Next Article in Journal
Integrated Information in the Spiking–Bursting Stochastic Model
Previous Article in Journal
Differential Geometric Aspects of Parametric Estimation Theory for States on Finite-Dimensional C-Algebras
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Partial Information Decomposition and the Information Delta: A Geometric Unification Disentangling Non-Pairwise Information

Pacific Northwest Research Institute, Seattle, WA 98122, USA
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(12), 1333; https://doi.org/10.3390/e22121333
Submission received: 22 September 2020 / Revised: 12 November 2020 / Accepted: 19 November 2020 / Published: 24 November 2020
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Information theory provides robust measures of multivariable interdependence, but classically does little to characterize the multivariable relationships it detects. The Partial Information Decomposition (PID) characterizes the mutual information between variables by decomposing it into unique, redundant, and synergistic components. This has been usefully applied, particularly in neuroscience, but there is currently no generally accepted method for its computation. Independently, the Information Delta framework characterizes non-pairwise dependencies in genetic datasets. This framework has developed an intuitive geometric interpretation for how discrete functions encode information, but lacks some important generalizations. This paper shows that the PID and Delta frameworks are largely equivalent. We equate their key expressions, allowing for results in one framework to apply towards open questions in the other. For example, we find that the approach of Bertschinger et al. is useful for the open Information Delta question of how to deal with linkage disequilibrium. We also show how PID solutions can be mapped onto the space of delta measures. Using Bertschinger et al. as an example solution, we identify a specific plane in delta-space on which this approach’s optimization is constrained, and compute it for all possible three-variable discrete functions of a three-letter alphabet. This yields a clear geometric picture of how a given solution decomposes information.

1. Introduction

The variables in complex biological data frequently have nonlinear and non-pairwise dependency relationships. Understanding the functions and/or dysfunctions of biological systems requires understanding these complex interactions. How can we reliably detect interdependence within a set of variables, and how can we distinguish simple, pairwise dependencies from those which are fundamentally multivariable?
An analytical approach formulated by Williams and Beer frames these questions in terms of the Partial Information Decomposition (PID) [1]. The PID proposes to decompose the mutual information between a pair of source variables X and Y and a target variable Z, I ( Z : X , Y ) , into four non-negative components:
I ( Z : X ) = U X + R I ( Z : Y ) = U Y + R I ( Z : X , Y ) = U X + U Y + R + S
The constituent terms proposed are: the unique informations, U X and U Y , which represent the amounts of information about Z encoded by X alone and by Y alone; the redundant information, R, which is the information about Z encoded redundantly by both X and Y; and the synergistic information S, which is the information about Z contained in neither X or Y individually, but encoded by X and Y taken together. An illustration of this decomposition, the associated governing equations, and examples characterizing each type of information are all shown in Figure 1. It was shown that PID components can distinguish between dyadic and triadic relationships which no conventional Shannon information measure can distinguish [2].
The problem with this approach is that its governing equations form an underdetermined system, with only three equations relating the four components. To actually calculate the decomposition, an additional assumption must be made to provide an additional equation. Williams and Beer proposed a method for the calculation of R in their original paper, but this has since been shown to have some undesirable properties [1]. Much of the subsequent work in this domain consisted of attempts to define new relationships or formulae to calculate the components, as well as critiques of these proposed measures [3]. These proposed measures include (as an incomplete list): a measure based on information geometry [4]; an intersection information based on the Gács-Körner common information [5]; the minimum mutual information [6]; the pointwise common change in surprisal [7]; and the extractable shared information [8].
Another noteworthy putative solution is that of [9], which requires solving an optimization problem over a space Q of probability distributions, but is rigorous in that it directly follows from reasonable assumptions about the unique information. However, it is unclear how to sensibly generalize this approach to larger numbers of variables. Nonetheless, there has been considerable interest in using the PID approach to gather insights from real datasets, particularly within the neuroscience community [10,11,12,13,14,15].
Independently, an alternative approach to many of these same questions has been formulated focusing on devising new information theory-based measures of multivariable dependency. In genetics, non-pairwise epistatic effects are often crucially important in determining complex phenotypes, but traditional methods are sensitive only to pairwise relationships; thus there is particular interest in methods to identify the existence of synergistic dependencies within genetic datasets. Galas et al. [16,17] quantified the non-pairwise information between genetic loci and phenotype data with the Delta measure, Δ ( Z : X , Y ) . Briefly, given a set of variables { X , Y , Z } , Δ ( Z : X , Y ) quantifies the change in co-information when considering the variables { X , Y , Z } as opposed to only { X , Y } (we hereafter denote Δ ( Z : X , Y ) as Δ Z , Δ ( X : Y , Z ) as Δ X , and so on). In its simplest application, the magnitudes of { Δ X , Δ Y , Δ Z } can be used to detect and quantify non-pairwise interactions [16,17].
Recent work showed that the delta values encode considerable additional information about the dependency. Sakhanenko et al. [18] defined the normalized delta measures δ = ( δ X , δ Y , δ Z ) , which define an “information space”, and considered the δ -values of all possible discrete functions Z = f ( X , Y ) . Fully mapping the specific example set of functions where { X , Y , Z } are all discrete variables with 3 possible values, they found that the 19,683 possible functional relationships Z = f ( X , Y ) mapped onto a highly structured plane in the space of normalized deltas (as shown in Figure 2). Different regions of this plane corresponded to qualitatively different types of functional relationships; in particular, completely pairwise functions such as Z = X and completely non-pairwise functions such as Z = X O R ( X , Y ) were mapped onto the extremes of the plane (see Figure 2; note that this paper defines “XOR” for a ternary alphabet as X O R ( X , Y ) ( X + Y ) mod 3 ). Since discrete variables such as these occur naturally in genetics, this suggests that relationships between genetic variables may be usefully characterized by their δ -coordinates, with useful intuitive value. The difficulty of this in practice is that the coordinates are constrained to this plane only when X and Y are statistically independent, which is not the case in many real datasets, e.g., in genetic datasets in the presence of linkage disequilibrium.
In this paper, we show that the Partial Information Decomposition approach and Information Delta approach are largely equivalent, since their component variables can be directly related. The δ -coordinates can be written explicitly in terms of PID components, which leads us to an intuitive understanding of how δ -space encodes PID information by casting them into a geometric context. We then apply our results to two different approaches to solving the PID problem, first one from Bertschinger et al. [9] and then from Finn and Lizier [19]. We show that the sets of probability distributions, Q, used by Bertschinger can be mapped onto low-dimensional manifolds in δ -space, which intersect with the δ -plane of Figure 2. This approach is theoretically useful for the Delta information framework, since it factors out X , Y dependence in the data, thereby accounting for linkage disequilibrium between genetic variables. We suggest an approach for the analysis of genetic datasets which would return both the closest discrete function underlying the data and its PID in the Bertschinger solution, and which would require no further optimization after the initial construction of a solution library. This realization thus yields a low-dimensional geometric interpretation of this optimization problem, and we compute the solution for all possible three-variable discrete functions of alphabet size three. For these same functions, we then compute the PID components using the Pointwise PID approach of Finn and Lizier [19]. This visualization yields an immediate comparison of how each solution decomposes information. Since our derived relationship between the frameworks is general, it could be similarly applied to any putative PID solution as demonstrated here. Code to replicate these computations and the associated figures is freely available [20].

2. Background

2.1. Interaction Information and Multi-Information

An important body of background work, which served as a foundation for both the Information Decomposition and Information Delta approaches, involves the Interaction Information, I I . I I can be thought of as a multivariable extension of the mutual information [21]. Unlike the mutual information, however, the interaction information can assume negative values. What does it mean for the interaction information to be negative? It was once common to interpret I > 0 as implying a synergistic interaction, and I < 0 as implying a redundant interaction between the variables. As detailed in [1] and discussed in the following sections, this interpretation is mistaken. Interactions can be both partly synergistic and partly redundant, and the interaction information indicates the balance of these components.
For a set of variables ν n = { X 1 , . . . , X n } , I I can be defined as [22]:
I I ( ν n ) = τ i ν n ( 1 ) | ν n | | τ i | H ( τ i )
where | ν n | is the total number of variables in the set, and the sum is over all possible subsets τ i (where | τ i | is the total number of variables in each subset). H ( τ i ) is the joint entropy between the variables in subset τ i . The interaction information, I I , is very similar to a measure called the co-information, C I [23]. These measures differ only by their sign: for an even number of variables they are identical (e.g., I I ( X , Y ) = C I ( X , Y ) ), and for an odd number of variables they are of opposite sign.
C I ( ν n ) = τ i ν n ( 1 ) | τ i | H ( τ i ) = ( 1 ) | ν n | I I ( ν n )
An additional, useful measure is the “multi-information”, Ω , introduced by Watanabe [24], sometimes called the “total correlation”, which represents the sum of all dependencies of variables and is zero only if all variables are independent. For a set of n variables ν n = { X i } it is defined as:
Ω ( ν n ) = i = 1 n H ( X i ) H ( X 1 , X n )

2.2. Information Decomposition

Consider a pair of “source variables” X,Y which determine the value of a “target variable” Z. Assume that we can measure the mutual information each source carries about a target, I ( Z : X ) and I ( Z : Y ) (which we abbreviate as I Z : X and I Z : Y ), as well as the mutual information between the joint distribution of { X , Y } and Z, I ( Z : X , Y ) (which we abbreviate as I Z : X Y ). These mutual informations can be written in terms of the entropies (which we abbreviate using subscripts, e.g., H ( X , Y ) H X Y ):
I Z : X = H X + H Z H X Z I Z : Y = H Y + H Z H Y Z I Z : X Y = H X Y + H Z H X Y Z
These mutual informations can be decomposed into components which measure how much of each “type” of information they contain, as follows:
I Z : X Y = U X + U Y + R + S I Z : X = U X + R I Z : Y = U Y + R
where U X and U Y are the unique informations, R is the redundant information, and S is the synergistic information, as described previously in Section 1. This is an underdetermined system which requires an additional equation for the variables to render it solvable. Many of the current and previous efforts to define such an equation (for example, several proposals on how to directly compute the value of R from data), as well as the limitations of those efforts, have been nicely summarized in [3].

2.3. Solution from Bertschinger et al.

One solution to this problem came from Bertschinger et al. [9], who proposed that the unique information be approximated as:
U ˜ X = min q Q I ( Y , Z | X )
Let Ψ be the set of all joint probability distributions of X, Y, and Z. Then we define Q as the set of all distributions, q, which have the same marginal probability distributions p ( X = x , Z = z ) and p ( Y = y , Z = z ) as our dataset. That is,
Q = { q Ψ | q ( X = x , Y = y ) = p ( X = x , Y = y ) q ( Y = y , Z = z ) = p ( Y = y , Z = z ) }
Please note that in [9], this set of probability distributions is denoted as Δ P , which we change here to Q to avoid notational confusion with the information deltas. Similarly, its elements are indicated by Q in the original paper. Here we indicate the distributions, elements of the set Q, by a lowercase q for consistency with our notation for probability distributions.
Put another way, we consider all possible probability distributions that maintain the marginals p ( X = x , Z = z ) and p ( Y = y , Z = z ) implied by our data. The relationship between X and Y ( p ( X = x , Y = y ) , and consequently the joint distribution p ( X = x , Y = y , Z = z ) ) is allowed to vary. The minimization criterion is perhaps more intuitive when written, equivalently, as:
U ˜ X = min q Q I ( Y , Z | X ) = min q Q [ I I ( X , Y , Z ) I I ( y , z ) ]
Thus, the unique information U ˜ X can be thought of as the smallest possible increase in the interaction information when the variable X is added to the set { Y , Z } . For example, if there exists a probability distribution in Q for which I I ( Y , Z ) = I I ( X , Y , Z ) , then the addition of X adds no unique information about Z and U ˜ X . The core assumption of this approach is that the unique and redundant informations depend simply upon the marginal distributions p ( X = x , Z = z ) and p ( Y = y , Z = z ) . This solution is rigorous in the sense that the result follows directly from this assumption without any ad-hoc assumptions for how the components are related.

2.4. Information Deltas and Their Geometry

Consider a set of three variables ν n = { X , Y , Z } . Using Equation (3), we can write the co-information in terms of the entropies:
C I ( X , Y , Z ) = H X + H Y + H Z H X Y H X Z H Y Z + H X Y Z
The differential interaction information Δ Z is the change in the interaction information when a given variable Z is added to the set. This can be written in terms of C I and then the conditional mutual information:
Δ Z ( ν n ) = C I ( X , Y , Z ) C I ( X , Y ) = I ( X , Y | Z )
These measures can be normalized by the multi-information for the three variables, Ω ( X , Y , Z ) (which we abbreviate as Ω X Y Z ), which by Equation (4) we can write in terms of the entropies as:
Ω X Y Z = H X + H Y + H Z H X Y Z
The normalized measures are then:
δ X = Δ X / Ω X Y Z , δ Y = Δ Y / Ω X Y Z , δ Z = Δ Z / Ω X Y Z
If Z is a function of X and Y, and if X and Y are i.i.d., then δ = ( δ X , δ Y , δ Z ) lies within a highly structured plane, where different regions of the plane correspond to qualitatively different types of interactions. Figure 2 shows the mapping of all possible functions onto this highly structured plane.
The normalized deltas can be expressed as:
δ X = 1 I X : Y + I Z : X Ω X Y Z = H X + H X Y + H X Z H X Y Z Ω X Y Z δ Y = 1 I X : Y + I Z : Y Ω X Y Z = H Y + H X Y + H Y Z H X Y Z Ω X Y Z δ Z = 1 I Z : X + I Z : Y Ω X Y Z = H Z + H X Z + H Y Z H X Y Z Ω X Y Z
The normalized deltas can also be written in terms of joint mutual informations, as follows:
δ X = 1 Ω X Y Z ( H X + H X Y + H X Z H X Y Z ) = 1 Ω X Y Z ( H X + H X Y + H X Z H X Y Z + ( H X + H Y + H Z H X Y Z ) Ω X Y Z ) = 1 Ω X Y Z ( H Y + H Z + H X Y + H X Z 2 H X Y Z Ω X Y Z ) = 1 Ω X Y Z ( ( H Z + H X Y H X Y Z ) + ( H Y + H X Z H X Y Z ) Ω X Y Z ) = I Z : X Y + I Y : X Z Ω X Y Z 1
We can write all normalized deltas in this form:
δ X = I Z : X Y + I Y : X Z Ω X Y Z 1 δ Y = I Z : X Y + I X : Y Z Ω X Y Z 1 δ Z = I Y : X Z + I X : Y Z Ω X Y Z 1
By inverting previous equations, we can then write:
I Z : X Y = Ω X Y Z 2 ( δ X + δ Y δ Z + 1 )
I Z : X = Ω X Y Z 2 ( δ X δ Z + δ Y + 1 )
I Z : Y = Ω X Y Z 2 ( δ Y δ Z + δ X + 1 )
Specifically, Equation Set (16) can be inverted to yield Equation (17a), and Equation Set (14) can be inverted to yield Equations (17b) and (17c).

3. PID Mapped into Information Deltas

3.1. Information Decomposition in Terms of Deltas

With Equations (6) and (17), we can equate the expressions for the mutual informations in their delta and information decomposition forms:
Ω X Y Z 2 ( + δ X + δ Y δ Z + 1 ) = R + U X + U Y + S Ω X Y Z 2 ( δ X + δ Y δ Z + 1 ) = R + U X Ω X Y Z 2 ( + δ X δ Y δ Z + 1 ) = R + U Y
From the above relations we can derive:
S R = Ω X Y Z 2 ( δ X + δ Y + δ Z 1 )
In other words, the difference between the synergy and the redundancy increases as we get farther from the origin in δ -space. Also:
U X U Y = Ω X Y Z ( δ Y δ X )
so the distance from the diagonal in the ( δ X , δ Y ) -plane is proportional to the difference between the unique informations. These striking relationships are visualized in Figure 3.

3.2. Relationship between Diagonal and Interaction Information

Considering again Equation (19) and using Equation (13), we can write:
S R = Ω X Y Z 2 ( δ X + δ Y + δ Z 1 ) = 1 2 ( Δ X + Δ Y + Δ Z + Ω X Y Z ) = 1 2 ( ( H X H X Y H X Z + H X Y Z ) + ( H Y H X Y H Y Z + H X Y Z ) + ( H Z H X Z H Y Z + H X Y Z ) + ( H X + H Y + H Z H X Y Z ) ) = ( H X + H Y + H Z H X Y H X Z H Y Z + H X Y Z ) = I I ( X , Y , Z )
where I I ( X , Y , Z ) is the interaction information between the variables. This replicates the important result that I I ( X , Y , Z ) = S R from the original Williams and Beer paper [1].

3.3. The Function Plane

When the variables are related by a discrete function (as defined in [18]), and X and Y are i.i.d., the function will lie on a plane defined by:
δ Z = δ X + δ Y 1
Thus, the distance d of a coordinate above the plane is given by
d = δ Z δ X δ Y + 1 = ( δ X + δ Y δ Z + 1 ) + 2
And so from Equation (18):
Ω 2 ( 2 d ) = R + U X + U Y + S

4. Solving the PID on the Function Plane

4.1. Transforming Probability Tensors within Q

As noted previously, there is no generally accepted solution for completing and computing the set of PID equations. Our results connecting the PID to the information deltas have therefore, up to this point, been agnostic on this question. All equations in the previous section follow from the basic PID formulation, and the delta coordinate equations. This means they are true for any putative solution, but also brings us no closer to an actual solution to the PID problem; we can still only compute the differences between PID components. We therefore now extend our analysis by using the solution of Bertschinger et al. [9] to fully compute the PID for the functions in Figure 2. We wish to emphasize, however, that the following approach could be used equally well to gain a geometric interpretation of any alternate solution to the PID.
Consider a probability tensor for an alphabet size of N:
P N = p 111 p 1 N 1 p N 11 p N N 1 , . . . , p 11 N p 1 N N p N 1 N p N N N
where we use the notation p i j k = p ( X = i , Y = j , Z = k ) . What transformations are permissible that will preserve the distribution within the set Q (as defined in Equation (8))? Please note that we can obtain the marginal distributions simply by summing over the appropriate tensor index. For example, summing along the first index yields the marginal distribution p ( Y = y , Z = z ) . To stay in Q, then, we require that the sums along the first and second indices both remain constant.
For an alphabet size of N = 2 , we can parameterize the set of all possible transformations quite simply:
P 2 = p 111 + α p 121 α p 211 α p 221 + α , p 111 + β p 121 β p 211 β p 221 + β
All possible changes to each layer of the tensor can be captured with a single parameter. For example, increasing p 111 will require that p 121 and p 211 be decreased by the same amount, as the row and column sums must remain constant (which, in turn, determines p 221 ). Each layer can be modified independently, and thus the second layer has an independent parameter.
For a given probability tensor with N = 2 , then, the probability tensor for any distribution in Q can be fully parameterized with two parameters, and thus the corresponding coordinates in delta-space are at most two-dimensional. In practice, we find that N = 2 functions have delta-coordinates that are restricted to a one-dimensional manifold.
Consider, for example, the AND function:
P 2 = 1 1 1 0 , 0 0 0 1
We can describe all possible perturbations which remain in Q by the parameterization:
P 2 = 1 / 4 + α 1 / 4 α 1 / 4 α 0 + α , 0 + β 0 β 0 β 1 / 4 + β
However, it can be seen that we must have β = 0 , as all probabilities must remain in the range p [ 0 , 1 ] . The parameter α , on the other hand, can fall within the range α [ 0 , 1 / 4 ] . Since all possible perturbations can be captured by varying a single parameter, Q must therefore be mapped to a one-dimensional manifold in δ -space.
The layers of a probability tensor become significantly harder to parameterize for N = 3 . Consider a single layer of a probability tensor:
p 111 p 121 p 131 p 211 p 221 p 231 p 311 p 321 p 331 a b c e d f g h i
The permissible transformations to this layer can be parameterized by:
a + α b + β c α β e + γ d + δ f γ δ g α γ h β δ i + α + β + γ + δ
subject to the constraints that:
0 ( a + α ) 1 0 ( b + β ) 1 0 ( e + γ ) 1 0 ( d + δ ) 1 0 ( c α β ) 1 0 ( f γ δ ) 1 0 ( g α γ ) 1 0 ( h β δ ) 1 0 ( i + α + β + γ + δ ) 1
Clearly, these relations are too complicated to lend any immediate insight into the problem. However, it is a simple matter to use the above inequalities to calculate permissible values of parameters ( α , β , γ , δ ) and to plot out the corresponding delta coordinates. This is done for randomly generated sample functions in Figure 4. In this case, the delta coordinates have a complex distribution but are nonetheless restricted to a plane in delta-space (the vertically oriented red plane in Figure 5).

4.2. δ -Coordinates in Q Are Always Restricted to a Plane

In the N = 2 case, delta-coordinates were parameterized by a single variable such that they must be restricted onto a line. In the N = 3 example, they are restricted onto a plane. Will larger alphabets map Q onto a three-dimensional volume? If not, is it possible to get a non-planar two-dimensional manifold, or are coordinates always restricted to a plane? We will now prove that Q is always constrained to a plane, regardless of the alphabet size.
Lemma 1.
In any set Q as defined previously, the following entropies remain constant: all individual entropies H X , H Y and H Z ; the joint, 2-variable entropies containing Z, namely H X Z and H Y Z . The only entropies which vary within a particular Q then are H X Y and H X Y Z .
Proof. 
The definition of Q preserves the marginal distributions by construction. H X Z and H Y Z being constant is a trivial consequence of holding p ( X = x , Z = z ) and p ( Y = y , Z = z ) constant, which is the condition defining Q. From these constant marginal distributions, we can calculate the distributions p ( X = x ) , p ( Y = y ) and p ( Z = z ) , which are therefore also constant, as are their corresponding entropies. □
Only two entropic quantities vary between the different distributions in Q. By considering just their effect on the delta coordinates, we can now show the following:
Theorem 1.
In any set Q of distributions with equal marginal distributions p ( X = x , Z = z ) and p ( Y = y , Z = z ) , the delta-coordinates ( δ X , δ Y , δ Z ) will be restricted to a plane. This is true for any alphabet size.
Proof. 
We begin by making several notational definitions to simplify the algebra which follows, first from the joint entropies which vary within Q:
d H X Y Z H X Y h H X Y Z
We then define quantities which collect the constant entropy terms:
c 1 H X + H Y + H Z c 2 H X + H X Z c 3 H Y + H Y Z c 4 H Z + H X Z + H Y Z
In terms of these quantities we can now write the normalized delta coordinates as follows:
δ X = c 2 d c 1 h δ Y = c 3 d c 1 h δ Z = c 4 h c 1 h
Solving for d in the δ X and δ Y equations yields:
δ Y ( c 1 h ) c 3 = δ X ( c 1 h ) c 2
And the δ Z equation allows us to solve for h:
h = c 4 c 1 δ Z 1 δ Z ( c 1 h ) = c 1 c 4 1 δ Z
Plugging this into the equation above yields an equation which simplifies to:
( c 1 c 4 ) ( δ X δ Y ) + ( c 3 c 2 ) ( 1 δ Z ) = 0
Since c 1 , c 2 , c 3 and c 4 are all constant over Q, this defines a plane in δ X , δ Y , δ Z space. □
Equation (32) not only shows that the points in Q are bound to a plane, but it also implies that this plane always contains the line defined by δ X = δ Y and δ Z = 1 . Therefore for any function in Figure 2, we can trivially compute the plane in which the corresponding Q is contained.

4.3. PID Calculation for All Functions

For the set of probability distributions Q, Bertschinger et al. [9] provide the following estimators for the PID components:
U ˜ X = min q Q Ω X Y Z δ Y U ˜ Y = min q Q Ω X Y Z δ X R ˜ = max q Q C I ( X , Y , Z ) S ˜ = I Z : X Y min q Q I Z : X Y
If we numerically compute the set Q for a given function f (i.e., by generating a distribution such as the one shown in Figure 4 via the parameterization of Equation (30)), these estimators are trivially consistent. Figure 6 shows the computed values of the PID components for all of the functions shown in Figure 2. There is a clear geometric interpretation here: Functions in the lower left/right corners consist almost entirely of U ˜ X and U ˜ Y , respectively. Functions approaching the top corner become increasingly synergistic with a higher proportion of S. Functions are most redundant towards the lower center of the plane, though no single function is primarily R.

4.4. Alternate Solutions: Pointwise PID

The Pointwise Partial Information Decomposition (PPID) of Finn and Lizier [19] is an alternate approach to solving the PID problem. It is motivated by the fact that the entropy and mutual information can be expressed as the expectation value of pointwise quantities, which measure the information content of a single event. For example, the event ( X , Z ) = ( x 1 , z 1 ) has the associated pointwise mutual information:
i ( x 1 : z 1 ) = log p ( X = x 1 | Z = z 1 ) p ( X = x 1 )
and the overall mutual information between the two variables is the expectation value of this pointwise quantity, taken over all possible events. It is important to note that while the overall mutual information is non-negative, the pointwise mutual information can be negative. Finn and Lizier decompose this pointwise quantity into two non-negative components, the “specificity” i + ( x 1 z 1 ) and “ambiguity” i ( x 1 z 1 ) , and argue that:
i ( x 1 : z 1 ) = i + ( x 1 z 1 ) i ( x 1 z 1 ) i + ( x 1 z 1 ) = h ( x 1 ) = log p ( X = x 1 ) i ( x 1 z 1 ) = h ( x 1 | z 1 ) = log p ( X = x 1 | Z = z 1 )
They similarly decompose the redundancy R into a pointwise specific redundancy r + and pointwise specific ambiguity r , and argue for the following definitions:
r m i n + ( a 1 , . . . , a k z ) = min a i i + ( a i z ) r m i n ( a 1 , . . . , a k z ) = min a i i ( a i z )
where { a i } are the values of each of the source variables in a particular realization (e.g., if we have two source variables X , Y predicting Z, then the event ( X , Y , Z ) = ( x 1 , y 2 , z 3 ) has { a i } = { x 1 , y 2 } ). The expectation value of the difference of these quantities then yields the redundancy:
R = r m i n + r m i n
from which the rest of the PID components follow. See [19] for a full discussion of motivations and Axioms which these definitions satisfy (including a discussion of the relationship between this formulation and that of Bertschinger et al. [9], and how the many aspects of [19] are arguably pointwise adaptations of the assumptions in [9]).
One consequence of this approach is that the PID components are no longer non-negative. There is extensive discussion of the interpretation of this in [19], but one example, RdnErr, is particularly informative. In our probability tensor notation, we can write this function as:
P R d n E r r = 3 / 8 1 / 8 0 0 , 0 0 1 / 8 3 / 8
This can be interpreted as follows: X is always equal to Z. Y is usually equal to Z, but occasionally (with probability 1 / 4 ) makes an error. What should we expect the PID components to be, in this case? The PPID yields ( R , U x , U y , S ) = ( 1 , 0 , 0.81 , 0.81 ) , which implies the following interpretation: the information about Z is encoded redundantly by both X and Y, but Y carries unique misinformation about Z due to its tendency to make errors. If all components were strictly positive, we would likely draw a different conclusion: both X and Y encode some information about Z, with X encoding additional unique information. In this way, different solutions will lead to slightly different interpretations about the nature of the relationship between the variables.
In Figure 7, we compute the PPID for all functions Z = f ( X , Y ) and map them onto δ -space, just as we did in Figure 6. Comparing Figure 6 and Figure 7 immediately highlights key differences in how each method decomposes information. For example: in Figure 6, the top corner is purely synergistic, the lower-left corner has information solely in X, and the lower-right corner has information solely in Y; in Figure 7, the top corner has zero redundancy, the lower-left corner has misinformation in Y, and the lower-right corner has misinformation in X.
It is not our goal here to argue which result is more correct. Instead, we wish to highlight how comparing Figure 6 and Figure 7 readily yields subtle insights into how the two approaches differ in decomposing information. It also yields immediate insights into the subtleties of how we might interpret coordinates in δ -space.

5. Conclusions

The key overall result of this paper is that the PID problem can be mapped directly into the the previously defined “information landscape” represented by the “delta space” of [18]. This theoretical framework is simple and has a geometric interpretation which was well worked out previously. The simple set of relations between the frameworks, as explicated in Equation (18) and visualized in Figure 3, anticipates a much deeper set of geometric constraints.
We build upon this general relationship using the solution of Bertschinger et al. [9]. Using this solution, we parameterize the permissible transformations to a discrete function to numerically generate the distribution set Q, and prove in Theorem 1 that this set is mapped onto a plane in delta-space. The optimization problem defined by this approach is cast in terms of our variables in Equation (33), and the various extrema can be extracted directly from our parameterization and mapping procedure. Code which replicates these computations and generates the figures within this paper is freely available[20].
These results suggest the following approach for computation of the PID components, if using the solution from [9], and given the added assumption that there is some function Z = f ( X , Y ) which best approximates the relationship between variables. The steps are these:
  • Construct a library (set) of distributions { Q 1 , Q 2 , , Q N } for all functions, f i ( X , Y ) . Specifically, record the δ -coordinates spanned by each distribution (e.g., as plotted in Figure 4) along with the corresponding function and its PID component values.
  • For a set of variables in data for which we wish to find the decomposition, compute its δ -coordinates and then match them to the closest Q i . This will then immediately yield the corresponding function and PID components.
If this approach proves to be practical, it would have several clear advantages. First, the computational cost of the library construction would only need to be done once, and not need to be repeated for any subsequent analysis. The cost of the library construction is itself quite tractable (for example, exactly this computation was done to generate Figure 6). Second, this solves an open problem in the use of Information Deltas for which the source variables are not independent, for example, in applications to genetics in the presence of linkage disequilibrium. Specifically, this approach relaxes the common assumption in [18] that X and Y must be statistically independent.
The practical application of this approach to data analysis requires further development, which is beyond the scope of this paper. Specifically, the actual data will contain noise such that the computed δ -coordinates will not lie perfectly within any distribution of Q set. The naïve approach of simply taking the closest Q i may therefore be insufficient in general. Future work will characterize the response of δ -coordinates to various levels of noise within the data, to enable the computation of p ( Q i | δ , α ) (i.e., the probability that variables belong to the set Q i given their observed coordinates δ and some noise level α ).
Future work will extend the approach of δ to larger sets of variables, so as to fully characterize a higher-dimensional δ -space and its relationship to the PID. Much of the complexity of each framework is contained in considering these higher-order relationships. Future work will also consider additional solutions to the PID problem beyond the solutions of [9,19] considered here. All equations in Section 3 are general and agnostic to the precise solution used for the actual PID computation, and it should be straightforward to generate figures similar to Figure 6 for different solutions to show how they differ in mapping information components onto the function plane. This will provide interpretable geometric comparisons between solutions and also immediately highlight all functions for which results offer differing interpretations, as seen in Figure 6 and Figure 7. We anticipate that this direct comparison of how different solutions map the information content of discrete functions will provide a powerful visual tool for understanding the differing consequences of putative solutions, and thus our unification of these frameworks will be useful in resolving the open question of how best to compute the PID.

Author Contributions

J.K.-G., N.S., and D.G. conceived of and designed the project; J.K.-G. performed the computations and formal analysis; N.S. and D.G. supervised and validated the results; J.K.-G. visualized the results; J.K.-G., N.S. and D.G. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

Research reported in this publication was supported by the National Heart, Lung, And Blood Institute of the National Institutes of Health under Award Number U01HL126496. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Acknowledgments

We wish to acknowledge support from the Pacific Northwest Research Institute.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
PIDPartial Information Decomposition
IIInteraction Information
CICo-Information
U X Unique Information in X
U Y Unique Information in Y
RRedundant Information
SSynergistic Information
PPIDPointwise Partial Information Decomposition

References

  1. Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv 2010, arXiv:1004.2515. [Google Scholar]
  2. James, R.G.; Crutchfield, J.P. Multivariate dependence beyond Shannon information. Entropy 2017, 19, 531. [Google Scholar] [CrossRef]
  3. Lizier, J.T.; Bertschinger, N.; Jost, J.; Wibral, M. Information Decomposition of Target Effects from Multi-Source Interactions: Perspectives on Previous, Current and Future Work. Entropy 2018, 20, 307. [Google Scholar] [CrossRef] [Green Version]
  4. Lizier, J.T.; Bertschinger, N.; Jost, J.; Wibral, M. Bivariate measure of redundant information. Phys. Rev. E 2013, 87, 012130. [Google Scholar]
  5. Griffith, V.; Chong, E.K.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection information based on common randomness. Entropy 2014, 16, 1985–2000. [Google Scholar] [CrossRef] [Green Version]
  6. Barrett, A.B. Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems. Phys. Rev. E 2015, 91, 052802. [Google Scholar] [CrossRef] [Green Version]
  7. Ince, R.A. Measuring multivariate redundant information with pointwise common change in surprisal. Entropy 2017, 19, 318. [Google Scholar] [CrossRef] [Green Version]
  8. Rauh, J.; Banerjee, P.K.; Olbrich, E.; Jost, J.; Bertschinger, N. On extractable shared information. Entropy 2017, 19, 328. [Google Scholar] [CrossRef] [Green Version]
  9. Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef] [Green Version]
  10. Timme, N.; Alford, W.; Flecker, B.; Beggs, J.M. Synergy, redundancy, and multivariate information measures: An experimentalist’s perspective. J. Comput. Neurosci. 2014, 36, 119–140. [Google Scholar] [CrossRef]
  11. Stramaglia, S.; Cortes, J.M.; Marinazzo, D. Synergy and redundancy in the Granger causal analysis of dynamical networks. New J. Phys. 2014, 16, 105003. [Google Scholar] [CrossRef] [Green Version]
  12. Timme, N.M.; Ito, S.; Myroshnychenko, M.; Nigam, S.; Shimono, M.; Yeh, F.C.; Hottowy, P.; Litke, A.M.; Beggs, J.M. High-degree neurons feed cortical computations. PLoS Comput. Biol. 2016, 12, e1004858. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Wibral, M.; Priesemann, V.; Kay, J.W.; Lizier, J.T.; Phillips, W.A. Partial information decomposition as a unified approach to the specification of neural goal functions. Brain Cogn. 2017, 112, 25–38. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Wibral, M.; Finn, C.; Wollstadt, P.; Lizier, J.T.; Priesemann, V. Quantifying information modification in developing neural networks via partial information decomposition. Entropy 2017, 19, 494. [Google Scholar] [CrossRef] [Green Version]
  15. Kay, J.W.; Ince, R.A.; Dering, B.; Phillips, W.A. Partial and entropic information decompositions of a neuronal modulatory interaction. Entropy 2017, 19, 560. [Google Scholar] [CrossRef] [Green Version]
  16. Galas, D.J.; Sakhanenko, N.A.; Skupin, A.; Ignac, T. Describing the complexity of systems: Multivariable “set complexity” and the information basis of systems biology. J. Comput. Biol. 2014, 21, 118–140. [Google Scholar] [CrossRef] [Green Version]
  17. Sakhanenko, N.A.; Galas, D.J. Biological data analysis as an information theory problem: Multivariable dependence measures and the Shadows algorithm. J. Comput. Biol. 2015, 22, 1005–1024. [Google Scholar] [CrossRef] [Green Version]
  18. Sakhanenko, N.; Kunert-Graf, J.; Galas, D. The Information Content of Discrete Functions and Their Application in Genetic Data Analysis. J. Comp. Biol. 2017, 24, 1153–1178. [Google Scholar] [CrossRef] [Green Version]
  19. Finn, C.; Lizier, J.T. Pointwise partial information decomposition using the specificity and ambiguity lattices. Entropy 2018, 20, 297. [Google Scholar] [CrossRef] [Green Version]
  20. Kunert-Graf, J. kunert/deltaPID: Initial Release (Version v1.0.0). Zenodo 2020. [Google Scholar] [CrossRef]
  21. McGill, W. Multivariate information transmission. Trans. IRE Prof. Group Inf. Theory 1954, 4, 93–111. [Google Scholar] [CrossRef]
  22. Jakulin, A.; Bratko, I. Quantifying and Visualizing Attribute Interactions: An Approach Based on Entropy. arXiv 2003, arXiv:cs/0308002. [Google Scholar]
  23. Bell, A.J. The co-information lattice. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation, ICA, Citeseer, Granada, Spain, 22–24 September 2003; Volume 2003. [Google Scholar]
  24. Watanabe, S. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960, 4, 66–82. [Google Scholar] [CrossRef]
Figure 1. (A) Visualization of the Information Decomposition (adapted from [3]) and its governing equations. The system is underdetermined. (B) Sample binary datasets which contain only one type of information. For (i), where Z = X , X contains all information about Z and Y is irrelevant, such U X is equal to the total information and all other terms are zero. For (ii), where Z = X = Y , X and Y are always identical and thus the information is fully redundant. For (iii), where Z is the XOR function of X and Y, both X and Y are independent of Z, but fully determine its value when taken jointly.
Figure 1. (A) Visualization of the Information Decomposition (adapted from [3]) and its governing equations. The system is underdetermined. (B) Sample binary datasets which contain only one type of information. For (i), where Z = X , X contains all information about Z and Y is irrelevant, such U X is equal to the total information and all other terms are zero. For (ii), where Z = X = Y , X and Y are always identical and thus the information is fully redundant. For (iii), where Z is the XOR function of X and Y, both X and Y are independent of Z, but fully determine its value when taken jointly.
Entropy 22 01333 g001
Figure 2. A geometric interpretation of the Information Deltas, as developed in [18]. (A) Consider functions where each variable has an alphabet size of three possible values. There are 19,683 possible functions f ( X , Y ) . If the variables X and Y are independent, these functions map onto 105 unique points (function families) within a plane in δ -space. (B) Sample functions and their mappings onto δ -space. Functions with a full pairwise dependence on X or Y map to opposite lower corners, whereas the fully synergistic XOR (i.e., the XOR-like ternary extension X O R ( X , Y ) ( X + Y ) mod 3 ) is mapped to the uppermost corner.
Figure 2. A geometric interpretation of the Information Deltas, as developed in [18]. (A) Consider functions where each variable has an alphabet size of three possible values. There are 19,683 possible functions f ( X , Y ) . If the variables X and Y are independent, these functions map onto 105 unique points (function families) within a plane in δ -space. (B) Sample functions and their mappings onto δ -space. Functions with a full pairwise dependence on X or Y map to opposite lower corners, whereas the fully synergistic XOR (i.e., the XOR-like ternary extension X O R ( X , Y ) ( X + Y ) mod 3 ) is mapped to the uppermost corner.
Entropy 22 01333 g002
Figure 3. As shown in Equations (23) and (24), the δ -space encodes the balance of synergy/redundancy along one diagonal, and the balance of unique information in each source along the other.
Figure 3. As shown in Equations (23) and (24), the δ -space encodes the balance of synergy/redundancy along one diagonal, and the balance of unique information in each source along the other.
Entropy 22 01333 g003
Figure 4. An example mapping of the Bertschinger set Q to δ -space for a randomly chosen function f. A set Q consists of all probability distributions p ( X = x , Y = y , Z = z ) that share the same marginal distributions p ( X = x , Z = z ) and p ( Y = y , Z = z ) . Each Q maps onto a set of points with a complex distribution, but which is constrained to a simple plane in δ -space.
Figure 4. An example mapping of the Bertschinger set Q to δ -space for a randomly chosen function f. A set Q consists of all probability distributions p ( X = x , Y = y , Z = z ) that share the same marginal distributions p ( X = x , Z = z ) and p ( Y = y , Z = z ) . Each Q maps onto a set of points with a complex distribution, but which is constrained to a simple plane in δ -space.
Entropy 22 01333 g004
Figure 5. The same function’s Q mapped onto δ -space as in Figure 4, viewed from a different angle. Q is constrained to a plane in δ -space. This plane, highlighted in red, contains the δ -coordinates of the function f (indicated by the red dot) as well as the line ( δ X = δ Y , δ Z = 1 ) (indicated by the solid red line).
Figure 5. The same function’s Q mapped onto δ -space as in Figure 4, viewed from a different angle. Q is constrained to a plane in δ -space. This plane, highlighted in red, contains the δ -coordinates of the function f (indicated by the red dot) as well as the line ( δ X = δ Y , δ Z = 1 ) (indicated by the solid red line).
Entropy 22 01333 g005
Figure 6. All functions Z = f ( X , Y ) (with alphabet sizes of 3) mapped onto a plane in δ -space, as in Figure 2. Each function is colored by the fraction of the total information in each PID component, as computed using the solution of [9]. There is a clear geometric structure to the decomposition which matches the previously discussed intuition about δ -space.
Figure 6. All functions Z = f ( X , Y ) (with alphabet sizes of 3) mapped onto a plane in δ -space, as in Figure 2. Each function is colored by the fraction of the total information in each PID component, as computed using the solution of [9]. There is a clear geometric structure to the decomposition which matches the previously discussed intuition about δ -space.
Entropy 22 01333 g006
Figure 7. The same set of all 3-letter functions Z = f ( X , Y ) mapped onto a plane in δ -space, as in Figure 6. The colorscale shows the amount of information in each component, now computed using the pointwise solution of Finn and Lizier [19]. In this formulation, the PID components are the average difference between two subcomponents, the specificity and ambiguity, and can be negative when the latter exceeds the former. Visualizing this solution immediately highlights the differences in how it decomposes the information of functions and leads to an alternate interpretation of δ -space.
Figure 7. The same set of all 3-letter functions Z = f ( X , Y ) mapped onto a plane in δ -space, as in Figure 6. The colorscale shows the amount of information in each component, now computed using the pointwise solution of Finn and Lizier [19]. In this formulation, the PID components are the average difference between two subcomponents, the specificity and ambiguity, and can be negative when the latter exceeds the former. Visualizing this solution immediately highlights the differences in how it decomposes the information of functions and leads to an alternate interpretation of δ -space.
Entropy 22 01333 g007
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Kunert-Graf, J.; Sakhanenko, N.; Galas, D. Partial Information Decomposition and the Information Delta: A Geometric Unification Disentangling Non-Pairwise Information. Entropy 2020, 22, 1333. https://doi.org/10.3390/e22121333

AMA Style

Kunert-Graf J, Sakhanenko N, Galas D. Partial Information Decomposition and the Information Delta: A Geometric Unification Disentangling Non-Pairwise Information. Entropy. 2020; 22(12):1333. https://doi.org/10.3390/e22121333

Chicago/Turabian Style

Kunert-Graf, James, Nikita Sakhanenko, and David Galas. 2020. "Partial Information Decomposition and the Information Delta: A Geometric Unification Disentangling Non-Pairwise Information" Entropy 22, no. 12: 1333. https://doi.org/10.3390/e22121333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop