Partial Information Decomposition: Redundancy as Information Bottleneck

Kolchinsky, Artemy

doi:10.3390/e26070546

Open AccessFeature PaperArticle

Partial Information Decomposition: Redundancy as Information Bottleneck

by

Artemy Kolchinsky

^1,2

¹

ICREA-Complex Systems Lab, Universitat Pompeu Fabra, 08003 Barcelona, Spain

²

Universal Biology Institute, The University of Tokyo, Tokyo 113-0033, Japan

Entropy 2024, 26(7), 546; https://doi.org/10.3390/e26070546

Submission received: 13 May 2024 / Revised: 20 June 2024 / Accepted: 21 June 2024 / Published: 26 June 2024

(This article belongs to the Special Issue Synergy and Redundancy Measures: Theory and Applications to Characterize Complex Systems and Shape Neural Network Representations)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The partial information decomposition (PID) aims to quantify the amount of redundant information that a set of sources provides about a target. Here, we show that this goal can be formulated as a type of information bottleneck (IB) problem, termed the “redundancy bottleneck” (RB). The RB formalizes a tradeoff between prediction and compression: it extracts information from the sources that best predict the target, without revealing which source provided the information. It can be understood as a generalization of “Blackwell redundancy”, which we previously proposed as a principled measure of PID redundancy. The “RB curve” quantifies the prediction–compression tradeoff at multiple scales. This curve can also be quantified for individual sources, allowing subsets of redundant sources to be identified without combinatorial optimization. We provide an efficient iterative algorithm for computing the RB curve.

Keywords:

partial information decomposition; information bottleneck; rate distortion; redundancy

1. Introduction

Many research fields that study complex systems are faced with multivariate probabilistic models and high-dimensional datasets. Prototypical examples include brain imaging data in neuroscience, gene expression data in biology, and neural networks in machine learning. In response, various information-theoretic frameworks have been developed in order to study multivariate systems in a universal manner. Here, we focus on two such frameworks, partial information decomposition and the information bottleneck.

The partial information decomposition (PID) considers how information about a target random variable Y is distributed among a set of source random variables

X_{1}, \dots, X_{n}

[1,2,3,4]. For example, in neuroscience, the sources

X_{1}, \dots, X_{n}

might represent the activity of n different brain regions and Y might represent a stimulus, and one may wish to understand how information about the stimulus is encoded in different brain regions. A central idea of the PID is that the information provided by the sources can exhibit redundancy, when the same information about Y is present in each source, and synergy, when information about Y is found only in the collective outcome of all sources. Moreover, it has been shown that standard information-theoretic quantities, such as entropy and mutual information, are not sufficient to quantify redundancy and synergy [1,5]. However, finding the right measures of redundancy and synergy has proven difficult. In recent work [4], we showed that such measures can be naturally defined by formalizing the analogy between set theory and information theory that lies at the heart of the PID [5]. We then proposed a measure of redundant information (Blackwell redundancy) that is motivated by algebraic, axiomatic, and operational considerations. We argued that Blackwell redundancy overcomes many limitations of previous proposals [4].

The information bottleneck (IB) [6,7] is a method for extracting compressed information from one random variable X that optimally predicts another target random variable Y. For instance, in the neuroscience example with stimulus Y and brain activity X, the IB method could be used to quantify how well the stimulus can be predicted using only one bit of information about brain activity. The overall tradeoff between the prediction of Y and compression of X is captured by the so-called IB curve. The IB method has been employed in various domains, including neuroscience [8], biology [9], and cognitive science [10]. In recent times, it has become particularly popular in machine learning applications [7,11,12,13,14].

In this paper, we demonstrate a formal connection between PID and IB. We focus in particular on the relationship between the IB and PID redundancy, leaving the connection to other PID measures (such as synergy) for future work. To begin, we show that Blackwell redundancy can be formulated as an information-theoretic constrained optimization problem. This optimization problem extracts information from the sources that best predict the target, under the constraint that the solution does not reveal which source provided the information. We then define a generalized measure of Blackwell redundancy by relaxing the constraint. Specifically, we ask how much predictive information can be extracted from the sources without revealing more than a certain number of bits about the identity of the source. Our generalization leads to an IB-type tradeoff between the prediction of the target (generalized redundancy) and compression (leakage of information about the identity of the source). We refer to the resulting optimization problem as the redundancy bottleneck (RB) and to the manifold of optimal solutions at different points on the prediction/compression tradeoff as the RB curve. We also show that the RB prediction and compression terms can be decomposed into contributions from individual sources, giving rise to an individual RB curve for each source.

Besides the intrinsic theoretical interest of unifying PID and the IB, our approach brings about several practical advantages. In particular, the RB curve offers a fine-grained analysis on PID redundancy, showing how redundant information emerges at various scales and across different sources. This fine-grained analysis can be used to uncover sets of redundant sources without performing intractable combinatorial optimization. Our approach also has numerical advantages. The original formulation of Blackwell redundancy was based on a difficult optimization problem that becomes infeasible for larger systems. By reformulating Blackwell redundancy as an IB-type problem, we are able to solve it efficiently using an iterative algorithm, even for larger systems (code available at https://github.com/artemyk/pid-as-ib, accessed on 12 May 2024). Finally, the RB has some attractive formal properties. For instance, unlike the original Blackwell redundancy, the RB curve is continuous in the underlying probability distributions.

This paper is organized as follows. In the next section, we provide the background on the IB, PID, and Blackwell redundancy. In Section 3, we introduce the RB, illustrate it with several examples, and discuss its formal properties. In Section 4, we introduce an iterative algorithm to solve the RB optimization problem. We discuss the implications and possible future directions in Section 5. All proofs are found in the Appendix A.

2. Background

We begin by providing relevant background on the information bottleneck, partial information decomposition, and Blackwell redundancy.

2.1. Information Bottleneck (IB)

The information bottleneck (IB) method provides a way to extract information that is present in one random variable X that is relevant for predicting another target random variable Y [6,15,16]. To do so, the IB posits a “bottleneck variable” Q that obeys the Markov condition

Q - X - Y

. This Markov condition guarantees that Q does not contain any information about Y that is not found in X. The quality of any particular choice of bottleneck variable Q is quantified via two mutual information terms:

I (X; Q)

, which decreases when Q provides a more compressed representation of X, and

I (Y; Q)

, which increases when Q allows a better prediction of Y. The IB method selects Q to maximize prediction given a constraint on compression [15,16,17]:

I_{IB} (R) = max_{Q : Q - X - Y} I (Y; Q) where I (X; Q) \leq R .

(1)

The values of

I_{IB} (R)

for different R specify the IB curve, which encodes the overall tradeoff between prediction and compression.

In practice, the IB curve is usually explored by considering the Lagrangian relaxation of the constrained optimization problem (1):

F_{IB} (β) : = max_{Q} I (Y; Q) - \frac{1}{β} I (X; Q)

(2)

Here,

β \geq 0

is a parameter that controls the tradeoff between compression cost (favored for

β \to 0

) and prediction benefit (favored for

β \to \infty

). The advantage of the Lagrangian formulation is that it avoids the non-linear constraint in Equation (1). If the IB curve is strictly concave, then the two Equations (1) and (2) are equivalent, meaning that there is a one-to-one map between the solutions of both problems [18]. When the IB curve is not strictly concave, a modified objective such as the “squared Lagrangian” or “exponential Lagrangian” should be used instead; see Refs. [18,19,20] for more details.

Since the original proposal, many reformulations, generalizations, and variants of the IB have been developed [7]. Notable examples include the “conditional entropy bottleneck” (CEB) [13,21], the “multi-view IB” [22], the “distributed IB” [23], as well as a large family of objectives called the “multivariate IB” [24]. All of these approaches consider some tradeoff between two information-theoretic terms: one that quantifies the prediction of target information that should be maximized and one that quantifies the compression of unwanted information that should be minimized. We refer to an optimization that involves a tradeoff between information-theoretic prediction and compression terms as an IB-type problem.

2.2. Partial Information Decomposition

The PID considers how information about a target random variable Y is distributed across a set of source random variables

X_{1}, \dots, X_{n}

. One of the main goals of the PID is to quantify redundancy, the amount of shared information that is found in each of the individual sources. The notion of redundancy in PID was inspired by an analogy between sets and information that has re-appeared in various forms throughout the history of information theory [25,26,27,28,29,30,31]. Specifically, if the amount of information provided by each source is conceptualized as the size of a set, then the redundancy is conceptualized as the size of the intersection of those sets [1,4,5]. Until recently, however, this analogy was treated mostly as an informal source of intuition, rather than a formal methodology.

In a recent paper [4], we demonstrated that the terms of PID can be defined by formalizing this analogy to set theory. Recall that, in set theory, the intersection of sets

A_{1}, \dots, A_{n}

is defined as the largest set B that is contained in each set

A_{s}

for

s \in {1 \dots n}

. Thus, the size of the intersection of finite sets

A_{1}, \dots, A_{n}

is

|⋂_{s = 1}^{n} A_{s}| = max_{B} | B | where B \subseteq A_{s} \forall s \in {1 \dots n} .

We showed that PID redundancy can be defined in a similar way: the redundancy between sources

X_{1}, \dots, X_{n}

is the maximum mutual information in any random variable Q that is less informative about the target Y than each individual source [4]:

I_{\cap}^{⊑} : = max_{Q} I (Q; Y) where Q ⊑ X_{s} \forall s \in {1 \dots n} .

(3)

The notation

Q ⊑ X_{s}

indicates that Q is “less informative” about the target than

X_{s}

, given some pre-specified ordering relation ⊑. The choice of the ordering relation completely determines the resulting redundancy measure

I_{\cap}^{⊑}

. We discuss possible choices in the following subsection.

We used a similar approach to define “union information”, which in turn leads to a principled measure of synergy [4]. Note that union information and redundancy are related algebraically but not numerically; in particular, unlike in set theory, the principle of inclusion–exclusion does not always hold.

As mentioned above, here, we focus entirely on redundancy and leave the exploration of connections between IB and union information/synergy for future work.

2.3. Blackwell Redundancy

Our definition of PID redundancy (3) depends on the definition of the “less informative” relation ⊑. Although there are many relations that can be considered [25,32,33,34,35], arguably the most natural choice is the Blackwell order.

The Blackwell order is a preorder relation over “channels”, that is conditional distributions with full support. A channel

κ_{B | Y}

is said to be less informative than

κ_{C | Y}

in the sense of the Blackwell order if there exists some other channel

κ_{B | C}

such that

\begin{matrix} κ_{B | Y} & = κ_{B | C} \circ κ_{C | Y} . \end{matrix}

(4)

Throughout, we use the notation ∘ to indicate the composition of channels, as defined via matrix multiplication. For instance,

κ_{B | Y} = κ_{B | C} \circ κ_{C | Y}

is equivalent to the statement

κ_{B | Y} (b | y) = \sum_{c} κ_{B | C} (b | c) κ_{C | Y} (c | y)

for all b and y. Equation (4) implies that

κ_{B | Y}

is less informative than

κ_{C | Y}

if

κ_{B | Y}

can be produced by downstream stochastic processing of the output of channel

κ_{C | Y}

. We use the notation

κ_{B | Y} ⪯ κ_{C | Y},

(5)

to indicate that

κ_{B | Y}

is less Blackwell-informative than

κ_{C | Y}

. The Blackwell order can also be defined over random variables rather than channels. Given a target random variable Y with full support, random variable B is said to be less Blackwell-informative than random variable C, written as

B ⪯_{Y} C,

(6)

when their corresponding conditional distributions obey the Blackwell relation,

p_{B | Y} ⪯ p_{C | Y}

[36]. It is not hard to verify that any random variable B that is independent of Y is lowest under the Blackwell order, obeying

B ⪯_{Y} C

for all C.

The Blackwell order plays a fundamental role in statistics, and it has an important operational characterization in decision theory [36,37,38]. Specifically,

p_{B | Y} ⪯ p_{C | Y}

if and only if access to channel

p_{C | Y}

is better for every decision problem than access to channel

p_{B | Y}

. See Refs. [4,39] for details of this operational characterization and Refs. [4,36,39,40,41,42] for more discussion of the relation between the Blackwell order and the PID.

Combining the Blackwell order (6) with Equation (3) gives rise to Blackwell redundancy [4]. Blackwell redundancy, indicated here as

I_{\cap}

, is the maximal mutual information in any random variable that is less Blackwell-informative than each of the sources:

I_{\cap} : = max_{Q} I (Q; Y) where Q ⪯_{Y} X_{s} \forall s .

(7)

The optimization is always well defined because the feasible set is not empty, given that any random variable Q that is independent of Y satisfies the constraints. (Note also that, for continuous-valued or countably infinite sources, max may need to be replaced by a sup; see also Appendix A.)

I_{\cap}

has many attractive features as a measure of PID redundancy, and it overcomes several problems with previous approaches [4]. For instance, it can be defined for any number of sources, it uniquely satisfies a natural set of PID axioms, and it has fundamental statistical and operational interpretations. Statistically, it is the maximum information transmitted across any channel that can be produced by downstream processing of any one of the sources. Operationally, it is the maximum information that any random variable can have about Y without being able to perform better on any decision problem than any one of the sources.

As we showed [4], the optimization problem (7) can be formulated as the maximization of a convex function subject to a set of linear constraints. For a finite-dimensional system, the feasible set is a finite-dimensional polytope, and the maximum will lie on one of its extreme points; therefore, the optimization can be solved exactly by enumerating the vertices of the feasible set and choosing the best one [4]. However, this approach is limited to small systems, because the number of vertices of the feasible set can grow exponentially.

Finally, it may be argued that Blackwell redundancy is actually a measure of redundancy in the channels

p_{X_{1} | Y}, \dots, p_{X_{n} | Y}

, rather than in the random variables

X_{1}, \dots, X_{n}

. This is because the joint distribution over

(Y, X_{1}, \dots, X_{n})

is never explicitly invoked in the definition of

I_{\cap}

; in fact, any joint distribution is permitted as long as it is compatible with the correct marginals. (The same property holds for several other redundancy measures ([4], Table 1), and Ref. [39] even suggested this property as a requirement for any valid measure of PID redundancy.) In some cases, the joint distribution may not even exist, for instance when different sources represent mutually exclusive conditions. To use a neuroscience example, imagine that

p_{X_{1} | Y}

and

p_{X_{2} | Y}

represent the activity of some brain region X in response to stimulus Y, measured either in younger (

p_{X_{1} | Y}

) or older (

p_{X_{2} | Y}

) subjects. Even though there is no joint distribution over

(Y, X_{1}, X_{2})

in this case, redundancy is still meaningful as the information about the stimulus that can be extracted from the brain activity of either age group. In the rest of this paper, we generally work within the channel-based interpretation of Blackwell redundancy.

3. Redundancy Bottleneck

In this section, we introduce the redundancy bottleneck (RB) and illustrate it with examples. Generally, we assume that we are provided with the marginal distribution

p_{Y}

of the target random variable Y, as well as n source channels

p_{X_{1} | Y}, \dots, p_{X_{n} | Y}

. Without loss of generality, we assume that

p_{Y}

has full support. We use calligraphic letters (like

Y

and

X_{s}

) to indicate the set of outcomes of random variables (like Y and

X_{s}

). For simplicity, we use notation appropriate for discrete-valued variables, such as in Equation (4), though most of our results also apply to continuous-valued variables.

3.1. Reformulation of Blackwell Redundancy

We first reformulate Blackwell redundancy (7) in terms of a different optimization problem. Our reformulation will make use of the random variable Y, along with two additional random variables, S and Z. The outcomes of S are the indexes of the different sources,

S = {1, \dots, n}

. The set of outcomes of Z is the union of the outcomes of the individual sources,

Z = ⋃_{s = 1}^{n} X_{s}

. For example, if there are two sources with outcomes

X_{1} = {0, 1}

and

X_{2} = {0, 1, 2}

, then

S = {1, 2}

and

Z = {0, 1} \cup {0, 1, 2} = {0, 1, 2}

. The joint probability distribution over

(Y, S, Z)

is defined as

p_{Y S Z} (y, s, z) = \{\begin{matrix} p_{Y} (y) ν_{S} (s) p_{X_{s} | Y} (z | y) & if z \in X_{s} \\ 0 & otherwise \end{matrix}

(8)

In other words, y is drawn from the marginal

p_{Y}

, the source s is then drawn independently from the distribution

ν_{S}

, and finally z is drawn from the channel

p_{X_{s} | Y} (z | y)

corresponding to source s. In this way, the channels corresponding to the n sources (

p_{X_{1} | Y}, \dots, p_{X_{n} | Y}

) are combined into a single conditional distribution

p_{Z | S Y}

.

We treat the distribution

ν_{S}

as an arbitrary fixed parameter, and except where otherwise noted, we make no assumptions about this distribution except that it has full support. As we will see, different choices of

ν_{S}

cause the different sources to be weighed differently in the computation of the RB. We return to the question of how to determine this distribution below.

Note that, under the distribution defined in Equation (8), Y and S are independent, so

I (Y; S) = 0 .

(9)

Actually, many of our results can be generalized to the case where there are correlations between S and Y. We leave exploration of this generalization for future work.

In addition to Y, Z, and S, we introduce another random variable Q. This random variable obeys the Markov condition

Q - (Z, S) - Y

, which ensures that Q does not contain any information about Y that is not contained in the joint outcome of Z and S. The full joint distribution over

(Y, S, Z, Q)

is

p_{Y S Z Q} (y, s, z, q) = p_{Y S Z} (y, s, z) p_{Q | S Z} (q | s, z) .

(10)

We sometimes refer to Q as the “bottleneck” random variable.

The set of joint outcomes of

(S, Z)

with non-zero probability is the disjoint union of the outcomes of the individual sources. For instance, in the example above with

X_{1} = {0, 1}

and

X_{2} = {0, 1, 2}

, the set of joint outcomes of

(S, Z)

with non-zero probability is

{(1, 0), (1, 1), (2, 0), (2, 1), (2, 2)}

. Because Q depends jointly on S and Z, our results do not depend on the precise labeling of the source outcomes, e.g., they are the same if

X_{2} = {0, 1, 2}

is relabeled as

X_{2} = {2, 3, 4}

.

Our first result shows that Blackwell redundancy can be equivalently expressed as a constrained optimization problem. Here, the optimization is over bottleneck random variables Q, i.e., over conditional distributions

p_{Q | S Z}

in Equation (10).

Theorem 1.

Blackwell redundancy (7) can be expressed as

\begin{matrix} I_{\cap} = max_{Q : Q - (Z, S) - Y} I (Q; Y | S) where I (Q; S | Y) = 0 . \end{matrix}

(11)

Importantly, Theorem 1 does not depend on the choice of the distribution

ν_{S}

, as long as it has full support.

In Theorem 1, the Blackwell order constraint in Equation (7) has been replaced by an information-theoretic constraint

I (Q; S | Y) = 0

, which states that Q does not provide any information about the identity of source S, additional to that already provided by the target Y. The objective

I (Q; Y)

has been replaced by the conditional mutual information

I (Q; Y | S)

. Actually, the objective can be equivalently written in either form, since

I (Q; Y | S) = I (Q; Y)

given our assumptions (see the proof of Theorem 1 in the Appendix A). However, the conditional mutual information form will be useful for further generalization and decomposition, as discussed in the next sections.

3.2. Redundancy Bottleneck

To relate Blackwell redundancy to the IB, we relax the constraint in Theorem 1 by allowing the leakage of R bits of conditional information about the source S. This defines the redundancy bottleneck (RB) at compression rate R:

I_{RB} (R) : = max_{Q : Q - (Z, S) - Y} I (Q; Y | S) where I (Q; S | Y) \leq R .

(12)

We note that, for

R > 0

, the value of

I_{RB} (R)

does depend on the choice of the source distribution

ν_{S}

.

Equation (12) is an IB-type problem that involves a tradeoff between prediction

I (Q; Y | S)

and compression

I (Q; S | Y)

. The prediction term

I (Q; Y | S)

quantifies the generalized Blackwell redundancy encoded in the bottleneck variable Q. The compression term

I (Q; S | Y)

quantifies the amount of conditional information that the bottleneck variable leaks about the identity of the source. The set of optimal values of

(I (Q; S | Y), I (Q; Y | S))

defines the redundancy bottleneck curve (RB curve) that encodes the overall tradeoff between prediction and compression.

We prove a few useful facts about the RB, starting from monotonicity and concavity.

Theorem 2.

I_{RB} (R)

is non-decreasing and concave as a function of R.

Since

I_{RB} (R)

is non-decreasing in R, the lowest RB value is achieved in the

R = 0

regime, when it equals the Blackwell redundancy (Theorem 1):

I_{RB} (R) \geq I_{RB} (0) = I_{\cap} .

(13)

The largest value is achieved as

R \to \infty

, when the compression constraint vanishes. It can be shown that

I (Q; Y | S) \leq I (Z; Y | S) = I (Y; Z, S)

using the Markov condition

Q - (Z, S) - Y

and the data-processing inequality (see the next subsection). This upper bound is achieved by the bottleneck variable

Q = Z

. Combining implies

I_{RB} (R) \leq I (Z; Y | S) = \sum_{s} ν_{S} (s) I (X_{s}; Y),

(14)

where we used the form of the distribution

p_{Y S Z}

in Equation (8) to arrive at the last expression. The range of necessary compression rates can be restricted as

0 \leq R \leq I (Z; S | Y)

.

Next, we show that, for finite-dimensional sources, it suffices to consider finite-dimensional Q. Thus, for finite-dimensional sources, the RB problem (12) involves the maximization of a continuous objective over a compact domain, so the maximum is always achieved by some Q. (Conversely, in the more general case of infinite-dimensional sources, it may be necessary to replace max with sup in Equation (12); see Appendix A.)

Theorem 3.

For the optimization problem (12), it suffices to consider Q of cardinality

|Q| \leq \sum_{s} |X_{s}| + 1

.

Interestingly, the cardinality bound for the RB is the same as for the IB if we take

X = (Z, S)

in Equation (1) [16,20]. It is larger than the cardinality required for Blackwell redundancy (7), where

| Q | \leq (\sum_{s} |X_{s}|) - n + 1

suffices [4].

The Lagrangian relaxation of the constrained RB problem (12) is given by

F_{RB} (β) = max_{Q : Q - (Z, S) - Y} I (Q; Y | S) - \frac{1}{β} I (Q; S | Y) .

(15)

The parameter

β

controls the tradeoff between prediction and compression. The

β \to 0

limit corresponds to the

R = 0

regime, in which case, Blackwell redundancy is recovered, while the

β \to \infty

limit corresponds to the

R = \infty

regime, when the compression constraint is removed. The RB Lagrangian (15) is often simpler to optimize than the constrained optimization (12). Moreover, when the RB curve

I_{RB} (R)

is strictly concave, there is a one-to-one relationship between the solutions to the two optimization problems (12) and (15). However, when the RB curve is not strictly concave, there is no one-to-one relationship and the usual Lagrangian formulation is insufficient. This can be addressed by optimizing a modified objective that combines prediction and compression in a nonlinear fashion, such as the “exponential Lagrangian” [19]:

F_{RB}^{exp} (β) = max_{Q : Q - (Z, S) - Y} I (Q; Y | S) - \frac{1}{β} e^{I (Q; S | Y)} .

(16)

(See an analogous analysis for IB in Refs. [18,19].)

3.3. Contributions from Different Sources

Both the RB prediction and compression terms can be decomposed into contributions from different sources, leading to an individual RB curve for each source. As we show in the examples below, this decomposition can be used to identify groups of redundant sources without having to perform intractable combinatorial optimization.

Let Q be an optimal bottleneck variable at rate R, so that

I_{RB} (R) = I (Q; Y | S)

and

I (Q; S | Y) \leq R

. Then, the RB prediction term can be expressed as the weighted average of the prediction contributions from individual sources:

\begin{matrix} I_{RB} (R) = I (Q; Y | S) & = \sum_{s} ν_{S} (s) I (Q; Y | S = s) . \end{matrix}

(17)

Here, we introduce the specific conditional mutual information:

I (Q; Y | S = s) : = D (p_{Q | Y, S = s} ∥ p_{Q | S = s}),

(18)

where

D (\cdot ∥ \cdot)

is the Kullback–Leibler (KL) divergence. To build intuitions about this decomposition, we may use the Markov condition

Q - (Z, S) - Y

to express the conditional distributions in Equation (18) as compositions of channels:

\begin{matrix} p_{Q | Y, S = s} & = p_{Q | Z, S = s} \circ p_{Z | Y, S = s} \\ p_{Q | S = s} & = p_{Q | Z, S = s} \circ p_{Z | S = s} \end{matrix}

Using the data-processing inequality for the KL divergence and Equation (8), we can then write

I (Q; Y | S = s) \leq D (p_{Z | Y, S = s} ∥ p_{Z | S = s}) = D (p_{X_{s} | Y} ∥ p_{X_{s}}) .

The last term is simply the mutual information

I (Y; X_{s})

between the target and source s. Thus, the prediction contribution from source s is bounded between 0 and the mutual information provided by that source:

0 \leq I (Q; Y | S = s) \leq I (Y; X_{s}) .

(19)

The difference between the mutual information and the actual prediction contribution:

I (Y; X_{s}) - I (Q; Y | S = s) \geq 0,

quantifies the unique information in source s. The upper bound in Equation (19) is achieved in the

R \to \infty

limit by

Q = Z

, leading to Equation (14). Conversely, for

R = 0

,

p_{Q | Y, S = s} = p_{Q | Y}

(from

I (Q; S | Y) = 0

) and

p_{Q | S = s} = p_{Q}

(from Equation (9)), so

I (Q; Y | S = s) = I (Q; Y) = I_{RB} (0) = I_{\cap} .

Thus, when

R = 0

, the prediction contribution from each source is the same, and it is equal to the Blackwell redundancy.

The RB compression cost can also be decomposed into contributions from individual sources:

\begin{matrix} I (Q; S | Y) & = \sum_{s} ν_{S} (s) I (Q; S = s | Y) . \end{matrix}

(20)

Here, we introduce the specific conditional mutual information:

I (Q; S = s | Y) : = D (p_{Q | Y, S = s} ∥ p_{Q | Y}) .

(21)

The source compression terms can be related to so-called deficiency, a quantitative generalization of the Blackwell order. Although various versions of deficiency can be defined [43,44,45], here we consider the “weighted deficiency” induced by the KL divergence. For any two channels

p_{B | Y}

and

p_{C | Y}

, it is defined as

\begin{matrix} δ_{D} (p_{C | Y}, p_{B | Y}) & : = min_{κ_{B | C}} D (κ_{B | C} \circ p_{C | Y} ∥ p_{B | Y}) . \end{matrix}

(22)

This measure quantifies the degree to which two channels violate the Blackwell order, vanishing when

κ_{B | Y} ⪯ κ_{C | Y}

. To relate the source compression terms (20) to deficiency, observe that

p_{Q | Y, S = s} = p_{Q | Z, S = s} \circ p_{Z | Y, S = s}

and that

p_{Z | Y, S = s} = p_{X_{s} | Y}

. Given Equation (21), we then have

I (Q; S = s | Y) \geq δ_{D} (p_{X_{s} | Y}, p_{Q | Y}) .

(23)

Thus, each source compression term is lower bounded by the deficiency between the source channel

p_{X_{s} | Y}

and the bottleneck channel

p_{Q | Y}

. Furthermore, the compression constraint in the RB optimization problem (12) sets an upper bound on the deficiency of

p_{Q | Y}

averaged across all sources.

Interestingly, several recent papers have studied the relationship between deficiency and PID redundancy in the restricted case of two sources [38,41,45,46,47]. To our knowledge, we provide the first link between deficiency and redundancy for the general case of multiple sources. Note also that previous work considered a slightly different definition of deficiency where the arguments of the KL divergence are reversed. Our definition of deficiency is arguably more natural, since it is more natural to minimize the KL divergence over a convex set with respect to the first argument [48].

Finally, observe that, in both decompositions (17) and (20), the source contributions are weighted by the distribution

ν_{S} (s)

. Thus, the distribution

ν_{S}

determines how different sources play into the tradeoff between prediction and compression. In many cases,

ν_{S}

can be chosen as the uniform distribution. However, other choices of

ν_{S}

may be more natural in other situations. For example, in a neuroscience context where different sources correspond to different brain regions,

ν_{S} (s)

could represent the proportion of metabolic cost or neural volume assigned to region s. Alternatively, when different sources represent mutually exclusive conditions, as in the age group example mentioned at the end of Section 2,

ν_{S} (s)

might represent the frequency of condition s found in the data. Finally, it may be possible to set

ν_{S}

in an “adversarial” manner so as to maximize the resulting value of

I_{RB} (R)

in Equation (12). We leave the exploration of this adversarial approach for future work.

3.4. Examples

We illustrate our approach using a few examples. For simplicity, in all examples, we use a uniform distribution over the sources,

ν_{S} (s) = 1 / n

. The numerical results are calculated using the iterative algorithm described in the next section.

Example 1.

We begin by considering a very simple system, called the “UNIQUE gate” in the PID literature. Here, the target Y is binary and uniformly-distributed,

p_{Y} (y) = 1 / 2

for

y \in {0, 1}

. There are two binary-valued sources,

X_{1}

and

X_{2}

, where the first source is a copy of the target,

p_{X_{1} | Y} (x_{1} | y) = δ_{x_{1}, y}

, while the second source is an independent and uniformly-distributed bit,

p_{X_{2} | Y} (x_{1} | y) = 1 / 2

. Thus, source

X_{1}

provides 1 bit of information about the target, while

X_{2}

provides none. The Blackwell redundancy is

I_{\cap} = 0

[4], because it is impossible to extract any information from the sources without revealing that this information came from

X_{1}

.

We performed RB analysis by optimizing the RB Lagrangian

F_{RB} (β)

(15) at different

β

. Figure 1a,b show the prediction

I (Q; Y | S)

and compression

I (Q; S | Y)

values for the optimal bottleneck variables Q. At small

β

, the prediction converges to the Blackwell redundancy,

I (Q; Y | S) = I_{\cap} = 0

, and there is complete loss of information about source identity,

I (Q; S | Y) = 0

. At larger

β

, the prediction approaches the maximum

I (Q; Y | S) = 0.5 \times I (X_{1}; Y) = 0.5 bit

, and compression approaches

I (Q; S | Y) = I (Z; S | Y) \approx 0.311 bit

. Figure 1c shows the RB curve, illustrating the overall tradeoff between prediction and compression.

In the shaded regions of Figure 1a,b, we show the additive contributions to the prediction and compression terms from the individual sources,

ν_{S} (s) I (Q; Y | S = s)

from Equation (17) and

ν_{S} (s) I (Q; S = s | Y)

from Equation (20), respectively. We also show the resulting RB curves for individual sources in Figure 1d. As expected, only source

X_{1}

contributes to the prediction at any level of compression.

To summarize, if some information about the identity of the source can be leaked (non-zero compression cost), then improved prediction of the target is possible. At the maximum needed compression cost of

0.311

, it is possible to extract 1 bit of predictive information from

X_{1}

and 0 bits from

X_{2}

, leading to an average of

0.5 bits

of prediction.

Example 2.

We now consider the “AND gate”, another well-known system from the PID literature. There are two independent and uniformly distributed binary sources,

X_{1}

and

X_{2}

. The target Y is also binary-valued and determined via

Y = X_{1} AND X_{2}

. Then,

p_{Y} (0) = 3 / 4

and

p_{Y} (1) = 1 / 4

, and both sources have the same channel:

p_{X_{s} | Y} (x | y) = \{\begin{matrix} 2 / 3 & if y = 0, x = 0 \\ 1 / 3 & if y = 0, x = 1 \\ 0 & if y = 1, x = 0 \\ 1 & if y = 1, x = 1 \end{matrix}

Because the two source channels are the same, the Blackwell redundancy obeys

I_{\cap} = I (Y; X_{1}) = I (Y; X_{2}) = 0.311

bits [4]. From Equations (13) and (14), we see that

I_{RB} (R) = I_{\cap}

across all compression rates. In this system, all information provided by the sources is redundant, so there is no strict tradeoff between prediction and compression. The RB curve (not shown) consists of a single point,

(I (Q; Y | S), I (Q; S | Y)) = (0.311, 0)

.

Example 3.

We now consider a more sophisticated example with four sources. The target is binary-valued and uniformly distributed,

p_{Y} (y) = 1 / 2

for

y \in {0, 1}

. There are four binary-valued sources, where the conditional distribution of each source

s \in {1, 2, 3, 4}

is a binary symmetric channel with error probability

ϵ_{s}

:

p_{X_{s} | Y} (x | y) = \{\begin{matrix} 1 - ϵ_{s} & if y = x \\ ϵ_{s} & if y \neq x \end{matrix}

(24)

We take

ϵ_{1} = ϵ_{2} = 0.1

,

ϵ_{3} = 0.2

, and

ϵ_{4} = 0.5

. Thus, sources

X_{1}

and

X_{2}

provide most information about the target;

X_{3}

provides less information;

X_{4}

is completely independent of the target.

We performed our RB analysis and plot the RB prediction values in Figure 2a and the compression values in Figure 2b, as found by optimizing the RB Lagrangian at different

β

. At small

β

, the prediction converges to the Blackwell redundancy,

I (Q; Y | S) = I_{\cap} = 0

, and there is complete loss of information about source identity,

I (Q; S | Y) = 0

. At large

β

, the prediction is equal to the maximum

I (Z; Y | S) \approx 0.335 bit

, and compression is equal to

I (Q; S | Y) \approx 0.104 bit

. Figure 2c shows the RB curve.

In Figure 2a,b, we show the additive contributions to the prediction and compression terms from the individual sources,

ν_{S} (s) I (Q; Y | S = s)

and

ν_{S} (s) I (Q; S = s | Y)

, respectively, as shaded regions. We also show the resulting RB curves for individual sources in Figure 2d.

As expected, source

X_{4}

does not contribute to the prediction at any level of compression, in accord with the fact that

I (Q; Y | S = s) \leq I (X_{4}; Y) = 0

. Sources

X_{1}

and

X_{2}

provide the same amount of prediction and compression at all points, up to the maximum

I (X_{1}; Y) = I (X_{2}; Y) \approx 0.531

. Source

X_{3}

provides the same amount of prediction and compression as sources

X_{1}

and

X_{2}

, until it hits its maximum prediction

I (X_{3}; Y) \approx 0.278

. As shown in Figure 2d, at this point,

X_{3}

splits off from sources

X_{1}

and

X_{2}

and its compression contribution decreases to 0; this is compensated by increasing the compression cost of sources

X_{1}

and

X_{2}

. The same behavior can also be seen in Figure 2a,b, where we see that the solutions undergo phase transitions as different optimal strategies are uncovered at increasing

β

. Importantly, by considering the prediction/compression contributions from the the individual sources, we can identify that sources

X_{1}

and

X_{2}

provide the most redundant information.

Let us comment on the somewhat surprising fact that, at larger

β

, the compression cost of

X_{3}

decreases—even while its prediction contribution remains constant and the prediction contribution from

X_{1}

and

X_{2}

increases. At first glance, this appears counter-intuitive if one assumes that, in order to increase prediction from

X_{1}

and

X_{2}

, the bottleneck channel

p_{Q | Y}

should approach

p_{X_{1} | Y} = p_{X_{2} | Y}

, thereby increasing the deficiency

δ_{D} (p_{X_{3} | Y}, p_{Q | Y})

and the compression cost of

X_{3}

via the bound (23). In fact, this is not the case, because the prediction is quantified via the conditional mutual information

I (Q; Y | S)

, not the mutual information

I (Q; Y)

. Thus, it is possible that the prediction contributions from

X_{1}

and

X_{2}

are large, even when the bottleneck channel

p_{Q | Y}

does not closely resemble

p_{X_{1} | Y} = p_{X_{2} | Y}

.

More generally, this example shows that it is possible for the prediction contribution from a given source to stay the same, or even increase, while its compression cost decreases. In other words, as can be seen from Figure 2d, it is possible for the RB curves of the individual sources to be non-concave and non-monotonic. It is only the overall RB curve, Figure 2c, representing the optimal prediction–compression tradeoff on average, that must be concave and monotonic.

Example 4.

In our final example, the target consists of three binary spins with a uniform distribution, so

Y = (Y_{1}, Y_{2}, Y_{3})

and

p_{Y} (y) = 1 / 8

for all y. There are three sources, each of which contains two binary spins. Sources

X_{1}

and

X_{2}

are both equal to the first two spins of the target Y,

X_{1} = X_{2} = (Y_{1}, Y_{2})

. Source

X_{3}

is equal to the first and last spin of the target,

X_{3} = (Y_{1}, Y_{3})

.

Each source provides

I (Y; X_{s}) = 2 bits

of mutual information about the target. The Blackwell redundancy

I_{\cap}

is 1 bit, reflecting the fact that there is a single binary spin that is included in all sources (

Y_{1}

).

We performed our RB analysis and plot the RB prediction values in Figure 3a and the compression values in Figure 3b, as found by optimizing the RB Lagrangian at different

β

. At small

β

, the prediction converges to the Blackwell redundancy,

I (Q; Y | S) = I_{\cap} = 1

, and

I (Q; S | Y) = 0

. At large

β

, the prediction is equal to the maximum

I (Z; Y | S) = 2 bit

, and compression is equal to

I (Z; S | Y) \approx 0.459

. Figure 3c shows the RB curve. As in the previous example, the RB curve undergoes phase transitions as different optimal strategies are uncovered at different

β

.

In Figure 3a,b, we show the additive contributions to the prediction and compression terms from the individual sources,

ν_{S} (s) I (Q; Y | S = s)

and

ν_{S} (s) I (Q; S = s | Y)

, as shaded regions. We also show the resulting RB curves for individual sources in Figure 3d.

Observe that sources

X_{1}

and

X_{2}

provide more redundant information at a given level of compression. For instance, as shown in Figure 3d, at source compression

I (Q; S = s | Y) \approx . 25

,

X_{1}

and

X_{2}

provide 2 bits of prediction, while

X_{3}

provides only a single bit. This again shows how the RB source decomposition can be used for identifying sources with high levels of redundancy.

3.5. Continuity

It is known that the Blackwell redundancy

I_{\cap}

can be discontinuous as a function of the probability distribution of the target and source channels [4]. In Ref. [4], we explain the origin of this discontinuity in geometric terms and provide sufficient conditions for Blackwell redundancy to be continuous. Nonetheless, the discontinuity of

I_{\cap}

is sometimes seen as an undesired property.

On the other hand, as we show in this section, the value of RB is continuous in the probability distribution for all

R > 0

.

Theorem 4.

For finite-dimensional systems and

R > 0

,

I_{RB} (R)

is a continuous function of the probability values

p_{X_{s} | Y} (x | y)

,

p_{Y} (y)

, and

ν_{S} (s)

.

Thus, by relaxing the compression constraint in Theorem 1, we “smooth out” the behavior of Blackwell redundancy and arrive at a continuous measure. We illustrate this using a simple example.

Example 5.

We consider the COPY gate, a standard example in the PID literature. Here, there are two binary-valued sources jointly distributed according to

p_{X_{1} X_{2}} (x_{1}, x_{2}) = \{\begin{matrix} 1 / 2 - ϵ / 4 & if x_{1} = x_{2} \\ ϵ / 4 & if x_{1} \neq x_{2} \end{matrix}

The parameter

ϵ

controls the correlation between the two sources, with perfect correlation at

ϵ = 0

and complete independence at

ϵ = 1

. The target Y is a copy of the joint outcome of the two sources,

Y = (X_{1}, X_{2})

.

It is known that Blackwell redundancy

I_{\cap}

is discontinuous for this system, jumping from

I_{\cap} = 1

at

ϵ = 0

to

I_{\cap} = 0

for

ϵ > 0

[4]. On the other hand, the RB function

I_{RB} (R)

is continuous for

R > 0

. Figure 4 compares the behavior of Blackwell redundancy and RB as a function of

ϵ

, at

R = 0.01

bits. In particular, it can be seen that

I_{RB} (R) = 1

at

ϵ = 0

and then decays continuously as

ϵ

increases.

4. Iterative Algorithm

We provide an iterative algorithm to solve the RB optimization problem. This algorithm is conceptually similar to the Blahut–Arimoto algorithm, originally employed for rate distortion problems and later adapted to solve the original IB problem [6]. A Python implementation of our algorithm is available at https://github.com/artemyk/pid-as-ib; there, we also provide updated code to exactly compute Blackwell redundancy (applicable to small systems).

To begin, we consider the RB Lagrangian optimization problem, Equation (15). We rewrite this optimization problem using the KL divergence:

F_{RB} (β) = max_{r_{Q | S Z}} D (r_{Y | Q S} ∥ p_{Y | S}) - \frac{1}{β} D (r_{Q | S Y} ∥ r_{Q | Y}) .

(25)

Here, notation like

r_{Q | Y}

,

r_{Y | Q S}

, etc., refers to distributions that include Q and therefore depend on the optimization variable

r_{Q | S Z}

, while notation like

p_{Y | S}

refers to distributions that do not depend on Q and are not varied under the optimization. Every choice of conditional distribution

r_{Q | S Z}

induces a joint distribution

r_{Y S Z Q} = p_{Y S Z} r_{Q | S Z}

via Equation (10).

We can rewrite the first KL term in Equation (25) as

\begin{matrix} D (r_{Y | Q S} ∥ p_{Y | S}) & = D (r_{Y | Q S} ∥ p_{Y | S}) - min_{ω_{Y S Z Q}} D (r_{Y | Q S} ∥ ω_{Y | Q S}) \\ = max_{ω_{Y S Z Q}} E_{p_{Y S Z} r_{Q | S Z}} [ln \frac{ω (y | q, s)}{p (y | s)}] . \end{matrix}

where

E

indicates the expectation, and we introduced the variational distribution

ω_{Y S Z Q}

. The maximum is achieved by

ω_{Y S Z Q} = r_{Y S Z Q}

, which gives

ω_{Y | Q S} = r_{Y | Q S}

. We rewrite the second KL term in Equation (25) as

\begin{matrix} D (r_{Q | S Y} ∥ r_{Q | Y}) & = D (r_{Q | S Y} r_{Z | S Y Q} ∥ r_{Q | Y} r_{Z | S Y Q}) \\ = min_{ω_{Y S Z Q}} D (r_{Q | S Y} r_{Z | S Y Q} ∥ ω_{Q | Y} ω_{Z | S Y Q}) . \end{matrix}

Here, we introduce the variational distribution

ω_{Y S Z Q}

, where the minimum is achieved by

ω_{Y S Z Q} = r_{Y S Z Q}

. The term

r_{Q | S Y} r_{Z | S Y Q}

can be rewritten as

\begin{matrix} r (q | s, y) r (z | s, y, q) = \frac{r (z, s, y, q)}{p (s, y)} = \frac{r (q | s, z) p (z, s, y)}{p (s, y)} = r (q | s, z) p (z | s, y) \end{matrix}

where we used the Markov condition

Q - (S, Z) - Y

. In this way, we separate the contribution from the conditional distribution

r_{Q | S Z}

being optimized.

Combining the above allows us to rewrite Equation (25) as

\begin{matrix} F_{RB} (β) = max_{r_{Q | Z S}, ω_{Y S Z Q}} E_{p_{Y S Z} r_{Q | S Z}} [ln \frac{ω (y | q, s)}{p (y | s)}] - \frac{1}{β} D (r_{Q | S Z} p_{Z | S Y} ∥ ω_{Q | Y} ω_{Z | S Y Q}) . \end{matrix}

(26)

We now optimize this objective in an iterative and alternating manner with respect to

r_{Q | S Z}

and

ω_{Y S Z Q}

. Formally, let

L (r_{Q | S Z}, ω_{Y S Z Q})

refer to the objective in Equation (26). Then, starting from some initial guess

r_{Q | S Z}^{(0)}

, we generate a sequence of solutions

\begin{matrix} ω_{Y S Z Q}^{(t + 1)} & = \underset{ω_{Y S Z Q}}{\arg \max} L (r_{Q | S Z}^{(t)}, ω_{Y S Z Q}) \end{matrix}

(27)

\begin{matrix} r_{Q | S Z}^{(t + 1)} & = \underset{r_{Q | S Z}}{\arg \max} L (r_{Q | S Z}, ω_{Y S Z Q}^{(t + 1)}) \end{matrix}

(28)

Each optimization problem can be solved in closed form. As already mentioned, the optimizer in Equation (27) is

ω_{Y S Z Q}^{(t + 1)} = r_{Y S Z Q}^{(t)} = r_{Q | S Z}^{(t)} p_{S Z Y} .

The optimization (28) can be solved by taking derivatives, giving

\begin{matrix} r^{(t + 1)} (q | s, z) \propto e^{\sum_{y} p (y | s, z) [β ln ω^{(t)} (y | q, s) - ln \frac{p (z | s, y)}{ω^{(t)} (q | y) ω^{(t)} (z | s, y, q)}]}, \end{matrix}

where the proportionality constant in ∝ is fixed by normalization

\sum_{q} r^{(t + 1)} (q | s, z) = 1

.

Each iteration increases the value of objective

L

. Since the objective is upper bounded by

I (Z; Y | S)

, the algorithm is guaranteed to converge. However, as in the case of the original IB problem, the objective is not jointly convex in both arguments, so the algorithm may converge to a local maximum or a saddle point, rather than a global maximum. This can be partially alleviated by running the algorithm several times starting from different initial guesses

r_{Q | S Z}^{(0)}

.

When the RB is not strictly concave, it is more appropriate to optimize the exponential RB Lagrangian (16) or another objective that combines the prediction and compression terms in a nonlinear manner [18,19]. The algorithm described above can be used with such objectives after a slight modification. For instance, for the exponential RB Lagrangian, we modify (26) as

\begin{matrix} F_{RB}^{exp} (β) = max_{r_{Q | S Z}, ω_{Y S Z Q}} E_{p_{Y S Z} r_{Q | S Z}} [ln \frac{ω (y | q, s)}{p (y | s)}] - \frac{1}{β} e^{D (r_{Q | Z S} p_{Z | S Y} ∥ ω_{Q | Y} ω_{Z | S Y Q})} . \end{matrix}

(29)

A similar analysis as above leads to the following iterative optimization scheme:

\begin{matrix} ω_{Y S Z Q}^{(t + 1)} & = r_{Q | S Z}^{(t)} p_{S Z Y} \\ r^{(t + 1)} (q | s, z) & \propto e^{\sum_{y} p (y | s, z) [β^{(t)} ln ω^{(t)} (y | q, s) - ln \frac{p (z | s, y)}{ω^{(t)} (q | y) ω^{(t)} (z | s, y, q)}]}, \end{matrix}

where

β^{(t)} = β e^{- I_{r^{(t)}} (Q; S | Y)}

is an effective inverse temperature. (Observe that, unlike the squared Lagrangian [18], the exponential Lagrangian leads to an effective inverse temperature

β^{(t)}

that is always finite and converges to

β

as

I_{r^{(t)}} (Q; S | Y) \to 0

.)

When computing an entire RB curve, as in Figure 1a–c, we found good results by annealing, that is by re-using the optimal

r_{Q | S Z}

found for one

β

as the initial guess at higher

β

. For quantifying the value of the RB function

I_{RB} (R)

at a fixed R, as in Figure 4, we approximated

I_{RB} (R)

via a linear interpolation of the RB prediction and compression values recovered from the RB Lagrangian at varying

β

.

5. Discussion

In this paper, we propose a generalization of Blackwell redundancy, termed the redundancy bottleneck (RB), formulated as an information-bottleneck-type tradeoff between prediction and compression. We studied some implications of this formulation and proposed an efficient numerical algorithm to solve the RB optimization problem.

We briefly mention some directions for future work.

The first direction concerns our iterative algorithm. The algorithm is only applicable to systems where it is possible to enumerate the outcomes of the joint distribution

p_{Q Y S Z}

. This is impractical for discrete-valued variables with very many outcomes, as well as continuous-valued variables as commonly found in statistical and machine learning settings. In future work, it would be useful to develop RB algorithms suitable for such datasets, possibly by exploiting the kinds of variational techniques that have recently gained traction in machine learning applications of IB [11,12,13].

The second direction would explore connections between the RB and other information-theoretic objectives for representation learning. To our knowledge, the RB problem is novel to the literature. However, it has some similarities to existing objectives, including among others the conditional entropy bottleneck [13], multi-view IB [22], and the privacy funnel and its variants [49]. Showing formal connections between these objectives would be of theoretical and practical interest, and could lead to new interpretations of the concept of PID redundancy.

Another direction would explore the relationship between RB and information-theoretic measures of causality [50,51]. In particular, if the different sources represent some mutually exclusive conditions—such as the age group example provided at the end of Section 2—then redundancy could serve as a measure of causal information flow that is invariant to the conditioning variable.

Finally, one of the central ideas of this paper is to treat the identity of the source as a random variable in its own right, which allows us to consider what information different bottleneck variables reveal about the source. In this way, we convert the search for topological or combinatorial structure in multivariate systems into an interpretable and differential information-theoretic objective. This technique may be useful in other problems that consider how information is distributed among variables in complex systems, including other PID measures such as synergy [4], information-theoretic measures of modularity [52,53], and measures of higher-order dependency [54,55].

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 101068029.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No research data was used in this study.

Acknowledgments

I thank Nihat Ay, Daniel Polani, Fernando Rosas, and especially, André Gomes for useful feedback. I also thank the organizers of the “Decomposing Multivariate Information in Complex Systems” (DeMICS 23) workshop at the Max Planck Institute for the Physics of Complex Systems (Dresden, Germany), which inspired this work.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

We provide proofs of the theorems in the main text. Throughout, we use D for the Kullback–Leibler (KL) divergence, H for the Shannon entropy, and I for the mutual information.

Appendix A.1. Proof of Theorem 1

We begin by proving a slightly generalized version of Theorem 1, that is we show the equivalence between the two optimization problems:

\begin{matrix} I_{\cap} & = sup_{Q} I (Q; Y) where Q ⪯_{Y} X_{s} \forall s \end{matrix}

(A1)

\begin{matrix} I_{\cap} & = sup_{Q : Q - (Z, S) - Y} I (Q; Y | S) where I (Q; S | Y) = 0 . \end{matrix}

(A2)

The slight generalization comes from replacing max by sup, so that the result also holds for systems with infinite-dimensional sources, where the supremum is not guaranteed to be achieved. For finite-dimensional systems, the supremum is always achieved, and we reduce to the simpler case of Equations (7) and (11).

Proof.

Let

V_{1}

indicate the supremum in Equation (A1) and

V_{2}

the supremum in Equation (A2), given some

ν_{S} (s)

with full support. We prove that

V_{1} = V_{2}

.

We will use that, for any distribution that has the form of Equation (10) and obeys

I (Q; S | Y) = 0

, the following holds:

\begin{matrix} I (Q; Y | S) & = H (Q | S) - H (Q | S, Y) \\ = H (Q) - H (Q | Y) = I (Q; Y) \end{matrix}

(A3)

Here, we used the Markov condition

Q - Y - S

, as well as

I (Q; S) = H (Q) - H (Q | S) = 0

, as follows from Equation (9) and the data-processing inequality.

Let Q be a feasible random variable that comes within

ϵ \geq 0

of the objective in (A1),

I (Q; Y) \geq V_{1} - ϵ

. Define the joint distribution:

p_{Q Y S Z} (q, y, s, z) = κ_{Q | X_{s}} (q | z) p_{Y} (y) ν_{S} (s) p_{X_{s} | Y} (z | y)

whenever

z \in X_{s}

, otherwise

p_{Q Y S Z} (q, y, s, z) = 0

. Here, we used the channels

κ_{Q | X_{s}}

associated with the Blackwell relation

Q ⪯_{Y} X_{s}

, so that

p_{Q | Y} = κ_{Q | X_{s}} \circ p_{X_{s} | Y}

. Under the distribution

p_{Q Y S Z}

, the Markov conditions

Q - (S, Z) - Y

and

Q - Y - S

hold, the latter since

p_{Q S | Y} (q, s | y) = p_{Q | Y} (q | y) ν_{S} (s) .

(A4)

Therefore, this distribution has the form of Equations (8) and (10) and satisfies the constraints in Equation (A2). Using Equation (A3), we then have

V_{1} - ϵ \leq I (Q; Y) = I (Q; Y | S) \leq V_{2} .

Conversely, let

p_{Y S Z Q}

be a feasible joint distribution for the optimization of Equation (A2) that comes within

ϵ \geq 0

of the supremum,

I (Q; Y | S) \geq V_{2} - ϵ

. Using the form of this joint distribution from Equation (10), we can write

\begin{matrix} p_{Q | Y} (q | y) & \overset{(a)}{=} p_{Q | Y S} (q | y, s) \\ = \sum_{z} p_{Q | Y S Z} (q | y, s, z) p_{Z | Y S} (z | y, s) \\ \overset{(b)}{=} \sum_{z} p_{Q | S Z} (q | z, s) p_{Z | Y S} (z | y, s) \\ \overset{(c)}{=} \sum_{z} p_{Q | S Z} (q | z, s) p_{X_{s} | Y} (z | y) \end{matrix}

In

(a)

, we used

I (Q; S | Y) = 0

; in

(b)

, we used

Q - (S, Z) - Y

; in

(c)

, we used that

p_{Z | Y S = s} = p_{X_{s} | Y}

. This implies that

p_{Q | Y} ⪯ p_{X_{s} | Y}

for all s. Therefore,

p_{Q | Y}

satisfies the constraints in Equation (A1), so

I (Q; Y) \leq V_{1}

. Combining with Equation (A3) implies

V_{2} - ϵ \leq I (Q; Y | S) = I (Q; Y) \leq V_{1} .

Taking the limit

ϵ \to 0

gives the desired result. □

Appendix A.2. Proof of Theorem 2

We now prove a slightly generalized version of Theorem 2. We show that the solution to the following optimization problem is non-decreasing and concave in R:

I_{RB} (R) : = sup_{Q : Q - (Z, S) - Y} I (Q; Y | S) where I (Q; S | Y) \leq R .

(A5)

The slight generalization comes from replacing max in Equation (12) by sup, so that the result also holds for systems with infinite-dimensional sources where the supremum is not guaranteed to be achieved.

Proof.

I_{RB} (R)

is non-decreasing in R because larger R give weaker constraints (larger feasible set) in the maximization problem (A5).

To show concavity, consider any two points on the RB curve as defined by Equation (A5):

(R, I_{RB} (R))

and

(R^{'}, I_{RB} (R^{'}))

. For any

ϵ > 0

, there exist Q and

Q^{'}

such that

\begin{matrix} I (Q; S | Y) & \leq R & I (Q; Y | S) & \geq I_{RB} (R) - ϵ \\ I (Q^{'}; S | Y) & \leq R^{'} & I (Q^{'}; Y | S) & \geq I_{RB} (R^{'}) - ϵ \end{matrix}

Without loss of generality, suppose that both variables have the same set of outcomes

Q

. Then, we define a new random variable

Q_{λ}

with outcomes

Q_{λ} = {1, 2} \times Q

, as well as a family of conditional distributions parameterized by

λ \in [0, 1]

:

\begin{matrix} p_{Q_{λ} | Z S} (1, q | z, s) & = λ p_{Q | Z S} (q | z, s) \\ p_{Q_{λ} | Z S} (2, q | z, s) & = (1 - λ) p_{Q^{'} | Z S} (q | z, s) \end{matrix}

In this way, we define

Q_{λ}

via a disjoint convex mixture of Q and

Q^{'}

onto non-overlapping subspaces, with

λ

being the mixing parameter. With a bit of algebra, it can be verified that, for every

λ

,

H (Q_{λ} | Y) = λ H (Q | Y) + (1 - λ) H (Q^{'} | Y),

and similarly for

H (Q_{λ} | Y, S)

and

H (Q_{λ} | S)

. Therefore,

\begin{matrix} I (Q_{λ}; S | Y) & = λ I (Q; S | Y) + (1 - λ) I (Q^{'}; S | Y) \\ \leq λ R + (1 - λ) R^{'} \\ I (Q_{λ}; Y | S) & = λ I (Q; Y | S) + (1 - λ) I (Q^{'}; Y | S) \\ \geq λ I_{RB} (R) + (1 - λ) I_{RB} (R^{'}) - ϵ \end{matrix}

Since

I_{RB}

is defined via a maximization, we have

\begin{matrix} I_{RB} (λ R + (1 - λ) R^{'}) & \geq I (Q_{λ}; Y | S) \geq λ I_{RB} (R) + (1 - λ) I_{RB} (R^{'}) - ϵ . \end{matrix}

Taking the limit

ϵ \to 0

proves the concavity. □

Appendix A.3. Proof of Theorem 3

Proof.

We show that, for any Q that achieves

I (Q; S | Y) \leq R

, there is another

Q^{'}

with cardinality

|Q^{'}| \leq \sum_{s} |X_{s}| + 1

that satisfies

I (Q^{'}; S | Y) \leq R

and

I (Q^{'}; Y | S) \geq I (Q; Y | S)

.

Consider any joint distribution

p_{Q S Z Y}

from Equation (10) that achieves

I (Q; S | Y) \leq R

, and let

Q

be the corresponding set of outcomes of Q. Fix the corresponding conditional distribution

p_{S Z | Q}

, and note that it also determines the conditional distributions:

\begin{matrix} p_{Y S Z | Q} (y, s, z | q) & = p_{Y | S Z} (y | s, z) p_{S Z | Q} (s, z | q) \end{matrix}

(A6)

\begin{matrix} = \frac{ν_{S} (s) p_{X_{s} | Y} (z | y) p_{Y} (y)}{p_{S Z} (s, z)} p_{S Z | Q} (s, z | q) \end{matrix}

(A7)

\begin{matrix} p_{Y | S Q} (y | s, q) & = \frac{\sum_{z} p_{Y S Z | Q} (y, s, z | q)}{\sum_{z, y^{'}} p_{Y S Z | Q} (y^{'}, s, z | q)} \end{matrix}

(A8)

\begin{matrix} p_{S | Y Q} (s | y, q) & = \frac{\sum_{z} p_{Y S Z | Q} (y, s, z | q)}{\sum_{z, s^{'}} p_{Y S Z | Q} (y, s^{'}, z | q)} \end{matrix}

(A9)

Next, consider the following linear program:

\begin{matrix} V = max_{ω_{Q^{'}} \in Δ} & \sum_{q} ω_{Q^{'}} (q) D (P_{Y | S Q = q} ∥ P_{Y | S}) \end{matrix}

(A10)

\begin{matrix} where & \sum_{q} ω_{Q^{'}} (q) p_{S Z | Q} (s, z | q) = p_{S Z} (s, z) \forall s, z \end{matrix}

(A11)

\begin{matrix} \sum_{q} ω_{Q^{'}} (q) H (S | Y, Q = q) = H (S | Y, Q) \end{matrix}

(A12)

where

Δ

is the

| Q |

-dimensional unit simplex, and we use the notation

H (S | Y, Q = q) = - \sum_{y, s} p_{Y S | Q} (y, s | q) ln p_{S | Y Q} (s | y, q)

. The first set of constraints (A11) guarantees that

ω_{Q^{'}} p_{Y S Z | Q}

belongs to the family (10) and, in particular, that the marginal over

(S, Z, Y)

is

ν_{S} (s) p_{X_{s} | Y} (z | y) p_{Y} (y)

(see Equation (A7)). There are

\sum_{s} |X_{s}|

possible outcomes of

(s, z)

, but

\sum_{s, z} p_{S Z} (s, z) = 1

by the conservation of probability. Therefore, Equation (A11) effectively imposes

\sum_{s} |X_{s}| - 1

constraints. The last constraint (A12) guarantees that

H (S | Y, Q^{'}) = H (S | Y, Q)

; hence,

\begin{matrix} I (Q^{'}; S | Y) & = H (S | Y) - H (S | Y, Q^{'}) \\ = H (S | Y) - H (S | Y, Q) = I (Q; S | Y) \leq R . \end{matrix}

Equation (A10) involves a maximization of a linear function over the simplex, subject to

\sum_{s} |X_{s}|

hyperplane constraints. The feasible set is compact, and the maximum is achieved at one of the extreme points of the feasible set. By Dubin’s theorem [56], any extreme point of this feasible set can be expressed as a convex combination of at most

\sum_{s} |X_{s}| + 1

extreme points of

Δ

. Thus, the maximum is achieved by a marginal distribution

ω_{Q^{'}}

with support on at most

\sum_{s} |X_{s}| + 1

outcomes. This distribution satisfies:

\begin{matrix} \sum_{q} ω_{Q^{'}} (q) D (P_{Y | S Q = q} ∥ P_{Y | S}) \geq \sum_{q} p_{Q} (q) D (P_{Y | S Q = q} ∥ P_{Y | S}) \end{matrix}

since the actual marginal distribution

p_{Q}

is an element of the feasible set. Finally, note that

\begin{matrix} \sum_{q} ω_{Q^{'}} (q) D (P_{Y | S Q = q} ∥ P_{Y | S}) & = I (Q^{'}; Y | S) \\ \sum_{q} p_{Q} (q) D (P_{Y | S Q = q} ∥ P_{Y | S}) & = I (Q; Y | S); \end{matrix}

therefore,

I (Q^{'}; Y | S) \geq I (Q; Y | S)

. □

Appendix A.4. Proof of Theorem 4

Proof.

For a finite-dimensional system, we may restrict the optimization problem in Theorem 1 to Q with cardinality

| Q | \leq \sum_{s} |X_{s}| + 1

(Theorem 3). In this case, the feasible set can be restricted to a compact set, and the objective is continuous; therefore, the maximum will be achieved.

Now, consider a tuple of random variables

(S, Z, Y, Q)

that obey the Markov conditions

S - Y - Z

and

Q - (S, Z) - Y

. Suppose that Q achieves the maximum in Theorem 1 for a given

R > 0

:

I (Q; Y | S) = I_{RB} (R), I (Q; S | Y) \leq R .

(A13)

Consider also a sequence of random variables

(S_{k}, Z_{k}, Y_{k}, Q_{k})

for

k = 1, 2, 3 \dots

where each tuple has the same outcomes as

(S, Z, Y, Q)

and obeys the Markov conditions

S_{k} - Y_{k} - Z_{k}

and

Q_{k} - (S_{k}, Z_{k}) - Y_{k}

. Let

I_{RB}^{k} (R)

indicate the redundancy bottleneck defined in Theorem 1 for random variables

Z_{k}, Y_{k}, S_{k}

, and suppose that

Q_{k}

achieves the optimum for problem k:

I (Q_{k}; Y_{k} | S_{k}) = I_{RB}^{k} (R) I (Q_{k}; S_{k} | Y_{k}) \leq R .

(A14)

To prove continuity, we assume that the joint distribution of

(S_{k}, Z_{k}, Y_{k})

approaches the joint distribution of

(S, Z, Y)

,

lim_{k} {∥p_{S_{k} Z_{k} Y_{k}} - p_{S X Y}∥}_{1} = 0 .

We first show that

I_{RB} (R) \geq lim_{k} I_{RB}^{k} (R) .

(A15)

First, observe that given our assumption that

p_{S Z}

has full support, we can always take k sufficiently large so that each

p_{S_{k} Z_{k}}

has full support. Next, we define the random variable

Q_{k}^{'}

that obeys the Markov condition

Q_{k}^{'} - (S, Z) - Y

, with conditional distribution:

p_{Q_{k}^{'} | S Z} (q | s, z) : = p_{Q_{k} | S_{k} Z_{k}} (q | s, z) .

This conditional distribution is always well-defined, given that

p_{S_{k} Z_{k}}

has the same support as

p_{S Z}

. By assumption,

p_{S_{k} Z_{k} Y_{k}} \to p_{S Z Y}

; therefore,

p_{Q_{k}^{'} S Z Y} - p_{Q_{k} S_{k} Z_{k} Y_{k}} \to 0 .

(A16)

Conditional mutual information is (uniformly) continuous due to the (uniform) continuity of entropy (Theorem 17.3.3, [57]). Therefore,

\begin{matrix} 0 = lim_{k} [I (Q_{k}; S_{k} | Y_{k}) - I (Q_{k}^{'}; S | Y)] \leq R - lim_{k} I (Q_{k}^{'}; S | Y), \end{matrix}

(A17)

where we used Equation (A14). We also define another random variable

Q_{k}^{α}

, which also obeys the Markov condition

Q_{k}^{α} - (S, Z) - Y

, whose conditional distribution is defined in terms of the convex mixture:

\begin{matrix} p_{Q_{k}^{α} | S Z} (q | s, z) & : = α_{k} p_{Q^{'} | S X} (q | s, z) + (1 - α_{k}) p_{U} (q), \\ α_{k} & : = min \{1, \frac{R}{I (Q_{k}^{'}; S | Y)}\} \in [0, 1] \end{matrix}

(A18)

Here,

p_{U} (q) = 1 / | Q |

is a uniform distribution over an auxiliary independent random variable U with outcomes

Q

. From the convexity of conditional mutual information [58],

\begin{matrix} I (Q_{k}^{α}; S | Y) & \leq α_{k} I (Q_{k}^{'}; S | Y) + (1 - α_{k}) I (U; S | Y) \leq R . \end{matrix}

In the last inequality, we used

I (U; S | Y) = 0

and plugged in the definition of

α_{k}

. Observe that the random variable

Q_{k}^{α}

falls in the feasible set of the maximization problem that defines

I_{RB}

, so

I_{RB} (R) \geq I (Q_{k}^{α}; Y | S) .

(A19)

Combining Equations (A17) and (A18), and

R > 0

implies that

α_{k} \to 1

, so

p_{Q_{k}^{α} S Z Y} - p_{Q_{k}^{'} S Z Y} \to 0 .

Combining with Equations (A14), (A16) and (A19), along with continuity of conditional mutual information, gives

\begin{matrix} I_{RB} (R) \geq lim_{k} I (Q_{k}^{α}; Y | S) = lim_{k} I (Q_{k}^{'}; Y | S) = lim_{k} I (Q_{k}; Y_{k} | S_{k}) = lim_{k} I_{RB}^{k} (R) . \end{matrix}

We now proceed in a similar way to prove

I_{RB} (R) \leq lim_{k} I_{RB}^{k} (R) .

(A20)

We define the random variable

Q^{″}

that obeys the Markov condition

Q^{″} - (S_{k}, Z_{k}) - Y_{k}

, with conditional distribution

p_{Q_{k}^{″} | S_{k} Z_{k}} (q | s, z) = p_{Q | S X} (q | s, z) .

Since

p_{S_{k} Z_{k} Y_{k}} \to p_{S Z Y}

by assumption,

p_{Q_{k}^{″} S_{k} Z_{k} Y_{k}} - p_{Q S Z Y} \to 0 .

(A21)

We then have

\begin{matrix} 0 & = lim_{k} [I (Q; S | Y) - I (Q_{k}^{″}; S_{k} | Y_{k})] \leq R - lim_{k} I (Q_{k}^{″}; S_{k} | Y_{k}) . \end{matrix}

(A22)

where we used Equation (A13). We also define the random variable

Q^{α'}

that obeys the Markov condition

Q^{α'} - (S_{k}, Z_{k}) - Y_{k}

, with conditional distribution

\begin{matrix} p_{Q_{k}^{α'} | S_{k} Z_{k}} (q | s, z) & = α_{k}^{'} p_{Q^{″} | S_{k} Z_{k}} (q | s, z) + (1 - α_{k}^{'}) p_{U} (q), \\ α_{k}^{'} & : = min \{1, \frac{R}{I (Q_{k}^{″}; S_{k} | Y_{k})}\} \in [0, 1] \end{matrix}

(A23)

Using the convexity of conditional mutual information,

I (U; S_{k}; Y_{k}) = 0

, and the definition of

α_{k}^{'}

, we have

I (Q_{k}^{α'}; S_{k} | Y_{k}) \leq α_{k}^{'} I (Q_{k}^{″}; S_{k} | Y_{k}) + (1 - α_{k}^{'}) I (U; S_{k} | Y_{k}) \leq R .

Therefore, the random variable

Q_{k}^{α'}

falls in the feasible set of the maximization problem that defines

I_{RB}^{k}

, so

I_{RB}^{k} (R) \geq I (Q_{k}^{α'}; Y_{k} | S_{k}) .

(A24)

Combining Equations (A22) and (A23), and

R > 0

implies

α_{k}^{'} \to 1

; therefore,

p_{Q_{k}^{α'} S_{k} Z_{k} Y_{k}} - p_{Q_{k}^{″} S_{k} Z_{k} Y_{k}} \to 0 .

Combining this with Equations (A13), (A21) and (A24), along with the continuity of conditional mutual information, gives

\begin{matrix} lim_{k} I_{RB}^{k} (R) \geq lim_{k} I (Q_{k}^{α'}; Y_{k} | S_{k}) = lim_{k} I (Q_{k}^{″}; Y_{k} | S_{k}) = I (Q; Y | S) = I_{RB} (R) . \end{matrix}

□

References

Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv 2010, arXiv:1004.2515. [Google Scholar]
Wibral, M.; Priesemann, V.; Kay, J.W.; Lizier, J.T.; Phillips, W.A. Partial information decomposition as a unified approach to the specification of neural goal functions. Brain Cogn. 2017, 112, 25–38. [Google Scholar] [CrossRef] [PubMed]
Lizier, J.; Bertschinger, N.; Jost, J.; Wibral, M. Information decomposition of target effects from multi-source interactions: Perspectives on previous, current and future work. Entropy 2018, 20, 307. [Google Scholar] [CrossRef] [PubMed]
Kolchinsky, A. A Novel Approach to the Partial Information Decomposition. Entropy 2022, 24, 403. [Google Scholar] [CrossRef]
Williams, P.L. Information Dynamics: Its Theory and Application to Embodied Cognitive Systems. Ph.D. Thesis, Indiana University, Bloomington, IN, USA, 2011. [Google Scholar]
Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Allerton Conference on Communication, Monticello, IL, USA, 22–24 September 1999. [Google Scholar]
Hu, S.; Lou, Z.; Yan, X.; Ye, Y. A Survey on Information Bottleneck. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 1–20. [Google Scholar] [CrossRef] [PubMed]
Palmer, S.E.; Marre, O.; Berry, M.J.; Bialek, W. Predictive information in a sensory population. Proc. Natl. Acad. Sci. USA 2015, 112, 6908–6913. [Google Scholar] [CrossRef]
Wang, Y.; Ribeiro, J.M.L.; Tiwary, P. Past–future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics. Nat. Commun. 2019, 10, 3573. [Google Scholar] [CrossRef] [PubMed]
Zaslavsky, N.; Kemp, C.; Regier, T.; Tishby, N. Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. USA 2018, 115, 7937–7942. [Google Scholar] [CrossRef] [PubMed]
Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; Available online: https://openreview.net/forum?id=HyxQzBceg (accessed on 12 May 2024).
Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear information bottleneck. Entropy 2019, 21, 1181. [Google Scholar] [CrossRef]
Fischer, I. The conditional entropy bottleneck. Entropy 2020, 22, 999. [Google Scholar] [CrossRef]
Goldfeld, Z.; Polyanskiy, Y. The information bottleneck problem and its applications in machine learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
Ahlswede, R.; Körner, J. Source Coding with Side Information and a Converse for Degraded Broadcast Channels. IEEE Trans. Inf. Theory 1975, 21, 629–637. [Google Scholar] [CrossRef]
Witsenhausen, H.; Wyner, A. A conditional entropy bound for a pair of discrete random variables. IEEE Trans. Inf. Theory 1975, 21, 493–501. [Google Scholar] [CrossRef]
Gilad-Bachrach, R.; Navot, A.; Tishby, N. An Information Theoretic Tradeoff between Complexity and Accuracy. In Learning Theory and Kernel Machines; Goos, G., Hartmanis, J., van Leeuwen, J., Schölkopf, B., Warmuth, M.K., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; Volume 2777, pp. 595–609. [Google Scholar] [CrossRef]
Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=rke4HiAcY7 (accessed on 12 May 2024).
Rodríguez Gálvez, B.; Thobaben, R.; Skoglund, M. The convex information bottleneck lagrangian. Entropy 2020, 22, 98. [Google Scholar] [CrossRef] [PubMed]
Benger, E.; Asoodeh, S.; Chen, J. The cardinality bound on the information bottleneck representations is tight. In Proceedings of the 2023 IEEE International Symposium on Information Theory (ISIT), Taipei, Taiwan, 25–30 June 2023; pp. 1478–1483. [Google Scholar]
Geiger, B.C.; Fischer, I.S. A comparison of variational bounds for the information bottleneck functional. Entropy 2020, 22, 1229. [Google Scholar] [CrossRef] [PubMed]
Federici, M.; Dutta, A.; Forré, P.; Kushman, N.; Akata, Z. Learning robust representations via multi-view information bottleneck. arXiv 2020, arXiv:2002.07017. [Google Scholar]
Murphy, K.A.; Bassett, D.S. Machine-Learning Optimized Measurements of Chaotic Dynamical Systems via the Information Bottleneck. Phys. Rev. Lett. 2024, 132, 197201. [Google Scholar] [CrossRef] [PubMed]
Slonim, N.; Friedman, N.; Tishby, N. Multivariate Information Bottleneck. Neural Comput. 2006, 18, 1739–1789. [Google Scholar] [CrossRef]
Shannon, C. The lattice theory of information. Trans. IRE Prof. Group Inf. Theory 1953, 1, 105–107. [Google Scholar] [CrossRef]
McGill, W. Multivariate information transmission. Trans. IRE Prof. Group Inf. Theory 1954, 4, 93–111. [Google Scholar] [CrossRef]
Reza, F.M. An Introduction to Information Theory; Dover Publications: Mineola, NY, USA, 1961. [Google Scholar]
Ting, H.K. On the amount of information. Theory Probab. Its Appl. 1962, 7, 439–447. [Google Scholar] [CrossRef]
Han, T. Linear dependence structure of the entropy space. Inf. Control 1975, 29, 337–368. [Google Scholar] [CrossRef]
Yeung, R.W. A new outlook on Shannon’s information measures. IEEE Trans. Inf. Theory 1991, 37, 466–474. [Google Scholar] [CrossRef]
Bell, A.J. The co-information lattice. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA, Nara, Japan, 1–4 April 2003; Volume 2003. [Google Scholar]
Gomes, A.F.; Figueiredo, M.A. Orders between Channels and Implications for Partial Information Decomposition. Entropy 2023, 25, 975. [Google Scholar] [CrossRef] [PubMed]
Griffith, V.; Koch, C. Quantifying synergistic mutual information. In Guided Self-Organization: Inception; Springer Berlin/Heidelberg, Germany, 2014; pp. 159–190.
Griffith, V.; Chong, E.K.; James, R.G.; Ellison, C.J.; Crutchfield, J.P. Intersection information based on common randomness. Entropy 2014, 16, 1985–2000. [Google Scholar] [CrossRef]
Griffith, V.; Ho, T. Quantifying redundant information in predicting a target random variable. Entropy 2015, 17, 4644–4653. [Google Scholar] [CrossRef]
Bertschinger, N.; Rauh, J. The Blackwell relation defines no lattice. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2479–2483. [Google Scholar]
Blackwell, D. Equivalent comparisons of experiments. Ann. Math. Stat. 1953, 24, 265–272. [Google Scholar] [CrossRef]
Rauh, J.; Banerjee, P.K.; Olbrich, E.; Jost, J.; Bertschinger, N.; Wolpert, D. Coarse-Graining and the Blackwell Order. Entropy 2017, 19, 527. [Google Scholar] [CrossRef]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
Rauh, J.; Banerjee, P.K.; Olbrich, E.; Jost, J.; Bertschinger, N. On extractable shared information. Entropy 2017, 19, 328. [Google Scholar] [CrossRef]
Venkatesh, P.; Schamberg, G. Partial information decomposition via deficiency for multivariate gaussians. In Proceedings of the 2022 IEEE International Symposium on Information Theory (ISIT), Espoo, Finland, 26 June–1 July 2022; pp. 2892–2897. [Google Scholar]
Mages, T.; Anastasiadi, E.; Rohner, C. Non-Negative Decomposition of Multivariate Information: From Minimum to Blackwell Specific Information. Entropy 2024, 26, 424. [Google Scholar] [CrossRef]
Le Cam, L. Sufficiency and approximate sufficiency. Ann. Math. Stat. 1964, 35, 1419–1455. [Google Scholar] [CrossRef]
Raginsky, M. Shannon meets Blackwell and Le Cam: Channels, codes, and statistical experiments. In Proceedings of the 2011 IEEE International Symposium on Information Theory, St. Petersburg, Russia, 31 July–5 August 2011; pp. 1220–1224. [Google Scholar]
Banerjee, P.K.; Olbrich, E.; Jost, J.; Rauh, J. Unique informations and deficiencies. In Proceedings of the 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 2–5 October 2018; pp. 32–38. [Google Scholar]
Banerjee, P.K.; Montufar, G. The Variational Deficiency Bottleneck. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Venkatesh, P.; Gurushankar, K.; Schamberg, G. Capturing and Interpreting Unique Information. In Proceedings of the 2023 IEEE International Symposium on Information Theory (ISIT), Taipei, Taiwan, 25–30 June 2023; pp. 2631–2636. [Google Scholar] [CrossRef]
Csiszár, I.; Matus, F. Information projections revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the information bottleneck to the privacy funnel. In Proceedings of the 2014 IEEE Information Theory Workshop (ITW 2014), Hobart, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
Janzing, D.; Balduzzi, D.; Grosse-Wentrup, M.; Schölkopf, B. Quantifying causal influences. Ann. Stat. 2013, 41, 2324–2358. [Google Scholar] [CrossRef]
Ay, N. Confounding ghost channels and causality: A new approach to causal information flows. Vietnam. J. Math. 2021, 49, 547–576. [Google Scholar] [CrossRef]
Kolchinsky, A.; Rocha, L.M. Prediction and modularity in dynamical systems. In Proceedings of the European Conference on Artificial Life (ECAL), Paris, France, 8–12 August 2011; Available online: https://direct.mit.edu/isal/proceedings/ecal2011/23/65/111139 (accessed on 12 May 2024).
Hidaka, S.; Oizumi, M. Fast and exact search for the partition with minimal information loss. PLoS ONE 2018, 13, e0201126. [Google Scholar] [CrossRef] [PubMed]
Rosas, F.; Ntranos, V.; Ellison, C.J.; Pollin, S.; Verhelst, M. Understanding interdependency through complex information sharing. Entropy 2016, 18, 38. [Google Scholar] [CrossRef]
Rosas, F.E.; Mediano, P.A.; Gastpar, M.; Jensen, H.J. Quantifying high-order interdependencies via multivariate extensions of the mutual information. Phys. Rev. E 2019, 100, 032305. [Google Scholar] [CrossRef]
Dubins, L.E. On extreme points of convex sets. J. Math. Anal. Appl. 1962, 5, 237–244. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Timo, R.; Grant, A.; Kramer, G. Lossy broadcasting with complementary side information. IEEE Trans. Inf. Theory 2012, 59, 104–131. [Google Scholar] [CrossRef]

Figure 1. RB analysis for the UNIQUE gate (Example 1). (a) Prediction values found by optimizing the RB Lagrangian (15) at different

β

. Colored regions indicate contributions from different sources,

ν_{S} (s) I (Q; Y | S = s)

from Equation (17). For this system, only source

X_{1}

contributes to the prediction. (b) Compression costs found by optimizing the RB Lagrangian at different

β

. Colored regions indicate contributions from different sources,

ν_{S} (s) I (Q; S = s | Y)

from Equation (20). (c) The RB curve shows the tradeoff between optimal compression and the prediction values; the marker colors correspond to the

β

values as in (a,b). All bottleneck variables Q must fall within the accessible grey region. (d) RB curves for individual sources.

Figure 1. RB analysis for the UNIQUE gate (Example 1). (a) Prediction values found by optimizing the RB Lagrangian (15) at different

β

. Colored regions indicate contributions from different sources,

ν_{S} (s) I (Q; Y | S = s)

from Equation (17). For this system, only source

X_{1}

contributes to the prediction. (b) Compression costs found by optimizing the RB Lagrangian at different

β

. Colored regions indicate contributions from different sources,

ν_{S} (s) I (Q; S = s | Y)

from Equation (20). (c) The RB curve shows the tradeoff between optimal compression and the prediction values; the marker colors correspond to the

β

values as in (a,b). All bottleneck variables Q must fall within the accessible grey region. (d) RB curves for individual sources.

Figure 2. RB analysis for the system with 4 binary symmetric channels (Example 3). (a,b) Prediction and compression values found by optimizing the RB Lagrangian (15) at different

β

. Contributions from individual sources are shown as shaded regions. (c) The RB curve shows the tradeoff between optimal compression and prediction values; marker colors correspond to the

β

values as in (a,b). (d) RB curves for individual sources.

Figure 2. RB analysis for the system with 4 binary symmetric channels (Example 3). (a,b) Prediction and compression values found by optimizing the RB Lagrangian (15) at different

β

. Contributions from individual sources are shown as shaded regions. (c) The RB curve shows the tradeoff between optimal compression and prediction values; marker colors correspond to the

β

values as in (a,b). (d) RB curves for individual sources.

Figure 3. RB analysis for the system with a 3-spin target (Example 4). (a,b) Prediction and compression values found by optimizing the RB Lagrangian (15) at different

β

. Contributions from individual sources are shown as shaded regions. (c) The RB curve shows the tradeoff between optimal compression and prediction values; marker colors correspond to the

β

values as in (a,b). (d) RB curves for individual sources.

Figure 3. RB analysis for the system with a 3-spin target (Example 4). (a,b) Prediction and compression values found by optimizing the RB Lagrangian (15) at different

β

. Contributions from individual sources are shown as shaded regions. (c) The RB curve shows the tradeoff between optimal compression and prediction values; marker colors correspond to the

β

values as in (a,b). (d) RB curves for individual sources.

Figure 4. The RB function

I_{RB} (R)

is continuous in the underlying probability distribution for

R > 0

, while Blackwell redundancy can be discontinuous. Here illustrated on the COPY gate,

Y = (X_{1}, X_{2})

, as a function of correlation strength

ϵ

between

X_{1}

and

X_{2}

(perfect correlation at

ϵ = 0

, independence at

ϵ = 1

). Blackwell redundancy jumps from

I_{\cap} = 1

at

ϵ = 0

to

I_{\cap} = 0

at

ϵ > 0

, while

I_{RB} (R)

(at

R = 0.01

) decays continuously.

Figure 4. The RB function

I_{RB} (R)

is continuous in the underlying probability distribution for

R > 0

, while Blackwell redundancy can be discontinuous. Here illustrated on the COPY gate,

Y = (X_{1}, X_{2})

, as a function of correlation strength

ϵ

between

X_{1}

and

X_{2}

(perfect correlation at

ϵ = 0

, independence at

ϵ = 1

). Blackwell redundancy jumps from

I_{\cap} = 1

at

ϵ = 0

to

I_{\cap} = 0

at

ϵ > 0

, while

I_{RB} (R)

(at

R = 0.01

) decays continuously.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kolchinsky, A. Partial Information Decomposition: Redundancy as Information Bottleneck. Entropy 2024, 26, 546. https://doi.org/10.3390/e26070546

AMA Style

Kolchinsky A. Partial Information Decomposition: Redundancy as Information Bottleneck. Entropy. 2024; 26(7):546. https://doi.org/10.3390/e26070546

Chicago/Turabian Style

Kolchinsky, Artemy. 2024. "Partial Information Decomposition: Redundancy as Information Bottleneck" Entropy 26, no. 7: 546. https://doi.org/10.3390/e26070546

APA Style

Kolchinsky, A. (2024). Partial Information Decomposition: Redundancy as Information Bottleneck. Entropy, 26(7), 546. https://doi.org/10.3390/e26070546

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Partial Information Decomposition: Redundancy as Information Bottleneck

Abstract

1. Introduction

2. Background

2.1. Information Bottleneck (IB)

2.2. Partial Information Decomposition

2.3. Blackwell Redundancy

3. Redundancy Bottleneck

3.1. Reformulation of Blackwell Redundancy

3.2. Redundancy Bottleneck

3.3. Contributions from Different Sources

3.4. Examples

3.5. Continuity

4. Iterative Algorithm

5. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Theorem 2

Appendix A.3. Proof of Theorem 3

Appendix A.4. Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI