Coarse-Graining and the Blackwell Order

Rauh, Johannes; Banerjee, Pradeep Kr.; Olbrich, Eckehard; Jost, Jürgen; Bertschinger, Nils; Wolpert, David

doi:10.3390/e19100527

Open AccessFeature PaperArticle

Coarse-Graining and the Blackwell Order

by

Johannes Rauh

^1,*

,

Pradeep Kr. Banerjee

¹,

Eckehard Olbrich

¹

,

Jürgen Jost

¹,

Nils Bertschinger

² and

David Wolpert

^3,4

¹

Max Planck Institute for Mathematics in the Sciences, 04103 Leipzig, Germany

²

Frankfurt Institute for Advanced Studies, 60438 Frankfurt, Germany

³

Santa Fe Institute, Santa Fe, NM 87501, USA

⁴

Massachusetts Institute of Technology, Cambridge, MA 02139, USA

^*

Author to whom correspondence should be addressed.

Entropy 2017, 19(10), 527; https://doi.org/10.3390/e19100527

Submission received: 15 June 2017 / Revised: 13 September 2017 / Accepted: 20 September 2017 / Published: 6 October 2017

(This article belongs to the Special Issue Information Decomposition of Target Effects from Multi-Source Interactions)

Download

Browse Figure

Versions Notes

Abstract

:

Suppose we have a pair of information channels,

κ_{1}, κ_{2}

, with a common input. The Blackwell order is a partial order over channels that compares

κ_{1}

and

κ_{2}

by the maximal expected utility an agent can obtain when decisions are based on the channel outputs. Equivalently,

κ_{1}

is said to be Blackwell-inferior to

κ_{2}

if and only if

κ_{1}

can be constructed by garbling the output of

κ_{2}

. A related partial order stipulates that

κ_{2}

is more capable than

κ_{1}

if the mutual information between the input and output is larger for

κ_{2}

than for

κ_{1}

for any distribution over inputs. A Blackwell-inferior channel is necessarily less capable. However, examples are known where

κ_{1}

is less capable than

κ_{2}

but not Blackwell-inferior. We show that this may even happen when

κ_{1}

is constructed by coarse-graining the inputs of

κ_{2}

. Such a coarse-graining is a special kind of “pre-garbling” of the channel inputs. This example directly establishes that the expected value of the shared utility function for the coarse-grained channel is larger than it is for the non-coarse-grained channel. This contradicts the intuition that coarse-graining can only destroy information and lead to inferior channels. We also discuss our results in the context of information decompositions.

Keywords:

Channel preorders; Blackwell order; degradation order; garbling; more capable; coarse-graining

MSC:

62C05; 62B15; 94A15

1. Introduction

Suppose we are given the choice of two channels that both provide information about the same random variable, and that we want to make a decision based on the channel outputs. Suppose that our utility function depends on the joint value of the input to the channel and our resultant decision based on the channel outputs. Suppose as well that we know the precise conditional distributions defining the channels, and the distribution over channel inputs. Which channel should we choose? The answer to this question depends on the choice of our utility function as well as on the details of the channels and the input distribution. So, for example, without specifying how we will use the channels, in general we cannot just compare their information capacities to choose between them.

Nonetheless, for certain pairs of channels we can make our choice, even without knowing the utility functions or the distribution over inputs. Let us represent the two channels by two (column) stochastic matrices

κ_{1}

and

κ_{2}

, respectively. Then, if there exists another stochastic matrix

λ

such that

κ_{1} = λ \cdot κ_{2}

, there is never any reason to strictly prefer

κ_{1}

; if we choose

κ_{2}

, we can always make our decision by chaining the output of

κ_{2}

through the channel

λ

and then using the same decision function we would have used had we chosen

κ_{1}

. This simple argument shows that whatever the three stochastic matrices are and whatever the decision rule we would use if we chose channel

κ_{1}

, we can always get the same expected utility by instead choosing channel

κ_{2}

with an appropriate decision rule. In this kind of situation, where

κ_{1} = λ \cdot κ_{2}

, we say that

κ_{1}

is a garbling (or degradation) of

κ_{2}

. It is much more difficult to prove that the converse also holds true:

Theorem 1.

(Blackwell’s theorem [1]) Let

κ_{1}, κ_{2}

be two stochastic matrices representing two channels with the same input alphabet. Then the following two conditions are equivalent:

1.: When the agent chooses $κ_{2}$ (and uses the decision rule that is optimal for $κ_{2}$ ), her expected utility is always at least as big as the expected utility when she chooses $κ_{1}$ (and uses the optimal decision rule for $κ_{1}$ ), independent of the utility function and the distribution of the input S.
2.: $κ_{1}$ is a garbling of $κ_{2}$ .

Blackwell formulated his result in terms of a statistical decision maker who reacts to the outcome of a statistical experiment. We prefer to speak of a decision problem instead of a statistical experiment. See [2,3] for an overview.

Blackwell’s theorem motivates looking at the following partial order over channels

κ_{1}, κ_{2}

with a common input alphabet:

κ_{1} \leq κ_{2} : ⟺ \{\begin{matrix} one of the two statements \\ in Blackwell ’ s theorem holds true . \end{matrix}

We call this partial order the Blackwell order (this partial order is called degradation order by other authors [4,5]). If

κ_{1} \leq κ_{2}

, then

κ_{1}

is said to be Blackwell-inferior to

κ_{2}

. Strictly speaking, the Blackwell order is only a preorder, since there are channels

κ_{1} \neq κ_{2}

that satisfy

κ_{1} \leq κ_{2} \leq κ_{1}

(when

κ_{1}

arises from

κ_{2}

by permuting the output alphabet). However, for our purposes, such channels can be considered as equivalent. We write

κ_{1} < κ_{2}

if

κ_{1} \leq κ_{2}

and

κ_{1} ≱ κ_{2}

. By Blackwell’s theorem, this implies that

κ_{2}

performs at least as well as

κ_{1}

in any decision problem and that there exist decision problems in which

κ_{2}

outperforms

κ_{1}

.

For a given distribution of S, we can also compare

κ_{1}

and

κ_{2}

by comparing the two mutual informations

I (S; X_{1})

,

I (S; X_{2})

between the common input S and the channel outputs

X_{1}

and

X_{2}

. The data processing inequality shows that

κ_{2} \geq κ_{1}

implies

I (S; X_{2}) \geq I (S; X_{1})

. However, the converse implication does not hold. The intuitive reason is that for the Blackwell order, not only the amount of information is important. Rather, the question is how much of the information that

κ_{1}

or

κ_{2}

preserve is relevant for a given fixed decision problem (that is, a given fixed utility function).

Given two channels

κ_{1}, κ_{2}

, suppose that

I (S; X_{2}) \geq I (S; X_{1})

for all distributions of S. In this case, we say that

κ_{2}

is more capable than

κ_{1}

. Does this imply that

κ_{1} \leq κ_{2}

? The answer is known to be negative in general [6]. In Proposition 2 we introduce a new surprising example of this phenomenon with a particular structure. In fact, in this example,

κ_{1}

is a Markov approximation of

κ_{2}

by a deterministic function, in the following sense: Consider another random variable

f (S)

that arises from S by applying a (deterministic) function f. Given two random variables S, X, denote by

X \leftarrow S

the channel defined by the conditional probabilities

P_{X | S} (x | s)

, and let

κ_{2} : = (X \leftarrow S)

and

κ_{1} : = (X \leftarrow f (S)) \cdot (f (S) \leftarrow S)

. Thus,

κ_{1}

can be interpreted as first replacing S by

f (S)

and then sampling X according to the conditional distribution

P_{X | S} (x | f (s))

. Which channel is superior? Using the data processing inequality, it is easy to see that

κ_{1}

is less capable than

κ_{2}

. However, as Proposition 2 shows, in general

κ_{1} ≰ κ_{2}

.

We call

κ_{1}

a Markov approximation, because the output of

κ_{1}

is independent of the input S given

f (S)

. The channel

κ_{1}

can also be obtained from

κ_{2}

by “pre-garbling” (Lemma 3); that is, there is another stochastic matrix

λ^{f}

that satisfies

κ_{1} = κ_{2} \cdot λ^{f}

. It is known that pre-garbling may improve the performance of a channel (but not its capacity) as we recall in Section 2. What may be surprising is that this can happen for pre-garblings of the form

λ^{f}

, which have the effect of coarse-graining according to f.

The fact that the more capable preorder does not imply the Blackwell order shows that “Shannon information,” as captured by the mutual information, is not the same as “Blackwell information,” as needed for the Blackwell decision problems. Indeed, our example explicitly shows that even though coarse-graining always reduces Shannon information, it need not reduce Blackwell information. Finally, let us mention that there are further ways of comparing channels (or stochastic matrices); see [5] for an overview.

Proposition 2 builds upon another effect that we find paradoxical: Namely, there exist random variables

S, X_{1}, X_{2}

and there exists a function

f : S \to S^{'}

from the support of S to a finite set

S^{'}

such that the following holds:

S and $X_{1}$ are independent given $f (S)$ .
$(X_{1} \leftarrow f (S)) \leq (X_{2} \leftarrow f (S))$ .
$(X_{1} \leftarrow S) ≰ (X_{2} \leftarrow S)$ .

Statement (1) says that everything

X_{1}

knows about S, it knows through

f (S)

. Statement (2) says that

X_{2}

knows more about

f (S)

than

X_{1}

. Still, (3) says that we cannot conclude that

X_{2}

knows more about S than

X_{1}

. The paradox illustrates that it is difficult to formalize what it means to “know more.”

Understanding the Blackwell order is an important aspect of understanding information decompositions; that is, the quest to find new information measures that separate different aspects of the mutual information

I (S; X_{1}, \dots, X_{k})

of k random variables

X_{1}, \dots, X_{k}

and a target variable S (see the other contributions of this special issue and references therein). In particular, [7] argues that the Blackwell order provides a natural criterion when a variable

X_{1}

has unique information about S with respect to

X_{2}

. We hope that the examples we present here are useful in developing intuition on how information can be shared among random variables and how it behaves when applying a deterministic function, such as a coarse-graining. Further implications of our examples on information decompositions are discussed in [8]. In the converse direction, information decomposition measures (such as measures of unique information) can be used to study the Blackwell order and deviations from the Blackwell order. We illustrate this idea in Example 4.

The remainder of this work is organized as follows: In Section 2, we recall how pre-garbling can be used to improve the performance of a channel. We also show that the pre-garbled channel will always be less capable and that simultaneous pre-garbling of both channels preserves the Blackwell order. In Section 3, we state a few properties of the Blackwell order, and we explain why we find these properties counter-intuitive and paradoxical. In particular, we show that coarse-graining the input can improve the performance of a channel. Section 4 contains a detailed discussion of an example that illustrates these properties. In Section 5 we use the unqiue information measure from [7], which has properties similar to the Le Cam’s deficiency, to illustrate deviations from the Blackwell relation.

2. Pre-Garbling

As discussed above (and as made formal in Blackwell’s theorem (Theorem 1)), garbling the output of a channel (“post-garbling”) never increases the quality of a channel. On the other hand, garbling the input of a channel (“pre-garbling”) may increase the performance of a channel, as the following example shows.

Example 1.

Suppose that an agent can choose an action from a finite set

A

. She then receives a utility

u (a, s)

that depends both on the chosen action

a \in A

and on the value s of a random variable S. Consider the channels

κ_{1} = (\begin{matrix} 0.9 & 0 \\ 0.1 & 1 \end{matrix}) and κ_{2} = κ_{1} \cdot (\begin{matrix} 0 & 1 \\ 1 & 0 \end{matrix}) = (\begin{matrix} 0 & 0.9 \\ 1 & 0.1 \end{matrix}),

and the utility function

\begin{array}{c} s & 0 & 0 & 1 & 1 \\ a & 0 & 1 & 0 & 1 \\ u (s, a) & 2 & 0 & 0 & 1 \end{array}

For uniform input, the optimal decision rule for

κ_{1}

is

a (0) = 0, a (1) = 1

and the opposite

a (0) = 1, a (1) = 0

for

κ_{2}

. The expected utility with

κ_{1}

is 1.4, while using

κ_{2}

, it is slightly higher (1.45).

It is also not difficult to check that neither of the two channels is a garbling of the other (cf. Propsition 3.22 in [5]).

The intuitive reason for the difference in the expected utilities is that the channel

κ_{2}

transmits one of the states without noise and the other state with noise. With a convenient pre-processing, it is possible to make sure that the relevant information for choosing an action and for optimizing expected utility is transmitted with less noise.

Note the symmetry of the example: each of the two channels arises from the other by a convenient pre-processing, since the pre-processing is invertible. Hence, the two channels are not comparable by the Blackwell order. In contrast, two channels that only differ by an invertible garbling of the output are equivalent with respect to the Blackwell order.

The pre-garbling in Example 1 is invertible, and so it is more aptly described as a pre-processing. In general, though, pure pre-garbling and pure pre-processing are not easily distinguishable, and it is easy to perturb Example 1 by adding noise without changing the conclusion. In Section 3, we will present an example in which the pre-garbling consists of coarse-graining. It is much more difficult to understand how coarse-graining can be used as sensible pre-processing.

Even though pre-garbling can make a channel better (or, more precisely, more suited for a particular decision problem at hand), pre-garbling cannot invert the Blackwell order:

Lemma 1.

If

κ_{1} < κ_{2} \cdot λ

, then

κ_{1} ≱ κ_{2}

.

Proof.

Suppose that

κ_{1} < κ_{2} \cdot λ

. Then the capacity of

κ_{1}

is less than the capacity of

κ_{2} \cdot λ

, which is bounded by the capacity of

κ_{2}

. Therefore, the capacity of

κ_{1}

is less than the capacity of

κ_{2}

. ☐

Additionally, it follows directly from Blackwell’s theorem that

κ_{1} \leq κ_{2} implies κ_{1} \cdot λ \leq κ_{2} \cdot λ

for any channel

λ

, where the input and output alphabets of

λ

equal the input alphabet of

κ_{1}, κ_{2}

. Thus, pre-garbling preserves the Blackwell order when applied to both channels simultaneously.

Finally, let us remark that certain kinds of simultaneous pre-garbling can also be “hidden” in the utility function; namely, in Blackwell’s theorem, it is not necessary to vary the distribution of

S

as long as the support of the (fixed) input distribution has full support S (that is, every state of the input alphabet of

κ_{1}

and

κ_{2}

appears with positive probability). In this setting, it suffices to look only at different utility functions. When the input distribution is fixed, it is more convenient to think in terms of random variables instead of channels, which slightly changes the interpretation of the decision problem. Suppose we are given random variables

S, X_{1}, X_{2}

and a utility function

u (a, s)

depending on the value of S and an action

a \in A

as above. If we cannot look at both

X_{1}

and

X_{2}

, should we choose to look at

X_{1}

or at

X_{2}

to make our decision?

Theorem 2.

(Blackwell’s theorem for random variables [7]) The following two conditions are equivalent:

1.: Under the optimal decision rule, when the agent chooses $X_{2}$ , her expected utility is always at least as large as the expected utility when she chooses $X_{1}$ , independent of the utility function.
2.: $(X_{1} \leftarrow S) \leq (X_{2} \leftarrow S)$ .

3. Pre-Garbling by Coarse-Graining

In this section we present a few counter-intuitive properties of the Blackwell order.

Proposition 1.

There exist random variables

S, X_{1}, X_{2}

and a function

f : S \to S

from the support of S to a finite set

S^{'}

such that the following holds:

1.: S and $X_{1}$ are independent given $f (S)$ .
2.: $(X_{1} \leftarrow f (S)) < (X_{2} \leftarrow f (S))$ .
3.: $(X_{1} \leftarrow S) ≰ (X_{2} \leftarrow S)$ .

This result may at first seem paradoxical. After all, property (3) implies that there exists a decision problem involving S for which it is better to use

X_{1}

than

X_{2}

. Property (1) implies that any information that

X_{1}

has about S is contained in

X_{1}

’s information about

f (S)

. One would therefore expect that, from the viewpoint of

X_{1}

, any decision problem in which the task is to predict S and to react on S looks like a decision problem in which the task is to react to

f (S)

. But property (2) implies that for such a decision problem, it may in fact be better to look at

X_{2}

.

Proof of Proposition 1.

The proof is by Example 2, which will be given in Section 4. This example satisfies

S and $X_{1}$ are independent given $f (S)$ .
$(X_{1} \leftarrow f (S)) \leq (X_{2} \leftarrow f (S))$ .
$(X_{1} \leftarrow S) ≰ (X_{2} \leftarrow S)$ .

It remains to show that it is also possible to achieve the strict relation

(X_{1} \leftarrow f (S)) < (X_{2} \leftarrow f (S))

in the second statement. This can easily be done by adding a small garbling to the channel

X_{1} \leftarrow f (S)

(e.g., by adding a binary symmetric channel with sufficiently small noise parameter

ϵ

). This ensures

(X_{1} \leftarrow f (S)) < (X_{2} \leftarrow f (S))

, and if the garbling is small enough, this does not destroy the property

(X_{1} \leftarrow S) ≰ (X_{2} \leftarrow S)

. ☐

The example from Proposition 1 also leads to the following paradoxical property:

Proposition 2.

There exist random variables

S, X

and there exists a function

f : S \to S^{'}

from the support of S to a finite set

S^{'}

such that the following holds:

(X \leftarrow f (S)) \cdot (f (S) \leftarrow S) ≰ X \leftarrow S .

Let us again give a heuristic argument for why we find this property paradoxical. Namely, the combined channel

(X \leftarrow f (S)) \cdot (f (S) \leftarrow S)

can be seen as a Markov chain approximation of the direct channel

X \leftarrow S

that corresponds to replacing the conditional distribution

P_{X | S} (x | s) = \sum_{f (s)} P_{X | S f (S)} (x | s, f (s)) P_{f (S) | S} (f (s) | s) .

by

\sum_{f (s)} P_{X | f (S)} (x | f (s)) P_{f (S) | S} (f (s) | s) .

Proposition 2 together with Blackwell’s theorem states that there exist situations where this approximation is better than the correct channel.

Proof of Proposition 2.

Let

S, X_{1}, X_{2}

be as in Example 2 in Section 4 that also proves Proposition 1, and let

X = X_{2}

. In that example, the two channels

X_{1} \leftarrow f (S)

and

X_{2} \leftarrow f (S)

are equal. Moreover,

X_{1}

and S are independent given

f (S)

. Thus,

(X \leftarrow f (S)) \cdot (f (S) \leftarrow S) = (X_{1} \leftarrow S)

. Therefore, the statement follows from

(X_{1} \leftarrow S) ≰ (X_{2} \leftarrow S)

. ☐

On the other hand, the channel

(X \leftarrow f (S)) \cdot (f (S) \leftarrow S)

is always less capable than

X \leftarrow S

:

Lemma 2.

For any random variables S, X, and function

f : S \to S

, the channel

(X \leftarrow f (S)) \cdot (f (S) \leftarrow S)

is less capable than

X \leftarrow S

.

Proof.

For any distribution of S, let

X^{'}

be the output of the channel

(X \leftarrow f (S)) \cdot (f (S) \leftarrow S)

. Then,

X^{'}

is independent of S given

f (S)

. On the other hand, since f is a deterministic function,

X^{'}

is independent of

f (S)

given S. Together, this implies

I (S; X^{'}) = I (f (S); X^{'})

. Using the fact that the joint distributions of

(X, f (S))

and

(X^{'}, f (S))

are identical and applying the data processing inequality gives

I (S; X^{'}) = I (f (S); X^{'}) = I (f (S); X) \leq I (S; X) . ☐

The setting of Proposition 2 can also be understood as a specific kind of pre-garbling. Namely, consider the channel

λ^{f}

defined by

λ_{s^{'}, s}^{f} : = P_{S | f (S)} (s^{'} | f (s)) .

The effect of this channel can be characterized as a randomization of the input: the precise value of S is forgotten, and only the value of

f (S)

is preserved. Then, a new value

s^{'}

is sampled for S according to the conditional distribution of S given

f (S)

.

Lemma 3.

(X \leftarrow f (S)) \cdot (f (S) \leftarrow S) = (X \leftarrow S) \cdot λ^{f}

.

Proof.

\begin{array}{l} \sum_{s_{1}} P_{X | S} (x | s_{1}) P_{S | f (S)} (s_{1} | f (s)) & = \sum_{s_{1}, t} P_{X | S} (x | s_{1}) P_{S | f (S)} (s_{1} | t) P_{f (S) | S} (t | s) \\ = \sum_{t} P_{X | f (S)} (x | t) P_{f (S) | S} (t | s), \end{array}

where we have used that

X - S - f (S)

forms a Markov chain. ☐

While it is easy to understand that pre-garbling can be advantageous in general (since it can work as preprocessing), we find it surprising that this can also happen in the case where the pre-garbling is done in terms of a function f; that is, in terms of a channel

λ^{f}

that does coarse-graining.

4. Examples

Example 2.

Consider the joint distribution

\begin{array}{c} f (s) & s & x_{1} & x_{2} & P_{f (S) S X_{1} X_{2}} \\ 0 & 0 & 0 & 0 & 1 / 4 \\ 0 & 1 & 0 & 1 & 1 / 4 \\ 0 & 0 & 1 & 0 & 1 / 8 \\ 0 & 1 & 1 & 0 & 1 / 8 \\ 1 & 2 & 1 & 1 & 1 / 4 \end{array}

and the function f: {0, 1, 2} → {0, 1} with f(0) = f(1) = 0 and f(2) = 1. Then,

X_{1}

and

X_{2}

are independent uniform binary random variables, and f(S) = And(X₁, X₂). By symmetry, the joint distributions of the pairs

(f (S), X_{1})

and

(f (S), X_{2})

are identical, and so the two channels

X_{1} \leftarrow f (S)

and

X_{2} \leftarrow f (S)

are identical. In particular,

(X_{1} \leftarrow f (S)) \leq (X_{2} \leftarrow f (S))

.

On the other hand, consider the utility function

\begin{array}{c} s & a & u (s, a) \\ 0 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 2 & 0 & 0 \\ 2 & 1 & 1 \end{array}

To compute the optimal decision rule, let us look at the conditional distributions:

\begin{matrix} \begin{array}{c} s & x_{1} & P_{S | X_{1}} (s | x_{1}) \\ 0 & 0 & 1 / 2 \\ 1 & 0 & 1 / 2 \\ 0 & 1 & 1 / 4 \\ 1 & 1 & 1 / 4 \\ 2 & 1 & 1 / 2 \end{array} & \begin{array}{c} s & x_{2} & P_{S | X_{2}} (s | x_{2}) \\ 0 & 0 & 3 / 4 \\ 1 & 0 & 1 / 4 \\ 0 & 1 & 0 \\ 1 & 1 & 1 / 2 \\ 2 & 1 & 1 / 2 \end{array} \end{matrix}

The optimal decision rule for

X_{1}

is a(0) = 0, a(1) = 1, with expected utility

u_{X_{1}} : = 1 / 2 \cdot 1 / 2 + 1 / 2 \cdot 1 / 2 = 1 / 2 .

The optimal decision rule for

X_{2}

is a(0) = 0, a(1) ∈ {0, 1} (this is not unique in this case), with expected utility

u_{X_{2}} : = 1 / 2 \cdot 1 / 4 + 1 / 2 \cdot 1 / 2 = 3 / 8 < 1 / 2 .

How can we understand this example? Some observations:

It is easy to see that $X_{2}$ has more irrelevant information than $X_{1}$ : namely, $X_{2}$ can determine relatively precisely when $S = 0$ . However, since $S = 0$ gives no utility independent of the action, this information is not relevant. It is more difficult to understand why $X_{2}$ has less relevant information than $X_{1}$ . Surprisingly, $X_{1}$ can determine more precisely when $S = 1$ : if $S = 1$ , then $X_{1}$ “detects this” (in the sense that $X_{1}$ chooses action 0) with probability $2 / 3$ . For $X_{2}$ , the same probability is only $1 / 3$ .
The conditional entropies of S given $X_{2}$ are smaller than the conditional entropies of S given $X_{1}$ :

$\begin{matrix} H (S | X_{1} = 0) & = log (2), & H (S | X_{1} = 1) & = \frac{3}{2} log (2), \\ H (S | X_{2} = 0) & = 2 log (2) - \frac{3}{2} log (3) \approx 0.4150375 log (2), & H (S | X_{2} = 1) & = log (2) . \end{matrix}$
One can see in which sense $f (S)$ captures the relevant information for $X_{1}$ , and indeed for the whole decision problem: knowing $f (S)$ is completely sufficient in order to receive the maximal utility for each state of S. However, when information is incomplete, it matters how the information about the different states of S is mixed, and two variables $X_{1}, X_{2}$ that have the same joint distribution with $f (S)$ may perform differently. It is somewhat surprising that it is the random variable that has less information about S and that is conditionally independent of S given $f (S)$ which actually performs better.

Example 2 is different from the pre-garbling Example 1 discussed in Section 2. In the latter, both channels had the same amount of information (mutual information) about S, but for the given decision problem the information provided by

κ_{2}

was more relevant than the information provided by

κ_{1}

. The first difference in Example 2 is that

X_{1}

has less mutual information about S than

X_{2}

(Lemma 2). Moreover, both channels are identical with respect to

f (S)

; i.e., they provide the same information about

f (S)

, and for

X_{1}

it is the only information it has about S. So, one could argue that

X_{2}

has additional information that does not help, but decreases the expected utility instead.

We give another example which shows that

X_{2}

can also be chosen as a deterministic function of S.

Example 3.

Consider the joint distribution

\begin{array}{c} f (s) & s & x_{1} & x_{2} & P_{f (S) S X_{1} X_{2}} \\ 0 & 0 & 0 & 0 & 1 / 6 \\ 0 & 0 & 1 & 0 & 1 / 6 \\ 0 & 1 & 0 & 1 & 1 / 6 \\ 0 & 1 & 1 & 1 & 1 / 6 \\ 1 & 2 & 1 & 1 & 1 / 3 \end{array}

The function f is as above, but now also

X_{2}

is a function of S. Again, the two channels

X_{1} \leftarrow f (S)

and

X_{2} \leftarrow f (S)

are identical, and

X_{1}

is independent of S given

f (S)

. Consider the utility function

\begin{array}{c} s & a & u (s, a) \\ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \\ 2 & 0 & 0 \\ 2 & 1 & - 1 \end{array}

One can show that it is optimal for an agent who relies on

X_{2}

to always choose action 0, which brings no reward (and no loss). However, when the agent knows that

X_{1}

is zero, he may safely choose action 1 and has a positive probability of receiving a positive reward.

To add another interpretation to the last example, we visualize the situation in the following Bayesian network:

X \leftarrow S \to f (S) \to X^{'},

where, as in Proposition 2 and its proof, we let

X = X_{2}

, and we consider

X^{'} = X_{1}

as an approximation of X. Then, S denotes the state of the system that we are interested in, and X denotes a given set of observables of interest.

f (S)

can be considered as a “proxy” in situations where it is difficult to observe X directly. For example, in neuroimaging, instead of directly measuring the neural activity X, one might look at an MRI signal

f (S)

. In economic and social sciences, monetary measures like the GDP are used as a proxy for prosperity.

A decision problem can always be considered as a classification problem defined by the utility

u (s, a)

by considering the optimal action as the class label of state S. Proposition 2 now says that there exist

S, X, f (S)

, and a classification problem

u (s, a)

, such that the approximated features

X^{'}

(simulated from

f (S)

) allow for a better classification (higher utility) than the original features X.

In such a situation, looking at

f (S)

will always be better than looking at either X or

X^{'}

. Thus, the paradox will only play a role in situations where it is not possible to base the decision on

f (S)

directly. For example,

f (S)

might still be too large, or X might have a more natural interpretation, making it easier to interpret for the decision taker. However, when it is better to base a decision on a proxy rather than directly on the observable of interest, this interpretation may be erroneous.

5. Information Decomposition and Le Cam Deficiency

Given two channels

κ_{1}, κ_{2}

, how can one decide whether or not

κ_{1} \leq κ_{2}

? The easiest way is to check whether the equation

κ_{1} = λ \cdot κ_{2}

has a solution

λ

that is a stochastic matrix. In the finite alphabet case, this amounts to checking the feasibility of a linear program, which is considered computationally easy. However, when the feasibility check returns a negative result, this approach does not give any more information (e.g., how far

κ_{1}

is away from being a garbling of

κ_{2}

). A function that quantifies how far

κ_{1}

is from being a garbling of

κ_{2}

is given by the (Le Cam) deficiency and its various generalizations [9]. Another such function is given by

U I

defined in [7] that accounts for the fact that the channels we consider are of the form

κ_{1} = (X_{1} \leftarrow S)

and

κ_{2} = (X_{2} \leftarrow S)

; that is, they are derived from conditional distributions of random variables. In contrast to the deficiencies,

U I

depends on the input distribution to these channels.

Let

P_{S X_{1} X_{2}}

be a joint distribution of S and the outputs

X_{1}

and

X_{2}

. Let

Δ_{P}

be the set of all joint distributions of the random variables

S, X_{1}, X_{2}

(with the same alphabets) that are compatible with the marginal distributions of

P_{S X_{1} X_{2}}

for the pairs

(S, X_{1})

and

(S, X_{2})

; i.e.,

Δ_{P} : = \{Q_{S X_{1} X_{2}} \in Δ : Q_{S X_{1}} = P_{S X_{1}}, Q_{S X_{2}} = P_{S X_{2}}\} .

In other words,

Δ_{P}

consists of all joint distributions that are compatible with

κ_{1}

and

κ_{2}

and that have the same distribution for S as

P_{S X_{1} X_{2}}

. Consider the function

\begin{matrix} U I (S; X_{1} \ X_{2}) : = min_{Q \in Δ_{P}} I_{Q} (S; X_{1} | X_{2}), \end{matrix}

where

I_{Q}

denotes the conditional mutual information evaluated with respect to the the joint distribution Q. This function has the following property:

U I (S; X_{1} \ X_{2}) = 0

if and only if

κ_{1} \leq κ_{2}

[7]. Computing

U I

is a convex optimization problem. However, the condition number can be very bad, which makes the problem difficult in practice.

U I

is interpreted in [7] as a measure of the unique information that

X_{1}

conveys about S (with respect to

X_{2}

). So, for instance, with this interpretation Example 2 can be summarized as follows: neither

X_{1}

nor

X_{2}

has unique information about

f (S)

. However, both variables have unique information about S, although

X_{1}

is conditionally independent of S given

f (S)

and thus—in contrast to

X_{2}

—contains no “additional” information about S. We now apply

U I

to a parameterized version of the And gate in Example 2.

Example 4.

Figure 1a shows a heat map of

U I

computed on the set of all distributions of the form

\begin{array}{c} f (s) & s & x_{1} & x_{2} & P_{f (S) S X_{1} X_{2}} \\ 0 & 0 & 0 & 0 & 1 / 8 + 2 b \\ 0 & 1 & 0 & 0 & 1 / 8 - 2 b \\ 0 & 0 & 0 & 1 & 1 / 8 + a \\ 0 & 1 & 0 & 1 & 1 / 8 - a \\ 0 & 0 & 1 & 0 & 1 / 8 + M M a / 2 + b \\ 0 & 1 & 1 & 0 & 1 / 8 - M M a / 2 - b \\ 1 & 2 & 1 & 1 & 1 / 4 \end{array}

where

- 1 / 8 \leq a \leq 1 / 8

and

- 1 / 16 \leq b \leq 1 / 16

. This is the set of distributions of

S, X_{1}, X_{2}

that satisfy the following constraints:

1.: $X_{1}, X_{2}$ are independent;
2.: f(S) = And(X₁, X₂), where f is as in Example 2; and
3.: $X_{1}$ is independent of S given $f (S)$ .

Along the secondary diagonal

b = a / 2

, the marginal distributions of the pairs

(S, X_{1})

and

(S, X_{2})

are identical. In such a situation, the channels

(X_{1} \leftarrow S)

and

(X_{2} \leftarrow S)

are Blackwell-equivalent, and so

U I

vanishes. Further away from the diagonal, the marginal distributions differ, and

U I

grows. The maximum value is achieved at the corners for

(a, b) = (- \frac{1}{8}, \frac{1}{16})

,

(\frac{1}{8}, - \frac{1}{16})

. At the upper left corner

(a, b) = \pm (- \frac{1}{8}, \frac{1}{16})

, we recover Example 2.

Example 5.

Figure 1b shows a heat map of

U I

computed on the set of all distributions of the form

\begin{array}{c} f (s) & s & x_{1} & x_{2} & P_{f (S) S X_{1} X_{2}} \\ 0 & 0 & 0 & 0 & a^{2} / (a + b) \\ 0 & 0 & 1 & 0 & a b / (a + b) \\ 0 & 1 & 0 & 1 & a b / (a + b) \\ 0 & 1 & 1 & 1 & b^{2} / (a + b) \\ 1 & 2 & 1 & 1 & 1 - a - b \end{array}

where

a, b \geq 0

and

a + b \leq 1

. This extends Example 3, which is recovered for

a = b = \frac{1}{3}

. This is the set of distributions of

S, X_{1}, X_{2}

that satisfy the following constraints:

1.: $X_{2}$ is a function of S, where the function is as in Example 3.
2.: $X_{1}$ is independent of S given $f (S)$ .
3.: The channels $X_{1} \leftarrow f (S)$ and $X_{2} \leftarrow f (S)$ are identical.

Acknowledgments

We thank the participants of the PID workshop at FIAS in Frankfurt in December 2016 for many stimulating discussions on this subject. Nils Bertschinger thanks h.c. Maucher for funding his position.

Author Contributions

The research was initiated by J.R. and carried out by all authors. Computer experiments to find and analyze the examples were done by P.K.B. D.W. simplified Example 1. J.J. and N.B. added interpretation. The manuscript was written by J.R., P.K.B., E.O. and D.W. Nobody played the synthesizer. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Blackwell, D. Equivalent Comparisons of Experiments. Ann. Math. Stat. 1953, 24, 265–272. [Google Scholar] [CrossRef]
Torgersen, E. Comparison of Statistical Experiments; Cambridge University Press: New York, NY, USA, 1991. [Google Scholar]
Le Cam, L. Comparison of Experiments—A Short Review. Stat. Probab. Game Theory 1996, 30, 127–138. [Google Scholar]
Bergmans, P. Random coding theorem for broadcast channels with degraded components. IEEE Trans. Inf. Theory 1973, 19, 197–207. [Google Scholar] [CrossRef]
Cohen, J.; Kemperman, J.; Zbăganu, G. Comparisons of Stochastic Matrices with Applications in Information Theory, Statistics, Economics, and Population Sciences; Birkhäuser: Boston, MA, USA, 1998. [Google Scholar]
Körner, J.; Marton, K. Comparison of two noisy channels. In Topics in Information Theory; Colloquia Mathematica Societatis János Bolyai: Keszthely, Hungary, 1975; Volume 16, pp. 411–423. [Google Scholar]
Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
Rauh, J.; Banerjee, P.K.; Olbrich, E.; Jost, J.; Bertschinger, N. On extractable shared information. arXiv, 2017; arXiv:1701.07805. [Google Scholar]
Raginsky, M. Shannon meets Blackwell and Le Cam: Channels, codes, and statistical experiments. In Proceedings of the 2011 IEEE International Symposium on Information Theory Proceedings, St. Petersburg, Russia, 31 July–5 August 2011; pp. 1220–1224. [Google Scholar]

Figure 1. Heatmaps for the function UI in (a) Example 4, and (b) Example 5.

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rauh, J.; Banerjee, P.K.; Olbrich, E.; Jost, J.; Bertschinger, N.; Wolpert, D. Coarse-Graining and the Blackwell Order. Entropy 2017, 19, 527. https://doi.org/10.3390/e19100527

AMA Style

Rauh J, Banerjee PK, Olbrich E, Jost J, Bertschinger N, Wolpert D. Coarse-Graining and the Blackwell Order. Entropy. 2017; 19(10):527. https://doi.org/10.3390/e19100527

Chicago/Turabian Style

Rauh, Johannes, Pradeep Kr. Banerjee, Eckehard Olbrich, Jürgen Jost, Nils Bertschinger, and David Wolpert. 2017. "Coarse-Graining and the Blackwell Order" Entropy 19, no. 10: 527. https://doi.org/10.3390/e19100527

APA Style

Rauh, J., Banerjee, P. K., Olbrich, E., Jost, J., Bertschinger, N., & Wolpert, D. (2017). Coarse-Graining and the Blackwell Order. Entropy, 19(10), 527. https://doi.org/10.3390/e19100527

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Coarse-Graining and the Blackwell Order

Abstract

1. Introduction

2. Pre-Garbling

3. Pre-Garbling by Coarse-Graining

4. Examples

5. Information Decomposition and Le Cam Deficiency

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI