1. Introduction
The information loss
associated with a measure-preserving function
between finite probability spaces is given by the Shannon entropy difference
where
is the Shannon entropy of
p (and similarly for
q). In [
1], Baez, Fritz, and Leinster proved that the information loss satisfies, and is uniquely characterized up to a non-negative multiplicative factor by, the following conditions:
- 0.
Positivity: for all . This says that the information loss associated with a deterministic process is always non-negative.
- 1.
Functoriality: for every composable pair of measure-preserving maps. This says that the information loss of two successive processes is the sum of the information losses associated with each process.
- 2.
Convex Linearity: for all . This says that the information loss associated with tossing a (possibly unfair) coin in deciding amongst two processes is the associated weighted sum of their information losses.
- 3.
Continuity: is a continuous function of f. This says that the information loss does not change much under small perturbations (i.e., is robust with respect to errors).
As measure-preserving functions may be viewed as deterministic stochastic maps, it is natural to ask whether there exist extensions of the Baez–Fritz–Leinster (BFL) characterization of information loss to maps that are inherently random (i.e., stochastic) in nature. In particular, what information-theoretic quantity captures such an information loss in this larger category?
This question is answered in the present work. Namely, we extend the BFL characterization theorem, which is valid on
deterministic maps, to the larger category of
stochastic maps. In doing so, we also find a characterization of the conditional entropy. Although the resulting extension is not functorial on the larger category of stochastic maps, we formalize a weakening of functoriality that restricts to functoriality on deterministic maps. This weaker notion of functoriality is definable in any Markov category [
2,
3], and it provides a key axiom in our characterization.
To explain how we arrive at our characterization, let us first recall the definition of stochastic maps between finite probability spaces, for which the measure-preserving functions are a special case. A stochastic map associates with every a probability distribution on Y such that , where is the distribution evaluated at . In terms of information flow, the space may be thought of as a probability distribution on the set of inputs for a communication channel described by the stochastic matrix , while is then thought of as the induced distribution on the set of outputs of the channel.
Extending the information loss functor by assigning
to any
stochastic map
would indeed result in an assignment that satisfies conditions 1–3 listed above. However, it would no longer be positive and the interpretation as an information loss would be gone. Furthermore, no additional information about the stochasticity of the map
f would be used in determining this assignment. In order to guarantee positivity, an additional term, depending on the stochasticity of
f, is needed. This term is provided by the
conditional entropy of
and is given by the the non-negative real number
where
is the Shannon entropy of the distribution
on
Y (in the case that
and
are probability spaces associated with the alphabets of random variables
and
, then
coincides with conditional entropy
[
4]). If
is in fact deterministic, i.e., if
is a point-mass distribution for all
, then
for all
. As such,
is a measure of the uncertainty (or randomness) of the outputs of
f averaged over the prior distribution
p on the set
X of its inputs. Indeed,
is maximized precisely when
is the uniform distribution on
Y for all
.
Therefore, given a stochastic map
, we call
the
conditional information loss of
(the same letter
K is used here because it agrees with the Shannon entropy difference when
f is deterministic). As
whenever
f is deterministic, the conditional information loss restricts to the category of measure preserving functions as the information loss functor of Baez, Fritz, and Leinster, while also satisfying conditions 0, 2, and 3 (i.e., positivity, convex linearity, and continuity) on the larger category of stochastic maps. However, conditional information loss is
not functorial in general, and while this may seem like a defect at first glance, we prove that there is no extension of the information loss functor that remains functorial on the larger category of stochastic maps if the positivity axiom is to be preserved, thus retaining an interpretation as information loss. In spite of this, conditional information loss does satisfy a weakened form of functoriality, which we briefly describe now.
A pair of composable stochastic maps is a.e. coalescable if and only if for every pair of elements and for which and , there exists a unique such that and . Intuitively, this says that the information about the intermediate step can be recovered given knowledge about the input and output. In particular, if f is deterministic, then the pair is a.e. colescable (for obvious reasons, since knowing x alone is enough to determine the intermediate value). However, there are other many situations where a pair could be a.e. coalescable and the maps need not be deterministic. With this definition in place (which we also generalize to the setting of arbitrary Markov categories), we replace functoriality with the following weaker condition.
- 1🟉.
Semi-functoriality: for every a.e. coalescable pair of stochastic maps. This says that the conditional information loss of two successive processes is the sum of the conditional information losses associated with each process provided that the information in the intermediate step can always be recovered.
Replacing functoriality with semi-functoriality is not enough to characterize the conditional information loss. However, it comes quite close, as only one more axiom is needed. Assuming positivity, semi-functoriality, convex linearity, and continuity, there are several equivalent axioms that may be stipulated to characterize the conditional information loss. To explain the first option, we introduce a convenient factorization of every stochastic map
. The
bloom-shriek factorization of
f is given by the decomposition
, where
is the
bloom of
f whose value at
is the probability measure on
given by sending
to
, where
is the Kronecker delta. In other words,
records each of the probability measures
on a copy of
Y indexed by
. A visualization of the bloom of
f is given in
Figure 1a. When one is given the additional data of probability measures
p and
q on
X and
Y, respectively, then
Figure 1b illustrates the bloom-shriek factorization of
f. From this point of view,
keeps track of the information encoded in
both p and
f, while the projection map
forgets, or loses,
some of this information.
With this in mind, our final axiom to characterize the conditional information loss is
- 4(a).
Reduction: , where is the bloom-shriek factorization of f. This says that the conditional information loss of f equals the information loss of the projection using the associated joint distribution on .
Note that this axiom describes how K is determined by its action on an associated class of deterministic morphisms. These slightly modified axioms, namely, semi-functoriality, convex linearity, continuity, and reduction, characterize the conditional information loss and therefore extend Baez, Fritz, and Leinster’s characterization of information loss. A much simpler axiom that may be invoked in place of the reduction axiom which also characterizes conditional information loss is the following.
- 4(b).
Blooming: , where is the unique map from a one point probability space to . This says that if a process begins with no prior information, then there is no information to be lost in the process.
The conditional entropy itself can be extracted from the conditional information loss by a process known as Bayesian inversion, which we now briefly recall. Given a stochastic map , there exists a stochastic map such that for all and (the stochastic map is the almost everywhere unique conditional probability so that Bayes’ rule holds). Such a map is called a Bayesian inverse of f. The Bayesian inverse can be visualized using the bloom-shriek factorization because it itself has a bloom-shriek factorization . This is obtained by finding the stochastic maps in the opposite direction of the arrows so that they reproduce the appropriate volumes of the water droplets.
Given this perspective on Bayesian inversion, we prove that the
conditional entropy of
equals the
conditional information loss of its Bayesian inverse
. Moreover, since the conditional information loss of
is just the information loss of
, this indicates how the conditional entropy and conditional information losses are the ordinary information losses associated with the two projections
and
in
Figure 1b. This duality also provides an interesting perspective on conditional entropy and its characterization. Indeed, using Bayesian inversion, we also characterize the
conditional entropy as the unique assignment
F sending measure-preserving stochastic maps between finite probability spaces to real numbers satisfying conditions 0,
, 2, and 3 above, but with a new axiom that reads as follows.
- 4(c).
Entropic Bayes’ Rule: for all . This is an information theoretic analogue of Bayes’ rule, which reads for all and , or in more traditional probabilistic notation
In other words, we obtain a
Bayesian characterization of the conditional entropy. This provides an entropic and information-theoretic description of Bayes’ rule from the Markov category perspective, in a way that we interpret as answering an open question of Fritz [
6].
2. Categories of Stochastic Maps
In the first few sections, we define all the concepts involved in proving that the conditional information loss satisfies the properties that we will later prove characterize it. This section introduces the domain category and its convex structure.
Definition 1. Let X and Y be finite sets. Astochastic map associates a probability measure to every . If is such that is a point-mass distribution for every , then f is said to bedeterministic.
Notation 1. Given a stochastic map (also written as ), the value will be denoted by . As there exists a canonical bijection between deterministic maps of the form and functions , deterministic maps from X to Y will be denoted by the functional notation .
Definition 2. A stochastic map of the form from a single element set to a finite set X is a single probability measure on X. Its unique value at x will be denoted by for all . The set will be referred to as thenullspaceof p.
Definition 3. LetFinStochbe the category of stochastic maps between finite sets. Given a finite set X, the identity map of X inFinStochcorresponds to the identity function . Second, given stochastic maps and , the composite is given by the Chapmann–Kolmogorov equation
Definition 4. Let X be a finite set. Thecopyof X is the diagonal embedding , and thediscardof X is the unique map from X to the terminal object • inFinStoch, which will be denoted by . If Y is another finite set, theswap mapis the map given by . Given morphisms and inFinStoch, theproductof f and g is the stochastic map given by
The product of stochastic maps endows
FinStoch with the structure of a monoidal category. Together with the copy, discard, and swap maps,
FinStoch is a Markov category [
2,
3].
Definition 5. Let (this stands for “finiteprobabilities andstochastic maps”) be the co-slice category , i.e., the category whose objects are pairs consisting of a finite set X equipped with a probability measure p, and a morphism from to is a stochastic map such that for all . The subcategory of deterministic maps in will then be denoted by (which stands for “finiteprobabilities anddeterministic maps”). A pair of morphisms in is said to be acomposable pairiff exists.
Note that the category
was called
in [
1].
Remark 1. Though it is often the case that we will denote a morphism in simply by f, such notation is potentially ambiguous, as the morphism is distinct from the morphsim whenever . As such, we will only employ the shorthand of denoting a morphism in by its underlying stochastic map whenever the source and target of the morphism are clear from the context.
Lemma 1. The object given by a single element set equipped with the unique probability measure is a zero object (i.e., terminal and initial) in .
Definition 6. Given an object in , theshriekandbloomof p are the unique maps to and from respectively, which will be denoted and (the former is deterministic, while the latter is stochastic). The underlying stochastic maps associated with and are and , respectively.
Example 1. Since is a zero object, given any two objects and , there exists at least one morphism , namely the composite .
Definition 7. Let be a morphism in . Thejoint distributionassociated with f is the probability measure given by .
It is possible to take convex combinations of both objects and morphisms in , and such assignments will play a role in our characterization of conditional entropy.
Definition 8. Let be a probability measure and let be a collection of objects in indexed by X. The p-weighted convex sum is defined to be the set equipped with the probability measure given byIn addition, if is a collection of morphisms in indexed by X, the p-weighted convex sum is given by 3. The Baez–Fritz–Leinster Characterization of Information Loss
In [
1], Baez, Fritz, and Leinster (BFL) characterized the Shannon entropy difference associated with measure-preserving functions between finite probability spaces as the only non-vanishing, continuous, convex linear functor from
to the non-negative reals (up to a multiplicative constant). It is then natural to ask whether there exist either extensions or analogues of their result by including non-deterministic morphisms from the larger category
. Before delving deeper into such inquiry, we first recall in detail the characterization theorem of BFL.
Definition 9. Let be the convex category consisting of a single object and whose set of morphisms is . The composition in is given by addition. Convex combinations of morphisms are given by ordinary convex combinations of numbers. The subcategory of non-negative reals will be denoted .
In the rest of the paper, we will not necessarily assume that assignments from one category to another are functors. Nevertheless, we do assume they form (class) functions (see ([
7], Section I.7) for more details). Furthermore, we assume that they respect or reflect source and targets in the following sense. If
and
are two categories, all functions
are either
covariant or
contravariant in the sense that for any morphism
in
,
is a morphism from
to
or from
to
, respectively. These are the only types of functions between categories we will consider in this work. As such, we therefore abuse terminology and use the term
functions for such assignments throughout. If
M is a
commutative monoid and
denotes its one object category, then every covariant function
is also contravariant and vice-versa.
We now define a notion of continuity for functions of the form .
Definition 10. A sequence of morphisms in convergesto a morphism if and only if the following two conditions hold.
- (a)
There exists an for which and for all .
- (b)
The following limits hold: and (note that these limits necessarily imply ).
A function iscontinuousif and only if whenever is a sequence in converging to f.
Remark 2. In the subcategory , since the topology of the collection of functions from a finite set X to another finite set Y is discrete, one can equivalently assume that a sequence as in Definition 10, but this time with all deterministic, converges to if and only if the following two conditions hold.
- (a)
There exists an for which for all .
- (b)
For , one has .
In this way, our definition of convergence agrees with the definition of convergence of BFL on the subcategory [1]. Definition 11. A function is said to beconvex linearif and only if for all objects in ,for all collections in . Definition 12. A function is said to befunctorialif and only if it is in fact a functor, i.e., if and only if for every composable pair in .
Definition 13. Let be a probability measure. TheShannon entropyof p is given by When considering any entropic quantity, we will always adhere to the convention that .
Definition 14. Given a map in , the Shannon entropy difference will be referred to as theinformation lossof f. Information loss defines a functor , henceforth referred to as theinformation loss functoron .
Theorem 1 (Baez–Fritz–Leinster [
1])
. Suppose is a function which satisfies the following conditions.- 1.
F is functorial.
- 2.
F is convex linear.
- 3.
F is continuous.
Then F is a non-negative multiple of information loss. Conversely, the information loss functor is non-negative and satisfies conditions 1–3.
In light of Theorem 1, it is natural to question whether or not there exists a functor that restricts to as the information loss functor. It turns out that no such non-vanishing functor exists, as we prove in the following proposition.
Proposition 1. If is a functor, then for all morphisms f in .
Proof. Let
be a morphism in
. Since
F is a functor,
Let
be
any morphism in
(which necessarily exists by Example 1, for instance). Then a similar calculation yields
Hence,
. □
4. Extending the Information Loss Functor
Proposition 1 shows it is not possible to extend the information loss functor to a functor on
. Nevertheless, in this section, we define a non-vanishing
function that restricts to the information loss
functor on
, which we refer to as
conditional information loss. While
K is not functorial, we show that it satisfies many important properties such as continuity, convex linearity, and invariance with respect to compositions with isomorphisms. Furthermore, in
Section 5 we show
K is functorial on a restricted class of composable pairs of morphisms (cf. Definition 18), which are definable in any Markov category. At the end of this section we characterize conditional information loss as the unique extension of the information loss functor satisfying the reduction axiom 4(a) as stated in the introduction. In
Section 8, we prove an intrinsic characterization theorem for
K without reference to the deterministic subcategory
inside
.
Appendix A provides an interpretation of the vanishing of conditional information loss in terms of correctable codes.
Definition 15. Theconditional information lossof a morphism in is the real number given bywhereis theconditional entropyof . Proposition 2. The function , uniquely determined on morphisms by sending to , satisfies the following conditions.
- (i)
.
- (ii)
K restricted to agrees with the information loss functor (cf. Definition 14).
- (iii)
K is convex linear.
- (iv)
K is continuous.
- (v)
Given , then , where is the projection and is the joint distribution (cf. Definition 7).
Lemma 2. Let be a morphism in . Then Proof of Proposition 2. - (i)
The non-negativity of K follows from Lemma 2 and the equality .
- (ii)
This follows from the fact that for all deterministic f.
- (iii)
Let
be a probability measure, and let
be a collection of morphisms in
indexed by
X. Then the
p-weighted convex sum
is a morphism in
of the form
, where
,
,
,
, and
. Then
which shows that
K is convex linear.
- (iv)
Let
be a sequence (indexed by
) of probability-preserving stochastic maps such that
and
for large enough
n, and where
and
. Then
where the last equality follows from the fact that the limit and sum (which is finite) can be interchanged and all expressions are continuous on
.
- (v)
This follows from
and the fact that
. □
Remark 3. Since conditional entropy vanishes for deterministic morphisms, conditional information loss restricts to as the information loss functor. It is important to note that if the term was not included in the expression for , then the inequality would fail in general. When f is deterministic, Baez, Fritz, and Leinster proved . However, when f is stochastic, the inequality does not hold in general. This has to do with the fact that stochastic maps may increase entropy, whereas deterministic maps always decrease it(while this claim holds in the classical setting as stated, it no longer holds for quantum systems [8]). As such, the term is needed to retain non-negativity as one attempts to extend BFL’s functor K on to a function on . Item (v) of Proposition 2 says that the conditional information loss of a map in is the information loss of the deterministic map in , so that conditional information loss of a morphism in may always be reduced to the information loss of a deterministic map in naturally associated with it having the same target. This motivates the following definition.
Definition 16. A function isreductiveif and only if for every morphism in (cf. Proposition 2 item (v) for notation).
Proposition 3 (Reductive characterization of conditional information loss). Let be a function satisfying the following conditions.
- (i)
F restricted to is functorial, convex linear, and continuous.
- (ii)
F is reductive.
Then F is a non-negative multiple of conditional information loss. Conversely, conditional information loss satisfies conditions (i) and (ii).
Proof. This follows immediately from Theorem 1 and item (v) of Proposition 2. □
In what follows, we will characterize conditional information loss without any explicit reference to the subcatgeory or the information loss functor of Baez, Fritz, and Leinster. To do this, we first need to develop some machinery.
5. Coalescable Morphisms and Semi-Functoriality
While conditional information loss is not functorial on
, we know it acts functorially on deterministic maps. As such, it is natural to ask for which pairs of composable stochastic maps does the conditional information loss act functorially. In this section, we answer this question, and then we use our result to define a property of functions
that is a weakening of functoriality, and which we refer to as
semi-functoriality. Our definitions are valid in any Markov category (cf.
Appendix B).
Definition 17. A deterministic map is said to be amediatorfor the composable pair in if and only ifIf in fact Equation (1) holds for all , then h is said to be astrong mediatorfor the composable pair inFinStoch.
Remark 4. Mediators do not exist for general composable pairs, as one can see by considering any composable pair such that (cf. Definitions 7 and 13).
Proposition 4. Let be a composable pair of morphisms in . Then the following statements are equivalent.
- (a)
For every and , there exists at most one such that .
- (b)
The pair admits a mediator .
- (c)
There exists a function such that
Proof. ((a)⇒(b)) For every for which such a y exists, set . If no such y exists or if , set to be anything. Then h is a mediator for .
((b)⇒(c)) Let
h be a mediator for
. Since (
2) holds automatically for
, suppose
, in which case (
2) is equivalent to
for all
. This follows from Equation (
1) and the fact that
h is a function.
((c)⇒(a)) Let and suppose . If h is the mediator, then . But since for all , there is only one non-vanishing term in this sum, and it is precisely . □
Theorem 2 (Functoriality of Conditional Entropy)
. Let be a composable pair of morphisms in . Thenholds if and only if there exists a mediator for . We first prove two lemmas.
Lemma 3. Let be a pair of composable morphisms. ThenIn particular, if and only if . Proof of Lemma 3. On components,
. Hence,
Note that this equality still holds if
or
as each step in this calculation accounted for such possibilities. □
Lemma 4. Let be a pair of composable morphisms in . ThenNote that the order of the sums matters in this expression and also note that it is always well-defined since implies . Proof of Lemma 4. For convenience, temporarily set
. Then
which proves the claim due to the definition of the composition of stochastic maps. □
Proof of Theorem 2. Temporarily set
. In addition, note that the set of all
and
can be given a more explicit description in terms of the joint distribution
associated with the composite
and prior
p, namely
. Then,
(⇒) Suppose
, which is equivalent to Equation (
3) by Lemma 3. Then since each term in the sum from Lemma 4 is non-negative,
Hence, fix such an
. The expression here vanishes if and only if
Hence, for every
and
, there exists a unique
such that
. But by (
5), this means that for every
, there exists a unique
such that
. This defines a function
which can be extended in an
s-a.e. unique manner to a function
We now show the function
h is in fact a mediator for the composable pair
. The equality clearly holds if
since both sides vanish. Hence, suppose that
. Given
, the left-hand-side of (2) is given by
by Equation (
6). Similarly, if
and
, then
for all
because otherwise
would be nonzero. If instead
, then
and
for all
by (
6). Therefore, (
2) holds.
(⇐) Conversely, suppose a mediator
h exists and let
be the stochastic map given on components by
Then
as desired. □
Corollary 1 (Functoriality of Conditional Information Loss). Let be a composable pair of morphisms in . Then if and only if there exists a mediator for the pair .
Proof. Since the Shannon entropy difference is always functorial, the conditional information loss is functorial on a pair of morphisms if and only if the conditional entropy is functorial on that pair. Theorem 2 then completes the proof. □
Example 2. In the notation of Theorem 2, suppose that f isa.e. deterministic, which means for all for some function f (abusive notation is used). In this case, the deviation from functoriality, (4), simplifies toTherefore, if f is p-a.e. deterministic, . In this case, the mediator is given by . Definition 18. A pair of composable morphisms in is calleda.e. coalescableif and only if admits a mediator . Similarly, a pair of composable morphisms inFinStochis calledcoalescableiff admits a strong mediator .
Remark 5. Example 2 showed that if is p-a.e. deterministic, then the pair is a.e. coalescable for any g. In particular, every pair of composable morphisms in is coalescable.
In light of Theorem 2 and Corollary 1, we make the following definition, which will serve as one of the axioms in our later characterizions of both conditional information loss and conditional entropy.
Definition 19. A function is said to besemi-functorialiff for every a.e. coalescable pair in .
Example 3. By Theorem 2 and Corollary 1, conditional information loss and conditional entropy are both semi-functorial.
Proposition 5. Suppose is semi-functorial. Then the restriction of F to is functorial. In particular, if F is, in addition, convex linear, continuous, and reductive, then F is a non-negative multiple of conditional information loss.
Proof. By Example 2, every pair of composable morphisms in is a.e. coalescable. Therefore, F is functorial on . The second claim then follows from Proposition 3. □
The following lemma will be used in later sections and serves to illustrate some examples of a.e. coalescable pairs.
Lemma 5. Let be a triple of composable morphisms with e deterministic and g invertible. Then each of the following pairs are a.e. coalescable:
- (i)
- (ii)
- (iii)
- (iv)
Proof. The proof that is coalescable was provided (in a stronger form) in Example 2. To see that is coalescable, note that since g is an isomorphism we have . Thus, is a mediator function for , thus is coalescable. The last two claims follow from the proofs of the first two claims. □
6. Bayesian Inversion
In this section, we recall the concepts of a.e. equivalence and Bayesian inversion phrased in a categorical manner [
2,
3,
9], as they will play a significant role moving forward.
Definition 20. Let and be two morphisms in with the same source and target. Then f and g are said toalmost everywhere equivalent(or p-a.e.equivalent) if and only if for every with . In such a case, the p-a.e. equivalence of f and g will be denoted .
Theorem 3 (Bayesian Inversion [
2,
9,
10])
. Let be a morphism in . Then there exists a morphism such that for all and . Furthermore, for any other morphism satisfying this condition, . Definition 21. The morphism appearing in Theorem 3 will be referred to as aBayesian inverseof . It follows that for all with .
Proposition 6. Bayesian inversion satisfies the following properties.
- (i)
Suppose and are p-a.e. equivalent, and let and be Bayesian inverses of f and g, respectively. Then .
- (ii)
Given two morphisms and in , then f is a Bayesian inverse of g if and only if g is a Bayesian inverse of f.
- (iii)
Let be a Bayesian inverse of , and let be the swap map (as in Definition 4). Then
- (iv)
Let be a composable pair of morphisms in , and suppose and are Bayesian inverses of f and g respectively. Then is a composable pair, and is a Bayesian inverse of .
Proof. These are immediate consequences of the categorical definition of a Bayesian inverse (see [
3,
10,
11] for proofs). □
Definition 22. A contravariant function is said to be aBayesian inversion functorif and only if acts as the identity on objects and is a Bayesian inverse of f for all morphisms f in .
This is mildly abusive terminology since functoriality only holds in the a.e. sense, as explained in the following remark.
Remark 6. A Bayesian inversion functor exists. Given any , set to be given by for all with and for all with . Note that this does not define a functor. Indeed, if is a probability space with for some , then is the uniform measure on X instead of the Dirac delta measure concentrated on . In other words, . Similar issues of measure zero occur, indicating that for a composable pair of morphisms . Nevertheless, Bayesian inversion is a.e. functorial in the sense that and .
Corollary 2. for any Bayesian inversion functor and every in .
Proposition 7. Let be a Bayesian inversion functor on (as in Definition 22). Then isa.e. convex linearin the sense thatwhere and the other notation is as in Definition 8. Proof. First note that it is immediate that
is convex linear on objects since Bayesian inversion acts as the identity on objects. Let
be a probability measure,
be a collection of morphisms in
indexed by
X, and suppose
is a Bayesian inversion functor. Then for
with
, we have
Thus, is a.e. convex linear. □
Proposition 8. Given in , and let and be Bayesian inverses of f and g respectively. Then is a.e. coalescable if and only if is a.e. coalescable.
Proof. Since Bayesian inversion is a dagger functor on a.e. equivalence classes ([
3], (Remark 13.10)), it suffices to prove one direction in this claim. Hence, suppose
is a.e. coalescable and let
h be a mediator function realizing this. Then
is a mediator for
because
A completely string-diagrammatic proof is provided in
Appendix B. □
The following proposition is a reformulation of the conditional entropy identity in terms of Bayesian inversion.
Proposition 9. Let be a morphism in , and suppose is a Bayesian inverse of f. Then Proof. This follows from the fact that both sides of (
7) are equal to
. □
Proposition 9 implies Bayesian inversion takes conditional entropy to conditional information loss and vice versa, which is formally stated as follows.
Corollary 3. Let and be given by conditional information loss and conditional entropy, respectively, and let be a Bayesian inversion functor. Then, and .
Remark 7. If is a deterministic morphism in , Baez, Fritz, and Leinster point out that the information loss of f is in fact the conditional entropy of x given y [1]. Here, we see this duality as a special case of Corollary 3 applied to deterministic morphisms. 7. Bloom-Shriek Factorization
We now introduce a simple, but surprisingly useful, factorization for every morphism in , and we use it to prove some essential lemmas for our characterization theorems for conditional information loss and conditional entropy, which appear in the following sections.
Definition 23. Given a stochastic map , thebloom of fis the stochastic map given by the composite , and theshriek of fis the deterministic map given by the projection .
Proposition 10. Let be a morphism in . Then the following statement hold.
- (i)
The composite is equal to the identity .
- (ii)
The morphism f equals the composite , where denotes any Bayesian inverse of f and is the swap map.
- (iii)
The pair is coalescable.
Definition 24. The decomposition in item (ii) Proposition 10 will be referred to as thebloom-shriek factorizationof f.
Proof of Propostion 10. Element-wise proofs are left as exercises.
Appendix B contains an abstract proof using string diagrams in Markov categories. □
The bloom of f can be expressed as a convex combination of simpler morphisms up to isomorphism. To describe this and its behavior under convex linear semi-functors, we introduce the notion of an invariant and examine some of its properties.
Definition 25. A function is said to be aninvariantif and only if for every triple of composable morphisms such that e and g are isomorphisms, then .
Lemma 6. If a function is semi-functorial, then F is an invariant.
Proof. Consider a composable triple
such that
e and
g are isomorphisms. Then
by Lemma 5. Secondly, since
g and
e are isomorphisms, and since the pairs
and
are coalescable,
. But since
(by semi-functoriality), this requires that
for an isomorphism
g since
and
. The same is true for
. Hence,
. □
Lemma 7. Let be a morphism in , and suppose is semi-functorial and convex linear. Then the following statements hold.
- (i)
- (ii)
- (iii)
Proof. For items (ii) and (iii), note that
and
can be expressed as composites of isomorphisms and certain convex combinations, namely
Hence,
Proposition 11. Suppose is semi-functorial and convex linear. If are two morphisms in such that , then .
Proof. Suppose
and
are such that
, and let
and
be Bayesian inverses for
f and
g. Then
as desired. □
8. An Intrinsic Characterization of Conditional Information Loss
Theorem 4. Suppose is a function satisfying the following conditions.
- 1.
F is semi-functorial.
- 2.
F is convex linear.
- 3.
F is continuous.
- 4.
for every probability distribution .
Then F is a non-negative multiple of conditional information loss. Conversely, conditional information loss satisfies conditions 1–4.
Proof. Suppose
F satisfies conditions 1–4, let
be an arbitrary morphism in
, and let
be the swap map, so that
. Then
Thus, F is reductive (see Definition 16) and Proposition 5 applies. □
Remark 8. Under the assumption that is semi-functorial and convex linear, one may show F satisfies condition 4 in Theorem 4 if and only if F is reductive (see Definition 16 and Proposition 5). While the reductive axiom specifies how the semi-functor acts on all morphisms in , condition 4 in Theorem 4 only specifies how it acts on morphisms from the initial object. This gives not just a simple mathematical criterion, but one with a simple intuitive interpretation as well. Namely, condition 4 says that if a process begins with no prior information, then there is no information to be lost in the process.
We now use Theorem 4 and Bayesian inversion to prove a statement dual to Theorem 4.
Theorem 5. Suppose is a function satisfying the following conditions.
- 1.
F is semi-functorial.
- 2.
F is convex linear.
- 3.
F is continuous.
- 4.
for every probability distribution .
Then F is a non-negative multiple of conditional entropy. Conversely, conditional entropy satisfies conditions 1–4.
Before giving a proof, we introduce some terminology and prove a few lemmas. We also would like to point out that condition 4 may be given an operational interpretation as follows: if a communication channel has a constant output, then it has no conditional entropy.
Definition 26. Let be a function and let be a Bayesian inversion functor. Then will be referred to as aBayesian reflectionof F.
Remark 9. By Proposition 11, if is a convex linear semi-functor, then a Bayesian reflection is independent of the choice of a Bayesian inversion functor, and as such, is necessarily unique.
Lemma 8. Let be a morphism in , suppose is a convex linear semi-functor, and let be a Bayesian inverse of f. Then .
Proof of Lemma 8. Let be a Bayesian inversion functor, so that . Then where the last equality follows from Proposition 11. □
Lemma 9. Let be a Bayesian inversion functor and let be a sequence of morphisms in converging to . Then .
Proof of Lemma 9. Set
. For all
with
, we have
Lemma 10. Suppose is a function satisfying conditions 1–4 of Theorem 5. Then the Bayesian reflection is a non-negative multiple of conditional information loss.
Proof of Lemma 10. We show satisfies conditions 1–4 of Theorem 4. Throughout the proof, let denote a Bayesian inversion functor, so that .
Semi-functoriality: Suppose
is an a.e. coalescable pair of composable morphisms in
. Then
Thus, is semi-functorial.
Convex Linearity: Given any probability space
and a family of morphisms
in
indexed by
X,
Thus, is convex linear.
Continuity: This follows from Lemma 9 and Proposition 11.
for every probability distribution : This follows from Lemma 8, since is the unique Bayesian inverse of . □
Proof of Theorem 5. Suppose is a function satisfying conditions 1–4 of Theorem 5, and let be a Bayesian inversion functor. Since F is semi-functorial and convex linear it follows from Proposition 11 that , and by Lemma 10 it follows that for some non-negative constant . We then have . Thus, F is a non-negative multiple of conditional entropy. □