In the previous section, we have described a framework for information inequalities for discrete random variables. We have also demonstrated the common proving technique. In this section, we will construct several different frameworks which are “equivalent” or “almost equivalent” to the earlier one. These equivalence relations among different frameworks will turn out to be very useful in deriving new information theoretic tools.
4.1. Differential Entropy
The previous framework for information inequalities assumes that all random variables are discrete. A very natural extension of the framework is thus to relax the restriction by allowing random variables to be continuous. To achieve this goal, we will first need an analogous definition of discrete entropy in the domain of continuous random variables.
Definition 1 (Differential entropies) Let be a set of continuous random variables such that are real numbers. For any , let be the density functions for . Then the differential entropy of is denoted by Remark: For notation simplicity, we abuse our notations by using to denote both discrete and differential entropies. However, its exact meaning should be clear from the context.
Discrete and differential entropies shared similar and dissimilar properties. The main difference is that differential entropy can be negative, unlike discrete entropy. However, mutual information and its conditional counterpart (by defined analogously as in (7)) remain nonnegative. In fact, as we shall see, the sets of information inequalities for discrete and continuous random variables are almost the same.
Definition 2 (Balanced inequalities) An information inequality (for either discrete or continuous random variables) is called balanced if for all , .
For any information inequality
or expression
, its
residual weight
is defined as
Clearly, an information inequality is balanced if and only if
for all
.
Example 1 The residual weights of the information inequality are both equal to one. Hence, the inequality is not balanced.
For any information inequality
, its balanced counterpart is the following inequality
which is balanced (as its name suggests).
Proposition 1 (Necessary and sufficiency of balanced inequalities [6]) For any valid information inequality , it is a valid discrete information inequality if and only if- 1.
its residual weights for all n, and
- 2.
its balanced counterpart is also valid.
Consequently, all valid discrete information inequalities are implied by the set of all valid balanced inequalities and the nonnegativity of (conditional) entropies.
It turns out that this set of balanced information inequalities also play the same significant role for inequalities involving continuous random variables.
Theorem 5 (Equivalence [6]) All information inequalities for continuous random variables are balanced. Furthermore, a balanced information inequalityis valid for continuous random variable if and only if it is also valid for discrete random variables. By Theorem 5, to characterise information inequalities, it is sufficient to consider only balanced information inequalities which are the same for either discrete or continuous random variables.
4.2. Inequalities for Kolmogorov Complexity
The second framework we will describe is quite different from the earlier information-theoretic frameworks. For information inequalities, the objects of interest are random variables. However, for the following Kolmogorov complexity framework, the objects of interest are deterministic strings instead.
To understand what Kolmogorov complexity is, let us consider the following example: Suppose that
and
are the following binary strings
Kolmogorov complexity of a string
x (denoted by
) is the minimal program length required to output that string [
18] In the above example, it is clear that the Kolmogorov complexity of
is much smaller than that of
(which is obtained by flipping a fair coin).
Although the objects of interest are different, [
3] proved a surprising result that inequalities for Kolmogorov complexities and for entropies are essentially the same.
Theorem 6 (Equivalence [3]) An information inequality (for discrete random variable) is valid if and only if the corresponding Kolmogorov complexity inequality defined belowis also valid. 4.3. A Group-Theoretic Framework
Besides Kolmogorov complexities, information inequalities are also closely related to group-theoretic inequalities [
4]. To understand their relation, we first illustrate how to construct a random variable from a subgroup.
Definition 3 (Group-theoretic construction of random variables) Let G be a finite group and U be a random variable that takes value in G uniformly. In other words,for all . For any subgroup K of G, it partitions G into ’s left (or right) coset of K in G such that each coset has exactly ’s elements. Note that, each coset can be written as the following subset for some elements where ∘
is the binary group operator. Let be the collection of all left cosets of K in G. The subgroup K induces a random variable , which is defined as the random left coset of K in G that contains U. In fact, is equal to the following coset Since
U is uniformly distributed over
G, we can easily prove that
is uniformly distributed over
and that
The above construction of a random variable from a subgroup can be extended naturally to multiple subgroups.
Theorem 7 (Group characterisable random variables [4]) Let G be a finite group and be a set of subgroups of G. For each , let be the random variable induced by the subgroup as defined above. Then for any ,- 1.
,
- 2.
,
- 3.
is uniformly distributed over its support. In other word, the value of the probability distribution function of is either zero or is a constant.
Definition 4 A function is called group characterisable if it is the entropy function of a set of random variables induced by a finite group G and its subgroups . Furthermore, h is- 1.
representable if are all vector space, and
- 2.
abelian if G is abelian.
Clearly, random variables induced by a set of subgroups must satisfy all valid information inequalities Therefore, we have the following theorem.
Theorem 8 (Group-theoretic inequalities [4]) Letbe a valid information inequality. Then for any finite group G and its subgroups , we haveor equivalently, Theorem 8 proved that we can directly “translate” any information inequality into a group-theoretic inequality. A very surprising result proved in [
4] was that the the converse also holds.
Theorem 9 (Converse [4]) The information inequality (30)
is valid if it is satisfied by all random variables induced by groups, or equivalently, the group-theoretic inequality (32)
is valid. Theorems 8 and 9 suggested that to prove an information inequality, it is necessary and sufficient to verify if the inequality is satisfied by all random variables induced by groups. Later, we will further illustrate how to use the two theorems to derive a group-theoretic proof for information inequalities.
In the following, we will further prove that many statistical properties of random variables induced by groups will have analogous algebraic interpretations.
Lemma 1 (Properties of group induced random variables) Suppose that is a set of random variables induced by a finite group G and its subgroups . Then- 1.
(Functional dependency) (i.e., is a function of ) if and only if . Hence, functional dependency is equivalent to subset relation;
- 2.
(Independency) if and only if - 3.
(Conditioning preserves group characterisation) for any fixed any , the group and its subgroups for induce a set of random variables such thatfor all . In other words, for any group characterisable , let such thatfor all . Then g is also group characterisable.
Proposition 2 (Duality [19]) Let be a set of vector subspaces of over the finite field . Define the following subspace for :Then, for any ,Hence, if such that for all , then h is weakly representable. Remark: While and W are both subspaces of V and , in general. If , then (defined as in (34)) is the orthogonal complement of .
Theorems 8 and 9 suggested that proving an information inequality (30) is equivalent to proving a group-theoretic inequality (32). In the following, we will illustrate the idea by providing a group-theoretic proof for nonnegativity of mutual information
Example 2 (Group-theoretic Proof) Let G be a finite group and and be its subgroups. Letwhere ∘
is the binary group operator. As S is a subset of , . With a simple counting argument (by removing duplications), it can be proved easily thatTherefore,Finally, according to Theorems 8 and 9, the inequality (35) follows. It is worth mentioning that Theorems 8 and 9 also suggested an information-theoretic proof for group-theoretic inequalities. For example, the following information inequality
implies the following group-theoretic inequality
The meaning of this inequality and its implications in group theory are yet to be understood.
4.4. Combinatorial Perspective
Random variables that are induced by groups have many interesting properties. One interesting property is that they are quasi-uniform in nature.
Definition 5 (Quasi-uniform random variables) A set of random variables is called quasi-uniform
if for all , is uniformly distributed over its support
. In other words,
Since
is uniformly distributed for all
, the entropy
is thus equal to
.
According to the Asymptotic Equipartition Property (AEP) [
12], for a sufficiently long sequence of independent and identically distributed random variables, the set of typical sequences has a total probability close to one and the probability of each typical sequence is approximately the same. In certain sense, quasi-uniform random variables possess the
non-aymptotic equipartition property that the probabilities are completely concentrated and uniformly distributed over their supports. As a result, quasi-uniform random variables can be fully characterised by their supports (because the probability distributions are uniform over the supports). This offers a combinatorial interpretation for quasi-uniform random variables. And it turns out that this interpretation offers a combinatorial approach to proving information inequalities.
Definition 6 (Box assignment) Let be nonempty finite sets and be their Cartesian product . A box assignment in is a nonempty subset of .
For any
and
, we define
Roughly speaking,
is the set of elements in
such that its “
-coordinate” is
for
. The set
will be called the
-layer of
. And hence,
contains all
such that the
-layer of
is nonempty. And we will call
the
α-projection of
.
Definition 7 (Quasi-uniform box assignment) A box assignment is called quasi-uniform if for any , the cardinality of is constant for all . And we will denote the constant by for simplicity.
The following proposition proves that quasi-uniform box assignment and quasi-uniform random variables are in fact equivalent.
Proposition 3 (Equivalence [7]) Let be a set of quasi-uniform random variables and be its probability distribution’s support. Then is a quasi-uniform box assignment in . Furthermore, for all , Conversely, for any quasi-uniform box assignment , there exists a set of quasi-uniform random variables whose probability distribution’s support is indeed .
As random variables induced by groups are quasi-uniform, by Theorems 8 and 9, we have the following combinatorial interpretation for information inequalities.
Theorem 10 (Combinatorial interpretation [7]) An information inequalityis valid if and only if the following box assignment inequality is validor equivalently,for all quasi-uniform box assignments . Again, in the following example, we will illustrate how to use the combinatorial interpretation to derive a “combinatorial proof” for information inequality.
Example 3 (Combinatorial proof) Let be a quasi-uniform box assignment in . Suppose . Then it is obvious that and . In other words, and consequently,By Theorem 10, we prove that . 4.5. Coding Perspective
We can also view a box assignment as an error correcting code such that is the set of all codewords. For each codeword , is the symbol to be transmitted across a channel. Taking this coding perspective, in the following, a box assignment will simply be called a code. Also, a code is called a quasi-uniform code if is a quasi-uniform box assignment. Again, each quasi-uniform code will induce a set of quasi-uniform random variables .
For any code
(which is just a box assignment) and two codewords
, the
Hamming distance between codewords
and
is defined as
In addition, the minimum Hamming distance of the code
is defined as
The minimum Hamming distance of a code characterises how strong the error correcting capability of the code is. Specifically, a code with a minimum Hamming distance d can correct up to ’s symbol errors.
Example 4 Let be a length-3 code containing only two codewords and . The minimum Hamming distance of this code is 3 and hence can correct any single symbol error. For instance, suppose the codeword is transmitted. If a symbol error occurs, the receiver will receive either , or . In any case, the receiver can always determine which symbol is erroneous (by using a bounded-distance decoder) and hence can correct it.
In addition to the minimum Hamming distance, in many cases, a code’s
distance profile is also of great importance: Let
be a code and
c be a codeword in
. The distance profile of
centered at
c is a set of integers
where
In other words,
is the number of codewords in
such that their Hamming distances to the centering codeword
c is
r.
The profile contains information about how likely a decoding error (i.e., the receiver decodes a wrong codeword) occurs if the transmitted codeword is c. In general, the distance profile depends on the choice of c. A code is called distance-invariant if its distance profile is independent of c. Roughly speaking, a distance-invariant code is one where the probability of decoding error is the same for all transmitted codewords .
Theorem 11 (Distance invariance [20]) Quasi-uniform codes are distance-invariant. Example 5 (Linear codes) Let P be a parity check matrix (over a finite field ) and the code is defined byThen is called a linear code. Note that, for a linear code, if , then is also contained in . Linear codes are quasi-uniform codes and hence are also distance invariant. In the following, we will consider only quasi-uniform codes. For simplicity, we will assume without loss of generality that there is a zero-codeword (by renaming). Also, for any , we define the Hamming weight of the codeword c (denoted by ) as .
Definition 8 (Weight enumerator) The weight enumerator of a quasi-uniform code C with length n iswhere x and y are indeterminates, and . Using simple counting, it is easy to prove that In many cases, it is more convenient to work with weight enumerator than distance profile. However, conceptually, they are equivalent (i.e., they can be uniquely obtained from each other). Clearly, the weight enumerator is uniquely determined from the code . However, what “structural property” of the code determines the weight enumerator? For example, suppose that we construct a new code from by exchanging the first and the second codeword symbols. It is obvious that this modification will not affect the weight enumerator. In other words, ordering of the codeword symbols has no effects on the weight enumerator. The question therefore is: What property of a code has direct effects on the weight enumerator?
To answer the question, let us use the old perspective that a quasi-uniform code is merely a quasi-uniform box assignment (and also its associated set of quasi-uniform random variables). These random variables have a simple interpretation here: Suppose a codeword is randomly and uniformly selected from . Then is the symbol in the random codeword C, i.e., . Our answer to the above question is given in the following theorem.
Theorem 12 (Generalised Greene’s Theorem [20]) Let C be a quasi-uniform code and be its induced quasi-uniform random variables. Suppose that ρ is the entropy function of . In other words, . Then Remark: The Greene’s Theorem is a special case of Theorem 12 when the code is a linear code.
By Theorem 12, the weight enumerator (and also the error-correcting capability) of a quasi-uniform code depends only on the entropy function induced by the codeword symbol random variables. By exploiting the relation between the entropy function of a set of quasi-uniform random variables and the weight enumerator of the induced code, we open a new door on how to harness coding theory results to derive new information theory results.
Example 6 (Code-theoretic proof) Consider a set of quasi-uniform random variables which induces a length-2 quasi-uniform code C. The length of the code is 2. By the Generalised Greene’s Theorem, the number of codewords which have Hamming weights 1 is given byAs is nonnegative, (50)
implies thatFinally, by Theorem 10 (a variation of which to be precise), an information inequality holds if and only if it also holds for all quasi-uniform random variables. Consequently, we prove that (51)
holds for all random variables.