Next Article in Journal
Evaluation of Sustainability Information Disclosure Based on Entropy
Next Article in Special Issue
A Review of Physical Layer Security Techniques for Internet of Things: Challenges and Solutions
Previous Article in Journal
Classical and Quantum Causal Interventions
Previous Article in Special Issue
Fuzzy and Sample Entropies as Predictors of Patient Survival Using Short Ventricular Fibrillation Recordings during out of Hospital Cardiac Arrest
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Information Geometric Approach on Most Informative Boolean Function Conjecture

Department of Electronical and Electrical Engineering, Hongik University, Seoul 04066, Korea
Entropy 2018, 20(9), 688; https://doi.org/10.3390/e20090688
Submission received: 26 July 2018 / Revised: 6 September 2018 / Accepted: 8 September 2018 / Published: 10 September 2018

Abstract

:
Let X n be a memoryless uniform Bernoulli source and Y n be the output of it through a binary symmetric channel. Courtade and Kumar conjectured that the Boolean function f : { 0 , 1 } n { 0 , 1 } that maximizes the mutual information I ( f ( X n ) ; Y n ) is a dictator function, i.e., f ( x n ) = x i for some i. We propose a clustering problem, which is equivalent to the above problem where we emphasize an information geometry aspect of the equivalent problem. Moreover, we define a normalized geometric mean of measures and interesting properties of it. We also show that the conjecture is true when the arithmetic and geometric mean coincide in a specific set of measures.

1. Introduction

Let X n be an independent and identically distributed (i.i.d.) uniform Bernoulli source and Y n be an output of it through a memoryless binary symmetric channel with crossover probability p < 1 / 2 . Recently, Courtade and Kumar conjectured that the most informative Boolean function is a dictator function.
Conjecture 1
([1]). For any Boolean function f : { 0 , 1 } n { 0 , 1 } , we have:
I ( f ( X n ) ; Y n ) 1 - h 2 ( p )
where the maximum is achieved by a dictator function, i.e., f ( x n ) = x i for some 1 i n . Note that h 2 ( p ) = - p log p - ( 1 - p ) log ( 1 - p ) is the binary entropy function.
Although there has been some progress in this line of work [2,3], this simple conjecture still remains open. There are also a number of variations of this conjecture. Weinberger and Shayevitz [4] considered the optimal Boolean function under quadratic loss. Huleihel and Ordentlich [5] considered the complementary case and showed that I ( f ( X n ) ; Y n ) ( n - 1 ) ( 1 - h 2 ( p ) ) for all f : { 0 , 1 } n { 0 , 1 } n - 1 . Nazer et al. focused on information distilling quantizers [6], which can be seen as a generalized version of the above problem.
Many of them are based on the Fourier analysis technique including the original paper [1]. In this paper, we suggest an alternative approach, namely the information geometric approach. The mutual information can naturally be expressed with Kullback–Leibler (KL) divergences. Thus, it can be shown that the maximizing mutual information is equivalent to clustering probability measures under KL divergence.
In the equivalent clustering problem, the center of the cluster is an arithmetic mean of measures. We also provide the role of the geometric mean of measures (with appropriate normalization) in this setting. To the best of our knowledge, the geometric mean of measures has received less attention in the literature. We propose an equivalent formulation of the conjecture using the geometric mean of measures. Note that the geometric mean also allows us to connect Conjecture 1 to the other well-known clustering problem.
The rest of the paper is organized as follows. In Section 2, we briefly review the Jensen–Shannon divergence and I -compressedness. In Section 3, we provide an equivalent clustering problem of probability measures. We introduce the geometric mean of measures in Section 4. We conclude this paper in Section 5.

Notations

X denotes the alphabet set of random variable X, and M ( X ) denotes the set of measures on X . X n denotes a random vector ( X 1 , X 2 , , X n ) , while x n denotes a specific realization of it. If it is clear from the context, P Y | x denotes a conditional distribution of Y given X = x , i.e., P Y | x ( y ) = P Y | X ( y | x ) . Similarly, P Y n | x n denotes a conditional distribution of Y n given X n = x n , i.e., P Y n | x n ( y n ) = P Y n | X n ( y n | x n ) . Let Ω = { 0 , 1 } n be the set of all binary sequences of length n. For A Ω , the shifted version of A is denoted by A x n = { x ~ n x n : x ~ n A } where ⊕ is an element-wise XOR operator. The arithmetic mean of measures in the set { P Y n | x n : x n A } is denoted by μ A . For 1 i n , let A i 0 be the set of elements in A that satisfy x i = 0 , i.e., A i 0 = { x n A : x i = 0 } , and Ω i 0 = { x n Ω : x i = 0 } . A i 1 is defined in a similar manner. A length n binary vector x n - 1 0 denotes a vector x n with x n = 0 .

2. Preliminaries

2.1. Jensen–Shannon Divergence

For α 1 , α 2 0 such that α 1 + α 2 = 1 , the Jensen–Shannon (JS) divergence of two measures P 1 and P 2 is defined as:
JSD α ( P 1 , P 2 ) = H ( α 1 P 1 + α 2 P 2 ) - α 1 H ( P 1 ) - α 2 H ( P 2 ) .
It is not hard to show that the following definition is equivalent.
JSD α ( P 1 , P 2 ) = α 1 D ( P 1 α 1 P 1 + α 2 P 2 ) + α 2 D ( P 2 α 1 P 1 + α 2 P 2 ) .
Lin proposed a generalized JS divergence [7]:
JSD α ( P 1 , P 2 , , P n ) = H i = 1 n α i P i - i = 1 n α i H ( P i )
where α = ( α 1 , , α n ) is a weight vector such that i = 1 n α i = 1 . Similar to Equation (3), it has an equivalent definition:
JSD α ( P 1 , P 2 , , P n ) = i = 1 n α i D ( P i P ¯ )
where P ¯ = i = 1 n α i P i . Topsøe [8] pointed out an interesting property, the so-called compensation identity. It states that for any distribution Q,
i = 1 n α i D ( P i Q ) = i = 1 n α i D ( P i P ¯ ) + D ( P ¯ Q )
= JSD α ( P 1 , P 2 , , P n ) + D ( P ¯ Q ) .
Throughout the paper, we often use Equation (6) directly without the notion of JSD
Remark 1.
The generalized JS divergence is the mutual information between X and the mixture distribution. Let Z be a random variable that takes the value from { 1 , 2 , , n } where P Z ( i ) = α i and P X | Z ( x | i ) = P i ( x ) . Then, it is not hard to show that:
J S D α ( P 1 , P 2 , , P n ) = I ( X ; Z )
However, we introduced generalized JS divergence to emphasize the information geometric perspective of our problem.

2.2. I -Compressed

Let A be the subset of Ω and I = { i 1 , i 2 , , i k } { 1 , 2 , , n } be the set of indexes. For x n Ω , the I -section of A is defined as:
A I ( x n ) = z k : y n A , y i = z j if   i = i j I x i otherwise .
The set A is called I -compressed if A I ( x n ) is an initial segment of lexicographical ordering for all x n . For example, if A is I -compressed for some | I | = 2 , then A I ( x n ) should be one of:
{ 00 } , { 00 , 01 } , { 00 , 01 , 10 } , { 00 , 01 , 10 , 11 } .
It simply says that if x n - 2 10 A , then x n - 2 00 , x n - 2 01 A .
Courtade and Kumar showed that it is enough to consider the I -compressed sets.
Theorem 1
([1]). Let S n be the set of functions f : Ω { 0 , 1 } for which f - 1 ( 0 ) is I -compressed for all I with | I | 2 . In maximizing I ( f ( X n ) ; Y n ) , it is sufficient to consider functions f S n .
In this paper, we often restrict our attention to functions in the set S n .

3. Approach via Clustering

In this section, we provide an interesting approach toward Conjecture 1 via clustering. More precisely, we formulate an equivalent clustering problem.

3.1. Equivalence to Clustering

The following theorem implies the relation between the original conjecture and the clustering problem.
Theorem 2.
Let f : X U and U = f ( X ) be an induced random variable. Then,
I ( f ( X ) ; Y ) = I ( X ; Y ) - x P X ( x ) D ( P Y | x P Y | U ( · | f ( x ) ) ) .
The proof of the theorem is provided in Appendix A. Note that:
P Y | U ( y | u ) = P U | Y ( u | y ) P Y ( y ) P U ( u )
= P Y ( y ) P U ( u ) x f - 1 ( u ) P X | Y ( x | y )
= x f - 1 ( u ) P X ( x ) P U ( u ) · P Y | X ( y | x ) .
which is a weighted mean of P Y | X ( y | x ) for x f - 1 ( u ) . The D ( P Y | x P Y | U ( · | f ( x ) ) ) is a distance from each element to the cluster center. This implies that maximizing I ( f ( X ) ; Y ) is equivalent to clustering { P Y n | x n } under KL divergences. Since KL divergence is a Bregman divergence, all clusters are separated by a hyperplane [9].
In this paper, we focus on U = { 0 , 1 } where X n is i.i.d. Bern(1/2).
Corollary 1.
Let f : Ω { 0 , 1 } and U = f ( X n ) be a binary random variable.
I ( f ( X n ) ; Y n ) = n - n h 2 ( p ) - 1 2 n x n D ( P Y n | x n P Y n | U ( · | f ( x n ) ) ) .
The equivalent clustering problem is minimizing:
x n D ( P Y n | x n P Y n | U ( · | f ( x n ) ) ) .
Let A = { x n Ω : f ( x n ) = 0 } , then we can simplify P Y n | U further.
P Y n | U ( y n | 0 ) = 1 | A | x n A P Y n | X n ( y n | x n )
= Δ μ A ( y n ) .
The cluster center μ A is an arithmetic mean of measures in the set { P Y n | x n : x n A } . Then, we have:
x n D ( P Y n | x n P Y n | U ( · | f ( x n ) ) ) = x n A D ( P Y n | x n μ A ) + x n A c D ( P Y n | x n μ A c ) .
For simplicity, let:
D ( A ) = Δ x n A D ( P Y n | x n | | μ A )
which is the sum of distances from each element in A to the cluster center. In short, finding the most informative Boolean function f is equivalent to finding the set A Ω that minimizes D ( A ) + D ( A c ) .
Remark 2.
Conjecture 1 implies that A = Ω i 0 = { x n : x i = 0 } minimizes (19). Furthermore, Theorem 1 implies that it is enough to consider A such that A i 1 A i 0 for all i.
For any Q Y n M ( Ω ) , Equation (6) implies that:
x A D ( P Y n | x n Q Y n ) = D ( A ) + | A | D ( μ A Q Y n )
x A c D ( P Y n | x n Q Y n ) = D ( A c ) + | A c | D ( μ A c Q Y n ) .
Thus, we have:
x Ω D ( P Y n | x n Q Y n ) = D ( A ) + D ( A c ) + | A | D ( μ A Q Y n ) + | A c | D ( μ A c Q Y n ) .
Note that x Ω D ( P Y n | x n Q Y n ) does not depend on A, and therefore, we have the following theorem.
Theorem 3.
For any Q Y n M ( Ω ) , minimizing D ( A ) + D ( A c ) is equivalent to maximizing:
| A | D ( μ A Q Y n ) + | A c | D ( μ A c Q Y n ) .
The above theorem provides an alternative problem formulation of the original conjecture.

3.2. Connection to Clustering under Hamming Distance

In this section, we consider the duality between the above clustering problem under the KL divergence and the clustering on Ω under the Hamming distance. The following theorem shows that the KL divergence on { P Y n | x n : x n Ω } corresponds to the Hamming distance on Ω .
Theorem 4.
For all x n , x ~ n Ω , we have:
D ( P Y n | x n P Y n | x ~ n ) = d H ( x n , x ~ n ) · ( 1 - 2 p ) log 1 - p p
where d H ( x n , x ~ n ) denotes the Hamming distance between x n and x ~ n .
This theorem implies that the distance between two measures P Y n | x n and P Y n | x ~ n is proportional to the Hamming distance between two binary vectors x n and x ~ n . The proof of the theorem is provided in Appendix B. Note that the KL divergence D ( · · ) is symmetric on { P Y n | x n : x n Ω } .
In the above duality, we have a mapping between { P Y n | x n : x n Ω } and { 0 , 1 } n ; more precisely, P Y n | x n x n . This mapping naturally suggests an equivalent clustering problem of n-dimensional binary vectors. However, the cluster center μ A is not an element of { P Y n | x n : x n Ω } in general. In order to formulate an equivalent clustering problem, we need to answer the question “Which n dimensional vector corresponds to μ A ?”. A naive approach is to extend the set of binary vectors to [ 0 , 1 ] n under 2 distance instead of the Hamming distance. In such a case, the goal is to map μ A to the arithmetic mean of binary vectors in the set A. If this is true, we can further simplify the problem into the problem of clustering a hypercube in R n . However, the following example shows that this naive extension is not valid.
Example 1.
Let n = 2 , A = { 00 , 11 } and B = { 01 , 10 } , then the arithmetic mean of binary vectors of A and that of B are the same. However, μ A is not equal to μ B .
Furthermore, the set Ω i 0 is not the optimum choice when clustering the hypercube under 2 . Instead, we need to consider the set of measures directly. The following theorem provides a bit of geometric structure among measures.
Theorem 5.
For all x n , x ~ n Ω and Q Y n ( { P Y n | x n | x n Ω } ) ,
D ( P Y n | X n = x n Q Y n ) - D ( P Y n | X n = x ~ n Q Y n ) k · ( ( 1 - p ) k - p k ) log 1 - p p
where k = d H ( x n , x ~ n ) .
The proof of the theorem is provided in Appendix C. Since ( 1 - p ) k - p k 1 - 2 p for all k 1 , Theorem 5 immediately implies the following corollary.
Corollary 2.
For all x n , x ~ n Ω and Q Y n conv ( { P Y n | x n | x n Ω } ) ,
D ( P Y n | X n = x n Q Y n ) - D ( P Y n | X n = x ~ n Q Y n ) D ( P Y n | X n = x n P Y n | X n = x ~ n )
where conv ( A ) is a convex hull of measures in the set A.
This is a triangle inequality that can be useful when we consider the clustering problem of measures.

4. Geometric Mean of Measures

In the previous section, we formulate the clustering problem that is equivalent to the original maximizing mutual information problem. In this section, we provide another approach using a geometric mean of measures. We define the geometric mean of measures formally and derive a nontrivial conjecture, which is equivalent to Conjecture 1.

4.1. Definition of the Geometric Mean of Measures

For measures P 1 , P 2 , , P n M ( X ) and weights α i 0 such that i = 1 n α i = 1 , we considered the sum of KL divergences in (6):
i = 1 n α i D ( P i Q ) .
We also observed that (28) is minimized when Q is an arithmetic mean of measures.
Since the KL divergence is asymmetric, it is natural to consider the sum of another direction of KL divergences.
i = 1 n α i D ( Q P i ) = i = 1 n α i x X Q ( x ) log Q ( x ) P i ( x )
= x X i = 1 n α i Q ( x ) log Q ( x ) P i ( x )
= x X Q ( x ) log Q ( x ) i = 1 n ( P i ( x ) ) α i .
Compared to the arithmetic mean that minimizes (28), i = 1 n ( P i ( x ) ) α i can be considered as a geometric mean of measures. However, i = 1 n ( P i ( x ) ) α i is not a measure in general, and normalization is required. With a normalizing constant s, we can define the geometric mean of measures by:
P ¯ G ( x ) = 1 s i = 1 n ( P i ( x ) ) α i
where s is a constant, so that x X P ¯ G ( x ) = 1 , i.e.,
s = x i = 1 n ( P i ( x ) ) α i .
Then, we have:
i = 1 n α i D ( Q P i ) = D ( Q P ¯ G ) + log 1 s
which can be minimized when Q = P ¯ G . Thus, for all Q,
i = 1 n α i D ( Q P i ) i = 1 n α i D ( P ¯ G P i )
= log 1 s .
The above result provides a geometric compensation identity.
i = 1 n α i D ( Q P i ) = D ( Q P ¯ G ) + i = 1 n α i D ( P ¯ G P i ) .
This also implies that log 1 s 0 .
Remark 3.
If n = 2 , s is called the α-Chernoff coefficient, and it is called the Bhattacharyya coefficient when α = 1 / 2 . The summation log 1 s = i = 1 2 α i D ( P ¯ G P i ) is known as α-Chernoff divergence. For more details, please see [10,11] and the references therein.
Under this definition, we can find the geometric mean of measures in the set { P Y n | x ~ n : x ~ n B } with uniform weights 1 | B | by:
γ B ( y n ) = 1 s B x ~ n B P Y n | X n ( y n | x ~ n ) 1 / | B |
where:
s B = y n x ~ n B P Y n | X n ( y n | x ~ n ) 1 / | B | .
Remark 4.
The original conjecture is that the Boolean function f such that f - 1 ( 0 ) = Ω i 0 = { x n : x i = 0 } maximizes the mutual information I ( f ( X n ) ; Y n ) . The geometric mean of measures in the set { P Y n | x n : x n Ω i 0 } satisfies the following property.
γ Ω i 0 = μ Ω i 0
s Ω i 0 = 2 n - 1 p ( 1 - p ) ( n - 1 ) / 2 .
Note that the geometric mean of measures in the set { P Y n | x n : x n Ω } satisfies:
γ Ω = μ Ω
s Ω = 2 n p ( 1 - p ) n / 2 .

4.2. Main Results

So far, we have seen two means of measures μ A and γ A . It is natural to ask if they are equal. Our main theorem provides a connection to Conjecture 1.
Theorem 6.
Suppose A is a nontrivial subset of Ω = { 0 , 1 } n for n > 0 (i.e., A , Ω ), and A is I -compressed for all | I | = 1 . Then, A = Ω i 0 for some i if and only if μ A = γ A and μ A c = γ A c .
The proof of the theorem is provided in Appendix D. Theorem 6 implies that the following conjecture is the equivalent to Conjecture 1.
Conjecture 2.
Let f : Ω { 0 , 1 } and A = f - 1 ( 0 ) be I -compressed for all | I | = 1 . Then, I ( f ( X ) ; Y ) is maximized if and only if μ A = γ A and μ A c = γ A c .
Remark 5.
One of the main challenges of this problem is that the conjectured optimal sets are extremes, i.e., A = Ω i 0 for some i. Our main theorem provides an alternative conjecture that seems more natural in the context of optimization.
Remark 6.
It is clear that μ A = γ A holds if | A | = 1 . Thus, both conditions μ A = γ A and μ A c = γ A c are needed to guarantee A = Ω i 0 for some i.

4.3. Property of the Geometric Mean

We can derive a new identity by combining the original and geometric compensation identity together. For A , B Ω , let π ( A , B ) be:
π ( A , B ) = ( x n , x ~ n ) A × B D ( P Y n | x n P Y n | x ~ n ) .
Then,
π ( A , B ) = ( x n , x ~ n ) A × B D ( P Y n | x n P Y n | x ~ n )
= x ~ n B x n A D ( P Y n | x n P Y n | x ~ n )
= x ~ n B x n A D ( P Y n | x n μ A ) + | A | D ( μ A P Y | x ~ n )
= | B | D ( A ) + | A | x ~ n B D ( μ A P Y | x ~ n )
where (47) is because of the compensation identity (6). As we discussed in Section 4.1, the second term of the right-hand side is:
x ~ n B D ( μ A P x ~ n ) = x ~ n B y n μ A ( y n ) log μ A ( y n ) P Y n | X n ( y n | x ~ n )
= | B | y n μ A ( y n ) log μ A ( y n ) x ~ n B P Y n | X n ( y n | x ~ n ) 1 / | B |
= | B | D ( μ A γ B ) + | B | log 1 s B
= | B | D ( μ A γ B ) + x ~ n B D ( γ B P Y n | x ~ n ) .
Finally, we have:
1 | A | | B | π ( A , B ) = 1 | A | D ( A ) + D ( μ A γ B ) + log 1 s B
= 1 | A | x n A D ( P Y n | x n μ A ) + D ( μ A γ B ) + 1 | B | x ~ n B D ( γ B P Y n | x ~ n ) .
More interestingly, we can apply original and geometric compensation identities:
1 | A | | B | π ( A , B ) = 1 | A | x n A D ( P Y n | x n μ A ) + 1 | B | x ~ n B D ( μ A P Y n | x ~ n )
= 1 | A | x n A D ( P Y n | x n γ B ) + 1 | B | x ~ n B D ( γ B P Y n | x ~ n ) .
From Theorem 4, we have π ( A , B ) = π ( B , A ) , and therefore, we can switch A and B.
1 | A | | B | π ( A , B ) = 1 | B | x n B D ( P Y n | x n μ B ) + D ( μ B γ A ) + 1 | A | x ~ n A D ( γ A P Y n | x ~ n )
= 1 | B | x n B D ( P Y n | x n μ B ) + 1 | A | x ~ n A D ( μ B P Y n | x ~ n )
= 1 | B | x n B D ( P Y n | x n γ A ) + 1 | A | x ~ n A D ( γ A P Y n | x ~ n ) .
If we let A = B , we have:
1 | A | 2 π ( A , A ) = 1 | A | x n A D ( P Y n | x n γ A ) + D ( γ A P Y n | x n )
= 1 | A | x n A D ( P Y n | x n μ A ) + D ( μ A P Y n | x n )
= 1 | A | x n A D ( P Y n | x n μ A ) + D ( μ A γ A ) + D ( γ A P Y n | x n ) .
Note that 1 | A | π ( A , A ) + 1 | A c | π ( A c , A c ) is similar to a known clustering problem. In the clustering literature, the min-sum clustering problem [12] is minimizing the sum of all edges in each cluster. Using π , we can describe the binary min-sum clustering problem on Ω by minimizing π ( A , A ) + π ( A c , A c ) .

4.4. Another Application of the Geometric Mean

Using the geometric mean of measures, we can rewrite the clustering problem in a different form. Recall that μ A x n = { x ~ n x n : x ~ n A } . Then, we have:
D ( A ) = x n A D ( P Y n | x n μ A )
= x n A D ( P Y n | 0 n μ A x n ) .
Let γ ~ A be the geometric mean of measures in the set { μ A x n : x n A } , i.e.,
γ ~ A ( y n ) = 1 s ~ A x n A μ A x n ( y n ) 1 / | A |
where:
s ~ A = y n x n A μ A x n ( y n ) 1 / | A | .
Then, we have:
D ( A ) = | A | D ( P Y n | 0 n γ ~ A ) + | A | log 1 s ~ A .
Needless to say:
D ( A c ) = | A c | D ( P Y n | 0 n γ ~ A c ) + | A c | log 1 s ~ A c .
The sum of the results is:
D ( A ) + D ( A c ) = | A | D ( P Y n | 0 n γ ~ A ) + | A c | D ( P Y n | 0 n γ ~ A c ) + | A | log 1 s ~ A + | A c | log 1 s ~ A c .
This can be considered as a dual of Theorem 3.
Remark 7.
Let Ω i 0 = { x n : x i = 0 } , which is the candidate of the optimizer. Then,
γ ~ Ω i 0 = γ ~ ( Ω i 0 ) c = μ Ω i 0
s ~ Ω i 0 = s ~ ( Ω i 0 ) c = 1 .

5. Concluding Remarks

In this paper, we have proposed a number of different formulations of the most informative Boolean function conjecture. Most of them are based on the information geometric approach. Furthermore, we focused on the (normalized) geometric mean of measures that can simplify the problem formulation. More precisely, we showed that Conjecture 1 is true if and only if the maximum achieving f satisfies the following property: “the arithmetic and geometric mean of measures are the same for both { P Y n | x n : x n f - 1 ( 0 ) } , as well as { P Y n | x n : x n f - 1 ( 1 ) } .”

Funding

This work was supported by the Hongik University new faculty research support fund.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Theorem 2

By the definition of mutual information, we have:
I ( X ; Y ) - I ( f ( X ) ; Y ) = E log P X , Y ( X , Y ) P X ( X ) P Y ( Y ) - E log P U , Y ( f ( X ) , Y ) P U ( f ( X ) ) P Y ( Y )
= E log P Y | X ( Y | X ) P Y | U ( Y | f ( X )
= E E [ log P Y | X ( Y | X ) P Y | U ( Y | f ( X ) ) | X ]
= x P X ( x ) E log P Y | X ( Y | X ) P Y | U ( Y | f ( X ) ) | X = x
= x P X ( x ) D ( P Y | x P Y | U ( · | f ( x ) ) ) .
This concludes the proof.

Appendix B. Proof of Theorem 4

Without loss of generality, we can assume that x n = 0 n and x ~ n = 1 k 0 n - k where k = d H ( x n , x ~ n ) . Then, we have:
D ( P Y n | x n P Y n | x ~ n )
= D ( P Y k | x k × P Y k + 1 n | x k + 1 n P Y k | x ˜ k × P Y k + 1 n | x ˜ k + 1 n )
= D ( P Y k | 0 k × P Y k + 1 n | 0 k + 1 n P Y k | 1 k × P Y k + 1 n | 0 k + 1 n )
= D ( P Y k | 0 k P Y k | 1 k ) + D ( P Y k + 1 n | 0 k + 1 n P Y k + 1 n | 0 k + 1 n )
= D ( P Y k | 0 k P Y k | 1 k )
Thus,
D ( P Y k | 0 k P Y k | 1 k )
= k D ( P Y | 0 P Y | 1 )
= k p log p 1 - p + ( 1 - p ) log 1 - p p
= k ( 1 - 2 p ) log 1 - p p .
Since d H ( x n , x ~ n ) = k , this concludes the proof.

Appendix C. Proof of Theorem 5

The following lemma bounds the ratio between Q Y n ( y n ) and Q Y n ( y ˜ n ) , which will be crucial in our argument.
Lemma A1.
For Q Y n conv { P Y n | x n | x n Ω } ,
d H ( y n , y ˜ n ) · log p 1 - p log Q Y n ( y n ) Q Y n ( y ˜ n ) d H ( y n , y ˜ n ) · log 1 - p p
Proof. 
Without loss of generality, we can assume that y ¯ k = y ˜ k and y k + 1 n = y ˜ k + 1 n .
Q Y n ( y n ) Q Y n ( y ¯ k , y k + 1 n ) = x n π ( x n ) P Y n | X n ( y n | x n ) x n π ( x n ) P Y n | X n ( y ¯ k , y k + 1 n | x n )
1 - p p k x n π ( x n ) P Y n | X n ( y n | x n ) x n π ( x n ) P Y n | X n ( y n | x n )
= 1 - p p k .
Similarly, we can show that:
Q Y n ( y n ) Q Y n ( y ¯ k , y k + 1 n ) p 1 - p k .
This concludes the proof of the lemma. ☐
Without loss of generality, we can assume that x n = 0 n and x ˜ n = 1 k 0 n - k . Then, we have:
P Y n | X n ( y n | x ˜ n ) = P Y n | X n ( y n | 1 k 0 n - k )
= P Y n | X n ( y ¯ k , y k + 1 n | 0 n )
where y ¯ i = 1 - y i . Thus, we have:
D ( P Y n | X n = x n Q Y n ) - D ( P Y n | X n = x ˜ n Q Y n )
= E log P Y n | X n ( Y n | 0 n ) Q Y n ( Y n ) - E log P Y n | X n ( Y ¯ k , Y k + 1 n | 0 n ) Q Y n ( Y n )
= E log P Y n | X n ( Y n | 0 n ) Q Y n ( Y n ) - E log P Y n | X n ( Y n | 0 n ) Q Y n ( Y ¯ k , Y k + 1 n )
= E log Q Y n ( Y n ) Q Y n ( Y ¯ k , Y k + 1 n ) .
Note that two expectations in (A20) are under different distributions; on the other hand, expectations in (A21) and the following equations are under the same distribution P Y n | X n = 0 n .
The above expectation can be written as follows.
D ( P Y n | X n = x n Q Y n ) - D ( P Y n | X n = x ˜ n Q Y n )
= y n P Y n | X n ( y n | 0 n ) log Q Y n ( y n ) Q Y n ( y ¯ k , y k + 1 n )
= y 2 n P Y n | X n ( 0 , y 2 n | 0 n ) log Q Y n ( 0 , y 2 n ) Q Y n ( 1 , y ¯ 2 k , y k + 1 n )
+ y 2 n P Y n | X n ( 1 , y 2 n | 0 n ) log Q Y n ( 1 , y 2 n ) Q Y n ( 0 , y ¯ 2 k , y k + 1 n )
= y 2 n P Y n | X n ( 0 , y 2 n | 0 n ) log Q Y n ( 0 , y 2 n ) Q Y n ( 1 , y ¯ 2 k , y k + 1 n )
+ y 2 n P Y n | X n ( 1 , y ¯ 2 k , y k + 1 n | 0 n ) log Q Y n ( 1 , y ¯ 2 k , y k + 1 n ) Q Y n ( 0 , y 2 n )
= y 2 n ( 1 - p ) P Y 2 n | X 2 n ( y 2 n | 0 n - 1 ) - p P Y 2 n | X 2 n ( y ¯ 2 k , y k + 1 n | 0 n - 1 ) log Q Y n ( 0 , y 2 n ) Q Y n ( 1 , y ¯ 2 k , y k + 1 n )
y 2 n ( 1 - p ) P Y 2 n | X 2 n ( y 2 n | 0 n - 1 ) - p P Y 2 n | X 2 n ( y ¯ 2 k , y k + 1 n | 0 n - 1 ) k log 1 - p p
= y 2 k ( 1 - p ) P Y 2 k | X 2 k ( y 2 k | 0 k - 1 ) - p P Y 2 k | X 2 k ( y ¯ 2 k | 0 k - 1 ) k log 1 - p p
where (A27) is because of Lemma A1.
Finally,
y 2 k ( 1 - p ) P Y 2 k | X 2 k ( y 2 k | 0 k - 1 ) - p P Y 2 k | X 2 k ( y ¯ 2 k | 0 k - 1 )
i = 0 k - 1 k - 1 i ( 1 - p ) p i ( 1 - p ) k - 1 - i - p · p k - 1 - i ( 1 - p ) i
( 1 - p ) k - p k .
This concludes the proof.

Appendix D. Proof of Theorem 6

From the first assumption μ A = γ A , we have:
y n - 1 μ A ( y n - 1 0 ) = y n - 1 1 | A | x n A P Y n | X n ( y n - 1 0 | x n )
= y n - 1 1 | A | x n A n 0 P Y n | X n ( y n - 1 0 | x n ) + y n - 1 1 | A | x n A n 1 P Y n | X n ( y n - 1 0 | x n )
= 1 - p | A | x n A n 0 y n - 1 P Y n - 1 | X n - 1 ( y n - 1 | x n - 1 ) + p | A | x n A n 1 y n - 1 P Y n - 1 | X n - 1 ( y n - 1 | x n - 1 )
= ( 1 - p ) | A n 0 | | A | + p | A n 1 | | A | .
Clearly, we can get the following result in a similar manner.
y n - 1 μ A ( y n - 1 1 ) = p | A n 0 | | A | + ( 1 - p ) | A n 1 | | A | .
The ratio of those two is given by:
y n - 1 μ A ( y n - 1 1 ) y n - 1 μ A ( y n - 1 0 ) = p | A n 0 | | A | + ( 1 - p ) | A n 1 | | A | ( 1 - p ) | A n 0 | | A | + p | A n 1 | | A | .
On the other hand, we can also marginalize γ A :
y n - 1 γ A ( y n - 1 0 ) = 1 s A y n - 1 x n A P Y n | X n ( y n - 1 0 | x n ) 1 / | A |
= 1 s A y n - 1 x n A n 0 ( 1 - p ) P Y n - 1 | X n - 1 ( y n - 1 | x n - 1 ) x n A n 1 p P Y n - 1 | X n - 1 ( y n - 1 | x n - 1 ) 1 / | A |
= 1 s A y n - 1 ( 1 - p ) | A n 0 | p | A n 1 | x n A P Y n - 1 | X n - 1 ( y n - 1 | x n - 1 ) 1 / | A |
= 1 s A ( 1 - p ) | A n 0 | / | A | p | A n 1 | / | A | y n - 1 x n A P Y n - 1 | X n - 1 ( y n - 1 | x n - 1 ) 1 / | A | .
Similarly, we have:
y n - 1 γ A ( y n - 1 1 ) = 1 s A p | A n 0 | / | A | ( 1 - p ) | A n 1 | / | A | y n - 1 x n A P Y n - 1 | X n - 1 ( y n - 1 | x n - 1 ) 1 / | A | .
Thus, the ratio is given by:
y n - 1 γ A ( y n - 1 1 ) y n - 1 γ A ( y n - 1 0 ) = p | A n 0 | / | A | ( 1 - p ) | A n 1 | / | A | ( 1 - p ) | A n 0 | / | A | p | A n 1 | / | A |
Since μ A = γ A , both ratios should be the same. Let x = | A n 0 | / | A | , which implies | A n 1 | / | A | = 1 - x . Then, we have:
p x + ( 1 - p ) ( 1 - x ) ( 1 - p ) x + p ( 1 - x ) = p x ( 1 - p ) 1 - x ( 1 - p ) x p 1 - x .
If we let y = p 1 - p , then the above equation can be further simplified to:
x y + ( 1 - x ) x + y ( 1 - x ) = y 2 x - 1 .
Lemma A2.
For fixed 0 < y < 1 , the only solutions of the above equation are x = 0 , 1 2 , 1 .
Proof. 
It is clear that x = 0 , 1 2 , 1 comprise the solution of the following equation.
x y + ( 1 - x ) x + y ( 1 - x ) = y 2 x - 1 .
It is enough to show that:
g y ( x ) = log ( x y + ( 1 - x ) ) - log ( x + y ( 1 - x ) ) - ( 2 x - 1 ) log y = 0
can have up to three solutions. Consider the derivative x g y ( x ) = 0 ,
x g y ( x ) = y - 1 x y + 1 - x - 1 - y x + y - x y - 2 log y = 0
which is equivalent to:
2 ( x y + 1 - x ) ( x + y - x y ) log y = ( 1 + y ) ( y - 1 ) .
It is a quadratic equation, and therefore, x g y ( x ) = 0 can have up to two solutions. Thus, g y ( x ) = 0 can have up to three solutions. ☐
This implies that | A n 0 | = 0 , | A | / 2 or | A | . It is clear that the above result holds for all i and A c , i.e., | A i 0 | is either 0 , | A | / 2 or | A | , and | A i 0 c | is either 0 , | A c | / 2 or | A c | . These cardinalities should satisfy the following equations:
| A i 0 | + | A i 1 | = | A |
| A i 0 | + | A i 0 c | = 2 n - 1
| A i 0 c | + | A i 1 c | = | A c |
| A i 1 | + | A i 1 c | = 2 n - 1
for all 1 i n . Since I -compressedness implies A i 1 A i 0 , we have | A i 1 | | A i 0 | . Thus, | A i 0 | should be either | A | / 2 or | A | for all i. If | A i 0 | = | A | for some i, then | A i 1 | = 0 . Since | A i 1 c | is either 0 , | A c | / 2 or | A c | , but A c , Ω , we have | A i 1 c | = 2 n - 1 . Thus, A = A i 0 and | A | = 2 n - 1 , which implies A = Ω i 0 .
On the other hand, assume that | A i 0 | = | A i 1 | = | A | / 2 for all i. Since A is I -compressed, x i - 1 1 x i + 1 n A i 1 implies x i - 1 0 x i + 1 n A i 0 . However, we have | A i 0 | = | A i 1 | , and therefore:
x i - 1 1 x i + 1 n A i 1 x i - 1 0 x i + 1 n A i 0
and equivalently,
x i - 1 1 x i + 1 n A x i - 1 0 x i + 1 n A
for all i. It can only be true when A = or Ω , which contradicts our original assumption. This concludes the proof.

References

  1. Courtade, T.A.; Kumar, G.R. Which Boolean functions maximize mutual information on noisy inputs? IEEE Trans. Inf. Theory 2014, 60, 4515–4525. [Google Scholar] [CrossRef]
  2. Pichler, G.; Matz, G.; Piantanida, P. A tight upper bound on the mutual information of two Boolean functions. In Proceedings of the 2016 IEEE Information Theory Workshop (ITW), Cambridge, UK, 11–14 September 2016; pp. 16–20. [Google Scholar]
  3. Ordentlich, O.; Shayevitz, O.; Weinstein, O. An improved upper bound for the most informative Boolean function conjecture. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 500–504. [Google Scholar]
  4. Weinberger, N.; Shayevitz, O. On the optimal Boolean function for prediction under quadratic loss. IEEE Trans. Inf. Theory 2017, 63, 4202–4217. [Google Scholar] [CrossRef]
  5. Huleihel, W.; Ordentlich, O. How to quantize n outputs of a binary symmetric channel to n-1 bits? In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 91–95. [Google Scholar]
  6. Nazer, B.; Ordentlich, O.; Polyanskiy, Y. Information-distilling quantizers. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 96–100. [Google Scholar]
  7. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef] [Green Version]
  8. Topsøe, F. An information theoretical identity and a problem involving capacity. Stud. Sci. Math. Hung. 1967, 2, 291–292. [Google Scholar]
  9. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  10. Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett. 2013, 20, 269–272. [Google Scholar] [CrossRef]
  11. Nielsen, F.; Boltz, S. The burbea-rao and bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
  12. Guttmann-Beck, N.; Hassin, R. Approximation algorithms for min-sum p-clustering. Discrete Appl. Math. 1998, 89, 125–142. [Google Scholar] [CrossRef]

Share and Cite

MDPI and ACS Style

No, A. Information Geometric Approach on Most Informative Boolean Function Conjecture. Entropy 2018, 20, 688. https://doi.org/10.3390/e20090688

AMA Style

No A. Information Geometric Approach on Most Informative Boolean Function Conjecture. Entropy. 2018; 20(9):688. https://doi.org/10.3390/e20090688

Chicago/Turabian Style

No, Albert. 2018. "Information Geometric Approach on Most Informative Boolean Function Conjecture" Entropy 20, no. 9: 688. https://doi.org/10.3390/e20090688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop