Next Article in Journal
Machine Learning Predictors of Extreme Events Occurring in Complex Dynamical Systems
Next Article in Special Issue
Markov Information Bottleneck to Improve Information Flow in Stochastic Neural Networks
Previous Article in Journal
Fault Diagnosis Method for High-Pressure Common Rail Injector Based on IFOA-VMD and Hierarchical Dispersion Entropy
Previous Article in Special Issue
Gaussian Mean Field Regularizes by Limiting Learned Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learnability for the Information Bottleneck

1
Department of Physics, MIT, 77 Massachusetts Ave, Cambridge, MA 02139, USA
2
Google Research, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(10), 924; https://doi.org/10.3390/e21100924
Submission received: 1 August 2019 / Revised: 29 August 2019 / Accepted: 12 September 2019 / Published: 23 September 2019
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)

Abstract

:
The Information Bottleneck (IB) method provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective I ( X ; Z ) β I ( Y ; Z ) employs a Lagrange multiplier β to tune this trade-off. However, in practice, not only is β chosen empirically without theoretical guidance, there is also a lack of theoretical understanding between β , learnability, the intrinsic nature of the dataset and model capacity. In this paper, we show that if β is improperly chosen, learning cannot happen—the trivial representation P ( Z | X ) = P ( Z ) becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as β is varied. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, which provides theoretical guidance for choosing a good β . We further show that IB-learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), and discuss its relation with model capacity. We give practical algorithms to estimate the minimum β for a given dataset. We also empirically demonstrate our theoretical conditions with analyses of synthetic datasets, MNIST and CIFAR10.

1. Introduction

Tishby et al. [1] introduced the Information Bottleneck (IB) objective function which learns a representation Z of observed variables ( X , Y ) that retains as little information about X as possible but simultaneously captures as much information about Y as possible:
min IB β ( X , Y ; Z ) = min [ I ( X ; Z ) β I ( Y ; Z ) ]
I ( · ) is the mutual information. The hyperparameter β controls the trade-off between compression and prediction, in the same spirit as Rate-Distortion Theory [2] but with a learned representation function P ( Z | X ) that automatically captures some part of the “semantically meaningful” information, where the semantics are determined by the observed relationship between X and Y. The IB framework has been extended to and extensively studied in a variety of scenarios, including Gaussian variables [3], meta-Gaussians [4], continuous variables via variational methods [5,6,7], deterministic scenarios [8,9], geometric clustering [10] and is used for learning invariant and disentangled representations in deep neural nets [11,12].
From the IB objective (Equation (1)) we see that when β 0 it will encourage I ( X ; Z ) = 0 which leads to a trivial representation Z that is independent of X, while when β + , it reduces to a maximum likelihood objective (e.g., in classification, it reduces to cross-entropy loss). Therefore, as we vary β from 0 to + , there must exist a point β 0 at which IB starts to learn a nontrivial representation where Z contains information about X.
As an example, we train multiple variational information bottleneck (VIB) models on binary classification of MNIST [13] digits 0 and 1 with 20% label noise at different β . The accuracy vs. β is shown in Figure 1. We see that when β < 3.25 , no learning happens and the accuracy is the same as random guessing. Beginning with β > 3.25 , there is a clear phase transition where the accuracy sharply increases, indicating the objective is able to learn a nontrivial representation. In general, we observe that different datasets and model capacity will result in different β 0 at which IB starts to learn a nontrivial representation. How does β 0 depend on the aspects of the dataset and model capacity and how can we estimate it? What does an IB model learn at the onset of learning? Answering these questions may provide a deeper understanding of IB in particular and learning on two observed variables in general.
In this work, we begin to answer the above questions. Specifically:
  • We introduce the concept of IB-Learnability and show that when we vary β , the IB objective will undergo a phase transition from the inability to learn to the ability to learn (Section 3).
  • Using the second-order variation, we derive sufficient conditions for IB-Learnability, which provide upper bounds for the learnability threshold β 0 (Section 4).
  • We show that IB-Learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), reveal its relationship with the slope of the Pareto frontier at the origin on the information plane I ( X ; Z ) vs. I ( Y ; Z ) and discuss its relation to model capacity (Section 5).
  • We prove a deep relationship between IB-Learnability, our upper bounds on β 0 , the hypercontractivity coefficient, the contraction coefficient and the maximum correlation (Section 5).
We also present an algorithm for estimating the onset of IB-Learnability and the conspicuous subset, which provide us with a tool for understanding a key aspect of the learning problem ( X , Y ) (Section 6). Finally, we use our main results to demonstrate on synthetic datasets, MNIST [13] and CIFAR10 [14] that the theoretical prediction for IB-Learnability closely matches experiment, and show the conspicuous subset our algorithm discovers (Section 7).

2. Related Work

The seminal IB work [1] provides a tabular method for exactly computing the optimal encoder distribution P ( Z | X ) for a given β and cardinality of the discrete representation, | Z | . They did not consider the IB learnability problem as addressed in this work. Chechik et al. [3] presents the Gaussian Information Bottleneck (GIB) for learning a multivariate Gaussian representation Z of ( X , Y ) , assuming that both X and Y are also multivariate Gaussians. Under GIB, they derive analytic formula for the optimal representation as a noisy linear projection to eigenvectors of the normalized regression matrix Σ x | y Σ x 1 and the learnability threshold β 0 is then given by β 0 = 1 1 λ 1 where λ 1 is the largest eigenvalue of the matrix Σ x | y Σ x 1 . This work provides deep insights about relations between the dataset, β 0 and optimal representations in the Gaussian scenario but the restriction to multivariate Gaussian datasets limits the generality of the analysis Another analytic treatment of IB is given in [4], which reformulates the objective in terms of the copula functions. As with the GIB approach, this formulation restricts the form of the data distributions—the copula functions for the joint distribution ( X , Y ) are assumed to be known, which is unlikely in practice.
Strouse and Schwab [8] present the Deterministic Information Bottleneck (DIB), which minimizes the coding cost of the representation, H ( Z ) , rather than the transmission cost, I ( X ; Z ) as in IB. This approach learns hard clusterings with different code entropies that vary with β . In this case, it is clear that a hard clustering with minimal H ( Z ) will result in a single cluster for all of the data, which is the DIB trivial solution. No analysis is given beyond this fact to predict the actual onset of learnability, however.
The first amortized IB objective is in the Variational Information Bottleneck (VIB) of Alemi et al. [5]. VIB replaces the exact, tabular approach of IB with variational approximations of the classifier distribution ( P ( Y | Z ) ) and marginal distribution ( P ( Z ) ). This approach cleanly permits learning a stochastic encoder, P ( Z | X ) , that is applicable to any x X , rather than just the particular X seen at training time. The cost of this flexibility is the use of variational approximations that may be less expressive than the tabular method. Nevertheless, in practice, VIB learns easily and is simple to implement, so we rely on VIB models for our experimental confirmation.
Closely related to IB is the recently proposed Conditional Entropy Bottleneck (CEB) [7]. CEB attempts to explicitly learn the Minimum Necessary Information (MNI), defined as the point in the information plane where I ( X ; Y ) = I ( X ; Z ) = I ( Y ; Z ) . The MNI point may not be achievable even in principle for a particular dataset. However, the CEB objective provides an explicit estimate of how closely the model is approaching the MNI point by observing that a necessary condition for reaching the MNI point occurs when I ( X ; Z | Y ) = 0 . The CEB objective I ( X ; Z | Y ) γ I ( Y ; Z ) is equivalent to IB at γ = β + 1 , so our analysis of IB-Learnability applies equally to CEB.
Kolchinsky et al. [9] show that when Y is a deterministic function of X, the “corner point” of the IB curve (where I ( X ; Y ) = I ( X ; Z ) = I ( Y ; Z ) ) is the unique optimizer of the IB objective for all 0 < β < 1 (with the parameterization of Kolchinsky et al. [9], β = 1 / β ), which they consider to be a “trivial solution”. However, their use of the term “trivial solution” is distinct from ours. They are referring to the observation that all points on the IB curve contain uninteresting interpolations between two different but valid solutions on the optimal frontier, rather than demonstrating a non-trivial trade-off between compression and prediction as expected when varying the IB Lagrangian. Our use of “trivial” refers to whether IB is capable of learning at all given a certain dataset and value of β .
Achille and Soatto [12] apply the IB Lagrangian to the weights of a neural network, yielding InfoDropout. In Achille and Soatto [11], the authors give a deep and compelling analysis of how the IB Lagrangian can yield invariant and disentangled representations. They do not, however, consider the question of the onset of learning, although they are aware that not all models will learn a non-trivial representation. More recently, Achille et al. [15] repurpose the InfoDropout IB Lagrangian as a Kolmogorov Structure Function to analyze the ease with which a previously-trained network can be fine-tuned for a new task. While that work is tangentially related to learnability, the question it addresses is substantially different from our investigation of the onset of learning.
Our work is also closely related to the hypercontractivity coefficient [16,17], defined as sup Z X Y I ( Y ; Z ) I ( X ; Z ) , which by definition equals the inverse of β 0 , our IB-learnability threshold. In [16], the authors prove that the hypercontractivity cofficient equals the contraction coefficient η KL ( P Y | X , P X ) and Kim et al. [18] propose a practical algorithm to estimate η KL ( P Y | X , P X ) , which provides a measure for potential influence in the data. Although our goal is different, the sufficient conditions we provide for IB-Learnability are also lower bounds for the hypercontractivity coefficient.

3. IB-Learnability

We are given instances of ( x , y ) drawn from a distribution with probability (density) P ( X , Y ) with support of X × Y , where unless otherwise stated, both X and Y can be discrete or continuous variables. We use capital letters X , Y , Z for random variables and lowercase x , y , z to denote the instance of variables, with P ( · ) and p ( · ) denoting their probability or probability density, respectively. ( X , Y ) is our training data and may be characterized by different types of noise. The nature of this training data and the choice of β will be sufficient to predict the transition from unlearnable to learnable.
We can learn a representation Z of X with conditional probability p ( z | x ) , such that X , Y , Z obey the Markov chain Z X Y . Equation (1) above gives the IB objective with Lagrange multiplier β , IB β ( X , Y ; Z ) , which is a functional of p ( z | x ) : IB β ( X , Y ; Z ) = IB β [ p ( z | x ) ] . The IB learning task is to find a conditional probability p ( z | x ) that minimizes IB β ( X , Y ; Z ) . The larger β , the more the objective favors making a good prediction for Y. Conversely, the smaller β , the more the objective favors learning a concise representation.
How can we select β such that the IB objective learns a useful representation? In practice, the selection of β is done empirically. Indeed, Tishby et al. [1] recommends “sweeping β ”. In this paper, we provide theoretical guidance for choosing β by introducing the concept of IB-Learnability and providing a series of IB-learnable conditions.
Definition 1.
( X , Y ) is IB β -learnable if there exists a Z given by some p 1 ( z | x ) , such that IB β ( X , Y ; Z ) | p 1 ( z | x ) < IB β ( X , Y ; Z ) | p ( z | x ) = p ( z ) , where p ( z | x ) = p ( z ) characterizes the trivial representation where Z = Z trivial is independent of X.
If ( X ; Y ) is IB β -learnable, then when IB β ( X , Y ; Z ) is globally minimized, it will not learn a trivial representation. On the other hand, if ( X ; Y ) is not IB β -learnable, then when IB β ( X , Y ; Z ) is globally minimized, it may learn a trivial representation.

3.1. Trivial Solutions

Definition 1 defines trivial solutions in terms of representations where I ( X ; Z ) = I ( Y ; Z ) = 0 . Another type of trivial solution occurs when I ( X ; Z ) > 0 but I ( Y ; Z ) = 0 . This type of trivial solution is not directly achievable by the IB objective, as I ( X ; Z ) is minimized but it can be achieved by construction or by chance. It is possible that starting learning from I ( X ; Z ) > 0 , I ( Y ; Z ) = 0 could result in access to non-trivial solutions not available from I ( X ; Z ) = 0 . We do not attempt to investigate this type of trivial solution in this work.

3.2. Necessary Condition for IB-Learnability

From Definition 1, we can see that IB β -Learnability for any dataset ( X ; Y ) requires β > 1 . In fact, from the Markov chain Z X Y , we have I ( Y ; Z ) I ( X ; Z ) via the data-processing inequality. If β 1 , then since I ( X ; Z ) 0 and I ( Y ; Z ) 0 , we have that min ( I ( X ; Z ) β I ( Y ; Z ) ) = 0 = IB β ( X , Y ; Z t r i v i a l ) . Hence ( X , Y ) is not IB β -learnable for β 1 .
Due to the reparameterization invariance of mutual information, we have the following theorem for IB β -Learnability:
Lemma 1.
Let X = g ( X ) be an invertible map (if X is a continuous variable, g is additionally required to be continuous). Then ( X , Y ) and ( X , Y ) have the same IB β -Learnability.
The proof for Lemma 1 is in Appendix A.2. Lemma 1 implies a favorable property for any condition for IB β -Learnability: the condition should be invariant to invertible mappings of X. We will inspect this invariance in the conditions we derive in the following sections.

4. Sufficient Conditions for IB-Learnability

Given ( X , Y ) , how can we determine whether it is IB β -learnable? To answer this question, we derive a series of sufficient conditions for IB β -Learnability, starting from its definition. The conditions are in increasing order of practicality, while sacrificing as little generality as possible.
Firstly, Theorem 1 characterizes the IB β -Learnability range for β , with proof in Appendix A.3:
Theorem 1.
If ( X , Y ) is IB β 1 -learnable, then for any β 2 > β 1 , it is IB β 2 -learnable.
Based on Theorem 1, the range of β such that ( X , Y ) is IB β -learnable has the form β ( β 0 , + ) . Thus, β 0 is the threshold of IB-Learnability.
Lemma 2.
p ( z | x ) = p ( z ) is a stationary solution for IB β ( X , Y ; Z ) .
The proof in Appendix A.6 shows that both first-order variations δ I ( X ; Z ) = 0 and δ I ( Y ; Z ) = 0 vanish at the trivial representation p ( z | x ) = p ( z ) , so δ IB β [ p ( z | x ) ] = 0 at the trivial representation.
Lemma 2 yields our strategy for finding sufficient conditions for learnability: find conditions such that p ( z | x ) = p ( z ) is not a local minimum for the functional IB β [ p ( z | x ) ] . Based on the necessary condition for the minimum (Appendix A.4), we have the following theorem (The theorems in this paper deal with learnability w.r.t. true mutual information. If parameterized models are used to approximate the mutual information, the limitation of the model capacity will translate into more uncertainty of Y given X, viewed through the lens of the model.):
Theorem 2
(Suff. Cond. 1). A sufficient condition for ( X , Y ) to be IB β -learnable is that there exists a perturbation function h ( z | x ) (so that the perturbed probability (density) is p ( z | x ) = p ( z | x ) + ϵ · h ( z | x ) ) with h ( z | x ) d z = 0 , such that the second-order variation δ 2 IB β [ p ( z | x ) ] < 0 at the trivial representation p ( z | x ) = p ( z ) .
The proof for Theorem 2 is given in Appendix A.4. Intuitively, if δ 2 IB β [ p ( z | x ) ] | p ( z | x ) = p ( z ) < 0 , we can always find a p ( z | x ) = p ( z | x ) + ϵ · h ( z | x ) in the neighborhood of the trivial representation p ( z | x ) = p ( z ) , such that IB β [ p ( z | x ) ] < IB β [ p ( z | x ) ] , thus satisfying the definition for IB β -Learnability.
To make Theorem 2 more practical, we perturb p ( z | x ) around the trivial solution p ( z | x ) = p ( z | x ) + ϵ · h ( z | x ) and expand IB β [ p ( z | x ) + ϵ · h ( z | x ) ] IB β [ p ( z | x ) ] to the second order of ϵ . We can then prove Theorem 3:
Theorem 3
(Suff. Cond. 2). A sufficient condition for ( X , Y ) to be IB β -learnable is X and Y are not independent and
β > inf h ( x ) β 0 [ h ( x ) ]
where the functional β 0 [ h ( x ) ] is given by
β 0 [ h ( x ) ] = E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2 E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E x p ( x ) [ h ( x ) ] 2
Moreover, we have that inf h ( x ) β [ h ( x ) ] 1 is a lower bound of the slope of the Pareto frontier in the information plane I ( Y ; Z ) vs. I ( X ; Z ) at the origin.
The proof is given in Appendix A.7, which also shows that if β > inf h ( x ) β 0 [ h ( x ) ] in Theorem 3 is satisfied, we can construct a perturbation function h ( z | x ) = h ( x ) h 2 ( z ) with h ( x ) = arg min h ( x ) β 0 [ h ( x ) ] , h 2 ( z ) d z = 0 , h 2 2 ( z ) p ( z ) d z > 0 for some h 2 ( z ) , such that h ( z | x ) satisfies Theorem 2. It also shows that the converse is true: if there exists h ( z | x ) such that the condition in Theorem 2 is true, then Theorem 3 is satisfied, that is, β > inf h ( x ) β 0 [ h ( x ) ] . (We do not claim that any h ( z | x ) satisfying Theorem 2 can be decomposed to h ( x ) h 2 ( z ) at the onset of learning. But from the equivalence of Theorems 2 and 3 as explained above, when there exists an h ( z | x ) such that Theorem 2 is satisfied, we can always construct an h ( z | x ) = h ( x ) h 2 ( z ) that also satisfies Theorem 2.) Moreover, letting the perturbation function h ( z | x ) = h ( x ) h 2 ( z ) at the trivial solution, we have
p β ( y | x ) = p ( y ) + ϵ 2 C z ( h ( x ) h ¯ x ) p ( x , y ) ( h ( x ) h ¯ x ) d x
where p β ( y | x ) is the estimated p ( y | x ) by IB for a certain β , h ¯ x = h ( x ) p ( x ) d x and C z = h 2 2 ( z ) p ( z ) d z > 0 is a constant. This shows how the p β ( y | x ) by IB explicitly depends on h ( x ) at the onset of learning. The proof is provided in Appendix A.8.
Theorem 3 suggests a method to estimate β 0 : we can parameterize h ( x ) for example, by a neural network, with the objective of minimizing β 0 [ h ( x ) ] . At its minimization, β 0 [ h ( x ) ] provides an upper bound for β 0 , and h ( x ) provides a soft clustering of the examples corresponding to a nontrivial perturbation of p ( z | x ) at p ( z | x ) = p ( z ) that minimizes IB β [ p ( z | x ) ] .
Alternatively, based on the property of β 0 [ h ( x ) ] , we can also use a specific functional form for h ( x ) in Equation (2) and obtain a stronger sufficient condition for IB β -Learnability. But we want to choose h ( x ) as near to the infimum as possible. To do this, we note the following characteristics for the R.H.S of Equation (2):
  • We can set h ( x ) to be nonzero if x Ω x for some region Ω x X and 0 otherwise. Then we obtain the following sufficient condition:
    β > inf h ( x ) , Ω x X E x p ( x ) , x Ω x [ h ( x ) 2 ] E x p ( x ) , x Ω x [ h ( x ) ] 2 1 d y p ( y ) E x p ( x ) , x Ω x [ p ( y | x ) h ( x ) ] E x p ( x ) , x Ω x [ h ( x ) ] 2 1
  • The numerator of the R.H.S. of Equation (4) attains its minimum when h ( x ) is a constant within Ω x . This can be proved using the Cauchy-Schwarz inequality: u , u v , v u , v 2 , setting u ( x ) = h ( x ) p ( x ) , v ( x ) = p ( x ) and defining the inner product as u , v = u ( x ) v ( x ) d x . Therefore, the numerator of the R.H.S. of Equation (4) 1 x Ω x p ( x ) 1 and attains equality when u ( x ) v ( x ) = h ( x ) is constant.
Based on these observations, we can let h ( x ) be a nonzero constant inside some region Ω x X and 0 otherwise and the infimum over an arbitrary function h ( x ) is simplified to infimum over Ω x X and we obtain a sufficient condition for IB β -Learnability, which is a key result of this paper:
Theorem 4
(Conspicuous Subset Suff. Cond.). A sufficient condition for ( X , Y ) to be IB β -learnable is X and Y are not independent and
β > inf Ω x X β 0 ( Ω x )
where
β 0 ( Ω x ) = 1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1
Ω x denotes the event that x Ω x , with probability p ( Ω x ) .
inf Ω x X β 0 ( Ω x ) 1 gives a lower bound of the slope of the Pareto frontier in the information plane I ( Y ; Z ) vs. I ( X ; Z ) at the origin.
The proof is given in Appendix A.9. In the proof we also show that this condition is invariant to invertible mappings of X.

5. Discussion

5.1. The Conspicuous Subset Determines β 0

From Equation (5), we see that three characteristics of the subset Ω x X lead to low β 0 : (1) confidence: p ( y | Ω x ) is large; (2) typicality and size: the number of elements in Ω x is large or the elements in Ω x are typical, leading to a large probability of p ( Ω x ) ; (3) imbalance: p ( y ) is small for the subset Ω x but large for its complement. In summary, β 0 will be determined by the largest confident, typical and imbalanced subset of examples or an equilibrium of those characteristics. We term Ω x at the minimization of β 0 ( Ω x ) the conspicuous subset.

5.2. Multiple Phase Transitions

Based on this characterization of Ω x , we can hypothesize datasets with multiple learnability phase transitions. Specifically, consider a region Ω x 0 that is small but “typical”, consists of all elements confidently predicted as y 0 by p ( y | x ) and where y 0 is the least common class. By construction, this Ω x 0 will dominate the infimum in Equation (5), resulting in a small value of β 0 . However, the remaining X Ω x 0 effectively form a new dataset, X 1 . At exactly β 0 , we may have that the current encoder, p 0 ( z | x ) , has no mutual information with the remaining classes in X 1 ; that is, I ( Y 1 ; Z 0 ) = 0 . In this case, Definition 1 applies to p 0 ( z | x ) with respect to I ( X 1 ; Z 1 ) . We might expect to see that, at β 0 , learning will plateau until we get to some β 1 > β 0 that defines the phase transition for X 1 . Clearly this process could repeat many times, with each new dataset X i being distinctly more difficult to learn than X i 1 .

5.3. Similarity to Information Measures

The denominator of β 0 ( Ω x ) in Equation (5) is closely related to mutual information. Using the inequality x 1 log ( x ) for x > 0 , it becomes:
E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 E y p ( y | Ω x ) log p ( y | Ω x ) p ( y ) = I ˜ ( Ω x ; Y )
where I ˜ ( Ω x ; Y ) is the mutual information “density” at Ω x X . Of course, this quantity is also D KL [ p ( y | Ω x ) | | p ( y ) ] , so we know that the denominator of Equation (5) is non-negative. Incidentally, E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 is the density of “rational mutual information” [19] at Ω x .
Similarly, the numerator of β 0 ( Ω x ) is related to the self-information of Ω x :
1 p ( Ω x ) 1 log 1 p ( Ω x ) = log p ( Ω x ) = h ( Ω x )
so we can estimate β 0 as:
β 0 inf Ω x X h ( Ω x ) I ˜ ( Ω x ; Y )
Since Equation (6) uses upper bounds on both the numerator and the denominator, it does not give us a bound on β 0 , only an estimate.

5.4. Estimating Model Capacity

The observation that a model cannot distinguish between cluster overlap in the data and its own lack of capacity gives an interesting way to use IB-Learnability to measure the capacity of a set of models relative to the task they are being used to solve. For example, for a classification task, we can use different model classes to estimate p ( y | x ) . For each such trained model, we can estimate the corresponding IB-learnability threshold β 0 . A model with smaller capacity than the task needs will translate to more uncertainty in p ( y | Ω x ) , resulting in a larger β 0 . On the other hand, models that give the same β 0 as each other all have the same capacity relative to the task, even if we would otherwise expect them to have very different capacities. For example, if two deep models have the same core architecture but one has twice the number of parameters at each layer and they both yield the same β 0 , their capacities are equivalent with respect to the task. Thus, β 0 provides a way to measure model capacity in a task-specific manner.

5.5. Learnability and the Information Plane

Many of our results can be interpreted in terms of the geometry of the Pareto frontier illustrated in Figure 2, which describes the trade-off between increasing I ( Y ; Z ) and decreasing I ( X ; Z ) . At any point on this frontier that minimizes IB β min min I ( X ; Z ) β I ( Y ; Z ) , the frontier will have slope β 1 if it is differentiable. If the frontier is also concave (has negative second derivative), then this slope β 1 will take its maximum β 0 1 at the origin, which implies IB β -Learnability for β > β 0 , so that the threshold for IB β -Learnability is simply the inverse slope of the frontier at the origin. More generally, as long as the Pareto frontier is differentiable, the threshold for IB β -learnability is the inverse of its maximum slope. Indeed, Theorem 3 and Theorem 4 give lower bounds of the slope of the Pareto frontier at the origin.

5.6. IB-Learnability, Hypercontractivity and Maximum Correlation

IB-Learnability and its sufficient conditions we provide harbor a deep connection with hypercontractivity and maximum correlation:
1 β 0 = ξ ( X ; Y ) = η KL sup h ( x ) 1 β 0 [ h ( x ) ] = ρ m 2 ( X ; Y )
which we prove in Appendix A.11. Here ρ m ( X ; Y ) max f , g E [ f ( X ) g ( Y ) ] s.t. E [ f ( X ) ] = E [ g ( Y ) ] = 0 and E [ f 2 ( X ) ] = E [ g 2 ( Y ) ] = 1 is the maximum correlation [20,21], ξ ( X ; Y ) sup Z X Y I ( Y ; Z ) I ( X ; Z ) is the hypercontractivity coefficient and η KL ( p ( y | x ) , p ( x ) ) sup r ( x ) p ( x ) D KL ( r ( y ) p ( y ) ) D KL ( r ( x ) p ( x ) ) is the contraction coefficient. Our proof relies on Anantharam et al. [16]’s proof ξ ( X ; Y ) = η KL . Our work reveals the deep relationship between IB-Learnability and these earlier concepts and provides additional insights about what aspects of a dataset give rise to high maximum correlation and hypercontractivity: the most confident, typical, imbalanced subset of ( X , Y ) .

6. Estimating the IB-Learnability Condition

Theorem 4 not only reveals the relationship between the learnability threshold for β and the least noisy region of P ( Y | X ) but also provides a way to practically estimate β 0 , both in the general classification case and in more structured settings.

6.1. Estimation Algorithm

Based on Theorem 4, for general classification tasks we suggest Algorithm 1 to empirically estimate an upper-bound β ˜ 0 β 0 , as well as discovering the conspicuous subset that determines β 0 .
We approximate the probability of each example p ( x i ) by its empirical probability, p ^ ( x i ) . For example, for MNIST, p ( x i ) = 1 N , where N is the number of examples in the dataset. The algorithm starts by first learning a maximum likelihood model of p θ ( y | x ) , using for example, feed-forward neural networks. It then constructs a matrix P y | x and a vector p y to store the estimated p ( y | x ) and p ( y ) for all the examples in the dataset. To find the subset Ω such that the β ˜ 0 is as small as possible, by previous analysis we want to find a conspicuous subset such that its p ( y | x ) is large for a certain class j (to make the denominator of Equation (5) large) and containing as many elements as possible (to make the numerator small).
We suggest the following heuristics to discover such a conspicuous subset. For each class j, we sort the rows of ( P y | x ) according to its probability for the pivot class j by decreasing order and then perform a search over i left , i right for Ω = { i left , i left + 1 , , i right } . Since β ˜ 0 is large when Ω contains too few or too many elements, the minimum of β ˜ 0 ( j ) for class j will typically be reached with some intermediate-sized subset and we can use binary search or other discrete search algorithm for the optimization. The algorithm stops when β ˜ 0 ( j ) does not improve by tolerance ε . The algorithm then returns the β ˜ 0 as the minimum over all the classes β ˜ 0 ( 1 ) , β ˜ 0 ( N ) , as well as the conspicuous subset that determines this β ˜ 0 .
After estimating β ˜ 0 , we can then use it for learning with IB, either directly or as an anchor for a region where we can perform a much smaller sweep than we otherwise would have. This may be particularly important for very noisy datasets, where β 0 can be very large.
Algorithm 1 Estimating the upper bound for β 0 and identifying the conspicuous subset
Require: Dataset D = { ( x i , y i ) } , i = 1 , 2 , N . The number of classes is C.
Require ε : tolerance for estimating β 0
1: 
Learn a maximum likelihood model p θ ( y | x ) using the dataset D .
2: 
Construct matrix ( P y | x ) such that ( P y | x ) i j = p θ ( y = y j | x = x i ) .
3: 
Construct vector p y = ( p y 1 , . . , p y C ) such that p y j = 1 N i = 1 N ( P y | x ) i j .
4: 
forjin { 1 , 2 , C } :
5: 
P y | x ( sort j ) Sort the rows of P y | x in decreasing values of ( P y | x ) i j .
6: 
β ˜ 0 ( j ) , Ω ( j ) Search i left , i right until β ˜ 0 ( j ) = Get β ( P y | x , p y , Ω ) is minimal with tolerance ε , where Ω = { i left , i left + 1 , i right } .
7: 
end for
8: 
j arg min j { β ˜ 0 ( j ) } , j = 1 , 2 , N .   
9: 
β ˜ 0 β ˜ 0 ( j ) .
10: 
P y | x ( β ˜ 0 ) the rows of P y | x ( sort j ) indexed by Ω ( j ) .
11: 
return β ˜ 0 , P y | x ( β ˜ 0 )
subroutine Get β ( P y | x , p y , Ω ):
s1: 
N number of rows of P y | x .
s2: 
C number of columns of P y | x .
s3: 
n number of elements of Ω .
s4: 
( p y | Ω ) j 1 n i Ω ( P y | x ) i j , j = 1 , 2 , , C .
s5: 
β ˜ 0 N n 1 j ( p y | Ω x ) j 2 p y j 1
s6: 
return β ˜ 0

6.2. Special Cases for Estimating β 0

Theorem 4 may still be challenging to estimate, due to the difficulty of making accurate estimates of p ( Ω x ) and searching over Ω x X . However, if the learning problem is more structured, we may be able to obtain a simpler formula for the sufficient condition.

6.2.1. Class-Conditional Label Noise

Classification with noisy labels is a common practical scenario. An important noise model is that the labels are randomly flipped with some hidden class-conditional probabilities and we only observe the corrupted labels. This problem has been studied extensively [22,23,24,25,26]. If IB is applied to this scenario, how large β do we need? The following corollary provides a simple formula.
Corollary 1.
Suppose that the true class labels are y and the input space belonging to each y has no overlap. We only observe the corrupted labels y with class-conditional noise p ( y | x , y ) = p ( y | y ) and Y is not independent of X. We have that a sufficient condition for IB β -Learnability is:
β > inf y 1 p ( y ) 1 y p ( y | y ) 2 p ( y ) 1
We see that under class-conditional noise, the sufficient condition reduces to a discrete formula which only depends on the noise rates p ( y | y ) and the true class probability p ( y ) , which can be accurately estimated via, for example, Northcutt et al. [26]. Additionally, if we know that the noise is class-conditional but the observed β 0 is greater than the R.H.S. of Equation (8), we can deduce that there is overlap between the true classes. The proof of Corollary 1 is provided in Appendix A.10.

6.2.2. Deterministic Relationships

Theorem 4 also reveals that β 0 relates closely to whether Y is a deterministic function of X, as shown by Corollary 2:
Corollary 2.
Assume that Y contains at least one value y such that its probability p ( y ) > 0 . If Y is a deterministic function of X and not independent of X, then a sufficient condition for IB β -Learnability is β > 1 .
The assumption in the Corollary 2 is satisfied by classification and certain regression problems. (The following scenario does not satisfy this assumption: for certain regression problems where Y is a continuous random variable and the probability density function p Y ( y ) is bounded, then for any y, the probability P ( Y = y ) has measure 0.)This corollary generalizes the result in Reference [9] which only proves it for classification problems. Combined with the necessary condition β > 1 for any dataset ( X , Y ) to be IB β -learnable (Section 3), we have that under the assumption, if Y is a deterministic function of X, then a necessary and sufficient condition for IB β -learnability is β > 1 ; that is, its β 0 is 1. The proof of Corollary 2 is provided in Appendix A.10.
Therefore, in practice, if we find that β 0 > 1 , we may infer that Y is not a deterministic function of X. For a classification task, we may infer that either some classes have overlap or the labels are noisy. However, recall that finite models may add effective class overlap if they have insufficient capacity for the learning task, as mentioned in Section 4. This may translate into a higher observed β 0 , even when learning deterministic functions.

7. Experiments

To test how the theoretical conditions for IB β -learnability match with experiment, we apply them to synthetic data with varying noise rates and class overlap, MNIST binary classification with varying noise rates and CIFAR10 classification, comparing with the β 0 found experimentally. We also compare with the algorithm in Kim et al. [18] for estimating the hypercontractivity coefficient (= 1 / β 0 ) via the contraction coefficient η KL . Experiment details are in Appendix A.12.

7.1. Synthetic Dataset Experiments

We construct a set of datasets from 2D mixtures of 2 Gaussians as X and the identity of the mixture component as Y. We simulate two practical scenarios with these datasets: (1) noisy labels with class-conditional noise and (2) class overlap. For (1), we vary the class-conditional noise rates. For (2), we vary class overlap by tuning the distance between the Gaussians. For each experiment, we sweep β with exponential steps and observe I ( X ; Z ) and I ( Y ; Z ) . We then compare the empirical β 0 indicated by the onset of above-zero I ( X ; Z ) with predicted values for β 0 .

7.1.1. Classification with Class-Conditional Noise

In this experiment, we have a mixture of Gaussian distribution with 2 components, each of which is a 2D Gaussian with diagonal covariance matrix Σ = diag ( 0.25 , 0.25 ) . The two components have distance 16 (hence virtually no overlap) and equal mixture weight. For each x, the label y { 0 , 1 } is the identity of which component it belongs to. We create multiple datasets by randomly flipping the labels y with a certain noise rate ρ = P ( y = 0 | y = 1 ) = P ( y = 1 | y = 0 ) . For each dataset, we train VIB models across a range of β and observe the onset of learning via random I ( X ; Z ) (Observed). To test how different methods perform in estimating β 0 , we apply the following methods: (1) Corollary 1, since this is classification with class-conditional noise and the two true classes have virtually no overlap; (2) Algorithm 1 with true p ( y | x ) ; (3) The algorithm in Kim et al. [18] that estimates η ^ KL , provided with true p ( y | x ) ; (4) β 0 [ h ( x ) ] in Equation (2); (2′) Algorithm 1 with p ( y | x ) estimated by a neural net; (3′) η ^ KL with the same p ( y | x ) as in (2′). The results are shown in Figure 3 and in Table 1.
From Figure 3 and Table 1 we see the following. (A) When using the true p ( y | x ) , both Algorithm 1 and η ^ KL generally upper bound the empirical β 0 and Algorithm 1 is generally tighter. (B) When using the true p ( y | x ) , Algorithm 1 and Corollary 1 give the same result. (C) Comparing Algorithm 1 and η ^ KL both of which use the same empirically estimated p ( y | x ) , both approaches provide good estimation in the low-noise region; however, in the high-noise region, Algorithm 1 gives more precise values than η ^ KL , indicating that Algorithm 1 is more robust to the estimation error of p ( y | x ) . (D) Equation (2) empirically upper bounds the experimentally observed β 0 and gives almost the same result as theoretical estimation in Corollary 1 and Algorithm 1 with the true p ( y | x ) . In the classification setting, this approach does not require any learned estimate of p ( y | x ) , as we can directly use the empirical p ( y ) and p ( x | y ) from SGD mini-batches.
This experiment also shows that for dataset where the signal-to-noise is small, β 0 can be very high. Instead of blindly sweeping β , our result can provide guidance for setting β so learning can happen.

7.1.2. Classification with Class Overlap

In this experiment, we test how different amounts of overlap among classes influence β 0 . We use the mixture of Gaussians with two components, each of which is a 2D Gaussian with diagonal covariance matrix Σ = diag ( 0.25 , 0.25 ) . The two components have weights 0.6 and 0.4. We vary the distance between the Gaussians from 8.0 down to 0.8 and observe the β 0 , e x p . Since we do not add noise to the labels, if there were no overlap and a deterministic map from X to Y, we would have β 0 = 1 by Corollary 2. The more overlap between the two classes, the more uncertain Y is given X. By Equation (5) we expect β 0 to be larger, which is corroborated in Figure 4.

7.2. MNIST Experiments

We perform binary classification with digits 0 and 1 and as before, add class-conditional noise to the labels with varying noise rates ρ . To explore how the model capacity influences the onset of learning, for each dataset we train two sets of VIB models differing only by the number of neurons in their hidden layers of the encoder: one with n = 512 neurons, the other with n = 128 neurons. As we describe in Section 4, insufficient capacity will result in more uncertainty of Y given X from the point of view of the model, so we expect the observed β 0 for the n = 128 model to be larger. This result is confirmed by the experiment (Figure 5). Also, in Figure 5 we plot β 0 given by different estimation methods. We see that the observations (A), (B), (C) and (D) in Section 7.1 still hold.

7.3. MNIST Experiments Using Equation (2)

To see what IB learns at its onset of learning for the full MNIST dataset, we optimize Equation (2) w.r.t. the full MNIST dataset and visualize the clustering of digits by h ( x ) . Equation (2) can be optimized using SGD using any differentiable parameterized mapping h ( x ) : X R . In this case, we chose to parameterize h ( x ) with a PixelCNN++ architecture [27,28], as PixelCNN++ is a powerful autoregressive model for images that gives a scalar output (normally interpreted as log p ( x ) ). Equation (2) should generally give two clusters in the output space, as discussed in Section 4. In this setup, smaller values of h ( x ) correspond to the subset of the data that is easiest to learn. Figure 6 shows two strongly separated clusters, as well as the threshold we choose to divide them. Figure 7 shows the first 5776 MNIST training examples as sorted by our learned h ( x ) , with the examples above the threshold highlighted in red. We can clearly see that our learned h ( x ) has separated the “easy” one (1) digits from the rest of the MNIST training set.

7.4. CIFAR10 Forgetting Experiments

For CIFAR10 [14], we study how forgetting varies with β . In other words, given a VIB model trained at some high β 2 , if we anneal it down to some much lower β 1 , what I ( Y ; Z ) does the model converge to? Using Algorithm 1, we estimated β 0 = 1.0483 on a version of CIFAR10 with 20% label noise, where the P y | x is estimated by maximum likelihood training with the same encoder and classifier architectures as used for VIB. For the VIB models, the lowest β with performance above chance was β = 1.048 (Figure 8), a very tight match with the estimate from Algorithm 1. See Appendix A.12 for details.

8. Conclusions

In this paper, we have presented theoretical results for predicting the onset of learning and have shown that it is determined by the conspicuous subset of the training examples. We gave a practical algorithm for predicting the transition as well as discovering this subset and showed that those predictions are accurate, even in cases of extreme label noise. We proved a deep connection between IB-learnability, our upper bounds on β 0 , the hypercontractivity coefficient, the contraction coefficient and the maximum correlation. We believe that these results provide a deeper understanding of IB, as well as a tool for analyzing a dataset by discovering its conspicuous subset and a tool for measuring model capacity in a task-specific manner. Our work also raises other questions, such as whether there are other phase transitions in learnability that might be identified. We hope to address some of those questions in future work.

Author Contributions

Conceptualization, T.W. and I.F.; methodology, T.W., I.F., I.L.C. and M.T.; software, T.W. and I.F.; validation, T.W. and I.F.; formal analysis, T.W. and I.F.; investigation, T.W. and I.F.; resources, T.W., I.F., I.L.C. and M.T.; data curation, T.W. and I.F.; writing–original draft preparation, T.W., I.F., I.L.C. and M.T.; writing–review and editing, T.W., I.F., I.L.C. and M.T.; visualization, T.W. and I.F.; supervision, I.F., I.L.C. and M.T.; project administration, I.F., I.L.C. and M.T.; funding acquisition, M.T.

Funding

T.W.’s work was supported by the The Casey and Family Foundation, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science. He thanks the Center for Brains, Minds and Machines (CBMM) for hospitality.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments that contributed to improving the paper.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A

The structure of the Appendix is as follows. In Appendix A.1, we provide preliminaries for the first-order and second-order variations on functionals. We prove Theorem 1 and Theorem 1 in Appendix A.2 and Appendix A.3, respectively. In Appendix A.4, we prove Theorem 2, the sufficient condition 1 for IB-Learnability. In Appendix A.5, we calculate the first and second variations of IB β [ p ( z | x ) ] at the trivial representation p ( z | x ) = p ( z ) , which is used in proving Lemma 2 (Appendix A.6) and the Sufficient Condition 2 for IB β -learnability (Appendix A.7). In Appendix A.8, we prove Equation (3) at the onset of learning. After these preparations, we prove the key result of this paper, Theorem 4, in Appendix A.9. Then two important Corollaries 1, 2 are proved in Appendix A.10. In Appendix A.11 we explore the deep relation between β 0 , β 0 [ h ( x ) ] , the hypercontractivity coefficient, contraction coefficient and maximum correlation. Finally in Appendix A.12, we provide details for the experiments.
Below are some implicit conventions of the paper: for integrals, whenever a variable W is discrete, we can simply replace the integral ( · d w ) by summation ( w · ) .

Appendix A.1. Preliminaries: First-Order and Second-Order Variations

Let functional F [ f ( x ) ] be defined on some normed linear space R . Let us add a perturbative function ϵ · h ( x ) to f ( x ) , and now the functional F [ f ( x ) + ϵ · h ( x ) ] can be expanded as
Δ F [ f ( x ) ] = F [ f ( x ) + ϵ · h ( x ) ] F [ f ( x ) ] = φ 1 [ f ( x ) ] + φ 2 [ f ( x ) ] + O ( ϵ 3 | | h | | 2 )
where h denotes the norm of h, φ 1 [ f ( x ) ] = ϵ d F [ f ( x ) ] d ϵ is a linear functional of ϵ · h ( x ) , and is called the first-order variation, denoted as δ F [ f ( x ) ] . φ 2 [ f ( x ) ] = 1 2 ϵ 2 d 2 F [ f ( x ) ] d ϵ 2 is a quadratic functional of ϵ · h ( x ) , and is called the second-order variation, denoted as δ 2 F [ f ( x ) ] .
If δ F [ f ( x ) ] = 0 , we call f ( x ) a stationary solution for the functional F [ · ] .
If Δ F [ f ( x ) ] 0 for all h ( x ) such that f ( x ) + ϵ · h ( x ) is at the neighborhood of f ( x ) , we call f ( x ) a (local) minimum of F [ · ] .

Appendix A.2. Proof of Lemma 1

Proof. 
If ( X , Y ) is IB β -learnable, then there exists Z Z given by some p 1 ( z | x ) such that IB β ( X , Y ; Z ) < IB ( X , Y ; Z t r i v i a l ) = 0 , where Z t r i v i a l satisfies p ( z | x ) = p ( z ) . Since X = g ( X ) is a invertible map (if X is continuous variable, g is additionally required to be continuous), and mutual information is invariant under such an invertible map [29], we have that IB β ( X , Y ; Z ) = I ( X ; Z ) β I ( Y ; Z ) = I ( X ; Z ) β I ( Y ; Z ) = IB β ( X , Y ; Z ) < 0 = IB ( X , Y ; Z t r i v i a l ) , so ( X , Y ) is IB β -learnable. On the other hand, if ( X , Y ) is not IB β -learnable, then Z , we have IB β ( X , Y ; Z ) IB ( X , Y ; Z t r i v i a l ) = 0 . Again using mutual information’s invariance under g, we have for all Z, IB β ( X , Y ; Z ) = IB β ( X , Y ; Z ) IB ( X , Y ; Z t r i v i a l ) = 0 , leading to that ( X , Y ) is not IB β -learnable. Therefore, we have that ( X , Y ) and ( X , Y ) have the same IB β -learnability. □

Appendix A.3. Proof of Theorem 1

Proof. 
At the trivial representation p ( z | x ) = p ( z ) , we have I ( X ; Z ) = 0 , and I ( Y ; Z ) = 0 due to the Markov chain, so IB β ( X , Y ; Z ) | p ( z | x ) = p ( z ) = 0 for any β . Since ( X , Y ) is IB β 1 -learnable, there exists a Z given by a p 1 ( z | x ) such that IB β 1 ( X , Y ; Z ) | p 1 ( z | x ) < 0 . Since β 2 > β 1 , and I ( Y ; Z ) 0 , we have IB β 2 ( X , Y ; Z ) | p 1 ( z | x ) IB β 1 ( X , Y ; Z ) | p 1 ( z | x ) < 0 = IB β 2 ( X , Y ; Z ) | p ( z | x ) = p ( z ) . Therefore, ( X , Y ) is IB β 2 -learnable. □

Appendix A.4. Proof of Theorem 2

Proof. 
To prove Theorem 2, we use the Theorem 1 of Chapter 5 of Gelfand et al. [30] which gives a necessary condition for F [ f ( x ) ] to have a minimum at f 0 ( x ) . Adapting to our notation, we have:
Theorem A1.
([30]). A necessary condition for the functional F [ f ( x ) ] to have a minimum at f ( x ) = f 0 ( x ) is that for f ( x ) = f 0 ( x ) and all admissible ϵ · h ( x ) ,
δ 2 F [ f ( x ) ] 0 .
Applying to our functional IB β [ p ( z | x ) ] , an immediate result of Theorem A1 is that, if at p ( z | x ) = p ( z ) , there exists an ϵ · h ( z | x ) such that δ 2 IB β [ p ( z | x ) ] < 0 , then p ( z | x ) = p ( z ) is not a minimum for IB β [ p ( z | x ) ] . Using the definition of IB β learnability, we have that ( X , Y ) is IB β -learnable. □

Appendix A.5. First- and Second-Order Variations of IB β [ p ( z | x ) ]

In this section, we derive the first- and second-order variations of IB β [ p ( z | x ) ] , which are needed for proving Lemma 2 and Theorem 3.
Lemma A1.
Using perturbative function h ( z | x ) , we have
δ IB β [ p ( z | x ) ] = d x d z p ( x ) h ( z | x ) log p ( z | x ) p ( z ) β d x d y d z p ( x , y ) h ( z | x ) log p ( z | y ) p ( z ) δ 2 IB β [ p ( z | x ) ] = 1 2 [ d x d z p ( x ) 2 p ( x , z ) h ( z | x ) 2 β d x d x d y d z p ( x , y ) p ( x , y ) p ( y , z ) h ( z | x ) h ( z | x ) + ( β 1 ) d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) ]
Proof. 
Since IB β [ p ( z | x ) ] = I ( X ; Z ) β I ( Y ; Z ) , let us calculate the first and second-order variation of I ( X ; Z ) and I ( Y ; Z ) w.r.t. p ( z | x ) , respectively. Through this derivation, we use ϵ · h ( z | x ) as a perturbative function, for ease of deciding different orders of variations. We assume that h ( z | x ) is continuous, and there exists a constant M such that | h ( z | x ) p ( z | x ) | < M , ( x , z ) X × Z . We will finally absorb ϵ into h ( z | x ) .
Denote I ( X ; Z ) = F 1 [ p ( z | x ) ] . We have
F 1 [ p ( z | x ) ] = I ( X ; Z ) = d x d z p ( z | x ) p ( x ) log p ( z | x ) p ( z )
In this paper, we implicitly assume that the integral (or summing) are only on the support of p ( x , y , z ) .
Since
p ( z ) = p ( z | x ) p ( x ) d x
We have
p ( z ) | p ( z | x ) + ϵ h ( z | x ) = p ( z ) | p ( z | x ) + ϵ h ( z | x ) p ( x ) d x
Expanding F 1 [ p ( z | x ) + ϵ h ( z | x ) ] to the second order of ϵ , we have
F 1 [ p ( z | x ) + ϵ h ( z | x ) ] = d x d z p ( x ) [ p ( z | x ) + ϵ h ( z | x ) ] log p ( z | x ) + ϵ h ( z | x ) p ( z ) + ϵ h ( z | x ) p ( x ) d x = d x d z p ( x ) p ( z | x ) 1 + ϵ h ( z | x ) p ( z | x ) log p ( z | x ) 1 + ϵ h ( z | x ) p ( z | x ) p ( z ) 1 + ϵ h ( z | x ) p ( x ) d x p ( z ) = d x d z p ( x ) p ( z | x ) 1 + ϵ h ( z | x ) p ( z | x ) log [ p ( z | x ) p ( z ) 1 + ϵ h ( z | x ) p ( z | x ) ( 1 ϵ h ( z | x ) p ( x ) d x p ( z ) + ϵ 2 h ( z | x ) p ( x ) d x p ( z ) 2 ) ] + O ( ϵ 3 ) = d x d z p ( x ) p ( z | x ) 1 + ϵ h ( z | x ) p ( z | x ) log [ p ( z | x ) p ( z ) ( 1 + ϵ h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) + ϵ 2 h ( z | x ) p ( x ) d x p ( z ) 2 ϵ 2 h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) ) ] + O ( ϵ 3 ) = d x d z p ( x ) p ( z | x ) 1 + ϵ h ( z | x ) p ( z | x ) [ log p ( z | x ) p ( z ) + ϵ h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) + ϵ 2 h ( z | x ) p ( x ) d x p ( z ) 2 ϵ 2 h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) 1 2 ϵ 2 h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) 2 ] + O ( ϵ 3 )
Collecting the first order terms of ϵ , we have
δ F 1 [ p ( z | x ) ] = ϵ d x d z p ( x ) p ( z | x ) h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) + ϵ d x d z p ( x ) p ( z | x ) h ( z | x ) p ( z | x ) log p ( z | x ) p ( z ) = ϵ d x d z p ( x ) h ( z | x ) ϵ d x d z p ( x ) h ( z | x ) + ϵ d x d z p ( x ) h ( z | x ) log p ( z | x ) p ( z ) = ϵ d x d z p ( x ) h ( z | x ) log p ( z | x ) p ( z )
Collecting the second order terms of ϵ 2 , we have
δ 2 F 1 [ p ( z | x ) ] = ϵ 2 d x d z p ( x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) 2 h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) 1 2 h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) 2 + ϵ 2 d x d z p ( x ) p ( z | x ) h ( z | x ) p ( z | x ) h ( z | x ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) = ϵ 2 2 d x d z p ( x ) 2 p ( x , z ) h ( z | x ) 2 ϵ 2 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x )
Now let us calculate the first and second-order variation of F 2 [ p ( z | x ) ] = I ( Z ; Y ) . We have
F 2 [ p ( z | x ) ] = I ( Y ; Z ) = d y d z p ( z | y ) p ( y ) log p ( y , z ) p ( y ) p ( z ) = d x d y d z p ( z | y ) p ( x , y ) log p ( y , z ) p ( y ) p ( z )
Using the Markov chain Z X Y , we have
p ( y , z ) = p ( z | x ) p ( x , y ) d x
Hence
p ( y , z ) | p ( z | x ) + ϵ h ( z | x ) = p ( y , z ) | p ( z | x ) + ϵ h ( z | x ) p ( x , y ) d x
Then expanding F 2 [ p ( z | x ) + ϵ h ( z | x ) ] to the second order of ϵ , we have
F 2 [ p ( z | x ) + ϵ h ( z | x ) ] = d x d y d z p ( x , y ) p ( z | x ) 1 + ϵ h ( z | x ) p ( z | x ) log p ( y , z ) 1 + ϵ h ( z | x ) p ( x , y ) d x p ( y , z ) p ( y ) p ( z ) ( 1 + ϵ h ( z | x ) p ( x ) d x p ( z ) ) = d x d y d z p ( x , y ) p ( z | x ) 1 + ϵ h ( z | x ) p ( z | x ) [ log p ( y , z ) p ( y ) p ( z ) + ϵ h ( z | x ) p ( x , y ) d x p ( y , z ) h ( z | x ) p ( x ) d x p ( z ) + ϵ 2 h ( z | x ) p ( x ) d x p ( z ) 2 h ( z | x ) p ( x , y ) d x p ( y , z ) h ( z | x ) p ( x ) d x p ( z ) 1 2 h ( z | x ) p ( x , y ) d x p ( y , z ) h ( z | x ) p ( x ) d x p ( z ) 2 + O ( ϵ 3 )
Collecting the first order terms of ϵ , we have
δ F 2 [ p ( z | x ) ] = ϵ d x d y d z p ( x , y ) h ( z | x ) log p ( y , z ) p ( y ) p ( z ) + ϵ d x d y d z p ( x , y ) p ( z | x ) h ( z | x ) p ( x , y ) d x p ( y , z ) ϵ d x d y d z p ( x , y ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) = ϵ d x d y d z p ( x , y ) h ( z | x ) log p ( y , z ) p ( y ) p ( z ) + ϵ d x d y d z h ( z | x ) p ( x , y ) ϵ d z h ( z | x ) p ( x ) d x = ϵ d x d y d z p ( x , y ) h ( z | x ) log p ( z | y ) p ( z )
Collecting the second order terms of ϵ , we have
δ 2 F 2 [ p ( z | x ) ] = ϵ 2 d x d y d z p ( x , y ) p ( z | x ) h ( z | x ) p ( x ) d x p ( z ) 2 h ( z | x ) p ( x , y ) d x p ( y , z ) h ( z | x ) p ( x ) d x p ( z ) ϵ 2 2 d x d y d z p ( x , y ) p ( z | x ) h ( z | x ) p ( x , y ) d x p ( y , z ) h ( z | x ) p ( x ) d x p ( z ) 2 + ϵ 2 d x d y d z p ( x , y ) p ( z | x ) h ( z | x ) p ( z | x ) h ( z | x ) p ( x , y ) d x p ( y , z ) h ( z | x ) p ( x ) d x p ( z ) = ϵ 2 2 d x d x d y d z p ( x , y ) p ( x , y ) p ( y , z ) h ( z | x ) h ( z | x ) ϵ 2 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x )
Finally, we have
δ IB β [ p ( z | x ) ] = δ F 1 [ p ( z | x ) ] β · δ F 2 [ p ( z | x ) ] = ϵ d x d z p ( x ) h ( z | x ) log p ( z | x ) p ( z ) β d x d y d z p ( x , y ) h ( z | x ) log p ( z | y ) p ( z )
δ 2 IB β [ p ( z | x ) ] = δ 2 F 1 [ p ( z | x ) ] β · δ 2 F 2 [ p ( z | x ) ] = ϵ 2 2 d x d z p ( x ) 2 p ( x , z ) h ( z | x ) 2 ϵ 2 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) β ϵ 2 1 2 d x d x d y d z p ( x , y ) p ( x , y ) p ( y , z ) h ( z | x ) h ( z | x ) 1 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) = ϵ 2 2 [ d x d z p ( x ) 2 p ( x , z ) h ( z | x ) 2 β d x d x d y d z p ( x , y ) p ( x , y ) p ( y , z ) h ( z | x ) h ( z | x ) + ( β 1 ) d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) ]
Absorb ϵ into h ( z | x ) , we get rid of the ϵ factor and obtain the final expression in Lemma A1. □

Appendix A.6. Proof of Lemma 2

Proof. 
Using Lemma A1, we have
δ IB β [ p ( z | x ) ] = d x d z p ( x ) h ( z | x ) log p ( z | x ) p ( z ) β d x d y d z p ( x , y ) h ( z | x ) log p ( z | y ) p ( z )
Let p ( z | x ) = p ( z ) (the trivial representation), we have that log p ( z | x ) p ( z ) 0 . Therefore, the two integrals are both 0. Hence,
δ IB β [ p ( z | x ) ] | p ( z | x ) = p ( z ) 0
Therefore, the p ( z | x ) = p ( z ) is a stationary solution for IB β [ p ( z | x ) ] . □

Appendix A.7. Proof of Theorem 3

Proof. 
Firstly, from the necessary condition of β > 1 in Section 3, we have that any sufficient condition for IB β -learnability should be able to deduce β > 1 .
Now using Theorem 2, a sufficient condition for ( X , Y ) to be IB β -learnable is that there exists h ( z | x ) with h ( z | x ) d x = 0 such that δ 2 IB β [ p ( z | x ) ] < 0 at p ( z | x ) = p ( x ) .
At the trivial representation, p ( z | x ) = p ( z ) and hence p ( x , z ) = p ( x ) p ( z ) . Due to the Markov chain Z X Y , we have p ( y , z ) = p ( y ) p ( z ) . Substituting them into the δ 2 IB β [ p ( z | x ) ] in Lemma A1, the condition becomes: there exists h ( z | x ) with h ( z | x ) d z = 0 , such that
0 > δ 2 IB β [ p ( z | x ) ] = 1 2 d x d z p ( x ) 2 p ( x ) p ( z ) h ( z | x ) 2 β d x d x d y d z p ( x , y ) p ( x , y ) p ( y ) p ( z ) h ( z | x ) h ( z | x ) + ( β 1 ) d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x )
Rearranging terms and simplifying, we have
d z p ( z ) G [ h ( z | x ) ] = d z p ( z ) d x h ( z | x ) 2 p ( x ) β d y p ( y ) d x h ( z | x ) p ( x ) p ( y | x ) 2 + ( β 1 ) d x h ( z | x ) p ( x ) 2 < 0
where
G [ h ( x ) ] = d x h ( x ) 2 p ( x ) β d y p ( y ) d x h ( x ) p ( x ) p ( y | x ) 2 + ( β 1 ) d x h ( x ) p ( x ) 2
Now we prove that the condition that h ( z | x ) s.t. d z p ( z ) G [ h ( z | x ) ] < 0 is equivalent to the condition that h ( x ) s.t. G [ h ( x ) ] < 0 .
If h ( z | x ) , G [ h ( z | x ) ] 0 , then we have h ( z | x ) , d z p ( z ) G [ h ( z | x ) ] 0 . Therefore, if h ( z | x ) s.t. d z p ( z ) G [ h ( z | x ) ] < 0 , we have that h ( z | x ) s.t. G [ h ( z | x ) ] < 0 . Since the functional G [ h ( z | x ) ] does not contain integration over z, we can treat the z in G [ h ( z | x ) ] as a parameter and we have that h ( x ) s.t. G [ h ( x ) ] < 0 .
Conversely, if there exists an certain function h ( x ) such that G [ h ( x ) ] < 0 , we can find some h 2 ( z ) such that h 2 ( z ) d z = 0 and h 2 2 ( z ) p ( z ) d z > 0 , and let h 1 ( z | x ) = h ( x ) h 2 ( z ) . Now we have
d z p ( z ) G [ h ( z | x ) ] = h 2 2 ( z ) d z p ( z ) G [ h ( x ) ] = G [ h ( x ) ] h 2 2 ( z ) d z p ( z ) < 0
In other words, the condition Equation (A2) is equivalent to requiring that there exists an h ( x ) such that G [ h ( x ) ] < 0 . Hence, a sufficient condition for IB β -learnability is that there exists an h ( x ) such that
G [ h ( x ) ] = d x h ( x ) 2 p ( x ) β d y p ( y ) d x h ( x ) p ( x ) p ( y | x ) 2 + ( β 1 ) d x h ( x ) p ( x ) 2 < 0
When h ( x ) = C = constant in the entire input space X , Equation (A3) becomes:
C 2 β C 2 + ( β 1 ) C 2 < 0
which cannot be true. Therefore, h ( x ) = constant cannot satisfy Equation (A3).
Rearranging terms and simplifying, we have
β d y p ( y ) d x h ( x ) p ( x ) p ( y | x ) 2 d x h ( x ) p ( x ) 2 > d x h ( x ) 2 p ( x ) d x h ( x ) p ( x ) 2
Written in the form of expectations, we have
β · E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E x p ( x ) [ h ( x ) ] 2 > E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2
Since the square function is convex, using Jensen’s inequality on the L.H.S. of Equation (A5), we have
E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 = E x p ( x ) [ h ( x ) ] 2
The equality holds iff E x p ( x | y ) [ h ( x ) ] is constant w.r.t. y, i.e., Y is independent of X. Therefore, in order for Equation (A5) to hold, we require that Y is not independent of X.
Using Jensen’s inequality on the innter expectation on the L.H.S. of Equation (A5), we have
E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E y p ( y ) E x p ( x | y ) [ h ( x ) 2 ] = E x p ( x ) [ h ( x ) 2 ]
The equality holds when h ( x ) is a constant. Since we require that h ( x ) is not a constant, we have that the equality cannot be reached.
Similarly, using Jensen’s inequality on the R.H.S. of Equation (A5), we have that
E x p ( x ) [ h ( x ) 2 ] > E x p ( x ) [ h ( x ) ] 2
where we have used the requirement that h ( x ) cannot be constant.
Under the constraint that Y is not independent of X, we can divide both sides of Equation (A5), and obtain the condition: there exists an h ( x ) such that
β > E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2 E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E x p ( x ) [ h ( x ) ] 2
i.e.,
β > inf h ( x ) E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2 E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E x p ( x ) [ h ( x ) ] 2
which proves the condition of Theorem 3.
Furthermore, from Equation (A6) we have
β 0 [ h ( x ) ] > 1
for h ( x ) const, which satisfies the necessary condition of β > 1 in Section 3.
Proof of lower bound of slope of the Pareto frontier at the origin: Now we prove the second statement of Theorem 3. Since δ I ( X ; Z ) = 0 and δ I ( Y ; Z ) = 0 according to Lemma 2, we have Δ I ( Y ; Z ) Δ I ( X ; Z ) 1 = δ 2 I ( Y ; Z ) δ 2 I ( X ; Z ) 1 . Substituting into the expression of δ 2 I ( Y ; Z ) and δ 2 I ( X ; Z ) from Lemma A1, we have
Δ I ( Y ; Z ) Δ I ( X ; Z ) 1 = δ 2 I ( Y ; Z ) δ 2 I ( X ; Z ) 1 = ϵ 2 2 d x d z p ( x ) 2 p ( x ) p ( z ) h ( z | x ) 2 ϵ 2 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) ϵ 2 2 d x d x d y d z p ( x , y ) p ( x , y ) p ( y ) p ( z ) h ( z | x ) h ( z | x ) ϵ 2 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) = d x p ( x ) h ( x ) 2 d x d x p ( x ) p ( x ) h ( x ) h ( z | x ) h 2 ( z ) 2 p ( z ) d z d x d x d y p ( x , y ) p ( x , y ) p ( y ) h ( x ) h ( z | x ) d x d x p ( x ) p ( x ) h ( x ) h ( z | x ) h 2 ( z ) 2 p ( z ) d z = d x p ( x ) h ( x ) 2 d x d x p ( x ) p ( x ) h ( x ) h ( z | x ) d x d x d y p ( x , y ) p ( x , y ) p ( y ) h ( x ) h ( z | x ) d x d x p ( x ) p ( x ) h ( x ) h ( z | x ) = E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2 E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E x p ( x ) [ h ( x ) ] 2 = E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2 1 E y p ( y ) E x p ( x | y ) [ h ( x ) ] E x p ( x ) [ h ( x ) ] 2 1 = β 0 [ h ( x ) ]
Therefore, inf h ( x ) β 0 [ h ( x ) ] 1 gives the largest slope of Δ I ( Y ; Z ) vs. Δ I ( X ; Z ) for perturbation function of the form h 1 ( z | x ) = h ( x ) h 2 ( z ) satisfying h 2 ( z ) d z = 0 and h 2 2 ( z ) p ( z ) d z > 0 , which is a lower bound of slope of Δ I ( Y ; Z ) vs. Δ I ( X ; Z ) for all possible perturbation function h 1 ( z | x ) . The latter is the slope of the Pareto frontier of the I ( Y ; Z ) vs. I ( X ; Z ) curve at the origin.
Inflection point for general Z : If we do not assume that Z is at the origin of the information plane, but at some general stationary solution Z with p ( z | x ) , we define
β ( 2 ) [ h ( x ) ] = δ 2 I ( Y ; Z ) δ 2 I ( X ; Z ) 1 = ϵ 2 2 d x d z p ( x ) 2 p ( x , z ) h ( z | x ) 2 ϵ 2 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) ϵ 2 2 d x d x d y d z p ( x , y ) p ( x , y ) p ( y , z ) h ( z | x ) h ( z | x ) ϵ 2 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) = d x d z p ( x ) 2 p ( x , z ) h ( z | x ) 2 d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) d x d x d y d z p ( x , y ) p ( x , y ) p ( y , z ) h ( z | x ) h ( z | x ) d x d x d z p ( x ) p ( x ) p ( z ) h ( z | x ) h ( z | x ) = d z p ( z ) d x p ( x ) 2 p ( x | z ) h ( z | x ) 2 d x p ( x ) h ( z | x ) 2 d z p ( z ) d y p ( y | z ) d x p ( x , y ) h ( z | x ) 2 d x p ( x ) h ( z | x ) 2 = d z p ( z ) d x p ( x ) 2 p ( x | z ) h ( z | x ) 2 d x p ( x ) h ( z | x ) 2 1 d z p ( z ) d y p ( y | z ) d x p ( x , y ) h ( z | x ) 2 d x p ( x ) h ( z | x ) 2 1 = d z d x p ( x ) p ( z | x ) h ( z | x ) 2 d x p ( x ) h ( z | x ) 2 1 p ( z ) d z d y p ( z | y ) p ( y ) d x p ( x , y ) h ( z | x ) 2 d x p ( x ) h ( z | x ) 2 1 p ( z ) = d z d x p ( x ) p ( z | x ) h ( z | x ) 2 1 p ( z ) ( d x p ( x ) h ( z | x ) ) 2 d z d y p ( z | y ) p ( y ) d x p ( x , y ) h ( z | x ) 2 1 p ( z ) d x p ( x ) h ( z | x ) 2
which reduces to β 0 [ h ( x ) ] when p ( z | x ) = p ( z ) . When
β > inf h ( z | x ) β ( 2 ) [ h ( z | x ) ]
it becomes a non-stable solution (non-minimum), and we will have other Z that achieves a better IB β ( X , Y ; Z ) than the current Z . □

Appendix A.8. What IB First Learns at Its Onset of Learning

In this section, we prove that at the onset of learning, if letting h ( z | x ) = h ( x ) h 2 ( z ) , we have
p β ( y | x ) = p ( y ) + ϵ 2 C z ( h ( x ) h ¯ x ) p ( x , y ) ( h ( x ) h ¯ x ) d x
where p β ( y | x ) is the estimated p ( y | x ) by IB for a certain β , h ( x ) = inf h ( x ) β 0 [ h ( x ) ] , h ¯ x = h ( x ) p ( x ) d x , C z = h 2 2 ( z ) p ( z ) d z is a constant.
Proof. 
In IB, we use p β ( z | x ) to obtain Z from X, then obtain the prediction of Y from Z using p β ( y | z ) . Here we use subscript β to denote the probability (density) at the optimum of IB β [ p ( z | x ) ] at a specific β . We have
p β ( y | x ) = p β ( y | z ) p β ( z | x ) d z = d z p β ( y , z ) p β ( z | x ) p β ( z ) = d z p β ( z | x ) p β ( z ) p ( x , y ) p β ( z | x ) d x
When we have a small perturbation ϵ · h ( z | x ) at the trivial representation, p β ( z | x ) = p β 0 ( z ) + ϵ · h ( z | x ) , we have p β ( z ) = p β 0 ( z ) + ϵ · h ( z | x ) p ( x ) d x . Substituting, we have
p β ( y | x ) = d z p β 0 ( z ) 1 + ϵ · h ( z | x ) p β 0 ( z ) p β 0 ( z ) 1 + ϵ · h ( z | x ) p ( x ) d x p β 0 ( z ) p ( x , y ) p β 0 ( z ) 1 + ϵ · h ( z | x ) p β 0 ( z ) d x = d z 1 + ϵ · h ( z | x ) p β 0 ( z ) 1 + ϵ · h ( z | x ) p ( x ) d x p β 0 ( z ) p ( x , y ) p β 0 ( z ) 1 + ϵ · h ( z | x ) p β 0 ( z ) d x
The 0th-order term is d z d x p ( x , y ) p β 0 ( z ) = p ( y ) . The first-order term is
δ p β ( z | x ) = ϵ · d z d x h ( z | x ) + h ( z | x ) h ( z | x ) p ( x ) d x p ( x , y ) = ϵ · d x d z h ( z | x ) + d z h ( z | x ) ϵ · d x d x p ( x , y ) p ( x ) d z h ( z | x ) = 0 0 = 0
since we have h ( z | x ) d z = 0 for any x.
For the second-order term, using h ( z | x ) = h ( x ) h 2 ( z ) and C z = d z p β 0 ( z ) h 2 2 ( z ) , it is
δ 2 p β ( y | x ) = ϵ 2 · d z h ( z | x ) p ( x ) d x p β 0 ( z ) 2 p ( x , y ) p β 0 ( z ) d x ϵ 2 · d z h ( z | x ) h ( z | x ) p ( x ) d x ( p β 0 ( z ) ) 2 p ( x , y ) p β 0 ( z ) d x + ϵ 2 d z h ( z | x ) h ( z | x ) p ( x ) d x p ( x , y ) h ( z | x ) p β 0 ( z ) d x = ϵ 2 C z · h ( x ) p ( x ) d x 2 p ( y ) ϵ 2 C z · h ( x ) h ( x ) p ( x ) d x p ( y ) + ϵ 2 C z · h ( x ) p ( x , y ) h ( x ) d x ϵ 2 C z · h ( x ) p ( x ) d x p ( x , y ) h ( x ) d x = ϵ 2 C z ( h ( x ) h ¯ x ) p ( x , y ) h ( x ) d x h ¯ x p ( y ) = ϵ 2 C z ( h ( x ) h ¯ x ) p ( x , y ) h ( x ) h ¯ x d x
where h ¯ x = h ( x ) p ( x ) d x . Combining everything, we have up to the second order,
p β ( y | x ) = p ( y ) + ϵ 2 C z ( h ( x ) h ¯ x ) p ( x , y ) ( h ( x ) h ¯ x ) d x
 □

Appendix A.9. Proof of Theorem 4

Proof. 
According to Theorem 3, a sufficient condition for ( X , Y ) to be IB β -learnable is that X and Y are not independent, and
β > inf h ( x ) E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2 1 E y p ( y ) E x p ( x | y ) [ h ( x ) ] E x p ( x ) [ h ( x ) ] 2 1
We can assume a specific form of h ( x ) , and obtain a (potentially stronger) sufficient condition. Specifically, we let
h ( x ) = 1 , x Ω x 0 , otherwise
for certain Ω x X . Substituting into Equation (A10), we have that a sufficient condition for ( X , Y ) to be IB β -learnable is
β > inf Ω x X p ( Ω x ) p ( Ω x ) 2 1 d y p ( y ) x Ω x d x p ( x | y ) d x p ( Ω x ) 2 1 > 0
where p ( Ω x ) = x Ω x p ( x ) d x .
The denominator of Equation (A11) is
d y p ( y ) x Ω x d x p ( x | y ) d x p ( Ω x ) 2 1 = d y p ( y ) p ( Ω x | y ) p ( Ω x ) 2 1 = d y p ( y | Ω x ) 2 p ( y ) 1 = E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1
Using the inequality x 1 log x , we have
E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 E y p ( y | Ω x ) log p ( y | Ω x ) p ( y ) 0
Both equalities hold iff p ( y | Ω x ) p ( y ) , at which the denominator of Equation (A11) is equal to 0 and the expression inside the infimum diverge, which will not contribute to the infimum. Except this scenario, the denominator is greater than 0. Substituting into Equation (A11), we have that a sufficient condition for ( X , Y ) to be IB β -learnable is
β > inf Ω x X p ( Ω x ) p ( Ω x ) 2 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1
Since Ω x is a subset of X , by the definition of h ( x ) in Equation (A10), h ( x ) is not a constant in the entire X . Hence the numerator of Equation (A12) is positive. Since its denominator is also positive, we can then neglect the “ > 0 ”, and obtain the condition in Theorem 4.
Since the h ( x ) used in this theorem is a subset of the h ( x ) used in Theorem 3, the infimum for Equation (5) is greater than or equal to the infimum in Equation (2). Therefore, according to the second statement of Theorem 3, we have that the inf Ω x X β 0 ( Ω x ) 1 is also a lower bound of the slope for the Pareto frontier of I ( Y ; Z ) vs. I ( X ; Z ) curve.
Now we prove that the condition Equation (5) is invariant to invertible mappings of X. In fact, if X = g ( X ) is a uniquely invertible map (if X is continuous, g is additionally required to be continuous), let X = { g ( x ) | x Ω x } , and denote g ( Ω x ) { g ( x ) | x Ω x } for any Ω x X , we have p ( g ( Ω x ) ) = p ( Ω x ) , and p ( y | g ( Ω x ) ) = p ( y | Ω x ) . Then for dataset ( X , Y ) , let Ω x = g ( Ω x ) , we have
1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 = 1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1
Additionally we have X = g ( X ) . Then
inf Ω x X 1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 = inf Ω x X 1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1
For dataset ( X , Y ) = ( g ( X ) , Y ) , applying Theorem 4 we have that a sufficient condition for it to be IB β -learnable is
β > inf Ω x X 1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 = inf Ω x X 1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1
where the equality is due to Equation (A14). Comparing with the condition for IB β -learnability for ( X , Y ) (Equation (5)), we see that they are the same. Therefore, the condition given by Theorem 4 is invariant to invertible mapping of X. □

Appendix A.10. Proof of Corollary 1 and Corollary 2

Appendix A.10.1. Proof of Corollary 1

Proof. 
We use Theorem 4. Let Ω x contain all elements x whose true class is y for some certain y , and 0 otherwise. Then we obtain a (potentially stronger) sufficient condition. Since the probability p ( y | y , x ) = p ( y | y ) is class-conditional, we have
inf Ω x X 1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 = inf y 1 p ( y ) 1 E y p ( y | y ) p ( y | y ) p ( y ) 1
By requiring β > inf y 1 p ( y ) 1 E y p ( y | y ) p ( y | y ) p ( y ) 1 , we obtain a sufficient condition for IB β learnability. □

Appendix A.10.2. Proof of Corollary 2

Proof. 
We again use Theorem 4. Since Y is a deterministic function of X, let Y = f ( X ) . By the assumption that Y contains at least one value y such that its probability p ( y ) > 0 , we let Ω x contain only x such that f ( x ) = y . Substituting into Equation (5), we have
1 p ( Ω x ) 1 E y p ( y | Ω x ) p ( y | Ω x ) p ( y ) 1 = 1 p ( y ) 1 E y p ( y | Ω x ) 1 p ( y ) 1 = 1 p ( y ) 1 1 p ( y ) 1 = 1
 □
Therefore, the sufficient condition becomes β > 1 .

Appendix A.11. β0, Hypercontractivity Coefficient, Contraction Coefficient, β 0 [ h ( x ) ] , and Maximum Correlation

In this section, we prove the relations between the IB-Learnability threshold β 0 , the hypercontractivity coefficient ξ ( X ; Y ) , the contraction coefficient η KL ( p ( y | x ) , p ( x ) ) , β 0 [ h ( x ) ] in Equation (2), and maximum correlation ρ m ( X , Y ) , as follows:
1 β 0 = ξ ( X ; Y ) = η KL ( p ( y | x ) , p ( x ) ) sup h ( x ) 1 β 0 [ h ( x ) ] = ρ m 2 ( X ; Y )
Proof. 
The hypercontractivity coefficient ξ is defined as [16]:
ξ ( X ; Y ) sup Z X Y I ( Y ; Z ) I ( X ; Z )
By our definition of IB-learnability, (X, Y) is IB-Learnable iff there exists Z obeying the Markov chain Z X Y , such that
I ( X ; Z ) β · I ( Y ; Z ) < 0 = I B β ( X , Y ; Z ) | p ( z | x ) = p ( z )
Or equivalently there exists Z obeying the Markov chain Z X Y such that
0 < 1 β < I ( Y ; Z ) I ( X ; Z )
By Theorem 1, the IB-Learnability region for β is ( β 0 , + ) , or equivalently the IB-Learnability region for 1 / β is
0 < 1 β < 1 β 0
Comparing Equations (A17) and (A18), we have that
1 β 0 = sup Z X Y I ( Y ; Z ) I ( X ; Z ) = ξ ( X ; Y )
In Anantharam et al. [16], the authors prove that
ξ ( X ; Y ) = η KL ( p ( y | x ) , p ( x ) )
where the contraction coefficient η KL ( p ( y | x ) , p ( x ) ) is defined as
η KL ( p ( y | x ) , p ( x ) ) = sup r ( x ) p ( x ) D KL ( r ( y ) | | p ( y ) ) D KL ( r ( x ) | | p ( x ) )
where p ( y ) = E x p ( x ) [ p ( y | x ) ] and r ( y ) = E x r ( x ) [ p ( y | x ) ] . Treating p ( y | x ) as a channel, the contraction coefficient measures how much the two distributions r ( x ) and p ( x ) becomes “nearer” (as measured by the KL-divergence) after passing through the channel.
In Anantharam et al. [16], the authors also provide a counterexample to an earlier result by Erkip and Cover [31] that incorrectly proved ξ ( X ; Y ) = ρ m 2 ( X ; Y ) . In the specific counterexample Anantharam et al. [16] design, ξ ( X ; Y ) > ρ m 2 ( X ; Y ) .
The maximum correlation is defined as ρ m ( X ; Y ) max f , g E [ f ( X ) g ( Y ) ] where f ( X ) and g ( Y ) are real-valued random variables such that E [ f ( X ) ] = E [ g ( Y ) ] = 0 and E [ f 2 ( X ) ] = E [ g 2 ( Y ) ] = 1 [20,21].
Now we prove ξ ( X ; Y ) ρ m 2 ( X ; Y ) , based on Theorem 3. To see this, we use the alternate characterization of ρ m ( X ; Y ) by Rényi [32]:
ρ m 2 ( X ; Y ) = max f ( X ) : E [ f ( X ) ] = 0 , E [ f 2 ( X ) ] = 1 E [ E [ f ( X ) | Y ] 2 ]
Denoting h ¯ = E p ( x ) [ h ( x ) ] , we can transform β 0 [ h ( x ) ] in Equation (2) as follows:
β 0 [ h ( x ) ] = E x p ( x ) [ h ( x ) 2 ] E x p ( x ) [ h ( x ) ] 2 E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 E x p ( x ) [ h ( x ) ] 2 = E x p ( x ) [ h ( x ) 2 ] h ¯ 2 E y p ( y ) E x p ( x | y ) [ h ( x ) ] 2 h ¯ 2 = E x p ( x ) [ ( h ( x ) h ¯ ) 2 ] E y p ( y ) E x p ( x | y ) [ h ( x ) h ¯ ] 2 = 1 E y p ( y ) E x p ( x | y ) [ f ( x ) ] 2 = 1 E [ E [ f ( X ) | Y ] 2 ]
where we denote f ( x ) = h ( x ) h ¯ E x p ( x ) [ ( h ( x ) h ¯ ) 2 ] 1 / 2 , so that E [ f ( X ) ] = 0 and E [ f 2 ( X ) ] = 1 .
Combined with Equation (A21), we have
sup h ( x ) 1 β 0 [ h ( x ) ] = ρ m 2 ( X ; Y )
Our Theorem 3 states that
sup h ( x ) 1 β 0 [ h ( x ) ] 1 β 0
Combining Equations (A18), (A22) and Equation (A23), we have
ρ m 2 ( X ; Y ) ξ ( X ; Y )
In summary, the relations among the quantities are:
1 β 0 = ξ ( X ; Y ) = η KL ( p ( y | x ) , p ( x ) ) sup h ( x ) 1 β 0 [ h ( x ) ] = ρ m 2 ( X ; Y )
 □

Appendix A.12. Experiment Details

We use the Variational Information Bottleneck (VIB) objective from [5]. For the synthetic experiment, the latent Z has dimension of 2. The encoder is a neural net with 2 hidden layers, each of which has 128 neurons with ReLU activation. The last layer has linear activation and 4 output neurons; the first two parameterize the mean of a Gaussian and the last two parameterize the log variance. The decoder is a neural net with 1 hidden layer with 128 neurons and ReLU activation. Its last layer has linear activation and outputs the logit for the class labels. It uses a mixture of Gaussian prior with 500 components (for the experiment with class overlap, 256 components), each of which is a 2D Gaussian with learnable mean and log variance, and the weights for the components are also learnable. For the MNIST experiment, the architecture is mostly the same, except the following: (1) for Z, we let it have dimension of 256. (2) For the prior, we use standard Gaussian with diagonal covariance matrix.
For all experiments, we use Adam [33] optimizer with default parameters. We do not add any explicit regularization. We use learning rate of 10 4 and have a learning rate decay of 1 1 + 0.01 × epoch . We train in total 2000 epochs with mini-batch size of 500.
For estimation of the observed β 0 in Figure 3, in the I ( X ; Z ) vs. β i curve ( β i denotes the i-th β ), we take the mean and standard deviation of I ( X ; Z ) for the lowest 5 β i values, denoting as μ β , σ β ( I ( Y ; Z ) has similar behavior, but since we are minimizing I ( X ; Z ) β · I ( Y ; Z ) , the onset of nonzero I ( X ; Z ) is less prone to noise). When I ( X ; Z ) is greater than μ β + 3 σ β , we regard it as learning a non-trivial representation, and take the average of β i and β i 1 as the experimentally estimated onset of learning. We also inspect manually and confirm that it is consistent with human intuition.
For estimating β 0 using Algorithm 1, at step 6 we use the following discrete search algorithm. We fix i left = 1 and gradually narrow down the range [ a , b ] of i right , starting from [ 1 , N ] . At each iteration, we set a tentative new range [ a , b ] , where a = 0.8 a + 0.2 b , b = 0.2 a + 0.8 b , and calculate β ˜ 0 , a = Get β ( P y | x , p y , Ω a ) , β ˜ 0 , b = Get β ( P y | x , p y , Ω b ) where Ω a = { 1 , 2 , a } and Ω b = { 1 , 2 , b } . If β ˜ 0 , a < β ˜ 0 , a , let a a . If β ˜ 0 , b < β ˜ 0 , b , let b b . In other words, we narrow down the range of i right if we find that the Ω given by the left or right boundary gives a lower β ˜ 0 value. The process stops when both β ˜ 0 , a and β ˜ 0 , b stop improving (which we find always happens when b = a + 1 ), and we return the smaller of the final β ˜ 0 , a and β ˜ 0 , b as β ˜ 0 .
For estimation of p ( y | x ) for (2′) Algorithm 1 and (3′) η ^ KL for both synthetic and MNIST experiments, we use a 3-layer neuron net where each hidden layer has 128 neurons and ReLU activation. The last layer has linear activation. The objective is cross-entropy loss. We use Adam [33] optimizer with a learning rate of 10 4 , and train for 100 epochs (after which the validation loss does not go down).
For estimating β 0 via (3′) η ^ KL by the algorithm in [18], we use the code from the GitHub repository provided by the paper (At https://github.com/wgao9/hypercontractivity), using the same p ( y | x ) employed for (2′) Algorithm 1. Since our datasets are classification tasks, we use A i j = p ( y j | x i ) / p ( y j ) instead of the kernel density for estimating matrix A; we take the maximum of 10 runs as estimation of μ .

CIFAR10 Details

We trained a deterministic 28 × 10 wide resnet [34,35], using the open source implementation from Cubuk et al. [36]. However, we extended the final 10 dimensional logits of that model through another 3 layer MLP classifier, in order to keep the inference network architecture identical between this model and the VIB models we describe below. During training, we dynamically added label noise according to the class confusion matrix in Table A1. The mean label noise averaged across the 10 classes is 20%. After that model had converged, we used it to estimate β 0 with Algorithm 1. Even with 20% label noise, β 0 was estimated to be 1.0483.
Table A1. Class confusion matrix used in CIFAR10 experiments. The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%.
Table A1. Class confusion matrix used in CIFAR10 experiments. The value in row i, column j means for class i, the probability of labeling it as class j. The mean confusion across the classes is 20%.
PlaneAuto.BirdCatDeerDogFrogHorseShipTruck
Plane0.822320.002380.0210.000690.0010800.000170.000190.14730.00489
Auto.0.002330.834190.000090.0001100.000010.0000200.009460.15379
Bird0.031390.000260.760820.00950.077640.013890.10310.003090.000310
Cat0.000960.00010.002730.693250.005570.280670.014710.001910.000020.0001
Deer0.0019900.038660.005420.834350.012730.025670.080660.000520.00001
Dog00.000040.003910.24980.005310.731910.004770.004230.000010
Frog0.000670.000080.063030.050250.03370.008420.843300.000540
Horse0.001570.000060.006490.002950.130580.0228700.833280.000230.00196
Ship0.12880.016680.000290.000020.001640.000060.000270.000170.833850.01822
Truck0.010070.1510700.000150.000010.0000100.000480.025490.81273
We then trained 73 different VIB models using the same 28 × 10 wide resnet architecture for the encoder, parameterizing the mean of a 10-dimensional unit variance Gaussian. Samples from the encoder distribution were fed to the same 3 layer MLP classifier architecture used in the deterministic model. The marginal distributions were mixtures of 500 fully covariate 10-dimensional Gaussians, all parameters of which are trained. The VIB models had β ranging from 1.02 to 2.0 by steps of 0.02, plus an extra set ranging from 1.04 to 1.06 by steps of 0.001 to ensure we captured the empirical β 0 with high precision.
However, this particular VIB architecture does not start learning until β > 2.5 , so none of these models would train as described. (A given architecture trained using maximum likelihood and with no stochastic layers will tend to have higher effective capacity than the same architecture with a stochastic layer that has a fixed but non-trivial variance, even though those two architectures have exactly the same number of learnable parameters.) Instead, we started them all at β = 100 , and annealed β down to the corresponding target over 10,000 training gradient steps. The models continued to train for another 200,000 gradient steps after that. In all cases, the models converged to essentially their final accuracy within 20,000 additional gradient steps after annealing was completed. They were stable over the remaining ∼180,000 gradient steps.

References

  1. Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
  2. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
  3. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
  4. Rey, M.; Roth, V. Meta-Gaussian information bottleneck. In Advances in Neural Information Processing Systems; lNIPS: San Diego, CA, USA, 2012; pp. 1916–1924. [Google Scholar]
  5. Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
  6. Chalk, M.; Marre, O.; Tkacik, G. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2016; pp. 1957–1965. [Google Scholar]
  7. Fischer, I. The Conditional Entropy Bottleneck. 2018. Available online: https://openreview.net/forum?id=rkVOXhAqY7 (accessed on 20 September 2019).
  8. Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [PubMed]
  9. Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 30 April 2019. [Google Scholar]
  10. Strouse, D.; Schwab, D.J. The information bottleneck and geometric clustering. arXiv 2017, arXiv:1712.09657. [Google Scholar] [CrossRef]
  11. Achille, A.; Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018, 19, 1947–1980. [Google Scholar]
  12. Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018. [Google Scholar] [CrossRef]
  13. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  14. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  15. Achille, A.; Mbeng, G.; Soatto, S. The Dynamics of Differential Learning I: Information-Dynamics and Task Reachability. arXiv 2018, arXiv:1810.02440. [Google Scholar]
  16. Anantharam, V.; Gohari, A.; Kamath, S.; Nair, C. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv 2013, arXiv:1304.6133. [Google Scholar]
  17. Polyanskiy, Y.; Wu, Y. Strong data-processing inequalities for channels and Bayesian networks. In Convexity and Concentration; Springer: Berlin/Heidelberg, Germany, 2017; pp. 211–249. [Google Scholar]
  18. Kim, H.; Gao, W.; Kannan, S.; Oh, S.; Viswanath, P. Discovering potential correlations via hypercontractivity. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2017; pp. 4577–4587. [Google Scholar]
  19. Lin, H.W.; Tegmark, M. Criticality in formal languages and statistical physics. arXiv 2016, arXiv:1606.06737. [Google Scholar]
  20. Hirschfeld, H.O. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1935; Volume 31, pp. 520–524. [Google Scholar]
  21. Gebelein, H. Das statistische Problem der Korrelation als Variations-und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM-J. Appl. Math. Mech. Für Angew. Math. Und Mech. 1941, 21, 364–379. [Google Scholar] [CrossRef]
  22. Angluin, D.; Laird, P. Learning from noisy examples. Mach. Learn. 1988, 2, 343–370. [Google Scholar] [CrossRef] [Green Version]
  23. Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2013; pp. 1196–1204. [Google Scholar]
  24. Liu, T.; Tao, D. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 447–461. [Google Scholar] [CrossRef] [PubMed]
  25. Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; Wang, X. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2691–2699. [Google Scholar]
  26. Northcutt, C.G.; Wu, T.; Chuang, I.L. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv 2017, arXiv:1705.01936. [Google Scholar]
  27. van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Kavukcuoglu, K.; Vinyals, O.; Graves, A. Conditional Image Generation with PixelCNN Decoders. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 4790–4798. [Google Scholar]
  28. Salimans, T.; Karpathy, A.; Chen, X.; Kingma, D.P. PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  29. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef]
  30. Gelfand, I.M.; Silverman, R.A. Calculus of Variations; Courier Corporation: North Chelmsford, MA, USA, 2000. [Google Scholar]
  31. Erkip, E.; Cover, T.M. The efficiency of investment information. IEEE Trans. Inf. Theory 1998, 44, 1026–1040. [Google Scholar] [CrossRef]
  32. Rényi, A. On measures of dependence. Acta Math. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
  33. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  34. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  35. Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2016, arXiv:1605.07146. [Google Scholar] [Green Version]
  36. Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
Figure 1. Accuracy for binary classification of MNIST digits 0 and 1 with 20% label noise and varying β . No learning happens for models trained at β < 3.25 .
Figure 1. Accuracy for binary classification of MNIST digits 0 and 1 with 20% label noise and varying β . No learning happens for models trained at β < 3.25 .
Entropy 21 00924 g001
Figure 2. The Pareto frontier of the information plane, I ( X ; Z ) vs. I ( Y ; Z ) , for the binary classification of MNIST digits 0 and 1 with 20% label noise described in Section 1 and Figure 1. For this problem, learning happens for models trained at β > 3.25 . H ( Y ) = 1 bit since only two of ten digits are used and I ( Y ; Z ) I ( X ; Y ) 0.5 bits < H ( Y ) because of the 20% label noise. The true frontier is differentiable; the figure shows a variational approximation that places an upper bound on both informations, horizontally offset to pass through the origin.
Figure 2. The Pareto frontier of the information plane, I ( X ; Z ) vs. I ( Y ; Z ) , for the binary classification of MNIST digits 0 and 1 with 20% label noise described in Section 1 and Figure 1. For this problem, learning happens for models trained at β > 3.25 . H ( Y ) = 1 bit since only two of ten digits are used and I ( Y ; Z ) I ( X ; Y ) 0.5 bits < H ( Y ) because of the 20% label noise. The true frontier is differentiable; the figure shows a variational approximation that places an upper bound on both informations, horizontally offset to pass through the origin.
Entropy 21 00924 g002
Figure 3. Predicted vs. experimentally identified β 0 , for mixture of Gaussians with varying class-conditional noise rates.
Figure 3. Predicted vs. experimentally identified β 0 , for mixture of Gaussians with varying class-conditional noise rates.
Entropy 21 00924 g003
Figure 4. I ( Y ; Z ) vs. β , for mixture of Gaussian datasets with different distances between the two mixture components. The vertical lines are β 0 , predicted computed by the R.H.S. of Equation (8). As Equation (8) does not make predictions w.r.t. class overlap, the vertical lines are always just above β 0 , predicted = 1 . However, as expected, decreasing the distance between the classes in X space also increases the true β 0 .
Figure 4. I ( Y ; Z ) vs. β , for mixture of Gaussian datasets with different distances between the two mixture components. The vertical lines are β 0 , predicted computed by the R.H.S. of Equation (8). As Equation (8) does not make predictions w.r.t. class overlap, the vertical lines are always just above β 0 , predicted = 1 . However, as expected, decreasing the distance between the classes in X space also increases the true β 0 .
Entropy 21 00924 g004
Figure 5. I ( Y ; Z ) vs. β for the MNIST binary classification with different hidden units per layer n and noise rates ρ : (upper left) ρ = 0.02 , (upper right) ρ = 0.1 , (lower left) ρ = 0.2 , (lower right) ρ = 0.3 . The vertical lines are β 0 estimated by different methods. n = 128 has insufficient capacity for the problem, so its observed learnability onset is pushed higher, similar to the class overlap case.
Figure 5. I ( Y ; Z ) vs. β for the MNIST binary classification with different hidden units per layer n and noise rates ρ : (upper left) ρ = 0.02 , (upper right) ρ = 0.1 , (lower left) ρ = 0.2 , (lower right) ρ = 0.3 . The vertical lines are β 0 estimated by different methods. n = 128 has insufficient capacity for the problem, so its observed learnability onset is pushed higher, similar to the class overlap case.
Entropy 21 00924 g005
Figure 6. Histograms of the full MNIST training and validation sets according to h ( X ) . Note that both are bimodal and the histograms are indistinguishable. In both cases, h ( x ) has learned to separate most of the ones into the smaller mode but difficult ones are in the wide valley between the two modes. See Figure 7 for all of the training images to the left of the red threshold line, as well as the first few images to the right of the threshold.
Figure 6. Histograms of the full MNIST training and validation sets according to h ( X ) . Note that both are bimodal and the histograms are indistinguishable. In both cases, h ( x ) has learned to separate most of the ones into the smaller mode but difficult ones are in the wide valley between the two modes. See Figure 7 for all of the training images to the left of the red threshold line, as well as the first few images to the right of the threshold.
Entropy 21 00924 g006
Figure 7. The first 5776 MNIST training set digits when sorted by h ( x ) . The digits highlighted in red are above the threshold drawn in Figure 6.
Figure 7. The first 5776 MNIST training set digits when sorted by h ( x ) . The digits highlighted in red are above the threshold drawn in Figure 6.
Entropy 21 00924 g007
Figure 8. Plot of I ( Y ; Z ) vs. β for CIFAR10 training set with 20% label noise. Each blue cross corresponds to a fully-converged model starting with independent initialization. The vertical black line corresponds to the predicted β 0 = 1.0483 using Algorithm 1. The empirical β 0 = 1.048 .
Figure 8. Plot of I ( Y ; Z ) vs. β for CIFAR10 training set with 20% label noise. Each blue cross corresponds to a fully-converged model starting with independent initialization. The vertical black line corresponds to the predicted β 0 = 1.0483 using Algorithm 1. The empirical β 0 = 1.048 .
Entropy 21 00924 g008
Table 1. Full table of values used to generate Figure 3.
Table 1. Full table of values used to generate Figure 3.
(2) Algorithm 1(3) η ^ KL
Noise RateObserved(1) Corollary 1True p ( y | x ) True p ( y | x ) (4) Equation (2)(2′) Algorithm 1(3′) η ^ KL
0.021.061.091.091.101.081.081.10
0.041.201.181.181.211.181.191.20
0.061.261.291.291.331.301.311.33
0.081.401.421.421.451.421.431.46
0.101.521.561.561.601.551.581.60
0.121.701.731.731.781.711.731.77
0.141.991.931.931.991.901.911.95
0.162.042.162.162.242.152.152.16
0.182.412.442.442.492.432.422.49
0.202.742.782.782.862.762.772.71
0.223.153.193.193.293.193.213.29
0.243.753.703.703.833.713.753.72
0.264.404.344.344.484.354.314.17
0.285.165.175.175.375.124.984.55
0.306.346.256.256.496.246.035.58
0.328.067.727.728.027.637.197.33
0.349.779.779.7710.139.748.957.37
0.3612.5812.7612.7613.2112.5111.1110.09
0.3816.9117.3617.3617.9616.9714.5510.49
0.4024.6625.0025.0025.9925.0120.3617.27
0.4239.0839.0639.0640.8539.4830.1210.89
0.4464.8269.4469.4471.8076.4851.9521.95
0.46163.07156.25156.26161.88173.15114.5721.47
0.48599.45625.00625.00651.47838.90293.908.69

Share and Cite

MDPI and ACS Style

Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the Information Bottleneck. Entropy 2019, 21, 924. https://doi.org/10.3390/e21100924

AMA Style

Wu T, Fischer I, Chuang IL, Tegmark M. Learnability for the Information Bottleneck. Entropy. 2019; 21(10):924. https://doi.org/10.3390/e21100924

Chicago/Turabian Style

Wu, Tailin, Ian Fischer, Isaac L. Chuang, and Max Tegmark. 2019. "Learnability for the Information Bottleneck" Entropy 21, no. 10: 924. https://doi.org/10.3390/e21100924

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop