**Information Bottleneck Theory and Applications in Deep Learning**

Printed Edition of the Special Issue Published in *Entropy* Bernhard C. Geiger and Gernot Kubin Edited by

## **Information Bottleneck: Theory and Applications in Deep Learning**

## **Information Bottleneck: Theory and Applications in Deep Learning**

Editors

**Bernhard C. Geiger Gernot Kubin**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Bernhard C. Geiger Know-Center GmbH Austria

Gernot Kubin Graz University of Technology Austria

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: https://www.mdpi.com/journal/entropy/special issues/ information theoretic computational intelligence).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-0802-3 (Hbk) ISBN 978-3-0365-0803-0 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


## **About the Editors**

**Bernhard C. Geiger** (Dipl.-Ing. Dr.) received the Dipl.-Ing. degree in Electrical Engineering (with distinction) and the Dr. techn. degree in Electrical and Information Engineering (with distinction) from Graz University of Technology, Austria, in 2009 and 2014, respectively. In 2009, he joined the Signal Processing and Speech Communication Laboratory, Graz University of Technology, as a Project Assistant and later accepted a position as Research and Teaching Associate at the same lab in 2010. He was a Senior Scientist and Erwin Schrodinger Fellow at the Institute ¨ for Communications Engineering, Technical University of Munich, Germany (2014–2017) and a postdoctoral researcher at the Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria (2017–2018). He is currently a Senior Researcher at Know-Center GmbH, Graz, Austria, where he also leads the Machine Learning Group within the Knowledge Discovery Area. Dr. Geiger's research interests cover information theory for machine learning, theory-assisted machine learning, and information–theoretic model reduction for Markov chains and hidden Markov models. He is a Senior Member of the IEEE.

**Gernot Kubin** Gernot Kubin (Dipl.-Ing. Dr.) has been University Professor and Founding Director of the Signal Processing and Speech Communication Laboratory at TU Graz, Austria, since 2000. He received the Dipl.-Ing. degree in 1982 and Dr. techn. degree (sub auspiciis praesidentis) in Electrical Engineering in 1990 from TU Vienna, Austria. At TU Graz, he has served as Dean of Studies in Electrical and Audio Engineering (2004–2007), Coordinator of the Key Research Area Smart Systems for a Mobile Society (2004–2011), Coordinator of the Field of Expertise Information, Communications, and Computing (2013–2015), Deputy Chair of the Curricular Committee for Electrical and Audio Engineering (2004–2008 and 2013–2019), Coordinator of the Doctoral School in Information and Communications Engineering (2007–now), Member of the Commission for Scientific Integrity and Ethics (2007–2011 and 2016–now), and Chair of the Senate (2007–2010 and 2013–now). Earlier international appointments include CERN Geneva (1980), TU Vienna (1983–2000), Erwin Schrodinger Fellow at Philips Research Labs Eindhoven (1985), AT&T Bell Labs Murray Hill ¨ (1992–1993 and 1995), KTH Stockholm (1998), Global IP Sound Stockholm (2000–2001) and San Francisco (2006), UC San Diego (2006), Danang UT (2009), and TU Munich (2015, 2017, and 2018). At national level, he has demonstrated leadership in the Vienna Telecommunications Research Centre FTW (1999–2016), the Christian Doppler Laboratory for Nonlinear Signal Processing (2002–2010), the Competence Network for Advanced Speech Technologies COAST (2006–2010), the FWF Research Network on Signal and Information Processing in Science and Engineering SISE (2008–2011), the COMET Excellence Projects Advanced Audio Processing AAP (2008–2013) and Acoustic Sensing and Design ASD (2013–2017), the Higher-Education Conference HSK (2016–now), the TU Graz Excellence Project Dependable Internet of Things (2016–now), the Competence Network for Digital Humanities KONDE (2017–now), the Complexity Science Hub Vienna (2017–now), the Scientific Board of the Christian Doppler Forschungsgesellschaft (2020–now), and as Speaker of the Senate Chairs Conference (SVK) of the Universities of Austria (2019–now). His research interests are in nonlinear signal processing as well as speech and audio communication. He has co-authored over 180 peer-reviewed publications and advised over 30 PhD students. Dr. Kubin is the recipient of the 2015 Nikola Tesla medal for the highest number of patents awarded to a TU Graz scientist in 5 years. He has served as a Member of the Board, Austrian Acoustics Association (2000–2015), as an elected member of the IEEE Speech and Language Processing Technical Committee (2011–2016), and as an elected member of the Speech Acoustics and Speech Processing committees of the German Information Technology Society ITG since 2015. He was General Chair for the INTERSPEECH 2019 conference held in Graz, Austria, in September 2019.

## *Editorial* **Information Bottleneck: Theory and Applications in Deep Learning**

**Bernhard C. Geiger 1,\*,† and Gernot Kubin 2,†**


Received: 2 December 2020; Accepted: 9 December 2020; Published: 14 December 2020

**Keywords:** information bottleneck; deep learning; neural networks

The information bottleneck (IB) framework, proposed in [1], describes the problem of representing an observation *X* in a lossy manner, such that its representation *T* is informative of a relevance variable *Y*. Mathematically, the IB problem aims to find a lossy compression scheme described by a conditional distribution *PT*|*<sup>X</sup>* that is a minimizer of the following functional:

$$\min\_{P\_{T|X}} \left( I(X;T) - \beta I(Y;T) \right) \tag{1}$$

where the minimization is performed over a well-defined feasible set.

The IB framework has received significant attention in information theory and machine learning; cf. [2,3]. Recently, the IB framework has also gained popularity in the analysis and design of neural networks (NNs): The framework has been proposed to investigate the stochastic optimization of NN parameters with information–theoretic quantities, e.g., [4,5], and the IB functional was used as a cost function for NN training [6,7].

Based on this increased attention, this Special Issue aims to investigate the properties of the IB functional in this new context and to propose learning mechanisms inspired by the IB framework. More specifically, we invited authors to submit manuscripts that provide novel insight into the properties of the IB functional that apply the IB principle for training deep, i.e., multi-layer machine learning structures such as NNs and that investigate the learning behavior of NNs using the IB framework. To cover the breadth of the current literature, we also solicited manuscripts that discuss frameworks inspired by the IB principle, but that depart from them in a well-motivated manner.

In the remainder of this Editorial, we provide a brief summary of the papers in this Special Issue, in order of their appearance.


We thank all the authors for their excellent contributions and timely submission of their works. We are looking forward to many future developments that will build on the current bounty of insightful results and that will make machine learning better explainable.

**Funding:** The work of Bernhard C. Geiger was supported by the iDev40 project and by the COMET programs within the Know-Center and the K2 Center "Integrated Computational Material, Process and Product Engineering (IC-MPPE)" (Project No 859480). The iDev40 project has received funding from the ECSEL Joint Undertaking (JU) under Grant Agreement No 783163. The JU receives support from the European Union's Horizon 2020 research and innovation program. It is co-funded by the consortium members, grants from Austria, Germany, Belgium, Italy, Spain, and Romania. The COMET programs are supported by the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology, the Austrian Federal Ministry of Digital and Economic Affairs, and by the States of Styria, Upper Austria, and the Tyrol. COMET is managed by the Austrian Research Promotion Agency FFG.

**Acknowledgments:** We would like to express our gratitude to the Editorial Assistants of Entropy for their help in organizing this Special Issue.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Gaussian Mean Field Regularizes by Limiting Learned Information**

**Julius Kunze 1,\*, Louis Kirsch 1,2, Hippolyt Ritter <sup>1</sup> and David Barber 1,3**


Received: 14 June 2019; Accepted: 1 August 2019; Published: 3 August 2019

**Abstract:** Variational inference with a factorized Gaussian posterior estimate is a widely-used approach for learning parameters and hidden variables. Empirically, a regularizing effect can be observed that is poorly understood. In this work, we show how mean field inference improves generalization by limiting mutual information between learned parameters and the data through noise. We quantify a maximum capacity when the posterior variance is either fixed or learned and connect it to generalization error, even when the KL-divergence in the objective is scaled by a constant. Our experiments suggest that bounding information between parameters and data effectively regularizes neural networks on both supervised and unsupervised tasks.

**Keywords:** information theory; variational inference; machine learning

#### **1. Introduction**

Bayesian machine learning is a popular framework for dealing with uncertainty in a principled way by integrating over model parameters rather than finding point estimates [1–3]. Unfortunately, exact inference is usually not feasible due to the intractable normalization constant of the posterior. A popular alternative is variational inference [4], where a tractable approximate distribution is optimized to resemble the true posterior as closely as possible. Due to its amenability to stochastic gradient descent [5–8], variational inference is scalable to large models and datasets.

The most common choice for the variational posterior is a factorized Gaussian. Outside of Bayesian inference, parameter noise has been found to be an effective regularizer [9–11], e.g., for training neural networks. In combination with L2-regularization, additive Gaussian parameter noise corresponds to variational inference with a Gaussian approximate posterior with fixed variance. Interestingly, it has been observed that flexible posteriors can perform worse than simple ones [12–15].

Variational inference follows the Minimum Description Length (MDL) principle [16–18], a formalization of Occam's Razor. Loosely speaking, it states that of two models describing the data equally well, the "simpler" one should be preferred. However, MDL is only an objective for compressing the training data and the model and makes no formal statement about generalization to unseen data. Yet, generalization to new data is a key property of a machine learning algorithm.

Recent work [19–22] has proposed upper bounds on the generalization error as a function of the mutual information between model parameters and training data. It states that the gap between training and test error can be reduced by limiting the mutual information. However, to the best of our knowledge, these bounds and specific inference methods have so far not been linked.

In this work, we show that Gaussian mean field inference in models with Gaussian priors can be reinterpreted as point estimation in corresponding noisy models. This leads to an upper bound on the mutual information between model parameters and data through the data processing inequality. Our result holds for both supervised and unsupervised models. We discuss the connection to generalization bounds from Xu and Raginsky [19] and Bu et al. [20], suggesting that the Gaussian mean field aids generalization. In our experiments, we show that limiting model capacity via mutual information is an effective measure of regularization, further supporting our theoretical framework.

#### **2. Regularization through the Mean Field**

In our derivation, we denote a generic model as *<sup>p</sup>*(*θ*, *<sup>D</sup>*) = *<sup>p</sup>*(*θ*)*p*(*<sup>D</sup>* <sup>|</sup> *<sup>θ</sup>*) with unobserved variables *θ* and data *D*. We refer to *θ* as the model parameters; however, in latent variable models, *θ* can also include the per-data point latent variables. The model consists of a prior *<sup>p</sup>*(*θ*) and a likelihood *<sup>p</sup>*(*<sup>D</sup>* <sup>|</sup> *<sup>θ</sup>*). Ideally, one would like to find the posterior *<sup>p</sup>*(*<sup>θ</sup>* <sup>|</sup> *<sup>D</sup>*) = *<sup>p</sup>*(*<sup>D</sup>* <sup>|</sup> *<sup>θ</sup>*)*p*(*θ*)/*Z*, where *<sup>Z</sup>* <sup>=</sup> *<sup>p</sup>*(*<sup>D</sup>* <sup>|</sup> *<sup>θ</sup>*)*p*(*θ*)*d<sup>θ</sup>* is the normalizer. However, calculating *Z* is typically intractable. Variational inference finds an approximation by maximizing the Evidence Lower Bound (ELBO)

$$\begin{split} \log p(D) &\geq \log p(D) - D\_{\text{KL}}\left( q(\theta) || p(\theta \mid D) \right) \\ &= \mathbb{E}\_{q(\theta)} \log p(D \mid \theta) - D\_{\text{KL}}\left( q(\theta) || p(\theta) \right) \end{split} \tag{1}$$

w.r.t. the approximate posterior *q*(*θ*). Our focus in this work lies on Gaussian mean field inference, so *q* is a fully-factorized normal distribution with a learnable mean *μ* and variance *σ*2. The prior is also chosen to be component-wise independent *<sup>p</sup>*(*θ*) = <sup>N</sup> - 0, *σ*<sup>2</sup> *p I* . The generative and inference models for this setting are shown in Figure 1a.

(**c**) Noisy model: learned variance

**Figure 1.** Gaussian mean field inference on model parameters *θ* with a Gaussian prior (**a**) can be reinterpreted as optimizing a point estimate on a model with injected noise, both when variance is fixed (**b**) and learned (**c**). For the latter case, we show this for the more general case where the complexity term in the objective is scaled by a constant *β* > 0, with *β* = 1 recovering variational inference.

#### *2.1. Fixed-Variance Gaussian Mean Field Inference*

When the variance *σ*<sup>2</sup> of the approximate posterior is fixed, the ELBO can be written as:

$$\mathbb{E}\_{\theta \sim \mathcal{N}(\mu, \sigma^2 I)} \log p(D \mid \theta) - \frac{1}{2\sigma\_p^2} \sum\_i \mu\_i^2 + c \tag{2}$$

which is optimized with respect to *μ*. We use *i* ∈ {1, ... , *K*} to denote the parameter index and *c* for constant terms.

To show how the Gaussian mean field implicitly limits learned information, we extend the model with a noisy version of the parameters ˜ *<sup>θ</sup>* <sup>∼</sup> *<sup>p</sup>*(˜ *<sup>θ</sup>* <sup>|</sup> *<sup>θ</sup>*) and let the likelihood depend on those noisy parameters. We choose the noise distribution to be the same as the inference distribution for the original model and find a lower bound on the log-joint of the noisy model. This leads to the same objective as mean field variational inference in the original model.

Specifically, we define the noisy model *p* (*θ*, ˜ *θ*, *D*) = *p* (*θ*)*p* (˜ *<sup>θ</sup>* <sup>|</sup> *<sup>θ</sup>*)*p* (*<sup>D</sup>* <sup>|</sup> ˜ *θ*) as visualized in Figure 1b. We use *p* to emphasize the distinction between distributions of the modified noisy model and the original one. As in the original model, *<sup>θ</sup>* ∼ N - 0, *σ*<sup>2</sup> *p I* represents the parameters (with the same prior), i.e., *p*(*θ*) = *p* (*θ*). We denote the noisy parameters as ˜ *<sup>θ</sup>* ∼ N *θ*, *σ*<sup>2</sup> . The likelihood remains unchanged, i.e., *p* (*<sup>D</sup>* <sup>|</sup> ˜ *<sup>θ</sup>*) = *<sup>p</sup>*(*<sup>D</sup>* <sup>|</sup> *<sup>θ</sup>*), except that it now depends on the noisy parameters instead of the "clean" ones.

We now show that maximizing a lower bound on the log-joint probability of the noisy model results in an identical objective as for variational inference in the clean model

$$\log p'(D, \theta) \tag{3}$$

$$\hat{\rho} = \log \int p'(D \mid \theta) p'(\theta \mid \theta) \, \mathrm{d}\theta + \log p'(\theta) \tag{4}$$

$$\varepsilon \ge \mathbb{E}\_{\theta \sim \mathcal{N}(\theta, \sigma^2 I)} \log p'(D \mid \tilde{\theta}) - \frac{1}{2\sigma\_p^2} \sum\_i \theta\_i^2 + c \tag{5}$$

$$\mathcal{L} = \mathbb{E}\_{\substack{\varnothing \sim \mathcal{N}(\mu, \sigma^2 I)}} \log p'(D \mid \theta) - \frac{1}{2\sigma\_p^2} \sum\_i \mu\_i^2 + c \tag{6}$$

where Equation (5) follows from Jensen's inequality as in Equation (1). In the final equation, we have replaced *θ* with *μ* (which is simply a change of names, since we are maximizing the objective over this free variable) to emphasize that the objective functions are identical.

Since *D* is independent of *θ* given ˜ *θ*, the joint *p*(*θ*, ˜ *θ*, *D*) forms a Markov chain, and the data processing inequality [23] limits the mutual information *I*(*D*, *θ*) between learned parameters and data through:

$$I(D, \theta) \le I(\tilde{\theta}, \theta) \tag{7}$$

The upper bound is given by:

$$I(\tilde{\theta}, \theta) = H(\tilde{\theta}) - H(\tilde{\theta} \mid \theta) = \frac{K}{2} \log \left( 1 + \frac{\sigma\_p^2}{\sigma^2} \right) \tag{8}$$

where *K* denotes the number of parameters. Here, we exploit that *θ* and ˜ *θ* | *θ* are Gaussian with *H*(˜ *θ*) = *<sup>K</sup>* <sup>2</sup> log 2*πe* - *σ*<sup>2</sup> + *σ*<sup>2</sup> *p* and *H*(˜ *<sup>θ</sup>* <sup>|</sup> *<sup>θ</sup>*) = *<sup>K</sup>* <sup>2</sup> log <sup>2</sup>*πeσ*2. This quantity is known as the capacity of channels with Gaussian noise in signal processing [23]. Intuitively, a high prior variance *σ*<sup>2</sup> *<sup>p</sup>* corresponds to a large capacity, while a high noise variance *σ*<sup>2</sup> reduces it. Any desired capacity can be achieved by simply adjusting the signal-to-noise ratio *σ*<sup>2</sup> *<sup>p</sup>*/*σ*2.

#### *2.2. Generalization Error vs. Limited Information*

Intuitively, we characterize overfitting as learning too much information about the training data, suggesting that limiting the amount of information extracted from the training data into the hypothesis should improve generalization. This idea was recently formalized by Xu and Raginsky [19], Bu et al. [20], Bassily et al. [21], Russo and Zou [22], showing that limiting mutual information between data and learned parameters bounds the expected generalization error under certain assumptions.

Specifically, their work characterizes the following process: Assume that our training dataset is sampled from a true distribution *pt*(*D*). Based on this training set, a learning algorithm subsequently returns a distribution over hypotheses given by *pt*(*<sup>θ</sup>* <sup>|</sup> *<sup>D</sup>*). The process defines mutual information *It*(*D*, *<sup>θ</sup>*) on the joint distribution *pt*(*D*, *<sup>θ</sup>*) = *pt*(*D*)*pt*(*<sup>θ</sup>* <sup>|</sup> *<sup>D</sup>*). Under certain assumptions on the loss function, Xu and Raginsky [19] derived a bound on the generalization error of the learning algorithm in expectation over this sampling process. Bu et al. [20] relaxed the condition on the loss and proved the applicability to a simple estimation algorithm involving L2-loss.

Exact Bayesian inference returns the true posterior *<sup>p</sup>*(*<sup>θ</sup>* <sup>|</sup> *<sup>D</sup>*) on a model *<sup>p</sup>*(*θ*, *<sup>D</sup>*). The theorem then states that a bound on *I*(*D*, *θ*) limits the expected generalization error as described in Bu et al. [20] if the model captures the nature of the generating process in the marginal *<sup>p</sup>*(*D*) = <sup>d</sup>*θp*(*θ*)*p*(*<sup>D</sup>* <sup>|</sup> *<sup>θ</sup>*). This is a common assumption necessary to justify any (variational) Bayesian approach.

Exact inference is intractable on deep models, and instead, one typically learns variational or point estimates for the posterior. That is also true for the objective on the noisy model above, where we only used a point estimate as given by Equation (6). Therefore, the assumption of exact inference is not met. Yet, we believe that those bounds motivate the expectation that variational inference aids generalization by limiting the learned information. If we performed exact inference on the noisy model in the last section, the given mutual information would imply a bound on generalization error as implied by Xu and Raginsky [19] and Bu et al. [20]. Therefore, we are optimistic that the gap between variational inference and those generalization bounds can be closed either by performing more accurate inference in the noisy model or by taking the dynamics of the training algorithm into account when bounding mutual information (see Section 5.2 for further discussion).

#### *2.3. Learned-Variance Gaussian Mean Field Inference*

The variance in Gaussian mean field inference is typically learned for each parameter [8,24,25]. Similar to when the variance in the approximate posterior is fixed, one can obtain a capacity constraint. This is the case even for a generalization of the objective from Equation (1) where the divergence term *<sup>D</sup>*KL (*q*(*θ*)||*p*(*θ*)) is scaled by some factor *<sup>β</sup>* <sup>&</sup>gt; 0. Higgins et al. [26] proposed using *<sup>β</sup>* <sup>&</sup>gt; 1 to learn "disentangled" representations in variational autoencoders. Further, *β* is commonly annealed from 0–1 for expressive models (e.g., Blundell et al. [25], Bowman et al. [27], Sønderby et al. [28].) In the following, we quantify a general capacity depending on *β*, where *β* = 1 recovers the standard variational objective. For notational simplicity, we here assume a prior variance of *σ*<sup>2</sup> *<sup>p</sup>* <sup>=</sup> 1. It is straight-forward to adapt the derivation to the general case.

In this case, the objective can be written as:

$$\mathbb{E}\_{\theta \sim N(\mu, \sigma^2)} \log p(D \mid \theta) + \frac{\beta}{2} \sum\_{i} \left( \log \sigma\_i^2 - \sigma\_i^2 - \mu\_i^2 - 1 \right) \tag{9}$$

where now, both *<sup>μ</sup>* and *<sup>σ</sup>*<sup>2</sup> represent learned vectors and <sup>N</sup> *μ*, *σ*<sup>2</sup> denotes a variable composed of pairwise independent Gaussian components with means and variances given by the elements of *μ* and *σ*2.

Similar to the previous section, we show a lower bound on the log-joint of a new noisy model to be identical to Equation (9). Specifically, we define the noisy model *p* (*θ*, *σ*2, ˜ *θ*, *D*) = *p* (*θ*)*p* (*σ*2)*p* (˜ *θ* | *θ*, *σ*2)*p* (*<sup>D</sup>* <sup>|</sup> ˜ *<sup>θ</sup>*) (Figure 1c), with independent priors *<sup>θ</sup><sup>i</sup>* ∼ N - 0, <sup>1</sup> *β* and *σ*<sup>2</sup> *<sup>i</sup>* ∼ Γ - *β* <sup>2</sup> <sup>+</sup> 1, *<sup>β</sup>* 2 , where

<sup>Γ</sup>(·, ·) denotes the Gamma distribution. As previously done Section 2.1, we define the noise-injected parameters as ˜ *<sup>θ</sup>* ∼ N *θ*, *σ*<sup>2</sup> and likelihood as *p* (*<sup>D</sup>* <sup>|</sup> ˜ *<sup>θ</sup>*) = *<sup>p</sup>*(*<sup>D</sup>* <sup>|</sup> *<sup>θ</sup>*).

The priors are chosen so that with Jensen's inequality, we find a lower bound on the log-joint probability of this model that recovers the objective from Equation (9):

$$\begin{split} &\log p'(D,\theta,\sigma^2) \\ &= \log \int p'(D \mid \theta) p'(\theta \mid \theta, \sigma^2) \, \mathrm{d}\theta + \log p'(\theta) + \log p'(\sigma^2) \\ &\geq \underline{\mathbf{E}}\_{\theta \sim \mathcal{N}(\theta,\sigma^2)} \log p'(D \mid \bar{\theta}) + \sum\_{i} \left( \log p'(\theta\_i) + \log p'(\sigma\_i^2) \right) \\ &= \underline{\mathbf{E}}\_{\theta \sim \mathcal{N}(\mu,\sigma^2)} \log p'(D \mid \bar{\theta}) + \frac{\underline{\beta}}{2} \sum\_{i} \left( \log \sigma\_i^2 - \sigma\_i^2 - \mu\_i^2 \right) + c \end{split} \tag{10}$$

In the noisy model, the data processing inequality and the independence of dimensions implies a bound:

$$I(D\_\prime(\theta, \sigma^2)) \le I(\theta, (\theta, \sigma^2)) = \sum\_i I(\theta\_{i\prime}(\theta\_{i\prime} \sigma\_i^2)) \tag{11}$$

where the capacity *I*(˜ *<sup>θ</sup>i*,(*θi*, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* )) per dimension is derived in Appendix A.

Figure 2 shows numerical results for various values of *β*. Standard variational inference (*β* = 1) results in a capacity of 0.45 bits per dimension. We observe that higher *β* corresponds to smaller capacity, which is given by the mutual information *I*(˜ *<sup>θ</sup>i*,(*θi*, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* )) between our new latent (*θi*, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* ) and ˜ *θi*. This formalizes the intuition that a higher weight of the complexity term in our objective increases regularization by decreasing a limit on the capacity.

**Figure 2.** Relationship between *β* and capacity *I*(˜ *<sup>θ</sup>i*,(*θi*, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* )) per parameter dimension in Gaussian mean field inference with learned variance and the complexity term scaled by *β* > 0.

#### *2.4. Supervised and Unsupervised Learning*

The above derivations apply to any learning algorithm that is purely trained with Gaussian mean field inference. This covers supervised and unsupervised tasks.

In supervised learning, the training data typically consist of pairs of inputs and labels, and a loss is assigned to each pair that depends on the trained model, e.g., neural network parameters. When all parameters are learned with one of the discussed mean field methods, the given bounds apply.

The derivation also comprises unsupervised methods with per-data latent variables and even amortized inference such as variational autoencoders [8,24], again as long as all learned variables are learned via Gaussian mean field inference. While this might be helpful to find generalizing representations, the focus of the experiments is on validating the generalizing behaviour of the Bayesian mean field variational approach on neural network parameters for overfitting regimes, namely a small dataset and complex models.

#### *2.5. Flexible Variational Distributions*

The objective function for variational inference is maximized when the approximate posterior is equal to the true one. This motivates the development of flexible families of posterior distributions [8,29–36]. In the case of exact inference, a bound on generalization as discussed in Section 2.2 only applies if the model itself has finite mutual information between data and parameters. However, estimating mutual information is generally a hard problem, particularly in high-dimensional, non-linear models. This makes it hard to state a generic bound, which is why we focus on the case of Gaussian mean field inference.

#### **3. Related Work**

#### *3.1. Regularization in Neural Networks*

The Gaussian mean field is intimately related to other popular regularization approaches in deep learning: As is apparent from Equation (6), the fixed-variance Gaussian mean field applied to training neural network weights is equivalent to L2-regularization (weight decay) combined with Gaussian parameter noise [9–11] on all network weights. Molchanov et al. [37] showed that additive parameter noise results in multiplicative noise on the unit activations. The resulting dependencies between noise components on the layer output can be ignored without significantly changing empirical results [38]. This is in turn equivalent to scaled Gaussian dropout [24].

#### *3.2. Information Bottlenecks*

The information bottleneck principle by Tishby et al. [39], Shamir et al. [40] aims to find a representation *Z* of some input *X* that is most useful to predict an output *Y*. For this purpose, the objective is to maximize the amount of information *I*(*Y*, *Z*) the representation contains about the output under a bounded amount of information *I*(*X*, *Z*) about the input:

$$\max\_{I(X,Z)\preccurlyeq \subset \subset} I(Y,Z) \tag{12}$$

They described a training procedure using the softly-constrained objective:

$$\min \mathcal{L}\_{IB} = \min I(X, Z) - \beta I(Y, Z) \tag{13}$$

where *β* > 0 controls the trade-off.

Alemi et al. [41] suggested a variational approximation for this objective. For the task of reconstruction, where labels *Y* are identical to inputs *X*, this results exactly in the *β*-VAE objective [42,43]. This is in accordance with our result from Section 2.3 that there is a maximum capacity per latent dimension that decreases for higher *β*. Setting *β* > 1, as suggested by Higgins et al. [26], for obtaining disentangled representations, corresponds to lower capacity per latent component than achieved by standard variational inference.

Both Tishby et al. [39] and Higgins et al. [26] introduced *β* as a trade-off parameter without a quantitative interpretation. With our information-theoretic perspective, we quantify the implied capacity and provide a link to the generalization error. Further, both methods are concerned with the information in the latent representation. They do not limit the mutual information with the model parameters, leaving them vulnerable to model overfitting under our theoretical assumptions. We experimentally validated this vulnerability and explore the effect of filling this gap by applying Gaussian mean field inference to the model parameters.

#### *3.3. Information Estimation with Neural Networks*

Multiple recent techniques [44–46] proposed the use of neural networks for obtaining a lower bound on the mutual information. This is useful in settings when we want to maximize mutual information, e.g., between the data and a lower-dimensional representation. In contrast, we show that Gaussian variational inference on variables with a Gaussian prior implicitly places an upper bound on the mutual information between these variables and the data and explore its regularizing effect.

#### **4. Experiments**

In this section, we analyse the implications of applying Gaussian mean field inference of a fixed scale to the model parameters in the supervised and unsupervised context. Our theoretical results suggest that varying the capacity will affect the generalization capability, and we show this effect on small data regimes and how it changes with the training set size. Furthermore, we investigate whether capacity is the only predictor for generalization or whether varying priors and architectures also have an effect. Finally, we demonstrate qualitatively how the capacity bounds are reflected in Fashion MNIST reconstruction.

#### *4.1. Supervised Learning*

We begin with a supervised classification task on the CIFAR10 dataset, training only on a subset of the first 5000 samples. We used 63 × 3 convolutional layers with 128 channels each followed by a ReLU activation function, every second of which implemented striding of 2 to reduce the input dimensionality. Finally, the last layer was a linear projection, which parameterized a categorical distribution. The capacity of each parameter in this network was set to specific values given by Equation (8).

Figure 3 shows that decreasing the model capacity per dimension (by increasing the noise) reduced the training log-likelihood and increased the test log-likelihood until both of them meet at an optimal capacity. One can observe that very small capacities led to a signal that was too noisy, and good predictions were no longer possible. In short, regimes of underfitting and overfitting were generated depending on the capacity.

**Figure 3.** Classifying CIFAR10 with varying model capacities. Large capacities lead to overfitting, while small capacities drown the signal in noise. Each configuration was evaluated 5 times; the mean and standard deviation are displayed.

#### *4.2. Unsupervised Learning*

We now evaluate the regularizing effect of fixed-scale Gaussian mean field inference in an unsupervised setting for image reconstruction on the MNIST dataset. Therefore, we used a VAE [6] with 2 latent dimensions and a 3-layer neural network parameterizing the conditional factorized Gaussian distribution. As usual, it was trained using the free energy objective, but different from the original work, we also used Gaussian mean field inference for the model parameters. Again, we used a small training set of 200 examples for the following experiments if not denoted otherwise.

#### 4.2.1. Varying Model Capacity and Priors

In our first experiment, we analysed generalization by inspecting the test evidence lower bound (ELBO) when varying the model capacity, which can be seen in Figure 4a. Similar to the supervised case, we can observe that there was a certain model capacity range that explained the data very well, while less or more capacity resulted in noise drowning and overfitting, respectively.In the same figure, we also investigated whether the information-theoretic model capacity can predict generalization independently of the specific prior distribution. Since we merely state an upper bound on mutual information in Section 2.1, the prior may have an effect in practice, which cannot be explained only by the capacity. Figure 4a shows that indeed, while the general behaviour remained the same for different model priors, the generalization error was not entirely independent. Furthermore, the observation that all curves descended with larger capacities, for all priors, suggests that weight decay [47] of fixed scale without parameters noise was not sufficient to regularize arbitrarily large networks. In Figure 4b, we investigated the extreme case of dropping the prior entirely and switching to maximum-likelihood learning instead by using an improper uniform prior. This approach recovered Gaussian dropout [24,48]. Dropping the prior set the bottleneck capacity to infinity and should lead to worse generalization. Comparing the test ELBO of this Gaussian dropout variant to the original Gaussian mean field inference in Figure 4b confirmed this result for larger capacities. For larger noise scales, generalization was still working well, a result that was not explained in our information-theoretic framework, but plausible due to the deployed limited architecture.

**Figure 4.** MNIST test reconstruction with a variational autoencoder training on 200 samples for various capacities and Gaussian priors. (**a**) The test Evidence Lower Bound (ELBO) is not invariant when varying the prior on the model parameters. Nevertheless, the first increasing and then decreasing trend when changing the capacity remains; (**b**) Using an improper prior, similar to just using Gaussian dropout on the weights, leads to an accelerated decrease of generalization for smaller noise scales.

#### 4.2.2. Varying Training Set Size

Figure 5a shows how limiting the capacity affects the test ELBO for varying amounts of training data. Models with very small capacity extracted less information from the data into the model, thus yielding a good test ELBO somewhat independent of the dataset size. This is visible as a graph that ascends very little with more training data (e.g., total model capacity of 330 kbits). Note that we here report the capacity of the entire model, which is the sum of the capacities for each parameter. In order to improve the test ELBO, more information from the data had to be extracted into the model. However, clearly, this led to non-generalizing information being extracted when the dataset was small, leading to overfitting. Only for larger datasets, the extracted information generalized. This is visible as a strongly-ascending test ELBO with larger dataset sizes and bad generalization for small datasets. We can therefore conclude that the information bottleneck needs to be chosen based on the amount of data that are available. Intuitively, when more information is available, the more information should be extracted into the model.

**Figure 5.** MNIST test reconstruction with a VAE; training on varying dataset sizes, architectures, and model capacities. (**a**) Varying the number of samples. Depending on the size of the dataset, higher capacities of the model are required to fit all the data points; (**b**) Varying architecture. Overfitting is not getting worse for more layers if the capacity is low enough. More layers do overfit only for higher capacities.

#### 4.2.3. Varying Model Size

Furthermore, we inspected how the size of the model (here, in terms of the number of layers) affected generalization in Figure 5b. Similar to varying the prior distribution, we were interested in how well the total capacity predicted generalization and the role the architecture plays. It can be observed that larger networks were more resilient to larger total capacities before they started overfitting. This indicates that the total capacity was less important than the individual capacity (i.e., noise) per parameter. Nevertheless, larger networks were more prone to overfitting for very large model capacities. This makes sense as their functional form was less constrained, an aspect that was not captured by our theory.

#### 4.2.4. Qualitative Reconstruction

Finally, we plot test reconstruction means for the binarized Fashion MNIST dataset under the same setup for various capacities in Figure 6. In accordance with the previous experiments, we observed that if the capacity was chosen too small, the model was not learning anything useful, while too large capacities resulted in overconfidence. This can be observed in most means being close to either 0–1. An intermediate capacity, on the other hand, made sensible predictions (given that it was trained only on 200 samples) with sensible uncertainty, visible through grey pixels that correspond to high entropy.

**Figure 6.** Test reconstruction means for binarized Fashion MNIST trained on 200 samples with per-parameter capacities of 5, 2, and 1 bits (**top**) compared to the true data (**bottom**).

#### **5. Discussion**

In this section, we discuss how the capacity can be set, as well as the effect of the model architecture and learning dynamics.

#### *5.1. Choosing the Capacity*

We have obtained a new trade-off parameter, the capacity, that has a simple quantitative interpretation: It determines how many bits to extract maximally from the training set. In contrast, for the *β* parameter introduced in Tishby et al. [39] and Higgins et al. [26], a clear interpretation is not known. Yet, it may still be hard to set the capacity optimally. Simple mechanisms such as evaluation of a validation set to determine its value may be used. We expect that more theoretically-rigorous methods could be developed.

Furthermore, in this paper, we focused on the regularization that Gaussian mean field inference implies on the model parameters. The same concept is valid for data-dependent latent variables, for instance in VAEs, as discussed in Section 2.4. In VAEs, Gaussian mean field inference on the latent variables leads to a restricted latent capacity, but leaves the capacity of the model unbounded. This leaves VAEs vulnerable to model overfitting, as demonstrated in the experiments, and setting *β* as done in Higgins et al. [26] is not sufficient to control complexity. This motivates the limitation of capacity between the data and both per-data point latent variables and model parameters. The interaction between the two is an interesting future research direction.

#### *5.2. Role of Learning Dynamics and Architecture*

As discussed in Section 2.2, it is necessary to perform exact inference in the noisy model for the bounds on the generalization error to hold. However, this assumption was not met. In practice, *pt*(*<sup>θ</sup>* <sup>|</sup> *<sup>D</sup>*) encodes the complete learning algorithm, which in deep learning typically includes parameter initialization and the dynamics of the stochastic gradient descent optimization.

Our experiments confirmed the relevance of the aforementioned factors: L2-regularization works in practice, even though no noise was added to the parameters. This could be explained by the fact that noise was already implicitly added through stochastic gradient descent [49] or through the output distribution of the network. Similarly, Gaussian dropout [9–11] without a prior on the parameters helped generalization. Again, early stopping combined with a finite reach of gradient descent steps effectively shaped a prior of finite variance in the parameter space. This could also formalize why the annealing schedule employed by Blundell et al. [25], Bowman et al. [27] and Sønderby et al. [28] was effective.

Since these other factors affect generalization, quantifying mutual information *It*(*θ*, *D*) of the actual distribution due to the learning dynamics might be a promising approach to explain why neural networks often generalize well on their own. This idea is in accordance with recent work that links the learning dynamics of small neural networks to generalization behaviour [50].

On the other hand, the architecture choice also had an influence on generalization. This does not contradict our theory, we since we only formulated an upper bound on mutual information. Tightening this bound based on the model architecture and output distribution is usually hard, as discussed in Section 2.5, but might be possible.

Another promising direction would be to sample approximately from the exact posterior on network parameters (i.e., as done by Marceau-Caron and Ollivier [51]), on a capacity-limited architecture, instead of the usual approach of point estimation. In the limit of infinite training time, this would fully realize the discussed bound on the expected generalization error.

#### **6. Conclusions**

We explored an information-theoretic perspective on the regularizing effects observed in Gaussian mean field approaches. The derivation featured a capacity that can be naturally interpreted as a limit on the amount of information extracted about the given data by the inferred model. We validated its practicality for both supervised and unsupervised learning.

How this capacity should be set for parameters and latent variables depending on the task and data is an interesting direction of research. We exploited a theoretical link of mutual information and generalization error. While this work is restricted to the Gaussian mean field, incorporating the effect of learning dynamics on mutual information in the future work might allow understanding why overparameterized neural networks still generalize well to unseen data.

**Author Contributions:** Conceptualization, methodology, software, validation, formal analysis, investigation, visualization, and writing of the original draft were performed by J.K. and L.K. with supervision by D.B., and review and editing was done by H.R.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Capacity in Learned-Variance Gaussian Mean Field Inference**

The capacity per dimension for the model discussed in Section 2.3 is given by:

$$\begin{split} &I(\vec{\theta}\_{i\prime}(\theta\_i, \sigma\_i^2)) \\ &= H(\vec{\theta}\_i) - H(\vec{\theta}\_i \mid \theta\_i, \sigma\_i^2) \\ &= -\int\_{-\infty}^{\infty} p'(\vec{\theta}\_i) \log p'(\vec{\theta}\_i) \, \mathrm{d}\vec{\theta}\_i - \int\_0^{\infty} p'(\sigma\_i^2) \frac{1}{2} \log 2\pi c v\_i^2 \, \mathrm{d}\sigma\_i^2 \end{split} \tag{A1}$$

*Entropy* **2019**, *21*, 758

˜ *<sup>θ</sup><sup>i</sup>* <sup>|</sup> *<sup>θ</sup>i*, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* ∼ N *θi*, *σ*<sup>2</sup> *i* with *<sup>θ</sup><sup>i</sup>* ∼ N - 0, <sup>1</sup> *β* implies ˜ *<sup>θ</sup><sup>i</sup>* <sup>|</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* ∼ N - 0, *σ*<sup>2</sup> *<sup>i</sup>* <sup>+</sup> <sup>1</sup> *β* . Together with *σ*<sup>2</sup> *i* ∼ Γ - *β* <sup>2</sup> <sup>+</sup> 1, *<sup>β</sup>* 2 , this implies:

$$\begin{split} p'(\theta\_i) &= \int\_0^\infty \mathrm{d}\sigma\_i^2 \, p'(\sigma\_i^2) p'(\theta\_i \mid \sigma\_i^2) \\ &= \int\_0^\infty \mathrm{d}\sigma\_i^2 \frac{1}{\Gamma\left(\frac{\beta}{2}\right)} \left(\frac{\beta}{2} \sigma\_i^2 e^{-\sigma\_i^2}\right)^{\frac{\beta}{2}} \cdot \left(2\pi \left(\sigma\_i^2 + \frac{1}{\beta}\right)\right)^{-\frac{1}{2}} e^{-\frac{1}{2\left(\sigma\_i^2 + \frac{1}{\beta}\right)} \theta\_i^2} \end{split} \tag{A2}$$

Numerical results for the capacity *I*(˜ *<sup>θ</sup>i*,(*θi*, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* )) with varying *<sup>β</sup>* are given below (see Table A1) and plotted in Figure 2.

**Table A1.** Capacity *I*(˜ *<sup>θ</sup>i*,(*θi*, *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* )) in learned-variance Gaussian mean field inference with varying factor *β* on the divergence term.


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Learnability for the Information Bottleneck**

#### **Tailin Wu 1,\*, Ian Fischer 2, Isaac L. Chuang <sup>1</sup> and Max Tegmark <sup>1</sup>**


Received: 1 August 2019; Accepted: 12 September 2019; Published: 23 September 2019

**Abstract:** The Information Bottleneck (IB) method provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>−</sup> *<sup>β</sup>I*(*Y*; *<sup>Z</sup>*) employs a Lagrange multiplier *β* to tune this trade-off. However, in practice, not only is *β* chosen empirically without theoretical guidance, there is also a lack of theoretical understanding between *β*, learnability, the intrinsic nature of the dataset and model capacity. In this paper, we show that if *β* is improperly chosen, learning cannot happen—the trivial representation *<sup>P</sup>*(*Z*|*X*) = *<sup>P</sup>*(*Z*) becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as *β* is varied. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, which provides theoretical guidance for choosing a good *β*. We further show that IB-learnability is determined by the largest *confident*, *typical* and *imbalanced subset* of the examples (the *conspicuous subset*), and discuss its relation with model capacity. We give practical algorithms to estimate the minimum *β* for a given dataset. We also empirically demonstrate our theoretical conditions with analyses of synthetic datasets, MNIST and CIFAR10.

**Keywords:** learnability; information bottleneck; representation learning; conspicuous subset

#### **1. Introduction**

Tishby et al. [1] introduced the *Information Bottleneck* (IB) objective function which learns a representation *Z* of observed variables (*X*,*Y*) that retains as little information about *X* as possible but simultaneously captures as much information about *Y* as possible:

$$\min \text{IB}\_{\beta}(X, \mathbf{Y}; Z) = \min [I(X; Z) - \beta I(\mathbf{Y}; Z)] \tag{1}$$

*<sup>I</sup>*(·) is the mutual information. The hyperparameter *<sup>β</sup>* controls the trade-off between compression and prediction, in the same spirit as Rate-Distortion Theory [2] but with a learned representation function *<sup>P</sup>*(*Z*|*X*) that automatically captures some part of the "semantically meaningful" information, where the semantics are determined by the observed relationship between *X* and *Y*. The IB framework has been extended to and extensively studied in a variety of scenarios, including Gaussian variables [3], meta-Gaussians [4], continuous variables via variational methods [5–7], deterministic scenarios [8,9], geometric clustering [10] and is used for learning invariant and disentangled representations in deep neural nets [11,12].

From the IB objective (Equation (1)) we see that when *<sup>β</sup>* <sup>→</sup> 0 it will encourage *<sup>I</sup>*(*X*; *<sup>Z</sup>*) = 0 which leads to a trivial representation *<sup>Z</sup>* that is independent of *<sup>X</sup>*, while when *<sup>β</sup>* <sup>→</sup> <sup>+</sup>∞, it reduces to a maximum likelihood objective (e.g., in classification, it reduces to cross-entropy loss). Therefore, as we vary *<sup>β</sup>* from 0 to +∞, there must exist a point *<sup>β</sup>*<sup>0</sup> at which IB starts to learn a nontrivial representation where *Z* contains information about *X*.

As an example, we train multiple variational information bottleneck (VIB) models on binary classification of MNIST [13] digits 0 and 1 with 20% label noise at different *β*. The accuracy vs. *β* is shown in Figure 1. We see that when *β* < 3.25, no learning happens and the accuracy is the same as random guessing. Beginning with *β* > 3.25, there is a clear phase transition where the accuracy sharply increases, indicating the objective is able to learn a nontrivial representation. In general, we observe that different datasets and model capacity will result in different *β*<sup>0</sup> at which IB starts to learn a nontrivial representation. How does *β*<sup>0</sup> depend on the aspects of the dataset and model capacity and how can we estimate it? What does an IB model learn at the onset of learning? Answering these questions may provide a deeper understanding of IB in particular and learning on two observed variables in general.

In this work, we begin to answer the above questions. Specifically:


**Figure 1.** Accuracy for binary classification of MNIST digits 0 and 1 with 20% label noise and varying *β*. No learning happens for models trained at *β* < 3.25.

We also present an algorithm for estimating the onset of IB-Learnability and the conspicuous subset, which provide us with a tool for understanding a key aspect of the learning problem (*X*,*Y*) (Section 6). Finally, we use our main results to demonstrate on synthetic datasets, MNIST [13] and CIFAR10 [14] that the theoretical prediction for IB-Learnability closely matches experiment, and show the conspicuous subset our algorithm discovers (Section 7).

#### **2. Related Work**

The seminal IB work [1] provides a tabular method for exactly computing the optimal encoder distribution *<sup>P</sup>*(*Z*|*X*) for a given *<sup>β</sup>* and cardinality of the discrete representation, <sup>|</sup>*Z*|. They did not consider the IB learnability problem as addressed in this work. Chechik et al. [3] presents the Gaussian Information Bottleneck (GIB) for learning a multivariate Gaussian representation *Z* of (*X*,*Y*), assuming that both *X* and *Y* are also multivariate Gaussians. Under GIB, they derive analytic formula for the optimal representation as a noisy linear projection to eigenvectors of the normalized regression matrix <sup>Σ</sup>*x*|*y*Σ−<sup>1</sup> *<sup>x</sup>* and the learnability threshold *<sup>β</sup>*<sup>0</sup> is then given by *<sup>β</sup>*<sup>0</sup> <sup>=</sup> <sup>1</sup> <sup>1</sup>−*λ*<sup>1</sup> where *<sup>λ</sup>*<sup>1</sup> is the largest eigenvalue of the matrix <sup>Σ</sup>*x*|*y*Σ−<sup>1</sup> *<sup>x</sup>* . This work provides deep insights about relations between the dataset, *<sup>β</sup>*<sup>0</sup> and optimal representations in the Gaussian scenario but the restriction to multivariate Gaussian datasets limits the generality of the analysis Another analytic treatment of IB is given in [4], which reformulates the objective in terms of the copula functions. As with the GIB approach, this formulation restricts the form of the data distributions—the copula functions for the joint distribution (*X*,*Y*) are assumed to be known, which is unlikely in practice.

Strouse and Schwab [8] present the Deterministic Information Bottleneck (DIB), which minimizes the coding cost of the representation, *H*(*Z*), rather than the transmission cost, *I*(*X*; *Z*) as in IB. This approach learns hard clusterings with different code entropies that vary with *β*. In this case, it is clear that a hard clustering with minimal *H*(*Z*) will result in a single cluster for all of the data, which is the DIB trivial solution. No analysis is given beyond this fact to predict the actual onset of learnability, however.

The first amortized IB objective is in the Variational Information Bottleneck (VIB) of Alemi et al. [5]. VIB replaces the exact, tabular approach of IB with variational approximations of the classifier distribution (*P*(*Y*|*Z*)) and marginal distribution (*P*(*Z*)). This approach cleanly permits learning a stochastic encoder, *<sup>P</sup>*(*Z*|*X*), that is applicable to any *<sup>x</sup>* ∈ X , rather than just the particular *<sup>X</sup>* seen at training time. The cost of this flexibility is the use of variational approximations that may be less expressive than the tabular method. Nevertheless, in practice, VIB learns easily and is simple to implement, so we rely on VIB models for our experimental confirmation.

Closely related to IB is the recently proposed Conditional Entropy Bottleneck (CEB) [7]. CEB attempts to explicitly learn the Minimum Necessary Information (MNI), defined as the point in the information plane where *I*(*X*;*Y*) = *I*(*X*; *Z*) = *I*(*Y*; *Z*). The MNI point may not be achievable even in principle for a particular dataset. However, the CEB objective provides an explicit estimate of how closely the model is approaching the MNI point by observing that a necessary condition for reaching the MNI point occurs when *<sup>I</sup>*(*X*; *<sup>Z</sup>*|*Y*) = 0. The CEB objective *<sup>I</sup>*(*X*; *<sup>Z</sup>*|*Y*) <sup>−</sup> *<sup>γ</sup>I*(*Y*; *<sup>Z</sup>*) is equivalent to IB at *γ* = *β* + 1, so our analysis of IB-Learnability applies equally to CEB.

Kolchinsky et al. [9] show that when *Y* is a deterministic function of X, the "corner point" of the IB curve (where *I*(*X*;*Y*) = *I*(*X*; *Z*) = *I*(*Y*; *Z*)) is the unique optimizer of the IB objective for all 0 < *β* < 1 (with the parameterization of Kolchinsky et al. [9], *β* = 1/*β*), which they consider to be a "trivial solution". However, their use of the term "trivial solution" is distinct from ours. They are referring to the observation that all points on the IB curve contain uninteresting interpolations between two different but valid solutions on the optimal frontier, rather than demonstrating a non-trivial trade-off between compression and prediction as expected when varying the IB Lagrangian. Our use of "trivial" refers to whether IB is capable of learning at all given a certain dataset and value of *β*.

Achille and Soatto [12] apply the IB Lagrangian to the weights of a neural network, yielding InfoDropout. In Achille and Soatto [11], the authors give a deep and compelling analysis of how the IB Lagrangian can yield invariant and disentangled representations. They do not, however, consider the question of the onset of learning, although they are aware that not all models will learn a non-trivial representation. More recently, Achille et al. [15] repurpose the InfoDropout IB Lagrangian as a Kolmogorov Structure Function to analyze the ease with which a previously-trained network can be fine-tuned for a new task. While that work is tangentially related to learnability, the question it addresses is substantially different from our investigation of the onset of learning.

Our work is also closely related to the hypercontractivity coefficient [16,17], defined as sup*Z*−*X*−*<sup>Y</sup> I*(*Y*;*Z*) *<sup>I</sup>*(*X*;*Z*), which by definition equals the inverse of *β*0, our IB-learnability threshold. In [16], the authors prove that the hypercontractivity cofficient equals the contraction coefficient *<sup>η</sup>*KL(*PY*|*X*, *PX*) and Kim et al. [18] propose a practical algorithm to estimate *<sup>η</sup>*KL(*PY*|*X*, *PX*), which provides a measure

for potential influence in the data. Although our goal is different, the sufficient conditions we provide for IB-Learnability are also lower bounds for the hypercontractivity coefficient.

#### **3. IB-Learnability**

We are given instances of (*x*, *y*) drawn from a distribution with probability (density) *P*(*X*,*Y*) with support of X ×Y, where unless otherwise stated, both *X* and *Y* can be discrete or continuous variables. We use capital letters *X*,*Y*, *Z* for random variables and lowercase *x*, *y*, *z* to denote the instance of variables, with *<sup>P</sup>*(·) and *<sup>p</sup>*(·) denoting their probability or probability density, respectively. (*X*,*Y*) is our *training data* and may be characterized by different types of noise. The nature of this training data and the choice of *β* will be sufficient to predict the transition from unlearnable to learnable.

We can learn a representation *<sup>Z</sup>* of *<sup>X</sup>* with conditional probability *<sup>p</sup>*(*z*|*x*), such that *<sup>X</sup>*,*Y*, *<sup>Z</sup>* obey the Markov chain *Z* ← *X* ↔ *Y*. Equation (1) above gives the IB objective with Lagrange multiplier *β*, IB*β*(*X*,*Y*; *<sup>Z</sup>*), which is a functional of *<sup>p</sup>*(*z*|*x*): IB*β*(*X*,*Y*; *<sup>Z</sup>*) = IB*β*[*p*(*z*|*x*)]. The IB learning task is to find a conditional probability *<sup>p</sup>*(*z*|*x*) that minimizes IB*β*(*X*,*Y*; *<sup>Z</sup>*). The larger *<sup>β</sup>*, the more the objective favors making a good prediction for *Y*. Conversely, the smaller *β*, the more the objective favors learning a concise representation.

How can we select *β* such that the IB objective learns a useful representation? In practice, the selection of *β* is done empirically. Indeed, Tishby et al. [1] recommends "sweeping *β*". In this paper, we provide theoretical guidance for choosing *β* by introducing the concept of IB-Learnability and providing a series of IB-learnable conditions.

**Definition 1.** (*X*,*Y*) *is IBβ-learnable if there exists a <sup>Z</sup> given by some <sup>p</sup>*1(*z*|*x*)*, such that IBβ*(*X*,*Y*; *<sup>Z</sup>*)|*p*1(*z*|*x*) <sup>&</sup>lt; *IBβ*(*X*,*Y*; *<sup>Z</sup>*)|*p*(*z*|*x*)=*p*(*z*)*, where <sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) *characterizes the trivial representation where Z* = *Ztrivial is independent of X.*

If (*X*;*Y*) is IB*β*-learnable, then when IB*β*(*X*,*Y*; *<sup>Z</sup>*) is globally minimized, it will *not* learn a trivial representation. On the other hand, if (*X*;*Y*) is not IB*β*-learnable, then when IB*β*(*X*,*Y*; *<sup>Z</sup>*) is globally minimized, it may learn a trivial representation.

#### *3.1. Trivial Solutions*

Definition 1 defines trivial solutions in terms of representations where *I*(*X*; *Z*) = *I*(*Y*; *Z*) = 0. Another type of trivial solution occurs when *I*(*X*; *Z*) > 0 but *I*(*Y*; *Z*) = 0. This type of trivial solution is not directly achievable by the IB objective, as *I*(*X*; *Z*) is minimized but it can be achieved by construction or by chance. It is possible that starting learning from *I*(*X*; *Z*) > 0, *I*(*Y*; *Z*) = 0 could result in access to non-trivial solutions not available from *I*(*X*; *Z*) = 0. We do not attempt to investigate this type of trivial solution in this work.

#### *3.2. Necessary Condition for IB-Learnability*

From Definition 1, we can see that IB*β*-Learnability for any dataset (*X*;*Y*) requires *<sup>β</sup>* > 1. In fact, from the Markov chain *<sup>Z</sup>* <sup>←</sup> *<sup>X</sup>* <sup>↔</sup> *<sup>Y</sup>*, we have *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) <sup>≤</sup> *<sup>I</sup>*(*X*; *<sup>Z</sup>*) via the data-processing inequality. If *<sup>β</sup>* <sup>≤</sup> 1, then since *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>≥</sup> 0 and *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) <sup>≥</sup> 0, we have that min(*I*(*X*; *<sup>Z</sup>*) <sup>−</sup> *<sup>β</sup>I*(*Y*; *<sup>Z</sup>*)) = <sup>0</sup> <sup>=</sup> IB*β*(*X*,*Y*; *Ztrivial*). Hence (*X*,*Y*) is not IB*β*-learnable for *<sup>β</sup>* <sup>≤</sup> 1.

Due to the reparameterization invariance of mutual information, we have the following theorem for IB*β*-Learnability:

**Lemma 1.** *Let X* = *g*(*X*) *be an invertible map (if X is a continuous variable, g is additionally required to be continuous). Then* (*X*,*Y*) *and* (*X* ,*Y*) *have the same IBβ-Learnability.*

The proof for Lemma 1 is in Appendix A.2. Lemma 1 implies a favorable property for any condition for IB*β*-Learnability: the condition should be invariant to invertible mappings of *X*. We will inspect this invariance in the conditions we derive in the following sections.

#### **4. Sufficient Conditions for IB-Learnability**

Given (*X*,*Y*), how can we determine whether it is IB*β*-learnable? To answer this question, we derive a series of sufficient conditions for IB*β*-Learnability, starting from its definition. The conditions are in increasing order of practicality, while sacrificing as little generality as possible.

Firstly, Theorem 1 characterizes the IB*β*-Learnability range for *β*, with proof in Appendix A.3:

**Theorem 1.** *If* (*X*,*Y*) *is IBβ*<sup>1</sup> *-learnable, then for any <sup>β</sup>*<sup>2</sup> <sup>&</sup>gt; *<sup>β</sup>*1*, it is IBβ*<sup>2</sup> *-learnable.*

Based on Theorem 1, the range of *<sup>β</sup>* such that (*X*,*Y*) is IB*β*-learnable has the form *<sup>β</sup>* <sup>∈</sup> (*β*0, <sup>+</sup>∞). Thus, *β*<sup>0</sup> is the *threshold* of IB-Learnability.

**Lemma 2.** *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) *is a stationary solution for IBβ*(*X*,*Y*; *<sup>Z</sup>*)*.*

The proof in Appendix A.6 shows that both first-order variations *δI*(*X*; *Z*) = 0 and *δI*(*Y*; *Z*) = 0 vanish at the trivial representation *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*), so *<sup>δ</sup>*IB*β*[*p*(*z*|*x*)] = 0 at the trivial representation.

Lemma 2 yields our strategy for finding sufficient conditions for learnability: find conditions such that *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) is not a local minimum for the functional IB*β*[*p*(*z*|*x*)]. Based on the necessary condition for the minimum (Appendix A.4), we have the following theorem (The theorems in this paper deal with learnability w.r.t. true mutual information. If parameterized models are used to approximate the mutual information, the limitation of the model capacity will translate into more uncertainty of *Y* given *X*, viewed through the lens of the model.):

**Theorem 2** (**Suff. Cond. 1**)**.** *A sufficient condition for* (*X*,*Y*) *to be IBβ-learnable is that there exists a perturbation function <sup>h</sup>*(*z*|*x*) *(so that the perturbed probability (density) is <sup>p</sup>* (*z*|*x*) = *<sup>p</sup>*(*z*|*x*) +  · *<sup>h</sup>*(*z*|*x*)*) with <sup>h</sup>*(*z*|*x*)*dz* <sup>=</sup> <sup>0</sup>*, such that the second-order variation <sup>δ</sup>*2*IBβ*[*p*(*z*|*x*)] <sup>&</sup>lt; <sup>0</sup> *at the trivial representation <sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*)*.*

The proof for Theorem <sup>2</sup> is given in Appendix A.4. Intuitively, if *<sup>δ</sup>*2IB*β*[*p*(*z*|*x*)] *<sup>p</sup>*(*z*|*x*)=*p*(*z*) <sup>&</sup>lt; 0, we can always find a *p* (*z*|*x*) = *<sup>p</sup>*(*z*|*x*) +  · *<sup>h</sup>*(*z*|*x*) in the neighborhood of the trivial representation *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*), such that IB*β*[*p* (*z*|*x*)] <sup>&</sup>lt; IB*β*[*p*(*z*|*x*)], thus satisfying the definition for IB*β*-Learnability.

To make Theorem <sup>2</sup> more practical, we perturb *<sup>p</sup>*(*z*|*x*) around the trivial solution *<sup>p</sup>* (*z*|*x*) = *<sup>p</sup>*(*z*|*x*) +  · *<sup>h</sup>*(*z*|*x*) and expand IB*β*[*p*(*z*|*x*) +  · *<sup>h</sup>*(*z*|*x*)] <sup>−</sup> IB*β*[*p*(*z*|*x*)] to the second order of . We can then prove Theorem 3:

**Theorem 3** (**Suff. Cond. 2**)**.** *A sufficient condition for* (*X*,*Y*) *to be IBβ-learnable is <sup>X</sup> and <sup>Y</sup> are not independent and*

$$\beta > \inf\_{h(\mathbf{x})} \beta\_0[h(\mathbf{x})] \tag{2}$$

*where the functional <sup>β</sup>*0[*h*(*x*)] *is given by*

$$\beta\_0[h(\mathbf{x})] = \frac{\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})}[h(\mathbf{x})^2] - \left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})}[h(\mathbf{x})]\right)^2}{\mathbb{E}\_{\mathbf{y} \sim p(\mathbf{y})}\left[\left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x}|\mathbf{y})}[h(\mathbf{x})]\right)^2\right] - \left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})}[h(\mathbf{x})]\right)^2}$$

*Moreover, we have that* - inf*h*(*x*) *<sup>β</sup>*[*h*(*x*)]−<sup>1</sup> *is a lower bound of the slope of the Pareto frontier in the information plane I*(*Y*; *Z*) *vs. I*(*X*; *Z*) *at the origin.*

The proof is given in Appendix A.7, which also shows that if *<sup>β</sup>* <sup>&</sup>gt; inf*h*(*x*) *<sup>β</sup>*0[*h*(*x*)] in Theorem <sup>3</sup> is satisfied, we can construct a perturbation function *<sup>h</sup>*(*z*|*x*) = *<sup>h</sup>*∗(*x*)*h*2(*z*) with *<sup>h</sup>*∗(*x*) = arg min*h*(*x*) *<sup>β</sup>*0[*h*(*x*)], *<sup>h</sup>*2(*z*)*dz* <sup>=</sup> 0, *<sup>h</sup>*<sup>2</sup> <sup>2</sup>(*z*) *<sup>p</sup>*(*z*) *dz* <sup>&</sup>gt; 0 for some *<sup>h</sup>*2(*z*), such that *<sup>h</sup>*(*z*|*x*) satisfies Theorem 2. It also shows that the converse is true: if there exists *<sup>h</sup>*(*z*|*x*) such that the condition in Theorem <sup>2</sup> is true, then Theorem <sup>3</sup> is satisfied, that is, *<sup>β</sup>* <sup>&</sup>gt; inf*h*(*x*) *<sup>β</sup>*0[*h*(*x*)]. (We do not claim that any *<sup>h</sup>*(*z*|*x*) satisfying Theorem 2 can be decomposed to *<sup>h</sup>*∗(*x*)*h*2(*z*) at the onset of learning. But from the equivalence of Theorems <sup>2</sup> and <sup>3</sup> as explained above, when there exists an *<sup>h</sup>*(*z*|*x*) such that Theorem <sup>2</sup> is satisfied, we can always construct an *h* (*z*|*x*) = *<sup>h</sup>*∗(*x*)*h*2(*z*) that also satisfies Theorem 2.) Moreover, letting the perturbation function *<sup>h</sup>*(*z*|*x*) = *<sup>h</sup>*∗(*x*)*h*2(*z*) at the trivial solution, we have

$$p\_{\boldsymbol{\beta}}(\boldsymbol{y}|\boldsymbol{x}) = p(\boldsymbol{y}) + \epsilon^{2} \mathbb{C}\_{z}(h^{\*}(\boldsymbol{x}) - \overline{h}\_{x}^{\*}) \int p(\boldsymbol{x}, \boldsymbol{y}) (h^{\*}(\boldsymbol{x}) - \overline{h}\_{x}^{\*}) d\boldsymbol{x} \tag{3}$$

where *<sup>p</sup>β*(*y*|*x*) is the estimated *<sup>p</sup>*(*y*|*x*) by IB for a certain *<sup>β</sup>*, *<sup>h</sup>* ∗ *<sup>x</sup>* <sup>=</sup> *<sup>h</sup>*∗(*x*)*p*(*x*)*dx* and *Cz* <sup>=</sup> *<sup>h</sup>*<sup>2</sup> <sup>2</sup>(*z*) *<sup>p</sup>*(*z*) *dz* > <sup>0</sup> is a constant. This shows how the *<sup>p</sup>β*(*y*|*x*) by IB explicitly depends on *<sup>h</sup>*∗(*x*) at the onset of learning. The proof is provided in Appendix A.8.

Theorem 3 suggests a method to estimate *<sup>β</sup>*0: we can parameterize *<sup>h</sup>*(*x*) for example, by a neural network, with the objective of minimizing *<sup>β</sup>*0[*h*(*x*)]. At its minimization, *<sup>β</sup>*0[*h*(*x*)] provides an upper bound for *<sup>β</sup>*0, and *<sup>h</sup>*(*x*) provides a *soft clustering* of the examples corresponding to a nontrivial perturbation of *<sup>p</sup>*(*z*|*x*) at *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) that minimizes IB*β*[*p*(*z*|*x*)].

Alternatively, based on the property of *<sup>β</sup>*0[*h*(*x*)], we can also use a specific functional form for *<sup>h</sup>*(*x*) in Equation (2) and obtain a stronger sufficient condition for IB*β*-Learnability. But we want to choose *h*(*x*) as near to the infimum as possible. To do this, we note the following characteristics for the R.H.S of Equation (2):

• We can set *<sup>h</sup>*(*x*) to be nonzero if *<sup>x</sup>* <sup>∈</sup> <sup>Ω</sup>*<sup>x</sup>* for some region <sup>Ω</sup>*<sup>x</sup>* ⊂ X and 0 otherwise. Then we obtain the following sufficient condition:

$$\beta > \inf\_{h(\mathbf{x}), \Omega\_{\mathbf{x}} \subset \mathcal{X}} \frac{\frac{\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x}), \mathbf{x} \in \Omega\_{\mathbf{x}}}[h(\mathbf{x})^2]}{\left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x}), \mathbf{x} \in \Omega\_{\mathbf{x}}}[h(\mathbf{x})]\right)^2} - 1}{\int \frac{dy}{p(y)} \left(\frac{\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x}), \mathbf{x} \in \Omega\_{\mathbf{x}}}[p(y|\mathbf{x})h(\mathbf{x})]}{\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x}), \mathbf{x} \in \Omega\_{\mathbf{x}}}[h(\mathbf{x})]}\right)^2 - 1} \tag{4}$$

• The numerator of the R.H.S. of Equation (4) attains its minimum when *h*(*x*) is a constant within <sup>Ω</sup>*x*. This can be proved using the Cauchy-Schwarz inequality: *<sup>u</sup>*, *<sup>u</sup> <sup>v</sup>*, *<sup>v</sup>*≥*u*, *<sup>v</sup>* 2, setting *u*(*x*) = *h*(*x*) *p*(*x*), *<sup>v</sup>*(*x*) = *p*(*x*) and defining the inner product as *<sup>u</sup>*, *<sup>v</sup>* <sup>=</sup> *<sup>u</sup>*(*x*)*v*(*x*)*dx*. Therefore, the numerator of the R.H.S. of Equation (4) <sup>≥</sup> <sup>1</sup> *<sup>x</sup>*∈Ω*<sup>x</sup> <sup>p</sup>*(*x*) <sup>−</sup> 1 and attains equality when *u*(*x*) *<sup>v</sup>*(*x*) <sup>=</sup> *<sup>h</sup>*(*x*) is constant.

Based on these observations, we can let *<sup>h</sup>*(*x*) be a nonzero constant inside some region <sup>Ω</sup>*<sup>x</sup>* ⊂ X and 0 otherwise and the infimum over an arbitrary function *<sup>h</sup>*(*x*) is simplified to infimum over <sup>Ω</sup>*<sup>x</sup>* ⊂ X and we obtain a sufficient condition for IB*β*-Learnability, which is a key result of this paper:

**Theorem 4** (**Conspicuous Subset Suff. Cond.**)**.** *A sufficient condition for* (*X*,*Y*) *to be IBβ-learnable is <sup>X</sup> and Y are not independent and*

$$\beta > \inf\_{\Omega\_{\mathbf{x}} \subseteq \mathcal{X}} \beta\_0(\Omega\_{\mathbf{x}}) \tag{5}$$

*where*

$$\beta\_0(\Omega\_\chi) = \frac{\frac{1}{p(\Omega\_x)} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right]}$$

<sup>Ω</sup>*<sup>x</sup> denotes the event that x* <sup>∈</sup> <sup>Ω</sup>*x, with probability p*(Ω*x*)*.*

(infΩ*x*⊂X *<sup>β</sup>*0(Ω*x*)) <sup>−</sup><sup>1</sup> *gives a lower bound of the slope of the Pareto frontier in the information plane I*(*Y*; *Z*) *vs. I*(*X*; *Z*) *at the origin.*

The proof is given in Appendix A.9. In the proof we also show that this condition is invariant to invertible mappings of *X*.

#### **5. Discussion**

#### *5.1. The Conspicuous Subset Determines β*<sup>0</sup>

From Equation (5), we see that three characteristics of the subset Ω*<sup>x</sup>* ⊂ X lead to low *β*0: **(1) confidence:** *<sup>p</sup>*(*y*|Ω*x*) is large; **(2) typicality and size:** the number of elements in <sup>Ω</sup>*<sup>x</sup>* is large or the elements in Ω*<sup>x</sup>* are typical, leading to a large probability of *p*(Ω*x*); **(3) imbalance:** *p*(*y*) is small for the subset Ω*<sup>x</sup>* but large for its complement. In summary, *β*<sup>0</sup> will be determined by the largest *confident*, *typical* and *imbalanced subset* of examples or an equilibrium of those characteristics. We term Ω*<sup>x</sup>* at the minimization of *<sup>β</sup>*0(Ω*x*) the *conspicuous subset*.

#### *5.2. Multiple Phase Transitions*

Based on this characterization of Ω*x*, we can hypothesize datasets with multiple learnability phase transitions. Specifically, consider a region Ω*x*<sup>0</sup> that is small but "typical", consists of all elements confidently predicted as *<sup>y</sup>*<sup>0</sup> by *<sup>p</sup>*(*y*|*x*) and where *<sup>y</sup>*<sup>0</sup> is the least common class. By construction, this <sup>Ω</sup>*x*<sup>0</sup> will dominate the infimum in Equation (5), resulting in a small value of *β*0. However, the remaining X − Ω*x*<sup>0</sup> effectively form a new dataset, X1. At exactly *β*0, we may have that the current encoder, *<sup>p</sup>*0(*z*|*x*), has no mutual information with the remaining classes in <sup>X</sup>1; that is, *<sup>I</sup>*(*Y*1; *<sup>Z</sup>*0) = 0. In this case, Definition <sup>1</sup> applies to *<sup>p</sup>*0(*z*|*x*) with respect to *<sup>I</sup>*(*X*1; *<sup>Z</sup>*1). We might expect to see that, at *<sup>β</sup>*0, learning will plateau until we get to some *β*<sup>1</sup> > *β*<sup>0</sup> that defines the phase transition for X1. Clearly this process could repeat many times, with each new dataset X*<sup>i</sup>* being distinctly more difficult to learn than X*i*−1.

#### *5.3. Similarity to Information Measures*

The denominator of *<sup>β</sup>*0(Ω*x*) in Equation (5) is closely related to mutual information. Using the inequality *<sup>x</sup>* <sup>−</sup> <sup>1</sup> <sup>≥</sup> log(*x*) for *<sup>x</sup>* <sup>&</sup>gt; 0, it becomes:

$$\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right] \ge \mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \log \frac{p(y|\Omega\_x)}{p(y)} \right] = \overline{I}(\Omega\_x; Y)$$

where ˜*I*(Ω*x*;*Y*) is the mutual information "density" at <sup>Ω</sup>*<sup>x</sup>* ⊂ X . Of course, this quantity is also <sup>D</sup>KL[*p*(*y*|Ω*x*)||*p*(*y*)], so we know that the denominator of Equation (5) is non-negative. Incidentally, <sup>E</sup>*y*∼*p*(*y*|Ω*x*) *<sup>p</sup>*(*y*|Ω*x*) *<sup>p</sup>*(*y*) − 1 is the density of "rational mutual information" [19] at Ω*x*.

Similarly, the numerator of *<sup>β</sup>*0(Ω*x*) is related to the self-information of <sup>Ω</sup>*x*:

$$\frac{1}{p(\Omega\_x)} - 1 \ge \log \frac{1}{p(\Omega\_x)} = -\log p(\Omega\_x) = h(\Omega\_x)$$

so we can estimate *β*<sup>0</sup> as:

$$\beta \rho \simeq \inf\_{\Omega\_{\mathbf{x}} \subset \mathcal{X}} \frac{h(\Omega\_{\mathbf{x}})}{\overline{I}(\Omega\_{\mathbf{x}}; \mathcal{Y})} \tag{6}$$

Since Equation (6) uses upper bounds on both the numerator and the denominator, it does not give us a bound on *β*0, only an estimate.

#### *5.4. Estimating Model Capacity*

The observation that a model cannot distinguish between cluster overlap in the data and its own lack of capacity gives an interesting way to use IB-Learnability to measure the capacity of a set of models relative to the task they are being used to solve. For example, for a classification task, we can use different model classes to estimate *<sup>p</sup>*(*y*|*x*). For each such trained model, we can estimate the corresponding IB-learnability threshold *β*0. A model with smaller capacity than the task needs will translate to more uncertainty in *<sup>p</sup>*(*y*|Ω*x*), resulting in a larger *<sup>β</sup>*0. On the other hand, models that give the same *β*<sup>0</sup> as each other all have the same capacity relative to the task, even if we would otherwise expect them to have very different capacities. For example, if two deep models have the same core architecture but one has twice the number of parameters at each layer and they both yield the same *β*0, their capacities are equivalent with respect to the task. Thus, *β*<sup>0</sup> provides a way to measure model capacity in a task-specific manner.

#### *5.5. Learnability and the Information Plane*

Many of our results can be interpreted in terms of the geometry of the Pareto frontier illustrated in Figure 2, which describes the trade-off between increasing *I*(*Y*; *Z*) and decreasing *I*(*X*; *Z*). At any point on this frontier that minimizes IBmin *<sup>β</sup>* <sup>≡</sup> min *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>−</sup> *<sup>β</sup>I*(*Y*; *<sup>Z</sup>*), the frontier will have slope *<sup>β</sup>*−<sup>1</sup> if it is differentiable. If the frontier is also concave (has negative second derivative), then this slope *β*−<sup>1</sup> will take its maximum *β*−<sup>1</sup> <sup>0</sup> at the origin, which implies IB*β*-Learnability for *β* > *β*0, so that the threshold for IB*β*-Learnability is simply the inverse slope of the frontier at the origin. More generally, as long as the Pareto frontier is differentiable, the threshold for IB*β*-learnability is the inverse of its maximum slope. Indeed, Theorem 3 and Theorem 4 give lower bounds of the slope of the Pareto frontier at the origin.

**Figure 2.** The Pareto frontier of the information plane, *I*(*X*; *Z*) vs. *I*(*Y*; *Z*), for the binary classification of MNIST digits 0 and 1 with 20% label noise described in Section 1 and Figure 1. For this problem, learning happens for models trained at *β* > 3.25. *H*(*Y*) = 1 bit since only two of ten digits are used and *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) <sup>≤</sup> *<sup>I</sup>*(*X*;*Y*) <sup>≈</sup> 0.5 bits <sup>&</sup>lt; *<sup>H</sup>*(*Y*) because of the 20% label noise. The true frontier is differentiable; the figure shows a variational approximation that places an upper bound on both informations, horizontally offset to pass through the origin.

#### *5.6. IB-Learnability, Hypercontractivity and Maximum Correlation*

IB-Learnability and its sufficient conditions we provide harbor a deep connection with hypercontractivity and maximum correlation:

$$\frac{1}{\beta\_0} = \mathfrak{z}(X;Y) = \eta\_{\mathrm{KL}} \ge \sup\_{h(\mathbf{x})} \frac{1}{\beta\_0[h(\mathbf{x})]} = \rho\_m^2(X;Y) \tag{7}$$

which we prove in Appendix A.11. Here *<sup>ρ</sup>m*(*X*;*Y*) <sup>≡</sup> max*<sup>f</sup>* ,*<sup>g</sup>* <sup>E</sup>[ *<sup>f</sup>*(*X*)*g*(*Y*)] s.t. <sup>E</sup>[ *<sup>f</sup>*(*X*)] = <sup>E</sup>[*g*(*Y*)] = <sup>0</sup> and <sup>E</sup>[ *<sup>f</sup>* <sup>2</sup>(*X*)] = <sup>E</sup>[*g*2(*Y*)] = 1 is the *maximum correlation* [20,21], *<sup>ξ</sup>*(*X*;*Y*) <sup>≡</sup> sup*Z*−*X*−*<sup>Y</sup> I*(*Y*;*Z*) *<sup>I</sup>*(*X*;*Z*) is the *hypercontractivity coefficient* and *<sup>η</sup>*KL(*p*(*y*|*x*), *<sup>p</sup>*(*x*)) <sup>≡</sup> sup*r*(*x*)=*p*(*x*) <sup>D</sup>KL(*r*(*y*)||*p*(*y*)) <sup>D</sup>KL(*r*(*x*)||*p*(*x*)) is the *contraction coefficient*. Our proof relies on Anantharam et al. [16]'s proof *<sup>ξ</sup>*(*X*;*Y*) = *<sup>η</sup>*KL. Our work reveals the deep relationship between IB-Learnability and these earlier concepts and provides additional insights about what aspects of a dataset give rise to high maximum correlation and hypercontractivity: the most confident, typical, imbalanced subset of (*X*,*Y*).

#### **6. Estimating the IB-Learnability Condition**

Theorem 4 not only reveals the relationship between the learnability threshold for *β* and the least noisy region of *<sup>P</sup>*(*Y*|*X*) but also provides a way to practically estimate *<sup>β</sup>*0, both in the general classification case and in more structured settings.

#### *6.1. Estimation Algorithm*

Based on Theorem 4, for general classification tasks we suggest Algorithm 1 to empirically estimate an upper-bound *β*˜ <sup>0</sup> ≥ *β*0, as well as discovering the conspicuous subset that determines *β*0.

We approximate the probability of each example *<sup>p</sup>*(*xi*) by its empirical probability, *<sup>p</sup>*ˆ(*xi*). For example, for MNIST, *<sup>p</sup>*(*xi*) = <sup>1</sup> *<sup>N</sup>* , where *N* is the number of examples in the dataset. The algorithm starts by first learning a maximum likelihood model of *<sup>p</sup><sup>θ</sup>* (*y*|*x*), using for example, feed-forward neural networks. It then constructs a matrix *Py*|*<sup>x</sup>* and a vector *py* to store the estimated *<sup>p</sup>*(*y*|*x*) and *<sup>p</sup>*(*y*) for all the examples in the dataset. To find the subset Ω such that the *β*˜ <sup>0</sup> is as small as possible, by previous analysis we want to find a *conspicuous* subset such that its *<sup>p</sup>*(*y*|*x*) is large for a certain class *<sup>j</sup>* (to make the denominator of Equation (5) large) and containing as many elements as possible (to make the numerator small).

We suggest the following heuristics to discover such a conspicuous subset. For each class *j*, we sort the rows of (*Py*|*x*) according to its probability for the pivot class *<sup>j</sup>* by decreasing order and then perform a search over *<sup>i</sup>*left, *<sup>i</sup>*right for <sup>Ω</sup> <sup>=</sup> {*i*left, *<sup>i</sup>*left <sup>+</sup> 1, ..., *<sup>i</sup>*right}. Since *<sup>β</sup>*˜ <sup>0</sup> is large when Ω contains too few or too many elements, the minimum of *<sup>β</sup>*˜(*j*) <sup>0</sup> for class *j* will typically be reached with some intermediate-sized subset and we can use binary search or other discrete search algorithm for the optimization. The algorithm stops when *<sup>β</sup>*˜(*j*) <sup>0</sup> does not improve by tolerance *ε*. The algorithm then returns the *β*˜ <sup>0</sup> as the minimum over all the classes *<sup>β</sup>*˜(1) <sup>0</sup> , ...*β*˜(*N*) <sup>0</sup> , as well as the conspicuous subset that determines this *β*˜ 0.

After estimating *β*˜ 0, we can then use it for learning with IB, either directly or as an anchor for a region where we can perform a much smaller sweep than we otherwise would have. This may be particularly important for very noisy datasets, where *β*<sup>0</sup> can be very large.

#### **Algorithm 1** Estimating the upper bound for *β*<sup>0</sup> and identifying the conspicuous subset

**Require**: Dataset <sup>D</sup> <sup>=</sup> {(*xi*, *yi*)}, *<sup>i</sup>* <sup>=</sup> 1, 2, ...*N*. The number of classes is *<sup>C</sup>*. **Require** *ε*: tolerance for estimating *β*<sup>0</sup> 1: Learn a maximum likelihood model *<sup>p</sup><sup>θ</sup>* (*y*|*x*) using the dataset <sup>D</sup>. 2: Construct matrix (*Py*|*x*) such that (*Py*|*x*)*ij* <sup>=</sup> *<sup>p</sup><sup>θ</sup>* (*<sup>y</sup>* <sup>=</sup> *yj*|*<sup>x</sup>* <sup>=</sup> *xi*). 3: Construct vector *py* = (*py*1, .., *pyC*) such that *pyj* = <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=1(*Py*|*x*)*ij*. 4: **for** *j* **in** {1, 2, ...*C*}: 5: *<sup>P</sup>*(sort*j*) *<sup>y</sup>*|*<sup>x</sup>* <sup>←</sup>Sort the rows of *Py*|*<sup>x</sup>* in decreasing values of (*Py*|*x*)*ij*. 6: *<sup>β</sup>*˜(*j*) <sup>0</sup> , <sup>Ω</sup>(*j*) <sup>←</sup>Search *<sup>i</sup>*left, *<sup>i</sup>*right until *<sup>β</sup>*˜(*j*) <sup>0</sup> <sup>=</sup> **Get***β*(*Py*|*x*, *py*, <sup>Ω</sup>) is minimal with tolerance *<sup>ε</sup>*, where <sup>Ω</sup> <sup>=</sup> {*i*left, *<sup>i</sup>*left <sup>+</sup> 1, ...*i*right}. 7: **end for** 8: *j* <sup>∗</sup> ← arg min*<sup>j</sup>* {*β*˜(*j*) <sup>0</sup> }, *<sup>j</sup>* <sup>=</sup> 1, 2, ...*N*. 9: *β*˜ <sup>0</sup> <sup>←</sup> *<sup>β</sup>*˜(*<sup>j</sup>* ∗) <sup>0</sup> . 10: *<sup>P</sup>*(*β*˜ 0) *<sup>y</sup>*|*<sup>x</sup>* <sup>←</sup> the rows of *<sup>P</sup>*(sort*<sup>j</sup>* ∗) *<sup>y</sup>*|*<sup>x</sup>* indexed by <sup>Ω</sup>(*<sup>j</sup>* ∗). 11: **return** *β*˜ 0, *<sup>P</sup>*(*β*˜ 0) *y*|*x* **subroutine Get***β*(*Py*|*x*, *py*, <sup>Ω</sup>): s1: *N* ← number of rows of *Py*|*x*. s2: *C* ← number of columns of *Py*|*x*. s3: *n* ← number of elements of Ω. s4: (*py*|Ω)*<sup>j</sup>* <sup>←</sup> <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*i*∈Ω(*Py*|*x*)*ij*, *<sup>j</sup>* <sup>=</sup> 1, 2, ..., *<sup>C</sup>*. s5: *β*˜ <sup>0</sup> <sup>←</sup> *<sup>N</sup> <sup>n</sup>* −1 ∑*j* (*py*|Ω*<sup>x</sup>* )<sup>2</sup> *j pyj* <sup>−</sup><sup>1</sup> s6: **return** *β*˜ 0

#### *6.2. Special Cases for Estimating β*<sup>0</sup>

Theorem 4 may still be challenging to estimate, due to the difficulty of making accurate estimates of *<sup>p</sup>*(Ω*x*) and searching over <sup>Ω</sup>*<sup>x</sup>* ⊂ X . However, if the learning problem is more structured, we may be able to obtain a simpler formula for the sufficient condition.

#### 6.2.1. Class-Conditional Label Noise

Classification with noisy labels is a common practical scenario. An important noise model is that the labels are randomly flipped with some hidden class-conditional probabilities and we only observe the corrupted labels. This problem has been studied extensively [22–26]. If IB is applied to this scenario, how large *β* do we need? The following corollary provides a simple formula.

**Corollary 1.** *Suppose that the true class labels are y*∗ *and the input space belonging to each y*∗ *has no overlap. We only observe the corrupted labels <sup>y</sup> with class-conditional noise <sup>p</sup>*(*y*|*x*, *<sup>y</sup>*∗) = *<sup>p</sup>*(*y*|*y*∗) *and <sup>Y</sup> is not independent of X. We have that a sufficient condition for IBβ-Learnability is:*

$$\beta > \inf\_{y^\*} \frac{\frac{1}{p(y^\*)} - 1}{\sum\_{y} \frac{p(y|y^\*)^2}{p(y)} - 1} \tag{8}$$

We see that under class-conditional noise, the sufficient condition reduces to a discrete formula which only depends on the noise rates *<sup>p</sup>*(*y*|*y*∗) and the true class probability *<sup>p</sup>*(*y*∗), which can be accurately estimated via, for example, Northcutt et al. [26]. Additionally, if we know that the noise is class-conditional but the observed *β*<sup>0</sup> is greater than the R.H.S. of Equation (8), we can deduce that there is overlap between the true classes. The proof of Corollary 1 is provided in Appendix A.10.

#### 6.2.2. Deterministic Relationships

Theorem 4 also reveals that *β*<sup>0</sup> relates closely to whether *Y* is a deterministic function of *X*, as shown by Corollary 2:

**Corollary 2.** *Assume that Y contains at least one value y such that its probability p*(*y*) > 0*. If Y is a deterministic function of X and not independent of X, then a sufficient condition for IBβ-Learnability is β* > 1*.*

The assumption in the Corollary 2 is satisfied by classification and certain regression problems. (The following scenario does not satisfy this assumption: for certain regression problems where *<sup>Y</sup>* is a continuous random variable and the probability density function *pY*(*y*) is bounded, then for any *y*, the *probability P*(*Y* = *y*) has measure 0.)This corollary generalizes the result in Reference [9] which only proves it for classification problems. Combined with the necessary condition *β* > 1 for any dataset (*X*,*Y*) to be IB*β*-learnable (Section 3), we have that under the assumption, if *<sup>Y</sup>* is a deterministic function of *X*, then a necessary and sufficient condition for IB*β*-learnability is *β* > 1; that is, its *β*<sup>0</sup> is 1. The proof of Corollary 2 is provided in Appendix A.10.

Therefore, in practice, if we find that *β*<sup>0</sup> > 1, we may infer that *Y* is not a deterministic function of *X*. For a classification task, we may infer that either some classes have overlap or the labels are noisy. However, recall that finite models may add effective class overlap if they have insufficient capacity for the learning task, as mentioned in Section 4. This may translate into a higher observed *β*0, even when learning deterministic functions.

#### **7. Experiments**

To test how the theoretical conditions for IB*β*-learnability match with experiment, we apply them to synthetic data with varying noise rates and class overlap, MNIST binary classification with varying noise rates and CIFAR10 classification, comparing with the *β*<sup>0</sup> found experimentally. We also compare with the algorithm in Kim et al. [18] for estimating the hypercontractivity coefficient (=1/*β*0) via the contraction coefficient *η*KL. Experiment details are in Section A.12.

#### *7.1. Synthetic Dataset Experiments*

We construct a set of datasets from 2D mixtures of 2 Gaussians as *X* and the identity of the mixture component as *Y*. We simulate two practical scenarios with these datasets: (1) noisy labels with class-conditional noise and (2) class overlap. For (1), we vary the class-conditional noise rates. For (2), we vary class overlap by tuning the distance between the Gaussians. For each experiment, we sweep *β* with exponential steps and observe *<sup>I</sup>*(*X*; *<sup>Z</sup>*) and *<sup>I</sup>*(*Y*; *<sup>Z</sup>*). We then compare the empirical *<sup>β</sup>*<sup>0</sup> indicated by the onset of above-zero *<sup>I</sup>*(*X*; *<sup>Z</sup>*) with predicted values for *<sup>β</sup>*0.

#### 7.1.1. Classification with Class-Conditional Noise

In this experiment, we have a mixture of Gaussian distribution with 2 components, each of which is a 2D Gaussian with diagonal covariance matrix Σ = diag(0.25, 0.25). The two components have distance 16 (hence virtually no overlap) and equal mixture weight. For each *x*, the label *y* ∈ {0, 1} is the identity of which component it belongs to. We create multiple datasets by randomly flipping the labels *<sup>y</sup>* with a certain noise rate *<sup>ρ</sup>* <sup>=</sup> *<sup>P</sup>*(*<sup>y</sup>* <sup>=</sup> <sup>0</sup>|*y*<sup>∗</sup> <sup>=</sup> <sup>1</sup>) = *<sup>P</sup>*(*<sup>y</sup>* <sup>=</sup> <sup>1</sup>|*y*<sup>∗</sup> <sup>=</sup> <sup>0</sup>). For each dataset, we train VIB models across a range of *β* and observe the onset of learning via random *I*(*X*; *Z*) (Observed). To test how different methods perform in estimating *β*0, we apply the following methods: **(1)** Corollary 1, since this is classification with class-conditional noise and the two true classes have virtually no overlap; **(2)** Algorithm <sup>1</sup> with true *<sup>p</sup>*(*y*|*x*); **(3)** The algorithm in Kim et al. [18] that estimates *<sup>η</sup>*ˆKL, provided with

true *<sup>p</sup>*(*y*|*x*); **(4)** *<sup>β</sup>*0[*h*(*x*)] in Equation (2); **(2 )** Algorithm <sup>1</sup> with *<sup>p</sup>*(*y*|*x*) estimated by a neural net; **(3 )** *<sup>η</sup>*ˆKL with the same *<sup>p</sup>*(*y*|*x*) as in (2 ). The results are shown in Figure 3 and in Table 1.

**Figure 3.** Predicted vs. experimentally identified *β*0, for mixture of Gaussians with varying class-conditional noise rates.


**Table 1.** Full table of values used to generate Figure 3.

From Figure <sup>3</sup> and Table <sup>1</sup> we see the following. **(A)** When using the true *<sup>p</sup>*(*y*|*x*), both Algorithm <sup>1</sup> and *η*ˆKL generally upper bound the empirical *β*<sup>0</sup> and Algorithm 1 is generally tighter. **(B)** When using the true *<sup>p</sup>*(*y*|*x*), Algorithm <sup>1</sup> and Corollary <sup>1</sup> give the same result. **(C)** Comparing Algorithm <sup>1</sup> and *<sup>η</sup>*ˆKL both of which use the same empirically estimated *<sup>p</sup>*(*y*|*x*), both approaches provide good estimation in the low-noise region; however, in the high-noise region, Algorithm 1 gives more precise values than *η*ˆKL, indicating that Algorithm <sup>1</sup> is more robust to the estimation error of *<sup>p</sup>*(*y*|*x*). **(D)** Equation (2) empirically upper bounds the experimentally observed *β*<sup>0</sup> and gives almost the same result as theoretical estimation in Corollary <sup>1</sup> and Algorithm <sup>1</sup> with the true *<sup>p</sup>*(*y*|*x*). In the classification setting, this approach does not require any learned estimate of *<sup>p</sup>*(*y*|*x*), as we can directly use the empirical *<sup>p</sup>*(*y*) and *<sup>p</sup>*(*x*|*y*) from SGD mini-batches.

This experiment also shows that for dataset where the signal-to-noise is small, *β*<sup>0</sup> can be very high. Instead of blindly sweeping *β*, our result can provide guidance for setting *β* so learning can happen.

#### 7.1.2. Classification with Class Overlap

In this experiment, we test how different amounts of overlap among classes influence *β*0. We use the mixture of Gaussians with two components, each of which is a 2D Gaussian with diagonal covariance matrix Σ = diag(0.25, 0.25). The two components have weights 0.6 and 0.4. We vary the distance between the Gaussians from 8.0 down to 0.8 and observe the *β*0,*exp*. Since we do not add noise to the labels, if there were no overlap and a deterministic map from *X* to *Y*, we would have *<sup>β</sup>*<sup>0</sup> = 1 by Corollary 2. The more overlap between the two classes, the more uncertain *<sup>Y</sup>* is given *<sup>X</sup>*. By Equation (5) we expect *β*<sup>0</sup> to be larger, which is corroborated in Figure 4.

**Figure 4.** *I*(*Y*; *Z*) vs. *β*, for mixture of Gaussian datasets with different distances between the two mixture components. The vertical lines are *β*0,predicted computed by the R.H.S. of Equation (8). As Equation (8) does not make predictions w.r.t. class overlap, the vertical lines are always just above *<sup>β</sup>*0,predicted = 1. However, as expected, decreasing the distance between the classes in *<sup>X</sup>* space also increases the true *β*0.

#### *7.2. MNIST Experiments*

We perform binary classification with digits 0 and 1 and as before, add class-conditional noise to the labels with varying noise rates *ρ*. To explore how the model capacity influences the onset of learning, for each dataset we train two sets of VIB models differing only by the number of neurons in their hidden layers of the encoder: one with *n* = 512 neurons, the other with *n* = 128 neurons. As we describe in Section 4, insufficient capacity will result in more uncertainty of *Y* given *X* from the point of view of the model, so we expect the observed *<sup>β</sup>*<sup>0</sup> for the *<sup>n</sup>* = 128 model to be larger. This result is confirmed by the experiment (Figure 5). Also, in Figure 5 we plot *β*<sup>0</sup> given by different estimation methods. We see that the observations (A), (B), (C) and (D) in Section 7.1 still hold.

**Figure 5.** *I*(*Y*; *Z*) vs. *β* for the MNIST binary classification with different hidden units per layer *n* and noise rates *ρ*: (upper left) *ρ* = 0.02, (upper right) *ρ* = 0.1, (lower left) *ρ* = 0.2, (lower right) *ρ* = 0.3. The vertical lines are *<sup>β</sup>*<sup>0</sup> estimated by different methods. *<sup>n</sup>* = 128 has insufficient capacity for the problem, so its observed learnability onset is pushed higher, similar to the class overlap case.

#### *7.3. MNIST Experiments Using Equation (2)*

To see what IB learns at its onset of learning for the full MNIST dataset, we optimize Equation (2) w.r.t. the full MNIST dataset and visualize the clustering of digits by *h*(*x*). Equation (2) can be optimized using SGD using any differentiable parameterized mapping *<sup>h</sup>*(*x*) : X → . In this case, we chose to parameterize *h*(*x*) with a PixelCNN++ architecture [27,28], as PixelCNN++ is a powerful autoregressive model for images that gives a scalar output (normally interpreted as log *p*(*x*)). Equation (2) should generally give two clusters in the output space, as discussed in Section 4. In this setup, smaller values of *h*(*x*) correspond to the subset of the data that is easiest to learn. Figure 6 shows two strongly separated clusters, as well as the threshold we choose to divide them. Figure 7 shows the first 5776 MNIST training examples as sorted by our learned *h*(*x*), with the examples above the threshold highlighted in red. We can clearly see that our learned *h*(*x*) has separated the "easy" one (1) digits from the rest of the MNIST training set.

**Figure 6.** Histograms of the full MNIST training and validation sets according to *h*(*X*). Note that both are bimodal and the histograms are indistinguishable. In both cases, *h*(*x*) has learned to separate most of the ones into the smaller mode but difficult ones are in the wide valley between the two modes. See Figure 7 for all of the training images to the left of the red threshold line, as well as the first few images to the right of the threshold.

**Figure 7.** The first 5776 MNIST training set digits when sorted by *h*(*x*). The digits highlighted in red are above the threshold drawn in Figure 6.

#### *7.4. CIFAR10 Forgetting Experiments*

For CIFAR10 [14], we study how *forgetting* varies with *β*. In other words, given a VIB model trained at some high *<sup>β</sup>*2, if we anneal it down to some much lower *<sup>β</sup>*1, what *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) does the model converge to? Using Algorithm 1, we estimated *<sup>β</sup>*<sup>0</sup> = 1.0483 on a version of CIFAR10 with 20% label noise, where the *Py*|*<sup>x</sup>* is estimated by maximum likelihood training with the same encoder and classifier architectures as used for VIB. For the VIB models, the lowest *β* with performance above chance was *β* = 1.048 (Figure 8), a very tight match with the estimate from Algorithm 1. See Appendix A.12 for details.

**Figure 8.** Plot of *I*(*Y*; *Z*) vs. *β* for CIFAR10 training set with 20% label noise. Each blue cross corresponds to a fully-converged model starting with independent initialization. The vertical black line corresponds to the predicted *<sup>β</sup>*<sup>0</sup> = 1.0483 using Algorithm 1. The empirical *<sup>β</sup>*<sup>0</sup> = 1.048.

#### **8. Conclusions**

In this paper, we have presented theoretical results for predicting the onset of learning and have shown that it is determined by the conspicuous subset of the training examples. We gave a practical algorithm for predicting the transition as well as discovering this subset and showed that those predictions are accurate, even in cases of extreme label noise. We proved a deep connection between IB-learnability, our upper bounds on *β*0, the hypercontractivity coefficient, the contraction coefficient and the maximum correlation. We believe that these results provide a deeper understanding of IB, as well as a tool for analyzing a dataset by discovering its conspicuous subset and a tool for measuring model capacity in a task-specific manner. Our work also raises other questions, such as whether there are other phase transitions in learnability that might be identified. We hope to address some of those questions in future work.

**Author Contributions:** Conceptualization, T.W. and I.F.; methodology, T.W., I.F., I.L.C. and M.T.; software, T.W. and I.F.; validation, T.W. and I.F.; formal analysis, T.W. and I.F.; investigation, T.W. and I.F.; resources, T.W., I.F., I.L.C. and M.T.; data curation, T.W. and I.F.; writing–original draft preparation, T.W., I.F., I.L.C. and M.T.; writing–review and editing, T.W., I.F., I.L.C. and M.T.; visualization, T.W. and I.F.; supervision, I.F., I.L.C. and M.T.; project administration, I.F., I.L.C. and M.T.; funding acquisition, M.T.

**Funding:** T.W.'s work was supported by the The Casey and Family Foundation, the Foundational Questions Institute and the Rothberg Family Fund for Cognitive Science. He thanks the Center for Brains, Minds and Machines (CBMM) for hospitality.

**Acknowledgments:** The authors would like to thank the anonymous reviewers for their constructive comments that contributed to improving the paper.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A**

The structure of the Appendix is as follows. In Appendix A.1, we provide preliminaries for the first-order and second-order variations on functionals. We prove Theorem 1 and Theorem 1 in Appendixes A.2 and A.3, respectively. In Appendix A.4, we prove Theorem 2, the sufficient condition 1 for IB-Learnability. In Appendix A.5, we calculate the first and second variations of IB*β*[*p*(*z*|*x*)] at the trivial representation *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*), which is used in proving Lemma <sup>2</sup> (Appendix A.6) and the Sufficient Condition 2 for IB*β*-learnability (Appendix A.7). In Appendix A.8, we prove Equation (3) at the onset of learning. After these preparations, we prove the key result of this paper, Theorem 4, in Section A.9. Then two important Corollaries 1, 2 are proved in Appendix A.10. In Appendix A.11 we explore the deep relation between *<sup>β</sup>*0, *<sup>β</sup>*0[*h*(*x*)], the hypercontractivity coefficient, contraction coefficient and maximum correlation. Finally in Appendix A.12, we provide details for the experiments.

Below are some implicit conventions of the paper: for integrals, whenever a variable *W* is discrete, we can simply replace the integral ( ·*dw*) by summation (∑*<sup>w</sup>* ·).

#### *Appendix A.1. Preliminaries: First-Order and Second-Order Variations*

Let functional *F*[ *f*(*x*)] be defined on some normed linear space R. Let us add a perturbative function  · *<sup>h</sup>*(*x*) to *<sup>f</sup>*(*x*), and now the functional *<sup>F</sup>*[ *<sup>f</sup>*(*x*) + · *<sup>h</sup>*(*x*)] can be expanded as

$$\begin{aligned} \Delta F[f(\mathbf{x})] &= F[f(\mathbf{x}) + \boldsymbol{\epsilon} \cdot h(\mathbf{x})] - F[f(\mathbf{x})] \\ &= \varphi\_1[f(\mathbf{x})] + \varphi\_2[f(\mathbf{x})] + \mathcal{O}(\boldsymbol{\epsilon}^3 ||h||^2) \end{aligned}$$

where ||*h*|| denotes the norm of *<sup>h</sup>*, *<sup>ϕ</sup>*1[ *<sup>f</sup>*(*x*)] =  *dF*[ *<sup>f</sup>*(*x*)] *d*  is a linear functional of  · *<sup>h</sup>*(*x*), and is called the *first-order variation*, denoted as *<sup>δ</sup>F*[ *<sup>f</sup>*(*x*)]. *<sup>ϕ</sup>*2[ *<sup>f</sup>*(*x*)] = <sup>1</sup> 2 <sup>2</sup> *<sup>d</sup>*2*F*[ *<sup>f</sup>*(*x*)] *d* <sup>2</sup> is a quadratic functional of  · *<sup>h</sup>*(*x*), and is called the *second-order variation*, denoted as *δ*2*F*[ *f*(*x*)].

If *<sup>δ</sup>F*[ *<sup>f</sup>*(*x*)] = 0, we call *<sup>f</sup>*(*x*) a stationary solution for the functional *<sup>F</sup>*[·].

If <sup>Δ</sup>*F*[ *<sup>f</sup>*(*x*)] <sup>≥</sup> 0 for all *<sup>h</sup>*(*x*) such that *<sup>f</sup>*(*x*) +  · *<sup>h</sup>*(*x*) is at the neighborhood of *<sup>f</sup>*(*x*), we call *<sup>f</sup>*(*x*) <sup>a</sup> (local) minimum of *<sup>F</sup>*[·].

#### *Appendix A.2. Proof of Lemma 1*

**Proof.** If (*X*,*Y*) is IB*β*-learnable, then there exists *<sup>Z</sup>* ∈ Z given by some *<sup>p</sup>*1(*z*|*x*) such that IB*β*(*X*,*Y*; *<sup>Z</sup>*) <sup>&</sup>lt; IB(*X*,*Y*; *Ztrivial*) = 0, where *Ztrivial* satisfies *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*). Since *<sup>X</sup>* <sup>=</sup> *<sup>g</sup>*(*X*) is a invertible map (if *X* is continuous variable, *g* is additionally required to be continuous), and mutual information is invariant under such an invertible map [29], we have that IB*β*(*X* ,*Y*; *Z*) = *I*(*X* ; *<sup>Z</sup>*) <sup>−</sup> *<sup>β</sup>I*(*Y*; *<sup>Z</sup>*) = *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>−</sup> *<sup>β</sup>I*(*Y*; *<sup>Z</sup>*) = IB*β*(*X*,*Y*; *<sup>Z</sup>*) <sup>&</sup>lt; <sup>0</sup> <sup>=</sup> IB(*X* ,*Y*; *Ztrivial*), so (*X* ,*Y*) is IB*β*-learnable. On the other hand, if (*X*,*Y*) is not IB*β*-learnable, then <sup>∀</sup>*Z*, we have IB*β*(*X*,*Y*; *<sup>Z</sup>*) <sup>≥</sup> IB(*X*,*Y*; *Ztrivial*) = 0. Again using mutual information's invariance under *<sup>g</sup>*, we have for all *<sup>Z</sup>*, IB*β*(*X* ,*Y*; *<sup>Z</sup>*) = IB*β*(*X*,*Y*; *<sup>Z</sup>*) <sup>≥</sup> IB(*X*,*Y*; *Ztrivial*) = 0, leading to that (*X* ,*Y*) is not IB*β*-learnable. Therefore, we have that (*X*,*Y*) and (*X* ,*Y*) have the same IB*β*-learnability.

#### *Appendix A.3. Proof of Theorem 1*

**Proof.** At the trivial representation *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*), we have *<sup>I</sup>*(*X*; *<sup>Z</sup>*) = 0, and *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) = 0 due to the Markov chain, so IB*β*(*X*,*Y*; *<sup>Z</sup>*)|*p*(*z*|*x*)=*p*(*z*) <sup>=</sup> 0 for any *<sup>β</sup>*. Since (*X*,*Y*) is IB*β*<sup>1</sup> -learnable, there exists <sup>a</sup> *<sup>Z</sup>* given by a *<sup>p</sup>*1(*z*|*x*) such that IB*β*<sup>1</sup> (*X*,*Y*; *<sup>Z</sup>*)|*p*1(*z*|*x*) <sup>&</sup>lt; 0. Since *<sup>β</sup>*<sup>2</sup> <sup>&</sup>gt; *<sup>β</sup>*1, and *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) <sup>≥</sup> 0, we have IB*β*<sup>2</sup> (*X*,*Y*; *<sup>Z</sup>*)|*p*1(*z*|*x*) <sup>≤</sup> IB*β*<sup>1</sup> (*X*,*Y*; *<sup>Z</sup>*)|*p*1(*z*|*x*) <sup>&</sup>lt; <sup>0</sup> <sup>=</sup> IB*β*<sup>2</sup> (*X*,*Y*; *<sup>Z</sup>*)|*p*(*z*|*x*)=*p*(*z*). Therefore, (*X*,*Y*) is IB*β*<sup>2</sup> -learnable.

#### *Appendix A.4. Proof of Theorem 2*

**Proof.** To prove Theorem 2, we use the Theorem 1 of Chapter 5 of Gelfand et al. [30] which gives a necessary condition for *<sup>F</sup>*[ *<sup>f</sup>*(*x*)] to have a minimum at *<sup>f</sup>*0(*x*). Adapting to our notation, we have:

**Theorem A1** ([30])**.** *A necessary condition for the functional <sup>F</sup>*[ *<sup>f</sup>*(*x*)] *to have a minimum at <sup>f</sup>*(*x*) = *<sup>f</sup>*0(*x*) *is that for f*(*x*) = *<sup>f</sup>*0(*x*) *and all admissible* · *<sup>h</sup>*(*x*)*,*

$$
\delta^2 F[f(x)] \ge 0.
$$

Applying to our functional IB*β*[*p*(*z*|*x*)], an immediate result of Theorem A1 is that, if at *<sup>p</sup>*(*z*|*x*) = *p*(*z*), there exists an  · *<sup>h</sup>*(*z*|*x*) such that *<sup>δ</sup>*2IB*β*[*p*(*z*|*x*)] <sup>&</sup>lt; 0, then *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) is not a minimum for IB*β*[*p*(*z*|*x*)]. Using the definition of IB*<sup>β</sup>* learnability, we have that (*X*,*Y*) is IB*β*-learnable.

*Appendix A.5. First- and Second-Order Variations of IBβ*[*p*(*z*|*x*)]

In this section, we derive the first- and second-order variations of IB*β*[*p*(*z*|*x*)], which are needed for proving Lemma 2 and Theorem 3.

**Lemma A1.** *Using perturbative function h*(*z*|*x*)*, we have*

$$\begin{split} \delta \mathcal{I} B\_{\beta}[p(z|\mathbf{x})] &= \int d\mathbf{x} dz p(\mathbf{x}) h(z|\mathbf{x}) \log \frac{p(z|\mathbf{x})}{p(z)} - \beta \int d\mathbf{x} dy dz p(\mathbf{x}, y) h(z|\mathbf{x}) \log \frac{p(z|y)}{p(z)} \\ \delta^{2} \mathcal{I} B\_{\beta}[p(z|\mathbf{x})] &= \frac{1}{2} \left[ \int d\mathbf{x} dz \frac{p(\mathbf{x})^{2}}{p(\mathbf{x}, z)} h(z|\mathbf{x})^{2} - \beta \int d\mathbf{x} d\mathbf{x}' dy dz \frac{p(\mathbf{x}, y) p(\mathbf{x}', y)}{p(y, z)} h(z|\mathbf{x}) h(z|\mathbf{x}') \right] \\ &+ (\beta - 1) \int d\mathbf{x} d\mathbf{x}' dz \frac{p(\mathbf{x}) p(\mathbf{x}')}{p(z)} h(z|\mathbf{x}) h(z|\mathbf{x}') \Big] \end{split}$$

**Proof.** Since IB*β*[*p*(*z*|*x*)] = *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>−</sup> *<sup>β</sup>I*(*Y*; *<sup>Z</sup>*), let us calculate the first and second-order variation of *<sup>I</sup>*(*X*; *<sup>Z</sup>*) and *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) w.r.t. *<sup>p</sup>*(*z*|*x*), respectively. Through this derivation, we use  · *<sup>h</sup>*(*z*|*x*) as a perturbative function, for ease of deciding different orders of variations. We assume that *<sup>h</sup>*(*z*|*x*) is continuous, and there exists a constant *M* such that  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) <sup>&</sup>lt; *<sup>M</sup>*, <sup>∀</sup>(*x*, *<sup>z</sup>*) ∈ X ×Z. We will finally absorb into *<sup>h</sup>*(*z*|*x*).

Denote *<sup>I</sup>*(*X*; *<sup>Z</sup>*) = *<sup>F</sup>*1[*p*(*z*|*x*)]. We have

$$F\_1[p(z|\mathbf{x})] = I(X;Z) = \int d\mathbf{x}dz p(z|\mathbf{x})p(\mathbf{x}) \log \frac{p(z|\mathbf{x})}{p(z)}$$

In this paper, we implicitly assume that the integral (or summing) are only on the support of *p*(*x*, *y*, *z*).

Since

$$p(z) = \int p(z|x)p(x)dx$$

We have

$$p(z)|\_{p(z|\mathbf{x}) + \epsilon h(z|\mathbf{x})} = p(z)|\_{p(z|\mathbf{x})} + \epsilon \int h(z|\mathbf{x}) p(\mathbf{x}) d\mathbf{x}$$

Expanding *<sup>F</sup>*1[*p*(*z*|*x*) +  *<sup>h</sup>*(*z*|*x*)] to the second order of , we have

*<sup>F</sup>*1[*p*(*z*|*x*) +  *<sup>h</sup>*(*z*|*x*)] = *dxdzp*(*x*)[*p*(*z*|*x*) +  *<sup>h</sup>*(*z*|*x*)]log *<sup>p</sup>*(*z*|*x*) +  *<sup>h</sup>*(*z*|*x*) *p*(*z*) +  *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx* = *dxdzp*(*x*)*p*(*z*|*x*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) log *<sup>p</sup>*(*z*|*x*) - 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) *p*(*z*) - 1 +  *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx p*(*z*) = *dxdzp*(*x*)*p*(*z*|*x*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) log *<sup>p</sup>*(*z*|*x*) *p*(*z*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) 1 −  *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) + 2 *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) <sup>2</sup> <sup>+</sup> <sup>O</sup>(3) = *dxdzp*(*x*)*p*(*z*|*x*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) log *<sup>p</sup>*(*z*|*x*) *p*(*z*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) <sup>−</sup> *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) + 2 *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) 2 − <sup>2</sup> *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) <sup>+</sup> <sup>O</sup>(3) = *dxdzp*(*x*)*p*(*z*|*x*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) log *<sup>p</sup>*(*z*|*x*) *<sup>p</sup>*(*z*) <sup>+</sup>  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) <sup>−</sup> *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) + 2 *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) 2 − <sup>2</sup> *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx <sup>p</sup>*(*z*) <sup>−</sup> <sup>1</sup> 2 2 *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) <sup>−</sup> *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) <sup>2</sup> <sup>+</sup> <sup>O</sup>(3)

Collecting the first order terms of , we have

$$\begin{split} &\delta F\_{1}[p(z|\mathbf{x})] \\ &= \epsilon \int d\mathbf{x} d\mathbf{z} p(\mathbf{x}) p(\mathbf{z}|\mathbf{x}) \left( \frac{h(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} - \frac{\int h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}') d\mathbf{x}'}{p(\mathbf{z})} \right) + \epsilon \int d\mathbf{x} d\mathbf{z} p(\mathbf{x}) p(\mathbf{z}|\mathbf{x}) \frac{h(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} \log \frac{p(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})} \\ &= \epsilon \int d\mathbf{x} d\mathbf{z} p(\mathbf{x}) h(\mathbf{z}|\mathbf{x}) - \epsilon \int d\mathbf{x}' d\mathbf{z} p(\mathbf{x}') h(\mathbf{z}|\mathbf{x}') + \epsilon \int d\mathbf{x} d\mathbf{z} p(\mathbf{x}) h(\mathbf{z}|\mathbf{x}) \log \frac{p(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})} \\ &= \epsilon \int d\mathbf{x} d\mathbf{z} p(\mathbf{x}) h(\mathbf{z}|\mathbf{x}) \log \frac{p(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})} \end{split}$$

Collecting the second order terms of 2, we have

$$\begin{split} &\quad ^2F\_1[p(z|\mathbf{x})] \\ &= -\varepsilon^2 \int dx \mathrm{d}z p(\mathbf{x}) p(\mathbf{z}|\mathbf{x}) \left[ \left( \frac{\int h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}') d\mathbf{x}'}{p(\mathbf{z})} \right)^2 - \frac{h(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} \frac{\int h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}') d\mathbf{x}'}{p(\mathbf{z})} - \frac{1}{2} \left( \frac{h(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} - \frac{\int h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}') d\mathbf{x}'}{p(\mathbf{z})} \right)^2 \right] \\ &+ \varepsilon^2 \int dx \mathrm{d}z p(\mathbf{x}) p(\mathbf{z}|\mathbf{x}) \frac{h(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} \left( \frac{h(\mathbf{z}|\mathbf{x})}{p(\mathbf{z}|\mathbf{x})} - \frac{\int h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}') d\mathbf{x}'}{p(\mathbf{z})} \right) \\ &= \frac{\varepsilon^2}{2} \int dx \mathrm{d}z \frac{p(\mathbf{x})^2}{p(\mathbf{z}|\mathbf{x})} h(\mathbf{z}|\mathbf{x})^2 - \frac{\varepsilon^2}{2} \int dx \mathrm{d}x \mathrm{d}'z \frac{p(\mathbf{x}) p(\mathbf{x}')}{p(\mathbf{z})} h(\mathbf{z}|\mathbf{x}) h(\mathbf{z}|\mathbf{x}') \end{split}$$

Now let us calculate the first and second-order variation of *<sup>F</sup>*2[*p*(*z*|*x*)] = *<sup>I</sup>*(*Z*;*Y*). We have

$$F\_2[p(z|\mathbf{x})] = I(\mathbf{Y}; \mathbf{Z}) = \int dy dz p(z|y)p(y) \log \frac{p(y, z)}{p(y)p(z)} = \int dx dy dz p(z|y)p(\mathbf{x}, y) \log \frac{p(y, z)}{p(y)p(z)}$$

Using the Markov chain *Z* ← *X* ↔ *Y*, we have

$$p(y, z) = \int p(z|x)p(x, y)dx$$

Hence

$$p(y, z)|\_{p(z|x) + \epsilon h(z|x)} = p(y, z)|\_{p(z|x)} + \epsilon \int h(z|x)p(x, y)dx$$

Then expanding *<sup>F</sup>*2[*p*(*z*|*x*) +  *<sup>h</sup>*(*z*|*x*)] to the second order of , we have

*<sup>F</sup>*2[*p*(*z*|*x*) +  *<sup>h</sup>*(*z*|*x*)] = *dxdydzp*(*x*, *<sup>y</sup>*)*p*(*z*|*x*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) log *p*(*y*, *z*) - 1 +  *<sup>h</sup>*(*z*|*x* )*p*(*x* ,*y*)*dx p*(*y*,*z*) *p*(*y*)*p*(*z*)(1 +  *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx <sup>p</sup>*(*z*) ) = *dxdydzp*(*x*, *<sup>y</sup>*)*p*(*z*|*x*) 1 +  *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) log *<sup>p</sup>*(*y*, *<sup>z</sup>*) *<sup>p</sup>*(*y*)*p*(*z*) <sup>+</sup>  *<sup>h</sup>*(*z*|*x* )*p*(*x* , *y*)*dx p*(*y*, *z*) − *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) + 2 *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) 2 − *<sup>h</sup>*(*z*|*x* )*p*(*x* , *y*)*dx p*(*y*, *z*) *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx <sup>p</sup>*(*z*) <sup>−</sup> <sup>1</sup> 2 *<sup>h</sup>*(*z*|*x* )*p*(*x* , *y*)*dx p*(*y*, *z*) − *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) <sup>2</sup> <sup>+</sup> <sup>O</sup>(3)

Collecting the first order terms of , we have

$$\begin{split} &\delta F\_{\mathbb{P}}[p(z|\mathbf{x})] \\ &= \epsilon \int d\mathbf{x} dy dz p(\mathbf{x},y) h(\mathbf{z}|\mathbf{x}) \log \frac{p(y,\mathbf{z})}{p(y)p(\mathbf{z})} + \epsilon \int d\mathbf{x} dy dz p(\mathbf{x},y) p(\mathbf{z}|\mathbf{x}) \frac{\int h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}',y) d\mathbf{x}'}{p(y,\mathbf{z})} \\ &- \epsilon \int d\mathbf{x} dy dz p(\mathbf{x},y) p(\mathbf{z}|\mathbf{x}) \frac{\int h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}') d\mathbf{x}'}{p(\mathbf{z})} \\ &= \epsilon \int d\mathbf{x} dy dz p(\mathbf{x},y) h(\mathbf{z}|\mathbf{x}) \log \frac{p(y,\mathbf{z})}{p(y)p(\mathbf{z})} + \epsilon \int d\mathbf{x}' dy dz h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}',y) - \epsilon \int d\mathbf{z} h(\mathbf{z}|\mathbf{x}') p(\mathbf{x}') d\mathbf{x}' \\ &= \epsilon \int d\mathbf{x} dy dz p(\mathbf{x},y) h(\mathbf{z}|\mathbf{x}) \log \frac{p(\mathbf{z}|\mathbf{y})}{p(\mathbf{z})} \end{split}$$

Collecting the second order terms of , we have

*<sup>δ</sup>*2*F*2[*p*(*z*|*x*)] = 2 *dxdydzp*(*x*, *<sup>y</sup>*)*p*(*z*|*x*) *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) 2 − *<sup>h</sup>*(*z*|*x* )*p*(*x* , *y*)*dx p*(*y*, *z*) *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx p*(*z*) − 2 2 *dxdydzp*(*x*, *<sup>y</sup>*)*p*(*z*|*x*) *<sup>h</sup>*(*z*|*x* )*p*(*x* , *y*)*dx p*(*y*, *z*) − *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) 2 + 2 *dxdydzp*(*x*, *<sup>y</sup>*)*p*(*z*|*x*) *<sup>h</sup>*(*z*|*x*) *<sup>p</sup>*(*z*|*x*) *<sup>h</sup>*(*z*|*x* )*p*(*x* , *y*)*dx p*(*y*, *z*) − *<sup>h</sup>*(*z*|*x* )*p*(*x* )*dx p*(*z*) = 2 2 *dxdx dydz <sup>p</sup>*(*x*, *<sup>y</sup>*)*p*(*x* , *y*) *<sup>p</sup>*(*y*, *<sup>z</sup>*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) <sup>−</sup> 2 2 *dxdx dz <sup>p</sup>*(*x*)*p*(*x* ) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* )

Finally, we have

$$\begin{split} \delta \text{IB}\_{\beta}[p(\mathbf{z}|\mathbf{x})] &= \delta \text{F}\_{1}[p(\mathbf{z}|\mathbf{x})] - \beta \cdot \delta \text{F}\_{2}[p(\mathbf{z}|\mathbf{x})] \\ &= \epsilon \left( \int d\mathbf{x} d\mathbf{z} p(\mathbf{x}) h(\mathbf{z}|\mathbf{x}) \log \frac{p(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})} - \beta \int d\mathbf{x} dy d\mathbf{z} p(\mathbf{x}, y) h(\mathbf{z}|\mathbf{x}) \log \frac{p(\mathbf{z}|\mathbf{y})}{p(\mathbf{z})} \right) \end{split} \tag{A1}$$

*<sup>δ</sup>*2IB*β*[*p*(*z*|*x*)] =*δ*2*F*1[*p*(*z*|*x*)] <sup>−</sup> *<sup>β</sup>* · *<sup>δ</sup>*2*F*2[*p*(*z*|*x*)] =2 2 *dxdz <sup>p</sup>*(*x*)<sup>2</sup> *p*(*x*, *z*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> <sup>−</sup> 2 2 *dxdx dz <sup>p</sup>*(*x*)*p*(*x* ) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) − *β* 2 1 2 *dxdx dydz <sup>p</sup>*(*x*, *<sup>y</sup>*)*p*(*x* , *y*) *<sup>p</sup>*(*y*, *<sup>z</sup>*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) <sup>−</sup> <sup>1</sup> 2 *dxdx dz <sup>p</sup>*(*x*)*p*(*x* ) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) =2 2 *dxdz <sup>p</sup>*(*x*)<sup>2</sup> *p*(*x*, *z*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> − *β dxdx dydz <sup>p</sup>*(*x*, *<sup>y</sup>*)*p*(*x* , *y*) *<sup>p</sup>*(*y*, *<sup>z</sup>*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* )+(*<sup>β</sup>* <sup>−</sup> <sup>1</sup>) *dxdx dz <sup>p</sup>*(*x*)*p*(*x* ) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) 

Absorb  into *<sup>h</sup>*(*z*|*x*), we get rid of the factor and obtain the final expression in Lemma A1.

#### *Appendix A.6. Proof of Lemma 2*

**Proof.** Using Lemma A1, we have

$$\delta \text{IB}\_{\beta}[p(z|\mathbf{x})] = \int d\mathbf{x}d\mathbf{z}p(\mathbf{x})h(\mathbf{z}|\mathbf{x})\log\frac{p(\mathbf{z}|\mathbf{x})}{p(\mathbf{z})} - \beta \int d\mathbf{x}dydzp(\mathbf{x},y)h(\mathbf{z}|\mathbf{x})\log\frac{p(\mathbf{z}|y)}{p(\mathbf{z})}$$

Let *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) (the trivial representation), we have that log *<sup>p</sup>*(*z*|*x*) *<sup>p</sup>*(*z*) ≡ 0. Therefore, the two integrals are both 0. Hence,

$$\left. \delta \text{IB}\_{\beta} \left[ p(z|\mathbf{x}) \right] \right|\_{p(z|\mathbf{x}) = p(z)} \equiv \mathbf{0}$$

Therefore, the *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) is a stationary solution for IB*β*[*p*(*z*|*x*)].

#### *Appendix A.7. Proof of Theorem 3*

**Proof.** Firstly, from the necessary condition of *β* > 1 in Section 3, we have that any sufficient condition for IB*β*-learnability should be able to deduce *β* > 1.

Now using Theorem 2, a sufficient condition for (*X*,*Y*) to be IB*β*-learnable is that there exists *<sup>h</sup>*(*z*|*x*) with *<sup>h</sup>*(*z*|*x*)*dx* <sup>=</sup> 0 such that *<sup>δ</sup>*2IB*β*[*p*(*z*|*x*)] <sup>&</sup>lt; 0 at *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*x*).

At the trivial representation, *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*) and hence *<sup>p</sup>*(*x*, *<sup>z</sup>*) = *<sup>p</sup>*(*x*)*p*(*z*). Due to the Markov chain *<sup>Z</sup>* <sup>←</sup> *<sup>X</sup>* <sup>↔</sup> *<sup>Y</sup>*, we have *<sup>p</sup>*(*y*, *<sup>z</sup>*) = *<sup>p</sup>*(*y*)*p*(*z*). Substituting them into the *<sup>δ</sup>*2IB*β*[*p*(*z*|*x*)] in Lemma A1, the condition becomes: there exists *<sup>h</sup>*(*z*|*x*) with *<sup>h</sup>*(*z*|*x*)*dz* <sup>=</sup> 0, such that

$$\begin{cases} 0 > \delta^2 \text{IB}\_{\tilde{\beta}}[p(z|\mathbf{x})] = \\ \frac{1}{2} \left[ \int d\mathbf{x} d\underline{z} \frac{p(\mathbf{x})^2}{p(\mathbf{x})p(\mathbf{z})} h(z|\mathbf{x})^2 - \beta \int d\mathbf{x} d\mathbf{x}' dy d\underline{z} \frac{p(\mathbf{x},\mathbf{y})p(\mathbf{z}',\mathbf{y})}{p(\mathbf{y})p(\mathbf{z})} h(z|\mathbf{x})h(z|\mathbf{x}') + (\beta - 1) \int d\mathbf{x} d\mathbf{x}' d\underline{z} \frac{p(\mathbf{x})p(\mathbf{z}')}{p(\mathbf{z})} h(z|\mathbf{x})h(z|\mathbf{x}') \right] \end{cases} \tag{A2}$$

Rearranging terms and simplifying, we have

$$\int \frac{dz}{p(z)} G[h(z|\mathbf{x})] = \int \frac{dz}{p(z)} \left[ \int dx h(z|\mathbf{x})^2 p(\mathbf{x}) - \oint \frac{dy}{p(y)} \left( \int dx h(z|\mathbf{x}) p(\mathbf{x}) p(y|\mathbf{x}) \right)^2 + (\beta - 1) \left( \int dx h(z|\mathbf{x}) p(\mathbf{x}) \right)^2 \right] < 0$$

where

$$G[h(\mathbf{x})] = \int d\mathbf{x} h(\mathbf{x})^2 p(\mathbf{x}) - \beta \int \frac{d\mathbf{y}}{p(\mathbf{y})} \left( \int d\mathbf{x} h(\mathbf{x}) p(\mathbf{x}) p(\mathbf{y}|\mathbf{x}) \right)^2 + (\beta - 1) \left( \int d\mathbf{x} h(\mathbf{x}) p(\mathbf{x}) \right)^2$$

Now we prove that the condition that <sup>∃</sup>*h*(*z*|*x*) s.t. *dz <sup>p</sup>*(*z*) *<sup>G</sup>*[*h*(*z*|*x*)] <sup>&</sup>lt; 0 is equivalent to the condition that <sup>∃</sup>*h*(*x*) s.t. *<sup>G</sup>*[*h*(*x*)] <sup>&</sup>lt; 0.

If <sup>∀</sup>*h*(*z*|*x*), *<sup>G</sup>*[*h*(*z*|*x*)] <sup>≥</sup> 0, then we have <sup>∀</sup>*h*(*z*|*x*), *dz <sup>p</sup>*(*z*) *<sup>G</sup>*[*h*(*z*|*x*)] <sup>≥</sup> 0. Therefore, if <sup>∃</sup>*h*(*z*|*x*) s.t. *dz <sup>p</sup>*(*z*) *<sup>G</sup>*[*h*(*z*|*x*)] <sup>&</sup>lt; 0, we have that <sup>∃</sup>*h*(*z*|*x*) s.t. *<sup>G</sup>*[*h*(*z*|*x*)] <sup>&</sup>lt; 0. Since the functional *<sup>G</sup>*[*h*(*z*|*x*)] does not

*Entropy* **2019**, *21*, 924

contain integration over *<sup>z</sup>*, we can treat the *<sup>z</sup>* in *<sup>G</sup>*[*h*(*z*|*x*)] as a parameter and we have that <sup>∃</sup>*h*(*x*) s.t. *G*[*h*(*x*)] < 0.

Conversely, if there exists an certain function *<sup>h</sup>*(*x*) such that *<sup>G</sup>*[*h*(*x*)] < 0, we can find some *<sup>h</sup>*2(*z*) such that *<sup>h</sup>*2(*z*)*dz* <sup>=</sup> 0 and *<sup>h</sup>*<sup>2</sup> <sup>2</sup>(*z*) *<sup>p</sup>*(*z*) *dz* <sup>&</sup>gt; 0, and let *<sup>h</sup>*1(*z*|*x*) = *<sup>h</sup>*(*x*)*h*2(*z*). Now we have

$$\int \frac{dz}{p(z)} \mathcal{G}[h(z|\mathbf{x})] = \int \frac{h\_2^2(z)dz}{p(z)} \mathcal{G}[h(\mathbf{x})] = \mathcal{G}[h(\mathbf{x})] \int \frac{h\_2^2(z)dz}{p(z)} < 0$$

In other words, the condition Equation (A2) is equivalent to requiring that there exists an *h*(*x*) such that *<sup>G</sup>*[*h*(*x*)] < 0 . Hence, a sufficient condition for IB*β*-learnability is that there exists an *<sup>h</sup>*(*x*) such that

$$\mathbb{G}[h(\mathbf{x})] = \int d\mathbf{x}h(\mathbf{x})^2 p(\mathbf{x}) - \beta \int \frac{d\mathbf{y}}{p(\mathbf{y})} \left( \int d\mathbf{x}h(\mathbf{x})p(\mathbf{x})p(\mathbf{y}|\mathbf{x}) \right)^2 + (\beta - 1) \left( \int d\mathbf{x}h(\mathbf{x})p(\mathbf{x}) \right)^2 < 0 \tag{A3}$$

When *<sup>h</sup>*(*x*) = *<sup>C</sup>* <sup>=</sup> constant in the entire input space <sup>X</sup> , Equation (A3) becomes:

$$\mathbf{C}^2 - \beta \mathbf{C}^2 + (\beta - 1)\mathbf{C}^2 < 0$$

which cannot be true. Therefore, *h*(*x*) = constant cannot satisfy Equation (A3).

Rearranging terms and simplifying, we have

$$\frac{1}{2}\beta \left[ \int \frac{dy}{p(y)} \left( \int dx \mathbf{h}(\mathbf{x}) p(\mathbf{x}) p(y|\mathbf{x}) \right)^2 - \left( \int dx \mathbf{h}(\mathbf{x}) p(\mathbf{x}) \right)^2 \right] > \int dx \mathbf{h}(\mathbf{x})^2 p(\mathbf{x}) - \left( \int dx \mathbf{h}(\mathbf{x}) p(\mathbf{x}) \right)^2 \tag{A4}$$

Written in the form of expectations, we have

$$\beta \cdot \left( \mathbb{E}\_{y \sim p(y)} \left[ \left( \mathbb{E}\_{x \sim p(x|y)} [h(\mathbf{x})] \right)^2 \right] - \left( \mathbb{E}\_{x \sim p(x)} [h(\mathbf{x})] \right)^2 \right) > \mathbb{E}\_{x \sim p(x)} [h(\mathbf{x})^2] - \left( \mathbb{E}\_{x \sim p(x)} [h(\mathbf{x})] \right)^2 \tag{A5}$$

Since the square function is convex, using Jensen's inequality on the L.H.S. of Equation (A5), we have

$$\mathbb{E}\_{y \sim p(y)}\left[\left(\mathbb{E}\_{x \sim p(x|y)}[h(x)]\right)^2\right] \ge \left(\mathbb{E}\_{y \sim p(y)}\left[\mathbb{E}\_{x \sim p(x|y)}[h(x)]\right]\right)^2 = \left(\mathbb{E}\_{x \sim p(x)}[h(x)]\right)^2$$

The equality holds iff <sup>E</sup>*x*∼*p*(*x*|*y*)[*h*(*x*)] is constant w.r.t. *<sup>y</sup>*, i.e., *<sup>Y</sup>* is independent of *<sup>X</sup>*. Therefore, in order for Equation (A5) to hold, we require that *Y* is not independent of *X*.

Using Jensen's inequality on the innter expectation on the L.H.S. of Equation (A5), we have

$$\mathbb{E}\_{\boldsymbol{y}\sim p(\boldsymbol{y})}\left[\left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x}|\boldsymbol{y})}[h(\boldsymbol{x})]\right)^{2}\right] \leq \mathbb{E}\_{\boldsymbol{y}\sim p(\boldsymbol{y})}\left[\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x}|\boldsymbol{y})}[h(\boldsymbol{x})^{2}]\right] = \mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x})}[h(\boldsymbol{x})^{2}]\tag{A6}$$

The equality holds when *h*(*x*) is a constant. Since we require that *h*(*x*) is not a constant, we have that the equality cannot be reached.

Similarly, using Jensen's inequality on the R.H.S. of Equation (A5), we have that

$$\mathbb{E}\_{\mathbf{x}\sim p(\mathbf{x})}[h(\mathbf{x})^2] \, > \, \left(\mathbb{E}\_{\mathbf{x}\sim p(\mathbf{x})}[h(\mathbf{x})]\right)^2$$

where we have used the requirement that *h*(*x*) cannot be constant.

*Entropy* **2019**, *21*, 924

Under the constraint that *Y* is not independent of *X*, we can divide both sides of Equation (A5), and obtain the condition: there exists an *h*(*x*) such that

$$\beta > \frac{\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})}[h(\mathbf{x})^2] - \left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})}[h(\mathbf{x})]\right)^2}{\mathbb{E}\_{\mathbf{y} \sim p(\mathbf{y})}\left[\left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x}|\mathbf{y})}[h(\mathbf{x})]\right)^2\right] - \left(\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})}[h(\mathbf{x})]\right)^2}$$

i.e.,

$$\beta > \inf\_{h(\mathbf{x})} \frac{\mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})} \left[ h(\mathbf{x})^2 \right] - \left( \mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})} \left[ h(\mathbf{x}) \right] \right)^2}{\mathbb{E}\_{\mathbf{y} \sim p(\mathbf{y})} \left[ \left( \mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x}|\mathbf{y})} \left[ h(\mathbf{x}) \right] \right)^2 \right] - \left( \mathbb{E}\_{\mathbf{x} \sim p(\mathbf{x})} \left[ h(\mathbf{x}) \right] \right)^2}$$

which proves the condition of Theorem 3.

Furthermore, from Equation (A6) we have

$$
\beta\_0[h(\mathbf{x})] > 1
$$

for *<sup>h</sup>*(*x*) ≡ const, which satisfies the necessary condition of *<sup>β</sup>* <sup>&</sup>gt; 1 in Section 3.

**Proof of lower bound of slope of the Pareto frontier at the origin:** Now we prove the second statement of Theorem 3. Since *δI*(*X*; *Z*) = 0 and *δI*(*Y*; *Z*) = 0 according to Lemma 2, we have - Δ*I*(*Y*;*Z*) Δ*I*(*X*;*Z*) −<sup>1</sup> = - *δ*<sup>2</sup> *I*(*Y*;*Z*) *δ*<sup>2</sup> *I*(*X*;*Z*) −<sup>1</sup> . Substituting into the expression of *δ*<sup>2</sup> *I*(*Y*; *Z*) and *δ*<sup>2</sup> *I*(*X*; *Z*) from Lemma A1, we have

 Δ*I*(*Y*; *Z*) Δ*I*(*X*; *Z*) <sup>−</sup><sup>1</sup> = *δ*<sup>2</sup> *I*(*Y*; *Z*) *δ*<sup>2</sup> *I*(*X*; *Z*) <sup>−</sup><sup>1</sup> = 2 2 *dxdz <sup>p</sup>*(*x*)<sup>2</sup> *<sup>p</sup>*(*x*)*p*(*z*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> <sup>−</sup> 2 2 *dxdx dz <sup>p</sup>*(*x*)*p*(*x* ) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) 2 2 *dxdxdydz <sup>p</sup>*(*x*,*y*)*p*(*x*,*y*) *<sup>p</sup>*(*y*)*p*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x*) <sup>−</sup> 2 2 *dxdxdz <sup>p</sup>*(*x*)*p*(*x*) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x*) = *dxp*(*x*)*h*(*x*)<sup>2</sup> <sup>−</sup> *dxdx p*(*x*)*p*(*x* )*h*(*x*)*h*(*z*|*x* ) *<sup>h</sup>*2(*z*)<sup>2</sup> *<sup>p</sup>*(*z*) *dz dxdxdy <sup>p</sup>*(*x*,*y*)*p*(*x*,*y*) *<sup>p</sup>*(*y*) *<sup>h</sup>*(*x*)*h*(*z*|*x*) <sup>−</sup> *dxdx <sup>p</sup>*(*x*)*p*(*x*)*h*(*x*)*h*(*z*|*x*) *<sup>h</sup>*2(*z*)<sup>2</sup> *<sup>p</sup>*(*z*) *dz* = *dxp*(*x*)*h*(*x*)<sup>2</sup> <sup>−</sup> *dxdx p*(*x*)*p*(*x* )*h*(*x*)*h*(*z*|*x* ) *dxdxdy <sup>p</sup>*(*x*,*y*)*p*(*x*,*y*) *<sup>p</sup>*(*y*) *<sup>h</sup>*(*x*)*h*(*z*|*x*) <sup>−</sup> *dxdx <sup>p</sup>*(*x*)*p*(*x*)*h*(*x*)*h*(*z*|*x*) <sup>=</sup> <sup>E</sup>*x*∼*p*(*x*)[*h*(*x*)2] <sup>−</sup> - <sup>E</sup>*x*∼*p*(*x*)[*h*(*x*)]<sup>2</sup> <sup>E</sup>*y*∼*p*(*y*) - <sup>E</sup>*x*∼*p*(*x*|*y*)[*h*(*x*)]<sup>2</sup> − - <sup>E</sup>*x*∼*p*(*x*)[*h*(*x*)]<sup>2</sup> = <sup>E</sup>*x*∼*p*(*x*)[*h*(*x*)2] (E*x*∼*p*(*x*)[*h*(*x*)]) <sup>2</sup> − 1 <sup>E</sup>*y*∼*p*(*y*) <sup>E</sup>*x*∼*p*(*x*|*y*)[*h*(*x*)] <sup>E</sup>*x*∼*p*(*x*)[*h*(*x*)] <sup>2</sup> − 1 = *<sup>β</sup>*0[*h*(*x*)]

Therefore, - inf*h*(*x*) *<sup>β</sup>*0[*h*(*x*)]−<sup>1</sup> gives the largest slope of Δ*I*(*Y*; *Z*) vs. Δ*I*(*X*; *Z*) for perturbation function of the form *<sup>h</sup>*1(*z*|*x*) = *<sup>h</sup>*(*x*)*h*2(*z*) satisfying *<sup>h</sup>*2(*z*)*dz* <sup>=</sup> 0 and *<sup>h</sup>*<sup>2</sup> <sup>2</sup>(*z*) *<sup>p</sup>*(*z*) *dz* > 0, which is a lower bound of slope of <sup>Δ</sup>*I*(*Y*; *<sup>Z</sup>*) vs. <sup>Δ</sup>*I*(*X*; *<sup>Z</sup>*) for all possible perturbation function *<sup>h</sup>*1(*z*|*x*). The latter is the slope of the Pareto frontier of the *I*(*Y*; *Z*) vs. *I*(*X*; *Z*) curve at the origin.

**Inflection point for general** *Z***:** If we *do not* assume that *Z* is at the origin of the information plane, but at some general stationary solution *<sup>Z</sup>*<sup>∗</sup> with *<sup>p</sup>*(*z*|*x*), we define

*<sup>β</sup>*(2) [*h*(*x*)] = *δ*<sup>2</sup> *I*(*Y*; *Z*) *δ*<sup>2</sup> *I*(*X*; *Z*) <sup>−</sup><sup>1</sup> = 2 2 *dxdz <sup>p</sup>*(*x*)<sup>2</sup> *<sup>p</sup>*(*x*,*z*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> <sup>−</sup> 2 2 *dxdx dz <sup>p</sup>*(*x*)*p*(*x* ) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) 2 2 *dxdxdydz <sup>p</sup>*(*x*,*y*)*p*(*x*,*y*) *<sup>p</sup>*(*y*,*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x*) <sup>−</sup> 2 2 *dxdxdz <sup>p</sup>*(*x*)*p*(*x*) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x*) = *dxdz <sup>p</sup>*(*x*)<sup>2</sup> *<sup>p</sup>*(*x*,*z*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> <sup>−</sup> *dxdx dz <sup>p</sup>*(*x*)*p*(*x* ) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x* ) *dxdxdydz <sup>p</sup>*(*x*,*y*)*p*(*x*,*y*) *<sup>p</sup>*(*y*,*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x*) <sup>−</sup> *dxdxdz <sup>p</sup>*(*x*)*p*(*x*) *<sup>p</sup>*(*z*) *<sup>h</sup>*(*z*|*x*)*h*(*z*|*x*) = *dz p*(*z*) *dx <sup>p</sup>*(*x*)<sup>2</sup> *<sup>p</sup>*(*x*|*z*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> <sup>−</sup> ( *dxp*(*x*)*h*(*z*|*x*)) 2 *dz p*(*z*) *dy <sup>p</sup>*(*y*|*z*) ( *dxp*(*x*, *<sup>y</sup>*)*h*(*z*|*x*)) <sup>2</sup> <sup>−</sup> ( *dxp*(*x*)*h*(*z*|*x*)) 2 = *dz p*(*z*) *dx <sup>p</sup>*(*x*)<sup>2</sup> *<sup>p</sup>*(*x*|*z*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> ( *dxp*(*x*)*h*(*z*|*x*)) <sup>2</sup> − 1 *dz p*(*z*) *dy <sup>p</sup>*(*y*|*z*) ( *dxp*(*x*,*y*)*h*(*z*|*x*)) 2 ( *dxp*(*x*)*h*(*z*|*x*)) <sup>2</sup> − 1 = *dz dx <sup>p</sup>*(*x*) *<sup>p</sup>*(*z*|*x*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> ( *dxp*(*x*)*h*(*z*|*x*)) <sup>2</sup> <sup>−</sup> <sup>1</sup> *p*(*z*) *dz dy <sup>p</sup>*(*z*|*y*)*p*(*y*) ( *dxp*(*x*,*y*)*h*(*z*|*x*)) 2 ( *dxp*(*x*)*h*(*z*|*x*)) <sup>2</sup> <sup>−</sup> <sup>1</sup> *p*(*z*) = *dz dx <sup>p</sup>*(*x*) *<sup>p</sup>*(*z*|*x*) *<sup>h</sup>*(*z*|*x*)<sup>2</sup> <sup>−</sup> <sup>1</sup> *<sup>p</sup>*(*z*)( *dxp*(*x*)*h*(*z*|*x*))<sup>2</sup> *dz dy <sup>p</sup>*(*z*|*y*)*p*(*y*) ( *dxp*(*x*, *<sup>y</sup>*)*h*(*z*|*x*)) <sup>2</sup> <sup>−</sup> <sup>1</sup> *<sup>p</sup>*(*z*) ( *dxp*(*x*)*h*(*z*|*x*)) 2 

which reduces to *<sup>β</sup>*0[*h*(*x*)] when *<sup>p</sup>*(*z*|*x*) = *<sup>p</sup>*(*z*). When

$$\beta > \inf\_{h(z|\mathbf{x})} \beta^{(2)}[h(z|\mathbf{x})] \tag{A7}$$

it becomes a non-stable solution (non-minimum), and we will have other *Z* that achieves a better IB*β*(*X*,*Y*; *<sup>Z</sup>*) than the current *<sup>Z</sup>*∗.

#### *Appendix A.8. What IB First Learns at Its Onset of Learning*

In this section, we prove that at the onset of learning, if letting *<sup>h</sup>*(*z*|*x*) = *<sup>h</sup>*∗(*x*)*h*2(*z*), we have

$$p\_{\beta}(y|\mathbf{x}) = p(y) + \epsilon^{2} \mathbb{C}\_{z}(h^{\*}(\mathbf{x}) - \overline{h}\_{x}^{\*}) \int p(\mathbf{x}, y)(h^{\*}(\mathbf{x}) - \overline{h}\_{x}^{\*})d\mathbf{x} \tag{A8}$$

where *<sup>p</sup>β*(*y*|*x*) is the estimated *<sup>p</sup>*(*y*|*x*) by IB for a certain *<sup>β</sup>*, *<sup>h</sup>*∗(*x*) = inf*h*(*x*) *<sup>β</sup>*0[*h*(*x*)], *<sup>h</sup>* ∗ *<sup>x</sup>* <sup>=</sup> *<sup>h</sup>*∗(*x*)*p*(*x*)*dx*, *Cz* <sup>=</sup> *<sup>h</sup>*<sup>2</sup> <sup>2</sup>(*z*) *<sup>p</sup>*(*z*) *dz* is a constant.

*Entropy* **2019**, *21*, 924

**Proof.** In IB, we use *<sup>p</sup>β*(*z*|*x*) to obtain *<sup>Z</sup>* from *<sup>X</sup>*, then obtain the prediction of *<sup>Y</sup>* from *<sup>Z</sup>* using *<sup>p</sup>β*(*y*|*z*). Here we use subscript *<sup>β</sup>* to denote the probability (density) at the optimum of IB*β*[*p*(*z*|*x*)] at a specific *β*. We have

$$\begin{aligned} p\_{\beta}(y|\mathbf{x}) &= \int p\_{\beta}(y|z)p\_{\beta}(z|\mathbf{x})dz \\ &= \int dz \frac{p\_{\beta}(y,z)p\_{\beta}(z|\mathbf{x})}{p\_{\beta}(z)} \\ &= \int dz \frac{p\_{\beta}(z|\mathbf{x})}{p\_{\beta}(z)} \int p(\mathbf{x}',y)p\_{\beta}(z|\mathbf{x}')d\mathbf{x}' \end{aligned}$$

When we have a small perturbation  · *<sup>h</sup>*(*z*|*x*) at the trivial representation, *<sup>p</sup>β*(*z*|*x*) = *<sup>p</sup>β*<sup>0</sup> (*z*) +  · *<sup>h</sup>*(*z*|*x*), we have *<sup>p</sup>β*(*z*) = *<sup>p</sup>β*<sup>0</sup> (*z*) +  · *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx*. Substituting, we have

$$\begin{split} p\_{\beta}(y|\mathbf{x}) &= \int dz \frac{p\_{\beta\_{0}}(z) \left(1 + \epsilon \cdot \frac{h(z|\mathbf{x})}{p\_{\beta\_{0}}(z)}\right)}{p\_{\beta\_{0}}(z) \left(1 + \epsilon \cdot \frac{\int h(z|\mathbf{x'})p(\mathbf{x''})d\mathbf{x''}}{p\_{\beta\_{0}}(z)}\right)} \int p(\mathbf{x'},y)p\_{\beta\_{0}}(z) \left(1 + \epsilon \cdot \frac{h(z|\mathbf{x'})}{p\_{\beta\_{0}}(z)}\right) d\mathbf{x'}} \\ &= \int dz \frac{1 + \epsilon \cdot \frac{h(z|\mathbf{x})}{p\_{\beta\_{0}}(z)}}{1 + \epsilon \cdot \frac{\int \frac{h(z|\mathbf{x''})p(\mathbf{x''})d\mathbf{x''}}{p\_{\beta\_{0}}(z)}}{p\_{\beta\_{0}}(z)}} \int p(\mathbf{x'},y)p\_{\beta\_{0}}(z) \left(1 + \epsilon \cdot \frac{h(z|\mathbf{x'})}{p\_{\beta\_{0}}(z)}\right) d\mathbf{x'} \end{split}$$

The 0th-order term is *dzdx p*(*x* , *<sup>y</sup>*)*pβ*<sup>0</sup> (*z*) = *<sup>p</sup>*(*y*). The first-order term is

$$\begin{aligned} \delta p\_{\beta}(z|\mathbf{x}) &= \epsilon \cdot \int dz d\mathbf{x}' \left( h(z|\mathbf{x}) + h(z|\mathbf{x}') - \int h(z|\mathbf{x}')p(\mathbf{x}')d\mathbf{x}'^{\prime} \right) p(\mathbf{x}', \mathbf{y}) \\ &= \epsilon \cdot \int d\mathbf{x}' \left( \int dz h(z|\mathbf{x}) + \int dz h(z|\mathbf{x}') \right) - \epsilon \cdot \int d\mathbf{x}' d\mathbf{x}'' p(\mathbf{x}', \mathbf{y}) p(\mathbf{x}'') \int dz h(z|\mathbf{x}') \\ &= 0 - 0 \\ &= 0 \end{aligned}$$

since we have *<sup>h</sup>*(*z*|*x*)*dz* <sup>=</sup> 0 for any *<sup>x</sup>*.

For the second-order term, using *<sup>h</sup>*(*z*|*x*) = *<sup>h</sup>*∗(*x*)*h*2(*z*) and *Cz* <sup>=</sup> *dz <sup>p</sup>β*<sup>0</sup> (*z*) *<sup>h</sup>*<sup>2</sup> <sup>2</sup>(*z*), it is

*<sup>δ</sup>*<sup>2</sup> *<sup>p</sup>β*(*y*|*x*) =2 · *dz <sup>h</sup>*(*z*|*x*)*p*(*x*)*dx <sup>p</sup>β*<sup>0</sup> (*z*) <sup>2</sup> *p*(*x* , *<sup>y</sup>*)*pβ*<sup>0</sup> (*z*)*dx* − 2 · *dz <sup>h</sup>*(*z*|*x*) *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx* (*pβ*<sup>0</sup> (*z*))<sup>2</sup> *p*(*x* , *<sup>y</sup>*)*pβ*<sup>0</sup> (*z*)*dx* + 2 *dz <sup>h</sup>*(*z*|*x*) <sup>−</sup> *<sup>h</sup>*(*z*|*x*)*p*(*x*)*dx p*(*x* , *y*) *<sup>h</sup>*(*z*|*x* ) *<sup>p</sup>β*<sup>0</sup> (*z*) *dx* =<sup>2</sup>*Cz* · *h*∗(*x*)*p*(*x*)*dx*<sup>2</sup> *p*(*y*) − <sup>2</sup>*Cz* · *<sup>h</sup>*∗(*x*) *h*∗(*x*)*p*(*x*)*dx p*(*y*) + <sup>2</sup>*Cz* · *<sup>h</sup>*∗(*x*) *p*(*x* , *y*)*h*∗(*x* )*dx* − <sup>2</sup>*Cz* · *<sup>h</sup>*∗(*x*)*p*(*x*)*dx p*(*x* , *y*)*h*∗(*x* )*dx* =<sup>2</sup>*Cz*(*h*∗(*x*) <sup>−</sup> *<sup>h</sup>* ∗ *x*) *p*(*x* , *y*)*h*∗(*x* )*dx* − *h* ∗ *<sup>x</sup> <sup>p</sup>*(*y*) =<sup>2</sup>*Cz*(*h*∗(*x*) <sup>−</sup> *<sup>h</sup>* ∗ *x*) *p*(*x* , *y*) - *h*∗(*x* ) <sup>−</sup> *<sup>h</sup>* ∗ *x dx*

where *h* ∗ *<sup>x</sup>* <sup>=</sup> *<sup>h</sup>*∗(*x*)*p*(*x*)*dx*. Combining everything, we have up to the second order,

$$p\_{\beta}(y|\mathbf{x}) = p(y) + \epsilon^2 \mathbb{C}\_z(h^\*(\mathbf{x}) - \overline{h}\_{\mathbf{x}}^\*) \int p(\mathbf{x}, y) (h^\*(\mathbf{x}) - \overline{h}\_{\mathbf{x}}^\*) d\mathbf{x}$$

#### *Appendix A.9. Proof of Theorem 4*

**Proof.** According to Theorem 3, a sufficient condition for (*X*,*Y*) to be IB*β*-learnable is that *<sup>X</sup>* and *<sup>Y</sup>* are not independent, and

$$\beta > \inf\_{h(\boldsymbol{x})} \frac{\frac{\mathbb{E}\_{\boldsymbol{x} \sim p(\boldsymbol{x})}[h(\boldsymbol{x})^2]}{\left(\mathbb{E}\_{\boldsymbol{x} \sim p(\boldsymbol{x})}[h(\boldsymbol{x})]\right)^2} - 1}{\mathbb{E}\_{\boldsymbol{y} \sim p(\boldsymbol{y})}\left[\left(\frac{\mathbb{E}\_{\boldsymbol{x} \sim p(\boldsymbol{x}|\boldsymbol{y})}[h(\boldsymbol{x})]}{\mathbb{E}\_{\boldsymbol{x} \sim p(\boldsymbol{x})}[h(\boldsymbol{x})]}\right)^2\right] - 1} \tag{A9}$$

We can assume a specific form of *h*(*x*), and obtain a (potentially stronger) sufficient condition. Specifically, we let

$$h(\mathbf{x}) = \begin{cases} 1, & \mathbf{x} \in \Omega\_{\mathbf{x}} \\ 0, & \text{otherwise} \end{cases} \tag{A10}$$

for certain <sup>Ω</sup>*<sup>x</sup>* ⊂ X . Substituting into Equation (A10), we have that a sufficient condition for (*X*,*Y*) to be IB*β*-learnable is

$$\beta > \inf\_{\Omega\_{\mathbf{f}} \subset \mathcal{X}} \frac{\frac{p(\Omega\_{\mathbf{f}})}{p(\Omega\_{\mathbf{f}})^2} - 1}{\int \, d\boldsymbol{y} p(\mathbf{y}) \left( \frac{\int\_{\mathbf{x} \in \Omega\_{\mathbf{f}}} d\mathbf{x} p(\mathbf{x}|\mathbf{y}) d\mathbf{x}}{p(\Omega\_{\mathbf{f}})} \right)^2 - 1} > 0 \tag{A11}$$

where *p*(Ω*x*) = *<sup>x</sup>*∈Ω*<sup>x</sup> <sup>p</sup>*(*x*)*dx*.

The denominator of Equation (A11) is

$$\begin{aligned} &\int dy p(y) \left(\frac{\int\_{x\in\Omega\_x} dx p(x|y) dx}{p(\Omega\_x)}\right)^2 - 1 \\ &= \int dy p(y) \left(\frac{p(\Omega\_x|y)}{p(\Omega\_x)}\right)^2 - 1 \\ &= \int dy \frac{p(y|\Omega\_x)^2}{p(y)} - 1 \\ &= \mathbb{E}\_{y\sim p(y|\Omega\_x)} \left[\frac{p(y|\Omega\_x)}{p(y)} - 1\right] \end{aligned}$$

Using the inequality *x* − 1 ≥ log *x*, we have

$$\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right] \ge \mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \log \frac{p(y|\Omega\_x)}{p(y)} \right] \ge 0$$

Both equalities hold iff *<sup>p</sup>*(*y*|Ω*x*) <sup>≡</sup> *<sup>p</sup>*(*y*), at which the denominator of Equation (A11) is equal to 0 and the expression inside the infimum diverge, which will not contribute to the infimum. Except this scenario, the denominator is greater than 0. Substituting into Equation (A11), we have that a sufficient condition for (*X*,*Y*) to be IB*β*-learnable is

$$\beta > \inf\_{\Omega\_x \subset \mathcal{X}} \frac{\frac{p(\Omega\_x)}{p(\Omega\_x)^2} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right]} \tag{A12}$$

*Entropy* **2019**, *21*, 924

Since <sup>Ω</sup>*<sup>x</sup>* is a subset of <sup>X</sup> , by the definition of *<sup>h</sup>*(*x*) in Equation (A10), *<sup>h</sup>*(*x*) is not a constant in the entire X . Hence the numerator of Equation (A12) is positive. Since its denominator is also positive, we can then neglect the "> 0", and obtain the condition in Theorem 4.

Since the *h*(*x*) used in this theorem is a subset of the *h*(*x*) used in Theorem 3, the infimum for Equation (5) is greater than or equal to the infimum in Equation (2). Therefore, according to the second statement of Theorem 3, we have that the (infΩ*x*⊂X *<sup>β</sup>*0(Ω*x*)) <sup>−</sup><sup>1</sup> is also a lower bound of the slope for the Pareto frontier of *I*(*Y*; *Z*) vs. *I*(*X*; *Z*) curve.

Now we prove that the condition Equation (5) is invariant to invertible mappings of *X*. In fact, if *X* = *g*(*X*) is a uniquely invertible map (if *X* is continuous, *g* is additionally required to be continuous), let <sup>X</sup> <sup>=</sup> {*g*(*x*)|*<sup>x</sup>* <sup>∈</sup> <sup>Ω</sup>*x*}, and denote *<sup>g</sup>*(Ω*x*) ≡ {*g*(*x*)|*<sup>x</sup>* <sup>∈</sup> <sup>Ω</sup>*x*} for any <sup>Ω</sup>*<sup>x</sup>* ⊂ X , we have *<sup>p</sup>*(*g*(Ω*x*)) = *<sup>p</sup>*(Ω*x*), and *<sup>p</sup>*(*y*|*g*(Ω*x*)) = *<sup>p</sup>*(*y*|Ω*x*). Then for dataset (*X*,*Y*), let <sup>Ω</sup> *<sup>x</sup>* <sup>=</sup> *<sup>g</sup>*(Ω*x*), we have

$$\frac{\frac{1}{p(\Omega\_x^{\prime})} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x^{\prime})} \left[ \frac{p(y|\Omega\_x^{\prime})}{p(y)} - 1 \right]} = \frac{\frac{1}{p(\Omega\_x)} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right]} \tag{A13}$$

Additionally we have <sup>X</sup> <sup>=</sup> *<sup>g</sup>*(<sup>X</sup> ). Then

$$\inf\_{\Omega\_x^{\prime} \subset \mathcal{X}^{\prime}} \frac{\frac{1}{p(\Omega\_x^{\prime})} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x^{\prime})} \left[ \frac{p(y|\Omega\_x^{\prime})}{p(y)} - 1 \right]} = \inf\_{\Omega\_x \subset \mathcal{X}} \frac{\frac{1}{p(\Omega\_x)} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right]} \tag{A14}$$

For dataset (*X* ,*Y*)=(*g*(*X*),*Y*), applying Theorem 4 we have that a sufficient condition for it to be IB*β*-learnable is

$$\beta > \inf\_{\Omega\_x^\prime \subset \mathcal{X}^\prime} \frac{\frac{1}{p(\Omega\_x^\prime)} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x^\prime)} \left[ \frac{p(y|\Omega\_x^\prime)}{p(y)} - 1 \right]} = \inf\_{\Omega\_x \subset \mathcal{X}} \frac{\frac{1}{p(\Omega\_x)} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right]} \tag{A15}$$

where the equality is due to Equation (A14). Comparing with the condition for IB*β*-learnability for (*X*,*Y*) (Equation (5)), we see that they are the same. Therefore, the condition given by Theorem 4 is invariant to invertible mapping of *X*.

#### *Appendix A.10. Proof of Corollary 1 and Corollary 2*

#### Appendix A.10.1. Proof of Corollary 1

**Proof.** We use Theorem 4. Let Ω*<sup>x</sup>* contain all elements *x* whose true class is *y*<sup>∗</sup> for some certain *y*∗, and 0 otherwise. Then we obtain a (potentially stronger) sufficient condition. Since the probability *<sup>p</sup>*(*y*|*y*∗, *<sup>x</sup>*) = *<sup>p</sup>*(*y*|*y*∗) is class-conditional, we have

$$\inf\_{\Omega\_x \subset \mathcal{X}} \frac{\frac{1}{p(\Omega\_x)} - 1}{\mathbb{E}\_{y \sim p(y|\Omega\_x)} \left[ \frac{p(y|\Omega\_x)}{p(y)} - 1 \right]}$$

$$= \inf\_{y^\*} \frac{\frac{1}{p(y^\*)} - 1}{\mathbb{E}\_{y \sim p(y|y^\*)} \left[ \frac{p(y|y^\*)}{p(y)} - 1 \right]}$$

By requiring *β* > inf*y*<sup>∗</sup> 1 *<sup>p</sup>*(*y*∗) −1 <sup>E</sup>*y*∼*p*(*y*|*y*∗) *<sup>p</sup>*(*y*|*y*∗) *<sup>p</sup>*(*y*) −1 , we obtain a sufficient condition for IB*<sup>β</sup>* learnability. Appendix A.10.2. Proof of Corollary 2

**Proof.** We again use Theorem 4. Since *Y* is a deterministic function of *X*, let *Y* = *f*(*X*). By the assumption that *Y* contains at least one value *y* such that its probability *p*(*y*) > 0, we let Ω*<sup>x</sup>* contain only *x* such that *f*(*x*) = *y*. Substituting into Equation (5), we have

Therefore, the sufficient condition becomes *β* > 1.

*Appendix A.11. <sup>β</sup>*0*, Hypercontractivity Coefficient, Contraction Coefficient, <sup>β</sup>*0[*h*(*x*)]*, and Maximum Correlation*

In this section, we prove the relations between the IB-Learnability threshold *β*0, the hypercontractivity coefficient *<sup>ξ</sup>*(*X*;*Y*), the contraction coefficient *<sup>η</sup>*KL(*p*(*y*|*x*), *<sup>p</sup>*(*x*)), *<sup>β</sup>*0[*h*(*x*)] in Equation (2), and maximum correlation *ρm*(*X*,*Y*), as follows:

$$\frac{1}{\beta\_0} = \sharp(X; \mathcal{Y}) = \eta\_{\text{KL}}(p(y|\mathbf{x}), p(\mathbf{x})) \ge \sup\_{h(\mathbf{x})} \frac{1}{\beta\_0[h(\mathbf{x})]} = \rho\_m^2(X; \mathcal{Y}) \tag{A16}$$

**Proof.** The hypercontractivity coefficient *ξ* is defined as [16]:

$$\xi(X;Y) \equiv \sup\_{Z-X-Y} \frac{I(Y;Z)}{I(X;Z)}$$

By our definition of IB-learnability, (*X*, *Y*) is IB-Learnable iff there exists *Z* obeying the Markov chain *Z* − *X* − *Y*, such that

$$I(X;Z) - \beta \cdot I(\mathcal{Y};Z) < 0 = I B\_{\beta}(X,\mathcal{Y};Z)|\_{p(z|x) = p(z)}$$

Or equivalently there exists *Z* obeying the Markov chain *Z* − *X* − *Y* such that

$$0 < \frac{1}{\beta} < \frac{I(\mathbf{Y}; Z)}{I(X; Z)}\tag{A17}$$

By Theorem 1, the IB-Learnability region for *<sup>β</sup>* is (*β*0, +∞), or equivalently the IB-Learnability region for 1/*β* is

$$0 < \frac{1}{\beta} < \frac{1}{\beta \alpha} \tag{A18}$$

Comparing Equations (A17) and (A18), we have that

$$\frac{1}{\beta\_0} = \sup\_{Z \sim X - Y} \frac{I(Y; Z)}{I(X; Z)} = \mathfrak{f}(X; Y) \tag{A19}$$

In Anantharam et al. [16], the authors prove that

$$\xi(\mathbf{X};\mathbf{Y}) = \eta\_{\text{KL}}(p(y|\mathbf{x}), p(\mathbf{x})) \tag{A20}$$

where the contraction coefficient *<sup>η</sup>*KL(*p*(*y*|*x*), *<sup>p</sup>*(*x*)) is defined as

$$\eta\_{\mathrm{KL}}(p(y|\mathbf{x}), p(\mathbf{x})) = \sup\_{r(\mathbf{x}) \neq p(\mathbf{x})} \frac{\mathbb{D}\_{\mathrm{KL}}(r(y)||p(y))}{\mathbb{D}\_{\mathrm{KL}}(r(\mathbf{x})||p(\mathbf{x}))}$$

where *<sup>p</sup>*(*y*) = <sup>E</sup>*x*∼*p*(*x*)[*p*(*y*|*x*)] and *<sup>r</sup>*(*y*) = <sup>E</sup>*x*∼*r*(*x*)[*p*(*y*|*x*)]. Treating *<sup>p</sup>*(*y*|*x*) as a channel, the contraction coefficient measures how much the two distributions *r*(*x*) and *p*(*x*) becomes "nearer" (as measured by the KL-divergence) after passing through the channel.

In Anantharam et al. [16], the authors also provide a counterexample to an earlier result by Erkip and Cover [31] that incorrectly proved *ξ*(*X*;*Y*) = *ρ*<sup>2</sup> *<sup>m</sup>*(*X*;*Y*). In the specific counterexample Anantharam et al. [16] design, *ξ*(*X*;*Y*) > *ρ*<sup>2</sup> *<sup>m</sup>*(*X*;*Y*).

The maximum correlation is defined as *<sup>ρ</sup>m*(*X*;*Y*) <sup>≡</sup> max*<sup>f</sup>* ,*<sup>g</sup>* <sup>E</sup>[ *<sup>f</sup>*(*X*)*g*(*Y*)] where *<sup>f</sup>*(*X*) and *<sup>g</sup>*(*Y*) are real-valued random variables such that E[ *f*(*X*)] = E[*g*(*Y*)] = 0 and E[ *f* <sup>2</sup>(*X*)] = E[*g*2(*Y*)] = 1 [20,21].

Now we prove *<sup>ξ</sup>*(*X*;*Y*) <sup>≥</sup> *<sup>ρ</sup>*<sup>2</sup> *<sup>m</sup>*(*X*;*Y*), based on Theorem 3. To see this, we use the alternate characterization of *ρm*(*X*;*Y*) by Rényi [32]:

$$\rho\_m^2(X;Y) = \max\_{f(X): \mathbb{E}[f(X)] = 0, \mathbb{E}[f^2(X)] = 1} \mathbb{E}[\left(\mathbb{E}[f(X)|Y]\right)^2] \tag{A21}$$

Denoting *<sup>h</sup>* <sup>=</sup> <sup>E</sup>*p*(*x*)[*h*(*x*)], we can transform *<sup>β</sup>*0[*h*(*x*)] in Equation (2) as follows:

$$\begin{split} \beta\_{0}[h(\boldsymbol{x})] &= \frac{\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x})}[h(\boldsymbol{x})^{2}] - \left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x})}[h(\boldsymbol{x})]\right)^{2}}{\mathbb{E}\_{\boldsymbol{y}\sim p(\boldsymbol{y})}\left[\left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x}|\boldsymbol{y})}[h(\boldsymbol{x})]\right)^{2}\right] - \left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x})}[h(\boldsymbol{x})]\right)^{2}} \\ &= \frac{\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x})}[h(\boldsymbol{x})^{2}] - \overline{\boldsymbol{h}}^{2}}{\mathbb{E}\_{\boldsymbol{y}\sim p(\boldsymbol{y})}\left[\left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x}|\boldsymbol{y})}[h(\boldsymbol{x})]\right)^{2}\right] - \overline{\boldsymbol{h}}^{2}} \\ &= \frac{\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x})}[(h(\boldsymbol{x}) - \overline{\boldsymbol{h}})^{2}]}{\mathbb{E}\_{\boldsymbol{y}\sim p(\boldsymbol{y})}\left[\left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x}|\boldsymbol{y})}[h(\boldsymbol{x}) - \overline{\boldsymbol{h}}]\right)^{2}\right]} \\ &= \frac{1}{\mathbb{E}\_{\boldsymbol{y}\sim p(\boldsymbol{y})}\left[\left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x}|\boldsymbol{y})}[f(\boldsymbol{x})]\right)^{2}\right]} \\ &= \frac{1}{\mathbb{E}\_{\boldsymbol{y}\sim p(\boldsymbol{y})}\left[\left(\mathbb{E}\_{\boldsymbol{x}\sim p(\boldsymbol{x}|\boldsymbol{y})}[f(\boldsymbol{x})]\right)^{2}\right]} \end{split}$$

where we denote *<sup>f</sup>*(*x*) = *<sup>h</sup>*(*x*)−*<sup>h</sup>* (E*x*∼*p*(*x*)[(*h*(*x*)−*h*)2]) 1/2 , so that E[ *<sup>f</sup>*(*X*)] = 0 and E[ *<sup>f</sup>* <sup>2</sup>(*X*)] = 1. Combined with Equation (A21), we have

$$\sup\_{h(\mathbf{x})} \frac{1}{\beta\_0[h(\mathbf{x})]} = \rho\_{\mathbf{m}}^2(X;Y) \tag{A22}$$

Our Theorem 3 states that

$$\sup\_{h(x)} \frac{1}{\beta\_0[h(x)]} \le \frac{1}{\beta\_0} \tag{A23}$$

Combining Equations (A18), (A22) and Equation (A23), we have

$$
\rho\_m^2(X;Y) \le \mathfrak{F}(X;Y) \tag{A24}
$$

In summary, the relations among the quantities are:

$$\frac{1}{\beta\_0} = \xi(X;Y) = \eta\_{\text{KL}}(p(y|\mathbf{x}), p(\mathbf{x})) \ge \sup\_{h(\mathbf{x})} \frac{1}{\beta\_0[h(\mathbf{x})]} = \rho\_m^2(X;Y) \tag{A25}$$

#### *Appendix A.12. Experiment Details*

We use the Variational Information Bottleneck (VIB) objective from [5]. For the synthetic experiment, the latent *Z* has dimension of 2. The encoder is a neural net with 2 hidden layers, each of which has 128 neurons with ReLU activation. The last layer has linear activation and 4 output neurons; the first two parameterize the mean of a Gaussian and the last two parameterize the log variance. The decoder is a neural net with 1 hidden layer with 128 neurons and ReLU activation. Its last layer has linear activation and outputs the logit for the class labels. It uses a mixture of Gaussian prior with 500 components (for the experiment with class overlap, 256 components), each of which is a 2D Gaussian with learnable mean and log variance, and the weights for the components are also learnable. For the MNIST experiment, the architecture is mostly the same, except the following: (1) for *Z*, we let it have dimension of 256. (2) For the prior, we use standard Gaussian with diagonal covariance matrix.

For all experiments, we use Adam [33] optimizer with default parameters. We do not add any explicit regularization. We use learning rate of 10−<sup>4</sup> and have a learning rate decay of <sup>1</sup> <sup>1</sup>+0.01×epoch . We train in total 2000 epochs with mini-batch size of 500.

For estimation of the observed *<sup>β</sup>*<sup>0</sup> in Figure 3, in the *<sup>I</sup>*(*X*; *<sup>Z</sup>*) vs. *<sup>β</sup><sup>i</sup>* curve (*β<sup>i</sup>* denotes the *<sup>i</sup>*-th *<sup>β</sup>*), we take the mean and standard deviation of *<sup>I</sup>*(*X*; *<sup>Z</sup>*) for the lowest 5 *<sup>β</sup><sup>i</sup>* values, denoting as *μβ*, *σβ* (*I*(*Y*; *<sup>Z</sup>*) has similar behavior, but since we are minimizing *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>−</sup> *<sup>β</sup>* · *<sup>I</sup>*(*Y*; *<sup>Z</sup>*), the onset of nonzero *<sup>I</sup>*(*X*; *<sup>Z</sup>*) is less prone to noise). When *<sup>I</sup>*(*X*; *<sup>Z</sup>*) is greater than *μβ* + 3*σβ*, we regard it as learning a non-trivial representation, and take the average of *β<sup>i</sup>* and *βi*−<sup>1</sup> as the experimentally estimated onset of learning. We also inspect manually and confirm that it is consistent with human intuition.

For estimating *β*<sup>0</sup> using Algorithm 1, at step 6 we use the following discrete search algorithm. We fix *<sup>i</sup>*left = 1 and gradually narrow down the range [*a*, *<sup>b</sup>*] of *<sup>i</sup>*right, starting from [1, *<sup>N</sup>*]. At each iteration, we set a tentative new range [*a* , *b* ], where *a* = 0.8*a* + 0.2*b*, *b* = 0.2*a* + 0.8*b*, and calculate *β*˜ 0,*a* <sup>=</sup> **Get***β*(*Py*|*x*, *py*, <sup>Ω</sup>*a*), *<sup>β</sup>*˜ 0,*b* <sup>=</sup> **Get***β*(*Py*|*x*, *py*, <sup>Ω</sup>*b*) where <sup>Ω</sup>*a* <sup>=</sup> {1, 2, ...*a* } and <sup>Ω</sup>*b* <sup>=</sup> {1, 2, ...*b* }. If *β*˜ 0,*a* < *β*˜ 0,*a*, let *a* ← *a* . If *β*˜ 0,*b* < *β*˜ 0,*b*, let *b* ← *b* . In other words, we narrow down the range of *i*right if we find that the Ω given by the left or right boundary gives a lower *β*˜ <sup>0</sup> value. The process stops when both *β*˜ 0,*a* and *β*˜ 0,*b* stop improving (which we find always happens when *<sup>b</sup>* = *<sup>a</sup>* + 1), and we return the smaller of the final *β*˜ 0,*a* and *β*˜ 0,*b* as *β*˜ 0.

For estimation of *<sup>p</sup>*(*y*|*x*) for (2 ) Algorithm 1 and (3 ) *η*ˆKL for both synthetic and MNIST experiments, we use a 3-layer neuron net where each hidden layer has 128 neurons and ReLU activation. The last layer has linear activation. The objective is cross-entropy loss. We use Adam [33] optimizer with a learning rate of 10<sup>−</sup>4, and train for 100 epochs (after which the validation loss does not go down).

For estimating *β*<sup>0</sup> via (3 ) *η*ˆKL by the algorithm in [18], we use the code from the GitHub repository provided by the paper (At https://github.com/wgao9/hypercontractivity), using the same *<sup>p</sup>*(*y*|*x*) employed for (2 ) Algorithm 1. Since our datasets are classification tasks, we use *Aij* <sup>=</sup> *<sup>p</sup>*(*yj*|*xi*)/*p*(*yj*) instead of the kernel density for estimating matrix *A*; we take the maximum of 10 runs as estimation of *μ*.

#### CIFAR10 Details

We trained a deterministic 28 × 10 wide resnet [34,35], using the open source implementation from Cubuk et al. [36]. However, we extended the final 10 dimensional logits of that model through another 3 layer MLP classifier, in order to keep the inference network architecture identical between this model and the VIB models we describe below. During training, we dynamically added label noise according to the class confusion matrix in Table A1. The mean label noise averaged across the 10 classes is 20%. After that model had converged, we used it to estimate *β*<sup>0</sup> with Algorithm 1. Even with 20% label noise, *β*<sup>0</sup> was estimated to be 1.0483.

**Table A1.** Class confusion matrix used in CIFAR10 experiments. The value in row *i*, column *j* means for class *i*, the probability of labeling it as class *j*. The mean confusion across the classes is 20%.


We then trained 73 different VIB models using the same 28 × 10 wide resnet architecture for the encoder, parameterizing the mean of a 10-dimensional unit variance Gaussian. Samples from the encoder distribution were fed to the same 3 layer MLP classifier architecture used in the deterministic model. The marginal distributions were mixtures of 500 fully covariate 10-dimensional Gaussians, all parameters of which are trained. The VIB models had *β* ranging from 1.02 to 2.0 by steps of 0.02, plus an extra set ranging from 1.04 to 1.06 by steps of 0.001 to ensure we captured the empirical *β*<sup>0</sup> with high precision.

However, this particular VIB architecture does not start learning until *β* > 2.5, so none of these models would train as described. (A given architecture trained using maximum likelihood and with no stochastic layers will tend to have higher effective capacity than the same architecture with a stochastic layer that has a fixed but non-trivial variance, even though those two architectures have exactly the same number of learnable parameters.) Instead, we started them all at *β* = 100, and annealed *β* down to the corresponding target over 10,000 training gradient steps. The models continued to train for another 200,000 gradient steps after that. In all cases, the models converged to essentially their final accuracy within 20,000 additional gradient steps after annealing was completed. They were stable over the remaining ∼180,000 gradient steps.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Markov Information Bottleneck to Improve Information Flow in Stochastic Neural Networks**

**Thanh Tang Nguyen 1,\*,† and Jaesik Choi 2,\*,‡**

part of the work was done at KAIST.


Received: 8 September 2019; Accepted: 30 September 2019; Published: 6 October 2019

**Abstract:** While rate distortion theory compresses data under a distortion constraint, information bottleneck (IB) generalizes rate distortion theory to learning problems by replacing a distortion constraint with a constraint of relevant information. In this work, we further extend IB to multiple Markov bottlenecks (i.e., latent variables that form a Markov chain), namely Markov information bottleneck (MIB), which particularly fits better in the context of stochastic neural networks (SNNs) than the original IB. We show that Markov bottlenecks cannot simultaneously achieve their information optimality in a non-collapse MIB, and thus devise an optimality compromise. With MIB, we take the novel perspective that each layer of an SNN is a bottleneck whose learning goal is to encode relevant information in a compressed form from the data. The inference from a hidden layer to the output layer is then interpreted as a variational approximation to the layer's decoding of relevant information in the MIB. As a consequence of this perspective, the maximum likelihood estimate (MLE) principle in the context of SNNs becomes a special case of the variational MIB. We show that, compared to MLE, the variational MIB can encourage better information flow in SNNs in both principle and practice, and empirically improve performance in classification, adversarial robustness, and multi-modal learning in MNIST.

**Keywords:** information bottleneck; stochastic neural networks; variational inference; machine learning

#### **1. Introduction**

The information bottleneck (IB) principle [1] extracts relevant information about a target variable *Y* from an input variable *X* via a *single* bottleneck variable *Z*. In particular, it constructs a *bottleneck* variable *Z* = *Z*(*X*) that is a *compressed* version of *X* but preserves as much *relevant* information in *X* about *Y* as possible. This principle of introducing relevant information under compression finds vast applications in clustering problems [2], neural network compression [3], disentanglement learning [4–7], and reinforcement learning [8,9]. In addition, there have been many variants of the original IB principle, such as multivariate IB [10], Gaussian IB [11], meta-Gaussian IB [12], deterministic IB [13], and variational IB [14]. Despite these vast applications and variants of IB, alongside the theoretical analysis of the IB principle in neural networks [15,16], the context of stochastic neural networks in which mutual information can be most naturally well-defined [17] has not been sufficiently studied from the IB insight. In this work, we are particularly interested in this context in which multiple stochastic variables are constructed for representation in the form of a Markov chain.

Stochastic neural networks (SNNs) are a general class of neural networks with stochastic neurons in the computation graph. There has been an active line of research in SNNs, including restricted Boltzmann machines (RBMs) [18], deep belief networks (DBNs) [19], sigmoid belief networks (SBNs) [20], and stochastic feed-forward neural networks (SFFNs) [21]. One of the advantages of SNNs is that they can induce rich multi-modal distributions in the output space [20] and enable exploration in reinforcement learning [22]. For learning SNNs (and deep neural networks in general), the maximum likelihood estimate (MLE) principle (in its various forms, such as maximum log-likelihood or Kullback–Leibler divergence) has generally been a de-facto standard. The MLE principle maximizes the likelihood of the model for observing the entire training data. However, this principle is generic and not specially tailored to the hierarchical structure of neural networks. Particularly, MLE treats the entire neural network as a whole without considering the explicit contribution of its hidden layers to model learning. As a result, the information contained within the hidden structure may not be adequately modified to capture the data regularities reflecting a target variable. Thus, it is reasonable to ask if the MLE principle effectively and sufficiently exploits a neural network's representative power, and whether there is a better alternative.

**Contributions**. In this paper, (i) we propose Markov information bottleneck (MIB), a variant of the IB principle for multiple Markov bottlenecks that directly offers an alternative learning principle for SNNs. In MIB, there are multiple bottleneck variables (as opposed to one single bottleneck variable in the original IB) that form a Markov chain. These multiple Markov bottlenecks sequentially extract relevant information for a learning task. From the perspective of MIB, each layer of an SNN is a bottleneck whose information is encoded from the data via the network parameters connecting the layer to the data layer. (ii) We show that in a non-collapse MIB, the information optimality is not simultaneously achievable for all bottlenecks; thus, an optimality compromise is devised. (iii) When applied to SNNs for a learning task, we interpret the inference from a hidden layer to the output layer in SNNs as a variational approximation to that layer's intractable decoding of relevant information. Consequently, the variational MIB in SNNs generalizes the MLE principle. We demonstrate via a simple analytical argument and synthetic experiment that MLE is unable to learn a good information representation, while the variational MIB can. (iv) We then empirically show that MIB improves the performance in classification, adversarial robust learning, and multi-modal learning in the standard hand-digit recognition data MNIST [23]. This work is an extended version of our preprint [24] and the first author's Master thesis [25].

#### **2. Related Work**

There have been many extensions of the original IB framework [1]. One natural consideration is to extend it to continuous variables, yet under special settings where the optimal information representation is analytic [11,12]. Another direction uses alternative measures for compression and/or relevance in IB [13]. Since the optimal information representation in IB is tractable only in limited settings such as discrete variables [1], Gaussian variables [11], and meta-Gaussian variables [12], scaling the IB solution using neural networks and variational inference is a very successful extension [14]. The closest extension to our MIB is multivariate IB [10], in which they define multi-information to capture the dependence among the elements of a multivariate variable. However, in MIB, we do not focus on capturing such multi-information but rather the optimal information sequentially processed by a Markov chain of (possibly multivariate) bottleneck variables.

The line of work applying the IB principle to learn information representation in neural networks is also relevant to our approach. For example, Reference [15] proposes the use of the mutual information of a hidden layer with the input layer and the output layer to quantify the performance of neural networks. However, it is not clear as to how the IB information optimality changes in multiple bottlenecks in a neural network and how we can approximate the IB solutions in this high-dimensional context. In addition, MLE is a standard learning principle for neural networks. It has been shown that the IB principle is mathematically equivalent to the MLE principle in the multinomial mixture model for the clustering problem when the input distribution *X* is uniform or has a large sample size [26]. However, it is also not clear how these two principles are related to each other in the context of neural

networks. Moreover, regarding the feasibility of the IB principle for representation learning in neural networks, Reference [17] analyzes two critical issues of mutual information that representation learning might suffer from: indefinite in deterministic encoding, and invariant under bijective transformations. These are inherent properties of mutual information which are also studied in, for example, [7,27,28]. In MIB, we share with [17] the same insight in these caveats by considering only the scenario where mutual information is well defined. This also explains our rationale in applying MIB to stochastic neural networks.

Deep learning compression schemes [3,29] loosely bear some similarity with our work. Both of the directions aim for a more compressed and useful neural networks for given tasks. The critical distinction is that deep learning compression schemes attempt to produce a smaller-sized neural network with similar performance of a larger one so that the network can be efficiently deployed in small devices such as mobile phones. This task therefore involves size-reduction techniques such as neural network pruning, low-rank factorization, transferred convolution filters and knowledge distillation [29] . On the other hand, our work asks an important representation learning question that given a neural network, what learning principles are the best we can do to improve the information content learned from the data for a given task? In this work, we attempt to address this question via the perspective that a neural network is a set of stochastic variables that sequentially encode information into its layers. We then explicitly improve the information flow (in the sense of more compressed but relevant information) for each layer via our introduced Markov Information Bottleneck framework.

#### **3. Preliminaries**

#### *3.1. Notations*

We denote random variables (RVs) by capital letters (e.g., *X*), and their specific realization value by the corresponding lowercase letter (e.g., *x*). We write *X* ⊥ *Y* (respectively, *X* ⊥ *Y*) to indicate that *X* and *Y* are independent (respectively, not independent). We denote a Markov chain by *Y* → *X* → *Z* , that is, *Y* and *Z* are conditionally independent given *X*, or *Y* ⊥ *Z*|*X*. We use the integral notation when taking expectation (e.g., *p*(*x*)*f*(*x*)*dx*) over the distribution of a random variable regardless of whether the variable is discrete or continuous. We also adopt the following conventions from [27] for defining entropy (denoted by *H*), mutual information (denoted by *I*), and Kullback–Leibler (KL) divergence (denoted by *DKL*): 0 log <sup>0</sup> <sup>0</sup> <sup>=</sup> 0, 0 log <sup>0</sup> *<sup>q</sup>* <sup>=</sup> 0, *<sup>p</sup>* log *<sup>p</sup>* <sup>0</sup> <sup>=</sup> <sup>∞</sup>.

#### *3.2. Information Bottleneck*

Given a (possibly unknown) data joint distribution *p*(*X*,*Y*), the IB framework constructs a *bottleneck* variable *Z* = *Z*(*X*) that is a *compressed* version of *X* but preserves as much *relevant* information in *X* about *Y* as possible. The compression of the representation *Z* is quantized by *I*(*Z*; *X*), the mutual information of *Z* and *X*. The relevance in *Z*, the amount of information *Z* contains about *Y*, is specified by *I*(*Z*;*Y*). The optimal representation *Z* satisfying a certain compression–relevance trade-off constraint is then determined via minimization of the following Lagrangian <sup>L</sup>*IB*[*p*(*z*|*x*)] = *<sup>I</sup>*(*Z*; *<sup>X</sup>*) <sup>−</sup> *<sup>β</sup>I*(*Z*;*Y*), where *<sup>β</sup>* is a positive Lagrangian multiplier that controls the trade-off. Due to the convexity of Lagrangian and constrained conditions with respect to the encoders {*p*(*z*|*x*)}, the Karush–Kuhn–Tucker (KKT) conditions for this constrained minimization problem become the sufficient and necessary conditions for finding the optimal encoders {*p*(*z*|*x*)}. By solving the KKT conditions, we can obtain the optimal encoders which can be expressed in an energy-based form as the following:

$$\underset{p(z|x)}{\text{arg min }} \mathcal{L}\_{IB}[p(z|x)] \propto p(z) \exp\left(-\beta D\_{KL}\left[p(Y|x) \|\| p(Y|z)\right]\right),\tag{1}$$

where *<sup>p</sup>*(*z*) = *<sup>p</sup>*(*z*|*x*)*p*(*x*)*dx*.

#### **4. Markov Information Bottleneck**

Given a data joint distribution *p*(*X*,*Y*) which is possibly only observed via a set of i.i.d. samples *<sup>S</sup>* <sup>=</sup> {(*xi*, *yi*)*<sup>N</sup> <sup>i</sup>*=1}, an information representation *<sup>Z</sup>* for *<sup>p</sup>*(*X*,*Y*) is said to be good if it encodes sufficient relevant information in *X* about *Y* in a compressed manner. Ideally, *Z* summarizes only the relevant information in *X* about *Y* and discards all the irrelevant information; more formally, *Z* is a minimal sufficient statistic for *Y*. Such information representation is desirable because it can capture the regularities in the data and is helpful for generalization in learning problems [30,31]. Our main interest is in solving the optimal information representation for a latent variable *Z* that has Markov structure, that is, *<sup>Z</sup>* = (*Z*1, *<sup>Z</sup>*2, ... , *ZL*), where *<sup>Z</sup>*<sup>1</sup> <sup>→</sup> *<sup>Z</sup>*<sup>2</sup> →···→ *ZL*. The Markov structure is common in deep neural networks whose advantage is the powerful modeling capacity coming from multiple layers. In MIB, each encoder *<sup>p</sup>*(*zl*<sup>+</sup>1|*x*) relates the encoders of the previous bottlenecks in the Markov chain via Bayes' rule:

$$p(z\_{l+1}|\mathbf{x}) = \int p(z\_{l+1}, z\_{1:l}|\mathbf{x}) dz\_{1:l} = \int \prod\_{i=1}^{l+1} p(z\_i|z\_{i-1}) dz\_{1:l}, \forall 1 \le l \le L-1,\tag{2}$$

where *<sup>z</sup>*1:*<sup>l</sup>* := (*z*1, ... , *zl*) and *<sup>z</sup>*<sup>0</sup> :<sup>=</sup> *<sup>x</sup>*. In addition, each *encoder <sup>p</sup>*(*zl*|*x*) corresponds to a unique decoder, namely *relevance decoder*, that decodes the relevant information in *x* about *y* from representation *zl*:

$$p(y|z\_l) = \int p(\mathbf{x}, y) \frac{p(z\_l|\mathbf{x})}{p(z\_l)} d\mathbf{x}.\tag{3}$$

In MIB, we further introduce a surrogate target variable *Y*ˆ (for the target variable *Y*) into the Markov chain: *<sup>Y</sup>* <sup>→</sup> *<sup>X</sup>* <sup>→</sup> *Zl* <sup>→</sup> *Zl*<sup>+</sup><sup>1</sup> <sup>→</sup> *<sup>Y</sup>*<sup>ˆ</sup> (Figure 1). The purpose of the surrogate target variable becomes clear in the section on variational MIB.

**Figure 1.** A directed graphical model for the Markov information bottleneck of two Markov bottlenecks. In a non-collapse Markov chain *Y* → *X* → *Z*<sup>1</sup> → *Z*<sup>2</sup> with 0 < *β*1, *β*<sup>2</sup> < ∞, the information optimality in *<sup>Z</sup>*<sup>1</sup> prevents the information optimality in *<sup>Z</sup>*2. Solid lines denote the encoders *<sup>p</sup><sup>θ</sup>* (*zi*|*x*) (for *<sup>i</sup>* ∈ {1, 2}), dashed lines denote the variational approximations *<sup>p</sup><sup>θ</sup>* (*y*ˆ|*zi*) to the intractable *relevance* decoder *<sup>p</sup><sup>θ</sup>* (*y*|*zi*). The variational relevance decoder *<sup>p</sup><sup>θ</sup>* (*y*ˆ|*zi*) encodes the information from *zi* into a surrogate target variable *y*ˆ. In the case of stochastic neural networks, *zi*, *θ*, and the surrogate target variable represents the hidden layers, the network weights, and the output layer, respectively.

A trivial solution to the optimal information representation problem for *Z* is to apply the original IB principle for *Z* as a whole by computing the optimal IB solution in Equation (1). However, this solution ignores the Markov structure of *Z*. As a principled approach, leveraging the intrinsic structure of a problem can generally provide a new insight that goes beyond the limitation of the perspective that ignores such structure. Thus, in Markov information bottleneck (MIB), we explicitly leverage the Markov structure of *Z* to derive a principled and tractable approximate solution to the optimal information representation. We then empirically show that leveraging the intrinsic structure in the case of MIB is indeed beneficial for learning.

In MIB, we reframe the optimal information representation problem as multiple IB problems for each of the bottlenecks *Zl*:

$$\min\_{p(z\_l|\mathbf{x})} \mathcal{L}\_l[p(z\_l|\mathbf{x})] := \min\_{p(z\_l|\mathbf{x})} \{ I(Z\_l; \mathbf{X}) - \beta\_l I(Z\_l; \mathbf{Y}) \},\tag{4}$$

for all 1 ≤ *l* ≤ *L*.

This extension is a natural approach for multiple bottlenecks because it aims for each bottleneck to achieve its own optimal information, and thus allows more relevant but compressed information to be encoded into *Z*. Another advantage is that we can leverage our understanding of the IB solution for each individual IB problem in Equation (4). Though this approach is promising and has good interpretation, there are two main challenges:


In what follows, we formally establish and present the first challenge, the conflicting property of information optimality in Markov structure, in Theorem 1 followed by a simple compromise to overcome the information conflict. After that, we present variational MIB to address the second challenge.

Without loss of generality, we consider the case when *L* = 2 (the result trivially generalizes to *L* > 2). We first define the collapse mode of the representation *Z* to be the two extreme cases in which *Z*<sup>2</sup> either contains all the information in *Z*<sup>1</sup> about *X* or simply random noise:

**Definition 1** (The *collapse* mode of MIB)**.** *<sup>Z</sup>* = (*Z*1, *<sup>Z</sup>*2) *is said to be in the collapse mode if it satisfies either of the following two conditions:*


For example, if *<sup>Z</sup>*<sup>2</sup> = *<sup>f</sup>*(*Z*1) where *<sup>f</sup>* is a deterministic bijection, *<sup>Z</sup>*<sup>2</sup> is a sufficient statistic for *<sup>X</sup>*. We then establish the conflicting property of information optimality in the Markov representation *Z* via the following theorem:

**Theorem 1** (*Conflicting Markov Information Optimality*)**.** *Given X*,*Y*, *Z*1, *and Z*<sup>2</sup> *such that Y* → *X* → *Z*<sup>1</sup> → *<sup>Z</sup>*<sup>2</sup> *and H*(*Y*|*X*) <sup>&</sup>gt; <sup>0</sup>*, consider two constrained minimization problems:*

$$\arg\min \mathcal{L}\_l[p(z\_l|\mathbf{x})] := \arg\min \{ I(Z\_l; \mathbf{X}) - \beta\_l I(Z\_l; \mathbf{Y}) \}, l \in \{1, 2\}, \tag{5}$$

*where* <sup>0</sup> <sup>&</sup>lt; *<sup>β</sup>*<sup>1</sup> <sup>&</sup>lt; <sup>∞</sup>*,* <sup>0</sup> <sup>&</sup>lt; *<sup>β</sup>*<sup>2</sup> <sup>&</sup>lt; <sup>∞</sup>*, and <sup>p</sup>*(*z*2|*x*) = *<sup>p</sup>*(*z*2|*z*1)*p*(*z*1|*x*)*dz*1*. Then, the following two statements are equivalent:*


Theorem 1 suggests that the Markov information optimality conflicts for most cases of interest (e.g., stochastic neural networks, which we will present in detail in the next section). The values of *β*<sup>1</sup> and *β*<sup>2</sup> are important to control the ratio of the relevant information versus the irrelevant one presented in the bottlenecks. These values also determine the conflictability of multiple bottlenecks on the edge cases. Recall by the data processing inequality (DPI) [27] that for *Y* → *X* → *Z*, we have <sup>0</sup> <sup>≤</sup> *<sup>I</sup>*(*Z*; *<sup>X</sup>*) <sup>≤</sup> *<sup>H</sup>*(*X*) and 0 <sup>≤</sup> *<sup>I</sup>*(*Z*;*Y*) <sup>≤</sup> *<sup>I</sup>*(*X*;*Y*). If *<sup>β</sup>*<sup>1</sup> and *<sup>β</sup>*<sup>2</sup> go to infinity, the optimal bottlenecks

*<sup>Z</sup>*<sup>1</sup> and *<sup>Z</sup>*<sup>2</sup> are both deterministic functions of *<sup>X</sup>* and they do not conflict. When *<sup>β</sup>*<sup>1</sup> = *<sup>β</sup>*<sup>2</sup> = 0, the information about *Y* in *X* is maximally compressed in *Z*<sup>1</sup> and *Z*<sup>2</sup> (i.e., *Z*<sup>1</sup> ⊥ *X*, *Z*<sup>2</sup> ⊥ *X*), and they do not conflict. The optimal solutions conflict when *<sup>β</sup>*<sup>1</sup> = 0 and *<sup>β</sup>*<sup>2</sup> > 0, as the former leads to a maximally compressed *Z*<sup>1</sup> while the latter prefers an informative *Z*<sup>2</sup> (this contradicts the Markov structure *X* → *Z*<sup>1</sup> → *Z*2, which indicates that maximal compression in *Z*<sup>1</sup> leads to maximal compression in *Z*2).

We can also easily construct non-conflicting MIBs for 0 < *β*1, *β*<sup>2</sup> < ∞ that violate the condition. For example, if *X* and *Y* are jointly Gaussian, the optimal bottlenecks *Z*<sup>1</sup> and *Z*<sup>2</sup> are linear transforms of *X* and jointly Gaussian with *X* and *Y* [11]. In this case, *Z*<sup>2</sup> is a sufficient statistic of *Z*<sup>1</sup> for *X*. In the case of neural networks, we can also construct a simple but non-trivial neural network that can obtain a non-conflicting Markov information optimality. For example, consider a neural network of two hidden layers *Z*<sup>1</sup> and *Z*2, where *Z*<sup>1</sup> is arbitrarily mapped from the input layer *X* but *Z*<sup>2</sup> is a sample mean of *<sup>n</sup>* samples i.i.d. drawn from the normal distribution <sup>N</sup> (*Z*1; <sup>Σ</sup>). This construction guarantees that *<sup>Z</sup>*<sup>2</sup> is a sufficient statistic of *Z*<sup>1</sup> for *X*, and thus there is non-conflicting Markov information optimality.

Theorem 1 is a direct result of DPI if *β<sup>i</sup>* ∈ {0, ∞}. In the case that 0 < *β<sup>i</sup>* < ∞, we trace down to the Lagrangian multiplier as in the original IB [1] to complete the proof. Formally, before proving Theorem 1, we first establish the two following lemmas. The first lemma expresses the uncertainty reduction in a Markov chain.

**Lemma 1.** *Given Y* → *X* → *Z*<sup>1</sup> → *Z*2*, we have*

$$I(Z\_2; X) = I(Z\_1; X) - I(Z\_1; X | Z\_2) \tag{6}$$

$$I(Z\_2; \boldsymbol{\Upsilon}) = I(Z\_1; \boldsymbol{\Upsilon}) - I(Z\_1; \boldsymbol{\Upsilon} | Z\_2). \tag{7}$$

**Proof.** It follows from [27] that *<sup>I</sup>*(*X*; *<sup>Z</sup>*1; *<sup>Z</sup>*2) = *<sup>I</sup>*(*X*; *<sup>Z</sup>*2)+*I*(*X*; *<sup>Z</sup>*1|*Z*2)=*I*(*X*; *<sup>Z</sup>*1)+*I*(*X*; *<sup>Z</sup>*2|*Z*1), but *<sup>I</sup>*(*X*; *<sup>Z</sup>*2|*Z*1)=0 since *<sup>X</sup>* ⊥ *<sup>Z</sup>*2|*Z*1, hence Equation (6). The proof for Equation (7) is similar by replacing variable *X* with variable *Y*. (Q.E.D.)

**Lemma 2.** *Given <sup>Y</sup>* <sup>→</sup> *<sup>X</sup>* <sup>→</sup> *<sup>Z</sup>*<sup>1</sup> <sup>→</sup> *<sup>Z</sup>*2, 0 <sup>&</sup>lt; *<sup>β</sup>*<sup>2</sup> <sup>&</sup>lt; <sup>∞</sup> *and <sup>H</sup>*(*X*|*Y*) <sup>&</sup>gt; <sup>0</sup>*, let us define the conditional information bottleneck objective:*

$$\mathcal{L}^c := \mathcal{L}^c[p(z\_2|z\_1), p(z\_1|\mathbf{x})] := I(Z\_1; X|Z\_2) - \beta\_2 I(Z\_1; Y|Z\_2). \tag{8}$$

*If Z is not in the collapse mode, <sup>∂</sup>*L*c*/*∂p*(*z*1|*x*) *depends on* {*p*(*z*2|*z*1)}*.*

**Proof.** Informally, if *<sup>Z</sup>*<sup>2</sup> in the conditional information bottleneck objective <sup>L</sup>*<sup>c</sup>* is not a trivial transform of the bottleneck variable *Z*1, *Z*<sup>2</sup> induces a non-trivial topology into the conditional information bottleneck objective. Formally, by the definition of the conditional mutual information

$$I(Z\_1; X|Z\_2) = \int \int \int p(\mathbf{x}, z\_1, z\_2) \log \frac{p(z\_1, \mathbf{x}|z\_2)}{p(z\_1|z\_2)p(\mathbf{x}|z\_2)} dz\_2 dz\_1 d\mathbf{x},$$

*<sup>I</sup>*(*Z*1; *<sup>X</sup>*|*Z*2) depends on *<sup>p</sup>*(*x*, *<sup>z</sup>*1, *<sup>z</sup>*2) as long as the presence of *<sup>Z</sup>*<sup>2</sup> in the conditional information bottleneck objective does not vanish (we will discuss the conditions for *Z*<sup>2</sup> to vanish in the final part of this proof). Note that due to the Markov chain *<sup>X</sup>* <sup>→</sup> *<sup>Z</sup>*<sup>1</sup> <sup>→</sup> *<sup>Z</sup>*2, we have *<sup>p</sup>*(*x*, *<sup>z</sup>*1, *<sup>z</sup>*2) = *<sup>p</sup>*(*x*)*p*(*z*1|*x*)*p*(*z*2|*z*1).

Thus, *<sup>∂</sup>I*(*Z*1; *<sup>X</sup>*|*Z*2)/*∂p*(*z*1|*x*) depends on *<sup>p</sup>*(*z*2|*z*1) as long as *<sup>Z</sup>*<sup>2</sup> does not vanish in the objective. Similarly, the same result also applies to *<sup>∂</sup>I*(*Z*1;*Y*|*Z*2)/*∂p*(*z*1|*x*). Hence, *<sup>∂</sup>*L*c*/*∂p*(*z*1|*x*) depends on {*p*(*z*2|*z*1)} (note that *<sup>H</sup>*(*X*|*Y*) <sup>&</sup>gt; 0 prevents the collapse of <sup>L</sup>*<sup>c</sup>* when summing two mutual informations) if *Z*<sup>2</sup> does not vanish in the objective.

Now we discuss the vanishing condition for *Z*<sup>2</sup> in the objective. It follows from Lemma 1 that:

$$0 \le I(Z\_1; X|Z\_2) \le I(Z\_1; X),\tag{9}$$

$$0 \le I(Z\_1; \mathcal{Y} | Z\_2) \le I(Z\_1; \mathcal{Y}).\tag{10}$$

Note that *<sup>Z</sup>*<sup>2</sup> vanishes in <sup>L</sup>*<sup>c</sup>* iff each of the mutual informations in <sup>L</sup>*<sup>c</sup>* does not depend on *<sup>Z</sup>*<sup>2</sup> iff the equality in both (9) and (10) occur. If *<sup>I</sup>*(*Z*1; *<sup>X</sup>*|*Z*2) = 0, we have *<sup>Y</sup>* <sup>→</sup> *<sup>X</sup>* <sup>→</sup> *<sup>Z</sup>*<sup>2</sup> <sup>→</sup> *<sup>Z</sup>*<sup>1</sup> (i.e., *<sup>Z</sup>*<sup>2</sup> is a sufficient statistic for *<sup>X</sup>* and *<sup>Y</sup>*), which also implies that *<sup>I</sup>*(*Z*1;*Y*|*Z*2) = 0. Similarly, *<sup>I</sup>*(*Z*1; *<sup>X</sup>*|*Z*2) = *<sup>I</sup>*(*Z*1; *<sup>X</sup>*) implies that *<sup>Z</sup>*<sup>2</sup> is independent of *<sup>Z</sup>*1, which in turn implies that *<sup>I</sup>*(*Z*1;*Y*|*Z*2) = *<sup>I</sup>*(*Z*1;*Y*). (Q.E.D.)

We now prove Theorem (1) by using Lemma (6) and Lemma (7).

**Proof of Theorem 1.** (⇐) This direction is obvious. When *<sup>I</sup>*(*Z*2; *<sup>X</sup>*) = *<sup>I</sup>*(*Z*1; *<sup>X</sup>*) and *<sup>I</sup>*(*Z*2;*Y*) = *<sup>I</sup>*(*Z*1;*Y*), or *<sup>I</sup>*(*Z*2; *<sup>X</sup>*) = 0 and *<sup>I</sup>*(*Z*2;*Y*) = 0, there is effectively only one optimization problem for L1, and this reduces into the original information bottleneck (with single bottleneck) [1].

(⇒) First we prove that if *<sup>Z</sup>* is not in the collapse mode, the constrained minimization problems are conflicting. Assume, by contradiction, that there exists a solution that minimizes both L<sup>1</sup> and L<sup>2</sup> simultaneously, that is, <sup>∃</sup>*p*(*z*1|*x*), *<sup>p</sup>*(*z*2|*z*1) s.t. <sup>L</sup><sup>1</sup> has a minimum at {*p*(*z*1|*x*)} and <sup>L</sup><sup>2</sup> has a minimum at {*p*(*z*1|*x*), *<sup>p</sup>*(*z*2|*z*1)}. Note that {*p*(*z*1|*x*)} and {*p*(*z*2|*z*1)} are independent variables for the optimization. By introducing Lagrangian multipliers *<sup>λ</sup>*1(*x*) and *<sup>λ</sup>*2(*x*) for the constraint *<sup>p</sup>*(*z*1|*x*)*dz*<sup>1</sup> <sup>=</sup> 1 of <sup>L</sup><sup>1</sup> and L2, respectively, the stationarity in the Karush–Kuhn–Tucker (KKT) conditions becomes:

$$\frac{\partial L\_1}{\partial p(z\_1|\mathbf{x})} = 0,\tag{11}$$

$$\frac{\partial L\_2}{\partial p(z\_1|\mathbf{x})} = 0,\tag{12}$$

where *L*<sup>1</sup> and *L*<sup>2</sup> are the Lagrangians:

$$L\_1[p(z\_1|\mathbf{x}), \lambda\_1] := I(Z\_1; \mathbf{X}) - \beta\_1 I(Z\_1; \mathbf{Y}) - \int \int \lambda\_1(\mathbf{x}) p(z\_1|\mathbf{x}) dz\_1 d\mathbf{x} \tag{13}$$

$$\mathbb{E}\left[p(\mathbf{z}\_1|\mathbf{x}), \lambda\_2\right] := I(Z\_2; \mathbf{X}) - \beta\_2 I(Z\_2; \mathbf{Y}) - \int \int \lambda\_2(\mathbf{x}) p(\mathbf{z}\_1|\mathbf{x}) d\mathbf{z}\_1 d\mathbf{x}.\tag{14}$$

It follows from Lemma 1 that:

$$L\_2 - L\_1 = (\beta\_1 - \beta\_2)I(Z\_1; Y) - \mathcal{L}^\varepsilon - \int \int (\lambda\_2(\mathbf{x}) - \lambda\_1(\mathbf{x})) p(z\_1|\mathbf{x}) dz\_1 d\mathbf{x},\tag{15}$$

where <sup>L</sup>*<sup>c</sup>* <sup>=</sup> *<sup>I</sup>*(*Z*1; *<sup>X</sup>*|*Z*2) <sup>−</sup> *<sup>β</sup>*<sup>2</sup> *<sup>I</sup>*(*Z*1;*Y*|*Z*2) (defined in Lemma 2). We take the derivative w.r.t. *<sup>p</sup>*(*z*1|*x*) both sides of Equation (15) and use Equations (11)–(12):

$$\frac{\partial \mathcal{L}^{\varepsilon}}{\partial p(z\_1|\mathbf{x})} = (\beta\_1 - \beta\_2) \frac{\partial I(Z\_1; \mathbf{Y})}{\partial p(z\_1|\mathbf{x})} + \lambda\_1(\mathbf{x}) - \lambda\_2(\mathbf{x}).\tag{16}$$

Notice that the left hand side of Equation (16) strictly depends on *<sup>p</sup>*(*z*2|*z*1) (Lemma 2) while the right hand side is independent of {*p*(*z*2|*z*1)}. This contradiction implies that the initial existence assumption is invalid, and thus implies the conclusion in Theorem 1. (Q.E.D.)

#### *4.1. Markov Information Optimality Compromise*

Due to Theorem 1, we cannot simultaneously achieve the information optimality for all bottlenecks. Thus, we need some compromised approach to instead obtain a compromised optimality. We propose two simple compromise strategies, namely, JointMIB and GreedyMIB. JointMIB is a weighted sum of the IB objectives <sup>L</sup>*joint* :<sup>=</sup> <sup>∑</sup>*<sup>L</sup> <sup>l</sup>*=<sup>0</sup> *γl*L*<sup>l</sup>* where *γ<sup>l</sup>* ≥ 0. The main idea of JointMIB is to simultaneously

optimize all encoders. Even though each bottleneck might not achieve its individual optimality, their joint optimality encourages a joint compromise. On the other hand, GreedyMIB progressively solves the information optimality for each bottleneck given that the encoders for the previous bottlenecks are fixed. In other words, GreedyMIB tries to obtain the conditional optimality of a current bottleneck which is conditioned on the fixed greedy-optimal information of the previous bottlenecks.

#### *4.2. Variational Markov Information Bottleneck*

Due to the intractability of encoders in Equation (2) and relevance decoders in Equation (3), the resulting mutual information in Equation (4) is also intractable. In this section, we present variational methods to derive a lower bound on mutual information in MIB.

#### 4.2.1. Approximate Relevance

Note that *<sup>I</sup>*(*Zl*;*Y*) = *<sup>H</sup>*(*Y*) <sup>−</sup> *<sup>H</sup>*(*Y*|*Zl*), where *<sup>H</sup>*(*Y*) = *constant*, which can be ignored in the minimization of L*l*. It follows from the non-negativity of KL divergence that:

$$H(\boldsymbol{Y}|\boldsymbol{Z}\_{l}) = -\int p(\boldsymbol{y}|\boldsymbol{z}\_{l})p(\boldsymbol{z}\_{l})\log p(\boldsymbol{y}|\boldsymbol{z}\_{l})d\boldsymbol{y}d\boldsymbol{z}\_{l} \leq -\int p(\boldsymbol{y}|\boldsymbol{z}\_{l})p(\boldsymbol{z}\_{l})\log p\_{v}(\boldsymbol{y}|\boldsymbol{z}\_{l})d\boldsymbol{y}d\boldsymbol{z}\_{l}$$

$$= -\mathbb{E}\_{\boldsymbol{(X,Y)}}\mathbb{E}\_{\boldsymbol{Z}\_{l}|\boldsymbol{X}}\log p\_{v}(\boldsymbol{Y}|\boldsymbol{Z}\_{l}) = -\mathbb{E}\_{\boldsymbol{(X,Y)}}\mathbb{E}\_{\boldsymbol{Z}\_{l}|\boldsymbol{X}}\log p(\hat{\boldsymbol{Y}}|\boldsymbol{Z}\_{l}) =: \hat{\boldsymbol{H}}(\boldsymbol{Y}|\boldsymbol{Z}\_{l}),\tag{17}$$

where we specifically use the relevance decoder for surrogate target variable *pv*(*y*|*zl*) = *<sup>p</sup>*(*y*ˆ|*zl*) as a variational distribution to the intractable distribution *<sup>p</sup>*(*y*|*zl*):

$$p\_{\mathcal{V}}(\boldsymbol{y}|\boldsymbol{z}\_{l}) := \mathbb{E}\_{\boldsymbol{Z}\_{L}|\boldsymbol{z}\_{l}}\left[p(\boldsymbol{\hat{y}}|\boldsymbol{Z}\_{L})\right].\tag{18}$$

The variational relevance ˜*I*(*Zl*;*Y*) :<sup>=</sup> *<sup>H</sup>*(*Y*) <sup>−</sup> *<sup>H</sup>*˜ (*Y*|*Zl*) is a lower bound on *<sup>I</sup>*(*Zl*;*Y*). This bound is tightest (i.e., zero gap) when the variational relevance decoder *<sup>p</sup>*(*y*ˆ|*zl*) equals the relevance decoder *<sup>p</sup>*(*y*|*zl*). In what follows, we establish the relationship between the variational relevance and the log likelihood function, thus connecting MIB with the MLE principle:

**Proposition 1** (Variational Relevance Inequalities)**.** *Given the definition of variational relevance* ˜*I*(*Zl*;*Y*) = *<sup>H</sup>*(*Y*) <sup>−</sup> *<sup>H</sup>*˜ (*Y*|*Zl*) *where <sup>H</sup>*˜ (*Y*|*Zl*) *is defined in Equation (17), and Z* = (*Z*1, ..., *ZL*)*, we have:*

$$H(Y) + \mathbb{E}\_{(X,Y)}\left[\log p(\hat{Y}|X)\right] = \tilde{\mathbb{I}}(Z\_0; \mathcal{Y}) \ge \tilde{\mathbb{I}}(Z\_l; \mathcal{Y}) \ge \tilde{\mathbb{I}}(Z\_{l+1}; \mathcal{Y}) \ge \tilde{\mathbb{I}}(Z\_L; \mathcal{Y}) = \tilde{\mathbb{I}}(Z; \mathcal{Y}), \tag{19}$$

*for all* <sup>0</sup> <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>L</sup>* <sup>−</sup> <sup>1</sup>*. where Z* = (*Z*1, ..., *ZL*)*.*

Proposition <sup>1</sup> suggests that: (i) the log likelihood of *<sup>p</sup>*(*y*ˆ|*x*) (plus the constant output entropy *<sup>H</sup>*(*Y*)) is a special case of the variational relevance at bottleneck *<sup>Z</sup>*<sup>0</sup> = *<sup>X</sup>*; (ii) the log likelihood bound *<sup>H</sup>*(*Y*) + E(*X*,*Y*) log *<sup>p</sup>*(*Y*ˆ|*X*) is an upper bound on the variational relevance for all the intermediate bottlenecks *Zl* and for the composite bottleneck *<sup>Z</sup>* = (*Z*1, ..., *ZL*). Therefore, maximizing the log likelihood, as in MLE, does not guarantee to increase the variational relevance for all the the intermediate bottlenecks and the composite bottleneck.

**Proof.** It follows from Jensen's inequality and the Markov chain that:

$$\begin{aligned} \int p(z\_l|\mathbf{x}) \log p(\hat{g}|z\_l) dz\_l &= \int p(z\_l|\mathbf{x}) \log \left( \int p(\hat{g}|z\_{l+1}) p(z\_{l+1}|z\_l) dz\_{l+1} \right) dz\_l \\ &\ge \int p(z\_l|\mathbf{x}) \int p(z\_{l+1}|z\_l) \log p(\hat{g}|z\_{l+1}) dz\_{l+1} dz\_l \\ &= \int \int p(z\_l|\mathbf{x}) p(z\_{l+1}|z\_l) \log p(\hat{g}|z\_{l+1}) dz\_l dz\_{l+1} \\ &= \int p(z\_{l+1}|\mathbf{x}) \log p(\hat{g}|z\_{l+1}) dz\_{l+1} \end{aligned}$$

for all 0 ≤ *l* ≤ *L* − 1. Thus, we have:

$$\begin{aligned} \bar{I}(Z\_l; \mathcal{Y}) &= H(\mathcal{Y}) - \bar{H}(\mathcal{Y}|Z\_l) \\ &= H(\mathcal{Y}) + \mathbb{E}\_{(X, \mathcal{Y})} \mathbb{E}\_{Z\_l|X} \log p(\hat{\mathcal{Y}}|Z\_l) \\ &\ge H(\mathcal{Y}) + \mathbb{E}\_{(X, \mathcal{Y})} \mathbb{E}\_{Z\_{l+1}|X} \log p(\hat{\mathcal{Y}}|Z\_{l+1}) \\ &= \bar{I}(Z\_{l+1}; \mathcal{Y}). \end{aligned}$$

It also follows from the Markov chain that:

$$p(\mathfrak{g}|z) = p(\mathfrak{g}|z\_L, z\_{L-1}, \dots, z\_1) = p(\mathfrak{g}|z\_L).$$

Therefore, we have:

$$\begin{aligned} I(Z; \boldsymbol{\mathcal{Y}}) &= H(\boldsymbol{\mathcal{Y}}) + \mathbb{E}\_{(X, \boldsymbol{Y})} \mathbb{E}\_{\boldsymbol{Z}\_{\boldsymbol{L}} | \boldsymbol{Z}\_{L-1}, \dots, \boldsymbol{Z}\_{1} | \boldsymbol{Z}\_{0}} \log p(\hat{\mathbf{Y}} | \boldsymbol{Z}\_{L})) \\ &= H(\boldsymbol{\mathcal{Y}}) + \mathbb{E}\_{(X, \boldsymbol{Y})} \mathbb{E}\_{\boldsymbol{Z}\_{\boldsymbol{L}} | \boldsymbol{X}} \log p(\hat{\mathbf{Y}} | \boldsymbol{Z}\_{L}) \\ &= I(\boldsymbol{Z}\_{L}; \boldsymbol{Y}). \end{aligned}$$

Finally, by the definition in Equation (17), we have:

$$\begin{aligned} \bar{I}(Z\_0; \boldsymbol{Y}) &= H(\boldsymbol{Y}) + \mathbb{E}\_{(X, \boldsymbol{Y})} \mathbb{E}\_{Z\_0|X} \log p(\hat{\boldsymbol{Y}}|Z\_0), \\ &= H(\boldsymbol{Y}) + \mathbb{E}\_{(X, \boldsymbol{Y})} \log p(\hat{\boldsymbol{Y}}|X). \end{aligned}$$

(Q.E.D.)

#### 4.2.2. Approximate Compression

In practice (e.g., in SNN presented in the next section), we can model the encoding between consecutive layers *<sup>p</sup>*(*zl*|*zl*−1) with an analytical form. However, the encoding of non-consecutive layers *<sup>p</sup>*(*zl*|*x*) for *<sup>l</sup>* <sup>&</sup>gt; 1 is generally not analytic as it is a mixture of *<sup>p</sup>*(*zl*|*zl*−1). We thus propose to avoid directly estimating *<sup>I</sup>*(*Zl*; *<sup>X</sup>*) by instead resorting to its upper bound *<sup>I</sup>*(*Zl*; *Zl*−1) as its surrogate in the optimization. However, *<sup>I</sup>*(*Zl*; *Zl*−1) is still intractable as it involves the intractable marginal distribution *<sup>p</sup>*(*zl*) = *<sup>p</sup>*(*zl*|*x*)*p*(*x*)*dx*. We then approximate *<sup>I</sup>*(*Zl*; *Zl*−1) using a mean-field (factorized) variational distribution *<sup>q</sup>*(*zl*) = <sup>∏</sup>*nl <sup>i</sup>*=<sup>1</sup> *<sup>q</sup>*(*zl*,*i*) where *zl* = (*zl*,1,..., *zl*,*nl* ):

$$I(Z\_l;X) \le I(Z\_l;Z\_{l-1}) = \int p(z\_l|z\_{l-1})p(z\_{l-1}) \log \frac{p(z\_l|z\_{l-1})}{p(z\_l)} dz\_l dz\_{l-1}$$

$$\le \int p(z\_l|z\_{l-1})p(z\_{l-1}) \log \frac{p(z\_l|z\_{l-1})}{q(z\_l)} dz\_l dz\_{l-1} = \mathbb{E}\_{Z\_{l-1}} D\_{KL} \left[ p(Z\_l|Z\_{l-1}) || q(Z\_l) \right]$$

$$= \mathbb{E}\_{Z\_{l-1}} \sum\_{i=1}^{n\_l} D\_{KL} \left[ p(Z\_{l,i}|Z\_{l-1}) || q(Z\_{l,i}) \right] =: I(Z\_l; Z\_{l-1}). \tag{20}$$

The mean-field variational inference not only helps derive a tractable approximation but also encourages distributed representation by constraining each neuron to capture an independent factor of variation for the data [32]; thus, it can potentially represent an exponential number of concepts using independent factors.

#### **5. Case Study: Learning Binary Stochastic Neural Networks**

In this section, we officially connect the variational MIB in Section 4 to stochastic neural networks (SNNs). We consider an SNN with *L* hidden layers (without any feedback or skip connection) where the input layer *<sup>X</sup>*, the hidden layers *Zl* for 1 <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>L</sup>*, and the output layer *<sup>Y</sup>*<sup>ˆ</sup> are considered as random variables. We use the convention that *<sup>Z</sup>*<sup>0</sup> :<sup>=</sup> *<sup>X</sup>*, *ZL*+<sup>1</sup> :<sup>=</sup> *<sup>Y</sup>*ˆ, and *Zl* <sup>=</sup> <sup>∅</sup> for all *<sup>l</sup>* ∈ { / 0, 1, ... , *<sup>L</sup>*, *<sup>L</sup>* <sup>+</sup> <sup>1</sup>}. Without any feedback or skip connection, *Y*, *X*, *Zl*, *Zl*+1, and *Y*ˆ form a Markov chain in that order. The output layer *Y*ˆ is the surrogate target variable presented in Section 4. The role of SNNs is therefore reduced to transforming from one random variable to another via the Markov chain *X* → *Zl* → *Zl*<sup>+</sup><sup>1</sup> <sup>→</sup> *<sup>Y</sup>*<sup>ˆ</sup> such that it achieves the good information representation (i.e., the compression–relevance tradeoff) for each layer. With the MLE principle, the learning in SNNs is performed by maximizing the log likelihood E(*X*,*Y*) log *<sup>p</sup>*(*Y*ˆ|*X*) . However, maximizing the log likelihood does not guarantee to improve the variational relevance for all the intermediate bottlenecks and the composite bottleneck (Proposition 1).

#### **Algorithm 1:** JointMIB

**Input:** data *<sup>S</sup>*<sup>0</sup> <sup>←</sup> (*xi*, *yi*)*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> <sup>∼</sup> *pD*(*x*, *<sup>y</sup>*), layer IB weights *<sup>γ</sup>l*, information tradeoff *<sup>β</sup>l*, number of particles for Monte Carlo simulation *M*. **Output:** *θ* **Initialization:** *θ* **<sup>1</sup> while** *not converged* **do <sup>2</sup> for** *i* = 1 *to L* **do <sup>3</sup>** *Si* ← ∅ **<sup>4</sup> for** *zi*−<sup>1</sup> ∈ *Si*−<sup>1</sup> **do** /\* Monte Carlo simulates *M* particles *z* (*k*) *<sup>i</sup>* given each *zi*−<sup>1</sup> \*/ **<sup>5</sup>** *Si* ← *Si* ∪ {*z* (*k*) *<sup>i</sup>* : 1 ≤ *k* ≤ *M*} where *z* (*k*) *<sup>i</sup>* <sup>∼</sup> *<sup>p</sup>*(*zi*|*zi*−1) **<sup>6</sup> end <sup>7</sup> end** /\* Estimate the variational relevance and compression using Monte Carlo samples \*/ **<sup>8</sup>** ˜*I*(*Zl*;*Y*) <sup>←</sup> Equation (17) and {*Si*}*<sup>L</sup> i*=0 **<sup>9</sup>** ˜*I*(*Zl*; *Zl*−1) <sup>←</sup> Equation (20) and {*Si*}*<sup>L</sup> i*=0 **<sup>10</sup>** <sup>L</sup>˜*joint*(*θ*) <sup>←</sup> <sup>∑</sup>*<sup>L</sup> <sup>l</sup>*=<sup>0</sup> *γ<sup>l</sup>* <sup>−</sup>˜*I*(*Zl*;*Y*) + *<sup>β</sup><sup>l</sup>* ˜*I*(*Zl*; *Zl*−1) **<sup>11</sup>** *<sup>g</sup>* <sup>←</sup> *<sup>∂</sup> ∂θ*L˜*joint*(*θ*) /\* Using the Raiko estimator in Binary SNNs \*/ **12 <sup>13</sup>** *θ* ← *θ* − *νg* /\* Update using SGD \*/ **<sup>14</sup> end**

We here instead combine the variational MIB and the MIB compromise to derive a practical learning principle that encourages compression and relevance for each layer, improving the information flow in SNNs. To make it concrete and simple, we consider a simple network architecture: binary stochastic feed-forward (fully-connected) neural networks (SFNNs). In binary SFNNs, we use a sigmoid as the activation function: *<sup>p</sup>*(*zl* <sup>=</sup> <sup>1</sup>|*zl*−1) = *<sup>σ</sup>*(*Wl*−<sup>1</sup>*zl*−<sup>1</sup> <sup>+</sup> *bl*−1), where *<sup>σ</sup>*(.) is the (element-wise) sigmoid function, *Wl*−<sup>1</sup> is the network weights connecting layer *l* − 1 to layer *l*, *bl*−<sup>1</sup> is a bias vector, and *Zl* ∈ {0, 1}*nl* . Let us define <sup>L</sup>˜ *<sup>l</sup>* :<sup>=</sup> <sup>−</sup>˜*I*(*Zl*;*Y*) + *<sup>β</sup><sup>l</sup>* ˜*I*(*Zl*; *Zl*−1), where ˜*I*(*Zl*;*Y*) and ˜*I*(*Zl*; *Zl*−1) are the approximate relevance and compression defined in Equation (17) and (20), respectively. Note that the position of *β<sup>l</sup>* here is slightly different from its position in Equation (4). In Equation (4), *β<sup>l</sup>* is associated with the relevance term to respect the convention of the original IB, while here it is associated with the compression term for practical reasons. In practice, the contribution of ˜*I*(*Zl*;*Y*) is higher than ˜*I*(*Zl*; *Zl*−1). In computing ˜*I*(*Zl*;*Y*) and ˜*I*(*Zl*; *Zl*−1), any expectation with respect to *<sup>p</sup>*(*zl*|*zl*−1) is approximated by Monte Carlo simulation in which we sample *<sup>M</sup>* particles

*zl* <sup>∼</sup> *<sup>p</sup>*(*zl*|*zl*−1). Regarding the information optimality compromise, we combine the variational MIB objectives into a weighted sum in JointMIB:

$$\mathcal{L}^{\text{joint}} := \sum\_{l=0}^{L} \gamma\_l \mathcal{L}\_{l\text{\textquotedblleft}l\end{pmatrix}} \tag{21}$$

where *<sup>γ</sup><sup>l</sup>* <sup>≥</sup> 0. In GreedyMIB, we greedily minimize <sup>L</sup>˜ *<sup>l</sup>* for each 0 ≤ *l* ≤ *L*. We also make each *<sup>q</sup>*(*Zl*,*i*) a learnable Bernoulli distribution. The JointMIB is presented in Algorithm 1. The Monte Carlo sampling operation of Algorithm 1 in stochastic neural networks precludes the backpropagation in a computation graph. It becomes even more challenging with binary stochastic neural networks, as it is not well-defined to compute gradients w.r.t. discrete-valued variables. Fortunately, we can find approximate gradients, which have been proved to be efficient in practice: the REINFORCE estimator [33,34], the straight-through estimator [35], the generalized EM algorithm [20], and the Raiko (biased) estimator [21]. Especially, we found that the Raiko gradient estimator works best in our specific setting and thus deployed it in this application. In the Raiko estimator, the gradient of a bottleneck particle *zl*,*<sup>i</sup>* <sup>∼</sup> *<sup>p</sup>*(*zl*,*<sup>i</sup>* <sup>=</sup> <sup>1</sup>|*zl*−1) = *<sup>σ</sup>*(*<sup>a</sup>* (*l*) *<sup>i</sup>* ) is propagated only through the deterministic term *σ*(*a* (*l*) *<sup>i</sup>* ): *<sup>∂</sup>zl*,*<sup>i</sup> ∂θ* <sup>≈</sup> *∂σ*(*<sup>a</sup>* (*l*) *i* ) *∂θ* .

#### **6. Experimental Evaluation**

We evaluated the effectiveness of the MIB framework on binary SNNs in synthetic data and MNIST hand-digit recognition data [23]. Each data sample in MNIST is a 28 × 28 gray-scale image representing a handwritten digit from 0 to 9. The dataset is split into 60000 training samples and 1000 test samples. In the synthetic data, we visualized the learning dynamics of the SNNs trained with the variational MIB variants (i.e., JointMIB and GreedyMIB), and those trained with MLE. In MNIST, we evaluate the effectiveness of the variational MIB variants by comparing them against the baselines MLE and VIB [14] in classification, adversarial robustness and multi-modal learning problems. We make the code for our framework publicly available at https://github.com/thanhnguyentang/pib.

#### *6.1. Synthetic Data: Learning Dynamics of Variational MIB*

To better understand how MIB modified the information within the layers during the learning process, we visualized the compression and relevance of each layer over the course of training of stochastic feed-forward neural networks (SFNNs) [21], JointMIB, and GreedyMIB in synthetic data. *SFN*1*N* is different from *MIB* only in the objective functions: *SFNN* is trained with the negative log likelihood while *MIB* is trained with the variational MIB objective. To simplify our analysis, we considered a binary decision problem where *X* is 12 binary inputs making up 212 = 4096 equally likely input patterns and *Y* is a binary variable equally distributed among 4096 input patterns [16]. The base neural network architecture had 4 hidden layers with widths: 10–8–6–4 neurons. Since the network architecture was small, we could precisely compute the true compression *Ix* := *<sup>I</sup>*(*Zi*; *<sup>X</sup>*) and true relevance *Iy* := *<sup>I</sup>*(*Zi*;*Y*) over training epochs. We fixed *<sup>β</sup><sup>l</sup>* = *<sup>β</sup>* = <sup>10</sup>−<sup>4</sup> for both JointMIB, trained five different randomly initialized neural networks for each comparative model with stochastic gradient descent (SGD) up to 20, 000 epochs on 80% of the data, and averaged the mutual information. In JointMIB, we set *<sup>γ</sup><sup>l</sup>* <sup>=</sup> *<sup>γ</sup>* <sup>=</sup> 1, <sup>∀</sup>*l*.

Figure 2 provides a visualization of the learning dynamics of SFNN versus JointMIB on the information plane (*Ix*, *Iy*). Firstly, we observed a common trend in the learning dynamics of MLE (in the SFNN model) and JointMIB frameworks. Both principles allow the network to gradually encode more information about *X* and the relevant information about *Y* into the hidden layers at the beginning as *<sup>I</sup>*(*Zi*; *<sup>X</sup>*) and *<sup>I</sup>*(*Zi*;*Y*) both increase. Intuitively, in order for the representations *Zl* to make sense of the task, the representations should encode enough information about *X*; thus, *<sup>I</sup>*(*Zl*; *<sup>X</sup>*) should increase. This is especially true for shallow layers because, due to the Markov chain

property, the shallower a layer, the greater its burden of carrying enough information to make sense of a task. Especially, we can observe that the increase of *<sup>I</sup>*(*Zl*; *<sup>X</sup>*) slowed down at some point for the deeper layers for both *SFNN* and *MIB*. This slowing effect was especially stronger in *MIB* where the compression is explicitly encouraged during the learning. Secondly, MIB was different from MLE in the maximum level of relevance at each layer and the number of epochs to encode the same level of relevance. JointMIB at *l* = 1 needed only about 4.68% of the training epochs to achieve at least the same level of relevance in all layers of SFNN at the final epoch. In addition, MLE was unable to encode the network layers to reach the maximum level of relevance enabled by MIB (we also trained SFNN up to 100,000 epochs and observed that the level of relevance of each layer never reached the value of 0.8 bits).

**Figure 2.** The learning dynamics of the stochastic feed-forward (fully-connected) neural network (SFNN) (**left**) and JointMIB (**right**). The color indicates the training epochs while each node in a color in the graph represents (*I*(*Zl*; *<sup>X</sup>*), *<sup>I</sup>*(*Zl*;*Y*)) at the corresponding epoch. Note that at each epoch, *<sup>I</sup>*(*Zl*; *<sup>X</sup>*) <sup>≥</sup> *<sup>I</sup>*(*Zl*+1; *<sup>X</sup>*), <sup>∀</sup>*<sup>l</sup>* (data processing inequality—DPI). JointMIB jointly encodes relevant information into every layer of stochastic neural networks (SNNs), while keeping each layer informatively concise. Compared to maximum likelihood estimation (MLE), the level of relevant information encoded by JointMIB increased more quickly over training epochs and reached a higher value. MIB: Markov information bottleneck.

There is also a subtle observation in Figure 2 that the relevance for MIB increased until some point before decreasing, while the relevance for SFNN increased until some point where the value almost stayed the same without a noticeable decrease. This could be explained by the fact that that the MIB objective can eventually allow the encoding of relevant information into each layer to its optimal information trade-off at some point. After this point, if training is continued, due to the mismatch between the exact MIB objective and its variational bound, the further minimization of the variational bound would decrease *<sup>I</sup>*(*Zl*;*Y*). Consequently, in order for *<sup>β</sup>lI*(*Zl*; *<sup>X</sup>*) <sup>−</sup> *<sup>I</sup>*(*Zl*;*Y*) to be small, *<sup>I</sup>*(*Zl*; *<sup>X</sup>*) also needs to decrease after this point to compensate for the decrease in *<sup>I</sup>*(*Zl*;*Y*). In the case of SFNN (trained with MLE), the MLE objective reaches its local minimum before the information of each layer can even reach its optimal information trade-off (if ever). This also suggests that MIB is better than MLE in terms of exploiting information for each layer during the learning.

GreedyMIB also obtained the representation of higher relevance as compared to MLE (Figure 3). GreedyMIB at *l* = 1 needed only about 17.95% of the training epochs to achieve at least the same level of relevance in all layers of the SFNN at the final epoch. Recall that in GreedyMIB at *l* = 1 the MIB principle is applied only to the first hidden layer. The layer representation at the final epoch gradually shifts to the left (i.e., more compressed) while not degrading the relevance over the greedy training from layer 1 to layer 4 in Figure 3.

We also see the compression effect that the compression constraints within the MIB framework prevented the layer representation from shifting to the right (in the information plane) during the encoding of relevant information (e.g., it slowed down the increase of *<sup>I</sup>*(*Zl*; *<sup>X</sup>*) during the information encoding, keeping the representation more concise). As compared with JointMIB, GreedyMIB also obtained a comparable information representation.

To conclude, two main advantages of MIB as compared to MLE are: (i) MIB can improve the information representation in SNNs in terms of higher relevance while keeping the information in each layer concise during encoding; (ii) MIB uses much fewer training epochs to obtain such information representation.

**Figure 3.** Subfigures (**a**), (**b**), (**c)**, and (**d**) represent GreedyMIB's encoding of relevant information into layers 1 ≤ *l* ≤ 4, respectively. GreedyMIB greedily encodes relevant information into each layer given the encoded information of the previous layers. GreedyMIB also achieved a significantly higher level of relevant information at each layer compared to MLE.

#### *6.2. Image Classification*

In this experiment, we compared JointMIB and GreedyMIB with three other comparative models which used the same network architecture without any explicit regularizer: (1) a standard deterministic neural network (DET) which simply treated each hidden layer as deterministic; (2) a stochastic feed-forward neural network (SFNN) [21] which is a binary stochastic neural network as in MIB but is trained with the MLE principle; and (3) variational information bottleneck (VIB) [14], which uses the entire deterministic network as an encoder, adds an extra stochastic layer as a out-of-network bottleneck variable, and is then trained with the IB principle on that single bottleneck layer. The base network architecture in this experiment had two hidden layers with 512 sigmoid-activated neurons per layer. These models were trained in MNIST [23].

Adopted from the common practice, we used the last 10,000 images of the training set as a validation (holdout) set for tuning hyperparameters. We then retrained the models from scratch in the full training set with the best validated configuration. We trained each of the five models with the same set of five different initializations and reported the average results over the set. For the stochastic models (all except DET), we drew *M* = 32 samples per stochastic layer during both training and inference, and performed inference 10 times at test time to report the mean classification errors for MNIST. The value of *M* = 32 is empirically reasonable in this experiment, as illustrated in Figure 4.

For JointMIB and GreedyMIB, we set *<sup>γ</sup><sup>l</sup>* <sup>=</sup> 1 (in JointMIB only) and *<sup>β</sup><sup>l</sup>* <sup>=</sup> *<sup>β</sup>*, <sup>∀</sup><sup>1</sup> <sup>≤</sup> *<sup>l</sup>* <sup>≤</sup> *<sup>L</sup>*, tuned *<sup>β</sup>* on a linear log scale *<sup>β</sup>* ∈ {10−*<sup>i</sup>* : <sup>1</sup> <sup>≤</sup> *<sup>i</sup>* <sup>≤</sup> <sup>10</sup>}. We found *<sup>β</sup>* <sup>=</sup> <sup>10</sup>−<sup>4</sup> worked best for both models (Figure 5). For VIB, we found that *β* = 10−<sup>3</sup> worked best on MNIST . We trained all the models on MNIST with Adadelta optimization [36], except for VIB for which we used Adam optimization [37], as we found that they worked best in the validation set.

The results are shown in Table 1. It shows that JointMIB substantially outperformed DET, MLE, and VIB on MNIST while GreedyMIB outperformed only DET and underperformed SFNN. Though JointMIB and GreedyMIB could have comparable information representation, as illustrated in the synthetic experiment in Section 6.1, in practice, it can be harder to obtain a comparable information representation for GreedyMIB. In GreedyMIB, it is necessary to train each layer greedily in order to obtain its information representation. The greedy nature makes it difficult to determine when would be a good time to stop the training and conclude the information representation for each layer. In addition, training greedily is expensive. JointMIB makes it more efficient by jointly obtaining a compromised information representation in each layer. Thus, it allows the compromised information representations of all the layers to jointly interact with each other during the learning. In principle, it is also harder to obtain good information representation in GreedyMIB. Due to the conflicting information optimality in MIB (Theorem 1), the good encoder for the first layer does not guarantee a good information trade-off in the the deeper layers. Though JointMIB also suffers from the conflicting information optimality, jointly and explicitly inducing relevant but compressed information into each layer of a neural network via MIBs as in JointMIB can make it easier for the training.

**Figure 4.** The value of *M* versus validation error. *M* = 32 gave a reasonably good performance as compared to other larger values.

**Figure 5.** The learning curve of JointMIB and SFNN in the MNIST validation set. Either a too-large or too-small value of *β* could hurt the generalization of learning. While a large value of *β* introduces aggressive compression, a smaller value allows more irrelevant information into the representation. In this experiment, we found that *β* = 10−<sup>4</sup> was the best trade-off hyperparameter in JointMIB for this experiment.

**Table 1.** The performance of the variational MIB variants (i.e., JointMIB and GreedyMIB) for classification and adversarial robustness on MNIST in comparison with MLE and variational information bottleneck (VIB). MIB explicitly induces compression–relevance trade-offs in each layer during the training, which outperforms and is more adversarially robust than the other models of the same architecture. DET: deterministic neural network.


#### *6.3. Robustness against Adversarial Attacks*

We consider here the adversarial robustness of neural networks trained with MIBs. Neural networks are prone to adversarial attacks which disturb the input pixels by small amounts that are imperceptible to humans [38,39]. Adversarial attacks generally fall into two categories: untargeted and targeted attacks. An untargeted adversarial attack A maps the target model *M* and an input image *x* into an adversarially perturbed image *x* : <sup>A</sup> : (*M*, *<sup>x</sup>*) <sup>→</sup> *<sup>x</sup>* , and is considered successful if it can fool the model *<sup>M</sup>*(*x*) <sup>=</sup> *<sup>M</sup>*(*x* ). A targeted attack, on the other hand, has an additional target label *l*: <sup>A</sup> : (*M*, *<sup>x</sup>*, *<sup>l</sup>*) <sup>→</sup> *<sup>x</sup>* , and is considered successful if *M*(*x* ) = *<sup>l</sup>* <sup>=</sup> *<sup>M</sup>*(*x*).

We performed adversarial attacks on the neural networks trained with MLE and MIB, and used the accuracy on adversarially perturbed versions of the test set to rank a model's robustness. In addition, we used the *L*<sup>2</sup> attack method for both targeted and untargeted attacks [40], which has shown to be the most effective attack algorithm with smaller perturbations. Specifically, we attacked the same four comparative models described from the previous experiment on the first 1000 samples of the MNIST test set. For the targeted attacks, we targeted each image into the other 9 labels other than the true

label of the image. We used the same hyperparameters as in the classification experiment. The value of *<sup>β</sup>* = *<sup>β</sup><sup>l</sup>* = <sup>10</sup>−<sup>4</sup> was also reasonable for this adversarial robustness task (Figure 6).

**Figure 6.** Adversarial robustness of JointMIB for various values of *β*. Introducing aggressive compression (i.e., large values of *β*) reduced adversarial robustness while smaller values of *β* introduced comparable robustness. The best information trade-off for targeted attacks was at *β* = 10−<sup>4</sup> in this experiment.

The results are shown in Table 1. Firstly, it was expected that the adversarial robustness accuracy in the targeted attacks would be smaller than that in the untargeted attacks because the targeted attacks are more challenging for the neural networks to overcome than untargeted attacks. This result is consistent in our experiment. Secondly, the deterministic model DET was totally fooled by all the attacks. It is known that stochasticity in neural networks improves adversarial robustness, which is consistent with our experiment as SFNN was significantly more adversarially robust than DET. Thirdly, VIB had comparable adversarial robustness to SFNN even if VIB had "less stochasticity" than SFNN (VIB had one stochastic layer while all hidden layers of the SFNN were stochastic). We hypothesize that this is because VIB performance was compensated with the IB principle for its stochastic layer. Finally, JointMIB was more adversarially robust than the other models. Again, GreedyMIB was not very effective in adversarial robustness (it was worse than VIB in the targeted attack and SFNN in the untargeted attack). We hypothesize that this relates to the difficulty for GreedyMIB to have a good information representation for all layers. In conclusion, this experiment suggests that explicitly and jointly inducing compression and relevance into each layer has a good potential of being more adversarially robust for neural networks.

#### *6.4. Multi-Modal Learning*

One of the main advantages of stochastic neural networks is their ability to model structured output space in which a one-to-many mapping is required. A binary stochastic variable *zl* of dimensionality *nl* can take on 2*nl* different states, each of which would give a different *y*ˆ. Thus, the conditional distribution *<sup>p</sup>*(*y*ˆ|*x*) in stochastic neural networks is multi-modal. Hence in this experiment, we evaluated how MIB affected the multi-modal learning capability of SNNs.

**Figure 7.** Samples drawn from the prediction of the lower half of the MNIST test data digits based on the upper half for JointMIB (**right**, after 60 epochs) and SFNN (**left**, after 200 epochs). The leftmost column is the original MNIST test digit followed by the masked out digits and nine samples. The rightmost column was obtained by averaging over all generated samples of bottlenecks drawn from the prediction. The figures illustrate the capability of modeling structured output space using JointMIB and SFNN. JointMIB generated more recognizable digits within much fewer training epochs.

In this experiment, we followed [21] and predicted the lower half of the MNIST digits using the upper half as inputs. We used the same neural network architecture of 392–512–512–392 for JointMIB and SFNN and trained them with SGD with a constant learning rate of 0.01 (due to the under-performance of GreedyMIB from the previous experiments and its expensive training, we compared only JointMIB with SFNN in this experiment). We trained the models on the full training set of 60,000 images and tested with the test set. For JointMIB, we also used *<sup>β</sup><sup>l</sup>* = *<sup>β</sup>* = <sup>10</sup><sup>−</sup>4. The results of JointMIB at epoch 60 and MLE at epoch 200 are shown in Figure 7. Firstly, JointMIB could generate digit variations which were more recognizable than those generated by MLE. In particular, some samples of digits 2, 4, 5, and 7 generated by MLE were distorted, while all digit samples generated by JointMIB were recognizable. Secondly, JointMIB used much fewer epochs to achieve good samples. In JointMIB, we trained only up to 60 epochs while in MLE, we trained up to 200 epochs but did not observe as good samples in between. This further highlights the advantage of MIB in obtaining good information representation in much fewer training epochs. Furthermore, we expect that the advantage of inducing compression and relevance into each layer by JointMIB is particularly helpful for multi-modal learning because in multi-modal learning, the modes generated in each hidden layer are critical for representing multiple modes. While MLE ignores the explicit contribution of each layer to the information representation of the neural network, JointMIB explicitly takes into account the compression and relevance of each layer.

#### **7. Discussion and Future Work**

In this work, we introduce Markov Information Bottleneck, an extension of the original Information Bottleneck to the context where a representation is multiple stochastic variables that form a Markov chain. In this context, we show that one cannot simply directly apply the original IB principle to each variable as their information optimality is conflicting for most of the interesting cases. We suggest a simple but efficient fix via a joint compromise. In this scheme, we jointly combine the information trade-offs of each variable into a weighted sum, encouraging the information trade-offs for all the variables better off during the learning. In particular in the context of Stochastic Neural Networks, we present the variational inference to estimate the compression and relevance for each bottleneck. As a result, the variational MIB turns the intractable decoding of each bottleneck approximately into an efficient inference for that bottleneck. This variational approximation turns out to generalize the MLE principle in the context of Stochastic Neural Networks. We empirically demonstrate the effectiveness of MIB by comparing it with the baselines using MLE principle and Variational Information Bottleneck in classification, adversarial robustness and multi-modal learning. The empirical performance supports the potential benefit of explicitly inducing compression and relevance into each layer (e.g., in a jointly manner), presenting a special link between information representation and the performance in classification, adversarial robustness and multi-modal learning.

One limitation of our current approach is the number of samples generated via *zl* <sup>∼</sup> *<sup>p</sup>*(*zl*|*zl*−1) used to estimate the variational compression and relevance scales exponentially with the number of layers. This is however a common drawback for performing inference in fully stochastic neural networks. This difficulty can be overcome by using partially stochastic neural networks. In addition, the Monte Carlo sampling to estimate the variational mutual information, though unbiased, is of high variance and sample inefficiency. This sample inefficiency limitation can be overcome by resorting to more advanced methods of estimating mutual information such as [41,42]. The MIB framework also admits several possible future extensions including scaling the framework to bigger networks and real-valued stochastic neural networks. The extension to real-valued stochastic neural networks are straightforward by, e.g., constructing a Gaussian layer for modeling *<sup>p</sup>*(*zl*|*zl*−1) and using reparameterization tricks [43] to perform back-propagation via sampling. Another dimension of improvement is to study hyperparameter effect of MIB. This current work only considers equal *<sup>γ</sup><sup>l</sup>* = *<sup>γ</sup>* for JointMIB and equal *<sup>β</sup><sup>l</sup>* = *<sup>β</sup>*, and tuned *<sup>β</sup>* via grid search. We can use, e.g., Bayesian optimization [44] to efficiently tune *γ<sup>l</sup>* and *β<sup>l</sup>* with expandable bounds. In addition, we believe that the challenge of applying our methods to more advanced datasets such as Imagenet [45] is partly associated with that of scaling the stochastic neural network as we tend to need more expressive models for more challenging datasets. Given this perspective, the challenge to scale to large datasets can be partially addressed with the solutions from scaling stochastic neural networks some of which we suggest above. Furthermore, we believe that, as one of the main messages from our work, explicitly inducing compressed and relevant information (e.g., via mutual information as in MIB) into many intermediate layers can be more beneficial to large-scale tasks than simply resorting to the MLE principle. An intuition is to think of this as a way for *information-theoretic regularization for intermediate layers*. Finally, a followup important question to ask is whether there is any theoretical and stronger empirical link between an improved information representation (e.g., in the MIB sense) and the generalization of neural networks. This connection might be intuitively correct but a systematically empirical study or a theoretical suggestion are an important future research direction.

**Author Contributions:** Conceptualization by J.C. and T.T.N.; writing and conducting experiments supervised by J.C.; methodology, software, validation, formal analysis, investigation, visualization, and writing of the original draft by T.T.N.

**Funding:** This work was supported by the Institute for Information and Communications Technology Planning and Evaluation (IITP) grant (No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence).

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Nonlinear Information Bottleneck**

#### **Artemy Kolchinsky 1,\*, Brendan D. Tracey 1,2 and David H. Wolpert 1,3,4**


Received: 16 October 2019; Accepted: 28 November 2019; Published: 30 November 2019

**Abstract:** Information bottleneck (IB) is a technique for extracting information in one random variable *X* that is relevant for predicting another random variable *Y*. IB works by encoding *X* in a compressed "bottleneck" random variable *M* from which *Y* can be accurately decoded. However, finding the optimal bottleneck variable involves a difficult optimization problem, which until recently has been considered for only two limited cases: discrete *X* and *Y* with small state spaces, and continuous *X* and *Y* with a Gaussian joint distribution (in which case optimal encoding and decoding maps are linear). We propose a method for performing IB on arbitrarily-distributed discrete and/or continuous *X* and *Y*, while allowing for nonlinear encoding and decoding maps. Our approach relies on a novel non-parametric upper bound for mutual information. We describe how to implement our method using neural networks. We then show that it achieves better performance than the recently-proposed "variational IB" method on several real-world datasets.

**Keywords:** information bottleneck; mutual information; representation learning; neural networks

#### **1. Introduction**

Imagine that one has two random variables, an "input" random variable *X* and an "output" random variable *Y*, and that one wishes to use *X* to predict *Y*. In some situations, it is useful to extract a compressed representation of *X* that is relevant for predicting *Y*. This problem is formally considered by the *information bottleneck* (IB) method [1–3]. IB proposes to find a "bottleneck" variable *M* which maximizes prediction, formulated in terms of the mutual information *I*(*Y*; *M*), given a constraint on compression, formulated in terms of the mutual information *I*(*X*; *M*). Formally, this can be stated in terms of the constrained optimization problem

$$\underset{M \in \Lambda}{\arg\max} \ I(\mathbf{Y}; M) \quad \text{s.t.} \quad I(X; M) \le R,\tag{1}$$

where Δ is the set of random variables *M* that obey the Markov condition *Y* − *X* − *M* [4–6]. This Markov condition states that *M* is conditionally independent of *Y* given *X*, and it guarantees that any information that *M* has about *Y* is extracted from *X*. The maximal value of *I*(*Y*; *M*) for each possible compression value *R* forms what is called the *IB curve* [1].

The following example illustrates how IB might be used. Suppose that a remote weather station makes detailed recordings of meteorological data (*X*), which are then encoded and sent to a central server (*M*) and used to predict weather conditions for the next day (*Y*). If the channel between the weather station and server has low capacity, then the information transmitted from the weather station to the server must be compressed. Minimizing the IB objective amounts to finding a compressed representation of meteorological data which can be transmitted across a low capacity channel (have low *I*(*X*; *M*)) and used to optimally predict future weather (have high *I*(*Y*; *M*)). The IB curve specifies the trade-off between channel capacity and accurate prediction.

Numerous applications of IB exist in domains such as clustering [7,8], coding theory and quantization [9–12], speech and image recognition [13–17], and cognitive science [18]. Several recent papers have also drawn connections between IB and supervised learning, in particular, classification using neural networks [19,20]. In this context, *X* typically represents input vectors, *Y* the output classes, and *M* the intermediate representations used by the network, such as the activity of hidden layer(s) [21]. Existing research has considered whether intermediate representations that are optimal in the IB sense (i.e., close to the IB curve) may be better in terms of generalization error [21–23], robustness to adversarial inputs [24], detection of out-of-distribution data [25], or provide more "interesting" or "useful" intermediate representations of inputs [26]. Other related research has investigated whether stochastic gradient descent (SGD) training dynamics may drive hidden layer representations towards IB optimality [27,28].

In practice, optimal bottleneck variables are usually not found by solving the constrained optimization problem of Equation (1), but rather by finding *M* that maximize the so-called *IB Lagrangian* [1,6,22],

$$\mathcal{L}\_{\text{IB}}(M) := I(\mathcal{Y}; M) - \beta I(X; M). \tag{2}$$

LIB is the Lagrangian relaxation [29] of the constrained optimization problem of Equation (1), and *β* is a Lagrange multiplier that enforces the constraint *<sup>I</sup>*(*X*; *<sup>M</sup>*) <sup>≤</sup> *<sup>R</sup>*. In practice, *<sup>β</sup>* <sup>∈</sup> [0, 1] serves as a parameter that controls the trade-off between compression and prediction. As *β* → 1, IB will favor maximal compression of *<sup>X</sup>*; for *<sup>β</sup>* <sup>=</sup> 1 (or any *<sup>β</sup>* <sup>≥</sup> 1) the optimal *<sup>M</sup>* will satisfy *<sup>I</sup>*(*X*; *<sup>M</sup>*) = *<sup>I</sup>*(*Y*; *<sup>M</sup>*) = 0. As *<sup>β</sup>* <sup>→</sup> 0, IB will favor prediction of *<sup>Y</sup>*; for *<sup>β</sup>* <sup>=</sup> 0 (or any *<sup>β</sup>* <sup>≤</sup> 0), there is no penalty on *<sup>I</sup>*(*X*; *<sup>M</sup>*) and the optimal *M* will satisfy *I*(*Y*; *M*) = *I*(*X*;*Y*), the maximum possible. It is typically easier to optimize LIB than Equation (1), since the latter involves a complicated non-linear constraint. For this reason, optimizing LIB has become standard in the IB literature [1,6,19,20,22,24,30,31].

However, in recent work [32] we showed that whenever *Y* is a deterministic function of *X* (or close to being one), optimizing LIB is not longer equivalent to optimizing Equation (1). In fact, when *Y* is a deterministic function of *X*, the same *M* will optimize LIB for all values of *β*, meaning that the IB curve cannot be explored by optimizing LIB while sweeping *β*. This is a serious issue in supervised learning scenarios (as well as some other domains), where it is very common for the output *Y* to be a deterministic function of the input *X*. Nonetheless, the IB curve can still be explored by optimizing the following simple modification of the IB Lagrangian, which we called the *squared-IB Lagrangian* [32],

$$\mathcal{L}\_{\text{sqIB}}(M) := I(\boldsymbol{Y}; M) - \beta I(\boldsymbol{X}; M)^2 \tag{3}$$

where *β* ≥ 0 is again a parameter that controls the trade-off between compression and prediction. Unlike the case for LIB, there is always a one-to-one correspondence between *M* that optimize LsqIB and solutions to Equation (1), regardless of the relationship between *X* and *Y*. In the language of optimization theory, the squared-IB Lagrangian is a "scalarization" of the multi-objective problem {min *<sup>I</sup>*(*X*; *<sup>M</sup>*), max *<sup>I</sup>*(*Y*; *<sup>M</sup>*)} [33]. Importantly, unlike <sup>L</sup>IB, there can be non-trivial optimizers of <sup>L</sup>sqIB even for *β* ≥ 1; the relationship between *β* and corresponding solutions on the IB curve has been analyzed in [34]. In that work, it was also shown that the objective function of Equation (3) is part of a general family of objectives *<sup>I</sup>*(*Y*; *<sup>M</sup>*) <sup>−</sup> *<sup>β</sup>F*(*I*(*X*; *<sup>M</sup>*)), where *<sup>F</sup>* is any monotonically-increasing and strictly convex function, all of which can be used to explore the IB curve.

Unfortunately, optimizing the IB Lagrangian and squared-IB Lagrangian remains a difficult problem. First, both objectives are non-convex, so there is no guarantee that a global optimum can be found. Second, finding even a local optimum requires evaluating the mutual information terms *I*(*X*; *M*) and *I*(*Y*; *M*), which can involve intractable integrals. For this reason, until recently IB has been mainly developed for two limited cases. The first case is where *X* and *Y* are discrete-valued and have a small number of possible outcomes [1]. There, one can explicitly represent the full *encoding map* (the condition probability distribution of *M* given *X*) during optimization, and the relevant integrals become tractable finite sums. The second case is when *X* and *Y* are continuous-valued and jointly Gaussian. Here, the IB optimization problem can be solved analytically, and the resulting encoding and decoding maps are linear [31].

In this work, we propose a method for performing IB in much more general settings, which we call *nonlinear information bottleneck*, or *nonlinear IB* for short. Our method assumes that *M* is a continuous-valued random variable, but *X* and *Y* can be either discrete-valued (possibly with many states) or continuous-valued, and with any desired joint distribution. Furthermore, as suggested by the term nonlinear IB, the encoding and decoding maps can be nonlinear.

To carry out nonlinear IB, we derive a lower bound on LIB (or, where appropriate, LsqIB) which can be maximized using gradient-based methods. As we describe in the next section, our approach makes use of the following techniques:


Note that three recent papers have suggested other ways of optimizing the IB Lagrangian in general settings [24,36,37]. These papers use variational upper bounds on the compression term *I*(*X*; *M*), which is different from our non-parametric upper bound. A detailed comparison is provided in Section 3. In that section, we also relate our approach to other work in machine learning.

In Section 4, we explain how to implement our approach using standard neural network techniques. We demonstrate its performance on several real-world datasets, and compare it to the recently-proposed *variational IB* method [24].

#### **2. Proposed Approach**

In the following, we use *<sup>H</sup>*(·) for Shannon entropy, *<sup>I</sup>*(·; ·) for mutual information [MI], *<sup>D</sup>*KL(··) for Kullback–Leibler [KL] divergence. All information-theoretic quantities are in units of bits, and all logs are base-2. We use <sup>N</sup> (*μ*, <sup>Σ</sup>) to indicate the probability density function of a multivariate Gaussian with mean *<sup>μ</sup>* and covariance matrix <sup>Σ</sup>. We use notation like <sup>E</sup>*P*(*X*)[ *<sup>f</sup>*(*X*)] = *<sup>P</sup>*(*x*)*f*(*x*) *dx* to indicate expectations, where *<sup>f</sup>*(*x*) is some function and *<sup>P</sup>*(*x*) some probability distribution. We use *<sup>δ</sup>*(·, ·) for the Kronecker delta.

Let the input random variable *X* and the output random variable *Y* be distributed according to some joint distribution *Q*(*x*, *y*), with marginals indicated by *Q*(*y*) and *Q*(*x*). We assume that we are provided with a "training dataset" <sup>D</sup> <sup>=</sup> {(*x*1, *<sup>y</sup>*1), ... ,(*xN*, *yN*)}, which contains *<sup>N</sup>* input–output pairs sampled IID from *Q*(*x*, *y*). Let *M* indicate the bottleneck random variable, with outcomes in R*d*. In the derivations in this section, we assume that *X* and *Y* are continuous-valued, but our approach extends immediately to the discrete case (with some integrals replaced by sums).

Let the conditional probability *<sup>P</sup><sup>θ</sup>* (*m*|*x*) indicate a parameterized *encoding map* from input *<sup>X</sup>* to the bottleneck variable *M*, where *θ* is a vector of parameters. Given an encoding map, one can compute the MI between *<sup>X</sup>* and *<sup>M</sup>*, *<sup>I</sup><sup>θ</sup>* (*X*; *<sup>M</sup>*), using the joint distribution *<sup>Q</sup><sup>θ</sup>* (*x*, *<sup>m</sup>*) :<sup>=</sup> *<sup>P</sup><sup>θ</sup>* (*m*|*x*)*Q*(*x*). Similarly, one can compute the MI between *<sup>Y</sup>* and *<sup>M</sup>*, *<sup>I</sup><sup>θ</sup>* (*Y*; *<sup>M</sup>*), using the joint distribution

$$Q\_{\theta}(y,m) := \int P\_{\theta}(m|\mathbf{x}) Q(\mathbf{x}, y) \, d\mathbf{x} \,. \tag{4}$$

We now consider the IB Lagrangian, Equation (2), as a function of the encoding map parameters,

$$\mathcal{L}\_{\text{IB}}(\theta) := I\_{\theta}(\mathcal{Y}; M) - \beta I\_{\theta}(X; M) \,. \tag{5}$$

In this parametric setting, we seek parameter values that maximize <sup>L</sup>IB(*θ*). Unfortunately, this optimization problem is usually intractable due to the difficulty of computing the integrals in Equation (4) and in the MI terms of Equation (5). Nonetheless, it is possible to carry out an approximate form of IB by maximizing a tractable lower bound on LIB, which we now derive.

First, consider any conditional probability *<sup>P</sup>φ*(*y*|*m*) of outputs given bottleneck variable, where *<sup>φ</sup>* is a vector of parameters, which we call the *(variational) decoding map*. Given *<sup>P</sup>φ*(*y*|*m*), the non-negativity of KL divergence leads to the following variational lower bound on the first MI term in Equation (5),

$$\begin{split} I\_{\theta}(\mathbf{Y};M) &= H(Q(\mathbf{Y})) - H(Q\_{\theta}(\mathbf{Y}|M)) \\ &\geq H(Q(\mathbf{Y})) - H(Q\_{\theta}(\mathbf{Y}|M)) - D\_{\text{KL}}(Q\_{\theta}(\mathbf{Y}|M) || P\_{\theta}(\mathbf{Y}|M)) \\ &= H(Q(\mathbf{Y})) + \mathbb{E}\_{Q\_{\theta}(\mathbf{Y},M)} \left[ \log P\_{\theta}(\mathbf{Y}|M) \right], \end{split} \tag{6}$$

where in the last line we've used the following identity,

$$-\mathbb{E}\_{Q\_{\theta}(Y,M)}\left[\log P\_{\theta}(Y|M)\right] = D\_{\text{KL}}(Q\_{\theta}(Y|M) \| P\_{\theta}(Y|M)) + H(Q\_{\theta}(Y|M)).\tag{7}$$

Note that the inequality of Equation (6) holds for any choice of *<sup>P</sup>φ*(*y*|*m*), and becomes an equality when *<sup>P</sup>φ*(*y*|*m*) is equal to the "optimal" decoding map *<sup>Q</sup><sup>θ</sup>* (*y*|*m*) (as would be computed from Equation (4)). Moreover, the bound becomes tighter as the KL divergence between *<sup>P</sup>φ*(*y*|*m*) and *<sup>Q</sup><sup>θ</sup>* (*y*|*m*) gets smaller. Below, we will maximize the RHS of Equation (6) with respect to *<sup>φ</sup>*, thereby bringing *<sup>P</sup>φ*(*y*|*m*) closer to *<sup>Q</sup><sup>θ</sup>* (*y*|*m*).

It remains to upper bound the *<sup>I</sup><sup>θ</sup>* (*X*; *<sup>M</sup>*) term in Equation (5). To proceed, we first approximate the joint distribution of *X* and *Y* with the empirical distribution in the training dataset,

$$Q(x, y) \approx \frac{1}{N} \sum\_{i} \delta(x\_i, x)\delta(y\_i, y). \tag{8}$$

We then assume that the encoding map is the sum of a deterministic function *<sup>f</sup><sup>θ</sup>* (*x*) plus Gaussian noise,

$$M = f\_{\theta}(X) + Z\_{\prime} \tag{9}$$

where (*Z*|*<sup>X</sup>* <sup>=</sup> *<sup>x</sup>*) ∼ N (*f<sup>θ</sup>* (*x*), <sup>Σ</sup>*<sup>θ</sup>* (*x*)). Note that the noise covariance <sup>Σ</sup>*<sup>θ</sup>* (*x*) can depend both on the parameters *θ* and the outcome of *X* (i.e., the noise can be heteroscedastic). Combining Equation (8) and Equation (9) implies that the bottleneck variable *M* will be distributed as a mixture of *N* equallyweighted Gaussian components, with component *<sup>i</sup>* having distribution <sup>N</sup> (*f<sup>θ</sup>* (*xi*), <sup>Σ</sup>*<sup>θ</sup>* (*xi*)). We can then employ the following non-parametric upper bound on MI, which was derived in a recent paper [35]:

$$I\_{\theta}(X;M) \le \hat{I}\_{\theta}(X;M) := -\frac{1}{N} \sum\_{i} \log \frac{1}{N} \sum\_{j} e^{-D\_{\text{KL}}\left[\mathcal{N}(f\_{\theta}(\mathbf{x}\_{i}), \Sigma\_{\theta}(\mathbf{x}\_{i}))\right]} \Big|\mathcal{N}(f\_{\theta}(\mathbf{x}\_{j}), \Sigma\_{\theta}(\mathbf{x}\_{j})) \Big|\,. \tag{10}$$

(Note that the published version of [35] contains some typos which are corrected in the latest arXiv version at arxiv.org/abs/1706.02419.)

Equation (10) bounds the MI in terms of the pairwise KL divergences between the Gaussian components of the mixture distribution of *M*. It is useful because the KL divergence between two *d*-dimensional Gaussians has a closed-form expression,

$$D\_{\rm KL}\left[\mathcal{N}(\boldsymbol{\mu}',\boldsymbol{\Sigma}')\middle|\,\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})\right] = \frac{1}{2}\left[\ln\frac{\det\boldsymbol{\Sigma}}{\det\boldsymbol{\Sigma}'} + (\boldsymbol{\mu}'-\boldsymbol{\mu})\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}'-\boldsymbol{\mu}) + \text{tr}(\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}') - d\right].\tag{11}$$

Furthermore, in the special case when all components have the same covariance and can be grouped into well-separated clusters, the upper bound of Equation (10) becomes tight [35]. As we will see below, this special case is a commonly encountered solution to the optimization problem considered here.

Combining Equation (6) and Equation (10) provides the following tractable lower bound for the IB Lagrangian,

$$\mathcal{L}\_{\text{IB}}(\theta) \ge \mathcal{L}\_{\text{IB}}(\theta, \phi) := \mathbb{E}\_{\text{Q}\_{\theta}(Y, \mathcal{M})} \left[ \log P\_{\theta}(Y|\mathcal{M}) \right] - \beta \mathbf{f}\_{\theta}(X; \mathcal{M}) \tag{12}$$

where we dropped the additive constant *H*(*Q*(*Y*)) (which does not depend on the parameter values and is therefore irrelevant for optimization). We refer to Equation (12) as the *nonlinear IB* objective.

As mentioned in the introduction, in cases where *Y* is a deterministic function of *X* (or close to being one), it is no longer possible to explore the IB curve by optimizing the IB Lagrangian for different values of *β* [19,32,34]. Nonetheless, it is always possible to explore the IB curve by instead optimizing the squared-IB Lagrangian, Equation (3). The above derivations also lead to the following tractable lower bound for the squared-IB Lagrangian,

$$\mathcal{L}\_{\text{sqIB}}(\theta) \ge \mathcal{L}\_{\text{sqIB}}(\theta, \phi) := \mathbb{E}\_{\text{Q}\_{\theta}(Y, M)} \left[ \log P\_{\theta}(Y|M) \right] - \beta \left[ \mathbb{I}\_{\theta}(X; M) \right]^2. \tag{13}$$

Note that maximizing the expectation term E*Q<sup>θ</sup>* (*Y*,*M*) log *<sup>P</sup>φ*(*Y*|*M*) is equivalent to minimizing the usual cross-entropy loss in supervised learning. (Note that mean squared error, the typical loss function used for training regression models, can also be interpreted as a cross-entropy term [38] (pp. 132–134).) From this point of view, Equation (12) and Equation (13) can be interpreted as adding an information-theoretic regularization term to the regular objective of supervised learning.

For optimization purposes, the compression term <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) can be computed from data using Equations (10) and (11), while the expectation term E*Q<sup>θ</sup>* (*Y*,*M*) log *<sup>P</sup>φ*(*Y*|*M*) can be estimated as E*Q<sup>θ</sup>* (*Y*,*M*) log *<sup>P</sup>φ*(*Y*|*M*) <sup>≈</sup> <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>i</sup>* log *<sup>P</sup>φ*(*yi*|*mi*), where *mi* indicates samples from *<sup>P</sup><sup>θ</sup>* (*m*|*xi*). Assuming that *f<sup>θ</sup>* is differentiable with respect to *θ* and *P<sup>φ</sup>* is differentiable with respect to *φ*, the optimal *θ* and *φ* can be selected by using gradient-based methods to maximize Equation (12) or Equation (13), as desired. In practice, this optimization will typically be done using stochastic gradient descent (SGD), i.e., by computing the gradient using randomly sampled mini-batches rather than the whole training dataset. In fact, mini-batching becomes necessary for large datasets, since evaluating <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) involves *O*(*n*2) operations, where *n* is the number of data points in the batch used to compute the gradient, which becomes prohibitively slow for very large *<sup>n</sup>*. At the same time, <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) is closely related to kernel-density estimators [35], and it is known that the number of samples required for accurate kernel-density estimates grows rapidly as dimensionality increases [39]. Thus, mini-batches should not be too small when *d* (the dimensionality of the bottleneck variable) is large. In some cases, it may be useful to estimate the gradient of E*Q<sup>θ</sup>* (*Y*,*M*) log *<sup>P</sup>φ*(*Y*|*M*) and the gradient of <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) using mini-batches of different sizes. More implementation details are discussed below in Section 4.1.

Note that the approach described here is somewhat different (and simpler) than in previous versions of this manuscript [40,41]. In previous versions, we represented the marginal distribution *Q*(*x*) with a mixture of Gaussians, rather than with the empirical distribution in the training data. However, we found that this increased complexity but was not necessary for good performance. Furthermore, we previously focused only on optimizing a bound on the IB Lagrangian, Equation (12). In subsequent work [32], we showed that the IB Lagrangian is inadequate for many supervised learning scenarios, including some of those explored in Section 4.2, and that the squared-IB Lagrangian should be used instead. In this work, we report performance when optimizing Equation (13), a bound on the squared-IB Lagrangian.

#### **3. Relation to Prior Work**

In this section, we relate our proposed method to prior work in machine learning.

#### *3.1. Variational IB*

Recently, there have been three other proposals for performing IB for continuous and possibly non-Gaussian random variables using neural networks [24,36,37], the most popular of which is called *variational IB* (VIB) [24]. As in our approach, these methods propose tractable lower bounds on the <sup>L</sup>IB objective. They employ the same variational bound for the prediction MI term *<sup>I</sup>*(*Y*; *<sup>M</sup>*) as our Equation (6). These methods differ from ours, however, in how they bound the compression term, *<sup>I</sup><sup>θ</sup>* (*X*; *<sup>M</sup>*). In particular, they all use some form of the following variational upper bound,

$$\Pi\_{\theta}(X;M) = D\_{\text{KL}}\left(P\_{\theta}(M|X)\|\mathcal{R}(M)\right) - D\_{\text{KL}}\left(P\_{\theta}(M)\|\mathcal{R}(M)\right) \leq D\_{\text{KL}}\left(P\_{\theta}(M|X)\|\mathcal{R}(M)\right),\tag{14}$$

where *R* is some surrogate marginal distribution over the bottleneck variable *M*. Combining with Equation (6) leads to the following variational lower bound for LIB,

$$\mathcal{L}\_{\text{IB}}(M) \ge \mathbb{E}\_{\mathcal{Q}\_{\theta}(Y,M)} \left[ \log P\_{\theta}(Y|M) \right] - \beta D\_{\text{KL}} \left( P\_{\theta}(M|X) \| \mathcal{R}(M) \right) + \text{const.} \tag{15}$$

The three aforementioned papers differ in how they define the surrogate marginal distribution *R*. In [24], *<sup>R</sup>* is a standard multivariate normal distribution, <sup>N</sup> (0,**I**). In [36], *<sup>R</sup>* is a product of Student's *t*-distributions. The scale and shape parameters of each *t*-distribution are optimized during training, in this way tightening the bound in Equation (14). In [37], two surrogate distributions are considered, the improper log-uniform and the log-normal, with the appropriate choice depending on the particular activation function (non-linearity) used in the neural network.

In addition, the encoding map *<sup>P</sup><sup>θ</sup>* (*m*|*x*) in [36] and [24] is a deterministic function plus Gaussian noise, same as in Equation (9). In [37], the encoding map consists of a deterministic function with multiplicative, rather than additive, noise.

These alternative methods have potential advantages and disadvantages compared to our approach. On one hand, they are more computationally efficient: Our non-parametric estimator of <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) requires *<sup>O</sup>*(*n*2) operations per mini-batch (where *<sup>n</sup>* is the size of the mini-batch), while the variational bound of Equation (14) requires *O*(*n*) operations. On the other hand, our non-parametric estimator is expected to give a better estimate of the true MI *I*(*X*; *M*) [35]. We provide a comparison between our approach and variational IB [25] in Section 4.2.

#### *3.2. Neural Networks and Kernel Density Entropy Estimates*

A key component of our approach is using a differentiable upper bound on MI, <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*). As discussed in [35], this bound is related to non-parametric kernel-density estimators of MI. See [42–46] for related work on using neural networks to optimize non-parametric estimates of information-theoretic functions. This technique can also be related to kernel-based estimation of the likelihood of held-out data for neural networks (e.g., [47]). In these later approaches, however, the likelihood of held-out data is estimated only once, as a diagnostic measure once learning is complete. We instead propose to train the network by directly incorporating our non-parametric estimator <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) in the objective function.

#### *3.3. Auto-Encoders*

Auto-encoders are unsupervised learning architectures that learn to reconstruct a copy of the input *X*, while using some intermediate representations (such as a hidden layer in a neural network). Auto-encoders have some conceptual relationships to IB, in that the intermediate representations are sometimes restricted in terms of dimensionality, or with information-theoretic penalties on hidden layer coding length [48,49]. Similar penalties have also been explored in a supervised learning scenario in [50]. In that work, however, hidden layer states were treated as discrete-valued, limiting the flexibility and information capacity of hidden representations.

More recently, *denoising auto-encoders* [51] have attracted attention. Denoising auto-encoders constrain the amount of information passing from input to hidden layers by injecting noise into the hidden layer activity, similarly to our noisy mapping from the input to the bottleneck layer. Previous work on auto-encoders has considered either penalizing hidden layer coding length *or* injecting noise into the map, rather than combing the two as we do here. Moreover, denoising auto-encoders do not have a notion of an "optimal" noise level, since less noise will always improve prediction error on the training data. Thus, they cannot directly adapt the noise level (as done in our method).

Finally, *variational auto-encoders* [52] [VAEs] are recently-proposed architectures which learn generative models from unsupervised data (i.e., after training, they can be used to generate new samples that resemble training data). Interestingly, the objective optimized in VAE training, called "ELBO", contains both a prediction term and a compression term and can be seen as a special case of the variational IB objective [24,37,53,54]. In principle, it may be fruitful to replace the compression term in the ELBO with our MI estimator <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*). Given our reported performance below, this may result in better compression, though it might also complicate sampling from the latent variable space. We leave this line of research for future work.

#### **4. Experiments**

In this section, we first explain how to implement nonlinear IB using neural network techniques. We then evaluate its on several datasets, and compare it to the variational IB (VIB) method. We demonstrate that, compared to VIB, nonlinear IB achieves better performance and uncovers different kinds of representations.

#### *4.1. Implementation*

Any implementation of nonlinear IB requires a way to compute the encoding map *<sup>P</sup><sup>θ</sup>* (*m*|*x*) and decoding map *<sup>P</sup>φ*(*y*|*m*), as well as a way to choose the parameters of these maps so as to maximize the nonlinear IB objective. Here we explain how this can be done using standard neural network methods.

The encoding map *<sup>P</sup><sup>θ</sup>* (*m*|*x*), Equation (9), is implemented in the following way: First, several neural network layers with parameters *θ* implement the (possibly nonlinear) deterministic function *<sup>f</sup><sup>θ</sup>* (*x*). The output of these layers is then added to zero-centered Gaussian noise with covariance <sup>Σ</sup>*<sup>θ</sup>* (*x*), which becomes the state of the *bottleneck layer*. This is typically done via the "reparameterization trick" [52], in which samples of Gaussian noise are passed through several deterministic layers (whose parameters are also indicated by *<sup>θ</sup>*) and then added to *<sup>f</sup><sup>θ</sup>* (*x*). Note that due to the presence of noise, the neural network is stochastic: even with parameters held constant, different states of the bottleneck layer are sampled during different NN evaluations. This stochasticity guarantees that the mutual information *I*(*X*; *M*) is finite [26,28].

In all of the experiments described below, the encoding map consists of two layers with 128 ReLU neurons each, following by a layer of 5 linear neurons. In addition, for simplicity we use a simple homoscedastic noise model: <sup>Σ</sup>*<sup>θ</sup>* (*x*) = *<sup>σ</sup>*2**I**, where *<sup>σ</sup>*<sup>2</sup> is a parameter the sets the scale of the noise variance. This noise model permits us to rewrite the MI bound of Equation (10) in terms of the following simple expression,

$$\mathcal{I}\_{\theta}(X;M) = -\frac{1}{N} \sum\_{i} \log \frac{1}{N} \sum\_{j} e^{-\frac{1}{2\sigma^{2}} \|f\_{\theta}(\mathbf{x}\_{i}) - f\_{\theta}(\mathbf{x}\_{j})\|\_{2}^{2}}.\tag{16}$$

For purposes of comparison, we use this same homoscedastic noise model for both nonlinear IB and for VIB (note that the original VIB paper [24] used a heteroscedastic noise model; investigating the performance of nonlinear IB with heteroscedastic noise remains for future work).

In our runs, the noise parameter *σ*<sup>2</sup> was one of the trainable parameters in *θ*. The initial value of *σ*<sup>2</sup> should be chosen with some care. If the initial *σ*<sup>2</sup> is too small, the Gaussian components that make up the mixture distribution of *<sup>M</sup>* will be many standard deviations away from each other and <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) (as well as *I*(*X*; *M*)) will be exponentially close to the constant log *N* [35]. In this case, the gradient of the compression term <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) with respect to *<sup>θ</sup>* will also be exponentially small, and the optimizer

will not be able to learn to compress. On the other hand, when *σ*<sup>2</sup> is too large, the resulting noise can swamp gradient information arising from the accuracy (cross-entropy) term, cause the optimizer to collapse to a "trivial" maximally-compressed model in which *<sup>I</sup>*(*X*; *<sup>M</sup>*) <sup>≈</sup> *<sup>I</sup>*(*Y*; *<sup>M</sup>*) <sup>≈</sup> 0. Nonetheless, the optimization is robust to several orders of magnitude of variation of the initial value of *σ*2. In the experiments below, we uses the initial value *σ*<sup>2</sup> = 1, which works sufficiently well in practice. (Note that the scale of the noise can also be trained by changing the parameters of the 5-neuron linear layer; thus, in our neural network architecture, having a trainable *σ*<sup>2</sup> is not strictly necessary.)

To implement the decoding map *<sup>P</sup>φ*(*y*|*m*), the bottleneck layer states are passed through several deterministic neural network layers with parameters *φ*. In the experiments described below, the decoding map is implemented with a single layer with 128 ReLU neurons, followed by a linear output layer. The log decoding probability (log *<sup>P</sup>φ*(*y*|*m*)) is then evaluated using the network output and an appropriately-chosen cost function: cross-entropy loss of the softmax of the output for classification, and mean squared error (MSE) of the output for regression.

In the experiments below, we use nonlinear IB to optimize the bound on the "squared-IB Lagrangian", Equation (13), rather than the bound on the IB Lagrangian, Equation (12). For comparison purposes, we also optimize the following "squared" version of the VIB objective, Equation (15),

$$\mathcal{L}\_{\text{sq}\cdot\text{VIB}} := \mathbb{E}\_{\text{Q}\rho(\mathcal{Y},M)} \left[ \log P\_{\theta}(\mathcal{Y}|M) \right] - \beta \left[ D\_{\text{KL}}(P\_{\theta}(M|X) \| \mathcal{R}(M)) \right]^2. \tag{17}$$

As in the original VIB paper, we take *<sup>R</sup>*(*m*) to be the standard Gaussian <sup>N</sup> (0,**I**). We found that optimizing the squared-IB bounds, Equation (13) and Equation (17), produced quantitatively similar results to optimizing Equation (12) and Equation (15), but was more numerically robust when exploring the full range of the IB curve. For an explanation of why this occurs, see the discussion and analysis in [32]. We report performance of nonlinear IB and VIB when optimizing bounds on the IB Lagrangian, Equation (12) and Equation (15), in the Supplementary Material.

We use the Adam [55] optimizer with standard TensorFlow settings and mini-batches of size 256. To avoid over-fitting, we use early stopping: we split the training data into 80% actual training data and 20% validation data; training is stopped once the objective on the validation dataset did not improve for 50 epochs.

A TensorFlow implementation of our approach is provided at https://github.com/artemyk/ nonlinearIB. An independent PyTorch implementation is available at https://github.com/burklight/ nonlinear-IB-PyTorch.

#### *4.2. Results*

We report the performance of nonlinear IB on two different classification datasets (MNIST and FashionMNIST) and one regression dataset (California housing prices). We also compare it with the recently-proposed variational IB (VIB) method [24]. Here we focus purely on the ability of these methods to optimize the IB objective on training and testing data. We leave for future work comparisons of these methods in terms of adversarial robustness [24], detection of out-of-distribution data [25], and other desirable characteristics that may emerge from IB training.

We optimize both the nonlinear IB (Equation (13)) and the VIB (Equation (17)) objectives for different values of *β*, producing a series of models that explore the trade-off between compression and prediction. We vary *<sup>β</sup>* <sup>∈</sup> [10<sup>−</sup>3, 2] for classification tasks and *<sup>β</sup>* <sup>∈</sup> [10<sup>−</sup>5, 2] for the regression task. These ranges were chosen empirically so that the resulting models fully explore the IB curve.

To report our results, we use *information plane* (info-plane) diagrams [27], which visualize the performance of different models in terms of the compression term (*I*(*X*; *M*), the x-axis) and the prediction term (*I*(*Y*; *M*), the y-axis) both on training and testing data. For the info-plane diagrams, we use Monte Carlo sampling to get an accurate estimate of *I*(*X*; *M*) terms. To estimate the *I*(*Y*; *M*) = *<sup>H</sup>*(*Y*) <sup>−</sup> *<sup>H</sup>*(*Y*|*M*) term, we use two different approaches. For classification datasets, we approximate *H*(*Y*) using the empirical entropy of the class labels in the dataset, and approximate the conditional

entropy with the cross-entropy loss, *<sup>H</sup>*(*Y*|*M*) ≈ −E*Q<sup>θ</sup>* (*Y*,*M*) log *<sup>P</sup>φ*(*Y*|*M*) . Note that the resulting MI estimate is an additive constant away from the cross-entropy loss. For the regression dataset, we approximate *<sup>H</sup>*(*Y*) via the entropy of a Gaussian with variance Var(Y), and approximate *<sup>H</sup>*(*Y*|*M*) via the entropy of a Gaussian with variance equal to the mean-squared-error. This results in the estimate *<sup>I</sup>*(*Y*; *<sup>M</sup>*) <sup>≈</sup> <sup>1</sup> <sup>2</sup> log(Var(*Y*)/MSE). Finally, we also use scatter plots to visualize the activity of the hidden layer for models trained with different objectives.

We first consider the *MNIST* dataset of hand-drawn digits, which contains 60,000 training images and 10,000 testing images. Each image is 28-by-28 pixels (784 total pixels, so *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>784), and is classified into 1 of 10 classes corresponding to the digit identity (*Y* ∈ {1, . . . , 10}).

The top row of Figure 1 shows *I*(*Y*; *M*) and *I*(*X*; *M*) values achieved by nonlinear IB and VIB on the MNIST dataset. As can be seen, nonlinear IB achieves better prediction values at the same level of compression than VIB, both on training and testing data. The difference is especially marked near the "corner point" *<sup>I</sup>*(*X*; *<sup>M</sup>*) = *<sup>I</sup>*(*Y*; *<sup>M</sup>*) <sup>≈</sup> log 10 (which corresponds to maximal compression, given perfect prediction), where nonlinear IB achieved ≈ 0.1 bits better prediction at the same compression level (see also Table 1).

**Figure 1. Top row**: Info-plane diagrams for nonlinear IB and variational IB (VIB) on the MNIST training (**left**) and testing (**right**) data. The solid lines indicate means across five runs, shaded region indicates the standard error of the mean. The black dashed line is the data-processing inequality bound *<sup>I</sup>*(*Y*; *<sup>M</sup>*) <sup>≤</sup> *<sup>I</sup>*(*X*; *<sup>M</sup>*), the black dotted line indicates the value of *<sup>I</sup>*(*Y*; *<sup>M</sup>*) achieved by a baseline model trained only to optimize cross-entropy. **Bottom row**: Principal component analysis (PCA) projection of bottleneck layer activity (on testing data, no noise) for models trained with regular cross-entropy loss (**left**), VIB (**middle**), and nonlinear IB (**right**) objectives. The location of the nonlinear IB and VIB models shown in the bottom row is indicated with the green vertical line in the top right panel.

Further insight is provided by considering the bottleneck representations found when training with nonlinear IB versus VIB versus regular cross-entropy loss. To visualize these bottleneck representations, we selected three models: a baseline model trained only to optimize cross-entropy loss, a model trained with nonlinear IB, and a model trained with VIB (the latter two models were chosen to both have *<sup>I</sup>*(*X*; *<sup>M</sup>*) <sup>≈</sup> log 10). We then measured the activity of their 5-neuron bottleneck hidden layer on the testing dataset, projected down to two dimensions using principal component analysis (PCA). Figure 1 visualizes these two-dimensional projections for these three models, with colors indicating class label (digit identity). Training with VIB and nonlinear IB objectives causes inputs corresponding to different digits to fall into well-separated clusters, unlike training with cross-entropy loss. Moreover, the clustering is particularly tight for nonlinear IB, meaning that the bottleneck states carry almost no information about input vectors beyond class identity. Note that in this regime, where Gaussian components are grouped into tightly separate clusters, our MI upper bound <sup>ˆ</sup>*I<sup>θ</sup>* (*X*; *<sup>M</sup>*) becomes exact [35].


**Table 1.** Amount of prediction *I*(*Y*; *M*) achieved at compression level *I*(*X*; *M*) = log 10 for both nonlinear IB and VIB.

In the next experiment, we considered the recently-proposed *FashionMNIST* dataset. FashionMNIST has the same structure as the MNIST dataset (28 × 28 images grouped into 10 classes, with 60,000 training and 10,000 testing images). Instead of hand-written digits, however, FashionMNIST includes images of clothes labeled with classes such as "Dress", "Coat", and "Sneaker". This dataset was designed as a drop-in replacement for MNIST which addresses the problem that MNIST is too easy for modern machine learning (e.g., it is fairly straightforward to achieve ≈99% test accuracy on MNIST) [56]. FashionMNIST is a more difficult dataset, with typical test accuracies of ≈90%–95%.

The top row Figure 2 shows *I*(*Y*; *M*) and *I*(*X*; *M*) values achieved by nonlinear IB and VIB on the FashionMNIST dataset. Compared to VIB, nonlinear IB again achieves better prediction values at the same level of compression, both on training and testing data. The difficulty of FashionMNIST is evident in the fact that neither method gets very close to the corner point *<sup>I</sup>*(*X*; *<sup>M</sup>*) = *<sup>I</sup>*(*Y*; *<sup>M</sup>*) <sup>≈</sup> log 10. Nonetheless, nonlinear IB performed better than VIB at a range of compression values, often extracting ≈ 0.15 additional bits of prediction at the same compression level (see also Table 1).

As for MNIST, we consider the bottleneck representations uncovered when training on FashionMNIST with cross-entropy loss only versus nonlinear IB versus VIB (the latter two models were chosen to have *<sup>I</sup>*(*X*; *<sup>M</sup>*) <sup>≈</sup> log 10). We measured the activity of the 5-neuron bottleneck layer on the testing dataset, projected down to two dimensions using PCA. The bottom row of Figure 2 visualizes these two-dimensional projections for these three models, with colors indicating class label (digit identity). It can again be seen that models trained with VIB and nonlinear IB map inputs into separated clusters, but that the clusters are significantly tighter for nonlinear IB.

**Figure 2. Top row**: Info-plane diagrams for nonlinear IB and VIB on the FashionMNIST dataset. **Bottom row**: PCA projection of bottleneck layer activations for models trained only to optimize cross-entropy (**left**), VIB (**middle**), and nonlinear IB (**right**) objectives. See caption of Figure 1 for details.

In our final experiment, we considered the *California housing prices* dataset. This is a regression dataset based on the 1990 California census, originally published in [57] (we use the version distributed with the scikit-learn package [58]). It consists of *N* = 20, 640 total samples, with one dependent variable (the house price) and 8 independent variables (such as "longitude", "latitude", and "number of rooms"). We used the log-transformed house price as the dependent variable *Y* (this made the distribution of *Y* closer to a Gaussian). To prepare the training and testing data, we first dropped 992 samples in which the house price was equal to or greater than \$500,000 (prices were clipped at this upper value in the dataset, which distorted the distribution of the dependent variable). We then randomly split the remaining samples into an 80% training and 20% testing dataset (the training dataset was then further split into the actual training dataset and a validation dataset, see above).

The top row of Figure 3 shows *I*(*Y*; *M*) and *I*(*X*; *M*) values achieved by nonlinear IB and VIB on the California housing prices dataset. Nonlinear IB achieves better prediction values at the same level of compression than VIB, both on training and testing data (see also Table 1). As for the other datasets, we also show the bottleneck representations uncovered when training on California housing prices dataset with MSE loss only versus nonlinear IB versus VIB (the latter two models were chosen to have *<sup>I</sup>*(*X*; *<sup>M</sup>*) <sup>≈</sup> log 10). The bottom row of Figure <sup>3</sup> visualizes the two-dimensional PCA projections of bottleneck layer activity for these three models, with colors indicating the dependent variable (log housing price). The bottleneck representations uncovered when training with MSE loss only and when training with VIB were somewhat similar. Nonlinear IB, however, finds a different and almost perfectly one-dimensional bottleneck representation. In fact, for the nonlinear IB model, the

first principal component explains 99.8% of the variance in bottleneck layer activity on testing data. For the models trained with MSE loss and VIB, the first principal component explains only 76.6% and 69% of the variance, respectively. The one-dimensional representation uncovered by nonlinear IB compresses away all information about the input vectors which is not relevant for predicting the dependent variable.

**Figure 3. Top row**: Information plane diagrams for nonlinear IB and VIB on the California housing prices dataset. **Bottom row**: PCA projection of bottleneck layer activations for models trained only to optimize mean squared error (MSE) (**left**), VIB (**middle**), and nonlinear IB (**right**) objectives. See caption of Figure 1 for details.

We finish by presenting some of our numerical results in Table 1. In particular, we quantify the amount of prediction, *I*(*Y*; *M*), achieved when training with nonlinear IB and VIB at the compression level *I*(*X*; *M*) = log 10, for training and testing datasets of the three datasets considered above. Nonlinear IB consistently achieves better prediction at a fixed level of compression.

#### **5. Conclusions**

We propose "nonlinear IB", a method for exploring the information bottleneck [IB] trade-off curve in a general setting. We allow the input and output variables to be discrete or continuous (though we assume a continuous bottleneck variable). We also allow for arbitrary (e.g., non-Gaussian) joint distributions over inputs and outputs and for non-linear encoding and decoding maps. We gain this generality by exploiting a new tractable and differentiable bound on the IB objective.

We describe how to implement our method using off-the-shelf neural network software, and apply it to several standard classification and regression problems. We find that nonlinear IB is able to effectively discover the tradeoff curve, and find solutions that are superior compared with competing methods. We also find that the intermediate representations discovered by nonlinear IB have visibly tighter clusters in the classification problems. In the regression problem, nonlinear IB discovers a one-dimensional intermediate representation.

We have successfully demonstrated the ability of nonlinear IB to explore the IB curve. It is possible that increased compression may lead to other benefits in supervised learning, such as improved generalization performance or increased robustness to adversarial inputs. Exploring its efficacy in these domains remains for future work.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1099-4300/21/12/1181/ s1: Figure S1: Performance of nonlinear IB and VIB when optimizing bounds on regular IB objective.

**Author Contributions:** Conceptualization, A.K.; Funding acquisition, D.H.W.; Software, A.K. and B.D.T.; Visualization, A.K.; Writing—original draft, A.K.; Writing—review & editing, A.K., B.D.T. and D.H.W.

**Funding:** This research was funded by National Science Foundation: CHE-1648973; Foundational Questions Institute: FQXi-RFP-1622; Air Force Office of Scientific Research: A9550-15-1-0038.

**Acknowledgments:** We thank Steven Van Kuyk and Borja Rodríguez Gálvez for helpful comments. We would also like to thank the Santa Fe Institute for helping to support this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Pareto-Optimal Data Compression for Binary Classification Tasks**

#### **Max Tegmark \* and Tailin Wu**

Department of Physics, MIT Kavli Institute & Center for Brains, Minds & Machines, Massachusetts Institute of Technology, Cambridge, MA 02139 USA; tailin@mit.edu

**\*** Correspondence: Tegmark@mit.edu

Received: 18 October 2019; Accepted: 12 December 2019; Published: 19 December 2019

**Abstract:** The goal of lossy data compression is to reduce the storage cost of a data set *X* while retaining as much information as possible about something (*Y*) that you care about. For example, what aspects of an image *X* contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping *<sup>X</sup>* <sup>→</sup> *<sup>Z</sup>* <sup>≡</sup> *<sup>f</sup>*(*X*) that maximizes the mutual information *I*(*Z*,*Y*) while the entropy *H*(*Z*) is kept below some fixed threshold. We present a new method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable *X* (an image, say) drawn from a class *<sup>Y</sup>* ∈ {1, ..., *<sup>n</sup>*} can be distilled into a vector *<sup>W</sup>* <sup>=</sup> *<sup>f</sup>*(*X*) <sup>∈</sup> <sup>R</sup>*n*−<sup>1</sup> losslessly, so that *I*(*W*,*Y*) = *I*(*X*,*Y*); for example, for a binary classification task of cats and dogs, each image *X* is mapped into a single real number *W* retaining all information that helps distinguish cats from dogs. For the *n* = 2 case of binary classification, we then show how *W* can be further compressed into a discrete variable *<sup>Z</sup>* <sup>=</sup> *<sup>g</sup>β*(*W*) ∈ {1, ..., *<sup>m</sup>β*} by binning *<sup>W</sup>* into *<sup>m</sup><sup>β</sup>* bins, in such a way that varying the parameter *β* sweeps out the full Pareto frontier, solving a generalization of the discrete information bottleneck (DIB) problem. We argue that the most interesting points on this frontier are "corners" maximizing *I*(*Z*,*Y*) for a fixed number of bins *m* = 2, 3, ... which can conveniently be found without multiobjective optimization. We apply this method to the CIFAR-10, MNIST and Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm. We find that these Pareto frontiers are not concave, and that recently reported DIB phase transitions correspond to transitions between these corners, changing the number of clusters.

**Keywords:** information; bottleneck; compression; classification

#### **1. Introduction**

A core challenge in science, and in life quite generally, is data distillation: Keeping only a manageably small fraction of our available data *X* while retaining as much information as possible about something (*Y*) that we care about. For example, what aspects of an image contain the most information about whether it depicts a cat (*Y* = 1) rather than a dog (*Y* = 2)? Mathematically, this corresponds to finding a deterministic mapping *<sup>X</sup>* <sup>→</sup> *<sup>Z</sup>* <sup>≡</sup> *<sup>g</sup>*(*X*) that maximizes the mutual information *I*(*Z*,*Y*) while the entropy *H*(*Z*) is kept below some fixed threshold. The tradeoff between *<sup>H</sup>*<sup>∗</sup> <sup>=</sup> *<sup>H</sup>*(*Z*) (bits stored) and *<sup>I</sup>*<sup>∗</sup> <sup>=</sup> *<sup>I</sup>*(*Z*,*Y*) (useful bits) is described by a Pareto frontier, defined as

$$I\_\*(H\_\*) \equiv \sup\_{\{\mathfrak{g}: H[\mathfrak{g}(X)] \le H\_\*\}} I[\mathfrak{g}(X), Y],\tag{1}$$

and illustrated in Figure 1 (this is for a toy example described below; we compute the Pareto frontier for our cat/dog example in Section 3). The shaded region is impossible because *<sup>I</sup>*(*Z*,*Y*) <sup>≤</sup> *<sup>I</sup>*(*X*,*Y*) and

*<sup>I</sup>*(*Z*,*Y*) <sup>≤</sup> *<sup>H</sup>*(*Z*). The colored dots correspond to random likelihood binnings into various numbers of bins, as described in the next section, and the upper envelope of all attainable points define the Pareto frontier. Its "corners", which are marked by black dots and maximize *I*(*Z*,*Y*) for *M* bins (*M* = 1, 2, ...), are seen to lie close to the vertical dashed lines *H*(*Z*) = log *M*, corresponding to all bins having equal size. We plot the *H*-axis flipped to conform with the tradition that up and to the right are more desirable. The core goal of this paper is to present a method for computing such Pareto frontiers.

**Figure 1.** The Pareto frontier (top panel) for compressed versions *Z* = *g*(*X*) of our warmup dataset *<sup>X</sup>* <sup>∈</sup> [0, 1] <sup>2</sup> with classes *<sup>Y</sup>* ∈ {1, 2}, showing the maximum attainable class information *<sup>I</sup>*(*Z*,*Y*) for a given entropy *H*(*Z*), mapped using the method described in this paper using the likelihood binning in the bottom panel.

#### *Objectives and Relation to Prior Work*

In other words, the goal of this paper is to analyze soft rather than hard classifiers: not to make the most accurate classifier, but rather to compute the Pareto frontier that reveals the most accurate (in an information-theoretic sense) classifier *Z* given a constraint on its bit content *H*(*Z*). These optimal soft classifiers that we will derive (corresponding to points on the Pareto frontier) are useful for the same reason that other methods for lossy data compression methods are useful: Overfitting less and therefore generalizing better, among other things.

This Pareto frontier challenge is thus part of the broader quest for data distillation: Lossy data compression that retains as much as possible of the information that is useful to us. Ideally, the information can be partitioned into a set of independent chunks and sorted from most to least useful, enabling us to select the number of chunks to retain so as to optimize our tradeoff between utility and data size. Consider two random variables *X* and *Y* which may each be vectors or scalars. For simplicity, consider them to be discrete with finite entropy. (this discreteness restriction loses us no generality in practice, since since we can always discretize real numbers by rounding them to some very large number of significant digits.) For prediction tasks, we might interpret *Y* as the future state of a dynamical system that we wish to predict from the present state *X*. For classification tasks, we might interpret *Y* as a class label that we wish to predict from an image, sound, video or text string *X*. Let us now consider various forms of ideal data distillation, as summarized in Table 1.


**Table 1.** Data distillation: The relationship between principal component analysis (PCA), canonical correlation analysis (CCA), nonlinear autoencoders and nonlinear latent representations.

If we distill *X* as a whole, then we would ideally like to find a function *f* such that the so-called latent representation *<sup>Z</sup>* = *<sup>f</sup>*(*X*) retains the full entropy *<sup>H</sup>*(*X*) = *<sup>H</sup>*(*Z*) = <sup>∑</sup> *<sup>H</sup>*(*Zi*), decomposed into independent parts with vanishing mutual infomation: *<sup>I</sup>*(*Zi*, *Zj*) = *<sup>δ</sup>ijH*(*Zi*). (Note that when implementing any distillation algorithm in practice, there is always a one-parameter tradeoff between compression and information retention which defines a Pareto frontier. A key advantage of the latent variables (or variable pairs) being statistically independent is that this allows the Pareto frontier to be trivially computed, by simply sorting them by decreasing information content and varying the number retained.)

For the special case where *X* = **x** is a vector with a multivariate Gaussian distribution, the optimal solution is Principal Component Analysis (PCA) [1], which has long been a workhorse of statistical physics and many other disciplines: Here *f* is simply a linear function mapping into the eigenbasis of the covariance matrix of **x**. The general case remains unsolved, and it is easy to see that it is hard: If *X* = *c*(*Z*) where *c* implements some state-of-the-art cryptographic code, then finding *f* = *c*−<sup>1</sup> (to recover the independent pieces of information and discard the useless parts) would generically require breaking the code. Great progress has nonetheless been made for many special cases, using techniques such as nonlinear autoencoders [2] and Generative Adversarial Networks (GANs) [3].

Now consider the case where we wish to distill *<sup>X</sup>* and *<sup>Y</sup>* separately, into *<sup>Z</sup>* <sup>≡</sup> *<sup>f</sup>*(*X*) and *<sup>Z</sup>* <sup>=</sup> *g*(*Y*), retaining the mutual information between the two parts. Then we ideally have *I*(*X*,*Y*) = <sup>∑</sup>*<sup>i</sup> <sup>I</sup>*(*Zi*, *<sup>Z</sup> i* ), *<sup>I</sup>*(*Zi*, *Zj*) = *<sup>δ</sup>ijH*(*Zi*), *<sup>I</sup>*(*Z i* , *Z j* ) = *<sup>δ</sup>ijH*(*Z i* ), *<sup>I</sup>*(*Zi*, *<sup>Z</sup> j* ) = *<sup>δ</sup>ijI*(*Zi*, *<sup>Z</sup> j* ). This problem has attracted great interest, especially for time series where *<sup>X</sup>* = **<sup>u</sup>***<sup>i</sup>* and *<sup>Y</sup>* = **<sup>u</sup>***<sup>j</sup>* for some sequence of states **<sup>u</sup>***<sup>k</sup>* (*<sup>k</sup>* = 0, 1, 2, ...) in physics or other fields, where one typically maps the state vectors **<sup>u</sup>***<sup>i</sup>* into some lower-dimensional vectors *<sup>f</sup>*(**u***i*), after which the prediction is carried out in this latent space. For the special case where *X* has a multivariate Gaussian distribution, the optimal solution is Canonical Correlation Analysis (CCA) [4]: Here both *f* and *g* are linear functions, computed via a singular-value decomposition (SVD) [5] of the cross-correlation matrix after prewhitening *X* and *Y*. The general case remains unsolved, and is obviously even harder than the above-mentioned 1-vector autoencoding problem. The recent work [6,7] review the state-of-the art as well as presenting Contrastive Predictive Coding and Dynamic Component Analysis, powerful new distillation techniques for time series, following the long tradition of setting *f* = *g* even though this is provably not optimal for the Gaussian case as shown in [8].

The goal of this paper is to make progress in the lower right quadrant of Table 1. We will first show that if *Y* ∈ {1, 2} (as in binary classification tasks) and we can successfully train a classifier that correctly predicts the conditional probability distribution *<sup>p</sup>*(*Y*|*X*), then it can be used to provide an exact solution to the distillation problem, losslessly distilling *X* into a single real variable *W* = *f*(*X*). We will generalize this to an arbitrary classification problem *Y* ∈ {1, ..., *n*} by losslessly distilling *X* into a vector *<sup>W</sup>* <sup>=</sup> *<sup>f</sup>*(*X*) <sup>∈</sup> <sup>R</sup>*n*−1, although in this case, the components of the vector *<sup>W</sup>* may not be independent. We will then return to the binary classification case and provide a family of binnings that map *W* into an integer *Z*, allowing us to scan the full Pareto frontier reflecting the tradeoff between retained entropy and class information, illustrating the end-to-end procedure with the CIFAR-10, MNIST and Fashion-MNIST datasets. This is related to the work of [9] which maximizes *I*(*Z*,*Y*) for a fixed number of bins (instead of for a fixed entropy), which corresponds to the "corners" seen in Figure 1.

This work is closely related to the Information Bottleneck (IB) method [10], which provides an insightful, principled approach for balancing compression against prediction [11]. Just as in our work, the IB method aims to find a random variable *Z* = *f*(*X*) that loosely speaking retains as much information as possible about *Y* and as little other information as possible. The IB method implements this by maximizing the IB-objective

$$\mathcal{L}\_{\text{IB}} = I(Z, \mathcal{Y}) - \beta I(Z, X) \tag{2}$$

where the Lagrange multiplier *β* tunes the balance between knowing about *Y* and forgetting about *X*. Ref. [12] considered the alternative Deterministic Information Bottleneck (DIB) objective

$$\mathcal{L}\_{\text{DIB}} = I(Z, \mathcal{Y}) - \beta H(Z), \tag{3}$$

to close the loophole where *Z* retains random information that is independent of both *X* and *Y*. (which is possible if *f* is function that contains random components rather than fully deterministic. In contrast, if *Z* = *f*(*X*) for some deterministic function *f* , which is typically not the case in the popular variational IB-implementation [13–15], then *<sup>H</sup>*(*Z*|*X*) = 0, so *<sup>I</sup>*(*Z*, *<sup>X</sup>*) <sup>≡</sup> *<sup>H</sup>*(*Z*) <sup>−</sup> *<sup>H</sup>*(*Z*|*X*) = *<sup>H</sup>*(*Z*), which means the two objectives (2) and (3) are identical.)

However, there is a well-known problem with this DIB objective that occurs when *<sup>Z</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is continuous [16]: *H*(*Z*) is strictly speaking infinite, since it requires an infinite amount of information to store the infinitely many decimals of a generic real number. While this infinity is normally regularized away by only defining *H*(*Z*) up to an additive constant, which is irrelevant when minimizing Equation (3), the problem is that we can define a new rescaled random variable

$$\mathbf{Z}' = \mathbf{a}\mathbf{Z} \tag{4}$$

for a constant *<sup>a</sup>* <sup>=</sup> 0 and obtain

$$I(Z',X) = I(Z,X) \tag{5}$$

and

$$H(Z') = H(Z) + n \log|a|. \tag{6}$$

(Throughout this paper, we take log to denote the logarithm in base 2, so that entropy and mutual information are measured in bits.) The last two equations imply that by choosing |*a*| 1, we can make *H*(*Z* ) arbitrarily negative while keeping *I*(*Z* , *<sup>X</sup>*) unchanged, thus making <sup>L</sup>DIB arbitrarily negative. The objective LDIB is therefore not bounded from below, and trying to minimize it will not produce an interesting result. We will eliminate this *Z*-rescaling problem by making *Z* discrete rather than continuous, so that *H*(*Z*) is always well-defined and finite. Another challenge with the DIB objective of Equation (3), which we will also overcome, is that it maximizes a linear combination of the two

axes in Figure 1, and can therefore only discover concave parts of the Pareto frontier, not convex ones (which are seen to dominate in Figure 1).

The rest of this paper is organized as follows: In Section 2.1, we will provide an exact solution for the binary classification problem where *Y* ∈ {1, 2} by losslessly distilling *X* into a single real variable *<sup>Z</sup>* <sup>=</sup> *<sup>f</sup>*(*X*). We also generalize this to an arbitrary classification problem *<sup>Y</sup>* ∈ {1, ..., *<sup>n</sup>*} by losslessly distilling *<sup>X</sup>* into a vector *<sup>W</sup>* <sup>=</sup> *<sup>f</sup>*(*X*) <sup>∈</sup> <sup>R</sup>*n*−1, although the components of the vector *<sup>W</sup>* may not be independent. In Section 2.2, we return to the binary classification case and provide a family a binnings that map *Z* into an integer, allowing us to scan the full Pareto frontier reflecting the tradeoff between retained entropy and class information. We apply our method to various image datasets in Section 3 and discuss our conclusions in Section 4.

#### **2. Method**

Our algorithm for mapping the Pareto frontier transforms our original data set *X* in a series of steps which will be describe in turn below:

$$X \stackrel{\text{uv}}{\longmapsto} \mathcal{W} \longmapsto \mathcal{W}\_{\text{uniform}} \longmapsto \mathcal{W}\_{\text{birnned}} \longmapsto \mathcal{W}\_{\text{sorted}} \stackrel{\text{B}}{\longmapsto} Z. \tag{7}$$

As we will show, the first, second and fourth transformations retain all mutual information with the label *Y*, and the information loss about *Y* can be kept arbitrarily small in the third step. In contrast, the last step treats the information loss as a tuneable parameter that parameterizes the Pareto frontier.

#### *2.1. Lossless Distillation for Classification Tasks*

Our first step is to compress *X* (an image, say) into *W*, a set of *n* − 1 real numbers, in such a way that no class information is lost about *Y* ∈ {1, ..., *n*}.

**Theorem 1** (**Lossless Distillation Theorem**)**.** *For an arbitrary random variable X and a categorical random variable Y* ∈ {1, ..., *n*}*, we have*

$$P(Y|X) = P(Y|W)\_\prime \tag{8}$$

*where W* <sup>≡</sup> *<sup>w</sup>*(*X*) <sup>∈</sup> <sup>R</sup>*n*−<sup>1</sup> *is defined by*

$$w\_i(X) \equiv P(Y=i|X). \tag{9}$$

Note that we ignore the *<sup>n</sup>*th component since it is redundant: *wn*(*X*) = <sup>1</sup> <sup>−</sup> <sup>∑</sup>*n*−<sup>1</sup> *<sup>i</sup> wi*(*X*).

**Proof.** Let *S* denote the domain of *X*, i.e., *X* ∈ *S*, and define the set-valued function

$$s(\mathcal{W}) \equiv \{ \mathfrak{x} \in \mathcal{S} : w(\mathfrak{x}) = \mathcal{W} \}.$$

These sets *s*(*W*) form a partition of *S* parameterized by *W*, since they are disjoint and

$$\cup\_{W \in \mathbb{R}^{n-1}} s(\mathcal{W}) = S.$$

For example, if *S* = R<sup>2</sup> and *n* = 2, then the sets *s*(*W*) are simply contour curves of the conditional probability *<sup>W</sup>* <sup>≡</sup> *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*X*) <sup>∈</sup> <sup>R</sup>. This partition enables us to uniquely specify *<sup>X</sup>* as the pair {*W*, *XW*} by first specifying which set *s*[ *f*(*X*)] it belongs to (determined by *W* = *f*(*X*)), and then specifying the particular element within that set, which we denote *XW* <sup>∈</sup> *<sup>S</sup>*(*W*). This implies that

$$P(Y|X) = P(Y|W, X\_W) = P(Y|W),\tag{11}$$

completing the proof. The last equal sign follows from the fact that the conditional probability *<sup>P</sup>*(*Y*|*X*) is independent of *XW*, since it is by definition constant throughout the set *<sup>s</sup>*(*W*).

*Entropy* **2020**, *22*, 7

The following corollary implies that *W* is an optimal distillation of the information *X* has about *Y*, in the sense that it constitutes a lossless compression of said information: *I*(*W*,*Y*) = *I*(*X*,*Y*) as shown, and the total information content (entropy) in *W* cannot exceed that of *X* since it is a deterministic function thereof.

**Corollary 1.** *With the same notation as above, we have*

$$I(X,Y) = I(\mathcal{W},Y). \tag{12}$$

**Proof.** For any two random variables, we have the identity *<sup>I</sup>*(*U*, *<sup>V</sup>*) = *<sup>H</sup>*(*V*) <sup>−</sup> *<sup>H</sup>*(*V*|*U*), where *<sup>I</sup>*(*U*, *<sup>V</sup>*) is their mutual information and *<sup>H</sup>*(*V*|*U*) denotes conditional entropy. We thus obtain

$$\begin{aligned} I(\mathbf{X}, \mathbf{Y}) &= \ &H(\mathbf{Y}) - H(\mathbf{Y}|\mathbf{X}) = H(\mathbf{Y}) + \langle \log P(\mathbf{Y}|\mathbf{X}) \rangle\_{\mathbf{X}, \mathbf{Y}} \\ &= &H(\mathbf{Y}) + \langle \log P(\mathbf{Y}|\mathbf{W}) \rangle\_{\mathbf{W}, \mathbf{X}|\mathbf{W}, \mathbf{Y}} \\ &= &H(\mathbf{Y}) + \langle \log P(\mathbf{Y}|\mathbf{W}) \rangle\_{\mathbf{W}, \mathbf{Y}} \\ &= &H(\mathbf{Y}) - H(\mathbf{Y}|\mathbf{W}) = I(\mathbf{W}, \mathbf{Y}), \end{aligned} \tag{13}$$

which completes the proof. We obtain the second line by using *<sup>P</sup>*(*Y*|*X*) = *<sup>P</sup>*(*Y*|*W*) from Theorem 1 and specifying *<sup>X</sup>* by *<sup>W</sup>* and *XW*, and the third line since *<sup>P</sup>*(*Y*|*W*) is independent of *XW*, as above.

In most situations of practical interest, the conditional probability distribution *<sup>P</sup>*(*Y*|*X*) is not precisely known, but can be approximated by training a neural-network-based classifier that outputs the probability distribution for *Y* given any input *X*. We present such examples in Section 3. The better the classifier, the smaller the information loss *<sup>I</sup>*(*X*,*Y*) <sup>−</sup> *<sup>I</sup>*(*W*,*Y*) will be, approaching zero in the limit of an optimal classifier.

#### *2.2. Pareto-Optimal Compression for Binary Classification Tasks*

Let us now focus on the special case where *n* = 2, i.e., binary classification tasks. For example, *X* may correspond to images of equal numbers of felines and canines to be classified despite challenges with variable lighting, occlusion, etc., as in Figure 2, and *Y* ∈ {1, 2} may correspond to the labels "cat" and "dog". In this case, *<sup>Y</sup>* contains *<sup>H</sup>*(*Y*) = 1 bit of information of which *<sup>I</sup>*(*X*,*Y*) <sup>≤</sup> 1 bit is contained in *X*. Theorem 1 shows that for this case, all of this information about whether an image contains a cat or a dog can be compressed into a single number *W* which is not a bit like *Y*, but a real number between zero and one.

The goal of this section is find a class of functions *g* that perform Pareto-optimal lossy compression of *<sup>W</sup>*, mapping it into an integer *<sup>Z</sup>* <sup>≡</sup> *<sup>g</sup>*(*W*) that maximizes *<sup>I</sup>*(*Z*,*Y*) for a fixed entropy *<sup>H</sup>*(*Z*). (Throughout this paper, we will use the term "Pareto-optimal" or "optimal" in this sense, i.e., maximizing *I*(*X*,*Y*) for a fixed *H*(*Z*).) The only input we need for our work in this section is the joint probability distribution *fi*(*w*) = *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> *<sup>i</sup>*, *<sup>W</sup>* <sup>=</sup> *<sup>w</sup>*), whose marginal distributions are the discrete probability distribution for *P<sup>Y</sup> <sup>i</sup>* for *Y* and the probability distribution *f* for *W*, which we will henceforth assume to be continuous:

$$f(w) \quad \equiv \quad \sum\_{i=1}^{2} f\_i(w), \tag{14}$$

$$P\_i^Y \equiv P(Y=i) \quad = \int\_0^1 f\_i(w) dw. \tag{15}$$

**Figure 2.** Sample data from Section 3. Images from MMNIST (**top**), Fashion-MNIST (**middle**) and CIFAR-10 are mapped into integers (group labels) *Z* = *f*(*X*) retaining maximum mutual information with the class variable *Y* (ones/sevens, shirts/pullovers and cats/dogs, respectively) for 3, 5 and 5 groups, respectively. These mappings *f* correspond to Pareto frontier "corners".

#### 2.2.1. Uniformization of *W*

For convenience and without loss of generality, we will henceforth assume that *f*(*w*) = 1, i.e., that *W* has a uniform distribution on the unit interval [0, 1]. We can do this because if *W* were not uniformly distributed, we could make it so by using the standard statistical technique of applying its cumulative probability distribution function to it

$$W \mapsto W' \equiv F(W), \quad F(w) \equiv \int\_0^w f(w') dw',\tag{16}$$

retaining all information — *I*(*W* ,*Y*) = *I*(*W*,*Y*) — since this procedure is invertible almost everywhere.

#### 2.2.2. Binning *W*

Given a set of bin boundaries *b*<sup>1</sup> < *b*<sup>2</sup> < ... < *bn*−<sup>1</sup> grouped into a vector **b**, we define the integervalue contiguous binning function

$$B(\mathbf{x}, \mathbf{b}) \equiv \begin{cases} 1 & \text{if } \mathbf{x} < b\_1 \\\ k & \text{if } b\_{k-1} < \mathbf{x} \le b\_k \\\ n & \text{if } \mathbf{x} \ge b\_{N-1} \end{cases} \tag{17}$$

*B*(*x*, **b**) can thus be interpreted as the ID of the bin into which *x* falls. Note that *B* is a monotonically increasing piecewise constant function of *x* that is shaped like an *N*-level staircase with *n* − 1 steps at *b*1, ..., *bN*−1.

Let us now bin *W* into *N* equispaced bins, by mapping it into an integer *W* ∈ {1, ..., *N*} (the bin ID) defined by

$$\mathcal{W}' \equiv \mathcal{W}\_{\text{binned}} \equiv B(\mathcal{W}, \mathbf{b}\_N)\_\prime \tag{18}$$

where **<sup>b</sup>** is the vector with elements *bj* <sup>=</sup> *<sup>j</sup>*/*N*, *<sup>j</sup>* <sup>=</sup> 1, ..., *<sup>N</sup>* <sup>−</sup> 1. As illustrated visually in Figure <sup>3</sup> and mathematically in Appendix A, binning *W* → *W* corresponds to creating a new random variable for which the conditional distribution *<sup>p</sup>*1(*w*) = *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*<sup>W</sup>* <sup>=</sup> *<sup>w</sup>*) is replaced by a piecewise constant function *<sup>p</sup>*¯1(*w*), replacing the values in each bin by their average. The binned variable *<sup>W</sup>* thus retains only information about which bin *W* falls into, discarding all information about the precise location within that bin. In the *<sup>N</sup>* <sup>→</sup> <sup>∞</sup> limit of infinitesimal bins, *<sup>p</sup>*¯1(*w*) <sup>→</sup> *<sup>p</sup>*1(*w*), and we expect the above-mentioned discarded information to become negligible. This intuition is formalized by Theorem A1 in Appendix A, which under mild smoothness assumptions ensuring that *<sup>p</sup>*1(*w*) is not pathological shows that

$$I(\mathcal{W}',Y) \to I(\mathcal{W},Y) \quad \text{as} \quad N \to \infty,\tag{19}$$

i.e., that we can make the binned data *W* retain essentially all the class information from *W* as long as we use enough bins.

In practice, such as for the numerical experiments that we will present in Section 3, training data is never infinite and the conditional probability function *<sup>p</sup>*1(*w*) is never known to perfect accuracy. This means that the pedantic distinction between *I*(*W* ,*Y*) = *I*(*W*,*Y*) and *I*(*W* ,*Y*) <sup>≈</sup> *<sup>I</sup>*(*W*,*Y*) for very large *N* is completely irrelevant in practice. In the rest of this paper, we will therefore work with the unbinned (*W*) and binned (*W* ) data somewhat interchangeably below for convenience, occasionally dropping the apostrophy from *W* when no confusion is caused.

**Figure 3.** Essentially all information about *Y* is retained if *W* is binned into sufficiently narrow bins. Sorting the bins (**left**) to make the conditional probability monotonically increasing (**right**) changes neither this information nor the entropy.

#### 2.2.3. Making the Conditional Probability Monotonic

For convenience and without loss of generality, we can assume that the conditional probability distribution *<sup>p</sup>*¯1(*w*) is a monotonically increasing function. We can do this because if this were not the case, we could make it so by sorting the bins by increasing conditional probability, as illustrated in Figure 3, because both the entropy *H*(*W* ) and the mutual information *I*(*W* ,*Y*) are left invariant by this renumbering/relabeling of the bins. The "cat" probability *P*(*Y* = 1) (the total shaded area in Figure 3) is of course also left unchanged by both this sorting and by the above-mentioned binning.

#### 2.2.4. Proof that Pareto Frontier is Spanned by Contiguous Binnings

We are now finally ready to tackle the core goal of this paper: Mapping the Pareto frontier (*H*∗, *<sup>I</sup>*∗) of optimal data compression *<sup>X</sup>* → *<sup>Z</sup>* that reflects the tradeoff between *<sup>H</sup>*(*Z*) and *<sup>I</sup>*(*Z*,*Y*). While fine-grained binning has no effect on the entropy *H*(*Y*) and negligible effect on *I*(*W*,*Y*), it dramatically reduces the entropy of our data, whereas *H*(*W*) = ∞ since *W* is continuous, *H*(*W* ) = log *N* is finite, approaching infinity only in the limit of infinitely many infinitesimal bins. (Note that while this infinity, which reflects the infinite number of bits required to describe a single generic real number, is customarily eliminated by defining entropy only up to an overall additive constant, we will not follow that custom here, for the reason explained in the introduction.) Taken together, these scalings of *<sup>I</sup>* and *<sup>H</sup>* imply that the leftmost part of the Pareto frontier *<sup>I</sup>*∗(*H*∗), defined by Equation (1) and illustrated in Figure 1, asymptotes to a horizontal line of height *<sup>I</sup>*<sup>∗</sup> <sup>=</sup> *<sup>I</sup>*(*X*,*Y*) as *<sup>H</sup>*<sup>∗</sup> <sup>→</sup> <sup>∞</sup>.

To reach the interesting parts of the Pareto frontier further to the right, we must destroy some information about *Y*. We do this by defining

$$Z = \lg(\mathcal{W}'),\tag{20}$$

where the function *g* groups the tiny bins indexed by *W* ∈ {1, ..., *N*} into fewer ones indexed by *Z* ∈ {1, ..., *M*}, *M* < *N*. There are vast numbers of such possible groupings, since each group corresponds to one of the 2*<sup>N</sup>* <sup>−</sup> 2 nontrivial subsets of the tiny bins. Fortunately, as we will now prove, we need only consider the <sup>O</sup>(*NM*) contiguous groupings, since non-contiguous ones are inferior and cannot lie on the Pareto frontier. Indeed, we will see that for the examples in Section 3, *<sup>M</sup>* <sup>∼</sup> < 5 suffices to capture the most interesting information.

**Theorem 2** (Contiguous binning theorem)**.** *If W has a uniform distribution and the conditional probability distribution <sup>P</sup>*(*W*|*Y =* <sup>1</sup>) *is monotonically increasing, then all points* (*H*∗, *<sup>I</sup>*∗) *on the Pareto frontier correspond to binning W into contiguous intervals, i.e., if*

$$I(H\_\*) \equiv \sup\_{\{\mathbf{g} : H[\mathbf{g}(\mathcal{W})] \le H\_\*\}} I[\mathbf{g}(\mathcal{W}), \mathcal{Y}],\tag{21}$$

*then there exists a set of bin boundaries <sup>b</sup>*<sup>1</sup> <sup>&</sup>lt; ... <sup>&</sup>lt; *bn*−<sup>1</sup> *such that the binned variable <sup>Z</sup>* <sup>≡</sup> *<sup>B</sup>*(*W*, **<sup>b</sup>**) ∈ {1, ..., *<sup>M</sup>*} *satisfies H*(*Z*) = *<sup>H</sup>*<sup>∗</sup> *and I*(*Z*,*Y*) = *<sup>I</sup>*∗*.*

**Proof.** We prove this by contradiction: we will assume that there is a point (*H*∗, *<sup>I</sup>*∗) on the Pareto frontier to which we can come arbitrarily close with (*H*(*Z*), *<sup>I</sup>*(*Z*,*Y*)) for *<sup>Z</sup>* <sup>≡</sup> *<sup>g</sup>*(*X*) for a compression function *g* : R → {1, ..., *M*} that is not a contiguous binning function, and obtain a contradiction by using *g* to construct another compression function *g* (*W*) lying above the Pareto frontier, with *H*[*g* (*W*)] = *<sup>H</sup>*<sup>∗</sup> and *<sup>I</sup>*[*g* (*W*),*Y*]) <sup>&</sup>gt; *<sup>I</sup>*∗. The joint probability distribution *Pij* for the *<sup>Z</sup>* and *<sup>Y</sup>* is given by the Lebesgue integral

$$P\_{i\bar{j}} \equiv P(Z=i, \mathcal{Y}=\bar{j}) = \int f\_{\bar{j}} d\mu\_{i\nu} \tag{22}$$

where *fj*(*w*) is the joint probability distribution for *<sup>W</sup>* and *<sup>Y</sup>* introduced earlier and *<sup>μ</sup><sup>j</sup>* is the set *<sup>μ</sup>* ≡ {*<sup>w</sup>* <sup>∈</sup> [0, 1] : *<sup>g</sup>*(*w*) = *<sup>i</sup>*}, i.e., the set of *<sup>w</sup>*-values that are grouped together into the *<sup>i</sup>*th large bin. We define the marginal and conditional probabilities

$$P\_{\bar{i}} \equiv P(Z=i) = P\_{\bar{i}1} + P\_{\bar{i}2}, \quad p\_{\bar{i}} \equiv P(Y=1|Z=i) = \frac{P\_{\bar{i}1}}{P\_{\bar{i}}}.\tag{23}$$

Figure 4 illustrates the case where the binning function *g* corresponds to *M* = 4 large bins, the second of which consists of two non-contiguous regions that are grouped together; the shaded rectangles in the bottom panel have width *Pi*, height *pi* and area *Pij* = *Pi pi*.

According to Theorem A2 in the Appendix B, we obtain the contradiction required to complete our proof (an alternative compression *Z* ≡ *g* (*W*) above the Pareto frontier with *H*(*Z* ) = *<sup>H</sup>*<sup>∗</sup> and *I*(*Z* ,*Y*) <sup>&</sup>gt; *<sup>I</sup>*∗) if there are two different conditional probabilities *pk* <sup>=</sup> *pl*, and we can change *<sup>g</sup>* into *<sup>g</sup>* so that the joint distribution *P ij* of *Z* and *Y* changes in the following way:


Figure 4 shows how this can be accomplished for non-contiguous binning: Let *k* be a bin with non-contiguous support set *μ<sup>k</sup>* (bin 2 in the illustrated example), let *l* be a bin whose support *μ<sup>l</sup>* (bin 4 in the example) contains a positive measure subset *μ*mid *<sup>l</sup>* <sup>⊂</sup> *<sup>μ</sup><sup>l</sup>* within two parts *<sup>μ</sup>*left *<sup>k</sup>* and *<sup>μ</sup>*right *<sup>k</sup>* of *μk*, and define a new binning function *g* (*w*) that differs from *g*(*w*) only by swapping a set *μ*  <sup>⊂</sup> *<sup>μ</sup>*mid *l* against a subset of either *μ*left *<sup>k</sup>* or *<sup>μ</sup>*right *<sup>k</sup>* of measure  (in the illustrated example, the binning function change implementing this subset is shown with dotted lines). This swap leaves the total measure of both bins (and hence the marginal distribution *Pi*) unchanged, and also leaves *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>) unchanged. If *pk* < *pl*, we perform this swap between *μ*mid *<sup>l</sup>* an *<sup>μ</sup>*right *<sup>k</sup>* (as in the figure), and if *pk* > *pl*, we instead perform this swap between *μ*mid *<sup>l</sup>* an *<sup>μ</sup>*left *<sup>k</sup>* , in both cases guaranteeing that *pl* and *pk* move further apart (since *p*(*w*) is monotonically increasing). This completes our proof by contradiction except for the case where *pk* = *pl*; in this case, we swap to entirely eliminate the discontiguity, and repeat our swapping procedure between other bins until we increase the entropy (again obtaining a contradiction) or end up with a fully contiguous binning (if needed, *g*(*w*) can be changed to eliminate any measure-zero subsets that ruin contiguity, since they leave the Lebesgue integral in Equation (22) unchanged.)

**Figure 4.** The reason that the Pareto frontier can never be reached using non-contiguous bins is that swapping parts of them against parts of an intermediate bin can increase *I*(*Z*, *X*) while keeping *H*(*Z*) constant. In this example, the binning function *g* assigns two separate *W*-intervals (top panel) to the same bin (bin 2) as seen in the bottom panel. The shaded rectangles have widths *Pi*, heights *pi* and areas *Pi*<sup>1</sup> = *Pi <sup>p</sup>*1. In the upper panel, the conditional probabilities *pi* are monotonically increasing because they are averages of the monotonically increasing curve *<sup>p</sup>*1(*w*).

#### *2.3. Mapping the Frontier*

Theorem 2 implies that we can in practice find the Pareto frontier for any random variable *X* by searching the space of contiguous binnings of *W* = *w*(*X*) after uniformization, binning and sorting. In practice, we can first try the two bin case by scanning the bin boundary 0 < *b*<sup>1</sup> < 1, then trying the three bin case by trying bin boundaries 0 < *b*<sup>1</sup> < *b*<sup>2</sup> < 1, then trying the four bin case, etc., as illustrated in Figure 1. Each of these cases corresponds to a standard multi-objective optimization problem aiming to maximize the two objectives *I*(*Z*,*Y*) and *H*(*Z*). We perform this optimization numerically with the AWS algorithm of [17] as described in the next section.

While the uniformization, binning and sorting procedures are helpful in practice as well as for for simplifying proofs, they are not necessary in practice. Since what we really care about is grouping into integrals containing similar conditional probabilities *<sup>p</sup>*1(*w*), not similar *<sup>w</sup>*-values, it is easy to see that binning horizontally after sorting is equivalent to binning vertically before sorting. In other words, we can eliminate the binning and sorting steps if we replace "horizontal" binning *g*(*W*) = *B*(*W*, **b**) by "vertical" binning

$$\log(\mathcal{W}) = B[p\_1(\mathcal{W}), \mathbf{b}]\_\prime \tag{24}$$

where *p*<sup>1</sup> denotes the conditional probability as before.

#### **3. Results**

The purpose of this section is to examine how our method for Pareto-frontier mapping works in practice on various datasets, both to compare its performance with prior work and to gain insight into the shape and structure of the Pareto frontiers for well-known datasets such as the CIFAR-10 image database [18], the MNIST database of hand-written digits [19] and the Fashion-MNIST database of garment images [20]. Before doing this, however, let us build intuition for how our method works by testing on a much simpler toy model that is analytically solvable, where the accuracy of all approximations can be exactly determined.

#### *3.1. Analytic Warmup Example*

Let the random variables *<sup>X</sup>* = (*x*1, *<sup>x</sup>*2) <sup>∈</sup> [0, 1] <sup>2</sup> and *<sup>Y</sup>* ∈ {1, 2} be defined by the bivariate probability distribution

$$f(X,Y) = \begin{cases} 2\chi\_1\chi\_2 & \text{if } Y = 1, \\ 2(1-\chi\_1)(1-\chi\_2) & \text{if } Y = 2, \end{cases} \tag{25}$$

which corresponds to *x*<sup>1</sup> and *x*<sup>2</sup> being two independent and identically distributed random variables with triangle distribution *<sup>f</sup>*(*xi*) = *xi* if *<sup>Y</sup>* <sup>=</sup> 1, but flipped *xi* → <sup>1</sup> <sup>−</sup> *xi* if *<sup>Y</sup>* <sup>=</sup> 2. This gives *<sup>H</sup>*(*Y*) = 1 bit and mutual information

$$I(X,Y) = 1 - \frac{\pi^2 - 4}{16\ln 2} \approx 0.4707 \text{ bits}.\tag{26}$$

The compressed random variable *<sup>W</sup>* <sup>=</sup> *<sup>w</sup>*(*X*) <sup>∈</sup> <sup>R</sup> defined by Equation (9) is thus

$$\mathcal{W} = P(Y=1|X) = \frac{\mathbf{x}\_1 \mathbf{x}\_2}{\mathbf{x}\_1 \mathbf{x}\_2 + (1 - \mathbf{x}\_1)(1 - \mathbf{x}\_2)}.\tag{27}$$

After defining *<sup>Z</sup>* <sup>≡</sup> *<sup>B</sup>*(*W*, **<sup>b</sup>**) for a vector **<sup>b</sup>** of bin boundaries, a straightforward calculation shows that the joint probability distribution of *Y* and the binned variable *Z* is given by

$$P\_{i\uparrow} \equiv P(Z=i, Y=j) = F\_{\slash}(b\_{i+1}) - F\_{\slash}(b\_{i}),\tag{28}$$

where the cumulative distribution function *Fj*(*w*) <sup>≡</sup> *<sup>P</sup>*(*W*<*w*,*<sup>Y</sup>* <sup>=</sup> *<sup>j</sup>*) is given by

$$F\_1(w) = \frac{w^2 \left[ (2w-1)(5-4w) + 2(1-w^2) \log(w^{-1}-1) \right]}{2(2w-1)^4},$$

$$F\_2(w) = \frac{1}{2} - F\_1(1-w). \tag{29}$$

Computing *I*(*W*,*Y*) using this probability distribution recovers exactly the same mutual information *I* ≈ 0.4707 bits as in Equation (26), as we proved in Theorem 1.

#### *3.2. The Pareto Frontier*

Given any binning vector **b**, we can plot a corresponding point (*H*[*Z*], *I*[*Z*,*Y*]) in Figure 1 by computing *<sup>I</sup>*(*Z*,*Y*) = *<sup>H</sup>*(*Z*) + *<sup>H</sup>*(*Y*) <sup>−</sup> *<sup>H</sup>*(*Z*,*Y*), *<sup>H</sup>*(*Z*,*Y*) = <sup>−</sup> <sup>∑</sup> *Pij* log *Pij*, etc., where *Pij* is given by Equation (28).

The figure shows 6000 random binnings each for *M* = 3, ..., 8 bins; as we have proven, the upper envelope of points corresponding to all possible (contiguos) binnings defines the Pareto frontier. The Pareto frontier begins with the black dot at (0, 0) (the lower right corner), since *M* = 1 bin obviously destroys all information. The *M* = 2 bin case corresponds to a 1-dimensional closed curve parametrized by the single parameter *b*<sup>1</sup> that specifies the boundary between the two bins: It runs from (0, 0) when *<sup>b</sup>*<sup>1</sup> = 1, moves to the left until *<sup>H</sup>*(*Z*) = 1 when *<sup>b</sup>*<sup>1</sup> = 0.5, and returns to (0, 0) when *<sup>b</sup>*<sup>1</sup> = 1. The *<sup>b</sup>*<sup>1</sup> < 0.5 and *<sup>b</sup>*<sup>1</sup> > 0.5 branches are indistinguishable in Figure <sup>1</sup> because of the symmetry

of our warmup problem, but in generic cases, a closed loop can be seen where only the upper part defines the Pareto frontier.

More generally, we see that the set of all binnings into *M* > 2 bins maps the vector **b** of *M* − 1 bin boundaries into a contiguous region in Figure 1. The inferior white region region below can also be reached if we use non-contiguous binnings.

The Pareto Frontier is seen to resemble the top of a circus tent, with convex segments separated by "corners" where the derivative vanishes, corresponding to a change in the number of bins. We can understand the origin of these corners by considering what happens when adding a new bin of infinitesimal size . As long as *pi*(*w*) is continuous, this changes all probabilites *Pij* by amounts *<sup>δ</sup>Pij* <sup>=</sup> <sup>O</sup>(), and the probabilities corresponding to the new bin (which used to vanish) will now be *O*(). The function  log  has infinite derivative at  <sup>=</sup> 0, blowing up as <sup>O</sup>(log ), which implies that the entropy increase *<sup>δ</sup>H*(*Z*) = <sup>O</sup>(<sup>−</sup> log ). In contrast, a straightforward calculation shows that all log -terms cancel when computing the mutual information, which changes only by *<sup>δ</sup>I*(*Z*,*Y*) = <sup>O</sup>(). As we birth a new bin and move leftward from one of the black dots in Figure 1, the initial slope of the Pareto frontier is thus

$$\lim\_{\varepsilon \to 0} \frac{\delta I(Z, \mathcal{Y})}{\delta H(Z)} = 0. \tag{30}$$

In other words, the Pareto frontier starts out *horizontally* to the left of each of its corners in Figure 1. Indeed, the corners are "soft" in the sense that the derivative of the Pareto Frontier is continuous and vanishes at the corners: For a given number of bins, *I*(*X*, *Z*) by definition takes its global maximum at the corresponding corner, so the derivative *∂I*(*Z*,*Y*)/*∂H*(*Z*) vanishes also as we approach the corner from the right. The first corner (the transition from 2 to 3 bins) can nonetheless look fairly sharp because the 2-bin curve turns around rather abruptly, and the right derivative does not vanish in the limit where a symmetry causes the upper and lower parts of the 2-bin loop to coincide.

Our theorems imply that in the *M* → ∞ limit of infinitely many bins, successive corners become gradually less pronounced (with ever smaller derivative discontinuities), because the left asymptote of the Pareto frontier simply approaches the horizontal line *<sup>I</sup>*<sup>∗</sup> <sup>=</sup> *<sup>I</sup>*(*Y*, *<sup>Z</sup>*).

#### 3.2.1. Approximating *w*(*X*)

For our toy example, we knew the conditional probability distribution *<sup>P</sup>*(*Y*|*X*) and could therefore compute *<sup>W</sup>* <sup>=</sup> *<sup>w</sup>*(*X*) = *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*X*) exactly. For practical examples where this is not the case, we can instead train a neural network to implement a function *<sup>w</sup>*ˆ(*X*) that approximates *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*X*). For our toy example, we train a fully connected feedforward neural network to predict *Y* from *X* using cross-entropy loss; it has two hidden layers, each with 256 neurons with ReLU activation, and a final linear layer with softmax activation, whose first neuron defines *w*ˆ(*X*). An illustrated in Figure 5, the network prediction for the conditional probability *<sup>w</sup>*ˆ(*X*) <sup>≡</sup> *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>) is fairly accurate, but slightly over-confident, tending to err on the side of predicting more extreme probabilities (further from 0.5). The average KL-divergence between the predicted and actual conditional probability distribution *<sup>P</sup>*(*Y*|*X*) is about 0.004, which causes negligible loss of information about *<sup>Y</sup>*.

**Figure 5.** Contour plot of the function *<sup>W</sup>*(*x*1, *<sup>x</sup>*2) computed both exactly using Equation (27) (solid curves) and approximately using a neural network (dashed curves).

### 3.2.2. Approximating *<sup>f</sup>*1(*W*)

For practical examples where the conditional joint probability distribution *P*(*W*,*Y*) cannot be computed analytically, we need to estimate it from the observed distribution of *W*-values output by the neural network. For our examples, we do this by fitting each probability distribution by a beta-distribution times the exponential of a polynomial of degree *d*:

$$f(w, \mathbf{a}) \equiv \exp\left[\sum\_{k=0}^{d} a\_k \mathbf{x}^k\right] \mathbf{x}^{a\_{d+1}} (1 - \mathbf{x})^{a\_{d+2}},\tag{31}$$

where the coefficient *<sup>a</sup>*<sup>0</sup> is fixed by the normalization requirement <sup>1</sup> <sup>0</sup> *<sup>f</sup>*(*w*, **<sup>a</sup>**)*dw* <sup>=</sup> 1. We use this simple parametrization because it can fit any smooth distribution arbitrarily well for sufficiently large *d*, and provides accurate fits for the probability distributions in our examples using quite modest *d*; for example, *<sup>d</sup>* <sup>=</sup> 3 gives *dKL*[ *<sup>f</sup>*1(*w*), *<sup>f</sup>*(*w*, **<sup>a</sup>**)] <sup>≈</sup> 0.002 for

$$\begin{array}{rcl}\mathbf{a} & \equiv & \underset{\mathbf{a}'}{\operatorname{argmin}} \ d\_{\text{KL}}[f\_1(w), f(w, \mathbf{a}')] \\ & & \\ = & (-1.010, 2.319, -5.579, 4.887, 0.308, -0.307), \end{array} \tag{32}$$

which causes rather negligible loss of information about *Y*. For our examples below where we do not know the exact distribution *<sup>f</sup>*1(*w*) and merely have samples *Wi* drawn from it, one for each element of the data set, we instead perform the fitting by the standard technique of minimizing the cross entropy loss, i.e.,

$$\mathbf{a} \equiv \operatorname\*{argmin}\_{\mathbf{a}'} - \sum\_{k=1}^{n} \log f(\mathcal{W}\_k, \mathbf{a}'). \tag{33}$$

Table 2 lists the fitting coefficients used, and Figure 6 illustrates the fitting accuracy.


**Table 2.** Fits to the conditional probability distributions *<sup>P</sup>*(*W*|*Y*) for our experiments, in terms of the parameters *ai* defined by Equation (31).

#### *3.3. MNIST, Fashion-MNIST and CIFAR-10*

The MNIST database consists of 28 × 28 pixel greyscale images of handwritten digits: 60,000 training images and 10,000 testing images [19]. We use the digits 1 and 7, since they are the two that are most frequently confused, relabeled as *Y* = 1 (ones) and *Y* = 2 (sevens). To increase difficulty, we inject 30% of pixel noise, i.e., randomly flip each pixel with 30% probability (see examples in Figure 2). For easy comparison with the other cases, we use the same number of samples for each class.

The Fashion-MNIST database has the exact same format (60,000 + 10,000 28 × 28 pixel greyscale images), depicting not digits but 10 classes of clothing [20]. Here we again use the two most easily confused classes: Pullovers (*Y* = 1) and shirts (*Y* = 2); see Figure 2 for examples.

We train a neural network classifier on our datasets using the architecture from https://github.com/pytorch/examples/blob/master/mnist/main.py, changing the number of outpiut neurons from 10 to 2. We use two convolutional layers (kernel size 5, stride 1, ReLU activation) with 20 and 50 features, respectively, each of which is followed by max-pooling with kernel size 2. This is followed by a fully connected layer with 500 ReLU neurons and finally a softmax layer that produces the predicted probabilities for the two classes. After training, we apply the trained model to the test set to obtain *Wi* <sup>=</sup> *<sup>P</sup>*(*Y*|*Xi*) for each dataset.

CIFAR-10 [21] is one of the most widely used datasets for machine learning research, and contains 60,000 32 <sup>×</sup> 32 color images in 10 different classes. We use only the cat (*<sup>Y</sup>* <sup>=</sup> 1) and dog (*<sup>Y</sup>* <sup>=</sup> 2) classes, which are the two that are empirically hardest to discriminate; see Figure 2 for examples. We use a ResNet18 architecture adapted from https://github.com/kuangliu/pytorch-cifar, for which we use its ResNet18 model [22]; the only difference in architecture is that we use 2 rather than 10 output neurons. We train with a learning rate of 0.01 for the first 150 epochs, 0.001 for the next 100, and 0.0001 for the final 100 epochs; we keep all other settings the same as in the original repository.

Figure <sup>6</sup> shows observed cumulative distribution functions *Fi*(*w*) (solid curves) for the *Wi* = *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*Xi*) generated by the neural network classifiers, together with our above-mentioned analytic fits (dashed curves). Figure <sup>7</sup> shows the corresponding conditional probability curves *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*W*) after remapping *W* to have a uniform distribution as described above. Figure 6 shows that the original *W*-distributions are strongly peaked around *W* ≈ 0 and *W* ≈ 1, so this remapping stretches the *W*-axis so as to shift probability toward more central values.

In the case of CIFAR-10, the observed distribution *f*(*w*) was so extremely peaked near the endpoints that we replaced Equation (31) by the more accurate fit

$$f(w) \quad \equiv \quad F'(w),\tag{34}$$

$$F(w) \quad \equiv \begin{cases} \ \mathbf{a}\_0^A F\_\*[w, \mathbf{a}^A] & \text{if } w < 1/2, \\ 1 - (1 - \mathbf{a}\_0^A) F\_\*[1 - w, \mathbf{a}^B] \mathbf{I} & \text{otherwise,} \end{cases} \tag{35}$$

$$F\_\*(x) \equiv \quad G\left[\frac{(2x)^{a\_1}}{2}\right],\tag{36}$$

$$G(x) \quad \equiv \left[ \left( \frac{x}{a\_2} \right)^{a\_3 a\_4} + \left( a\_5 + a\_6 x \right)^{a\_4} \right]^{1/a\_4},\tag{37}$$

$$a\_6 \equiv \ 2 \left[ (1 - (2a\_2)^{-a\_3 a\_4})^{1/a\_4} - a\_5 \right],\tag{38}$$

where the parameters vectors **a***<sup>A</sup>* and **a***<sup>B</sup>* are given in Table 2 for both cats and dogs. For the cat case, this fit gives not *<sup>f</sup>*(*w*) but *<sup>f</sup>*(<sup>1</sup> <sup>−</sup> *<sup>w</sup>*). Note that *<sup>F</sup>*∗(0) = 0, *<sup>F</sup>*∗(1/2) = 1.

**Figure 6.** Cumulative distributions *Fi*(*w*) <sup>≡</sup> *<sup>P</sup>*(*W*<*w*|*<sup>Y</sup>* <sup>=</sup> *<sup>i</sup>*) are shown for the analytic (blue/dark grey), Fashion-MNIST (red/grey) and CIFAR-10 (orange/light grey) examples. Solid curves show the observed cumulative histograms of *W* from the neural network, and dashed curves show the fits defined by Equation (31) and Table 2.

**Figure 7.** The solid curves show the actual conditional probability *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*W*) for CIFAR-10 (where the labels Y = 1 and 2 correspond to "cat" and "dog") and MNIST with 20% label noise (where the labels Y = 1 and 2 correspond to "1" and "7"), respectively. The color-matched dashed curves show the conditional probabilities predicted by the neural network; the reason that they are not diagonal lines *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*W*) = *<sup>W</sup>* is that *<sup>W</sup>* has been reparametrized to have a uniform distribution. If the neural network classifiers were optimal, then solid and dashed curves would coincide.

The final result of our calculations is shown in Figure 8: The Pareto frontiers for our four datasets, computed using our method.

**Figure 8.** The Pareto frontier for compressed versions *Z* = *g*(*X*) of our four datasets *X*, showing the maximum attainable class information *I*(*Z*,*Y*) for a given entropy *H*(*Z*). The "corners" (dots) correspond to the maximum *I*(*Z*,*Y*) attainable when binning the likelihood *W* into a given number of bins (2, 3, ..., 8 from right to left). The horizontal dotted lines show the maximum available information *I*(*X*,*Y*) for each case, reflecting that there is simply less to learn in some examples than in others.

#### *3.4. Interpretation of Our Results*

To build intuition for our results, let us consider our CIFAR-10 example of images *X* depicting cats (Y = 1) and dogs (Y = 2) as in Figure 2 and ask what aspects *Z* = *g*(*X*) of an image *X* capture the most information about the species *<sup>Y</sup>*. Above, we estimated that *<sup>I</sup>*(*X*,*Y*) <sup>≈</sup> 0.69692 bits, so what *Z* captures the largest fraction of this information for a fixed entropy? Given a good neural network classifier, a natural guess might be the single bit *Z* containing its best guess, say "it's probably a cat". This corresponds to defining *<sup>Z</sup>* <sup>=</sup> 1 if *<sup>P</sup>*(<sup>Y</sup> <sup>=</sup> <sup>1</sup>|*X*) <sup>&</sup>gt; 0.5, *<sup>Z</sup>* <sup>=</sup> 2 otherwise, and gives the joint distribution *Pij* <sup>≡</sup> *<sup>P</sup>*(Y = i, Z = j)

$$\mathbf{P} = \begin{pmatrix} 0.454555 & 0.045445 \\ 0.042725 & 0.457275 \end{pmatrix}.$$

corresponding to (*Z*,*Y*) <sup>≈</sup> 0.56971 bits. However, our results show that we can improve things in two separate ways.

First of all, if we only want to store one bit *Z*, then we can do better, corresponding to the first "corner" in Figure 8: moving the likelihood cutoff from 0.5 to 0.51, i.e., redefining *Z* = 1 if *<sup>P</sup>*(*Y*|*X*) <sup>&</sup>gt; 0.51, increases the mutual information to *<sup>I</sup>*(*Z*,*Y*) <sup>≈</sup> 0.56974 bits.

More importantly, we are still falling far short of the 0.69692 bits of information we had without data compression, capturing only 88% of the available species information. Our Theorem 1 showed that we can retain all this information if we instead define *Z* as the cat probability itself: *Z* ≡ *W* ≡ *<sup>P</sup>*(*Y*|*X*). For example, a given image might be compressed not into "It's probably a cat" but into "I'm 94.2477796% sure it's a cat". However, it is clearly impractical to report the infinitely many decimals required to retain all the species information, which would make *H*(*Z*) infinite. Our results can be loosely speaking interpreted as the optimal way to round *Z*, so that the information *H*(*Z*) required to store it becomes finite. We found that simply rounding to a fixed number of decimals is suboptimal; for example, if we pick 2 decimals and say "I'm 94.25% sure it's a cat", then we have effectively binned

the probability *W* into 10,000 bins of equal size, even though we can often do much better with bins of unequal size, as illustrated in the bottom panel of Figure 1. Moreover, when the probability *W* is approximated by a neural network, we found that what should be optimally binned is not *W* but the conditional probability *<sup>P</sup>*(*<sup>Y</sup>* = 1|*W*) illustrated in Figure <sup>7</sup> ("vertical binning").

It is convenient to interpret our Pareto-optimal data compression *X* → *Z* as clustering, i.e., as a method of grouping our images or other data *Xi* into clusters based on what information they contain about *Y*. For example, Figure 2 illustrates CIFAR-10 images clustered by their degree of "cattiness" into 5 groups *Z* = 1, ..., 5 that might be nicknamed "1.9% cat", "11.8% cat", "31.4% cat", "68.7% cat" and "96.7% cat". This gives the joint distribution *Pij* <sup>≡</sup> *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> *<sup>i</sup>*, *<sup>Z</sup>* <sup>=</sup> *<sup>j</sup>*) where

$$\mathbf{P} = \begin{pmatrix} 0.350685 & 0.053337 & 0.054679 & 0.034542 & 0.006756 \\ 0.007794 & 0.006618 & 0.032516 & 0.069236 & 0.383836 \end{pmatrix}.$$

and gives *<sup>I</sup>*(*Z*,*Y*) <sup>≈</sup> 0.6882, thus increasing the fraction of species information retained from 82% to 99%.

This is a striking result: We can group the images into merely five groups and discard all information about all images except which group they are in, yet retain 99% of the information we cared about. Such grouping may be helpful in many contexts. For example, given a large sample of labeled medical images of potential tumors, they can be used to define say five optimal clusters, after which future images can be classified into five degrees of cancer risk that collectively retain virtually all the malignancy information in the original images.

Given that the Pareto Frontier is continuous and corresponds to an infinite family of possible clusterings, which one is most useful in practice? Just as in more general multi-objective optimization problems, the most interesting points on the frontier are arguably its "corners", indicated by dots in Figure 8, where we do notably well on both criteria. This point was also made in the important paper [23] in the context of the DIB-frontier discussed below. We see that the parts of the frontier between corners tend to be convex and thus rather unappealing, since any weighted average of <sup>−</sup>*H*(*Z*) and *I*(*Z*,*Y*) will be maximized at a corner. Our results show that these corners can conveniently be computed without numerically tedious multiobjective optimization, by simply maximizing the mutual information *I*(*Z*,*Y*) for *m* = 2, 3, 4, ... bins. The first corner, at *H*(*Z*) = 1 bit, corresponds to the learnability phase transition for DIB, i.e., the largest *β* for which DIB is able to learn a non-trivial representation. In contrast to the IB learnability phase transition [24,25] where *I*(*Z*,*Y*) increases continuously from 0, here the *I*(*Y*; *Z*) has a jump from 0 to a positive value, due to the non-concave nature of the Pareto frontier.

Moreover, all the examples in Figure 8 are seen to get quite close to the *m* → ∞ asymptote *<sup>I</sup>*(*Z*,*Y*) <sup>→</sup> *<sup>I</sup>*(*X*,*Y*) for *<sup>m</sup>* <sup>∼</sup> > 5, so the most interesting points on the Pareto frontier are simply the first handful of corners. For these examples, we also see that the greater the mutual information is, the fewer bins are needed to capture most of it.

An alternative way if interpreting the Pareto plane in Figure 8 is as a traveoff between two evils:

**Information bloat:** *<sup>H</sup>*(*Z*|*Y*) <sup>≡</sup> *<sup>H</sup>*(*Z*) <sup>−</sup> *<sup>I</sup>*(*Z*,*Y*) <sup>≥</sup> 0, **Information loss:** <sup>Δ</sup>*<sup>I</sup>* <sup>≡</sup> *<sup>I</sup>*(*X*,*Y*) <sup>−</sup> *<sup>I</sup>*(*Z*,*Y*) <sup>≥</sup> 0.

What we are calling the "information bloat" has also been called "causal waste" [26]. It is simply the conditional entropy of *Z* given *Y*, and represents the excess bits we need to store in order to retain the desired information about *Y*. Geometrically, it is the horizontal distance to the impossible region to the right in Figure 8, and we see that for MNIST, it takes local minima at the corners for both 1 and 2 bins. The information loss is simply the information discarded by our lossy compression of *X*. Geometrically, it is the vertical distance to the impossible region at the top of Figure 1 (and, in Figure 8, it is the vertical distance to the corresponding dotted horizontal line). As we move from corner to corner adding more bins, we typically reduce the information loss at the cost of increased information

bloat. For the examples in Figure 8, we see that going beyond a handful of bins essentially just adds bloat without significantly reducing the information loss.

#### *3.5. Real-World Issues*

We just discussed how lossy compression is a tradeoff between information bloat and information loss. Let us now elaborate on the latter, for the real-world situation where *<sup>W</sup>* <sup>≡</sup> *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*X*) is approximated by a neural network.

If the neural network learns to become perfect, then the function *w* that it implements will be such that *<sup>W</sup>* <sup>≡</sup> *<sup>w</sup>*(*X*) satisfies *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*W*) = *<sup>W</sup>*, which corresponds to the dashed curves in Figure <sup>7</sup> being identical to the solid curves. While we see that this is close to being the case for the analytic and MNIST examples, the neural networks are further from optimal for Fashion-MNIST and CIFAR-10. The figure illustrates that the general trend is for these neural networks to overfit and therefore be overconfident, predicting probabilities that are too extreme.

This fact that *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*W*) <sup>=</sup> *<sup>W</sup>* probably indicates that our Fashion-MNIST and CIFAR-10 classifiers *W* = *w*(*X*) destroy information about *X*, but it does not prove this, because if we had a perfect lossless classifier *<sup>W</sup>* <sup>≡</sup> *<sup>w</sup>*(*X*) satisfying *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*W*) = *<sup>W</sup>*, then we could define an overconfident lossless classifier by an invertible (and hence information-preserving) reparameterization such as *<sup>W</sup>* <sup>≡</sup> *<sup>W</sup>*<sup>2</sup> that violates the condition *<sup>P</sup>*(*<sup>Y</sup>* = 1|*W* ) = *W* .

So how much information does *X* contain about *Y*? One way to lower-bound *I*(*X*;*Y*) uses the classification accuracy: if we have a classification problem where *P*(*Y* = 1) = 1/2 and compress *X* into a single classification bit *Z* (corresponding to a binning of *W* into two bins), then we can write the joint probability distribution for *Y* and the guessed class *Z* as

$$P = \begin{pmatrix} \frac{1}{2} - \epsilon\_1 & \epsilon\_1 \\ \epsilon\_2 & \frac{1}{2} - \epsilon\_2 \end{pmatrix}.$$

For a fixed total error rate  ≡ <sup>1</sup> + 2, Fano's Inequality implies that the mutual information takes a minimum

$$I(Z, \mathcal{Y}) = 1 + \epsilon \log \epsilon + (1 - \epsilon) \log(1 - \epsilon) \tag{39}$$

when <sup>1</sup> = <sup>2</sup> = /2, so if we can train a classifier that gives an error rate , then the right-hand-side of Equation (39) places a lower bound on the mutual information *I*(*X*,*Y*). The prediction accuracy 1 −  is shown for reference on the right side of Figure 8. Note that getting close to one bit of mutual information requires extremely high accuracy; for example, 99% prediction accuracy corresponds to only 0.92 bits of mutual information.

We can obtain a stronger estimated lower bound on *I*(*X*,*Y*) from the cross-entropy loss function L used to train our classifiers:

$$<\langle \mathcal{L} \rangle = -\left\langle \log P(Y = Y\_i | X = X\_i) \right\rangle = H(Y | X) + d\_{\text{KL}, \prime} \tag{40}$$

where *d*KL ≥ 0 denotes the average KL-divergence between true and predicted conditional probability distributions, and · denotes ensemble averaging over data points, which implies that

$$\begin{aligned} H(X,Y) &= \, | \, H(Y) - H(Y|X) = H(Y) - \langle \mathcal{L} \rangle - d\_{\text{KL}} \\ &\ge \, | \, H(Y) - \langle \mathcal{L} \rangle. \end{aligned} \tag{41}$$

If *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*W*) <sup>=</sup> *<sup>W</sup>* as we discussed above, then *<sup>d</sup>*KL and hence the loss can be further reduced be recalibrating *W* as we have done, which increases the information bound from Equation (41) up to the the value computed directly from the observed joint distribution *P*(*W*,*Y*).

Unfortunately, without knowing the true probability *<sup>p</sup>*(*Y*|*X*), there is no rigorous and practically useful upper bound on the mutual information other than the trivial inequality *I*(*X*,*Y*) < *H*(*Y*) = 1 bit, as the following simple counterexample shows: Suppose our images *X* are encrypted with some

encryption algorithm that is extremely time-consuming to crack, rendering the images for all practical purposes indistinguishable from random noise. Then any reasonable neural network will produce a useless classifier giving *<sup>I</sup>*(*W*,*Y*) <sup>≈</sup> 0 even though the true mutual information *<sup>I</sup>*(*X*,*Y*) could be as large as one bit. In other words, we generally cannot know the true information loss caused by compressing *X* → *W*, so the best we can do in practice is to pick a corner reasonably close to the upper asymptote in Figure 8.

#### *3.6. Performance Compared with Blahut–Arimoto Method*

The most commonly used technique to date for finding the Pareto frontier is the Blahut–Arimoto (BA) method [27,28] applied to the DIB objective of Equation (3) as described in [12]. Figure 9 and Table 3 shows the BA method implemented as in [23], applied to our above-mentioned analytic toy example, after binning using 2000 equispaced *W*-bins and *Z* ∈ 1, ..., 8, scanning the *β*-parameter from Equation (3) from 10−<sup>10</sup> to 1 in 20,000 logarithmically equispaced steps. Our method is seen to improve on the BA method in two ways. First, our method finds the entire continuous frontier, whereas the BA method finds only six discrete disconnected points. This is because the BA-method tries to maximize the the DIB-objective from Equation (3) and thus cannot discover points where the Pareto frontier is convex as discussed above. Second, our method finds the exact frontier, whereas the BA-method finds only approximations, which are seen to generally lie below the true frontier.

**Table 3.** The approximate Pareto frontier points for our analytic example computed with the Blahut– Arimoto (BA) method compared with the points for those same six *H*-values computed with our exact method.


**Figure 9.** The Pareto frontier our analytic example is computed exactly with our method (solid curve) and approximately with the Blahut–Arimoto method (dots).

#### **4. Conclusions and Discussion**

We have presented a method for mapping out the Pareto frontier for classification tasks (as in Figure 8), reflecting the tradeoff between retained entropy and class information. In other words, we have generalized the quest for maximizing raw classification accuracy to that of mapping the full Pareto frontier corresponding to the accuracy–complexity tradeoff. The optimal soft classifiers that we have studied (corresponding to points on the Pareto frontier) are useful for the same reason that the DIB method is useful, e.g., overfitting less and therefore generalizing better.

We first showed how a random variable *X* (an image, say) drawn from a class *Y* ∈ {1, ..., *n*} can be distilled into a vector *<sup>W</sup>* <sup>=</sup> *<sup>f</sup>*(*X*) <sup>∈</sup> <sup>R</sup>*n*−<sup>1</sup> losslessly, so that *<sup>I</sup>*(*W*,*Y*) = *<sup>I</sup>*(*X*,*Y*). For the *<sup>n</sup>* <sup>=</sup> 2 case of binary classification, we then showed how the Pareto frontier is swept out by a one-parameter family of binnings of *<sup>W</sup>* into a discrete variable *<sup>Z</sup>* <sup>=</sup> *<sup>g</sup>β*(*W*) ∈ {1, ..., *<sup>m</sup>β*} that corresponds to binning *<sup>W</sup>* into *<sup>m</sup><sup>β</sup>* = 2, 3, ..., bins, such that *<sup>I</sup>*(*Z*,*Y*) is maximized for each fixed entropy *<sup>H</sup>*(*Z*). Our method efficiently finds the exact Pareto frontier, significantly outperforming the Blahut–Arimoto (BA) method [27,28]. Our MATLAB code for computing the Pareto frontier is freely available here: https://github.com/ tailintalent/distillation.

#### *4.1. Relation to Information Bottleneck*

As mentioned in the introduction, the discrete information bottleneck (DIB) method [12] maximizes a linear combination *<sup>I</sup>*(*Z*,*Y*) <sup>−</sup> *<sup>β</sup>H*(*Z*) of the two axes in Figure 8. We have presented a method solving a generalization of the DIB problem. The generalization lies in switching the objective from Equation (3) to Equation (1), which has the advantage of discovering the full Pareto frontier in Figure 8 instead of merely the corners and concave parts (as mentioned, the DIB objective cannot discover convex parts of the frontier). The solution lies in our proof that the frontier is spanned by binnings of the likelihood into 2, 3, 4, etc., bins, which enables it to be computed more efficiently than with the iterative/variational method of [12].

The popular original Information Bottleneck (IB) method [10] generalizes DIB by allowing the compression function *g*(*X*) to be non-deterministic, thus adding noise that is independent of *X*. Starting with a Pareto-optimal *<sup>Z</sup>* <sup>≡</sup> *<sup>g</sup>*(*X*) and adding such noise will simply shift us straight to the left in Figure 8, away from the frontier (which is by definition monotonically decreasing) and into the Pareto-suboptimal region in the *I*(*Y*; *Z*) vs. *H*(*Z*) plane. As shown in [12], IB-compressions tend to altogether avoid the rightmost part of Figure 8, with an entropy *H*(*Z*) that never drops below some fixed value independent of *β*.

#### *4.2. Relation to Phase Transitions in DIB Learning*

Recent work has revealed interesting phase transitions that occur during information bottleneck learning [12,24,25,29], as well as phase transitions in other objectives, e.g., *β*-VAE [30], infoDropout [31]. Specifically, when the *β*-parameter that controls the tradeoff between information retention and model simplicity is continuously adjusted, the resulting point in the IB-plane can sometimes "get stuck" or make discontinuous jumps. For the DIB case, our results provide an intuitive understanding of these phase transitions in terms of the geometry of the Pareto frontier.

Let us consider Figure <sup>1</sup> as an example. The DIB maximiziation of *<sup>I</sup>*(*Z*,*Y*) <sup>−</sup> *<sup>β</sup>H*(*Z*) geometrically corresponds to finding a tangent line of the Pareto frontier of slope −*β*.

If the Pareto frontier *<sup>I</sup>*∗(*H*) were everywhere continuous and concave, so that *<sup>I</sup>* <sup>∗</sup> (*H*) <sup>&</sup>lt; 0, then its slope would range from some steepest value <sup>−</sup>*β*<sup>∗</sup> at the right endpoint *<sup>H</sup>* <sup>=</sup> 0 and continuously flatten out as we move leftward, asymptotically approaching zero slope as *H* → ∞. The learnability phase transition studied in [24,25] would then occur when *<sup>β</sup>* <sup>=</sup> *<sup>β</sup>*∗: for any *<sup>β</sup>* <sup>≥</sup> *<sup>β</sup>*∗, the DIB method learns nothing, e.g., discovers as optimal the point (*H*, *I*)=(0, 0) where *Z* retains no information whatsoever about *Y*. As *β* ≤ *β*<sup>∗</sup> is continuously reduced, the DIB-discovered point would then continuously move up and to the left along the Pareto frontier.

This was for the case of an everywhere concave frontier, but Figures 1 and 8 show that actual Pareto frontiers need not be concave—indeed, none of the frontiers that we have computed are concave. Instead, they are seen to consist of long convex segments joint together by short concave pieces near the "corners". This means that as *β* is continuously increased, the DIB solution exhibits first-order phase transitions, making discontinuous jumps from corner to corner at certain critical *β*-values; these phase transitions correspond to increasing the number of clusters into which the data *X* is grouped.

#### *4.3. Outlook*

Our results suggest a number of opportunities for further work, ranging from information theory to machine learning, neuroscience and physics.

As to information theory, it will be interesting to try to generalize our method from binary classification into classification into more than two classes. Moreover, one can ask if there is a way of pushing the general information distillation problem all the way to bits. It is easy to show that a discrete random variable *Z* ∈ {1, ..., *m*} can always be encoded as *m* − 1 independent random bits (Bernoulli variables) *B*1, ..., *Bm*−<sup>1</sup> ∈ {0, 1}, defined by

$$P(B\_k = 1) = P(Z = k + 1) / P(Z \le k + 1),\tag{42}$$

while this generically requires some information bloat. The mapping *z* from bit strings **B** to integers *<sup>Z</sup>* <sup>≡</sup> *<sup>z</sup>*(**B**) is defined so that *<sup>z</sup>*(**B**) is the position of the last bit that equals one when **<sup>B</sup>** is preceded by a one. For example, for *<sup>m</sup>* <sup>=</sup> 4, the mapping from length-3 bit strings **<sup>B</sup>** ∈ {0, 1}<sup>3</sup> to integers *<sup>Z</sup>* ∈ {1, ..., 4} is *z*(001) = *z*(011) = *z*(101) = *z*(111) = 4, *z*(010) = *z*(110) = 3, *z*(100) = 2, *z*(000) = 1. So in the spirit of the introduction, is there some useful way of generalizing PCA, autoencoders, CCA and/or the method we have presented so that the quantities *Zi* and *Z <sup>i</sup>* in Table 1 are not real numbers but bits?

As to neural networks, it is interesting to explore novel classifier architectures that reduce the overfitting and resulting overconfidence revealed by Figure 7, as this might significantly increase the amount of information we can distill into our compressed data. It is important not to complacently declare victory just because classification accuracy is high; as mentioned, even 99% binary classification accuracy can waste 8% of the information.

As to neuroscience, our discovery of optimal "corner" binnings begs the question of whether evolution may have implemented such categorization in brains. For example, if some binary variable *Y* that can be inferred from visual imagery is evolutionarily important for a given species (say, whether potential food items are edible), might our method help predict how many distinct colors *m* their brains have evolved to classify hues into? In this example, *X* might be a triplet of real numbers corresponding to light intensity recorded by three types of retinal photoreceptors, and the integer *Z* might end up corresponding so some definitions of yellow, orange, etc. A similar question can be asked for other cases where brains define finite numbers of categories, for example categories defined by distinct words.

As to physics, it has been known even since the introduction of Maxwell's Demon that a physical system can use information about its environment to extract work from it. If we view an evolved life form as an intelligent agent seeking to perform such work extraction, then it faces a tradeoff between retaining too little relevant infomation (consequently extrating less work) and retaining too much (wasting energy on information processing and storage). Susanne Still recently proved the remarkable physics result [32] that the lossy data compression optimizing such work extraction efficiency is precisely that prescribed by the above-mentioned information bottleneck method [10]. As she puts it, an intelligent data representation strategy emerges from the optimization of a fundamental physical limit to information processing. This derivation made minimal and reasonable seeming assumptions about the physical system, but did not include an energy cost for information encoding. We conjecture that this can be done such that an extra Shannon coding term proportional to *H*(*Z*) gets added to the

loss function, which means that when this term dominates, the generalized Still criterion would instead prefer the Deterministic Information Bottleneck or one of our Pareto-optimal data compressions.

While noise-adding IB-style data compression may turn out to be commonplace in many biological settings, it is striking that the types of data compression we typically associate with human perception intelligence appears more deterministic, in the spirit of DIB and our work. For example, when we compress visual input into "this is a probably a cat", we do not typically add noise by deliberately flipping our memory to "this is probably a dog". Similarly, the popular jpeg image compression algorithm dramatically reduces image sizes while retaining essentially all information that we humans find relevant, and does so deterministically, without adding noise.

It is striking that simple information-theoretical principles such as IB, DIB and Pareto-optimality appear relevant across the spectrum of known intelligence, ranging from extremely simple physical systems as in Still's work all the way up to high-level human perception and cognition. This motivates further work on the exciting quest for a deeper understanding of Pareto-optimal data compression and its relation to neuroscience and physics.

**Author Contributions:** Conceptualization, resources, supervision, project administration, funding acquisition, M.T.; methodology, software, validation, formal analysis, investigation, writing—original draft preparation, writing—review and editing, visualization, M.T. and T.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by The Casey and Family Foundation, the Ethics and Governance of AI Fund, the Foundational Questions Institute, the Rothberg Family Fund for Cognitive Science and by Theiss Research through TWCF grant #0322. The opinions expressed in this publication are those of the authors and do not necessarily reflect the views of the funders.

**Acknowledgments:** The authors wish to thank Olivier de Weck for sharing the AWS multiobjective optimization software.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A. Binning Can Be Practically Lossless**

If the conditional probability distribution *<sup>p</sup>*1(*w*) <sup>≡</sup> *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*<sup>W</sup>* <sup>=</sup> *<sup>w</sup>*) is a slowly varying function and the range of *<sup>W</sup>* is divided into tiny bins, then *<sup>p</sup>*1(*w*) will be almost constant within each bin and so binning *W* (discarding information about the exact position of *W* within a bin) should destroy almost no information about *Y*. This intuition is formalized by the following theorem, which says that a random variable *W* can be binned into a finite number of bins at the cost of losing arbitrarily little information about *Y*.

**Theorem A1.** *Binning can be practically lossless: Given a random variable Y* ∈ {1, 2} *and a uniformly distributed random variable <sup>W</sup>* <sup>∈</sup> [0, 1] *such that the conditional probability distribution <sup>p</sup>*1(*w*) <sup>≡</sup> *<sup>P</sup>*(*Y =* <sup>1</sup>|*W=w*) *is monotonic, there exists for any real number*  <sup>&</sup>gt; <sup>0</sup> *a vector* **<sup>b</sup>** <sup>∈</sup> <sup>R</sup>*N*−<sup>1</sup> *of bin boundaries such that the information reduction*

$$\Delta I \equiv I[\mathcal{W}, \mathcal{Y}] - I[B(\mathcal{W}, \mathbf{b}), \mathcal{Y}] < \epsilon\_{\prime}$$

*where B is the binning function defined by Equation (17).*

**Proof.** The binned bivariate probability distribution is

$$P\_{\mathbf{i}\mathbf{j}} \equiv P(Z=\mathbf{j}, \mathbf{Y}=\mathbf{i}) = \int\_{b\_{\mathbf{j}-1}}^{b\_{\mathbf{j}}} p\_{\mathbf{i}}(w) dw \tag{A1}$$

with marginal distribution

$$P\_j^Z \equiv P(Z=j) = b\_j - b\_{j-1}.\tag{A2}$$

Let *<sup>p</sup>*¯*i*(*w*) denote the piecewise constant function that in the *<sup>j</sup>*th bin *bj*−<sup>1</sup> <sup>&</sup>lt; *<sup>w</sup>* <sup>≤</sup> *bj* takes the average value of *pi*(*w*) in that bin, i.e.,

$$\phi\_i(w) \equiv \frac{1}{b\_j - b\_{j-1}} \int\_{b\_{j-1}}^{b\_j} p\_i(w) dw = \frac{P\_{ij}}{P\_j^Z}.\tag{A3}$$

These definitions imply that

$$-\sum\_{j=1}^{N} P\_{ij} \log \frac{P\_{ij}}{P\_j^2} = \int\_0^1 h\left[\vec{p}\_i(w)\right] dw,\tag{A4}$$

where *<sup>h</sup>*(*x*) ≡ −*<sup>x</sup>* log *<sup>x</sup>*. Since *<sup>h</sup>*(*x*) vanishes at *<sup>x</sup>* <sup>=</sup> 0 and *<sup>x</sup>* <sup>=</sup> 1 and takes its intermediate maximum value at *x* = 1/*e*, the function

$$h\_\*(x) \equiv \begin{cases} \begin{array}{l} h(x) \\ 2h(e^{-1}) - h(x) \end{array} & \text{if } x < e^{-1}, \\\end{cases} \tag{A5}$$

is continuous and increases monotonically for *<sup>x</sup>* <sup>∈</sup> [0, 1], with *<sup>h</sup>* <sup>∗</sup> <sup>=</sup> <sup>|</sup>*h* (*x*)|. This means that if we define the non-negative monotonic function

$$h\_+(w) \equiv h\_\*[p\_1(w)] - h\_\*[p\_2(w)]\_\*$$

it changes at least as fast as either of its terms, so that for any *<sup>w</sup>*1, *<sup>w</sup>*<sup>2</sup> <sup>∈</sup> [0, 1], we have

$$\begin{aligned} |h\left[p\_i(w\_2)\right] - h\left[p\_i(w\_1)\right]| &\leq & |h\_\*\left[p\_i(w\_2)\right] - h\_\*\left[p\_i(w\_1)\right]|\\ &\leq & |h\_+(w\_2) - h\_+(w\_1)|. \end{aligned} \tag{A6}$$

We will exploit this bound to limit how much *<sup>h</sup>* [*pi*(*w*)] can vary within a bin. Since *<sup>h</sup>*+(0) <sup>≥</sup> <sup>0</sup> and *<sup>h</sup>*+(1) <sup>≤</sup> <sup>2</sup>*h*∗(1) = 4/*e*ln 2 <sup>≈</sup> 2.12 <sup>&</sup>lt; 3, we pick *<sup>N</sup>* <sup>−</sup> 1 bins boundaries *bk* implicitly defined by

$$h\_+(b\_{\dot{j}}) = h\_+(0) + [h\_+(1) - h\_+(0)]\frac{\dot{j}}{N} \tag{A7}$$

for some integer *N* 1. Using Equation (A6), this implies that

$$|h\left[\vec{p}\_i(w)\right] - h\left[p\_i(w)\right]| \le \frac{h\_+(1) - h\_+(0)}{N} < \frac{3}{N}.\tag{A8}$$

The mutual information between two variables is given by *<sup>I</sup>*(*Y*, *<sup>U</sup>*) = *<sup>H</sup>*(*Y*) <sup>−</sup> *<sup>H</sup>*(*Y*|*U*), where the second term (the conditional entropy is given by the following expressions in the cases that we need:

$$H(Y|Z) \quad = \quad -\sum\_{i=1}^{N} \sum\_{j=1}^{2} P\_{ij} \log \frac{P\_{ij}}{P\_i} \,. \tag{A9}$$

$$H(Y|W) \quad = \quad -\sum\_{i=1}^{2} \int\_{0}^{1} p\_i(w) \log p\_i(w) dw. \tag{A10}$$

The information loss caused by our binning is therefore

$$\begin{split} \Delta I &= \quad I(W,Y) - I(Z,Y) = H(Y|Z) - H(Y|W) \\ &= \quad -\sum\_{i=1}^{2} \left( \sum\_{j=1}^{N} P\_{ij} \log \frac{P\_{ij}}{P\_{j}^{2}} + \int\_{0}^{1} h\left[p\_{i}(w)\right] dw \right) \\ &= \quad \sum\_{i=1}^{2} \int\_{0}^{1} \left( h\left[\bar{p}\_{i}(w)\right] - h\left[p\_{i}(w)\right] \right) dw \\ &\leq \quad \sum\_{i=1}^{2} \int\_{0}^{1} \left| h\left[\bar{p}\_{i}(w)\right] - h\left[p\_{i}(w)\right] \right| dw \\ &< \quad \sum\_{i=1}^{2} \int\_{0}^{1} \frac{3}{N} = \frac{6}{N}, \tag{A11} \end{split} \tag{A11}$$

where we used Equation (A4) to obtain the third row and Equation (A8) to obtain the last row. This means that however small an information loss tolerance  we want, we can guarantee Δ*I* <  by choosing *N* > 6/bins placed according to Equation (A7), which completes the proof.

Note that the proof still holds if the function *pi*(*w*) is not monotonic, as long as the number of times *M* that it changes direction is finite: In that case, we can simply repeat the above-mentioned binning procedure separately in the *<sup>M</sup>* + 1 intervals where *pi*(*w*) is monotonic, using *<sup>N</sup>* > 6/ bins in each interval, i.e., a total of *N* > 6*M*/bins.

#### **Appendix B. More Varying Conditional Probability Boosts Mutual Information**

Mutual information is loosely speaking a measure of how far a probability distribution *Pij* is from being separable, i.e.,, a product of its two marginal distributions. (specifically, the mutual information is the Kullback–Leibler divergence between the bivariate probability distribution and the product of its marginals.) If all conditional probabilities for one variable *Y* given the other variable *Z* are identical, then the distribution is separable and the mutual information *I*(*Z*,*Y*) vanishes, so one may intuitively expect that making conditional probabilities more different from each other will increase *I*(*Z*,*Y*). The following theorem formalizes this intuition in a way that enables Theorem 2.

**Theorem A2.** *Consider two discrete random variables Z* ∈ {1, ..., *n*} *and Y* ∈ {1, 2} *and define Pi* <sup>≡</sup> *<sup>P</sup>*(*<sup>Z</sup>* <sup>=</sup> *<sup>i</sup>*)*, pi* <sup>≡</sup> *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> <sup>1</sup>|*<sup>Z</sup>* <sup>=</sup> *<sup>i</sup>*)*, so that the joint probability distribution Pij* <sup>≡</sup> *<sup>P</sup>*(*<sup>Z</sup>* <sup>=</sup> *<sup>i</sup>*,*<sup>Y</sup>* <sup>=</sup> *<sup>j</sup>*) *is given by Pi*<sup>1</sup> <sup>=</sup> *Pi pi, Pi*<sup>2</sup> <sup>=</sup> *Pi*(<sup>1</sup> <sup>−</sup> *pi*)*. If two conditional probabilities pk and pl differ, then we increase the mutual information <sup>I</sup>*(*Y*, *<sup>Z</sup>*) *if we bring them further apart by adjusting Pkj and Plj in such a way that both marginal distributions remain unchanged.*

**Proof.** The only such change that keeps the marginal distributions for both *Z* and *Y* unchanged takes the form

$$
\begin{pmatrix}
P\_1 p\_1 & \cdots & P\_k p\_k - \varepsilon & \cdots & P\_l p\_l + \varepsilon & \cdots \\
P\_1 (1 - p\_1) & \cdots & P\_k (1 - p\_k) + \varepsilon & \cdots & P\_l (1 - p\_l) - \varepsilon & \cdots \\
\end{pmatrix}
$$

where the parameter  must be kept small enough for all probabilities to remain non-negative. Without loss of generality, we can assume that *pk* < *pl*, so that we make the conditional probabilities

$$P(Y=1|Z=k) = \frac{P\_{k1}}{P\_k} = p\_k - \epsilon / P\_{k'} \tag{A12}$$

$$P(Y=1|Z=l) = \frac{P\_{l1}}{P\_l} = p\_l + \varepsilon / P\_l \tag{A13}$$

more different when increasing  from zero. Computing and differentiating the mutual information with respect to , most terms cancel and we find that

$$\left. \frac{\partial I(Z, \varGamma)}{\partial \varepsilon} \right|\_{\varepsilon=0} = \log \left[ \frac{1/p\_k - 1}{1/p\_l - 1} \right] > 0 \tag{A14}$$

which means that adjusting the probabilities with a sufficiently tiny  > 0 will increase the mutual information, completing the proof.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **The Convex Information Bottleneck Lagrangian**

#### **Borja Rodríguez Gálvez \*,†, Ragnar Thobaben \*,† and Mikael Skoglund \*,†**

Department of Intelligent Systems, Division of Information Science and Engineering (ISE), KTH Royal Institute of Technology, 11428 Stockholm, Sweden

**\*** Correspondence: borjarg@kth.se (B.R.G.); ragnart@kth.se (R.T.); skoglund@kth.se (M.S.)

† Current address: Malvinas väg 10, 100 44 Stockholm, Sweden

Received: 9 December 2019; Accepted: 8 January 2020; Published: 14 January 2020

**Abstract:** The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations *T* of some random variable *X* for the task of predicting *Y*. It is defined as a constrained optimization problem that maximizes the information the representation has about the task, *I*(*T*;*Y*), while ensuring that a certain level of compression *<sup>r</sup>* is achieved (i.e., *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>≤</sup> *<sup>r</sup>*). For practical reasons, the problem is usually solved by maximizing the IB Lagrangian (i.e., <sup>L</sup>IB(*T*; *<sup>β</sup>*) = *<sup>I</sup>*(*T*;*Y*) <sup>−</sup> *<sup>β</sup>I*(*X*; *<sup>T</sup>*)) for many values of *<sup>β</sup>* <sup>∈</sup> [0, 1]. Then, the curve of maximal *<sup>I</sup>*(*T*;*Y*) for a given *<sup>I</sup>*(*X*; *<sup>T</sup>*) is drawn and a representation with the desired predictability and compression is selected. It is known when *Y* is a deterministic function of *X*, the IB curve cannot be explored and another Lagrangian has been proposed to tackle this problem: the squared IB Lagrangian: <sup>L</sup>sq-IB(*T*; *<sup>β</sup>*sq) = *<sup>I</sup>*(*T*;*Y*) <sup>−</sup> *<sup>β</sup>*sq *<sup>I</sup>*(*X*; *<sup>T</sup>*)2. In this paper, we (i) present a general family of Lagrangians which allow for the exploration of the IB curve in all scenarios; (ii) provide the exact one-to-one mapping between the Lagrange multiplier and the desired compression rate *r* for known IB curve shapes; and (iii) show we can approximately obtain a specific compression level with the convex IB Lagrangian for both known and unknown IB curve shapes. This eliminates the burden of solving the optimization problem for many values of the Lagrange multiplier. That is, we prove that we can solve the original constrained problem with a single optimization.

**Keywords:** information bottleneck; representation learning; mutual information; optimization

#### **1. Introduction**

Let *X* ∈ X and *Y* ∈ Y be two statistically dependent random variables with joint distribution *p*(*X*,*Y*). The information bottleneck (IB) [1] investigates the problem of extracting the relevant information from *X* for the task of predicting *Y*.

For this purpose, the IB defines a bottleneck variable *T* ∈ T obeying the Markov chain *Y* ↔ *X* ↔ *T* so that *T* acts as a representation of *X*. Tishby et al. [1] define the relevant information as the information the representation keeps from *Y* after the compression of *X* (i.e., *I*(*T*;*Y*)), provided a certain level of compression (i.e., *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>≤</sup> *<sup>r</sup>*). Therefore, we select the representation which yields the value of the IB curve that best fits our requirements.

**Definition 1 (IB Functional).** *Let X and Y be statistically dependent variables. Let* Δ *be the set of random variables T obeying the Markov condition Y* ↔ *X* ↔ *T. Then the IB functional is*

$$F\_{\text{IB,max}}(r) = \max\_{T \in \Lambda} \left\{ I(T; Y) \right\} \text{ s.t. } I(X; T) \le r, \forall r \in [0, \infty). \tag{1}$$

**Definition 2 (IB Curve).** *The IB curve is the set of points defined by the solutions of <sup>F</sup>*IB,max(*r*) *for varying values of r* <sup>∈</sup> [0, <sup>∞</sup>)*.*

#### **Definition 3 (Information Plane).** *The plane is defined by the axes I*(*T*;*Y*) *and I*(*X*; *T*)*.*

This method has been successfully applied to solve different problems from a variety of domains. For example:


Furthermore, it has been employed as a tool for development or explanation in other disciplines like reinforcement learning [12–14], attribution methods [15], natural language processing [16], linguistics [17] or neuroscience [18]. Moreover, it has connections with other problems such as source coding with side information (or the Wyner-Ahlswede-Körner (WAK) problem), the rate-distortion problem or the cost-capacity problem (see Sections 3, 6 and 7 from [19]).

In practice, solving a constrained optimization problem such as the IB functional is challenging. Thus, in order to avoid the non-linear constraints from the IB functional, the IB Lagrangian is defined.

**Definition 4 (IB Lagrangian).** *Let X and Y be statistically dependent variables. Let* Δ *be the set of random variables T obeying the Markov condition Y* ↔ *X* ↔ *T. Then we define the IB Lagrangian as*

$$\mathcal{L}\_{\text{IB}}(T; \beta) = I(T; Y) - \beta I(X; T). \tag{2}$$

Here *<sup>β</sup>* <sup>∈</sup> [0, 1] is the Lagrange multiplier which controls the trade-off between the information of *<sup>Y</sup>* retained and the compression of *<sup>X</sup>*. Note we consider *<sup>β</sup>* <sup>∈</sup> [0, 1] because (i) for *<sup>β</sup>* <sup>≤</sup> 0 many uncompressed solutions such as *<sup>T</sup>* <sup>=</sup> *<sup>X</sup>* maximize <sup>L</sup>IB(*T*; *<sup>β</sup>*), and (ii) for *<sup>β</sup>* <sup>≥</sup> 1 the IB Lagrangian is non-positive due to the data processing inequality (DPI) (Theorem 2.8.1 from Cover and Thomas [20]) and trivial solutions like *<sup>T</sup>* <sup>=</sup> const are maximizers with <sup>L</sup>IB(*T*; *<sup>β</sup>*) = 0 [21].

We know the solutions of the IB Lagrangian optimization (if existent) are solutions of the IB functional by the Lagrange's sufficiency theorem (Theorem 5 in Appendix A of Courcoubetis [22]). Moreover, since the IB functional is concave (Lemma 5 of Gilad-Bachrach et al. [19]) we know they exist (Theorem 6 in Appendix A of Courcoubetis [22]).

Therefore, the problem is usually solved by maximizing the IB Lagrangian with adaptations of the Blahut-Arimoto algorithm [1], deterministic annealing approaches [23] or a bottom-up greedy agglomerative clustering [6] or its improved sequential counterpart [24]. However, when provided with high-dimensional random variables *X* such as images, these algorithms do not scale well and deep learning-based techniques, where the IB Lagrangian is used as the objective function, prevailed [2,25,26].

Note the IB Lagrangian optimization yields a representation *T* with a given performance (*I*(*X*; *T*), *I*(*T*;*Y*)) for a given *β*. However, there is no one-to-one mapping between *β* and *I*(*X*; *T*). Hence, we cannot directly optimize for the desired compression level *r* but we need to perform several optimizations for different values of *β* and select the representation with the desired performance; e.g., [2]. The Lagrange multiplier selection is important since (i) sometimes even choices of *β* < 1 lead to trivial representations such that *pT*|*<sup>X</sup>* <sup>=</sup> *pT*, and (ii) there exist some discontinuities on the performance level w.r.t. the values of *β* [27].

Moreover, recently Kolchinsky et al. [21] showed how in deterministic scenarios (such as many classification problems where an input *xi* belongs to a single particular class *yi*) the IB Lagrangian could not explore the IB curve. Particularly, they showed that multiple *β* yielded the same performance level and that a single value of *β* could result in different performance levels. To solve this issue, they introduced the squared IB Lagrangian, <sup>L</sup>sq-IB(*T*; *<sup>β</sup>*sq) = *<sup>I</sup>*(*T*;*Y*) <sup>−</sup> *<sup>β</sup>sq <sup>I</sup>*(*X*; *<sup>T</sup>*)2, which is able to explore the IB curve in any scenario by optimizing for different values of *β*sq. However, even though they realized a one-to-one mapping between *β*sq and the compression level existed, they did not find such mapping. Hence, multiple optimizations of the Lagrangian were still required to find the best trade-off solution.

The main contributions of this article are:


Furthermore, we provide some insight for explaining why there are discontinuities in the performance levels w.r.t. the values of the Lagrange multipliers. In a classification setting, we connect those discontinuities with the intrinsic clusterization of the representations when optimizing the IB bottleneck objective.

The structure of the article is the following: In Section 2 we motivate the usage of the IB in supervised learning settings. Then, in Section 3 we outline the important results used about the IB curve in deterministic scenarios. Later, in Section 4 we introduce the convex IB Lagrangian and explain some of its properties like the bijective mapping between Lagrange multipliers and the compression level and the range of such multipliers. After that, we support our (proved) claims with some empirical evidence on the MNIST [28] and TREC-6 [29] datasets in Section 5. Finally, in Section 6 we discuss our claims and empirical results. A PyTorch [30] implementation of the article can be found at https://github.com/burklight/convex-IB-Lagrangian-PyTorch.

In the Appendices A–F we provide with the proofs of the theoretical results. Then, in Appendix G we show some alternative families of Lagrangians with similar properties. Later, in Appendix H we provide with the precise experimental setup details to reproduce the results from the paper, and further experimentation with different datasets and neural network architectures. To conclude, in Appendix I we show some guidelines on how to set the convex information bottleneck Lagrangians for practical problems.

#### **2. The IB in Supervised Learning**

In this section, we will first give an overview of supervised learning in order to later motivate the usage of the information bottleneck in this setting.

#### *2.1. Supervised Learning Overview*

In supervised learning we are given a dataset <sup>D</sup>*<sup>n</sup>* <sup>=</sup> {(*xi*, *yi*)}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> of *n* pairs of input features and task outputs. In this case, *X* and *Y* are the random variables of the input features and the task outputs. We assume *xi* and *yi* are sampled i.i.d. from the true distribution *<sup>p</sup>*(*X*,*Y*) <sup>=</sup> *pY*|*<sup>X</sup> pX*. The usual aim of supervised learning is to use the dataset <sup>D</sup>*<sup>n</sup>* to learn a particular conditional distribution *qY*<sup>ˆ</sup>|*<sup>X</sup>* of the task outputs given the input features, parametrized by *θ*, which is a good approximation of *pY*|*X*. We use *Y*ˆ and *y*ˆ to indicate the predicted task output random variable and its outcome. We call a supervised learning task regression when *Y* is continuous-valued and classification when it is discrete.

Usually, supervised learning methods employ intermediate representations of the inputs before making predictions about the outputs; e.g., hidden layers in neural networks (Chapter 5 from Bishop [31]) or transformations in a feature space through the kernel trick in kernel machines like SVMs or RVMs (Sections 7.1 and 7.2 from Bishop [31]). Let *T* be a possibly stochastic function of the input features *<sup>X</sup>* with a parametrized conditional distribution *qT*|*X*, then, *<sup>T</sup>* obeys the Markov condition *Y* ↔ *X* ↔ *T*. The mapping from the representation to the predicted task outputs is defined by the parametrized conditional distribution *qY*<sup>ˆ</sup>|*T*. Therefore, in representation-based machine learning methods, the full Markov Chain is *<sup>Y</sup>* <sup>↔</sup> *<sup>X</sup>* <sup>↔</sup> *<sup>T</sup>* <sup>↔</sup> *<sup>Y</sup>*ˆ. Hence, the overall estimation of the conditional probability *pY*|*<sup>X</sup>* is given by the marginalization of the representations; i.e., *qY*<sup>ˆ</sup>|*<sup>X</sup>* <sup>=</sup> <sup>E</sup>*t*∼*qT*|*<sup>X</sup> qY*<sup>ˆ</sup>|*T*=*<sup>t</sup>* (The notation *qY*<sup>ˆ</sup>|*T*=*<sup>t</sup>* represents the probability distribution *qY*<sup>ˆ</sup>|*T*(·|*t*; *<sup>θ</sup>*). For the rest of the text, we will use the same notation to represent conditional probability distributions where the conditioning argument is given).

In order to achieve the goal of having a good estimation of the conditional probability distribution *pY*|*X*, we usually define an instantaneous cost function <sup>j</sup> : X ×Y → <sup>R</sup>. The value of this function j(*x*, *y*; *θ*) serves as a heuristic to measure the loss of our algorithm, parametrized by *θ*, obtains when trying to predict the realization of the task output *y* with the input realization *x*.

Clearly, we can be interested in minimizing the expectation of the instantaneous cost function over all the possible input features and task outputs, which we call the cost function. However, since we only have a finite dataset D*<sup>n</sup>* we have instead to minimize the empirical cost function.

**Definition 5 (Cost Function and Empirical Cost Function).** *Let X and Y be the input features and task output random variables and x* ∈ X *and y* ∈ Y *their realizations. Let also* j*be the instantaneous cost function, <sup>θ</sup> the parametrization of our learning algorithm, and* <sup>D</sup>*<sup>n</sup>* <sup>=</sup> {(*xi*, *yi*)}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *the given dataset. Then, we define:*


The discrepancy between the normal and empirical cost functions is called the generalization gap or generalization error (see Section 1 of Xu and Raginsky [32], for instance) and intuitively, the smaller this gap is, the better our model generalizes; i.e., the better it will perform to new, unseen samples in terms of our cost function.

**Definition 6 (Generalization Gap).** *Let <sup>J</sup>*(*p*(*X*,*Y*); *<sup>θ</sup>*) *and* <sup>ˆ</sup>*J*(D*n*; *<sup>θ</sup>*) *be the cost and the empirical cost functions as defined in Definition 5. Then, the generalization gap is defined as*

$$\text{gen}(\mathcal{D}\_n; \theta) = f(p\_{(X, Y)}; \theta) - \hat{f}(\mathcal{D}\_n; \theta), \tag{5}$$

*and it represents the error incurred when the selected distribution is the one parametrized by θ when the rule* <sup>ˆ</sup>*J*(D*n*; *<sup>θ</sup>*) *is used instead of J*(*p*(*X*,*Y*); *<sup>θ</sup>*) *as the function to minimize.*

Ideally, we would want to minimize the cost function. Hence, we usually try to minimize the empirical cost function and the generalization gap simultaneously. The modifications to our learning algorithm which intend to reduce the generalization gap but not hurt the performance on the empirical cost function are known as regularization.

#### *2.2. Why Do We Use the IB?*

**Definition 7 (Representation cross-entropy cost function).** *Let X and Y be two statistically dependent variables with joint distribution <sup>p</sup>*(*X*,*Y*) <sup>=</sup> *pY*|*<sup>X</sup> pX. Let also <sup>T</sup> be a random variable obeying the Markov condition <sup>Y</sup>* <sup>↔</sup> *<sup>X</sup>* <sup>↔</sup> *<sup>T</sup> and qT*|*<sup>X</sup> and qY*<sup>ˆ</sup>|*<sup>T</sup> be the encoding and decoding distributions of our model, parametrized by <sup>θ</sup>. Finally, let* <sup>C</sup>(*pZ*||*qZ*) = <sup>−</sup>E*z*∼*pZ* [log(*qZ*(*z*))] *be the cross entropy between two probability distributions pZ and qZ. Then, the cross-entropy cost function is*

$$J\_{\mathrm{CE}}(p\_{(X,Y)};\theta) = \mathbb{E}\_{(\mathbf{x},t)\sim q\_{\mathrm{T}|X}p\_{X}}\left[\mathbb{C}(q\_{\mathrm{Y}|T=t}||q\_{\hat{Y}|T=t})\right] = \mathbb{E}\_{(\mathbf{x},\mathbf{y})\sim p\_{(X,Y)}}\left[j\_{\mathrm{CE}}(\mathbf{x},\mathbf{y};\theta)\right],\tag{6}$$

*where* <sup>j</sup>CE(*x*, *<sup>y</sup>*; *<sup>θ</sup>*) = <sup>−</sup>E*t*∼*qT*|*X*=*<sup>x</sup>* [*qY*<sup>ˆ</sup>|*T*=*<sup>t</sup>* (*y*|*t*; *<sup>θ</sup>*)] *is the instantaneous representation cross-entropy cost function and qY*|*<sup>T</sup>* <sup>=</sup> <sup>E</sup>*x*∼*pX* [*pY*|*X*<sup>=</sup>*xqT*|*X*=*x*/*qT*] *and qT* <sup>=</sup> <sup>E</sup>*x*∼*pX* [*qT*|*X*=*x*]*.*

The cross-entropy is a widely used cost function in classification tasks (e.g., Teahan [8], Krizhevsky et al. [33], Shore and Gray [34]) which has many interesting properties [35]. Moreover, it is known that minimizing the *<sup>J</sup>*CE(*p*(*X*,*Y*); *<sup>θ</sup>*) maximizes the mutual information *<sup>I</sup>*(*T*;*Y*). That is:

**Proposition 1 (Minimizing the Cross Entropy Maximizes the Mutual Information).** *Let <sup>J</sup>*CE(*p*(*X*,*Y*); *<sup>θ</sup>*) *be the representation cross-entropy cost function as defined in Definition 7. Let also <sup>I</sup>*(*T*;*Y*) *be the mutual information between random variables T and Y in the setting from Definition 7. Then, minimizing <sup>J</sup>*CE(*p*(*X*,*Y*); *<sup>θ</sup>*) *implies maximizing I*(*T*;*Y*)*.*

The proof of this proposition can be found in Appendix A.

**Definition 8 (Nuisance).** *A nuisance is any random variable that affects the observed data X but is not informative to the task we are trying to solve. That is,* <sup>Ξ</sup> *is a nuisance for Y if Y* <sup>⊥</sup> <sup>Ξ</sup> *or I*(Ξ,*Y*) = <sup>0</sup>*.*

Similarly, we know that minimizing *I*(*X*; *T*) minimizes the generalization gap for restricted classes when using the cross-entropy cost function (Theorem 1 of Vera et al. [36]), and when using *I*(*T*;*Y*) directly as an objective to maximize (Theorem 4 of Shamir et al. [37]). Furthermore, Achille and Soatto [38] in Proposition 3.1 upper bound the information of the input representations, *T*, with nuisances that affect the observed data, Ξ, with *I*(*X*; *T*). Therefore, minimizing *I*(*X*; *T*) helps generalization by not keeping useless information of Ξ in our representations.

Thus, jointly maximizing *I*(*T*;*Y*) and minimizing *I*(*X*; *T*) is a good choice both in terms of performance in the available dataset and in new, unseen data, which motivates studies on the IB.

#### **3. The Information Bottleneck in Deterministic Scenarios**

Kolchinsky et al. [21] showed that when *Y* is a deterministic function of *X* (i.e., *Y* = *f*(*X*)), the IB curve is piecewise linear. More precisely, it is shaped as stated in Proposition 2.

**Proposition 2 (The IB Curve is Piecewise Linear in Deterministic Scenarios).** *Let X be a random variable and Y* = *f*(*X*) *be a deterministic function of X. Let also T be the bottleneck variable that solves the IB functional. Then the IB curve in the information plane is defined by the following equation:*

$$\begin{cases} \ I(T;Y) = I(X;T) & \text{if} \quad I(X;T) \in [0, I(X;Y)) \\\ I(T;Y) = I(X;Y) & \text{if} \quad I(X;T) \ge I(X;Y) \end{cases} \tag{7}$$

Furthermore, they showed that the IB curve could not be explored by optimizing the IB Lagrangian for multiple *β* because the curve was not strictly concave. That is, there was not a one-to-one relationship between *β* and the performance level.

**Theorem 1 (In Deterministic Scenarios, the IB Curve cannot be Explored Using the IB Lagrangian).** *Let X be a random variable and Y* = *f*(*X*) *be a deterministic function of X. Let also* Δ *be the set of random variables T obeying the Markov condition Y* ↔ *X* ↔ *T. Then:*


*Note we use the supremum in this case since for β* = 0 *we have that I*(*X*; *T*) *could be infinite and then the search set from Equation (1); i.e.,* {*<sup>T</sup>* : *<sup>Y</sup>* <sup>↔</sup> *<sup>X</sup>* <sup>↔</sup> *<sup>T</sup>*}∩{*<sup>T</sup>* : *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>&</sup>lt; <sup>∞</sup>} *is not compact anymore.*

*3. Any solution <sup>T</sup>* <sup>∈</sup> <sup>Δ</sup> *such that <sup>I</sup>*(*X*; *<sup>T</sup>*) = *<sup>I</sup>*(*T*;*Y*) = *<sup>I</sup>*(*X*;*Y*) *solves* arg max*T*∈Δ{LIB(*T*; *<sup>β</sup>*)} *for all <sup>β</sup>* <sup>∈</sup> (0, 1)*. That is, many different <sup>β</sup> achieve the same compression and performance level.*

An alternative proof for this theorem can be found in Appendix B.

#### **4. The Convex IB Lagrangian**

#### *4.1. Exploring the IB Curve*

Clearly, a situation like the one depicted in Theorem 1 is not desirable, since we cannot aim for different levels of compression or performance. For this reason, we generalize the effort from Kolchinsky et al. [21] and look for families of Lagrangians which are able to explore the IB curve. Inspired by the squared IB Lagrangian, <sup>L</sup>sq-IB(*T*; *<sup>β</sup>*sq) = *<sup>I</sup>*(*T*;*Y*) <sup>−</sup> *<sup>β</sup>*sq *<sup>I</sup>*(*X*; *<sup>T</sup>*)2, we look at the

conditions a function of *I*(*X*; *T*) requires in order to be able to explore the IB curve. In this way, we realize that any monotonically increasing and strictly convex function will be able to do so, and we call the family of Lagrangians with these characteristics the convex IB Lagrangians, due to the nature of the introduced function.

**Theorem 2 (Convex IB Lagrangians).** *Let* Δ *be the set of r.v. T obeying the Markov condition Y* ↔ *X* ↔ *T. Then, if u is a monotonically increasing and strictly convex function, the IB curve can always be recovered by the solutions of* arg max*T*∈Δ{LIB,*u*(*T*; *<sup>β</sup>u*)}*, with*

$$\mathcal{L}\_{\text{IB},u}(T;\beta\_{\text{\textquotedbl}}) = I(T;Y) - \beta\_{\text{\textquotedbl}}u(I(X;T)). \tag{8}$$

*That is, for each point* (*I*(*X*; *T*), *I*(*T*;*Y*)) *s.t. dI*(*T*;*Y*)/*dI*(*X*; *T*) > 0 *there is a unique β<sup>u</sup> for which maximizing* <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*) *achieves this solution. Furthermore, <sup>β</sup><sup>u</sup> is strictly decreasing w.r.t. <sup>I</sup>*(*X*; *<sup>T</sup>*)*. We call* <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*) *the convex IB Lagrangian.*

The proof of this theorem can be found in Appendix C. Furthermore, by exploiting the IB curve duality (Lemma 10 of Gilad-Bachrach et al. [19]) we were able to derive other families of Lagrangians which allow for the exploration of the IB curve (Appendix G).

**Remark 1.** *Clearly, we can see how if u is the identity function (i.e., u*(*I*(*X*; *T*)) = *I*(*X*; *T*)*) then we end up with the normal IB Lagrangian. However, since the identity function is not strictly convex, it cannot ensure the exploration of the IB curve.*

During the proof of this theorem we observed a relationship between the Lagrange multipliers and the solutions obtained of the normal IB Lagrangian <sup>L</sup>IB(*T*; *<sup>β</sup>*) and the convex IB Lagrangian <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*). This relationship is formalized in the following corollary.

**Corollary 1 (IB Lagrangian and IB convex Lagrangian connection).** *Let* <sup>L</sup>IB(*T*; *<sup>β</sup>*) *be the IB Lagrangian and* <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*) *the convex IB Lagrangian. Then, maximizing* <sup>L</sup>IB(*T*; *<sup>β</sup>*) *and* <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*) *can obtain the same point in the IB curve if β<sup>u</sup>* = *β*/*u* (*I*(*X*; *T*))*, where u is the derivative of u.*

This corollary allows us to better understand why the addition of *u* allows for the exploration of the IB curve in deterministic scenarios. If we note that for *β* = 1 we can obtain any point in the increasing region of the curve, then we clearly see how evaluating *u* for different values of *I*(*X*; *T*) define different values of *β<sup>u</sup>* that obtain such points. Moreover, it lets us see how if for *β* = 0 maximizing the IB Lagrangian could obtain any point (*I*(*X*;*Y*); *I*(*X*; *T*)) with *I*(*X*; *T*) > *I*(*X*;*Y*), then the same happens for the IB convex Lagrangian.

#### *4.2. Aiming for a Specific Compression Level*

Let *Bu* denote the domain of Lagrange multipliers *β<sup>u</sup>* for which we can find solutions in the IB curve with the convex IB Lagrangian. Then, the convex IB Lagrangians do not only allow us to explore the IB curve with different *βu*. They also allow us to identify the specific *β<sup>u</sup>* that obtains a given point (*I*(*X*; *T*), *I*(*T*;*Y*)), provided we know the IB curve in the information plane. Conversely, the convex IB Lagrangian allows finding the specific point (*I*(*X*; *T*), *I*(*T*;*Y*)) that is obtained by a given *βu*.

**Proposition 3 (Bijective Mapping between IB Curve Point and Convex IB Lagrange multiplier).** *Let the IB curve in the information plane be known; i.e., <sup>I</sup>*(*T*;*Y*) = *<sup>f</sup>*IB(*I*(*X*; *<sup>T</sup>*)) *is known. Then there is a bijective mapping from Lagrange multipliers β<sup>u</sup>* ∈ *Bu* \ {0} *from the convex IB Lagrangian to points in the IB curve* (*I*(*X*; *<sup>T</sup>*), *<sup>f</sup>*IB(*I*(*X*; *<sup>T</sup>*))*. Furthermore, these mappings are:*

$$\beta\_{\rm u} = \frac{df\_{\rm IB}(I(X;T))}{dI(X;T)} \frac{1}{u'(I(X;T))} \quad \text{and} \quad I(X;T) = (u')^{-1} \left( \frac{df\_{\rm IB}(I(X;T))}{dI(X;T)} \frac{1}{\beta\_{\rm u}} \right), \tag{9}$$

*where u is the derivative of u and* (*u* )−<sup>1</sup> *is the inverse of u .*

This is especially interesting since in deterministic scenarios we know the shape of the IB curve (Theorem 2) and since the convex IB Lagrangians allow for the exploration of the IB curve (Theorem 2). A proof for Proposition 3 can be found in Appendix D.

**Remark 2.** *Note that the definition from Tishby et al. [1] <sup>β</sup>* = *d f*IB(*I*(*X*; *<sup>T</sup>*))/*dI*(*X*; *<sup>T</sup>*) *only allows for a bijection between <sup>β</sup> and <sup>I</sup>*(*X*; *<sup>T</sup>*) *if <sup>f</sup>*IB *is a strictly convex, and known function, and we have seen this is not the case in deterministic scenarios (Theorem 1).*

A direct result derived from this proposition is that we know the domain of Lagrange multipliers, *Bu*, which allows for the exploration of the IB curve if the shape of the IB curve is known. Furthermore, if the shape is not known we can at least bound that range.

**Corollary 2 (Domain of Convex IB Lagrange Multiplier with Known IB Curve Shape).** *Let the IB curve in the information plane be <sup>I</sup>*(*T*;*Y*) = *<sup>f</sup>*IB(*I*(*X*; *<sup>T</sup>*)) *and let <sup>I</sup>*max = *<sup>I</sup>*(*X*;*Y*)*. Let also <sup>I</sup>*(*X*; *<sup>T</sup>*) = *<sup>r</sup>*max *be the minimum mutual information s.t. <sup>f</sup>*IB(*r*max) = *<sup>I</sup>*max*; i.e., <sup>r</sup>*max <sup>=</sup> arg inf*r*{ *<sup>f</sup>*IB(*r*)} s.t. *<sup>f</sup>*IB(*r*) = *<sup>I</sup>*max*. Then, the range of Lagrange multipliers that allow the exploration of the IB curve with the convex IB Lagrangian is Bu* = [*βu*,min, *<sup>β</sup>u*,max]*, with*

$$\beta\_{u,\min} = \lim\_{r \to r\_{\max}^{-}} \left\{ \frac{f\_{\text{IB}}'(r)}{u'(r)} \right\} \quad \text{and} \quad \beta\_{u,\max} = \lim\_{r \to 0^{+}} \left\{ \frac{f\_{\text{IB}}'(r)}{u'(r)} \right\},\tag{10}$$

*where f* IB(*r*) *and <sup>u</sup>* (*r*) *are the derivatives of <sup>f</sup>*IB(*I*(*X*; *<sup>T</sup>*)) *and <sup>u</sup>*(*I*(*X*; *<sup>T</sup>*)) *w.r.t. <sup>I</sup>*(*X*; *<sup>T</sup>*) *evaluated at <sup>r</sup> respectively. Also, note that there are some scenarios where r*max → ∞ *(see, e.g., [39]), in these scenarios <sup>β</sup>u*,min <sup>=</sup> lim*r*→<sup>∞</sup> { *<sup>f</sup>* IB(*r*)/*u* (*r*)} <sup>≥</sup> <sup>0</sup>*.*

**Corollary 3 (Domain of Convex IB Lagrange Multiplier Bound).** *The range of the Lagrange multipliers that allow the exploration of the IB curve is contained by* [0, *<sup>β</sup>*u,top] *which is also contained by* [0, *<sup>β</sup>*<sup>+</sup> *<sup>u</sup>*,top]*, where*

$$\beta\_{\text{u,top}} = \frac{(\inf\_{\Omega\_{\text{r}} \subset \mathcal{X}} \{\beta\_0(\Omega\_x)\})^{-1}}{\lim\_{r \to 0^+} \{u'(r)\}}, \text{ and } \beta\_{\text{u,top}}^+ = \frac{1}{\lim\_{r \to 0^+} \{u'(r)\}}. \tag{11}$$

*where u* (*r*) *is the derivative of <sup>u</sup>*(*I*(*X*; *<sup>T</sup>*)) *w.r.t. <sup>I</sup>*(*X*; *<sup>T</sup>*) *evaluated at r,* <sup>X</sup> *is the set of possible realizations of X and β*<sup>0</sup> *and* Ω*<sup>x</sup> are defined as in [27] (Note in [27] they consider the dual problem (see Appendix G), so when they refer to <sup>β</sup>*−<sup>1</sup> *it translates to <sup>β</sup> in this article). That is, Bu* <sup>⊆</sup> [0, *<sup>β</sup>*u,top] <sup>⊆</sup> [0, *<sup>β</sup>*<sup>+</sup> u,top]*.*

Corollaries 2 and 3 allow us to reduce the range search for *β* when we want to explore the IB curve. Practically, infΩ*x*⊂X {*β*0(Ω*x*)} might be difficult to calculate so Wu et al. [27] derived an algorithm to approximate it. However, we still recommend setting the numerator to 1 for simplicity. The proofs for both corollaries are found in Appendices E and F.

#### **5. Experimental Support**

In order to showcase our claims we use the MNIST [28] and the TREC-6 [29] datasets. We modify the nonlinear-IB method [26], which is a neural network that minimizes the cross-entropy while also minimizing a differentiable kernel-based estimate of *I*(*X*; *T*) [40]. Then, we used this technique to maximize a lower bound on the convex IB Lagrangians by applying the functions *u* to the *I*(*X*; *T*) estimate.

The network structure is the following: first, a stochastic encoder *T* = *f*enc(*X*; *θ*) + *W* with *pW* <sup>=</sup> <sup>N</sup> (0, *Id*) such that *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*d*, where *<sup>d</sup>* is the dimension of the bottleneck variable (Note that the encoder needs to be stochastic to (i) ensure a finite and well-defined mutual information [21,41] and (ii) make gradient-based optimization methods over the IB Lagrangian useful [41]). Second, a deterministic decoder *qY*<sup>ˆ</sup>|*<sup>T</sup>* <sup>=</sup> *<sup>f</sup>*dec(*T*; *<sup>θ</sup>*). For the MNIST dataset both the encoder and the decoder are fully-connected networks, for a fair comparison with [26]. For the TREC-6 dataset, the encoder is a set of convolutions of word embeddings followed by a fully-connected network and the decoder is also a fully-connected network. For further details about the experiment setup, additional results for different values of *α* and *η* and supplementary experimental results for different datasets and network architectures, please refer to Appendix H.

In Figure 1 we show our results for two particularizations of the convex IB Lagrangians:


**Figure 1.** The top row shows the results for the power information bottleneck (IB) Lagrangian with *α* = 1, and the bottom row for the exponential IB Lagrangian with *η* = 1, both in the MNIST dataset. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) *I*(*T*;*Y*) as a function of *βu*; and (iii) the compression *I*(*X*; *T*) as a function of *βu*. In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated *<sup>I</sup>*(*X*;*Y*) = *<sup>H</sup>*(*Y*) = log2(10) in a dashed, orange line. All values are shown in bits.

We can clearly see how both Lagrangians are able to explore the IB curve (first column from Figure 1) and how the theoretical performance trend of the Lagrangians matches the experimental results (second and third columns from Figure 1). There are small mismatches between the theoretical and experimental performance. This is because using the nonlinear-IB, as stated by Kolchinsky et al. [21], does not guarantee that we find optimal representations due to factors like (i) inaccurate estimation of *I*(*X*; *T*), (ii) restrictions on the structure of *T*, (iii) use of an estimation of the decoder instead of the real one and (iv) the typical non-convex optimization issues that arise with gradient-based methods. The main difference comes from the discontinuities in performance for increasing *β*, which cause is still unknown (cf. Wu et al. [27]). It has been observed, however, that the bottleneck variable performs an intrinsic clusterization in classification tasks (see, for instance, [21,26,42] or Figure 2b). We observed how this clusterization matches with the quantized performance levels observed (e.g., compare Figure 2a with the top center graph in Figure 1); with

maximum performance when the number of clusters is equal to the cardinality of *Y* and reducing performance with a reduction of the number of clusters, which is in line with the concurrent work from Wu and Fischer [43]. We do not have a mathematical proof for the exact relationship between these two phenomena; however, we agree with Wu et al. [27] that it is an interesting matter and hope this observation serves as motivation to derive new theory.

(**a**) Number of clusters for different *β*pow. (**b**) Example of clusters for different *β*pow. **Figure 2.** Depiction of the clusterization behavior of the bottleneck variable for the power IB Lagrangian in the MNIST dataset with *α* = 1. The clusters were obtained using the DBSCAN algorithm [44,45].

In practice, there are different criteria for choosing the function *u*. For instance, the exponential IB Lagrangian could be more desirable than the power IB Lagrangian when we want to draw the IB curve since it has a finite range of *βu*. This is *Bu* = [(*η* exp(*ηI*max))−1, *η*−1] for the exponential IB Lagrangian vs. *Bu* = [((1 + *α*)*I<sup>α</sup>* max)−1, <sup>∞</sup>) for the power IB Lagrangian. Furthermore, there is a trade-off between (i) how much the selected *u* function resembles a linear function in our region of interest; e.g., with *α* or *η* close to zero, since it will suffer from similar problems as the original IB Lagrangian; and (ii) how fast it grows in our region of interest; e.g., higher values of *α* or *η*, since it will suffer from value convergence; i.e., optimizing for separate values of *β<sup>u</sup>* will achieve similar levels of performance (Figure 3). Please, refer to Appendix I for a more thorough explanation of these two phenomena.

**Figure 3.** Example of value convergence with the exponential IB Lagrangian with *η* = 3. We show the intersection of the isolines of <sup>L</sup>IB,exp(*T*; *<sup>β</sup>*exp) for different *<sup>β</sup>*exp <sup>∈</sup> *<sup>B</sup>*exp <sup>≈</sup> [1.56 <sup>×</sup> <sup>10</sup>−5, 3−1] using Corollary 2.

Particularly, the value convergence phenomenon can be exploited in order to approximately obtain a particular level of compression *r*∗, both for known and unkown IB curves (see Appendix I or the example in Figure 4). For known IB curves, we also know the achieved predictability *I*(*T*;*Y*) since it is the same as the level of compression *I*(*X*; *T*). For this exploitation, we can employ the shifted version of the exponential IB Lagrangian (which is also a particular case of the convex IB Lagrangian):

#### • the shifted exponential IB Lagrangians:

**Figure 4.** Example of value convergence exploitation with the shifted exponential Lagrangian with *η* = 200. In the top row, for the MNIST dataset aiming for a compression level *r*<sup>∗</sup> = 2 and in the bottom row, for the TREC-6 dataset aiming for a compression level of *r*<sup>∗</sup> = 16. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) *I*(*T*;*Y*) as a function of *βu*; and (iii) the compression *I*(*X*; *T*) as a function of *βu*. In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated *H*(*Y*) in a dashed, orange line. All values are shown in bits.

For this Lagrangian, the optimization procedure converges to representations with approximately the desired compression level *r*∗ if the hyperparameter *η* is set to a large value.

In Figure 4 we show the results of aiming for a compression level of *r*<sup>∗</sup> = 2 bits in the MNIST dataset and of *r*<sup>∗</sup> = 16 bits in the TREC-6 dataset, both with *η* = 200. We can see how for different values of *β*sh-exp we can obtain the same desired compression level, which makes this method stable to variations in the Lagrange multiplier selection.

To sum up, in order to achieve a desired level of performance with the convex IB Lagrangian as an objective one should:


#### **6. Conclusions**

The information bottleneck is a widely used and studied technique. However, it is known that the IB Lagrangian cannot be used to achieve varying levels of performance in deterministic scenarios. Moreover, in order to achieve a particular level of performance, multiple optimizations with different Lagrange multipliers must be done to draw the IB curve and select the best traded-off representation.

In this article we introduced a general family of Lagrangians which allow to (i) achieve varying levels of performance in any scenario, and (ii) pinpoint a specific Lagrange multiplier *β<sup>u</sup>* to optimize for a specific performance level in known IB curve scenarios; e.g., deterministic. Furthermore, we showed the *β<sup>u</sup>* domain when the IB curve is known and a *β<sup>u</sup>* domain bound for exploring the IB curve when it is unknown. This way we can reduce and/or avoid multiple optimizations and, hence, reduce the computational effort for finding well traded-off representations. Moreover, (iii) when the IB curve is not known, we saw how we can exploit the value convergence issue of the convex IB Lagrangian to approximately obtain a specific compression level for both known and unknown IB curve shapes. Finally, (iv) we provided some insight into the discontinuities on the performance levels w.r.t. the Lagrange multipliers by connecting those with the intrinsic clusterization of the bottleneck variable.

**Author Contributions:** Conceptualization, B.R.G. and R.T.; formal analysis, B.R.G.; funding acquisition, M.S.; methodology, B.R.G. and R.T.; resources, M.S.; software, B.R.G.; supervision, R.T. and M.S.; visualization, B.R.G.; writing—original draft, B.R.G.; writing—review and editing, B.R.G., R.T. and M.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the Swedish Research Council.

**Acknowledgments:** We want to thank the anonymous reviewers for their insighful comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proof of Proposition 1**

**Proof.** We can easily prove this statement by finding *<sup>I</sup>*(*T*;*Y*) is lower bounded by the *<sup>γ</sup>J*CE(*p*(*X*,*Y*); *<sup>θ</sup>*) + *C* where *γ* < 0 and *C* does not depend on *T*. This way maximizing such lower bound would be equivalent to minimizing *<sup>J</sup>*CE(*p*(*X*,*Y*); *<sup>θ</sup>*) and, moreover, it would imply maximizing *<sup>I</sup>*(*T*;*Y*).

We can find such an expression as follows:

$$I(T;Y) = \mathbb{E}\_{(y,t) \sim q\_{Y|T}q\_T} \left[ \log \left( \frac{q\_{Y|T-t}(y|t;\theta)}{p\_Y(y)} \right) \right] = H(Y) + \mathbb{E}\_{(y,t) \sim q\_{Y|T}q\_T} \left[ \log(q\_{Y|T-t}(y|t;\theta)) \right] \tag{A1}$$

$$\mathcal{I} = H(Y) + \mathbb{E}\_{t \sim q\_T} \left[ D\_{\text{KL}} \left( q\_{Y|T-t} || q\_{\hat{Y}|T-t} \right) \right] + \mathbb{E}\_{(y,t) \sim q\_{Y|T} q\_T} \left[ \log(q\_{\hat{Y}|T}(y|t; \theta)) \right] \tag{A2}$$

$$\mathbb{E}\_{\boldsymbol{Y}} \ge H(\boldsymbol{Y}) + \mathbb{E}\_{\left(\boldsymbol{x}, \boldsymbol{y} \boldsymbol{t}\right) \sim q\_{\mathbb{T}|\boldsymbol{T}|\boldsymbol{T}|\boldsymbol{X}} \mathbb{P}\_{\boldsymbol{X}}} \left[ \log(q\_{\hat{\boldsymbol{Y}}|\boldsymbol{T}-\boldsymbol{t}}(\boldsymbol{y}|\boldsymbol{t}, \boldsymbol{\theta})) \right] = H(\boldsymbol{Y}) - \mathbb{E}\_{\left(\boldsymbol{x}, \boldsymbol{t}\right) \sim q\_{\mathbb{T}|\boldsymbol{T}|\boldsymbol{X}} \mathbb{P}\_{\boldsymbol{X}}} \left[ \mathbb{C} \big( q\_{\boldsymbol{Y}|\boldsymbol{T}-\boldsymbol{t}} || q\_{\hat{\boldsymbol{Y}}|\boldsymbol{T}-\boldsymbol{t}} \big) \right] \tag{A3}$$

$$\Psi = H(\mathcal{Y}) - \chi\_{\text{CE}}(p\_{(X,\mathcal{Y})^\prime}; \theta). \tag{A4}$$

Here, in Equation (A1) we just used the definition of the mutual information between two random variables, and then we decoupled it using the definition of the entropy of a variable (Note we used *<sup>H</sup>*(·) which is usually employed for discrete variables. However, in this setting *<sup>H</sup>*(·) could also refer to the differential entropy *<sup>h</sup>*(·) of a continuous random variable since we employed the general definition using the expectation). Then, in Equation (A2) we only multiplied and divided by *qY*<sup>ˆ</sup>|*<sup>T</sup>* inside the logarithm and employed the definition of the Kullback–Leibler divergence. Finally, in Equation (A3) we first used the fact the Kullback–Leibler divergence is always positive (Theorem 2.6.3 from Cover and Thomas [20]) and then the properties of the Markov chain *T* ↔ *X* ↔ *Y*.

Therefore, since *H*(*Y*) does not depend on *T* and we have a negative multiplicative term on *<sup>J</sup>*CE(*p*(*X*,*Y*); *<sup>θ</sup>*) the proposition is proved.

#### **Appendix B. Alternative Proof of Theorem 1**

**Proof.** We will proof all the enumerated statements sequentially, since the third one requires from the two first ones to be proved.

	- (a) Since the IB curve is concave we know *<sup>β</sup>* is non-increasing in *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>∈</sup> <sup>R</sup>+. We also know *<sup>β</sup>* <sup>=</sup> 1 at the points in the IB curve where *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>≤</sup> lim<sup>→</sup>0<sup>+</sup> {*I*(*X*;*Y*) <sup>−</sup> } and *<sup>β</sup>* <sup>=</sup> 1 at the points in the IB curve where *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>≥</sup> lim<sup>→</sup>0<sup>+</sup> {*I*(*X*;*Y*) + }. Hence, if we achieve a solution with *<sup>β</sup>* <sup>∈</sup> (0, 1), this solution is *<sup>I</sup>*(*X*; *<sup>T</sup>*) = *<sup>I</sup>*(*T*;*Y*) = *<sup>I</sup>*(*X*;*Y*).
	- (b) We can upper bound the IB Lagrangian by

$$\mathcal{L}\_{\text{IB}}(T;\beta) = I(T;Y) - \beta I(X;T) \le (1-\beta)I(T;Y) \le (1-\beta)I(X;Y),\tag{A5}$$

where the first and second inequalities use the DPI (Theorem 2.8.1 from Cover and Thomas [20]).

Then, we can consider the point of the IB curve (*I*(*X*;*Y*), *I*(*X*;*Y*)). Since the function is concave a tangent line to (*I*(*X*;*Y*), *I*(*X*;*Y*)) exists such that all other points in the curve lie below this line. Let *β* be the slope of this curve (which we know it is from Tishby et al. [1]). Then,

$$I(X;Y) - \beta I(X;Y) = (1 - \beta)I(X;Y) \ge F\_{\text{IB,max}}(r) - \beta r, \; \forall r \in [0, \infty). \tag{A6}$$

As we see, by the upper bound on the IB Lagrangian from Equation (A5), if the point (*I*(*X*;*Y*), *I*(*X*;*Y*)) exists, any *β* can be the slope of the tangent line to (*I*(*X*;*Y*), *I*(*X*;*Y*)) that ensures concavity.

#### **Appendix C. Proof of Theorem 2**

**Proof.** We start the proof by remembering the optimization problem at hand (Definition 1):

$$F\_{\text{IB},\text{max}}(r) = \max\_{T \in \Delta} \{ I(T; \mathcal{Y}) \} \text{ s.t. } I(X; T) \le r \tag{A7}$$

We can modify the optimization problem by

$$\max\_{T \in \Delta} \{ I(T; Y) \} \text{ s.t. } u(I(X; T)) \le u(r) \tag{A8}$$

iff *<sup>u</sup>* is a monotonically non-decreasing function since otherwise *<sup>u</sup>*(*I*(*X*; *<sup>T</sup>*)) <sup>≤</sup> *<sup>u</sup>*(*r*) would not hold necessarily. Now, let us assume ∃*T*<sup>∗</sup> ∈ Δ and *β*<sup>∗</sup> *<sup>u</sup>* s.t. *<sup>T</sup>*<sup>∗</sup> maximizes <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>*<sup>∗</sup> *<sup>u</sup>*) over all *<sup>T</sup>* <sup>∈</sup> <sup>Δ</sup>, and *<sup>I</sup>*(*X*; *<sup>T</sup>*∗) <sup>≤</sup> *<sup>r</sup>*. Then, we can operate as follows:

$$\max\_{\substack{T \in \Lambda \\ u(I(X;T)) \le u(r)}} \left\{ I(T;Y) \right\} = \max\_{\substack{T \in \Lambda \\ u(I(X;T)) \le u(r)}} \left\{ I(T;Y) - \beta\_u^\*(u(I(X;T)) - u(r) + \xi) \right\} \tag{A9}$$

$$\leq \max\_{T \in \Delta} \left\{ I(T; Y) - \beta\_u^\* \left( \mu(I(X; T)) - \mu(r) + \xi \right) \right\} \tag{A10}$$

$$=I(T^\*;Y) - \beta\_u^\*(u(I(X;T^\*) - u(r) + \xi) = I(T^\*;Y). \tag{A11}$$

Here, the equality from Equation (A9) comes from the fact that since *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>≤</sup> *<sup>r</sup>*, then <sup>∃</sup>*<sup>ξ</sup>* <sup>≥</sup> 0 s.t. *<sup>u</sup>*(*I*(*X*; *<sup>T</sup>*)) <sup>−</sup> *<sup>u</sup>*(*r*) + *<sup>ξ</sup>* <sup>=</sup> 0. Then, the inequality from Equation (A10) holds since we have expanded the optimization search space. Finally, in Equation (A11) we use that *<sup>T</sup>*<sup>∗</sup> maximizes <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>*<sup>∗</sup> *<sup>u</sup>*) and that *<sup>I</sup>*(*X*; *<sup>T</sup>*∗) <sup>≤</sup> *<sup>r</sup>*.

Now, we can exploit that *u*(*r*) and *ξ* do not depend on *T* and drop them in the maximization in Equation (A10). We can then realize we are maximizing over <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>*<sup>∗</sup> *<sup>u</sup>*); i.e.,

$$\max\_{\substack{T \in \Lambda \\ u(I(X;T)) \le u(r)}} \left\{ I(T;Y) \right\} \le \max\_{T \in \Lambda} \{ I(T;Y) - \beta\_u^\*(u(I(X;T)) - u(r) + \xi) \}\tag{A12}$$

$$=\max\_{T\in\Delta}\{I(T;Y)-\beta^\*\_u(I(X;T))\}=\max\_{T\in\Delta}\{\mathcal{L}\_{\text{IB},u}(T;\beta^\*\_u)\}.\tag{A13}$$

Therefore, since *<sup>I</sup>*(*T*∗;*Y*) satisfies both the maximization with *<sup>T</sup>*<sup>∗</sup> <sup>∈</sup> <sup>Δ</sup> and the constraint *<sup>I</sup>*(*X*; *<sup>T</sup>*∗) <sup>≤</sup> *<sup>r</sup>*, maximizing <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>*<sup>∗</sup> *<sup>u</sup>*) obtains *<sup>F</sup>*IB,max(*r*).

Now, we know if such *β*∗ *<sup>u</sup>* exists, then the solution of the Lagrangian will be a solution for *<sup>F</sup>*IB,max(*r*). Then, if we consider Theorem 6 from the Appendix of Courcoubetis [22] and consider the maximization problem instead of the minimization problem, we know if both *<sup>I</sup>*(*T*;*Y*) and <sup>−</sup>*u*(*I*(*X*; *<sup>T</sup>*)) are concave functions, then a set of Lagrange multipliers *S*∗ *<sup>u</sup>* exists with these conditions. We can make this consideration because *<sup>f</sup>* is concave if <sup>−</sup>*<sup>f</sup>* is convex and max{ *<sup>f</sup>* } <sup>=</sup> min{−*<sup>f</sup>* }. We know *<sup>I</sup>*(*T*;*Y*) is a concave function of *<sup>T</sup>* for *<sup>T</sup>* <sup>∈</sup> <sup>Δ</sup> (Lemma 5 of Gilad-Bachrach et al. [19]) and *<sup>I</sup>*(*X*; *<sup>T</sup>*) is convex w.r.t. *<sup>T</sup>* given *pX* is fixed (Theorem 2.7.4 of Cover and Thomas [20]). Thus, if we want <sup>−</sup>*u*(*I*(*X*; *<sup>T</sup>*)) to be concave we need *u* to be a convex function.

Finally, we will look at the conditions of *u* so that for every point (*I*(*X*; *T*), *I*(*T*;*Y*)) in the IB curve, there exists a unique *β*∗ *<sup>u</sup>* s.t. <sup>L</sup>IB,u(*T*; *<sup>β</sup>*<sup>∗</sup> *<sup>u</sup>*) is maximized. That is, the conditions of *<sup>u</sup>* s.t. <sup>|</sup>*S*<sup>∗</sup> *<sup>u</sup>*<sup>|</sup> <sup>=</sup> 1. For this purpose we will look at the solutions of the Lagrangian optimization:

$$\frac{d\mathcal{L}\_{\text{IB},u}(T;\beta\_{\text{u}})}{dT} = \frac{d(I(T;Y) - \beta\_{\text{u}}u(I(X;T)))}{dT} = \frac{dI(T;Y)}{dT} - \beta\_{\text{u}}\frac{du(I(X;T))}{dI(X;T)}\frac{dI(X;T)}{dT} = 0 \tag{A14}$$

Now, if we integrate both sides of Equation (A14) over all *T* ∈ Δ we obtain

$$\beta\_{\boldsymbol{u}} = \frac{dI(T;Y)}{dI(X;T)} \left(\frac{d\boldsymbol{u}(I(X;T))}{dI(X;T)}\right)^{-1} = \frac{\beta}{\boldsymbol{u}'(I(X;T))},\tag{A15}$$

where *β* is the Lagrange multiplier from the IB Lagrangian [1] and *u* (*I*(*X*; *<sup>T</sup>*)) is *du*(*I*(*X*;*T*)) *dI*(*X*;*T*) . Also, if we want to avoid indeterminations of *β<sup>u</sup>* we need *u* (*I*(*X*; *T*)) not to be 0. Since we already imposed *u* to be monotonically non-decreasing, we can solve this issue by strengthening this condition. That is, we will require *u* to be monotonically increasing.

We would like *β<sup>u</sup>* to be continuous, this way there would be a unique *β<sup>u</sup>* for each value of *I*(*X*; *T*). We know *β* is a non-increasing function of *I*(*X*; *T*) (Lemma 6 of Gilad-Bachrach et al. [19]). Hence, if we want *β<sup>u</sup>* to be a strictly decreasing function of *I*(*X*; *T*), we will require *u* to be a strictly increasing function of *I*(*X*; *T*). Therefore, we will require *u* to be a strictly convex function.

Thus, if *u* is a strictly convex and monotonically increasing function, for each point (*I*(*X*; *T*), *I*(*T*;*Y*)) in the IB curve s.t. *dI*(*T*;*Y*)/*dI*(*X*; *T*) > 0 there is a unique *β<sup>u</sup>* for which maximizing <sup>L</sup>IB,u(*T*; *<sup>β</sup>u*) achieves this solution.

#### **Appendix D. Proof of Proposition 3**

**Proof.** In Theorem 2 we showed how each point of the IB curve (*I*(*X*; *T*), *I*(*T*;*Y*)) can be found with a unique *<sup>β</sup><sup>u</sup>* maximizing <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*). Therefore, since we also proved <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*) is strictly concave w.r.t. *T* we can find the values of *β<sup>u</sup>* that maximize the Lagrangian for fixed *I*(*X*; *T*).

First, we look at the solutions of the Lagrangian maximization:

$$\frac{d\mathcal{L}\_{\Pi^{\mathsf{H}}}(T\_{l}\mathfrak{E}\_{\mathsf{H}})}{dT} = \frac{d(f\_{\Pi}(l(\mathbf{X};T)) - \mathfrak{E}\_{\mathbf{u}}u(l(\mathbf{X};T)))}{dT} = \frac{df\_{\Pi^{\mathsf{H}}}(l(\mathbf{X};T))}{dT} - \beta\_{\mathsf{u}}\frac{du(l(\mathbf{X};T))}{dI(\mathbf{X};T)}\frac{dl(\mathbf{X};T)}{dT} = 0. \tag{A16}$$

Then as before we can integrate at both sides for all *T* ∈ Δ and solve for *βu*:

$$\beta\_u = \frac{df\_{\rm IB}(I(X;T))}{dI(X;T)} \frac{1}{u'(I(X;T))}.\tag{A17}$$

Moreover, since *u* is a strictly convex function it's derivative *u* is strictly increasing. Hence, *u* is an invertible function (since a strictly increasing function is bijective and a function is invertible iff it is bijective by definition). Now, if we consider *β<sup>u</sup>* > 0 to be known and *I*(*X*; *T*) to be the unknown we can solve for *I*(*X*; *T*) and get:

$$I(X;T) = (\mu')^{-1} \left( \frac{df\_{\text{IB}}(I(X;T))}{dI(X;T)} \frac{1}{\beta\_u} \right). \tag{A18}$$

Note we require *β<sup>u</sup>* not to be 0 so the mapping is defined.

#### **Appendix E. Proof of Corollary 2**

**Proof.** We will start the proof by proving the following useful Lemma.

**Lemma A1.** *Let* <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*) *be a convex IB Lagrangian, then* sup*T*∈Δ{LIB,*u*(*T*; 0)} <sup>=</sup> *<sup>I</sup>*(*X*;*Y*)*.*

**Proof.** Since <sup>L</sup>IB,*u*(*T*; 0) = *<sup>I</sup>*(*T*;*Y*), maximizing this Lagrangian is directly maximizing *<sup>I</sup>*(*T*;*Y*). We know *<sup>I</sup>*(*T*;*Y*) is a concave function of *<sup>T</sup>* for *<sup>T</sup>* <sup>∈</sup> <sup>Δ</sup> (Theorem 2.7.4 from Cover and Thomas [20]); hence it has a supremum. We also know *<sup>I</sup>*(*T*;*Y*) <sup>≤</sup> *<sup>I</sup>*(*X*;*Y*). Moreover, we know *<sup>I</sup>*(*X*;*Y*) can be achieved if, for example, *Y* is a deterministic function of *T* (since then the Markov Chain *X* ↔ *T* ↔ *Y* is formed). Thus, sup*T*∈Δ{LIB,*u*(*T*; 0)} <sup>=</sup> *<sup>I</sup>*(*X*;*Y*).

For *<sup>β</sup><sup>u</sup>* <sup>=</sup> 0 we know maximizing <sup>L</sup>IB,*u*(*T*; 0) we can obtain the point in the IB curve (*r*max, *<sup>I</sup>*max) (Lemma A1). Moreover, we know that for every point (*I*(*X*; *<sup>T</sup>*), *<sup>f</sup>*IB(*I*(*X*; *<sup>T</sup>*))) such that *d f*IB(*I*(*X*; *<sup>T</sup>*))/*dI*(*X*; *<sup>T</sup>*) <sup>&</sup>gt; 0, <sup>∃</sup>!*β<sup>u</sup>* s.t. max{LIB,*u*(*T*; *<sup>β</sup>u*)} achieves that point (Theorem 2). Thus, <sup>∃</sup>!*βu*,min s.t. lim*r*→*r*<sup>−</sup> max (*r*, *<sup>f</sup>*IB(*r*)) is achieved. From Proposition <sup>3</sup> we know this *<sup>β</sup>u*,min is given by

$$\beta\_{u,\min} = \lim\_{r \to r\_{\max}^{-}} \left\{ \frac{f\_{\text{IB}}'(r)}{u'(r)} \right\}. \tag{A19}$$

Since we know *<sup>f</sup>*IB(*I*(*X*; *<sup>T</sup>*)) is a concave non-decreasing function in (0,*r*max) (Lemma 5 of Gilad-Bachrach et al. [19]) we know it is continuous in this interval. In addition we know *β<sup>u</sup>* is strictly decreasing w.r.t. *I*(*X*; *T*) (Theorem 2). Furthermore, by definition of *r*max and knowing *<sup>I</sup>*(*T*;*Y*) <sup>≤</sup> *<sup>I</sup>*(*X*;*Y*) we know *<sup>f</sup>* IB(*r*) = 0, <sup>∀</sup>*<sup>r</sup>* <sup>&</sup>gt; *<sup>r</sup>*max. Therefore, we cannot ensure the exploration of the IB curve for *β <sup>u</sup>* s.t. 0 < *β <sup>u</sup>* < *βu*,min.

Then, since *u* is a strictly increasing function in (0,*r*max), *u* is positive in that interval. Hence, taking into account *β<sup>u</sup>* is strictly decreasing we can find a maximum *β<sup>u</sup>* when *I*(*X*; *T*) approaches to 0. That is,

$$\beta\_{\mu, \text{max}} = \lim\_{r \to 0^{+}} \left\{ \frac{f\_{\text{IB}}'(r)}{u'(r)} \right\},\tag{A20}$$

#### **Appendix F. Proof of Corollary 3**

**Proof.** If we use Corollary 2, it is straightforward to see that *<sup>β</sup><sup>u</sup>* <sup>⊆</sup> [*L*−, *<sup>L</sup>*+] if *<sup>β</sup>u*,min <sup>≥</sup> *<sup>L</sup>*<sup>−</sup> and *βu*,max ≤ *L*<sup>+</sup> for all IB curves *f*IB and functions *u*. Therefore, we look at a domain bound dependent on the function choice. That is, if we can find *β*min ≤ *f* IB(*r*) and *<sup>β</sup>*max <sup>≥</sup> *<sup>f</sup>* IB(*r*) for all IB curves and all values of *r*, then

$$B\_{\rm u} \subseteq \left[ \frac{\beta\_{\rm min}}{\lim\_{r \to r\_{\rm max}^{-}} \{u'(r)\}}, \frac{\beta\_{\rm max}}{\lim\_{r \to 0^{+}} \{u'(r)\}} \right]. \tag{A21}$$

The region for all possible IB curves regardless of the relationship between *X* and *Y* is depicted in Figure A1. The hard limits are imposed by the DPI (Theorem 2.8.1 from Cover and Thomas [20]) and the fact that the mutual information is non-negative (Corollary with Equation 2.90 for discrete and first Corollary of Theorem 8.6.1 for continuous random variables from Cover and Thomas [20]). Hence, a minimum and maximum values of *f* IB are given by the minimum and maximum values of the slope of the Pareto frontier. Which means

$$B\_{\mu} \subseteq \left[0, \frac{1}{\lim\_{r \to 0^{+}} \{\mu'(r)\}}\right]. \tag{A22}$$

Note 0/(lim*r*→*r*<sup>−</sup> max {*u* (*r*)}) = 0 since *<sup>u</sup>* is monotonically increasing and, thus, *<sup>u</sup>* will never be 0.

**Figure A1.** Graphical representation of the IB curve in the information plane. Dashed lines in orange represent tight bounds confining the region (in light orange) of possible IB curves (delimited by the red line, also known as the Pareto frontier). Black dotted lines are informative values. In blue we show an example of a possible IB curve confining a region (in darker orange) of an IB curve that does not achieve the Pareto frontier. Finally, the yellow star represents the point where the representation keeps the same information about the input and the output.

Then, we can tighten the bound using the results from Wu et al. [27], where, in Theorem 2, they showed the slope of the Pareto frontier could be bounded in the origin by *f* IB <sup>≤</sup> (infΩ*x*⊂X {*β*0(Ω*x*)})−1. Finally, we know that in deterministic classification tasks infΩ*x*⊂X {*β*0(Ω*x*)} <sup>=</sup> 1, which aligns with Kolchinsky et al. [21] and what we can observe from Figure A1. Therefore,

$$B\_{\mathfrak{u}} \subseteq \left[ 0, \frac{(\inf\_{\Omega\_{\mathfrak{x}} \subset \mathcal{X}} \{\beta\_0(\Omega\_{\mathfrak{x}})\})^{-1}}{\lim\_{r \to 0^+} \{u'(r)\}} \right] \subseteq \left[ 0, \frac{1}{\lim\_{r \to 0^+} \{u'(r)\}} \right]. \tag{A23}$$

#### **Appendix G. Other Lagrangian Families**

We can use the same ideas we used for the convex IB Lagrangian to formulate new families of Lagrangians that allow the exploration of the IB curve. For that, we will use the duality of the IB curve (Lemma 10 of [19]). That is:

**Definition A1 (IB Dual Functional).** *Let X and Y be statistically dependent variables. Let also* Δ *be the set of random variables T obeying the Markov condition Y* ↔ *X* ↔ *T. Then the IB dual functional is*

$$F\_{\text{IB,min}}(i) = \min\_{T \in \Lambda} \{ I(X; T) \} \text{ s.t. } I(T; Y) \ge i, \forall i \in [0, I(X; Y)). \tag{A24}$$

**Theorem A1 (IB Curve Duality).** *Let the IB curve be defined by the solutions of <sup>F</sup>*IB,max(*r*) *for varying <sup>r</sup>* <sup>∈</sup> [0, <sup>∞</sup>)*. Then,*

$$(\forall r \exists i \; s.t. \; (r, F\_{\text{IB}, \text{max}}(r)) = (F\_{\text{IB}, \text{min}}(i), i) \tag{A25}$$

*and*

$$\forall i \exists r \, s.t. \,(F\_{\text{IB}, \text{min}}(i), i) = (r, F\_{\text{IB}, \text{max}}(r)). \tag{A26}$$

From this definition, it follows that minimizing the dual IB Lagrangian, <sup>L</sup>IB,dual(*T*; *<sup>β</sup>*dual) = *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>−</sup> *<sup>β</sup>*dual*I*(*T*;*Y*), for *<sup>β</sup>*dual <sup>=</sup> *<sup>β</sup>*−<sup>1</sup> is equivalent to maximizing the IB Lagrangian. In fact, the original Lagrangian for solving the problem was defined this way [1]. We decided to use the maximization version because the domain of useful *β* is bounded while it is not for *β*dual.

Following the same reasoning as we did in the proof of Theorem 2, we can ensure the IB curve can be explored if:


Here, *u* is a monotonically increasing strictly convex function, *v* is a monotonically increasing strictly concave function, and *βv*, *βv*,dual, *βu*,dual are the Lagrange multipliers of the families of Lagrangians defined above.

In a similar manner, one could obtain relationships between the Lagrange multipliers of the IB Lagrangian and the convex IB Lagrangian with these Lagrangian families. For instance, the convex IB Lagrangian <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*) is related with the concave IB Lagrangian <sup>L</sup>IB,*v*(*T*; *<sup>β</sup>v*) as defined by Propositon A1.

**Proposition A1 (Relationship between the convex and concave IB Lagrangians).** *Consider the convex and concave IB Lagrangians* <sup>L</sup>IB,*u*(*T*; *<sup>β</sup>u*)*,* <sup>L</sup>IB,*v*(*T*; *<sup>β</sup>v*)*. Let the IB curve defined as in Definition <sup>2</sup> be <sup>f</sup>*IB*. Then, if we fix the functions <sup>u</sup> and <sup>v</sup> we can obtain the same point in the IB curve* (*r*, *<sup>f</sup>*IB(*r*)) *with both Lagrangians when*

$$\beta\_v^{-1} = f\_{\rm IB}'(r) v' \left( f\_{\rm IB} \left( (u')^{-1} \left( \frac{f\_{\rm IB}'(r)}{\beta\_u} \right) \right) \right), \tag{A27}$$

*or equivalently,*

$$\beta\_u^{-1} = \frac{1}{f\_{\rm IB}'(r)} u' \left( f\_{\rm IB}^{-1} \left( (v')^{-1} \left( \frac{\beta\_v^{-1}}{f\_{\rm IB}'(r)} \right) \right) \right) . \tag{A28}$$

**Proof.** If we proceed like we did in the proof of Proposition 3 we can find the mapping between *I*(*X*; *T*) and *β<sup>u</sup>* and between *I*(*T*;*Y*) and *βv*. That is,

$$I(\mathbf{X};T) = (u')^{-1} \left( \frac{df\_{\mathrm{IB}}(I(\mathbf{X};T))}{dI(\mathbf{X};T)} \frac{1}{\beta\_{\mathrm{u}}} \right) \text{ and } I(T;Y) = (v')^{-1} \left( \left( \frac{df\_{\mathrm{IB}}(I(\mathbf{X};T))}{dI(\mathbf{X};T)} \right)^{-1} \frac{1}{\beta\_{\mathrm{v}}} \right). \tag{A29}$$

Then, if we recall that *<sup>I</sup>*(*T*;*Y*) = *<sup>f</sup>*IB(*I*(*X*; *<sup>T</sup>*)), we can directly obtain that

$$f\_{\rm IB}\left((\mathsf{u}')^{-1}\left(\frac{df\_{\rm IB}(I(X;T))}{dI(X;T)}\frac{1}{\beta\_{\mathsf{u}}}\right)\right) = (\mathsf{v}')^{-1}\left(\left(\frac{df\_{\rm IB}(I(X;T))}{dI(X;T)}\right)^{-1}\frac{1}{\beta\_{\mathsf{v}}}\right). \tag{A30}$$

Then, if we solve Equation (A30) with a fixed point (*I*(*X*; *<sup>T</sup>*) = *<sup>r</sup>*, *<sup>I</sup>*(*T*;*Y*) = *<sup>f</sup>*IB(*r*)) for *<sup>β</sup><sup>v</sup>* we obtain Equation (A27), and if we solve it for *β<sup>u</sup>* we obtain Equation (A28).

Also, one could find a range of values for these Lagrangians to allow for the IB curve exploration and define a bijective mapping between their Lagrange multipliers and the IB curve. However, (i) as mentioned in Section 2.2, *I*(*T*;*Y*) is particularly interesting to maximize without transformations because of its meaning. Moreover, (ii) like *β*dual, the domain of useful *β<sup>v</sup>* and *βu*,dual is not upper bounded. These two reasons make these other Lagrangians less preferable. We only include them here for completeness. Nonetheless, we encourage the curiours reader to explore these families of Lagrangians too. For example, a possible interesting research would be investigating if some particularization of the concave IB Lagrangian suffers from an issue like value convergence that can be exploited for approximately obtaining any predictability level *I*(*T*;*Y*) = *i* ∗ for many values of *βv*.

#### **Appendix H. Experimental Setup Details and Further Experiments**

In order to generate empirical support for our claims, we performed several experiments on different datasets with different neural network architectures and different ways of calculating the information bottleneck.

#### *Appendix H.1. Information Bottleneck Calculations*

The information bottleneck is calculated modifying either the nonlinear-IB [26]. This method of calculating the information bottleneck is a neural network that minimizes the cross-entropy while also miniminizing an upper bound estimate of the mutual information *<sup>I</sup><sup>θ</sup>* <sup>≈</sup> *<sup>I</sup>*(*X*; *<sup>T</sup>*). The nonlinear-IB relies on a kernel-based estimate of this mutual information [40]. We modify this calculation method by applying the function *u* to the *I*(*X*; *T*) estimate.

For the nonlinear-IB calculations, we estimated the gradients of both *<sup>I</sup><sup>θ</sup>* (*X*; *<sup>T</sup>*) and the cross-entropy with the same mini-batch. Moreover, we did not learn the covariance of the mixture of Gaussians used for the kernel density estimation of *<sup>I</sup><sup>θ</sup>* (*X*; *<sup>T</sup>*) and we set it to (exp(−1))2.

In both methods, and for all the experiments, we assumed a Gaussian stochastic encoder *T* = *<sup>f</sup>*enc(*X*; *<sup>θ</sup>*) + *<sup>W</sup>* with *pW* <sup>=</sup> <sup>N</sup> (0, *Id*), where *<sup>d</sup>* are the number of dimensions of the representations. We trained the neural networks with the Adam optimization algorithm [46] with a learning rate of 10−<sup>4</sup> and a 0.6 decay rate every 10 epochs. We used a batch size of 128 samples and all the weights were initialized according to the method described by Glorot and Bengio [47] using a Gaussian distribution.

Then, we used the DBSCAN algorithm [44,45] for clustering. Particularly, we used the scikit-learn [48] implementation with = 0.3 and min\_samples = 50.

The reader can find the PyTorch [30] implementation in the following link: https://github.com/ burklight/convex-IB-Lagrangian-PyTorch.

#### *Appendix H.2. The Experiments*

We performed experiments in four different datasets:

• A Classification Task on the MNIST Dataset [28] (Figures 1, 2, and A2–A4 and top row from Figure 3). This dataset contains 60,000 training samples and 10,000 testing samples of hand-written digits. The samples are 28x28 pixels and are labeled from 0 to 9; i.e., <sup>X</sup> <sup>=</sup> <sup>R</sup><sup>784</sup> and <sup>Y</sup> <sup>=</sup> {0, 1, ..., 9}. The data is pre-processed so that the input has zero mean and unit variance. This is a deterministic setting, hence the experiment is designed to showcase how the convex IB Lagrangians allow us to explore the IB curve in a setting where the normal IB Lagrangian cannot and the relationship between the performance plateaus and the clusterization phenomena. Furthermore, it intends to showcase the behavior of the power and exponential Lagrangians with different parameters of *α* and *η*. Finally, it wants to demonstrate how the value convergence can be employed to approximately obtain a specific compression value. In this experiment, the encoder *f*enc is a three fully-connected layer encoder with 800 ReLU units on the first two layers and two linear units on the last layer (*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>2), and the decoder *<sup>f</sup>*dec is a fully-connected 800 ReLU unit layers followed by an output layer with 10 softmax units. The convex IB Lagrangian was calculated using the nonlinear-IB.

**Figure A2.** Results for the power IB Lagrangian in the MNIST dataset with *<sup>α</sup>* <sup>=</sup> {0.5, 1, 2}, from top to bottom. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) *I*(*T*;*Y*) as a function of *βu*; and (iii) the compression *I*(*X*; *T*) as a function of *βu*. In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated *<sup>I</sup>*(*X*;*Y*) = *<sup>H</sup>*(*Y*) = log2(10) in a dashed, orange line. All values are shown in bits.

In Figure A2 we show how the IB curve can be explored with different values of *α* for the power IB Lagrangian and in Figure A3 for different values of *η* and the exponential IB Lagrangian.

Finally, in Figure A4 we show the clusterization for the same values of *α* and *η* as in Figure A2 and A3. In this way the connection between the performance discontinuities and the clusterization is more evident. Furthermore, we can also observe how the exponential IB Lagrangian maintains better the theoretical performance than the power IB Lagrangian (see Appendix I for an explanation of why).

**Figure A3.** Results for the exponential IB Lagrangian in the MNIST dataset with *<sup>η</sup>* <sup>=</sup> {log(2), 1, 1.5}, from top to bottom. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) *I*(*T*;*Y*) as a function of *βu*; and (iii) the compression *I*(*X*; *T*) as a function of *βu*. In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the gren stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated *<sup>I</sup>*(*X*;*Y*) = *<sup>H</sup>*(*Y*) = log2(10) in a dashed, orange line. All values are shown in bits.

**Figure A4.** Depiction of the clusterization behavior of the bottleneck variable in the MNIST dataset. In the first row, from left to right, the power IB Lagrangian with different values of *<sup>α</sup>* <sup>=</sup> {0.5, 1, 2}. In the second row, from left to right, the exponential IB Lagrangian with different values of *η* = {log(2), 1, 1.5}.

• A Classification Task on the Fashion-MNIST Dataset [49] (Figure A5). As MNSIT, this dataset contains 60,000 training and 10,000 testing samples of 28x28 pixel images labeled from 0 to 9 and constitutes a deterministic setting. The difference is that this dataset contains fashion products instead of hand-written digits and it represents a harder classification task [49]. The data is also pre-processed so that the input has zero mean and unit variance. For this experiment, the encoder *f*enc is composed of a two-layer convolutional neural network (CNN) with 32 filters on the first layer and 128 filters on the second with kernels of size 5 and stride 2. This CNN is followed by two fully-connected layers of 128 linear units (*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>128). After the first convolution and the first fully-connected layer, a ReLU activation is employed. The decoder *f*dec is a fully-connected 128 ReLU unit layer followed by an output layer with 10 softmax units. The convex IB Lagrangian was calculated using the nonlinear-IB. Therefore, this experiment intends to showcase how the convex IB Lagrangian can explore the IB curve for different neural network architectures and harder datasets.

**Figure A5.** Results for the exponential IB Lagrangian in the Fashion MNIST dataset with *η* = 1. From left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) *I*(*T*;*Y*) as a function of *βu*; and (iii) the compression *I*(*X*; *T*) as a function of *βu*. In all plots, the red crosses joined by a dotted line represent the values computed with the training set and the blue dots the values computed with the validation set. Moreover, in all plots, it is indicated *<sup>I</sup>*(*X*;*Y*) = *<sup>H</sup>*(*Y*) = log2(10). All values are shown in bits.

• A Regression Task on the California Housing Dataset [50] (Figure A6). This dataset contains 20,640 samples of 8 real number input variables like the longitude and latitude of the house (i.e.,

*<sup>X</sup>* <sup>∈</sup> <sup>R</sup>8) and a task output real variable representing the price of the house (i.e., *<sup>Y</sup>* <sup>∈</sup> <sup>R</sup>). We used the log-transformed house price as the target variable and dropped the 992 samples in which the house price was equal or greater than \$500, 000 so that the output distribution was closer to a Gaussian as they did in [26]. The input variables were processed so that they had zero mean and unit variance and we randomly split the samples into a 70% training and 30% test dataset. As in [40], for regression tasks we approximate *H*(*Y*) with the entropy of a Gaussian with variance Var(*Y*) and *<sup>H</sup>*(*Y*|*T*) with the entropy of a Gaussian with variance equal to the mean-squared error (MSE). This leads to the estimate *<sup>I</sup>*(*T*;*Y*) <sup>≈</sup> 0.5 log(Var(*Y*)/*MSE*). The encoder *<sup>f</sup>*enc is a three fully-connected layer encoder with 128 ReLU units on the first two layers and 2 linear units on the last layer (*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>2), and the decoder *<sup>f</sup>*dec is a fully-connected 128 ReLU unit layers followed by an output layer with 1 linear unit. The convex IB Lagrangian was calculated using the nonlinear-IB. Hence, this experiment was designed to showcase the convex IB Lagrangian can explore the IB curve in stochastic scenarios for regression tasks.

**Figure A6.** The top row shows the results for the normal IB Lagrangian, and the bottom row for the exponential IB Lagrangian with *η* = 1, both in the California housing dataset. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) *I*(*T*;*Y*) as a function of *βu*; and (iii) the compression *I*(*X*; *T*) as a function of *βu*. In all plots, the red crosses joined by a dotted line represent the values computed with the training set and the blue dots the values computed with the validation set. Moreover, in all plots, it is indicated *I*(*X*;*Y*) as the empirical value obtained maximizing *I*(*T*;*Y*) without compression limitations as in [26]. All values are shown in bits.

• A Classification Task on the TREC-6 Dataset [29] (Figure A7 and bottom row from Figure 3). This dataset is the six-class version of the TREC [51] dataset. It contains 5452 training and 500 test samples of text questions. Each question is labeled within six different semantic categories based on what the answer is; namely: Abbreviation, description and abstract concepts, entities, human beings, locations, and numeric values. This dataset does not constitute a deterministic setting since there are examples that could belong to more than one class and there are examples which are wrongly labeled (e.g., "What is a fear of parasites?" could belong both to the description and abstract concept category, however it is labeled into the entity category), and hence *<sup>H</sup>*(*Y*|*X*) <sup>&</sup>gt; 0. Following Ben Trevett's tutorial on Sentiment Analysis [52] the encoder *f*enc is composed by a 6 billion token pre-trained 100-dimensional Glove word embedding [53], followed by a concatenation of three convolutions with kernel sizes 2–4 respectively, and finalized with a fully-connected 128 linear unit layer (*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>128). The decoder *<sup>f</sup>*dec is a single fully-connected 6 softmax unit layer. The convex IB Lagrangian was calculated using the nonlinear-IB. Thus, this experiment intends to show an example where the classification task does not convey a deterministic scenario, that the convex IB Lagrangian can recover the IB curve in complex stochastic tasks with complex neural network architectures and that the value convergence can be employed to obtain a specific compression value even in stochastic settings where the IB curve is unknown.

**Figure A7.** The top row shows the results for the normal IB Lagrangian, and the bottom row for the power IB Lagrangian with *α* = 0.1, both in the TREC-6 dataset. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) *I*(*T*;*Y*) as a function of *βu*; and (iii) the compression *I*(*X*; *T*) as a function of *βu*. In all plots, the red crosses joined by a dotted line represent the values computed with the training set and the blue dots the values computed with the validation set. Moreover, in all plots, it is indicated *<sup>H</sup>*(*Y*) = log2(6). All values are shown in bits.

#### **Appendix I. Guidelines for Selecting A Proper Function in the Convex IB Lagrangian**

When choosing the right *u* function, it is important to find the right balance between avoiding value convergence and aiming for strong convexity. Practically, this balance is found by looking at how much faster *u* grows w.r.t. the identity function.

When the aim is not to draw the IB curve but to find a specific level of performance, we can exploit the value convergence phenomenon in order to design a stable performance targeted *u* function.

#### *Appendix I.1. Avoiding Value Convergence*

In order to explain this issue we are going to use the example of classification on MNIST [28], where *<sup>I</sup>*(*X*;*Y*) = *<sup>H</sup>*(*Y*) = log2(10), and again the power and exponential IB Lagrangians.

If we use Proposition 3 on both Lagrangians we obtain the bijective mapping between their Lagrange multipliers and a certain level of compression in the classification setting:


Hence, we can simply plot the curves of *I*(*X*; *T*) vs. *β<sup>u</sup>* for different hyperparameters *α* and *η* (see Figure A8). In this way, we can observe how increasing the growth of the function (e.g., increasing *α* or

*η* in this case) too much provokes that many different values of *β<sup>u</sup>* converge to very similar values of *I*(*X*; *T*). This is an issue both for drawing the curve (for obvious reasons) and for aiming for a specific performance level. Due to the nature of the estimation of the IB Lagrangian, the theoretical and practical value of *β<sup>u</sup>* that yields a specific *I*(*X*; *T*) may vary slightly (see Figure 1). Then if we select a function with too high growth, a small change in *β<sup>u</sup>* can result in a big change in the performance obtained.

**Figure A8.** Theoretical bijection between *<sup>I</sup>*(*X*; *<sup>T</sup>*) and different *<sup>α</sup>* from *<sup>β</sup>u*,min to 1.5 in the power IB Lagrangian (**top**), and different *η* in the domain *Bu* in the exponential IB Lagrangian (**bottom**).

*Appendix I.2. Aiming for Strong Convexity*

**Definition A2 (***μ***-Strong Convexity).** *If a function f*(*r*) *is twice continuous differentiable and its domain is confined in the real line, then it is <sup>μ</sup>-strong convex if f* (*r*) <sup>≥</sup> *<sup>μ</sup>* <sup>≥</sup> <sup>0</sup> <sup>∀</sup>*r.*

Experimentally, we observed when the growth of our function *u*(*r*) is small in the domain of interest *r* > 0 the convex IB Lagrangian does not perform well (see first row of Figures A2 and A3). Later we realized that this was closely related to the strength of the convexity of our function.

In Theorem 2 we imposed the function *u* to be strictly convex to enforce having a unique *β<sup>u</sup>* for each value of *I*(*X*; *T*). Hence, since in practice we are not exactly computing the Lagrangian but an estimation of it (e.g., with the nonlinear IB [26]) we require strong convexity in order to be able to explore the IB curve.

We now look at the second derivative of the power and exponential function: *u*(*r*)=(1+ *α*)*αrα*−<sup>1</sup> and *u*(*r*) = *η*<sup>2</sup> exp(*ηr*) respectivelly. Here we see how both functions are inherently 0-strong convex for *r* > 0 and *α*, *η* > 0. However, values of *α* < 1 and *η* < 1 could lead to low *μ*-strong convexity in certain domains of *r*. Particularly, the case of *α* < 1 is dangerous because the function approaches 0-strong convexity as *r* increases, so the power IB Lagrangian performs poorly when low *α* are used to find high performances.

#### *Appendix I.3. Exploiting Value Convergence*

When the aim is not to draw or explore the IB curve, but to obtain a specific level of performance, the power of exponential IB Lagrangians aforementioned might not be the best choice due to the problems with value convergence or non-strong convexity. However, we can exploit the former in order to design a performance targeted *u* function.

For instance, if we look at Figure A8 we can see how a modification of the exponential IB Lagrangian could result in such a function. More precisely, a shifted exponential *u*(*r*) = exp(*η*(*<sup>r</sup>* <sup>−</sup> *<sup>r</sup>*∗)), with *<sup>η</sup>* <sup>&</sup>gt; 0 sufficiently large, converges to the compression level *<sup>r</sup>*∗. We can see this more clearly if we consider the shifted exponential IB Lagrangian <sup>L</sup>IB,sh-exp(*T*; *<sup>β</sup>*sh-exp, *<sup>η</sup>*,*r*∗) = *<sup>I</sup>*(*T*;*Y*) <sup>−</sup> *<sup>β</sup>*sh-exp exp(*η*(*I*(*X*; *<sup>T</sup>*) <sup>−</sup> *<sup>r</sup>*∗)), since then the application of Proposition <sup>3</sup> results on *<sup>I</sup>*(*X*; *<sup>T</sup>*) = <sup>−</sup> log(*ηβ*sh-exp/ *<sup>f</sup>* IB(*I*(*X*; *<sup>T</sup>*)))/*<sup>η</sup>* <sup>+</sup> *<sup>r</sup>*∗, where *<sup>f</sup>* IB(*I*(*X*; *<sup>T</sup>*)) is the derivative of *<sup>f</sup>*IB evaluated at *<sup>I</sup>*(*X*; *<sup>T</sup>*). We know *f* IB <sup>=</sup> 1 in deterministic scenarios (Theorem 2) and that *<sup>f</sup>* IB < 1 otherwise (see, e.g., [27]). Then, for large enough *<sup>η</sup>*, *<sup>I</sup>*(*X*; *<sup>T</sup>*) <sup>≈</sup> *<sup>r</sup>*<sup>∗</sup> regardless of the value of *<sup>f</sup>* IB.

For instance, if we consider a deterministic scenario like the MNIST dataset [28] with *I*(*X*;*Y*) = *<sup>H</sup>*(*Y*) = log2(10), for *<sup>η</sup>* <sup>=</sup> 200 and *<sup>r</sup>*<sup>∗</sup> <sup>=</sup> 2 the range of the Lagrange multipliers that allow the exploration of the IB curve, according to Corollary 2, is *<sup>β</sup>*sh-exp <sup>∈</sup> [7.54 <sup>×</sup> <sup>10</sup>−178, 2.61 <sup>×</sup> 10171]. Furthermore, *<sup>I</sup>*(*X*; *<sup>T</sup>*) is close to 2 for many values of *<sup>β</sup>*sh-exp. For instance, *<sup>I</sup>*(*X*; *<sup>T</sup>*) = 1.974 for *<sup>β</sup>*sh-exp = 1 and *<sup>I</sup>*(*X*; *<sup>T</sup>*) = 1.963 for *<sup>β</sup>*sh-exp = 8. This ensures a stability in the performance level obtained so that small changes in the choice of *β*sh-exp do not result in significant changes on the performance (e.g., see top row from Figure 4).

If we now consider a stochastic scenario like the TREC-6 dataset [29] with *<sup>H</sup>*(*Y*) = log2(6), for *η* = 200 and *r*<sup>∗</sup> = 16 the range of the Lagrange multipliers that allow the IB curve, according to Corollary 3, is *<sup>β</sup>*sh-exp <sup>∈</sup> [0, 2.76(infΩ*x*⊂X {*β*0(Ω*x*)})−<sup>1</sup> <sup>×</sup> <sup>10</sup>1287], where *<sup>β</sup>*<sup>0</sup> and <sup>Ω</sup>*<sup>x</sup>* are defined as in [27]. Then, unless (infΩ*x*⊂X {*β*0(Ω*x*)})−<sup>1</sup> is of the order of 10−1287, the range of possible betas is wide. Moreover, *<sup>I</sup>*(*X*; *<sup>T</sup>*) is close to 16 for many values of *<sup>β</sup>*sh-exp. For example, *<sup>I</sup>*(*X*; *<sup>T</sup>*) = 15.939 if *<sup>f</sup>* IB <sup>=</sup> 0.001 at that point and *I*(*X*; *T*) = 15.973 if *f* IB <sup>=</sup> 0.9 for *<sup>β</sup>*sh-exp <sup>=</sup> 1; and *<sup>I</sup>*(*X*; *<sup>T</sup>*) = 15.929 if *<sup>f</sup>* IB <sup>=</sup> 0.001 at that point and *I*(*X*; *T*) = 15.963 if *f* IB <sup>=</sup> 0.9 for *<sup>β</sup>*sh-exp <sup>=</sup> 8. Hence, as in the deterministic scenario, the performance level obtained is stable with changes in the choice of *β*sh-exp (e.g., see bottom row from Figure 4).

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Probabilistic Ensemble of Deep Information Networks**

#### **Giulio Franzese \* and Monica Visintin**

Electronic and Telecommunications, Politecnico di Torino, 10100 Torino, Italy; monica.visintin@polito.it **\*** Correspondence: giulio.franzese@polito.it

Received: 22 November 2019; Accepted: 13 January 2020; Published: 14 January 2020

**Abstract:** We describe a classifier made of an ensemble of decision trees, designed using information theory concepts. In contrast to algorithms C4.5 or ID3, the tree is built from the leaves instead of the root. Each tree is made of nodes trained independently of the others, to minimize a local cost function (information bottleneck). The trained tree outputs the estimated probabilities of the classes given the input datum, and the outputs of many trees are combined to decide the class. We show that the system is able to provide results comparable to those of the tree classifier in terms of accuracy, while it shows many advantages in terms of modularity, reduced complexity, and memory requirements.

**Keywords:** information theory; information bottleneck; classifier; decision tree; ensemble

#### **1. Introduction**

Supervised classification is at the core of many modern applications of machine learning. The history of classifiers is rich and many variants have been proposed, such as decision trees, logistic regression, Bayesian networks, and neural networks (for an overview of general methods, see [1–3]). Despite the power of modern deep learning, for many problems involving categorical structured datasets, decision trees [4–7] or Bayesian networks [8–10] usually outperform neural network based approaches.

Decision trees are particularly interesting because they can be easily interpreted. Various types of tree classifiers can be discriminated according to the metric for the iterative construction and selection of features [4]: popular tree classifiers are based on information theoretic metrics, such as ID3 and C4.5 [6,7]. However, it is known that the greedy splitting procedure at each node can be sub-optimal [11], and that decision trees are prone to overfitting when dealing with small datasets. When a classifier is not strong enough, there are, roughly speaking, two possibilities: choosing a more sophisticated classifier or ensembling multiple "weak" classifiers [12,13]. This second approach is usually called the *ensemble* method. In the performance tradeoff by using multiple classifiers simultaneously, we improve classification performance, paying with the loss of interpretability.

The so-called "information bottleneck", described by Tishby and Zaslavsky [14] and Tishby et al. [15], was proposed in [16] to build a classifier (Deep Information Network, DIN) with a tree topology that compresses the input data and generates the estimated class. DINs [16] are based on the so-called information node that, using the input samples of a feature *Xin*, generates samples of a new feature *Xout*, according to the conditional probabilities *<sup>P</sup>*(*Xout* <sup>=</sup> *<sup>j</sup>*|*Xin* <sup>=</sup> *<sup>i</sup>*) obtained by minimizing the mutual information I(*Xin*; *Xout*), with the constraint of a given mutual information I(*Xout*;*Y*) between *Xout* and the target/class *Y* (information bottleneck [14]). The outputs of two or more nodes are combined, without information loss, to generate samples of a new feature passed to a subsequent information node. The final node (root) directly outputs the class of each input datum. The tree structure of the network is thus built from the leaves, whereas C4.5 and ID3 build it from the root.

We here propose an improved implementation of the DIN scheme in [16] that only requires the propagation through the tree of small matrices containing conditional probabilities. Notice that the previous version of the DIN was stochastic, while the one we propose here is deterministic. Moreover, we use an ensemble (e.g., [12,13]) of trees with randomly permuted features and weigh their outputs to improve classification accuracy.

The proposed architecture has several advantages in terms of:


With respect to the DINs in [16], the main difference is that samples of the random variables in the inner layers of the tree are never generated, which is an advantage in the case of large datasets. However, an assumption of statistical independence (see Section 2.3) is necessary to build the probability matrices and this might be seen as a limitation of the newly proposed method. However, experimental results (see Section 5) show that this approximation does not compromise the performance.

We underline similarities and differences of the proposed classifier with respect to the methods described in [6,7] since they are among the best performing ones. When using decision trees, as well as DINs, categorical and missing data are easily managed, but continuous random variables are not: quantization of these input features is necessary in a pre-processing phase, and it can be performed as in C4.5 [6], using other heuristics, or manually. Concerning differences, instead, the first one is that normally a hierarchical decision tree is built starting from the root and splitting at each node, whereas we here propose a way to build a tree starting from the leaves. The topology of our network implies that, once the initial ordering of the features has been set, there is no need, after each node is trained, to perform a search of the best possible next node. The second important difference is that we do not use directly mutual information as a metric for building the tree but we base our algorithm on the Information Bottleneck principle [14,15,17–21]. This allows us to extract all the relevant information (the *sufficient statistic*) while removing the redundant one, which is helpful in avoiding overfitting. As in [12,13], we use an ensemble method. We choose the simplest possible form of ensemble combination: we train independently many structurally equivalent networks, using the same single dataset but permuting the order of the features, and produce a weighted average of the outputs based on a simple rule described in Section 3.1. Notice that we use a one-shot procedure, i.e., we do not iterate more than once over the entire dataset and exploit techniques similarly to [22,23]. We leave the study of more sophisticated techniques to future works.

Sections 2 and 3 more precisely describe the structure of the DIN and how it works, Section 4 gives some insight on the theoretical properties, Section 5 comments the results obtained with standard datasets. Conclusions are finally drawn in Section 6.

#### **2. The DIN Architecture and Its Training**

The information network is made of input nodes (Section 2.1), information nodes (Section 2.2), and combiners joined together through a tree network described in Section 2.3. Moreover, an ensemble of *Nmach* trees is built, based on which the final estimated class is produced (Section 3.1). In [16], the input nodes are not present, the information node has a slightly different role, the combiners are much simpler than those described here, and just one tree was considered. As already stated, the new version of the DIN is more efficient when a large dataset with relatively few features is analyzed.

In the following, it is assumed that all the features take a finite number of discrete values; a case of continuous random variables is discussed in Section 5.2.

It is also assumed that *Ntrain* points are used in the training phase, *Ntest* points in the testing phase, and that *D* features are present. The *n*th training point corresponds to one of *Nclass* possible classes.

#### *2.1. The Input Node*

Each input node (see Figure 1) has two input vectors:


**Figure 1.** Schematic representation of an input node: the inputs are two vectors and the outputs are matrices that statistically describe the random variables *Xin* and *Y*.

The notation we use in the equations below is the following: *Y*, *Xin* represent random variables; **<sup>y</sup>**(*n*) and **<sup>x</sup>***in*(*n*) are the *<sup>n</sup>*th elements of vectors **<sup>y</sup>** and **<sup>x</sup>***in*, respectively; and **<sup>1</sup>**(*c*) is equal to 1 if *<sup>c</sup>* is true, and is otherwise equal to 0. Using Laplace smoothing [2], the input node estimates the following probabilities (the probability mass function of *Y* in Equation (1) is common to all the input nodes: it can be evaluated only by the first one and passed to the others):

$$\hat{P}(Y=m) \simeq \frac{1 + \sum\_{n=0}^{N\_{train}-1} \mathbf{1}(\mathbf{y}(n) = m)}{N\_{train} + N\_{class}} \quad m = 0, \ldots, N\_{class} - 1 \tag{1}$$

$$\hat{P}(X\_{\text{in}} = i) \simeq \frac{1 + \sum\_{n=0}^{N\_{\text{train}} - 1} \mathbf{1}(\mathbf{x}\_{\text{in}}(n) = i)}{N\_{\text{train}} + N\_{\text{int}}}, \quad i = 0, \ldots, N\_{\text{int}} - 1 \tag{2}$$

$$\hat{P}(Y=m, X\_{in}=i) \simeq \frac{1 + \sum\_{n=0}^{N\_{train}-1} \mathbf{1}(\mathbf{y}(n) = m)\mathbf{1}(\mathbf{x}\_{in}(n) = i)}{N\_{train} + N\_{class}N\_{in}} \tag{3}$$

From basic application of probability rules, *<sup>P</sup>*ˆ(*<sup>Y</sup>* <sup>=</sup> *<sup>m</sup>*|*Xin* <sup>=</sup> *<sup>i</sup>*) and *<sup>P</sup>*ˆ(*Xin* <sup>=</sup> *<sup>i</sup>*|*<sup>Y</sup>* <sup>=</sup> *<sup>m</sup>*) are then computed. From now on, for simplicity, we denote all the estimated probabilities *P*ˆ simply as *P*.

All the above probabilities can be organized in matrices defined as follows:

$$\mathbf{P}\_Y \in \mathbb{R}^{1 \times N\_{class}}, \quad \mathbf{P}\_Y(m) = P(Y = m) \tag{4}$$

$$\mathbf{P}\_{X\_{in}} \in \mathbb{R}^{1 \times N\_{in}}, \quad \mathbf{P}\_{X\_{in}}(i) = P(X\_{in} = i) \tag{5}$$

$$\mathbf{P}\_{X\_{in}|Y} \in \mathbb{R}^{N\_{class} \times N\_{in}}, \quad \mathbf{P}\_{X\_{in}|Y}(m, i) = P(X\_{in} = i | Y = m) \tag{6}$$

$$\mathbf{P}\_{Y|X\_{\text{in}}} \in \mathbb{R}^{N\_{\text{in}} \times N\_{\text{class}}}, \quad \mathbf{P}\_{Y|X\_{\text{in}}}(i, m) = \mathbf{P}(Y = m | X\_{\text{in}} = i) \tag{7}$$

Note that vectors **x***in* and **y** are not needed by the subsequent elements in the tree; only the input nodes have access to them.

Notice also that the following equalities hold:

$$\mathbf{P}\_{X\_{iu}} = \mathbf{P}\_Y \mathbf{P}\_{X\_{iu}|Y} \tag{8}$$

$$\mathbf{P}\_Y = \mathbf{P}\_{X\_{iw}} \mathbf{P}\_{Y|X\_{iw}} \tag{9}$$

#### *2.2. The Information Node*

The information node is schematically shown in Figure 2: the input discrete random variable *Xin* is stochastically mapped into another discrete random variable *Xout* (see [16] for further details) through probability matrices:


Compression (source encoding) is obtained by setting *Nout* < *Nin*.

In the training phase, the information node generates the conditional probability mass function that satisfies the following equation (see [14]):

$$P(X\_{\rm out} = j | X\_{\rm in} = i) = \frac{1}{Z(i; \beta)} P(X\_{\rm out} = j) e^{-\beta d(i, j)}, \quad i = 0, \dots, N\_{\rm in} - 1, j = 0, \dots, N\_{\rm out} - 1 \tag{10}$$

where

• *P*(*Xout* = *j*) is the probability mass function of the output random variable *Xout*

$$P(X\_{\text{out}} = j) = \sum\_{i=0}^{N\_{\text{in}}-1} P(X\_{\text{in}} = i) P(X\_{\text{out}} = j | X\_{\text{in}} = i), \quad j = 0, \dots, N\_{\text{out}} - 1 \tag{11}$$

• *d*(*i*, *j*) is the Kullback–Leibler divergence

$$d(i,j) = \sum\_{m=0}^{N\_{class}-1} P(Y=m|X\_{in}=i) \log\_2 \frac{P(Y=m|X\_{in}=i)}{P(Y=m|X\_{out}=j)}$$

$$= \mathbb{KL}(P(Y|X\_{in}=i) || P(Y|X\_{out}=j))\tag{12}$$

and

$$P(Y=m|X\_{out}=j) = \sum\_{i=0}^{N\_{in}-1} P(Y=m|X\_{in}=i)P(X\_{in}=i|X\_{out}=j),$$

$$m = 0, \dots, N\_{class}-1, j = 0, \dots, N\_{out}-1 \tag{13}$$


$$\sum\_{j=1}^{N\_{out}-1} P(X\_{out} = j | X\_{in} = i) = 1. \tag{14}$$

The probabilities *<sup>P</sup>*(*Xout* <sup>=</sup> *<sup>j</sup>*|*Xin* <sup>=</sup> *<sup>i</sup>*) can be iteratively found using the Blahut–Arimoto algorithm [14,24,25].

Equation (10) solves the information bottleneck: it minimizes the mutual information I(*Xin*; *Xout*) under the constraint of a given mutual information I(*Y*; *Xout*). In particular, Equation (10) is the solution of the minimization of the Lagrangian

$$\mathcal{L} = \mathbb{I}(X\_{in}; X\_{out}) - \beta \mathbb{I}(Y; X\_{out}).\tag{15}$$

If the Lagrangian multiplier *β* is increased, then the constraint is privileged and the information node tends to maximize the mutual information between its output *Xout* and the class *Y*; if *β* is reduced, then minimization of I(*Xin*; *Xout*) is obtained (compression). The information node must actually balance compression from *Xin* to *Xout* and propagation of the information about *Y*. In our implementation, the compression is also imposed by the fact that the cardinality of the output alphabet *Nout* is smaller than that of the input alphabet *Nin*.

The role of the information node is thus that of finding the conditional probability matrices

$$\mathbf{P}\_{X\_{out}|X\_{in}} \in \mathbb{R}^{N\_{\rm in} \times N\_{\rm out}}, \quad \mathbf{P}\_{X\_{out}|X\_{in}}(i,j) = P(X\_{out} = j | X\_{in} = i) \tag{16}$$

$$\mathbf{P}\_{Y|X\_{\text{out}}} \in \mathbb{R}^{N\_{\text{out}} \times N\_{\text{class}}}, \quad \mathbf{P}\_{Y|X\_{\text{out}}}(j, m) = P(Y = m | X\_{\text{out}} = j) \tag{17}$$

$$\mathbf{P}\_{X\_{out}} \in \mathbb{R}^{1 \times N\_{out}}, \quad \mathbf{P}\_{X\_{out}}(j) = P(X\_{out} = j) \tag{18}$$

**Figure 2.** Schematic representation of an information node, showing the input and output matrices.

#### *2.3. The Combiner*

Consider the case depicted in Figure 3, where the two information nodes *a* and *b* feed a combiner (shown as a triangle) that generates the input of the information node *c*. The random variables *Xout*,*<sup>a</sup>* and *Xout*,*b*, both having alphabet with cardinality *N*1, are combined together as

$$X\_{in,c} = X\_{out,a} + N\_1 \ X\_{out,b} \tag{19}$$

that has an alphabet with cardinality *N*<sup>1</sup> × *N*1.

The combiner actually does not generate *Xin*,*c*; it simply evaluates the probability matrices that describe *Xin*,*<sup>c</sup>* and *<sup>Y</sup>*. In particular, the information node *<sup>c</sup>* needs **<sup>P</sup>***Xin*,*<sup>c</sup>* <sup>|</sup>*Y*, which can be evaluated **assuming** that *Xout*,*<sup>a</sup>* and *Xout*,*<sup>b</sup>* are conditionally independent given *Y* ( notice that in implementation [16] this assumption was not necessary):

$$\begin{split} P(X\_{in,\varepsilon} = k | Y = m) &= P(X\_{out,\mathfrak{a}} = k\_{a\prime} X\_{out,\mathfrak{b}} = k\_{b} | Y = m) \\ &= P(X\_{out,\mathfrak{a}} = k\_{a} | Y = m) P(X\_{out,\mathfrak{b}} = k\_{b} | Y = m) \end{split} \tag{20}$$

where *<sup>k</sup>* <sup>=</sup> *ka* <sup>+</sup> *<sup>N</sup>*1*kb*. In particular, the *<sup>m</sup>*th row of **<sup>P</sup>***Xin*,*<sup>c</sup>* <sup>|</sup>*<sup>Y</sup>* is the Kronecker product of the *<sup>m</sup>*th rows of **P***Xout*,*<sup>a</sup>* <sup>|</sup>*<sup>Y</sup>* and **P***Xout*,*b*|*<sup>Y</sup>*

$$\mathbf{P}\_{X\_{in,\varepsilon}|Y}(m,\cdot) = \mathbf{P}\_{X\_{out,\varepsilon}|Y}(m,\cdot) \odot \mathbf{P}\_{X\_{out,\delta}|Y}(m,\cdot) \quad m = 0,\dots,N\_{\text{class}}-1 \tag{21}$$

*Entropy* **2020**, *22*, 100

(here **<sup>A</sup>**(*m*, :) identifies the *<sup>m</sup>*th row of matrix **<sup>A</sup>**). The probability vector **<sup>P</sup>***Xin*,*<sup>c</sup>* can be evaluated considering that

$$P(X\_{in, \varepsilon} = k) = \sum\_{m=0}^{N\_{class}-1} P(X\_{in, \varepsilon} = k, Y = m) = \sum\_{m=0}^{N\_{class}-1} P(X\_{in, \varepsilon} = k | Y = m) P(Y = m) \tag{22}$$

so that

$$\mathbf{P}\chi\_{in,c} = \mathbf{P}\_Y \mathbf{P}\_{X\_{in,c}|Y} \tag{23}$$

At this point, matrix **P***Y*|*Xin*,*<sup>c</sup>* can be evaluated element by element since

$$P(Y=m|X\_{in,c}=k) = \frac{P(X\_{in,c}=k|Y=m)P(Y=m)}{P(X\_{in,c}=k)},$$

$$m = 1, \dots, N\_{\text{class}} - 1, k = 0, \dots, N\_1 \times N\_1 - 1 \tag{24}$$

It is straightforward to extend the equations to the case in which *Xin*,*<sup>a</sup>* and *Xin*,*<sup>b</sup>* have different cardinalities.

**Figure 3.** Sub-network: *Xin*,*a*, *Xout*,*a*, *Xin*,*b*, *Xout*,*b*, *Xin*,*c*, and *Xout*,*<sup>c</sup>* are all random variables; *N*<sup>0</sup> is the number of values taken by *Xin*,*<sup>a</sup>* and *Xin*,*b*; *N*<sup>1</sup> is the number of values taken by *Xout*,*<sup>a</sup>* and *Xout*,*b*; and *N*<sup>2</sup> is the number of values taken by *Xout*,*c*.

#### *2.4. The Tree Architecture*

Figure 4 shows an example of a DIN, where we assume that the dataset has *D* = 8 features and that training is thus obtained using a matrix **<sup>X</sup>***train* with *Ntrain* rows and *<sup>D</sup>* = 8 columns, with a corresponding class vector **<sup>y</sup>**. The *<sup>k</sup>*th column **<sup>x</sup>**(*k*) of matrix **<sup>X</sup>***train* feeds, together with vector **<sup>y</sup>**, the input node *<sup>I</sup>*(*k*), *<sup>k</sup>* <sup>=</sup> 0, . . . , *<sup>D</sup>* <sup>−</sup> 1.

**Figure 4.** Example of a DIN for *D* = 8: the input nodes are represented as rectangles, the info nodes as circles, and the combiners as triangles. The numbers inside each circle identify the node (position inside the layer and layer number), *<sup>N</sup>*(*k*) *in* is the number of values taken by the input of the info node at layer *<sup>k</sup>*, and *<sup>N</sup>*(*k*) *out* is the number of values taken by the output of the info node at layer *k*. In this example, the info nodes at a given layer all have the same input and output cardinalities.

Information node (*k*, 0) at layer 0 processes the probability matrices generated by the input node *<sup>I</sup>*(*k*), with *<sup>N</sup>*(0) *in* possible values of *Xin*(*k*, 0), and evaluates the conditional probability matrices with *<sup>N</sup>*(0) *out* possible values of *Xout*(*k*, 0), using the algorithm described in Section 2.2. The outputs of info nodes (2*k*, 0) and (2*<sup>k</sup>* + 1, 0) are given to a combiner that outputs the probability matrices for *Xin*(*k*, 1), having alphabet of cardinality *<sup>N</sup>*(1) *in* <sup>=</sup> *<sup>N</sup>*(0) *out* <sup>×</sup> *<sup>N</sup>*(0) *out*, using the equations described in Section 2.3. The sequence of combiners and information nodes is iterated, decreasing the number of information nodes from layer to layer, until the final root node is obtained. In the previous implementation of the DINs in [16], the root information node outputs the estimated class of the input and it is therefore necessary that the output cardinality of the root info node is equal to *Nclass*. In the current implementation, this cardinality can be larger than *Nclass*, since classification is based on the output probability matrix **P***Y*|*Xout* .

For a number of features *D* = 2*d*, the number of layers is *d*. If *D* is not a power of 2, then it is possible to use combiners with 3 or more inputs (the changes in the equations in Section 2.3 are straightforward, since a combiner with three inputs can be seen as two cascaded combiners with two inputs each).

The overall binary topology proposed in Figure 4 requires a number of information nodes equal to

$$N\_{\text{nodes}} = D + \frac{D}{2} + \frac{D}{4} + \dots + 2 + 1 = 2D - 1 \tag{25}$$

and a number of combiners equal to

$$N\_{comb} = \frac{D}{2} + \frac{D}{4} + \dots + 2 + 1 = D - 1\tag{26}$$

All the info nodes run exactly the same algorithm and all the combiners are equal, apart from the input/output alphabet cardinalities. If the cardinalities of the alphabets are all equal, i.e., *<sup>N</sup>*(*i*) *in* and *<sup>N</sup>*(*i*) *out* do not depend on the layer *i*, then all the nodes and all the combiners are exactly equal, which might help in a possible hardware implementation; in this case, the number of parameters of the network is (*Nout* <sup>−</sup> <sup>1</sup>) <sup>×</sup> *Nin* <sup>×</sup> *Nnodes*.

Actually, the network performance depends on how the features are coupled in subsequent layers and a random shuffling of the columns of matrix **X***train* provides results that might be significantly different. This property is used in Section 3.1 for building the ensemble of networks.

#### *2.5. A Note on Computational Complexity and Memory Requirements*

The modular structure of the proposed method has several advantages in terms of both memory footprint and computational cost. The considered topology in this explanation is binary, similarly to what is depicted in Figure 4. We furthermore consider for simplicity cardinalities of the *D* input features all equal to *Nin* and input/output cardinalities of subsequent layers information node to also be fixed constants *N*∗ *in* and *N*<sup>∗</sup> *out* <sup>=</sup> *<sup>N</sup>*<sup>∗</sup> *in* <sup>2</sup> , respectively. As we show in the experiment (Section 5), small values for *N*∗ *in* and *N*<sup>∗</sup> *out* such as 2, 3, or 4 are sufficient in the considered cases. Straightforward generalizations are possible when considering inhomogeneous cases.

At the first layer (the input node layer), each of the *D* input nodes stores the joint probabilities of the target variable *Y* and its input feature. Each node thus includes a simple counter that fills the probability matrix of dimension *Nin* × *Nclass*. Both the computational cost and the memory requirements for this first stage are the same as the Naive Bayes algorithm. Notice that, from the memory requirements point of view, it is not necessary to store all the training data but just counters with number of joint occurrences of features/classes. If after training, new data are observed, it is in fact sufficient to update the counters and properly renormalize the values to obtain the updated probability matrices. In this paper, we do not cover the topic of online learning as well as possible strategies to reduce the computational complexity in such a scenario.

At the second layer (the first information node layer), each node receives as input the joint probability matrix of feature and target variable and performs the Blahut–Arimoto algorithm. The internal memory requirement of this node is the space needed to store two probability matrices of dimensions *N*∗ *in* × *Nclass* and *N*<sup>∗</sup> *in* × *N*<sup>∗</sup> *out*, respectively. The cost per iteration of Blahut–Aritmoto depends on matrix multiplication of sizes *N*∗ *in* × *N*<sup>∗</sup> *out* and *N*<sup>∗</sup> *in* × *Nclass*, and thus obviously the complexity scales with the number of classes of the considered classification problem. To the best of our knowledge, the convergence rate for the Blahut–Arimoto algorithm applied to information bottleneck problems is unknown. In this study, however, we found empirically that, for the considered datasets, 5–6 iterations per node are sufficient, as discussed in Section 5.5.

Each combiner process the matrices generated by two information nodes: the memory requirement is zero and the computational cost is roughly *Nclass* Kronecker products between rows of probability matrices. Since for ease of explanation we chose *N*∗ *out* <sup>=</sup> *<sup>N</sup>*<sup>∗</sup> *in* <sup>2</sup> the output probability matrix have again dimensions *N*∗ *in* × *Nclass*.

The overall memory requirement and computational complexity (for a single DIN) is thus going to scale as *D* times the requirements for an input node, 2*D* − 1 times the requirements for an information node, and *D* − 1 times the requirements for a combiner. To complete the discussion, we have to remember that a further multiplication factor of *Nmach* is required to take into account that we are considering an ensemble of networks (actually, at the first layer, the set of input nodes can be shared by the different architectures since only the relative position of the input nodes changes , see Section 3.1).

#### **3. The Running Phase**

During the running phase, the columns of matrix **X** with *N* rows and *D* columns are used as inputs. Assume again that the network architecture is that depicted in Figure 4 with *D* = 8, and consider the *n*th input row **X**(*n*, :).

In particular, assume that **X**(*n*, 2*k*) = *i* and **X**(*n*, 2*k* + 1) = *j*. Then,

	- (b) info node (2*<sup>k</sup>* <sup>+</sup> 1, 0) passes the probability vector **<sup>p</sup>***<sup>b</sup>* <sup>=</sup> **<sup>P</sup>***Xout*(2*k*+1,0)|*Xin*(2*k*+1,0)(*j*, :) (*j*th row) to the combiner; **<sup>p</sup>***<sup>b</sup>* stores the conditional probabilities *<sup>P</sup>*(*Xout*(2*<sup>k</sup>* <sup>+</sup> 1, 0) = *<sup>h</sup>*|**X**(*n*, 2*<sup>k</sup>* <sup>+</sup> <sup>1</sup>) = *<sup>j</sup>*) for *<sup>h</sup>* <sup>=</sup> 0, . . . , *<sup>N</sup>*(0) *out* − 1;

$$\mathbf{p}\_{\mathfrak{c}} = \mathbf{p}\_{\mathfrak{a}} \circledast \mathbf{p}\_{\mathfrak{b}\prime} \tag{27}$$

which stores the conditional probabilities *<sup>P</sup>*(*Xin*(*k*, 1) = *<sup>s</sup>*|**X**(*n*, 2*k*) = *<sup>i</sup>*, **<sup>X</sup>**(*n*, 2*<sup>k</sup>* <sup>+</sup> <sup>1</sup>) = *<sup>j</sup>*) for *<sup>s</sup>* <sup>=</sup> 0, . . . , *<sup>N</sup>*(1) *in* <sup>−</sup> 1, where *<sup>N</sup>*(1) *in* <sup>=</sup> *<sup>N</sup>*(0) *out* <sup>×</sup> *<sup>N</sup>*(0) *out*; 4. info node (*k*, 1) generates the probability vector

$$\mathbf{P} \circ \mathbf{P} \,\mathrm{\mathcal{X}}\_{\mathrm{out}}(k, 1) \, | \, \mathcal{X}\_{\mathrm{in}}(k, 1) \, \mathsf{\mathcal{A}} \tag{28}$$

which stores the conditional probabilities *<sup>P</sup>*(*Xout*(*k*, 1) = *<sup>r</sup>*|**X**(*n*, 2*k*) = *<sup>i</sup>*, **<sup>X</sup>**(*n*, 2*<sup>k</sup>* <sup>+</sup> <sup>1</sup>) = *<sup>j</sup>*) for *<sup>r</sup>* <sup>=</sup> 0, . . . , *<sup>N</sup>*(1) *out*


$$\mathbf{p}\_{\rm out}(n) = \mathbf{p} \mathbf{P}\_{X\_{\rm out}(0,3)|X\_{\rm in}(0,3)} \mathbf{p}\_{Y|X\_{\rm out}(0,3)} \tag{29}$$

which stores the estimated probabilities *<sup>P</sup>*(*<sup>Y</sup>* <sup>=</sup> *<sup>m</sup>*|**X**(*n*, :)) for *<sup>m</sup>* <sup>=</sup> 0, . . . , *Nclass* <sup>−</sup> 1.

According to the MAP criterion, the estimated class of the input point **X**(*n*, :) is

$$\hat{Y}(n) = \arg\max \mathbf{p}\_{out}(n) \tag{30}$$

but we propose to use an improved method, as described in Section 3.1.

#### *3.1. The DIN Ensemble*

At the end of the training phase, when all the conditional matrices have been generated in each information node and combiner, the network is run with input matrix **X***train* (*Ntrain* rows and *D* columns) and the probability vector **<sup>p</sup>***out* is obtained for each input point **<sup>X</sup>***train*(*n*, :). As anticipated at the end of Section 2.4, the DIN classification accuracy depends on how the input features are combined together. By permuting the columns of **X***train*, a different probability vector **p***out* is typically obtained. We thus propose to generate an ensemble of DINs by randomly permuting the columns of **X***train*, and then combine their outputs.

Since in the training phase **y**(*n*) is known, it is possible to get for each DIN *v* the probability **p***v out*(*n*), and ideally **<sup>p</sup>***<sup>v</sup> out*(*n*, **<sup>y</sup>**(*n*)), the estimated probability corresponding to the true class **<sup>y</sup>**(*n*), should be equal to one. The weights

$$w^{\upsilon} = \frac{\sum\_{n=0}^{N\_{train}-1} \mathbf{p}\_{out}^{\upsilon}(n, \mathbf{y}(n))}{\sum\_{n=0}^{N\_{train}-1} \sum\_{j=0}^{N\_{month}-1} \mathbf{p}\_{out}^{j}(n, \mathbf{y}(n))} \tag{31}$$

thus represent the reliability of the *v*th DIN.

In the running phase, feeding the *Nmach* machines each with the correctly permuted vector **<sup>X</sup>**(*n*, :), the final estimated probability vector is determined as

$$\hat{\mathbf{p}}\_{\text{ens}}(n) = \sum\_{v=0}^{N\_{\text{mech}}-1} w^v \hat{\mathbf{p}}\_{\text{out}}^v(n) \tag{32}$$

and the estimated class is

$$
\hat{Y}(n) = \arg\max \Phi\_{\text{ens}}(n). \tag{33}
$$

#### **4. The Probabilistic Point of View**

This section is intended to underline the difference in probability terms formulation between the Naive Bayes classifier [2,26] and the proposed scheme, since both use the assumption of conditional independence of the input features. Both classifiers build in a simplified way the probability matrix **<sup>P</sup>***Y*|*X*0,...,*XD* with *Nclass* rows and <sup>∏</sup>*D*−<sup>1</sup> *<sup>i</sup>*=<sup>0</sup> *<sup>N</sup>*(*i*) *in* , where *<sup>N</sup>*(*i*) *in* is the cardinality for the input feature *Xi*. In the next sections, we show the different structure of these two probability matrices.

#### *4.1. Assumption of Conditionally Independent Features*

The Naive Bayes assumption allows writing the output estimated probability of the Naive Bayes classifier as follows:

$$\begin{split}P(Y=m|\mathbf{x}=\mathbf{x}\_{0}) &= \frac{P(\mathbf{x}=\mathbf{x}\_{0}|Y=m)P(Y=m)}{P(\mathbf{x}=\mathbf{x}\_{0})}\\ &= \frac{\left[\prod\_{k=0}^{D-1} P(X\_{k}=\mathbf{x}\_{k0}|Y=m)\right]P(Y=m)}{\sum\_{s=0}^{N\_{clas}} \left[\prod\_{k=0}^{D-1} P(X\_{k}=\mathbf{x}\_{k0}|Y=s)\right]P(Y=s)}\end{split} \tag{34}$$

which is very easily implemented, without the need of generating the tree network. We rewrite this output probability in a fairly complex way to show the difference between the naive Bayes probability matrix and the DIN one. Consider the *<sup>n</sup>*th feature *<sup>x</sup>*(*n*), which can take values in the set {*c*<sup>0</sup> *<sup>n</sup>*, ... , *c Dn*−<sup>1</sup> *<sup>n</sup>* }. Define **<sup>p</sup>***x*(*n*)|*y*=*<sup>m</sup>* = [*P*(*x*(*n*) = *<sup>c</sup>*<sup>0</sup> *<sup>n</sup>*|*<sup>Y</sup>* <sup>=</sup> *<sup>m</sup>*),... *<sup>P</sup>*(*x*(*n*) = *<sup>c</sup> Dn*−<sup>1</sup> *<sup>n</sup>* <sup>|</sup>*<sup>Y</sup>* <sup>=</sup> *<sup>m</sup>*)]; then,

$$\mathbf{P}\_{\mathbf{X}\_{in}|Y}(m, \cdot) = \bigotimes\_{k=0}^{D-1} \mathbf{P}\_{x(k)|y=m} \tag{35}$$

and thus obviously

$$\mathbf{P}\_{X\_{iu}|Y} = \begin{bmatrix} \, \, \circlearrowleft\_{k=0}^{D-1} \mathbf{P}\_{\mathbf{x}(k)|y=0} \\ \, \circlearrowleft\_{k=0}^{D-1} \mathbf{P}\_{\mathbf{x}(k)|y=1} \\ \vdots \\ \, \circlearrowleft\_{k=0}^{D-1} \mathbf{P}\_{\mathbf{x}(k)|y=N\_{class}} \end{bmatrix} \tag{36}$$

We can write the joint probability matrix as

$$\mathbf{P}\_{X\_{\rm inv},Y} = \operatorname{diag}(\mathbf{P}\_Y)\mathbf{P}\_{X|Y} \tag{37}$$

and the probability matrix of target class given observation as

$$\mathbf{P}\_{Y|X\_{in}} = (\mathbf{P}\_{X\_{in}, Y} \text{diag}(\mathbf{P}\_{X\_{in}}^{\circ(-1)}))^T \tag{38}$$

The hypothesis of conditional statistical independence of the features is not always correct and thus we can incur obvious performance degradation.

#### *4.2. The Overall Probability Matrix*

We now instead compute the output estimated probability for the DIN classifier. Consider again the sub-network in Figure 3 made of info nodes *a*, *b*, and *c*. Info node *a* is characterized by matrix **<sup>P</sup>***a*, whose element *Pa*(*i*, *<sup>j</sup>*) is *<sup>P</sup>*(*Xout*,*<sup>a</sup>* <sup>=</sup> *<sup>j</sup>*|*Xin*,*<sup>a</sup>* <sup>=</sup> *<sup>i</sup>*); similar definitions hold for **<sup>P</sup>***<sup>b</sup>* and **<sup>P</sup>***c*. Note that **P***<sup>a</sup>* and **P***<sup>b</sup>* have *N*<sup>0</sup> rows and *N*<sup>1</sup> columns, whereas **P***<sup>c</sup>* has *N*<sup>1</sup> × *N*<sup>1</sup> rows and *N*<sup>2</sup> columns; the overall probability matrix between the inputs *Xin*,*a*, *Xin*,*<sup>b</sup>* and the output *Xout*,*<sup>c</sup>* is **<sup>P</sup>**˜ with *<sup>N</sup>*<sup>0</sup> <sup>×</sup> *<sup>N</sup>*<sup>0</sup> rows and *<sup>N</sup>*<sup>2</sup> columns. Then,

$$\begin{split} &P(X\_{\text{out},\mathcal{E}}=i|X\_{\text{in},\mathcal{A}}=j,X\_{\text{in},\mathcal{b}}=k) \\ &=\sum\_{r=0}^{N\_{\text{in}}-1}\sum\_{s=0}^{N\_{\text{in}}-1}P(X\_{\text{out},\mathcal{E}}=i,X\_{\text{out},\mathcal{A}}=r,X\_{\text{out},\mathcal{b}}=s|X\_{\text{in},\mathcal{A}}=j,X\_{\text{in},\mathcal{b}}=k) \\ &=\sum\_{r=0}^{N\_{\text{in}}-1}\sum\_{s=0}^{N\_{\text{in}}-1}P(X\_{\text{out},\mathcal{C}}=i|X\_{\text{out},\mathcal{A}}=r,X\_{\text{out},\mathcal{b}}=s)P(X\_{\text{out},\mathcal{A}}=r|X\_{\text{in},\mathcal{A}}=j)P(X\_{\text{out},\mathcal{b}}=s|X\_{\text{in},\mathcal{B}}=k) \\ &=\sum\_{r=0}^{N\_{\text{in}}-1}\sum\_{s=0}^{N\_{\text{in}}-1}P(X\_{\text{out},\mathcal{C}}=i|X\_{\text{out},\mathcal{S}}=r,X\_{\text{out},\mathcal{B}}=s)\mathbf{P}\_{\text{d}}(j,r)\mathbf{P}\_{\text{b}}(k,s). \end{split}\tag{39}$$

It can be shown that

$$\mathbf{P} = (\mathbf{P}\_a \otimes \mathbf{P}\_b)\mathbf{P}\_c \tag{40}$$

where ⊗ identifies the Kronecker matrix multiplication; note that **P***<sup>a</sup>* ⊗ **P***<sup>b</sup>* has *N*<sup>0</sup> × *N*<sup>0</sup> rows and *N*<sup>1</sup> × *N*<sup>1</sup> columns. By iteratively applying the above rule, we can get the expression of the overall matrix **P**˜ for the exact topology of Figure 4, with eight input nodes and four layers:

$$\mathbf{P} = \left[ \left\{ \left[ (\mathbf{P}\_{0,0} \otimes \mathbf{P}\_{1,0}) \mathbf{P}\_{0,1} \right] \otimes \left[ (\mathbf{P}\_{2,0} \otimes \mathbf{P}\_{3,0}) \mathbf{P}\_{1,1} \right] \right\} \mathbf{P}\_{0,2} \right]$$

$$\otimes \left\{ \left[ (\mathbf{P}\_{4,0} \otimes \mathbf{P}\_{5,0}) \mathbf{P}\_{2,1} \right] \otimes \left[ (\mathbf{P}\_{6,0} \otimes \mathbf{P}\_{7,0}) \mathbf{P}\_{3,1} \right] \right\} \mathbf{P}\_{1,2} \right] \mathbf{P}\_{0,3}. \tag{41}$$

The overall output probability matrix **P***Y*|*<sup>X</sup>* can finally be computed as

$$\mathbf{P}\_{Y|X\_{in}} = \tilde{\mathbf{P}} \mathbf{P}\_{Y|X\_{out}(0,3)}.\tag{42}$$

The DIN then behaves as a one-layer system that generates the output according to matrix **<sup>P</sup>***Y*|*Xin* , whose size might be impractically large. It is also possible to interpret the system as a sophisticated way of factorizing and approximating the exponentially large true probability matrix. In fact, the proposed layered structure needs smaller probability matrices, which makes the system computationally efficient. The equivalent probability matrix is thus different in the DIN (Equation (42)) and Naive Bayes (Equation (38)) cases.

#### **5. Experiments**

In this section, we analyze the results obtained with benchmark datasets. In particular, we consider the DIN ensemble when: (a) each DIN is based on the probability matrices (the scheme described in this paper); and (b) each information node of the DIN randomly generates the symbols, as described in the previous work [16]. We refer to these two variants in captions and labels as DIN(Prob) and DIN(Gen), respectively. The reason for this comparison is that conditional statistical independence

is not required in the case DIN(Gen), and the classification accuracy could be different in the two cases. Note that Franzese and Visintin [16] considered just one DIN, not an ensemble of DINs. In the following, we introduce three datasets on which we tested the method (Sections 5.1– 5.3) and propose some examples of DINs architectures. Complete analysis of numerical results is described in Section 5.4. Sections 5.5 and 5.6 analyze the impact of changing the maximum number of iterations of Blahut–Arimoto algorithm and Lagrangian coefficient *β*, respectively. Finally, a synthetic multiclass experiment is described in Section 5.7. In all experiments, the value of *β* was optimized similarly to what is described in Section 5.6 using the training set.

#### *5.1. UCI Congressional Voting Records Dataset*

The first experiment on real data was conducted on the UCI Congressional Voting Records dataset [27], which collects the votes given by each of the U.S. House of Representatives Congressmen on 16 key laws (in 1985). Each vote can take three values corresponding to (roughly, see [27] for more details) yes, no, and missing value; each datum belongs to one of two classes (Democrats or Republican). The aim of the network is, given the list of 16 votes, decide if the voter is Republican or Democratic. In this dataset, we thus have *<sup>D</sup>* = 16 features and 435 data split into *Ntrain* data for training and *Ntest* <sup>=</sup> <sup>435</sup> <sup>−</sup> *Ntrain* data for testing. The architecture of the used network is the same as the one described in Section 2.4, except for the fact that there are 16 input features instead of 8 (the network has thus one more layer). The input cardinality in the first layer is *<sup>N</sup>*(0) *in* <sup>=</sup> 3 (yes/no/missing) and the output cardinality is set to *<sup>N</sup>*(0) *out* <sup>=</sup> 2. From the second layer on, the input cardinality for each information node is equal to *N*∗ *in* <sup>=</sup> 4 and *<sup>N</sup>*<sup>∗</sup> *out* <sup>=</sup> 2. In the majority of the cases, the size of the probability matrices is therefore 4 <sup>×</sup> 2 or 2 <sup>×</sup> 2. In this example, we used *Nmach* <sup>=</sup> 30 and *Ntrain* <sup>=</sup> <sup>218</sup> (roughly 50% of the data). The value of *β* was set to 2.2.

#### *5.2. UCI Kidney Disease Dataset*

The second considered dataset was the UCI Kidney Disease dataset [28]. The dataset has a total of 24 medical features, consisting of mixed categorical, integer, and real values, with missing values. Quantization of non-categorical features of the dataset was performed according to the thresholds in Appendix A, agreed upon by a medical doctor.

The aim of the experiment is to correctly classify patients affected by chronic kidney disease. We performed 100 different trials training the algorithms using only *Ntrain* = 50 out of 400 samples for the training. Layer zero has 24 input nodes, and then the outputs of layer zero are mixed two at a time to get 12 information nodes at Layer 1, 6 at Layer 2, and 3 at Layer 3; the last three nodes are combined into a unique final node. The output cardinality of all nodes is equal to *N*∗ *out* <sup>=</sup> 2. The value of *<sup>β</sup>* was set equal to 5.6. In addition, in this case, we used an ensemble of *Nmach* = 30 DINs.

#### *5.3. UCI Mushroom dataset*

The last considered dataset was the UCI Mushroom dataset [29]. This dataset is comprised of 22 categorical features with different cardinalities, which describe some properties of mushrooms, and one target variable that defines whether the considered mushroom is edible or poisonous/unsafe. There are 8124 entries in the dataset. We padded the dataset with two null features to reach the cardinality of 24 and used exactly the same architecture as the kidney disease experiment. We selected *Ntrain* = 50, *<sup>β</sup>* = 2.7, and number of DINs equal to *Nmach* = 15.

#### *5.4. Misclassification Probability Analysis*

We hereafter report results in terms of misclassification probability between the proposed method and several classification methods implemented using MATLAB® Classification Learner. All datasets were randomly split 100 times into training and testing subsets, thus generating 100 different experiments. The proposed method shows competitive results in the considered cases, as can

be observed in Table 1. It is interesting to compare in terms of performance the proposed algorithm with respect to the Naive Bayes classifier, i.e., Equation (34), and the Bagged Tree algorithm, which is the closest algorithm (conceptually) to the one we propose. In general, the two variants of the DINs perform similarly to the Bagged Trees, while outperforming Naive Bayes. For Bagged Trees and KNN-Ensemble, the same number of learners as DIN ensembles were used.

**Table 1.** Mean misclassification probability (over 100 random experiments) for the three datasets with the considered classifiers.


#### *5.5. The Impact of Number of Iterations of Blahut–Arimoto on The Performance*

As anticipated in Section 2.5, the computational complexity of a single node scales with the number of iterations of Blahut–Arimoto algorithm. To the best of our knowledge, a provable convergence rate for the Blahut–Arimoto algorithm in the information nottleneck setting does not exist. We hereafter (Figure 5) present empirical results on the impact of limiting the number of iterations of Blahut–Arimoto algorithm (for simplicity, the same bound is applied to all nodes in the networks). When the number of iterations is too small, there is a drastic decrease in performance because the probability matrices in the information nodes have not yet converged, while 5–6 iterations are sufficient and a further increase in the number of iterations is not necessary in terms of performance improvements.

**Figure 5.** Misclassification probability versus number of iterations (average over 10 different trials) for the considered UCI datasets.

#### *5.6. The Role of β: Underfitting, Optimality, and Overfitting*

As usual with almost all machine learning algorithms, the choice of hyperparameters is of fundamental importance. For simplicity, in all experiments described in the previous sections, we kept the value of *β* constant through the network. To gain some intuition, Figure 6 shows the misclassification probability for different *β* for the three considered datasets (each time keeping *β* constant through the network). While the three curves are quantitatively different, we can notice the same qualitative trend: when *β* is too small, not enough information about the target variable is propagated, and then by increasing *β* above a certain threshold, the misclassification probability drops. Increasing *β* too much however induces overfitting, as expected, and the classification error (slowly) increases again. Remember (from Equation (15)) that the Lagrangian we are minimizing is

$$\mathcal{L} = \mathbb{I}(X\_{in}; X\_{out}) - \beta \mathbb{I}(Y; X\_{out}) \dots$$

Information theory tells us that at every information node we should propagate only the sufficient statistic about the target variable *Y*. In practice, this is reflected in the role of *β*: when it is too small, we neglect the term I(*Y*; *Xout*) and just minimize I(*Xin*; *Xout*) (that corresponds to underfitting), while increasing *β* allows passing more information about the target variable through the bottleneck. It is important to remember, however, that we do not have direct access to the true mutual information values but just to an empirical estimate based on a finite dataset. Especially when the cardinalities of inputs and outputs are high, this translates into an increased probability of spotting spurious correlations that, if learned by the nodes, induce overfitting. The overall message is that *β* has an extremely important role in the proposed method, and its value should be chosen to modulate between underfitting and overfitting.

**Figure 6.** Misclassification probability versus *β* (average over 20 different trials) for the considered UCI datasets.

#### *5.7. A Synthetic Multiclass Experiment*

In this section we present results on a multiclass synthetic dataset. We generated 64-dimensional feature vectors **z** drawn from multivariate Gaussian distributions with mean and covariance depending on a target class *y* and a control parameter *ρ*:

$$p(\mathbf{z}|y=l) = |2\pi\Sigma\_l|^{-\frac{1}{2}} \exp\left(-\frac{1}{2}(\mathbf{z}-\boldsymbol{\mu}\_l)^T(\rho\Sigma\_l)^{-1}(\mathbf{z}-\boldsymbol{\mu}\_l)\right) \quad l=1,\cdots,N\_{\text{class}}\tag{43}$$

where for the considered experiment *Nclass* = 8. The mean *<sup>μ</sup><sup>l</sup>* is sampled from a normal 64-dimensional random vector and <sup>Σ</sup>*<sup>l</sup>* is randomly generated as <sup>Σ</sup>*<sup>l</sup>* = **AA***<sup>T</sup>* (where **<sup>A</sup>** is sampled from a matrix normal distribution) and normalized to have unit norm. The other parameter *ρ* is inserted to modulate the signal to noise ratio of the generated samples: a smaller value of *ρ* corresponds to smaller feature variances and more distinct, less overlapping, pdfs *<sup>p</sup>*(**z**|*<sup>y</sup>* <sup>=</sup> *<sup>l</sup>*), and an easier classification task. We then perform quantization of the result using 1 bit, i.e., the input of the ensemble of DINs is the following random vector:

$$\mathbf{x} = \mathcal{U}(\mathbf{z})\tag{44}$$

where *<sup>U</sup>*(·) is the Heaviside step operator. The designed architecture has at the first layer 64 input nodes, followed by 32, 16, 4, 2, and 1. The output cardinalities are equal to 2 for the first three layers, 4 for the fourth and fifth layer, and 8 at the last layer. We selected *Ntrain* = 1000, *<sup>β</sup>* = 7 (constant trough the network), and number of DINs equal to *Nmach* = 10. Figure <sup>7</sup> shows the classification accuracy (on a test set of 1000 samples) for different values of *ρ*. As expected, when the value of *ρ* is small, we can reach almost perfect classification accuracy, whereas, by increasing it, the performance drops to the point where the useful signal is completely buried in noise and the classification accuracy reaches the asymptotic level of <sup>1</sup> <sup>8</sup> (that corresponds to random guessing when the number of classes is equal to 8).

**Figure 7.** Varying of classification accuracy for different values of control parameter *ρ*.

#### **6. Conclusions**

The proposed ensemble Deep Information Network (DIN) shows good results in terms of accuracy and represents a new simple, flexible, and modular structure. The required hyperparameters are the cardinality of the alphabet at the output of each information node, the value of the Lagrangian multiplier *β*, and the structure of the tree itself (number of input information nodes of each combiner).

Simplistic architecture choices made for the experiments (such as equal cardinality of all node outputs, *β* constant through the network, etc.) performed comparably to finely tuned networks. However, we expect that, similar to what happened in neural network applications, a domain specific design of the architectures will allow for consistent improvements in terms of performance on complex datasets.

Despite the local assumption of conditionally independent features, the proposed method always outperforms Naive Bayes. As discussed in Section 4, the induced equivalent probability matrix is different in the two cases. Intuitively, we can understand the difference in performance under the point of view of probability matrix factorization. On the one side, we have the true, exponentially large, joint probability matrix of all features and target class. On the other side, we have the Naive Bayes one, which is extremely simple in terms of complexity but obviously less performing. In between, we have the proposed method, where the complexity is still reasonable but the quality of the approximation is much better. The DIN(Gen) algorithm does not require the assumption of statistical independence, but the classification accuracy is very close to that of DIN(Prob), which further suggests that the assumption can be accepted from a practical point of view.

The proposed method leaves open the possibility of devising a custom hardware implementation. Differently from classical decision trees, in fact, the execution times of all branches as well as the

precise number of operations is fixed per datum and known a priori, helping in various system design choices. In fact, with classical trees, where a node's utilization depends on the datum, we are forced to design the system for the worst case, even if in the vast majority of time not all nodes are used. Instead, with DIN, there is no such a problem.

Finally, a clearly open point is related to the quantization procedure of continuous random variables. One possible self-consistent approach could be devising an information bottleneck based method (similar to the method for continuous random variables [20]).

Further studies on extremely large datasets will help understand principled ways of tuning hyperparameters and architecture choices and their relationship on performance.

**Author Contributions:** Conceptualization, G.F. and M.V.; methodology, G.F. and M.V.; software, G.F. and M.V.; validation, G.F. and M.V.; formal analysis, G.F. and M.V.; investigation, G.F. and M.V.; resources, G.F. and M.V.; data curation, G.F. and M.V.; writing–original draft preparation, G.F. and M.V.; writing–review and editing, G.F. and M.V.; visualization, G.F. and M.V.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** A special thank to MD Gabriella Olmo who suggested a quantization of the continuous values of the features in the experiment in Section 5.2, which is correct from a medical point of view.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Quantization**

Hereafter, we present the quantization scheme used for the numerical features of chronic kidney disease dataset.


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Convergence Behavior of DNNs with Mutual-Information-Based Regularization**

#### **Hlynur Jónsson, Giovanni Cherubini \* and Evangelos Eleftheriou**

IBM Research Zurich, 8803 Rüschlikon, Switzerland; hlynur4@gmail.com (H.J.); ele@zurich.ibm.com (E.E.) **\*** Correspondence: cbi@zurich.ibm.com; Tel.: +41-44-724-8518

Received: 17 June 2020; Accepted: 26 June 2020; Published: 30 June 2020

**Abstract:** Information theory concepts are leveraged with the goal of better understanding and improving Deep Neural Networks (DNNs). The information plane of neural networks describes the behavior during training of the mutual information at various depths between input/output and hidden-layer variables. Previous analysis revealed that most of the training epochs are spent on compressing the input, in some networks where finiteness of the mutual information can be established. However, the estimation of mutual information is nontrivial for high-dimensional continuous random variables. Therefore, the computation of the mutual information for DNNs and its visualization on the information plane mostly focused on low-complexity fully connected networks. In fact, even the existence of the compression phase in complex DNNs has been questioned and viewed as an open problem. In this paper, we present the convergence of mutual information on the information plane for a high-dimensional VGG-16 Convolutional Neural Network (CNN) by resorting to Mutual Information Neural Estimation (MINE), thus confirming and extending the results obtained with low-dimensional fully connected networks. Furthermore, we demonstrate the benefits of regularizing a network, especially for a large number of training epochs, by adopting mutual information estimates as additional terms in the loss function characteristic of the network. Experimental results show that the regularization stabilizes the test accuracy and significantly reduces its variance.

**Keywords:** deep neural networks; information bottleneck; regularization methods

#### **1. Introduction**

Deep Neural Networks (DNNs) have revolutionized several application domains of machine learning, including computer vision, natural language processing and recommender systems. Despite their success, the internal learning process of these networks is still an active field of research. One of the goals of this paper is to leverage information theoretical concepts to analyze and further improve DNNs. The analysis of DNNs through the information plane, i.e., the plane of mutual information values that each layer preserves at various learning stages on the input and the output random variables, was proposed in [1,2]. Previous approaches for the visualization of the information plane applied non-parametric estimation methods that do not work well with high dimensional data [2–4], as in this case the estimation of mutual information is nontrivial. The information plane for small fully connected networks was visualized in [2]. The results in [2] suggested that most of the training epochs of a DNN, the "compression phase", are spent on compressing the input variables after an initial "fitting phase". The existence of the compression phase was later questioned in [4–7], in case the finiteness of the mutual information between input/output and hidden-layer variables cannot be established. In this paper, we focus on Convolutional Neural Networks (CNNs) with high complexity. After briefly discussing non-parametric mutual information estimation methods, we present the convergence of mutual information on the information plane for a high-dimensional VGG-16 CNN [8] by resorting to Mutual Information Neural Estimation (MINE) [9]. The compression

phase is evident from the obtained results, which confirm and extend the results previously found with low-dimensional fully connected networks, where methods with lower computational complexity were adopted to study the convergence in the information plane.

Furthermore, we consider DNNs with mutual-information-based regularization. The use of the mutual information between the input and a hidden layer of a DNN as a regularizer was suggested in [9–11]. The idea is based on the Information Bottleneck (IB) approach [12], which provides a maximally compressed version of the input random variable, while still retaining as much information as possible on a relevant random variable. Here we compare the accuracy achieved by a VGG-16 CNN, using well-known regularization techniques, such as dropout, batch normalization and data augmentation, with that of the same VGG-16 network that is enhanced by applying mutual-information-based regularization, by resorting either to MINE or the variational information bottleneck (VIB) technique [10], and demonstrate the advantages of mutual-information-based regularization, especially for a large number of training epochs.

The remainder of the paper is structured as follows. Basic definitions of mutual information and the formulation of the IB approach are recalled in Section 2. Non-parametric methods for the estimation of mutual information in DNNs are addressed in Section 3. The visualization of DNN convergence on the information plane using MINE is described in Section 4, whereas the advantages of long-term DNN regularization using mutual-information-based techniques are presented in Section 5. Finally, conclusions are given in Section 6.

#### **2. The Information Bottleneck in DNNs**

The mutual information is a measure of the mutual dependence of two random variables. In essence, it measures the relationship between them and may be regarded as the reduction in uncertainty in a random variable given the knowledge available about another one. The mutual information between two discrete random variables *X* and *Y* is defined as

$$I(X;Y) = \sum\_{x \in X} \sum\_{y \in Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \tag{1}$$

where *p*(*x*, *y*) is the joint probability distribution and *p*(*x*) and *p*(*y*) are the marginal probability distributions. The mutual information may also be expressed as the Kullback–Leibler (KL) divergence between the joint distribution and the product of the marginal distributions of the random variables *X* and *Y*

$$I(X;Y) = D\_{KL}\left(p(\mathbf{x},y) \| \| p(\mathbf{x})p(y)\right) \tag{2}$$

Computing the mutual information of two random variables is in general challenging. Exact computation can be done in the discrete setting, where the sum in Equation (1) can be calculated exhaustively, or in the continuous setting, provided the probability distributions are known in advance. Several methods to estimate the mutual information have been introduced. The most common ones are non-parametric, including discretizing the values using binning and resorting to non-parametric kernel-density estimators, as will be discussed in more detail in Section 3.

The IB method was introduced in [12] to find a maximally compressed representation of an input random variable, *X*, which is obtained as a function of a relevant random variable, Ψ, such that it preserves as much information as possible on Ψ. Let us denote *X*ˆ as the compressed version of the input random variable *X* parameterized by *θ*. What the IB method effectively does is to solve the following optimization problem

$$\begin{aligned} \min\_{\theta} \, & I\_{\theta}(X; \mathcal{X}) \\ \text{subject to:} \quad & I\_{\theta}(\hat{X}; \mathbb{Y}) = I\_{\mathsf{c}} \le I(X; \mathbb{Y}) \end{aligned} \tag{3}$$

where *Ic* is the information constraint. In a non-deterministic classification scenario, an equivalent formulation of the IB problem is obtained by introducing the Lagrangian multiplier *β* ≥ 0, and by minimizing the Lagrangian

$$\mathcal{L}(\theta) = l\_{\theta}(X; \hat{X}) - \beta' l\_{\theta}(\hat{X}; \Psi) \tag{4}$$

In a deterministic classification scenario, however, where neural networks do not exhibit any randomization of the hidden layer output variables, the two IB formulations given above are in general not equivalent, as demonstrated in [13].

The goal of any supervised learning method is to efficiently capture the relevant information about an input random variable, typically a label for classification tasks, in the output random variable [1]. Let us consider a DNN for image recognition, with input *X*, output *Y*ˆ, and *i*-th hidden layer denoted by *hi*. The classification task is related to the interpretation of an image that is generated from a relevant random variable *Y*.

In case the hidden layer, *hi*, only processes the output of the previous layer, *hi*−1, the layers form a Markov chain of successive internal representations of the input. By applying the Data Processing Inequality (DPI) [14] to a DNN that consists of *L* layers, we have

$$I(X;h\_1) \ge I(X;h\_2) \ge \dots \ge I(X;h\_L) \ge I(X;\mathcal{Y})\tag{5}$$

As mentioned in the Introduction, DNNs may be analyzed through mutual information values on the information plane. However, estimating mutual information across layers in DNNs is a nontrivial task, as the outputs of the hidden layer neurons are continuous-valued high-dimensional random vectors.

A further difficulty arises if the input X is a continuous random variable and the neural network is deterministic. In this case, it has been shown, e.g., in [4–6], that *<sup>I</sup>*(*X*; *hi*) = <sup>∞</sup> for commonly used activation functions. For the image classification problem considered here, one might argue that the input random variable is a discrete random variable, where the pixels have a discrete distribution, and *hi* is also a discrete random variable given by a deterministic transformation of *X* via finite-precision arithmetic. The training and test sets, however, have a cardinality that is typically much lower than the alphabet size of *X* , thus rendering the estimation of the mutual information very difficult. To cope with the challenge of estimating the divergence in Equation (2), we will resort to MINE, as discussed in Section 4.

#### **3. Non-Parametric Estimation of Mutual Information**

As stated in Section 2, the most common methods for the estimation of mutual information are non-parametric. As our focus is on CNNs, we consider a VGG-16 network [8] to evaluate the effectiveness of non-parametric estimation methods. The block diagram of a VGG-16 network is illustrated in Figure 1. The loss function adopted to train the network is the cross-entropy loss obtained from the softmax output probabilities, that is

$$\mathcal{L}\_{\text{CE}} = \sum\_{m=1}^{M} y\_m \log p\_m \tag{6}$$

where *ym* is the binary value corresponding to the *m*-th class of a one-hot encoded vector defined by the class labels, *pm* is the softmax probability output for class *m*, and *M* is the number of classes in the classification task. In all experiments the dataset considered is CIFAR-10 [15], i.e., *M* = 10, with a batch size of 128. CIFAR-10 is a dataset consisting of 60,000 images in 10 classes. Fifty thousand of those images are used for training and 10,000 for testing. Each image of the dataset is of size 32 × 32 with three color channels.

**Figure 1.** The Visual Geometry Group (VGG)-16 network [8] architecture from input to predicted output. CONV-64 is shorthand for a 2D convolutional layer with 64 filters. FC-512 is shorthand for a fully connected layer with 512 neurons.

#### *3.1. Activation Binning*

The mutual-information estimation method adopted in [2] resorts to binning the activations of each layer into equal-sized bins. Binning ensures that the values are discretized, which allows to calculate the joint distribution and marginal distributions by counting the activations in each bin.

Despite the promising results reported in [2], this method has a number of limitations as a general method for mutual-information estimation. Firstly, the experiments in [2] were conducted using a fully connected network with only a few neurons, which has far fewer synaptic weights and neurons than typical CNN architectures. Another shortcoming is the usage of the *tanh* function as a non-linear activation function, which bounds the activations from −1 to 1. The ReLU activation function [16] is more commonly used and allows unbounded positive activations. This limitation is pointed out in [4], where the authors showed counterexamples of the compression phase using ReLU non-linear activations. Furthermore, in [2] the input is limited to a vector of 12 binary elements and the output is also binary. In this case binning is convenient because of the low number of dimensions of the input and output random variables.

We found that the method of binning the activations does not scale well with higher dimensionality. For example, we experimented with the activations of VGG-16 trained on CIFAR-10. For the classification of a CIFAR-10 image, the input random variable has a total of 3072 dimensions as opposed to 12 in [2]. Varying the number of bins where the activations are allocated was also found not to have a significant impact on the results. The mutual-information estimates for layers with high input dimensionality turned out not to satisfy the DPI. In addition, the estimates of both *<sup>I</sup>*(*X*; *hi*) and *<sup>I</sup>*(*hi*;*Y*) for the last few CNN layers converged to values close to zero. If *<sup>I</sup>*(*hi*;*Y*) approximates zero the accuracy of the model should be roughly the same as for random guessing. This contrasts with the measured accuracy, which is higher than 90%, compared to 10% for random guessing.

#### *3.2. Non-Parametric Kernel Density Estimation*

We also conducted experiments on the kernel-density estimation method described in [4,17]. The assumption made is that the hidden layer activities are distributed as a mixture of Gaussian random variables with covariance matrix *σ*<sup>2</sup> *I*. In [4], it is further assumed for analysis purposes that Gaussian noise is added to each hidden layer *Ti*, which is expressed as *Ti* = *hi* + , where  ∼ N (0; *<sup>σ</sup>*<sup>2</sup> *<sup>I</sup>*). A mutual information upper bound with respect to the input is proposed as

$$I(T\_i; X) \le -\frac{1}{P} \sum\_j \log \frac{1}{P} \sum\_k \exp\left(-\frac{1}{2} \frac{||h\_{ij} - h\_{ik}||\_2^2}{\sigma^2}\right) \tag{7}$$

where *P* denotes the number of samples and *hij* the hidden layer activities of layer *i* for sample *j*. Furthermore, the mutual information with respect to the output random variable is upper bounded as follows

$$I(T\_{\hat{i}};Y) \le -\frac{1}{P} \sum\_{j} \log \frac{1}{P} \sum\_{k} \exp\left(-\frac{1}{2} \frac{||h\_{\hat{i}\hat{j}} - h\_{\hat{i}k}||\_2^2}{\sigma^2}\right)$$

$$-\sum\_{m} p\_{\text{fl}} \left[ -\frac{1}{P\_{\text{m}}} \sum\_{j, Y\_j = m} \log \frac{1}{P\_{\text{m}}} \sum\_{k, Y\_k = m} \exp\left(-\frac{1}{2} \frac{||h\_{\hat{i}\hat{j}} - h\_{\hat{i}k}||\_2^2}{\sigma^2}\right) \right] \tag{8}$$

where *Pm* is the number of samples with class label *m*, and *pm* = *Pm*/*P* is the empirical probability of class *m*.

The same experiment as in [2] was conducted in [4] by using the non-parametric kernel density estimation.

We also tested the estimation method by a VGG-16 network, and adopted a variance of *σ*<sup>2</sup> = 0.1, as in [4]. However, as experienced with the binning method in Section 3.1, we did not find satisfactory results with a convolutional network of high complexity.

#### *3.3. Rényi's α-Entropy*

A multivariate matrix-based Rényi's *α*-entropy method was proposed in [3] for application to a LeNet-5 network. This approach is suitable for CNN networks, as each CNN layer has several channels, which all contain some information on the input and output random variables. However, two distinct channels of a single layer can contain the same information on the input and output random variables. Therefore, summing up the mutual information estimates between each channel and the input or output random variable only gives an upper bound on the mutual information that has little relevance to the true mutual information value. The experiments in [3] for LeNet-5 result in mutual information estimates for the various layers that satisfy the DPI. However, our experiments for VGG-16, which has up to 512 channels, did not yield estimates that comply with the DPI.

A method for the estimation of mutual information in complex DNNs was proposed in [18], which relies on matrix-based Rényi's entropy and tensor kernels to estimate the mutual information in a VGG-16 network. The method in [18] augments the multivariate extension of the matrix-based Rényi's *α*-order entropy presented in [19] by introducing tensor kernels. In that manner, the tensor-based nature of convolutional layers in DNNs is recognized and the numerical difficulties arising by a straightforward application of the multivariate extension of the matrix-based Rényi's entropy are avoided. However, the convergence in the information plane is affected by the overfitting that takes place when the training is conducted for a large number of epochs. Therefore, the compression phase needs to be limited by an early stopping technique to prevent overfitting.

#### **4. DNN Convergence Analysis Using MINE**

Our goal is to visualize the information plane for networks with high-dimensional variables, as previous work focuses on networks with much lower complexity [2,4,10]. The methods discussed in Section 3 for estimating the mutual information do not perform well with high-dimensional random variables. Furthermore, the existence of a compression phase during training has been disputed in [4] for networks where the finiteness of the mutual information between input/output and hidden-layer variables cannot be established. Therefore, to clarify these issues, we consider a VGG-16 network with ReLU activation function. For the estimation of the mutual information for all layers in the network, we resort to the MINE method [9].

MINE is a method first proposed in [9] for the estimation of mutual information between high-dimensional continuous random variables. The method takes advantage of the Donsker–Varadhan dual representation of the KL-divergence [20] and utilizes the lower bound

$$D\_{KL}\left(\mathbb{P}\|\mathbb{Q}\right) \geq \sup\_{T \in \mathcal{F}} \mathbb{E}\_{\mathbb{P}}[T] - \log\left(\mathbb{E}\_{\mathbb{Q}}[e^{T}]\right) \tag{9}$$

where <sup>F</sup> is any class of functions *<sup>T</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup> such that the two expectations are finite, and <sup>P</sup> and <sup>Q</sup> are probability distributions. The main idea of MINE is to choose *T* as a function parameterized by a deep neural network with parameters *θ* ∈ Θ. By defining P as the joint probability distribution and Q as the product of the marginal distributions of the random vectors *X* and *Y*, by combining Equations (9) and (2) we get the MINE lower bound

$$\hat{I}(\mathbf{X};\mathbf{Y}) = \sup\_{\theta \in \Theta} \mathbb{E}\_{p(\mathbf{x},\mathbf{y})}[T\_{\theta}] - \log(\mathbb{E}\_{p(\mathbf{x})p(\mathbf{y})}[\mathbf{e}^{T\_{\theta}}]) \tag{10}$$

The lower bound given by Equation (10) is finite, even if the input *X* is a continuous random vector and the neural network under investigation is deterministic, for which the mutual information between the input and a hidden layer *<sup>I</sup>*(*X*; *hi*) is infinite, as discussed in Section 2. If the mutual information is not finite, the MINE may nevertheless be regarded as a well-defined estimate of the statistical divergence between the two probability distributions *<sup>p</sup>*(*X*, *hi*) and *<sup>p</sup>*(*X*)*p*(*hi*) that assume nonzero (possibly infinite) values over different support. This is analogous to the evaluation of the optimal cost in applications of the optimal transport theory, which is obtained by resorting to a dual representation of the original problem, see, e.g., [21].

The expectation over the product of the marginal distributions is estimated by shuffling the samples from which the empirical joint distribution is obtained. For the estimation of *<sup>I</sup>*(*X*; *hi*) the samples from *hi* are shuffled, whereas for the estimation of *<sup>I</sup>*(*hi*;*Y*) the samples from *<sup>Y</sup>* are shuffled. The objective function from Equation (10) is adopted and is optimized by gradient ascent. For the visualization of the information plane for the *i*-th layer in the network, two estimations are needed, namely those of *<sup>I</sup>*(*X*; *hi*) and *<sup>I</sup>*(*hi*;*Y*). Each of these estimations is parameterized by a separate deep neural network. As stated in [9], more training samples are needed as the complexity of the MINE network increases. Therefore, very deep networks with high complexity are infeasible as MINE networks. Here we adopt a network and an overall training approach capable of accurately estimating mutual information while resorting to networks with moderate complexity.

For the experiments in this section, we train a VGG-16 network on CIFAR-10 images. Minor data augmentation is applied in the form of random cropping and randomly flipping the images horizontally. The size of each CIFAR-10 image is 32 × 32 pixels. For random cropping, we pad the original image to 40 × 40 pixels and randomly take a crop of size 32 × 32 pixels. In our experiments, each convolutional layer has a 3 × 3 pixel receptive field. In addition, batch-normalization is used for all convolutional layers. Furthermore, dropout regularization is applied after all convolutional layers that do not precede a pooling layer (dropout rate of 0.3) and after the first fully connected layer (dropout rate of 0.5). The ReLU activation function is adopted for all layers with the exception of the last one, which is a linear dense layer. The hyperparameters are chosen using a validation set obtained by extracting 10,000 samples from the training set. The MINE loss function used to train each MINE network is defined as

$$\mathcal{L}\_{MINE} = \frac{1}{n} \sum\_{i=1}^{n} T\_{\theta}(i) - \log(\frac{1}{n} \sum\_{j=1}^{n} \exp^{T\_{\theta}(j)}) \tag{11}$$

where *<sup>n</sup>* is the batch size and *<sup>T</sup><sup>θ</sup>* (*i*) and *<sup>T</sup><sup>θ</sup>* (*j*) are the individual network outputs referring to the expectation over the joint distribution and over the product of the marginal distributions (see Equation (10)), respectively. An illustration of the process, including input and output encoders and referring to the hidden layer *h*<sup>3</sup> as an example, is shown in Figure 2.

#### *4.1. Visualization of the Information Plane*

As discussed above with reference to Equation (10), each estimate of mutual information by MINE requires a separate neural network to learn both the expectations over the joint distribution and over the product of the marginal distributions. To characterize the convergence behavior of the VGG-16 on the information planes, we need to estimate both *<sup>I</sup>*(*X*; *hi*) and *<sup>I</sup>*(*hi*;*Y*) for each layer, i.e., a total of 32 networks are needed. Additionally, we adopt two encoders, which are employed across all layers, to encode the input and output random variables. Each MINE network encodes the respective hidden layer. As illustrated in Figure 2, all hidden layer encoders and input/output encoders output a 64-dimensional vector. The hidden layer encoder output is concatenated with the corresponding encoder output for the input/output random variable, resulting in a 128-dimensional vector. A fully connected network takes the concatenated vector as input and outputs a single value, from which the expected value over either distribution is obtained, depending on whether the input is shuffled or not. The expected values obtained from the network yield the mutual information estimate by Equation (10). Further details on the architectures of the experimental MINE networks are given in [22].

**Figure 2.** Example of the Mutual Information Neural Estimation (MINE) networks considered for both *<sup>I</sup>*(*X*; *<sup>h</sup>*3) and *<sup>I</sup>*(*h*3;*Y*) in layer 3 of VGG-16. The same input and output encoder are employed for all layers. The four expectations are applied as indicated in Equation (10) to estimate both *<sup>I</sup>*(*X*; *<sup>h</sup>*3) and *<sup>I</sup>*(*h*3;*Y*). FC indicates a fully connected layer, whereas CONV indicates a convolutional layer.

An information plane shows the mutual information estimates for all epochs within a certain layer. To get unbiased estimates for each epoch the training procedure is conducted as follows. Initially, the VGG-16 network is trained up to a certain epoch. Then all MINE networks are trained for a total of 1000 epochs. During MINE network training, the MINE networks use the outputs of the hidden layer neurons of the trained VGG-16 as input, without updating the weights of the VGG-16 through back-propagation. In this phase, gradient-ascent updating by back-propagation is only performed on the weights of the MINE networks. After training the MINE network for 1000 epochs, the expectations are evaluated to find the estimates of mutual information on the information plane. Therefore, each dot on the information plane of the *<sup>i</sup>*-th layer represents the estimates of the MINE networks for *<sup>I</sup>*(*X*; *hi*) and *<sup>I</sup>*(*hi*;*Y*), for a single epoch of VGG-16, after training the MINE network for 1000 epochs. Each mutual information estimate shown on the information plane is obtained by the same number of training iterations. To visualize the information plane, we consider the first 50 epochs of the VGG-16 training phase. The mutual information values do not provide more insight on the compression phase beyond that point. The above procedure is repeated for all 50 VGG-16 epochs shown in the information plane.

The information planes of the VGG-16 layers are shown in Figure 3. The mutual information estimates are expressed in bits. The compression phase is evident especially in the high-order layers, which is consistent with previous work presented in [2], however for the first time shown in a CNN with such a high complexity. One further difference with respect to [2] is that for the VGG-16 network trained on CIFAR-10, the compression phase appears earlier in the training process. We see that *<sup>I</sup>*(*X*; *hi*) starts decreasing after the first VGG-16 epoch for the high-order layers and continues to exhibit a decreasing trend until convergence. The estimation of *<sup>I</sup>*(*hi*;*Y*), for all layers *<sup>i</sup>* = 2, ..., 16, converges towards the upper bound equal to log2 10, which is the desired value of the mutual information in bits as CIFAR-10 contains 10 classes. An exception is constituted by the first layer, which seems to slightly underestimate the mutual information with the output. It can also be seen how the input is compressed successively in each layer. This behavior is more clear from layer 7 onward, as the estimates of *<sup>I</sup>*(*X*; *hi*) decrease between subsequent layers, as demanded by the DPI. While training the MINE networks, it was observed that the mutual information estimates converged to zero during training for some VGG-16 epochs. The occurrence of such events was significantly mitigated by lowering the learning rate. The lower learning rate, however, slows down the training process. Therefore, training over 1000 epochs was needed for MINE to allow the mutual information estimates to reliably converge. Figure <sup>4</sup> shows the decrease of the mutual information estimates *<sup>I</sup>*(*X*; *hi*) and *<sup>I</sup>*(*hi*;*Y*) as a function of the layer index, for the 1st and the 40th epoch, i.e., towards beginning and end of the considered training interval, respectively, thus indicating that the DPI is well approximated by the MINE.

**Figure 3.** *Cont*.

**Figure 3.** Information planes for the VGG-16 layers trained on Canadian Institute for Advanced Research (CIFAR)-10 image set.

**Figure 4.** Mutual information estimates as function of the layer index, for the 1st and the 40th epoch. The subscript *<sup>n</sup>* in *In*(*hi*;*Y*) indicates the epoch.

We remark that the results presented in this section are qualitative. Proper quantitative assessment of the variance in the trajectories and a comparative study of the convergence of DNNs having different architectures will be the subject of further investigation.

#### **5. Long-Term DNN Regularization**

#### *5.1. MINE-Based Regularization*

Using MINE as a regularizer was proposed in [9] for a small fully connected network trained on MNIST. The authors replaced the variational approximation of the mutual information in [10] with a MINE network for the mutual information estimate. We also consider a MINE network to estimate the mutual information, however with a VGG-16 network of significantly higher complexity. In our experiments, we estimated the mutual information of two layers by applying MINE networks. We trained a VGG-16 network for a total of 10,000 epochs to investigate how the MINE-based regularizer affects the test accuracy over the entire training period. An additional loss term was included in the objective function of the network that represents the estimates of *<sup>I</sup>*(*X*; *<sup>h</sup>*14) and *<sup>I</sup>*(*X*; *<sup>h</sup>*15) by MINE. The network parameters of the MINE networks were the same as described in Section 4.1. We performed gradient descent on the cross-entropy loss with a regularization term that equals the sum of the mutual information estimations of *<sup>I</sup>*(*X*; *<sup>h</sup>*14) and *<sup>I</sup>*(*X*; *<sup>h</sup>*15) multiplied by a regularization coefficient, which was chosen as *β* = 10−3. The overall loss function is defined as

$$\mathcal{L} = \mathcal{L}\_{\text{CE}} + \sum\_{l=1}^{L} \mathcal{L}\_{\text{MINE}}(l) \tag{12}$$

where L*CE* and L*MINE* are defined in Equations (6) and (11), respectively, and *L* is the number of layers in the VGG-16 over which the regularization takes place.

The results without and with MINE-based regularizer are shown in Figure 6a,b, respectively. The test accuracy increases and the variance decreases with respect to the experiments without MINE-based regularization. The maximum test accuracy achieved with the MINE-based regularizer is 93.9%, whereas a baseline accuracy for a VGG-16 network is measured as 93.25% in [23], which is similar to our results shown in Figure 6a.

#### *5.2. VIB-based Regularization*

As an alternative mutual-information-based estimation method between consecutive layers in CNNs, a Variational Information Bottleneck (VIB) method was proposed in [10] for fully connected networks with low complexity. The VIB technique was also used in [24] to reduce network complexity. Here we extend VIB-based regularization to CNNs with substantially higher complexity. We investigate the performance of the regularizer when training a VGG-16 for a large number of training epochs, up to 10,000, in which case overfitting is a common issue.

We adopt the same formulation of the VIB as in [24]. In a feed-forward neural network like the VGG-16 network, each hidden layer, *hi*, takes as input the previous output of the hidden layer, *hi*−1. Therefore, each layer only extracts information from the previous layer. The previous layer typically contains some information that is not relevant to the output. The aim of a VIB-based regularizer is therefore to reduce the amount of redundant information extracted from the previous layer. This is accomplished by minimizing the estimated mutual information between subsequent layers, *<sup>I</sup>*(*hi*; *hi*−1). The information bottleneck objective then becomes

$$\mathcal{L} = \sum\_{i=1}^{L} \beta\_i I(h\_i; h\_{i-1}) - I(h\_i; \mathcal{Y}) \tag{13}$$

where the coefficient *β<sup>i</sup>* ≥ 0 represents the strength of the VIB-based regularization in the *i*-th layer, and *L* is the number of layers in the network over which the regularization takes place. An upper bound on the term given by Equation (13) can be derived as

$$\hat{\mathcal{L}} = \sum\_{i=1}^{L} \beta\_i \mathbb{E}\_{h\_{i-1} \sim p(h\_{i-1})} \left[ D\_{\text{KL}} \left[ p(h\_i | h\_{i-1}) \parallel q(h\_i) \right] \right] - \mathbb{E}\_{x, y \sim \mathcal{D}} [\log \ q(y | h\_{\mathcal{L}})] \tag{14}$$

where <sup>D</sup> denotes the input data distribution and *<sup>q</sup>*(*hi*) and *<sup>q</sup>*(*y*|*hL*) are variational distributions that approximate *<sup>p</sup>*(*hi*) and *<sup>p</sup>*(*y*|*hL*), respectively. To optimize the network, a parametric form of the distributions *<sup>p</sup>*(*hi*|*hi*−1), *<sup>q</sup>*(*hi*) and *<sup>q</sup>*(*y*|*hL*) is specified. In [24], it is assumed that each conditional distribution *<sup>p</sup>*(*hi*|*hi*−1) is defined via the relation

$$h\_i = (\mu\_i + \mathfrak{e}\_i \odot \sigma\_i) \odot f\_i(h\_{i-1}) \tag{15}$$

where the parameters *μ<sup>i</sup>* and *σ<sup>i</sup>* are learnable for each layer where VIB is applied and  *<sup>i</sup>* ∼ N (0,*I*). The function *fi* represents the network processing that takes place at the *i*-th layer, consisting of a linear transformation or a convolution operation for convolutional layers, plus batch normalization and a non-linear activation function. Furthermore, the distribution *<sup>q</sup>*(*hi*) is specified as a Gaussian distribution, such that

$$q(h\_i) = \mathcal{N}(h\_i; 0, \text{diag}[\xi\_i])\tag{16}$$

where *ξ<sup>i</sup>* is a vector of variances learned from the data.

The process is illustrated in Figure 5. The element-wise multiplications in Equation (15) are applied differently in fully connected layers and convolutional layers, as the convolutional layers have several channels. For each convolutional layer the learned parameters, *μ<sup>i</sup>* and *σi*, are vectors with dimensionality equal to the number of channels in the layer. Therefore, we obtain a learned Gaussian distribution for each channel. The matrix, which is adopted for the element-wise multiplications, has the same dimensions as the convolutional layer output, and is generated by sampling from the distribution of each channel *<sup>n</sup>*<sup>2</sup> times, where *<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* is the feature map size. For the fully connected layers, the vectors of the learned parameters have a dimensionality equal to the number of neurons in the layer. Accordingly, each element is associated with a separate learned Gaussian distribution. Thus, the loss function is expressed as

$$\mathcal{L} = \gamma \sum\_{i=1}^{L} \beta\_i \sum\_{j=1}^{r\_i} \log \left( 1 + \frac{\mu\_{i,j}^2}{\sigma\_{i,j}^2} \right) - \mathbb{E}\_{\mathbf{x}, \mathbf{y} \sim \mathcal{D}}[\log \ q(\mathbf{y}|h\_L)] \tag{17}$$

where *ri* denotes the number of channels for convolutional layers and neurons for fully connected layers. The coefficient *γ* is used to scale the regularizing term. Scaling is crucial in deep networks as the accumulated loss from every layer may otherwise become too large.

**Figure 5.** Illustration of how the VIB is incorporated into each layer using the formulation from [24] using Equation (15), where *zi* = *<sup>μ</sup><sup>i</sup>* +  *<sup>i</sup>* " *σi*. The noise variable,  *<sup>i</sup>*, is sampled randomly from a Gaussian distribution with zero mean and unit variance.

As in the previous sections, the network adopted in our experiments was a VGG-16, trained on the CIFAR-10 dataset with the same data augmentation described in Section 4. The regularization constants were chosen as *<sup>γ</sup>* <sup>=</sup> <sup>10</sup>−5, {*βi*}*i*=1,...,15 <sup>=</sup> {2−5, 2−5, 2−4, 2−4, 2−3, 2−3, 2−3, 2−2, 2−2, 2−2, <sup>2</sup>−1, 2−1, 2−1, 1, 1}.

We trained the VGG-16 network without and with the VIB objective and compared the results. The output of the i-th layer in the experiment with the VIB was calculated as shown in Equation (15). Both models were trained for a total of 10,000 epochs on the CIFAR-10 dataset. To update the weights by back-propagation, we used the Adam [25] optimizer with exponential decay rate parameters *<sup>β</sup>A*,1 = 0.9, *<sup>β</sup>A*,2 = 0.999 and  *<sup>A</sup>* = 10−8. The learning rate was fixed to 0.001. We trained all models using a mini-batch size of 128. The results without and with the VIB objective are illustrated in Figure 6a,c, respectively.

The results in Figure 6c show that the test accuracy of the model increases with the additional VIB-based regularizer, achieving a value of 94.1%. We recall that the baseline accuracy for a VGG-16 network in [23] is 93.25%. Furthermore, the VIB-based regularizer prevents the model from overfitting. When trained for enough epochs, the model without the VIB-based regularizer eventually starts to overfit, even though it applies several regularization methods such as dropout, batch normalization and data augmentation, see Figure 6a. In contrast, the test accuracy exhibits substantially lower variance if the VIB objective is considered. To obtain the best accuracy from the model trained without VIB, early stopping is required. In contrast, the test accuracy of the model with VIB is much more stable and the stability is maintained even after 10,000 epochs of training.

We remark that the results relative to Figure 6a,b are obtained by using the exact same VGG-16 network architecture depicted in Figure 1. The VIB block illustrated in Figure 5 is added for regularization into each layer to get the results shown in Figure 6c, resulting in a modification of the loss function as well as of the overall network architecture. An interesting aspect related to the application of VIB for regularization is whether the observed improved performance is due to the modified loss function and architecture, or rather to the injection of noise alone. To investigate this aspect, we resort to a simpler LeNet-5 network on CIFAR-10. First, a comparison of test accuracy over 400 epochs without mutual-information-based regularizer, with MINE-based regularizer including the mutual information estimates of *<sup>I</sup>*(*X*; *hi*), *<sup>i</sup>* = 1, ..., 4, and with VIB-based regularizer is shown in Figure 7. The regularization coefficients for MINE and VIB are chosen similarly to the case of VGG-16. As observed in the case of VGG-16 on CIFAR-10, a VIB-based regularizer leads to better performance, albeit significantly lower than that achieved by VGG-16, as LeNet-5 is a much simpler network. For the same reason, overfitting is generally not an issue for a LeNet-5 network, and therefore the possible improvements due to regularization are marginal. Second, the performance of a VIB-based regularizer for LeNet-5 is compared with that achieved by a network where a Gaussian noise signal with zero mean and fixed standard deviation is added at each hidden layer. The accuracy obtained after 400 training epochs is reported in Table 1 for various values of the Gaussian noise standard deviation *σ*, and compared with that achieved by either VIB, MINE, or no regularization. It appears that the addition of noise alone is not adequate to achieve the same performance as the other regularizers.

**Figure 6.** Test accuracies over 10,000 epochs for VGG-16 trained on CIFAR-10 (**a**) without mutualinformation-based regularizer, (**b**) with MINE-based regularizer and (**c**) with Variational Information Bottleneck (VIB)-based regularizer.

**Figure 7.** Test accuracies over 400 epochs for LeNet-5 trained on CIFAR-10 (**a**) without mutual-informationbased regularizer, (**b**) with MINE-based regularizer and (**c**) with VIB-based regularizer.



#### **6. Conclusions**

Information theoretic concepts were adopted to analyze and improve high-complexity CNNs. We demonstrated the convergence of mutual information on the information plane and the existence of a compression phase for VGG-16, thus extending the results of [1] for fully connected networks with low-complexity. Furthermore, our experiments highlighted the advantages of regularizing DNNs by mutual-information-based additional terms in the network loss function. Specifically, mutualinformation-based regularization improves and stabilizes the test accuracy, significantly reduces its variance, and prevents the model from overfitting, especially for a large number of training epochs.

**Author Contributions:** All authors collectively conceived the analysis of the convergence behavior of DNNs with mutual-information-based techniques. H.J. performed the experiments and derived the results under the supervision of G.C. and E.E. H.J. wrote the manuscript with input from all authors. All authors have read and agreed to the submitted version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Variational Information Bottleneck for Semi-Supervised Classification**

**Slava Voloshynovskiy 1,\*, Olga Taran 1, Mouad Kondah 1, Taras Holotyak <sup>1</sup> and Danilo Rezende <sup>2</sup>**


Received: 22 July 2020; Accepted: 24 August 2020; Published: 27 August 2020

**Abstract:** In this paper, we consider an information bottleneck (IB) framework for semi-supervised classification with several families of priors on latent space representation. We apply a variational decomposition of mutual information terms of IB. Using this decomposition we perform an analysis of several regularizers and practically demonstrate an impact of different components of variational model on the classification accuracy. We propose a new formulation of semi-supervised IB with hand crafted and learnable priors and link it to the previous methods such as semi-supervised versions of VAE (M1 + M2), AAE, CatGAN, etc. We show that the resulting model allows better understand the role of various previously proposed regularizers in semi-supervised classification task in the light of IB framework. The proposed IB semi-supervised model with hand-crafted and learnable priors is experimentally validated on MNIST under different amount of labeled data.

**Keywords:** information bottleneck principle; deep networks; semi-supervised classification; latent space representation; hand crafted priors; learnable priors; regularization

#### **Notations**

We will denote a joint generative distribution as *<sup>p</sup>θ*(**x**, **<sup>z</sup>**) = *<sup>p</sup>θ*(**z**)*pθ*(**x**|**z**), whereas marginal *<sup>p</sup>θ*(**z**) is interpreted as a targeted distribution of latent space and marginal *<sup>p</sup>θ*(**x**) = <sup>E</sup>*pθ*(**z**) [*pθ*(**x**|**z**)] <sup>=</sup> **<sup>z</sup>** *<sup>p</sup>θ*(**x**|**z**)*pθ*(**z**)d**<sup>z</sup>** as a generated data distribution with a generative model described by *<sup>p</sup>θ*(**x**|**z**), where <sup>E</sup> stands for the expected value. A joint data distribution *<sup>q</sup>φ*(**x**, **<sup>z</sup>**) = *<sup>p</sup>*D(**x**)*qφ*(**z**|**x**), where *<sup>p</sup>*D(**x**) denotes an empirical data distribution and *<sup>q</sup>φ*(**z**|**x**) is an inference or encoding model and marginal *qφ*(**z**) denotes a "true" or "aggregated" distribution of latent space data. We will denote parameters of encoders as *φ*<sup>a</sup> and *φ*z, and those of decoders as *θ*<sup>c</sup> and *θ*x. The discriminators corresponding to Kullback–Leibler divergences are denoted as D<sup>x</sup> where the subscript indicates the space to which this discriminator is applied to. The cross-entropy metrics are denoted as Dxxˆ, where the subscript indicates the corresponding vectors. **X** denotes random vector, while the corresponding realization is denoted as **x**.

#### **1. Introduction**

The deep supervised classifiers demonstrate an impressive performance when the amount of labeled data is large. However, their performance significantly deteriorates with the decrease of labeled samples. Recently, semi-supervised classifiers based on deep generative models such as VAE (M1 + M2) [1], AAE [2], CatGAN [3], etc., along with several other approaches based on multi-view and contrastive metrics just to mention the most recent ones [4,5], are considered to be a solution to the above problem. Besides the remarkable reported results, the information theoretic analysis of

semi-supervised classifiers based on generative models and the role of different priors aiming to fulfil the gap in the lack of labeled data remain little studied. Therefore, in this paper we will try to address these issues using IB principle [6] and practically compare different priors on the same architecture of classifier.

Instead of considering the latent space of generative models such as VAE (M1 + M2) [1] and AAE [2] trained in the unsupervised way as suitable features for the classification, we will depart from the IB formulation of supervised classification, where we consider an encoder-decoder formulation of classifier and impose priors on its latent space. Thus, we study an approach to semi-supervised classification based on an IB formulation with a variational decomposition of IB compression and classification mutual information terms. To deeper understand the role and impact of different elements of variational IB on the classification accuracy, we consider two types of priors on the latent space of classifier: (i) hand-crafted and (ii) learnable priors. *Hand-crafted* latent space priors impose constraints on a distribution of latent space by fitting it to some targeted distribution according to the variational decomposition of the compression term of the IB. This type of latent space priors is well known as an information dropout [7]. One can also apply the same variational decomposition to the classification term of the IB, where the distribution of labels is supposed to follow some targeted class distribution to maximize the mutual information between inferred labels and targeted ones. This type of class label space regularization reflects an adversarial classification used in AAE [2] and CatGAN [3]. In contrast, *learnable* latent space priors aim at minimizing the need in human expertise in imposing priors on the latent space. Instead, the learnable priors are learned directly from unlabeled data using auto-encoding (AE) principle. In this way, the learnable priors are supposed to compensate the lack of labeled data in the semi-supervised learning yet minimizing the need in the hand-crafted control of the latent space distribution.

We demonstrate that several state-of-the-art models such as AAE [2], CatGAN [3], VAE (M1 + M2) [1], etc., can be considered to be instances of the variational IB with the learnable priors. At the same time, the role of different regularizers in the hand-crafted semi-supervised learning is generalized and linked to known frameworks such as information dropout [7].

We evaluate our model using standard dataset MNIST on both hand-crafted and learnable features. Besides revealing the impact of different components of variational IB factorization, we demonstrate that the proposed model outperforms prior works on this dataset.

Our main contribution is three-fold: (i) We propose a new formulation of IB for the semi-supervised classification and use a variational decomposition to convert it into a practically tractable setup with learnable parameters. (ii) We develop the variational IB for two classes of hand-crafted and learnable priors on the latent space of classifier and show its link to the state-of-the-art semi-supervised methods. (iii) We investigate the role of these priors and different regularizers in the classification, latent and reconstruction spaces for the same fixed architecture under the different amount of training data.

#### **2. Related Work**

**Regularization techniques in semi-supervised learning**: Semi-supervised learning tries to find a way to benefit from a large number of unlabeled samples available for training. The most common way to leverage unlabeled data is to add a special regularization term or some mechanism to better generalize to unseen data. The recent work [8] identifies three ways to construct such a regularization: (i) entropy minimization, (ii) consistency regularization and (iii) generic regularization. The entropy minimization [9,10] encourages the model to output confident predictions on unlabeled data. In addition, more recent work [3] extends this concept to adversarially generated samples or fakes for which the entropy of class label distribution was suggested to be maximized. Finally, the adversarial regularization of label space was considered in [2], where the discriminator was trained to ensure the labels produced by the classifier follow a prior distribution, which was defined to be a categorical one. The consistency regularization [11,12] encourages the model to produce the same output distribution when its inputs are perturbed. Finally, the generic regularization encourages the

model to generalize well and avoid overfitting the training data. It can be achieved by imposing regularizers and corresponding priors on the model parameters or feature vectors.

In this work, we implicitly use the concepts of all three forms of considered regularization frameworks. However, instead of adding additional regularizers to the baseline classifier as suggested by the framework in [8], we will try to derive the corresponding counterparts from a semi-supervised IB framework. In this way, we will try to justify their origin and investigate their impact on overall classification accuracy for the same system architecture.

**Information bottleneck:** In the recent years, the IB framework [6] is considered to be a theoretical framework for analysis and explanation of supervised deep learning systems. However, as shown in [13], the original IB framework faces several practical issues: (i) for the deterministic deep networks, either the IB functional is infinite for network parameters, that leads to the ill-posed optimization problem, or it is piecewise constant, hence not admitting gradient-based optimization methods, and (ii) the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification. In the same work, the authors demonstrate that these issues can be partly resolved for stochastic deep networks, networks that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. It is important to mention that the same authors also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly.

In our work, we extend these ideas using variational approximation approach suggested in [14] and that was also applied to unsupervised models in the previous work [15,16]. More particularly, we extend the IB framework to the semi-supervised classification and as discussed above we will consider two different ways of regularization of the latent space of classifier, i.e., either using traditional hand-crafted priors or suggested learnable priors. Although we do not consider the semi-supervised clustering and conditional generation in this work, the proposed findings can be extended to these problems in a way similar to prior works such as AAE [2], ADGM [17] and SeGMA [18].

**The closest works:** The proposed framework is closely related to several families of semi-supervised classifiers based on generative models. VAE (M1 + M2) [1] combines latent-feature discriminative model M1 and generative semi-supervised model M2. A new latent representation is learned using the generative model from M1 and subsequently a generative semi-supervised model M2 is trained using embeddings from the first latent representation instead of the raw data. Semi-supervised AAE classifier [2] is based on the AE architecture, where the encoder of AE outputs two latent representations: one representing class and another style. The latent class representation is regularized by an adversarial loss forcing it to follow categorical distribution. It is claimed that it plays an essential role for the overall classification performance. The latent style representation is regularized to follow Gaussian distribution. In both cases of VAE and AAE, the mean square error (MSE) metric is used for the reconstruction space loss. CatGAN [3] is an extension of GAN and is based on an objective function that trades-off mutual information between observed examples and their predicted categorical class distribution, against robustness of the classifier to an adversarial generative model.

In contrast to the above approaches and following the IB framework, we formulate the semi-supervised classification problem as a training of classifier that aims at compressing the input **x** to some latent data **a** via an encoding that is supposed to retain only class relevant information that is controlled by a decoder as shown in Figure 1. If the amount of labeled data is sufficiently large, the supervised classifier can achieve this goal. However, when the amount of labeled examples is small such an encoder-decoder pair representing an IB-driven classifier is regularized by a latent space and adversarial label space regularizers to fill the gap in training data. The adversarial label space regularization was already used in AAE and CatGAN. The latent space regularization in the scope of IB framework was reported in [7]. In this paper, we demonstrate that both label and latent space regularizations are instances of the generalized IB formulation developed in Section 3. At the same time, in contrast to the hypothesis that the considered label space and latent space regularizations

are the driving factors behind the success of semi-supervised classifiers, we demonstrate that the hand-crafted priors considered in these models cannot completely fulfil the lack of labelled data and lead to relatively poor performance in comparison to a fully supervised system based on a sole cross-entropy metric. For these reasons, we analyze another mechanism of regularization of latent space based on learnable priors as shown in Figure 2 and developed in Section 4. Along this line, we provide an IB formulation of AAE and explain the driving mechanisms behind its success as an instance of IB with learnable priors. Finally, we present several extensions that explain the IB origin and role of adversarial regularization in the reconstruction space.

**Figure 1.** Classification with the hand-crafted latent space regularization.

**Figure 2.** Classification with the learnable latent space regularization.

**Summary:** The considered methods of semi-supervised learning can be differentiated based on: (i) *the targeted tasks* (auto-encoding, clustering, generation or classification that can be accomplished depending on available labeled data); (ii) *the architecture in terms of the latent space representation* (with

a single representation vector or with multiple representation vectors); (iii) *the usage of IB or other underlying frameworks* (methods derived from the IB directly or using regularization techniques); (iv) *the label space regularization* (based on available unlabeled data, augmented labeled data, synthetically generated labeled and unlabeled data, especially designed adversarial examples); (v) *the latent space regularization* (hand-crafted regularizers and priors or learnable priors under the reconstruction and constrastive setups) and (vi) *the reconstruction space regularization in case of reconstruction setup* (based on unlabeled and labeled data, augmented data under certain perturbations, synthetically generated examples).

In this work, our main focus is the latent space regularization for the hand-crafted and learnable priors under the reconstruction setup within the IB framework. Our main task is the semi-supervised classification. We will not consider any augmentation and adversarial techniques besides a simple stochastic encoding based on the addition of data independent noise at the system input or even deterministic encoding without any form of augmentation. The regularization of the label space and reconstruction space is solely based on the terms derived from the IB framework and only includes available labeled and unlabeled data without any form of augmentation. In this way, we want to investigate the role and impact of the latent space regularization as such in the IB-based semi-supervised classification. The usage of the above mentioned techniques of augmentation should be further investigated and will likely provide an additional performance improvement.

#### **3. IB with Hand-Crafted Priors (HCP)**

We assume that a semi-supervised classifier has an access to {**x***m*, **<sup>c</sup>***m*}*<sup>N</sup> <sup>m</sup>*=<sup>1</sup> training labeled samples, where **<sup>x</sup>***<sup>m</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* denotes *<sup>m</sup>th* data sample and **<sup>c</sup>***<sup>m</sup>* corresponding encoded class label from the set {1, 2, ··· , *Mc*}, generated from the joint distribution *<sup>p</sup>*(**c**, **<sup>x</sup>**), and non-labeled data samples % **x***j* &*J j*=1 with *J N*. To integrate the knowledge about the labeled and non-labeled data at training, one can formulate the IB as:

$$\mathcal{L}^{\text{HCP}}(\Phi\_{\mathbf{a}}) = l\_{\Phi\_{\mathbf{a}}}(\mathbf{X}; \mathbf{A}) - \beta\_{\mathbf{c}} l\_{\Phi\_{\mathbf{a}}}(\mathbf{A}; \mathbf{C}), \tag{1}$$

where **a** denotes the latent representation, *β*<sup>c</sup> is a Lagrangian multiplier and the IB terms are defined as *<sup>I</sup>φ*<sup>a</sup> (**X**; **<sup>A</sup>**) = <sup>E</sup>*qφ*<sup>a</sup> (**x**,**a**) log *<sup>q</sup>φ*<sup>a</sup> (**a**|**x**) *<sup>q</sup>φ*<sup>a</sup> (**a**) and *<sup>I</sup>φ*<sup>a</sup> (**A**; **<sup>C</sup>**) = <sup>E</sup>*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>q</sup>φ*<sup>a</sup> (**c**|**a**) *p*(**c**) .

According to the above IB formulation the encoder *<sup>q</sup>φ*<sup>a</sup> (**a**|**x**) is trained to minimize the mutual information between **<sup>X</sup>** and **<sup>A</sup>** while ensuring that the decoder *<sup>q</sup>φ*<sup>a</sup> (**c**|**a**) can reliably decide on labels **<sup>C</sup>** from the compressed representation **A**. The trade-off between the compression and recognition terms is controlled by *β*c. Thus, it is assumed that the information retained in the latent representation **A** represents the sufficient statistics for the class labels **C**.

However, since optimal *<sup>q</sup>φ*<sup>a</sup> (**c**|**a**) is unknown, the second term *<sup>I</sup>φ*<sup>a</sup> (**A**; **<sup>C</sup>**) is lower bounded by *<sup>I</sup>φ*a,*θ*<sup>c</sup> (**A**; **<sup>C</sup>**) using a variational approximation *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**):

*<sup>I</sup>φ*<sup>a</sup> (**A**; **<sup>C</sup>**) - E*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>q</sup>φ*<sup>a</sup> (**c**|**a**) *p*(**c**) <sup>=</sup> <sup>E</sup>*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>q</sup>φ*<sup>a</sup> (**c**|**a**) *p*(**c**) *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) <sup>=</sup> <sup>E</sup>*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) *p*(**c**) <sup>+</sup> <sup>E</sup>*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>q</sup>φ*<sup>a</sup> (**c**|**a**) *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) <sup>=</sup> <sup>E</sup>*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) *p*(**c**) <sup>+</sup> <sup>E</sup>*p*(**c**,**x**) *<sup>D</sup>*KL(*qφ*<sup>a</sup> (**c**|**a**)||*pθ*<sup>c</sup> (**c**|**a**)) ≥ E*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) *p*(**c**) , (2)

where *<sup>D</sup>*KL(*qφ*<sup>a</sup> (**c**|**a**)||*pθ*<sup>c</sup> (**c**|**a**)) = <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>q</sup>φ*<sup>a</sup> (**c**|**a**) *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) and the inequality follows from the fact that *<sup>D</sup>*KL(*qφ*<sup>a</sup> (**c**|**a**)||*pθ*<sup>c</sup> (**c**|**a**)) <sup>≥</sup> 0. We denote the term *<sup>I</sup>φ*a,*θ*<sup>c</sup> (**A**; **<sup>C</sup>**) = <sup>E</sup>*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**) *p*(**c**) . Thus,

## *<sup>I</sup>φ*<sup>a</sup> (**A**; **<sup>C</sup>**) <sup>≥</sup> *<sup>I</sup>φ*a,*θ*<sup>c</sup> (**A**; **<sup>C</sup>**).

Thus, the IB (1) can be reformulated as:

$$\mathcal{L}^{\text{HCP}\_{\perp}}(\boldsymbol{\upphi}\_{\mathbf{a}},\boldsymbol{\uptheta}\_{\mathbb{C}}) = I\_{\boldsymbol{\upPhi}\_{\mathbf{a}}}(\mathbf{X};\mathbf{A}) - \beta\_{\mathbb{C}} I\_{\boldsymbol{\upPhi}\_{\mathbf{a}},\boldsymbol{\uptheta}\_{\mathbb{C}}}(\mathbf{A};\mathbf{C}).\tag{3}$$

The considered IB is schematically shown in Figure 1 and we will proceed next with the detailed development of each component of the IB formulation.

#### *3.1. Decomposition of the First Term: Hand-Crafted Regularization*

The first mutual information term *<sup>I</sup>φ*<sup>a</sup> (**X**; **<sup>A</sup>**) in (3) can be decomposed using a factorization by a parametric marginal distribution *<sup>p</sup>θ*<sup>a</sup> (**a**) that represents a prior on the latent representation **<sup>a</sup>**:

$$\begin{split} I\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{X};\mathbf{A}) &= \mathbb{E}\_{q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{x},\mathbf{a})} \left[ \log \frac{q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{x},\mathbf{a})}{q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a})p\_{\mathcal{D}}(\mathbf{x})} \right] = \mathbb{E}\_{q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{x},\mathbf{a})} \left[ \log \frac{q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a}|\mathbf{x})}{q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a})} \frac{p\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a})}{p\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a})} \right] \\ &= \mathbb{E}\_{p\_{\mathcal{D}}(\mathbf{x})} \underbrace{\left[ D\_{\mathrm{KL}} \left( q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a}|\mathbf{X}=\mathbf{x}) || p\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a}) \right) \right]}\_{\mathcal{D}\_{\mathbf{a}|\mathbf{x}}} - \underbrace{D\_{\mathrm{KL}} \left( q\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a}) || p\_{\boldsymbol{\Phi}\_{\mathbf{a}}}(\mathbf{a}) \right)}\_{\mathcal{D}\_{\mathbf{a}}}, \end{split} \tag{4}$$

where the first term denotes the KL-divergence Da|<sup>x</sup> - *D*KL *<sup>q</sup>φ*<sup>a</sup> (**a**|**<sup>X</sup>** <sup>=</sup> **<sup>x</sup>**)*pθ*<sup>a</sup> (**a**) = <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) log *<sup>q</sup>φ*<sup>a</sup> (**a**|**x**) *<sup>p</sup>θ*<sup>a</sup> (**a**) and the term denotes the KL-divergence D<sup>a</sup> - *D*KL *<sup>q</sup>φ*<sup>a</sup> (**a**)*pθ*<sup>a</sup> (**a**) = <sup>E</sup>*qφ*<sup>a</sup> (**a**) log *<sup>q</sup>φ*<sup>a</sup> (**a**) *<sup>p</sup>θ*<sup>a</sup> (**a**) .

It should be pointed out that the encoding *<sup>q</sup>φ*<sup>a</sup> (**a**|**x**) can be both stochastic or deterministic. *Stochastic encoding <sup>q</sup>φ*<sup>a</sup> (**a**|**x**) can be implemented via: (a) *multiplicative encoding* applied to the input **<sup>x</sup>** as **<sup>a</sup>** <sup>=</sup> *<sup>f</sup>φ*<sup>a</sup> (**<sup>x</sup>** " ) or in the latent space **<sup>a</sup>** <sup>=</sup> *<sup>f</sup>φ*<sup>a</sup> (**x**) " , where *<sup>f</sup>φ*<sup>a</sup> (**x**) is the output of the encoder, " denotes the element-wise product and follows some data independent or data dependent distribution as in information dropout [7]; (b) *additive encoding* applied to the input **<sup>x</sup>** as **<sup>a</sup>** <sup>=</sup> *<sup>f</sup>φ*<sup>a</sup> (**<sup>x</sup>** <sup>+</sup> ) with the data independent perturbations, e.g., such as in PixelGAN [19], or in the latent space with generally data-dependent perturbations of form **<sup>a</sup>** <sup>=</sup> *<sup>f</sup>φ*<sup>a</sup> (**x**) + *<sup>σ</sup>φ*<sup>a</sup> (**x**) " , where *<sup>f</sup>φ*<sup>a</sup> (**x**) and *<sup>σ</sup>φ*<sup>a</sup> (**x**) are outputs of the encoder and is assumed to be a zero mean unit variance vector such as in VAE [1] or (c) *concatenative/mixing encoding* **<sup>a</sup>** <sup>=</sup> *<sup>f</sup>φ*<sup>a</sup> ([**x**, ]) that is generally applied at the input of encoder. Deterministic encoding is based on the mapping **<sup>a</sup>** <sup>=</sup> *<sup>f</sup>φ*<sup>a</sup> (**x**), i.e., no randomization is introduced, e.g., such as one of encoding modalities of AAE [2].

#### *3.2. Decomposition of the Second Term*

In this section, we factorize the second term in (3) to address the semi-supervised training, i.e., to integrate the knowledge of both non-labeled and labeled data available at training:

$$\begin{split} I\_{\boldsymbol{\Phi}\_{\mathbf{z}},\boldsymbol{\Phi}\_{\mathbf{c}}}(\mathbf{A};\mathbf{C}) & \stackrel{\scriptstyle \Delta}{=} \mathbb{E}\_{p(\mathbf{c},\mathbf{x})} \left[ \mathbb{E}\_{q\_{\boldsymbol{\Phi}\_{\mathbf{z}}(\mathbf{c}|\mathbf{x})}} \left[ \log \frac{p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c}|\mathbf{a})}{p(\mathbf{c})} \frac{p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c})}{p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c})} \right] \right] \\ &= -\mathbb{E}\_{p(\mathbf{c})} \left[ \log p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c}) \right] - \mathbb{E}\_{p(\mathbf{c})} \left[ \log \frac{p(\mathbf{c})}{p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c})} \right] + \mathbb{E}\_{p(\mathbf{c},\mathbf{x})} \left[ \mathbb{E}\_{q\_{\boldsymbol{\theta}\_{\mathbf{z}}}(\mathbf{a}|\mathbf{x})} \left[ \log p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c}|\mathbf{a}) \right] \right] \\ &= H(p(\mathbf{c});p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c})) - D\_{\mathbf{KL}} \left( p(\mathbf{c}) || p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c}) \right) - H\_{\boldsymbol{\theta}\_{\mathbf{c}}} \boldsymbol{\Phi}\_{\mathbf{a}} \left( \mathbf{C} | \mathbf{A} \right), \end{split} \tag{5}$$

with *<sup>H</sup>*(*p*(**c**); *<sup>p</sup>θ*(**c**)) = <sup>−</sup>E*p*(**c**) [log *<sup>p</sup>θ*<sup>c</sup> (**c**)] denoting a cross-entropy between *<sup>p</sup>*(**c**) and *<sup>p</sup>θ*<sup>c</sup> (**c**), and D<sup>c</sup> - *<sup>D</sup>*KL (*p*(**c**)*pθ*<sup>c</sup> (**c**)) <sup>=</sup> <sup>E</sup>*p*(**c**) log *<sup>p</sup>*(**c**) *<sup>p</sup>θ*<sup>c</sup> (**c**) to be a KL-divergence between the prior class label distribution *<sup>p</sup>*(**c**) and the estimated one *<sup>p</sup>θ*<sup>c</sup> (**c**). One can assume different forms of labels' **<sup>c</sup>** encoding but one of the most often used forms is one-hot-label encoding that leads to the categorical distribution *p*(**c**) = cat(**c**).

Finally, the conditional entropy is defined as Dccˆ - *Hθ*c,*φ*<sup>a</sup> (**C**|**A**) = −E*p*(**c**,**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) [log *<sup>p</sup>θ*<sup>c</sup> (**c**|**a**)] .

Since *<sup>H</sup>*(*p*(**c**); *<sup>p</sup>θ*<sup>c</sup> (**c**)) <sup>≥</sup> 0, one can lower bound (5) as *<sup>I</sup>φ*a,*θ*<sup>c</sup> (**A**; **<sup>C</sup>**) <sup>≥</sup> *<sup>I</sup>*<sup>L</sup> *φ*a,*θ*<sup>c</sup> (**A**; **C**) where:

$$I\_{\boldsymbol{\Phi}\_{\mathcal{E}},\boldsymbol{\theta}\_{\mathcal{E}}}^{\mathcal{L}}(\mathbf{A};\mathbf{C}) \triangleq -\underbrace{D\_{\operatorname{KL}}\left(p(\mathbf{c})\|\|p\_{\boldsymbol{\theta}\_{\mathcal{E}}}(\mathbf{c})\right)}\_{\mathcal{D}\_{\boldsymbol{\Phi}}} - \underbrace{H\_{\boldsymbol{\theta}\_{\mathcal{E}},\boldsymbol{\Phi}\_{\mathcal{E}}}\left(\mathbf{C}|\mathbf{A}\right)}\_{\mathcal{D}\_{\boldsymbol{\Phi}}}.\tag{6}$$

#### *3.3. Supervised and Semi-Supervised Models with/without Hand-Crafted Priors*

Summarizing the above variational decomposition of (3) with the terms (4) and (6), we will proceed with four practical scenarios.

*Supervised training without latent space regularization (***baseline)**: is based on term Dccˆ in (6)

$$\mathcal{L}\_{\text{S}-\text{NoReg}}^{\text{HCP}}(\theta\_{\text{C}}, \phi\_{\text{a}}) = \mathcal{D}\_{\text{c}^{\text{k}}}.\tag{7}$$

*Semi-supervised training without latent space regularization* is based on terms Dccˆ and D<sup>c</sup> in (6):

$$\mathcal{L}\_{\text{SS}-\text{NoReg}}^{\text{HCP}}(\theta\_{\text{c}}, \phi\_{\text{a}}) = \mathcal{D}\_{\text{c}\text{\textsuperscript{\text{a}}}} + \mathcal{D}\_{\text{c}}.\tag{8}$$

*Supervised training with latent space regularization* is based on term Dccˆ in (6) and either term Da|<sup>x</sup> or D<sup>a</sup> or jointly Da|<sup>x</sup> and D<sup>a</sup> in (4):

$$\mathcal{L}\_{\text{S}-\text{Reg}}^{\text{HCP}}(\theta\_{\text{c}}, \phi\_{\text{a}}) = \mathbb{E}\_{p\_{\text{D}}(\mathbf{x})} \left[ \mathcal{D}\_{\mathbf{a}|\mathbf{x}} \right] + \mathcal{D}\_{\mathbf{a}} + \beta\_{\text{c}} \mathcal{D}\_{\text{c}\mathbf{b}}.\tag{9}$$

*Semi-supervised training with latent space regularization* deploys all terms in (4) and (6):

$$\mathcal{L}\_{\text{SS}\to\text{Reg}}^{\text{HCP}}(\mathfrak{G}\_{\text{c}},\mathfrak{g}\_{\text{a}}) = \mathbb{E}\_{\mathcal{V}\_{\mathcal{D}}(\mathbf{x})} \left[ \mathcal{D}\_{\text{a}|\mathbf{x}} \right] + \mathcal{D}\_{\text{a}} + \beta\_{\text{c}} \mathcal{D}\_{\text{c}\xi} + \beta\_{\text{c}} \mathcal{D}\_{\text{c}}.\tag{10}$$

The empirical evaluation of these setups on MNIST dataset is given in Section 5. The same architecture of encoder and decoder was used to establish the impact of each term in a function of available labeled data.

#### **4. IB with Learnable Priors (LP)**

In this section, we extend the results obtained for the hand-crafted priors to the learnable priors. Instead of applying the hand-crafted regularization of the latent representation **a** as suggested by the IB (3) and shown in Figure 1, we will assume that the latent representation **a** is regularized by an especially designed AE as shown in Figure 2. The AE-based regularization has two components: (i) the latent space **z** regularization and (ii) the observation space regularization. The design and training of this latent space regularizer in a form of the AE is guided by its own IB. In the general case, all elements of AE, i.e., its encoder-decoder pair, latent and observation space regularizers are conditioned by the learned class label **c**. The resulting Lagrangian with the learnable prior is (formally one should consider *<sup>I</sup>φ*a,*φ*z,*θ*<sup>c</sup> (**X**; **<sup>Z</sup>**|**C**) for the term A. However, since *<sup>I</sup>φ*a,*φ*z,*θ*<sup>c</sup> (**X**; **<sup>Z</sup>**|**C**) <sup>≤</sup> *<sup>I</sup>φ*a,*φ*z,*θ*<sup>c</sup> (**A**; **<sup>Z</sup>**|**C**) due to the Markovianity of considered architecture, we consider the decomposition starting from **A** [20], Data Processing Inequality, Theorem 2.8.1):

$$\mathcal{L}^{\rm LP}(\boldsymbol{\Phi}\_{\rm u'}, \boldsymbol{\Phi}\_{\rm x'}, \boldsymbol{\theta}\_{\rm x}) = \underbrace{\boldsymbol{I}\_{\boldsymbol{\Phi}\_{\rm x}, \boldsymbol{\Phi}\_{\rm x}, \boldsymbol{\theta}\_{\rm c}}(\mathbf{A}; \mathbf{Z}|\mathbf{C})}\_{\boldsymbol{\Lambda}} - \beta\_{\rm x} \underbrace{\boldsymbol{I}\_{\boldsymbol{\Phi}\_{\rm x}, \boldsymbol{\Phi}\_{\rm x}, \boldsymbol{\theta}\_{\rm c}, \boldsymbol{\theta}\_{\rm x}}(\mathbf{X}; \mathbf{Z}|\mathbf{C})}\_{\mathbf{B}} - \beta\_{\rm c} \underbrace{\boldsymbol{I}\_{\boldsymbol{\Phi}\_{\rm x}, \boldsymbol{\theta}\_{\rm c}}^{\rm L}(\mathbf{A}; \mathbf{C})}\_{\mathbf{C}},\tag{11}$$

where *β*<sup>x</sup> is a Lagrangian multiplier controlling the reconstruction of **x** at the decoder and *β*<sup>c</sup> is the same as in (1).

*Entropy* **2020**, *22*, 943

The terms A and B, conditioned by the class **c**, play a role of the latent space regularizer by imposing the learnable constrains on the vector **a**. These two terms correspond to the hand-crafted counterpart *<sup>I</sup>φ*<sup>a</sup> (**X**; **<sup>A</sup>**) in (3). The term C in the learnable IB formulation corresponds to the classification part of hand-crafted IB in (3) and can be factorized along the same lines as in (6). Therefore, we will proceed with the factorization of terms A and B.

One can also consider the following IB formulation with the learnable priors with no conditioning on **c** in term A in (11) leading to an unconditional counterpart D below that can be viewed as an IB generalization of semi-supervised AAE [2]:

$$\mathcal{L}\_{\mathsf{A}\mathsf{A}\mathsf{Z}}^{\mathsf{L}\mathsf{P}}(\boldsymbol{\upphi}\_{\mathsf{z}},\boldsymbol{\upphi}\_{\mathsf{z}},\boldsymbol{\uptheta}\_{\mathsf{c}},\boldsymbol{\uptheta}\_{\mathsf{x}}) = \underbrace{\mathbbm{I}\_{\mathsf{\Phi}\_{\mathsf{z}},\boldsymbol{\upPhi}\_{\mathsf{x}}}(\mathsf{A};\mathsf{Z})}\_{\mathsf{D}} - \beta\_{\mathsf{x}} \underbrace{\mathbbm{I}\_{\mathsf{\Phi}\_{\mathsf{z}},\boldsymbol{\upPhi}\_{\mathsf{x}},\boldsymbol{\uptheta}\_{\mathsf{c}},\boldsymbol{\uptheta}\_{\mathsf{x}}}\_{\mathsf{B}}(\mathsf{X};\mathsf{Z}|\mathsf{C})}\_{\mathsf{B}} - \beta\_{\mathsf{c}} \underbrace{\mathbbm{I}\_{\mathsf{\Phi}\_{\mathsf{z}},\boldsymbol{\uptheta}\_{\mathsf{c}}}^{\mathsf{L}}(\mathsf{A};\mathsf{C})}\_{\mathsf{C}}.\tag{12}$$

#### *4.1. Decomposition of Latent Space Regularizer*

We will denote *<sup>p</sup>φ*a,*φ*z,*θ*<sup>c</sup> (**x**, **<sup>a</sup>**, **<sup>c</sup>**, **<sup>z</sup>**) = *<sup>p</sup>*D(**x**)*qφ*<sup>a</sup> (**a**|**x**)*pθ*<sup>c</sup> (**c**|**a**)*qφ*<sup>z</sup> (**z**|**a**, **<sup>c</sup>**) and decompose the term A in (11) using variational factorization as:

*<sup>I</sup>φ*a,*φ*z,*θ*<sup>c</sup> (**A**, **<sup>Z</sup>**|**C**) = <sup>E</sup>*pφ*a,*φ*z,*θ*<sup>c</sup> (**x**,**a**,**c**,**z**) log *<sup>q</sup>φ*<sup>z</sup> (**z**|**a**, **<sup>c</sup>**) *<sup>q</sup>φ*<sup>z</sup> (**z**|**c**) *<sup>p</sup>θ*<sup>z</sup> (**z**) *<sup>p</sup>θ*<sup>z</sup> (**z**) <sup>=</sup> <sup>E</sup>*p*D(**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) <sup>E</sup>*pθ*<sup>c</sup> (**c**|**a**) *D*KL *<sup>q</sup>φ<sup>z</sup>* (**z**|**<sup>A</sup>** <sup>=</sup> **<sup>a</sup>**, **<sup>C</sup>** <sup>=</sup> **<sup>c</sup>**)*pθ*<sup>z</sup> (**z**) ' () \* Dz|a,c , <sup>−</sup> <sup>E</sup>*p*D(**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) <sup>E</sup>*pθ*<sup>c</sup> (**c**|**a**) *D*KL *<sup>q</sup>φ<sup>z</sup>* (**z**|**<sup>C</sup>** <sup>=</sup> **<sup>c</sup>**)*pθ*<sup>z</sup> (**z**) ' () \* Dz|<sup>c</sup> , (13)

where Dz|a,c - *D*KL *qφ<sup>z</sup>* (**z**|**a**, **<sup>c</sup>**)*pθ*<sup>z</sup> (**z**) <sup>=</sup> <sup>E</sup>*qφ*<sup>z</sup> (**z**|**a**,**c**) log *<sup>q</sup>φ<sup>z</sup>* (**z**|**a**,**c**) *<sup>p</sup>θ*<sup>z</sup> (**z**) and Dz|<sup>c</sup> - *D*KL *<sup>q</sup>φ<sup>z</sup>* (**z**|**c**)*pθ*<sup>z</sup> (**z**) <sup>=</sup> <sup>E</sup>*qφ*<sup>z</sup> (**z**|**c**) log *<sup>q</sup>φ<sup>z</sup>* (**z**|**c**) *<sup>p</sup>θ*<sup>z</sup> (**z**) denote the KL-divergence terms and *<sup>q</sup>φ*<sup>z</sup> (**z**|**c**) = <sup>E</sup>*p*D(**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) *<sup>q</sup>φ<sup>z</sup>* (**z**|**a**, **<sup>c</sup>**)) .

#### *4.2. Decomposition of Reconstruction Space Regularizer*

Denoting *<sup>p</sup>φ*a,*φ*z,*θ*c,*θ*<sup>x</sup> (**x**, **<sup>a</sup>**, **<sup>c</sup>**, **<sup>z</sup>**) = *<sup>p</sup>*D(**x**)*qφ*<sup>a</sup> (**a**|**x**)*pθ*<sup>c</sup> (**c**|**a**)*qφ*<sup>z</sup> (**z**|**a**, **<sup>c</sup>**)*pθ*<sup>x</sup> (**x**|**z**, **<sup>c</sup>**), we decompose the term B in (11) as:

$$\begin{split} I\_{\boldsymbol{\theta}\_{\mathbf{x}},\boldsymbol{\theta}\_{\mathbf{z}},\boldsymbol{\theta}\_{\mathbf{c}},\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{X};\mathbf{Z}|\mathbf{C}) &= \mathbb{E}\_{p\_{\boldsymbol{\theta}\_{\mathbf{z}},\boldsymbol{\theta}\_{\mathbf{z}},\boldsymbol{\theta}\_{\mathbf{c}},\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{x},\mathbf{a},\mathbf{c},\mathbf{x})} \left[ \log \frac{p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x}|\mathbf{z},\mathbf{c})}{p\_{\mathcal{D}}(\mathbf{x}|\mathbf{c})} \frac{p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x})}{p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x})} \right] \\ &= \mathbb{E}\_{p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c})} \left[ H(p\_{\mathcal{D}}(\mathbf{x}|\mathbf{c});p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x})) \right] \\ &\quad - \mathbb{E}\_{p\_{\boldsymbol{\theta}\_{\mathbf{c}}}(\mathbf{c})} \underbrace{\left[ D\_{\text{KL}} \left( p\_{\mathcal{D}}(\mathbf{x}|\mathbf{C}=\mathbf{c}) \| \, p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x}) \right) \right]}\_{\mathcal{D}\_{\mathbf{x}|\mathbf{c}}} - \underbrace{H\_{\boldsymbol{\theta}\_{\mathbf{x}},\boldsymbol{\theta}\_{\mathbf{x}},\boldsymbol{\theta}\_{\mathbf{c}},\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x}|\mathbf{Z},\mathbf{C})}\_{\mathcal{D}\_{\mathbf{x}|\mathbf{c}}} \end{split} \tag{14}$$

where *<sup>p</sup>θ*<sup>c</sup> (**c**) = <sup>E</sup>*p*D(**x**) <sup>E</sup>*qφ***<sup>a</sup>** (**a**|**x**) [*pθ*<sup>c</sup> (**c**|**a**)] . The terms are defined as *<sup>H</sup>*(*p*D(**x**|**c**); *<sup>p</sup>θ*<sup>x</sup> (**x**)) = <sup>−</sup>E*p*D(**x**|**c**) [log *<sup>p</sup>θ*<sup>x</sup> (**x**)], <sup>D</sup>x|<sup>c</sup> - *<sup>D</sup>*KL (*p*D(**x**|**<sup>C</sup>** <sup>=</sup> **<sup>c</sup>**)*pθ*<sup>x</sup> (**x**)) <sup>=</sup> <sup>E</sup>*p*D(**x**|**c**) log *<sup>p</sup>*D(**x**|**c**) *<sup>p</sup>θ*<sup>x</sup> (**x**) and Dxxˆ - *<sup>H</sup>φ*a,*φ*z,*θ*c,*θ*<sup>x</sup> (**X**|**Z**, **<sup>C</sup>**) = <sup>−</sup>E*p*D(**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) <sup>E</sup>*pθ*<sup>c</sup> (**c**|**a**) <sup>E</sup>*qφ*<sup>z</sup> (**z**|**a**,**c**) [log *<sup>p</sup>θ***<sup>x</sup>** (**x**|**z**, **<sup>c</sup>**)] . Since <sup>E</sup>*pθ*<sup>c</sup> (**c**) [*H*(*p*D(**x**|**c**); *<sup>p</sup>θ*<sup>x</sup> (**x**))] <sup>≥</sup> 0, we can lower bound *<sup>I</sup>φ*a,*φ*z,*θ*c,*θ*<sup>x</sup> (**X**; **<sup>Z</sup>**|**C**) <sup>≥</sup> *<sup>I</sup>*<sup>L</sup> *φ*a,*φ*z,*θ*c,*θ*<sup>x</sup> (**X**; **<sup>Z</sup>**|**C**) - −Dx|<sup>c</sup> − Dxxˆ.

#### *4.3. Semi-Supervised Models with Learnable Priors*

Summarizing the above variational decomposition of (11) with the terms (13) and (14), we will consider semi-supervised training with latent space regularization as:

<sup>L</sup>LP SS−Reg(*θ*c, *<sup>θ</sup>*x, *<sup>φ</sup>*a, *<sup>φ</sup>*z) = <sup>E</sup>*p*D(**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) <sup>E</sup>*pθ*<sup>c</sup> (**c**|**a**) <sup>D</sup>z|a,c <sup>+</sup> <sup>E</sup>*p*D(**x**) <sup>E</sup>*qφ*<sup>a</sup> (**a**|**x**) <sup>E</sup>*pθ*<sup>c</sup> (**c**|**a**) Dz|<sup>c</sup> <sup>+</sup> *<sup>β</sup>*xDxxˆ <sup>+</sup> *<sup>β</sup>*xE*pθ*<sup>c</sup> (**c**) Dx|<sup>c</sup> <sup>+</sup> *<sup>β</sup>*cDccˆ <sup>+</sup> *<sup>β</sup>*cDc. (15)

To create a link to the semi-supervised AAE [2], we also consider (12), where all latent and reconstruction space regularizers are independent of **c**, i.e., do not contain conditioning on **c**. *Semi-supervised training with latent space regularization and MSE reconstruction* based on (12):

$$\mathcal{L}\_{\text{SS}-\text{AAE}}^{\text{LP}}(\boldsymbol{\theta}\_{\text{c}}, \boldsymbol{\theta}\_{\text{x}}, \boldsymbol{\phi}\_{\text{a}}, \boldsymbol{\phi}\_{\text{z}}) = \mathcal{D}\_{\text{z}} + \boldsymbol{\beta}\_{\text{x}} \mathcal{D}\_{\text{x}\text{k}} + \boldsymbol{\beta}\_{\text{c}} \mathcal{D}\_{\text{c}\text{k}} + \boldsymbol{\beta}\_{\text{c}} \mathcal{D}\_{\text{c}\text{\prime}} \tag{16}$$
 
$$(\mathbf{z}) \| p\_{\boldsymbol{\theta}\_{\text{z}}}(\mathbf{z}) \| = \mathbb{E}\_{q\_{\boldsymbol{\theta}\_{\text{a}}}(\mathbf{z})} \left[ \log \frac{q\_{\boldsymbol{\theta}\_{\text{r}}}(\mathbf{z})}{n\_{\boldsymbol{\alpha}}(\mathbf{z})} \right].$$

where D<sup>z</sup> - *D*KL *qφ<sup>z</sup>* (**z**)*pθ*<sup>z</sup> (**z**) <sup>=</sup> <sup>E</sup>*qφ*<sup>z</sup> (**z**) *<sup>p</sup>θ*<sup>z</sup> (**z**)

*Semi-supervised training with latent space regularization and with MSE and adversarial reconstruction* based on (12) deploys all terms:

$$\mathcal{L}\_{\text{SS}\to\text{AA}\_{\text{complex}}}^{\text{LP}}(\boldsymbol{\theta}\_{\text{c}},\boldsymbol{\theta}\_{\text{x}},\boldsymbol{\phi}\_{\text{a}},\boldsymbol{\phi}\_{\text{z}}) = \mathcal{D}\_{\text{x}} + \beta\_{\text{x}}\mathcal{D}\_{\text{x}\ddagger} + \beta\_{\text{x}}\mathcal{D}\_{\text{x}} + \beta\_{\text{c}}\mathcal{D}\_{\text{c}\ddagger} + \beta\_{\text{c}}\mathcal{D}\_{\text{c}\prime} \tag{17}$$
  $\chi\_{\text{L}}\left(p\_{\mathcal{D}}(\mathbf{x}) \| \boldsymbol{\uprho}\_{\theta\_{\text{x}}}(\mathbf{x})\right) = \mathbb{E}\_{p\_{\mathcal{D}}(\mathbf{x})}\left[\log \frac{p\_{\mathcal{D}}(\mathbf{x})}{p\_{\mathcal{D}}(\mathbf{x})}\right]$ .

where D<sup>x</sup> - *<sup>D</sup>*KL (*p*D(**x**)*pθ*<sup>x</sup> (**x**)) <sup>=</sup> <sup>E</sup>*p*D(**x**) *<sup>p</sup>θ*<sup>x</sup> (**x**)

#### *4.4. Links to State-Of-The-Art Models*

The considered HCP and LP models can be linked with several state-of-the-art unsupervised models such VAE [21,22], *β*-VAE [23], AAE [2] and BIB-AE [15] and semi-supervised models such as AAE [2], CatGAN [3], VAE (M1 + M2) [1] and SeGMA [18].

#### 4.4.1. Links to Unsupervised Models

The proposed LP model (11) generalizes unsupervised models without the categorical latent representation. In addition, the unsupervised models in a form of the auto-encoder are used as a latent space regularizer in the LP setup. For these reasons, we will briefly consider four models of interest, namely VAE, *β*-VAE, AAE, and BIB-AE.

Before we proceed with the analysis, we will define an unsupervised IB for these models. We will assume the fused encoders *<sup>q</sup>φ*<sup>a</sup> (**a**|**x**) and *<sup>q</sup>φ*<sup>z</sup> (**z**|**a**) without conditioning on **<sup>c</sup>** in the inference model according to Figure 2. We also assume no conditionally on **c** in the generative model.

The Lagrangian of unsupervised IB is defined according to [15]:

$$\mathcal{L}^{\text{U}\_{\text{L}}}(\boldsymbol{\theta}\_{\text{x}}, \boldsymbol{\Phi}\_{\text{z}}) = I\_{\boldsymbol{\Phi}\_{\text{x}}}(\mathbf{X}; \mathbf{Z}) - \beta\_{\text{x}} I\_{\boldsymbol{\Phi}\_{\text{x}}, \boldsymbol{\theta}\_{\text{x}}}(\mathbf{Z}; \mathbf{X}), \tag{18}$$

where similarly to the supervised counterpart (4), we define the first term as:

$$\begin{split} I\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{X};\mathbf{Z}) &= \mathbb{E}\_{q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{x},\mathbf{z})} \left[ \log \frac{q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{x},\mathbf{z})}{q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z})p\_{\mathcal{D}}(\mathbf{x})} \right] = \mathbb{E}\_{q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{x},\mathbf{z})} \left[ \log \frac{q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z}|\mathbf{x})}{q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z})} \frac{p\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z})}{p\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z})} \right] \\ &= \mathbb{E}\_{p\_{\boldsymbol{D}}(\mathbf{x})} \underbrace{\left[ D\_{\text{KL}} \left( q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z}|\mathbf{X}=\mathbf{x}) || p\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z}) \right) \right]}\_{\mathcal{D}\_{\boldsymbol{x}|\mathbf{x}}} - \underbrace{D\_{\text{KL}} \left( q\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z}) || p\_{\boldsymbol{\Phi}\_{\boldsymbol{x}}}(\mathbf{z}) \right)}\_{\mathcal{D}\_{\boldsymbol{x}}} \end{split} \tag{19}$$

and similarly to (14) the second term is defined as:

$$\begin{split} I\_{\boldsymbol{\Phi}\_{\mathbf{x}},\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{Z};\mathbf{X}) &= \mathbb{E}\_{p\_{\mathcal{D}}(\mathbf{x})} \left[ \mathbb{E}\_{q\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x}|\mathbf{z})}{p\_{\mathcal{D}}(\mathbf{x})} \frac{p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x})}{p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x})} \right] \right] \\ &= H(p\_{\mathcal{D}}(\mathbf{x}|\mathbf{c}); p\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x})) - \underbrace{D\_{\text{KL}}\left(p\_{\mathcal{D}}(\mathbf{x}) \|\boldsymbol{\mu}\_{\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{x})\right)}\_{\mathcal{D}\_{\mathbf{x}}} - \underbrace{H\_{\boldsymbol{\theta}\_{\mathbf{x}},\boldsymbol{\theta}\_{\mathbf{x}}}(\mathbf{X}|\mathbf{Z})}\_{\mathcal{D}\_{\mathbf{x}\mathbf{\tilde{x}}}}, \end{split} \tag{20}$$

where the definition of all terms should follow from the above equations. Since *<sup>H</sup>*(*p*D(**x**|**c**); *<sup>p</sup>θ*<sup>x</sup> (**x**)) <sup>≥</sup> 0, we can lower bound *<sup>I</sup>φ*z,*θ*<sup>x</sup> (**Z**; **<sup>X</sup>**) ≥ −D<sup>x</sup> − Dxxˆ.

Having defined the unsupervised IB variational bounded decomposition, we can proceed with an analysis of the related state-of-the-art methods along the lines of analysis introduced in Summary part of Section 2.

**VAE** [21,22] and *β***-VAE** [23]:


$$\mathcal{L}\_{\mathbb{\beta}-\text{VAE}}(\theta\_{\mathbf{x}}, \boldsymbol{\Phi}\_{\mathbf{z}}) = \mathbb{E}\_{p\_{\mathcal{D}}(\mathbf{x})} \left[ \mathcal{D}\_{\mathbf{z}|\mathbf{x}} \right] + \beta\_{\mathbf{x}} \mathcal{D}\_{\mathbf{x}\mathbb{\hat{X}}} \tag{21}$$

where *β*<sup>x</sup> = 1 for VAE. It can be noted that the VAE and *β*-VAE are based on an upper bound on the mutual information term *<sup>I</sup>φ*<sup>z</sup> (**X**; **<sup>Z</sup>**) <sup>≤</sup> <sup>E</sup>*p*D(**x**) Dz|<sup>x</sup> , since *D*KL *<sup>q</sup>φ*<sup>z</sup> (**z**)*pθ*<sup>z</sup> (**z**) ≥ 0. Similar considerations apply to the second term since *<sup>D</sup>*KL (*p*D(**x**)*pθ*<sup>x</sup> (**x**)) <sup>≥</sup> 0.


#### **Unsupervised AAE** [2]:


$$\mathcal{L}\_{\text{AAE}}(\theta\_{\text{X}}, \phi\_{\text{x}}) = \mathcal{D}\_{\text{Z}} + \beta\_{\text{X}} \mathcal{D}\_{\text{x}\&\prime} \tag{22}$$

where *β*<sup>x</sup> = 1 in the original AAE formulation. It should be pointed out that the IB formulation of AAE contains the term Dxxˆ, whose origin can be explained in the same way as for the VAE. Despite the fact that the term D<sup>z</sup> indeed appears in (22) with the opposite sign, it cannot be interpreted either as an upper bound on *<sup>I</sup>φ*<sup>z</sup> (**X**; **<sup>Z</sup>**) similarly to the VAE or as a lower bound. The goal of AAE is to minimize the reconstruction loss or to maximize the log-likelihood by ensuring that the latent space marginal distribution *<sup>q</sup>φ*<sup>z</sup> (**z**) matches the prior *<sup>p</sup>θ*<sup>z</sup> (**z**). The latter corresponds to the minimization of *D*KL *<sup>q</sup>φ*<sup>z</sup> (**z**)*pθ*<sup>z</sup> (**z**) , i.e., D<sup>z</sup> term.


6. *The reconstruction space regularization in case of reconstruction loss*: is based on the MSE.

#### **BIB-AE** [15]:


$$\mathcal{L}\_{\text{BIB}-\text{AE}}(\boldsymbol{\theta}\_{\mathbf{x}}, \boldsymbol{\phi}\_{\mathbf{z}}) = \mathbb{E}\_{\text{p}\mathbf{\mathcal{D}}(\mathbf{x})} \left[ \mathcal{D}\_{\mathbf{z}|\mathbf{x}} \right] - \mathcal{D}\_{\mathbf{z}} + \boldsymbol{\beta}\_{\mathbf{x}} \mathcal{D}\_{\mathbf{x}} + \boldsymbol{\beta}\_{\mathbf{x}} \mathcal{D}\_{\mathbf{x}\mathbf{\hat{\beta}}}.\tag{23}$$


In summary, BIB-AE includes VAE and AAE as two particular cases. In turns, it should be clear that the regularizer of semi-supervised model considered in this paper resembles the BIB-AE model and extends it to the conditional case that will be considered below.

#### 4.4.2. Links to Semi-Supervised Models

The proposed LP model (11) is also related to several state-of-the-art semi-supervised models used for the classification. As pointed out in the introduction, we only consider available labeled and unlabeled samples in our analysis. The extension to the augmented samples, i.e., permutations, syntehtically generated samples, i.e., fakes, and the adversarial examples for both latent space and label space regularizations can be performed along the line of analysis but it goes beyond the scope and focus of this paper.

**Semi-supervised AAE** [2]:


*Remark*: It should be pointed out that in our architecture we consider the latent space to be represented by the vector **a**, which is fed to the classifier and regularizer that gives a natural consideration of IB and corresponding regularization and priors. In the case of semi-supervised AAE, the latent space is considered by the class and style representations directly. Therefore, to make it coherent with our case, one should assume that the class vector of semi-supervised AAE corresponds to the vector **c** and the style vector to the vector **z**.


6. *The reconstruction space regularization in case of reconstruction loss*: is only based on the MSE.

**CatGAN** [3]: is based on an extension of classical GAN binary discriminator designed to distinguish between the original images and fake images generated from the latent space distribution to a multi-class discriminator. The author assumes the one-hot-vector encoding of class labels. The system is considered for the unsupervised and semi-supervised modes. For both modes the one-hot-vector encoding is used to encoded class labels. For the unsupervised mode, the system has an access only to the unlabeled data and the output of the classifier is considered to be a clustering to a predefined number of clusters/classes. The main idea behind the unsupervised training consists of a training of the discriminator that any sample from the set of original images is assigned to one of the classes with high fidelity whereas any fake or adversarial sample is assigned to all classes almost equiprobably. This corresponds to the fake samples and the regularization in the label space is based on the considered and extended framework of entropy minimization-based regularization. In the case of absence of fakes, this regularization coincides with the semi-supervised AAE label space regularization under the categorical distribution and adversarial discriminator that is equivalent to enforcing the minimum entropy of label space. However, the encoding of fake samples is equivalent to a sort of rejection option expressed via the activation of classes that have maximum entropy or uniform distribution over the classes. Equivalently, the above types of encoding can be considered to be the maximization of mutual information between the original data and encoded class labels and minimization of mutual information between the fakes/adversarial samples and the class labels. Semi-supevised CatGAN model adds a cross-entropy term computed for the true labeled samples.

Therefore, in summary:


**SeGMA** [18]: is a semi-supervised clustering and generative system with a single latent vector representation auto-encoder similar in spirit to the unsupervised version of AAE that can be also used for the classification. The latent space of SeGMA is assumed to follow a mixture of Gaussians. Using a small labeled data set, classes are assigned to components of this mixture of Gaussians by minimizing the cross-entropy loss induced by the class posterior distribution of a simple Gaussian classifier. The resulting mixture describes the distribution of the whole data, and representatives of individual classes are generated by sampling from its components. In the classification setup, SeGMA uses the latent space clustering scheme for the classification.

Therefore, in summary:


$$\mathcal{L}\_{\text{ScGAN}}(\theta\_{\text{c}}, \theta\_{\text{x}}, \phi\_{\text{z}}) = \mathcal{D}\_{\text{Z}} + \beta\_{\text{x}} \mathcal{D}\_{\text{x}\text{\&}} + \beta\_{\text{c}} \mathcal{D}\_{\text{c}\text{\&}} \tag{24}$$

where the latent space discriminator D<sup>z</sup> is assumed to be the maximum mean discrepancy (MMD) penalty that is analytically defined for the mixture of Gaussians pdf, Dxxˆ is represented by the MSE and Dccˆ represents the cross-entropy for the labeled data defined over class labels deduced from the latent space representation.


**VAE (M1 + M2)** [1]: is based on the combination of several models. The model M1 represents a vanilla VAE considered in Section 4.4.1. Therefore, model M1 is a particular case of considered unsupervised IB. The model M2 is a combination of encoder producing a continuous latent representation and following Gaussian distribution and a classifier that takes as an input original data in parallel to the model M1. The class labels are encoded using the one-hot-vector representations and follow categorical distribution with a hyper-parameter following the symmetric Dirichlet distribution. The decoder of model M2 takes as an input the continuous latent representation and output of classifier. The decoder is trained under the MSE distortion metric. It is important to point out that the classifier works with the input data directly but not with the common latent space such as in the considered LP model. For this reason, it is an obvious analogy with the considered LP model (11) under the assumption that **a** = **x** and all performed IB analysis directly applies to. However, as pointed by the authors, the performance of model M2 in the semi-supervised classification for the limited number of labeled samples is relatively poor. That is why the third hybrid model M1 + M2 is considered when the models M1 and M2 and used in a stacked way. At the first stage, the model M1 is learned as the usual VAE. Then the latent space of model M1 is used as an input to the model M2 trained in a semi-supervised way. Such a two-stage approach closely resembles the learnable prior architecture presented in Figure 2. However, our model is end-to-end trained with the explainable common latent space and IB origin, while the model M1 + M2 is trained in two stages with the use of regularized ELBO for the derivation of model M2.

1. *The targeted tasks*: auto-encoding, clustering, (conditional) generation and classification.


$$\mathcal{L}\_{\text{SS}-\text{VAE M1}+\text{M2}}^{\text{LP}}(\theta\_{\text{c}}, \theta\_{\text{x}}, \phi\_{\text{u}}, \phi\_{\text{x}}) = \mathbb{E}\_{p\_{\text{D}}(\mathbf{x})} \left[ \mathcal{D}\_{\text{x}|\mathbf{x}} \right] + \beta\_{\text{x}} \mathcal{D}\_{\text{x}\mathbf{\hat{}}} + \beta\_{\text{c}} \mathcal{D}\_{\text{c}\mathbf{\hat{}}} + \beta\_{\text{c}} \mathcal{D}\_{\text{c}}.\tag{25}$$


#### **5. Experimental Results**

#### *5.1. Experimental Setup*

The tested system is based on (i) the deterministic encoder and decoder, (ii) the stochastic encoder of type **<sup>a</sup>** <sup>=</sup> *<sup>f</sup>φ*<sup>a</sup> (**<sup>x</sup>** <sup>+</sup> ) with the data independent perturbations and deterministic decoder. The density ratio estimator [24] is used to measure all KL-divergences. The results of semi-supervised classification on the MNIST dataset are reported in Table 1, where symbol *D* indicates the deterministic setup (i) and symbol *S* corresponds to the stochastic one (ii). To choose the optimal parameters of systems, e.g., the Lagrangian multipliers in the considered models, we used 3-run cross-validation with the randomly chosen labeled examples as shown in Appendices B–G. Once the model parameters were chosen, we run 10 time cross-validation and the average results are shown in Table 1.

Additionally, we performed a 10-run cross-validation on the SVHN dataset [25]. We used the same architecture as for MNIST with the same encoders, decoders and discriminators. In contrast to VAE M1 + M2, we used normalized raw data without any pre-processing. Additionally, in contrast to AAE, where an extra set of 531,131 unlabeled images was used for the semi-supervised training, in our experiments only a train set of 73,257 images was used for training. Moreover, the experiments were performed: (i) for the optimal parameters chosen after 3-run cross-validation for the MNIST dataset with no special adaption to SVHN dataset and (ii) under the network architectures with exactly the same number of used filters as given in Appendices B–G for the MNIST dataset. In summary, our goal is to test the generalization capacity of the proposed approach but not just to achieve the best performance by fine-tuning of network parameters. The obtained results are represented in Table 1.

We compare the considered architectures with several state-of-the-art semi-supervised methods such as AAE [2], CatGAN [3], VAE (M1 + M2) [1], IB multiview [5], MV-InfoMax [5] and InfoMax [3] with 100, 1000 and 60,000 training labeled samples. The expected training times for the considered models are given in Table 2. The source code is available at https://github.com/taranO/IB-semisupervised-classification. The analysis of the latent space of trained models for the MNIST dataset is given in Appendix A.

#### *5.2. Discussion MNIST*

The deterministic and stochastic systems based on the learnable priors clearly demonstrate the state-of-the-art performance in comparison to the considered semi-supervised counterparts.

*Baseline Neural Network (NN):* the obtained results allow concluding that, if the amount of labeled training data is large, as shown in "all" column (Table 1), the latent space regularization has no practically significant impact on the classification performance for both hand crafted and learnable priors. The deep classifier is capable of learning a latent representation retaining only sufficient statistics in the latent space solely based on the cross-entropy component of IB classification term decomposition as shown in Table A1, row Dccˆ and column "all". The classes appear to be well separable under this form of visualization. At the same time, the decrease of number of labeled samples leads to the degradation of classification accuracy as show in Table 1 for columns "1000" and "100". This degradation is also clearly observed in Table A1, row Dccˆ and column "l00", where there is larger overlap between the classes compared to the column "all". The stochastic encoding via the addition of noise to the input samples does not enhance the performance with respect to the deterministic decoding for the small amount of labeled examples. One can assume that the presence of additive noise is not typical for the considered data, whereas the samples clearly differ in the geometrical appearance. Therefore, we can only assume that random geometrical permutations would be a more interesting alternative to the additive noise permutations/encoding.

**Table 1.** Semi-supervised classification performance (percentage error) for the optimal parameters (Appendices B–G) defined on the MNIST (*D*—deterministic; *S*—stochastic).


*No priors on latent space:* to investigate the impact of unlabeled data, we add the adversarial regularizer D<sup>c</sup> to the baseline classifier based on Dccˆ. The term D<sup>c</sup> enforces the distribution of class labels for the unlabeled samples to follow the categorical distribution. At this stage, no regularization of latent space is applied. The addition of the adversarial regularizer Dc, see "100" column (Table 1), allows reducing the classification error in comparison to the baseline classifier. Moreover, the stochastic encoder slightly outperforms the deterministic one for all numbers of labeled samples. However, the achieved classification error is far away from the performance of baseline classifier trained on the whole labeled data set. Thus, the cross-entropy and adversarial classification terms alone can hardly cope with the lack of labeled data, and proper regularization of the latent space is the main mechanism capable of retaining the most relevant representation.

*Hand crafted latent space priors:* along this line we investigate the impact of hand-crafted regularization in the form of the added discriminator D<sup>a</sup> imposing Gaussian prior on the latent representation **a**. The sole regularization of latent space with the hand-crafted prior on the Gaussianity does not reflect the complex nature of latent space of real data. As a result the performance of the regularized classifier *<sup>β</sup>*cDccˆ <sup>+</sup> <sup>D</sup><sup>a</sup> does not lead to a remarkable improvement in comparison to the non-regularized counterpart Dccˆ for both stochastic and deterministic types of encoding. When in addition the label space regularization <sup>D</sup><sup>c</sup> is added to the final classifier *<sup>β</sup>*cDccˆ <sup>+</sup> <sup>D</sup><sup>a</sup> <sup>+</sup> *<sup>β</sup>*cDc, it leads to the factor of 2 classification error reduction over the cross-entropy baseline classifier but it is still far away from the fully supervised baseline classifier trained on the fully labeled data set. At the same time, there is no significant difference between the stochastic and deterministic types of encoding.

*Learnable latent space priors:* along this line we will investigate the impact of learnable priors by adding the corresponding regularizations of the latent space of auto-encoder and data reconstruction. We investigate the role of reconstruction space regularization based on the MSE expressed via Dxxˆ and joint Dxxˆ and Dx. The addition of discriminator D<sup>x</sup> slightly enhances the classification but requires almost doubled training time as shown in Table 2. The stochastic encoding does not show any obvious advantage over the deterministic one in this setup. The separability of classes shown in Table A1, row *<sup>β</sup>*cDccˆ <sup>+</sup> *<sup>β</sup>*cD<sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> *<sup>β</sup>*xDxxˆ <sup>+</sup> *<sup>β</sup>*xD<sup>x</sup> and column "l00", is very close to those of column "all" and row Dccˆ, i.e., the semi-supervised system with 100 labeled examples is capable of closely approximating the fully supervised one. We show the t-sne only for this setup since it practically coincides with *<sup>β</sup>*cDccˆ <sup>+</sup> *<sup>β</sup>*cD<sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> *<sup>β</sup>*xDxxˆ. However, it should be pointed out that the learnable priors ensures the reconstruction of data from the compressed latent space and the learned representation is the sufficient statistics for the data reconstruction task but not for the classification one. Since the entropy of the classification task is significantly lower to those of reconstruction, such a learned representation contains more information than actually needed for the classification task. A fraction of retained information is irrelevant to the classification problem and might be a potential source of classification errors. This likely explains a gap in performance between the considered semi-supervised system and fully supervised one.

**Table 2.** Execution time (hours) per 100 epochs on one NVIDIA GPU. For the SVHN the models with the learnable latent space priors were trained with a learning rate 0.0001 that explains the longer time but without optimization of Lagrangians, i.e., the Lagrangians were re-used from pre-trained MNIST model. All the others models were trained with a learning rate 0.001.


#### *5.3. Discussion SVHN*

In the SVHN test, we did not try to optimize the Lagrangian coefficients as it was done for MNIST. However, to compensate for a potential non-optimality, we perform the model training with the reduced learning rate as indicated in Table 2. As a result, the training time on the SVHN dataset is longer. Therefore, 10-run validation of the proposed framework on the SVHN dataset was done with the optimal Lagrangian multipliers determined on the MNIST dataset. In this respect, one might observe a small degradation of the obtained results compared to the state-of-the-art. Additionally, we did not apply any pre-processing such as PCA that was used in VAE M1 + M2 and we did not use the extended unlabeled dataset as it was done in case of AAE. One can clearly observe the same behavior of semi-supervised classifiers as for MNIST data set discussed in Section 5.2. Therefore, we can clearly confirm the role of learnable priors in the overall performance observed for both datasets.

#### **6. Conclusions and Future Work**

We have introduced a novel formulation of variational information bottleneck for semi-supervised classification. To overcome the problem of original bottleneck and to compensate the lack of labeled data in the semi-supervised setting, we considered two models of latent space regularization via hand-crafted and learnable priors. On a toy example of MNIST dataset we investigated how the parameters of proposed framework influence the performance of classifier. By end-to-end training, we demonstrate how the proposed framework compares to the state-of-the-art methods and approaches the performance of fully supervised classifier.

The envisioned future work is along the lines of providing a stronger compression yet preserving only classification task relevant information since retaining more task irrelevant information does not provide distinguishable classification features, i.e., it only ensures reliable data reconstruction. In this work, we have considered IB for the predictive latent space model. We think that the contrastive multi-view IB formulation would be an interesting candidate for the regularization of latent space. Additionally, we did not use the adversarially generated examples to impose the constraint on the minimization of mutual information between them and class labels or equivalently to maximize the entropy of class label distribution for these adversarial examples according to the framework of entropy minimization. This line of "adversarial" regularization seems to be a very interesting complement to the considered variational bottleneck. In this work, we considered a particular form of stochastic encoding by the addition of data independent noise to the input with the preservation of the same class labels. This also corresponds to the consistency regularization when samples can be more generally permuted including the geometrical transformations. It is also interesting to point out that the same form of generic permutations is used in the unsupervised constrastive loss-based multi-view formulations for the continual latent space representation as opposed to the categorical one in the consistency regularization. Finally, the conditional generation can be an interesting line of research considering the generation from discrete labels and continuous latent space of the autoencoder.

**Author Contributions:** Conceptualization, S.V. and O.T.; methodology, O.T., M.K., T.H. and D.R.; software, O.T.; validation, O.T.; formal analysis, M.K., T.H. and D.R.; investigation, O.T.; writing—original draft preparation, S.V. and O.T., writing—review and editing, ALL; visualization, S.V. and O.T.; supervision, S.V.; project administration, S.V., All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Swiss National Science Foundation SNF No. 200021\_182063.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


**MNIST** (100) **MIST** (1000) **MNIST** (all) <sup>D</sup>ccˆ <sup>+</sup> <sup>D</sup><sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> <sup>D</sup>xxˆ <sup>+</sup> *α*xD<sup>x</sup>

**Figure A2.** Latent space **z** (of size 20) of auto-encoder.

In this section, we consider the properties of classifier's latent space for both the hand-crafted and learnable priors under different amount of training samples. Figures A1 and A2 show t-sne plots for the perplexity 30 for 100, 1000 and 60,000 ("all") training labels of the MNIST dataset.

The first raw of Figure A1 with the label "Dccˆ" corresponds to the classifier considered in Appendix B. The latent space **a** of the classifier with "all" labels demonstrates the perfect separability of classes. The classes are far away from each other and there are practically no outliers leading to the misclassification. The decrease of the number of labels in the supervised setup, see the columns 1000 and 100, leads to a visible degradation of separability between the classes.

The regularization of class label space by the regularizer D<sup>c</sup> or by the hand-crafted latent space regularizer <sup>D</sup><sup>a</sup> shown in raws "Dccˆ <sup>+</sup> *<sup>α</sup>*cDc" considered in Appendix <sup>C</sup> and "Dccˆ <sup>+</sup> *<sup>α</sup>*aDa" considered in Appendix D for the small number of training samples equal 100 does not significantly enhance the class separability with respect to "Dccˆ".

At the same time, the joint usage of the above regularizers according to the model "Dccˆ <sup>+</sup> *<sup>α</sup>*cD<sup>c</sup> <sup>+</sup> *α*aDa" according to the model in Appendix E leads to the better separability of classes for 100 labels in comparison with the previous cases. At the same time, the addition of these regularizers does not have any impact on the latent space for "all" label case.

The introduction of learnable regularization of latent space along with the class label regularization according to the model "Dccˆ <sup>+</sup> <sup>D</sup><sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> <sup>D</sup>xxˆ <sup>+</sup> *<sup>α</sup>*xDx" considered in Appendix <sup>G</sup> enhances the class separability in the latent space of classifier for 100 label case that is also very close to the fully supervised case.

For the comparison reasons, we also visualize the latent space of the auto-encoder **z** for the above model in Figure A2.

#### **Appendix B. Supervised Training without Latent Space Regularization (Baseline)**

The baseline architecture is based on the cross-entropy term Dccˆ (7) in the main part of paper and depicted in Figure A3:

$$\mathcal{L}\_{\mathbb{S}\text{-}\text{NoReg}}^{\text{HCP}}(\theta\_{\text{c}}, \phi\_{\text{a}}) = \mathcal{D}\_{\text{c}\text{\`\!\text{ }}}.\tag{A1}$$

The parameters of encoder and decoder are shown in Table A1. The performance of baseline supervised classifier with and without batch normalization corresponds to the parameter *α*<sup>c</sup> = 0 in Table A3 (deterministic scenario) and Table A4 (stochastic scenario).

**Figure A3.** Baseline classifier based on Dccˆ. The blue shadowed regions are not used.


**Table A1.** The network parameters of baseline classifier trained on Dccˆ. The encoder is trained with and without batch normalization (BN) after Conv2D layers.

#### **Appendix C. Semi-Supervised Training without Latent Space Regularization and with Class Label Regularizer**

This model is based on terms Dccˆ and D<sup>c</sup> in (8) in the main part of paper and schematically shown in Figure A4:

$$
\mathcal{L}\_{\text{SS}-\text{NoReg}}^{\text{HCP}}(\theta\_{\text{\textdegree}}, \phi\_{\text{\textdegree}}) = \mathcal{D}\_{\text{c\%}} + a\_{\text{c}} \mathcal{D}\_{\text{c}}.\tag{A2}
$$

The parameters of encoder, decoder and discriminator are shown in Table A2. The KL-divergence term D<sup>c</sup> is implemented in a form of density ratio estimator (DRE). In the considered practical implementation, the parameter *α*<sup>c</sup> controls the trade-off between the cross-entropy and class discriminator terms. The discriminator D<sup>c</sup> is trained in an adversarial way based on samples generated by the decoder and from targeted distribution.

The performance of semi-supervised classifier with and without batch normalization is shown in Table A3 (deterministic scenario) and Table A4 (stochastic scenario).

**Figure A4.** Semi-supervised classifier based on the cross-entropy Dccˆ and categorical class discriminator Dc. No latent space regularization is applied. The blue shadowed regions are not used.


**Table A2.** The network parameters of semi-supervised classifier trained on Dccˆ and Dc. The encoder is trained with and without batch normalization (BN) after Conv2D layers.

**Table A3.** The performance (percentage error) of **deterministic** classifier based on <sup>D</sup>ccˆ <sup>+</sup> *<sup>α</sup>*cD<sup>c</sup> for the encoder with and without batch normalization as a function of Lagrangian multiplier *α*c and the number of labelled examples.


**Table A3.** *Cont.*


**Table A4.** The performance (percentage error) of **stochastic** classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on <sup>D</sup>ccˆ <sup>+</sup> *<sup>α</sup>*cD<sup>c</sup> for the encoder with and without batch normalization as a function of Lagrangian multiplier *α*c and the number of labelled examples.


#### **Appendix D. Supervised Training with Hand Crafted Latent Space Regularization**

This model is based on the cross-entropy term Dccˆ and either term Da|<sup>x</sup> or D<sup>a</sup> or jointly Da|<sup>x</sup> and D<sup>a</sup> as defined by (9) in the main part of paper. In our implementation, we consider the regularization based on the adversarial term D<sup>a</sup> similar to AAE due to the flexibility of imposing different priors on the latent space distribution. The implemented system is shown in Figure A5 and the training is based on:

$$
\mathcal{L}\_{\text{S}-\text{Reg}}^{\text{HCP}}(\mathfrak{G}\_{\text{c}}, \mathfrak{G}\_{\text{a}}) = \mathcal{D}\_{\text{c}\text{f}} + \mathfrak{a}\_{\text{a}} \mathcal{D}\_{\text{a}}.\tag{A3}
$$

where *α*<sup>a</sup> is a regularization parameter controlling a trade-off between the cross-entropy term and latent space regularization term. We have replaced the Lagrangians above with respect to (9) in the main part of paper and used it in front of D<sup>a</sup> in contrast to the original formulation (9). It is done to keep the term Dccˆ without a multiplier as the reference to the baseline classifier.

The parameters of encoder, decoder and discriminator are summarized in Table A5. The performance of this classifier without and with batch normalization is shown in Table A6 (deterministic scenario) and Table A7 (stochastic scenario).

**Figure A5.** Supervised classifier based on the cross-entropy Dccˆ and hand crafted latent space regularization Da. The blue shadowed parts are not used.

**Table A5.** The network parameters of supervised classifier trained on Dccˆ and Da. The encoder is trained with and without batch normalization (BN) after Conv2D layers. D<sup>a</sup> is trained in the adversarial way.



**Table A6.** The performance (percentage error) of **deterministic** classifier based on <sup>D</sup>ccˆ <sup>+</sup> *<sup>α</sup>*aD<sup>a</sup> for the encoder with and without batch normalization as a function of Lagrangian multiplier.

**Table A7.** The performance (percentage error) of **stochastic** classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on <sup>D</sup>ccˆ <sup>+</sup> *<sup>α</sup>*aD<sup>a</sup> for the encoder with and without batch normalization as a function of Lagrangian multiplier.


#### **Appendix E. Semi-Supervised Training with Hand Crafted Latent Space and Class Label Regularizations**

This model is based on the cross-entropy term Dccˆ and either term Da|<sup>x</sup> or D<sup>a</sup> or jointly Da|<sup>x</sup> and D<sup>a</sup> and the label class regularizer D<sup>c</sup> as defined by (10) in the main part of paper. In our implementation, we consider the regularization based on the adversarial term D<sup>a</sup> only as shown in Figure A6. The training is based on:

$$\mathcal{L}\_{\text{S}-\text{Reg}}^{\text{HCP}}(\theta\_{\text{c}},\phi\_{\text{a}}) = \mathcal{D}\_{\text{c}\text{\textsuperscript{\text{\textquotedblleft}}{\text{\textquotedblright}}} + \mathfrak{a}\_{\text{c}}\mathcal{D}\_{\text{c}} + \mathfrak{a}\_{\text{a}}\mathcal{D}\_{\text{a}}.\tag{A4}$$

The parameters of encoder, decoder and both discriminators are shown in Table A8. The performance of this classifier without and with batch normalization is shown in Table A9 (deterministic scenario) and Table A10 (stochastic scenario).

**Table A8.** The network parameters of semi-supervised classifier trained on Dccˆ, D<sup>a</sup> and Dc. The encoder is trained with and without batch normalization (BN) after Conv2D layers. D<sup>a</sup> and D<sup>c</sup> are trained in the adversarial way.


**Figure A6.** Semi-supervised classifier based on the cross-entropy Dccˆ and hand crafted latent space regularization Da. The blue shadowed parts are not used.

**Table A9.** The performance (percentage error) of **deterministic** classifier based on <sup>D</sup>ccˆ <sup>+</sup> *<sup>α</sup>*aD<sup>a</sup> <sup>+</sup> *<sup>α</sup>*cD<sup>c</sup> for the encoder with and without batch normalization.



**Table A10.** The performance (percentage error) of **stochastic** classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on <sup>D</sup>ccˆ <sup>+</sup> *<sup>α</sup>*aD<sup>a</sup> <sup>+</sup> *<sup>α</sup>*cD<sup>c</sup> for the encoder with and without batch normalization.

#### **Appendix F. Semi-Supervised Training with Learnable Latent Space Regularization**

This model is based on the cross-entropy term Dccˆ, the MSE term representing Dxxˆ, the label class regularizer D<sup>c</sup> and either term Dz|<sup>x</sup> or D<sup>z</sup> or jointly Dz|<sup>x</sup> and D<sup>z</sup> as defined by (16) in the main part of paper. In our implementation, we consider the regularization of the latent space based on the adversarial term D<sup>z</sup> only to compare it with the vanila AAE as shown in Figure A7. The encoder is also not conditioned on **c** as in the original semi-supervised AAE. Thus, the tested system is based on:

$$\mathcal{L}\_{\text{SS}-\text{A}\text{A}\text{E}}^{\text{LP}}(\boldsymbol{\theta}\_{\text{c}},\boldsymbol{\theta}\_{\text{x}},\boldsymbol{\phi}\_{\text{a}},\boldsymbol{\phi}\_{\text{z}}) = \boldsymbol{\beta}\_{\text{c}}\mathcal{D}\_{\text{c}\text{k}} + \boldsymbol{\beta}\_{\text{c}}\mathcal{D}\_{\text{c}} + \mathcal{D}\_{\text{z}} + \boldsymbol{\beta}\_{\text{x}}\mathcal{D}\_{\text{x}\text{k}}.\tag{A5}$$

We set the parameters *β*<sup>x</sup> = *β*<sup>c</sup> = 1 to compare our system with the vanila AAE. However, these parameters can be also optimized in practice.

The parameters of encoder and decoder are shown in Table A11. The performance of this classifier without and with batch normalization is shown in Table A12 (deterministic scenario) and Table A13 (stochastic scenario).


**Table A11.** The encoder and decoder of semi-supervised classifier trained based on Dccˆ, D<sup>c</sup> and Dz. The encoder is trained with and without batch normalization (BN) after Conv2D layers. D<sup>c</sup> and D<sup>z</sup> are trained in the adversarial way.

**Table A12.** The performance (percentage error) of **deterministic** classifier based on <sup>D</sup>ccˆ <sup>+</sup> <sup>D</sup><sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> Dxxˆ for the encoder with and without batch normalization.


**Table A13.** The performance (percentage error) of **stochastic** classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on <sup>D</sup>ccˆ <sup>+</sup> <sup>D</sup><sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> <sup>D</sup>xxˆ for the encoder with and without batch normalization.

**Figure A7.** Semi-supervised classifier with learnable priors: the cross-entropy Dccˆ, MSE Dxxˆ, class label D<sup>c</sup> and latent space regularization Da. The blue shadowed parts are not used.

#### **Appendix G. Semi-Supervised Training with Learnable Latent Space Regularization and Adversarial Reconstruction**

This model is similar to the previously considered model but in addition to the MSE reconstruction term representing Dxxˆ it also contains the adversarial reconstruction term D<sup>x</sup> as defined by (17) in the main part of paper. In our implementation, we consider the regularization of the latent space based on the adversarial term D<sup>z</sup> as shown in Figure A8. The training is based on:

$$\mathcal{L}\_{\text{SS}-\text{A}\text{A}}^{\text{LP}}(\theta\_{\text{\textdegree\text{\textdegree\text\text\text\text\text\text\text\text\text\text\text\text\text\text\textdegree}}\,\theta\_{\text{\texttextquap}}\,\Phi\_{\text{\textquotedbl}}\,\Phi\_{\text{z}}) = \mathcal{D}\_{\text{Z}} + \mathcal{D}\_{\text{x}\text{\text\text{\textdegree\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\}\}\,\text{A\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\}\}\,\text{A\text\text\text\text\text\text\text\text\text\text\text\text\text\text\text\}\}\,\text{A}$$

The parameters of encoder and decoder are shown in Table A14. The performance of this classifier without and with batch normalization is shown in Table A15 (deterministic scenario) and Table A16 (stochastic scenario).

**Figure A8.** Semi-supervised classifier with learnable priors: the cross-entropy Dccˆ, MSE Dxxˆ, adversarial reconstruction Dx, class label D<sup>c</sup> and latent space regularizer Dz. The blue shadowed parts are not used.

**Table A14.** The network parameters of semi-supervised classifier trained based on Dccˆ, D<sup>c</sup> and Dz. The encoder is trained with and without batch normalization (BN) after Conv2D layers. D<sup>c</sup> and D<sup>z</sup> are trained in the adversarial way.


**Table A14.** *Cont.*


**Table A15.** The performance (percentage error) of **deterministic** classifier based on <sup>D</sup>ccˆ <sup>+</sup> <sup>D</sup><sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> <sup>D</sup>xxˆ <sup>+</sup> *<sup>α</sup>*xD<sup>x</sup> for the encoder with and without batch normalization.


**Table A16.** The performance (percentage error) of **stochastic** classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on <sup>D</sup>ccˆ <sup>+</sup> <sup>D</sup><sup>c</sup> <sup>+</sup> <sup>D</sup><sup>z</sup> <sup>+</sup> <sup>D</sup>xxˆ <sup>+</sup> *<sup>α</sup>*xD<sup>x</sup> for the encoder with and without batch normalization.


#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **The Conditional Entropy Bottleneck**

#### **Ian Fischer**

Google Research, Mountain View, CA 94043, USA; iansf@google.com

Received: 30 July 2020; Accepted: 28 August 2020; Published: 8 September 2020

**Abstract:** Much of the field of Machine Learning exhibits a prominent set of failure modes, including vulnerability to adversarial examples, poor out-of-distribution (OoD) detection, miscalibration, and willingness to memorize random labelings of datasets. We characterize these as failures of *robust generalization*, which extends the traditional measure of generalization as accuracy or related metrics on a held-out set. We hypothesize that these failures to robustly generalize are due to the learning systems retaining *too much* information about the training data. To test this hypothesis, we propose the *Minimum Necessary Information* (MNI) criterion for evaluating the quality of a model. In order to train models that perform well with respect to the MNI criterion, we present a new objective function, the *Conditional Entropy Bottleneck* (CEB), which is closely related to the *Information Bottleneck* (IB). We experimentally test our hypothesis by comparing the performance of CEB models with deterministic models and Variational Information Bottleneck (VIB) models on a variety of different datasets and robustness challenges. We find strong empirical evidence supporting our hypothesis that MNI models improve on these problems of robust generalization.

**Keywords:** information theory; information bottleneck; machine learning

#### **1. Introduction**

Despite excellent progress in classical generalization (e.g., accuracy on a held-out set), the field of Machine Learning continues to struggle with the following issues:


We consider these to be problems of *robust generalization*, which we define and discuss in Section 2.1. In this work, we hypothesize that these problems of robust generalization all have a common cause: models retain *too much* information about the training data. We formalize this by introducing the *Minimum Necessary Information* (MNI) criterion for evaluating a learned representation (Section 2.2). We then introduce an objective function that directly optimizes the MNI, the *Conditional Entropy Bottleneck* (CEB) (Section 2.3) and compare it with the closely-related *Information Bottleneck* (IB) objective [11] in Section 2.5. In Section 2.6, we describe practical ways to optimize CEB in a variety of settings.

Finally, we give empirical evidence for the following claims:


#### **2. Materials and Methods**

#### *2.1. Robust Generalization*

In classical generalization, we are interested in a model's performance on held-out data on some task of interest, such as classification accuracy. In *robust generalization*, we want: **(RG1)** *to maintain the model's performance in the classical generalization setting*; **(RG2)** *to ensure the model's performance in the presence of an adversary (unknown at training time)*; and **(RG3)** *to detect adversarial and non-adversarial data that strongly differ from the training distribution*.

Adversarial training approaches considered in the literature so far [12–14] violate **(RG1)**, as they typically result in substantial decreases in accuracy. Similarly, provable robustness approaches (e.g., Cohen et al. [15], Wong et al. [16]) provide guarantees for a particular adversary known at training time, also at a cost to test accuracy. To our knowledge, neither approaches provide any mechanism to satisfy **(RG3)**. On the other hand, approaches for detecting adversarial and non-adversarial out-of-distribution (OoD) examples [4–9] are either known to be vulnerable to adversarial attack [1,2], or do not demonstrate that the approach provides robustness against unknown adveraries, both of which violate **(RG2)**.

Training on information-free datasets [10] provides an additional way to check if a learning system is compatible with **(RG1)**, as memorization of such datasets necessarily results in maximally poor performance on any test set. Model calibration is not obviously a necessary condition for robust generalization, but if a model is well-calibrated on a held-out set, its confidence may provide some signal for distinguishing OoD examples, so we mention it as a relevant metric for **(RG3)**.

To our knowledge, the only works to date that have demonstrated progress on robust generalization for modern machine learning datasets are the *Variational Information Bottleneck* [17,18] (VIB), and *Information Dropout* [19]. Alemi et al. [17] presented preliminary results that VIB improves adversarial robustness on image classification tasks while maintaining high classification accuracy (**(RG1)** and **(RG2)**). Alemi et al. [18] showed that VIB models provide a useful signal, the *Rate*, *R*, for detecting OoD examples (**(RG3)**). Achille and Soatto [19] also showed preliminary results on adversarial robustness and demonstrated failure to train on information-free datasets.

In this work, we do not claim to "solve" robust generalization, but we do show notable improvement on all three conditions simply by changing the training objective. This evidence supports our core hypothesis that problems of robust generalization are caused in part by retaining too much information about the training data.

#### *2.2. The Minimum Necessary Information*

We define the *Minimum Necessary Information* (MNI) criterion for a learned representation in three parts:


*Necessity* can be defined as *<sup>I</sup>*(*X*;*Y*) <sup>≤</sup> *<sup>I</sup>*(*Y*; *<sup>Z</sup>*). Any less information than that would prevent *<sup>Z</sup>* from solving the task of predicting *<sup>Y</sup>* from *<sup>X</sup>*. *Minimality* can be defined as *<sup>I</sup>*(*X*;*Y*) <sup>≥</sup> *<sup>I</sup>*(*X*; *<sup>Z</sup>*). Any more information than that would result in *Z* capturing information from *X* that is either redundant or irrelevant for predicting *Y*. Since the information captured by *Z* is constrained from above and below, we have the following necessary and sufficient conditions for perfectly achieving the Minimum Necessary Information, which we call the *MNI Point*:

$$I(X;Y) = I(X;Z) = I(Y;Z) \tag{1}$$

The MNI point defines a unique point in the information plane. The geometry of the information plane can be seen in Figure 1. The MNI criterion does not make any Markov assumptions on the models or algorithms that learn the representations. However, the algorithms we discuss here all do rely on the standard Markov chain *Z* ← *X* ↔ *Y*. See Fischer [22] for an example of an objective that doesn't rely on a Markov chain during training.

A closely related concept to Necessity is called *sufficiency* by Achille and Soatto [19] and other authors. We avoid the term due to potential confusion with minimum sufficient statistics, which maintain the mutual information between a model and the data it generates [21] (p. 35). The primary difference between necessity and sufficiency is the reliance on the Markov constraint to define sufficiency. Ref. [19] also does not identify the MNI point as an idealized target, instead defining the optimization problem: minimize *<sup>I</sup>*(*X*; *<sup>Z</sup>*) s.t. *<sup>H</sup>*(*Y*|*Z*) = *<sup>H</sup>*(*Y*|*X*).

In general it may not be possible to satisfy Equation (1). As discussed in Anantharam et al. [23–25], for any given dataset (*X*,*Y*), there is some maximum value for any possible representation *Z*:

$$1 \ge \eta\_{\rm KL} = \sup\_{Z \leftarrow X \rightarrow Y} \frac{I(Y; Z)}{I(X; Z)} \tag{2}$$

with equality only when *X* → *Y* is a *deterministic* map. Training datasets are often deterministic in one direction or the other. e.g., common image datasets map each distinct image to a single label. Thus, in practice, we can often get very close to the MNI on the training set given a sufficiently powerful model.

**Figure 1.** Geometry of the feasible regions in the (*I*(*X*; *Z*), *I*(*Y*; *Z*)) information plane for for any algorithm, with key points and edges labeled. The **black** edges bound the feasible region for an (*X*,*Y*) pair where *<sup>H</sup>*(*X*|*Y*) <sup>&</sup>gt; *<sup>I</sup>*(*X*;*Y*) <sup>&</sup>gt; *<sup>H</sup>*(*Y*|*X*), which would generally be the case in an image classification task, for example. The **gray** dashed lines bound the feasible regions when the underlying model depends on a Markov chain. The *I*(*X*; *Z*) = *I*(*Y*; *Z*) and *I*(*Y*; *Z*) = *I*(*X*;*Y*) lines are the upper bound for *<sup>Z</sup>* <sup>←</sup> *<sup>X</sup>* <sup>↔</sup> *<sup>Y</sup>*. The *<sup>I</sup>*(*X*; *<sup>Z</sup>*) = *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) and *<sup>I</sup>*(*X*; *<sup>Z</sup>*) = *<sup>I</sup>*(*X*;*Y*) lines are the right bound for *Z* ← *Y* ↔ *X*. The **blue** points correspond to the best possible Maximum Likelihood Estimates (MLE) for the corresponding Markov chain models. The **red** point corresponds to the maximum information *Z* could ever capture about (*X*,*Y*). The Minimum Necessary Information (MNI) point is **green**. As *I*(*X*; *Z*) increases, *Z* captures more information that is either redundant or irrelevant with respect to predicting *Y*. Similarly, any variation in *Y* that remains once we know *X* is just noise as far as the task is concerned. The MNI point is the unique point that has no redundant or irrelevant information from *X*, and everything but the noise from *Y*.

#### 2.2.1. MNI and Robust Generalization

To satisfy **(RG1)** (classical generalization), a model must have *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>≥</sup> *<sup>I</sup>*(*X*;*Y*) = *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) on the *test* dataset. Shamir et al. [26] show that <sup>|</sup>*I*(*X*; *<sup>Z</sup>*) <sup>−</sup> <sup>ˆ</sup>*I*(*X*; *<sup>Z</sup>*)| ≈ *<sup>O</sup>* -<sup>2</sup>ˆ*I*(*X*;*Z*) <sup>√</sup>*<sup>N</sup>* , where <sup>ˆ</sup>*I*(·) indicates the training set information and *N* is the size of the training set. More recently, Bassily et al. [27] gave a similar result in a PAC setting. Both results indicate that models that are *compressed on the training data* should do *better at generalizing* to similar test data.

Less clear is how an MNI model might improve on **(RG2)** (adversarial robustness). In this work, we treat it as a hypothesis that we investigate empirically rather than theoretically. The intuition behind the hypothesis can be described in terms of the idea of *robust* and *non-robust features* from Ilyas et al. [28]: non-robust features in *X* should be compressed as much as possible when we learn *Z*, whereas robust features should be retained as much as is necessary. If Equation (1) is satisfied, *Z* must have "scaled" the importance of the the features in *X* according to their importance for predicting *Y*. Consequently, an attacker that tries to take advantage of a non-robust feature will have to change it much more in order to confuse the model, possibly exceeding the constraints of the attack before it succeeds.

For **(RG3)** (detection), the MNI criterion does not directly apply, as that will be a property of specific modeling choices. However, if the model provides an accurate way to measure *I*(*X* = *x*; *Z* = *z*) for a particular pair (*x*, *z*), Alemi et al. [18] suggests that can be a valuable signal for OoD detection.

#### *2.3. The Conditional Entropy Bottleneck*

We would like to learn a representation *Z* of *X* that will be useful for predicting *Y*. We can represent this problem setting with the Markov chain *Z* ← *X* ↔ *Y*. We would like *Z* to satisfy Equation (1). Given the conditional independence *Z <sup>Y</sup>*|*<sup>X</sup>* in our Markov chain, *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) <sup>≤</sup> *<sup>I</sup>*(*X*;*Y*), by the data processing inequality. Thus, maximizing *I*(*Y*; *Z*) is consistent with the MNI criterion.

However, *<sup>I</sup>*(*X*; *<sup>Z</sup>*) does not clearly have a constraint that targets *<sup>I</sup>*(*X*;*Y*), as 0 <sup>≤</sup> *<sup>I</sup>*(*X*; *<sup>Z</sup>*) <sup>≤</sup> *<sup>H</sup>*(*X*). Instead, we can notice the following identities at the MNI point:

$$I(X; \mathcal{Y}|Z) = I(X; Z|\mathcal{Y}) = I(\mathcal{Y}; Z|X) = 0 \tag{3}$$

The conditional mutual information is always non-negative, so learning a compressed representation *<sup>Z</sup>* of *<sup>X</sup>* is equivalent to minimizing *<sup>I</sup>*(*X*; *<sup>Z</sup>*|*Y*). Using our Markov chain and the chain rule of mutual information [21]:

$$I(X;Z|Y) = I(X,Y;Z) - I(Y;Z) = I(X;Z) - I(Y;Z) \tag{4}$$

This leads us to the general *Conditional Entropy Bottleneck*:

$$\text{CEB} \equiv \min\_{\mathbb{Z}} I(X; Z | \mathcal{Y}) - \gamma I(\mathcal{Y}; Z) \tag{5}$$

= min *<sup>Z</sup> <sup>H</sup>*(*Z*) <sup>−</sup> *<sup>H</sup>*(*Z*|*X*) <sup>−</sup> *<sup>H</sup>*(*Z*) + *<sup>H</sup>*(*Z*|*Y*)

$$-\gamma (H(\boldsymbol{Y}) + H(\boldsymbol{Y}|\boldsymbol{Z})) \tag{6}$$

$$\Leftrightarrow \min\_{Z} -H(Z|X) + H(Z|Y) + \gamma H(Y|Z) \tag{7}$$

In line 7, we can optionally drop *H*(*Y*) because it is constant with respect to *Z*. Here, any *γ* > 0 is valid, but for deterministic datasets (Section 2.2), *γ* = 1 will achieve the MNI for a sufficiently powerful model. Further, we should expect *γ* = 1 to yield *consistent* models and other values of *γ* not to: since *I*(*Y*; *Z*) shows up in two forms in the objective, weighing them differently forces the optimization procedure to count bits of *I*(*Y*; *Z*) in two different ways, potentially leading to a situation where *<sup>H</sup>*(*Z*) <sup>−</sup> *<sup>H</sup>*(*Z*|*Y*) <sup>=</sup> *<sup>H</sup>*(*Y*) <sup>−</sup> *<sup>H</sup>*(*Y*|*Z*) at convergence. Given knowledge of those four entropies, we can define a consistency metric for *Z*:

$$\mathcal{C}\_{I(Y;Z)}(Z) \equiv \left| H(Z) - H(Z|Y) - H(Y) + H(Y|Z) \right| \tag{8}$$

#### *2.4. Variational Bound on CEB*

We will variationally upper bound the first term of Equation (5) and lower bound the second term using three distributions: *<sup>e</sup>*(*z*|*x*), the *encoder* which defines the joint distribution we will use for sampling, *<sup>p</sup>*(*x*, *<sup>y</sup>*, *<sup>z</sup>*) <sup>≡</sup> *<sup>p</sup>*(*x*, *<sup>y</sup>*)*e*(*z*|*x*); *<sup>b</sup>*(*z*|*y*), the *backward encoder*, an approximation of *<sup>p</sup>*(*z*|*y*); and *<sup>c</sup>*(*y*|*z*), the *classifier*, an approximation of *<sup>p</sup>*(*y*|*z*) (the name is arbitrary, as *<sup>Y</sup>* may not be labels). All of *<sup>e</sup>*(·), *<sup>b</sup>*(·), and *<sup>c</sup>*(·) may have learned parameters, just like the encoder and decoder of a VAE [29], or the encoder, classifier, and marginal in VIB.

In the following, we write expectations log *<sup>e</sup>*(*z*|*x*) . They are always with respect to the joint distribution; here, that is *<sup>p</sup>*(*x*, *<sup>y</sup>*, *<sup>z</sup>*) <sup>≡</sup> *<sup>p</sup>*(*x*, *<sup>y</sup>*)*e*(*z*|*x*). The first term of Equation (5):

$$I(X;Z|Y) = -H(Z|X) + H(Z|Y) = \langle \log e(z|x) \rangle - \langle \log p(z|y) \rangle \tag{9}$$

$$=\langle \log e(z|\mathbf{x}) \rangle - \langle \log b(z|y) \rangle - \langle \text{KL}[p(z|y)||b(z|y)] \rangle \tag{10}$$

$$0 \le \langle \log e(z|\mathbf{x}) \rangle - \langle \log b(z|y) \rangle \tag{11}$$

The second term of Equation (5):

$$H(\mathbf{Y}; \mathbf{Z}) = H(\mathbf{Y}) - H(\mathbf{Y}|\mathbf{Z}) \approx -H(\mathbf{Y}|\mathbf{Z}) = \langle \log p(y|z) \rangle \tag{12}$$

$$=\langle \log c(y|z) \rangle + \langle \text{KL}[p(y|z)||c(y|z)] \rangle \tag{13}$$

$$\geq \langle \log c(y|z) \rangle$$

These variational bounds give us a tractable objective function for amortized inference, the *Variational Conditional Entropy Bottleneck* (VCEB):

$$\text{VCEB} \equiv \min\_{\varepsilon, b, \varepsilon} \left< \log \varepsilon(z|x) \right> - \left< \log b(z|y) \right> - \gamma \left< \log \varepsilon(y|z) \right> \tag{15}$$

There are a number of other ways to optimize Equation (5). We describe a few of them in Section 2.6 and Appendices B and C.

#### *2.5. Comparison to the Information Bottleneck*

The Information Bottleneck (IB) [11] learns a representation *Z* from *X* subject to a soft constraint:

$$IB \equiv \min\_{Z} I(X; Z) - \beta I(Y; Z) \tag{16}$$

where *<sup>β</sup>*−<sup>1</sup> controls the strength of the constraint. As *<sup>β</sup>* <sup>→</sup> <sup>∞</sup>, IB recovers the standard cross-entropy loss.

In Figure 2 we show information diagrams comparing which regions IB and CEB maximize and minimize. See Yeung [30] for a theoretical explanation of information diagrams.CEB avoids trying to both minimize and maximize the central region at the same time. In Figure 3 we show the feasible regions for CEB and IB, labeling the MNI point on both. CEB's rectification of the information plane means that we can always measure in absolute terms how much more we could compress our representation *at the same predictive performance*: *<sup>I</sup>*(*X*; *<sup>Z</sup>*|*Y*) <sup>≥</sup> 0. For IB, it is not possible to tell *a priori* how far we are from optimal compression.

**Figure 2.** Information diagrams showing how IB and CEB maximize and minimize different regions. regions inaccessible to the objective due to the Markov chain *Z* ← *X* ↔ *Y*. regions being maximized by the objective (*I*(*Y*; *Z*) in both cases). regions being minimized by the objective. **IB** minimizes the intersection between *<sup>Z</sup>* and both *<sup>H</sup>*(*X*|*Y*) and *<sup>I</sup>*(*X*;*Y*). **CEB** only minimizes the intersection between *<sup>Z</sup>* and *<sup>H</sup>*(*X*|*Y*).

**Figure 3.** Geometry of the feasible regions for IB and CEB, with all points labeled. CEB rectifies IB's parallelogram by subtracting *I*(*Y*; *Z*) at every point. Everything outside of the black lines is unattainable by any model on any dataset. Compare the IB feasible region to the dashed region in Figure 1.

From Equations (4), (5) and (16), it is clear that CEB and IB are equivalent for *<sup>γ</sup>* <sup>=</sup> *<sup>β</sup>* <sup>−</sup>1. To simplify comparison of the two objectives, we can parameterize them with:

$$
\rho = \log \gamma = \log(\beta - 1) \tag{17}
$$

Under this parameterization, for deterministic datasets, sufficiently powerful models will target the MNI point at *ρ* = 0. As *ρ* increases, more information is captured by the model. *ρ* < 0 *may* capture less than the MNI. *ρ* > 0 *may* capture more than the MNI.

#### 2.5.1. Amortized IB

As described in Tishby et al. [11], IB is a tabular method, so it is not usable for amortized inference. The tabular optimization procedure used for IB trivially applies to CEB, just by setting *β* = *γ* + 1. Two recent works have extended IB for amortized inference. Achille and Soatto [19] presents *InfoDropout*, which uses IB to motivate a variation on Dropout [31]. Alemi et al. [17] presents the *Variational Information Bottleneck* (VIB):

$$VBB \equiv \left< \log e(z|x) \right> - \left< \log m(z) \right> \right) - \beta \left< \log c(y|z) \right> \tag{18}$$

Instead of the backward encoder, VIB has a *marginal posterior*, *m*(*z*), which is a variational approximation to *<sup>e</sup>*(*z*) = *dx p*(*x*)*e*(*z*|*x*).

Following Alemi et al. [32], we define the *Rate* (*R*):

$$R \equiv \langle \log e(z|\mathbf{x}) \rangle - \langle \log m(z) \rangle \ge I(\mathbf{X}; Z) \tag{19}$$

We similarly define the *Residual Information* (*ReX*):

$$\text{Re}\_X \equiv \langle \log e(z|\mathbf{x}) \rangle - \langle \log b(z|y) \rangle \ge I(X; Z|Y) \tag{20}$$

During optimization, observing *R* does not tell us how tightly we are adhering to the MNI. However, observing *ReX* tells us exactly how many bits we are from the MNI point, assuming that our current classifier is optimal.

For convenience, define *CEBx* ≡ *CEBρ*=*x*, and likewise for VIB. We can compare variational CEB with VIB by taking their difference at *ρ* = 0:

$$
\langle CEB\_0 - VIB\_0 = \langle \log b(z|y) \rangle - \langle \log m(z) \rangle \tag{21}
$$

$$-\left<\log\varepsilon(y|z)\right> + \left<\log p(y)\right>\tag{22}$$

Solving for *m*(*z*) when that difference is 0:

$$m(z) = \frac{b(z|y)p(y)}{c(y|z)}\tag{23}$$

Since the optimal *<sup>m</sup>*∗(*z*) is the marginalization of *<sup>e</sup>*(*z*|*x*), at convergence we must have:

$$m^\*(z) = \int dx \, p(x)e(z|x) = \frac{p(z|y)p(y)}{p(y|z)}\tag{24}$$

This solution may be difficult to find, as *m*(*z*) only gets information about *y* indirectly through *<sup>e</sup>*(*z*|*x*). For otherwise equivalent models, we may expect *VIB*<sup>0</sup> to converge to a looser approximation of *<sup>I</sup>*(*X*; *<sup>Z</sup>*) = *<sup>I</sup>*(*Y*; *<sup>Z</sup>*) = *<sup>I</sup>*(*X*;*Y*) than CEB. Since VIB optimizes an upper bound on *<sup>I</sup>*(*X*; *<sup>Z</sup>*), *VIB*<sup>0</sup> will report *<sup>R</sup>* converging to *<sup>I</sup>*(*X*;*Y*), but may capture less than the MNI. In contrast, if *ReX* converges to 0, the variational tightness of *<sup>b</sup>*(*z*|*y*) to the optimal *<sup>p</sup>*(*z*|*y*) depends only on the tightness of *<sup>c</sup>*(*y*|*z*) to the optimal *<sup>p</sup>*(*y*|*z*).

#### *2.6. Model Variants*

We introduce some variants on the basic variational CEB classification model that we will use in Section 3.1.6.

#### 2.6.1. Bidirectional CEB

We can learn a shared representation *Z* that can be used to predict both *X* and *Y* with the following bidirectional CEB model: *ZX* ← *X* ↔ *Y* → *ZY*. This corresponds to the following joint: *<sup>p</sup>*(*x*, *<sup>y</sup>*, *zX*, *zY*) <sup>≡</sup> *<sup>p</sup>*(*x*, *<sup>y</sup>*)*e*(*zX*|*x*)*b*(*zY*|*y*). The main CEB objective can then be applied in both directions:

$$\begin{aligned} \text{CEB}\_{\text{bidir}} & \equiv \min \, -H(Z\_X|X) + H(Z\_X|Y) + \gamma\_X H(Y|Z\_X) \\ & \quad - H(Z\_Y|Y) + H(Z\_Y|X) + \gamma\_Y H(X|Z\_Y) \end{aligned} \tag{25}$$

For the two latent representations to be useful, we want them to be consistent with each other (minimally, they must have the same parametric form). Fortunately, that consistency is trivial to encourage by making the natural variational substitutions: *<sup>p</sup>*(*zY*|*x*) <sup>→</sup> *<sup>e</sup>*(*zY*|*x*) and *<sup>p</sup>*(*zX*|*y*) <sup>→</sup> *<sup>b</sup>*(*zX*|*y*). This gives variational CEBbidir:

$$\begin{aligned} \min & \left< \log \varepsilon(z\_X|x) \right> - \left< \log b(z\_X|y) \right> - \gamma\_X \left< \log \varepsilon(y|z\_X) \right> \\ & + \left< \log b(z\_Y|y) \right> - \left< \log \varepsilon(z\_Y|x) \right> - \gamma\_Y \left< \log d(x|z\_Y) \right> \end{aligned} \tag{26}$$

where *<sup>d</sup>*(*x*|*z*) is a *decoder* distribution. At convergence, we learn a unified *<sup>Z</sup>* that is consistent with both *ZX* and *ZY*, permitting generation of either output given either input in the trained model, in the same spirit as Vedantam et al. [33], but without needing to train a joint encoder *<sup>q</sup>*(*z*|*x*, *<sup>y</sup>*).

#### 2.6.2. Consistent Classifier

We can reuse the backwards encoder as a classifier: *<sup>c</sup>*(*y*|*z*) <sup>∝</sup> *<sup>b</sup>*(*z*|*y*)*p*(*y*). We refer to this as the *Consistent Classifier*: *<sup>c</sup>*(*y*|*z*) <sup>≡</sup> softmax *<sup>b</sup>*(*z*|*y*)*p*(*y*). If the labels are uniformly distributed, the *<sup>p</sup>*(*y*) factor can be dropped; otherwise, it suffices to use the empirical *p*(*y*). Using the consistent classifier for classification problems results in a model that only needs parameters for the two encoders, *<sup>e</sup>*(*z*|*x*) and *<sup>b</sup>*(*z*|*y*). This classifier differs from the more common *maximum a posteriori* (MAP) classifier because *<sup>b</sup>*(*z*|*y*) is not the sampling distribution of either *<sup>Z</sup>* or *<sup>Y</sup>*.

#### 2.6.3. CatGen Decoder

We can further generalize the idea of the consistent classifier to arbitrary prediction tasks by relaxing the requirement that we perfectly marginalize *Y* in the softmax. Instead, we can marginalize *Y* over any minibatch of size *K* we see at training time, under an assumption of a uniform distribution over the training examples we sampled:

$$p(y|z) = \frac{p(z|y)p(y)}{\int dy' \, p(z|y')p(y')}\tag{27}$$

$$\approx \frac{p(z|y)\frac{1}{\mathbb{K}}}{\sum\_{k=1}^{K} p(z|y\_k)\frac{1}{\mathbb{K}}} = \frac{p(z|y)}{\sum\_{k=1}^{K} p(z|y\_k)}\tag{28}$$

$$\approx \frac{b(z|y)}{\sum\_{k=1}^{K} b(z|y\_k)} \equiv c(y|z) \tag{29}$$

We can immediately see that this definition of *<sup>c</sup>*(*y*|*z*) gives a valid distribution, as it is just a softmax over the minibatch. That means it can be directly used in the original objective without violating the variational bound. We call this decoder *CatGen*, for *Categorical Generative Model* because it can trivially "generate" *Y*: the softmax defines a categorical distribution over the batch; sampling from it gives indices of *<sup>Y</sup>* = *yj* that most closely correspond to *<sup>Z</sup>* = *zi*.

Maximizing *I*(*Y*; *Z*) in this manner is a universal task, in that it can be applied to any paired data *<sup>X</sup>*,*Y*. This includes images and labels – the CatGen model may be used in place of both *<sup>c</sup>*(*y*|*zX*) and *<sup>d</sup>*(*x*|*zY*) in the CEBbidir model (using *<sup>e</sup>*(*z*|*x*) for *<sup>d</sup>*(*x*|*zY*)). This avoids a common concern when dealing with multivariate predictions: if predicting *X* is disproportionately harder than predicting *Y*, it can be difficult to balance the model [33,34]. For CatGen models, predicting *X* is never any harder than predicting *Y*, since in both cases we are just trying to choose the correct example out of *K* possibilities.

It turns out that CatGen is mathematically equivalent to *Contrastive Predictive Coding* (CPC) [35] after an offset of log *<sup>K</sup>*. We can see this using the proof from Poole et al. [36], and substituting log *<sup>b</sup>*(*z*|*y*) for *f*(*y*, *z*):

$$I(Y;Z) \geq \frac{1}{K} \sum\_{k=1}^{K} \mathbb{E}\_{\prod\_{j} y\_k z \sim p(y\_j) p(x\_k|y\_k) c(z|x\_k)} \left[ \log \frac{\mathcal{E}^{f(y\_k z)}}{\frac{1}{K} \sum\_{i=1}^{K} \mathcal{E}^{f(y\_i z)}} \right] \tag{30}$$

$$=\frac{1}{K}\sum\_{k=1}^{K}\mathbb{E}\_{\prod\_{j}y\_{k}z\sim p(y\_{j})p(x\_{k}|y\_{k})c(z|x\_{k})}\left[\log\frac{b(z|y\_{k})}{\frac{1}{K}\sum\_{i=1}^{K}b(z|y\_{i})}\right] \tag{31}$$

The advantage of the CatGen approach over CPC in the CEB setting is that we already have parameterized the forward and backward encoders to compute *<sup>I</sup>*(*X*; *<sup>Z</sup>*|*Y*), so we don't need to introduce any new parameters when using CatGen to maximize the *I*(*Y*; *Z*) term.

As with CPC, the CatGen bound is constrained by log *K*, but when targeting the MNI, it is more likely that we can train with log *<sup>K</sup>* <sup>≥</sup> *<sup>I</sup>*(*X*;*Y*). This is trivially the case for the datasets we explore here, where *<sup>I</sup>*(*X*;*Y*) <sup>≤</sup> log 10. It is also practical for larger datasets like ImageNet, where models are routinely trained with batch sizes in the thousands (e.g., Goyal et al. [37]), and *<sup>I</sup>*(*X*;*Y*) <sup>≤</sup> log 1000.

#### **3. Results**

We evaluate deterministic, VIB, and CEB models on Fashion MNIST [38] and CIFAR10 [39]. Our experiments focus on comparing the performance of otherwise *identical* models when we change only the objective function and vary *ρ*. Thus, we are interested in relative differences in performance that can be directly attributed to the difference in objective and *ρ*. These experiments cover the three aspects of *Robust Generalization* (Section 2.1): **(RG1)** (classical generalization) in Sections 3.1 and 3.1.6; **(RG2)** (adversarial robustness) in Sections 3.1 and 3.1.6; and **(RG3)** (detection) in Section 3.1.

#### *3.1. (RG1), (RG2), and (RG3): Fashion MNIST*

Fashion MNIST [38] is an interesting dataset in that it is visually complex and challenging, but small enough to train in a reasonable amount of time. We trained 60 different models on Fashion MNIST, four each for the following 15 types: a deterministic model (*Determ*); seven VIB models (VIB−1, ..., VIB5); and seven CEB models (CEB−1, ..., CEB5). Subscripts indicate *ρ*. All 60 models share the same inference architecture and are trained with otherwise identical hyperparameters. See Appendix A for details.

#### 3.1.1. (RG1): Accuracy and Compression

In Figure 4 we see that both VIB and CEB have improved accuracy over the deterministic baseline, consistent with compressed representations generalizing better. Also, CEB outperforms VIB at every *ρ*, which we can attribute to the tighter variational bound given by minimizing *ReX* rather than *R*. In the case of a simple classification problem with a uniform distribution over classes in the training set (like Fashion MNIST), we can directly compute *I*(*X*;*Y*) = log *C*, where *C* is the number of classes. In order to compare the relative complexity of the learned representations for the VIB and CEB models, in the second panel of Figure 4 we show the maximum *rate lower bound* seen during training:

*RX* ≡ + log *<sup>e</sup>*(*z*|*x*) <sup>1</sup> *<sup>K</sup>* <sup>∑</sup>*<sup>K</sup> <sup>k</sup> <sup>e</sup>*(*z*|*xk* ) , <sup>≤</sup> *<sup>I</sup>*(*X*; *<sup>Z</sup>*) using the encoder's minibatch marginal for both VIB and CEB. This lower bound on *I*(*X*; *Z*) is the "InfoNCE with a tractable encoder" bound from Poole et al. [36]. The two sets of models show nearly the same *RX* at each value of *ρ*. Both models converge to exactly *<sup>I</sup>*(*X*;*Y*) = log 10 <sup>≈</sup> 2.3 nats at *<sup>ρ</sup>* <sup>=</sup> 0, as predicted by the derivation of CEB.

**Figure 4.** Test accuracy, maximum rate lower bound *RX* <sup>≤</sup> *<sup>I</sup>*(*Z*; *<sup>X</sup>*) seen during training, and robustness to targeted PGD L2 and L<sup>∞</sup> attacks on CEB, VIB, and Deterministic models trained on Fashion MNIST. At every *ρ* the CEB models outperform the VIB models on both accuracy and robustness, while having essentially identical maximum rates. *None of these models is adversarially trained.*

#### 3.1.2. (RG2): Adversarial Robustness

The bottom two panels of Figure 4 show robustness to targeted *Projected Gradient Descent* (PGD) L2 and L<sup>∞</sup> attacks [14]. All of the attacks are targeting the *trouser* class of Fashion MNIST, as that is the most distinctive class. Targeting a less distinctive class, such as one of the shirt classes, would confuse the difficulty of classifying the different shirts and the robustness of the model to adversaries. To measure robustness to the targeted attacks, we count the number of predictions that changed from a correct prediction on the clean image to an incorrect prediction of the target class on the adversarial image, and divide by the original number of correct predictions. Consistent with testing **(RG2)**, these adversaries are completely unknown to the models at training time – none of these models see any adversarial examples during training. CEB again outperforms VIB at every *ρ*, and the deterministic baseline at all but the least-compressed model (*ρ* = 5). We also see for both models that as *ρ* decreases, the robustness to both attacks increases, indicating that more compressed models are more robust.

Consistent with the MNI hypothesis, at *ρ* = 0 we end up with CEB models that have hit exactly 2.3 nats for the rate lower bound, have maintained high accuracy, and have strong robustness to both attacks. Moving to *<sup>ρ</sup>* <sup>=</sup> <sup>−</sup>1 gives only a small improvement to robustness, at the cost of a large decrease in accuracy.

#### 3.1.3. (RG3): Out-of-Distribution Detection

We compare the ability of Determ, CEB0, VIB0, and VIB4 to detect four different out-of-distribution (OoD) detection datasets. *U*(0, 1) is uniform noise in the image domain. MNIST uses the MNIST test set. Vertical Flip is the most challenging, using vertically flipped Fashion MNIST test images, as originally proposed in Alemi et al. [18]. CW is the Carlini-Wagner L2 attack [40] at the default settings found in Papernot et al. [41], and additionally includes the adversarial attack success rate against each model.

We use two different metrics for thresholding, proposed in Alemi et al. [18]. *H* is the classifier entropy. *R* is the rate, defined in Section 2.5. These two threshold scores are used with the standard suite of proper scoring rules [42]: *False Positive Rate at 95% True Positive Rate* (FPR 95% TPR), *Area Under the ROC Curve* (AUROC), and *Area Under the Precision-Recall Curve* (AUPR).

Table 1 shows that using *R* to detect OoD examples can be much more effective than using classifier-based approaches. The deterministic baseline model is far weaker at detection using *H* than either of the high-performing stochastic models (CEB0 and VIB4). Those models both saturate detection performance, providing reliable signals for all four OoD datasets. However, as VIB0 demonstrates, simply having *R* available as a signal does not guarantee good detection. As we saw above, the VIB0 models had noticeably worse classification performance, indicating that they had not achieved the MNI point: *I*(*Y*; *Z*) < *I*(*X*; *Z*) for those models. These results indicate that for detection, violating the MNI criterion by having *I*(*X*; *Z*) > *I*(*X*;*Y*) may not be harmful, but violating the criterion in the opposite direction *is* harmful.

**Table 1.** Results for out-of-distribution detection (*OoD*). *Thrsh.* is the threshold score used: *H* is the entropy of the classifier; *R* is the rate. Determ cannot compute *R*, so only *H* is shown. For VIB and CEB models, *H* is always inferior to *R*, similar to findings in Alemi et al. [18], so we omit it. *Adv. Success* is attack success of the CW adversary (bottom four rows). Arrows denote whether higher or lower scores are better. **Bold** indicates the best score in that column for that OoD dataset.


#### 3.1.4. (RG3): Calibration

A *well-calibrated* model is correct half of the time it gives a confidence of 50% for its prediction. In Figure 5, we show calibration plots at various points during training for four models. Calibration curves help analyze whether models are underconfident or overconfident. Each point in the plots corresponds to a 5% confidence bin. Accuracy is averaged for each bin. All four networks move from under- to overconfidence during training. However, CEB0 and VIB0 end up only slightly overconfident, while *ρ* = 2 is already sufficient to make VIB and CEB (not shown) nearly as overconfident as the deterministic model.

#### 3.1.5. (RG1): Overfitting Experiments

We replicate the basic experiment from Zhang et al. [10] by using the images from Fashion MNIST, but replacing the training labels with fixed random labels. This dataset is *information-free* because *I*(*X*;*Y*) = 0. We use that dataset to train multiple deterministic models, as well as CEB and VIB models at *ρ* from 0 through 7. We find that the CEB and VIB models with *ρ* < 6 *never* learn, even after 100 epochs of training, but the deterministic models *always* learn. After about 40 epochs of training they begin to memorize the random labels, indicating severe overfitting and a perfect *failure* to generalize.

**Figure 5.** Calibration plots with 90% confidence intervals for four of the models after 2000 steps, 20,000 steps, and 40,000 steps (left, center, and right of each trio): (**a**) is CEB0; (**b**) is VIB0; (**c**) is VIB2; (**d**) is Determ. *Perfect calibration* corresponds to the dashed diagonal lines. *Underconfidence* occurs when the points are above the diagonal. *Overconfidence* is below the diagonal. The *ρ* = 0 models are nearly perfectly calibrated still at 20,000 steps, but even at *ρ* = 2, the VIB model is almost as overconfident as Determ.

#### 3.1.6. (RG1) and (RG2): CIFAR10 Experiments

For CIFAR10 [39] we trained the largest Wide ResNet [43] we could fit on a single GPU with a batch size of 250. This was a 62×7 model trained using AutoAugment [44]. We trained 3 CatGen CEBbidir models each of CEB0 and CEB5 and then selected the two models with the highest test accuracy for the adversarial robustness experiments. We evaluated the CatGen models using the consistent classifier, since CatGen models only train *<sup>e</sup>*(*z*|*x*) and *<sup>b</sup>*(*z*|*y*). CEB0 reached **97.51%** accuracy. This result is better than the 28×10 Wide ResNet from AutoAugment by 0.19 percentage points, although it is still worse than the Shake-Drop model from that paper. We additionally tested the model on the CIFAR-10.1 test set [45], getting accuracy of 93.6%. This is a gap of only **3.9** percentage points, which is better than all of the results reported in that paper, and substantially better than the Wide ResNet results (but still inferior to the Shake-Drop AutoAugment results). The CEB5 model reached 97.06% accuracy on the normal test set and 91.9% on the CIFAR-10.1 test set, showing that increased *ρ* gave substantially worse generalization.

To test robustness of these models, we swept  for both PGD attacks (Figure 6). The CEB0 model not only has substantially higher accuracy than the adversarially-trained Wide ResNet from Madry et al. [14] (*Madry*), it also beats the Madry model on both the L2 and the L<sup>∞</sup> attacks at almost all values of . We also show that this model is even more robust to two transfer attacks, where we used the CEB5 model and the Madry model to generate PGD attacks, and then test them on the CEB0 model. This result indicates that these models are not doing "gradient masking", a failure mode of some attempts at adversarial defense [2], since these are black-box attacks that do not rely on taking gradients through the target model.

**Figure 6. Left:** Accuracy on untargeted *L*∞ attacks at different values of *ε* for all 10,000 CIFAR10 test set examples. CEB0 is the model with the highest accuracy (97.51%) trained at *<sup>ρ</sup>* = 0. CEB5 is the model with the highest accuracy (97.06%) trained at *ρ* = 5. Madry is the best adversarially-trained model from Madry et al. [14] with 87.3% accuracy (values provided by Aleksander Madry). CEB5 ⇒CEB0 is transfer attacks from the CEB5 model to the CEB0 model. Madry ⇒CEB0 is transfer attacks from the Madry model to the CEB0 model. Madry was trained with 7 steps of PGD at *<sup>ε</sup>* = 8 (grey dashed line). Chance is 10% (grey dotted line). **Right:** Accuracy on untargeted *L*<sup>2</sup> attacks at different values of *ε*. All values are collected at 7 steps of PGD. CEB0 outperforms Madry everywhere except the region of *<sup>L</sup>*∞*<sup>ε</sup>* <sup>∈</sup> [2, 7]. Madry appears to have overfit to L∞, given its poor performance on *<sup>L</sup>*<sup>2</sup> attacks relative to either CEB model. *None of the CEB models are adversarially trained.*

#### **4. Conclusions**

We have presented the Conditional Entropy Bottleneck (CEB), motivated by the Minimum Necessary Information (MNI) criterion and the hypothesis that failures of *robust generalization* are due in part to learning models that retain *too much* information about the training data. We have shown empirically that simply by switching to CEB, models may substantially improve their robust generalization, including **(RG1)** higher accuracy, **(RG2)** better adversarial robustness, and **(RG3)** stronger OoD detection. We believe that the MNI criterion and CEB offer a promising path forward for many tasks in machine learning by permitting fast amortized inference in an easy-to-implement framework that improves robust generalization.

**Funding:** This research received no external funding.

**Acknowledgments:** We would like to thank Alex Alemi and Kevin Murphy for valuable discussions in the preparation of this work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Model Details**

Here we collect a number of results that are not critical to the core of the paper, but may be of interest to particular audiences.

#### *Appendix A.1. Fashion MNIST*

All of the models in our Fashion MNIST experiments have the same core architecture: A 7 × 2 Wide Resnet [43] for the encoder, with a final layer of *D* = 4 dimensions for the latent representation, followed by a two layer MLP classifier using ELU [46] activations with a final categorical distribution over the 10 classes.

The stochastic models parameterize the mean and variance of a *D* = 4 fully covariate multivariate Normal distribution with the output of the encoder. Samples from that distribution are passed into the classifier MLP. Apart from that difference, the stochastic models don't differ from Determ during evaluation. None of the five models uses any form of regularization (e.g., *L*1, *L*2, DropOut [31], BatchNorm [47]).

The VIB models have an additional learned marginal, *m*(*z*), which is a mixture of 240 *D* = 4 fully covariate multivariate Normal distributions. The CEB model instead has the backward encoder, *<sup>b</sup>*(*z*|*y*) which is a *D* = 4 fully covariate multivariate Normal distribution parameterized by a 1 layer MLP mapping the label, *Y* = *y*, to the mean and variance. In order to simplify comparisons, for CEB we additionally train a marginal *m*(*z*) identical in form to that used by the VIB models. However, for CEB, *m*(*z*) is trained using a separate optimizer so that it doesn't impact training of the CEB objective in any way. Having *m*(*z*) for both CEB and VIB allows us to compare the rate, *R*, of each model except Determ.

#### *Appendix A.2. CIFAR-10*

For the 62×7 CEB CIFAR-10 models, we used the AutoAugment policies for CIFAR-10. We trained the models for 800 epochs, lowering the learning rate by a factor of 10 at 400 and 600 epochs. We trained all of the models using Adam [48] at a base learning rate of 10−3.

#### *Appendix A.3. Distributional Families*

Any distributional family may be used for the encoder. Reparameterizable distributions [29,49] are convenient, but it is also possible to use the score function trick [50] to get a high-variance estimate of the gradient for distributions that have no explicit or implicit reparameterization. In general, a good choice for *<sup>b</sup>*(*z*|*y*) is the same distributional family as *<sup>e</sup>*(*z*|*x*), or a mixture thereof. These are modeling choices that need to be made by the practitioner, as they depend on the dataset. In this work, we chose normal distributions because they are easy to work with and will be the common choice for many problems, particularly when parameterized with neural networks, but that choice is incidental rather than fundamental.

#### **Appendix B. Mutual Information Optimization**

As an objective function, CEB is independent of the methods used to optimize it. Here we focus on variational objectives because they are simple, tractable, and well-understood, but any approach to optimize mutual information terms can work, so long as they respect the side of the bounds required by the objective. For example, both Oord et al. [35], Hjelm et al. [51] could be used to maximize *I*(*Y*; *Z*).

#### *Appendix B.1. Finiteness of the Mutual Information*

The conditions for infinite mutual information given in Amjad and Geiger [52] do not apply to either CEB or VIB, as they both use stochastic encoders *<sup>e</sup>*(*z*|*x*). In our experiments using continuous representations, we did not encounter mutual information terms that diverged to infinity, although it is possible to make modeling and data choices that make it more likely that there will be numerical instabilities. This is not a flaw specific to CEB or VIB, however, and we found numerical instability to be almost non-existent across a wide variety of modeling and architectural choices for both variational objectives.

#### **Appendix C. Additional CEB Objectives**

Here we describe a few more variants of the CEB objective.

#### *Appendix C.1. Hierarchical CEB*

Thus far, we have focused on learning a single latent representation (possibly composed of multiple latent variables at the same level). Here, we consider one way to learn a hierarchical model with CEB.

Consider the graphical model *Z*<sup>2</sup> ← *Z*<sup>1</sup> ← *X* ↔ *Y*. This is the simplest hierarchical supervised representation learning model. The general form of its information diagram is given in Figure A1.

**Figure A1.** Information diagram for the basic hierarchical CEB model, *Z*<sup>2</sup> ← *Z*<sup>1</sup> ← *X* ↔ *Y*.

The key observation for generalizing CEB to hierarchical models is that the target mutual information doesn't change. By this, we mean that all of the *Zi* in the hierarchy should cover *<sup>I</sup>*(*X*;*Y*) at convergence, which means maximizing *<sup>I</sup>*(*Y*; *Zi*). It is reasonable to ask why we would want to train such a model, given that the final set of representations are presumably all effectively identical in terms of information content. Doing so allows us to train deep models in a principled manner such that all layers of the network are consistent with each other and with the data. We need to be more careful when considering the residual information terms, though – it is not the case that we want to minimize *<sup>I</sup>*(*X*; *Zi*|*Y*), which is not consistent with the graphical model. Instead, we want to minimize *<sup>I</sup>*(*Zi*−1; *Zi*|*Y*), defining *<sup>Z</sup>*<sup>0</sup> <sup>=</sup> *<sup>X</sup>*.

This gives the following simple *Hierarchical CEB* objective:

$$CEB\_{\text{hier}} \equiv \min \sum\_{i} I(Z\_{i-1}; Z\_i | \mathcal{Y}) - I(\mathcal{Y}; Z\_i) \tag{A1}$$

$$\Leftrightarrow \min \sum\_{i} -H(Z\_i|Z\_{i-1}) + H(Z\_i|Y) + H(Y|Z\_i) \tag{A2}$$

Because all of the *Zi* are targetting *Y*, this objective is as stable as regular CEB.

#### *Appendix C.2. Sequence Learning*

Many of the richest problems in machine learning vary over time. In Bialek et al. [53], the authors define the *Predictive Information*:

$$I(X\_{\text{part}}, X\_{\text{future}}) = \left\langle \log \frac{p(\mathbf{x}\_{\text{part}}, \mathbf{x}\_{\text{future}})}{p(\mathbf{x}\_{\text{post}})p(\mathbf{x}\_{\text{future}})} \right\rangle\_{\text{s}}$$

This is of course just the mutual information between the past and the future. However, under an assumption of temporal invariance (any time of fixed length is expected to have the same entropy), they are able to characterize the predictive information, and show that it is a subextensive quantity: lim*T*→<sup>∞</sup> *<sup>I</sup>*(*T*)/*<sup>T</sup>* <sup>→</sup> 0, where *<sup>I</sup>*(*T*) is the predictive information over a time window of length 2*<sup>T</sup>* (*T* steps of the past predicting *T* steps into the future). This concise statement tells us that past observations contain vanishingly small information about the future as the time window increases.

The application of CEB to extracting the predictive information is straightforward. Given the Markov chain *<sup>X</sup>*<*<sup>t</sup>* <sup>→</sup> *<sup>X</sup>*≥*t*, we learn a representation *Zt* that optimally covers *<sup>I</sup>*(*X*<*t*, *<sup>X</sup>*≥*t*) in *Predictive CEB*:

$$CEB\_{\text{pred}} \equiv \min I(X\_{$$

$$\Rightarrow \min -H(Z\_t|X\_{$$

Given a dataset of sequences, CEBpred may be extended to a bidirectional model. In this case, two representations are learned, *Z*<*<sup>t</sup>* and *Z*≥*t*. Both representations are for timestep *t*, the first representing the observations before *t*, and the second representing the observations from *t* onwards. As in the normal bidirectional model, using the same encoder and backwards encoder for both parts of the bidirectional CEB objective ties the two representations together.

#### Appendix C.2.1. Modeling and Architectural Choices

As with all of the variants of CEB, whatever entropy remains in the data after capturing the entropy of the mutual information in the representation must be modeled by the decoder. In this case, a natural modeling choice would be a probalistic RNN with powerful decoders per time-step to be predicted. However, it is worth noting that such a decoder would need to sample at each future step to decode the subsequent step. An alternative, if the prediction horizon is short or the predicted data are small, is to decode the entire sequence from *Zt* in a single, feed-forward network (possibly as a single autoregression over all outputs in some natural sequence). Given the subextensivity of the predictive information, that may be a reasonable choice in stochastic environments, as the useful prediction window may be small.

Likely a better alternative, however, is to use the CatGen decoder, as no generation of the long future sequences is required in that case.

#### Appendix C.2.2. Multi-Scale Sequence Learning

As in WaveNet [54], it is natural to consider sequence learning at multiple different temporal scales. Combining an architecture like time-dilated WaveNet with CEB is as simple as combining CEBpred with CEBhier (Appendix C.1). In this case, each of the *Zi* would represent a wider time dilation conditioned on the aggregate *Zi*−1.

#### *Appendix C.3. Unsupervised CEB*

For unsupervised learning, it seems challenging to put the decision about what information should be kept into objective function hyperparameters, as in the *β* VAE and penalty VAE [32] objectives. That work showed that it is possible to constrain the amount of information in the learned representation, but it is unclear how those objective functions keep only the "correct" bits of information for the downstream tasks you might care about. This is in contrast to supervised learning while targeting the MNI point, where the task clearly defines the both the correct amount of information and which bits are likely to be important.

Our perspective on the importance of defining a task in order to constrain the information in the representation suggests that we can turn the problem into a data modeling problem in which the practitioner who selects the dataset also "models" the likely form of the useful bits in the dataset for the downstream task of interest.

In particular, given a dataset *<sup>X</sup>*, we propose selecting a function *<sup>f</sup>*(*X*) <sup>→</sup> *<sup>X</sup>* that transforms *<sup>X</sup>* into a new random variable *X* . This defines a paired dataset, *P*(*X*, *X* ), on which we can use CEB as normal. Note that choosing the identity function for *f* results in maximal mutual information between *X* and *X* (*H*(*X*) nats), which will result in a representation that is far from the MNI for normal downstream tasks.

It may seem that we have not proposed anything useful, as the selection of *f*(.) is unconstrained, and seems much more daunting than selecting *β* in a *β* VAE or *σ* in a penalty VAE. However, there is a very powerful class of functions that makes this problem much simpler, and that also make it clear using CEB will *only* select bits from *X* that are useful. That class of functions is the noise functions.

#### Appendix C.3.1. Denoising CEB Autoencoder

Given a dataset *X* without labels or other targets, and some set of tasks in mind to be solved by a learned representation, we may select a random noise variable *U*, and function *X* = *f*(*X*, *U*) that we believe will destroy the irrelevant information in *X*. We may then add representation variables *ZX*, *ZX* to the model, giving the joint distribution *p*(*x*, *x* , *<sup>u</sup>*, *zX*, *zX*) <sup>≡</sup> *<sup>p</sup>*(*x*)*p*(*u*)*p*(*x* <sup>|</sup> *<sup>f</sup>*(*x*, *<sup>u</sup>*))*e*(*zX*|*x*)*b*(*zX*|*x* ). This joint distribution is represented in Figure A2.

**Figure A2.** Graphical model for the Denoising CEB Autoencoder.

*Denoising Autoencoders* were originally proposed in Vincent et al. [55]. In that work, the authors argue informally that reconstruction of corrupted inputs is a desirable property of learned representations. In this paper's notation, we could describe their proposed objective as min *<sup>H</sup>*(*X*|*ZX*), or equivalently min log *<sup>d</sup>*(*x*|*zX* <sup>=</sup> *<sup>f</sup>*(*x*, *<sup>η</sup>*)) *<sup>x</sup>*,*η*∼*p*(*x*)*p*(*θ*) .

We also note that, practically speaking, we would like to learn a representation that is consistent with uncorrupted inputs as well. Consequently, we are going to use a bidirectional model.

$$CEB\_{\text{demois}} \equiv \min I(X; Z\_X | X') - I(X'; Z\_X) \tag{A5}$$

$$+I(X'; Z\_{X'}|X) - I(X; Z\_{X'}) \tag{A6}$$

$$\Rightarrow \min -H(Z\_X|X) + H(Z\_X|X') + H(X'|Z\_X) \tag{A7}$$

$$-H(Z\_{X'}|X') + H(Z\_{X'}|X) + H(X|Z\_{X'})\tag{A8}$$

This requires two encoders and two decoders, which may seem expensive, but it permits a consistent learned representation that can be used cleanly for downstream tasks. Using a single encoder/decoder pair would result in either an encoder that does not work well with uncorrupted inputs, or a decoder that only generates noisy outputs.

If you are only interested in the learned representation and not in generating good reconstructions, the objective simplifies to the first three terms. In that case, the objective is properly called a *Noising CEB Autoencoder*, as the model predicts the noisy *X* from *X*:

$$CEB\_{\text{noise}} \equiv \min I(X; Z\_X | X') - I(X'; Z\_X) \tag{A9}$$

$$\Rightarrow \min -H(Z\_X|X) + H(Z\_X|X') + H(X'|Z\_X) \tag{A10}$$

In these models, the noise function, *X* = *f*(*X*, *U*) must encode the practitioner's assumptions about the structure of information in the data. This obviously will vary per type of data, and even per desired downstream task.

However, we don't need to work too hard to find the perfect noise function initially. A reasonable choice for *f* is:

$$f(\mathbf{x}, \boldsymbol{\eta}) = \text{clip}(\mathbf{x} + \boldsymbol{\eta}, \mathcal{D}) \tag{A11}$$

$$
\eta \sim \lambda \mathcal{U}(-1, 1) \,\ast \mathcal{D} \tag{A12}
$$

$$\mathcal{D} = \text{domain}(X) \tag{A13}$$

In other words, add uniform noise scaled to the domain of *X* and by a hyperparameter *λ*, and clip the result to the domain of *<sup>X</sup>*. When *<sup>λ</sup>* <sup>=</sup> 1, *<sup>X</sup>* is indistinguishable from uniform noise. As *<sup>λ</sup>* <sup>→</sup> 0, this maintains more and more of the original information from *X* in *X* . For some value of *λ* > 0, most of the irrelevant information is destroyed and most of the relevant information is maintained, if we assume that higher frequency content in the domain of *X* is less likely to contain the desired information. That information is what will be retained in the learned representation.

#### Theoretical Optimality of Noise Functions

Above we claimed that this learning procedure will only select bits that are useful for the downstream task, given that we select the proper noise function. Here we prove that claim constructively. Imagine an oracle that knows which bits of information should be destroyed, and which retained in order to solve the future task of interest. Further imagine, for simplicity, that the task of interest is classification. What noise function must that oracle implement in order to ensure that *CEBdenoise* can only learn exactly the bits needed for classification? The answer is simple: for every *<sup>X</sup>* = *xi*, select *<sup>X</sup>* = *<sup>x</sup> <sup>i</sup>* uniformly at random from among all of the *<sup>X</sup>* <sup>=</sup> *xj* that should have the same class label as *<sup>X</sup>* = *xi*. Now, the only way for CEB to maximize *<sup>I</sup>*(*X*; *ZX*) and minimize *<sup>I</sup>*(*X* ; *ZX*) is by learning a representation that is isomorphic to classification, and that encodes exactly *I*(*X*;*Y*) nats of information, even though it was only trained "unsupervisedly" on *X*, *X* pairs. Thus, if we can choose the correct noise function that destroys only the bits we don't care about, *CEBdenoise* will learn the desired representation and nothing else (caveated by model, architecture, and optimizer selection, as usual).

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **CEB Improves Model Robustness**

#### **Ian Fischer \* and Alexander A. Alemi**

Google Research, Mountain View, CA 94043, USA; alemi@google.com

**\*** Correspondence: iansf@google.com

Received: 31 July 2020; Accepted: 21 September 2020; Published: 25 September 2020

**Abstract:** Intuitively, one way to make classifiers more robust to their input is to have them depend less sensitively on their input. The Information Bottleneck (IB) tries to learn compressed representations of input that are still predictive. Scaling up IB approaches to large scale image classification tasks has proved difficult. We demonstrate that the Conditional Entropy Bottleneck (CEB) can not only scale up to large scale image classification tasks, but can additionally improve model robustness. CEB is an easy strategy to implement and works in tandem with data augmentation procedures. We report results of a large scale adversarial robustness study on CIFAR-10, as well as the ImageNet-C Common Corruptions Benchmark, ImageNet-A, and PGD attacks.

**Keywords:** information theory; information bottleneck; machine learning

#### **1. Introduction**

We aim to learn models that make meaningful predictions beyond the data they were trained on. Generally we want our models to be robust. Broadly, robustness is the ability of a model to continue making valid predictions as the distribution the model is tested on moves away from the empirical training set distribution. The most commonly reported robustness metric is simply test set performance, where we verify that our model continues to make valid predictions on what we hope represents valid draws from the same data generating procedure as the training set.

Adversarial attacks test robustness in a worst case setting, where an attacker [1] makes limited targeted modifications to the input that are as fooling as possible. Many adversarial attacks have been proposed and studied (e.g., Szegedy et al. [1], Carlini and Wagner [2,3], Kurakin et al. [4], Madry et al. [5]). Most machine-learned systems appear to be vulnerable to adversarialexamples. Many defenses have been proposed, but few have demonstrated robustness against a powerful, general-purpose adversary [3,6]. Recent discussions have emphasized the need to consider forms of robustness besides adversarial [7]. The Common Corruptions Benchmark [8] measures image models' robustness to more mild real-world perturbations. Even these modest perturbations can fool traditional architectures.

One general-purpose strategy that has been shown to improve model robustness is data augmentation [9–11]. Intuitively, by performing modifications of the inputs at training time, the model is prevented from being too sensitive to particular features of the inputs that do not survive the augmentation procedure. We would like to identify complementary techniques for further improving robustness.

One approach is to try to make our models more robust by making them less sensitive to the inputs in the first place. The goal of this work is to experimentally investigate whether, by systematically limiting the complexity of the extracted representation using the Conditional Entropy Bottleneck (CEB), we can make our models more robust in all three of these senses: test set generalization (e.g., classification accuracy on "clean" test inputs), worst-case robustness, and typical-case robustness.

This paper is primarily empirical. We demonstrate:

• CEB models are easy to implement and train.


We also show that adversarially-trained models *fail* to generalize to attacks they were not trained on, by comparing the results on L2 PGD attacks from Madry et al. [5] to our results on the same baseline architecture. This result underscores the importance of finding ways to make models robust that do not rely on knowing the form of the attack ahead of time. Finally, for readers who are curious about theoretical and philosophical perspectives that may give insights into why CEB improves robustness, we recommend Fischer [12], which introduced CEB, as well as Achille and Soatto [13], Achille and Soatto [14], and Pensia et al. [15].

#### **2. Materials and Methods**

#### *2.1. Information Bottlenecks*

The Information Bottleneck (IB) objective [16] aims to learn a stochastic representation *<sup>Z</sup>* <sup>∼</sup> *<sup>p</sup>*(*z*|*x*) of some input *X* that retains as much information about a target variable *Y* while being as compressed as possible. The objective:

$$IB \equiv \max\_{p(z|\boldsymbol{x})} I(Z; \boldsymbol{Y}) - \sigma(-\rho)I(Z; \boldsymbol{X}), \tag{1}$$

uses a Lagrange multiplier *<sup>σ</sup>*(−*ρ*) to trade off between the relevant information (*I*(*Z*;*Y*)) and complexity of the representation (*I*(*Z*; *X*)). The IB objective is ordinarily written with a Lagrange multiplier *<sup>β</sup>* <sup>≡</sup> *<sup>σ</sup>*(−*ρ*) with a natural range from 0 to 1. Here we use the sigmoid function: *<sup>σ</sup>*(−*ρ*) <sup>≡</sup> <sup>1</sup> <sup>1</sup>+*e<sup>ρ</sup>* to reparameterize in terms of a control parameter *ρ* on the whole real line. As *ρ* → ∞ the bottleneck turns off.

Because *Z* depends only on *X* (*Z* ← *X* ↔ *Y*), *Z* and *Y* are independent given *X*:

$$\begin{split} I(Z;X,Y) &= I(Z;X) + \underline{I(Z;Y|X)} \\ &= I(Z;Y) + I(Z;X|Y). \end{split} \tag{2}$$

This allows us to write Equation (1) in an equivalent form:

$$\max\_{\mathbf{Z}} I(Z;Y) - e^{-\rho} I(Z;X|Y). \tag{3}$$

Just as the original IB objective (Equation (1)) admits a natural variational lower bound [17], so does this form. We can variationally lower bound the mutual information between our representation and the targets with a variational decoder *<sup>q</sup>*(*y*|*z*):

$$H(Z;Y) = \mathbb{E}\_{p(x,y)p(z|x)} \left[ \log \frac{p(y|z)}{p(y)} \right]$$

$$\geq H(Y) + \mathbb{E}\_{p(x,y)p(z|x)} \left[ \log q(y|z) \right]. \tag{4}$$

While we may not know *H*(*Y*) exactly for real world datasets, in the IB formulation it is a constant outside of our control and so can be dropped in our objective. We can variationally upper bound our residual information:

$$\begin{split} I(Z;X|Y) &= \mathbb{E}\_{p(\mathbf{x},\mathbf{y})p(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p(\mathbf{z}|\mathbf{x},\mathbf{y})}{p(\mathbf{z}|\mathbf{y})} \right] \\ &\leq \mathbb{E}\_{p(\mathbf{x},\mathbf{y})p(\mathbf{z}|\mathbf{x})} \left[ \log \frac{p(\mathbf{z}|\mathbf{x})}{q(\mathbf{z}|\mathbf{y})} \right], \end{split} \tag{5}$$

with a variational class conditional marginal *<sup>q</sup>*(*z*|*y*) that approximates *dx p*(*z*|*x*)*p*(*x*|*y*). Putting both bounds together gives us the Conditional Entropy Bottleneck objective [12]:

$$\min\_{p(z|x)} \mathbb{E}\_{p(x,y)p(z|x)} \left[ -\log q(y|z) + \varepsilon^{-\rho} \log \frac{p(z|x)}{q(z|y)} \right]. \tag{6}$$

Compare this with the Variational Information Bottleneck (VIB) objective [17]:

$$\min\_{p(z|x)} \mathbb{E}\_{p(x,y)p(z|x)} \left[ -\log q(y|z) - \sigma(-\rho) \log \frac{p(z|x)}{q(z)} \right].\tag{7}$$

The difference between CEB and VIB is the presence of a class conditional versus unconditional variational marginal. As can be seen in Equation (5), using an unconditional marginal provides a looser variational upper bound on *<sup>I</sup>*(*Z*; *<sup>X</sup>*|*Y*). CEB (Equation (6)) can be thought of as a tighter variational approximation than VIB (Equation (7)) to Equation (3). Since Equation (3) is equivalent to the IB objective (Equation (1)), CEB can be thought of as a tighter variational approximation to the IB objective than VIB.

#### *2.2. Implementing a CEB Model*

In practice, turning an existing classifier architecture into a CEB model is very simple. For the stochastic representation *<sup>p</sup>*(*z*|*x*) we simply use the original architecture, replacing the final softmax layer with a dense layer with *d* outputs. These outputs are then used to specify the means of a *d*-dimensional Gaussian distribution with unit diagonal covariance. That is, to form the stochastic representation, independent standard normal noise is simply added to the output of the network (*z* = *x* + ). For every input, this stochastic encoder will generate a random *d*-dimensional output vector. For the variational classifier *<sup>q</sup>*(*y*|*z*) any classifier network can be used, including just a linear softmax classifier as done in these experiments. For the variational conditional marginal *<sup>q</sup>*(*z*|*y*) it helps to use the same distribution as output by the classifier. For the simple unit variance Gaussian encoding we used in these experiments, this requires learning just *d* parameters per class. For ease of implementation, this can be represented as a single dense linear layer mapping from a one-hot representation of the labels to the *d*-dimensional output, interpreted as the mean of the corresponding class marginal.

In this setup the CEB loss takes a particularly simple form:

$$\mathbb{E}\left[w\_{\mathcal{Y}}\cdot(f(\mathbf{x})+\epsilon)-\log\sum\_{\mathbf{y}'}e^{w\_{\mathcal{Y}'}\cdot(f(\mathbf{x})+\epsilon)}-\frac{\epsilon^{-\rho}}{2}(f(\mathbf{x})-\mu\_{\mathcal{Y}})\left(f(\mathbf{x})-\mu\_{\mathcal{Y}}+2\epsilon\right)\right].\tag{8}$$

The first two terms of Equation (8) are the usual softmax classifier loss, but acting on our stochastic representation *z* = *f*(*x*) + , which is simply the output of our encoder network *f*(*x*) with additive Gaussian noise. The *wy* is the *y*th row of weights in the final linear layer outputting the logits. *μ<sup>y</sup>* are the learned class conditional means for our marginal.  are standard normal draws from an isotropic unit variance Gaussian with the same dimension as our encoding *f*(*x*). The last term of Equation (8) is a stochastic sampling of the KL divergence between our encoder likelihood and the class conditional marginal likelihood. *ρ* controls the strength of the bottleneck and can vary on the whole real line. As *ρ* → ∞ the bottleneck is turned off. In practice we find that *ρ* values near but above 0 tend to work best for modest size models, with the tendency for the best *ρ* to approach 0 as the model capacity increases. Notice that in expectation the second term in the loss is (*f*(*x*) <sup>−</sup> *<sup>μ</sup>y*)2, which encourages

the learned means *μ<sup>y</sup>* to converge to the average of the representations of each element in the class. During testing we use the mean encodings and remove the stochasticity.

In its simplest form, training a CEB classifier amounts to injecting Gaussian random noise in the penultimate layer and learning estimates of the class-averaged output of that layer. In Appendix B we show simple modifications to the TPU-compatible ResNet implementation available on GitHub from the Google TensorFlow Team [18] that produce the same core ResNet50 models we use for our ImageNet experiments.

#### *2.3. Consistent Classifier*

An alternative classifier to the standard linear layer described in Section 2.2 performs the Bayesian inversion on the true class-conditional marginal:

$$p(y|z) = \frac{p(z|y)p(y)}{\sum\_{y'} p(z|y')p(y')}.\tag{9}$$

Substituting *<sup>q</sup>*(*z*|*y*) and using the empirical distribution over labels, we can define our variational classifier as:

$$q(y|z) \equiv \text{softmax}\_{\mathcal{Y}}(q(z|y)p(y))\tag{10}$$

In the case that the labels are uniformly distributed, that further simplifies to *<sup>q</sup>*(*y*|*z*) <sup>≡</sup> softmax*y*(*q*(*z*|*y*)). We call this the *consistent classifier* because it is Bayes-consistent with the variational conditional marginal. This is in contrast to the standard feed-forward classifier, which may choose to classify a region of the latent space differently from the highest density class given by the conditional marginal.

#### *2.4. Adversarial Attacks and Defenses*

#### 2.4.1. Attacks

The first adversarial attacks were proposed in Szegedy et al. [1], Goodfellow et al. [19]. Since those seminal works, an enormous variety of attacks has been proposed (Carlini and Wagner [2], Kurakin et al. [4], Madry et al. [5], Kurakin et al. [20], Moosavi-Dezfooli et al. [21], Eykholt et al. [22], Baluja and Fischer [23], etc.). In this work, we will primarily consider the Projected Gradient Descent (PGD) attack [5], which is a multi-step variant of the early Fast Gradient Method [19]. The attack can be viewed as having four parameters—*p*, the norm of the attack (typically 2 or ∞), , the radius the the *p*-norm ball within which the attack is permitted to make changes to an input, *n*, the number of gradient steps the adversary is permitted to take, and  *<sup>i</sup>*, the per-step limit to modifications of the current input. In this work, we consider L2 and L<sup>∞</sup> attacks of varying  and *n*, and with  *<sup>i</sup>* = <sup>4</sup> 3  *n* .

#### 2.4.2. Defenses

A common defense for adversarial examples is adversarial training. Adversarial training was originally proposed in Szegedy et al. [1], but was not practical until the Fast Gradient Method was introduced. It has been studied in detail, with varied techniques [5,20,24,25]. Adversarial training can clearly be viewed as a form of data augmentation [26], where instead of using some fixed set of functions to modify the training examples, we use the model itself in combination with one or more adversarial attacks to modify the training examples. As the model changes, the distribution of modifications changes as well. However, unlike with non-adversarial data augmentation techniques, such as AutoAugment (AutoAug) [9], adversarial training techniques considered in the literature so far cause substantial reductions in accuracy on clean test sets. For example, the CIFAR10 model described in Madry et al. [5] gets 95.5% accuracy when trained normally, but only 87.3% when trained on L∞ adversarial examples. More recently, Xie et al. [25] adversarially trains ImageNet models with

impressive robustness to targeted PGD L∞ attacks, but at only 62.32% accuracy on the non-adversarial test set, compared to 78.81% accuracy for the same model trained only on clean images.

#### *2.5. Common Corruptions*

The Common Corruptions Benchmark [8] offers a test of model robustness to common image processing pipeline corruptions. ImageNet-C modifies the ImageNet test set with the 15 corruptions applied at five different strengths. Within each corruption type we evaluate the average error at each of the five levels (*Ec* = <sup>1</sup> <sup>5</sup> <sup>∑</sup><sup>5</sup> *<sup>s</sup>*=<sup>1</sup> *Ecs*). To summarize the performance across all corruptions, we report both the average corruption error (avg = <sup>1</sup> <sup>15</sup> ∑*<sup>c</sup> Ec*) and the *Mean Corruption Error* (mCE) [8]:

$$\text{mCE} = \frac{1}{15} \sum\_{\text{c}} \frac{\sum\_{s=1}^{5} E\_{\text{c}s}}{\sum\_{s=1}^{5} E\_{\text{c}s}^{\text{AlexNet}}}.\tag{11}$$

The mCE weights the errors on each task against the performance of a baseline AlexNet model. Slightly different pipelines have been used for the ImageNet-C task [10]. In this work we used the AlexNet normalization numbers and data formulation from Yin et al. [11].

#### *2.6. Natural Adversarial Examples*

The ImageNet-A Benchmark [27] is a dataset of 7500 naturally-occurring "adversarial" examples across 200 ImageNet classes. The images exploit commonly-occurring weaknesses in ImageNet models, such as relying on textures often seen with certain class labels.

#### *2.7. Calibration*

One approach to estimating a model's robustness is to look at how well *calibrated* the model is. The *Expected Calibration Error* (ECE) [28] gives an intuitive metric of calibration:

$$ECE \equiv \sum\_{s=1}^{S} \frac{|B\_s|}{N} |\text{acc}(B\_s) - \text{conf}(B\_s)|\,\prime \tag{12}$$

where *S* is the number of confidence bins (30 in our experiments), *N* is the number of examples (50,000 for ImageNet and for each ImageNet-C corruption), |*Bs*| is the number of examples in the *s*th bin, acc(*Bs*) is the mean accuracy in the *s*th bin, and conf(*Bs*) is the mean confidence of the model's predictions in the *s*th bin. The ECE ranges between 0 and 1. A perfectly calibrated model would have an ECE of 0. See Ovadia et al. [29] for further details.

#### **3. Results**

#### *3.1. CIFAR10 Experiments*

We trained a set of 25 28 <sup>×</sup> 10 Wide ResNet (WRN) CEB models on CIFAR10 at *<sup>ρ</sup>* <sup>∈</sup> [−1, <sup>−</sup>0.75, ... , 5], as well as a deterministic baseline. They trained for 1500 epochs, lowering the learning rate by a factor of 0.3 after 500, 1000, and 1250 epochs. This long training regime was due to our use of the original AutoAug policies, which requires longer training. The only additional modification we made to the basic 28 × 10 WRN architecture was the removal of all Batch Normalization [30] layers. Every small CIFAR10 model we have trained with Batch Normalization enabled has had substantially worse robustness to L∞ PGD adversaries, even though typically the accuracy is much higher. For example, 28 × 10 WRN CEB models rarely exceeded more than 10% adversarial accuracy. However, it was always still the case that lower values of *ρ* gave higher robustness. As a baseline comparison, a deterministic 28×10 WRN with BatchNorm, trained with AutoAug reaches 97.3% accuracy on clean images, but 0% accuracy on L∞ PGD attacks at  = 8 and *<sup>n</sup>* = 20. Interestingly, that model was noticeably more robust to L2 PGD attacks than the deterministic baseline without BatchNorm, getting 73% accuracy compared to 66%. However, it was still much weaker than the CEB models, which get over 80% accuracy on the same attack (Figure 1). Additional training details are in Appendix A.1.

Figure 1 demonstrates the adversarial robustness of CEB models to both targeted L2 and L<sup>∞</sup> attacks. The CEB models show a marked improvement in robustness to L2 attacks compared to an adversarially-trained baseline from Madry et al. [5] (denoted Madry). The attack parameters were selected to be about equally difficult for the adversarially-trained WRN 28 × 10 model from Madry et al. [5] (grey dashed and dotted lines in Figure 1). The deterministic baseline (Det.) only gets 8% accuracy on the L<sup>∞</sup> attacks, but gets 66% on the L2 attack, substantially better than the 45.7% of the adversarially-trained model, which makes it clear that the adversarially-trained model failed to generalize in any reasonable way to the L2 attack. The CEB models are always substantially more robust than Det., and many of them outperform Madry even on the L∞ attack the Madry model was trained on, but for both attacks there is a clear general trend toward more robustness as *ρ* decreases. Finally, the CEB and Det. models all reach about the same accuracy, ranging from 93.9% to 95.1%, with Det. at 94.4%. In comparison, Madry only gets 87.3%.

Figure 2 shows the robustness of five of those models to PGD attacks as  is varied. We selected the four CEB models to represent the most robust models across most of the range of *ρ* we trained. All values in the figure are collected at 20 steps of PGD. The Madry model [5] was trained with 7 steps of L<sup>∞</sup> PGD at *<sup>ε</sup>* <sup>=</sup> 8 (grey dashed line in the figure). All of the CEB models with *<sup>ρ</sup>* <sup>≤</sup> 4 outperform Madry across most of the values of , even though they were not adversarially-trained. It is interesting to note that the Det. model eventually outperforms the CEB5 model on L2 attacks at relatively high accuracies. This result indicates that the CEB5 model may be under-compressed.

Of the 25 CEB models we trained, only the models with *ρ* ≥ 1 successfully trained. The remainder collapsed to chance performance. This is something we observe on all datasets when training models that are too low capacity. Only by increasing model capacity does it become possible to train at low *ρ*. Note that this result is predicted by the theory of the onset of learning in IB and its relationship to model capacity from Wu et al. [31].

We additionally tested two models (*ρ* = 0 and *ρ* = 5) on the CIFAR10 Common Corruptions test sets. At the time of training, we were unaware that AutoAug's default policies for CIFAR10 contain brightness and contrast augmentations that amount to training on those two corruptions from Common Corruptions (as mentioned in Yin et al. [11]), so our results are not appropriate for direct comparison with other results in the literature. However, they still allow us to compare the effect of bottlenecking the information between the two models. The *ρ* = 5 model reached an mCE of 61.2. The *ρ* = 0 model reached an mCE of 52.0, which is a dramatic relative improvement. Note that the mCE is computed relative to a baseline model. We use the baseline model from Yin et al. [11].

**Figure 1.** Conditional Entropy Bottleneck (CEB) *ρ* vs. test set accuracy, and L2 and L<sup>∞</sup> Projected Gradient Descent (PGD) adversarial attacks on CIFAR10. *None of the CEB models is adversarially trained.*

**Figure 2.** Untargeted adversarial attacks on CIFAR10 models showing both strong robustness to PGD L2 and L<sup>∞</sup> attacks, as well as good test accuracy of up to 95.1%. (**Left**): Accuracy on untargeted L<sup>∞</sup> attacks at different values of *ε* for all 10,000 test set examples. (**Right**): Accuracy on untargeted L2 attacks at different values of *ε*. Note the switch to log scale on the x axis at L2 <sup>=</sup> 100. 28 <sup>×</sup> 10 indicates the Wide ResNet size. CEB*<sup>x</sup>* indicates a CEB model trained at *ρ* = *x*. Madry is the adversarially-trained model from Madry et al. [5] (values provided by Aleksander Madry). *None of the CEB models is adversarially-trained.*

#### *3.2. ImageNet Experiments*

To demonstrate CEB's ability to improve robustness, we trained four different ResNet architectures on ImageNet at 224×224 resolution, with and without AutoAug, using three different objective functions, and then tested them on ImageNet-C, ImageNet-A, and targeted PGD attacks.

As a simple baseline we trained ResNet50 with no data augmentation using the standard cross-entropy loss (XEnt). We then trained the same network with CEB at ten different values of *ρ* = (1, 2, ... , 10). AutoAug [9] has previously been demonstrated to improve robustness markedly on ImageNet-C, so next we trained ResNet50 with AutoAug using XEnt. We similarly trained these AutoAug ResNet50 networks using CEB at the same ten values of *ρ*. ImageNet-C numbers are also sensitive to the model capacity. To assess whether CEB can benefit larger models, we repeated the experiments with a modified ResNet50 network where every layer was made twice as wide, training an XEnt model and ten CEB models, all with AutoAug. To see if there is any additional benefit or cost to using the consistent classifier (Section 2.3), we took the same wide architecture using AutoAug and trained ten consistent classifier CEB (cCEB) models. Finally, we repeated all of the previous experiments using ResNet152: XEnt and CEB models without AutoAug; with AutoAug; with AutoAug and twice as wide; and cCEB with AutoAug and twice as wide. All other hyperparameters (learning rate schedule, L2 weight decay scale, etc.) remained the same across all models. All of those hyperparameters where taken from the ResNet hyperparameters given in the AutoAug paper. In total we trained 86 ImageNet models: 6 deterministic XEnt models varying augmentation, width, and depth; 60 CEB models additionally varying *ρ*; and 20 cCEB models also varying *ρ*. The results for the ResNet50 models are summarized in Figure 3. For ResNet152, see Figure 4. See Table 1 for detailed results across the matrix of experiments. Additional experimental details are given in Appendix A.2.

**Figure 3.** Summary of the ResNet50 ImageNet-C experiments. Lower is better in all cases. In the main part of the figure (in blue), the average errors across corruption magnitude are shown for 33 different networks for each of the labeled Common Corruptions, ImageNet-A, and targeted PGD attacks. The networks come in paired sets, with the vertical lines denoting the baseline XEnt network's performance, and then in the corresponding color the errors for each of 10 different CEB networks are shown with varying *ρ* = [1, 2, ... , 10], arranged from 10 at the top to 1 at the bottom. The light blue lines indicate ResNet50 models trained without AutoAug. The blue lines show the same network trained with AutoAug. The dark blue lines show ResNet50 AutoAug networks that were made twice as wide. For these models, we display cCEB rather than CEB, which gave qualitatively similar but slightly weaker performance. The figure separately shows the effects of data augmentation, enlarging the model, and the additive effect of CEB on each model. At the top in red are shown the same data for three summary statistics. clean denotes the clean top-1 errors of each of the networks. mCE denotes the AlexNet regularized average corruption errors. avg shows an equally-weighted average error across all common corruptions. The dots denote the value for each CEB network and each corruption at *ρ*∗, the optimum *ρ* for the network as measured in terms of clean error. The values at these dots and the baseline values are given in detail in Table 1. Figure 4 show the same data for the ResNet152 models.

**Figure 4.** Replication of Figure 3 but for ResNet152. Lower is better in all cases. The light blue lines indicate ResNet152 models trained without AutoAug. The blue lines show the same network trained with AutoAug. The dark blue lines show ResNet152 AutoAug networks that were made twice as wide. As in Figure 3, we show the cCEB models for the largest network to reduce visual clutter. The deeper model shows marked improvement across the board compared to ResNet50, but the improvements due to CEB and cCEB are even more striking. Notice in particular the adversarial robustness to L∞ and L2 PGD attacks for the CEB models over the XEnt baselines. The L<sup>∞</sup> baselines all have error rates above 99%, so they are to the right edge of the figure. See Table 1 for details of the best-performing models, which correspond to the dots in this figure.

The CEB models highlighted in Figures 3 and 4 and Table 1 were selected by cross validation. These were values of *ρ* that gave the best *clean* test set accuracy. Despite being selected for classical generalization, these models also demonstrate a high degree of robustness on both average- and worst-case perturbations. In the case that more than one model gets the same test set accuracy, we choose the model with the lower *ρ*, since we know that lower *ρ* correlates with higher robustness. The only model where we had to make this decision was for ResNet152 with AutoAug, where five models all were within 0.1% of each other, so we chose the *<sup>ρ</sup>* <sup>=</sup> 3 model, rather than *<sup>ρ</sup>* ∈ {5...8}.

**Table 1.** Baseline and cross-validated CEB values for the ImageNet experiments. **cCEB** uses the consistent classifier. **XEnt** is the baseline cross entropy objective. "**-aa**" indicates AutoAug is not used during training. "**x2**" indicates the ResNet architecture is twice as wide. The CEB values reported here are denoted with the dots in Figures 3 and 4. Lower values are better in all cases, and the lowest value for each architecture is shown in bold. All values are percentages.


#### 3.2.1. Accuracy, ImageNet-C, and ImageNet-A

Increasing model capacity and using AutoAug have positive effects on classification accuracy, as well as on robustness to ImageNet-C and ImageNet-A, but for all three classes of models CEB gives substantial additional improvements. cCEB gives a small but noticeable additional gain for all three cases (except indistinguishable performance compared to CEB on ImageNet-A with the wide ResNet152 architecture), indicating that enforcing variational consistency is a reasonable modification to the CEB objective. In Table 1 we can see that CEB's relative accuracy gains increase as the architecture gets larger, from gains of 1.2% for ResNet50 and ResNet152 without AutoAug, to 1.6% and 1.8% for the consistent wide models with AutoAug. This indicates that even larger relative gains may be possible when using CEB to train larger architectures than those considered here. We can also see that for the XEnt 152x2 and 152 models, the smaller model (152) actually has better mCE and equally good top-1 accuracy, indicating that the wider model may be overfitting, but the 152x2 CEB and cCEB models substantially outperform both of them across the board. cCEB gives a noticeable boost over CEB for clean accuracy and mCE in both wide architectures.

#### 3.2.2. Targeted PGD Attacks

We tested on the random-target version of the PGD L2 and L<sup>∞</sup> attacks [4]. The L<sup>∞</sup> attack used  = 16, *n* = 10, and  *<sup>i</sup>* = 2, which is considered to be a strong attack still [25]. The L2 attack used  = 200, *n* = 10, and  *<sup>i</sup>* = 220. Those parameters were chosen by attempting to match the baseline XEnt ResNet50 without AutoAug model's performance on the L∞ attack—the performance of the CEB models were not considered when selecting the L2 attack strength. Interestingly, for the PGD attacks, AutoAug was detrimental—the ResNet50 models without AutoAug were substantially more robust than those with AutoAug, and the ResNet152 models without AutoAug were nearly as robust as the AutoAug and wide models, in spite of having much worse test set accuracy. The ResNet50 CEB models show a dramatic improvement over the XEnt model, with top-1 accuracy increasing from 0.3% to 19.8% between the XEnt baseline without AutoAug and the corresponding *ρ* = 4 CEB model, a relative increase of 66 times. Interestingly, the CEB ResNet50 models *without* AutoAug are much more robust to the adversarial attacks than the AutoAug and wide ResNet50 models. As with the accuracy results above, the robustness gains due to CEB increase as model capacity increases, indicating that further gains are possible.

#### 3.2.3. Calibration and ImageNet-C

Following the experimental setup in Reference [29], in Figure 5 we compare accuracy and ECE on ResNet models for both the clean ImageNet test set and the collection of 15 ImageNet-C corruptions at each of the five different corruption intensities. It is easy to see in the figure that the CEB models always have superior mean accuracy and ECE for all six different sets of test sets.

**Figure 5.** Comparison of accuracy and Expected Calibration Error (ECE) between Xent baseline models and corresponding CEB models at the value of *ρ* that gives the closest accuracy to the XEnt baseline. Higher is better for accuracy; lower is better for ECE. The box plots show the minimum, 25th percentile, mean, 75th percentile, and maximum values across the 15 different ImageNet-C corruptions for the given shift intensity. XEnt baseline models are always the lighter color, with the corresponding CEB model having the darker color.

Because accuracy can have a strong impact on ECE, we use a different model selection procedure than in the previous experiments. Rather than selecting the CEB model with the highest accuracy, we instead select the CEB model with with the *closest* accuracy to the corresponding XEnt model. This resulted in selecting models with lower *ρ* than in the previous experiments for four out of the six CEB model classes. We note that by selecting models with lower *ρ* (which are more compressed), we see more dramatic differences in ECE, but even if we select the CEB models with highest accuracy

as in the previous experiments, all six CEB models outperform the corresponding XEnt baselines on all six different sets of test sets.

#### **4. Conclusions**

The Conditional Entropy Bottleneck (CEB) provides a simple mechanism to improve robustness of image classifiers. We have shown a strong trend toward increased robustness as *ρ* decreases in the standard 28 × 10 Wide ResNet model on CIFAR10, and that this increased robustness does not come at the expense of accuracy relative to the deterministic baseline. We have shown that CEB models at a range of *ρ* outperform an adversarially-trained baseline model, even on the attack the adversarial model was trained on, and have incidentally shown that the adversarially-trained model generalizes to at least one other attack *less well* than a deterministic baseline. Finally, we have shown that on ImageNet, CEB provides substantial gains over deterministic baselines in validation set accuracy, robustness to Common Corruptions, Natural Adversarial Examples, and targeted Projected Gradient Descent attacks, and gives large improvements to model calibration, all without any change to the inference architecture. We hope these empirical demonstrations inspire further theoretical and practical study of the use of bottlenecking techniques to encourage improvements to both classical generalization and robustness.

**Author Contributions:** Conceptualization, I.F.; methodology, I.F. and A.A.A.; software, I.F.; validation, I.F. and A.A.A.; formal analysis, A.A.A. and I.F.; investigation, I.F.; writing—original draft preparation, I.F. and A.A.A.; writing—review and editing, I.F. and A.A.A.; visualization, I.F. and A.A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We would like to thank Justin Gilmer for helpful conversations on the use of ImageNet-C.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Experiment Details**

Here we give additional technical details for the CIFAR10 and ImageNet experiments.

#### *Appendix A.1. CIFAR10 Experiment Details*

We trained all of the models using Adam [32] at a base learning rate of 10−3. We lowered the learning rate three times by a factor of 0.3 each time. The only additional trick to train the CIFAR10 models was to start with *ρ* = 100, anneal down to *ρ* = 10 over 2 epochs, and then anneal to the target *ρ* over one epoch once training exceeded a threshold of 20%. This jump-start method is inspired by experiments on VIB in Wu et al. [31]. It makes it much easier to train models at low *ρ*, and appears to not negatively impact final performance.

#### *Appendix A.2. ImageNet Experiment Details*

We follow the learning rate schedule for the ResNet 50 from Cubuk et al. [9], which has a top learning rate of 1.6, trains for 270 epochs, and drops the learning rate by a factor of 10 at 90, 180, and 240 epochs. The only difference for all of our models is that we train at a batch size of 8192 rather than 4096. Similar to the CIFAR10 models, in order to ensure that the ImageNet models train at low *ρ*, we employ a simple jump-start. We start at *ρ* = 100 and anneal down to the target *ρ* over 12,000 steps. The first learning rate drop occurs a bit after 14,000 steps. Also similar to the CIFAR10 28 × 10 WRN experiments, none of the models we trained at *ρ* = 0 succeeded, indicating that ResNet50 and wide ResNet50 both have insufficient capacity to fully learn ImageNet. We were able to train ResNet152 at *<sup>ρ</sup>* = 0, but only by disabling L2 weight decay and using a slightly lower learning rate. Since that involved additional hyperparameter tuning, we do not report those results here, beyond noting that it is possible, and that those models reached top-1 accuracy around 72%.

#### **Appendix B. CEB Example Code**

In Listings 1 to 3 we give annotated code changes needed to make ResNet CEB models, based on the TPU-compatible ResNet implementation from the Google TensorFlow Team [18].

**Listing 1.** Modifications to the model.py file.

```