Next Article in Journal
Integrating Dynamical Systems Modeling with Spatiotemporal scRNA-Seq Data Analysis
Previous Article in Journal
Mutual Information Neural-Estimation-Driven Constellation Shaping Design and Performance Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Supervised Information Bottleneck

1
Department of Computer Science, Reichman University, Herzliya 4610101, Israel
2
Toga Networks, a Huawei Company, Tel Aviv 4524075, Israel
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(5), 452; https://doi.org/10.3390/e27050452
Submission received: 8 March 2025 / Revised: 7 April 2025 / Accepted: 12 April 2025 / Published: 22 April 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
The Information Bottleneck (IB) framework offers a theoretically optimal approach to data modeling, although it is often intractable. Recent efforts have optimized supervised deep neural networks (DNNs) using a variational upper bound on the IB objective, leading to enhanced robustness to adversarial attacks. In these studies, supervision assumes a dual role: sometimes as a presumably constant and observed random variable and at other times as its variational approximation. This work proposes an extension to the IB framework and, consequent to the derivation of its variational bound, that resolves this duality. Applying the resulting bound as an objective for supervised DNNs induces empirical improvements and provides an information-theoretic motivation for decoder regularization.

1. Introduction

The Variational Information Bottleneck (VIB) [1] adapts the theoretically optimal, yet mostly intractable, Information Bottleneck (IB) [2] to supervised DNNs. However, the IB is a method for unsupervised learning and requires knowledge of the underlying joint distribution p ( x , y ) [3]. This requirement is relaxed in the original VIB derivation, resulting in a duality of the usage of the downstream RV Y, which is treated both as an observed RV when sampled from the training data, and as a variational approximation when optimized. This work proposes a new adaptation of the IB and VIB frameworks for supervised tasks, and consequently, an information-theoretic motivation for decoder regularization.
We begin by laying down what IB is and how it can be adapted as an objective for DNNs—classic information theory provides rate-distortion [4] for optimal data compression. However, rate distortion regards all information as equal, not taking into account which information is more relevant to a specified downstream task without constructing tailored distortion functions. The Information Bottleneck (IB) [2] resolves this limitation by defining mutual information (MI) between the learned representation and a designated downstream random variable (RV) as a universal distortion function. Yet, learning representations using the IB method is possible given discrete distributions and some continuous ones, but not in the general case [5]. Moreover, MI is either difficult or impossible to optimize when considering deterministic models, such as DNNs [6,7]. Nonetheless, the promise of the IB remains alluring, and recent efforts have utilized VAE [8] inspired variational methods to approximate upper bounds on the IB objective, allowing for its utilization as a loss function for DNNs, where the underlying distributions are both continuous and unknown [1,9,10,11]. These approaches learn representations in supervised settings without knowledge of the underlying distribution p ( x , y ) , utilizing the learned variational conditional p ( y | x ) to approximate MI. In contrast, non-variational IB methods learn representations in unsupervised settings, where the stochastic process underlying the observed data is known [2,5,12]. Nonetheless, when deriving the variational IB objectives, previous research considered the learned representation as the only optimized RV when, in practice, a variational classifier is also optimized.
This work proposes an extension of the IB and variational IB objectives by setting the downstream RV as a parameterized model in the problem definition. An empirical comparison of models trained with the proposed objective, and identical models trained using previous IB adaptations, demonstrates improved performance in several cases across various challenging tasks and modalities. Finally, interpreting our findings in the context of previous work in the field leads us to propose a novel information-theoretic interpretation of overfitting in supervised DNNs.
The reader is encouraged to refer to the preliminaries provided in Appendix A before proceeding.

2. Materials and Methods

2.1. Related Work

Deterministic Information Bottleneck Classic information theory offers rate distortion [4] to mitigate signal loss during compression. A source X is compressed to an encoding Z, such that maximal compression is achieved while keeping the encoding quality above a certain threshold. Encoding quality is measured by a task-specific distortion function: d : X × Z R + . Rate distortion suggests a mapping that minimizes the rate of bits to the source sample, measured by I ( X ; Z ) , that adheres to a chosen allowed expected distortion D 0 . The Information Bottleneck (IB) [2] extends rate distortion by replacing the tailored distortion functions with MI over a target distribution. Let Y be the target signal for some specific downstream task, such that the joint distribution p ( x , y ) is known, and define the distortion function as MI between Z and Y. The IB is the solution to the optimization problem Z : m i n p ( z | x ) I ( X ; Z ) subject to I ( Z ; Y ) D , which can be optimized by minimizing the IB objective L I B = β I ( X ; Z ) I ( Z ; Y ) over p ( z | x ) . The solution to this objective is a function of the Lagrange multiplier β , and is a theoretical limit for representation quality, given mutual information as an accepted metric, as elaborated in more detail in Appendix B. The IB is, in fact, an unsupervised soft clustering problem in which each data point x is assigned a probability z of belonging to different clusters, given the joint distribution of the input and target tasks p ( x , y ) [3]. Chechik et al. [5] showed that computing the IB for continuous distributions is hard in the general case and provided a method for optimizing the IB objective in the case where X , Y are jointly Gaussian and known. Painsky and Tishby [12] offered a limited linear approximation of the IB for any distribution by extracting the joint Gaussian element of given distributions. Saxe et al. [6] considered the application of the IB objective as a loss function for DNNs and concluded that computing mutual information in deterministic DNNs is problematic as the entropy term H ( Z | X ) for a continuous Z is infinite. Amjad and Geiger [7] extended this observation and pointed out that for a discrete Z, MI becomes a piecewise constant function of its parameters, making gradient descent limited and difficult.
Considering the supervised problem, Geiger and Fischer [13] suggested regarding the classification output as an additional random variable, leading to an extended underlying Markov chain: Y X Z Y ˜ . A similar approach has also been suggested by Piran et al. [14], where a dual IB formulation was proposed that, while still considering the minimization of I ( X ; Z ) , replaces the constraints with one that takes into account Y ˜ . The approach suggested here follows these ideas but adds the additional objective of reducing overfitting during the classification step.
Variational Information Bottleneck Alemi et al. [1] introduced the Variational Information Bottleneck (VIB), a variational approximation for an upper bound to the IB objective for DNN optimization. Bounds for I ( X , Z ) and I ( Z , Y ) are derived from the non-negativity of KL divergence and are used to form an upper bound for the IB objective. Variational approximations are then used to replace intractable distributions in the upper bound. Using the reparameterization trick [8], a discrete empirical estimation of the variational upper bound is used as a loss function for classifier DNN optimization, resulting in an objective that is equivalent to the β -autoencoder loss [15]. VIB was evaluated over image classification tasks and displayed substantial improvements in robustness to adversarial attacks while inflicting a slight reduction in test set accuracy when compared to equivalent deterministic MLE models. The improved robustness is attributed to an improvement in representation quality and, subsequently, better generalization. Achille and Soatto [11] extended VIB with a total correlation term, designed to increase latent disentanglement.
Kolchinsky et al. [10] and Cheng et al. [16] derived variational upper bounds for the IB objective that match the VIB formulation for I ( Z ; Y ) , while leveraging different MI estimators [16,17] to bound I ( X ; Z ) . These approaches demonstrated improved performance over VIB on several tasks, although they can be challenging to scale to high-dimensional settings [10,18]. In a complementary direction, Yu et al. [19] proposed a non-variational method for estimating I ( X ; Z ) , showing improvements over VIB on several low-dimensional datasets.
Fischer [9] proposed an IB-based loss function called the Conditional Entropy Bottleneck (CEB), in which the conditional mutual information of X and Z given Y is minimized instead of the unconditional mutual information. The CEB loss, L C E B = m i n Z I ( X ; Z | Y ) γ I ( Y ; Z ) , is designed to minimize all information in Z that is not relevant to the downstream task Y, by conditioning over Y. CEB is equivalent to IB for γ = β 1 , following the chain rule of mutual information [20] and the IB Markov chain, as established in Appendix B. However, its variational approximation, VCEB, differs from VIB in how the marginal is approximated. Geiger and Fischer [13] showed that VCEB is a tighter variational approximation for IB under certain conditions but not in the general case. Empirical studies carried out in [9] demonstrated that VCEB improves accuracy and robustness to targeted PGD attacks [21] for F-MNist [22], and increased robustness to untargeted PGD on Cifar-10 [23], compared to VIB and deterministic MLE models. Later work [24] involved an extensive experimental investigation that further substantiated the gains of VCEB, demonstrating that models trained with VCEB achieved improved robustness over deterministic MLE models for the targeted PGD attack [21] on ImageNet [25], and improved classification accuracy on the ImageNet-A and ImageNet-C datasets, two flavors of ImageNet that assess model performance on challenging edge cases and robustness to common corruptions.
The work carried out in [24] offers an in-depth comparison of VCEB to deterministic MLE models, training 80 VCEB and six MLE ImageNet classifiers from the ground up while exploring varying architectures, hyperparameters, pretraining strategies, and optimization techniques to compare the best possible VCEB model to its deterministic MLE counterpart. In contrast, the experiments in [1,9] and in the present study focus on comparing the performance of otherwise identical models using different objective functions—deterministic MLE, VIB, and VCEB, as well as SVIB in the current study.
Information-Theoretic Regularization Label smoothing [26] and entropy regularization [27] regularize classifier DNNs by increasing classifier entropy, either by inserting a scaled conditional entropy term to the objective, or by smoothing the training labels. Applying either method improved test accuracy and model calibration on various challenging tasks.
Alemi et al. [28] extended the information plane [2] to VAE [8] settings, measuring distortion as MI between input and reconstructed images and rate as the KL divergence between variational representation and marginal. The limits of representation quality in VAEs are looser than the theoretical IB limits and heavily depend on the chosen variational families of the marginal and decoder distributions. The closer the families are to the true distributions, the tighter the gap to the theoretical IB limit for representation quality. Alemi et al. [28] also showed that given a strong enough decoder, the ELBO loss is prone to produce low-quality representations, as the ELBO KL regularization term might induce completely uninformative representations that are then overfitted by the powerful decoder, as elaborated in detail in Appendix B.
In the current study, a conditional entropy term [27] emerges during the derivation of our proposed adaptation of the IB objective, providing a possible remedy to the discrepancies in the ELBO loss described in [28], and subsequently VIB and VCEB loss functions.

2.2. From VIB to SVIB

Problem Definition As elaborated in Section 2.1, the IB objective, L I B = β I ( X ; Z ) I ( Z ; Y ) , is computed over the joint distribution p ( x , y , z ) . When p ( x , y ) is given, this expression is optimized over the distribution p ( z | x , y ) , as proposed by Tishby et al. [2]:
min p ( z | x , y ) I ( Z ; X ) s . t . I ( Z ; Y ) D 1 .
However, adapting IB to supervised tasks admits the learned classifier as a new RV to the optimization problem [13,14]. Thus, we consider the extended Markov chain Y X Z Y ˜ for the supervised IB, distinguishing between the true unknown RV Y, and the learned classifier Y ˜ . We follow this approach and also assume that Y ˜ and Y share the same support. The IB framework connects the underlying joint distribution of the input and objective data, p ( x , y ) , with a learned representation Z. We claim that when applying IB to supervised tasks, one must also consider the connection to the classifier defined by the output RV Y ˜ . Thus, we also want to consider the joint distribution over the pair Z , Y ˜ during optimization. Following the IB method logic, we seek a Y ˜ that will minimize mutual information with Z while keeping below a defined distortion metric with the true Y. That is, we seek a second bottleneck that minimizes the passage of information between Z and Y ˜ so as to limit it to the minimum required to ensure that Y ˜ is similar enough to Y, given the transition through both X and Z. Since, in this case, we are optimizing over the joint conditional distribution p ( z , y ˜ | x , y ) , instead of the conditional p ( y ˜ | z , x , y ) , this problem is not simply an IB problem over the Markov chain Y Z Y ˜ . Additionally, contrary to the standard IB, X plays a significant role, controlling the distribution of Z, and the entire chain of four random variables must be taken into consideration. We thus define a second bottleneck for the true distribution p ( x , y ) and modeled distribution c ( y ˜ | z ) p ( z | x , y ) . We choose KL divergence as a distortion metric, as we assume Y and Y ˜ share the same support. For some positive scalar D 2 we have:
min c ( y ˜ | z ) p ( z | x , y ) I ( Z ; Y ˜ ) s . t . D K L p ( y = y ˜ | z , x ) | | c ( y ˜ | z ) p ( z | x ) D 2 .
Combining the two bottlenecks results in a new optimization problem, which we denote the Supervised Information Bottleneck (SIB), which minimizes the following objective:
L S I B β I ( X ; Z ) I ( Z ; Y ) + λ I ( Z ; Y ˜ ) + D K L p ( y = y ˜ | z , x ) | | c ( y ˜ | z ) p ( z | x ) .
Optimization Objective We proceed to derive a tractable variational upper bound for L S I B , which we can use as an objective function for classifier DNNs. We begin by deriving the first bottleneck (1) as done in VIB [1], and proceed to derive the second (2).
  • Consider I ( Z ; X ) :
    I ( Z ; X ) = p ( x , z ) log p ( z | x ) d x d z p ( z ) log p ( z ) d z .
For any probability distribution r, we have that D K L p ( z ) | | r ( z ) 0 . It follows that:
p ( z ) log p ( z ) d z p ( z ) log r ( z ) d z .
So, by Equation (5):
I ( Z ; X ) p ( x ) p ( z | x ) log p ( z | x ) r ( z ) d x d z ,
  • Consider I ( Z ; Y ) :   By the Barber–Agakov inequality [29], we have that for any probability distribution c:
    I ( Z ; Y ) p ( y , z ) log c ( y | z ) d y d z p ( y ) log ( p ( y ) ) d y .
Note that Equations (6) and (7) hold for any distribution r over the support of Z, and for any conditional distribution c ( · | z ) whose support equals the support of Y for every given value z in support of Z. We link the two bottlenecks by choosing c to be Y ˜ | Z = z c ( · | z ) , meaning the variational classifier distribution. This connection is implicit in [1], where Y ˜ is not formally defined. We now move on to the second bottleneck.
Consider I ( Z ; Y ˜ ) :
I ( Z ; Y ˜ ) = H ( Y ˜ ) H ( Y ˜ | Z ) .
Choosing a discrete random variable for Y ˜ , as in labeled classification, we have H ( Y ˜ ) log Y ˜ . Otherwise, choosing a continuous RV with finite support [ a , b ] , we have that H ( Y ˜ ) log ( b a ) . In both cases, I ( Z ; Y ˜ ) is bounded from above by some constant J = log ( b a ) , or J = log Y ˜ , and the negative conditional entropy term H ( Y ˜ | Z ) :
I ( Z ; Y ˜ ) J H ( Y ˜ | Z ) = J + p ( y ˜ , z ) log ( c ( y ˜ | z ) ) d y ˜ d z .
Consider D K L p ( y = y ˜ | z , x ) | | c ( y ˜ | z ) p ( z | x ) :
D K L p ( y = y ˜ | z , x ) | | c ( y ˜ | z ) p ( z | x ) = p ( y , z , x ) log p ( y | z , x ) d y d x d z p ( y , z , x ) log c ( y | z , x ) d y d x d z .
Applying the Markov chain Y X Z Y ˜ , and total probability, we get:
D K L p ( y = y ˜ | z , x ) | | c ( y ˜ | z ) p ( z | x ) =                                               p ( y , x ) log p ( y | x ) d y d x p ( y , z ) log c ( y | z ) d y d z .
Finally, we attain an upper bound for L S I B by combining Equations (6), (7), (9), and (11):
L S I B β p ( x ) p ( z | x ) log p ( z | x ) r ( z ) d x d z 2 p ( y , z ) log ( c ( y | z ) ) d y d z + λ c ( y | z ) p ( z ) log ( c ( y | z ) ) d y d z + p ( y , x ) log p ( y | x ) d y d x + p ( y ) log p ( y ) d y + λ J .
Note that p ( x , y ) and J are constants, so the last three terms in Equation (12) can be ignored in the course of optimization.
Variational Approximation and Empirical Estimation We further develop the upper bound in Equation (12) using the IB Markov chain Y X Z Y ˜ and total probability, and define tractable variational distributions to replace intractable ones. Let e ( z | x ) a variational encoder approximating the conditional p ( z | x ) , let r ( z ) be a variational approximation for the marginal, and let c ( y | z ) be a variational classifier approximating p ( y | z ) . We define the variational approximation L S V I B :
L S V I B β p ( x ) e ( z | x ) log e ( z | x ) r ( z ) d x d z 2 p ( x ) p ( y | x ) e ( z | x ) log c ( y | z ) d x d y d z + λ p ( x ) e ( z | x ) c ( y | z ) log c ( y | z ) d x d y d z .
As is common in the VIB and VAE literature, we choose a standard Gaussian for the variational marginal r ( z ) , a spherical Gaussian for the variational encoder e ( z | x ) , and a categorical distribution for the variational classifier c ( y | z ) . We use DNNs to model these distributions as follows: Let e ϕ ( z | x ) N ( μ , Σ ) be a stochastic DNN encoder with parameters ϕ , and a final layer of dimension 2 K , such that for each forward pass, the first K entries are used to encode μ , and the last K entries to encode a diagonal Σ , after a soft-plus transformation. Let C γ be a discrete classifier neural net parameterized by γ , such that C γ ( y | z ) C a t e g o r i c a l . r ( z ) is constant and unparameterized. We use Monte Carlo sampling over some discrete dataset S to empirically estimate L S V I B . The true and possibly continuous distribution p ( x , y ) = p ( y | x ) p ( x ) can be sampled from S . Distributions featuring Z are samples from the stochastic encoder using the reparameterization trick [8], such that for each x n S we generate a sample z ^ n . Finally, we use the variational classifier to attain instances y ˜ n , given an instance z ^ n .
L ^ S V I B 1 N n = 1 N β D K L e ϕ ( z | x n ) | | r ( z ) log C γ y n | z ^ n + λ log C γ ( y ˜ n | z ^ n ) .
The only difference between VIB and SVIB lies in the additional conditional entropy term λ log C γ ( y ˜ n | z ^ n ) . This term can be computed at each iteration using the existing logits, preserving the overall asymptotic complexity of O N · Y · X . Detailed runtime performance measurements are provided in Appendix C.
Motivation Tishby et al. [2] proposed that representations are optimal if they contain just enough information for a required downstream task, and that the Information Bottleneck is a method to obtain such representations. However, in the supervised case, an additional information processing stage is added, where representations are decoded by a learned decoder (here, decoder in the general sense, including classifiers and other decoders) in a joint training process. As mentioned in Section 2.1, Alemi et al. [28] observed that the ELBO loss function [8] may learn uninformative representations even when strong KL regularization is imposed, since an overpowerful decoder can overfit the learned embeddings. This observation holds for all VIB loss functions [1,9,16], as VIB is equivalent to the ELBO loss, as shown in [1]. Our proposed extension to the IB and VIB frameworks asks to resolve this conflict. By appending an additional bottleneck between representation Z and learned classifier Y ˜ , we learn a classifier that holds the minimum information about the representation that is required to meet a designated distortion target over the true downstream RV. Extending the work in [28], we propose to define a decoder Y ˜ as overfitting if a substantial amount of its information about Z lacks relevance about Y. The conditional MI I ( Z ; Y ˜ | Y ) measures the amount of information Z and Y ˜ share, which is uninformative about Y. Hence, we have that Y ˜ overfits Z if:
I ( Z ; Y ˜ ) I ( Z ; Y ˜ ) I ( Z ; Y ˜ | Y ) H ( Y ˜ | Y ) H ( Y ˜ | Z ) ,
Where the last line follows from the SIB Markov chain.
By deriving the second bottleneck, L S V I B introduces a modulated conditional entropy term to the loss function, λ H ( Y ˜ | Z ) , inducing an increase in the right-hand side of Equation (15). At the same time, we expect that the left-hand side conditional entropy will be reduced by the power of the cross entropy term. Applying these two forces together prevents decoders from overfitting embeddings, as illustrated in Figure 1.

3. Results

We follow the experimental setup proposed by Alemi et al. [1], extending it to NLP tasks as well. We trained image classification models on the ImageNet 2012 dataset [25], and text classification models on the IMDB sentiment analysis dataset [30]. For each dataset, we compared a competitive deterministic MLE model with VIB models trained over eight different β values ranging from 10 4 to 0.5 , VCEB models trained with ρ values ranging from 1 to 7, and SVIB models trained with different combinations of β and λ values. Each model was trained and evaluated five times per setting. As in [1], all models were trained over a frozen encoder of the deterministic model instead of training models from the ground up, enabling extensive testing while meeting resource constraints. Models were evaluated for test set accuracy and robustness to various adversarial attacks, showing consistent performance. For image classification, we employed the untargeted Fast Gradient Sign (FGS) attack [31], as well as the targeted CW L 2 attack [32,33]. For text classification, we used the untargeted Deep Word Bug attack [34,35] as well as the untargeted PWWS attack [36]. The empirical results presented in Figure 2 confirm that while VIB, VCEB, and SVIB models mostly decrease test set accuracy compared to the deterministic MLE model, they significantly improve robustness to the applied adversarial attacks. SVIB consistently attains higher test set accuracy over VIB and VCEB, and in one case over the deterministic model as well, while demonstrating improved or on-par robustness in all attacks, apart from the targeted CW attack. A comparison of the best MLE, VIB, VCEB, and SVIB models further substantiates these findings, with statistical significance confirmed by a p-value of less than 0.05 on a Wilcoxon rank sum test.
As in [1,9], the experiments performed in the current study compare identical models that differ only in their objective function, ensuring that any performance differences arise solely from these variations. Rather than benchmarking the best possible performance, these experiments serve to validate the proposed information-theoretic approach. This method enabled us to perform a direct comparison of four objective functions across high-dimensional tasks in two modalities while systematically exploring the entire range of useful β , λ , and ρ values, with five runs per setting. We leave for future work the task of benchmarking SVIB for state-of-the-art performance. As carried out in [24], this would involve training ImageNet classifiers from the ground up without a pre-trained encoder, experimenting with larger model architectures, using Gaussian-mixture-models for Z, training for more epochs and incorporating training techniques such as AutoAugment [37], L 2 weight decay, and annealing strategies for β and λ .
Elaboration on the experimental setup, detailed results, and further insights from the experiments are available in Appendix C. The code for reconstructing the experiments is provided in the following Github repository: github.com/nirweingarten/svib (accessed on 16 April 2025).

3.1. Image Classification

A pre-trained inceptionV3 [26] base model was used and achieved a 77.21% accuracy on the ImageNet 2012 validation set. Image classification evaluation results are shown in Figure 2, and examples of successful attacks are shown in Figure 3 and Figure 4. The ImageNet 2012 validation set was used for evaluation as the test set for ImageNet is unavailable. InceptionV3 yields a slightly worse single shot accuracy than inceptionV2 (80.4%) when run in a single model and single crop setting; however, we have used InceptionV3 over V2 for simplicity. Each model was trained for 100 epochs. The entire validation set was used to measure accuracy and robustness to FGS attacks, while only 1% of it was used for CW attacks, as they are computationally expensive. Complete results are available in Appendix C.1. Examples of successful attacks are shown in Figure 3 and Figure 4. t-SNE [38] visualization of the latent space of each model is presented in Figure 5.

3.2. Text Classification

A fine-tuned BERT uncased [39] base model was used and achieved a 93.0% accuracy on the IMDB sentiment analysis test set. Text classification evaluation results are shown in Figure 2. Each model was trained for 150 epochs. The entire test set was used to measure accuracy, while only the first 200 entries in the test set were used for adversarial attacks, as they are computationally expensive. Complete results are available in Appendix C.1. Examples of successful attacks are shown in Table 1 and Table 2.

4. Discussion

The IB is a special case of rate distortion, and was initially designed to optimize compressed representations. Applying the IB objective for supervised tasks results in the optimization of a classifier distribution as well, and requires a reformulation of the initial problem to include both representation and classification. We propose the Supervised IB (SIB), an extension to the original IB that considers the classifier distribution as well, and adds an additional bottleneck to mitigate information flow between representations and the classifier. We derive SVIB, a tractable variational approximation for SIB, and show that it induces empirical gains in terms of classification accuracy and robustness to several adversarial attacks over high dimensional tasks of different modalities, with high statistical significance. We apply previous information-theoretic frameworks for deep learning [26,27,28] to interpret our findings and propose a definition for decoder overfitting, and a new motivation for conditional entropy regularization. While other advancements have been achieved in recent years, [1,9,10,11], none propose a reformulation for IB, as is required in our opinion.
This study opens many opportunities for further research: Applying SVIB in self-supervised learning, and in particular, measuring whether representations learned with SVIB capture better semantics than representations learned with non-IB inspired loss functions; An in-depth empirical study that includes training an encoder from the ground up with SVIB, exploring different architectures, using a full covariance matrix and Gaussian-mixture-model for Z, applying different training techniques such as AutoAugment, β and λ annealing, and L 2 weight decay, as well as testing against PGD [21] and AutoAttack [40]; Combining SVIB with VCEB is also left for future work.

Limitations

SVIB falls short of VIB and VCEB in robustness to the targeted CW attack, introduces an additional hyperparameter, and requires a slightly longer training time than VIB due to the additional entropy computation.

Author Contributions

Conceptualization, N.Z.W. and R.B.; methodology, N.Z.W. and R.B.; software, N.Z.W.; validation, N.Z.W. and R.B.; formal analysis, N.Z.W. and R.B.; investigation, N.Z.W., R.B., Z.Y. and M.B.; Original draft preparation N.Z.W.; Review and editing, N.Z.W., R.B., Z.Y. and M.B.; visualization, N.Z.W., R.B., Z.Y. and M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

This paper uses ImageNet 2012 [25] and the IMDB sentiment analysis [30] datasets which are both publicly available. The paper’s code repository features code that downloads, preprocesses and uses these datasets.

Acknowledgments

We would like to thank Ran Gilad-Bachrach for his advice, and Kfir Bar for the generous allocation of GPUs.

Conflicts of Interest

Author Ronit Bustin is employed by Toga Networks. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Preliminaries

Appendix A.1. Notation

We denote random variables (RVs) with uppercase letters X , Y , and their realizations in lowercase x , y . Denote discrete Probability Mass Functions (PMFs) with an uppercase P ( x ) and continuous Probability Density Functions (PDFs) with a lowercase p ( x ) . Subscripts are written where the RV identities are not clear from the context, and hat notation denotes empirical measurements.
Let X , Y be two observed random variables with a true and unknown joint distribution p ( x , y ) , and true marginals p ( x ) , p ( y ) . We can attempt to approximate these distributions using a model p θ with parameters θ , such that for generative tasks p θ ( x ) p ( x ) and for discriminative tasks p θ ( y | x ) p ( y | x ) , using a dataset of N i.i.d observation pairs S = ( x 1 , y 1 ) , , ( x N , y N ) to fit our model. One can also assume the existence of an additional unobserved RV Z p ( z ) that influences or generates the observed RVs X , Y . Since Z is unobserved, it is absent from the dataset S , and so cannot be modeled directly. Denote p θ ( x ) = p θ ( x | z ) p θ ( z ) d z = p θ ( x , z ) d z as the marginal, p θ ( z ) as the prior as it is not conditioned over any other RV, and p θ ( z | x ) as the posterior following Bayes’ rule.

Appendix A.2. Variational Approximations

When modeling an unobserved variable of an unknown distribution, we encounter a problem as the marginal p θ ( x ) = p θ ( x , z ) d z doesn’t have an analytic solution. This intractability can be overcome by choosing some tractable parametric variational distribution q ϕ ( z | x ) to approximate the posterior p θ ( z | x ) , such that q ϕ ( z | x ) p θ ( z | x ) , and estimate p θ ( x , z ) or p θ ( x , z | y ) by fitting the dataset S [41].

Appendix A.3. Learning Tasks

Vapnik [42] defines supervised learning as follows:
  • A generator of random vectors x R d , drawn independently from an unknown probability distribution p ( x ) .
  • A supervisor who returns a scalar output value y R according to an unknown conditional probability distribution p ( y | x ) . We note that these probabilities can indeed be soft labels, where y is a continuous probability vector rather than the more commonly used hard labels.
  • A learning machine capable of implementing a predefined set of functions, f ( x , θ ) : R d × Θ R , where Θ is a set of parameters.
The problem of supervised learning is choosing from the given set of functions the one that best approximates the supervisor’s response, based on observation pairs from the training set S , drawn according to p ( x , y ) = p ( x ) p ( y | x ) .
Slonim [3] defines unsupervised learning as the task of constructing a compact representation of a set of unlabeled data points { x 1 , , x N } , x i R d , which in some sense reveals their hidden structure. This representation can be used further to achieve a variety of goals, including reasoning, prediction, communication etc. In particular, unsupervised clustering partitions the data points into exhaustive and mutually exclusive clusters where each cluster can be represented by a centroid, typically a weighted average of the cluster’s members. Soft clustering assigns cluster probabilities for each data point and fits an assignment by minimizing the expected loss for these probabilities, usually a distance metric such as MSE.

Appendix A.4. Information Theoretic Functions

In this work, information-theoretic functions share the same notation for discrete and continuous settings, and are denoted as follows:
NotationDifferentialDiscrete
Entropy H p ( X ) p ( x ) log p ( x ) d x x X P ( x ) log P ( x )
Conditional
entropy
H p ( X | Y ) p ( x , y ) log p ( x | y ) d x d y x X y Y P ( x , y ) log P ( x | y )
Cross
entropy
C E ( p , q ) p ( x ) log q ( x ) d x x X P ( x ) log Q ( x )
Joint
entropy
H p ( X , Y ) p ( x , y ) log p ( x , y ) d x d y x X y Y P ( x , y ) log P ( x , y )
KL
divergence
D K L p | | q p ( x ) log p ( x ) q ( x ) d x x X P ( x ) log P ( x ) Q ( x )
Mutual
information (MI)
I ( X ; Y ) p ( x , y ) log p ( x , y ) p ( x ) p ( y ) d x d y x X y Y P ( x , y ) log P ( x , y ) P ( x ) P ( y )

Appendix B. Related Work Elaboration

This appendix supplements the related work presented in Section 2.1 by providing a deeper review of the IB, the IB theory of deep learning, and variational approximations for the IB.

Appendix B.1. The Information Plane

As mentioned in Section 2.1, the solution to the IB objective, L I B = I ( X ; Z ) β I ( Z ; Y ) , depends on the Lagrange multiplier β . Hence, the IB objective has no one unique solution and can thus be plotted as a function of β and of Z’s cardinality over a Cartesian system composed of the axes I ( X ; Z ) (rate) and I ( Z ; Y ) (distortion). We denote the resulting curve the information curve, and its Cartesian system the information plane [2], as illustrated in Figure A1. When β approaches 0 the distortion term is nullified and we learn a representation that has maximal compression but no information over the downstream task (such a representation may be a null vector), and when β approaches we learn a representation that has the maximum possible information over the downstream task, but minimal compression. The region above the information curve is unreachable by any possible representation. The different bifurcations of the information curve, illustrated in Figure A1, correspond to the different possible cardinalities of the compressed representation. Under the assumption that optimizing a precision-complexity trade-off will yield a model that is closer in nature to the real underlying process, and that mutual information is a sufficient metric for this purpose, the informaiton curve represents optimal representations for a given cardinality and rate-distortion tradeoff [3].
Figure A1. The information plane and curve: rate distortion ratio over β . At β = 0 , the representation is compressed but uninformative (maximal compression); at β the representation is informative but potentially overfitted (maximal information). Adapted from [3].
Figure A1. The information plane and curve: rate distortion ratio over β . At β = 0 , the representation is compressed but uninformative (maximal compression); at β the representation is informative but potentially overfitted (maximal information). Adapted from [3].
Entropy 27 00452 g0a1

Appendix B.2. Fixing a Broken ELBO

Kingma and Welling [8] introduced variational autoencoders (VAEs) as a latent model-based generative DNN architecture. In VAEs, an unobserved RV Z is assumed to generate evidence X, a variational DNN encoder e ( z | x ) is used to approximate the intractable posterior p ( z | x ) , and a variational DNN decoder d ( x ^ | z ) is used to reconstruct X. The log probability l o g p ( x ) is developed into the tractable Evidence Lower Bound (ELBO) loss: l o g p ( x ) L ELBO ( x ) E e ( z | x ) l o g d ( x | z ) + D K L e ( z | x ) | | m ( z ) , consisting of a reconstruction error term (cross-entropy), and a KL regularization term between encoder and variational marginal m ( z ) .
Alemi et al. [28] adapt the information plane [2] to VAEs by defining an additional theoretical bound for the ratio between rate and distortion imposed by the limits of finite parametric families of variational approximations. Instead of true rate and distortion, the proposed information plane features the variational rate as R D K L e ( z | x ) | | m ( z ) , and variational distortion as D p ( x ) e ( z | x ) l o g d ( x | z ) d x d z . Figure A2 illustrates the suggested information plane, which is divided into three sub-planes: (1) Infeasible: This is the IB theoretical limit (as per Figure A1); (2) Feasible: Attainable given an infinite model family and complete variety of e ( z | x ) , d ( x | z ) and m ( z ) ; (3) Realizable: Attainable given a finite parametric and tractable variational family. The black diagonal line at the lower left satisfies H p ( X ) D = R , resulting in tight variational bounds on the mutual information.
Alemi et al. [28] observed that the variational rate R does not depend on the variational decoder distribution d ( x | z ) . As R is used as the ELBO KL regularizer, high variational compression rates can be attained regardless of MI between the decoder and learned representation. Equivalently, good reconstruction does not directly depend on good representation. Empirical evidence suggests that VAEs are prone to learn uninformative representations while still achieving low ELBO loss, a degeneration made possible by overpowerful decoders that are able to overfit the little information captured by the encoder. D K L e ( z | x ) | | m ( z ) approaches 0 iff e ( z | x ) m ( z ) , making e ( z | x ) close to independence from x, resulting in a latent representation that fails to encode information about the input. However, a suitably powerful decoder could possibly learn to overfit encoded traces of the training examples and reach a low distortion score during optimization.
In the current study, we extended this theoretical framework to explain the advancements of our proposed loss function.
Figure A2. Phase diagram, a proposed information plane interpretation of VAEs. Axes are variational rate and distortion. The IB theoretical limit is extended by an additional limit induced by the constraint of a finite parametric variational family. Once a family is chosen, we seek to learn an optimal marginal m ( z ) and decoder d ( x | z ) in order to approach the new limit. Adapted from [28].
Figure A2. Phase diagram, a proposed information plane interpretation of VAEs. Axes are variational rate and distortion. The IB theoretical limit is extended by an additional limit induced by the constraint of a finite parametric variational family. Once a family is chosen, we seek to learn an optimal marginal m ( z ) and decoder d ( x | z ) in order to approach the new limit. Adapted from [28].
Entropy 27 00452 g0a2

Appendix B.3. IB Theory of Deep Learning

The following is a summary of work leveraging the IB framework for deterministic DNN optimization and interpretation. For a more comprehensive review of this opinion-splitting topic, the reader is advised to consult the work of Goldfeld and Polyanskiy [43].
Tishby and Zaslavsky [44] proposed a representation-learning interpretation of DNNs using the IB framework, regarding DNNs as Markov cascades of intermediate representations between hidden layers. Under this notion, comparing the optimal and the achieved rate-distortion ratios between DNN layers will indicate if a model is too complex or too simple for a given task and training set. Shwartz-Ziv and Tishby [45] visualized and analyzed the information plane behavior of DNNs over a toy problem with a known joint distribution. Mutual information of the different layers was estimated and used to analyze the training process. The learning process over Stochastic Gradient Descent (SGD) exhibited two separate and sequential behaviors—a short Empirical Error Minimization phase (ERM) characterized by a rapid decrease in distortion, followed by a long compression phase with an increase in rate until convergence to an optimal IB limit.
Saxe et al. [6] reproduced the experiments described in [45], expanding them to different activation functions, different datasets and different methods to estimate mutual information. It was found that double-sided saturated nonlinear activations, such as the tanh, produced a distinct compression stage when mutual information was measured by binning, as performed in [45], while other activations did not. It was also shown that DNN generalization did not depend on a distinct compression stage, and that DNNs do forget task-irrelevant information, but this happens concurrently with the learning of task-relevant information and not necessarily separately. Amjad and Geiger [7] argued against the use of the IB as an objective for deterministic DNNs, as mutual information in deterministic DNNs is either infinite or steplike because of mutual information’s invariance to invertible transformations and the absence of a decision function in the objective. Using IB as an objective in stochastic DNNs, such as the variational IB family, is suggested as a possible solution.

Appendix B.4. Conditional Entropy Bottleneck

As mentioned in Section 2.1, Fischer [9] showed that the conditional entropy bottleneck is equivalent to IB for γ = β 1 , following the chain rule of mutual information [20] and the IB Markov chain. We develop this equivalence in detail:
C E B = I ( X ; Z | Y ) γ I ( Z ; Y ) = MI chain rule H ( Z | Y ) H ( Z | X , Y ) γ I ( Z ; Y ) = Z X Y H ( Z | Y ) H ( Z | X ) γ I ( Z ; Y ) γ : = β 1 H ( Z | Y ) H ( Z | X ) ( β 1 ) I ( Z ; Y ) = H ( Z | Y ) H ( Z | X ) β I ( Z ; Y ) + I ( Z ; Y ) = H ( Z | Y ) H ( Z | X ) β I ( Z ; Y ) + H ( Z ) H ( Z | Y ) = H ( Z ) H ( Z | X ) + H ( Z | Y ) H ( Z | Y ) β I ( Z ; Y ) = I ( X ; Z ) β I ( Z ; Y ) .

Appendix C. Experiments Elaboration

Image classification models were trained on the first 500,000 samples of the ImageNet 2012 dataset [25], and text classification was performed over the entire IMDB sentiment analysis dataset [30]. For each dataset, a competitive pre-trained model (Vanilla model) was evaluated and then used to encode embeddings. These embeddings were then used as a dataset for a new stochastic classifier net with either a VIB or an SVIB loss function. Stochastic classifiers consisted of two ReLU-activated linear layers of the same dimensions as the pre-trained model’s logits (2048 for image and 768 for text classification), followed by reparameterization and a final softmax-activated FC layer. The learning rate was 10 4 and decayed exponentially with a factor of 0.97 every two epochs. Batch sizes were 32 for ImageNet and 16 for IMDB. All models were trained using an Nvidia RTX3080 GPU with approximately 1–2 days per single experiment run, including training, test set evaluation, and adversarial attacks. Run time for a single ImageNet epoch was 2:41.37 min for VIB, 3:5.15 min for SVIB, and 3:31.31 min for VCEB. Beta values of β = 10 i for i { 1 , 2 , 3 } were tested, and we used a single forward pass per sample for inference since previous studies indicated that these are the best range and sample rates for VIB [1,28]. Each model was trained and evaluated five times per β value with consistent performance. Statistical significance was demonstrated in all comparisons using the Wilcoxon rank sum test, with all metrics compared attaining a p-value of less than 0.05 . Rank sum was computed as follows: A sorted vector of results was prepared for each compared metric, where each entry featured the attained result in each of the five i.i.d. experiments per algorithm and a boolean indicator value for the algorithm type. For example, let r : = ( 0.94 , 1 ) ( 0.935 , 1 ) ( 0.93 , 1 ) ( 0.93 , 1 ) ( 0.925 , 1 ) ( 0.92 , 0 ) ( 0.915 , 0 ) ( 0.915 , 0 ) ( 0.91 , 0 ) ( 0.89 , 0 ) be a sorted vector of (test accuracy, algorithm) tuples, 1 being SVIB and 0 VIB. We compute the rank-sum as follows:
μ T = 5 · 11 2 = 27.5 , σ T = 5 · 5 · 11 12 4.78 , Z ( T ) = 15 27.5 4.78 2.61
Φ 1 ( p v a l ) = 2.61 , p v a l = 0.0045 0.05
In practice, these were computed with the Python3 Scipy library (1.13.0) as follows:
  • import scipy.stats as stats
    vib_scores = [0.915, 0.915, 0.91, 0.92, 0.89] 
    svib_scores = [0.93, 0.935, 0.925, 0.93, 0.94] 
    pvalue = stats.ranksums(svib_scores, vib_scores, ‘greater’).pvalue
    assert pvalue < 0.05

Appendix C.1. Complete Empirical Results

The following tables contain the results of all experiments run in this study.
Table A1. Complete ImageNet evaluation scores for vanilla and SVIB models averaged over five runs with standard deviation. The first column is performance on the ImageNet validation set, the second and third columns are the percentage of unsuccessful FGS attacks at ϵ = 0.1 , 0.5 , and the fourth column is the average L 2 distance for a successful Carlini Wagner L 2 targeted attack. For all columns, higher is better ↑.
Table A1. Complete ImageNet evaluation scores for vanilla and SVIB models averaged over five runs with standard deviation. The first column is performance on the ImageNet validation set, the second and third columns are the percentage of unsuccessful FGS attacks at ϵ = 0.1 , 0.5 , and the fourth column is the average L 2 distance for a successful Carlini Wagner L 2 targeted attack. For all columns, higher is better ↑.
β λ Val ↑ FGS ϵ = 0.1 FGS ϵ = 0.5 CW ↑
Vanilla model
--77.2%31.1%32.3%788
SVIB models
10 4 275.4% ± 0.01 % 40.1% ± 0.08 % 33.7% ± 2.1 % 3401 ± 267
10 3 0.5 74.9% ± 0.06 % 38.4% ± 0.06 % 33.8% ± 0.1 % 3293 ± 140
10 3 175.5% ± 0.03 % 37.2% ± 0.1 % 33.6% ± 0.1 % 2666 ± 140
10 3 2.0 75.4% ± 0.07 % 38.1% ± 0.1 % 33.7% ± 0.1 % 2981 ± 260
10 3 2.5 75.3% ± 0.01 % 38.3% ± 0.2 % 33.8% ± 0.15 % 3095 ± 407
10 3 3.0 75.3% ± 0.03 % 38.5% ± 0.2 % 33.9% ± 0.16 % 3078 ± 443
10 2 0.5 74.2% ± 0.11 % 42.0% ± 0.13 % 35.2% ± 0.06 % 2354 ± 394
10 2 175.0% ± 0.05 % 42.4% ± 0.2 % 35.7% ± 0.1 % 1564 ± 218
10 2 2.0 75.3% ± 0.07 % 43.1% ± 0.1 % 36.3% ± 0.1 % 1748 ± 160
10 2 2.5 75.4% ± 0.06 % 43.0% ± 0.13 % 36.0% ± 0.1 % 1814 ± 144
10 2 3.0 75.4% ± 0.07 % 42.9% ± 0.18 % 36.2% ± 0.12 % 1749 ± 138
SVIB models
10 1 0.5 73.1% ± 0.04 % 39.1% ± 0.2 % 32.6% ± 0.19 % 3738 ± 138
10 1 174.8% ± 0.09 % 42.1% ± 0.5 % 35.2% ± 0.5 % 3575 ± 456
10 1 2.0 75.4% ± 0.03 % 46.6% ± 1.8 % 39.8% ± 2.1 % 3332 ± 443
10 1 2.5 75.4% ± 0.03 % 45.6% ± 1.2 % 38.7% ± 1.3 % 3581 ± 243
10 1 3.0 75.1% ± 0.09 % 46.0% ± 0.8 % 39.3% ± 1.0 % 3536 ± 315
Table A2. Complete ImageNet evaluation scores for VIB and CEB models, averaged over five runs with standard deviation. The first column is performance on the ImageNet validation set, the second and third columns are the percentage of unsuccessful FGS attacks at ϵ = 0.1 , 0.5 , and the fourth column is the average L 2 distance for a successful Carlini Wagner L 2 targeted attack. For all columns, higher is better ↑.
Table A2. Complete ImageNet evaluation scores for VIB and CEB models, averaged over five runs with standard deviation. The first column is performance on the ImageNet validation set, the second and third columns are the percentage of unsuccessful FGS attacks at ϵ = 0.1 , 0.5 , and the fourth column is the average L 2 distance for a successful Carlini Wagner L 2 targeted attack. For all columns, higher is better ↑.
β ρ Val ↑ FGS ϵ = 0.1 FGS ϵ = 0.5 CW ↑
VIB models
10 4 -74.8% ± 0.01 % 28.3% ± 0.2 % 29.3% ± 0.2 % 1554 ± 280
5 · 10 4 -74.1% ± 0.01 % 37.7% ± 0.01 % 34.8% ± 0.01 % 3104 ± 529
10 3 -73.7% ± 0.1 % 40.5% ± 0.2 % 36.1% ± 0.2 % 3917 ± 291
5 · 10 3 -73.0% ± 0.04 % 44.9% ± 0.13 % 37.8% ± 0.21 % 3358 ± 245
10 2 -72.8% ± 0.1 % 46.5% ± 0.2 % 38.0% ± 0.1 % 3318 ± 293
5 · 10 2 -72.3% ± 0.07 % 44.7% ± 0.3 % 34.9% ± 0.32 % 3654 ± 333
10 1 -72.1% ± 0.01 % 41.6% ± 0.1 % 38.0% ± 0.1 % 3318 ± 293
5 · 10 1 -0.1% ± 0 % 0% ± 0 % 0% ± 0 % 0 ± 0
CEB models
-173.0% ± 0.07 % 26.5% ± 0.22 % 28.7% ± 0.15 % 4527 ± 64
-2730.2% ± 0 % 26.4% ± 0.21 % 29.0% ± 0.03 % 4342 ± 173
-373.4% ± 0 % 26.7% ± 0.12 % 29.3% ± 0.18 % 4556 ± 177
-473.8% ± 0.08 % 27.0% ± 0 % 29.9% ± 0.07 % 3689 ± 347
-574.3% ± 0.05 % 27.6% ± 0.13 % 30.1%± 0.22 % 1776 ± 146
-674.6% ± 0.03 % 27.7% ± 0.35 % 30.0% ± 0.13 % 1103 ± 154
-774.6% ± 0.04 % 28.0% ± 0.02 % 30.1% ± 0.02 % 847 ± 16
Table A3. Complete IMDB evaluation scores for vanilla and SVIB models, averaged over five runs with standard deviation. The first column is performance over the test set, the second is the percentage of unsuccessful Deep Word Bug attacks, and the third column is the percentage of unsuccessful PWWS attacks. For all columns, higher is better ↑.
Table A3. Complete IMDB evaluation scores for vanilla and SVIB models, averaged over five runs with standard deviation. The first column is performance over the test set, the second is the percentage of unsuccessful Deep Word Bug attacks, and the third column is the percentage of unsuccessful PWWS attacks. For all columns, higher is better ↑.
β λ Test ↑DWB ↑PWWS ↑
Vanilla model
--93.0%45.7%0.0%
SVIB models
10 4 192.4% ± 0.01 % 68.4% ± 1.7 % 63.9% ± 3.3 %
10 3 0.5 92.3% ± 0.07 % 70.7% ± 2.3 % 68.3% ± 3.3 %
10 3 1930.2% ± 0.5 % 72.5% ± 2.0 % 71.6% ± 1.3 %
10 3 2.0 92.3% ± 0.07 % 74.7% ± 3.5 % 73.1% ± 3.4 %
10 3 2.5 92.4% ± 0.07 % 75.9% ± 1.9 % 72.4% ± 1.8 %
10 3 3.0 92.3% ± 0.04 % 74.5% ± 1.7 % 74.4% ± 0.9 %
10 2 0.5 92.4% ± 0.06 % 66.1% ± 40.2 % 68.3% ± 3.3 %
10 2 192.6% ± 0.8 % 690.2% ± 2.0 % 50.0% ± 4.8 %
SVIB models
10 2 2.0 92.4% ± 0.1 % 64.8% ± 4.7 % 40.3% ± 7.4 %
10 2 2.5 92.3% ± 0.1 % 58.1% ± 2.5 % 28.9% ± 2.45 %
10 2 3.0 92.3% ± 0.1 % 54.0% ± 3.3 % 22.5% ± 2.6 %
10 1 0.5 920.2% ± 0.02 % 1.1% ± 1.1 % 0.0% ± 0 %
10 1 1890.2% ± 2.0 % 0.8% ± 0.5 % 0.0% ± 0 %
10 1 2.0 92.3% ± 0.2 % 0.0% ± 0 % 0.0% ± 0 %
10 1 2.5 92.4% ± 0.1 % 0.0% ± 0 % 0.0% ± 0 %
10 1 3.0 92.4% ± 0.1 % 0.0% ± 0 % 0.0% ± 0 %
Table A4. Complete IMDB evaluation scores for VIB and CEB models, averaged over five runs with standard deviation. The first column is performance over the test set, the second is the percentage of unsuccessful Deep Word Bug attacks, and the third column is the percentage of unsuccessful PWWS attacks. For all columns, higher is better ↑.
Table A4. Complete IMDB evaluation scores for VIB and CEB models, averaged over five runs with standard deviation. The first column is performance over the test set, the second is the percentage of unsuccessful Deep Word Bug attacks, and the third column is the percentage of unsuccessful PWWS attacks. For all columns, higher is better ↑.
β ρ Test ↑DWB ↑PWWS ↑
VIB models
10 4 -92.1% ± 1.1 % 67.0% ± 30.2 % 60.8% ± 1.4 %
5 · 10 4 -920.2% ± 0.07 % 680.2% ± 3.0 % 64.3% ± 1.3 %
10 3 -91.0% ± 1.0 % 64.9% ± 4.4 % 58.4% ± 6.6 %
5 · 10 3 -920.2% ± 0.07 % 62.9% ± 3.9 % 48.3% ± 7.5 %
10 2 -90.8% ± 0.5 % 59.0% ± 4.8 % 37.1% ± 14.3 %
5 · 10 2 92.4% ± 0.1 % 14.4% ± 5.5 % 1.0% ± 0.3 %
10 1 -89.4% ± 0.9 % 10.0% ± 8.0 % 0.9% ± 0.9 %
CEB models
- 0.1 92.7% ± 0.04 % 46.7% ± 0.68 % 1.65% ± 0.27 %
-192.7% ± 0 % 430.2% ± 1.45 % 1.53% ± 0.8 %
-292.5% ± 0 % 40.8% ± 0.72 % 0% ± 0 %
-392.3% ± 0 % 360.2% ± 0 % 0% ± 0 %
-492.1% ± 0 % 38.8% ± 0 % 0% ± 0 %
-5920.2% ± 0 % 39.6% ± 0 % 1.0% ± 0 %
-692.1% ± 0 % 41.9% ± 0 % 0% ± 0 %
-7920.2% ± 0 % 41.9% ± 0 % 2.15% ± 0 %
-8920.2% ± 0 % 45.9% ± 0 % 0% ± 0 %

References

  1. Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep Variational Information Bottleneck. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–27 April 2017. [Google Scholar]
  2. Tishby, N.; Pereira, F.C.; Bialek, W. The Information Bottleneck Method. In Proceedings of the 37th annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 22–24 September 1999. [Google Scholar]
  3. Slonim, N. The Information Bottleneck: Theory and Applications. Ph.D. Thesis, Hebrew University of Jerusalem, Jerusalem, Israel, 2002. [Google Scholar]
  4. Shannon, C.E. Coding Theorems for a Discrete Source with a Fidelity Criterion. In Proceedings of the IRE National Convention, San Fransico, CA, USA, 18–21 August 1959. [Google Scholar]
  5. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
  6. Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the Information Bottleneck Theory of Deep Learning. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  7. Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2225–2239. [Google Scholar] [CrossRef] [PubMed]
  8. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, (ICLR), Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  9. Fischer, I. The Conditional Entropy Bottleneck. Entropy 2020, 22, 999. [Google Scholar] [CrossRef] [PubMed]
  10. Kolchinsky, A.; Tracey, B.D.; Wolpert, D.H. Nonlinear Information Bottleneck. Entropy 2019, 21, 1181. [Google Scholar] [CrossRef]
  11. Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2897–2905. [Google Scholar] [CrossRef] [PubMed]
  12. Painsky, A.; Tishby, N. Gaussian Lower Bound for the Information Bottleneck Limit. J. Mach. Learn. Res. 2017, 18, 213:1–213:29. [Google Scholar]
  13. Geiger, B.C.; Fischer, I.S. A Comparison of Variational Bounds for the Information Bottleneck Functional. Entropy 2020, 22, 1229. [Google Scholar] [CrossRef] [PubMed]
  14. Piran, Z.; Shwartz-Ziv, R.; Tishby, N. The Dual Information Bottleneck. arXiv 2020, arXiv:2006.04641. [Google Scholar] [CrossRef]
  15. Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–27 April 2017. [Google Scholar]
  16. Cheng, P.; Hao, W.; Dai, S.; Liu, J.; Gan, Z.; Carin, L. CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; Volume 119, pp. 1779–1788. [Google Scholar]
  17. Kolchinsky, A.; Tracey, B.D. Estimating Mixture Entropy with Pairwise Distances. Entropy 2017, 19, 361. [Google Scholar] [CrossRef]
  18. Czyż, P.; Grabowski, F.; Vogt, J.; Beerenwinkel, N.; Marx, A. Beyond Normal: On the Evaluation of Mutual Information Estimators. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 16957–16990. [Google Scholar]
  19. Yu, X.; Yu, S.; Príncipe, J.C. Deep Deterministic Information Bottleneck with Matrix-Based Entropy Functional. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3160–3164. [Google Scholar]
  20. Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999; p. 22. [Google Scholar]
  21. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  22. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
  23. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  24. Fischer, I.; Alemi, A.A. CEB Improves Model Robustness. Entropy 2020, 22, 1081. [Google Scholar] [CrossRef] [PubMed]
  25. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  26. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  27. Pereyra, G.; Tucker, G.; Chorowski, J.; Kaiser, L.; Hinton, G.E. Regularizing Neural Networks by Penalizing Confident Output Distributions. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–27 April 2017. [Google Scholar]
  28. Alemi, A.; Poole, B.; Fischer, I.; Dillon, J.; Saurous, R.A.; Murphy, K. Fixing a Broken ELBO. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 159–168. [Google Scholar]
  29. Barber, D.; Agakov, F.V. The IM algorithm: A variational approach to Information Maximization. In Proceedings of the Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–13 December 2003. [Google Scholar]
  30. Maas, A.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
  31. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  32. Carlini, N.; Wagner, D.A. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, USA, 22–24 May 2017; pp. 39–57. [Google Scholar]
  33. Kaiwen. pytorch-cw2. GitHub Repository. 2018. Available online: https://github.com/kkew3/pytorch-cw2 (accessed on 16 April 2025).
  34. Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-box generation of adversarial text sequences to evade deep learning classifiers. In Proceedings of the IEEE Security and Privacy Workshops (SPW), San Jose, CA, USA, 24 May 2018; pp. 50–56. [Google Scholar]
  35. Morris, J.; Lifland, E.; Yoo, J.Y.; Grigsby, J.; Jin, D.; Qi, Y. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Virtual Conference (Online), 16–20 November 2020; pp. 119–126. [Google Scholar]
  36. Ren, S.; Deng, Y.; He, K.; Che, W. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1085–1097. [Google Scholar]
  37. Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar]
  38. van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  39. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–9 June 2019; pp. 4171–4186. [Google Scholar]
  40. Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual (Online), 12–18 July 2020. [Google Scholar]
  41. Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
  42. Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995; Chapter 1; pp. 17–22. [Google Scholar]
  43. Goldfeld, Z.; Polyanskiy, Y. The Information Bottleneck Problem and its Applications in Machine Learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
  44. Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. arXiv 2015, arXiv:1503.02406. [Google Scholar] [CrossRef]
  45. Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810. [Google Scholar] [CrossRef]
Figure 1. Venn diagrams illustrating decoder overfitting. The left diagram depicts an overfitted decoder where Y ˜ holds no information about Y, and H ( Y ˜ | Y ) H ( Y ˜ | Z ) . The right diagram depicts a regularized decoder where H ( Y ˜ | Y ) is not much greater than H ( Y ˜ | Z ) .
Figure 1. Venn diagrams illustrating decoder overfitting. The left diagram depicts an overfitted decoder where Y ˜ holds no information about Y, and H ( Y ˜ | Y ) H ( Y ˜ | Z ) . The right diagram depicts a regularized decoder where H ( Y ˜ | Y ) is not much greater than H ( Y ˜ | Z ) .
Entropy 27 00452 g001
Figure 2. Performance comparison across models and metrics for IMDB and ImageNet. Higher is better ↑ in all plots. Analyzing accuracy and robustness against adversarial attacks for deterministic MLE, SVIB, VIB, and VCEB models under varying β and ρ values, averaged over five runs with standard deviation. Left column features IMDB tasks, right column features ImageNet tasks. Upper row shows accuracy over test set, and bottom rows depict robustness under various adversarial attacks, presented as the rate of deflected attacks or as the average L 2 distance required for a successful CW attack. Results show that SVIB consistently attains higher test set accuracy and higher or on-par robustness in all attacks apart from the targeted CW attack. ρ values apply to VCEB models, while β values apply to SVIB and VIB models. SVIB results are presented for λ = 1 in IMDB and λ = 2 in ImageNet. For all experimental results please see Appendix C.1.
Figure 2. Performance comparison across models and metrics for IMDB and ImageNet. Higher is better ↑ in all plots. Analyzing accuracy and robustness against adversarial attacks for deterministic MLE, SVIB, VIB, and VCEB models under varying β and ρ values, averaged over five runs with standard deviation. Left column features IMDB tasks, right column features ImageNet tasks. Upper row shows accuracy over test set, and bottom rows depict robustness under various adversarial attacks, presented as the rate of deflected attacks or as the average L 2 distance required for a successful CW attack. Results show that SVIB consistently attains higher test set accuracy and higher or on-par robustness in all attacks apart from the targeted CW attack. ρ values apply to VCEB models, while β values apply to SVIB and VIB models. SVIB results are presented for λ = 1 in IMDB and λ = 2 in ImageNet. For all experimental results please see Appendix C.1.
Entropy 27 00452 g002
Figure 3. Successful untargeted FGS attack examples. Images are perturbations of previously successfully classified instances from the ImageNet validation set. Perturbation magnitude is determined by the parameter ϵ shown on the left—the higher, the more perturbed. Original and wrongly assigned labels are listed at the top of each image. Notice the deterioration of image quality as ϵ increases.
Figure 3. Successful untargeted FGS attack examples. Images are perturbations of previously successfully classified instances from the ImageNet validation set. Perturbation magnitude is determined by the parameter ϵ shown on the left—the higher, the more perturbed. Original and wrongly assigned labels are listed at the top of each image. Notice the deterioration of image quality as ϵ increases.
Entropy 27 00452 g003
Figure 4. Successfully targeted CW attack examples. Images are perturbations of previously successfully classified instances from the ImageNet validation set. The target label is ‘Soccer ball’. Average L 2 distance required for a successful attack is shown on the left. The higher the required L 2 distance, the greater the visible change required to fool the model. Original and wrongly assigned labels are listed at the top of each image. Mind the difference in noticeable change as compared to the FGS perturbations presented in Figure 3.
Figure 4. Successfully targeted CW attack examples. Images are perturbations of previously successfully classified instances from the ImageNet validation set. The target label is ‘Soccer ball’. Average L 2 distance required for a successful attack is shown on the left. The higher the required L 2 distance, the greater the visible change required to fool the model. Original and wrongly assigned labels are listed at the top of each image. Mind the difference in noticeable change as compared to the FGS perturbations presented in Figure 3.
Entropy 27 00452 g004
Figure 5. ImageNet embeddings of the different models casted to 2D using the t-SNE algorithm [38]. 5000 datapoints of the first 500 ImageNet labels. The VIB and VCEB castings share similar traits of well-separated clusters, while the deterministic MLE casting shows some clustering that seems less formed and unseparated. The SVIB casting shows very little clustering and features the most dispersed distribution. The visualization suggests that the conditional entropy term in SVIB has negated the clustering effect of the ELBO loss and induced a more uniform representation.
Figure 5. ImageNet embeddings of the different models casted to 2D using the t-SNE algorithm [38]. 5000 datapoints of the first 500 ImageNet labels. The VIB and VCEB castings share similar traits of well-separated clusters, while the deterministic MLE casting shows some clustering that seems less formed and unseparated. The SVIB casting shows very little clustering and features the most dispersed distribution. The visualization suggests that the conditional entropy term in SVIB has negated the clustering effect of the ELBO loss and induced a more uniform representation.
Entropy 27 00452 g005
Table 1. Example of a successful PWWS attack on a vanilla Bert model, fine-tuned over the IMDB dataset. The original label is ‘positive sentiment’. The substituted word, marked in italic font, changed the classification to ‘negative sentiment’. SVIB and VIB classifiers are far less susceptible to these perturbations, as shown in Figure 2.
Table 1. Example of a successful PWWS attack on a vanilla Bert model, fine-tuned over the IMDB dataset. The original label is ‘positive sentiment’. The substituted word, marked in italic font, changed the classification to ‘negative sentiment’. SVIB and VIB classifiers are far less susceptible to these perturbations, as shown in Figure 2.
Original Text
the acting, costumes, music, cinematography and sound are all astounding given the production’s austere locales.
Perturbed Text
the acting, costumes, music, cinematography and sound are all dumbfounding given the production’s austere locales.
Table 2. Example of a successful Deep Word Bug attack on a vanilla Bert model, fine-tuned over the IMDB dataset. The original label is ‘positive sentiment’. Perturbations, marked in italic font, change the classification to ‘negative sentiment’. SVIB and VIB classifiers are far less susceptible to these perturbations, as shown in Figure 2.
Table 2. Example of a successful Deep Word Bug attack on a vanilla Bert model, fine-tuned over the IMDB dataset. The original label is ‘positive sentiment’. Perturbations, marked in italic font, change the classification to ‘negative sentiment’. SVIB and VIB classifiers are far less susceptible to these perturbations, as shown in Figure 2.
Original Text
great historical movie, will not allow a viewer to leave once you begin to watch. View is presented differently than displayed by most school books on this subject. My only fault for this movie is it was photographed in black and white; wished it had been in color wow !
Perturbed Text
gnreat historical movie, will not allow a viewer to leave once you begin to watch. View is presented differently than displayed by most school books on this sSbject. My only fault for this movie is it was photographed in black and white; wished it had been in color wow !
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Weingarten, N.Z.; Yakhini, Z.; Butman, M.; Bustin, R. The Supervised Information Bottleneck. Entropy 2025, 27, 452. https://doi.org/10.3390/e27050452

AMA Style

Weingarten NZ, Yakhini Z, Butman M, Bustin R. The Supervised Information Bottleneck. Entropy. 2025; 27(5):452. https://doi.org/10.3390/e27050452

Chicago/Turabian Style

Weingarten, Nir Z., Zohar Yakhini, Moshe Butman, and Ronit Bustin. 2025. "The Supervised Information Bottleneck" Entropy 27, no. 5: 452. https://doi.org/10.3390/e27050452

APA Style

Weingarten, N. Z., Yakhini, Z., Butman, M., & Bustin, R. (2025). The Supervised Information Bottleneck. Entropy, 27(5), 452. https://doi.org/10.3390/e27050452

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop