Elastic Information Bottleneck

Ni, Yuyan; Lan, Yanyan; Liu, Ao; Ma, Zhiming

doi:10.3390/math10183352

Open AccessArticle

Elastic Information Bottleneck

by

Yuyan Ni

¹,

Yanyan Lan

^2,*,

Ao Liu

³ and

Zhiming Ma

¹

Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China

²

Institute for AI Industry Research, Tsinghua University, Beijing 100084, China

³

School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(18), 3352; https://doi.org/10.3390/math10183352

Submission received: 21 July 2022 / Revised: 25 August 2022 / Accepted: 9 September 2022 / Published: 15 September 2022

(This article belongs to the Special Issue Artificial Intelligence and Scientific Computing: Mathematical Techniques and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Information bottleneck is an information-theoretic principle of representation learning that aims to learn a maximally compressed representation that preserves as much information about labels as possible. Under this principle, two different methods have been proposed, i.e., information bottleneck (IB) and deterministic information bottleneck (DIB), and have gained significant progress in explaining the representation mechanisms of deep learning algorithms. However, these theoretical and empirical successes are only valid with the assumption that training and test data are drawn from the same distribution, which is clearly not satisfied in many real-world applications. In this paper, we study their generalization abilities within a transfer learning scenario, where the target error could be decomposed into three components, i.e., source empirical error, source generalization gap (SG), and representation discrepancy (RD). Comparing IB and DIB on these terms, we prove that DIB’s SG bound is tighter than IB’s while DIB’s RD is larger than IB’s. Therefore, it is difficult to tell which one is better. To balance the trade-off between SG and the RD, we propose an elastic information bottleneck (EIB) to interpolate between the IB and DIB regularizers, which guarantees a Pareto frontier within the IB framework. Additionally, simulations and real data experiments show that EIB has the ability to achieve better domain adaptation results than IB and DIB, which validates the correctness of our theories.

Keywords:

information bottleneck; transfer learning; generalization bound

MSC:

68T07

1. Introduction

Representation learning has recently become a core problem in machine learning, especially with the development of deep learning methods. Different from other statistical representation learning approaches, the information bottleneck principle formalizes the extraction of relevant features about Y from X as an information-theoretic optimization problem:

{min}_{p (t | x)} L = {min}_{p (t | x)} f (X; T) - β g (Y; T)

, where

p (t | x)

amounts to the encoder of the input signal X, f stands for the compression of representation T with respect to input X, g stands for the preserved information of T with respect to output Y,

Y \leftrightarrow X \leftrightarrow T

forms a Markov chain, and

β

is the tradeoff parameter. The basic idea of the information bottleneck principle is to obtain the information that X provides about Y through a ‘bottleneck’ representation T. The Markov constraint requires that T is a (possibly stochastic) function of X and can only obtain information about Y through X.

Under this principle, various methods have been proposed, such as information bottleneck (IB) [1], conditional entropy bottleneck (CEB) [2], Gaussian IB [3], multivariate IB [4], distributed IB [5], squared IB [6], deterministic information bottleneck (DIB) [7], etc. Almost all previous methods use the mutual information

I (Y; T)

as the information preserving function g. As for the compression function f, there are two typical functions, which categorize these methods into two groups. The first group uses the mutual information

I (X; T)

, a common measure of representation cost in channel coding, as the compression function. Typical examples include IB, CEB, Gaussian IB, multivariate IB, squared IB, etc. Please note that CEB uses the conditional mutual information

I (X; T | Y)

as the compression function, however, it has been proven to be equivalent to mutual information. Similarly, squared IB uses the square of the mutual information

I^{2} (X; T)

as the compression function; still, we put it into the same category. Instead of the mutual information, the second group, including DIB, uses entropy

H (T)

as the compression function, which is another common measure to quantify the representation cost as in source coding. The reason for this is that entropy is directly related to the quantity to be constrained such as the number of clusters, which is expected to achieve better compression results.

IB has been extensively studied over recent years in theory and application. In theory, IB has been proven to be able to enhance generalization [8] and adversarial robustness [9] by providing a generalization bound and an adversarial robustness bound. In application, IB has been successfully applied in evaluating the representations learned by deep neural networks (DNN) [10,11,12] and completing various tasks such as geometric clustering by iteratively solving a set of self-consistent equations to obtain the optimal solution of IB optimization problem [13,14], classification [2,15], and generation [16] by serving as the loss function of DNN via variational methods. Recent research also shows that if we merge the loss function of IB with other models, such as BERT and Graph Neural Network, better generalization and adversarial robustness results will be obtained [9,17]. Moreover, IB inspires new methods that improve generalization [18,19,20]. Likewise, DIB has been applied in geometric clustering [21] and has the potential to be used in similar applications to IB. More wide-ranging work on IB in deep learning and communication is comprehensively summarized in the surveys found in [22,23,24].

Still, the theoretical results are only valid when the training and test data are drawn from the same distribution, but this is rarely the case in practice. Therefore, it is unclear, especially theoretically, whether IB and DIB are able to learn a representation from a source domain that performs well on the target domain. Moreover, it is worth studying which objective function is better. This is exactly the motivation of this paper. To this end, we formulate the problem as transfer learning, and the target domain test error could be decomposed into three parts: the source domain training error, the source domain generalization gap (SG), and the representation discrepancy (RD), based on the transfer learning theory [25]. Without loss of generality, we assume that both IB and DIB have the ability to achieve small source domain training errors, so our goals become calculating SG and RD and comparing the two methods on these terms.

For SG, Shamir et al. [8] have provided an upper bound related to

I (X; T)

, indicating that minimizing

I (X; T)

leads to better generalization. However, this theory is not applicable for comparing IB and DIB because DIB minimizes

H (T) = I (X; T) + H (T | X)

, and it is unclear whether further minimizing

H (T | X)

may bring some advantages. Therefore, we need to derive a new bound. The difficulty lies in that the new bound needs to not only include both

I (X, T)

and

H (T | X)

for convenient comparison but also be tighter than the previous one. Since the previous bound in Shamir et al. [8] was represented as a

ϕ

function of the variance, we tackle this problem by introducing a different analysis of these two factors. Specifically for the variance, different from relating the variance to the

L_{1}

distance; then, to KL-divergence; and lastly, to mutual information in the previous proof, we bound the variances by functions of expectations and successfully relate them to entropy

H (T)

. Furthermore, we prove a tighter bound for the

ϕ

function. Consequently, we prove a tighter generalization bound, suggesting that minimizing

H (T)

is better than

I (X; T)

. Therefore, our results indicate that DIB may generalize better than IB in the source domain.

As for RD, it is measured by the

H Δ H

-distance, as in Ben-David et al. [26]. However, this term is difficult to compute because

H

is the hypothesis space of classifiers and is diverse for different models. Inspired by the fact that IB and DIB solutions are mainly different with the variance in the representations, we assume that data are generated with a Gaussian distribution. Therefore, we define a pair-wise

L_{1}

distance to bound the

H Δ H

-distance and relate RD with the variance in representations. Specifically, IB’s representations have larger randomness and thus have smaller RDs. Moreover, the closer the two domains are, the greater the difference between IB and DIB on RD is.

From the above theoretical findings, we conclude that there exists a better objective function under the IB principle. However, how to obtain the optimal objective function remains a challenging problem. Inspired by the trade-off between SG and RD, we propose an elastic information bottleneck (EIB) to interpolate between IB and DIB, i.e.,

min L_{E I B} = (1 - α) H (T) + α I (X; T) - β I (Y; T)

. We can see that EIB includes IB and DIB as special cases. In addition, we provide a variational method to optimize EIB by DNN. We conduct both simulations and real data experiments in this paper. Our results show that EIB is more flexible to different kinds of data and achieves better accuracy than IB and DIB on classification tasks. We also provide an example of combining EIB with a previous domain adaptation algorithm by substituting the cross entropy loss with the EIB objective function, which suggests a promising application of EIB. Our contributions are summarized as follows:

We derive a transfer learning theory for IB frameworks and find a trade-off between SG and RD. Consequently, We propose a novel representation learning method EIB for better transfer learning results under the IB principle, which is flexible to suit different kinds of data and can be merged into domain adaptation algorithms as a regularizer.
In the study of SG, we provide a tighter SG upper bound, which serves as a theoretical guarantee for DIB and further develop the generalization theory in the IB framework.
Comprehensive simulations and experiments validate our theoretic results and demonstrate that EIB outperforms IB and DIB in many cases.

2. Problem Formulation

In domain adaptation, a scenario of transductive transfer learning, we have labeled training data from a source domain, and we wish to learn an algorithm on the training data that performs well on a target domain [27]. The learning task is the same on the two domains, but the population distribution of the source domain and the target domain are different. Specifically, the source and the target instances come from the same instance space

X

, and the same instance corresponds to the same label, even if they lie in different domains. However, the population instance distribution on the source and the target domain are different, denoted as

p (X)

and

q (X)

. (We use “instance” (X) to distinguish from “label” (Y) and “example” (X,Y) and use “population distribution” (p or q) to distinguish from “empirical distribution” (

\hat{p}

or

\hat{q}

)). Assume that there exists a generic feature space so it can be utilized to transfer knowledge between different domains. That is to say, both the IB and DIB methods involve an encoder with a transition probability

p (t | x)

to convert instance

X

to an intermediate feature

T \in T

. The induced marginal distribution is then denoted as

p (t)

and

q (t)

for the source and target domains, respectively. The IB and DIB methods also have a decoder or classifier

h \in H

to map from the feature space

T

to the label space

Y

.

p (y | x)

denotes the ground-truth labeling function and

p (y | t) = \sum_{x} p (y | x) \frac{p (t | x) p (x)}{\sum_{x^{'}} p (t | x^{'}) p (x^{'})}

is a labeling function induced by

p (y | x)

and

p (t | x)

. In classification scenarios, deterministic labels are commonly used. Therefore, we define the deterministic ground-truth-induced labeling function as

f : T \to Y, f (t) = a r g m a x_{y} p (y | t)

. If the maximum probability is not unique, i.e.,

{y_{1}, \dots, y_{n}} = a r g m a x_{y} p (y | t)

, randomly choose a

y_{i, i \in {1 \dots n}}

as the output. With the above definitions, the expected error on the source domain could be written as

ϵ_{S} (h) = E_{t \sim p (t)} I_{{f (t) \neq h (t)}}

, where

I_{{\cdot}}

is the indicator function. Similarly, the expected error on target domain is

ϵ_{T} (h) = E_{t \sim q (t)} I_{{f (t) \neq h (t)}}

. Then, our problem could be formulated as follows: When the IB and DIB methods are trained on the source domain, which method achieves a lower target domain expected error

ϵ_{T} (h)

?

According to previous work [26], the target domain error could be decomposed to three parts, as shown in the following theorem. A detailed proof is provided in Appendix A.1.

Theorem 1

(Target Error Decomposition). Suppose that

h^{*}

is the classifier in

H

, which minimizes the sum of the expected error in two domains, i.e.,

h^{*} = a r g m i n_{h \in H} ϵ_{S} (h) + ϵ_{T} (h)

. Let us denote

ϵ_{S} (h^{*}) + ϵ_{T} (h^{*}) ≜ λ_{S} + λ_{T} = λ

. Then, for a classifier h, we have

ϵ_{T} (h) \leq {\hat{ϵ}}_{S} (h) + δ_{S} (h) + d_{H Δ H} (p (t), q (t)) + λ,

(1)

where

\hat{ϵ}

is the empirical error;

δ_{S} (h)

is the source generalization error gap, i.e.,

δ_{S} (h) = | {\hat{ϵ}}_{S} (h) - ϵ_{S} (h) |

; and

d_{H Δ H} (p (t), q (t))

is the RD defined by the

A

-distance, i.e.,

d_{H Δ H} (p (t), q (t)) = {sup}_{h \in H} | E_{t \sim p (t)} I_{{h^{*} (t) \neq h (t)}} - E_{t \sim q (t)} I_{{h^{*} (t) \neq h (t)}} |

.

Please note that we assume that the ground-truth labeling function on the two domains remains the same, which is common, as used in many previous studies. If there is a conditional shift, i.e., a distance between

f_{S}

and

f_{T}

, where

f_{S}

and

f_{T}

are the deterministic ground-truth-induced labeling function on the source and target domains, respectively, as defined in Zhao et al. [28], the above decomposition is not valid any more. The assumption is reasonable since the conditional shift phenomenon is not observed in our experiments.

According to the above theorem, we need to minimize the following three terms to achieve a low expected error, i.e., the training error on source domain, and SG and RD between the marginal distributions on the two domains (RD). Assume that both the IB and DIB methods have the ability to achieve comparable small source training errors, so we focus on SG and RD to compare the two methods.

3. Main Results

3.1. Source Generalization Study of IB and DIB

In this subsection, we study the generalization of IB and DIB on the source domain. The generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit the training data. The generalization error gap in statistical learning theory is defined as the expected difference between the population risk and the empirical risk. Russo and Zou [29], Xu and Raginsky [30] provided a generalization upper bound with mutual information of the input and output. Sefidgaran et al. [31], Sefidgaran et al. [32] further related the mutual information with two other studies of generalization error, i.e., compressibility and fractal dimensions, providing a unifying framework for the three directions of studies.

In the theoretic study of generalization and adversarial robustness of IB, however, the measure is not exactly the same for the convenience of illustrating the characteristic of IB [8,9]. Specifically, in Section 4.1 in Shamir et al. [8], the classification error is proven to be exponentially upper bounded by

- I (Y; T)

, indicating that

I (Y; T)

is a measure of performance. Therefore, the difference between the population performance

I (Y; T)

and empirical performance

\hat{I} (Y; T)

is used to measure the ability of generalization. Shamir et al. [8] provided the following generalization upper bound, which is relevant to the mutual information

I (X; T)

.

Theorem 2

(The previous generalization bound). Denote the empirical estimates by

\hat{\cdot}

. For any probability distribution

p (x, y)

, with a probability of at least

1 - δ

over the draw of the sample of size m from p(x,y), we have that for all T,

\begin{matrix} | I (Y; T) - \hat{I} (Y; T) | \leq \\ \sqrt{\frac{B_{0} log (| Y | / δ)}{m}} (B_{1} log (m) \sqrt{| T | I (X; T)} + B_{2} {| T |}^{3 / 4} {(I (X; T))}^{1 / 4} + B_{3} \hat{I} (X; T)) \end{matrix}

(2)

where

B_{0}

is a constant and

B_{1}, B_{2}, B_{3}

only depend on m, δ,

| Y |

,

{min}_{x} p (x)

, and

{min}_{y} p (y)

, where min refers to the minimum value other than 0.

However, this upper bound cannot be directly used to compare IB and DIB because both IB and DIB minimizes

I (X; T)

. Moreover, the regularizer of DIB:

H (T) = I (X; T) + H (T | X)

further minimizes the term

H (T | X)

. However, it is not clear whether it may bring any advantage from the previous theoretical result. To tackle this problem, we prove a new generalization bound in this paper, as shown in the following theorem.

Lemma 1.

If

X \in [0, 1]

,

E X = μ

, then

V a r (X) \leq (1 - μ) μ

.

Lemma 2.

If

\sum x = 1

,

\forall 0 \leq x \leq 1

, then

\sum_{x}

- \sqrt{x (1 - x)}

l n (\sqrt{x (1 - x)}) \leq \sum_{x} - \sqrt{x} l n x .

Theorem 3

(Our generalization bound). For any probability distribution

p (x, y)

, with a probability of at least

1 - δ

over the draw of the sample of size m from p(x, y), we have that for all T,

\begin{matrix} | I (Y; T) - \hat{I} (Y; T) | \leq \\ \frac{1}{\sqrt{m}} ((C_{1} + C_{3}) \sqrt{| T | - 1} + C_{2} H (T) + C_{4} H (T | Y) + C_{5} \sqrt{(log | T | - \hat{H} (T | Y)) \hat{H} (T | Y)}) \end{matrix}

(3)

where

C_{1}, C_{2}, C_{3}, C_{4}, C_{5}

only depend on δ,

| Y |

,

{min}_{x, y} p (x | y)

,

{min}_{t, y} p (t | y)

,

{min}_{x} p (x)

, and

{min}_{t} p (t)

.

Proof.

Here, we show a proof sketch to help us understand how our bound is different from the previous one. The complete proof is in the appendix. Similarly to the previous proof, SG is firstly divided into three parts.

\begin{matrix} | I (Y; T) - \hat{I} (Y; T) | \leq \\ | H (T) - \hat{H} (T) | + | \sum_{y} p (y) (H (T | y) - \hat{H} (T | y)) | + | \sum_{y} (p (y) - \hat{p} (y)) \hat{H} (T | y) | \end{matrix}

(4)

Denote

Δ_{1} = | H (T) - \hat{H} (T) |

,

Δ_{2} = | \sum_{y} p (y) (H (T | y) - \hat{H} (T | y)) |

,

Δ_{3} = | \sum_{y} (p (y) - \hat{p} (y)) \hat{H} (T | y) |

. Then, we can summarize the previous proof and our proof in Figure 1.

The main idea of the previous proof is to bound these parts by

ϕ

functions of sample variances in (5) and (9), where

V_{x} (p (t | x)) ≜ \frac{1}{| X |} \sum_{x} {(p (t | x) - \frac{1}{| X |} \sum_{x} p (t | x))}^{2}

, and

ϕ (x)

is defined by (A1). The variances are then related to the

L_{1}

distances in (6) and (10); KL-divergence in (7); and lastly, mutual information in (8) and (11).

We also use the idea of bounding by

ϕ

functions of variances, as shown in (12), (15), and (18). However, the variance is an exact one, instead of the previous estimation on the samples, i.e.,

V a r_{X} (p (t | x)) ≜ E_{p (x)} {[p (t | x) - E_{p (x)} [p (t | x)]]}^{2}

. Consequently, we can obtain a tighter bound. Specifically, we first use Lemma 1 to convert the variance to the expectation in (13), (16), and (19). Then, with the help of Lemma 2, we turn the

ϕ

functions into entropy, like from (13) to (14). With (14), (17), and (20), we finish the proof of Theorem 3. □

We can see that this bound is similar to the previous one, but the new bound is related to the regularization term

H (T)

,

H (T | Y)

and

\hat{H} (T | Y)

, instead of the mutual information

I (X; T)

. Note that

H (T | Y)

and

\hat{H} (T | Y)

are closely related by

| H (T | Y) - \hat{H} (T | Y) | \leq (17) + (20)

. Additionally,

H (T)

is equivalent to

H (T | Y)

as a regularizer because

L_{D I B} = H (T) - β I (Y; T) = H (T | Y) - (β - 1) I (Y; T)

. With these two observations above, this bound is directly dependent on the entropy

H (T)

, i.e., the regularizer of DIB. That is to say, compressing the entropy of the representation is beneficial to reducing SG.

Now we will illustrate why DIB generalizes better. Intuitively, since DIB directly minimizes

H (T)

while IB only minimizes a part of them, i.e.,

I (X; T)

, DIB may have a better generalization than IB. Owing to the lack of close form solutions, we cannot explicitly see the difference between IB and DIB with respect to the concerning information quantities. However, we can solve the IB and DIB problems through a self-consistent iterative algorithm, with

α = 0

or 1 in Appendix B.1. The IB and DIB solutions are shown on the information plain in Appendix C with

α = 0

or 1, showing that the IB solutions have larger

H (T)

than DIB solutions, and thus, DIB generalizes better in the sense of Theorem 3 in our paper.

Furthermore, our bound is tighter than the previous one. We provide a comparison of the order in Appendix A.2, which shows that the IB bound is

\sqrt{| X |}

times larger than the DIB bound for the first two terms and

\sqrt{| Y |}

times larger for the third term. There is also an empirical comparison in Section 4.1.2. Moreover, Theorems 2 and 3 require some constraints for sample size. They are summarized in the Appendix A.2 too. The empirical results show that our bound requires a smaller sample size.

To sum up, we provide a tighter generalization bound in the IB framework, which serves as a theoretical support for DIB’s generalization performance. Experiments on MNIST will validate this theoretical result in Section 4.2.1.

3.2. Representation Discrepancy of IB and DIB

In this section, we compare the RD of IB and DIB. According to the target error decomposition theorem in Section 1, RD is measured by

H Δ H

-distance, which is however difficult for direct computation because of the complex hypothesis space

H

. To remove the dependence of

H

, we propose to bound RD by a pair-wise

L_{1}

distance.

To start with the simplest case, assume that the sample sizes on the source and target domains are both m, and the samples on two domains have a one-to-one correspondence (i.e., semantically close to each other). Then, RD can be bounded by the following pair-wise

L_{1}

distance.

Proposition 1.

The distance of overall representations on the source and target domains is bounded by the distance of individual instance representations.

\begin{matrix} d_{H Δ H} (p (t), q (t)) \leq & \frac{1}{m} \sum_{(x_{S}, x_{T}) \in CP} {∥ p (t | x_{S}) - p (t | x_{T}) ∥}_{1} + ϵ, \end{matrix}

(21)

where

CP

stands for the set of all correspondence pairs and

ϵ = 2 \sqrt{| T |} \frac{2 + \sqrt{2 log (1 / δ)}}{\sqrt{m}}

, which is small when the sample size is large.

In fact, Proposition 1 is valid for any

(x_{S}, x_{T})

pair, so there can be different upper bounds. To obtain the lowest upper bound, we define correspondent pairs such that

\sum_{(x_{S}, x_{T}) \in CP} {∥ p (t | x_{S}) - p (t | x_{T}) ∥}_{1}

is the lowest, i.e.,

CP = {{(x_{S, i}, x_{T, i})}_{i = 1 \dots m} = a r g min \sum_{i = 1}^{m} ∥ p (t | x_{S, i}) - p (t | x_{T, i}) ∥_{1} | {x_{S, i}}_{i = 1 \dots m} = X_{S}, {x_{T, i}}_{i = 1 \dots m} = X_{T}}

.

We parameterize

p (t | x)

to be a d-dimensional Gaussian with a diagonal covariance matrix, which is usually assumed in some variational methods such as Alemi et al. [15]. Compared with IB, DIB additionally reduces

H (T | X)

, so its

p (t | x)

are almost deterministic. Therefore, the variance in

p_{D I B} (t | x)

in each dimension are significantly smaller than that of

p_{I B} (t | x)

. On the other hand, since the only difference between IB and DIB is altering

H (T | X)

and the entropy of Gaussian random variable is only dependent on its variance, the expectations of

p_{I B} (t | x)

and

p_{D I B} (t | x)

are comparable. Therefore, we need to find how the variances affect RD.

Consider the representations of the instances from the source and target domains in the same model.

\forall (x_{S}, x_{T}) \in CP

, since

x_{S}

and

x_{T}

are semantically close, the expectations of their representations are close. Additionally, the discrepancy between their variances are small compared to the discrepancy in the representation variances between IB and DIB, so we can neglect them. Therefore, denote that

p (t | x_{S}) \sim N (μ_{1}, Σ)

,

p (t | x_{T}) \sim N (μ_{2}, Σ)

,

Σ = d i a g (σ_{1}^{2}, \dots, σ_{d}^{2})

; then,

∥ p (t | x_{S}) - p (t | x_{T}) ∥_{1} = Π_{i = 1}^{d} (4 Φ (\frac{| μ_{1 i} - μ_{2 i} |}{2 σ_{i}}) - 2)

(22)

where

Φ

is the cumulative distribution function of a standard Gaussian distribution.

As discussed above, compared with IB, DIB has significantly smaller variances in the representations and

μ_{1} - μ_{2}

are comparable for IB and DIB. Therefore, the term

\frac{| μ_{1 i} - μ_{2 i} |}{2 σ_{i}}

for IB is remarkably smaller than DIB. With

Φ

monotonically increasing,

∥ p (t | x_{S}) - p (t | x_{T}) ∥

for IB is smaller than that for DIB. Figure 2 provides an intuitive understanding about how randomness helps reduce RD. Suppose that the blue and red lines are

p (t | x_{S})

and

p (t | x_{T})

, where

(x_{S}, x_{T}) \in CP

. We can see clearly that their

L_{1}

distance drops with the growth in variances. Moreover, because the derivative of

ϕ

monotonically decreases on

[0, + \infty]

, when

μ_{2} - μ_{1}

is smaller, the difference between IB and DIB will be larger. This phenomenon is also found in simulations; see Section 4.1.1 and Section 4.1.3.

When the sample size on the two domains are different and the “pair-wise” correspondence does not hold, we can take the correlated instances in the two domains as a general form of correspondent pairs; then, the above comparison result is also valid. Details are found in Appendix A.3.

Please note that the assumption of correspondent pairs is reasonable. In transfer learning, it is widely assumed that there exists shared (high level) features and common distributions of representation on the two domains. The correspondent pairs can be viewed as the instance pairs with the closest distributions of representations from the two domains.When the distributions are distinct, feature alignment is usually implemented in practice.

According to the above theoretical results, we can obtain the following comparisons between IB and DIB, with respect to the three terms for the target error decomposition, as illustrated in Table 1. Clearly, there is a trade-off between IB and DIB, in terms of SG and RD.

3.3. Elastic Information Bottleneck Method

From the previous results, IB and DIB still have room for improvement, respectively, on SG and RD. To obtain a Pareto optimal solution under the IB principle, we propose a new bottleneck method as follows, namely EIB. Please note that the generalized IB objective function in [7] is the same as the objective function of elastic IB, but they are derived from different perspectives. The generalized IB is constructed for the convenience of solving the DIB problem, while EIB is proposed to balance SG and RD.

Definition 1

(Elastic Information bottleneck). The objective function of the elastic information bottleneck method is as follows:

\begin{matrix} min L_{E I B} = (1 - α) H (T) & + α I (X; T) - β I (Y; T), 0 \leq α \leq 1, β > 1 \end{matrix}

(23)

We can see that EIB is a linear combination of IB and DIB, which covers IB and DIB as special cases. Specifically, EIB reduces to IB when

α = 1

and DIB when

α = 0

. Since IB and DIB perform better than the other one on SG or RD, respectively, a linear combination may lead to better target performance. In fact, the global optimal solution of the linear combination of the objectives is in the Pareto optimal solution set. Therefore, by adjusting

α

in

[0, 1]

, we can obtain a balance between SG and RD and achieve a Pareto frontier within the IB framework.

As a bottleneck method similar to IB, its optimal solution can be calculated by iterative algorithms when the joint distribution of instances and labels are known. The algorithm and EIB’s optimal solutions on information planes are provided in Appendix B. However, the iterative methods are intractable when the data are numerous and high-dimensional. Therefore, we use the variational approaches similar to the method in Fischer [2] to optimize IB, DIB, and EIB objective functions by neural networks. Assume the representation is a k-dimension Gaussian distribution and all dimensions are independent. The network contains an encoder

p (t | x)

, a decoder

q (y | t)

, and a backward encoder

b (t | y)

. To be specific, the encoder is an MLP which outputs the K-dimensional expectations

μ_{1}

and K-dimensional diagonal elements of covariance matrix

σ_{1}

and then yield the Gaussian representation with the reparameterization trick [15]. The decoder is a simple logistic regression model. The backward encoder is an MLP that uses the onehot encoding of the classify outcome as an input and the K-dimensional expectations

μ_{2}

and K-dimensional diagonal elements of covariance matrix

σ_{2}

as an output. The loss function of variational EIB is as follows:

\begin{matrix} L o s s_{E I B} = \int p (t | x) p (x, y) (log \frac{p^{α} (t | x)}{b (t | y)} - \tilde{β} log q (y | t)) d x d t d y, 0 \leq α \leq 1, β > 0 . \\ L o s s_{E I B} \approx \frac{1}{m} \sum_{n = 1}^{m} [\sum_{j = 1}^{k} [\frac{(1 - α)}{2} l n (2 π) - α l n σ_{1, j, n} + l n σ_{2, j, n} - \frac{α}{2} \\ + \frac{μ_{1, j, n}^{2} + μ_{2, j, n}^{2} - 2 μ_{1, j, n} μ_{2, j, n} + σ_{1, j, n}^{2}}{2 σ_{2, j, n}^{2}}] + \tilde{β} C E [s o f t m a x (y_{n}) | | q (y_{n} | t_{n})]], \end{matrix}

(24)

where

C E

is the cross entropy loss.

Our simulations and real data experiments show that EIB outperforms IB and DIB. It is also worth noticing that EIB can be plugged into previous transfer learning algorithms to expand its application. An example is given in the next section.

4. Experiments

In this section, we evaluate IB, DIB, and EIB by both toy data simulations and real data experiments.

4.1. Toy Data Simulations

We conduct simulations to study the performance of EIB in the transfer learning scenario, a comparison of RD between IB and DIB, and the comparison between our generalization bound and the previous one.

4.1.1. Performance of EIB in Transfer Learning

We design a toy binary classification problem, where instance

X \in {0, 1}^{10}

and label

Y \in {0, 1}

. Define

X_{0}

= [1,1,1,1,1,0,0,0,0,0],

X_{1}

= [0,0,0,0,0,1,1,1,1,1],

Y_{0} = 0

, and

Y_{1} = 1

. First, in the source and target datasets, half of the examples is

(X_{0}, Y_{0})

and the other half is

(X_{1}, Y_{1})

. Then we introduce random noise into the instances in the following way. We define the reverse digit operation as first choosing one digit in the instance and then reversing the digit (changing from 0 to 1 or the opposite direction). For each instance, we perform the reverse digit operation

⌊ N ⌋

times (round down function), where N is a real-valued uniform random variable

N \sim U [0, R]

. R is named as the noise level because as R increases, more instances are very different from

X_{0}

and

X_{1}

in the dataset. As a result, we can adjust the similarity between two domains by modifying the parameter R. Lastly, we set the

R \in (1, 3)

for the source domain and

R = 3

for the target domain so that we obtain the toy data in a transfer learning scenario.

We test our EIB model on the toy data. The parameter

α

ranges in [0,1], indicating different EIB models. The other parameter

β

is chosen to be

10^{4}

for

R = 2, 1.5, 1.433

and

β = 5 \times 10^{3}

for

R = 1.375, 1.25

in order to discriminate the model performance and to obtain high accuracy. The results are shown in Table 2. From the results, we can see that EIB outperforms IB and DIB when

R = 1.5, 1.433, 1.375, 1.25

. To be more specific, when the two domains become more similar, i.e., the noise level R of the source domain becomes closer to

R = 3

of the target domain,

α

of the best model changes from 0 to 1, i.e., IB gradually becomes more advantageous. This is in accordance with our theory that when the two domains are similar, the effect of RD is apparent and IB has more advantages in transfer learning. This result also provides an empirical rule to tune the parameter

α

.

4.1.2. Our Generalization Bound vs. the Previous One

In this simulation, we compare the DIB generalization bound (ours) and the IB generalization bound [8] in terms of number size and required sample size.

Assume that the data, representations, and labels are discrete, i.e.,

| X |

= 3,

| T |

= 2,

| Y |

= 2,

δ = 0.1

. The sample size increases exponentially from

10^{1}

to

10^{6}

. The bound and error rate are computed as follows. First, we generate distributions

p (x, y)

,

p (t | x)

and sample an empirical distribution

\hat{p} (x, y)

. To compare the bounds in a general case,

p (x, y)

is randomly valued by numbers generated from a uniform or a normal distribution and the value of

p (t | x)

is also randomly generated from a uniform distribution or a normal distribution.

\hat{p} (x, y)

is the empirical probability of

p (x, y)

. Second, we determine whether the constraints for the bound in Appendix A.3 are met. If so, we calculate the value of the bound. If not, we record the number of times the constraints are not satisfied. Third, we repeat this process for 100 times to obtain the rate of constraint violations (error rate) and two average bound values. Finally, we repeat the previous three steps under different sample sizes.

Figure 3 (left) shows that our bound consistently needs less samples than the previous one to satisfy the constraints. Moreover, the more uniform the distribution is, the less samples are needed to satisfy the constraints, which is obvious from the formula of constraints. Figure 3 (right) shows that the generalization bound decreases as the sample size increases, and the DIB generalization bound is always smaller than the IB generalization bound.

4.1.3. Comparison of RD between IB and DIB

To support the claims about RD, we approximate RD by using a classifier to predict which domain the representation sample t comes from, which is proven to be an adequate approximation of RD in [26].

The data generation, model, and parameters (R,

β

) are consistent with the simulation in Section 4.1.1. The data are generated two times under different noise levels R and are trained by EIB models with

α

= 0 or 1. Then, five samples are drawn from the Gaussian representations

p (t | x)

for each instance x. If x comes from the source domain, we label the samples as positive; otherwise, we label it as negative. Then, we train a linear classifier to classify the samples. Each classifier is trained 20 times. The average error rate is shown in Table 3. The standard error of the mean (SEM) is in parentheses.

The results show that IB has a larger error rate than DIB in most cases, indicating that IB has a smaller RD, which is consistent with the main result in Section 3.2. Furthermore, the gap between IB’s and DIB’s error rates becomes larger with the growth of R, which validates that when two domains become more similar, IB’s advantage over RD becomes more significant, as is claimed in Section 3.2 and Section 4.1.1. This trend does not seem to be consistent between

R = 1.433

and

R = 1.375

because

β

, which is chosen to minimizes the error rate, is different.

4.2. Real Data Experiments

Similar to IB, EIB can also be utilized in many representation learning algorithms as a regularizer. As an example, we combine EIB with DFA-MCD in Wang et al. [33], which is an adversarial domain adaptation algorithm with feature alignment. We replace the features in DFA-MCD with Gaussian representations and add a backward encoder after the source classifier as in variational EIB. Since the feaures are already Gaussian, we remove the KLD regularizer in DFA-MCD, which is the KL-divergence penalty between source domain feature and a Gaussian prior. Then, we substitute cross entropy loss with EIB loss. The adversary training and feature alignment designs are retained.

We test the model on a common transfer learning task, where the source dataset is MNIST [34] and the target dataset is USPS [35]. We only use 2/55 of the training data, and

λ

= 15,

β = 10^{9}

, other parameters remain the same as in Wang et al. [33]. The experiment is repeated with six random seeds, and the results are shown in Figure 4 (left). For each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted individually using the ‘+’ symbol. It is automatically generated by the boxplot function in matlab. We can see that the model with EIB performs better than IB and DIB in most cases, e.g.,

α = 0.2, 0.3, 0.4, 0.6, 0.7, 0.8

, and some of them beats the DFA-MCD baseline model, which suggests that EIB works as an effective regularizer for transfer learning. Please note that this example is utilized to demonsrate EIB can be combined with domain adaptation algorithms and perform better than IB and DIB, so we simply inherit the structure and hyper-parameters of the original networks in Wang et al. [33]. Further parameter tuning can be conducted to achieve better experimental results.

4.2.1. Source Generalization Analysis

We use variational EIB on MNIST to compare SG of IB and DIB. The networks were trained for 100 epochs and converge at about 60 epochs. The average generalization gap is calculated by the mean discrepancy of the training error and the testing error of 25 epochs with the least testing error. We randomly initialize the network 14 times and utilize EIB models with

α = 0

(DIB),

α = 0.5

,

α = 1

(IB) under the same initialization. The results are shown in Figure 4 (right) and Table 4. Baseline is a model without a regularizer i.e.,

β = + \infty

in EIB. First, let us analyze the results in terms of

β

. When

β

is small, models over-compress the representations so that the error rate is large. When

β

is large, the weight of regularization term in objective function is small so the three models’ performance become close. When

β = 10^{5}

, the three models have the best accuracy. Then, in terms of

α

, with the decrease in

α

, the generalization gap becomes smaller. When

β = 10^{4}, 5 \times 10^{4}, 10^{5}

, the p value of the t-test on “DIB’s SG < IB’s SG” is less than 0.05, indicating that the source generalization error of DIB is statistically significantly smaller than that of IB. (The p value is a term in statistical test that is defined as the probability of obtaining a test statistic as extreme or more extreme than the test statistic of actual observations if the null hypothesis is true. Usually, if p value < 0.05, the null hypothesis should be rejected and the result is considered statistically significant) This validates that DIB generalizes better than IB.

5. Conclusions and Future Work

This work studies the two objective functions of the information bottleneck principle. The motivation comes from our theoretical analysis that neither IB nor DIB is the optimal solution in terms of the generalization ability under the transfer learning scenario. Specifically, we theoretically analyze SG and RD of IB and DIB, and find that there is a trade-off between them. To tackle this problem, we propose a new method, EIB, to interpolate IB and DIB. Consequently, EIB can not only achieve better transfer learning performances but also be plugged into existing domain adaptation methods as a regularizer to suit different kinds of data, which have been shown by our simulations and real data experiments.

We believe that our results take an important step towards understanding different information bottleneck methods and provide some insights into the design of stronger deep domain adaptation algorithms. We qualitatively suggest choosing

α

, but in practice, when the distance of the two domain is fixed, the optimum still needs careful tuning. Therefore, how to choose the best parameters

α

and

β

efficiently in EIB remain questions that we plan to study in the future.

Author Contributions

Methodology, Y.N. and Y.L.; Formal analysis, Y.N.; Software, A.L.; Writing—original draft preparation, Y.N.; Writing—review and editing, Y.L.; Supervision, Y.L. and Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by National Key R&D Program of China No. 2021YFF1201600, Vanke Special Fund for Public Health and Health Discipline Development, Tsinghua University (No. 2022-1080053), and Beijing Academy of Artificial Intelligence (BAAI).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code is available at https://github.com/nyyxxx/elastic-information-bottleneck, accessed on 20 July 2022.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Missing Proofs

Appendix A.1. Target Error Decomposition

Lemma A1.

The triangle inequality for classification error holds.

ϵ (f_{1}, f_{2}) \leq ϵ (f_{1}, f_{3}) + ϵ (f_{2}, f_{3})

, where

ϵ (f_{1}, f_{2}) = E_{z} I_{{f_{1} (z) \neq f_{2} (z)}}

.

Proof.

A ≜ {f_{1} \neq f_{2}}, B ≜ {f_{1} \neq f_{3}}, C ≜ {f_{2} \neq f_{3}}

. Obviously,

B^{c} \cap C^{c} \subseteq A^{c}

. This leads to the inverse proposition

B \cup C \supseteq A

. Then,

I_{B} + I_{C} \geq I_{B \cup C} \geq I_{A}

. Add expectation

E_{z} I_{B} + E_{z} I_{C} \geq E_{z} I_{A}

. With the definition of the classification error, the triangle inequality is proven. □

Proof of Theorem 1.

\begin{matrix} ϵ_{T} (h) & \leq ϵ_{T} (h^{*}, f_{T}) + ϵ_{T} (h, h^{*}) = λ_{T} + E_{z \sim q (t)} I_{{h^{*} (z) \neq h (z)}} \\ = λ_{T} + E_{z \sim p (t)} I_{{h^{*} (z) \neq h (z)}} + (E_{z \sim q (t)} I_{{h^{*} (z) \neq h (z)}} - E_{z \sim p (t)} I_{{h^{*} (z) \neq h (z)}}) \\ \leq λ_{T} + ϵ_{S} (h, h^{*}) + d_{H Δ H} (p (t), q (t)) \leq λ_{T} + λ_{S} + ϵ_{S} (h) + d_{H Δ H} (p (t), q (t)) \\ \leq λ + {\hat{ϵ}}_{S} (h) + δ_{S} (h) + d_{H Δ H} (p (t), q (t)) \end{matrix}

The first and the fifth inequalities use Lemma A1. □

Note: The original theorem in Ben David et al., 2006, is slightly different in that their encoder is deterministic and that the decoder is stochastic while ours are in an opposite situation. However, this do not substantially affect the result.

Appendix A.2. Generalization Bound

Note: The logarithms in this work are all based on e, i.e., the unit of entropy is nat.

Lemma A2

(Plug-in Estimation Error [8]). Let ρ be a distribution vector of arbitrary (possible countably infinite) cardinality, and let

\hat{ρ}

be an empirical estimation of ρ based on a sample of size m. Then, with a probability of at least

1 - δ

over the samples,

| | ρ - \hat{ρ} {| |}_{2} \leq \frac{2 + \sqrt{2 l o g (1 / δ)}}{\sqrt{m}}

We apply this bound simultaneously to totally

| Y | + 2

terms:

| | p (x) - \hat{p} {(x) | |}_{2}, | | p (y) - \hat{p} {(y) | |}_{2}, | | p (x | y) - \hat{p} (x | y) {| |}_{2}

. Therefore, with a probability of at least

1 - δ

over the samples, the

| Y | + 2

terms above are respectively bounded by

\frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}}

.

Lemma A3

(Shamir et al. [8]). A real-valued function ϕ is defined as follows

ϕ (x) = \{\begin{matrix} 0 & x = 0 \\ x l o g (1 / x) & 0 < x \leq 1 / e \\ 1 / e & x > 1 / e \end{matrix}

(A1)

ϕ is a continuous, monotonically increasing, concave function satisfying

\forall a, b \in [0, 1],

| a l o g (a) - b l o g (b) | \leq ϕ (| a - b |),

(A2)

Lemma A4.

If

X \in [a, b]

,

E X = μ

, then

V a r X \leq (b - μ) (μ - a)

.

Proof.

(1) If

X \in [0, 1], E X = μ

, then

V a r X = E X^{2} - {(E X)}^{2} \leq E X - {(E X)}^{2} = μ (1 - μ)

. Equality holds when

E X^{2} = E X

, i.e., X = 0 or 1.

(2) Generally, if

X \in [a, b]

, then

\frac{X - a}{b - a} \in [0, 1], E \frac{X - a}{b - a} = \frac{μ - a}{b - a}

. With the results in (1), we have

V a r \frac{X - a}{b - a} \leq \frac{μ - a}{b - a} (1 - \frac{μ - a}{b - a})

, i.e.,

V a r X \leq (b - μ) (μ - a)

. Equality holds when

\frac{X - a}{b - a}

= 0 or 1, i.e., X = a or b. □

Lemma A5.

If

x_{i} \in [0, 1], i = 1, 2, \dots n, \sum_{i}^{n} x_{i} = 1

, then

\sum_{i}^{n} (\sqrt{x_{i} (1 - x_{i})} l n (\sqrt{x_{i} (1 - x_{i})}) - \sqrt{x_{i}} l n x_{i}) \geq 0

Proof.

Let

f (x) = \sqrt{x (1 - x)} l n (\sqrt{x (1 - x)}) - \sqrt{x} l n x

. Let

x_{0} \in (0, 1)

be the zero point of

f (x)

. It is easy to verify that

x_{0} \geq 1 / 2

,

f (x)

is non-negative and concave on

[0, x_{0}]

.

\forall x_{1}, x_{2} \in [0, x_{0}]

, satisfying

x_{1} + x_{2} \in [0, x_{0}]

; by the property of concave function, we have

\frac{f (x_{1} + x_{2}) - f (x_{1})}{x_{2}} \leq \frac{f (x_{2})}{x_{2}}, f (x_{1} + x_{2}) \leq f (x_{1}) + f (x_{2})

.

Now, we prove if

x_{i} \in [0, 1], i = 1, 2, \dots n, \sum_{i}^{n} x_{i} = 1

, then

f (x_{1}) + \dots + f (x_{n}) \geq 0

(*).

(1) If

x_{i} \in [0, 1 / 2]

, since

f (x)

is non-negative on

[0, 1 / 2]

, (*) holds.

(2) If there exists

x_{i} > 1 / 2

, since

\sum_{i}^{n} x_{i} = 1

, there is at most one

x_{i}

that is larger than 1/2. We denote it by

x_{n}

. It is easy to verify that

f (x) + f (1 - x) = \sqrt{x} (\sqrt{1 - x} - 1) l n x + \sqrt{1 - x} (\sqrt{x} - 1) l n (1 - x) \geq 0

,

x \in [0, 1]

. Then,

f (x_{1}) + \dots + f (x_{n}) \geq f (x_{1} + x_{2}) + f (x_{3}) + \dots + f (x_{n}) \geq f (x_{1} + . . + x_{n - 1}) + f (x_{n}) = f (x_{n}) + f (1 - x_{n}) \geq 0,

i.e., (*) holds. □

Proof of Theorem 3.

\begin{matrix} | I (Y; T) - \hat{I} (Y; T) | \leq | H (T) - \hat{H} (T) | + | H (T | Y) - \hat{H} (T | Y) | \\ = | H (T) - \hat{H} (T) | + | \sum_{y} (p (y) H (T | y) - \hat{p} (y) \hat{H} (T | y)) | \\ \leq | H (T) - \hat{H} (T) | + | \sum_{y} p (y) (H (T | y) - \hat{H} (T | y)) | + | \sum_{y} (p (y) - \hat{p} (y)) \hat{H} (T | y) | \end{matrix}

(A3)

The process of these three parts has some similarities, and we use the first part as an example to illustrate where our proof differs from the original proof. For the first term of (A3),

\begin{matrix} | H (T) - \hat{H} (T) | = | \sum_{t} p (t) l o g (p (t)) - \hat{p} (t) l o g (\hat{p} (t)) | \end{matrix}

(A4)

\begin{matrix} \leq \sum_{t} ϕ (| p (t) - \hat{p} (t) |) = \sum_{t} ϕ (| \sum_{x} p (t | x) (p (x) - \hat{p} (x)) |) \end{matrix}

(A5)

\begin{matrix} = \sum_{t} ϕ (| \sum_{x} (p (t | x) - E_{X} [p (t | x)]) (p (x) - \hat{p} (x)) |) \end{matrix}

(A6)

\begin{matrix} = \sum_{t} ϕ (| E_{X} [(p (t | x) - E_{X} [p (t | x)]) (1 - \frac{\hat{p} (x)}{p (x)})] |) \end{matrix}

(A7)

\begin{matrix} \leq \sum_{t} ϕ (\sqrt{V a r_{X} (p (t | x)) E_{X} [{(1 - \frac{\hat{p} (x)}{p (x)})}^{2}]}) \end{matrix}

(A8)

\begin{matrix} \leq \sum_{t} ϕ (\sqrt{V a r_{X} (p (t | x)) \sum_{x} \frac{1}{m i n_{x} p (x)} {(p (x) - \hat{p} (x))}^{2}}) \end{matrix}

(A9)

\begin{matrix} \leq \sum_{t} ϕ (C \sqrt{V a r_{X} (p (t | x))}) \end{matrix}

(A10)

where

C = \sqrt{\frac{1}{m i n_{x} p (x)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}}

. () uses

\sum_{x} p (x) = \sum_{x} \hat{p} (x) = 1

. (A6) uses Cauchy–Schwarz inequality. (A8) uses Lemma A2. Suppose the number of sample m is large enough subject to

0 < C \leq 1, C \sqrt{(1 - p (t)) p (t)} \leq 1 / e

.

Our proof begins to deviate from the original proof in (A10), where the original proof add the sample mean and then derive the sample variation: (A5)

= \sum_{t} ϕ (| \sum_{x} (p (t | x) - \frac{1}{| X |} \sum_{x} p (t | x)) (p (x) - \hat{p} (x)) |) \leq \sum_{t} ϕ (\sqrt{| X | V_{x} (p (t | x)) \sum_{x} {(p (x) - \hat{p} (x))}^{2}})

. Since here, the two proofs take two totally different ways to process the variance, leading to two different bounds, we treat the variance with our lemmas, while the proof of the original bound uses triangle inequality and an inequality linking KL-divergence and the L1 norm. To grasp the detailed distinction between the two proofs, we recommend reading them. Now, we first continue our proof from (A10) and then show how Shamir et al. [8] processed the variance afterwards.

\begin{matrix} \sum_{t} ϕ (C \sqrt{V a r_{X} (p (t | x))}) \end{matrix}

(A11)

\begin{matrix} \leq \sum_{t} ϕ (C \sqrt{(1 - p (t)) p (t)}) \end{matrix}

(A12)

\begin{matrix} = - \sum_{t} (C \sqrt{(1 - p (t)) p (t)}) l o g (C \sqrt{(1 - p (t)) p (t)}) \end{matrix}

(A13)

\begin{matrix} = - C l o g C \sum_{t} \sqrt{(1 - p (t)) p (t)} \end{matrix}

(A14)

\begin{matrix} - C \sum_{t} \sqrt{(1 - p (t)) p (t)} l o g (\sqrt{(1 - p (t)) p (t)}) \end{matrix}

(A15)

\begin{matrix} \leq - \sqrt{(| T | - 1)} C l o g C - C \sum_{t} \sqrt{p (t)} l o g (p (t)) \end{matrix}

(A16)

\begin{matrix} \leq - \sqrt{(| T | - 1)} C l o g C - C \sqrt{\frac{1}{m i n_{t} p (t)}} \sum_{t} p (t) l o g (p (t)) \end{matrix}

(A17)

\begin{matrix} = C_{1} \frac{1}{\sqrt{m}} \sqrt{| T | - 1} + C_{2} \frac{1}{\sqrt{m}} H (T) \end{matrix}

(A18)

where

C_{1} = - \sqrt{m} C l o g C = \sqrt{\frac{1}{m i n_{x} p (x)}} (2 + \sqrt{2 l o g ((| Y | + 2) / δ)}) l o g (1 / \sqrt{\frac{1}{m i n_{x} p (x)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}})

,

C_{2} = \sqrt{m} C \sqrt{\frac{1}{m i n_{t} p (t)}} = \sqrt{\frac{1}{m i n_{x} p (x) m i n_{t} p (t)}} (2 + \sqrt{2 l o g ((| Y | + 2) / δ)})

.

If

p (t) = 0

, then

p (t) l o g p (t) = 0

. Therefore, there is no need to consider the terms where

p (t) = 0

in (A17). Therefore,

C_{1}

,

C_{2}

only depend on

m, δ, | Y |, m i n_{x} p (x), m i n_{t} p (t)

(here,

m i n

refers to the minimum value other than 0). Equation (A12) uses Lemma 1: Since

p (t | x) \in [0, 1], E_{X} [p (t | x)] = p (t)

, then, we have

V a r_{X} (p (t | x)) \leq (1 - p (t)) p (t)

. Equation (A13) uses the definition of

ϕ

in Lemma A3. Equation (A16) uses Lemma 2 and the Jensen inequality: since

\sqrt{x (1 - x)}

is concave, we have

\sum_{t} \sqrt{(1 - p (t)) p (t)} \leq | T | \sqrt{(1 - \frac{1}{| T |} \sum_{t} p (t)) \frac{1}{| T |} \sum_{t} p (t)} = \sqrt{| T | - 1}

.

Here is how Shamir et al. [8] processed the sample variance:

\begin{matrix} \sqrt{V_{x} (p (t | x)} = p (t) \sqrt{\sum_{x} {(\frac{p (x | t)}{p (x)} - \frac{1}{| X |} \sum_{x^{'}} \frac{p (x^{'} | t)}{p (x^{'})})}^{2}} \\ \leq p (t) [\sqrt{\sum_{x} {(\frac{p (x | t)}{p (x)} - 1)}^{2}} + \sqrt{\sum_{x} {(1 - \frac{1}{| X |} \sum_{x^{'}} \frac{p (x^{'} | t)}{p (x^{'})})}^{2}}] \\ \leq (1 + \frac{1}{\sqrt{| X |}}) | | \frac{p (x | t)}{p (x)} - {1 | |}_{1} \leq \frac{2}{m i n_{x} p (x)} | | p (x | t) - p (x) {| |}_{1} \\ \leq \frac{2 \sqrt{2 l o g 2}}{m i n_{x} p (x)} \sqrt{D_{K L} [p (x | t) | | p (x)]} \end{matrix}

For the second term of (A3),

\begin{matrix} | \sum_{y} p (y) (H (T | y) - \hat{H} (T | y)) | \end{matrix}

(A19)

\begin{matrix} \leq | \sum_{y} p (y) \sum_{t} (\hat{p} (t | y) log (\hat{p} (t | y)) - p (t | y) l o g (p (t | y))) | \end{matrix}

(A20)

\begin{matrix} \leq \sum_{y} p (y) \sum_{t} ϕ (| \hat{p} (t | y) - p (t | y) |) \end{matrix}

(A21)

\begin{matrix} = \sum_{y} p (y) \sum_{t} ϕ (| \sum_{x} p (t | x) (\hat{p} (x | y) - p (x | y)) |) \end{matrix}

(A22)

Now use the same technique as that used when processing

| H (T) - \hat{H} (T) |

.

\begin{matrix} \sum_{y} p (y) \sum_{t} ϕ (| \sum_{x} p (t | x) (\hat{p} (x | y) - p (x | y)) |) \end{matrix}

(A23)

\begin{matrix} = & \sum_{y} p (y) \sum_{t} ϕ (| \sum_{x} p (x | y) (p (t | x) - \sum_{x} p (t | x) p (x | y)) (\frac{\hat{p} (x | y)}{p (x | y)} - 1) |) \end{matrix}

(A24)

\begin{matrix} = & \sum_{y} p (y) \sum_{t} ϕ (\sqrt{\sum_{x} p (x | y) {(p (t | x) - p (t | y))}^{2}} \sqrt{\sum_{x} p (x | y) {(\frac{\hat{p} (x | y)}{p (x | y)} - 1)}^{2}}) \end{matrix}

(A25)

\begin{matrix} \leq & \sum_{y} p (y) \sum_{t} ϕ (\sqrt{V a r_{X | Y} (p (t | x)) \frac{1}{m i n_{x} p (x | y)} | | \hat{p} (x | y) - p (x | y) {| |}_{2}^{2}}) \end{matrix}

(A26)

\begin{matrix} \leq & \sum_{y} p (y) \sum_{t} ϕ (C \sqrt{p (t | y) (1 - p (t | y))}) \end{matrix}

(A27)

\begin{matrix} = & - \sum_{y} p (y) \sum_{t} C \sqrt{p (t | y) (1 - p (t | y))} l o g (C \sqrt{p (t | y) (1 - p (t | y))}) \\ = & - \sum_{y} p (y) \sum_{t} C l o g C \sqrt{p (t | y) (1 - p (t | y))} - \sum_{y} p (y) \sum_{t} C \sqrt{p (t | y) (1 - p (t | y))} \end{matrix}

(A28)

\begin{matrix} l o g (\sqrt{p (t | y) (1 - p (t | y))}) \end{matrix}

(A29)

\begin{matrix} \leq & - C l o g C \sqrt{| T | - 1} + C \sqrt{\frac{1}{m i n_{t, y} p (t | y)}} H (T | Y) \end{matrix}

(A30)

\begin{matrix} = & C_{3} \frac{1}{\sqrt{m}} \sqrt{| T | - 1} + C_{4} \frac{1}{\sqrt{m}} H (T | Y) \end{matrix}

(A31)

where

C = \sqrt{\frac{1}{m i n_{x, y} p (x | y)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}}

. Suppose that the number of sample m is large enough subject to

0 < C \leq 1

,

C \sqrt{p (t | y) (1 - p (t | y))} \leq 1 / e

.

C_{3} = - \sqrt{m} C l o g C = \sqrt{\frac{1}{m i n_{x, y} p (x | y)}}

(2 + \sqrt{\frac{2 l o g ((| Y | + 2)}{δ})}) l o g (1 / \sqrt{\frac{1}{m i n_{x, y} p (x | y)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}})

,

C_{4} = C \sqrt{\frac{1}{m i n_{t, y} p (t | y)}}

=

\sqrt{\frac{1}{m i n_{x, y} p (x | y) m i n_{t, y} p (t | y))}} (2 + \sqrt{2 l o g ((| Y | + 2) / δ)})

only depending on m,

δ

,

| Y |

,

m i n_{x, y} p (x | y)

,

m i n_{t, y} p (t | y)

(here,

m i n

refers to the minimum value other than 0).

With the same technique as above, the third term of (A3) can be bounded as follows:

\begin{matrix} | \sum_{y} (p (y) - \hat{p} (y)) \hat{H} (T | y) | \end{matrix}

(A32)

\begin{matrix} = | \sum_{y} \hat{p} (y) (\frac{p (y)}{\hat{p} (y)} - 1) (\hat{H} (T | y) - \sum_{y} \hat{p} (y) \hat{H} (T | y) | \end{matrix}

(A33)

\begin{matrix} \leq | | p (y) - \hat{p} (y) | | \sqrt{\frac{1}{m i n_{y} \hat{p} (y)} V a r_{\hat{Y}} (\hat{H} (T | y))} \end{matrix}

(A34)

\begin{matrix} = C_{5} \frac{1}{\sqrt{m}} \sqrt{V a r_{\hat{Y}} (\hat{H} (T | y))} \end{matrix}

(A35)

\begin{matrix} \leq C_{5} \frac{1}{\sqrt{m}} \sqrt{(l o g | T | - \hat{H} (T | Y)) \hat{H} (T | Y)} \end{matrix}

(A36)

where

C_{5} = \sqrt{\frac{1}{m i n_{y} \hat{p} (y)}} (2 + \sqrt{2 l o g ((| Y | + 2) / δ)})

.

From (A18), (A31), and (A36), we conclude Theorem 3:

\begin{matrix} | I (Y; T) - \hat{I} (Y; T) | \leq \\ \frac{1}{\sqrt{m}} ((C_{1} + C_{3}) \sqrt{| T | - 1} + C_{2} H (T) + C_{4} H (T | Y) + C_{5} \sqrt{(l o g | T | - \hat{H} (T | Y)) \hat{H} (T | Y)}) \end{matrix}

□

Here we provide some discussions of the results.

A. A comparison of the order of the previous bound and our bound.

\begin{matrix} B o u n d_{P r e v i o u s} \sim & - 4 \sqrt{2 l o g (2)} D | X | l o g (2 \sqrt{2 l o g (2)} D | X |) \sqrt{| T |} \sqrt{I (X; T)} + \\ 4 \sqrt{2 l o g (2)} {D | X | | T |}^{3 / 4} I {(X; T)}^{1 / 4} + 2 D | Y | I (X; T) \end{matrix}

\begin{matrix} B o u n d_{O u r s} \sim & - 2 D \sqrt{| X |} l o g (D \sqrt{| X |}) \sqrt{| T |} + D \sqrt{| X |} \sqrt{| T |} (H (T | Y) + H (T | Y)) + \\ D \sqrt{| Y |} \sqrt{(l o g | T | - H (T | Y)) H (T | Y)} \end{matrix}

Some explanations

$D = \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}}$ is a constant.
The sample size m should be large enough such that $l o g (D \sqrt{| X |}) < 0$ and $l o g (2 \sqrt{2 l o g (2)} D | X |) < 0$ .
Since $| X |$ , $| T |$ , $| Y |$ can be very large in real-world data, the magnitude of the two bounds is mainly controlled by them.
$I (X; T)$ , $H (T | Y)$ , $H (T)$ are bounded by $l o g | T |$ . Their effects to the magnitude of the bounds are small. They can be viewed as the same order of $l o g | T |$ .

Finally, the conclusion is that the IB bound is

\sqrt{| X |}

times larger than the DIB bound for the first two terms and

\sqrt{| Y |}

times larger for the third term.

B. Constraints for sample size

0 < \sqrt{\frac{1}{m i n_{x} p (x)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}} \leq 1

(A37)

\sqrt{\frac{1}{m i n_{x} p (x)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}} \sqrt{(1 - p (t)) p (t)} \leq 1 / e

(A38)

0 < \sqrt{\frac{1}{m i n_{x} p (x | y)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}} \leq 1

(A39)

\sqrt{\frac{1}{m i n_{x} p (x | y)}} \frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{\sqrt{m}} \sqrt{p (t | y) (1 - p (t | y))} \leq 1 / e

(A40)

The IB bound in Shamir et al. [8] has similar constraints.

0 < \sqrt{\frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{m}} \frac{2 \sqrt{2 l o g (2)}}{m i n_{x} p (x)} < 1

(A41)

\begin{matrix} \sqrt{\frac{2 + \sqrt{2 l o g ((| Y | + 2) / δ)}}{m}} \frac{2 \sqrt{2 l o g (2)}}{m i n_{x} p (x)} \\ p (t) \sqrt{D_{K L} [p (x | t) | | p (x)]} < 1 / e \end{matrix}

(A42)

They discussed how IB bound trivially holds when (A42) is not satisfied, but neglecting the constraint (A41). Our simulations in Section 4.1.2 compare the sample size the two bounds needs.

C. Why does DIB have a better generalization than IB? The main conclusion in Section 3.1 is explained as follows.

The SG bound in Theorem 3 is controlled by $H (T)$ and $H (T | Y)$ , which suggests that minimizing $H (T)$ and $H (T | Y)$ narrows the source generalization gap.
$H (T)$ and $H (T | Y)$ are equivalently DIB regularizers (see Section 3.2).
$I (X; T)$ and $I (X; T | Y)$ are equivalently IB regularizers. This can be proved in the same way as (2): As is assumed in IB, $Y - X - T$ is a Markov chain, so $H (T | X, Y) = H (T | X)$ .

$\begin{matrix} I (X; T) & = H (T) - H (T | X) = I (Y; T) + H (T | Y) - H (T | X) \\ = I (Y; T) + H (T | Y) - H (T | X, Y) = I (Y; T) + I (X; T | Y) \end{matrix}$

$L_{I B} = I (X; T) - β I (Y; T) = I (X; T | Y) - (β - 1) I (Y; T), β > 1$
Therefore, $I (X; T | Y)$ is an equivalently IB regularizer, which is also known as CEB (Fischer 2020).
DIB achieves lower $H (T)$ and $H (T | Y)$ than IB.
$H (T) = I (X; T) + H (T | X)$ , $H (T | Y) = I (X; T | Y) + H (T | X)$ . DIB minimizes $H (T)$ and $H (T | Y)$ , while IB minimizes part of them according to (2) and (3). Specifically, DIB’s optimal $p (t | x)$ is deterministic, i.e., $0 = H {(T | X)}_{D I B} \leq H {(T | X)}_{I B}$ , while $I {(X; T)}_{D I B} \approx I {(X; T)}_{I B}$ (Strouse and Schwab, 2017b). Therefore, DIB’s optimal solution has smaller $H (T)$ and $H (T | Y)$ than that of IB, as is found in the simulations (Figure A3 in the Appendix C and Figure 2 in [7]).
Finally according to (1) and (4), DIB achieves lower SG than IB in the sense of Theorem 3.

Appendix A.3. Representation Discrepancy

Proof of Proposition 1.

\begin{matrix} d_{H Δ H} (p (t), q (t)) = s u p_{h \in H} | E_{t \sim p (t)} I_{{h^{*} (t) \neq h (t)}} - E_{t \sim q (t)} I_{{h^{*} (t) \neq h (t)}} | \end{matrix}

(A43)

\begin{matrix} \leq & s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} | p (t) - q (t) | \leq s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} | \hat{p} (t) - \hat{q} (t) | + ϵ \end{matrix}

(A44)

\begin{matrix} = & s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} | \frac{1}{m} \sum_{x \in X_{S}} p (t | x) - \frac{1}{m} \sum_{x \in X_{T}} p (t | x) | + ϵ \end{matrix}

(A45)

\begin{matrix} \leq & \frac{1}{m} \sum_{(x_{S}, x_{T}) \in CP} s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} | p (t | x_{S}) - p (t | x_{T}) | + ϵ \end{matrix}

(A46)

\begin{matrix} \leq & \frac{1}{m} \sum_{(x_{S}, x_{T}) \in CP} {∥ p (t | x_{S}) - p (t | x_{T}) ∥}_{1} + ϵ \end{matrix}

(A47)

Equation (A44) uses the Cauchy–Schwarz inequality and Lemma A2.

ϵ

is small when the sample size is large. □

Proof of (22).

\begin{matrix} ∥ p (t | x_{S}) - p (t | x_{T}) ∥_{1} \end{matrix}

(A48)

\begin{matrix} = Π_{i = 1}^{d} \int_{R} | \frac{1}{\sqrt{2 π} σ_{i}} (e^{- \frac{{(x_{i} - μ_{1 i})}^{2}}{2 σ_{i}^{2}}} - e^{- \frac{{(x_{i} - μ_{2 i})}^{2}}{2 σ_{i}^{2}}}) | d x_{i} \end{matrix}

(A49)

\begin{matrix} = Π_{i = 1}^{d} \int_{R} | \frac{1}{\sqrt{2 π} σ_{i}} (e^{- \frac{x_{i}^{2}}{2 σ_{i}^{2}}} - e^{- \frac{{(x_{i} - μ_{2 i} + μ_{1 i})}^{2}}{2 σ_{i}^{2}}}) | d x_{i} \end{matrix}

(A50)

\begin{matrix} = Π_{i = 1}^{d} 2 (\int_{- \infty}^{\frac{| μ_{1 i} - μ_{2 i} |}{2 σ_{i}}} \frac{1}{\sqrt{2 π}} e^{- \frac{x_{i}^{2}}{2}} - \int_{\frac{| μ_{1 i} - μ_{2 i} |}{2 σ_{i}}}^{\infty} \frac{1}{\sqrt{2 π}} e^{- \frac{x_{i}^{2}}{2}}) d x_{i} \end{matrix}

(A51)

\begin{matrix} = Π_{i = 1}^{d} (4 \int_{- \infty}^{\frac{| μ_{1 i} - μ_{2 i} |}{2 σ_{i}}} \frac{1}{\sqrt{2 π}} e^{- \frac{x_{i}^{2}}{2}} d x_{i} - 2) \end{matrix}

(A52)

\begin{matrix} = Π_{i = 1}^{d} (4 Φ (\frac{| μ_{1 i} - μ_{2 i} |}{2 σ_{i}}) - 2) \end{matrix}

(A53)

□

In order to simplify the calculation, we assumed that the source and target domains have the same sample sizes and a pair-wise correspondence, but in fact, the “pair-wise” assumption is not essential. What is essential is the intrinsic correspondence between the two domains. For example, all the zeros in MNIST digits and all the zeros in USPS digits have a semantic correspondence. This is a reasonable assumption for transfer learning because when the source domain and target domain are not related to each other, brute-force transfer may be unsuccessful (Pan et al., 2009).

When the sample size on two domains are different and the “pair-wise” correspondence does not hold, we can take the correlated instances in two domains as a general form of correspondent pairs, named correspondent group pairs. Specifically, the instances are partitioned into, say, n parts:

X_{S} = ⨆_{i}^{n} G_{S, i}

,

X_{T} = ⨆_{i}^{n} G_{T, i}

, where ⨆ is a disjoint union. ∀ fixed i, the instances in

G_{S, i}

and the instances in

G_{T, i}

are correlated. The collection of all the correspondent group pairs is denoted by

\tilde{CP} ≜ {{(G_{S, i}, G_{T, i})}_{i = 1 \dots n}}

. Take digit recognition as an example; the instances in two domains can both be partitioned into n = 10 groups.

G_{S, i}

and

G_{T, i}

contain the instances of digit

(i - 1)

. Then,

(G_{S, i}, G_{T, i})

is a correspondent group pair.

The comparison of IB and DIB on RD can be explicitly proven under the assumption of diagonal Gaussian and assumption of balanced samples (

\forall i, j \in {1 \dots n}, | G_{S, i} | = | G_{S, j} |, | G_{T, i} | = | G_{T, j} |

, where

| \cdot |

denotes the cardinality of a set, i.e.,

| X_{S} | = n | G_{S, i} |

,

| X_{T} | = n | G_{T, i} |

). This proof is a generalized form of the proof of Proposition 1.

\begin{matrix} (A 46) = s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} | \hat{p} (t) - \hat{q} (t) | + ϵ \end{matrix}

(A54)

\begin{matrix} = s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} | \frac{1}{| X_{S} |} \sum_{x_{S} \in X_{S}} p (t | x_{S}) - \frac{1}{| X_{T} |} \sum_{x_{T} \in X_{T}} p (t | x_{T}) | + ϵ \end{matrix}

(A55)

\begin{matrix} = s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} \frac{1}{n} \sum_{i = 1}^{n} | \frac{1}{| G_{S, i} |} \sum_{x_{S} \in G_{S, i}} p (t | x_{S}) - \frac{1}{| G_{T, i} |} \sum_{x_{T} \in G_{T, i}} p (t | x_{T}) | + ϵ \end{matrix}

(A56)

\begin{matrix} = s u p_{h \in H} \sum_{{t | h^{*} (t) \neq h (t)}} \frac{1}{n} \sum_{i = 1}^{n} | \frac{1}{| G_{S, i} | \cdot | G_{T, i} |} \sum_{x_{S} \in G_{S, i}, x_{T} \in G_{T, i}} (p (t | x_{S}) - p (t | x_{T})) | + ϵ \end{matrix}

(A57)

\begin{matrix} \leq \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{| G_{S, i} | \cdot | G_{T, i} |} \sum_{x_{S} \in G_{S, i}, x_{T} \in G_{T, i}} {∥ p (t | x_{S}) - p (t | x_{T}) ∥}_{1} + ϵ \end{matrix}

(A58)

Then, with Equation (22) and the analysis in our paper, IB’s RD is smaller than that of DIB.

Appendix B. Iterative and Variational Algorithm for EIB

Please note that Strouse and Schwab [7] constructed a generalized IB expression that has the same form as EIB. They used a generalized IB to solve the iterative algorithm of DIB and to raise it as an potential method for soft clustering, but did not study it in depth, let alone use it in transfer learning. In this section, we provide a traditional iterative algorithm of EIB and a proof of convergence, the proof of which is quite similar to that of IB. After that, we give a variantional algorithm for EIB, which is more useful in practice.

Appendix B.1. An Iterative Algorithm of EIB and a Proof of Convergence

Proposition A1.

The first-order variation of

L_{E I B}

at

p (t | x)

along

h (t | x)

is as follows:

\begin{matrix} δ L [p (t | x)] & = \sum_{x, t} p (x) h (t | x) log \frac{p {(t | x)}^{α}}{p (t)} + (α - 1) \sum_{x, t} h (t | x) p (x) - β \sum_{x, y, t} p (x, y) h (t | x) log \frac{p (t | y)}{p (t)} \end{matrix}

(A59)

Let the first-order variation equal 0; the expression of the optimal solution

p (t | x)

can then be obtained:

p (t | x) = \frac{1}{Z (x, α, β)} e x p (\frac{1}{α} (l o g p (t) - β D_{K L} [p (y | x) | | p (y | t)]))

(A60)

where

Z (x, α, β)

is a normalization coefficient. Now, we obtain the iterative algorithm for solving EIB in Algorithm 1.

Algorithm 1:The elastic information bottleneck method.

input:: $p (x, y)$ : Joint probability distribution;
1:: $α$ , $β$ : Parameters
2:: $| T |$ : Cardinality of set T/maximum number of categories*;
3:: Initialization: $p^{(0)} (t | x)$ ;n=0;
4:: repeat
5:: $p^{(n)} (t) = \sum_{x} p (x) p^{(n)} (t | x), \forall t \in T$ ;
6:: $p^{(n)} (y | t) = \frac{\sum_{x} p^{(n)} (t | x) p (x, y)}{p^{(n)} (t)}, \forall t \in T, \forall y \in Y$ ;
7:: $p^{(n + 1)} (t | x) = \frac{1}{Z (x, α, β)} e x p (\frac{1}{α} (l o g p^{(n)} (t) - β D_{K L} [p (y | x) | | p^{(n)} (y | t)]))$ ;
8:: n=n+1;
9:: until $p (t | x)$ converge.
output:: Optimal solution $p (t | x)$ ;

Comments.

Note that the actual number of representation clusters can be smaller than

| T |

because EIB minimizes H(T), resulting in

p (t) = 0

for some t.

Proposition A2

(Convergence of EIB algorithm). For the three iterative formulas of EIB,

L_{E I B}

will decrease every time an iterative formula is updated. Given that α, β,

L_{E I B}

have lower bounds, the EIB algorithm converges.

Proof.

Let

F_{E I B}

be the negative logarithmic expectation of the normalized coefficient, which is called the free energy of the system in physics.

F_{E I B} = - \sum_{x, t} p (x) p (t | x) l o g Z (x, β, α)

□

Lemma A6.

F_{E I B}

is non-negative. If any two terms of

p (t | x), p (t), p (t | y)

are fixed,

F_{E I B}

is convex with respect to the third term.

Proof.

\begin{matrix} F_{E I B} = & \sum_{x, t} p (x) p (t | x) l o g \frac{p {(t | x)}^{α}}{p (t)} + β \sum_{x, t, y} p (t | x) p (y, x) l o g \frac{p (y | x)}{p (y | t)} = \sum_{x} α D_{K L} [p (t | x) | | p (t)] \\ + (1 - α) \sum_{x, t} p (x) p (t | x) l o g \frac{1}{p (t)} + β \sum_{x, t} p (x) p (t | x) D_{K L} [p (y | x) | | p (y | t)] \end{matrix}

(A61)

Because KL-divergence is non-negative,

F_{E I B}

is non-negative. Since the function

g_{1} (x) = x l n x, 0 < x < 1

is strictly convex with respect to x, and the function

g_{2} (x) = - l n x, 0 < x < 1

is strictly convex with respect to x; obviously,

F_{E I B}

is strictly convex with respect to

p (t | x), p (t), p (t | y)

. □

Lemma A7.

When

p (t | x), p (t), p (t | y)

is updated according to the EIB iteration formula,

F_{E I B}

will not increase.

Proof.

First, we prove that when

p (t | x)

is updated,

F_{E I B}

will not increase. Let us add the normalization condition,

{\hat{F}}_{E I B} ≜ F_{E I B} + \sum_{x} λ (x) [\sum_{t} p (t | x) - 1]

, and let the derivative of

{\hat{F}}_{E I B}

with respect to

p (t | x)

be 0:

\begin{matrix} p (x) l o g \frac{p {(t | x)}^{α}}{p (t)} + α p (x) + β \sum_{y} p (y, x) l o g \frac{p (y | x)}{p (y | t)} + λ (x) = 0 \\ l o g \frac{p {(t | x)}^{α}}{p (t)} + β \sum_{y} p (y | x) l o g \frac{p (y | x)}{p (y | t)} + \hat{λ} (x) = 0 \\ p (t | x) = \frac{1}{Z (x, α, β)} e x p (\frac{l o g p (t) - β D_{K L} [p (y | x) | | p (y | t)]}{α}) \end{matrix}

(A62)

The above formula happens to be the iterative formula of

p (t | x)

. Since

{\hat{F}}_{E I B}

is strictly convex, when

p (t | x)

is updated,

F_{E I B}

will not increase.

Similarly, it is easy to verify that the derivative of

F_{E I B} + λ [\sum_{t} p (t) - 1]

with respect to

p (t)

being 0 is exactly the iterative formula of

p (t)

and that the derivative of

F_{E I B} + \sum_{t} λ (t) [\sum_{y} p (y | t) - 1]

with respect to

p (y | t)

being 0 is exactly the iterative formula of

p (y | t)

.

According to the above two lemmas, every iteration update makes

F_{E I B}

non-increasing. Since

F_{E I B}

has a lower bound, the iterative process converges. □

Appendix B.2. Variantional EIB

VIB [15] is a classical variational approach for IB, but its assumption that p(t) is standard Gaussian distribution contradicts minimizing H(T) in DIB. Therefore, we utilize the variational approaches similar to CEB [2] for variantional EIB. Figure A1 illustrates the network structure.

Figure A1. The network structure of variantional EIB.

The loss function of variantional EIB is derived as follows:

\begin{matrix} L_{E I B} = & H (T | Y) - α H (T | X) - β I (Y, T) \end{matrix}

(A63)

\begin{matrix} = & \int p (t | x) p (x, y) (l o g \frac{p^{α} (t | x)}{p (t | y)} - β l o g p (y | t)) d x d t d y - H (Y) \end{matrix}

(A64)

\begin{matrix} \leq & \int p (t | x) p (x, y) (l o g \frac{p^{α} (t | x)}{b (t | y)} - β l o g q (y | t)) d x d t d y - H (Y) \end{matrix}

(A65)

where the decoder

q (y | t)

is a variantional approximation of

p (y | t)

and the backward encoder

b (t | y)

is a variantional approximation of

p (t | y)

. The encoder outputs the K-dimensional expectations

μ_{1}

and K-dimensional diagonal elements of covariance matrix

σ_{1}

. Similarly, the backward encoder outputs

μ_{2}

and

σ_{2}

. Then, the regularization term can be written as follows:

\begin{matrix} \int p (t | x) l o g \frac{p^{α} (t | x)}{b (t | y)} d t \end{matrix}

(A66)

\begin{matrix} = & \prod_{i = 1}^{K} \int_{R} \frac{1}{\sqrt{2 π} σ_{1, i}} e x p (- \frac{{(x_{i} - μ_{1, i})}^{2}}{2 σ_{1, i}^{2}}) l n (\frac{{(\prod_{j = 1}^{K} \frac{1}{\sqrt{2 π} σ_{1, j}} e x p (- \frac{{(x_{j} - μ_{1, j})}^{2}}{2 σ_{1, j}^{2}}))}^{α}}{\prod_{j = 1}^{K} \frac{1}{\sqrt{2 π} σ_{2, j}} e x p (- \frac{{(x_{j} - μ_{2, j})}^{2}}{2 σ_{2, j}^{2}})}) d x_{i} \end{matrix}

(A67)

\begin{matrix} = & \sum_{j = 1}^{K} [\frac{(1 - α)}{2} l n (2 π) - α l n σ_{1, j} + l n σ_{2, j} - \frac{α}{2} + \frac{μ_{1, j}^{2} + μ_{2, j}^{2} - 2 μ_{1, j} μ_{2, j} + σ_{1, j}^{2}}{2 σ_{2, j}^{2}}] \end{matrix}

(A68)

Using the empirical distribution to approximate the joint distribution p(x,y), the loss function of variational EIB is as follows:

\begin{matrix} L = \frac{1}{m} \sum_{n = 1}^{m} \int p (t | x_{n}) (l o g \frac{p^{α} (t | x_{n})}{p (t | y_{n})} - β l o g q (y_{n} | t)) d t = β C E [s o f t m a x (y_{n}) | | q (y_{n} | t_{n})] + \end{matrix}

(A69)

\begin{matrix} \frac{1}{m} \sum_{n = 1}^{m} \sum_{j = 1}^{K} [\frac{μ_{1, j, n}^{2} + μ_{2, j, n}^{2} - 2 μ_{1, j, n} μ_{2, j, n} + σ_{1, j, n}^{2}}{2 σ_{2, j, n}^{2}} + \frac{(1 - α)}{2} l n (2 π) - α l n σ_{1, j, n} + l n σ_{2, j, n} - \frac{α}{2}] \end{matrix}

(A70)

Appendix C. Additional Experiments: EIB on Information Plane

We calculate the optimal solutions of EIB by the iterative algorithm and plot them on the information planes. Figure A2 reveals that when

α

decreases or

β

increases, I(Y;T) and I(X;T) grow while

H (T | X)

drops. Figure A3 shows that

α

and

β

can precisely adjust the position of the optimal solution on the information plane, which is closely related to the clustering property.

Figure A2. Top:

α

= 1;

β

takes 500 points evenly in [1.5, 51.5];

β

from small to large corresponds to the colors started from blue to red. Bottom:

β

= 4.5;

α

takes 201 points evenly in [0, 1];

α

from small to large corresponds to the colors starting from blue to red. The blue horizontal line is H(Y), which is the upper bound of I(Y,T).

Figure A2. Top:

α

= 1;

β

takes 500 points evenly in [1.5, 51.5];

β

from small to large corresponds to the colors started from blue to red. Bottom:

β

= 4.5;

α

takes 201 points evenly in [0, 1];

α

from small to large corresponds to the colors starting from blue to red. The blue horizontal line is H(Y), which is the upper bound of I(Y,T).

Figure A3. Left: Each curve is drawn by altering

β

. Right: Each curve is drawn by altering

α

.

Figure A3. Left: Each curve is drawn by altering

β

. Right: Each curve is drawn by altering

α

.

Appendix D. Experimental Details

Our codes of variational EIB are based on the code of VIB [15].

For variational EIB on toy data, the dimension of Gaussian representations is

K = 5

. The model was trained with 50 epochs, a 100 batch size, and a 0.01 initial learning rate, and the experiments were repeated with different random seeds.

For variational EIB on MNIST, the dimension of Gaussian representations is

K = 256

. We trained the variational EIB model, with 50 training epochs, a 100 batch size, and a 1e-4 initial learning rate, and the experiments were repeated with different random seeds.

Our codes of EIB-DFA-MCD are based on the code of DFA-MCD Wang et al. [33]. The pseudocode of EIB-DFA-MCD is shown in Algorithm 2:

Algorithm 2:EIB-DFA-MCD.

Input:: Source samples:( $X_{s}$ , $Y_{s}$ ); Target samples:( $X_{t}$ ); Parameters: $α$ , $β$ , $λ$ .
1:: for epoch=1 to epoch-number do
2:: for batch=1 to batch-number do
3:: Step 1: Update E, $D_{1}$ , $D_{2}$ , $B E_{1}$ , $B E_{2}$ to minimize
4:: $L o s s_{1} = L o s s_{E I B} (E, D_{1}, B E_{1}) + L o s s_{E I B} (E, D_{2}, B E_{2})$ ;
5:: Step 2: Update $D_{1}$ , $D_{2}$ , $B E_{1}$ , $B E_{2}$ to minimize
6:: $L o s s_{2} = L o s s_{E I B} (E, D_{1}) + L o s s_{E I B} (E, D_{2}) - K L [{\hat{Y}}_{t 1} | | {\hat{Y}}_{t 2}]$ ;
7:: Step 3: Update E to minimize $L o s s_{3} = K L [{\hat{Y}}_{t 1} | | {\hat{Y}}_{t 2}] + λ K L [{\hat{X}}_{s} | | {\hat{X}}_{t}]$ ;
8:: end for
9:: end for
Output:: E, $D_{1}$ , $D_{2}$ .

where E stands for the encoder,

D_{1}

and

D_{2}

stand for the decoders,

B E_{1}

and

B E_{2}

stand for the backward encoders,

{\hat{Y}}_{t 1}

and

{\hat{Y}}_{t 2}

stand for the target label predictions by the two decoders, and

{\hat{X}}_{s}

and

{\hat{X}}_{t}

stand for the target and source reconstructed instances. The dimension of Gaussian representations is

K = 768

. The model was trained with 200 epochs, a 128 batch size, and a 0.0002 initial learning rate, and the experiments were repeated with different random seeds. An OUr computing infrastructure was used to run the experiments: GPU model; memory: 24,268 MiB; operating system: linux; and Pytorch: 1.9.0+cu111.

References

Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
Fischer, I. The conditional entropy bottleneck. Entropy 2020, 22, 999. [Google Scholar] [CrossRef] [PubMed]
Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
Slonim, N.; Friedman, N.; Tishby, N. Multivariate Information Bottleneck. Neural Comput. 2006, 18, 1739–1789. [Google Scholar] [CrossRef]
Aguerri, I.E.; Zaidi, A. Distributed variational representation learning. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 120–138. [Google Scholar] [CrossRef] [PubMed]
Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef]
Shamir, O.; Sabato, S.; Tishby, N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010, 411, 2696–2711. [Google Scholar] [CrossRef]
Wang, B.; Wang, S.; Cheng, Y.; Gan, Z.; Jia, R.; Li, B.; Liu, J. Infobert: Improving robustness of language models from an information theoretic perspective. arXiv 2020, arXiv:2010.02329. [Google Scholar]
Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jeju Island, Korea, 11–15 October 2015; pp. 1–5. [Google Scholar]
Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
Slonim, N. The Information Bottleneck: Theory and Applications. Ph.D. Thesis, Hebrew University of Jerusalem, Jerusalem, Israel, 2002. [Google Scholar]
Slonim, N.; Tishby, N. Agglomerative information bottleneck. Adv. Neural Inf. Process. Syst. 1999, 12, 617–623. [Google Scholar]
Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. Available online: https://github.com/1Konny/VIB-pytorch (accessed on 21 April 2012).
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Wu, T.; Ren, H.; Li, P.; Leskovec, J. Graph Information Bottleneck. Adv. Neural Inf. Process. Syst. 2020, 33, 20437–20448. [Google Scholar]
Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2897–2905. [Google Scholar] [CrossRef]
Dubois, Y.; Kiela, D.; Schwab, D.J.; Vedantam, R. Learning optimal representations with the decodable information bottleneck. Adv. Neural Inf. Process. Syst. 2020, 33, 18674–18690. [Google Scholar]
Wang, Z.; Huang, S.L.; Kuruoglu, E.E.; Sun, J.; Chen, X.; Zheng, Y. PAC-Bayes Information Bottleneck. arXiv 2021, arXiv:2109.14509. [Google Scholar]
Strouse, D.; Schwab, D.J. The information bottleneck and geometric clustering. arXiv 2017, arXiv:1712.09657. [Google Scholar] [CrossRef]
Goldfeld, Z.; Polyanskiy, Y. The information bottleneck problem and its applications in machine learning. IEEE J. Sel. Areas Inf. Theory 2020, 1, 19–38. [Google Scholar] [CrossRef]
Zaidi, A.; Estella-Aguerri, I.; Shamai (Shitz), S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy 2020, 22, 151. [Google Scholar] [CrossRef] [Green Version]
Lewandowsky, J.; Bauch, G.; Stark, M. Information Bottleneck Signal Processing and Learning to Maximize Relevant Information for Communication Receivers. Entropy 2022, 24, 972. [Google Scholar] [CrossRef]
Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
Ben-David, S.; Blitzer, J.; Crammer, K.; Pereira, F. Analysis of representations for domain adaptation. In Proceedings of the International Conference on Neural Information Processing Systems, Hong Kong, China, 3–6 October 2006. [Google Scholar]
Pan, S.J.; Qiang, Y. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Zhao, H.; Combes, R.; Zhang, K.; Gordon, G.J. On Learning Invariant Representation for Domain Adaptation. arXiv 2019, arXiv:1901.09453. [Google Scholar]
Russo, D.; Zou, J. How much does your data exploration overfit? controlling bias via information usage. IEEE Trans. Inf. Theory 2019, 66, 302–323. [Google Scholar] [CrossRef]
Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. Adv. Neural Inf. Process. Syst. 2017, 30, 2521–2530. [Google Scholar]
Sefidgaran, M.; Gohari, A.; Richard, G.; Simsekli, U. Rate-distortion theoretic generalization bounds for stochastic learning algorithms. In Proceedings of the Conference on Learning Theory, London, UK, 2–5 July 2022; pp. 4416–4463. [Google Scholar]
Sefidgaran, M.; Chor, R.; Zaidi, A. Rate-Distortion Theoretic Bounds on Generalization Error for Distributed Learning. arXiv 2022, arXiv:2206.02604. [Google Scholar]
Wang, J.; Chen, J.; Lin, J.; Sigal, L.; Silva, C.W.D. Discriminative Feature Alignment: Improving Transferability of Unsupervised Domain Adaptation by Gaussian-guided Latent Alignment. Pattern Recognit. 2021, 116. Available online: https://github.com/JingWang18/Discriminative-Feature-Alignment/tree/master/Digit_Classification/DFAMCD (accessed on 20 July 2022). [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Hull, J. Database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 550–554. [Google Scholar] [CrossRef]

Figure 1. The previous proof vs. our proof.

Figure 2. The effect of randomness on RD.

Figure 3. Pre. denotes the previous bound. Left: Comparisons of the two bounds with respect to the constraint error as the sample size grows. Right: Comparisons of the two bounds with respect to the tightness when samples are sufficient.

Figure 4. Left: Target domain accuracy by EIB in transferring MNIST to USPS. BL (baseline) is the DFA-MCD method. Right: SG on MNIST by EIB.

Table 1. A comparison of IB and DIB.

Decomposition Terms	IB vs. DIB
Source Training Error	IB ≈ DIB
Source Generalization Gap (SG)	IB > DIB
Representation Discrepancy (RD)	IB < DIB

Table 2. Accuracy of elastic information bottleneck (EIB) on simulated data with different noise levels R. The highest accuracy under each noise levels are marked in bold.

$α$	R = 2	R = 1.5	R = 1.433	R = 1.375	R = 1.25
0	98.85	97.20	98.55	97.65	98.30
0.1	98.30	97.40	98.70	97.30	98.35
0.2	98.45	97.40	98.75	98.15	98.05
0.3	98.45	97.70	98.65	97.85	98.20
0.4	98.50	97.75	98.60	97.45	98.20
0.5	99.10	97.40	98.65	97.55	98.25
0.6	99.15	97.45	98.55	97.40	98.25
0.7	98.60	97.65	98.60	97.35	98.25
0.8	98.65	97.55	98.60	97.25	98.00
0.9	98.60	98.53	98.65	97.20	98.10
1.0	99.25	97.75	98.65	97.50	97.90

Table 3. The average error rate of classification between representation samples on two domains with difference noise levels. A larger error rate indicates smaller RD. The highest error rate under each noise levels are marked in bold.

R	2	1.5	1.433	1.375	1.25
$β$	$10^{4}$	$10^{4}$	$10^{4}$	$5 \times 10^{3}$	$5 \times 10^{3}$
DIB ( $α$ = 0)	43.65% ± 0.20%	28.05% ± 0.15%	28.14% ± 0.35%	28.99% ± 0.71%	27.72% ± 0.50%
IB ( $α$ = 1)	45.46% ± 0.26%	28.58% ± 0.18%	28.54% ± 0.35%	32.84% ± 1.25%	27.46% ± 0.35%

Table 4. T-test on “DIB’s SG < IB’s SG”. Statistically significant results are marked in bold.

$β$	$10^{4}$	$5 \times 10^{4}$	$10^{5}$	$10^{6}$	$10^{7}$	$10^{8}$
pvalue	0.021	0.020	0.042	0.66	0.085	0.39

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ni, Y.; Lan, Y.; Liu, A.; Ma, Z. Elastic Information Bottleneck. Mathematics 2022, 10, 3352. https://doi.org/10.3390/math10183352

AMA Style

Ni Y, Lan Y, Liu A, Ma Z. Elastic Information Bottleneck. Mathematics. 2022; 10(18):3352. https://doi.org/10.3390/math10183352

Chicago/Turabian Style

Ni, Yuyan, Yanyan Lan, Ao Liu, and Zhiming Ma. 2022. "Elastic Information Bottleneck" Mathematics 10, no. 18: 3352. https://doi.org/10.3390/math10183352

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Elastic Information Bottleneck

Abstract

1. Introduction

2. Problem Formulation

3. Main Results

3.1. Source Generalization Study of IB and DIB

3.2. Representation Discrepancy of IB and DIB

3.3. Elastic Information Bottleneck Method

4. Experiments

4.1. Toy Data Simulations

4.1.1. Performance of EIB in Transfer Learning

4.1.2. Our Generalization Bound vs. the Previous One

4.1.3. Comparison of RD between IB and DIB

4.2. Real Data Experiments

4.2.1. Source Generalization Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Missing Proofs

Appendix A.1. Target Error Decomposition

Appendix A.2. Generalization Bound

Appendix A.3. Representation Discrepancy

Appendix B. Iterative and Variational Algorithm for EIB

Appendix B.1. An Iterative Algorithm of EIB and a Proof of Convergence

Appendix B.2. Variantional EIB

Appendix C. Additional Experiments: EIB on Information Plane

Appendix D. Experimental Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI