1. Introduction
Deep learning models have gained enormous attention thanks to their impressive performance compared with traditional learning models in a variety of areas, such as computer vision, speech processing, natural language processing, and many more [
1,
2]. However, despite their stunning performance, we still do not fully understand how deep neural networks work [
3].
A number of recent approaches have been proposed to study the generalization/ optimization properties of over-parameterized models, such as deep neural networks [
4,
5]. However, these approaches do not fully capture certain neural network representation properties, including how these evolve during the neural network training procedure. Such an understanding of the role of different components of the model and their impact on the learning process can be essential for selecting or designing better neural network models and associated learning algorithms.
Another popular approach to studying the generalization/optimization dynamics of deep neural networks has been the information bottleneck (IB). This approach, which is based on the information bottleneck theory [
6,
7], employs the mutual information (MI) between the data and their neural network representation, as well as MI between labels and the neural network representation to capture neural network behavior. In particular, in classification problems, it is typical to model the relationship between the data label
Y, the data themselves
X, and some neural network intermediate data representation
Z via a Markov chain
, where
Y,
X, and
Z represent random variables/vectors associated with these different objects. Then, the IB principle is described via two MIs: (1)
to measure the amount of information contained in the data representation about the input data, and (2)
to measure the information in the data representation that could contribute to the prediction of ground-truth labels. One can capture how the value of
and
evolve as a function of the number of training epochs for a neural network by plotting pairs of these mutual information values on a two-dimensional plane [
8]. The plane defined by these MI terms is called the information plane (IP), and the trace of the MI value versus training epoch is called the information plane dynamic (IP-dynamic).
This approach has led to the identification of some trends associated with the optimization of neural networks. In particular, by observing the IP-dynamic of the networks trained on a synthetic dataset and the MNIST dataset, ref. [
8] found that, in early epochs, both
and
increase; and, in later epochs,
will keep increasing while
decreases. This led to the conjecture that the training of a neural network contains two different phases: (1) a
fitting phase, where the network representation
Z fits the input data
X as much as possible, and (2) a subsequent
compression phase in which the network compresses the useless information in the representation
Z about the labels
Y.
However, the IB approach requires estimating
and
, which is notoriously difficult to accomplish because the inputs and representations typically lie in very high-dimensional spaces. For example, non-parametric mutual information estimators—such as [
9,
10]—suffer from either high bias or high variance, especially in high-dimensional settings [
10]. This will directly affect any conclusions extracted from the IP-dynamics because high bias prevents recognizing the existence of fitting or compression phases, whereas high variance leads to inconsistent results across different numerical experiments. Indeed, with different mutual information estimators, researchers drew diverse or opposite conclusions about trends in IP-dynamics [
8,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25]. For instance, Saxe et al. [
24] argued that the reported phenomena of fitting and compression in Shwartz et al.’s study [
8] are highly dependent on the simple binning MI estimator setup adopted.
Therefore, the trends that one often extracts from an IB analysis may not always hold.
1.1. Paper Contributions
This paper attempts to resolve these issues by introducing a different approach to studying the dynamics of neural networks. Our main contributions are as follows:
- 1
First, we propose to use more tractable measures to capture the relationship between an intermediate network data representation and the original data or the intermediate network representation and the data label. In particular, we used the minimum mean-squared error between the intermediate data representation and the original data to try to capture fitting and compression phenomena occurring in a neural network; we also used the well-known cross-entropy between the intermediate data representation and the data label to capture performance.
- 2
Second, by building upon the variational representations of these quantities, we also propose to estimate such measures using neural networks. In particular, our experimental results demonstrate that such an approach leads to consistent estimates of the measures using different estimator neural network architectures and initializations.
- 3
Finally, using our proposed approach, we conducted an empirical study to reveal the influence of various factors on neural network learning processing, including compression, fitting, and generalization phenomena. Specifically, we considered the impact of (1) the machine learning model, (2) the learning algorithm (optimizer and regularization techniques), and (3) the data.
The main findings deriving from our empirical study—along with the literature that explored similar network architecture, training algorithm, or data setups—are summarized in
Table 1. In particular, we highlight that our study suggests that (1) a neural network generalization performance improves with the magnitude of the network’s fitting and compression phase; (2) a network tends to undergo a fitting phase followed by a compression phase, regardless of the activation function; and (3) the specific behavior of the fitting/compression phases depends on a number of factors, including the network architecture, the learning algorithm, and the nature of the data.
1.2. Scope of Study
Finally, we note that the information bottleneck technique has been used as a tool to cast insight into other machine learning paradigms, including semi-supervised learning [
30] and unsupervised learning [
31,
32,
33]. However, we focused exclusively on supervised learning settings—with an emphasis on neural networks—in order to contribute to a deeper understanding of deep learning techniques.
1.3. Paper Organization
This paper is organized as follows:
Section 2 offers an overview of the literature that relates to our work.
Section 3 proposes our approach to studying the compression, fitting, and generalization dynamics of neural networks, whereas
Section 4 discusses practical implementation details associated with our proposed approach.
Section 5 leverages our approach to conducting an empirical study of the impact of various factors on the compression, fitting, and generalization behavior of a neural network, including the underlying architecture, learning algorithm, and nature of the data. Finally, we summarize the paper, discuss its limitations, and propose future directions in
Section 6.
1.4. Paper Notation
We adopt the following convention for random variables and their distributions throughout the paper. A random variable (or vector) is denoted by an upper-case letter (e.g., Z), and its space of possible values is denoted with the corresponding calligraphic letter (e.g., ). The probability distribution of the random variable Z is denoted by . The joint distribution of a pair of random variables is denoted by . represents the entropy (or differential entropy) of random variable Z, represents the entropy (or differential entropy) of random variable given random variable , and represents the mutual information between random variables and . We denote the set of integers from 1 to n by .
2. Related Work
There are various lines of research that connect to our work.
Information bottleneck (IB) and information plane (IP) dynamics: Many works have adopted the IB and the IP to study the optimization dynamics of neural networks. Refs. [
8,
18,
19,
26,
28] concluded that there is a different fitting and compression phase during the training of a deep neural network, while [
24,
34] claim that neural networks with saturating activation functions exhibit a fitting phase but do not exhibit a compression phase. Ref. [
11] conveyed that the network may occasionally compress only for some random initializations. On the other hand, ref. [
11] found that weight decay regularization will increase the magnitude of the compression, while [
14] did not observe compression unless weight decay is applied. Finally, overfitting was observed from the IP associated with hidden layers in [
8,
23,
34].
While these works mentioned above explore various aspects of deep learning techniques, such as how network behaviors are affected by varying training dataset sizes and regularization techniques, their conclusions may not always be reliable due to the fact that MI estimation can be inaccurate and unstable in high-dimensional settings, as argued in [
12].
IB and IP based on other information measures: Many works have also adopted IBs/IPs based on other information measures to study the dynamics of neural networks. Motivated by source coding, ref. [
35] proposes to replace the
with the entropy of the representation
Z. The authors in [
36] introduced a generalized IB based on
f-divergence. The authors also proposed an estimation bottleneck based on
-information, but this quantity is difficult to estimate in practice, preventing its applicability in various problems. The paper [
37] proposed an information bottleneck approach based on MMSE and Fisher information to develop robust neural networks. However, the authors utilized MMSE to substitute mutual information between the representation and ground truth label, whereas we employed it to evaluate the association between representation and data. Inspired by [
38], ref. [
39] introduced a new IB—called the
-information bottleneck—that articulates the amount of useful information a representation embodies about a target usable by a classifier drawn from a family of classifiers
. Recently, refs. [
40,
41] have used sliced mutual information to study fitting in neural networks. However, their work mainly focused on the fitting phase and did not explore the role of compression and its relationship with generalization.
Mutual information estimation: Relying on mutual information to study the dynamics of neural networks leads to various challenges. The first challenge relates to the fact that the MI between two quantities that lie in continuous space and are linked by a functional relationship, such as the input and the output of a neural network, is theoretically infinite [
42]. This limits its use since a neural network representation is typically a deterministic function of the neural network input [
8,
11,
21,
24]. Many works have circumvented this issue by adding additional noise to the random variables. For instance, kernel density estimation (KDE) [
43,
44] was used by [
11,
13,
24,
45], and the
k-nearest-neighbor based Kraskov estimator [
46] was used in [
18,
24,
47]. Other works using variational mutual information estimators address the challenge by adding noise to the neural network representations [
14,
19]. However, adding noise to the representations of a neural network is not a widespread practice in most deep learning implementations. An alternative measure of dependence between two variables is sliced mutual information, which was proposed by [
48]. This method involves random projections and the averaging of mutual information across pairs of projected scalar variables. Our approach differs from this method as we directly processed the random variables in high-dimensional space.
The second challenge relates to the fact that many mutual information estimators exhibit high bias and/or high variance in a high-dimensional setting. For example, simple binning methods [
8,
49] are known to lead to mutual information estimates that vary greatly depending on the bin size choice. Further, variational mutual information estimators, such as MINE [
9], are also known to produce mutual information estimates that suffer from high bias or high variance [
10,
50].
Our work departs from existing work because we propose to study the evolution of two more stable measures during a neural network optimization process: (1) the minimum mean-squared error associated with the estimation of the original data given some intermediate network representation and (2) the cross-entropy associated with the original data label given an intermediate data representation. This offers a more reliable lens for studying compression, fitting, and generalization phenomena occurring in neural networks.
3. Proposed Framework
We now introduce our approach to studying the compression, fitting, and generalization dynamics of neural networks. We focused exclusively on classification problems characterized by a pair of random variables
, where
X is the input data and
Y is the ground-truth label, that follow a distribution
. We delivered an estimate of the ground-truth label
given the data
using an
L-layer neural network as follows:
where
models the operation of the
l-th (
) network layer, where
represents the parameters of this layer (the weights and biases). The network parameters were optimized using standard procedures given a (training) dataset containing various (training) samples.
The optimized network can then be used to make new output predictions given new input data X.
The network optimization procedure involves the application of iterative learning algorithms such as stochastic gradient descent. Therefore, at a certain epoch
i associated with the learning algorithm, we can model the flow of information in the neural network via a Markov chain as follows:
where the random variable
represents the network representation at layer
l at epoch
i in the
-dimension (with a convention that
). Our goal was to examine how certain quantities—capturing the compression, fitting, and generalization behavior—associated with the network optimization process evolve as a function of the number of algorithm training epochs.
Z-X measure: Our first quantity describes the difficulty in recovering the original data
X from some intermediate network representation
as follows:
where
is an estimator living in the function space
and
is a loss function. We will take the loss function to correspond to the squared error so that the Z-X measure reduces to the well-known minimum mean-squared error given by:
where the function
that minimizes the right-hand side of Equation (
3) is the well-known conditional mean estimator. Our rationale for adopting this quantity to capture the relationship between the network representation and the data in
lieu of mutual information—which is used in the conventional IB—is manifold:
First, the minimum mean-squared error can act as a proxy to capture fitting—the lower the MMSE, the easier it is to recover the data from the representation—and compression—the higher the MMSE, the more difficult it is to estimate the data from the representation.
Second, this quantity is also easier to estimate than mutual information, allowing us to capture the phenomena above reliably (see
Section 5.1).
Finally, the minimum mean-squared error is also connected to mutual information (see
Section 3.1).
Z-Y measure: Our second quantity describes the difficulty in recovering the original label
Y from some intermediate network representation
as follows:
where
is an estimator living in the function space
and
is a loss function. We will take the loss function to correspond to the cross-entropy so that the Z-Y measure reduces to the well-known conditional entropy given by:
where the function
that minimizes the right-hand side of Equation (
5) should model the distribution of the label given the representation. We also adopted this measure because it connects directly to performance—hence the ability of the network to generalize—but also to mutual information (see
Section 3.1).
Plane and Dynamics of the Z-X and Z-Y Measures: Equipped with the measures in Equations (
4) and (
6), one can immediately construct a two-dimensional plane plotting the Z-X measure
against the Z-Y measure
as a function of the number of network training epochs
in order to understand (empirically) how a particular neural network operates. Such a plane and the associated dynamics are the analogue of the IB plane and the IB dynamics introduced in [
8].
3.1. Connecting our Approach to the Information Bottleneck
Our approach is also intimately connected to the conventional information bottleneck because—as alluded to earlier—our adopted measures are also connected to mutual information. First, in accordance to [
51] (Theorem 10), we can bound the mutual information between the data
X and the representation
as follows:
where
represents the variance of the random variable.
Second, we can also trivially express the mutual information between the data
Y and the representation
as follows:
However, the main advantage of our approach in relation to the traditional IB is that it is much easier to estimate the proposed Z-X and Z-Y measures than the corresponding mutual information in high-dimensional settings; see
Section 5.1.
5. Results
We now build upon the proposed framework to explore the dynamics of the Z-X and Z-Y measures and their relationship with fitting/compression (F/C) phases and generalization in a range of neural network models. In particular, the fitting phase refers to the initial phase of training where the Z-X measure decreases with the number of epochs, indicating that the network is attempting to fit the dataset. This phase commonly occurs during early training. On the other hand, the compression phases refer to the subsequent increase in the Z-X measure, indicating the compression of information in the network.
Firstly, we experimentally examined whether the estimation of the proposed measures is stable. Then, we examined the impact of (1) the model architecture; (2) the learning algorithm including optimizer and regularization techniques; and (3) the data on the dynamics of the measures.
The results will be presented using Z-X and/or Z-Y dynamics, and the tables show the losses, accuracy, and generalization error of each experiment. In the figures, the x-axes or y-axes will be shared unless specified otherwise by the presence of ticks.
5.1. Z-X and Z-Y Measures Estimation Stability
The reliability of the estimation of the proposed measures is critical for extracting robust conclusions about the behavior of the Z-X and Z-Y dynamics in a neural network. Such studies are, however, largely absent in the information bottleneck literature [
12].
5.1.1. Criteria to Describe the Stability of Estimated Measures
We assessed the stability of the Z-X and Z-Y measures estimation using two criteria:
Stability with regard to the initialization of estimator networks: First, we explored how different initializations of an estimator network affect the Z-X and Z-Y measures.
Stability with regard to the architecture of estimator networks: Second, we also explored how (estimator) neural network architectures—with different depths—affect the estimation of the Z-X and Z-Y measures.
5.1.2. Subject Networks, Estimator Networks, and Datasets Involved
We examined the stability of Z-X and Z-Y measures estimates in both fully connected and convolutional subject networks. In particular, we used: (1) a Tishby-net (which has an MLP-like architecture) trained on the Tishby-dataset classification task with a standard stochastic gradient descent (SGD) optimizer, and (2) a CNN trained on the CIFAR-10 classification task trained with an Adam optimizer. However, we noticed that the Tishby-net may not always converge due to its simple architecture and small dataset size of 4096 samples. Therefore, we repeated the training process multiple times with different initializations and only retained converged subject networks to ensure meaningful results. We built estimator networks as elaborated in the previous sections, and their architectures are detailed in
Appendix A.
To verify the first stability criterion, we tested different initializations by modifying the random seed of the Xavier initializer. For the second stability criterion, we experimented with estimators at different depths.
5.1.3. Are the Measures Stable in the MLP-like Subject Neural Networks?
Figure 3 depicts the Z-X and Z-Y measures estimates on the Tishby-net. Specifically, panels (a) and (b) display the behavior of such measures under different initializations of a one-layer and two-layer estimator network, respectively. Our results indicate that these measures are robust to changes in the initialization of the estimator network (for a given estimator network architecture).
In turn, panels (c) and (d) depict the behavior of the Z-X and Z-Y measure estimates for different estimator network architectures. It is clear that the capacity of the estimator (which depends on the number of estimator network layers) may affect the exact value of the Z-X and Z-Y measures estimate, indicating the presence of a bias; however, such estimators can still capture consistent trends (such as increases and decreases in the measures that are critical to identifying fitting or compression behavior; see panel (d)).
We however note—as we had elaborated previously—that the estimator networks need to be sufficiently complex to emulate a conditional mean estimator—to estimate the Z-X measure—or to emulate the conditional distribution of the label given the representation—to estimate the Z-Y measure. This may not always be possible depending on the complexity/capacity of the estimator network e.g., one-layer estimator networks are only capable of representing linear estimators whereas two-layer networks can represent more complex estimators (therefore, linear one-layer networks cannot reliably estimate the minimum mean-squared error unless the random variables are Gaussian). However, our results suggest that, with a two-layer network, we may already obtain a reliable estimate since—except for some representations—the difference in the measures estimated using a two-layer network does not differ much from those using a three-layer network. Naturally, with an increase in the capacity of the estimator networks, one may also need additional data in order to optimize the estimator network to deliver a reliable network, but our results also suggest that the variance of the estimates is relatively low for both two-layer and three-layer estimators. Further, the results in [
57] suggest that the difference between the estimated value and the true value for our Z-X measure decays rapidly with the number of points in the (validation) dataset (note, however, that these results only apply for scalar random variables). Therefore, we will adopt a two-layer estimator network in our study of MLPs in the sequel.
We conducted a more robust analysis of the efficacy of different estimators using a Gaussian mixture data model in
Appendix B, where we can also directly analytically compute the mean-squared error for comparison purposes.
5.1.4. Are the Measures Stable in the Convolutional Subject Neural Networks?
Figure 4 shows the Z-X and Z-Y measure estimates on the CNN. To test the stability criteria, we again used different estimator network initializations (varying the random seed of the Xavier initializer) and different estimator network architectures. We first plotted the Z-X dynamics and Z-Y dynamics based on the setup described in
Section 4, and the results are shown in the left column of
Figure 4. Then, for comparison, we added an extra convolutional layer to all Z-X estimators and a fully connected layer to all Z-Y estimators, and the results are displayed in the right column of
Figure 4.
The results show that both estimator networks lead to relatively consistent and stable measure estimates. This suggests that our proposed measures can be reliably inferred using such estimator networks—under different initializations—even in this high-dimensional setting that poses significant challenges to mutual information estimators. Comparing the dynamics estimated by the standard estimator architecture and the one with an extra layer, we observed that the trends of the dynamics are similar. Hence, we used the standard setup in the rest of the paper due to its higher computational efficiency, which is illustrated in
Figure A3.
We next relied on this approach to estimate the Z-X and the Z-Y dynamics for different (subject) neural network models and algorithms in order to cast further insights into the compression, fitting, and generalization dynamics of deep learning.
5.2. The Impact of Model Architectures to the Network Dynamics
We started our study by investigating the effect of the neural network model on the Z-X and Z-Y dynamics of neural networks. We considered both MLPs with different activation functions, depths, and widths. We also considered CNN and res-net architectures. Our study will allow us to identify possible fitting, compression, and generalization behavior.
5.2.1. Does the Activation Function Affect the Existence of F/C Phases?
We began by examining whether the presence of the fitting and compression (F/C) phases is dependent on the activation function used in the network. This topic has been explored in previous studies using the IB approach [
8,
11,
24,
27], but different studies have led to different conclusions [
27].
Setups: We deployed Tishby-net architecture with various activation functions, including both saturating (tanh and softsign [
58]) and non-saturating (ReLU [
59], ELU [
60], GELU [
58], swish [
61,
62], PELU [
63], and leaky-ReLU [
64]) options. The Tishby-net was trained on the Tishby-dataset using the same optimizer and hyper-parameter setups as described in the literature [
8,
24]. The Z-X and Z-Y measures were estimated using two-layer estimators, as argued in
Section 5.1.
Results: Figure 5 reveals that the Z-X dynamics exhibit a consistent pattern among all Tishby-nets, characterized by an initial decrease in Z-X measures followed by an increase. Note that the initial decrease happens prior to the decrease in the subject network loss. There can be a longer period of epochs where the network struggles to converge and, during this phase, the changes in the Z-X measure may not be easily visible. The Z-X dynamics in some experimental setups, such as PELU, display fluctuation, which we attribute to the unstable convergence of the subject network, as evidenced by the fluctuations in the subject network loss. Moreover, the increases in Z-X measures coincide with epochs where the network experiences a decrease in loss. These observations suggest that the F/C phases are likely to occur in the network, regardless of the activation function employed. Our observation is in line with some of the previous studies that have used MI measures, such as [
8,
11].
5.2.2. How Do the Width and Depth of an MLP Impact Network Dynamics?
We now examine the effect of the MLP width (number of neurons per layer) and depth on the Z-X and Z-Y dynamics.
Setups: For the MLP width analysis, we constructed four-layer MLPs with different numbers of neurons per layer: 16, 64, and 512. For the MLP depth experiment, we fixed the width of the subject network to 64 and varied its depth from two to six hidden layers. All models were trained on the full MNIST dataset using a standard SGD optimizer with a fixed learning rate of 0.001. We also used two-layer estimator networks to estimate the Z-X and Z-Y measures.
Figure 6 depicts the dynamics of the Z-X measure against the Z-Y measure for MLP networks with four layers and with different widths. As shown in
Table 2, the best generalization performance is associated with the model MLP 512 × 4. We can observe that all MLP networks exhibit fitting and compression phases. However, wider networks (e.g., MLP 512 × 4) tend to begin compressing earlier, while the thinner ones (e.g., MLP 16 × 4) tend to have a longer fitting phase. This trend suggests that wider networks are able to fit data more quickly. We can also observe that the networks with more neurons per layer (MLP 512 × 4) exhibit more compression than network with fewer neurons per layer (MLP 16 × 4). Interestingly, the MLP 512 × 4 model also exhibits the best generalization performance, so one can potentially infer that significant compression may be necessary for good generalization [
8,
29].
Figure 7 depicts the dynamics of the Z-X measure (associated with the first and last layers) of MLPs with a width of 64 and with different depths (we note that the best generalization performance is associated with the model MLP 64 × 3). In terms of fitting, we can observe that the different MLPs experience a fitting phase. However, deeper models such as MLP 64 × 5 and MLP 64 × 6 appear to experience a more pronounced fitting phase than shallower models, though deeper models still exhibit a higher Z-X measure than shallower ones toward the end of this fitting phase (see marker
). In terms of compression, we find that deeper networks (e.g., MLP 64 × 5, MLP 64 × 6) compress data more aggressively than shallower ones. Indeed, the gap between the Z-X measure value between the last layer and the first layer of the network is much higher for a deeper model than for shallower ones (as indicated by marker
).
We also highlight that the MLP
network, which demonstrated the best generalization performance (refer to
Table 2), exhibited a significant fitting phase similar to MLP 64 × 2, as well as a notable compression phase close to MLP 64-4.
Overall, shallow networks may have difficulty compressing data effectively, while the layers close to the output in the deep networks may lose important information and cannot fit data well. We hypothesize that both of these phenomena—which are both present in the MLP network—can have an impact on a network’s ability to generalize effectively.
5.2.3. How Do the Number of Kernels and Kernel Size of a CNN Impact Network Dynamics?
We now examine the effect of the kernels, including their number and size, on the Z-X and Z-Y dynamics in a CNN.
Setups: To analyze the impact of the number of kernels on network F/C phases in CNNs, we adjusted the number of kernels by a factor derived from the baseline CNN architecture shown in
Figure 2. To analyze the impact of the kernel size, we used 1 × 1, 3 × 3 (baseline), 5 × 5, and 7 × 7 kernel sizes for all convolutional layers The CNN models were trained on the CIFAR-10 dataset using the Adam optimizer with a learning rate of 0.001. We utilized minimal estimator networks, as described in the previous section.
Results: Figure 8 depicts the Z-X dynamics of our CNN network with different numbers of kernels. We observe that having a low number of kernels (e.g., /4, /8) seems to impair both the fitting and compression process, particularly in early layers (e.g., layers 1 and 2). In contrast, we observed that a high number of kernels do not significantly impact the F/C phases or the generalization performance. Indeed, as shown in
Table 3, CNNs with more kernels (e.g., ×2, ×4) have a similar test loss performance to the baseline model (note that the best test loss performance corresponds to the ×4 model, and that its generalization performance is also similar to that of the baseline model). This suggests that adding more kernels to a well-generalized CNN may not significantly impact the F/C phases and may not lead to an improved generalization.
Figure 9 depicts the Z-X dynamics of our CNN network with different kernel sizes. It appears that networks with large kernels fail to fit and compress, but networks with small kernels also exhibit little fitting and compression. Indeed, the best test loss and generalization performance are associated with the CNN model with a 3 × 3 kernel size, which also exhibits a more pronounced fitting and compression phase (refer to
Table 3).
Overall, we hypothesize that selecting an appropriate kernel size can improve a network’s ability to both fit and compress data, leading to a better generalization performance, which is in line with the conclusion in [
8,
29].
5.2.4. How Does Residual Connection Affect the Network Dynamics?
We finally assessed the impact of residual connections—introduced in [
65]—on neural network learning dynamics, since these have been frequently used to address the gradient vanishing problem in very deep neural networks. We note that some works [
13,
18] have studied the behavior of ResNet or DenseNet (which also contain residual connections [
66]). However, these studies did not delve into how residual connections may impact the information bottleneck of hidden layer representations and their relation to generalization.
Setup: We deployed a ResCNN, as elaborated in the previous section, that was trained using an Adam optimizer with a learning rate of 0.001 on the CIFAR-10 dataset. We also used the standard estimator network setups elaborated in
Section 4.2 and shown in
Appendix A Figure A3.
Results: We first analyzed the behavior of the Z-X dynamics at the output of the residual blocks (e.g.,
) and the fully connected layers, and compared it with the CNN with a similar architecture but without residual connections; see
Figure 10.
We notice that the ResCNN tends to have less pronounced compression in the (residual) convolutional blocks, e.g., the Z-X dynamic of (without residual connection) shows a more pronounced increase than that of (with residual connection). Additionally, we can see that the model with residual connection depends more on the fully connected layers to compress the Z-X measure, which is demonstrated by the significantly wider gap between representations and , as well as between and in the residual model.
We then inspected the behavior of the Z-X measure and the Z-Y measure within each residual block; see
Figure 11 (note that the dynamics of the Z-X and Z-Y measures associated with
are flat because
corresponds to
X).
We can observe that, within each residual block (i.e., for a given index l), the Z-X measure of is generally lower than that of and . This is because the representation is the sum of and and thus retains more information associated with the data.
We can also observe that, in every residual block, the Z-X dynamics of and have a pronounced increase over the epoch, while the Z-X dynamics of and are relatively stable. This suggests that each residual block may learn to form a mini-bottleneck. However, the overall network does not exhibit a visible compressing phase when observing the output of the residual blocks alone. Our experiments demonstrate the distinct behavior of networks with residual connections compared to those without.
5.3. The Impact of Training Algorithm to the Network Dynamics
A neural network generalization ability also tends to depend on the training procedure, including the learning algorithm and regularizers. Therefore, we now explore how different learning settings affect neural network Z-X and Z-Y measures dynamics.
5.3.1. How Does the Optimizer Impact the Network Dynamics?
It was suggested by [
29] that the Adam optimizer leads to a better performance during the fitting phase, but it tends to perform worse during the compression phase. We investigated, under the lens of our approach, the effect of Adam and various other optimizers on neural network learning dynamics.
Setup: Our experiments were conducted on CNNs (with the standard architecture illustrated in
Figure 2) trained on the CIFAR-10 dataset using different optimizers. Specifically, we experimented with non-adaptive optimizers such as SGD and SGD-momentum [
67], as well as adaptive optimizers such as RMSprop [
68]. We also considered the Adam optimizer [
69], which can be viewed as a combination of a momentum optimizer and RMSprop optimizer, representing a hybrid approach. We used standard hyper-parameters commonly used for CIFAR-10 classification tasks, setting the learning rate to 0.001 for all optimizers and a momentum parameter of 0.9 (if applicable). Our estimator networks are akin to those used in previous studies.
Results: Figure 12 shows the behavior of the normalized Z-X measure for CNNs trained with different optimizers. We normalized this measure using min-max normalization to allow for a better visualization of relative changes in performance. Specifically, each Z-X dynamic curve was normalized individually, and the minimum and maximum values were taken from the curve after the 50th epoch, as we observed that all Z-X dynamics enter the compression phase before this epoch.
We observe that SGD and SGD-momentum exhibit similar fitting phases, while Adam and RMSprop also display similar fitting phases. We can also note that, when trained on the Adam and RMSprop optimizer—which are adaptive optimizers—the representations associated with the various layers exhibit major compression; in contrast, when trained with the SGD optimizer, the representations {
,
} do not show noticeable compression and, likewise, when trained with SGD-momentum optimizers, the representations {
,
} also do not exhibit much compression. Note that, in our experiment with the CNN trained on the CIFAR classification task, we can see from
Table 4 that the model trained with the RMSprop optimizer achieved the best generalization performance, followed closely by the model trained with Adam. Therefore, it appears that adaptive optimizers—which adjust the learning rate per parameter—may be critical for leading to network compression, and hence generalization [
70].
5.3.2. How Does Regularization Impact the Network Dynamics?
It has been suggested by [
11,
12] that weight decay regularization can significantly enhance the compression phase associated with a neural network learning dynamic. It has also been argued by others [
18] that compression is only possible with regularization. Therefore, we also investigated, under the lens of our approach, the effect of regularization on the learning dynamics of MLPs and CNNs.
Setup: We deployed MLP 64 × 4 models trained on the MNIST dataset with or without weight decay (WD) regularization and CNN models trained with the CIFAR-10 dataset with or without dropout regularization. The weight decay was applied to all layers in the MLP 64 × 4 model with its hyper-parameter set to 0.001, while the dropout was only adopted in the first fully connected layer in the CNN with a
,
, or
dropout rate (which is a common approach in the literature [
54]). The MLP with weight decay regularization requires more epochs to converge. Therefore, we trained the MLP 64 × 4 without weight decay for 300 epochs and the model with weight decay for 1200 epochs.
Results: We offer the dynamics of the Z-X and Z-Y measures associated with the MLP setting in
Figure 13. We infer that weight decay regularization does not significantly impact the fitting phase; however, weight decay does seem to affect network compression, leading networks to compress more aggressively. Moreover, weight decay not only prevents the subject network from overfitting [
2] but also prevents its representations from overfitting. Therefore, we conjecture that the weight decay regularization boosts the compression in MLPs (as also observed in [
11]) and prevents the representation overfitting to improve the generalization performance (shown in
Table 5), which is also in line with [
11].
We also offer the dynamics of the Z-X measure associated with the CNN setting in
Figure 14 (
Table 5 shows that the best generalization performance is obtained for a CNN with dropout regularization at a 60% dropout rate on the first fully connected layer). Our results suggest that tuning the dropout rate on the first fully connected layer affects not only the dynamics of its representation (
) but also the dynamics of other layers. When a high dropout rate (e.g., 90%) is used, we observe less pronounced fitting and compression phases, which also lead to a worse generalization performance (refer to
Table 5). Conversely, a low dropout rate (30%) showed similar fitting phases to the no-dropout group, but with more compression. These results support our conjecture that the F/C phases are linked to the generalization behavior of the model.
On the other hand, it can be observed that adopting dropout regularization diminishes the visibility of fitting phases across multiple layers. This suggests that the training algorithm effectively leverages the neurons and connections within the model, enabling rapid dataset fitting.
5.4. The Impact of Dataset to the Network Dynamics
It is well established that the size of the training set directly affects a machine learning model’s generalization performance [
71]. Our goal was to also understand how the dataset size affects neural network model learning dynamics, including its fitting and compression phases.
Setup: We compared the learning dynamics of CNN models trained on three different datasets: 1% of CIFAR-10 (0.5k samples), CINIC [
56] (which has the same classes as CIFAR-10 but contains 180k samples), and the full CIFAR-10 dataset (50k samples).
We used the Adam optimizer with a learning rate of 0.001 to train the neural networks. We also estimated the Z-X and Z-Y measures using the network in
Figure A3 using the CIFAR-10 validation and test sets.
Results: Figure 15 shows the Z-X dynamics of CNNs trained on datasets of different sizes. We can observe from
Table 6 that the model trained on the CINIC dataset achieves the best generalization performance, while the model trained on the smallest dataset (1% CIFAR-10) performs the worst.
Our experiments show that the fitting behavior of the network trained on the small dataset is identical to that of the network trained on the standard CIFAR-10 dataset. However, the degree of compression exhibited by the network optimized on the 1% CIFAR-10 dataset was much less pronounced than that of the model trained on richer datasets. This suggests that compression may only be possible for sufficiently large datasets. Our experiments also show that the behavior of the Z-X measure associated with the network trained on the CINIC dataset rapidly increases during the optimization process. This indicates a significant F/C phase that may also justify the superior generalization performance.
Overall, these observations suggest that providing sufficient training data can amplify the magnitude of compression. This in turn helps the model learn to abstract key information for predicting labels more effectively, leading to a better generalization performance. Therefore, we conclude that compression may be a crucial factor for effective generalization in neural networks, and providing sufficient training data is essential for amplifying this phase [
8].
6. Conclusions
In this paper, we proposed to replace the mutual information measures associated with information bottleneck studies with other measures capable of capturing fitting, compression, and generalization behavior. The proposed method includes: (1) the Z-X measure corresponding to the approximation of the minimum mean-squared error associated with the recovery of the network input (X) from some intermediate network representation (Z) and (2) the Z-Y measure associated with the cross-entropy of the data label/target (Y) given some intermediate data representation (Z). We also proposed to estimate such measures using neural-network-based estimators. The proposed approach can handle representations in high-dimension space, is computationally stable, and is also computationally affordable.
Our series of experiments explored—via the dynamics between the Z-X and Z-Y measure estimates—the interplay between network fitting, compression, and generalization on different neural networks, with varying architectures, learning algorithms, and datasets, that are as complex or more complex than those used in traditional IB studies [
12]. Our main findings are as follows:
Impact of Neural Network Architecture:
- −
We have found that MLPs appear to compress regardless of the non-linear activation function.
- −
We have observed that MLP generalization, fitting, and compression behavior depend on the number of neurons per layer and the number of layers. In general, the MLPS offering the best generalization performance exhibit more pronounced fitting and compression phases.
- −
We have also observed that CNN generalization, fitting, and compression behavior also depend on the kernel’s number/size. In general, CNNs exhibiting the best generalization performance also exhibit pronounced fitting and compression phases.
- −
Finally, we have seen that the fitting/compression behavior exhibited by networks with residual connections is rather distinct from that shown in networks without such connections.
Impact of Neural Network Algorithms: We have observed that adaptive optimizers seem to lead to more compression/better generalization in relation to non-adaptive ones. Likewise, we have also observed that regulation can help with compression/generalization.
Impact of Dataset: Our main observation is that insufficient training data may prevent a model from compressing and hence generalizing; in turn, models trained with sufficient training data exhibit both a fitting phase followed by a compression phase, resulting in a higher generalization performance.
Overall, our findings are in line with an open conjecture that good neural network generalization is associated with the presence of a neural network fitting phase followed by a compression phase during the learning process [
8,
11,
29].
There are some interesting directions for further research. First, it would be intriguing to explore the dynamics of state-of-the-art machine learning models, including transformers, which have demonstrated exceptional performance in various tasks. By analyzing the behavior of transformers under the lens of the information bottleneck theory, we may be able to gain additional insights into how these advanced models learn, compress information, and generalize.
Second, it would also be interesting to extend the study to other learning paradigms such as semi-supervised or unsupervised tasks. In semi-supervised learning, where a limited amount of labeled data are available along with a larger unlabeled dataset, using the proposed approach to study the learning process may help to uncover effective strategies for leveraging unlabeled data. Similarly, in unsupervised learning tasks, where the goal is to discover patterns and structure in unlabeled data, a similar approach could potentially uncover the interplay between compression and fitting and their implications in leading up to meaningful representations capturing essential information.
Finally, although our study has shed some light on the interplay between compression and generalization using the proposed method, conducting a specialized study and analysis to obtain a more comprehensive understanding of the relationship between these two factors would be interesting.