Gradient Regularization as Approximate Variational Inference

Unlu, Ali; Aitchison, Laurence

doi:10.3390/e23121629

Open AccessArticle

Gradient Regularization as Approximate Variational Inference

by

Ali Unlu

¹ and

Laurence Aitchison

^2,*

¹

Department of Infomatics, University of Sussex, Brighton BN1 9QJ, UK

²

Department of Computer Science, University of Bristol, Bristol BS8 1UB, UK

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(12), 1629; https://doi.org/10.3390/e23121629

Submission received: 20 September 2021 / Revised: 31 October 2021 / Accepted: 1 November 2021 / Published: 3 December 2021

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

We developed Variational Laplace for Bayesian neural networks (BNNs), which exploits a local approximation of the curvature of the likelihood to estimate the ELBO without the need for stochastic sampling of the neural-network weights. The Variational Laplace objective is simple to evaluate, as it is the log-likelihood plus weight-decay, plus a squared-gradient regularizer. Variational Laplace gave better test performance and expected calibration errors than maximum a posteriori inference and standard sampling-based variational inference, despite using the same variational approximate posterior. Finally, we emphasize the care needed in benchmarking standard VI, as there is a risk of stopping before the variance parameters have converged. We show that early-stopping can be avoided by increasing the learning rate for the variance parameters.

Keywords:

variational inference; Laplace; Bayes; Bayesian neural networks

1. Introduction

Neural networks are increasingly being used in safety-critical settings, such as self-driving cars [1] and medical diagnosis [2]. In these settings, it is critical to be able to reason about uncertainty in the parameters of the network—for instance, so that the system is able to call for additional human input when necessary [3]. Several approaches to Bayesian inference in neural networks are available, including stochastic gradient Langevin dynamics [4] Laplace’s method [5,6,7] and variational inference [8,9].

Here, we focus on combining the advantages of Laplace’s method [5,6,7] and variational inference (VI; [10]). In particular, Laplace’s method is very fast, as it begins by finding a mode using a standard gradient descent procedure, and then computes a local Gaussian approximate of the mode by performing a second-order Taylor expansion. However, as the mode is discovered by standard gradient descent, it may be a narrow mode that generalizes poorly [11]. In contrast, variational inference (VI; [8]) is slower, as it requires stochastic sampling of the weights, but that stochastic sampling forces it to find a broad, flat mode that presumably generalizes better.

We developed a new Variational Laplace (VL) method that combines the best of both worlds, giving a method that finds broad, flat modes even in the absence of the stochastic sampling. The resulting objective is composed of the log-likelihood, standard weight-decay regularization and a squared-gradient regularizer, which is weighted by the variance of the approximate posterior. VL displayed improved performance over VI and MAP in standard benchmark tasks.

Our squared gradient regularizer relates strongly to the effectiveness of stochastic gradient descent. In particular, recent work has shown that gradient descent implicitly uses a squared gradient regularizer [12], and that full-batch gradient descent with a squared gradient regularizer can recover much of the benefits of implicit regularization from minibatched stochastic gradient descent [13]. Our work implies that these regularizers can be interpreted as a form of approximate inference over the neural-network weights.

2. Background

2.1. Variational Inference (VI) for Bayesian Neural Networks

To perform variational inference for neural networks, we follow the usual approach [8,14], by using independent Gaussian priors, P and approximate posteriors Q for all parameters,

w

:

\begin{matrix} P_{} (w_{λ}) & = N (w_{λ}; 0, s_{λ}^{2}) \end{matrix}

(1)

\begin{matrix} Q_{} (w_{λ}) & = N (w_{λ}; μ_{λ}, σ_{λ}^{2}) & equivalently & Q_{} (w) & = N (w; μ, Σ), \end{matrix}

(2)

where

μ_{λ}

and

σ_{λ}^{2}

are learned parameters of the approximate posterior, and where

Σ

is a diagonal matrix, with

Σ_{λ λ} = σ_{λ}^{2}

. We fit the approximate posterior by optimizing the evidence lower bound objective (ELBO) with respect to parameters of the variational posterior,

μ_{λ}

and

σ_{λ}^{2}

:

L_{VI} = \underset{Q_{} (w)}{E} [log P_{} (y | x, w) + β \sum_{λ} log \frac{log P_{} (w_{λ})}{log Q_{} (w_{λ})}] .

(3)

Here,

x

is all training inputs,

y

is all training outputs, and

β

is the tempering parameter which is 1 for a close approximation to Bayesian inference, but is often set to smaller values to “temper” the posterior, which often improves empirical performance [15,16] and has theoretical justification as accounting for the data-curation process [17].

We need to optimize the expectation in Equation (3) with respect to the parameters of

Q_{} (w)

, the distribution over which the expectation is taken. To perform this optimization efficiently, the usual approach is to use the reparameterization trick [8,18,19]—we write

w

in terms of

ϵ

:

\begin{matrix} w_{λ} (ϵ_{λ}) & = μ_{λ} + σ_{λ} ϵ_{λ} \end{matrix}

(4)

where

ϵ_{λ} \sim N (0, 1)

. Thus, the ELBO can be written as an expectation over

ϵ

:

L_{VI} = \underset{ϵ}{E} [log P_{} (y | x, w (ϵ)) + β \sum_{λ} log \frac{log P_{} (w_{λ} (ϵ_{λ}))}{log Q_{} (w_{λ} (ϵ_{λ}))}] .

(5)

where the distribution over

ϵ

is now fixed. Critically, now the expected gradient of the term inside the expectation is equal to the gradient of

L_{VI}

, so we can use samples of

ϵ

to estimate the expectation.

2.2. Laplace’s Method

Laplace’s method [5,6,7] first finds a mode by doing gradient ascent on the log-joint:

\begin{matrix} w^{*} & = & \underset{w}{arg max} [log P_{} (y | x, w) + log P_{} (w)] \end{matrix}

(6)

and uses a Gaussian approximate posterior around that mode,

\begin{matrix} Q_{} (w) & = N (w; w^{*}, {- H}^{- 1} (w^{*})) \end{matrix}

(7)

where H

(w^{*})

is Hessian of the log-joint at

w^{*}

.

3. Related Work

There is past work on Variational Laplace [20,21,22], which learns the mean parameters,

μ

, of a Gaussian approximate posterior,

\begin{matrix} Q_{μ} (w) & = N (w; μ, {- H}^{- 1} (μ)) \end{matrix}

(8)

and obtains the covariance matrix as a function of the mean parameters using the Hessian, as in Laplace’s method. However, instead of taking the approximation to be centered around a MAP solution,

w^{*}

, they take the approximate posterior to be centered on learned mean parameters,

μ

. Importantly, they simplify the ELBO by substituting this approximate posterior into Equation (3), and approximating the log-joint using its Taylor series expansion. Ultimately they obtain

L_{VI} \approx log P_{} (y | w = μ, x) + log P_{} (w = μ) - \frac{1}{2} log | H (μ) | + const .

(9)

However, there are two problems with this approach when applied to neural networks. First, the algebraic manipulations required to derive Equation (9) require the full

N \times N

Hessian,

H (μ)

, for all N parameters, and neural networks have too many parameters for this to be feasible. Second, the

log | H (μ) |

term in Equation (9) cannot be minibatched, as we need the full sum over minibatches inside the log to compute the Hessian:

\begin{matrix} log | H (μ) | & = log | \sum_{j} H_{j} (μ) |, \end{matrix}

(10)

where

H_{j} (μ)

is the contribution to the Hessian from an individual minibatch. Due to these issues, past Variational Laplace methods did not scale to large neural networks.

An alternative deterministic approach to variational inference in Bayesian neural networks, approximates the distribution over activities induced by stochasticity in the weights [23]. Unfortunately, it is important to capture the covariance over features induced by stochasticity in the weights. In fully connected networks, this is feasible, as we usually have a small number of features at each layer. However, in convolutional networks, we have a large number of features,

channels \times height \times width

. In the lower layers of a ResNet, we may have 64 channels and a

32 \times 32

feature map, resulting in

64 \times 32^{2}

= 65,536 features and a 65,536 × 65,536 covariance matrix. These scalability issues prevented them from applying their approach to convolutional networks. In contrast, our approach is highly scalable and readily applicable to the convolutional setting. Subsequent work such as Haußmann et al. [24] introduced other deterministic approximations, based on decomposing the relu into a linear function and a Heaviside step. However, their approach had errors of ∼30% on CIFAR-10.

Ritter et al. [7] and MacKay [25] used Laplace’s method in Bayesian neural networks, by first finding the mode by doing gradient ascent on the log-joint probability, and expanding around that mode. As usual for Laplace’s method, they risk finding a narrow mode that generalizes poorly. In contrast, we find a mode using an approximation to the ELBO that takes the curvature into account and hence is biased towards broad, flat modes that presumably generalise better.

Our approach gives a squared-gradient regularizer that is similar to those discovered in past work [12,26]. They showed that squared-gradient regularizers connect to gradient descent, in that approximation errors due to finite-step sizes in gradient-descent imply an effective squared gradient regularization. The similarity of our objectives raises profound questions about the extent to which gradient descent can be said to perform Bayesian inference. That said there are three key differences. First, their approach connects full-batch gradient descent to squared-gradient regularizers. Of course, most neural network training is stochastic gradient descent based on minibatches. Given that the stationary distribution of SGD is isotropic Gaussian (in a quadratic loss-function; Appendix A), we are able to connect stochastic gradient descent to squared gradient regularizers, and hence to approximate variational inference. This is especially important in light of recent work showing full-batch gradient descent gives poor performance, but that performance can be improved by including explicit squared gradient regularization. Our work indicates that explicit squared gradient regularization is mimicking the implicit regularization from stochastic gradient descent. First, our method uses the Fisher, (i.e., the gradients for data sampled from the model) whereas their approach uses the empirical Fisher, (i.e., gradients for the observed data) to form the squared gradient regularizer [27]. Second, our approach gives a principled method to learn a separate weighting for the squared-gradient for each parameter, whereas the connection to SGD forces Barrett and Dherin [12] to use a uniform weighting across all parameters.

Our work differs from, e.g., Khan et al. [28] by explicitly providing a simply implemented loss-function in terms of a squared-gradient regularizer, instead of working with NTK-inspired approximations to the Hessian.

Other approaches include “Broad Bayesian Learning” [29], which optimizes the architecture of a Bayesian neural network, exploiting information from previously trained but different networks. Of course, quantification of uncertainty for Bayesian neural networks is always fraught [30]. As such, we followed standard practice in the literature of reporting OOD detection performance and a measure of calibration accuracy [31].

4. Methods

To combine the best of VI and Laplace’s method, we begin by noting that the ELBO can be rewritten in terms of the KL divergence between the prior and approximate posterior:

L_{VI} = \underset{Q_{} (w)}{E} [log P_{} (y | x, w)] - β \sum_{λ} D_{KL} (Q_{} (w_{λ}) | | P_{} (w_{λ})),

(11)

where the KL-divergence can be evaluated analytically:

\begin{matrix} D_{KL} (Q_{} (w_{λ}) | | P_{} (w_{λ})) & = \frac{1}{2} (\frac{σ_{λ}^{2} + μ_{λ}^{2}}{s_{λ}^{2}} - 1 + log \frac{s_{λ}^{2}}{σ_{λ}^{2}}) . \end{matrix}

(12)

As such, the only term we need to approximate is the expected log-likelihood.

To approximate the expectation, we begin by taking a second-order Taylor series expansion of the log-likelihood around the current settings of the mean parameters,

μ

:

\begin{matrix} \underset{Q_{} (w)}{E} [log P_{} (y | x, w)] \approx log P_{} (y | x, w = μ) + \underset{Q_{} (w)}{E} [\sum_{j = 1}^{B} g_{j}^{T} (w - μ)] \\ + \underset{Q_{} (w)}{E} [\frac{1}{2} {(w - μ)}^{T} H (w - μ)] \end{matrix}

(13)

where B is the number of minibatches,

g_{j}

is the gradient for minibatch j and H is the Hessian for the full dataset:

\begin{matrix} g_{j; λ} & = \frac{\partial}{\partial w_{λ}} [log P_{} (y_{j} | x_{j}, w)] \end{matrix}

(14)

\begin{matrix} H_{λ, ν} & = \frac{log P_{} (y | x, w)}{\partial w_{λ} \partial w_{ν}} . \end{matrix}

(15)

Here,

x

and

y

are the the inputs and outputs for the full dataset, whereas

x_{j}

and

y_{j}

are the inputs and outputs for minibatch j. Now we consider the expectation of each of these terms under the approximate posterior,

Q_{} (w)

. The first term is constant and independent of

w

. The second (linear) term is zero, because the expectation of

(w - μ)

under the approximate posterior is zero

\begin{matrix} \underset{Q_{} (w)}{E} [g_{j}^{T} (w - μ)] & = g_{j}^{T} \underset{Q_{} (w)}{E} [(w - μ)] = 0 . \end{matrix}

(16)

The third (quadratic) term might at first appear difficult to evaluate because it involves, the

N \times N

matrix of second derivatives, where N is the number of parameters in the model. However, using properties of the trace, and noting that the expectation of

(w - μ) {(w - μ)}^{T}

is the covariance of the approximate posterior, we obtain

\begin{matrix} \underset{Q_{} (w)}{E} [\frac{1}{2} (w & {- μ)}^{T} (w - μ)] = \underset{Q_{} (w)}{E} [\frac{1}{2} Tr ((w - μ) {(w - μ)}^{T})] = \frac{1}{2} Tr (Σ) \end{matrix}

(17)

writing the trace in index notation, and substituting for the (diagonal) posterior covariance,

Σ

:

\frac{1}{2} Tr (Σ) = \frac{1}{2} \sum_{λ ν} H_{λ ν} Σ_{λ ν} = \frac{1}{2} \sum_{λ} H_{λ λ} σ_{λ}^{2} .

(18)

Thus, our first approximation of the expected log-likelihood is

\underset{Q_{} (w)}{E} [log P_{} (y | x, w)] \approx log P_{} (y | x, w = μ) + \frac{1}{2} \sum_{λ} σ_{λ}^{2} H_{λ λ},

(19)

and substituting this into Equation (11) gives

\begin{matrix} L_{VI} \approx L_{VL (H)} = log P_{} (y | x, w = μ) + \frac{1}{2} \sum_{λ} σ_{λ}^{2} H_{λ λ} - β \sum_{λ} D_{KL} (Q_{} (w_{λ}) | | P_{} (w_{λ})) . \end{matrix}

(20)

This resolves most of the issues with the original Variational Laplace method: it requires only the diagonal of the Hessian, it can be minibatched and it does not blow up if

H_{λ λ}

is zero.

4.1. Pathological Optima When Using the Hessian

However, a new issue arises:

H_{λ λ}

is usually negative, in which case the approximation in Equation (20) can be expected to work well. However there is nothing to stop

H_{λ λ}

from becoming positive. Usually if we, e.g., took the log-determinant of the negative Hessian, this would immediately break the optimization process (as we would be taking the logarithm of a negative number). However, in our context, there is no immediate issue as Equation (20) takes on a well-defined value even when one or more

H_{λ λ}

’s are positive. That said, we rapidly encounter similar issues as we get pathological optimal values of

σ_{λ}^{2}

. In particular, picking out the terms in the objective that depend on

σ_{λ}^{2}

, absorbing the other terms into the constant, and taking

β = 1

for simplicity, we have

\begin{matrix} L_{VL (H)} = \frac{1}{2} \sum_{λ} (- (\frac{1}{s_{λ}^{2}} - H_{λ λ}) σ_{λ}^{2} + log σ_{λ}^{2}) + const . \end{matrix}

(21)

Thus, the gradient wrt a single variance parameter is

\begin{matrix} \frac{\partial}{\partial σ_{λ}^{2}} L_{VL (H)} = \frac{1}{2} (- (\frac{1}{s_{λ}^{2}} - H_{λ λ}) + \frac{1}{σ_{λ}^{2}}) . \end{matrix}

(22)

In the typical case,

H_{λ λ}

is negative so

(\frac{1}{s_{λ}^{2}} - H_{λ λ})

is positive, and we can find the optimum by solving for the value of

σ_{λ}^{2}

where the gradient is zero:

\begin{matrix} σ_{λ}^{2} & = \frac{1}{\frac{1}{s_{λ}^{2}} - H_{λ λ}} . \end{matrix}

(23)

However, if

H_{λ λ}

is positive and sufficiently large,

H_{λ λ} > \frac{1}{s_{λ}^{2}}

, then

(\frac{1}{s_{λ}^{2}} - H_{λ λ})

becomes negative, and not only is the mode in Equation (23) undefined, but the gradient is always positive:

\begin{matrix} 0 < \frac{\partial}{\partial σ_{λ}^{2}} L_{VL (H)} = \frac{1}{2} (- (\frac{1}{s_{λ}^{2}} - H_{λ λ}) + \frac{1}{σ_{λ}^{2}}) . \end{matrix}

(24)

as both terms in the sum:

- (\frac{1}{s_{λ}^{2}} - H_{λ λ})

and

\frac{1}{σ_{λ}^{2}}

are positive. As such, when

H_{λ λ} > \frac{1}{s_{λ}^{2}}

, the variance,

σ_{λ}^{2}

grows without bound.

4.2. Avoiding Pathologies with the Fisher

To avoid pathologies arising from the fact that the Hessian is not necessarily negative definite, a common approach is to approximate the Hessian using the Fisher information matrix:

\begin{matrix} - H \approx F & = \sum_{j = 1}^{B} \underset{P_{} ({\tilde{y}}_{j} | x_{j}, w = μ)}{E} [{\tilde{g}}_{j} ({\tilde{y}}_{j}) {\tilde{g}}_{j}^{T} ({\tilde{y}}_{j})] . \end{matrix}

(25)

Importantly,

\tilde{g}

is the gradient of the log-likelihood for data sampled from the model,

{\tilde{y}}_{j}

, not for the true data:

\begin{matrix} {\tilde{g}}_{j; λ} ({\tilde{y}}_{j}) & = \frac{\partial}{\partial w_{λ}} [log P_{} ({\tilde{y}}_{j} | x_{j}, w)] . \end{matrix}

(26)

This gives us the Fisher, which is a commonly used and well-understood approximation to the Hessian [27]. Importantly, this contrasts with the empirical Fisher [27], which uses the gradient conditioned on the actual data (and not data sampled from the model):

\begin{matrix} F_{emp} & = \sum_{j = 1}^{B} g_{j} g_{j}^{T}, \end{matrix}

(27)

which is problematic, because there is a large rank-1 component in the direction of the mean gradient, which disrupts the estimated matrix specifically in the direction of interest for problems such as optimization [27].

Using the Fisher information (Equation (25)) in Equation (19), we obtain an approximate expected log-likelihood:

\underset{Q_{} (w)}{E} [log P_{} (y | x, w)] \approx log P_{} (y | x, w = μ) - \frac{1}{2} \sum_{λ} σ_{λ}^{2} \sum_{j = 1}^{B} {\tilde{g}}_{j; λ}^{2} .

(28)

Substituting this into Equation (11) gives us the final VL objective,

L_{VL}

, which is an approximation of the ELBO:

L_{VI} \approx L_{VL} = log P_{} (y | x, w = μ) - \frac{1}{2} \sum_{λ} σ_{λ}^{2} \sum_{j = 1}^{B} {\tilde{g}}_{j; λ}^{2} - β \sum_{λ} D_{KL} (Q_{} (w_{λ}) | | P_{} (w_{λ})) .

(29)

In practice, we typically take the objective for a minibatch, divided by the number of datapoints in a minibatch, S:

\begin{matrix} \frac{1}{S} L_{VL; j} = \frac{1}{S} log P_{} (y_{j} | x_{j}, w = μ) - \frac{S}{2} \sum_{λ} σ_{λ}^{2} {(\frac{1}{S} {\tilde{g}}_{j; λ})}^{2} - \frac{β}{2 S B} \sum_{λ} (\frac{σ_{λ}^{2} + μ_{λ}^{2}}{s_{λ}^{2}} - 1 + log \frac{s_{λ}^{2}}{σ_{λ}^{2}}), \end{matrix}

(30)

where

(\frac{1}{S} {\tilde{g}}_{j; λ})

are the gradients of the log-likelihood for the minibatch averaged across datapoints, i.e., the gradient of

\frac{1}{S} log P_{} ({\tilde{y}}_{j} | x_{j}, w = μ)

. Remember B is the number of minibatches so

S B

is the total number of training datapoints.

4.3. Constraints on the Network Architecture

Importantly, here the regularizer is the squared gradient of the loss with respect to the parameters. As such, computing the loss implicitly involves a second-derivative of the log-likelihood, and we therefore cannot use piecewise linear activation functions such as ReLU, which have pathological second derivatives. In particular, the second derivative has a delta-function “spike” at zero:

\begin{matrix} \frac{d^{2}}{d x} ϕ (x) & = \frac{d}{d x} [\frac{d}{d x} ϕ (x)] = \frac{d}{d x} Θ (x) = δ (x) \end{matrix}

(31)

where

ϕ

is the relu nonlinearity,

Θ (x)

is the Heaviside step function which is zero for

x < 0

and one for

0 < x

, and

δ (x)

is the Dirac delta function. As the function is almost never evaluated at exactly zero, it is not possible to sensibly take into account the contribution of the infinitely high spike in the second derivative at zero. Interestingly, this issue is very similar to the one that turns up when differentiating step (i.e.,

Θ (x)

) activations—the derivative is well-defined and zero almost everywhere. The issue is that there are delta-function spikes in the gradient at zero that gradient descent cannot reasonably work with. Instead, we used a softplus activation function, but any activation with well-behaved second derivatives is admissible.

5. Results

We compared MAP, VI and our method (VL) on four different datasets (CIFAR-10, CIFAR-100 [32], SVHN [33] and fashion-MNIST [34] MIT Licensed) using a PreactResNets-18 [35] with an initial learning rate of 1E-4, which decreased by a factor of 10 after 100 and 150 epochs and a batch size of 128 with all the other optimizer hyperparameters set to their default values. We tried two variants of variational inference: evaluating test-performance using the mean network, VI (mean), and evaluating test performance by drawing 10 samples from the approximate posterior, VI (sampled). We swept across different degrees of posterior tempering,

β

. Using

β < 1

is normatively justified in the Bayesian framework as accounting for the effect of data curation [17]. For many values of

β

VL gave better test accuracies, test log-likelihoods and expected calibration errors [31,36] than VI or MAP inference (Figure 1). Importantly though, for the optimal value of

β

, VL almost always gave better performance on these metrics (Table 1). These experiments took ∼480 GPU hours, and were run on a mixture of nVidia 1080 and 2080 GPUs in an internal cluster.

The runtimes of the methods are listed in Table 2. VL or gradient regularization was around a factor of three slower than either VI or MAP due to the need to compute second-derivatives. It is still eminently feasible, especially in comparison to past methods for deterministic variational inference that have fundamental difficulties in scaling to convolutional networks [23]. Furthermore, we did not find that increasing the number of epochs improved performance either for VI or MAP, as we were already training for convergence.

Early-Stopping and Poor Performance in VI

Before performing comparisons where we learn the approximate posterior variance, it is important to understand the pitfalls when optimizing variational Bayesian neural networks using adaptive optimizers such as Adam. In particular, there is a strong danger of stopping the optimization before the variances have converged. To illustrate this risk, note that Adam [38] updates take the form

\begin{matrix} Δ θ & = η \frac{m}{\sqrt{v} + ϵ} \end{matrix}

(32)

where

η

is the learning rate, m is an unbiased estimator of the mean gradient,

〈 g 〉

, v is an unbiased estimator of the squared gradient,

〈 g^{2} 〉

and

ϵ

is a small positive constant to avoid divide-by-zero. The magnitude of the updates,

| Δ θ |

, is maximized by having exactly the same gradient on each step, in which case, neglecting

ϵ

, we have

| Δ θ | = η

. As such, with a learning rate of

η = 10^{- 4}

, a training set of 50,000 and a batch size of 128 parameters can move at most 50,000/128

\times 10^{- 4} \approx 0.04

per epoch. Doing 100 epochs at this learning rate, a parameter can change by at most 4 over the 100 epochs before the first learning rate step.

This is fine for the weights, which typically have very small values. However, the underlying parameters used for the variances typically take on larger values. In our case, we will use

log σ_{λ}

as the parameter, and initialize it to three less than the prior standard deviation,

log s_{λ} - 3

. To ensure reasonable convergence,

log σ_{λ}

should be able to revert back to the prior, implying that it must be able to change by at least three during the course of training. Unfortunately, 3 is very close to the maximum possible change of 4, raising the possibility that the variance parameters will not actually converge. To check whether early-stopping was indeed an issue, we plotted the (tempered) ELBO for VI (Figure 2A) and VL (Figure 2B). For VI (Figure 2A) with the standard setup (lightest line with a learning rate multiplier of 1), the ELBO clearly has not converged at 100 epochs, indicating early-stopping. Notably, this was still an issue with VL (Figure 2B), especially if we were to train for fewer epochs. However, the effect was smaller for VL, which may have been because the gradients were more consistent, as it did not sample the weights. These issues can be rectified by increasing the learning rate specifically for the

log σ_{λ}

parameters (darker lines).

We then plotted the test log-likelihood (Figure 2C), test accuracy (Figure 2D) and ELBO (Figure 2E) against the learning rate multiplier. Again, the performance for VL (orange) was reasonably robust to changes in the learning rate multiplier. However, the performance of VI (blue) was very sensitive to the multiplier: as the multiplier increased, test performance fell but the ELBO rose. As we ultimately care about test performance, these results would suggest that we should use the lowest multiplier (1), and accept the possibility of early-stopping. That may be a perfectly good choice in many cases. However, VI is supposed to be an approximate Bayesian method, and using an alternative form for the ELBO,

\begin{matrix} L_{VI} & = log P_{} (y | x) - D_{KL} (Q_{} (w) | | P_{} (w | y, x)), \end{matrix}

(33)

we can see that the ELBO measures KL-divergence between the true and approximate posterior, and hence the quality of our approximate Bayesian inference. As such, very poor ELBOs imply that the KL-divergence between the true and approximate posterior is very large, and hence the “approximate posterior” is no longer actually approximating the true posterior. As such, if we are to retain a Bayesian interpretation of VI, we need to use larger learning rate multipliers which give better values for the ELBO (Figure 2E). However, in doing that, we get worse test performance (Figure 2C,D). This conflict between approximate posterior quality and test performance is very problematic: the Bayesian framework would suggest that as Bayesian inference becomes more accurate, performance should improve, whereas for VI, performance gets worse. Concretely, by initializing

log σ_{λ}

to a small value and then early-stopping, we leave

log σ_{λ}

at a small value through training, in which case VI becomes equivalent to MAP inference with a negligibly small amount of noise added to the weights. We would therefore expect early-stopped VI to behave (and be) very similar to MAP inference.

In subsequent experiments, we chose to use a learning rate multiplier of 10, as this largely eliminated early-stopping (though see VI with

β = 0.1

; Figure 2E). Overall, this indicates that we have to be very careful to avoid early stopping when running standard, sampling-based variational inference.

6. Conclusions

We gave a novel Variational Laplace approach to inference in Bayesian neural networks which combines the best of previous approaches based on variational inference and Laplace’s Method. This method gave excellent empirical performance compared to VI.

Author Contributions

Conceptualization, L.A.; methodology, A.U.; software, A.U.; validation, L.A.; formal analysis, L.A.; investigation, A.U.; resources, L.A.; data curation, A.U.; writing—original draft preparation, L.A.; writing—review and editing, L.A.; visualization, A.U.; supervision, L.A.; project administration, L.A.; funding acquisition, L.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Code is available at https://github.com/LaurenceA/fitr (accessed on 2 December 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Stationary Distribution of SGD

We sought to relate these gradient regularizers back to work on SGD. In particular, we looked to work on the stationary distribution of SGD, which noted that under quadratic losses functions, SGD samples an isotropic Gaussian (i.e., with covariance proportional to the identity matrix). In particular, consider a loss function which is locally closely approximated by a quadratic. Without loss of generality, we consider a mode at

w = 0

,

log P_{} ({y_{i}}_{i = 1}^{P} | {x_{i}}_{i = 1}^{P}, w) = - \frac{P}{2} w^{T} H w + const .

(A1)

where P is the total number of datapoints. Typically, the objective used in SGD is the loss for a minibatch of size

S_{SGD}

. Following Mandt et al. [39], we use the Fisher information to identify the noise in the minibatch gradient,

\frac{\partial}{\partial w} [\frac{1}{S_{SGD}} log P_{} (y_{j} | x_{j}, w)] = - H w + {\frac{1}{\sqrt{S_{SGD}}} H}^{1 / 2} ξ (t),

(A2)

where

ξ (t)

is sampled from a standard IID Gaussian. For SGD, this gradient is multiplied by a learning rate,

η_{SGD}

,

\begin{matrix} w (t + 1) & = w (t) - η_{SGD} H w (t) + {\frac{η_{SGD}}{\sqrt{S_{SGD}}} H}^{1 / 2} ξ (t), \end{matrix}

(A3)

This is an multivariate Gaussian autoregressive process, so we can solve for the stationary distribution of the weights. In particular, we note that the covariance at time

t + 1

is

\begin{matrix} C [w (t + 1)] & = E [{(w (t) - η_{SGD} H w (t) - {\frac{η_{SGD}}{\sqrt{S_{SGD}}} H}^{1 / 2} ξ (t))}^{T} (w (t) - η_{SGD} H w (t) - {\frac{η_{SGD}}{\sqrt{S_{SGD}}} H}^{1 / 2} ξ (t))] \\ C [w (t + 1)] & = {(I - η_{SGD} H)}^{T} C [w (t)] (I - η_{SGD} H) + \frac{η_{SGD}^{2}}{S_{SGD}} H \end{matrix}

(A4)

Following Mandt et al. [39], when the learning rate is small, the quadratic term can be neglected.

\begin{matrix} C [w (t + 1)] & \approx C [w (t)] - η_{SGD} H C [w (t)] - η_{SGD} C [w (t)] H + \frac{η_{SGD}^{2}}{S_{SGD}} H \end{matrix}

(A5)

We then solve for steady-state in which

Σ = C [w (t + 1)] = C [w (t)]

,

\begin{matrix} 0 & \approx - η_{SGD} (H^{T} Σ_{SGD} + Σ_{SGD} H) + \frac{η_{SGD}^{2}}{S_{SGD}} H \end{matrix}

(A6)

so,

\begin{matrix} Σ & \approx \frac{η_{SGD}}{2 S_{SGD}} I . \end{matrix}

(A7)

Appendix B. Varying the Batch Size

Here, we vary the batch size. We used a batch size of 128 in the main text. We found that batch sizes of 64 and 256 have no effect on the relative performance of the methods.

Figure A1. Replication of Figure 1 with a batch size of 64.

Figure A2. Replication of Figure 1 with a batch size of 256.

References

Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
Amato, F.; López, A.; Peña-Méndez, E.M.; Vanhara, P.; Hampl, A.; Havel, J. Artificial neural networks in medical diagnosis. J. Appl. Biomed. 2013, 11, 47–58. [Google Scholar] [CrossRef]
McAllister, R.; Gal, Y.; Kendall, A.; Van Der Wilk, M.; Shah, A.; Cipolla, R.; Weller, A. Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar]
Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]
Azevedo-Filho, A.; Shachter, R.D. Laplace’s Method Approximations for Probabilistic Inference in Belief Networks with Continuous Variables. In Uncertainty Proceedings 1994; Elsevier: Amsterdam, The Netherlands, 1994; pp. 28–36. [Google Scholar]
MacKay, D.J. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Ritter, H.; Botev, A.; Barber, D. A scalable laplace approximation for neural networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018; Volume 6. [Google Scholar]
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural networks. arXiv 2015, arXiv:1505.05424. [Google Scholar]
Ober, S.W.; Aitchison, L. Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes. arXiv 2020, arXiv:2005.08140. [Google Scholar]
Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference; Now Publishers Inc.: Delft, The Netherlands, 2008. [Google Scholar]
Neyshabur, B.; Bhojanapalli, S.; McAllester, D.; Srebro, N. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5947–5956. [Google Scholar]
Barrett, D.G.; Dherin, B. Implicit Gradient Regularization. arXiv 2020, arXiv:2009.11162. [Google Scholar]
Geiping, J.; Goldblum, M.; Pope, P.E.; Moeller, M.; Goldstein, T. Stochastic training is not necessary for generalization. arXiv 2021, arXiv:2109.14119. [Google Scholar]
Hinton, G.E.; Van Camp, D. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; pp. 5–13. [Google Scholar]
Huang, C.W.; Tan, S.; Lacoste, A.; Courville, A.C. Improving explorability in variational inference with annealed variational objectives. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 9701–9711. [Google Scholar]
Wenzel, F.; Roth, K.; Veeling, B.S.; Swiatkowski, J.; Tran, L.; Mandt, S.; Snoek, J.; Salimans, T.; Jenatton, R.; Nowozin, S. How Good is the Bayes Posterior in Deep Neural Networks Really? arXiv 2020, arXiv:2002.02405. [Google Scholar]
Aitchison, L. A statistical theory of cold posteriors in deep neural networks. arXiv 2020, arXiv:2008.05912. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv 2014, arXiv:1401.4082. [Google Scholar]
Friston, K.; Mattout, J.; Trujillo-Barreto, N.; Ashburner, J.; Penny, W. Variational free energy and the Laplace approximation. NeuroImage 2007, 34, 220–234. [Google Scholar] [CrossRef]
Daunizeau, J.; Friston, K.J.; Kiebel, S.J. Variational Bayesian identification and prediction of stochastic nonlinear dynamic causal models. Phys. D Nonlinear Phenom. 2009, 238, 2089–2118. [Google Scholar] [CrossRef] [Green Version]
Daunizeau, J. The Variational Laplace approach to approximate Bayesian inference. arXiv 2017, arXiv:1703.02089. [Google Scholar]
Wu, A.; Nowozin, S.; Meeds, E.; Turner, R.E.; Hernández-Lobato, J.M.; Gaunt, A.L. Deterministic variational inference for robust bayesian neural networks. arXiv 2018, arXiv:1810.03958. [Google Scholar]
Haußmann, M.; Hamprecht, F.A.; Kandemir, M. Sampling-free variational inference of bayesian neural networks by variance backpropagation. In Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, Tel Aviv, Israel, 22–25 July 2019; pp. 563–573. [Google Scholar]
MacKay, D.J. A practical Bayesian framework for backpropagation networks. Neural Comput. 1992, 4, 448–472. [Google Scholar] [CrossRef]
Smith, S.L.; Dherin, B.; Barrett, D.G.; De, S. On the Origin of Implicit Regularization in Stochastic Gradient Descent. arXiv 2021, arXiv:2101.12176. [Google Scholar]
Kunstner, F.; Hennig, P.; Balles, L. Limitations of the empirical Fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 4156–4167. [Google Scholar]
Khan, M.E.; Immer, A.; Abedi, E.; Korzepa, M. Approximate inference turns deep networks into gaussian processes. arXiv 2019, arXiv:1906.01930. [Google Scholar]
Kuok, S.C.; Yuen, K.V. Broad Bayesian learning (BBL) for nonparametric probabilistic modeling with optimized architecture configuration. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 1270–1287. [Google Scholar] [CrossRef]
Yao, J.; Pan, W.; Ghosh, S.; Doshi-Velez, F. Quality of uncertainty quantification for Bayesian neural network inference. arXiv 2019, arXiv:1906.09686. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, NSW, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning, Granada, Spain, 16–17 December 2011. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar]
Naeini, M.P.; Cooper, G.; Hauskrecht, M. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
Chrabaszcz, P.; Loshchilov, I.; Hutter, F. A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets. arXiv 2017, arXiv:1707.08819. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Mandt, S.; Hoffman, M.D.; Blei, D.M. Stochastic gradient descent as approximate bayesian inference. J. Mach. Learn. Res. 2017, 18, 4873–4907. [Google Scholar]

Figure 1. Training a PreactResNet-18 on various datasets, displaying the test accuracy, test log-likelihood, expected calibration error (ECE) [31,36] and OOD detection metric (AUROC) for CIFAR-10, CIFAR-100, SVHN and fashion MNIST. Downsampled Imagenet [37] was used as OOD data. See Appendix B.

Figure 2. Analysis of early stopping in VI and VL. The first row is untempered (

β = 1

), and the second row is tempered (

β = 0.1

). (A) ELBO over epochs 0–100 (with the highest initial learning rate) for VI. Different lines correspond to networks with learning rate multipliers for

log σ_{λ}

of 1, 3, 10 and 30. (B) As (A), but for VL. CDE Final test-log-likelihood (C), test accuracy (D) and ELBO (E) after 200 epochs for different learning rate multipliers.

Figure 2. Analysis of early stopping in VI and VL. The first row is untempered (

β = 1

), and the second row is tempered (

β = 0.1

). (A) ELBO over epochs 0–100 (with the highest initial learning rate) for VI. Different lines correspond to networks with learning rate multipliers for

log σ_{λ}

of 1, 3, 10 and 30. (B) As (A), but for VL. CDE Final test-log-likelihood (C), test accuracy (D) and ELBO (E) after 200 epochs for different learning rate multipliers.

Table 1. Best values test NLL, test accuracy and ECE for a variety of datasets, as we used different values of the tempering parameter,

β

.

Table 1. Best values test NLL, test accuracy and ECE for a variety of datasets, as we used different values of the tempering parameter,

β

.

Dataset	Method	Test NLL	Test Acc.	ECE
CIFAR-10	VL	0.23	92.4%	0.017
	VI (Mean)	0.37	91.1%	0.053
	VI (10 Samples)	0.35	90.2%	0.044
	MAP	0.43	90.8%	0.058
CIFAR-100	VL	1.00	71.4%	0.024
	VI (Mean)	1.29	68.8%	0.100
	VI (10 Samples)	1.49	67.3%	0.026
	MAP	1.61	67.5%	0.159
SVHN	VL	0.14	97.1%	0.009
	VI (Mean)	0.16	96.3%	0.012
	VI (10 Samples)	0.22	95.5%	0.022
	MAP	0.24	95.7%	0.028
Fashion MNIST	VL	0.16	94.6%	0.010
	VI (Mean)	0.23	94.0%	0.034
	VI (10 Samples)	0.29	93.6%	0.016
	MAP	0.29	93.6%	0.096

Table 2. Time per epoch for different methods on CIFAR-10.

Method	Time per Epoch (s)
VL	114.9
VI	43.2
MAP	41.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Unlu, A.; Aitchison, L. Gradient Regularization as Approximate Variational Inference. Entropy 2021, 23, 1629. https://doi.org/10.3390/e23121629

AMA Style

Unlu A, Aitchison L. Gradient Regularization as Approximate Variational Inference. Entropy. 2021; 23(12):1629. https://doi.org/10.3390/e23121629

Chicago/Turabian Style

Unlu, Ali, and Laurence Aitchison. 2021. "Gradient Regularization as Approximate Variational Inference" Entropy 23, no. 12: 1629. https://doi.org/10.3390/e23121629

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gradient Regularization as Approximate Variational Inference

Abstract

1. Introduction

2. Background

2.1. Variational Inference (VI) for Bayesian Neural Networks

2.2. Laplace’s Method

3. Related Work

4. Methods

4.1. Pathological Optima When Using the Hessian

4.2. Avoiding Pathologies with the Fisher

4.3. Constraints on the Network Architecture

5. Results

Early-Stopping and Poor Performance in VI

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. The Stationary Distribution of SGD

Appendix B. Varying the Batch Size

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI