Stochastic Control for Bayesian Neural Network Training

Winkler, Ludwig; Ojeda, César; Opper, Manfred

doi:10.3390/e24081097

Open AccessArticle

Stochastic Control for Bayesian Neural Network Training

by

Ludwig Winkler

^1,*

,

César Ojeda

²

and

Manfred Opper

^2,3

¹

Machine Learning Group, Technische Universität Berlin, 10623 Berlin, Germany

²

Artificial Intelligence Group, Technische Universität Berlin, 10623 Berlin, Germany

³

Centre for Systems Modelling and Quantitative Biomedicine, University of Birmingham, Birmingham B15 2TT, UK

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(8), 1097; https://doi.org/10.3390/e24081097

Submission received: 4 July 2022 / Revised: 24 July 2022 / Accepted: 30 July 2022 / Published: 9 August 2022

(This article belongs to the Special Issue Probabilistic Models in Machine and Human Learning)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, we propose to leverage the Bayesian uncertainty information encoded in parameter distributions to inform the learning procedure for Bayesian models. We derive a first principle stochastic differential equation for the training dynamics of the mean and uncertainty parameter in the variational distributions. On the basis of the derived Bayesian stochastic differential equation, we apply the methodology of stochastic optimal control on the variational parameters to obtain individually controlled learning rates. We show that the resulting optimizer, StochControlSGD, is significantly more robust to large learning rates and can adaptively and individually control the learning rates of the variational parameters. The evolution of the control suggests separate and distinct dynamical behaviours in the training regimes for the mean and uncertainty parameters in Bayesian neural networks.

Keywords:

Bayesian inference; Bayesian neural networks; learning

1. Introduction

Deep Bayesian neural networks (BNNs) aim to leverage the advantages of two different methodologies. First, in recent years, deep representations have been incredibly successful in fields as diverse as computer vision, speech recognition and natural language processing [1,2,3]. Much of the success, however, revolves around prediction accuracy. Second, Bayesian methodologies are required to obtain an estimate of model uncertainty, a crucial feature that allows deep neural networks to tackle risk assessment to create informed model decisions. The role of model uncertainty in the training procedure of BNNs, however, remains unaddressed; the present investigation seeks to exploit the model uncertainty in Bayesian neural networks for the development of new learning algorithms.

For the training of BNNs, the approximate posterior over the model parameters is obtained via a maximization of the variational lower bound.

Such a posterior introduces a form of uncertainty in the parameters which is different than that injected by random batches of data. In this investigation, we seek to exploit both the data uncertainty (aleatoric) and the model uncertainty (epistemic) to solve a control problem aimed at maximizing the evidence lower bound (ELBO), where the control parameters gauge the dynamics of the gradient during descent.

The contributions of our work are threefold,

We provide a derivation of the stochastic differential equation on a first principle basis that governs the evolution of the parameters in variational distributions trained with variational inference and we decompose the uncertainty of the gradients into their aleatoric and epistemic components.
We derive a stochastic optimal control optimization algorithm which incorporates the uncertainty in the gradients to optimally control the learning rates for each variational parameter.
The evolution of the control exhibits distinct dynamical behaviour and demonstrates different fluctuation and dissipation regimes for the variational mean and uncertainty parameters.

Section 1 offers an introduction to the topic. In Section 2, we provide an overview over probabilistic models and Bayesian neural networks. Section 3 details the derivation of the stochastic differential equation governing the dynamics of the frequentist and variational parameters. Subsequently, we derive a stochastic optimal control algorithm in Section 4 on the basis of the dynamics of the variational parameters. Finally, Section 5 summarizes experiments undertaken and the performance of the stochastic optimal control optimizer, as well as the distinct behaviour of the control parameters.

2. Variational Inference for Bayesian Neural Networks

For a training dataset

D

, the Bayesian formulation of a neural network places a posterior distribution

p (θ | D)

and a prior

p (θ)

on each of its parameters

θ

. The quintessential task in Bayesian inference is to compute the posterior

p (θ | D)

according to Bayes’ rule:

\begin{matrix} p (θ | D) = \frac{p (D | θ) p (θ)}{p (D)} . \end{matrix}

(1)

Given a likelihood function

p (D | θ)

and the parameter prior

p (θ)

, we can make predictions by marginalizing out over the parameters. For the most common application of supervised learning with label y and data x,

D = {y_{m}, x_{m}}_{m = 1}^{N}, x \in X, y \in Y

, this gives us

\begin{matrix} p (y | x) = \int p (y | x, θ) p (θ | D) d θ . \end{matrix}

(2)

where

p (y | x, θ)

is the likelihood function of the output y given the input x and the posterior parameter distribution

p (θ | D)

. For highly parameterized models, the inference of the posterior distribution

p (θ | D)

requires the computation of a high dimensional integral which is numerically intractable to compute for most complex models as they can easily have millions of parameters.

There are two main approaches for inferring the posterior distribution

p (θ | D)

in Bayesian neural networks: sampling from the posterior distribution in proportion to the data likelihood and prior [4], and variational inference, which optimizes a bound on the evidence and approximates the true posterior with a tractable distribution

q (θ | ϕ) \approx p (θ | D)

with the variational parameters

ϕ

, [5].

Our approach focuses on the variational inference formulation, which scales well to large data regimes as the bound is amenable to gradient-based optimization schemes [6]. Variational inference infers the posterior distribution by optimizing the Kullback–Leibler divergence between the true posterior

p (θ | D)

and a variational distribution

q (θ | ϕ)

. The important detail is that the variational distribution is assumed to be independent of the data

D

, which makes the solution approximate yet tractable. The optimization problem is then

\begin{matrix} \arg \min_{ϕ} KL [q (θ | ϕ) | | p (θ | D)] = \arg \min_{ϕ} - E_{q (θ | ϕ)} [\log p (D | θ)] + KL [q (θ | ϕ) | | p (θ)] . \end{matrix}

(3)

We are, thus, left to optimize the ELBO, which is derived in full in Appendix A, in the form

E_{q (θ | ϕ)} [\log p (D | θ)] - KL [q (θ | ϕ) | | p (θ)]

as a surrogate loss function. The ELBO is optimized numerically through gradient descent algorithms, which bring their own set of challenges with respective to gradient step size, directional sensitivity and exploding and vanishing gradients. We propose a stochastic optimal control algorithm for gradient descent optimization which controls the learning rate for every variational parameter

ϕ

based on the local surface of the Kullback–Leibler divergence.

For the remainder of this paper, we assume that the variational distribution

q (θ | ϕ)

for each parameter

θ

follows an independent normal or Laplace distribution with the location of the distribution

μ

and the scale

σ

as the variational parameters

ϕ = {μ, σ}

of the parameter

θ

. Since the scale parameter

σ

is constrained to be positive, we employ an additional reparameterization

σ = \log (\exp (ρ) + 1)

which allows us to compute derivatives for

ρ

during optimization while keeping

σ

strictly positive.

2.1. Stochastic Differential Equations for Frequentist Models

During optimization, the model parameters follow a dynamical process; in the following section, we show how it is possible to approximate this dynamic as an SDE; we start with a frequentist version where no distribution is imposed on the parameters (as in the BNN), and the stochasticity is injected by the dataset and samples from therein. Given a probabilistic model

p (y_{m} | x_{m}, θ) : X \to Y

with a set of scalar parameters

θ \in R

, the input

x_{m} \in X

and output

y_{m} \in Y

, we compute the derivative of a scalar loss function

L_{m} : Y \times Y \to R

to obtain the derivative with respect to each parameter

\partial_{θ} L_{m}

in the probabilistic model for a single data point. Gradient descent requires us to calculate the derivative of the loss over the entire training dataset

D

at each iteration

\partial_{θ} L = 1 / | D | \sum_{i = 1}^{| D |} L_{i}

. This gradient, has an associated variance:

σ_{D} = V_{D} [\partial_{θ} L] = \frac{1}{| D |} \sum_{i \in D} {(\partial L_{i} - \partial_{θ} L)}^{2} .

(4)

The computational cost of calculating gradients over entire training datasets is prohibitively expensive, which has favoured the use of mini-batch sampled gradients. Now, a mini-batch with

M ≪ | D |

data points is sampled [7]. The assumption is that a mini-batch is computationally tractable while providing a representative sample of the training dataset to compute a sufficiently good gradient on. We denote

D_{m}

as a single data sample and

D_{M}

as the mini-batch sample. The sampling of the mini-batches introduces stochasticity into the gradient estimation. The first and second moments, denoted as

E [\cdot]

and

V [\cdot]

, for each scalar parameter

θ

of the mini-batch gradients are:

\begin{matrix} \partial_{θ} L_{M} & = \frac{1}{M} \sum_{m = 1}^{M} \partial_{θ} L_{m} \end{matrix}

(5)

\begin{matrix} \partial_{θ} L = E_{M \sim p (D)} [\partial_{θ} L_{M}] \end{matrix}

(6)

\begin{matrix} V_{M} [\partial_{θ} L_{M}] = \frac{1}{M} \sum_{i \in D} {(\partial L_{i} - \partial_{θ} L)}^{2} \end{matrix}

(7)

\begin{matrix} V_{M} [\partial_{θ} L_{M}] \propto \frac{1}{M} σ_{D} \end{matrix}

(8)

It is easy to see that we can decrease the variance in the gradient estimation by increasing the size of the mini-batch M. The change in the parameters

Δ θ_{t} = θ_{t + 1} - θ_{t}

in gradient based optimization consequentially follows a noisy estimate of the true gradient

\partial_{θ} L

which is distributed according to the first- and second-order moments in (4) and (5). The central limit theorem implies that the derivatives are distributed along a Gaussian distribution,

\partial_{θ} L_{M} \sim N (\partial_{θ} L, \frac{1}{M} σ_{D})

[8]. Given the distribution of the gradients, the evolution of the parameter through time with the learning rate

η

can be approximated by:

\begin{matrix} θ_{t + 1} = θ_{t} - η \partial_{θ} L_{M} \end{matrix}

(9)

\begin{matrix} Δ θ_{t} = - η \partial_{θ} L + η \sqrt{\frac{σ_{D}}{M}} ϵ; ϵ \sim N (0, 1) \end{matrix}

(10)

This formulation of the parameter dynamics during training has strong similarities with the Euler–Maruyama discretization of an Ito drift–diffusion process. Indeeed, for an SDE with drift

b (θ_{t})

and diffusion

σ (θ)

:

d θ_{t} = b (θ_{t}) d t + σ (θ_{t}) d W_{t}

(11)

we have the associated Eurler–Maruyama discretization:

θ_{t + 1} = θ_{t} + b (θ_{t}) Δ t \sqrt{Δ t} σ (θ) ϵ .

(12)

We proceed by setting

η \equiv Δ t

,

b (θ_{t}) \equiv \partial_{θ} L

and

σ (θ_{t}) \equiv \sqrt{η / M σ_{D}}

to denote equivalency, as further described in [8,9,10]. This modification allows the use of stochastic analysis to Ito drift-diffusion processes. See [11] for a more thorough discussion on the relationship of the learning rate and the diffusion of SGD). If we additionally consider the learning in the infinitesimal limit of

η \to 0

, we arrive at a formulation for the instantaneous change in time which is given by

d θ_{t} = - \partial_{θ} L d t + \sqrt{η / M σ_{D}} d W_{t}

(13)

which is a stochastic differential equation, where

d W_{t}

is a Wiener process that originates from the limit applied to

\sqrt{η} ϵ

[12]. We can, thus, conclude that the change in the parameters

θ_{t}

, for an infinitesimal small learning rate

η

, follows a stochastic differential equation in the form of an Ito drift–diffusion process over time in which the sampling of the mini-batches contributes the diffusion [12].

2.2. Stochastic Differential Equations for Bayesian Models

In BNN models, each scalar parameter

θ

is modelled by a univariate distribution

θ \sim q (θ | ϕ)

. The use of the distribution

q (θ | ϕ)

extends the loss

L

to the form of the ELBO which is additive in the mini-batch samples m and has a closed form regularization term (the Kullback–Leibler divergence between posterior

p (θ | D)

and prior

p (θ)

), the derivation of which can be found in Appendix A. Not only do we choose data samples at random, but, concurrently, we sample the parameter

θ

from the distribution

q (θ | ϕ)

following the reparametrization trick. The parameter

θ

is thus a random variable itself. Consequentially, the derivative

\partial_{θ} L_{m}

for a single data sample m will exhibit randomness originating both from the randomly sampled mini-batches and the stochasticity of the sampled parameters from the variational distribution.

The uncertainty of the parameter derivative

\partial_{θ} L

can be decomposed into the aleatoric and the epistemic uncertainty. The aleatoric uncertainty arises from the variance in the data and is irreducible, whereas the epistemic uncertainty arises from the uncertainty of the parameter

θ

and can be reduced to zero, since, in principle, the parameters can be sampled

θ

.

Employing the tractable univariate variational distribution

q (θ | ϕ)

to achieve a scalable optimization, for a derivative

\partial_{θ} L_{m}

which is dependent on the random parameter

θ

and the randomly chosen data sample m, we can decompose the uncertainty of

\partial_{θ} L

into a sum of the data uncertainty and the parameter uncertainty, which follows from the law of total variance [13]:

\begin{matrix} V [\partial_{θ} L] = & \underset{Aleatoric Uncertainty}{\underset{⏟}{V_{p (D_{M})} [E_{q (θ | ϕ)} [\partial_{θ} L_{m} | D_{m}]]}} + \underset{Epistemic Uncertainty}{\underset{⏟}{E_{p (D_{M})} [V_{q (θ | ϕ)} [\partial_{θ} L_{m} | D_{m}]]}} . \end{matrix}

(14)

In effect, we draw samples twice in BNNs to obtain ‘per data sample per variational sample’ derivatives: data samples for the mini-batch and parameter samples

θ

from the variational distribution. Aleatoric uncertainty first computes the expectancy over the ‘variationally sampled’ derivatives per data sample

D_{m}

and subsequently computes variance over the mini-batch

D_{M}

. Epistemic uncertainty first computes the variance over the ‘variationally sampled’ gradients and, finally, computes the expected derivative over the mini-batch

D_{M}

.

It is important over which source of randomness the variance is computed in the uncertainty decomposition. The first term,

V [E [\partial_{θ} L_{m} | D_{m}]]

, represents the aleatoric uncertainty and measures the data uncertainty. It measures how much the average gradient varies over the dataset. The second term,

E [V [\partial_{θ} L_{m} | D_{m}]]

, is called the epistemic uncertainty and measures the uncertainty originating from the model parameter distribution. For the epistemic uncertainty, the variance is computed over the source of parameter uncertainty and averaged over the data samples. In BNNs this is explicitly modelled through the use of distributions for every parameter

θ

. Frequentist models exhibit only aleatoric uncertainty, as the variance over the deterministic gradients in the epistemic uncertainty evaluates to zero.

For a univariate variational distribution

θ \sim q (θ | ϕ)

, we can now formulate the stochastic differential equation (SDE) that governs the dynamics of the variational parameters

ϕ = {μ, σ}

.

The first modification, with respect to the SDE of a frequentist model in Equation (10) is that, for every parameter

θ

in the frequentist model, we have, in fact, two separate variational parameters

ϕ = {μ, σ}

in the Bayesian model, corresponding to the mean and scale of the variational distribution from which we sample

θ

. We, thus, have the two differential equations for the variational parameters

{μ, σ}

,

\begin{matrix} d μ_{t} & = - E [\partial_{μ} L] d t + V {[\partial_{μ} L]}^{\frac{1}{2}} d W_{t} \end{matrix}

(15)

\begin{matrix} d σ_{t} & = - E [\partial_{σ} L] d t + V {[\partial_{σ} L]}^{\frac{1}{2}} d W_{t}^{*} . \end{matrix}

(16)

in which

d σ_{t}

has a separate Wiener process

d W_{t}^{*}

due to the externalized noise in the reparameterization, the details of which can be checked up upon in the Appendix C. The second modification is the separation of uncertainty, given that we have the additional source of uncertainty from the distribution

q (θ | ϕ)

. We can, thus, employ the uncertainty decomposition to obtain

\begin{matrix} d μ_{t} = - E [\partial_{μ} L] d t + {(V [E [\partial_{μ} L_{m} | D_{m}]] + E [V [\partial_{μ} L_{m} | D_{m}]])}^{\frac{1}{2}} d W_{t} \end{matrix}

(17)

\begin{matrix} d σ_{t} = - E [\partial_{σ} L] d t + {(V [E [\partial_{σ} L_{m} | D_{m}]] + E [V [\partial_{σ} L_{m} | D_{m}]])}^{\frac{1}{2}} d W_{t}^{*} \end{matrix}

(18)

We can now see that the only difference in the SDEs that govern the training dynamics in frequentist and Bayesian models is the added epistemic uncertainty in the diffusion term of the Bayesian stochastic differential equation.

Figure 1 exemplifies the different terms in the Bayesian stochastic differential equation and how uncertainty in stochastic gradient descent for a variational distribution can be decomposed for a toy example in one dimension. The details of the derivation of the Bayesian stochastic differential equation can be followed up in Appendix C.

3. Stochastic Control for Learning Rates

Having derived and characterized the training dynamics of the variational parameters

x \equiv {μ_{t}, σ_{t}}

on a first principle basis, we now construct our proposed stochastic optimal control algorithm for BNNs. Our approximation methodology relies on the limit

η \to 0

. We first introduce a new control variable that respects the limit, namely the learning rate adjustment to the training, an additional adaptive diagonal control matrix U, which leads to a full SDE for the dynamics of training as:

\begin{matrix} {\dot{x}}_{t} = - U E [\nabla_{x} L] d t + U \sqrt{η} V {[\nabla_{x} L]}^{\frac{1}{2}} d W_{t} \end{matrix}

(19)

which is an Ito drift-diffusion process, where both the drift and the diffusion are controlled by the diagonal control matrix U and where the diffusion term is estimated from the variance of the gradients. We essentially scale it on a per-parameter basis with the control matrix U. We clip the individual control parameters

U_{i}

on the diagonal of U to the range

U_{i} \in [0, 1]

bounding the step size to

η U_{i} \in [0, η]

. We posed our problem as follows: if we have the gradients

\nabla_{x} L

, how do we choose the policy for adjusting the control parameter U to minimize the loss at the end of the training? Essentially:

\min_{u} E [L (X_{T})] subject to (19)

(20)

provided that X follows Equation (19). The general optimal control formalism requires us to minimize the cost C for the optimal control parameter U, accumulated over time

t \in [t_{0}, t_{1}]

, and the final cost

C (x_{t}, U, t_{1})

, under the constraint of the dynamics

{\dot{x}}_{t}

.

3.1. Simplifiying the Loss

It is known that the loss surface

L

of deep neural network architectures is highly non-linear, which makes global optimization nearly impossible. In a similar way to [14,15], we therefore approximate the loss surface locally with a quadratic function of the form

\begin{matrix} g (x_{t}) = \frac{1}{2} {(x_{t} - b)}^{⊤} A (x_{t} - b) . \end{matrix}

(21)

The quadratic approximation as seen in Figure 2 forfeits the global loss surface for a local approximation in which the respective optimal quantities can be computed optimally in the sense of the local approximation. This simplification is chosen such that a tractable stochastic optimal control algorithm can be derived. Intuitively, given a local quadratic approximation of the loss surface, the offset parameter b denotes the optimum of the quadratic approximation

g (x_{t})

, whereas the curvature A denotes how flat or steep the loss surface is locally.

Consequently, we want to move the variational parameters

{μ_{t}, ρ_{t}}

in the observable state vector

x_{t}

as close as possible to this local optimum which coincides with the offset parameter b.

The curvature A and the offset parameter b of the local quadratic approximation of the loss surface can be conveniently calculated via ordinary least squares with the gradient relation (see Appendix D for details)

\begin{matrix} \nabla_{ϕ} L \sim \nabla_{x} g (x_{t}) = A (x_{t} - b) . \end{matrix}

(22)

We maintain running averages of the gradients and the parameters to prevent abrupt changes in the control. The quadratic approximation of the loss surface is maintained for each parameter distribution in the BNN architectures.

3.2. Our Control Problem

Taking inspiration from the local quadratic approximation

g (x_{t})

, we wish to minimize the distance of the observable state variables

x_{t}

to the optimum b of the quadratic approximation

g (x)

. We introduce an auxilliary variable L which allows us to simplify the classical control problem that requires the solution for the Hamiltonian–Jacobi–Bellman equation Appendix E.

It is known, by definition, that

M_{i j} = (x_{i} - b_{i}) (x_{j} - b_{j})

is a stochastic variable. We can obtain a relationship between L and the approximation of the error g:

\begin{matrix} g (x_{t}) & = \frac{1}{2} M_{11} A_{11} + M_{12} A_{12} + \frac{1}{2} M_{22} A_{22} \end{matrix}

(23)

We make use of Ito’s lemma, detailed in Appendix B, to obtain the dynamics of the error

d M

and define the diffusion matrix

D = η V [\nabla_{x_{t}} L]

which gives us,

\begin{matrix} d M = (- \nabla_{x} M U \nabla_{x} L + \frac{η}{2} T r [D U^{T} \nabla_{x}^{2} M U]) d t + \nabla_{x} M^{T} U \sqrt{η V [\nabla_{x} L]} d W \end{matrix}

which is again an Ito drift-diffusion process, and for which we provide the relevant gradient calculations in Appendix D. With the intention of separating the drift and evaluating the matrix derivatives, the details of which are in the Appendix D, we obtain

\begin{matrix} f (M, U, t) = - (U A M + M A U) + η U D U \end{matrix}

(24)

The dynamics of the error

f (M, U, t)

denote the drift of the Ito drift-diffusion process

d M

and represent the average dynamics of the error function over time, given the dynamics

d x_{t}

of the parameters

x_{t} = {(μ_{t}, ρ_{t})}^{⊤}

. The task which we want to achieve is to minimize the loss in (23) in such a way that we arrive at the optimum after the control period

t \in [t_{0}, t_{1}]

.

\begin{matrix} C (M, U, t_{1}) = \frac{1}{2} \int_{t_{0}}^{t_{1}} Tr [A \dot{M}] d t . \end{matrix}

(25)

where A is the curvature of the local approximation

g (x_{t})

. The motivation of this formulation is that M measures the distance of the state vector

x_{t}

to the local optimum b in the quadratic approximation scaled by the curvature A. Thus, minimizing the distance M at each time step is equivalent to minimizing the entire cost C. The full derivation can be found in Appendix E.

The optimization of the final cost C can be solved by minimizing the cost of

Tr [A \dot{M}]

, which, in turn, minimizes C. Taking the derivative of

\frac{1}{2} Tr [A \dot{M}]

with respect to the individual control parameters

U_{i i}

and setting it to zero gives us

\begin{matrix} U^{'} = {(A \circ D)}^{- 1} \frac{Diag [A M A]}{η} \end{matrix}

(26)

where

U^{'} = {[U_{11}, U_{22}]}^{⊤}

is a vector with the corresponding control parameters, ∘ is the Hadamard product and

Diag [\cdot]

extracts the diagonal elements of a matrix For indefinite matrices A, we project

U^{'}

onto the eigenvector corresponding to the positive eigenvalue to ensure that the optimality condition is met [16]. The full derivation can be reviewed in Appendix E.

We compute the control parameter

U^{'}

jointly for the variational parameters

{μ, σ}

, which results in the matrices A, D, M to be in

R^{2 \times 2}

. The inversion of the

2 \times 2

matrices can be performed analytically, as detailed at the end of Appendix F. Comparing the operations required per parameter in ADAM (addition, subtraction, division etc.) and those in StochControlSGD (mostly 2 × 2 matrix multiplications and analytical inversions), we arrive at an approximately 2.5× increase in computations for StochControlSGD compared to ADAM. It is important to note that ADAM has to be applied to both variational parameters independently, whereas StochControlSGD computes the control parameters jointly, thus saving computation.

The StochControlSGD algorithm is detailed in its entirety in Algorithm 1.

Algorithm 1: StochControlSGD

4. Experiments

We evaluate the proposed stochastic optimal control SGD, which we abbreviate as StochControlSGD, on the MNIST [17], FashionMNIST [18] and CIFAR10 [19] datasets. In Table 1, we compare the final performance of StochControlSGD with the performance of ADAM, controlled SGD (cSGD), SGD and SGD with cosine learning rate scheduling, as proposed by [15].

Learning rate scheduling was chosen as the cosine annealing, where the initial learning rate was chosen as

10^{- 1}

and was decreased to

10^{- 5}

. The experimental setup is detailed in Appendix G.

ADAM provides a strong baseline for the frequentist models when the learning rate is chosen to be appropriately small. Following the notion of learning rate scheduling, we initialized the learning of both cSGD and StochControlSGD as

η = 0.5

and

u_{0} = 1.0

. Both cSGD and StochControlSGD are able to adaptively and individually set their control parameters over the course of optimization.

Additionally, we plot the convergence of the ADAM, cSGD and StochControlSGD in Figure 3. The results are portrayed more concisely in Figure 4, for which five runs for each learning rate and each optimizer are combined in a boxplot format.

In contrast to cSGD and StochControlSGD, ADAM does not have the ability to modify the a priori chosen learning rate

η

. Coupled with the first- and second-order moments from which the surrogate gradient is computed, ADAM is sensitive to the large learning rate with significantly worsening performance for learning rates at

η = 0.5

and

η = 0.1

. The larger learning rates do not pose a problem for the optimal control optimizers cSGD and StochControlSGD, as they can adaptively and individually control their learning rates. We consider only optimizers which rely on the gradient information to accelerate the gradient descent and forego learning rate scheduling algorithms which incorporate performance information, such as learning rate schedulers which decrease the learning rate if a performance plateau is detected.

Among the optimal control optimizers, StochControlSGD provides tighter bounds on the lower and upper performance while offering a higher performance. Especially on the CIFAR10 dataset in Figure 3, StochControlSGD improves upon cSGD with better absolute performance and less variation between the largest learning rate of

η = 0.5

and the smallest learning rate

η = 0.01

. Furthermore, it can be seen that the performance of StochControlSGD and cSGD improve with larger learning rates. As can be seen in Figure 3, the performance of the largest learning rate of

η = 0.5

is, in fact, its best performance, whereas it is the worst performance for ADAM.

The direct comparison of ADAM with StochControlSGD connects to recent work carried out by [20] on the fundamental optimization of deep Bayesian models with gradient optimization algorithms developed for frequentist models. The methodology of BNNs is limited in the amount of relevant information in the uncertainty with respect to the learning optimization due to its reliance on normal priors. Modern frequentist deep neural networks rely on custom layer architectures, such as BatchNorm [21], with additional data augmentation schemes, which have no clear Bayesian interpretation, raising additional questions on the applicability of porting frequentist ideas, such as layer designs, in deep neural networks, to their Bayesian formulations.

Behaviour of Control Parameter

The evolution of the control parameter U allows insight into the descent and fluctuation behaviour of the variational parameters

μ_{t}

and

ρ_{t}

with respect to the ELBO. More specifically, it allows us to shed some light onto the dynamics between the data log likelihood and the KL divergence.

The data loglikelihood aims at minimizing the uncertainty parameter

ρ_{t}

of each variational distribution as much as possible. The gradients of KL divergence, in turn, prioritize an uncertainty parameter which corresponds to the prior which we chose as

N (0, I)

. The relative weighting of the data log likelihood and KL divergence with respect to the number of samples in the ELBO heavily favours the gradients of the data log likelihood during the descent phase for large datasets. As the gradients of the KL divergence are independent of the data by definition, the importance of their gradients increases proportionally to the diminishing gradients of the converging data log likelihood.

The uncertainty parameters were initialized to

ρ_{0} = - 6.9

in all our experiments which allows the BNN to increase the uncertainty of select parameters if the KL divergence dominates the gradients of the specific parameter in question. The intuition is that deep neural networks, in fact, only use few weights [22], and, thus, the uncertainty parameters can be maximized by the KL divergence for parameters for which the gradients of the KL divergence are stronger than the gradients originating from the data log likelihood.

We can observe this behaviour in Figure 5, where the median control parameters of

μ_{t}

decrease quickly alongside the control parameters for the uncertainty parameter

ρ_{t}

. However, as the data loglikelihood converges, the median control parameter of the uncertainty parameter is increased as the relative importance of the gradients originating from the data loglikelihood decreases and the gradients from the KL divergence dominate.

This indicates two different dynamical regimes in the optimization of the uncertainty parameter of the variational distribution. The mean control parameter remains small during the descent and fluctuation dynamics whereas the uncertainty control is, in fact, increased by the stochastic control optimization algorithm in the fluctuation phase.

5. Related Work

The authors of [15] derived an optimal control algorithm for frequentist models which incorporated the variance into the learning rate scheduling. In [23], it was argued that instead of decreasing the learning in the dissipation phase of the optimization, the batch size should be increased to reduce the uncertainty in the gradients. The authors of [24] and [25] examined adaptive learning rate schemes for changing loss surfaces. The idea of a priori cyclical scaling in the learning rates was pioneered in [26].

The use of the reparameterization of the Gaussian variational distribution in deep Bayesian neural networks to arrive at a scalable optimization algorithm based on variational inference was proposed in [27]. The authors of [28] examined the behaviour of DropOut [29] as an approximate Bayesian inference. The authors of [30] demonstrated that the dropout rate could be learned as an approximate uncertainty parameter.

6. Conclusions

We have examined the potential for incorporating Bayesian uncertainty information directly into a learning algorithm. For this, we derived the SDEs for variational parameters on a first principle basis. With both aleatoric and epistemic uncertainty present in the optimization process, we decomposed the diffusion parameter of the SDE into its data and parameter uncertainties.

Having identified the underlying dynamics of the variational parameters during optimization, we proceeded to formulate a stochastic optimal control algorithm for Bayesian models which was able to incorporate the Bayesian uncertainty information into an adaptive and selective learning rate schedule. An analysis of the control parameters indicated separate dynamical behaviours during optimization of the mean and uncertainty parameters. This can be investigated further to examine the dynamics of the ELBO as a loss function for other probabilistic models.

Author Contributions

Conceptualization, C.O. and M.O.; methodology, L.W. and C.O.; software, L.W.; validation, L.W., C.O. and M.O.; formal analysis, L.W.; investigation, L.W.; writing—original draft preparation, L.W. and C.O.; writing—review and editing, M.O.; visualization, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

The research of M.O. was partially funded by the Deutsche Forschungsgemeinschaft (DFG)—Project-ID 318763901—SFB1294. The research of L.W. and C.O. was funded by the BIFOLD-Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref. 01IS18037A).

Data Availability Statement

The training data and code base used in this study are available upon reasonable request from the authors.

Acknowledgments

Ludwig Winkler would like to thank Klaus-Robert Müller for fruitful discussions and proof reading and Jason Salomon Rinnert for technical support. We acknowledge support by the German Research Foundation and the Open Access Publication Fund of TU Berlin.

Conflicts of Interest

The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. Evidence Lower Bound

In variational inference for Bayesian neural networks, we want to minimize the Kullback–Leibler divergence between the true posterior distribution

p (θ | D)

and the variational distribution

q (θ | ϕ)

, which is easy to work with and from which we can sample easily:

\begin{matrix} \arg \min_{ϕ} KL [q (θ | ϕ) | | p (θ | D)] \end{matrix}

(A1)

\begin{matrix} = \arg \min_{ϕ} E_{q (θ | ϕ)} [\log \frac{q (θ | ϕ)}{p (θ | D)}] \geq 0 . \end{matrix}

(A2)

The posterior distribution

p (θ | D)

conditioned on the data

D

can be rewritten according to the Bayes theorem as

\begin{matrix} p (θ | D) = \frac{p (D | θ) p (θ)}{p (D)} \end{matrix}

(A3)

with which we can rewrite the Kullback–Leibler divergence as

\begin{matrix} E_{q (θ | D)} [\log \frac{q (θ | ϕ)}{p (θ | D)}] \end{matrix}

(A4)

\begin{matrix} = E_{q (θ | D)} [\log \frac{q (θ | ϕ)}{p (D | θ) p (θ)} + \log p (D)] \end{matrix}

(A5)

\begin{matrix} = E_{q (θ | D)} [\log \frac{q (θ | ϕ)}{p (θ)} - \log p (D | θ) + \log p (D)] \end{matrix}

(A6)

\begin{matrix} = KL [q (θ | ϕ) | | p (θ)] - E_{q (θ | D)} [\log p (D | θ)] + \log p (D) . \end{matrix}

(A7)

Since the Kullback–Leibler divergence is greater or equal to zero at all times, we have

\begin{matrix} 0 \leq KL [q (θ | ϕ) | | p (θ)] - E_{q (θ | D)} [\log p (D | θ)] + \log p (D) \end{matrix}

(A8)

\begin{matrix} - \log p (D) \leq KL [q (θ | ϕ) | | p (θ)] - E_{q (θ | D)} [\log p (D | θ)] \end{matrix}

(A9)

\begin{matrix} \log p (D) \geq E_{q (θ | D)} [\log p (D | θ)] - KL [q (θ | ϕ) | | p (θ)] . \end{matrix}

(A10)

Appendix B. Ito’s Lemma

Let

X_{t}

be an Ito drift-diffusion process that satisfies the SDE

\begin{matrix} d X_{t} = μ_{t} d t + σ_{t} d W_{t} \end{matrix}

(A11)

where

W_{t}

is a Wiener process. For a scalar function

f (X_{t}, t)

, which is twice differentiable in

X_{t}

and once differentiable in time t, we can apply the Taylor expansion up to the second-order to obtain

\begin{matrix} d f (X_{t}, t) = \partial_{t} f (X_{t}, t) + \partial_{X} f (X_{t}, t) d X_{t} + \frac{1}{2} \partial_{X}^{2} f (X_{t}, t) d X_{t}^{2} . \end{matrix}

(A12)

We can then substitute the Ito drift-diffusion process

d X_{t}

into the Taylor expansion. In the limit of

d t \to 0

, the terms

d t^{2}

and

d t d W_{t}

tend to zero faster than

d W_{t}^{2} = {\sqrt{d t}}^{2}

due to their higher exponent. Multiplying out the terms and setting the relevant infinitessimal terms to zero, we obtain

\begin{matrix} d f (X_{t}, t) = \partial_{t} f (X_{t}, t) d t + \partial_{X} f (X_{t}, t) d X_{t} + \frac{1}{2} \partial_{X}^{2} f (X_{t}, t) d X_{t}^{2} \end{matrix}

(A13)

\begin{matrix} = \partial_{t} f (X_{t}, t) d t + \partial_{X} f (X_{t}, t) (μ_{t} d t + σ_{t} d W_{t}) + \frac{1}{2} \partial_{X}^{2} f (X_{t}, t) {(μ_{t} d t + σ_{t} d W_{t})}^{2} \end{matrix}

(A14)

\begin{matrix} = \partial_{t} f (X_{t}, t) d t + \partial_{X} f (X_{t}, t) (μ_{t} d t + σ_{t} d W_{t}) \end{matrix}

(A15)

\begin{matrix} + \frac{1}{2} \partial_{X}^{2} f (X_{t}, t) (\underset{= 0}{\underset{⏟}{μ_{t} d t^{2} + 2 μ_{t} σ_{t} d t d W_{t}}} + σ_{t}^{2} \underset{= d t}{\underset{⏟}{d W_{t}^{2}}}) \end{matrix}

(A16)

\begin{matrix} = \partial_{t} f (X_{t}, t) d t + \partial_{X} f (X_{t}, t) (μ_{t} d t + σ_{t} d W_{t}) + \frac{1}{2} \partial_{X}^{2} f (X_{t}, t) σ_{t}^{2} d t \end{matrix}

(A17)

\begin{matrix} = (\partial_{t} f (X_{t}, t) + μ_{t} \partial_{X} f (X_{t}, t) + \frac{1}{2} \partial_{X}^{2} f (X_{t}, t) σ_{t}^{2}) d t + σ_{t} \partial_{X} f (X_{t}, t) d W_{t} \end{matrix}

(A18)

which is again an Ito drift-diffusion process, albeit with more complex drift and diffusion terms.

Appendix C. Bayesian Stochastic Differential Equation of a Variational Distribution

To validate the derivation of the SDE that governs the dynamics of the variational parameters, we evaluate the proposed SDE on a simplified example before moving on to more general Bayesian models, such as deep Bayesian neural networks.

In this intuitive example, we model a parameter

θ

as a distribution

θ \sim N (μ_{t}, σ_{t} (ρ_{t}))

and reparameterize the uncertainty parameter

σ_{t} \in R^{+}

with an unbounded surrogate parameter

ρ_{t} \in R

[27,31,32,33]. The subscript t indicates the parameters at time t during the optimization process. By making use of the reparameterization trick, we obtain a sampling scheme which is differentiable with respect to the parameters

ϕ_{t} = {μ_{t}, ρ_{t}}

. For a normal distribution this takes the form

\begin{matrix} θ_{t} = μ_{t} + ϵ σ_{t} (ρ_{t}) = μ_{t} + ϵ \log (1 + \exp (ρ_{t})) \end{matrix}

(A19)

with

ϵ \sim N (0, 1)

being the externalized stochasticity.

The reparameterization allows us to compute the derivatives for any differentiable loss function

L

with respect to the variational parameters

ϕ_{t}

. In our simplified case, we proceed with the quadratic loss function

\begin{matrix} L = \frac{1}{2} θ_{t}^{2} = \frac{1}{2} {(μ_{t} + ϵ \log (1 + \exp (ρ_{t})))}^{2} \end{matrix}

(A20)

for which we can compute the gradients

\begin{matrix} \partial_{μ} L = μ_{t} + ϵ σ_{t} (ρ_{t}) \end{matrix}

(A21)

\begin{matrix} \partial_{ρ} L = μ_{t} ϵ σ (ρ_{t}) + ϵ^{2} σ_{t} (ρ_{t}) σ (ρ_{t}) \end{matrix}

(A22)

where

σ (\cdot)

is the sigmoid function and

σ_{t} (\cdot)

is the reparameterization of the uncertainty parameter

ρ_{t}

.

The first- and second-order moments of the gradients are

\begin{matrix} E [\partial_{μ} L] = μ_{t} \end{matrix}

(A23)

\begin{matrix} E [\partial_{ρ} L] = σ_{t} (ρ_{t}) σ (ρ_{t}) \end{matrix}

(A24)

and, with the fact that, by definition,

E [ϵ^{2}] = 1

,

\begin{matrix} V [E [\partial_{μ} L]] = σ_{t} {(ρ_{t})}^{2} \end{matrix}

(A25)

\begin{matrix} V [E [\partial_{ρ} L]] = V [μ_{t} ϵ σ (ρ_{t})] + V [ϵ^{2} σ_{t} (ρ_{t}) σ (ρ_{t})] \end{matrix}

(A26)

\begin{matrix} = μ_{t}^{2} σ {(ρ)}^{2} + V [ϵ^{2}] σ_{t} {(ρ_{t})}^{2} σ {(ρ_{t})}^{2} \end{matrix}

(A27)

As the loss function

L

does not include any data uncertainty, the aleatoric uncertainty is reduced to zero. We can now expand the loss function

L

to include aleatoric uncertainty from the data. For this, we sample from two functions

L_{m}

which are the original cost function

L

shifted by

\pm b

, the purpose of which is to simulate data uncertainty that occurs in gradient optimization with mini-batches:

\begin{matrix} L_{1} = \frac{1}{2} {(θ_{t} - b)}^{2} - 1 \end{matrix}

(A28)

\begin{matrix} = \frac{1}{2} {(μ_{t} + ϵ \log (1 + e^{ρ_{t}}))}^{2} {- b)}^{2} - 1 \end{matrix}

(A29)

\begin{matrix} L_{2} = \frac{1}{2} {(θ_{t} + b)}^{2} - 1 \end{matrix}

(A30)

\begin{matrix} = \frac{1}{2} {(μ_{t} + ϵ \log (1 + e^{ρ_{t}}))}^{2} {+ b)}^{2} - 1 . \end{matrix}

(A31)

Furthermore, we shift each loss function

L_{m}

by

- 1

, such that the local optima of the loss function are lower than the optimum that balances both loss functions. By computing the gradients from randomly sampled

L_{1}

and

L_{2}

during training, we include a simplified version of mini-batch sampling from a dataset with two samples. Both

L_{1}

and

L_{2}

provide local minima at

\pm b

, while the global optimum still lies in the middle between them at 0.

We can now estimate the aleatoric variance

E [V [\nabla_{ϕ} L_{m} | L_{m}]]

in the gradient with

\begin{matrix} E [V [\partial_{μ} L_{m} | L_{m}]] = \frac{1}{2} \sum_{i = 1}^{2} {(\partial_{μ} L_{m} (μ, ρ_{t}) - \partial_{μ} L (μ, ρ))}^{2} \end{matrix}

(A32)

\begin{matrix} = b^{2} \end{matrix}

(A33)

\begin{matrix} E [V [\partial_{ρ} L_{m} | L_{m}]] = \frac{1}{2} \sum_{i = 1}^{2} {(\partial_{ρ} L_{m} (μ, ρ) - \partial_{ρ} L (μ, ρ))}^{2} \end{matrix}

(A34)

\begin{matrix} = b^{2} ϵ^{2} σ {(ρ_{t})}^{2} \end{matrix}

(A35)

We can, thus, define the Ito drift-diffusion process for the variational parameters with the decomposed diffusion as

\begin{matrix} d μ_{t} = - E [\partial_{μ} L_{m}] d t + {(V [E [\partial_{μ} L_{m} | L_{m}]] + E [V [\partial_{μ} L_{m} | L_{m}]])}^{\frac{1}{2}} d W_{t} \end{matrix}

(A36)

\begin{matrix} d ρ_{t} = - E [\partial_{ρ} L_{m}] d t + {(V [E [\partial_{ρ} L_{m} | L_{m}]] + E [V [\partial_{ρ} L_{m} | L_{m}]])}^{\frac{1}{2}} d W_{t} \end{matrix}

(A37)

For this simplifying example, we choose to derive a decoupled set of SDEs. In both cases the chain rule passes the gradients through the sampled parameter

θ_{t}

. As it turns out, and as we make use of them in the subsequently derived stochastic optimal control optimziation algorithm, both

\partial_{μ} L

and

\partial_{ρ} L

share the gradient

\partial_{θ} L

when applying the chain rule,

\begin{matrix} \partial_{μ} L = \partial_{θ} L \partial_{μ} θ_{t} \end{matrix}

(A38)

\begin{matrix} \partial_{ρ} L = \partial_{θ} L \partial_{σ} θ_{t} \partial_{ρ} σ \end{matrix}

(A39)

Appendix D. Gradient Derivations

We employ the Einstein summation to derive the relevant gradients. We use the lower index for the horizontal indices and the upper index for the vertical indices of a matrix.

\begin{matrix} f (z) = z^{⊤} A z = z_{i} A_{j}^{i} z^{j} \end{matrix}

(A40)

\begin{matrix} \nabla f (z) = A z = A_{j}^{i} z^{j} \end{matrix}

(A41)

\begin{matrix} M = z z^{T} = z^{i} z_{j} \end{matrix}

(A42)

\begin{matrix} \nabla M = [\frac{\partial M_{j}^{i}}{\partial z^{l}}] = z_{j} δ_{i l} + z^{i} δ_{j l} \end{matrix}

(A43)

\begin{matrix} \nabla M U \nabla f = (z_{j} δ_{i l} + z^{i} δ_{j l}) U_{m}^{l} A_{n}^{m} z^{n} \end{matrix}

(A44)

\begin{matrix} = z_{j} δ_{i l} U_{m}^{l} A_{n}^{m} z^{n} + z^{i} δ_{j l} U_{m}^{l} A_{n}^{m} z^{n} \end{matrix}

(A45)

\begin{matrix} = z_{j} U_{m}^{i} A_{n}^{m} z^{n} + z^{i} U_{m}^{j} A_{n}^{m} z^{n} \end{matrix}

(A46)

\begin{matrix} = U_{m}^{i} A_{n}^{m} z^{n} z_{j} + z^{i} z_{n} A_{m}^{n} U_{i}^{m} \end{matrix}

(A47)

\begin{matrix} = U A M + M A U \end{matrix}

(A48)

\begin{matrix} \frac{\partial^{2} M_{i j}}{\partial z_{l} \partial z_{m}} = δ_{i l} δ_{m j} + δ_{i m} δ_{j l} \end{matrix}

(A49)

\begin{matrix} Tr [D U \nabla^{2} M U] = Tr [D_{p}^{q} U_{m}^{p} (δ_{i l} δ_{m j} + δ_{i m} δ_{j l}) U_{k}^{l}] \end{matrix}

(A50)

\begin{matrix} = Tr [D_{p}^{q} U_{m}^{p} δ_{m j} δ_{i l} U_{l k} + D_{p}^{q} U_{m}^{p} δ_{i m} δ_{j l} U_{k}^{l}] \end{matrix}

(A51)

\begin{matrix} = Tr [D_{p}^{q} U_{j}^{p} U_{k}^{i} + D_{p}^{q} U_{i}^{p} U_{k}^{j}] \end{matrix}

(A52)

\begin{matrix} = Tr [D_{p}^{q} U_{j}^{p} U_{k}^{i}] + Tr [D_{p}^{q} U_{i}^{p} U_{k}^{j}] \end{matrix}

(A53)

\begin{matrix} = D_{p}^{q} U_{j}^{p} U_{q}^{i} + D_{p}^{q} U_{i}^{p} U_{q}^{j} \end{matrix}

(A54)

\begin{matrix} = U_{q}^{i} D_{p}^{q} U_{j}^{p} + U_{q}^{j} D_{p}^{q} U_{i}^{p} \end{matrix}

(A55)

\begin{matrix} = U D U + {(U D U)}^{T} \end{matrix}

(A56)

\begin{matrix} = 2 U D U \end{matrix}

(A57)

Appendix E. Stochastic Control

We have the dynamics given as

\begin{matrix} \dot{M} = - (U A M + M A U) + η U D U \end{matrix}

(A58)

We want to minimize the final cost

\begin{matrix} C (M, U, t_{1}) = \frac{1}{2} Tr [A M] . \end{matrix}

(A59)

Alternatively, we can directly minimize the dynamics

\dot{M}

over time, such that

\begin{matrix} \min_{U} C = \min_{U} \frac{1}{2} \int_{t_{0}}^{t_{1}} Tr [A \dot{M}] d t \end{matrix}

(A60)

which requires us to minimize

Tr [A \dot{M}]

at every step. Both A and M should be positive semi-definite to guarantee a bound from below.

Optimizing

\frac{1}{2} Tr [A \dot{M}]

gives us

\begin{matrix} \frac{1}{2} Tr [A \dot{M}] = - \frac{1}{2} Tr [A U A M + A M A U] + \frac{η}{2} Tr [A U D U] \end{matrix}

(A61)

\begin{matrix} = - \frac{1}{2} (Tr [A U A M] + Tr [A M A U]) + \frac{η}{2} Tr [A U D U] \end{matrix}

(A62)

\begin{matrix} = - \frac{1}{2} (Tr [A M A U] + Tr [A M A U]) + \frac{η}{2} Tr [A U D U] \end{matrix}

(A63)

\begin{matrix} = - Tr [A M A U] + \frac{η}{2} Tr [A U D U] \end{matrix}

(A64)

The control matrix U is diagonal. The trace is defined as

Tr [A B] = A_{j}^{i} B_{i}^{j}

and the noise matrix is symmetric by definition, so

D_{i}^{j} = D_{j}^{i}

. In index notation, we can arbitrarily shuffle the individual terms in a product to obtain generalizable formulations in terms of matrices and vectors. This gives us the index notation of

\begin{matrix} - Tr [A M A U] + \frac{η}{2} Tr [A U D U] = - \sum_{i} {(A M A)}_{i}^{i} U_{i}^{i} + \frac{η}{2} \sum_{i, j} A_{j}^{i} U_{j}^{j} D_{i}^{j} U_{i}^{i} \end{matrix}

(A65)

\begin{matrix} = - \sum_{i} {(A M A)}_{i}^{i} U_{i}^{i} + \frac{η}{2} \sum_{i, j} U_{i}^{i} \underset{Q_{j}^{i}}{\underset{⏟}{A_{j}^{i} D_{j}^{i}}} U_{j}^{j} \end{matrix}

(A66)

\begin{matrix} = - \sum_{i} {(A M A)}_{i}^{i} U_{i}^{i} + \frac{η}{2} \sum_{i, j} U_{i}^{i} Q_{j}^{i} U_{j}^{j} \end{matrix}

(A67)

Taking the derivative with respect to the control parameters

U_{i}^{i}

, and setting it to zero, yields

\begin{matrix} 0 = - {(A M A)}_{i}^{i} + η \sum_{j} Q_{j}^{i} U_{j}^{j} \end{matrix}

(A68)

\begin{matrix} 0 = - Diag [A M A] + η Q U^{'} \end{matrix}

(A69)

\begin{matrix} U^{'} = Q^{- 1} \frac{Diag [A M A]}{η} \end{matrix}

(A70)

\begin{matrix} = {(A \circ D)}^{- 1} \frac{Diag [A M A]}{η} \end{matrix}

(A71)

where

Diag [A]

is a vector of the diagonal elements of the matrix A and

U^{'} = {[U_{1}, U_{2}]}^{⊤}

is a vector of the individual control parameters

U_{1}

and

U_{2}

.

Appendix F. Estimation of Local Quadratic Approximation

We compute the offset b and the curvature A via the following relations:

\begin{matrix} \partial_{b} [\frac{1}{2} E [{(\partial_{ϕ} L - A (x + b))}^{2}]] = \partial_{b} [\frac{1}{2} E [(\partial_{ϕ} L - A x + A b))^{2}]] \end{matrix}

(A72)

\begin{matrix} = E [A^{T} (\partial_{ϕ} L - A x + A b)] \end{matrix}

(A73)

\begin{matrix} = A^{T} E [\partial_{ϕ} L] - A^{T} A E [x] + A^{T} A b \overset{!}{=} 0 \end{matrix}

(A74)

⇓

(A75)

\begin{matrix} A^{T} A b = A^{T} A E [x] - A^{T} E [\partial_{ϕ} L] \end{matrix}

(A76)

\begin{matrix} b = E [x] - {(A^{T} A)}^{- 1} A^{T} E [\partial_{ϕ} L] \end{matrix}

(A77)

\begin{matrix} b = E [x] - A^{- 1} {A^{T}}^{- 1} A^{T} E [\partial_{ϕ} L] \end{matrix}

(A78)

\begin{matrix} b = E [x] - A^{- 1} E [\partial_{ϕ} L] \end{matrix}

(A79)

\begin{matrix} \partial_{A} [\frac{1}{2} E [{(\partial_{ϕ} L - A (x + b))}^{2}]] = - E [\partial_{ϕ} L {(x + b)}^{T} - A (x + b) {(x + b)}^{T}] \overset{!}{=} 0 . \end{matrix}

(A80)

Writing out the products, taking the expectations where necessary, and following through with the long and tedious algebra, we finally arrive at

\begin{matrix} \partial_{A} [\frac{1}{2} E [{(\partial_{ϕ} L - A (x - b))}^{2}]] \end{matrix}

(A81)

\begin{matrix} = - E [(\partial_{ϕ} L - A (x - b)) {(x - b)}^{T}] \end{matrix}

(A82)

\begin{matrix} = - E [\partial_{ϕ} L {(x - b)}^{T} - A (x - b) {(x - b)}^{T}] \end{matrix}

(A83)

\begin{matrix} = - E [\partial_{ϕ} L x^{T} - \partial_{ϕ} L {(E [x] - A^{- 1} E [\partial_{ϕ} L])}^{T} - A (x x^{T} - b x^{T} - x b^{T} + b b^{T})] \end{matrix}

(A84)

\begin{matrix} = - E [\partial_{ϕ} L x^{T} - \partial_{ϕ} L E {[x]}^{T} + \partial_{ϕ} L E {[\partial_{ϕ} L]}^{T} A^{- 1} \end{matrix}

\begin{matrix} - A (x x^{T} - (E [x] - A^{- 1} E [\partial_{ϕ} L]) x^{T} - x {(E [x] - A^{- 1} E [\partial_{ϕ} L])}^{T} \end{matrix}

\begin{matrix} + (E [x] - A^{- 1} E [\partial_{ϕ} L]) {(E [x] - A^{- 1} E [\partial_{ϕ} L])}^{T})] \end{matrix}

(A85)

\begin{matrix} = - [E [\partial_{ϕ} L x^{T}] - E [\partial_{ϕ} L] E {[x]}^{T} + E [\partial_{ϕ} L] E {[\partial_{ϕ} L]}^{T} A^{- 1} \end{matrix}

\begin{matrix} - A (E [x x^{T}] - E [x] E {[x]}^{T} + A^{- 1} E [\partial_{ϕ} L] E {[x]}^{T} - E [x] E {[x]}^{T} \end{matrix}

\begin{matrix} + E [x] E {[\partial_{ϕ} L]}^{T} A^{- 1} + E [x] E {[x]}^{T} - E [x] E {[\partial_{ϕ} L]}^{T} A^{- 1} \end{matrix}

\begin{matrix} - A^{- 1} E [\partial_{ϕ} L] E {[x]}^{T} + A^{- 1} E [\partial_{ϕ} L] E {[\partial_{ϕ} L]}^{T} A^{- 1})] \end{matrix}

(A86)

\begin{matrix} = - [E [\partial_{ϕ} L x^{T}] - E [\partial_{ϕ} L] E {[x]}^{T} + E [\partial_{ϕ} L] E {[\partial_{ϕ} L]}^{T} A^{- 1} \end{matrix}

\begin{matrix} - A (E [x x^{T}] - E [x] E {[x]}^{T} + A^{- 1} E [\partial_{ϕ} L] E {[\partial_{ϕ} L]}^{T} A^{- 1})] \end{matrix}

(A87)

\begin{matrix} = - [E [\partial_{ϕ} L x^{T}] - E [\partial_{ϕ} L] E {[x]}^{T} + E [\partial_{ϕ} L] E {[\partial_{ϕ} L]}^{T} A^{- 1} \end{matrix}

\begin{matrix} - A (E [x x^{T}] - E [x] E {[x]}^{T}) - E [\partial_{ϕ} L] E {[\partial_{ϕ} L]}^{T} A^{- 1}] \end{matrix}

(A88)

\begin{matrix} = - (E [\partial_{ϕ} L x^{T}] - E [\partial_{ϕ} L] E {[x]}^{T}) + A (E [x x^{T}] - E [x] E {[x]}^{T}) \end{matrix}

(A89)

\begin{matrix} \overset{!}{=} 0 \end{matrix}

(A90)

⇓

(A91)

\begin{matrix} A = (E [\partial_{ϕ} L x^{T}] - E [\partial_{ϕ} L] E {[x]}^{T}) {(E [x x^{T}] - E [x] E {[x]}^{T})}^{- 1} \end{matrix}

(A92)

As we deal with matrices in

R^{2 \times 2}

per set of variational parameters

ϕ = {μ, σ}

, the matrix inversions can be performed cheaply and analytically with the identity

A^{- 1} = \begin{matrix} {(\begin{matrix} a & b \\ c & d \end{matrix})}^{- 1} = \frac{1}{a d - b c} (\begin{matrix} d & - b \\ - c & a \end{matrix}) \end{matrix}

(A93)

which requires only two multiplications, one subtraction and a repositioning of values with two negations.

Appendix G. Experimental Setup

In all the experiments, we trained a convolutional Bayesian neural network (CBNN) with four blocks of 2D convolution, 2D BatchNorm and 2D MaxPooling, followed by two linear layers. The convolutional filters were in the sequence 96, 128, 256 and 128 and the subsequent layers had 200 and 10 neurons, respectively. This corresponds to the experimental setup of [15]. As a baseline, we optimized a Bayesian feed-forward neural network (BNN) with four layers and 200 neurons in each layer. As a frequentist baseline, we trained the same architectures but without any regularization compared to the KL divergence in the ELBO for Bayesian neural networks. We used a consistent batch size of 64 due to the memory constraints of computing per sample gradients and used leaky ReLU with a negative slope of 0.01 in all architectures. The convolutional and linear layer parameters were parameterized with normal distributions and the ELBO was used as the objective function.

Experimentally and theoretically, it has been found that large learning rates in the early phase of optimization help to provide a large approximate improvement before decreasing the learning rate to finetune the model parameters [34,35]. For our experiments, we initialized the control parameter with a value of one, which corresponds to adaptive learning rate scheduling [26]. Similarly to the learning rate, learning rate scheduling requires an a priori decision on how the scheduling should be conducted. Our optimizer StochControlSGD adaptively determines the optimal learning rate via the principle of stochastic optimal control independently for each parameter.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25, pp. 1097–1105. [Google Scholar]
Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Andrieu, C.; De Freitas, N.; Doucet, A.; Jordan, M.I. An introduction to MCMC for machine learning. Mach. Learn. 2003, 50, 5–43. [Google Scholar] [CrossRef]
Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference; Now Publishers Inc.: Norwell, MA, USA, 2008. [Google Scholar]
Hoffman, M.D.; Blei, D.M.; Wang, C.; Paisley, J. Stochastic variational inference. J. Mach. Learn. Res. 2013, 14, 1303–1347. [Google Scholar]
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the COMPSTAT’2010: 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
Liu, G.H.; Theodorou, E.A. Deep learning theory review: An optimal control and dynamical systems perspective. arXiv 2019, arXiv:1908.10920. [Google Scholar]
Orvieto, A.; Kohler, J.; Lucchi, A. The role of memory in stochastic optimization. In Proceedings of the Uncertainty in Artificial Intelligence (PMLR), Virtual, 3–6 August 2020; pp. 356–366. [Google Scholar]
Mandt, S.; Hoffman, M.D.; Blei, D.M. Stochastic gradient descent as approximate Bayesian inference. arXiv 2017, arXiv:1704.04289. [Google Scholar]
Yaida, S. Fluctuation-dissipation relations for stochastic gradient descent. arXiv 2018, arXiv:1810.00004. [Google Scholar]
Oksendal, B. Stochastic Differential Equations: An Introduction with Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Depeweg, S.; Hernandez-Lobato, J.M.; Doshi-Velez, F.; Udluft, S. Decomposition of uncertainty in Bayesian deep learning for efficient and risk-sensitive learning. In Proceedings of the International Conference on Machine Learning (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 1184–1193. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3th International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Li, Q.; Tai, C.; Weinan, E. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the International Conference on Machine Learning (PMLR), Sydney, Australia, 6–11 August 2017; pp. 2101–2110. [Google Scholar]
Stengel, R.F. Optimal Control and Estimation; Courier Corporation: Chelmsford, MA, USA, 1994. [Google Scholar]
LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 4 March 2022).
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Krizhevsky, A.; Hinton, G. Convolutional Deep Belief Networks on Cifar-10. 2010. Available online: https://www.cs.toronto.edu/~kriz/conv-cifar10-aug2010.pdf (accessed on 4 March 2022).
Wenzel, F.; Roth, K.; Veeling, B.S.; Swiatkowski, J.; Tran, L.; Mandt, S.; Snoek, J.; Salimans, T.; Jenatton, R.; Nowozin, S. How good is the bayes posterior in deep neural networks really? arXiv 2020, arXiv:2002.02405. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
Smith, S.L.; Kindermans, P.J.; Ying, C.; Le, Q.V. Don’t decay the learning rate, increase the batch size. arXiv 2017, arXiv:1711.00489. [Google Scholar]
Murata, N.; Kawanabe, M.; Ziehe, A.; Müller, K.R.; Amari, S.i. On-line learning in changing environments with applications in supervised and unsupervised learning. Neural Netw. 2002, 15, 743–760. [Google Scholar] [CrossRef]
Murata, N.; Müller, K.R.; Ziehe, A.; Amari, S.-I. Adaptive on-line learning in changing environments. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 1–6 December 1997; pp. 599–605. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight uncertainty in neural network. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 1613–1622. [Google Scholar]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (PMLR), New York, NY, USA, 20–22 June 2016; pp. 1050–1059. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Kingma, D.P.; Salimans, T.; Welling, M. Variational dropout and the local reparameterization trick. arXiv 2015, arXiv:1506.02557. [Google Scholar]
Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Artificial Intelligence and Statistics (PMLR), Bejing, China, 22–24 June 2014; pp. 814–822. [Google Scholar]
Baydin, A.G.; Pearlmutter, B.A.; Radul, A.A.; Siskind, J.M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 2018, 18, 1–43. [Google Scholar]
Kucukelbir, A.; Tran, D.; Ranganath, R.; Gelman, A.; Blei, D.M. Automatic differentiation variational inference. J. Mach. Learn. Res. 2017, 18, 430–474. [Google Scholar]
Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 681–688. [Google Scholar]

Figure 1. The components of the stochastic differential equation for the variational parameters

μ_{t}

and

σ_{t}

over time. The empirical drift and diffusion estimates shown in blue are unbiased estimates of the true analytically derived drift and diffusion terms. The loss was

L = \frac{1}{2} {(θ_{t} - b)}^{2}

where b was sampled randomly from

b \in {- 2, + 2}

to simulate aleatoric uncertainty. The aleatoric uncertainty from the data in the gradients remains constant whereas the epistemic uncertainty from the parameter distribution is reduced to zero.

Figure 1. The components of the stochastic differential equation for the variational parameters

μ_{t}

and

σ_{t}

over time. The empirical drift and diffusion estimates shown in blue are unbiased estimates of the true analytically derived drift and diffusion terms. The loss was

L = \frac{1}{2} {(θ_{t} - b)}^{2}

where b was sampled randomly from

b \in {- 2, + 2}

to simulate aleatoric uncertainty. The aleatoric uncertainty from the data in the gradients remains constant whereas the epistemic uncertainty from the parameter distribution is reduced to zero.

Figure 2. A one-dimensional illustration of how the optimal stochastic control u is determined from the gradient and parameter information. The parameters

ϕ

and their gradient information

\partial_{ϕ} L

are used to estimate the curvature A and offset b for the quadratic approximation g through which the optimal control parameter u is determined. In our experiments with Bayesian neural networks, each parameter

θ

has two variational parameters

ϕ = {μ, σ}

, such that

A \in R^{2 x 2}

and

b \in R^{2}

.

Figure 2. A one-dimensional illustration of how the optimal stochastic control u is determined from the gradient and parameter information. The parameters

ϕ

and their gradient information

\partial_{ϕ} L

are used to estimate the curvature A and offset b for the quadratic approximation g through which the optimal control parameter u is determined. In our experiments with Bayesian neural networks, each parameter

θ

has two variational parameters

ϕ = {μ, σ}

, such that

A \in R^{2 x 2}

and

b \in R^{2}

.

Figure 3. Comparison of StochControlSGD with SGD, controlled SGD and ADAM. StochcontrolSGD offers very robust performance over varying learning rates.

Figure 4. Combined performance of the optimizers over different learning rates. StochControlSGD provides reliable performance over a wide range of learning rates without the necessity of hyperparameter tuning.

Figure 5. The median control parameter over time plotted with the Training ELBO which is used to compute the gradients for a BNN which was trained on Fashion MNIST.

Table 1. Test accuracy on the MNIST, FMNIST and CIFAR10 datasets. We abbreviate StochControlSGD as scSGD, and the SGD with cosine learning rate scheduling as LRSGD, for notational brevity. The best performing optimization algorithm per data set is denoted in bold.

	MNIST					FMNIST					CIFAR10
	SGD	ADAM	cSGD	scSGD	LRSGD	SGD	ADAM	cSGD	scSGD	LRSGD	SGD	ADAM	cSGD	scSGD	LRSGD
NN	0.959	0.987	0.961	/	0.985	0.818	0.890	0.851	/	0.878	0.461	0.512	0.432	/	0.499
CNN	0.989	0.993	0.981	/	0.990	0.904	0.918	0.912	/	0.907	0.853	0.865	0.857	/	0.855
BNN (Normal)	0.956	0.963	0.970	0.971	0.069	0.865	0.870	0.876	0.900	0.900	0.441	0.442	0.451	0.471	0.462
CBNN (Normal)	0.982	0.988	0.982	0.990	0.989	0.869	0.914	0.903	0.921	0.915	0.615	0.854	0.836	0.853	0.801
BNN (Laplace)	0.976	0.978	0.974	0.977	0.975	0.890	0.875	0.903	0.901	0.9	0.501	0.452	0.461	0.479	0.500
CBNN (Laplace)	0.989	0.987	0.985	0.991	0.989	0.899	0.916	0.907	0.918	0.912	0.627	0.857	0.829	0.857	0.853

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Winkler, L.; Ojeda, C.; Opper, M. Stochastic Control for Bayesian Neural Network Training. Entropy 2022, 24, 1097. https://doi.org/10.3390/e24081097

AMA Style

Winkler L, Ojeda C, Opper M. Stochastic Control for Bayesian Neural Network Training. Entropy. 2022; 24(8):1097. https://doi.org/10.3390/e24081097

Chicago/Turabian Style

Winkler, Ludwig, César Ojeda, and Manfred Opper. 2022. "Stochastic Control for Bayesian Neural Network Training" Entropy 24, no. 8: 1097. https://doi.org/10.3390/e24081097

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stochastic Control for Bayesian Neural Network Training

Abstract

1. Introduction

2. Variational Inference for Bayesian Neural Networks

2.1. Stochastic Differential Equations for Frequentist Models

2.2. Stochastic Differential Equations for Bayesian Models

3. Stochastic Control for Learning Rates

3.1. Simplifiying the Loss

3.2. Our Control Problem

4. Experiments

Behaviour of Control Parameter

5. Related Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Evidence Lower Bound

Appendix B. Ito’s Lemma

Appendix C. Bayesian Stochastic Differential Equation of a Variational Distribution

Appendix D. Gradient Derivations

Appendix E. Stochastic Control

Appendix F. Estimation of Local Quadratic Approximation

Appendix G. Experimental Setup

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI