Galilean and Hamiltonian Monte Carlo

Skilling, John

doi:10.3390/proceedings2019033019

Open AccessProceeding Paper

Galilean and Hamiltonian Monte Carlo^†

by

John Skilling

^†

Maximum Entropy Data Consultants Ltd., CB4 1XE Kenmare, Ireland

^†

Presented at the 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 30 June–5 July 2019.

Proceedings 2019, 33(1), 19; https://doi.org/10.3390/proceedings2019033019

Published: 5 December 2019

(This article belongs to the Proceedings of The 39th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Galilean Monte Carlo (GMC) allows exploration in a big space along systematic trajectories, thus evading the square-root inefficiency of independent steps. Galilean Monte Carlo has greater generality and power than its historical precursor Hamiltonian Monte Carlo because it discards second-order propagation under forces in favour of elementary force-free motion. Nested sampling (for which GMC was originally designed) has similar dominance over simulated annealing, which loses power by imposing an unnecessary thermal blurring over energy.

Keywords:

Nested sampling; simulated annealing; Hamiltonian Monte Carlo; Galilean Monte Carlo

PACS:

02.50.Ng

1. Introduction

Question:	How does a mathematician find a needle in a haystack?
Answer:	Keep halving the haystack and discarding the “wrong” half.

This trick relies on having a test for whether the needle is or is not in the chosen half. With that test in hand, the mathematician expects to find the needle in

{log}_{2} N

steps instead of the

O (\frac{1}{2} N)

trials of direct point-by-point search.

The programmer is faced with a similar problem when trying to locate a small target in a large space. We do not generally have volume-wide global tests available to us, being instead restricted to point-wise evaluations of some quality function

Q (x)

at selected locations

x

. A successful algorithm should have two parts.

One part uses quality differences to drive successive trial locations towards better (larger) quality values. This iteration reduces the available possibilities by progressively eliminating bad (low quality) locations. Nested sampling [1] accomplishes this without needing to interpret Q as energy or anything else. By relying only on comparisons (> or = or <) it’s invariant to any monotonic regrade, thereby preserving generality. Its historical precursor was simulated annealing [2], in which

log Q

was restrictively interpreted as energy in a thermal equilibrium.

The other part of a successful algorithm concerns how to move location without decreasing the quality attained so far. Here, it will often be more efficient to move systematically for several steps in a chosen direction, rather than diffuse slowly around with randomly directed individual steps. After n steps, the aim is to have moved

Δ x \propto n

, not just

\sqrt{n}

. Galilean Monte Carlo (GMC) accomplishes this with steady (“Galilean”) motion controlled by quality value. Its historical precursor was Hamiltonian Monte Carlo (HMC) [3], in which motion was controlled by “Hamiltonian” forces restrictively defined by a quality gradient which sometimes doesn’t exist.

In both parts, nested sampling compression and GMC exploration, generality and power are retained by avoiding presentation in terms of physics. After all, elementary ideas underlie our understanding of physics, not the other way round, and discarding what isn’t needed ought to be helpful.

2. Compression by Nested Sampling

Question:	How does a programmer find a small target in a big space?
Answer:	……

There may be very many (N) possible “locations”

x

to choose from. For clarity, assume these have equal status a priori—there is no loss of generality because unequal status can always be modelled by retreating to an appropriate substratum of equivalent points. For the avoidance of doubt, we are investigating practical computation so N is finite, though with no specific limit.

As the first step, the programmer with no initial guidance available can at best select a random location

x_{1}

for the first evaluation

Q_{1} = Q (x_{1})

. The task of locating larger values of Q carries no assumption of geometry or even topology. Locations could be shuffled arbitrarily and the task would remain just the same. Accordingly, we are allowed to shuffle the locations into decreasing Q-order without changing the task (Figure 1). If ordering is ambiguous because different locations have equal quality, the ambiguity can be resolved by assigning each location its own (random) key-value to break the degeneracy.

Being chosen at random,

x_{1}

’s shuffled rank

N u_{1}

marks an equally random fraction of the ordered N locations. Our knowledge of

u_{1}

is uniform:

u_{1} \sim Uniform (0, 1)

. We can encode this knowledge as one or (better) more samples that simulate what the position might actually have been. If the programmer deems a single simulation too crude and many too bothersome, the mean and standard deviation

log u_{1} = - 1 \pm 1

(1)

often suffice.

The next step is to discard the “wrong” points with

Q < Q_{1}

and select a second location

x_{2}

randomly from the surviving

N u_{1}

possibilities. Being similarly random,

x_{2}

’s rank

N u_{1} u_{2}

marks a random fraction of those

N u_{1}

, with

u_{2} \sim Uniform (0, 1)

(Figure 2).

And so on. After k steps of increasing quality

Q_{1} < Q_{2} < \dots < Q_{k}

, the net compression ratio

X_{k} = u_{1} u_{2} \dots u_{k}

can be simulated as several samples from

X_{k} \sim \underset{k}{\underset{︸}{Uniform (0, 1) \cdot Uniform (0, 1) \cdot \dots \cdot Uniform (0, 1)}}

(2)

to get results fully faithful to our knowledge or simply abbreviated as mean and standard deviation

log X_{k} = \underset{k}{\underset{︸}{(- 1 \pm 1) + (- 1 \pm 1) + \dots (- 1 \pm 1)}} = - k \pm \sqrt k

(3)

Compression proceeds exponentially until the user decides that Q has been adequately maximised. At that stage, the evaluated sequence

Q_{1} < Q_{2} < \dots < Q_{k}

of qualities Q has been paired with the corresponding sequence

X_{1} > X_{2} > \dots X_{k}

of compressions X (Figure 3), either severally simulated or abbreviated in mean.

Q can then be integrated as

Z = \int Q (x) d x = \int_{0}^{1} Q (X) d X \approx \sum_{i = 1}^{k} w_{i}, w_{i} = Q_{i} Δ X_{i}

(4)

so that quantification is available for Bayesian or other purposes. Any function of Q can be integrated too from the same run, changing

Q_{i}

to some related

Q_{i}^{'}

while leaving the

X_{i}

fixed. And the statistical uncertainty in any integral Z is trivially acquired from the repeated simulations (2) of what the compressions X might have been according to their known distributions.

That’s nested sampling. It requires two user procedures additional to the

Q (x)

function. The first is to sample an initially random location to start the procedure. The second—which we next address—is to move to a new random location obeying a lower bound on Q. Note that there is no modulation within a constrained domain. Locations are either acceptable,

Q (x) \geq Q^{*}

, or not,

Q (x) < Q^{*}

.

3. Exploration by Galilean Monte Carlo

The obvious beginners’ MCMC procedure for moving from one acceptable location to another, while obeying detailed balance but not moving so far that Q always disobeys the lower bound

Q^{*}

, is:

\begin{matrix} Start at x with acceptable quality Q (x) \geq Q^{*} \\ Repeat for length of trajectory \end{matrix} \begin{matrix} Set v = isotropically random velocity \\ x^{'} = x + v = trial location \\ if (Q (x) \geq Q^{*}) & accept new x = x' \\ else & reject x' by keeping x \end{matrix}

(5)

However, randomising

v

every step is diffusive and slow, with net distance travelled increasing only as the square root of the number of steps.

All locations within the constrained domain are equally acceptable, so the program might better try to proceed in a straight line, changing velocity only when necessary in an attempt to reflect specularly off the boundary (Figure 4, left). The user is asked to ensure that the imposed geometry makes sense in the context of the particular application, otherwise there will be no advantage.

With finite step lengths, it’s not generally possible to hit the boundary exactly whilst simultaneously being sure that it had not already been encountered earlier, so the ideal path is impractical. Instead, we take a current location

x

and suppose that a corresponding unit vector

n

can be defined there as a proxy for where the surface normal would be if that surface was close at hand (Figure 4, right). Again, it is the user’s responsibility to ensure that

n

makes sense in the context of the particular application: exploration procedures cannot anticipate the quirks of individual applications.

Reflection from a plane orthogonal to

n

(drawn horizontally in Figure 5) modifies an incoming velocity

v

(in northward direction from the South) to

v^{'} = v - 2 n (n^{T} v)

. Depending on the circumstances, the incoming particle may proceed straight on to the North (

+ v

), or be reflected to the East (

+ v^{'}

), or back-reflected to the West (

- v^{'}

), or reversed back to the South (

- v

).

Mistakenly, the author’s earlier introduction of GMC in 2011 [4] reduced the possibilities by eliminating West, but at the cost of allowing the particle to escape the constraint temporarily, which damaged the performance and cancelled its potential superiority.

Figure 5. North – East – West – South, the four Galilean outcomes.

If the potential destination North is acceptable (bottom left in Figure 5), the particle should move there and not change its direction (so

n

need not be computed). Otherwise, the particle needs to change its direction but not its position.

For a North-South oriented velocity to divert into East-West, either East or West must be acceptable, but not both because East-West particles would then pass straight through without interacting with North-South, so the proposed diversion would break detailed balance. Likewise, for an East-West velocity to divert North-South, either North or South must be acceptable but not both. These conditions yield the following procedure:

\begin{matrix} Start at x with acceptable quality Q (x) \geq Q^{*} \\ Set v = isotropically random velocity \\ Repeat for length of trajectory \end{matrix} \begin{matrix} Set N = “ Q (x + v) \geq Q^{*} ” \\ if (N) & exit with x = x + v & [go North] \\ Set v^{'} = Rv = reflection velocity \\ Set E = “ Q (x + v^{'}) \geq Q^{*} ”, & W = “ Q (x - v^{'}) \geq Q^{*} ”, & S = “ Q (x - v) \geq Q^{*} ” \\ if (S & (E but not W)) & exit with v = v^{'} & [aim East] \\ if (S & (W but not E)) & exit with v = - v^{'} & [aim West] \\ otherwise & exit with v = - v & [aim South] \end{matrix}

(6)

Any self-inverse reflection operator

R

will do, though the reflection idea suggests

R = I - 2 n n^{T}

.

That’s Galilean Monte Carlo. The trajectory is explored uniformly, with each step yielding an acceptable (though correlated) sample.

4. Compression and Exploration

GMC was originally designed for nested-sampling compression, from which probability distributions can be built up after a run by identifying quality as likelihood L in the weighted sequence (4) of successively compressed locations. However, GMC can also be used when exploring a weighted distribution directly.

For compression (standard nested sampling, Figure 6, left), only the domain size X is iterated, albeit under likelihood control.

\begin{matrix} C o m p r e s s i o n : & \begin{matrix} Enter with X and L \\ Set constraint L^{*} = L defining X^{*} \\ Sample within L^{*} to get X^{'} = u X^{*} \\ Exit with X^{'} and L^{'} \end{matrix} \end{matrix}

(7)

For exploration (standard reversible MCMC, Figure 6, right), the likelihood is relaxed as well through a preliminary random number

u^{'} \sim Uniform (0, 1)

.

\begin{matrix} E x p l o r a t i o n : & \begin{matrix} Enter with X and L \\ Set constraint L^{*} = u^{'} L defining X^{*} \\ Sample within L^{*} to get X^{'} = u X^{*} \\ Exit with X^{'} and L^{'} \end{matrix} \end{matrix}

(8)

This is equivalent to standard Metropolis balancing “Accept

x^{'}

if

L (x^{'}) \geq u^{'} L (x)

”, the only difference being that the lower bound

u^{'} L

is set beforehand instead of checked afterwards.

5. Exploration by Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) [3] uses a physical analogy with kinetic theory of gases in which a thermal distribution of moving particles, whose position/velocity probability distribution factorises into space and velocity parts

Pr (x, v) \propto e^{- E (x, v)}, E (x, v) = V (x) + T (v)

(9)

with potential energy V defining the spatial target distribution

Pr (x) \propto e^{- V (x)}

and kinetic energy

T = \frac{1}{2} {| v |}^{2}

distributed as the Boltzmann thermal equilibrium

Pr (v) \propto e^{- T (v)}

.

The usual dynamics (Figure 7)

\frac{d x}{d t} = v, \frac{d v}{d t} = - \nabla V (x)

(10)

relaxes an initial setting towards the joint equilibrium (9) under occasional collisions which reset

v

according to

Pr (v)

, leaving

x

as a sample from the target

Pr (x)

.

Between collisions, the force field is necessarily digitised into impulses at discrete time intervals

δ t

, so the computation obeys

δ x = v δ t, δ v = - \nabla V (x) δ t

(11)

To make the trajectory reversible and increase the accuracy order, the impulses are halved at the start

x

and end

x^{'}

, but even this does not ensure full accuracy because the dynamics has been approximated (Figure 8).

To correct this, the destination

x^{'}

, whose total energy

E^{'} = V^{'} + T^{'}

ought to agree with the initial

E = V + T

, is subjected to the usual Metropolis balancing.

Accept x^{'} iff e^{- E^{'}} \geq e^{- E} \cdot Uniform (0, 1) .

(12)

In practice, the correction is often ignored because the (reversible) algorithm explores “level sets” whose contours are often an adequately good approximation to the true Hamiltonian provided the fixed timestep is not too large.

That’s Hamiltonian Monte Carlo. The trajectory is explored non-uniformly, with successive steps being closer at altitude where the particles are moving slower, so that sampling is closer where the density is smaller—a mismatch which needs to be overcome by the equilibrating collisions. And, of course, HMC requires the potential

V (x)

(a.k.a. log-likelihood) to be differentiable and generally smooth.

6. Compression versus Simulated Annealing

Simulated annealing uses a physical analogy to thermal equilibrium to compress from a prior probability distribution to a posterior. As in HMC, though without the complication of kinetic energy, the likelihood (or quality) is identified as the exponential

L = e^{- E}

of an energy. In annealing, the energy is scaled by a parameter

β

so that the quality becomes

Q = L^{β} = e^{- β E}

of this scaled energy, with

β

used to connect posterior (where

β = 1

) with prior (where

β = 0

).

This “simulates” thermal equilibrium at coolness (inverse temperature)

β

, and “annealing” refers to sufficiently gradual cooling from prior to posterior that equilibrium is locally preserved. A few lines of algebra, familiar in statistical mechanics, show that the evidence (or partition function) can be accumulated from successive thermal equilibria as

log Z = \int_{0}^{1} {〈 log L 〉}_{β} d β

(13)

where

{〈 log L 〉}_{β}

is the average log-likelihood as determined by sampling the equilibrium appropriate to coolness

β

. Equilibrium is defined by weights

L^{β}

and can be explored either by GMC or (traditionally) by HMC. There is seldom any attempt to evaluate the statistical uncertainty in

log Z

, the necessary fluctuations being poorly defined in the simulations.

At coolness

β

, the equilibrium distribution of locations

x

, initially uniform over the prior, is modulated by

L^{β}

so that the samples have probability distribution

Pr (x) \propto L {(x)}^{β}

which corresponds to

Pr (X) \propto L {(X)}^{β}

(14)

in terms of compression. Consequently, samples cluster around the maximum of

β log L + log X

, where the

log L (log X)

curve has slope

- 1 / β

(Figure 9, left). Clearly this only works properly if

log L (log X)

is concave (

\overset{⌢}{}

). Any zone of convexity (

\overset{⌣}{}

) is unstable, with samples heading toward either larger L at little cost to X or toward larger X at little cost to L. A simulated-annealing program cannot enter a convex region, and the steady cooling assumed in (13) cannot occur.

In the physics analogy, this behaviour is a phase change and can be exemplified by the transition from steam to water at 100 °C (Figure 9, right). Because of the different volumes (exponentially different in large problems), a simulated-annealing program will be unable to place the two phases in algorithmic contact, so will be unable to model the latent heat of the transition. Correspondingly, computation of evidence Z will fail. Yet our world is full of interesting phase changes, and a method that cannot cope with them cannot be recommended for general use.

Nested sampling, on the other hand, compresses steadily with respect to the abscissa

log X

regardless of the (monotonic) behaviour of

log L

, so is impervious to this sort of phase change. Contrariwise, simulated annealing cools through the slope, which need not change monotonically. By using thermal equilibria which average over

e^{- β E}

, simulated annealing views the system through the lens of a Laplace transform, which is notorious for its ill-conditioning. Far better to deal with the direct situation.

7. Conclusions

The author suggests that, just as nested sampling dominates simulated annealing, …

Nested sampling	Simulated Annealing
Steady compression	Arbitrary cooling schedule for $β$
Invariant to relabelling Q	Q has fixed form $L^{β}$
Can deal with phase changes	Cannot deal with phase changes
Evidence $Z = \int L d X$ with uncertainty	Evidence $Z = exp \int_{0}^{1} {〈 log L 〉}_{β} d β$

… so does Galilean Monte Carlo dominate Hamiltonian.

Galilean Monte Carlo	Hamiltonian Monte Carlo
No rejection	Trajectories can be rejected
Any metric is OK	Riemannian metric required
Invariant to relabelling Q	Trajectory explores nonuniformly
Quality function $Q (x)$ is arbitrary	Quality $Q (x)$ must be differentiable
Step functions OK (nested sampling)	Can not use step functions
Can sample any probability distribution	Probability distribution must be smooth
Needs 2 work vectors	Needs 3 work vectors

In each case, reverting to elementary principles by discarding physical analogies enhances generality and power.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

References

Skilling, J. Nested Sampling for general Bayesian computation. J. Bayesian Anal. 2006, 1, 833–860. [Google Scholar] [CrossRef]
Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P. Optimization by Simulated Annealing. Science 1983, 220, 671–680. [Google Scholar] [CrossRef] [PubMed]
Duane, S.; Kennedy, A.D.; Pendleton, B.J.; Roweth, D. Hybrid Monte Carlo. Phys. Lett. B 1987, 195, 216–222. [Google Scholar] [CrossRef]
Skilling, J. Bayesian computation in big spaces—Nested sampling and Galilean Monte Carlo. AIP Conf. Proc. 2012, 1443, 145–156. [Google Scholar]

Figure 1. N locations (left) ordered (right) by quality Q.

Figure 2. Selection of second location B after discarding domain outside A.

Figure 3. Nested sampling produces the relationship

Q (X)

.

Figure 3. Nested sampling produces the relationship

Q (X)

.

Figure 4. The motivation behind Galilean Monte Carlo (GMC).

Figure 6. GMC for compression (left) and exploration (right).

Figure 7. The Hamiltonian Monte Carlo idea.

Figure 8. Hamiltonian Monte Carlo path approximates the ideal continuous path.

Figure 9. Simulated annealing without (left) and with (right) phase change.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Skilling, J. Galilean and Hamiltonian Monte Carlo. Proceedings 2019, 33, 19. https://doi.org/10.3390/proceedings2019033019

AMA Style

Skilling J. Galilean and Hamiltonian Monte Carlo. Proceedings. 2019; 33(1):19. https://doi.org/10.3390/proceedings2019033019

Chicago/Turabian Style

Skilling, John. 2019. "Galilean and Hamiltonian Monte Carlo" Proceedings 33, no. 1: 19. https://doi.org/10.3390/proceedings2019033019

APA Style

Skilling, J. (2019). Galilean and Hamiltonian Monte Carlo. Proceedings, 33(1), 19. https://doi.org/10.3390/proceedings2019033019

Article Menu

Galilean and Hamiltonian Monte Carlo^†

Abstract

1. Introduction

2. Compression by Nested Sampling

3. Exploration by Galilean Monte Carlo

4. Compression and Exploration

5. Exploration by Hamiltonian Monte Carlo

6. Compression versus Simulated Annealing

7. Conclusions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Galilean and Hamiltonian Monte Carlo †

Abstract

1. Introduction

2. Compression by Nested Sampling

3. Exploration by Galilean Monte Carlo

4. Compression and Exploration

5. Exploration by Hamiltonian Monte Carlo

6. Compression versus Simulated Annealing

7. Conclusions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Galilean and Hamiltonian Monte Carlo^†