1. Introduction
Markov chain Monte Carlo (MCMC) methods are techniques for estimating expectations with respect to some probability measure
, which need not be normalised. This is done by sampling a Markov chain which has limiting distribution
, and computing empirical averages. A popular form of MCMC is the Metropolis–Hastings algorithm [
1,
2], where at each time step a ‘proposed’ move is drawn from some candidate distribution, and then accepted with some probability, otherwise the chain stays at the current point. Interest lies in finding choices of candidate distribution that will produce sensible estimators for expectations with respect to
.
The quality of these estimators can be assessed in many different ways, but a common approach is to understand conditions on that will result in a chain which converges to its limiting distribution at a geometric rate. If such a rate can be established, then a Central Limit Theorem will exist for expectations of functionals with finite second absolute moment under if the chain is reversible.
A simple yet often effective choice is a symmetric candidate distribution centred at the current point in the chain (with a fixed variance), resulting in the
Random Walk Metropolis (RWM) (e.g., [
3]). The convergence properties of a chain produced by the RWM are well-studied. In one dimension, essentially convergence is geometric if
decays at an exponential or faster rate in the tails [
4], while in higher dimensions an additional curvature condition is required [
5]. Slower rates of convergence have also been established in the case of heavier tails [
6].
Recently, some MCMC methods were proposed which generalise the RWM, whereby proposals are still centred at the current point
x and symmetric, but the variance changes with
x [
7,
8,
9,
10,
11]. An extension to infinite-dimensional Hilbert spaces is also suggested in Reference [
12]. The motivation is that the chain can become more ‘local’, perhaps making larger jumps when out in the tails, or mimicking the local dependence structure of
to propose more intelligent moves. Designing MCMC methods of this nature is particularly relevant for modern Bayesian inference problems, where posterior distributions are often high dimensional and exhibit nonlinear correlations [
13]. We term this approach the
Position-dependent Random Walk Metropolis (PDRWM), although technically this is a misnomer, since proposals are no longer random walks. Other choices of candidate distribution designed with distributions that exhibit nonlinear correlations were introduced in Reference [
13]. Although powerful, these require derivative information for
, something which can be unavailable in modern inference problems (e.g., [
14]). We note that no such information is required for the PDRWM, as shown by the particular cases suggested in References [
7,
8,
9,
10,
11]. However, there are relations between the approaches, to the extent that understanding how the properties of the PDRWM differ from the standard RWM should also aid understanding of the methods introduced in Reference [
13].
In this article, we consider the convergence rate of a Markov chain generated by the PDRWM to its limiting distribution. Our main interest lies in whether this generalisation can change these ergodicity properties compared to the standard RWM with fixed covariance. We focus on the case in which the candidate distribution is Gaussian, and illustrate that such changes can occur in several different ways, either for better or worse. Our aim is not to give a complete characterisation of the approach, but rather to illustrate the possibilities through carefully chosen examples, which are known to be indicative of more general behaviour.
In
Section 2 necessary concepts about Markov chains are briefly reviewed, before the PDRWM is introduced in
Section 3. Some results in the one-dimensional case are given in
Section 4, before a higher-dimensional model problem is examined in
Section 5. Throughout
denotes a probability measure (we use the terms probability measure and distribution synonymously), and
its density with respect to Lebesgue measure
.
Since an early version of this work appeared online, some contributions to the literature were made that are worthy of mention. A Markov kernel constructed as a state-dependent mixture is introduced in Reference [
15] and its properties are studied in some cases that are similar in spirit to the model problem of
Section 5. An algorithm called
Directional Metropolis–Hastings, which encompasses a specific instance of the PDRWM, is introduced and studied in Reference [
16], and a modification of the same idea is used to develop the
Hop kernel within the
Hug and Hop algorithm of Reference [
17]. Kamatani considers an algorithm designed for the infinite-dimensional setting in Reference [
18] of a similar design to that discussed in Reference [
12] and studies the ergodicity properties.
2. Markov Chains and Geometric Ergodicity
We will work on the Borel space
, with
for some
, so that each
for a discrete-time Markov chain
with time-homogeneous transition kernel
, where
and
is defined similarly for
. All chains we consider will have invariant distribution
, and be both
-irreducible and aperiodic, meaning
is the limiting distribution from
-almost any starting point [
19]. We use
to denote the Euclidean norm.
In Markov chain Monte Carlo the objective is to construct estimators of
, for some
, by computing
If
is the limiting distribution for the chain then
P will be
ergodic, meaning
from
-almost any starting point. For finite
n the quality of
intuitively depends on how quickly
approaches
. We call the chain
geometrically ergodic if
from
-almost any
, for some
and
, where
is the total variation distance between distributions
and
[
19].
For
-reversible Markov chains geometric ergodicity implies that if
for some
, then
for some asymptotic variance
[
20]. Equation (
2) enables the construction of asymptotic confidence intervals for
.
In practice, geometric ergodicity does not guarantee that will be a sensible estimator, as can be arbitrarily large if the chain is initialised far from the typical set under , and may be very close to 1. However, chains which are not geometrically ergodic can often either get ‘stuck’ for a long time in low-probability regions or fail to explore the entire distribution adequately, sometimes in ways that are difficult to diagnose using standard MCMC diagnostics.
Establishing Geometric Ergodicity
It is shown in Chapter 15 of Reference [
21] that Equation (
1) is equivalent to the condition that there exists a
Lyapunov function
and some
such that
where
. The set
must be
small, meaning that for some
,
and probability measure
for any
and
. Equations (
3) and (
4) are referred to as
drift and
minorisation conditions. Intuitively,
C can be thought of as the centre of the space, and Equation (
3) ensures that some one dimensional projection of
drifts towards
C at a geometric rate when outside. In fact, Equation (
3) is sufficient for the return time distribution to
C to have geometric tails [
21]. Once in
C, (
4) ensures that with some probability the chain forgets its past and hence
regenerates. This regeneration allows the chain to couple with another initialised from
, giving a bound on the total variation distance through the
coupling inequality (e.g., [
19]). More intuition is given in Reference [
22].
Transition kernels considered here will be of the
Metropolis–Hastings type, given by
where
is some candidate kernel,
is called the acceptance rate and
. Here we choose
where
denotes the minimum of
a and
b. This choice implies that
P satisfies detailed balance for
[
23], and hence the chain is
-reversible (note that other choices for
can result in non-reversible chains, see Reference [
24] for details).
Roberts and Tweedie [
5], following on from Reference [
21], introduced the following regularity conditions.
Theorem 1. (Roberts and Tweedie). Suppose that is bounded away from 0 and ∞ on compact sets, and there exists and such that for every x Then the chain with kernel (5) is -irreducible and aperiodic, and every nonempty compact set is small. For the choices of
Q considered in this article these conditions hold, and we will restrict ourselves to forms of
for which the same is true (apart from a specific case in
Section 5). Under Theorem 1 then (
1) only holds if a Lyapunov function
with
exists such that
when
P is of the Metropolis–Hastings type, (
7) can be written
In this case, a simple criterion for lack of geometric ergodicity is
Intuitively this implies that the chain is likely to get ‘stuck’ in the tails of a distribution for large periods.
Jarner and Tweedie [
25] introduce a necessary condition for geometric ergodicity through a
tightness condition.
Theorem 2. (Jarner and Tweedie). If for any there is a such that for all where , then a necessary condition for P to produce a geometrically ergodic chain is that for some The result highlights that when
is heavy-tailed the chain must be able to make very large moves and still be capable of returning to the centre quickly for (
1) to hold.
3. Position-Dependent Random Walk Metropolis
In the RWM,
with
, meaning (
6) reduces to
. A common choice is
, with
chosen to mimic the global covariance structure of
[
3]. Various results exist concerning the optimal choice of
h in a given setting (e.g., [
26]). It is straightforward to see that Theorem 2 holds here, so that the tails of
must be uniformly exponential or lighter for geometric ergodicity. In one dimension this is in fact a sufficient condition [
4], while for higher dimensions additional conditions are required [
5]. We return to this case in
Section 5.
In the PDRWM
, so (
6) becomes
The motivation for designing such an algorithm is that proposals are more able to reflect the local dependence structure of
. In some cases this dependence may vary greatly in different parts of the state-space, making a global choice of
ineffective [
9].
Readers familiar with differential geometry will recognise the volume element
and the linear approximations to the distance between
x and
y taken at each point through
and
if
is viewed as a Riemannian manifold with metric
G. We do not explore these observations further here, but the interested reader is referred to Reference [
27] for more discussion.
The choice of is an obvious question. In fact, specific variants of this method have appeared on many occasions in the literature, some of which we now summarise.
Tempered Langevin diffusions [
8]
. The authors highlight that the diffusion with dynamics
has invariant distribution
, motivating the choice. The method was shown to perform well for a bi-modal
, as larger jumps are proposed in the low density region between the two modes.
State-dependent Metropolis [
7]
. Here the intuition is simply that
means larger jumps will be made in the tails. In one dimension the authors compare the expected squared jumping distance
empirically for chains exploring a
target distribution, choosing
b adaptively, and found
to be optimal.
Regional adaptive Metropolis–Hastings [
7,
11].
. In this case the state-space is partitioned into
, and a different proposal covariance
is learned adaptively in each region
. An extension which allows for some errors in choosing an appropriate partition is discussed in [
11]
Localised Random Walk Metropolis [
10].
. Here
are weights based on approximating
with some mixture of Normal/Student’s t distributions, using the approach suggested in Reference [
28]. At each iteration of the algorithm a mixture component
k is sampled from
, and the covariance
is used for the proposal
.
Kernel adaptive Metropolis–Hastings [
9].
, where
for some kernel function
k and
n past samples
,
is a centering matrix (the
matrix
has 1 as each element), and
,
are tuning parameters. The approach is based on performing nonlinear principal components analysis on past samples from the chain to learn a local covariance. Illustrative examples for the case of a Gaussian kernel show that
acts as a weighted empirical covariance of samples
z, with larger weights given to the
which are closer to
x [
9].
The latter cases also motivate any choice of the form
for some past samples
and weight function
with
that decays as
grows, which would also mimic the local curvature of
(taking care to appropriately regularise and diminish adaptation so as to preserve ergodicity, as outlined in Reference [
10]).
Some of the above schemes are examples of adaptive MCMC, in which a candidate from among a family of Markov kernels
is selected by learning the parameter
during the simulation [
10]. Additional conditions on the adaptation process (i.e., the manner in which
is learned) are required to establish ergodicity results for the resulting stochastic processes. We consider the decisions on how to learn
appropriately to be a separate problem and beyond the scope of the present work, and instead focus attention on establishing geometric ergodicity of the base kernels
for any fixed
. We note that this is typically a pre-requisite for establishing convergence properties of any adaptive MCMC method [
10].
4. Results in One Dimension
Here we consider two different general scenarios as
, i)
is bounded above and below, and ii)
at some specified rate. Of course there is also the possibility that
, though intuitively this would result in chains that spend a long time in the tails of a distribution, so we do not consider it (if
then chains will in fact exhibit the
negligible moves property studied in Reference [
29]). Proofs to Propositions in
Section 4 and
Section 5 can be found in
Appendix A.
We begin with a result that emphasizes that a growing variance is a necessary requirement for geometric ergodicity in the heavy-tailed case.
Proposition 1. If for some , then unless for some the PDRWM cannot produce a geometrically ergodic Markov chain.
The above is a simple extension of a result that is well-known in the RWM case. Essentially the tails of the distribution should be exponential or lighter to ensure fast convergence. This motivates consideration of three different types of behaviour for the tails of .
Assumption 1. The density satisfies one of the following tail conditions for all such that , for some finite .
for some
for some and
for some .
Naturally Assumption 1 implies 2 and Assumption 2 implies 3. If Assumption 1 is not satisfied then
is generally called
heavy-tailed. When
satisfies Assumption 2 or 3 but not 1, then the RWM typically fails to produce a geometrically ergodic chain [
4]. We show in the sequel, however, that this is not always the case for the PDRWM. We assume the below assumptions for
to hold throughout this section.
Assumption 2. The function is bounded above by some for all , and bounded below for all with , for some .
The heavy-tailed case is known to be a challenging scenario, but the RWM will produce a geometrically ergodic Markov chain if is log-concave. Next we extend this result to the case of sub-quadratic variance growth in the tails.
Proposition 2. If such that whenever , then the PDRWM will produce a geometrically ergodic chain in both of the following cases:
satisfies Assumption 1 and
satisfies Assumption 2 for some and
The second part of Proposition 2 is not true for the RWM, for which Assumption 2 alone is not sufficient for geometric ergodicity [
4].
We do not provide a complete proof that the PDRWM will not produce a geometrically ergodic chain when only Assumption 3 holds and
for some
, but do show informally that this will be the case. Assuming that in the tails
for some
then for large
xThe first expression on the right hand side converges to 1 as , which is akin to the case of fixed proposal covariance. The second term will be larger than one for and less than one for . So the algorithm will exhibit the same ‘random walk in the tails’ behaviour which is often characteristic of the RWM in this scenario, meaning that the acceptance rate fails to enforce a geometric drift back into the centre of the space.
When
the above intuition will not necessarily hold, as the terms in Equation (
10) will be roughly constant with
x. When only Assumption 3 holds, it is, therefore, tempting to make the choice
for
. Informally we can see that such behaviour may lead to a favourable algorithm if a small enough
h is chosen. For any fixed
a typical proposal will now take the form
, where
. It therefore holds that
where for any fixed
x and
the term
as
. The first term on the right-hand side of Equation (
11) corresponds to the proposal of the
multiplicative Random Walk Metropolis, which is known to be geometrically ergodic under Assumption 3 (e.g., [
3]), as this equates to taking a logarithmic transformation of
x, which ‘lightens’ the tails of the target density to the point where it becomes log-concave. So in practice we can expect good performance from this choice of
. The above intuition does not, however, provide enough to establish geometric ergodicity, as the final term on the right-hand side of (
11) grows unboundedly with
x for any fixed choice of
h. The difference between the acceptance rates of the multiplicative Random Walk Metropolis and the PDRWM with
will be the exponential term in Equation (
10). This will instead become polynomial by letting the proposal noise
follow a distribution with polynomial tails (e.g., student’s t), which is known to be a favourable strategy for the RWM when only Assumption 3 holds [
6]. One can see that if the heaviness of the proposal distribution is carefully chosen then the acceptance rate may well enforce a geometric drift into the centre of the space, though for brevity we restrict attention to Gaussian proposals in this article.
The final result of this section provides a note of warning that lack of care in choosing can have severe consequences for the method.
Proposition 3. If as , then the PDRWM will not produce a geometrically ergodic Markov chain.
The intuition for this result is straightforward when explained. In the tails, typically
will be the same order of magnitude as
, meaning
grows arbitrarily large as
grows. As such, proposals will ‘overshoot’ the typical set of the distribution, sending the sampler further out into the tails, and will therefore almost always be rejected. The result can be related superficially to a lack of geometric ergodicity for Metropolis–Hastings algorithms in which the proposal mean is comprised of the current state translated by a drift function (often based in
) when this drift function grows faster than linearly with
(e.g., [
30,
31]).
5. A Higher-Dimensional Case Study
An easy criticism of the above analysis is that the one-dimensional scenario is sometimes not indicative of the more general behaviour of a method. We note, however, that typically the geometric convergence properties of Metropolis–Hastings algorithms do carry over somewhat naturally to more than one dimension when
is suitably regular (e.g., [
5,
32]). Because of this we expect that the growth conditions specified above could be supplanted onto the determinant of
when the dimension is greater than one (leaving the details of this argument for future work).
A key difference in the higher-dimensional setting is that
now dictates both the
size and
direction of proposals. In the case
, some additional regularity conditions on
are required for geometric ergodicity in more than one dimension, outlined in References [
5,
32]. An example is also given in Reference [
5] of the simple two-dimensional density
, which fails to meet these criteria. The difficult models are those for which probability concentrates on a ridge in the tails, which becomes ever narrower as
increases. In this instance, proposals from the RWM are less and less likely to be accepted as
grows. Another well-known example of this phenomenon is the
funnel distribution introduced in Reference [
33].
To explore the behaviour of the PDRWM in this setting, we design a model problem, the
staircase distribution, with density
where
denotes the integer part of
. Graphically the density is a sequence of cuboids on the upper-half plane of
(starting at
), each centred on the vertical axis, with each successive cuboid one third of the width and height of the previous. The density resembles an ever narrowing staircase, as shown in
Figure 1.
We denote by
the proposal kernel associated with the Random Walk Metropolis algorithm with fixed covariance
. In fact, the specific choice of
h and
does not matter provided that the result is positive-definite. For the PDRWM we denote by
the proposal kernel with covariance matrix
which will naturally adapt the scale of the first coordinate to the width of the ridge.
Proposition 4. The Metropolis–Hastings algorithm with proposal does not produce a geometrically ergodic Markov chain when .
The design of the PDRWM proposal kernel in this instance is such that the proposal covariance reduces at the same rate as the width of the stairs, therefore naturally adapting the proposal to the width of the ridge on which the density concentrates. This state-dependent adaptation results in a geometrically ergodic chain, as shown in the below result.
Proposition 5. The Metropolis–Hastings algorithm with proposal produces a geometrically ergodic Markov chain when .
6. Discussion
In this paper we have analysed the ergodic behaviour of a Metropolis–Hastings method with proposal kernel
. In one dimension we have characterised the behaviour in terms of growth conditions on
and tail conditions on the target distribution, and in higher dimensions a carefully constructed model problem is discussed. The fundamental question of interest was whether generalising an existing Metropolis–Hastings method by allowing the proposal covariance to change with position can alter the ergodicity properties of the sampler. We can confirm that this is indeed possible, either for the better or worse, depending on the choice of covariance. The take home points for practitioners are (i) lack of sufficient care in the design of
can have severe consequences (as in Proposition 3), and (ii) careful choice of
can have much more beneficial ones, perhaps the most surprising of which are in the higher-dimensional setting, as shown in
Section 5.
We feel that such results can also offer insight into similar generalisations of different Metropolis–Hastings algorithms (e.g., [
13,
34]). For example, it seems intuitive that any method in which the variance grows at a faster than quadratic rate in the tails is unlikely to produce a geometrically ergodic chain. There are connections between the PDRWM and some extensions of the Metropolis-adjusted Langevin algorithm [
34], the ergodicity properties of which are discussed in Reference [
35]. The key difference between the schemes is the inclusion of the drift term
in the latter. It is this term which in the main governs the behaviour of the sampler, which is why the behaviour of the PDRWM is different to this scheme. Markov processes are also used in a wide variety of application areas beyond the design of Metropolis–Hastings algorithms (e.g., [
36]), and we hope that some of the results established in the present work prove to be beneficial in some of these other settings.
We can apply these results to the specific variants discussed in
Section 3. Provided that sensible choices of regions/weights are made and that an adaptation scheme which obeys the diminishing adaptation criterion is employed, the Regional adaptive Metropolis–Hastings, Locally weighted Metropolis and Kernel-adaptive Metropolis–Hastings samplers should all satisfy
as
, meaning they can be expected to inherit the ergodicity properties of the standard RWM (the behaviour in the centre of the space, however, will likely be different). In the State-dependent Metropolis method provided
the sampler should also behave reasonably. Whether or not a large enough value of
b would be found by a particular adaptation rule is not entirely clear, and this could be an interesting direction of further study. The Tempered Langevin diffusion scheme, however, will fail to produce a geometrically ergodic Markov chain whenever the tails of
are lighter than that of a Cauchy distribution. To allow reasonable tail exploration when this is the case, two pragmatic options would be to upper bound
manually or use this scheme in conjunction with another, as there is evidence that the sampler can perform favourably when exploring the centre of a distribution [
8]. None of the specific variants discussed here are able to mimic the local curvature of the
in the tails, so as to enjoy the favourable behaviour exemplified in Proposition 5. This is possible using Hessian information as in Reference [
13], but should also be possible in some cases using appropriate surrogates.