Next Article in Journal
Comparative Investigation on Hydrodynamic Performance of Pump-Jet Propulsion Designed by Direct and Inverse Design Methods
Next Article in Special Issue
Non-Homogeneous Markov Set Systems
Previous Article in Journal
New Jochemsz–May Cryptanalytic Bound for RSA System Utilizing Common Modulus N = p2q
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Geometric Ergodicity of the Random Walk Metropolis with Position-Dependent Proposal Covariance

by
Samuel Livingstone
Department of Statistical Science, University College London, London WC1E 6BT, UK
Mathematics 2021, 9(4), 341; https://doi.org/10.3390/math9040341
Submission received: 19 January 2021 / Revised: 1 February 2021 / Accepted: 4 February 2021 / Published: 8 February 2021

Abstract

:
We consider a Metropolis–Hastings method with proposal N ( x , h G ( x ) 1 ) , where x is the current state, and study its ergodicity properties. We show that suitable choices of G ( x ) can change these ergodicity properties compared to the Random Walk Metropolis case N ( x , h Σ ) , either for better or worse. We find that if the proposal variance is allowed to grow unboundedly in the tails of the distribution then geometric ergodicity can be established when the target distribution for the algorithm has tails that are heavier than exponential, in contrast to the Random Walk Metropolis case, but that the growth rate must be carefully controlled to prevent the rejection rate approaching unity. We also illustrate that a judicious choice of G ( x ) can result in a geometrically ergodic chain when probability concentrates on an ever narrower ridge in the tails, something that is again not true for the Random Walk Metropolis.

1. Introduction

Markov chain Monte Carlo (MCMC) methods are techniques for estimating expectations with respect to some probability measure π ( · ) , which need not be normalised. This is done by sampling a Markov chain which has limiting distribution π ( · ) , and computing empirical averages. A popular form of MCMC is the Metropolis–Hastings algorithm [1,2], where at each time step a ‘proposed’ move is drawn from some candidate distribution, and then accepted with some probability, otherwise the chain stays at the current point. Interest lies in finding choices of candidate distribution that will produce sensible estimators for expectations with respect to π ( · ) .
The quality of these estimators can be assessed in many different ways, but a common approach is to understand conditions on π ( · ) that will result in a chain which converges to its limiting distribution at a geometric rate. If such a rate can be established, then a Central Limit Theorem will exist for expectations of functionals with finite second absolute moment under π ( · ) if the chain is reversible.
A simple yet often effective choice is a symmetric candidate distribution centred at the current point in the chain (with a fixed variance), resulting in the Random Walk Metropolis (RWM) (e.g., [3]). The convergence properties of a chain produced by the RWM are well-studied. In one dimension, essentially convergence is geometric if π ( x ) decays at an exponential or faster rate in the tails [4], while in higher dimensions an additional curvature condition is required [5]. Slower rates of convergence have also been established in the case of heavier tails [6].
Recently, some MCMC methods were proposed which generalise the RWM, whereby proposals are still centred at the current point x and symmetric, but the variance changes with x [7,8,9,10,11]. An extension to infinite-dimensional Hilbert spaces is also suggested in Reference [12]. The motivation is that the chain can become more ‘local’, perhaps making larger jumps when out in the tails, or mimicking the local dependence structure of π ( · ) to propose more intelligent moves. Designing MCMC methods of this nature is particularly relevant for modern Bayesian inference problems, where posterior distributions are often high dimensional and exhibit nonlinear correlations [13]. We term this approach the Position-dependent Random Walk Metropolis (PDRWM), although technically this is a misnomer, since proposals are no longer random walks. Other choices of candidate distribution designed with distributions that exhibit nonlinear correlations were introduced in Reference [13]. Although powerful, these require derivative information for log π ( x ) , something which can be unavailable in modern inference problems (e.g., [14]). We note that no such information is required for the PDRWM, as shown by the particular cases suggested in References [7,8,9,10,11]. However, there are relations between the approaches, to the extent that understanding how the properties of the PDRWM differ from the standard RWM should also aid understanding of the methods introduced in Reference [13].
In this article, we consider the convergence rate of a Markov chain generated by the PDRWM to its limiting distribution. Our main interest lies in whether this generalisation can change these ergodicity properties compared to the standard RWM with fixed covariance. We focus on the case in which the candidate distribution is Gaussian, and illustrate that such changes can occur in several different ways, either for better or worse. Our aim is not to give a complete characterisation of the approach, but rather to illustrate the possibilities through carefully chosen examples, which are known to be indicative of more general behaviour.
In Section 2 necessary concepts about Markov chains are briefly reviewed, before the PDRWM is introduced in Section 3. Some results in the one-dimensional case are given in Section 4, before a higher-dimensional model problem is examined in Section 5. Throughout π ( · ) denotes a probability measure (we use the terms probability measure and distribution synonymously), and π ( x ) its density with respect to Lebesgue measure d x .
Since an early version of this work appeared online, some contributions to the literature were made that are worthy of mention. A Markov kernel constructed as a state-dependent mixture is introduced in Reference [15] and its properties are studied in some cases that are similar in spirit to the model problem of Section 5. An algorithm called Directional Metropolis–Hastings, which encompasses a specific instance of the PDRWM, is introduced and studied in Reference [16], and a modification of the same idea is used to develop the Hop kernel within the Hug and Hop algorithm of Reference [17]. Kamatani considers an algorithm designed for the infinite-dimensional setting in Reference [18] of a similar design to that discussed in Reference [12] and studies the ergodicity properties.

2. Markov Chains and Geometric Ergodicity

We will work on the Borel space ( X , B ) , with X R d for some d 1 , so that each X t X for a discrete-time Markov chain { X t } t 0 with time-homogeneous transition kernel P : X × B [ 0 , 1 ] , where P ( x , A ) = P [ X i + 1 A | X i = x ] and P n ( x , A ) is defined similarly for X i + n . All chains we consider will have invariant distribution π ( · ) , and be both π -irreducible and aperiodic, meaning π ( · ) is the limiting distribution from π -almost any starting point [19]. We use | · | to denote the Euclidean norm.
In Markov chain Monte Carlo the objective is to construct estimators of E π [ f ] , for some f : X R , by computing
f ^ n = 1 n i = 1 n f ( X i ) , X i P i ( x 0 , · ) .
If π ( · ) is the limiting distribution for the chain then P will be ergodic, meaning f ^ n a . s . E π [ f ] from π -almost any starting point. For finite n the quality of f ^ n intuitively depends on how quickly P n ( x , · ) approaches π ( · ) . We call the chain geometrically ergodic if
P n ( x , · ) π ( · ) T V M ( x ) ρ n ,
from π -almost any x X , for some M > 0 and ρ < 1 , where μ ( · ) ν ( · ) T V : = sup A B | μ ( A ) ν ( B ) | is the total variation distance between distributions μ ( · ) and ν ( · ) [19].
For π -reversible Markov chains geometric ergodicity implies that if E π [ f 2 ] < for some f : X R , then
n f ^ n E π [ f ] d N 0 , v ( P , f ) ,
for some asymptotic variance v ( P , f ) [20]. Equation (2) enables the construction of asymptotic confidence intervals for f ^ n .
In practice, geometric ergodicity does not guarantee that f ^ n will be a sensible estimator, as M ( x ) can be arbitrarily large if the chain is initialised far from the typical set under π ( · ) , and ρ may be very close to 1. However, chains which are not geometrically ergodic can often either get ‘stuck’ for a long time in low-probability regions or fail to explore the entire distribution adequately, sometimes in ways that are difficult to diagnose using standard MCMC diagnostics.

Establishing Geometric Ergodicity

It is shown in Chapter 15 of Reference [21] that Equation (1) is equivalent to the condition that there exists a Lyapunov function V : X [ 1 , ) and some λ < 1 , b < such that
P V ( x ) λ V ( x ) + b I C ( x ) ,
where P V ( x ) : = V ( y ) P ( x , d y ) . The set C X must be small, meaning that for some m N , ε > 0 and probability measure ν ( · )
P m ( x , A ) ε ν ( A ) ,
for any x C and A B . Equations (3) and (4) are referred to as drift and minorisation conditions. Intuitively, C can be thought of as the centre of the space, and Equation (3) ensures that some one dimensional projection of { X t } t 0 drifts towards C at a geometric rate when outside. In fact, Equation (3) is sufficient for the return time distribution to C to have geometric tails [21]. Once in C, (4) ensures that with some probability the chain forgets its past and hence regenerates. This regeneration allows the chain to couple with another initialised from π ( · ) , giving a bound on the total variation distance through the coupling inequality (e.g., [19]). More intuition is given in Reference [22].
Transition kernels considered here will be of the Metropolis–Hastings type, given by
P ( x , d y ) = α ( x , y ) Q ( x , d y ) + r ( x ) δ x ( d y ) ,
where Q ( x , d y ) = q ( y | x ) d y is some candidate kernel, α is called the acceptance rate and r ( x ) = 1 α ( x , y ) Q ( x , d y ) . Here we choose
α ( x , y ) = 1 π ( y ) q ( x | y ) π ( x ) q ( y | x ) ,
where a b denotes the minimum of a and b. This choice implies that P satisfies detailed balance for π ( · ) [23], and hence the chain is π -reversible (note that other choices for α can result in non-reversible chains, see Reference [24] for details).
Roberts and Tweedie [5], following on from Reference [21], introduced the following regularity conditions.
Theorem 1.
(Roberts and Tweedie). Suppose that π ( x ) is bounded away from 0 and ∞ on compact sets, and there exists δ q > 0 and ε q > 0 such that for every x
| x y | δ q q ( y | x ) ε q .
Then the chain with kernel (5) is μ L e b -irreducible and aperiodic, and every nonempty compact set is small.
For the choices of Q considered in this article these conditions hold, and we will restrict ourselves to forms of π ( x ) for which the same is true (apart from a specific case in Section 5). Under Theorem 1 then (1) only holds if a Lyapunov function V : X [ 1 , ] with E π [ V ] < exists such that
lim sup | x | P V ( x ) V ( x ) < 1 .
when P is of the Metropolis–Hastings type, (7) can be written
lim sup | x | V ( y ) V ( x ) 1 α ( x , y ) Q ( x , d y ) < 0 .
In this case, a simple criterion for lack of geometric ergodicity is
lim sup | x | r ( x ) = 1 .
Intuitively this implies that the chain is likely to get ‘stuck’ in the tails of a distribution for large periods.
Jarner and Tweedie [25] introduce a necessary condition for geometric ergodicity through a tightness condition.
Theorem 2.
(Jarner and Tweedie). If for any ε > 0 there is a δ > 0 such that for all x X
P ( x , B δ ( x ) ) > 1 ε ,
where B δ ( x ) : = { y X : d ( x , y ) < δ } , then a necessary condition for P to produce a geometrically ergodic chain is that for some s > 0
e s | x | π ( d x ) < .
The result highlights that when π ( · ) is heavy-tailed the chain must be able to make very large moves and still be capable of returning to the centre quickly for (1) to hold.

3. Position-Dependent Random Walk Metropolis

In the RWM, Q ( x , d y ) = q ( y x ) d y with q ( y x ) = q ( x y ) , meaning (6) reduces to α ( x , y ) = 1 π ( y ) / π ( x ) . A common choice is Q ( x , · ) = N ( x , h Σ ) , with Σ chosen to mimic the global covariance structure of π ( · ) [3]. Various results exist concerning the optimal choice of h in a given setting (e.g., [26]). It is straightforward to see that Theorem 2 holds here, so that the tails of π ( x ) must be uniformly exponential or lighter for geometric ergodicity. In one dimension this is in fact a sufficient condition [4], while for higher dimensions additional conditions are required [5]. We return to this case in Section 5.
In the PDRWM Q ( x , · ) = N ( x , h G ( x ) 1 ) , so (6) becomes
α ( x , y ) = 1 π ( y ) | G ( y ) | 1 2 π ( x ) | G ( x ) | 1 2 exp 1 2 ( x y ) T [ G ( y ) G ( x ) ] ( x y ) .
The motivation for designing such an algorithm is that proposals are more able to reflect the local dependence structure of π ( · ) . In some cases this dependence may vary greatly in different parts of the state-space, making a global choice of Σ ineffective [9].
Readers familiar with differential geometry will recognise the volume element | G ( x ) | 1 / 2 d x and the linear approximations to the distance between x and y taken at each point through G ( x ) and G ( y ) if X is viewed as a Riemannian manifold with metric G. We do not explore these observations further here, but the interested reader is referred to Reference [27] for more discussion.
The choice of G ( x ) is an obvious question. In fact, specific variants of this method have appeared on many occasions in the literature, some of which we now summarise.
  • Tempered Langevin diffusions [8] G ( x ) = π ( x ) I . The authors highlight that the diffusion with dynamics d X t = π 1 2 ( X t ) d W t has invariant distribution π ( · ) , motivating the choice. The method was shown to perform well for a bi-modal π ( x ) , as larger jumps are proposed in the low density region between the two modes.
  • State-dependent Metropolis [7] G ( x ) = ( 1 + | x | ) b . Here the intuition is simply that b > 0 means larger jumps will be made in the tails. In one dimension the authors compare the expected squared jumping distance E [ ( X i + 1 X i ) 2 ] empirically for chains exploring a N ( 0 , 1 ) target distribution, choosing b adaptively, and found b 1.6 to be optimal.
  • Regional adaptive Metropolis–Hastings [7,11]. G ( x ) 1 = i = 1 m I ( x X i ) Σ i . In this case the state-space is partitioned into X 1 . . . X m , and a different proposal covariance Σ i is learned adaptively in each region 1 i m . An extension which allows for some errors in choosing an appropriate partition is discussed in [11]
  • Localised Random Walk Metropolis [10]. G ( x ) 1 = k = 1 m q ˇ θ ( k | x ) Σ k . Here q ˇ θ ( k | x ) are weights based on approximating π ( x ) with some mixture of Normal/Student’s t distributions, using the approach suggested in Reference [28]. At each iteration of the algorithm a mixture component k is sampled from q ˇ θ ( · | x ) , and the covariance Σ k is used for the proposal Q ( x , d y ) .
  • Kernel adaptive Metropolis–Hastings [9]. G ( x ) 1 = γ 2 I + ν 2 M x H M x T , where M x = 2 [ x k ( z 1 , x ) , . . . , x k ( z n , x ) ] for some kernel function k and n past samples { z 1 , . . . , z n } , H = I ( 1 / n ) 1 n × n is a centering matrix (the n × n matrix 1 n × n has 1 as each element), and γ , ν are tuning parameters. The approach is based on performing nonlinear principal components analysis on past samples from the chain to learn a local covariance. Illustrative examples for the case of a Gaussian kernel show that M x H M x T acts as a weighted empirical covariance of samples z, with larger weights given to the z i which are closer to x [9].
The latter cases also motivate any choice of the form
G ( x ) 1 = i = 1 n w ( x , z i ) ( z i x ) T ( z i x )
for some past samples { z 1 , . . . , z n } and weight function w : X × X [ 0 , ) with i w ( x , z i ) = 1 that decays as | x z i | grows, which would also mimic the local curvature of π ( · ) (taking care to appropriately regularise and diminish adaptation so as to preserve ergodicity, as outlined in Reference [10]).
Some of the above schemes are examples of adaptive MCMC, in which a candidate from among a family of Markov kernels { P θ : θ Θ } is selected by learning the parameter θ Θ during the simulation [10]. Additional conditions on the adaptation process (i.e., the manner in which θ is learned) are required to establish ergodicity results for the resulting stochastic processes. We consider the decisions on how to learn θ appropriately to be a separate problem and beyond the scope of the present work, and instead focus attention on establishing geometric ergodicity of the base kernels P θ for any fixed θ Θ . We note that this is typically a pre-requisite for establishing convergence properties of any adaptive MCMC method [10].

4. Results in One Dimension

Here we consider two different general scenarios as | x | , i) G ( x ) is bounded above and below, and ii) G ( x ) 0 at some specified rate. Of course there is also the possibility that G ( x ) , though intuitively this would result in chains that spend a long time in the tails of a distribution, so we do not consider it (if G ( x ) then chains will in fact exhibit the negligible moves property studied in Reference [29]). Proofs to Propositions in Section 4 and Section 5 can be found in Appendix A.
We begin with a result that emphasizes that a growing variance is a necessary requirement for geometric ergodicity in the heavy-tailed case.
Proposition 1.
If G ( x ) σ 2 for some σ 2 > 0 , then unless e η | x | π ( d x ) < for some η > 0 the PDRWM cannot produce a geometrically ergodic Markov chain.
The above is a simple extension of a result that is well-known in the RWM case. Essentially the tails of the distribution should be exponential or lighter to ensure fast convergence. This motivates consideration of three different types of behaviour for the tails of π ( · ) .
Assumption 1.
The density π ( x ) satisfies one of the following tail conditions for all y , x X such that | y | > | x | > t , for some finite t > 0 .
  • π ( y ) / π ( x ) exp { a ( | y | | x | ) } for some a > 0
  • π ( y ) / π ( x ) exp { a ( | y | β | x | β ) } for some a > 0 and β ( 0 , 1 )
  • π ( y ) / π ( x ) | x | / | y | p for some p > 1 .
Naturally Assumption 1 implies 2 and Assumption 2 implies 3. If Assumption 1 is not satisfied then π ( · ) is generally called heavy-tailed. When π ( x ) satisfies Assumption 2 or 3 but not 1, then the RWM typically fails to produce a geometrically ergodic chain [4]. We show in the sequel, however, that this is not always the case for the PDRWM. We assume the below assumptions for G ( x ) to hold throughout this section.
Assumption 2.
The function G : X ( 0 , ) is bounded above by some σ b 2 < for all x X , and bounded below for all x X with | x | < t , for some t > 0 .
The heavy-tailed case is known to be a challenging scenario, but the RWM will produce a geometrically ergodic Markov chain if π ( x ) is log-concave. Next we extend this result to the case of sub-quadratic variance growth in the tails.
Proposition 2.
If r < such that G ( x ) | x | γ whenever | x | > r , then the PDRWM will produce a geometrically ergodic chain in both of the following cases:
  • π ( x ) satisfies Assumption 1 and γ [ 0 , 2 )
  • π ( x ) satisfies Assumption 2 for some β ( 0 , 1 ) and γ ( 2 ( 1 β ) , 2 )
The second part of Proposition 2 is not true for the RWM, for which Assumption 2 alone is not sufficient for geometric ergodicity [4].
We do not provide a complete proof that the PDRWM will not produce a geometrically ergodic chain when only Assumption 3 holds and G ( x ) | x | γ for some γ < 2 , but do show informally that this will be the case. Assuming that in the tails π ( x ) | x | p for some p > 1 then for large x
α ( x , x + c x γ / 2 ) = 1 x x + c x γ / 2 p + γ / 2 exp c 2 x γ 2 h 1 ( x + c x γ / 2 ) γ 1 x γ .
The first expression on the right hand side converges to 1 as x , which is akin to the case of fixed proposal covariance. The second term will be larger than one for c > 0 and less than one for c < 0 . So the algorithm will exhibit the same ‘random walk in the tails’ behaviour which is often characteristic of the RWM in this scenario, meaning that the acceptance rate fails to enforce a geometric drift back into the centre of the space.
When γ = 2 the above intuition will not necessarily hold, as the terms in Equation (10) will be roughly constant with x. When only Assumption 3 holds, it is, therefore, tempting to make the choice G ( x ) = x 2 for | x | > r . Informally we can see that such behaviour may lead to a favourable algorithm if a small enough h is chosen. For any fixed x > r a typical proposal will now take the form y = ( 1 + ξ h ) x , where ξ N ( 0 , 1 ) . It therefore holds that
y = e ξ h x + r ( x , h , ξ ) ,
where for any fixed x and ξ the term r ( x , h , ξ ) / h 0 as h 0 . The first term on the right-hand side of Equation (11) corresponds to the proposal of the multiplicative Random Walk Metropolis, which is known to be geometrically ergodic under Assumption 3 (e.g., [3]), as this equates to taking a logarithmic transformation of x, which ‘lightens’ the tails of the target density to the point where it becomes log-concave. So in practice we can expect good performance from this choice of G ( x ) . The above intuition does not, however, provide enough to establish geometric ergodicity, as the final term on the right-hand side of (11) grows unboundedly with x for any fixed choice of h. The difference between the acceptance rates of the multiplicative Random Walk Metropolis and the PDRWM with G ( x ) = x 2 will be the exponential term in Equation (10). This will instead become polynomial by letting the proposal noise ξ follow a distribution with polynomial tails (e.g., student’s t), which is known to be a favourable strategy for the RWM when only Assumption 3 holds [6]. One can see that if the heaviness of the proposal distribution is carefully chosen then the acceptance rate may well enforce a geometric drift into the centre of the space, though for brevity we restrict attention to Gaussian proposals in this article.
The final result of this section provides a note of warning that lack of care in choosing G ( x ) can have severe consequences for the method.
Proposition 3.
If G ( x ) x 2 0 as | x | , then the PDRWM will not produce a geometrically ergodic Markov chain.
The intuition for this result is straightforward when explained. In the tails, typically | y x | will be the same order of magnitude as G ( x ) 1 , meaning | y x | / | x | grows arbitrarily large as | x | grows. As such, proposals will ‘overshoot’ the typical set of the distribution, sending the sampler further out into the tails, and will therefore almost always be rejected. The result can be related superficially to a lack of geometric ergodicity for Metropolis–Hastings algorithms in which the proposal mean is comprised of the current state translated by a drift function (often based in log π ( x ) ) when this drift function grows faster than linearly with | x | (e.g., [30,31]).

5. A Higher-Dimensional Case Study

An easy criticism of the above analysis is that the one-dimensional scenario is sometimes not indicative of the more general behaviour of a method. We note, however, that typically the geometric convergence properties of Metropolis–Hastings algorithms do carry over somewhat naturally to more than one dimension when π ( · ) is suitably regular (e.g., [5,32]). Because of this we expect that the growth conditions specified above could be supplanted onto the determinant of G ( x ) when the dimension is greater than one (leaving the details of this argument for future work).
A key difference in the higher-dimensional setting is that G ( x ) now dictates both the size and direction of proposals. In the case G ( x ) 1 = Σ , some additional regularity conditions on π ( x ) are required for geometric ergodicity in more than one dimension, outlined in References [5,32]. An example is also given in Reference [5] of the simple two-dimensional density π ( x , y ) exp ( x 2 y 2 x 2 y 2 ) , which fails to meet these criteria. The difficult models are those for which probability concentrates on a ridge in the tails, which becomes ever narrower as | x | increases. In this instance, proposals from the RWM are less and less likely to be accepted as | x | grows. Another well-known example of this phenomenon is the funnel distribution introduced in Reference [33].
To explore the behaviour of the PDRWM in this setting, we design a model problem, the staircase distribution, with density
s ( x ) 3 x 2 I R ( x ) , R : = { y R 2 ; y 2 1 , | y 1 | 3 1 y 2 } ,
where z denotes the integer part of z > 0 . Graphically the density is a sequence of cuboids on the upper-half plane of R 2 (starting at y 2 = 1 ), each centred on the vertical axis, with each successive cuboid one third of the width and height of the previous. The density resembles an ever narrowing staircase, as shown in Figure 1.
We denote by Q R the proposal kernel associated with the Random Walk Metropolis algorithm with fixed covariance h Σ . In fact, the specific choice of h and Σ does not matter provided that the result is positive-definite. For the PDRWM we denote by Q P the proposal kernel with covariance matrix
h G ( x ) 1 = 3 2 x 2 0 0 1 ,
which will naturally adapt the scale of the first coordinate to the width of the ridge.
Proposition 4.
The Metropolis–Hastings algorithm with proposal Q R does not produce a geometrically ergodic Markov chain when π ( x ) = s ( x ) .
The design of the PDRWM proposal kernel Q P in this instance is such that the proposal covariance reduces at the same rate as the width of the stairs, therefore naturally adapting the proposal to the width of the ridge on which the density concentrates. This state-dependent adaptation results in a geometrically ergodic chain, as shown in the below result.
Proposition 5.
The Metropolis–Hastings algorithm with proposal Q P produces a geometrically ergodic Markov chain when π ( x ) = s ( x ) .

6. Discussion

In this paper we have analysed the ergodic behaviour of a Metropolis–Hastings method with proposal kernel Q ( x , · ) = N ( x , h G ( x ) 1 ) . In one dimension we have characterised the behaviour in terms of growth conditions on G ( x ) 1 and tail conditions on the target distribution, and in higher dimensions a carefully constructed model problem is discussed. The fundamental question of interest was whether generalising an existing Metropolis–Hastings method by allowing the proposal covariance to change with position can alter the ergodicity properties of the sampler. We can confirm that this is indeed possible, either for the better or worse, depending on the choice of covariance. The take home points for practitioners are (i) lack of sufficient care in the design of G ( x ) can have severe consequences (as in Proposition 3), and (ii) careful choice of G ( x ) can have much more beneficial ones, perhaps the most surprising of which are in the higher-dimensional setting, as shown in Section 5.
We feel that such results can also offer insight into similar generalisations of different Metropolis–Hastings algorithms (e.g., [13,34]). For example, it seems intuitive that any method in which the variance grows at a faster than quadratic rate in the tails is unlikely to produce a geometrically ergodic chain. There are connections between the PDRWM and some extensions of the Metropolis-adjusted Langevin algorithm [34], the ergodicity properties of which are discussed in Reference [35]. The key difference between the schemes is the inclusion of the drift term G ( x ) 1 log π ( x ) / 2 in the latter. It is this term which in the main governs the behaviour of the sampler, which is why the behaviour of the PDRWM is different to this scheme. Markov processes are also used in a wide variety of application areas beyond the design of Metropolis–Hastings algorithms (e.g., [36]), and we hope that some of the results established in the present work prove to be beneficial in some of these other settings.
We can apply these results to the specific variants discussed in Section 3. Provided that sensible choices of regions/weights are made and that an adaptation scheme which obeys the diminishing adaptation criterion is employed, the Regional adaptive Metropolis–Hastings, Locally weighted Metropolis and Kernel-adaptive Metropolis–Hastings samplers should all satisfy G ( x ) Σ as | x | , meaning they can be expected to inherit the ergodicity properties of the standard RWM (the behaviour in the centre of the space, however, will likely be different). In the State-dependent Metropolis method provided b < 2 the sampler should also behave reasonably. Whether or not a large enough value of b would be found by a particular adaptation rule is not entirely clear, and this could be an interesting direction of further study. The Tempered Langevin diffusion scheme, however, will fail to produce a geometrically ergodic Markov chain whenever the tails of π ( x ) are lighter than that of a Cauchy distribution. To allow reasonable tail exploration when this is the case, two pragmatic options would be to upper bound G ( x ) 1 manually or use this scheme in conjunction with another, as there is evidence that the sampler can perform favourably when exploring the centre of a distribution [8]. None of the specific variants discussed here are able to mimic the local curvature of the π ( x ) in the tails, so as to enjoy the favourable behaviour exemplified in Proposition 5. This is possible using Hessian information as in Reference [13], but should also be possible in some cases using appropriate surrogates.

Funding

This research was supported by a UCL IMPACT PhD scholarship co-funded by Xerox Research Centre Europe and EPSRC.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not Applicable

Acknowledgments

The author thanks Alexandros Beskos, Krzysztof atuszyński and Gareth Roberts for several useful discussions, Michael Betancourt for proofreading the paper, and Mark Girolami for general supervision and guidance.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proofs

Proof of Proposition 1.
In this case, for any choice of ε > 0 there is a δ > 0 such that Q ( x , B δ ( x ) ) > 1 ε . Noting that P ( x , B δ ( x ) ) Q ( x , B δ ( x ) ) when P is of Metropolis–Hastings type, Theorem 2 can be applied directly. □
Proof of Proposition 2.
For the log-concave case, take V ( x ) = e s | x | for some s > 0 , and let B A denote the integral (8) over the set A. We first break up X into ( , 0 ] ( 0 , x c x γ 2 ] ( x c x γ 2 , x + c x γ 2 ] ( x + c x γ 2 , x + c x γ ] ( x + c x γ , ) for some x > 0 and fixed constant c ( 0 , ) , and show that the integral is strictly negative on at least one of these sets, and can be made arbitrarily small as x on all others. The case is analogous from the tail conditions on π ( x ) . From the conditions we can choose x > r and therefore write G ( x ) 1 = η x γ for some fixed η < .
On ( , 0 ] , we have
B ( , 0 ] = e s x 0 e s | y | α ( x , y ) Q ( x , d y ) 0 α ( x , y ) Q ( x , d y ) , e s x 0 e s y Q ( x , d y ) .
The integral is now proportional to the moment generating function of a truncated Gaussian distribution (see Appendix B), so is given by
e s x + h η x γ s 2 / 2 1 Φ x 1 γ / 2 / h η h η s x γ / 2 .
A simple bound on the error function is 2 π x Φ c ( x ) < e x 2 / 2 [37], so setting ϑ = x 1 γ / 2 / h η h η s x γ / 2 we have
B ( , 0 ] 1 2 π exp 2 s x + h η s 2 2 x γ 1 2 1 h η x 2 γ 2 s x + h η s 2 x γ + log ϑ , = 1 2 π exp s x 1 2 h η x 2 γ + log ϑ .
which 0 as x , so can be made arbitrarily small.
On ( 0 , x c x γ / 2 ] , note that e s ( | y | | x | ) 1 is clearly negative throughout this region provided that c < x 1 γ / 2 , which can be enforced by choosing x large enough for any given c < . So the integral is straightforwardly bounded as B ( 0 , x c x γ / 2 ] 0 for all x X .
On ( x c x γ / 2 , x + c x γ / 2 ] , provided x c x γ / 2 > r then for any y in this region we can either upper or lower bound α ( x , y ) with the expression
exp a ( y x ) + γ 2 log x y 1 2 h η ( x y ) 2 y γ ( x y ) 2 x γ .
A Taylor expansion of y γ about x gives
y γ = x γ γ x γ 1 ( y x ) + γ ( γ + 1 ) x γ 2 ( y x ) 2 + . . .
and multiplying by ( y x ) 2 gives
( y x ) 2 y γ = ( y x ) 2 x γ γ ( y x ) 3 x γ + 1 + γ ( γ + 1 ) ( y x ) 4 x γ + 2 + . . .
If | y x | = c x γ / 2 then this is:
c 2 x γ x γ γ c 3 x 3 γ / 2 x γ + 1 + γ ( γ + 1 ) c 4 x 2 γ x γ + 2 + . . .
As γ < 2 then 3 γ / 2 < γ + 1 , and similarly for successive terms, meaning each gets smaller as | x | . So we have for large x, y ( x c x γ / 2 , x + c x γ / 2 ) and any δ > 0
( y x ) 2 y γ ( y x ) 2 x γ γ ( y x ) 3 x γ + 1 2 h η δ .
So we can analyse how the acceptance rate behaves. First note that for fixed ϵ > 0
α ( x , x + ϵ ) exp a ϵ + γ 2 log x x + ϵ + 1 2 h γ ϵ 3 x γ + 1 + δ exp ( a ϵ + δ ) ,
recalling that δ can be made arbitrarily small. In fact, it holds that the e a ϵ term will be dominant for any ϵ for which ϵ 3 / x γ + 1 0 , i.e., any ϵ = o ( x γ + 1 / 3 ) . If γ < 2 then ϵ = c x γ / 2 satisfies this condition. So for any y > x in this region we can choose an x such that
α ( x , y ) exp a ( y x ) + δ x ,
where δ x 0 as x . Similarly we have (for any fixed ϵ > 0 )
α ( x , x ϵ ) exp a ϵ + γ 2 log x x ϵ 1 2 h γ ϵ 3 x γ + 1 δ exp ( a ϵ δ ) .
So by a similar argument we have α ( x , y ) > 1 here when x . Combining gives
B ( x c x γ / 2 , x + c x γ / 2 ] 0 c x γ / 2 e ( s a ) z + δ x e a z + δ x + e s z 1 q x ( d z ) ,
where q x ( · ) denotes a zero mean Gaussian distribution with the same variance as Q ( x , · ) . Using the change of variables z = z / ( h η x γ / 2 ) we can write the above integral
0 c h η e ( s a ) h η x γ / 2 z + δ x e a h η x γ / 2 z + δ x + e s h η x γ / 2 z 1 μ ( d z )
where μ ( · ) denotes a Gaussian distribution with zero mean and variance one. Provided s < a , then by dominated convergence as x this asymptotes to
0 c h η μ ( d z ) = 1 2 erf c 2 h η < 0 ,
where erf ( z ) : = ( 2 / π ) 0 z e t 2 d t is the Gaussian error function.
On ( x + c x γ / 2 , x + c x γ ] we can upper bound the acceptance rate as
α ( x , y ) π ( y ) π ( x ) exp 1 2 log | G ( y ) | | G ( x ) | + G ( x ) 2 h ( x y ) 2
If y x and x > x 0 we have
α ( x , y ) exp a ( | y | | x | ) + 1 2 h η ( x y ) 2 x γ .
For | y x | = c x this becomes
α ( x , y ) exp a c x + c 2 2 h η x 2 γ
So provided γ > the first term inside the exponential will dominate the second for large enough x. In the equality case we have
α ( x , y ) exp c 2 2 h η a c x γ ,
so provided we choose c such that a > c 2 / ( 2 h η ) then the acceptance rate will also decay exponentially. Because of this we have
B ( x + c x γ / 2 , x + c x γ ] x + c x γ / 2 x + c x γ e s ( y x ) α ( x , y ) Q ( x , d y ) , e ( c 2 / ( 2 h η ) + s a ) c x γ / 2 Q ( x , ( x + c x γ / 2 , x + c x γ ] ) ,
so provided a > c 2 / ( 2 h η ) + s then this term can be made arbitrarily small.
On ( x + c x γ , ) using the same properties of truncated Gaussians we have
B ( x + c x γ , ) e s x x + c x γ e s y Q ( x , d y ) , = e s 2 h η x γ / 2 Φ c c h η h η s x γ ,
which can be made arbitrarily small provided that s is chosen to be small enough using the same simple bound on Φ c as for the case of B ( , 0 ] .
Combining gives that the integral (8) is bounded above by erf ( c / 2 h 2 η 2 ) / 2 , which is strictly less than zero as c , h and η are all positive. This completes the proof under Assumption 1.
Under Assumption 2 the proof is similar. Take V ( x ) = e s | x | β , and divide X up into the same regions. Outside of ( x c x γ / 2 , x + c x γ / 2 ] the same arguments show that the integral can be made arbitrarily small. On this set, note that in the tails
( x + c x ) β x β = β c x + β 1 + β ( β 1 ) 2 c 2 x 2 + β 2 + . . .
For y x = c x , then for < 1 β this becomes negligible. So in this case we further divide the typical set into ( x , x + c x 1 β ] ( x + c x 1 β , x + c x γ / 2 ) . On ( x c x 1 β , x + c x 1 β ) the integral is bounded above by e c 1 Q ( x , ( x c x 1 β , x + c x 1 β ) ) 0 , for some suitably chosen c 1 > 0 . On ( x c x γ / 2 , x c x 1 β ] ( x + c x 1 β , x + c x γ / 2 ] then for y > x we have α ( x , y ) e c 2 ( y β x β ) , so we can use the same argument as in the the log-concave case to show that the integral will be strictly negative in the limit. □
Proof of Proposition 3.
First note that in this case for any g : R ( 0 , ) such that as | x | it holds that g ( x ) / | x | but g ( x ) G ( x ) 0 , then
Q ( x , { x g ( x ) , x + g ( x ) } ) = Φ g ( x ) G ( x ) Φ g ( x ) G ( x ) 0
as | x | . The chain therefore has the property that P ( { | X i + 1 | > g ( X i ) / 2 }     { X i + 1 = X i } ) can be made arbitrarily close to 1 as | X i | grows, which leads to two possible behaviours. If the form of π ( · ) enforces such large jumps to be rejected then r ( x ) 1 and lack of geometric ergodicity follows from (9). If this is not the case then the chain will be transient (this can be made rigorous using a standard Borel–Cantelli argument, see e.g., the proof of Theorem 12.2.2 on p. 299 of [21]). □
Proof of Proposition 4.
It is sufficient to construct a sequence of points x p R 2 such that | x p | as p , and show that r ( x p ) 1 in the same limit, then apply (9). Take x p = ( 0 , p ) for p N . In this case
r ( x p ) = 1 α ( x p , y ) Q R ( x p , d y )
Note that for every ϵ > 0 there is a δ < such that Q ( x p , B δ c ( x p ) ) < ϵ for all x p , where B δ ( x ) : = { y R 2 : | y x | δ } . The set A ( x p , δ ) : = B δ ( x p ) R denotes the possible values of y B δ ( x ) for which the acceptance rate is non-zero. Note that A ( x p , δ ) S ( x p , δ ) : = { y B δ ( x p ) : | y 1 | 3 1 p δ } , which is simply a strip that can be made arbitrarily narrow for any fixed δ by taking p large enough. Combining these ideas gives
α ( x p , y ) Q R ( x p , d y ) A ( x p , δ ) α ( x p , y ) Q R ( x p , d y ) + ϵ Q R ( x p , S ( x p , δ ) ) + ϵ .
Both of the quantities on the last line can be made arbitrarily small by choosing p suitably large. Thus, r ( x p ) 1 as | x p | , as required. □
Proof of Proposition 5.
First note that inf x R Q P ( x , R ) is bounded away from zero, unlike in the case of Q R , owing to the design of Q P . The acceptance rate here simplifies, since for any y R
s ( y ) | G ( y ) | 1 2 s ( x ) | G ( x ) | 1 2 = 1 ,
meaning only the expression exp 1 2 ( y x ) T [ G ( y ) G ( x ) ] ( y x ) needs to be considered. In this case the expression is simply
exp 1 2 ( 3 2 y 2 3 2 x 2 ) ( y 1 x 1 ) 2 .
Provided that x 1 y 1 , then when 1 y 2 < x 2 this expression is strictly greater than 1, whereas in the reverse case it is strictly less than one. The resulting Metropolis–Hastings kernel P using proposal kernel Q P will therefore satisfy y 2 P ( x , d y ) < x 2 for large enough x 2 , and hence geometric ergodicity follows by taking the Lyapunov function V ( x ) = e s | x 2 | (which can be used here since the domain of x 1 is compact) and following an identical argument to that given on pages 404–405 of Reference [21] for the case of the proof of geometric ergodicity of the random walk on the half-line model for suitably small s > 0 , taking the small set C : = [ 0 , 1 ] × [ 1 , r ] for suitably large r < and ν ( · ) = · s ( x ) d x . □

Appendix B. Needed Facts about Truncated Gaussian Distributions

Here we collect some elementary facts used in the article. For more detail see e.g., [38]. If X follows a truncated Gaussian distribution N [ a , b ] T ( μ , σ 2 ) then it has density
f ( x ) = 1 σ Z a , b ϕ x μ σ I [ a , b ] ( x ) ,
where ϕ ( x ) = e x 2 / 2 / 2 π , Φ ( x ) = x ϕ ( y ) d y and Z a , b = Φ ( ( b μ ) / σ ) Φ ( ( a μ ) / σ ) . Defining B = ( b μ ) / σ and A = ( a μ ) / σ , we have
E [ X ] = μ + ϕ ( A ) ϕ ( B ) Z a , b σ
and
E [ e t X ] = e μ t + σ 2 t 2 / 2 Φ ( B σ t ) Φ ( A σ t ) Z a , b .
In the special case b = , a = 0 this becomes e μ t + σ 2 t 2 / 2 Φ ( σ t ) / Z a , b .

References

  1. Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of state calculations by fast computing machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef] [Green Version]
  2. Hastings, W.K. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 1970, 57, 97–109. [Google Scholar] [CrossRef]
  3. Sherlock, C.; Fearnhead, P.; Roberts, G.O. The random walk Metropolis: Linking theory and practice through a case study. Stat. Sci. 2010, 25, 172–190. [Google Scholar] [CrossRef] [Green Version]
  4. Mengersen, K.L.; Tweedie, R.L. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Stat. 1996, 24, 101–121. [Google Scholar] [CrossRef]
  5. Roberts, G.O.; Tweedie, R.L. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika 1996, 83, 95–110. [Google Scholar] [CrossRef]
  6. Jarner, S.F.; Roberts, G.O. Convergence of Heavy-tailed Monte Carlo Markov Chain Algorithms. Scand. J. Stat. 2007, 34, 781–815. [Google Scholar] [CrossRef]
  7. Roberts, G.O.; Rosenthal, J.S. Examples of adaptive MCMC. J. Comput. Graph. Stat. 2009, 18, 349–367. [Google Scholar] [CrossRef]
  8. Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis–Hastings algorithms. Methodol. Comput. Appl. Probab. 2002, 4, 337–357. [Google Scholar] [CrossRef]
  9. Sejdinovic, D.; Strathmann, H.; Garcia, M.L.; Andrieu, C.; Gretton, A. Kernel Adaptive Metropolis-Hastings. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Xing, E.P., Jebara, T., Eds.; PMLR: Beijing, China, 2014; Volume 32, pp. 1665–1673. [Google Scholar]
  10. Andrieu, C.; Thoms, J. A tutorial on adaptive MCMC. Stat. Comput. 2008, 18, 343–373. [Google Scholar] [CrossRef]
  11. Craiu, R.V.; Rosenthal, J.; Yang, C. Learn from thy neighbor: Parallel-chain and regional adaptive MCMC. J. Am. Stat. Assoc. 2009, 104, 1454–1466. [Google Scholar] [CrossRef] [Green Version]
  12. Rudolf, D.; Sprungk, B. On a generalization of the preconditioned Crank–Nicolson Metropolis algorithm. Found. Comput. Math. 2018, 18, 309–343. [Google Scholar] [CrossRef] [Green Version]
  13. Girolami, M.; Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011, 73, 123–214. [Google Scholar] [CrossRef]
  14. Brooks, S.; Gelman, A.; Jones, G.; Meng, X.L. Handbook of Markov Chain Monte Carlo; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
  15. Maire, F.; Vandekerkhove, P. On Markov chain Monte Carlo for sparse and filamentary distributions. arXiv 2018, arXiv:1806.09000. [Google Scholar]
  16. Mallik, A.; Jones, G.L. Directional Metropolis-Hastings. arXiv 2017, arXiv:1710.09759. [Google Scholar]
  17. Ludkin, M.; Sherlock, C. Hug and Hop: A discrete-time, non-reversible Markov chain Monte Carlo algorithm. arXiv 2019, arXiv:1907.13570. [Google Scholar]
  18. Kamatani, K. Ergodicity of Markov chain Monte Carlo with reversible proposal. J. Appl. Probab. 2017, 638–654. [Google Scholar] [CrossRef] [Green Version]
  19. Roberts, G.O.; Rosenthal, J.S. General state space Markov chains and MCMC algorithms. Probab. Surv. 2004, 1, 20–71. [Google Scholar] [CrossRef] [Green Version]
  20. Roberts, G.O.; Rosenthal, J.S. Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab. 1997, 2, 13–25. [Google Scholar] [CrossRef]
  21. Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
  22. Jones, G.L.; Hobert, J.P. Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Stat. Sci. 2001, 16, 312–334. [Google Scholar] [CrossRef]
  23. Tierney, L. Markov chains for exploring posterior distributions. Annal. Stat. 1994, 22, 1701–1728. [Google Scholar] [CrossRef]
  24. Bierkens, J. Non-reversible Metropolis–Hastings. Stat. Comput. 2016, 26, 1213–1228. [Google Scholar] [CrossRef] [Green Version]
  25. Jarner, S.F.; Tweedie, R.L. Necessary conditions for geometric and polynomial ergodicity of random-walk-type Markov chains. Bernoulli 2003, 9, 559–578. [Google Scholar] [CrossRef]
  26. Roberts, G.O.; Rosenthal, J.S. Optimal scaling for various Metropolis-Hastings algorithms. Stat. Sci. 2001, 16, 351–367. [Google Scholar] [CrossRef]
  27. Livingstone, S.; Girolami, M. Information-geometric Markov chain Monte Carlo methods using diffusions. Entropy 2014, 16, 3074–3102. [Google Scholar] [CrossRef]
  28. Andrieu, C.; Moulines, É. On the ergodicity properties of some adaptive MCMC algorithms. Ann. Appl. Probab. 2006, 16, 1462–1505. [Google Scholar] [CrossRef] [Green Version]
  29. Livingstone, S.; Faulkner, M.F.; Roberts, G.O. Kinetic energy choice in Hamiltonian/hybrid Monte Carlo. Biometrika 2019, 106, 303–319. [Google Scholar] [CrossRef]
  30. Roberts, G.O.; Tweedie, R.L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 1996, 2, 341–363. [Google Scholar] [CrossRef]
  31. Livingstone, S.; Betancourt, M.; Byrne, S.; Girolami, M. On the geometric ergodicity of Hamiltonian Monte Carlo. Bernoulli 2019, 25, 3109–3138. [Google Scholar] [CrossRef] [Green Version]
  32. Jarner, S.F.; Hansen, E. Geometric ergodicity of Metropolis algorithms. Stoch. Process. Their Appl. 2000, 85, 341–361. [Google Scholar] [CrossRef]
  33. Neal, R.M. Slice sampling. Annal. Stat. 2003, 705–741. [Google Scholar] [CrossRef]
  34. Xifara, T.; Sherlock, C.; Livingstone, S.; Byrne, S.; Girolami, M. Langevin diffusions and the Metropolis-adjusted Langevin algorithm. Stat. Probab. Lett. 2014, 91, 14–19. [Google Scholar] [CrossRef] [Green Version]
  35. Latuszyński, K.; Roberts, G.O.; Thiery, A.; Wolny, K. Discussion on ‘Riemann manifold Langevin and Hamiltonian Monte Carlo methods’ (by Girolami, M. and Calderhead, B.). J. R. Stat. Soc. Ser. B Statist. Methodol. 2011, 73, 188–189. [Google Scholar]
  36. Chen, S.; Tao, Y.; Yu, D.; Li, F.; Gong, B. Distributed learning dynamics of Multi-Armed Bandits for edge intelligence. J. Syst. Archit. 2020, 101919. Available online: https://www.sciencedirect.com/science/article/abs/pii/S1383762120301806 (accessed on 29 May 2015). [CrossRef]
  37. Cook, J.D. Upper and Lower Bounds on the Normal Distribution Function; Technical Report. 2009. Available online: http://www.johndcook.com/normalbounds.pdf (accessed on 29 May 2015).
  38. Johnson, N.L.; Kotz, S. Distributions in Statistics: Continuous Univariate Distributions; Houghton Mifflin: Boston, MA, USA, 1970; Volume 1. [Google Scholar]
Figure 1. The staircase distribution, with density given by Equation (12).
Figure 1. The staircase distribution, with density given by Equation (12).
Mathematics 09 00341 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Livingstone, S. Geometric Ergodicity of the Random Walk Metropolis with Position-Dependent Proposal Covariance. Mathematics 2021, 9, 341. https://doi.org/10.3390/math9040341

AMA Style

Livingstone S. Geometric Ergodicity of the Random Walk Metropolis with Position-Dependent Proposal Covariance. Mathematics. 2021; 9(4):341. https://doi.org/10.3390/math9040341

Chicago/Turabian Style

Livingstone, Samuel. 2021. "Geometric Ergodicity of the Random Walk Metropolis with Position-Dependent Proposal Covariance" Mathematics 9, no. 4: 341. https://doi.org/10.3390/math9040341

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop