Convergence Rates for the Constrained Sampling via Langevin Monte Carlo

Zhu, Yuanzheng

doi:10.3390/e25081234

Open AccessArticle

Convergence Rates for the Constrained Sampling via Langevin Monte Carlo

by

Yuanzheng Zhu

School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China

Entropy 2023, 25(8), 1234; https://doi.org/10.3390/e25081234

Submission received: 26 June 2023 / Revised: 8 August 2023 / Accepted: 15 August 2023 / Published: 18 August 2023

(This article belongs to the Collection Advances in Applied Statistical Mechanics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Sampling from constrained distributions has posed significant challenges in terms of algorithmic design and non-asymptotic analysis, which are frequently encountered in statistical and machine-learning models. In this study, we propose three sampling algorithms based on Langevin Monte Carlo with the Metropolis–Hastings steps to handle the distribution constrained within some convex body. We present a rigorous analysis of the corresponding Markov chains and derive non-asymptotic upper bounds on the convergence rates of these algorithms in total variation distance. Our results demonstrate that the sampling algorithm, enhanced with the Metropolis–Hastings steps, offers an effective solution for tackling some constrained sampling problems. The numerical experiments are conducted to compare our methods with several competing algorithms without the Metropolis–Hastings steps, and the results further support our theoretical findings.

Keywords:

Bayesian computation; constrained sampling; convex support; Langevin Monte Carlo; MCMC; mixing time bound

1. Introduction

Sampling from distributions with some constraints has extensive applications in statistics, machine-learning, and operations research, among other areas. Some distributions have bounded support, such as the simple but versatile uniform distribution which serves as the foundation for a series of Monte Carlo methods, as discussed in [1]. Furthermore, many statistical inference problems involve estimating parameters subject to constraints on the parameter space, which defines a posterior distribution with bounded support in a Bayesian setting. Examples include Latent Dirichlet Allocation [2], truncated data problems in failure and survival time studies [3], ordinal data models [4], constrained lasso and ridge regressions [5], and non-negative matrix factorization [6]. In Bayesian learning, sampling from posterior distributions is a fundamental primitive, used for exploring posterior distributions, identifying the unknown parameters [7,8], obtaining credible intervals, and solving inverse problems [7,8]. Finally, constrained sampling has great potential in solving constrained optimization problems [9,10].

Many Markov Chain Monte Carlo (MCMC) algorithms have been extensively studied for sampling from probability distributions with convex support or more generally with constrained parameters, mainly in the fields of Bayesian statistics and theoretical computer science. Early work includes, among others, [1,11,12,13,14]. Firstly, based on MCMC algorithms, a direct solution involves discarding samples that violate the constraints, thereby exclusively retaining samples that satisfy the constraints; see, for example, [1,15,16]. However, these rejection-type approaches may encounter an excessive number of rejections or an extremely large acceptance rate within some local subspace that satisfies the constraints, which leads to poor mixing and computational inefficiency, especially for complicated constraints and the high dimensional distributions [17,18]. Secondly, some literature draws inspiration from penalty functions in optimization problems and considers the construction of barriers along the boundaries of the constrained domain, effectively constraining the sampling process within the constrained area. These approaches encounter a major challenge when the samples reach the boundaries of the constraints, necessitating the implementation of a mechanism based on reflection to redirect them back into the constrained region. To address this issue, Ref. [19] extended the Hamiltonian Monte Carlo (HMC) method by setting the potential energy outside the constraint region to infinity, restricting the states to the desired domain. Ref. [20] extended the HMC method to sample from truncated multivariate Gaussian distributions, and Ref. [21] proposed an approach that involves mapping the constrained domain onto a sphere in an augmented space. Thirdly, motivated by the constrained optimization methods, the constrained sampling problem can be reformulated as an unconstrained sampling problem via suitable transformations. Following this idea, Ref. [22] proposed a family of novel algorithms based on HMC through the introduction of Lagrange multipliers that address a broader range of constrained sampling problems. More recently, Ref. [23] tackled the constrained sampling problem via the mirror-Langevin algorithm. In spite of the widespread adoption of these MCMC methods, most of them have primarily focused on the algorithm design and lack the rigorous theoretical analysis of convergence rates.

Among all the MCMC algorithms, a class of algorithms based on the Langevin dynamics has garnered significant attention in both practical applications and theoretical analyses [24,25,26,27]. It has recently witnessed a notable increase in non-asymptotic analyses of these algorithms, initiated by the seminal work of [28]. In the setting of unconstrained sampling, Ref. [29] extended the theoretical analysis of convergence rates by studying with decreasing step size, and Refs. [30,31] derived corresponding convergence results based on alternative distances. These theoretical analyses focus on the Langevin algorithm without the Metropolis–Hastings step. More recently, Refs. [32,33] have shown that incorporating the Metropolis–Hastings step can significantly improve the convergence rate of the associated Langevin algorithm. In the setting of constrained sampling, Ref. [34] suggested a Euclidean projection step in the Langevin algorithms for the constrained case (PLMC) and derived the convergence rate of the associated Markov chain. Ref. [35] presented a detailed theoretical analysis for a proximal version of the Langevin algorithm that incorporates the Moreau-Yosida envelope of the indicator function (MYULA) to handle the distributions that are restricted to a convex body. Ref. [36] constructed the mirrored Langevin algorithm (MLD) using a mirror map to constrain the domain, which achieves the same convergence rate as its unconstrained counterpart [28]. However, these constrained sampling algorithms are all developed based on the Langevin algorithm without incorporating the Metropolis–Hastings steps, thus not leveraging the fast mixing advantages of them.

In this paper, we considered the constrained Langevin Monte Carlo with the Metropolis–Hastings step for sampling from the distributions restricted to some convex support. Firstly, for certain constraints, we re-examine the simple and intuitive rejection-type methods for sampling from constrained distributions, and reach a surprising discovery that the corresponding algorithm still retained the advantage of rapid convergence by carefully selecting the step size parameter. Subsequently, for the more generally constrained domain, we build upon the framework proposed in [35], incorporating the Metropolis–Hastings step for further refinement, and analyze the convergence rate of the corresponding Markov chain. We present detailed non-asymptotic analysis for these constrained algorithms and achieve notably enhanced convergence rates in the total variation distance. Compared with the best rate in [36], our results show that adopting the Metropolis–Hastings step in some constrained MCMC algorithms can also lead to an exponentially improved dependence on the error tolerance.

The rest of the paper is organized as follows. In Section 2, we introduce the preliminaries and the problem set-up of our study. Then, we propose the constrained sampling algorithms tailored to different types of constraint regions in Section 3. Section 4 provides the non-asymptotic theoretical results of the proposed algorithms. The numerical experiments and comparisons are presented in Section 5. Some Markov chain basics are provided in Appendix A and all the technical proofs are deferred to Appendix B.

Notation: Let

⌈ a ⌉

represent the smallest integer not less than

a \in R

. For a vector

x \in R^{d}

, we use

{| x |}_{2}

to denote its Euclidean norm. For a

q \times q

symmetric matrix A, denote by

λ_{min} (A)

and

λ_{max} (A)

the smallest and largest eigenvalues of A, respectively, and let

A^{T}

be its transpose. For two square matrices A and B, we write

A ⪯ B

if

(B - A)

is a positive semi-definite matrix. Denote by

I (\cdot)

the indicator function. For

r > 0

, let

B (x, r) = {y \in R^{d} : | y - x |_{2} \leq r}

denote a closed Euclidean ball with center x and radius r. For two real-valued sequences

a_{n}

and

b_{n}

, we say

a_{n} = O (b_{n})

if there exists a universal constant c such that

a_{n} \leq c b_{n}

, and

a_{n} = \tilde{O} (b_{n})

if

a_{n} \leq c_{n} b_{n}

where the sequence

c_{n}

grows at most poly-logarithmically with n. For any two probability measures

μ

and

ν

, denote by

{∥ μ - ν ∥}_{TV}

the total variation distance between

μ

and

ν

.

2. Preliminaries and Problem Set-Up

In this section, we introduce the MCMC sampling methods with its mixing analysis, the traditional unconstrained Metropolis-Adjusted Langevin Algorithm (MALA), and our problem set-up for this paper.

2.1. Markov Chain Monte Carlo and Mixing

Consider a distribution

Π

equipped with a density

π : R^{d} \mapsto R_{+}

such that

π (x) \propto e^{- U (x)}

(1)

for some potential function

U : R^{d} \mapsto R

. In certain scenarios, it is necessary to perform sampling from this distribution. For example, many statistical applications involve estimating the expectation of a function

g (X)

for

X \sim π

, where analytical and numerical computation is infeasible. Monte Carlo approximation provides a solution by generating samples from

Π

and using sample mean to estimate the population expectation. Hence, the key point is to access samples from

Π

.

MCMC represents a class of popular sampling algorithms, which construct an appropriate Markov chain whose stationary distribution is

Π

or close to

Π

in certain metrics. The class of the Metropolis–Hastings algorithms refers to a type of MCMC method that ensures the corresponding Markov chain converges to the target distribution by incorporating the Metropolis–Hastings step. The Metropolis–Hastings algorithms usually take two steps to generate a Markov chain: a proposal step and a reject-accept step. At each iteration, a sample is generated from the proposal distribution in the proposal step, and it is updated as a new state of the Markov chain with probability determined by the Metropolis–Hastings correction in the reject-accept step.

Given an error tolerance

ε \in (0, 1)

, in order to obtain an

ε

-accurate sample with respect to some metric, one simulates the Markov chain for a certain number of steps k, as determined by a mixing time analysis. Specifically, we are concerned about how many steps the chain needs to take such that the current distribution of the chain is

ε

-close to the target distribution

Π

. Based on this, we define the

ε

-mixing time with respect to the target distribution

Π

as

τ (ε; P^{0}, Π) = min {k \in N : ∥ T^{k} (P^{0}) - Π ∥_{TV} \leq ε}

(2)

for the error tolerance

ε \in (0, 1)

, where

T

is the transition operator of the Markov chain and

T^{k} (P^{0})

is the distribution of the Markov chain at k-th step from an initial distribution

P^{0}

.

2.2. Metropolis-Adjusted Langevin Algorithm

Consider the problem of sampling from the distribution with density defined as (1). MALA [26,27] adopts the Gaussian distribution

N {x_{k} - h \nabla U (x_{k}), 2 h I_{p}}

as the proposal distribution for the k-th step, where

x_{k}

is the current state and

h > 0

is a proper step size, and performs a Metropolis–Hastings accept-reject step. MALA is the standard Metropolis–Hastings algorithm applied to the Langevin dynamics, and the associated Langevin-type algorithms belong to a family of gradient-based MCMC sampling algorithms [37]. The Langevin-type algorithms can be understood as the Euler discretization of the Langevin dynamics:

d X_{t} = - \nabla U (X_{t}) d t + \sqrt{2} d W_{t},

where

W_{t} (t \geq 0)

is the standard Brownian motion on

R^{d}

.

Algorithm 1 provides the unconstrained MALA for sampling from the distribution supported on

R^{d}

, where

ϕ_{h} (\cdot | x)

denotes the probability density function of

N {x - h \nabla U (x), 2 h I_{d}}

.

Algorithm 1 Metropolis-adjusted Langevin algorithm

Input: a sample

x^{0} \in R^{d}

from an initial distribution

P^{0}

, the step size h

for $k = 0, 1, 2, \dots, K - 1$ do
Proposal step: $y^{k + 1} \leftarrow x^{k} - h \nabla U (x^{k}) + ξ$ , where $ξ \sim N (0, 2 h I_{p})$
Accept-reject step:
compute $α^{k + 1} = min \{1, \frac{ϕ_{h} (x^{k} | y^{k + 1}) π (y^{k + 1})}{ϕ_{h} (y^{k + 1} | x^{k}) π (x^{k})}\}$
sample $u^{k + 1}$ from the uniform distribution on $[0, 1]$
if $α^{k + 1} \geq u^{k + 1}$ , then $x^{k + 1} \leftarrow y^{k + 1}$
else $x^{k + 1} \leftarrow x^{k}$
end if
end for

Output:

x^{1}, x^{2}, \dots, x^{K}

2.3. Problem Set-Up

In this part, we consider the problem of sampling from a target distribution or posterior

Π^{*}

supported on a compact set

X \subset R^{d}

equipped with a density

π^{*}

. It can be written in the form

π^{*} (x) = \frac{exp {- U (x)} I (x \in X)}{\int_{X} exp {- U (y)} d y}

(3)

for some potential function

U : R^{d} \mapsto R

. Assume that the function

U (\cdot)

and the set

X

satisfy the following assumptions:

Assumption 1.

U (\cdot)

is a twice continuously differentiable, L-smooth and m-strongly convex function on

R^{d}

. That is, there exist universal constants

L \geq m > 0

such that

\frac{m}{2} {| y - x |}_{2}^{2} \leq U (y) - U (x) - {\nabla U (x)}^{T} (y - x) \leq \frac{L}{2} {| y - x |}_{2}^{2}

for any

x, y \in R^{d}

.

Assumption 2.

X \subset R^{d}

is a compact and convex set satisfying

B (x^{*}, r) \subset X \subset B (x^{*}, R)

for some universal constants

0 < r \leq R

and

x^{*} \in X

.

Hereafter, we assume that the above two assumptions hold, which is frequently used in the literature for the analysis of constrained sampling algorithms [34,35,36]. We will modify the MALA in Algorithm 1 to adapt to sampling from the above constrained distribution, and analyse its non-asymptotic theoretical properties and derive the mixing time bound in terms of the problem dimension d and the error tolerance

ε

.

3. The Constrained Langevin Algorithms

In this section, we present three sampling algorithms based on MALA to handle the distribution constrained within some convex body

X

. As discussed in [34], the inherent challenges in constrained sampling problems arise from the complex properties on the boundary of the constraint region, and the lack of the curvature in the potential function. To tackle these challenges, Ref. [34] initially studied constrained sampling from the uniform distribution on

X

, and then extended the exploration to more general distributions. Similarly, we begin our investigation by examining some simple constrained regions and progressively extend our analysis to more complex constraint scenarios.

3.1. Constrained Langevin Algorithm via Rejection

We initially discuss the case where the constraint region

X

is an Euclidean ball on

R^{d}

, where the boundary can be characterized by a curve equation. If

X = B (x^{*}, R)

for some universal constant

R > 0

and

x^{*} \in R^{d}

, we consider the simple and intuitive rejection-type methods via the Metropolis–Hastings accept-reject step for sampling from the distribution with density defined as (3). The constrained MALA for

X = B (x^{*}, R)

outlined in Algorithm 2 as follows, where

ϕ_{h} (\cdot | x)

denotes the probability density function of the Gaussian distribution

N {x - h \nabla U (x), 2 h I_{d}}

.

Algorithm 2 The MALA for Euclidean ball constrained domain

Input: a sample

x^{0} \in X

from an initial distribution

P^{0}

, the step size h

for $k = 0, 1, 2, \dots, K - 1$ do
Proposal step: $y^{k + 1} \leftarrow x^{k} - h \nabla U (x^{k}) + ξ$ , where $ξ \sim N (0, 2 h I_{p})$
Accept-reject step:
if $y^{k + 1} \in X$ then
compute $α^{k + 1} = min \{1, \frac{ϕ_{h} (x^{k} | y^{k + 1}) π^{*} (y^{k + 1})}{ϕ_{h} (y^{k + 1} | x^{k}) π^{*} (x^{k})}\}$
sample $u^{k + 1}$ from the uniform distribution on $[0, 1]$
if $α^{k + 1} \geq u^{k + 1}$ , then $x^{k + 1} \leftarrow y^{k + 1}$
else $x^{k + 1} \leftarrow x^{k}$
end if
else $x^{k + 1} \leftarrow x^{k}$
end if
end for

Output:

x^{1}, x^{2}, \dots, x^{K}

Compared with Algorithm 1, this modified algorithm forces the Markov chain to stay at the current state when it jumps out of the limited state space

X = B (x^{*}, R)

, which is a quite natural extension of the unconstrained MALA. This idea is not completely novel. Ref. [34] suggested a projection step in unadjusted Langevin algorithm for sampling from a log-concave distribution with compact support. Ref. [10] proposed an MALA for constrained optimization, where they used a similar step to constrain the Markov chain to stay at a given state space. Due to the favorable properties on the boundary of constrained domain

X = B (x^{*}, R)

, we can establish the theoretical results of Algorithm 2; see Lemma A1 in Appendix B for details.

3.2. Norm-Constrained Domain

Regularization is a technique commonly used in machine-learning and statistical modeling. As discussed in [38], some models with regularization can be reformulated as the distributions with norm-constraint on the parameters. Notice that the

L_{p}

-norm for the vector

x = {(x_{1}, x_{2}, \dots, x_{d})}^{T} \in R^{d}

is defined as

{| x |}_{p} = \{\begin{matrix} {(\sum_{i = 1}^{d} {| x_{i} |}^{p})}^{1 / p}, & p \in (0, \infty) \\ max_{1 \leq i \leq d} | x_{i} |, & p = \infty . \end{matrix}

For the norm-constrained domain

X = {x \in R^{d} : | x |_{p} \leq C}

with some universal constant

C > 0

, we can transform it into the Euclidean ball

B (0, 1)

via a vector-valued function

f : X \mapsto B (0, 1)

. Specifically, for any

x = {(x_{1}, x_{2}, \dots, x_{d})}^{T} \in X

, we have

y = f (x) = : {f_{1} (x), f_{2} (x), \dots, f_{d} (x)}^{T}

with

f_{i} (x) = \{\begin{matrix} C^{- p / 2} sgn (x_{i}) {| x_{i} |}^{p / 2}, & p \in (0, \infty) \\ x_{i} \frac{{| x |}_{\infty}}{{C | x |}_{2}}, & p = \infty \end{matrix}, 1 \leq i \leq d

such that

y \in B (0, 1)

. Due to the bijective nature of the function

f : X \mapsto B (0, 1)

, its inverse function

f^{- 1} = : g : B (0, 1) \mapsto X

can be defined accordingly. Similarly, for any

y = {(y_{1}, y_{2}, \dots, y_{d})}^{T} \in B (0, 1)

, we have

x = g (y) = : {g_{1} (y), g_{2} (y), \dots, g_{d} (y)}^{T}

with

g_{i} (y) = \{\begin{matrix} C sgn (y_{i}) {| y_{i} |}^{2 / p}, & p \in (0, \infty) \\ C y_{i} \frac{{| y |}_{2}}{{| y |}_{\infty}}, & p = \infty \end{matrix}, 1 \leq i \leq d

such that

x \in X

. By utilizing the vector-valued functions

f (\cdot)

and

g (\cdot)

defined above, we can employ the Euclidean ball constrained sampling algorithm, as described in Section 3.1, to tackle the norm-constrained domain

X = {x \in R^{d} : | x |_{p} \leq C}

. The computational process is outlined in Algorithm 3, where

π^{B (0, 1)} (x) = \frac{exp {- U (x)} I {x \in B (0, 1)}}{\int_{B (0, 1)} exp {- U (y)} d y}

with the potential function

U (\cdot)

.

Algorithm 3 The MALA for norm-constrained domain

Input: a sample

x^{0} \in X

from an initial distribution

P^{0}

, the step size h

for $k = 0, 1, 2, \dots, K - 1$ do
Transformation step: $y^{k} \leftarrow f (x^{k})$
Proposal step: $z^{k + 1} \leftarrow y^{k} - h \nabla U (y^{k}) + ξ$ , where $ξ \sim N (0, 2 h I_{p})$
Accept-reject step:
if $z^{k + 1} \in B (0, 1)$ then
compute $α^{k + 1} = min \{1, \frac{ϕ_{h} (y^{k} | z^{k + 1}) π^{B (0, 1)} (z^{k + 1})}{ϕ_{h} (z^{k + 1} | y^{k}) π^{B (0, 1)} (y^{k})}\}$
sample $u^{k + 1}$ from the uniform distribution on $[0, 1]$
if $α^{k + 1} \geq u^{k + 1}$ , then $y^{k + 1} \leftarrow z^{k + 1}$
else $y^{k + 1} \leftarrow y^{k}$
end if
else $y^{k + 1} \leftarrow y^{k}$
end if
Transformation step: $x^{k + 1} \leftarrow g (y^{k + 1})$
end for

Output:

x^{1}, x^{2}, \dots, x^{K}

Compared with Algorithm 2, the Algorithm 3 achieves the

X \to B (0, 1) \to X

transformation by incorporating two transformation steps, thereby addressing the norm-constrained sampling problems. The main purpose of this approach is to facilitate theoretical analysis by leveraging the well-understood properties of the boundary of the Euclidean ball compared to the boundary of the norm-constrained domain; see Appendix B.7 for details.

3.3. Constrained Langevin Algorithm via an Approximation of the Indicator Function

We proceed to discuss the constrained sampling for more general constraint regions. Given

X \in R^{d}

, define

ι_{X} (x) = : - log {I (x \in X)} = \{\begin{matrix} 0, & If x \in X \\ \infty, & If x \notin X \end{matrix}

(4)

for any

x \in R^{d}

. Then, the target distribution

Π^{*}

with density defined as (3) can be reformulated as

π^{*} (x) = \frac{exp {- V_{X} (x)}}{\int_{X} exp {- V (y)} d y}

(5)

with the potential function

V_{X} : R^{d} \mapsto R

satisfying

V_{X} (\cdot) = U (\cdot) + ι_{X} (\cdot),

(6)

where

ι_{X} (\cdot)

is defined in (4). Notice that

ι_{X} (\cdot)

is a convex function on

R^{d}

. Under Assumption 1, we then know that the potential function

V_{X} (\cdot)

is smooth and strongly convex on

R^{d}

. By this transformation, the problem of constrained sampling is apparently converted into an unconstrained counterpart. However, the non-differentiability of the function

V_{X} (\cdot)

on the boundary of

X

poses a challenge when applying the gradient-based unconstrained sampling algorithms. To address this issue, we can approximate the function

ι_{X} (\cdot)

by a differentiable function such as the Moreau-Yosida (MY) envelope [35]. The MY envelope of

ι_{X} (\cdot)

is defined as

ι_{X}^{λ} (x) = inf_{y \in R^{d}} {ι_{X} (x) + {(2 λ)}^{- 1} {| x - y |}_{2}^{2} {} = (2 λ)}^{- 1} {| x - {Pro}_{X} (x) |}_{2}^{2}

(7)

for any

x \in R^{d}

, where

λ > 0

is a regularization parameter and

{Pro}_{X} (\cdot)

is the projection function onto

X

. By [35], the function

ι_{X}^{λ} (\cdot)

is convex and continuously differentiable with the gradient

\nabla ι_{X}^{λ} (x) = λ^{- 1} {x - {Pro}_{X} (x)}

(8)

for any

x \in R^{d}

, and it holds that

| \nabla ι_{X}^{λ} (x) - \nabla ι_{X}^{λ} {(y) |}_{2} \leq λ^{- 1} {| x - y |}_{2}

(9)

for any

x, y \in R^{d}

. Then the approximation of

V_{X} (\cdot)

defined as (6) can be given by

V_{X}^{λ} (\cdot) = U (\cdot) + ι_{X}^{λ} (\cdot),

(10)

which is continuously differentiable, smooth and strongly convex on

R^{d}

if

U (\cdot)

satisfying Assumption 1. Define the distribution

Π^{*, λ}

with density

π^{*, λ} (x) = \frac{exp {- V_{X}^{λ} (x)}}{\int_{R^{d}} exp {- V^{λ} (y)} d y} .

(11)

Recall that the target distribution

Π^{*}

with the reformulated density defined as (5). As discussed in [35], under some mild conditions including Assumptions 1 and 2, the approximation error between

Π^{*}

and

Π^{*, λ}

in total variation distance can be made arbitrarily small by adjusting the regularization parameter

λ

. Therefore, we can utilize the gradient-based unconstrained sampling algorithms, such as the MALA presented in Algorithm 1, for constructing an appropriate Markov chain whose stationary distribution is close to

Π^{*}

; see Algorithm 4 for details, where

ϕ_{h}^{λ} (\cdot | x)

denotes probability density function of the Gaussian distribution

N {x - h {\nabla U (x) + \nabla ι_{X}^{λ} (x)}, 2 h I_{d}}

with

\nabla ι_{X}^{λ} (\cdot)

defined as (8).

Algorithm 4 The MALA for convex constrained domain

Input: a sample

x^{0} \in R^{d}

from an initial distribution

P^{0}

, the step size h

for $k = 0, 1, 2, \dots, K - 1$ do
Proposal step: $y^{k + 1} \leftarrow x^{k} - h {\nabla U (x^{k}) + \nabla ι_{X}^{λ} (x^{k})} + ξ$ , where $ξ \sim N (0, 2 h I_{p})$
Accept-reject step:
compute $α^{k + 1} = min \{1, \frac{ϕ_{h}^{λ} (x^{k} | y^{k + 1}) π^{*, λ} (y^{k + 1})}{ϕ_{h}^{λ} (y^{k + 1} | x^{k}) π^{*, λ} (x^{k})}\}$
sample $u^{k + 1}$ from the uniform distribution on $[0, 1]$
if $α^{k + 1} \geq u^{k + 1}$ , then $x^{k + 1} \leftarrow y^{k + 1}$
else $x^{k + 1} \leftarrow x^{k}$
end if
end for

Output:

x^{1}, x^{2}, \dots, x^{K}

4. Theoretical Results

In this section, we first analyze the properties of the Markov chains determined by the three constrained sampling algorithms presented in Section 3, and then establish the mixing time bounds of these Markov chains.

4.1. Properties of the Markov Chains

The outcomes

{x^{1}, \dots, x^{K}}

from each algorithm presented in Section 3 form a Markov chain, whose properties are established in Propositions 1, 2, and 3, respectively, as below.

Proposition 1.

For

X = B (x^{*}, R)

with some universal constant

R > 0

and

x^{*} \in R^{d}

, the Markov chain determined by Algorithm 2 is

Π^{*}

-irreducible, smooth, and reversible with respect to the stationary distribution

Π^{*}

with density

π^{*}

defined as (3) (The definition of the

Π^{*}

-irreducible, reversible, and smooth Markov chain is deferred to Appendix A).

Remark 1.

Proposition 1 shows that the Markov chain determined by Algorithm 2 enjoys a series of nice properties as the unconstrained MALA, which form the basis for the study of the mixing time bounds of such Markov chain.

The similar properties hold for the Markov chains determined by Algorithms 3 and 4 as well.

Proposition 2.

For

X = {x \in R^{d} : | x |_{p} \leq C}

with some universal constant

C > 0

, the Markov chain determined by Algorithm 3 is

Π^{*}

-irreducible, smooth, and reversible with respect to the stationary distribution

Π^{*}

with density

π^{*}

defined as (3).

Proposition 3.

Under Assumption 2, the Markov chain determined by Algorithm 4 is

Π^{*, λ}

-irreducible, smooth, and reversible with respect to the distribution

Π^{*, λ}

with density

π^{*, λ}

defined as (11).

4.2. Mixing Time Bounds of the Markov Chains

For a distribution

Π

supported on

X \subset R^{d}

with the density

π

, recall that the

ε

-mixing time with respect to

Π

is defined as (2). A

β

-warm initial distribution

P^{0}

with density

p^{0}

with respect to the distribution

Π

is commonly used for the mixing time analysis, which satisfies

sup_{x \in X} \frac{p^{0} (x)}{π (x)} \leq β

for some finite constant

β > 0

. We say that the Markov chain is

ς

-lazy if at each iteration the chain is forced to stay at the previous state with probability at least

ς

. It is a convenient assumption for theoretical analysis of the convergence rate, but not likely to be used in practice since the lazy steps slow down the mixing rate of Markov chain. Given the definitions above and some Markov chain basics in Appendix A, we can obtain the following results for some well-behaved Markov chains defined on

{X, B (X)}

.

Lemma 1.

Consider a reversible, Π-irreducible, ς-lazy, and smooth Markov chain defined on

{X, B (X)}

with stationary distribution Π supported on

X

. For any error tolerance

ε \in (0, 1)

and β-warm initial distribution

P^{0}

, the ε-mixing time with respect to Π satisfying

τ (ε; P^{0}, Π) \leq ⌈\frac{4}{ς} \int_{4 β^{- 1}}^{ε^{- 2}} \frac{d v}{v {\tilde{Ω}}^{2} (v)}⌉,

where

τ (ε; P^{0}, Π)

and

\tilde{Ω} (\cdot)

are defined, respectively, in (2) and (A4).

Remark 2.

Lemma 1 provides a control on the mixing time of a Markov chain on

X

in terms of

\tilde{Ω} (\cdot)

. This result can be seen as an extension of Lemma 3 in [33] to the case where a Markov chain defined on

{X, B (X)}

. We then can readily derive the mixing time bound if a lower bound for

\tilde{Ω} (\cdot)

is known.

The following lemma gives a lower bound for

Ω (\cdot)

.

Lemma 2.

Assume that the distribution Π supported on

X

with the density π satisfy the log-isoperimetry inequality defined as (A1) for some constant

\hat{c} > 0

. If a reversible Markov chain with stationary distribution Π satisfies

{sup}_{{x, y \in X : | x - y |}_{2} \leq Δ} {∥ T_{x} - T_{y} ∥}_{TV} \leq 1 - δ

for some

δ \in (0, 1)

and

Δ > 0

, it then holds that

Ω (v) \geq \frac{δ}{4} min \{1, \frac{Δ}{4 \hat{c}} {log}^{1 / 2} (1 + \frac{1}{v})\}

for any

v \in (0, 1 / 2]

, where

T_{x}

is the one-step transition distribution of this Markov chain at

x \in X

and

Ω (\cdot)

is the conductance profile of this Markov chain defined in (A3).

Remark 3.

Lemma 2 states a lower bound for the conductance profile of a Markov chain on

X

. Similar results can be found in the [33,39,40]. Lemma 2, together with Lemma 1, provides a general framework for obtaining mixing time bound of a well-behaved Markov chain on

X

.

Based on Lemmas 1 and 2, we can drive the upper bounds for each

ε

-mixing time of the Markov chains determined by the three constrained sampling algorithms presented in Section 3.

Theorem 1.

For

X = B (x^{*}, R)

with some universal constant

R > 0

and

x^{*} \in R^{d}

, let Assumption 1 hold with

L^{3 / 8} R^{3 / 4} \geq 16 / \sqrt{d} + 8

and

L^{- 15 / 8} m^{2} R^{1 / 4} \geq 12 d

. Given a β-warm initial distribution

P^{0}

and an error tolerance

ε \in (0, 1)

, the Markov chain determined by Algorithm 2 satisfies

τ (ε; P^{0}, Π^{*}) = O (\frac{L^{7 / 4} R^{3 / 2} d}{m} log \frac{log β}{ε})

for any step size h satisfying

\frac{1}{L^{7 / 4} R^{3 / 2} d} \leq h \leq min [\frac{R^{2} {(1 - \tilde{c})}^{2}}{4 {{log}^{1 / 2} (16 / u) + \sqrt{d}}^{2}}, \frac{\sqrt{u}}{4 \sqrt{3} L^{3 / 2} R}, \frac{u}{128 L {{log}^{1 / 2} (16 / u) + \sqrt{d}}^{2}}]

with

\tilde{c} = {1 + (L^{- 7 / 2} R^{- 3} d^{- 2} - L^{- 11 / 4} R^{- 3 / 2} d^{- 1}) m^{2}}^{1 / 2}

and some constant

u \in (1 / 2, 1)

, where

Π^{*}

with density

π^{*}

defined as (3).

Remark 4.

Theorem 1 presents a sharp mixing time bound for Algorithm 2 with a

β

-warm initial distribution as

\tilde{O} {d log (1 / ε)}

up to β and L, m, R which are specified in Assumptions 1 and 2. This result improves upon the previously known mixing time bounds for constrained sampling algorithms in [34,35,36]; see Table 1 for details.

For sampling from the norm-constrained domain

X = {x \in R^{d} : | x |_{p} \leq C}

with some universal constant

C > 0

, we transform it into the sampling from Euclidean ball

B (0, 1)

as shown in Algorithm 3; then, the similar result holds for the Markov chain determined by Algorithm 3 as well.

Corollary 1.

For

X = {x \in R^{d} : | x |_{p} \leq C}

with some universal constant

C > 0

, let Assumption 1 hold with

L^{3 / 8} \geq 16 / \sqrt{d} + 8

and

L^{- 15 / 8} m^{2} \geq 12 d

. Given a β-warm initial distribution

P^{0}

and an error tolerance

ε \in (0, 1)

, the Markov chain determined by Algorithm 3 satisfies

τ (ε; P^{0}, Π^{*}) = O (\frac{L^{7 / 4} d}{m} log \frac{log β}{ε})

for any step size h satisfying

\frac{1}{L^{7 / 4} d} \leq h \leq min [\frac{{(1 - \bar{c})}^{2}}{4 {{log}^{1 / 2} (16 / u) + \sqrt{d}}^{2}}, \frac{\sqrt{u}}{4 \sqrt{3} L^{3 / 2}}, \frac{u}{128 L {{log}^{1 / 2} (16 / u) + \sqrt{d}}^{2}}]

with

\bar{c} = {1 + (L^{- 7 / 2} d^{- 2} - L^{- 11 / 4} d^{- 1}) m^{2}}^{1 / 2}

and some constant

u \in (1 / 2, 1)

, where

Π^{*}

with density

π^{*}

defined as (3).

For the Markov chain determined by Algorithm 4, we can also derive a sharp mixing time bound by the mixing time analysis for sampling from log-concave distribution without constraints in [33] and the approximation error between

Π^{*}

and

Π^{*, λ}

in [35].

Theorem 2.

Let Assumptions 1 and 2 hold, and assume that there exists a universal constant

\tilde{C} > 0

such that

exp {{inf}_{x \in X^{c}} U (x) - {sup}_{x \in X} U (x)} \geq \tilde{C}

. Given the initial distribution

P^{0} = N {x^{★}, {(L + λ^{★ - 1})}^{- 1} I_{d}}

with

x^{★} = arg {min}_{x \in R^{d}} V_{X}^{λ^{★}} (x)

and an error tolerance

ε \in (0, 1)

, the Markov chain determined by Algorithm 4 satisfies

τ (ε; P^{0}, Π^{*}) = O [\frac{(L + λ^{★ - 1}) d}{m} log \frac{d}{ε} \cdot max \{1, \sqrt{\frac{L + λ^{★ - 1}}{d m}}\}]

for the step size h satisfying

h = c \frac{1}{(L + λ^{★ - 1}) d \cdot max \{1, \sqrt{\frac{L + λ^{★ - 1}}{d m}}\}}

with some universal constant

c > 0

, where

V_{X}^{λ^{★}} (\cdot)

is defined as in (10) with

λ^{★} : = 8 π^{- 1} ε^{2} r^{2} d^{- 2} {\tilde{C}}^{2}

, and

Π^{*}

with density

π^{*}

defined as (3).

Remark 5.

Theorem 2 presents a mixing time bound for Algorithm 4 with a feasible initial distribution as

O {d^{3} ε^{- 2} log (d / ε)}

up to L, m, r which are specified in Assumptions 1 and 2 if we choose the regularization parameter

λ = λ^{★}

. This result improves upon the mixing time bound for constrained sampling algorithm without incorporating the Metropolis–Hastings step in [35]; see Table 1 for details.

5. Numerical Experiments

In this section, we conduct numerical experiments to validate the theoretical properties derived in Section 4 and compare the constrained sampling algorithms presented in Section 3 with three competing MCMC algorithms for sampling from constrained log-concave distributions listed in Table 1 under various simulation settings. The implementation of these algorithms involves the selection of a step size. For Algorithms 2 and 3, we follow Theorem 1 and Corollary 3, respectively, to select the step size. For Algorithm 4, we choose the step size as that in [32] for the MALA for sampling from log-concave distribution without constraints. The step size choice of the other three MCMC algorithms follows the recommendation in the associated papers; see Table 2 for details.

5.1. Sampling from the Euclidean Ball Constrained Domain

We consider the problem of sampling from a truncated multivariate Gaussian distribution on

X

, which admits the density

π^{*} (x) \propto exp \{- \frac{{(x - μ)}^{T} Σ^{- 1} (x - μ)}{2}\} I (x \in X),

where the mean

μ = 0

and covariance matrix

Σ \in R^{d \times d}

is a diagonal matrix with

λ_{max} (Σ) = 10

and

λ_{min} (Σ) = 1

. For this target distribution, the potential function

U (\cdot)

and its derivatives are given as

U (x) = 2^{- 1} x^{T} Σ^{- 1} x

,

\nabla U (x) = Σ^{- 1} x

, and

\nabla^{2} U (x) = Σ^{- 1}

. Therefore, the function

U (\cdot)

is smooth with parameter

L = λ_{min}^{- 1} (Σ)

and strongly convex with parameter

m = λ_{max}^{- 1} (Σ)

on

R^{d}

. We select

X = B (0, R)

with

R = 5

, the initial distribution

P^{0} = N_{X} {0, {(2 L)}^{- 1} I_{d}}

, and use the inverse transformation algorithm [14] to generate an initial point from

P^{0}

. We compare Algorithm 2 with the three sampling algorithms in literature given in Table 2, and follow the recommendation in the associated papers to choose the initial points of the three sampling algorithms.

5.1.1. The Trace Graphs of Sampling Algorithms

To initiate a preliminary assessment of the convergence properties of these algorithms, we commence with simple sample trace plots. Write

x = {(x_{1}, \dots, x_{d})}^{T} \in R^{d}

and

μ = {(μ_{1}, \dots, μ_{d})}^{T} \in R^{d}

. Figure 1 depicts the traces of

x_{1}

of the Markov chains determined by the four sampling algorithms under dimension

d = 10

. Evidently, in comparison to the other three algorithms, Algorithm 2 exhibits a notably faster mixing time, as evidenced by the trace consistently remaining around its mean

μ_{1} = 0

. Conversely, the traces of the other three sampling algorithms exhibit greater fluctuations and deviate more from

μ_{1} = 0

.

Figure 2 illustrates the histograms and densities corresponding to these traces of

x_{1}

. Similarly, it is evident that Algorithm 2 achieves sample means closer to

μ_{1} = 0

, along with the least variance. Conversely, the sample means obtained from the other three sampling algorithms exhibit a certain degree of deviation from

μ_{1} = 0

, accompanied by heavier tails.

5.1.2. Dimension and Error Dependence of Algorithm 2

The goal of this simulation is to demonstrate that the dimension and error tolerance dependence of the mixing time bound for Algorithm 2 both conform to the theoretical results shown in Theorem 1.

Since the total variation distance between continuous measures is hard to estimate, we use the error in quantiles along some direction for convergence diagnostics in the experiments. In the spirit of [33], we measure the error in the

95 %

quantile of the sample distribution and the true distribution in the direction along the eigenvector of

Σ

corresponding to

λ_{min} (Σ)

. The approximate mixing time

{\hat{k}}_{mix} (ε)

is then defined as the smallest iteration k when such error between the distribution of the Markov chain at iteration k and the target distribution falls below the error tolerance

ε

. We simulate 20 independent runs of the Markov chain of the algorithms with N = 20,000 samples at each run to determine the approximate mixing time

{\hat{k}}_{mix} (ε)

. Then the final

{\hat{k}}_{mix} (ε)

is the average of these 20 independent runs.

Figure 3a shows the dependence of the approximate mixing time

{\hat{k}}_{mix} (0.2)

as a function of dimension d for Algorithm 2. By the linear regression for

{\hat{k}}_{mix} (0.2)

with respect to d, we conclude that the mixing time of Algorithm 2 is linear in d with slope

4.137

and R-squared

0.991

. Figure 3b presents the dependence of the approximate mixing time

{\hat{k}}_{mix} (ε)

on the inverse of the error tolerance

ε^{- 1}

for Algorithm 2 under

d = 4

. The linear regression for the approximate mixing time

{\hat{k}}_{mix} (ε)

with respect to

ε^{- 1}

suggests that the mixing time of Algorithm 2 is linear in

log (ε^{- 1})

with slope

15.854

and R-squared

0.994

, which is consistent with the theoretical results given in Theorem 1.

5.1.3. Comparison with Competitive Algorithms

Figure 4a shows the dependence of the approximate mixing time

{\hat{k}}_{mix} (0.2)

on the problem dimension d for the four sampling algorithms. Compared with the other three algorithms, the approximate mixing time of Algorithm 2 seems more robust to dimension. When d is small, the approximate mixing time of the four algorithms is comparatively close. However, as the dimension d increases, the approximate mixing time of PLMC and MYULA increases rapidly, showing a polynomial order with respect to d. Moreover, the dimension dependence of MLD and Algorithm 2 both indicate a linear growth trend, and MLD needs a few more steps than Algorithm 2 to reach the same error tolerance.

Figure 4b presents the dependence of the approximate mixing time

{\hat{k}}_{mix} (ε)

on the inverse of the error tolerance

ε^{- 1}

for the four sampling algorithms under

d = 4

. The regression analysis shows that the approximate mixing time

{\hat{k}}_{mix} (ε)

of PLMC and MYULA increases in polynomial order of

ε^{- 1}

. When

ε^{- 1}

is relatively small, MLD and Algorithm 2 have similar approximate mixing time. With the increase in

ε^{- 1}

, the strength of Algorithm 2 gets more significant. For MLD, the linear regression for the approximate mixing time

{\hat{k}}_{mix} (ε)

with respect to

ε^{- 2}

yields a slope of

1.934

and R-squared

0.984

, suggesting the error tolerance dependence of order

ε^{- 2}

.

It is noteworthy that the above analysis not only suggests significantly better dimension and error tolerance dependence of the constrained MALA but also partly verifies the theoretical convergence rates of the three methods for comparison.

5.2. Bayesian Regularized Regression

The regularized regression involves adding a penalty term on the objective function of the regression model, which helps to control the complexity of the model and prevent it from fitting the noise in the data. In this section, we validate the effectiveness of Algorithm 3 for constrained sampling involving the Bayesian regularized regression.

Given the independent and identically observations

y = {(y_{1}, y_{2}, \dots, y_{n})}^{T} \in R^{n}

which follow from the Gaussian distribution with mean

X β

and covariance matrix

σ^{2} I_{n}

, we consider the regression models where the parameter are obtain by minimizing the square of Euclidean norm of the residual subject to a norm-constraint on the regression parameter as follows:

min_{β \in R^{d}} {| y - X β |}_{2}^{2} subject to {| β |}_{p} \leq C

for some universal constant

C > 0

, where

X \in R^{n \times d}

is the design matrix,

β \in R^{d}

is the regression parameter, and

{| β |}_{p}

is the

L_{p}

-norm of

β

. In Bayesian setting, many regularization techniques correspond to imposing certain prior distributions on model parameters. We then consider sampling from the distribution with density

π^{*} (x) \propto exp \{- \frac{{| y - X β |}_{2}^{2}}{2 σ^{2}}\} I (x \in X),

and obtaining the parameter estimates

\hat{β}

via the maximum a posteriori probability (MAP) estimate, where

X = {x \in R^{d} : | x |_{p} \leq C}

. We use the diabetes data studied in [41], and set the burn-in period to be

10^{3}

iterations and

σ^{2} = 1

. Figure 5 presents the paths of the parameter estimates under different norm constraints, which demonstrate that Algorithm 3 can effectively handle the norm-constrained sampling problems.

5.3. Truncated Multivariate Gaussian Distribution

The final comparison was made by examining the sampling performance of MYULA in [35] and Algorithm 4 in the setting of a more general truncated multivariate Gaussian distribution. We consider the same setup as in [35]. Specifically, the density of the target distribution is defined as follows:

π^{*} (x) \propto exp \{- \frac{{(x - μ)}^{T} Σ^{- 1} (x - μ)}{2}\} I (x \in X),

where

X

is a convex set and the origin 0 is on its boundary. Let

μ = 0

, the covariance matrix

Σ \in R^{d \times d}

with

(i, j)

-th element given by

{(Σ)}_{i, j} = 1 / (1 + | i - j |)

, and

X = [0, 5] \times [0, 1]

. We generate

10^{6}

samples for Algorithm 4, and set the burn-in period to be the initial

10 %

iterations.

Table 3 presents the mean and covariance estimation results of the target distribution based on the samples generated by MYULA and Algorithm 4. For comparison purposes, the results of MYULA align with those reported in [35]. With the same number of iterations, Algorithm 4 outperforms MYULA in terms of the estimation results. This indicates that incorporating the Metropolis–Hastings step in Algorithm 4 leads to improvements in the mixing time.

6. Discussion and Conclusions

In this article, we propose three sampling algorithms based on Langevin Monte Carlo with the Metropolis–Hastings steps to handle the distribution constrained within some convex body, and establish the mixing time bounds of these algorithms for sampling from strongly log-concave distributions. Under certain conditions, these bounds are sharper than existing algorithms in the literature. Furthermore, in comparison to existing algorithms, the suggested constrained sampling algorithms are simpler, more intuitive, and easier to operate in some cases.

Our results demonstrate that the sampling algorithm, enhanced with the Metropolis–Hastings step, offers an effective solution for tackling some constrained sampling problems. Numerical experiments fully illustrate the advantages of the proposed algorithms. Although we focus on the strongly log-concave distributions in the theoretical analysis, the proposed algorithm can be readily applied to weakly log-concave distributions or non-convex potential functions. Simultaneously, we recognize that there are various aspects of the sampling algorithms that can be further improved. For instance, potential enhancements could involve the multiple importance sampling methods or adaptive techniques. We leave the investigation of its theoretical properties under such scenarios for future work.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Some Markov Chain Basics

Consider the time-homogeneous (We say that a Markov chain is time-homogeneous in which the probability of any state transition is independent of time.) Markov chains defined on a measurable state space

{X, B (X)}

with a transition probability

Ψ : X \times B (X) \mapsto [0, 1]

. The transition probability satisfies

Ψ (x, d y) \geq 0 \forall x \in X, and \int_{y \in X} Ψ (x, d y) = 1 .

The k-th step transition probability defined recursively as

Ψ^{k} (x, d y) = \int_{z \in X} Ψ^{k - 1} (x, d z) Ψ (z, d y) .

For a distribution

Π

on

X

, a Markov chain defined on

{X, B (X)}

is called

Π

-irreducible if for each

A \in B (X)

with

Π (A) > 0

and each

x \in X

, there exists

k \in N

such that

Ψ^{k} (x, A) > 0

. A Markov chain defined on

{X, B (X)}

with transition probability

Ψ

:

X \times B (X) \mapsto [0, 1]

and stationary distribution

Π

is called reversible if it satisfies the detailed balance condition

Π (d x) Ψ (x, d y) = Π (d y) Ψ (y, d x)

for any

x, y \in X

.

Smooth chain assumption. We say that the Markov chain satisfies the smooth chain condition if its transition probability

Ψ : X \times B (X) \mapsto [0, 1]

can be expressed in the form

Ψ (x, d y) = ψ (x, y) d y + ι_{x} δ_{x} (d y)

for any

x, y \in X

, where

ψ (\cdot, \cdot)

is the transition kernel satisfying

ψ (x, y) \geq 0

for any

x, y \in X

,

ι_{x}

denotes the one-step probability of the chain to stay at its current state x, and

δ_{x} (\cdot)

is the Dirac-delta function at x.

Log-isoperimetric inequality. A distribution

Π

supported on

X

with density

π

is said to satisfy the log-isoperimetry inequality with some constant

\hat{c} > 0

if

Π (S_{3}) \geq \frac{d (S_{1}, S_{2})}{2 \hat{c}} min {Π (S_{1}), Π (S_{2})} {log}^{1 / 2} [1 + \frac{1}{min {Π (S_{1}), Π (S_{2})}}]

(A1)

for any partition

(S_{1}

,

S_{2}

,

S_{3})

of

X

, where

Π (S_{i}) = \int_{S_{i}} π (x) d x

and

d (S_{1}, S_{2}) =

{inf}_{x \in S_{1}, y \in S_{2}} {| x - y |}_{2}

.

Conductance profile. Given a Markov chain with transition probability

Ψ : X \times B (X) \mapsto [0, 1]

and stationary distribution

Π

with density

π

, its stationary flow

ω (\cdot) : B (X) \mapsto R

is defined as

ω (S) = \int_{S} Ψ (x, S^{c}) π (x) d x

(A2)

for any

S \in B (X)

. For any

v \in (0, 1 / 2]

, the conductance profile is given by

Ω (v) = inf_{S : Π (S) \in (0, v]} \frac{ω (S)}{Π (S)} .

(A3)

Furthermore, the extended conductance profile is defined as

\tilde{Ω} (v) = \{\begin{matrix} Ω (v), & v \in (0, 1 / 2], \\ Ω (1 / 2), & v \in (1 / 2, \infty) . \end{matrix}

(A4)

Appendix B. Proofs

Appendix B.1. Proof of Proposition 1

Proof of Proposition 1.

Denote by

Ψ (x, \cdot)

the transition probability of the Markov chain at

x \in X

determined by Algorithm 2. For any

x \in X

, let

P_{x, h} = N {x - h \nabla U (x), 2 h I_{d}}

with the step size h. Write the density of

P_{x, h}

as

ϕ_{h} (\cdot | x)

. For any

x \in X

, denote by

α_{x} (y) = min {1, R_{x} (y)}

the acceptance probability for any

y \in R^{d}

, where

R_{x} (y) = \frac{π^{*} (y) ϕ_{h} (x | y)}{π^{*} (x) ϕ_{h} (y | x)} I (y \in X) .

Then, the transition probability of the associated Markov chain at

x \in X

has a probability mass

ψ_{x} = 1 - \int_{X} ϕ_{h} (y | x) α_{x} (y) d y

. Define the transition kernel

ψ (x, y) = ϕ_{h} (y | x) α_{x} (y) I (y \in X \ {x})

for

x \in X

. Then, the transition probability

Ψ : X \times B (X) \mapsto [0, 1]

satisfies

Ψ (x, d y) = ψ_{x} δ_{x} (d y) + ψ (x, y) d y,

(A5)

where

δ_{x} (\cdot)

is the Dirac-delta function at x. By the smooth chain condition given in Appendix A, we know the Markov chain with the transition probability

Ψ (\cdot, \cdot)

is smooth.

Recall that

Π^{*}

is the distribution on

X

with the density

π^{*}

defined as (3). Since

α_{x} (y) π^{*} (x) ϕ_{h} (y | x) = α_{y} (x) π^{*} (y) ϕ_{h} (x | y)

for any

x, y \in X

, then

π^{*} (x) ψ (x, y) = π^{*} (y) ψ (y, x)

for any

x, y \in X

. Together with (A5), for any

A, B \in B (X)

, it holds that

\begin{matrix} \int_{A} π^{*} (x) Ψ (x, B) d x = & \int_{A \cap B} π^{*} (x) ψ_{x} d x + \int_{(x, y) \in A \times B} π^{*} (x) ψ (x, y) d x d y \\ = & \int_{B} π^{*} (x) ψ_{x} δ_{x} (A) d x + \int_{(x, y) \in A \times B} π^{*} (y) ψ (y, x) d x d y \\ = & \int_{B} π^{*} (x) Ψ (x, A) d x \end{matrix}

with

δ_{x} (A) = I (x \in A)

, which implies

Π^{*} (A) = \int_{A} π^{*} (x) Ψ (x, X) d x = \int_{X} π^{*} (x) Ψ (x, A) d x

for any

A \in B (X)

. Thus,

Π^{*}

is the stationary distribution of the Markov chain with the transition probability

Ψ (\cdot, \cdot)

. Hence, such Markov chain is reversible.

Furthermore, by (A5), we have

Ψ (x, A) = ψ_{x} δ_{x} (A) + \int_{A} ψ (x, y) d y

for any

x \in X

and

A \in B (X)

. For any

A \in B (X)

with

Π^{*} (A) > 0

, due to

Π^{*} (A) = \int_{A} π^{*} (x) d x

, we know the Lebesgue measure of A is nonzero. Since

α_{x} (y) \leq 1

and

X = B (x^{*}, R)

for some universal constant

R > 0

and

x^{*} \in R^{d}

, we know

ψ_{x} \geq 1 - \int_{X} ϕ_{h} (y | x) d y > 0

for any

x \in X

. If

A = {x}

,

Ψ (x, A) \geq ψ_{x} > 0

. If

A \neq {x}

, we know the Lebesgue measure of

A \ {x}

is also nonzero, which implies

Ψ (x, A) \geq \int_{A \ {x}} ψ (x, y) d y > 0

. Thus, the Markov chain with the transition probability

Ψ (\cdot, \cdot)

is

Π^{*}

-irreducible. We complete the proof of Proposition 1. □

Appendix B.2. Proof of Proposition 2

Proof of Proposition 2.

Recall

X = {x \in R^{d} : | x |_{p} \leq C}

for some universal constant

C > 0

. Notice that the additional two steps are introduced in Algorithm 3 only for the purpose of establishing a one-to-one mapping between

{x \in R^{d} : | x |_{p} \leq C}

and

B (0, 1)

, and they do not affect the properties of the Markov chain. Using the same arguments in the proof of Proposition 1, we can obtain the results of Proposition 2. □

Appendix B.3. Proof of Proposition 3

Proof of Proposition 3.

The proof is almost identical to that of Proposition 1. Recall the distribution

Π^{*, λ}

with density

π^{*, λ} (x) = \frac{exp {- V_{X}^{λ} (x)}}{\int_{X} exp {- V^{λ} (y)} d y},

where

V_{X}^{λ} (\cdot) = U (\cdot) + ι_{X}^{λ} (\cdot)

with

\nabla ι_{X}^{λ} (\cdot)

defined as (8). Let

ϕ_{h}^{λ} (\cdot | x)

be the probability density function of the Gaussian distribution

N {x - h {\nabla U (x) + \nabla ι_{X}^{λ} (x)}, 2 h I_{d}}

. We only need to replace

{Π^{*}, π^{*}, ϕ_{h} (\cdot | x)}

which appeared in the proof of Proposition 1 by

{Π^{*, λ}, π^{*, λ}, ϕ_{h}^{λ} (\cdot | x)}

and all the arguments still hold. □

Appendix B.4. Proof of Lemma 1

Proof of Lemma 1.

We introduce some notation first. Denote by

π

the density function of

Π

, and

L_{2} (π)

the space of square integrable functions defined on

X

under the density

π

, that is,

\int_{X} g^{2} (x) π (x) d x < \infty

for any

g \in L_{2} (π)

. The Dirichlet form

E_{Ψ} : L_{2} (π) \times L_{2} (π) \mapsto R

associated with the transition probability

Ψ (\cdot, \cdot)

is defined as follows:

\begin{matrix} E_{Ψ} (g, h) = \frac{1}{2} \int_{(x, y) \in X^{2}} {g (x) - h (y)}^{2} Ψ (x, d y) π (x) d x . \end{matrix}

(A6)

For any

g \in L_{2} (π)

, let

E_{π} (g) = \int_{X} g (x) π (x) d x and {Var}_{π} (g) = \int_{X} {g (x) - E_{π} (g)}^{2} π (x) d x .

For a measurable non-empty subset

S \subset X

, the spectral gap is defined as

λ (S) = inf_{g \in c_{0}^{+} (S)} \frac{E_{Ψ} (g, g)}{{Var}_{π} (g)},

where

c_{0}^{+} (S) = {g \in L_{2} (π) : supp (g) \subset S, g \geq 0, {Var}_{π} (g) > 0}

. Define the spectral profile

Λ (\cdot)

as

\begin{matrix} Λ (v) = inf_{S : Π (S) \in (0, v]} λ (S) \end{matrix}

(A7)

for any

v \in (0, \infty)

. If the current state of a Markov chain admits the distribution

P

with density p, we write

T (p)

as the distribution of its next state. The proof of Lemma 1 includes two steps. The first step is to show

τ (ε; P^{0}, Π) \leq \frac{1}{ς} \int_{4 β^{- 1}}^{ε^{- 2}} \frac{d v}{v Λ (v)} .

The second step is to show that the spectral profile and the conductance profile defined in (A3) are related as

Λ (v) \geq \{\begin{matrix} \frac{Ω^{2} (v)}{2}, & v \in (0, 1 / 2], \\ \frac{Ω^{2} (1 / 2)}{4}, & v \in (1 / 2, \infty) . \end{matrix}

Notice that

Π (X) = 1

. Replacing the restricted conductance profile and restricted spectral gap in the proof of Lemma 1 in [33] by the conductance profile and spectral gap, respectively, and using the similar arguments in the proof of Lemma 1 in [33], we can obtain the results of the two steps. Then, Lemma 1 can be constructed immediately. □

Appendix B.5. Proof of Lemma 2

Proof of Lemma 2.

Denote by

π

the density function of the distribution

Π

. For any measurable non-empty subset

A_{1} \subset X

such that

0 < Π (A_{1}) \leq 1 / 2

, we have

Π (A_{2}) \geq 1 / 2 \geq Π (A_{1})

, where

A_{2} = X \ A_{1}

. Given

δ > 0

, we define the following sets

A_{1}^{'} = {x \in A_{1} : Ψ (x, A_{2}) < δ / 2}, A_{2}^{'} = {x \in A_{2} : Ψ (x, A_{1}) < δ / 2}

and

A_{3}^{'} = X \ (A_{1}^{'} ⋃ A_{2}^{'})

, where

Ψ : X \times B (X) \mapsto [0, 1]

is the transition probability of the considered Markov chain.

On the one hand, if

Π (A_{1}^{'}) \leq Π (A_{1}) / 2

, then

Π (A_{1} \ A_{1}^{'}) \geq Π (A_{1}) / 2

. Thus,

\int_{A_{1}} Ψ (x, A_{2}) π (x) d x \geq \int_{A_{1} \ A_{1}^{'}} Ψ (x, A_{2}) π (x) d x \geq \frac{δ}{2} \int_{A_{1} \ A_{1}^{'}} π (x) d x \geq \frac{δ}{4} Π (A_{1}) .

Similarly, if

Π (A_{2}^{'}) \leq Π (A_{2}) / 2

, we have

\int_{A_{2}} Ψ (x, A_{1}) π (x) d x \geq δ Π (A_{2}) / 4

. By the detailed balance condition and the Fubini’s theorem, it holds that

\begin{matrix} \int_{A_{1}} Ψ (x, A_{2}) π (x) d x & = \int_{x \in A_{1}} \int_{y \in A_{2}} Ψ (x, d y) π (x) d x \\ = \int_{x \in A_{1}} \int_{y \in A_{2}} Ψ (y, d x) π (y) d y \\ = \int_{A_{2}} Ψ (y, A_{1}) π (y) d y = \int_{A_{2}} Ψ (x, A_{1}) π (x) d x . \end{matrix}

(A8)

Therefore, if

Π (A_{1}^{'}) \leq Π (A_{1}) / 2

or

Π (A_{2}^{'}) \leq Π (A_{2}) / 2

, we have

\int_{A_{1}} Ψ (x, A_{2}) π (x) d x \geq \frac{δ}{4} min {Π (A_{1}), Π (A_{2})} = \frac{δ}{4} Π (A_{1}) .

On the other hand, we consider the case with

Π (A_{1}^{'}) > Π (A_{1}) / 2

and

Π (A_{2}^{'}) > Π (A_{2}) / 2

. Notice that

T_{x} (\cdot) = Ψ (x, \cdot)

. By the definition of the total variation distance, for any

x \in A_{1}^{'}

and

y \in A_{2}^{'}

, we have

∥ T_{x} - T_{y} ∥_{TV} \geq Ψ (x, A_{1}) - Ψ (y, A_{1}) = 1 - Ψ (x, A_{2}) - Ψ (y, A_{1}) > 1 - δ .

Since

{sup}_{{x, y \in X : | x - y |}_{2} \leq Δ} {∥ T_{x} - T_{y} ∥}_{TV} \leq 1 - δ

, we know

{| x - y |}_{2} > Δ

, which implies

d (A_{1}^{'}, A_{2}^{'}) : = {inf}_{x \in A_{1}^{'}, y \in A_{2}^{'}} {| x - y |}_{2} \geq Δ

. Recall

A_{3}^{'} = X \ (A_{1}^{'} ⋃ A_{2}^{'})

. By (A8),

\begin{matrix} \int_{A_{1}} Ψ (x, A_{2}) π (x) d x & = \frac{1}{2} \int_{A_{1}} Ψ (x, A_{2}) π (x) d x + \frac{1}{2} \int_{A_{2}} Ψ (x, A_{1}) π (x) d x \\ \geq \frac{1}{2} \int_{A_{1} \ A_{1}^{'}} Ψ (x, A_{2}) π (x) d x + \frac{1}{2} \int_{A_{2} \ A_{2}^{'}} Ψ (x, A_{1}) π (x) d x \\ \geq \frac{δ}{4} Π (A_{3}^{'}) . \end{matrix}

(A9)

Since

Π (A_{1}^{'}) > Π (A_{1}) / 2

,

Π (A_{2}^{'}) > Π (A_{2}) / 2

and the sets

(A_{1}^{'}, A_{2}^{'}, A_{3}^{'})

partition

X

, by the log-isoperimetry inequality given in (A1), it holds that

\begin{matrix} Π (A_{3}^{'}) & \geq \frac{d (A_{1}^{'}, A_{2}^{'})}{2 \hat{c}} min {Π (A_{1}^{'}), Π (A_{2}^{'})} {log}^{1 / 2} [1 + \frac{1}{min {Π (A_{1}^{'}), Π (A_{2}^{'})}}] \\ \geq \frac{Δ}{4 \hat{c}} min {Π (A_{1}), Π (A_{2})} {log}^{1 / 2} [1 + \frac{2}{min {Π (A_{1}), Π (A_{2})}}] \\ \geq \frac{Δ}{4 \hat{c}} Π (A_{1}) {log}^{1 / 2} \{1 + \frac{1}{Π (A_{1})}\}, \end{matrix}

(A10)

where the second inequality follows from the fact that

x {log}^{1 / 2} (1 + x^{- 1})

is non-decreasing in

x > 0

. By (A9) and (A10), we have

\int_{A_{1}} Ψ (x, A_{2}) π (x) d x \geq \frac{δ Δ}{16 \hat{c}} Π (A_{1}) {log}^{1 / 2} \{1 + \frac{1}{Π (A_{1})}\} .

Putting the two cases together, it holds that

ω (A_{1}) = \int_{A_{1}} Ψ (x, A_{2}) π (x) d x \geq \frac{δ}{4} Π (A_{1}) min [1, \frac{Δ}{4 \hat{c}} {log}^{1 / 2} \{1 + \frac{1}{Π (A_{1})}\}]

for any measurable non-empty subset

A_{1} \subset X

with

0 < Π (A_{1}) \leq 1 / 2

. Due to

{inf}_{x \in (0, v]} {log}^{1 / 2} (1 + x^{- 1}) = {log}^{1 / 2} (1 + v^{- 1})

, by the definition of the conductance profile given in (A9), we have

Ω (v) \geq \frac{δ}{4} min \{1, \frac{Δ}{4 \hat{c}} {log}^{1 / 2} (1 + \frac{1}{v})\}

for any

v \in (0, 1 / 2]

. We complete the proof of Lemma 2. □

Appendix B.6. Proof of Theorem 1

For any

x \in X

, let

P_{x, h} = N {x - h \nabla U (x), 2 h I_{d}}

with the step size h. For

X = B (x^{*}, R)

with some universal constant

R > 0

and

x^{*} \in R^{d}

, without loss of generality, we set

x^{*} = arg {min}_{x \in R^{d}} U (x)

. Under Assumption 1, we know

\nabla U (x^{*}) = 0

.

Lemma A1.

Let

X = B (x^{*}, R)

for some universal constant

R > 0

and

x^{*} = arg {min}_{x \in R^{d}} U (x)

, and Assumption 1 hold. For any step size

h \in (0, 2 L^{- 1}]

with L specified in Assumption 1 , it holds that

∥ P_{x, h} - P_{x, h} ∥_{TV} \leq \frac{{| x - y |}_{2}}{\sqrt{2 h}}

(A11)

for any

x, y \in X

. Furthermore, if

L^{3 / 8} R^{3 / 4} \geq 16 d^{- 1 / 2} + 8

and

L^{- 15 / 8} m^{2} R^{1 / 4} \geq 12 d

, for any

u \in (1 / 2, 1)

, it holds that

{sup}_{x \in X} {∥ P_{x, h} - T_{x} ∥}_{TV} \leq \frac{u}{4}

(A12)

for any step size h satisfying

\frac{1}{L^{7 / 4} R^{3 / 2} d} \leq h \leq min [\frac{R^{2} {(1 - \tilde{c})}^{2}}{4 {{log}^{1 / 2} (16 u^{- 1}) + \sqrt{d}}^{2}}, \frac{\sqrt{u}}{4 \sqrt{3} L^{3 / 2} R}, \frac{u}{128 L {{log}^{1 / 2} (16 u^{- 1}) + \sqrt{d}}^{2}}]

with

\tilde{c} = {1 + (L^{- 7 / 2} R^{- 3} d^{- 2} - L^{- 11 / 4} R^{- 3 / 2} d^{- 1}) m^{2}}^{1 / 2}

, where m is specified in Assumption 1, and

T_{x}

is the one-step transition distribution of the associated Markov chain involved in Algorithm 2 at

x \in X

.

Proof of Lemma A1.

Firstly, we prove the first claim (A11) of this lemma. Recall

P_{x, h} = N {x - h \nabla U (x), 2 h I_{d}}

with the step size h. For any

x, y \in X

, by the Pinsker’s inequality, we have

∥ P_{x, h} - P_{y, h} ∥_{TV} \leq \sqrt{2 KL (P_{x, h} ∥ P_{y, h})} = {(2 h)}^{- 1 / 2} {| {x - h \nabla U (x)} - {y - h \nabla U (y)} |}_{2},

where

KL (P_{x, h} ∥ P_{y, h})

is the Kullback–Leibler divergence between

P_{x, h}

and

P_{y, h}

. Under Assumption 1, by the Taylor expansion, it holds that

\begin{matrix} {| {x - h \nabla U (x)} - {y - h \nabla U (y)} |}_{2} = & | {I_{d} - h \nabla^{2} U (z)} {(x - y) |}_{2} \\ \leq & ∥ I_{d} - h \nabla^{2} {U (z) ∥}_{2} {| x - y |}_{2} \end{matrix}

for some z lying on the jointing line between x and y. Since

X = B (x^{*}, R)

for some universal constant

R > 0

and

U (\cdot)

is L-smooth and m-strongly convex on

X

, by Theorems 2.1.6 and 2.1.11 of [42], we have

m I_{d} ⪯ \nabla^{2} U (z) ⪯ L I_{d}

. Due to

h \in (0, 2 L^{- 1}]

, then

\begin{matrix} λ_{max} {I_{d} - h \nabla^{2} U (z)} \leq λ_{max} (I_{d}) + λ_{max} {- h \nabla^{2} U (z)} \leq 1 - m h \leq 1, \end{matrix}

and

\begin{matrix} λ_{min} {I_{d} - h \nabla^{2} U (z)} \geq λ_{min} (I_{d}) + λ_{min} {- h \nabla^{2} U (z)} \geq 1 - L h \geq - 1 \end{matrix}

for all

z \in X

. Therefore, we can obtain

{sup}_{z \in X} ∥ I_{d} - h \nabla^{2} U (z) {} ∥}_{2} \leq 1

, which implies that

\begin{matrix} ∥ P_{x, h} - P_{y, h} ∥_{TV} \leq \frac{{| x - y |}_{2}}{\sqrt{2 h}} \end{matrix}

for any

x, y \in X

. It yields the claim (A11).

Next, we will prove the second claim (A12) of this lemma. Write the density of

P_{x, h}

as

ϕ_{h} (\cdot | x)

. Notice that the one-step transition distribution of the associated Markov chain at

x \in X

has a probability mass

T_{x} ({x}) = 1 - \int_{X} ϕ_{h} (z | x) α_{x} (z) d z,

and admits a transition kernel

ϕ_{h} (z | x) α_{x} (z) I (z \in X \ {x})

, where

α_{x} (z) = min \{1, \frac{π^{*} (z) ϕ_{h} (x | z)}{π^{*} (x) ϕ_{h} (z | x)} I (z \in X)\} .

By the definition of the total variation distance, we have

\begin{matrix} ∥ P_{x, h} - T_{x} ∥_{TV} & = \frac{1}{2} T_{x} ({x}) + \frac{1}{2} \int_{R^{d}} | ϕ_{h} (z | x) - ϕ_{h} (z | x) α_{x} (z) I (z \in X \ {x}) | d z \\ = 1 - \int_{X} ϕ_{h} (z | x) α_{x} (z) d z \\ = 1 - E_{z \sim P_{x, h}} α_{x} (z) \end{matrix}

for any

x \in X

. By the Markov’s inequality, it holds that

\begin{matrix} E_{z \sim P_{x, h}} α_{x} (z) \geq C P_{z \sim P_{x, h}} \{\frac{π^{*} (z) ϕ_{h} (x | z) I (z \in X)}{π^{*} (x) ϕ_{h} (z | x)} \geq C\} \end{matrix}

(A13)

for any

C \in (0, 1]

. In the sequel, we will derive a lower bound for this tail probability.

Notice that

\frac{π^{*} (z) ϕ_{h} (x | z)}{π^{*} (x) ϕ_{h} (z | x)} = exp [\frac{{4 h {U (x) - U (z)} + | z - x + h \nabla U (x) |}_{2}^{2} - {| x - z + h \nabla U (z) |}_{2}^{2}}{4 h}] .

For the numerator of this exponent, we have

\begin{matrix} {4 h {U (x) - U (z)} + | z - x + h \nabla U (x) |}_{2}^{2} - {| x - z + h \nabla U (z) |}_{2}^{2} \\ = & {4 h {U (x) - U (z)} + | z - x |}_{2}^{2} + {| h \nabla U (x) |}_{2}^{2} + 2 h {(z - x)}^{T} \nabla U (x) \\ {- | x - z |}_{2}^{2} - {| h \nabla U (z) |}_{2}^{2} - 2 h {(x - z)}^{T} \nabla U (z) \\ = & 2 h {U (x) - U (z) - {(x - z)}^{T} \nabla U (x)} + 2 h {U (x) - U (z) - {(x - z)}^{T} \nabla U (z)} \\ + h^{2} {{| \nabla U (x) |}_{2}^{2} - {| \nabla U (z) |}_{2}^{2}} . \end{matrix}

Since

U (\cdot)

is L-smooth and m-strongly convex on

X

, it holds that

U (x) - U (z) - {(x - z)}^{T} \nabla U (x) \geq - \frac{L}{2} {| x - z |}_{2}^{2}, U (x) - U (z) - {(x - z)}^{T} \nabla U (z) \geq \frac{m}{2} {| x - z |}_{2}^{2}

for any

x, z \in X

. By the Cauchy–Schwarz’s inequality, triangle inequality, and Theorem 2.1.5 of [42], we know

\begin{matrix} {| \nabla U (x) |}_{2}^{2} - {| \nabla U (z) |}_{2}^{2} & = {\nabla U (x) + \nabla U (z)}^{T} {\nabla U (x) - \nabla U (z)} \\ \geq {- | \nabla U (x) + \nabla U (z) |}_{2} {| \nabla U (x) - \nabla U (z) |}_{2} \\ \geq {- | \nabla U (x) + \nabla U (z) - \nabla U (x) + \nabla U (x) |}_{2} L {| x - z |}_{2} \\ \geq - {{2 | \nabla U (x) |}_{2} {+ L | x - z |}_{2} {} L | x - z |}_{2} \end{matrix}

for any

x, z \in X

. Since

X = B (x^{*}, R)

for some universal constant

R > 0

and

x^{*} = arg {min}_{x \in R^{d}} U (x)

, by Assumption 1, it holds that

{| \nabla U (x) |}_{2} = | \nabla U (x) - \nabla U (x^{*}) |_{2} \leq L {| x - x^{*} |}_{2} \leq L R

for any

x \in X

. Thus,

\begin{matrix} \frac{π^{*} (z) ϕ_{h} (x | z)}{π^{*} (x) ϕ_{h} (z | x)} \geq exp \{\underset{T}{\underset{︸}{- \frac{L - m}{4} {| x - z |}_{2}^{2} - \frac{h L^{2} R}{2} {| x - z |}_{2} - \frac{h L^{2}}{4} {| x - z |}_{2}^{2}}}\} \end{matrix}

(A14)

for any

x, z \in X

. Since

z \sim P_{x, h} = N {x - h \nabla U (x), 2 h I_{d}}

and

\nabla U (x^{*}) = 0

, we have

{| x - z |}_{2} {= | h \nabla U (x) - (2 h)}^{1 / 2} {ξ |}_{2} \leq {h | \nabla U (x) |}_{2} + {(2 h)}^{1 / 2} {| ξ |}_{2} \leq h L R + {(2 h)}^{1 / 2} {| ξ |}_{2}

and

{| x - z |}_{2}^{2} \leq 2 h^{2} L^{2} R^{2} + 4 h {| ξ |}_{2}^{2}

for some

ξ \sim N (0, I_{d})

, which implies

T \geq - \frac{3}{2} h^{2} L^{3} R^{2} - {2 h L | ξ |}_{2}^{2} - \frac{1}{\sqrt{2}} h^{3 / 2} L^{2} R {| ξ |}_{2}

if

h L \leq 1

. Recall

X = B (x^{*}, R)

. Under Assumption 1, by Theorems 2.1.5, 2.1.9 and 2.1.10 of [42], it holds that

\begin{matrix} | x - h \nabla U (x) - x^{*} |_{2}^{2} & = | x - x^{*} |_{2}^{2} - 2 h {(x - x^{*})}^{T} \nabla U (x) + h^{2} {| \nabla U (x) |}_{2}^{2} \\ \leq | x - x^{*} |_{2}^{2} + (h^{2} - h L^{- 1}) {| \nabla U (x) |}_{2}^{2} \\ \leq {1 + (h^{2} - h L^{- 1}) m^{2}} R^{2} \leq R^{2} \end{matrix}

for any

x \in X

if

h L \leq 1

. Recall

z = x - h \nabla U (x) + {(2 h)}^{1 / 2} ξ

. Select

\tilde{c} \in (0, 1)

satisfying

{\tilde{c}}^{2} = 1 + (L^{- 7 / 2} R^{- 3} d^{- 2} - L^{- 11 / 4} R^{- 3 / 2} d^{- 1}) m^{2}

, which can be guaranteed by

L \geq m

and

L^{3 / 8} R^{3 / 4} \geq 16 d^{- 1 / 2} + 8

. Then

| z - x^{*} |_{2} \leq R \tilde{c} + {(2 h)}^{1 / 2} {| ξ |}_{2}

for any

h \in [L^{- 7 / 4} R^{- 3 / 2} d^{- 1}, L^{- 1} - L^{- 7 / 4} R^{- 3 / 2} d^{- 1}]

. For such selected h, we have

{{| ξ |}_{2} \leq {(2 h)}^{- 1 / 2} R (1 - \tilde{c})} \subset {z \in X} .

Since

L^{3 / 8} R^{3 / 4} \geq 16 d^{- 1 / 2} + 8

and

L^{- 15 / 8} m^{2} R^{1 / 4} \geq 12 d

, by Lemma 1 of [43], for any given

u \in (1 / 2, 1)

, we have

\begin{matrix} P_{z \sim P_{x, h}} (T \geq - \frac{u}{8}, z \in X) & \geq P \{T \geq - \frac{u}{8}, {| ξ |}_{2} \leq \frac{R (1 - \tilde{c})}{\sqrt{2 h}}\} \\ \geq P \{{| ξ |}_{2}^{2} \leq \frac{R^{2} {(1 - \tilde{c})}^{2}}{2 h}\} - P \{{(\sqrt{\frac{3}{2}} h L^{3 / 2} R + \sqrt{2 h L} {| ξ |}_{2})}^{2} \geq \frac{u}{8}\} \\ \geq P [{| ξ |}_{2}^{2} \leq 2 {\{{log}^{1 / 2} (\frac{16}{u}) + \sqrt{d}\}}^{2}] - P ({| ξ |}_{2}^{2} \geq \frac{u}{64 h L}) \\ \geq 1 - \frac{u}{8} \end{matrix}

for any step size h satisfying

\frac{1}{L^{7 / 4} R^{3 / 2} d} \leq h \leq min [\frac{R^{2} {(1 - \tilde{c})}^{2}}{4 {{log}^{1 / 2} (16 u^{- 1}) + \sqrt{d}}^{2}}, \frac{\sqrt{u}}{4 \sqrt{3} L^{3 / 2} R}, \frac{u}{128 L {{log}^{1 / 2} (16 u^{- 1}) + \sqrt{d}}^{2}}] .

Together with (A14), it holds that

P_{z \sim P_{x, h}} \{\frac{π^{*} (z) ϕ_{h} (x | z) I (z \in X)}{π^{*} (x) ϕ_{h} (z | x)} \geq exp (- \frac{u}{8})\} \geq 1 - \frac{u}{8}

for any

x \in X

. Select

C = exp (- u / 8)

in (A13). Due to

exp (- u / 8) \geq 1 - u / 8

, we have

E_{z \sim P_{x, h}} α_{x} (z) \geq {(1 - \frac{u}{8})}^{2} \geq 1 - \frac{u}{4},

which implies

∥ P_{x, h} - T_{x} ∥_{TV} \leq u / 4

for any

x \in X

. Therefore, we have the result (A12). We complete the proof of Lemma A1. □

Lemma A2.

Let

X = B (x^{*}, R)

for some universal constant

R > 0

and

x^{*} \in R^{d}

, and Assumption1hold. The target distribution

Π^{*}

with density

π^{*}

defined as (3) satisfies the log-isoperimetry inequality given in (A1) with constant

\hat{c} = m^{- 1 / 2}

, where m is specified in Assumption 1.

Proof of Lemma A2.

Let p denote the density of the Gaussian distribution

N (0, σ^{2} I_{d})

, and let

Π

be a distribution with density

π = q \cdot p

, where q is a log-concave function supported on

X

. From Lemma 16 in [33], it holds that

\begin{matrix} Π (S_{3}) \geq \frac{d (S_{1}, S_{2})}{2 σ} min {Π (S_{1}), Π (S_{2})} {log}^{1 / 2} [1 + \frac{1}{min {Π (S_{1}), Π (S_{2})}}] \end{matrix}

(A15)

for any partition

S_{1}

,

S_{2}

,

S_{3}

of

X

.

We now prove that the target distribution

Π^{*}

with density

π^{*}

defined as (3) satisfies the log-isoperimetry inequality defined as (A1). Notice that

\begin{matrix} π^{*} (x) = {(\frac{2 π}{m})}^{d / 2} \frac{exp {- U (x) + m | x |_{2}^{2} / 2}}{\int_{X} exp {- U (y)} d y} I (x \in X) \cdot \frac{exp (- m | x |_{2}^{2} / 2)}{{(2 π / m)}^{d / 2}}, \end{matrix}

where

U (\cdot)

is m-strongly convex on

X

. By Theorem 2.1.11 of [42], we know

{U (\cdot) - m | \cdot |}_{2}^{2} / 2

is convex on

X

. Since the indicator function

I (\cdot \in X)

is log-concave on

X

and the class of log-concave functions is closed under multiplication, then

π^{*}

can be expressed as the product of a log-concave function and the density of the normal distribution

N (0, m^{- 1} I_{d})

. By (A15), the distribution

Π^{*}

satisfies the log-isoperimetry inequality defined as (A1) with constant

\hat{c} = m^{- 1 / 2}

. We complete the proof of Lemma A2. □

Proof of Theorem 1.

Let

T_{x}^{L}

be the one-step transition distribution of the Markov chain determined by the

1 / 2

-lazy version of Algorithm 2, at

x \in X

. Then we have

T_{x}^{L} (A) = \frac{1}{2} δ_{x} (A) + \frac{1}{2} T_{x} (A)

for any

A \in B (X)

, where

δ_{x} (\cdot)

is the Dirac-delta function at

x \in X

and

T_{x}

is the one-step transition distribution of the associated Markov chain determined by Algorithm 2, at

x \in X

. By the definition of lazy chain and Proposition 1, we know that the Markov chain with transition distribution

T_{x}^{L}

is

1 / 2

-lazy,

Π^{*}

-irreducible, smooth, and reversible with respect to the distribution

Π^{*}

with density

π^{*}

defined as (3).

Recall

P_{x, h}

is the proposal distribution involved in Algorithm 2 and the

1 / 2

-lazy version of Algorithm 2. For any

x, y \in X

such that

{| x - y |}_{2} \leq {(2^{- 1} h)}^{1 / 2} u

for some

u \in (1 / 2, 1)

and the step size h satisfying

h \geq L^{- 7 / 4} R^{- 3 / 2} d^{- 1}

and

\begin{matrix} h \leq min [\frac{R^{2} {(1 - \tilde{c})}^{2}}{4 {{log}^{1 / 2} (16 u^{- 1}) + \sqrt{d}}^{2}}, \frac{\sqrt{u}}{4 \sqrt{3} L^{3 / 2} R}, \frac{u}{128 L {{log}^{1 / 2} (16 u^{- 1}) + \sqrt{d}}^{2}}] \end{matrix}

with

\tilde{c} = {1 + (L^{- 7 / 2} R^{- 3} d^{- 2} - L^{- 11 / 4} R^{- 3 / 2} d^{- 1}) m^{2}}^{1 / 2}

, by the triangle inequality and Lemma A1, it holds that

\begin{matrix} ∥ T_{x}^{L} - T_{y}^{L} ∥_{TV} \leq & \frac{1}{2} + \frac{1}{2} {∥ T_{x} - T_{y} ∥}_{TV} \\ \leq & \frac{1}{2} + \frac{1}{2} (∥ T_{x} - P_{x, h} ∥_{TV} + ∥ P_{x, h} - P_{y, h} ∥_{TV} + ∥ P_{y, h} - T_{y} ∥_{TV}) \\ \leq & \frac{1 + u}{2} . \end{matrix}

Recall

X = B (x^{*}, R)

for some universal constant

R > 0

and

x^{*} \in R^{d}

. Under Assumption 1, Lemma A2 implies that the distribution

Π^{*}

with density

π^{*}

satisfies the log-isoperimetry inequality given in (A1) with constant

\hat{c} = m^{- 1 / 2}

. Using Lemma 2 with

δ = 2^{- 1} (1 - u)

and

Δ = {(2 h)}^{1 / 2} u

, we have

\begin{matrix} Ω (v) \geq \frac{1 - u}{8} min \{1, \frac{u \sqrt{h m}}{4 \sqrt{2}} {log}^{1 / 2} (1 + \frac{1}{v})\} \end{matrix}

for any

v \in (0, 1 / 2]

, where

Ω (\cdot)

is the conductance profile defined in (A3) for Markov chain with transition distribution

T_{x}^{L}

. For the above selected u and h, define the function

Υ (v) = \{\begin{matrix} \frac{1 - u}{8} min \{1, \frac{u \sqrt{h m}}{4 \sqrt{2}} {log}^{1 / 2} (\frac{1}{v})\}, & v \in (0, 1 / 2] \\ \frac{1 - u}{8} min \{1, \frac{u \sqrt{h m}}{4 \sqrt{2}} {(log 2)}^{1 / 2}\}, & v \in (1 / 2, \infty) \end{matrix}

for any

v > 0

. Recall that

τ (ε; P^{0}, Π^{*}) = min {k \in N : ∥ T^{k} (P^{0}) - Π^{*} ∥_{TV} \leq ε}

for an error tolerance

ε \in (0, 1)

, where

T^{k} (P^{0})

is the distribution of the Markov chain with transition distribution

T_{x}^{L}

at the k-th step. Let

\tilde{Ω} (v) = \{\begin{matrix} Ω (v), & v \in (0, 1 / 2], \\ Ω (1 / 2), & v \in (1 / 2, \infty) . \end{matrix}

be the extended conductance profile of such Markov chain. By Lemma 1, it holds that

\begin{matrix} τ (ε; P^{0}, Π^{*}) \leq ⌈8 \int_{4 β^{- 1}}^{ε^{- 2}} \frac{d v}{v {\tilde{Ω}}^{2} (v)}⌉ \leq ⌈8 \int_{4 β^{- 1}}^{ε^{- 2}} \frac{d v}{v Υ^{2} (v)}⌉ . \end{matrix}

If

β > 8

and

h \leq 32 u^{- 2} {m log (β / 4)}^{- 1}

, it then holds that

\frac{u \sqrt{h m}}{4 \sqrt{2}} {(log 2)}^{1 / 2} < \frac{u \sqrt{h m}}{4 \sqrt{2}} {log}^{1 / 2} (\frac{β}{4}) \leq 1,

which implies

τ (ε; P^{0}, Π^{*}) = O (\frac{1}{h m} log \frac{log β}{ε}) .

Together with

h \geq L^{- 7 / 4} R^{- 3 / 2} d^{- 1}

, we complete the proof of Theorem 1. □

Appendix B.7. Proof of Corollary 1

Proof of Corollary 1.

Recall

X = {x \in R^{d} : | x |_{p} \leq C}

for some universal constant

C > 0

. Since the additional two steps are introduced in Algorithm 3 only transforming the sampling from the norm-constrained region

{x \in R^{d} : | x |_{p} \leq C}

to the Euclidean ball

B (0, 1)

, the convergence rate of the two processes remains consistent. Using the same arguments in the proof of Theorem 1 with

R = 1

and

x^{*} = 0

, we can obtain the results of Corollary 1. □

Appendix B.8. Proof of Theorem 2

Proof of Theorem 2.

Recall that the distribution

Π^{*, λ}

with density

π^{*, λ} (x) = \frac{exp {- V_{X}^{λ} (x)}}{\int_{R^{d}} exp {- V^{λ} (y)} d y}

for a regularization parameter

λ > 0

, where

V_{X}^{λ} (\cdot)

is defined as in (10), and the target distribution

Π^{*}

with density

π^{*} (x) = \frac{exp {- U (x)} I (x \in X)}{\int_{X} exp {- U (y)} d y}

for some potential function

U : R^{d} \mapsto R

. Under Assumptions 1 and 2, if there exists a universal constant

\tilde{C} > 0

such that

exp {{inf}_{x \in X^{c}} U (x) - {sup}_{x \in X} U (x)} \geq \tilde{C}

, by Proposition 4 in [35], we have

∥ Π^{*, λ} - Π^{*} ∥_{TV} \leq ε

(A16)

for

λ = 8 π^{- 1} ε^{2} r^{2} d^{- 2} {\tilde{C}}^{2}

with the error tolerance

ε \in (0, 1)

, where

r > 0

is specified in Assumption 2.

Notice that

V_{X}^{λ} (\cdot) = U (\cdot) + ι_{X}^{λ} (\cdot)

with

ι_{X}^{λ} (\cdot)

defined as (7). Under Assumption 1, by (9) and Theorem 2.1.5 in [42], we know that the function

V_{X}^{λ} (\cdot)

is twice continuously differentiable,

(L + λ^{- 1})

-smooth and m-strongly convex on

R^{d}

. Given the initial distribution

P^{0} = N {x^{★}, {(L + λ^{- 1})}^{- 1} I_{d}}

with

x^{★} = arg {min}_{x \in R^{d}} V_{X}^{λ} (x)

and an error tolerance

ε \in (0, 1)

, by Theorem 5 of [33], the Markov chain determined by Algorithm 4 satisfies

τ (ε; P^{0}, Π^{*, λ}) = O [\frac{(L + λ^{- 1}) d}{m} log \frac{d}{ε} \cdot max \{1, \sqrt{\frac{L + λ^{- 1}}{d m}}\}]

with the step size

h = c \frac{1}{(L + λ^{- 1}) d \cdot max \{1, \sqrt{\frac{L + λ^{- 1}}{d m}}\}},

where

c > 0

is a universal constant. Together with (A16), by the definition of

ε

-mixing time and the triangle inequality, we have

τ (ε; P^{0}, Π^{*}) = O [\frac{(L + λ^{- 1}) d}{m} log \frac{d}{ε} max \{1, \sqrt{\frac{L + λ^{- 1}}{d m}}\}]

with

λ = 8 π^{- 1} ε^{2} r^{2} d^{- 2} {\tilde{C}}^{2}

. Hence, we complete the proof of Theorem 2. □

References

Gelfand, A.E.; Smith, A.F.; Lee, T.M. Bayesian analysis of constrained parameter and truncated data problems using Gibbs sampling. J. Am. Stat. Assoc. 1992, 87, 523–532. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Klein, J.P.; Moeschberger, M.L. Survival Analysis: Techniques for Censored and Truncated Data; Springer: New York, NY, USA, 2005; pp. 5–18. [Google Scholar]
Johnson, V.E.; Albert, J.H. Ordinal Data Modeling; Springer: New York, NY, USA, 2006; pp. 126–157. [Google Scholar]
Celeux, G.; El Anbari, M.; Marin, J.M.; Robert, C.P. Regularization in regression: Comparing Bayesian and frequentist methods in a poorly informative situation. Bayesian Anal. 2012, 7, 477–502. [Google Scholar] [CrossRef]
Paisley, J.W.; Blei, D.M.; Jordan, M.I. Bayesian nonnegative matrix factorization with stochastic variational inference. In Handbook of Mixed Membership Models and Their Applications; Airoldi, E.M., Blei, D.M., Erosheva, E.A., Fienberg, S.E., Eds.; CRC Press: Boca Raton, FL, USA, 2014; pp. 205–224. [Google Scholar]
Khodadadian, A.; Parvizi, M.; Teshnehlab, M.; Heitzinger, C. Rational design of field-effect sensors using partial differential equations, Bayesian inversion, and artificial neural networks. Sensors 2022, 22, 4785. [Google Scholar] [CrossRef] [PubMed]
Noii, N.; Khodadadian, A.; Ulloa, J.; Aldakheel, F.; Wick, T.; François, S.; Wriggers, P. Bayesian inversion with open-source codes for various one-dimensional model problems in computational mechanics. Arch. Comput. Methods Eng. 2022, 29, 4285–4318. [Google Scholar] [CrossRef]
Ma, Y.A.; Chen, Y.; Jin, C.; Flammarion, N.; Jordan, M.I. Sampling can be faster than optimization. Proc. Natl. Acad. Sci. USA 2019, 116, 20881–20885. [Google Scholar] [CrossRef] [PubMed]
Mangoubi, O.; Vishnoi, N.K. Nonconvex sampling with the Metropolis-adjusted Langevin algorithm. In Proceedings of the 32nd Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 2259–2293. [Google Scholar]
Dyer, M.; Frieze, A. Computing the volume of convex bodies: A case where randomness provably helps. Probabilistic Comb. Its Appl. 1991, 44, 123–170. [Google Scholar]
Rodriguez-Yam, G.; Davis, R.A.; Scharf, L.L. Efficient Gibbs sampling of truncated multivariate normal with application to constrained linear regression. In Technical Report; Unpublished Manuscript; Colorado State University: Fort Collins, CO, USA, 2004. [Google Scholar]
Lovász, L.; Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Struct. Algorithms 2007, 30, 307–358. [Google Scholar] [CrossRef]
Chen, M.H.; Shao, Q.M.; Ibrahim, J.G. Monte Carlo Methods in Bayesian Computation; Springer: New York, NY, USA, 2012; pp. 191–212. [Google Scholar]
Dyer, M.; Frieze, A.; Kannan, R. A random polynomial-time algorithm for approximating the volume of convex bodies. J. ACM 1991, 38, 1–17. [Google Scholar] [CrossRef]
Lang, L.; Chen, W.S.; Bakshi, B.R.; Goel, P.K.; Ungarala, S. Bayesian estimation via sequential Monte Carlo sampling—Constrained dynamic systems. Automatica 2007, 43, 1615–1622. [Google Scholar] [CrossRef]
Chaudhry, S.; Lautzenheiser, D.; Ghosh, K. An efficient scheme for sampling in constrained domains. arXiv 2021, arXiv:2110.10840. [Google Scholar]
Lan, S.; Kang, L. Sampling constrained Continuous probability distributions: A review. arXiv 2021, arXiv:2209.12403. [Google Scholar] [CrossRef]
Neal, R.M. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; Brooks, S., Gelman, A., Jones, G., Meng, X.L., Eds.; CRC Press: Boca Raton, FL, USA, 2011; pp. 113–162. [Google Scholar]
Pakman, A.; Paninski, L. Exact hamiltonian Monte Carlo for truncated multivariate gaussians. J. Comput. Graph. Stat. 2014, 23, 518–542. [Google Scholar] [CrossRef]
Lan, S.; Shahbaba, B. Sampling constrained probability distributions using spherical augmentation. In Algorithmic Advances in Riemannian Geometry and Applications; Minh, H.Q., Murino, V., Eds.; Springer: New York, NY, USA, 2016; pp. 25–71. [Google Scholar]
Brubaker, M.; Salzmann, M.; Urtasun, R. A family of MCMC methods on implicitly defined manifolds. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, La Palma, Canary Islands, Spain, 21–23 April 2012; pp. 161–172. [Google Scholar]
Ahn, K.; Chewi, S. Efficient constrained sampling via the mirror-Langevin algorithm. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 28405–28418. [Google Scholar]
Parisi, G. Correlation functions and computer simulations. Nucl. Phys. B 1981, 180, 378–384. [Google Scholar] [CrossRef]
Grenander, U.; Miller, M.I. Representations of knowledge in complex systems. J. R. Stat. Soc. Ser. B (Methodol.) 1994, 56, 549–581. [Google Scholar] [CrossRef]
Roberts, G.O.; Tweedie, R.L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 1996, 2, 341–363. [Google Scholar] [CrossRef]
Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 2002, 4, 337–357. [Google Scholar] [CrossRef]
Dalalyan, A.S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B (Methodol.) 2017, 79, 651–676. [Google Scholar] [CrossRef]
Durmus, A.; Moulines, E. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Bernoulli 2017, 27, 1551–1587. [Google Scholar] [CrossRef]
Cheng, X.; Bartlett, P. Convergence of Langevin MCMC in KL-divergence. In Proceedings of the Machine Learning Research, Lanzarote, Spain, 7–9 April 2018; pp. 186–211. [Google Scholar]
Durmus, A.; Moulines, E. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 2019, 25, 2854–2882. [Google Scholar] [CrossRef]
Dwivedi, R.; Chen, Y.; Wainwright, M.J.; Yu, B. Log-concave sampling: Metropolis-Hastings algorithms are fast. J. Mach. Learn. Res. 2019, 20, 1–42. [Google Scholar]
Chen, Y.; Dwivedi, R.; Wainwright, M.J.; Yu, B. Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients. J. Mach. Learn. Res. 2020, 21, 3647–3717. [Google Scholar]
Bubeck, S.; Eldan, R.; Lehec, J. Finite-time analysis of projected Langevin Monte Carlo. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1243–1251. [Google Scholar]
Brosse, N.; Durmus, A.; Moulines, É.; Pereyra, M. Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo. In Proceedings of the 2017 Conference on Learning Theory, Amsterdam, The Netherlands, 7–10 July 2017; pp. 319–342. [Google Scholar]
Hsieh, Y.P.; Kavis, A.; Rolland, P.; Cevher, V. Mirrored langevin dynamics. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1–10. [Google Scholar]
Roberts, G.O.; Rosenthal, J.S. General state space Markov chains and MCMC algorithms. Probab. Surv. 2004, 1, 20–71. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Kannan, R.; Lovász, L.; Montenegro, R. Blocking conductance and mixing in random walks. Comb. Probab. Comput. 2006, 15, 541–570. [Google Scholar] [CrossRef]
Lee, Y.T.; Vempala, S.S. Stochastic localization + Stieltjes barrier = tight bound for log-Sobolev. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing, Los Angeles, CA, USA, 25–29 June 2018; pp. 1122–1129. [Google Scholar]
Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer: New York, NY, USA, 2003; pp. 51–101. [Google Scholar]
Laurent, B.; Massart, P. Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 2000, 28, 1302–1338. [Google Scholar] [CrossRef]

Figure 1. The trace graphs of

x_{1}

of the Markov chain determined by the four sampling algorithms.

Figure 1. The trace graphs of

x_{1}

of the Markov chain determined by the four sampling algorithms.

Figure 2. The densities of

x_{1}

of the Markov chain determined by the four sampling algorithms.

Figure 2. The densities of

x_{1}

of the Markov chain determined by the four sampling algorithms.

Figure 3. Approximate mixing time with respect to dimension and error tolerance of Algorithm 2. (a) Dimension dependence for fixed error tolerance. (b) Error tolerance dependence for fixed dimension.

Figure 4. Approximate mixing time with respect to dimension and error tolerance dependence of the four sampling algorithms. (a) Dimension dependence for fixed error tolerance. (b) Error tolerance dependence for fixed dimension.

Figure 5. Bayesian regularized regression via Algorithm 3, where distinct colors represent various trajectories of parameter estimates for distinct variables. (a)

L_{1}

—norm-constraint. (b)

L_{1.5}

—norm-constraint. (c)

L_{2}

—norm-constraint.

Figure 5. Bayesian regularized regression via Algorithm 3, where distinct colors represent various trajectories of parameter estimates for distinct variables. (a)

L_{1}

—norm-constraint. (b)

L_{1.5}

—norm-constraint. (c)

L_{2}

—norm-constraint.

Table 1. Convergence rates for sampling from log-concave distributions with bounded support.

Assumptions	${∥ \cdot ∥}_{TV}$ Rate	Algorithms
$0 I_{d} ⪯ \nabla^{2} U (x) ⪯ L I_{d}$	$\tilde{O} (d^{12} ε^{- 12})$	PLMC in [34]
$m I_{d} ⪯ \nabla^{2} U (x) ⪯ L I_{d}$	$\tilde{O} (d^{5} ε^{- 6})$	MYULA in [35]
$m I_{d} ⪯ \nabla^{2} U (x)$	$\tilde{O} (d ε^{- 2})$	MLD in [36]
$m I_{d} ⪯ \nabla^{2} U (x) ⪯ L I_{d}$	$\tilde{O} {d log (1 / ε)}$	Algorithms 2 and 3 in this paper
$m I_{d} ⪯ \nabla^{2} U (x) ⪯ L I_{d}$	$\tilde{O} (d^{3} ε^{- 2})$	Algorithm 4 in this paper

Table 2. Step sizes for sampling from log-concave distributions with bounded support.

Algorithms	Step Size
PLMC in [34]	$L^{- 1} d^{- 2}$
MYULA in [35]	${d max (d, L)}^{- 1}$
MLD in [36]	the grid search
Algorithm 2 in this paper	$L^{- 7 / 4} R^{- 3 / 2} d^{- 1}$
Algorithm 3 in this paper	$L^{- 7 / 4} d^{- 1}$
Algorithm 4 in this paper	${(L + λ^{★ - 1}) max [d, {m^{- 1} d (L + λ^{★ - 1})}^{1 / 2}]}^{- 1}$

Table 3. The mean and covariance estimation results obtained by MYULA and Algorithm 4.

Assumptions	Mean	Covariance
The truth	$(\begin{matrix} 0.790 \\ 0.488 \end{matrix})$	$(\begin{matrix} 0.326 & 0.017 \\ 0.017 & 0.080 \end{matrix})$
MYULA	$(\begin{matrix} 0.758 \pm 0.052 \\ 0.484 \pm 0.016 \end{matrix})$	$(\begin{matrix} 0.309 \pm 0.038 & 0.017 \pm 0.009 \\ 0.017 \pm 0.009 & 0.088 \pm 0.002 \end{matrix})$
Algorithm 4	$(\begin{matrix} 0.781 \pm 0.034 \\ 0.491 \pm 0.009 \end{matrix})$	$(\begin{matrix} 0.317 \pm 0.012 & 0.017 \pm 0.004 \\ 0.017 \pm 0.004 & 0.082 \pm 0.003 \end{matrix})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, Y. Convergence Rates for the Constrained Sampling via Langevin Monte Carlo. Entropy 2023, 25, 1234. https://doi.org/10.3390/e25081234

AMA Style

Zhu Y. Convergence Rates for the Constrained Sampling via Langevin Monte Carlo. Entropy. 2023; 25(8):1234. https://doi.org/10.3390/e25081234

Chicago/Turabian Style

Zhu, Yuanzheng. 2023. "Convergence Rates for the Constrained Sampling via Langevin Monte Carlo" Entropy 25, no. 8: 1234. https://doi.org/10.3390/e25081234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Convergence Rates for the Constrained Sampling via Langevin Monte Carlo

Abstract

1. Introduction

2. Preliminaries and Problem Set-Up

2.1. Markov Chain Monte Carlo and Mixing

2.2. Metropolis-Adjusted Langevin Algorithm

2.3. Problem Set-Up

3. The Constrained Langevin Algorithms

3.1. Constrained Langevin Algorithm via Rejection

3.2. Norm-Constrained Domain

3.3. Constrained Langevin Algorithm via an Approximation of the Indicator Function

4. Theoretical Results

4.1. Properties of the Markov Chains

4.2. Mixing Time Bounds of the Markov Chains

5. Numerical Experiments

5.1. Sampling from the Euclidean Ball Constrained Domain

5.1.1. The Trace Graphs of Sampling Algorithms

5.1.2. Dimension and Error Dependence of Algorithm 2

5.1.3. Comparison with Competitive Algorithms

5.2. Bayesian Regularized Regression

5.3. Truncated Multivariate Gaussian Distribution

6. Discussion and Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Some Markov Chain Basics

Appendix B. Proofs

Appendix B.1. Proof of Proposition 1

Appendix B.2. Proof of Proposition 2

Appendix B.3. Proof of Proposition 3

Appendix B.4. Proof of Lemma 1

Appendix B.5. Proof of Lemma 2

Appendix B.6. Proof of Theorem 1

Appendix B.7. Proof of Corollary 1

Appendix B.8. Proof of Theorem 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI