Next Article in Journal
Variational Bayesian Algorithms for Maneuvering Target Tracking with Nonlinear Measurements in Sensor Networks
Previous Article in Journal
Fault Diagnosis of Rolling Bearings in Primary Mine Fans under Sample Imbalance Conditions
Previous Article in Special Issue
Entropy Stable DGSEM Schemes of Gauss Points Based on Subcell Limiting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Convergence Rates for the Constrained Sampling via Langevin Monte Carlo

School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China
Entropy 2023, 25(8), 1234; https://doi.org/10.3390/e25081234
Submission received: 26 June 2023 / Revised: 8 August 2023 / Accepted: 15 August 2023 / Published: 18 August 2023
(This article belongs to the Collection Advances in Applied Statistical Mechanics)

Abstract

:
Sampling from constrained distributions has posed significant challenges in terms of algorithmic design and non-asymptotic analysis, which are frequently encountered in statistical and machine-learning models. In this study, we propose three sampling algorithms based on Langevin Monte Carlo with the Metropolis–Hastings steps to handle the distribution constrained within some convex body. We present a rigorous analysis of the corresponding Markov chains and derive non-asymptotic upper bounds on the convergence rates of these algorithms in total variation distance. Our results demonstrate that the sampling algorithm, enhanced with the Metropolis–Hastings steps, offers an effective solution for tackling some constrained sampling problems. The numerical experiments are conducted to compare our methods with several competing algorithms without the Metropolis–Hastings steps, and the results further support our theoretical findings.

1. Introduction

Sampling from distributions with some constraints has extensive applications in statistics, machine-learning, and operations research, among other areas. Some distributions have bounded support, such as the simple but versatile uniform distribution which serves as the foundation for a series of Monte Carlo methods, as discussed in [1]. Furthermore, many statistical inference problems involve estimating parameters subject to constraints on the parameter space, which defines a posterior distribution with bounded support in a Bayesian setting. Examples include Latent Dirichlet Allocation [2], truncated data problems in failure and survival time studies [3], ordinal data models [4], constrained lasso and ridge regressions [5], and non-negative matrix factorization [6]. In Bayesian learning, sampling from posterior distributions is a fundamental primitive, used for exploring posterior distributions, identifying the unknown parameters [7,8], obtaining credible intervals, and solving inverse problems [7,8]. Finally, constrained sampling has great potential in solving constrained optimization problems [9,10].
Many Markov Chain Monte Carlo (MCMC) algorithms have been extensively studied for sampling from probability distributions with convex support or more generally with constrained parameters, mainly in the fields of Bayesian statistics and theoretical computer science. Early work includes, among others, [1,11,12,13,14]. Firstly, based on MCMC algorithms, a direct solution involves discarding samples that violate the constraints, thereby exclusively retaining samples that satisfy the constraints; see, for example, [1,15,16]. However, these rejection-type approaches may encounter an excessive number of rejections or an extremely large acceptance rate within some local subspace that satisfies the constraints, which leads to poor mixing and computational inefficiency, especially for complicated constraints and the high dimensional distributions [17,18]. Secondly, some literature draws inspiration from penalty functions in optimization problems and considers the construction of barriers along the boundaries of the constrained domain, effectively constraining the sampling process within the constrained area. These approaches encounter a major challenge when the samples reach the boundaries of the constraints, necessitating the implementation of a mechanism based on reflection to redirect them back into the constrained region. To address this issue, Ref. [19] extended the Hamiltonian Monte Carlo (HMC) method by setting the potential energy outside the constraint region to infinity, restricting the states to the desired domain. Ref. [20] extended the HMC method to sample from truncated multivariate Gaussian distributions, and Ref. [21] proposed an approach that involves mapping the constrained domain onto a sphere in an augmented space. Thirdly, motivated by the constrained optimization methods, the constrained sampling problem can be reformulated as an unconstrained sampling problem via suitable transformations. Following this idea, Ref. [22] proposed a family of novel algorithms based on HMC through the introduction of Lagrange multipliers that address a broader range of constrained sampling problems. More recently, Ref. [23] tackled the constrained sampling problem via the mirror-Langevin algorithm. In spite of the widespread adoption of these MCMC methods, most of them have primarily focused on the algorithm design and lack the rigorous theoretical analysis of convergence rates.
Among all the MCMC algorithms, a class of algorithms based on the Langevin dynamics has garnered significant attention in both practical applications and theoretical analyses [24,25,26,27]. It has recently witnessed a notable increase in non-asymptotic analyses of these algorithms, initiated by the seminal work of [28]. In the setting of unconstrained sampling, Ref. [29] extended the theoretical analysis of convergence rates by studying with decreasing step size, and Refs. [30,31] derived corresponding convergence results based on alternative distances. These theoretical analyses focus on the Langevin algorithm without the Metropolis–Hastings step. More recently, Refs. [32,33] have shown that incorporating the Metropolis–Hastings step can significantly improve the convergence rate of the associated Langevin algorithm. In the setting of constrained sampling, Ref. [34] suggested a Euclidean projection step in the Langevin algorithms for the constrained case (PLMC) and derived the convergence rate of the associated Markov chain. Ref. [35] presented a detailed theoretical analysis for a proximal version of the Langevin algorithm that incorporates the Moreau-Yosida envelope of the indicator function (MYULA) to handle the distributions that are restricted to a convex body. Ref. [36] constructed the mirrored Langevin algorithm (MLD) using a mirror map to constrain the domain, which achieves the same convergence rate as its unconstrained counterpart [28]. However, these constrained sampling algorithms are all developed based on the Langevin algorithm without incorporating the Metropolis–Hastings steps, thus not leveraging the fast mixing advantages of them.
In this paper, we considered the constrained Langevin Monte Carlo with the Metropolis–Hastings step for sampling from the distributions restricted to some convex support. Firstly, for certain constraints, we re-examine the simple and intuitive rejection-type methods for sampling from constrained distributions, and reach a surprising discovery that the corresponding algorithm still retained the advantage of rapid convergence by carefully selecting the step size parameter. Subsequently, for the more generally constrained domain, we build upon the framework proposed in [35], incorporating the Metropolis–Hastings step for further refinement, and analyze the convergence rate of the corresponding Markov chain. We present detailed non-asymptotic analysis for these constrained algorithms and achieve notably enhanced convergence rates in the total variation distance. Compared with the best rate in [36], our results show that adopting the Metropolis–Hastings step in some constrained MCMC algorithms can also lead to an exponentially improved dependence on the error tolerance.
The rest of the paper is organized as follows. In Section 2, we introduce the preliminaries and the problem set-up of our study. Then, we propose the constrained sampling algorithms tailored to different types of constraint regions in Section 3. Section 4 provides the non-asymptotic theoretical results of the proposed algorithms. The numerical experiments and comparisons are presented in Section 5. Some Markov chain basics are provided in Appendix A and all the technical proofs are deferred to Appendix B.
Notation: Let a represent the smallest integer not less than a R . For a vector x R d , we use | x | 2 to denote its Euclidean norm. For a q × q symmetric matrix A, denote by λ min ( A ) and λ max ( A ) the smallest and largest eigenvalues of A, respectively, and let A T be its transpose. For two square matrices A and B, we write A B if ( B A ) is a positive semi-definite matrix. Denote by I ( · ) the indicator function. For r > 0 , let B ( x , r ) = { y R d : | y x | 2 r } denote a closed Euclidean ball with center x and radius r. For two real-valued sequences a n and b n , we say a n = O ( b n ) if there exists a universal constant c such that a n c b n , and a n = O ˜ ( b n ) if a n c n b n where the sequence c n grows at most poly-logarithmically with n. For any two probability measures μ and ν , denote by μ ν TV the total variation distance between μ and ν .

2. Preliminaries and Problem Set-Up

In this section, we introduce the MCMC sampling methods with its mixing analysis, the traditional unconstrained Metropolis-Adjusted Langevin Algorithm (MALA), and our problem set-up for this paper.

2.1. Markov Chain Monte Carlo and Mixing

Consider a distribution Π equipped with a density π : R d R + such that
π ( x ) e U ( x )
for some potential function U : R d R . In certain scenarios, it is necessary to perform sampling from this distribution. For example, many statistical applications involve estimating the expectation of a function g ( X ) for X π , where analytical and numerical computation is infeasible. Monte Carlo approximation provides a solution by generating samples from Π and using sample mean to estimate the population expectation. Hence, the key point is to access samples from Π .
MCMC represents a class of popular sampling algorithms, which construct an appropriate Markov chain whose stationary distribution is Π or close to Π in certain metrics. The class of the Metropolis–Hastings algorithms refers to a type of MCMC method that ensures the corresponding Markov chain converges to the target distribution by incorporating the Metropolis–Hastings step. The Metropolis–Hastings algorithms usually take two steps to generate a Markov chain: a proposal step and a reject-accept step. At each iteration, a sample is generated from the proposal distribution in the proposal step, and it is updated as a new state of the Markov chain with probability determined by the Metropolis–Hastings correction in the reject-accept step.
Given an error tolerance ε ( 0 , 1 ) , in order to obtain an ε -accurate sample with respect to some metric, one simulates the Markov chain for a certain number of steps k, as determined by a mixing time analysis. Specifically, we are concerned about how many steps the chain needs to take such that the current distribution of the chain is ε -close to the target distribution Π . Based on this, we define the ε -mixing time with respect to the target distribution Π as
τ ( ε ; P 0 , Π ) = min { k N : T k ( P 0 ) Π TV ε }
for the error tolerance ε ( 0 , 1 ) , where T is the transition operator of the Markov chain and T k ( P 0 ) is the distribution of the Markov chain at k-th step from an initial distribution P 0 .

2.2. Metropolis-Adjusted Langevin Algorithm

Consider the problem of sampling from the distribution with density defined as (1). MALA [26,27] adopts the Gaussian distribution N { x k h U ( x k ) , 2 h I p } as the proposal distribution for the k-th step, where x k is the current state and h > 0 is a proper step size, and performs a Metropolis–Hastings accept-reject step. MALA is the standard Metropolis–Hastings algorithm applied to the Langevin dynamics, and the associated Langevin-type algorithms belong to a family of gradient-based MCMC sampling algorithms [37]. The Langevin-type algorithms can be understood as the Euler discretization of the Langevin dynamics:
d X t = U ( X t ) d t + 2 d W t ,
where W t ( t 0 ) is the standard Brownian motion on R d .
Algorithm 1 provides the unconstrained MALA for sampling from the distribution supported on R d , where ϕ h ( · | x ) denotes the probability density function of N { x h U ( x ) , 2 h I d } .
Algorithm 1 Metropolis-adjusted Langevin algorithm
Input: a sample x 0 R d from an initial distribution P 0 , the step size h
  •   for  k = 0 , 1 , 2 , , K 1  do
  •       Proposal step:  y k + 1 x k h U ( x k ) + ξ , where ξ N ( 0 , 2 h I p )
  •       Accept-reject step:
  •       compute α k + 1 = min 1 , ϕ h ( x k | y k + 1 ) π ( y k + 1 ) ϕ h ( y k + 1 | x k ) π ( x k )
  •       sample u k + 1 from the uniform distribution on [ 0 , 1 ]
  •       if  α k + 1 u k + 1 , then  x k + 1 y k + 1
  •       else x k + 1 x k
  •       end if
  •   end for
Output:  x 1 , x 2 , , x K

2.3. Problem Set-Up

In this part, we consider the problem of sampling from a target distribution or posterior Π * supported on a compact set X R d equipped with a density π * . It can be written in the form
π * ( x ) = exp { U ( x ) } I ( x X ) X exp { U ( y ) } d y
for some potential function U : R d R . Assume that the function U ( · ) and the set X satisfy the following assumptions:
Assumption 1.
U ( · ) is a twice continuously differentiable, L-smooth and m-strongly convex function on R d . That is, there exist universal constants L m > 0 such that
m 2 | y x | 2 2 U ( y ) U ( x ) { U ( x ) } T ( y x ) L 2 | y x | 2 2
for any x , y R d .
Assumption 2.
X R d is a compact and convex set satisfying
B ( x * , r ) X B ( x * , R )
for some universal constants 0 < r R and x * X .
Hereafter, we assume that the above two assumptions hold, which is frequently used in the literature for the analysis of constrained sampling algorithms [34,35,36]. We will modify the MALA in Algorithm 1 to adapt to sampling from the above constrained distribution, and analyse its non-asymptotic theoretical properties and derive the mixing time bound in terms of the problem dimension d and the error tolerance ε .

3. The Constrained Langevin Algorithms

In this section, we present three sampling algorithms based on MALA to handle the distribution constrained within some convex body X . As discussed in [34], the inherent challenges in constrained sampling problems arise from the complex properties on the boundary of the constraint region, and the lack of the curvature in the potential function. To tackle these challenges, Ref. [34] initially studied constrained sampling from the uniform distribution on X , and then extended the exploration to more general distributions. Similarly, we begin our investigation by examining some simple constrained regions and progressively extend our analysis to more complex constraint scenarios.

3.1. Constrained Langevin Algorithm via Rejection

We initially discuss the case where the constraint region X is an Euclidean ball on R d , where the boundary can be characterized by a curve equation. If X = B ( x * , R ) for some universal constant R > 0 and x * R d , we consider the simple and intuitive rejection-type methods via the Metropolis–Hastings accept-reject step for sampling from the distribution with density defined as (3). The constrained MALA for X = B ( x * , R ) outlined in Algorithm 2 as follows, where ϕ h ( · | x ) denotes the probability density function of the Gaussian distribution N { x h U ( x ) , 2 h I d } .
Algorithm 2 The MALA for Euclidean ball constrained domain
Input: a sample x 0 X from an initial distribution P 0 , the step size h
  •   for   k = 0 , 1 , 2 , , K 1   do
  •       Proposal step:  y k + 1 x k h U ( x k ) + ξ , where ξ N ( 0 , 2 h I p )
  •       Accept-reject step:
  •       if  y k + 1 X  then
  •           compute α k + 1 = min 1 , ϕ h ( x k | y k + 1 ) π * ( y k + 1 ) ϕ h ( y k + 1 | x k ) π * ( x k )
  •           sample u k + 1 from the uniform distribution on [ 0 , 1 ]
  •           if  α k + 1 u k + 1 , then  x k + 1 y k + 1
  •           else x k + 1 x k
  •           end if
  •       else x k + 1 x k
  •       end if
  •   end for
Output:  x 1 , x 2 , , x K
Compared with Algorithm 1, this modified algorithm forces the Markov chain to stay at the current state when it jumps out of the limited state space X = B ( x * , R ) , which is a quite natural extension of the unconstrained MALA. This idea is not completely novel. Ref. [34] suggested a projection step in unadjusted Langevin algorithm for sampling from a log-concave distribution with compact support. Ref. [10] proposed an MALA for constrained optimization, where they used a similar step to constrain the Markov chain to stay at a given state space. Due to the favorable properties on the boundary of constrained domain X = B ( x * , R ) , we can establish the theoretical results of Algorithm 2; see Lemma A1 in Appendix B for details.

3.2. Norm-Constrained Domain

Regularization is a technique commonly used in machine-learning and statistical modeling. As discussed in [38], some models with regularization can be reformulated as the distributions with norm-constraint on the parameters. Notice that the L p -norm for the vector x = ( x 1 , x 2 , , x d ) T R d is defined as
| x | p = i = 1 d | x i | p 1 / p , p ( 0 , ) max 1 i d | x i | , p = .
For the norm-constrained domain X = { x R d : | x | p C } with some universal constant C > 0 , we can transform it into the Euclidean ball B ( 0 , 1 ) via a vector-valued function f : X B ( 0 , 1 ) . Specifically, for any x = ( x 1 , x 2 , , x d ) T X , we have y = f ( x ) = : { f 1 ( x ) , f 2 ( x ) , , f d ( x ) } T with
f i ( x ) = C p / 2 sgn ( x i ) | x i | p / 2 , p ( 0 , ) x i | x | C | x | 2 , p = , 1 i d
such that y B ( 0 , 1 ) . Due to the bijective nature of the function f : X B ( 0 , 1 ) , its inverse function f 1 = : g : B ( 0 , 1 ) X can be defined accordingly. Similarly, for any y = ( y 1 , y 2 , , y d ) T B ( 0 , 1 ) , we have x = g ( y ) = : { g 1 ( y ) , g 2 ( y ) , , g d ( y ) } T with
g i ( y ) = C sgn ( y i ) | y i | 2 / p , p ( 0 , ) C y i | y | 2 | y | , p = , 1 i d
such that x X . By utilizing the vector-valued functions f ( · ) and g ( · ) defined above, we can employ the Euclidean ball constrained sampling algorithm, as described in Section 3.1, to tackle the norm-constrained domain X = { x R d : | x | p C } . The computational process is outlined in Algorithm 3, where
π B ( 0 , 1 ) ( x ) = exp { U ( x ) } I { x B ( 0 , 1 ) } B ( 0 , 1 ) exp { U ( y ) } d y
with the potential function U ( · ) .
Algorithm 3 The MALA for norm-constrained domain
Input: a sample x 0 X from an initial distribution P 0 , the step size h
  •   for  k = 0 , 1 , 2 , , K 1   do
  •       Transformation step:  y k f ( x k )
  •       Proposal step:  z k + 1 y k h U ( y k ) + ξ , where ξ N ( 0 , 2 h I p )
  •       Accept-reject step:
  •       if  z k + 1 B ( 0 , 1 )  then
  •           compute α k + 1 = min 1 , ϕ h ( y k | z k + 1 ) π B ( 0 , 1 ) ( z k + 1 ) ϕ h ( z k + 1 | y k ) π B ( 0 , 1 ) ( y k )
  •           sample u k + 1 from the uniform distribution on [ 0 , 1 ]
  •           if  α k + 1 u k + 1 , then  y k + 1 z k + 1
  •           else y k + 1 y k
  •           end if
  •       else y k + 1 y k
  •       end if
  •       Transformation step:  x k + 1 g ( y k + 1 )
  •   end for
Output:  x 1 , x 2 , , x K
Compared with Algorithm 2, the Algorithm 3 achieves the X B ( 0 , 1 ) X transformation by incorporating two transformation steps, thereby addressing the norm-constrained sampling problems. The main purpose of this approach is to facilitate theoretical analysis by leveraging the well-understood properties of the boundary of the Euclidean ball compared to the boundary of the norm-constrained domain; see Appendix B.7 for details.

3.3. Constrained Langevin Algorithm via an Approximation of the Indicator Function

We proceed to discuss the constrained sampling for more general constraint regions. Given X R d , define
ι X ( x ) = : log { I ( x X ) } = 0 , If x X , If x X
for any x R d . Then, the target distribution Π * with density defined as (3) can be reformulated as
π * ( x ) = exp { V X ( x ) } X exp { V ( y ) } d y
with the potential function V X : R d R satisfying
V X ( · ) = U ( · ) + ι X ( · ) ,
where ι X ( · ) is defined in (4). Notice that ι X ( · ) is a convex function on R d . Under Assumption 1, we then know that the potential function V X ( · ) is smooth and strongly convex on R d . By this transformation, the problem of constrained sampling is apparently converted into an unconstrained counterpart. However, the non-differentiability of the function V X ( · ) on the boundary of X poses a challenge when applying the gradient-based unconstrained sampling algorithms. To address this issue, we can approximate the function ι X ( · ) by a differentiable function such as the Moreau-Yosida (MY) envelope [35]. The MY envelope of ι X ( · ) is defined as
ι X λ ( x ) = inf y R d { ι X ( x ) + ( 2 λ ) 1 | x y | 2 2 } = ( 2 λ ) 1 | x Pro X ( x ) | 2 2
for any x R d , where λ > 0 is a regularization parameter and Pro X ( · ) is the projection function onto X . By [35], the function ι X λ ( · ) is convex and continuously differentiable with the gradient
ι X λ ( x ) = λ 1 { x Pro X ( x ) }
for any x R d , and it holds that
| ι X λ ( x ) ι X λ ( y ) | 2 λ 1 | x y | 2
for any x , y R d . Then the approximation of V X ( · ) defined as (6) can be given by
V X λ ( · ) = U ( · ) + ι X λ ( · ) ,
which is continuously differentiable, smooth and strongly convex on R d if U ( · ) satisfying Assumption 1. Define the distribution Π * , λ with density
π * , λ ( x ) = exp { V X λ ( x ) } R d exp { V λ ( y ) } d y .
Recall that the target distribution Π * with the reformulated density defined as (5). As discussed in [35], under some mild conditions including Assumptions 1 and 2, the approximation error between Π * and Π * , λ in total variation distance can be made arbitrarily small by adjusting the regularization parameter λ . Therefore, we can utilize the gradient-based unconstrained sampling algorithms, such as the MALA presented in Algorithm 1, for constructing an appropriate Markov chain whose stationary distribution is close to Π * ; see Algorithm 4 for details, where ϕ h λ ( · | x ) denotes probability density function of the Gaussian distribution N { x h { U ( x ) + ι X λ ( x ) } , 2 h I d } with ι X λ ( · ) defined as (8).
Algorithm 4 The MALA for convex constrained domain
Input: a sample x 0 R d from an initial distribution P 0 , the step size h
  •   for  k = 0 , 1 , 2 , , K 1   do
  •       Proposal step:  y k + 1 x k h { U ( x k ) + ι X λ ( x k ) } + ξ , where ξ N ( 0 , 2 h I p )
  •       Accept-reject step:
  •       compute α k + 1 = min 1 , ϕ h λ ( x k | y k + 1 ) π * , λ ( y k + 1 ) ϕ h λ ( y k + 1 | x k ) π * , λ ( x k )
  •       sample u k + 1 from the uniform distribution on [ 0 , 1 ]
  •       if  α k + 1 u k + 1 , then  x k + 1 y k + 1
  •       else x k + 1 x k
  •       end if
  •   end for
Output:  x 1 , x 2 , , x K

4. Theoretical Results

In this section, we first analyze the properties of the Markov chains determined by the three constrained sampling algorithms presented in Section 3, and then establish the mixing time bounds of these Markov chains.

4.1. Properties of the Markov Chains

The outcomes { x 1 , , x K } from each algorithm presented in Section 3 form a Markov chain, whose properties are established in Propositions 1, 2, and 3, respectively, as below.
Proposition 1.
For X = B ( x * , R ) with some universal constant R > 0 and x * R d , the Markov chain determined by Algorithm 2 is Π * -irreducible, smooth, and reversible with respect to the stationary distribution Π * with density π * defined as (3) (The definition of the Π * -irreducible, reversible, and smooth Markov chain is deferred to Appendix A).
Remark 1.
Proposition 1 shows that the Markov chain determined by Algorithm 2 enjoys a series of nice properties as the unconstrained MALA, which form the basis for the study of the mixing time bounds of such Markov chain.
The similar properties hold for the Markov chains determined by Algorithms 3 and 4 as well.
Proposition 2.
For X = { x R d : | x | p C } with some universal constant C > 0 , the Markov chain determined by Algorithm 3 is Π * -irreducible, smooth, and reversible with respect to the stationary distribution Π * with density π * defined as (3).
Proposition 3.
Under Assumption 2, the Markov chain determined by Algorithm 4 is  Π * , λ -irreducible, smooth, and reversible with respect to the distribution Π * , λ with density π * , λ defined as (11).

4.2. Mixing Time Bounds of the Markov Chains

For a distribution Π supported on X R d with the density π , recall that the ε -mixing time with respect to Π is defined as (2). A β -warm initial distribution P 0 with density p 0 with respect to the distribution Π is commonly used for the mixing time analysis, which satisfies
sup x X p 0 ( x ) π ( x ) β
for some finite constant β > 0 . We say that the Markov chain is ς -lazy if at each iteration the chain is forced to stay at the previous state with probability at least ς . It is a convenient assumption for theoretical analysis of the convergence rate, but not likely to be used in practice since the lazy steps slow down the mixing rate of Markov chain. Given the definitions above and some Markov chain basics in Appendix A, we can obtain the following results for some well-behaved Markov chains defined on { X , B ( X ) } .
Lemma 1.
Consider a reversible, Π-irreducible, ς-lazy, and smooth Markov chain defined on { X , B ( X ) } with stationary distribution Π supported on X . For any error tolerance ε ( 0 , 1 ) and β-warm initial distribution P 0 , the ε-mixing time with respect to Π satisfying
τ ( ε ; P 0 , Π ) 4 ς 4 β 1 ε 2 d v v Ω ˜ 2 ( v ) ,
where τ ( ε ; P 0 , Π ) and Ω ˜ ( · ) are defined, respectively, in (2) and (A4).
Remark 2.
Lemma 1 provides a control on the mixing time of a Markov chain on X in terms of Ω ˜ ( · ) . This result can be seen as an extension of Lemma 3 in [33] to the case where a Markov chain defined on { X , B ( X ) } . We then can readily derive the mixing time bound if a lower bound for Ω ˜ ( · ) is known.
The following lemma gives a lower bound for Ω ( · ) .
Lemma 2.
Assume that the distribution Π supported on X with the density π satisfy the log-isoperimetry inequality defined as (A1) for some constant c ^ > 0 . If a reversible Markov chain with stationary distribution Π satisfies sup x , y X : | x y | 2 Δ T x T y TV 1 δ for some δ ( 0 , 1 ) and Δ > 0 , it then holds that
Ω ( v ) δ 4 min 1 , Δ 4 c ^ log 1 / 2 1 + 1 v
for any v ( 0 , 1 / 2 ] , where T x is the one-step transition distribution of this Markov chain at x X and Ω ( · ) is the conductance profile of this Markov chain defined in (A3).
Remark 3.
Lemma 2 states a lower bound for the conductance profile of a Markov chain on X . Similar results can be found in the [33,39,40]. Lemma 2, together with Lemma 1, provides a general framework for obtaining mixing time bound of a well-behaved Markov chain on X .
Based on Lemmas 1 and 2, we can drive the upper bounds for each ε -mixing time of the Markov chains determined by the three constrained sampling algorithms presented in Section 3.
Theorem 1.
For X = B ( x * , R ) with some universal constant R > 0 and x * R d , let Assumption 1 hold with L 3 / 8 R 3 / 4 16 / d + 8 and L 15 / 8 m 2 R 1 / 4 12 d . Given a β-warm initial distribution P 0 and an error tolerance ε ( 0 , 1 ) , the Markov chain determined by Algorithm 2 satisfies
τ ( ε ; P 0 , Π * ) = O L 7 / 4 R 3 / 2 d m log log β ε
for any step size h satisfying
1 L 7 / 4 R 3 / 2 d h min R 2 ( 1 c ˜ ) 2 4 { log 1 / 2 ( 16 / u ) + d } 2 , u 4 3 L 3 / 2 R , u 128 L { log 1 / 2 ( 16 / u ) + d } 2
with c ˜ = { 1 + ( L 7 / 2 R 3 d 2 L 11 / 4 R 3 / 2 d 1 ) m 2 } 1 / 2 and some constant u ( 1 / 2 , 1 ) , where Π * with density π * defined as (3).
Remark 4.
Theorem 1 presents a sharp mixing time bound for Algorithm 2 with a β -warm initial distribution as O ˜ { d log ( 1 / ε ) } up to β and L, m, R which are specified in Assumptions 1 and 2. This result improves upon the previously known mixing time bounds for constrained sampling algorithms in [34,35,36]; see Table 1 for details.
For sampling from the norm-constrained domain X = { x R d : | x | p C } with some universal constant C > 0 , we transform it into the sampling from Euclidean ball B ( 0 , 1 ) as shown in Algorithm 3; then, the similar result holds for the Markov chain determined by Algorithm 3 as well.
Corollary 1.
For X = { x R d : | x | p C } with some universal constant C > 0 , let Assumption 1 hold with L 3 / 8 16 / d + 8 and L 15 / 8 m 2 12 d . Given a β-warm initial distribution P 0 and an error tolerance ε ( 0 , 1 ) , the Markov chain determined by Algorithm 3 satisfies
τ ( ε ; P 0 , Π * ) = O L 7 / 4 d m log log β ε
for any step size h satisfying
1 L 7 / 4 d h min ( 1 c ¯ ) 2 4 { log 1 / 2 ( 16 / u ) + d } 2 , u 4 3 L 3 / 2 , u 128 L { log 1 / 2 ( 16 / u ) + d } 2
with c ¯ = { 1 + ( L 7 / 2 d 2 L 11 / 4 d 1 ) m 2 } 1 / 2 and some constant u ( 1 / 2 , 1 ) , where Π * with density π * defined as (3).
For the Markov chain determined by Algorithm 4, we can also derive a sharp mixing time bound by the mixing time analysis for sampling from log-concave distribution without constraints in [33] and the approximation error between Π * and Π * , λ in [35].
Theorem 2.
Let Assumptions 1 and 2 hold, and assume that there exists a universal constant C ˜ > 0 such that exp { inf x X c U ( x ) sup x X U ( x ) } C ˜ . Given the initial distribution P 0 = N { x , ( L + λ 1 ) 1 I d } with x = arg min x R d V X λ ( x ) and an error tolerance ε ( 0 , 1 ) , the Markov chain determined by Algorithm 4 satisfies
τ ( ε ; P 0 , Π * ) = O ( L + λ 1 ) d m log d ε · max 1 , L + λ 1 d m
for the step size h satisfying
h = c 1 ( L + λ 1 ) d · max 1 , L + λ 1 d m
with some universal constant c > 0 , where V X λ ( · ) is defined as in (10) with λ : = 8 π 1 ε 2 r 2 d 2 C ˜ 2 , and Π * with density π * defined as (3).
Remark 5.
Theorem 2 presents a mixing time bound for Algorithm 4 with a feasible initial distribution as O { d 3 ε 2 log ( d / ε ) } up to L, m, r which are specified in Assumptions 1 and 2 if we choose the regularization parameter λ = λ . This result improves upon the mixing time bound for constrained sampling algorithm without incorporating the Metropolis–Hastings step in [35]; see Table 1 for details.

5. Numerical Experiments

In this section, we conduct numerical experiments to validate the theoretical properties derived in Section 4 and compare the constrained sampling algorithms presented in Section 3 with three competing MCMC algorithms for sampling from constrained log-concave distributions listed in Table 1 under various simulation settings. The implementation of these algorithms involves the selection of a step size. For Algorithms 2 and 3, we follow Theorem 1 and Corollary 3, respectively, to select the step size. For Algorithm 4, we choose the step size as that in [32] for the MALA for sampling from log-concave distribution without constraints. The step size choice of the other three MCMC algorithms follows the recommendation in the associated papers; see Table 2 for details.

5.1. Sampling from the Euclidean Ball Constrained Domain

We consider the problem of sampling from a truncated multivariate Gaussian distribution on X , which admits the density
π * ( x ) exp ( x μ ) T Σ 1 ( x μ ) 2 I ( x X ) ,
where the mean μ = 0 and covariance matrix Σ R d × d is a diagonal matrix with λ max ( Σ ) = 10 and λ min ( Σ ) = 1 . For this target distribution, the potential function U ( · ) and its derivatives are given as U ( x ) = 2 1 x T Σ 1 x , U ( x ) = Σ 1 x , and 2 U ( x ) = Σ 1 . Therefore, the function U ( · ) is smooth with parameter L = λ min 1 ( Σ ) and strongly convex with parameter m = λ max 1 ( Σ ) on R d . We select X = B ( 0 , R ) with R = 5 , the initial distribution P 0 = N X { 0 , ( 2 L ) 1 I d } , and use the inverse transformation algorithm [14] to generate an initial point from P 0 . We compare Algorithm 2 with the three sampling algorithms in literature given in Table 2, and follow the recommendation in the associated papers to choose the initial points of the three sampling algorithms.

5.1.1. The Trace Graphs of Sampling Algorithms

To initiate a preliminary assessment of the convergence properties of these algorithms, we commence with simple sample trace plots. Write x = ( x 1 , , x d ) T R d and μ = ( μ 1 , , μ d ) T R d . Figure 1 depicts the traces of x 1 of the Markov chains determined by the four sampling algorithms under dimension d = 10 . Evidently, in comparison to the other three algorithms, Algorithm 2 exhibits a notably faster mixing time, as evidenced by the trace consistently remaining around its mean μ 1 = 0 . Conversely, the traces of the other three sampling algorithms exhibit greater fluctuations and deviate more from μ 1 = 0 .
Figure 2 illustrates the histograms and densities corresponding to these traces of x 1 . Similarly, it is evident that Algorithm 2 achieves sample means closer to μ 1 = 0 , along with the least variance. Conversely, the sample means obtained from the other three sampling algorithms exhibit a certain degree of deviation from μ 1 = 0 , accompanied by heavier tails.

5.1.2. Dimension and Error Dependence of Algorithm 2

The goal of this simulation is to demonstrate that the dimension and error tolerance dependence of the mixing time bound for Algorithm 2 both conform to the theoretical results shown in Theorem 1.
Since the total variation distance between continuous measures is hard to estimate, we use the error in quantiles along some direction for convergence diagnostics in the experiments. In the spirit of [33], we measure the error in the 95 % quantile of the sample distribution and the true distribution in the direction along the eigenvector of Σ corresponding to λ min ( Σ ) . The approximate mixing time k ^ mix ( ε ) is then defined as the smallest iteration k when such error between the distribution of the Markov chain at iteration k and the target distribution falls below the error tolerance ε . We simulate 20 independent runs of the Markov chain of the algorithms with N = 20,000 samples at each run to determine the approximate mixing time k ^ mix ( ε ) . Then the final k ^ mix ( ε ) is the average of these 20 independent runs.
Figure 3a shows the dependence of the approximate mixing time k ^ mix ( 0.2 ) as a function of dimension d for Algorithm 2. By the linear regression for k ^ mix ( 0.2 ) with respect to d, we conclude that the mixing time of Algorithm 2 is linear in d with slope 4.137 and R-squared 0.991 . Figure 3b presents the dependence of the approximate mixing time k ^ mix ( ε ) on the inverse of the error tolerance ε 1 for Algorithm 2 under d = 4 . The linear regression for the approximate mixing time k ^ mix ( ε ) with respect to ε 1 suggests that the mixing time of Algorithm 2 is linear in log ( ε 1 ) with slope 15.854 and R-squared 0.994 , which is consistent with the theoretical results given in Theorem 1.

5.1.3. Comparison with Competitive Algorithms

Figure 4a shows the dependence of the approximate mixing time k ^ mix ( 0.2 ) on the problem dimension d for the four sampling algorithms. Compared with the other three algorithms, the approximate mixing time of Algorithm 2 seems more robust to dimension. When d is small, the approximate mixing time of the four algorithms is comparatively close. However, as the dimension d increases, the approximate mixing time of PLMC and MYULA increases rapidly, showing a polynomial order with respect to d. Moreover, the dimension dependence of MLD and Algorithm 2 both indicate a linear growth trend, and MLD needs a few more steps than Algorithm 2 to reach the same error tolerance.
Figure 4b presents the dependence of the approximate mixing time k ^ mix ( ε ) on the inverse of the error tolerance ε 1 for the four sampling algorithms under d = 4 . The regression analysis shows that the approximate mixing time k ^ mix ( ε ) of PLMC and MYULA increases in polynomial order of ε 1 . When ε 1 is relatively small, MLD and Algorithm 2 have similar approximate mixing time. With the increase in ε 1 , the strength of Algorithm 2 gets more significant. For MLD, the linear regression for the approximate mixing time k ^ mix ( ε ) with respect to ε 2 yields a slope of 1.934 and R-squared 0.984 , suggesting the error tolerance dependence of order ε 2 .
It is noteworthy that the above analysis not only suggests significantly better dimension and error tolerance dependence of the constrained MALA but also partly verifies the theoretical convergence rates of the three methods for comparison.

5.2. Bayesian Regularized Regression

The regularized regression involves adding a penalty term on the objective function of the regression model, which helps to control the complexity of the model and prevent it from fitting the noise in the data. In this section, we validate the effectiveness of Algorithm 3 for constrained sampling involving the Bayesian regularized regression.
Given the independent and identically observations y = ( y 1 , y 2 , , y n ) T R n which follow from the Gaussian distribution with mean X β and covariance matrix σ 2 I n , we consider the regression models where the parameter are obtain by minimizing the square of Euclidean norm of the residual subject to a norm-constraint on the regression parameter as follows:
min β R d | y X β | 2 2 subject to | β | p C
for some universal constant C > 0 , where X R n × d is the design matrix, β R d is the regression parameter, and | β | p is the L p -norm of β . In Bayesian setting, many regularization techniques correspond to imposing certain prior distributions on model parameters. We then consider sampling from the distribution with density
π * ( x ) exp | y X β | 2 2 2 σ 2 I ( x X ) ,
and obtaining the parameter estimates β ^ via the maximum a posteriori probability (MAP) estimate, where X = { x R d : | x | p C } . We use the diabetes data studied in [41], and set the burn-in period to be 10 3 iterations and σ 2 = 1 . Figure 5 presents the paths of the parameter estimates under different norm constraints, which demonstrate that Algorithm 3 can effectively handle the norm-constrained sampling problems.

5.3. Truncated Multivariate Gaussian Distribution

The final comparison was made by examining the sampling performance of MYULA in [35] and Algorithm 4 in the setting of a more general truncated multivariate Gaussian distribution. We consider the same setup as in [35]. Specifically, the density of the target distribution is defined as follows:
π * ( x ) exp ( x μ ) T Σ 1 ( x μ ) 2 I ( x X ) ,
where X is a convex set and the origin 0 is on its boundary. Let μ = 0 , the covariance matrix Σ R d × d with ( i , j ) -th element given by ( Σ ) i , j = 1 / ( 1 + | i j | ) , and X = [ 0 , 5 ] × [ 0 , 1 ] . We generate 10 6 samples for Algorithm 4, and set the burn-in period to be the initial 10 % iterations.
Table 3 presents the mean and covariance estimation results of the target distribution based on the samples generated by MYULA and Algorithm 4. For comparison purposes, the results of MYULA align with those reported in [35]. With the same number of iterations, Algorithm 4 outperforms MYULA in terms of the estimation results. This indicates that incorporating the Metropolis–Hastings step in Algorithm 4 leads to improvements in the mixing time.

6. Discussion and Conclusions

In this article, we propose three sampling algorithms based on Langevin Monte Carlo with the Metropolis–Hastings steps to handle the distribution constrained within some convex body, and establish the mixing time bounds of these algorithms for sampling from strongly log-concave distributions. Under certain conditions, these bounds are sharper than existing algorithms in the literature. Furthermore, in comparison to existing algorithms, the suggested constrained sampling algorithms are simpler, more intuitive, and easier to operate in some cases.
Our results demonstrate that the sampling algorithm, enhanced with the Metropolis–Hastings step, offers an effective solution for tackling some constrained sampling problems. Numerical experiments fully illustrate the advantages of the proposed algorithms. Although we focus on the strongly log-concave distributions in the theoretical analysis, the proposed algorithm can be readily applied to weakly log-concave distributions or non-convex potential functions. Simultaneously, we recognize that there are various aspects of the sampling algorithms that can be further improved. For instance, potential enhancements could involve the multiple importance sampling methods or adaptive techniques. We leave the investigation of its theoretical properties under such scenarios for future work.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Some Markov Chain Basics

Consider the time-homogeneous (We say that a Markov chain is time-homogeneous in which the probability of any state transition is independent of time.) Markov chains defined on a measurable state space { X , B ( X ) } with a transition probability Ψ : X × B ( X ) [ 0 , 1 ] . The transition probability satisfies
Ψ ( x , d y ) 0 x X , and y X Ψ ( x , d y ) = 1 .
The k-th step transition probability defined recursively as
Ψ k ( x , d y ) = z X Ψ k 1 ( x , d z ) Ψ ( z , d y ) .
For a distribution Π on X , a Markov chain defined on { X , B ( X ) } is called Π -irreducible if for each A B ( X ) with Π ( A ) > 0 and each x X , there exists k N such that Ψ k ( x , A ) > 0 . A Markov chain defined on { X , B ( X ) } with transition probability Ψ : X × B ( X ) [ 0 , 1 ] and stationary distribution Π is called reversible if it satisfies the detailed balance condition Π ( d x ) Ψ ( x , d y ) = Π ( d y ) Ψ ( y , d x ) for any x , y X .
Smooth chain assumption. We say that the Markov chain satisfies the smooth chain condition if its transition probability Ψ : X × B ( X ) [ 0 , 1 ] can be expressed in the form
Ψ ( x , d y ) = ψ ( x , y ) d y + ι x δ x ( d y )
for any x , y X , where ψ ( · , · ) is the transition kernel satisfying ψ ( x , y ) 0 for any x , y X , ι x denotes the one-step probability of the chain to stay at its current state x, and δ x ( · ) is the Dirac-delta function at x.
Log-isoperimetric inequality. A distribution Π supported on X with density π is said to satisfy the log-isoperimetry inequality with some constant c ^ > 0 if
Π ( S 3 ) d ( S 1 , S 2 ) 2 c ^ min { Π ( S 1 ) , Π ( S 2 ) } log 1 / 2 1 + 1 min { Π ( S 1 ) , Π ( S 2 ) }
for any partition ( S 1 , S 2 , S 3 ) of X , where Π ( S i ) = S i π ( x ) d x and d ( S 1 , S 2 ) = inf x S 1 , y S 2 | x y | 2 .
Conductance profile. Given a Markov chain with transition probability Ψ : X × B ( X ) [ 0 , 1 ] and stationary distribution Π with density π , its stationary flow ω ( · ) : B ( X ) R is defined as
ω ( S ) = S Ψ ( x , S c ) π ( x ) d x
for any S B ( X ) . For any v ( 0 , 1 / 2 ] , the conductance profile is given by
Ω ( v ) = inf S : Π ( S ) ( 0 , v ] ω ( S ) Π ( S ) .
Furthermore, the extended conductance profile is defined as
Ω ˜ ( v ) = Ω ( v ) , v ( 0 , 1 / 2 ] , Ω ( 1 / 2 ) , v ( 1 / 2 , ) .

Appendix B. Proofs

Appendix B.1. Proof of Proposition 1

Proof of Proposition 1.
Denote by Ψ ( x , · ) the transition probability of the Markov chain at x X determined by Algorithm 2. For any x X , let P x , h = N { x h U ( x ) , 2 h I d } with the step size h. Write the density of P x , h as ϕ h ( · | x ) . For any x X , denote by α x ( y ) = min { 1 , R x ( y ) } the acceptance probability for any y R d , where
R x ( y ) = π * ( y ) ϕ h ( x | y ) π * ( x ) ϕ h ( y | x ) I ( y X ) .
Then, the transition probability of the associated Markov chain at x X has a probability mass ψ x = 1 X ϕ h ( y | x ) α x ( y ) d y . Define the transition kernel
ψ ( x , y ) = ϕ h ( y | x ) α x ( y ) I ( y X \ { x } )
for x X . Then, the transition probability Ψ : X × B ( X ) [ 0 , 1 ] satisfies
Ψ ( x , d y ) = ψ x δ x ( d y ) + ψ ( x , y ) d y ,
where δ x ( · ) is the Dirac-delta function at x. By the smooth chain condition given in Appendix A, we know the Markov chain with the transition probability Ψ ( · , · ) is smooth.
Recall that Π * is the distribution on X with the density π * defined as (3). Since
α x ( y ) π * ( x ) ϕ h ( y | x ) = α y ( x ) π * ( y ) ϕ h ( x | y )
for any x , y X , then π * ( x ) ψ ( x , y ) = π * ( y ) ψ ( y , x ) for any x , y X . Together with (A5), for any A , B B ( X ) , it holds that
A π * ( x ) Ψ ( x , B ) d x = A B π * ( x ) ψ x d x + ( x , y ) A × B π * ( x ) ψ ( x , y ) d x d y = B π * ( x ) ψ x δ x ( A ) d x + ( x , y ) A × B π * ( y ) ψ ( y , x ) d x d y = B π * ( x ) Ψ ( x , A ) d x
with δ x ( A ) = I ( x A ) , which implies Π * ( A ) = A π * ( x ) Ψ ( x , X ) d x = X π * ( x ) Ψ ( x , A ) d x for any A B ( X ) . Thus, Π * is the stationary distribution of the Markov chain with the transition probability Ψ ( · , · ) . Hence, such Markov chain is reversible.
Furthermore, by (A5), we have
Ψ ( x , A ) = ψ x δ x ( A ) + A ψ ( x , y ) d y
for any x X and A B ( X ) . For any A B ( X ) with Π * ( A ) > 0 , due to Π * ( A ) = A π * ( x ) d x , we know the Lebesgue measure of A is nonzero. Since α x ( y ) 1 and X = B ( x * , R ) for some universal constant R > 0 and x * R d , we know ψ x 1 X ϕ h ( y | x ) d y > 0 for any x X . If A = { x } , Ψ ( x , A ) ψ x > 0 . If A { x } , we know the Lebesgue measure of A \ { x } is also nonzero, which implies Ψ ( x , A ) A \ { x } ψ ( x , y ) d y > 0 . Thus, the Markov chain with the transition probability Ψ ( · , · ) is Π * -irreducible. We complete the proof of Proposition 1. □

Appendix B.2. Proof of Proposition 2

Proof of Proposition 2.
Recall X = { x R d : | x | p C } for some universal constant C > 0 . Notice that the additional two steps are introduced in Algorithm 3 only for the purpose of establishing a one-to-one mapping between { x R d : | x | p C } and B ( 0 , 1 ) , and they do not affect the properties of the Markov chain. Using the same arguments in the proof of Proposition 1, we can obtain the results of Proposition 2. □

Appendix B.3. Proof of Proposition 3

Proof of Proposition 3.
The proof is almost identical to that of Proposition 1. Recall the distribution Π * , λ with density
π * , λ ( x ) = exp { V X λ ( x ) } X exp { V λ ( y ) } d y ,
where V X λ ( · ) = U ( · ) + ι X λ ( · ) with ι X λ ( · ) defined as (8). Let ϕ h λ ( · | x ) be the probability density function of the Gaussian distribution N { x h { U ( x ) + ι X λ ( x ) } , 2 h I d } . We only need to replace { Π * , π * , ϕ h ( · | x ) } which appeared in the proof of Proposition 1 by { Π * , λ , π * , λ , ϕ h λ ( · | x ) } and all the arguments still hold. □

Appendix B.4. Proof of Lemma 1

Proof of Lemma 1.
We introduce some notation first. Denote by π the density function of Π , and L 2 ( π ) the space of square integrable functions defined on X under the density π , that is,
X g 2 ( x ) π ( x ) d x <
for any g L 2 ( π ) . The Dirichlet form E Ψ : L 2 ( π ) × L 2 ( π ) R associated with the transition probability Ψ ( · , · ) is defined as follows:
E Ψ ( g , h ) = 1 2 ( x , y ) X 2 { g ( x ) h ( y ) } 2 Ψ ( x , d y ) π ( x ) d x .
For any g L 2 ( π ) , let
E π ( g ) = X g ( x ) π ( x ) d x and Var π ( g ) = X { g ( x ) E π ( g ) } 2 π ( x ) d x .
For a measurable non-empty subset S X , the spectral gap is defined as
λ ( S ) = inf g c 0 + ( S ) E Ψ ( g , g ) Var π ( g ) ,
where c 0 + ( S ) = { g L 2 ( π ) : supp ( g ) S , g 0 , Var π ( g ) > 0 } . Define the spectral profile Λ ( · ) as
Λ ( v ) = inf S : Π ( S ) ( 0 , v ] λ ( S )
for any v ( 0 , ) . If the current state of a Markov chain admits the distribution P with density p, we write T ( p ) as the distribution of its next state. The proof of Lemma 1 includes two steps. The first step is to show
τ ( ε ; P 0 , Π ) 1 ς 4 β 1 ε 2 d v v Λ ( v ) .
The second step is to show that the spectral profile and the conductance profile defined in (A3) are related as
Λ ( v ) Ω 2 ( v ) 2 , v ( 0 , 1 / 2 ] , Ω 2 ( 1 / 2 ) 4 , v ( 1 / 2 , ) .
Notice that Π ( X ) = 1 . Replacing the restricted conductance profile and restricted spectral gap in the proof of Lemma 1 in [33] by the conductance profile and spectral gap, respectively, and using the similar arguments in the proof of Lemma 1 in [33], we can obtain the results of the two steps. Then, Lemma 1 can be constructed immediately. □

Appendix B.5. Proof of Lemma 2

Proof of Lemma 2.
Denote by π the density function of the distribution Π . For any measurable non-empty subset A 1 X such that 0 < Π ( A 1 ) 1 / 2 , we have Π ( A 2 ) 1 / 2 Π ( A 1 ) , where A 2 = X \ A 1 . Given δ > 0 , we define the following sets
A 1 = { x A 1 : Ψ ( x , A 2 ) < δ / 2 } , A 2 = { x A 2 : Ψ ( x , A 1 ) < δ / 2 }
and A 3 = X \ ( A 1 A 2 ) , where Ψ : X × B ( X ) [ 0 , 1 ] is the transition probability of the considered Markov chain.
On the one hand, if Π ( A 1 ) Π ( A 1 ) / 2 , then Π ( A 1 \ A 1 ) Π ( A 1 ) / 2 . Thus,
A 1 Ψ ( x , A 2 ) π ( x ) d x A 1 \ A 1 Ψ ( x , A 2 ) π ( x ) d x δ 2 A 1 \ A 1 π ( x ) d x δ 4 Π ( A 1 ) .
Similarly, if Π ( A 2 ) Π ( A 2 ) / 2 , we have A 2 Ψ ( x , A 1 ) π ( x ) d x δ Π ( A 2 ) / 4 . By the detailed balance condition and the Fubini’s theorem, it holds that
A 1 Ψ ( x , A 2 ) π ( x ) d x = x A 1 y A 2 Ψ ( x , d y ) π ( x ) d x = x A 1 y A 2 Ψ ( y , d x ) π ( y ) d y = A 2 Ψ ( y , A 1 ) π ( y ) d y = A 2 Ψ ( x , A 1 ) π ( x ) d x .
Therefore, if Π ( A 1 ) Π ( A 1 ) / 2 or Π ( A 2 ) Π ( A 2 ) / 2 , we have
A 1 Ψ ( x , A 2 ) π ( x ) d x δ 4 min { Π ( A 1 ) , Π ( A 2 ) } = δ 4 Π ( A 1 ) .
On the other hand, we consider the case with Π ( A 1 ) > Π ( A 1 ) / 2 and Π ( A 2 ) > Π ( A 2 ) / 2 . Notice that T x ( · ) = Ψ ( x , · ) . By the definition of the total variation distance, for any x A 1 and y A 2 , we have
T x T y TV Ψ ( x , A 1 ) Ψ ( y , A 1 ) = 1 Ψ ( x , A 2 ) Ψ ( y , A 1 ) > 1 δ .
Since sup x , y X : | x y | 2 Δ T x T y TV 1 δ , we know | x y | 2 > Δ , which implies d ( A 1 , A 2 ) : = inf x A 1 , y A 2 | x y | 2 Δ . Recall A 3 = X \ ( A 1 A 2 ) . By (A8),
A 1 Ψ ( x , A 2 ) π ( x ) d x = 1 2 A 1 Ψ ( x , A 2 ) π ( x ) d x + 1 2 A 2 Ψ ( x , A 1 ) π ( x ) d x 1 2 A 1 \ A 1 Ψ ( x , A 2 ) π ( x ) d x + 1 2 A 2 \ A 2 Ψ ( x , A 1 ) π ( x ) d x δ 4 Π ( A 3 ) .
Since Π ( A 1 ) > Π ( A 1 ) / 2 , Π ( A 2 ) > Π ( A 2 ) / 2 and the sets ( A 1 , A 2 , A 3 ) partition X , by the log-isoperimetry inequality given in (A1), it holds that
Π ( A 3 ) d ( A 1 , A 2 ) 2 c ^ min { Π ( A 1 ) , Π ( A 2 ) } log 1 / 2 1 + 1 min { Π ( A 1 ) , Π ( A 2 ) } Δ 4 c ^ min { Π ( A 1 ) , Π ( A 2 ) } log 1 / 2 1 + 2 min { Π ( A 1 ) , Π ( A 2 ) } Δ 4 c ^ Π ( A 1 ) log 1 / 2 1 + 1 Π ( A 1 ) ,
where the second inequality follows from the fact that x log 1 / 2 ( 1 + x 1 ) is non-decreasing in x > 0 . By (A9) and (A10), we have
A 1 Ψ ( x , A 2 ) π ( x ) d x δ Δ 16 c ^ Π ( A 1 ) log 1 / 2 1 + 1 Π ( A 1 ) .
Putting the two cases together, it holds that
ω ( A 1 ) = A 1 Ψ ( x , A 2 ) π ( x ) d x δ 4 Π ( A 1 ) min 1 , Δ 4 c ^ log 1 / 2 1 + 1 Π ( A 1 )
for any measurable non-empty subset A 1 X with 0 < Π ( A 1 ) 1 / 2 . Due to inf x ( 0 , v ] log 1 / 2 ( 1 + x 1 ) = log 1 / 2 ( 1 + v 1 ) , by the definition of the conductance profile given in (A9), we have
Ω ( v ) δ 4 min 1 , Δ 4 c ^ log 1 / 2 1 + 1 v
for any v ( 0 , 1 / 2 ] . We complete the proof of Lemma 2. □

Appendix B.6. Proof of Theorem 1

For any x X , let P x , h = N { x h U ( x ) , 2 h I d } with the step size h. For X = B ( x * , R ) with some universal constant R > 0 and x * R d , without loss of generality, we set x * = arg min x R d U ( x ) . Under Assumption 1, we know U ( x * ) = 0 .
Lemma A1.
Let X = B ( x * , R ) for some universal constant R > 0 and x * = arg min x R d U ( x ) , and Assumption 1 hold. For any step size h ( 0 , 2 L 1 ] with L specified in Assumption 1 , it holds that
P x , h P x , h TV | x y | 2 2 h
for any x , y X . Furthermore, if L 3 / 8 R 3 / 4 16 d 1 / 2 + 8 and L 15 / 8 m 2 R 1 / 4 12 d , for any u ( 1 / 2 , 1 ) , it holds that
sup x X P x , h T x TV u 4
for any step size h satisfying
1 L 7 / 4 R 3 / 2 d h min R 2 ( 1 c ˜ ) 2 4 { log 1 / 2 ( 16 u 1 ) + d } 2 , u 4 3 L 3 / 2 R , u 128 L { log 1 / 2 ( 16 u 1 ) + d } 2
with c ˜ = { 1 + ( L 7 / 2 R 3 d 2 L 11 / 4 R 3 / 2 d 1 ) m 2 } 1 / 2 , where m is specified in Assumption 1, and T x is the one-step transition distribution of the associated Markov chain involved in Algorithm 2 at x X .
Proof of Lemma A1.
Firstly, we prove the first claim (A11) of this lemma. Recall P x , h = N { x h U ( x ) , 2 h I d } with the step size h. For any x , y X , by the Pinsker’s inequality, we have
P x , h P y , h TV 2 KL ( P x , h P y , h ) = ( 2 h ) 1 / 2 | { x h U ( x ) } { y h U ( y ) } | 2 ,
where KL ( P x , h P y , h ) is the Kullback–Leibler divergence between P x , h and P y , h . Under Assumption 1, by the Taylor expansion, it holds that
| { x h U ( x ) } { y h U ( y ) } | 2 = | { I d h 2 U ( z ) } ( x y ) | 2 I d h 2 U ( z ) 2 | x y | 2
for some z lying on the jointing line between x and y. Since X = B ( x * , R ) for some universal constant R > 0 and U ( · ) is L-smooth and m-strongly convex on X , by Theorems 2.1.6 and 2.1.11 of [42], we have m I d 2 U ( z ) L I d . Due to h ( 0 , 2 L 1 ] , then
λ max { I d h 2 U ( z ) } λ max ( I d ) + λ max { h 2 U ( z ) } 1 m h 1 ,
and
λ min { I d h 2 U ( z ) } λ min ( I d ) + λ min { h 2 U ( z ) } 1 L h 1
for all z X . Therefore, we can obtain sup z X I d h 2 U ( z ) } 2 1 , which implies that
P x , h P y , h TV | x y | 2 2 h
for any x , y X . It yields the claim (A11).
Next, we will prove the second claim (A12) of this lemma. Write the density of P x , h as ϕ h ( · | x ) . Notice that the one-step transition distribution of the associated Markov chain at x X has a probability mass
T x ( { x } ) = 1 X ϕ h ( z | x ) α x ( z ) d z ,
and admits a transition kernel ϕ h ( z | x ) α x ( z ) I ( z X \ { x } ) , where
α x ( z ) = min 1 , π * ( z ) ϕ h ( x | z ) π * ( x ) ϕ h ( z | x ) I ( z X ) .
By the definition of the total variation distance, we have
P x , h T x TV = 1 2 T x ( { x } ) + 1 2 R d | ϕ h ( z | x ) ϕ h ( z | x ) α x ( z ) I ( z X \ { x } ) | d z = 1 X ϕ h ( z | x ) α x ( z ) d z = 1 E z P x , h α x ( z )
for any x X . By the Markov’s inequality, it holds that
E z P x , h α x ( z ) C P z P x , h π * ( z ) ϕ h ( x | z ) I ( z X ) π * ( x ) ϕ h ( z | x ) C
for any C ( 0 , 1 ] . In the sequel, we will derive a lower bound for this tail probability.
Notice that
π * ( z ) ϕ h ( x | z ) π * ( x ) ϕ h ( z | x ) = exp 4 h { U ( x ) U ( z ) } + | z x + h U ( x ) | 2 2 | x z + h U ( z ) | 2 2 4 h .
For the numerator of this exponent, we have
4 h { U ( x ) U ( z ) } + | z x + h U ( x ) | 2 2 | x z + h U ( z ) | 2 2 = 4 h { U ( x ) U ( z ) } + | z x | 2 2 + | h U ( x ) | 2 2 + 2 h ( z x ) T U ( x ) | x z | 2 2 | h U ( z ) | 2 2 2 h ( x z ) T U ( z ) = 2 h { U ( x ) U ( z ) ( x z ) T U ( x ) } + 2 h { U ( x ) U ( z ) ( x z ) T U ( z ) } + h 2 { | U ( x ) | 2 2 | U ( z ) | 2 2 } .
Since U ( · ) is L-smooth and m-strongly convex on X , it holds that
U ( x ) U ( z ) ( x z ) T U ( x ) L 2 | x z | 2 2 , U ( x ) U ( z ) ( x z ) T U ( z ) m 2 | x z | 2 2
for any x , z X . By the Cauchy–Schwarz’s inequality, triangle inequality, and Theorem 2.1.5 of [42], we know
| U ( x ) | 2 2 | U ( z ) | 2 2 = { U ( x ) + U ( z ) } T { U ( x ) U ( z ) } | U ( x ) + U ( z ) | 2 | U ( x ) U ( z ) | 2 | U ( x ) + U ( z ) U ( x ) + U ( x ) | 2 L | x z | 2 { 2 | U ( x ) | 2 + L | x z | 2 } L | x z | 2
for any x , z X . Since X = B ( x * , R ) for some universal constant R > 0 and x * = arg min x R d U ( x ) , by Assumption 1, it holds that
| U ( x ) | 2 = | U ( x ) U ( x * ) | 2 L | x x * | 2 L R
for any x X . Thus,
π * ( z ) ϕ h ( x | z ) π * ( x ) ϕ h ( z | x ) exp L m 4 | x z | 2 2 h L 2 R 2 | x z | 2 h L 2 4 | x z | 2 2 T
for any x , z X . Since z P x , h = N { x h U ( x ) , 2 h I d } and U ( x * ) = 0 , we have
| x z | 2 = | h U ( x ) ( 2 h ) 1 / 2 ξ | 2 h | U ( x ) | 2 + ( 2 h ) 1 / 2 | ξ | 2 h L R + ( 2 h ) 1 / 2 | ξ | 2
and | x z | 2 2 2 h 2 L 2 R 2 + 4 h | ξ | 2 2 for some ξ N ( 0 , I d ) , which implies
T 3 2 h 2 L 3 R 2 2 h L | ξ | 2 2 1 2 h 3 / 2 L 2 R | ξ | 2
if h L 1 . Recall X = B ( x * , R ) . Under Assumption 1, by Theorems 2.1.5, 2.1.9 and 2.1.10 of [42], it holds that
| x h U ( x ) x * | 2 2 = | x x * | 2 2 2 h ( x x * ) T U ( x ) + h 2 | U ( x ) | 2 2 | x x * | 2 2 + ( h 2 h L 1 ) | U ( x ) | 2 2 { 1 + ( h 2 h L 1 ) m 2 } R 2 R 2
for any x X if h L 1 . Recall z = x h U ( x ) + ( 2 h ) 1 / 2 ξ . Select c ˜ ( 0 , 1 ) satisfying c ˜ 2 = 1 + ( L 7 / 2 R 3 d 2 L 11 / 4 R 3 / 2 d 1 ) m 2 , which can be guaranteed by L m and L 3 / 8 R 3 / 4 16 d 1 / 2 + 8 . Then
| z x * | 2 R c ˜ + ( 2 h ) 1 / 2 | ξ | 2
for any h [ L 7 / 4 R 3 / 2 d 1 , L 1 L 7 / 4 R 3 / 2 d 1 ] . For such selected h, we have
{ | ξ | 2 ( 2 h ) 1 / 2 R ( 1 c ˜ ) } { z X } .
Since L 3 / 8 R 3 / 4 16 d 1 / 2 + 8 and L 15 / 8 m 2 R 1 / 4 12 d , by Lemma 1 of [43], for any given u ( 1 / 2 , 1 ) , we have
P z P x , h T u 8 , z X P T u 8 , | ξ | 2 R ( 1 c ˜ ) 2 h P | ξ | 2 2 R 2 ( 1 c ˜ ) 2 2 h P 3 2 h L 3 / 2 R + 2 h L | ξ | 2 2 u 8 P | ξ | 2 2 2 log 1 / 2 16 u + d 2 P | ξ | 2 2 u 64 h L 1 u 8
for any step size h satisfying
1 L 7 / 4 R 3 / 2 d h min R 2 ( 1 c ˜ ) 2 4 { log 1 / 2 ( 16 u 1 ) + d } 2 , u 4 3 L 3 / 2 R , u 128 L { log 1 / 2 ( 16 u 1 ) + d } 2 .
Together with (A14), it holds that
P z P x , h π * ( z ) ϕ h ( x | z ) I ( z X ) π * ( x ) ϕ h ( z | x ) exp u 8 1 u 8
for any x X . Select C = exp ( u / 8 ) in (A13). Due to exp ( u / 8 ) 1 u / 8 , we have
E z P x , h α x ( z ) 1 u 8 2 1 u 4 ,
which implies P x , h T x TV u / 4 for any x X . Therefore, we have the result (A12). We complete the proof of Lemma A1. □
Lemma A2.
Let X = B ( x * , R ) for some universal constant R > 0 and x * R d , and Assumption1hold. The target distribution Π * with density π * defined as (3) satisfies the log-isoperimetry inequality given in (A1) with constant c ^ = m 1 / 2 , where m is specified in Assumption 1.
Proof of Lemma A2.
Let p denote the density of the Gaussian distribution N ( 0 , σ 2 I d ) , and let Π be a distribution with density π = q · p , where q is a log-concave function supported on X . From Lemma 16 in [33], it holds that
Π ( S 3 ) d ( S 1 , S 2 ) 2 σ min { Π ( S 1 ) , Π ( S 2 ) } log 1 / 2 1 + 1 min { Π ( S 1 ) , Π ( S 2 ) }
for any partition S 1 , S 2 , S 3 of X .
We now prove that the target distribution Π * with density π * defined as (3) satisfies the log-isoperimetry inequality defined as (A1). Notice that
π * ( x ) = 2 π m d / 2 exp { U ( x ) + m | x | 2 2 / 2 } X exp { U ( y ) } d y I ( x X ) · exp ( m | x | 2 2 / 2 ) ( 2 π / m ) d / 2 ,
where U ( · ) is m-strongly convex on X . By Theorem 2.1.11 of [42], we know U ( · ) m | · | 2 2 / 2 is convex on X . Since the indicator function I ( · X ) is log-concave on X and the class of log-concave functions is closed under multiplication, then π * can be expressed as the product of a log-concave function and the density of the normal distribution N ( 0 , m 1 I d ) . By (A15), the distribution Π * satisfies the log-isoperimetry inequality defined as (A1) with constant c ^ = m 1 / 2 . We complete the proof of Lemma A2. □
Proof of Theorem 1.
Let T x L be the one-step transition distribution of the Markov chain determined by the 1 / 2 -lazy version of Algorithm 2, at x X . Then we have
T x L ( A ) = 1 2 δ x ( A ) + 1 2 T x ( A )
for any A B ( X ) , where δ x ( · ) is the Dirac-delta function at x X and T x is the one-step transition distribution of the associated Markov chain determined by Algorithm 2, at x X . By the definition of lazy chain and Proposition 1, we know that the Markov chain with transition distribution T x L is 1 / 2 -lazy, Π * -irreducible, smooth, and reversible with respect to the distribution Π * with density π * defined as (3).
Recall P x , h is the proposal distribution involved in Algorithm 2 and the 1 / 2 -lazy version of Algorithm 2. For any x , y X such that | x y | 2 ( 2 1 h ) 1 / 2 u for some u ( 1 / 2 , 1 ) and the step size h satisfying h L 7 / 4 R 3 / 2 d 1 and
h min R 2 ( 1 c ˜ ) 2 4 { log 1 / 2 ( 16 u 1 ) + d } 2 , u 4 3 L 3 / 2 R , u 128 L { log 1 / 2 ( 16 u 1 ) + d } 2
with c ˜ = { 1 + ( L 7 / 2 R 3 d 2 L 11 / 4 R 3 / 2 d 1 ) m 2 } 1 / 2 , by the triangle inequality and Lemma A1, it holds that
T x L T y L TV 1 2 + 1 2 T x T y TV 1 2 + 1 2 ( T x P x , h TV + P x , h P y , h TV + P y , h T y TV ) 1 + u 2 .
Recall X = B ( x * , R ) for some universal constant R > 0 and x * R d . Under Assumption 1, Lemma A2 implies that the distribution Π * with density π * satisfies the log-isoperimetry inequality given in (A1) with constant c ^ = m 1 / 2 . Using Lemma 2 with δ = 2 1 ( 1 u ) and Δ = ( 2 h ) 1 / 2 u , we have
Ω ( v ) 1 u 8 min 1 , u h m 4 2 log 1 / 2 1 + 1 v
for any v ( 0 , 1 / 2 ] , where Ω ( · ) is the conductance profile defined in (A3) for Markov chain with transition distribution T x L . For the above selected u and h, define the function
Υ ( v ) = 1 u 8 min 1 , u h m 4 2 log 1 / 2 1 v , v ( 0 , 1 / 2 ] 1 u 8 min 1 , u h m 4 2 ( log 2 ) 1 / 2 , v ( 1 / 2 , )
for any v > 0 . Recall that
τ ( ε ; P 0 , Π * ) = min { k N : T k ( P 0 ) Π * TV ε }
for an error tolerance ε ( 0 , 1 ) , where T k ( P 0 ) is the distribution of the Markov chain with transition distribution T x L at the k-th step. Let
Ω ˜ ( v ) = Ω ( v ) , v ( 0 , 1 / 2 ] , Ω ( 1 / 2 ) , v ( 1 / 2 , ) .
be the extended conductance profile of such Markov chain. By Lemma 1, it holds that
τ ( ε ; P 0 , Π * ) 8 4 β 1 ε 2 d v v Ω ˜ 2 ( v ) 8 4 β 1 ε 2 d v v Υ 2 ( v ) .
If β > 8 and h 32 u 2 { m log ( β / 4 ) } 1 , it then holds that
u h m 4 2 ( log 2 ) 1 / 2 < u h m 4 2 log 1 / 2 β 4 1 ,
which implies
τ ( ε ; P 0 , Π * ) = O 1 h m log log β ε .
Together with h L 7 / 4 R 3 / 2 d 1 , we complete the proof of Theorem 1. □

Appendix B.7. Proof of Corollary 1

Proof of Corollary 1.
Recall X = { x R d : | x | p C } for some universal constant C > 0 . Since the additional two steps are introduced in Algorithm 3 only transforming the sampling from the norm-constrained region { x R d : | x | p C } to the Euclidean ball B ( 0 , 1 ) , the convergence rate of the two processes remains consistent. Using the same arguments in the proof of Theorem 1 with R = 1 and x * = 0 , we can obtain the results of Corollary 1. □

Appendix B.8. Proof of Theorem 2

Proof of Theorem 2.
Recall that the distribution Π * , λ with density
π * , λ ( x ) = exp { V X λ ( x ) } R d exp { V λ ( y ) } d y
for a regularization parameter λ > 0 , where V X λ ( · ) is defined as in (10), and the target distribution Π * with density
π * ( x ) = exp { U ( x ) } I ( x X ) X exp { U ( y ) } d y
for some potential function U : R d R . Under Assumptions 1 and 2, if there exists a universal constant C ˜ > 0 such that exp { inf x X c U ( x ) sup x X U ( x ) } C ˜ , by Proposition 4 in [35], we have
Π * , λ Π * TV ε
for λ = 8 π 1 ε 2 r 2 d 2 C ˜ 2 with the error tolerance ε ( 0 , 1 ) , where r > 0 is specified in Assumption 2.
Notice that V X λ ( · ) = U ( · ) + ι X λ ( · ) with ι X λ ( · ) defined as (7). Under Assumption 1, by (9) and Theorem 2.1.5 in [42], we know that the function V X λ ( · ) is twice continuously differentiable, ( L + λ 1 ) -smooth and m-strongly convex on R d . Given the initial distribution P 0 = N { x , ( L + λ 1 ) 1 I d } with x = arg min x R d V X λ ( x ) and an error tolerance ε ( 0 , 1 ) , by Theorem 5 of [33], the Markov chain determined by Algorithm 4 satisfies
τ ( ε ; P 0 , Π * , λ ) = O ( L + λ 1 ) d m log d ε · max 1 , L + λ 1 d m
with the step size
h = c 1 ( L + λ 1 ) d · max 1 , L + λ 1 d m ,
where c > 0 is a universal constant. Together with (A16), by the definition of ε -mixing time and the triangle inequality, we have
τ ( ε ; P 0 , Π * ) = O ( L + λ 1 ) d m log d ε max 1 , L + λ 1 d m
with λ = 8 π 1 ε 2 r 2 d 2 C ˜ 2 . Hence, we complete the proof of Theorem 2. □

References

  1. Gelfand, A.E.; Smith, A.F.; Lee, T.M. Bayesian analysis of constrained parameter and truncated data problems using Gibbs sampling. J. Am. Stat. Assoc. 1992, 87, 523–532. [Google Scholar] [CrossRef]
  2. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  3. Klein, J.P.; Moeschberger, M.L. Survival Analysis: Techniques for Censored and Truncated Data; Springer: New York, NY, USA, 2005; pp. 5–18. [Google Scholar]
  4. Johnson, V.E.; Albert, J.H. Ordinal Data Modeling; Springer: New York, NY, USA, 2006; pp. 126–157. [Google Scholar]
  5. Celeux, G.; El Anbari, M.; Marin, J.M.; Robert, C.P. Regularization in regression: Comparing Bayesian and frequentist methods in a poorly informative situation. Bayesian Anal. 2012, 7, 477–502. [Google Scholar] [CrossRef]
  6. Paisley, J.W.; Blei, D.M.; Jordan, M.I. Bayesian nonnegative matrix factorization with stochastic variational inference. In Handbook of Mixed Membership Models and Their Applications; Airoldi, E.M., Blei, D.M., Erosheva, E.A., Fienberg, S.E., Eds.; CRC Press: Boca Raton, FL, USA, 2014; pp. 205–224. [Google Scholar]
  7. Khodadadian, A.; Parvizi, M.; Teshnehlab, M.; Heitzinger, C. Rational design of field-effect sensors using partial differential equations, Bayesian inversion, and artificial neural networks. Sensors 2022, 22, 4785. [Google Scholar] [CrossRef] [PubMed]
  8. Noii, N.; Khodadadian, A.; Ulloa, J.; Aldakheel, F.; Wick, T.; François, S.; Wriggers, P. Bayesian inversion with open-source codes for various one-dimensional model problems in computational mechanics. Arch. Comput. Methods Eng. 2022, 29, 4285–4318. [Google Scholar] [CrossRef]
  9. Ma, Y.A.; Chen, Y.; Jin, C.; Flammarion, N.; Jordan, M.I. Sampling can be faster than optimization. Proc. Natl. Acad. Sci. USA 2019, 116, 20881–20885. [Google Scholar] [CrossRef] [PubMed]
  10. Mangoubi, O.; Vishnoi, N.K. Nonconvex sampling with the Metropolis-adjusted Langevin algorithm. In Proceedings of the 32nd Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 2259–2293. [Google Scholar]
  11. Dyer, M.; Frieze, A. Computing the volume of convex bodies: A case where randomness provably helps. Probabilistic Comb. Its Appl. 1991, 44, 123–170. [Google Scholar]
  12. Rodriguez-Yam, G.; Davis, R.A.; Scharf, L.L. Efficient Gibbs sampling of truncated multivariate normal with application to constrained linear regression. In Technical Report; Unpublished Manuscript; Colorado State University: Fort Collins, CO, USA, 2004. [Google Scholar]
  13. Lovász, L.; Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Struct. Algorithms 2007, 30, 307–358. [Google Scholar] [CrossRef]
  14. Chen, M.H.; Shao, Q.M.; Ibrahim, J.G. Monte Carlo Methods in Bayesian Computation; Springer: New York, NY, USA, 2012; pp. 191–212. [Google Scholar]
  15. Dyer, M.; Frieze, A.; Kannan, R. A random polynomial-time algorithm for approximating the volume of convex bodies. J. ACM 1991, 38, 1–17. [Google Scholar] [CrossRef]
  16. Lang, L.; Chen, W.S.; Bakshi, B.R.; Goel, P.K.; Ungarala, S. Bayesian estimation via sequential Monte Carlo sampling—Constrained dynamic systems. Automatica 2007, 43, 1615–1622. [Google Scholar] [CrossRef]
  17. Chaudhry, S.; Lautzenheiser, D.; Ghosh, K. An efficient scheme for sampling in constrained domains. arXiv 2021, arXiv:2110.10840. [Google Scholar]
  18. Lan, S.; Kang, L. Sampling constrained Continuous probability distributions: A review. arXiv 2021, arXiv:2209.12403. [Google Scholar] [CrossRef]
  19. Neal, R.M. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; Brooks, S., Gelman, A., Jones, G., Meng, X.L., Eds.; CRC Press: Boca Raton, FL, USA, 2011; pp. 113–162. [Google Scholar]
  20. Pakman, A.; Paninski, L. Exact hamiltonian Monte Carlo for truncated multivariate gaussians. J. Comput. Graph. Stat. 2014, 23, 518–542. [Google Scholar] [CrossRef]
  21. Lan, S.; Shahbaba, B. Sampling constrained probability distributions using spherical augmentation. In Algorithmic Advances in Riemannian Geometry and Applications; Minh, H.Q., Murino, V., Eds.; Springer: New York, NY, USA, 2016; pp. 25–71. [Google Scholar]
  22. Brubaker, M.; Salzmann, M.; Urtasun, R. A family of MCMC methods on implicitly defined manifolds. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, La Palma, Canary Islands, Spain, 21–23 April 2012; pp. 161–172. [Google Scholar]
  23. Ahn, K.; Chewi, S. Efficient constrained sampling via the mirror-Langevin algorithm. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 28405–28418. [Google Scholar]
  24. Parisi, G. Correlation functions and computer simulations. Nucl. Phys. B 1981, 180, 378–384. [Google Scholar] [CrossRef]
  25. Grenander, U.; Miller, M.I. Representations of knowledge in complex systems. J. R. Stat. Soc. Ser. B (Methodol.) 1994, 56, 549–581. [Google Scholar] [CrossRef]
  26. Roberts, G.O.; Tweedie, R.L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 1996, 2, 341–363. [Google Scholar] [CrossRef]
  27. Roberts, G.O.; Stramer, O. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 2002, 4, 337–357. [Google Scholar] [CrossRef]
  28. Dalalyan, A.S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B (Methodol.) 2017, 79, 651–676. [Google Scholar] [CrossRef]
  29. Durmus, A.; Moulines, E. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Bernoulli 2017, 27, 1551–1587. [Google Scholar] [CrossRef]
  30. Cheng, X.; Bartlett, P. Convergence of Langevin MCMC in KL-divergence. In Proceedings of the Machine Learning Research, Lanzarote, Spain, 7–9 April 2018; pp. 186–211. [Google Scholar]
  31. Durmus, A.; Moulines, E. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 2019, 25, 2854–2882. [Google Scholar] [CrossRef]
  32. Dwivedi, R.; Chen, Y.; Wainwright, M.J.; Yu, B. Log-concave sampling: Metropolis-Hastings algorithms are fast. J. Mach. Learn. Res. 2019, 20, 1–42. [Google Scholar]
  33. Chen, Y.; Dwivedi, R.; Wainwright, M.J.; Yu, B. Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients. J. Mach. Learn. Res. 2020, 21, 3647–3717. [Google Scholar]
  34. Bubeck, S.; Eldan, R.; Lehec, J. Finite-time analysis of projected Langevin Monte Carlo. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1243–1251. [Google Scholar]
  35. Brosse, N.; Durmus, A.; Moulines, É.; Pereyra, M. Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo. In Proceedings of the 2017 Conference on Learning Theory, Amsterdam, The Netherlands, 7–10 July 2017; pp. 319–342. [Google Scholar]
  36. Hsieh, Y.P.; Kavis, A.; Rolland, P.; Cevher, V. Mirrored langevin dynamics. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 1–10. [Google Scholar]
  37. Roberts, G.O.; Rosenthal, J.S. General state space Markov chains and MCMC algorithms. Probab. Surv. 2004, 1, 20–71. [Google Scholar] [CrossRef]
  38. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  39. Kannan, R.; Lovász, L.; Montenegro, R. Blocking conductance and mixing in random walks. Comb. Probab. Comput. 2006, 15, 541–570. [Google Scholar] [CrossRef]
  40. Lee, Y.T.; Vempala, S.S. Stochastic localization + Stieltjes barrier = tight bound for log-Sobolev. In Proceedings of the Annual ACM SIGACT Symposium on Theory of Computing, Los Angeles, CA, USA, 25–29 June 2018; pp. 1122–1129. [Google Scholar]
  41. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
  42. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course; Springer: New York, NY, USA, 2003; pp. 51–101. [Google Scholar]
  43. Laurent, B.; Massart, P. Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 2000, 28, 1302–1338. [Google Scholar] [CrossRef]
Figure 1. The trace graphs of x 1 of the Markov chain determined by the four sampling algorithms.
Figure 1. The trace graphs of x 1 of the Markov chain determined by the four sampling algorithms.
Entropy 25 01234 g001
Figure 2. The densities of x 1 of the Markov chain determined by the four sampling algorithms.
Figure 2. The densities of x 1 of the Markov chain determined by the four sampling algorithms.
Entropy 25 01234 g002
Figure 3. Approximate mixing time with respect to dimension and error tolerance of Algorithm 2. (a) Dimension dependence for fixed error tolerance. (b) Error tolerance dependence for fixed dimension.
Figure 3. Approximate mixing time with respect to dimension and error tolerance of Algorithm 2. (a) Dimension dependence for fixed error tolerance. (b) Error tolerance dependence for fixed dimension.
Entropy 25 01234 g003
Figure 4. Approximate mixing time with respect to dimension and error tolerance dependence of the four sampling algorithms. (a) Dimension dependence for fixed error tolerance. (b) Error tolerance dependence for fixed dimension.
Figure 4. Approximate mixing time with respect to dimension and error tolerance dependence of the four sampling algorithms. (a) Dimension dependence for fixed error tolerance. (b) Error tolerance dependence for fixed dimension.
Entropy 25 01234 g004
Figure 5. Bayesian regularized regression via Algorithm 3, where distinct colors represent various trajectories of parameter estimates for distinct variables. (a) L 1 —norm-constraint. (b) L 1.5 —norm-constraint. (c) L 2 —norm-constraint.
Figure 5. Bayesian regularized regression via Algorithm 3, where distinct colors represent various trajectories of parameter estimates for distinct variables. (a) L 1 —norm-constraint. (b) L 1.5 —norm-constraint. (c) L 2 —norm-constraint.
Entropy 25 01234 g005
Table 1. Convergence rates for sampling from log-concave distributions with bounded support.
Table 1. Convergence rates for sampling from log-concave distributions with bounded support.
Assumptions · TV RateAlgorithms
0 I d 2 U ( x ) L I d O ˜ ( d 12 ε 12 ) PLMC in [34]
m I d 2 U ( x ) L I d O ˜ ( d 5 ε 6 ) MYULA in [35]
m I d 2 U ( x ) O ˜ ( d ε 2 ) MLD in [36]
m I d 2 U ( x ) L I d O ˜ { d log ( 1 / ε ) } Algorithms 2 and 3 in this paper
m I d 2 U ( x ) L I d O ˜ ( d 3 ε 2 ) Algorithm 4 in this paper
Table 2. Step sizes for sampling from log-concave distributions with bounded support.
Table 2. Step sizes for sampling from log-concave distributions with bounded support.
AlgorithmsStep Size
PLMC in [34] L 1 d 2
MYULA in [35] { d max ( d , L ) } 1
MLD in [36]the grid search
Algorithm 2 in this paper L 7 / 4 R 3 / 2 d 1
Algorithm 3 in this paper L 7 / 4 d 1
Algorithm 4 in this paper { ( L + λ 1 ) max [ d , { m 1 d ( L + λ 1 ) } 1 / 2 ] } 1
Table 3. The mean and covariance estimation results obtained by MYULA and Algorithm 4.
Table 3. The mean and covariance estimation results obtained by MYULA and Algorithm 4.
AssumptionsMeanCovariance
The truth 0.790 0.488 0.326 0.017 0.017 0.080
MYULA 0.758 ± 0.052 0.484 ± 0.016 0.309 ± 0.038 0.017 ± 0.009 0.017 ± 0.009 0.088 ± 0.002
Algorithm 4 0.781 ± 0.034 0.491 ± 0.009 0.317 ± 0.012 0.017 ± 0.004 0.017 ± 0.004 0.082 ± 0.003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, Y. Convergence Rates for the Constrained Sampling via Langevin Monte Carlo. Entropy 2023, 25, 1234. https://doi.org/10.3390/e25081234

AMA Style

Zhu Y. Convergence Rates for the Constrained Sampling via Langevin Monte Carlo. Entropy. 2023; 25(8):1234. https://doi.org/10.3390/e25081234

Chicago/Turabian Style

Zhu, Yuanzheng. 2023. "Convergence Rates for the Constrained Sampling via Langevin Monte Carlo" Entropy 25, no. 8: 1234. https://doi.org/10.3390/e25081234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop