Next Article in Journal
Robust-BD Estimation and Inference for General Partially Linear Models
Next Article in Special Issue
Langevin Dynamics with Variable Coefficients and Nonconservative Forces: From Stationary States to Numerical Methods
Previous Article in Journal
Re-Evaluating Electromyogram–Force Relation in Healthy Biceps Brachii Muscles Using Complexity Measures
Previous Article in Special Issue
Transport Coefficients from Large Deviation Functions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Variational Characterization of Free Energy: Theory and Algorithms

1
Institut für Mathematik, Brandenburgische Technische Universität Cottbus-Senftenberg, D-03046 Cottbus, Germany
2
Institut für Mathematik, Freie Universität Berlin, D-14195 Berlin, Germany
3
Zuse Institute Berlin, D-14195 Berlin, Germany
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(11), 626; https://doi.org/10.3390/e19110626
Submission received: 25 September 2017 / Revised: 7 November 2017 / Accepted: 15 November 2017 / Published: 20 November 2017
(This article belongs to the Special Issue Understanding Molecular Dynamics via Stochastic Processes)

Abstract

:
The article surveys and extends variational formulations of the thermodynamic free energy and discusses their information-theoretic content from the perspective of mathematical statistics. We revisit the well-known Jarzynski equality for nonequilibrium free energy sampling within the framework of importance sampling and Girsanov change-of-measure transformations. The implications of the different variational formulations for designing efficient stochastic optimization and nonequilibrium simulation algorithms for computing free energies are discussed and illustrated.

1. Introduction

It is one of the standard problems in statistical physics and its computational applications, e.g., in molecular dynamics, that one desires to compute expected values of an observable f with respect to a given (equilibrium) probability density π ,
E π [ f ] = f ( x ) π ( x ) d x
Even if samples from the density π are available, the simplest Monte Carlo estimator, the mean value, may suffer from a large variance (compared to the quantity that one tries to estimate), such that the accurate estimation of E π [ f ] requires an unreasonably large sample size. Various approaches to circumvent this problem and to reduce the variance of an estimator are available, one of the most prominent representatives being importance sampling where samples are drawn from another probability density ρ and reweighted with the likelihood ratio π / ρ [1,2]. It is well-known that theoretically (and under certain assumptions) there exists an optimal importance sampling density ρ such that the resulting estimator has variance zero. By a clever choice of the importance sampling proposal density, it is thus possible to completely remove the stochasticity from the problem and to obtain what is sometimes called certainty equivalence. Yet, drawing samples from the optimal density (or an approximation of it) is a difficult problem in itself, so that the striking variance-reduction due to importance sampling often does not pay off in practice.
The zero variance property of importance sampling and the challenge to utilize it algorithmically is the starting point of this article where the focus is on its generalization to path sampling problems and its algorithmic realization. Regarding the former, we will show that the Donsker–Varadhan variational principle, a well-known measure-theoretic characterization of the cumulant generating functions [3] that gives rise to a variational characterization of the thermodynamic free energy [4,5] permits several stunning utilizations of the importance sampling framework for path sampling problems; examples involve trajectory-dependent expectations like expected hitting times or free energy differences [6,7]. We will see that finding the optimal change of measure in path space is equivalent to solving an optimal control problem for the underlying dynamical system in which the dynamics is controlled by external driving forces and thus driven out of equilibrium [8,9].
One of the central contributions of this paper is that we prove that the resulting path space importance sampling scheme features zero variance estimators under quite general assumptions. We furthermore elaborate on the connection between optimized importance sampling and the famous Jarzynski fluctuation relation for the thermodynamic free energy [10]. In particular we will explore this connection and describe how to devise better non-equilibrium free energy algorithms, and hopefully obtain a better understanding of Jarzynski-based estimators; cf. [11,12,13,14].
Regarding the algorithmic realization, the theoretical insight into the relation between (adaptive) importance sampling and optimal control leads to novel algorithms that aim at utilizing the zero variance property without having to sample from the optimal importance sampling density. We will demonstrate how this can be achieved by discretizing the optimal control problem, using ideas from stochastic approximation and stochastic optimization [9,15]; see [16,17,18] for an alternative approach using ideas from the theory of large deviations. The examples we present are mainly pedagogical and admittedly very simple, but they highlight important features of the importance sampling scheme, such as the exponential tilting of the (path space) probability measure or the uniqueness of the solution to the stochastic approximation problem within a certain parametric family of trial probability measures, and this is why we confine our attention to such low-dimensional examples. Regarding the application of our approach to molecular dynamics simulation, we allude to the relevant literature.

Outline

The article is organized as follows: Firstly, in Section 2 we review certainty equivalence and the zero variance property of optimized importance sampling in state space, starting from the Donsker–Varadhan principle and its relation to importance sampling, and comment on some algorithmic issues. Then, in Section 3, we consider the generalization to path space, discuss the relation to stochastic optimal control and revisit Jarzynski-based estimators for thermodynamic free energies. Section 4 surveys and discusses novel algorithms that are exploiting the theoretical properties of the control-based importance sampling scheme. We briefly discuss some of these algorithms with simple toy examples in Section 5, before the article concludes in Section 6 with a brief summary and a discussion of open issues. The article contains four appendices that record various technical identities, including a brief derivation of Girsanov’s change of measure formula, and the proof of the main theorem: the zero-variance property of optimized importance sampling on path space.

2. Certainty Equivalence

In mathematical finance, the guaranteed payment that an investor would accept instead of a potentially higher, but uncertain return on an asset is called a certainty equivalent. In physics, certainty equivalence amounts to finding a deterministic surrogate system that reproduces averages of certain fluctuating thermodynamic quantities with probability one. One such example is the thermodynamic free energy difference between two equilibrium states that can be either computed by an exponential average over the fluctuating nonequilibrium work done on the system or by measuring the work of an adiabatic transformation between these states.

2.1. Donsker–Varadhan Variational Principle

Before getting into the technical details, we briefly review the classical Donsker–Varadhan variational principle for the cumulant generating function of a random variable. To this end, let X be a real-valued, n-dimensional random variable with smooth probability density π and call
E π f ( X ) = R n f ( x ) π ( x ) d x
the expectation with respect to π for any integrable function f : R n R .
Definition 1.
Let W : R n R be a bounded random variable. The quantity
γ : B ( R n ) R , W log E π exp ( W )
is called the free energy of the random variable W = W ( X ) with respect to π, where B ( R n ) is the set of bounded and measurable, real-valued functions on R n (“If you can write it down, it’s measurable!”, S. R. Srinivasa Varadhan).
Definition 2.
Let ρ be another probability density on R n . Then
D ( ρ | π ) = R n log ρ ( x ) π ( x ) ρ ( x ) d x
is called the relative entropy of ρ with respect to π (or: Kullback–Leibler divergence), provided that π ( x ) = 0 implies that ρ ( x ) = 0 for every x R n . Otherwise, we set D ( ρ | π ) = + .
The requirement that π ( x ) must not be zero without ρ ( x ) being zero is known as absolute continuity and guarantees that the likelihood ratio L = ρ / π is well defined. In what follows, we may assume without loss of generality that π > 0 . (Otherwise we may exclude those states x R n for which π ( x ) = 0 .)
A well-known thermodynamic principle states that the free energy is the Legendre transform of the entropy. The following variant of this principle is due to Donsker and Varadhan and says that (e.g., see [3] and the references therein)
log E π exp ( W ) = min ρ 0 E ρ W + D ( ρ | π ) ,
where the minimum is over all probability density functions ρ on R n . The last equality easily follows from Jensen’s inequality by noting that
log R n exp ( W ) π d x = log R n exp ( W ) π ρ ρ d x = log R n exp W log ρ π ρ d x R n W + log ρ π ρ d x = R n W ρ d x + R n log ρ π ρ d x .
Additionally, it can be readily seen that equality is attained if and only if
ρ = exp ( γ W ) π ,
which defines a probability measure with γ given in Equation (2).

Importance Sampling

The relevance of Equations (4) and (5) lies in the fact that, by sampling X from the probability distribution with density ρ , one removes the stochasticity from the problem, since the random variable
U = W ( X ) + log ρ ( X ) π ( X )
is almost surely (a.s.) constant. As a consequence, the Monte Carlo scheme for computing the free energy on the left hand side of Equation (4) based on the empirical mean of the independent draws of U = U ( X ) with X ρ , will have a zero variance. This zero-variance property is a consequence of Jensen’s inequality and the strict concavity of the logarithmic function which implies that equality is attained if and only if the random variable inside the expectation is almost surely constant. The next statement makes this precise.
Theorem 1 (Optimal importance sampling).
Let ρ be the probability density given in Equation (5). Then the random variable Z = exp ( W ) π / ρ has zero variance under ρ , and we have:
Z = E π [ exp ( W ) ] , ρ a . s .
Proof. 
We need to show that Var ρ ( Z ) = E ρ [ Z 2 ] ( E ρ [ Z ] ) 2 = 0 . Using Equation (5) and noting that ρ > 0 since W is bounded and π > 0 , it follows that Z has finite second moment and
E ρ [ Z 2 ] ( E ρ [ Z ] ) 2 = E π exp ( 2 W ) π / ρ E π [ exp ( W ) ] 2 = exp ( γ ) E π exp ( W ) exp ( 2 γ ) = 0 ,
where we have used that exp ( γ ) = E π [ exp ( W ) ] . ☐
The above theorem asserts that ρ -almost surely ( ρ -a.s.)
Z = E ρ [ Z ] ,
which means that the importance sampling scheme based on estimating E ρ [ Z ] using draws from the density ρ is a zero-variance estimator of E π [ exp ( W ) ] . We will discuss the problem of drawing from an approximation of the optimal distribution ρ later on in Section 4.
Remark 1.
Equation (4) furnishes the famous relation F = U T S for the Helmholtz free energy F, with U being the internal energy, T the temperature and S denoting the Gibbs entropy. If we modify the previous assumptions by setting π 1 and W = β E where β = ( k B T ) 1 with k B > 0 being Boltzmann’s constant and E denoting a smooth potential energy function that is bounded from below and growing at infinity, then
β 1 log exp ( β E ) d x = F = min ρ > 0 E ρ d x = U + β 1 ρ log ρ d x = T S ,
with the unique minimizer being the Gibbs-Boltzmann density ρ = exp ( β E ) / Z with normalization constant Z = exp ( β F ) . In the language of statistics, ρ is a probability distribution from the exponential family with sufficient statistic E ( X ) and parameter β > 0 .
An alternative variational characterization of expectations is discussed in Appendix A.

2.2. Computational Issues

In practice, the above result is of limited use because the optimal importance sampling distribution is only known up to the normalizing constant C where the latter is just the sought quantity C = exp ( γ ) . Clearly we can resort to Markov Chain Monte Carlo (MCMC) to generate samples ( Y ^ i ) i 1 that are asymptotically distributed according to (see, e.g., [19])
ρ ( y ) = exp ( Φ ( y ) ) C , Φ ( y ) = V ( y ) + W ( y ) .
However, in the situation at hand, we wish estimate E ρ ( Z ) in Equation (6), where Z = exp ( W ) π / ρ is given in Theorem 1, and the problem is that the likelihood ratio π / ρ is only known up to the normalizing factor. In this case, the self-normalized importance sampling estimator must be used (see, e.g., [20]):
C ^ N = i = 1 N exp ( W ( Y ^ i ) ) exp ( W ( Y ^ i ) ) i = 1 N exp ( W ( Y ^ i ) ) = N i = 1 N exp ( W ( Y ^ i ) ) ,
which is a consistent estimator for C = exp ( γ ) . Note that unlike in the case of the importance sampling estimators with known likelihood ratio, the self-normalized estimator is only asymptotically unbiased, even if we can draw exactly from ρ (see Appendix B for details.)
To avoid the bias due to the self-normalization, it is helpful to note that exp ( γ ) = E ρ [ exp ( W ) ] holds ρ -a.s. As a consequence,
C ^ N 1 = i = 1 N exp ( W ( Y ^ i ) ) N
is an unbiased estimator of C 1 = exp ( γ ) , provided that we can generate i.i.d. samples from ρ . Taking the logarithm, it follows that
γ ^ N = log N + log i = 1 N exp ( W ( Y ^ i ) )
is a consistent estimator for γ , which by Jensen’s inequality and the strict concavity of the logarithm will again be only asymptotically unbiased.

Comparison with the Standard Monte Carlo Estimator

In most cases, the samples from ρ will be generated by MCMC or the like. If we consider the advantages of γ ^ N as compared to the plain vanilla Monte Carlo estimator
γ ˜ N = log N log i = 1 N exp ( W ( X ^ i ) ) ,
with ( X ^ i ) i 1 being a sample that is drawn from the reference distribution π , there are two aspects that will influence the efficiency of Equation (9) relative to Equation (10), namely:
(a)
the speed of convergence towards the stationary distribution and
(b)
the (asymptotic) variance of the estimator.
By construction, the asymptotic variance of the importance sampling estimator is zero or close to zero if we take numerical discretization errors into account, hence the efficiency of the estimator (9) is solely determined by the speed of convergence of the corresponding MCMC algorithm to the stationary distribution ρ , which, depending on the problem at hand, may be larger or smaller than the speed of convergence to π . It may even happen that π is unimodal, whereas ρ e W π is multimodal and hence difficult to generate, for example when π is the standard Gaussian density and W = ( x 2 d ) 2 with d 0 is a bistable (energy) function. We refrain from going into details here and instead refer to the review article [21] for an in-depth discussion of the asymptotic properties of reversible diffusions.
In Section 4 and Section 5, we discuss alternatives to Monte Carlo sampling based on stochastic optimization and approximation algorithms that are feasible even for large-scale systems.
Remark 2.
The comparison of Equations (9) and (10) suggests that the importance sampling estimator (9) is an instance of Bennett’s bidirectional estimator for a positive random variable, called “weighting function” in Bennett’s language [22]. As a consequence, Theorem 1 implies that Bennett’s bidirectional estimator has zero variance when the negative logarithm of the weighting function equals the bias potential.

3. Certainty Equivalence in Path Space

The previous considerations nicely generalize from the case of real-valued random variables to time-dependent problems and path functionals.

3.1. Donsker–Varadhan Variational Principle in Path Space

Let ( X s ) s 0 with X 0 = x R n be the solution of the stochastic differential equation (SDE)
d X s = b ( X s , s ) d s + σ ( X s ) d B s , X 0 = x ,
where b : R n × [ 0 , ) R n is a smooth, possibly time-dependent vector field, σ : R n R n × m is a smooth matrix field and B is an m-dimensional Brownian motion. Our standard example will be an SDE with b ( x , s ) = V ( x ) for a smooth potential energy function V and σ ( x ) = 2 I n × n , so that X s satisfies a gradient dynamics. We assume throughout this paper that the functions b , σ , V are such that Equation (11) or the corresponding gradient dynamics have unique strong solutions for all s 0 .
Now, suppose that we want to compute the free energy (2) where W is now considered to be a functional of the paths X = { X s : 0 s τ } for some bounded stopping time τ :
W τ ( X ) = 0 τ f ( X s , s ) d s + g ( X τ ) ,
for some bounded and sufficiently smooth, real valued functions f , g . We assume throughout the rest of the paper that f , g are bounded from below and that W is integrable.
We define P to be the probability measure on the space Ω = C ( [ 0 , ) , R n ) of continuous trajectories that is induced by the Brownian motion ( B s ) s 0 that drives the SDE (11). We call P a path space measure, and we denote the expectation with respect to P by E P [ · ] .
Definition 3 (Path space free energy).
Let ( X s ) s 0 be the solution of Equation (11) and W τ = W τ ( X ) 0 be integrable and defined by Equation (12). The quantity
γ = log E P exp ( W τ ) = log E P exp 0 τ f ( X s , s ) d s g ( X τ )
is called the free energy of W τ with respect to the path space measure P.
Note that Equation (13) simply is the path space version of Equation (2) which now implicitly depends on the initial condition X 0 = x . The Donsker–Varadhan variational principle now reads
γ = inf Q P E Q 0 τ f ( X s , s ) d s + g ( X τ ) + D ( Q | P ) ,
where Q P stands for absolute continuity of Q with respect to P, which means that P ( E ) = 0 implies that Q ( E ) = 0 for any measurable set E Ω , as a consequence of which
D ( Q | P ) = Ω log d Q d P ( ω ) d Q ( ω ) ,
exists. Note that Equation (15) is just the generalization of the relative entropy (3) from probability densities on R n to probability measures on the measurable space ( Ω , E ) , with E being a σ -algebra containing measurable subsets of Ω , where we again declare that D ( Q | P ) = when Q is not absolutely continuous with respect to P. Therefore it is sufficient that the infimum in Equation (14) is taken over all path space measures Q P .
If W τ 0 , it is again a simple convexity argument (see, e.g., [4]), which shows that the minimum in Equation (14) is attained at Q given by:
d Q d P | [ 0 , τ ] = exp ( γ W τ ) ,
where φ | [ 0 , τ ] denotes the restriction of the path space density φ ( X ) = ( d Q / d P ) ( X ) to trajectories X = ( X s ) s 0 of length τ . More precisely, φ | [ 0 , τ ] is understood as the restriction of the measure Q defined by d Q = φ d P to the σ -algebra F τ that contains all measurable sets E E , with the property that for every t 0 the set E { τ t } is an element of the σ -algebra F t = σ ( X s : 0 s t ) that is generated by all trajectories ( X s ) 0 s t of length t. In other words, F τ E is a σ -algebra that contains the history of the trajectories of (the random) length τ .
Even though Equation (16) is the direct analogue of Equation (5), this result is not particularly useful if we do not know how to sample from Q . Therefore, let us first characterize the admissible path space measures Q P and discuss the practical implications later on.

3.1.1. Likelihood Ratio of Path Space Measures

It turns out that the only admissible change of measure from P to Q such that D ( Q | P ) < results in a change of the drift in Equation (11). Let ( u s ) s 0 be an R m -valued stochastic process that is adapted, in that u t depends only on the Brownian motion B s up to time s t , and that satisfies the Novikov condition (see, e.g., [23]):
E P exp 1 2 0 τ | u s | 2 d s < .
Now, define the auxiliary process
B t u = B t 0 t u s d s .
Using the definition of B t u , we may write Equation (11) as
d X s = b ( X s , s ) + σ ( X s ) u s d s + σ ( X s ) d B s u , X 0 = x .
Note that Equations (11) and (18) govern the same process ( X s ) s 0 , because the extra drift σ ( · ) u is absorbed by the shifted mean of the process ( σ ( X s ) B s u ) s 0 . By construction, ( B s u ) s 0 is not a Brownian motion under P, because its expectation with respect to P is not zero in general. On the other hand, ( B s ) s 0 is a Brownian motion under the measure P, and our aim is to find a measure Q P under which ( B s u ) s 0 is a Brownian motion. To this end, let ( Z s u ) s 0 be the process defined by
Z t u = 0 t u s · d B s 1 2 0 t | u s | 2 d s ,
or, equivalently,
Z t u = 0 t u s · d B s u + 1 2 0 t | u s | 2 d s .
Girsanov’s theorem (see, e.g., [23], Theorem 8.6.4, or Appendix C) now states that ( B s u ) 0 s τ is a standard Brownian motion under the probability measure Q with likelihood ratio
d Q d P | [ 0 , τ ] = exp ( Z τ u )
with respect to P where the Novikov condition (17) guarantees that E P [ exp ( Z τ u ) ] = 1 , i.e., that Q is a probability measure. Inserting Equations (20) and (21) into the Donsker–Varadhan formula Equation (14), using that B s u is Brownian motion with respect to Q, it follows that the term in Z τ u in the expression of the relative entropy, which is linear in u, drops out, and what remains is (cf. [4,6]):
γ = inf u E Q 0 τ f ( X s , s ) + 1 2 | u s | 2 d s + g ( X τ ) ,
with X s being the solution of Equation (18). Since the distribution of B u under Q is the same as the distribution of B under P, an equivalent representation of the last equation is
γ = inf u E P 0 τ f ( X s u , s ) + 1 2 | u s | 2 d s + g ( X τ u ) .
where X s u is the solution of the controlled SDE
d X s u = b ( X s u , s ) + σ ( X s u ) u s d s + σ ( X s u ) d B s , X 0 u = x ,
with B s being our standard, m-dimensional Brownian motion (under P). See Appendix C for a sketch of derivation of Girsanov’s formula.

3.1.2. Importance Sampling in Path Space

Similarly to the finite dimensional case considered in the last section, we can derive optimal importance sampling strategies from the Donsker–Varadhan principle. To this end, we consider the case that τ is a random stopping time, which is a case that is often relevant in applications (e.g., when computing transition rates or committor functions [7]), but that is rarely considered in the importance sampling literature. Let T > 0 and O R n be an open and bounded set with smooth boundary O . We define
τ O = inf { s > 0 : X s u O } ,
as the first exit time of the set O and define the stopping time
τ = τ O T
to be the minimum of τ O and T, i.e., the exit from the set O or the end of the maximum time interval, whatever comes first. For the ease of notation, we will use the same symbol τ to denote the stopping time under the controlled or uncontrolled process (i.e., for u = 0 ) throughout the article. Unless otherwise noted, it should be clear from the context whether τ is understood with respect to X u or X = X u = 0 . Here, X s u satisfies the controlled SDE (23).
We will argue that the optimal Q , which yields zero variance in the reweighting scheme
E Q Y τ u = E P exp W τ
via Y τ u = exp ( Z τ u W τ u ) , can be generated by a feedback control of the form
u s = c ( X s u , s ) ,
with a suitable function c : R n × [ 0 , ) D R m . Finding u turns the Donsker–Varadhan variational principle (14) into an optimal control problem by virtue of Equations (22) and (23). The following statement characterizes the optimal control by which the infimum in Equation (14) is attained and which, as a consequence, provides a zero variance reweighting scheme (or: change of measure).
Theorem 2.
Let
Ψ ( x , t ) = E P exp t τ f ( X s , s ) d s g ( X τ ) | X t = x
be the exponential of the negative free energy, considered as a function of the initial condition X t = x with 0 t τ T . Then, the path space measure Q induced by the feedback control
u s = σ ( X s u ) T x log Ψ ( X s u , s )
yields a zero variance estimator, i.e.,
Y τ u = Ψ ( x , 0 ) , Q a . s .
Proof. 
See Appendix D. ☐
Remark 3.
We should mention that Theorem 2 covers also the special cases that either τ = T is a deterministic stopping time (see, e.g., [24], Proposition 5.4.4) or, by sending T , that τ = τ O is the first exit time of the set O, assuming that the stopping time τ O is a.s. finite (but not necessarily bounded).

3.2. Revisiting Jarzynski’s Identity

The Donsker–Varadhan variational principle shares some features with the nonequilibrium free energy formula of Jarzynski [10], and, in fact, the variational form makes this formula amenable to the analysis of the previous paragraphs, with the aim of improving the quality of the corresponding statistical estimators. Jarzynski’s identity relates the Helmholtz equilibrium free energy to averages that are taken over an ensemble of non-equilibrium trajectories generated by forcing the dynamics.
We discuss a possible application of importance sampling to free energy calculation à la Jarzynski with a simple standard example, but we stress that all considerations easily generalize to more general situations than the one treated below.
As an example, let ( V λ ) 0 λ 1 be a parametric family of smooth potential energy functions V λ : R n R and define the free energy difference between the two equilibrium densities π 0 exp ( V 0 ) and π 1 exp ( V 1 ) as the log-ratio
Δ F = log R n exp ( V 1 ( x ) ) d x R n exp ( V 0 ( x ) ) d x .
(Often π 0 and π 1 are called thermodynamic states.) Defining the energy difference V diff = V 1 V 0 and the equilibrium probability density
π 0 ( x ) = exp ( V 0 ( x ) ) R n exp ( V 0 ( x ) ) d x ,
the Helmholtz free energy is seen to be an exponential average of the familiar form (2):
Δ F = log E π 0 [ exp ( V diff ) ] .
Jarzynski’s formula [10] states that the last equation can be represented as an exponential average over non-stationary realizations of a parameter-dependent process X λ = ( X s λ ) 0 s T . Specifically, letting W T λ = W T λ ( X λ ) denote the nonequilibrium work done on the system by varying the parameter from λ = 0 to λ = 1 within time [ 0 , T ] , Jarzynski’s equality states that
Δ F = log E [ exp ( W T λ ) ] ,
where W T λ will be specified below. In the last equation the expectation is taken over all realizations of X λ , with initial conditions distributed according to the equilibrium density π 0 . To be specific, we assume that the parametric process X s λ is the solution of the SDE
d X s λ = ( 1 λ s ) V 0 ( X s λ ) d s λ s V 1 ( X s λ ) d s + 2 d B s ,
with ( λ s ) 0 s T being a differentiable parameter process (called: protocol) that interpolates between λ 0 = 0 and λ T = 1 . Further, let the work exerted by the protocol be given by
W T λ = 0 T V λ λ X s λ λ ˙ s d s = 0 T V diff ( X s λ ) λ ˙ s d s ,
where V λ ( x ) = ( 1 λ ) V 0 ( x ) + λ V 1 ( x ) is the interpolated potential, and λ s ˙ = d λ s / d s denotes the time derivative of λ s . Note that W τ λ is a path functional of the standard form (12), with bounded deterministic stopping time τ = T and cost functions
f ( X s , s ) = V diff ( X s λ ) λ ˙ s , g 0 .
Letting now P denote the path space measure that is generated by the Brownian motion ( B s ) s 0 in the parameter dependent SDE (31), we can express Jarzynski’s equality (30) by
Δ F = log E [ exp ( W T λ ) ] = log R n E P exp ( W T λ ) π 0 ( x ) d x ,
where the (conditional) expectation E P [ exp ( W T λ ) ] = E P [ exp ( W T λ ( X λ ) ) | X 0 λ = x ] is understood over all realizations of (31) with initial condition X 0 λ = x .

Optimized Protocols by Adaptive Importance Sampling

The applicability of Jarzynski’s formula heavily depends on the choice of the protocol λ s . The observation that an uneducated choice of a protocol may render the corresponding statistical estimator virtually useless because of a dramatic increase of its variance is in accordance with what one observes in importance sampling. An attempt to optimize the protocol by minimizing the variance of the estimator has been carried out in [12], but here, we shall follow an alternative route, exploiting the fact that Jarzynski’s formula has the familiar exponential form considered in this paper; cf. also [11,13,14].
Having this said and recalling Theorem 2, it is plausible that there exists a zero variance estimator for E P [ exp ( W T λ ) ] which appeared in the integrand of Jarzynski’s equality (33), under certain assumptions on the functional W T λ . For simplicity, we confine the following considerations to the above example of a diffusion process of the form (31) with a deterministic protocol ( λ s ) s [ 0 , T ] . To make the idea of optimizing the protocol more precise, we introduce the shorthand Y s = X s λ for the solution of Equation (31) and define
γ ( x , t ) = min v E P t T f ( Y s v , s ) + 1 2 | v s | 2 d s | Y t v = x ,
with f ( x , s ) = V diff ( x ) λ ˙ s and the expectation taken over all realizations of Y s v . The process Y s v solves a controlled variant of the SDE (31), specifically,
d Y s v = b λ ( Y s v , s ) + 2 v s d s + 2 d B s , Y t v = x .
Here, we have used the shorthand b λ ( x , s ) = ( 1 λ s ) V 0 ( x ) λ s V 1 ( x ) . Theorem 2 that specifies the zero-variance importance sampling estimator in terms of a feedback control policy can be adapted to our situation (see, e.g., [7,18]) by letting O R n so that τ = τ O T T a.s. The zero-variance estimator is generated by the feedback control
v s = 2 y γ ( Y s v , s ) ,
with γ ( x , t ) given by Equation (34), and thus, by the SDE
d Y s v = b λ ( Y s v , s ) 2 y γ ( Y s v , s ) d s + 2 d B s , Y 0 v = x .
Specifically, given N independent draws x 1 , , x N π 0 from the equilibrium distribution and corresponding N independent trajectories ( Y s v ) s [ 0 , T ] of the SDE (36) with initial conditions Y 0 v = x i , an asymptotically unbiased, the minimum variance estimator of the free energy is given by
Δ F ^ N = log 1 N i = 1 N G T ( x i ) ,
where G T = exp ( Z T v W T λ ( Y v ) ) with Z T u = v given by Equation (19) and W T λ ( Y v ) being the nonequilibrium work (32) under the controlled process (36).
Remark 4.
Generally, the discretization of the work W T λ requires some care, because the discretization error may introduce some “shadow work” that may spoil the properties of the importance sampling estimator [25]. Further note that, even if time-discretization errors are ignored, the estimator (37) is not a zero-variance estimator because we have minimized only the conditional estimator (for fixed initial condition). Moreover the estimator is only asymptotically unbiased by Jensen’s inequality and the strict concavity of the logarithm.
Further notice that the estimator hinges on the availability of γ ( x , t ) which is typically difficult to compute. An idea, inspired by the adaptive biasing force (ABF) algorithm [26,27,28] is to estimate γ on the fly and then iteratively refine the estimate in the course of the simulation using a suitable parametric representation [29,30]. If good collective variables or reaction coordinates are known, it is further possible to choose a representation that depends only on these variables and still obtain low variance estimators [31,32].

4. Algorithms: Gradient Descent, Cross Entropy Minimization and beyond

According to Theorem 2 designing reweighting (importance sampling) schemes on path space that feature zero variance estimators comes at the price of solving an optimal control problem of the following form: minimize the cost functional
J ( u ) = E P 0 τ f ( X s u , s ) + 1 2 | u s | 2 d s + g ( X τ u )
over all admissible controls and subject to the dynamics
d X s u = b ( X s u , s ) + σ ( X s u ) u s d s + σ ( X s u ) d B s .
Here, admissible controls are Markovian feedback controls u s = c ( X s u , s ) such that Equation (39) has a unique strong solution. Leaving all technical details aside see Section IV.3 in [8], it can be shown that the value function (or: optimal cost-to-go)
γ ( x , t ) = E P t τ f ( X s u , s ) + 1 2 | u s | 2 d s + g ( X τ u ) | X t u = x ,
with u being the unique optimal control given by Equation (26), is the solution of a nonlinear partial differential equation of Hamilton–Jacobi–Bellman type. Solving this equation numerically is typically even more difficult than solving the original sampling problem by brute-force Monte Carlo (especially when the state space dimension n is large).
Note that Equations (38) and (39) is simply the concrete form of the Donsker–Varadhan principle when the path space measure is generated by a diffusion. Therefore the equivocation with the path space free energy (13) or (34) is not a coincidence, because by definition the value function is the free energy, considered as a function of the initial conditions. In other words and in view of Theorem 2, there is no need for further sampling once the value function is known.
We will now discuss concrete numerical algorithms to minimize Equations (38) and (39) without resorting to the associated Hamilton–Jacobi–Bellman equation.

4.1. Gradient Descent

The fact that solving the optimal control problem can be as difficult as solving the sampling problem suggests to combine the two in an iterative fashion using a parametric representation of the value function (or: free energy). To this end, notice that the optimal control is essentially a gradient force that can be approximated by
u ^ s = σ ( X s u ) T i = 1 N α i x ϕ i ( X s u , s ) ,
based on a finite-dimensional approximation
γ ^ ( x , t ) = i = 1 N α i ϕ i ( X s u , s )
of the value function with suitable smooth basis functions { ϕ i : D ¯ R m : i = 1 , , N } that span an N-dimensional subspace of the space C 2 , 1 ( D ) C ( D ¯ ) of classical solutions of the associated Hamilton–Jacobi–Bellman equation. Here we denote by C r , s ( R n × [ 0 , ) the Banach space of functions that are r and s times continuously differentiable in their first and second arguments, respectively, and C ( R n × [ 0 , ) ) = C 0 , 0 ( R n × [ 0 , ) ) for continuous functions. Plugging the above representation into Equations (38) and (39) yields the following finite-dimensional optimization problem: minimize
J ( u ^ ) = E P 0 τ f ( X s u ^ , s ) + 1 2 | u ^ s | 2 d s + g ( X τ u ^ )
over the controls u ^ where X u ^ is the solution of the SDE (23) with control u = u ^ .
Let us define J ^ ( α ) = J ( u ^ ( α ) ) , with the shorthand α = ( α 1 , , α N ) T R N . Because of the dependence of the process X α and the random stopping time τ = τ α on the parameter α , the functional J ^ is not quadratic in α , but it has been shown [33] that it is strongly convex if the basis functions ϕ i are non-overlapping. In this case J ^ has a unique minimum, which suggests to do a gradient descent in the parameter α :
α ( m + 1 ) = α ( m ) h m J ^ α ( m ) .
Here, ( h m ) m 0 is a sequence of step sizes that goes to zero as m , and the gradient J ^ ( α ) must be interpreted in the sense of a functional derivative:
δ J ( u ^ ) · ξ = d d ϵ J ( u ^ + ϵ ξ ) ϵ = 0 ,
for suitable test functions ξ V (i.e., square-integrable and adapted to the Brownian motion). Then, the gradient J ^ ( α ) has the components
J ^ α k = δ J ( u ^ ( α ) ) · ( σ T x ϕ k ) .
Introducing the shorthand
( X u ^ , u ^ ) = 0 τ f ( X s u ^ , s ) + 1 2 | u ^ s | 2 d s + g ( X τ u ^ )
for the cost and the convention E [ · ] = E P [ · ] for the expectation with respect to P, the derivative (45) can again be found by means of Girsanov’s formula: there exists a measure Q ϵ that is absolutely continuous with respect to the reference measure P, such that
d d ϵ J ( u ^ + ϵ ξ ) ϵ = 0 = d d ϵ E ( X u ^ + ϵ ξ , u ^ + ϵ ξ ) ϵ = 0 = d d ϵ E ( X , u ^ + ϵ ξ ) d Q ϵ d P ϵ = 0 ,
with the likelihood ratio
d Q ϵ d P [ 0 , τ ] = exp ( Z τ u + ϵ ξ ) .
Assuming that the derivative and the expectation in Equation (47) commute, we can differentiate inside the expectation E [ · ] which is independent of the parameter ϵ and then switch back to the controlled process X u under the reference measure P, by which we obtain (see [33]):
δ J ( u ^ ) · ξ = E ( X u ^ , u ^ ) 0 τ ξ s · d B s + 0 τ u ^ s · ξ s d s .
Hence, using Equation (46), we find
J ^ α k = E ( X u ^ , u ^ ) 0 τ ( σ T x ϕ k ) ( X s u ^ , s ) · d B s + 0 τ u ^ s · ( σ T x ϕ k ) ( X s u ^ , s ) d s .
where the last expression can be estimated by Monte Carlo, possibly in combination with variance minimizing strategies to improve the convergence of the gradient estimation in the course of the gradient descent [33,34]. Before we conclude, we shall briefly explain why the gradient vanishes when the variance is zero.
Lemma 1.
Under the optimal control u , it holds that:
δ J ( u ) · ξ = 0 ξ V .
Proof. 
By the Itô isometry (see [23], Corollary 3.1.7), we can recast Equation (48) as:
δ J ( u ) · ξ = E ( X u , u ) + 0 τ u s · d B s 0 τ ξ s · d B s = E W ( X u ) + Z τ u 0 τ ξ s · d B s = W ( X u ) + Z τ u E 0 τ ξ s · d B s ,
where in the last equality we have used that W + Z is a.s. constant under the optimal control. Since B is a Brownian motion under P, the expectation is zero, and it follows that
δ J ( u ) · ξ = 0 ξ V ,
and hence, the assertion is proven. ☐
We summarize the above considerations in Algorithm 1 below.
Algorithm 1 Gradient descent
  • Set maximum no. of iteration M axit and ϵ > 0
  • Initialize m = 0 , α ( 0 ) R N and h 0 > 0
  • Evaluate G 0 = J ^ ( α ( 0 ) )
  • while m < M axit & h m > ϵ
  •    α ( m + 1 ) = α ( m ) h m G m
  •   Evaluate G m + 1 = J ^ ( α ( m + 1 ) )
  •   Evaluate step size, e.g., h m + 1 = α ( m + 1 ) α ( m ) · G m + 1 G m | G m + 1 G m | 2
  • m m + 1
  • end while
Remark 5.
The step size control in Algorithm 1 follows the Barzilai-Borwein procedure that guarantees convergence as m when the functional is convex. Another alternative is to do a line search after each iterate in the descent direction and then determine h m + 1 so that it satisfies the Wolfe condition; see [35] for further details.
Remark 6.
In practice, it may be advantageous to pick the basis functions so that they are not explicitly time-dependent (e.g., Gaussians, Chebyshev polynomials or the alike). If the associated control problem is stationary, as is for example the case when the SDE is homogeneous and the stopping time is a hitting time, the value function will be stationary too and, as a consequence, the control policy will be stationary. If, however, the problem is explicitly time-dependent, one may change the ansatz (42) to have stationary basis functions, but time-dependent coefficients α i , where the time-dependence is mediated by the initial data; see [29] for a discussion.

4.2. Cross-Entropy Minimization

Another algorithm for minimizing J ( u ^ ) is based on an entropy representation of J ( u ) , namely,
J ( u ) = J ( u ) + D ( Q | Q )
where u is any admissible control for Equations (38) and (39), u is the optimal control, and Q = Q ( u ) and Q = Q ( u ) are the corresponding path space measures. Equation (51) is a consequence of the zero-variance property of the optimal change of measure, since Equation (27) implies that
exp ( W ) d P d Q = ψ ( x , 0 ) = exp ( J ( u ) )
and hence
W log d P d Q d Q d Q = J ( u ) .
Taking the expectation with respect to Q and using that both Q and Q are absolutely continuous with respect to P and vice versa yields Equation (51).
The idea now is to seek a minimizer of D ( Q | Q ) in the set of probability measures Q M ^ that are generated by the discretized controls u ^ , i.e., one would like to minimize
I ^ ( α ) = D ( Q ( u ^ ( α ) ) | Q )
over α R N , such that Q ^ = Q ( u ^ ( α ) ) is absolutely continuous with respect to Q . By Equation (16) the optimal change of measure is only known up to the normalizing factor exp ( γ ) , which enters Equation (54) only as an additive constant; note that we call exp ( γ ) or exp ( J ( u ) ) a normalizing factor, even though it is clearly a function of the initial conditions ( x , t ) or ( x , 0 ) . Nevertheless minimizing I ^ is not easily possible since the functional may have several local minima. With a little trick, however, we can turn the minimization of Equation (54) into a feasible minimization problem, simply by flipping the arguments. To this end, we define:
H ^ ( α ) = D ( Q | Q ( u ^ ( α ) ) ) .
Clearly, Equation (51) does not hold with arguments in the Kullback-Leibler (or: KL) divergence term reversed, since D ( · | · ) is not symmetric; nevertheless, it holds that
I ^ ( α ) 0 , H ^ ( α ) 0 and I ^ ( α ) = 0 if and only if H ^ ( α ) = 0 ,
where the minimum is attained if and only if Q ^ = Q . Hence, by continuity of the relative entropy, we may expect that by minimizing the “wrong” functional H ^ we get close to the optimal change of measure, provided that the optimal Q can be approximated by our parametric family Q ^ . We have the following handy result (see [15]).
Lemma 2 (Cross-entropy minimization).
The minimization of (55) is equivalent to the minimization of the cross-entropy functional
C E ( α ) = E log φ ( u ^ ) e W ( X )
where the log likelihood ratio log φ = log ( d Q / d P ) between controlled and uncontrolled trajectories is quadratic in the unknown α and can be computed via Girsanov’s theorem.
Proof. 
By definition of the KL divergence, we have
H ^ ( α ) = log d Q d P d Q d P d P log d Q ^ d P d Q d P d P ,
since all measures are mutually absolutely continuous. The first term in the last equation is independent of α , and the second term is proportional to the cross-entropy functional E [ log φ exp ( W ) ] up to the unknown normalizing factor exp ( γ ) . ☐
The fact that the cross-entropy functional is quadratic in α implies that the necessary optimality condition is of the form
S α = b ,
where S = ( S i j ) 1 i , j N and b = ( b i ) 1 i N are given by:
S i j = E e W ( X ) 0 τ ( σ T x ϕ i ) ( X s , s ) ( σ T x ϕ j ) ( X s , s ) d s b i = E e W ( X ) 0 τ ( σ T x ϕ i ) ( X s , s ) · d B s .
Note that the average in Equation (59) is over the uncontrolled realizations X. It is easy to see that the matrix S is positive definite if the basis functions ϕ i are linearly independent, which implies that Equation (58) has a unique solution and our necessary condition is in fact sufficient. Nevertheless it may happen in practice that the coefficient matrix S is badly conditioned, in which case it may be advisable to evaluate the coefficients using importance sampling or a suitable annealing strategy; see [15,29] for further details.
A simple, iterative variant of the cross-entropy algorithm is summarized in Algorithm 2.
Algorithm 2 Simple cross-entropy method
  • Set maximum no. of iteration M axit and α ( 0 ) = 0
  • Evaluate S = S ( 0 ) and b = b ( 0 ) according to (59)
  • for m = 0 to M axit do
  •   Solve linear system of equations S ( m ) α ( m + 1 ) = b ( m )
  •   Evaluate S ( m + 1 ) and b ( m + 1 ) by importance sampling using realizations of X α ( m + 1 )
  • end for

4.3. Other Monte Carlo-Based Methods

We refrain from listing all possibilities to compute the optimal change of measure or the optimal control, and mention only two more possibilities that are functional in situations in which grid-based discretization methods (e.g., for solving the nonlinear Hamilton–Jacobi–Bellman equation) are unfeasible. The strength of the methods described below is that they can be combined with model reduction methods such as averaging, homogenization or Markov state modeling if either suitable collective variables, a reaction coordinate or some dominant metastable sets are known; see, e.g., [7,15,29,32,36,37] for the general approach and the application to molecular dynamics.

4.3.1. Approximate Policy Iteration

The first option is based on successive linearization of the Hamilton–Jacobi–Bellman equation of the underlying optimal control problem. The idea is the following: Given any admissible control u s = c ( X s u , s ) , the Feynman–Kac theorem [23] (Theorem 8.2.1) states that the cost functional J ( u ) , considered as a function of the initial data ( x , t ) of the controlled process X u = ( X s u ) s t with X t u = x , solves a linear boundary value problem of the form
A ( c ) Θ ( c ) = ( x , c ) ,
where A ( c ) is a linear differential operator that depends on the chosen control policy and which precise form (e.g., parabolic, elliptic or hypoelliptic) depends on the problem at hand. Clearly, γ ( x , t ) = min c Θ ( c ; x , t ) is the value function (or free energy), i.e., the solution we seek. For an arbitrary initial choice of a control policy c 0 c we have γ < Θ ( c 0 ) , and a successive improvement of the policy can be obtained by iterating
c n + 1 ( x , s ) = σ ( x , s ) T x Θ ( c n ; x , t ) , n 0 .
Under suitable assumptions on the drift and diffusion coefficients, iteration of Equations (60) and (61) yields a convergent series of control policies c n that converges to the unique optimal control, hence the name of the method is policy iteration. Clearly, solving the linear partial differential Equation (60) by any grid-based method will be unfeasible if the state space dimension is larger than, say, three or four. In this case, it is possible to approximate the infinitesimal generators A ( c ) by a sparse and grid-free Markov State Model (MSM) that captures the underlying dynamics X u = X u ( c ) ; see, e.g., [36] for the error analysis of the corresponding nonequilibrium MSM and an application to molecular dynamics. In this case, one speaks of an approximate policy iteration. For further details on approximate policy iteration algorithms we refer to the article [38] and the references therein.

4.3.2. Least-Squares Monte Carlo

If τ = T is a finite stopping time, another alternative is to exploit that the value function of the control problem Equations (38) and (39) can be computed as the solution to a forward-backward stochastic differential equation (FBSDE) of the form
d X s = b ( X s , s ) d s + σ ( X s ) d B s , X t = x d Y s = f ( X s , s ) d s + 1 2 | Z s | 2 + Z s · d B s , Y T = g ( X T ) ,
where t s T and the second equation must be interpreted as an equation that runs backwards in time. A solution of the FBSDE (62) is a triplet ( X s , Y s , Z s ) , with the property that Y s and Z s at time s [ t , T ] depend only on the history of the forward process ( X s ) t s s up to time s . In particular, since X t = x , the backward process Y t is a deterministic function of the initial data ( x , t ) only, and it holds that (e.g., [39])
γ ( x , t ) = Y t .
The specific structure of the control problem Equations (38) and (39) implies that the forward equation is decoupled from the backward equation and that the backward process ( Y s , Z s ) can be expressed by
Y s = γ ( X s , s ) , Z s = σ ( X s ) T x γ ( X s , s ) ,
where X s is the uncontrolled forward process. Since we can simulate the forward process and we know the functional dependence of ( Y s , Z s ) on X s , the idea here is again to use the representation Equation (42) of γ in terms of a finite basis. It turns out that the coefficient vector α R N can be computed by solving a least-squares problem in every time step of the time-discretized backward SDE, which is why methods for solving an FBSDE like Equation (62) are termed least squares Monte Carlo; for the general approach we refer to [40,41]; details for the situation at hand will be addressed in a forthcoming paper [42].

5. Illustrative Examples

From a measure-theoretic viewpoint, changing the drift of an SDE (also known as Girsanov transformation) is an exponential tilting of a Gaussian measure on an infinite-dimensional space. Here, for illustration purposes, we consider a one-dimensional paradigm that is in the spirit of Section 2 and that illustrates the basic features of Gaussian measure changes, Girsanov transformations and the cross-entropy method.
To this end, let π = N ( 0 , 1 ) the density of the standard Gaussian distribution on R , and define an exponential family of “tilted” probability densities by
ρ α ( x ) = exp α x α 2 2 π ( x ) .
It can be readily checked that ρ α is the density of the normal distribution N ( α , 1 ) with mean α and unit variance, in other words, the exponential tilting results in a shift of the mean, which represents a change of the drift in the case of an SDE (compare Equations (19) and (21)).

5.1. Example 1 (Moment Generating Function)

Let β > 0 and define
ψ β = E π exp ( β X ) .
By Jensen’s inequality, it follows that
β 1 log E π exp ( β X ) E α X + β 1 log ρ α π ,
where E α [ · ] denotes the expectation with respect to ρ α . A simple calculation shows that the inequality is sharp where equality is attained for α = β , i.e., when ρ α = ρ , with
ρ = N ( β , 1 ) .
As a consequence, the Donsker–Varadhan variational principle (4) holds when the minimum is taken over the exponential family (64) with sufficient statistic X.
We will now show that ρ can be computed by the cross-entropy method. To this end, let
J ( α ) = E α X + β 1 log ρ α π .
As we have just argued, there exists a unique minimizer α = β of J that by Theorem 1 has the zero variance property, which implies that
J ( α ) = J ( α ) + β 1 D ( ρ α | ρ ) .
The associated cross-entropy functional has the form (see Section 4.2):
C E ( α ) = E π log ρ α π exp ( β X ) .
Using Equation (64), it is easily seen that the cross-entropy functional is quadratic,
C E ( α ) = E π α 2 2 α X exp ( β X ) ,
with unique minimizer:
α ^ = E π [ X exp ( β X ) ] E π [ exp ( β X ) ] = β log ψ β ,
where the second equality follows from Equation (65), using the fact that the derivative and the expectation commute because π is Gaussian and hence the moment-generating function ψ β exists for all β R . Rearranging the terms in the last equation, we obtain
β log ψ β = E π X exp β X β 2 2 = E β [ X ] = β ,
showing that:
α ^ = α = β .
The above consideration readily generalize to the multidimensional Gaussian case, and hence this simple example illustrates that the cross-entropy method yields the same result as direct minimization of the functional (68)—at least in the finite-dimensional case.

5.2. Example 2 (Rare Event Probabilities)

The following example illustrates that the cross-entropy method can be used and produces meaningful results, even though the Donsker–Varadhan principle does not hold. To this end consider again the case of a real-valued random variable X P with density π = N ( 0 , 1 ) and W = log 1 { X > d } with d 0 . Then
P ( X > d ) = E π exp ( W )
is a small probability that is difficult to compute by brute-force Monte Carlo. In this case, a zero-variance change of measure exists, but it is not of the form (64). As a consequence, equality in Equation (66) cannot be attained within the exponential family { ρ α : α R } given by Equation (64). Instead, the optimal density in this case would be the conditional density
ρ ( x ) = 1 { x > d } p π ( x ) ,
where the normalization constant p = P ( X > d ) is of course the quantity we want to compute (cf. Section 2.1). Note that this expression formally agrees with the optimal density (5), which was derived under different assumptions though.
The idea now is to minimize the distance between ρ α and ρ in the sense of relative entropy, i.e., we seek a minimizer of the Kullback–Leibler divergence D ( ρ | ρ α ) in the exponential family { ρ α : α R } . The associated cross-entropy functional is given by
C E ( α ) = E π α 2 2 α X 1 { X > d } ,
with unique minimizer
α = E π [ X 1 { X > d } ] E π [ 1 { X > d } ] = E π [ X | X > d ] .
Comparing Equations (77) and (75), we observe that both densities ρ α and ρ ρ α have the same mean (namely α ), hence the suboptimal density ρ α is concentrated around the typical values that the optimal density ρ would produce when samples were drawn from it.
Clearly, the optimal tilting parameter (77) is probably as difficult to compute by brute-force Monte Carlo as the probability p = P ( X > d ) since { X > d } is a rare event when d 0 is far away from the mean. The strength of both gradient descent and cross-entropy method is, however, that the optimal tilting parameter can be computed iteratively. This is illustrated numerically in Figure 1 for the choice d = 5 , where we use Algorithm 1 with a constant stepsize and Algorithm 2 as specified. In each iteration m we draw a sample of size N = 10 8 from the density ρ α ( m ) , and estimate the mean,
p ^ = 1 N i = 1 N 1 { X i > d } π ρ α ( m ) ( X i )
and the sample variance in each sample. The latter is proportional to the normalized variance K Var ( p ^ ) of an estimator that has been estimated K times.
For this (admittedly simple) example both gradient descent and cross-entropy method converge well and lead to a drastic reduction of the normalized relative error δ = Var ( p ^ ) / p ^ of the estimator by a factor of about 1000, from about 2000 without importance sampling to about δ 2.38 under (suboptimal) importance sampling with exponential tilting, indicating that both methods can handle situations in which the optimal (i.e., δ = 0 ) change of measure is not available within the set of trial densities.

6. Conclusions

We have presented a method for constructing minimum-variance importance sampling estimators. The method is based on a variational characterization of the thermodynamic free energy and essentially replaces a Monte Carlo sampling problem by a stochastic approximation problem for the optimal importance sampling density. For path sampling, the stochastic approximation problem boils down to a Markov control problem, which again can be solved by stochastic optimization techniques. We have proved that for a large class of path sampling problems that are relevant in, e.g., molecular dynamics or rare events simulation, the (unique) solution to the optimal control problem can yield zero-variance importance sampling schemes.
The computational gain when replacing the sampling problem by a variational principle is, besides improved convergence due to the variance reduction and often a higher hitting rate of the relevant events, due to the fact that the variational problem can be solved iteratively, which makes it amenable to multilevel approaches. The cross-entropy method as an example of such an approach has been presented in some detail. A substantial difficulty still is a clever choice of basis functions that is highly problem-specific, and hence, future research should address non-parametric approaches, as well as model reduction methods in combination with the stochastic optimization/approximation tools that can be used to solve the underlying variational problems.

Acknowledgments

This work was funded by the Einstein Center for Mathematics (ECMath) and by the Deutsche Forschungsgemeinschaft (DFG) under Frant DFG-SFB 1114 “Scaling Cascades in Complex Systems”.

Author Contributions

C.H. and W.Z. conceived and designed the research; L.R. performed the numerical experiments; C.H. and C.S. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Yet Another Certainty Equivalence

A similar variational characterization of expected values as in Equation (4), based on convexity arguments and Jensen’s inequality, can be formulated for non-negative random variables W = W ( X ) 0 . For simplicity and as in Section 2.1, we assume W to be bounded and measurable and π > 0 . Then, for all p [ 1 , ) , it holds that
E π W p 1 / p = max η 0 E η W η π 1 / p ,
where η is any non-negative probability density. If we exclude the somewhat pathological case W 0 a.s., it follows that E π W p > 0 and the supremum is attained for
η = W p E π W p π .
The proof is along the lines of the proof of the Donsker–Varadhan principle Equation (4). Indeed, applying Jensen’s inequality and noting that η / π is non-zero η -a.s. it readily follows that
E π W p 1 / p E η W η π 1 / p ,
and it is easy to verify that the supremum in Equation (A1) is attained at η = η given by Equation (A2).
Similar as in Section 2.1, the above discussions can be applied to study the importance sampling schemes for the p-th moment of random variable W. We have:
Theorem A1 (Optimal importance sampling, cont’d).
Let p > 1 and η be defined in Equation (A2). Then the random variable Y = W p ( η / π ) 1 has zero variance under η , and we have:
Y = E π [ W p ] , η a . s .
Again, Theorem A1 implies that drawing random variable X from η and then estimating the reweighted expectation E η [ Y ] provides a zero variance estimator for the quantity E π [ W p ] .
Remark A1.
When 0 < p 1 the function f ( u ) = u p is concave for u 0 and the variational principle (A1) needs to be modified as (see [4])
E π W p 1 / p = min η 0 E η W η π 1 / p ,
where the minimizer η is given by Equation (A2). If W > 0 a.s., then η belongs to the exponential family with sufficient statistic S ( X ) = p log W ( X ) and reference density π.

Appendix B. Ratio Estimators

We shall briefly discuss the properties of the self-normalized importance sampling estimator (7) that is based on estimating a ratio of expected values
q N = 1 N i = 1 N Q i , p N = 1 N i = 1 N P i
where Q i and P i are i.i.d. random variables living on a joint probability space and having finite variances σ Q 2 and σ P 2 and covariance σ Q P . Further assume that q = E [ Q 1 ] 0 , then, by the strong law of large numbers, the ratio p N / q N converges a.s. to p / q where p = E [ P 1 ] .

Appendix B.1. The Delta Method

We can apply the delta method (e.g., [20] (Section 4.1)) to analyze the behavior of the ratio estimator in more detail. Roughly speaking, the delta method says that for a sum S N = X 1 + + X N , N N of square-integrable, i.i.d. random variables X k with mean μ R n , covariance matrix Σ R n × n and a sufficiently smooth function ϕ : R n R which can be Taylor expanded about μ , the central limit theorem applies. Specifically, using mean value theorem, it is easily seen that
ϕ ( S N / N ) ϕ ( μ ) = ϕ ( ζ N ) ( S N / N μ )
for some ζ N R n lying component-wise in the half open interval between S N / N and μ . By the continuity of ϕ at μ , the fact that S N / N μ a.s. as N and that N ( S N / N μ ) is approximately Gaussian with mean zero and covariance Σ , we have
N ϕ ( S N / N ) ϕ ( μ ) L N 0 , ϕ ( μ ) T Σ ϕ ( μ ) , N ,
where “ L ” denotes convergence in law (or: convergence in distribution), and N ( m , C ) denotes a Gaussian distribution with mean m and covariance C.

Appendix B.2. Asymptotic Properties of Ratio Estimators

Applying the delta method to the function ϕ : R 2 R , ( u , v ) u / v , and assuming that | v | is bounded away from zero, we find that the ratio estimator satisfies a central limit theorem too. Specifically, assuming that q 0 so that | q N | is asymptotically bounded away from zero, the delta method yields
N p N q N p q L N 0 , σ 2 ,
with variance
σ 2 = Var P 1 p q Q 1 q 2 .
In particular, the estimator is asymptotically unbiased.

Appendix C. Finite-Dimensional Change of Measure Formula

We will explain the basic idea behind Girsanov’s theorem and the change of measure Formula (21). To keep the presentation easily accessible, we present only a vanilla version of the theorem based on finite-dimensional Gaussian measures, partly following an idea in [43].

Appendix C.1. Gaussian Change of Measure

Let P be a probability measure on a measurable space ( Ω , E ) , on which an m-dimensional random variable B : Ω R m is defined. Further suppose that B has standard Gaussian distribution P B = P B 1 . Given a (deterministic) vector b R n and a matrix σ R n × m , we define a new random variable X : Ω R n by
X ( ω ) = b + σ B ( ω ) .
The similarity to the SDE (11) is no coincidence. Since B is Gaussian, so is X, with mean b and covariance C = σ σ T . Now, let u R n and define the shifted Gaussian random variable
B u ( ω ) = B ( ω ) u ,
and consider the alternative representation
X ( ω ) = b u + σ B u ( ω )
of X that is equivalent to Equation (A4) if and only if
σ u = b u b
has a solution (that may not be unique though). Following the line of Section 3.1, we seek a probability measure Q P such that B u is standard Gaussian under Q, and we claim that such a Q should have the property:
d Q d P ( ω ) = exp u · B ( ω ) 1 2 | u | 2
or, equivalently,
d Q d P ( ω ) = exp u · B u ( ω ) + 1 2 | u | 2 ,
in accordance with Equations (19)–(21). To show that B u is indeed standard Gaussian under the above defined measure Q, it is sufficient to check that for any measurable (Borel) set A R m , the probability Q ( B u A ) is given by the integral against the standard Gaussian density:
Q ( B u A ) = 1 ( 2 π ) m / 2 A exp | x | 2 2 d x .
Indeed, since B is standard Gaussian under P, it follows that:
Q ( B u A ) = { ω : B u ( ω ) A } exp u · B ( ω ) 1 2 | u | 2 d P ( ω ) = { ω : B ( ω ) u A } exp u · B ( ω ) 1 2 | u | 2 d P ( ω ) = 1 ( 2 π ) m / 2 { x : x u A } exp u · x 1 2 | u | 2 1 2 | x | 2 d x = 1 ( 2 π ) m / 2 { x : x u A } exp | x u | 2 2 d x = 1 ( 2 π ) m / 2 A exp | y | 2 2 d y ,
showing that B u has a standard Gaussian distribution under Q.

Appendix C.2. Reweighting

Clearly, by the definition of Q, it holds that:
E [ f ( X ) ] = E Q f ( X ) exp u · B u ( ω ) 1 2 | u | 2
for any bounded and measurable function f : R n R , where E [ · ] = E P [ · ] denotes the expectation with respect to the reference measure P. Now, let
X u ( ω ) = b u + σ B ( ω ) .
Since the distribution of the pair ( X u , B ) under P is the same as the distribution of the pair ( X , B u ) with X = b u + σ B u under Q, the reweighting identity Equation (A8) entails that
E [ f ( X ) ] = E f ( X u ) exp u · B ( ω ) 1 2 | u | 2 ,
with E [ B ] = 0 . Equation (A9) is the finite dimensional analogue of the reweighting identity that has been used to convert the Donsker–Varadhan Formula (14) into its final form (22).
Remark A2.
If σ ( · ) in the SDE (18) is square and invertible, then an alternative derivation of Girsanov’s theorem and Equation (21) can be based on the Euler-Maruyama discretization of the SDE and a change of measure for the corresponding Markov chain.

Appendix D. Proof of Theorem 2

The proof is based on the Feynman–Kac formula and Itô’s lemma. Here, we give only a sketch of the proof and leave aside all technical details regarding the regularity of solutions of partial differential equations, for which we refer to [8] (Section VI.5). Recall the definition
Ψ ( x , t ) = E P exp t τ f ( X s ) d s g ( X τ ) | X t = x .
By the Feynman–Kac formula, function Ψ solves the parabolic boundary value problem
A f Ψ = 0 , for ( x , t ) O × [ 0 , T ) , Ψ = exp g , for ( x , t ) D +
on the domain D = O × [ 0 , T ) where D + = ( O × [ 0 , T ) ) ( O × { T } ) denotes the terminal set of the augmented (control-free) process ( X t , t ) and
A = t + 1 2 σ σ T : x 2 + b · x
is its infinitesimal generator under the probability measure P.
By construction, the stopping time τ is bounded, and we assume that Ψ is of class C 2 , 1 on D, and continuous and uniformly bounded away from zero on the closure D ¯ . Now, let us define the process
ζ s u = log Ψ ( X s u , s ) ,
with X s u given by Equation (18). Then, using Itô’s lemma (e.g., [23] (Theorem 4.2.1)) and introducing the shorthands
Ψ s u = Ψ ( X s u , s ) , b s u = b ( X s u , s ) , σ s u = σ ( X s u ) ,
we see that ( ζ s u ) 0 s < τ satisfies the SDE
d ζ s u = t log Ψ s u d s x log Ψ s u · b s u + σ s u u s d s 1 2 σ s u ( σ s u ) T : x 2 ( log Ψ s u ) d s ( σ s u ) T x log Ψ s u · d B s u = A Ψ s u Ψ s u + ( σ s u ) T x Ψ s u Ψ s u · u s 1 2 | ( σ s u ) T x Ψ s u | 2 ( Ψ s u ) 2 d s ( σ s u ) T x Ψ s u Ψ s u · d B s u = f ( X s u , s ) + ( σ s u ) T x Ψ s u Ψ s u · u s 1 2 | ( σ s u ) T x Ψ s u | 2 ( Ψ s u ) 2 d s ( σ s u ) T x Ψ s u Ψ s u · d B s u .
In the last equation, we have used that the first equality in Equation (A10) holds in the interior of the bounded domain D, i.e., for s < τ . Choosing u s = u s for 0 s < τ to be the optimal control
u s = σ ( X s u , s ) T x log Ψ ( X s u , s ) ,
as in Equation (26), the last equation can be recast as
d ζ s u = f ( X s u , s ) + 1 2 | u s | 2 d s u s · d B s u .
Similar to Equation (20), if we introduce
Z s , τ u = s τ u r · d B r u + 1 2 s τ | u r | 2 d r ,
then Z 0 , τ u = Z τ u , and we have:
d ζ s u = f ( X s u , s ) d s d Z s , τ u .
As a consequence, using the continuity of the process as s 0 ,
ζ τ u = ζ 0 u Z τ u 0 τ f ( X s u , s ) d s .
By definition of ζ s u , the initial value ζ 0 u = log Ψ ( X 0 u , 0 ) = log Ψ ( x , 0 ) is deterministic. Moreover ζ τ u = log Ψ ( X τ u , τ ) = g ( X τ u ) , which in combination with Equation (A11) yields:
log Ψ ( x , 0 ) = g ( X τ u ) + 0 τ f ( X s u , s ) d s Z τ u .
Rearranging the terms in the last equation, we find
Ψ ( x , 0 ) = exp Z τ u 0 τ f ( X s u , s ) d s g ( X τ u ) ,
with probability one, which yields the assertion of Theorem 2.
Remark A3.
Letting T in the proof of the theorem it follows that τ τ O a.s. where τ O is the first exit time of the set O. As a consequence, the zero variance property of the importance sampling estimator carries over to the case of a.s. finite (but not necessarily bounded) hitting times or first exit times.

References

  1. Hammersely, J.M.; Morton, K.W. Poor Man’s Monte Carlo. J. R. Stat. Soc. Ser. B 1954, 16, 23–38. [Google Scholar]
  2. Rosenbluth, M.N.; Rosenbluth, A.W. Monte Carlo Calculations of the Average Extension of Molecular Chains. J. Chem. Phys. 1955, 23, 356–359. [Google Scholar] [CrossRef]
  3. Deuschel, J.D.; Stroock, D.W. Large Deviations; Academic Press: New York, NY, USA, 1989. [Google Scholar]
  4. Dai Pra, P.; Meneghini, L.; Runggaldier, W.J. Connections between stochastic control and dynamic games. Math. Control Signals Syst. 1996, 9, 303–326. [Google Scholar] [CrossRef]
  5. Delle Site, L.; Ciccotti, G.; Hartmann, C. Partitioning a macroscopic system into independent subsystems. J. Stat. Mech. Theory Exp. 2017, 2017, 83201. [Google Scholar] [CrossRef]
  6. Boué, M.; Dupuis, P. A variational representation for certain functionals of Brownian motion. Ann. Probab. 1998, 26, 1641–1659. [Google Scholar]
  7. Hartmann, C.; Banisch, R.; Sarich, M.; Badowski, T.; Schütte, C. Characterization of rare events in molecular dynamics. Entropy 2014, 16, 350–376. [Google Scholar] [CrossRef]
  8. Fleming, W.H.; Soner, H.M. Controlled Markov Processes and Viscosity Solutions; Springer: New York, NY, USA, 2006. [Google Scholar]
  9. Hartmann, C.; Schütte, C. Efficient rare event simulation by optimal nonequilibrium forcing. J. Stat. Mech. Theory Exp. 2012, 2012. [Google Scholar] [CrossRef]
  10. Jarzynski, C. Nonequilibrium equality for free energy differences. Phys. Rev. Lett. 1997, 78, 2690–2693. [Google Scholar] [CrossRef]
  11. Sivak, D.A.; Crooks, G.A. Thermodynamic Metrics and Optimal Paths. Phys. Rev. Lett. 2012, 109, 190602. [Google Scholar] [CrossRef] [PubMed]
  12. Oberhofer, H.; Dellago, C. Optimum bias for fast-switching free energy calculations. Comput. Phys. Commun. 2008, 179, 41–45. [Google Scholar] [CrossRef]
  13. Rotskoff, G.M.; Crooks, G.E. Optimal control in nonequilibrium systems: Dynamic Riemannian geometry of the Ising model. Phys. Rev. E 2015, 92, 60102. [Google Scholar] [CrossRef] [PubMed]
  14. Vaikuntanathan, S.; Jarzynski, C. Escorted Free Energy Simulations: Improving Convergence by Reducing Dissipation. Phys. Rev. Lett. 2008, 100, 109601. [Google Scholar] [CrossRef] [PubMed]
  15. Zhang, W.; Wang, H.; Hartmann, C.; Weber, M.; Schütte, C. Applications of the cross-entropy method to importance sampling and optimal control of diffusions. SIAM J. Sci. Comput. 2014, 36, A2654–A2672. [Google Scholar] [CrossRef]
  16. Dupuis, P.; Wang, H. Importance sampling, large deviations, and differential games. Stoch. Int. J. Probab. Stoch. Proc. 2004, 76, 481–508. [Google Scholar] [CrossRef]
  17. Dupuis, P.; Wang, H. Subsolutions of an Isaacs equation and efficient schemes for importance sampling. Math. Oper. Res. 2007, 32, 723–757. [Google Scholar] [CrossRef]
  18. Vanden-Eijnden, E.; Weare, J. Rare Event Simulation of Small Noise Diffusions. Commun. Pure Appl. Math. 2012, 65, 1770–1803. [Google Scholar] [CrossRef]
  19. Roberts, G.O.; Tweedie, R.L. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 1996, 2, 341–363. [Google Scholar] [CrossRef]
  20. Glasserman, P. Monte Carlo Methods in Financial Engineering; Springer: New York, NY, USA, 2004. [Google Scholar]
  21. Lelièvre, T.; Stolz, G. Partial differential equations and stochastic methods in molecular dynamics. Acta Numer. 2016, 25, 681–880. [Google Scholar] [CrossRef]
  22. Bennett, C.H. Efficient estimation of free energy differences from Monte Carlo data. J. Comput. Phys. 1976, 22, 245–268. [Google Scholar] [CrossRef]
  23. Øksendal, B. Stochastic Differential Equations: An Introduction with Applications; Springer: Berlin, Germany, 2003. [Google Scholar]
  24. Lapeyre, B.; Pardoux, E.; Sentis, R. Méthodes de Monte Carlo Pour les Équations de Transport et de Diffusion; Springer: Berlin, Germany, 1998. (In French) [Google Scholar]
  25. Sivak, D.A.; Chodera, J.D.; Crooks, G.A. Using Nonequilibrium Fluctuation Theorems to Understand and Correct Errors in Equilibrium and Nonequilibrium Simulations of Discrete Langevin Dynamics. Phys. Rev. X 2013, 3, 11007. [Google Scholar] [CrossRef]
  26. Darve, E.; Rodriguez-Gomez, D.; Pohorille, A. Adaptive biasing force method for scalar and vector free energy calculations. J. Chem. Phys. 2008, 128, 144120. [Google Scholar] [CrossRef] [PubMed]
  27. Lelièvre, T.; Rousset, M.; Stoltz, G. Computation of free energy profiles with parallel adaptive dynamics. J. Chem. Phys. 2007, 126, 134111. [Google Scholar] [CrossRef] [PubMed]
  28. Lelièvre, T.; Rousset, M.; Stoltz, G. Long-time convergence of an adaptive biasing force methods. Nonlinearity 2008, 21, 1155–1181. [Google Scholar] [CrossRef]
  29. Hartmann, C.; Schütte, C.; Zhang, W. Model reduction algorithms for optimal control and importance sampling of diffusions. Nonlinearity 2016, 29, 2298–2326. [Google Scholar] [CrossRef]
  30. Zhang, W.; Hartmann, C.; Schütte, C. Effective dynamics along given reaction coordinates, and reaction rate theory. Faraday Discuss. 2016, 195, 365–394. [Google Scholar] [CrossRef] [PubMed]
  31. Hartmann, C.; Latorre, J.C.; Pavliotis, G.A.; Zhang, W. Optimal control of multiscale systems using reduced-order models. J. Comput. Nonlinear Dyn. 2014, 1, 279–306. [Google Scholar] [CrossRef]
  32. Hartmann, C.; Schütte, C.; Weber, M.; Zhang, W. Importance sampling in path space for diffusion processes with slow-fast variables. Probab. Theory Relat. Fields 2017. [Google Scholar] [CrossRef]
  33. Lie, H.C. On a Strongly Convex Approximation of a Stochastic Optimal Control Problem for Importance Sampling of Metastable Diffusions. Ph.D. Thesis, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany, 2016. [Google Scholar]
  34. Richter, L. Efficient Statistical Estimation Using Stochastic Control and Optimization. Master’s Thesis, Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany, 2016. [Google Scholar]
  35. Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: New York, NY, USA, 1999. [Google Scholar]
  36. Banisch, R.; Hartmann, C. A sparse Markov chain approximation of LQ-type stochastic control problems. Math. Control Relat. Fields 2016, 6, 363–389. [Google Scholar] [CrossRef]
  37. Schütte, C.; Winkelmann, S.; Hartmann, C. Optimal control of molecular dynamics using Markov state models. Math. Program. Ser. B 2012, 134, 259–282. [Google Scholar] [CrossRef]
  38. Bertsekas, D.P. Approximate policy iteration: A survey and some new methods. J. Control Theory Appl. 2011, 9, 310–355. [Google Scholar] [CrossRef] [Green Version]
  39. El Karoui, N.; Hamadène, S.; Matoussi, A. Backward stochastic differential equations and applications. Appl. Math. Optim. 2008, 27, 267–320. [Google Scholar]
  40. Bender, C.; Steiner, J. Least-Squares Monte Carlo for BSDEs. In Numerical Methods in Finance; Carmona, R., Del Moral, P., Hu, P., Oudjane, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 257–289. [Google Scholar]
  41. Gobet, E.; Turkedjiev, P. Adaptive importance sampling in least-squares Monte Carlo algorithms for backward stochastic differential equations. Stoch. Proc. Appl. 2005, 127, 1171–1203. [Google Scholar] [CrossRef]
  42. Hartmann, C.; Kebiri, O.; Neureither, L. Importance sampling of rare events using least squares Monte Carlo. 2018; under preparation. [Google Scholar]
  43. Papaspiliopoulos, O.; Roberts, G.O. Importance sampling techniques for estimation of diffusions models. In Centre for Research in Statistical Methodology; Working Papers, No. 28; University of Warwick: Coventry, UK, 2009. [Google Scholar]
Figure 1. Comparison of the cross-entropy (green) and the gradient descent method (blue) for a rare event with probability p 2.867 × 10 7 for fixed sample size N = 10 8 . Both algorithms quickly converge to the optimal tilting parameter α 5.187 for the family N ( α , 1 ) of importance sampling distributions (left panel) and lead to a drastic reduction of the normalized relative error by a factor of 1000, from about 2000 to 2.38 after few iterations (right panel).
Figure 1. Comparison of the cross-entropy (green) and the gradient descent method (blue) for a rare event with probability p 2.867 × 10 7 for fixed sample size N = 10 8 . Both algorithms quickly converge to the optimal tilting parameter α 5.187 for the family N ( α , 1 ) of importance sampling distributions (left panel) and lead to a drastic reduction of the normalized relative error by a factor of 1000, from about 2000 to 2.38 after few iterations (right panel).
Entropy 19 00626 g001

Share and Cite

MDPI and ACS Style

Hartmann, C.; Richter, L.; Schütte, C.; Zhang, W. Variational Characterization of Free Energy: Theory and Algorithms. Entropy 2017, 19, 626. https://doi.org/10.3390/e19110626

AMA Style

Hartmann C, Richter L, Schütte C, Zhang W. Variational Characterization of Free Energy: Theory and Algorithms. Entropy. 2017; 19(11):626. https://doi.org/10.3390/e19110626

Chicago/Turabian Style

Hartmann, Carsten, Lorenz Richter, Christof Schütte, and Wei Zhang. 2017. "Variational Characterization of Free Energy: Theory and Algorithms" Entropy 19, no. 11: 626. https://doi.org/10.3390/e19110626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop