1. Introduction
This paper presents probably approximately correct (PAC)-Bayesian bounds on variational Bayesian (VB) approximations of fractional or tempered posterior distributions for Markov data generation models. Exact computation of either standard or tempered posterior distributions is a hard problem that has, broadly speaking, spawned two classes of computational methods. The first, Markov chain Monte Carlo (MCMC), constructs ergodic Markov chains to approximately sample from the posterior distribution. MCMC is known to suffer from high variance and complex diagnostics, leading to the development of variational Bayesian (VB) [
1] methods as an alternative in recent years. VB methods pose posterior computation as a variational optimization problem, approximating the posterior distribution of interest by the ‘closest’ element of an appropriately defined class of ‘simple’ probability measures. Typically, the measure of closeness used by VB methods is the Kullback–Leibler (KL) divergence. Excellent introductions to this so-called
KL-VB method can be found in [
2,
3,
4]. More recently, there has also been interest in alternative divergence measures, particularly the
-Rényi divergence [
5,
6,
7], though in this paper, we focus on the KL-VB setting.
Theoretical properties of VB approximations, and in particular asymptotic frequentist consistency, have been studied extensively under the assumption of an independent and identically distributed (i.i.d.) data generation model [
4,
8,
9]. On the other hand, the common setting where data sets display temporal dependencies presents unique challenges. In this paper, we focus on homogeneous Markov chains with parameterized transition kernels, representing a parsimonious class of data generation models with a wide range of applications. We work in the Bayesian framework, focusing on the posterior distribution over the unknown parameters of the transition kernel. Our theory develops PAC bounds that link the ergodic and mixing properties of the data generating Markov chain to the Bayes risk associated with approximate posterior distributions.
Frequentist consistency of Bayesian methods, in the sense of concentration of the posterior distribution around neighborhoods of the ‘true’ data generating distribution, have been established in significant generality, in both the i.i.d. [
10,
11,
12] and in the non-i.i.d. data generation setting [
13,
14]. More recent work [
14,
15,
16] has studied fractional or tempered posteriors, a class of generalized Bayesian posteriors obtained by combining the likelihood function raised to a fractional power with an appropriate prior distribution using Bayes’ theorem. Tempered posteriors are known to be robust against model misspecification: in the Markov setting we consider, the associated stationary distribution as well as mixing properties are sensitive to model parameterization. Further, tempered posteriors are known to be much simpler to analyze theoretically [
14,
16]. Therefore, following [
14,
15,
16] we focus on tempered posterior distributions on the transition kernel parameters, and study the rate of concentration of variational approximations to the tempered posterior. Equivalently, as shown in [
16] and discussed in
Section 1.1, our results also apply to so-called
-variational approximations to standard posterior distributions over kernel parameters. The latter are modifications of the standard KL-VB algorithm to address the well-known problem of overconfident posterior approximations.
While there have been a number of recent papers studying the consistency of approximate variational posteriors [
5,
8,
15] in the large sample limit, rates of convergence have received less attention. Exceptions include [
9,
15,
17], where an i.i.d. data generation model is assumed. [
15] establishes PAC-Bayes bounds on the convergence of a variational tempered posterior with fractional powers in the range
, while [
9] considers the standard variational posterior case (where the fractional power equals 1). [
17], on the other hand, establishes PAC-Bayes bounds for risk-sensitive Bayesian decision making problems in the standard variational posterior setting. The setting in [
15] allows for model misspecification and the analysis is generally more straightforward than that in [
9,
17]. Our work extends [
15] to the setting of a discrete-time Markov data generation model.
Our first results in Theorem 1 and Corollary 1 of
Section 2 establish PAC-Bayes bounds for sequences with arbitrary temporal dependence. Our results generalize [
15], [Theorem 2.4] to the non-i.i.d. data setting in a straightforward manner. Note that Theorem 1 also recovers ([
16], [Theorem 3.3]), which is established under different ‘existence of test’ conditions. Our objective in this paper is to explicate how the ergodic and mixing properties of the Markov data generating process influences the PAC-Bayes bound. The sufficient conditions of our theorem, bounding the mean and variance of the log-likelihood ratio of the data, allows for developing this understanding, without the technicalities of proving the existence of test conditions intruding on the insights.
In
Section 3, we study the setting where the data generating model is a stationary
-mixing Markov chain. Stationarity means that the Markov chain is initialized with the invariant distribution corresponding to the parameterized transition kernel, implying all subsequent states also follow this marginal distribution. The
-mixing condition ensures that the variance of the likelihood ratio of the Markov data does not grow faster than linear in the sample size. Our main results in this setting are applicable when the state space of the Markov chain is either continuous or discrete. The primary requirement on the class of data generating Markov models is for the log-likelihood ratio of the parameterized transition kernel and invariant distribution to satisfy a Lipschitz property. This condition implies a decoupling between the model parameters and the random samples, affording a straightforward verification of the mean and variance bounds. We highlight this main result by demonstrating that it is satisfied by a finite state Markov chain, a birth-death Markov chain on the positive integers, and a one-dimensional Gaussian linear model.
In practice, the assumption that the data generating model is stationary is unlikely to be satisfied. Typically, the initial distribution is arbitrary, with the state distribution of the Markov sequence converging weakly to the stationary distribution. In this setting, we must further assume that the class of data generating Markov chains are geometrically ergodic. We show that this implies the boundedness of the mean and variance of the log-likelihood ratio of the data generating Markov chain. Alternatively, in Theorem 4 we directly impose a drift condition on random variables that bound the log-likelihood ratio. Again, in this more general nonstationary setting, we illustrate the main results by showing that the PAC-Bayes bound is satisfied by a finite state Markov chain, a birth-death Markov chain on the positive integers, and a one-dimensional Gaussian linear model.
In preparation for our main technical results starting in
Section 2 we first note relevant notations and definitions in the next section.
1.1. Notations and Definitions
We broadly adopt the notation in [
15]. Let the sequence of random variables
represent a dataset of
observations drawn from a joint distribution
, where
is the ‘true’ parameter underlying the data generation process. We assume the state space
of the random variables
is either discrete-valued or continuous, and write
for a realization of the dataset. We also adopt the convention that
.
For each , we will write as the probability density of with respect to some measure , i.e., , where is either Lebesgue measure or the counting measure. Unless stated otherwise, all probabilities, expectations and variances, which we represent as P, and , are with respect to the true distribution .
Let
be a
prior distribution with support
. The
-
fractional posterior is defined as
where, for
,
, is the log-likelihood ratio of the corresponding density functions, and
is a tempering coefficient. Setting
recovers the standard Bayesian posterior. Note that we will use superscripts to distinguish different quantities that are referred to just as
in the literature.
The
Kullback–Leibler (KL) divergence between distributions
is defined as
where
are the densities corresponding to
on some sample space
. In particular, the KL divergence between the distributions parameterized by
and
is
The
-Rényi divergence is defined as
where
. As
, the
-Rényi divergence recovers the KL divergence.
Let be some class of distributions with support in and such that any distribution P in is absolutely continuous with respect to the tempered posterior: .
Many choices of
exist; for instance (see also [
15]),
can be the set of Gaussian measures, denoted
:
where P.D. references the class of positive definite matrices. Alternately,
can be the family of
mean-field or factored distributions where the components
of
are independent of each other. Let
be the variational approximation to the tempered posterior, defined as
It is easy to see that finding
in Equation (
5) is equivalent to the following optimization problem:
Setting
again recovers the usual variational solution that seeks to approximate the posterior distribution with the closest element of
(the right-hand side above is called the evidence lower bound (ELBO)). Other settings of
constitute
-variational inference [
16], which seeks to regularize the ‘overconfident’ approximate posteriors that standard variational methods tend to produce.
Our results in this paper focus on parametrized Markov chains. We term a Markov chain as ‘parameterized’ if the transition kernel
is parametrized by some
. Let
be the initial density (defined with respect to the Lebesgue measure over
) or initial probability mass function. Then, the joint density is
; recall, this joint density
corresponds to the walk probability of a time-homogeneous Markov chain. We assume that corresponding to each transition kernel
there exists an invariant distribution
that satisfies
We will also use to designate the density of the invariant measure (as before, this is with respect to the Lebesgue or counting measure for continuous or discrete state spaces, respectively). A Markov chain is stationary if its initial distribution is the invariant probability distribution, that is, .
Our results in the ensuing sections will be established under strong mixing conditions [
18] on the Markov chain. Specifically, recall the definition of the
-mixing coefficients of a Markov chain
:
Definition 1 (
-mixing coefficient).
Let denote the σ-field generated by the Markov chain parameterized by . Then, the α-mixing coefficient is defined as Informally speaking, the -mixing coefficients measure the dependence between any two events A (in the ‘history’ -algebra) and B (in the ‘future’ -algebra) with a time lag k. We note that we do not use superscripts to identify these parameters, since they are the only ones with subscripts, and can be identified through this.
2. A Concentration Bound for the -Rényi Divergence
The object of analysis in what follows is the probability measure
, the variational approximation to the tempered posterior. Our main result establishes a bound on the Bayes risk of this distribution; in particular, given a sequence of loss functions
, we bound
. Following recent work in both the i.i.d. and dependent sequence settings [
14,
15,
16], we will use
, the
-Rényi divergence between
and
as our loss function. Unlike loss functions like Euclidean distance, Rényi divergence compares
and
through their effect on observed sequences, so that issues like parameter identifiability no longer arise. Our first result generalizes [
15], [Theorem 2.1] to a general non-i.i.d. data setting.
Proposition 1. Let be a subset of all probability distributions on Θ. For any , and , the following probabilistic uniform upper bound on the expected -Rényi divergence holds: The proof of Proposition 1 follows easily from [
15], and we include it in
Appendix B.1.1 for completeness. Mirroring the comments in [
15], when
this result is precisely [
14, Theorem 3.4]. We also note from [
14] that
-Rényi divergences are all equivalent through the following inequality
. Hence, for the subsequent results, we simplify by assuming that
. This probabilistic bound implies the following PAC-Bayesian concentration bound on the model risk computed with respect to the fractional variational posterior:
Theorem 1. Let be a subset of all probability distributions parameterized by Θ, and assume there exist and such that
- i.
,
- ii.
, and
- iii.
.
Then, for any and , The proof of Theorem 1 is a generalization of [
15] (Theorem 2.4) to the non-i.i.d. setting, and a special case of [
16] (Theorem 3.1), where the problem setting includes latent variables. We include a proof for completeness. As noted in [
15], the sufficient conditions follow closely from [
13] and we will show that they hold for a variety of Markov chain models.
A direct corollary of Theorem 1 follows by setting
,
and using the fact that
. Note that Equation (
9) is vacuous if
. Therefore, without loss of generality, we restrict ourselves to the condition
.
Corollary 1. Assume , such that the following conditions hold:
- i.
,
- ii.
, and
- iii.
.
Then, for any , We observe that Theorem 1 and Corollary 1 place no assumptions on the nature of the statistical dependence between data points. However, verification of the sufficient conditions is quite hard, in general. One of our key contributions is to verify that under reasonable assumptions on the smoothness of the transition kernel, the sufficient conditions of Theorem 1 and Corollary 1 are satisfied by ergodic Markov chains.
Observe that the first two conditions in Corollary 1 ensure that the distribution concentrates on parameters around the true parameter , while the third condition requires that not diverge from the prior rapidly as a function of the sample size n. In general, verifying the first and third conditions is relatively straightforward. The second condition, on the other hand, is significantly more complicated in the current setting of dependent data, as the variance of includes correlations between the observations . In the next section, we will make assumptions on the transition kernels (and corresponding invariant densities) that ’decouple’ the temporal correlations and the model parameters in the setting of strongly mixing and ergodic Markov chain models, and allow for the verification of the conditions in Corollary 1. Towards this, Propositions 2 and 3 below characterize the expectation and variance of the log-likelihood ratio in terms of the one-step transition kernels of the Markov chain. First, consider the expectation of in condition (i).
Proposition 2. Fix and consider the parameterized Markov transition kernels and , and initial distributions and . Let and be the corresponding joint probability densities; that is,for . Then, for any , the log-likelihood ratio satisfieswhere . The expectation in the first term is with respect to the joint density function where the marginal density satisfiesIf the Markov chain is also stationary under , then Equation (12) simplifies to Notice that
is precisely the KL divergence,
. Next, the following proposition uses [
19] (Lemma 1.3) to upper bound the variance of the log-likelihood ratio.
Proposition 3. Fix and consider parameterized Markov transition kernels and , with initial distributions and . Let and be the corresponding joint probability densities of the sequence , and the marginal density for and . Fix and, for each , defineSimilarly, define , and Suppose the Markov chain corresponding to is α-mixing with coefficients . Then, Note that this result holds for any parameterized Markov chain. In particular, when the Markov chain is stationary,
and
, and Equation (14) simplifies to
If the sum
is infinite, the bound is trivially true. For it to be finite, of course, the coefficients
must decay to zero sufficiently quickly. For instance, Theorem A.1.2 shows that if the Markov chain is geometrically ergodic, then the
-mixing coefficients are geometrically decreasing. We will use this fact when the Markov chain is non-stationary, as in
Section 4. In the next section, however, we first consider the simpler stationary Markov chain setting where geometric ergodic conditions are not explicitly imposed. We also note that unless only a finite number of
are nonzero, the sum
is infinite when
, and our results will typically require
.
3. Stationary Markov Data-Generating Models
Observe that the PAC-Bayesian concentration bound in Corollary 1 specifically requires bounding the mean and variance of the log-likelihood ratio . We ensure this by imposing regularity conditions on the log-ratio of the one-step transition kernels and the corresponding invariant densities. Specifically, we assume the following conditions that decouple the model parameters from the random samples, allowing us to verify the bounds in Corollary 1.
Assumption 1. There exist positive functions and , such that for any parameters , the log of the ratio of one-step transition kernels and the log of the ratio of the invariant distributions satisfy, respectively,We further assume that for some , the functions and satisfy the following: - i
there exist constants and measures such that for , and , and
- ii
there exists a constant B such that and .
The following examples illustrate Equations (
17) and (
18) for discrete and continuous state Markov chains.
Example 1. Suppose is generated by the birth-death chain with parameterized transition probability mass function,In this example, the parameter θ denotes the probability of birth. We shall see that, : , , and . We also define , and set and both to . Let , , , , and . The derivation of these terms and that they satisfy the conditions of Assumption 1 is provided in the proof of Proposition 6. Example 2. Suppose is generated by the ‘simple linear’ Gauss–Markov modelwhere is a sequence of i.i.d. standard Gaussian random variables. Then, , with , and . Corresponding to these, we have and . The derivation of these quantities and that these satisfy the conditions of Assumption 1 under appropriate choice of is shown in the proof of Proposition 10. Note that assuming the same number
m of
and
involves no loss of generality, since these functions can be set to 0. Both Equations (
17) and (
18) can be viewed as generalized Lipschitz-smoothness conditions, recovering the usual Lipschitz-smoothness when
and when
is Euclidean distance. Our generalized conditions are useful for distributions like the Gaussian, where Lipschitz smoothness does not apply. From Jensen’s inequality we have
, and Assumption 1(i) above implies that for some constant
and
,
Assumption 1(i) is satisfied in a variety of scenarios, for example, under mild assumptions on the partial derivatives of the functions
. To illustrate this, we present the following proposition.
Proposition 4. Let be a function on a bounded domain with bounded partial derivatives with . Let be a sequence of probability densities on θ such that and for some . Then, for some , Proof. Define as the partial derivative of the function f. By the mean value theorem, , for some . Since the partial derivatives are bounded, there exists such that , and . Choose be such that , then . Therefore, . Now choosing as C completes the proof. □
If is continuous and is compact, then is always bounded. Furthermore, observe that if , without loss of generality we can use Jensen’s inequality to conclude that, for all , .
We can now state the main theorem of this section.
Theorem 2. Let be generated by a stationary, α-mixing Markov chain parametrized by . Suppose that Assumption 1 holds and that the α-mixing coefficients satisfy . Furthermore, assume that for some constant . Then, the conditions of Corollary 1 are satisfied with .
Theorem 2 is satisfied by a large class of Markov chains, including chains with countable and continuous state spaces. In particular, if the Markov chain is geometrically ergodic, then it follows from Equation (A4) (in the
appendix) that
. Observe that in order to achieve
convergence, we need
. Key to the proof of Theorem 2 is the fact that the variance of the log-likelihood ratio can be controlled via the application of Assumption 1 and Proposition 3. Note also that as
decreases, satisfying the condition
requires the Markov chain to be faster mixing.
We now illustrate Theorem 2 for a number of Markov chain models. First, consider a birth-death Markov chain on a finite state space.
Proposition 5. Suppose the data-generating process is a birth-death Markov chain, with one-step transition kernel parametrized by the birth probability . Let be the set of all Beta distributions. We choose the prior to be a Beta distribution. Then, the conditions of Theorem 2 are satisfied and .
Proof. The proof of Proposition 5 follows from the more general Proposition 8, by fixing the initial distribution to the invariant distribution under
. Therefore it has been omitted. We simply refer to the proof of Proposition 8 under a more general setup in
Appendix B.3. □
The birth-death chain on the finite state space is, of course, geometrically ergodic and the -mixing coefficients decay geometrically. Note that the invariant distribution of this Markov chain is uniform over the state space, and consequently this is a particularly simple example. A more complicated and more realistic example is a birth-death Markov chain on the nonnegative integers. We note that if the probability of birth in a birth-death Markov chain on positive integers is greater than , then the Markov chain is transient, and consequently, not ergodic. Hence, our prior should be chosen to have support within . For that purpose, we define the class of scaled beta distributions.
Definition 2 (Scaled Beta).
If X is a beta distribution on with parameters a and b, then Y is said to be a scaled beta distribution with same parameters on the interval ifand in that case, the pdf of Y is obtained as Here, and . For the birth-death chain, we set and giving it support on . Setting and gives a beta distribution rescaled to have support on .
Proposition 6. Suppose the data-generating process is a positive recurrent birth-death Markov chain on the positive integers parameterized by the birth probability . Further let be the set of all Beta distributions rescaled to have support . We choose the prior to be a scaled Beta distribution on with parameters a and b. Then, the conditions of Theorem 2 are satisfied with .
Proof. The proof of Proposition 6 (for the stationary case) follows from the more general Proposition 9 (the nonstationary case) by fixing the initial distribution to the invariant distribution under
. We omit the proof and simply refer to the proof of Proposition 9 under a more general setup in
Appendix B.3. □
Unlike with the finite state-space, the invariant distribution now depends on the parameter
, and verification of the conditions of the proposition is more involved. In
Appendix A.2, we prove that the class of scaled beta distributions satisfy the condition
when the prior
is a beta or an uniform distribution. This fact will allow us to prove the above propositions.
Both Proposition 5 and Proposition 6 assume a discrete state space. The next example considers a strictly stationary simple linear model (as defined in Example 2), which has a continuous, unbounded state space.
Proposition 7. Suppose the data-generating model is a stationary simple linear model:where are i.i.d. standard Gaussian random variables and . Suppose that is the class of all beta distributions rescaled to have the support . Then, the conditions of Theorem 2 are satisfied with . Proof. This is a special case of the more general non-stationary simple linear model which is detailed in Proposition 10. Therefore, the proof of the fact that the simple linear model satisfies Assumption 1 when starting from stationarity is deferred to the proof of Proposition 10. The simple linear model with
has geometrically decreasing (and therefore summable)
-mixing coefficients as a consequence of [
20] (eq. (15.49)) and Theorem A.1.2. Combining these two facts, it follows that the conditions of Theorem 2 are satisfied. □
Observe that Theorem 1 (and Corollary 1) are general, and hold for any dependent data-generating process. Therefore, there can be Markov chains that satisfy these, but do not satisfy Assumption 1 which entails some loss of generality. However, as our examples demonstrate, common Markov chain models do indeed satisfy the latter assumption.
4. Non-Stationary, Ergodic Markov Data-Generating Models
We call a time-homogeneous Markov chain
non-stationary if the initial distribution
is not the invariant distribution. There are two sets of results in this setting: in Theorem 3 and Theorem 4 we explicitly impose the
-mixing condition, while in Theorem 5 we impose a
f-geometric ergodicity condition (Definition A.1.2 in the
appendix). As seen in Equation (A4) (in the
appendix) if the Markov chain is also geometrically ergodic, then
,
. This condition can be relaxed, albeit at the risk of more complicated calculations that, nonetheless, mirror those in the geometrically ergodic setting. A common thread through these results is that we must impose some integrability or regularity conditions on the functions
.
First, in Theorem 3 we assume that the functions in Assumption 1 are uniformly bounded and that the -mixing condition is satisfied. This result holds for both discrete and continuous state space settings.
Theorem 3. Let be generated by an α-mixing Markov chain parametrized by with transition probabilities satisfying Assumption 1 and with known initial distribution . Let be the α-mixing coefficients under , and assume that . Suppose that there exists such that for all in Assumption 1. Furthermore, assume that there exists such that for some constant . If the initial distribution satisfies for all , then the conditions of Corollary 1 are satisfied with .
The following result in Proposition 8 illustrates Theorem 3 in the setting of a finite state birth-death Markov chain.
Proposition 8. Suppose the data-generating process is a finite state birth-death Markov chain, with one-step transition kernel parametrized by the birth probability . Let be the set of all Beta distributions. We choose the prior on to be a Beta distribution. Then, the conditions of Theorem 3 are satisfied with for any initial distribution .
Theorem 3 also applies to data generated by Markov chains with countably infinite state spaces, so long as the class of data-generating Markov chains is strongly ergodic and the initial distribution has finite second moments. The following example demonstrates this in the setting of a birth-death Markov chain on the positive integers, where the initial distribution is assumed to have finite second moments.
Proposition 9. Suppose the data-generating process is a birth-death Markov chain on the non-negative integers, parameterized by the probability of birth . Further let be the set of all Beta distributions rescaled upon the support . Let be a probability mass function on non-negative integers such that . We choose the prior to be a scaled Beta distribution on with parameters a and b. Then, the conditions of Theorem 3 are satisfied with .
Since continuous functions on a compact domain are bounded, we have the following (easy) corollary (stated without proof).
Corollary 2. Let be generated by an α-mixing Markov chain parametrized by on a compact state space, and with initial distribution . Suppose the α-mixing coefficients satisfy , and that Assumption 1 holds with continuous functions , . Furthermore, assume that there exists such that for some constant C. Then, Theorem 3 is satisfied with .
In general, the
functions will not be uniformly bounded (consider the case of the Gauss–Markov simple linear model in Example 2), and stronger conditions must be imposed on the data-generating Markov chain itself. The following assumption imposes a ‘drift’ condition from [
21]. Specifically, [
21] (Theorem 2.3) shows that under the conditions of Assumption 2, the moment generating function of an aperiodic Markov chain
can be upper bounded by a function of the moment generating function of
. Together with the
-mixing condition, Assumption 2 implies that this Markov data generating process satisfies Corollary 1.
Assumption 2. Consider a Markov chain parameterized by . Let denote the σ-field generated by . Denote the stochastic process ; recall , for each , are defined in Assumption 1. For each , assume the process satisfies the following conditions:
The drift condition holds for , i.e., for some .
For some and , .
Under this drift condition, the next theorem shows that Corollary 1 is satisfied.
Theorem 4. Let be generated by an aperiodic α-mixing Markov chain parametrized by and initial distribution . Suppose that Assumption 1 and Assumption 2 hold, and that the α-mixing coefficients satisfy . Furthermore, assume for some constant . If for all , then the conditions of Corollary 1 are satisfied with .
Verifying the conditions in Theorem 4 can be quite challenging. Instead, we suggest a different approach that requires f-geometric ergodicity. Unlike the drift condition in Assumption 2, f-geometric ergodicity additionally requires the existence of a petite set. As noted before, geometric ergodicity implies -mixing with geometrically decaying mixing coefficients. As with Theorem 4, we assume for simplicity that the Markov chain is aperiodic.
Theorem 5. Let be generated by an aperiodic Markov chain parametrized by with known initial distribution , and assumed to be V-geometrically ergodic for some . Suppose that Assumption 1 holds and . Furthermore, assume that for some constant . If the initial distribution satisfies , then the conditions of Corollary 1 are satisfied with .
The following Proposition 10 shows, the simple linear model satisfies Theorem 5 when the parameter is suitably restricted.
Proposition 10. Consider the simple linear model satisfying the equationwhere are i.i.d. standard Gaussian random variables and for . Let be the space of all scaled Beta distributions on and suppose the prior π is a uniform distribution on . Then, the conditions of Theorem 5 are satisfied with , if the initial distribution satisfies .