PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models

Banerjee, Imon; Rao, Vinayak A.; Honnappa, Harsha

doi:10.3390/e23030313

Open AccessArticle

PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models

by

Imon Banerjee

¹,

Vinayak A. Rao

^1,*,† and

Harsha Honnappa

^2,†

¹

Department of Statistics, Purdue University, West Lafayette, IN 47907, USA

²

School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2021, 23(3), 313; https://doi.org/10.3390/e23030313

Submission received: 8 February 2021 / Revised: 4 March 2021 / Accepted: 4 March 2021 / Published: 6 March 2021

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download Versions Notes

Abstract

:

Datasets displaying temporal dependencies abound in science and engineering applications, with Markov models representing a simplified and popular view of the temporal dependence structure. In this paper, we consider Bayesian settings that place prior distributions over the parameters of the transition kernel of a Markov model, and seek to characterize the resulting, typically intractable, posterior distributions. We present a Probably Approximately Correct (PAC)-Bayesian analysis of variational Bayes (VB) approximations to tempered Bayesian posterior distributions, bounding the model risk of the VB approximations. Tempered posteriors are known to be robust to model misspecification, and their variational approximations do not suffer the usual problems of over confident approximations. Our results tie the risk bounds to the mixing and ergodic properties of the Markov data generating model. We illustrate the PAC-Bayes bounds through a number of example Markov models, and also consider the situation where the Markov model is misspecified.

Keywords:

ergodicity; Markov chain; probably approximately correct; variational Bayes

1. Introduction

This paper presents probably approximately correct (PAC)-Bayesian bounds on variational Bayesian (VB) approximations of fractional or tempered posterior distributions for Markov data generation models. Exact computation of either standard or tempered posterior distributions is a hard problem that has, broadly speaking, spawned two classes of computational methods. The first, Markov chain Monte Carlo (MCMC), constructs ergodic Markov chains to approximately sample from the posterior distribution. MCMC is known to suffer from high variance and complex diagnostics, leading to the development of variational Bayesian (VB) [1] methods as an alternative in recent years. VB methods pose posterior computation as a variational optimization problem, approximating the posterior distribution of interest by the ‘closest’ element of an appropriately defined class of ‘simple’ probability measures. Typically, the measure of closeness used by VB methods is the Kullback–Leibler (KL) divergence. Excellent introductions to this so-called KL-VB method can be found in [2,3,4]. More recently, there has also been interest in alternative divergence measures, particularly the

α

-Rényi divergence [5,6,7], though in this paper, we focus on the KL-VB setting.

Theoretical properties of VB approximations, and in particular asymptotic frequentist consistency, have been studied extensively under the assumption of an independent and identically distributed (i.i.d.) data generation model [4,8,9]. On the other hand, the common setting where data sets display temporal dependencies presents unique challenges. In this paper, we focus on homogeneous Markov chains with parameterized transition kernels, representing a parsimonious class of data generation models with a wide range of applications. We work in the Bayesian framework, focusing on the posterior distribution over the unknown parameters of the transition kernel. Our theory develops PAC bounds that link the ergodic and mixing properties of the data generating Markov chain to the Bayes risk associated with approximate posterior distributions.

Frequentist consistency of Bayesian methods, in the sense of concentration of the posterior distribution around neighborhoods of the ‘true’ data generating distribution, have been established in significant generality, in both the i.i.d. [10,11,12] and in the non-i.i.d. data generation setting [13,14]. More recent work [14,15,16] has studied fractional or tempered posteriors, a class of generalized Bayesian posteriors obtained by combining the likelihood function raised to a fractional power with an appropriate prior distribution using Bayes’ theorem. Tempered posteriors are known to be robust against model misspecification: in the Markov setting we consider, the associated stationary distribution as well as mixing properties are sensitive to model parameterization. Further, tempered posteriors are known to be much simpler to analyze theoretically [14,16]. Therefore, following [14,15,16] we focus on tempered posterior distributions on the transition kernel parameters, and study the rate of concentration of variational approximations to the tempered posterior. Equivalently, as shown in [16] and discussed in Section 1.1, our results also apply to so-called

α

-variational approximations to standard posterior distributions over kernel parameters. The latter are modifications of the standard KL-VB algorithm to address the well-known problem of overconfident posterior approximations.

While there have been a number of recent papers studying the consistency of approximate variational posteriors [5,8,15] in the large sample limit, rates of convergence have received less attention. Exceptions include [9,15,17], where an i.i.d. data generation model is assumed. [15] establishes PAC-Bayes bounds on the convergence of a variational tempered posterior with fractional powers in the range

[0, 1)

, while [9] considers the standard variational posterior case (where the fractional power equals 1). [17], on the other hand, establishes PAC-Bayes bounds for risk-sensitive Bayesian decision making problems in the standard variational posterior setting. The setting in [15] allows for model misspecification and the analysis is generally more straightforward than that in [9,17]. Our work extends [15] to the setting of a discrete-time Markov data generation model.

Our first results in Theorem 1 and Corollary 1 of Section 2 establish PAC-Bayes bounds for sequences with arbitrary temporal dependence. Our results generalize [15], [Theorem 2.4] to the non-i.i.d. data setting in a straightforward manner. Note that Theorem 1 also recovers ([16], [Theorem 3.3]), which is established under different ‘existence of test’ conditions. Our objective in this paper is to explicate how the ergodic and mixing properties of the Markov data generating process influences the PAC-Bayes bound. The sufficient conditions of our theorem, bounding the mean and variance of the log-likelihood ratio of the data, allows for developing this understanding, without the technicalities of proving the existence of test conditions intruding on the insights.

In Section 3, we study the setting where the data generating model is a stationary

α

-mixing Markov chain. Stationarity means that the Markov chain is initialized with the invariant distribution corresponding to the parameterized transition kernel, implying all subsequent states also follow this marginal distribution. The

α

-mixing condition ensures that the variance of the likelihood ratio of the Markov data does not grow faster than linear in the sample size. Our main results in this setting are applicable when the state space of the Markov chain is either continuous or discrete. The primary requirement on the class of data generating Markov models is for the log-likelihood ratio of the parameterized transition kernel and invariant distribution to satisfy a Lipschitz property. This condition implies a decoupling between the model parameters and the random samples, affording a straightforward verification of the mean and variance bounds. We highlight this main result by demonstrating that it is satisfied by a finite state Markov chain, a birth-death Markov chain on the positive integers, and a one-dimensional Gaussian linear model.

In practice, the assumption that the data generating model is stationary is unlikely to be satisfied. Typically, the initial distribution is arbitrary, with the state distribution of the Markov sequence converging weakly to the stationary distribution. In this setting, we must further assume that the class of data generating Markov chains are geometrically ergodic. We show that this implies the boundedness of the mean and variance of the log-likelihood ratio of the data generating Markov chain. Alternatively, in Theorem 4 we directly impose a drift condition on random variables that bound the log-likelihood ratio. Again, in this more general nonstationary setting, we illustrate the main results by showing that the PAC-Bayes bound is satisfied by a finite state Markov chain, a birth-death Markov chain on the positive integers, and a one-dimensional Gaussian linear model.

In preparation for our main technical results starting in Section 2 we first note relevant notations and definitions in the next section.

1.1. Notations and Definitions

We broadly adopt the notation in [15]. Let the sequence of random variables

X^{n} = (X_{0}, \dots, X_{n}) \subset R^{m \times (n + 1)}

represent a dataset of

n + 1

observations drawn from a joint distribution

P_{θ_{0}}^{(n)}

, where

θ_{0} \in Θ \subseteq R^{d}

is the ‘true’ parameter underlying the data generation process. We assume the state space

S \subseteq R^{m}

of the random variables

X_{i}

is either discrete-valued or continuous, and write

{x_{0}, \dots, x_{n}}

for a realization of the dataset. We also adopt the convention that

0 log (0 / 0) = 0

.

For each

θ \in Θ

, we will write

p_{θ}^{(n)}

as the probability density of

P_{θ}^{(n)}

with respect to some measure

Q^{(n)}

, i.e.,

p_{θ}^{(n)} : = \frac{d P_{θ}^{(n)}}{d Q^{(n)}}

, where

Q^{(n)}

is either Lebesgue measure or the counting measure. Unless stated otherwise, all probabilities, expectations and variances, which we represent as P,

E [X]

and

Var [X]

, are with respect to the true distribution

P_{θ_{0}}^{(n)}

.

Let

π (θ)

be a prior distribution with support

Θ

. The

α^{t e}

-fractional posterior is defined as

\begin{matrix} π_{n, α^{t e} | X^{n}} (d θ) & : = \frac{e^{- α^{t e} r_{n} (θ, θ_{0}) (X^{n})} π (d θ)}{\int e^{- α^{t e} r_{n} (θ, θ_{0}) (X^{n})} π (d θ)}, \end{matrix}

(1)

where, for

θ_{0}, θ \in Θ

,

r_{n} (θ, θ_{0}) (\cdot) : = log (\frac{p_{θ_{0}}^{(n)} (\cdot)}{p_{θ}^{(n)} (\cdot)})

, is the log-likelihood ratio of the corresponding density functions, and

α^{t e} \in (0, \infty)

is a tempering coefficient. Setting

α^{t e} = 1

recovers the standard Bayesian posterior. Note that we will use superscripts to distinguish different quantities that are referred to just as

α

in the literature.

The Kullback–Leibler (KL) divergence between distributions

P, Q

is defined as

\begin{matrix} K (P, Q) : = \int_{X} log (\frac{p (x)}{q (x)}) p (x) d x, \end{matrix}

where

p, q

are the densities corresponding to

P, Q

on some sample space

X

. In particular, the KL divergence between the distributions parameterized by

θ_{0}

and

θ

is

\begin{matrix} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) & : = \int log (\frac{p_{θ_{0}}^{(n)} (x_{0}, \dots, x_{n})}{p_{θ}^{(n)} (x_{0}, \dots, x_{n})}) p_{θ_{0}}^{(n)} (x_{0}, \dots, x_{n}) d x_{0} \dots d x_{n} \\ = \int r_{n} (θ, θ_{0}) (x_{0}, \dots, x_{n}) p_{θ_{0}}^{n} (x_{0}, \dots, x_{n}) d x_{0} \dots d x_{n} . \end{matrix}

(2)

The

α^{r e}

-Rényi divergence

D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)})

is defined as

\begin{matrix} D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) & : = \frac{1}{α^{r e} - 1} log \int exp (- α^{r e} r_{n} (θ, θ_{0}) (x_{0}, \dots, x_{n})) p_{θ_{0}}^{(n)} (x_{0}, \dots, x_{n}) d x_{0} \dots d x_{n}, \end{matrix}

(3)

where

α^{r e} \in (0, 1)

. As

α^{r e} \to 1

, the

α^{r e}

-Rényi divergence recovers the KL divergence.

Let

F

be some class of distributions with support in

R^{d}

and such that any distribution P in

F

is absolutely continuous with respect to the tempered posterior:

P ≪ π_{n, α^{t e} | X^{n}}

.

Many choices of

F

exist; for instance (see also [15]),

F

can be the set of Gaussian measures, denoted

F_{i d}^{Φ}

:

\begin{matrix} F_{i d}^{Φ} & = {Φ (d θ; μ, Σ) : μ \in R^{d}, Σ_{d \times d} \in P . D .}, \end{matrix}

(4)

where P.D. references the class of positive definite matrices. Alternately,

F

can be the family of mean-field or factored distributions where the components

θ_{i}

of

θ

are independent of each other. Let

{\tilde{π}}_{n, α^{t e} | X^{n}}

be the variational approximation to the tempered posterior, defined as

\begin{matrix} {\tilde{π}}_{n, α^{t e} | X^{n}} & : = \underset{ρ \in F}{arg min} K (ρ, π_{n, α^{t e} | X^{n}}) \end{matrix}

(5)

It is easy to see that finding

{\tilde{π}}_{n, α^{t e} | X^{n}}

in Equation (5) is equivalent to the following optimization problem:

\begin{matrix} {\tilde{π}}_{n, α^{t e} | X^{n}} & : = \underset{ρ \in F}{arg max} [\int r_{n} (θ, θ_{0}) (x_{0}, \dots, x_{n}) ρ (d θ) - {(α^{t e})}^{- 1} K (ρ, π)] . \end{matrix}

(6)

Setting

α^{t e} = 1

again recovers the usual variational solution that seeks to approximate the posterior distribution with the closest element of

F

(the right-hand side above is called the evidence lower bound (ELBO)). Other settings of

α^{t e}

constitute

α^{t e}

-variational inference [16], which seeks to regularize the ‘overconfident’ approximate posteriors that standard variational methods tend to produce.

Our results in this paper focus on parametrized Markov chains. We term a Markov chain as ‘parameterized’ if the transition kernel

p_{θ} (\cdot | \cdot)

is parametrized by some

θ \in Θ \subseteq R^{d}

. Let

q^{(0)} (\cdot)

be the initial density (defined with respect to the Lebesgue measure over

R^{m}

) or initial probability mass function. Then, the joint density is

p_{θ}^{(n)} (x_{0}, \dots, x_{n}) = q^{(0)} (x_{0}) \prod_{i = 0}^{n - 1} p_{θ} (x_{i + 1} | x_{i})

; recall, this joint density

p_{θ}^{(n)} (x_{0}, \dots, x_{n})

corresponds to the walk probability of a time-homogeneous Markov chain. We assume that corresponding to each transition kernel

p_{θ}, θ \in Θ,

there exists an invariant distribution

q_{θ}^{(\infty)} \equiv q_{θ}

that satisfies

q_{θ} (x) = \int p_{θ} (x | y) q_{θ} (d y) \forall x \in R^{m}, θ \in Θ .

We will also use

q_{θ}

to designate the density of the invariant measure (as before, this is with respect to the Lebesgue or counting measure for continuous or discrete state spaces, respectively). A Markov chain is stationary if its initial distribution is the invariant probability distribution, that is,

X_{0} \sim q_{θ}

.

Our results in the ensuing sections will be established under strong mixing conditions [18] on the Markov chain. Specifically, recall the definition of the

α

-mixing coefficients of a Markov chain

{X_{n}}

:

Definition 1

(

α

-mixing coefficient). Let

M_{i}^{j}

denote the σ-field generated by the Markov chain

{X_{k} : i \leq k \leq j}

parameterized by

θ \in Θ

. Then, the α-mixing coefficient is defined as

\begin{matrix} α_{k} = sup_{t > 0} sup_{(A, B) \in M_{- \infty}^{t} \times M_{t + k}^{\infty}} |P_{θ} (A \cap B) - P_{θ} (A) P_{θ} (B)| . \end{matrix}

(7)

Informally speaking, the

α

-mixing coefficients

{α_{k}}

measure the dependence between any two events A (in the ‘history’

σ

-algebra) and B (in the ‘future’

σ

-algebra) with a time lag k. We note that we do not use superscripts to identify these

α

parameters, since they are the only ones with subscripts, and can be identified through this.

2. A Concentration Bound for the $α^{re}$ -Rényi Divergence

The object of analysis in what follows is the probability measure

{\tilde{π}}_{n, α^{t e} | X^{n}} (θ)

, the variational approximation to the tempered posterior. Our main result establishes a bound on the Bayes risk of this distribution; in particular, given a sequence of loss functions

ℓ_{n} (θ, θ_{0})

, we bound

\int ℓ_{n} (θ, θ_{0}) {\tilde{π}}_{n, α^{t e} | X^{n}} (θ) d θ

. Following recent work in both the i.i.d. and dependent sequence settings [14,15,16], we will use

ℓ_{n} (θ, θ_{0}) = D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)})

, the

α^{r e}

-Rényi divergence between

P_{θ}^{(n)}

and

P_{θ_{0}}^{(n)}

as our loss function. Unlike loss functions like Euclidean distance, Rényi divergence compares

θ

and

θ_{0}

through their effect on observed sequences, so that issues like parameter identifiability no longer arise. Our first result generalizes [15], [Theorem 2.1] to a general non-i.i.d. data setting.

Proposition 1.

Let

F

be a subset of all probability distributions on Θ. For any

α^{r e} \in (0, 1)

,

ϵ \in (0, 1)

and

n \geq 1

, the following probabilistic uniform upper bound on the expected

α^{r e}

-Rényi divergence holds:

\begin{matrix} P [sup_{ρ \in F} \int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) ρ (d θ) \leq \frac{α^{r e}}{1 - α^{r e}} \int r_{n} (θ, θ_{0}) ρ (d θ) + \frac{K (ρ, π) + log (\frac{1}{ϵ})}{1 - α^{r e}}] \geq 1 - ϵ . \end{matrix}

(8)

The proof of Proposition 1 follows easily from [15], and we include it in Appendix B.1.1 for completeness. Mirroring the comments in [15], when

ρ = {\tilde{π}}_{n, α^{t e}}

this result is precisely [14, Theorem 3.4]. We also note from [14] that

\forall α^{r e}, β \in (0, 1]

α^{r e}

-Rényi divergences are all equivalent through the following inequality

\frac{α^{r e} (1 - β)}{β (1 - α^{r e})} D_{β} \leq D_{α^{r e}} \leq D_{β} \forall α^{r e} \leq β

. Hence, for the subsequent results, we simplify by assuming that

α^{t e} = α^{r e}

. This probabilistic bound implies the following PAC-Bayesian concentration bound on the model risk computed with respect to the fractional variational posterior:

Theorem 1.

Let

F

be a subset of all probability distributions parameterized by Θ, and assume there exist

ϵ_{n} > 0

and

ρ_{n} \in F

such that

i.: $\int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) = \int E [r_{n} (θ, θ_{0})] ρ_{n} (d θ) \leq n ϵ_{n}$ ,
ii.: $\int Var (r_{n} (θ, θ_{0})) ρ_{n} (d θ) \leq n ϵ_{n}$ , and
iii.: $K (ρ_{n}, π) \leq n ϵ_{n}$ .

Then, for any

α^{r e} \in (0, 1)

and

(ϵ, η) \in (0, 1) \times (0, 1)

,

\begin{matrix} P [\int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) {\tilde{π}}_{n, α^{r e}} (d θ | X^{(n)}) \leq \frac{(α^{r e} + 1) n ϵ_{n} + α^{r e} \sqrt{\frac{n ϵ_{n}}{η}} - log (ϵ)}{1 - α^{r e}}] \geq 1 - ϵ - η . \end{matrix}

(9)

The proof of Theorem 1 is a generalization of [15] (Theorem 2.4) to the non-i.i.d. setting, and a special case of [16] (Theorem 3.1), where the problem setting includes latent variables. We include a proof for completeness. As noted in [15], the sufficient conditions follow closely from [13] and we will show that they hold for a variety of Markov chain models.

A direct corollary of Theorem 1 follows by setting

η = \frac{1}{n ϵ_{n}}

,

ϵ = e^{- n ϵ_{n}}

and using the fact that

e^{- n ϵ_{n}} \geq \frac{1}{n ϵ_{n}}

. Note that Equation (9) is vacuous if

η + ϵ > 1

. Therefore, without loss of generality, we restrict ourselves to the condition

\frac{2}{n ϵ_{n}} < 1

.

Corollary 1.

Assume

\exists ϵ_{n} > 0

,

ρ_{n} \in F

such that the following conditions hold:

i.: $\int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) = \int E [r_{n} (θ, θ_{0})] ρ_{n} (d θ) \leq n ϵ_{n}$ ,
ii.: $\int Var (r_{n} (θ, θ_{0})) ρ_{n} (d θ) \leq n ϵ_{n}$ , and
iii.: $K (ρ_{n}, π) \leq n ϵ_{n}$ .

Then, for any

α^{r e} \in (0, 1)

,

\begin{matrix} P [\int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) {\tilde{π}}_{n, α^{r e}} (d θ | X^{(n)}) \leq \frac{2 (α^{r e} + 1) ϵ_{n}}{1 - α^{r e}}] \geq 1 - \frac{2}{n ϵ_{n}} . \end{matrix}

(10)

We observe that Theorem 1 and Corollary 1 place no assumptions on the nature of the statistical dependence between data points. However, verification of the sufficient conditions is quite hard, in general. One of our key contributions is to verify that under reasonable assumptions on the smoothness of the transition kernel, the sufficient conditions of Theorem 1 and Corollary 1 are satisfied by ergodic Markov chains.

Observe that the first two conditions in Corollary 1 ensure that the distribution

ρ_{n}

concentrates on parameters

θ \in Θ

around the true parameter

θ_{0}

, while the third condition requires that

ρ_{n}

not diverge from the prior

π

rapidly as a function of the sample size n. In general, verifying the first and third conditions is relatively straightforward. The second condition, on the other hand, is significantly more complicated in the current setting of dependent data, as the variance of

r_{n} (θ, θ_{0})

includes correlations between the observations

{X_{0}, \dots, X_{n}}

. In the next section, we will make assumptions on the transition kernels (and corresponding invariant densities) that ’decouple’ the temporal correlations and the model parameters in the setting of strongly mixing and ergodic Markov chain models, and allow for the verification of the conditions in Corollary 1. Towards this, Propositions 2 and 3 below characterize the expectation and variance of the log-likelihood ratio

r_{n} (\cdot, \cdot)

in terms of the one-step transition kernels of the Markov chain. First, consider the expectation of

r_{n} (\cdot, \cdot)

in condition (i).

Proposition 2.

Fix

θ_{1}, θ_{2} \in Θ

and consider the parameterized Markov transition kernels

p_{θ_{1}}

and

p_{θ_{2}}

, and initial distributions

q_{θ_{1}}^{(0)}

and

q_{θ_{2}}^{(0)}

. Let

p_{θ_{1}}^{(n)}

and

p_{θ_{2}}^{(n)}

be the corresponding joint probability densities; that is,

p_{θ_{j}}^{(n)} (x_{0}, \dots, x_{n}) = q_{θ_{j}}^{(0)} (x_{0}) \prod_{i = 1}^{n} p_{θ_{i}} (x_{i} | x_{i - 1})

(11)

for

j \in {1, 2}

. Then, for any

n \geq 1

, the log-likelihood ratio

r_{n} (θ_{2}, θ_{1})

satisfies

\begin{matrix} E_{θ_{1}} [r_{n} (θ_{2}, θ_{1})] & = \sum_{i = 1}^{n} E_{θ_{1}} [log (\frac{p_{θ_{1}} (X_{i} | X_{i - 1})}{p_{θ_{2}} (X_{i} | X_{i - 1})})] + E_{θ_{1}} [Z_{0}], \end{matrix}

(12)

where

Z_{0} : = log (\frac{q_{θ_{1}}^{(0)} (X_{0})}{q_{θ_{2}}^{(0)} (X_{0})})

. The expectation in the first term is with respect to the joint density function

p_{θ_{1}} (y, x) = p_{θ_{1}} (y | x) q_{θ_{1}}^{(i - 1)} (x)

where the marginal density satisfies

\begin{matrix} q_{θ_{1}}^{(i - 1)} (x) = \{\begin{matrix} \int p_{θ_{1}}^{(i - 1)} (x_{0}, \dots, x_{i - 2}, x) d x_{0} \dots d x_{i - 2} & f o r i > 1, a n d \\ q_{θ_{1}}^{(0)} (x) & f o r i = 1 . \end{matrix} \end{matrix}

If the Markov chain is also stationary under

θ_{1}

, then Equation (12) simplifies to

\begin{matrix} E_{θ_{1}} [r_{n} (θ_{2}, θ_{1})] = n E_{θ_{1}} [log (\frac{p_{θ_{1}} (X_{1} | X_{0})}{p_{θ_{2}} (X_{1} | X_{0})})] + E_{θ_{1}} [Z_{0}] . \end{matrix}

(13)

Notice that

E_{θ_{1}} [r_{n} (θ_{2}, θ_{1})]

is precisely the KL divergence,

K (P_{θ_{1}}^{(n)}, P_{θ_{2}}^{(n)})

. Next, the following proposition uses [19] (Lemma 1.3) to upper bound the variance of the log-likelihood ratio.

Proposition 3.

Fix

θ_{1}, θ_{2} \in Θ

and consider parameterized Markov transition kernels

p_{θ_{1}}

and

p_{θ_{2}}

, with initial distributions

q_{θ_{1}}^{(0)}

and

q_{θ_{2}}^{(0)}

. Let

p_{θ_{1}}^{(n)}

and

p_{θ_{2}}^{(n)}

be the corresponding joint probability densities of the sequence

(x_{0}, \dots, x_{n})

, and

q_{θ_{j}}^{(i)}

the marginal density for

i \in {1, \dots, n}

and

j \in {1, 2}

. Fix

δ > 0

and, for each

i \in {1, \dots, n}

, define

\begin{matrix} C_{θ_{1}, θ_{2}}^{(i)} & : = \int {|log (\frac{p_{θ_{1}} (x_{i} | x_{i - 1})}{p_{θ_{2}} (x_{i} | x_{i - 1})})|}^{2 + δ} p_{θ_{1}} (x_{i} | x_{i - 1}) q_{θ_{1}}^{(i - 1)} (x_{i - 1}) d x_{i} d x_{i - 1} . \end{matrix}

Similarly, define

Z_{0} : = log (\frac{q_{θ_{1}}^{(0)} (X_{0})}{q_{θ_{2}}^{(0)} (X_{0})})

, and

D_{1, 2} : = E_{θ_{1}} {|Z_{0}|}^{2 + δ} .

Suppose the Markov chain corresponding to

θ_{1}

is α-mixing with coefficients

{α_{k}}

. Then,

\begin{matrix} Var (r_{n} (θ_{1}, θ_{2})) & < & \sum_{i, j = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(i)} + C_{θ_{1}, θ_{2}}^{(j)} + \sqrt{C_{θ_{1}, θ_{2}}^{(i)} C_{θ_{1}, θ_{2}}^{(j)}})) (α_{| i - j | - 1}^{δ / (2 + δ)}) \\ (14) & + \sum_{i = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(i)} + D_{1, 2} + \sqrt{C_{θ_{1}, θ_{2}}^{(i)} D_{1, 2}})) (α_{i - 1}^{δ / (2 + δ)}) \\ (15) & + Cov (Z_{0}, Z_{0}) . \end{matrix}

Note that this result holds for any parameterized Markov chain. In particular, when the Markov chain is stationary,

C_{θ_{1}, θ_{2}}^{(i)} = C_{θ_{1}, θ_{2}}^{(1)} \forall i

and

\forall θ \in Θ

, and Equation (14) simplifies to

\begin{matrix} Var (r_{n} (θ_{1}, θ_{2})) & < n (\frac{4}{n} + 6 n^{δ / 2} C_{θ_{1}, θ_{2}}^{(1)}) (\sum_{k \geq 0} α_{k}^{δ / (2 + δ)}) \\ + (\frac{4}{n} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(1)} + D_{1, 2} + \sqrt{C_{θ_{1}, θ_{2}}^{(1)} D_{1, 2}})) (\sum_{k \geq 1} α_{k}^{δ / (2 + δ)}) \\ + Cov (Z_{0}, Z_{0}) . \end{matrix}

(16)

If the sum

\sum_{k \geq 0} α_{k}^{δ / (2 + δ)}

is infinite, the bound is trivially true. For it to be finite, of course, the coefficients

α_{k}

must decay to zero sufficiently quickly. For instance, Theorem A.1.2 shows that if the Markov chain is geometrically ergodic, then the

α

-mixing coefficients are geometrically decreasing. We will use this fact when the Markov chain is non-stationary, as in Section 4. In the next section, however, we first consider the simpler stationary Markov chain setting where geometric ergodic conditions are not explicitly imposed. We also note that unless only a finite number of

α_{k}

are nonzero, the sum

\sum_{k \geq 0} α_{k}^{δ / (2 + δ)}

is infinite when

δ = 0

, and our results will typically require

δ > 0

.

3. Stationary Markov Data-Generating Models

Observe that the PAC-Bayesian concentration bound in Corollary 1 specifically requires bounding the mean and variance of the log-likelihood ratio

r_{n} (θ, θ_{0})

. We ensure this by imposing regularity conditions on the log-ratio of the one-step transition kernels and the corresponding invariant densities. Specifically, we assume the following conditions that decouple the model parameters from the random samples, allowing us to verify the bounds in Corollary 1.

Assumption 1.

There exist positive functions

M_{k}^{(1)} (\cdot, \cdot)

and

M_{k}^{(2)} (\cdot)

,

k \in {1, 2, \dots, m}

such that for any parameters

θ_{1}, θ_{2} \in Θ

, the log of the ratio of one-step transition kernels and the log of the ratio of the invariant distributions satisfy, respectively,

\begin{matrix} | log p_{θ_{1}} (x_{1} | x_{0}) - log p_{θ_{2}} (x_{1} | x_{0}) | \leq \sum_{k = 1}^{m} M_{k}^{(1)} (x_{1}, x_{0}) | f_{k}^{(1)} (θ_{2}, θ_{1}) | \forall (x_{0}, x_{1}), a n d \end{matrix}

(17)

\begin{matrix} | log q_{θ_{1}} (x) - log q_{θ_{2}} (x) | \leq \sum_{k = 1}^{m} M_{k}^{(2)} (x) | f_{k}^{(2)} (θ_{2}, θ_{1}) | \forall x . \end{matrix}

(18)

We further assume that for some

δ > 0

, the functions

f_{k}^{(1)}, f_{k}^{(2)}

and

M_{k}^{(1)}

satisfy the following:

i: there exist constants $C_{k}^{(t)}$ and measures $ρ_{n} \in F$ such that $\int | f_{k}^{(t)} (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ) < \frac{C_{k}^{(t)}}{n}$ for $t \in {1, 2}$ , $n \geq 1$ and $k \in {1, 2, \dots, m}$ , and
ii: there exists a constant B such that $\int M_{k}^{(1)} {(x_{1}, x_{0})}^{2 + δ} p_{θ_{j}} (x_{1} | x_{0}) q_{θ_{j}}^{(0)} (x_{0}) d x_{1} d x_{0} < B, k \in {1, \dots, m}$ and $j \in {1, 2}$ .

The following examples illustrate Equations (17) and (18) for discrete and continuous state Markov chains.

Example 1.

Suppose

{X_{0}, \dots, X_{n}}

is generated by the birth-death chain with parameterized transition probability mass function,

\begin{matrix} p_{θ} (j | i) = \{\begin{matrix} θ & i f j = i - 1, \\ 1 - θ & i f j = i + 1 . \end{matrix} \end{matrix}

In this example, the parameter θ denotes the probability of birth. We shall see that,

m = 3

:

M_{1}^{(1)} (X_{1}, X_{0}) = I_{[X_{1} = X_{0} + 1]}

,

M_{2}^{(1)} (X_{1}, X_{0}) = I_{[X_{1} = X_{0} - 1]}

, and

M_{3}^{(1)} (X_{1}, X_{0}) = 1

. We also define

M_{1}^{(2)} (X_{0}) = 1

, and set

M_{2}^{(2)} (X_{0})

and

M_{3}^{(2)} (X_{0})

both to

X_{0} - 1

. Let

f_{1}^{(1)} (θ, θ_{0}) = log [\frac{θ_{0}}{θ}]

,

f_{2}^{(1)} (θ, θ_{0}) = log [\frac{1 - θ_{0}}{1 - θ}]

,

f_{3}^{(1)} (θ, θ_{0}) = 0

,

f_{1}^{(2)} (θ, θ_{0}) = - f_{3}^{(2)} (θ, θ_{0}) = log [\frac{1 - θ_{0}}{1 - θ}]

, and

f_{2}^{(2)} (θ, θ_{0}) = log [\frac{θ_{0}}{θ}]

. The derivation of these terms and that they satisfy the conditions of Assumption 1 is provided in the proof of Proposition 6.

Example 2.

Suppose

{X_{0}, \dots, X_{n}}

is generated by the ‘simple linear’ Gauss–Markov model

X_{n} = θ X_{n - 1} + W_{n},

where

{W_{n}}

is a sequence of i.i.d. standard Gaussian random variables. Then,

m = 2

, with

M_{1}^{(1)} (X_{n}, X_{n - 1}) = | X_{n} X_{n - 1} |

,

M_{2}^{(1)} (X_{n}, X_{n - 1}) = X_{n}^{2}, M_{1}^{(2)} (x) = \frac{x^{2}}{2}

and

M_{2}^{(2)} (X) = 0

. Corresponding to these, we have

f_{1}^{(1)} (θ, θ_{0}) = (θ - θ_{0}), f_{2}^{(1)} (θ, θ_{0}) = (θ_{0}^{2} - θ^{2}), f_{1}^{(2)} (θ_{0}, θ_{0}) = (θ_{0}^{2} - θ^{2})

and

f_{2}^{(2)} (θ_{0}, θ_{0}) = 0

. The derivation of these quantities and that these satisfy the conditions of Assumption 1 under appropriate choice of

ρ_{n}

is shown in the proof of Proposition 10.

Note that assuming the same number m of

M_{k}^{(1)}

and

M_{k}^{(2)}

involves no loss of generality, since these functions can be set to 0. Both Equations (17) and (18) can be viewed as generalized Lipschitz-smoothness conditions, recovering the usual Lipschitz-smoothness when

m = 1

and when

f_{k}^{(t)}

is Euclidean distance. Our generalized conditions are useful for distributions like the Gaussian, where Lipschitz smoothness does not apply. From Jensen’s inequality we have

\int | f_{k}^{(t)} (θ, θ_{0}) | ρ_{n} (d θ) | \leq {[\int | f_{k}^{(t)} (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ)]}^{\frac{1}{2 + δ}}

, and Assumption 1(i) above implies that for some constant

C > 0

and

k \in {1, 2, \dots, m}, t \in {1, 2}

,

\begin{matrix} \int | f_{k}^{(t)} (θ, θ_{0}) | ρ_{n} (d θ) & \leq \frac{C}{n^{1 / (2 + δ)}} < \frac{C}{\sqrt{n}} . \end{matrix}

(19)

Assumption 1(i) is satisfied in a variety of scenarios, for example, under mild assumptions on the partial derivatives of the functions

f_{k}^{(t)}

. To illustrate this, we present the following proposition.

Proposition 4.

Let

f (θ, θ_{0})

be a function on a bounded domain with bounded partial derivatives with

f (θ_{0}, θ_{0}) = 0

. Let

{ρ_{n} (\cdot)}

be a sequence of probability densities on θ such that

E_{ρ_{n}} [θ] = θ_{0}

and

{Var}_{ρ_{n}} [θ] = \frac{σ^{2}}{n}

for some

σ > 0

. Then, for some

C > 0

,

\int | f (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ) < \frac{C}{n} .

(20)

Proof.

Define

\partial_{θ} f (θ, θ_{0}) : = \frac{\partial f (θ, θ_{0})}{\partial θ}

as the partial derivative of the function f. By the mean value theorem,

| f (θ, θ_{0}) | = | θ - θ_{0} | | \partial_{θ} f (θ^{*}, θ_{0}) |

, for some

θ^{*} \in [min {θ, θ_{0}}, max {θ, θ_{0}}]

. Since the partial derivatives are bounded, there exists

L \in R

such that

\partial_{θ} f (θ^{*}, θ_{0}) < L

, and

\int | f (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ) < L^{2 + δ} \int {| θ - θ_{0} |}^{2 + δ} ρ_{n} (d θ)

. Choose

G > 0

be such that

| θ | < G

, then

{|\frac{θ - θ_{0}}{2 G}|}^{2 + δ} < {|\frac{θ - θ_{0}}{2 G}|}^{2}

. Therefore,

\int | θ - θ_{0} |^{2 + δ} ρ_{n} (d θ) < {(2 G)}^{2 + δ} Var [\frac{θ}{2 G}] < {(2 G)}^{δ} \frac{σ^{2}}{n}

. Now choosing

{(2 G)}^{δ} σ^{2}

as C completes the proof. □

If

\partial_{θ} f_{k}^{(t)}

is continuous and

Θ

is compact, then

\partial_{θ} f_{k}^{(t)}

is always bounded. Furthermore, observe that if

E [M_{k}^{(1)} {(X_{1}, X_{0})}^{2 + δ}] < B

, without loss of generality we can use Jensen’s inequality to conclude that, for all

0 < a < 2 + δ

,

E [M_{k}^{(1)} {(X_{1}, X_{0})}^{a}] < B^{\frac{a}{2 + δ}} < B

.

We can now state the main theorem of this section.

Theorem 2.

Let

{X_{0}, \dots, X_{n}}

be generated by a stationary, α-mixing Markov chain parametrized by

θ_{0} \in Θ

. Suppose that Assumption 1 holds and that the α-mixing coefficients satisfy

\sum_{k \geq 1} α_{k}^{δ / (2 + δ)} < + \infty

. Furthermore, assume that

K (ρ_{n}, π) \leq \sqrt{n} C

for some constant

C > 0

. Then, the conditions of Corollary 1 are satisfied with

ϵ_{n} = O (max (\frac{1}{\sqrt{n}}, \frac{n^{δ / 2}}{n}))

.

Theorem 2 is satisfied by a large class of Markov chains, including chains with countable and continuous state spaces. In particular, if the Markov chain is geometrically ergodic, then it follows from Equation (A4) (in the appendix) that

\sum_{k \geq 1} α_{k}^{δ / (2 + δ)} < + \infty

. Observe that in order to achieve

O (\frac{1}{\sqrt{n}})

convergence, we need

δ \leq 1

. Key to the proof of Theorem 2 is the fact that the variance of the log-likelihood ratio can be controlled via the application of Assumption 1 and Proposition 3. Note also that as

δ

decreases, satisfying the condition

\sum_{k \geq 1} α_{k}^{δ / (2 + δ)}

requires the Markov chain to be faster mixing.

We now illustrate Theorem 2 for a number of Markov chain models. First, consider a birth-death Markov chain on a finite state space.

Proposition 5.

Suppose the data-generating process is a birth-death Markov chain, with one-step transition kernel parametrized by the birth probability

θ_{0} \in Θ

. Let

F

be the set of all Beta distributions. We choose the prior to be a Beta distribution. Then, the conditions of Theorem 2 are satisfied and

ϵ_{n} = O (\frac{1}{\sqrt{n}})

.

Proof.

The proof of Proposition 5 follows from the more general Proposition 8, by fixing the initial distribution to the invariant distribution under

θ_{0}

. Therefore it has been omitted. We simply refer to the proof of Proposition 8 under a more general setup in Appendix B.3. □

The birth-death chain on the finite state space is, of course, geometrically ergodic and the

α

-mixing coefficients

α_{k}

decay geometrically. Note that the invariant distribution of this Markov chain is uniform over the state space, and consequently this is a particularly simple example. A more complicated and more realistic example is a birth-death Markov chain on the nonnegative integers. We note that if the probability of birth

θ

in a birth-death Markov chain on positive integers is greater than

0.5

, then the Markov chain is transient, and consequently, not ergodic. Hence, our prior should be chosen to have support within

(0, 0.5)

. For that purpose, we define the class of scaled beta distributions.

Definition 2

(Scaled Beta). If X is a beta distribution on with parameters a and b, then Y is said to be a scaled beta distribution with same parameters on the interval

(c, m + c)

if

\begin{matrix} Y & = m x + c; (m, c) \in R^{2} \end{matrix}

and in that case, the pdf of Y is obtained as

\begin{matrix} f (y) = \{\begin{matrix} \frac{1}{m Beta (a, b)} {(\frac{y - c}{m})}^{a - 1} {(1 - \frac{y - c}{m})}^{b - 1} & if y \in (c, m + c), \\ 0 & otherwise . \end{matrix} . \end{matrix}

Here,

E [Y] = m \frac{a}{a + b} + c

and

Var [Y] = m^{2} \frac{a b}{{(a + b)}^{2} (a + b + 1)}

. For the birth-death chain, we set

m = 0.5

and

c = 0

giving it support on

(0, \frac{1}{2})

. Setting

m = 2

and

c = - 1

gives a beta distribution rescaled to have support on

(- 1, 1)

.

Proposition 6.

Suppose the data-generating process is a positive recurrent birth-death Markov chain on the positive integers parameterized by the birth probability

θ_{0} \in (0, \frac{1}{2})

. Further let

F

be the set of all Beta distributions rescaled to have support

(0, \frac{1}{2})

. We choose the prior to be a scaled Beta distribution on

(0, 1 / 2)

with parameters a and b. Then, the conditions of Theorem 2 are satisfied with

ϵ_{n} = O (\frac{1}{\sqrt{n}})

.

Proof.

The proof of Proposition 6 (for the stationary case) follows from the more general Proposition 9 (the nonstationary case) by fixing the initial distribution to the invariant distribution under

θ_{0}

. We omit the proof and simply refer to the proof of Proposition 9 under a more general setup in Appendix B.3. □

Unlike with the finite state-space, the invariant distribution now depends on the parameter

θ \in Θ

, and verification of the conditions of the proposition is more involved. In Appendix A.2, we prove that the class of scaled beta distributions satisfy the condition

K (ρ_{n}, π) \leq n ϵ_{n}

when the prior

π

is a beta or an uniform distribution. This fact will allow us to prove the above propositions.

Both Proposition 5 and Proposition 6 assume a discrete state space. The next example considers a strictly stationary simple linear model (as defined in Example 2), which has a continuous, unbounded state space.

Proposition 7.

Suppose the data-generating model is a stationary simple linear model:

\begin{matrix} X_{n} & = θ_{0} X_{n - 1} + W_{n}, \end{matrix}

(21)

where

{W_{n}}

are i.i.d. standard Gaussian random variables and

| θ_{0} | < 1

. Suppose that

F

is the class of all beta distributions rescaled to have the support

(- 1, 1)

. Then, the conditions of Theorem 2 are satisfied with

ϵ_{n} = O (\frac{1}{\sqrt{n}})

.

Proof.

This is a special case of the more general non-stationary simple linear model which is detailed in Proposition 10. Therefore, the proof of the fact that the simple linear model satisfies Assumption 1 when starting from stationarity is deferred to the proof of Proposition 10. The simple linear model with

| θ_{0} | < 1

has geometrically decreasing (and therefore summable)

α

-mixing coefficients as a consequence of [20] (eq. (15.49)) and Theorem A.1.2. Combining these two facts, it follows that the conditions of Theorem 2 are satisfied. □

Observe that Theorem 1 (and Corollary 1) are general, and hold for any dependent data-generating process. Therefore, there can be Markov chains that satisfy these, but do not satisfy Assumption 1 which entails some loss of generality. However, as our examples demonstrate, common Markov chain models do indeed satisfy the latter assumption.

4. Non-Stationary, Ergodic Markov Data-Generating Models

We call a time-homogeneous Markov chain non-stationary if the initial distribution

q^{(0)}

is not the invariant distribution. There are two sets of results in this setting: in Theorem 3 and Theorem 4 we explicitly impose the

α

-mixing condition, while in Theorem 5 we impose a f-geometric ergodicity condition (Definition A.1.2 in the appendix). As seen in Equation (A4) (in the appendix) if the Markov chain is also geometrically ergodic, then

\forall δ > 0

,

\sum α_{k}^{δ / (2 + δ)} < \infty

. This condition can be relaxed, albeit at the risk of more complicated calculations that, nonetheless, mirror those in the geometrically ergodic setting. A common thread through these results is that we must impose some integrability or regularity conditions on the functions

M_{k}^{(1)}

.

First, in Theorem 3 we assume that the

M_{k}^{(1)}

functions in Assumption 1 are uniformly bounded and that the

α

-mixing condition is satisfied. This result holds for both discrete and continuous state space settings.

Theorem 3.

Let

{X_{0}, \dots, X_{n}}

be generated by an α-mixing Markov chain parametrized by

θ_{0} \in Θ

with transition probabilities satisfying Assumption 1 and with known initial distribution

q^{(0)}

. Let

{α_{k}}

be the α-mixing coefficients under

θ_{0}

, and assume that

\sum_{k \geq 1} α_{k}^{δ / (2 + δ)} < + \infty

. Suppose that there exists

B \in R

such that

{sup}_{x, y} | M_{k}^{(1)} (x, y) | < B

for all

k \in {1, 2, \dots, m}

in Assumption 1. Furthermore, assume that there exists

ρ_{n} \in F

such that

K (ρ_{n}, π) \leq \sqrt{n} C

for some constant

C > 0

. If the initial distribution

q^{(0)}

satisfies

E_{q^{(0)}} {| M_{k}^{(2)} (X_{0}) |}^{2} < + \infty

for all

k \in {1, 2, \dots, m}

, then the conditions of Corollary 1 are satisfied with

ϵ_{n} = O (max (\frac{1}{\sqrt{n}}, \frac{n^{δ / 2}}{n}))

.

The following result in Proposition 8 illustrates Theorem 3 in the setting of a finite state birth-death Markov chain.

Proposition 8.

Suppose the data-generating process is a finite state birth-death Markov chain, with one-step transition kernel parametrized by the birth probability

θ_{0}

. Let

F

be the set of all Beta distributions. We choose the prior on

θ_{0}

to be a Beta distribution. Then, the conditions of Theorem 3 are satisfied with

ϵ_{n} = O (\frac{1}{\sqrt{n}})

for any initial distribution

q^{(0)}

.

Theorem 3 also applies to data generated by Markov chains with countably infinite state spaces, so long as the class of data-generating Markov chains is strongly ergodic and the initial distribution has finite second moments. The following example demonstrates this in the setting of a birth-death Markov chain on the positive integers, where the initial distribution is assumed to have finite second moments.

Proposition 9.

Suppose the data-generating process is a birth-death Markov chain on the non-negative integers, parameterized by the probability of birth

θ_{0} \in (0, \frac{1}{2})

. Further let

F

be the set of all Beta distributions rescaled upon the support

(0, \frac{1}{2})

. Let

q^{(0)}

be a probability mass function on non-negative integers such that

\sum_{i = 1}^{\infty} i^{2} q^{(0)} (i) < + \infty

. We choose the prior to be a scaled Beta distribution on

(0, 1 / 2)

with parameters a and b. Then, the conditions of Theorem 3 are satisfied with

ϵ_{n} = O (\frac{1}{\sqrt{n}})

.

Since continuous functions on a compact domain are bounded, we have the following (easy) corollary (stated without proof).

Corollary 2.

Let

{X_{0}, \dots, X_{n}}

be generated by an α-mixing Markov chain parametrized by

θ_{0} \in Θ

on a compact state space, and with initial distribution

q^{(0)}

. Suppose the α-mixing coefficients satisfy

\sum_{k \geq 1} α_{k}^{δ / (2 + δ)} < + \infty

, and that Assumption 1 holds with continuous functions

M_{k}^{(1)} (\cdot, \cdot)

,

k \in {1, 2, \dots, m}

. Furthermore, assume that there exists

ρ_{n}

such that

K (ρ_{n}, π) \leq \sqrt{n} C

for some constant C. Then, Theorem 3 is satisfied with

ϵ_{n} = O (max (\frac{1}{\sqrt{n}}, \frac{n^{δ / 2}}{n}))

.

In general, the

M_{k}^{(1)}

functions will not be uniformly bounded (consider the case of the Gauss–Markov simple linear model in Example 2), and stronger conditions must be imposed on the data-generating Markov chain itself. The following assumption imposes a ‘drift’ condition from [21]. Specifically, [21] (Theorem 2.3) shows that under the conditions of Assumption 2, the moment generating function of an aperiodic Markov chain

{X_{n}}

can be upper bounded by a function of the moment generating function of

X_{0}

. Together with the

α

-mixing condition, Assumption 2 implies that this Markov data generating process satisfies Corollary 1.

Assumption 2.

Consider a Markov chain

{X_{n}}

parameterized by

θ_{0} \in Θ

. Let

M_{- \infty}^{n}

denote the σ-field generated by

{X_{- \infty}, \dots, X_{n - 1}, X_{n}}

. Denote the stochastic process

{M_{n}^{k}} : = {M_{k}^{(1)} (X_{n}, X_{n - 1})}

; recall

M_{k}^{(1)}

, for each

k = 1, \dots, m_{1}

, are defined in Assumption 1. For each

k = 1, \dots, m

, assume the process

{M_{n}^{k}}

satisfies the following conditions:

The drift condition holds for ${M_{n}^{k}}$ , i.e., $E [M_{n}^{k} - M_{n - 1}^{k} | M_{- \infty}^{n - 1}, M_{n - 1}^{k} > a] \leq - ϵ$ for some $ϵ, a > 0$ .
For some $λ > 0$ and $D > 0$ , $E [e^{λ (M_{n}^{k} - M_{n - 1}^{k})} | M_{- \infty}^{n - 1}] \leq D$ .

Under this drift condition, the next theorem shows that Corollary 1 is satisfied.

Theorem 4.

Let

{X_{0}, \dots, X_{n}}

be generated by an aperiodic α-mixing Markov chain parametrized by

θ_{0} \in Θ

and initial distribution

q^{(0)}

. Suppose that Assumption 1 and Assumption 2 hold, and that the α-mixing coefficients satisfy

\sum_{k \geq 1} α_{k}^{δ / (2 + δ)} < + \infty

. Furthermore, assume

K (ρ_{n}, π) \leq \sqrt{n} C

for some constant

C > 0

. If

\int e^{λ M_{k}^{(1)} (y, x)} p_{θ_{0}} (y | x) q_{1}^{(0)} (x) d x < + \infty

for all

k = 1, \dots, m_{1}

, then the conditions of Corollary 1 are satisfied with

ϵ_{n} = O (max (\frac{1}{\sqrt{n}}, \frac{n^{δ / 2}}{n}))

.

Verifying the conditions in Theorem 4 can be quite challenging. Instead, we suggest a different approach that requires f-geometric ergodicity. Unlike the drift condition in Assumption 2, f-geometric ergodicity additionally requires the existence of a petite set. As noted before, geometric ergodicity implies

α

-mixing with geometrically decaying mixing coefficients. As with Theorem 4, we assume for simplicity that the Markov chain is aperiodic.

Theorem 5.

Let

{X_{0}, \dots, X_{n}}

be generated by an aperiodic Markov chain parametrized by

θ_{0} \in Θ

with known initial distribution

q^{(0)}

, and assumed to be V-geometrically ergodic for some

V : R^{m} \to [1, \infty)

. Suppose that Assumption 1 holds and

\int M_{k}^{(1)} {(y, x)}^{2 + δ} p_{θ_{0}} (y | x) d y < V (x) \forall k, x and some δ > 0

. Furthermore, assume that

K (ρ_{n}, π) \leq \sqrt{n} C

for some constant

C > 0

. If the initial distribution

q^{(0)}

satisfies

E_{q^{(0)}} [V (X_{0})] < + \infty

, then the conditions of Corollary 1 are satisfied with

ϵ_{n} = O (max (\frac{1}{\sqrt{n}}, \frac{n^{δ / 2}}{n}))

.

The following Proposition 10 shows, the simple linear model satisfies Theorem 5 when the parameter

θ_{0}

is suitably restricted.

Proposition 10.

Consider the simple linear model satisfying the equation

\begin{matrix} X_{n} & = θ_{0} X_{n - 1} + W_{n}, \end{matrix}

(22)

where

{W_{n}}

are i.i.d. standard Gaussian random variables and

| θ_{0} | < 2^{\frac{1}{4 + 2 δ} - 1}

for

δ > 0

. Let

F

be the space of all scaled Beta distributions on

(- 1, 1)

and suppose the prior π is a uniform distribution on

(- 1, 1)

. Then, the conditions of Theorem 5 are satisfied with

ϵ_{n} = O (max (\frac{1}{\sqrt{n}}, \frac{n^{δ / 2}}{n}))

, if the initial distribution

q^{(0)}

satisfies

E_{q^{(0)}} [X_{0}^{4 + 2 δ}] < + \infty

.

5. Misspecified Models

We show next how our results can be extended to the misspecified model setting. Assume that the true data generating distribution is parametrized by

θ_{0} \notin Θ

. Let

θ_{n}^{*} : = arg {min}_{θ \in Θ} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)})

represent the closest parametrized distribution in the variational family to the data-generating distribution. Further, assume our usual conditions:

i.: $\int E [r_{n} (θ, θ_{n}^{*})] ρ_{n} (d θ) \leq n ϵ_{n}$ ,
ii.: $\int Var (r_{n} (θ, θ_{n}^{*})) ρ_{n} (d θ) \leq n ϵ_{n}$ .

Now, since

r_{n} (θ, θ_{0}) = r_{n} (θ, θ_{n}^{*}) + r_{n} (θ_{n}^{*}, θ_{0})

, we have

\int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) \leq E [r_{n} (θ_{0}, θ_{n}^{*})] + n ϵ_{n} .

(23)

Similarly, decomposing the variance it follows that

\begin{matrix} Var [r_{n} (θ, θ_{0})] & = Var [r_{n} (θ, θ_{n}^{*})] + Var [r_{n} (θ_{n}^{*}, θ_{0})] + 2 Cov [r_{n} (θ, θ_{n}^{*}), r_{n} (θ_{n}^{*}, θ_{0})] . \end{matrix}

(24)

Using the fact that

2 a b \leq a^{2} + b^{2}

on the covariance term

2 Cov [r_{n} (θ, θ_{n}^{*}), r_{n} (θ_{n}^{*}, θ_{0})] = 2 E [(r_{n} (θ, θ_{n}^{*}) - E [r_{n} (θ, θ_{n}^{*})]) (r_{n} (θ_{n}^{*}, θ_{0}) - E [r_{n} (θ_{n}^{*}, θ_{0})])]

, we have

\begin{matrix} Var [r_{n} (θ, θ_{0})] & \leq 2 Var [r_{n} (θ, θ_{n}^{*})] + 2 Var [r_{n} (θ_{n}^{*}, θ_{0})] . \end{matrix}

(25)

Integrating both sides with respect to

ρ_{n} (d θ)

we get

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) & \leq 2 \int Var [r_{n} (θ, θ_{n}^{*})] ρ_{n} (d θ) + 2 \int Var [r_{n} (θ_{n}^{*}, θ_{0})] ρ_{n} (d θ) \\ \leq 2 n ϵ_{n} + 2 Var [r_{n} (θ_{n}^{*}, θ_{0})] . \end{matrix}

(26)

Consequently, we arrive at the following result:

Theorem 6.

Let

F

be a subset of all probability distributions parameterized by Θ. Let

θ_{n}^{*} = arg {min}_{θ \in Θ} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)})

and assume there exist

ϵ_{n} > 0

and

ρ_{n} \in F

such that

i.: $\int E [r_{n} (θ, θ_{n}^{*})] ρ_{n} (d θ) \leq n ϵ_{n}$ ,
ii.: $\int Var (r_{n} (θ, θ_{n}^{*})) ρ_{n} (d θ) \leq n ϵ_{n}$ , and
iii.: $K (ρ_{n}, π) \leq n ϵ_{n}$ .

Then, for any

α^{r e} \in (0, 1)

and

(ϵ, η) \in (0, 1) \times (0, 1)

,

\begin{matrix} P [\int D_{α^{r e}} & (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) {\tilde{π}}_{n, α^{r e}} (d θ | X^{(n)}) \leq \\ \frac{(α^{r e} + 1) n ϵ_{n} + E [r_{n} (θ_{0}, θ_{n}^{*})] + α^{r e} \sqrt{\frac{2 n ϵ_{n} + 2 Var [r_{n} (θ_{n}^{*}, θ_{0})]}{η}} - log (ϵ)}{1 - α^{r e}}] \geq 1 - ϵ - η . \end{matrix}

(27)

The proof of this theorem is straightforward and follows from the proof of Theorem 1 by plugging in the upper bounds for KL-divergence from Equation (23), and variance from Equation (26) to (A13). A sketch of the proof is presented in the appendix.

6. Conclusions

Concentration of the KL-VB model risk, in terms of the expected

α^{r e}

-Rényi divergence, is well established under the i.i.d. data generating model assumption. Here, we extended this to the setting of Markov data generating models, linking the concentration rate to the mixing and ergodic properties of the Markov model. Our results apply to both stationary and non-stationary Markov chains, as well as to the situation with misspecified models. There remain a number of open questions. An immediate one is to extend the current analysis to continuous-time Markov chains and Markov jump processes, possibly using uniformization of the continuous time model. Another direction is to extend this to the setting of non-homogeneous Markov chains, where analogues of notions such as stationarity are less straightforward. Further, as noted in the introduction, [14] establish PAC-Bayes bounds under slightly weaker ‘existence of test functions’ conditions, while our results are established under the stronger conditions used by [15] for the i.i.d. setting. Weakening the conditions in our analysis is important, but complicated. A possible path is to build on results from [22], who provides conditions form the existence of exponentially powerful test functions exist for distinguishing between two Markov chains. It is also known that there exists a likelihood ratio test separating any two ergodic measures [23]. However, leveraging these to establish the PAC-Bayes bounds for the KL-VB posterior is a challenging effort that we leave to future papers. Finally it is of interest to generalize our PAC-bounds to posterior approximations beyond KL-variational inference, such as

α^{r e}

-Rényi posterior approximations [6], and loss-calibrated posterior approximations [24,25].

Author Contributions

Formal analysis, I.B.; Investigation, I.B.; Methodology, I.B., V.A.R. and H.H.; Resources, V.A.R. and H.H.; Validation, V.A.R. and H.H. All authors have read and agreed to the published version of the manuscript.

Funding

National Science Foundation: IIS-1816499; DMS-1812197.

Acknowledgments

Rao and Honnappa acknowledge support from NSF DMS-1812197. In addition, Rao acknowledges NSF IIS-1816499 for supporting this project.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Technical Desiderata

Appendix A.1. Definitions Related to Markov Chains

As noted before, ergodicity plays an acute role in establishing our results. We consolidate various definitions used throughout the paper in this appendix. Recall that we assume the parameterized Markov chain possesses an invariant probability density or mass function

q_{θ}

under parameter

θ \in Θ

. Our results in Section 4 also rely on the ergodic properties of the Markov chain, and we assume that the Markov chain is f-geometrically ergodic [20] (Chapter 15). First, refer to the definition of the functional norm

{∥ \cdot ∥}_{f}

, from Definition A.1.1,

Definition A.1.1

(f-norm). The functional norm in f-metric of a measure v, or the f-norm of v is

{∥ v ∥}_{f} = sup_{g : | g | < f} |\int g d v|,

(A1)

where f and g are any two functions.

An immediate consequence of this definition is that if

f_{1}, f_{2}

are two functions such that

f_{1} < f_{2}

(i.e., for all points in the support of the functions), then

{∥ v ∥}_{f_{1}} \leq {∥ v ∥}_{f_{2}} .

(A2)

Now that we have defined the

{∥ \cdot ∥}_{f}

norm, we can now define f-geometric ergodicity. In the following, we assume the Markov chain is positive Harris; see [20] for a definition. This is a mild and fairly standard assumption in Markov chain theory.

Definition A.1.2

(f-geometric ergodicity). For any function f, Markov chain

{X_{n}}

parameterized by

θ \in Θ

is said to be f-geometrically ergodic if it is positive Harris and there exists a constant

r_{f} > 1

, that depends on f, such that for any

A \in B (X)

,

\sum_{n = 1}^{n} r_{f}^{n} {∥P_{θ} (X_{n} \in A | X_{0} = x) - \int_{A} q_{θ} (y) d y∥}_{f} < \infty .

(A3)

It is straightforward to see that this is equivalent to

{∥P_{θ} (X_{n} \in A | X_{0} = x) - \int q_{θ} (y) d y∥}_{f} \leq C r_{f}^{- n}

for an appropriate constant C (which may depend on the state x), that is, the Markov chain approaches steady state at a geometrically fast rate. If a Markov chain is f-geometrically ergodic for

f \equiv 1

, then, it is simply termed as geometrically ergodic. It is straightforward to see (via Theorem A.1.2 in the Appendix) that a geometrically ergodic Markov chain is also

α

-mixing, with mixing coefficients satisfying

\sum_{k \geq 0} α_{k}^{υ} < \infty \forall υ > 0,

(A4)

showing that, under geometric ergodicity, the

α

-mixing coefficients raised to any positive power

υ

are finitely summable. We note here that the most standard procedure to establish f-geometric ergodicity for any Markov chain is through the verification of the drift condition. The drift condition is a sufficient condition for a Markov chain to be f-geometrically ergodic, as long as there exists a set (called petite set) towards which the Markov chain drifts to (see Assumption A.1.1 in the appendix). If a Markov chain is f-geometrically ergodic with

f \equiv V

, for some particular function V, then we call it V-geometrically ergodic.

We defined V-geometric ergodicity in the previous sections. In this section, we provide a sufficient condition for a Markov chain to be V-geometrically ergodic. First, we recall the definition of resolvent from [20] (Chapter 5).

Definition A.1.3

(Resolvent). Let

n \in {0, 1, 2, \dots}

and

q_{n}

be such that

q_{n} \geq 0 \forall n

and

\sum_{n = 1}^{\infty} q_{n} = 1

. Note that

q_{n}

can be thought of being a probability mass function for a random variable "q" taking values on non-negative integers. Then, the resolvent of a Markov chain with respect to q is given by

K_{q} (x, A)

where,

\begin{matrix} K_{q} (x, A) & = \sum_{n = 0}^{\infty} q_{n} P (X_{n} \in A | X_{0} = x) . \end{matrix}

(A5)

Then, the definition of petite sets follows (see, for Reference, [20] (Chapter 5)).

Definition A.1.4

(Petite Sets). Let

X_{0}, \dots, X_{n}

be n samples from a Markov chain taking values on the state space

X

. Let C be a set. We shall call C to be

v_{q}

petite if

\begin{matrix} K_{q} (x, B) \geq υ_{q} (B) \end{matrix}

for all

x \in C

and

B \in B (X)

, and a non-trivial measure

υ_{q}

on

B (X)

, and a probability mass function q on

{1, 2, 3, \dots}

Now, let

Δ V (x) : = E [V (X_{n}) | X_{n - 1} = x] - V (x)

for

V : S \to [1, \infty)

.

Assumption A.1.1

(Drift condition). [20] (Chapter 5) Suppose the chain

{X_{n}}

is, aperiodic and ψ-irreducible. Let there exists a petite set C, constants

b < \infty, β > 0

, and a non-trivial function

V : S \to [1, \infty)

satisfying

\begin{matrix} Δ V (x) \leq - β V (x) + b I_{x \in C} \forall x \in S . \end{matrix}

(A6)

If a Markov chain drifts towards a petite set then it is V-geometrically ergodic. Suppose, for simplicity, that

V (x) = | X |

. Then, the drift condition becomes

E [| X_{n} ∥ X_{x - 1}] - | X_{n - 1} | = - β | X_{n} | + b I_{X_{n} \in C}

. The left hand side of this equation represents the change in the state of the Markov chain in one time epoch. Thus, the condition in Assumption A.1.1 essentially states that the Markov chain drifts towards a petite set C and then, once it reaches that set, moves to any point in the state space with at least some probability independent of C.

Theorem A.1.1

(Geometrically ergodic theorem). Suppose that

{X_{n}}

is satisfies Assumption A.1.1. Then, the set

S_{V} = {x : V (x) < \infty}

is absorbing, i.e.,

P_{θ} (X_{1} \in S_{V} | X_{0} = x) = 1

\forall x \in S_{V}

, and full, i.e.,

ψ (S_{V}^{c}) = 0

. Furthermore,

\exists constants r > 1, R < \infty

such that, for any

A \in B (S)

,

{∥P_{θ} (X_{n} \in A | X_{0} = x) - \int_{A} q_{θ} (y) d y∥}_{V} \leq R r^{- n} V (x) .

(A7)

Any aperiodic and

ψ

-irreducible Markov chain satisfying the drift condition is geometrically ergodic. A consequence of Equation (A2) is that if,

{X_{n}}

is V-geometrically ergodic, then for any other function

U, such that | U | < V

, it is also U-geometrically ergodic. In essence, a geometrically ergodic Markov chain is asymptotically uncorrelated in a precise sense. Recall

ρ

-mixing coefficients defined as follows. Let

A

be a sigma field and

L^{2} (A)

be the set of square integrable, real valued,

A

measurable functions.

Definition A.1.5

(

ρ

-mixing coefficient). Let

M_{i}^{j}

denote the sigma field generated by the measures

X_{k}, where i \leq k \leq j

. Then,

ρ_{k} = sup_{t > 0} sup_{(f, g) \in L^{2} (M_{- \infty}^{t}) \times L^{2} (M_{t + k}^{\infty})} |Corr (f, g)|,

(A8)

where

Corr

is the correlation function.

Theorem A.1.2.

If

X_{n}

is geometrically ergodic, then it is α-mixing. That is, there exists a constant

c > 0

such that

α_{k} = O (e^{- c k})

.

Proof.

By [26] (Theorem 2) it follows that a geometrically ergodic Markov chain is asymptotically uncorrelated with

ρ

-mixing coefficients (see Definition A.1.5) that satisfy

ρ_{k} = O (e^{- c k})

. Furthermore, it is well known that [18,26]

α_{k} \leq \frac{1}{4} ρ_{k}

, implying

α_{k} = O (e^{- c k})

. □

Appendix A.2. Bounding the KL-Divergence between Beta Distributions

The following results will be utilized in the proofs of Propositions 8–10.

Lemma A.2.1.

Let

θ_{0} \in (0, 1)

. Let,

ρ_{n}

be a sequence of Beta distributions with parameters

a_{n} = n θ_{0}

and

b_{n} = n (1 - θ_{0})

. Let π denote an uniform distribution,

U (0, 1)

. Then,

K (ρ_{n}, π) < C + \frac{1}{2} log (n)

, for some constant

C > 0

.

Proof.

Without loss of generality, we can assume

a_{n} > 1

and

b_{n} > 1

. The same form of the result can be obtained in all the other cases, by appropriate use of the bounds presented in the proof. We write the KL divergence

K (ρ_{n}, π)

as

\int log (\frac{ρ_{n}}{π}) ρ_{n} (d θ)

. Since

π

is uniform,

π (θ) = 1

whenever

θ \in (0, 1)

. Hence, the KL-divergence can be written as the negative of the entropy of

ρ_{n}

\int_{0}^{1} log (ρ_{n} (θ)) ρ_{n} (d θ),

which can be written as

\begin{matrix} K (ρ_{n}, π) = (a_{n} - 1) ψ (a_{n}) + (b_{n} - 1) ψ (b_{n}) - (a_{n} + b_{n} - 2) ψ (a_{n} + b_{n}) \\ - log Beta (a_{n}, b_{n}), \end{matrix}

(A9)

where

ψ

is the digamma function. Using Stirling’s approximation on

Beta (a_{n}, b_{n})

yields,

\begin{matrix} Beta (a_{n}, b_{n}) = \sqrt{2 π} \frac{a_{n}^{a_{n} - 1 / 2} b_{n}^{b_{n} - 1 / 2}}{{(a_{n} + b_{n})}^{a_{n} + b_{n} - 1 / 2}} (1 + o (1)) . \end{matrix}

Hence, setting

C_{1} = log (2 \sqrt{π})

, we can write

- log Beta (a_{n}, b_{n})

as,

\begin{matrix} - log Beta (a_{n}, b_{n}) & = C_{1} - (a_{n} - \frac{1}{2}) log (a_{n}) - (b_{n} - \frac{1}{2}) log (b_{n}) \\ + (a_{n} + b_{n} - \frac{1}{2}) log (a_{n} + b_{n}) + log (1 + o (1)) . \end{matrix}

From [27] we have that

log (x) - \frac{1}{x} < ψ (x) < log (x) - \frac{1}{2 x} \forall x > 0

. Since we assumed

a_{n} > 1

and

b_{n} > 1

, the fact that

ψ (x) < log (x) - \frac{1}{2 x}

implies

\begin{matrix} (a_{n} - 1) ψ (a_{n}) & < (a_{n} - 1) log (a_{n}) - \frac{a_{n} - 1}{2 a_{n}} and, \\ (b_{n} - 1) ψ (b_{n}) & < (b_{n} - 1) log (b_{n}) - \frac{b_{n} - 1}{2 b_{n}} . \end{matrix}

Finally, using the fact that

log (x) - \frac{1}{x} < ψ (x)

, we get,

\begin{matrix} - (a_{n} + b_{n} - 2) ψ (a_{n} + b_{n}) < - (a_{n} + b_{n} - 2) log (a_{n} + b_{n}) + \frac{a_{n} + b_{n} - 2}{a_{n} + b_{n}} . \end{matrix}

Therefore, after much cancellation, the KL-divergence

\begin{matrix} (a_{n} - 1) ψ (a_{n}) + (b_{n} - 1) ψ (b_{n}) - (a_{n} + b_{n} - 2) ψ (a_{n} + b_{n}) - log Beta (a_{n}, b_{n}) \end{matrix}

can be upper bounded by

\begin{matrix} - \frac{1}{2} log (a_{n}) - \frac{1}{2} log (b_{n}) + \frac{3}{2} log (a_{n} + b_{n}) + \frac{a_{n} + b_{n} - 2}{a_{n} + b_{n}} - \frac{a_{n} - 1}{2 a_{n}} - \frac{b_{n} - 1}{2 b_{n}} . \end{matrix}

Now, plugging in the values of

a_{n}

and

b_{n}

, we get Plugging in the values of

a_{n}

and

b_{n}

, we get as upper bound for the KL-divergence as,

\begin{matrix} K (ρ_{n}, π) & < - \frac{1}{2} log (n θ_{0}) - \frac{1}{2} log (n (1 - θ_{0})) + \frac{3}{2} log (n) + \frac{n - 2}{n} - \frac{n θ_{0} - 1}{2 n θ_{0}} - \frac{n (1 - θ_{0}) - 1}{2 n (1 - θ_{0})} \\ = \frac{1}{2} log (n) - \frac{1}{2} (log (θ_{0}) + log (1 - θ_{0})) + 3 - \frac{2}{n} - \frac{1}{2 n θ_{0}} - \frac{1}{2 n (1 - θ_{0})} \\ < C + \frac{1}{2} log (n), \end{matrix}

for some large enough positive constant C. This completes our proof. □

Proposition A.2.1.

Let

θ_{0} \in (0, 1)

. Let,

ρ_{n}

be a sequence of Beta distributions with parameters

a_{n} = n θ_{0}

and

b_{n} = n (1 - θ_{0})

. Let π denote an Beta distribution, with parameters

(a, b)

. Then,

K (ρ_{n}, π) < C + \frac{1}{2} log (n)

, for some constant

C > 0

.

Proof.

Without loss of generality, we assume

a > 1

and

b > 1

. As mentioned in the proof of Lemma A.2.1, the other cases follows similarly. We write the KL-divergence between

ρ_{n}

and

π

as,

\begin{matrix} K (ρ_{n}, π) = & \int log (\frac{ρ_{n}}{π}) ρ_{n} (d θ) = \int log (\frac{ρ_{n}}{U}) ρ_{n} (d θ) + \int log (\frac{U}{π}) ρ_{n} (d θ), \end{matrix}

where, U is an uniform distribution on

(0, 1)

. We analyze the second term in the above expression. The second term can be written as,

\begin{matrix} \int log (\frac{U}{π}) ρ_{n} (d θ) & = \int log (\frac{1}{\frac{1}{Beta (a, b)} θ^{a - 1} {(1 - θ)}^{b - 1}}) ρ_{n} (d θ) \\ = C_{1} - (a - 1) \int log (θ) ρ_{n} (d θ) - (b - 1) \int log (1 - θ) ρ_{n} (d θ), \end{matrix}

where

C_{1}

is

log (Beta (a, b))

. Since,

ρ_{n}

follows a Beta distribution with parameters

a_{n} = n θ_{0}

and

b_{n} = n (1 - θ_{0})

, we get that,

\begin{matrix} \int log (\frac{U}{π}) ρ_{n} (d θ) & = C_{1} - (a - 1) [ψ (a_{n}) - ψ (a_{n} + b_{n})] - (b - 1) [ψ (b_{n}) - ψ (a_{n} + b_{n})] \end{matrix}

Since,

log (x) - \frac{1}{x} < ψ (x) < log (x) - \frac{1}{2 x}

, looking at the term

[ψ (a_{n}) - ψ (a_{n} + b_{n})]

, we get that,

\begin{matrix} - [ψ (a_{n}) - ψ (a_{n} + b_{n})] & = - [ψ (n θ_{0}) - ψ (n θ_{0} + n (1 - θ_{0}))] \\ = - [ψ (n θ_{0}) - ψ (n)] . \end{matrix}

Using the lower bound on

ψ (n θ_{0})

and the upper bound on

ψ (n)

, we get

\begin{matrix} - [ψ (a_{n}) - ψ (a_{n} + b_{n})] & < - log (n θ_{0}) + \frac{1}{n θ_{0}} + log (n) - \frac{1}{2 n} \\ = - log (θ_{0}) + \frac{2 - θ_{0}}{2 n θ_{0}} . \end{matrix}

Furthermore, similarly, we get that,

\begin{matrix} - [ψ (b_{n}) - ψ (a_{n} + b_{n})] & < - log (1 - θ_{0}) + \frac{2 - (1 - θ_{0})}{2 n (1 - θ_{0})} . \end{matrix}

Therefore it follows that

\begin{matrix} max & \{- (a - 1) [ψ (a_{n}) - ψ (a_{n} + b_{n})], - (b - 1) [ψ (b_{n}) - ψ (a_{n} + b_{n})]\} \\ < max \{(a - 1) [- log (θ_{0}) + \frac{2 - θ_{0}}{2 n θ_{0}}], (b - 1) [- log (1 - θ_{0}) + \frac{2 - (1 - θ_{0})}{2 n (1 - θ_{0})}]\} \\ < C, \end{matrix}

for a large positive constant C. Using the above bounds, we finally show that,

\begin{matrix} C_{1} - (a - 1) [ψ (a_{n}) - ψ (a_{n} + b_{n})] - (b - 1) [ψ (b_{n}) - ψ (a_{n} + b_{n})] \\ < C_{1} + 2 C, \end{matrix}

which can be upper bounded by

C^{'}

for some large constant

C^{'}

. Finally, we upper bound

\int log (\frac{ρ_{n}}{U}) ρ_{n} (d θ)

by Lemma A.2.1 thereby completing the proof. □

Appendix B. Proofs of Main Results

Appendix B.1. Proofs for A Concentration Bound for the α^re-Rényi Divergence

Appendix B.1.1. Proof of Proposition 1

We start by recalling the variational formula of Donsker and Varadhan [28].

Lemma B.1.1

(Donsker-Varadhan)b. For any probability distribution function π on Θ, and for any measurable function

h : Θ \to R

, if

\int e^{h} d π < \infty

, then

\begin{matrix} log \int e^{h} d π & = sup_{ρ \in M^{+} (Θ)} \{\int h d ρ - K (ρ, π)\} \end{matrix}

(A10)

Now, fix

α^{r e} \in (0, 1)

, and

θ \in Θ

. First, observe that by the definition of the

α^{r e}

-Rényi divergence we have

\begin{matrix} E_{θ_{0}}^{(n)} [exp (- α^{r e} r_{n} (θ, θ_{0}))] = exp [- (1 - α^{r e}) D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)})] \end{matrix}

Multiplying both sides of the equation by

exp [(1 - α^{r e}) D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)})

and integrating with respect to (w.r.t.)

π (θ)

it follows that

\begin{matrix} \int E_{θ_{0}}^{(n)} [exp (- α^{r e} r_{n} (θ, θ_{0}) + (1 - α^{r e}) D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}))] π (d θ) = 1, or \end{matrix}

\begin{matrix} E_{θ_{0}}^{(n)} [\int exp (- α^{r e} r_{n} (θ, θ_{0}) + (1 - α^{r e}) D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)})) π (d θ)] = 1 . \end{matrix}

Define

h (θ) : = - α^{r e} r_{n} (θ, θ_{0}) + (1 - α^{r e}) D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)})

. Then, applying Lemma B.1.1 to the integrand on the left hand side (l.h.s.) above, it follows that

\begin{matrix} E_{θ_{0}}^{(n)} [exp (sup_{ρ \in M^{+} (Θ)} [\int h (θ) ρ (d θ) - K (ρ, π)])] = 1 . \end{matrix}

Multiply both sides of this equation by

ϵ > 0

to obtain

\begin{matrix} E_{θ_{0}}^{(n)} [exp (sup_{ρ \in M^{+} (Θ)} [\int h (θ) ρ (d θ) - K (ρ, π) + log (ϵ)])] = ϵ . \end{matrix}

Now, by Markov’s inequality, we have

P_{θ_{0}}^{(n)} [sup_{ρ \in M^{+} (Θ)} \int (- α^{r e} r_{n} (θ, θ_{0}) + (1 - α^{r e}) D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)})) ρ (d θ) - K (ρ, π) + log (ϵ) \geq 0] \leq ϵ .

(A11)

Thus, it follows via complementation that

\begin{matrix} P_{θ_{0}}^{(n)} [\forall ρ \in F (Θ) \int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) ρ (d θ) \leq \frac{α^{r e}}{(1 - α^{r e})} \int r_{n} (θ, θ_{0}) ρ (d θ) + & \frac{K (ρ, π) - log (ϵ)}{1 - α^{r e}}] \\ \geq 1 - ϵ, \end{matrix}

thereby completing the proof.

Appendix B.1.2. Proof of Theorem 1

Recall the definition of the fractional posterior and the VB approximation,

\begin{matrix} π_{n, α^{r e} | X^{n}} = \frac{{exp}^{- α^{r e} r_{n} (θ, θ_{0}) (X^{n})} π (d θ)}{\int {exp}^{- α^{r e} r_{n} (γ, θ_{0}) (X^{n})} π (d γ)}, {\tilde{π}}_{n, α^{r e} | X^{n}} = \underset{ρ \in F}{arg min} K (ρ, π_{n, α^{r e} | X^{(n)}}) . \end{matrix}

It follows by definition of the KL divergence that

\begin{matrix} {\tilde{π}}_{n, α^{r e} | X^{n}} & = \underset{ρ \in F}{arg min} \{- α^{r e} \int r_{n} (θ, θ_{0}) ρ (d θ) + K (ρ, π)\}, \end{matrix}

(A12)

where

π

is the prior distribution. Following Proposition 1 it follows that for any

ϵ > 0

\int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) \tilde{π} (d θ | X^{n}) \leq \frac{α^{r e}}{(1 - α^{r e})} \int r_{n} (θ, θ_{0}) ρ (d θ) + \frac{K (ρ, π) - log (ϵ)}{1 - α^{r e}},

with probability

1 - ϵ

. We fix an

η \in (0, 1)

. Using Chebychev’s inequality, we have

\begin{matrix} P_{θ_{0}}^{(n)} [\frac{α^{r e}}{1 - α^{r e}} \int r_{n} (θ, θ_{0}) ρ_{n} (d θ) \geq \frac{α^{r e}}{1 - α^{r e}} \int E [r_{n} (θ, θ_{0})] ρ_{n} (d θ) \\ + \frac{α^{r e}}{1 - α^{r e}} \sqrt{\frac{Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)]}{η}} + \frac{K (ρ_{n}, π)}{1 - α^{r e}}] \\ = P_{θ_{0}}^{(n)} [\frac{α^{r e}}{1 - α^{r e}} \int r_{n} (θ, θ_{0}) ρ_{n} (d θ) - \frac{α^{r e}}{1 - α^{r e}} \int E [r_{n} (θ, θ_{0})] ρ_{n} (d θ) - \frac{K (ρ_{n}, π)}{1 - α^{r e}} \\ \geq \frac{α^{r e}}{1 - α^{r e}} \sqrt{\frac{Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)]}{η}}] \\ \leq \frac{Var [\frac{α^{r e}}{1 - α^{r e}} \int r_{n} (θ, θ_{0}) ρ_{n} (d θ) - \frac{α^{r e}}{1 - α^{r e}} \int E [r_{n} (θ, θ_{0})] ρ_{n} (d θ) - \frac{K (ρ_{n}, π)}{1 - α^{r e}}]}{\frac{{(α^{r e})}^{2}}{{(1 - α^{r e})}^{2}} \frac{Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)]}{η}} . \end{matrix}

Note that

\frac{α^{r e}}{1 - α^{r e}} \int E (r_{n} (θ, θ_{0})) ρ_{n} (d θ)

and

\frac{K (ρ_{n}, π)}{1 - α^{r e}}

are constants with respect to the data, implying

\begin{matrix} Var [\frac{α^{r e}}{1 - α^{r e}} & \int r_{n} (θ, θ_{0}) ρ_{n} (d θ) - \frac{α^{r e}}{1 - α^{r e}} \int E [r_{n} (θ, θ_{0})] ρ_{n} (d θ) - \frac{K (ρ_{n}, π)}{1 - α^{r e}}] \\ = \frac{{(α^{r e})}^{2}}{{(1 - α^{r e})}^{2}} Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)] . \end{matrix}

Therefore, we have

\begin{matrix} P_{θ_{0}}^{(n)} [\frac{α^{r e}}{1 - α^{r e}} \int r_{n} (θ, θ_{0}) ρ_{n} (d θ) \geq \frac{α^{r e}}{1 - α^{r e}} & \int E [r_{n} (θ, θ_{0})] ρ_{n} (d θ) \\ + \frac{α^{r e}}{1 - α^{r e}} \sqrt{\frac{Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)]}{η}} + \frac{K (ρ_{n}, π)}{1 - α}] \leq η . \end{matrix}

From Proposition 1, with probability

1 - ϵ

the following holds

\int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) {\tilde{π}}_{n, α^{r e} | X^{n}} (d θ) \leq \frac{α^{r e} \int r_{n} (θ, θ_{0}) ρ_{n} (d θ) + K (ρ_{n}, π) - log (ϵ)}{1 - α^{r e}} .

Therefore, with probability

1 - η - ϵ

the following statement holds

\begin{matrix} \int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) {\tilde{π}}_{n, α^{r e} | X^{n}} (d θ) & \leq \frac{α^{r e}}{1 - α^{r e}} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) \\ + \frac{α^{r e}}{1 - α^{r e}} \sqrt{\frac{Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)]}{η}} \\ + \frac{K (ρ_{n}, π) - log (ϵ)}{1 - α^{r e}} . \end{matrix}

(A13)

Next, we observe that

\begin{matrix} Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)] & = E_{θ_{0}}^{(n)} [{|\int r_{n} (θ, θ_{0}) ρ_{n} (d θ) - E [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)]|}^{2}] \\ \leq \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ), \end{matrix}

by a straightforward application of Jensen’s inequality to the inner integral on the left hand side. Finally, following the hypotheses (i), (ii) and (iii), we have,

\begin{matrix} \int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) {\tilde{π}}_{n, α^{r e} | X^{n}} (d θ) & \leq \frac{α^{r e}}{1 - α^{r e}} \int (K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) + \sqrt{\frac{\int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ)}{η}}) ρ_{n} (d θ) \\ + \frac{1}{α^{r e}} (K (ρ_{n}, π) - log (ϵ)) \\ \leq \frac{α^{r e} (ϵ_{n} + \sqrt{\frac{n ϵ_{n}}{η}})}{1 - α^{r e}} + \frac{n ϵ_{n} - log (ϵ)}{1 - α^{r e}}, \end{matrix}

thereby concluding the proof. □

Appendix B.1.3. Proof of Proposition 2

We define

Y_{i} : = log (\frac{p_{θ_{1}} (X_{i} | X_{i - 1})}{p_{θ_{2}} (X_{i} | X_{i - 1})})

for

i = 1, \dots, n

, and

Z_{0} = log (\frac{q_{1}^{(0)} (X_{0})}{q_{2}^{(0)} (X_{0})})

. Then, using the Markov property we can see that the Kullback–Leibler divergence between the joint distributions

P_{θ_{1}}^{(n)}

and

P_{θ_{2}}^{(n)}

satisfies

K (P_{θ_{1}}^{(n)}, P_{θ_{2}}^{(n)}) = \sum_{i = 1}^{n} E_{θ_{1}} [Y_{i}] + E_{θ_{1}} [Z_{0}] .

If the Markov chain

{X_{i}}

is stationary under

θ_{1}

, so is

{Y_{i}}

. Hence

Y_{i} \overset{d}{=} Y_{1}

and the above equation reduces to,

\begin{matrix} K (P_{θ_{1}}^{(n)}, P_{θ_{2}}^{(n)}) = n E_{θ_{1}} [Y_{1}] + E_{θ_{1}} [Z_{0}] . \end{matrix}

(A14)

□

Appendix B.1.4. Proof of Proposition 3

First, recall the following result from [19].

Lemma B.1.2.

[19] (Lemma 1.2) Let

X_{- \infty}, \dots, X_{1}, X_{2}, \dots

be an α-mixing Markov chain with α-mixing coefficients given by

α_{k}

. Let

M_{a}^{b}

be the sigma-field generated by the subsequence

(X_{a}, X_{a + 1}, \dots, X_{b})

. Let

η_{t} \in M_{- \infty}^{t}

and

τ_{t} \in M_{t + k}^{\infty}

be adapted random variables such that

| η_{t} | \leq 1, | τ_{t} | \leq 1

. Then,

\begin{matrix} sup_{t} sup_{η_{t}, τ_{t}} | E [η_{t} τ_{t}] - E [η_{t}] E [τ_{t}] | \leq 4 α_{k} . \end{matrix}

(A15)

This lemma provides an upper bound on the covariance of events

η and τ

, as shown next.

Lemma B.1.3.

Let

η \in M_{- \infty}^{t}

τ \in M_{t + k}^{\infty}

be such that,

{E | η |}^{2 + δ} \leq C_{1}, E {| τ |}^{2 + δ} \leq C_{2} for some δ > 0

. Then, for a fixed

n < + \infty

, we have

| E η τ - E η E τ | \leq (\frac{4}{n} + 2 n^{δ / 2} (C_{1} + C_{2}) + 2 n^{δ / 2} \sqrt{C_{1} C_{2}}) α_{k}^{2 δ / (2 + δ)} .

(A16)

Proof.

Let

N < + \infty

be a fixed number. We get from the triangle inequality that

\begin{matrix} | E η τ - E η E τ | & \leq | E η τ I_{[| η | \leq N, | τ | \leq N]} - E η I_{[| η | \leq N]} E τ I_{[| τ | \leq N]} | \\ + | E η τ I_{[| η | \geq N, | τ | \leq N]} - E η I_{[| η | \geq N]} E τ I_{[| τ | \leq N]} | \\ + | E η τ I_{[| η | \leq N, | τ | \geq N]} - E η I_{[| η | \leq N]} E τ I_{[| τ | \geq N]} | \\ + | E η τ I_{[| η | \geq N, | τ | \geq N]} - E η I_{[| η | \geq N]} E τ I_{[| τ | \geq N]} | . \end{matrix}

(A17)

Multiplying and dividing the first term by

N^{2}

and applying Lemma B.1.2, we get

| E η τ I_{[| η | \leq N, | τ | \leq N]} - E η I_{[| η | \leq N]} E τ I_{[| τ | \leq N]} | \leq 4 N^{2} α_{k}

. For the second term, if

| τ | \leq N

, then

τ \leq N

and

τ \geq - N

. Plugging this in the second term we get,

\begin{matrix} (A18) & | E η τ I_{[| η | \geq N, | τ | \leq N]} - & E η I_{[| η | \geq N]} E τ I_{[| τ | \leq N]} | \leq |N E η I_{[| η | \geq N} + N [E η I_{[| η | \geq N]}]| \\ (A19) & = 2 N | E η I_{[| η | \geq N]} | . \end{matrix}

Since

| η | \geq N

, we have

1 \leq \frac{{| η |}^{1 + δ}}{N^{1 + δ}}

. Following this,

\begin{matrix} (A20) & | 2 N E η I_{[| η | \geq N]} | & \leq 2 N |E [\frac{{| η |}^{2 + δ}}{N^{1 + δ}} I_{[| η | \geq N]}]| \\ (A21) & \leq 2 N \frac{1}{N^{1 + δ}} | E η^{2 + δ} | \leq 2 \frac{C_{1}}{N^{δ}} . \end{matrix}

Similarly, we can also write for the third term,

| E η τ I_{[| η | \leq N, | τ | \geq N]} - E η I_{[| η | \leq N]} E τ I_{[| τ | \geq N]} | \leq 2 \frac{C_{2}}{N^{δ}}

. Finally, for the last term we get that by Cauchy-Schwarz inequality,

\begin{matrix} | E η τ I_{[| η | \geq N, | τ | \geq N]} - E η I_{[| η | \geq N]} E τ I_{[| τ | \geq N]} | & \leq \sqrt{Var [η I_{[| η | \geq N]}] Var [τ I_{[| τ | \geq N]}]} \end{matrix}

(A22)

\begin{matrix} < 2 \sqrt{Var [η I_{[| η | \geq N]}] Var [τ I_{[| τ | \geq N]}]} \end{matrix}

(A23)

\begin{matrix} \leq 2 \sqrt{E [η^{2} I_{[| η | \geq N]}] E [τ^{2} I_{[| τ | \geq N]}]} . \end{matrix}

(A24)

Since

| η | > N

,

1 < \frac{{| η |}^{δ}}{N^{δ}}

. Similarly,

1 < \frac{{| τ |}^{δ}}{N^{δ}}

. Plugging these in the previous equation, we get,

\begin{matrix} \sqrt{E [η^{2} I_{[| η | \geq N]}] E [τ^{2} I_{[| τ | \geq N]}]} & \leq \sqrt{\frac{1}{N^{2 δ}} E [{| η |}^{2 + δ} I_{[| η | \geq N]}] E [{| τ |}^{2 + δ} I_{[| τ | \geq N]}]} \end{matrix}

(A25)

\begin{matrix} \leq \frac{1}{N^{δ}} \sqrt{C_{1} C_{2}} . \end{matrix}

(A26)

Combining the four upper bounds above, we get,

\begin{matrix} | E η τ - E η E τ | & \leq 4 N^{2} α_{k} + \frac{2}{N^{δ}} (C_{1} + C_{2}) + \frac{2}{N^{δ}} \sqrt{C_{1} C_{2}} . \end{matrix}

(A27)

Now, in particular, setting

N = n^{- 1 / 2} α_{k}^{- 1 / (2 + δ)}

it follows that

\begin{matrix} | E η τ - E η E τ | & \leq \frac{4}{n} α_{k}^{δ / (2 + δ)} + 2 n^{δ / 2} α_{k}^{δ / (2 + δ)} (C_{1} + C_{2}) + 2 n^{δ / 2} α_{k}^{δ / (2 + δ)} \sqrt{C_{1} C_{2}} \end{matrix}

(A28)

\begin{matrix} = (\frac{4}{n} + 2 n^{δ / 2} (C_{1} + C_{2}) + 2 n^{δ / 2} \sqrt{C_{1} C_{2}}) α_{k}^{δ / (2 + δ)} . \end{matrix}

(A29)

□

Lemma B.1.4.

Let

{X_{t}}

be an α-mixing Markov chain with mixing coefficient

α_{k}

. Further assume that

E | X_{t} |^{2 + δ} \leq C_{1} and E {| X_{t + k} |}^{2 + δ} \leq C_{2}

for some

δ > 0

. Then, for any t and any

n > 0

| Cov (X_{t}, X_{t + k}) | \leq (\frac{4}{n} + 2 n^{δ / 2} (C_{1} + C_{2}) + 2 n^{δ / 2} \sqrt{C_{1} C_{2}}) α_{k}^{δ / (2 + δ)} .

(A30)

Proof.

Set

η = X_{t}, τ = X_{t + k}

Lemma B.1.3. □

We also need to establish the following technical lemma.

Lemma B.1.5.

Let

{X_{t}}

be an α-mixing Markov Chain with mixing coefficients

{α_{t}}

. Then the process

{Y_{t}}

where

Y_{t} : = log (\frac{p_{θ_{0}} (X_{t} | X_{t - 1})}{p_{θ} (X_{t} | X_{t - 1})})

is also α-mixing with mixing coefficients

{{\tilde{α}}_{t}}

where

{\tilde{α}}_{t} = α_{t - 1}

.

Proof.

By

Z_{i}

denote the paired random measure

(X_{i}, X_{i - 1})

. Let

M_{i}^{j}

denote the sigma field generated by the measures

X_{k}, where i \leq k \leq j

. By

G_{i}^{j}

denote the sigma field generated by the measures

Z_{k}, where i \leq k \leq j

. Let

C \in M_{i - 1}^{j}

. Then, C can be expressed as

(C_{i - 1} \times C_{i} \times \dots \times C_{j})

. for

C_{i - 1} \in M_{i - 1}^{i - 1}, C_{i} \in M_{i}^{i} \dots

and so on. Now, consider a map.

T_{i}^{j} : (C_{i - 1} \times C_{i} \times \dots \times C_{j}) \overset{}{\to} (C_{i - 1} \times C_{i} \times C_{i} \times \dots \times C_{j - 1} \times C_{j - 1} \times C_{j})

. Note that,

T_{i}^{j} (C) \in G_{i}^{j}

. It is easy to see that

G_{i}^{j} = T_{i}^{j} (M_{i - 1}^{j}) \cup M_{i - 1}^{* j}

, where

T_{i}^{j} (M_{i - 1}^{j})

is obtained by applying the map

T_{i}^{j}

to each element of

M_{i - 1}^{j}

. If we assume this latter set to be the range and

M_{i - 1}^{j}

to be the domain, then, by construction,

T_{i}^{j}

is a bijection. Furthermore, the two classes are made of disjoint sets, i.e., if

A \in T_{i}^{j} (M_{i - 1}^{j}) and A^{*} \in M_{i - 1}^{* j}

, then

A \cap A^{*} = ϕ

. Furthermore, note that

M_{i - 1}^{j *}

is made of impossible sets. i.e.,

P (A^{*}) = 0 \forall A^{*} \in M_{i - 1}^{j *}

. Now consider the

α

-mixing coefficients for

Z_{i}

. By definition, it is given by

\begin{matrix} α_{k}^{z} & = sup_{i} sup_{A \in G_{- \infty}^{i}, B \in G_{i + k}^{\infty}} | P (A \cap B) - P (A) P (B) | \\ = sup_{i} sup_{A \in G_{- \infty}^{i}, B \in G_{i + k}^{\infty}} | P ((A^{o} \cup A^{*}) \cap (B^{o} \cup B^{*})) - P ((A^{o} \cup A^{*})) P ((B^{o} \cup B^{*})) | . \end{matrix}

where,

$A = (A^{o} \cup A^{*})$	$B = (B^{o} \cup B^{*})$
$A^{o} \in T_{- \infty}^{i} (M_{- \infty}^{i})$	$A^{} \in M_{- \infty}^{ i}$
$B^{o} \in T_{i + k - 1}^{\infty} (M_{j + k - 1}^{\infty})$	$B^{} \in M_{j + k - 1}^{ \infty} .$

Then, the expression for the

α

-mixing coefficient can be reduced into

\begin{matrix} α_{k}^{z} & = sup_{i} sup_{A^{o} \in T_{- \infty}^{i} (M_{- \infty}^{i}), B^{o} \in T_{i + k - 1}^{\infty} (M_{i + k - 1}^{\infty})} | P (A^{o} \cap B^{o}) - P (A^{o}) P (B^{o}) | . \end{matrix}

Note that, by bijection property of

T_{i}^{j}

, we can find

A^{'} \in M_{- \infty}^{i}

and

B^{'} \in M_{i + k - 1}^{\infty}

such that

\begin{matrix} α_{k}^{z} & = sup_{i} sup_{A^{'} \in M_{- \infty}^{i}, B^{'} \in M_{i + k - 1}^{\infty}} | P (T_{- \infty}^{i} (A^{'}) \cap T_{i + k - 1}^{\infty} (B^{'})) - P (T_{- \infty}^{i} (A^{'})) P (T_{i + k - 1}^{\infty} (B^{'})) | . \\ = α_{k - 1} . \end{matrix}

Now,

log (\frac{p_{θ_{0}} (X_{n} | X_{n - 1})}{p_{θ} (X_{n} | X_{n - 1})})

is just a function of the paired Markov chain

Z_{i}

, therefore it has

α

-mixing coefficient

α_{k - 1}

. □

We now proceed to the proof of Proposition 3. Let

{X_{k}}

be a stationary

α

-mixing Markov chain under

θ_{1}

with mixing coefficients

{α_{k}}

. Observe that the log-likelihood can be expressed as

\begin{matrix} r_{n} (θ_{2}, θ_{1}) & = \sum_{i = 1}^{n} log (\frac{p_{θ_{1}} (X_{i} | X_{i - 1})}{p_{θ_{2}} (X_{i} | X_{i - 1})}) + log (\frac{q_{1}^{(0)} (X_{0})}{q_{2}^{(0)} (X_{0})}) \\ \equiv \sum_{i = 1}^{n} Y_{i} + Z_{0} . \end{matrix}

Therefore, the variance of the log-likelihood ratio is simply

\begin{matrix} {Var}_{θ_{1}} [r_{n} (θ_{2}, θ_{1})] & = {Var}_{θ_{1}} [\sum_{i = 1}^{n} Y_{i} + Z_{0}] \\ = \sum_{i, j = 1}^{n} {Cov}_{θ_{1}} (Y_{i}, Y_{j}) + \sum_{i, j = 1}^{n} {Cov}_{θ_{1}} (Y_{i}, Z_{0}) + {Cov}_{θ_{1}} (Z_{0}, Z_{0}) . \end{matrix}

It follows from Lemma B.1.5 that

{Y_{k}}

is a stochastic process with

α

-mixing coefficients

α_{k - 1}

. Therefore, using Lemma B.1.4 we have

\begin{matrix} | {Cov}_{θ_{1}} (Y_{i}, Y_{j}) | & = | E_{θ_{1}} Y_{i} Y_{j} - E_{θ_{1}} Y_{i} E_{θ_{1}} Y_{j} | \\ < (\frac{4}{n} + 2 n^{δ / 2} (E_{θ_{1}} | Y_{i} |^{2 + δ} + E_{θ_{1}} {| Y_{j} |}^{2 + δ} \\ + \sqrt{E_{θ_{1}} | Y_{i} |^{2 + δ} E_{θ_{1}} {| Y_{j} |}^{2 + δ}})) α_{| j - i | - 1}^{δ / (2 + δ)} \\ = (\frac{4}{n} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(i)} + C_{θ_{1}, θ_{2}}^{(j)} + \sqrt{C_{θ_{1}, θ_{2}}^{(i)} C_{θ_{1}, θ_{2}}^{(j)}})) α_{| j - i | - 1}^{δ / (2 + δ)} . \end{matrix}

Similarly, as above we can also say

\begin{matrix} | {Cov}_{θ_{1}} (Y_{i}, Z_{0}) | < & (\frac{4}{n} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(i)} + D_{1, 2} + \sqrt{C_{θ_{1}, θ_{2}}^{(i)} D_{1, 2}})) (α_{i - 1}^{δ / (2 + δ)}) \end{matrix}

Combining, the two upper bounds above, we get the first result:

\begin{matrix} {Var}_{θ_{1}} [r_{n} (θ_{2}, θ_{1})] & < \sum_{i, j = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(i)} + C_{θ_{1}, θ_{2}}^{(j)} + \sqrt{C_{θ_{1}, θ_{2}}^{(i)} C_{θ_{1}, θ_{2}}^{(j)}})) (α_{| i - j | - 1}^{δ / (2 + δ)}) \\ + \sum_{i = 1}^{n} (\frac{4}{n^{2}} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(i)} + D_{1, 2} + \sqrt{C_{θ_{1}, θ_{2}}^{(i)} D_{1, 2}})) (α_{i - 1}^{δ / (2 + δ)}) \\ + Var [Z_{0}, Z_{0}] . \end{matrix}

If

{X_{i}}

is stationary under

θ_{1}

, so is

{Y_{i}}

. Therefore,

E_{θ_{1}} | Y_{i} |^{2 + δ} = E_{θ_{1}} {| Y_{1} |}^{2 + δ} = C_{θ_{1}, θ_{2}}^{(1)} \forall i

, and

\begin{matrix} \sum_{i, j = 1}^{n} {Cov}_{θ_{1}} (Y_{i}, Y_{j}) & \leq \sum_{i, j = 1}^{n} (\frac{4}{n} + 6 n^{δ / 2} C_{θ_{1}, θ_{2}}^{(1)}) α_{| j - i | - 1}^{δ / (2 + δ)} \\ \leq n (\frac{4}{n} + 6 n^{δ / 2} C_{θ_{1}, θ_{2}}^{(1)}) (\sum_{h \geq 1} α_{h - 1}^{δ / (2 + δ)}) . \end{matrix}

(A31)

Again, using Lemma B.1.4 on

{Cov}_{θ_{1}} (Y_{i}, Z_{0})

, yields

\sum_{i = 1}^{n} {Cov}_{θ_{1}} (Y_{i}, Z_{0}) \leq (\frac{4}{n} + 2 n^{δ / 2} (C_{θ} + D_{1, 2} + \sqrt{C_{θ} D_{1, 2}})) (\sum_{h \geq 1} α_{h}^{δ / (2 + δ)}) .

(A32)

Finally, using Equations (A31) and (A32) we have

\begin{matrix} {Var}_{θ_{1}} [r_{n} (θ_{2}, θ_{1})] & \leq n (\frac{4}{n} + 6 n^{δ / 2} C_{θ_{1}, θ_{2}}^{(1)}) (\sum_{h \geq 1} α_{h - 1}^{δ / (2 + δ)}) + \\ (\frac{4}{n} + 2 n^{δ / 2} (C_{θ_{1}, θ_{2}}^{(1)} + D_{1, 2} + \sqrt{C_{θ_{1}, θ_{2}}^{(1)} D_{1, 2}})) (\sum_{h \geq 1} α_{h}^{δ / (2 + δ)}) \\ + {Cov}_{θ_{1}} (Z_{0}, Z_{0}) . \end{matrix}

□

Appendix B.2. Proofs for Stationary Markov Data-Generating Models

Proof of Theorem 2

Part 1: Verifying condition (i) of Corollary 1.

We substitute the true parameter

θ_{0}

for

θ_{1}

and

θ

for

θ_{2}

. We also set

q_{1}^{(0)}

to be the invariant distribution of the Markov chain under

θ_{0}

,

q_{0}

, and

q_{2}^{(0)}

as the invariant distribution of the Markov chain under

θ

,

q_{θ}

. Applying the fact that these Markov chains are stationary to Proposition 2, we have

\begin{matrix} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) & = n E [log (\frac{p_{θ_{0}} (X_{1} | X_{0})}{p_{θ} (X_{1} | X_{0})})] + E [Z_{0}], \\ \leq n \sum_{j = 1}^{m} E [M_{j}^{(1)} (X_{1}, X_{0})] | f_{j}^{(1)} (θ, θ_{0}) | + \sum_{k = 1}^{m} E [M_{k}^{(2)} (X_{0})] | f_{k}^{(2)} (θ, θ_{0}) |, \end{matrix}

(A33)

where the inequality follows from Assumption 1. Therefore, it follows that

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq n \sum_{j = 1}^{m} E [M_{j}^{(1)} (X_{1}, X_{0})] \int | f_{j}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) \\ + \sum_{k = 1}^{m} E [M_{k}^{(2)} (X_{0})] | \int f_{k}^{(2)} (θ, θ_{0}) | ρ_{n} (d θ) . \end{matrix}

By Assumption 1(i), it follows that

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq n \sum_{j = 1}^{m} E [M_{j}^{(1)} (X_{1}, X_{0})] \frac{C}{\sqrt{n}} + \sum_{k = 1}^{m} E [M_{k}^{(2)} (X_{0})] \frac{C}{\sqrt{n}} \leq n ϵ_{n}^{(1)}, \end{matrix}

where

ϵ_{n}^{(1)} = O (\frac{1}{\sqrt{n}})

.

Part 2: Verifying condition (ii) of 1. Again, using Proposition 3 along with the fact that the Markov chain is stationary we have

\begin{matrix} Var [r_{n} (θ, θ_{0})] & \leq n (\frac{4}{n} + 6 n^{δ / 2} C_{θ_{0}, θ}^{(1)}) (\sum_{k \geq 0} α_{k}^{δ / (2 + δ)}) \\ + (\frac{4}{n^{2}} + 2 n^{δ / 2} (C_{θ_{0}, θ}^{(1)} + D_{θ_{0}, θ} + \sqrt{C_{θ_{0}, θ}^{(1)} D_{θ_{0}, θ}})) (\sum_{k \geq 1} α_{k}^{δ / (2 + δ)}) \\ + Var [Z_{0}] . \end{matrix}

It then follows that

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) & \leq n (\frac{4}{n} + 6 n^{δ / 2} \int C_{θ_{0}, θ}^{(1)} ρ_{n} (d θ)) (\sum_{k \geq 1} α_{k - 1}^{δ / (2 + δ)}) + \int Var [Z_{0}] ρ_{n} (d θ) \\ + (\frac{4}{n^{2}} + 2 n^{δ / 2} (\int C_{θ_{0}, θ}^{(1)} ρ_{n} (d θ) \\ + \int D_{θ_{0}, θ} ρ_{n} (d θ) + \int \sqrt{C_{θ_{0}, θ}^{(1)} D_{θ_{0}, θ}} ρ_{n} (d θ))) (\sum_{k \geq 1} α_{k}^{δ / (2 + δ)}) . \end{matrix}

First, consider the term

\int C_{θ_{0}, θ}^{(1)} ρ_{n} (θ)

, and observe that

\begin{matrix} \int C_{θ_{0}, θ}^{(1)} ρ_{n} (d θ) & = \int E log {|\frac{p_{θ_{0}} (X_{1} | X_{0})}{p_{θ} (X_{1} | X_{0})}|}^{2 + δ} ρ_{n} (d θ) . \end{matrix}

By Assumption 1, we have

\begin{matrix} \int E log {|\frac{p_{θ_{0}} (X_{1} | X_{0})}{p_{θ} (X_{1} | X_{0})}|}^{2 + δ} ρ_{n} (d θ) & \leq \int E {[\sum_{j = 1}^{m} M_{j}^{(1)} (X_{1}, X_{0}) | f_{k}^{(1)} (θ, θ_{0}) |]}^{2 + δ} ρ_{n} (d θ) . \end{matrix}

Since the function

x \mapsto x^{2 + δ}

is convex, we can apply Jensen’s inequality to obtain,

\begin{matrix} {(\sum_{j = 1}^{m} M_{j}^{(1)} (X_{1}, X_{0}) | f_{k}^{(1)} (θ, θ_{0}) |)}^{2 + δ} & \leq m^{1 + δ} \sum_{k = 1}^{m} M_{j}^{(1)} {(X_{1}, X_{0})}^{2 + δ} {| f_{k}^{(1)} (θ, θ_{0}) |}^{2 + δ} . \end{matrix}

Therefore, it follows that

\begin{matrix} \int E log {|\frac{p_{θ_{0}} (X_{1} | X_{0})}{p_{θ} (X_{1} | X_{0})}|}^{2 + δ} ρ_{n} (d θ) & \leq m^{1 + δ} \sum_{k = 1}^{m} E [M_{k}^{(1)} {(X_{1}, X_{0})}^{2 + δ}] \\ \times \int | f_{k}^{(1)} (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ) . \end{matrix}

By Assumption 1,

\int | f_{k} (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ) < \frac{C}{n}

and

E [M_{k}^{(1)} {(X_{1}, X_{0})}^{2 + δ}] < B

, implying that

\begin{matrix} \int C_{θ_{0}, θ}^{(1)} ρ_{n} (d θ) & \leq m^{1 + δ} \sum_{k = 1}^{m} B \frac{C}{n} = m^{2 + δ} \frac{B C}{n} . \end{matrix}

Since

(\sum_{k \geq 0} α_{k}^{δ / (2 + δ)}) < \infty

, it follows that

(\frac{4}{n} + 6 n^{δ / 2} \int C_{θ_{0}, θ}^{(1)} ρ_{n} (d θ)) (\sum_{k \geq 1} α_{k - 1}^{δ / (2 + δ)}) = O (\frac{n^{δ / 2}}{n})

. Similarly, we can show that

\int D_{θ_{0}, θ} ρ_{n} (d θ) = O (\frac{1}{n})

, and

\int Var [Z_{0}] ρ_{n} (d θ) = O (\frac{1}{n})

.

For the final term

\int \sqrt{C_{θ_{0}, θ}^{(1)} D_{θ_{0}, θ}} ρ_{n} (d θ)

, use the Cauchy-Schwarz inequality to obtain the upper bound

{(\int C_{θ_{0}, θ}^{(1)} ρ_{n} (d θ) \int D_{θ_{0}, θ} ρ_{n} (d θ))}^{1 / 2}

which is also of order

O (\frac{1}{n})

. Combining all of these together we have

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) \leq n ϵ_{n}^{(2)}, \end{matrix}

for some

ϵ_{n}^{(2)} = O (\frac{n^{δ / 2}}{n})

.

Since

K (ρ_{n}, π) < \sqrt{n} C = n \frac{C}{\sqrt{n}},

it follows that

K (ρ_{n}, π) < n ϵ_{n}^{(3)},

where

ϵ_{n}^{(3)} = O (1 / \sqrt{n})

as before. Finally, by choosing

ϵ_{n} = max (ϵ_{n}^{(1)}, ϵ_{n}^{(2)}, ϵ_{n}^{(3)})

, our theorem is proved. □

Appendix B.3. Proofs for Non-Stationary, Ergodic Markov Data-Generating Models

Appendix B.3.1. Proof of Theorem 3

Part 1: Verifying condition (i) of Corollary 1: As in the proof of Theorem 2 substitute the true parameter

θ_{0}

for

θ_{1}

and

θ

for

θ_{2}

in. We also set

q_{1}^{(0)}

and

q_{2}^{(0)}

to the distribution

q^{(0)}

. Applying Proposition 2 to the corresponding transition kernels and initial distribution we have,

\begin{matrix} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) & = \sum_{i = 1}^{n} E [log (\frac{p_{θ_{0}} (X_{i} | X_{i - 1})}{p_{θ} (X_{i} | X_{i - 1})})] + E [log (\frac{D (X_{0})}{D (X_{0})})] \\ = \sum_{i = 1}^{n} E [log (\frac{p_{θ_{0}} (X_{i} | X_{i - 1})}{p_{θ} (X_{i} | X_{i - 1})})] . \end{matrix}

(A34)

Now, applying Assumption 1, we can bound the previous equation as follows,

\begin{matrix} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) & \leq \sum_{i = 1}^{n} E [\sum_{k = 1}^{m} M_{k}^{(1)} (X_{i}, X_{i - 1}) | f_{k}^{(1)} (θ, θ_{0}) |] \\ = \sum_{i = 1}^{n} \sum_{k = 1}^{m} E [M_{k}^{(1)} (X_{i}, X_{i - 1})] | f_{k}^{(1)} (θ, θ_{0}) | . \end{matrix}

(A35)

Since

M_{k}^{(1)}

’s are bounded there exists a constant Q so that,

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq Q \int \sum_{i = 1}^{n} \sum_{k = 1}^{m} | f_{k}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) \\ = Q n \sum_{k = 1}^{m} \int | f_{k}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) . \end{matrix}

By Assumption 19 in Assumption 1, it follows that

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq Q n \sum_{k = 1}^{m} \frac{C}{\sqrt{n}} = n m Q \frac{C}{\sqrt{n}} = n ϵ_{n}^{(1)}, \end{matrix}

for some

ϵ_{n}^{(1)} = O (\frac{1}{\sqrt{n}})

.

Part 2: Verifying condition (ii) of Corollary 1: As in the previous part,

Z_{0} = 0

, implying that

D_{θ, θ_{0}}

. Applying Proposition 3 and integrating with respect to

ρ_{n}

, we obtain

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) & \leq \sum_{i = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} \int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ)) (α_{i - 1}^{δ / (2 + δ)}) \\ + \sum_{i, j = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} (\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) + \int C_{θ_{0}, θ}^{(j)} ρ_{n} (d θ) + \int \sqrt{C_{θ_{0}, θ}^{(i)} C_{θ_{0}, θ}^{(j)}} ρ_{n} (d θ))) \\ \times (α_{| i - j | - 1}^{δ / (2 + δ)}) . \end{matrix}

(A36)

First, consider the term

\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ)

. Using Assumption 1, we can upper bound

C_{θ_{0}, θ}^{(i)}

as,

\begin{matrix} C_{θ_{0}, θ}^{(i)} & \leq E {[\sum_{k = 1}^{m} M_{k}^{(1)} (X_{i}, X_{i - 1}) | f_{k}^{(1)} (θ, θ_{0}) |]}^{2 + δ} \\ \leq \sum_{k = 1}^{m} m^{1 + δ} E [{(M_{k}^{(1)} (X_{i}, X_{i - 1}) | f_{k}^{(1)} (θ, θ_{0}) |)}^{2 + δ}] (by Jensen ’ s inequality) \\ = \sum_{k = 1}^{m} m^{1 + δ} E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ}] {| f_{k}^{(1)} (θ, θ_{0}) |}^{2 + δ} . \end{matrix}

Since

M_{k}^{(1)}

’s are upper bounded by Q, it follows from the previous expression that,

C_{θ_{0}, θ}^{(i)} \leq \sum_{k = 1}^{m} m^{1 + δ} Q^{2 + δ} {| f_{k}^{(1)} (θ, θ_{0}) |}^{2 + δ}

.

Hence, from Assumption 1, we get,

\begin{matrix} \int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) & \leq \sum_{k = 1}^{m} m^{1 + δ} Q^{2 + δ} \int {| f_{k}^{(1)} (θ, θ_{0}) |}^{2 + δ} ρ_{n} (d θ) \leq {(m Q)}^{2 + δ} \frac{C}{n} . \end{matrix}

Using the upper bound above, we can say for an L large enough,

\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) \leq \frac{L}{n}

. Next, by the Cauchy-Schwarz inequality, we have that

\int \sqrt{C_{θ_{0}, θ}^{(i)} C_{θ_{0}, θ}^{(j)} ρ_{n} (d θ))} < \sqrt{\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) \int C_{θ_{0}, θ}^{(j)} ρ_{n} (d θ))} \leq \frac{L}{n}

. Thus, we have the following upper bound.

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) & \leq \sum_{i = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} \frac{L}{n}) (α_{i - 1}^{δ / (2 + δ)}) \\ + \sum_{i, j = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} (\frac{L}{n} + \frac{L}{n} + \frac{L}{n})) (α_{| i - j | - 1}^{δ / (2 + δ)}) \\ = (\frac{4}{n} + 2 n^{δ / 2} \frac{L}{n}) (\sum_{i = 1}^{n} α_{i - 1}^{δ / (2 + δ)}) \\ + (\frac{4}{n} + 6 n^{δ / 2} \frac{L}{n}) (\sum_{i, j = 1}^{n} α_{| i - j | - 1}^{δ / (2 + δ)}) . \end{matrix}

Since

\sum_{i, j = 1}^{n} α_{| i - j | - 1}^{δ / (2 + δ)} < n \sum_{k \geq 1} α_{k - 1}^{δ / (2 + δ)} < \infty

, we have that for some

ϵ_{n}^{(2)} = O (\frac{n^{δ / 2}}{n})

,

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) < n ϵ_{n}^{(2)} . \end{matrix}

Since

K (ρ_{n}, π) \leq \sqrt{n} C

, following the concluding argument in Theorem 2 completes the proof. □

Appendix B.3.2. Proof of Proposition 8

We verify Assumption 1 and the proof follows from Theorem 3. For

i \in {1, 2, \dots, K - 1}

,

\begin{matrix} p_{θ} (j | i) = \{\begin{matrix} θ & if j = i - 1, \\ 1 - θ & if j = i + 1 . \end{matrix} \end{matrix}

If

i = 0

or

i = K

, then the Markov chain goes back to 1 or

K - 1

, respectively, with probability 1. With the convention

log \frac{0}{0} = 0

, the log ratio of the transition probabilities becomes,

\begin{matrix} | log p_{θ_{0}} (X_{1} | X_{0}) - log p_{θ} (X_{1} | X_{0}) | = I_{[X_{1} = X_{0} + 1]} log (\frac{θ_{0}}{θ}) + I_{[X_{1} = X_{0} - 1]} log (\frac{1 - θ_{0}}{1 - θ}) . \end{matrix}

In this case,

m = 2

.

M_{1}^{(1)} (X_{1}, X_{0}) = I_{[X_{1} = X_{0} + 1]}

and

M_{2}^{(1)} (X_{1}, X_{0}) = I_{[X_{1} = X_{0} - 1]}

, both of which are bounded. Let

f_{1}^{(1)} (θ, θ_{0}) : = log (\frac{θ_{0}}{θ})

suppose

f_{2}^{(1)} (θ, θ_{0}) : = log (\frac{1 - θ_{0}}{1 - θ})

.

The stationary distribution

q_{θ} (i) = \frac{1}{K} \forall i \in 1, 2, \dots, K

. Hence the log of the ratio of the invariant distribution becomes

\begin{matrix} log q_{0} (x) - log q_{θ} (x) & = 0, \end{matrix}

(A37)

and we can set

M_{i}^{(2)} (\cdot) : = 1

and

f_{i}^{(2)} (\cdot, \cdot) : = 0

for

i \in {1, 2}

. Thus, to prove the concentration bound for this Markov chain it is enough to assume that

δ = 1

and show that

\int {[f_{1}^{(1)} (θ, θ_{0})]}^{3} ρ_{n} (d θ) < \frac{C}{n}

and

\int {[f_{2}^{(1)} (θ, θ_{0})]}^{3} ρ_{n} (d θ) < \frac{C}{n}

for some constant

C > 0

.

As given,

{ρ_{n}}

is a sequence of beta probability distribution functions, with parameters

a_{n}, b_{n}

that satisfy the constraint

\frac{a_{n}}{a_{n} + b_{n}} = θ_{0}

. Specifically, we choose

a_{n} = n θ_{0}

and (therefore)

b_{n} = n (1 - θ_{0})

. Thus, we get the following,

\begin{matrix} \int | f_{1}^{(1)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) & = \int {|log (\frac{θ_{0}}{θ})|}^{3} ρ_{n} (d θ) \\ < \int {|\frac{θ_{0}}{θ} - 1|}^{3} ρ_{n} (d θ) \\ = \frac{1}{Beta (a_{n}, b_{n})} \int_{0}^{1} {|\frac{θ_{0} - θ}{θ}|}^{3} θ^{a_{n} - 1} {(1 - θ)}^{b_{n} - 1} d θ . \end{matrix}

Since

θ_{0}, θ \in (0, 1)

, so is

\frac{| θ_{0} - θ |}{2}

, giving

| θ_{0} {- θ |}^{3} < 2 {(θ_{0} - θ)}^{2}

. We use that fact to arrive at

\begin{matrix} \int | f_{1}^{(1)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) & \leq \frac{2}{Beta (a_{n}, b_{n})} \int_{0}^{1} {(θ_{0} - θ)}^{2} θ^{a_{n} - 4} {(1 - θ)}^{b_{n} - 1} d θ \\ = \frac{2 Beta (a_{n} - 3, b_{n})}{Beta (a_{n}, b_{n})} \frac{(a_{n} - 3) (b_{n})}{{(a_{n} + b_{n} - 3)}^{2} (a_{n} + b_{n} - 2)} . \end{matrix}

From our choice of

a_{n}

and

b_{n}

,

\frac{2 Beta (a_{n} - 3, b_{n})}{Beta (a_{n}, b_{n})} = O (1)

, and plugging the values of

a_{n}

and

b_{n}

into

\frac{(a_{n} - 3) (b_{n})}{{(a_{n} + b_{n} - 3)}^{2} (a_{n} + b_{n} - 2)}

, we get

\frac{(a_{n} - 3) (b_{n})}{{(a_{n} + b_{n} - 3)}^{2} (a_{n} + b_{n} - 2)} = \frac{1}{n} \frac{(θ_{0} - \frac{3}{n}) (1 - θ_{0})}{{(1 - \frac{3}{n})}^{2} (1 - \frac{2}{n})}

, which is upper bounded by

\frac{C_{1}}{n}

for some constant

C_{1} > 0

. Hence,

\begin{matrix} \int | f_{1}^{(1)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) & < \frac{C_{1}}{n} . \end{matrix}

Similarly, we can also show that,

\begin{matrix} \int | f_{2}^{(1)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) & < \frac{C_{2}}{n} . \end{matrix}

Finally, from Proposition A.2.1, we get that

K (ρ_{n}, π) < C + \frac{1}{2} log (n)

for some large constant C. Hence,

K (ρ_{n}, π) < C_{3} \sqrt{n}

for some constant

C_{3} > 0

. Choosing

C = max (C_{1}, C_{2}, C_{3})

, we satisfy all the conditions of Assumption 1 and Theorem 3. □

Appendix B.3.3. Proof of Proposition 9

For the purpose of this proof, we choose

ρ_{n}

’s with scaled Beta distribution with parameters

a_{n} = n (θ_{0} / 2)

and

b_{n} = n (1 - θ_{0} / 2)

. Since,

ρ_{n}

is a scaled Beta distribution with the scaling factors

m = 0.5

and

c = 0

, the pdf of

ρ_{n}

is given by

\begin{matrix} ρ_{n} (θ) & = \frac{2}{Beta (a_{n}, b_{n})} {(2 θ)}^{a_{n}} {(1 - 2 θ)}^{b_{n}} \end{matrix}

Since this is a scaled distribution,

E_{ρ_{n}} [θ] = 2 \frac{a_{n}}{a_{n} + b_{n}} = θ_{0}

and there exists a constant

σ > 0

,

{Var}_{ρ_{n}} [θ] = \frac{σ^{2}}{n}

. Now, we analyse the transition probabilities. For

i \in {1, 2, \dots}

, the Birth-Death process has transition probabilities

\begin{matrix} p_{θ} (j | i) = \{\begin{matrix} θ & if j = i - 1, \\ 1 - θ & if j = i + 1 . \end{matrix} \end{matrix}

If

i = 0

, then the Markov chain goes to 1 with probability 1. Hence with the convention

log \frac{0}{0} = 0

the ratio of the log of the transition probabilities becomes,

\begin{matrix} | log p_{θ_{0}} (X_{1} | X_{0}) - log p_{θ} (X_{1} | X_{0}) | = I_{[X_{1} = X_{0} + 1]} log [\frac{θ_{0}}{θ}] + I_{[X_{1} = X_{0} - 1]} log [\frac{1 - θ_{0}}{1 - θ}] . \end{matrix}

In this case,

m = 3

.

M_{1}^{(1)} (X_{1}, X_{0}) = I_{[X_{1} = X_{0} + 1]}

and

M_{2}^{(1)} (X_{1}, X_{0}) = I_{[X_{1} = X_{0} - 1]}

. Define

M_{3}^{(1)} (X_{1}, X_{0}) : = 1

. All these random variables are bounded. Define

f_{1}^{(1)} (θ, θ_{0}) : = log [\frac{θ_{0}}{θ}], f_{2}^{(1)} (θ, θ_{0}) : = log [\frac{1 - θ_{0}}{1 - θ}]

and

f_{3}^{(1)} (θ, θ_{0}) : = 0

. Similarly as in the proof on Proposition 8,

\begin{matrix} \int {[f_{1}^{(1)} (θ, θ_{0})]}^{3} ρ_{n} (d θ) & < \frac{C_{1}}{n}, and \\ \int {[f_{2}^{(1)} (θ, θ_{0})]}^{3} ρ_{n} (d θ) & < \frac{C_{2}}{n} . \end{matrix}

The stationary distribution is given by

q_{θ} (i) = {(\frac{θ}{1 - θ})}^{i - 1} q_{θ} (1) \forall i \in 1, 2, \dots

, so that

q_{θ} (i) = (1 - θ) {(\frac{θ}{1 - θ})}^{i - 1}

Hence the log of the ratio of the invariant distribution becomes

\begin{matrix} log q_{0} (i) - log q_{θ} (i) & = log [\frac{1 - θ_{0}}{1 - θ}] + (i - 1) log [\frac{θ_{0}}{θ}] - (i - 1) log [\frac{1 - θ_{0}}{1 - θ}] \end{matrix}

(A38)

We define

M_{1}^{(2)} (X_{0}) : = 1

, and

M_{2}^{(2)} (X_{0}) = M_{3}^{(2)} (X_{0}) : = X_{0} - 1

. We can write

E_{q^{(0)}} {[M_{2}^{(2)} (X_{0})]}^{2} = \sum_{i = 1}^{\infty} {(i - 1)}^{2} q^{(0)} (i) < \sum_{i = 1}^{\infty} i^{2} q^{(0)} (i)

. We have chosen

q^{(0)}

such that

\sum_{i = 1}^{\infty} i^{2} q^{(0)} (i)

is bounded. Hence,

E_{q^{(0)}} {[M_{2}^{(2)} (X_{0})]}^{2} < \infty

. To verify Assumption i define,

f_{1}^{(2)} (θ, θ_{0}) = - f_{3}^{(2)} (θ, θ_{0}) : = log [\frac{1 - θ_{0}}{1 - θ}]

, and define

f_{2}^{(2)} (θ, θ_{0}) : = log [\frac{θ_{0}}{θ}]

. Therefore following the proof of Proposition 8,

\begin{matrix} \int | f_{1}^{(2)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) = \int {| f_{3}^{(2)} (θ, θ_{0}) |}^{3} ρ_{n} (d θ) = & \int | f_{2}^{(1)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) < \frac{C_{2}}{n}, and, \\ \int | f_{2}^{(2)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) = & \int | f_{1}^{(1)} (θ, θ_{0}) |^{3} ρ_{n} (d θ) < \frac{C_{1}}{n} . \end{matrix}

Finally, we take the KL-divergence

K (ρ_{n}, π)

.

ρ_{n}

follows a scaled Beta distribution on

(0, 1 / 2)

with parameters

a_{n} = n (θ_{0} / 2)

and

b_{n} = n (1 - θ_{0} / 2)

, while

π

follows a scaled Beta distribution on

(0, 1 / 2)

with parameters a and b. Thus,

\begin{matrix} K (ρ_{n}, π) & = \int_{0}^{\frac{1}{2}} log (\frac{ρ_{n} (θ)}{π (θ)}) ρ_{n} (d θ), \end{matrix}

which, by substituting

t = 2 θ

, we get,

\begin{matrix} K (ρ_{n}, π) & = 2 \int_{0}^{1} log (\frac{ρ_{n} (t)}{π (t)}) ρ_{n} (d t) . \end{matrix}

\int_{0}^{1} log (\frac{ρ_{n} (t)}{π (t)}) ρ_{n} (d t)

is the KL-divergence between a Beta distribution with parameters

a_{n}

and

b_{n}

and a Beta distribution with parameters a and b. An application of Proposition A.2.1 gives us for a constant

C_{1} > 0

,

\begin{matrix} \int_{0}^{1} log (\frac{ρ_{n} (t)}{π (t)}) ρ_{n} (d t) < C_{1} + \frac{1}{2} log (n) . \end{matrix}

Hence we can say,

K (ρ_{n}, π) < 2 [C_{1} + \frac{1}{2} log (n)]

. Thus, we now get that for some constant

C_{3} > 0

,

\begin{matrix} K (ρ_{n}, π) < C_{3} \sqrt{n} . \end{matrix}

Choosing

C = max (C_{1}, C_{2}, C_{3})

we satisfy all of the conditions of Assumption 1 and thus by Theorem 3, we are complete the proof. □

Appendix B.3.4. Proof of Theorem 4

Part 1: Verifying condition (i) of Corollary 1 As in the proof of Theorem 2 substitute the true parameter

θ_{0}

for

θ_{1}

and

θ

for

θ_{2}

. We also set our initial distributions

q_{1}^{(0)}

and

q_{2}^{(0)}

to the known initial distribution

q^{(0)}

. A method similar to Equation (A35), yields

\begin{matrix} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) & \leq \sum_{i = 1}^{n} \sum_{k = 1}^{m} E [M_{k}^{(1)} (X_{i}, X_{i - 1})] | f_{k}^{(1)} (θ, θ_{0}) | . \end{matrix}

Because

M_{k}^{(1)}

s satisfy Assumption 2, it follows by the application of Theorem 2.3, [21] that

\exists λ > 0

such that for any

0 < κ \leq λ

, and for some

ζ \in (0, 1)

possibly depending upon

λ

,

\begin{matrix} E [e^{κ M_{k}^{(1)} (X_{i}, X_{i - 1})}| X_{1}, X_{0}] \leq ζ^{i - 1} e^{κ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{1 - ζ^{i}}{1 - ζ} D e^{κ a} for all i > 1 . \end{matrix}

We rewrite

E [M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{1}, X_{0}]

as follows:

\begin{matrix} E [M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{1}, X_{0}] & = \frac{E [κ M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{1}, X_{0}]}{κ} \\ \leq \frac{E [e^{κ M_{k}^{(1)} (X_{i}, X_{i - 1})} | X_{1}, X_{0}]}{κ} . \end{matrix}

Therefore,

\sum_{i = 1}^{n} E [M_{k}^{(1)} (X_{i}, X_{i - 1})]

can be upper bounded as,

\begin{matrix} \sum_{i = 1}^{n} E [M_{k}^{(1)} (X_{i}, X_{i - 1})] & = \sum_{i = 1}^{n} E [κ M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{1}, X_{0}] κ^{- 1} \\ \leq \sum_{i = 1}^{n} [ζ^{i - 1} E e^{κ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{1 - ζ^{i}}{1 - ζ} D e^{κ a}] κ^{- 1} . \end{matrix}

Since,

ζ \in (0, 1)

,

ζ^{i} < 1

. Hence, we can write that,

\begin{matrix} \sum_{i = 1}^{n} [ζ^{i - 1} E e^{κ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{1 - ζ^{i}}{1 - ζ} D e^{κ a}] κ^{- 1} & \leq \sum_{i = 1}^{n} [ζ^{i - 1} E e^{κ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{1}{1 - ζ} D e^{κ a}] κ^{- 1} \\ = [\frac{1 - ζ^{n}}{1 - ζ} E e^{κ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{n}{1 - ζ} D e^{κ a}] κ^{- 1} \\ \leq n L, \end{matrix}

for a large constant L. Therefore

\int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ)

can be upper bounded as follows,

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq \int \sum_{k = 1}^{m} n L | f_{k}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) \\ = \sum_{k = 1}^{m} n L \int | f_{k}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) . \end{matrix}

By Assumption 1,

\int | f_{k}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) < \frac{C}{n}

, hence,

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq n L \frac{C}{\sqrt{n}} . \end{matrix}

Hence, for some

ϵ_{n}^{(1)} = O (\frac{1}{\sqrt{n}})

, we have obtained that,

\int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) \leq n ϵ_{n}^{(1)}

.

Part 2: Verifying condition (ii) of Corollary 1: Similar to as in the proof of Theorem 3, we upper bound

\int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ)

by

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) & \leq \sum_{i, j = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} (\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) + \int C_{θ_{0}, θ}^{(j)} ρ_{n} (d θ) \end{matrix}

(A39)

\begin{matrix} + \int \sqrt{C_{θ_{0}, θ}^{(i)} C_{θ_{0}, θ}^{(j)}} ρ_{n} (d θ))) (α_{| i - j | - 1}^{δ / (2 + δ)}) \\ + \sum_{i = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} \int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ)) (α_{i - 1}^{δ / (2 + δ)}), \end{matrix}

(A40)

where

C_{θ_{0}, θ}

is upper bounded as

\begin{matrix} C_{θ_{0}, θ}^{(i)} & \leq \sum_{k = 1}^{m} m^{1 + δ} E {[M_{k}^{(1)} (X_{i}, X_{i - 1})]}^{2 + δ} {| f_{k}^{(1)} (θ, θ_{0}) |}^{2 + δ} . \end{matrix}

There exists a constant

C_{δ}

depending upon

δ

such that,

\begin{matrix} {[M_{k}^{(1)}]}^{2 + δ} (X_{i}, X_{i - 1}) & = \frac{κ^{2 + δ} {[M_{k}^{(1)}]}^{2 + δ} {(X_{i}, X_{i - 1})}^{2 + δ}}{κ^{2 + δ}} \\ \leq \frac{e^{κ M_{k}^{(1)} (X_{i}, X_{i - 1})} + C_{δ}}{κ^{2 + δ}} . \end{matrix}

By expressing

E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ}] = E [E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ} | X_{1}, X_{0}]]

and following a method similar to the previous part, we get,

\begin{matrix} E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ}] & \leq \frac{[ζ^{i} E e^{κ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{1 - ζ^{i}}{1 - ζ} D e^{κ a}] + C_{δ}}{κ^{2 + δ}} . \end{matrix}

The fact that

0 < ζ < 1

implies that

0 < ζ^{i} < ζ

. This gives us the following,

\begin{matrix} E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ}] & \leq \frac{[ζ E e^{κ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{1}{1 - ζ} D e^{κ a}] + C_{δ}}{κ^{2 + δ}} . \end{matrix}

Since

κ < λ

, by the application of Jensen’s inequality, we get

\begin{matrix} E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ}] & \leq \frac{[ζ E e^{λ M_{k}^{(1)} (X_{1}, X_{0})} + \frac{1}{1 - ζ} D e^{κ a}] + C_{δ}}{κ^{2 + δ}} \\ = \frac{[ζ \int e^{λ M_{k}^{(1)} (x_{1}, x_{0})} p_{θ_{0}} (x_{1} | x_{0}) D (x_{0}) d x_{1} d x_{0} + \frac{1}{1 - ζ} D e^{κ a}] + C_{δ}}{κ^{2 + δ}} . \end{matrix}

We know that

\int | f_{k}^{(1)} (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ) < \frac{C}{n}

. Thus, following Assumption 1 we can say that, for a large constant L,

\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) \leq \frac{L}{n}

. The rest of the proof follows similarly as in the proof of Theorem 3, and we obtain an

ϵ_{n}^{(2)} = O (\frac{n^{δ / 2}}{n})

, such that,

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) < n ϵ_{n}^{(2)} . \end{matrix}

Since,

K (ρ_{n}, π) \leq \sqrt{n} C

, similar arguments as in the proof of Theorem 2 holds. The theorem is thus proved.

Appendix B.3.5. Proof of Theorem 5

Part 1: Verifying condition (i) of Corollary 1 As in the proof of Theorem 2 substitute the true parameter

θ_{0}

for

θ_{1}

and

θ

for

θ_{2}

. We also set

q_{1}^{(0)}

and

q_{2}^{(0)}

to the known initial distribution

q^{(0)}

. Similar to the steps leading to Equation (A35), we get

\begin{matrix} K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) & \leq \sum_{i = 1}^{n} \sum_{k = 1}^{m} E [M_{k}^{(1)} (X_{i}, X_{i - 1})] | f_{k}^{(1)} (θ, θ_{0}) | . \end{matrix}

Consider the term

E [M_{k}^{(1)} (X_{i}, X_{i - 1})]

. With

q_{θ_{0}}^{(i - 1)}

the marginal distribution of

X_{i - 1}

, we have

\begin{matrix} E [M_{k}^{(1)} (X_{i}, X_{i - 1})] & = \int M_{k}^{(1)} (x_{i}, x_{i - 1}) p_{θ_{0}} (x_{i} | x_{i - 1}) q_{θ_{0}}^{(i - 1)} (x_{i - 1}) d x_{i} d x_{i - 1} . \\ E [M_{k}^{(1)} (X_{i}, X_{i - 1})] & = \int M_{k}^{(1)} (x_{i}, x_{i - 1}) p_{θ_{0}} (x_{i} | x_{i - 1}) p_{θ_{0}}^{i - 1} (x_{i - 1} | x_{0}) q_{θ_{0}}^{(0)} (x_{0}) d x_{0} d x_{i} d x_{i - 1} \end{matrix}

Recall that the marginal density satisfies

q_{θ_{0}}^{(i - 1)} (x_{i - 1}) = \int p_{θ_{0}}^{i - 1} (x_{i - 1} | x_{0}) q_{θ_{0}}^{(0)} (x_{0}) d (x_{0})

, where

p_{θ_{0}}^{i} (\cdot | x_{0})

is the i-step transition probability. Then

\begin{matrix} E [M_{k}^{(1)} (X_{i}, X_{i - 1})] & = \int E [M_{k}^{(1)} (X_{i}, x_{i - 1}) | x_{i - 1}] p_{θ_{0}}^{i - 1} (x_{i - 1} | x_{0}) q_{θ_{0}}^{(0)} (x_{0}) d x_{0} d x_{i - 1} . \end{matrix}

Since the Markov chain

{X_{n}}

satisfies Assumption A.1.1, we know by the application of Theorem A.1.1 that

{X_{n}}

is V-geometrically ergodic. Hence,

\exists τ < 1

,

R < \infty

such that

\forall | f | < V

\begin{matrix} | \int f (x_{i - 1}) p_{θ_{0}}^{i - 1} (x_{i - 1} | x_{0}) d x_{i - 1} & - \int f (x_{i - 1}) q_{θ_{0}} (x_{i - 1}) d x_{i - 1} | < R V (x_{0}) τ^{i - 1}, \end{matrix}

where

q_{θ_{0}}

is the stationary distribution, implying that

\begin{matrix} \int f (x_{i - 1}) p_{θ_{0}}^{i - 1} (x_{i - 1} | x_{0}) d x_{i - 1} & < \int f (x_{i - 1}) q_{θ_{0}} (x_{i - 1}) d x_{i - 1} + R V (x_{0}) τ^{i - 1} . \end{matrix}

By the application of Jensen’s inequality we get

{(E [M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{i - 1}])}^{2 + δ} \leq E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ} | X_{i - 1}] < V (X_{i - 1})

. Since

V (\cdot) \geq 1

, it follows from the previous expression that

E [M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{i - 1}] < V {(X_{i - 1})}^{1 / (2 + δ)} \leq V (X_{i - 1})

. Thus, setting

f (x) = E [M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{i - 1} = x]

, we obtain

\begin{matrix} E [M_{k}^{(1)} (X_{i}, X_{i - 1})] & < \int [E [M_{k}^{(1)} (X_{i}, X_{i - 1}) | X_{i - 1}] q_{θ_{0}} (x_{i}) d x_{i - 1} + R V (x_{0}) τ^{i - 1}] q^{(0)} (x_{0}) d x_{0} \\ = E [M_{k}^{(1)} (X_{1}, X_{0})] + τ^{i - 1} \int R V (x_{0}) q^{(0)} (x_{0}) d x_{0} . \end{matrix}

Summing from

i = 1

to n, we get

\begin{matrix} \sum_{i = 1}^{n} E [M_{k}^{(1)} (x_{i}, x_{i - 1})] & < n E [M_{k}^{(1)} (X_{1}, X_{0})] + \sum_{i = 1}^{n} τ^{i - 1} \int R V (x_{0}) q^{(0)} (x_{0}) d x_{0} \end{matrix}

\begin{matrix} = n E [M_{k}^{(1)} (X_{1}, X_{0})] + \frac{1 - τ^{n}}{1 - τ} \int R V (x_{0}) q^{(0)} (x_{0}) d x_{0} . \end{matrix}

This gives us the following bound on

\int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ)

:

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq \sum_{k = 1}^{m} [n E [M_{k}^{(1)} (X_{1}, X_{0})] + \frac{1 - τ^{n}}{1 - τ} \int R V (x_{0}) D (x_{0}) d x_{0}] \\ \times \int | f_{k}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) . \end{matrix}

By Assumption 1,

\int | f_{k}^{(1)} (θ, θ_{0}) | ρ_{n} (d θ) < \frac{C}{\sqrt{n}}

. Hence, we can rewrite the previous expression as

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq \sum_{k = 1}^{m} [n E [M_{k}^{(1)} (X_{1}, X_{0})] + \frac{1 - τ^{n}}{1 - τ} \int R V (x_{1}) D (x_{1}) d x_{1}] \frac{C}{\sqrt{n}} \\ = n m [E [M_{k}^{(1)} (X_{1}, X_{0})] + \frac{1 - τ^{n}}{n (1 - τ)} \int R V (x_{0}) D (x_{0}) d x_{0}] \frac{C}{\sqrt{n}} . \end{matrix}

Since,

τ < 1

,

0 < 1 - τ^{n} < 1

, and we rewrite the previous equation as,

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq n m [E [M_{k}^{(1)} (X_{1}, X_{0})] + \frac{1}{n (1 - τ)} \int R V (x_{0}) D (x_{0}) d x_{0}] \frac{C}{\sqrt{n}} . \end{matrix}

Hence, there exists an

ϵ_{n}^{(1)} = O (\frac{1}{\sqrt{n}})

such that

\int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) \leq n ϵ_{n}^{(1)}

.

Part 2: Verifying condition (ii) of Corollary 1: Similar to as in the proof of Theorem 3, we upper bound

\int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ)

by

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) & \leq \sum_{i, j = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} (\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) + \int C_{θ_{0}, θ}^{(j)} ρ_{n} (d θ) \end{matrix}

(A41)

\begin{matrix} + \int \sqrt{C_{θ_{0}, θ}^{(i)} C_{θ_{0}, θ}^{(j)}} ρ_{n} (d θ))) (α_{| i - j | - 1}^{δ / (2 + δ)}) \\ + \sum_{i = 1}^{n} (\frac{4}{n} + 2 n^{δ / 2} \int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ)) (α_{i - 1}^{δ / (2 + δ)}), \end{matrix}

(A42)

where

C_{θ_{0}, θ}

is upper bounded as

\begin{matrix} C_{θ_{0}, θ}^{(i)} & \leq \sum_{k = 1}^{m} m^{1 + δ} E {[M_{k}^{(1)} (X_{i}, X_{i - 1})]}^{2 + δ} {| f_{k}^{(1)} (θ, θ_{0}) |}^{2 + δ} . \end{matrix}

Since

E [M_{k}^{(1)} {(X_{i}, X_{i - 1})}^{2 + δ} | X_{i - 1}] < V (X_{i - 1})

, by a similar application of V-geometric ergodicity, we can say that,

\exists 0 < τ < 1

, such that

\begin{matrix} E {[M_{k}^{(1)} (X_{i}, X_{i - 1})]}^{2 + δ} & \leq n E {[M_{k}^{(1)} (X_{1}, X_{0})]}^{2 + δ} + τ^{i - 1} \int R V (x_{0}) D (x_{0}) d x_{0}, \end{matrix}

which, by the fact that

τ^{i - 1} < τ

, gives us,

\begin{matrix} E {[M_{k}^{(1)} (X_{i}, X_{i - 1})]}^{2 + δ} & \leq E {[M_{k}^{(1)} (X_{1}, X_{0})]}^{2 + δ} + τ \int R V (x_{0}) D (x_{0}) d x_{0} . \end{matrix}

By Assumption 1, we know that,

\int | f_{k}^{(1)} (θ, θ_{0}) |^{2 + δ} ρ_{n} (d θ) < \frac{C}{n}

. Hence, for a large constant L,

\int C_{θ_{0}, θ}^{(i)} ρ_{n} (d θ) \leq \frac{L}{n}

. We also see that since the chain is geometrically ergodic, by the application of Equation (A4),

\sum_{k \geq 1} α_{k}^{δ / (2 + δ)} < + \infty

. The rest of the proof follows similarly as in the proof of Theorem 3, and we obtain an

ϵ_{n}^{(2)} = O (\frac{n^{δ / 2}}{n})

, such that,

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) < n ϵ_{n}^{(2)} . \end{matrix}

Since,

K (ρ_{n}, π) \leq \sqrt{n} C

, similar arguments as in the proof of Theorem 2 holds. The theorem is thus proved. □

Appendix B.3.6. Proof of Proposition 10

For the purpose of the proof, we choose

ρ_{n}

’s with scaled Beta distribution with parameters

a_{n} = n \frac{1 + θ_{0}}{2}

and

b_{n} = n \frac{1 - θ_{0}}{2}

. Since,

ρ_{n}

is a scaled Beta distribution with the scaling factors

m = 2

and

c = - 1

, the pdf of

ρ_{n}

is given by

\begin{matrix} ρ_{n} (θ) & = \frac{1}{2 Beta (a_{n}, b_{n})} {(\frac{1 + θ}{2})}^{a_{n}} {(\frac{1 - θ}{2})}^{b_{n}} \end{matrix}

Since this is a scaled distribution,

E_{ρ_{n}} [θ] = 2 \frac{a_{n}}{a_{n} + b_{n}} - 1 = θ_{0}

and there exists a constant

σ > 0

,

{Var}_{ρ_{n}} [θ] = \frac{σ^{2}}{n}

. We now analyse the log-ratio of the transition probabilities for the Markov chain,

\begin{matrix} log p_{θ_{0}} (X_{n} | X_{n - 1}) - log p_{θ} (X_{n} | X_{n - 1}) = 2 X_{n} X_{n - 1} (θ - θ_{0}) + X_{n - 1}^{2} (θ_{0}^{2} - θ^{2}) . \end{matrix}

Observe that in this setting,

M_{1}^{(1)} (X_{n}, X_{n - 1}) = | X_{n} X_{n - 1} |

and

M_{2}^{(1)} (X_{n}, X_{n - 1}) = X_{n}^{2}

. Next, using the fact that

\begin{matrix} E [| X_{n} |^{2 + δ} | X_{n - 1}] & = E [| X_{n} - θ_{0} X_{n - 1} + θ_{0} X_{n - 1} |^{2 + δ} | X_{n - 1}], \end{matrix}

and by an application of triangle inequality, we obtain

\begin{matrix} E [| X_{n} |^{2 + δ} | X_{n - 1}] & \leq E [{(| X_{n} - θ_{0} X_{n - 1} | + | θ_{0} X_{n - 1} |)}^{2 + δ} | X_{n - 1}] \\ = E [{(2 \frac{| X_{n} - θ_{0} X_{n - 1} | + | θ_{0} X_{n - 1} |}{2})}^{2 + δ} | X_{n - 1}] \\ = E [2^{2 + δ} {(\frac{| X_{n} - θ_{0} X_{n - 1} | + | θ_{0} X_{n - 1} |}{2})}^{2 + δ} | X_{n - 1}] . \end{matrix}

Now by using Jensen’s inequality we get,

\begin{matrix} E [| X_{n} |^{2 + δ} | X_{n - 1}] & \leq E [2^{2 + δ} (\frac{| X_{n} - θ_{0} X_{n - 1} |^{2 + δ} + {| θ_{0} X_{n - 1} |}^{2 + δ}}{2}) | X_{n - 1}] \\ = 2^{1 + δ} E [| X_{n} - θ_{0} X_{n - 1} |^{2 + δ} | X_{n - 1}] + 2^{1 + δ} | θ_{0} X_{n - 1} | . \end{matrix}

We know if

Y \sim N (μ, σ^{2})

, then

{E | Y - μ |}^{p} = σ^{p} \frac{2^{\frac{p}{2} Γ (\frac{p + 1}{2})}}{\sqrt{π}}

. Consequently,

\begin{matrix} E [| X_{n} |^{2 + δ} | X_{n - 1}] & \leq 2^{1 + δ} [\frac{2^{\frac{2 + δ}{2} Γ (\frac{3 + δ}{2})}}{\sqrt{π}}] + 2^{1 + δ} {| θ_{0} X_{n - 1} |}^{2 + δ} . \end{matrix}

(A43)

It follows that,

\begin{matrix} E [M_{1}^{(1)} {(X_{n}, X_{n - 1})}^{2 + δ} | X_{n - 1}] & \leq 2^{1 + δ} [\frac{2^{\frac{2 + δ}{2} Γ (\frac{3 + δ}{2})}}{\sqrt{π}}] | X_{n - 1} |^{2 + δ} + 2^{1 + δ} | θ_{0} |^{2 + δ} {| X_{n - 1} |}^{4 + 2 δ} \\ \leq (2^{1 + δ} [\frac{2^{\frac{2 + δ}{2} Γ (\frac{3 + δ}{2})}}{\sqrt{π}}] + 2^{1 + δ} {| θ_{0} |}^{2 + δ}) (| X_{n - 1} |^{4 + 2 δ} + 1) . \end{matrix}

Since

θ_{0} < 1

, we can say,

\begin{matrix} E [M_{1}^{(1)} {(X_{n}, X_{n - 1})}^{2 + δ} | X_{n - 1}] & \leq (2^{1 + δ} [\frac{2^{\frac{2 + δ}{2} Γ (\frac{3 + δ}{2})}}{\sqrt{π}}] + 2^{1 + δ}) (| X_{n - 1} |^{4 + 2 δ} + 1) . \end{matrix}

Define a constant

C_{δ} : = (2^{1 + δ} [\frac{2^{\frac{2 + δ}{2} Γ (\frac{3 + δ}{2})}}{\sqrt{π}}] + 2^{1 + δ})

. The above term then becomes,

\begin{matrix} E [M_{1}^{(1)} {(X_{n}, X_{n - 1})}^{2 + δ} | X_{n - 1}] & \leq C_{δ} (| X_{n - 1} |^{4 + 2 δ} + 1) . \end{matrix}

Next we analyse the term

M_{2}^{(1)} (X_{n}, X_{n - 1})

.

\begin{matrix} E [M_{2}^{(1)} {(X_{n}, X_{n - 1})}^{2 + δ} | X_{n - 1}] & = E [X_{n - 1}^{4 + 2 δ} | X_{n - 1}] \\ = X_{n - 1}^{4 + 2 δ} \\ \leq C_{δ} (X_{n - 1}^{4 + 2 δ} + 1) . \end{matrix}

Then, defining

V (x) : = C_{δ} (x^{4 + 2 δ} + 1)

it follows that,

\begin{matrix} E [V (X_{n}) | X_{n - 1}] & = E [C_{δ} (X_{n}^{4 + 2 δ} + 1) | X_{n - 1}] . \end{matrix}

Using a technique similar to Equation (A43) we get,

\begin{matrix} E [C_{δ} (X_{n}^{4 + 2 δ} + 1) | X_{n - 1}] & \leq [C_{δ} (2^{3 + 2 δ} [\frac{2^{\frac{4 + 2 δ}{2} Γ (\frac{5 + 2 δ}{2})}}{\sqrt{π}}] + 2^{3 + 2 δ} | θ_{0} X_{n - 1} |^{4 + 2 δ} + 1)] . \end{matrix}

Define another constant

C_{δ}^{'} : = C_{δ} (2^{3 + 2 δ} [\frac{2^{\frac{4 + 2 δ}{2} Γ (\frac{5 + 2 δ}{2})}}{\sqrt{π}}] - 2^{3 + 2 δ} {| θ_{0} |}^{4 + 2 δ} + 1)

. Since

δ > 0

,

\frac{2^{\frac{4 + 2 δ}{2} Γ (\frac{5 + 2 δ}{2})}}{\sqrt{π}} > 1

. Furthermore, since

| θ_{0} | < 1

, so is

| θ_{0} |^{4 + 2 δ}

. Hence,

2^{3 + 2 δ} [\frac{2^{\frac{4 + 2 δ}{2} Γ (\frac{5 + 2 δ}{2})}}{\sqrt{π}}] - 2^{3 + 2 δ} {| θ_{0} |}^{4 + 2 δ} > 0 .

Hence, we have shown that,

\begin{matrix} E [V (X_{n}) | X_{n - 1}] & \leq (2^{3 + 2 δ} | θ_{0} |^{4 + 2 δ}) C_{δ} (X_{n - 1}^{4 + 2 δ} + 1) + C_{δ}^{'} . \end{matrix}

Since

| θ_{0} | < 2^{\frac{1}{4 + 2 δ} - 1}

,

2^{3 + 2 δ} {| θ_{0} |}^{4 + 2 δ} < 1

, and we can express the above equation as,

\begin{matrix} E [V (X_{n}) | X_{n - 1}] & \leq V (X_{n - 1}) + C_{δ}^{'} . \end{matrix}

Define the set

C (m) : = {x : | x |^{4 + 2 δ} + 1 \leq m}

. From Proposition 11.4.2, [20], for a large enough m,

C (m)

forms a petite set. Thus, we have proved that

V (x)

as defined in this example satisfies Assumption A.1.1, and

{X_{n}}

is V-geometrically ergodic. The

f_{j}^{(1)}

’s corresponding to Assumption 1 are given by

f_{1}^{(1)} (θ, θ_{0}) = (θ - θ_{0})

and

f_{2}^{(1)} (θ, θ_{0}) = (θ_{0}^{2} - θ^{2})

. Therefore, it follows that,

\begin{matrix} \partial_{θ} f_{1}^{(1)} & = 1, \\ \partial_{θ} f_{2}^{(1)} & = - 2 θ and \\ - 2 & < - 2 θ < 2 . \end{matrix}

Since

f_{1}^{(1)} (θ_{0}, θ_{0}) = f_{2}^{(1)} (θ_{0}, θ_{0}) = 0

, We just showed that they also have bounded partial derivatives. We also know that

| θ | < 1

. Hence, by Proposition 4

f_{j}^{(1)}

’s satisfy the conditions of Assumption 1.

The invariant distribution for the simple linear model Markov-chain under parameter

θ

is given by a gaussian distribution with mean 0 and variance

\frac{1}{1 - θ^{2}}

. In other words,

\begin{matrix} q_{θ} (x) = \frac{1}{\sqrt{2 π}} e^{- \frac{1 - θ^{2}}{2} x^{2}} . \end{matrix}

Analyzing the log likelihood yields,

\begin{matrix} log q_{0} (x) - log q_{θ} (x) & = - \frac{x^{2}}{2} (1 - θ_{0}^{2}) + \frac{x^{2}}{2} (1 - θ^{2}) \\ = \frac{x^{2}}{2} (θ_{0}^{2} - θ^{2}) . \end{matrix}

Let

f_{1}^{(2)} (θ_{0}, θ_{0}) = (θ_{0}^{2} - θ^{2})

and

f_{1}^{(2)} (θ_{0}, θ_{0}) = 0

. Since

f_{1}^{(2)} (θ_{0}, θ_{0}) = f_{2}^{(1)} (θ_{0}, θ_{0})

, by following arguments similar as before, can conclude that

f_{1}^{(2)} (θ_{0}, θ_{0})

also satisfies the requirements of Assumption 1. Let

M_{1}^{(2)} (x) = \frac{x^{2}}{2}

and define

M_{2}^{(2)} (x) : = 1

. Let

X_{0} \sim q_{1}^{(0)}

. As long as

\int x^{4 + 2 δ} q_{1}^{(0)} (x) d x < \infty

, we satisfy all the conditions required for Theorem 5. Finally we need to verify the condition that

K (ρ_{n}, π) < C \sqrt{n}

for some constant

C > 0

. The KL-divergence

\int log (\frac{ρ_{n} (θ)}{π (θ)}) ρ_{n} (d θ)

becomes,

\begin{matrix} K (ρ_{n}, π) & = \int_{- 1}^{1} log (\frac{1}{2 Beta (a_{n}, b_{n})} {(\frac{1 + θ}{2})}^{a_{n}} {(\frac{1 - θ}{2})}^{b_{n}}) \\ \times \frac{1}{2 Beta (a_{n}, b_{n})} {(\frac{1 + θ}{2})}^{a_{n}} {(\frac{1 - θ}{2})}^{b_{n}} d θ . \end{matrix}

Substituting,

y = \frac{1 + θ}{2}

, we get,

\begin{matrix} K (ρ_{n}, π) & = \int_{0}^{1} log (\frac{1}{2 Beta (a_{n}, b_{n})} {(y)}^{a_{n}} {(1 - y)}^{b_{n}}) \frac{1}{2 Beta (a_{n}, b_{n})} {(y)}^{a_{n}} {(1 - y)}^{b_{n}} d y \\ = \int_{0}^{1} log (\frac{1}{2}) \frac{1}{Beta (a_{n}, b_{n})} {(y)}^{a_{n}} {(1 - y)}^{b_{n}} d y \\ + \int_{0}^{1} log (\frac{1}{Beta (a_{n}, b_{n})} {(y)}^{a_{n}} {(1 - y)}^{b_{n}}) \frac{1}{Beta (a_{n}, b_{n})} {(y)}^{a_{n}} {(1 - y)}^{b_{n}} . \end{matrix}

The first term integrates up to

log (1 / 2)

. The second term is the KL-divergence between a Uniform and Beta distribution with parameters

a_{n} = n \frac{1 + θ_{0}}{2}

and

b_{n} = n (1 - \frac{1 + θ_{0}}{2})

and support

[0, 1]

. Following Lemma A.2.1 it follows that

K (ρ_{n}, π)

is upper bounded by,

\begin{matrix} K (ρ_{n}, π) & < log (1 / 2) + C_{1} + \frac{1}{2} log (n) < C \sqrt{n}, \end{matrix}

for some large constant C. This completes the proof. □

Appendix B.4. Proofs for Misspecified Models

Proof of Theorem 6

As in the proof of Theorem 1, following Equation (A13), we note that,

\begin{matrix} \int D_{α^{r e}} (P_{θ}^{(n)}, P_{θ_{0}}^{(n)}) {\tilde{π}}_{n, α^{r e} | X^{n}} (d θ) & \leq \frac{α^{r e}}{1 - α^{r e}} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) \\ + \frac{α^{r e}}{1 - α^{r e}} \sqrt{\frac{Var [\int r_{n} (θ, θ_{0}) ρ_{n} (d θ)]}{η}} + \frac{K (ρ_{n}, π) - log (ϵ)}{1 - α^{r e}} . \end{matrix}

(A44)

Following from Equations (23) and (26), we get that,

\begin{matrix} \int K (P_{θ_{0}}^{(n)}, P_{θ}^{(n)}) ρ_{n} (d θ) & \leq E [r_{n} (θ_{0}, θ_{n}^{*})] + n ϵ_{n}, \end{matrix}

and

\begin{matrix} \int Var [r_{n} (θ, θ_{0})] ρ_{n} (d θ) & \leq 2 n ϵ_{n} + 2 Var [r_{n} (θ_{n}^{*}, θ_{0})] . \end{matrix}

Plugging these into Equation (A44), we are done. □

References

Wainwright, M.J.; Jordan, M.I. Introduction to Variational Methods for Graphical Models. Found. Trends Mach. Learn. 2008, 1, 1–103. [Google Scholar] [CrossRef] [Green Version]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin, Germany, 2006. [Google Scholar]
Ormerod, J.T.; Wand, M.P. Explaining variational approximations. Am. Stat. 2010, 64, 140–153. [Google Scholar] [CrossRef] [Green Version]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef] [Green Version]
Jaiswal, P.; Rao, V.; Honnappa, H. Asymptotic Consistency of α-Rényi-Approximate Posteriors. J. Mach. Learn. Res. 2020, 21, 1–42. [Google Scholar]
Li, Y.; Turner, R.E. Rényi divergence variational inference. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29, pp. 1073–1081. [Google Scholar]
Dieng, A.B.; Tran, D.; Ranganath, R.; Paisley, J.; Blei, D. Variational Inference via χ Upper Bound Minimization. In Proceedings of the 31th Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Wang, Y.; Blei, D.M. Frequentist consistency of variational Bayes. J. Am. Stat. Assoc. 2019, 114, 1147–1161. [Google Scholar] [CrossRef] [Green Version]
Zhang, F.; Gao, C. Convergence rates of variational posterior distributions. Ann. Stat. 2020, 48, 2180–2207. [Google Scholar] [CrossRef]
Ghosal, S.; Ghosh, J.K.; Van Der Vaart, A.W. Convergence rates of posterior distributions. Ann. Stat. 2000, 28, 500–531. [Google Scholar] [CrossRef]
Shen, X.; Wasserman, L. Rates of convergence of posterior distributions. Ann. Stat. 2001, 29, 687–714. [Google Scholar]
Rousseau, J. On the frequentist properties of Bayesian nonparametric methods. Annu. Rev. Stat. Its Appl. 2016, 3, 211–231. [Google Scholar] [CrossRef] [Green Version]
Ghosal, S.; Van Der Vaart, A.W. Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of Normal densities. Ann. Stat. 2001, 1233–1263. [Google Scholar]
Bhattacharya, A.; Pati, D.; Yang, Y. Bayesian fractional posteriors. Ann. Stat. 2019, 47, 39–66. [Google Scholar] [CrossRef] [Green Version]
Alquier, P.; Ridgway, J. Concentration of tempered posteriors and of their variational approximations. Ann. Stat. 2020, 48, 1475–1497. [Google Scholar] [CrossRef]
Yang, Y.; Pati, D.; Bhattacharya, A. α-variational inference with statistical guarantees. Ann. Stat. 2020, 48, 886–905. [Google Scholar] [CrossRef]
Jaiswal, P.; Honnappa, H.; Rao, V.A. Risk-sensitive variational Bayes: Formulations and bounds. arXiv 2019, arXiv:1903.05220. [Google Scholar]
Bradley, R.C. Basic Properties of Strong Mixing Conditions. A Survey and Some Open Questions. Probab. Surv. 2005, 2, 107–144. [Google Scholar] [CrossRef] [Green Version]
Ibragimov, I.A. Some limit theorems for stationary processes. Theory Probab Appl. 1962, 7, 349–382. [Google Scholar] [CrossRef]
Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer: Berlin, Germany, 2012. [Google Scholar]
Hajek, B. Hitting-time and occupation-time bounds implied by drift analysis with applications. Adv. Appl. Probab. 1982, 502–525. [Google Scholar] [CrossRef]
Birgé, L. Robust testing for independent non identically distributed variables and Markov chains. In Specifying Statistical Models; Springer: Berlin, Germany, 1983; pp. 134–162. [Google Scholar]
Ryabko, D. Testing statistical hypotheses about ergodic processes. In Proceedings of the IEEE Region 8 International Conference on Computational Technologies in Electrical and Electronics Engineering, Novosibirsk, Russia, 21–25 July 2008. [Google Scholar]
Lacoste-Julien, S.; Huszár, F.; Ghahramani, Z. Approximate inference for the loss-calibrated Bayesian. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
Jaiswal, P.; Honnappa, H.; Rao, V.A. Asymptotic consistency of loss-calibrated variational Bayes. Stat 2020, 9, e258. [Google Scholar] [CrossRef] [Green Version]
Jones, G.L. On the Markov chain central limit theorem. Probab. Survey. 2004, 1, 299–320. [Google Scholar] [CrossRef]
Alzer, H. On some inequalities for the gamma and psi functions. Math. Comput. 1997, 66, 373–389. [Google Scholar] [CrossRef] [Green Version]
Donsker, M.D.; Varadhan, S.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975, 28, 1–47. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Banerjee, I.; Rao, V.A.; Honnappa, H. PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models. Entropy 2021, 23, 313. https://doi.org/10.3390/e23030313

AMA Style

Banerjee I, Rao VA, Honnappa H. PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models. Entropy. 2021; 23(3):313. https://doi.org/10.3390/e23030313

Chicago/Turabian Style

Banerjee, Imon, Vinayak A. Rao, and Harsha Honnappa. 2021. "PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models" Entropy 23, no. 3: 313. https://doi.org/10.3390/e23030313

APA Style

Banerjee, I., Rao, V. A., & Honnappa, H. (2021). PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models. Entropy, 23(3), 313. https://doi.org/10.3390/e23030313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models

Abstract

1. Introduction

1.1. Notations and Definitions

2. A Concentration Bound for the α re -Rényi Divergence

3. Stationary Markov Data-Generating Models

4. Non-Stationary, Ergodic Markov Data-Generating Models

5. Misspecified Models

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Technical Desiderata

Appendix A.1. Definitions Related to Markov Chains

Appendix A.2. Bounding the KL-Divergence between Beta Distributions

Appendix B. Proofs of Main Results

Appendix B.1. Proofs for A Concentration Bound for the αre-Rényi Divergence

Appendix B.1.1. Proof of Proposition 1

Appendix B.1.2. Proof of Theorem 1

Appendix B.1.3. Proof of Proposition 2

Appendix B.1.4. Proof of Proposition 3

Appendix B.2. Proofs for Stationary Markov Data-Generating Models

Proof of Theorem 2

Appendix B.3. Proofs for Non-Stationary, Ergodic Markov Data-Generating Models

Appendix B.3.1. Proof of Theorem 3

Appendix B.3.2. Proof of Proposition 8

Appendix B.3.3. Proof of Proposition 9

Appendix B.3.4. Proof of Theorem 4

Appendix B.3.5. Proof of Theorem 5

Appendix B.3.6. Proof of Proposition 10

Appendix B.4. Proofs for Misspecified Models

Proof of Theorem 6

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2. A Concentration Bound for the $α^{re}$ -Rényi Divergence

Appendix B.1. Proofs for A Concentration Bound for the α^re-Rényi Divergence