**Approximate Bayesian Inference**

Editor

**Pierre Alquier**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Pierre Alquier RIKEN Center for Advanced Intelligence Project (AIP) Japan

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: https://www.mdpi.com/journal/entropy/special issues/ approx Bayes inference).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-3789-4 (Hbk) ISBN 978-3-0365-3790-0 (PDF)**

Cover image courtesy of Pierre Alquier

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**



## **About the Editor**

**Pierre Alquier** obtained his PhD in 2006 from Universite Pierre et Marie Curie in Paris ´ (today part of Sorbonne Universites). He worked as a research and teaching assistant at Universit ´ e´ Paris Dauphine (2006–2007), as a Maˆıtre de Conferences (lecturer) at Universit ´ e Paris Diderot ´ (2007–2012), as a lecturer at UCD Dublin (2012–2014), and then as a professor of statistics at ENSAE Paris (2014–2019). There, he taught Bayesian statistics, stochastic processes, introductory machine learning, estimator aggregation, and online learning. He joined the RIKEN new center for Advanced Intelligence Projects (AIP) in Tokyo in 2019 as a research scientist on the Approximate Bayesian Inference team. His research interests are high-dimensional statistics; Bayesian inference and machine learning; variational inference; and PAC–Bayes bounds. He has served as an Area Chair for conferences in Machine Learning such as NeurIPS and AISTATS. He is currently a member of the Topical Advisory Panel of Entropy and an action editor for the Journal of Machine Learning Research as well as for Transactions in Machine Learning Research.

## *Editorial* **Approximate Bayesian Inference**

## **Pierre Alquier**

Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo 103-0027, Japan; pierrealain.alquier@riken.jp Received: 28 October 2020; Accepted: 6 November 2020; Published: 10 November 2020

**Abstract:** This is the Editorial article summarizing the scope of the Special Issue: Approximate Bayesian Inference.

**Keywords:** Bayesian statistics; machine learning; variational approximations; PAC-Bayes; expectation-propagation; Markov chain Monte Carlo; Langevin Monte Carlo; sequential Monte Carlo; Laplace approximations; approximate Bayesian computation; Gibbs posterior

## **1. Introduction**

Extremely popular for statistical inference, Bayesian methods are gaining importance in machine learning and artificial intelligence problems. Indeed, in many applications, it is important for any device not only to predict well, but also to provide a quantification of the uncertainty of the prediction.

The main problem when one is to apply Bayesian statistics is that the computation of the estimators is expensive and sometimes not feasible. Bayesian estimators are based on the posterior distribution on parameters *θ* given by:

$$
\pi(\theta|\mathbf{x}) = \frac{\mathcal{L}(\theta; \mathbf{x})\pi(\theta)}{\int \mathcal{L}(\theta; \mathbf{x})\pi(\mathbf{d}\theta)} \tag{1}
$$

where *π* is the prior, *x* the observations, and L(*θ*; *x*) the likelihood function. For example, the computation of the posterior mean - *θπ*(d*θ*|*x*) requires a difficult evaluation of the integrals. Thanks to the development of computational power, Bayesian estimation became feasible in the 1980s and the 1990s through Markov Chain Monte Carlo (MCMC) methods, such as the Metropolis–Hastings algorithm [1] and the Gibbs sampler [2,3]. These algorithms target the exact posterior distribution. They proved to be useful in many contexts and are still an active area of research. The performances and applicability of MCMC were improved by variants such as the Hamiltonian MCMC [4,5], adaptive MCMC [6–8], etc. We refer the reader to the review [9], the books [10–12], and Part III in [13] for detailed introductions to MCMC. The surveys [14,15] provide an overview on more recent advances. The asymptotic theory of Markov chains, ensuring the consistency of these algorithms, was covered in the monographs [16,17]. A few non-asymptotic results are also available [18].

Sequential Monte Carlo emerged in the 1990s as a way to update sequentially (that is, for each new data) samples from the posterior in hidden state models. They allow thus the computation of a Bayesian version of filters (such as the Kalman filter [19]). For this reason, they are also referred to as "particle filters". We refer the reader to [20] for the state-of-the-art of the early years and to the recent books [21,22] for pedagogical introductions and an overview of the most recent progress.

However, many modern models in statistics are simply too complex to use such methodologies. In machine learning, the volume of the data used in practice makes MCMC too slow to be used: first, each iteration of the algorithm requires accessing all the data, then the number of iterations required to reach convergence explodes when the dimension is large. In these cases, it seems that targeting the exact posterior is no longer a realistic objective. This motivated the development of many new methodologies, where the target is no longer the exact posterior, but simply a part of the information contained in it, or an approximation.

Before a short overview of these approximations techniques, let us mention two important examples where approximations were an essential ingredient in the application of Bayesian methods. In 2006, Netflix released a dataset containing movie ratings by its users and challenged the machine learning community to improve on its own predictions for movies that were not rated [23]. Many algorithms were proposed, including methods based on matrix factorization. Bayesian matrix factorization is computationally intensive. The first success at scaling Bayesian methods to the Netflix dataset was based on a mean-field variational approximation of the posterior by [24]. Such approximations will be discussed below.

In computer vision problems, the best performances are reached by deep neural networks [25]. Bayesian neural networks became a popular research direction. A new field of Bayesian deep learning has emerged that relies on approximate Bayesian inference to provide uncertainty estimates for neural networks without increasing the computation cost too much [26–29]. In particular, References [28,29] scaled these algorithms to the size of benchmark datasets such as CIFAR-10 and ImageNet.

### **2. Approximation in the Modelization**

In many practical situations, the statistician is not interested in building a complete model describing the data, but simply in learning some aspects of it. One can think for example of a classification problem where one does not want to learn the full distribution of the data, but only a good classifier. A natural idea is to replace *π*(*θ*|*x*) in (1) by:

$$\pi(\theta|\mathbf{x}) = \frac{\exp\left[-\ell(\mathbf{x};\theta)\right]\pi(\theta)}{\int \exp\left[-\ell(\mathbf{x};\theta)\right]\pi(\mathbf{d}\theta)}\tag{2}$$

where -(*x*; *θ*) is a Taylor loss function—for example, the classification error. When -(*x*; *θ*) = − logL(*θ*; *x*), we recover (1) as a special case. When -(*x*; *θ*) = −*α* logL(*θ*; *x*) for some *α* -= 1, we obtain tempered posteriors, which appeared for various computational and theoretical reasons in the statistical literature; see [30–34], respectively. The use of the general form (2) was advocated to the statistical community by [35].

It appears that this idea was already popular in the machine learning theory community, where distributions like *π*˜(*θ*|*x*) are often referred to as Gibbs posteriors or aggregation rules. The PAC-Bayesian theory was developed to provide upper bounds on the prediction risk of such distributions [36–38]. We refer the reader to nice tutorials on PAC-Bayes bounds [39,40]. References [41–43] emphasized the connection to information theory. Note that the dropout technique used in deep learning to improve the performances of neural networks [44] was studied with PAC-Bayes bounds in [40]; see also [26]. Many publications in the past few years indeed confirmed that PAC-Bayes bounds are very well suited to analyze the performances of deep learning [45–51]. See [52] for a recent survey on PAC-Bayes bounds.

Such distributions were also well known in game theory and in prediction with expert advice since the 1990s [53,54]. We refer to the book [55], the recent work [56], and to connected problems such as bandits [57,58].

Finally, many aggregation procedures studied in high-dimensional statistics can also be written under the form of (2); see [59–64] with various regression or classification losses. References [65] used a Gibbs posterior based on the quantile loss to estimate a VaR (Value at Risk, a measure of risk in finance).

#### **3. Approximation in the Computations**

Many works have been done in the past few years to compute estimators based on *π*(*θ*|*x*) or *π*˜(*θ*|*x*) in complex problems, or with very large datasets. Very often, this is at the cost of targeting an approximation rather than the exact posterior. It is then important to analyze the accuracy of the approximation.

The nature and accuracy of these approximations are extremely different from one algorithm to the other, and some of them are not well understood theoretically. Below, we group these algorithms into three groups. In Section 3.1, we present methods that still essentially rely on simulations. In Section 3.2, we present asymptotic approximations. Finally, in Section 3.3, we present optimization based methods (this grouping is for the ease of exposition and is of course a little crude; each subsection mentions methods that have little to do with each other).

## *3.1. Non-Exact Monte Carlo Methods*

Monte Carlo methods based on Langevin diffusions were introduced in physics in the 1970s [66]. Let (*Ut*)*t*≥<sup>0</sup> be a diffusion process given by the stochastic differential equation:

$$\mathbf{d}lI\_t = \nabla \log \pi(lI\_t|\mathbf{x})\mathbf{d}t + \sqrt{2}\mathbf{d}N\_{t\wedge}$$

where (*Wt*)*t*≥<sup>0</sup> is a standard Brownian motion. It turns out that the invariant distribution of (*Ut*) is *π*(·|*x*). A discretization scheme with step *h* > 0 leads to the Markov chain *<sup>U</sup>*˜ *<sup>n</sup>*+<sup>1</sup> <sup>=</sup> *<sup>U</sup>*˜ *<sup>n</sup>* <sup>+</sup> *<sup>h</sup>*<sup>∇</sup> log *<sup>π</sup>*(*Un*|*x*) + <sup>√</sup> 2*hξn*, where the (*ξn*) are i.i.d standard Gaussian variables. However, it is important to note that (*Un*) does not admit *π*(·|*x*) as an invariant distribution. Thus, the Langevin Monte Carlo method is not exact (it would become exact with *h* → 0). Reference [67] proposed a correction of this method based on the Metropolis–Hastings algorithm, which leads to an exact algorithm, known as the MALA (the Monte Carlo Adjusted Langevin Algorithm). The Langevin Monte Carlo and MALA became popular in statistics and machine learning following [68]. This paper studies the asymptotic properties of both algorithms. Surprisingly, the exact method does not necessarily enjoy the best asymptotic guarantees. More recently, in the case where log *π*(*Un*|*x*) is concave, non-asymptotic guarantees where proven for Langevin Monte Carlo with a running time that depends only polynomially on the dimension of the parameter *θ*; see [69–74]. Such results are usually not available for exact MCMC methods.

The implementation of the classical Metropolis–Hastings algorithm requires being able to compute the ratio L(*θ*; *x*)/L(*θ* |*x*) for any *θ*, *θ* . In some models with complex likelihoods, or with intractable normalization constants, this is not possible. This led to a new direction, that is approximations of this likelihood ratio. A surprising and beautiful fact is that, if each likelihood is computed by an unbiased Monte Carlo estimator, the algorithm remains exact: this was studied under the name pseudo-marginal MCMC in [75]. Still, it sometimes requires much work to get unbiased estimates [76,77], when possible at all. Some authors proposed more general approximations of the likelihood ratio, leading to non-exact algorithms. References [78–81] proposed estimators based on subsampling when the data *x* are too large. Reference [82] proposed an estimator of the likelihood ratio when the likelihood has intractable constants, as in the exponential random graph model, and proved that, even if the resulting MCMC is inexact, it remains asymptotically close to the exact chain. A further theory was developed in [83–85]. More on MCMC for big data can be found in [86].

Finally, the ABC (Approximate Bayesian Computation) algorithm was proposed in population genetics for models where the likelihood is far too complex to be computed, but where it is relatively easy to sample from it [87,88]. It became extremely popular in some applications; we refer the reader to the survey [89], to Section 3 in [15], and more recently, to the book [90]. Some theoretical results were proven in [91]; we also refer the reader to [92–94] for some recent advances.

## *3.2. Asymptotic Approximations*

Laplace's method provides a Gaussian approximation of the posterior centered on the Maximum Likelihood Estimator (MLE) and whose covariance matrix is the inverse of the Fisher information. This approximation can be theoretically justified in parametric models under appropriate regularity conditions thanks to the Bernstein–von Mises theorem. We refer the reader to Chapter 13 in [95] for

a complete statement of this result. Integrated Nested Laplace Approximations (INLA) indeed became very popular in Gaussian latent models to compute approximations of the posterior marginals [96].

The extension of the Bernstein–von Mises theorem to nonparametric or semiparametric models is a quite technical and important research direction; see for example [97–101] and Chapter 10 in the monograph [102]. It is important to keep in mind that even in parametric models, when the assumptions of the theorem are not met, Laplace approximation can be wrong. The asymptotic of the posterior in such models was studied in detail in [103].

#### *3.3. Approximations via Optimization*

A huge number of methods are based on the idea of using optimization algorithms to find the best approximation of *π*(·|*x*), or *π*˜(·|*x*), in a set of probability distributions Q fixed by the statistician. The difference between the various methods is in the choice of the criterion used to define the "best" approximation. The set Q can be parametric (e.g., Gaussian distributions, inspired by Laplace's method) or not, the choice being prescribed by the feasibility of the optimization problem.

Variational approximations are based on the Kullback–Leibler divergence *KL*:

$$\text{tr}(\theta|\mathbf{x}) = \operatorname\*{argmin}\_{\mathbf{x}} KL(q||\pi(\cdot|\mathbf{x})) \tag{3}$$

$$\begin{aligned} \label{eq:SDAR} & \mathfrak{g} \in \mathcal{Q} \\ &= \operatorname\*{argmin}\_{\mathfrak{q} \in \mathcal{Q}} \left\{ \mathbb{E}\_{\theta \sim \mathfrak{q}} [-\log \mathcal{L}(\theta; \mathfrak{x})] + KL(q||\mathfrak{x}) \right\}, \end{aligned} \tag{4}$$

where we remind that *KL*(*q*||*p*) = - log(d*q*/d*p*)dp when *q* is absolutely continuous with respect to *p*, and *KL*(*q*||*p*)=+∞ otherwise. We refer the reader to the seminal papers [104,105], to the tutorial [106], and to the recent review of the huge literature on variational approximations [107]. Note that the approximation used in [108] in the early days of neural networks can also be interpreted as a variational approximation. Besides the aforementioned applications to recommender systems and to deep learning, variational inference was successfully used in network data analysis [109], economics and econometrics [110–113], finance [114], natural language processing [115], and video processing [116], among others. A huge range of optimization algorithm were used, from the coordinate-wise optimization in the original publications to message passing [117], the gradient and stochastic gradient algorithm [27,115,118], and the natural gradient [119]. The convexity and smoothness of the minimization problem were discussed in [120]. The scope of these methods was extended to models with intractable likelihood in [121]. Reference [122] pointed out a connection between (4) and PAC-Bayes bounds, which led to the first generalization error bounds for variational inference for some Gibbs posteriors, as in (2). The analysis was extended to various settings, including regular posteriors, as in (1), by [123–131]. In particular, Reference [132] proved that variational inference leads to the optimal estimation of some classes of functions with deep learning. Note that even when Q is the set of all Gaussian distributions on the parameter space, the approximation can be very different from the Laplace approximation. Indeed, Reference [129] contains an example of a mixture model where the MLE is not consistent, but Gaussian variational inference is.

The choice of the Kullback–Leibler divergence in (3) and (4) was initially motivated by the tractability of the computational program to which it leads. Recently, many authors questioned that choice and proposed extended definitions of variational inference using other divergences; for a presentation of the most popular divergences in statistics, see the introduction to information geometry [133]. Note that if we replace *KL* by another divergence, (3) and (4) are in general no longer equivalent, which leads to two possible ways to extend the definition. Reference [134] extended (3) by replacing the *KL* term by a Rényi divergence, and Reference [135] used the *χ*<sup>2</sup> divergence. However, Reference [136] discussed the computational difficulties induced by these changes, which might outweigh the benefits. Reference [137] discussed other criteria, including the Wasserstein distance, and provided some theoretical guarantees. On the other hand, References [138–141] proposed to use

more general divergences in (3). This can be related to the generalized exponential family of [142] and the PAC-Bayes bounds in [143,144].

The very popular Expectation Propagation algorithm (EP) was introduced by [145]. EP can be interpreted as the minimization of the reverse *KL*, *KL*(*π*(·|*x*)||*q*)), instead of (3). This was detailed in [146], where the author also proposed an extension with *α*-divergences called power EP. Algorithmic issues were discussed in [147] and by [148], who proposed stochastic optimization methods. A first theoretical analysis of EP was proposed in [149]. Let us mention that the textbook [150], which is a generalist introduction to machine learning, contains a full chapter entirely devoted to a pedagogical introduction to variational approximations and EP. The paper [151] focuses on the application of EP to hierarchical models, but also contains a very nice introduction to EP and the conditions ensuring its stability.

Finally, let us mention approximations by discrete distributions, of the form *q* = <sup>1</sup> *<sup>M</sup>* <sup>∑</sup>*<sup>M</sup> <sup>i</sup>*=<sup>1</sup> *δθ<sup>i</sup>* where *δ<sup>x</sup>* is the Dirac mass at *x*. Note that this is typically the kind of approximation provided by the MCMC and sequential Monte Carlo methods, but in these methods, the *θ<sup>i</sup>* are sampled. It is also possible to try to minimize a distance criterion between *q* and *π*(·|*x*). Unfortunately, when *π*(·|*x*) is continuous, both *KL*(*π*(·|*x*)||*q*)) = *KL*(*q*||*π*(·|*x*))) = +∞, so it is not possible to use variational inference or EP in this case. An energy based criterion was proposed in [152]. Reference [153] proposed to use Stein divergences between *q* and *π*(·|*x*), and the technique became quite successful [154–156]. Another possible research direction is to use the Wasserstein distance [157].

#### **4. Scope of This Special Issue**

The objective of this Special Issue is to provide the latest advances in approximate Monte Carlo methods and in approximations of the posterior: the design of efficient algorithms, the study of the statistical properties of these algorithms, and challenging applications.

**Funding:** This research received no external funding.

**Acknowledgments:** The author gratefully thanks Emtiyaz Khan (RIKEN AIP) and Nicolas Chopin (ENSAE Paris) for useful comments.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Coupled VAE: Improved Accuracy and Robustness of a Variational Autoencoder**

**Shichen Cao 1, Jingjing Li 2, Kenric P. Nelson 3,\* and Mark A. Kon <sup>2</sup>**

	- mkon@bu.edu (M.A.K.)

**Abstract:** We present a coupled variational autoencoder (VAE) method, which improves the accuracy and robustness of the model representation of handwritten numeral images. The improvement is measured in both increasing the likelihood of the reconstructed images and in reducing divergence between the posterior and a prior latent distribution. The new method weighs outlier samples with a higher penalty by generalizing the original evidence lower bound function using a coupled entropy function based on the principles of nonlinear statistical coupling. We evaluated the performance of the coupled VAE model using the Modified National Institute of Standards and Technology (MNIST) dataset and its corrupted modification C-MNIST. Histograms of the likelihood that the reconstruction matches the original image show that the coupled VAE improves the reconstruction and this improvement is more substantial when seeded with corrupted images. All five corruptions evaluated showed improvement. For instance, with the Gaussian corruption seed the accuracy improves by 1014 (from 10−57.2 to 10<sup>−</sup>42.9) and robustness improves by 1022 (from 10−109.2 to 10<sup>−</sup>87.0). Furthermore, the divergence between the posterior and prior distribution of the latent distribution is reduced. Thus, in contrast to the *β*-VAE design, the coupled VAE algorithm improves model representation, rather than trading off the performance of the reconstruction and latent distribution divergence.

**Keywords:** machine learning; entropy; robustness; statistical mechanics; complex systems

## **1. Introduction**

An overarching challenge in machine learning is the development of methodologies that ensure the accuracy and robustness of models given limited training data. By accuracy, we refer to the metrics of information theory, such as minimizing the cross-entropy or divergence of an algorithm. In this paper, we define a measure of robustness based on a generalization of information theory. The variational autoencoder (VAE) contributes to improved learning of models by utilizing approximate variational inference [1,2]. By storing a statistical model rather than a deterministic model at the latent layer, the algorithm has increased flexibility in its use for reconstruction and other applications. The variational inference is optimized by minimization of a loss function, the so-called negative evidence lower bound, which has two components. The first component is a cross-entropy between the generated and the source data, also known as the expected negative log-likelihood, while the second is a divergence between the prior and the posterior distributions of the latent layer.

Our goal in this research is to provide an evaluation as to whether a generalization of information theory can be applied to improving the robustness of machine learning algorithms. Robustness of autoencoders to outliers is critical for generating a reliable representation of particular data types in the encoded space when using corrupted training data [3]. In this paper, a generalized entropy function is used to modify the negative

**Citation:** Cao, S.; Li, J.; Nelson, K.P.; Kon, M.A. Coupled VAE: Improved Accuracy and Robustness of a Variational Autoencoder. *Entropy* **2022**, *24*, 423. https://doi.org/ 10.3390/e24030423

Academic Editor: Pierre Alquier

Received: 15 February 2022 Accepted: 13 March 2022 Published: 18 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

13

evidence lower bound loss function of a variational autoencoder. With the MNIST handwritten numerals dataset, we are able to measure the improvement in the robustness of the reconstruction, using a metric also derived from the generalization of information theory. In addition, we find that the accuracy of the reconstruction, as measured by Shannon information theory, is also improved. Furthermore, the divergence between the latent distribution posterior and prior is also reduced. This is important to ensure that the reconstruction improvement is not a result of degrading the latent layer.

Our study builds from the work of Kingma and Welling [4] on variational autoencoders and Tran et al. [5] on deep probabilistic programming. Variational autoencoders are an unsupervised learning method for training encoder and decoder neural networks. Between the encoder and decoder, the parameters of a multidimensional distribution are learned to form a compressed latent representation of the training data [6]. It is an effective method for generating complex datasets such as images and speech. Zalger [7] implemented the application of VAE for aircraft turbomachinery design and Xu et al. [8] used VAEs to achieve unsupervised anomaly detection for seasonal key performance indicators (KPIs) in web applications. VAEs have been used to construct probabilistic models of complex physical phenomena [9]. Autoencoders can use a variety of latent variable models, but restricting the models can enhance performance. Sparse autoencoders add a penalty for the number of active hidden layer nodes used in the model. Variational autoencoders further restrict the model to a probability distribution *qφ*(**z**|**x**) specified by a set of encoder parameters *φ* which approximates the actual conditional probability *p*(**z**|**x**). Variational inference, as reviewed by Blei et al. [10], is used to learn this approximation by minimizing an objective function such as the Kullback–Liebler divergence. The decoder learns a set of parameters *θ* for a generative distribution *q<sup>θ</sup>* (**x** |**z**), where **z** is the latent variable, and **x** is the output generated data. The complexity of the data distribution *p*(**x**) makes direct computation of the divergence between the approximate and exact latent conditional probabilities intractable; however, a variational or evidence lower bound (ELBO) is computable and consists of two components, the expected reconstruction log-likelihood of the generated data (cross-entropy) and the negative of the divergence between the latent posterior conditional probability *qφ*(**z**|**x**) and a latent prior distribution *p*(**z**), which is typically a standard normal distribution but can be more sophisticated for particular model requirements.

Recently, Higgins et al. [11] proposed a *β*-VAE framework, which can provide a more disentangled latent representation **z** [12] by increasing the weight of the KL-divergence term of the ELBO. Since the KL-divergence is a regularization that constrains the capacity of the latent information channel **z**, increasing the weight of the regularization with *β* > 1 puts pressure on the learnt posterior so it is more tightly packed. The effect seems to be an encouragement of each dimension to store distinct information and excess dimensions as highly packed noise. However, this improvement is a trade-off between the divergence and reconstruction components of the ELBO metric. We will show that the coupled VAE algorithm improves both components of the ELBO.

The next section provides an introduction to the design of the variational autoencoder. A comparison with other generative algorithms is included. Section 3 introduces nonlinear statistical coupling and its application to defining metrics for the robustness, accuracy, and decisiveness of decision algorithms. In this paper, use of the uppercase letter for the terms 'Robustness', 'Accuracy', and 'Decisiveness' refers to the specific metrics, which will be introduced in Section 3.1. Lowercase letters for these terms will be used when referring to the general properties. Following the definition of the reconstruction assessment metrics, the generalization of the negative ELBO is defined. This coupled negative ELBO provides control over the weighting of rare versus common samples in the distribution of the training set. Additional details of the derivation of the generalized negative ELBO function and metrics are provided in Appendices A.1 and A.2, respectively. In Section 4, the improved autoencoder is evaluated using the MNIST handwritten numeral test set. Measurements of the reconstruction and the characteristics of the posterior latent variables

are analyzed. Section 5 provides a visualization of the changes in the latent distribution using a 2-dimensional distribution. Section 6 demonstrates that the coupled VAE algorithm provides significantly improved stability in the model performance when the input image is corrupted from the training set. This provides evidence of the improved robustness of the algorithm. Section 7 provides a discussion, conclusion, and suggestions for future research.

#### **2. The Variational Autoencoder**

A variational autoencoder consists of an encoder, a decoder, and a loss function. Figure 1 represents the basic structure of an autoencoder. The encoder *Q* is a neural network that converts high-dimensional information from the input data into a low-dimensional hidden, latent representation **z**. Some information is lost during this data compression because the dimension is reduced. The decoder *P* decompresses from latent space **z** to reconstruct the data. While, in general, autoencoders can learn a variety of representations, VAEs especially learn the parameters of a probability distribution. The model used here learns the means and standard deviations *θ* of a collection of multivariate Gaussian distributions and stores this information in a two-layered space. The training loss function, which is the negative evidence lower bound, is optimized by using stochastic gradient descent.

**Figure 1.** The variational autoencoder consists of an encoder, a probability model, and a decoder.

## *2.1. Vae Loss Function*

The encoder reads the input data and compresses and transforms it into a fixed-shape latent representation **z**, while the decoder decompresses and reconstructs the information from this latent representation, outputting specific distribution parameters to generate a new reconstruction **x** . The true posterior distribution *<sup>p</sup>*(**z**|**x**(*i*)) of **<sup>z</sup>** given *<sup>i</sup> th* datapoint **x**(*i*) is unknown, but we use the Gaussian approximation *<sup>q</sup>*(**z**|**x**(*i*)) with mean vector *<sup>μ</sup>*(*i*) and covariance matrix diag(*σ*<sup>2</sup> <sup>1</sup> , ··· , *<sup>σ</sup>*<sup>2</sup> *<sup>d</sup>* )(*i*) instead. The goal of the algorithm is to maximize the variational or evidence lower bound (ELBO) on the marginal density of individual datapoints.

For a dataset **X** = **x**(*i*) *<sup>N</sup> i*=1 consisting of *N* independent and identically distributed samples, the variational lower bound for the *i th* datapoint or image **x**(*i*) in the original VAE algorithm [4] is

$$ELBO\left(\mathbf{x}^{(i)}\right) = -D\_{KL}\left(q\left(\mathbf{z}|\mathbf{x}^{(i)}\right) \parallel p(\mathbf{z})\right) + \mathbb{E}\_{q\left(\mathbf{z}|\mathbf{x}^{(i)}\right)}\left[\log p\left(\mathbf{x}^{(i)}|\mathbf{z}\right)\right].\tag{1}$$

The first term on the right-hand side is the negative Kullback–Leibler divergence between the posterior variational approximation *q*(**z**|**x**) and a prior distribution **z** which is selected to be a standard Gaussian distribution. The second term on the right-hand side is denoted as the expected reconstruction log-likelihood, and is referred to as the cross-entropy. Let *nz* be the dimensionality of **z**; then, the Kullback–Leibler divergence simplifies to

$$-D\_{\rm KL}\left(q\left(\mathbf{z}|\mathbf{x}^{(i)}\right)||p(\mathbf{z})\right) = \int\_{\dots}q\left(\mathbf{z}|\mathbf{x}^{(i)}\right)\left(\log p(\mathbf{z}) - \log q\left(\mathbf{z}|\mathbf{x}^{(i)}\right)\right)d\mathbf{z} \tag{2}$$

$$=\frac{1}{2}\sum\_{j=1}^{n\_x}\left(1+\log\left(\left(\sigma\_{\vec{\jmath}}\right)^2\right)-\left(\mu\_{\vec{\jmath}}\right)^2-\left(\sigma\_{\vec{\jmath}}\right)^2\right).\tag{3}$$

The expected reconstruction log-likelihood (cross-entropy) *Eq*(**z**|**x**(*i*)) log *p* **<sup>x</sup>**(*i*)|**<sup>z</sup>** can be estimated by sampling, i.e.,

$$\mathbb{E}\_{q\left(\mathbf{z}\mid\mathbf{x}^{(i)}\right)}\left[\log p\left(\mathbf{x}^{(i)}\mid\mathbf{z}\right)\right] = \frac{1}{L} \sum\_{l=1}^{L} \left(\log p\left(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i,l)}\right)\right),\tag{4}$$

where *L* denotes the number of samples for each datapoint and we set *L* = 1 in our study. Supposing data **x** given **z** has the following probability density,

$$\log p(\mathbf{x}|\mathbf{z}) = \sum\_{i=1}^{n\_x} (\mathbf{x}\_i \log y\_i + (1 - \mathbf{x}\_i) \log(1 - y\_i)),\tag{5}$$

where **y** is the output of the decoder. Therefore, the loss function can be calculated by

$$\mathcal{L}\left(\mathbf{x}^{(i)}\right) = -ELBO\left(\mathbf{x}^{(i)}\right) = D\_{KL}\left(q\left(\mathbf{z}|\mathbf{x}^{(i)}\right) \parallel p(\mathbf{z})\right) - \frac{1}{L} \sum\_{l=1}^{L} \left(\log p\left(\mathbf{x}^{(i)}|\mathbf{z}^{(i,l)}\right)\right). \tag{6}$$

For our work, the loss function is modified to improve the robustness of the variational autoencoder, something that will be discussed in Section 4.

#### *2.2. Comparison with Other Generative Machine Learning Methods*

The paradigm of generative adversarial networks (GANs) is a recent advance in generative machine learning methods. The basic idea of GANs was published in a 2010 blog post by Niemitalo [13], and the name 'GAN' was introduced by Goodfellow et al. [14]. In comparison with variational autoencoders, generative adversarial networks are used for optimizing generative tasks specifically. GANs can produce models with true latent spaces, as is the case of bidirectional GAN (BiGAN) and adversarially learned inference (ALI) [15,16], which are designed to improve the performance of GANs. However, GANs cannot generate reasonable results when data are high-dimensional [17]. By contrast, as a probabilistic model, the specific goal of a variational autoencoder is to marginalize out noninformative variables during the training process. The ability to use complex priors in the latent space enables existing expert knowledge to be incorporated.

Bayesian networks form another generative model. Pearl [18] proposed the Bayesian network paradigm in 1985. Bayesian networks have a strong ability to capture the symbolic figures of input information and combine objective probabilities with subjective estimates for both qualitative and quantitative modeling. The basic concept of Bayesian networks is built on Bayes's theorem. Another effective way to solve for the posterior of the distribution derived from neural networks is to train and predict using variational inference techniques [19]. Compared with the original Bayesian network, the basic building blocks of deep networks provide multiple loss functions for making multitarget predictions, for transfer learning, and for varying outputs depending on the situation. The improvement of the deeper architectures, using VAE specifically, continues to occur.

Other generative models are now commonly combined with a variational autoencoder to improve performance. Ebbers et al. [20] developed a VAE with a hidden Markov model (HMM) as the latent model for discovering acoustic units. Dilokthanakul et al. [2] studied the

use of Gaussian mixture models as the prior distribution of the VAE to perform unsupervised clustering through deep generative models. They showed a heuristic algorithm called 'minimum information constraint' and it is capable of improving the unsupervised clustering performance with this model. Srivastava and Sutton [1] presented the effective autoencoding variational Bayes-based inference method for latent Dirichlet allocation (LDA). This model solves the problems caused by autoencoding variational Bayes by the Dirichlet prior and by component collapsing. Additionally, this model matches traditional methods' inaccuracy with much better inference time.

#### **3. Accounting for Risk with Coupled Entropy**

Machine learning algorithms, including the VAE, have achieved efficient learning and inference for many image processing applications. Nevertheless, assuring accurate forecasts of the uncertainty is still a challenge. Problems such as outliers and overfitting impact the robustness of scientific prediction and engineering systems. This paper concentrates on assessing and improving the robustness of the VAE algorithm.

In this study, we draw upon the principles of nonlinear statistical coupling (NSC) [21,22] to define a generalization to information theory and apply the resulting entropic functions to the definition of the negative ELBO loss function for the training of the variational autoencoder [23]. NSC is derived from nonextensive statistical mechanics [24], which generalizes the variational calculus of maximum entropy to include constraints related to the nonlinear dynamics of complex systems and in turn to the nonexponential decay of the maximizing distributions. The NSC frame focuses this theory on the role of nonlinear coupling *κ* in generalizing entropy and its related functions. The approach defines a family of heavy-tailed (positive coupling) and compactly supported (negative coupling) distributions which maximize a generalized entropy function referred to as coupled entropy. The variational methods underlying NSC can be applied to a variety of problems in mathematical physics [25,26]. Here, we examine how NSC can broaden the role of approximate variational inference in machine learning to include sensitivity to the risks of outlier events occurring in the tail of the distribution of the phenomena being learned.

#### *3.1. Assessing Probabilistic Forecasts with the Generalized Mean*

First, proper metrics are needed to evaluate the accuracy and robustness of machine learning algorithms, such as VAE. The arithmetic mean and the standard deviation are widely used to measure the central tendency and fluctuation, respectively, of a random variable. Nevertheless, these are inappropriate for probabilities, which are formed by ratios. A random variable formed by the ratio of two independent random variables has a central tendency determined by the geometric mean, as described by McAlister [27]. Information theory addresses this issue by taking the logarithm of the probabilities, then the arithmetic mean; however, we will show that the generalizations of information theory are easier to report and visualize in the probability domain.

In [28], a risk profile was introduced, which is the spectrum of the generalized means of probabilities and provides an assessment of the the central tendency and fluctuations of

probabilistic inferences. The generalized mean ( <sup>1</sup> *N N* ∑ *i*=1 *pr i*) 1 *<sup>r</sup>* is a translation of generalized

information-theoretic metrics back to the probability domain, and is derived in the next section. Its use as a metric for evaluating and training inference algorithms is related to the Wasserstein distance [29], which incorporates the generalized mean. The accuracy of the likelihoods is measured with robust, neutral, and decisive risk bias using the *<sup>r</sup>* <sup>=</sup> <sup>−</sup><sup>2</sup> 3 , *r* = 0 (geometric) and *r* = 1 (arithmetic) means, respectively. With no risk bias (*r* = 0), the geometric mean is equivalent to transforming the cross-entropy between the forecast *pi* and the distribution of the test samples to the probability domain. The arithmetic mean (*r* = 1) is a simple measure of the Decisiveness (i.e., were the class probabilities in the right order so that a correct decision can be made?). This measure de-weights probabilities near zero since increasing *r* reduces the influence of small probabilities on the average. To

complement the arithmetic mean, we choose a negative conjugate value. The conjugate is not the harmonic mean (*r* = −1) because this turns out to be too severe a test. Instead, *<sup>r</sup>* <sup>=</sup> <sup>−</sup><sup>2</sup> <sup>3</sup> is chosen based on a dual transformation between heavy-tail (positive *κ*) and compact-support (negative *κ*) domains of the coupled Gaussian distribution. The risk sensitivity *r* can be decomposed into the nonlinear coupling and the power and dimension of the variable *<sup>r</sup>*(*κ*, *<sup>α</sup>*, *<sup>d</sup>*) = <sup>−</sup>*ακ* <sup>1</sup>+*d<sup>κ</sup>* . The dual transformation between the positive/negative domains of the coupled Gaussians has the following relationship: *<sup>κ</sup>*<sup>ˆ</sup> <sup>⇔</sup> <sup>−</sup>*<sup>κ</sup>* <sup>1</sup>+*d<sup>κ</sup>* . Taking *<sup>α</sup>* <sup>=</sup> <sup>2</sup> and *<sup>d</sup>* <sup>=</sup> 1, the coupling for a risk bias of one is 1 <sup>=</sup> <sup>−</sup>2*<sup>κ</sup>* <sup>1</sup>+*<sup>κ</sup>* <sup>⇒</sup> *<sup>κ</sup>* <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>3</sup> and the conjugate values are *<sup>κ</sup>*<sup>ˆ</sup> <sup>=</sup> <sup>1</sup> 3 <sup>1</sup><sup>−</sup> <sup>1</sup> 3 = <sup>1</sup> <sup>2</sup> and *<sup>r</sup>*<sup>ˆ</sup> <sup>=</sup> <sup>−</sup>2· <sup>1</sup> 2 1+ <sup>1</sup> 2 <sup>=</sup> <sup>−</sup><sup>2</sup> <sup>3</sup> [23]. The Robustness metric increases the weight of probabilities near zero since negative powers invert the probabilities prior to the average.

For simplicity, we refer to these three metrics as the Robustness, Accuracy, and Decisiveness. The label 'accuracy' is used for the neutral accuracy, since 'neutralness' is not appropriate and 'neutral' does not express that this metric is the central tendency of the accuracy. Summarizing:

$$Decisionness \text{ (arithmetic mean)}: \frac{1}{N} \sum\_{i=1}^{N} p\_i. \tag{7}$$

$$Accuracy\text{ (geometric mean)}:\prod\_{i=1}^{N}p\_i^{\frac{1}{N}}.\tag{8}$$

$$(Robustness\ (-2/3\,\text{mean}):\left(\frac{1}{N}\sum\_{i=1}^{N}p\_i^{-\frac{2}{3}}\right)^{-\frac{3}{2}}.\tag{9}$$

Similar to the standard deviation, the arithmetic mean and −2/3 mean play roles as measures of the fluctuation. Figure 2 shows an example of input images from the MNIST dataset and the generated output images produced by the VAE. Despite the blur in some output images, the VAE succeeds in generating very similar images to the input. However, the histogram in Figure 3, which plots the frequency of the likelihoods over a log scale, shows that the probabilities of ground truth range over a large scale. The geometric mean or Accuracy captures the central tendency of the distribution at 10−<sup>37</sup> . The Robustness and the Decisiveness capture the span of the fluctuation in the distribution. The −2/3 mean or Robustness is 10−<sup>77</sup> and the arithmetic mean or Decisiveness is 10−15. The minimal value of the −2/3 mean metric is an indicator of the poor robustness of the VAE model, which can be improved. We measure and display the performance in the probability space in order to simplify the comparison between the three metrics. In the next subsection, we will show their relationship with a generalization of the log-likelihood. If, however, we were to plot histograms in the log-space, separate histograms would be required for each metric. By using the probability space, we can display one histogram overlaid with three different means. Appendix A.2 describes the origin of the Robustness–Accuracy– Decisiveness metrics.

(**a**) (**b**) **Figure 2.** Example set of (**a**) MNIST input images and (**b**) VAE-generated output images.

**Figure 3.** A histogram of the likelihoods that the VAE-reconstructed images match the input images. The objective of the coupled VAE research is to demonstrate that the Robustness, which is the −2/3 generalized mean, can be increased by penalizing the cost of producing outlier reconstructions. The Accuracy is the exponential of the average log-likelihood and the Decisiveness is the arithmetic mean.

In order to improve performance against the robust metric, the training of the variational autoencoder needs to incorporate this generalized metric. To do so, we derive a coupled loss function in the next subsection.

## *3.2. Definition of Negative Coupled ELBO*

As we discussed in Section 2, the goal of a VAE algorithm is to optimize a lowdimensional model of a high-dimensional input dataset. This is accomplished using approximate variational inference by maximizing an evidence lower bound (ELBO). Equivalently, the negative ELBO defines a loss function which can be minimized, <sup>L</sup>(*x*(*i*)) = <sup>−</sup>*ELBO*(*x*(*i*)). In this paper, we provide initial evidence that the accuracy and robustness of the variational inference can be improved by generalizing the negative ELBO to account for the risk of outlier events. Here, we provide a definition of the generalization and in Appendix A.1 a derivation is provided.

The generalized loss function in the coupled variational autoencoder (VAE) method is defined as follows.

**Definition 1.** *(Negative Coupled ELBO). Given the i th datapoint* **x**(*i*)*, the corresponding latent variable value* **z***, and the output value* **y** *of the decoder using the Bernoulli distribution, then the loss function for the coupled VAE algorithm is given by*

$$\mathcal{L}\_{\mathbf{x}}\left(\mathbf{x}^{(i)}\right) = D\_{\mathbf{x}}\left(q\left(\mathbf{z}|\mathbf{x}^{(i)}\right) \parallel p(\mathbf{z})\right) + H\_{\mathbf{x}}(\mathbf{x}, \mathbf{y}),\tag{10}$$

*where*

$$\begin{split} &D\_{\mathbf{k}}(q(\mathbf{z}|\mathbf{x}^{(i)}) \parallel p(\mathbf{z})) \\ = & \prod\_{j=1}^{n\_{\mathbf{z}}} \int \frac{q(z\_{j}|\mathbf{x}^{(j)})^{1+\frac{2\kappa}{1+\kappa}}}{\int q(z\_{j}|\mathbf{x}^{(i)})^{1+\frac{2\kappa}{1+\kappa}} dz\_{j}^{2}} \frac{1}{2} (\ln\_{\mathbf{x}}(q(z\_{j}|\mathbf{x}^{(i)})^{-\frac{2}{1+\kappa}}) - \ln\_{\mathbf{x}}(p(z\_{j})^{-\frac{2}{1+\kappa}})) dz\_{j} \\ = & \prod\_{j=1}^{n\_{\mathbf{z}}} \frac{1}{2\kappa} \int \frac{(\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(z\_{j}-\mu\_{j})^{2}}{2\sigma^{2}}})^{1+\frac{2\kappa}{1+\kappa}}}{\int (\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(z\_{j}-\mu\_{j})^{2}}{2\sigma^{2}}})^{1+\frac{2\kappa}{1+\kappa}}} \cdot (\left(\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(z\_{j}-\mu\_{j})^{2}}{2\sigma^{2}}}\right)^{-\frac{2\kappa}{1+\kappa}} - \left(\frac{1}{\sqrt{2\pi}}e^{-\frac{z\_{j}^{2}}{2}}\right)^{-\frac{2\kappa}{1+\kappa}}) dz\_{j} \end{split} \tag{11}$$

*is the generalized (coupled) KL-divergence in the original loss function in Equation (6), and*

$$H\_{\mathbf{k}}(\mathbf{x}, \mathbf{y}) \equiv -\frac{1}{2L} \sum\_{l=1}^{L} \sum\_{i=1}^{n\_{\mathbf{x}}} \left( \mathbf{x}\_{i} \ln\_{\mathbf{k}} \left( (y\_{i})^{\frac{2}{1+\kappa}} \right) + (1 - \mathbf{x}\_{i}) \ln\_{\mathbf{k}} \left( (1 - y\_{i})^{\frac{2}{1+\kappa}} \right) \right) \tag{12}$$

*is the generalized reconstruction loss (coupled cross-entropy) in the original loss function in Equation (6).*

In the next section, we show preliminary experimental evidence that the negative coupled ELBO can be used to improve the robustness and accuracy of the variational inference. We show that increasing the coupling parameter of the loss function has the effect of increasing the Accuracy (8) and Robustness (9) metrics of the generated data. Additionally, we show that the improvement in the generation process is not at the expense of the divergence between the posterior and the prior latent distributions. Thus, the overall ELBO is improved, indicating an improvement in the approximate variational inference. Furthermore, in Section 6, we show that improvements are more substantial when the algorithm is seeded by images from the corrupted MNIST database. While the experimental results of this report focus on a two-layer dense neural network and the (corrupted)-MNIST datasets, the generalization of information-theoretic cost functions for machine learning training is applicable to a broader range of architectures and datasets. For instance, the CIFAR-10 reconstruction is typically processed with a deep neural network [30] and is planned for future research.

#### **4. Results Using the MNIST Handwritten Numerals**

The MNIST handwritten digit database is a large database of handwritten digits consisting of a training set of 60,000 images and a test set of 10,000 images widely used for evaluating machine learning and pattern recognition methods. The digits have been size-normalized and centered in fixed-size images. Each image in the database contains 28 by 28 grayscale pixels. Pixel values vary from 0 to 255. Zero means the pixel is white, or background, while 255 means the pixel is black, or foreground [31]. In this and the next section, we examine the performance of the coupled VAE algorithm in reconstructing images of the MNIST database. In Section 6, we show the stability of the coupled VAE when reconstruction is distorted by samples from the corrupted MNIST database.

For this research, we used the MNIST database as the input since it was used in the traditional VAE. Specifically, input **x** is a batch of 28 by 28 pixel photos of handwritten numbers. The encoder encodes the data, which are 784-dimensional for each image in a batch into the latent layer space. For our experiment, the dimension of the latent variable **z** can be from 2 to 20. Taking the latent layers **z** as the input, the probability distribution of each pixel is computed using a Bernoulli or Gaussian distribution by the decoder. The decoder outputs the corresponding 784 parameters to reconstruct an image. We used specific numbers of images from the training set as the batch size and a fixed number of epochs. Additionally, for the learned MNIST manifold, visualizations of learned data and reproduced results were plotted. The algorithm and experiments were developed with Python and the TensorFlow library. Our Python code can be found in the Data Availability Statement.

The input images and output images for different values of coupling *κ* are shown in Figure 4. *κ* = 0 represents the original VAE model. Compared with the original algorithm, output images generated by the modified coupled VAE model show small improvements in detail and clarity. For instance, the fifth digit in the first row of the input images is '4', but the output image in the original VAE is more like '9' rather than '4', while the coupled VAE method generates '4' correctly. For the seventh digit '4' in the first row, the generated image in the coupled VAE has an improved clarity compared to the traditional VAE.

Figure 5 shows the likelihood histograms for 5000 input images with coupling values of *κ* = 0, 0.025, 0.05, 0.1. The red, blue, and green lines represent the arithmetic mean (decisiveness), geometric mean (central tendency), and −2/3 mean (robustness), respectively. When *κ* = 0 , the minimal value of the Robustness metric indicates that the original VAE suffers from poor robustness. As *κ* becomes large, the geometric mean and the −2/3 mean metrics start to increase while the arithmetic mean metric mostly stays the same. Since the probability of producing a correct image by a uniform random sampling is 1 <sup>228</sup>×<sup>28</sup> <sup>=</sup> 9.8 <sup>×</sup> <sup>10</sup><sup>−</sup>237, the accuracy achieved by the VAE algorithm is significantly improved, even though the absolute value of the Accuracy metric seems small. As the coupling *κ* increases, the coupled loss function approaches infinity faster. This eventually causes computational errors. For instance, when *κ* = 0.2 , the loss function has a computational error at the 53*rd* epoch; when *κ* = 0.5, the loss function has a computational error at the 8*th* epoch. Further investigations of the computational bounds of the algorithm are planned. The specific relationship between coupling *κ* and probabilities for input images is shown in Table 1. The increased Robustness metric shows that the modified loss does improve the robustness of the the reconstructed image. In the next section, we also examine the performance of the divergence between the posterior and prior distributions of the latent layer.

Furthermore, compared with the original VAE model, the geometric mean, which measures the accuracy of the input image likelihood, is larger for the coupled algorithm. The improvement of this metric means that the input images (truth) are assigned to higher likelihoods on average by the coupled VAE model.

The standard deviation *σ* of latent variables **z** is shown in rose plots in Figure 6. The angular position of a bar represents the value of *σ*, clockwise from 0 to 1. The radius of the bar measures the frequency of different *σ* values from 0 to 100. As the coupling *κ* increases, the range and the average value of these standard deviations decrease. To be specific, when *κ* = 0, *σ* of all dimensions in all 5000 batches ranges from 0.09 to 0.72; when *κ* = 0.025, *σ* ranges from 0.02 to 0.3; when *κ* = 0.05, *σ* ranges from 0.001 to 0.09; when *κ* = 0.1, *σ* ranges from 0.00007 to 0.06.

**Figure 4.** (**a**) The MNIST input images and (**b**) the output images generated by the original VAE. (**c**–**e**) The output images generated by the modified coupled VAE model show small improvements in detail and clarity. For instance, the fifth digit in the first row of the input images is '4', but the output image in the original VAE is more like '9' rather than '4', while the coupled VAE method produced '4' correctly.

**Figure 5.** The histograms of likelihood for the reconstruction of the input images with various coupling *κ* values. The red, blue, and green lines represent the arithmetic mean (Decisiveness), geometric mean (Accuracy), and −2/3 mean (Robustness), respectively. The minimal value of the Robustness metric indicates that the original VAE suffers from poor robustness. As *κ* increases, the Robustness and Accuracy improve while the Decisiveness is mostly unchanged.

**Table 1.** The Decisiveness, Accuracy, and Robustness of the reconstruction likelihood as a function of the coupling *κ*.


22

We note that as coupling parameter *κ* increases, the variability of the latent space diminishes. One possible method to address this problem is to use heavy-tail distribution in the latent layer. Chen et al. [32] and Nelson [23] used the Student's *t* as the distribution [33] of the latent layer to incorporate heavy-tail decay.

We choose samples in which the likelihoods of input images are close to the three metrics and plot the standard deviation *σ* of each dimension of the latent variable **z** of these samples in Figure 7. The red, blue, and green lines represent samples near the decisiveness, accuracy, and robustness, respectively. It shows that when *κ* = 0, the standard deviations of **z** range from 0.1 to 0.7. However, as *κ* increases, values of *σ* fluctuate less and decrease toward 0. Magnified plots are shown to visualize the results further. The general trend for *σ* is to be more significant for samples near decisiveness, intermediate near the accuracy, and smaller for samples near robustness. An exception is *κ* = 0.025, where *σ* overlaps for samples near the robustness and accuracy. The histogram likelihood plots with a twodimensional latent variable are shown in Figure 8. The increased values of the arithmetic mean metric and −2/3 mean metric show that the accuracy and robustness of the output MNIST images in the VAE model have been improved, consistent with the result in the 20-D model. While the performance improvements are modest, we will show in Section 6 that the performance improvements when the algorithm is seeded with corrupted images is much more substantial. First, we provide a visualization of the changes in the latent distribution using two dimensions.

(**c**) *κ* = 0.05 (**d**) *κ* = 0.1

**Figure 7.** The standard deviation of latent variable samples near the three generalized mean metrics. The red, blue, and green lines represent samples near the Decisiveness, Accuracy, and Robustness, respectively. As *κ* increases, values of *σ* fluctuate less and decrease toward 0. Magnified plots are shown to visualize the results further.

**Figure 8.** The histogram likelihood plots with a two-dimensional latent variable. Like the 20-D model, the increased values of the arithmetic mean metric and −2/3 mean metric show that the accuracy and robustness of the VAE model have been improved.

#### **5. Visualization of Latent Distribution**

In order to understand the relationship between increasing coupling of the loss function with the means and the standard deviations of the Gaussian model, we examine a two-dimensional model which can be visualized. Compared with the high-dimensional model, the probability likelihoods for the two-dimensional model are lower, indicating that the higher dimensions do improve the model. Nevertheless, like the 20-dimensional model, the distribution of likelihood is compressed toward higher values as the coupling increases and, therefore, can be used to analyze the results further. Larger likelihood of input images along with both means closer to the origin and smaller standard deviations of latent variables are the primary characteristics as the coupling parameter of the loss function is increased. As a result, both the robustness and accuracy of likelihoods increase. To be specific, when *κ* increases from 0 to 0.075, the geometric mean metric increases from 1.20 <sup>×</sup> <sup>10</sup>−<sup>63</sup> to 4.67 <sup>×</sup> <sup>10</sup>−55, and the <sup>−</sup>2/3 mean metric increases from 5.03 <sup>×</sup> <sup>10</sup>−<sup>170</sup> to 5.17 <sup>×</sup> <sup>10</sup>−144, while the arithmetic metric does not change very much. In this case, the reconstructed images have a higher probability of replicating the input image using the coupled VAE method.

The rose plots in Figure 9 show that the range and variability of the mean values of latent variables decrease as the coupling *κ* increases. From the view of means, the posterior distribution of the latent space is closer to the prior, the standard Gaussian distribution. From the view of standard deviations, the posterior distribution of the latent space is further from the prior.

**Figure 9.** The rose plots of the various mean (above four figures) and standard deviation (below four figures) values in 2 dimensions. The range of means is reduced and mean values become closer to 0 as coupling increases.

The latent space plots shown in Figure 10 are the visualizations of images of the numerals from 0 to 9. Images are embedded in a 2D map where the axis is the values of the 2D latent variable. The same color represents images that belong to the same numeral, and they cluster together since they have higher similarity to each other. The distances between spots represent the similarities of images. The latent space plots show that the different clusters shrink together more tightly when coupling becomes larger. The plots shown in Figure 11 are the visualizations of the learned data manifold generated by the decoder network of the coupled VAE model. A grid of values from a two-dimensional Gaussian distribution is sampled. The distinct digits each exist in different regions of the latent space and smoothly transform from one digit to another. This smooth transformation can be quite useful when the interpolation between two observations is needed. Additionally, the distribution of distinct digits in the plot becomes more even, and the sharpness of the digits increases when *κ* increases.

**Figure 10.** The plot of the latent space of VAE trained for 200 epochs on MNIST with various *κ* values. Different numerals cluster together more tightly as coupling *κ* increases.

**Figure 11.** The plot of visualization of learned data manifold for generative models with the axes as the values of each dimension of latent variables. The distinct digits each exist in different regions of the latent space and smoothly transform from one digit to another.

As shown in Table 2, as the coupling increases from 0 to 0.075, the negative ELBO (the loss) decreases from 172.3 to 146.7, the coupled KL-divergence decreases from 5.8 to 5.6, and the coupled reconstruction loss decreases from 166.5 to 141.1. It shows that the reconstruction loss plays a dominant role (with proportion over 96%), while the divergence term has a much lower effect (with proportion under 4%) in the loss function. The overall improvement of coupled loss is based on both the smaller coupled KL-divergence and the smaller coupled reconstruction error, instead of a trade-off between them. There is a high degree of variability in this improvement, so there are reasons to be cautious about the degree of improvement. In addition, since the coupled loss function is adjusting the metric, the property being measured is also adjusting. Part of our future research plan is to explore how the relative performance between the reconstruction and the latent space can be compared.

**Table 2.** Components of coupled ELBO with a 2-dimensional latent layer under different values of coupling. The improvement in the coupled KL-divergence is very slight, while it is larger for the coupled reconstruction loss.


#### **6. Performance with Corrupted Images**

We also evaluate the performance of the coupled VAE algorithm when keyed by images from the corrupted MNIST (C-MNIST) dataset [34]. The reconstructed images under 5 different corruptions: Gaussian corruption, glass blur corruption, impulse noise corruption, shot noise corruption and shear corruption, with two coupling values *κ* = 0.0 and *κ* = 0.1 are shown in the Figure 12. Based on the visualization of the generated images, the qualitative visual improvement in clarity using the coupling is modest.

We also conduct the further analyses for the performance of the coupled VAE with each corruption. For the MNIST images with Gaussian corruption, as shown in the Figure 13, when the coupling parameter *κ* increases, all the three metrics—robustness, central tendency, and decisiveness—increase. The robustness improves the most, central tendency is the next, and decisiveness has the least improvement. Furthermore, we confirm that the reconstruction improvement is not a trade-off with latent distribution divergence, as shown in Table 3. This is in contrast to the *β*-VAE [11] method which merely alters the weight between the reconstruction and divergence components of the negative ELBO cost function.

In the Table 3, analyses of the components of the coupled ELBO are provided. Comparisons as the coupling changes are somewhat confusing because the metric itself is changing. Therefore, as the coupling increases the measure of performance is more difficult. Nevertheless, there is still an overall tendency towards improved performance, even with this caveat. The second column shows that the coupled KL-divergence initially increases when moving away from the standard VAE design with *κ* = 0, however, it then steadily decreases with increasing *κ*. This may be due to the distinct difference between the logarithm and even a slight deviation from the logarithm. The coupled reconstruction loss (column three) shows steady improvement. The overall negative coupled ELBO shows consistent improvement as the coupling increases. The relative importance of the divergence and reconstruction varies as the coupling increases but in each case it is approximately a 15% to 85% relative weighting.

The improvement of the three metrics with glass blur corruption, impulse noise corruption, shot noise corruption and shear corruption is also observed and shown in Figures 14–17, respectively. Similar to the Gaussian corruption, all the three metrics gradually increase as the coupling parameter *κ* increases from 0 to 0.1. The respective analyses of the components of the coupled ELBO with glass blur corruption, impulse noise corruption, shot noise corruption and shear corruption are provided in Tables 4–7. The four corruptions share the consistent results, the coupled KL-divergence initially increases when moving away from the standard VAE design with *κ* ≤ 0.025, but it then steadily decreases with increasing *κ*. The overall negative coupled ELBO shows consistent improvement as *κ* increases. It means that if the coupling parameter is relatively large (> 0.025), both the KL-divergence and the reconstruction loss will be improved, thus the overall improvement of the algorithm is not a trade-off between the reconstruction accuracy and the latent distribution divergence.

**Table 3.** The components of the coupled ELBO under **Gaussian** corruptions are provided in the table. The coupled KL-divergence initially increases when moving away from the standard VAE design with *κ* = 0 to *κ* = 0.025, however, it then steadily decreases with increasing *κ*. The coupled reconstruction loss (column three) shows steady improvement. The overall negative coupled ELBO shows consistent improvement as the coupling increases. The relative importance of the divergence and reconstruction varies as the coupling increases but in each case it is approximately a 15% to 85% relative weighting.


**Figure 12.** The images with 5 different corruptions are shown in the first row. The reconstructed images when *κ* = 0.0 and *κ* = 0.1 are shown in the second and third rows, respectively. The qualitative visual improvement in clarity using the coupling is modest.

**Figure 13.** The histograms of marginal likelihood for the MNIST images with **Gaussian** corruption shown. All three metrics increase as the coupling parameter *κ* increases. The robustness improves the most, central tendency is the next, and decisiveness has the least improvement. From *κ* = 0.0 to *κ* = 0.1, the Robustness improves from 10−109.2 to 10−87.0, the Accuracy improves from 10−57.2 to 10−42.9, and the Decisiveness improves from 10−16.8 to 10−13.6.

**Figure 14.** The histograms of marginal likelihood for the MNIST images with **glass blur** corruption are shown. All the three metrics increase as the coupling parameter *κ* increases from 0 to 0.1.

**Table 4.** The components of the coupled ELBO under **glass blur** corruptions are provided in the table. The coupled KL-divergence initially increases when moving away from the standard VAE design with *κ* ≤ 0.025, but it then steadily decreases with increasing *κ*. The coupled reconstruction loss shows steady improvement. The overall negative coupled ELBO shows consistent improvement as *κ* increases.


**Figure 15.** The histograms of marginal likelihood for the MNIST images with **impulse noise** corruption are shown. All the three metrics increase as the coupling *κ* increases from 0 to 0.1.

**Table 5.** The components of the coupled ELBO under **impulse noise** corruptions are provided in the table. The coupled KL-divergence initially increases when moving away from the standard VAE design with *κ* ≤ 0.025, but it then steadily decreases with increasing *κ*. The overall negative coupled ELBO shows consistent improvement as *κ* increases.


**Figure 16.** The histograms of marginal likelihood for the MNIST images with **shot noise** corruption are shown. All the three metrics increase as the coupling parameter *κ* increases from 0 to 0.1.

**Table 6.** The components of the coupled ELBO under **shot noise** corruptions are provided in the table. The coupled KL-divergence increases when moving away from the standard VAE design with *κ* ≤ 0.025, but it then steadily decreases with increasing *κ*. The coupled reconstruction loss shows steady improvement. The overall negative coupled ELBO shows consistent improvement as *κ* increases.


**Figure 17.** The histograms of marginal likelihood for the MNIST images with **shear** corruption are shown. All the three metrics increase as the coupling parameter *κ* increases from 0 to 0.1.

**Table 7.** The components of the coupled ELBO under **shear** corruptions are provided. The coupled KL-divergence increases when moving away from the standard VAE design with *κ* ≤ 0.025, but it then steadily decreases with increasing *κ*. The coupled reconstruction loss shows steady improvement. The overall negative coupled ELBO shows consistent improvement as *κ* increases.


#### **7. Discussion and Conclusions**

This investigation sought to determine whether the accuracy and robustness of variational autoencoders can be improved using certain statistical methods developed within the area of complex systems theory. Our investigation provides evidence that the tail shape of the negative evidence lower bound can be controlled in such a way that the cost of outlier events is adjustable. We refer to this method as a coupled VAE, since the control parameter models the nonlinear deviation from the exponential and logarithmic functions of linear analysis. A positive coupling parameter increases the cost of these tail events and thereby trains the algorithm to be robust against such outliers. Additionally, this improves both the accuracy of reconstructed images and reduces the divergence of the posterior latent distribution from the prior. We have been able to document this improvement using the histogram of the reconstructed marginal likelihoods. Metrics of the histogram are formed from the arithmetic mean, geometric mean, and −2/3 mean, which represent Decisiveness, Accuracy, and Robustness, respectively. Both the accuracy and the robustness are improved by increasing the coupling of the loss function. There is a limit to such increases in the coupling beyond which the training process no longer converges.

These performance improvements have been evaluated for the MNIST handwritten numeral dataset and its corrupted modification C-MNIST. We used a two-layer dense neural network for the encoder/decoder. The latent layer is a 20-dimensional Gaussian distribution and for visualization a 2-dimensional distribution was also examined. Without the corruption, we observed improvements in both components of the negative coupled ELBO loss function, namely the image reconstruction loss (marginal likelihood) and the latent distribution (divergence between the prior and posterior). Thus, the coupled VAE is able to improve the model representation, rather than just trading off reconstruction and divergence performance, as does the highly cited *β*-VAE design. The likelihood of the reconstructed image matching the original improves in Accuracy by 10<sup>10</sup> and in Robustness by 108 when the coupling parameter was increased from *κ* = 0 (the standard VAE) to *κ* = 0.1 (the largest value of the coupled VAE reported). The Decisiveness did not change significantly, though there is potential that negative values of the coupling could influence this metric. The performance improvements when the algorithm is seeded by the C-MNIST dataset are far more significant, demonstrating the improved stability of the algorithm. All five corruptions examined (Gaussian, glass blur, impulse noise, shot noise, and shear) show significant improvement in Robustness and Accuracy and some improvement in the Decisiveness. For example, under the Gaussian corruption, the improvements in the reconstruction likelihood for Accuracy are 1014 and those for the Robustness are 1020 when the coupling parameter is increased from *κ* = 0 (the standard VAE) to *κ* = 0.1. The significant improvement in Robustness using the corrupted MNIST dataset demonstrates that the coupled negative ELBO cost function reduces the risk of overfitting by forcing the network to learn general solutions that are less likely to create outliers.

The modifications of the latent posterior distributions have been further examined using a two-dimensional representation. We show that the latent variables have both a tighter distribution of the mean about its prior value of zero, and a movement of standard deviations towards zero, away from the prior of one, as coupling *κ* increases. Overall, the coupled KL-divergence does indeed decrease as the coupling is increased, indicating improvement in the latent representation. Thus, improvements in the reconstruction evident from both visual clarity of images and increased accuracy in measured likelihoods are not due to a trade-off with the latent representation. Rather, the negative coupled ELBO metric shows improvement in both latent layer divergence and output image reconstruction. This improvement in the two components of the evidence lower bound provides evidence that the coupled VAE improves the approximate variational inference of the model.

In future research, we plan to study the coupled Gaussian distribution as the prior and posterior distribution of the latent layer. This may be helpful for achieving greater separation between the images into distinct clusters similar to what has been achieved with t-stochastic neighborhood embedding methods [35]. If so, it may be possible to improve the decisiveness of the likelihoods in addition to further improvements in the accuracy and robustness. Since our approach generalizes the training of the decoder and encoder networks, it is expected to be seamlessly applicable to other datasets and neural network architectures. We are conducting research to apply our method to a convolutional neural network design that can process more complex datasets such as CIFAR-10. This first demonstration of the coupled ELBO cost function has provided experimental results applied to a shallow neural network but the approach is also applicable to the training of deep neural networks.

**Author Contributions:** S.C. implemented the coupled VAE algorithm programming structure, generated essential output data for analysis, and drafted and modified the paper. J.L. conducted statistical analysis of the results and derived the coupled ELBO based on the concept of nonlinear statistical coupling. K.P.N. originated the concept of nonlinear statistical coupling and mentored the team in applying the methods to the design of a variational autoencoder. M.A.K. provided oversight of the statistical analysis and the writing of the results, discussion, and conclusions. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The algorithm and data can be accessed at https://github.com/Photrek/ Coupled-VAE-Improved-Robustness-and-Accuracy-of-a-Variational-Autoencoder, accessed on 5 February 2022.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A**

## *Appendix A.1. Derivation of Negative Coupled ELBO*

Generalizing the negative ELBO is accomplished using the principles of nonlinear statistical coupling (NSC) to generalize information theory. As described in Section 2.1, the negative ELBO consists of two components, the KL-divergence between the prior and posterior latent distribution, and the cross-entropy or negative log-likelihood of the reconstructed image in relation to the original image. NSC is an approach to modeling the statistics of complex systems that unifies heavy-tailed distributions, generalized information metrics, and fusion of information. Its application to the cost functions of a VAE provides control over the trade-off between decisive and robust generative models. Decisive refers to the characteristic of confident probabilities and robust refers to the characteristic of dampening extremes in the probabilities.

In the VAE algorithm, the loss function consists of the KL-divergence between the posterior approximation *q* **<sup>z</sup>**|**x**(*i*) and a prior *p*(**z**) and the cross-entropy between the reported probabilities and the training sample distribution.

$$\mathcal{L}\left(\mathbf{x}^{(i)}\right) = D\_{KL}\left(q\left(\mathbf{z}|\mathbf{x}^{(i)}\right) \parallel p(\mathbf{z})\right) - \frac{1}{L} \sum\_{l=1}^{L} \left(\log p\left(\mathbf{x}^{(i)}|\mathbf{z}^{(i,l)}\right)\right),\tag{A1}$$

where *L* is the number of reconstructions per test sample, and the KL-divergence is given by

$$D\_{KL}\left(q\left(\mathbf{z}|\mathbf{x}^{(i)}\right)\parallel p(\mathbf{z})\right) = \int q\left(\mathbf{z}|\mathbf{x}^{(i)}\right)\left(\log q\left(\mathbf{z}|\mathbf{x}^{(i)}\right) - \log p(\mathbf{z})\right)d\mathbf{z}.\tag{A2}$$

Even though **x**(**i**) given **z** is a grayscaled value, which is not Bernoulli distributed, we can still use the probability mass function of Bernoulli distribution, then the cross entropy term is given by

$$-\frac{1}{L}\sum\_{l=1}^{L}\left(\log p\left(\mathbf{x}^{(i)}|\mathbf{z}^{(i,l)}\right)\right) = -\frac{1}{L}\sum\_{l=1}^{L}\sum\_{i=1}^{n\_x}\left[x\_i\log y\_i + (1-x\_i)\log(1-y\_i)\right],\tag{A3}$$

where **y** = Sigmod(**f2**(tanh (**f1**(**z**)))) while *f*<sup>1</sup> and *f*<sup>2</sup> are linear models and *nx* is the dimensionality of **x**.

The negative ELBO loss function is modified by coupled generalizations of the KLdivergence and cross-entropy. The purpose is to increase the weighting of rare events in the training dataset and thereby improve the robustness of the VAE model. The connection with the assessment metrics defined in Section 3.1 is that the power of the generalized mean can be decomposed into functions of the coupling and second parameter *α*, related to the power in the distribution of the random variable. For Gaussians and their generalizations, known as coupled Gaussians, *<sup>α</sup>* <sup>=</sup> 2. Making use of *<sup>r</sup>*(*κ*, *<sup>α</sup>*, *<sup>d</sup>*) = <sup>−</sup>*ακ* <sup>1</sup>+*d<sup>κ</sup>* with *<sup>α</sup>* <sup>=</sup> 2, the generalized mean is ∑ *p*1+*<sup>r</sup> i* 1 *<sup>r</sup>* = ∑ *p* <sup>1</sup><sup>−</sup> <sup>2</sup>*<sup>κ</sup>* 1+*κ i* <sup>−</sup> <sup>1</sup>+*<sup>κ</sup>* 2*κ* . When the coupling *κ* → 0, the

generalized mean is asymptotically equal to the geometric mean.

The coupled entropy function takes the form of a generalized logarithmic function applied to the generalized mean [22].

$$H\_{\mathbf{x}}(\mathbf{p}) \equiv \frac{1}{2} \ln\_{\mathbb{K}} \left( \left( \sum p\_i^{1 + \frac{2c}{1+\kappa}} \right)^{\frac{-1}{\kappa}} \right) \equiv \frac{p\_i^{1 + \frac{2c}{1+\kappa}}}{2 \sum p\_i^{\frac{1+2c}{1+\kappa}}} \ln\_{\mathbb{K}} p\_i^{-\frac{2}{1+\kappa}} \equiv \frac{1}{2\kappa} \left( \left( \sum p\_i^{\frac{1+2c}{1+\kappa}} \right)^{-1} - 1 \right),\tag{A4}$$

where ln*κ*(*x*) is the generalization of the logarithm function in Equation (A15).

Similar to the generalization of coupled entropy function, the generalized logarithmic is applied to the KL-divergence. The first term in KL-divergence becomes

$$-\int q(\mathbf{z}|\mathbf{x}^{(i)}) \log q(\mathbf{z}|\mathbf{x}^{(i)}) d\mathbf{z} \Rightarrow \frac{1}{2} \prod\_{j=1}^{n\_z} \int \frac{q(z\_j|\mathbf{x}^{(j)})^{1 + \frac{2\kappa}{1+\kappa}}}{\int q(z\_j|\mathbf{x}^{(i)})^{1 + \frac{2\kappa}{1+\kappa}} d\varpi\_j} \ln\_{\mathbf{x}} (q(z\_j|\mathbf{x}^{(i)})^{-\frac{2}{1+\kappa}}) d\varpi\_j,\tag{A5}$$

and the second term in KL-divergence becomes

$$-\int q(\mathbf{z}|\mathbf{x}^{(i)}) \log p(\mathbf{z}) d\mathbf{z} \Rightarrow \frac{1}{2} \prod\_{j=1}^{n\_{\mathrm{z}}} \int \frac{q(z\_{j}|\mathbf{x}^{(i)})^{1+\frac{2\kappa}{1+\kappa}}}{\int q(z\_{j}|\mathbf{x}^{(i)})^{1+\frac{2\kappa}{1+\kappa}} d\mathbf{z}\_{j}} \ln p(p(z\_{j})^{-\frac{2}{1+\kappa}}) d\mathbf{z}\_{j} \tag{A6}$$

Therefore, the coupled divergence with *nz* as the dimensionality of **z** can be written as

$$\begin{split} &D\_{\mathbf{x}}(q(\mathbf{z}|\mathbf{x}^{(i)}) \parallel p(\mathbf{z})) \\ = & \prod\_{j=1}^{n\_{\mathrm{c}}} \int \frac{q(z\_{j}|\mathbf{x}^{(j)})^{1+\frac{2\kappa}{1+\kappa}}}{\int q(z\_{j}|\mathbf{x}^{(i)})^{1+\frac{2\kappa}{1+\kappa}} dz\_{j}} \frac{1}{2} (\ln\_{\mathbf{x}}(q(z\_{j}|\mathbf{x}^{(i)})^{-\frac{2}{1+\kappa}}) - \ln\_{\mathbf{x}}(p(z\_{j})^{-\frac{2}{1+\kappa}})) dz\_{j} \\ = & \prod\_{j=1}^{n\_{\mathrm{c}}} \frac{1}{2\kappa} \int \frac{(\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(z\_{j}-\mu\_{j})^{2}}{2\sigma^{2}}})^{1+\frac{2\kappa}{1+\kappa}}}{\int (\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(z\_{j}-\mu\_{j})^{2}}{2\sigma^{2}}})^{1+\frac{2\kappa}{1+\kappa}}} \cdot ((\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(z\_{j}-\mu\_{j})^{2}}{2\sigma^{2}}})^{-\frac{2\kappa}{1+\kappa}} - (\frac{1}{\sqrt{2\pi}}e^{-\frac{z\_{j}^{2}}{2}})^{-\frac{2\kappa}{1+\kappa}}) dz\_{j} \\ \leq & \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(z\_{j}-\mu\_{j})^{2}}{2\sigma^{2}}} \frac{1+\frac{2\kappa}{1+\kappa}}{\sigma\omega\_{j}} \, dz\_{j} \end{split} \tag{A7}$$

The original cross-entropy can also be modified in a similar way. Applying the generalization of the logarithmic function, the terms log(*yi*) and log(1 − *yi*) are modified to <sup>1</sup> <sup>2</sup> ln*<sup>κ</sup>* (*yi*) <sup>2</sup> 1+*κ* and <sup>1</sup> <sup>2</sup> ln*<sup>κ</sup>* (<sup>1</sup> <sup>−</sup> *yi*) <sup>2</sup> 1+*κ* , thus

$$\log p\left(\mathbf{x}^{(i)} | \mathbf{z}^{(i,l)}\right) \Rightarrow \sum\_{i=1}^{n\_{\mathcal{X}}} \left(\mathbf{x}\_{i}\frac{1}{2}\text{ln}\_{\mathbb{K}}\left((y\_{i})^{\frac{2}{1+\kappa}}\right) + (1-\mathbf{x}\_{i})\frac{1}{2}\text{ln}\_{\mathbb{K}}\left((1-y\_{i})^{\frac{2}{1+\kappa}}\right)\right).\tag{A8}$$

Therefore, the coupled cross-entropy is the generalization of the cross-entropy term in Equation (A14), which is defined as

$$H\_{\mathbf{k}}(\mathbf{x}\_{i}, y\_{i}) \equiv -\frac{1}{2L} \sum\_{l=1}^{L} \sum\_{i=1}^{n\_{\mathbf{x}}} \left( \mathbf{x}\_{i} \ln\_{\mathbf{x}} \left( (y\_{i})^{\frac{2}{1+\kappa}} \right) + (1 - \mathbf{x}\_{i}) \ln\_{\mathbf{x}} \left( (1 - y\_{i})^{\frac{2}{1+\kappa}} \right) \right). \tag{A9}$$

Adding Equations (A7) and (A9) gives the negative coupled ELBO,

$$\mathcal{L}\_{\mathbf{x}}\left(\mathbf{x}^{(i)}\right) = D\_{\mathbf{x}}\left(q\left(\mathbf{z}|\mathbf{x}^{(i)}\right) \parallel p(\mathbf{z})\right) + H\_{\mathbf{x}}(\mathbf{x}, \mathbf{y}),\tag{A10}$$

as defined in Equations (10)–(12).

#### *Appendix A.2. Origin of the Generalized Probability Metrics*

The generalized probability metrics derive from a translation of a generalized entropy function back to the probability domain. Use of the geometric mean for Accuracy derives from the Boltzmann–Gibbs–Shannon entropy, which measures the average uncertainty of a system and is equal to the arithmetic average of the negative logarithm of the probability distribution,

$$H(\mathbf{P}) \equiv -\sum\_{i=1}^{N} p\_i \ln p\_i = -\ln \left( \prod\_{i=1}^{N} p\_i^{p\_i} \right). \tag{A11}$$

Translating the entropy back to the probability domain via the inverse of the negative logarithm, which is the exponential of the negative, results in the weighted geometric mean of the probabilities

$$P\_{\text{dFS}} \equiv \exp\left(-H(\mathbf{P})\right) = \exp\left(\ln\left(\prod\_{i=1}^{N} p\_i^{p\_i}\right)\right) = \prod\_{i=1}^{N} p\_i^{p\_i}.\tag{A12}$$

The role of this function in defining the central tendency of the y-axis of a density is illustrated with the Gaussian distribution. Utilizing the continuous definition of entropy for a density *f*(*x*) for a random variable *x*, the neutral accuracy or central tendency of the density is

$$f\_{\text{avg}} \equiv \exp(-H(f(\mathbf{x}))) = \exp\left(\int\_{\mathcal{X}} f(\mathbf{x}) \ln f(\mathbf{x}) d\mathbf{x}\right). \tag{A13}$$

For the Gaussian, the average density is equal to the density at the mean plus the standard deviation *f*(*μ* ± *σ*).

The use of the geometric mean as a metric for the neutral accuracy in the previous section is related to the cross-entropy between the reported probability of the algorithm and the probability distribution of the test set. The cross-entropy between a 'quoted' or predicted probability distribution **q** and the distribution of the test set **p** is

$$H(\mathbf{p}, \mathbf{q}) \equiv -\sum\_{i} p\_i \ln q\_i. \tag{A14}$$

In evaluating an algorithm, the actual distribution is defined by the test samples which, for equally probable independent samples, each have a probability of *pi* = <sup>1</sup> *<sup>N</sup>* . Translated to the probability domain, the cross-entropy becomes the geometric mean of the reported probabilities (8), thus showing that use of the geometric mean of the probabilities as a measure of Accuracy for reported probabilities is equivalent to the use of cross-entropy as a metric of forecasting performance.

Likewise, the use of the generalized mean as a metric for Robustness and Decisiveness derives from a generalization of the cross-entropy. While there are a variety of proposed generalizations to information theory, in [22,36–38], the Renyi and Tsallis entropies were both shown to translate to a generalized mean upon transformation to the probability domain. Here, we show that the derivation of this transformation uses the coupled entropy, which derives from the Tsallis entropy, but utilizes a modified normalization. The nonlinear statistical coupling (or simply the coupling) has been shown to (a) quantify the relative variance of a superstatistics model in which the variance of exponential distribution fluctuates according to a gamma distribution, and (b) be equal to the inverse of the degree of freedom of the Student's *t* distribution. The coupling is related to the risk bias by the expression *<sup>r</sup>* <sup>=</sup> <sup>−</sup>2*<sup>κ</sup>* <sup>1</sup>+*<sup>κ</sup>* , where the numeral 2 is associated with the power 2 of the Student's *<sup>t</sup>* distribution, and the ratio *<sup>r</sup>* <sup>=</sup> <sup>−</sup>2*<sup>κ</sup>* <sup>1</sup>+*<sup>κ</sup>* is associated with a duality between the positive and negative domains of the coupling. The coupled entropy uses a generalization of the logarithmic function,

$$\ln\_{\mathbb{K}}(\mathbf{x}) \equiv \frac{1}{\mathbb{x}} (\mathbf{x}^{\mathbb{x}} - 1), \; \mathbf{x} > 0,\tag{A15}$$

which provides a continuous set of functions with power. The coupled entropy aggregates the probabilities of a distribution using the generalized mean and translates this to the entropy domain using the generalized logarithm. Using the equiprobable for the sample probabilities, *pi* = <sup>1</sup> *<sup>N</sup>* , the coupled cross-entropy 'score' for the forecasted probabilities **q** for the event labels **e** is

$$S\_{\mathbf{x}}(\mathbf{e}, \mathbf{q}) \equiv \frac{-2}{1+\kappa} \ln \left( \frac{1}{N} \sum\_{i=1}^{N} q\_i^{\frac{-2\kappa}{1+\kappa}} \right)^{\frac{-1-\kappa}{2\kappa}} \equiv \frac{1}{\kappa} \left( \left( \frac{1}{N} \sum\_{i=1}^{N} q\_i^{\frac{-2\kappa}{1+\kappa}} \right) - 1 \right), \tag{A16}$$

where *qi* is the probability of event *ei* which occurred. Thus, the coupled cross-entropy is a local scoring rule dependent only on the probabilities of the actual events.

#### **References**


## *Article* **Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly**

**Le Li <sup>1</sup> and Benjamin Guedj 2,\***


**\*** Correspondence: b.guedj@ucl.ac.uk

**Abstract:** When confronted with massive data streams, summarizing data with dimension reduction methods such as PCA raises theoretical and algorithmic pitfalls. A principal curve acts as a nonlinear generalization of PCA, and the present paper proposes a novel algorithm to automatically and sequentially learn principal curves from data streams. We show that our procedure is supported by regret bounds with optimal sublinear remainder terms. A greedy local search implementation (called slpc, for sequential learning principal curves) that incorporates both sleeping experts and multi-armed bandit ingredients is presented, along with its regret computation and performance on synthetic and real-life data.

**Keywords:** sequential learning; principal curves; data streams; regret bounds; greedy algorithm; sleeping experts

## **1. Introduction**

Numerous methods have been proposed in the statistics and machine learning literature to sum up information and represent data by condensed and simpler-to-understand quantities. Among those methods, principal component analysis (PCA) aims at identifying the maximal variance axes of data. This serves as a way to represent data in a more compact fashion and hopefully reveal as well as possible their variability. PCA was introduced by [1,2] and further developed by [3]. This is one of the most widely used procedures in multivariate exploratory analysis targeting dimension reduction or feature extraction. Nonetheless, PCA is a linear procedure and the need for more sophisticated nonlinear techniques has led to the notion of principal curve. Principal curves may be seen as a nonlinear generalization of the first principal component. The goal is to obtain a curve which passes "in the middle" of data, as illustrated by Figure 1. This notion of skeletonization of data clouds has been at the heart of numerous applications in many different domains, such as physics [4,5], character and speech recognition [6,7], mapping and geology [5,8,9], to name but a few. e fir data, as n of ], speech

**Figure 1.** A principal curve.

## *1.1. Earlier Works on Principal Curves*

The original definition of principal curve dates back to [10]. A principal curve is a smooth (*C*∞) parameterized curve **<sup>f</sup>**(*s*) = (*f*1(*s*),..., *fd*(*s*)) in R*<sup>d</sup>* which does not intersect

**Citation:** Li, L.; Guedj, B. Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly. *Entropy* **2021**, *23*, 1534. https:// doi.org/10.3390/e23111534

Academic Editor: Mohamed Medhat Gaber

Received: 22 August 2021 Accepted: 1 November 2021 Published: 18 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

itself, has finite length inside any bounded subset of R*<sup>d</sup>* and is self-consistent. This last requirement means that **<sup>f</sup>**(*s*) = <sup>E</sup>[*X*|*s***f**(*X*) = *<sup>s</sup>*], where *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is a random vector and the so-called projection index *s***f**(*x*) is the largest real number *s* minimizing the squared Euclidean distance between **f**(*s*) and *x*, defined by

$$s\_{\mathbf{f}}(x) = \sup \left\{ s : \left\| x - \mathbf{f}(s) \right\|\_{2}^{2} = \inf\_{\tau} \left\| x - \mathbf{f}(\tau) \right\|\_{2}^{2} \right\}.$$

Self-consistency means that each point of **f** is the average (under the distribution of *X*) of all data points projected on **f**, as illustrated by Figure 2.

**Figure 2.** A principal curve and projections of data onto it.

However, an unfortunate consequence of this definition is that the existence is not guaranteed in general for a particular distribution, let alone for an online sequence for which no probabilistic assumption is made. In order to handle complex data structures, Ref. [11] proposed principal curves (PCOP) of principal oriented points (POPs) which are defined as the fixed points of an expectation function of points projected to a hyperplane minimising the total variance. To obtain POPs, a cluster analysis is performed on the hyperplane and only data in the local cluster are considered. Ref. [12] introduced the local principal curve (LPC), whose concept is similar to that of [11], but accelerates the computation of POPs by calculating local centers of mass instead of performing cluster analysis, and local principal component instead of principal direction. Later, Ref. [13] also considered LPC in data compression and regression to reduce the dimension of predictors space to low-dimension manifold. Ref. [14] extended the idea of localization to independent component analysis (ICA) by proposing a local-to-global non-linear ICA framework for visual and auditory signal. Ref. [15] considered principal curves from a different perspective: as the ridge of a smooth probability density function (PDF) generating dataset, where the ridges are collections of all points; the local gradient of a PDF is an eigenvector of the local Hessian, and the eigenvalues corresponding to the remaining eigenvectors are negative. To estimate principal curves based on this definition, the subspace constrained mean shift (SCMS) algorithm was proposed. All the local methods above require strong assumptions on the PDF, such as twice continuous differentiability, which may prove challenging to be satisfied in the settings of online sequential data. Ref. [16] proposed a new concept of principal curves which ensures its existence for a large class of distributions. Principal curves **f** are defined as the curves minimizing the expected squared distance over a class F*<sup>L</sup>* of curves whose length is smaller than *L* > 0; namely,

$$\mathbf{f}^\* \in \underset{\mathbf{f} \in \mathcal{F}\_L}{\arg\inf} \,\Delta(\mathbf{f})\_\prime$$

where

$$\Delta(\mathbf{f}) = \mathbb{E}[\Delta(\mathbf{f}, X)] = \mathbb{E}\left[\inf\_{s} \|\mathbf{f}(s) - X\|\_2^2\right].$$

If <sup>E</sup> *X* 2 <sup>2</sup> <sup>&</sup>lt; <sup>∞</sup>, **<sup>f</sup>** always exists but may not be unique. In practical situations where only i.i.d. copies *X*1, ... , *Xn* of *X* are observed, the method of [16] considers classes F*k*,*<sup>L</sup>* of all polygonal lines with *k* segments and length not exceeding *L*, and chooses an estimator ˆ **f***k*,*<sup>n</sup>* of **f** as the one within F*k*,*L*, which minimizes the empirical counterpart

$$\Delta\_n(\mathbf{f}) = \frac{1}{n} \sum\_{i=1}^n \Delta(\mathbf{f}, X\_i).$$

of Δ(**f**). It is proved in [17] that if *X* is almost surely bounded and *k* ∝ *n*1/3, then

$$
\Delta \left( \mathbf{f}\_{k,n} \right) - \Delta (\mathbf{f}^\star) = \mathbb{O} \left( n^{-1/3} \right).
$$

As the task of finding a polygonal line with *k* segments and length of at most *L* that minimizes Δ*n*(**f**) is computationally costly, Ref. [17] proposed a polygonal line algorithm. This iterative algorithm proceeds by fitting a polygonal line with *k* segments and considerably speeds up the exploration part by resorting to gradient descent. The two steps (projection and optimization) are similar to what is done by the *k*-means algorithm. However, the polygonal line algorithm is not supported by theoretical bounds and leads to variable performance depending on the distribution of the observations.

As the number of segments, *k*, plays a crucial role (a too small a *k* value leads to a poor summary of data, whereas a too-large *k* yields overfitting; see Figure 3), Ref. [18] aimed to fill the gap by selecting an optimal *k* from both theoretical and practical perspectives.

**Figure 3.** Principal curves with different numbers (*k*) of segments. (**a**) A too small *k*. (**b**) Right *k*. (**c**) A too large *k*.

Their approach relies strongly on the theory of model selection by penalization introduced by [19] and further developed by [20]. By considering countable classes {F*k*,-}*k*, of polygonal lines with *k* segments and total length - ≤ *L*, and whose vertices are on a lattice, the optimal (ˆ *k*, ˆ -) is obtained as the minimizer of the criterion

$$\text{crit}(k,\ell) = \Delta\_n \left( \mathbf{f}\_{k,\ell} \right) + \text{pen}(k,\ell),$$

where

$$\text{pen}(k,\ell) = c\_0 \sqrt{\frac{k}{n}} + c\_1 \frac{\ell}{n} + c\_2 \frac{1}{\sqrt{n}} + \delta^2 \sqrt{\frac{w\_{k,\ell}}{2n}}$$

is a penalty function where *δ* stands for the diameter of observations and *wk*, denotes the weight attached to class F*k*,-; and it has constants *c*0, *c*1, *c*<sup>2</sup> depending on *δ*, maximum length *L* and a certain number of dimensions of observations. Ref. [18] then proved that

$$\mathbb{E}\left[\Delta(\mathbf{\hat{f}}\_{k,\ell})-\Delta(\mathbf{f}^{\star})\right] \leq \inf\_{k,\ell} \left\{ \mathbb{E}\left[\Delta(\mathbf{\hat{f}}\_{k,\ell})-\Delta(\mathbf{f}^{\star})\right] + \text{pen}(k,\ell) \right\} + \frac{\delta^{2}\Sigma}{2^{3/2}}\sqrt{\frac{\pi}{n'}}\tag{1}$$

where Σ is a numerical constant. The expected loss of the final polygonal line ˆ **f**ˆ *k*,ˆ is close to the minimal loss achievable over F*k*,up to a remainder term decaying as 1/√*n*.

## *1.2. Motivation*

The big data paradigm—where collecting, storing and analyzing massive amounts of large and complex data becomes the new standard—commands one to revisit some of the classical statistical and machine learning techniques. The tremendous improvements of data acquisition infrastructures generates new continuous streams of data, rather than batch datasets. This has drawn great interest to sequential learning. Extending the notion of principal curves to the sequential settings opens up immediate practical application possibilities. As an example, path planning for passengers' locations can help taxi companies to better optimize their fleet. Online algorithms that could yield instantaneous path summarization would be adapted to the sequential nature of geolocalized data. Existing theoretical works and practical implementations of principal curves are designed for the batch setting [7,16–18,21] and their adaptation to the sequential setting is not a smooth process. As an example, consider the algorithm in [18]. It is assumed that vertices of principal curves are located on a lattice, and its computational complexity is of order O(*nNp*) where *n* is the number of observations, *N* the number of points on the lattice and *p* the maximum number of vertices. When *p* is large, running this algorithm at each epoch yields a monumental computational cost. In general, if data are not identically distributed or even adversary, algorithms that originally worked well in the batch setting may not be ideal when cast onto the online setting (see [22], Chapter 4). To the best of our knowledge, little effort has been put so far into extending principal curves algorithms to the sequential context.

Ref. [23] provided an incremental version of the SCMS algorithm [15] which is based on a definition of a principal curve as the ridge of a smooth probability density function generating observations. They applied the SCMS algorithm to the input points that are associated with the output points which are close to the new incoming sample and leave the remaining outputs unchanged. Hence, this algorithm can be used to deal with sequential data. As presented in the next section, our algorithm for sequentially updating principal curve vertices that are close to new data is similar in spirit to that of incremental SCMS. However, a difference is that our algorithm outputs polygonal lines. In addition, the computation complexity of our method is O(*n*2), and incremental SCMS has O(*n*3) complexity, where *n* is the number of observations. Ref. [24] considered sequential principal curves analysis in a fairly different setting in which the goal was to derive in an adaptive fashion a set of nonlinear sensors by using a set of preliminary principal curves. Unfolding sequentially principal curves and a sequential path for Jacobian integration were

considered. The "sequential" in this setting represented the generalization of principal curves to principal surfaces or even a principal manifold of higher dimensions. This way of sequentially exploiting principal curves was firstly proposed by [11] and later extended by [14,25,26] to give curvilinear representations using sequence of local-to-global curves. In addition, Refs. [15,27,28] presented, respectively, principal polynomial and non-parametric regressions to capture the nonlinear nature of data. However, these methods are not originally designed for treating sequential data. The present paper aims at filling this gap: our goal was to propose an online perspective to principal curves by automatically and sequentially learning the best principal curve summarizing a data stream. Sequential learning takes advantage of the latest collected (set of) observations and therefore suffers a much smaller computational cost.

Sequential learning operates as follows: a blackbox reveals at each time *t* some deterministic value *xt*, *t* = 1, 2, ... , and a forecaster attempts to predict sequentially the next value based on past observations (and possibly other available information). The performance of the forecaster is no longer evaluated by its generalization error (as in the batch setting) but rather by a regret bound which quantifies the cumulative loss of a forecaster in the first *T* rounds with respect to some reference minimal loss. In sequential learning, the velocity of algorithms may be favored over statistical precision. An immediate use of aforecited techniques [17,18,21] at each time round *t* (treating data collected until *t* as a batch dataset) would result in a monumental algorithmic cost. Rather, we propose a novel algorithm which adapts to the sequential nature of data, i.e., which takes advantage of previous computations.

The contributions of the present paper are twofold. We first propose a sequential principal curve algorithm, for which we derived regret bounds. We then present an implementation, illustrated on a toy dataset and a real-life dataset (seismic data). The sketch of our algorithm's procedure is as follows. At each time round *t*, the number of segments of *kt* is chosen automatically and the number of segments *kt*+<sup>1</sup> in the next round is obtained by only using information about *kt* and a small number of past observations. The core of our procedure relies on computing a quantity which is linked to the mode of the so-called Gibbs quasi-posterior and is inspired by quasi-Bayesian learning. The use of quasi-Bayesian estimators is especially advocated by the PAC-Bayesian theory, which originated in the machine learning community in the late 1990s, in the seminal works of [29] and McAllester [30,31]. The PAC-Bayesian theory has been successfully adapted to sequential learning problems; see, for example, Ref. [32] for online clustering. We refer to [33,34] for a recent overview of the field.

The paper is organized as follows. Section 2 presents our notation and our online principal curve algorithm, for which we provide regret bounds with sublinear remainder terms in Section 3. A practical implementation was proposed in Section 4, and we illustrate its performance on synthetic and real-life datasets in Section 5. Proofs of all original results claimed in the paper are collected in Section 6.

#### **2. Notation**

A parameterized curve in <sup>R</sup>*<sup>d</sup>* is a continuous function **<sup>f</sup>** : *<sup>I</sup>* −→ <sup>R</sup>*<sup>d</sup>* where *<sup>I</sup>* = [*a*, *<sup>b</sup>*] is a closed interval of the real line. The length of **f** is given by

$$\mathcal{L}(\mathbf{f}) = \lim\_{M \to \infty} \left\{ \sup\_{a = s\_0 < s\_1 < \cdots < s\_M = b} \sum\_{i=1}^M ||\mathbf{f}(s\_i) - \mathbf{f}(s\_{i-1})||\_2 \right\}.$$

Let *<sup>x</sup>*1, *<sup>x</sup>*2, ... , *xT* <sup>∈</sup> *<sup>B</sup>*(0, <sup>√</sup>*dR*) <sup>⊂</sup> <sup>R</sup>*<sup>d</sup>* be a sequence of data, where *<sup>B</sup>*(**c**, *<sup>R</sup>*) stands for the -2-ball centered in **<sup>c</sup>** <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* with radius *<sup>R</sup>* <sup>&</sup>gt; 0. Let <sup>Q</sup>*<sup>δ</sup>* be a grid over *<sup>B</sup>*(0, <sup>√</sup>*dR*), i.e., Q*<sup>δ</sup>* = *B*(0, <sup>√</sup>*dR*) <sup>∩</sup> <sup>Γ</sup>*<sup>δ</sup>* where <sup>Γ</sup>*<sup>δ</sup>* is a lattice in <sup>R</sup>*<sup>d</sup>* with spacing *<sup>δ</sup>* <sup>&</sup>gt; 0. Let *<sup>L</sup>* <sup>&</sup>gt; 0 and define for each *k* ∈ -1, *p* the collection F*k*,*<sup>L</sup>* of polygonal lines **f** with *k* segments whose vertices are in <sup>Q</sup>*<sup>δ</sup>* and such that <sup>L</sup>(**f**) <sup>≤</sup> *<sup>L</sup>*. Denote by <sup>F</sup>*<sup>p</sup>* <sup>=</sup> <sup>∪</sup>*<sup>p</sup> <sup>k</sup>*=1F*k*,*<sup>L</sup>* all polygonal lines with a

number of segments ≤ *p*, whose vertices are in Q*<sup>δ</sup>* and whose length is at most *L*. Finally, let K(**f**) denote the number of segments of **f** ∈ F*p*. This strategy is illustrated by Figure 4.

**Figure 4.** An example of a lattice <sup>Γ</sup>*<sup>δ</sup>* in R<sup>2</sup> with *<sup>δ</sup>* = 1 (spacing between blue points) and *<sup>B</sup>*(0, 10) (black circle). The red polygonal line is composed of vertices in Q*<sup>δ</sup>* = *B*(0, 10) ∩ Γ*δ*.

Our goal is to learn a time-dependent polygonal line which passes through the "middle" of data and gives a summary of all available observations *x*1, ... , *xt*−<sup>1</sup> (denoted by (*xs*)1:(*t*−1) hereafter) before time *<sup>t</sup>*. Our output at time *<sup>t</sup>* is a polygonal line <sup>ˆ</sup> **f***<sup>t</sup>* ∈ F*<sup>p</sup>* depending on past information (*xs*)1:(*t*−1) and past predictions (<sup>ˆ</sup> **<sup>f</sup>***s*)1:(*t*−1). When *xt* is revealed, the instantaneous loss at time *t* is computed as

$$\Delta\left(\hat{\mathbf{f}}\_{t\prime}\mathbf{x}\_{t}\right) = \inf\_{s\in I} \|\hat{\mathbf{f}}\_{t}(s) - \mathbf{x}\_{t}\|\_{2}^{2}.\tag{2}$$

,

In what follows, we investigate regret bounds for the cumulative loss based on (2). Given a measurable space Θ (embedded with its Borel *σ*-algebra), we let P(Θ) denote the set of probability distributions on Θ, and for some reference measure *π*, we let P*π*(Θ) be the set of probability distributions absolutely continuous with respect to *π*.

For any *k* ∈ -1, *p*, let *π<sup>k</sup>* denote a probability distribution on F*k*,*L*. We define the *prior <sup>π</sup>* on <sup>F</sup>*<sup>p</sup>* <sup>=</sup> <sup>∪</sup>*<sup>p</sup> <sup>k</sup>*=1F*k*,*<sup>L</sup>* as

$$\pi(\mathbf{f}) = \sum\_{k \in \{1, p\}} w\_k \pi\_k(\mathbf{f}) \mathbb{1}\_{\{\mathbf{f} \in \mathcal{F}\_{k, \mathbf{f}}\}} \quad \mathbf{f} \in \mathcal{F}\_{p, \mathbf{f}}$$

where *w*1,..., *wp* ≥ 0 and ∑*k*∈-1,*p wk* = 1.

We adopt a quasi-Bayesian-flavored procedure: consider the Gibbs quasi-posterior (note that this is not a proper posterior in all generality, hence the term "quasi"):

$$
\hat{\rho}\_t(\cdot) \propto \exp(-\lambda S\_t(\cdot))\pi(\cdot),
$$

where

$$S\_t(\mathbf{f}) = S\_{t-1}(\mathbf{f}) + \Delta(\mathbf{f}, \mathbf{x}\_t) + \frac{\lambda}{2} \left(\Delta(\mathbf{f}, \mathbf{x}\_t) - \Delta(\hat{\mathbf{f}}\_t, \mathbf{x}\_t)\right)^2$$

as advocated by [32,35] who then considered realizations from this quasi-posterior. In the present paper, we will rather focus on a quantity linked to the mode of this quasi-posterior. Indeed, the mode of the quasi-posterior *ρ*ˆ*t*+<sup>1</sup> is

$$\arg\min\_{\mathbf{f}\in\mathcal{F}\_{p}}\left\{\underbrace{\sum\_{s=1}^{t}\Delta(\mathbf{f},\mathbf{x}\_{s})}\_{(i)}+\underbrace{\frac{\lambda}{2}\sum\_{s=1}^{t}\left(\Delta(\mathbf{f},\mathbf{x}\_{t})-\Delta(\hat{\mathbf{f}}\_{t},\mathbf{x}\_{t})\right)^{2}}\_{(ii)}+\underbrace{\frac{\ln\pi(\mathbf{f})}{\lambda}}\_{(iii)}\right\},\forall\ 1$$

where *(i)* is a cumulative loss term, *(ii)* is a term controlling the variance of the prediction **f** to past predictions ˆ **f***s*,*s* ≤ *t*, and *(iii)* can be regarded as a penalty function on the complexity of **f** if *π* is well chosen. This mode hence has a similar flavor to follow the best expert or follow the perturbed leader in the setting of prediction with experts (see [22,36], Chapters 3 and 4) if we consider each **f** ∈ F*<sup>p</sup>* as an expert which always delivers constant advice. These remarks yield Algorithm 1.

**Algorithm 1** Sequentially learning principal curves.

1: **Input parameters**: *<sup>p</sup>* <sup>&</sup>gt; 0, *<sup>η</sup>* <sup>&</sup>gt; 0, *<sup>π</sup>*(*z*) = <sup>e</sup>−*z*{*z*>0} and penalty function *<sup>h</sup>* : <sup>F</sup>*<sup>p</sup>* <sup>→</sup> <sup>R</sup><sup>+</sup>

2: **Initialization**: For each **<sup>f</sup>** <sup>∈</sup> <sup>F</sup>*p*, draw *<sup>z</sup>***<sup>f</sup>** <sup>∼</sup> *<sup>π</sup>* and <sup>Δ</sup>**f**,0 <sup>=</sup> <sup>1</sup> *<sup>η</sup>* (*h*(**f**) − *z***f**)

3: **For** *t* = 1, . . . , *T*


$$\hat{\mathbf{f}}\_{\mathbf{f}} = \underset{\mathbf{f} \in \mathcal{F}\_{\mathcal{P}}}{\arg\inf} \left\{ \sum\_{s=0}^{t-1} \Delta\_{\mathbf{f},s} \right\}\_{\mathbf{f}}.$$

where Δ**f**,*<sup>s</sup>* = Δ(**f**, *xs*), *s* ≥ 1. 6: **End for**

#### **3. Regret Bounds for Sequential Learning of Principal Curves**

We now present our main theoretical results.

**Theorem 1.** *For any sequence* (*xt*)1:*<sup>T</sup>* ∈ *B*(*0*, <sup>√</sup>*dR*)*, <sup>R</sup>* <sup>≥</sup> <sup>0</sup> *and any penalty function <sup>h</sup>* : <sup>F</sup>*<sup>p</sup>* <sup>→</sup> <sup>R</sup>+*, let <sup>π</sup>*(*z*) = <sup>e</sup>−*z*{*z*>0}*. Let* <sup>0</sup> <sup>&</sup>lt; *<sup>η</sup>* <sup>≤</sup> <sup>1</sup> *<sup>d</sup>*(2*R*+*δ*)<sup>2</sup> *; then the procedure described in Algorithm 1 satisfies*

$$\sum\_{t=1}^{T} \mathbb{E}\_{\pi} \left[ \Delta(\hat{\mathbf{f}}\_{t}, \mathbf{x}\_{t}) \right] \leq (1 + c\_{0}(\mathbf{e} - 1)\eta) S\_{T, \mathbf{b}, \eta} + \frac{1}{\eta} \left( 1 + \ln \sum\_{\mathbf{f} \in \mathcal{F}\_{\mathcal{F}}} \mathbf{e}^{-h(\mathbf{f})} \right),$$

*where c*<sup>0</sup> = *d*(2*R* + *δ*)<sup>2</sup> *and*

$$S\_{T,h,\eta} = \inf\_{k \in \left[1,p\right]} \left\{ \inf\_{\substack{\mathbf{f} \in \mathcal{F}\_p \\ \mathcal{K}(\mathbf{f})=k}} \left\{ \sum\_{t=1}^T \Delta(\mathbf{f}, \mathbf{x}\_t) + \frac{h(\mathbf{f})}{\eta} \right\} \right\}.$$

The expectation of the cumulative loss of polygonal lines ˆ **f**1, ... , ˆ **f***<sup>T</sup>* is upper-bounded by the smallest penalized cumulative loss over all *k* ∈ {1, ... , *p*} up to a multiplicative term (1 + *c*0(e − 1)*η*), which can be made arbitrarily close to 1 by choosing a small enough *η*. However, this will lead to both a large *h*(**f**)/*η* in *ST*,*h*,*<sup>η</sup>* and a large <sup>1</sup> *<sup>η</sup>* (<sup>1</sup> <sup>+</sup> ln <sup>∑</sup>**f**∈F*<sup>p</sup>* <sup>e</sup>−*h*(**f**)). In addition, another important issue is the choice of the penalty function *h*. For each **f** ∈ F*p*, *<sup>h</sup>*(**f**) should be large enough to ensure a small <sup>∑</sup>**f**∈F*<sup>p</sup>* <sup>e</sup>−*h*(**f**), but not too large to avoid overpenalization and a larger value for *ST*,*h*,*η*. We therefore set

$$h(\mathbf{f}) \ge \ln(p\mathbf{e}) + \ln\left|\{\mathbf{f} \in \mathcal{F}\_p \mathcal{K}(\mathbf{f}) = k\}\right|\tag{3}$$

for each **f** with *k* segments (where |*M*| denotes the cardinality of a set *M*) since it leads to

$$\sum\_{\mathbf{f}\in\mathcal{F}\_{\mathcal{F}}} \mathbf{e}^{-h(\mathbf{f})} (\mathbf{f}) = \sum\_{k\in\{1,p\}} \sum\_{\substack{\mathbf{f}\in\mathcal{F}\_{\mathcal{F}}\\ \mathcal{K}(\mathbf{f})=k}} \mathbf{e}^{-h(\mathbf{f})} \le \sum\_{k\in\{1,p\}} \frac{1}{p\mathbf{e}} \le \frac{1}{\mathbf{e}}.$$

The penalty function *h*(**f**) = *c*1K(**f**) + *c*2*L* + *c*<sup>3</sup> satisfies (3), where *c*1, *c*2, *c*<sup>3</sup> are constants depending on *R*, *d*, *δ*, *p* (this is proven in Lemma 3, in Section 6). We therefore obtain the following corollary.

**Corollary 1.** *Under the assumptions of Theorem 1, let*

$$\eta = \min \left\{ \frac{1}{d(2R+\delta)^2}, \sqrt{\frac{c\_1 p + c\_2 L + c\_3}{\text{co}(\text{e}-1) \inf\_{\mathbf{f} \in \mathcal{F}\_p} \sum\_{t=1}^T \Delta(\mathbf{f}, \mathbf{x}\_t)}} \right\}.$$

*Then*

$$\begin{split} \sum\_{t=1}^{T} \mathbb{E} \Big[ \Delta \big( \mathbf{\hat{f}}\_{t}, \mathbf{x}\_{t} \big) \big] \leq \inf\_{k \in \{1, p\}} \left\{ \inf\_{\begin{subarray}{c} \mathbf{f} \in \mathcal{F}\_{p} \\ \mathcal{K}(\mathbf{f}) = k \end{subarray}} \left\{ \sum\_{t=1}^{T} \Delta (\mathbf{f}, \mathbf{x}\_{t}) + \sqrt{c\_{0} (\mathbf{e} - 1) r\_{T, k, L}} \right\} \right\} \\ \quad + \sqrt{c\_{0} (\mathbf{e} - 1) r\_{T, p, L}} + c\_{0} (\mathbf{e} - 1) (c\_{1} p + c\_{2} L + c\_{3}). \end{split}$$

*where rT*,*k*,*<sup>L</sup>* <sup>=</sup> inf**f**∈F*<sup>p</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> Δ(**f**, *xt*)(*c*1*k* + *c*2*L* + *c*3)*.*

**Proof.** Note that

$$\sum\_{t=1}^{T} \mathbb{E}\left[\Delta(\hat{\mathbf{f}}\_{t}, \mathbf{x}\_{t})\right] \le \mathbb{S}\_{T, \mathbb{H}, \eta} + \eta \mathbf{c}\_{0} (\mathbf{e} - 1) \inf\_{\mathbf{f} \in \mathcal{F}\_{p}} \sum\_{t=1}^{T} \Delta(\mathbf{f}, \mathbf{x}\_{t}) + \mathbf{c}\_{0} (\mathbf{e} - 1) (\mathbf{c}\_{0} p + \mathbf{c}\_{2} L + \mathbf{c}\_{3}),$$

and we conclude by setting

$$\eta = \sqrt{\frac{c\_1 p + c\_2 L + c\_3}{c\_0 (\mathbf{e} - 1) \inf\_{\mathbf{f} \in \mathcal{F}\_p} \sum\_{t=1}^T \Delta(\mathbf{f}, \mathbf{x}\_t)}}.$$

Sadly, Corollary 1 is not of much practical use since the optimal value for *η* depends on inf**f**∈F*<sup>p</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> Δ(**f**, *xt*) which is obviously unknown, even more so at time *t* = 0. We therefore provide an adaptive refinement of Algorithm 1 in the following Algorithm 2.

**Algorithm 2** Sequentially and adaptively learning principal curves.


$$\circledast \text{ For } t = 1, \dots, T$$

$$\text{4: } \begin{array}{c} \text{Compute } -1, \text{--- } 1 \\ \text{Compute } \eta\_t = \frac{\sqrt{c\_1 p + c\_2 L + c\_3 t}}{c\_0 \sqrt{(c - 1)t}} \end{array}$$

$$\mathfrak{s} \colon \begin{array}{c} \text{Get data } \mathfrak{x}\_{\mathfrak{t}} \text{ and compute } \Delta\_{\mathbf{f},\mathfrak{t}} = \Delta(\mathfrak{f},\mathfrak{x}\_{\mathfrak{t}}) + \left(\frac{1}{\eta\_{\mathfrak{t}}} - \frac{1}{\eta\_{\mathfrak{t}-1}}\right)(h(\mathbf{f}) - z\_{\mathfrak{f}}) \end{array}$$

ˆ

6: Obtain

$$\mathbf{f}\_t = \underset{\mathbf{f} \in \mathcal{F}\_p}{\text{arg inf}} \left\{ \sum\_{s=0}^{t-1} \Delta\_{\mathbf{f},s} \right\}. \tag{4}$$

**f**∈F*<sup>p</sup>*

## 7: **End for**

**Theorem 2.** *For any sequence* (*xt*)1:*<sup>T</sup>* ∈ *B*(*0*, <sup>√</sup>*dR*), *<sup>R</sup>* <sup>≥</sup> <sup>0</sup>*, let <sup>h</sup>*(**f**) = *<sup>c</sup>*1K(**f**) + *<sup>c</sup>*2*<sup>L</sup>* <sup>+</sup> *<sup>c</sup>*<sup>3</sup> *where <sup>c</sup>*1*, c*2*, c*<sup>3</sup> *are constants depending on R*, *<sup>d</sup>*, *<sup>δ</sup>*, ln *p. Let <sup>π</sup>*(*z*) = <sup>e</sup>−*z*{*z*>0} *and*

$$\eta\_0 = \frac{\sqrt{c\_1 p + c\_2 L + c\_3}}{c\_0 \sqrt{\mathbf{e} - 1}}, \quad \eta\_t = \frac{\sqrt{c\_1 p + c\_2 L + c\_3}}{c\_0 \sqrt{(\mathbf{e} - 1)t}} \lambda$$

*where t* <sup>≥</sup> <sup>1</sup> *and c*<sup>0</sup> <sup>=</sup> *<sup>d</sup>*(2*<sup>R</sup>* <sup>+</sup> *<sup>δ</sup>*)2*. Then the procedure described in Algorithm <sup>2</sup> satisfies*

$$\begin{split} \left\lVert \sum\_{t=1}^{T} \mathbb{E} \Big[ \Delta(\hat{\mathbf{f}}\_{t}, \mathbf{x}\_{t}) \Big] \right\rVert \leq \inf\_{k \in \{1, p\}} \left\{ \inf\_{\begin{subarray}{c} \mathbf{f} \in \mathcal{T}\_{p} \\ \mathcal{K}(\mathbf{f}) = k \end{subarray}} \left\{ \sum\_{t=1}^{T} \Delta(\mathbf{f}, \mathbf{x}\_{t}) + c\_{0} \sqrt{(\mathbf{e} - 1)T(c\_{1}k + c\_{2}L + c\_{3})} \right\} \right\} \\ \\ &+ 2c\_{0} \sqrt{(\mathbf{e} - 1)T(c\_{1}p + c\_{2}L + c\_{3})}. \end{split}$$

The message of this regret bound is that the expected cumulative loss of polygonal lines ˆ **f**1, ... , ˆ **f***<sup>T</sup>* is upper-bounded by the minimal cumulative loss over all *k* ∈ {1, ... , *p*}, up to an additive term which is sublinear in *T*. The actual magnitude of this remainder term is <sup>√</sup>*kT*. When *<sup>L</sup>* is fixed, the number *<sup>k</sup>* of segments is a measure of complexity of the retained polygonal line. This bound therefore yields the same magnitude as (1), which is the most refined bound in the literature so far ([18] where the optimal values for *k* and *L* were obtained in a model selection fashion).

#### **4. Implementation**

The argument of the infimum in Algorithm <sup>2</sup> is taken over <sup>F</sup>*<sup>p</sup>* <sup>=</sup> <sup>∪</sup>*<sup>p</sup> <sup>k</sup>*=1F*k*,*<sup>L</sup>* which has a cardinality of order |Q*δ*| *p* , making any greedy search largely time-consuming. We instead turn to the following strategy: Given a polygonal line ˆ **f***<sup>t</sup>* ∈ F*kt*,*<sup>L</sup>* with *kt* segments, we consider, with a certain proportion, the availability of ˆ **f***t*+<sup>1</sup> within a neighborhood U(ˆ **f***t*) (see the formal definition below) of ˆ **f***t*. This consideration is well suited for the principal curves setting, since if observation *xt* is close to ˆ **f***t*, one can expect that the polygonal line which well fits observations *xs*,*s* = 1, ... , *t* lies in a neighborhood of ˆ **f***t*. In addition, if each polygonal line **f** is regarded as an action, we no longer assume that all actions are available at all times, and allow the set of available actions to vary at each time. This is a model known as "sleeping experts (or actions)" in prior work [37,38]. In this setting, defining the regret with respect to the best action in the whole set of actions in hindsight remains difficult, since that action might sometimes be unavailable. Hence, it is natural to define the regret with respect to the best ranking of all actions in the hindsight according to their losses or rewards, and at each round one chooses among the available actions by selecting the one which ranks the highest. Ref. [38] introduced this notion of regret and studied both the full-information (best action) and partial-information (multi-armed bandit) settings with stochastic and adversarial rewards and adversarial action availability. They pointed out that the **EXP4** algorithm [37] attains the optimal regret in the adversarial rewards case but has a runtime exponential in the number of all actions. Ref. [39] considered full and partial information with stochastic action availability and proposed an algorithm that runs in polynomial time. In what follows, we materialize our implementation by resorting to "sleeping experts", i.e., a special set of available actions that adapts to the setting of principal curves.

Let *σ* denote an ordering of |F*p*| actions, and A*<sup>t</sup>* a subset of the available actions at round *t*. We let *σ*(A*t*) denote the highest ranked action in A*t*. In addition, for any action **f** ∈ F*<sup>p</sup>* we define the reward *r***f**,*<sup>t</sup>* of **f** at round *t*, *t* ≥ 0 by

$$r\_{\mathbf{f},t} = c\_0 - \Delta(\mathbf{f}, \mathbf{x}\_t).$$

It is clear that *r***f**,*<sup>t</sup>* ∈ (0, *c*0). The convention from losses to gains is done in order to facilitate the subsequent performance analysis. The reward of an ordering *σ* is the cumulative reward of the selected action at each time:

$$\sum\_{t=1}^{T} r\_{\sigma(\mathcal{A}\_{t})\mathcal{A}^{\prime}}$$

and the reward of the best ordering is max*<sup>σ</sup>* ∑*<sup>T</sup> <sup>t</sup>*=<sup>0</sup> *<sup>r</sup>σ*(A*t*),*<sup>t</sup>* (respectively, <sup>E</sup> max*<sup>σ</sup>* ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *rσ*(A*t*),*<sup>t</sup>* when A*<sup>t</sup>* is stochastic).

Our procedure starts with a **partition** step which aims at identifying the "relevant" neighborhood of an observation *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* with respect to a given polygonal line, and then proceeds with the definition of the **neighborhood** of an action **f**. We then provide the full implementation and prove a regret bound.

**Partition.** For any polygonal line **<sup>f</sup>** with *<sup>k</sup>* segments, we denote by **V** = (*v*1,..., *vk*<sup>+</sup>1) its vertices and by *si*, *i* = 1, ... , *k* the line segments connecting *vi* and *vi*+1. In the sequel, we use **f**( **<sup>V</sup>**) to represent the polygonal line formed by connecting consecutive vertices in **V** if no confusion arises. Let *Vi*, *i* = 1, ... , *k* + 1 and *Si*, *i* = 1, ... , *k* be the Voronoi partitions of R*<sup>d</sup>* with respect to **<sup>f</sup>**, i.e., regions consisting of all points closer to vertex *vi* or segment *si*. Figure 5 shows an example of Voronoi partition with respect to **f** with three segments.

**Neighborhood.** For any *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*d*, we define the neighborhood <sup>N</sup>(*x*) with respect to **<sup>f</sup>** as the union of all Voronoi partitions whose closure intersects with two vertices connecting the projection **f**(*s***f**(*x*)) of *x* to **f**. For example, for the point *x* in Figure 5, its neighborhood N(*x*) is the union of *S*2, *V*3, *S*<sup>3</sup> and *V*4. In addition, let N*t*(*x*) = {*xs* ∈ N(*x*),*s* = 1, . . . , *t*.} be the set of observations *x*1:*<sup>t</sup>* belonging to N(*x*) and N¯ *<sup>t</sup>*(*x*) be its average. Let D(*M*) = sup*x*,*y*∈*<sup>M</sup>* ||*<sup>x</sup>* <sup>−</sup> *<sup>y</sup>*||<sup>2</sup> denote the diameter of set *<sup>M</sup>* <sup>⊂</sup> <sup>R</sup>*d*. We finally define the local grid <sup>Q</sup>*δ*,*t*(*x*) of *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* at time *<sup>t</sup>* as

$$\mathcal{Q}\_{\delta,t}(\mathbf{x}) = B(\vec{\mathcal{N}}\_t(\mathbf{x}), \mathcal{D}(\mathcal{N}\_t(\mathbf{x})) \cap \Omega\_{\delta}.)$$

**Figure 5.** An example of a Voronoi partition.

We can finally proceed to the definition of the neighborhood U(ˆ **f***t*) of ˆ **f***t*. Assume ˆ **f***t* has *kt* <sup>+</sup> 1 vertices **<sup>V</sup>** = (*v*1:*it*−<sup>1</sup> (*i*) , *vit*:*jt*−<sup>1</sup> (*ii*) , *vjt*:*kt*+<sup>1</sup> (*iii*) ), where vertices of (*ii*) belong to Q*δ*,*t*(*xt*) while

those of (*i*) and (*iii*) do not. The neighborhood U(ˆ **f***t*) consists of **f** sharing vertices (*i*) and (*iii*) with ˆ **f***t*, but can be equipped with different vertices (*ii*) in Q*δ*,*t*(*xt*); i.e.,

$$\mathcal{U}(\widehat{\mathbf{f}}\_{l}) = \left\{ \mathbf{f}(\overrightarrow{\mathbf{V}}), \quad \overrightarrow{\mathbf{V}} = \left( \upsilon\_{1:i\_{l}-1}, \upsilon\_{1:m}, \upsilon\_{j\_{l}:k\_{l}+1} \right) \right\}.$$

where *v*1:*<sup>m</sup>* ∈ Q*δ*,*t*(*xt*) and *m* is given by

$$m = \begin{cases} j\_t - i\_t - 1 & \text{reduce segments by 1 unit,} \\ j\_t - i\_t & \text{same number of segments,} \\ j\_t - i\_t + 1 & \text{increase segments by 1 unit.} \end{cases}$$

In Algorithm 3, we initiate the principal curve ˆ **f**<sup>1</sup> as the first component line segment whose vertices are the two farthest projections of data *x*1:*t*<sup>0</sup> (*t*<sup>0</sup> can be set to 20 in practice) on the first component line. The reward of **f** at round *t* in this setting is therefore *r***f**,*<sup>t</sup>* = *c*<sup>0</sup> − Δ(**f**, *xt*0+*t*). Algorithm 3 has an exploration phase (when *It* = 1) and an exploitation phase (*It* = 0). In the exploration phase, it is allowed to observe rewards of all actions and to choose an optimal perturbed action from the set F*<sup>p</sup>* of all actions. In the exploitation phase, only rewards of a part of actions can be accessed and rewards of others are estimated by a constant, and we update our action from the neighborhood U ˆ **f***t*−<sup>1</sup> of the previous action ˆ **f***t*−1. This local update (or search) greatly reduces computation complexity since <sup>|</sup>U(<sup>ˆ</sup> **<sup>f</sup>***t*−1)| \$ \$F*<sup>p</sup>* \$ \$ when *p* is large. In addition, this local search will be enough to account for the case when *xt* locates in U ˆ **f***t*−<sup>1</sup> . The parameter *β* needs to be carefully calibrated since it should not be too large to ensure that the condition *cond*(*t*) is non-empty; otherwise, all rewards are estimated by the same constant and thus lead to the same descending ordering of tuples for both <sup>∑</sup>*t*−<sup>1</sup> *<sup>s</sup>*=<sup>1</sup> *<sup>r</sup>*ˆ**f**,*s*,**<sup>f</sup>** <sup>∈</sup> <sup>F</sup>*<sup>p</sup>* and  ∑*t <sup>s</sup>*=<sup>1</sup> *r*ˆ**f**,*s*,**f** ∈ F*<sup>p</sup>* . Therefore, we may face the risk of having ˆ **f***t*+<sup>1</sup> in the neighborhood of ˆ **f***t* even if we are in the exploration phase at time *t* + 1. Conversely, very small *β* could result in large bias for the estimation *<sup>r</sup>***f**,*<sup>t</sup>* P(<sup>ˆ</sup> **<sup>f</sup>***t*=**f**|H*t*) of *<sup>r</sup>***f**,*t*. Note that the exploitation phase is close yet different to the label efficient prediction ([40], Remark 1.1) since we allow an action at time *t* to be different from the previous one. Ref. [41] proposed the *geometric resampling* method to estimate the conditional probability P ˆ **f***<sup>t</sup>* = **f**|H*<sup>t</sup>* since this quantity often does not have an explicit form. However, due to the 

simple exponential distribution of *<sup>z</sup>***<sup>f</sup>** chosen in our case, an explicit form of P ˆ **f***<sup>t</sup>* = **f**|H*<sup>t</sup>* is straightforward.

**Algorithm 3** A locally greedy algorithm for sequentially learning principal curves.


5: Let

$$\boldsymbol{\sigma}\_{t} = \text{sort}\left(\mathbf{f}, \quad \sum\_{s=1}^{t-1} \boldsymbol{\hat{r}}\_{\mathbf{f},s} - \frac{1}{\eta\_{t-1}} h(\mathbf{f}) + \frac{1}{\eta\_{t-1}} z\_{\mathbf{f}}\right) \boldsymbol{\wedge}$$

i.e., sorting all **f** ∈ F*<sup>p</sup>* in descending order according to their perturbed cumulative reward till *t* − 1.

6: If *It* = 1, set A*<sup>t</sup>* = F*<sup>p</sup>* and ˆ **f***<sup>t</sup>* = *σ*ˆ*<sup>t</sup>* (A*t*) and observe *r*<sup>ˆ</sup> **f***t*,*t*

**<sup>f</sup>***t*−1), <sup>ˆ</sup>

**f***<sup>t</sup>* = *σ*ˆ*<sup>t</sup>*

$$
\hat{r}\_{\mathbf{f},t} = r\_{\mathbf{f},t} \quad \text{for} \quad \mathbf{f} \in \mathcal{F}\_p.
$$

8: If *It* = 0, set A*<sup>t</sup>* = U(ˆ 9:

7:

$$
\hat{r}\_{\mathbf{f},t} = \begin{cases}
\frac{r\_{\mathbf{f},t}}{\mathbb{P}\left(\hat{\mathbf{f}}\_{t} = \mathbf{f} | \mathcal{H}\_{t}\right)} & \text{if } \mathbf{f} \in \mathcal{U}(\hat{\mathbf{f}}\_{t-1}) \cap \operatorname{cond}(t) \text{ and } \hat{\mathbf{f}}\_{t} = \mathbf{f}\_{t} \\
a & \text{otherwise}
\end{cases}
$$

(A*t*) and observe *r*<sup>ˆ</sup>

**f***t*,*t*

where H*<sup>t</sup>* denotes all the randomness before time *t* and *cond*(*t*) = **<sup>f</sup>** <sup>∈</sup> <sup>F</sup>*<sup>p</sup>* : <sup>P</sup> ˆ **f***<sup>t</sup>* = **f**|H*<sup>t</sup>* > *β* . In particular, when *t* = 1, we set *r*ˆ**f**,1 = *r***f**,1 for all **f** ∈ F*p*, U ˆ **f**0 = ∅ and *r*ˆ *σ*ˆ <sup>1</sup>(U(ˆ **<sup>f</sup>**0)),1 ≡ 0. 10: **End for**

**Theorem 3.** *Assume that p* > 6*, T* ≥ 2|F*p*| <sup>2</sup> *and let β* = \$ \$F*<sup>p</sup>* \$ \$− 1 <sup>2</sup> *T*<sup>−</sup> <sup>1</sup> <sup>4</sup> *, α* = *<sup>c</sup>*<sup>0</sup> *<sup>β</sup> , <sup>c</sup>*ˆ0 <sup>=</sup> <sup>2</sup>*c*<sup>0</sup> *β ,* <sup>=</sup> <sup>1</sup> <sup>−</sup> \$ \$F*<sup>p</sup>* \$ \$ 1 <sup>2</sup> <sup>−</sup> <sup>3</sup> *<sup>p</sup> T*<sup>−</sup> <sup>1</sup> <sup>4</sup> *and*

$$
\eta\_1 = \eta\_2 = \dots = \eta\_T = \frac{\sqrt{c\_1 p + c\_2 L + c\_3}}{\sqrt{T(e-1)} \hat{c}\_0}.
$$

*Then the procedure described in Algorithm 3 satisfies the regret bound*

$$\sum\_{t=1}^T \mathbb{E}\left[\Delta\left(\hat{\mathbf{f}}\_{t\prime}, \mathbf{x}\_t\right)\right] \le \inf\_{\mathbf{f} \in \mathcal{F}\_T} \mathbb{E}\left[\sum\_{t=1}^T \Delta(\mathbf{f}, \mathbf{x}\_t)\right] + \mathcal{O}\left(T^{\frac{3}{4}}\right).$$

The proof of Theorem 3 is presented in Section 6. The regret is upper bounded by a term of order \$ \$F*<sup>p</sup>* \$ \$ 1 <sup>2</sup> *T* <sup>3</sup> 4 , sublinear in *T*. The term (1 − )*c*0*T* = *c*<sup>0</sup> \$ \$F*<sup>p</sup>* \$ \$ 1 <sup>2</sup> *T* <sup>3</sup> <sup>4</sup> is the price to pay for the local search (with a proportion 1 <sup>−</sup> ) of polygonal line <sup>ˆ</sup> **f***<sup>t</sup>* in the neighborhood of the previous ˆ **f***t*−1. If = 1, we would have that *c*ˆ0 = *c*0, and the last two terms in the first inequality of Theorem 3 would vanish; hence, the upper bound reduces to Theorem 2. In addition, our algorithm achieves an order that is smaller (from the perspective of both the number \$ \$F*<sup>p</sup>* \$ \$ of all actions and the total rounds *T*) than [39] since at each time, the availability of actions for our algorithm can be either the whole action set or a neighborhood of the previous action while [39] consider at each time only partial and independent stochastic available set of actions generated from a predefined distribution.

#### **5. Numerical Experiments**

We illustrate the performance of Algorithm 3 on synthetic and real-life data. Our implementation (hereafter denoted by slpc—Sequential Learning of Principal Curves) is conducted with the R language and thus our most natural competitors are the R package princurve, which is the algorithm from [10], and incremental, which is the algorithm from SCMS [23]. We let *p* = 50, *R* = max*t*=1,...,*<sup>T</sup>* ||*x*||2/ <sup>√</sup>*d*, *<sup>L</sup>* <sup>=</sup> 0.1*<sup>p</sup>* <sup>√</sup>*dR*. The spacing *δ* of the lattice is adjusted with respect to data scale.

**Synthetic data** We generate a dataset *xt* <sup>∈</sup> <sup>R</sup>2, *<sup>t</sup>* <sup>=</sup> 1, . . . , 500 uniformly along the curve *<sup>y</sup>* <sup>=</sup> 0.05 <sup>×</sup> (*<sup>x</sup>* <sup>−</sup> <sup>5</sup>)3, *<sup>x</sup>* <sup>∈</sup> [0, 10]. Table <sup>1</sup> shows the regret (first row) for


The mean computation time with different values for the time horizons *T* are also reported.

**Table 1.** The first line is the regret (cumulative loss) on synthetic data (average over 10 trials, with standard deviation in brackets). Second and third lines are the average computation time for two values of the time horizon *T*. princurve and incremental SCMS are deterministic, hence the zero standard deviation for regret.


Table 1 demonstrates the advantages of our method slpc, as it achieved the optimal tradeoff between performance (in terms of regret) and runtime. Although princurve outperformed the other two algorithms in terms of computation time, it yielded the largest

regret, since it outputs a curve which does not pass in "the middle of data" but rather bends towards the curvature of the data cloud, as shown in Figure 6 where the predicted principal curves ˆ **f***t*+<sup>1</sup> for princurve, incremental SCMS and slpc are presented. incremental SCMS and slpc both yielded satisfactory results, although the mean computation time of splc was significantly smaller than that of incremental SCMS (the reason being that eigenvectors of the Hessian of PDF need to be computed in incremental SCMS). Figure 7 showed, respectively, the estimation of the regret of slpc and its per-round value (i.e., the cumulative loss divided by the number of rounds) both with respect to the round *t*. The jumps in the per-round curve occurred at the beginning, due to the initialization from a first principal component and to the collection of new data. When data accumulates, the vanishing pattern of the per-round curve illustrates that the regret is sublinear in *t*, which matches our aforementioned theoretical results.

**Figure 6.** Synthetic data. Black dots represent data *x*1:*t*. The red point is the new observation *xt*+1. princurve (solid red) and slpc (solid green). (**a**) *t* = 150, princurve. (**b**) *t* = 450, princurve. (**c**) *t* = 150, incremental SCMS. (**d**) *t* = 450, incremental SCMS. (**e**) *t* = 150, slpc. (**f**) *t* = 450, slpc.

In addition, to better illustrate the way slpc works between two epochs, Figure 8 focuses on the impact of collecting a new data point on the principal curve. We see that only a local vertex is impacted, whereas the rest of the principal curve remains unaltered. This cutdown in algorithmic complexity is one the key assets of slpc.

**Figure 7.** Mean estimation of regret and per-round regret of slpc with respect to time round *t*, for the horizon *T* = 500. (**a**) Mean estimation of the regret of slpc over 20 trials (black line) and a bisection line (green) with respect to time round *t*. (**b**) Per-round of estimated regret of slpc with respect to *t*.

**Figure 8.** Synthetic data. Zooming in: how a new data point impacts the principal curve only locally. (**a**) At time *t* = 97. (**b**) And at time *t* = 98.

**Synthetic data in high dimension.** We also apply our algorithm on a dataset {*xt* <sup>∈</sup> <sup>R</sup>6, *t* = 1, 2, ... , 200} in higher dimension. It is generated uniformly along a parametric curve whose coordinates are

⎛ ⎜⎜⎜⎜⎜⎜⎝ 0.5*t* cos(*t*) 0.5*t* sin(*t*) 0.5*t* −*t* √*t* 2 ln(*t* + 1) ⎞ ⎟⎟⎟⎟⎟⎟⎠

where *t* takes 100 equidistant values in [0, 2*π*]. To the best of our knowledge, [10,16,18] only tested their algorithm on 2-dimensional data. This example aims at illustrating that our algorithm also works on higher dimensional data. Table 2 shows the regret for the ground truth, princurve and slpc.

**Table 2.** Regret (cumulative loss) on synthetic high dimensional data in (average over 10 trials, with standard deviation in brackets). princurve and incremental SCMS are deterministic, hence the zero standard deviation.


In addition, Figure 9 shows the behaviour of slpc (green) on each dimension.

(**c**)

**Figure 9.** slpc (green line) on synthetic high dimensional data from different perspectives. Black dots represent recordings *x*1:99; the red dot is the new recording *x*200. (**a**) slpc, *t* = 199, 1st and 2nd coordinates. (**b**) slpc, *t* = 199, 3th and 5th coordinates. (**c**) slpc, *t* = 199, 4th and 6th coordinates.

**Seismic data.** Seismic data spanning long periods of time are essential for a thorough understanding of earthquakes. The "Centennial Earthquake Catalog" [42] aims at providing a realistic picture of the seismicity distribution on Earth. It consists in a global catalog

of locations and magnitudes of instrumentally recorded earthquakes from 1900 to 2008. We focus on a particularly representative seismic active zone (a lithospheric border close to Australia) whose longitude is between E130◦ to E180◦ and latitude between S70◦ to N30◦, with *T* = 218 seismic recordings. As shown in Figure 10, slpc recovers nicely the tectonic plate boundary, but both princurve and incremental SCMS with well-calibrated bandwidth fail to do so.

Lastly, since no ground truth is available, we used the *R*<sup>2</sup> coefficient to assess the performance (residuals are replaced by the squared distance between data points and their projections onto the principal curve). The average over 10 trials was 0.990.

**Figure 10.** Seismic data. Black dots represent seismic recordings *x*1:*t*; the red dot is the new recording *xt*+1. (**a**) princurve, *t* = 100. (**b**) princurve, *t* = 125. (**c**) incremental SCMS, *t* = 100. (**d**) incremental SCMS, *t* = 125. (**e**) slpc, *t* = 100. (**f**) slpc, *t* = 125.

**Back to Seismic Data.** Figure 11 was taken from the USGS website (https://earthquake. usgs.gov/data/centennial/) and gives the global locations of earthquakes for the period 1900–1999. The seismic data (latitude, longitude, magnitude of earthquakes, etc.) used in the present paper may be downloaded from this website.

**Figure 11.** Seismic data from https://earthquake.usgs.gov/data/centennial/.

**Daily Commute Data.** The identification of segments of personal daily commuting trajectories can help taxi or bus companies to optimize their fleets and increase frequencies on segments with high commuting activity. Sequential principal curves appear to be an ideal tool to address this learning problem: we tested our algorithm on trajectory data from the University of Illinois at Chicago (https://www.cs.uic.edu/~boxu/mp2p/gps\_ data.html). The data were obtained from the GPS reading systems carried by two of the laboratory members during their daily commute for 6 months in the Cook county and the Dupage county of Illinois. Figure 12 presents the learning curves yielded by princurve and slpc on geolocalization data for the first person, on May 30. A particularly remarkable asset of slpc is that abrupt curvature in the data sequence was perfectly captured, whereas princurve does not enjoy the same flexibility. Again, we used the *R*<sup>2</sup> coefficient to assess the performance (where residuals are replaced by the squared distances between data points and their projections onto the principal curve). The average over 10 trials was 0.998.

**Figure 12.** Daily commute data. Black dots represent collected locations *x*1:*t*. The red point is the new observation *xt*+1. princurve (solid red) and slpc (solid green). (**a**) *t* = 10, princurve. (**b**) *t* = 127, princurve. (**c**) *t* = 10, slpc. (**d**) *t* = 127, slpc.

#### **6. Proofs**

This section contains the proof of Theorem 2 (note that Theorem 1 is a straightforward consequence, with *η<sup>t</sup>* = *η*, *t* = 0, ... , *T*) and the proof of Theorem 3 (which involves intermediary lemmas). Let us first define for each *t* = 0, ... , *T* the following forecaster sequence (ˆ **f** *t* )*t*

$$\begin{split} \hat{\mathbf{f}}\_{0}^{\star} &= \underset{\mathbf{f} \in \mathcal{F}\_{p}}{\arg\inf} \left\{ \Delta\_{\mathbf{f},0} \right\} = \underset{\mathbf{f} \in \mathcal{F}\_{p}}{\arg\inf} \left\{ \frac{1}{\eta\_{0}}h(\mathbf{f}) - \frac{1}{\eta\_{0}}z\_{\mathbf{f}} \right\}, \\ \hat{\mathbf{f}}\_{t}^{\star} &= \underset{\mathbf{f} \in \mathcal{F}\_{p}}{\arg\inf} \left\{ \sum\_{s=0}^{t} \Delta\_{\mathbf{f},s} \right\} = \underset{\mathbf{f} \in \mathcal{F}\_{p}}{\arg\inf} \left\{ \sum\_{s=1}^{t} \Delta(\mathbf{f}, \mathbf{x}\_{s}) + \frac{1}{\eta\_{t-1}}h(\mathbf{f}) - \frac{1}{\eta\_{t-1}}z\_{\mathbf{f}} \right\}, \quad t \ge 1. \end{split}$$

Note that ˆ **f** *<sup>t</sup>* is an "illegal" forecaster since it peeks into the future. In addition, denote by

$$\mathbf{f}^{\star} = \underset{\mathbf{f} \in \mathcal{F}\_{\mathcal{P}}}{\arg\inf} \left\{ \sum\_{t=1}^{T} \Delta(\mathbf{f}, \mathbf{x}\_{t}) + \frac{1}{\eta \tau} h(\mathbf{f}) \right\},$$

the polygonal line in F*<sup>p</sup>* which minimizes the cumulative loss in the first *T* rounds plus a penalty term. **f** is deterministic, and ˆ **f** *<sup>t</sup>* is a random quantity (since it depends on *z***f**, **<sup>f</sup>** <sup>∈</sup> <sup>F</sup>*<sup>p</sup>* drawn from *<sup>π</sup>*). If several **<sup>f</sup>** attain the infimum, we chose **<sup>f</sup>** *<sup>T</sup>* as the one having the smallest complexity. We now enunciate the first (out of three) intermediary technical result.

**Lemma 1.** *For any sequence x*1,..., *xT in B*(*0*, <sup>√</sup>*dR*)*,*

$$\sum\_{t=0}^{T} \Delta\_{\hat{\mathbf{r}}\_{t}^{\star}, t} \le \sum\_{t=0}^{T} \Delta\_{\hat{\mathbf{r}}\_{\hat{\mathbf{r}}}, t'} \qquad \pi\text{-almost surely.} \tag{5}$$

**Proof.** Proof by induction on *T*. Clearly (5) holds for *T* = 0. Assume that (5) holds for *<sup>T</sup>* <sup>−</sup> 1: *<sup>T</sup>*−<sup>1</sup>

$$\sum\_{t=0}^{T-1} \Delta\_{\hat{\mathbf{f}}\_{t}^{\*},t} \le \sum\_{t=0}^{T-1} \Delta\_{\hat{\mathbf{f}}\_{T-1}^{\*},t}.$$

Adding Δ<sup>ˆ</sup> **f** *<sup>T</sup>*,*<sup>T</sup>* to both sides of the above inequality concludes the proof.

By (5) and the definition of ˆ **f** *<sup>T</sup>*, for *k* ≥ 1, we have *π*-almost surely that

$$\begin{split} \sum\_{t=1}^{T} \Delta(\hat{\mathbf{f}}\_{t}^{\star}, \mathbf{x}\_{t}) &\leq \sum\_{t=1}^{T} \Delta(\hat{\mathbf{f}}\_{T}^{\star}, \mathbf{x}\_{t}) + \frac{1}{\eta\_{T}} h(\hat{\mathbf{f}}\_{T}^{\star}) - \frac{1}{\eta\_{T}} Z\_{\mathbf{f}\_{T}^{\star}} + \sum\_{t=0}^{T} \left( \frac{1}{\eta\_{t-1}} - \frac{1}{\eta\_{t}} \right) \left( h(\hat{\mathbf{f}}\_{t}^{\star}) - Z\_{\mathbf{f}\_{t}^{\star}} \right) \\ &\leq \sum\_{t=1}^{T} \Delta(\mathbf{f}^{\star}, \mathbf{x}\_{t}) + \frac{1}{\eta\_{T}} h(\mathbf{f}^{\star}) - \frac{1}{\eta\_{T}} Z\_{\mathbf{f}^{\star}} + \sum\_{t=0}^{T} \left( \frac{1}{\eta\_{t-1}} - \frac{1}{\eta\_{t}} \right) \left( h(\hat{\mathbf{f}}\_{t}^{\star}) - Z\_{\mathbf{f}\_{t}^{\star}} \right) \\ &= \inf\_{\mathbf{f} \in \mathcal{F}\_{T}} \left\{ \sum\_{t=1}^{T} \Delta(\mathbf{f}, \mathbf{x}\_{t}) + \frac{1}{\eta\_{T}} h(\mathbf{f}) \right\} - \frac{1}{\eta\_{T}} Z\_{\mathbf{f}^{\star}} + \sum\_{t=0}^{T} \left( \frac{1}{\eta\_{t-1}} - \frac{1}{\eta\_{t}} \right) \left( h(\hat{\mathbf{f}}\_{t}^{\star}) - Z\_{\mathbf{f}\_{t}^{\star}} \right). \end{split}$$

where 1/*η*−<sup>1</sup> = 0 by convention. The second and third inequality is due to respectively the definition of ˆ **f** *<sup>T</sup>* and **<sup>f</sup>** *<sup>T</sup>*. Hence

$$\begin{split} \mathbb{E}\left[\sum\_{t=1}^{T}\Delta\left(\hat{\mathbf{f}}\_{t}^{\star},\mathbf{x}\_{t}\right)\right] &\leq \inf\_{\mathbf{f}\in\mathcal{F}\_{T}}\left\{\sum\_{t=1}^{T}\Delta(\mathbf{f},\mathbf{x}\_{t})+\frac{1}{\eta\_{T}}h(\mathbf{f})\right\}-\frac{1}{\eta\_{T}}\mathbb{E}[Z\_{\mathbf{f}\_{T}^{\star}}] \\ &\quad + \sum\_{t=0}^{T}\mathbb{E}\left[\left(\frac{1}{\eta\_{t}}-\frac{1}{\eta\_{t-1}}\right)\left(-h(\mathbf{\hat{f}}\_{t}^{\star})+Z\_{\mathbf{\hat{f}}\_{t}^{\star}}\right)\right] \\ &\leq \inf\_{\mathbf{f}\in\mathcal{F}\_{T}}\left\{\sum\_{t=1}^{T}\Delta(\mathbf{f},\mathbf{x}\_{t})+\frac{1}{\eta\_{T}}h(\mathbf{f})\right\}+\sum\_{t=1}^{T}\left(\frac{1}{\eta\_{t}}-\frac{1}{\eta\_{t-1}}\right)\mathbb{E}\left[\sup\_{\mathbf{f}\in\mathcal{F}\_{T}}\left(-h(\mathbf{f})+Z\_{\mathbf{f}}\right)\right] \\ &= \inf\_{\mathbf{f}\in\mathcal{F}\_{T}}\left\{\sum\_{t=1}^{T}\Delta(\mathbf{f},\mathbf{x}\_{t})+\frac{1}{\eta\_{T}}h(\mathbf{f})\right\}+\frac{1}{\eta\_{T}}\mathbb{E}\left[\sup\_{\mathbf{f}\in\mathcal{F}\_{T}}\left(-h(\mathbf{f})+Z\_{\mathbf{f}}\right)\right], \end{split}$$

where the second inequality is due to E[*Z***f** *<sup>T</sup>* ] = 0 and <sup>1</sup> *<sup>η</sup><sup>t</sup>* <sup>−</sup> <sup>1</sup> *ηt*−<sup>1</sup> > 0 for *t* = 0, 1, ... , *T* since *η<sup>t</sup>* is decreasing in *t* in Theorem 2. In addition, for *y* ≥ 0, one has

$$\mathbb{P}(-h(\mathbf{f}) + Z\_{\mathbf{f}} > y) = \mathbf{e}^{-h(\mathbf{f}) - y}.$$

Hence, for any *y* ≥ 0

$$\mathbb{P}\left(\sup\_{\mathbf{f}\in\mathcal{F}\_{\mathcal{P}}}(-h(\mathbf{f})+Z\_{\mathbf{f}})>y\right)\leq\sum\_{\mathbf{f}\in\mathcal{F}\_{\mathcal{P}}}\mathbb{P}(Z\_{\mathbf{f}}\geq h(\mathbf{f})+y)=\sum\_{\mathbf{f}\in\mathcal{F}\_{\mathcal{P}}}\mathbf{e}^{-h(\mathbf{f})}\mathbf{e}^{-y}=ue^{-y},$$

where *<sup>u</sup>* <sup>=</sup> <sup>∑</sup>**f**∈F*<sup>p</sup>* <sup>e</sup>−*h*(**f**). Therefore, we have

$$\begin{split} \mathbb{E}\left[\sup\_{\mathbf{f}\in\mathcal{F}\_{p}}(-h(\mathbf{f})+Z\_{\mathbf{f}})-\ln u\right] &\leq \mathbb{E}\Big[\max\Big(0,\sup\_{\mathbf{f}\in\mathcal{F}\_{p}}(-h(\mathbf{f})+Z\_{\mathbf{f}}-\ln u)\Big)\Big] \\ &\leq \int\_{0}^{\infty}\mathbb{P}\Big(\max\Big(0,\sup\_{\mathbf{f}\in\mathcal{F}\_{p}}(-h(\mathbf{f})+Z\_{\mathbf{f}}-\ln u)\Big)>y\Big)\mathrm{d}y \\ &\leq \int\_{0}^{\infty}\mathbb{P}\Big(\sup\_{\mathbf{f}\in\mathcal{F}\_{p}}(-h(\mathbf{f})+Z\_{\mathbf{f}})>y+\ln u\Big)\mathrm{d}y \\ &\leq \int\_{0}^{\infty}\mathrm{ue}^{-(y+\ln u)}\mathrm{d}y = 1. \end{split}$$

We thus obtain

$$\mathbb{E}\left[\sum\_{t=1}^{T}\Delta\left(\hat{\mathbf{f}}\_{t}^{\star},\mathbf{x}\_{t}\right)\right] \leq \inf\_{\mathbf{f}\in\mathcal{F}\_{T}}\left\{\sum\_{t=1}^{T}\Delta(\mathbf{f},\mathbf{x}\_{t}) + \frac{1}{\eta\_{T}}h(\mathbf{f})\right\} + \frac{1}{\eta\_{T}}\left(1 + \ln\sum\_{\mathbf{f}\in\mathcal{F}\_{T}}\mathbf{e}^{-h(\mathbf{f})}\right). \tag{6}$$

Next, we control the regret of Algorithm 2.

**Lemma 2.** *Assume that <sup>z</sup>***<sup>f</sup>** *is sampled from the symmetric exponential distribution in* R*, i.e., <sup>π</sup>*(*z*) = <sup>e</sup>−*z*{*z*>0}*. Assume that* sup*t*=1,...,*<sup>T</sup> <sup>η</sup>t*−<sup>1</sup> <sup>≤</sup> <sup>1</sup> *<sup>d</sup>*(2*R*+*δ*)<sup>2</sup> *, and define <sup>c</sup>*<sup>0</sup> <sup>=</sup> *<sup>d</sup>*(2*<sup>R</sup>* <sup>+</sup> *<sup>δ</sup>*)2*. Then for any sequence* (*xt*) ∈ *B*(*0*, <sup>√</sup>*dR*)*, t* <sup>=</sup> 1, . . . , *T,*

$$\sum\_{t=1}^{T} \mathbb{E}\left[\Delta\left(\mathbf{f}\_{t}, \mathbf{x}\_{t}\right)\right] \le \sum\_{t=1}^{T} (1 + \eta\_{t-1} \mathbf{c}\_{0}(\mathbf{e} - 1)) \mathbb{E}\left[\Delta\left(\mathbf{f}\_{t}^{\star}, \mathbf{x}\_{t}\right)\right].\tag{7}$$

**Proof.** Let us denote by

$$F\_{\mathbf{f}}(Z\_{\mathbf{f}}) = \Delta \left(\hat{\mathbf{f}}\_{t}, \mathbf{x}\_{\mathbf{f}}\right) = \Delta \left(\underset{\mathbf{f} \in \mathcal{F}}{\operatorname{arg\,inf}} \left(\sum\_{s=1}^{t-1} \Delta(\mathbf{f}, \mathbf{x}\_{s}) + \frac{1}{\eta\_{t-1}} h(\mathbf{f}) - \frac{1}{\eta\_{t-1}} Z\_{\mathbf{f}}\right), \mathbf{x}\_{t}\right),$$

the instantaneous loss suffered by the polygonal line ˆ **f***t* when *xt* is obtained. We have

$$\begin{aligned} \mathbb{E}[\Delta \Big(\hat{\mathbf{f}}\_{t}^{\star}, \mathbf{x}\_{t}\Big)] &= \int F\_{\mathbf{f}}(z - \eta\_{t-1} \Delta(\mathbf{f}, \mathbf{x}\_{t})) \pi(z) \mathrm{d}z \\ &= \int F\_{\mathbf{f}}(z) \pi(z + \eta\_{t-1} \Delta(\mathbf{f}, \mathbf{x}\_{t})) \mathrm{d}z \\ &= \int F\_{\mathbf{f}}(z) \mathbf{e}^{-(z + \eta\_{t-1} \Delta(\mathbf{f}, \mathbf{x}\_{t}))} \mathrm{d}z \\ &\ge \mathbf{e}^{-\eta\_{t-1} d (2R + \delta)^{2}} \int F\_{\mathbf{f}}(z) \mathbf{e}^{-z} \mathrm{d}z \\ &= \mathbf{e}^{-\eta\_{t-1} d (2R + \delta)^{2}} \mathbb{E}[\Delta \Big(\hat{\mathbf{f}}\_{t}, \mathbf{x}\_{t}\Big)], \end{aligned}$$

where the inequality is due to the fact that <sup>Δ</sup>(**f**, *<sup>x</sup>*) <sup>≤</sup> *<sup>d</sup>*(2*<sup>R</sup>* <sup>+</sup> *<sup>δ</sup>*)<sup>2</sup> holds uniformly for any **f** ∈ F*<sup>p</sup>* and *x* ∈ *B*(0, <sup>√</sup>*dR*). Finally, summing on *<sup>t</sup>* on both sides and using the elementary inequality e*<sup>x</sup>* <sup>≤</sup> <sup>1</sup> + (<sup>e</sup> <sup>−</sup> <sup>1</sup>)*<sup>x</sup>* if *<sup>x</sup>* <sup>∈</sup> (0, 1) concludes the proof.

**Lemma 3.** *For k* ∈ -1, *p, we control the cardinality of set* **f** ∈ F*p*, K(**f**) = *k as*

$$\ln\left|\left\{\mathbf{f}\in\mathcal{F}\_{p},\mathcal{K}(\mathbf{f})=k\right\}\right|\leq\left(\ln(8peV\_{d})+3d^{\frac{3}{2}}-d\right)k+\left(\frac{\ln 2}{\delta\sqrt{d}}+\frac{d}{\delta}\right)L+d\ln\left(\frac{\sqrt{d}(2R+\delta)}{\delta}\right)$$

$$\stackrel{\Delta}{=}c\_{1}k+c\_{2}L+c\_{3}\iota$$

*where Vd denotes the volume of the unit ball in* R*d.*

**Proof.** First, let *Nk*,*<sup>δ</sup>* denote the set of polygonal lines with *k* segments and whose vertices are in Q*δ*. Notice that *Nk*,*<sup>δ</sup>* is different from {**f** ∈ F*p*, K(**f**) = *k*} and that

$$\left| \{ \mathbf{f} \in \mathcal{F}\_{p\prime} \mathcal{K}(\mathbf{f}) = k \} \right| \leq \binom{p}{k} \left| N\_{k, \delta} \right|.$$

Hence

$$\begin{split} \ln \left| \{ \mathbf{f} \in \mathcal{F}\_p, \mathcal{K}(\mathbf{f}) = k \} \right| &\leq \ln \binom{p}{k} + \ln \left| N\_{k, \delta} \right| \\ &\leq k \ln \frac{p \mathbf{e}}{k} + k \left( \ln 8V\_d + 3d^{\frac{3}{2}} - d \right) + \left( \frac{\ln 2}{\sqrt{d} \delta} + \frac{d}{\delta} \right) L + d \ln \left( \frac{\sqrt{d} (2R + \delta)}{\delta} \right) \\ &\leq k \ln (p \mathbf{e}) + k \left( \ln 8V\_d + 3d^{\frac{3}{2}} - d \right) + \left( \frac{\ln 2}{\sqrt{d} \delta} + \frac{d}{\delta} \right) L + d \ln \left( \frac{\sqrt{d} (2R + \delta)}{\delta} \right) . \end{split}$$

where the second inequality is a consequence to the elementary inequality ( *p <sup>k</sup>*) <sup>≤</sup> *<sup>p</sup>*<sup>e</sup> *k k* combined with Lemma 2 in [16].

We now have all the ingredients to prove Theorem 1 and Theorem 2.

First, combining (6) and (7) yields that

$$\begin{split} \sum\_{t=1}^{T} \mathbb{E}\left[\Delta(\mathbf{f}\_{t},\mathbf{x}\_{t})\right] &\leq \inf\_{\mathbf{f}\in\mathcal{F}\_{T}} \left\{ \sum\_{t=1}^{T} \Delta(\mathbf{f},\mathbf{x}\_{t}) + \frac{1}{\eta\_{T}} h(\mathbf{f}) \right\} + \frac{1}{\eta\_{T}} \left(\frac{1}{2} + \ln \sum\_{\mathbf{f}\in\mathcal{F}\_{T}} \mathbf{e}^{-h(\mathbf{f})} \right) \\ &\qquad + c\_{0}(\mathbf{e}-1) \sum\_{t=1}^{T} \eta\_{t-1} \mathbb{E}\left[\Delta(\mathbf{f}\_{t}^{\*},\mathbf{x}\_{t}) \right] \\ &\leq \inf\_{k\in\left[1,p\right]} \left\{ \inf\_{\begin{subarray}{c} \mathbf{f}\in\mathcal{F}\_{T} \\ \mathcal{K}(\mathbf{f})=k\end{subarray}} \left\{ \inf\_{t=1}^{T} \Delta(\mathbf{f},\mathbf{x}\_{t}) + \frac{h(\mathbf{f})}{\eta\_{T}} \right\} \right\} + \frac{1}{\eta\_{T}} \left(\frac{1}{2} + \ln \sum\_{\mathbf{f}\in\mathcal{F}\_{T}} \mathbf{e}^{-h(\mathbf{f})} \right) \\ &\qquad + c\_{0}(\mathbf{e}-1) \sum\_{t=1}^{T} \eta\_{t-1} \mathbb{E}\left[\Delta(\mathbf{f}\_{t}^{\*},\mathbf{x}\_{t}) \right]. \end{split}$$

Assume that *<sup>η</sup><sup>t</sup>* <sup>=</sup> *<sup>η</sup>*, *<sup>t</sup>* <sup>=</sup> 0, ... , *<sup>T</sup>* and *<sup>h</sup>*(**f**) = *<sup>c</sup>*1K(**f**) + *<sup>c</sup>*2*<sup>L</sup>* <sup>+</sup> *<sup>c</sup>*<sup>3</sup> for **<sup>f</sup>** <sup>∈</sup> <sup>F</sup>*p*, then ( <sup>1</sup> <sup>2</sup> + <sup>∑</sup>**f**∈F*<sup>p</sup>* <sup>e</sup>−*h*(**f**)) <sup>≤</sup> 0 and moreover

$$\begin{split} \sum\_{t=1}^{T} \mathbb{E}\left[\Lambda(\hat{\mathbf{f}}\_{t}, \mathbf{x}\_{t})\right] &\leq S\_{T, h, \eta} + \frac{1}{\eta} \left(\frac{1}{2} + \ln \sum\_{\mathbf{f} \in \mathcal{F}\_{T}} \mathbf{e}^{-h(\mathbf{f})}\right) + c\_{0}(\mathbf{e} - 1)\eta \sum\_{t=1}^{T} \mathbb{E}\left[\Lambda(\hat{\mathbf{f}}\_{t}^{\star}, \mathbf{x}\_{t})\right] \\ &\leq S\_{T, h, \eta} + c\_{0}(\mathbf{e} - 1)\eta S\_{T, h, \eta} \\ &\leq S\_{T, h, \eta} + \eta c\_{0}(\mathbf{e} - 1) \inf\_{\mathbf{f} \in \mathcal{F}\_{T}} \sum\_{t=1}^{T} \Delta(\mathbf{f}, \mathbf{x}\_{t}) + c\_{0}(\mathbf{e} - 1)(c\_{1}p + c\_{2}L + c\_{3}), \end{split}$$

where

$$S\_{T, \mathbf{l}, \eta} = \inf\_{k \in \{1, p\}} \left\{ \inf\_{\substack{\mathbf{f} \in \mathcal{F}\_p \\ \mathcal{K}(\mathbf{f}) = k}} \left\{ \sum\_{t=1}^T \Delta(\mathbf{f}, \mathbf{x}\_t) + \frac{h(\mathbf{f})}{\eta} \right\} \right\}$$

and the second inequality is obtained with Lemma 1. By setting

$$\eta = \sqrt{\frac{c\_1 p + c\_2 L + c\_3}{c\_0 (\mathbf{e} - 1) \inf\_{\mathbf{f} \in \mathcal{F}\_p} \sum\_{t=1}^T \Delta(\mathbf{f}, \mathbf{x}\_t)}}$$

we obtain

$$\begin{split} \sum\_{t=1}^{T} \mathbb{E} \Big[ \Delta \big( \mathbf{\hat{f}}\_{t}, \mathbf{x}\_{t} \big) \big] \leq \inf\_{k \in \{1, p\}} \left\{ \inf\_{\begin{subarray}{c} \mathbf{f} \in \mathcal{F}\_{p} \\ \mathcal{K}(\mathbf{f}) = k \end{subarray}} \left\{ \sum\_{t=1}^{T} \Delta (\mathbf{f}, \mathbf{x}\_{t}) + \sqrt{c\_{0} (\mathbf{e} - 1) r\_{T, k, L}} \right\} \right\} \\ &+ \sqrt{c\_{0} (\mathbf{e} - 1) L\_{T, p, L}} + c\_{0} (\mathbf{e} - 1) c\_{1} p + c\_{2} L + c\_{3} \end{split}$$

where *rT*,*k*,*<sup>L</sup>* <sup>=</sup> inf**f**∈F*<sup>p</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> Δ(**f**, *xt*)(*c*1*k* + *c*2*L* + *c*3). This proves Theorem 1.

Finally, assume that

$$\eta\_0 = \frac{\sqrt{c\_1 p + c\_2 L + c\_3}}{c\_0 \sqrt{(\mathbf{e} - 1)}} \quad \text{and} \quad \eta\_t = \frac{\sqrt{c\_1 p + c\_2 L + c\_3}}{c\_0 \sqrt{(\mathbf{e} - 1)t}}, \qquad t = 1, \dots, T.$$

Since E Δ(ˆ **f** *<sup>t</sup>* , *xt*) ≤ *c*<sup>0</sup> for any *t* = 1, . . . , *T*, we have

$$\begin{split} \sum\_{t=1}^{T} \mathbb{E}\left[\Delta(\hat{\mathbf{f}}\_{t},\mathbf{x}\_{t})\right] &\leq \inf\_{k \in \left[1, p\right]} \left\{ \inf\_{\begin{subarray}{c} \mathbf{f} \in \mathcal{G}\_{p} \\ \mathcal{K}(\mathbf{f}) = k \end{subarray}} \left\{ \sum\_{t=1}^{T} \Delta(\mathbf{f},\mathbf{x}\_{t}) + \frac{h(\mathbf{f})}{\eta\_{T}} \right\} \right\} + \frac{1}{\eta\_{T}} \left(1 + \ln \sum\_{\mathbf{f} \in \mathcal{F}\_{p}} \mathbf{e}^{-h(\mathbf{f})}\right) \\ &+ c\_{0}^{2}(\mathbf{e} - 1) \sum\_{t=1}^{T} \eta\_{t-1} \\ &\leq \inf\_{k \in \left[1, p\right]} \left\{ \inf\_{\begin{subarray}{c} \mathbf{f} \in \mathcal{G}\_{p} \\ \mathcal{K}(\mathbf{f}) = k \end{subarray}} \left\{ \inf\_{t=1}^{T} \Delta(\mathbf{f},\mathbf{x}\_{t}) + c\_{0} \sqrt{(\mathbf{e} - 1)T(c\_{0}k + c\_{2}L + c\_{3})} \right\} \right\} \\ &+ 2c\_{0} \sqrt{(\mathbf{e} - 1)T(c\mathbf{p} + c\mathbf{2}L + c\mathbf{e})}. \end{split}$$

which concludes the proof of Theorem 2.

**Lemma 4.** *Using Algorithm 3, if* <sup>0</sup> <sup>&</sup>lt; <sup>≤</sup> <sup>1</sup>*,* <sup>0</sup> <sup>&</sup>lt; *<sup>β</sup>* <sup>&</sup>lt; <sup>1</sup>*, <sup>α</sup>* <sup>≥</sup> (1−*β*)*c*<sup>0</sup> *<sup>β</sup> and* \$ \$ \$ U ˆ **f***t*−<sup>1</sup> \$ \$ \$ <sup>≥</sup> <sup>2</sup> *for all t* ≥ 2*, where* \$ \$ \$ U ˆ **f***t*−<sup>1</sup> \$ \$ \$ *is the cardinality of* <sup>U</sup> ˆ **f***t*−<sup>1</sup> *, then we have T* ∑ *t*=1 E *r*ˆ **f***t*,*t* ≥ *T* ∑ *t*=1 E *r*ˆ*σ*ˆ*t*(A*t*),*<sup>t</sup>* − 2(1 − )*αβ T* ∑ *t*=1 \$ \$ \$ U ˆ **f***t*−<sup>1</sup> \$ \$ \$.

**Proof.** First notice that A*<sup>t</sup>* = U ˆ **f***t*−<sup>1</sup> if *It* = 0, and that for *t* ≥ 2 E

. *r*ˆ **f***t*,*t* \$ \$ \$ \$ H*t*, *It* = 0 / =E . *rσ*ˆ*t*(A*t*),*<sup>t</sup>* \$ \$ \$ \$ H*t*, *It* = 0 / = ∑ **f**∈A*t*∩*cond*(*t*) *<sup>r</sup>***f**,*t*P *σ*ˆ*t* (A*t*) = **f** \$ \$ \$ \$ H*t* + ∑ **f**∈A*t*∩*cond*(*t*) *c <sup>r</sup>***f**,*t*P *σ*ˆ*t* (A*t*) = **f** \$ \$ \$ \$ H*t* ≥ ∑ **f**∈A*t*∩*cond*(*t*) *r***f**,*<sup>t</sup>* + ∑ **f**∈A*t*∩*cond*(*t*) *c α*P *σ*ˆ*t* (A*t*) = **f** \$ \$ \$ \$ H*t* − (<sup>1</sup> − *<sup>β</sup>*) ∑ **f**∈A*t*∩*cond*(*t*) *<sup>r</sup>***f**,*<sup>t</sup>* − ∑ **f**∈A*t*∩*cond*(*t*) *c* (*<sup>α</sup>* <sup>−</sup> *<sup>r</sup>***f**,*t*)<sup>P</sup> *σ*ˆ*t* (A*t*) = **f** \$ \$ \$ \$ H*t* =E . *r*ˆ*σ*ˆ*t*(A*t*),*<sup>t</sup>* \$ \$ \$ \$ H*t*, *It* = 0 / − (<sup>1</sup> − *<sup>β</sup>*) ∑ **f**∈A*t*∩*cond*(*t*) *r***f**,*<sup>t</sup>* − ∑ **f**∈A*t*∩*cond*(*t*) *c* (*<sup>α</sup>* <sup>−</sup> *<sup>r</sup>***f**,*t*)<sup>P</sup> *σ*ˆ*t* (A*t*) = **f** \$ \$ \$ \$ H*t* ≥E . *r*ˆ*σ*ˆ*t*(A*t*),*<sup>t</sup>* \$ \$ \$ \$ H*t*, *It* = 0 / − (1 − *β*)*c*0|A*t*| − *αβ*|A*t*| ≥E . *r*ˆ*σ*ˆ*t*(A*t*),*<sup>t</sup>* \$ \$ \$ \$ H*t*, *It* = 0 / − 2*αβ*|A*t*|,

where *cond*(*t*)*<sup>c</sup>* denotes the complement of set *cond*(*t*). The first inequality above is due to the assumption that for all **<sup>f</sup>** <sup>∈</sup> <sup>A</sup>*<sup>t</sup>* <sup>∩</sup> *cond*(*t*), we have <sup>P</sup> *σ*ˆ*t* (A*t*) = **f** \$ \$ \$ \$ H*t* ≥ *β*. For *t* = 1, the above inequality is trivial since *r*ˆ *σ*ˆ <sup>1</sup>(U(ˆ **<sup>f</sup>**0)),1 ≡ 0 by its definition. Hence, for *t* ≥ 1, one has

$$\begin{split} \mathbb{E}\left[\mathbf{r}\_{\mathbf{\hat{t}}\_{t},t} \Big| \mathcal{H}\_{t} \right] &= \mathsf{c} \mathbb{E}\left[\mathbf{r}\_{\hat{\mathbf{e}}^{t}(\mathcal{F}\_{\mathbf{\hat{f}}}),t} \Big| \mathcal{H}\_{t},I\_{t} = 1 \right] + (1-\mathsf{c}) \mathbb{E}\left[\mathbf{r}\_{\delta^{t}(\mathcal{A}\_{t}),t} \Big| \mathcal{H}\_{t},I\_{t} = 0 \right] \\ &\geq \mathbb{E}\left[\mathbf{\hat{r}\_{\hat{\mathbf{t}}\_{t},t}} \Big| \mathcal{H}\_{t} \right] - 2a\beta |\mathcal{A}\_{t}|. \end{split} \tag{8}$$

Summing on both sides of inequality (8) over *t* terminates the proof of Lemma 4.

**Lemma 5.** *Let c*ˆ0 = *<sup>c</sup>*<sup>0</sup> *<sup>β</sup>* <sup>+</sup> *<sup>α</sup>. If* <sup>0</sup> <sup>&</sup>lt; *<sup>η</sup>*<sup>1</sup> <sup>=</sup> *<sup>η</sup>*<sup>2</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>η</sup><sup>T</sup>* <sup>=</sup> *<sup>η</sup>* <sup>&</sup>lt; <sup>1</sup> *c*ˆ0 *, then we have*

$$\begin{split} \mathbb{E}\left[\max\_{\boldsymbol{\hat{\sigma}}} \left\{ \sum\_{t=1}^{T} \hat{r}\_{\boldsymbol{\hat{\sigma}}(\mathcal{A}\_{t}),t} - \frac{1}{\eta} h(\boldsymbol{\hat{\sigma}}(\mathcal{A}\_{t})) \right\} \right] - \sum\_{t=1}^{T} \mathbb{E}\left[\hat{r}\_{\boldsymbol{\hat{\sigma}}^{t}(\mathcal{A}\_{t}),t}\right] \leq \\ \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbb{E}\_{0}^{2}(\boldsymbol{\varepsilon} - 1)\eta \boldsymbol{T} + \mathbb{E}\_{0}(\boldsymbol{\varepsilon} - 1)(\boldsymbol{c}\_{1}\boldsymbol{p} + \boldsymbol{c}\_{2}\boldsymbol{L} + \boldsymbol{c}\_{3}). \end{split}$$

**Proof.** By the definition of *r*ˆ**f**,*<sup>t</sup>* in Algorithm 3, for any **f** ∈ F*<sup>p</sup>* and *t* ≥ 1, we have

$$\hat{r}\_{\mathbf{f},t} \le \max\left\{ \frac{r\_{\mathbf{f},t}}{\mathbb{P}\left(\hat{\mathbf{f}}\_{t} = \mathbf{f} \, \middle| \, \mathcal{H}\_{t}\right)}, \alpha, r\_{\mathbf{f},t} \right\} \le \max\left\{ \frac{c\_{0}}{\beta}, \alpha \right\} \le \hat{c}\_{0,t}$$

where in the second inequality we use that *<sup>r</sup>***f**,*<sup>t</sup>* <sup>≤</sup> *<sup>c</sup>*<sup>0</sup> for all **<sup>f</sup>** and *<sup>t</sup>*, and that <sup>P</sup> ˆ **f***<sup>t</sup>* = **f** \$ \$ \$ \$ H*t* ≥ *β* when **f** ∈ U ˆ **f***t*−<sup>1</sup> ∩ *cond*(*t*). The rest of the proof is similar to those of Lemmas 1 and 2. In fact, if we define by <sup>Δ</sup>ˆ(**f**, *xt*) <sup>=</sup> *<sup>c</sup>*ˆ0 <sup>−</sup> *<sup>r</sup>*ˆ**f**,*t*, then one can easily observe the following relation when *It* = 1 (similar relation in the case that *It* = 0)

$$\begin{split} \hat{\mathbf{f}}\_{t} = \left. \mathcal{O}^{t} \left( \mathcal{F}\_{p} \right) \right| &= \underset{\mathbf{f} \in \mathcal{F}\_{p}}{\arg\max} \left\{ \sum\_{s=1}^{t-1} \hat{r}\_{\mathbf{f},s} + \frac{1}{\eta} (z\_{\mathbf{f}} - h(\mathbf{f})) \right\} \\ &= \underset{\mathbf{f} \in \mathcal{F}\_{p}}{\arg\min} \left\{ \sum\_{s=1}^{t-1} \hat{\Delta}(\mathbf{f}, x\_{s}) + \frac{1}{\eta} (h(\mathbf{f}) - z\_{\mathbf{f}}) \right\}. \end{split}$$

Then applying Lemmas 1 and 2 on this newly defined sequence Δˆ ˆ **f***t*, *xt* , *t* = 1, ... *T* leads to the result of Lemma 5.

The proof of the upcoming Lemma 6 requires the following submartingale inequality: let *Y*0, ... *YT* be a sequence of random variable adapted to random events H0, ... , H*<sup>T</sup>* such that for 1 ≤ *t* ≤ *T*, the following three conditions hold:

$$\mathbb{E}[\mathcal{Y}\_t|H\_t] \le 0, \quad \text{Var}(\mathcal{Y}\_t|H\_t) \le a^2, \quad \mathcal{Y}\_t - \mathbb{E}[\mathcal{Y}\_t|H\_t] \le b.$$

Then for any *λ* > 0,

$$\mathbb{P}\left(\sum\_{t=1}^T \mathbf{Y}\_t > \mathbf{Y}\_0 + \lambda\right) \le \exp\left(-\frac{\lambda^2}{2T(a^2 + b^2)}\right).$$

The proof can be found in Chung and Lu [43] (Theorem 7.3).

**Lemma 6.** *Assume that* 0 < *β* < <sup>1</sup> |F*p*| , *<sup>α</sup>* <sup>≥</sup> *<sup>c</sup>*<sup>0</sup> *<sup>β</sup> and η* > 0*, then we have*

$$\begin{split} \mathbb{E}\left[\max\_{\boldsymbol{\mathcal{V}}}\left\{\sum\_{t=1}^{T}\boldsymbol{r}\_{\boldsymbol{\mathcal{T}}(\mathcal{A}\_{t}),t}-\frac{1}{\eta}h(\boldsymbol{\mathcal{T}}(\mathcal{A}\_{t}))\right\}\right] &-\mathbb{E}\left[\max\_{\boldsymbol{\mathcal{V}}}\left\{\sum\_{t=1}^{T}\boldsymbol{\hat{r}}\_{\hat{\boldsymbol{\mathcal{T}}}(\mathcal{A}\_{t}),t}-\frac{1}{\eta}h(\boldsymbol{\mathcal{T}}(\mathcal{A}\_{t}))\right\}\right] \\ &\leq \left(1-\left|\mathcal{F}\_{p}\left|\boldsymbol{\beta}\right|\right)\sqrt{2T\left[\frac{c\_{0}^{2}}{\beta}+a^{2}(1-\beta)+(c\_{0}+2a)^{2}\right]\ln\left(\frac{1}{\beta}\right)}+\left|\mathcal{F}\_{p}\right|\beta c\_{0}T. \end{split}$$

**Proof.** First, we have almost surely that

$$\max\_{\boldsymbol{\sigma}} \left\{ \sum\_{t=1}^{T} r\_{\sigma(\mathcal{A}\_{t}),t} - \frac{1}{\eta} h(\boldsymbol{\sigma}(\mathcal{A}\_{t})) \right\} - \max\_{\boldsymbol{\hat{\sigma}}} \left\{ \sum\_{t=1}^{T} \mathfrak{f}\_{\mathbb{H}(\mathcal{A}\_{t}),t} - \frac{1}{\eta} h(\boldsymbol{\mathcal{f}}(\mathcal{A}\_{t})) \right\} \leq \max\_{\boldsymbol{\mathbf{f}} \in \mathcal{F}\_{T}} \sum\_{t=1}^{T} (r\_{\mathbf{f},t} - \boldsymbol{\hat{r}}\_{\mathbf{f},t}).$$

Denote by *Y***f**,*<sup>t</sup>* = *r***f**,*<sup>t</sup>* − *r*ˆ**f**,*t*. Since

$$\mathbb{E}\left[\mathbb{M}\_{\mathbf{f},t}\middle|\mathcal{H}\_{t}\right] = \begin{cases} r\_{\mathbf{f},t} + (1 - \epsilon)a\left(1 - \mathbb{P}\left(\mathbf{\hat{f}}\_{t} = \mathbf{f}|\mathcal{H}\_{t}\right)\right) & \text{if} \quad \mathbf{f} \in \mathsf{U}(\mathbf{\hat{f}}\_{t-1}) \cap \mathrm{cond}(t), \\\epsilon r\_{\mathbf{f},t} + (1 - \epsilon)a & \text{otherwise}, \end{cases}$$

and *<sup>α</sup>* <sup>&</sup>gt; *<sup>c</sup>*<sup>0</sup> <sup>≥</sup> *<sup>r</sup>***f**,*<sup>t</sup>* uniformly for any **<sup>f</sup>** and *<sup>t</sup>*, we have uniformly that <sup>E</sup>[*Yt*|H*t*] <sup>≤</sup> 0, satisfying the first condition.

For the second condition, if **f** ∈ U ˆ **f***t*−<sup>1</sup> ∩ *cond*(*t*), then

$$\begin{split} \text{Var}(\mathcal{Y}\_{t}|\mathcal{H}\_{t}) &= \mathbb{E}\left[\mathbf{\hat{r}}\_{\mathbf{f},t}^{2}|\mathcal{H}\_{t}\right] - \left(\mathbb{E}[\mathbf{\hat{r}}\_{\mathbf{f},t}|\mathcal{H}\_{t}]\right)^{2} \\ &\leq \epsilon r\_{\mathbf{f},t}^{2} + (1-\epsilon) \left[\frac{r\_{\mathbf{f},t}^{2}}{\mathbb{P}\left(\mathbf{\hat{f}}\_{t} = \mathbf{f}|\mathcal{H}\_{t}\right)} + a\left(1 - \mathbb{P}\left(\mathbf{\hat{f}}\_{t} = \mathbf{f}|\mathcal{H}\_{t}\right)\right)\right] \\ &- \left[r\_{\mathbf{f},t} + (1-\epsilon)a\left(1 - \mathbb{P}\left(\mathbf{\hat{f}}\_{t} = \mathbf{f}|\mathcal{H}\_{t}\right)\right)\right]^{2} \\ &\leq \frac{r\_{\mathbf{f},t}^{2}}{\beta} + a^{2}(1-\beta) \leq \frac{\epsilon\_{0}^{2}}{\beta} + a^{2}(1-\beta). \end{split}$$

Similarly, for **f** -∈ U ˆ **f***t*−<sup>1</sup> <sup>∩</sup> *cond*(*t*), one can have Var(*Yt*|H*t*) <sup>≤</sup> *<sup>α</sup>*2. Moreover, for the third condition, since

$$\mathbb{E}[\mathcal{Y}\_{\mathbf{f},t}|\mathcal{Y}\_{t}] \geq -2\alpha\_{\prime}$$

then

$$\begin{aligned} \mathbf{Y\_{f,t}} - \mathbb{E}[\mathbf{Y\_{f,t}} | \mathcal{H}\_t] &\le r\_{\mathbf{f},t} + 2a \le c\_0 + 2a. \\\\ \text{Setting } \lambda &= \sqrt{2T \left[ \frac{r\_0^2}{\beta} + a^2 (1 - \beta) + (c\_0 + 2a)^2 \right] \ln \left( \frac{1}{\beta} \right)} \text{ leads to } \\\\ \mathbb{P} \left( \sum\_{t=1}^T \mathbf{Y\_{f,t}} \ge \lambda \right) &\le \beta. \end{aligned}$$

Hence the following inequality holds with probability 1 − \$ \$ \$ \$ F*p* \$ \$ \$ \$ *β*

$$\max\_{\mathbf{f}\in\mathcal{F}\_{\mathcal{F}}}\sum\_{t=1}^{T}(r\_{\mathbf{f},t}-\widehat{r}\_{\mathbf{f},t})\leq\sqrt{2T\left[\frac{c\_{0}^{2}}{\beta}+\alpha^{2}(1-\beta)+(c\_{0}+2\alpha)^{2}\right]\ln\left(\frac{1}{\beta}\right)}.$$

Finally, noticing that max**f**∈F*<sup>p</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=1(*r***f**,*<sup>t</sup>* − *r*ˆ**f**,*t*) ≤ *c*0*T* almost surely, we terminate the proof of Lemma 6.

**Proof of Theorem 3.** Assume that *p* > 6, *T* ≥ 2|F*p*| <sup>2</sup> and let

$$\begin{aligned} \beta &= \left| \mathcal{F}\_p \right|^{-\frac{1}{2}} T^{-\frac{1}{4}}, \qquad \boldsymbol{\sigma} = \frac{c\_0}{\beta}, \qquad \hat{c}\_0 = \frac{2c\_0}{\beta},\\ \eta\_1 &= \eta\_2 = \dots = \eta\_T = \frac{\sqrt{c\_1 p + c\_2 L + c\_3}}{\sqrt{T(\boldsymbol{\varepsilon} - 1)} \hat{c}\_0}, \qquad \boldsymbol{\varepsilon} = 1 - \left| \mathcal{F}\_p \right|^{\frac{1}{2} - \frac{2}{p}} T^{-\frac{1}{4}}. \end{aligned}$$

With those values, the assumptions of Lemmas 4, 5 and 6 are satisfied. Combining their results lead to the following

*T* ∑ *t*=1 E *r*ˆ **f***t*,*t* <sup>≥</sup> <sup>E</sup> \* max *<sup>σ</sup> T* ∑ *t*=1 *<sup>r</sup>σ*(A*t*),*<sup>t</sup>* <sup>−</sup> <sup>1</sup> *η <sup>h</sup>*(*σ*(A*t*))+ <sup>−</sup> <sup>2</sup>*αβ*(<sup>1</sup> <sup>−</sup> ) *T* ∑ *t*=1 \$ \$ \$ U ˆ **f***t*−<sup>1</sup> \$ \$ \$ − *c*ˆ 2 <sup>0</sup>(*e* − 1)*ηT* − *c*ˆ0(*e* − 1)(*c*<sup>1</sup> *p* + *c*2*L* + *c*3) −  <sup>1</sup> <sup>−</sup> \$ \$F*<sup>p</sup>* \$ \$*β* 01122*T* \* *c*2 0 *<sup>β</sup>* <sup>+</sup> *<sup>α</sup>*2(<sup>1</sup> <sup>−</sup> *<sup>β</sup>*) + (*c*<sup>0</sup> <sup>+</sup> <sup>2</sup>*α*) 2 + ln <sup>1</sup> *β* − \$ \$F*<sup>p</sup>* \$ \$*βc*0*T* ≥E \* max *<sup>σ</sup> T* ∑ *t*=1 *<sup>r</sup>σ*(A*t*),*<sup>t</sup>* <sup>−</sup> <sup>1</sup> *η <sup>h</sup>*(*σ*(A*t*))+ <sup>−</sup> (<sup>1</sup> <sup>−</sup> ) \$ \$F*<sup>p</sup>* \$ \$ 3 *<sup>p</sup> c*0*T* − *c*ˆ 2 <sup>0</sup>(*e* − 1)*ηT* − *c*ˆ0(*e* − 1)(*c*<sup>1</sup> *p* + *c*2*L* + *c*3) −  <sup>1</sup> <sup>−</sup> \$ \$F*<sup>p</sup>* \$ \$*β* 01122*T* \* *c*2 0 *<sup>β</sup>* <sup>+</sup> *<sup>α</sup>*2(<sup>1</sup> <sup>−</sup> *<sup>β</sup>*) + (*c*<sup>0</sup> <sup>+</sup> <sup>2</sup>*α*) 2 + ln <sup>1</sup> *β* − \$ \$F*<sup>p</sup>* \$ \$*βc*0*T* ≥E \* max *<sup>σ</sup> T* ∑ *t*=1 *<sup>r</sup>σ*(A*t*),*<sup>t</sup>* <sup>−</sup> <sup>1</sup> *η <sup>h</sup>*(*σ*(A*t*))+ <sup>−</sup> <sup>O</sup> \$ \$F*<sup>p</sup>* \$ \$ 1 <sup>2</sup> *<sup>T</sup>* <sup>3</sup> 4 ,

where the second inequality is due to the fact that the cardinality \$ \$ \$ U ˆ **f***t*−<sup>1</sup> \$ \$ \$ is upper bounded by \$ \$F*<sup>p</sup>* \$ \$ 3 *<sup>p</sup>* for *t* ≥ 1. In addition, using the definition of *r***f**,*<sup>t</sup>* that *r***f**,*<sup>t</sup>* = *c*<sup>0</sup> − Δ(**f**, *xt*) terminates the proof of Theorem 3.

**Author Contributions:** Conceptualization, L.L. and B.G.; Formal analysis, L.L. and B.G.; Methodology, B.G.; Project administration, B.G.; Software, L.L.; Supervision, B.G.; Writing—original draft, L.L. and B.G.; Writing—review and editing, L.L. and B.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** LL is funded and supported by the Fundamental Research Funds for the Central Universities (Grand No. 30106210158) and National Natural Science Foundation of China (Grant No. 61877023), the Fundamental Research Funds for the Central Universities (CCNU19TD009). BG is supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. BG acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23- 0015-02.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


## *Article* **Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds**

**Benjamin Guedj 1,2,\* and Louis Pujol 3,\***


**Abstract:** "No free lunch" results state the impossibility of obtaining meaningful bounds on the error of a learning algorithm without prior assumptions and modelling, which is more or less realistic for a given problem. Some models are "expensive" (strong assumptions, such as sub-Gaussian tails), others are "cheap" (simply finite variance). As it is well known, the more you pay, the more you get: in other words, the most expensive models yield the more interesting bounds. Recent advances in robust statistics have investigated procedures to obtain tight bounds while keeping the cost of assumptions minimal. The present paper explores and exhibits what the limits are for obtaining tight probably approximately correct (PAC)-Bayes bounds in a robust setting for cheap models.

**Keywords:** statistical learning theory; PAC-Bayes theory; no free lunch theorems

## **1. Introduction**

For the sake of clarity, we focus on the supervised learning problem. We collect a sequence of input–output pairs (*Xi*,*Yi*)*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> <sup>∈</sup> (X ×Y)*N*, which we assume to be *<sup>N</sup>* independent realisations of a random variable drawn from a distribution P on X ×Y. The overarching goal in statistics and machine learning is to select a hypothesis *f* over a space F which, given a new input *x* in X , delivers an output *f*(*x*) in Y, hopefully close (in a certain sense) to the unknown true output *y*. The quality of *f* is assessed through a loss function which characterises the discrepancy between the true output *y* and its prediction *f*(*x*), and we define a global notion of risk as

$$\mathcal{R}(f) = \mathbb{E}\_{(\mathcal{X}, Y) \sim \mathcal{P}}[\ell(f(\mathcal{X}), Y)].$$

The aim of machine learning is to find a good (in the sense of a low risk) hypothesis *f* ∈ F. In the generalised Bayes setting, the learning algorithm does not output a single hypothesis but rather a *distribution ρ* over the hypotheses space F and the associated bounds are called PAC-Bayesian bounds (see [1] for a survey of the topic).

As many probabilistic bounds stated in the statistics and machine learning literature, PAC-Bayesian bounds (where PAC stands for probably approximately correct—see [2]) commonly requires strong assumptions to hold, such as sub-Gaussian behaviour of some random variables. These assumptions can be misleading when dealing with true data as they do not take into account some practical situations, such as outlier contamination. Many efforts have been made recently to keep tight generalisation bounds valid with a few set of assumptions about the underlying distribution: this is known as robust learning [see [3] for a survey of the topic].

In this work we explore the possibility to establish a connection between recent techniques introduced by robust machine learning and PAC-Bayesian generalisation bounds. The result of our work is negative as we were not able to prove a PAC-Bayes bound in a

**Citation:** Guedj, B.; Pujol, L. Still No Free Lunches: The Price to Pay for Tighter PAC-Bayes Bounds. *Entropy* **2021**, *23*, 1529. https://doi.org/ 10.3390/e23111529

Academic Editor: Wray Buntine

Received: 29 August 2021 Accepted: 3 November 2021 Published: 18 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

67

robust statistics setting. However, we found it useful to write down our findings in order to give the interested reader a review of material involved in both robust statistics and PAC-Bayes theory and present the fundamental issues we faced as we believe it to be useful to the community.

**Organisation of the paper.** We introduce an elementary example and set a basic notation to illustrate the problem of robustness in Section 2, before providing an overview of recent advances in robust statistics in Section 3, and briefly introduce the field of PAC-Bayes learning in Section 4. We then propose in Section 5 a detailed study of the structural limits which do not allow for PAC-Bayes bounds which are simultaneously tight without requiring strong assumptions. The paper closes with a discussion in Section 6.

## **2. About the "No Free Lunch" Results**

A class of results in statistics is known as "no free lunch" statements [see [4], Chapter 7]. The "no free lunch" results typically state that if one does not consider the restrictions on the modelling of the data-generating process, one cannot obtain meaningful deviation bounds in a non-asymptotic regime. The well-known trade-off is that the more restrictive the assumptions, the tighter the bounds. Let us illustrate this classical phenomenon by a simple example.

Assume that we have a dataset consisting in *<sup>N</sup>* real observations *<sup>x</sup>*1, ... , *xN* <sup>∈</sup> <sup>R</sup> and consider they are independent, identically distributed (iid) realisations of a random variable *X*. Our goal is to estimate the mean of *X* and build a confidence interval for this estimate. As a start, let us focus on the empirical mean, denoted by *x*¯ = <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *xi*. As "no free lunch" results state, we have to consider a class of distributions to which the data-generating distribution P belongs.

## *2.1. Expensive and Cheap Models*

If there is always a price to pay in order to derive insightful result, there is a variety of degrees of restrictions. In the remainder of the paper, we will focus on two classical models corresponding to a different level of demand on the random variables.

A first type of restriction we can make is an "expensive modelling". For *σ* > 0, let P*σ* expensive be the set of all real-valued random variables *X* satisfying:

$$\log\left(\mathbb{E}\left[\exp\left\{\lambda\left(X - \mathbb{E}[X]\right)\right\}\right]\right) \le \frac{\lambda^2 \sigma^2}{2}.$$

This <sup>P</sup>*<sup>σ</sup>* expensive is the class of sub-Gaussian random variables with variance factor *σ*<sup>2</sup> [see [5] for a complete coverage of the topic]. We call this model "expensive" as this restriction is often considered unrealistic for real-life datasets and is hard or impossible to check in practice.

An alternative type of restriction is a "cheap modelling". For *<sup>σ</sup>* <sup>&</sup>gt; 0, let <sup>P</sup>*<sup>σ</sup>* cheap be the set of real-valued random variables with a finite variance, upper bounded by *σ*2. We call this model "cheap" as this is considerably less restrictive than the expensive one and is much more likely to hold in practice.

#### *2.2. Confidence Interval for the Empirical Mean*

**Proposition 1** (Confidence intervals)**.** *If we assume that <sup>X</sup>* ∈ P*<sup>σ</sup> expensive, then for all δ* ∈ (0, 1/2)*, the following random interval is a confidence interval for the mean of X at level* 1 − *δ:*

$$
\left[\begin{array}{c}
\mathfrak{x} \pm \frac{\sigma}{\sqrt{N}} \sqrt{2} \times \sqrt{2 \log \left(\frac{1}{\delta}\right)}\end{array}\right].\tag{1}
$$

*If we assume that <sup>X</sup>* ∈ P*<sup>σ</sup> cheap, then for all δ* ∈ (0, 1)*, the following random interval is a confidence interval for the mean of X at level* 1 − *δ:*

$$
\left[\begin{array}{c}
\mathfrak{x} \pm \frac{\sigma}{\sqrt{N}} \sqrt{\frac{1}{\delta}}
\end{array}\right].\tag{2}
$$

*In the case of a cheap model, there is no hope to obtain a significantly tighter confidence interval with respect to δ if one uses the empirical mean [as proved in [6], Proposition 6.2].*

**Proof.** To establish the first confidence interval (1), we first remark that if *<sup>X</sup>* ∈ P*<sup>σ</sup>* expensive, then *<sup>x</sup>*¯ ∈ P*σ*/ <sup>√</sup>*<sup>N</sup>* expensive and <sup>E</sup>[*x*¯] = <sup>E</sup>[*X*]. So, applying Theorem 2.1 of [5] to *<sup>x</sup>*¯ <sup>−</sup> <sup>E</sup>[*X*] we obtain, for all *a* > 0 :

$$\begin{split} \mathbb{P}(|\mathfrak{x} - \mathbb{E}[X]| > a) &= \mathbb{P}(\mathfrak{x} - \mathbb{E}[X] > a) + \mathbb{P}(\mathfrak{x} - \mathbb{E}[X] < -a) \\ &\leq 2 \max(\mathbb{P}(\bar{\mathfrak{x}} - \mathbb{E}[X] > a), \mathbb{P}(\bar{\mathfrak{x}} - \mathbb{E}[X] < -a)) \\ &\leq 2 \exp\left(-\frac{Na^2}{2\sigma^2}\right). \end{split}$$

Setting *<sup>δ</sup>* <sup>=</sup> exp <sup>−</sup> *Na*<sup>2</sup> 2*σ*<sup>2</sup> leads to the expected result. The second confidence interval (2) is obtained through Chebychev's inequality. <sup>E</sup>[*x*¯] = <sup>E</sup>[*X*] and as *<sup>X</sup>* ∈ P*<sup>σ</sup>* cheap, Var(*x*¯) = Var(*X*) *<sup>N</sup>* <sup>≤</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>N</sup>* . So for all *a* > 0

$$\mathbb{P}(|\mathfrak{x} - \mathbb{E}[X]| > a) \le \frac{\sigma^2}{Na^2}.$$

Now, setting *δ* = *<sup>σ</sup>*<sup>2</sup> *Na*<sup>2</sup> we get

$$\mathbb{P}\left(|\bar{\mathbf{x}} - \mathbb{E}[\mathbf{X}]| > \frac{\sigma}{\sqrt{N}} \sqrt{\frac{1}{\delta}}\right) \le \delta.$$

Note that the dependence in *δ* is fairly different in both confidence intervals defined in (1) and (2): for fixed *<sup>σ</sup>*<sup>2</sup> and *<sup>N</sup>*, the <sup>√</sup><sup>2</sup> <sup>×</sup> '2 log(1/*δ*) regime (following the lunch metaphor, the "good lunch") is much more favourable than the 1/ <sup>√</sup>*<sup>δ</sup>* regime (the "bad lunch"). We illustrate this in Figure 1, where we plot <sup>√</sup><sup>2</sup> <sup>×</sup> '2 log(1/*δ*) and 1/ <sup>√</sup>*<sup>δ</sup>* as a function of *δ* ∈ (0, 1/2). We remark that for small values of *δ*, corresponding to a higher confidence level, the interval (1) will be much tighter than (2).

**Figure 1.** <sup>√</sup><sup>2</sup> <sup>×</sup> '2 log(1/*δ*) and 1/√*<sup>δ</sup>* with respect to *<sup>δ</sup>*.

So, while it is clear that the best confidence interval requires more stringent assumptions, there have been attempts at relaxing those assumptions—or in other words, keeping equally good lunches at a cheaper cost.

#### **3. Robust Statistics**

Robust statistics address the following question: can we obtain tight bounds with minimal assumptions—or in other words, can we get a good cheap lunch? In the mean estimation case hinted in Section 2, the question becomes the following: if P ∈ P*<sup>σ</sup>* cheap, can we build a confidence interval at level 1 <sup>−</sup> *<sup>δ</sup>* with a size proportional to <sup>√</sup>*<sup>σ</sup>* '2 log(1/*δ*)?

*N* As mentioned above, there is no hope to achieve this goal with the empirical mean. Different alternative estimators have thus been considered in robust statistics, such as M-estimators [6] or median-of-means (MoM) estimators [see [7] for a recent survey, and references therein].

The key idea of MoM estimators is to achieve a compromise between the unbiased but non-robust empirical mean and the biased but robust median. As before, let us consider a sample of *N* real numbers *x*1, ... , *xN*, assumed to be an iid sequence drawn from a distribution P. Let *K* ≤ *N* be a positive integer and assume for simplicity that *K* is a divisor of *N*. To compute the MoM estimator, the first step consists of dividing the sample (*x*1, ... , *xN*) into *K* non-overlapping blocks *B*1, ... , *BK*, each of length *N*/*K*. For each block, we then compute the empirical mean

$$\mathfrak{x}\_{\mathcal{B}\_i} = \frac{K}{N} \sum\_{j \in \mathcal{B}\_i} x\_j.$$

The MoM estimator is defined as the median of those means:

$$\mathsf{Mod}\_{K}(\mathfrak{x}\_{1},\ldots,\mathfrak{x}\_{N}) = \mathsf{median}\{\mathfrak{x}\_{B\_{1}\prime},\ldots,\mathfrak{x}\_{B\_{K}}\}\ldots$$

This estimator has the following nice property.

**Proposition 2** ([7], Proposition 12)**.** *Assume* <sup>P</sup> ∈ P*<sup>σ</sup> cheap, for <sup>δ</sup>* <sup>=</sup> exp −*K* 8 *,*

$$
\left[\text{MOM}\_K \pm \frac{\sigma}{\sqrt{N}} \times 4\sqrt{2\log\left(\frac{1}{\delta}\right)}\right] \tag{3}
$$

*is a confidence interval for the mean of X at the level* 1 − *δ.*

This property is quite encouraging, as for a cheap model we obtain a confidence interval similar, up to a numerical constant, to the best one (1) in Section 2. However, we also spot here an important limitation. The confidence interval (3) for MoM is only valid for the particular error threshold *δ* = exp(−*K*/8), which depends on the number of blocks *K* (a parameter for the estimator MoM*K*). The estimator must be changed each time we want to evaluate a different confidence level.

An ever more limiting feature is that the error threshold *δ* is constrained and cannot be set arbitrarily small, as in (1) or (2). Obviously, the number of blocks cannot exceed the sample size *N*, and the error threshold reaches its lowest tolerable value exp(−*N*/8). In other words, the interval defined in (3) can have confidence at most 1 − exp(−*N*/8).

Is this strong limitation specific to MoM estimators? No, say [8], [Theorem 3.2 and following remark]. This limitation is universal; over the class <sup>P</sup>*<sup>σ</sup>* cheap, there is no estimator *x*ˆ of the mean such that there exists a constant *L* > 1 such that

$$
\left[ \mathfrak{X} \pm \frac{\sigma}{\sqrt{N}} \times L \sqrt{2 \log \left( \frac{1}{\delta} \right)} \right]
$$

is a confidence interval at level 1 <sup>−</sup> *<sup>δ</sup>* for *<sup>δ</sup>* lower than *<sup>e</sup>*−O(*N*).

To sum up, a good and cheap lunch is possible, with the limitation that the bound is no longer valid for all confidence levels.

## **4. PAC-Bayes**

We now briefly introduce the generalised Bayesian setting in machine learning, and the resulting generalisation bounds, the PAC-Bayesian bounds. PAC-Bayes is a sophisticated framework to derive new learning algorithms and obtain (often state-of-the-art) generalisation bounds, while maintaining probability distributions over hypotheses; as such, we are interested in studying how PAC-Bayes is compatible with good and cheap lunches. We refer the reader to [1,9] and the many references therein for recent surveys on PAC-Bayes including historical notes and main bounds. We focus on classical bounds from the PAC-Bayes literature, based on the empirical risk as a risk estimator—and we instantiate those bounds in two regimes matching the "expensive" and "cheap" models introduced in Section 2.

#### *4.1. Notation*

For any *f* ∈ F, we define the empirical risk *RN*(*f*) as:

$$R\_N(f) = \frac{1}{N} \sum\_{i=1}^N \ell(f(X\_i), \mathcal{Y}\_i).$$

In the following, we consider integrals over the hypotheses space F. To keep the notation as compact as possible, we will write *μ*[*g*] = - *g*d*μ* if *μ* is a measure over F and *g* ∈ F a *μ*-integrable function.

#### *4.2. Generalised Bayes and PAC Bounds*

The main advantage of PAC-Bayes over deterministic approaches which output single hypotheses (through optimisation of a particular criterion such as in model selection, etc.) is that the distributions allow us to capture uncertainty on hypotheses, and take into account correlations among possible hypotheses.

Denoting by *ρ* the posterior distribution, the quantity to control is:

$$\rho[\mathbb{R}] = \int\_{\mathcal{F}} \mathbb{R}(f) \mathrm{d}\rho(f)$$

which is an aggregated risk over the class F and represents the expected risk if the predictor *f* is drawn from *ρ* for each new prediction. The distribution *ρ* is usually data-dependent and is referred to as a "posterior" distribution (by analogy with Bayesian statistics). We also fix a reference measure *π* over F, called the "prior" (for similar reasons). We refer to [1,10] for in-depth discussions on the choice of the prior: a recent streamline of work has further investigated the choice of data-dependent priors [11–14].

The generalisation bounds associated to this setting are known as "PAC-Bayesian" bounds, where PAC stands for probably approximately correct. One important feature of PAC-Bayes bounds is that they hold true for any prior *π* and posterior *ρ*. In practice, bounds are optimised with respect to *ρ* and possibly *π*. In the following, we focus on establishing bounds for any choice of *π* and *ρ* and do not mean to optimise.

#### *4.3. Notion of Divergence*

An important notion used in PAC-Bayesian theory is the divergence between two probability distributions [see [15], for example, for a survey on divergences]. Let E be a measurable space and *μ* and *ν* two probability distributions on E. Let *f* be a non-negative convex function defined on R<sup>+</sup> such that *f*(1) = 0, we define the *f*-divergence between *μ* and *ν* by

$$\mathcal{D}\_f(\mu, \nu) = \begin{cases} \int f\left(\frac{\mathrm{d}\mu}{\mathrm{d}\nu}\right) \mathrm{d}\nu & \text{if } \mu \ll \nu\_\prime \\ +\infty & \text{otherwise} \end{cases}$$

Note that we also use the notation *f* to denote hypotheses elsewhere in the paper, but we believe the context to always be clear enough to avoid ambiguity.

Applying Jensen inequality, we have that D*f*(*μ*, *ν*) is always non-negative and equal to zero if and only if *μ* = *ν*. The class of *f*-divergences includes many celebrated divergences, such as the Kullback–Leibler (KL) divergence, the reversed KL, the Hellinger distance, the total variation distance, *χ*2-divergences, *α*-divergences, etc. Most PAC-Bayesian generalisation bounds involve the KL divergence.

A divergence can be thought of as a transport cost between two probability distributions. This interpretation will be useful for explaining PAC-Bayesian inequalities, where the divergence plays the role of a complexity term. In the following, we will just use two types of divergence. The first is the Kullback–Leibler divergence and corresponds to the choice *f*(*x*) = *x* log *x*, which we denote it by

$$\text{KL}(\mu, \nu) = \begin{cases} \int \log\left(\frac{\text{d}\mu}{\text{d}\nu}\right) \text{d}\mu & \text{if } \mu \ll \nu\_{\prime} \\ +\infty & \text{otherwise}. \end{cases}$$

The second is linked to Pearson's *χ*2-divergence and corresponds to the choice *f*(*x*) = *<sup>x</sup>*<sup>2</sup> <sup>−</sup> 1. It is referred to as <sup>D</sup>2:

$$\mathcal{D}\_2(\mu, \nu) = \begin{cases} \int \left(\frac{\mathrm{d}\mu}{\mathrm{d}\nu}\right)^2 \mathrm{d}\nu - 1 & \text{if } \mu \ll \nu\_\prime \\ +\infty & \text{otherwise} \end{cases}$$

To illustrate the behaviour of these two divergences, consider the case where *μ* and *ν* are normal distributions on R*d*.

**Proposition 3.** *If* <sup>E</sup> <sup>=</sup> <sup>R</sup>*d, <sup>μ</sup>* <sup>=</sup> <sup>N</sup> (*a*, *<sup>I</sup>*)*, and <sup>ν</sup>* <sup>=</sup> <sup>N</sup> (0, *<sup>I</sup>*) *(where <sup>I</sup> stands for the <sup>d</sup>* <sup>×</sup> *<sup>d</sup> identity matrix), we have*

$$\begin{cases} \mathcal{D}\_2(\mu, \nu) = e^{||a||^2} - 1, \\ \text{KL}(\mu, \nu) = \frac{1}{2}||a||^2. \end{cases}$$

**Proof.** We have:

$$\begin{cases} \mathbf{d}\mu(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}} \exp\left(-\frac{1}{2}(\mathbf{x}-\mathbf{a})^{\mathrm{T}}(\mathbf{x}-\mathbf{a})\right) \mathrm{d}\mathbf{x}, \\\ \mathbf{d}\nu(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}} \exp\left(-\frac{1}{2}\mathbf{x}^{\mathrm{T}}\mathbf{x}\right) \mathrm{d}\mathbf{x}, \\\ \frac{d\mu}{d\nu}(\mathbf{x}) = \exp\left(-\frac{1}{2}\left[-2\mathbf{x}^{\mathrm{T}}a + \mathbf{a}^{\mathrm{T}}a\right]\right) = \exp\left(-||a||^{2}/2\right)\exp\left(\mathbf{x}^{\mathrm{T}}a\right). \end{cases}$$

.

Then:

$$\begin{split} \mathcal{D}\_{2}(\boldsymbol{\mu},\boldsymbol{\nu}) &= \exp\left(-\|\boldsymbol{a}\|^{2}\right) \int \exp\Big(2\mathbf{x}^{\mathrm{T}}\boldsymbol{a}\Big) \frac{1}{(2\pi)^{d/2}} \exp\Big(-\frac{1}{2}\mathbf{x}^{\mathrm{T}}\mathbf{x}\Big) \mathrm{d}\mathbf{x} - 1 \\ &= \exp\Big(-\|\boldsymbol{a}\|^{2}\Big) \int \frac{1}{(2\pi)^{d/2}} \exp\Big(-\frac{1}{2}\mathbf{x}^{\mathrm{T}}\mathbf{x} + 2\mathbf{x}^{\mathrm{T}}\boldsymbol{a}\Big) \mathrm{d}\mathbf{x} - 1 \\ &= \exp\Big(-\|\boldsymbol{a}\|^{2}\Big) \exp\Big(2\|\boldsymbol{a}\|^{2}\Big) \int \frac{1}{(2\pi)^{d/2}} \exp\Big(-\frac{1}{2}(\mathbf{x} - 2\boldsymbol{a})^{\mathrm{T}}(\mathbf{x} - 2\boldsymbol{a})\Big) \mathrm{d}\mathbf{x} - 1 \\ &= e^{\|\boldsymbol{a}\|^{2}} - 1. \end{split}$$

And finally:

$$\begin{split} \operatorname{KL}(\mu,\nu) &= \int \left( -\frac{\|a\|^2}{2} + \mathbf{x}^\mathrm{T} a \right) \frac{1}{(2\pi)^{d/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - a)^\mathrm{T} (\mathbf{x} - a) \right) \mathrm{d}\mathbf{x} \\ &= -\frac{\|a\|^2}{2} + \int \mathbf{x}^\mathrm{T} a \frac{1}{(2\pi)^{d/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - a)^\mathrm{T} (\mathbf{x} - a) \right) \mathrm{d}\mathbf{x} \\ &= -\frac{\|a\|^2}{2} + \|a\|^2 = \frac{\|a\|^2}{2} .\end{split}$$

We therefore see that the divergence D<sup>2</sup> penalises much more strongly the gap between the means of both distributions than the Kullback–Leibler divergence.

The following technical lemma involving the Kullback–Leibler divergence and a change of measure from posterior to prior distribution is pivotal in the PAC-Bayes literature:

**Lemma 1** ([5–16], Corollary 4.15)**.** *Let <sup>g</sup> be a measurable function <sup>g</sup>* : <sup>F</sup> → <sup>R</sup> *such that <sup>π</sup>*[*eg*] *is finite. Let π and ρ be respectively prior and posterior measures as defined in Section 4.1. The following inequality holds:*

$$
\rho[\mathcal{g}] \le \log \pi[e^{\mathcal{g}}] + \text{KL}(\rho, \pi).
$$

#### *4.4. Expensive PAC-Bayesian Bound*

The first PAC-Bayesian bound we present is called "expensive PAC-Bayesian bound" in the spirit of Section 2: it is obtained under a sub-Gaussian tails assumption. More precisely, we suppose here that for any *f* ∈ F, the distribution of the random variable -(*f*(*X*),*Y*) belongs to <sup>P</sup>*<sup>σ</sup>* expensive, which means

$$\log \mathbb{E}[\exp\{\lambda(\ell(f(X), Y) - \mathbb{R}(f))\}] \le \frac{\lambda^2 \sigma^2}{2}, \quad \forall \lambda \in \mathbb{R}.$$

In this setting, we have the following bound, close to the ones obtained by [10].

**Proposition 4.** *Assume that for any f* ∈ F*,* -(*f*(*X*),*Y*) ∈ P*<sup>σ</sup> expensive. For any prior π, posterior ρ, and any δ* ∈ (0, 1)*, the following inequality holds true with a probability greater than* 1 − *δ:*

$$
\rho[\mathcal{R}] \le \rho[\mathcal{R}\_N] + \frac{\sigma}{\sqrt{N}} \sqrt{2\left(\log\left(\frac{1}{\delta}\right) + \text{KL}(\rho, \pi)\right)}.
$$

**Proof.** The proof is decomposed in two steps. The first leverages Lemma 1. Let *λ* be a positive number and apply Lemma 1 to the function *λ*(*R* − *RN*):

$$
\rho[R] \le \rho[R\_N] + \frac{1}{\lambda} \left( \log \pi[e^{\lambda(R-R\_N)}] + \text{KL}(\rho\_\prime \pi) \right).
$$

The second step is to control the deviations of log *π eλ*(*R*−*RN*) . With a probability 1 − *δ*, we have, by Markov's inequality

$$\pi\left[e^{\lambda(R-R\_N)}\right] \le \frac{\mathbb{E}\left[\pi\left[e^{\lambda(R-R\_N)}\right]\right]}{\delta}.$$

By Fubini's theorem, we can exchange the symbols E and *π*. Using the assumption P*σ* expensive, we obtain with a probability greater than 1 − *δ*

$$\pi \left[ e^{\lambda (R - R\_N)} \right] \le \frac{\exp \left\{ \lambda^2 \sigma^2 / 2N \right\}}{\delta}.$$

Now, putting these results together and setting

$$\lambda = \frac{\sqrt{2N\left(\log\left(\frac{1}{\delta}\right) + \text{KL}(\rho, \pi)\right)}}{\sigma}$$

we obtain the desired bound.

A PAC-Bayesian inequality is a bound which treats the complexity in the following manner:


## *4.5. Cheap PAC-Bayesian Bounds*

## 4.5.1. Using *χ*<sup>2</sup> Divergence

The vast majority of works in the PAC-Bayesian literature focuses on an expensive model. The main reason is that it includes the situation where the loss is bounded, a common (yet debatable) assumption in machine learning. The case where -(*f*(*X*,*Y*) belongs to a cheap model has attracted far less attention; recently, ref. [17] have obtained the following bound.

**Proposition 5** ([17], Theorem 1)**.** *Assume that for any f* ∈ F*,* -(*f*(*X*),*Y*) ∈ P*<sup>σ</sup> cheap. For any prior π, posterior ρ, and any δ* ∈ (0, 1)*, the following inequality holds true with a probability greater than* 1 − *δ*

$$
\rho[\mathbb{R}] \le \rho[\mathbb{R}\_N] + \frac{\sigma}{\sqrt{N}} \sqrt{\frac{\mathcal{D}\_2(\rho, \pi) + 1}{\delta}}.
$$

The proof (see [17]) uses the same elementary ingredients as in the expensive case, replacing the Kullback–Leibler divergence by ' D<sup>2</sup> and the dependence in *δ* moves from 2 log(1/*δ*) to <sup>√</sup><sup>1</sup> *δ* . Note the correspondence between these two bounds and the confidence intervals introduced in Section 2.

#### 4.5.2. Using Huber-Type Losses

With a different approach, ref. [18] obtained asymptotic PAC-Bayesian bounds for *δ*-dependent risk estimators based on the empirical mean of Huber-type influence functions. The author of [18] studied in a slightly more restrictive model than Pcheap, assuming in addition that the order 3 moment of -(*f*(*X*),*Y*) is bounded for *f* ∈ H. We rephrase here Theorem 9 of [18]: with a probability greater than 1 − *δ*,

$$
\rho[R] \le \rho[\hat{R}\_{\delta,N}] + \frac{1}{\sqrt{N}} \left( \text{KL}(\rho, \pi) + \frac{\log(8\pi\sigma\delta^{-2})}{2} + \sigma + \pi\_N^\*(\mathcal{F}) - 1 \right) + o\left(\frac{1}{N}\right),
$$

where *π*∗ *<sup>N</sup>*(F) is a term depending on the quality of the prior. In Remark 10, the author notes that assuming only finite moments for -(*f*(*X*),*Y*), it is impossible in practice to choose a prior such that *<sup>π</sup>*<sup>∗</sup> *<sup>N</sup>*(F) <sup>√</sup>*<sup>N</sup>* decreases at rate 1/ <sup>√</sup>*<sup>N</sup>* or faster. Then, the dominant term necessarily converges at a slower rate than that of Proposition 4. However, this bounds leads to the definition of a robust PAC-Bayes estimator which proves efficient on simulated data (see Section 5 of [18]).

## **5. A Good Cheap Lunch: Towards a Robust PAC-Bayesian Bound?**

If we take a closer look at the aforementioned PAC-Bayesian bounds from a robust statistics perspective, the following question arises: **can we obtain a PAC-Bayesian bound with a** 'log(1/*δ*) **dependence (possibly up to a numerical constant) in the confidence level with the cheap model?** In this section, we shed light on some structural issues. In the following, we assume the existence of *σ* > 0 such that for any *f* ∈ F, -(*f*(*X*),*Y*) ∈ P*<sup>σ</sup>* cheap.

#### *5.1. A Necessary Condition*

Let *R*9 be an estimator of the risk (not necessarily the classical empirical risk). Here is a prototype of the inequality we are looking for: for any *δ* ∈ (0, 1), with probability 1 − *δ*

where

$$
\rho[\boldsymbol{R}] \le \rho\left[\widehat{\boldsymbol{R}}\right] + \frac{\sigma}{\sqrt{N}} \mathbf{A}(\rho, \boldsymbol{\pi}, \boldsymbol{\delta}),
$$

 

$$\mathbb{A}(\rho,\pi,\delta) \underset{\delta \to 0}{=} \mathcal{O}\left(\sqrt{\log(1/\delta)}\right).$$

If we choose *<sup>ρ</sup>* = *<sup>π</sup>* = *<sup>δ</sup>*{ *<sup>f</sup>* } (Dirac mass in the single hypothesis *<sup>f</sup>*), the existence of such a PAC-Bayesian bound valid for all *δ* implies that

$$\left[\hat{\mathcal{R}}(f) \pm \frac{\sigma}{\sqrt{N}} \times c\sqrt{\log(1/\delta)}\right]$$

is a confidence interval for the risk *R*(*f*) for any level 1 − *δ*, where *c* is a constant.

Thus, a necessary condition for a PAC-Bayesian bound to be valid for all of the risk level *δ* is to have tight confidence intervals for any *f* ∈ F.

However, as covered in Section 3, such estimators do not exist over the class <sup>P</sup>*<sup>σ</sup>* cheap, and the possibility to derive a tight confidence interval is limited by the fact that the level *δ* must be greater that a positive constant of the form *e*−O(*N*).

#### *5.2. A δ-Dependent PAC-Bayesian Bound?*

As a consequence, there is simply no hope for a robust PAC-Bayesian bound valid for any error threshold *δ*, for essentially the same reason which prevents it in the mean estimation case. The question we address now is the possibility of obtaining a robust PAC-Bayesian bound, with a dependence of magnitude '2 log(1/*δ*) (possibly up to a constant), with a possible limitation on the error threshold *δ*. In the following, we assume to have an estimator of the risk *R*9 and an error threshold *δ* > 0 such that there exists a constant *C* > 0 such that for any *f* ∈ F,

$$\left[\widehat{\mathcal{R}}(f) \pm \frac{\sigma}{\sqrt{N}} \times \mathbb{C}\sqrt{\log(1/\delta)}\right]^2$$

is a confidence interval for *R*(*f*) at level 1 − *δ*. MoM is an example of such estimator. Let us stress that *δ* is fixed and cannot be used as a free parameter.

As seen above, a PAC-Bayesian bound proof proceeds in two steps:


The first step does not require any use of a stochastic model on the data, and is always valid, regardless of whether we have a cheap or an expensive model. The second step uses the model and introduce the dependence in the error rate *δ* on the right-term of the bound: *g*−1(1/*δ*). In the case of the "expensive bound", we had *g* = exp, and the dependence was log(1/*δ*), the final rate 'log(1/*δ*) was obtained by choosing a relevant value for *λ*.

Let us follow this scheme to obtain a robust PAC-Bayesian bound. The first step gives

$$
\rho[\mathcal{R}] \le \rho[\hat{\mathcal{R}}] + \frac{1}{\lambda} \Big( \log \pi \left[ e^{\lambda(\mathcal{R} - \hat{\mathcal{R}})} \right] + \text{KL}(\rho, \pi) \Big).
$$

Our goal is now to control *π eλ*(*R*−*R*9) in high probability.

5.2.1. The Case *<sup>π</sup>* = *<sup>δ</sup>*{ *<sup>f</sup>* }

Let us start with a very special case, where the prior is a Dirac mass on some hypothesis *f* ∈ F. Then

$$\frac{1}{\lambda} \log \pi \left[ e^{\lambda \left( R - \bar{R} \right)} \right] = R(f) - \hat{R}(f).$$

Using how *R*9 is defined, we can bound this quantity in the following way: with probability 1 − *δ*,

$$
\mathcal{R}(f) - \widehat{\mathcal{R}}(f) \le \frac{\sigma}{\sqrt{N}} \times \mathbb{C}\sqrt{\log(1/\delta)}.
$$

Another way to formulate this result is to say that there exists an event A*<sup>f</sup>* with a probability greater than 1 − *δ* such that for all *ω* ∈ A*<sup>f</sup>* , the following holds true:

$$(R(f) - \hat{R}(f, \omega)) \le \frac{\sigma}{\sqrt{N}} \times C \sqrt{2 \log(1/\delta)}.$$

In this example, we can control log *π eλ*(*R*−*R*9) at the price of a maximal constraint on the choice of the posterior. Indeed, the only possible choice for *ρ* for the Kullback–Leibler KL(*ρ*, *<sup>π</sup>*) to make sense is *<sup>ρ</sup>* = *<sup>π</sup>* = *<sup>δ</sup>*{ *<sup>f</sup>* }.

5.2.2. The Case *<sup>π</sup>* = *αδ*{ *<sup>f</sup>*1} + (<sup>1</sup> − *<sup>α</sup>*)*δ*{ *<sup>f</sup>*2}

Consider now a somewhat more sophisticated choice of prior which is a mixture of two Dirac masses in two distinct hypotheses. We do not fix the mixing proportion *α* and allow it to move freely between 0 and 1. The goal is to control the quantity

$$\pi \left[ \boldsymbol{\varepsilon}^{\lambda(\mathbb{R}-\mathbb{R})} \right] = \alpha \boldsymbol{\varepsilon}^{\lambda(\mathbb{R}(f\_1) - \mathbb{R}(f\_1))} + (1 - \alpha) \boldsymbol{\varepsilon}^{\lambda(\mathbb{R}(f\_2) - \mathbb{R}(f\_2))}.$$

More precisely, for all *α* ∈ (0, 1), we want to find an event A*<sup>α</sup>* on which this quantity is under control. In view of the prior's structure, the only way to ensure such a control is to have A*<sup>α</sup>* ⊂ A*f*<sup>2</sup> ∩ A*f*<sup>2</sup> , where A*f*<sup>1</sup> (resp. A*f*<sup>2</sup> ) is the favourable event for the concentration of *f* 9 <sup>1</sup> (resp. *f* 9 2) around its mean.

By the union bound, we have that with a probability greater than 1 − 2*δ*

 $\frac{1}{\lambda} \log \pi$  $\left[e^{\lambda(R-R)}\right] \le \frac{\sigma}{\sqrt{N}} \times C \sqrt{\log(1/\delta)}.$ 

We face a double problem here. As above, if we want the final bound to be nonvacuous, we have to ensure that KL(*ρ*, *π*) is finite, which restricts the support for the posterior to be included in the set { *f*1, *f*2}. In addition, the PAC-Bayesian bound holds with a probability greater than 1 − 2*δ*...

#### 5.2.3. Limitation

. . . which hints at the fact that this will become 1 − *Kδ* if the support for the prior contains *K* distinct hypotheses. If *K* ≥ 1/*δ*, the bound becomes vacuous. In particular, we cannot obtain a relevant bound using this approach in the situation where the cardinal of F is infinite (which is commonly the case in most PAC-Bayes works).

This limiting fact highlights that to derive PAC-Bayesian bounds, we cannot rely on the construction of confidence interval for all *R*(*f*) for a fixed error threshold *δ*. The issue is that when we want to transfer this local property into a global one (valid for any mixture of hypotheses by the prior *π*), we cannot avoid a worst-case reasoning by the use of the union bound.

The established bounds in the PAC-Bayesian literature, both in cheap and expensive models, repeatedly use the fact that when we assume that for any *f* ∈ F,

$$\log \mathbb{E}\left[e^{\lambda(R(f)-\ell(f(X),Y))}\right] \le \frac{\lambda^2 \sigma^2}{2}, \forall \lambda \in \mathbb{R}$$

or

$$\text{var}(\ell(f(X), \mathcal{Y})) \le \sigma^2,$$

we make an implicit assumption on the integrability of the tail of the distribution of -(*f*(*X*),*Y*). This argument is crucial for the second step of the PAC-Bayesian proof because, by Fubini's theorem, it allows us to convert a local property (the tail distribution of each -(*f*(*X*),*Y*)) into a global one (the control of *π eλ*(*R*−*RN*) or *π* 7 (*<sup>R</sup>* <sup>−</sup> *RN*))<sup>2</sup> 8 in high probability).

#### *5.3. Is That the End of the Story?*

We have identified a structural limitation to derive a tight PAC-Bayesian bound in a cheap model. We make the case that we cannot replicate the PAC-Bayesian proof presented in Section 4. To conclude this section, we want to highlight the fact that, up to our knowledge, no proof of PAC-Bayesian bounds avoids these two steps (see, for example, the general presentation in [19]).

What if we try to avoid the change of the measure step and try to control directly *ρ*[*R*] − *ρ*[*R*9] in high probability? We remark that *ρ* can only be chosen with the information given by the observation of *R*9(*f*), where *f* ∈ F. In particular, we cannot obtain any information of the concentration of each *R*9(*f*) around *R*(*f*) as such knowledge requires to know the true risk. So, it seems that a direct control cannot avoid starting as a "worst-case" bound:

$$\rho[\mathcal{R}] - \rho[\hat{\mathcal{R}}] \le \sup\_{f \in \mathcal{F}} \left\{ R(f) - \hat{\mathcal{R}}(f) \right\}.$$

Then, we have to control sup*f*∈F *R*(*f*) − *R*9(*f*) in high probability (see [20] for a general presentation on such controls, and [7] for the recent results in the special case where *R*9 is a MoM estimator). However, the obtained bound will take the following prototypic form:

$$
\rho[\mathcal{R}] \le \rho[\bar{\mathcal{R}}] + \text{complexity term},
$$

where the complexity term does not depend on the distribution *ρ*. Thus, the optimisation of the right term leads to choosing *ρ* as the Dirac mass in arg min *R*9(*f*).

*f*∈F So, the overall procedure amounts to a slightly modified empirical risk minimisation (where the empirical mean is replaced with any estimator of the risk), and will not fall into the category of generalised Bayesian approaches which take into account the uncertainty on hypotheses. Pretty much all the strengths of PAC-Bayes would then be lost.

#### **6. Conclusions**

The present paper contributes a better understanding of the profound structural reasons why good cheap lunches (tight bounds under minimal assumptions) are not possible with PAC-Bayes by walking gently through elementary examples.

From a theoretical perspective, PAC-Bayesian bounds requires too strong assumptions to adapt robust statistics results (where almost good lunches can be obtained for cheap models—with the limitation that the confidence level is constrained). The second step of the proof we have shown requires us to transform a local hypothesis, a control of some moments of -(*f*(*X*),*Y*), into a global one, valid for all mixture of hypotheses by the prior *π*. As covered above, this transformation seems impossible.

To close on a more positive note after this negative result, let us stress that even if the conciliation of PAC-Bayes and robust statistics appears challenging, we believe that the recent ideas from robust statistics could be used in practical algorithms inspired by PAC-Bayes. In particular, we leave as an avenue for future work the empirical study of PAC-Bayesian posteriors (such as the Gibbs measure defined as *ρ* ∝ exp(−*γR*9)*π* for any inverse temperature *γ* > 0) where the risk estimator is not the empirical mean (as in most PAC-Bayes works) but rather a robust estimator, such as MoM.

**Author Contributions:** Conceptualization, B.G. and L.P.; Formal analysis, B.G. and L.P.; Supervision, B.G.; Writing—original draft, L.P.; Writing—review & editing, B.G. and L.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** B.G. is supported in part by the U.S. Army Research Laboratory and the U.S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. B.G. acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23- 0015-02.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


## *Article* **A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization**

**Giulio Franzese \*, Dimitrios Milios, Maurizio Filippone and Pietro Michiardi**

Data Science Department, Eurecom, 06410 Biot, France; dimitrios.milios@eurecom.fr (D.M.); maurizio.filippone@eurecom.fr (M.F.); pietro.michiardi@eurecom.fr (P.M.)

**\*** Correspondence: giulio.franzese@eurecom.fr

**Abstract:** Stochastic gradient SG-based algorithms for Markov chain Monte Carlo sampling (SGMCMC) tackle large-scale Bayesian modeling problems by operating on mini-batches and injecting noise on SGsteps. The sampling properties of these algorithms are determined by user choices, such as the covariance of the injected noise and the learning rate, and by problem-specific factors, such as assumptions on the loss landscape and the covariance of SG noise. However, current SGMCMC algorithms applied to popular complex models such as Deep Nets cannot simultaneously satisfy the assumptions on loss landscapes and on the behavior of the covariance of the SG noise, while operating with the practical requirement of non-vanishing learning rates. In this work we propose a novel practical method, which makes the SG noise isotropic, using a fixed learning rate that we determine analytically. Extensive experimental validations indicate that our proposal is competitive with the state of the art on SGMCMC.

**Keywords:** Bayesian sampling; stochastic gradients; Monte Carlo integration

**Citation:** Franzese, G.; Milios, D.; Filippone, M.; Michiardi, P. A Scalable Bayesian Sampling Method Based on Stochastic Gradient Descent Isotropization. *Entropy* **2021**, *23*, 1426. https://doi.org/10.3390/e23111426

Academic Editor: Pierre Alquier

Received: 21 September 2021 Accepted: 26 October 2021 Published: 28 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

## **1. Introduction**

Stochastic gradient (SG) methods have been extensively studied as a means for MCMCbased Bayesian posterior sampling algorithms to scale to large data regimes. Variants of SG-MCMC algorithms have been studied through the lens of first [1–3] or second-order [4,5] Langevin Dynamics, which are mathematically convenient continuous-time processes that correspond to discrete-time gradient methods with and without momentum, respectively. The common traits underlying many methods from the literature can be summarized as follows: they address large data requirements using SG and mini-batching, they inject Gaussian noise throughout the algorithm execution, and they avoid the expensive Metropolis-Hasting accept/reject tests that use the whole data [1,2,4].

Despite mathematical elegance and some promising results restricted to simple models, current approaches fall short in dealing with the complexity of the loss landscape typical of popular modern machine learning models, e.g., neural networks [6,7], for which stochastic optimization poses some serious challenges [8,9].

In general, SG-MCMC algorithms inject random noise to SG descent algorithms: the covariance of such noise and the learning rate, or step-size in the stochastic differential equation simulation community, are tightly related to the assumptions on the loss landscape, which together with the SG noise, determine the sampling properties of these methods [5]. However, current SG-MCMC algorithms applied to popular complex models such as Deep Nets, cannot simultaneously satisfy the assumptions on posterior distribution geometry and on the behavior of the covariance of the SG noise, while operating with the practical requirement of non-vanishing learning rates. In this paper, in accordance with most of the Neural Network related literature, we refer to the posterior distribution geometry as loss landscape. Some recent work [10], instead, argue for fixed step sizes, but settle for variational approximations of quadratic losses. Although we are not the first to highlight these issues, including the lack of a unified notation [5], we believe that studying the

role of noise in SG-MCMC algorithms has not received enough attention, and a deeper understanding is truly desirable, as it can clarify how various methods compare. Most importantly, this endeavor can suggest novel and more practical algorithms relying on fewer parameters and less restrictive assumptions.

In this work we chose a mathematical notation that emphasizes the role of noise covariances and learning rate on the behavior of SG-MCMC algorithms (Section 2). As a result, the equivalence between learning rate annealing and extremely large injected noise covariance becomes apparent, and this allows us to propose a novel practical SG-MCMC algorithm (Section 3). We derive our proposal, by first analyzing the case where we inject the smallest complementary noise such that its combined effects with the SG noise result in an isotropic noise. Thanks to this isotropic property of the noise, it is possible to deal with intricate loss surfaces typical of deep models, and produce samples from the true posterior without learning rate annealing. This, however, comes at the expense of cubic complexity matrix operations. We address such issues through a practical variant of our scheme, which employs well-known approximations to the SG noise covariance (see, e.g., [11]). The result is an algorithm that produces approximate posterior samples with a fixed, theoretically derived, learning rate. Please note that in generic Bayesian deep learning setting, none of the existing implementations of SG-MCMC methods converge to the true posterior without learning rate annealing. In contrast, our method automatically determines an appropriate learning rate through a simple estimation procedure. Furthermore, our approach can be readily applied to pre-trained models: after a "warmup" phase to compute SG noise estimates, it can efficiently perform Bayesian posterior sampling.

We evaluate SG-MCMC algorithms (Section 4) through an extensive experimental campaign, where we compare our approach to several alternatives, including Monte Carlo Dropout (MCD) [12] and Stochastic Weighted Averaging Gaussians (SWAG, [9]), which have been successfully applied to the Bayesian deep learning setting. Our results indicate that our approach offers performance that are competitive to the state of the art, according to metrics that aim at assessing the predictive accuracy and uncertainty.

#### **2. Preliminaries and Related Work**

Consider a dataset of *<sup>m</sup>*−dimensional observations <sup>D</sup> <sup>=</sup> {*Ui*}*<sup>N</sup> <sup>i</sup>*=1. Given prior *p*(*θ*) for a *d*-dimensional set of parameters, and a likelihood model *p*(D|*θ*), the posterior is obtained by means of Bayes theorem as follows:

$$p(\boldsymbol{\theta}|\mathcal{D}) = \frac{p(\mathcal{D}|\boldsymbol{\theta})\,\, p(\boldsymbol{\theta})}{p(\mathcal{D})} \tag{1}$$

where *p*(D) is also known as the model evidence, defined as the integral *<sup>p</sup>*(D) = - *p*(D|*θ*) *p*(*θ*)*dθ*. Except when the prior and the likelihood function are conjugate, Equation (1) is analytically intractable [13]. However, the joint likelihood term in the numerator is typically not hard to compute; this is a key element of many MCMC algorithms, since the normalization constant *p*(D) does not affect the shape of the distribution in any way other than scaling. The posterior distribution is necessary to obtain predictive distributions for new test observations *U*∗, as:

$$p(\mathcal{U}\_\*|\mathcal{D}) = \int p(\mathcal{U}\_\*|\theta)p(\theta|\mathcal{D})d\theta\tag{2}$$

We focus in particular on Monte Carlo methods to obtain an estimate of this predictive distribution, by averaging over *N*MC samples obtained from the posterior over *θ*, i.e., *<sup>θ</sup>*(*i*) <sup>∼</sup> *<sup>p</sup>*(*θ*|D)

$$p(\mathcal{U}\_\*|\mathcal{D}) \approx \frac{1}{N\_{\text{MC}}} \sum\_{i=1}^{N\_{\text{MC}}} p(\mathcal{U}\_\*|\theta^{(i)}) \tag{3}$$

We develop our work by working with an unnormalized version of the logarithm of the posterior density, by expressing the negative logarithm of the joint distribution of the dataset D and parameters *θ* as:

$$-f(\boldsymbol{\theta}) = \sum\_{i=1}^{N} \log p(\boldsymbol{\mathcal{U}}\_i | \boldsymbol{\theta}) + \log p(\boldsymbol{\theta}).\tag{4}$$

For computational efficiency, we use a minibatch stochastic gradient *g*(*θ*), which guarantees that the estimated gradient is an unbiased estimate of the true gradient ∇ *f*(*θ*), and we assume that the randomness due to the minibatch introduces a Gaussian noise:

$$\mathbf{g}(\theta) \sim N(\nabla f(\theta), \mathcal{B}\mathcal{B}(\theta)),\tag{5}$$

where the matrix *B***(***θ***)** denotes the SG noise covariance, which depends on the parametric model, the data distribution and the minibatch size.

A survey of algorithms to sample from the posterior using SG methods can be found in Ma et al. [5]. In Appendix A we report some well-known facts which are relevant for the derivations in our paper. As shown in the literature [10,14], there are structural similarities between SG-MCMC algorithms and stochastic optimization methods, and both can be used to draw samples from posterior distributions. Notice that the original goal of stochastic optimization is to find the minimum of a given cost function, and the stochasticity is introduced by sub-sampling the dataset to scale. SG-MCMC methods instead aim at sampling from a given distribution, i.e., collecting multiple values, and the stochasticity is necessary explore the whole landscape. In what follows, we use a unified notation to compare many existing algorithms in light of the role played by their noise components.

It is well-known [15–17] that stochastic gradient descent (SGD), with and without momentum, can be studied through the following stochastic differential equation (SDE), when the learning rate *η* is small enough (In this work we do not consider discretization errors. The reader can refer to classical SDE texts such as [18] to investigate the topic in greater depth.):

$$d\mathbf{z}\_t = \mathbf{s}(\mathbf{z}\_l)dt + \sqrt{2\eta \mathbf{D}(\mathbf{z}\_l)}dW\_l. \tag{6}$$

where *s* is usually referred to as driving force and *D* as diffusion matrix We use a generic form of the SDE, with variable *z* instead of *θ*, which accommodates SGD variants, with and without momentum. By doing this, we will be able to easily cast the expression for the two cases in what follows (The operator ∇ applied to matrix *D*(*z*) produces a row vector whose elements are the divergences of the *D*(*z*) columns. Our notation is aligned with Chen et al. [4]).

**Definition 1.** *A distribution ρ*(*z*) ∝ exp(−*φ*(*z*)) *is said to be a stationary distribution for the* SDE *of the form* (6)*, if and only if it satisfies the following Fokker-Planck equation (*FPE*):*

$$0 = \text{Tr}\left\{\nabla \left[ -\mathbf{s}(\mathbf{z})^\top \boldsymbol{\rho}(\mathbf{z}) + \nabla^\top \left( \mathbf{D}(\mathbf{z}) \boldsymbol{\rho}(\mathbf{z}) \right) \right] \right\}.\tag{7}$$

Please note that in general, the stationary distribution does not converge to the desired posterior distribution, i.e., *φ*(*z*) -= *f*(*z*), as shown by Chaudhari and Soatto [8]. Additionally, given an initial condition for *zt*, its distribution is going to converge to *ρ*(*z*) only for *t* → ∞. In practice, we observe the SDE dynamics for a finite amount of time: then, we declare that the process is approximately in the stationary regime once the potential has reached low and stable values.

Next, we briefly overview known approaches to Bayesian posterior sampling, and interpret them as variants of an SGD process, using the FPE formalism.

#### *2.1. Gradient Methods without Momentum*

The generalized updated rule of SGD, described as a discrete-time stochastic process, writes as:

$$\delta\theta\_n = -\eta P(\theta\_{n-1}) (\mathcal{g}(\theta\_{n-1}) + \mathfrak{w}\_n),\tag{8}$$

where *P*(*θn*−1) is a user-defined preconditioning matrix, and *w<sup>n</sup>* is a noise term, distributed as *w<sup>n</sup>* ∼ *N*(**0**, 2*C*(*θn*)), with a user-defined covariance matrix *C*(*θn*). Then, the corresponding continuous-time SDE is [15]:

$$d\theta\_t = -P(\theta\_t)\nabla f(\theta\_t)dt + \sqrt{2\eta P(\theta\_t)^2 \Sigma(\theta\_t)}d\mathcal{W}\_t. \tag{9}$$

In this paper we use the symbol *n* to indicate discrete time, while *t* for continuous time. We denote by *C***(***θ***)** the covariance of the *injected noise* and **Σ(***θ***)** the *composite noise* covariance. Please note that **Σ**(*θt*) = *B*(*θt*) + *C*(*θt*) combines the SG and the injected noise. Notice that our choice of notation is different from the standard one, in which the starting discretetime process is in the form *δθ<sup>n</sup>* = −*ηP*(*θn*−1)(*g*(*θn*−1)) + *wn*. By directly grouping the injected noise with the stochastic gradient we can better appreciate the relationship between annealing the learning rate and extremely large injected noise. Moreover, as will be explained in Section 3, this allows derivation of a new sampling algorithm.

We define the stationary distribution of the SDE in Equation (9) as *ρ*(*θ*) ∝ exp(−*φ*(*θ*)). Please note that when *C* = **0**, the potential *φ*(*θ*) differs from the desired posterior *f*(*θ*) [8]. The following theorem, which is an adaptation of known results in light of our formalism, states the conditions for which the *noisy* SGD converges to the true posterior distribution (proof in Appendix A).

**Theorem 1.** *Consider dynamics of the form* (9) *and define the stationary distribution ρ*(*θ*) ∝ exp(−*φ*(*θ*))*. If*

$$\nabla^{\top} \left( \Sigma(\boldsymbol{\theta})^{-1} \right) = \mathbf{0}^{\top} \quad \text{and} \quad \eta P(\boldsymbol{\theta}) = \Sigma(\boldsymbol{\theta})^{-1}, \tag{10}$$

*then φ*(*θ*) = *f*(*θ*)*.*

Stochastic Gradient Langevin Dynamics (SGLD) [1] is a simple approach to satisfy Equation (10); it uses no preconditioning, *P***(***θ***)** = *I*, and sets the injected noise covariance to *<sup>C</sup>***(***θ***)** <sup>=</sup> *<sup>η</sup>*−<sup>1</sup> *<sup>I</sup>*. In the limit for *<sup>η</sup>* <sup>→</sup> 0, it holds that **<sup>Σ</sup>(***θ***)** <sup>=</sup> *<sup>B</sup>***(***θ***)** <sup>+</sup> *<sup>η</sup>*−<sup>1</sup> *<sup>I</sup> <sup>η</sup>*−<sup>1</sup> *<sup>I</sup>*. Then, ∇ **Σ(***θ***)** −1 = *η*∇*I* = **0**, and *ηP***(***θ***)** = **Σ(***θ***)** −1 . Although SGLD succeeds in (asymptotically) generating samples from the true posterior, its mixing rate is unnecessarily slow, due to the extremely small learning rate [2].

An extension to SGLD is Stochastic Gradient Fisher Scoring (SGFS) [2], which can be tuned to switch between sampling from an approximate posterior, using a non-vanishing learning rate, and the true posterior, by annealing the learning rate to zero. SGFS uses preconditioning, *P***(***θ***)** ∝ *B***(***θ***)** −1 . In practice, however, *B***(***θ***)** is ill conditioned for complex models such as deep neural networks. Then, many of its eigenvalues are almost zero [8], and computing *B***(***θ***)** <sup>−</sup><sup>1</sup> is problematic. An in-depth analysis of SGFS reveals that conditions (10) would be met with a non-vanishing learning rate only if, at convergence, ∇(*B***(***θ***)** −1 ) = **0**, which would be trivially true if *B***(***θ***)** was constant. However, recent work [6,7] suggest that this condition is difficult to justify for deep neural networks.

The Stochastic Gradient Riemannian Langevin Dynamics (SGRLD) algorithm [3] extends SGFS to the setting in which ∇(*B***(***θ***)** −1 ) -= **0**. The process dynamic is adjusted by adding the term ∇(*B***(***θ***)** −1 ). However, the term ∇(*B***(***θ***)** −1 ) has not a clear estimation procedure, restricting SGRLD to cases where it can be computed analytically.

The work by [10] investigates constant-rate SGD (with no injected noise), and determines analytically the learning rate and preconditioning that minimize the Kullback–Leibler (KL) divergence between an approximation and the true posterior. Moreover, it shows

that the preconditioning used in SGFS is optimal, in the sense that it converges to the true posterior, when *B***(***θ***)** is constant and the true posterior has a quadratic form.

In summary, to claim convergence to the true posterior distribution, existing approaches require either vanishing learning rates or assumptions on the SG noise covariance that are difficult to verify in practice, especially when considering deep models. We instead propose a novel practical method that induces isotropic SG noise and thus satisfies Theorem 1. We determine analytically a fixed learning rate, and we require weaker assumptions on the loss shape.

## *2.2. Gradient Methods with Momentum*

Momentum-corrected methods emerge as a natural extension to SGD approaches. The general set of update equations for (discrete-time) momentum-based algorithms is:

$$\begin{cases} \delta \theta\_{\mathbb{M}} = \eta \mathcal{P}(\theta\_{\mathbb{M}-1}) \mathcal{M}^{-1} r\_{n-1} \\ \delta r\_{\mathbb{M}} = -\eta A(\theta\_{n-1}) \mathcal{M}^{-1} r\_{n-1} - \eta \mathcal{P}(\theta\_{n-1}) (\mathcal{g}(\theta\_{\mathbb{M}-1}) + \mathfrak{w}\_{\mathbb{M}}), \end{cases}$$

where *P*(*θn*−1) is a preconditioning matrix, *M* is the mass matrix and *A*(*θn*−1) is the friction matrix, as shown by [4,19]. As with the first order counterpart, the noise term is distributed as *w<sup>n</sup>* ∼ *N*(**0**, 2*C*(*θn*))). Then, the SDE to describe continuous-time system dynamics is:

$$\begin{cases} d\theta\_t = \mathcal{P}(\theta\_t) \mathcal{M}^{-1} r\_t dt \\ dr\_t = -(A(\theta\_t) \mathcal{M}^{-1} r\_t + \mathcal{P}(\theta\_t) \nabla f(\theta\_t)) dt + \sqrt{2\eta \mathcal{P}(\theta\_t)^2 \Sigma(\theta\_t)} dW\_t. \end{cases} \tag{11}$$

where *P*(*θt*)<sup>2</sup> = *P*(*θt*)*P*(*θt*) and we assume *P*(*θt*) to be symmetric. The theorem hereafter describes the conditions for which noisy SGD with momentum converges to the true posterior distribution (Appendix A).

**Theorem 2.** *Consider dynamics of the form* (11) *and define the stationary distribution for θ<sup>t</sup> as ρ*(*θ*) ∝ exp(−*φ*(*θ*))*. If*

$$\nabla^\top P(\boldsymbol{\theta}) = \mathbf{0}^\top \quad \text{and} \quad \mathbf{A}(\boldsymbol{\theta}) = \eta P(\boldsymbol{\theta})^2 \Sigma(\boldsymbol{\theta}), \tag{12}$$

*then φ*(*θ*) = *f*(*θ*) *.*

In the naive case, where *P***(***θ***)** = *I*, *A***(***θ***)** = **0**, *C***(***θ***)** = **0**, Equation (12) are not satisfied and the stationary distribution does not correspond to the true posterior [4]. To generate samples from the true posterior it is sufficient to set *P***(***θ***)** = *I*, *A***(***θ***)** = *ηB***(***θ***)**, *C***(***θ***)** = **0** (as in Equation (9) in [4]).

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) [4] suggests that estimating *B***(***θ***)** can be costly. Hence, the injected noise *C***(***θ***)** is chosen such that *C***(***θ***)** = *η*−1*A***(***θ***)**, where *A***(***θ***)** is user-defined. When *η* → 0, the following approximation holds: **Σ(***θ***)** *C***(***θ***)**. It is then trivial to check that conditions (12) hold without the need for explicitly estimating *B***(***θ***)**. A further practical reason to avoid setting *A***(***θ***)** = *ηB***(***θ***)** is that the computational cost for the operation *<sup>A</sup>*(*θn*−1)*M*−1*rn*−<sup>1</sup> has <sup>O</sup>(*D*2) complexity, whereas if *<sup>C</sup>***(***θ***)** is diagonal, this is reduced to O(*D*). This, however, severely slows down the sampling process.

Stochastic Gradient Riemannian Hamiltonian Monte Carlo (SGRHMC) is an extension to SGHMC [5]), which considers a generic, space-varying preconditioning matrix *P***(***θ***)** derived from information geometric arguments [20]. SGRHMC suggests setting *P***(***θ***)** = *G*(*θ*)<sup>−</sup> <sup>1</sup> <sup>2</sup> , where *G*(*θ*) is the Fisher Information matrix. To meet the requirement ∇*P***(***θ***)** = **0**, it includes a correction term, −∇*P***(***θ***)**. The injected noise is set to *<sup>C</sup>***(***θ***)** <sup>=</sup> *<sup>η</sup>*−<sup>1</sup> *<sup>I</sup>* <sup>−</sup> *<sup>B</sup>***(***θ***)**, consequently **<sup>Σ</sup>** <sup>=</sup> *<sup>η</sup>*−<sup>1</sup> *<sup>I</sup>*, and the friction matrix is set to *A***(***θ***)** = *P***(***θ***)** 2 . With all these choices, Theorem 2 is satisfied. Although appealing, the main drawbacks of this method are the need for an analytical expression of ∇*P***(***θ***)**, and the assumption for *B***(***θ***)** to be known.

From a practical standpoint, momentum-based methods suffer from the requirement to tune many hyperparameters, including the learning rate, and the parameters that govern the simulation of a second-order Langevin dynamics.

The method we propose in this work can be applied to momentum-based algorithms; in this case, it could be viewed as an extension of the work in [11], albeit addressing the complex loss landscapes typical of deep neural networks. However, we leave this avenue of research for future work.

#### **3. Sampling by Layer-Wise Isotropization**

We present a simple and practical approach to inject noise to SGD iterates to perform Bayesian posterior sampling. Our goal is to sample from the true posterior distribution (or approximations thereof) using a *constant* learning rate, and to rely on more lenient assumptions about the shape of the loss landscape that characterize deep models, compared to previous works. In general, in modern machine learning applications, we deal with multi-layer neural networks [21]. We exploit the natural subdivision of the parameters of these architecture into different layers to propose a practical sampling scheme

Careful inspection of Theorem 1 reveals that the matrices *P***(***θ***)**, **Σ(***θ***)** are instrumental in determining the convergence properties of SG methods to the true posterior. Therefore, we consider the constructive approach of *designing ηP***(***θ***)** to obtain a sampling scheme that meets our goals; we set *ηP***(***θ***)** to be a constant, diagonal matrix which we constrain to be layer-wise uniform:

$$\eta P(\theta) = \Lambda^{-1} = \text{diag}(\underbrace{[\lambda^{(1)}, \dots, \lambda^{(1)}}\_{\text{layer 1}}, \dots, \underbrace{\lambda^{(Nl)}, \dots, \lambda^{(Nl)}}\_{\text{layer } Nl}))^{-1}.\tag{13}$$

By properly selecting the set of parameters {*λ<sup>i</sup>* } we can achieve the simultaneous result of non-vanishing learning rate and well-conditioned preconditioning matrix. This implies a layer-wise learning rate *η*(*p*) = <sup>1</sup> *<sup>λ</sup>*(*p*) for the *<sup>p</sup>*-th layer, without further preconditioning.

We can now prove (see Appendix B), as a corollary to Theorem 1, that our design choices can guarantee convergence to the true posterior distribution.

**Corollary 1.** *(Theorem 1) Consider dynamics of the form* (9) *and define the stationary distribution <sup>ρ</sup>*(*θ*) <sup>∝</sup> exp(−*φ*(*θ*))*. If <sup>η</sup>P***(***θ***)** <sup>=</sup> **<sup>Λ</sup>**−<sup>1</sup> *as in* (13)*, <sup>C</sup>***(***θ***)** <sup>=</sup> **<sup>Λ</sup>** <sup>−</sup> *<sup>B</sup>***(***θ***)** *and <sup>C</sup>***(***θ***)** <sup>0</sup> <sup>∀</sup>*θ, then φ*(*θ*) = *f*(*θ*)*.*

If aforementioned conditions are satisfied, it is in fact simple to show that the relevant matrices satisfy the conditions in Equation (10). The covariance matrix of the composite noise is said to be *isotropic* within the layers of (deep) models. In fact, **Σ(***θ***)** = *C***(***θ***)** + *<sup>B</sup>***(***θ***)** <sup>=</sup> diag *λ*(1),..., *<sup>λ</sup>*(1),..., *<sup>λ</sup>*(*Nl*),... *<sup>λ</sup>*(*Nl*) . From a practical point of view, we choose **Λ** to be, among all valid matrices satisfying **Λ** − *B***(***θ***)** 0, the smallest (the one with the smallest *λ*'s). Indeed, larger **Λ** induce a smaller learning rate, thus unnecessarily reducing sampling speed.

Now, let us consider an ideal case, in which we assume the SG noise covariance *B***(***θ***)** and **Λ** to be known in advance. The procedure described in Algorithm 1 illustrates a naive SG method that uses the *injected noise* covariance *C***(***θ***)** to sample from the true posterior.


This deceivingly simple procedure generate samples from the true posterior, with a non-vanishing learning rate, as shown earlier. However, it cannot be used in practice as *B***(***θ***)** and **Λ** are unknown. Furthermore, the algorithm requires computationally expensive operations, i.e., to compute (**Σ** − *B***(***θ***)**) 1 <sup>2</sup> , which requires <sup>O</sup>(*d*3) operations, and *<sup>C</sup>***(***θ***)** 1 2 , which costs <sup>O</sup>(*d*2) multiplications.

Next, we describe a practical variant of our approach, where we use approximations at the expense of generating samples from the true posterior distribution. We note that [10] suggest exploring a related preconditioning, but do not develop this path in their work. Moreover, the proposed method shares similarities with a scheme proposed in [22] although the analysis we perform here is different.

## *3.1. A Practical Method: Isotropic SGD*

To render the idealized sampling method practical, it is necessary to consider some additional assumptions. As we explain at the end of this section, the assumptions that follow are less strict than other approaches in the literature.

**Assumption 1.** *The* SG *noise covariance B***(***θ***)** *can be approximated with a diagonal matrix, i.e., B***(***θ***)** = diag(*b*(*θ*))*.*

**Assumption 2.** *The signal-to-noise ratio (SNR) of a gradient is small enough such that in the stationary regime, the second-order moment of the gradient is a good estimate of the true variance. Hence, combining with Assumption 1, <sup>b</sup>*(*θ*) <sup>E</sup>[*g*(*θ*)*<sup>g</sup>*(*θ*)] <sup>2</sup> *, where indicates the elementwise product.*

**Assumption 3.** *The sum of the variances of noise components, layer by layer, can be assumed to constant in the stationary regime. Then, β*(*p*) = ∑ *j*∈*Ip bj*(*θ*)*, where Ip is the set of indices of*

## *parameters belonging to pth layer.*

The diagonal covariance assumption (i.e., Assumption 1) is common in other works, such as [2,11]. The small signal-to-noise ratio as stated in Assumption 2 is in line with recent studies, such as [11,23]. Assumption 3 is similar to those appeared in earlier work, such as [24]. Please note that Assumptions 2 and 3 must hold in the stationary regime when the process reaches the bottom valley of the loss landscape. The matrix (*b*(*θ*)) has been associated in the literature with the *empirical* Fisher information matrix [2,25]. As we

do not consider this matrix for preconditioning purposes, we do not further investigate this connection.

Given our assumptions, and our design choices, it is then possible to show (see Appendix B) that the optimal (i.e., the smallest possible) **Λ** = *λ*(1),..., *λ*(1),..., *λ*(*Nl*),... *λ*(*Nl*) satisfying Corollary 1 can be obtained as *λ*(*p*) = *β*(*p*). Please note that we do not assume *B***(***θ***)** to be known, but use a simple procedure to estimate its components by computing: *λ*(*p*) = ∑ *j*∈*Ip bj*(*θ*) = ||*g*(*p*)(*θ*)||<sup>2</sup> <sup>2</sup> , where *<sup>g</sup>*(*p*)(*θ*) is the portion of stochastic gradient corresponding to the *p*-th layer. Then, the composite noise matrix **Σ** = **Λ** is a layer-wise isotropic covariance matrix, which inspires the name of our proposed method as *Isotropic* SGD

(I-SGD). The practical implementation of I-SGD is shown in Algorithm 2. The advantage of I-SGD is that it can either be used to obtain posterior samples starting from a pre-trained model, or do so by training a model from scratch. In either case, the estimates of *B***(***θ***)** are used to compute **Λ**, as discussed above. An important consideration is that once all *λ*(*i*) have been estimated, the learning rate, layer by layer, is determined *automatically*. In fact, for the *<sup>p</sup>*-th layer, the learning rate is: *<sup>η</sup>*(*p*) <sup>=</sup> *<sup>λ</sup>*(*p*)−<sup>1</sup> . A simpler approach is to use a unique learning rate for all layers, where the equivalent *λ* is the sum of all *λ*(*p*).

#### **Algorithm 2** I-SGD: practical posterior sampling

**SAMPLE** (*θ*0): *θ* ← *θ*<sup>0</sup> **loop** *<sup>g</sup>* <sup>=</sup> <sup>∇</sup> ˜ *f*(*θ*) **for** *p* ← 1 to *Nl* **do** *n* ∼ *N*(**0**, *I*) *C***(***θ***)** 1/2 <sup>←</sup> *<sup>λ</sup>*(*p*) <sup>−</sup> (1/2) *<sup>g</sup>*(*p*) *<sup>g</sup>*(*p*) *<sup>g</sup>*(*p*) <sup>←</sup> 1/*λ*(*p*) *<sup>g</sup>*(*p*) <sup>+</sup> <sup>√</sup>2*C***(***θ***)** 1/2 *n* **end for** *θ* ← *θ* − *g* **end loop**

A Remark on Convergence

In summary, I-SGD is a practical method to perform approximate Bayesian posterior sampling, backed up by solid theoretical foundations. Our assumptions, which are at the origin of the approximate nature of I-SGD, are less strict than those used in the literature of SG-MCMC methods. More precisely, the theory behind I-SGD can explain convergence to the true posterior with a non-vanishing learning rate in the particular case when Assumption 1 holds and the estimation of *B***(***θ***)** is perfect. Even with perfect estimates, this is not the case for SGFS, which requires the correction term ∇*B***(***θ***)** <sup>−</sup><sup>1</sup> = 0. Additionally, both SGRLD and SGRHMC are more demanding than I-SGD because they require computing ∇*B***(***θ***)** −1 , for which an estimation procedure is elusive. Finally, the method by Springenberg et al. [11] needs a *constant*, diagonal *B***(***θ***)**, a condition that does not necessarily hold for deep models.

#### *3.2. Computational Cost*

The computational cost of I-SGD is as follows. As with [4], we define the cost of computing a gradient minibatch as *Cg*(*Nb*, *d*). Thanks to Assumptions 1 and 2, the computational cost for estimating the noise covariance scales as O(*d*) multiplications. The computational cost of generating random samples with the desired covariance scales as O(*d*) square roots and O(*d*) multiplications (without considering the cost of generating random numbers). The overall cost of our method is the sum of the above terms. Notice that the cost of estimating the noise covariance does not depend on the minibatch size *Nb*. We would like to stress that in many modern models, the real computational bottleneck is the backward propagation for the computation of the gradients. As all the SG-MCMC methods considered in this work require one gradient evaluation per step, the different methods have in practice the same complexity.

The space complexity of I-SGD is the same as SGHMC,SGFS and variants: it scales as O(*N*sam*d*), where *N*sam is the number of posterior samples.

## **4. Experiments**

The empirical analysis of our method, and its comparison to alternative approaches from the literature, is organized as follows. First, we proceed with a validation of I-SGD using the standard UCI datasets [26] and a shallow neural network. Then we move to the case of deeper models: we begin with a simple CNN used on the MNIST [27] dataset, then move to the standard RESNET-18 [28] deep network using the CIFAR-10 [29] dataset.

We compare I-SGD to other Bayesian sampling methods such as SGHMC [4], SGLD [2], and to alternative approaches to approximate Bayesian inference, including MCD [12], SWAG [9] and VSGD [10]. In general, our result indicates that: (1) I-SGD achieves similar or superior performance regarding competitors, when measuring uncertainty quantification, even with simple datasets and models; (2) I-SGD is simple to tune, when compared to alternatives; (3) I-SGD is competitive when used for deep Bayesian modeling, even when compared to standard methods used in the literature. In particular, the proposed method shares some of the strengths of VSGD, such as learning rates determined automatically and the simplicity of SGLD. Appendix B includes additional implementation details on I-SGD. Appendix C presents detailed configurations of all methods we compare, and additional experimental results.

#### *4.1. A Disclaimer on Performance Characterization*

It is important to stress a detail on the analysis of the experimental campaign. The discussion is usually focused on the goodness of the various methods for representing the true posterior distribution. Different methods can or cannot claim convergence to the true posterior according to certain assumptions and the nature of the hyperparameters. In the experimental validation of the results, however, we do not have access to the form of the true posterior as it is exactly the problem we are trying to solve. The practical solution adopted is to compare the different methods in terms of *proxy* metrics evaluated on the test sets, such as the accuracy and uncertainty metrics. Being better in terms of these performance metrics does not imply that the sampling method is better at approximating the posterior distribution, and outperforming competitors in terms of these metric do not provide sufficient information about the intrinsic quality of the sampling scheme.

#### *4.2. Regression Tasks, with Simple Models*

We consider several regression tasks defined on the UCI datasets. We use a simple neural network configuration with two fully connected layers and a ReLU activation function; the hidden layer includes 50 units. In this set of experiments, we use the following metrics: the root mean square error (RMSE) to judge the model predictive performance and the mean negative log-likelihood (MNLL) as a proxy for uncertainty quantification. We note that the task of tuning our competitors was far from trivial. We used our own version of SGHMC, based on [11], to ensure a proper understanding of the implementation internals, and we proceeded with a tuning process to find appropriate values for the numerous hyperparameters. In this set of experiments, we omit results for SWAG, which we keep for more involved scenarios.

Tables 1 and 2 report a complete overview of our results, for a selection of UCI datasets. For each method and each dataset, we also included how many out of the 10 splits considered failed to converge, indicated as *F* = ... . As explained in Appendix C we implemented a temperature scaled version of VSGD. A clear picture emerges from this first set of experiments: while for the RMSE the performance is similar for different methods, for the MNLL averaging over multiple samples clearly improves the uncertainty quantification capabilities. SGHMC is in many cases better than alternatives, considering however the standard deviation of the results it is difficult to claim clear superiority of one method over the others.


#### **Table 1.** RMSE results for regression on UCI datasets.


#### *4.3. Classification Tasks, with Deeper Models*

Next, we compare I-SGD against competitors on image classification tasks. First, we use the MNIST dataset, and a simple LENET-5 CNN [30]. All methods are compared based on the test accuracy ACC,MNLL and the expected calibration error (ECE, [31]). Additionally, at test time, we carry out predictions on both MNIST and NOT-MNIST; the latter is a dataset equivalent to MNIST , but it represents letters rather than numbers. (http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html, accessed on 24 October 2021) This experimental setup is often used to check whether the entropy of the predictions on NOT-MNIST is higher than the entropy of the predictions on MNIST (the entropy of the output of an *Ncl* classes classifier, represented by the vector **p**, is defined as

$$-\sum\_{i=1}^{N\_{\mathbb{S}^l}} p\_i \log p\_i).$$

Table 3 indicates that all methods are essentially equivalent in terms of accuracy and MNLL. We consider, together with the classical in and out of distribution entropies the regions of convergence (ROCS) diagrams comparing detection of out of distribution

samples and false alarms when using as test statistic the entropy. Results, reported in Figure 1, clearly shows that: (1) collecting multiple samples improve the uncertainty quantification capabilities (2) I-SGD is competitive (but not the best scheme) and importantly outperform the closest approach to ours, i.e., VSGD. The experimental results show that I-SGD improves the quality of the BASELINE model with respect to all metrics. To test whether the improvements are due just to "*additional training*" or are intrinsically due to the Bayesian averaging properties, we do consider alternative deterministic baselines (details in Appendix C). For this set of experiments the best performing is BASELINE R. As can be appreciated by comparing Table 3 and Figure 1, while it is possible to increase the classical metrics, I-SGD (and other methods) still outperform by a large margin the baselines in terms of detection of out of distribution samples.


**Table 3.** Results for classification on MNIST dataset.

**Figure 1.** Detection/False alarm diagrams for different methods.

We now move on to a classical image classification problem with deep convolutional networks, whereby we use the CIFAR10 dataset, and the RESNET-18 network architecture. For this set of experiments, we compare I-SGD, SGHMC, SWAG, and VSGD using again test accuracy and MNLL, which we report in Table 4. As usual, we compare the results against the baseline of the individual network resulting from the pre-training phase. Results are obtained averaging over three independent seeds. Notice, as expanded in Appendix C that for SWAG we do consider two variants: the Bayesian correct one (SWAG) and a second variant that has better performance (SWAG wd). We stress again, as highlighted in Section 4.1

that not always goodness of approximation of the posterior and performance correlate positively. Additionally in this case, we found I-SGD to be competitive with other methods and superior to the baseline. Among the competitors, we found I-SGD to the easiest to tune, given the feature of a fixed learning rate informed by theoretical considerations; we believe that this is an important aspect to consider for a wide adoption of our proposal by practitioners.


**Table 4.** Results for classification on CIFAR10 10 dataset.

#### **5. Conclusions**

SG methods allowed Bayesian posterior sampling algorithms, such as MCMC, to regain relevance in an age when datasets have reached extremely large sizes. However, despite mathematical elegance and promising results, current approaches from the literature are restricted to simple models. Indeed, the sampling properties of these algorithms are determined by simplifying assumptions on the loss landscape, which do not hold for the kind of complex models which are popular these days, such as deep models. Meanwhile, SG-MCMC algorithms require vanishing learning rates, which force practitioners to develop creative annealing schedules that are often model specific and difficult to justify.

We have attempted to target these weaknesses by suggesting a simpler algorithm that relies on fewer parameters and less strict assumptions compared to the literature on SG-MCMC. We used a unified mathematical notation to deepen our understanding of the role of the covariance of the noise of stochastic gradients and learning rate on the behavior of SG-MCMC algorithms. We then presented a practical variant of the SGD algorithm, which uses a constant learning rate, and an additional noise to perform Bayesian posterior sampling. Our proposal is derived from the ideal method, in which it is guaranteed that samples are generated from the true posterior. When the learning rate and noise terms are empirically estimated, with no user intervention, our method offers a very good approximation to the posterior, as demonstrated by the extensive experimental campaign.

We verified empirically the quality of our approach, and compared its performance to state-of-the-art SG-MCMC and alternative methods. Results, which span a variety of settings, indicated that our method is competitive to the alternatives from the state-of-the-art, while being much simpler to use.

**Author Contributions:** Formal analysis, G.F., D.M., M.F. and P.M.; Methodology, G.F., D.M., M.F. and P.M.; Software, G.F., D.M., M.F. and P.M.; Writing—original draft, G.F., D.M., M.F. and P.M.; Writing—review & editing, G.F., D.M., M.F. and P.M. All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

**Funding:** MF gratefully acknowledges support from the AXA Research Fund and the Agence Nationale de la Recherche (grant ANR-18-CE46-0002 and ANR-19-P3IA-0002).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Background and Related Material**

*Appendix A.1. The Minibatch Gradient Approximation*

Starting from the gradient of the logarithm of the posterior density:

$$-\nabla f(\boldsymbol{\theta}) = \sum\_{i=1}^{N} \nabla \log p(\boldsymbol{\mathcal{U}}\_i|\boldsymbol{\theta}) + \nabla \log p(\boldsymbol{\theta})\_i$$

it is possible to define its *minibatch* version by computing the gradient on a random subset I*Nb* with cardinality *Nb* of all the indices. The minibatch gradient *g*(*θ*) is computed as

$$-\mathbf{g}(\boldsymbol{\theta}) = \frac{N}{N\_b} \sum\_{i=1}^{N\_b} \nabla \log p(\boldsymbol{\mathcal{U}}\_i|\boldsymbol{\theta}) + \nabla \log p(\boldsymbol{\theta})\_i$$

By simple calculations it is possible to show that the estimation is unbiased (*E*(*g*(*θ*)) = ∇*f*(*θ*)). The estimation error covariance is defined to be *E* (*g*(*θ*) − ∇*f*(*θ*))(*g*(*θ*) − ∇*f*(*θ*)) = 2*B***(***θ***)**.

If the minibatch size is large enough, invoking the central limit theorem, we can state that the minibatch gradient is normally distributed:

$$\mathbf{g}(\boldsymbol{\theta}) \sim N(\nabla f(\boldsymbol{\theta}), 2B(\boldsymbol{\theta})) .$$

*Appendix A.2. Gradient Methods without Momentum*

Appendix A.2.1. The SDE from Discrete Time

We start from the generalized updated rule of SGD:

$$
\delta\theta\_n = -\eta P(\theta\_{n-1}) (\mathbf{g}(\theta\_{n-1}) + w\_n).
$$

Since *g*(*θn*−1) ∼ *N*(∇ *f*(*θn*−1), 2*B*(*θn*−1)) we can rewrite the above equation as:

$$\mathcal{S}\theta\_{\mathbb{N}} = -\eta P(\theta\_{\mathbb{N}-1}) (\nabla f(\theta\_{\mathbb{N}-1}) + \mathfrak{w}\_{\mathbb{N}}^{'})\_{\prime}$$

where *w <sup>n</sup>* ∼ *N*(0, 2**Σ**(*θn*−1)). If we separate deterministic and random component we can equivalently write:

$$\begin{cases} \delta \theta\_{n} = -\eta \mathbf{P}(\theta\_{n-1}) \nabla f(\theta\_{n-1}) + \eta \mathbf{P}(\theta\_{n-1}) \boldsymbol{w}\_{n}^{'} = -\eta \mathbf{P}(\theta\_{n-1}) \nabla f(\theta\_{n-1}) + \eta \mathbf{P}(\theta\_{n-1}) \nabla f(\theta\_{n-1}) \\ \sqrt{2\eta \mathbf{P}^{2}(\theta\_{n-1}) \boldsymbol{\Sigma}(\theta\_{n-1})} \boldsymbol{w}\_{n} \end{cases}$$

where *<sup>v</sup><sup>n</sup>* <sup>∼</sup> *<sup>N</sup>*(0, <sup>√</sup>*ηI*). When *<sup>η</sup>* is small enough (*<sup>η</sup>* <sup>→</sup> *dt*) we can interpret the above equation as the discrete-time simulation of the following SDE [15]:

$$d\theta\_t = -P(\theta\_t)\nabla f(\theta\_t)dt + \sqrt{2\eta P(\theta\_t)^2 \Sigma(\theta\_t)}dW\_{t\wedge\tau}$$

where *dW<sup>t</sup>* is a *d*−dimensional Brownian motion.

Appendix A.2.2. Proof of Theorem 1

The stationary distribution of the above SDE, *ρ*(*θ*) ∝ exp(−*φ*(*θ*)), satisfies the following FPE

$$0 = \text{Tr}\left\{\nabla \left[\nabla^{\top}(f(\theta))P(\theta)\rho(\theta) + \eta \nabla^{\top}(P(\theta)^2 \Sigma(\theta)\rho(\theta))\right]\right\},$$

that we rewrite as

$$0 = \text{Tr}\{\nabla[\nabla^\top(f(\theta))P(\theta)\rho(\theta) - \eta \nabla^\top(\phi(\theta))P(\theta)^2\Sigma(\theta)\rho(\theta) + \eta \nabla^\top(P(\theta)^2\Sigma(\theta))\rho(\theta)]\}.$$

The above equation is verified with ∇ *f*(*θ*) = ∇*φ*(*θ*) if

$$\begin{cases} \nabla^{\top} \left( P(\theta)^{2} \Sigma(\theta) \right) = \mathbf{0} \\ \eta P(\theta)^{2} \Sigma(\theta) = P(\theta) \to \eta P(\theta) = \Sigma(\theta)^{-1} \end{cases}$$

that proves Theorem 1.

*Appendix A.3. Gradient Methods with Momentum*

Appendix A.3.1. The SDE from Discrete Time

The general set of update equations for (discrete-time) momentum-based algorithms is:

$$\begin{cases} \delta \theta\_{\mathbb{H}} = \eta \mathbf{P}(\theta\_{n-1}) \mathbf{M}^{-1} r\_{n-1} \\ \delta r\_{\mathbb{H}} = -\eta A(\theta\_{n-1}) \mathbf{M}^{-1} r\_{n-1} - \eta \mathbf{P}(\theta\_{n-1}) (\mathbf{g}(\theta\_{n-1}) + \mathbf{w}\_{\mathbb{H}}) . \end{cases}$$

Similarly to the case without momentum, we rewrite the second equation of the system as

$$\begin{aligned} \delta r\_{\mathbb{H}} &= -\eta A(\theta\_{n-1}) \mathcal{M}^{-1} r\_{n-1} - \eta \mathcal{P}(\theta\_{n-1}) (\mathbf{g}(\theta\_{n-1}) + \mathbf{w}\_{\mathbb{H}}) = \\ &- \eta A(\theta\_{n-1}) \mathcal{M}^{-1} r\_{n-1} - \eta \mathcal{P}(\theta\_{n-1}) \nabla f(\theta\_{n-1}) + \sqrt{2 \eta \mathcal{P}^2(\theta\_{n-1}) \Sigma(\theta\_{n-1})} \mathbf{w}\_{\mathbb{H}} \end{aligned}$$

where again *<sup>v</sup><sup>n</sup>* <sup>∼</sup> *<sup>N</sup>*(0, <sup>√</sup>*ηI*). If we define the supervariable *<sup>z</sup>* <sup>=</sup> [*θ*,*r*] we can rewrite the system as

$$\delta z\_n = -\eta \begin{bmatrix} \mathbf{0} & -P(\theta\_{n-1}) \\ P(\theta\_{n-1}) & A(\theta\_{n-1}) \end{bmatrix} \mathbf{s}(z\_{n-1}) + \sqrt{2\eta D(z\_{n-1})} \nu\_n$$

$$= \begin{bmatrix} \nabla f(\theta) \\ \mathbf{M}^{-1} \mathbf{r} \end{bmatrix}, D(z) = \begin{bmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & P(\theta)^2 \Sigma(\theta) \end{bmatrix} \text{ and } \nu\_n \sim N(0, \sqrt{\eta}I).$$

**0** *P***(***θ***)**

where *<sup>s</sup>*(*z*) = .

**Σ(***θ***)** As the learning rate goes to zero (*η* → *dt*), similarly to the previous case, we can interpret the above difference equation as a discretization of the following FPE

$$dz\_t = -\begin{bmatrix} \mathbf{0} & -P(\theta\_t) \\ P(\theta\_t) & A(\theta\_t) \end{bmatrix} \mathbf{s}(z\_t) + \sqrt{2\eta D(z\_t)} dW\_t$$

Appendix A.3.2. Proof of Theorem 2

As before we assume that the stationary distribution has form *ρ*(*z*) ∝ exp(−*φ*(*z*)). The corresponding FPE is

$$0 = \text{Tr}\left(\nabla \begin{pmatrix} \mathbf{s}(\mathbf{z})^\top \begin{bmatrix} \mathbf{0} & -P(\theta) \\ P(\theta) & A(\theta) \end{bmatrix} \rho(\mathbf{z}) + \eta \left(\nabla^\top \left(\mathbf{D}(\mathbf{z})\rho(\mathbf{z})\right)\right)\right).$$

Notice that since ∇*D*(*z*) = 0 we can rewrite

$$\begin{split} 0 &= \text{Tr}\left(\nabla\left(s(\boldsymbol{z})^{\top}\begin{bmatrix} \mathbf{0} & -\boldsymbol{P}(\boldsymbol{\theta}) \\ \boldsymbol{P}(\boldsymbol{\theta}) & \boldsymbol{A}(\boldsymbol{\theta}) \end{bmatrix}\rho(\boldsymbol{z}) + \eta\nabla^{\top}(\rho(\boldsymbol{z}))\boldsymbol{D}(\boldsymbol{z})\right)\right) \\ &= \text{Tr}\left(\nabla\left(s(\boldsymbol{z})^{\top}\begin{bmatrix} \mathbf{0} & -\boldsymbol{P}(\boldsymbol{\theta}) \\ \boldsymbol{P}(\boldsymbol{\theta}) & \boldsymbol{A}(\boldsymbol{\theta}) \end{bmatrix}\rho(\boldsymbol{z}) - \eta\nabla^{\top}(\phi(\boldsymbol{z}))\boldsymbol{D}(\boldsymbol{z})\rho(\boldsymbol{z})\right)\right) \\ &= \text{Tr}\left(\nabla\left(s(\boldsymbol{z})^{\top}\begin{bmatrix} \mathbf{0} & -\boldsymbol{P}(\boldsymbol{\theta}) \\ \boldsymbol{P}(\boldsymbol{\theta}) & \boldsymbol{A}(\boldsymbol{\theta}) \end{bmatrix}\rho(\boldsymbol{z}) - \eta\nabla^{\top}(\phi(\boldsymbol{z}))\begin{bmatrix} \mathbf{0} & \mathbf{0} \\ \mathbf{0} & \boldsymbol{P}(\boldsymbol{\theta})^{2}\boldsymbol{\Sigma}(\boldsymbol{\theta}) \end{bmatrix}\rho(\boldsymbol{z})\right)\right). \end{split}$$

that is verified with ∇*φ*(*z*) = *s*(*z*) if

$$\begin{cases} \nabla^{\top}P(\theta) = \mathbf{0} \\ A(\theta) = \eta P(\theta)^2 \Sigma(\theta) \dots \end{cases}$$

If ∇*P*(*θ*) = **0** in fact

$$\begin{split} &\operatorname{Tr}\left(\nabla\left(\nabla^{\top}(\boldsymbol{\phi}(\boldsymbol{z}))\boldsymbol{\rho}(\boldsymbol{z})\begin{bmatrix} \mathbf{0} & -\boldsymbol{P}(\boldsymbol{\theta})\\ \boldsymbol{P}(\boldsymbol{\theta}) & \mathbf{0} \end{bmatrix}\right)\right) = \nabla^{\top}\left(\begin{bmatrix} \mathbf{0} & -\boldsymbol{P}(\boldsymbol{\theta})\\ \boldsymbol{P}(\boldsymbol{\theta}) & \mathbf{0} \end{bmatrix} \nabla(\boldsymbol{\phi}(\boldsymbol{z}))\boldsymbol{\rho}(\boldsymbol{z})\right) = \\ &\nabla^{\top}\left(\begin{bmatrix} \mathbf{0} & -\boldsymbol{P}(\boldsymbol{\theta})\\ \boldsymbol{P}(\boldsymbol{\theta}) & \mathbf{0} \end{bmatrix}\right) \nabla(\boldsymbol{\phi}(\boldsymbol{z}))\boldsymbol{\rho}(\boldsymbol{z}) + \operatorname{Tr}\left(\begin{bmatrix} \mathbf{0} & -\boldsymbol{P}(\boldsymbol{\theta})\\ \boldsymbol{P}(\boldsymbol{\theta}) & \mathbf{0} \end{bmatrix} \nabla\left(\nabla^{\top}(\boldsymbol{\phi}(\boldsymbol{z}))\boldsymbol{\rho}(\boldsymbol{z})\right)\right) = 0, \end{split}$$

since ∇ . **<sup>0</sup>** <sup>−</sup>*P*(*θ*) *P*(*θ*) **0** / <sup>=</sup> **<sup>0</sup>** and the second term is zero due to the fact that . **<sup>0</sup>** <sup>−</sup>*P*(*θ*) *P*(*θ*) **0** / is anti-symmetric while <sup>∇</sup> ∇(*φ*(*z*))*ρ*(*z*) is symmetric.

Thus, we can rewrite

Tr ∇ *s*(*z*) . **<sup>0</sup>** <sup>−</sup>*P*(*θ*) *P*(*θ*) *A*(*θ*) / *<sup>ρ</sup>*(*z*) <sup>−</sup> *<sup>η</sup>*∇(*φ*(*z*)). **0 0 0** *P*(*θ*)2**Σ(***θ***)** / *ρ*(*z*) <sup>=</sup> Tr ∇ *s*(*z*) . **<sup>0</sup>** <sup>−</sup>*P*(*θ*) *P*(*θ*) *A*(*θ*) / *<sup>ρ</sup>*(*z*) − ∇(*φ*(*z*)). **0 0 0** *ηP*(*θ*)2**Σ(***θ***)** / *ρ*(*z*) <sup>=</sup> Tr ∇ *s*(*z*) . **<sup>0</sup>** <sup>−</sup>*P*(*θ*) *P*(*θ*) *A*(*θ*) / *<sup>ρ</sup>*(*z*) − ∇(*φ*(*z*)). **0 0 0** *A***(***θ***)** / *ρ*(*z*) <sup>=</sup> Tr ∇ *<sup>s</sup>*(*z*) − ∇(*φ*(*z*)). **<sup>0</sup>** <sup>−</sup>*P*(*θ*) *P*(*θ*) *A*(*θ*) / *ρ*(*z*) <sup>=</sup> <sup>0</sup>

and then ∇*φ*(*z*) = *s*(*z*) proving Theorem 2.

#### **Appendix B. I-SGD Method Proofs and Details**

*Appendix B.1. Proof of Corollary 1*

The requirement *C***(***θ***)** 0 ∀*θ*, ensures that the injected noise covariance is valid. The composite noise matrix is equal to **Σ(***θ***)** = **Λ**. Since ∇**Σ(***θ***)** = ∇**Λ** = **0** and *ηP***(***θ***)** = **Λ**−<sup>1</sup> by construction, then Theorem 1 is satisfied.

## *Appendix B.2. Proof of Optimality of* **Λ**

Our design choice is to select *λ*(*p*) = *β*(*p*). By the assumptions, the matrix *B***(***θ***)** is diagonal, and consequently *C***(***θ***)** = **Λ** − *B***(***θ***)** is diagonal as well. The preconditioner **Λ** must be chosen to satisfy the positive semidefinite constraint, i.e., *C***(***θ***)***ii* ≥ 0 ∀*i*, ∀*θ*. Equivalently, we must satisfy *<sup>λ</sup>*(*p*) <sup>−</sup> *bj*(*θ*) <sup>≥</sup> <sup>0</sup> <sup>∀</sup>*<sup>j</sup>* <sup>∈</sup> *Ip*, <sup>∀</sup>*p*, <sup>∀</sup>*θ*, where *Ip* is the set of indices of parameters belonging to *pth* layer. By assumption 3, i.e., *<sup>β</sup>*(*p*) <sup>=</sup> <sup>∑</sup>*k*∈*Ip bk*(*θ*), it is easy to show that *bj*(*θ*), *<sup>j</sup>* <sup>∈</sup> *Ip*, is upper bounded as *bj*(*θ*) <sup>≤</sup> *<sup>β</sup>*(*p*). To satisfy the positive semidefinite requirement in all cases the minimum valid set of *λ*(*p*) is then determined as *λ*(*p*) = *β*(*p*).

## *Appendix B.3. Algorithmic Details*

In this section, we provide further details about the practical implementation of the proposed scheme. At any (discrete) time instant a minibatch version of the gradient is computed that is distributed, according to the hypotheses of the main paper, as *g*(*θ*) ∼ *N*(∇ *f*(*θ*), 2*b*(*θ*). Since we assumed that the second-order moment is a good approximation of the variance, we can estimate *b*(*θ*) as <sup>1</sup> <sup>2</sup> (*g*(*θ*) *g*(*θ*)). In practice, we found that the following running average estimation procedure to be the most robust

$$b(\theta) \leftarrow \mu b(\theta) + (1 - \mu)\frac{1}{2}(\mathbf{g}(\theta) \odot \mathbf{g}(\theta))\tag{A1}$$

where *μ* ∈ (0, 1]. In all experiments we considered *μ* = 0.5

After a warmup period, the various *λ*(*p*) , layer per layer, are estimated as *<sup>λ</sup>*(*p*) <sup>=</sup> <sup>∑</sup>*k*∈*Ip bk*(*θ*) and kept constant until the end. The estimation procedure continues during sampling phase,

as the quantity *<sup>λ</sup>*(*p*) <sup>−</sup> *<sup>b</sup>*(*θ*) is necessary at every step. As the learning rate is derived as <sup>2</sup> *<sup>λ</sup>*(*p*) , we found that the usage of second-order moments instead of variances, and in certain cases temperature scaling, kept the simulated trajectories more stable.

#### **Appendix C. Methodology**

We hereafter present additional implementation details.

#### *Appendix C.1. Regression Tasks, with Simple Models*

For this set of experiments we considered , the BASELINE is obtained by running the ADAM optimizer for 20,000 steps with learning rate 0.01 and default parameters. At test time we use 100 samples to estimate the predictive posterior distribution, using Equation (3), for the *sampling* methods (I-SGD,SGLD,SGHMC,VSGD), with a keep-every value equal to 1000. The I-SGD and VSGD sampling methods are started from the BASELINE. For I-SGD we selected temperature 0.01, while for SGHMC and SGLD we do performed experiments for temperatures 1 and 0.01. We modified the implementation of VSGD as the original implementation produced unstable learning rates (as noticed also in [9]). A simple and effective solution we implement that we kept throughout the experimental campaigns is to divide the learning rate by the number of parameters (thus performing variational inference on a tempered version of the posterior). For SGLD the learning rate decay is the one suggested in [2], with initial and finial learning rate equal to 10−<sup>6</sup> and 10−<sup>8</sup> respectively. For MCD we collected 1000 samples with standard dropout rate of 0.5. All our experiments use 10-splits. The considered batch size is 64 for all methods.

#### *Appendix C.2. Classification Task,* CONVNET

For the LENET-5 on MNIST experiment, we do consider also the SWAG algorithm. At test time we use 30 samples for all methods. Baselines are again trained using ADAM optimizer for 20,000 steps with learning rate 0.01 and default parameters. For I-SGD and SGHMC we collected samples for the different temperatures of 1 and 0.01. SGLD has initial and final learning rates of 10−<sup>3</sup> and 10−5. For all the sampling methods we do collect 100 samples with a keep-every of 10,000 steps. SWAG results are obtained by collecting the statistics over 300 epochs using ADAM optimizer and decreasing the learning rate every epoch in accordance with the original paper schedule [9]. DROP results are obtained by training the networks with SGD, with learning rate 0.005 and momentum 0.5. The number of collected samples for this method is 1000. The batch size for all the methods is 128.

As explained in the main text, we performed an ablation study on the considered baselines. In Table A1 we do report the results for the additional variants obtained by early stopping (10,000 iterations instead of 20,000) BASELINE S, to ablate overfitting, and BASELINE L, by training for 30,000 iterations. Finally, we include the best performing BASELINE R, obtained starting from BASELINE, reducing the learning rate by a factor of 10 and training for 10,000 more iterations.


**Table A1.** Baselines comparison for classification on MNIST dataset.

#### *Appendix C.3. Classification Task, Deeper Models*

We here report details for the RESNET-18 on CIFAR10 experiments. The BASELINE is obtained with ADAM optimizer with learning rate 0.01 decreased by a factor of 10 every 50 epochs for a total of 200 epochs and weight decay of 0.05. For this set of experiments no

temperature scaling was required. We could not find good hyperparameters for the SGLD scheme. Concerning I-SGD, SGHMC and VSGD the keep-every value is chosen as 10,000 and the number of collected samples is 30. For SWAG we used the default parameters described in [9]. Notice that for SWAG we performed the following ablation study: we trained the networks considering as loss function the joint log-likelihood and included or not the suggested weight decay of the original work [9]. From a purely Bayesian perspective no weight decay should be considered to be the information is implicit in the prior; however, we found that without the extra decay SWAG was not able to obtain competitive results. As underlined in Section 4.1, not necessarily a better posterior approximation translates into better empirical results.

## *Appendix C.4. Definition of the Metrics*

For regression datasets, we consider RMSE and MNLL. Consider a single datapoint *U<sup>i</sup>* = (*xi*, *y<sup>i</sup>* ), with *x<sup>i</sup>* the input of the model and *y<sup>i</sup>* the true corresponding output. The output of the model, for a single sample of parameters *θj*, is *y*ˆ *<sup>θ</sup><sup>j</sup>* (*xi*). RMSE is defined as

1 *N N* ∑ *i*=1 ||*y<sup>i</sup>* <sup>−</sup> *<sup>μ</sup>*(*xi*)||2, where *<sup>μ</sup>*(*xi*) is the empirical mean <sup>1</sup> *NMC NMC* ∑ *j*=1 *y*ˆ *θj* (*xi*). MNLL is defined instead as ( <sup>1</sup> *N N* ∑ 1 <sup>2</sup> log(2*πσ*<sup>2</sup> *<sup>i</sup>* ) + <sup>1</sup> 2 ||*yi*−*μ*(*xi*)||<sup>2</sup> *σ*2 , where *σ*<sup>2</sup> *<sup>i</sup>* is the empirical variance.

*i*=1 For classification datasets, we consider ACC,MNLL and entropy. Consider a single datapoint *U<sup>i</sup>* = (*xi*, *yi*), with *x<sup>i</sup>* the input of the model and *yi* the true corresponding label. The output of the model, for a single sample of parameters *θj*, is the *Ncl* vector **p***θ<sup>j</sup>* (*xi*). The

*i*

averaged probability vector for a single sample is **p**(*xi*) = <sup>1</sup> *NMC NMC* ∑ *i*=1 **p***θ<sup>j</sup>* (*xi*).ACC is defined as <sup>1</sup> *N N* ∑ *i*=1 1(arg max **p**(*xi*) = *yi*). MNLL is computed as <sup>1</sup> *N N* ∑ *i*=1 log **p***yi* (*xi*) . Entropy, as stated in the main text, is instead computed according to <sup>1</sup> *N* ∑ *Ncl* ∑ **p***k*(*xi*)log(**p***k*(*xi*)) .

*N*

*i*=1

*k*=1

## **References**


## *Article* **PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses**

**Maxime Haddouche 1, Benjamin Guedj 2,3,\*, Omar Rivasplata <sup>2</sup> and John Shawe-Taylor <sup>2</sup>**


**Abstract:** We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this classical assumption, we propose to allow the range of the loss to depend on each predictor. This relaxation is captured by our new notion of *HYPothesis-dependent rangE* (HYPE). Based on this, we derive a novel PAC-Bayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions.

**Keywords:** statistical learning theory; PAC-Bayes; generalisation bounds

## **Citation:** Haddouche, M.; Guedj, B.; Rivasplata, O.; Shawe-Taylor, J. PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses. *Entropy* **2021**, *23*, 1330. https://doi.org/10.3390/ e23101330

Academic Editor: Boris Ryabko

Received: 22 August 2021 Accepted: 25 September 2021 Published: 12 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

## **1. Introduction**

Since its emergence in the late 1990s, the PAC-Bayes theory (see the seminal works of [1–3], the recent survey by [4] and work by [5]) has been a powerful tool to obtain generalisation bounds and to derive efficient learning algorithms. Generalisation bounds are helpful for understanding how a learning algorithm may perform on future similar batches of data. While the classical generalization bounds typically address the performance of individual predictors from a given hypothesis class, PAC-Bayes bounds typically address a randomized predictor defined by a distribution over the hypothesis class.

PAC-Bayes bounds were originally meant for binary classification problems [6–8], but the literature now includes many contributions involving any bounded loss function (without loss of generality, with values in [0; 1]), not just the binary loss. Our goal is to provide new PAC-Bayes bounds that are valid for unbounded loss functions, and thus extend the usability of PAC-Bayes to a much larger class of learning problems. To do so, we reformulate the general PAC-Bayes theorem of [9] and use it as basic building block to derive our new PAC-Bayes bound.

Some ways to circumvent the bounded range assumption on the losses have been explored in the recent literature. For instance, one approach consists of assuming a tail decay rate on the loss, such as sub-gaussian or sub-exponential tails [10,11]; however, this approach requires the knowledge of additional parameters. Some other works have also looked into the analysis for heavy-tailed losses, e.g., ref. [12] proposed a polynomial moment-dependent bound with *f*-divergences, while [13] devised an exponential bound that assumes the second (uncentered) moment of the loss is bounded by a constant (with a truncated risk estimator, as recalled in Section 4 below). A somewhat related approach was explored by [14], who do not assume boundedness of the loss, but instead control higher-order moments of the generalization gap through the Efron-Stein variance proxy. See also [5].

99

We investigate a different route here. We introduce the *HYPothesis-dependent rangE* (HYPE) condition, which means that the loss is upper-bounded by a term that depends on the chosen predictor (but does not depend on the data). Thus, effectively, the loss may have an arbitrarily large range. The HYPE condition allows us to derive an upper bound on the exponential moment of a suitably chosen functional, which, combined with the general PAC-Bayes theorem, leads to our new PAC-Bayes bound. To illustrate it, we instantiate the new bound on a linear regression problem, which additionally serves the purpose of illustrating that our HYPE condition is easy to verify in practice, given an explicit formulation of the loss function. In particular, we shall see in the linear regression setting that a mere use of the triangle inequality is enough to check the HYPE condition. The technical assumptions on which our results are based are comparable to those of the classical PAC-Bayes bounds; we state them in full detail, with discussions, for the sake of clarity and to make our work accessible.

**Our contributions are twofold.** (i) We propose PAC-Bayesian bounds holding with unbounded loss functions, therefore overcoming a limitation of the mainstream PAC-Bayesian literature for which a bounded loss is usually assumed. (ii) We analyse the bound, its implications, limitations of our assumptions, and their usability by practitioners. We hope this will extend the PAC-Bayes framework into a widely usable tool for a significantly wider range of problems, such as unbounded regression or reinforcement learning problems with unbounded rewards.

**Outline.** Section 2 introduces our notation and definition of the HYPE condition and provides a general PAC-Bayesian bound, which is valid for any learning problem complying with a mild assumption. For the sake of completeness, we present how our approach (designed for the unbounded case) behaves in the bounded case (Section 3). This section is not the core of our work, but rather serves as a safety check and particularises our bound to more classical PAC-Bayesian assumptions. We also provide numerical experiments. Section 4 introduces the notion of *softening functions* and particularises Section 2's PAC-Bayesian bound. In particular, we make explicit all terms in the right-hand side. Section 5.1 extends our results to linear regression (which has been studied from the perspective of PAC-Bayes in the literature, most recently by [15]). We also experimentally illustrate the behaviour of our bound. Finally, Section 6 presents, in detail, related works and Section 7 contains all proofs of the original claims we make in the paper.

#### **2. Framework and Preliminary Results**

The learning problem is specified by three variables (H, Z, -) consisting of a set H of predictors, the data space Z, and a loss function -: H×Z→ <sup>R</sup>+.

For a given positive integer *m*, we consider size-*m* datasets. The space of all possible datasets of this fixed size is <sup>S</sup> <sup>=</sup> <sup>Z</sup>*m*; an arbitrary element of this space is *<sup>s</sup>* = (*z*1, ... , *zm*). We denote *S* as a random dataset: *S* = (*Z*1, ... , *Zm*) where the random data points *Zi* are independent and sampled from the same distribution *μ* over Z. We call *μ* the data-generating distribution. The assumption that the *Zi*'s are *independent and identically distributed* is typically called the i.i.d. data assumption. It means that the random sample *S* (of size *m*) has distribution *μ*⊗*<sup>m</sup>* which is the product of *m* copies of *μ*.

For any predictor *h* ∈ H, we define the *empirical risk* of *h* over a sample *s*, denoted *Rs*(*h*), and the *theoretical risk* of *h*, denoted *R*(*h*), as:

$$R\_{\delta}(h) = \frac{1}{m} \sum\_{i=1}^{m} \ell(h, z\_i) \qquad \text{and} \qquad \mathcal{R}(h) = \mathbb{E}\_{\mu}[\ell(h, Z)]$$

respectively, where E*μ*[-(*h*, *Z*)] denotes the expectation with respect to *Z* ∼ *μ*. Finally, we define the *risk gap* Δ*s*(*h*) = *R*(*h*) − *Rs*(*h*) for any *h* ∈ H and *s* ∈ S. Often, Δ*s*(*h*) is referred to as the generalisation gap.

Notice that for a random dataset *S*, the empirical risk *RS*(*h*) is random, with expected value E*μ*⊗*<sup>m</sup>* [*RS*(*h*)] = *<sup>R</sup>*(*h*), where E*μ*⊗*<sup>m</sup>* the expectation under the distribution of the random sample *S*.

In general, <sup>E</sup>*μ*[·] denotes an expectation under the distribution *<sup>μ</sup>*. When we want to emphasize the role of the random variable *<sup>Z</sup>* <sup>∼</sup> *<sup>μ</sup>* we write <sup>E</sup>*Z*[·] or <sup>E</sup>*Z*∼*μ*[·] instead of <sup>E</sup>*μ*[·]. We use a similar convention for expectations related to any other distributions and random quantities. We now introduce the key concept to our analysis.

**Definition 1.** (HYPE). *A loss function* - : H×Z → <sup>R</sup><sup>+</sup> *is said to satisfy the hypothesisdependent range (*HYPE*) condition if there exists a function <sup>K</sup>* : H → <sup>R</sup>+\{0} *such that* sup*z*∈Z -(*h*, *z*) ≤ *K*(*h*) *for every predictor h. We then say that is* HYPE(*K*) *compliant.*

Let <sup>M</sup><sup>+</sup> <sup>1</sup> (H) be the set of probability distributions on H. We assume that all considered probability measures on H are defined on a fixed *σ*-algebra over H, while the notation <sup>M</sup><sup>+</sup> <sup>1</sup> (H) hides the *<sup>σ</sup>*-algebra, for simplicity. For *<sup>P</sup>*, *<sup>P</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H), the notation *P P* indicates that *P* is absolutely continuous with respect to *P* (i.e., *P* (*A*) = 0 if *P*(*A*) = 0 for measurable *A* ⊂ H). We write *P* ∼ *P* to indicate that *P P* and *P P* , i.e., these two distributions are absolutely continuous with respect to each other.

We now recall a result from Germain et al. [9]. Note that while implicit in many PAC-Bayes works (including theirs), we make it explicit that both the prior *P* and the posterior *Q* must be absolutely continuous with respect to each other. We discuss this restriction below.

**Theorem 1.** (Adapted from [9], Theorem 2.1.) *For any <sup>P</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H) *with no dependency on data, for any function F* : <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup>*, define the exponential moment:*

$$\chi := \mathbb{E}\_S \mathbb{E}\_{h \sim P} \left[ \mathfrak{e}^{F(\mathbb{R}\_S(h), \mathbb{R}(h))} \right] \dots$$

*If F is convex, then for any δ* ∈ [0; 1]*, with probability of at least* 1 − *δ over random samples S, simultaneously for all Q* ∈ M<sup>+</sup> <sup>1</sup> (H) *such that Q* ∼ *P we have:*

$$F(\mathbb{E}\_{h \sim \mathbb{Q}}[R\_S(h)], \mathbb{E}\_{h \sim \mathbb{Q}}[R(h)]) \le \text{KL}(\mathbb{Q}||P) + \log\left(\frac{\chi}{\delta}\right).$$

The proof is deferred to Section 7.1. Note that the proof in [9] requires that *P Q*, although it is not explicitly stated; we highlight this in our own proof. While *Q P* is classical and necessary for the KL(*Q*||*P*) to be meaningful, *P Q* appears to be more restrictive. In particular, we have to choose *Q* such that it has the exact same support as *P* (e.g., choosing a Gaussian and a truncated Gaussian is not possible). However, we can still apply our theorem when *P* and *Q* belong to the same parametric family of distributions, e.g., both 'full-support' Gaussian or Laplace distributions, but these are just two examples and there are many others.

Note that Alquier et al. [10] (Theorem 4.1) adapted a result from Catoni [8], which only requires *Q P*. This comes at the expense of what Alquier et al. [10] (Definition 2.3) called a *Hoeffding's assumption*, which means that the exponential moment *χ* is assumed to be bounded by a function depending only on the hyperparameters (such as the dataset size *m* or parameters given by Hoeffding's assumption). Our analysis does not require this assumption, which might prove restrictive in practice.

Theorem 1 may be seen as a basis to recover many classical PAC-Bayesian bounds. For instance, *<sup>F</sup>*(*x*, *<sup>y</sup>*) = <sup>2</sup>*m*(*<sup>x</sup>* <sup>−</sup> *<sup>y</sup>*)2, recovers McAllester's bound as recalled in [4] (Theorem 1). To get a usable bound, the outstanding task is to bound the exponential moment *χ*. Note that a previous attempt has been made in [11], as described in Section 6.1 below. Furthermore, under the assumption that the distribution *P* has no dependency on the data, we may swap the order of integration in the exponential moment thanks to Fubini-Tonelli's theorem and the positiveness of the exponential:

$$\chi = \mathbb{E}\_{h \sim P} \mathbb{E}\_S \left[ \mathfrak{e}^{F(\mathcal{R}\_S(h), \mathcal{R}(h))} \right] \dots$$

This is the starting point for the way that the exponential moment was handled in several works in the PAC-Bayes literature. Essentially, for a fixed *h*, one may upper-bound the innermost expectation (with respect to *S*) using standard exponential moment inequalities.

In this work, we will use Theorem 1 with *F*(*x*, *y*) = *mαD*(*x*, *y*), where *α* > 0, and *<sup>D</sup>* : <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup> is a convex function. In this case, the high-probability inequality of the theorem takes the form:

$$\begin{split} \left| D\left( \mathbb{E}\_{\mathbb{H}\sim Q}[R\_S(h)], \mathbb{E}\_{\mathbb{H}\sim Q}[R(h)] \right) \right| &\leq \\ & \frac{1}{m^d} \left( \mathrm{KL}(Q||P) + \log \left( \frac{1}{\delta} \mathbb{E}\_{\mathbb{H}\sim P} \mathbb{E}\_S \, e^{m^d D\left(R\_S(h), R(h)\right)} \right) \right) . \end{split} \tag{1}$$

Our goal is to control <sup>E</sup>*<sup>S</sup> <sup>e</sup>mαD*(*RS*(*h*),*R*(*h*)) for a fixed *<sup>h</sup>*, when *<sup>D</sup>*(*x*, *<sup>y</sup>*) = *<sup>y</sup>* <sup>−</sup> *<sup>x</sup>*. This will readily give us control on the exponential moment *χ*. To do so, we propose the following theorem:

**Theorem 2.** *Let <sup>h</sup>* ∈ H *be a fixed predictor and <sup>α</sup>* <sup>∈</sup> <sup>R</sup>*. If the loss function is* HYPE(*K*) *compliant, then for* Δ*S*(*h*) = *R*(*h*) − *RS*(*h*) *we have:*

$$\mathbb{E}\_{\mathcal{S}}\left[e^{m^{\alpha}\Delta\_{\mathcal{S}}(h)}\right] \le \exp\left(\frac{K(h)^2}{2m^{1-2\alpha}}\right).$$

**Proof.** Let *h* ∈ H. Then:

$$\begin{aligned} \mathbb{E}\_{\mathbb{S}}\left[e^{\mathsf{w}^{\mathsf{a}}\Delta\_{\mathbb{S}}(h)}\right] &= \mathbb{E}\left[\exp\left(m^{\mathsf{a}-1}\sum\_{i=1}^{m}\left(l(h,Z\_{i})-R(h)\right)\right)\right] \\ &= \mathbb{E}\left[\prod\_{i=1}^{m}\exp\left(m^{\mathsf{a}-1}\left(\ell(h,Z\_{i})-R(h)\right)\right)\right] \\ &= \prod\_{i=1}^{m}\mathbb{E}\left[\exp\left(m^{\mathsf{a}-1}\left(\ell(h,Z\_{i})-R(h)\right)\right)\right]. \end{aligned}$$

We now apply Hoeffding's lemma, for any *i* ∈ {1..*m*}, the random (in *Zi*) variable -(*h*, *Zi*) − *R*(*h*) is centered, taking values in [−*K*(*h*); *K*(*h*)], so that:

$$\mathbb{E}\left[\exp\left(m^{\mathfrak{a}-1}(\ell(h,Z\_i)-R(h))\right)\right] \le \exp\left(m^{2\mathfrak{a}-2}\frac{4K(h)^2}{8}\right).$$

and finally:

$$\mathbb{E}\_S\left[e^{m^a \Lambda\_S(h)}\right] \le \prod\_{i=1}^m \exp\left(m^{2a-2} \frac{4K(h)^2}{8}\right) = \exp\left(\frac{K(h)^2}{2m^{1-2a}}\right).$$

The strength of this result lies in the fact that *<sup>K</sup>*(*h*)<sup>2</sup> *<sup>m</sup>*1−2*<sup>α</sup>* , is a decreasing factor in *m*, when *α* ≤ 1/2, and more generally, one can control how fast the exponential moment will explode when *m* grows by the choice of the hyperparameter *α*.

For convenient cross-referencing, we state the following rewriting of Theorem 1.

**Theorem 3.** *Let the loss be* HYPE(*K*) *compliant. For any <sup>P</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H) *with no data dependency, for any <sup>α</sup>* <sup>∈</sup> <sup>R</sup> *and for any <sup>δ</sup>* <sup>∈</sup> [0; 1]*, with probability of at least* <sup>1</sup> <sup>−</sup> *<sup>δ</sup> over size-m random samples S, simultaneously for all Q such that Q* ∼ *P we have:*

$$\mathbb{E}\_{h\sim Q}[R(h)] \le \mathbb{E}\_{h\sim Q}[R\_S(h)] + \frac{1}{m^a} \left( \text{KL}(Q||P) + \log \frac{\mathbb{E}\_{h\sim P}\left[\exp\left(\frac{K(h)^2}{2m^{1-2a}}\right)\right]}{\delta} \right).$$

**Proof.** We first apply Theorem <sup>1</sup> with *<sup>F</sup>*(*x*, *<sup>y</sup>*) = *<sup>m</sup>α*(*<sup>y</sup>* <sup>−</sup> *<sup>x</sup>*). More precisely, we use Equation (1) with *D*(*x*, *y*) = *y* − *x*. We then conclude with Theorem 2.

#### **3. Safety Check: The Bounded Loss Case**

*3.1. Theoretical Results*

At this stage, the reader might wonder whether this new approach allows for the recovery of known results in the bounded case: the answer is yes.

In this section, we study the case where is bounded by some constant *<sup>C</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> \ {0}. In other words, we consider the case that sup*<sup>h</sup>* sup*<sup>z</sup>* -(*h*, *z*) ≤ *C*. We provide a bound, valid for any choice of "priors" *P* and "posteriors" *Q* such that *P* ∼ *Q*, which is an immediate corollary of Theorem 3.

**Proposition 1.** *Let be* HYPE(*K*) *compliant, with <sup>K</sup>*(*h*) = *<sup>C</sup> constant, and let <sup>α</sup>* <sup>∈</sup> <sup>R</sup>*. Let <sup>P</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H) *be a distribution with no data dependency. Then, for any δ* ∈ [0; 1]*, with probability of at least* <sup>1</sup> <sup>−</sup> *<sup>δ</sup> over random m-samples S, simultaneously for all <sup>Q</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H) *such that Q* ∼ *P we have:*

$$\mathbb{E}\_{h \sim Q}[R(h)] \le \mathbb{E}\_{h \sim Q}[R\_S(h)] + \frac{\mathsf{KL}(Q||P) + \log(1/\delta)}{m^a} + \frac{\mathsf{C}^2}{2m^{1-a}}.$$

**Remark 1.** *We provide Proposition 1 to evaluate the robustness of our approach. For instance, by comparing it with the PAC-Bayesian bound found in Germain et al. [11]. This discussion can be found in Section 6.1, where the bound from Germain et al. [11] is presented in detail.*

**Remark 2.** *At first glance, a naive remark: in order to control the rate of convergence of all the terms of the bound in Proposition 1 (as is often the case in classical PAC-Bayesian bounds), then the only case of interest is in fact α* = <sup>1</sup> <sup>2</sup> *. However, one could notice that the factor <sup>C</sup>*<sup>2</sup> *is not optimisable, while the KL is. In this way, if it appears that C*<sup>2</sup> *is too big, in practice, one wants to have the ability to attenuate its influence as much as possible and this may lead us to consider α* < 1/2*. The following lemma answers this question.*

**Lemma 1.** *For any given K*<sup>1</sup> <sup>&</sup>gt; <sup>0</sup>*, the function fK*<sup>1</sup> (*α*) :<sup>=</sup> *<sup>K</sup>*<sup>1</sup> *<sup>m</sup><sup>α</sup>* <sup>+</sup> *<sup>C</sup>*<sup>2</sup> *<sup>m</sup>*1−*<sup>α</sup> reaches its minimum at*

$$\alpha\_0 = \frac{1}{2} + \frac{1}{2\log(m)}\log\left(\frac{2K\_1}{C^2}\right).$$

**Proof.** The explicit calculus of the *f <sup>K</sup>*<sup>1</sup> and the resolution of *f K*1 (*α*) = 0 provides the result.

**Remark 3.** *Lemma 1 indicates that with a fixed "prior" P and "posterior" Q, taking K*<sup>1</sup> = KL(*Q*||*P*) + log(1/*δ*)*, gives the optimised value of the bound in Proposition 1. We numerically show in Section 3.2 (first experiment there) that optimising α leads to significantly better results.*

Now the only remaining question is how to optimise the KL divergence. To do so, we may need to fix an "informed prior" to minimise the KL divergence with an interesting posterior. This idea has been studied by [16,17] and, more recently, by Mhammedi et al. [18], Rivasplata et al. [5], among others. We will adapt it to our problem in the simplest way.

We now introduce some additional notation. For a sample *s* = (*z*1, ... , *zm*) and *<sup>k</sup>* ∈ {1..*m*}, we define *<sup>s</sup>*≤*<sup>k</sup>* := {*z*1, ... , *zk*} and *<sup>s</sup>*>*<sup>k</sup>* := {*zk*+1, ... , *zm*}. Then, similarly, for a random sample *S*, we have the splits *S*≤*<sup>k</sup>* and *S*>*k*.

**Proposition 2.** *Let be* HYPE(*K*) *compliant, with constant <sup>K</sup>*(*h*) = *C, and <sup>α</sup>*1, *<sup>α</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*. Consider any "priors" <sup>P</sup>*<sup>1</sup> ∈ M<sup>+</sup> <sup>1</sup> (H) *(possibly dependent on <sup>S</sup>*>*m*/2*) and <sup>P</sup>*<sup>2</sup> ∈ M<sup>+</sup> <sup>1</sup> (H) *(possibly dependent* *on <sup>S</sup>*≤*m*/2*). Then, for any <sup>δ</sup>* ∈ [0; 1]*, with probability of at least* <sup>1</sup> − *<sup>δ</sup> over random size-m samples S, simultaneously for all Q* ∈ M<sup>+</sup> <sup>1</sup> (H) *such that Q* ∼ *P*<sup>1</sup> *and Q* ∼ *P*<sup>2</sup> *we have:*

$$\begin{split} \mathbb{E}\_{h \sim Q}[R(h)] &\leq \mathbb{E}\_{h \sim Q}[R\_S(h)] + \frac{1}{2} \left( \frac{\mathrm{KL}(Q || P\_1) + \log(2/\delta)}{(m/2)^{a\_1}} + \frac{\mathcal{C}^2}{2(m/2)^{1-a\_1}} \right) \\ &+ \frac{1}{2} \left( \frac{\mathrm{KL}(Q || P\_2) + \log(2/\delta)}{(m/2)^{a\_2}} + \frac{\mathcal{C}^2}{2(m/2)^{1-a\_2}} \right). \end{split}$$

**Proof.** Let *P*1, *P*2, *Q* be as stated in Proposition 2. We first notice that by using Proposition 1 on the two halves of the sample, we obtain, with a probability of at least 1 − *δ*/2:

$$\mathbb{E}\_{\hbar \sim Q}[R(\hbar)] \le \mathbb{E}\_{\hbar \sim Q} \left[ \frac{1}{m/2} \sum\_{i=1}^{m/2} \ell(\hbar, Z\_i) \right] + \frac{\mathrm{KL}(Q || P\_1) + \log(2/\delta)}{(m/2)^{a\_1}} + \frac{\mathcal{C}^2}{2(m/2)^{1-a\_1}}$$

and also with probability at least 1 − *δ*/2:

$$\mathbb{E}\_{h\sim Q}[\mathbb{R}(h)] \le \mathbb{E}\_{h\sim Q} \left[ \frac{1}{m/2} \sum\_{i=1}^{m/2} \ell(h, Z\_{m/2+i}) \right] + \frac{\mathrm{KL}(Q || P\_2) + \log(2/\delta)}{(m/2)^{a\_2}} + \frac{\mathcal{C}^2}{2(m/2)^{1-a\_2}}.$$

Hence, with a probability of at least 1 − *δ*, both inequalities hold, and the result follows by adding them and dividing by 2.

**Remark 4.** *One can notice that the main difference between Proposition 2 and Proposition 1 lies in the implicit PAC-Bayesian paradigm that our priors must not depend on the data. With this last proposition, we implicitly allow P*<sup>1</sup> *to depend on S*>*m*/2 *and P*<sup>2</sup> *on S*≤*m*/2*, which can in practice lead to far more accurate priors. We numerically show this fact in Section 3.2's second experiment. Note that this idea is not new and has been studied, for instance, in [19] for the specific case of SVMs.*

#### *3.2. Numerical Experiments*

Our experimental framework has been inspired by the work of [18].

**Settings.** We generate synthetic data for classification, and we are using the 0–1 loss. The data space is <sup>Z</sup> <sup>=</sup> X ×Y <sup>=</sup> <sup>R</sup>*<sup>d</sup>* × {0, 1} with *<sup>d</sup>* <sup>∈</sup> <sup>N</sup>. The set of predictors <sup>H</sup> is parameterised with *<sup>d</sup>*-dimensional 'weight' vectors: <sup>H</sup> <sup>=</sup> {*hw* : X→Y| *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*d*}. For simplicity, we identify *hw* with *w* and we also identify the space H, with the weight space <sup>W</sup> <sup>=</sup> <sup>R</sup>*d*. For *<sup>z</sup>* = (*x*, *<sup>y</sup>*) ∈ Z and *<sup>w</sup>* ∈ W, we define the loss as -(*w*, *z*) := |1 *<sup>φ</sup>*(*wx*) <sup>&</sup>gt; 1/2 <sup>−</sup> *<sup>y</sup>*|, where *<sup>φ</sup>*(*r*) = <sup>1</sup> <sup>1</sup>+*e*−*<sup>r</sup>* . We want to learn an optimised predictor given a dataset *S* = (*Zi*)*i*=1..*<sup>m</sup>* where *Zi* = (*Xi*,*Yi*). To do so, we use *regularised logistic regression* and compute:

$$\psi(S) := \arg\min\_{w \in \mathcal{W}} \lambda \frac{||w||^2}{2} - \frac{1}{m} \sum\_{i=1}^{m} y\_i \log \left( \phi(w^\top \mathbf{x}\_i) \right) + (1 - y\_i) \log \left( 1 - \phi(w^\top \mathbf{x}\_i) \right) \tag{2}$$

where *λ* is a fixed regularisation parameter.

We also restrict the probability distributions (over <sup>W</sup> <sup>=</sup> <sup>R</sup>*d*), considered for this learning problem. We consider the Gaussian distribution <sup>N</sup> (*w*, *<sup>σ</sup>*<sup>2</sup> *Id*) with centre *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and diagonal covariance *<sup>σ</sup>*<sup>2</sup> *Id* <sup>∈</sup> <sup>R</sup>*d*×*<sup>d</sup>* with *<sup>σ</sup>*<sup>2</sup> <sup>&</sup>gt; 0.

**Parameters.** We set *δ* = 0.05, *λ* = 0.01. We approximately solve Equation (2) by using the minimize function of the optimisation module in Python, with the Powell method. To approximate gaussian expectations, we use Monte-Carlo sampling.

**Synthetic data.** We generate synthetic data for *d* = 10 according to the following process: for a fixed sample size *m*, we draw *X*1, ..., *Xm* under the multivariate Gaussian distribution <sup>N</sup> (0, *Id*) and for each *<sup>i</sup>* we compute the label if *Xi* as: *Yi* <sup>=</sup> <sup>1</sup>{*φ*(*w*∗*xi*) <sup>&</sup>gt; 1/2} where *w*∗ is the vector formed by the *d* first digits of the number *π*.

**Normalisation trick.** Given the predictors shape, we notice that for any *w* ∈ W:

$$\mathbb{1}\{\phi(w^\top x) > 1/2\} = 1 \quad \Leftrightarrow \quad \frac{1}{1 + \exp(-w^\top x)} > \frac{1}{2} \quad \Leftrightarrow \quad w^\top x < 0.1$$

Thus, the value of the prediction is exclusively determined by the sign of the inner product, and this quantity is definitely not influenced by the norm of the vector. Then, for any sample *S*, we call the **normalisation trick** the fact of considering *w*ˆ(*S*)/||*w*ˆ(*S*)|| instead of *w*ˆ(*S*) in our calculations. This process will not deteriorate the quality of the prediction and will considerably enhance the value of the KL divergence.

#### 3.2.1. First experiment

Our goal here is to highlight the point discussed in Remark 2, e.g., the influence of the parameter *α* in Proposition 1. We arbitrarily fix *σ*<sup>2</sup> <sup>0</sup> = 1/2, and define our *naive prior* as *<sup>P</sup>*<sup>0</sup> <sup>=</sup> <sup>N</sup> (0, *<sup>σ</sup>*<sup>2</sup> <sup>0</sup> *Id*). For a fixed dataset *<sup>S</sup>*, we define our posterior as *<sup>P</sup>*(*S*) :<sup>=</sup> <sup>N</sup> (<sup>ˆ</sup> *h*(*S*), *σ*<sup>2</sup> *Id*), with *<sup>σ</sup>*<sup>2</sup> ∈ {1/2, ... , 1/2*<sup>J</sup>* } (for *J* = log2(*m*)) such that it is minimising the bound among candidates. We computed two curves: first, Proposition 1 with *α* = 1/2 second, Proposition 1 again with *α* equals to the value proposed in Lemma 1. Notice that to compute this last bound, we first optimised our choice of posterior with *α* = 1/2 and then optimised *α*, to be consistent with Lemma 1. Indeed, we proved this lemma by assuming that the KL divergence was already fixed, hence our optimisation process is in two steps. Note that we chose to apply the normalisation trick here, we then obtained the left curve of Figure 1.

**Discussion.** From this curve, we formulate several remarks. First, we remark on this specific case, our theorem provides a tight result in practice (with an error rate lesser than 10% for the bound with optimised alpha). Second, we can now confirm that choosing an optimised *α* leads to a tighter bound. In further studies, it will be relevant to adjust *α* with regards to the different terms of our bound instead of looking for an identical convergence rate for all terms.

#### 3.2.2. Second Experiment

We now study Proposition 2 to see if an informed prior effectively provides a tighter bound than a naive one. We will use the notations introduced in Proposition 2. For a dataset *S*, we define *w*1(*S*) = *w*(*S*>*m*/2) as the vector resulting from the optimisation of Equation (2) on *<sup>S</sup>*>*m*/2. Similarly, we define *<sup>w</sup>*2(*S*) := *<sup>w</sup>*(*S*≤*m*/2). We arbitrarily fix *σ*2 <sup>0</sup> <sup>=</sup> 1/2, and define our *informed priors* as: *<sup>P</sup>*<sup>1</sup> <sup>=</sup> <sup>N</sup> (*w*1(*S*), *<sup>σ</sup>*<sup>2</sup> <sup>0</sup> *Id*) and *<sup>P</sup>*<sup>2</sup> <sup>=</sup> <sup>N</sup> (*w*2(*S*), *<sup>σ</sup>*<sup>2</sup> <sup>0</sup> *Id*). Finally, we define our posterior as *<sup>P</sup>*(*S*) :<sup>=</sup> <sup>N</sup> (*w*ˆ(*S*), *<sup>σ</sup>*<sup>2</sup> *Id*), with *<sup>σ</sup>*<sup>2</sup> ∈ {1/2, ..., 1/2*<sup>J</sup>* } (for *<sup>J</sup>* = log2(*m*)) with *<sup>σ</sup>*<sup>2</sup> optimising the bound among the same candidate than the first experiment. We computed two curves: first, Proposition 1 with *α* optimised accordingly to Lemma 1 secondly, Proposition 2 with *α*1, *α*<sup>2</sup> optimised as well, and informed priors as defined above. We chose to not apply the normalisation trick here, we then obtained the right curve of Figure 1.

**Discussion.** It is clear, that with this framework, having an informed prior is a powerful tool to enhance the quality of our bound. Notice that we voluntarily chose to not apply the normalisation trick here. The reason is that this trick appears to be too powerful in practice, and applying it leads to counterproductive results; to highlight our point: the bound without informed prior would be tighter than the one with informed prior. Furthermore, this trick is linked to the specific structure of our problem and is not valid for any classification problem. Thus, the idea of providing informed priors remains an interesting tool for most cases.

**Figure 1.** Above, result of the first experiment which highlight the importance of optimising *α*. Below, result of the second experiment which show how effective an informed prior is.

#### **4. PAC Bayesian Bounds with Smoothed Estimator**

We now move on to control the right-hand side term in Theorem 3 when *K* is not constant. A first step is to consider a transformed estimate of the risk, inspired by the truncated estimator from [20], also used in [21], and more recently in [13]. The following is inspired by the results of [13], which we summarise in Section 6.

The idea is to modify the estimator *RS*(*h*) for any *h* by introducing a threshold *t* and a function *ψ* which will attenuate the influence of the empirical losses (-(*h*, *Zi*))*i*=1..*<sup>m</sup>* that exceed *t*.

**Definition 2.** *<sup>ψ</sup>*-risks. *For every <sup>t</sup>* <sup>&</sup>gt; <sup>0</sup>*, <sup>ψ</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup>+*, for any <sup>h</sup>* ∈ H*, we define the empirical ψ-risk RS*,*ψ*,*<sup>t</sup> and the theoretical ψ-risk Rψ*,*<sup>t</sup> as follows:*

$$R\_{S, \psi, t}(h) := \frac{t}{m} \sum\_{i=1}^{m} \psi\left(\frac{\ell(h, Z\_i)}{t}\right) \quad \text{and} \quad R\_{\psi, t}(h) = \mathbb{E}\_{\mu} \left[t \, \psi\left(\frac{\ell(h, Z)}{t}\right)\right]$$

*where Z* <sup>∼</sup> *<sup>μ</sup>. Notice that* <sup>E</sup>*<sup>S</sup>* 7 *RS*,*ψ*,*t*(*h*) 8 = *Rψ*,*t*(*h*)*.*

We now focus on what we call *softening functions*, i.e., functions that will temper high values of the loss function -.

**Definition 3.** (Softening function). *We say that <sup>ψ</sup>* : <sup>R</sup><sup>+</sup> <sup>→</sup> <sup>R</sup><sup>+</sup> *is a softening function if:*


*We let* F *denote the set of all softening functions.*

**Remark 5.** *Notice that those three assumptions ensure that ψ is continuous at* 1*. For instance, the functions <sup>f</sup>* : *<sup>x</sup>* → *<sup>x</sup>*1{*<sup>x</sup>* <sup>≤</sup> <sup>1</sup>} <sup>+</sup> <sup>1</sup>{*<sup>x</sup>* <sup>&</sup>gt; <sup>1</sup>} *and <sup>g</sup>* : *<sup>x</sup>* → *<sup>x</sup>*1{*<sup>x</sup>* <sup>≤</sup> <sup>1</sup>} + (<sup>2</sup> <sup>√</sup>*<sup>x</sup>* <sup>−</sup> <sup>1</sup>)1{*<sup>x</sup>* <sup>&</sup>gt; <sup>1</sup>} *are in* F*. In Section 6 we compare these softening functions and those used by Holland [13].*

Using *ψ* ∈ F, for a fixed threshold *t* > 0, the softened loss function *tψ* -(*h*,*z*) *t* verifies for any *h* ∈ H, *z* ∈ Z:

$$t\,\psi\left(\frac{\ell(h,z)}{t}\right) \le t\,\psi\left(\frac{K(h)}{t}\right).$$

because *ψ* is non-decreasing. In this way, the exponential moment in Theorem 3 can be far more controllable. The trade-off lies in the fact that softening - (instead of taking directly -) will deteriorate our ability to distinguish between two bad predictions when both of them are greater than *t*. For instance, if we choose *ψ* ∈ F such as *ψ* = 1 on [1; +∞) and *t* > 0, if *ψ*(-(*h*, *z*)/*t*) = 1 for a certain pair (*h*, *z*), then we cannot tell how far -(*h*, *z*) is from *t* and we only can affirm that -(*h*, *z*) ≥ *t*.

We now move on to the following lemma, which controls the shortfall between <sup>E</sup>*h*∼*Q*[*R*(*h*)] and <sup>E</sup>*h*∼*Q*[*Rψ*,*t*(*h*)] for all *<sup>Q</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H), for a given *ψ* and *t* > 0. To do that, we assume that *K* admits a finite moment under any posterior distribution:

$$\forall \mathcal{Q} \in \mathcal{M}\_1^+(\mathcal{H}), \ \mathbb{E}\_{\hbar \sim \mathcal{Q}}[\mathcal{K}(\hbar)] \prec +\infty. \tag{3}$$

For instance, in the case of <sup>H</sup> identified with a weight space <sup>W</sup> <sup>=</sup> <sup>R</sup>*N*, and if *<sup>K</sup>* is polynomial in ||*w*|| (where ||.|| denotes the Euclidean norm), then this assumption holds if we consider Gaussian priors and posteriors.

**Lemma 2.** *Assume that Equation (3) holds, and let <sup>ψ</sup>* ∈ F*, Q* ∈ M<sup>+</sup> <sup>1</sup> (H), *t* > 0*. We have:*

$$\mathbb{E}\_{h\sim Q}[\mathcal{R}(h)] \le \mathbb{E}\_{h\sim Q}[\mathcal{R}\_{\emptyset,t}(h)] + \mathbb{E}\_{h\sim Q}[\mathcal{K}(h)\mathbbm{1}\{\mathcal{K}(h)\geq t\}].$$

**Proof.** Let *<sup>ψ</sup>* ∈ F, *<sup>Q</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H), *t* > 0. We have, for *h* ∈ H :

$$\begin{aligned} &R(h) - R\_{\psi,t}(h) \\ &= \mathbb{E}\_{Z \sim \mu} \left[ \ell(h, Z) - t\psi\left(\frac{\ell(h, Z)}{t}\right) \right], \end{aligned}$$

and using that ∀*x* ∈ [0, 1], *ψ*(*x*) = *x*,

$$=\mathbb{E}\_{Z\sim\mu}\left[\left(\ell(h,Z)-t\psi\left(\frac{\ell(h,Z)}{t}\right)\right)\mathbb{1}\{\ell(h,Z)\geq t\}\right]$$

while using that -(*h*, *z*) ≤ *K*(*h*),

$$=\mathbb{E}\_{Z\sim\mu}\left[\left(\ell(h,Z)-t\psi\left(\frac{\ell(h,Z)}{t}\right)\right)\mathbb{1}\{\ell(h,Z)\geq t\}\mathbb{1}\{K(h)\geq t\}\right]$$

and continuing:

<sup>≤</sup> <sup>E</sup>*Z*∼*μ*[-(*h*, *<sup>Z</sup>*)1{-(*h*, *<sup>Z</sup>*) <sup>≥</sup> *<sup>t</sup>*}]1{*K*(*h*) <sup>≥</sup> *<sup>t</sup>*} (*<sup>ψ</sup>* <sup>≥</sup> 0) <sup>≤</sup> *<sup>K</sup>*(*h*)P*Z*∼*μ*{-(*h*, *<sup>Z</sup>*) <sup>≥</sup> *<sup>t</sup>*}1{*K*(*h*) <sup>≥</sup> *<sup>t</sup>*} ( -(*h*, *Z*) ≤ *K*(*h*))

Finally, by crudely bounding the probability by 1, we get:

$$\mathcal{R}(h) \le \mathcal{R}\_{\emptyset, t}(h) + \mathcal{K}(h)\mathbf{1}\{\mathcal{K}(h) \ge t\}.$$

Hence the result by integrating over H with respect to *Q*.

Finally we present the following theorem, which provides a PAC-Bayesian inequality bounding the theoretical risk by the empirical *ψ*-risk for *ψ* ∈ F.

**Theorem 4.** *Let be* HYPE(*K*) *compliant, and assume K satisfies Equation (3). Then for any <sup>P</sup>* ∈ M<sup>+</sup> <sup>1</sup> (H) *with no data dependency, for any <sup>α</sup>* <sup>∈</sup> <sup>R</sup>*, for any <sup>ψ</sup>* ∈ F *and for any <sup>δ</sup>* <sup>∈</sup> [0; 1]*, with probability of at least* 1 − *δ over size-m random samples S, simultaneously for all Q such that Q* ∼ *P we have:*

$$\begin{split} \mathbb{E}\_{h \sim Q} [\mathcal{R}(h)] &\leq \mathbb{E}\_{h \sim Q} \left[ \mathbb{R}\_{S, \psi, t}(h) \right] + \mathbb{E}\_{h \sim Q} [K(h) \mathbb{1} \{ K(h) \geq t \}] \\ &+ \frac{\mathrm{KL}(Q || P) + \log \left( \frac{1}{\delta} \right)}{m^{a}} \\ &+ \frac{1}{m^{a}} \log \left( \mathbb{E}\_{h \sim P} \left[ \exp \left( \frac{t^{2}}{2m^{1-2a}} \psi \left( \frac{K(h)}{t} \right)^{2} \right) \right] \right) . \end{split}$$

.

**Proof.** Let *ψ* ∈ F, we define the *ψ*-loss:

$$\ell\_2(h, z) = t\psi\left(\frac{\ell(h, z)}{t}\right).$$

Since *ψ* is non decreasing, we have for all (*h*, *z*) ∈H×Z:

$$\ell\_2(h, z) \le t \psi\left(\frac{K(h)}{t}\right) := K\_2(h).$$

Thus, we apply Theorem 3 to the learning problem defined with -2: for any *α* and *δ* ∈ (0, 1), with probability at least 1 − *δ* over size-*m* random samples *S*, simultaneously for all *Q* such that *Q* ∼ *P* we have:

<sup>E</sup>*h*∼*<sup>Q</sup>* 7 *Rψ*,*t*(*h*) 8 <sup>≤</sup> <sup>E</sup>*h*∼*<sup>Q</sup>* 7 *RS*,*ψ*,*t*(*h*) 8 + KL(*Q*||*P*) + log<sup>1</sup> *δ m<sup>α</sup>* + 1 *<sup>m</sup><sup>α</sup>* log <sup>E</sup>*h*∼*<sup>P</sup>* . exp *<sup>K</sup>*2(*h*)<sup>2</sup> 2*m*1−2*<sup>α</sup>* /.

We then add <sup>E</sup>*h*∼*Q*[*K*(*h*)1{*K*(*h*) <sup>≥</sup> *<sup>t</sup>*}] on both sides of the latter inequality and apply Lemma 2.

**Remark 6.** *Notice that the function <sup>ψ</sup>* : *<sup>x</sup>* → *<sup>x</sup>*1{*<sup>x</sup>* <sup>≤</sup> <sup>1</sup>} <sup>+</sup> <sup>1</sup>{*<sup>x</sup>* <sup>&</sup>gt; <sup>1</sup>} *is such that for any given prior <sup>P</sup> we have* <sup>E</sup>*h*∼*<sup>P</sup>* . exp *t* 2 <sup>2</sup>*m*1−2*<sup>α</sup> ψ K*(*h*) *t* 2 / <sup>&</sup>lt; <sup>+</sup>∞*. So the exponential moment can be controlled with a good choice of ψ. Thus the strength of Theorem 4 is to provide a PAC-Bayesian bound valid for any set of posterior measures verifying Equation (3). The choice of ψ minimising the bound is still an open problem.*

#### **5. The Linear Regression Problem**

#### *5.1. Theoretical Result*

We now focus on the celebrated linear regression problem and see how our theory translates to that particular learning problem. We assume that the data is a size-*m* random sample *S* = (*Zi*)*i*=1..*<sup>m</sup>* where the *Zi* are i.i.d. drawn from the distribution *μ*, and *Zi* = (*Xi*,*Yi*) with *Xi* <sup>∈</sup> <sup>R</sup>*N*, *Yi* <sup>∈</sup> <sup>R</sup>.

Our goal here is to find the most accurate predictor *hw* (with *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*N*), with respect to the loss function -(*hw*, *z*) = |*w*, *x* − *y*|, where *z* = (*x*, *y*). We will make the following mild assumption: there exists *<sup>B</sup>*, *<sup>C</sup>* <sup>∈</sup> <sup>R</sup>+\{0} such that for all *<sup>z</sup>* = (*x*, *<sup>y</sup>*) drawn under *<sup>μ</sup>*:

$$||x|| \le B \quad \text{and} \quad |y| \le C$$

where ||.|| is the norm associated to the classical inner product of <sup>R</sup>*N*. Under this assumption we note that for all *z* = (*x*, *y*) drawn according to *μ*, we have:

$$\ell(\mathfrak{h}\_w, z) = |\langle w, x \rangle - y| \le |\langle w, x \rangle| + |y| \le ||w||. ||x|| + |y| \le B||w|| + C.$$

Thus we define *<sup>K</sup>*(*hw*) = *<sup>B</sup>*||*w*|| <sup>+</sup> *<sup>C</sup>* for *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*N*. If we first restrict ourselves to the framework of Section 2, we want to use Theorem 3 and doing so, our goal is to bound *<sup>ξ</sup>* :<sup>=</sup> <sup>E</sup>*w*∼*<sup>P</sup>* exp *<sup>K</sup>*(*w*)<sup>2</sup> 2*m*1−2*<sup>α</sup>* . The shape of *K* invites us to consider a Gaussian prior. Indeed, we notice that if *<sup>P</sup>* <sup>=</sup> <sup>N</sup> (0, *<sup>σ</sup>*2*IN*) with 0 <sup>&</sup>lt; *<sup>σ</sup>*<sup>2</sup> <sup>&</sup>lt; *<sup>m</sup>*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> , then *<sup>ξ</sup>* < +∞. Notice that we cannot take just any Gaussian prior, however with a small *α*, the condition 0 < *σ*<sup>2</sup> < *<sup>m</sup>*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> may become quite loose. Thus, we have the following:

**Theorem 5.** *Let <sup>α</sup>* <sup>∈</sup> <sup>R</sup> *and <sup>N</sup>* <sup>≥</sup> <sup>6</sup>*. Assume that the loss is* HYPE(*K*) *compliant with K*(*h*) = *<sup>B</sup>*||*h*|| <sup>+</sup> *C, with <sup>B</sup>* <sup>&</sup>gt; 0, *<sup>C</sup>* <sup>≥</sup> <sup>0</sup>*. For a prior distribution, consider any Gaussian <sup>P</sup>* <sup>=</sup> <sup>N</sup> (0, *<sup>σ</sup>*2*IN*) *with σ*<sup>2</sup> = *t <sup>m</sup>*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> *,* 0 < *<sup>t</sup>* < 1*. Then, for any <sup>δ</sup>* ∈ [0; 1]*, with probability of at least* 1 − *<sup>δ</sup> over size-m random samples S, simultaneously for all Q* ∈ M<sup>+</sup> <sup>1</sup> (H) *such that P* ∼ *Q we have:*

$$\begin{split} \mathbb{E}\_{h \sim Q}[R(h)] &\leq \mathbb{E}\_{h \sim Q}[R\_S(h)] + \frac{\mathrm{KL}(Q || P) + \log(2/\delta)}{m^a} + \frac{\mathcal{C}^2}{2m^{1-a}} \left( 1 + f(t)^{-1} \right) \\ &+ \frac{N}{m^a} \left( \log \left( 1 + \left( \frac{\mathcal{C}}{\sqrt{2f(t)m^{1-2a}}} \right) \right) + \log \left( \frac{1}{\sqrt{1-t}} \right) \right) \end{split}$$

*where f*(*t*) = <sup>1</sup>−*<sup>t</sup> t .*

The proof is deferred to Section 7.2. To compare our result with those found in the literature, we can fix *α* = 1/2. Doing so, we lose the dependency in *m* for the choice of the variance of the prior (which now only depends on *B*), but we recover the classic decreasing factor 1/√*m*.

**Remark 7.** *Notice that for now we did not use Section 4, even if we could (because K is polynomial in* ||*w*|| *and we consider Gaussian priors and posteriors, so Equation (3) is satisfied). Doing so, we obtained a bound which appears to depend linearly on the dimension N. In practice, N may be too big, and in this case, introducing an adapted softening function ψ (one can think for instance of <sup>ψ</sup>*(*x*) = *<sup>x</sup>*1{*<sup>x</sup>* <sup>≤</sup> <sup>1</sup>} <sup>+</sup> <sup>1</sup>{*<sup>x</sup>* <sup>&</sup>gt; <sup>1</sup>}*) is a powerful tool to attenuate the weight of the exponential moment. This also extends the class of authorised Gaussian priors by avoidance, to stick with a variance σ*<sup>2</sup> = *t <sup>m</sup>*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> *,* <sup>0</sup> < *<sup>t</sup>* < <sup>1</sup>*.*

#### *5.2. Numerical Experiment*

#### 5.2.1. Setting

In this section we apply Theorem 5 on a concrete linear regression problem. The situation is as follows: we want to approximate the function *<sup>f</sup>*(*x*) = '*<sup>w</sup>*∗, *<sup>x</sup>* , where *<sup>w</sup>*<sup>∗</sup> <sup>∈</sup> <sup>R</sup>*d*. We assume that <sup>W</sup> = [−*c*, *<sup>c</sup>*] *<sup>d</sup>* so that *w*<sup>∗</sup> lies in an hypercube centred at 0 of half-side *c* > 0, i.e., the set {(*wi*)*i*=1,...,*<sup>d</sup>* | ∀*i*, |*wi*| ≤ *c*}. Doing so we have ||*w*∗|| ≤ *c* <sup>√</sup>*d*.

Furthermore, we assume that input data are drawn inside a hypercube of half-side *e* > 0, i.e., X = [−*e*,*e*] *<sup>d</sup>*. Doing so we have for any data *<sup>x</sup>*, ||*x*|| ≤ *<sup>e</sup>* <sup>√</sup>*d*.

For any data *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*d*, we define *<sup>y</sup>* <sup>=</sup> *<sup>f</sup>*(*x*). As before, we identify the hypothesis set <sup>H</sup> with the weight space <sup>W</sup> <sup>=</sup> <sup>R</sup>*d*. As described in Section 5.1, we set -(*hw*, *x*, *y*) = |*w*, *x* − *y*|. We then remark that for any (*w*, *x*, *y*):

$$\begin{aligned} \ell(h\_{w^\*} \mathbf{x}, y) &\leq |\langle w, \mathbf{x} \rangle| + |y| \leq ||w|| ||\mathbf{x}|| + |\sqrt{\langle w^\*, \mathbf{x} \rangle}| \\ &\leq \varepsilon \sqrt{d} ||w|| + \sqrt{||w^\*|| |. ||\mathbf{x}||} \leq \varepsilon \sqrt{d} ||w|| + \sqrt{c \sqrt{d} \, \boldsymbol{\omega} \sqrt{d}} \\ &\leq \varepsilon \sqrt{d} ||w|| + \sqrt{c \underline{d}} .\end{aligned}$$

Then we can define *B* = *e* <sup>√</sup>*<sup>d</sup>* and *<sup>C</sup>* <sup>=</sup> <sup>√</sup>*cde* to apply Theorem 5. We restrict (as before) the class of distributions over W to be *d*-dimensional Gaussians:

$$\left\{ \mathcal{N}(w, \sigma^2 I\_d) \mid w \in \mathcal{H}, \sigma^2 \in \mathbb{R}^+ \right\},$$

which is the set of candidate distributions for this learning problem. Recall that in practice, given a fixed *<sup>α</sup>* <sup>∈</sup> <sup>R</sup>, we are only allowed to consider priors such that their variance *<sup>σ</sup>*<sup>2</sup> <sup>∈</sup> 0; *<sup>m</sup>*1−2*<sup>α</sup> B*2 . We want to learn an optimised predictor (posterior) given a random dataset *S* = ((*Xi*,*Yi*))*i*=1,...,*m*. To do so, we consider synthetic data.

#### 5.2.2. Synthetic Data

We draw *w*∗ under a Gaussian (with mean 0 and standard deviation equal to 5) truncated to the hypercube centered at 0 of the half-side *c* > 0. We generate synthetic data according to the following process: for a fixed sample size *m*, we draw *X*1, ... , *Xm* under a Gaussian (with mean 0 and standard deviation equal to 5) truncated to the hypercube centered at 0 of the half-side *e* > 0.

#### 5.2.3. Experiment

First, we fix *c* = *e* = 10. Our goal here is to obtain a generalisation bound on our problem. We fix arbitrarily, for a fixed *<sup>α</sup>* <sup>∈</sup> <sup>R</sup>, *<sup>t</sup>*<sup>0</sup> <sup>=</sup> 1/2 and *<sup>σ</sup>*<sup>2</sup> <sup>0</sup> = *t*<sup>0</sup> *m*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> and we define our *naive prior* as *<sup>P</sup>*<sup>0</sup> <sup>=</sup> <sup>N</sup> (0, *<sup>σ</sup>*<sup>2</sup> <sup>0</sup> *Id*). For a given dataset *S*, we define our posterior as *<sup>Q</sup>*(*S*) :<sup>=</sup> <sup>N</sup> (*w*ˆ(*S*), *<sup>σ</sup>*<sup>2</sup> *Id*), with *<sup>σ</sup>*<sup>2</sup> ∈ {*σ*<sup>2</sup> <sup>0</sup>/2, ..., *<sup>σ</sup>*<sup>2</sup> 0/2*<sup>J</sup>* } (*J* = log2(*m*)), such that it is minimising the bound among candidates. Note that all the previously defined parameters are dependent on *α*, which is why we choose *α* ∈ {*i*/step | 0 ≤ *i* ≤ step} for step a fixed integer (in practice step = 8 or 16) and we take the value of *α* minimising the bound among the candidates as well. Figure 2 contains two figures, one with *d* = 10, the other with *d* = 50. On each figure are computed the right-hand side term in Theorem 5 with an optimised *α* for each step.

#### 5.2.4. Discussion

To the the best of our knowledge, this is the first attempt to numerically compute PAC-Bayes bounds for unbounded problems, making it impossible to compare to other results. We stress, however, that obtaining numerical values for the bound without assuming a bounded loss is a significant first step. Furthermore, we consider a rather hard problem: *f* is not linear, so we cannot rely on a linear approximation fitting perfectly data, and the larger the dimension, the larger the error, as illustrated by Figure 2. Thus, for any posterior *<sup>Q</sup>*, the quantity <sup>E</sup>*h*∼*Q*[*R*(*h*)] is potentially large in practice and our bound might not be tight. Finally, notice that optimising *α* (instead of taking *α* = 1/2 to recover a classic convergence rate) leads to a significantly better bound. A numerical example of this assertion is presented in Section 3.2. We aim to conduct further studies to consider the convergence rate as an hyperparameter to optimise, rather than selecting the same rate for all terms in the bound.

**Figure 2.** Evaluation of the right hand side in Theorem 5 with *d* = 10 and *d* = 50.

## **6. Existing Work**

*6.1. Germain et al., 2016*

In Germain et al. [11] (Section 4), a PAC-Bayesian bound has been provided for all *sub-gamma* losses with a variance *t* <sup>2</sup> and scale parameter *c* > 0, under a data distribution *μ* and a prior *P*, i.e. losses such that for every *λ* ∈ 0, <sup>1</sup> *c* the following is satisfied:

$$\log\left(\frac{1}{\delta}\mathbb{E}\_{h\sim P}\mathbb{E}\_{\mathcal{S}}\,e^{\lambda\left(R(h)-R\_{\mathcal{S}}(h)\right)}\right) \leq \frac{t^2}{c^2}(-\log(1-c\lambda)-\lambda c) \leq \frac{\lambda^2 t^2}{2(1-c\lambda)}.$$

Note that a sub-gamma loss (with regards to *μ* and *P*) is potentially unbounded. Germain et al. then propose the following PAC-Bayesian bound:

**Theorem 6.** *Ref. [11]. If the loss is sub-gamma with a variance t* <sup>2</sup> *and scale parameter c, under the data distribution μ and a fixed prior P* ∈ H*, then for any δ* ∈ [0; 1]*, with probability* 1 − *δ over size-m random samples, simultaneously for all Q P we have:*

$$\mathbb{E}\_{h \sim Q}[R(h)] \le \mathbb{E}\_{h \sim Q}[R\_S(h)] + \frac{\text{KL}(Q||P) + \log(1/\delta)}{m} + \frac{t^2}{2(1-c)}$$

.

Theorem 6 will be quoted several times in this paper given that it is a concrete PAC Bayesian bound provided with the will to overcome the constraint of a bounded loss. It is also one of the only one found in the literature.

Can we apply this theorem to the bounded case? The answer is yes: we remark that thanks to Hoeffding's lemma, if is bounded by *C* > 0, then for any *h* ∈ H it holds that *RS*(*h*) <sup>−</sup> *<sup>R</sup>*(*h*) <sup>∈</sup> [−*C*, *<sup>C</sup>*] almost surely. So, <sup>∀</sup>*<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>, logE*z*∼*<sup>μ</sup> eλ*(*R*(*h*)−*RS*(*h*) <sup>≤</sup> *<sup>λ</sup>*2*C*<sup>2</sup> <sup>2</sup> . Therefore, for any prior *P*, we have:

$$\log \mathbb{E}\_{h \sim P} \mathbb{E}\_{z \sim \mu} \left[ \mathcal{e}^{\lambda(R(h) - R\_S(h))} \right] \le \frac{\lambda^2 C^2}{2}.$$

Thus, is sub-gamma with variance *C*<sup>2</sup> and scale parameter 0. Then, Theorem 6 can be applied with *t* <sup>2</sup> = *C*2, *c* = 0.

**Comparison with Proposition 1.** We remark that by taking *K* = *C* and *α* = 1 in Proposition 1, we are recovering Theorem 6. However, our approach allows us to say that if we can obtain a more precise form of *K* such that ∀*h* ∈ H, *K*(*h*) ≤ *C* and *K* is non-constant, Theorem 3, will ensure that:

$$\frac{1}{m^{\alpha}}\log\left(\mathbb{E}\_{h\sim P}\left[\exp\left(\frac{K(h)^2}{2m^{1-2\alpha}}\right)\right]\right) \le \frac{C^2}{2m^{1-\alpha}}.$$

Thus, having precise information on the behavior of the loss function -, with regards to the predictor *h*, allows us to obtain a tighter control of the exponential moment, and hence a tighter bound.

**Remark 8.** *We can see that Theorem 6 cannot control the factor C*2/2*. However, Ref. [11] remarked on this apparent weakness and partially corrected this issue [11] (Section 4, Equations (13) and (14)). Indeed, they proposed to balance the influence of m between the different terms of the PAC-Bayes bound by providing the same convergence rate in* 1/√*m to all terms.*

*We can then see Proposition 1 as a proper generalisation of Germain et al. [11] (Section 4, Equations (13) and (14)). Indeed, our bound exhibits properly the influence of the parameter α. Thus, we understand (and Lemma 1 proves it) that the choice of α deserves a study in itself in the way it is now a parameter of our optimisation problem. This fact has already been highlighted in Alquier et al. [10] (Theorem 4.1) (where λ* := *mα).*

## *6.2. Holland, 2019*

In [13], Holland proposed a PAC Bayesian inequality with unbounded loss. For that, he introduced a function *ψ* verifying a few specific conditions, different to those used in Section 4 to define our set of softening functions. Indeed, he considered a function *ψ* such that:


$$-\log\left(1-u+\frac{u^2}{b}\right) \le \psi(u) \le \log\left(1+u+\frac{u^2}{b}\right).\tag{4}$$

We remark that, as Holland did, we supposed that our softening functions are nondecreasing. We chose softening functions to be equal to the identity function (*x* → *x*) on [0, 1], which is quite restrictive. However, we are imposing softening functions to be lesser than the identity on [1, +∞); whereas, Holland supposed *ψ* to be bounded and satisfy Equation (4). A concrete example of such a function *ψ*, lies in the piecewise polynomial function of Catoni and Giulini [21], defined by:

$$\psi(u) = \begin{cases} -2\sqrt{2}/3 & \text{if } u \le -\sqrt{2} \\ u - u^3/6 & \text{if } u \in [-2\sqrt{2}/3, 2\sqrt{2}/3] \\ 2\sqrt{2}/3 & \text{otherwise} \end{cases}$$

As in Section 4, we are considering the *ψ*-empirical risk *RS*,*ψ*,*<sup>t</sup>* for any *t* > 0. Holland provided his theorem given the fact the following assumptions are realised:


**Theorem 7.** *Ref. [13]. Let P be a prior distribution on model* H*. Let the three assumptions listed above hold. Setting t* <sup>2</sup> <sup>=</sup> *mM*2/(<sup>2</sup> log(*δ*−1))*, then for any <sup>δ</sup>* <sup>∈</sup> [0; 1]*, with probability of at least* 1 − *δ over the random draw of the size-m sample S, simultaneously for all Q it holds that:*

$$\begin{split} \mathbb{E}\_{h \sim Q}[R(h)] &\leq \mathbb{E}\_{h \sim Q} \left[ R\_{S, \psi, t}(h) \right] + \frac{1}{\sqrt{m}} \left( \mathrm{KL}(Q || P) + \frac{1}{2} \log \left( \frac{8 \pi M\_2}{\delta^2} \right) - 1 \right) \\ &+ \frac{1}{\sqrt{m}} \nu^\*(\mathcal{H}) + O\left(\frac{1}{m}\right) \end{split}$$

*where:*

$$\nu^\*(\mathcal{H}) := \frac{\mathbb{E}\_{\hbar \sim P} \Big[ \exp \left( \sqrt{m} (R(\hbar) - R\_{S, \psi, t}(\hbar)) \right) \Big]}{\mathbb{E}\_{\hbar \sim P} \Big[ \exp \left( R(\hbar) - R\_{S, \psi, t}(\hbar) \right) \Big]}.$$

## **7. Proofs**

*7.1. Proof of Theorem 1*

**Proof.** Let *<sup>F</sup>* : <sup>R</sup><sup>+</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> → <sup>R</sup> be a convex function, *<sup>P</sup>* a fixed prior, and *<sup>δ</sup>* <sup>∈</sup> [0, 1]. Since <sup>E</sup>*h*∼*<sup>P</sup> eF*(*RS*(*h*),*R*(*h*)) is a nonnegative random variable, we know that, by Markov's inequality, for any *h* ∈ H :

$$\mathbb{P}\left(\mathbb{E}\_{\mathrm{h}\sim P}\left[e^{F(\mathcal{R}\_{\mathrm{S}}(h),\mathcal{R}(h))}\right] > \frac{1}{\delta}\mathbb{E}\_{\mathrm{S}}\,\mathbb{E}\_{\mathrm{h}\sim P}\left[e^{F(\mathcal{R}\_{\mathrm{S}}(h),\mathcal{R}(h))}\right]\right) \leq \delta.$$

So with probability of at least 1 − *δ*, we have:

$$\mathbb{E}\_{\mathsf{h}\sim P}\left[e^{F\left(\mathcal{R}\_{\mathcal{S}}(h),\mathcal{R}(h)\right)}\right] \stackrel{\sim}{\leq} \frac{1}{\delta} \mathbb{E}\_{\mathcal{S}}\,\mathbb{E}\_{\mathsf{h}\sim P}\left[e^{F\left(\mathcal{R}\_{\mathcal{S}}(h),\mathcal{R}(h)\right)}\right] = \frac{\mathcal{X}}{\delta}.$$

Applying the log function on each side of this inequality gives us with probability of at least 1 − *δ* over samples *S*:

$$\log\left(\mathbb{E}\_{h\sim P}\left[e^{F(R\_S(h),R(h))}\right]\right) \le \log\left(\frac{\chi}{\delta}\right).$$

We now rename *<sup>A</sup>* :<sup>=</sup> log <sup>E</sup>*h*∼*<sup>P</sup> eF*(*RS*(*h*),*R*(*h*)) .

Furthermore, if we denote by *dQ dP* the Radon-Nikodym derivative of *Q* with respect to *P* when *Q P*, we then have, for all *Q* such that *Q* ∼ *P*:

$$\begin{split} A &= \log \left( \mathbb{E}\_{h \sim Q} \left[ \frac{dP}{dQ} e^{F(R\_S(h), R(h))} \right] \right) \\ &= \log \left( \mathbb{E}\_{h \sim Q} \left[ \left( \frac{dQ}{dP} \right)^{-1} e^{F(R\_S(h), R(h))} \right] \right) \end{split} \tag{4}$$

and by concavity of log and Jensen's inequality,

$$\begin{aligned} &\geq -\mathbb{E}\_{h\sim Q} \Big[ \log \left( \frac{dQ}{dP} \right) \Big] + \mathbb{E}\_{h\sim Q} [F(R\_S(h), R(h))] \\ &= -\operatorname{KL}(Q||P) + \mathbb{E}\_{h\sim Q} [F(R\_S(h), R(h))] \end{aligned}$$

while by convexity of *F* with Jensen's inequality,

$$0 \geq -\mathsf{KL}(Q||P) + F(\mathbb{E}\_{h \sim Q}[\mathcal{R}\_S(h)], \mathbb{E}\_{h \sim Q}[R(h)]).$$

Hence, for *Q* such that *Q* ∼ *P*,

$$F\left(\mathbb{E}\_{h\sim Q}[R\_S(h)], \mathbb{E}\_{h\sim Q}[R(h)]\right) \le \text{KL}(Q||P) + A.$$

So with probability 1 − *δ*, for *Q* such that *Q* ∼ *P*,

$$F\left(\mathbb{E}\_{h\sim\mathcal{Q}}[R\_S(h)], \mathbb{E}\_{h\sim\mathcal{Q}}[R(h)]\right) \leq \mathsf{KL}(Q||P) + \log\left(\frac{\mathcal{X}}{\delta}\right).$$

This completes the proof of Theorem 1.

## *7.2. Proof of Theorem 5*

We first provide a technical property. Recall that:

$$\xi = \mathbb{E}\_{h \sim P} \left[ \exp \left( \frac{K(h)^2}{2m^{1-2\alpha}} \right) \right].$$

**Proposition 3.** *Let <sup>α</sup>* <sup>∈</sup> <sup>R</sup>*. Suppose the loss is* HYPE(*K*) *compliant with K*(*h*) = *B*||*h*|| + *C, with <sup>B</sup>* <sup>&</sup>gt; <sup>0</sup>*, <sup>C</sup>* <sup>≥</sup> <sup>0</sup>*. Then, for any Gaussian prior <sup>P</sup>* <sup>=</sup> <sup>N</sup> (0, *<sup>σ</sup>*2*IN*) *with <sup>σ</sup>*<sup>2</sup> <sup>=</sup> *<sup>t</sup> <sup>m</sup>*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> *,* <sup>0</sup> < *<sup>t</sup>* < <sup>1</sup> *and N* ≥ 6 *we have:*

$$\xi \le 2 \exp\left(\frac{\mathbb{C}^2}{2m^{1-2\alpha}f(t)}(1+f(t))\right) \frac{1}{\left(\sqrt{1-t}\right)^N} \left(1+\left(\frac{\mathbb{C}}{\sqrt{2f(t)m^{1-2\alpha}}}\right)\right)^{N-1}$$

*with f*(*t*) = <sup>1</sup>−*<sup>t</sup> t .*

**Proof.** We recall that *σ*<sup>2</sup> = *t <sup>m</sup>*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> . By expliciting the expectation and *<sup>K</sup>*(*h*) we thus obtain:

$$\begin{split} \mathcal{E}\_{\mathbb{S}} &= \left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)^{N} \int\_{\mathbb{H}\in\mathbb{R}^{N}} \exp\left(\frac{(\mathcal{B}||h||+\mathcal{C})^{2}}{2m^{1-2\alpha}} - \frac{||h||^{2}\mathcal{B}^{2}}{2tm^{1-2\alpha}}\right) dh \\ &= \left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)^{N} \int\_{\mathbb{H}\in\mathbb{R}^{N}} \exp\left(-\frac{1}{2m^{1-2\alpha}} \left(f(t)B^{2}||h||^{2} - 2BC||h|| - \mathcal{C}^{2}\right)\right) dh \\ &= \left(\frac{1}{\sqrt{2\pi\sigma^{2}}}\right)^{N} \int\_{\mathbb{H}\in\mathbb{R}^{N}} \exp\left(-\frac{B^{2}f(t)}{2m^{1-2\alpha}} \left(||h||^{2} - \frac{2C||h||}{Bf(t)} - \frac{\mathcal{C}^{2}}{B^{2}f(t)}\right)\right) dh \\ &= \exp\left(\frac{\mathcal{C}^{2}}{2m^{1-2\alpha}f(t)}(1+f(t))\right) \frac{1}{(\sqrt{2\pi\sigma^{2}})^{N}} \int\_{\mathbb{H}\in\mathbb{R}^{N}} \exp\left(-\frac{B^{2}f(t)}{2m^{1-2\alpha}} \left(||h|| - \frac{\mathcal{C}}{Bf(t)}\right)^{2}\right) dh. \end{split}$$

We will use the spherical coordinates in *N*-dimensional Euclidean space given in [22]:

$$
\varphi: (h\_1, \dots, h\_N) \to (r, \varphi\_1, \dots, \varphi\_{N-1}),
$$

where especially *r* = ||*h*|| and also the Jacobian of *φ* is given by:

$$d^N V = r^{N-1} \prod\_{k=1}^{N-2} \sin^k(\varphi\_{N-1-k}) = r^{N-1} d\_{S^{N-1}} V.$$

Let us also precise that as given in Blumenson [22] (page 66), we have that the surface of the sphere of radius 1 in *N*-dimensional space is:

$$\int\_{\mathcal{P}\_1,\dots,\mathcal{P}\_{N-1}} d\_{S^{N-1}} V \, d\boldsymbol{\varrho}\_1 \dots d\boldsymbol{\varrho}\_{N-1} = \frac{2\sqrt{\pi}^N}{\Gamma\left(\frac{N}{2}\right)}$$

where Γ is the Gamma function defined as:

$$
\Gamma(\mathbf{x}) = \int\_0^{+\infty} t^{x-1} e^{-t} dt \quad \text{for } \ge -1.
$$

Then, if we set:

$$A := \int\_{h \in \mathbb{R}^N} \exp\left(-\frac{B^2 f(t)}{2m^{1-2\alpha}} \left(||h|| - \frac{\mathbb{C}}{B f(t)}\right)^2\right) dh$$

we obtain by a change of variable:

$$\begin{split} A &= \int\_{r,\varphi\_{1},\ldots,\varphi\_{N-1}} \exp\left(-\frac{B^{2}f(t)}{2m^{1-2\alpha}} \left(r-\frac{\mathbb{C}}{Bf(t)}\right)^{2}\right) d^{N}V \, dr d\varphi\_{1} \ldots d\varphi\_{N-1} \\ &= \left(\frac{2\sqrt{\pi}^{N}}{\Gamma\left(\frac{N}{2}\right)}\right) \int\_{r=0}^{+\infty} \exp\left(-\frac{B^{2}f(t)}{2m^{1-2\alpha}} \left(r-\frac{\mathbb{C}}{Bf(t)}\right)^{2}\right) r^{N-1} dr \\ &= \left(\frac{2\sqrt{\pi}^{N}}{\Gamma\left(\frac{N}{2}\right)}\right) \int\_{r=-\frac{C}{2f(t)}}^{+\infty} \left(r+\frac{\mathbb{C}}{Bf(t)}\right)^{N-1} \exp\left(-\frac{B^{2}f(t)}{2m^{1-2\alpha}}r^{2}\right) dr \\ &= \left(\frac{2\sqrt{\pi}^{N}}{\Gamma\left(\frac{N}{2}\right)}\right) \sum\_{k=0}^{N-1} \binom{N-1}{k} \left(\frac{\mathbb{C}}{Bf(t)}\right)^{N-k-1} \int\_{r=-\frac{C}{2f(t)}}^{+\infty} r^{k} \exp\left(-\frac{B^{2}f(t)}{2m^{1-2\alpha}}r^{2}\right) dr. \end{split}$$

We fix a random variable *X* such that:

$$X \sim \mathcal{N}\left(0, \frac{m^{1-2a}}{B^2(f(t))}\right).$$

We then have for any *k* positive integer, if *k* is even:

$$\begin{split} \int\_{r=-\frac{C}{Bf(t)}}^{+\infty} r^k \exp\left(-\frac{B^2f(t)}{2m^{1-2\alpha}}r^2\right) dr &\leq \int\_{r=-\infty}^{+\infty} r^k \exp\left(-\frac{B^2f(t)}{2m^{1-2\alpha}}r^2\right) dr\\ &\leq \sqrt{2\pi \frac{m^{1-2\alpha}}{B^2f(t)}} \mathbb{E}[|X|^k]. \end{split}$$

And if *k* is odd:

$$\begin{split} \int\_{r=-\frac{C}{Bf(t)}}^{+\infty} r^k \exp\left(-\frac{B^2f(t)}{2m^{1-2a}}r^2\right) dr &\leq \int\_{r=0}^{+\infty} r^k \exp\left(-\frac{B^2f(t)}{2m^{1-2a}}r^2\right) dr\\ &\leq \sqrt{2\pi \frac{m^{1-2a}}{B^2f(t)}} \mathbb{E}[|X|^k \mathbb{1}(X \geq 0)] \\ &\leq \sqrt{2\pi \frac{m^{1-2a}}{B^2f(t)}} \mathbb{E}[|X|^k]. \end{split}$$

So we have:

$$A \le \left(\frac{2\sqrt{\pi}^N}{\Gamma\left(\frac{N}{2}\right)}\right) \sum\_{k=0}^{N-1} \binom{N-1}{k} \left(\frac{\mathcal{C}}{Bf(t)}\right)^{N-k-1} \sqrt{2\pi \frac{m^{1-2\alpha}}{B^2f(t)}} \mathbb{E}[|X|^k].$$

As precised in [23], we have for any *k*:

$$\mathbb{E}[|X|^k] = \left(\sqrt{\frac{m^{1-2\alpha}}{B^2 f(t)}}\right)^k 2^{k/2} \frac{\Gamma\left(\frac{k+1}{2}\right)}{\sqrt{\pi}}.$$

So finally:

$$A \le 2\sqrt{\pi}^N \sum\_{k=0}^{N-1} \binom{N-1}{k} \left(\frac{\mathbb{C}}{Bf(t)}\right)^{N-k-1} \left(\sqrt{\frac{2m^{1-2a}}{B^2f(t)}}\right)^{k+1} \frac{\Gamma\left(\frac{k+1}{2}\right)}{\Gamma\left(\frac{N}{2}\right)}.$$

**Lemma 3.** *If N* ≥ 6*, then:*

$$\max\_{k=0\ldots N-1} \frac{\Gamma\left(\frac{k+1}{2}\right)}{\Gamma\left(\frac{N}{2}\right)} = 1.$$

**Proof.** As precised in the introduction of Srinivasan and Zvengrowski [24], Gauss [25] (page 147) proved that on the interval [*x*0, +∞) where *x*<sup>0</sup> ∈ [1.46, 1.47], Γ is a monotonic increasing function. So, for *<sup>N</sup>* <sup>−</sup> <sup>1</sup> <sup>≥</sup> *<sup>k</sup>* <sup>≥</sup> 2, <sup>Γ</sup>( *<sup>k</sup>*+<sup>1</sup> <sup>2</sup> ) <sup>≤</sup> <sup>Γ</sup>( *<sup>N</sup>* <sup>2</sup> ). And because Γ(1/2) = <sup>√</sup>*π*, <sup>Γ</sup>(1) = 1, we have:

$$\max\_{k=0\ldots N-1} \frac{\Gamma\left(\frac{k+1}{2}\right)}{\Gamma\left(\frac{N}{2}\right)} = \max\left(\frac{\sqrt{\pi}}{\Gamma\left(\frac{N}{2}\right)}, \frac{\Gamma\left(\frac{N-1+1}{2}\right)}{\Gamma\left(\frac{N}{2}\right)}\right) = \max\left(\frac{\sqrt{\pi}}{\Gamma\left(\frac{N}{2}\right)}, 1\right)$$

Because *N* ≥ 6, and Γ is monotone and increasing on [3; +∞], we have Γ(*N*/2) ≥ <sup>Γ</sup>(3) <sup>≥</sup> <sup>√</sup>*π*. Hence the result.

Using Lemma 3 allows us to write:

$$A \le 2\sqrt{\pi}^N \sum\_{k=0}^{N-1} \binom{N-1}{k} \left(\frac{\mathbb{C}}{Bf(t)}\right)^{N-k-1} \left(\sqrt{\frac{2m^{1-2\alpha}}{B^2f(t)}}\right)^{k+1}.$$

We recall that *σ*<sup>2</sup> = *t <sup>m</sup>*1−2*<sup>α</sup> <sup>B</sup>*<sup>2</sup> and *<sup>f</sup>*(*t*) = <sup>1</sup>−*<sup>t</sup> <sup>t</sup>* . Then we can write:

$$A \le 2\sqrt{\pi}^N \sum\_{k=0}^{N-1} \binom{N-1}{k} \left(\frac{\mathbb{C}}{Bf(t)}\right)^{N-k-1} \left(\sqrt{\frac{2\sigma^2}{1-t}}\right)^{k+1}.$$

We now conclude with the final bound on *ξ*:

*<sup>ξ</sup>* <sup>≤</sup> exp *<sup>C</sup>*<sup>2</sup> <sup>2</sup>*m*1−2*<sup>α</sup> <sup>f</sup>*(*t*)(<sup>1</sup> + *<sup>f</sup>*(*t*)) <sup>1</sup> ( √ <sup>2</sup>*πσ*2)*<sup>N</sup> <sup>A</sup>* <sup>≤</sup> exp *<sup>C</sup>*<sup>2</sup> <sup>2</sup>*m*1−2*<sup>α</sup> <sup>f</sup>*(*t*)(<sup>1</sup> + *<sup>f</sup>*(*t*)) <sup>1</sup> ( √ <sup>2</sup>*πσ*2)*<sup>N</sup>* <sup>2</sup> <sup>√</sup>*π<sup>N</sup>* <sup>∑</sup>*N*−<sup>1</sup> *<sup>k</sup>*=<sup>0</sup> ( *N*−1 *<sup>k</sup>* ) *<sup>C</sup> B f*(*t*) *N*−*k*−<sup>1</sup> & <sup>2</sup>*σ*<sup>2</sup> 1−*t k*+<sup>1</sup> <sup>≤</sup> 2 exp *<sup>C</sup>*<sup>2</sup> <sup>2</sup>*m*1−2*<sup>α</sup> <sup>f</sup>*(*t*)(<sup>1</sup> + *<sup>f</sup>*(*t*)) ∑*N*−<sup>1</sup> *<sup>k</sup>*=<sup>0</sup> ( *N*−1 *<sup>k</sup>* ) *<sup>C</sup> B f*(*t*) *N*−*k*−1& <sup>1</sup> 1−*t k*+<sup>1</sup> & *<sup>B</sup>*<sup>2</sup> 2*tm*1−2*<sup>α</sup> N*−*k*−<sup>1</sup> <sup>≤</sup> 2 exp *<sup>C</sup>*<sup>2</sup> <sup>2</sup>*m*1−2*<sup>α</sup> <sup>f</sup>*(*t*)(<sup>1</sup> + *<sup>f</sup>*(*t*)) ∑*N*−<sup>1</sup> *<sup>k</sup>*=<sup>0</sup> ( *N*−1 *<sup>k</sup>* ) *<sup>C</sup>* √*t* (1−*t*) √ 2*m*1−2*<sup>α</sup> N*−*k*−1& <sup>1</sup> 1−*t k*+<sup>1</sup> ≤ 2 exp *C*2 2*m*1−2*α f*(*t*) (1+*f*(*t*)) ( <sup>√</sup>1−*t*) *<sup>N</sup>* ∑*N*−<sup>1</sup> *<sup>k</sup>*=<sup>0</sup> ( *N*−1 *<sup>k</sup>* ) √ *<sup>C</sup>* 2 *f*(*t*)*m*1−2*<sup>α</sup> N*−*k*−<sup>1</sup> ≤ 2 exp *C*2 2*m*1−2*α f*(*t*) (1+*f*(*t*)) ( <sup>√</sup>1−*t*) *N* 1 + √ *<sup>C</sup>* 2 *f*(*t*)*m*1−2*<sup>α</sup> N*−<sup>1</sup> .

This completes the proof of Proposition 3.

**Proof of Theorem 5.** We combine Theorem 3 with Proposition 3. We also upper-bound *N* − 1 by *N*.

**Author Contributions:** Conceptualization, M.H., B.G. and J.S.-T.; Formal analysis, M.H., B.G. and O.R.; Project administration, B.G.; Supervision, B.G.; Writing—original draft, M.H., B.G. and O.R.; Writing—review and editing, M.H., B.G., O.R. and J.S.-T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. BG acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23- 0015-02.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


## *Article* **Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks**

**Felix Biggs <sup>1</sup> and Benjamin Guedj 1,2,\***


**\*** Correspondence: benjamin.guedj@inria.fr

**Abstract:** We make two related contributions motivated by the challenge of training stochastic neural networks, particularly in a PAC–Bayesian setting: (1) we show how averaging over an ensemble of stochastic neural networks enables a new class of partially-aggregated estimators, proving that these lead to unbiased lower-variance output and gradient estimators; (2) we reformulate a PAC–Bayesian bound for signed-*output* networks to derive in combination with the above a directly optimisable, differentiable objective and a generalisation guarantee, without using a surrogate loss or loosening the bound. We show empirically that this leads to competitive generalisation guarantees and compares favourably to other methods for training such networks. Finally, we note that the above leads to a simpler PAC–Bayesian training scheme for sign-*activation* networks than previous work.

**Keywords:** statistical learning theory; PAC–Bayes theory; deep learning

## **1. Introduction**

**Citation:** Biggs, F.; Guedj, B. Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks. *Entropy* **2021**, *23*, 1280. https://doi.org/10.3390/e23101280

Academic Editor: Udo Von Toussaint

Received: 22 August 2021 Accepted: 27 September 2021 Published: 29 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

The use of stochastic neural networks has become widespread in the PAC–Bayesian and Bayesian deep learning [1] literature as a way to quantify predictive uncertainty and obtain generalisation bounds. PAC–Bayesian theorems generally bound the expected loss of *randomised* estimators, so it has proven easier to obtain non-vacuous numerical guarantees on generalisation in such networks.

However, we observe that when training these in the PAC–Bayesian setting, the objective used is generally somewhat divorced from the bound on misclassification loss itself, often because non-differentiability leads to difficulties with direct optimisation. For example, Langford and Caruana [2], Zhou et al. [3], and Dziugaite and Roy [4] all initially train non-stochastic networks before using them as the mode of a distribution, with variance chosen, respectively, through a computationally-expensive sensitivity analysis, as a proportion of weight norms, or by optimising an objective with both a surrogate loss function and a different dependence on the Kullback–Leibler (KL) divergence from their bound.

In exploring methods to circumvent this gap, we also note that PAC–Bayesian bounds can often be straightforwardly adapted to aggregates or averages of estimators, leading directly to analytic and differentiable objective functions (for example, [5]). Unfortunately, averages over deep stochastic networks are usually intractable or, if possible, very costly (as found by [6]).

Motivated by these observations, our main contribution is to obtain a compromise by defining new and general "*partially-aggregated*" Monte Carlo estimators for the average output and gradients of deep stochastic networks (Section 3), with the direct optimisation of PAC–Bayesian bounds in mind. Although our main focus here is on the use of this estimator in a PAC–Bayesian application, we emphasise that the technique applies generally to stochastic networks and thus has links to other variance-reduction techniques for training them, such as the pathwise estimator used in the context of neural networks by [7] amongst

many others or Flipout [8]; indeed, it can be used in combination with these techniques. We provide proofs (Section 4) that this application leads to lower variances than a Monte Carlo forward pass and lower variance final-layer gradients than REINFORCE [9].

A further contribution of ours is a first application of this general estimator to nondifferentiable "signed-output" networks (with a final output ∈ {−1, +1} and arbitrarily complex other structure, see Section 4). As well as reducing variances as stated above, a small amount of additional structure in combination with partial-aggregation enables us to extend the pathwise estimator to the other layers, which usually requires a fully differentiable network and eases training by reducing the variance of gradient estimates.

We adapt a binary classification bound (Section 5) from Catoni [10] to these networks, yielding straightforward and directly differentiable objectives when used in combination with aggregation. Closing this gap between objectives and bounds leads to improved theoretical properties.

Further, since most of the existing PAC–Bayes bounds for neural networks have a heavy dependency on the distance from initialisation of the obtained solution, we would intuitively expect these lower variances to lead to faster convergence and tighter bounds (from finding low-error solutions nearer to the initialisation). We indeed observe this experimentally, showing that training PAC–Bayesian objectives in combination with partial aggregation leads to competitive experimental generalisation guarantees (Section 6), and improves upon naive Monte Carlo and REINFORCE.

As a useful corollary, this application also leads us to a similar but simpler PAC– Bayesian training method for sign-activation neural networks than Letarte et al. [6], which successfully aggregated networks with all sign activation functions ∈ {+1, −1} and a non-standard tree structure, but incurred an exponential KL divergence penalty and a heavy computational cost (so that in practice they often resorted to a Monte Carlo estimate). Further, the lower variance of our obtained estimator predictions enables us to use the Gibbs estimator directly (where we draw a single sample function for every new example), leading to a modified bound on the misclassification loss which is a factor of two tighter without a significant performance penalty.

We discuss further and outline future work in Section 7.

#### **2. Background**

We begin here by setting out our notation and the requisite background.

Generally, we consider parameterised functions, { *<sup>f</sup><sup>θ</sup>* : X → Y|*<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup> <sup>⊂</sup> <sup>R</sup>*N*}, in a specific form, choosing X ⊂ <sup>R</sup>*d*<sup>0</sup> and an arbitrary output space <sup>Y</sup> which could be for example {−1, <sup>+</sup>1} or <sup>R</sup>. We wish to find functions minimizing the out-of-sample risk, *<sup>R</sup>*(*f*) = <sup>E</sup>(*x*,*y*)∼D-(*f*(*x*), *y*), for some loss function -, for example the 0-1 misclassification loss for classification, -<sup>0</sup>−1(*y*, *y* ) = **1**{*y* -= *y* }, or the binary linear loss, lin(*y*, *y* ) = 1 <sup>2</sup> (1 − *yy* ), with Y = {+1, −1}. These must be chosen based on an i.i.d. sample *S* = {(*xi*, *yi*)}*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> ∼ D*<sup>m</sup>* from the data distribution <sup>D</sup>, using the surrogate of in-sample empirical risk, *RS*(*f*) = <sup>1</sup> *<sup>m</sup>* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> -(*f*(*xi*), *yi*). We denote the expected and empirical risks under the misclassification and linear losses, respectively *R*0−1, *R*lin, *R*0−<sup>1</sup> *<sup>S</sup>* and *<sup>R</sup>*lin *S* .

In this paper, we consider learning a distribution (PAC–Bayesian posterior), *Q*, over the parameters *θ*. PAC–Bayesian theorems then provide bounds on the expected generalization risk of randomised classifiers, where every prediction is made using a newly sampled function from our posterior, *fθ*, *θ* ∼ *Q*.

We also consider averaging the above to obtain *Q*-aggregated prediction functions,

$$F\_Q(\mathfrak{x}) := \mathbb{E}\_{\theta \sim Q} f\_\theta(\mathfrak{x}). \tag{1}$$

In the case of a convex loss function, Jensen's inequality lower bounds the risk of the randomised function by its *Q*-aggregate: -(*FQ*(*x*), *<sup>y</sup>*) <sup>≤</sup> <sup>E</sup>*f*∼*Q*-(*f*(*x*), *y*). The equality is achieved by the linear loss, a fact we will exploit to obtain an easier PAC–Bayesian optimisation objective in Section 5.

#### *2.1. Analytic Q-Aggregates for Signed Linear Functions*

*Q*-aggregate predictors are analytically tractable for "signed-output" functions (here the sign function and "signed" functions have outputs ∈ {+1, −1}, as the terminology "binary", used sometimes in the literature, suggests to us too strongly an output ∈ {0, 1}) of the form *fw*(*x*) = sign(*<sup>w</sup>* · *<sup>x</sup>*) under a normal distribution, *<sup>Q</sup>*(*w*) = *<sup>N</sup>*(*μ*,I), as specifically considered in a PAC–Bayesian context for binary classification by [5], obtaining an differentiable objective similar to the SVM. Provided *x* -= **0**:

$$F\_Q(\mathbf{x}) := \mathbb{E}\_{\mathbf{w} \sim N(\mu, \mathbb{I})} \operatorname{sign}(\mathbf{w} \cdot \mathbf{x}) = \operatorname{erf}\left(\frac{\mu \cdot \mathbf{x}}{\sqrt{2} ||\mathbf{x}||}\right). \tag{2}$$

In Section 4, we will consider aggregating signed output (*f*(*x*) ∈ {+1, −1}) functions of a more general form.

#### *2.2. Monte Carlo Estimators for More Complex Q-Aggregates*

The framework of *Q*-aggregates can be extended to less tractable cases (for example, with *f<sup>θ</sup>* a randomised or a "Bayesian" neural network, see, e.g., [1]) through a simple and unbiased Monte Carlo approximation:

$$F\_Q(\mathbf{x}) = \mathbb{E}\_{\theta \sim Q} f\_\theta(\mathbf{x}) \approx \frac{1}{T} \sum\_{t=1}^T f\_{\theta^t}(\mathbf{x}) := \mathbb{P}\_Q(\mathbf{x}).\tag{3}$$

If we go on to parameterize our posterior *<sup>Q</sup>* by *<sup>φ</sup>* <sup>∈</sup> <sup>Φ</sup> <sup>⊂</sup> <sup>R</sup>*<sup>N</sup>* as *<sup>Q</sup><sup>φ</sup>* and wish to obtain gradients without a closed form for *FQ<sup>φ</sup>* (*x*) = <sup>E</sup>*θ*∼*Q<sup>φ</sup> <sup>f</sup><sup>θ</sup>* (*x*), there are two possibilities. One is REINFORCE [9], which requires only a differentiable density function, *qφ*(*θ*) and makes a Monte Carlo approximation to the left hand side of the identity <sup>∇</sup>*φ*E*θ*∼*q<sup>φ</sup> <sup>f</sup><sup>θ</sup>* (*x*) = <sup>E</sup>*θ*∼*q<sup>φ</sup>* [ *<sup>f</sup><sup>θ</sup>* (*x*)∇*<sup>φ</sup>* log *<sup>q</sup>φ*(*θ*)].

The other is the pathwise estimator, which additionally requires that *f<sup>θ</sup>* (*x*) be differentiable w.r.t. *θ*, and that the probability distribution chosen has a standardization function, *Sφ*, which removes the *φ* dependence, turning a parameterised *q<sup>φ</sup>* into a nonparameterised distribution *p*: for example, *Sμ*,*σ*(*X*)=(*X* − *μ*)/*σ* to transform a general normal distribution into a standard normal. If this exists, the right hand side of <sup>∇</sup>*φ*E*θ*∼*q<sup>φ</sup> <sup>f</sup><sup>θ</sup>* (*x*) = <sup>E</sup>∼*p*∇*<sup>φ</sup> <sup>f</sup> S*−<sup>1</sup> *<sup>φ</sup>* () (*x*) generally yields lower-variance estimates than REIN-FORCE (see for a modern survey [11]).

The variance introduced by REINFORCE can make it difficult to train neural networks when the pathwise estimator is not available, for example when non-differentiable activation functions are used. Below we find a compromise between the analytically closed form of (2) and the above estimator that enables us to make differentiable certain classes of network and extend the pathwise estimator where otherwise it could not be used. Through this we are able to stably train a new class of network.

## *2.3. PAC–Bayesian Approach*

We use PAC–Bayes in this paper to obtain generalisation guarantees and theoreticallymotivated training methods. The primary bound utilised is based on the following theorem, valid for a loss taking values in [0, 1]:

**Theorem 1** ([10], Theorem 1.2.6)**.** *Given probability measure P on hypothesis space* F *and α* > 1*, for all Q on* <sup>F</sup> *with probability at least* <sup>1</sup> <sup>−</sup> *<sup>δ</sup> over S* ∼ D*m,*

$$\mathbb{E}\_{f \sim Q} \mathbb{R}(f) \le \inf\_{\lambda > 1} \Phi\_{\lambda/m}^{-1} \left[ \mathbb{E}\_{f \sim Q} R\_S(f) + \frac{a}{\lambda} \Delta \right]$$
 
$$with \ \Phi\_{\gamma}^{-1}(t) = \frac{1 - \exp(-\gamma t)}{1 - \exp(-\gamma)} \, and \ \Delta = \mathrm{KL}(Q|P) - \log \delta + 2 \log \left( \frac{\log a^2 \lambda}{\log a} \right).$$

This slightly opaque formulation (used previously by [3]) gives essentially identical results when KL/*m* is large to the better-known "small-kl" PAC–Bayes bounds originated by Langford and Seeger [12], Seeger et al. [13]. It is chosen because it leads to objectives that are *linear* in the empirical loss and KL divergence, like

$$\mathbb{E}\_{f \sim Q} R\_S(f) + \frac{\mathbb{KL}(Q|P)}{\lambda}. \tag{4}$$

This objective is minimised by a Gibbs distribution and is closely related to the evidence lower bound (ELBO) usually optimised by Bayesian Neural Networks [1]. Such a connection has been noted throughout the PAC–Bayesian literature; we refer the reader to [14] or [15] for a formalised treatment.

#### **3. The Partial Aggregation Estimator**

Here we outline our main contribution: a reformulation of *Q*-aggregation for neural networks leading to different, lower-variance, Monte Carlo estimators for their outputs and gradients. These estimators apply to networks with a dense final layer, and arbitrary stochastic other structure (for example convolutions, residual layers or a non-feedforward structure). Specifically, they take the form

$$f\_{\theta}(\mathbf{x}) = A(\mathbf{z}\mathbf{w}\cdot\boldsymbol{\eta}\_{\theta^{\neg \mathbf{w}}}(\mathbf{x})) \tag{5}$$

with *<sup>θ</sup>* <sup>=</sup> vec(*w*, *<sup>θ</sup>¬w*) <sup>∈</sup> <sup>Θ</sup> <sup>⊂</sup> <sup>R</sup>*D*, *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*d*, and *<sup>θ</sup>¬<sup>w</sup>* <sup>∈</sup> <sup>Θ</sup>*¬<sup>w</sup>* <sup>⊂</sup> <sup>R</sup>*D*−*<sup>d</sup>* the parameter set excluding *w*, for the non-final layers of the network. These non-final layers are included in *<sup>η</sup>θ¬<sup>w</sup>* : X→A*<sup>d</sup>* <sup>⊆</sup> <sup>R</sup>*<sup>d</sup>* and the final activation is *<sup>A</sup>* : <sup>R</sup> → Y. For simplicity we have used a one-dimensional output but we note that the formulation and below derivations trivially extend to a vector-valued function with elementwise activations. We require the distribution over parameters to factorise like *Q*(*θ*) = *Qw*(*w*)*Q¬w*(*θ¬w*), which is consistent with the literature.

We recover a similar functional form to that considered in Section 2.1 by rewriting the function as *<sup>A</sup>*(*<sup>w</sup>* · *<sup>a</sup>*) with *<sup>a</sup>* ∈ A*<sup>d</sup>* the randomised hidden-layer activations. The "aggregated" activation function on the final layer, which we define as *I*(*a*) := - *A*(*w* · *<sup>a</sup>*) <sup>d</sup>*Qw*(*w*), may then be analytically tractable. For example, with *<sup>w</sup>* <sup>∼</sup> *<sup>N</sup>*(*μ*,I) and a sign final activation, we recall (2) where *<sup>I</sup>*(*a*) = erf *<sup>μ</sup>*·*<sup>a</sup>* <sup>√</sup><sup>2</sup> *a* .

Using these definitions we can write the *Q*-aggregate in terms of the conditional distribution on the activations, *<sup>a</sup>*, which takes the form *<sup>Q</sup>*˜ *<sup>¬</sup>w*(*a*|*x*) := (*η*(·)(*x*)) ◦ *<sup>Q</sup>¬w*, (i.e., the distribution of *<sup>η</sup>θ¬<sup>w</sup>* (*x*)|*x*, with *<sup>θ</sup>¬<sup>w</sup>* <sup>∼</sup> *<sup>Q</sup>¬w*). The *<sup>Q</sup>*-aggregate can then be stated as

$$\begin{split} F\_{\mathbb{Q}}(\mathbf{x}) &:= \mathbb{E}\_{\theta \sim \mathcal{Q}}[f\_{\theta}(\mathbf{x})] \\ &= \int\_{\theta^{\neg w}} \left[ \int\_{\mathbb{R}^{d}} A(w \cdot \eta\_{\theta^{\neg w}}(\mathbf{x})) \, \mathrm{d}Q^{w}(w) \right] \mathrm{d}Q^{\neg w}(\theta^{\neg w}) \\ &= \int\_{\theta^{\neg w}} I(\eta\_{\theta^{\neg w}}(\mathbf{x})) \, \mathrm{d}Q^{\neg w}(\theta^{\neg w}) \\ &= \int\_{\mathcal{A}^{d}} I(\mathbf{a}) \, \mathrm{d}\{ (\eta\_{(\cdot)}(\mathbf{x})) \diamond Q^{\neg w} \}(\mathbf{a}) \\ &=: \int\_{\mathcal{A}^{d}} I(\mathbf{a}) \, \mathrm{d}Q^{\neg w}(\mathbf{a}|\mathbf{x}). \end{split}$$

In most cases, the final integral cannot be calculated exactly or involves a large summation, so we resort to a Monte Carlo estimate, for each *x* drawing *T* samples of the randomised activations, {*a<sup>t</sup>* }*T <sup>t</sup>*=<sup>1</sup> <sup>∼</sup> *<sup>Q</sup>*˜ *<sup>¬</sup>w*(*a*|*x*) to obtain the "partially-aggregated" estimator

$$F\_Q(\mathbf{x}) = \int\_{\mathcal{A}^d} I(\mathbf{a}) \, \mathbf{d} \tilde{Q}^{\sim \text{av}}(\mathbf{a}|\mathbf{x}) \approx \frac{1}{T} \sum\_{t=1}^T I(\mathbf{a}^t) = \mathbb{P}\_Q^\*(\mathbf{x}). \tag{6}$$

This is quite similar to the original estimator from (3), but in fact the aggregation of the final layer may significantly reduce the variance of the outputs and also make better gradient estimates possible, as we will show below.

#### *3.1. Reduced Variance Estimates*

**Proposition 1.** Lower variance outputs: *For a neural network as defined by Equation* (5) *and the unbiased Q-aggregation estimators defined by Equations* (3) *and* (6)*,*

$$\mathbb{V}\_Q[\mathcal{F}\_Q^\*(\mathfrak{x})] \le \mathbb{V}\_Q[\mathcal{F}\_Q(\mathfrak{x})].$$

**Proof.** Treating *a* as a random variable, always conditioned on *x*, we have

$$\begin{split} & \mathbb{V}\_{Q}[\hat{\mathbb{P}}\_{Q}(\mathbf{x})] - \mathbb{V}\_{Q}[\hat{\mathbb{P}}\_{Q}^{\star}(\mathbf{x})] = \mathbb{E}\_{Q} |\hat{\mathbb{P}}\_{Q}(\mathbf{x})|^{2} - \mathbb{E}\_{Q} |\hat{\mathbb{P}}\_{Q}^{\star}(\mathbf{x})|^{2} \\ &= \frac{1}{T} \mathbb{E}\_{\mathfrak{a}|\mathbf{x}} \left[ \mathbb{E}\_{\mathfrak{w}} |A(\mathfrak{w} \cdot \mathfrak{a})|^{2} - |\mathbb{E}\_{\mathfrak{w}} A(\mathfrak{w} \cdot \mathfrak{a})|^{2} \right] \\ &= \frac{1}{T} \mathbb{E}\_{\mathfrak{a}|\mathbf{x}} [\mathbb{V}\_{\mathfrak{w}}[A(\mathfrak{w} \cdot \mathfrak{a})]] \ge 0. \end{split}$$

$$\mathbb{0}$$

From the above we see that the aggregate outputs estimated through partial-aggregation have lower variances. Next, we consider the two unbiased gradient estimators for the distribution over final-layer weights, *w*, arising from partial-aggregation or REINFORCE (as would be used, for example, where the final layer is non-differentiable). Assuming *Q<sup>w</sup>* has a density, *<sup>q</sup>φ*(*θw*), parameterised by *<sup>φ</sup>*, these use forward samples of {*w<sup>t</sup>* , *<sup>θ</sup>¬w*,(*t*)}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> as:

$$\begin{aligned} \hat{\mathbf{G}}(\boldsymbol{\mathfrak{x}}) &:= \frac{1}{T} \sum\_{t=1}^{T} A \left( \boldsymbol{\mathfrak{w}}^{t} \cdot \boldsymbol{\mathfrak{y}}^{t} \right) \nabla\_{\boldsymbol{\Phi}} \log q\_{\boldsymbol{\Phi}}(\boldsymbol{\mathfrak{w}}^{t}), \\ \hat{\mathbf{G}}^{\*}(\boldsymbol{\mathfrak{x}}) &:= \frac{1}{T} \sum\_{t=1}^{T} \nabla\_{\boldsymbol{\Phi}} I\_{\boldsymbol{\Phi}\boldsymbol{\Phi}}(\boldsymbol{\mathfrak{y}}^{t}). \end{aligned}$$

**Proposition 2.** Lower variance gradients: *Under the conditions of Proposition 1 and the above definitions,*

$$\operatorname{Cov}\_{\mathbb{Q}}[\hat{\mathbf{G}}^\*(\mathfrak{x})] \preceq \operatorname{Cov}\_{\mathbb{Q}}[\hat{\mathbf{G}}(\mathfrak{x})],$$

*where A* " *B* ⇐⇒ *B* − *A is positive semi-definite. Thus, for all u* -<sup>=</sup> **<sup>0</sup>***,* <sup>V</sup>[*G*<sup>ˆ</sup> <sup>∗</sup>(*x*) · *<sup>u</sup>*] <sup>≤</sup> <sup>V</sup>[*G*ˆ(*x*) · *<sup>u</sup>*]*.*

**Proof.** Writing *v* := ∇*<sup>φ</sup>* log *qφ*(*w*) and using the unbiasedness of the estimators,

$$\begin{split} & \text{Cov}\_{Q}[\hat{\mathbf{G}}\_{Q}(\mathbf{x})] - \text{Cov}\_{Q}[\hat{\mathbf{G}}\_{Q}^{\star}(\mathbf{x})] \\ &= \mathbb{E}\_{Q}[\hat{\mathbf{G}}\_{Q}(\mathbf{x})\hat{\mathbf{G}}\_{Q}(\mathbf{x})^{T}] - \mathbb{E}\_{Q}[\hat{\mathbf{G}}\_{Q}^{\star}(\mathbf{x})\hat{\mathbf{G}}\_{Q}^{\star}(\mathbf{x})^{T}] \\ &= \frac{1}{T} \mathbb{E}\_{\mathbf{a}|\mathbf{x}} \left[ \mathbb{E}\_{\mathbf{w}}[A(\mathbf{w}\cdot\mathbf{a})^{2}\mathbf{w}^{\star}] - \nabla\_{\Phi}I\_{\eta\_{\phi}}(\boldsymbol{\eta}^{t}) \left(\nabla\_{\Phi}I\_{\eta\_{\phi}}(\boldsymbol{\eta}^{t})^{T}\right) \right] \\ &= \frac{1}{T} \mathbb{E}\_{\mathbf{a}|\mathbf{x}} \left[ \text{Cov}\_{\mathbf{w}}[A(\mathbf{w}\cdot\mathbf{a})\nabla\_{\Phi}\log q\_{\phi}(\mathbf{w})] \right] \succeq 0 \end{split}$$

where in the final line we have used that <sup>∇</sup>*<sup>φ</sup> Iq<sup>φ</sup>* (*η<sup>t</sup>* ) = <sup>∇</sup>*φ*E*w*[*A*(*<sup>w</sup>* · *<sup>η</sup><sup>t</sup>* )] = <sup>E</sup>*w*[*A*(*<sup>w</sup>* · *ηt* )*v*].

## *3.2. Single Hidden Layer*

For clarity (and to introduce notation to be used in Section 4.2) we will briefly consider the case of a neural network with one hidden layer, *f<sup>θ</sup>* (*x*) = *A*2(*w*<sup>2</sup> · *A*1(*W*1*x*)). The randomised parameters are *<sup>θ</sup>* <sup>=</sup> vec(*w*2, *<sup>W</sup>*1), *<sup>W</sup>*<sup>1</sup> <sup>∈</sup> <sup>R</sup>*d*1×*d*<sup>0</sup> , *<sup>w</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*d*<sup>1</sup> and the elementwise activations are *<sup>A</sup>*<sup>1</sup> : <sup>R</sup>*d*<sup>1</sup> → A*d*<sup>1</sup> <sup>1</sup> <sup>⊆</sup> <sup>R</sup>*d*<sup>1</sup> and *<sup>A</sup>*<sup>2</sup> : <sup>R</sup> → Y. We choose the distribution *Q*(*θ*) =: *Q*2(*w*2)*Q*1(*W*1) to factorise over the layers. This is identical to the above and sets *ηW*<sup>1</sup> (*x*) = *A*1(*W*1*x*).

Sampling *a* is straightforward if sampling *W*<sup>1</sup> is. Further, if the final layer aggregate is differentiable, and so is the hidden layer activation *A*1, we may be able to use the lowervariance pathwise gradient estimator for gradients with respect to *Q*1. We note that this may be possible even if the activation *A*<sup>2</sup> is not differentiable, as in Section 4, where we extend the pathwise estimator where we could not otherwise use it.

Computationally, we may implement the above by analytically finding the distribution on the "pre-activations" *W*1*x* (trivial for the normal distribution) before sampling this and passing through the activation. With the pathwise estimator this is known as the "local reparameterization trick" [16], which can lead to considerable computational savings on parallel minibatches compared to direct hierarchical sampling, *a* = *A*1(*W*1*x*) with *W*<sup>1</sup> ∼ *Q*1. We will utilise this in all our reparameterizable dense networks, and a variation on it to save computation when using REINFORCE in Sections 4.2 and 6.

#### **4. Aggregating Signed-Output Networks**

Here we consider a first practical application of the aggregation estimator to stochastic neural networks with a final dense sign-activated layer. We have seen above that this partial aggregation leads to better-behaved training objectives and lower-variance gradient estimates across arbitrary other network structure, It may also allow use of pathwise gradients for the other layers, which would not be possible otherwise due to the nondifferentiability of the final layer.

Specifically, these networks take the form of Equation (5) with the final layer activation a sign function and weights drawn from a unit variance normal distribution, *Qw*(*w*) = <sup>N</sup> (*μ*,I). The aggregate *<sup>I</sup>*(*a*) is given by Equation (2). Normally-distributed weights are chosen because of the simple analytic forms for the aggregate (reminiscent of the tanh activation occasionally used for neural networks) and KL divergence (effectively an *L*<sup>2</sup> regularisation penalty); we note however that closed forms are available for other commonlyused distributions such as the Laplace.

Using Equations (3) and (6) with independent samples {(*w<sup>t</sup>* , *<sup>θ</sup>¬w*,(*t*))}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> ∼ *Q* and *<sup>η</sup><sup>t</sup>* := *<sup>η</sup>θ¬w*,(*t*)(*x*) leads to the two unbiased estimators for the output (henceforth assuming the technical condition <sup>P</sup>*η*|*x*{*<sup>η</sup>* <sup>=</sup> **<sup>0</sup>**} <sup>=</sup> 0 that allows aggregation to be well-defined).

$$\mathcal{F}\_Q(\mathbf{x}) := \frac{1}{T} \sum\_{t=1}^T \text{sign}(\mathbf{w}^t \cdot \boldsymbol{\eta}^t) \tag{7}$$

$$\mathcal{F}\_Q^\*(\mathbf{x}) := \frac{1}{T} \sum\_{t=1}^T \text{erf}\left(\frac{\boldsymbol{\mu} \cdot \boldsymbol{\eta}^t}{\sqrt{2}||\boldsymbol{\eta}^t||}\right). \tag{8}$$

It follows immediately from Propositions 1 and 2 that the latter and the associated gradient estimators have lower variances than the former or the REINFORCE gradient estimates (which we would otherwise have to use due to the non-differentiability of the final layer).

#### *4.1. Lower Variance Estimates of Aggregated Sign-Output Networks*

We clarify the situation with the lower variance estimates further below. In particular, we find that the reduction in variance from using the partial-aggregation estimator is controlled by the norm *μ* , so that for small *μ* (as could be expected early in training) the difference can be large, while as *μ* grows, the difference in variance is controlled and we could reasonably revert to the Monte Carlo (or Gibbs) estimator. Note also that as *FQ*(*x*) → ±1 (as expected after training), both variances disappear.

We also show that a stricter condition than Proposition 2 holds on the variances of the aggregated gradients here, and thus that the non-aggregated gradients are noisier in all cases than the aggregate.

**Proposition 3.** *With the definitions given by Equation* (7)*, for all <sup>x</sup>* <sup>∈</sup> <sup>R</sup>*d*<sup>0</sup> *, <sup>T</sup>* <sup>∈</sup> <sup>N</sup>*, and <sup>Q</sup> with normally-distributed final layer,*

$$0 \le \mathbb{V}\_{\mathcal{Q}}[\hat{F}\_{\mathcal{Q}}(\mathfrak{x})] - \mathbb{V}\_{\mathcal{Q}}[\hat{F}\_{\mathcal{Q}}^\*(\mathfrak{x})] \le \frac{1}{T} \left(1 - \left| \operatorname{erf} \left( \frac{||\mu||}{\sqrt{2}} \right) \right|^2 \right).$$

**Proof.** The left identity follows directly from Proposition 1. We also have

$$\begin{split} \mathbb{V}\_{\mathbb{Q}}[\mathbb{f}\_{\mathbb{Q}}(\mathbf{x})] - \mathbb{V}\_{\mathbb{Q}}[\mathbb{f}\_{\mathbb{Q}}^{\*}(\mathbf{x})] &= \frac{1}{T} \mathbb{E}\_{\mathbf{a}|\mathbf{x}}[\mathbb{V}\_{\mathbf{w}}[\text{sign}(\mathbf{w} \cdot \mathbf{a})]] \\ &= \frac{1}{T} \left( 1 - \mathbb{E}\_{\mathbf{a}|\mathbf{x}} \left| \text{erf} \left( \frac{\boldsymbol{\mu} \cdot \boldsymbol{\eta}}{\sqrt{2} ||\boldsymbol{\eta}||} \right) \right|^{2} \right) \\ &\leq \frac{1}{T} \left( 1 - \left| \text{erf} \left( \frac{||\boldsymbol{\mu}||}{\sqrt{2}} \right) \right|^{2} \right) . \end{split}$$

**Proposition 4.** *Under the conditions of Proposition 3,*

$$\text{Cov}[\hat{\mathbf{G}}^\*(\mathbf{x})] \preceq \text{Cov}[\hat{\mathbf{G}}(\mathbf{x})] + \frac{1 - 2/\pi}{T} \mathbb{I}.$$

*Thus, for all u with u* = 1*,*

$$\mathbb{V}[\mathbf{\hat{G}}^\*(\mathbf{x}) \cdot \boldsymbol{\mu}] \le \mathbb{V}[\mathbf{\hat{G}}(\mathbf{x}) \cdot \boldsymbol{\mu}] + \frac{1 - 2/\pi}{T} \cdot \boldsymbol{\mu}$$

**Proof.** It is straightforward to show that

$$\begin{aligned} \boldsymbol{\hat{G}}(\mathbf{x}) &:= \frac{1}{T} \sum\_{t=1}^{T} \text{sign}(\boldsymbol{w}^{t} \cdot \boldsymbol{\eta}^{t}) (\boldsymbol{\mu} - \boldsymbol{w}^{t}) \\ \boldsymbol{\hat{G}}^{\*}(\mathbf{x}) &:= \frac{1}{T} \sum\_{t=1}^{T} \frac{\boldsymbol{\eta}^{t}}{||\boldsymbol{\eta}^{t}||} \sqrt{\frac{2}{\pi}} \exp\left[ -\frac{1}{2} \left( \frac{\boldsymbol{\mu} \cdot \boldsymbol{\eta}^{t}}{||\boldsymbol{\eta}^{t}||} \right)^{2} \right] \\ \boldsymbol{\text{Cov}}[\boldsymbol{\hat{G}}(\mathbf{x})] &= \frac{1}{T} \left( \mathbb{I} - \boldsymbol{G} \boldsymbol{G}^{T} \right) \\ \boldsymbol{\text{Cov}}[\boldsymbol{\hat{G}}^{\*}(\mathbf{x})] &= \frac{1}{T} \left( \mathbb{E} \left[ \frac{\boldsymbol{\eta} \boldsymbol{\eta}^{T}}{||\boldsymbol{\eta}||^{2}} \frac{2}{\pi} \boldsymbol{\varepsilon}^{-\left(\frac{\boldsymbol{\eta} \cdot \boldsymbol{\eta}}{\|\boldsymbol{\eta}\|}\right)^{2}} \right] - \boldsymbol{G} \boldsymbol{G}^{T} \right) \end{aligned}$$

so for *u* -= **0**,

$$\begin{aligned} &\operatorname{Tu}^{\operatorname{T}}\left(\operatorname{Cov}[\hat{\mathbf{G}}(\mathbf{x})]-\operatorname{Cov}[\hat{\mathbf{G}}^{\*}(\mathbf{x})]\right)\mathfrak{u} \\ &=||\mathfrak{u}||^{2}-\frac{2}{\pi t}\mathbb{E}\left[\frac{|\mathbf{u}\cdot\boldsymbol{\eta}|^{2}}{||\boldsymbol{\eta}||^{2}}e^{-\left(\frac{\mathbf{p}\cdot\boldsymbol{\eta}}{\|\boldsymbol{\eta}\|}\right)^{2}}\right] \geq||\mathfrak{u}||^{2}\left(1-\frac{2}{\pi}\right)>0. \end{aligned}$$

Above we have brought *u* inside the term with an expectation, which is then bounded using Cauchy–Schwarz on |*u* · *η*|/ *η* ≤ *u* , and *<sup>e</sup>*−|*t*<sup>|</sup> <sup>≤</sup> 1 for all *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>.

#### *4.2. All Sign Activations*

Here we examine an important special case previously examined by Letarte et al. [6]: a feed-forward network with all sign activations and normal weights. This takes the form

$$f\_{\theta}(\mathbf{x}) = \text{sign}(\mathbf{w}\_{L} \cdot \text{sign}(W\_{L-1} \dots \text{sign}(W\_1 \mathbf{x}) \dots \mathbf{x})),$$

with *θ* := vec(*wL*, ... , *W*1) and *Wl* := [*wl*,1 ... *wl*,*dl* ] *<sup>T</sup>*; *<sup>l</sup>* ∈ {1, ..., *<sup>L</sup>*} are the layer indices. We choose unit-variance normal distributions on the weights, which factorise into *Ql*(*Wl*) = <sup>∏</sup>*dl <sup>i</sup>*=<sup>1</sup> *ql*,*i*(*wl*,*i*) with *ql*,*<sup>i</sup>* <sup>=</sup> <sup>N</sup> (*μl*,*i*, <sup>I</sup>*dl*−<sup>1</sup> ). In the notation of Section 3, *<sup>η</sup>θ¬<sup>w</sup>* (*x*) = sign(*WL*−<sup>1</sup> . . . sign(*W*1*x*)...) is the final layer activation, which could easily be sampled by mapping *x* through the first *L* − 1 layers with draws from the weight distribution.

Instead, we go on to make an iterative replacement of the weight distributions on each layer by conditionals on the layer activations to obtain the summation

$$F\_Q(\mathbf{x}) = \sum\_{\substack{\mathbf{a}\_1 \in \{+1, -1\}^{d\_1}}} \tilde{Q}\_1(\mathbf{a}\_1|\mathbf{x}) \quad \times \quad \dots \tag{9}$$

$$\dots \quad \times \sum\_{\substack{\mathbf{a}\_{L-1} \in \{+1, -1\}^{d\_{L-1}}}} \tilde{Q}\_{L-1}(\mathbf{a}\_{L-1}|\mathbf{a}\_{L-2}) \text{ erf}\left(\frac{\mu\_L \cdot \mathbf{a}\_{L-1}}{\sqrt{2}||\mathbf{a}\_{L-1}||}\right). \tag{9}$$

The number of terms is exponential in the depth so we instead hierarchically sample the *al*. Like local reparameterisation, this leads to a considerable computational saving over sampling a separate weight matrix for every input. The conditionals can be found in closed form: we can factorise individual hidden units *<sup>Q</sup>*˜*l*(*al*|*al*−1) :<sup>=</sup> <sup>∏</sup>*dl <sup>i</sup>*=<sup>1</sup> *<sup>q</sup>*˜*l*,*i*(*al*,*i*|*al*−1), and find their activation distributions (with *a*<sup>0</sup> := *x* and *z* a dummy variable):

$$\begin{split} \tilde{q}\_{l,i}(a\_{l,i} = \pm 1 \mid a\_{l-1}) &= \int\_{0}^{\infty} \mathcal{N}\left(z; \pm \mu\_{l,i} \cdot a\_{l-1}, \lVert a\_{l-1} \rVert^{2}\right) dz \\ &= \frac{1}{2} \left[ 1 \pm \text{erf}\left(\frac{\mu\_{l,i} \cdot a\_{l-1}}{\sqrt{2} \lVert a\_{l-1} \rVert}\right) \right] .\end{split}$$

A marginalised REINFORCE-style gradient estimator for *conditional* distributions can then be used; this does not necessarily have better statistical properties but in combination with the above is much more computationally efficient. This idea of "conditional sampling" is inspired by the local reinforce trick. Using samples {(*a<sup>t</sup>* <sup>1</sup> ... *<sup>a</sup><sup>t</sup> <sup>L</sup>*−1)}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> <sup>∼</sup> *<sup>Q</sup>*˜,

$$\frac{\partial F\_Q(\mathbf{x})}{\partial \mu\_{l,i}} \approx \frac{1}{T} \sum\_{t=1}^T \text{erf}\left(\frac{\mu\_L \cdot a\_{L-1}^t}{\sqrt{2}||\mathbf{a}\_{L-1}^t||}\right) \frac{\partial}{\partial \mu\_{l,i}} \log \overline{q}\_{l,i}(a\_{l,i}^t|\mathbf{a}\_{l-1}^t). \tag{10}$$

This formulation along with Equation (9) resembles the PBGNet model of [6], but derived in a very different way. Indeed both are equivalent in the single-hidden-layer case, but with more layers PBGNet uses an unusual tree-structured network to make the individual activations independent and avoid an exponential computational dependency on the depth in Equation (9). This makes the above summation exactly calculable but is also still not efficient enough in practice, so they resort further to a Monte Carlo approximation: informally, this draws new samples for every layer *l* based on an average of those from the previous layer, *<sup>a</sup>l*|{*a*(*t*) *<sup>l</sup>*−1}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> <sup>∼</sup> <sup>1</sup> *<sup>T</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *<sup>Q</sup>*˜(*al*|*a*(*t*) *<sup>l</sup>*−1).

This is all justified within the tree-structured framework but leads to an exponential KL penalty which—as hinted by Letarte et al. [6] and shown empirically in Section 6 makes PAC–Bayes bound optimisation strongly favour shallower such networks. Our formulation avoids this, is more general—applying to alternative network structures—and we believe it is significantly easier to understand and use in practice.

#### **5. PAC–Bayesian Objectives with Signed-Outputs**

We now move to obtain binary classifiers with guarantees for the expected misclassification error, *R*0−1, which we do by optimizing PAC–Bayesian bounds. Such bounds (as in Theorem 1) will usually involve the non-differentiable and non-convex misclassification loss -<sup>0</sup>−1. However, to train a neural network we need to replace this by a differentiable surrogate, as discussed in the introduction.

Here we adopt a different approach by using our signed-output networks, where since *f*(*x*) ∈ {+1, −1}, there is an exact equivalence between the linear and misclassification losses, -<sup>0</sup>−1(*f*(*x*), *y*) = lin(*f*(*x*), *y*), avoiding an extra factor of two from the inequality -<sup>0</sup>−<sup>1</sup> ≤ 2lin.

Although we have only moved the non-differentiability into *f*, the form of a PAC–Bayesian bound and the linearity of the loss and expectation allow us to go further and aggregate,

$$\mathbb{E}\_{f \sim Q} \ell\_{0-1}(f(\mathbf{x}), y) = \mathbb{E}\_{f \sim Q} \ell\_{\text{lin}}(f(\mathbf{x}), y) = \ell\_{\text{lin}}(F\_{\mathbb{Q}}(\mathbf{x}), y) \tag{11}$$

which allows us to use the tools discussed in Section 4 to obtain lower-variance estimates and gradients. Below we prove a small result to show the utility of this:

**Proposition 5.** *Under the conditions of Proposition 3 and y* ∈ {+1, −1}*,*

$$\begin{aligned} \mathbb{V}\_Q[\ell\_{\text{lin}}(\mathcal{F}\_Q^\*(\mathbf{x}), y)] &\leq \mathbb{V}\_Q[\ell\_{\text{lin}}(\mathcal{F}\_Q(\mathbf{x}), y)] \\ &\leq \mathbb{V}\_{f \sim Q}[\ell\_{0-1}(f(\mathbf{x}), y)] = \frac{1}{4}(1 - |F\_Q(\mathbf{x})|^2). \end{aligned}$$

**Proof.**

$$\mathbb{E}\left[\mathbb{V}\_{\mathcal{Q}}[\ell\_{\text{lin}}(\mathbb{F}\_{\mathcal{Q}}(\mathbf{x}),y)]\right] = \mathbb{E}\_{\mathcal{Q}}\left|\frac{1}{2}(y\mathbb{F}\_{\mathcal{Q}}(\mathbf{x}) - y\mathbb{F}\_{\mathcal{Q}}(\mathbf{x}))\right|^2 = \frac{1}{4}\mathbb{V}\_{\mathcal{Q}}[\mathbb{F}\_{\mathcal{Q}}(\mathbf{x})]^2$$

and a similar result for *F*ˆ<sup>∗</sup> *<sup>Q</sup>*. *<sup>f</sup>* <sup>=</sup> *<sup>F</sup>*<sup>ˆ</sup> *<sup>Q</sup>* if *T* = 1 and lin(*f*(*x*), *y*) = -<sup>0</sup>−1(*f*(*x*), *y*). The result then follows from this and Proposition 3.

Combining (11) with Theorem 1, we obtain a directly optimizable, differentiable bound on the misclassification loss without introducing the above-mentioned factor of 2.

**Theorem 2.** *Given P on θ and α* > 1*, for all Q on θ and λ* > 1 *simultaneously with probability at least* <sup>1</sup> <sup>−</sup> *<sup>δ</sup> over S* ∼ D*m,*

$$\mathbb{E}\_{\theta \sim Q} R^{0-1}(f\_{\theta}) \le \Phi\_{\lambda/m}^{-1} \left[ R\_S^{\text{lin}}(F\_Q) + \frac{\alpha}{\lambda} \Delta \right]$$

*with* <sup>Φ</sup>−<sup>1</sup> *<sup>γ</sup>* (*t*) = <sup>1</sup>−exp(−*γt*) <sup>1</sup>−exp(−*γ*) *, <sup>f</sup><sup>θ</sup>* : <sup>R</sup>*<sup>d</sup>* → {+1, <sup>−</sup>1}, *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>*, and* <sup>Δ</sup> <sup>=</sup> KL(*Q*|*P*) <sup>−</sup> log *<sup>δ</sup>* <sup>+</sup> 2 loglog *<sup>α</sup>*2*<sup>λ</sup>* log *α .*

Thus, for each *λ*, which can be held fixed ("**fix-***λ*") or simultaneously optimized throughout training for automatic regularisation tuning ("**optim-***λ*"), we obtain a gradient descent objective:

$$R\_S^{\rm lin}(\hat{F}\_Q^\*) + \frac{\mathbf{KL}(Q|P)}{\lambda}. \tag{12}$$

#### **6. Experiments**

All experiments (Table 1) run on "binary"-MNIST, dividing MNIST into two classes, of digits 0–4 and 5–9. Neural networks had three hidden layers with 100 units per layer and **sign**, sigmoid (**sgmd**) or **relu** activations, before a single-unit final layer with sign activation. *Q* was chosen as an isotropic, unit-variance normal distribution with initial means drawn from a truncated normal distribution of variance 0.05. The data-free prior *P* was fixed equal to the initial *Q*, as motivated by Dziugaite and Roy [4] (Section 5 and Appendix B).


**Table 1.** Average (from ten runs) binary-MNIST losses and bounds (*δ* = 0.05) for the best epoch and optimal hyperparameter settings of various algorithms. Hyperparameters and epochs were chosen by bound if available and non-vacuous, otherwise by training linear loss. Bold numbers indicate the best values and standard deviation is reported in italics.

The objectives **fix-***λ* and **optim-***λ* from Section 5 were used for batch-size 256 gradient descent with Adam [17] for 200 epochs. Every five epochs, the bound (for a minimising *λ*) was evaluated using the entire training set; the learning rate was then halved if the bound was unimproved from the previous two evaluations. The best hyperparameters were selected using the best bound achieved in these evaluations through a grid search of initial learning rates ∈ {0.1, 0.01, 0.001}, sample sizes *T* ∈ {1, 10, 50, 100}. Once these were selected training was repeated 10 times to obtain the values in Table 1.

*λ* in **optim-***λ* was optimised through Theorem 2 on alternate mini-batches with SGD and a fixed learning rate of 10−<sup>4</sup> (whilst still using the objective (12) to avoid effectively scaling the learning rate with respect to empirical loss by the varying *λ*). After preliminary experiments in **fix-***λ*, we set *λ* = *m* = 60,000, the training set size, as is common in Bayesian deep learning.

We also report the values of three baselines: **reinforce**, which uses the fix-*λ* objective without partial-aggregation, forcing the use of REINFORCE gradients everywhere; **mlp**, an unregularised non-stochastic relu neural network with tanh output activation; and the PBGNet model (**pbg**) from Letarte et al. [6]. For the latter, a misclassification error bound obtained through -<sup>0</sup>−<sup>1</sup> ≤ 2lin must be used as their test predictions were made through the sign of a prediction function ∈ [−1, +1], not ∈ {+1, −1}. Further, despite significant additional hyperparameter exploration, we were unable to train a three layer network through the PBGNet algorithm directly comparable to our method, likely because of the exponential KL penalty (in their Equation 17) within that framework; to enable comparison, we therefore allowed the number of hidden layers in this scenario to vary ∈ {1, 2, 3}. Other baseline tuning and setup was similar to the above, see the Appendix A for more details.

During evaluation **reinforce** draws a new set of weights for every test example, equivalent to the evaluation of the other models; but doing so during training, with multiple parallel samples, is prohibitively expensive. Two different approaches to straightforward, not partially-aggregated, gradient estimation for this case suggest themselves, arising from different approximations to the *Q*-expected loss of the minibatch, *B* ⊆ *S* (with data indices B). From the identities

$$\begin{aligned} \mathbb{E}\nabla\_{\boldsymbol{\theta}}\mathbb{E}\_{\theta \sim q\_{\boldsymbol{\theta}}}\mathbb{R}\_{B}(f\_{\boldsymbol{\theta}}) &= \mathbb{E}\_{\theta \sim q\_{\boldsymbol{\theta}}}\frac{1}{|B|}\sum\_{i \in B} \ell(f\_{\boldsymbol{\theta}}(\boldsymbol{x}\_{i}), y\_{i}) \nabla\_{\boldsymbol{\theta}}\log q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) \\ &= \frac{1}{|B|}\sum\_{i \in B} \mathbb{E}\_{\theta \sim q\_{\boldsymbol{\theta}}}\ell(f\_{\boldsymbol{\theta}}(\boldsymbol{x}\_{i}), y\_{i}) \nabla\_{\boldsymbol{\theta}}\log q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) \end{aligned}$$

we obtain two slightly different estimators for <sup>∇</sup>*φ*E*θ*∼*qφRB*(*f<sup>θ</sup>* ):

$$\leq \frac{1}{T|B|} \sum\_{t=1}^{T} \sum\_{i \in B} \ell(f\_{\emptyset^{(t,i)}}(\mathfrak{x}\_i), \mathfrak{y}\_i) \nabla\_{\boldsymbol{\Phi}} \log q\_{\boldsymbol{\Phi}}(\boldsymbol{\theta}^{(t,i)})$$

$$\frac{1}{|T|B|}\sum\_{i\in B}\sum\_{t=1}^{T}\ell(f\_{\theta^t}(\mathbf{x}\_i),\underline{y}\_i)\nabla\_{\boldsymbol{\theta}}\log q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}^t).$$

The first draws many more samples and has lower variance but is much slower computationally; even aside from the *O*(|*B*|) increase in computation, there is a slowdown as the optimised BLAS matrix routines cannot be used, and the very large matrices involved may not fit in memory (see for more information [16]).

Therefore, as is standard in the Bayesian Neural Network literature with the pathwise estimator, we use the latter formulation, which has a similar computational complexity to local-reparameterisation and our marginalised REINFORCE estimator (10). We should note though that in preliminary experiments, the alternate estimator did not appear to lead to improved results. This clarifies the advantages of marginalised sampling, which can lead to lower variance with a similar computational cost.

#### **7. Discussion**

The experiments demonstrate that partial-aggregation enables training of multi-layer non-differentiable neural networks in a PAC–Bayesian context, which is not possible with REINFORCE gradients and a multiple-hidden-layer PBGNet [6]. These obtained only vacuous bounds, and our misclassification bounds also improve those of a single-hidden-layer PBGNet.

Our experiments raise a couple of questions: firstly, why is it that lower variance estimates empirically lead to tighter bounds? We speculate that the faster convergence of SGD in this case takes us to a more "local" minimum of the objective, closer to our initialisation. Since most existing PAC–Bayes bounds for neural networks have a very strong dependence on this distance from initialisation through the KL term, this leads to tighter bounds. This distance could also be reduced through other methods we consider out-of-scope, such as the data-dependent bounds employed by Dziugaite and Roy [18] and Letarte et al. [6].

A second and harder question is asking why the non-stochastic mlp model obtains a lower overall error. The bound optimisation is empirically quite conservative, but does not necessarily lead to better generalisation; understanding this gap is a key question in the theory of deep learning.

In future work we will develop significant new tools to extend partial-aggregation to multi-class classification, and to improve test prediction bounds for sign(*F*ˆ *<sup>Q</sup>*(*x*)) with *T* > 1, as in PBGNet, which gave slightly improved predictive performance despite the inferior theoretical guarantees.

**Author Contributions:** Conceptualization, F.B. and B.G.; Formal analysis, F.B. and B.G.; Methodology, F.B. and B.G.; Project administration, B.G.; Software, F.B.; Writing—original draft, F.B.; Writing review and editing, F.B. and B.G. Both authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. BG acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23-0015- 02. FB acknowledges the support from the Foundational Artificial Intelligence Centre for Doctoral Training at University College London.

**Data Availability Statement:** Not applicable

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Further Experimental Details**

*Appendix A.1. Aggregating Biases with the Sign Function*

We used a bias term in our network layers, leading to a simple extension of the above formulation, omitted in the main text for conciseness:

$$\mathbb{E}\_{w \sim \mathcal{N}(\mu, \Sigma), b \sim \mathcal{N}(\beta, \sigma^2)} \operatorname{sign}(w \cdot \mathbf{x} + b) = \operatorname{erf}\left(\frac{\mu \cdot \mathbf{x} + \beta}{\sqrt{2(\mathbf{x}^T \Sigma \mathbf{x} + \sigma^2)}}\right)$$

since *<sup>w</sup>* · *<sup>x</sup>* <sup>+</sup> *<sup>b</sup>* ∼ N (*<sup>μ</sup>* · *<sup>x</sup>* <sup>+</sup> *<sup>β</sup>*, *<sup>x</sup>T*Σ*<sup>x</sup>* <sup>+</sup> *<sup>σ</sup>*2) and

$$\begin{aligned} \mathbb{E}\_{z \sim \mathcal{N}(\boldsymbol{\alpha}, \boldsymbol{\beta}^2)} \operatorname{sign} z &= P(z \ge 0) - P(z < 0) \\ &= \left[ 1 - \Phi(-\boldsymbol{\alpha}/\beta) \right] - \Phi(-\boldsymbol{\alpha}/\beta) \\ &= 2\Phi(\boldsymbol{\alpha}/\beta) - 1 = \operatorname{erf}(\boldsymbol{\alpha}/\sqrt{2}\beta). \end{aligned}$$

The bias and weight co-variances were chosen to be diagonal with a scale of 1, which leads to some simplification in the above.

## *Appendix A.2. Dataset Details*

We used the MNIST dataset version 3.0.1, available online at http://yann.lecun.com/ exdb/mnist/ (accessed on 4 June 2021), which contains 60,000 training examples and 10,000 test examples, which were used without any further split, and rescaled to lie in the range [0, 1]. For the "binary"-MINST task, the labels +1 and −1 were assigned to digits in {5, 6, 7, 8, 9} and {0, 1, 2, 3, 4}, respectively, and images were scaled into the interval [0, 1].

#### *Appendix A.3. Hyperparameter Search for Baselines*

The baseline comparison values offered with our experiments were optimized similarly to the above, for completeness we report everything here.

The MLP model had three hidden ReLu layers of size 100 each trained with Adam, a learning rate ∈ {0.1, 0.01, 0.001} and a batch size of 256 for 100 epochs. Complete test and train evaluation was performed after every epoch, and in the absence of a bound, the model and epoch with lowest train linear loss was selected.

For PBGNet we choose the values of hyperparameters from within these values giving the least bound value. Note that, unlike in [6], we do not allow the hidden size to vary {∈ 10, 50, 100}, and we use the entire MNIST training set as we do not need a validation set. While attempting to train a three hidden layer network, we also searched through the hyperparameter settings with a batch size of 64 as in the original, but after this failed, we returned to the original batch size of 256 with Adam. All experiments were performed using the code from the original paper, available at https://github.com/gletarte/dichotomizeand-generalize (accessed on 4 June 2021).

Since we were unable to train a multiple-hidden-layer network through the PBGNet algorithm, for this model only we explored different numbers of hidden layers ∈ {1, 2, 3}.

#### *Appendix A.4. Final Hyperparameter Settings*

In Table A1 we report the hyperparameter settings used for the experiments in Table 1 after exploration. To save computation, hyperparameter settings that were not learning (defined as having a whole-train-set linear loss of > 0.45 after ten epochs) were terminated early. This was also done on the later evaluation runs, where in a few instances the fix-*λ* sigmoid network failed to train after ten epochs; to handle this we reset the network to obtain the main experimental results.


**Table A1.** Chosen hyperparameter settings and additional details for results in Table 1. Best hyperparameters were chosen by bound if available and non-vacuous, otherwise by best training linear loss through a grid search as described in Section 6 and Appendix A.3. Run times are rounded to nearest 5 min.

For clarity we repeat here the hyperparameter settings and search space:


## *Appendix A.5. Implementation and Runtime*

Experiments were implemented using Python and the TensorFlow library [19]. Reported approximate runtimes are for execution on a NVIDIA GeForce RTX 2080 Ti GPU.

#### **References**


**Dimitri Meunier <sup>1</sup> and Pierre Alquier 2,\***


**Abstract:** Online learning methods, similar to the online gradient algorithm (OGA) and exponentially weighted aggregation (EWA), often depend on tuning parameters that are difficult to set in practice. We consider an online meta-learning scenario, and we propose a meta-strategy to learn these parameters from past tasks. Our strategy is based on the minimization of a regret bound. It allows us to learn the initialization and the step size in OGA with guarantees. It also allows us to learn the prior or the learning rate in EWA. We provide a regret analysis of the strategy. It allows to identify settings where meta-learning indeed improves on learning each task in isolation.

**Keywords:** meta-learning; hyperparameters; priors; online learning; Bayesian inference; online optimization; gradient descent

## **1. Introduction**

In many applications of modern supervised learning, such as medical imaging or robotics, a large number of tasks is available but many of them are associated with a small amount of data. With few datapoints per task, learning them in isolation would give poor results. In this paper, we consider the problem of learning from a (large) sequence of regression or classification tasks with small sample size. By exploiting their similarities we seek to design algorithms that can utilize previous experience to rapidly learn new skills or adapt to new environments.

Inspired by human ingenuity in solving new problems by leveraging prior experience, *meta-learning* is a subfield of machine learning whose goal is to automatically adapt a learning mechanism from past experiences to rapidly learn new tasks with little available data. Since it "learns the learning mechanism" it is also referred to as *learning-to-learn* [1]. It is seen as a critical problem for the future of machine learning [2]. Numerous formulations exist for meta-learning and we focus on the problem of *online meta-learning* where the tasks arrive one at a time and the goal is to efficiently transfer information from the previous tasks to the new ones such that we learn the new tasks as efficiently as possible (this has also been refered to as *lifelong learning*). Each task is in turn processed *online*. To sum up, we have a stream of tasks and for each task a stream of observations.

In order to solve online tasks, diverse well-established strategies exist: perceptron, online gradient algorithm (OGA), online mirror descent, follow-the-regularized-leader, exponentially weighted aggregation (EWA, also refered to as *generalized Bayes* etc.) We refer the reader to [3–6] for introductions to these algorithms and to so-called regret bounds, that control their generalization errors. We refer to these algorithms as the *within-task* strategies. The big challenge is to design a meta-strategy that uses past experiences to adapt a within-task strategy to perform better on the next tasks.

In this paper, we propose a new meta-learning strategy. The main idea to learn the tuning parameters is to minimize its regret bound. We provide a meta-regret analysis for our strategy. We illustrate our results in the case where the within-task strategy is the online gradient algorithm, and exponentially weighted aggregation. In the case of OGA, the tuning parameters considered are the initialization and the gradient steps. For EWA,

**Citation:** Meunier, D.; Alquier, P. Meta-Strategy for Learning Tuning Parameters with Guarantees. *Entropy* **2021**, *23*, 1257. https://doi.org/ 10.3390/e23101257

Academic Editor: Gholamreza Anbarjafari

Received: 9 August 2021 Accepted: 23 September 2021 Published: 27 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

we consider either the learning rate, or the prior. In each case, we compare the regret incurred when learning the tasks in isolation to our meta-regret bound. This allows us to identify settings where meta-learning indeed improves on learning in isolation.

#### *1.1. Related Works*

Meta-learning is similar to multitask learning [7–9] in the sense that the learner faces many tasks to solve. However, in multitask learning, the learner is given a fixed number of tasks, and can learn the connections between these tasks. In meta-learning, the learner must prepare to face future tasks that are not given yet.

Meta-learning is often referred to as learning-to-learn or lifelong learning. The authors of [10] proposed the following distinction: "learning-to-learn" for situations where the tasks are presented simultaneously, and "lifelong learning" for situations where they are presented sequentially. Following this terminology, learning-to-learn algorithms were proposed very early in the literature, with generalization guarantees [11–16].

On the other hand, in the lifelong learning scenario, until recently, algorithms were proposed without generalization guarantees [17,18]. A theoretical study was proposed by [10], but the strategies in that paper are not feasible in practice. This problem was recently improved [19–26]. In a similar context, in [27], the authors propose an efficient strategy to learn the starting point of OGA. However, an application of this strategy to learning the step size do not show any improvement over learning in isolation [28]. The closest work to this paper is [29] in which they also suggest a regret bound minimization strategy. This paper indeed provides a meta-regret bound for learning both the initialization and the gradient step. Note, however, that this paper remains specific to OGA, while our work can be potentially applied to any online learning algorithm. Indeed, we provide another example: the generalized Bayesian algorithm EWA, for which we learn the prior, or the learning rate. To learn the prior is new in the online setting, to our knowledge. It can be related to works in the batch setting [11,13,15,16], but the improvement with respect to learning in isolation is not quantified in these works.

Finally, it is important to note that we focus on the case where the number of tasks *T* is large, while the sample size *n* and algorithmic complexity of each task is moderately small. When each task is extremely complex, for example training a deep neural network on a huge dataset, our procedure (as well as those discussed above) will become too expansive. Alternative approaches were proposed, based on optimization via multi-armed bandits [30,31].

#### *1.2. Organization of the Paper*

In Section 2, we introduce the formalism of meta-learning and the notations that will be used throughout the paper. In Section 3, we introduce our meta-learning strategy, and its theoretical analysis. In Section 4, we provide the details of our method in the case of meta-learning the initialization and the step size in the online gradient algorithm. Based on our theoretical results, there are also explicit situations where meta-learning indeed improves on learning the tasks independently. This is confirmed by experiments reported in this section. In Section 5, we provide the details of our methodology when the algorithm used within tasks is a generalized Bayesian algorithm: EWA. We show how our metastrategy can be used to tune the learning rate; we also discuss how it can be used to learn priors. The proofs of the main results are given in Section 6.

#### **2. Notations and Preliminaries**

By convention, vectors *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* are seen as *<sup>d</sup>* <sup>×</sup> 1 matrices (columns). Let *v* denote the Euclidean norm of *<sup>v</sup>*. Let *<sup>A</sup><sup>T</sup>* denote the transposition of any *<sup>d</sup>* <sup>×</sup> *<sup>k</sup>* matrix *<sup>A</sup>*, and *Id* the *<sup>d</sup>* <sup>×</sup> *<sup>d</sup>* identity matrix. For two real numbers *a* and *b*, let *a* ∨ *b* = max(*a*, *b*) and *a* ∧ *b* = min(*a*, *b*). For *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>, *<sup>z</sup>*<sup>+</sup> is its positive part *<sup>z</sup>*<sup>+</sup> <sup>=</sup> *<sup>z</sup>* <sup>∨</sup> 0. Given a finite set *<sup>S</sup>*, we let card(*S*) denote the cardinality of *S*.

The learner has to solve tasks *t* = 1, ... , *T* sequentially. Each task *t* consists in *n* rounds *i* = 1, ... , *n*. At each round *i* of task *t*, the learner has to take a decision *θt*,*<sup>i</sup>* in a decision space <sup>Θ</sup> <sup>⊆</sup> <sup>R</sup>*<sup>d</sup>* for some *<sup>d</sup>* <sup>&</sup>gt; 0. Then, a convex loss function *<sup>t</sup>*,*<sup>i</sup>* : <sup>Θ</sup> <sup>→</sup> <sup>R</sup> is revealed to the learner, who incurs the loss *<sup>t</sup>*,*i*(*θt*,*i*). Classical examples with <sup>Θ</sup> <sup>⊂</sup> <sup>R</sup>*<sup>d</sup>* include regression tasks, where *<sup>t</sup>*,*i*(*θ*)=(*yt*,*<sup>i</sup>* <sup>−</sup> *<sup>x</sup><sup>T</sup> t*,*i <sup>θ</sup>*)<sup>2</sup> for some *xt*,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and *yt*,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>. For classification tasks, *<sup>t</sup>*,*i*(*θ*)=(<sup>1</sup> <sup>−</sup> *yt*,*ix<sup>T</sup> t*,*i <sup>θ</sup>*)+ for some *xt*,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*d*, *yt*,*<sup>i</sup>* ∈ {−1, <sup>+</sup>1}.

Throughout the paper, we will assume that the learner uses, for each task, an online decision strategy called *within-task strategy*, parametrized by a tuning parameter *λ* ∈ Λ where Λ is a closed, convex subset of R*<sup>p</sup>* for some *p* > 0. Example of such strategies include the online gradient algorithm, given by *<sup>θ</sup>t*,*<sup>i</sup>* = *<sup>θ</sup>t*,*i*−<sup>1</sup> − *<sup>γ</sup>*∇*<sup>t</sup>*,*i*(*θt*,*i*−1). In this case, the tuning parameters are the initialization, or starting point, *θt*,1 = *ϑ* and the learning rate, or step size, *γ*. That is, *λ* = (*ϑ*, *γ*), so *p* = *d* + 1. The parameter *λ* is kept fixed during the whole task. It is of course possible to use the same parameter *λ* in *all* the tasks. However, we will be interested here in defining *meta-strategies* that will allow us to improve *λ* task after task, based on the information available so far. In Section 3, we will define such strategies. For now, let *λ<sup>t</sup>* denote the tuning parameter used by the learner all along task *t*. Figure 1 provides a recap of all the notations.

**Figure 1.** The dynamics of meta-learning.

Let *θ<sup>λ</sup> <sup>t</sup>*,*<sup>i</sup>* denote the decision at round *i* of task *t* when the online strategy is used with parameter *λ*. We will assume that a regret bound is available for the within-task strategy. By this, we mean that there is a set Θ<sup>0</sup> ⊂ Θ of parameters of interest, and that the learner knows a function <sup>B</sup>*<sup>n</sup>* : <sup>Θ</sup> <sup>×</sup> <sup>Λ</sup> <sup>→</sup> <sup>R</sup> such that, for any task *<sup>t</sup>*, for any *<sup>λ</sup>* <sup>∈</sup> <sup>Λ</sup>,

$$\sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\lambda}) \le \underbrace{\inf\_{\theta \in \Theta\_{0}} \left\{ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \mathcal{B}\_{n}(\theta, \lambda) \right\}}\_{=: \mathcal{L}\_{t}(\lambda)}.\tag{1}$$

For OGA, regret bounds can be found, for example, in [4,6] (in this case, Θ<sup>0</sup> = Θ). Other examples include exponentially weighted aggregation (bounds in [3], here Θ<sup>0</sup> is a finite set of predictors while decisions Θ are probability distributions on Θ0). More examples will be discussed in the paper. For a fixed parameter *θ*, the quantity ∑*n <sup>i</sup>*=<sup>1</sup> *t*,*i*(*θ<sup>λ</sup> t*,*i* ) <sup>−</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>t</sup>*,*i*(*θ*) measures the difference between the total loss suffered during task *t*, and the loss what one would have suffered using the parameter *θ*. It is thus called "the regret with respect to parameter *θ*", and B*n*(*θ*, *λ*) is usually referred to as a "regret bound". We will call L*t*(*λ*) the "meta-loss". In [29], the authors study a meta-strategy that minimizes the meta-loss of OGA. Indeed, if (1) is tight, to minimize the right-hand side is a good way to ensure that the left-hand side, that is, the cumulated loss, is small. In this work, we will focus on meta-strategies minimizing the meta-loss in a more general context.

The simplest meta-strategy is learning in isolation. That is, we keep *λ<sup>t</sup>* = *λ*<sup>0</sup> ∈ Λ for all tasks. The total loss after task *T* is then given by:

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\lambda\_0}) \le \sum\_{t=1}^{T} \mathcal{L}\_t(\lambda\_0). \tag{2}$$

However, when the learner uses a meta-strategy to improve the tuning parameter at the end of each task, the total loss is given by ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *t*,*i*(*θλ<sup>t</sup> <sup>t</sup>*,*i*). We will, in this paper, investigate strategies with meta-regret bounds; that is, bounds of the form

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\lambda\_t}) \le \inf\_{\lambda \in \Lambda} \left\{ \sum\_{t=1}^{T} \mathcal{L}\_t(\lambda) + \mathcal{C}\_T(\lambda) \right\}.\tag{3}$$

Of course, such bounds will be relevant only if the right-hand side of (3) is not larger than the right-hand side of (2), and is significantly smaller in some favourable settings. We show when this is the case in Section 4.

#### **3. Meta-Learning Algorithms**

In this section, we provide two meta-strategies to update *λ* at the end of each task. The first one is a direct application of OGA to meta-learning. It is computationally simpler, but feasible only in the special case where we have an explicit formula for the (sub-)gradient of each L*t*(*λ*). The second one is an application of implicit online learning to our setting. In Section 4, we provide an example where this is the case. The second meta-strategy can be used without this assumption. In both cases, we provide a regret bound as (3), under the following condition.

**Assumption 1.** *For any t* ∈ {1, . . . , *T*}*, the function λ* → L*t*(*λ*) *is L-Lipschitz and convex.*

## *3.1. Special Case: The Gradient of the Meta-Loss Is Available in Closed Form*

As each L*<sup>t</sup>* is convex, its subdifferential at each point of Λ is non-empty. For the sake of simplicity, we will use the notation *λ* → ∇L*t*(*λ*) in the following formulas to denote *any* element of its subdifferential at *λ*. We define the online gradient meta-strategy (OGMS) with step *α* > 0 and starting point *λ*<sup>1</sup> ∈ Λ: for any *t* > 1,

$$
\lambda\_t = \Pi\_\Lambda [\lambda\_{t-1} - \mathfrak{a} \nabla \mathcal{L}\_{t-1}(\lambda\_{t-1})] \tag{4}
$$

where ΠΛ denotes the orthogonal projection on Λ.

#### *3.2. The General Case*

We now cover the general case, where a formula for the gradient of L*t*(*λ*) might not be available. We propose to apply a strategy that was first defined in [32] for online learning, and studied under the name "implicit online learning" (we refer the reader to [33] and the references therein). In the meta-learning context, this gives the online proximal meta-strategy (OPMS) with step *α* > 0 and starting point *λ*<sup>1</sup> ∈ Λ, defined by:

$$\lambda\_t = \operatorname\*{argmin}\_{\lambda \in \Lambda} \left\{ \mathcal{L}\_{t-1}(\lambda) + \frac{\|\lambda - \lambda\_{t-1}\|^2}{2\alpha} \right\}. \tag{5}$$

Using classical notations, e.g., [34], we can rewrite this definition with the proximal operator (hence the name of the method). Indeed *<sup>λ</sup><sup>t</sup>* <sup>=</sup> prox*α*L*t*−<sup>1</sup> (*λt*−1) where prox is the proximal operator given for any *<sup>x</sup>* <sup>∈</sup> <sup>Λ</sup> and any convex function *<sup>f</sup>* : <sup>Λ</sup> <sup>→</sup> <sup>R</sup>,

$$\text{prox}\_f(\mathbf{x}) = \underset{\lambda \in \Lambda}{\text{argmin}} \left\{ f(\lambda) + \frac{||\mathbf{x} - \lambda\boldsymbol{\lambda}||^2}{2} \right\}.\tag{6}$$

This strategy is feasible in practice in the regime we are interested in; that is, when *n* is small or moderately large, and *T* → ∞. The learner has to store all the losses of the current task *<sup>t</sup>*−1,1, ... , *<sup>t</sup>*−1,*n*. At the end of the task, the learner can use any convex optimization algorithm to minimize, with respect to (*θ*, *λ*) ∈ Θ × Λ, the function

$$F\_{\mathbf{f}}(\theta,\lambda) = \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \mathcal{B}\_{\mathbf{n}}(\theta,\lambda) + \frac{||\lambda - \lambda\_{t-1}||^2}{2a}.\tag{7}$$

We can use a (projected) gradient descent on *Ft* or its accelerated variants [35].

## *3.3. Regret Analysis*

A direct application of known results to the setting of this paper leads to the following proposition. For the sake of completeness, we still provide the proofs in Section 6.

**Proposition 1.** *Under Assumption 1, using either OGMS or OPMS with step α* > 0 *and starting point λ*<sup>1</sup> ∈ Λ *leads to*

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\lambda\_t}) \le \inf\_{\lambda \in \Lambda} \left\{ \sum\_{t=1}^{T} \mathcal{L}\_t(\lambda) + \frac{aTL^2}{2} + \frac{||\lambda - \lambda\_1||^2}{2a} \right\}.\tag{8}$$

The proof can be found in Section 6.

#### **4. Example: Learning the Tuning Parameters of Online Gradient Descent**

In all this section, we work under the following condition.

**Assumption 2.** *For any* (*t*, *i*) ∈ {1, ... , *T*}×{1, ... , *n*}*, the function <sup>t</sup>*,*<sup>i</sup> is* Γ*-Lipschitz and convex.*

## *4.1. Explicit Meta-Regret Bound*

We study the situation where the learner uses (projected) OGA as a within-task strategy; that is, <sup>Θ</sup> <sup>=</sup> {*<sup>θ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* : *θ* ≤ *C*} and, for any *i* > 1,

$$\theta\_{t,i} = \Pi\_{\Theta}[\theta\_{t,i-1} - \gamma \nabla \ell\_{t,i}(\theta\_{t,i-1})].\tag{9}$$

With such a strategy, we already mentioned that *<sup>λ</sup>* = (*ϑ*, *<sup>γ</sup>*) <sup>∈</sup> <sup>Λ</sup> <sup>⊂</sup> <sup>Θ</sup> <sup>×</sup> <sup>R</sup><sup>+</sup> contains an initialization and a step size. An application of the results in Chapter 11 in [3] gives <sup>B</sup>*n*(*θ*, *<sup>λ</sup>*) = <sup>B</sup>*n*(*θ*,(*ϑ*, *<sup>γ</sup>*)) = *<sup>γ</sup>*Γ2*n*/2 <sup>+</sup> *θ* − *ϑ* 2/(2*γ*). So

$$\mathcal{L}\_t((\vartheta,\gamma)) = \inf\_{\|\theta\| \le C} \left\{ \sum\_{i=1}^n \ell\_{t,i}(\theta) + \frac{\gamma \Gamma^2 \eta}{2} + \frac{\|\theta - \theta\|^2}{2\gamma} \right\}. \tag{10}$$

It is quite direct to check Assumption 1. We summarize this in the following proposition.

**Proposition 2.** *Under Assumption 2, assume that the learner uses OGA as an inner algorithm. Assume* <sup>Λ</sup> <sup>=</sup> {*<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* : *ϑ* ≤ *C*} × [*γ*, *γ*¯] *for some C* > 0 *and* 0 < *γ* < *γ*¯ < ∞*. Then Assumption 1 is satisfied with*

$$L := \sqrt{\frac{n^2 \Gamma^4}{4} + \frac{4C^2}{\underline{\chi}^2} + \frac{4C^4}{\underline{\chi}^4}}.\tag{11}$$

So, when the learner uses one of the meta-strategies OGMS or OPMS, we can apply Proposition 1 respectively. This leads to the following theorem.

**Theorem 1.** *Under the assumptions of Proposition 2, with γ* = 1/*n<sup>β</sup> for some β* > 0 *and γ*¯ = *C*2*, when the learner uses either OGMS or OPMS with*

$$\alpha = \frac{C}{L} \sqrt{\frac{4 + C^2}{T}} \tag{12}$$

*(where L is given by* (11)*), we have:*

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\boldsymbol{\theta}\_{t,i}^{\lambda\_t}) \le \inf\_{\theta\_1, \dots, \theta\_T \in \Theta} \left\{ \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_t) + \mathcal{C}(\Gamma, \mathbb{C}) \left[ n^{1/2\beta} \sqrt{T} + \left( n^{1-\beta} + \sigma(\theta\_1^T) \sqrt{n} \right) T \right] \right\} \tag{13}$$

*where* C(Γ, *C*) > 0 *depends only on* (Γ, *C*) *and where:*

$$\sigma(\theta\_1^T) = \sqrt{\frac{1}{T} \sum\_{t=1}^T \left\| \theta\_t - \frac{1}{T} \sum\_{s=1}^T \theta\_s \right\|^2}. \tag{14}$$

Let us compare this result with learning in isolation, as defined in (2); that is, solving the sequence of tasks with a constant hyperparameter *λ* = (*ϑ*, *γ*). For the usual choice *ϑ* = 0 and *γ* = *c*/ <sup>√</sup>*<sup>n</sup>* where *<sup>c</sup>* is a constant that does not depend on *<sup>n</sup>* nor *<sup>T</sup>*, OGA leads to a regret in O( <sup>√</sup>*n*). After *<sup>T</sup>* tasks, learning in isolation thus leads to a regret in *<sup>T</sup>* <sup>√</sup>*n*. Our strategies with *β* = 1 lead to a regret in

$$m^2\sqrt{T} + \left(1 + \sigma(\theta\_1^T)\sqrt{n}\right)T.\tag{15}$$

The term *n*<sup>2</sup> <sup>√</sup>*<sup>T</sup>* is the price to pay for meta-learning. In the regime we are interested in (small *n*, large *T*), which is smaller than *T* <sup>√</sup>*n*. Consider the leading term. In the worst case scenario, this is also *T* <sup>√</sup>*n*. However, there are good predictors *<sup>θ</sup>*1, ... , *<sup>θ</sup><sup>T</sup>* for tasks 1, ... , *<sup>T</sup>*, respectively, such that *σ*(*θ<sup>T</sup>* <sup>1</sup> ) is small, and in this case we see the improvement with respect to learning in isolation. The extreme case is when there is a good predictor *θ*∗ that predicts well for all tasks. In this case, regret with respect to *<sup>θ</sup>*<sup>1</sup> <sup>=</sup> ··· <sup>=</sup> *<sup>θ</sup><sup>T</sup>* <sup>=</sup> *<sup>θ</sup>*<sup>∗</sup> is in *<sup>n</sup>*<sup>2</sup> <sup>√</sup>*<sup>T</sup>* <sup>+</sup> *<sup>T</sup>*, which improves significantly on learning in isolation. Note however that, using a different meta-strategy, specifically designed for OGA, Ref. [29] obtain a better dependence on *T* when *σ*(*θ<sup>T</sup>* <sup>1</sup> ) = 0.

Let us now discuss the implementation of our meta-stategy. We first remark that under the quadratic loss, it is possible to derive a formula for L*t*, which allows to use OGMS. We then discuss OPMS for the general case.

## *4.2. Special Case: Quadratic Loss*

First, consider *<sup>t</sup>*,*<sup>i</sup>* = (*yt*,*<sup>i</sup>* <sup>−</sup> *<sup>x</sup><sup>T</sup> t*,*i <sup>θ</sup>*)<sup>2</sup> for some *yt*,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup> and *xt*,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*d*. Assumption <sup>2</sup> is satisfied if we assume, moreover that all |*yt*,*i*| ≤ *c* and *xt*,*i* <sup>≤</sup> *<sup>b</sup>*, with <sup>Γ</sup> <sup>=</sup> <sup>2</sup>*bc* <sup>+</sup> <sup>2</sup>*b*2*C*. In this case,

$$\mathcal{L}\_t((\theta,\gamma)) = \inf\_{||\theta|| \le C} \left\{ \sum\_{i=1}^n (y\_{t,i} - \mathbf{x}\_{t,i}^T \theta)^2 + \frac{\gamma \Gamma^2 n}{2} + \frac{||\theta - \theta||^2}{2\gamma} \right\}.\tag{16}$$

Define *Yt* = (*yt*,1, ... , *yt*,*n*)*<sup>T</sup>* and *Xt* = (*xt*,1<sup>|</sup> ... <sup>|</sup>*xt*,*n*)*T*. The minimizer of <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=1(*yt*,*<sup>i</sup>* − *xT t*,*i <sup>θ</sup>*)<sup>2</sup> <sup>+</sup> *θ* − *ϑ* 2/(2*γ*) with respect to *θ* is known as the ridge regression estimator:

$$\hat{\theta}\_{t} = \left(X\_{t}^{T}X\_{t} + \frac{I\_{d}}{2\gamma}\right)^{-1} \left(X\_{t}^{T}Y\_{t} + \frac{\theta}{2\gamma}\right). \tag{17}$$

This also coincides with the minimizer in the right-hand side of (16) on the condition that ˆ *θt* <sup>≤</sup> *<sup>C</sup>*. In this case, by plugging <sup>ˆ</sup> *θ<sup>t</sup>* in (16), we have a close form formula for L*t*((*ϑ*, *γ*)), and an explicit (but cumbersome) formula for its gradient. It is thus possible to use the OGMS strategy to update *λ* = (*ϑ*, *γ*).

## *4.3. The General Case*

In the general case, denote *λt*−<sup>1</sup> = (*ϑt*−1, *γt*−1), then *λ<sup>t</sup>* = (*ϑt*, *γt*)is obtained by minimizing

$$F\_t(\theta, (\theta, \gamma)) = \sum\_{i=1}^n \ell\_{t,i}(\theta) + \frac{\gamma \Gamma^2 n}{2} + \frac{||\theta - \theta||^2}{2\gamma} + \frac{||\theta - \theta\_{t-1}||^2 + (\gamma - \gamma\_{t-1})^2}{2n} \tag{18}$$

with respect to *θ*, *ϑ*, *γ*. Any efficient minimization procedure can be used. In our experiments, we used a projected gradient descent, the gradient being given by:

$$\frac{\partial F\_t}{\partial \theta} = \sum\_{i=1}^n \nabla \ell\_{t,i}(\theta) + \frac{\theta - \theta}{\gamma}, \tag{19}$$

$$\frac{\partial F\_t}{\partial \theta} = \frac{\theta - \theta}{\gamma} + \frac{\theta - \theta\_{t-1}}{\kappa},\tag{20}$$

$$\frac{\partial F\_t}{\partial \gamma} = \frac{\Gamma^2 n}{2} - \frac{||\theta - \theta||^2}{2\gamma^2} + \frac{\gamma - \gamma\_{t-1}}{n}. \tag{21}$$

Note that even though we do not *stricto sensu* obtain the minimizer of *Ft*, we can get arbitrarily close to it by taking a large enough number of steps. The main difference between this algorithm and the strategy suggested in [29] is that it is obtained by applying the general proximal update introduced in Equation (7), while they decoupled the update for the initialization step and the learning rate.

## *4.4. Experimental Study*

In this section we compare simulated data for the numerical performance of OPMS w.r.t learning the task in isolation with online gradient descent (I-OGA). To measure the impact of learning the gradient step *γ*, we also introduce mean-OPMS that uses the same strategy as OPMS but only learns the starting point *ϑ* (it is thus close to [27]). We present the results for regression tasks with the mean-squared-error loss, and then for classification with the hinge loss. The notebooks of the experiments can be found online: https://dimitri-meunier.github.io/ (accessed on 26 September 2021).

## 4.4.1. Synthetic Regression

At each round *t* = 1, ... , *T*, the meta learner sequentially receives a regression task that corresponds to a dataset (*xt*,*i*, *yt*,*i*)*i*=1,...,*<sup>n</sup>* generated as *yt*,*<sup>i</sup>* = *x<sup>T</sup> t*,*i <sup>θ</sup><sup>t</sup>* <sup>+</sup> *t*,*i*, *xt*,*<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*d*. The noise is *t*,*<sup>i</sup>* ∼ U([−*σ*2, *<sup>σ</sup>*2]) and the *t*,*<sup>i</sup>* are all independent, the inputs are uniformly sampled on the (*<sup>d</sup>* <sup>−</sup> <sup>1</sup>)-unit sphere <sup>S</sup>*d*−<sup>1</sup> and *<sup>θ</sup><sup>t</sup>* <sup>=</sup> *ru* <sup>+</sup> *<sup>θ</sup>*0, *<sup>u</sup>* ∼ U <sup>S</sup>*d*−<sup>1</sup> , *<sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup>*d*, *<sup>r</sup>* <sup>∈</sup> <sup>R</sup>+. We take *d* = 20, *n* = 30, *T* = 200, *σ*<sup>2</sup> = 0.5 and *θ*<sup>0</sup> with all components equal to 5. In this setting, *θ*<sup>0</sup> is a common bias between the tasks, *σ*<sup>2</sup> is the inter-task variance and *r* characterizes the tasks similarity. We experiment with different values of *r* ∈ {0, 5, 10, 30} to observe the impact of task similarity on the meta-learning process. The smaller *r*, the closer are the tasks and for the extreme case of *r* = 0 the tasks are identical, in the sense that the parameters *θ<sup>t</sup>* of the tasks are all the same. We draw attention to the fact that a cross-validation procedure to select *α* (the parameter of OGMS or OPMS, see Equation (5)) or *γ* is not valid in the online settings, as it would require having knowledge of several tasks in advance for the former and several datapoints in advance for each task for the latter. Moreover, the theoretical values are based on worst-case analysis and lead in practice to slow learning. In practice, to set these values to the correct order of magnitude without adjusting the constants led to better results. So, for mean-OPMS and OPMS we set *α* = 1/ <sup>√</sup>*T*, for OPMS and I-OGA we set *γ* = 1/ <sup>√</sup>*n*. Instead of cross-validation, one can launch several online learners in parallel with different parameter values to pick the best one (or aggregate them). That is the strategy we use to select Γ for OPMS. Note that the exact value of Γ is usually unkown in practice; its automatic calibration is an important open question. To solve (18), after each task we use the exact solution for mean-OPMS and projected Newton descent with 10

steps for OPMS. We observed that not reaching the exact solution of (18) does not harm the performance of the algorithm and 10 steps are sufficient to reach convergence. The results are displayed in Table 1 and Figure 2. On Figure 2, for each task *t* = 1, ... , *T*, we report the average end-of-task loss *MSEt* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>t</sup>*,*i*(*θt*,*n*)/*n* averaged over 50 independent runs (with their confidence intervals). Table 1 reports *MSEt* averaged over the 100 most recent tasks. The results confirm our theoretical findings: learning *γ* can bring a substantial benefit over just learning the starting point, which in turn brings a considerable benefit with respect to learning the tasks in isolation. Learning the gradient step makes the meta-learner more robust to task dissimilarities (i.e. when *r* increases) as shown in Figure 2. In the regime where *r* is low, learning the gradient step does not help the meta-learner as it takes more steps to reach convergence. Overall both meta learners are consistently better than learning the task in isolation since the number of observation per task is low.

**Figure 2.** Performance of learning in isolation with OGA (**I-OGA**), OPMS to learn initialization (**mean-OPMS**) and OPMS to learn initialization and step size (**OPMS**). We report the average end-of-task MSE losses at the end of each task, for different values of the task-similarity index *r* ∈ {0, 5, 10, 30}. The results are averaged over 50 independent runs to get confidence intervals.

**Table 1.** Average end-of-task MSE of the 100 last tasks (averaged over 50 independent runs).


4.4.2. Synthetic Classification

At each round *t* = 1, ... , *T*, the meta learner sequentially receives a binary classification task with the Hinge loss that corresponds to a dataset (*xt*,*i*, *yt*,*i*)*i*=1,...,*n*. The binary labels {−1, 1} are generated as a logistic model <sup>P</sup>(*<sup>y</sup>* <sup>=</sup> <sup>1</sup>)=(<sup>1</sup> <sup>+</sup> exp(−*x<sup>t</sup> θt*))−1. The task parameters *θ<sup>t</sup>* and the inputs are generated as in the regression setting. To add some noise, we shuffle 10% of the labels. We take *d* = 10, *n* = 100, *T* = 500, *r* = 2. For mean-OPMS and OPMS we set *α* = 1/ <sup>√</sup>*T*, for OPMS and I-OGA we set *<sup>γ</sup>* <sup>=</sup> 1/ <sup>√</sup>*n*. For the optimisation of *Ft* (18) with both OPMS and mean-OPMS we use a projected gradient descent with 50 steps.

On Figure 3, for each task *t* = 1, ... , *T*, we report the regret on the end-of-task losses: *R*(*t*) = <sup>1</sup> *nt* <sup>∑</sup>*<sup>t</sup> <sup>k</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *<sup>k</sup>*,*i*(*θk*,*n*), averaged over 10 independent runs (with their confidence intervals). As the for regression setting, the results confirm our theoretical findings: by learning *γ* (OPMS), we reach a better overall performance than just learning the initialization (mean-OPMS) and a substantially stronger than independent task learning (I-OGA). Note that, in the classification regime, there is no known closed formed expression for the meta-gradient; therefore, OGMS cannot be used.

## **5. Second Example: Learning the Prior or the Learning Rate in Exponentially Weighted Aggregation**

In this section, we will study a generalized Bayesian method, exponentially weighted aggregation. Consider a *finite* set <sup>Θ</sup><sup>0</sup> <sup>=</sup> {*θ*1, ... , *<sup>θ</sup>M*} ⊂ <sup>R</sup>*d*. EWA depends on a prior distribution *π* on Θ0, and on a learning rate *η* > 0, and returns a decision in Θ = conv(*θ*1, ... , *θM*) the convex envelope of Θ0. In this section, we work under the following condition.

**Assumption 3.** *There is a <sup>B</sup>* <sup>∈</sup> <sup>R</sup><sup>∗</sup> <sup>+</sup>*, such that for any* (*t*, *i*) ∈ {1, ... , *T*}×{1, ... , *n*}*, the function <sup>t</sup>*,*<sup>i</sup> is* Θ → [0, *B*] *and convex.*

We will sometimes use a stronger assumption.

**Assumption 4.** *There is a <sup>C</sup>* <sup>∈</sup> <sup>R</sup><sup>∗</sup> <sup>+</sup>*, such that for any* (*t*, *i*) ∈ {1, ... , *T*}×{1, ... , *n*}*, the function θ* → exp(−*<sup>t</sup>*,*i*(*θ*)/*C*) *is concave.*

Examples of a situation in which Assumption 4 is satisfied are provided in [3]. Note that Assumption 4 implies Assumption 3.

## *5.1. Reminder on EWA*

The update in EWA is given by:

$$\theta\_{t,i} = \sum\_{\theta \in \Theta\_{\varnothing}} p\_{t,i}(\theta)\theta \tag{22}$$

where *pt*,*<sup>i</sup>* are weights defined by

$$p\_{t,i}(\theta) = \frac{\exp\left[-\eta \sum\_{j=1}^{i-1} \ell\_{t,j}(\theta)\right] \pi(\theta)}{\sum\_{\theta \in \Theta\_0} \exp\left[-\eta \sum\_{j=1}^{i-1} \ell\_{t,j}(\theta)\right] \pi(\theta)}. \tag{23}$$

The strategy is studied in detail in [3]. We refer the reader to [36] and the references therein for connections to Bayesian inference. We recall the following regret bounds from [3]. First, under Assumption 3,

$$\sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}) \le \min\_{\theta \in \Theta\_0} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \frac{\eta n B^2}{8} + \frac{\log \frac{1}{\pi(\theta)}}{\eta} \right]. \tag{24}$$

Moreover, under the stronger Assumption 4,

$$\sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}) \le \min\_{\theta \in \Theta\_0} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \mathbb{C} \log \frac{1}{\pi(\theta)} \right]. \tag{25}$$

In Section 5.2, we work in the general setting (Assumption 3), and we use our metastrategy OPMS or OGMS to learn *η*. In Section 5.3, we use OPMS or OGMS to learn *π* under Assumption 4.

#### *5.2. Learning the Rate η*

Consider the uniform prior *π*(*θ*) = 1/*M* for any *θ* ∈ Θ0. Then, the regret bound (24) becomes: *<sup>n</sup>*

$$\sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}) \le \min\_{\theta \in \Theta\_0} \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \frac{\eta n B^2}{8} + \frac{\log M}{\eta} \tag{26}$$

and it is then possible to optimize it explicitly with respect to *η*. The value minimizing the bound is *η* = (2/*B*) '2 log(*M*)/*n* and the regret bound becomes:

$$\sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}) \le \min\_{\theta \in \Theta\_0} \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + B\sqrt{\frac{n\log M}{2}}.\tag{27}$$

In practice, however, while it is often reasonable to assume that the loss function is bounded (as in Assumption 3), very often one does not know a tight upper bound. Thus, one may use a constant *B* that satisfies Assumption 3, but that is far too large. Even though one does not know a better upper bound than *B*, one would like a regret bound that depends on the tightest possible upper bound.

In the meta-learning framework, define:

$$\mathcal{L}\_t(\eta) = \min\_{\theta \in \Theta\_0} \sum\_{i=1}^n \ell\_{t,i}(\theta) + \frac{\eta n \left[ \max\_{\theta \in \Theta\_0, 1 \le i \le n} \ell\_{t,i}(\theta) \right]^2}{8} + \frac{\log M}{\eta} \tag{28}$$

for *η* ∈ Λ = [1/*n*, 1]. It is immediately necessary to prove that this function is convex and *L*-Lipschitz with *L* = *n*<sup>2</sup> log(*M*) + *nB*2/8. So, Assumption 1 is satisfied, allowing for the use of the OPMS or OGMS strategy without needed a tight upper bound on the losses. Note that, in this context, the OGMS strategy is given by:

$$\eta\_t = \frac{1}{n} \vee \left[ \eta\_{t-1} - a \left( \frac{n \left[ \max\_{\theta \in \Theta\_{0,1}} \ell\_{i \le n} \ell\_{t,i}(\theta) \right]^2}{8} - \frac{\log M}{\eta\_{t-1}^2} \right) \right] \wedge 1.$$

**Theorem 2.** *Under Assumption 3, using OGMS or OPMS on* L*t*(*η*)*, as in* (28) *with η*<sup>1</sup> = 1*, L* = *n*<sup>2</sup> log(*M*) + *nB*2/8 *and*

$$
\alpha = \frac{1}{L} \sqrt{\frac{2}{T}} \tag{29}
$$

*we have*

$$\begin{split} \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\eta\_t}) \leq \sum\_{t=1}^{T} \min\_{\theta \in \Theta\_{0}} \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + bT \sqrt{\frac{n \log(M)}{2}} \\ &+ T \log(M) + \frac{b^{2}T}{8} + \left(n^{2} \log M + \frac{nB^{2}}{8}\right) \sqrt{2T} \end{split} \tag{30}$$

*where*

$$b = \max\_{\theta \in \Theta \cup \{1 \le i \le T, 1 \le i \le n\}} |\ell\_{t,i}(\theta)|. \tag{31}$$

Let us compare learning in isolation with meta-learning in this context. When learning in isolation, the hyperparameter *η* is fixed (as in (2)). If we fix it to the value *η*<sup>0</sup> = (2/*B*) '2 log(*M*)/*n* as in (27), the meta-regret is in *BT*'*n* log(*M*)/2. On the other hand, meta-learning leads to a meta-regret in *bT*'*<sup>n</sup>* log(*M*)/2 <sup>+</sup> *<sup>n</sup>*<sup>2</sup> log *<sup>M</sup>*√2*<sup>T</sup>* <sup>+</sup> <sup>O</sup>(*nB*<sup>2</sup> <sup>√</sup>*<sup>T</sup>* <sup>+</sup> *<sup>T</sup>*). In other words, we replace the potentially loose upper bound *B* by the tightest possible bound *<sup>b</sup>*, at the cost of an additional *<sup>n</sup>*<sup>2</sup> log *<sup>M</sup>*√2*<sup>T</sup>* <sup>+</sup> <sup>O</sup>(*nB*<sup>2</sup> <sup>√</sup>*<sup>T</sup>* <sup>+</sup> *<sup>T</sup>*) term. Here again, when *T* is large enough with respect to *n*, this term is negligible.

#### *5.3. Learning the Prior π*

Under Assumption 4, we have the regret bound in (25). Without any information on Θ0, it seems natural to use the uniform prior *π* on Θ<sup>0</sup> = {*θ*1,..., *θM*}, which leads to

$$\sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}) \le \min\_{\theta \in \Theta\_0} \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \mathbb{C} \log M. \tag{32}$$

If some additional information was available, such as, for example: "the best *θ* is always either *θ*<sup>1</sup> or *θ*2", one would rather chose the uniform prior on {*θ*1, *θ*2}, and obtain the bound: *<sup>n</sup>*

$$\sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}) \le \min\_{\theta \in \Theta\_0} \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \mathbb{C} \log 2. \tag{33}$$

Unfortunately, such information is generally not available. However, in the context of meta-learning, we can take advantage of the previous tasks to learn such information.

Thus, let us define, for any task *t*,

$$\theta\_t^\* = \underset{\theta \in \Theta\_0}{\text{argmin}} \sum\_{i=1}^n \ell\_{t,i}(\theta) \tag{34}$$

and

$$\mathcal{L}\_t(\pi) = \sum\_{i=1}^n \ell\_{t,i}(\theta\_t^\*) + \mathbb{C} \log \frac{1}{\pi(\theta\_t^\*)} \tag{35}$$

for *π* = (*π*(*θ*1),..., *π*(*θM*)) ∈ Λ with

$$\Lambda = \left\{ \mathbf{x} \in (\mathbb{R}\_+)^M \colon \sum\_{h=1}^M \mathbf{x}\_h = 1 \text{ and } \mathbf{x}\_h \ge \frac{1}{2M} \right\}. \tag{36}$$

It is important to check that L*<sup>t</sup>* is convex and *L*-Lipschitz with *L* = 2*CM* on Λ; this allows us to use OPMS (or OGMS).

**Theorem 3.** *Under Assumption 4, using OPMS on* L*t*(*π*) *as in* (35) *with π*<sup>1</sup> = (1/*M*, ... , 1/*M*)*, L* = 2*CM and*

$$\alpha = \frac{1}{2CM\sqrt{T}},\tag{37}$$

*define I*<sup>∗</sup> = {*θ*<sup>∗</sup> <sup>1</sup> ,..., *θ*<sup>∗</sup> *<sup>T</sup>*} *where each θ*<sup>∗</sup> *<sup>t</sup> is as in* (34) *and m*<sup>∗</sup> = card(*I*∗)*. We have*

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\pi\_t}) \le \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_t^\*) + \mathbb{C}T \log(2m^\*) + 2\mathbb{C}M\sqrt{T}.\tag{38}$$

When learning in isolation with a uniform prior, the meta-regret is *TC* log(*M*). On the other hand, if *m*∗ is small (that is, many of the *θ*∗ *<sup>i</sup>* s are similar), meta-learning leads to a meta-regret in *CT* log(2*m*∗) + <sup>2</sup>*CM*√*T*. For a *<sup>T</sup>* that is large enough, this is an important improvement.

#### *5.4. Discussion on the Continuous Case*

Let us now discuss the possibility of meta-learning for generalized Bayesian methods when Θ<sup>0</sup> is no longer a finite set. There is a general formula for EWA, given by

$$\rho\_{t,i}(\mathbf{d}\theta) = \underset{\rho}{\operatorname{argmin}} \left\{ \mathbb{E}\_{\theta \sim \rho} \left[ \sum\_{j=1}^{i-1} \ell\_{t,j}(\theta) \right] + \frac{\mathcal{K}(\rho, \pi)}{\eta} \right\} \tag{39}$$

where the minimum is taken over for all probability distributions that are absolutely continuous with *π*, and where *π* is a prior distribution, *η* > 0 a learning rate and K is the Kullback– Leibler divergence (KL). Meta-learning for such an update rule is proven in [10,37] but usually does not lead to feasible strategies. Online variational inference [38,39] consists in replacing the minimization on the set of all probability distributions by minimization in a smaller set in order to define a feasible approximation of *ρt*,*i*. For example, let (*qμ*)*μ*∈*<sup>M</sup>* be a parametric family of probability distributions, Thus, we define:

$$\mu\_{t,i} = \underset{\mu \in M}{\operatorname{argmin}} \left\{ \mathbb{E}\_{\theta \sim q\_{\mu}} \left[ \sum\_{j=1}^{i-1} \ell\_{t,j}(\theta) \right] + \frac{\mathcal{K}(q\_{\mu}, \pi)}{\eta} \right\}. \tag{40}$$

It is discussed in [40] that, generally, when *μ* is a location-scale parameter and *<sup>t</sup>*,*<sup>j</sup>* is Γ-Lipschitz and convex, then ¯ *<sup>t</sup>*,*i*(*μ*) :<sup>=</sup> <sup>E</sup>*θ*∼*q<sup>μ</sup>* [*<sup>t</sup>*,*j*(*θ*)] is 2Γ-Lipschitz and convex. In this case, under the assumption that K(*qμ*, *π*) is *α*-strongly convex in *μ*, a regret bound for such strategies was derived in [39]:

$$\sum\_{i=1}^{n} \mathbb{E}\_{\theta \sim q\_{\mu\_{i,j}}}[\ell\_{t,i}(\theta)] \le \inf\_{\mu \in \mathcal{M}} \left\{ \mathbb{E}\_{\theta \sim q\_{\mu}} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) \right] + \frac{\eta 4 \Gamma^2 n}{\mathfrak{a}} + \frac{\mathcal{K}(q\_{\mu\_{\tau}} \pi)}{\eta} \right\}. \tag{41}$$

A complete study of meta-learning of the rate *η* > 0 and of the prior *π* in this context is an important objective (possibly, with a restriction that *π* ∈ {*qμ*, *μ* ∈ *M*}). However, this raises many problems. For example, the KL divergence K(*qμ*, *qμ*) is not always convex with respect to the parameter *μ* . In this case, it might help to replace it by a convex relaxation that would allow for the use of OGMS or OPMS. This relates to [41,42], who advocate going beyond the KL divergence in (39); see also [36] and the references therein. This will be the object of future works.

#### **6. Proofs**

We start with a preliminary lemma that will be used in the proof of Proposition 1.

**Lemma 1.** *Let a*, *b*, *c be three vectors in* R*p. Then:*

$$(a-b)^T(b-c) = \frac{||a-c||^2 - ||a-b||^2 - ||b-c||^2}{2}.\tag{42}$$

**Proof.** expand *a* − *c* <sup>2</sup> <sup>=</sup> *a* <sup>2</sup> <sup>+</sup> *c* <sup>2</sup> <sup>−</sup> <sup>2</sup>*aTc* in the r.h.s, as well as *a* − *b* <sup>2</sup> and *b* − *c* 2. Then simplify.

We now prove Proposition 1 separately for the general OGMS strategy, and then for OGMS.

**Proof of Proposition 1 for OPMS.** As mentioned earlier, this strategy is an application to the meta-learning setting of implicit online learning [32,33]. We follow here a proof from Chapter 11 in [3]. We refer the reader to [43] and the references therein for tighter bounds under stronger assumptions.

First, *λ<sup>t</sup>* is defined as the minimizer of a convex function in (5). So, the subdifferential of this function at *λ<sup>t</sup>* contains 0. In other words, there is a *zt* ∈ *∂*L*t*−1(*λt*), such that

$$z\_t = \frac{\lambda\_{t-1} - \lambda\_t}{a}.\tag{43}$$

By convexity, for any *λ*, for any *z* ∈ *∂*L*t*−1(*λt*),

$$\mathcal{L}\_{t-1}(\lambda) \ge \mathcal{L}\_{t-1}(\lambda\_t) + \left(\lambda - \lambda\_t\right)^T z. \tag{44}$$

The choice *z* = *zt* gives:

$$\mathcal{L}\_{t-1}(\lambda) \ge \mathcal{L}\_{t-1}(\lambda\_t) + \frac{(\lambda - \lambda\_t)^T(\lambda\_{t-1} - \lambda\_t)}{a},\tag{45}$$

that is,

$$\begin{split} \mathcal{L}\_{t-1}(\lambda\_t) &\leq \mathcal{L}\_{t-1}(\lambda) + \frac{(\lambda - \lambda\_t)^T (\lambda\_t - \lambda\_{t-1})}{\mathfrak{a}} \\ &= \mathcal{L}\_{t-1}(\lambda) + \frac{\frac{\|\lambda - \lambda\_{t-1}\|^2 - \|\lambda - \lambda\_t\|^2}{2\mathfrak{a}} - \frac{\|\lambda\_t - \lambda\_{t-1}\|^2}{2\mathfrak{a}}}{2\mathfrak{a}} \\ &= \mathcal{L}\_{t-1}(\lambda) + \frac{\frac{\|\lambda - \lambda\_{t-1}\|^2 - \|\lambda - \lambda\_t\|^2}{2\mathfrak{a}} - \mathfrak{a} \frac{\|z\_t\|^2}{2}}{2\mathfrak{a}} \end{split} \tag{46}$$

where we used Lemma 1. Then, note that

$$\begin{split} \mathcal{L}\_{t-1}(\lambda\_{t-1}) &= \mathcal{L}\_{t-1}(\lambda\_t) + [\mathcal{L}\_{t-1}(\lambda\_{t-1}) - \mathcal{L}\_{t-1}(\lambda\_t)] \\ &\leq \mathcal{L}\_{t-1}(\lambda\_t) + \|\lambda\_{t-1} - \lambda\_t\| \|L \\ &\leq \mathcal{L}\_{t-1}(\lambda\_t) + \kappa \|z\_t\| \|L. \end{split} \tag{47}$$

Combining this inequality with (46) gives

$$\mathcal{L}\_{t-1}(\lambda\_{t-1}) \le \mathcal{L}\_{t-1}(\lambda) + \frac{\|\lambda - \lambda\_{t-1}\|^2 - \|\lambda - \lambda\_t\|^2}{2\alpha} + a\left(\|z\_t\|\|L - \frac{\|z\_t\|^2}{2}\right). \tag{48}$$

Now, for any *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>, <sup>−</sup>*x*2/2 <sup>+</sup> *xL* <sup>−</sup> *<sup>L</sup>*2/2 <sup>≤</sup> 0. In particular, *zt L* − *zt* 2/2 <sup>≤</sup> *<sup>L</sup>*2/2 and so the above can be rewritten:

$$\mathcal{L}\_{t-1}(\lambda\_{t-1}) \le \mathcal{L}\_{t-1}(\lambda) + \frac{\|\lambda - \lambda\_{t-1}\|^2 - \|\lambda - \lambda\_t\|^2}{2a} + \frac{aL^2}{2}.\tag{49}$$

Summing the inequality for *t* = 2 to *T* + 1 leads to:

$$\sum\_{t=1}^{T} \mathcal{L}\_t(\lambda\_t) \le \sum\_{t=1}^{T} \mathcal{L}\_t(\lambda) + \frac{\|\lambda - \lambda\_1\|^2 - \|\lambda - \lambda\_{T+1}\|^2}{2a} + \frac{aTL^2}{2}. \tag{50}$$

This ends the proof.

**Proof of Proposition 1 for OGMS.** The beginning of the proof follows the proof of Theorem 11.1 in [3].

Note that we can rewrite (4) as

$$\begin{cases} \begin{aligned} \lambda\_t &= \lambda\_{t-1} - \alpha \nabla \mathcal{L}\_{t-1}(\lambda\_{t-1}), \\ \lambda\_t &= \Pi\_\Lambda(\vec{\lambda}\_t) \end{aligned} \end{cases}$$

rearranging the first line, we obtain:

$$
\nabla \mathcal{L}\_{t-1}(\lambda\_{t-1}) = \frac{\lambda\_{t-1} - \tilde{\lambda}\_t}{\alpha}. \tag{51}
$$

By convexity, for any *λ*,

$$\mathcal{L}\_{t-1}(\lambda) \ge \mathcal{L}\_{t-1}(\lambda\_{t-1}) + (\lambda - \lambda\_{t-1})^T \nabla \mathcal{L}\_{t-1}(\lambda\_{t-1}) \tag{52}$$

$$\mathcal{L} = \mathcal{L}\_{t-1}(\lambda\_{t-1}) + \frac{(\lambda - \lambda\_{t-1})^T(\lambda\_{t-1} - \bar{\lambda}\_t)}{\mathfrak{a}},\tag{53}$$

that is,

$$
\mathcal{L}\_{t-1}(\lambda\_{t-1}) \le \mathcal{L}\_{t-1}(\lambda) - \frac{(\lambda - \lambda\_{t-1})^T (\lambda\_{t-1} - \tilde{\lambda}\_t)}{\alpha}.\tag{54}
$$

Lemma 1 gives:

$$\begin{split} \left( (\lambda - \lambda\_{t-1})^T (\lambda\_{t-1} - \bar{\lambda}\_t) = \frac{\|\lambda - \bar{\lambda}\_t\|^2 - \|\lambda - \lambda\_{t-1}\|^2 - \|\lambda\_{t-1} - \bar{\lambda}\_t\|^2}{2} \\ = \frac{\|\lambda - \bar{\lambda}\_t\|^2 - \|\lambda - \lambda\_{t-1}\|^2 - a^2 \|\nabla \mathcal{L}\_{t-1}(\lambda\_{t-1})\|^2}{2} \end{split} \tag{55}$$

$$\geq \frac{||\lambda - \lambda\_t||^2 - ||\lambda - \lambda\_{t-1}||^2 - \alpha^2 ||\nabla \mathcal{L}\_{t-1}(\lambda\_{t-1})||^2}{2},\tag{56}$$

the last step being justified by:

$$\|\lambda - \bar{\lambda}\_t\|^2 \ge \|\lambda - \Pi\_\Lambda(\bar{\lambda}\_t)\|^2 = \|\lambda - \lambda\_t\|^2 \tag{57}$$

for any *λ* ∈ Λ. Plug (56) in (54) to get:

$$\mathcal{L}\_{t-1}(\lambda\_{t-1}) \le \mathcal{L}\_{t-1}(\lambda) + \frac{||\lambda - \lambda\_{t-1}||^2 - ||\lambda - \lambda\_t||^2}{2a} + \frac{a||\nabla \mathcal{L}\_{t-1}(\lambda\_{t-1})||^2}{2} \tag{58}$$

and the Lipschitz assumption gives:

$$\mathcal{L}\_{t-1}(\lambda\_{t-1}) \le \mathcal{L}\_{t-1}(\lambda) + \frac{||\lambda - \lambda\_{t-1}||^2 - ||\lambda - \lambda\_t||^2}{2a} + \frac{aL^2}{2} \tag{59}$$

sum the inequality for *t* = 2 to *T* + 1 to get:

$$\sum\_{t=1}^{T} \mathcal{L}\_t(\lambda\_t) \le \sum\_{t=1}^{T} \mathcal{L}\_t(\lambda) + \frac{\|\lambda - \lambda\_1\|^2 - \|\lambda - \lambda\_{T+1}\|^2}{2a} + \frac{aTL^2}{2}. \tag{60}$$

This ends the proof of the statement for OGMS.

We now provide a lemma that will be useful for the proof of Proposition 2.

**Lemma 2.** *Let G*(*u*, *v*) *be a convex function of* (*u*, *v*) ∈ *U* × *V. Define g*(*u*) = inf*v*∈*<sup>V</sup> G*(*u*, *v*)*. Then g is convex.*

**Proof.** indeed, let *<sup>λ</sup>* <sup>∈</sup> [0, 1] and (*x*, *<sup>y</sup>*) <sup>∈</sup> *<sup>U</sup>*2,

$$\lg\left(\lambda x + (1 - \lambda)y\right) = \inf\_{\upsilon \in V} G\left(\lambda x + (1 - \lambda)y\_\prime \upsilon\right) \tag{61}$$

$$\leq G(\lambda x + (1 - \lambda)y, \lambda x' + (1 - \lambda)y') \tag{62}$$

$$\leq \lambda G(\mathbf{x}, \mathbf{x}') + (1 - \lambda)G(\mathbf{y}, \mathbf{y}') \tag{63}$$

where the last two inequalities hold for any (*x* , *y* ) <sup>∈</sup> *<sup>V</sup>*2. Let us now take the infimum with respect to (*x* , *y* ) <sup>∈</sup> *<sup>V</sup>*<sup>2</sup> in both sides, this gives:

$$\log(\lambda x + (1 - \lambda)y) \le \inf\_{\mathbf{x'} \in V} \lambda G(\mathbf{x}, \mathbf{x'}) + \inf\_{\mathbf{y'} \in V} (1 - \lambda)G(\mathbf{y}, \mathbf{y'}) \tag{64}$$

$$
\lambda = \lambda \lg(x) + (1 - \lambda)\lg(y),
\tag{65}
$$

that is, *g* is convex.

**Proof of Proposition 2.** Apply Lemma 2 to *u* = (*ϑ*, *γ*), *v* = *θ*, *U* = Λ, *V* = Θ and

$$G(u,v) = \sum\_{i=1}^{n} \ell\_{i\downarrow}(\theta) + \frac{\gamma \Gamma^2 n}{2} + \frac{||\theta - \theta||^2}{2\gamma}. \tag{66}$$

This shows *g*(*u*) = L*t*((*ϑ*, *γ*)) is convex with respect (*ϑ*, *γ*). Additionally, *G* is differentiable w.r.t *u* = (*ϑ*, *γ*), so

$$\frac{\partial G}{\partial \theta} = \frac{\theta - \theta}{\gamma}, \text{and } \frac{\partial G}{\partial \gamma} = \frac{n\Gamma^2}{2} - \frac{||\theta - \theta||^2}{2\gamma^2}. \tag{67}$$

As a consequence, for (*θ*, *<sup>ϑ</sup>*) <sup>∈</sup> <sup>Θ</sup><sup>2</sup> and *<sup>γ</sup>* <sup>≤</sup> *<sup>γ</sup>* <sup>≤</sup> *<sup>γ</sup>*,

$$\left\|\left\|\frac{\partial G}{\partial \theta}\right\|\right\|^2 \le \frac{4C^2}{\underline{\gamma}^2}, \text{and} \left|\frac{\partial G}{\partial \gamma}\right|^2 \le \frac{n^2 \Gamma^4}{4} + \frac{4C^4}{\underline{\gamma}^4}.\tag{68}$$

This leads to

$$\|\nabla\_{\boldsymbol{\mu}}\mathcal{G}(\boldsymbol{\mu},\boldsymbol{\upsilon})\| = \sqrt{\left\|\frac{\partial\boldsymbol{G}}{\partial\boldsymbol{\theta}}\right\|^2 + \left|\frac{\partial\boldsymbol{G}}{\partial\boldsymbol{\gamma}}\right|^2} \tag{69}$$

$$\leq \sqrt{\frac{n^2 \Gamma^4}{4} + \frac{4C^2}{\underline{\chi}^2} + \frac{4C^4}{\underline{\chi}^4}} =: L,\tag{70}$$

that is, for each *v*, *G*(*u*, *v*) is *L*-Lipschitz in *u*. So, *g*(*u*) = inf*v*∈*<sup>V</sup> G*(*u*, *v*) is *L*-Lipschitz in *u*.

**Proof of Theorem 1.** Thanks to the Assumption 2, we can apply Proposition 2. That is, Assumption (1) is satisfied, and we can apply Proposition 1. This gives:

$$\begin{split} \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\lambda\_{t}}) &\leq \inf\_{\theta\_{1},\ldots,\theta\_{T}\in\Theta} \inf\_{(\theta,\gamma)\in\Lambda} \left\{ \sum\_{t=1}^{T} \left| \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t}) \right| \\ &+ \frac{\gamma \Gamma^{2} n}{2} + \frac{\|\theta\_{t} - \theta\|^{2}}{2\gamma} \right\} + \frac{aTL^{2}}{2} + \frac{\|\theta - \theta\_{1}\|^{2} + |\gamma - \gamma\_{1}|^{2}}{2a} \right\}. \end{split} \tag{71}$$

We use direct bounds for the last two terms: *ϑ* − *ϑ*<sup>1</sup> <sup>2</sup> <sup>≤</sup> <sup>4</sup>*C*<sup>2</sup> and <sup>|</sup>*<sup>γ</sup>* <sup>−</sup> *<sup>γ</sup>*1<sup>|</sup> <sup>2</sup> <sup>≤</sup> |*γ* − *γ*| <sup>2</sup> <sup>≤</sup> *<sup>γ</sup>*<sup>2</sup> <sup>=</sup> *<sup>C</sup>*4. Then note that

$$\sum\_{t=1}^{T} \left\| \theta\_t - \theta \right\|^2 = T \left\| \theta - \frac{1}{T} \sum\_{s=1}^{T} \theta\_s \right\|^2 + \sum\_{t=1}^{T} \left\| \theta\_t - \frac{1}{T} \sum\_{s=1}^{T} \theta\_s \right\|^2 \tag{72}$$

$$\theta = T \left\| \theta - \frac{1}{T} \sum\_{s=1}^{T} \theta\_s \right\|^2 + T\sigma^2(\theta\_1^T). \tag{73}$$

Upper bounding the infimum on *ϑ* in (71) by *ϑ* = <sup>1</sup> *<sup>T</sup>* <sup>∑</sup>*<sup>T</sup> <sup>s</sup>*=<sup>1</sup> *θ<sup>s</sup>* leads to

$$\begin{split} \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\lambda\_{t}}) \leq \inf\_{\theta\_{1}, \dots, \theta\_{T} \in \Theta} \inf\_{\gamma \in [\underline{\mathcal{L}^{\overline{\mathcal{T}}}]}} \left\{ \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t}) + \frac{\gamma \Gamma^{2} n \Gamma}{2} \\ &+ \frac{T \sigma^{2}(\theta\_{1}^{T})}{2\gamma} + \frac{a TL^{2}}{2} + \frac{\mathbb{C}^{2} (4 + \mathbb{C}^{2})}{2a} \right\}. \end{split} \tag{74}$$

The right-hand side of (74) is minimized with respect to *α* if *α* = *<sup>C</sup> L* &4+*C*<sup>2</sup> *<sup>T</sup>* , which is the value proposed in the theorem, and we obtain:

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{tj}(\theta\_{t,i}^{\lambda\_t}) \le \inf\_{\theta\_1, \dots, \theta\_T \in \Theta} \inf\_{\gamma \in [\underline{\gamma}, \overline{\gamma}]} \left\{ \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_t) + \frac{\gamma \Gamma^2 nT}{2} + \frac{T\sigma^2(\theta\_1^T)}{2\gamma} + \text{CL}\sqrt{(4+\mathcal{C}^2)T} \right\}.\tag{75}$$

The infimum with respect to *γ* in the r.h.s is reached for

$$\gamma^\* = \left(\underline{\gamma}^\vee \frac{\sigma(\theta\_1^T)}{\Gamma \sqrt{n}}\right) \wedge \overline{\gamma}. \tag{76}$$

First, note that

$$\frac{\gamma^\* \Gamma^2 n T}{2} \le \left(\underline{\gamma} \vee \frac{\sigma(\theta\_1^T)}{\Gamma \sqrt{n}}\right) \frac{\Gamma^2 n T}{2} \tag{77}$$

$$\leq \left(\underline{\chi} + \frac{\sigma(\theta\_1^T)}{\Gamma\sqrt{n}}\right) \frac{\Gamma^2 nT}{2} \tag{78}$$

$$=\frac{\Gamma^2 T n^{1-\beta}}{2} + \frac{\sigma(\theta\_1^T) \Gamma T \sqrt{n}}{2},\tag{79}$$

using *γ* = *n*−*β*. Then,

$$\frac{T\sigma^2(\theta\_1^T)}{2\gamma^\*} \le \frac{T\sigma^2(\theta\_1^T)}{2} \left(\frac{1}{\overline{\gamma}} \vee \frac{\Gamma\sqrt{n}}{\sigma(\theta\_1^T)}\right) \tag{80}$$

$$\leq \frac{T\sigma^2(\theta\_1^T)}{2} \left(\frac{1}{\overline{\gamma}} + \frac{\Gamma\sqrt{n}}{\sigma(\theta\_1^T)}\right) \tag{81}$$

$$\rho = \frac{T\sigma^2(\theta\_1^T)}{2C^2} + \frac{\sigma(\theta\_1^T)\Gamma T \sqrt{n}}{2} \tag{82}$$

$$\leq \frac{T\sigma(\theta\_1^T)}{\mathbb{C}} + \frac{\sigma(\theta\_1^T)\Gamma T \sqrt{n}}{2},\tag{83}$$

using *γ* = *C*<sup>2</sup> and *σ*(*θ<sup>T</sup>* <sup>1</sup> ) ≤ 2*C*. Plugging (77), (80) and the definition of *L* into (75) gives

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\lambda\_t}) \le \inf\_{\theta\_{1}, \dots, \theta\_{T} \in \Theta} \left\{ \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t}) + \mathbb{C} \sqrt{\left(\frac{n^{2} \Gamma^{4}}{4} + 4 \mathcal{C}^{2} n^{2\theta} + 4 \mathcal{C}^{4} n^{4\beta}\right) (4 + \mathcal{C}^{2}) T} \right\} \tag{84}$$

$$\frac{\Gamma^2 T n^{1-\beta}}{2} + \sigma(\theta\_1^T) T \left(\Gamma \sqrt{n} + \frac{1}{\mathbb{C}}\right) \Bigg\} \tag{85}$$

$$=\inf\_{\theta\_{1},\ldots,\theta\_{T}\in\Theta} \left\{ \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t}) + \mathbb{C} \sqrt{(4+\mathcal{C}^{2}) \left( \frac{n^{2}\Gamma^{4}}{4n^{2\vee4\beta}} + \frac{4\mathcal{C}^{2}n^{2\beta}}{n^{2\vee4\beta}} + \frac{4\mathcal{C}^{4}n^{4\beta}}{n^{2\vee4\beta}} \right)} \right. \\ \left. + \frac{4\mathcal{C}^{4}n^{4\beta}}{n^{2\vee4\beta}} \right) n^{1\vee2\beta} \sqrt{T} \tag{86}$$

$$\left\{ \begin{bmatrix} \Gamma^2 \\ 2 \\ 1 \end{bmatrix} n^{1-\beta} + \left( \Gamma + \frac{1}{n\mathbb{C}} \right) \sigma(\theta\_1^T) \sqrt{n} \, \middle| \, T \right\} \tag{87}$$

$$\leq \inf\_{\theta\_{1},...,\theta\_{T}\in\Theta} \left\{ \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t}) + \mathbb{C} \sqrt{(4+\mathcal{C}^{2}) \left(\frac{\Gamma^{2}}{4} + 4\mathcal{C}^{2} + 4\mathcal{C}^{4}\right)} n^{1/2\beta} \sqrt{T} \tag{88}$$

$$\begin{aligned} \left[ \begin{array}{c} \Gamma^2 \\ \hline 2 \\ \end{array} n^{1-\beta} + \left( \Gamma + \frac{1}{\mathbb{C}} \right) \sigma(\theta\_1^T) \sqrt{n} \right] T \end{aligned} \tag{89}$$

$$\leq \inf\_{\theta\_1, \dots, \theta\_T \in \Theta} \left\{ \sum\_{t=1}^T \sum\_{i=1}^n \ell\_{t,i}(\theta\_l) + \mathcal{C}(\Gamma, \mathbb{C}) \left[ n^{1/2\beta} \sqrt{T} + \left( n^{1-\beta} + \sigma(\theta\_1^T) \sqrt{n} \right) T \right] \right\} \tag{90}$$

where we took

+

$$\mathcal{L}(\Gamma, \mathbb{C}) = \max \left( \mathbb{C} \sqrt{(4 + \mathbb{C}^2) \left( \frac{\Gamma^2}{4} + 4\mathbb{C}^2 + 4\mathbb{C}^4 \right)}, \frac{\Gamma^2}{2}, \Gamma + \frac{1}{\mathbb{C}} \right). \tag{91}$$

This ends the proof.

**Proof of Theorem 2.** A direct application of Proposition 1 gives

$$\begin{split} \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\eta\_{i}}) &\leq \inf\_{\eta \geq \frac{1}{n}} \left\{ \sum\_{t=1}^{T} \min\_{\theta \in \Theta\_{0}} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) \right. \right. \\ &\left. + \frac{\eta n \left[ \max\_{\theta \in \Theta\_{0}, 1 \leq i \leq n} \ell\_{t,i}(\theta) \right]^{2}}{8} + \frac{\log M}{\eta} \right\} + \frac{aTL^{2}}{2} + \frac{(\eta - 1)^{2}}{2a} \right\}. \end{split} \tag{92}$$

Thus, we have

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\eta\_t}) \le \inf\_{\eta \ge \frac{1}{n}} \left\{ \sum\_{t=1}^{T} \min\_{\theta \in \Theta\_0} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \frac{\eta n b^2}{8} + \frac{\log M}{\eta} \right] + \frac{a T L^2}{2} + \frac{(\eta - 1)^2}{2\alpha} \right\}. \tag{93}$$

Now, plugging in the right-hand side

$$\eta = \frac{1}{n} \vee \left(\frac{2}{b} \sqrt{\frac{2 \log M}{n}}\right) \wedge 1,\tag{94}$$

we obtain:

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\eta\_t}) \le \sum\_{t=1}^{T} \min\_{\theta \in \Theta\_0} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \frac{b^2}{8} + b\sqrt{\frac{n\log(M)}{2}} + \log(M) \right] + \frac{aTL^2}{2} + \frac{1}{2a}.\tag{95}$$

Now, we see that the value *α* = '2/(*TL*2) leads to:

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\eta\_t}) \le \sum\_{t=1}^{T} \min\_{\theta \in \Theta\_0} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + \frac{b^2}{8} + b\sqrt{\frac{n\log(M)}{2}} + \log(M) \right] + L\sqrt{2T}.\tag{96}$$

Rearranging terms, and replacing *L* by its value,

$$\begin{split} \sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\eta\_t}) \leq \sum\_{t=1}^{T} \min\_{\theta \in \Theta\_{0}} \sum\_{i=1}^{n} \ell\_{t,i}(\theta) + bT \sqrt{\frac{n \log(M)}{2}} + \frac{b^{2}T}{8} + T \log(M) \\ &+ \left(n^{2} \log M + \frac{nB^{2}}{8}\right) \sqrt{2T}, \end{split} \tag{97}$$

that is the statement of the theorem.

**Proof of Theorem 3.** An application of Proposition 1 leads to

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\pi\_t}) \le \min\_{\pi \in \Lambda} \left\{ \sum\_{t=1}^{T} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_t^\*) + \mathbb{C} \log \frac{1}{\pi(\theta\_t^\*)} \right] + \frac{aTL^2}{2} + \frac{\|\pi - \pi\_1\|^2}{2a} \right\} \tag{98}$$

and so

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\pi\_t}) \le \min\_{\pi \in \Lambda} \left\{ \sum\_{t=1}^{T} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_t^\*) + \mathbb{C} \log \frac{1}{\pi(\theta\_t^\*)} \right] + \frac{aTL^2}{2} + \frac{1}{2a} \right\} \tag{99}$$

define *πI*<sup>∗</sup> such that *πI*<sup>∗</sup> (*θj*) = 1/(2*m*∗) if *j* ∈ *I*<sup>∗</sup> and *πI*<sup>∗</sup> (*θj*) = 1/(2(*M* − *m*∗)) otherwise. We have *π*∗ *<sup>I</sup>* ∈ Λ and thus

$$\sum\_{t=1}^{T} \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_{t,i}^{\pi\_t}) \le \sum\_{t=1}^{T} \left[ \sum\_{i=1}^{n} \ell\_{t,i}(\theta\_t^\*) + \mathbb{C} \log(2m^\*) \right] + \frac{aTL^2}{2} + \frac{1}{2\alpha}.\tag{100}$$

Replace *L* and *α* by their values to get the theorem.

#### **7. Conclusions**

We proposed two simple meta-learning strategies together with their theoretical analysis. Our results clearly show an improvement on learning in isolation if the tasks are similar enough. These theoretical findings are confirmed by our numerical experiments. Important questions remain open. In [27], a purely online method is proposed, in the sense that it does not require storing all of the information of the current task. In the case of OGA, this method allows us to learn the starting point. However, its application to learn the step size is not direct [28]. An important question is, then: is there a purely online method that would provably improve on learning in isolation in this case? Another important question is the automatic calibration of Γ. However, as mentioned in Section 5, we believe that a very general and efficient meta-learning method for learning priors in Bayesian statistics (or in generalized Bayesian inference) would be extremely valuable in practice.

**Author Contributions:** Investigation, D.M. and P.A.; Software, D.M.; Writing—original draft, D.M. and P.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This project was initiated as Dimitri Meunier's internship project at RIKEN AIP, in the Approximate Bayesian Inference team. The internship was cancelled because of the pandemic. We would like to thank Arnak Dalalyan (ENSAE Paris), who provided fundings so that the internship could take place at ENSAE Paris instead. We would like to thank Emtiyaz Khan (RIKEN AIP), Sébastien Gerchinovitz (IRT Saint-Exupéry, Toulouse), Vianney Perchet (ENSAE Paris) and all the members of the Approximate Bayesian Inference team for valuable feedback.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **References**


## *Article* **Fast Compression of MCMC Output**

**Nicolas Chopin \*,† and Gabriel Ducrocq †**

Institut Polytechnique de Paris, ENSAE Paris, CEDEX, 92247 Malakoff, France; gabriel.ducrocq@ensae.fr

**\*** Correspondence: nicolas.chopin@ensae.fr

† These authors contributed equally to this work.

**Abstract:** We propose cube thinning, a novel method for compressing the output of an MCMC (Markov chain Monte Carlo) algorithm when control variates are available. It allows resampling of the initial MCMC sample (according to weights derived from control variates), while imposing equality constraints on the averages of these control variates, using the cube method (an approach that originates from survey sampling). The main advantage of cube thinning is that its complexity does not depend on the size of the compressed sample. This compares favourably to previous methods, such as Stein thinning, the complexity of which is quadratic in that quantity.

**Keywords:** control variates; Markov chain Monte Carlo; thinning

## **1. Introduction**

MCMC (Markov chain Monte Carlo) remains, to this day, the most popular approach to sampling from a target distribution *p*, in particular in Bayesian computations [1].

Standard practice is to run a single chain, *X*1, ... , *XN* according to a Markov kernel that leaves invariant *p*. It is also common to discard part of the simulated chain, either to reduce its memory footprint, or to reduce the CPU cost of later post-processing operations, or more generally for the user's convenience. Historically, the two common recipes for compressing an MCMC output are:


The impact of either recipes on the statistical properties of the subsampled estimates are markedly different. Burn-in reduces the bias introduced by the discrepancy between *p* and the distribution of the initial state *X*<sup>1</sup> (since *Xb* ≈ *p* for *b* large enough). On the other hand, thinning always increases the (asymptotic) variance of MCMC estimators [2].

Practitioners often choose *b* (the burn-in period) and *t* (the thinning frequency) separately, in a somewhat ad hoc fashion (i.e., through visual inspection of the initial chain), or using convergence diagnosis such as, e.g., those reviewed in [3].

Two recent papers [4,5] cast a new light on the problem of compressing an MCMC chain by considering, more generally, the problem, for a given *M*, of selecting the subsample of size *M* that best represents (according to a certain criterion) the target distribution *p*. We focus for now on [5], for reasons we explain below.

Stein thinning, the method developed in [5], chooses the subsample S of size *M* which minimises the following criterion:

$$D(\mathcal{S}) := \frac{1}{M^2} \sum\_{m,n \in \mathcal{S}} k\_p(X\_m, X\_n), \quad \mathcal{S} \subset \{1, \dots, N\}, \quad |\mathcal{S}| = M \tag{1}$$

where *kp* is a *p*-dependent kernel function derived from another kernel function *k*: X × X → <sup>R</sup>, as follows:

$$k\_p(\mathbf{x}, \mathbf{y}) = \nabla\_\mathbf{x} \cdot \nabla\_\mathbf{y} k(\mathbf{x}, \mathbf{y}) + \langle \nabla\_\mathbf{x} k(\mathbf{x}, \mathbf{y}), \mathbf{s}\_p(\mathbf{y}) \rangle + \langle \nabla\_\mathbf{y} k(\mathbf{x}, \mathbf{y}), \mathbf{s}\_p(\mathbf{x}) \rangle + k(\mathbf{x}, \mathbf{y}) \langle \mathbf{s}\_p(\mathbf{x}), \mathbf{s}\_p(\mathbf{y}) \rangle$$

**Citation:** Chopin, N.; Ducrocq, G. Fast Compression of MCMC Output. *Entropy* **2021**, *23*, 1017. https:// doi.org/10.3390/e23081017

Academic Editor: Cathy W. S. Chen

Received: 6 July 2021 Accepted: 3 August 2021 Published: 6 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

with ·, · being the Euclidean inner product, *sp*(*x*) := ∇ log *p*(*x*) is the so-called score function (gradient of the log target density), and ∇ is the gradient operator.

The rationale behind criterion (1) is that it may be interpreted as the KSD (kernel Stein discrepancy) between the true distribution *p* and the empirical distribution of subsample *S*. We refer to [5] for more details on the theoretical background of the KSD, and its connection to Stein's method.

Stein thinning is appealing, as it seems to offer a principled, quasi-automatic way to compress an MCMC output. However, closer inspection reveals the following three limitations.

First, it requires computing the gradient of the log-target density, *sp*(*x*) = ∇ log *p*(*x*). This restricts the method to problems where this gradient exists and is tractable (and, in particular, to <sup>X</sup> <sup>=</sup> <sup>R</sup>*d*).

Second, its CPU cost is <sup>O</sup>(*NM*2). This makes it nearly impossible to use Stein thinning for *M* & 100. This cost stems from the greedy algorithm proposed in [5], see their Algorithm 1, which adds at iteration *t* the state *Xi* which minimises *kp*(*Xi*, *Xi*) + <sup>∑</sup>*j*∈*St*−<sup>1</sup> *kp*(*Xi*, *Xj*), where *St*−<sup>1</sup> is the sample obtained from the *<sup>t</sup>* <sup>−</sup> 1 previous iterations.

Third, its performance seems to depend in a non-trivial way on the original kernel function *k*; the authors of [5] propose several strategies for choosing and scaling *k*, but none of them seem to perform uniformly well in their numerical experiments.

We propose a different approach in this paper, which we call cube thinning, and which addresses these shortcomings to some extent. Assuming the availability of *J* control variates (that is, of functions *hj* with known expectation under *p*), we cast the problem of MCMC compression as that of resampling the initial chain under constraints based on these control variates. The main advantage of cube thinning is that its complexity is <sup>O</sup>(*N J*3); in particular, it does not depend on *M*. That makes it possible to use it for much larger values of *M*. We shall discuss the choice of *J*, but, by and large, *J* should be of the same order as *d*, the dimension of the sampling space. The name stems from the cube method of [6], which plays a central part in our approach, as we explain in the body of the paper.

The availability of control variates may seem like a strong requirement. However, if we assume we are able to compute *sp*(*x*) = ∇ log *p*(*x*), then (for a large class of functions *<sup>φ</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>*d*, which we define later)

$$\mathbb{E}\_{\mathcal{P}}\left[\phi(\mathbf{x})s\_{\mathcal{P}}(\mathbf{x}) + \nabla\_{\mathcal{X}} \cdot \phi(\mathbf{x})\right] = 0$$

where ∇*<sup>x</sup>* · *φ* denotes the divergence of *φ*. In other words, the availability of the score function implies, automatically, the availability of control variates. The converse is not true: there exists control variates, e.g., [7], that are not gradient-based. One of the examples we consider in our numerical examples feature such non gradient-based control variates; as a result, we are able to apply cube thinning, although Stein thinning is not applicable.

The supporting methods of [4] do not require control variates. It is thus more generally applicable than either cube thinning or Stein thinning. On the other hand, when gradients (and thus control variates) are available, the numerical experiments of [5] suggest that Stein thinning outperforms support points. From now on, we focus on situations where control variates are available.

This paper is organised as follows. Section 2 recalls the concept of control variates, and explains how control variates may be used to reweight an MCMC sample. Section 3 describes the cube method of [6]. Section 4 explains how to combine control variates and the cube method to perform cube thinning. Section 5 assesses the statistical performance of cube thinning through two numerical experiments.

We use the following notations throughout: *p* denotes both the target distribution and its probability density; *p*(*f*) is a short-hand for the expectation of *f*(*X*) under *p*. The gradient of a function *f* is denoted by ∇*<sup>x</sup> f*(*x*), or simply ∇ *f*(*x*) when there is no ambiguity. The *<sup>i</sup>*−-th component of a vector *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is denoted by *<sup>v</sup>*[*i*], and it is transposed by *<sup>v</sup><sup>t</sup>* . The vectors of the canonical basis of R*<sup>d</sup>* are denoted by *ei*, i.e., *ei*[*j*] = 1 if *<sup>j</sup>* = *<sup>i</sup>*, 0 otherwise. Matrices are written in upper-case; the kernel (null space) of matrix *A* is denoted by

ker*A*. The set of functions *<sup>f</sup>* : <sup>Ω</sup> <sup>→</sup> <sup>R</sup>*<sup>d</sup>* that are continuously differentiable is denoted by *C*1(Ω, R*d*).

#### **2. Control Variates**

*2.1. Definition*

Control variates are a very well known way to reduce the variance of Monte Carlo estimates—see, e.g., the books of [1,8,9].

Suppose we want to estimate the quantity *<sup>p</sup>*(*f*) = <sup>E</sup>*p*[ *<sup>f</sup>*(*X*)] for a suitable *<sup>f</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>, based on an IID (independent and identically distributed) sample {*X*1, ... , *XN*} from distribution *p*. The generalisation of control variates to MCMC will be discussed in Section 4.

The usual Monte Carlo estimator of *p*(*f*) is

$$\mathfrak{p}(f) = \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_n). \tag{2}$$

Assume we know *<sup>J</sup>* <sup>∈</sup> <sup>N</sup> functions *hj* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup> for *<sup>j</sup>* ∈ {1, ... , *<sup>J</sup>*} such that *<sup>p</sup>*(*hj*) = 0. Functions with this property are called control variates. We can use this property to build an estimate with a lower variance: let us denote *h*(*X*)=(*h*1(*X*), ... , *hJ*(*X*))*<sup>t</sup>* and write our new estimate:

$$\mathfrak{p}\_{\beta}(f) = \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_n) + \beta^t h(\mathbf{X}\_n) \tag{3}$$

with *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*<sup>J</sup>* . Then it is straightforward to show that E[*p*ˆ*β*(*f*)] = E[*p*ˆ(*f*)] = *p*(*f*). Depending on the choice of *β*, we may have Var[*p*ˆ*β*(*f*)] ≤ Var[*p*ˆ(*f*)]. The next section discusses how to choose such a *β*.

#### *2.2. Control Variates as a Weighting Scheme*

The standard approach to choose *β* consists of two steps. First, one shows easily that the value the minimises the variance of estimator (3) is:

$$\beta^\*(f) = \text{Var}(h(X))^{-1}\text{Cov}(h(X), f(X))\tag{4}$$

where Var(*h*(*X*)) is the *J* × *J* variance matrix of the vector *h*(*X*) and Cov(*h*(*X*), *f*(*X*)) is the *J* × 1 vector such that Cov(*h*(*X*), *f*(*X*))*i*,1 = Cov(*f*(*X*), *hi*(*X*)).

Second, one realises that this quantity may be estimated from the sample *X*1,..., *XN* through a simple linear regression model, where the *f*(*Xn*)s are the outcome, and the *hj*(*Xn*)s are the predictors:

$$f(X\_n) \approx \mu + \beta^t h(X\_n) + \epsilon\_{n\prime} \quad \mathbb{E}[\epsilon\_n] = 0. \tag{5}$$

More precisely, let *<sup>γ</sup>* <sup>∈</sup> *<sup>R</sup>J*+<sup>1</sup> be the vector such that *<sup>γ</sup><sup>t</sup>* = (*μ*, *<sup>β</sup><sup>t</sup>* ), *H* = (*Hij*) the design matrix such that *Hi*<sup>1</sup> = 1, *Hi*(*j*+1) = *hj*(*Xi*), and *F* = (*f*(*X*1), ... , *f*(*XN*)). Then, the OLS (ordinary least squares) estimate of *γ* is

$$
\hat{\gamma}\_{\rm OLS} = \left(H^t H\right)^{-1} H^t F. \tag{6}
$$

Since <sup>E</sup>[ *<sup>f</sup>*(*Xn*)] = *<sup>μ</sup>* in this artificial regression model, the first component of *<sup>γ</sup>*9OLS:

$$\mathfrak{d}\_{\star}(f) := \widehat{\gamma}\_{\mathrm{OLS}} \times \mathfrak{e}\_{1\prime} \tag{7}$$

actually corresponds to estimate (3) when *β* = *β* 9OLS.

At first glance, the approach described above seems to require implementing a different linear regression for each function *f* of interest. Ref. [9] noted, however, that one may re-express (7) as a weighted average:

$$\mathfrak{p}\_{\star}(f) = \sum\_{n=1}^{N} w\_{n} f(X\_{n}) \tag{8}$$

where the weights *wn* sum to one, and do not depend on *f* . It is thus possible to compute these weights once from a given sample (given a certain choice of control variates), and then quickly compute *p*ˆ(*f*) for any function *f* of interest.

The exact expression of the weights are easily deduced from (7) and (6): *w* = (*wn*) with

$$w = H(H^t H)^{-1} e\_1.$$

## *2.3. Gradient-Based Control Variates*

In this section and the next, we recall generic methods to construct control variates. This section specifically considers control variates that are derived from the score function, *sp*(*x*) = ∇ log *p*(*x*). (We therefore assume that this quantity is tractable.)

Under the following two conditions:


$$h(\mathbf{x}) = \nabla\_{\mathbf{x}} \cdot \boldsymbol{\phi}(\mathbf{x}) + \boldsymbol{\phi}(\mathbf{x}) \cdot \mathbf{s}\_p(\mathbf{x}) \tag{9}$$

is a control variate: *p*(*h*) = 0, see, e.g., [10] or [11] for further details. To gain some insight, note that in dimension 1 and assuming the domain of integration is an interval ]*a*, *<sup>b</sup>*[<sup>⊂</sup> <sup>R</sup>, this amounts to an integration by parts with the condition that *h*(*b*)*p*(*b*) − *h*(*a*)*p*(*a*) = 0.

Thus, whenever the score function is available (and the conditions above hold), we are able to construct an infinite number of control variates (one for each function *φ*). For simplicity, we shall focus on the following standard classes of such functions. First, for *i* = 1, . . . , *d*,

$$\phi\_i \colon \mathbb{R}^d \to \mathbb{R}^d$$
 
$$\mathbf{x} \mapsto e\_i$$

which leads to the following *d* control variates:

$$h\_i(\mathbf{x}) = s\_p(\mathbf{x})[i]. \tag{10}$$

For a Gaussian target, *<sup>N</sup>*(*μ*, <sup>Σ</sup>), the score is *sp*(*x*) = <sup>−</sup>Σ−1(*<sup>x</sup>* <sup>−</sup> *<sup>μ</sup>*), and the control variates above make it possible to reweight the Monte Carlo sample to make it have the same expectation as the target distribution.

Second, we consider, for *i*, *j* = 1, . . . , *d*:

$$\phi\_{ij} \colon \mathbb{R}^d \to \mathbb{R}^d$$

$$\mathbf{x} \mapsto \mathbf{x}[i] \mathbf{e}\_j$$

which leads to the following *d*<sup>2</sup> control variates:

$$\mathfrak{gl}\_{ij}(\mathbf{x}) = \mathbb{1}\{i = j\} + \mathfrak{x}[i]s\_{\mathfrak{p}}(\mathbf{x})[j]. \tag{11}$$

Again, for a Gaussian target *N*(*μ*, Σ), this makes it possible to fix the empirical covariance matrix to true covariance Σ.

In our simulations, we consider two sets of control variates: the 'full' set, consisting of the *d* control variates defined by (10), and the *d*<sup>2</sup> control variates defined by (11), and a 'diagonal' set of 2*d* control variates, where for (11), we only consider the cases where *i* = *j*. Of course, the former set should lead to a better performance (lower variance), but since the complexity of our approach will be <sup>O</sup>(*J*3), where *<sup>J</sup>* is the number of control variates, taking *<sup>J</sup>* <sup>=</sup> <sup>O</sup>(*d*2) may be too expensive whenever the dimension *<sup>d</sup>* is large. In fact, when *<sup>d</sup>* is very large, one might even consider considering only control variates that depend on a few components of *x* of interest.

#### *2.4. MCMC-Based Control Variates*

We mention in passing other ways to construct control variates, in particular in the context of MCMC.

For instance, [7] noted that, for a Markov chain {*Xn*}, the quantity

$$\phi(X\_n) - \mathbb{E}[\phi(X\_n) | X\_{n-1}]$$

has zero expectations. In particular, if the MCMC kernel is a Gibbs sampler, it is likely that one is able to compute the conditional expectation of each component, i.e., *φ*(*x*) = *x*[*i*] for *i* = 1, . . . , *d*.

See also [12,13] for other ways to construct control variates when the *Xn*s are simulated from a Metropolis kernel.

#### **3. The Cube Method**

We review in this section the cube method of [6]. This method originated from survey sampling and is a way to sample from a finite population under constraints. The first subsection gives some definitions, the second one explains the flight phase of the cube method and the third subsection discusses the landing phase of the method.

#### *3.1. Definitions*

Suppose we have a finite population {1, ... , *N*} of *N* individuals and that to each individual *n* = 1, ... , *N* is associated a variable of interest *yn* and *J* auxiliary variables, *vn* = (*vn*1, ... , *vn J*). Without loss of generality, suppose also that the *J* vectors (*v*1*j*, ... , *vNj*) are linearly independent. We are interested in estimating the quantity *Y* = ∑*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *yn* using a subsample of {1, ... , *<sup>N</sup>*}. Furthermore, we know the exact value of each sum *Vj* <sup>=</sup> <sup>∑</sup>*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *vnj*, and we wish to use this auxiliary information to better estimate *Y*.

We assign, to each individual *n*, a sampling probability *π<sup>n</sup>* ∈ [0, 1]. We consider random variables *Sn* such that, marginally, P(*Sn* = 1) = *πn*. We may then define the Horvitz–Thompson estimator of *Y* as

$$\hat{Y} = \sum\_{n=1}^{N} \frac{S\_n y\_n}{\pi\_n} \tag{12}$$

which is unbiased, and which depends only on selected individuals (i.e., *Sn* = 1).

We define similarly the Horvitz–Thompson estimator of *Vj* as

$$
\hat{\mathcal{V}}\_{\hat{\mathcal{V}}} = \sum\_{n=1}^{N} \frac{\mathcal{S}\_n \upsilon\_{nj}}{\pi\_n}. \tag{13}
$$

Our objective is to construct a joint distribution *ξ* for the inclusion variables *Sn* such that P*<sup>ξ</sup>* (*Sn* = 1) = *π<sup>n</sup>* for all *n* = 1, . . . , *N*, and

$$
\hat{V} = V \quad \text{\textquotedblleft almost surely.} \tag{14}
$$

where *V* = (*V*1, ... , *VJ*), *V*ˆ = (*V*ˆ 1, ... , *V*ˆ*J*). Such a probability distribution is called a balanced sampling design.

#### *3.2. Subsamples as Vertices*

We can view all the possible samples from {1, ... , *N*} as the vertices of the hypercube C = [0, 1] *<sup>N</sup>* in R*N*. A sampling design with inclusion probabilities *π<sup>n</sup>* = P*<sup>ξ</sup>* (*Sn* = <sup>1</sup>) is then a distribution over the set of these vertices such that E[*S*] = *π*, where *S* = (*S*1, ... , *SN*)*<sup>t</sup>* , and *π* = (*π*1, ... , *πN*)*<sup>t</sup>* is the vector of inclusion probabilities. Hence, *π* is expressed as a convex combination of the vertices of the hypercube.

We can think of a sampling algorithm as finding a way to reach any vertex of the cube, starting at *π*, while satisfying the balancing Equation (14). However, before we describe such a sampling algorithm, we may wonder if it is possible to find a vertex such that (14) is satisfied.

#### *3.3. Existence of a Solution*

The balancing equation, Equation (14), defines a linear system. Indeed, we can reexpress (14) as *S*, as a solution to *As* = *V*, where *A* = (*Ajn*) is of dimension *J* × *N*, *Ajn* <sup>=</sup> *vkn*/*πn*. This system defines a hyperplane *<sup>Q</sup>* of dimension *<sup>N</sup>* <sup>−</sup> *<sup>J</sup>* in <sup>R</sup>*N*.

What we want is to find vertices of the hypercube C that also belong to the hyperplane *Q*. Unfortunately, it is not necessarily possible, as it depends on how the hyperplane *Q* intersects cube C. In addition, there is no way to know beforehand if such a vertex exists. Since *π* ∈ *Q*, we know that K := C ∩ *Q* -= ∅ and is of dimension *N* − *J*. The only thing we can say is stated Proposition 1 in [6]: if *r* is a vertex of K, then in general *q* = card({*n* : 0 < *r*[*n*] < 1}) ≤ *J*.

The next section describes the flight phase of the cube algorithm, which generates a vertex in K when such vertices exist, or which, alternatively, returns a point in K with most (but not all) components set to zero or one. In the latter case, one needs to implement a landing phase, which is discussed in Section 3.5.

## *3.4. Flight Phase*

The flight phases simulates a process *π*(*t*) which takes values in K = C ∩ *Q*, and starts at *π*(0) = *π*. At every time *t*, one selects a unit vector *u*(*t*), then one chooses randomly between one of the two points that are in the intersection of the hypercube C and the line parallel to *u*(*t*) that passes through *π*(*t* − 1). The probability of selecting these two points are set to ensure that *π*(*t*) is a martingale; in that way, we have E[*πt*] = *π* at every time step. The random direction *u*(*t*) must be generated to fulfil the following two requirements: (a) that the two points are in *Q*, i.e., *u*(*t*) ∈ ker*A*, and (b) whenever *π*(*t*) reaches one of the faces of the hypercube, it must stay within that face; thus, *u*(*t*)[*k*] = 0 if *π*(*t* − 1)[*k*] = 0 or 1.

Algorithm 1 describes one step of the flight phase.


The flight phase stops when Step 1 of Algorithm 1 cannot be performed (i.e., no vector *u*(*t*) fulfils these conditions). Until this happens, each iteration increases by at least one the number of components in *π*(*t*) that are either zero or one. Thus, the flight phases completes at most in *N* steps.

In practice, to generate *u*(*t*), one may proceed as follows: first generate a random vector *<sup>v</sup>*(*t*) <sup>∈</sup> <sup>R</sup>*N*, then project it in the constraint hyperplane: *<sup>u</sup>*(*t*) = *<sup>I</sup>*(*t*)*v*(*t*) <sup>−</sup> *<sup>I</sup>*(*t*)*A<sup>t</sup>* (*AI*(*t*)*A<sup>t</sup>* )− *AI*(*t*)*v*(*t*), where *I*(*t*) is a diagonal matrix such that *Ikk*(*t*) is 0 if *πk*(*t*) is an integer and 1 otherwise, and *M*− denotes the pseudo-inverse of the matrix *M*.

The authors of [14] propose a particular method to generate vector *v*(*t*), which ensures that the complexity of a single iteration of the flight phase is <sup>O</sup>(*J*3). This leads to an overall complexity of <sup>O</sup>(*N J*3) for the flight phase, since it terminates in at most *<sup>N</sup>* iterations.

## *3.5. Landing Phase*

Denote by *π* the value of process *π*(*t*) when the flight phase terminates. If *π* is a vertex of <sup>C</sup> (i.e., all its components are either zero or one), one may stop and return *<sup>π</sup>* as the output of the cube algorithm. If *π* is not a vertex, this informs us that no vertex belongs to K. One may implement a landing phase, which aims at choosing randomly a vertex which is close to *π*, and such that the variance of the components of *V*ˆ is small.

Appendix A gives more details on the landing phase. Note that its worst-case complexity is <sup>O</sup>(2*<sup>J</sup>* ). However, in practice, it is typically either much faster, or not required (i.e., *<sup>π</sup>* is already a vertex) as soon as *<sup>J</sup> <sup>N</sup>*.

## **4. Cube Thinning**

We now explain how the previous ingredients (control variates, and the cube method) may be combined in order to thin a Markov chain, *X*1, ... , *XN*, into a subsample of size *M*. As before, the invariant distribution of the chain is denoted by *p*, and we assume we know of *J* control variates *hj*, i.e., *p*(*hj*) = 0 for *j* = 1, . . . , *J*.

#### *4.1. First Step: Computing the Weights*

The first step of our method is to use the *J* control variates to compute the *N* weights *wn*, as defined at the end of Section 2.2. Recall that these weights sum to one, and that they automatically fulfil the constraints:

$$\sum\_{n=1}^{N} w\_n h\_j(X\_n) = 0\tag{15}$$

for *j* = 1, . . . , *J*, and that we use them to compute

$$\mathfrak{p}\_{\star}(f) = \sum\_{n=1}^{N} w\_{n} f(X\_{n}) \tag{16}$$

as a low-variance estimate for *p*(*f*) for any *f* .

Recall that the control variates procedure we described in Section 2 assume that the input variables, *X*1, ... , *XN*, are IID. This is obviously not the case in an MCMC context; however, we follow the common practice [10,11] of applying the procedure to MCMC points as if they were IID points. This implies that the weighted estimate above corresponds to a value of *β* in (3) that does not minimise the (asymptotic) variance of estimator (3). It is actually possible to estimate the value of *β* that minimises the asymptotic variance of an MCMC estimate [7,15]. However, this type of approach is specific to certain MCMC samplers, and, critically for us, it cannot be cast as a weighting scheme. Thus, we stick to this standard approach.

We note in passing that, in our experiments (see Figure 1 and the surrounding discussion), the weights *wn* make it easy to visually assess the convergence (and thus the burn-in) of the Markov chain. In fact, since the MCMC points of the burn-in phase are far from the mass of the target distribution, the procedure must assign a small or negative weight to these points in order to respect the constraints based on the control variates. Again, see Section 5.2 for more discussion on this issue. The fact that control variates may be used to assess MCMC convergence has been known for a long time (e.g., [16]), but the visualisation of weights makes this idea more expedient.

**Figure 1.** Lotka–Volterra example: first 5000 weights of the cube methods, based on full (**top**) or diagonal (**bottom**) set of covariates.

#### *4.2. Second Step: Cube Resampling*

The second step consists in resampling the weighted sample (*wn*, *Xn*)*n*=1,...,*N*, to obtain a subsample <sup>S</sup> <sup>=</sup> {*Xn* : *Sn* <sup>=</sup> <sup>1</sup>} where *Sn* are random variables such that (a) <sup>E</sup>[*Sn*] = *wn*; (b) ∑*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *Sn* = *M*, and (c) for *j* = 1, . . . , *J*:

$$\sum\_{S\_n=1} h\_j(X\_n) = 0.$$

Condition (a) ensures that the procedure does not introduce any bias:

$$\mathbb{E}\left[\frac{1}{\mathcal{M}}\sum\_{S\_n=1}f(X\_n)\,\middle|\,X\_{1:N}\right]=\sum\_{n=1}^N w\_n f(X\_n).$$

Condition (b) ensures that the subsample is exactly of size *M*.

We would like to use the cube method in order to generate the *Sn*'s. Specifically, we would like to assign the inclusion probabilities *π<sup>n</sup>* to *wn*, and impose the (*J* + 1) constraints defined above by conditions (b) and (c). There is one caveat, however: the weights *wn* do not necessarily lie in [0, 1].

## *4.3. Dealing with Weights Outside of* [0, 1]

We rewrite (16) as:

$$\mathfrak{p}\_{\star}(f) = \frac{\Omega}{M} \times \sum\_{n=1}^{N} \mathcal{W}\_{n} \times \text{sgn}(w\_{n}) f(X\_{n}) \tag{17}$$

where Ω = ∑*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> <sup>|</sup>*wn*<sup>|</sup> and *Wn* <sup>=</sup> *<sup>M</sup>*|*wn*|/Ω. We now have *Wn* <sup>≥</sup> 0, and <sup>∑</sup>*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *Wn* = *M*, which is required for condition (b) in the previous section. We might have a few points such that *Wn* > 1. In that case, we replace them by '*Wn*( copies, with adjusted weights *Wn*/'*Wn*(.

It then becomes possible to implement the cube method, using as inclusion probabilities the *Wn*s, and as the matrix *A* that defines the *J* + 1 constraints, the matrix *A* = (*Ajn*) such that *A*1*<sup>n</sup>* = 1, *A*(*j*+1)*<sup>n</sup>* = sgn(*wn*)*hj*(*Xn*). The cube method samples variables *Sn*, which may be used to compute the subsampled estimate

$$\psi(f) = \frac{\Omega}{M} \sum\_{S\_n=1} \text{sgn}(w\_n) f(X\_n). \tag{18}$$

More generally, in our numerical experiments, we shall evaluate to which extent the random signed measure:

$$\psi = \frac{\Omega}{M} \sum\_{S\_n=1} \text{sgn}(w\_{\mathbb{H}}) \delta\_{\mathcal{X}\_n}(\mathbf{dx}). \tag{19}$$

is a good approximation of the target distribution *p*.

#### **5. Experiments**

We consider two examples. The first example is taken from [5], and is used to compare cube thinning with KSD thinning. The second example illustrates cube thinning when used in conjunction with control variates that are not gradient-based. We also include standard thinning in our comparisons.

Note that there is little point in comparing these methods in terms of CPU cost, as KSD thinning is considerably slower than cube thinning and standard thinning whenever *M* & 100. (In one of our experiments, for *M* = 1000, KSD took close to 7 h to run, while cube thinning with all the covariates took about 30 s.) Thus, our comparison will be in terms of statistical error, or, more precisely, in terms of how representative of *p* is the selected subsample.

In the following (in particular in the plots), "cubeFull" (resp. "cubeDiagonal") will refer to our approach based on the full (resp. diagonal) set of control variates, as discussed in Section 2.3. "NoBurnin" means that burn-in has been discarded manually (hence, no burn-in in the inputs). Finally, "thinning" denotes the usual thinning approach, "SMPCOV", "MED" and "SCLMED" are the same names used in [5] for KSD thinning, based on three different kernels.

To implement the cube method, we used R package BalancedSampling.

## *5.1. Evaluation Criteria*

We could compare the three different methods in terms of variance of the estimates of *p*(*f*) for certain functions *f* . However, it is easy to pick functions *f* that are strongly correlated with the chosen control variates; this would bias the comparison in favour of our approach. In fact, as soon as the target is Gaussian-like, the control variates we chose in Section 2.3 should be strongly correlated with the expectation of any polynomial function of order two, as we discussed in that section.

Rather, we consider criteria that are indicative of the performance of the methods for a general class of function. Specifically, we consider three such criteria. The first one is the kernel Stein discrepency (KSD) as defined in [5] and recalled in the introduction—see (1). Note that this criterion is particularly favourable for KSD thinning, since this approach specifically minimises this quantity. (We use the particular version based on the median kernel in Riabiz et al. [5].)

The second criterion is the energy distance (ED) between *p* and the empirical distribution defined by the thinning method, e.g., (19) for cube thinning. Recall that the ED between two distributions *F* and *G* is:

$$ED(F,G) = 2\mathbb{E}||Z - X||\_2 - \mathbb{E}||Z - Z'||\_2 - \mathbb{E}||X - X'||\_2\tag{20}$$

where *Z* , *Z iid* ∼ *F* and *X* , *X iid* ∼ *G*, and that this quantity is actually a pseudo-distance: *ED*(*F*, *G*) ≥ 0, *ED*(*F*, *G*) = 0 ⇒ *F* = *G*, *ED*(*F*, *G*) = *ED*(*G*, *F*), but ED does not fulfil the triangle inequality [17,18].

One technical difficulty is that (19) is a signed measure, not a probability measure; see Appendix B on how we dealt with this issue.

Our third criterion is inspired by the star discrepancy, a well-known measure of the uniformity of *N* points *un* ∈ [0, 1] *<sup>d</sup>* in the context of quasi-Monte Carlo sampling [9] (Chapter 15). Specifically, we consider the quantity

$$d^\star(\mathcal{P}, \mathbb{\hat{\nu}}) = \sup\_{B \in \mathcal{B}} |\mathcal{P}\_{\mathbb{\hat{\Psi}}}(B) - \mathbb{\hat{\nu}}\_{\mathbb{\hat{\Psi}}}(B)|$$

where *<sup>ψ</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> [0, 1] *<sup>d</sup>*, *P*ˆ *<sup>ψ</sup>* and *ν*ˆ*<sup>ψ</sup>* are the push-forward measures associated to empirical distributions *<sup>P</sup>*<sup>ˆ</sup> = (*<sup>N</sup>* <sup>−</sup> *<sup>b</sup>*)−<sup>1</sup> <sup>∑</sup>*<sup>N</sup> <sup>n</sup>*=*b*+<sup>1</sup> *δXn* (*dx*), and *ν*ˆ as defined in (19), and B is the set of hyper-rectangles *B* = ∏*<sup>d</sup> <sup>i</sup>*=1[0, *bi*]. In practice, we defined function *ψ* as follows: we apply the linear transform that makes the considered sample to have zero mean and unit variance, and then we applied the inverse CDF (cumulative distribution function) of a unit Gaussian to each component.

Additionally, since the sup above is not tractable, we replace it by a maximum over a finite number of *bi* (simulated uniformly).

#### *5.2. Lotka–Volterra Model*

This example is taken from [5]. The Lotka–Volterra model describes the evolution of a prey–predator system in a closed environment. We denote the number of prey by *u*<sup>1</sup> and the number of predators by *u*2. The growth rate of the prey is controlled by a parameter *θ*<sup>1</sup> > 0 and its death rate—due to the interactions with the predators—is controlled by a parameter *θ*<sup>2</sup> > 0. In the same way, the predator population has a death rate of *θ*<sup>3</sup> > 0 and a growth rate of *θ*<sup>4</sup> > 0. Given these parameters, the evolution of the system is described by a system of ODEs:

$$\begin{aligned} \frac{du\_1}{dt} &= \theta\_1 u\_1 - \theta\_2 u\_1 u\_2 \\ \frac{du\_2}{dt} &= -\theta\_4 u\_1 u\_2 - \theta\_3 u\_2 \end{aligned}$$

Ref. [5] set *θ* = (*θ*1, *θ*2, *θ*3, *θ*4)=(0.67, 1.33, 1, 1), the initial condition *u*<sup>0</sup> = (1, 1), and simulate synthetic data. They assume they observe the populations of prey and predator at times *ti*, *i* = 1, ... , 2400 where the *ti* are taken uniformly on [0, 25] and that these observations are corrupted with a centered Gaussian noise with a covariance matrix *C* = diag(0.22, 0.22). Finally, the model is parametrised in terms of *x* = (log *<sup>θ</sup>*1, log *<sup>θ</sup>*2, log *<sup>θ</sup>*3, log *<sup>θ</sup>*4) <sup>∈</sup> <sup>R</sup><sup>4</sup> and a standard normal distribution as a prior on *<sup>x</sup>* is used.

The authors have provided their code as well as the sampled values they obtained by running different MCMC chains for a long time. We use the exact same experimental set-up, and we do not run any MCMC chain on our own, but use the ones they provide instead, specifically the simulated chain, of length 2 <sup>×</sup> 106, from preconditionned MALA.

We compress this chain into a subsample of size either *M* = 100 or *M* = 1000. For each value of *M*, we run different variations of our cube method 50 times and make a comparison with the usual thinning method and with the KSD thinning method with different kernels, see [5]. In Figure 1, we show the first 5000 weights of the cube method. We can see that after 1000 iterations, the weights seem to stabilise. Based on visual examination of these weights, we chose a conservative burn-in period of 2000 iterations for the variants where burn-in is removed manually.

We plot the results of the experiment in Figures 2–4.

First, we see that regarding the kernel Stein discrepancy metric, Figure 2, the KSD method performs better than the standard thinning procedure and the cube method. This is not surprising since, even if this method does not properly minimise the Kernel–Stein Discrepency, this is still its target. We also see that, for *M* = 1000, the KSD method performs a bit better than our cube method which in turn performs better than the standard thinning procedure. Note that the relative performance of the KSD method to our cube methods depends on the kernel that is being used and that there is no way to determine which kernel will perform best before running any experiment.

The picture is different for *M* = 100: KSD thinning outperforms standard thinning, which in turn outperforms all of our cube thinning variations. Once again, the fact that the KSD method performs better than any other method seems reasonable: since it regards minimizing the Kernel–Stein Discrepancy, the KSD method is "playing at home" on this metric.

If we look at Figure 4, we see that all of our cube methods outperform the KSD method with any kernel. Interestingly, the standard thinning methods has a similar energy distance as the cube methods with "diagonal" control variates. These observations are true for both *M* = 100 and *M* = 1000. We can also note that the cube method with the full set of control variates tends to perform much better than its "diagonal" counterpart, whatever the value of *M*.

Finally, looking at Figure 3, it is clear that the KSD method—with any kernel—performs worse than any cube method in terms of star discrepancy.

**Figure 2.** Lotka–Volterra example: box-plots of the kernel Stein discrepency for all the cube method variations, compared with the KSD method for three kernels and the usual thinning method (horizontal lines). **Top**: *M* = 100. **Bottom**: *M* = 1000. (In the top plot, standard thinning is omitted to improve clarity, as corresponding value is too high.)

**Figure 3.** Lotka–Volterra example: box-plots of the star discrepency for all the cube method variations, compared with the KSD method for three kernels and the usual thinning method (horizontal lines). **Top**: *M* = 100. **Bottom**: *M* = 1000.

**Figure 4.** *Cont*.

**Figure 4.** Lotka–Volterra example: boxplots of the energy distance for all the cube method variations, compared with the KSD method for three kernels and the usual thinning method (horizontal lines). **Top**: *M* = 100. **Bottom**: *M* = 1000.

Overall, the relative performance of the cube methods and KSD methods can change a lot depending on the metric being used and the number of points we keep. In addition, while all the cube methods tend to perform roughly the same, this is not the case of the KSD method, whose performances depend on the kernel we use. Unfortunately, we have no way to determine beforehand which kernel will perform best. This is a problem since the KSD method is computationally expensive for subsamples of cardinality *M* & 100.

Thus, by and large, cube thinning seems much more convenient to use (both in terms of CPU time and sensitivity to tuning parameters) while offering, roughly, the same level of statistical performance.

#### *5.3. Truncated Normal*

In this example, we use the (random-scan version of) the Gibbs sampler of [1] to sample from 10-dimensional multivariate normal truncated to [0, ∞)10. We generated the parameters of this truncated normal as follows: the mean was set as the realisation of a 10-dimensional standard normal distribution, while for the covariance matrix Σ, we first generated a matrix *<sup>M</sup>* ∈ M10,10(R) for which each entry was the realisation of a standard normal distribution. Then, we set Σ = *MTM*.

Since we used a Gibbs sampler, we have access to the Gibbs control variates of [7], based on the expectation of each update (which amounts to simulating from a univariate Gaussian). Thus, we consider 10 control variates.

The Gibbs sampler was run for *N* = 105 iterations and no burn-in was performed. We compare the following estimators of the expectation of the target distribution the standard estimator, based on the whole chain ("usualEstim" in the plots), the estimator based on standard thinning ("thinEstim" in the plots), the control variate estimator based on the whole chain, i.e., (7) ("regressionEstim" in the plots), and finally our cube estimator described in Section 4 ("cubeEstim" in the plots). For standard thinning and cube thinning, the thinning sample size was set to *M* = 100, which corresponds to a compression factor of 103.

The results are shown in Figure 5. First, we can see that the control variates we chose led to a substantial decrease in the variance of the estimates for regressionEstim compared to usualEstim. Second, the cube estimator performed worse than the regression estimator in terms of variance, but this was expected, as explained in Section 4. More interestingly, if we cannot say that the cube estimator performs better than the usual MCMC estimator in general, we can see that for some components it performed as well or even better, even though the cube estimator used only *M* = 100 points while the usual estimator used 10<sup>5</sup> points. This is largely due to the good choice of the control variates. Finally,

the cube estimator outperformed the regular thinning estimator for every component, sometimes significantly.

**Figure 5.** Truncated normal example: box-plots over 100 independent replicates of each estimator; see text for more details.

**Author Contributions:** Conceptualization, N.C.; Formal analysis, N.C. and G.D.; Investigation, G.D.; Methodology, G.D.; Software, G.D.; Writing—original draft, G.D.; Writing—review and editing, N.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** The PhD grant of the second author is funded by the French National Research Agency (ANR) contract ANR-17-C23-0002-01 (project B3DCMB).

**Data Availability Statement:** The data that support the findings of the first numerical experiment are openly available in stein.thinning at https://github.com/wilson-ye-chen/stein.thinning (accessed on 2 August 2021).

**Acknowledgments:** We are grateful to the editor and the referees for their supportive and useful comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **Appendix A. Details on the Landing Phase**

The landing phase seeks to generate a random vector *<sup>S</sup>* in {0, 1}*N*, with expectation *<sup>π</sup>* (the output of the flight phase), which minimises the criterion tr(*M*Var(*V*<sup>ˆ</sup> <sup>|</sup>*π*)) for a certain matrix *<sup>M</sup>*. (The notation ·|*π* refers to the distribution of *<sup>S</sup>* conditional on *<sup>π</sup>*(*t*) = *<sup>π</sup>* at the end of the flight phase.)

Since Var(*S*) = Var(E[*S*|*π*]) + <sup>E</sup>[Var(*S*|*π*)] by the law of total variance, and since the first term is zero (as <sup>E</sup>[*S*|*π*] = *<sup>π</sup>*), we have

$$\text{Var}(\hat{V}) = \mathbb{E}[\text{Var}(\hat{V}|\pi^{\star})] = \mathbb{E}[A\text{Var}(S|\pi^{\star})A^{\dagger}].\tag{A1}$$

and thus:

$$\text{tr}(M\text{Var}(\hat{\mathcal{V}}|\pi^{\star})) = \sum\_{s \in \{0,1\}^{\mathcal{N}}} p(s|\pi^{\star}) (s - \pi^{\star})^{t} A^{t} M A (s - \pi^{\star}).\tag{A2}$$

Choosing *M* = (*AA<sup>t</sup>* )−1, as recommended by [6], amounts to minimising the distance to the hyperplane 'on average'. Let *<sup>C</sup>*(*s*)=(*<sup>s</sup>* <sup>−</sup> *<sup>π</sup>*)*tAt* (*AA<sup>t</sup>* )−1*A<sup>t</sup>* (*<sup>s</sup>* <sup>−</sup> *<sup>π</sup>*), then the minimisation program is equivalent to the following linear programming problem over *q* variables only:

$$\min\_{\xi^{\star}(.)} \sum\_{s^{\star} \in \mathcal{S}^{\star}} \mathbb{C}(s^{\star}) \xi^{\star}(s^{\star}) \tag{A3}$$

with constraints <sup>∑</sup>*s*∈S *<sup>ξ</sup>*(*s*) = 1, 0 <sup>≤</sup> *<sup>ξ</sup>*(*s*) <sup>≤</sup> 1, <sup>∑</sup>*s*∈S|*s <sup>k</sup>*=<sup>1</sup> *<sup>ξ</sup>*(*s*) = *<sup>π</sup> <sup>k</sup>* for every *<sup>k</sup>* <sup>∈</sup> *<sup>U</sup>* and <sup>S</sup> <sup>=</sup> {0, 1}*<sup>q</sup>* where *<sup>q</sup>* <sup>=</sup> card(*U*) and *<sup>U</sup>* <sup>=</sup> {*<sup>k</sup>* <sup>∈</sup> *<sup>U</sup>* : <sup>0</sup> <sup>&</sup>lt; *<sup>π</sup>*[*k*] <sup>&</sup>lt; <sup>1</sup>}. Here, *<sup>ξ</sup>* denotes the marginal distribution of the components *U* of the sampling design *ξ* and *C*(*s*) must be understood as *<sup>C</sup>*(*s*) with the components of *<sup>s</sup>* <sup>∈</sup>/ *<sup>U</sup>* being fixed by the result of the flight phase; thus, in this minimisation problem, *C* is in fact dependent on the components of *s* that are in *U* only.

The constraints define a bounded polyhedron. By the fundamental theorem of linear programming, this optimisation problem has at least one solution on a minimal support see [6].

The flight phase ends on a vertex of K and, by Proposition 1 in [6], *q* ≤ *J*—typically *J N*. This means that we are solving a linear programming problem in a dimension *q* potentially much lower than the population size *N*, and if we do not have too many auxiliary variables, this optimisation problem will not be computationally too expensive. In practice, a simplex algorithm is used to find the solution.

#### **Appendix B. Estimation of the Energy Distance**

There are two difficulties with computing (20). First, it involves intractable expectations. Second, as pointed out at the end of Section 4.3, the empirical distribution generated by cube thinning, (19), is actually a signed measure.

Regarding the first issue, we can approximate (20) from our MCMC sample *X*1, ... , *XN*. That is, if our subsampled empirical measure writes *ν*ˆ = ∑*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> *wmδZm* and that we approximate the distribution associated with *<sup>p</sup>* by *<sup>P</sup>*<sup>ˆ</sup> = (*<sup>N</sup>* <sup>−</sup> *<sup>b</sup>*)−<sup>1</sup> <sup>∑</sup>*<sup>N</sup> <sup>n</sup>*=*b*+<sup>1</sup> *δXn* where 1 ≤ *b* ≤ *N* is the burn-in of the chain; then, we can estimate *ED*(*μ*ˆ, *p*) with *ED*(*μ*ˆ, *P*ˆ).

Regarding the second issue, we can generalize the energy distance to finite measures: suppose we have two finite and potentially signed measures *ν*<sup>1</sup> and *ν*2, both defined on the same measurable space (Ω,P(Ω)} where Ω = {*X*1, ... , *XN*} and P(Ω) denote the set of parts of Ω. Suppose, in addition, that *ν*1(Ω) = *α*<sup>1</sup> and *ν*2(Ω) = *α*<sup>2</sup> with *α*<sup>1</sup> -= 0 and *α*<sup>2</sup> -= 0. We define the generalized energy distance as:

$$\begin{split} ED^\star(\nu\_1, \nu\_2) &= \frac{2}{\mathfrak{a}\_1 \mathfrak{a}\_2} \int\_{\Omega} ||\mathbf{x} - \mathbf{y}||\_2 d\nu\_1(\mathbf{x}) d\nu\_2(\mathbf{y}) \\ &- \frac{1}{\mathfrak{a}\_1^2} \int\_{\Omega} ||\mathbf{x} - \mathbf{x}'||\_2 d\nu\_1(\mathbf{x}) d\nu\_1(\mathbf{x}') \\ &- \frac{1}{\mathfrak{a}\_2^2} \int\_{\Omega} ||\mathbf{y} - \mathbf{y}'||\_2 d\nu\_2(\mathbf{y}) d\nu\_2(\mathbf{y}'). \end{split}$$

Then, by negative definiteness of the application *<sup>φ</sup>*(*x*, *<sup>y</sup>*) = ||*<sup>x</sup>* <sup>−</sup> *<sup>y</sup>*||<sup>2</sup> on <sup>R</sup>*<sup>N</sup>* <sup>×</sup> <sup>R</sup>*N*, *ED*(*ν*1, *<sup>ν</sup>*2) <sup>≥</sup> 0 with equality if and only if <sup>1</sup> *α*1 *<sup>ν</sup>*<sup>1</sup> <sup>=</sup> <sup>1</sup> *α*2 *ν*2. This means that the generalized energy distance is zero if and only if the two measures are equal up to a non-zero multiplicative constant—see [17] for a demonstration. This generalized energy distance is also symmetric, but the triangle inequality does not hold. It is a pseudo-distance.

Thus, we will use the following criterion, which we will call the energy distance:

$$\begin{split} ED^\star(\boldsymbol{\vartheta}, \boldsymbol{\hat{P}}) &= \frac{2}{(N-b)a\_1} \sum\_{k=1}^N \sum\_{n=b+1}^N \frac{\Omega}{M} \operatorname{sgn}(w\_k) ||\mathbf{X}\_k - \mathbf{X}\_n||\_2 \mathbf{1}\_{\{\boldsymbol{\xi}\_k = 1\}} \\ &- \frac{1}{a\_1^2} \sum\_{n=1}^N \sum\_{k=1}^N \left(\frac{\Omega}{M}\right)^2 \operatorname{sgn}(w\_n) \operatorname{sgn}(w\_k) ||\mathbf{Z}\_k - \mathbf{Z}\_n||\_2 \mathbf{1}\_{\{\boldsymbol{\xi}\_k = 1\}} \mathbf{1}\_{\{\boldsymbol{\xi}\_n = 1\}}, \end{split}$$

where *ν*ˆ is defined in (19) and we dropped the last term because it does not depend on *ν*ˆ and it is a potentially expensive sum of (*<sup>N</sup>* <sup>−</sup> *<sup>b</sup>*)<sup>2</sup> terms.

Note that the probability of *ν*ˆ(Ω) being zero is non-null and then there is a nonnegligible probability of *ED*(*ν*ˆ, *P*ˆ) being undefined. However, this event is unlikely to happen.

## **References**


## *Article* **Accelerated Diffusion-Based Sampling by the Non-Reversible Dynamics with Skew-Symmetric Matrices**

**Futoshi Futami 1,\*, Tomoharu Iwata 1, Naonori Ueda <sup>1</sup> and Issei Sato <sup>2</sup>**


**\*** Correspondence: futoshi.futami.uk@hco.ntt.co.jp

**Abstract:** Langevin dynamics (LD) has been extensively studied theoretically and practically as a basic sampling technique. Recently, the incorporation of non-reversible dynamics into LD is attracting attention because it accelerates the mixing speed of LD. Popular choices for non-reversible dynamics include underdamped Langevin dynamics (ULD), which uses second-order dynamics and perturbations with skew-symmetric matrices. Although ULD has been widely used in practice, the application of skew acceleration is limited although it is expected to show superior performance theoretically. Current work lacks a theoretical understanding of issues that are important to practitioners, including the selection criteria for skew-symmetric matrices, quantitative evaluations of acceleration, and the large memory cost of storing skew matrices. In this study, we theoretically and numerically clarify these problems by analyzing acceleration focusing on how the skew-symmetric matrix perturbs the Hessian matrix of potential functions. We also present a practical algorithm that accelerates the standard LD and ULD, which uses novel memory-efficient skew-symmetric matrices under parallel-chain Monte Carlo settings.

**Keywords:** Markov Chain Monte Carlo; Langevin dynamics; Hamilton Monte Carlo; non-reversible dynamics

## **1. Introduction**

Sampling is one of the most widely used techniques for the approximation of posterior distribution in Bayesian inference [1]. Markov Chain Monte Carlo (MCMC) is widely used to obtain samples. In MCMC, Langevin dynamics (LD) is a popular choice for sampling from high-dimensional distributions. Each sample in LD moves toward a gradient direction with added Gaussian noise. LD efficiently explore around a mode of a target distribution using the gradient information without being trapped by local minima thanks to added Gaussian noise. Many previous studies theoretically and numerically proved LD's superior performance [2–5]. Since non-reversible dynamics generally improves mixing performance [6,7], research on introducing non-reversible dynamics to LD for better sampling performance is attracting attention [8].

There are two widely known non-reversible dynamics for LD. One is underdamped Langevin dynamics (ULD) [9], which uses second-order dynamics. The other introduces perturbation, which consists of multiplying the skew-symmetric matrix by a gradient [8]. Here, we refer to the matrix as skew matrices for simplicity and this perturbation technique as skew acceleration. Much research has been done on ULD theoretically [9–11] and ULD is widely used in practice, which is also known as stochastic gradient Hamilton Monte Carlo [12]. In contrast, the application of the skew acceleration for standard Bayesian models is quite limited even though it is expected to show superior performance theoretically [8].

**Citation:** Futami, F.; Iwata, T.; Ueda, N.; Sato, I. Accelerated Diffusion-Based Sampling by the Non-Reversible Dynamics with Skew-Symmetric Matrices. *Entropy* **2021**, *23*, 993. https://doi.org/ 10.3390/e23080993

Academic Editor: Pierre Alquier

Received: 21 June 2021 Accepted: 27 July 2021 Published: 30 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

For example, skew acceleration has been analyzed focusing on sampling from Gaussian distributions [13–17], although assuming Gaussian distributions in Bayesian models is restrictive in practice. A recent study [8] theoretically showed that skew acceleration accelerates the dynamics around the local minima and saddle points for non-convex functions. Another work [18] clarified that the skew acceleration theoretically and numerically improves mixing speed when used as interactions between chains in parallel sampling schemes for non-convex Bayesian models.

Compared to ULD, what seems to be lacking for skew acceleration is a theoretical understanding of issues that are important to practitioners. The most significant problem is that no theory exists for selecting skew matrices. In existing studies, introducing a skew matrix into LD results in equal or faster convergence, denoting that a bad choice of skew matrix results in no acceleration. Thus, choosing appropriate skew matrices is critical. Furthermore, although ULD's acceleration has been analyzed quantitatively, existing studies have only analyzed skew acceleration qualitatively. Thus, it is difficult to justify the usefulness of skew acceleration in practice compared to ULD. Another issue is that introducing skew matrices requires a vast memory cost in many practical Bayesian models.

The purpose of this study is to solve these problems from theoretical and numerical viewpoints and establish a practical algorithm for skew acceleration. The following are the two major contributions of this work.

Our contribution 1: We present a convergence analysis of skew acceleration for standard Bayesian model settings, including non-convex potential functions using Poincaré constants [19]. The major advantage of Poincaré constants is that we can analyze skew acceleration through a Hessian matrix and its eigenvalues and develop a practical theory about the selection of *J* and the quantitative assessment of skew acceleration.

Furthermore, we propose skew acceleration for ULD and present convergence analysis for the first time. Since ULD shows faster convergence than LD, combining skew acceleration with ULD is promising.

Our contribution 2: We develop a practical skew accelerated sampling algorithm for a parallel sampling setting with novel memory-efficient skew matrices. Since a naive implementation of skew acceleration requires a large memory cost to store skew matrices, memory-efficiency is critical in practice. We also present a non-asymptotic theoretical analysis for our algorithm in both LD and ULD settings under a stochastic gradient and Euler discretization. We clarify that introducing skew matrices accelerates the convergence of continuous dynamics, although it increases the discretization and stochastic gradient error. Then to the best of our knowledge, we propose the first algorithm that adaptively controls this trade-off using the empirical distribution of the parallel sampling scheme.

Finally, we verify our algorithm and theory in practical Bayesian problems and compare it with other sampling methods.

Notations: *Id* denotes a *d* × *d* identity matrix. Capital letters such as *X* represent random variables, and lowercase letters such as *x* represent non-random real values. ·, · and |·| denote Euclidean inner products, distances and absolute values.

#### **2. Preliminaries**

In this section, we briefly introduce the basic settings of LD and non-reversible dynamics for the posterior distribution sampling in Bayesian inference.

### *2.1. LD and Stochastic Gradient LD*

First, we introduce the notations and the basic settings of LD and stochastic gradient LD (SGLD), which is a practical extension of LD. Here *zi* denotes a data point in space Z, |Z| denotes the total number of data points, and *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* corresponds to the parameters of a given model, which we want to sample. Our goal is to sample from the target distribution with density *dπ*(*x*) ∝ *e*−*βU*(*x*)*dx*, where potential function *U*(*x*) is the summation of *<sup>u</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>×</sup> <sup>Z</sup> <sup>→</sup> <sup>R</sup>, i.e., *<sup>U</sup>*(*x*) = <sup>1</sup> |Z| |Z| ∑ *i*=1 *u*(*x*, *zi*). Function *u*(·, ·) is continuous and nonconvex. The explicit assumptions made for it are discussed in Section 3.1. The SGLD algorithm [2,3] is given as a recursion:

$$X\_{k+1} = X\_k - h\nabla \mathcal{U}(X\_k) + \sqrt{2h\beta^{-1}}\varepsilon\_{k\prime} \tag{1}$$

where *<sup>h</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> is a step size, *<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is a standard Gaussian random vector, *<sup>β</sup>* is a temperature parameter of *<sup>π</sup>*, and <sup>∇</sup>*U*<sup>ˆ</sup> (*Xk*) is a conditionally unbiased estimator of true gradient ∇*U*(*Xk*). This unbiased estimate of the true gradient is suitable for large-scale data set since we can use not the full gradient, but a stochastic version obtained through a randomly chosen subset of data at each time step. This means that we can reduce the computational cost to calculate the gradient at each time step.

The discrete time Markov process in Equation (1) is the discretization of the continuoustime LD [2]:

$$dX\_t = -\nabla \mathcal{U}(X\_t)dt + \sqrt{2\beta^{-1}}dw\_{t\prime} \tag{2}$$

where *wt* denotes the standard Brownian motion in R*d*. The stationary measure of Equation (2) is *dπ*(*x*) ∝ *e*−*βU*(*x*)*dx*.

#### *2.2. Poincaré Inequality and Convergence Speed*

In sampling, we are interested in the convergence speed to the stationary measure. The speed is often characterized by the *the generator* associated with Equation (2) and defined as:

$$\begin{split} \mathcal{L}f(\mathbf{X}\_{t}) := \lim\_{s \to 0^{+}} \frac{\mathbb{E}(f(\mathbf{X}\_{t+s})|\mathbf{X}\_{t}) - f(\mathbf{X}\_{t})}{s} \\ = \left( -\nabla l I(\mathbf{X}\_{t}) \cdot \nabla + \boldsymbol{\beta}^{-1} \boldsymbol{\Delta} \right) f(\mathbf{X}\_{t}), \end{split} \tag{3}$$

where <sup>Δ</sup> denotes a standard Laplacian on <sup>R</sup>*<sup>d</sup>* and *<sup>f</sup>* ∈ D(L) and <sup>D</sup>(L) <sup>⊂</sup> <sup>L</sup>2(*π*) denote the L domain. This −L is a self-adjoint operator, which has only discrete spectrums (eigenvalues). *π* with L has a *spectral gap* if the smallest eigenvalue of −L (other than 0) is positive. We refer to it as *ρ*0(>0). This spectral gap is closely related to Poincaré inequality. Internal energy is defined:

$$\mathcal{E}(f) := -\int\_{\mathbb{R}^d} f \mathcal{L}f d\pi. \tag{4}$$

Please note that E(*f*) > 0 is satisfied. Then *π* with L satisfies the Poincaré inequality with constant *c*, if for any *f* ∈ D(L), *π* with L satisfies:

$$\int f^2 d\pi - \left(\int f d\pi\right)^2 \le c\mathcal{E}(f). \tag{5}$$

The spectral gap characterizes this constant *<sup>c</sup>* <sup>≤</sup> <sup>1</sup> *ρ*0 , which holds (see Appendix A.2 for details). We refer to best constant *c* as the Poincaré constant [19]. For notational simplicity, we define *m*<sup>0</sup> := <sup>1</sup> *<sup>c</sup>* and refer to this *m*<sup>0</sup> as the Poincaré constant.

In sampling, crucially, Poincaré inequality dominates the convergence speed in *χ*<sup>2</sup> divergence:

$$\int \left(\frac{d\mu\_t}{d\pi} - 1\right)^2 d\pi := \chi^2(\mu\_t \|\pi) \le e^{-\frac{2m\_0}{\beta}t} \chi^2(\mu\_0 \|\pi),\tag{6}$$

where *μ<sup>t</sup>* denotes the measure at time *t* induced by Equation (2) and *μ*<sup>0</sup> is the initial measure (see Appendix A.3 for details). Thus, the larger Poincaré constant *m*<sup>0</sup> is, the faster convergence we have.

### *2.3. Non-Reversible Dynamics*

In this section, we introduce the non-reversible dynamics. *π* with L is reversible if for any test function *f* , *g* ∈ D(L), *π* with L satisfies

$$\int\_{\mathbb{R}^d} f \mathcal{L}g d\pi = \int\_{\mathbb{R}^d} \mathfrak{g} \mathcal{L}f d\pi. \tag{7}$$

If this is not satisfied, *π* with L is non-reversible [19]. We introduce two non-reversible dynamics for LD. The first is ULD, which is given as

$$\begin{aligned} dX\_t &= \Sigma^{-1} V\_t dt\_\prime \\ dV\_t &= -\nabla \mathcal{U}(X\_t) dt - \gamma \Sigma^{-1} V\_t dt + \sqrt{2\gamma \beta^{-1}} dw\_{t\prime} \end{aligned} \tag{8}$$

where *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* is an auxiliary random variable, *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup> is a positive constant, and <sup>Σ</sup> is the variance of the stationary distribution of auxiliary random variable *V*. The stationary distribution is *<sup>π</sup>*˜ :<sup>=</sup> *<sup>π</sup>* ⊗ N (0, <sup>Σ</sup>) <sup>∝</sup> *<sup>e</sup>*−*βU*(*x*)<sup>−</sup> <sup>1</sup> <sup>2</sup> <sup>Σ</sup>−<sup>1</sup> *v* 2 , where N denotes a Gaussian distribution. The superior performance of ULD compared with LD has been studied rigorously [9–11]. ULD's convergence speed is also characterized by the Poincaré constant [20]. In practice, we use discretization and the stochastic gradient for ULD, which is called the stochastic gradient Hamilton Monte Carlo (SGHMC) [10]. The second non-reversible dynamics is the skew acceleration given as

$$dX\_t = -(I + aI)\nabla l I(X\_t)dt + \sqrt{2\beta^{-1}}dw\_{t\_\prime} \tag{9}$$

where *<sup>J</sup>* is a real value skew matrix and *<sup>α</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> is a positive constant. We call this dynamics S-LD. The stationary distribution of S-LD is still *π*, and S-LD shows faster convergence and smaller asymptotic variance [13–15,18].

#### **3. Theoretical Analysis of Skew Acceleration**

In this section, we present a theoretical analysis of skew acceleration in LD and ULD in standard Bayesian settings. We analyze acceleration through the Poincaré constant and connect it with the eigenvalues of the Hessian matrix, which allows us to obtain a practical criterion to choose skew matrices and quantitatively evaluate acceleration. We focus on a setting where a continuous SDE and a full gradient of the potential function is used in this section. The discretized SDE and stochastic gradient are discussed in Section 4.

#### *3.1. Acceleration Characterization by the Poincaré Constant*

First, we introduce the same four assumptions as a previous work [2], which showed the existence of the Poincaré constant about *m*<sup>0</sup> for LD (see Appendix C for details).

**Assumption 1.** *(Upper bound of the potential function at the origin) Function u takes nonnegative real values and is twice continuously differentiable on* R*d, and constants A and B exist such that for all z* <sup>∈</sup> <sup>Z</sup>*,*

$$|u(0, z)| \le A, \quad \|\nabla u(0, z)\|\,\le B. \tag{10}$$

**Assumption 2.** *(Smoothness) Function <sup>u</sup> has Lipschitz continuous gradients; for all <sup>z</sup>* <sup>∈</sup> <sup>Z</sup>*, positive constant M exists for all x*, *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*d,*

$$\left\|\nabla u(\mathbf{x},\mathbf{z}) - \nabla u(\mathbf{y},\mathbf{z})\right\| \le M \|\mathbf{x} - \mathbf{y}\|.\tag{11}$$

**Assumption 3.** *(Dissipative condition) Function u satisfies the (m,b)-dissipative condition for all <sup>z</sup>* <sup>∈</sup> <sup>Z</sup>*; for all x* <sup>∈</sup> <sup>R</sup>*d, m* <sup>&</sup>gt; <sup>0</sup> *and b* <sup>≥</sup> <sup>0</sup> *exist such that*

$$-\mathbf{x} \cdot \nabla \boldsymbol{\mu}(\mathbf{x}, \boldsymbol{z}) \le -m \|\boldsymbol{\mu}\|^2 + b.\tag{12}$$

**Assumption 4.** *(Initial condition) Initial probability distribution μ*<sup>0</sup> *of X*<sup>0</sup> *has a bounded and strictly positive density p*0*, and for all x* <sup>∈</sup> <sup>R</sup>*d,*

$$\kappa\_0 := \log \int\_{\mathbb{R}^d} e^{||\mathbf{x}||^2} p\_0(\mathbf{x}) d\mathbf{x} \prec \infty. \tag{13}$$

Please note that these assumptions allow us to consider the non-convex potential functions, which are common in practical Bayesian models. Furthermore, we make the following assumption about *J*.

**Assumption 5.** *The operator norm of J is bounded:*

$$\|f\|\_{2} \le 1.\tag{14}$$

*This means that the largest singular value of J is below* 1*.*

Under these assumptions, we present the convergence behavior of skew acceleration using the Poincaré constant. First, we present the following S-LD result.

**Theorem 1.** *Under Assumptions 1–5, the S-LD of Equation (9) has exponential convergence,*

$$\chi^2(\mu\_t^a \| \, \pi) \le e^{-\frac{2m(a)}{\beta}t} \chi^2(\mu\_0 \| \, \pi),\tag{15}$$

*where μ<sup>α</sup> <sup>t</sup> is the measure at time t induced by S-LD and m*(*α*) *is the Poincaré constant of S-LD defined by its generator*

$$\mathcal{L}\_{\mathfrak{a}}f(\mathbf{x}) := \left(-(I+\mathfrak{a}I)\nabla lI(\mathbf{x}) \cdot \nabla + \beta^{-1}\Delta\right)f(\mathbf{x}).\tag{16}$$

*Furthermore, m*(*α*) *satisfies m*(*α*) ≥ *m*0*.*

The proof is shown in Appendix C. This theorem states that introducing the skew matrices accelerates the convergence of LD by improving the convergence rate from *m*<sup>0</sup> to *m*(*α*). Although [18] obtained a similar result, we used the Poincaré constant and derived an explicit criterion when *m*(*α*) = *m*<sup>0</sup> holds, as we discuss below.

Next, we also introduce skew acceleration in ULD. Since ULD shows faster convergence than LD in standard Bayesian settings [10,11], it is promising to combine skew acceleration with ULD to obtain a more efficient sampling algorithm. For that purpose, we propose the following SDE:

$$dX\_t = \Sigma^{-1} V\_l dt + a\_1 I\_1 \nabla l I(X\_t) dt,\tag{17}$$

$$dV\_t = -\nabla \mathcal{U}(X\_t)dt - \gamma(\Sigma^{-1} + \mathfrak{a}\_2 I\_2)V\_t dt + \sqrt{2\gamma\beta^{-1}}dw\_{t\prime} \tag{18}$$

where *J*<sup>1</sup> and *J*<sup>2</sup> are real value skew matrices and *α*<sup>1</sup> and *α*<sup>2</sup> are positive constants. We assume that *J*<sup>1</sup> and *J*<sup>2</sup> satisfy Assumption 5. We refer to this method as skew underdamped Langevin dynamics (S-ULD) whose stationary distribution is *π*˜ = *π* ⊗ N (0, Σ) ∝ *e*−*βU*(*x*)<sup>−</sup> <sup>1</sup> <sup>2</sup> <sup>Σ</sup>−<sup>1</sup> *v* 2 . See Appendix B for details, which include discussions on other combinations of skew matrices. As for S-ULD, we need an additional assumption about the initial condition of *V*0:

**Assumption 6.** *(Initial condition) Initial probability distribution μ*0(*x*, *v*) *of* (*X*0, *V*0) *has a bounded and strictly positive density p*<sup>0</sup> *that satisfies,*

$$\kappa\_0 := \log \int\_{\mathbb{R}^{2d}} e^{\|\mathbf{x}\|^2 + \|\mathbf{v}\|^2} p\_0((x, \upsilon)) dx d\upsilon < \infty. \tag{19}$$

We then provide the following convergence theorem that resembles S-LD.

**Theorem 2.** *Under Assumptions 1–3, 5, 6, S-ULD has exponential convergence in χ*<sup>2</sup> *divergence and its convergence rate is also characterized by m*(*α*) *as defined in Theorem 1. S-ULD's convergence equals or exceeds ULD, of which convergence rate is characterized by m*0*.*

See Appendix C.2 for details. From these theorems, we confirmed that skew acceleration is effective in both S-LD and S-ULD, and the convergence speed is characterized by Poincaré constant *m*(*α*) defined by Equation (16).

#### *3.2. Skew Acceleration from the Hessian Matrix*

Our goal is to clarify what choices of *J* induce *m*(*α*) > *m*0, which leads to acceleration. Therefore, we discuss how Poincaré constant *m*(*α*) is connected to the eigenvalues and eigenvectors of the perturbed Hessian matrix (*<sup>I</sup>* <sup>+</sup> *<sup>α</sup>J*)∇2*U*(*x*). Next, we introduce the notations. We express the Hessian of *U*(*x*) as *H*(*x*) and the perturbed Hessian matrix as *H* (*x*) := (*I* + *αJ*)*H*(*x*). Please note that *H* is a real symmetric matrix, which has real eigenvalues and diagonalizable. On the other hand, since *H* is not symmetric, it has complex eigenvalues, although diagonalization is not assured (see Appendix E). We express pairs of eigenvectors and eigenvalues of *H* (*x*) as {(*v<sup>α</sup> <sup>i</sup>* (*x*), *<sup>λ</sup><sup>α</sup> <sup>i</sup>* (*x*))}*<sup>d</sup> <sup>i</sup>*=1, which are ordered as Re(*λ<sup>α</sup>* <sup>1</sup> (*x*))) ≤···≤ Re(*λ<sup>α</sup> <sup>d</sup>*(*x*)). Here, Re(*λ<sup>α</sup>* <sup>1</sup> (*x*)) expresses the real part of complex value *λα* <sup>1</sup> and Im denotes the imaginary part. We express those of *<sup>H</sup>*(*x*) as {(*v*<sup>0</sup> *<sup>i</sup>* (*x*), *<sup>λ</sup>*<sup>0</sup> *<sup>i</sup>* (*x*))}*<sup>d</sup> i*=1 and order them as *λ*<sup>0</sup> <sup>1</sup>(*x*) ≤···≤ *<sup>λ</sup>*<sup>0</sup> *<sup>d</sup>*(*x*).

#### 3.2.1. Strongly Convex Potential Function

Assume that *<sup>U</sup>* is an m-strongly convex function, where for all *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*d*, *<sup>m</sup>* <sup>≤</sup> *<sup>λ</sup>*<sup>0</sup> <sup>1</sup>(*x*) holds. Poincaré constant *m*<sup>0</sup> of LD satisfies *m*<sup>0</sup> = *m* [19]. For the skew acceleration, since Poincaré constant satisfies *m*(*α*) = *m* (*α*), where *m* (*α*) is the best constant that satisfies, for all *x*, *m* (*α*) <sup>≤</sup> Re *λα* <sup>1</sup> (*x*) (see Appendix D.1). Therefore, studying the Poincaré constant is equivalent to studying the smallest (real part of the) eigenvalue of the Hessian matrix. Thus, the relation between *λ*<sup>0</sup> <sup>1</sup>(*x*) and Re *λα* <sup>1</sup> (*x*) must be studied. The following theorem describes how the skew matrices change the smallest eigenvalue.

**Theorem 3.** *For all x* <sup>∈</sup> <sup>R</sup>*d, the real parts of the eigenvalues of H satisfy*

$$m \le \lambda\_1^0(\mathbf{x}) \le \text{Re}(\lambda\_1^a(\mathbf{x})) \le \dots \le \text{Re}(\lambda\_d^a(\mathbf{x})) \le \lambda\_d^0(\mathbf{x}).\tag{20}$$

*The condition of λ*<sup>0</sup> <sup>1</sup>(*x*) = Re *λα* 1 *x*)) *is shown in Remark 1.*

**Remark 1.** *Denote the set of the eigenvectors of eigenvalue λ*<sup>0</sup> <sup>1</sup>(*x*) *as <sup>V</sup>*<sup>0</sup> <sup>1</sup> *. If <sup>V</sup>*<sup>0</sup> <sup>1</sup> = {*v*} *and Jv* = 0*, then λ*<sup>0</sup> <sup>1</sup>(*x*) = Re *λα* 1 *x*)) *holds. If the cardinality of set V*<sup>0</sup> <sup>1</sup> *is larger than* 1*, and vectors <sup>v</sup>*, *<sup>v</sup>* <sup>∈</sup> *<sup>V</sup>*<sup>0</sup> <sup>1</sup> *exist, such that <sup>λ</sup>*<sup>0</sup> <sup>1</sup>*αJv* = (Im *λα* 1 )*v and λ*<sup>0</sup> <sup>1</sup>*αJv* <sup>=</sup> <sup>−</sup>(Im *λα* 1 )*v, then λ*<sup>0</sup> <sup>1</sup>(*x*) = Re *λα* 1 *x*)) *holds.*

Refer to Appendix F for the proof. This is an extension of previous work [8,13]. If *λ*<sup>0</sup> <sup>1</sup>(*x*) <sup>&</sup>lt; Re *λα* <sup>1</sup> (*x*) is satisfied for all *x*, we have *m*<sup>0</sup> < *m*(*α*), i.e., acceleration occurs. We discuss how to construct *J* such that *λ*<sup>0</sup> <sup>1</sup>(*x*) <sup>&</sup>lt; Re *λα* <sup>1</sup> (*x*) holds in Section 3.3.

3.2.2. Non-Convex Potential Function

The previous work [21] clarified that the Poincaré constant of the non-convex function is characterized by the negative eigenvalue of the saddle point. As shown in Figure 1, denote *x*<sup>1</sup> as the global minima, and *x*<sup>2</sup> is the local minima which has the second smallest value in *U*(*x*). We express the saddle point with index one, i.e., there is only one negative eigenvalue at the point, between *x*<sup>1</sup> and *x*<sup>2</sup> as *x*∗. This means that the eigenvalues of *H*(*x*∗) satisfies *λ*<sup>0</sup> <sup>1</sup>(*x*∗) < <sup>0</sup> < *<sup>λ</sup>*<sup>0</sup> <sup>2</sup>(*x*∗) <sup>&</sup>lt; ··· <sup>&</sup>lt; *<sup>λ</sup>*<sup>0</sup> *<sup>d</sup>*(*x*∗). Ref. [21] clarified that the saddle point *x*<sup>∗</sup> characterizes the Poincaré constant as

$$m\_0^{-1} \approx \frac{1}{|\lambda\_1(\mathbf{x}^\*)|} e^{\oint \left(l I(\mathbf{x}^\*) - l I(\mathbf{x}\_1) - l I(\mathbf{x}\_2)\right)}.\tag{21}$$

When skew matrices are introduced, [8] clarified the following relation:

**Theorem 4.** *([8]) λ<sup>α</sup>* (*x*∗) <sup>≤</sup> *<sup>λ</sup>*<sup>0</sup> (*x*∗) < <sup>0</sup> *and equality holds only if Jv<sup>α</sup>* (*x*∗) = 0*.*

Note *λ<sup>α</sup>* <sup>1</sup> (*x*∗) is not a complex number. Thus, the skew acceleration reduces the negative eigenvalue and leads to a larger Poincaré constant (see Appendix D.2) and results in faster convergence.

−5 −4 −3 −2 −1 012

**Figure 1.** Double-potential example: Poincaré constant is related to the eigenvalue at *x*∗.

In conclusion, introducing the skew matrix changes the Hessian's eigenvalues and increase the Poincaré constant. If *λ*<sup>0</sup> <sup>1</sup>(*x*) -= Re *λα* <sup>1</sup> (*x*) is satisfied, this leads to faster convergence for both convex and non-convex potential functions.

## *3.3. Choosing J*

−1000

−500

0

500

In this section, we present a method for choosing *J* that leads to *λ*<sup>0</sup> <sup>1</sup>(*x*) -= Re *λα* <sup>1</sup> (*x*) to ensure the acceleration based on the equality conditions in Theorems 3 and 4. Combining these theorems, we obtain the following criterion:

**Remark 2.** *Given a point x, λ*<sup>0</sup> <sup>1</sup>(*x*) -= Re *λα* <sup>1</sup> (*x*) *holds if either the following conditions are satisfied:* (*i*) *when V*<sup>0</sup> <sup>1</sup> = {*v*}*, Jv* -<sup>=</sup> <sup>0</sup> *is satisfied.* (*ii*) *when* <sup>|</sup>*V*<sup>0</sup> <sup>1</sup> | > 1*, Jv* -= 0 *holds for any <sup>v</sup>* <sup>∈</sup> *<sup>V</sup>*<sup>0</sup> <sup>1</sup> *, and for any <sup>v</sup>*, *<sup>v</sup>* <sup>∈</sup> *<sup>V</sup>*<sup>0</sup> <sup>1</sup> *, <sup>λ</sup>*<sup>0</sup> <sup>1</sup>*αJv* = (Im *λα* 1 )*v and λ*<sup>0</sup> <sup>1</sup>*αJv* <sup>=</sup> <sup>−</sup>(Im *λα* 1 )*v are not satisfied.*

The first condition (*i*) is easily satisfied if we choose *J* such that Ker*J* = {0}. On the other hand, the second condition (*ii*) is difficult to verify since *H* and its eigenvalues and eigenvectors generally depend on the current position of *Xt*. Instead of evaluating eigenvalues and eigenvectors of *H* and *H* directly, we use the random matrix property shown in the next theorem.

**Theorem 5.** *Suppose the upper triangular entries of J follow a probability distribution that is absolutely continuous with respect to the Lebesgue measure. If* Ker*J* = {0} *is satisfied, then given a point x* <sup>∈</sup> <sup>R</sup>*d, <sup>λ</sup>*<sup>0</sup> <sup>1</sup>(*x*) -= Re *λα* <sup>1</sup> (*x*) *holds with probability 1.*

The proof is given in Appendix G.1. From this theorem, we simply generate *J* from some probability distribution, such as the Gaussian distribution. Then, we check whether Ker*J* = {0} holds. If Ker*J* = {0} does not hold, we generate a random matrix *J* again.

The above theorem is valid only at a given evaluation point *x*. We can extend the above theorem to all the points over the path of the discretized dynamics (see Appendix G.3). With this procedure, we can theoretically ensure that acceleration occurs with probability one for discretized dynamics.

#### *3.4. Qualitative Evaluation of The Acceleration*

So far, we have discussed skew acceleration qualitatively but not quantitatively. Although acceleration's quantitative evaluation is critical for practical purposes, to the best of our knowledge, no existing work has addressed it. In this section, we present a formula that quantitatively assesses skew acceleration by analyzing the eigenvalues of the Hessian matrix.

**Theorem 6.** *With the identical notation as in Theorem 3, for all x, we have*

$$\operatorname{Re}(\lambda\_1^a(\mathbf{x})) = \lambda\_1^0(\mathbf{x}) + a^2 \sum\_{k=2}^d \frac{\lambda\_1^0(\mathbf{x})\lambda\_k^0(\mathbf{x})|v\_k^0(\mathbf{x})Iv\_1^0(\mathbf{x})|^2}{\lambda\_k^0(\mathbf{x}) - \lambda\_1^0(\mathbf{x})} + \mathcal{O}(a^3). \tag{22}$$

*In particular, at saddle point x*∗*, we have*

$$
\lambda\_1^a(\mathbf{x}^\*) = \lambda\_1^0(\mathbf{x}^\*) + a^2 \sum\_{k=2}^d \frac{\lambda\_1^0(\mathbf{x}^\*)\lambda\_k^0(\mathbf{x}^\*)|v\_k^0(\mathbf{x}^\*)Iv\_1^0(\mathbf{x}^\*)|^2}{\lambda\_k^0(\mathbf{x}^\*) - \lambda\_1^0(\mathbf{x}^\*)} + \mathcal{O}(a^3). \tag{23}
$$

The proofs are shown in Appendix H. When focusing on Equation (22), if *U*(*x*) is a strongly convex function, since for all *k* > 1, *λk*(*x*) > *λ*1(*x*) > 0 holds and the second term in Equation (22) is positive. From this, Re *λα* <sup>1</sup> (*x*) > *λ*<sup>0</sup> <sup>1</sup>(*x*) holds. A similar relation holds for Re(*λ<sup>α</sup> <sup>d</sup>*(*x*)). In Equation (23), *<sup>λ</sup><sup>α</sup>* <sup>1</sup> (*x*∗) < *<sup>λ</sup>*<sup>0</sup> <sup>1</sup>(*x*∗) < 0 holds. Thus, the changes of the Poincaré constants are proportional to *α*2. With these formulas, we can quantitatively evaluate the acceleration. We present numerical experiments to confirm our theoretical findings in Section 6.1.

#### **4. Practical Algorithm for Skew Acceleration**

In this section, we discuss skew acceleration in more practical settings compared to Section 3. First, we discuss the memory issue for storing *J* and the discretization of SDE and the stochastic gradient, which are widely used techniques in Bayesian inference. Finally, we present a practical algorithm for skew acceleration.

#### *4.1. Memory Issue of Skew Acceleration and Ensemble Sampling*

For *<sup>d</sup>*-dimensional Bayesian models, we need <sup>O</sup>(*d*2) memory space to store skew matrices *J*s, and this is difficult for high-dimensional models. Instead of storing *J*, we can randomly generate *J*s at each time step following Theorem 5. However, we experimentally confirmed that using different *J*s at each step does not accelerate the convergence (see Section 6). Thus, we need to use a fixed *J* during the iterations.

As discussed below, we found that the previously proposed accelerated parallel sampling [18] can be a practical algorithm to resolve this memory issue. In that method, we simultaneously updated *N* samples of the model's parameters with correlation. In such a parallel sampling scheme, a correlation exists among multiple Markov chains, it is more efficient than a naive parallel-chain MCMC, where the samples are independent. We express the *n*-th sample at time *t* as *X*(*n*) *<sup>t</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and the joint state of all samples at time *t* as *X*⊗*<sup>N</sup> <sup>t</sup>* := (*X*(1) *<sup>t</sup>* , ... , *<sup>X</sup>*(*N*) *<sup>t</sup>* ) <sup>∈</sup> <sup>R</sup>*dN*. We express the joint stationary measure as *<sup>π</sup>*⊗*<sup>N</sup>* :<sup>=</sup> *<sup>π</sup>* ⊗···⊗ *<sup>π</sup>*(*x*⊗*N*) <sup>∝</sup> *<sup>e</sup>*−*<sup>β</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *<sup>U</sup>*(*x*(*i*)) . We express the sum of the potential function as *U*⊗*<sup>N</sup>* := ∑*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *<sup>U</sup>*(*x*(*i*)). We then consider the following dynamics:

$$dX\_t^{\odot N} = -(I\_{dN} + \mathfrak{a}I)\nabla \mathcal{U}^{\odot N}(X\_t^{\odot N})dt + \sqrt{2\beta^{-1}}dw\_{t\prime} \tag{24}$$

$$\nabla \mathcal{U}^{\otimes N}(X\_t^{\otimes N}) := \left(\nabla \mathcal{U}(X\_t^{(1)}), \dots, \nabla \mathcal{U}(X\_t^{(N)})\right)^\top. \tag{25}$$

We call this dynamics skew parallel LD (S-PLD). *N*-independent parallel LD (PLD) is coupled with the skew matrix. Since each chain in PLD is independent of the other, the Poincaré constant of PLD is also *m*0. Ref. [18] argued that the Poincaré constant of S-PLD, *m*(*α*, *N*), satisfies *m*(*α*, *N*) ≥ *m*0. This means S-PLD shows faster convergence than PLD. As discussed in Section 3.2, these Poincaré constants are characterized by the smallest eigenvalue of the Hessian matrix <sup>∇</sup>2*U*⊗*N*(*x*⊗*N*) and (*IdN* <sup>+</sup> *<sup>α</sup>J*)∇2*U*⊗*N*(*x*⊗*N*) where *<sup>x</sup>*⊗*<sup>N</sup>* <sup>∈</sup> <sup>R</sup>*dN*. We denote these smallest eigenvalues as *<sup>λ</sup>*<sup>0</sup> <sup>1</sup>(*x*⊗*N*) and Re*λ<sup>α</sup>* <sup>1</sup> (*x*⊗*N*). As discussed in Section 3.2, acceleration occurs if *λ*<sup>0</sup> <sup>1</sup>(*x*⊗*N*) -= Re*λ<sup>α</sup>* <sup>1</sup> (*x*⊗*N*) is satisfied.

In [18], they failed to specify the choice of *J* whose naive construction of *J* requires <sup>O</sup>(*d*2*N*2) memory cost. To reduce the memory cost, we propose the following skew matrix:

$$J := J\_0 \otimes I\_{d\_1} \tag{26}$$

where *J*<sup>0</sup> is a *N* × *N* skew matrix and ⊗ is a Kronecker product. We then have the following lemma:

**Lemma 1.** *If J*<sup>0</sup> *is generated based on Theorem 5 and* Ker*J*<sup>0</sup> = {0} *is satisfied, then given a point x*⊗*N, J does not satisfy the equality condition in Theorems 3,4, which means λ*<sup>0</sup> <sup>1</sup>(*x*⊗*N*) -= Re*λ<sup>α</sup>* <sup>1</sup> (*x*⊗*N*) *with probability 1.*

See Appendix G.2 for the proof. Thus, from this lemma, we only need to prepare and store *<sup>J</sup>*0, which requires <sup>O</sup>(*N*2) memory, which does not depend on *<sup>d</sup>*. In practical settings, this is a significant reduction of the memory size since the number of parallel chains is smaller than the dimension of models. Please note that we can ensure the acceleration with this *J*.

**Lemma 2.** *Under Assumptions 1–5, assume J satisfies the condition of Lemma 1. Then S-PLD shows*

$$\chi^2(\mu\_t^{a,\odot N} || \pi^{\odot N}) \le e^{-\frac{2m(a,N)}{\beta}t} \chi^2(\mu\_0^{\odot N} || \pi^{\odot N}),\tag{27}$$

*where μα*,⊗*<sup>N</sup> <sup>t</sup> is the measure at time <sup>t</sup> induced by S-PLD, and <sup>μ</sup>*⊗*<sup>N</sup>* <sup>0</sup> *is the initial measure defined as the product measure of μ*0*.*

See Appendix I.1 for the proofs. Thus, combined with Lemma 2, S-PLD converges faster than PLD. We also considered the ensemble version of ULD (parallel ULD (PULD)) and its skew accelerated version:

$$\begin{split} dX\_t^{\odot N} &= \Sigma^{-1} V\_t^{\odot N} dt + \mathfrak{a}\_1 I\_1 \nabla L^{\odot N} (X\_t^{\odot N}) dt, \\ dV\_t^{\odot N} &= -\nabla l L^{\odot N} (X\_t^{\odot N}) dt - \gamma (\Sigma^{-1} + \mathfrak{a}\_2 I\_2) V\_t^{\odot N} dt + \sqrt{2\gamma \beta^{-1}} dw\_{t\_{\prime}} \end{split} \tag{28}$$

where *<sup>J</sup>*<sup>1</sup> and *<sup>J</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup>*dN*×*dN* are real-valued skew-symmetric matrices, and *<sup>α</sup>*<sup>1</sup> and *<sup>α</sup>*<sup>2</sup> <sup>∈</sup> <sup>R</sup><sup>+</sup> are positive constants and *<sup>V</sup>*⊗*<sup>N</sup> <sup>t</sup>* = *V*(1) *<sup>t</sup>* ,..., *<sup>V</sup>*(*N*) *t* <sup>∈</sup> <sup>R</sup>*dN*. We refer to this dynamics as skew PULD (S-PULD) whose faster convergence can be assured similar to Lemma 2 as shown in Appendix I.2.

## *4.2. Discussion of the Discretization of SDE and Stochastic Gradient and Practical Algorithm*

In this section, we further consider practical settings for S-PLD and S-PULD. We discretize these continuous dynamics, e.g., by the Euler-Maruyama method, and approximate the gradient by the stochastic gradient. Although introducing skew matrices accelerates the convergence of continuous dynamics, it simultaneously increases the discretization and stochastic gradient error, resulting in a trade-off. We present a practical algorithm that controls this trade-off.

#### 4.2.1. Trade-Off Caused by Discretization and Stochastic Gradient

We consider the following discretization and stochastic gradient for S-PLD and S-PULD:

$$X\_{k+1}^{\odot N} = X\_k^{\odot N} - h(I\_{dN} + \mu I)\nabla \hat{\mathcal{U}}^{\odot N}(X\_k^{\odot N}) + \sqrt{2h\beta^{-1}}\epsilon\_{k\prime} \tag{29}$$

and

$$\begin{aligned} X\_{k+1}^{\odot N} &= X\_k^{\odot N} + \Sigma^{-1} V\_k^{\odot N} h + \kappa f \nabla \mathcal{U}^{\odot N} (X\_k^{\odot N}) h \\ V\_{k+1}^{\odot N} &= V\_k^{\odot N} - \nabla \mathcal{U}^{\odot N} (X\_k^{\odot N}) h - \gamma \Sigma^{-1} V\_k^{\odot N} h + \sqrt{2 \gamma \beta^{-1} h} \varepsilon\_{k\prime} \end{aligned} \tag{30}$$

where *<sup>k</sup>* <sup>∈</sup> <sup>R</sup>*dN* is a standard Gaussian random vector. <sup>∇</sup>*U*<sup>ˆ</sup> <sup>⊗</sup>*N*(*X*⊗*N*) is an unbiased estimator of the gradient <sup>∇</sup>*U*⊗*N*(*X*⊗*N*). We refer to Equation (29) as skew-SGLD and Equation (30) as skew-SGHMC. For skew-SGHMC, we dropped *J*<sup>2</sup> of S-PULD to decrease the parameters, shown in Appendix B. Please note that skew-SGLD is the identical as the previous dynamics [18]. We introduce an assumption about the stochastic gradient:

**Assumption 7.** *(Stochastic gradient) There exists a constant δ* ∈ [0, 1) *such that*

$$\mathbb{E}[\|\nabla \hat{\mathcal{U}}(\mathbf{x}) - \nabla \mathcal{U}(\mathbf{x})\|^2] \le 2\delta \left(M^2 \|\mathbf{x}\|^2 + B^2\right). \tag{31}$$

Given a test function *<sup>f</sup>* with *Lf* lipschitzness, we approximate - *f dπ* by skew-SGLD or skew-SGHMC, with estimator <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *<sup>f</sup>*(*X*(*n*) *<sup>k</sup>* ). The bias of skew-SGLD is upperbounded as

**Theorem 7.** *Under Assumptions 1–7, for any <sup>k</sup>* <sup>∈</sup> <sup>N</sup> *and any <sup>h</sup>* <sup>∈</sup> (0, 1 <sup>∧</sup> *<sup>m</sup>* <sup>4</sup>*M*<sup>2</sup> ) *obeying kh* ≥ 1 *and βm* ≥ 2*, we have*

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \le L\_f(\underbrace{\mathbb{C}\_1(a)kh}\_{(i)} + \underbrace{\mathbb{C}\_2 e^{-\beta^{-1}m(a,N)kh}}\_{(ii)}) \tag{32}$$

*and C*<sup>1</sup> *and C*<sup>2</sup> *depends on the constants of Assumptions 1–7, for the details see Appendix J.*

We present a tighter bias bound in Section 4.3 under a stronger assumption. We can show a similar upper bound for the skew-SGHMC using the same proof strategy. This bound resembles of a previous one [18]; ours shows improved dependency on *kh*. The previous results of [18] are also limited to LD, not including skew-SGHMC.

Please note that (*i*) corresponds to the discretization and stochastic gradient error and (*ii*) corresponds to the convergence behavior of S-PLD, which is continuous dynamics. Since *C*1(*α*) ≥ *C*1(*α* = 0), skew acceleration increases the discretization and stochastic gradient error. On the other hand, since *m*(*α*, *N*) ≥ *m*0, the convergence of the continuous dynamics is accelerated. Thus, skew acceleration causes a trade-off. When *α* is sufficiently small, we derive the explicit dependency of *α* for this trade-off from an asymptotic expansion. Using the quantitative evaluation of skew acceleration in Theorem 6, we obtain

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \le \underbrace{(d\_1 u + d\_2 a^2) k h}\_{(i)} - \underbrace{a^2 d\_0 e^{-\beta^{-1} m\_0 k h}}\_{(ii)} + \mathcal{O}(a^3) + \text{const}, \tag{33}$$

where *d*<sup>0</sup> to *d*<sup>2</sup> are positive constants obtained by the asymptotic expansion. See Appendix K for the details. In the above expression, (*i*) and (*ii*) correspond to (*i*) and (*ii*) of Equation (32). Thus, by choosing appropriate *α*, we can control the trade-off.

#### 4.2.2. Practical Algorithm Controlling the Trade-Off

Since calculating the optimal *α* that minimizes Equation (33) at each step is computationally demanding, we adaptively tune the value of *α* by measuring the acceleration with kernelized Stein discrepancy (KSD) [22]. Our idea is to update samples under different *α* and *α* + *η*, and compare KSD between the stationary and empirical distributions of these different interaction strengths. Here, *<sup>η</sup>* <sup>∈</sup> <sup>R</sup><sup>+</sup> is a small increment of *α*. We denote the samples at the (*k* + 1)th step, which is obtained by Equation (29) as *X*⊗*<sup>N</sup> <sup>k</sup>*+1,*<sup>α</sup>* :<sup>=</sup> *<sup>X</sup>*⊗*<sup>N</sup> <sup>k</sup>*,*<sup>α</sup>* <sup>−</sup> *<sup>h</sup>*(*IdN* <sup>+</sup> *<sup>α</sup>J*)∇*U*<sup>ˆ</sup> <sup>⊗</sup>*N*(*X*⊗*<sup>N</sup> <sup>k</sup>*,*<sup>α</sup>* ) + '2*hβ*−1*k*, (or (30) as *<sup>X</sup>*⊗*<sup>N</sup> <sup>k</sup>*+1,*<sup>α</sup>* :<sup>=</sup> *<sup>X</sup>*⊗*<sup>N</sup> <sup>k</sup>* + Σ−1*V*⊗*<sup>N</sup> <sup>k</sup> <sup>h</sup>* <sup>+</sup> *<sup>α</sup>J*∇*U*<sup>ˆ</sup> <sup>⊗</sup>*N*(*X*⊗*<sup>N</sup> <sup>k</sup>* )*h*). We denote the samples, which are obtained by replacing the above *α* by *α* + *η*, as *X*⊗*<sup>N</sup> <sup>k</sup>*+1,*α*+*η*. We denote the KSD between the measure of *<sup>X</sup>*⊗*<sup>N</sup> k*+1,*α* and stationary measure *π* as *KSD*(*k* + 1, *α*) and estimate the differences of empirical KSD:

$$
\Delta := K \triangle \mathcal{D}(k+1, \mathfrak{a}) - K \triangle \mathcal{D}(k+1, \mathfrak{a} + \eta), \tag{34}
$$

where KSD is estimated by

$$K\hat{S}D(k,\mu) = \frac{1}{N(N-1)}\sum\_{i=1}^{N}\mu\_{\emptyset}(X\_{k,\mu'}^{(i)}X\_{k,\mu}^{(j)})\_{\prime} \tag{35}$$

$$\begin{split} u\_q(\mathbf{x}, \mathbf{x}') := \nabla\_{\mathbf{x}} \log \pi(\mathbf{x})^\top l(\mathbf{x}, \mathbf{x}') \nabla\_{\mathbf{x}} \log \pi(\mathbf{x}') + \nabla\_{\mathbf{x}} \log \pi(\mathbf{x})^\top \nabla\_{\mathbf{x}'} l(\mathbf{x}, \mathbf{x}') \\ + \nabla\_{\mathbf{x}} l(\mathbf{x}, \mathbf{x}')^\top \nabla\_{\mathbf{x}} \log \pi + \text{Tr} \nabla\_{\mathbf{x}, \mathbf{x}'} l(\mathbf{x}, \mathbf{x}'), \end{split} \tag{36}$$

where *l* denotes a kernel and we use an RBF kernel. If Δ > 0, which indicates that the empirical distribution of *X*⊗*<sup>N</sup> <sup>k</sup>*+1,*α*+*<sup>η</sup>* is closer to the stationary distribution than that of *<sup>X</sup>*⊗*<sup>N</sup> <sup>k</sup>*+1,*α*. Thus, we should increase the interaction strength from *α* to *α* + *η*. If Δ < 0, we decrease it to *α* − *η*. We also update *η* to *cη* where *c* ∈ (0, 1]. The overall process is shown in Algorithm 1. Detailed discussions of the algorithm including how to select *α*0, *η*0, and *c* are shown in Appendix L.

#### **Algorithm 1** Tuning *α*

**Input:** *X*⊗*<sup>N</sup> <sup>k</sup>* , *ηk*, *αk*, *c* **Output:** *αk*<sup>+</sup>1, *ηk*+<sup>1</sup> 1: Calculate *X*⊗*<sup>N</sup> <sup>k</sup>*+1,*α<sup>k</sup>* and *<sup>X</sup>*⊗*<sup>N</sup> k*+1,*αk*+*η<sup>k</sup>* . 2: Calculate <sup>Δ</sup> :<sup>=</sup> *KSD*<sup>ˆ</sup> (*<sup>k</sup>* <sup>+</sup> 1, *<sup>α</sup>k*) <sup>−</sup> *KSD*<sup>ˆ</sup> (*<sup>k</sup>* <sup>+</sup> 1, *<sup>α</sup><sup>k</sup>* <sup>+</sup> *<sup>η</sup>k*) 3: **if** Δ > 0 **then** 4: Update *αk*+<sup>1</sup> = *α<sup>k</sup>* + *η<sup>k</sup>* 5: Update *ηk*+<sup>1</sup> = *η<sup>k</sup>* 6: **else** 7: Update *αk*+<sup>1</sup> = |*α<sup>k</sup>* − *ηk*| 8: Update *ηk*+<sup>1</sup> = *cη<sup>k</sup>* 9: **end if**

Finally, we present Algorithm 2, which describes the whole process. We update the value of *α* once every *k* step. Please note that its computational cost is not much larger than that of Equation (30). We only calculate the eigenvalues of *J* once, which requires <sup>O</sup>(*N*3). The calculation of different KSDs is computationally inexpensive since we can re-use the gradient, which is the most computationally demanding part.

**Algorithm 2** Proposed algorithm

**Input:** *X*⊗*<sup>N</sup>* <sup>0</sup> , *h*, *α*0, *η*, *k* , *<sup>K</sup>*, *<sup>c</sup>*,(*V*⊗*<sup>N</sup>* <sup>0</sup> , *<sup>γ</sup>*, <sup>Σ</sup>−1) **Output:** *X*⊗*<sup>N</sup> K* 1: Make a *N* × *N* random matrix *J*<sup>0</sup> and check ker*J*<sup>0</sup> = {0} 2: Set *J* = *J*<sup>0</sup> ⊗ *Id* 3: **for** *k* = 0 to *K* **do** 4: **if** ' *<sup>k</sup> <sup>k</sup>* ( = 0 **then** 5: Update *α* by Algorithm 1 6: **end if** 7: Update *X*⊗*<sup>N</sup> <sup>k</sup>* by Equation (29) (for skew-SGLD) 8: (Update (*X*⊗*<sup>N</sup> <sup>k</sup>* , *<sup>V</sup>*⊗*<sup>N</sup> <sup>k</sup>* ) by Equation (30) for skew-SGHMC) 9: **end for**

## *4.3. Refined Analysis for the Bias of Skew-SGLD*

When using a constant step size for skew-SGLD, the bound in Theorem 7 is meaningless since the first term of Equation (32) will diverge. Here, following [23], we present a tighter bound for the bias of skew-SGLD under a stronger assumption.

**Theorem 8.** *Under Assumptions 1–7, for any <sup>k</sup>* <sup>∈</sup> <sup>N</sup> *and any <sup>h</sup>* <sup>∈</sup> (0, 1 <sup>∧</sup> *<sup>λ</sup>*(*α*,*N*) 4 <sup>√</sup>2*M*<sup>2</sup> <sup>∧</sup> *<sup>m</sup>* <sup>4</sup>*M*<sup>2</sup> ) *obeying kh* ≥ 1 *and βm* ≥ 2*, we have*

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \leq L\_f \sqrt{\frac{2}{\lambda(\mathfrak{a}, N)}} \sqrt{e^{-\lambda(\mathfrak{a}, N) k \hbar} \mathrm{KL}(\mu\_0|\pi) + \frac{\mathbb{C}\_3(\mathfrak{a})}{\lambda(\mathfrak{a}, N)}},\tag{37}$$

*where*

$$\lambda(\mathfrak{a}, N) := \left( \frac{1}{(1 + m(\mathfrak{a}, N)^{-1} \beta \mathbb{C}(m\_0)) 2 \pi e^2} + \frac{3}{2} m(\mathfrak{a}, N)^{-1} \right)^{-1} \tag{38}$$

*and constants C*3(*α*) *and C*(*m*0) *depend on the constants of Assumptions 1–7. Moreover, λ*(*α*, *N*) *satisfies λ*(*α*, *N*) ≥ *λ*(*α* = 0, *N*)*. For the details, see Appendix M.*

Proof is shown in Appendix M. Please note that even if we use a constant step size for skew-SGLD, the bound in Theorem 8 will not diverge. Here we need the stronger assumption about a step size compared to Theorem 7. From Equation (37), the convergence behavior is characterized by *λ*(*α*, *N*) and the bias bound become smaller when *λ*(*α*, *N*) become larger. From the definition of *λ*(*α*, *N*), the larger *m*(*α*, *N*) is, the larger *λ*(*α*, *N*) we obtain. Thus, as we had seen so far, introducing the skew matrices leads to the larger Poincaré constant, and thus, this leads to larger *λ*(*α*, *N*).

Previous work [18] clarified that if *α* is sufficiently small, introducing skew matrices improves the Poincaré constant by a constant factor, which means that we have *<sup>m</sup>*(*α*, *<sup>N</sup>*) <sup>−</sup> *<sup>m</sup>*<sup>0</sup> ≈ O(*α*2), where <sup>O</sup>(*α*2) depends on the eigenvector and eigenvalues of the generator L. On the other hand, from Theorem 8, for any *ξ* > 0, to achieve the bias smaller than *<sup>ξ</sup>*, it suffice to run skew-SGLD at least for *<sup>k</sup>* <sup>≥</sup> <sup>2</sup> *<sup>λ</sup>*(*α*,*N*)*<sup>h</sup>* ln *Lf ξ* KL(*μ*0|*π*) <sup>2</sup>*λ*(*α*,*N*) iterations us-

ing the appropriate step size *h* and under the assumption that *δ* and *α* are small enough (see Appendix M.2 for details). Combined with these observations, introducing skew matrices into SGLD improves the computational complexity for a constant order. Our numerical experiments show that even constant improvement results in faster convergence in practical Bayesian models.

## **5. Related Work**

In this section, we discuss the relationship between our method and other sampling methods.

#### *5.1. Relation to Non-Reversible Methods*

As we discussed in Section 1, our work extends the existing analysis of non-reversible dynamics [8,18] and presents a practical algorithm. Compared to those previous works, we focus on the practical setting of Bayesian sampling and derive the explicit condition about *J* for acceleration. We also derived a formula to quantitatively evaluate skew acceleration based on the asymptotic expansion of the eigenvalues of the perturbed Hessian matrix. A previous work [24], which derived the optimal skew matrices when the target distribution is Gaussian, requires <sup>O</sup>(*d*3) computational cost to derive optimal skew matrices, and it is unclear whether it works for non-convex potential functions. On the other hand, our construction method for skew matrices is simple, computationally cheap, and can be applied to general Bayesian models.

Our work analyzes skew acceleration for ULD, which is more effective than LD in practical problems. Another work [8,18] only analyzed skew acceleration for LD. A previous work [17] combined a non-reversible drift term with ULD. Unlike our method, this work's purpose was to reduce the asymptotic variance of the expectation of a test function and is mainly focusing on sampling from Gaussian distribution.

To the best of our knowledge, our work is the first to focus on the memory issue of skew acceleration and develop a memory-efficient skew matrix for ensemble sampling. Our work also presents an algorithm that controls the trade-off for the first time. Another work [18] identified the trade-off and handled it by cross-validation, which is computationally inefficient, unfortunately.

Finally, we point out an interesting connection between our skew-SGHMC and the magnetic HMC (M-HMC) [25]. M-HMC accelerates HMC's mixing time by introducing a "magnetic" term into the Hamiltonian. That magnetic term is expressed by special skew matrices. Although a previous work [25] argued that M-HMC is numerically superior to a standard HMC, its theoretical property remains unclear. Thus, our work can analyze the theoretical behavior of magnetic HMC.

#### *5.2. Relation to Ensemble Methods*

Our proposed algorithm is based on ensemble sampling [26]. Ensemble sampling, in which multiple samples are simultaneously updated with interaction, has been attracting attention numerically and theoretically because of improvements in memory size, computational power, and parallel processing computation schemes [26]. There are successful, widely used ensemble methods, including SVGD [27] and SPOS [28], with which we compare our proposed method numerically in Section 6. Although both show numerically good performance, it is unclear how the interaction term theoretically accelerates the convergence since they are formulated as a McKean–Vlasov process, which is non-linear dynamics, complicating establishing a finite sample convergence rate. Our algorithm is an extension of another work [18], where the interaction was composed of a skew-acceleration term and can be rigorously analyzed. Compared to that previous work [18], we analyzed skew acceleration, focused on the Hessian matrix, and developed practical algorithms, as discussed in Section 4.2, and derived the explicit condition when acceleration occurs, which was unclear [18].

Another difference among SPOS, SVGD, and [18] is that they use first-order methods; our approach uses the second-order method. Little work has been done on ensemble sampling for second-order dynamics. Recently a second-order ensemble method was proposed [29], based on gradient flow analysis. Although its method showed good numerical performance, its theoretical property for finite samples remains unclear since it proposed a scheme as a finite sample approximation of the gradient flow. In contrast, our proposed method is a valid sampling scheme with a non-asymptotic guarantee.

## **6. Numerical Experiments**

The purpose of our numerical experiments is to confirm the acceleration of our algorithm proposed in Section 4 in various commonly used Bayesian models including Gaussian distribution (toy data), latent Dirichlet allocation (LDA), and Bayesian neural net regression and classification (BNN). We compared our algorithm's performance with other ensemble sampling methods: SVGD, SPOS, standard SGLD, and SGHMC. In all the experiments, the values and the error bars are the mean and the standard deviation of repeated trials. For all the experiments we set *γ* = 1 and Σ−<sup>1</sup> = 300 for SGHMC and Skew-SGHMC. As for the hyperparameters of our proposed algorithm, the selection criterion is discussed in Appendix L.

### *6.1. Toy Data Experiment*

The target distribution is the multivariate Gaussian distribution, *π* = *N*(*μ*, Ω) where we generated <sup>Ω</sup>−<sup>1</sup> <sup>=</sup> *<sup>A</sup><sup>A</sup>* and each element of *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>2*d*×*<sup>d</sup>* is drawn from the standard Gaussian distribution. The dimension of the target distribution is *d* = 50, we approximate by 20 samples using the proposed ensemble methods. We tested these toy data because the LD for this target distribution is known as the Ornstein–Uhlenbeck process, and its theoretical properties have been studied extensively e.g., [30]. Thus, by studying the convergence behavior of these toy data, we can understand our proposed method more clearly.

First, we confirmed how the skew-symmetric matrix affects the eigenvalues of the Hessian matrix, as discussed in Section 3, where we only showed the asymptotic expansion for the smallest real part of the eigenvalues and saddle point. Here we can show a similar expansion for the largest real part:

$$\operatorname{Re}(\lambda\_{dN}^a) = \lambda\_{dN}^0 + \kappa^2 \sum\_{k=1}^{dN-1} \frac{\lambda\_d N^0 \lambda\_k^0 |\nu\_k^0 J \nu\_{dN}^0|^2}{\lambda\_k^0 - \lambda\_{dN}^0} + \mathcal{O}(\alpha^3). \tag{39}$$

Re *λα dN* <sup>≤</sup> *<sup>λ</sup><sup>α</sup> dN* holds.

Then we observed how the largest and smallest real parts of the eigenvalues of (*I* + *αJ*)Ω−<sup>1</sup> depend on *α*. The results are shown in Figure 2, where we averaged 10 trials over a randomly made *J* with fixed *A*. The upper-left, upper-right, and lower figures show Re(*λ*1(*α*)), Re(*λdN*(*α*)), and Re(*λ*1(*α*))/Re(*λdN*(*α*)). These behaviors are consistent with Theorem 3. When *α* is small, its behavior is close to the quadratic function proved in Theorem 3.

Next, we observed the convergence behavior of skew-SGLD and skew-SGHMC. We measured the convergence by maximum mean discrepancy (MMD) [31] between the empirical and stationary distributions. For MMD, we used 2000 samples for the target distribution, and we used the Gaussian kernel whose bandwidth is set to the median distance of these 2000 samples. We used gradient descent (GD), with step size *<sup>h</sup>* <sup>=</sup> <sup>1</sup> <sup>×</sup> <sup>10</sup><sup>−</sup>4. The results are shown in Figure 3. The proposed method shows faster convergence than naive parallel sampling, which is consistent with Table 2.

**Figure 2.** Eigenvalue changes (averaged over ten trials).

**Figure 3.** Convergence behavior of toy data in MMD (averaged over ten trials).

#### *6.2. LDA Experiment*

We tested with an LDA model using the ICML dataset [32] following the same setting as [33]. We used 20 samples for all the methods. Minibatch size is 100. We used step size *<sup>h</sup>* <sup>=</sup> <sup>5</sup> <sup>×</sup> <sup>10</sup>−4. First, we confirmed the effectiveness of our proposed Algorithm 1, which adaptively tunes *α* values. For that purpose, we compared the final performance obtained by our methods with a previous method [18], in which *α* is selected by cross-validation (CV). Here instead of CV, we just fixed *α* during the sampling and refer to it as fixed *α*. We also tested the case when *J* is generated randomly at each step with fixed *α*, as discussed in Section 4.1. We refer to it as random J. The results are shown in Figure 4 where skew-SGLD was used. We found that our method showed competitive performance with the best performance of fixed *α*. For the computational cost, we used *k* = 2 in Algorithm 2, and our method needed twice the wall clock time than each fixed *α*. This means that our algorithm greatly reduces the total computational time since we tried more than two *α*s in the fixed *α* for CV. We also found that since using different *J*s at each step did not accelerate the performance, we need to store and fix *J* during the sampling for acceleration. Next, we compared our method with other ensemble sampling schemes and observed the convergence speed. The result is shown in Figure 5. Skew-SGLD and skew-SGHMC outperformed SGLD and SGHMC, which is consistent with our theory.

**Figure 4.** Final performances of LDA under different values of *α* (averaged over ten trials).

**Figure 5.** LDA experiments (Averaged over 10 trials).

#### *6.3. BNN Regression and Classification*

We tested with the BNN regression task using the UCI dataset [34], following a previous setting Liu and Wang [27]. We used one hidden layer neural network model with ReLU activation and 100 hidden units. We used 10 samples for all the methods. We used the minibatch size 100. We used step size *<sup>h</sup>* <sup>=</sup> <sup>5</sup> <sup>×</sup> <sup>10</sup><sup>−</sup>5. The results are shown in Tables <sup>1</sup> and 2. We also tested on BNN classification task using the MNIST dataset. The result is shown in Figure 6. We used one hidden layer neural network model with ReLU activation and 100 hidden units. Batchsize is 500 and we set step size *<sup>h</sup>* <sup>=</sup> <sup>5</sup> <sup>×</sup> <sup>10</sup><sup>−</sup>5. Our proposed methods outperformed other ensemble methods. Please note that skew-SGHMC and skew-SGLD consistently outperformed SGHMC and SGLD.


**Table 1.** Benchmark results on test RMSE for regression task.

**Table 2.** Benchmark results on test negative log likelihood for regression task.


**Figure 6.** MNIST classification (Averaged over ten trials).

#### **7. Conclusions**

We studied skew acceleration for LD and ULD from practical viewpoints and concluded that the improved eigenvalues of the perturbed Hessian matrix caused acceleration and derived the explicit condition for acceleration. We described a novel ensemble sampling method, which couples multiple SGLD or SGHMC with memory-efficient skew matrices. We also proposed a practical algorithm that controls the trade-off of faster convergence and larger discretization and stochastic gradient error and numerically confirmed the effectiveness of our proposed algorithm.

**Author Contributions:** Conceptualization, F.F. and T.I.; methodology, F.F. and T.I.; software, F.F.; validation, F.F., T.I., N.U. and I.S.; formal analysis, F.F. and I.S.; writing—original draft preparation, F.F.; project administration, F.F.; funding acquisition, F.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** JST ACT-X: Grant Number JPMJAX190R.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml (accessed on 21 June 2021).

**Acknowledgments:** FF was supported by JST ACT-X Grant Number JPMJAX190R.

**Conflicts of Interest:** The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Additional Backgrounds**

We introduce additional backgrounds which are used in our Proof.

#### *Appendix A.1. Wasserstein Distance and Kullback–Leibler Divergence*

In this paper, we use the Wasserstein distance. Let us define the Wasserstein distance. Let (*E*, *d*) be a metric space (appropriate space such as Polish space) with *σ* field A, where *d*(·, ·) is A×A-measurable. Let *μ*, *ν* are probability measures on *E*, and *p* ≥ 1. The Wasserstein distance of order *p* with cost function *d* between *μ* and *ν* is defined as

$$\mathcal{W}\_p^d(\mu, \nu) = \inf\_{\pi \in \Pi(\mu, \nu)} \left( \int \int d(\mathbf{x}, y)^p d\pi(\mathbf{x}, y) \right)^{1/p},\tag{A1}$$

where Π(*μ*, *ν*) is the set of all joint probability measures on *E* × *E* with marginals *μ* and *ν*. In this paper, we work on the space R*d*. As for the distance, we use the Euclidean distance, · . For simplicity, we express the p-Wasserstein distance with the Euclidean distance as *Wp*. The various properties of Wasserstein distance are summarized in [35]. We define the Kullback–Leibler (KL) divergence as

$$\text{KL}(\nu \| \mu) = \begin{cases} \int \log \frac{d\nu}{d\mu} d\nu, & \nu \ll \mu\_{\prime} \\ +\infty, & \text{otherwise}. \end{cases} \tag{A2}$$

#### *Appendix A.2. Markov Diffusion and Generator*

Here we introduce the additional explanation about the generator of the Markov diffusion process. Given an SDE,

$$dX\_t = -\nabla \mathcal{U}(X\_t)dt + \sqrt{2\beta^{-1}}dw(t),\tag{A3}$$

and we denote the corresponding Markov semigroup as *P* = {*Pt*}*t*><sup>0</sup> and define the Kolmogorov operator as *Ps* which is defined as *Ps <sup>f</sup>*(*Xt*) = <sup>E</sup>[ *<sup>f</sup>*(*Xt*+*s*)|*X*(*t*)], where *<sup>f</sup>* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup> is some bounded test function in L2(*μ*). A property *Ps*+*<sup>t</sup>* <sup>=</sup> *Ps* ◦ *Pt* is called Markov property. A probability measure *π* is the stationary distribution when it satisfies for all measurable bounded function *f* and *t*, - <sup>R</sup>*<sup>d</sup> Pt f dπ* = - R*<sup>d</sup> f dπ*.

We denote the infinitesimal generator of the associated Markov group as L and we call it a generator for simplicity. The linearity of the operators of *Pt* with the semigroup property indicates that L is the derivative of *Pt* as

$$\frac{1}{h}(P\_{t+h} - P\_t) = P\_t \frac{1}{h}(P\_h - Id) = \frac{1}{h}(P\_h - Id)P\_{t\prime} \tag{A4}$$

where *Id* is the identity map. In addition, taking *h* → 0, we have *∂Pt* = L*Pt* = *Pt*L. From the Hille–Yoshida theory [19], there exists a dense linear subspace of L2(*π*) on which L exists. We refer it as D(L). If the Markov semigroup is associated with the SDE of Equation (A3), the generator can be written as

$$\mathcal{L}f(\mathbf{X}\_{l}) := \lim\_{h \to 0^{+}} \frac{\mathbb{E}(f(\mathbf{X}\_{t+h})|\mathbf{X}\_{t}) - f(\mathbf{X}\_{l})}{h} = \left(-\nabla \mathcal{U}(\mathbf{X}\_{l}) \cdot \nabla + \boldsymbol{\mathcal{P}}^{-1} \Delta \right) f(\mathbf{X}\_{l}), \tag{A5}$$

where Δ is the Laplacian in the standard Euclidean space. The generator satisfies <sup>L</sup><sup>1</sup> <sup>=</sup> 0, - <sup>R</sup>*<sup>d</sup>* L*f dπ* = 0.

## *Appendix A.3. Poincaré Inequality*

We use the Poincaré inequality to measure the speed of convergence to the stationary distribution. In this section, we summarize definitions and useful properties of them and see [19] for more details. We define the Dirichlet form E(*f*) for all bounded functions *f* ∈ D(L) where D(L) denotes the domain of L as

$$\mathcal{E}(f) := -\int\_{\mathbb{R}^d} f \mathcal{L}f d\pi. \tag{A6}$$

<sup>E</sup>(*f*) <sup>&</sup>gt; 0 is satisfied. By the partial integration, we have <sup>E</sup>(*f*) = <sup>−</sup> - <sup>R</sup>*<sup>d</sup> <sup>f</sup>*L*f d<sup>π</sup>* <sup>=</sup> <sup>1</sup> *β* - R*<sup>d</sup>* ∇ *f* <sup>2</sup>*dπ*. We define a Dirichlet domain, <sup>D</sup>(E), which is the set of functions *<sup>f</sup>* <sup>∈</sup> *<sup>L</sup>*2(*π*) and satisfies <sup>E</sup>(*f*) <sup>&</sup>lt; <sup>∞</sup>.

We say that *π* with L satisfies *a Poincaré inequality* with a positive constant *c* if for any *f* ∈ D(E), *π* with L satisfies,

$$\int f^2 d\pi - \left(\int f d\pi\right)^2 \le c\mathcal{E}(f). \tag{A7}$$

This constant *c* is closely related to a spectral gap. If the smallest eigenvalue of L, *λ*, is greater than 0, then it is called the spectral gap. If the spectral gap *λ* > 0 exists, then it is written as

$$\lambda := \inf\_{f \in \mathcal{D}(\mathcal{E})} \left\{ \frac{\mathcal{E}(f)}{\int f^2 d\pi} : f \neq 0, \int f d\pi = 0 \right\}. \tag{A8}$$

From this, a constant *c* which satisfies *c* ≥ 1/*λ*, can also satisfy the Poincaré inequality. To check the existence of the spectral gap, one approach is to use the Lyapunov function, which is developed by Bakry et al. [36].

We can also express the Poincaré inequality via chi divergence. Let us define the *χ*<sup>2</sup> divergence for *μ π* as

$$\chi^2(\mu \| \pi) := \left\| \frac{d\mu}{d\pi} - 1 \right\|\_{L^2\_{\pi}}^2 = \int\_{\mathbb{R}^d} \left| \frac{d\mu}{d\pi} - 1 \right|^2 d\pi. \tag{A9}$$

Then, we express the Poincaré inequality with a constant *c* for all *μ π* as

$$
\chi^2(\mu \| \pi) \le c \, \mathcal{E}\left(\sqrt{\frac{d\mu}{d\pi}}\right). \tag{A10}
$$

We obtain the following exponential convergence results from the above functional inequalities for measures.

**Theorem A1.** *(Exponential convergence in the variance, Theorem 4.2.5 in [19]) When π satisfies the Poincaré inequality with a constant c, it implies the exponential convergence in the variance with a rate* 2/*c, i.e., for every bounded function f* : <sup>R</sup>*<sup>d</sup>* <sup>→</sup> <sup>R</sup>*,*

$$\text{Var}\_{\mathbb{T}}(P\_{\text{f}}f) \le e^{-2t/c} \text{Var}\_{\mathbb{T}}(f),\tag{A11}$$

*where* Var*π*(*f*) := - <sup>R</sup>*<sup>d</sup> <sup>f</sup>* <sup>2</sup>*d<sup>π</sup>* <sup>−</sup> - R*<sup>d</sup> f dπ* 2 *.*

We also introduce the important property of Poincaré inequality as for the product measures. These relations play important roles in our analysis.

**Theorem A2.** *(Stability under the product, Proposition 4.3.1 in [19]) If μ*<sup>1</sup> *and μ*<sup>2</sup> *on* R*<sup>d</sup> satisfy the Poincaré inequalities with a constant <sup>c</sup>*<sup>1</sup> *and <sup>c</sup>*2*, then the product <sup>μ</sup>*<sup>1</sup> <sup>⊗</sup> *<sup>μ</sup>*<sup>2</sup> *on* <sup>R</sup>*<sup>d</sup>* <sup>⊗</sup> <sup>R</sup>*<sup>d</sup> satisfies the Poincaré inequality with the constant* max(*c*1, *c*2)*.*

## **Appendix B. Generator of the Underdamped Langevin Dynamics (ULD)**

Following [10], we define the infinitesimal generator of the ULD as

$$\mathcal{L}f(\mathbf{x},\boldsymbol{\upsilon}) := -(\gamma\boldsymbol{\upsilon} + \nabla \mathcal{U}(\mathbf{x})) \nabla\_{\boldsymbol{\upsilon}}f(\mathbf{x},\boldsymbol{\upsilon}) + \gamma\beta^{-1} \Delta f(\mathbf{x},\boldsymbol{\upsilon}) + \boldsymbol{\upsilon} \nabla\_{\mathbf{x}}f(\mathbf{x},\boldsymbol{\upsilon}).\tag{A12}$$

Then, we define the generator of S-ULD as

$$\begin{split} \mathcal{L}f(\mathbf{x},\boldsymbol{\upsilon}) := -(\gamma\boldsymbol{\upsilon} + \nabla \mathcal{U}^{\boldsymbol{f}}(\mathbf{x})) \nabla\_{\boldsymbol{\upsilon}}f(\mathbf{x},\boldsymbol{\upsilon}) + \gamma\beta^{-1} \Delta f(\mathbf{x},\boldsymbol{\upsilon}) \\ + \boldsymbol{\upsilon} \nabla\_{\boldsymbol{x}}f(\mathbf{x},\boldsymbol{\upsilon}) + a\_{1}l\_{1} \nabla \mathcal{U}(\mathbf{x}) \nabla\_{\boldsymbol{x}}f(\mathbf{x},\boldsymbol{\upsilon}) + a\_{1}l\_{2} \Sigma^{-1} \boldsymbol{\upsilon} \nabla\_{\boldsymbol{\upsilon}}f(\mathbf{x},\boldsymbol{\upsilon}), \end{split} \tag{A13}$$

where the second line corresponds to the interaction terms. Then it is easily to confirm - <sup>R</sup>2*<sup>d</sup>* <sup>L</sup>*f*(*x*, *<sup>v</sup>*)*dπ*˜ <sup>=</sup> 0, where *<sup>π</sup>*˜ :<sup>=</sup> *<sup>π</sup>* ⊗ N (0, <sup>Σ</sup>) <sup>∝</sup> *<sup>e</sup>*−*βU*(*x*)<sup>−</sup> <sup>1</sup> <sup>2</sup> <sup>Σ</sup>−<sup>1</sup> *v* 2 . Thus, the stationary distribution of S-ULD is *π*˜. We can prove this by simply using the partial integral and using the property of the skew-symmetric matrix. Thus, the stationary distribution of S-ULD is *π*˜.

We consider other combinations the skew matrices with ULD. For example, we can consider the following more general combination;

$$\begin{split} dX\_{t} &= \Sigma^{-1} V\_{t} dt + a\_{1} l\_{1} \nabla l I(X\_{t}) dt + a\_{2} \Sigma^{-1} l\_{2} V\_{t} dt \\ dV\_{t} &= -\nabla l I(X\_{t}) dt - \gamma \Sigma^{-1} V\_{t} dt + a\_{3} l\_{3} V\_{t} dt + a\_{4} l\_{4} \nabla l I(X\_{t}) dt + \sqrt{2\gamma \beta^{-1}} dw\_{t\prime} \end{split} \tag{A14}$$

compared to S-ULD, there are new two terms are included. We can also derive the infinitesimal generator of this Markov process. We express it as <sup>L</sup>˜. Then we calculate the infinitesimal change of the expectation of *f*

$$\int\_{\mathbb{R}^{2d}} \vec{\mathcal{L}} f(\mathfrak{x}, \mathfrak{v}) d\mathfrak{H} \neq 0,\tag{A15}$$

which suggests that the stationary distribution of Equation (A14) is different form *π*˜.

It is widely known that underdamped Langevin dynamics converges to (overdamped) Langevin dynamics. Here we observe that S-ULD converges to Skew-LD in [18]. The limiting procedure is widely known, for example, see [17,37,38]. We cite Proposition 1 in [17]; given a stochastic process

$$\begin{aligned} dX\_t &= \Sigma^{-1} V\_t dt + a\_1 I\_1 \nabla l I(X\_t) dt, \\ dV\_t &= -\nabla l I(X\_t) dt - \gamma \Sigma^{-1} V\_t dt - a\_2 \Sigma^{-1} f\_2 V\_t dt + \sqrt{2\gamma} dw\_{t\prime} \end{aligned} \tag{A16}$$

and we rescale it by introducing which expresses the small mass limit as

$$\begin{split} dX\_{t} &= \frac{1}{\varepsilon} \Sigma^{-1} V\_{t} dt + a\_{1} I\_{1} \nabla \mathcal{U}(X\_{t}) dt, \\ dY\_{t} &= -\frac{1}{\varepsilon} \nabla \mathcal{U}(X\_{t}) dt - \frac{1}{\varepsilon^{2}} \gamma \Sigma^{-1} V\_{t} dt - \frac{1}{\varepsilon}^{2} a\_{2} \Sigma^{-1} f\_{2} V\_{t} dt + \frac{1}{\varepsilon} \sqrt{2\gamma} dw\_{t}, \end{split} \tag{A17}$$

and by taking the limit → 0, the dynamics converges to

$$dX\_t = -\left(\mathfrak{a}\_2 l\_2 + \gamma\right)^{-1} \nabla l I(X\_t) dt - \mathfrak{a}\_1 l\_1 \nabla l I(X\_t) + \left(\mathfrak{a}\_2 l\_2 + \gamma\right)^{-1} \sqrt{2\gamma} dw\_t. \tag{A18}$$

See Proposition 1 in [17], for the precise statements. Please note that the term related *J*<sup>2</sup> works as preconditioning. Thus, if we set *α*<sup>2</sup> *J*<sup>2</sup> = 0, the obtained dynamics are equivalent to the continuous dynamics of skew-SGLD. Thus, our skew-SGHMC is the natural extension of skew-SGLD.

#### **Appendix C. Proof of Theorem 1**

#### *Appendix C.1. Proof for S-LD*

First, under Asuumptions 1–5, LD has a spectral gap, and its Poincaré constant is upper bounded as

$$\frac{1}{m\_0} \le \frac{2\mathcal{C}(d+b\beta)}{m\beta} \exp\left(\frac{2}{m}(M+B)(b\beta+d) + \beta(A+B)\right) + \frac{1}{m\beta(d+b\beta)}.\tag{A19}$$

and this is derived in [2].

Next, we introduce the generator of S-LD

$$\mathcal{L}\_{\mathfrak{A}}f(\mathfrak{x}) = \left(-\nabla \mathcal{U}\_{\mathfrak{A}}(\mathfrak{x}) \cdot \nabla + \mathcal{B}^{-1} \Delta \right) f(\mathfrak{x}),$$

where ∇*Uα*(*x*) := ∇*U*(*x*) + *αJ*∇*U*(*x*).

The proof is almost similar to [18] of Theorem 12.

**Proof of Theorem 1.** Since the generator L*α*=<sup>0</sup> is self-adjoint, and the suitable growth condition, the spectral of <sup>L</sup>*α*=<sup>0</sup> is discrete [19]. We denote the spectrum of <sup>L</sup>*α*=<sup>0</sup> as {*λk*}<sup>∞</sup> *<sup>k</sup>*=<sup>0</sup> <sup>∈</sup> <sup>R</sup> and corresponding normalized eigenvectors as {*ek*}<sup>∞</sup> *k*=0, which are the real functions. We order the spectrum as 0 > *λ*<sup>0</sup> > *λ*<sup>1</sup> > . . . . Thus, *m*<sup>0</sup> = −*λ*0.

As for L*α*, although it is not a self-adjoint operator, from Proposition 1 in Franke et al. [39], it has discrete complex spectrums. We denote the spectrum of <sup>L</sup>*<sup>α</sup>* as *<sup>λ</sup>* <sup>+</sup> *<sup>i</sup><sup>μ</sup>* <sup>∈</sup> <sup>C</sup> where

*<sup>λ</sup>*, *<sup>μ</sup>* <sup>∈</sup> <sup>R</sup> and corresponding normalized eigenvector as *<sup>u</sup>* <sup>+</sup> *iv* where *<sup>u</sup>*, *<sup>v</sup>* are the real functions and then we have

$$
\mathcal{L}\_a(\mu + iv) = (\lambda + i\mu)(\mu + iv). \tag{A20}
$$

From this definition, by checking the real parts and complex parts, following relations are derived

$$
\mathcal{L}\_{\mathbb{R}}u = \lambda u - \mu v,\tag{A21}
$$

$$
\mathcal{L}\_a \upsilon = \lambda \upsilon + \mu \iota. \tag{A22}
$$

Due to the divergence-free drift property, for any bounded real value test function *g*(*x*),

$$
\int \lg(\mathcal{L}\_{a=0} - \mathcal{L}\_a) g d\pi = \int \arg \gamma \cdot \nabla g d\pi = -\int \arg \gamma \cdot \nabla g d\pi,\tag{A23}
$$

where we used the partial integral. This means that for any bounded real function *g*(*x*),

$$
\int \lg \mathcal{L}\_{a=0} \lg d\pi = \int \lg \mathcal{L}\_{a} \lg d\pi. \tag{A24}
$$

(This only holds for real functions.) Then, we can evaluate the real part of the eigenvalue *λ* as follows,

$$
\int \mu \mathcal{L}\_{n=0} \mu d\pi + \int \upsilon \mathcal{L}\_{n=0} \upsilon d\pi = \int \mu \mathcal{L}\_{n} \mu d\pi + \int \upsilon \mathcal{L}\_{n} \upsilon d\pi = \lambda \left( \int \mu^{2} d\pi + \int \upsilon^{2} d\pi \right) = \lambda. \tag{A.25}
$$

Then, by expanding the eigenfunction *u*, *v* by the eigenfunction {*ek*},

$$\begin{split} \lambda = \int \mathsf{u} \mathcal{L}\_{\mathsf{u}=0} \mathsf{u} d\pi + \int \mathsf{v} \mathcal{L}\_{\mathsf{u}=0} \mathsf{v} d\pi &= \sum\_{k} \lambda\_{k} \left( \left( \int \mathsf{u} \mathsf{e}\_{k} d\pi \right)^{2} + \left( \int \mathsf{v} \mathsf{e}\_{k} d\pi \right)^{2} \right) \\ &\leq \lambda\_{0} \sum\_{k} \left( \left( \int \mathsf{u} \mathsf{e}\_{k} d\pi \right)^{2} + \left( \int \mathsf{v} \mathsf{e}\_{k} d\pi \right)^{2} \right) \leq \lambda\_{0}. \end{split} \tag{A26}$$

Thus, the real part of the eigenvalue of L*<sup>α</sup>* is smaller than the smallest eigenvalue of L*α*. This means that the spectral gap of L*<sup>α</sup>* is larger than that of L*α*=0, i.e., *m*(*α*) ≥ *m*<sup>0</sup> holds.

## *Appendix C.2. Proof of Theorem 2 (S-ULD)*

**Proof of Theorem 2 .** To prove the S-ULD, we use the result of [20], which characterize the convergence of ULD via the Poincaré constant. Let us denote *μ*˜*<sup>t</sup>* as the measure induced by ULD. Then from Theorem 1 of [20], if *π* with L has the Poincaré constant *m*0, we have

$$
\chi^2(\vec{\mu}\_t || \vec{\pi}) \le \frac{1+\bar{\epsilon}}{1-\bar{\epsilon}} e^{-\lambda\_\gamma t} \chi^2(\vec{\mu}\_t || \vec{\pi}).\tag{A27}
$$

where ¯ and *λγ* is given as follows.

$$
\lambda\_{\varchi} = \frac{\Lambda(\gamma\_{\varchi} \notin \min(\gamma\_{\varchi} \gamma^{-1}))}{1 + \mathfrak{e} \min(\gamma\_{\varchi} \gamma^{-1})} \, \text{} \tag{A28}
$$

where

$$\Lambda(\gamma, \varepsilon) = \frac{\gamma \Sigma^{-1} - \frac{1}{1 + \frac{m\_0 \Sigma^{-1}}{\mathbb{P}}}}{2} - \frac{1}{2} \sqrt{(\mathcal{S}\_{--} - \mathcal{S}\_{++})^2 + (\mathcal{S}\_{-+})^2},\tag{A.29}$$

$$S\_{--} = \varepsilon \lambda\_{\text{ham}}.\tag{A30}$$

$$S\_{-+} = -\epsilon (R\_{\text{ham}} + \gamma \Sigma^{-1} / 2),\tag{A31}$$

$$S\_{++} = \gamma \Sigma^{-1} - \mathfrak{e}\_{\prime} \tag{A32}$$

$$
\lambda\_{\text{ham}} = 1 - \left( 1 + \frac{m\_0 \Sigma^{-1}}{\beta} \right)^{-1},
\tag{A33}
$$

$$
\epsilon = \bar{\epsilon} \min(\gamma, \gamma^{-1}),
\tag{A34}
$$

where ¯ is arbitrary sufficiently small positive value such that Λ(*γ*, ¯ min(*γ*, *γ*−1)) > 0 is satisfies. As for *Rham*, if there exists a positive constant *<sup>K</sup>*, such that <sup>∇</sup>2*<sup>U</sup>* ≥ −*K I*, then *Rham* <sup>≤</sup> 'max{*K*, 2}. In our assumption, this corresponds to *<sup>β</sup>M*, thus *Rham* <sup>≤</sup> 'max{*βM*, 2}. From the above definitions, we can see that the larger *<sup>m</sup>*<sup>0</sup> is, i.e., the larger the Poincaré constant is the faster convergence ULD shows.

This can also be confirmed numerically, see Figure A1, which shows how the Λ changes under different *m*0. We set Σ−<sup>1</sup> = 100. From the figure, the larger the Poincaré constant is, the larger Λ becomes.

**Figure A1.** The convergence rate of ULD under the different Poincaré constants.

So far, we confirmed that the convergence speed of S-ULD is characterized by the Poincaré constant of L. When we consider S-ULD, we simply add the skew matrices term to the generator of the ULD in the proof of Proposition 1 in [20]. This means that we simply replace the Poincaré constant from *m*<sup>0</sup> to *m*(*α*) in the proof of Proposition 1 in [20]. Then, *m*<sup>0</sup> will be replaced with *m*(*α*) that indicates the faster convergence.

#### **Appendix D. Eigenvalue and Poincaré Constant**

In this section, we discuss the relation between eigenvalues of the Hessian matrix and Poincaré constant.

#### *Appendix D.1. Strongly Convex Potential Function*

When we consider LD with *m*-strongly convex potential function, then the Poincaré constant is *m*, this means exponential convergence with rate *m* (See [19] for the detail).

We then consider the S-LD with *m*-strongly convex function. In this setting, by considering the synchronous coupling technique [11], we can show that the variance decays exponentially with the rate of the smallest real part of the eigenvalue. This is because that by preparing two S-LD (*Xt*,*Yt*) given as

$$dX\_l = -(I + \kappa I)\nabla l I(X\_l)dt + \sqrt{2\beta^{-1}}dw\_l, \qquad dY\_l = -(I + \kappa I)\nabla l I(Y\_l)dt + \sqrt{2\beta^{-1}}dw\_l'.\tag{A35}$$

Then we evaluate the behavior of *Xt* − *Yt* 2. From Ito lemma and considering the synchronous coupling, we obtain

$$\frac{d}{dt} \|X\_l - Y\_l\|^2 = -(X\_l - Y\_l) \cdot \frac{(I + aI)}{\beta} (\nabla l I(X\_l) - \nabla l I(Y\_l)) \le -\frac{2m(a)}{\beta} \|X\_l - Y\_l\|^2,\tag{A36}$$

where *<sup>m</sup>*(*α*) is the constant that satisfies *<sup>m</sup>*(*α*) <sup>≤</sup> Re*λ<sup>α</sup>* <sup>1</sup> (*x*) for all *x*, see Appendix E for details. This means that variance decays exponentially with the rate <sup>2</sup>*m*(*α*) *<sup>β</sup>* . From the fundamental property of the Poincaré constant (Theorem 4.2.5 in [19]), *m*(*α*) is the Poincaré constant. Thus the imaginary part has no effect on the continuous dynamics. Thus, the Poincaré inequality is the smallest real part of the perturbed Hessian matrix.

#### *Appendix D.2. Non-Convex Potential Function*

As we discussed in Section 3.1, [21] derived the sharper estimation for the Poincaré constant for the non-convex potential function. It is easy to verify that their assumptions are satisfied under our assumption 1–5. Following the main paper, we denote *x*<sup>1</sup> global minima, and *x*<sup>2</sup> is the local minima which have the second smallest value in *U*(*x*). We express the saddle point between *x*<sup>1</sup> and *x*<sup>2</sup> as *x*∗. To be more precise, the saddle point that characterizes the Poincaré constant is known as the critical point with index one defined as

$$\mathcal{U}(\mathbf{x}^\*) = \inf \left\{ \max\_{s \in [0, 1]} \mathcal{U}(\gamma(s)) : \gamma \in \mathcal{C}([0, 1], \mathbb{R}^d), \gamma(0) = \mathbf{x}\_1, \gamma(1) = \mathbf{x}\_2 \right\},\tag{A37}$$

and the eigenvalue of <sup>∇</sup>2*U*(*x*∗) has one negative eigenvalue and *<sup>d</sup>* <sup>−</sup> 1 positive eigenvalues. We express them as *λ*1(*x*∗) < 0 < *λ*2(*x*∗) < ..., *λd*(*x*∗).

Ref. [21] studied the Poincaré constant by decomposing the non-convex potential focusing on attractors. By focusing on attractors, they showed that the non-convex potential can be decomposed into the sum of *approximately* Gaussian distributions. They proved that the Poincaré constant is characterized by the local Poincaré constants, these are derived by the approximate Gaussian distribution on the attractors and their surrounding regions. In addition, they proved that the dominant term of the Poincaré constant is specified by the saddle points between the global minima and the point which takes the second smallest value for *U*(*x*). From Theorem 2.12 and Corollary 2.15 in [21], the Poincaré constant is characterized by

$$m\_0^{-1} \approx \frac{\sqrt{\det H(\mathbf{x}^\*)}}{\sqrt{Z|\lambda\_1(\mathbf{x}^\*)|\det H(\mathbf{x}\_1)}\sqrt{\det H(\mathbf{x}\_2)}} e^{\theta(\mathcal{U}(\mathbf{x}^\*) - \mathcal{U}(\mathbf{x}\_1) - \mathcal{U}(\mathbf{x}\_2))} \propto \frac{1}{|\lambda\_1(\mathbf{x}^\*)|} e^{\theta(\mathcal{U}(\mathbf{x}^\*) - \mathcal{U}(\mathbf{x}\_1) - \mathcal{U}(\mathbf{x}\_2))},\tag{A38}$$

where *Z* is the normalizing constant of *e*−*βU*(*x*).

Next, we discuss how this estimate changes when skew matrices are applied. When the skew matrices are introduced, from lemma A.1 in [40], at the saddle point, there exists a unique negative real eigenvalue *λ<sup>α</sup>* <sup>1</sup> (*x*∗) < 0 for the perturbed Hessian matrix even if (*I* + *αJ*)*H* is not a symmetric matrix.

Then from Proposition 5 in [8], that negative eigenvalue of the perturbed Hessian is smaller than that of the un-perturbed Hessian matrix at the saddle point. This means that *λα* <sup>1</sup> (*x*∗) ≤ *λ*1(*x*∗) < 0 holds.

Finally, from Theorem 5.1 in [41] and Theorem 2.12 in [21], this improvement of the negative eigenvalue of the saddle point directly leads to the larger Poincaré constant.

#### **Appendix E. Properties of a Skew-Symmetric Matrix**

Here, we introduce the basic properties of the skew-symmetric matrices. Let us consider assume that *d* × *d* matrix *H* = (*I* + *αJ*)*H* is diagonalizable. Then assume that matrix *H* has *l* real eigenvalues *λ*1, ... , *λ<sup>l</sup>* and 2*m* complex eigenvalues, *μ*<sup>1</sup> = *α*<sup>1</sup> ± *iβ*1, ... , *μ<sup>m</sup>* = *<sup>α</sup><sup>m</sup>* <sup>±</sup> *<sup>i</sup>βm*. Thus, *<sup>d</sup>* <sup>=</sup> *<sup>l</sup>* <sup>+</sup> <sup>2</sup>*m*. We denote the corresponding eigenvectors as {*vj*}*<sup>l</sup> <sup>j</sup>*=<sup>1</sup> for real eigenvalues and {*wj* <sup>=</sup> *aj* <sup>+</sup> *ibj*}*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> for complex eigenvalues {*μj*}*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> and {*w*¯*j*} for corresponding conjugate eigenvalues. Then, let us define a *d* × *d* matrix *V* as

$$V = [\upsilon\_1, \dots, \upsilon\_l, a\_1, b\_1, \dots, a\_m, b\_m]. \tag{A39}$$

Then, we can decompose *H* into a block diagonal matrix [42];

$$H'V = VD \tag{A40}$$

Thus, *D* := *A* + *B*. Then, from the Taylor expansion and expressing its residual by integral, by defining *<sup>H</sup>*(*x*) :<sup>=</sup> <sup>∇</sup>2*U*(*x*) we have

$$(\mathbf{x} - \mathbf{y})^\top (I + aI)(\nabla \mathcal{U}(\mathbf{x}) - \nabla \mathcal{U}(\mathbf{y})) = (\mathbf{x} - \mathbf{y})^\top \left( \int\_0^1 (I + aI)H(y + \tau(\mathbf{x} - \mathbf{y}))(\mathbf{x} - \mathbf{y})d\tau \right). \tag{A42}$$

Then, let us apply the Jordan canonical form here. If (*I* + *αJ*)*H* is diagonalizable, and it is decomposable by the Jordan canonical form shown in Equation (A40). Then, we can decompose (*I* + *αJ*)*H* as

$$H(I+uI)H(\mathbf{x}^\* + \tau(\mathbf{x}(t) - \mathbf{x}^\*)) = VDV^{-1}.\tag{A43}$$

Then, we obtain

$$\begin{split} (\mathbf{x} - \mathbf{y})^\top (I + \mathbf{a}I)(\nabla \mathcal{U}(\mathbf{x}) - \nabla \mathcal{U}(\mathbf{y})) &= (\mathbf{x} - \mathbf{y})^\top \left( \int\_0^1 (I + \mathbf{a}I)H(\mathbf{y} + \tau(\mathbf{x} - \mathbf{y}))(\mathbf{x} - \mathbf{y}) d\tau \right) \\ &= \left( \int\_0^1 (\mathbf{x} - \mathbf{y})^\top V(A + B)V^{-1}(\mathbf{x} - \mathbf{y}) dt \right) \\ &= \left( \int\_0^1 (\mathbf{x} - \mathbf{y})^\top V A V^{-1}(\mathbf{x}(t) - \mathbf{x}^\*) dt \right) \\ &\le m(\mathbf{a}) \|\mathbf{x}(t) - \mathbf{x}^\*\|^2. \end{split} \tag{A44}$$

where *m*(*α*) is the constant that satisfies *m*(*α*) ≤ min{*λ*1, ... , *λl*, *α*1, ... , *αm*} for all *x*. Thus, the imaginary part never appears to the upper bound and we only need to focus on the largest real part of the eigenvalues, if the matrix is diagonalizable. Next subsection describes when the non-symmetric matrix *H* is diagonalizable by focusing on the random matrix.

#### **Appendix F. Proof of Theorem 3**

**Proof.** Since the potential function is *m*-strongly convex, the smallest eigenvalue of the Hessian matrix *H* is *m*, which is larger than 0. Thus, *H* and *H*1/2 are regular matrices. With this in mind, we consider *H* + *H*1/2 *JH*1/2 as a similar matrix of *H* := (*I* + *J*)*H*. This is easily confirmed by

$$H^{-1/2}(H + H^{1/2} \| H^{1/2}) H^{1/2} = H'.\tag{A45}$$

This means that to study the eigenvalues of *H* , we only need to study the similar matrix *A* := *H* + *H*1/2 *JH*1/2. By doing this, *A* is composed of symmetric and skewsymmetric matrices, which are easy to treat compared to *H* , where the term *JH* is difficult to analyze. For simplicity, we omit the dependency of *H* and *H* on *x* in this section.

**Remark A1.** *Please note that we can eliminate the strong convexity of U, if H is a regular matrix. This means that H does not have* 0 *as an eigenvalue.*

For simplicity, we assume that the dimension *d* is an even number. We assume that the eigenvalues and eigenvectors of *A* are expressed as

$$Aw\_{\rangle} = \mu\_{\rangle}w\_{\rangle} \Leftrightarrow A(a\_{\rangle} + ib\_{\rangle}) = (a\_{\rangle} + i\beta\_{\rangle})(a\_{\rangle} + ib\_{\rangle}). \tag{A46}$$

and *α<sup>j</sup>* is ordered as *α*<sup>1</sup> ≤ *α*2, ... . In this section, we only consider the setting where all the eigenvalue and eigenvector are imaginary for notational simplicity. The extension to the general settings similar to Appendix E and the setting when is *d* is odd is straightforward.

We denote the eigenvalues and eigenvectors of *<sup>H</sup>* as {*λj*, *vj*}*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> and *vj*s are linearly independent. In addition, we assume that *λ*<sup>1</sup> ≤, ... , *λd*. From this definition, by checking the real parts and complex parts, the following relations are derived

$$A a\_{\rangle} = a\_{\rangle} a\_{\rangle} - \beta b\_{\rangle} \,\,\,\,\,\,\tag{A47}$$

$$Ab\_{\rangle} = \alpha\_{\rangle}b\_{\rangle} + \beta a\_{\rangle}. \tag{A48}$$

thus, by the skew-symmetric property

$$a\_j^\top A a\_\flat + b\_j^\top A b\_\flat = a\_j (||a\_j||^2 + ||b\_j||^2) = a\_\flat \tag{A49}$$

$$\mathbf{a}\_{\rangle} = a\_{\rangle}^{\top} \, H a\_{\rangle} + b\_{\rangle}^{\top} \, H b\_{\rangle},\tag{A50}$$

and in the third equality, we used the property

$$a\_{\dot{\jmath}}^{\top}H^{1/2}\jmath H^{1/2}a\_{\dot{\jmath}} = b\_{\dot{\jmath}}^{\top}H^{1/2}\jmath H^{1/2}b\_{\dot{\jmath}} = 0,\tag{A51}$$

since *H*1/2 *JH*1/2 is a skew-symmetric matrix. Then, we expand *aj* and *bj* by *vj* as

$$a\_k = \sum\_{j=1}^d a\_k^\top v\_j \tag{A52}$$

$$b\_k = \sum\_{j=1}^d b\_k^\top v\_j v\_{j\prime} \tag{A53}$$

since *vj*s are eigenvalues of *<sup>H</sup>*, which can be used as the basis for R*d*. Then we substitute this into Equation (A50) and we have

$$a\_k = \sum\_{j=1}^d \lambda\_j (a\_k^\top v\_j)^2 + \sum\_{j=1}^d \lambda\_j (b\_k^\top v\_j)^2 \ge \lambda\_1 \sum\_{j=1}^d (a\_k^\top v\_j)^2 + (b\_k^\top v\_j)^2) = \lambda\_1. \tag{A54}$$

This means that any real part of the eigenvalue of *A* is larger than *λ*<sup>1</sup> which is the smallest eigenvalue of *H*. Thus, if the *α*<sup>1</sup> is the smallest real part of the eigenvalue of *A*, that is larger than the smallest eigenvalue of *H*. This concludes the proof.

In the same way,

$$a\_k = \sum\_{j=1}^d \lambda\_j (a\_k^\top v\_j)^2 + \sum\_{j=1}^d \lambda\_j (b\_k^\top v\_j)^2 \le \lambda\_d \sum\_{j=1}^d (a\_k^\top v\_j)^2 + (b\_k^\top v\_j)^2 \tag{A55}$$

which means any real part of the eigenvalues of *A* is smaller than the largest eigenvalue of *H*. Thus, if *α* is the largest real part of the eigenvalues of *A*, it is smaller than the largest eigenvalue of *H*.

## Equality condition:

Next, we discuss when the equality holds for *α*<sup>1</sup> = *λ*1. First, we assume that eigenvalues of *H* are distinct, thus, there is only one eigenvector for *λ*1. Later, we discuss if eigenvalues are not distinct. From Equation (A54), we have

$$\alpha\_1 = \sum\_{j=1}^d \lambda\_j (a\_1^\top v\_j)^2 + \sum\_{j=1}^d \lambda\_j (b\_1^\top v\_j)^2 \ge \lambda\_1 \sum\_{j=1}^d (a\_1^\top v\_j)^2 + (b\_1^\top v\_j)^2 \tag{A56}$$

in general. Please note that if *a*<sup>1</sup> and *b*<sup>1</sup> does not correspond to *v*1, then *λj*-<sup>=</sup><sup>1</sup> > *λ*<sup>1</sup> must appear in the summation and equality never holds. So, the condition is

$$a\_{1\prime}b\_1 \ll v\_{1\prime} \tag{A57}$$

must hold for the equality.

Based on this, let us assume that *w*<sup>1</sup> = *ca*<sup>1</sup> + *ic b*<sup>1</sup> where *c*<sup>2</sup> + *c* <sup>2</sup> = 1. We consider the case *a*<sup>1</sup> = *b*<sup>1</sup> = *v*1. Then we need to solve the simultaneous equations

$$A(ca\_1 + ic'b\_1) = (\lambda\_1 + i\beta\_1)(ca\_1 + ic'b\_1) = (\lambda\_1 c - c'\beta\_1)v\_1 + i(c\beta\_1 + \lambda\_1 c')v\_1,\tag{A58}$$

this is obtained by the definition of the eigenvalue of *A* and

$$A(ca\_1 + ic'b\_1) = \lambda\_1^{1/2} c(I\lambda\_1^{1/2} + aH^{1/2}I)v\_1 + i\lambda\_1^{1/2}c'(I\lambda\_1^{1/2} + aH^{1/2}I)v\_1,\tag{A59}$$

this is obtained from the definition of eigenvalues of *H*. Then multiplying *v*<sup>1</sup> from the left, we obtain *cβ*<sup>1</sup> = 0 and *c β*<sup>1</sup> = 0. Thus, *β*<sup>1</sup> = 0. *β*<sup>1</sup> = 0 means*b*<sup>1</sup> = 0 from the property of the complex eigenvectors. Thus, we obtain *w*<sup>1</sup> = *a*<sup>1</sup> = *v*<sup>1</sup> for *λ*<sup>1</sup> = *α*1. Then, the following relation holds,

$$
\lambda\_1 \upsilon\_1 = Av\_1 = Hv\_1 + aH^{1/2}fH^{1/2}\upsilon\_1 = \lambda\_1 \upsilon\_1 + a\lambda\_1^{1/2}H^{1/2}fv\_1. \tag{A60}
$$

Since *λ*<sup>1</sup> -= 0 and *H*1/2 has the inverse matrix, this condition indicates that

$$
\mathfrak{a}[v\_1 = 0.\tag{A61}] \tag{161}
$$

This is the condition that *λ*<sup>1</sup> = *α*<sup>1</sup> holds. The same relation can be derived for *λ<sup>d</sup>* = *αd*. Next, we assume that eigenvalues of *H* are not distinct. Let us denote the set of eigenvectors of the eigenvalue *λ*<sup>0</sup> <sup>1</sup> as {*v*<sup>0</sup> <sup>1</sup>}. Please note that if *a*<sup>1</sup> and *b*<sup>1</sup> does not included in *V*0 <sup>1</sup> , then *λj*-<sup>=</sup><sup>1</sup> > *λ*<sup>1</sup> must appear and equality never holds. Thus

$$a\_1, b\_1 \in V\_1^0 \tag{A62}$$

must hold for equality. Based on this, let us assume that *w*<sup>1</sup> = *ca*<sup>1</sup> + *ic b*<sup>1</sup> where *c*<sup>2</sup> + *c* <sup>2</sup> = 1. We consider the case *a*<sup>1</sup> -= *b*1. Then

$$\begin{aligned} H^{-1/2}A(ca\_1 + ic'b\_1) &= \lambda\_1^{-1/2}(\lambda\_1 + i\beta\_1)(ca\_1 + ic'b\_1) \\ H^{-1/2}(H + aH^{1/2}fH^{1/2})(ca\_1 + ic'b\_1) &= \lambda\_1^{1/2}c(I + aI)a\_1 + i\lambda\_1^{1/2}c'(I + aI)b\_1,\end{aligned} \tag{A63}$$

then we obtain the condition

$$
\lambda\_1 c a f a\_1 = -\beta\_1 c' b\_1 \tag{A64}
$$

$$
\lambda\_1 c' a f b\_1 = \beta\_1 c a\_1. \tag{A65}
$$

#### **Appendix G. Proofs of Random Matrices**

*Appendix G.1. Proof of Theorem 5*

**Proof.** The proof is the straightforward consequence of lemma in [43], that is

Lemma in ([43]) *If f*(*x*1, ... , *xm*) *is a polynomial in real variables x*1, ... , *xm, which is not identically zero, then the subset Nm* = {(*x*1, ... , *xm*)| *f*(*x*1, ... , *xm*) = 0} *of the Euclidean m-space* R*<sup>m</sup> has the Lebesgue measure zero.*

We use this lemma to prove that the probability of *λ*<sup>1</sup> = *α*<sup>1</sup> is 0 by showing that the probability mass of *λ*<sup>1</sup> = *α*<sup>1</sup> has Lebesgue measure zero.

We use the same notation as in Appendix F. Recall Equation (A64), which is the condition of equality about *λ*<sup>1</sup> = *α*1. We express the elements of *a*<sup>1</sup> and *b*<sup>1</sup> as *a*<sup>1</sup> = (*a*<sup>1</sup> <sup>1</sup>,..., *<sup>a</sup><sup>d</sup>* 1) and *b*<sup>1</sup> = (*b*<sup>1</sup> <sup>1</sup>,..., *<sup>b</sup><sup>d</sup>* <sup>1</sup> ). Then the equality condition can be written as

$$\sum\_{i=1}^{d} \left(\sum\_{j=1}^{d} \lambda\_1 c a f\_{i,j} a\_1^j + \beta\_1 c' b\_1^i \right))^2 + \sum\_{i=1}^{d} \left(\sum\_{j=1}^{d} \lambda\_1 c' a f\_{i,j} b\_1^j - \beta\_1 c a\_1^i \right))^2 = 0. \tag{A66}$$

Then we define the polynomial about {*Ji*.*j*}

$$f(f\_{1,2},\ldots,f\_{d-1,d}) = \sum\_{i=1}^{d} (\sum\_{j=1}^{d} \lambda\_1 c a f\_{ij} a\_1^j + \beta\_1 c' b\_1^i))^2 + \sum\_{i=1}^{d} (\sum\_{j=1}^{d} \lambda\_1 c' a f\_{ij} b\_1^j - \beta\_1 c a\_1^i))^2. \tag{A67}$$

To apply lemma of [43], we must confirm that *<sup>f</sup>*(*J*1,2, ... , *Jd*−1,*d*) is not always 0. This is clear from the definition of *f* since we generate *J*1,2, ... , *Jd*−1,*<sup>d</sup>* randomly from the distribution that is absolutely continuous with respect to Lebesgue measure and *λ*<sup>1</sup> -= 0 and *<sup>c</sup>*<sup>2</sup> <sup>+</sup> *<sup>c</sup>*<sup>2</sup> <sup>=</sup> 1 and either *<sup>a</sup>*1, *<sup>b</sup>*<sup>1</sup> -= 0.

Then, given an evaluation point *<sup>x</sup>*, from lemma of [43], the subset of {*Ji*,*j*} ∈ <sup>R</sup>*d*(*d*−1)/2 that satisfies *<sup>f</sup>*(*J*1,2,..., *Jd*−1,*d*) = 0 has Lebesque measure zero. Thus, if we generate {*Ji*,*j*} from the probability measure which is absolutely continuous with respect to Lebesque measure, (such as Gaussian distribution), *<sup>f</sup>*(*J*1,2, ... , *Jd*−1,*d*) = 0 holds probability 0. This concludes the proof.

#### *Appendix G.2. Proof of Lemma 1*

**Proof.** We first discuss the condition about Ker*J*<sup>0</sup> = {0}. Since *J* = *J*<sup>0</sup> ⊗ *Id*, and we denote the set of eigenvalues of *J*<sup>0</sup> as {*ωi*}. In general, the eigenvalues of the matrix that is composed of the Kronecker product with two matrices, e.g., *A* and *B*, are given as the product of each eigenvalue of *A* and *B* [44]. Thus, since *J* is the Kronecker product of *J*<sup>0</sup> and *Id*, if *J*<sup>0</sup> does not have 0 as an eigenvalue, *J* does not have 0 as an eigenvalue.

Next, we discuss another equality condition. We use the similar notation as in Appendix F, but now the dimension of the matrix *J* is *dN*. We express the eigenvalue which has the smallest real part as *λ<sup>α</sup>* <sup>1</sup> and its eigenvector as *<sup>ω</sup><sup>α</sup>* <sup>1</sup> = *a*<sup>1</sup> + *ib*1. The elements of *a*<sup>1</sup> and *b*<sup>1</sup> as *a*<sup>1</sup> = (*a*<sup>1</sup> <sup>1</sup>, ... , *<sup>a</sup><sup>d</sup>* <sup>1</sup>, *<sup>a</sup>d*+<sup>1</sup> <sup>1</sup> , ... , *<sup>a</sup>dN* <sup>1</sup> ) <sup>∈</sup> <sup>R</sup>*dN* and *<sup>b</sup>*<sup>1</sup> = (*b*<sup>1</sup> <sup>1</sup>, ... , *<sup>b</sup><sup>d</sup>* <sup>1</sup>, ... , *<sup>b</sup>d*+<sup>1</sup> <sup>1</sup> , ... , *<sup>b</sup>dN* <sup>1</sup> ). We also express these as *a*<sup>1</sup> = (*a* (1) <sup>1</sup> ,..., *a* (*N*) <sup>1</sup> ) <sup>∈</sup> <sup>R</sup>*dN* where *<sup>a</sup>* (*i*) <sup>1</sup> = (*a* (*i*−1)*d*+1 <sup>1</sup> ,..., *<sup>a</sup>id* <sup>1</sup> ) <sup>∈</sup> <sup>R</sup>*d*. We use the Kronecker product property:

$$Ja\_1 = (J0 \otimes I\_d)a\_1 = \left(\sum\_{i=1}^N J\_{0|i,1} a\_1^{(i)} \wedge \dots \wedge \sum\_{i=1}^N J\_{0|i,N} a\_1^{(i)}\right)^\wedge \tag{A68}$$

where *J*0|*i*,*<sup>j</sup>* indicates the element of *i*-th row and *j*-th column of *J*<sup>0</sup> where we use the property of the Kronecker product and the Vec operator in the second equality [44].

The proof is almost similar to Appendix G.1. Then the equality condition can be written as

$$\sum\_{n=1}^{N} \left\| \lambda\_1 c a \sum\_{i} J\_{0|i,n} a\_1^{(i)} + \beta\_1 c' b\_1^{(n)} \right\|^2 + \sum\_{n=1}^{N} \left\| \lambda\_1 c' a \sum\_{i} J\_{0|i,n} b\_1^{(i)} + \beta\_1 c a\_1^{(n)} \right\|^2 = 0,\tag{A69}$$

where · is the *d*-dimensional Euclidean norm since *a* (*n*) <sup>1</sup> , *b* (*n*) <sup>1</sup> <sup>∈</sup> <sup>R</sup>*d*. Then we define the polynomial about {*Ji*.*j*}

$$f(I\_{1,2},\ldots,I\_{N-1,N}) = \sum\_{n=1}^{N} \left\| \lambda\_1 ca \sum\_{i} I\_{0\mid i,n} a\_1^{(i)} + \beta\_1 c' b\_1^{(n)} \right\|^2 + \sum\_{n=1}^{N} \left\| \lambda\_1 c' a \sum\_{i} I\_{0\mid i,n} b\_1^{(i)} + \beta\_1 ca\_1^{(n)} \right\|^2. \tag{A70}$$

In a similar discussion with Appendix G.1, it is clear that *f* is not always 0. Thus, given an evaluation point *<sup>x</sup>*, from lemma of [43], the subset of {*Ji*,*j*} ∈ <sup>R</sup>*N*(*N*−1)/2 that satisfies *<sup>f</sup>*(*J*1,2, ... , *JN*−1,*d*) = 0 has Lebesque measure zero. Thus, if we generate {*Ji*,*j*} from the probability measure which is absolutely continuous with respect to Lebesque measure, (such as Gaussian distribution), *f*(*J*1,2, ... , *JN*−1,*N*) = 0 holds probability 0. This concludes the proof.

## *Appendix G.3. Extending the Theorem to the Path*

About Theorem 5 and Lemma 1, the statement holds true when we fix an evaluation point *x*. To ensure the acceleration, we need to extend Theorem 5 and Lemma 1 from a single evaluation point to the path of the stochastic process for S-LD, S-PLD, S-ULD, and S-PULD.

First, the condition of Ker*J*<sup>0</sup> = {0} is not related to the evaluation point. Thus, we need to consider the equality condition for Re*λ<sup>α</sup>* <sup>1</sup> = *<sup>λ</sup>*<sup>0</sup> <sup>1</sup>. As for this condition, as we had seen in Theorem 5 and Lemma 1, if we generate the random matrix *J* which is absolutely continuous with respect to Lebesgue measure, then the equality condition is not satisfied with probability 1 at the given evaluation point. The important point in those proof is to prove that the event when the equality holds has Lebesgue measure 0 at the given evaluation point using the lemma of [43].

Let us consider when two evaluation points are given (e.g., *x*1, *x*2), and we check whether the random matrix *J* satisfies the above equality condition or not. We can easily prove that at each evaluation point, such an event (we express them as *S*<sup>1</sup> and *S*2) has Lebesgue measure 0 using the lemma of [43] (We refer to this as *P*(*S*1) = 0 and *P*(*S*2) = 0 where *P* is the law induced by generating the random matrix that has independent *d*(*d* − 1)/2 elements). So, the volume of the event of sum of *S*<sup>1</sup> and *S*<sup>2</sup> are also 0 (*P*(*S*<sup>1</sup> < *S*<sup>2</sup> = 0). By repeating this procedure, when given a finite number of evaluation points, (*x*1, ... , *xk*), the sum of such probability is 0 (this indicates *P*(*S*<sup>1</sup> < *S*2, ... , <sup>&</sup>lt; *Sk*) = 0).

When we consider the discretized dynamics of S-LD, S-PLD, and so on, and update samples up to *k*-iterations, then there exist *k* evaluation points. So, by applying the above discussion, we can ensure that along the path of the discretized dynamics, the equality condition does not hold with probability 1. On the other hand, as for the continuous dynamics, the evaluation point is infinite, thus when we cannot conclude that the probability that the equality does not hold is 1.

#### **Appendix H. Proof of Theorem 6**

We use the same notation as in Appendix F. We consider the expansion concerning *α* and we consider the following setting,

$$w\_j := v\_j + \delta v\_j \tag{A71}$$

$$
\mu\_{\dot{\jmath}} := \lambda\_{\dot{\jmath}} + \delta \lambda\_{\dot{\jmath}\prime} \tag{A72}
$$

which indicates that by introducing the skew-acceleration terms, the pairs of eigenvalues and eigenvectors of *H* are expressed by the small perturbation for the eigenvalues and eigenvectors of *<sup>H</sup>*. Since {*vj*}*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> are the eigenvalues of *H* and they can be used as an orthogonal basis, thus we expand *δv* by this basis. We obtain

$$
\delta v\_j = \sum\_{k \neq j}^d c\_{jk} v\_{k \prime} \tag{A73}
$$

where *cjk* = *δv <sup>j</sup> vk*.

#### *Appendix H.1. Asymptotic Expansion When the Smallest Eigenvalue of H*(*x*) *Is Positive*

We work on the similar matrix of *H* , that is *H* + *αV* where *V* := *H*1/2 *JH*1/2. See Appendix G.1 for the detail. Please note that this similar matrix only exists when the smallest eigenvalue of *H*(*x*) is positive. Thus, the following discussion cannot apply to the case at the saddle point, where negative eigenvalues appear. We discuss the saddle point expansion later.

From the definition, we have

$$H'w\_{\rangle} = Hw\_{\rangle} + aVw\_{\rangle} = \mu\_{\rangle}w\_{\rangle} = (\lambda\_{\rangle} + \delta\lambda\_{\rangle})(v\_{\rangle} + \delta v\_{\rangle}),\tag{A74}$$

We rearrange this equation as

$$H\upsilon\_{\rangle} + H\delta\upsilon\_{\rangle} + aV\upsilon\_{\rangle} + aV\delta\upsilon\_{\rangle} = \lambda\_{\rangle}\upsilon\_{\rangle} + \delta\lambda\_{\rangle}\upsilon\_{\rangle} + \lambda\_{\rangle}\delta\upsilon\_{\rangle} + \delta\lambda\_{\rangle}\delta\upsilon\_{\rangle}.\tag{A75}$$

First, we focus on the first-order expansion. This means we neglect high-order terms. Then, we have

$$Hv\_j + H\delta v\_j + \varkappa Vv\_j = \lambda\_j v\_j + \delta \lambda\_j v\_j + \lambda\_j \delta v\_j. \tag{A76}$$

By multiplying *vj* to Equation (A76) from the left-hand side, we have

$$
\lambda\_j + \lambda\_j v\_j^\top \delta v\_j + a v\_j^\top V v\_j = \lambda\_j + \delta \lambda\_j + \lambda\_j v\_j^\top \delta v\_{j'} \tag{A77}
$$

Since *v <sup>j</sup> Vvj* = 0 due to the skew-symmetric property of *V*. Thus, we have

$$
\delta\lambda\_{\bar{\jmath}} = 0,\tag{A78}
$$

up to the first-order expansion. Then we substitute this into Equation (A76) and multiplying *vi* where *i* -= *j*, we have

$$
\lambda\_i \mathbf{c}\_{j\bar{i}} + \alpha \mathbf{v}\_i^\top V \mathbf{v}\_{\bar{j}} = \lambda\_{\bar{j}} \mathbf{c}\_{j\bar{i}}.\tag{A79}
$$

Then we have

$$x\_{ji} = \frac{\alpha v\_i^\top V v\_j}{\lambda\_j - \lambda\_i}.\tag{A80}$$

Then we obtain

$$
\delta \upsilon\_{\dot{j}} = a \sum\_{i \neq j}^{d} \frac{\upsilon\_i^{\top} V \upsilon\_j}{\lambda\_{\dot{j}} - \lambda\_{\dot{i}}} \upsilon\_{\dot{i}}.\tag{A81}
$$

We substitute this into Equation (A75), and multiplying *v <sup>j</sup>* , we have

$$\begin{split} \boldsymbol{\upsilon}\_{j}^{\top} \boldsymbol{H} \boldsymbol{a} \sum\_{i \neq j}^{d} \frac{\boldsymbol{\upsilon}\_{i}^{\top} \boldsymbol{V} \boldsymbol{\upsilon}\_{j}}{\lambda\_{j} - \lambda\_{i}} \boldsymbol{\upsilon}\_{i} + \boldsymbol{a} \boldsymbol{\upsilon}\_{j}^{\top} \boldsymbol{V} \boldsymbol{\upsilon}\_{j} + \boldsymbol{a} \boldsymbol{\upsilon}\_{j}^{\top} \boldsymbol{V} \boldsymbol{a} \sum\_{i \neq j}^{d} \frac{\boldsymbol{\upsilon}\_{i}^{\top} \boldsymbol{V} \boldsymbol{\upsilon}\_{j}}{\lambda\_{j} - \lambda\_{i}} \boldsymbol{\upsilon}\_{i} \\ = \boldsymbol{\delta} \lambda\_{j} \boldsymbol{\upsilon}\_{j}^{\top} \boldsymbol{\upsilon}\_{j} + \lambda\_{j} \boldsymbol{\upsilon}\_{j}^{\top} \boldsymbol{a} \sum\_{i \neq j}^{d} \frac{\boldsymbol{\upsilon}\_{i}^{\top} \boldsymbol{V} \boldsymbol{\upsilon}\_{j}}{\lambda\_{j} - \lambda\_{i}} \boldsymbol{\upsilon}\_{i} + \boldsymbol{\delta} \lambda\_{j} \boldsymbol{\upsilon}\_{j}^{\top} \boldsymbol{a} \sum\_{i \neq j}^{d} \frac{\boldsymbol{\upsilon}\_{i}^{\top} \boldsymbol{V} \boldsymbol{\upsilon}\_{j}}{\lambda\_{j} - \lambda\_{i}} \boldsymbol{\upsilon}\_{i}. \end{split} \tag{A82}$$

Since *v <sup>j</sup> Vvj* = 0 and *v <sup>j</sup> vi* = 0 and *v <sup>j</sup> vj* = 1, we have

$$
\alpha^2 \sum\_{i \neq j}^d \frac{v\_i^\top V v\_j}{\lambda\_j - \lambda\_i} v\_j^\top V v\_i = \delta \lambda\_j. \tag{A83}
$$

Thus, we have

$$
\lambda\_j \mu\_j - \lambda\_j = a\_j + i\beta\_j - \lambda\_j = -a^2 \sum\_{i \neq j}^d \frac{(v\_i^\top V v\_j)^2}{\lambda\_j - \lambda\_i}. \tag{A84}
$$

Thus, by taking the real part, and note that Re*λj*(*α*) = *αj*, we have

$$\mathrm{Re}\lambda\_{\dot{j}}(\boldsymbol{a}) - \lambda\_{\dot{j}} = \boldsymbol{a}^2 \mathrm{Re} \sum\_{i \neq j}^d \frac{(\boldsymbol{v}\_i^\top \boldsymbol{V} \boldsymbol{v}\_j)^2}{\lambda\_i - \lambda\_{\dot{j}}} + \mathcal{O}(\boldsymbol{a}^3) = \boldsymbol{a}^2 \sum\_{i \neq j}^d \frac{\lambda\_i \lambda\_{\dot{j}} (\boldsymbol{v}\_i^\top \boldsymbol{I} \boldsymbol{v}\_j)^2}{\lambda\_i - \lambda\_{\dot{j}}} + \mathcal{O}(\boldsymbol{a}^3). \tag{A85}$$

This concludes the proof.

## *Appendix H.2. Expansion of the Eigenvalue at the Saddle Point*

Here we derive the formula of the expansion of the eigenvalue at the saddle point. Since the smallest eigenvalue is negative, we cannot use the similar matrix as shown above. Instead, we use the relation,

$$
\mu\_{\text{j}} H w\_{\text{j}} = H \mu\_{\text{j}} w\_{\text{j}} = H(I + \alpha I) H w\_{\text{j}} \tag{A86}
$$

where we used the definition of the eigenvalues and eigenvectors. Here, we express *<sup>H</sup>* := (*<sup>I</sup>* <sup>+</sup> *<sup>α</sup>J*)*<sup>H</sup>* and its pairs of eigenvalues and eigenvectors as {(*μi*, *wi*)}*<sup>d</sup> i*=1. As introduced in the above, we substitute the expansion to Equation (A86), then we obtain

$$H(\lambda\_{\dot{l}} + \delta \lambda\_{\dot{l}})H(v\_{\dot{l}} + \delta v\_{\dot{l}}) = H(I + \mathfrak{a}I)H(v\_{\dot{l}} + \delta v\_{\dot{l}}) \tag{A87}$$

Then, in the same way as above, since {*vj*}*<sup>d</sup> <sup>j</sup>*=<sup>1</sup> are the eigenvalues of *H* and they can be used as an orthogonal basis, we expand *δv* by this basis. This means

$$
\delta \upsilon\_{\rangle} = \sum\_{k=1}^{d} c\_{jk} \upsilon\_{k\prime} \tag{A88}
$$

where *cjk* = *δv <sup>j</sup> vk*. By multiplying *vi* to Equation (A87) where *i* -= *j* from left-hand side and neglecting high-order terms, we have

$$x\_{\vec{\mu}} = \frac{\lambda\_j}{\lambda\_{\vec{\mu}} - \lambda\_i} (\upsilon\_i^\top \alpha J v\_i). \tag{A89}$$

Next, Then by multiplying *vj* to Equation (A87) from left-hand side, we have

$$
\sigma\_{\dot{\jmath}} H(aI) H \delta v\_{\dot{\jmath}} = (\delta \lambda\_{\dot{\jmath}}) (\lambda\_{\dot{\jmath}} + \lambda\_{\dot{\jmath}} v\_{\dot{\jmath}} \,\delta v\_{\dot{\jmath}}) \tag{A90}
$$

Then by substituting *δvj* with coefficient Equation (A89), we have

$$\delta\lambda\_{\dot{j}} = a^2 \sum\_{i \neq j}^d \frac{\lambda\_i \lambda\_j (v\_i^\top J v\_j)^2}{\lambda\_i - \lambda\_j} + \mathcal{O}(a^3) \tag{A91}$$

This concludes the proof.

### **Appendix I. Convergence Rate of Parallel Sampling Schemes**

#### *Appendix I.1. Proof of Lemma 2*

First, we introduce the notations. We express the random variables of S-PLD as *<sup>Y</sup>*⊗*<sup>N</sup> <sup>t</sup>* . We express the measure induced by S-PLD as *μ*⊗*<sup>N</sup> <sup>t</sup>* (*α*), which uses the *αJ* as an interaction term. Thus, we express the measure of PLD as *μ*⊗*<sup>N</sup> kh* (0), we can decompose the measure as marginals. We also denote the marginal measure of S-PLD for *Y*(*n*) *<sup>t</sup> ν* (*n*) *<sup>t</sup>* (*α*). Please note that initial distribution is *μ*⊗*<sup>N</sup>* <sup>0</sup> and its marginals are *μ*<sup>0</sup> as defined in Assumption 4.

Please note that the marginal measure of PLD is the same as those of LD if the initial measures are all the same, thus each marginal satisfy the Poincaré constant *m*0. This is also the result of the tensorization property of the spectral gap (Proposition 4.3.1 in Bakry et al. [19]).

As for the initial condition, from the fact that *χ*<sup>2</sup> divergence is the special case of Renyi divergence (*α* = 4), and from the tensorization property of the Renyi divergence (see Theorem 28 in [45]), we have

$$\chi^2(\mu\_t^{\odot N}(0), \pi^{\odot N}) \le e^{-2\beta^{-1}m\_0t} \chi^2(\mu\_0^{\odot N}, \pi^{\odot N}) = \sum\_{n=1}^N e^{-2\beta^{-1}m\_0t} \chi^2(\mu\_0, \pi). \tag{A92}$$

Then we have

$$
\chi^2(\mu\_t^{\otimes N}(0), \pi^{\otimes N}) \le e^{-2\beta^{-1}m\_0t} \chi^2(\mu\_0^{\otimes N}, \pi^{\otimes N}) = N e^{-2\beta^{-1}m\_0t} \chi^2(\mu\_0, \pi). \tag{A93}
$$

If the skew acceleration is applied, from the same discussion as S-LD (see Appendix C.1), S-PLD has the Poincaré constant which is larger than *m*0. We express it as *m*(*α*, *N*)(≥ *m*0). Then we have

$$
\chi^2(\mu\_t^{\odot N}(a), \pi^{\odot N}) \le N e^{-2\beta^{-1}m(a,N)t} \chi^2(\mu\_0, \pi). \tag{A94}
$$

At first, since there exists a constant *N* in the convergence bound, this bound seems not useful. However, as we discussed below, when we bound the bias or variance, these bound is meaningful. For example, let us consider approximating the true expectation - *f*(*x*)*dπ*(*x*) by the ensemble samples <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *<sup>f</sup>*(*X*(*n*) *<sup>t</sup>* ). Then we are interested in bounding the error

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right|. \tag{A95}$$

For this purpose, we can bound this by 2-Wasserstein distance as

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \le \frac{L\_f}{\sqrt{N}} W\_2(\mu\_{kh}^{\odot N}(a), \pi^{\odot N}) \tag{A96}$$

where we assumed that *f* shows *Lf* lipschitzness and used the fact that <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>n</sup>*=<sup>1</sup> *<sup>f</sup>*(*x*(*n*)) shows *Lf* / <sup>√</sup>*<sup>N</sup>* lipschitzness.

To bound the distance, we use the basic relation

$$\mathcal{W}\_2^2(\nu\_{\rm kh}(\alpha), \pi^{\otimes N}) \le 2 \frac{1}{m(\alpha, N)} \chi^2(\mu\_{\rm kh}^{\otimes N}(\alpha), \pi^{\otimes N}),\tag{A97}$$

where *m*(*α*, *N*) is the Poincaré constant. This is established by the definition of Wasserstein distance and *χ*2-divergence, see [46] for the detail. Then combined with above relations, we obtain the bias bound of S-PLD as

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \leq L\_f \sqrt{\frac{2}{m(a,N)}} e^{-\beta^{-1} m(a,N) k h} \chi^2 (\mu\_{0\prime}, \pi)^{1/2}. \tag{A98}$$

In the same way, we obtain the bias bound of PLD as

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \leq L\_f \sqrt{\frac{2}{m\_0}} e^{-\beta^{-1} m\_0 k \hbar} \chi^2(\nu\_0, \pi)^{1/2}. \tag{A99}$$

Thus, while the explicit dependency on *N* disappeared, but S-PLD shows faster convergence through the relation of *m*(*α*, *N*) ≥ *m*0. Moreover, if we use the skew matrices, which does not satisfy the equality condition, we have *m*(*α*, *N*) > *m*0.

#### *Appendix I.2. Proof for S-ULD*

We can characterize the convergence rate almost in the same way as Appendix C.2. The derivation is the same above, thus we only show the result

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \leq L\_f \sqrt{\frac{2}{m(\mathfrak{a}, N)}} \sqrt{\frac{1 + \mathfrak{s}}{1 - \mathfrak{s}}} e^{-\lambda\_\gamma / 2 \text{kh}} \chi^2(\upsilon\_0^0, \pi)^{1/2}. \tag{A100}$$

where ¯ and *λγ* is given as follows.

$$
\lambda\_{\gamma} = \frac{\Lambda(\gamma\_{\prime} \mathbb{E} \min(\gamma\_{\prime} \gamma^{-1}))}{1 + \mathbb{E} \min(\gamma\_{\prime} \gamma^{-1})},
\tag{A101}
$$

and

$$\Lambda(\gamma, \epsilon) = \frac{\gamma \Sigma^{-1} - \frac{1}{1 + \frac{m\_0 \Sigma^{-1}}{\tilde{\rho}}}}{2} - \frac{1}{2} \sqrt{(S\_{--} - S\_{++})^2 + (S\_{-+})^2},\tag{A102}$$

$$S\_{--} = \varepsilon \lambda\_{ham\prime} \tag{A103}$$

$$S\_{-+} = -\epsilon (R\_{\text{ham}} + \gamma \Sigma^{-1} / 2),\tag{A104}$$

$$S\_{++} = \gamma \Sigma^{-1} - \varepsilon,\tag{A105}$$

$$
\lambda\_{\text{ham}} = 1 - \left( 1 + \frac{m(\mathfrak{a}, N)\Sigma^{-1}}{\beta} \right)^{-1}, \tag{A106}
$$

$$
\epsilon = \epsilon \min(\gamma, \gamma^{-1}),
\tag{A107}
$$

where ¯ is arbitrary sufficiently small positive value such that Λ(*γ*, ¯ min(*γ*, *γ*−1)) > 0 is satisfies. and

$$R\_{ham} \le \sqrt{\max\{M, 2\}}.\tag{A108}$$

#### **Appendix J. Proof of Theorem 7**

We show our theorem again with explicit constants

**Theorem A3.** *Under Assumptions 1–7, for any <sup>k</sup>* <sup>∈</sup> <sup>N</sup> *and any <sup>h</sup>* <sup>∈</sup> (0, 1 <sup>∧</sup> *<sup>m</sup>* <sup>4</sup>*M*<sup>2</sup> ) *obeying kh* ≥ <sup>1</sup> *and βm* ≥ 2*, we have*

$$\begin{split} & \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_{k}^{(n)}) - \int\_{\mathbb{R}^{d}} f d\pi \right| \\ & \leq L\_{f} \sqrt{\mathcal{C}\_{0}^{2} \sqrt{\delta} + \mathcal{C}\_{1}^{2} \sqrt{h}} k\eta + L\_{f} \sqrt{\frac{2}{m(a\_{\ast} N)}} \chi^{2}(\mu\_{0}, \pi)^{1/2} e^{-\beta^{-1} m(a, N) h\hbar}. \tag{A109} \end{split} \tag{A109}$$

*where*

$$\mathbf{C}\_{0}^{2} = \left(12 + 8\left(\kappa\_{0} + 2b + \frac{2d}{\beta}\right)\right)\left(\beta \mathbf{C}\_{0} + \sqrt{\beta \mathbf{C}\_{0}}\right),\tag{A110}$$

$$
\hat{\mathcal{L}}\_1^2 = \left(12 + 8\left(\kappa\_0 + 2b + \frac{2d}{\beta}\right)\right)\left(\mathbb{C}\_1 + \sqrt{\mathbb{C}\_1}\right) \tag{A111}
$$

$$\mathcal{C}\_0 = (1+a)^2 \left( M^2 \left( \kappa\_0 + 2 \left( 1 \vee \frac{1}{m} \right) \left( b + 2(1+a)^2 B^2 + \frac{d}{\beta} \right) \right) + B^2 \right), \tag{A112}$$

$$\mathbf{C}\_{1} = \mathfrak{G}(1+a^{2})M^{2}(\beta \mathbf{C}\_{0} + d),\tag{A113}$$

Then obtained bound is <sup>O</sup>(*kh* · *<sup>h</sup>*1/4), which is independent of *<sup>N</sup>*. Thus, this result is much better than those in [18]. Additionally, note that we can derive the similar bias bound for skew-SGHMC in the same way as skew-SGLD.

**Proof.** For notational simplicity, we express the random variables of skew-SGLD which uses the *αJ* as an interaction term as *X*⊗*<sup>N</sup> <sup>k</sup>* and those of S-PLD as *<sup>Y</sup>*⊗*<sup>N</sup> <sup>k</sup>* . In this section, for simplicity, we express them as *Xk* and *Yk*. We denote the measure of *Xk* and *Yk* as *ν*⊗*<sup>N</sup> kh* and *μ*⊗*<sup>N</sup> kh* . We also denote the marginal measure of *<sup>X</sup>*(*n*) *<sup>k</sup>* and *<sup>Y</sup>*(*n*) *<sup>k</sup>* as *<sup>μ</sup>*(*n*) *kh* and *ν* (*n*) *kh* .

Then, we first decompose the bias as

$$\begin{split} & \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_{k}^{(n)}) - \int\_{\mathbb{R}^{d}} f d\pi \right| \\ &= \left| \mathbb{E} \frac{\sum\_{n=1}^{N} f(\mathbf{X}\_{k}^{(n)})}{N} - \mathbb{E} \frac{\sum\_{n=1}^{N} f(\mathbf{Y}\_{k}^{(n)})}{N} + \mathbb{E} \frac{\sum\_{n=1}^{N} f(\mathbf{Y}\_{k}^{(n)})}{N} - \int\_{\mathbb{R}^{d}} f d\pi \right| \\ & \leq \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_{k}^{(n)}) - \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{Y}\_{k}^{(n)}) \right| + \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{Y}\_{k}^{(n)}) - \int\_{\mathbb{R}^{d}} f d\pi \right| \\ & \leq \frac{L\_{f}}{N} \sum\_{i=1}^{N} \mathcal{W}\_{2}(\nu\_{\text{kh}}^{(n)}(a), \mu\_{\text{kh}}^{(n)}(a)) + \frac{L\_{f}}{\sqrt{N}} \underbrace{\mathcal{W}\_{2}(\mu\_{\text{kh}}^{\diamond \text{N}}(a), \pi^{\diamond \text{N}})}\_{(i)} \end{split} \tag{A114}$$

where we used the Jensen inequality for the first term in the last inequality and we move 1 *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> outside the |·|. In addition, each expectation only depends on the marginal measures *μ*(*i*) in the first term and we use the property of the 2-Wasserstein (2-W) distance. Furthermore, we decompose the first term as

$$\frac{L\_f}{N} \sum\_{n=1}^{N} \mathsf{W}\_2(\mu\_{kh}^{(n)}(a), \nu\_{kh}^{(n)}(a)) \le \frac{L\_f}{N} \left( \sum\_{n=1}^{N} \underbrace{\mathsf{W}\_2(\nu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0))}\_{(ii)} + \underbrace{\mathsf{W}\_2(\mu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0))}\_{(iii)} \right), \quad \text{(A115)} \le \frac{L\_f}{N} \left( \sum\_{n=1}^{N} \frac{\mathsf{W}\_2(\nu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0))}{N} + \underbrace{\mathsf{W}\_2(\mu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0))}\_{(iii)} \right) + \underbrace{\mathsf{W}\_2(\mu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0))}\_{(iii)} + \underbrace{\mathsf{W}\_2(\mu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0))}\_{(iv)} \right)$$

where *μ*(*n*) *kh* (0) denotes the measure induced by *PLD*, which is the naive parallel sampling without a skew-symmetric interaction.

In conclusion, our task is to bound each (*i*), (*ii*), (*iii*) terms in the above. Bounding (*i*) is already discussed in Appendix I.1.

Next, we work on (*ii*) and(*iii*). Following [10], we use weighted CKP inequality to bound the 2-W distance. From Bolley and Villani [47], using the weighted CKP inequality, we can bound each 2-W distance by the relative entropy (KL divergence). This weighted CKP inequality indicates that

$$\mathcal{W}\_2(\nu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0)) \le \mathbb{C}\_{\mu\_{kh}^{(n)}(0)} \left( \mathrm{KL}(\nu\_{kh}^{(n)}(a) | \mu\_{kh}^{(n)}(0))^{1/2} + \left( \frac{\mathrm{KL}(\nu\_{kh}^{(n)}(a) | \mu\_{kh}^{(n)}(0))}{2} \right)^{1/4} \right), \tag{A116}$$

with

$$\mathcal{C}\_{\mu\_{\text{kh}}^{(n)}(0)} = 2 \inf\_{\lambda > 0} \left( \frac{1}{\lambda} \left( \frac{3}{2} + \log \int\_{\mathbb{R}^d} e^{\lambda \|\mathbf{x}^{(n)}\|^2} d\mu\_{\text{kh}}^{(n)}(0) \right) \right)^{1/2}. \tag{A117}$$

and

$$\mathcal{W}\_2(\mu\_{kh}^{(n)}(a), \mu\_{kh}^{(n)}(0)) \le \mathbb{C}\_{\mu\_{kh}^{(n)}(0)} \left( \mathrm{KL}(\mu\_{kh}^{(n)}(a) | \mu\_{kh}^{(n)}(0))^{1/2} + \left( \frac{\mathrm{KL}(\mu\_{kh}^{(n)}(a) | \mu\_{kh}^{(n)}(0))}{2} \right)^{1/4} \right), \text{ (A118)}$$

with

$$\mathcal{C}\_{\mu\_{\text{kh}}^{(n)}(0)} = 2 \inf\_{\lambda > 0} \left( \frac{1}{\lambda} \left( \frac{3}{2} + \log \int\_{\mathbb{R}^d} e^{\lambda \|x^{(n)}\|^2} d\mu\_{\text{kh}}^{(n)}(0) \right) \right)^{1/2}. \tag{A119}$$

We point out that using *C <sup>μ</sup>*(*i*) *kh* (0) not *<sup>C</sup><sup>ν</sup>* (*i*) *kh* (*α*) and *<sup>C</sup> <sup>μ</sup>*(*i*) *kh* (*α*) in weighted CKP inequality is important. This is because since *μ*(*i*) *kh* (0) is the constant based on the parallel-chain Monte Carlo without skew-symmetric term, thus the parallel chain can be decomposed each independent chains. Thus, *C <sup>μ</sup>*(*i*) *kh* actually does not depend on *i* and it does not depend on *<sup>N</sup>* and shows <sup>O</sup>(*d*) dependency. However, *<sup>C</sup><sup>ν</sup>* (*i*) *kh* (*α*) and *<sup>C</sup> <sup>μ</sup>*(*i*) *kh* (*α*) show <sup>O</sup>(*dN*) which shows linear dependency on *N* since there is an interaction term between parallel chains and we cannot decompose the parallel chain easily. Thus, this results in unsatisfactory dependency on *N*. This is the reason we introduced *μ*(*i*) *kh* (0) in our theoretical analysis.

Please note that since *μ*(*n*) *kh* (0) is induced by the naive parallel chain, each marginal is independent with each other and takes the same measure if the initial measure is the same. Thus, *μ*(1) *kh* (0) = ··· <sup>=</sup> *<sup>μ</sup>*(*N*) *kh* (0). From now on, we express the marginal as *μkh*(0) for simplicity. Thus, *C <sup>μ</sup>*(1) *kh* (0) <sup>=</sup> ··· <sup>=</sup> *<sup>C</sup> <sup>μ</sup>*(*N*) *kh* (0) <sup>=</sup> *<sup>C</sup>μkh*(0).

Then substituting the above WKP inequalities and using the Jensen inequality, we obtain

$$\begin{split} & \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(Y\_k^{(n)}) \right| \\ & \leq L\_f \mathbb{C}\_{\mu\_{kh}(0)} \frac{1}{N} \sum\_{n=1}^{N} \left( \mathbb{KL}(\nu\_{kh}^{(n)}(a) | \mu\_{kh}(0))^{1/2} + \left( \frac{\mathbb{KL}(\nu\_{kh}^{(n)}(a) | \mu\_{kh}(0))}{2} \right)^{1/4} \right) \\ & \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \left( \frac{\mathbb{KL}(\mu\_{kh}^{(n)}(a) | \mu\_{kh}(0))}{2} \right)^{1/4} \end{split}$$

$$\leq L\_{f}\mathbb{C}\_{\mu\_{\text{kk}}(0)}\left(\left(\sum\_{n=1}^{N}\frac{\text{KL}(\boldsymbol{\nu}\_{\text{ki}}^{(n)}(\boldsymbol{a})|\boldsymbol{\mu}\_{\text{ki}}(0))}{N}\right)^{\frac{1}{2}}+\left(\sum\_{n=1}^{N}\frac{\text{KL}(\boldsymbol{\nu}\_{\text{ki}}^{(n)}(\boldsymbol{a})|\boldsymbol{\mu}\_{\text{ki}}(0))}{2N}\right)^{\frac{1}{2}}$$

$$+\left(\sum\_{n=1}^{N}\frac{\text{KL}(\boldsymbol{\mu}\_{\text{ki}}^{(n)}(\boldsymbol{a})|\boldsymbol{\mu}\_{\text{ki}}(0))}{N}\right)^{\frac{1}{2}}+\left(\sum\_{n=1}^{N}\frac{\text{KL}(\boldsymbol{\mu}\_{\text{ki}}^{(n)}(\boldsymbol{a})|\boldsymbol{\mu}\_{\text{ki}}(0))}{2N}\right)^{\frac{1}{4}}\right). \tag{A120}$$

To analyze the discretization error, we use the following key lemma:

**Lemma A1.** *Assume that there exist random variables* {*Xi* <sup>∈</sup> <sup>Ω</sup>*i*}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *and* {*Yi* <sup>∈</sup> <sup>Ω</sup>*i*}*<sup>N</sup> <sup>i</sup>*=1*. We denote the product space as* <sup>Ω</sup>⊗*<sup>N</sup>* :<sup>=</sup> <sup>Ω</sup><sup>1</sup> <sup>×</sup> ... <sup>Ω</sup>*N. Let us introduce <sup>X</sup>* = (*X*1, ... , *XN*) <sup>∈</sup> <sup>Ω</sup>⊗*<sup>N</sup> and <sup>Y</sup>* = (*Y*1, ... ,*YN*) <sup>∈</sup> <sup>Ω</sup>⊗*N. Let us express their joint probability measures as expressed as P*(*X*) := *P*(*X*1, ... , *XN*)*, Q*(*Y*) := *Q*(*Y*1, ... ,*YN*)*, let us denote the marginal measures of each Xs and Ys as* {*Pi*(*Xi*)}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *and* {*Qi*(*Yi*)}*<sup>N</sup> <sup>i</sup>*=1*. If Pi* << *Qi holds, we have*

$$\sum\_{i=1}^{N} \text{KL}(P\_i(X\_i) \| Q\_i(Y\_i)) \le \text{KL}(P(X) \| Q(Y)),\tag{A121}$$

A proof is given in Appendix J.1. We apply this lemma as

$$\sum\_{n=1}^{N} \text{KL}(\mu\_{kh}^{(n)} | \mu\_{kh}(0)) \le \text{KL}(\nu\_{kh}^{\odot N} | \mu\_{kh}^{\odot N}(0)),\tag{A122}$$

$$\sum\_{n=1}^{N} \text{KL}(\mu\_{kh}^{(n)}(a) | \mu\_{kh}(0)) \le \text{KL}(\mu\_{kh}^{\odot N}(a) | \mu\_{kh}^{\odot N}(0)). \tag{A123}$$

Combining these results with the above bias bound, we obtain

$$\begin{split} & \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_{k}^{(n)}) - \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{Y}\_{k}^{(n)}) \right| \\ & \leq L\_{f} \mathbb{C}\_{\mu\_{kh}(0)} \left( \left( \frac{\mathrm{KL}(\boldsymbol{\nu}\_{kh}^{\odot N}(a) | \boldsymbol{\mu}\_{kh}^{\odot N}(0))}{N} \right)^{\frac{1}{2}} + \left( \frac{\mathrm{KL}(\boldsymbol{\nu}\_{kh}^{\odot N}(a) | \boldsymbol{\mu}\_{kh}^{\odot N}(0))}{2N} \right)^{\frac{1}{2}} \\ & \quad + \left( \frac{\mathrm{KL}(\boldsymbol{\mu}\_{kh}^{\odot N}(a) | \boldsymbol{a}) | \boldsymbol{\mu}\_{kh}^{\odot N}(0))}{N} \right)^{\frac{1}{2}} + \left( \frac{\mathrm{KL}(\boldsymbol{\mu}\_{kh}^{\odot N}(a) | \boldsymbol{\mu}\_{kh}^{\odot N}(0))}{2N} \right)^{\frac{1}{2}} \end{split} \tag{A124}$$

Thus, we need to bound KL(*μ*(*i*) *kh* (*α*)|*μ*⊗*<sup>N</sup> kh* (0)) and KL(*ν*⊗*<sup>N</sup> kh* (*α*)|*μ*⊗*<sup>N</sup> kh* (0)) and *Cμkh*(0). We can upper-bound them using the results of [2]. For that purpose, we need to replace the constants in [2] as we show in the below. Here, we discuss how the constants in the assumption are changed in the ensemble scheme. We define

$$\nabla u^{\otimes N}(\mathbf{x}^{\otimes N}) := (\nabla u(\mathbf{x}^{(1)}), \dots, \nabla u(\mathbf{x}^{(N)})) \tag{A125}$$

First, we focus on the smoothness condition. From Assumption 2 and lemma 8 in [18], we have

$$\left\| \left( (I + aI) \nabla u^{\otimes N} (\mathbf{x}^{\otimes N}, z) - (I + aI) \nabla u^{\otimes N} (y^{\otimes N}, z) \right) \right\| \le M(1 + \mathfrak{a}) \| \mathbf{x}^{\otimes N} - y^{\otimes N} \|. \tag{A126}$$

where the norm in the right-hand side is the Euclidean norm in R*dN*.

Next, we discuss the smoothness condition. Define <sup>∇</sup>*Uα*(*x*⊗*N*) :<sup>=</sup> <sup>∇</sup>*U*⊗*N*(*x*⊗*N*) + *<sup>α</sup>J*∇*U*⊗*N*(*x*⊗*N*). Then, Let *<sup>x</sup>*⊗*<sup>N</sup>* <sup>∈</sup> <sup>R</sup>*dN* and under the assumptions 1 to 6, we have

$$\|\mathbf{x}^{\otimes N} \cdot \nabla \mathcal{U}\_{\mathbf{a}}(\mathbf{x}^{\otimes N})\| \ge m \|\mathbf{x}^{\otimes N}\|^2 - bN. \tag{A127}$$

Next, we check about the condition of the drift function at the origin: ∇*u*(0, *z*) ≤ *B*. We can calculate in the same way as the smoothness condition. Then we have

$$\|(I+aI)\nabla \mathcal{U}^{\otimes N}(0^{\otimes N})\| \le B\sqrt{N}(1+a). \tag{A128}$$

Next, we study the condition about the stochastic gradient: <sup>E</sup>[ <sup>∇</sup>*U*<sup>ˆ</sup> (*x*) − ∇*U*(*x*) <sup>2</sup>] <sup>≤</sup> 2*δ <sup>M</sup>*<sup>2</sup> *x* <sup>2</sup> + *B*<sup>2</sup> . This can be easily modified to

$$\begin{split} & \mathbb{E}[\|(I+\mathfrak{a}\mathfrak{d})\nabla\hat{\mathcal{U}}^{\otimes N}(\mathbf{x}^{\otimes N}) - (I+\mathfrak{a}\mathfrak{d})\nabla\mathcal{U}^{\otimes N}(\mathbf{x}^{\otimes N})\|^{2}] \\ & \leq (1+\mathfrak{a})^{2}\mathbb{E}[\nabla\mathcal{U}^{\otimes N}(\mathbf{x}^{\otimes N}) - \nabla\mathcal{U}^{\otimes N}(\mathbf{x}^{\otimes N})\|^{2}] \\ & \leq (1+\mathfrak{a})^{2}\sum\_{i=1}^{N}\mathbb{E}[\nabla\hat{\mathcal{U}}(\mathbf{x}^{(i)}) - \nabla\mathcal{U}(\mathbf{x}^{(i)})\|^{2}] \\ & \leq (1+\mathfrak{a})^{2}\sum\_{i=1}^{N}2\delta\Big(M^{2}\|\mathbf{x}^{(i)}\|^{2} + B^{2}\Big) \\ & \leq 2\delta(1+\mathfrak{a})^{2}\Big(M^{2}\|\mathbf{x}^{\otimes N}\|^{2} + NB^{2}\Big). \end{split} \tag{A129}$$

Finally, we discuss about the initial condition: *<sup>κ</sup>*<sup>0</sup> := log - R*<sup>d</sup> e x* 2 *p*0(*x*)*dx* < ∞. We assume that the initial probability distribution is *μ*⊗*<sup>N</sup>* <sup>0</sup> (*X*⊗*<sup>N</sup>* <sup>0</sup> ) = *<sup>μ</sup>*0(*X*(1) <sup>0</sup> ) ×···× *<sup>μ</sup>*0(*X*(*N*) <sup>0</sup> ), which means that all the marginal probability is the same. Then

$$\kappa\_0^{\odot N} := \log \int\_{\mathbb{R}^{dN}} \epsilon^{\parallel \mathbf{x}^{\odot N} \parallel^2} \mu\_0^{\odot N}(\mathbf{x}^{\odot N}) d\mathbf{x}^{\odot N} = \log \prod\_{n=1}^N \left( \int\_{\mathbb{R}^d} \epsilon^{\parallel \mathbf{x}^{(n)} \parallel^2} \mu\_0(\mathbf{x}^{(n)}) d\mathbf{x} \right) = N\kappa\_0. \tag{A130}$$

In this way, the constants in the assumptions are modified and expressed with *N* and *α*. Then combined with the results of [2], we can derive the following relations

$$\mathcal{C}\_{\nu\_{\rm ih}(0)} \le 12 + 8\left(\kappa\_0 + 2b + \frac{2d}{\beta}\right),\tag{A131}$$

$$\text{KL}(\nu\_{kh}^{\odot N}|\mu\_{kh}^{\odot N}(0)) \le N(\mathbb{C}\_0 \beta \delta + \mathbb{C}\_1 \eta)k\eta,\tag{A132}$$

$$\text{KL}(\mu\_{kh}^{\otimes N}(a) | \mu\_{kh}^{\otimes N}(0)) \le N \frac{\beta}{2} a^2 M^2(\kappa\_0 + \frac{b + d/\beta}{m}) k \eta\_\prime \tag{A133}$$

where

$$\mathbf{C}\_{0} = (1+a)^{2} \left( M^{2} \left( \kappa\_{0} + 2 \left( 1 \vee \frac{1}{m} \right) \left( b + 2(1+a)^{2} B^{2} + \frac{d}{\beta} \right) \right) + B^{2} \right), \tag{A134}$$

$$\mathbb{C}\_1 = \mathfrak{G}(1+a^2)M^2(\beta \mathbb{C}\_0 + d). \tag{A135}$$

This concludes the proof.

## *Appendix J.1. Proof of Lemma A1*

**Proof.** We prove this lemma using the Donsker–Varadhan representation of the relative entropy [48]. The relative entropy admits the dual representation as:

$$\text{KL}(P(X)\|\!\|Q(Y)) = \sup\_{T:\Omega \cap \mathbb{N} \to \mathbb{R}} \mathbb{E}\_{P(X)}[T] - \log \mathbb{E}\_{Q(Y)}[\varepsilon^T],\tag{A136}$$

where supremum is taken over all function *T* of which the expectation of *e<sup>T</sup>* and *T* are finite. We then restrict the function class into a class <sup>F</sup>(*T*) = {*T*(*X*)|∃*Ti* : <sup>Ω</sup>*<sup>i</sup>* <sup>→</sup> <sup>R</sup>,*s*.*t*.*T*(*X*) = ∑*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *Ti*(*Xi*)} where each expectation of *<sup>e</sup>Ti* and *Ti* are finite. Then by definition,

$$\mathbb{KL}(P(X)\|\!|Q(Y)) = \sup\_{T:\Omega \to \mathbb{R}} \mathbb{E}\_{P(X)}[T] - \log \mathbb{E}\_{Q(Y)}[\epsilon^T] \geq \sup\_{T \in \mathcal{F}} \mathbb{E}\_{P(X)}\left[\sum\_{i} T\_i\right] - \log \mathbb{E}\_{Q(Y)}\left[\epsilon^{\sum\_{i} T\_i}\right].\tag{A137}$$

Then we have

$$\begin{split} \operatorname{KL}(P(X)||Q(Y)) &\geq \sup\_{T \in \mathcal{F}} \sum\_{i} \mathbb{E}\_{P\_{i}(X\_{i})}[T\_{i}] - \log \prod\_{i} \mathbb{E}\_{Q\_{i}(Y\_{i})} \left[ \boldsymbol{\varepsilon}^{T\_{i}} \right] \\ &= \sup\_{T \in \mathcal{F}} \sum\_{i} \Big( \mathbb{E}\_{P\_{i}(X\_{i})}[T\_{i}] - \log \mathbb{E}\_{Q\_{i}(Y\_{i})} \left[ \boldsymbol{\varepsilon}^{T\_{i}} \right] \Big) \\ &= \sum\_{i} \sup\_{T\_{i} : \Omega\_{i} \to \mathbb{R}} \mathbb{E}\_{P\_{i}(X\_{i})}[T\_{i}] - \log \mathbb{E}\_{Q\_{i}(Y\_{i})} \left[ \boldsymbol{\varepsilon}^{T\_{i}} \right] \\ &= \sum\_{i=1}^{N} \mathbb{KL}(P\_{i}(X\_{i})||Q\_{i}(Y\_{i})). \end{split} \tag{A138}$$

## **Appendix K. Order Expansion**

*Bias Expansion for S-PLD*

Recall that the bias of S-PLD is

$$\begin{split} & \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_{k}^{(n)}) - \int\_{\mathbb{R}^{d}} f d\pi \right| \\ & \leq L\_{f} \sqrt{\mathcal{C}\_{0}^{2} \sqrt{\delta} + \mathcal{C}\_{1}^{2} \sqrt{h}} k\eta + L\_{f} \sqrt{\frac{2}{m(a,N)}} \chi^{2}(\mu\_{0}), \pi)^{1/2} e^{-\beta^{-1} m(a,N) k\delta}. \tag{A139} \end{split} \tag{A139}$$

where

$$
\tilde{\mathcal{L}}\_0^2 = \left(12 + 8\left(\kappa\_0 + 2b + \frac{2d}{\beta}\right)\right) \left(\beta \mathcal{C}\_0 + \sqrt{\beta \mathcal{C}\_0}\right),
\tag{A140}
$$

$$
\tilde{\mathcal{C}}\_1^2 = \left(12 + 8\left(\kappa\_0 + 2b + \frac{2d}{\beta}\right)\right)\left(\mathbb{C}\_1 + \sqrt{\mathcal{C}\_1}\right) \tag{A141}
$$

$$\mathcal{C}\_{0} = (1+a)^{2} \left( M^{2} \left( \kappa\_{0} + 2 \left( 1 \vee \frac{1}{m} \right) \left( b + 2(1+a)^{2} B^{2} + \frac{d}{\beta} \right) \right) + B^{2} \right), \tag{A142}$$

$$\mathbf{C}\_{1} = \mathbf{6} (1 + a^{2}) M^{2} (\beta \mathbf{C}\_{0} + d),\tag{A143}$$

First, we discuss the convergence of the continuous dynamics. Using the eigenvalue expansion in Theorem 6 , with some positive constant *d*0, we have

$$m(\mathfrak{a}, N) \approx m\_0 + \mathfrak{a}^2 d\_0 + \mathcal{O}(\mathfrak{a}^3). \tag{A144}$$

Then by assuming *α*<sup>2</sup> is small enough and considering the Tayler expansion, we have

$$\mathcal{L}\_f \sqrt{\frac{2}{m(a,N)}} \chi^2 (\mu\_0, \pi)^{1/2} e^{-\beta^{-1}m(a,N)t} \approx L\_f \chi^2 (\mu\_0, \pi)^{1/2} \sqrt{2} \left(\frac{1}{\sqrt{m\_0}} - \frac{d\_0}{2m\_0^{3/2}} a^2\right) e^{-\beta^{-1}m\_0 t}. \tag{A145}$$

As for the discretization and stochastic gradient error, using the Taylor expansion, there exists a positive constant *d*<sup>1</sup> and *d*2, such that

$$d\_f \sqrt{\mathcal{L}\_0^2 \sqrt{\delta} + \mathcal{L}\_1^2} \sqrt{h} k \eta \approx (d\_1 a + d\_2 a^2 + \text{Const}) k h. \tag{A146}$$

Combining these terms, we have

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \le (d\_1 a + d\_2 a^2) k h - a^2 L\_f \chi^2(\mu\_0, \pi)^{1/2} \frac{1}{\sqrt{2} m\_0^{3/2}} \varepsilon^{-\beta^{-1} m\_0 t} + \text{Const.} \tag{A147}$$

Thus, there exists an optimal *α*∗, which minimizes the bias. Please note that at *k* = 0, acceleration always occurs. As *k* goes to infinity, the second third terms 0, thus the first term will be dominant, which means we have larger discretization and stochastic gradient error.

#### **Appendix L. Hyperparameters of the Proposed Algorithm**

Here we discuss how to set hyperparameters in the algorithm. There are three hyperparameters, *α*0, *η*, and *c*. We numerically found that setting *c* = 0.95 work well for real dataset including LDA experiment, and Bayesian neural network regression and classification. For toy dataset, we set *c* = 0.9.

As for *α*<sup>0</sup> and *η*, we empirically found that using the following scaling trick works well for real dataset including LDA experiment, and Bayesian neural network regression and classification,

$$\alpha\_0 \approx \frac{1}{\sqrt{\frac{1}{N^2} \sum\_n \nabla \mathcal{U}(\mathbf{x}\_0^{(n)})^2}} Nh. \tag{A148}$$

and using *η* ≈ 0.1*α*0. The intuition is that the magnitude of the gradient can be very different in each dimension, so we introduce the scaling by the gradient. We also multiply *h* so that the stochastic gradient and discretization error of the skew term will not be dominant compared to usual gradient term. Finally, we multiply some constant so that *α*<sup>0</sup> will not be too small.

#### **Appendix M. Proof of Theorem 8**

In this section, we derive the upper-bound of the bias of skew-SGLD based on [23]. This approach requires us to use the logarithmic Sobolev inequality [19], which is stronger than the Poincaré inequality. First, we present the definition of the logarithmic Sobolev inequality. We say that *<sup>π</sup>* on <sup>R</sup>*<sup>d</sup>* with <sup>L</sup> satisfies the logarithmic Sobolev inequality with constant *λ* in case for all function *f* on R*<sup>d</sup>* with - <sup>R</sup>*<sup>d</sup> u*2*dπ* = 1,

$$\int\_{\mathbb{R}^d} f^2 \ln f^2 d\pi \le \frac{2}{\lambda} \int\_{\mathbb{R}^d} -f \mathcal{L} f d\pi. \tag{A149}$$

This logarithmic Sobolev inequality is stronger than the Poincaré inequality and induces the convergence in KL divergence. See [19] for details. It was proved in [2,18] that our dynamics, LD, SLD, PLD, S-PLD, and skew-SGLD satisfy the logarithmic Sobolev inequalities under our assumptions. We express the constant of the logarithmic Sobolev inequality for skew-SGLD as *λ*(*α*, *N*). This constant depends on the skew matrices and the Poincaré constant. We estimate this constant in Appendix M.1.

To upper-bound the bias, here we control the KL divergence. We denote the law of skew-SGLD at iteration *k* with interaction strength *α* as *μ*⊗*<sup>N</sup> kh* (*α*). We upper-bound the bias by 2-Wasserstein distance

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right|\_{\prime} \leq \frac{L\_f}{\sqrt{N}} \mathcal{W}\_2(\mu\_{kh}^{\odot N}(\mathfrak{a}), \pi^{\odot N}). \tag{A150}$$

Then, from the transportation inequality [19],

$$\mathcal{W}\_2(\mu\_k, \pi) \le \sqrt{\frac{2}{\lambda(a, N)} \text{KL}(\mu\_{kh}^{\otimes N}(a) | \pi^{\otimes N})}.\tag{A151}$$

Thus, we will upper bound the KL divergence using the technique in [23]. However, in the original proof, a full gradient ∇*U* is used so we replace it with the stochastic gradient. Moreover, we introduce the skew interaction term.

First, Lemma 11 in [23] is modified to

$$\mathbb{E}\_{\pi^{\otimes}N} \| \nabla \mathcal{U}^{\otimes N} \|^{2} \leq \frac{dNM}{\beta}. \tag{A152}$$

Then Lemma 12 in [23] is modified to

$$\mathbb{E}\_{\mu} ||\nabla \mathcal{U}^{\odot N}||^2 \le 4M^2 \lambda \mathbb{K} \mathcal{L}(\mu | \pi^{\odot N}) + \frac{2dNM}{\beta},\tag{A153}$$

for any integrable *μ*.

Herein after, we drop <sup>⊗</sup>*<sup>N</sup>* from *<sup>X</sup>*⊗*N*, <sup>∇</sup>*U*⊗*N*, and <sup>∇</sup>*U*<sup>ˆ</sup> <sup>⊗</sup>*<sup>N</sup>* for notational simplicity. We focus on skew-SGLD at iteration *k*, we consider the following SDE for *t* ∈ (*kh*,(*k* + 1)*h*]

$$dX\_t = -(I + \mathfrak{a}I)\nabla\!\!\!\!/\!\!/X\_k)dt + \sqrt{2\beta^{-1}}dw\_{t\prime} \tag{A154}$$

where <sup>∇</sup>*U*˜ (*Xk*) is the stochastic gradient conditioned on *Xk*. The solution of this SDE is

$$X\_{(k+1)} = X\_k - (I + \mathfrak{a}I)\nabla \mathcal{U}(X\_k)h + \sqrt{2\beta^{-1}}\mathfrak{e}.\tag{A155}$$

We would like to derive the continuity equation correspond to Equation (A154). Following [23], we express *Xt* as *xt* and *Xk* as *x*<sup>0</sup> for simplicity. Let *ρ*0*t*(*x*0, *xt*) denote the joint distribution of (*x*0, *xt*). Then, the conditional and marginal relations are written as

$$
\rho\_{0t}(\mathbf{x}\_{0\prime}\mathbf{x}\_{t}) = \rho\_{0}(\mathbf{x}\_{0})\rho\_{t|0}(\mathbf{x}\_{t}|\mathbf{x}\_{0}) = \rho\_{t}(\mathbf{x}\_{t})\rho\_{0|t}(\mathbf{x}\_{0}|\mathbf{x}\_{t}).\tag{A156}
$$

The conditional density *<sup>ρ</sup>t*|0(*xt*|*x*0) follows the FP equation

$$\frac{\partial \rho\_{t|0}(\mathbf{x}\_t|\mathbf{x}\_0)}{\partial t} = \nabla \cdot (\rho\_{t|0}(\mathbf{x}\_t|\mathbf{x}\_0)(I+af)\nabla \mathcal{U}(\mathbf{x}\_0)) + \beta^{-1} \Delta \rho\_{t|0}(\mathbf{x}\_t|\mathbf{x}\_0),\tag{A157}$$

Then following [23], to derive the evolution of *ρt*, we take the expectation over *ρ*0(*x*0)

$$\begin{split} \frac{\partial \rho\_t(\mathbf{x})}{\partial t} &= \int\_{\mathbb{R}^d} \frac{\partial \rho\_{t|0}(\mathbf{x}\_t|\mathbf{x}\_0)}{\partial t} \rho\_0(\mathbf{x}\_0) d\mathbf{x}\_0 \\ &= \nabla \cdot (\rho\_t(\mathbf{x}\_t) \mathbb{E}\_{\rho\_{0|t}}[(I+\kappa I)\nabla \hat{\mathcal{U}}(\mathbf{x}\_0)|\mathbf{x}\_t = \mathbf{x}]) + \beta^{-1} \Delta \rho\_t(\mathbf{x}). \end{split} \tag{A158}$$

Then, we take the expectation regarding for the stochastic gradient in the above equation and include it into <sup>E</sup>*ρ*0|*<sup>t</sup>* for notational simplicity. Then following the discussion of Lemma 3 in [23], we obtain

$$\begin{split} \frac{\partial \mathbb{KL}(\mu\_{t}|\pi)}{\partial t} \leq & -\frac{3}{4} I(\mu\_{t}^{\otimes N}|\pi^{\otimes N}) + 2 \mathbb{E}\_{\rho\_{\text{U}}}[||\nabla \mathcal{U}(\mathcal{X}\_{t}) - \nabla \mathcal{U}(\mathcal{X}\_{0})||^{2}] \\ &+ 2(1+a)^{2} \mathbb{E}\_{\rho\_{\text{U}}}[||\nabla \mathcal{U}(\mathcal{X}\_{0}) - \nabla \mathcal{U}(\mathcal{X}\_{0})||^{2}] + 2a^{2} \mathbb{E}\_{\rho\_{\text{U}}}[\nabla \mathcal{U}(\mathcal{x}\_{0})]^{2}], \end{split} \tag{A159}$$

where *t* ∈ (*kh*,(*k* + 1)*h*] and

$$X\_t = X\_k - t(I + \mathfrak{a}I)\nabla \mathcal{U}(X\_k) + \sqrt{2t\beta^{-1}}\mathfrak{e}.\tag{A160}$$

Then, from [18], we can upper-bound the second term by

$$\mathbb{E}\_{\mathbb{P}0}\left[\|\nabla \mathcal{U}(X\_0) - \nabla \tilde{\mathcal{U}}(X\_0)\|^2\right] \le \mathcal{N} \mathbb{C}'\_0 \delta,\tag{A161}$$

$$\mathcal{C}\_0' := 2\left(M^2 \left(\kappa\_0 + 2\left(1 \vee \frac{1}{m}\right)\left(b + 2(1+a)^2 B^2 + \frac{d}{\beta}\right)\right) + B^2\right) \tag{A162}$$

and the third term is upper-bounded by

$$\begin{split} \mathbb{E}\_{\mathbb{P}\_{\rm{0t}}} \left[ \left\| \nabla \mathcal{U}(X\_{0}) - \nabla \mathbb{E}\_{\mathbb{P}\_{\rm{0t}}} [\nabla \mathcal{U}(x\_{0})] \right\|^{2} \right] &\leq 2M^{2} \left\| \mathbf{x}\_{0} \right\|^{2} + 2NB^{2} \\ &\leq NC\_{0\prime}^{\prime} \end{split} \tag{A163}$$

where we used lemma 2 and 7 in [2]. Finally, from the original proof of [23] we obtain

$$2\mathbb{E}\_{\mathbb{P}^{\rm U}}[\|\nabla \mathcal{U}(X\_{\mathrm{f}}) - \nabla \mathcal{U}(X\_{\mathrm{0}})\|^2] \leq 8t^2 \mathcal{M}^4 \lambda \text{KL}(\mu\_{\mathrm{k}}^{\otimes N}|\pi^{\otimes N}) + \frac{4t^2 d\mathcal{M}\mathcal{M}^3}{\beta} + \frac{4t d\mathcal{M}\mathcal{M}^2}{\beta}.\tag{A164}$$

Then, in conclusion, under *<sup>h</sup>* <sup>∈</sup> (0, 1 <sup>∧</sup> *<sup>m</sup>* <sup>4</sup>*M*<sup>2</sup> ) obeying *kh* ≥ 1 and *<sup>β</sup><sup>m</sup>* ≥ 2, we obtain

$$\begin{split} \frac{d}{dt} \text{KL}(\mu\_t^{\odot N} | \pi^{\odot N}) \leq & -\frac{3}{4} I(\mu\_t^{\odot N} | \pi^{\odot N}) + 8t^2 M^4 \lambda(a, N) \text{KL}(\mu\_k^{\odot N} | \pi^{\odot N}) \\ & + \frac{4t^2 dN M^3}{\beta} + \frac{4t d N M^2}{\beta} + 2N \text{C}\_0'(\delta(1+a)^2 + a^2). \end{split} \tag{A165}$$

For simplicity, we assume that *<sup>h</sup>* <sup>∈</sup> (0, *<sup>m</sup>* <sup>4</sup>*M*<sup>2</sup> ) and *<sup>m</sup>* <sup>4</sup>*M*<sup>2</sup> < 1, then we obtain

$$\begin{split} \frac{d}{dt} \text{KL}(\mu\_t^{\odot N} | \pi^{\odot N}) \leq -\frac{3}{4} I(\mu\_t^{\odot N} | \pi^{\odot N}) + 8t^2 M^4 \lambda(a, N) \text{KL}(\mu\_k^{\odot N} | \pi^{\odot N}) \\ &+ \frac{t^2 dN M}{\beta} (m + 4M) + 2N \text{C}\_0'(\delta(1+a)^2 + a^2). \end{split} \tag{A166}$$

Then using *t* ∈ (*kh*,(*k* + 1)*h*], we obtain

$$\begin{split} \text{KL}(\mu\_{k+1}^{\otimes N}|\pi^{\otimes N}) \leq & e^{-\frac{3}{2}\lambda(a,N)\hbar} \Big(1 + 16\hbar^3 M^4 \lambda\Big) \text{KL}(\mu\_k^{\otimes N}|\pi^{\otimes N}) \\ &+ e^{-\frac{3}{2}\lambda(a,N)\hbar} \Big(\frac{2\hbar dNM}{\beta}(m+4M) + 8\hbar N \mathcal{C}\_0'(\delta(1+a)^2 + a^2)\Big). \end{split} \text{(A167)}$$

If *<sup>h</sup>* <sup>∈</sup> (0, *<sup>λ</sup>*(*α*,*N*) 4 <sup>√</sup>2*M*<sup>2</sup> ), we obtain

$$\text{KL}(\mu\_{k+1}^{\otimes N}|\pi^{\otimes N}) \le e^{-\lambda(a,N)\hbar} \text{KL}(\mu\_k^{\otimes N}|\pi^{\otimes N}) + \frac{2\hbar^2 dNM}{\beta}(m+4M) + 8\hbar N \mathcal{C}\_0'(\delta(1+a)^2 + a^2). \tag{A168}$$

From this one step inequality, we obtain

$$\begin{split} &\text{KL}(\mu\_{k}^{\otimes N}|\pi^{\otimes N}) \\ &\leq \epsilon^{-\lambda(a,N)k\hbar} \text{KL}(\mu\_{0}^{\otimes N}|\pi^{\otimes N}) + \frac{1}{1-\epsilon^{-\lambda(a,N)k}} \left( \frac{2h^{2}dNM}{\beta} (m+4M) + 8hN\mathbb{C}'\_{0} (\delta(1+a)^{2}+a^{2}) \right) \\ &\leq \epsilon^{-\lambda(a,N)k\hbar} \text{KL}(\mu\_{0}^{\otimes N}|\pi^{\otimes N}) + \frac{2N}{\lambda(a,N)} \left( \frac{hdM}{\beta} (m+4M) + 4\mathbb{C}'\_{0} (\delta(1+a)^{2}+a^{2}) \right). \end{split} \tag{A169}$$

Then, finally we obtain

$$\begin{split} & \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_{k}^{(n)}) - \int\_{\mathbb{R}^{d}} f d\pi \right| \\ & \leq \frac{L\_{f}}{\sqrt{N}} \sqrt{\frac{2}{\lambda(a,N)}} \mathbb{KL}(\mu\_{\text{th}}^{\otimes N}(a) | \pi^{\otimes N}) \\ & \leq L\_{f} \sqrt{\frac{2}{\lambda(a,N)}} \sqrt{\varepsilon^{-\lambda(a,N) \text{id}} \mathbb{KL}(\mu\_{\text{0}}|\pi) + \frac{2}{\lambda(a,N)} \left( \frac{\text{h} \text{d}M}{\beta} (m + 4M) + 4C\_{0}' (\delta(1+a)^{2} + a^{2}) \right)} \\ & \leq L\_{f} \sqrt{\frac{2}{\lambda(a,N)}} \sqrt{\varepsilon^{-\lambda(a,N) \text{id}} \mathbb{KL}(\mu\_{\text{0}}|\pi) + \frac{C\_{3}(a)}{\lambda(a,N)}}, \end{split} \tag{A170}$$

where

$$\mathbb{C}\_3(a) := 2\frac{hdM}{\beta}(m+4M) + 8\mathbb{C}\_0'(\delta(1+a)^2 + a^2),\tag{A171}$$

$$\mathcal{C}\_0' := 2\left(M^2 \left(\kappa\_0 + 2\left(1 \vee \frac{1}{m}\right)\left(b + 2(1+a)^2 B^2 + \frac{d}{\beta}\right)\right) + B^2\right). \tag{A172}$$

Moreover, from Appendix M.1, the logarithmic Sobolev constant is

$$\lambda(\mathfrak{a}, \mathcal{N}) := \left( \frac{1}{(1 + \beta m(\mathfrak{a}, \mathcal{N})^{-1} |\mathbb{C}(m\_0)|) 2 \pi e^2} + \frac{3}{2m(\mathfrak{a}, \mathcal{N})} \right), \tag{A173}$$

where

$$-\mathbb{C}(m\_0) := \mathbb{E}\_{\pi^{\otimes N}}[\|\nabla \mathcal{U}^{\otimes N}(\mathbf{x})\|]^{1/2} + \sqrt{\frac{8}{m\_0} \mathbb{E}\_{\pi^{\otimes N}}[\|\nabla \mathcal{U}^{\otimes N}(\mathbf{x})\|^2]^{1/2}}.\tag{A174}$$

#### *Appendix M.1. Estimation of the Logarithmic Sobolev Constant*

In this section, we estimate the logarithmic Sobolev constants using the technique of restricted logarithmic Sobolev inequality, which was introduced in [49].

The technique of [49] estimates the constant of the logarithmic Sobolev inequality as follows. Assume that *<sup>π</sup>* on <sup>R</sup>*<sup>d</sup>* with <sup>L</sup> satisfies the Poincaré inequality with constant *<sup>m</sup>*. Then, for any function *u* on R*<sup>d</sup>* that satisfies

$$\int\_{\mathbb{R}^d} \mathfrak{u} d\pi = 0 \quad \text{and} \quad \int\_{\mathbb{R}^d} \mathfrak{u}^2 d\pi = 1,\tag{A175}$$

we find a constant *b* that satisfies

$$\int\_{\mathbb{R}^d} \mathfrak{u}^2 \ln \mathfrak{u}^2 d\pi \le b \int\_{\mathbb{R}^d} -\mathfrak{u} \mathcal{L}\mathfrak{u} d\pi. \tag{A176}$$

Then the logarithmic constant is larger than 2(*b* + <sup>3</sup> *<sup>m</sup>* )−1. Thus, we only need to focus on the restricted function class to estimate a constant *b*. We slightly change the Lemma 3.2 of [49] that estimate the constant *b* in Equation (A176) to apply it in our setting. In Lemma 3.2 of [49], it was proved that if *u* on R*<sup>d</sup>* satisfies the conditions in Equation (A175), then for any *t* ∈ (0, 1), we have

$$\int\_{\mathbb{R}^d} -u\mathcal{L}u d\pi - t\pi e^2 \int\_{\mathbb{R}^d} u^2 \ln u^2 d\pi \ge (1-t)m + t\beta \int\_{\mathbb{R}^d} (-\frac{1}{2}\mathcal{L}\mathcal{U}(\mathbf{x}) - \pi e^2 \mathcal{U}(\mathbf{x})) u^2 d\pi,\tag{A177}$$

where we assume that *π* ∝ *e*−*βU*(*x*) satisfies the Poincaré inequality with constant *m*. If there exists a constant *C* such that

$$-\mathbb{C} \ge \beta \int\_{\mathbb{R}^d} (-\frac{1}{2}\mathcal{L}\mathcal{U}(\mathbf{x}) - \pi e^2 \mathcal{U}(\mathbf{x})) \mu^2 d\pi > -\infty,\tag{A178}$$

then by setting *t* = *m*/(*m* + |*C*|), we can show that

$$\int\_{\mathbb{R}^d} -u\mathcal{L}u d\pi - m/(m+|\mathbb{C}|) \,\pi e^2 \int\_{\mathbb{R}^d} u^2 \ln u^2 d\pi > 0. \tag{A179}$$

Thus, the constant *b* in Equation (A176) is *b* = *t* = *m*/(*m* + |*C*|) and the logarithmic constant is 2(*m*/(*<sup>m</sup>* <sup>+</sup> <sup>|</sup>*C*|) + <sup>3</sup> *<sup>m</sup>* )−1.

Thus, We analyze the constant *C*. The first term of the integral in Equation (A178) is lower-bounded bounded by

$$-\mathbb{E}\_{\pi}[\mathcal{L}\mathcal{U}(\mathbf{x})u^{2}] \ge -|\mathbb{E}\_{\pi}[\mathcal{U}(\mathbf{x})\mathcal{L}\mathcal{U}(\mathbf{x})]|^{1/2}|\mathbb{E}\_{\pi}[u^{2}\mathcal{L}u^{2}]|^{1/2} \ge -2\mathbb{E}\_{\pi}[||\nabla\mathcal{U}(\mathbf{x})||^{2}]^{1/2},\tag{A180}$$

where we used the property of L, see [19] for details. As for the second term, it is lowerbounded by

$$\begin{split}-|\mathbb{E}\_{\pi}[\mathcal{U}(\mathbf{x})u^{2}]| &\geq -\sqrt{|\mathbb{E}\_{\pi}[\mathcal{U}^{2}(\mathbf{x})u^{2}]|} \geq -\sqrt{\frac{1}{m}|\mathbb{E}\_{\pi}[(\mathcal{U}(\mathbf{x})|u|)\mathcal{L}(\mathcal{U}(\mathbf{x})|u|)]|} \\ &\geq -\sqrt{\frac{8}{m}\mathbb{E}\_{\pi}[||\nabla\mathcal{U}(\mathbf{x})||^{2}]^{1/2}}.\end{split} \tag{A181}$$

Thus, by setting

$$-\mathbb{C} := \mathbb{E}\_{\pi} \left[ \|\nabla \mathcal{U}(\mathbf{x})\|\right]^{1/2} + \sqrt{\frac{8}{m\_0} \mathbb{E}\_{\pi} \left[ \|\nabla \mathcal{U}(\mathbf{x})\|^2 \right]^{1/2}},\tag{A182}$$

we can estimate the logarithmic constant as 2(*m*/(*<sup>m</sup>* <sup>+</sup> <sup>|</sup>*C*|) + <sup>3</sup> *<sup>m</sup>* )−1. In our setting, this is modified to

$$\lambda(\mathfrak{a}, N) = \left( \frac{1}{(1 + \beta m(\mathfrak{a}, N)^{-1} |\mathbb{C}(m\_0)|) 2\pi e^2} + \frac{3}{2m(\mathfrak{a}, N)} \right)^{-1}. \tag{A183}$$

where

$$-\mathbb{C}(m\_0) := \mathbb{E}\_{\pi^{\otimes N}}[\|\nabla \mathcal{U}^{\otimes N}(\mathbf{x})\|]^{1/2} + \sqrt{\frac{8}{m\_0} \mathbb{E}\_{\pi^{\otimes N}}[\|\nabla \mathcal{U}^{\otimes N}(\mathbf{x})\|^2]^{1/2}}.\tag{A184}$$

Finally, if we increase *m*(*α*, *N*), *λ*(*α*, *N*) increases. Thus, since *m*(*α*, *N*) ≥ *m*(*α* = 0, *N*), we obtain *λ*(*α*, *N*) ≥ *λ*(*α* = 0, *N*).

## *Appendix M.2. Computational Complexity*

To derive the computational complexity, for simplicity, we assume that *δ* ≤ *h* and We also set *<sup>α</sup>*<sup>2</sup> <sup>≤</sup> *<sup>h</sup>* for simplicity. This means that the variance of the stochastic gradient is small enough and we use small *α*. Then the bias is

$$\begin{split} \left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(\mathbf{X}\_{k}^{(n)}) - \int\_{\mathbb{R}^{d}} f d\pi \right| &\leq L\_{f} \sqrt{\frac{2}{\lambda(a,N)}} \sqrt{e^{-\lambda(a,N)k\hbar} \mathrm{KL}(\mu\_{0}|\pi) + \frac{\mathsf{C}\_{3}(a)}{\lambda(a,N)}} \\ &\leq L\_{f} \sqrt{\frac{2}{\lambda(a,N)}} \left( \sqrt{e^{-\lambda(a,N)k\hbar} \mathrm{KL}(\mu\_{0}|\pi)} + \sqrt{\frac{\mathsf{C}\_{3}(a)}{\lambda(a,N)}} \right), \end{split} \tag{A185}$$

where

$$\mathcal{C}\_3(a) := h \left( 2 \frac{dM}{\beta} (m + 4M) + 8 \mathcal{C}\_0'((1 + h^{1/2})^2 + 1) \right),\tag{A186}$$

$$\mathbb{C}\_{0}^{\prime} := 2\left(M^{2}\left(\kappa\_{0} + 2\left(1 \vee \frac{1}{m}\right)\left(b + 2(1 + h^{1/2})^{2}B^{2} + \frac{d}{\beta}\right)\right) + B^{2}\right). \tag{A187}$$

Then we define

$$\mathcal{C}\_3' := 2\frac{dM}{\beta}(m+4M) + 8\mathcal{C}\_0'((1+h^{1/2})^2+1),\tag{A188}$$

and use the step size that satisfies *h* = *<sup>λ</sup>*(*α*,*N*)*<sup>ξ</sup>* 2 <sup>√</sup>2*C* <sup>3</sup>*Lf* . Then when we use

$$k \ge \frac{2}{\lambda(a, N)h} \ln \frac{L\_f}{\xi} \sqrt{\frac{\text{KL}(\mu\_0|\pi)}{2\lambda(a, N)}},\tag{A189}$$

we have

$$\left| \mathbb{E} \frac{1}{N} \sum\_{n=1}^{N} f(X\_k^{(n)}) - \int\_{\mathbb{R}^d} f d\pi \right| \le \frac{\mathfrak{J}}{2} + \frac{\mathfrak{J}}{2} \le \mathfrak{J}. \tag{A190}$$

### **References**


## *Article* **Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation**

**Théo Galy-Fajou 1,\*, Valerio Perrone <sup>2</sup> and Manfred Opper 1,3**


**Abstract:** Variational inference is a powerful framework, used to approximate intractable posteriors through variational distributions. The de facto standard is to rely on Gaussian variational families, which come with numerous advantages: they are easy to sample from, simple to parametrize, and many expectations are known in closed-form or readily computed by quadrature. In this paper, we view the Gaussian variational approximation problem through the lens of gradient flows. We introduce a flexible and efficient algorithm based on a linear flow leading to a particle-based approximation. We prove that, with a sufficient number of particles, our algorithm converges linearly to the exact solution for Gaussian targets, and a low-rank approximation otherwise. In addition to the theoretical analysis, we show, on a set of synthetic and real-world high-dimensional problems, that our algorithm outperforms existing methods with Gaussian targets while performing on a par with non-Gaussian targets.

**Keywords:** variational inference; Gaussian; particle flow; variable flow

## **1. Introduction**

Representing uncertainty is a ubiquitous problem in machine learning. Reliable uncertainties are key for decision making, especially in contexts where the trade-off between exploitation and exploration plays a central role, such as Bayesian optimization [1], active learning [2], and reinforcement learning [3]. While Bayesian inference is a principled tool to provide uncertainty estimation, computing posterior distributions is intractable for many problems of interest. Most sampling methods struggle to scale up to large datasets [4], while the diagnosis of convergence is not always straightforward [5]. On the other hand, *Variational Inference* **(VI)** methods can rely on well-understood optimization techniques and scale well to large datasets, at the cost of an approximation quality depending heavily on the assumptions made. The Gaussian family is by far the most popular variational approximation used in VI [6,7]. This is for several reasons. First, Gaussian variational families are easy to sample from, reparametrize, and marginalize. Second, they are easily amenable to diagonal covariance approximations, making them scalable to high dimensions. Third, most expectations are either easily computable by quadrature or Monte Carlo integration, or known in closed-form.

A large body of work covers different approaches to optimize the *Variational Gaussian Approximation* **(VGA)**, with the speed of convergence and the scalability in dimensions as the main concerns. From the perspective of convergence speed, the major bottleneck when computing gradients with stochastic estimators is the estimator variance [8]. *Particlebased methods* with deterministic paths do not have this issue, and have been proven to be highly successful in many applications [9–11]. However, can we use a particle-based

**Citation:** Galy-Fajou, T.; Perrone, V.; Opper, M. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. *Entropy* **2021**, *23*, 990. https://doi.org/ 10.3390/e23080990

Academic Editor: Pierre Alquier

Received: 22 June 2021 Accepted: 21 July 2021 Published: 30 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

<sup>1</sup> Artificial Intelligence Group, Technische Universität Berlin, 10623 Berlin, Germany; manfred.opper@tu-berlin.de

algorithm to compute a VGA? If so, what are its properties and is it competitive with other VGA methods?

In this paper, we attempt to answer these questions by introducing the *Gaussian Particle Flow* **(GPF)**, a framework to approximate a Gaussian variational distribution with particles. GPF is derived from a continuous-time flow, where the necessary expectations over the evolving densities are approximated by particles. The complexity of the method grows quadratically with the number of particles but linearly with the dimension, remaining compatible with other approximations such as structured mean-field approximations. Using the same dynamics, we also derive a stochastic version of the algorithm, *Gaussian Flow* **(GF)**. To show convergence, we prove the decrease in an empirical version of the free energy that is valid for a finite number of particles. For the special case of *D*–dimensional Gaussian target densities, we show that *D* + 1 particles are enough to obtain convergence to the true distribution. We also find, for this case, that convergence is exponentially fast. Finally, we compare our approach with other VGA algorithms, both in fully controlled synthetic settings and on a set of real-world problems.

#### **2. Related Work**

The goal of Bayesian inference is to carry out computations with the posterior distribution of a latent variable *<sup>x</sup>* <sup>∈</sup> *<sup>R</sup><sup>D</sup>* given some observations *<sup>y</sup>*. By Bayes theorem, the posterior distribution is *<sup>p</sup>*(*x*|*y*) = *<sup>p</sup>*(*y*|*x*)*p*(*x*) *<sup>p</sup>*(*y*) , where *p*(*y*|*x*) and *p*(*x*) are, respectively, the likelihood and the prior distribution. Even if the likelihood and the prior are known analytically, marginalizing out high-dimensional variables in the product *p*(*y*|*x*)*p*(*x*) in order to compute quantities such as *p*(*y*) is typically intractable. *Variational Inference* **(VI)** aims to simplify this problem by turning it into an optimization one. The intractable posterior is approximated by the closest distribution within a tractable family, with closeness being measured by the *Kullback-Leibler* **(KL)** divergence, defined by

$$\text{KL}\left[q(\mathfrak{x})\middle|\,\left|p(\mathfrak{x})\right|\right] = \mathbb{E}\_{\mathfrak{q}}[\log q(\mathfrak{x}) - \log p(\mathfrak{x})]\_{\mathfrak{q}}.$$

where E*q*[ *f*(*x*)] = - *f*(*x*)*q*(*x*)*dx* denotes the expectation of *f* over *q*. Denoting by Q a family of distributions, we look for

$$\underset{q \in \mathcal{Q}}{\text{arg min }} \text{KL}\left[q(\mathbf{x}) || p(\mathbf{x}|y)\right].$$

Since *p*(*y*) is not computable in an efficient way, we equivalently minimize the upper bound F:

$$\mathbb{KL}[q(\mathbf{x})||p(\mathbf{x}|y)] \le \mathcal{F}[q] = -\mathbb{E}\_{\mathbf{q}}[\log p(y|\mathbf{x})p(\mathbf{x})] - \mathbb{H}\_{\mathbf{q}}.\tag{1}$$

where <sup>H</sup>*<sup>q</sup>* is the entropy of *<sup>q</sup>* (−E*q*[log *<sup>q</sup>*(*x*)]). Here, <sup>F</sup> is known as the variational free energy and −F is known as the Evidence Lower BOund (ELBO). A diverse set of approaches to perform VI with Gaussian families Q have been developed in the literature, which we review in the following.

#### *2.1. The Variational Gaussian Approximation*

The VGA is the restriction of Q to be the family of multivariate Gaussian distributions *<sup>q</sup>*(*x*) = <sup>N</sup> (*m*, *<sup>C</sup>*), where *<sup>m</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* is the mean and *<sup>C</sup>* ∈ {*<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*D*×*D*|*xAx* <sup>≥</sup> 0, <sup>∀</sup>*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*D*} is the covariance matrix, for which the free energy is found to be

$$\mathcal{F}[q] = -\frac{1}{2}\log|\mathcal{C}| + \mathbb{E}\_q[q\rho(\mathbf{x})].\tag{2}$$

where *ϕ*(*x*) = − log(*p*(*y*|*x*)*p*(*x*)). A standard descent algorithm based on gradients of Equation (2) with respect to variational parameters *m*, *C* give rise to some issues. First, naively computing the gradient of the expectation with respect to the covariance matrix

*C* involves unwanted second derivatives of *ϕ*(*x*) [12], which may not be available or may be computationally too expensive in a *black-box* setting. Second, the gradient of the entropy term H*<sup>q</sup>* entails inverting a non-sparse matrix, which we would like to avoid for higher-dimensional cases. Finally, the positive-definiteness of the covariance matrix leads to non-trivial constraints on parameter updates, which can lead to a slowdown of convergence or, if ignored, to instabilities in the algorithm.

To solve these issues, a variety of approaches have been proposed in the literature. If we focus on factorizable models, we can make a simplification: for problems with likelihoods that can be rewritten as *<sup>p</sup>*(*y*|*x*) = <sup>∏</sup>*<sup>D</sup> <sup>d</sup>*=<sup>1</sup> *p*(*y*|*xd*), the number of independent variational parameters is reduced to 2*D* [12,13]. In this special case, the Gaussian expectations in the free energy (2) split into a sum of 1-dimensional integrals, which can be efficiently computed by using numerical quadrature methods. To extend to the general case, gradients of the free energy are estimated by a stochastic sampling approach, which also forms the starting point of our method. This relies on the so-called *reparametrization trick*, where the expectation over the parameter-dependent variational density *q<sup>θ</sup>* is replaced by an expectation over a fixed density *q*<sup>0</sup> instead. This facilitates the gradient computation because unwanted derivatives of the type ∇*θq<sup>θ</sup>* (*x*) are avoided. For the Gaussian case, the reparametrization trick is a linear transformation of an arbitrary *D* dimensional Gaussian random variable *x* ∼ *q<sup>θ</sup>* (*x*) in terms of a *D*-dimensional Gaussian random variable *<sup>x</sup>*<sup>0</sup> <sup>∼</sup> *<sup>q</sup>*<sup>0</sup> <sup>=</sup> <sup>N</sup> (*m*0, *<sup>C</sup>*0):

$$\mathbf{x} = \Gamma(\mathbf{x}^0 - m^0) + m,\tag{3}$$

where <sup>Γ</sup> <sup>∈</sup> <sup>R</sup>*D*×*<sup>D</sup>* and *<sup>m</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* are the variational parameters. We assume that the covariance *C*<sup>0</sup> is not degenerate and, for simplicity, we set it as the identity. For instance, the gradient of the expectation given *q* over a function *f* given the mean *m* becomes <sup>∇</sup>*m*E*q*[ *<sup>f</sup>*(*x*)] <sup>=</sup> <sup>E</sup>*q*<sup>0</sup> 7 <sup>∇</sup>*<sup>m</sup> <sup>f</sup>*(Γ(*x*<sup>0</sup> <sup>−</sup> *<sup>m</sup>*0) + *<sup>m</sup>*) 8 . This can be simply proved by using the reparametrization (3) inside the integral and passing the gradient inside; for more details, see [14].

Given this representation, the free energy is easily obtained as a function of the variational parameters:

$$\mathcal{F}(q) = -\log|\Gamma| + \mathbb{E}\_{q^0} \left[ q \left( \Gamma(\mathbf{x}^0 - m^0) + m \right) \right]. \tag{4}$$

Other representations are possible. Challis and Barber [13] and Ong et al. [15] use a different reparametrization with a factorized structure of the covariance *C* = ΓΓ + diag(*d*), where <sup>Γ</sup> <sup>∈</sup> <sup>R</sup>*D*×*<sup>P</sup>* and *<sup>d</sup>* <sup>∈</sup> <sup>R</sup>*D*, with *<sup>P</sup>* <sup>≤</sup> *<sup>D</sup>* is the rank of <sup>Γ</sup>Γ. Other representations assume special structures of the precision matrix Λ = *C*−1, which allow you to enforce special properties, such as sparsity in [16,17].

In general, these methods tend to scale poorly with the number of dimensions, as one needs to optimize *D*(*D* + 3)/2 parameters. The (structured) *Mean-Field* **(MF)** [18,19] approach imposes independence between variables in the variational distribution. The number of variational parameters is then 2*D*, but covariance information between dimensions is lost.

## *2.2. Natural Gradients*

Besides the issue of expectations, more efficient optimizations directions, beyond ordinary gradient descent, have been considered. These can help to deal with constraints such as those given for the covariance matrix. Natural gradients [20] are a special case of Riemannian gradients and utilize the specific Riemannian manifold structure of variational parameters. They can often deal with constraints of parameters (such as the positive definiteness of the covariance), accelerate inference, and improve the convergence of algorithms. The application of such advanced gradient methods typically requires an estimate of the inverse Fisher information matrix as a preconditioner of ordinary gradients. Khan and Nielsen [21] and Lin et al. [22] propose a solution that requires extra second derivatives of the log–posteriors. Salimbeni et al. [23] developed an automatic process to

compute these without the second derivatives but with instability issues. Lin et al. [17] solved these issues by using geodesics on the manifold of parameters, at the price of having to compute inverse matrices as well as Hessians.

#### *2.3. Particle-Based VI*

Stochastic gradient descent methods compute expectations (and gradients) at each time step with new independent Monte Carlo samples drawn from the current approximation of the variational density. Particle-based methods for variational inference *draw samples only once* at the beginning of the algorithm instead. They iteratively construct transformations of an initial random variable (having a simple tractable density) where the transformed density leads to the decrease and finally to the minimum of the variational free energy. The iterative approach induces a deterministic temporal flow of random variables which depends on the current density of the variable itself. Using an approximation by the empirical density (which is represented by the positions of a set of 'particles') one obtains a flow of interacting particles which converges asymptotically to an empirical approximation of the desired optimal variational density.

The most popular approach is *Stein Variational Gradient Descent* **(SVGD)** [24], which computes a nonparametric transformation based on the kernelized Stein discrepancy [9]. SVGD has the advantage of not being restricted to a parametric form of the variational distribution. However, using standard distance-based kernels like the squared exponential kernel (*k*(*x*, *y*) = exp(− *x* − *y* 2 2/2)) can lead to underestimated covariances and poor performance in high dimensions [11,25]. Hence, it is interesting to develop particle approaches that approximate the VGA. We provide a more thorough comparison between our method and SVGD in Section 3.6.

## *2.4. GVA in Bayesian Neural Networks*

There has been increased interest in making *Bayesian Neural Networks* **(BNN)** by adding priors to Neural Networks parameters. The true form of the posterior is unknown but VGA has been used due to its ease of use and scalability with the number of dimensions (typically *<sup>D</sup>* & <sup>10</sup>5). Most of the aforementioned methods apply to BNN, but techniques have been specifically tailored with BNN in mind. [26] use the low-rank structure of [13] but exploit the *Local Reparametrization Trick*, where each datapoint *yi* gets a different sample from *q* in order to reduce the stochastic gradient estimator variance. *Stochastic Weight Averaging-Gaussian* **(SWAG)** [27], in which a set of particles obtained via stochastic gradient descent represent a low-rank Gaussian distribution, approximating the true posterior with a prior posterior produced by the network's regularization. While easy to implement, SWAG does not allow you to incorporate an explicit prior, and the resulting distribution does not derive from a principled Bayesian approach.

## *2.5. Related Approaches*

The closest approach to our proposed method is the *Ensemble Kalman Filter* **(EKF)** [28]. It assumes that the posterior is computed in a sequential way, where, at each time step, only single (or smaller batches) of data observations, represented by their likelihoods, become available. An ensemble of particles, representing a Gaussian distribution is iteratively updated with every new batch of observations. EKF allows us to work on high-dimensional problems with a limited amount of particles but is restricted to factorizable likelihoods for which a sequential representation is possible. While EKF maintains a representation of a Gaussian posterior, it is not clear how this relates to the goal of minimizing the free energy or the KL divergence.

#### **3. Gaussian (Particle) Flow**

We introduce *Gaussian Particle Flow* **(GPF)** and *Gaussian Flow* **(GF)**, two computationally tractable approaches, to obtain a *Variational Gaussian Approximation* **(VGA)**. In the following, we derive deterministic linear dynamics, which decreases the variational free

energy. We additionally give some variants with a *Mean-Field* **(MF)** approach and prove theoretical convergence guarantees.

In the following, *<sup>d</sup>*(·) *dt* indicates the total derivative given time, *<sup>∂</sup>*(·) *<sup>∂</sup><sup>t</sup>* partial derivatives given time, ∇*x*(·) gradients given a vector *x*.

#### *3.1. Gaussian Variable Flows*

We next discuss an alternative approach to generate the desired transformation of random variables, leading from a simple (prior) Gaussian density to a more complex Gaussian, which minimizes the variational free energy. It is based on the idea of *variable flows*, i.e., recursive deterministic transformations of the random variables defined by a mapping *<sup>x</sup>n*+<sup>1</sup> <sup>=</sup> *<sup>x</sup><sup>n</sup>* <sup>+</sup> *<sup>f</sup> <sup>n</sup>*(*xn*) where *<sup>f</sup> <sup>n</sup>* : <sup>R</sup>*<sup>D</sup>* <sup>→</sup> <sup>R</sup>*D*. Well-known examples of flows are *Normalizing Flows* [29], where *f <sup>n</sup>* are bijections, or *Neural ODEs* [30] where *f <sup>n</sup>* = *f* is defined by a neural network and *x*<sup>0</sup> is the input. For simplicity, we will consider small changes → 0 and work with flows in the continuous-time limit (*t* = *n*), which follow a system of *Ordinary Differential Equation* **(ODE)**. For the Gaussian case, in the spirit of the reparametrization trick (3), we choose a linear corresponding map *f* and write

$$\frac{d\mathbf{x}^t}{dt} = f^t(\mathbf{x}^t) = A^t(\mathbf{x}^t - m^t) + b^t,\tag{5}$$

where *<sup>A</sup><sup>t</sup>* is a matrix and *<sup>m</sup><sup>t</sup>* . <sup>=</sup> <sup>E</sup>*q<sup>t</sup>* [*x*] (which is no longer interpreted as an independent variational parameter). When the initial random variable *x*<sup>0</sup> is Gaussian distributed, the vectors *x<sup>t</sup>* are also Gaussian for any *t*. To construct a flow that decreases the free energy over time, we can either compute the time derivative of the specific free energy (2) induced by the ODE (5), or simply derive the general result valid for smooth maps *f* (see, e.g., [24]). To be self contained, we briefly repeat the main steps: We first compute the change of the free energy in terms of the time derivative of *q<sup>t</sup>* :

$$\begin{split} \frac{d\mathcal{F}[q^{t}]}{dt} &= \frac{d}{dt} \int q^{t}(\mathbf{x}) (\log q^{t}(\mathbf{x}) + \varrho(\mathbf{x})) d\mathbf{x} \\ &= \int \frac{\partial q^{t}(\mathbf{x})}{\partial t} (\log q^{t}(\mathbf{x}) + \varrho(\mathbf{x})) d\mathbf{x} + \int q^{t}(\mathbf{x}) \left( \frac{\partial q^{t}(\mathbf{x})}{\partial t} \frac{1}{q^{t}(\mathbf{x})} + \frac{\partial \varrho(\mathbf{x})}{\partial t} \right) d\mathbf{x} \\ &= \int \frac{\partial q^{t}(\mathbf{x})}{\partial t} (\log q^{t}(\mathbf{x}) + \varrho(\mathbf{x})) d\mathbf{x} \end{split}$$

where we have used the fact that - *∂q<sup>t</sup>* (*x*) *<sup>∂</sup><sup>t</sup> dx* <sup>=</sup> *<sup>d</sup> dt* - *qt* (*x*)*dx* = 0 and *∂ϕ*(*x*) *<sup>∂</sup><sup>t</sup>* = 0. We next use the *continuity equation* for the density

$$\frac{\partial q^t(\mathbf{x})}{\partial t} = -\nabla\_\mathbf{x} \cdot (q^t(\mathbf{x})f^t(\mathbf{x}))\_\mathbf{x}$$

related to the deterministic flow to obtain

$$\begin{split} \frac{d\mathcal{F}[q^{t}]}{dt} &= \int \nabla\_{\mathbf{x}} \cdot \left( q^{t}(\mathbf{x}) f^{t}(\mathbf{x}) \right) \left( \log q^{t}(\mathbf{x}) + q(\mathbf{x}) \right) d\mathbf{x} \\ &= - \int \left( q^{t}(\mathbf{x}) f^{t}(\mathbf{x}) \right) \cdot \nabla\_{\mathbf{x}} \left( \log q^{t}(\mathbf{x}) + q(\mathbf{x}) \right) d\mathbf{x} \\ &= \int \left( \nabla\_{\mathbf{x}} \cdot \left( q^{t}(\mathbf{x}) f^{t}(\mathbf{x}) \right) + q^{t}(\mathbf{x}) f^{t}(\mathbf{x}) \cdot \nabla\_{\mathbf{x}} q(\mathbf{x}) \right) d\mathbf{x} \\ &= \int \nabla\_{\mathbf{x}} q^{t}(\mathbf{x}) \cdot f^{t}(\mathbf{x}) + q^{t}(\mathbf{x}) f^{t}(\mathbf{x}) \cdot \nabla\_{\mathbf{x}} q(\mathbf{x}) d\mathbf{x} \\ &= - \mathbb{E}\_{q^{t}} \left[ \nabla\_{\mathbf{x}} \cdot f^{t}(\mathbf{x}) - f^{t}(\mathbf{x}) \cdot \nabla\_{\mathbf{x}} q(\mathbf{x}) \right] \end{split}$$

where we have applied Green's identity twice and used the fact that lim*x*→<sup>∞</sup> *qt*(*x*) = 0. Specializing to the linear flow (5), we obtain

$$\frac{d\mathcal{F}[q^t]}{dt} = -\text{tr}[A^t(A^t\_\star)^\top] - (b^t)^\top b^t\_{\star\prime} \tag{6}$$

where

$$\begin{aligned} A^t\_\star &\doteq I - \mathbb{E}\_{q^t} \left[ \nabla\_x q(\mathbf{x}) (\mathbf{x} - m^t)^\top \right] \\ b^t\_\star &\doteq - \mathbb{E}\_{q^t} [\nabla\_x q(\mathbf{x})] \end{aligned} \tag{7}$$

Equation (6) represents the change in the free energy F for an infinitesimal change in the variables *x* given by the flow (5). Obviously, the simplest choices

$$A^t \equiv A^t\_\star \qquad b^t \equiv b^t\_\star \tag{8}$$

lead to a decrease in the free energy *<sup>d</sup>*F[*q<sup>t</sup>* ] *dt* ≤ 0. More detailed derivations are given in Appendix A. Additionally, *equality* only happens, when

$$\begin{aligned} I - \mathbb{E}\_{\theta} \left[ \nabla\_{\mathbf{x}} \boldsymbol{\varrho}(\mathbf{x}) (\mathbf{x} - \boldsymbol{m})^{\top} \right] &= 0 \\ \mathbb{E}\_{\boldsymbol{\varrho}} [\nabla\_{\mathbf{x}} \boldsymbol{\varrho}(\mathbf{x})] &= 0 \end{aligned} \tag{9}$$

Using Stein's lemma [31], we can show that these fixed-point solutions are equal to the conditions for the optimal variational Gaussian distribution solution given in [12]. In Appendix C, we show that our parameter updates can be interpreted as a Riemannian gradient descent method for the free energy (4). This is based on the metric introduced by ([20], Theorem 7.6) as an efficient technique for learning the mixing matrix in models of blind source separation. This gradient should not be confused with the so-called *natural gradient* obtained by pre-multiplying with the inverse Fischer-information matrix.

Of course, there are other choices for *A<sup>t</sup>* and *b<sup>t</sup>* , which lead to a decrease in the free energy and the same fixed-point equations. In Section 3.6, we discuss how SVGD, with a linear kernel, can lead to the same fixed points but with different dynamics.

#### *3.2. From Variable Flows to Parameter Flows*

Before we introduce the particle algorithm, we show that the results for the variable flow can also be converted into a temporal change of the parameters Γ*<sup>t</sup>* , *m<sup>t</sup>* , as defined for Equation (3). From this, a corresponding *Gaussian Flow* **(GF)** algorithm can be easily derived. By differentiating the parametrisation *x<sup>t</sup>* = Γ*<sup>t</sup>* (*x*<sup>0</sup> <sup>−</sup> *<sup>m</sup>*0) + *<sup>m</sup><sup>t</sup>* (with *<sup>m</sup><sup>t</sup>* now considered as free variational parameter) with respect to time *t* and using (5), we obtain

$$\frac{d\mathbf{x}^t}{dt} = \frac{d\Gamma^t}{dt}(\mathbf{x}^0 - m^0) + \frac{dm^t}{dt} = A^t(\mathbf{x}^t - m^t) + b^t \tag{10}$$

By inserting *x<sup>t</sup>* = Γ*<sup>t</sup>* (*x*<sup>0</sup> <sup>−</sup> *<sup>m</sup>*0) + *<sup>m</sup><sup>t</sup>* into the right hand side of (10), and using the optimal parameters from (7), we obtain

$$\begin{aligned} \frac{d\Gamma^t}{dt} &= \Gamma^t - \mathbb{E}\_{q^0} \left[ \nabla\_x q (\mathbf{x}^t) (\mathbf{x}^0 - m^0)^\top \right] \Gamma^t (\Gamma^t)^\top\\ \frac{dm^t}{dt} &= -\mathbb{E}\_{q^0} \left[ \nabla\_x q (\mathbf{x}^t) \right] \end{aligned} \tag{11}$$

Note that the expectations are over the probability distribution of the initial random variable *x*0. Discretizing Equations (11) in time, and estimating the expectations by drawing independent samples from the fixed Gaussian *q*<sup>0</sup> at each time step, we obtain our GF algorithm to minimize the variational free energy in the space of Gaussian densities. We summarize the steps of GF in Algorithm 1. Remarkably, this scheme differs from previous VGA algorithms with Riemannian gradients based on the Fisher information

metric (see, e.g., [17,32]) because no *matrix inversions* or *second order derivatives* of the function *ϕ* are required.

GF also allows for the computation of a low-rank VGA by enforcing <sup>Γ</sup> <sup>∈</sup> <sup>R</sup>*D*×*<sup>K</sup>* and *<sup>x</sup>*<sup>0</sup> <sup>∈</sup> <sup>R</sup>*K*. This algorithm scales linearly in the number of dimensions and quadratically in the rank *K* of the covariance.

It is interesting to note that the reverse construction of a variable flow from a parameter flow is, in general, not possible. This would require the ability to eliminate all variational parameters and the initial variables *x*<sup>0</sup> in the resulting differential equation for *x<sup>t</sup>* , and replace them with functions of *x<sup>t</sup>* alone. For instance, if we eliminate the initial variables *x*<sup>0</sup> in terms of (Γ*<sup>t</sup>* )−<sup>1</sup> and *x<sup>t</sup>* the algorithm of [14], the resulting expression still depends on Γ*<sup>t</sup>* .

#### *3.3. Particle Dynamics*

The main idea of the particle approach is to approximate the Gaussian density *q<sup>t</sup>* in (7) by the empirical distribution

$$\boldsymbol{\hat{q}}^{t} \doteq \frac{1}{N} \sum\_{i=1}^{N} \boldsymbol{\delta}(\mathbf{x} - \mathbf{x}\_{i}^{t}) \tag{12}$$

computed from *N* samples *x<sup>t</sup> i* , *i* = 1, ... , *N*. These are initially sampled from the density *q*<sup>0</sup> at time *t* = 0 and are then propagated using the discretized dynamics of the ODE (5):

$$\frac{d\mathbf{x}\_i^t}{dt} = -\eta\_1^t \mathbb{E}\_{\hat{q}^t}[\nabla\_x \boldsymbol{\uprho}(\mathbf{x})] - \eta\_2^t \hat{A}^t(\mathbf{x}\_i^t - \boldsymbol{\uphat{m}}^t) \tag{13}$$

where

$$\hat{A}^t = I - \frac{1}{N} \sum\_{i=1}^N \nabla\_{\mathbf{x}} \boldsymbol{\varrho}(\mathbf{x}) (\mathbf{x}\_i^t - \boldsymbol{\mathfrak{m}}^t)^\top$$

$$\hat{\boldsymbol{b}}^t = \frac{1}{N} \sum\_{i=1}^N \nabla\_{\mathbf{x}} \boldsymbol{\varrho}(\mathbf{x}\_i^t), \qquad \boldsymbol{\mathfrak{m}}^t = \frac{1}{N} \sum\_{i=1}^N \mathbf{x}\_i^t$$

where *η<sup>t</sup>* <sup>1</sup> and *<sup>η</sup><sup>t</sup>* <sup>2</sup> are learning rates (We further comment on the use of different optimization schemes in Section 4.4). Note that although <sup>E</sup>*q*ˆ*<sup>t</sup>* 7 <sup>∇</sup>*xϕ*(*x*)(*<sup>x</sup>* <sup>−</sup> *<sup>m</sup>*<sup>ˆ</sup> *<sup>t</sup>* )8 is a *D* × *D* matrix, changing the matrix multiplication order leads to a computational complexity of <sup>O</sup>(*N*2*D*) with a storage complexity of O(*N*(*N* + *D*)), since neither the empirical covariance matrix or *A<sup>t</sup>* need to be explicitly computed.

#### Relaxation of Empirical Free Energy and Convergence

We have shown that the continuous-time dynamics (10) of the random variables leads to a decay of the free energy <sup>F</sup>(*q<sup>t</sup>* ) with time *t*. Assuming that the free energy is bounded from below, one might conjecture that this property would imply the convergence of the particle algorithm to a fixed point when learning rates are sufficiently small such that the discrete-time dynamics are approximated well by the continuous limit. Unfortunately, the finite number *N* of particles poses an extra problem. The definition of the free energy F(*q*) by the KL–divergence (1) for continuous random variables such as assumes that both *q*(·) and *p*(·|*y*) are densities with respect to the Lebesgue measure. Hence, F(*q*ˆ) is not defined if we take *q* ≡ *q*ˆ, (12) as the empirical distribution of the finite particle approximation. Nevertheless, we define a finite *N approximation* to the Gaussian free energy, which is also then found to decay under the finite *N* dynamics. Let us first assume that *N* > *D* and define

$$\tilde{\mathcal{F}}(\hat{\eta}^t) \doteq -\frac{1}{2} \log |\hat{\mathsf{C}}^t| + \mathbb{E}\_{\hat{\eta}^t}[\varphi(\textbf{x})] \tag{14}$$

with the empirical covariance matrix

$$\mathcal{L}^t = \frac{1}{N} \sum\_{i=1}^{N} \left( \mathbf{x}\_i^t - m^t \right) \left( \mathbf{x}\_i^t - m^t \right)^\top \tag{15}$$

The definition (14) is chosen in such way that in the large *N* limit, when the empirical distribution *q*ˆ *<sup>t</sup>* converges to a Gaussian distribution *q<sup>t</sup>* , we will also obtain the convergence of the approximation (14) to <sup>F</sup>(*q<sup>t</sup>* ). It can be shown (see Appendix B) that *<sup>d</sup>*F˜(*q*<sup>ˆ</sup> *t* ) *dt* ≤ 0, with equality only at the fixed points of the dynamics.

In applications of our particle method to high-dimensional problems, the limitations of computational power may force us to restrict particle numbers to be smaller than the dimensionality *D*. For *N* < *D* + 1, the empirical covariance *C<sup>t</sup>* will be singular, and typically contain only *<sup>N</sup>* <sup>−</sup> 1 non-zero eigenvalues, which leads to the <sup>−</sup> log\$ \$*C*ˆ \$ \$ = ∞ and makes Equation (14) meaningless. We resolve this issue through a regularisation of the log–determinant term in (14), replacing all zero eigenvalues of *C*ˆ by the values 1, i.e., *<sup>λ</sup><sup>i</sup>* <sup>=</sup> <sup>0</sup> <sup>→</sup> *<sup>λ</sup>*˜ *<sup>i</sup>* <sup>=</sup> 1. We show in Appendix <sup>B</sup> that the free energy still decays, provided that the dynamics of the particles stay the same. This regularisation step can be formally stated as a replacement of the empirical covariance (15) in (14) by

$$
\hat{C}^t \to \hat{C}^t + \sum\_{i:\lambda\_i^t = 0} e\_i^t (e\_i^t)^\top
$$

where *e<sup>t</sup> <sup>i</sup>* <sup>=</sup> *<sup>i</sup>*th eigenvector of *<sup>C</sup>*ˆ*<sup>t</sup>* .

## *3.4. Algorithm and Properties*

The algorithm we propose is to sample *<sup>N</sup>* particles {*x*<sup>0</sup> <sup>1</sup>, ... , *<sup>x</sup>*<sup>0</sup> *<sup>N</sup>*} where *<sup>x</sup>*<sup>0</sup> *<sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* from *q*<sup>0</sup> (which can be centered around the MAP for example), and iteratively optimize their positions using Equation (13). Once convergence is reached, i.e., *<sup>d</sup>*<sup>F</sup> *dt* = 0, we can easily make predictions using the converged empirical distribution *q*ˆ(*x*) = <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *δ*(*x* − *xi*), where *δ* is the Dirac delta function, or, alternatively, the Gaussian density it represents, i.e., *<sup>q</sup>*(*x*) = <sup>N</sup> (*m*, *<sup>C</sup>*), where *<sup>m</sup>* <sup>=</sup> <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *xi* and *<sup>C</sup>* = <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=1(*xi* − *m*)(*xi* − *m*). To draw samples from *q*ˆ, no inversions of the empirical covariance *C* are needed, as we can obtain new samples by computing:

$$\mathbf{x} = \frac{1}{\sqrt{N}} \sum\_{i=1}^{N} (\mathbf{x}\_i - m) \circ \mathbf{\zeta}\_i + m \,\tag{16}$$

where *<sup>ξ</sup><sup>i</sup>* are i.i.d. normal variables: *<sup>ξ</sup><sup>i</sup>* ∼ N (0,I*D*). This can be shown by defining *<sup>D</sup>*, the deviation matrix, a matrix which columns equal to *Di* <sup>=</sup> *xi*−*<sup>m</sup>* <sup>√</sup>*<sup>N</sup>* . We naturally have *DD* = *C* which makes *D* the Cholesky decomposition of *C*.

All the inference steps are summarized in Algorithm 2 and an illustration in two dimensions is provided in Figure 1.

We summarize the principal points of our approach:


**Figure 1.** Illustration of the Gaussian Particle Flow algorithm, with *q*0(*x*) and *p*(*x*) representing the initial and target distribution respectively. Particles are iteratively moved according to the gradient flow starting from *q*0(*x*), approximating a new Gaussian distribution *q<sup>t</sup>* (*x*) at each iteration *t*.


**Algorithm 2:** Gaussian Particle Flow (GPF)


3.4.1. Relaxation of Empirical Free Energy

The definition of the free energy F(*q*) from the KL–divergence (1) for a continuous random variables assumes that both *q*(·) and *p*(·|*y*) are densities with respect to the Lebesgue measure. Hence, it is not *a priori* clear that a specific *approximation* F(*q*ˆ *t* ), based on an empirical distribution *q*ˆ *t* (*x*) . = <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *<sup>δ</sup>*(*<sup>x</sup>* <sup>−</sup> *<sup>x</sup><sup>t</sup> i* ) with a *finite number of particles N*, will decrease under the particle flow. Thus we may not be able to guarantee convergence to a fixed point for finite *N*. Luckily, as we show in Appendix D, we find that:

$$\frac{d\mathcal{F}(\boldsymbol{\eta}\_t)}{dt} = \frac{d(\mathbb{E}\_{\boldsymbol{\theta}\_t}[\boldsymbol{\varrho}(\boldsymbol{x})] - \frac{1}{2}\log|\boldsymbol{C}^t|)}{dt} \le 0. \tag{17}$$

For *<sup>N</sup>* <sup>&</sup>lt; *<sup>D</sup>* <sup>+</sup> 1, the empirical covariance *<sup>C</sup><sup>t</sup>* will typically contain *<sup>N</sup>* <sup>−</sup> 1 non-zero eigenvalues and lead to − log|*C*| = ∞, making Equation (17) meaningless. We resolve this issue by introducing a *regularized free energy* <sup>F</sup><sup>=</sup> where log\$ \$*Ct* \$ \$ is replaced by ∑*i*:*λi*><sup>0</sup> log *λ<sup>i</sup>* where {*λi*}*<sup>D</sup> <sup>i</sup>*=<sup>1</sup> are the eigenvalues of *<sup>C</sup><sup>t</sup>* . We show in Appendix D that, given the dynamics from Equation (5), F= is also guaranteed to not increase over time. It can, therefore, be used as a regularized proxy for the true F and used to optimize over hyper-parameters or to monitor convergence. Note that similar proofs exist for SVGD [33] and were proven to be highly non-trivial.

#### 3.4.2. Dynamics and Fixed Points for Gaussian Targets

We illustrate our method by some exact theoretical results for the dynamics and the fixed points of our algorithm when *the target is a multivariate Gaussian density*. While such targets may seem like a trivial application, our analysis could still provide some insight into the performance for more complicated densities.

**Theorem 1.** *If the target density p*(*x*) *is a D-dimensional multivariate Gaussian, only D* + 1 *particles are needed for Algorithm 2 to converge to the exact target parameters.*

**Proof.** The proof is given in Appendix E.

**Theorem 2.** *For a target <sup>p</sup>*(*x*) = <sup>N</sup> (*<sup>x</sup>* <sup>|</sup> *<sup>μ</sup>*, <sup>Λ</sup>−1)*, i.e., with precision matrix*Λ*, where <sup>x</sup>* <sup>∈</sup> <sup>R</sup>*D, and N* ≥ *D* + 1 *particles, the continuous time limit of Algorithm 2 will converge exponentially fast for both the mean and the trace of the precision matrix:*

$$m^t - \mu = e^{-\Lambda t} (m^0 - \mu),$$

$$\text{tr}(\left(C^t\right)^{-1} - \Lambda) = e^{-2t} \text{tr}(\left(C^0\right)^{-1} - \Lambda),$$

*where <sup>m</sup><sup>t</sup> and <sup>C</sup><sup>t</sup> are the empirical mean and covariance matrix at time <sup>t</sup> and* exp(−Λ*t*) *is the matrix exponential.*

**Proof.** The proof is given in Appendix F.

Our result shows that convergence of the mean *m<sup>t</sup>* directly depends on Λ. However, we can also precondition the gradient on *m* by *C<sup>t</sup>* , i.e., using the natural gradient approximation in the Fisher sense, and eventually get rid of the dependency on Λ when *Ct* <sup>−</sup><sup>1</sup> <sup>≈</sup> <sup>Λ</sup>.

The exponential relaxation of fluctuations also manifests itself in the decay of the free energy towards its minimum. For the Gaussian target, the free energy exactly separates into two terms corresponding to the mean and fluctuations. We can write <sup>F</sup>(*m<sup>t</sup>* , *C<sup>t</sup>* ) = 1 <sup>2</sup> (*m<sup>t</sup>* <sup>−</sup> *<sup>μ</sup>*)Λ(*m<sup>t</sup>* <sup>−</sup> *<sup>μ</sup>*) + *<sup>D</sup>* <sup>2</sup> <sup>+</sup> <sup>F</sup>*f l*(*C<sup>t</sup>* ), where the nontrivial fluctuation part (subtracted by its minimum) is given by

$$\mathcal{F}\_{fl}(\mathbb{C}^t) = -\frac{1}{2}\log|\mathbb{C}^t| + \frac{1}{2}\text{tr}(\Lambda \mathbb{C}^t - I).$$

We can show that

$$-\lim\_{t \to \infty} \frac{d \ln \mathcal{F}\_{fl}(\mathbf{C}^t)}{dt} \ge 4\sqrt{\frac{\ln \mathcal{F}\_{fl}(\mathbf{C}^t)}{\ln \mathcal{F}\_{fl}(\mathbf{C}^t)}}$$

indicating an asymptotic decrease in <sup>F</sup>*f l*(*C<sup>t</sup>* ) faster than *e*−4*<sup>t</sup>* , independent of the target. We can also prove the finite time bound

$$\mathcal{F}\_{fl}(\mathbb{C}^{t}) \leq \mathcal{F}\_{fl}(\mathbb{C}^{0})e^{-\left[\frac{2t}{\text{tr}(\Lambda^{-1})(\text{tr}(\Lambda) + |\text{tr}((\mathbb{C}^{0})^{-1} - \Lambda)|)}\right]}$$

.

The degenerate case **N** < **D** + **1**

Additionally, we can show the following result for the fixed points:

**Theorem 3.** *Given a D-dimensional multivariate Gaussian target density p*(*x*) = N (*x*|*μ*, Σ)*, using Algorithm 2 with N* < *D* + 1 *particles, the empirical mean converges to the exact mean μ. The <sup>N</sup>* <sup>−</sup> <sup>1</sup> *non-zero eigenvalues of <sup>C</sup><sup>t</sup> converge to a subset of the target covariance* <sup>Σ</sup> *spectrum. Furthermore, the global minimum of the regularised version* F= *of the free energy* (17) *corresponds to the largest eigenvalues of* Σ*.*

#### **Proof.** The proof is given in Appendix G.

This result suggests that *C<sup>t</sup>* might typically converge to an optimal low-rank approximation of Σ. We show an empirical confirmation in Section 4.2 for this conjecture. This suggests that it makes sense to apply our algorithm to high-dimensional problems even when the number of particles is not large. If the target density has significant support close to a low-dimensional submanifold, we might still obtain a reasonable approximation.

#### *3.5. Structured Mean-Field*

For high-dimensional problems, it may be useful to restrict the variational Gaussian approximation to the posterior to a specific structure via a structured mean-field approximation. In this way, spurious dependencies between variables that are caused by finite-sample effects could be explicitly removed from the algorithms. This is most easily incorporated in our approach by splitting a given collection of latent variables *x* into *M* disjoint subsets *x*(*i*). We reorder the vector indices in such a way that the first components correspond to *<sup>x</sup>*(1), *<sup>x</sup>*(2), and so on. Hence, we obtain *<sup>x</sup>* <sup>=</sup> {*x*(1), *<sup>x</sup>*(2), ... , *<sup>x</sup>*(*M*)}. A structured mean-field approach is enforced by imposing a block matrix structure for the update matrix *AMF* = *A*(1) ⊕···⊕ *A*(*M*), where ⊕ is the direct sum operator. It is easy to see that this construction corresponds to a related block structure of the Γ matrix in Equation (3). This means that the subsets of the random vectors are modeled as independent. Hence, when the number of particles grows to infinity, one recovers the fixed-point equations for the optimal MF structured Gaussian variational approximation from our approach. As previously, as the number of particles grows to infinity, we recover the optimal MF Gaussian variational approximation. Note that using a structured MF does not change the complexity of the algorithm but requires fewer particles to obtain a full-rank solution.

## *3.6. Comparison with SVGD*

Given the similarities with the SVGD methods [24],one could question the differences of our approach. The model proposed by [10] using a *linear kernel k*(*x*, *x* ) = *xx* + 1 has similar properties to our approach. The variable update becomes:

$$\begin{split} \frac{d\mathbf{x}}{dt} &= \frac{1}{N} \sum\_{i=1}^{N} (-k(\mathbf{x}\_{i\prime}\mathbf{x}) \nabla \boldsymbol{\varrho}(\mathbf{x}\_{i}) + \nabla\_{\mathbf{x}\_{i}} \mathsf{K}(\mathbf{x}\_{l\prime}\mathbf{x}\_{i})) \\ &= \mathbb{E}\_{\boldsymbol{\mathfrak{q}}} \Big[ \boldsymbol{I} - \nabla \boldsymbol{\mathfrak{q}}(\mathbf{x}) \boldsymbol{\mathfrak{x}}^{\top} \Big] \boldsymbol{\mathfrak{x}} - \mathbb{E}\_{\boldsymbol{\mathfrak{q}}}[\nabla \boldsymbol{\mathfrak{q}}(\mathbf{x})] \end{split}$$

The fixed points are

$$\begin{aligned} 0 &= \mathbb{E}\_{\boldsymbol{\theta}} [\nabla \boldsymbol{\varphi}(\boldsymbol{x})] \\ I &= \mathbb{E}\_{\boldsymbol{\theta}} \Big[ \nabla \boldsymbol{\varphi}(\boldsymbol{x}) \boldsymbol{x}^{\top} \Big] &= \mathbb{E}\_{\boldsymbol{\theta}} \Big[ \nabla \boldsymbol{\varphi}(\boldsymbol{x}) (\boldsymbol{x} - \boldsymbol{m})^{\top} \Big] \end{aligned}$$

where the last equality holds since <sup>E</sup>*q*ˆ[∇*ϕ*(*x*)] <sup>=</sup> 0. This is the same as our algorithm fixed points (9). Similarly to Theorem 1, *D* + 1 particles will converge to the exact *D*-dimensional multivariate Gaussian target. However, the generated flows are different. The main difference is that *we normalize our flow via the L*<sup>2</sup> *norm*, whereas [10] rely on the *reproducing kernel Hilbert space (RKHS) norm*, i.e., *ϕ* 2 *<sup>k</sup>* = *<sup>ϕ</sup>K*−1*<sup>ϕ</sup>* where *<sup>ϕ</sup><sup>i</sup>* = *<sup>ϕ</sup>*(*xi*) and *Kij* = *<sup>k</sup>*(*xi*, *xj*). For a full introduction on RKHS, we recommend [34]. Remarkably, centering the particles on the mean, namely, using the modified linear kernel *k*(*x*, *x* )=(*x* − *m*)(*x* − *m*) + 1, leads to the same dynamics. Additionally, when using SVGD, there is no direct possibility of computing the current KL divergence between the variational distribution and the target, unless some values are accumulated [35]. There is also no clear theory explaining what happens when the number of particles is smaller than the number of dimensions, for both distance-based kernels and the linear kernel.

#### **4. Experiments**

We now evaluate the efficiency of GPF and GF. First, given a Gaussian target, we compare the convergence of our approach with popular VGA methods, which are all described in Section 2. Second, we evaluate the effect of varying the number of particles for both Gaussian targets and non-Gaussian targets, especially with a low-rank covariance. Then, we evaluate the efficiency of our algorithm on a range of real-world binary classification problems through a Bayesian logistic regression model and a series of BNN on the MNIST dataset.

All the Julia [36] code and data used to reproduce the experiments are available at the Github repository: https://github.com/theogf/ParticleFlow\_Exp (accessed on 27 July 2021).

#### *4.1. Multivariate Gaussian Targets*

We consider a 20-dimensional multivariate Gaussian target distribution. The mean is sampled from a normal Gaussian *μ* ∼ N (0, *ID*) and the covariance is a dense matrix defined as Σ = *U*Λ*U*, where *U* is a unitary matrix and Λ is a diagonal matrix. Λ is constructed as log10(Λ*ii*) = log10(*κ*)(*i*−1) *<sup>D</sup>*−<sup>1</sup> <sup>−</sup> <sup>1</sup> where *<sup>κ</sup>* is the condition number, i.e., *<sup>κ</sup>* <sup>=</sup> <sup>Λ</sup>max/Λmin. This means that, for *κ* = 1, we obtain a Σ = 0.1I, and for *κ* = 100, we obtain eigenvalues ranging uniformly from 0.1 to 10 in log-space.

We compare GPF and GF to the state-of-the art methods for VGA described in Section 2, namely *Doubly Stochastic VI* **(DSVI)** [14], *Factor Covariance Structure* **(FCS)** [15] with rank *p* = *D*, *iBayes Learning Rule* **(IBLR)** [17] with a full-rank covariance and their Hessian approach, and Stein Variational Gradient Descent with both a linear kernel (**Linear SVGD**) [10] and a squared-exponential kernel (**Sq. Exp. SVGD**) [24]. For all methods, we set the number of particles or, alternatively, the number of samples used by the estimator, as *D* + 1, and use standard gradient descent (*xt*+<sup>1</sup> = *x<sup>t</sup>* + *ηϕ<sup>t</sup> xt* ) with a learning rate of *η* = 0.01 for all particle methods. We use RMSProp [37] with a learning rate of 0.01 for all stochastic methods. We run each experiment 10 times with 30,000 iterations, and plot the average error on the mean and the covariance with one standard deviation. For GPF, we additionally evaluate the method with and without using natural gradients for the mean (i.e., pre-multiplying the averaged gradient with *C<sup>t</sup>* ), indicated, respectively, with a dashed and solid line. Figure 2 reports the *L*<sup>2</sup> norm of the difference between the mean and covariance with the true posterior over time for the target condition number *κ* ∈ {1, 10, 100}.

**Figure 2.** *L*<sup>2</sup> norm of the difference between the target mean *μ* (left side) and target covariance Σ (right side) with the inferred variational parameters *m<sup>t</sup>* and *C<sup>t</sup>* against time for 20-dimensional Gaussian targets with condition number *κ*. We use *D* + 1 particles/samples and show the mean over 10 runs as well as the 68% credible interval. Methods with dashed curves use natural gradients on the mean. Note that DSVI, GF and FCS are overlapping and are, at this scale, indistinguishable from one another.

As Theorem 1 predicts, GPF converges exactly to the true distribution, regardless of the target. GF and other methods based on stochastic estimators cannot obtain the same precision as their accuracy is penalized by the gradient noise. IBLR approximate the covariance perfectly, despite the stochasticity of its estimator; however IBLR needs to compute the true Hessian at each step. When using a Hessian approximation instead, IBLR performed just like DSVI; the true benefit of IBLR appears when second-order functions are computed, which is naturally intractable in high-dimensions. SVGD with a linear kernel, achieves a good performance but is highly unstable: most of the runs (ignored here) diverge. This is due to the dot computation *xx* which can become extremely high, especially for non-centered data. For this reason, we do not consider this method for the later experiments. SVGD with a sq. exp. kernel obtains a good estimate for the mean but fails to approximate the covariance.

Perhaps surprisingly, GF does not perform much better than DSVI or FCS. This is potentially due to *the benefit of Riemannian gradients being canceled by the gradient noise* [38] providing a strong argument for particle-based methods over stochastic estimators.

Remarkably, we also confirm Theorem 2, that the convergence speed of *C<sup>t</sup>* is independent of the target Σ, while the convergence speed of *m<sup>t</sup>* has this dependency unless the natural gradient is used (see the dashed curves). The case *κ* = 1 highlights that *natural gradient do not necessarily improve convergence speed*.

#### *4.2. Low-Rank Approximation for Full Gaussian Targets*

We explore the effect of the number of particles for both Gaussian and non-Gaussian targets. We use the same Gaussian target from the previous experiment in 50 dimensions with a full-rank covariance determined by their condition number *κ* = *<sup>λ</sup>*max *<sup>λ</sup>*min . The covariance eigenvalues *λ<sup>i</sup>* in log-space range uniformly from 0.1 to 0.1*κ*. For a given target multivariate Gaussian, we vary the number of particles from 2 to *D* + 1 and look at the absolute difference of |tr(*C* − Σ)|. The results in *D* = 50, as well as the corresponding predictions (in dashed-black), from Theorem 3, are shown on Figure 3.

The empirical results perfectly match the theoretical predictions, confirming that, for Gaussian targets, the particles determine a low-rank approximation whose spectrum is equal to the largest eigenvalues from the target.

**Figure 3.** Trace error for a Gaussian target with *D* = 50 and condition numbers *κ* for a varying number of particles with GPF. Predictions from Theorem 3 are shown in dashed-black.

*4.3. High-Dimensional Low-Rank Gaussian Targets*

We consider a typical low-rank target case where the dimensionality is high but the effective rank of the covariance is unknown. The target is given by *p*(*x*) = N (*μ*, Σ) where *<sup>μ</sup>* ∼ N (0,I*D*), the covariance is defined by <sup>Σ</sup> <sup>=</sup> *<sup>U</sup>*Λ*U*, where *<sup>U</sup>* is a *<sup>D</sup>* <sup>×</sup> *<sup>D</sup>* unitary matrix and Λ is a diagonal matrix defined by

$$
\Lambda\_{ii} = \begin{cases}
\mathcal{N}(2,1), & \text{if } i \le K \\
10^{-8}, & \text{otherwise}
\end{cases}
$$

where *K* is the effective rank of the target. We pick *D* = 500 and vary *K* ∈ {10, 20, 30} to simulate a true problem where the correct *K* is not known. We test all methods allowing for low-rank structure, namely, GPF, GF, FCS and SVGD (Linear and Sq. Exp.). We fix the rank (or the number of particles) to be 20; therefore, we obtain three cases where the rank is exact, under-estimated, and over-estimated. For all methods, we use RMSProp [37] for the stochastic methods, or a diagonal version of it (see Section 4.4) for the particle ones. The error of the mean and the covariance is shown in Figure 4. Note that the difference in the initial error on the covariance is due to the difficulty of starting with the same covariance between particle and stochastic methods.

**Figure 4.** Convergence plot of low-rank methods for a 500-dimensional multivariate Gaussian target with effective rank *K* ∈ {10, 20, 30}. The rank of each method is fixed as 20. The difference in the starting point for the covariance is due to the initialization difference between each method. We show the mean over 10 runs for each method with shadowed areas representing the 68% credible interval.

We observe once again that the SVGD with a linear kernel fails to converge due to the large gradients. All methods perform equally in the estimation of the mean while being non-influenced by the rank of the target. As expected, the approximation quality for the covariance degrades when the rank gets bigger, but all algorithms still converge to good approximations. SVGD with a sq. exp. kernel performs much worse than the rest of the methods. This is a known phenomenon where, for high dimensions, the covariance SVGD is either over- or underestimated.

#### *4.4. Non-Gaussian Target*

We now investigate the behavior of our algorithm with non-Gaussian target distributions. We built a two-dimensional banana distribution: *<sup>p</sup>*(*x*) <sup>∝</sup> exp(−0.5(0.01*x*<sup>2</sup> <sup>1</sup> + 0.1(*x*<sup>2</sup> + 0.1*x*<sup>2</sup> <sup>1</sup> <sup>−</sup> <sup>10</sup>)2)), varied the number of particles used for GPF in {3, 5, 10, 20, 50} and compared it with a standard full-rank VGA approach. We also showed the impact of replacing a fixed *η* with the Adam [39] optimizer for 50 particles. The results are shown in Figure 5. As expected, increasing the number of particles madesthe distribution obtained via GPF increasingly closer to the optimal standard VGA, even in a non-Gaussian setting. However, using a momentum-based optimizer such as Adam breaks the linearity assumption of the original flow (5) and leads to a twisted representation of the particles. (We observed the same behavior with other momentum-based optimizers). A simple modification of the most known optimizers allows the linearity to be maintained while correctly adapting the learning rate to the shape of the problem. Most optimisers accumulate momentum or gradients element-wise, and end up modifying the updates as *<sup>x</sup>t*+<sup>1</sup> <sup>=</sup> *<sup>x</sup><sup>t</sup>* <sup>+</sup> *<sup>P</sup><sup>t</sup> <sup>ϕ</sup><sup>t</sup>* (*x<sup>t</sup>* ), where *<sup>P</sup><sup>t</sup>* <sup>∈</sup> <sup>R</sup>*D*×*<sup>D</sup>* is the preconditioner obtained via the optimiser and is the Hadamard product. By instead taking the average over each dimensions, we obtained the updates *xt*+<sup>1</sup> = *x<sup>t</sup>* + *Ptϕ<sup>t</sup>* (*x<sup>t</sup>* ), where *<sup>P</sup><sup>t</sup>* is a *<sup>D</sup>* <sup>×</sup> *<sup>D</sup>* diagonal matrix. The details of the dimensionwise conditioners for ADAM, AdaGrad and AdaDelta are given in Appendix H.

**Figure 5.** Two-dimensional Banana distribution. Comparison of GPF using an increasing number of particles and a different optimizer (ADAM) with the standard VGA (rightmost plot).

#### *4.5. Bayesian Logistic Regression*

Finally, we considered a range of real-world binary classification problems modeled with a Bayesian logistic regression. Given some data {(*xi*, *yi*)}*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> where *xi* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* and *<sup>y</sup>* ∈ {−1, 1}, we defined the model *yi* <sup>∼</sup> Bernoulli(*σ*(*wxi*)) with weight *<sup>w</sup>* <sup>∈</sup> <sup>R</sup>*D*, and with *<sup>σ</sup>* being the logistic function. We set a prior on *<sup>w</sup>*: *<sup>w</sup>* <sup>N</sup> (0, 10I*D*). We benchmarked the competing approaches over four datasets from the UCI repository [40]: spam (*N* = 4601, *D* = 104), krkp (*N* = 351, *D* = 111), ionosphere (*N* = 3196, *D* = 37) and mushroom (*N* = 8124, *D* = 95). We ran all algorithms discussed in Section 4.1, both with and without a mean-field approximation; SVGD was omitted since it is too unstable. All algorithms were run with a fixed learning rate *η* = 10−4, and we used mini-batches of size 100. We show alternative training settings in Appendix I. Note that FCS, for mean-field, simplifies to DSVI Additionally, we did not consider full-rank IBLR, as it is too expensive, and we used their reparametrized gradient version for the Hessian. Figure 6 shows the average negative log-likelihood on 10-fold cross-validation with one standard deviation for each dataset. While, as expected, the advantages shown for Gaussian targets do not transfer to non-Gaussian targets, GPF and GF are consistently on par with competitors. On the other hand, IBLR tends to be outperformed. It is also interesting to note that mean-field does not seem to have a negative impact on these problems, and performance remains the same even with a full-rank matrix.

**Figure 6.** Average negative log-likelihood vs. time on a test-set over 10 runs against training time for a Bayesian logistic regression model applied to different datasets. Top plots use a mean-field approximation, while bottom plots use a low-rank structure for the covariance with rank *L* = 100.

#### *4.6. Bayesian Neural Network*

We ran our algorithm on a standard network with two hidden layers each, with *L* = 200 neurons and tanh activation functions (we additionally tried ReLU [41], but some baselines failed to converge). We trained on the MNIST dataset [42] (*N* = 60,000, *D* = 784) and used an isotropic prior on the weights *p*(*w*) = N (0, *αID*) with *α* = 1.0. We additionally compared these with *Stochastic Weight Averaging-Gaussian* **(SWAG)** [27] with an SGD learning rate of 10−<sup>6</sup> (selected empirically) and *Efficient Low-Rank Gaussian Variational Inference* **(ELRGVI)** [26]. We varied the assumptions on the covariance matrix to be diagonal (**Mean-Field**), or to have rank *L* ∈ {5, 10}. Additionally, we showed, for GPF, the effect of using a structured mean-field assumption by imposing the independence of the weights between each layer (**GPF (Layers)**).

We trained each algorithm for 5000 iterations with a batchsize of 128(∼10 epochs) and reported the final average negative log-likelihood, accuracy and expected calibration error [43] on the test set (*N* = 10,000) on Table 1. The predictive distribution is given by

$$p(y=k|\mathbf{x}^\*, \mathcal{D}) = \int p(y=k|\mathbf{x}^\*, w)p(w|\mathcal{D})dw \approx \int p(y=k|\mathbf{x}^\*, w)q(w)dw,$$

where D is the training data, and *x*<sup>∗</sup> is a test sample. We computed the accuracy and the average negative test log-likelihood as:

$$\begin{aligned} \text{Acc} &= \frac{1}{N} \sum\_{i=1}^{N} \mathbf{1}\_{\mathcal{Y}\_i} (\arg\_k \max p(y = k | \mathbf{x}\_i^\*, \mathcal{D})), \\ \text{NLL} &= -\frac{1}{N} \sum\_{i=1}^{N} \log p(y = y\_i | \mathbf{x}\_i^\*, \mathcal{D}) \end{aligned}$$

where 1*y*(*x*) is the indicator function (equal to 1 for *y* = *x*, 0 otherwise). For the definition of expected calibrated error, we refer the reader to [43]. Additional convergence and uncertainty calibration plots can be found in Appendix I.

**Table 1.** Negative Log-Likelihood (NLL), Accuracy (Acc), and Expected Calibration Error (ECE) for a *Bayesian Neural Networks* **(BNN)** on the MNIST dataset. We varied the rank of the variational covariance from mean-field (all variables are independent) to a low-rank structure with *L* ∈ {5, 10}. Bold numbers indicated the best performance, and italic bold numbers indicate the best performance when restricted to VGA methods. Convergence and calibration plots can be found in Appendix I.


Overall, the SVGD method performed best in terms of both accuracy and negative log-likelihood. However, SVGD is not in the same category as others, since it is not a VGA. For VGAs, we observed that a low-rank approximation improves upon mean-field methods. In particular, assuming independence between layers provides a large advantage to GPF. GPF and GF generally perform equally or better than all the other VGA methods. Note that, although not reported here, all methods needed approximately the same time for the 5000 iterations, except for SWAG, which only needed the MAP and a few thousand iterations of SGD afterward, making it generally faster but also less controlled (a grid search was needed to find the appropriate learning for SGD).

#### **5. Discussion**

We introduced GPF, a general-purpose and theoretically grounded, particle-based approach, to perform inference with variational Gaussians as well as GF its parameter version. We were able to show the convergence of the particle algorithm based on an empirical approximation of the free energy. We also showed that we can approximate high-dimensional targets by allowing for low-rank approximations with a small number of particles. The results for Gaussian targets suggest that the convergence of posterior covariance approximation may relax asymptotically fast, with small dependence on the target. This work is the first step in analyzing convergence speed and guarantees in inference with variational Gaussians, and future work could extend guarantees to non-Gaussian problems. One could also take advantage of existing particle-based VI methods to accelerate inference further or reach a better optima [44,45].

**Author Contributions:** Conceptualization, T.G.-F. and M.O.; methodology, T.G.-F., V.P. and M.O.; software, T.G.-F.; validation, T.G.-F.; formal analysis, T.G.-F.; investigation, T.G.-F.; resources, T.G.-F. and V.P.; data curation, T.G.-F.; writing—original draft preparation, T.G.-F., V.P. and M.O.; writing—review and editing, T.G.-F., V.P. and M.O.; visualization, T.G.-F.; supervision, M.O.; project administration, T.G.-F.; funding acquisition, M.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** We acknowledge the support of the German Research Foundation and the Open Access Publication Fund of TU Berlin.

**Data Availability Statement:** Datasets can be found on the UCI dataset website [40] and the MNIST dataset can be found on Yann Lecun website [42].

**Acknowledgments:** We thank Fela Winkelmolen for his initial help on computations, Jannik Thümmel for his work on the linear SVGD and the reviewers for their insightful comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Derivation of the Optimal Parameters**

In Section 3, we considered the optimization problem:

$$\min\_{A^t, b^t \in \mathcal{B}} \frac{d\mathcal{F}[q^t]}{dt} \text{ where } \mathcal{B} = \{A^t, b^t : \|A^t\|\_F^2 = 1, \, \|b^t\|^2 = 1\}\_{\mathcal{H}}$$

where we have introduced *<sup>A</sup>*<sup>2</sup> 2 *<sup>F</sup>* = tr(*AA*), the Froebius norm and *bt* , the *L*<sup>2</sup> norm and

$$\frac{d\mathcal{F}[\eta^t]}{dt} = -\text{tr}\left[A^t(A^t\_\star)^\top\right] - (b^t)^\top b^t\_\star \tag{A1}$$

To solve this problem, we used the Lagrange multiplier method. We write the Lagrangian as:

$$\mathcal{L}(A^t, b^t) = \frac{d\mathcal{F}[q^t]}{dt} - \lambda\_A \mathcal{g}(A^t) - \lambda\_b h(b^t), \dots$$

where *g*(*A*) = tr(*AA*) − 1 and *h*(*b*) = *b* 2 <sup>2</sup> − 1. For simplicity we can divide the problem as:

$$\begin{aligned} \mathcal{L}(A^t) &= -\operatorname{tr}\left[A^t (A^t\_\star)^\top\right] - \lambda\_A \mathcal{g}(A^t) \\ \mathcal{L}(b^t) &= -\left(b^t\right)^\top b^t\_\star - \lambda\_b h(b^t) \end{aligned}$$

For *A<sup>t</sup>* , we have the constraints:

$$\begin{aligned} \nabla\_{A^t} \text{tr} \Big[ A^t (A^t\_\star)^\top \Big] &= \lambda\_A \nabla\_{A^t} \mathcal{G} (A^t) \\ \text{g} (A^t) &= 0 \end{aligned}$$

Computing the gradients is straightforward:

$$\begin{aligned} A^t\_\star &= 2\lambda\_A A^t\\ \Rightarrow A^t &= \frac{A^t\_\star}{2\lambda\_A} \\ \Rightarrow \frac{1}{4\lambda\_A^2} \text{tr}(A^t\_\star (A^t\_\star)^\top) &= 1 \\ \Rightarrow \lambda\_A &= \sqrt{\frac{\text{tr}(A^t\_\star (A^t\_\star)^\top)}{4}}. \end{aligned}$$

which gives us the result *<sup>A</sup><sup>t</sup>* <sup>=</sup> *<sup>A</sup><sup>t</sup> At F* . Similarly for *b<sup>t</sup>* :

$$\begin{aligned} \nabla\_{b^t} (b^t)^\top b^t\_\star &= \lambda\_b \nabla\_{b^t} h(b^t), \\ h(b^t) &= 0. \end{aligned}$$

Replacing the gradients gives:

$$\begin{aligned} b^{\underline{t}}\_{\star} &= 2\lambda\_{\underline{b}} b^{\underline{t}} \\ \Rightarrow b^{\underline{t}} &= \frac{b^{\underline{t}}\_{\star}}{2\lambda\_{\underline{b}}} \\ \Rightarrow \frac{1}{4\lambda\_{\underline{b}}^{2}} ||b^{\underline{t}}\_{\star}||\_{2}^{2} &= 1 \\ \Rightarrow \lambda\_{\underline{b}} &= \frac{2}{||b^{\underline{t}}\_{\star}||\_{2}} \end{aligned}$$

which gives us the result *<sup>b</sup><sup>t</sup>* <sup>=</sup> *<sup>b</sup><sup>t</sup> bt* 2 .

#### **Appendix B. Relaxation of the Empirical Free Energy**

We prove the decrease in the empirical free energy (17) under the particle flow when the covariance *C* is nonsingular. We define the empirical distribution *q*ˆ(*x*) = <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *δx*,*xi* with a finite number *N* of particles. The empirical free energy is defined as

$$\mathcal{F}[\hat{q}] = \mathbb{E}\_{\emptyset}[\varphi(x)] - \frac{1}{2}\log|\mathcal{C}|.$$

We are interested in the temporal change of the free energy, when particles move under a general linear dynamics

$$\frac{dx\_i}{dt} = b + A(x\_i - m).$$

The induced dynamics for F are:

$$\frac{d\mathcal{F}}{dt} = \mathbb{E}\_{q^t} \left[ \nabla\_{\mathcal{X}} \boldsymbol{\varrho}(\boldsymbol{x})^\top \frac{d\boldsymbol{x}}{dt} \right] - \frac{1}{2} \text{tr}(\mathbb{C}^{-1} \frac{d\mathcal{C}}{dt})$$

For notational simplicity, we introduce *<sup>g</sup>*(*x*) = <sup>∇</sup>*xϕ*(*x*) and *<sup>x</sup>*˙ <sup>=</sup> *dx dt* (similarly *<sup>m</sup>*˙ <sup>=</sup> *dm dt* ).

$$\begin{split} \frac{dC}{dt} &= \frac{d}{dt} \mathbb{E}\_q\left[\left(\boldsymbol{\upkappa} - \boldsymbol{m}\right)\left(\boldsymbol{\upkappa} - \boldsymbol{m}\right)^\top\right] \\ &= \mathbb{E}\_q\left[\left(\boldsymbol{\upkappa} - \boldsymbol{m}\right)\left(\boldsymbol{\upkappa} - \boldsymbol{m}\right)^\top\right] + \mathbb{E}\_q\left[\left(\boldsymbol{\upkappa} - \boldsymbol{m}\right)\left(\boldsymbol{\upkappa} - \boldsymbol{m}\right)^\top\right] \\ &= \mathbb{E}\_q\left[\left\|\boldsymbol{\upkappa}^\top + \boldsymbol{x}\boldsymbol{x}^\top - \boldsymbol{m}\boldsymbol{m}^\top - \boldsymbol{m}\boldsymbol{m}^\top\right\|\right] \\ &= \mathbb{E}\_q\left[\left\|\left(\boldsymbol{x} - \boldsymbol{m}\right)^\top\right\| + \mathbb{E}\_q\left[\left(\boldsymbol{x} - \boldsymbol{m}\right)\boldsymbol{\upkappa}^\top\right] \right] \\ \\ \frac{d\mathcal{F}}{dt} &= \mathbb{E}\_q\left[\left\|\left(\boldsymbol{x}\right)^\top\boldsymbol{\upkappa}\right\| - \\ & \quad \frac{1}{2}\mathbb{E}\_q\left[\left\|\left(\boldsymbol{C}^{-1}\boldsymbol{\upkappa}(\boldsymbol{x} - \boldsymbol{m})^\top\right) + \operatorname{tr}\left(\boldsymbol{C}^{-1}\left(\boldsymbol{x} - \boldsymbol{m}\right)^\top\boldsymbol{\upkappa}^\top\right)\right\|\right] \\ &= \mathbb{E}\_q\left[\left\|\boldsymbol{x}^\top\left(\boldsymbol{g}(\boldsymbol{x}) - \boldsymbol{C}^{-1}\left(\boldsymbol{x} - \boldsymbol{m}\right)\right)\right\|\right] \end{split} \tag{A2}$$

where we used the permutation properties of the trace.

Plugging the dynamics into Equation (A2), we obtain:

$$\begin{split} \frac{d\mathcal{F}}{dt} &= \mathsf{b}^{\top} \mathbb{E}\_{\mathsf{q}}[\mathsf{g}(\mathsf{x})] + \mathbb{E}\_{\mathsf{q}}\left[ (\mathsf{x} - m)^{\top} \boldsymbol{A}^{\top} \boldsymbol{\mathsf{g}}(\mathsf{x}) \right] \\ &- \mathbb{E}\_{\mathsf{q}}\left[ (\mathsf{x} - m)^{\top} \boldsymbol{A}^{\top} \mathbb{C}^{-1} (\mathsf{x} - m) \right] \end{split} \tag{A3}$$

where we used the fact that *<sup>b</sup>C*−1E*q*[*<sup>x</sup>* <sup>−</sup> *<sup>m</sup>*] <sup>=</sup> 0.

We next look for conditions on *b* and *A*, under which *<sup>d</sup>*<sup>F</sup> *dt* < 0, i.e., the dynamics will lead to a decrease in the free energy. We pick *<sup>b</sup>* <sup>=</sup> <sup>−</sup>*β*1E*q*[*g*(*x*)], where *<sup>β</sup>*<sup>1</sup> <sup>&</sup>gt; 0, and we obtain, for the first term in (A3):

$$-\beta\_1 \| \mathbb{E}\_q[\mathbf{g}(\mathbf{x})] \|^2 \le 0.$$

For *A*, let us first define *ψ* = E*<sup>q</sup>* 7 *<sup>g</sup>*(*x*)(*<sup>x</sup>* <sup>−</sup> *<sup>m</sup>*)<sup>8</sup> and rewrite the second and last term of the Equation (A3) as:

$$\begin{aligned} \mathbb{E}\_{\eta} \left[ (\mathbf{x} - m)^{\top} A^{\top} \boldsymbol{g}(\mathbf{x}) \right] &= \text{tr} \Big( \mathbb{E}\_{\eta} \left[ A^{\top} \boldsymbol{g}(\mathbf{x}) (\mathbf{x} - m)^{\top} \right] \Big) \\ &= \text{tr} \Big( A^{\top} \boldsymbol{\Psi} \Big) \\ \mathbb{E}\_{\eta} \Big[ (\mathbf{x} - m)^{\top} A^{\top} \mathbb{C}^{-1} (\mathbf{x} - m) \Big] &= \text{tr} \Big( A^{\top} \mathbb{C}^{-1} \mathbb{C} \Big) \\ &= \text{tr} (A) \end{aligned}$$

Combining both, we get tr *A*(*ψ* − *I*) . Similarly to the previous step, we pick *A* = −*β*2(*ψ* − *I*), where *β*<sup>2</sup> ≥ 0, which leads to another negative term:

$$-\beta\_2 \text{tr}((\psi - I)^\top (\psi - I)) \le 0,$$

where we use the fact that *XX* is a positive semi-definite matrix for any real valued *X*.

Note that different forms of *A* (e.g., *β*<sup>2</sup> are replaced by a positive definite matrix) could be used, as long as the trace of the product stays positive. Inserting *b* and *A*, the free energy dynamics become

$$\frac{d\mathcal{F}}{dt} = -\beta\_1 ||\mathbb{E}\_q[\mathcal{g}(\mathbf{x})]||^2 - \beta\_2 \text{tr}((\boldsymbol{\psi} - \boldsymbol{I})^\top (\boldsymbol{\psi} - \boldsymbol{I})) $$

The variable dynamics are given by

$$\begin{split} \frac{dx}{dt} &= -\beta\_1 \mathbb{E}\_q[\mathbb{g}(\mathbf{x})] - \beta\_2(\boldsymbol{\psi} - I)(\boldsymbol{x} - \boldsymbol{m}) \\ &= -\beta\_1 \mathbb{E}\_q[\mathbb{g}(\mathbf{x})] \\ &- \beta\_2 \left( \mathbb{E}\_q[\mathbb{g}(\mathbf{x})(\boldsymbol{x} - \boldsymbol{m})^\top] - I \right)(\boldsymbol{x} - \boldsymbol{m}), \end{split}$$

which is equivalent to Equation (5), for *β*<sup>1</sup> = *β*<sup>2</sup> = 1. Our result shows that the empirical approximation of the free energy decreases under the particle flow.

#### **Appendix C. Riemannian Gradient for Matrix Parameter Γ**

The parameter flow for the matrix Γ in (11) is given by

$$\frac{d\Gamma^t}{dt} = \Gamma^t - \mathbb{E}\_{q^0} \left[ \nabla\_{\mathbf{x}} \varphi(\mathbf{x}^t) (\mathbf{x}^0 - m^0)^\top \right] \Gamma^t (\Gamma^t)^\top.$$

This is easily rewritten in terms of the parameter gradient as *<sup>d</sup>*Γ*<sup>t</sup> dt* <sup>=</sup> *<sup>∂</sup>*<sup>F</sup> *<sup>∂</sup>*<sup>Γ</sup> ΓΓ

Similar to natural gradients, which are defined by the metric, which is induced by the Fisher–matrix, we can rewrite the parameter change in terms of a different *Riemannian* gradient. This gradient is the direction of change *d*Γ = Γ(*t* + *dt*) − Γ(*t*), which yields the steepest descent of the free energy over a small time interval *dt*. As an extra condition, one keeps the length of *d*Γ (measured by a 'natural' metric, which has specific invariance properties) fixed. This is defined by an inner product (the squared length) *d*Γ, *d*Γ <sup>Γ</sup> in the tangent space of small deviations *d*Γ from the matrix Γ. Hence, *d*Γ is found by minimising F(Γ(*t*) + *d*Γ, *m*) (for small *d*Γ) under the condition that *d*Γ, *d*Γ <sup>Γ</sup>(*t*) is fixed. Following [20] (Theorem 6), a natural metric in the space of symmetric nonsingular matrices can be defined as

$$\langle d\Gamma, d\Gamma \rangle\_{\Gamma} \doteq \text{tr}\left( (d\Gamma \,\, \Gamma^{-1})^\top d\Gamma \,\, \Gamma^{-1} \right).$$

This metric is invariant against multiplications of Γ and *d*Γ by matrices *Y*, i.e., *d*Γ, *d*Γ <sup>Γ</sup> = *d*Γ *Y*, *d*Γ *Y* <sup>Γ</sup>*<sup>Y</sup>* and reduces to the Euclidian metric at the unit matrix Γ = *I*.

The direction of the natural gradient is obtained by expanding the free energy for small *d*Γ and introducing a Lagrange–multiplier *λ* for the constraint. One ends up with the quadratic form

$$\frac{\partial \mathcal{F}}{\partial \Gamma} d\Gamma + \lambda \text{tr} \left( \left( d\Gamma \,\, \Gamma^{-1} \right)^{\top} d\Gamma \,\, \Gamma^{-1} \right),$$

to be minimised by *d*Γ. By taking the derivative with respect to *d*Γ, one finds that the direction of *d*Γ agrees with the right equation of the flow (11).

## **Appendix D. Regularised Free Energy for** *N* ≤ *D*

The problem of defining an empirical approximation for *N* ≤ *D* particles is that the empirical covariance becomes singular and typically has *N* − 1 nonzero eigenvalues, and thus |*C*| = 0. Note that the extra 0 eigenvalue is derived from the fact that the empirical sum of fluctuations must be zero, which provides an additional linear constraint.

We can regularise the log determinant term by replacing the zero eigenvalues of *C*: *<sup>λ</sup><sup>i</sup>* <sup>=</sup> <sup>0</sup> <sup>→</sup> *<sup>λ</sup>*˜ *<sup>i</sup>* <sup>=</sup> 1. The new covariance *<sup>C</sup>*˜ becomes

$$\log|\widetilde{C}| = \sum\_{i:\lambda\_i>0} \log \lambda\_{i\nu}$$

since log 1 = 0. The dynamics of the particles stays the same. To rewrite this formally in terms of matrices, we define

$$
\bar{\mathsf{C}} = \mathsf{C} + \mathsf{C}\_{\perp}
$$

where

$$\mathcal{C}\_{\perp} = \sum\_{i:\lambda\_i = 0} e\_i e\_i^\top$$

and *ei* = *<sup>i</sup>*th eigenvector of *<sup>C</sup>*. This replaces all 0 eigenvalues by 1. *<sup>C</sup>*<sup>⊥</sup> is a projector: *C*2 <sup>⊥</sup> <sup>=</sup> *<sup>C</sup>*<sup>⊥</sup> and *<sup>C</sup>*⊥(*<sup>I</sup>* <sup>−</sup> *<sup>C</sup>*⊥) = 0. We also have tr(*C*⊥) = *<sup>D</sup>* <sup>−</sup> (*<sup>N</sup>* <sup>−</sup> <sup>1</sup>). In the following, it is useful to introduce the *D* × *N* matrix of fluctuations *Z*, such that *C* = *ZZ*/*N*. The column vectors of *Z* span the subspace of eigenvectors *ei* with *λ<sup>i</sup>* > 0. Hence, it follows that *<sup>C</sup>*⊥*<sup>Z</sup>* = 0.

We want to show that the regularised free energy *F*= decreases under the particle dynamics for *<sup>N</sup>* <sup>≤</sup> *<sup>D</sup>*. Since the part of the time derivative of *<sup>F</sup>*<sup>=</sup> that depends on *dm dt* is not changed, we will only discuss the fluctuation part in the following.

It is useful to introduce the matrix:

$$\widetilde{A} \doteq I - \mathbb{C}\_{\perp} - \mathrm{g} \boldsymbol{Z}^{\top}/N = A - \mathbb{C}\_{\perp}.$$

with *g* = ∇*xϕ*(*x*) is the *D* × *N* matrix of the gradient.

$$\begin{split} \mathbb{E}\_{\eta} \left[ \mathcal{g}(\mathbf{x})^{\top} \frac{d\mathbf{x}}{dt} \right] &= \text{tr}(\boldsymbol{A}) - \text{tr}(\boldsymbol{A}^{\top} \boldsymbol{A}) \\ &= \text{tr}(\boldsymbol{\tilde{A}} + \mathbb{C}\_{\perp}) - \text{tr}((\boldsymbol{\tilde{A}} + \mathbb{C}\_{\perp})^{\top}(\boldsymbol{\tilde{A}} + \mathbb{C}\_{\perp})) \\ &= \text{tr}(\boldsymbol{\tilde{A}}) - \text{tr}(\boldsymbol{\tilde{A}}^{\top} \boldsymbol{\tilde{A}}) . \end{split}$$

To obtain this result, we need

$$\begin{aligned} \text{tr}(\mathbf{C}\_{\perp}\dot{A}) &= \text{tr}(\mathbf{C}\_{\perp}\dot{A}^{\top}) \\ &= \text{tr}(\mathbf{C}\_{\perp}(I - \mathbf{C}\_{\perp}) - \mathbf{C}\_{\perp}Z\mathbf{g}^{\top}/N) = 0. \end{aligned}$$

We need to work out

$$\begin{aligned} -\frac{1}{2}\frac{d\ln|\tilde{\mathcal{C}}|}{dt} &= -\frac{1}{2}\text{tr}\left(\frac{d\tilde{\mathcal{C}}}{dt}\tilde{\mathcal{C}}^{-1}\right) \\ &= -\frac{1}{2}\text{tr}\left(\frac{d\mathcal{C}}{dt}\tilde{\mathcal{C}}^{-1}\right) \end{aligned}$$

where we have used the fact that the eigenvalues *λ<sup>i</sup>* = 1 of *C*= have a zero time derivative and can be omitted. We use the linear dynamics *dZ dt* = *AZ* to obtain:

$$\begin{split} \frac{d\mathcal{C}}{dt} &= \mathcal{C}A^{\top} + A\mathcal{C} \\ &= (\tilde{\mathcal{C}} - \mathcal{C}\_{\perp})(\tilde{A}^{\top} + \mathcal{C}\_{\perp}) + (\tilde{A} + \mathcal{C}\_{\perp})(\tilde{\mathcal{C}} - \mathcal{C}\_{\perp}) \\ &= \tilde{\mathcal{C}}\tilde{A}^{\top} + \tilde{A}\tilde{\mathcal{C}} + \mathcal{C}\_{\perp}\tilde{\mathcal{C}} + \tilde{\mathcal{C}}\mathcal{C}\_{\perp} - \tilde{A}\mathcal{C}\_{\perp} - \mathcal{C}\_{\perp}\tilde{A}^{\top} - 2\mathcal{C}\_{\perp} \\ &= \tilde{\mathcal{C}}\tilde{A}^{\top} + \tilde{A}\tilde{\mathcal{C}}\_{\prime} \end{split}$$

where we have used *C*<sup>2</sup> <sup>⊥</sup> <sup>=</sup> *<sup>C</sup>*<sup>⊥</sup> and *<sup>C</sup>*⊥*A*= <sup>=</sup> 0. Hence

$$-\frac{1}{2}\text{tr}\left(\frac{d\tilde{\mathcal{C}}}{dt}\tilde{\mathcal{C}}^{-1}\right) = -\text{tr}(\tilde{\mathcal{A}}).$$

Finally, the temporal change in the free energy due to the fluctuations is given by

$$\frac{d\tilde{\mathcal{F}}}{dt} = -\text{tr}(\tilde{A}^\top \tilde{A}) \le 0.$$

Note that this proof is not only valid for *N* ≤ *D*, but also for *N* > *D*, as the overall computations are simplified with *<sup>C</sup>*<sup>⊥</sup> = 0. A more detailed proof for *<sup>N</sup>* > *<sup>D</sup>* is, furthermore, given in Appendix B.

*Efficient Computation of* log \$ \$ \$ *C*= \$ \$ \$

A practical way to compute log |*C*=| without performing an eigenvector expansion is to define the *N* × *N* matrix

$$\mathcal{R} \doteq Z^{\top} Z / N + J\_{N,N} / N\_r$$

where *JN*,*<sup>N</sup>* is the *N* × *N all-ones* matrix. *ZZ*/*N* shares the *N* − 1 nonzero eigenvalues with *C* and has an additional eigenvalue 0 corresponding to the constant eigenvector (*eN*)*<sup>i</sup>* = 1/ <sup>√</sup>*N*. Adding an all-ones matrix preserves all existing eigenvalues while replacing the 0 one with a constant. This leads to the following result:

$$-\frac{1}{2}\log|\mathcal{R}| = -\frac{1}{2}\sum\_{i=1}^{N-1}\log\lambda\_i.$$

## **Appendix E. Proof of Theorem 1: Fixed Points for a Gaussian Model (***N > d***)**

**Theorem A1** (1)**.** *If the target density p*(*x*) *is a D-dimensional multivariate Gaussian, only D* + 1 *particles are needed for Algorithm 2 to converge to the exact target parameters.*

The general fixed-point condition for the dynamics (13) of the position *xi* for particle *i* is given by:

$$(I - \mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{g}(\boldsymbol{\lambda})(\boldsymbol{\lambda} - \boldsymbol{m})^{\top}])(\boldsymbol{x}\_{i} - \boldsymbol{m}) - \mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{g}(\boldsymbol{x})] = 0.$$

for *i* = 1, . . . , *N*. By taking the expectation over all particles, we obtain:

$$\mathbb{E}\_{\emptyset}[\mathcal{g}(\mathbf{x})] = \mathbf{0},\tag{A4}$$

where *q*ˆ is the empirical distributions of particles at the the fixed point. Note that this result is independent of *N*, i.e., it is also valid for *N* = 1.

For a *D*-dimensional Gaussian target *p*(*x*) = N (*μ*, Σ), we will show that empirical mean and covariance given by the particle algorithm converge to the true mean and covariance matrix of the Gaussian when we use *N* ≥ *D* + 1 particles. In this setting, we have *ϕ*(*x*) = <sup>1</sup> <sup>2</sup> *<sup>x</sup>*Σ−1*<sup>x</sup>* <sup>−</sup> *<sup>x</sup>*Σ−1*μ*. For simplification, we use the precision matrix <sup>Λ</sup> <sup>=</sup> <sup>Σ</sup>−<sup>1</sup> and get

$$\boldsymbol{q}(\mathbf{x}) = \frac{1}{2} \mathbf{x}^\top \boldsymbol{\Lambda} \mathbf{x} - \mathbf{x}^\top \boldsymbol{\Lambda} \boldsymbol{\mu}.$$

The gradient *g*(*x*) becomes:

$$\lg(x) = \Lambda(x - \mu)$$

At the fixed points, we have that *dm dt* and *<sup>d</sup>*<sup>Γ</sup> *dt* are equal to 0. For the mean *m*:

$$\frac{dm}{dt} = \mathbb{E}\_{\emptyset}[\mathbb{g}(\mathbf{x})] = 0$$

$$\Lambda \mathbb{E}\_{\emptyset}[\mathbf{x} - \boldsymbol{\mu}] = 0$$

$$\Lambda m = \Lambda \boldsymbol{\mu}$$

$$m = \boldsymbol{\mu}$$

For the matrix Γ, we have

$$\begin{aligned} \frac{d\Gamma}{dt} &= -A\Gamma = 0\\ \Gamma - \mathbb{E}\_{\mathfrak{q}\_0} \Big[ \mathbf{g}(\mathbf{x})(\mathbf{x} - m)^\top \Big] \Gamma &= 0\\ \mathbb{E}\_{\mathfrak{q}\_0} \Big[ \Lambda(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{m})^\top \Big] \Gamma &= \Gamma\\ -2\eta\_2 \mathbb{E}\_{\mathfrak{q}\_0} \Big[ (\mathbf{x} - \boldsymbol{m})(\mathbf{x} - \boldsymbol{m})^\top \Big] \Gamma &= \Gamma\\ \Lambda \mathcal{C} \Gamma &= \Gamma\\ \Lambda \mathcal{C}^2 &= \mathcal{C} \end{aligned}$$

where we use the result for the mean *m* = *μ* and right multiplied by Γ as *C* = ΓΓ. Now, we can only simplify, as *C* = Λ−<sup>1</sup> = Σ if *C* is not singular. This is true only if its rank is equal to *D*, needing *D* + 1 particles.

## **Appendix F. Proof of Theorem 2: Rates of Convergence for Gaussian Targets**

**Theorem A2** (2)**.** *For a target <sup>p</sup>*(*x*) = <sup>N</sup> (*<sup>x</sup>* <sup>|</sup> *<sup>μ</sup>*, <sup>Λ</sup>−1)*, where <sup>x</sup>* <sup>∈</sup> <sup>R</sup>*D, and <sup>N</sup>* <sup>≥</sup> *<sup>D</sup>* <sup>+</sup> <sup>1</sup> *particles, the continuous time limit of Algorithm 2 will converge exponentially fast for both the mean and the trace of the precision matrix:*

$$m^t - \mu = e^{-\Lambda t} (m^0 - \mu)\_\prime$$

$$\text{tr}(\left(\mathcal{C}^t\right)^{-1} - \Lambda) = e^{-2t} \text{tr}(\left(\mathcal{C}^0\right)^{-1} - \Lambda)\_\prime$$

*where <sup>m</sup><sup>t</sup> and <sup>C</sup><sup>t</sup> are the empirical mean and covariance matrix at time <sup>t</sup> and* exp(−Λ*t*) *is the matrix exponential.*

In the following, we assume the target *<sup>p</sup>*(*x*) = <sup>N</sup> (*μ*, <sup>Σ</sup>) We use the notation <sup>Λ</sup> . = Σ−<sup>1</sup> and *<sup>δ</sup>C<sup>t</sup>* <sup>=</sup> *<sup>C</sup><sup>t</sup>* <sup>−</sup> <sup>Σ</sup>.

*Appendix F.1. Convergence of the Mean*

Given our target *p*(*x*), similarly to Appendix E we have *g*(*x*) = Λ(*x* − *μ*), where *<sup>η</sup>*<sup>1</sup> <sup>=</sup> <sup>Σ</sup>−1*<sup>μ</sup>* and *<sup>η</sup>*<sup>2</sup> <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>2</sup>Σ−1. This transform the first of Equations (11) into

$$\frac{dm}{dt} = -\operatorname{\Lambda}(\mathbb{E}\_{\hat{\eta}}[x] - \mu)$$

$$= -\operatorname{\Lambda}(m - \mu)$$

If now consider the error on *m* : *δm* = *m* − *μ* we obtain:

$$\begin{split} \frac{d\delta m}{dt} &= \frac{dm}{dt} = -\Lambda (m - \mu) \\ &= -\Lambda \delta m. \end{split}$$

Therefore, the mean converges exponentially fast to the true mean. The asymptotic rate is governed by the largest eigenvalue of Λ, i.e., the inverse of the smallest eigenvalue of Σ, *λ*min.

#### *Appendix F.2. Convergence of the Covariance Matrix*

Let *z* = *x* − *m*, we have from Equation (5), that

$$\frac{dz}{dt} = -Az$$

where *<sup>A</sup>* = E*q*<sup>0</sup> 7 *g*(*x*)*z*8 − *I*. This expectation can further be simplified as

$$\mathbb{E}\_{\emptyset} \left[ \Lambda(\mathbf{x} - \boldsymbol{\mu}) \boldsymbol{z}^{\top} \right] = \Lambda \mathbb{C}\_{\prime} \tag{A5}$$

where *q* ∼ N (*m*, *C*). Hence, we have the exact result

$$\frac{d\mathbb{C}}{dt} = (I - \Lambda\mathbb{C})\mathbb{C} + \mathbb{C}(I - \mathbb{C}\Lambda). \tag{A6}$$

We know that the optimal target is *C* = Σ. Therefore, we define the error *δC* = *C* − Σ. Linearizing Equation (A6) gives us

$$\begin{split} \frac{d\delta\mathbb{C}}{dt} &= \frac{d\mathbb{C}}{dt} = (I - \Lambda(\delta\mathbb{C} + \Sigma))(\delta\mathbb{C} + \Sigma) \\ &\quad + (\delta\mathbb{C} + \Sigma)(I - (\delta\mathbb{C} + \Sigma)\Lambda) \\ &= -\Lambda\delta\mathbb{C}(\delta\mathbb{C} + \Sigma) - (\delta\mathbb{C} + \Sigma)\delta\mathbb{C}\Lambda \\ &\approx -\Lambda\delta\mathbb{C}\Sigma - \Sigma\delta\mathbb{C}\Lambda \end{split}$$

We were not yet able to find a general solution of this equation, but we can obtain a simple result for the trace *<sup>y</sup><sup>t</sup>* . = tr(*δC*) at time *t*:

$$\frac{dy^t}{dt} \simeq -2y^t.$$

We, therefore, have a asymptotic linear convergence: *y<sup>t</sup>* ∝ *e*−2*<sup>t</sup> y*<sup>0</sup> which is independent of the parameters of the Gaussian model.

We can also equivalently obtain a non-asymptotic estimate of a specific error measure for the precision matrix. Using equation (A6), we have the following dynamics for the precision *C*−1:

$$\begin{split} \frac{d\mathbf{C}^{-1}}{dt} &= -\mathbf{C}^{-1} \frac{d\mathbf{C}}{dt} \mathbf{C}^{-1} \\ &= -\mathbf{C}^{-1} (I - \Lambda \mathbf{C}) - (I - \Lambda \mathbf{C}) \mathbf{C}^{-1} \end{split}$$

Taking the trace

$$\begin{aligned} \frac{d\text{tr}(\mathbb{C}^{-1})}{dt} &= -2\text{tr}(\mathbb{C}^{-1}) - 2\text{tr}(\Lambda) \\ \frac{d\text{tr}(\mathbb{C}^{-1} - \Lambda)}{dt} &= -2\text{tr}(\mathbb{C}^{-1} - \Lambda) \end{aligned}$$

Hence we get the following exact result:

$$\text{tr}((\mathbb{C}^t)^{-1} - \Lambda) = e^{-2t} \text{tr}((\mathbb{C}^0)^{-1} - \Lambda)$$

which is again independent of the parameters of the Gaussian model.

Additionally, this tells us that if the covariance *C* is non-singular at time *t* = 0, it will remain non-singular for all *t* (tr(*C*−1) would be infinite). Hence, if we start with *N* > *d* particles with a proper empirical covariance, they cannot collapse to make *C* singular.

#### *Appendix F.3. Convergence of the Trace of the Covariance*

The asymptotic result on traces obtained previously can be turned into an exact inequality. We have

$$\frac{d\delta\mathcal{C}}{dt} = -\Lambda\delta\mathcal{C}\Sigma - \Sigma\Lambda\delta\mathcal{C} - \Lambda(\delta\mathcal{C})^2 - (\delta\mathcal{C})^2\Lambda^2$$

Taking the trace, we get

$$\frac{d\text{tr}(\delta \mathcal{C})}{dt} = -2\text{tr}(\delta \mathcal{C}) - 2\text{tr}(\delta \mathcal{C} \Lambda \delta \mathcal{C})$$

Since *δC*Λ*δC* is positive definite, we have −2tr(*δC*Λ*δC*) ≤ 0 and thus

$$\frac{d\text{tr}(\delta \mathcal{C})}{dt} \le -2\text{tr}(\delta \mathcal{C})$$

leading to:

$$\text{tr}(\delta \mathbf{C}^t) \le \text{tr}(\delta \mathbf{C}^0) e^{-2t}$$

by using by Grönwall's lemma [46]:

**Lemma A1** (Grönwall)**.** *For an interval I*<sup>0</sup> = [0, ∞) *and a given function f differentiable everywhere in I*<sup>0</sup> *and satisfying:*

$$f'(t) \le \beta(t)f(t), \quad t \in I\_0$$

*then f is bounded by the corresponding differential equation g* (*t*) = *β*(*t*)*g*(*t*)*:*

$$f(t) \le f(0) \int\_0^t \beta(s) ds, \quad t \in I\_0$$

The bound is nontrivial only if tr(*δC*) ≥ 0. This would be natural assumption for a Bayesian model, if *C*<sup>0</sup> is the prior covariance and the eigenvalues of *C<sup>t</sup>* at *t* = ∞ (corresponding to the posterior) are reduced by the data.

## *Appendix F.4. Decay of Fluctuation Part of the Free Energy*

Still focusing on the Gaussian model, we can further derive a bound on the free energy. It is easy to see that for the Gaussian case, the free energy in Equation (4) separates into a sum of two terms. The first one depends on the mean *m<sup>t</sup>* only and the second one on only the fluctuations (i.e., *C<sup>t</sup>* ).

We will consider the second, nontrivial part only. We assume that the covariance matrix is nonsingular (corresponding to *N* > *D*). The fluctuation part of the free energy (minus its minimum) is given by

$$\mathcal{F}\_{fl} = -\frac{1}{2}\ln|I - B| - \frac{1}{2}\text{tr}(B).$$

where we have introduced the matrix *<sup>B</sup>* . = *I* − Λ*C*. One can show that its eigenvalues are real and are upper bounded by 1. First, we can show from the equations of motion that

$$\frac{d\mathcal{F}\_{fl}}{dt} = -\text{tr}(BB^\top) \tag{A7}$$

Second, using the elementary bound <sup>−</sup> ln(<sup>1</sup> <sup>−</sup> *<sup>u</sup>*) <sup>≤</sup> *<sup>u</sup>* <sup>1</sup>−*<sup>u</sup>* valid for *<sup>u</sup>* <sup>≤</sup> 1 and applied to the eigenvalues of *B* yields

$$\begin{aligned} \mathcal{F}\_{fl} &\leq \frac{1}{2} \text{tr}(B(I-B)^{-1} - B) \\ &= \frac{1}{2} \text{tr}(B(I-B)^{-1} - B(I-B)(I-B)^{-1}) \\ &= \frac{1}{2} \text{tr}(B^2(I-B)^{-1}) \\ &= \frac{1}{2} \text{tr}(B^2 \mathbb{C}^{-1} \Lambda^{-1}) \leq \frac{1}{2} \text{tr}(B^\top \Lambda^{-1} B \mathbb{C}^{-1}) \end{aligned}$$

The last two equalities used the definition *<sup>B</sup>* <sup>=</sup> *<sup>I</sup>* <sup>−</sup> <sup>Λ</sup>*C*. Since *<sup>B</sup>*Λ−1*<sup>B</sup>* and *<sup>C</sup>*−<sup>1</sup> are both positive definite, we can bound the last term by (see ([47], Theorem 6.5))

$$\begin{aligned} \mathcal{F}\_{fl} &\leq \frac{1}{2} \text{tr}(\boldsymbol{B}^\top \boldsymbol{\Lambda}^{-1} \boldsymbol{B}) \text{tr}(\boldsymbol{\mathcal{C}}^{-1}) \leq \\ &\frac{1}{2} \text{tr}(\boldsymbol{B} \boldsymbol{B}^\top) \text{tr}(\boldsymbol{\Lambda}^{-1}) \text{tr}(\boldsymbol{\mathcal{C}}^{-1})), \end{aligned}$$

where, in the last line, we have bounded the trace of a product of p.d. matrices a second time.

Combining with Equation (A7) we show that

$$\frac{d\mathcal{F}\_{fl}}{dt} \le -\frac{2\mathcal{F}\_{fl}}{\text{tr}(\Lambda^{-1})\text{tr}(\mathbf{C}^{-1})}$$

We can plug in our result from Theorem 2:

$$\begin{split} \text{tr}(\mathbf{C}^{-1}) &= \text{tr}(\boldsymbol{\Lambda}) + \text{tr}(\mathbf{C}^{-1} - \boldsymbol{\Lambda}) \\ &= \text{tr}(\boldsymbol{\Lambda}) + e^{-2t} \text{tr}((\mathbf{C}^{0})^{-1} - \boldsymbol{\Lambda}) \\ &\leq \text{tr}(\boldsymbol{\Lambda}) + e^{-2t} |\text{tr}((\mathbf{C}^{0})^{-1} - \boldsymbol{\Lambda})| \\ &\leq \text{tr}(\boldsymbol{\Lambda}) + |\text{tr}((\mathbf{C}^{0})^{-1} - \boldsymbol{\Lambda})| \end{split}$$

We can plug this in and use Grönwall's Lemma A1 to get an exponential bound

$$\mathcal{F}\_{fl}(\mathbb{C}^{t}) \le \mathcal{F}\_{fl}(\mathbb{C}^{0})e^{-\left[\frac{2t}{\text{tr}(\Lambda^{-1})(\text{tr}(\Lambda)+|\text{tr}((\mathbb{C}^{0})^{-1}-\Lambda)|)}\right]}.$$

*Appendix F.5. Asymptotic Decay of the Free Energy:*

For large times *t*, we can do better. Let us analyse the asymptotic decay constant F*f l e* <sup>−</sup>*λf reet* defined by

$$\lambda\_{free} \doteq -\lim\_{t \to \infty} \frac{d\ln(\mathcal{F}\_{fl})}{dt} = -\lim \frac{\frac{d\mathcal{F}\_{fl}}{dt}}{\mathcal{F}\_{fl}}$$

$$= \lim \frac{\text{tr}(BB^\top)}{-\frac{1}{2}\ln|I - B| - \frac{1}{2}\text{tr}(B)} \ge 0$$

$$\lim \frac{\text{tr}(B^2)}{-\frac{1}{2}\ln|I - B| - \frac{1}{2}\text{tr}(B)}$$

In the last inequality, we used tr(*BB*) <sup>≥</sup> tr(*B*2). Everything is expressed by traces of functions of *B*, and thus by its eigenvalues. Since *B* → 0 as *t* → ∞ (this applies also to its eigenvalues *<sup>u</sup>*), we can use Taylor's expansion ln(<sup>1</sup> <sup>−</sup> *<sup>u</sup>*) + *<sup>u</sup>* <sup>=</sup> <sup>−</sup>*u*2/2 <sup>+</sup> *<sup>O</sup>*(*u*3) to show that

$$\lambda\_{fro} \ge 4$$

which is independent of Λ.

## **Appendix G. Proof of Theorem 3: Fixed-Points for Gaussian Model (***N ≤ D***)**

**Theorem A3** (3)**.** *Given a D-dimensional multivariate Gaussian target density p*(*x*) = N (*x*|*μ*, Σ)*, using Algorithm 2 with N* < *D* + 1 *particles, the empirical mean converges to the exact mean μ. The <sup>N</sup>* <sup>−</sup> <sup>1</sup> *non-zero eigenvalues of <sup>C</sup><sup>t</sup> converge to a subset of the target covariance* <sup>Σ</sup> *spectrum. Furthermore, the global minimum of the regularised version* F= *of the free energy* (17) *corresponds to the largest eigenvalues of* Σ*.*

Applying Equation (A4) to our fixed point equation, we obtain

$$(I - \mathbb{E}\_{\boldsymbol{\theta}}\left[\mathbf{g}\left(\mathbf{x}\right)\left(\mathbf{x} - \boldsymbol{m}\right)^{\top}\right])\left(\mathbf{x}\_{i} - \boldsymbol{m}\right) = \mathbf{0}, \ \forall i = 1, \ldots, N$$

Hence, the set of centered positions of the particles *<sup>S</sup>* <sup>=</sup> {*xi* <sup>−</sup> *<sup>m</sup>*}*<sup>N</sup> <sup>i</sup>*=1, are all eigenvectors of the matrix E*q*<sup>ˆ</sup> 7 *<sup>g</sup>*(*x*)(*<sup>x</sup>* <sup>−</sup> *<sup>m</sup>*)<sup>8</sup> with eigenvalue 1. *S* spans a *N* − 1 dimensional space (we have ∑*<sup>N</sup> <sup>i</sup>*=1(*xi* − *m*) = 0).

If we specialise to a Gaussian target *<sup>p</sup>*(*x*) = <sup>N</sup> (*<sup>x</sup>* <sup>|</sup> *<sup>μ</sup>*, <sup>Σ</sup>), (and <sup>Λ</sup> <sup>=</sup> <sup>Σ</sup>−<sup>1</sup> we have *g*(*x*) = Λ(*x* − *μ*) and can reuse the result from Equation (A5):

$$\begin{split} \mathbb{E}\_{\boldsymbol{\theta}} \Big[ \boldsymbol{\mathcal{g}}(\boldsymbol{x}) (\boldsymbol{x} - \boldsymbol{m})^{\top} \Big] &= \boldsymbol{\Lambda} \mathbb{E}\_{\boldsymbol{\theta}} \Big[ (\boldsymbol{x} - \boldsymbol{m})(\boldsymbol{x} - \boldsymbol{m})^{\top} \Big] \\ &= \boldsymbol{\Lambda} \mathsf{C}. \end{split}$$

Using the equality above, we get:

$$\begin{aligned} \Lambda \mathcal{C}(\mathbf{x}\_i - m) &= (\mathbf{x}\_i - m) \\ \mathcal{C}(\mathbf{x}\_i - m) &= \Sigma(\mathbf{x}\_i - m), \; \forall i = 1, \dots, N \end{aligned}$$

which shows that the obtained low-rank covariance *C* and the target covariance Σ have *N* − 1 eigenvectors and eigenvalues in common.

However, are these the largest ones? We look at the modified free energy (17) (ignoring the contribution of the mean):

$$\min \tilde{\mathcal{F}} = \min \left\{ -\frac{1}{2} \sum\_{i:\lambda\_i > 0} \ln \lambda\_i + \text{tr}(\Lambda \mathbf{C}) \right\},$$

where *λ<sup>i</sup>* are the eigenvalues of the empirical covariance *C*. We first note that tr(Λ*C*) = *N* − 1, independent of which eigenvalues are obtained at the fixed point. This is easily seen by the following argument: If we use the index–set I for the common eigenvectors *ei* and eigenvalues *λi*, *i* ∈ I, we can write

$$\begin{aligned} \mathcal{C} &= \sum\_{i \in \mathcal{I}} e\_i \lambda\_i e\_i^\top \\ \Sigma &= \sum\_i e\_i \lambda\_i e\_i^\top \end{aligned} $$

From this we obtain

$$\text{tr}(\Lambda \mathbf{C}) = \text{tr}(\sum\_{i \in \mathcal{T}} e\_i \lambda\_i^{-1} \lambda\_i \mathbf{e}^\top) = N - 1$$

From this result we obtain

$$\min \tilde{\mathcal{F}} = \max \frac{1}{2} \sum\_{i:\lambda\_i > 0} \ln \lambda\_i - (N - 1)\_i$$

The term *N* − 1 is a constant, but the first term makes a difference: The **absolute minimum** of F= is achieved, when the *λ<sup>i</sup>* are *N* − 1 **largest** eigenvalues of Σ. Our simulations empirically show that the algorithm usually converges to the absolute minimum.

#### **Appendix H. Dimension-Wise Optimizers**

Here, we list some of the most populars optimizers used and their dimension-wise versions. In all algorithms, we consider *ϕ* the matrix created by the concatenation of the flow of each particle : *ϕ* = [*ϕ*1,..., *ϕN*], where *ϕ<sup>n</sup>* = *ϕ*(*xn*) We additionally use the notation *ϕn*,*<sup>i</sup>* for the *i*-th dimension of the flow of the *n*-th particle. The main differences between the original algorithms and their modified version were put in red.

*Appendix H.1. ADAM*

The ADAM algorithm is given by:


## **Algorithm A2:** Dimension-wise ADAM

**Input:** *ϕ<sup>t</sup>* , *mt*−1, *vt*−1, *β*1, *β*2, *η* **Output:** Δ *mt <sup>n</sup>*,*<sup>d</sup>* <sup>=</sup> *<sup>β</sup>*1*mt*−<sup>1</sup> *<sup>n</sup>*,*<sup>d</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>β</sup>*1)*ϕ<sup>t</sup> <sup>n</sup>*,*d*; *vt <sup>d</sup>* <sup>=</sup> *<sup>β</sup>*2*vt*−<sup>1</sup> *<sup>d</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>β</sup>*2) <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> n*=1 *ϕt n*,*d* 2 ; <sup>Δ</sup>*n*,*<sup>d</sup>* <sup>=</sup> *<sup>η</sup> <sup>m</sup><sup>t</sup> n*,*d* (1−*β<sup>t</sup>* 1) √*v<sup>t</sup> d*(1−*β<sup>t</sup>* <sup>2</sup>)−1+ ;

*Appendix H.2. AdaGrad*

The AdaGrad algorithm is given by:



**Input:** *ϕ<sup>t</sup>* , *vt*−1, *η* **Output:** Δ *vt <sup>d</sup>* <sup>=</sup> *<sup>v</sup>t*−<sup>1</sup> *<sup>d</sup>* <sup>+</sup> <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> n*=1 *ϕt n*,*d* 2 <sup>Δ</sup>*n*,*<sup>d</sup>* <sup>=</sup> *<sup>η</sup> <sup>ϕ</sup><sup>t</sup>* √ *<sup>n</sup>*,*<sup>d</sup> vt <sup>d</sup>*+

*Appendix H.3. RMSProp*

The RMSProp algorithm is given by:


**Algorithm A6:** Dimension-wise RMSProp

**Input:** *ϕ<sup>t</sup>* , *vt*−1, *ρ*, *η* **Output:** Δ *vt <sup>d</sup>* <sup>=</sup> *<sup>ρ</sup>vt*−<sup>1</sup> *<sup>d</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>ρ</sup>*) <sup>1</sup> *<sup>N</sup>* <sup>∑</sup>*<sup>N</sup> n*=1 *ϕt n*,*d* 2 <sup>Δ</sup>*n*,*<sup>d</sup>* <sup>=</sup> *<sup>η</sup> <sup>ϕ</sup><sup>t</sup>* √ *<sup>n</sup>*,*<sup>d</sup> vt <sup>d</sup>*+

#### **Appendix I. Additional Figures**

*Appendix I.1. Bayesian Logistic Regression*

Similarly to the previous section, we also show results with the RMSProp optimizer with learning rate 1 <sup>×</sup> <sup>10</sup>−4.

**Figure A1.** Similarly to Figure 6, we show the average negative log-likelihood on a test-set over 10 runs against training time on different datasets for a Bayesian logistic regression problem. The dashed curve represents the low-rank approximation with RMSProp for methods based on stochastic estimators.

*Appendix I.2. Bayesian Neural Network*

**Figure A2.** Convergence of the classification error and average negative log-likelihood as a function of time.

**Figure A3.** Accuracy vs confidence. Every test sample is clustered in function of its highest predictive probability. The accuracy of this cluster is then computed. A perfectly calibrated estimator would return the identity.

## **References**


## *Article* **ABCDP: Approximate Bayesian Computation with Differential Privacy**

**Mijung Park 1,\*, Margarita Vinaroz 2,3 and Wittawat Jitkrittum <sup>4</sup>**

	- <sup>4</sup> Google Research, 80636 Munich, Germany; wittawat@google.com
	- **\*** Correspondence: mijungp@cs.ubc.ca

**Abstract:** We developed a novel approximate Bayesian computation (ABC) framework, *ABCDP*, which produces differentially private (DP) and approximate posterior samples. Our framework takes advantage of the sparse vector technique (SVT), widely studied in the differential privacy literature. SVT incurs the privacy cost only when a condition (whether a quantity of interest is above/below a threshold) is met. If the condition is sparsely met during the repeated queries, SVT can drastically reduce the cumulative privacy loss, unlike the usual case where every query incurs the privacy loss. In ABC, the quantity of interest is the distance between observed and simulated data, and only when the distance is below a threshold can we take the corresponding prior sample as a posterior sample. Hence, applying SVT to ABC is an organic way to transform an ABC algorithm to a privacy-preserving variant with minimal modification, but yields the posterior samples with a high privacy level. We theoretically analyzed the interplay between the noise added for privacy and the accuracy of the posterior samples. We apply ABCDP to several data simulators and show the efficacy of the proposed framework.

**Keywords:** approximate Bayesian computation (ABC); differential privacy (DP); sparse vector technique (SVT)

## **1. Introduction**

Approximate Bayesian computation (ABC) aims to identify the posterior distribution over simulator parameters. The posterior distribution is of interest as it provides the mechanistic understanding of the stochastic procedure that directly generates data in many areas such as climate and weather, ecology, cosmology, and bioinformatics [1–4].

Under these complex models, directly evaluating the likelihood of data is often intractable given the parameters. ABC resorts to an approximation of the likelihood function using simulated data that are *similar* to the actual observations.

In the simplest form of ABC called *rejection ABC* [5], we proceed by sampling multiple model parameters from a prior distribution *π*: *θ*1, *θ*2, ... ∼ *π*. For each *θt*, a pseudo dataset *Yt* is generated from a simulator (the forward sampler associated with the intractable likelihood P(*y*|*θ*)). The parameter *θ<sup>t</sup>* for which the generated *Yt* are similar to the observed *Y*∗, as decided by *ρ*(*Yt*,*Y*∗) < *abc*, is accepted. Here, *ρ* is a notion of distance, for instance, L2 distance between *Yt* and *Y*∗ in terms of a pre-chosen summary statistic. Whether the distance is small or large is determined by *abc*, a *similarity threshold*. The result is samples {*θt*}*<sup>M</sup> <sup>t</sup>*=<sup>1</sup> from a distribution, P˜ (*θ*|*Y*∗) <sup>∝</sup> *<sup>π</sup>*(*θ*)P˜ (*Y*∗|*θ*), where P˜ (*Y*∗|*θ*) = - *<sup>B</sup>*(*Y*∗) P(*Y*|*θ*)*dY* and *B*(*Y*∗) = {*Y* : *ρ*(*Y*,*Y*∗) < *abc*}. As the likelihood computation is approximate, so is the posterior distribution. Hence, this framework is named by *approximate* Bayesian computation, as we do not compute the likelihood of data explicitly.

Most ABC algorithms evaluate the data similarity in terms of summary statistics computed by an aggregation of individual datapoints [6–11]. However, this seemingly

**Citation:** Park, M.; Vinaroz, M.; Jitkrittum, W. ABCDP: Approximate Bayesian Computation with Differential Privacy. *Entropy* **2021**, *23*, 961. https://doi.org/10.3390/ e23080961

Academic Editor: Pierre Alquier

Received: 19 May 2021 Accepted: 20 July 2021 Published: 27 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

innocuous step of similarity check could impose a privacy threat, as aggregated statistics could still reveal an individual's participation to the dataset with the help of combining other publicly available datasets (see [12,13]). In addition, in some studies, the actual observations are privacy-sensitive in nature, e.g., Genotype data for estimating tuberculosis transmission parameters [14]. Hence, it is necessary to privatize the step of similarity check in ABC algorithms.

In this light, we introduce an ABC framework that obeys the notion of *differential privacy*. The differential privacy definition provides a way to quantify the amount of information that the distance computed on the privacy-sensitive data contains, whether or not a single individual's data are included (or modified) in the data [15]. Differential privacy also provides rigorous privacy guarantees in the presence of *arbitrary side information* such as similar public data.

A common form of applying DP to an algorithm is by adding noise to outputs of the algorithm, called *output perturbation* [16]. In the case of ABC, we found that *adding noise to the distance* computed on the real observations and pseudo-data suffices for the privacy guarantee of the resulting posterior samples. However, if we choose to simply add noise to the distance in every ABC inference step, this DP-ABC inference imposes an additional challenge due to the *repeated* use of the real observations. The *composition* property of differential privacy states that the privacy level degrades over the repeated use of data. To overcome this challenge, we adopt the *sparse vector technique* (SVT) [17], and apply it to the rejection ABC paradigm. The SVT outputs *noisy* answers of whether or not a stream of queries is above a certain threshold, where privacy cost incurs only when the SVT outputs at most *c* "above threshold" answers. This is a significant saving in privacy cost, as arbitrarily many "below threshold" answers are privacy cost free.

We name our framework, which combines ABC with SVT, as *ABCDP* (approximate Bayesian computation with differential privacy). Under ABCDP, we theoretically analyze the effect of noise added to the distance in the resulting posterior samples and the subsequent posterior integrals. Putting together, we summarize our main contributions:


Unlike other existing ABC frameworks that typically rely on a pre-specified set of summary statistics, we use a kernel-based distance metric called *maximum mean discrepancy* following K2-ABC [18] to eliminate the necessity of pre-selecting a summary statistic. Using a kernel for measuring similarity between two empirical distributions was also proposed in K-ABC [19]. K-ABC formulates ABC as a problem of estimating a conditional mean embedding operator mapping (induced by a kernel) from summary statistics to corresponding parameters. However, unlike our algorithm, K-ABC still relies on a particular choice of summary statistics. In addition, K-ABC is a soft-thresholding ABC algorithm, while ours is a rejection-ABC algorithm.

To avoid the necessity of pre-selecting summary statistics, one could resort to methods that automatically or semi-automatically learn the best summary statistics given in a dataset, and use the learned summary statistics in our ABCDP framework. An example is semi-auto ABC [6], where the authors suggest using the posterior mean of the parameters as a summary statistic. Another example is the indirect-score ABC [20], where the authors suggest using an auxiliary model which determines a score vector as a summary statistic. However, the posterior mean of the parameters in semi-auto ABC as well as the parameters of the auxiliary model in indirect-score ABC need to be estimated. The estimation step can incur a further privacy loss if the real data need to be used for estimating them. Our ABCDP framework does not involve such an estimation step and is more economical in terms of privacy budget to be spent than semi-auto ABC and indirect-score ABC.

#### **2. Background**

We start by describing relevant background information.

#### *2.1. Approximate Bayesian Computation*

Given a set *Y*∗ containing observations, **rejection ABC** [5] yields samples from an approximate posterior distribution by repeating the following three steps:

$$
\theta \sim \pi(\theta),
\tag{1}
$$

$$Y = \{y\_1, y\_2, \dots\} \sim \mathcal{P}(y|\theta),\tag{2}$$

$$\mathcal{P}\_{\mathcal{C}\_{\text{abc}}}(\theta|Y^\*) \sim \mathcal{P}\_{\mathcal{C}\_{\text{abc}}}(Y^\*|\theta)\pi(\theta),\tag{3}$$

where the pseudo dataset *Y* is compared with the observations *Y*∗ via:

$$\mathcal{P}\_{\mathfrak{c}\_{\text{abc}}}(Y^\*|\theta) = \int\_{B\_{\mathfrak{c}\_{\text{abc}}}(Y^\*)} \mathcal{P}(Y|\theta) dY,$$

$$B\_{\mathfrak{c}\_{\text{abc}}}(Y^\*) = \{Y|\rho(Y, Y^\*) \le \mathfrak{c}\_{\text{abc}}\}, \tag{4}$$

where *ρ* is a divergence measure between two datasets. Any distance metric can be used for *ρ*. For instance, one can use the L2 distance under two datasets in terms of a pre-chosen set of summary statistics, i.e., *ρ*(*Y*,*Y*∗) = *D*(*S*(*Y*), *S*(*Y*∗)), with an L2 distance measure *D* on the statistics computed by *S*.

A more statistically sound choice for *ρ* would be *maximum mean discrepancy* (MMD, [21]) as used in [18]. Unlike a pre-chosen finite dimensional summary statistic typically used in ABC, MMD compares two distributions in terms of all the possible moments of the random variables described by the two distributions. Hence, ABC frameworks using the MMD metric such as [18] can avoid the problem of non-sufficiency of a chosen summary statistic that may occur in many ABC methods. For this reason, in this paper, we demonstrate our algorithm using the MMD metric. However, other metrics can be used as we illustrated in our experiments.

#### Maximum Mean Discrepancy

Assume that the data *Y* ⊂ X and let *k* : X ×X be a positive definite kernel. MMD between two distributions *P*, *Q* is defined as

$$\text{MMD}^2(P, Q) = \mathbb{E}\_{\mathbf{x}, \mathbf{x}' \sim P} k(\mathbf{x}, \mathbf{x}') + \mathbb{E}\_{\mathbf{y}, \mathbf{y}' \sim Q} k(\mathbf{y}, \mathbf{y}') - 2\mathbb{E}\_{\mathbf{x} \sim P} \mathbb{E}\_{\mathbf{y} \sim Q} k(\mathbf{x}, \mathbf{y}).\tag{5}$$

By following the convention in kernel literature, we call MMD2 simply MMD.

The Moore–Aronszajn theorem states that there is a unique Hilbert space H on which *k* defines an inner product. As a result, there exists a feature map *φ*: X→H such that *<sup>k</sup>*(*x*, *<sup>y</sup>*) = *<sup>φ</sup>*(*x*), *<sup>φ</sup>*(*y*) H, where ·, · H <sup>=</sup> ·, · denotes the inner product on <sup>H</sup>. The MMD in (5) can be written as

$$\text{MMD}^2(P, \mathbb{Q}) = \left|| \mathbb{E}\_{\mathbf{x} \sim P}[\phi(\mathbf{x})] - \mathbb{E}\_{\mathbf{y} \sim \mathbb{Q}}[\phi(\mathbf{y})] \right||\_{\mathcal{H}'}^2$$

where <sup>E</sup>*x*∼*P*[*φ*(*x*)] ∈ H is known as the (kernel) mean embedding of *<sup>P</sup>*, and exists if <sup>E</sup>*x*∼*<sup>P</sup>* '*k*(*x*, *x*) < ∞ [22]. The MMD can be interpreted as the distance between the mean embeddings of the two distributions. If *k* is a characteristic kernel [23], then *<sup>P</sup>* → <sup>E</sup>*x*∼*P*[*φ*(*x*)] is injective, implying that MMD(*P*, *<sup>Q</sup>*) = 0, if and only if *<sup>P</sup>* <sup>=</sup> *<sup>Q</sup>*. When *<sup>P</sup>*, *<sup>Q</sup>* are observed through samples *Xm* <sup>=</sup> {*xi*}*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> <sup>∼</sup> *<sup>P</sup>* and *Yn* <sup>=</sup> {*yi*}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> ∼ *Q*, MMD can be estimated by empirical averages [21] (Equation (3)): MMD -2 (*Xm*,*Yn*) = <sup>1</sup> *<sup>m</sup>*<sup>2</sup> <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*,*j*=<sup>1</sup> *<sup>k</sup>*(*xi*, *xj*) + <sup>1</sup> *<sup>n</sup>*<sup>2</sup> <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*,*j*=<sup>1</sup> *<sup>k</sup>*(*yi*, *yj*) <sup>−</sup> <sup>2</sup> *mn* <sup>∑</sup>*<sup>m</sup> <sup>i</sup>*=<sup>1</sup> <sup>∑</sup>*<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *k*(*xi*, *yj*). When applied in the ABC setting, one input to MMD is the observed dataset *Y*∗ and the other input is a pseudo dataset *Yt* ∼ *p*(·|*θt*) generated by the simulator given *θ<sup>t</sup>* ∼ *π*(*θ*).

## *2.2. Differential Privacy*

An output from an algorithm that takes in sensitive data as input will naturally contain some information of the sensitive data D. The goal of differential privacy is to augment such an algorithm so that useful information about the population is retained, while sensitive information such as an individual's participation in the dataset cannot be learned [17]. A common way to achieve these two seemingly paradoxical goals is by deliberately injecting a controlled level of random noise to the to-be-released quantity. The modified procedure, known as a DP mechanism, now gives a stochastic output due to the injected noise. In the DP framework, a higher level of noise provides stronger privacy guarantee at the expense of less accurate population-level information that can be derived from the released quantity. Less noise added to the output thus reveals more about an individual's presence in the dataset.

More formally, given a mechanism M (a *mechanism* takes a dataset as input and produces stochastic outputs) and neighboring datasets D, D differing by a single entry (either by replacing one's datapoint with another, or by adding/removing a datapoint to/from D), the *privacy loss* of an outcome *o* is defined by

$$L^{(o)} = \log \frac{\mathrm{P}(\mathcal{M}(\mathcal{D}) = o)}{\mathrm{P}(\mathcal{M}(\mathcal{D}') = o)}.\tag{6}$$

The mechanism <sup>M</sup> is called -DP if and only if <sup>|</sup>*L*(*o*)| ≤ , for all possible outcomes *<sup>o</sup>* and for all possible neighboring datasets D, D . The definition states that a single individual's participation in the data does not change the output probabilities by much; this limits the amount of information that the algorithm reveals about any one individual. A weaker or an *approximate* version of the above notion is (, *<sup>δ</sup>*)-DP: <sup>M</sup> is (, *<sup>δ</sup>*)-DP if <sup>|</sup>*L*(*o*)| ≤ , with probability 1 − *δ*, where *δ* is often called a failure probability which quantifies how often the DP guarantee of the mechanism fails.

Output perturbation is a commonly used DP mechanism to ensure the outputs of an algorithm to be differentially private. Suppose a deterministic function *<sup>h</sup>* : <sup>D</sup> → <sup>R</sup>*<sup>p</sup>* computed on sensitive data D outputs a *p*-dimensional vector quantity. In order to make *h* private, we can add noise to the output of *h*, where the level of noise is calibrated to the *global sensitivity* [24], Δ*h*, defined by the maximum difference in terms of some norm ||*h*(D) − *h*(D )|| for neighboring D and D (i.e., differ by one data sample).

There are two important properties of differential privacy. First, the *post-processing invariance* property [24] tells us that the composition of any arbitrary data-independent mapping with an (, *δ*)-DP algorithm is also (, *δ*)-DP. Second, the *composability* theorem [24] states that the strength of privacy guarantee degrades with the repeated use of DP-algorithms. Formally, given an 1-DP mechanism M<sup>1</sup> and an 2-DP mechanism M2, the mechanism M(D) := (M1(D),M2(D)) is (<sup>1</sup> + 2)-DP. This composition is oftencalled *linear* composition, under which the total privacy loss linearly increases with the number of repeated use of DP-algorithms. The *strong* composition [17] [Theorem 3.20] improves the linear composition, while the resulting DP guarantee becomes weaker (i.e., approximate (, *δ*)-DP). Recently, more refined methods further improve the privacy loss (e.g., [25]).

## *2.3. AboveThreshold and Sparse Vector Technique*

Among the DP mechanisms, we will utilize *AboveThreshold* and *sparse vector technique* (SVT) [17] to make the rejection ABC algorithm differentially private. AboveThreshold outputs 1 when a query value exceeds a pre-defined threshold, or 0 otherwise. This resembles rejection ABC where the output is 1 when the distance is less than a chosen threshold. To ensure the output is differentially private, AboveThreshold adds noise to both the threshold and the query value. We take the same route as AboveThreshold to make our ABCDP outputs differentially private. Sparse vector technique (SVT) consists of *c* calls to AboveThreshold, where *c* in our case determines how many posterior samples ABCDP releases.

Before presenting our ABCDP framework, we first describe the privacy setup we consider in this paper.

#### **3. Problem Formulation**

We assume a *data owner* who owns sensitive data *Y*∗ and is willing to contribute to the posterior inference.

We also assume a *modeler* who aims to learn the posterior distribution of the parameters of a simulator. Our ABCDP algorithm proceeds with the two steps:


Note that *T* is the maximum number of parameter-pseudo-data pairs that are publicly available. We will run our algorithm for *T* steps, while our algorithm can terminate as soon as we output the *c* number of accepted posterior samples. So, generally, *c T*. The details are then introduced.

#### **4. ABCDP**

Recall that the only place where the real data *Y*∗ appear in the ABC algorithm is when we judge whether the simulated data are similar to the real data, i.e., as in (4). Our method hence adds noise to this step. In order to take advantage of the privacy analysis of SVT, we also add noise to the ABC threshold and to the ABC distance. Consequently, we introduce two perturbation steps.

Before we introduce them, we describe the global sensitivity of the distance, as this quantity tunes the amount of noise we will add in the two perturbation steps. For *<sup>ρ</sup>*(*Y*∗,*Y*) = MMD -(*Y*∗,*Y*) with a bounded kernel, then the sensitivity of the distance is Δ*<sup>ρ</sup>* = *O*(1/*N*) as shown in Lemma 1.

**Lemma 1** (Δ*<sup>ρ</sup>* = *O*(1/*N*) for MMD)**.** *Assume that Y*<sup>∗</sup> *and each pseudo dataset Yt are of the same cardinality N. Set <sup>ρ</sup>*(*Y*∗,*Y*) = MMD -(*Y*∗,*Y*) *with a kernel k bounded by Bk* > 0*, i.e.,* sup*x*,*y*∈X *<sup>k</sup>*(*x*, *<sup>y</sup>*) <sup>≤</sup> *Bk* <sup>&</sup>lt; <sup>∞</sup>*. Then:*

$$\sup\_{(\mathcal{Y}^\*,Y^{\*'}),Y} |\rho(\mathcal{Y}^\*,\mathcal{Y}) - \rho(\mathcal{Y}^{\*'},\mathcal{Y})| \le \Delta\_{\rho} := \frac{2}{N} \sqrt{B\_k}$$

*and* sup*Y*∗,*<sup>Y</sup> ρ*(*Y*∗,*Y*) ≤ 2 <sup>√</sup>*Bk*.

A proof is given in Appendix B. For *<sup>ρ</sup>* <sup>=</sup> MMD using a Gaussian kernel, *k*(*x*, *y*) = exp − *x*−*y* 2 2*l*<sup>2</sup> where *l* > 0 is the bandwidth of the kernel, *Bk* = 1 for any *l* > 0.

Now, we introduce the two perturbation steps used in our algorithm summarized in Algorithm 1.

#### **Algorithm 1** Proposed *c*-sample ABCDP

**Require:** Observations *Y*∗, Number of accepted posterior sample size *c*, privacy tolerance *total*, ABC threshold *abc*, distance *<sup>ρ</sup>*, and parameter-pseudo-data pairs {(*θt*,*Yt*)}*<sup>T</sup> <sup>t</sup>*=1, and option RESAMPLE.

**Ensure:** *total*-DP indicators {*τ*˜*t*}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> for corresponding samples {*θt*}*<sup>T</sup> t*=1

1: Calculate the noise scale *b* by Theorem 1. 2: Privatize ABC threshold: ˆ*abc* = *abc* + *mt* via (7) 3: Set count=0 4: **for** *t* = 1, . . . , *T* **do** 5: Privatize distance: *ρ*ˆ*<sup>t</sup>* = *ρ*(*Y*∗,*Yt*) + *ν<sup>t</sup>* via (8) 6: **if** *ρ*ˆ*<sup>t</sup>* ≤ ˆ*abc* **then** 7: Output *τ*˜*<sup>t</sup>* = 1 8: count = count+1 9: **if** RESAMPLE **then** 10: ˆ*abc* = *abc* + *mt* via (7) 11: **end if** 12: **else** 13: Output *τ*˜*<sup>t</sup>* = 0 14: **end if** 15: **if** count ≥ c **then** 16: Break the loop 17: **end if** 18: **end for**

*Step 1: Noise for privatizing the ABC threshold.*

$$
\mathfrak{A}\_{abc} = \mathfrak{e}\_{abc} + m\_t \tag{7}
$$

where *mt* ∼ Lap(*b*), i.e., drawn from the zero-mean Laplace distribution with a scale parameter *b*.

*Step 2: Noise for privatizing the distance.*

$$
\rho\_t = \rho(\mathbf{Y}^\*, \mathbf{Y}\_t) + \nu\_t \tag{8}
$$

where *ν<sup>t</sup>* ∼ Lap(2*b*).

Due to these perturbations, Algorithm 1 runs with the privatized threshold and distance. We can choose to perturb the threshold only once, or every time we output 1 by setting RESAMPLE to false or true. After outputting *c* number of 1's, the algorithm is terminated. How do we calculate the resulting privacy loss under the different options we choose?

We formally state the relationship between the noise scale and the final privacy loss *tot* for the Laplace noise in Theorem 1.

**Theorem 1** (Algorithm 1 is *total*-DP)**.** *For any neighboring datasets Y*∗,*Y*∗ *of size N and any dataset Y, assume that ρ is such that* 0 < sup(*Y*∗,*Y*∗ ),*<sup>Y</sup>* <sup>|</sup>*ρ*(*Y*∗,*Y*) <sup>−</sup> *<sup>ρ</sup>*(*Y*∗ ,*Y*)| ≤ Δ*<sup>ρ</sup>* < ∞*. Algorithm 1 is total-DP, where:*

$$\epsilon\_{total} = \begin{cases} \frac{(c+1)\Lambda\_p}{b} & \text{if } RESAMPLE \text{ is False,} \\ \frac{2c\Lambda\_p}{b} & \text{if } RESAMPLE \text{ is True.} \end{cases} \tag{9}$$

A proof is given in Appendix A. The proof uses linear composition, i.e., the privacy level linearly degrading with *c*. However, using the strong composition or more advanced compositions can reduce the resulting privacy loss, while these compositions turn pure-DP into a weaker, approximate-DP. In this paper, we focus on the pure-DP. For the case of RESAMPLE = True, the proof directly follows the proof of the standard SVT algorithm using the linear composition method [17], with an exception that we utilize the quantity representing the minimum noisy value of any query evaluated on *Y*∗, as opposed to the maximum utilized in SVT. For the case of RESAMPLE= False, the proof follows the proof of Algorithm 1 in [26].

Note that the DP analysis in Theorem 1 holds for other types of distance metrics and not limited to only MMD, as long as there is a bounded sensitivity Δ*<sup>ρ</sup>* under the chosen metric. When there is no bounded sensitivity, one could impose a clipping bound *C* to the distance by taking the distance from min[*ρ*(*Yt*,*Y*∗), *C*], such that the resulting distance between any pseudo data *Yt* and *Y*∗ with modifying one datapoint in *Y*∗ cannot exceed that clipping bound. In fact, we use this trick in our experiments when there is no bounded sensitivity.

#### *4.1. Effect of Noise Added to ABC*

Here, we would like to analyze the effect of noise added to ABC. In particular, we are interested in analyzing the probability that the output of ABCDP differs from that of ABC: <sup>P</sup>[*τ*˜*<sup>t</sup>* -= *τt*|*τt*] at any given time *t*. To compute this probability, we first compute the probability density function (PDF) of the random variables *mt* − *ν<sup>t</sup>* in the following Lemma.

**Lemma 2.** *Recall mt* ∼ *Lap*(*b*)*, ν<sup>t</sup>* ∼ *Lap*(2*b*)*. The subtraction of these yields another random variable Z* = *mt* − *νt, where the PDF of Z is given by*

$$f\_Z(z) = \frac{1}{6b} \left[ 2 \exp\left(-\frac{|z|}{2b}\right) - \exp\left(-\frac{|z|}{b}\right) \right].\tag{10}$$

*Furthermore, for <sup>a</sup>* <sup>≥</sup> <sup>0</sup>*, Gb*(*a*) :<sup>=</sup> - ∞ *<sup>a</sup> fZ*(*z*) <sup>d</sup>*<sup>z</sup>* <sup>=</sup> <sup>1</sup> 6 7 4 exp − *a* 2*b* <sup>−</sup> exp −*a b* 8*, and the CDF of Z is given by FZ*(*a*) = *H*[*a*]+(1 − 2*H*[*a*])*Gb*(|*a*|) *where H*[*a*] *is the Heaviside step function.*

See Appendix C for the proof. Using this PDF, we now provide the following proposition:

**Proposition 1.** *Denote the output of Algorithm 1 at time t by τ*˜*<sup>t</sup>* ∈ {0, 1} *and the output of ABC by τ<sup>t</sup>* ∈ {0, 1}*. The flip probability, the probability that the outputs of ABCDP and ABC differ given the output of ABC, is given by P*[*τ*˜*<sup>t</sup>* -= *τt*|*τt*] = *Gb*(|*ρ<sup>t</sup>* − *abc*|)*, where Gb*(*a*) *is defined in Lemma 2, and ρ<sup>t</sup>* := *ρ*(*Y*∗,*Yt*)*.*

See Appendix D for proof.

To provide an intuition of Proposition 1, we visualize the flip probability in Figure 1. This flip probability provides a guideline for choosing the accepted sample size *c* given the datasize *N* and the desired privacy level *total*. For instance, if a given dataset is extremely small, e.g., containing datapoints on the order of 10, *c* has to be chosen such that the flip probability of each posterior sample remains low for a given privacy guarantee (*total*). If a higher number of posterior samples are needed, then one has to reduce the desired privacy level for the posterior sample of ABCDP to be similar to that of ABC. Otherwise, with a small *total* with a large *c*, the accepted posterior samples will be poor. On the other hand, if the dataset is bigger, then a larger *c* can be taken for a reasonable level of privacy.

**Figure 1.** Visualization of flip probability derived in Proposition 1, the probability that the outputs of ABCDP and ABC differ given an output of ABC, with different dataset size *N* and accepted posterior sample size *c*. We simulated *ρ* ∼ Uniform[0, 1] (drew 100 values for *ρ*) and used *abc* = 0.2: (**A**) This column shows the flip probability at a regime of extremely small datasets, *N* = 10. Top plot shows the probability at *c* = 10, middle plot at *c* = 100, and bottom plot at *c* = 1000. In this regime, even *total* = 100 cannot reduce the flip probability to perfectly zero when *c* = 10. The flip probability remains high when we accept more samples, i.e., *c* = 1000; (**B**) the flip probability at *N* = 100; (**C**) the flip probability at *N* = 1000. As we increase the dataset size *N* (moving from the left to right columns), the flip probability approaches zero at a smaller privacy loss *total*.

#### *4.2. Convergence of Posterior Expectation of Rejection-ABCDP to Rejection-ABC.*

The flip probability studied in Section 4.1 only accounts for the effect of noise added to a single output of ABCDP. Building further on this result, we analyzed the discrepancy between the posterior expectations derived from ABCDP and from the rejection ABC. This analysis requires quantifying the effect of noise added to the whole sequence of outputs of ABCDP. The result is presented in Theorem 2.

**Theorem 2.** *Given <sup>Y</sup>*<sup>∗</sup> *of size N, and* {(*θt*,*Yt*)}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *as input, let τ*˜*<sup>t</sup>* ∈ {0, 1} *be the output from Algorithm 1 where τ*˜*<sup>t</sup>* = 1 *indicates that* (*θt*,*Yt*) *is accepted, for t* = 1, ... , *T. Similarly, let τ<sup>t</sup> denote the output from the traditional rejection ABC algorithm, for t* = 1, ... , *T. Let f be an arbitrary vector-valued function of θ. Assume that the numbers of accepted samples from Algorithm 1, and the traditional rejection ABC algorithm are c* := ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *<sup>τ</sup>*˜*<sup>t</sup>* <sup>≥</sup> <sup>1</sup> *and <sup>c</sup>* :<sup>=</sup> <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *τ<sup>t</sup>* ≥ 1*, respectively. Let b* = <sup>4</sup>*<sup>c</sup>* <sup>√</sup>*Bk totalN if RESAMPLE=True, and <sup>b</sup>* <sup>=</sup> <sup>2</sup>(*c*+1) <sup>√</sup>*Bk totalN if RESAMPLE=False (see Theorem 1). Define KT* := max*t*=1,...,*<sup>T</sup> f*(*θt*) <sup>2</sup>*. Then, the following statements hold for both RESAMPLE options:*

*1.* <sup>E</sup>*τ*˜1,...,*τ*˜*<sup>T</sup>* : : : : 1 *<sup>c</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *<sup>f</sup>*(*θt*)*τ*˜*<sup>t</sup>* <sup>−</sup> <sup>1</sup> *<sup>c</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *f*(*θt*)*τ<sup>t</sup>* : : : : 2 <sup>≤</sup> <sup>2</sup>*KT <sup>c</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *Gb*(|*ρ<sup>t</sup>* − *abc*|), *where the*

*decreasing function Gb*(*x*) <sup>∈</sup> (0, <sup>1</sup> <sup>2</sup> ] *for any x* ≥ 0 *is defined in Lemma 2;*

$$\text{2. } \mathbb{E}\_{\tilde{\tau}\_{1}, \dots, \tilde{\tau}\_{T}} \left\| \frac{1}{\varepsilon} \sum\_{t=1}^{T} f(\theta\_{t}) \,\overline{\tau}\_{t} - \frac{1}{\varepsilon} \sum\_{t=1}^{T} f(\theta\_{t}) \,\tau\_{t} \right\|\_{2} \to 0 \text{ as } N \to \infty;$$

*3. For any a* > 0*:*

$$P\left(\left\|\frac{1}{c}\sum\_{t=1}^{T}f(\theta\_{t})\,\tilde{\mathbf{r}}\_{t}-\frac{1}{c'}\sum\_{t=1}^{T}f(\theta\_{t})\,\pi\_{t}\right\|\_{2}\leq a\right)\geq 1-\frac{4K\_{T}}{3ac'}\sum\_{t=1}^{T}\exp\left(-\frac{\left|\rho\_{t}-\varepsilon\_{abc}\right|}{2b}\right),$$

*where the probability is taken with respect to τ*˜1,..., *τ*˜*T.*

Theorem 2 contains three statements. The first states that the expected error between the two posterior expectations of an arbitrary function *f* is bounded by a constant factor of the sum of the flip probability in each rejection/acceptance step. As we have seen in Section 4.1, the flip probability is determined by the scale parameter *b* of the Laplace distribution. Since *b* = *O*(1/*N*) (see Theorem 1 and Lemma 1), it follows that the expected error decays as *N* increases, giving the second statement.

The third statement gives a probabilistic bound on the error. The bound guarantees that the error decays exponentially in *N*. Our proof relies on establishing an upper bound on the error as a function of the total number of flips ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> |*τ*˜*<sup>t</sup>* − *τt*| which is a random variable. Bounding the error of interest then amounts to characterizing the tail behavior of this quantity. Observe that in Theorem 2, we consider ABCDP and rejection ABC with the same computational budget, i.e., the same total number of iterations *T* performed. However, the number of accepted samples may be different in each case (*c* for ABCDP and *c* for reject ABC). The fact that *c* itself is a random quantity due to injected noise presents its own technical challenge in the proof. Our proof can be found in Appendix E.

#### **5. Related Work**

Combining DP with ABC is relatively novel. The only related work is [27], which states that a rejection ABC algorithm produces posterior samples from the exact posterior distribution given perturbed data, when the kernel and bandwidth of rejection ABC are chosen in line with the data perturbation mechanism. The focus of [27] is to identify the condition when the posterior becomes exact in terms of the kernel and bandwidth of the kernel through the lens of data perturbation. On the other hand, we use the sparse vector technique to reduce the total privacy loss. The resulting theoretical studies including the flip probability and the error bound on the posterior expectation are new.

#### **6. Experiments**

## *6.1. Toy Examples*

We start by investigating the interplay between *abc* and *total*, in a synthetic dataset where the ground truth parameters are known. Following [18], we also consider a symmetric Dirichlet prior *π* and a likelihood *p*(*y*|*θ*) given by a mixture of uniform distributions as

$$
\pi(\theta) = \text{Dirichlet}(\theta; \mathbf{1})\_\prime
$$

$$
\mathbf{P}(y|\theta) = \sum\_{i=1}^5 \theta\_i \text{Uniform}(y; [i-1, i]). \tag{11}
$$

A vector of mixing proportions is our model parameters *θ*, where the ground truth is *θ*∗ = [0.25, 0.04, 0.33, 0.04, 0.34] (see Figure 2). The goal is to estimate <sup>E</sup>[*θ*|*Y*∗] where *<sup>Y</sup>*<sup>∗</sup> is generated with *θ*∗.

We first generated 5000 samples for *Y*∗ drawn from (11) with true parameters *θ*∗. Then, we tested our two ABCDP frameworks with varying *abc* and *total*. In these experiments, we set *<sup>ρ</sup>* <sup>=</sup> MMD with a Gaussian kernel. We set the bandwidth of the Gaussian kernel using the median heuristic computed on the simulated data (i.e., we did not use the real data for this, hence there is no privacy violation in this regard).

We drew 5000 pseudo-samples for *Yt* at each time. We tested various settings, as shown in Figure 3, where we vary the number of posterior samples, *c* = {10, 100, 1000}, *abc* = {0.05, 0.1, 0.2, 0.5} and *total* = {0.5, 1.0, 10, ∞}. We showed the result of ABCDP for both RESAMPLE options in Figure 3.

(**b**) Observations, where the x axis indicates the range of the values of observations.

**Figure 3.** ABCDP on synthetic data. Mean-squared error (between true parameters and posterior mean) as a function of similarity threshold *abc* given each privacy level. We ran ABCDP with the following options: *RESAMPLE = True* (denoted by R and solid line); or *RESAMPLE = False* (without R and dotted line) for 60 independent runs. (**Top Left**) When *cstop* = 10 at different values of *abc*, ABCDP and non-private ABC (black trace) achieved the highest accuracy (lowest MSE) at the smallest *abc* (*abc* = 0.01). Notice that ABCDP *RESAMPLE = False* (dotted) outperformed ABCDP *RESAMPLE=True* (solid) for the same privacy tolerance (*total*) at small values of *abc*. (**Top Right**) MSE for *cstop* = 100 at different values of *abc*; (**Bottom Left**) MSE for *cstop* = 1000 at different values of *abc*. We can observe when *abc* is large, ABCDP (gray) marginally outperforms non-private ABC (black) due to the excessive noise added in ABCDP.

#### *6.2. Coronavirus Outbreak Data*

In this experiment, we modelled coronavirus outbreak in the Netherlands using a polynomial model consisting of four parameters *a*0, *a*1, *a*2, *a*3, which we aimed to infer, where:

$$y(t) = a\_3 + a\_2t + a\_1t^2 + a\_0t^3. \tag{12}$$

The observed (https://www.ecdc.europa.eu/en/publications-data/download-todaysdata-geographic-distribution-COVID-19-cases-worldwide, accessed on 10 October 2020) data are the number of cases of the coronavirus outbreak from 27 February to 17 March 2020, which amounts to 18 datapoints (*N* = 18). The presented experiment imposes privacy concern as each datapoint is a count of the individuals who are COVID positive at each time. The goal is to identify the approximate posterior distribution P˜(*a*0, *<sup>a</sup>*1, *<sup>a</sup>*2, *<sup>a</sup>*3|*y*∗) over these parameters, given a set of observations.

Recalling from Figure 1 that the small size of data worsens the privacy and accuracy trade-off, the inference is restricted to a small number of posterior samples (we chose *c* = 5) since the number of datapoints is extremely limited in this dataset. We used the same prior distributions for the four parameters as *ai* ∼ N (0, 1) for all *i* = 0, 1, 2, 3. We drew 50, 000 samples from the Gaussian prior, and performed our ABCDP algorithm with *total* = {13, 22, 44} and *abc* = 0.1, as shown in Figure 4.

**Figure 4.** COVID-19 outbreak data (*N* = 18) and simulated data under different privacy guarantees. Red dots show observed data, and gray dots show simulated data drawn from 5 posterior samples accepted in each case. The blue crosses are simulated data given the posterior mean in each case: (**Top left**) simulated data by non-private ABC; (**Top right**) simulated data by ABCDP with *total* = 44 are relatively well aligned with regard to the extremely small size of the data. Note that we use a different scale for left and right plots for better visibility. If we use the same y scale in both plots, the simulated and observed points are not distinguishable on the left plot: (**Bottom left**) the simulated data given 5 posterior samples exhibit a large variance when *total* = 22; and (**Bottom right**) when *total* = 13, the simulated data exhibit an excessively large variance.

## *6.3. Modeling Tuberculosis (TB) Outbreak Using Stochastic Birth–Death Models*

In this experiment, we used the stochastic birth–death models to model Tuberculosis (TB) outbreak. There are four parameters that we aim to infer, which go into the

communicable disease outbreak simulator as inputs: burden rate *β*, transmission rate *t*1, reproductive numbers *R*<sup>1</sup> and *R*2. The goal is to identify the approximate posterior distribution *p*˜(*R*1, *t*1, *R*2, *β*|*y*∗) over these parameters given a set of observations. Please refer to Section 3 in [28] for the description of the birth–death process of the model. We used the same prior distributions for the four parameters as in [28]: *β* ∼ N (200, 30), *R*<sup>1</sup> ∼ Unif(1.01, 20), *R*<sup>2</sup> ∼ Unif(0.01,(1 − 0.05*R*1)/0.95), *t*<sup>1</sup> ∼ Unif(0.01, 30).

To illustrate the privacy and accuracy trade-off, we first generated two sets of observations *y*∗ (*n* = 100 and *n* = 1000) by some *true* model parameters (shown as black bars in Figure 5). We then tested our ABCDP algorithm with a privacy level = 1. We used the summary statistic described in Table 1 in [28] and used a weighted L2 distance as *ρ* as done in [28], together with *abc* = 150. Since there is no bounded sensitivity in this case, we impose an artificial boundedness by clipping the distance by *C* (we set *C* = 200) when the distance goes beyond *C*.

As an error metric, we computed the mean absolute distance between each posterior mean and the true parameter values. The top row in Figure 5 shows that the mean of the prior (red) is far from the true value (black) that we chose. As we increase the data size from *n* = 100 (middle) to *n* = 1000 (bottom), the distance between true values and estimates reduces, as reflected in the error from 4.71 to 2.20 for RESAMPLE = True; and from 4.51 to 2.10 for RESAMPLE=False.

**Figure 5.** Posterior samples for modeling tuberculosis (TB) outbreak. In all ABCDP methods, we set *total* = 1. True values in black. Mean of samples in red: (R) indicates ABCDP with Resampling = True. (**1st row**): Histogram of 50 samples drawn from the prior (we used the same prior as [28]); (**2nd row**): 10 posterior samples from ABCDP with (R) given *n* = 100 observations; (**3rd row**): 10 posterior samples from ABCDP without (R) given *n* = 100 observations; (**4th row**): 10 posterior samples from ABCDP with (R) given *n* = 1000 observations; and (**5th row**): 10 posterior samples from ABCDP without (R) given *n* = 1000 observations. The distance between the black bar (true) and red bar (estimate) reduces as the size of data increases from 100 to 1000. ABCDP with Resampling=False performs better regardless of the data size.

#### **7. Summary and Discussion**

We presented the ABCDP algorithm by combining DP with ABC. Our method outputs differentially private binary indicators, yielding differentially private posterior samples. To analyze the proposed algorithm, we derived the probability of flip from the rejection ABC's indicator to the ABCDP's indicator, as well as the average error bound of the posterior expectation.

We showed experimental results that output a relatively small number of posterior samples. This is due to the fact that the cumulative privacy loss increases linearly with the number of posterior samples (i.e., *c*) that our algorithm outputs. For a large-sized dataset (i.e., *N* is large), one can still increase the number of posterior samples while providing a reasonable level of privacy guarantee. However, for a small-sized dataset (i.e., *N* is small), a more refined privacy composition (e.g., [29]) would be necessary to keep the cumulative privacy loss relatively small, at the expense of providing an *approximate* DP guarantee rather than the pure DP guarantee that ABCDP provides.

When we presented our work to the ABC community, we often received the question of whether we could apply ABCDP to other types of ABC algorithms such as the sequential Monte Carlo algorithm which outputs the significance of each proposal sample, as opposed to its acceptance or rejection as in the rejection ABC algorithm. Directly applying the current form of ABCDP to these algorithms is not possible, while applying the Gaussian mechanism to the significance of each proposal sample can guarantee differential privacy for the output of the sequential Monte Carlo algorithm. However, the cumulative privacy loss will be relatively large, as now it is a function of the number of proposal samples, whether they are taken as good posterior samples or not.

A natural by-product of ABCDP is differentially private synthetic data, as the simulator is a public tool that anybody can run and hence differentially private posterior samples suffice for differentially private synthetic data without any further privacy cost. Applying ABCDP to generate complex datasets is an intriguing future direction.

**Author Contributions:** M.P. and W.J. contributed to conceptualization and methodology development. M.P. and M.V. contributed to software development. All three authors contributed to writing the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** Gips Schüle Foundation and the Institutional Strategy of the University of Tübingen (ZUK63).

**Data Availability Statement:** A publicly available dataset was analyzed in this study. This data can be found here: https://www.ecdc.europa.eu/en/publications-data/download-todays-datageographic-distribution-covid-19-cases-worldwide.

**Conflicts of Interest:** The authors declare no conflict of interest.

**Note:** Our code is available at https://github.com/ParkLabML/ABCDP.

## **Appendix A. Proof of Theorem 1**

**Proof of Theorem 1.** *Case I: RESAMPLE* = *True.* We prove the case of *c* = 1 first. The case of *c* > 1 is a *c*-composition of the case of *c* = 1, where the privacy loss linearly increases with *c*.

Given any neighboring datasets *Y*<sup>∗</sup> and *Y*∗ of size *N* and any dataset *Y*, assume that *ρ* is such that 0 < sup(*Y*∗,*Y*∗ ),*<sup>Y</sup>* <sup>|</sup> *<sup>ρ</sup>*(*Y*∗,*Y*) <sup>−</sup> *<sup>ρ</sup>*(*Y*∗ ,*Y*) |< Δ*<sup>ρ</sup>* < ∞ and *ρ* is bounded by *Bρ*.

Let *A* denote the random variable that represents the outputs Algorithm 1 given ({(*θt*,*Yt*)}*<sup>T</sup> <sup>t</sup>*=1, *Y*∗, *ρ*, *abc*, ) and *A* the random variable that represents the outputs given ({(*θt*,*Yt*)}*<sup>T</sup> <sup>t</sup>*=1, *<sup>Y</sup>*∗ , *ρ*, *abc*, ). The output of the algorithm is some realization of these variables, *<sup>τ</sup>* ∈ {1, 0}*<sup>k</sup>* where 0 <sup>&</sup>lt; *<sup>k</sup>* <sup>≤</sup> *<sup>T</sup>* and for all *<sup>t</sup>* <sup>&</sup>lt; *<sup>k</sup>*, *<sup>τ</sup><sup>t</sup>* <sup>=</sup> 0 and *<sup>τ</sup><sup>k</sup>* <sup>=</sup> 1. For the rest of the proof, we will fix the arbitrary values of *ν*1, ..., *νk*−<sup>1</sup> and take probabilities over the randomness of *ν<sup>k</sup>* and *abc*. We define the deterministic quantity (*ν*1, ..., *νk*−<sup>1</sup> are fixed):

$$\lg(\mathcal{Y}^\*) = \min\_{t$$

that represents the minimum noised value of the distance evaluated on any dataset *Y*∗.

Let *P*[ˆ*abc* = *a*] be the pdf of ˆ*abc* evaluated on *a* and *P*[*ν<sup>k</sup>* = *v*] the pdf of *ν<sup>k</sup>* evaluated on *v*, and 1[*x*] the indicator function of event *x*. We have:

$$\mathcal{P}\_{\ell\_{\text{abc}}, \nu\_k}[A = \tau\_k] = \mathcal{P}[\pounds\_{\text{abc}} < g(Y^\*)]$$

and:

$$\rho(\boldsymbol{\chi}\_{k},\boldsymbol{\chi}^{\*}) + \boldsymbol{\nu}\_{k} \leq \boldsymbol{\varepsilon}\_{abc} = \mathbb{P}[\boldsymbol{\varepsilon}\_{abc} \in [\rho(\boldsymbol{\chi}\_{k},\boldsymbol{\chi}^{\*}) + \boldsymbol{\nu}\_{k}, \boldsymbol{\varrho}(\boldsymbol{\chi}^{\*}))]$$

$$= \int\_{-\infty}^{\infty} \int\_{-\infty}^{\infty} \mathbb{P}[\boldsymbol{\nu}\_{k} = \boldsymbol{\upsilon}] \mathbb{P}[\boldsymbol{\hat{\varepsilon}\_{abc} = a] \mathbbm{1} \left[ a \in [\rho(\boldsymbol{\chi}\_{k},\boldsymbol{\chi}^{\*}) + \boldsymbol{\nu}\_{k}, \boldsymbol{\varrho}(\boldsymbol{\chi}^{\*})) \right] d\boldsymbol{\upsilon} da$$

Now, we define the following variables:

$$
\hat{a} = a + \operatorname{g}(\mathcal{Y}^\*) - \operatorname{g}(\mathcal{Y}^{\*'})
$$

$$
\hat{v} = v\text{s.} + \operatorname{g}(\mathcal{Y}^\*) - \operatorname{g}(\mathcal{Y}^{\*'}) + \rho(\mathcal{Y}\_{k'}\mathcal{Y}^{\*'}) - \rho(\mathcal{Y}\_{k'}\mathcal{Y}^\*)
$$

We know that for each *Y*∗,*Y*∗ , *ρ* is Δ*ρ*-sensitive and hence, the quantity *g*(*Y*∗) is Δ*ρ*sensitive as well. In this way, we obtain that | *a*ˆ − *a* |≤ Δ*<sup>ρ</sup>* and | *v*ˆ − *vs*. |≤ 2Δ*ρ*. Applying these changes of variables, we have:

$$=\int\_{-\infty}^{\infty} \int\_{-\infty}^{\infty} \mathbf{P}[\boldsymbol{\nu}\_{k} = \boldsymbol{\theta}] \mathbf{P}[\boldsymbol{\varepsilon}\_{\text{abc}} = \boldsymbol{\theta}] \mathbf{1} [a + \mathbf{g}(\boldsymbol{\chi}^{\*}) - \mathbf{g}(\boldsymbol{\chi}^{\*'})] \in [\boldsymbol{\upsilon} + \boldsymbol{\mathcal{g}}(\boldsymbol{\chi}^{\*}) - \boldsymbol{\mathcal{g}}(\boldsymbol{\chi}^{\*'}) + \boldsymbol{\mathcal{g}}(\boldsymbol{\chi}^{\*})] $$

$$\rho(\boldsymbol{\chi}\_{k}, \boldsymbol{\chi}^{\*'}), \boldsymbol{\mathcal{g}}(\boldsymbol{\chi}^{\*})) |dvda $$

$$=\int\_{-\infty}^{\infty} \int\_{-\infty}^{\infty} \mathbf{P}[\boldsymbol{\nu}\_{k} = \boldsymbol{\theta}] \mathbf{P}[\boldsymbol{\xi}\_{\text{abc}} = \boldsymbol{\theta}] \mathbf{1} \, [a \in [\boldsymbol{\upsilon} + \rho(\boldsymbol{\chi}\_{k}, \boldsymbol{\chi}^{\*'}), \boldsymbol{\varrho}(\boldsymbol{\chi}^{\*'}))] dvda $$

$$\leq \int\_{-\infty}^{\infty} \int\_{-\infty}^{\infty} \exp\left(\frac{\boldsymbol{\varepsilon}}{2}\right) \mathbf{P}[\boldsymbol{\nu}\_{k} = \boldsymbol{\upsilon}] \exp\left(\frac{\boldsymbol{\varepsilon}}{2}\right) \mathbf{P}[\boldsymbol{\varepsilon}\_{\text{abc}} = a] \mathbf{1} \, [a \in [\boldsymbol{\upsilon} + \rho(\boldsymbol{\chi}\_{k}, \boldsymbol{\chi}^{\*'}), \boldsymbol{\varrho}(\boldsymbol{\chi}^{\*'}))] dvda $$

$$\leq \exp\left(\boldsymbol{\varepsilon}\right) \int\_{-\infty}^{\infty} \int\_{-\infty}^{\infty} \mathbf{P}[\boldsymbol{\nu}\_{k} = \boldsymbol{\upsilon}] \mathbf{P}[\boldsymbol{\varepsilon}\_{\text{abc}} = a] \mathbf{1} \,$$

where the inequality comes from the bounds considered throughout the proof (i.e., | *a*ˆ − *a* |≤ Δ*<sup>ρ</sup>* and | *v*ˆ − *vs*. |≤ 2Δ*ρ*) and the form of the cdf for the Laplace distribution.

*Case II: RESAMPLE* = *False.* In this case, the proof follows the proof of Algorithm 1 in [26], with an exception that positive events for [26] become negative events for us and vice versa as we find the value below a threshold, where [26] finds the value above a threshold.

#### **Appendix B. Proof of Lemma 1**

**Proof of Lemma 1.** We will establish Δ*<sup>ρ</sup>* when *ρ* is MMD. Recall that (*Y*∗,*Y*∗ ) is a pair of neighboring datasets, and *Y* is an arbitrary dataset. Without loss of generality, assume that *<sup>Y</sup>*<sup>∗</sup> <sup>=</sup> {*x*1,..., *xN*}, *<sup>Y</sup>*∗ = {*x* <sup>1</sup>, ... , *x <sup>N</sup>*} such that *xi* = *x <sup>i</sup>* for all *i* = 1, ... , *N* − 1, and *Y* = {*y*1,..., *ym*}. We start with:

$$\begin{split} &\text{sup}\quad\left|\rho(Y^{\*},Y)-\rho(Y^{\*'},Y)\right| \\ &=\left.\sup\limits\_{(Y^{\*},Y'),Y}\left|\widehat{\mathbf{MMD}}(Y^{\*},Y)-\widehat{\mathbf{MMD}}(Y^{\*'},Y)\right| \\ &=\sup\_{(Y^{\*},Y'),Y}\left|\left\|\frac{1}{N}\sum\_{i=1}^{N}\phi(\mathbf{x}\_{i})-\frac{1}{m}\sum\_{j=1}^{m}\Phi(y\_{j})\right\|\_{\mathcal{H}}-\left\|\frac{1}{N}\sum\_{i=1}^{N}\phi(\mathbf{x}\_{i}^{\prime})-\frac{1}{m}\sum\_{j=1}^{m}\Phi(y\_{j})\right\|\_{\mathcal{H}}\right| \\ &\stackrel{(a)}{\leq}\sup\_{(X,X')}\left\|\frac{1}{N}\sum\_{i=1}^{N}\phi(\mathbf{x}\_{i})-\frac{1}{N}\sum\_{i=1}^{N}\Phi(\mathbf{x}\_{i}^{\prime})\right\|\_{\mathcal{H}} \\ &=\sup\_{(\mathbf{x}\_{N},\mathbf{x}\_{N}^{\prime})}\left\|\frac{1}{N}\Phi(\mathbf{x}\_{N})-\frac{1}{N}\Phi(\mathbf{x}\_{N}^{\prime})\right\|\_{\mathcal{H}} \\ &=\sup\_{(\mathbf{x}\_{N},\mathbf{x}\_{N}^{\prime})}\frac{1}{N}\sqrt{k(\mathbf{x}\_{N},\mathbf{x}\_{N})+k(\mathbf{x}\_{N}^{\prime},\mathbf{x}\_{N}^{\prime})-2k(\mathbf{x}\_{N},\mathbf{x}\_{N}^{\prime})}{N} \\ &\leq\frac{2}{N}\sqrt{B\_{k}}, \end{split}$$

where at (*a*), we use the reverse triangle inequality. Furthermore:

$$\begin{split} & \sup\_{Y^\*, Y} \rho(Y^\*, Y) \\ & \leq \sup\_{Y^\*, Y} \sqrt{\left\| \frac{1}{N} \sum\_{i=1}^N \phi(x\_i) - \frac{1}{m} \sum\_{i=1}^m \phi(y\_i) \right\|\_{\mathcal{H}}^2} \\ & = \sup\_{Y^\*, Y} \sqrt{\frac{1}{N^2} \sum\_{i, j=1}^N k(x\_i, x\_j) + \frac{1}{m^2} \sum\_{i, j=1}^m k(y\_i, y\_j) - \frac{2}{mn} \sum\_{i=1}^N \sum\_{j=1}^m k(x\_i, y\_j)}{\sqrt{B\_k + B\_k + 2B\_k} - 2\sqrt{B\_k}} \\ & = \sqrt{B\_k + B\_k + 2B\_k} = 2\sqrt{B\_k}. \end{split}$$

**Appendix C. Proof of Lemma 2 Proof of Lemma 2.** The PDF is computed from the convolution of two PDFs:

$$f\_{\mathfrak{m}\_l - \mathfrak{v}\_l}(z) = \int\_{-\infty}^{\infty} f\_{\mathfrak{m}\_l}(x) f\_{\mathfrak{v}\_l}(x - z) dx,\tag{A2}$$

where *fmt*(*x*) = <sup>1</sup> <sup>2</sup>*<sup>b</sup>* exp(−|*x*<sup>|</sup> *<sup>b</sup>* ) and *<sup>f</sup>νt*(*y*) = <sup>1</sup> <sup>4</sup>*<sup>b</sup>* exp(−|*y*<sup>|</sup> <sup>2</sup>*<sup>b</sup>* ):

$$f\_{\mathcal{H}\_l - \nu\_l}(z) = \frac{1}{8b^2} \int\_{-\infty}^{\infty} \exp\left(-\frac{|\mathbf{x}|}{b} - \frac{|\mathbf{x} - \mathbf{z}|}{2b}\right) d\mathbf{x} \tag{A3}$$

*For case z* ≥ 0*:*

$$\begin{split} f\_{|\mathcal{U}\_{l} = \nu\_{l}}(z) &= \frac{1}{8b^{2}} \int\_{-\infty}^{0} \exp\left(\frac{x}{b} + \frac{x - z}{2b}\right) dx + \frac{1}{8b^{2}} \int\_{0}^{z} \exp\left(-\frac{x}{b} + \frac{x - z}{2b}\right) dx \\ &+ \frac{1}{8b^{2}} \int\_{z}^{\infty} \exp\left(-\frac{x}{b} - \frac{x - z}{2b}\right) dx, \\ &= \frac{1}{8b^{2}} \int\_{-\infty}^{0} \exp\left(\frac{3x - z}{2b}\right) dx + \frac{1}{8b^{2}} \int\_{0}^{z} \exp\left(\frac{-x - z}{2b}\right) dx \end{split} \tag{A4}$$

$$\begin{split} &+\frac{1}{8b^{2}}\int\_{z}^{\infty} \exp\left(\frac{-3x+z}{2b}\right) dx \\ &= \frac{\exp\left(\frac{-z}{2b}\right)}{8b^{2}} \int\_{-\infty}^{0} \exp\left(\frac{3x}{2b}\right) dx + \frac{\exp\left(\frac{-z}{2b}\right)}{8b^{2}} \int\_{0}^{z} \exp\left(\frac{-x}{2b}\right) dx \\ &+ \frac{\exp\left(\frac{z}{2b}\right)}{8a^{2}} \int\_{0}^{\infty} \exp\left(\frac{-3x}{2a}\right) dx \end{split} \tag{A6}$$

2*b* 8*b*<sup>2</sup> *z* exp−3*<sup>x</sup>* 2*b dx* (A6)

<sup>=</sup> exp <sup>−</sup>*<sup>z</sup>* 2*b* 8*b*<sup>2</sup> 2*b* <sup>3</sup> <sup>−</sup> exp <sup>−</sup>*<sup>z</sup>* 2*b* <sup>8</sup>*b*<sup>2</sup> <sup>2</sup>*<sup>b</sup>* exp−*<sup>z</sup>* 2*b* − 1 <sup>+</sup> exp *<sup>z</sup>* 2*b* 8*b*<sup>2</sup> 2*b* <sup>3</sup> exp−3*<sup>z</sup>* 2*b* , (A7) . 

$$\mathbf{I} = \frac{1}{12b} \left[ \exp\left(\frac{-z}{2b}\right) + 3 \exp\left(\frac{-z}{2b}\right) \left(1 - \exp\left(\frac{-z}{2b}\right)\right) + \exp\left(\frac{-z}{b}\right) \right],\tag{A8}$$

$$\mathbf{1} \cdot \begin{bmatrix} \dots & \binom{-z}{z} & \dots & \binom{-z}{z} \end{bmatrix}^{\prime} \tag{A8}$$

$$\mathbf{E} = \frac{1}{12b} \begin{bmatrix} 4\exp\left(\frac{-z}{2b}\right) - 2\exp\left(\frac{-z}{b}\right) \\ 1 \ \left[ \begin{array}{c} \end{array} \begin{array}{c} \left(\frac{-z}{b}\right) \end{array} \right] \end{bmatrix} . \tag{A9}$$

$$\dot{\lambda} = \frac{1}{6b} \left[ 2 \exp\left(\frac{-z}{2b}\right) - \exp\left(\frac{-z}{b}\right) \right] \tag{A10}$$

*For case z* < 0*:*

$$\begin{split} f\_{\text{in}\_{i}-\text{v}\_{i}}(z) &= \frac{1}{8b^{2}} \int\_{-\infty}^{z} \exp\left(\frac{x}{b} + \frac{x-z}{2b}\right) dx + \frac{1}{8b^{2}} \int\_{z}^{0} \exp\left(\frac{x}{b} - \frac{x-z}{2b}\right) dx \\ &+ \frac{1}{8b^{2}} \int\_{0}^{\infty} \exp\left(-\frac{x}{b} - \frac{x-z}{2b}\right) dx, \\ &= \frac{1}{8b^{2}} \int\_{-\infty}^{z} \exp\left(\frac{3x-z}{2b}\right) dx + \frac{1}{8b^{2}} \int\_{z}^{0} \exp\left(\frac{x+z}{2b}\right) dx \\ &+ \frac{1}{8b^{2}} \int\_{0}^{\infty} \exp\left(\frac{-3x+z}{2b}\right) dx \\ &= \frac{\exp\left(\frac{-z}{2b}\right)}{8b^{2}} \int\_{-\infty}^{z} \exp\left(\frac{3x}{2b}\right) dx + \frac{\exp\left(\frac{z}{2b}\right)}{8b^{2}} \int\_{z}^{0} \exp\left(\frac{x}{2b}\right) dx \\ &+ \frac{\exp\left(\frac{x}{2b}\right)}{8b^{2}} \int\_{z}^{\infty} \exp\left(\frac{-3x}{2b}\right) dx \end{split} \tag{A13}$$

$$\begin{aligned} & \quad 8b^2 \quad \int\_0^\infty \exp\left(\begin{array}{c} 2b \end{array}\right)^{\text{nm}} \\ &= \frac{\exp\left(\frac{-z}{2b}\right)}{8b^2} \frac{2b}{3} \exp\left(\frac{3z}{2b}\right) + \frac{\exp\left(\frac{z}{2b}\right)}{8b^2} 2b \left(1 - \exp\left(\frac{z}{2b}\right)\right) + \frac{\exp\left(\frac{z}{2b}\right)}{8b^2} \frac{2b}{3}, \end{aligned} \tag{A14}$$

$$=\frac{1}{12b}\left[\exp\left(\frac{z}{b}\right)-3\exp\left(\frac{z}{2b}\right)\left(\exp\left(\frac{z}{2b}\right)-1\right)+\exp\left(\frac{z}{2b}\right)\right],\tag{A15}$$

$$=\frac{1}{12b}\left[-2\exp\left(\frac{z}{b}\right)+4\exp\left(\frac{z}{2b}\right)\right],\tag{A16}$$

$$\hat{\lambda} = \frac{1}{6b} \left[ 2 \exp\left(\frac{z}{2b}\right) - \exp\left(\frac{z}{b}\right) \right]. \tag{A17}$$

With the obtained PDF *fZ*(*z*) = <sup>1</sup> 6*b* 2 exp −|*z*<sup>|</sup> 2*b* <sup>−</sup> exp −|*z*<sup>|</sup> *b* . for *Z* := *mt* − *νt*, it is straightforward to compute *Gb*(*a*) := - ∞ *<sup>a</sup> fZ*(*z*) <sup>d</sup>*<sup>z</sup>* <sup>=</sup> <sup>1</sup> 6 7 4 exp − *a* 2*b* <sup>−</sup> exp −*a b* <sup>8</sup> for *<sup>a</sup>* <sup>≥</sup> 0. In other words, *Gb*(*a*) = 1 − *FZ*(*a*) for *a* ≥ 0 where *FZ* denotes the CDF of *Z*.

To show that the CDF of *Z* is *FZ*(*a*) = *H*[*a*]+(1 − 2*H*[*a*])*Gb*(|*a*|) where *H*[*a*] is the Heaviside step function, we note that the density *fZ*(*z*) is an even function, i.e., *fZ*(*z*) = *fZ*(−*z*) for any *z*. It follows that if *a* < 0, 1 − *Fz*(*a*) = 1 − *Gb*(−*a*). This means that:

$$1 - F\_Z(a) = \begin{cases} G\_b(a) & \text{if } a \ge 0, \\ 1 - G\_b(-a) & \text{if } a < 0, \end{cases}$$

or equivalently:

$$F\_Z(a) = \begin{cases} 1 - G\_b(a) & \text{if } a \ge 0, \\ G\_b(-a) & \text{if } a < 0. \end{cases}$$

More concisely:

$$\begin{aligned} F\_{\mathbb{Z}}(a) &= (1 - G\_b(|a|)) \mathbb{I}[a \ge 0] + G\_b(|a|) \mathbb{I}[a < 0], \\ &= \mathbb{I}[a \ge 0] + (\mathbb{I}[a < 0] - \mathbb{I}[a \ge 0]) G\_b(|a|) \\ &\overset{(a)}{=} H[a] + (1 - 2H[a]) G\_b(|a|), \end{aligned}$$

where at (a), we use (I[*<sup>a</sup>* <sup>&</sup>lt; <sup>0</sup>] <sup>−</sup> <sup>I</sup>[*<sup>a</sup>* <sup>≥</sup> <sup>0</sup>]) = (<sup>1</sup> <sup>−</sup> <sup>2</sup>*H*[*a*]).

## **Appendix D. Proof of Proposition 1**

**Proof of Proposition 1.** Using this pdf above, we can compute the following probabilities:

$$P[\overline{\tau}\_t = 1 | \tau\_t = 0] \tag{A18}$$

$$\mathbf{A} = \mathbf{P}[0 \le \rho\_l - \varepsilon\_{\text{abc}} \le Z],\tag{A19}$$

$$\rho = \int\_{\rho\_l - \mathfrak{c}\_{\text{abc}}}^{\infty} f(z) dz, \quad \text{where } \rho\_l - \mathfrak{c}\_{\text{abc}} \ge 0 \tag{A20}$$

$$=\int\_{\rho\_l-\mathfrak{c}\_{\rm abc}}^{\infty} \frac{1}{6b} \left[ 2\exp\left(-\frac{|z|}{2b}\right) - \exp\left(-\frac{|z|}{b}\right) \right] dz,\text{ by definition of } f(z) \tag{A21}$$

$$I = \int\_{\rho\_l - \varepsilon\_{\rm abc}}^{\infty} \frac{1}{6b} \left[ 2 \exp\left( - \frac{z}{2b} \right) - \exp\left( - \frac{z}{b} \right) \right] dz, \quad \text{because } \rho\_l - \varepsilon\_{\rm abc} \ge 0 \tag{A22}$$

$$\hat{\rho} = \frac{1}{6b} \left[ 4b \exp\left( -\frac{\rho\_l - \epsilon\_{\rm abc}}{2b} \right) - b \exp\left( -\frac{\rho\_l - \epsilon\_{\rm abc}}{b} \right) \right],\tag{A23}$$

$$\hat{\rho} = \frac{1}{6} \left[ 4 \exp\left( -\frac{\rho\_l - \epsilon\_{abc}}{2b} \right) - \exp\left( -\frac{\rho\_l - \epsilon\_{abc}}{b} \right) \right], \text{where } \rho\_l - \epsilon\_{abc} \ge 0,\tag{A24}$$

and:

$$\begin{aligned} \mathbf{P}[\tilde{\boldsymbol{\eta}} = \mathbf{0} | \boldsymbol{\eta} = 1] \\ \mathbf{P} = \mathbf{P}[\boldsymbol{Z} \le \boldsymbol{\rho}\_t - \boldsymbol{\epsilon}\_{\text{abc}} \le \mathbf{0}], \end{aligned} \tag{A25}$$

$$\epsilon = \int\_{-\infty}^{\rho\_l - \varepsilon\_{abc}} f(z)dz,\text{ where }\rho\_l - \varepsilon\_{abc} \le 0,\tag{A26}$$

$$I = \int\_{-\infty}^{\rho\_l - \varepsilon\_{\rm ak}} \frac{1}{6b} \left[ 2 \exp\left(\frac{z}{2b}\right) - \exp\left(\frac{z}{b}\right) \right] dz,\tag{A27}$$

$$\epsilon = \frac{1}{6b} \left[ 4b \exp\left(\frac{\rho\_t - \epsilon\_{abc}}{2b}\right) - b \exp\left(\frac{\rho\_t - \epsilon\_{abc}}{b}\right) \right],\tag{A28}$$

$$\rho = \frac{1}{6} \left[ 4 \exp\left(\frac{\rho\_t - \varepsilon\_{abc}}{2b}\right) - \exp\left(\frac{\rho\_t - \varepsilon\_{abc}}{b}\right) \right], \text{ where } \rho\_t - \varepsilon\_{abc} \le 0. \tag{A29}$$

So:

$$\begin{split} \mathbb{P}[\overline{\tau}\_{t} \neq \overline{\tau}\_{t} | \tau\_{t}] &= \begin{cases} \mathbb{P}[\overline{\tau}\_{t} = 1 | \tau\_{t} = 0], & \text{if } \rho\_{t} \geq \varepsilon\_{\text{abc}} \\ \mathbb{P}[\overline{\tau}\_{t} = 0 | \tau\_{t} = 1], & \text{otherwise} \end{cases} \\ &= \begin{cases} \frac{1}{6} \Big[ 4 \exp\Big( -\frac{\rho\_{t} - \varepsilon\_{\text{abc}}}{2b} \Big) - \exp\Big( -\frac{\rho\_{t} - \varepsilon\_{\text{abc}}}{b} \Big) \Big], \text{if } \rho\_{t} \geq \varepsilon\_{\text{abc}} \\ \frac{1}{6} \Big[ 4 \exp\Big( \frac{\rho\_{t} - \varepsilon\_{\text{abc}}}{2b} \Big) - \exp\Big( \frac{\rho\_{t} - \varepsilon\_{\text{abc}}}{b} \Big) \Big], \text{otherwise} . \end{split} \tag{A30} \end{split}$$

The two cases can be combined with the use of an absolute value to give the result.

#### **Appendix E. Proof of Theorem 2**

**Proof of Theorem 2.** Let *H*(*x*) be the Heaviside step function. Recall from our algorithm that each accepted sample (*θ*,*Y*) is associated with two independent noise realizations: *mt* ∼ Lap(*b*) (i.e., ˆ*abc* = *abc* + *mt*) and *ν<sup>t</sup>* ∼ Lap(2*b*) (added to *ρ*(*Y*∗,*Yt*)). With this notation, we have *τ*˜*<sup>t</sup>* = *H*[*abc* − *ρ*(*Yt*,*Y*∗) + *mt* − *νt*] for *t* = 1, ... , *T*. Similarly, *τ<sup>t</sup>* := *H*[*abc* − *ρ*(*Yt*,*Y*∗)]. For brevity, we define *ρ<sup>t</sup>* := *ρ*(*Yt*,*Y*∗). It follows that *τ*˜*<sup>t</sup>* ∼ Bernoulli(*pt*) where *pt* := P(*mt* − *ν<sup>t</sup>* > *ρ<sup>t</sup>* − *abc*) = P(*τ*˜ = 1).

**Proof of the first claim:** we start by establishing an upper bound for:

: : : : 1 *c T* ∑ *t*=1 *<sup>f</sup>*(*θt*)*τ*˜*<sup>t</sup>* <sup>−</sup> <sup>1</sup> *c T* ∑ *t*=1 *f*(*θt*)*τ<sup>t</sup>* : : : : 2 = : : : : 1 *c T* ∑ *t*=1 *<sup>f</sup>*(*θt*)*τ*˜*t*<sup>−</sup> <sup>1</sup> *c T* ∑ *t*=1 *f*(*θt*)*τ*˜*<sup>t</sup>* + 1 *c T* ∑ *t*=1 *<sup>f</sup>*(*θt*)*τ*˜*<sup>t</sup>* <sup>−</sup> <sup>1</sup> *c T* ∑ *t*=1 *f*(*θt*)*τ<sup>t</sup>* : : : : 2 ≤ \$ \$ \$ \$ 1 *<sup>c</sup>* <sup>−</sup> <sup>1</sup> *c* \$ \$ \$ \$ : : : : *T* ∑ *t*=1 *f*(*θt*)*τ*˜*<sup>t</sup>* : : : : + 1 *c* : : : : *T* ∑ *t*=1 *f*(*θt*)*τ*˜*<sup>t</sup>* − *T* ∑ *t*=1 *f*(*θt*)*τ<sup>t</sup>* : : : : <sup>=</sup> <sup>1</sup> *c* \$ \$*c* − *c* \$ \$ 1 *c* : : : : *T* ∑ *t*=1 *f*(*θt*)*τ*˜*<sup>t</sup>* : : : : + 1 *c* : : : : *T* ∑ *t*=1 *f*(*θt*)(*τ*˜*<sup>t</sup>* − *τt*) : : : : ≤ 1 *c* \$ \$ \$ \$ \$ *T* ∑ *t*=1 *τ<sup>t</sup>* − *T* ∑ *t*=1 *τ*˜*t* \$ \$ \$ \$ \$ 1 *c* : : : : *T* ∑ *t*=1 *f*(*θt*)*τ*˜*<sup>t</sup>* : : : : + 1 *c T* ∑ *t*=1 *f*(*θt*) <sup>2</sup>|*τ*˜*<sup>t</sup>* − *τt*| ≤ 1 *c T* ∑ *t*=1 |*τ*˜*<sup>t</sup>* − *τt*| 1 *c* : : : : *T* ∑ *t*=1 *f*(*θt*)*τ*˜*<sup>t</sup>* : : : : + *KT c T* ∑ *t*=1 |*τ*˜*<sup>t</sup>* − *τt*|, (A31)

where *KT* := max*t*=1,...,*<sup>T</sup> f*(*θt*) 2. Consider <sup>1</sup> *c* : : : : ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *f*(*θt*)*τ*˜*<sup>t</sup>* : : : : . We can show that it is bounded by *KT* by

$$\frac{1}{c} \left\| \sum\_{t=1}^{T} f(\theta\_t) \overline{\tau}\_t \right\| \le \frac{1}{c} \sum\_{t=1}^{T} \|f(\theta\_t)\|\_2 \overline{\tau}\_t \le \frac{K\_T}{c} \sum\_{t=1}^{T} \overline{\tau}\_t = K\_T.$$

Combining this bound with (A31), we have:

$$\left\| \frac{1}{c} \sum\_{t=1}^{T} f(\boldsymbol{\theta}\_{t}) \, \overline{\boldsymbol{\tau}}\_{t} - \frac{1}{c'} \sum\_{t=1}^{T} f(\boldsymbol{\theta}\_{t}) \, \overline{\boldsymbol{\tau}}\_{t} \right\|\_{2} \leq \frac{2K\_{T}}{c'} \sum\_{t=1}^{T} \left| \overline{\boldsymbol{\tau}}\_{t} - \boldsymbol{\tau}\_{t} \right| \tag{A32}$$

We will need to characterize the distribution of |*τ*˜*<sup>t</sup>* − *τt*|. Let *Zt* := *mt* − *νt*. By Lemma 2, we have:

$$\begin{split} p\_t = \mathcal{P}(\overline{\tau}\_t = 1) &= \mathcal{P}(Z\_t > \rho\_t - \mathfrak{e}\_{abc}) = 1 - F\_{\mathcal{Z}}(\rho\_t - \mathfrak{e}\_{abc}) \\ &= 1 - H[\rho\_t - \mathfrak{e}\_{abc}] + (2H[\rho\_t - \mathfrak{e}\_{abc}] - 1)\mathcal{G}\_b(|\rho\_t - \mathfrak{e}\_{abc}|) \\ &= \mathfrak{x}\_t + (1 - 2\mathfrak{x}\_t)\mathcal{G}\_b(|\rho\_t - \mathfrak{e}\_{abc}|), \end{split}$$

where the decreasing function *Gb*(*x*) <sup>∈</sup> (0, <sup>1</sup> <sup>2</sup> ] for any *x* ≥ 0 is defined in Lemma 2. We observe that |*τ*˜*<sup>t</sup>* − *τt*| ∼ Bernoulli(*qt*) where *qt* := P(*τ*˜*<sup>t</sup>* -= *τt*)=(1 − *pt*)*τ<sup>t</sup>* + *pt*(1 − *τt*). We can rewrite *qt* as

$$\begin{split} q\_t &= \pi\_t + p\_t (1 - 2\pi\_t) \\ &= \pi\_t + [\pi\_t + (1 - 2\pi\_t)G\_b(|\rho\_t - \varepsilon\_{\text{abc}}|)](1 - 2\pi\_t) \\ &= G\_b(|\rho\_t - \varepsilon\_{\text{abc}}|). \end{split}$$

To prove the first claim, we take the expectation on both sides of (A32):

$$\begin{aligned} \mathbb{E}\_{\mathfrak{r}\_{1},\ldots,\mathfrak{r}\_{T}} \left\| \frac{1}{c} \sum\_{t=1}^{T} f(\mathfrak{H}\_{t}) \, \mathfrak{r}\_{t} - \frac{1}{c'} \sum\_{t=1}^{T} f(\mathfrak{H}\_{t}) \, \mathfrak{r}\_{t} \right\|\_{2} & \leq \frac{2K\_{T}}{c'} \mathbb{E}\_{\mathfrak{H}\_{t}} \left[ \sum\_{t=1}^{T} |\, \mathfrak{r}\_{t} - \mathfrak{r}\_{t}| \right] \\ & = \frac{2K\_{T}}{c'} \mu\_{T'} \end{aligned}$$

where *<sup>μ</sup><sup>T</sup>* = E*τ*˜*<sup>t</sup>* ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> |*τ*˜*<sup>t</sup>* − *τt*| = ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *Gb*(|*ρ<sup>t</sup>* <sup>−</sup> *abc*|) and we use the fact that <sup>E</sup>*τ*˜*<sup>t</sup>* <sup>|</sup>*τ*˜*<sup>t</sup>* <sup>−</sup> *τt*| = *qt*. Note that these are *T* independent, marginal expectations, i.e., they do not depend on the condition that noise is added to the ABC threshold.

**Proof of the second claim:** observe that *Gb*(|*ρ<sup>t</sup>* − *abc*|) → 0 as *b* → 0. The claim follows by noting that *b* = *O*(1/*N*).

**Proof of the third claim:** based on (A32), characterizing the tail bound of : : : : 1 *<sup>c</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *<sup>f</sup>*(*θt*)*τ*˜*<sup>t</sup>* <sup>−</sup> <sup>1</sup> *<sup>c</sup>* <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *f*(*θt*)*τ<sup>t</sup>* : : : : 2 amounts to establishing a tail bound on *ST* := ∑*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> |*τ*˜*<sup>t</sup>* − *τt*|. By Markov's inequality:

$$\begin{split} \mathbb{P}(S\_T \le s) &\ge 1 - \mathbb{E}[S\_T]/s \\ &= 1 - \frac{1}{s} \sum\_{t=1}^T G\_b(|\rho\_t - \epsilon\_{abc}|) \\ &= 1 - \frac{1}{s} \sum\_{t=1}^T \frac{1}{6} \left[ 4 \exp\left( -\frac{|\rho\_t - \epsilon\_{abc}|}{2b} \right) - \exp\left( -\frac{|\rho\_t - \epsilon\_{abc}|}{b} \right) \right] \\ &\ge 1 - \frac{2}{3s} \sum\_{t=1}^T \exp\left( -\frac{|\rho\_t - \epsilon\_{abc}|}{2b} \right). \end{split}$$

Applying this bound to (A32) gives:

$$\mathbb{P}\left(\frac{2K\_T}{c'}S\_T \le \frac{2K\_T}{c'}s\right) = \mathbb{P}(S\_T \le s).$$

With a reparametrization *a* := <sup>2</sup>*KT <sup>c</sup> <sup>s</sup>* so that *<sup>s</sup>* <sup>=</sup> *ac* <sup>2</sup>*KT* , we have:

$$\mathbb{P}\left(\frac{2K\_T}{c'}S\_T \le a\right) \ge 1 - \frac{4K\_T}{3ac'} \sum\_{t=1}^T \exp\left(-\frac{|\rho\_t - \epsilon\_{abc}|}{2b}\right),$$

.

$$\text{Since } \left\| \frac{1}{\varepsilon} \sum\_{t=1}^{T} f(\theta\_t) \overline{\mathbf{r}}\_t - \frac{1}{\varepsilon'} \sum\_{t=1}^{T} f(\theta\_t) \mathbf{r}\_t \right\|\_2 \le \frac{2K\_T}{\varepsilon'} S\_T \text{, we have: }$$

$$\mathbf{P} \left( \left\| \frac{1}{\varepsilon} \sum\_{t=1}^{T} f(\theta\_t) \overline{\mathbf{r}}\_t - \frac{1}{\varepsilon'} \sum\_{t=1}^{T} f(\theta\_t) \mathbf{r}\_t \right\|\_2 \le a \right) \ge \mathbf{P} \left( \frac{2K\_T}{\varepsilon'} S\_T \le a \right)$$

which gives the result in the third claim.

## **References**


## *Article* **Variational Message Passing and Local Constraint Manipulation in Factor Graphs**

**˙ Ismail ¸Senöz 1,\*, Thijs van de Laar 1, Dmitry Bagaev <sup>1</sup> and Bert de Vries 1,2**


**Abstract:** Accurate evaluation of Bayesian model evidence for a given data set is a fundamental problem in model development. Since evidence evaluations are usually intractable, in practice variational free energy (VFE) minimization provides an attractive alternative, as the VFE is an upper bound on negative model log-evidence (NLE). In order to improve tractability of the VFE, it is common to manipulate the constraints in the search space for the posterior distribution of the latent variables. Unfortunately, constraint manipulation may also lead to a less accurate estimate of the NLE. Thus, constraint manipulation implies an engineering trade-off between tractability and accuracy of model evidence estimation. In this paper, we develop a unifying account of constraint manipulation for variational inference in models that can be represented by a (Forney-style) factor graph, for which we identify the Bethe Free Energy as an approximation to the VFE. We derive well-known message passing algorithms from first principles, as the result of minimizing the constrained Bethe Free Energy (BFE). The proposed method supports evaluation of the BFE in factor graphs for model scoring and development of new message passing-based inference algorithms that potentially improve evidence estimation accuracy.

**Keywords:** Bayesian inference; Bethe free energy; factor graphs; message passing; variational free energy; variational inference; variational message passing

## **1. Introduction**

Building models from data is at the core of both science and engineering applications. The search for good models requires a performance measure that scores how well a particular model *m* captures the hidden patterns in a data set *D*. In a Bayesian framework, that measure is the *Bayesian evidence p*p*D*|*m*q, i.e., the probability that model *m* would generate *D* if we were to draw data from *m*. The art of modeling is then the iterative process of proposing new model specifications, evaluating the evidence for each model and retaining the model with the most evidence [1].

Unfortunately, Bayesian evidence is intractable for most interesting models. A popular solution to evidence evaluation is provided by *variational* inference, which describes the process of Bayesian evidence evaluation as a (free energy) minimization process, since the variational free energy (VFE) is a tractable upper bound on Bayesian (negative log-)evidence [2]. In practice, the model development process then consists of proposing various candidate models, minimizing VFE for each model and selecting the model with the lowest minimized VFE.

The difference between VFE and negative log-evidence (NLE) is equal to the Kullback– Leibler divergence (KLD) [3] from the (perfect) Bayesian posterior distribution to the variational distribution for the latent variables in the model. The KLD can be interpreted as the cost of conducting variational rather than Bayesian inference. Perfect (Bayesian) inference would lead to zero inference costs (KLD " 0), and the KLD increases as the variational posterior diverges further from the Bayesian posterior. As a result, model

**Citation:** Senöz, ¸ ˙ I.; van de Laar, T.; Bagaev, D.; de Vries, B. Variational Message Passing and Local Constraint Manipulation in Factor Graphs. *Entropy* **2021**, *23*, 807. https://doi.org/10.3390/e23070807

Academic Editor: Pierre Alquier

Received: 19 May 2021 Accepted: 22 June 2021 Published: 24 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

development in a variational inference context is a balancing act, where we search for models that have both large amounts of evidence for the data and small inference costs (small KLD). In other words, in a variational inference context, the researcher has two knobs to tune models. The first knob alters the model specification, which affects model evidence. The second knob relates to constraining the search space for the variational posterior, which may affect the inference costs.

In this paper, we are concerned with developing algorithms for tuning the second knob. How do we constrain the range of variational posteriors so as to make variational inferences both tractable and accurate (resulting in low KLD)? We present our framework in the context of a (Forney-style) factor graph representation of the model [4,5]. In that context, variational inference can be understood as an automatable and efficient message passing-based inference procedure [6–8].

Traditional constraints include mean-field [6] and Bethe approximations [9,10]. However, more recently it has become clear how alternative local constraints, such as posterior factorization [11], expectation and chance constraints [12,13], and local Laplace approximation [14], may impact both tractability and inference accuracy, and thereby potentially lead to lower VFE. The main contribution of the current work lies in unifying the various ideas on local posterior constraints into a principled method for deriving variational message passing-based inference algorithms. The proposed method derives existing message passing algorithms, but also supports the development of new message passing variants.

Section 2 reviews Forney-style Factor Graphs (FFGs) and variational inference by minimizing the Bethe Free Energy (BFE). This review is continued in Section 3, where we discuss BFE optimization from a Lagrangian optimization viewpoint. In Appendix A, we include an example to illustrate that the Bayes rule can be derived from Lagrangian optimization with data constraints. Our main contribution lies in Section 4, which provides a rigorous treatment of the effects of imposing local constraints on the BFE and the resulting message update rules. We build upon several previous works that describe how manipulation of (local) constraints and variational objectives can be employed to improve variational approximations in the context of message passing. For example, ref. [12] shows how inference algorithms can be unified in terms of hybrid message passing by Lagrangian constraint manipulation. We extend this view by bringing form (Section 4.2) and factorization constraints (Section 4.1) into a constrained optimization framework. In [15], a high-level recipe for generating message passing algorithms from divergence measures is described. We apply their general recipe in the current work, where we adhere to the view on local stationary points for region-based approximations on general graphs [16]. In Appendix B, we also show that locally stationary solutions are also the global stationary solutions. In Section 5, we develop an algorithm for VFE evaluation in an FFG. In previous work, ref. [17] describes a factor softening approach to evaluate the VFE for models with deterministic factors. We extend this work in Section 5, and show how to avoid factor softening for both free energy evaluation and inference of posteriors. We show an example of how to compute VFE for a deterministic node in Appendix C. A more detailed comparison to related work is given in Section 7.

In the literature, proofs and descriptions of message passing-based inference algorithms are scattered across multiple papers and varying graphical representations, including Bayesian networks [6,18], Markov random fields [16], bi-partite (Tanner) factor graphs [12,17,19] and Forney-style factor graphs (FFGs) [5,11]. In Appendix D, we provide first-principle proofs for a large collection of familiar message passing algorithms in the context of Forney-style factor graphs, which is the preferred framework in the information and communication theory communities [4,20].

#### **2. Factor Graphs and the Bethe Free Energy**

*2.1. Terminated Forney-Style Factor Graphs*

A Forney-style factor graph (FFG) is an undirected graph G " pV, Eq with nodes V and edges E Ď V ˆ V. We denote the neighboring edges of a node *a* P V by Ep*a*q. Vice versa, for an edge *i* P E, the notation Vp*i*q collects all neighboring nodes. As a notational convention, we index nodes by *a*, *b*, *c* and edges by *i*, *j*, *k*, unless stated otherwise. We will mainly use *a* and *i* as summation indices and use the other indices to refer to a node or edge of interest.

In this paper, we will frequently refer to the notion of a subgraph. We define an edgeinduced subgraph by Gp*i*q"pVp*i*q, *i*q, and a node-induced subgraph by Gp*a*q"p*a*, Ep*a*qq. Furthermore, we denote a local subgraph by Gp*a*, *i*q"pVp*i*q, Ep*a*qq, which collects all local nodes and edges around *i* and *a*, respectively.

An FFG can be used to represent a factorized function,

*<sup>f</sup>*p*s*q " <sup>ź</sup> *a*PV *fa*p*sa*q, (1)

where *s<sup>a</sup>* collects the argument variables of factor *fa*. We assumed that all the factors are positive. In an FFG, a node *a* P V corresponds to a factor *fa*, and the neighboring edges Ep*a*q correspond to the variables *s<sup>a</sup>* that are the arguments of *fa*.

As an example model, the following factorization (2), the corresponding FFG of which is shown in Figure 1.

$$f(\mathbf{s}\_1, \dots, \mathbf{s}\_5) = f\_a(\mathbf{s}\_1) f\_b(\mathbf{s}\_1, \mathbf{s}\_2, \mathbf{s}\_3) \, f\_c(\mathbf{s}\_2) \, f\_d(\mathbf{s}\_3, \mathbf{s}\_4, \mathbf{s}\_5) \, f\_c(\mathbf{s}\_5) \,. \tag{2}$$

**Figure 1.** Example Forney-style factor graph for the model of (2).

The FFG of Figure 1 consists of five nodes V " t*a*, ... ,*e*u, as annotated by their corresponding factor functions, and five edges E " tp*a*, *b*q, ... ,p*d*,*e*qu as annotated by their corresponding variables. An edge that connects to only one node (e.g., the edge for *s*4) is called a half-edge. In this example, the neighborhood Ep*b*q " tp*a*, *b*q,p*b*, *c*q,p*b*, *d*qu and Vpp*b*, *c*qq " t*b*, *c*u.

In the FFG representation, a node can be connected to an arbitrary number of edges, while an edge can only be connected to at most two nodes. Therefore, FFGs often contain "equality nodes" that constrain connected edges to carry identical beliefs, with the implication that these beliefs can be made available to more than two factors. An equality node has the factor function

$$f\_d(\mathbf{s}\_{i\prime}, \mathbf{s}\_j, \mathbf{s}\_k) = \delta(\mathbf{s}\_j - \mathbf{s}\_i) \, \delta(\mathbf{s}\_j - \mathbf{s}\_k) \, \, \, \tag{3}$$

for which the node-induced subgraph Gp*a*q is drawn in Figure 2.

If every edge in the FFG has exactly two connected nodes (including equality nodes), then we designate the graph as a terminated FFG (TFFG). Since multiplication of a function *f*p*s*q by 1 does not alter the function, any FFG can be terminated by connecting any halfedge *i* to a node *a* that represents the unity factor *fa*p*si*q " 1.

**Figure 2.** Visualization of the node-induced subgraph for an equality node. If the node function *fa* is known, a symbol representing the node function is often substituted within the node (""" in this case).

In Section 4.2 we discuss form constraints on posterior distributions. If such a constraint takes on a Dirac-delta functional form, then we visualize the constraint on the FFG by a small circle in the middle of the edge. For example, the small shaded circle in Figure 11 indicates that the variable has been observed. In Section 4.3.2 we consider form constraints in the context of optimization, in which case the circle annotation will be left open (see, e.g., Figure 14).

#### *2.2. Variational Free Energy*

Given a model *f*p*s*q and a (normalized) probability distribution *q*p*s*q, we can define a Variational Free Energy (VFE) functional as

$$F[q\_\prime f] \triangleq \int q(\mathbf{s}) \log \frac{q(\mathbf{s})}{f(\mathbf{s})} \, \mathbf{d} \mathbf{s} \, . \tag{4}$$

Variational inference is concerned with finding solutions to the minimization problem

*q*˚ p*s*q " arg min *q*PQ *F*r*q*, *f*s, (5)

where Q imposes some constraints on *q*.

If *q* is unconstrained, then the optimal solution is obtained for *q*˚p*s*q " *p*p*s*q, with *<sup>p</sup>*p*s*q " <sup>1</sup> *<sup>Z</sup> f*p*s*q being the exact posterior, and *Z* " ş *f*p*s*q d*s* a normalizing constant that is commonly referred to as the evidence. The minimum value of the free energy then follows as the negative log-evidence (NLE),

$$F[q^{\bullet}, f] = -\log Z\_{\bullet}$$

which is also known as the surprisal. The NLE can be interpreted as a measure of model performance, where low NLE is preferred.

As an unconstrained search space for *q* grows exponentially with the number of variables, the optimization of (5) quickly becomes intractable beyond the most basic models. Therefore, constraints and approximations to the variational free energy (4) are often utilized. As a result, the *constrained* variational free energy with *q*˚ P Q bounds the NLE by

$$F[q^\bullet, f] = -\log Z + \int q^\bullet(\mathbf{s}) \log \frac{q^\bullet(\mathbf{s})}{p(\mathbf{s})} \, \mathrm{d}\mathbf{s} \, \mathrm{},\tag{6}$$

where the latter term expresses the divergence from the (intractable) exact solution to the optimal variational belief.

In practice, the functional form of *q*p*s*q " *q*p*s*; *θ*q is often parameterized, such that gradients of *F* can be derived w.r.t. the parameters *θ*. This effectively converts the variational optimization of *F*r*q*, *f*s to a parametric optimization of *F*p*θ*q as a function of *θ*. This problem can then be solved by a (stochastic) gradient descent procedure [21,22].

In the context of variational calculus, while form constraints may lead to interesting properties (see Section 4.2), they are generally not required. Interestingly, in a variational optimization context, the functional form of *q* is often not an *assumption*, but rather a *result* of optimization (see Section 4.3.1). An example of variational inference is provided in Appendix A.

#### *2.3. Bethe Free Energy*

The Bethe approximation enjoys a unique place in the landscape of Q, because the Bethe free energy (BFE) defines the fundamental objective of the celebrated belief propagation (BP) algorithm [17,23]. The origin of the Bethe approximation is rooted in tree-like approximations to subgraphs (possibly containing cycles) by enforcing local consistency conditions on the beliefs associated with edges and nodes [24].

Given a TFFG G " pV, Eq for a factorized function *f*p*s*q " ś *<sup>a</sup>*P<sup>V</sup> *fa*p*sa*<sup>q</sup> (1), the Bethe free energy (BFE) is defined as [25]:

$$F[q, f] \triangleq \sum\_{a \in \mathcal{V}} \underbrace{\int q\_a(\mathbf{s}\_a) \log \frac{q\_a(\mathbf{s}\_a)}{f\_a(\mathbf{s}\_a)} \, \mathrm{d}\mathbf{s}\_a}\_{F[q\_a, f\_a]} + \sum\_{i \in \mathcal{E}} \underbrace{\int q\_i(\mathbf{s}\_i) \log \frac{1}{q\_i(\mathbf{s}\_i)} \, \mathrm{d}\mathbf{s}\_i}\_{H[q\_i]} \tag{7}$$

such that the factorized beliefs

$$q(\mathbf{s}) = \prod\_{a \in \mathcal{V}} q\_d(\mathbf{s}\_d) \prod\_{i \in \mathcal{E}} q\_i(s\_i)^{-1} \tag{8}$$

satisfy the following constraints:

$$\int q\_a(\mathbf{s}\_a) \, \mathrm{d}\mathbf{s}\_a = 1, \quad \text{for all } a \in \mathcal{V} \tag{9a}$$

$$\int q\_a(\mathbf{s}\_a) \, \mathrm{d}\mathbf{s}\_{a\backslash i} = q\_i(s\_i) \,, \quad \text{for all } a \in \mathcal{V} \text{ and all } i \in \mathcal{E}(a) \, . \tag{9b}$$

Together, the normalization constraint (9a) and marginalization constraint (9b) imply that the edge marginals are also normalized:

$$\int q\_i(s\_i) \, \mathrm{d}s\_i = 1 \,, \quad \text{for all } i \in \mathcal{E} \,. \tag{10}$$

The Bethe free energy (7) includes a local free energy term *F*r*qa*, *fa*s for each node *a* P V, and an entropy term *H*r*qi*s for each edge *i* P E. Note that the local free energy also depends on the node function *fa*, as specified in the factorization of *f* (1), whereas the entropy only depends on the local belief *qi*.

The Bethe factorization (8) and constraints are summarized by the local polytope [26]

$$\mathcal{L}(\mathcal{G}) = \{q\_{\#} \text{ for all } a \in \mathcal{V} \text{ s.t. (9a), and } q\_{\text{i}} \text{ for all } i \in \mathcal{E}(a) \text{ s.t. (9b)}\},\tag{11}$$

which defines the constrained search space for the factorized variational distribution (8).

#### *2.4. Problem Statement*

In this paper, the problem is to find the beliefs in the local polytope that minimize the Bethe free energy

$$q^\*(\mathbf{s}) = \arg\min\_{q \in \mathcal{L}(\mathcal{G})} F[q, f] \, \, \, \, \tag{12}$$

where *q* is defined by (8), and where *q* P LpGq offers a shorthand notation for optimizing over the individual beliefs in the local polytope. In the following sections, we will follow the Lagrangian optimization approach to derive various message passing-based inference algorithms.

#### *2.5. Sketch of Solution Approach*

The problem statement of Section 2.4 defines a global minimization of the beliefs in the Bethe factorization. Instead of solving the global optimization problem directly, we employ the factorization of the variational posterior and local polytope to subdivide the global problem statement in multiple *interdependent* local objectives.

From the BFE objective (12) and local polytope of (11), we can construct the Lagrangian

$$\begin{split} L[q, f] = \sum\_{a \in \mathcal{V}} F[q\_a, f\_a] + \sum\_{a \in \mathcal{V}} \psi\_a \left[ \int q\_a(\mathbf{s}\_a) \, \mathrm{d}\mathbf{s}\_a - 1 \right] + \sum\_{a \in \mathcal{V}} \sum\_{i \in \mathcal{E}(a)} \int \lambda\_{\mathrm{id}}(s\_i) \left[ q\_i(s\_i) - \int q\_a(\mathbf{s}\_a) \, \mathrm{d}\mathbf{s}\_{a|i} \right] \, \mathrm{d}s\_i \\ + \sum\_{i \in \mathcal{E}} H[q\_i] + \sum\_{i \in \mathcal{E}} \psi\_i \left[ \int q\_i(s\_i) \, \mathrm{d}s\_i - 1 \right], \end{split} \tag{13}$$

where the Lagrange multipliers *ψa*, *ψ<sup>i</sup>* and *λia* enforce the normalization and marginalization constraints of (9). It can be seen that this Lagrangian contains local beliefs *qa* and *qi*, which are coupled through the *λia* Lagrange multipliers. The Lagrange multipliers *λia* are doubly indexed, because there is a multiplier associated with each marginalization constraint. The Lagrangian method then converts a constrained optimization problem of *F*r*q*, *f*s to an unconstrained optimization problem of *L*r*q*, *f*s. The total variation of the Lagrangian (13) can then be approached from the perspective of variations of the individual (coupled) local beliefs.

More specifically, given a locally connected pair *b* P V, *j* P Ep*b*q, we can rewrite the optimization of (12) in terms of the local beliefs *qb*, *qj*, and the constraints in the local polytope

$$\mathcal{L}(\mathcal{G}(b,j)) = \left\{ q\_b \text{ s.t. (9a), and } q\_j \text{ s.t. (9b)} \right\},\tag{14}$$

that pertains to these beliefs. The problem then becomes finding local stationary solutions

$$\left\{q\_{b'}^{\*},q\_{\backslash}^{\*}\right\}=\arg\min\_{\mathcal{L}\left(\mathcal{G}(b,j)\right)}F[q\_{\prime}f].\tag{15}$$

Using (13), the optimization of (15) can then be written in the Lagrangian form

$$q\_b^{\bullet} = \arg\min\_{q\_b} L\_b[q\_{b\prime} f\_b]\_{\prime} \tag{16a}$$

$$q\_j^{\bullet} = \arg\min\_{q\_j} L\_j[q\_j] \, , \tag{16b}$$

where the Lagrangians *Lb* and *Lj* include the local polytope of (14) to rewrite (13) as an explicit functional of beliefs *qb* and *qj* (see, e.g., Lemmas 1 and 2). The combined stationary solutions to the local objectives then also comprise a stationary solution to the global objective (Appendix B).

The current paper shows how to identify stationary solutions to local objectives of the form (15), with the use of variational calculus, under varying constraints as imposed by the local polytope (14). Interestingly, the resulting fixed-point equations can be interpreted as message passing updates on the underlying TFFG representation of the model. In the following Sections 3 and 4, we derive the local stationary solutions under a selection of constraints and show how these relate to known message passing update rules (Table 1). It then becomes possible to derive novel message updates and algorithms by simply altering the local polytope.

**Table 1.** Relation between local constraints and derived message updates. The rows refer to different constraints that relate to factor–variable combinations, factors, and variables, respectively. Note that each message passing algorithm combines a set of constraints. Abbreviations: Sum-Product (SP), Structured Variational Message Passing (SVMP), Mean-Field Variational Message Passing (MFVMP), Data Constraint (DC), Laplace Propagation (LP), Mean-Field Variational Laplace (MFVLP), Expectation Maximization (EM), and Expectation Propagation (EP).


#### **3. Bethe Lagrangian Optimization by Message Passing**

*3.1. Stationary Points of the Bethe Lagrangian*

We wish to minimize the Bethe free energy under variations of the variational density. As the Bethe free energy factorizes over factors and variables (7), we first consider variations on separate node- and edge-induced subgraphs.

**Lemma 1.** *Given a TFFG* G " pV, Eq*, consider the node-induced subgraph* Gp*b*q *(Figure 3). The stationary points of the Lagrangian* (16a) *as a functional of qb,*

$$L\_b[q\_b, f\_b] = F[q\_b, f\_b] + \psi\_b \left[ \int q\_b(\mathbf{s}\_b) \, \mathrm{d}\mathbf{s}\_b - 1 \right] + \sum\_{i \in \mathcal{E}(b)} \int \lambda\_{i\bar{b}}(s\_i) \left[ q\_{\bar{i}}(s\_{\bar{i}}) - \int q\_{\bar{b}}(\mathbf{s}\_{\bar{b}}) \, \mathrm{d}\mathbf{s}\_{\bar{b}\bar{i}} \right] \, \mathrm{d}\mathbf{s}\_{\bar{i}} + \mathbb{C}\_{b\prime} \tag{17}$$

*where Cb collects all terms that are independent of qb, which are of the form*

$$\eta\_b(\mathbf{s}\_b) = \frac{f\_b(\mathbf{s}\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_i)}{\int f\_b(\mathbf{s}\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_i) \mathbf{d} s\_b} \,. \tag{18}$$

**Proof.** See Appendix D.1.

The *μib*p*si*q are any set of positive functions that makes (18) satisfy (9b), and will be identified in Theorem 1.

**Figure 3.** The subgraph around node *b* with indicated messages. Ellipses indicate an arbitrary (possibly zero) amount of edges.

**Lemma 2.** *Given a TFFG* G " pV, Eq*, consider an edge-induced subgraph* Gp*j*q *(Figure 4). The stationary points of the Lagrangian* (16b) *as a functional of qj,*

$$L\_{\uparrow}[q\_{\uparrow}] = H[q\_{\uparrow}] + \Psi\_{\uparrow} \left[ \int q\_{\uparrow}(s\_{\uparrow}) \, \mathrm{d}s\_{\uparrow} - 1 \right] + \sum\_{a \in \mathcal{V}(j)} \int \lambda\_{ja}(s\_{\uparrow}) \left[ q\_{j}(s\_{\uparrow}) - \int q\_{a}(s\_{a}) \, \mathrm{d}s\_{a\langle j \rangle} \right] \mathrm{d}s\_{j} + \mathbb{C}\_{j},\tag{19}$$

*where Cj collects all terms that are independent of qj, are of the form*

$$q\_{\dot{\jmath}}(\mathbf{s}\_{\dot{\jmath}}) = \frac{\mu\_{\dot{\jmath}\mathbf{b}}(\mathbf{s}\_{\dot{\jmath}})\mu\_{\dot{\jmath}\mathbf{c}}(\mathbf{s}\_{\dot{\jmath}})}{\int \mu\_{\dot{\jmath}\mathbf{b}}(\mathbf{s}\_{\dot{\jmath}})\mu\_{\dot{\jmath}\mathbf{c}}(\mathbf{s}\_{\dot{\jmath}}) \mathbf{d}\mathbf{s}\_{\dot{\jmath}}}. \tag{20}$$

**Proof.** See Appendix D.2.


**Figure 4.** An edge-induced subgraph Gp*j*q with indicated messages.

*3.2. Minimizing the Bethe Free Energy by Belief Propagation*

We now combine Lemmas 1 and 2 to derive the sum-product message update.

**Theorem 1** (Sum-Product Message Update)**.** *Given a TFFG* G " pV, Eq*, consider the induced subgraph* Gp*b*, *j*q *(Figure 5). Given the local polytope* LpGp*b*, *j*qq *of* (14)*, then the local stationary solutions to* (15) *are given by*

$$q\_b^\*(s\_b) = \frac{f\_b(s\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}^\*(s\_i)}{\int f\_b(s\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}^\*(s\_i) \mathbf{d} s\_b} \tag{21a}$$
 
$$\mu\_{ib}(s\_b) \cdot \mu\_{ib}(s\_b)$$

$$q\_{\vec{j}}^{\ast}(s\_{\vec{j}}) = \frac{\mu\_{\vec{j}b}^{\ast}(s\_{\vec{j}})\mu\_{\vec{j}c}^{\ast}(s\_{\vec{j}})}{\int \mu\_{\vec{j}b}^{\ast}(s\_{\vec{j}})\mu\_{\vec{j}c}^{\ast}(s\_{\vec{j}})\mathrm{d}s\_{\vec{j}}},\tag{21b}$$

*with messages μ*˚ *jc*p*sj*q *corresponding to the fixed points of*

$$
\mu^{(k+1)}\_{j\text{c}}(s\_j) = \int f\_b(\mathbf{s}\_b) \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq j}} \mu^{(k)}\_{i\text{b}}(s\_i) \mathbf{d} \mathbf{s}\_{b \backslash j \text{ } \backslash} \tag{22}
$$

*with k representing an iteration index.*

**Proof.** See Appendix D.3.

**Figure 5.** Visualization of a subgraph with indicated sum-product messages.

The sum-product algorithm has proven to be useful in many engineering applications and disciplines. For example, it is widely used for decoding in communication systems [4,20,27]. Furthermore, for a linear Gaussian state space model, Kalman filtering

and smoothing can be expressed in terms of sum-product message passing for state inference on a factor graph [28,29]. This equivalence has inspired applications ranging from localization [30] to estimation [31].

The sum-product algorithm with updates (22) obtains the exact Bayesian posterior when the underlying graph is a tree [24,25,32]. Application of the sum-product algorithm to cyclic graphs is not guaranteed to converge and might lead to oscillations in the BFE over iterations. Theorems 3.1 and 3.2 in [33] show that the BFE of a graph with a single cycle is convex, which implies that the sum-product algorithm will converge in this case. Moreover, ref. [19] shows that it is possible to obtain a double-loop message passing algorithm if the graph has a cycle such that the stable fixed points will correspond to local minima of the BFE.

**Example 1.** *A Linear Dynamical System Considering a Linear Gaussian state space model specified by the following factors:*

$$g\_0(\mathbf{x}\_0) = \mathcal{N}(\mathbf{x}\_0 | m\_{\mathbf{x}\_0 \prime} V\_{\mathbf{x}\_0}) \tag{23a}$$

$$g\_t(\mathbf{x}\_{t-1}, \mathbf{z}\_t, A\_t) = \delta(\mathbf{z}\_t - A\_t \mathbf{x}\_{t-1}) \tag{23b}$$

$$h\_t(\mathbf{x}\_t', z\_{t'}, \mathbf{Q}\_t) = \mathcal{N}(\mathbf{x}\_t'|z\_{t'}\mathbf{Q}\_t^{-1})\tag{23c}$$

$$n\_t(\mathbf{x}\_t, \mathbf{x}\_t', \mathbf{x}\_t'') = \delta(\mathbf{x}\_t - \mathbf{x}\_t')\delta(\mathbf{x}\_t - \mathbf{x}\_t'') \tag{23d}$$

$$m\_t(o\_t, \mathbf{x}\_t'', B\_t) = \delta(o\_t - B\_t \mathbf{x}\_t'') \tag{23e}$$

$$r\_{\mathbf{f}}(y\_t, o\_t, \mathcal{R}\_{\mathbf{f}}) = \mathcal{N}(y\_t | o\_{\mathbf{f}}, \mathcal{R}\_{\mathbf{f}}^{-1}) \,. \tag{23f}$$

*The FFG corresponding to the one time segment of the state space model is given in Figure 6. We assumed that we know the following matrices that are used to generate the data:*

$$\hat{A}\_{t} = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix}, \quad \hat{Q}\_{t}^{-1} = \begin{bmatrix} 3 & 0.1 \\ 0.1 & 2 \end{bmatrix}, \quad \hat{B}\_{t} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad \hat{R}\_{t}^{-1} = \begin{bmatrix} 10 & 2 \\ 2 & 20 \end{bmatrix} \tag{24}$$

*with θ* " *π*{8*. Given a collection of observations y***ˆ** " t*y*ˆ1, ... , *y*ˆ*T*u*, we constrain the latent states x* " t*x*0, ... , *xT*u *by local marginalization and normalization constraints (for brevity we omit writing the normalization constraints explicitly) in accordance with Theorem 1, i.e.,*

$$\int\_{\tau} q(\mathbf{x}\_{t-1}, z\_t, A\_t) \mathbf{d} \mathbf{x}\_{t-1} \mathbf{d} z\_t = q(A\_t),\\\int\_{\tau} q(\mathbf{x}\_{t-1}, z\_t, A\_t) \mathbf{d} A\_t = q(z\_t | \mathbf{x}\_{t-1}) q(\mathbf{x}\_{t-1}) \tag{25a}$$

$$\int q(\mathbf{x}\_{t\prime}^{\prime}\mathbf{z}\_{t\prime}, \mathbf{Q}\_{t}) \mathrm{d}\mathbf{x}\_{t}^{\prime} \mathrm{d}\mathbf{z}\_{t} = q(\mathbf{Q}\_{t}),\\\int q(\mathbf{x}\_{t\prime}^{\prime}\mathbf{z}\_{t\prime}, \mathbf{Q}\_{t}) \mathrm{d}\mathbf{z}\_{t} \mathrm{d}\mathbf{Q}\_{t} = q(\mathbf{x}\_{t\prime}^{\prime}),\\\int q(\mathbf{x}\_{t\prime}^{\prime}\mathbf{z}\_{t\prime}, \mathbf{Q}\_{t}) \mathrm{d}\mathbf{x}\_{t}^{\prime} \mathrm{d}\mathbf{Q}\_{t} = q(\mathbf{z}\_{t})\tag{25b}$$

$$q(\mathbf{x}\_t, \mathbf{x}\_t', \mathbf{x}\_t'') = q(\mathbf{x}\_t)\delta(\mathbf{x}\_t - \mathbf{x}\_t')\delta(\mathbf{x}\_t - \mathbf{x}\_t'') \tag{25c}$$

$$\int q(\mathbf{o}\_{l}, \mathbf{x}\_{t}'', \mathbf{B}\_{l}) \mathrm{d}\mathbf{o}\_{l} \, \mathrm{d}\mathbf{x}\_{t}'' = q(\mathbf{B}\_{l}), \quad \int q(\mathbf{o}\_{l}, \mathbf{x}\_{t}'', \mathbf{B}\_{l}) \mathrm{d}\mathbf{B}\_{l} = q(\mathbf{o}\_{l}|\mathbf{x}\_{t}'')q(\mathbf{x}\_{t}'') \tag{25d}$$

$$\int q(o\_t, y\_t, R\_t) \mathrm{d}o\_t \mathrm{d}y\_t = q(R\_t),\\\int q(o\_t, y\_t, R\_t) \mathrm{d}R\_t \mathrm{d}o\_t = q(y\_t),\\\int q(o\_t, y\_t, R\_t) \mathrm{d}R\_t \mathrm{d}y\_t = q(o\_t) \tag{25e}$$

*Moreover, we use data constraints in accordance with Theorem 3 (explained in Section 4.2.1) for the observations, state transition matrices and precision matrices, i.e.,*

$$q(y\_t) = \delta(y\_t - \hat{y}\_t), \; q(A\_t) = \delta(A\_t - \hat{A}\_t), \; q(B\_t) = \delta(B\_t - \hat{B}\_t), \; q(Q\_t) = \delta(Q\_t - \hat{Q}\_t), \; q(R\_t) = \delta(R\_t - \hat{R}\_t).$$

*Computation of sum-product messages by* (22) *is analytically tractable and detailed algebraic manipulation can be found in [31]. If the backwards messages are not passed, then the resulting sum-product message passing algorithm is equivalent to Kalman filtering and if both forward and backward messages are propagated, then the Rauch–Tung–Striebel smoother is obtained [34] (Ch. 8).*

*We generated T* " 100 *observations y***ˆ** *using the matrices specified in* (24) *and the initial condition x*ˆ0 " r5, ´5s <sup>J</sup>*. Due to* (23a)*, we have μx*<sup>0</sup> *<sup>g</sup>*<sup>1</sup> " N p*mx*<sup>0</sup> , *Vx*<sup>0</sup> q*. We chose Vx*<sup>0</sup> " 100 ¨ I *and mx*<sup>0</sup> " *x*ˆ0*. Under these constraints, the results of sum-product message passing and Bethe free* *energy evaluation is given in Figure 6. As the underlying graph is a tree, sum-product message passing results are exact and the evaluated BFE corresponds to negative log-evidence. In the follow-up Example 2, we will modify the constraints and give a comparative free energy plot for the examples in Figures 10 and 16.*

**Figure 6.** (**Left**) One time segment of the FFG corresponding to the linear Gaussian state space model specified in Example 1, with the sum-product messages computed according to (22). The three small dots at both sides of the graph indicate identical continuation of the graph over time. (**Right**) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using parameter matrices as specified in (24). The posterior distribution for the hidden states are inferred by sum-product message passing and are drawn with shaded regions, indicating plus and minus the variance. The Bethe free energy evaluates to *F*r*q*, *f*s " 580.698.

#### **4. Message Passing Variations through Constraint Manipulation**

For generic node functions with arbitrary connectivity, there is no guarantee that the sum-product updates can be solved analytically. When analytic solutions are not possible, there are two ways to proceed. One way is to try to solve the sum-product update equations numerically, e.g., by Monte Carlo methods. Alternatively, we can add additional constraints to the BFE that leads to simpler update equations at the cost of inference accuracy. In the remainder of the paper, we explore a variety of constraints that have proven to yield useful inference solutions.

## *4.1. Factorization Constraints*

Additional factorizations of the variational density *qa*p*sa*q are often assumed to ease computation. In particular, we assumed a *structured mean-field factorization* such that

$$q\_b(\mathbf{s}\_b) \doteq \prod\_{n \in I(b)} q\_b^n(\mathbf{s}\_b^n) \,. \tag{26}$$

where *n* indicates a local cluster as a set of edges. To define a local cluster rigorously, let us first denote by Pp*a*q the power set of an edge set Ep*a*q, where the power set is the set of all subsets of Ep*a*q. Then, a mean-field factorization *l*p*a*q Ď Pp*a*q can be chosen such that all elements in Ep*a*q are included in *l*p*a*q exactly once. Therefore, *l*p*a*q is defined as a set of one or multiple sets of edges. For example, if Ep*a*q"t*i*, *j*, *k*u, then *l*p*a*q " tt*i*u, t*j*, *k*uu is allowed, as is *l*p*a*q " tt*i*, *j*, *k*uu itself, but *l*p*a*q " tt*i*, *j*u, t*j*, *k*uu is not allowed, since the element *j* occurs twice. More formally, in (26), the intersection of the super- and subscript collects the required variables, see Figure 7 for an example. The special case of a fully factorized *l*p*b*q for all edges *i* P Ep*b*q is known as the *naive mean-field factorization* [11,24].

We will analyze the effect of a structured mean-field factorization (26) on the Bethe free energy (7) for a specific factor node *b* P V. Substituting (26) in the local free energy for factor *b* yields

$$F[q\_{\flat\prime}f\_{\flat}] = F[\{q\_{\flat\flat}^{\scriptscriptstyle\mathsf{u}}\}, f\_{\flat}] = \sum\_{n \in l(\boldsymbol{b})} \int q\_{\flat}^{\scriptscriptstyle\mathsf{u}}(\mathbf{s}\_{\flat}^{\scriptscriptstyle\mathsf{u}}) \log q\_{\flat}^{\scriptscriptstyle\mathsf{u}}(\mathbf{s}\_{\flat}^{\scriptscriptstyle\mathsf{u}}) \, \mathrm{d}\mathbf{s}\_{\flat}^{\scriptscriptstyle\mathsf{u}} - \int \left\{ \prod\_{n \in l(\boldsymbol{b})} q\_{\flat}^{\scriptscriptstyle\mathsf{u}}(\mathbf{s}\_{\flat}^{\scriptscriptstyle\mathsf{u}}) \right\} \log f\_{\flat}(\mathbf{s}\_{\flat}) \, \mathrm{d}\mathbf{s}\_{\flat} \,. \tag{27}$$

We are then interested in

$$q\_{b}^{m,\*} = \arg\min\_{q\_{b}^{m}} L\_{b}^{m}[q\_{b}^{m}, f\_{b}] \, . \tag{28}$$

where the Lagrangian *L<sup>m</sup> <sup>b</sup>* (Lemma 3) enforces the normalization and marginalization constraints

$$\int q\_b^m(\mathbf{s}\_b^m) \, \mathbf{d} \mathbf{s}\_b^m = 1 \,, \tag{29a}$$

$$\int\_{\boldsymbol{b}} q\_{\boldsymbol{b}}^{m}(\mathbf{s}\_{\boldsymbol{b}}^{m}) \, \mathrm{d}\mathbf{s}\_{\boldsymbol{b}\backslash i}^{m} = q\_{i}(\boldsymbol{s}\_{i})\_{\prime} \text{ for all } i \in m \,, m \in l(\boldsymbol{b}) \,. \tag{29b}$$

**Figure 7.** A node-induced subgraph Gp*b*q with shaded sections that enclose the edges of an exemplary structured mean-field factorization *l*p*b*q"t*m*, *n*,*r*u. Note that, in this example, the cluster *n* only encompasses the single edge *j*, such that *q<sup>n</sup> <sup>b</sup>* <sup>p</sup>*s<sup>n</sup> <sup>b</sup>* q " *qj*p*sj*q. In general, the assignment and number of edges in a cluster can be arbitrary.

**Lemma 3.** *Given a terminated FFG* G " pV, Eq*, consider a node-induced subgraph* Gp*b*q *with a structured mean-field factorization l*p*b*q *(e.g., Figure 7). Then, local stationary solutions to the Lagrangian*

$$L\_{b}^{m}[q\_{b}^{m}] = \int q\_{b}^{m}(\mathbf{s}\_{b}^{m}) \log q\_{b}^{m}(\mathbf{s}\_{b}^{m}) \, \mathrm{d}\mathbf{s}\_{b}^{m} - \int \left\{ \prod\_{n \in I(b)} q\_{b}^{n}(\mathbf{s}\_{b}^{n}) \right\} \log f\_{b}(\mathbf{s}\_{b}) \, \mathrm{d}\mathbf{s}\_{b} +$$

$$\boldsymbol{\upmu}\_{b}^{m} \left[ \int q\_{b}^{m}(\mathbf{s}\_{b}^{m}) \, \mathrm{d}\mathbf{s}\_{b}^{m} - 1 \right] + \sum\_{i \in \mathcal{m}} \int \lambda\_{ib}(s\_{i}) \left[ q\_{i}(s\_{i}) - \int q\_{b}^{m}(\mathbf{s}\_{b}^{m}) \, \mathrm{d}\mathbf{s}\_{m \nmid i} \right] \, \mathrm{d}s\_{i} + \mathbf{C}\_{b}^{m}, \tag{30}$$

*where C<sup>m</sup> <sup>b</sup> collects all terms independent of q<sup>m</sup> <sup>b</sup> , which are of the form*

$$q\_b^m(\mathbf{s}\_b^m) = \frac{\bar{f}\_b^m(\mathbf{s}\_b^m) \prod\_{i \in m} \mu\_{ib}(s\_i)}{\int \bar{f}\_b^m(\mathbf{s}\_b^m) \prod\_{i \in m} \mu\_{ib}(s\_i) \mathbf{d} \mathbf{s}\_b^m} \tag{31}$$

*where*

$$\hat{f}\_b^m(\mathbf{s}\_b^m) = \exp\left(\int \left\{ \prod\_{\substack{n \in I(b) \\ n \neq m}} q\_b^n(\mathbf{s}\_b^n) \right\} \log f\_b(\mathbf{s}\_b) \, \mathrm{d}\mathbf{s}\_b^{|m|} \right). \tag{32}$$

**Proof.** See Appendix D.4.

#### 4.1.1. Structured Variational Message Passing

We now combine Lemmas 2 and 3 to derive the structured variational message passing algorithm.

**Theorem 2.** *Structured variational message passing: Given a TFFG* G " pV, Eq*, consider the induced subgraph* Gp*b*, *j*q *with a structured mean-field factorization l*p*b*q Ď Pp*b*q*, with local clusters n* P *l*p*b*q*. Let m* P *l*p*b*q *be the cluster where j* P *m (see, e.g., Figure 8). Given the local polytope*

$$\mathcal{L}(\mathcal{G}(b,j)) = \left\{ q\_{\rm b}^{\rm n} \text{ for all } n \in l(b) \text{ s.t. (29a), and } q\_{\rm j} \text{ s.t. (29b)} \right\},\tag{33}$$

*then local stationary solutions to*

$$\{q\_{b}^{\text{in},\*}, q\_{\rangle}^{\*}\} = \arg\min\_{\mathcal{L}(\mathcal{G}(b,j))} F[q\_{\prime}f] \,. \tag{34}$$

*are given by*

$$\eta\_{b}^{m,\bullet}(\mathbf{s}\_{b}^{m}) = \frac{\tilde{f}\_{b}^{m,\bullet}(\mathbf{s}\_{b}^{m}) \prod\_{i \in m} \mu\_{ib}^{\bullet}(s\_{i})}{\int \tilde{f}\_{b}^{m,\bullet}(\mathbf{s}\_{b}^{m}) \prod\_{i \in m} \mu\_{ib}^{\bullet}(s\_{i}) \mathbf{d} \mathbf{s}\_{b}^{m}} \tag{35a}$$

$$q\_{j}^{\bullet}(s\_{j}) = \frac{\mu\_{jb}^{\bullet}(s\_{j})\mu\_{jc}^{\bullet}(s\_{j})}{\int \mu\_{jb}^{\bullet}(s\_{j})\mu\_{jc}^{\bullet}(s\_{j})\mathrm{d}s\_{j}},\tag{35b}$$

*with messages μ*˚ *jc*p*sj*q *corresponding to the fixed points of*

$$
\mu\_{j\varepsilon}^{(k+1)}(s\_{\rangle}) = \int \tilde{f}\_{b}^{m,(k)}(s\_{b}^{m}) \prod\_{i \in m \atop i \neq j} \mu\_{ib}^{(k)}(s\_{i}) \mathrm{d}s\_{b}^{m}\_{b\backslash j'} \tag{36}
$$

*with iteration index k, and where*

$$\tilde{f}\_b^{m,(k)} = \exp\left(\int \left\{ \prod\_{\substack{n \in l(b) \\ n \neq m}} q\_b^{n,(k)}(\mathbf{s}\_b^n) \right\} \log f\_b(\mathbf{s}\_b) \, \mathrm{d}\mathbf{s}\_b^{\mid\_m} \right). \tag{37}$$

**Proof.** See Appendix D.5.

**Figure 8.** An example subgraph corresponding to Gp*b*, *j*q. Dashed ellipses enclose the edges of an exemplary exact cover *l*p*b*q"t*m*, *n*,*r*u. In general, the assignment and number of edges in a cluster can be arbitrary.

The structured mean-field factorization applies the marginalization constraint only to the local cluster beliefs, as opposed to the joint node belief. As a result, computation for the local cluster beliefs might become tractable [24] (Ch.5). The practical appeal of Variational Message Passing (VMP) based inference becomes evident when the underlying model is composed of conjugate factor pairs from the exponential family. When the underlying factors are conjugate exponential family distributions, the message passing updates (36) amounts to adding natural parameters [35] of the underlying exponential family distributions. Structured variational message passing is popular in acoustic signal modelling, e.g., [36], as it allows one to be able to keep track of correlations over time. In [37], a stochastic variant of structured variational inference is utilized for Latent Dirichlet Allocation. Structured approximations are also used to improve inference in auto-encoders. In [38], inference involving non-parametric Beta-Bernoulli process priors is improved by developing a structured approximation to variational auto-encoders. When the data being modelled are time series, structured approximations reflect the transition structure over time. In [39], an efficient structured black-box variational inference algorithm for fitting Gaussian variational models to latent time series is proposed.

**Example 2.** *Consider the linear Gaussian state space model of Example 1. Let us assume that the precision matrix for latent-state transitions Qt is not known and can not be constrained by data. Then, we can augment state space model by including a prior for Qt and try to infer a posterior over Qt from the observations. Since Qt is the precision of a normal factor, we chose a conjugate Wishart prior and assumed that Qt is time-invariant by adding the following factors*

$$w\_0(Q\_0, V, \nu) = \mathcal{W}(Q\_0|V, \nu) \tag{38a}$$

$$w\_t(Q\_{t-1}, Q\_t, Q\_{t+1}) = \delta(Q\_{t-1} - Q\_t)\delta(Q\_t - Q\_{t+1}), \text{ for every } t = 1, \dots, T. \tag{38b}$$

*It is certainly possible to assume a time-varying structure for Qt; however, our purpose is to illustrate a change in constraints rather than analyzing time-varying properties. This is why we assume time-invariance.*

*In this setting, the sum-product equations around the factor ht are not analytically tractable. Therefore, we changed the constraints associated with ht* (25b) *to those given in Theorem 2 as follows*

$$\int q(\mathbf{x}\_{t\prime}^{\prime}, \mathbf{z}\_{t\prime}, \mathbf{Q}\_{t}) \mathbf{d} \mathbf{x}\_{t}^{\prime} \mathbf{d} \mathbf{z}\_{t} = q(\mathbf{Q}\_{t}),\\\int q(\mathbf{x}\_{t\prime}^{\prime}, \mathbf{z}\_{t\prime}, \mathbf{Q}\_{t}) \mathbf{d} \mathbf{Q}\_{t} = q(\mathbf{x}\_{t\prime}^{\prime}, \mathbf{z}\_{t\prime}) \tag{39a}$$

$$\int q(Q\_t) \mathrm{d}Q\_t = 1\_\prime \int q(\mathbf{x}\_{t\prime}^\prime z\_t) \mathrm{d}\mathbf{x}\_t^\prime \mathrm{d}z\_t = 1 \,\mathrm{d}.\tag{39b}$$

*We removed the data constraint on q*p*Qt*q *and instead included data constraints on the hyperparameters*

$$q(V) = \delta(V - \hat{V}), \ q(\nu) = \delta(\nu - \vartheta). \tag{40}$$

*With the new set of constraints ((39a) and (39b)), we obtained a hybrid of the sum-product and structured VMP algorithm, where structured messages around the factor ht are computed by* (36) *and the rest of the messages are computed by the sum-product* (22)*. One time segment of the modified FFG along with the messages is given Figure 9. We used the same observations y***ˆ** *that were generated in Example 1 and the same initialization for the hidden states. For the hyper-parameters of the Wishart prior, we chose <sup>V</sup>*<sup>ˆ</sup> " 0.1 ¨ <sup>I</sup> *and <sup>ν</sup>*<sup>ˆ</sup> " <sup>2</sup>*. Under these constraints, the result of structured variational message passing results along with the Bethe free energy evaluation is given in Figure 9.*

#### 4.1.2. Naive Variational Message Passing

As a corollary of Theorem 2, we can consider the special case of a naive mean-field factorization, which is defined for node *b* as

$$q\_b(\mathbf{s}\_b) = \prod\_{i \in \mathcal{E}(b)} q\_i(\mathbf{s}\_i) \,. \tag{41}$$

The naive mean-field constraint (41) transforms the local free energy into

$$\begin{split} F[q\_{b\prime},f\_{b}] &= F[\{q\_{i}\},f\_{b}] \\ &= \sum\_{i \in \mathcal{E}(b)} \int q\_{i}(s\_{i}) \log q\_{i}(s\_{i}) \, \mathrm{d}s\_{i} - \int \left\{ \prod\_{i \in \mathcal{E}(b)} q\_{i}(s\_{i}) \right\} \log f\_{b}(\mathbf{s}\_{b}) \, \mathrm{d}\mathbf{s}\_{b} \, . \end{split} \tag{42}$$

**Figure 9.** (**Left**) One time segment of the FFG corresponding to the linear Gaussian state space model specified in Example 2 with the sum-product messages computed according to (36). (**Right**) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution of the hidden states inferred by structured variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over all iterations is *F*r*q*, *f*s " 586.178 (compared to *F*r*q*, *f*s " 580.698 in Example 1). The posterior distribution for the precision matrix is given by *Q* " W ˜«0.00266 0.000334 0.00034 0.00670 ff , 102.0¸ .

**Corollary 1.** *Naive Variational Message Passing: Given a TFFG* G " pV, Eq*, consider the induced subgraph* Gp*b*, *j*q *with a naive mean-field factorization l*p*b*q"t*isuch that for all i* P Ep*b*qu*. Let m* P *l*p*b*q *be the cluster where j* " *m. Given the local polytope of* (33)*, the local stationary solutions to* (34) *are given by*

$$q\_{b^{\ast}}^{m,\ast}(\mathbf{s}\_{b^{\ast}}^{m}) = q\_{\boldsymbol{j}}^{\ast}(\mathbf{s}\_{\boldsymbol{j}}) = \frac{\mu\_{\boldsymbol{j}b}^{\ast}(\mathbf{s}\_{\boldsymbol{j}})\mu\_{\boldsymbol{j}c}^{\ast}(\mathbf{s}\_{\boldsymbol{j}})}{\int \mu\_{\boldsymbol{j}b}^{\ast}(\mathbf{s}\_{\boldsymbol{j}})\mu\_{\boldsymbol{j}c}^{\ast}(\mathbf{s}\_{\boldsymbol{j}}) \,\mathrm{d}\mathbf{s}\_{\boldsymbol{j}}} \,\mathrm{d}\boldsymbol{j}$$

*where the messages μ*˚ *jc*p*sj*q *are the fixed points of the following iterations*

$$\mu\_{j\varepsilon}^{(k+1)}(s\_{\hat{\jmath}}) = \exp\left(\int \left\{ \prod\_{\substack{i \in \mathcal{E}\{b\} \\ i \neq j}} q\_i^{(k)}(s\_i) \right\} \log f\_b(\mathbf{s}\_b) \, \mathrm{d}s\_{b\bigvee} \right),\tag{43}$$

*where k is an iteration index.*

## **Proof.** See Appendix D.6.

The naive mean-field factorization limits the search space of beliefs by imposing strict constraints on the variational posterior. As a result, the variational posterior also loses flexibility. To improve inference performance for sparse Bayesian learning, the authors of [40] proposes a hybrid mechanism by augmenting naive mean-field VMP with sumproduct updates. This hybrid scheme reduces the complexity of the sum-product algorithm, while improving the accuracy of the naive VMP approach. In [41], naive VMP is applied to semi-parametric regression and allows for scaling of regression models to large data sets.

**Example 3.** *As a follow up on Example 2, we relaxed the constraints in ((39a) and (39b)) to the following constraints presented in Corollary 1 as*

$$\int q(\mathbf{x}\_t', z\_t, \mathbf{Q}\_t) \mathrm{d}\mathbf{x}\_t' \mathrm{d}z\_t = q(\mathbf{Q}\_t),\\\int q(\mathbf{x}\_t', z\_t, \mathbf{Q}\_t) \mathrm{d}\mathbf{Q}\_t = q(\mathbf{x}\_t', z\_t) = q(\mathbf{x}\_t')q(z\_t) \tag{44a}$$

$$\int q(Q\_t) \mathrm{d}Q\_t = 1,\\
\int q(\mathbf{x}\_t') \mathrm{d}\mathbf{x}\_t' = 1,\\
\int q(\mathbf{z}\_t) \mathrm{d}\mathbf{z}\_t = 1. \tag{44b}$$

*The FFG remains the same and we use identical data constraints as in Example 2. Together with constraint* (44)*, we obtained a hybrid of naive variational message passing and sum-product message passing algorithm where the messages around the factor ht are computed by* (43) *and the rest of the messages by sum-product* (22)*. Using the same data as in Example 1, the results for naive VMP are given in Figure 10 along with the evaluated Bethe free energy.*

**Figure 10.** (**Left**) The small dots indicate the noisy observations that were synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution for the hidden states inferred by naive variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over all iterations is *F*r*q*, *f*s " 617.468, which is more than for the less-constrained Example 2 (with *F*r*q*, *f*s " 586.178) and Example 1 (with *F*r*q*, *f*s " 580.698). The posterior for the precision matrix is given by *Q* " W ˜« 0.00141 ´6.00549*e*´<sup>5</sup> ´6.00549*e*´<sup>5</sup> 0.00187 ff , 102.0¸ . (**Right**) A comparison of the Bethe free energies for sum-product, structured and naive variational message passing algorithms for the data generated in Example 1.

## *4.2. Form Constraints*

Form constraints limit the functional form of the variational factors *qa*p*sa*q and *qi*p*si*q. One of the most widely used form constraints, the data constraint, is also illustrated in Appendix A.

#### 4.2.1. Data Constraints

A data constraint can be viewed as a special case of (9b), where the belief *qj* is constrained to be a Dirac-delta function [42], such that

$$\int q\_{\boldsymbol{a}}(\mathbf{s}\_{\boldsymbol{a}}) \, \mathrm{d}\mathbf{s}\_{\boldsymbol{a}\backslash j} = q\_{\boldsymbol{j}}(\mathbf{s}\_{\boldsymbol{j}}) = \delta(\mathbf{s}\_{\boldsymbol{j}} - \mathbf{s}\_{\boldsymbol{j}}) \, \mathrm{},\tag{45}$$

where *s*ˆ*<sup>j</sup>* is a known value, e.g., an observation.

**Lemma 4.** *Given a TFFG* G " pV, Eq*, consider the node-induced subgraph* Gp*b*q *(Figure 3). Then local stationary solutions to the Lagrangian*

*Lb*r*qb*, *fb*s " *F*r*qb*, *fb*s ` *ψ<sup>b</sup>* "ż *qb*p*sb*q d*s<sup>b</sup>* ´ 1 j ` <sup>ÿ</sup> *i*PEp*b*q *i*‰*j* ż *λib*p*si*q " *qi*p*si*q ´ <sup>ż</sup> *qb*p*sb*q d*sb*z*<sup>i</sup>* j d*si*` ż *λjb*p*sj*q " *<sup>δ</sup>*p*sj* ´ *<sup>s</sup>*ˆ*j*q ´ <sup>ż</sup> *qb*p*sb*q d*sb*z*<sup>j</sup>* j d*sj* ` *Cb* . (46)

*where Cb collects all terms that are independent of qb, are of the form*

$$q\_{b}(\mathbf{s}\_{b}) = \frac{f\_{b}(\mathbf{s}\_{b}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_{i})}{\int f\_{b}(\mathbf{s}\_{b}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_{i}) \mathbf{d}s\_{b}} \,. \tag{47}$$

**Proof.** See Appendix D.7.

**Theorem 3.** *Data-Constrained Sum-Product: Given a TFFG* G " pV, Eq*, consider the induced subgraph* Gp*b*, *j*q *(Figure 11). Given the local polytope*

$$\mathcal{L}(\mathcal{G}(b,j)) = \{q\_{\flat} \text{ s.t. (45)}\}\,,\tag{48}$$

*the local stationary solutions to*

$$q\_b^\* = \arg\min\_{\mathcal{L}(G(b,j))} F[q\_\prime f]\_{\prime \prime}$$

*are of the form*

$$q\_b^\*(\mathbf{s}\_b) = \frac{f\_b(\mathbf{s}\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}^\bullet(\mathbf{s}\_i)}{\int f\_b(\mathbf{s}\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}^\bullet(\mathbf{s}\_i) \mathbf{d} \mathbf{s}\_b} \tag{49}$$

*with message*

$$
\mu\_{jb}^{\bullet}(\mathbf{s}\_{j}) = \delta(\mathbf{s}\_{j} - \mathbf{s}\_{j})\,. \tag{50}
$$

**Proof.** See Appendix D.8.

**Figure 11.** Visualization of a subgraph Gp*b*, *j*q with indicated messages, where the dark circled delta indicates a data constraint—i.e., the variable *sj* is constrained to have a distribution of the form *δ*p*sj* ´ *s*ˆ*j*q.

Note that the resulting message *μ*˚ *jb*p*sj*q to node *b* does not depend on messages from node *c*, as would be the case for a sum-product update. By the symmetry of Theorem 3 for the subgraph LtGp*c*, *j*qu, (A32) identifies

$$\mu\_{\mathfrak{c}\vec{j}}(\mathfrak{s}\_{\vec{j}}) = \int f\_{\mathfrak{c}}(\mathfrak{s}\_{\mathfrak{c}}) \prod\_{\substack{i \in \mathcal{E}(\mathfrak{c}) \\ i \neq j}} \mu\_{i\mathfrak{c}}(\mathfrak{s}\_{\vec{i}}) \operatorname{d}\mathfrak{s}\_{\mathfrak{c}\backslash j} \neq \delta(\mathfrak{s}\_{\vec{j}} - \mathfrak{s}\_{\vec{j}}) \dots$$

This implies that messages incoming to a data constraint (such as *μcj*) are not further propagated through the data constraint. The data constraint thus effectively introduces a conditional independence between the variables of neighboring factors (conditioned on the shared constrained variable). Interestingly, this is similar to the notion of an intervention [43], where a decision variable is externally forced to a realization.

Data constraints allow information from data sets to be absorbed into the model. Essentially, (variational) Bayesian machine learning is an application of inference in a graph with data constraints. In our framework, data are a constraint, and machine learning via Bayes rule follows naturally from the minimization of the Bethe free energy (see also Appendix A).

## 4.2.2. Laplace Propagation

A second type of form constraint we consider is the Laplace constraint, see also [14]. Consider a second-order Taylor approximation on the local log-node function

L*a*p*sa*q " log *fa*p*sa*q, (51)

around an approximation point ˆ*sa*, as

$$\mathcal{L}\_a(\mathbf{s}\_a; \mathbf{\hat{s}}\_a) = \mathcal{L}\_a(\mathbf{\hat{s}}\_a) + \nabla^\top \mathcal{L}\_a(\mathbf{\hat{s}}\_a)(\mathbf{s}\_a - \mathbf{\hat{s}}\_a) + \frac{1}{2}(\mathbf{s}\_a - \mathbf{\hat{s}}\_a)^\top \nabla^2 \mathcal{L}\_a(\mathbf{\hat{s}}\_a)(\mathbf{s}\_a - \mathbf{\hat{s}}\_a) \,. \tag{52}$$

From this approximation, we define the Laplace-approximated node function as

$$\vec{f}\_{a}(\mathbf{s}\_{a};\mathbf{\hat{s}}\_{a}) \triangleq \exp\left(\vec{\mathcal{L}}\_{a}(\mathbf{s}\_{a};\mathbf{\hat{s}}\_{a})\right),\tag{53}$$

which is substituted in the local free energy to obtain the Laplace-encoded local free energy as

$$F[q\_a, \tilde{f}\_a; \mathfrak{s}\_a] = \int q\_a(\mathfrak{s}\_a) \log \frac{q\_a(\mathfrak{s}\_a)}{\tilde{f}\_a(\mathfrak{s}\_a; \mathfrak{s}\_a)} \, \mathrm{d}\mathfrak{s}\_a \,. \tag{54}$$

It follows that the Laplace-encoded optimization of the local free energy becomes

$$q\_a^{\bullet} = \arg\min\_{q\_a} L\_a[q\_{a\prime} f\_a; \mathfrak{s}\_a] \,. \tag{55}$$

where the Lagrangian *La* imposes the marginalization and normalization constraints of (9) on (54).

**Lemma 5.** *Given a TFFG* G " pV, Eq*, consider the node-induced subgraph* Gp*b*q *(Figure 12). The stationary points of the Laplace-approximated Lagrangian* (55) *as a functional of qb,*

$$L\_b[q\_b, \vec{f}\_b; \mathbf{s}\_b] = F[q\_{b\prime}\vec{f}\_b; \mathbf{s}\_b] + \psi\_b \left[ \int q\_b(\mathbf{s}\_b) \, \mathrm{d}s\_b - 1 \right] +$$

$$\sum\_{i \in \mathcal{E}(b)} \int \lambda\_{i\bar{b}}(s\_i) \left[ q\_{\bar{i}}(s\_i) - \int q\_b(\mathbf{s}\_b) \, \mathrm{d}s\_{\bar{b}\cdot\bar{i}} \right] \mathrm{d}s\_{\bar{i}} + \mathbb{C}\_{b\prime} \tag{56}$$

*where Cb collects all terms that are independent of qb, which are of the form*

$$\eta\_{b}(\mathbf{s}\_{b}) = \frac{\tilde{f}\_{b}(\mathbf{s}\_{b}; \mathbf{s}\_{b}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(\mathbf{s}\_{i})}{\int f\_{b}(\mathbf{s}\_{b}; \mathbf{s}\_{b}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(\mathbf{s}\_{i}) \mathbf{d} \mathbf{s}\_{b}}. \tag{57}$$

**Proof.** See Appendix D.9.

**Figure 12.** The subgraph around a Laplace-approximated node *b* with indicated messages.

We can now formulate Laplace propagation as an iterative procedure, where the approximation point ˆ*s<sup>b</sup>* is chosen as the mode of the belief *qb*p*sb*q.

**Theorem 4.** *Laplace Propagation: Given a TFFG* G " pV, Eq*, consider the induced subgraph* <sup>G</sup>p*b*, *<sup>j</sup>*<sup>q</sup> *(Figure 13) with the Laplace-encoded factor* ˜ *fb as per* (53)*. We write the model* (1) *with the Laplace-encoded factor* ˜ *fb substituted for fb, as* ˜ *f . Given the local polytope* LpGp*b*, *j*qq *of* (14)*, the local stationary solutions to*

$$\{q\_{b'}^\* q\_{\tilde{f}}^\*\} = \arg\min\_{\mathcal{L}(\mathcal{G}(b,\circ))} F[q\_{\prime}\tilde{f}; \mathfrak{s}\_b] \,. \tag{58}$$

*are given by*

$$\begin{aligned} q\_{\boldsymbol{b}}^{\*}(\mathbf{s}\_{\boldsymbol{b}}) &= \frac{\bar{f}\_{\boldsymbol{b}}(\mathbf{s}\_{\boldsymbol{b}};\hat{\mathbf{s}}\_{\boldsymbol{b}}^{\*}) \prod\_{\boldsymbol{i}\in\mathcal{E}(\boldsymbol{b})} \mu\_{\boldsymbol{i}\boldsymbol{b}}^{\*}(\mathbf{s}\_{\boldsymbol{i}})}{\int \bar{f}\_{\boldsymbol{b}}(\mathbf{s}\_{\boldsymbol{b}};\hat{\mathbf{s}}\_{\boldsymbol{b}}^{\*}) \prod\_{\boldsymbol{i}\in\mathcal{E}(\boldsymbol{b})} \mu\_{\boldsymbol{i}\boldsymbol{b}}^{\*}(\mathbf{s}\_{\boldsymbol{i}}) \mathrm{d}\mathbf{s}\_{\boldsymbol{b}}}, \\ q\_{\boldsymbol{j}}^{\*}(\mathbf{s}\_{\boldsymbol{j}}) &= \frac{\mu\_{\boldsymbol{j}\boldsymbol{b}}^{\*}(\boldsymbol{s}\_{\boldsymbol{j}}) \mu\_{\boldsymbol{j}\boldsymbol{c}}^{\*}(\boldsymbol{s}\_{\boldsymbol{j}})}{\int \mu\_{\boldsymbol{j}\boldsymbol{b}}^{\*}(\boldsymbol{s}\_{\boldsymbol{j}}) \mu\_{\boldsymbol{j}\boldsymbol{c}}^{\*}(\mathbf{s}\_{\boldsymbol{j}}) \mathrm{d}\mathbf{s}\_{\boldsymbol{j}}} \end{aligned}$$

*with s*ˆ ˚ *<sup>b</sup> and the messages μ*˚ *jc*p*sj*q *the fixed points of*

$$
\begin{split}
\boldsymbol{\hat{s}}\_{b}^{(k)} &= \arg\max\_{\boldsymbol{s}\_{b}} \log q\_{b}^{(k)}(\mathbf{s}\_{b}) \\
\boldsymbol{f}\_{b}(\mathbf{s}\_{b};\boldsymbol{\hat{s}}\_{b}^{(k)}) &= \frac{\boldsymbol{f}\_{b}(\mathbf{s}\_{b};\boldsymbol{\hat{s}}\_{b}^{(k)}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}^{(k)}(\mathbf{s}\_{i})}{\int \boldsymbol{f}\_{b}(\mathbf{s}\_{b};\boldsymbol{\hat{s}}\_{b}^{(k)}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}^{(k)}(\mathbf{s}\_{i}) \mathbf{d} \mathbf{s}\_{b}} \\
\boldsymbol{\mu}\_{j\boldsymbol{\hat{c}}}^{(k+1)}(\mathbf{s}\_{j}) &= \int \boldsymbol{f}\_{b}(\mathbf{s}\_{b};\boldsymbol{\hat{s}}\_{b}^{(k)}) \prod\_{\begin{subarray}{c} i \in \mathcal{E}(b) \\ i \neq j \end{subarray}} \mu\_{ib}^{(k)}(\mathbf{s}\_{i}) \mathbf{d} \mathbf{s}\_{b\nmidj}.
\end{split}
$$

**Proof.** See Appendix D.10.

**Figure 13.** Visualization of a subgraph with indicated Laplace propagation messages. The node function *fb* is denoted by ˜ *fb* according to (53).

A Laplace propagation is introduced in [14] as an algorithm that propagates mean and variance information when exact updates are expensive to compute. Laplace propagation has found applications in the context of Gaussian processes and support vector machines [14]. In the jointly normal case, Laplace propagation coincides with sum-product and expectation propagation [14,18].

#### 4.2.3. Expectation Propagation

Expectation propagation can be derived in terms of constraint manipulation by relaxing the marginalization constraints to expectation constraints. Expectation constraints are of the form

$$\int q\_a(\mathbf{s}\_a) T\_i(\mathbf{s}\_i) \, \mathrm{d}\mathbf{s}\_a = \int q\_i(\mathbf{s}\_i) T\_i(\mathbf{s}\_i) \, \mathrm{d}\mathbf{s}\_i \, \tag{59}$$

for a given function (statistic) *Ti*p*si*q. Technically, the statistic *Ti*p*si*q can be chosen arbitrarily. Nevertheless, they are often chosen as sufficient statistics of an exponential family distribution. An exponential family distribution is defined by

$$q\_i(\mathbf{s}\_i) = h(\mathbf{s}\_i) \exp\left(\eta\_i^\top T\_i(\mathbf{s}\_i) - \log Z(\eta\_i)\right),\tag{60}$$

where *η<sup>i</sup>* is the natural parameter, *Z*p*ηi*q is the partition function, *Ti*p*si*q is the sufficient statistics and *h*p*si*q is a base measure [24]. The reason *Ti*p*si*q is a sufficient statistic is because if there are observed values of the random variable *si*, then the parameter *η<sup>i</sup>* can be estimated by using only the statistics *Ti*p*si*q. This means that the estimator of *η<sup>i</sup>* will depend only on the statistics.

The idea behind expectation propagation [18] is to relax the marginalization constraints with moment-matching constraints by choosing sufficient statistics from exponential family distributions [12]. Relaxation allows approximating the marginals of the sum-product algorithm with exponential family distributions. By keeping the marginals within the exponential family, the complexity of the resulting computations is reduced.

**Lemma 6.** *Given a TFFG* G " pV, Eq*, consider the node-induced subgraph* Gp*b*q *(Figure 3). The stationary points of the Lagrangian*

$$L\_{\mathbb{B}}[q\_{\mathbb{b}}, f\_{\mathbb{b}}] = F[q\_{\mathbb{b}}, f\_{\mathbb{b}}] + \psi\_{\mathbb{b}} \left[ \int q\_{\mathbb{b}}(\mathbf{s}\_{\mathbb{b}}) \, \mathrm{d}\mathbf{s}\_{\mathbb{b}} - 1 \right] + \sum\_{\substack{i \in \mathcal{E}(\mathbb{b}) \\ i \neq j}} \int \lambda\_{i\mathbb{B}}(s\_{i}) \left[ q\_{\mathbb{f}}(s\_{i}) - \int q\_{\mathbb{b}}(\mathbf{s}\_{\mathbb{b}}) \, \mathrm{d}\mathbf{s}\_{\mathbb{b}/i} \right] \mathrm{d}s\_{i} + \psi\_{\mathbb{b}} \left[ \int q\_{\mathbb{b}}(\mathbf{s}\_{\mathbb{b}}) T\_{\mathbb{b}}(s\_{\mathbb{b}}) \, \mathrm{d}\mathbf{s}\_{\mathbb{b}} \right] + \mathcal{L}\_{\mathbb{b}}, \tag{61}$$
 
$$\eta\_{\mathbb{y}}^{\top} \left[ \int q\_{\mathbb{y}}(\mathbf{s}\_{\mathbb{y}}) T\_{\mathbb{y}}(\mathbf{s}\_{\mathbb{y}}) \, \mathrm{d}\mathbf{s}\_{\mathbb{y}} - \int q\_{\mathbb{b}}(\mathbf{s}\_{\mathbb{b}}) T\_{\mathbb{b}}(\mathbf{s}\_{\mathbb{y}}) \, \mathrm{d}\mathbf{s}\_{\mathbb{b}} \right] + \mathcal{L}\_{\mathbb{b}}, \tag{61}$$

*with sufficient statistics Tj, and where Cb collects all terms that are independent of qb, are of the form*

$$q\_{b}(\mathbf{s}\_{b}) = \frac{f\_{b}(\mathbf{s}\_{b}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_{i})}{\int f\_{b}(\mathbf{s}\_{b}) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_{i}) \mathbf{d}s\_{b}},\tag{62}$$

*with incoming exponential family message*

$$\mu\_{\vec{\jmath}\vec{\jmath}}(s\_{\vec{\jmath}}) = \exp(\eta\_{\vec{\jmath}\vec{\jmath}}^{\top}T\_{\vec{\jmath}}(s\_{\vec{\jmath}})) \,. \tag{63}$$

**Proof.** See Appendix D.11.

**Lemma 7.** *Given a TFFG* G " pV, Eq*, consider an edge-induced subgraph* Gp*j*q *(Figure 4). The stationary solutions of the Lagrangian*

$$L\_{\dot{j}}[q\_{\dot{j}}] = H[q\_{\dot{j}}] + \Psi\_{\dot{j}} \left[ \int q\_{\dot{j}}(s\_{\dot{j}}) \, \mathrm{d}s\_{\dot{j}} - 1 \right] + \sum\_{a \in \mathcal{V}(\dot{j})} \eta\_{\dot{j}a}^{\top} \left[ \int q\_{\dot{j}}(s\_{\dot{j}}) T\_{\dot{j}}(s\_{\dot{j}}) \, \mathrm{d}s\_{\dot{j}} - \int q\_{a}(\mathbf{s}\_{a}) T\_{\dot{j}}(\mathbf{s}\_{\dot{j}}) \, \mathrm{d}\mathbf{s}\_{a} \right] + C\_{\dot{j},a}$$

*with sufficient statistics Tj*p*sj*q*, and where Cj collects all terms that are independent of qj, are of the form*

$$q\_{\dot{\boldsymbol{\eta}}}(\boldsymbol{s}\_{\dot{\boldsymbol{\eta}}}) = \frac{\exp\Big( [\eta\_{\dot{\boldsymbol{\eta}}\boldsymbol{b}} + \eta\_{\dot{\boldsymbol{\eta}}\boldsymbol{c}}]^\top T\_{\dot{\boldsymbol{\eta}}}(\boldsymbol{s}\_{\dot{\boldsymbol{\eta}}}) \Big)}{\int \exp\Big( [\eta\_{\dot{\boldsymbol{\eta}}\boldsymbol{b}} + \eta\_{\dot{\boldsymbol{\eta}}\boldsymbol{c}}]^\top T\_{\dot{\boldsymbol{\eta}}}(\boldsymbol{s}\_{\dot{\boldsymbol{\eta}}}) \Big) \mathrm{d}\mathbf{s}\_{\dot{\boldsymbol{\eta}}}} \,\tag{64}$$

**Proof.** See Appendix D.12.

**Theorem 5.** *Expectation Propagation: Given a TFFG* G " pV, Eq*, consider the induced subgraph* Gp*b*, *j*q *(Figure 5). Given the local polytope*

$$\mathcal{L}\left(\mathcal{G}(b,j)\right) = \left\{q\_b \text{ s.t. (9 a), and } q\_j \text{ s.t. (59) and (10)}\right\},\tag{65}$$

*and <sup>μ</sup>jb*p*sj*q " exp´ *η*J *jbTj*p*sj*q ¯ *an exponential family message (from Lemma 6). Then, the local stationary solutions to* (15) *are given by*

$$q\_{\flat}^{\ast}(\mathbf{s}\_{\flat}) = \frac{f\_{\flat}(\mathbf{s}\_{\flat}) \prod\_{i \in \mathcal{E}(\flat)} \mu\_{i\flat}^{\ast}(s\_{i})}{\int f\_{\flat}(\mathbf{s}\_{\flat}) \prod\_{i \in \mathcal{E}(\flat)} \mu\_{i\flat}^{\ast}(s\_{i}) \mathbf{d} \mathbf{s}\_{\flat}} \tag{66a}$$

$$q\_{\dot{j}}^{\bullet}(s\_{\dot{j}}) = \frac{\exp\Big( [\eta\_{\dot{j}b}^{\bullet} + \eta\_{\dot{j}c}^{\bullet}]^{\top} T\_{\dot{j}}(s\_{\dot{j}}) \Big)}{\int \exp\Big( [\eta\_{\dot{j}b}^{\bullet} + \eta\_{\dot{j}c}^{\bullet}]^{\top} T\_{\dot{j}}(s\_{\dot{j}}) \Big) \mathrm{d}s\_{\dot{j}}},\tag{66b}$$

*with η*˚ *jb*, *η*˚ *jc and μ*˚ *jc*p*sj*q *being the fixed points of the iterations*

$$
\begin{split}
\boldsymbol{\mu}\_{j\boldsymbol{c}}^{(k)}(\boldsymbol{s}\_{j}) &= \int f\_{\boldsymbol{b}}(\mathbf{s}\_{\boldsymbol{b}}) \prod\_{\substack{\boldsymbol{i}\in\mathcal{E}\left(\boldsymbol{b}\right) \\ \boldsymbol{i}\neq\boldsymbol{s}\_{\boldsymbol{j}}\end{subarray}} \mu\_{i\boldsymbol{b}}^{(k)}(\boldsymbol{s}\_{i}) \,\mathrm{d}\mathbf{s}\_{\boldsymbol{b}\nmid\boldsymbol{j}} \\
\boldsymbol{\eta}\_{j}^{(k)}(\boldsymbol{s}\_{j}) &= \frac{\mu\_{j\boldsymbol{b}}^{(k)}(\boldsymbol{s}\_{j})\bar{\mu}\_{j\boldsymbol{c}}^{(k)}(\boldsymbol{s}\_{j})}{\int \mu\_{j\boldsymbol{b}}^{(k)}(\boldsymbol{s}\_{j})\bar{\mu}\_{j\boldsymbol{c}}^{(k)}(\boldsymbol{s}\_{j}) \,\mathrm{d}\boldsymbol{s}\_{j}}.
\end{split}
$$

*By moment-matching on q*˜ p*k*q *<sup>j</sup>* p*sj*q*, we obtain the natural parameter η*˜ p*k*q *<sup>j</sup> . The message update then follows from*

$$\begin{aligned} \eta\_{j\boldsymbol{c}}^{(k)} &= \vec{\eta}\_{j}^{(k)} - \eta\_{j\boldsymbol{b}}^{(k)} \\ \mu\_{j\boldsymbol{c}}^{(k+1)}(\boldsymbol{s}\_{j}) &= \exp\left(T\_{j}(\boldsymbol{s}\_{j})^{\top} \boldsymbol{\eta}\_{j\boldsymbol{c}}^{(k)}\right). \end{aligned}$$

## **Proof.** See Appendix D.13.

Moment-matching can be performed by solving [24] (Proposition 3.1)

$$\nabla\_{\eta\_{\dot{\jmath}}} \log Z\_{\dot{\jmath}}(\eta\_{\dot{\jmath}}) = \int \bar{q}\_{\dot{\jmath}}(s\_{\dot{\jmath}}) \, T\_{\dot{\jmath}}(s\_{\dot{\jmath}}) \, \mathrm{d}s\_{\dot{\jmath}}$$

for *ηj*, where

$$Z\_{\dot{\jmath}}(\eta\_{\dot{\jmath}}) = \int \exp(\eta\_{\dot{\jmath}}^{\top} T\_{\dot{\jmath}}(s\_{\dot{\jmath}}) ) \, \mathrm{d}s\_{\dot{\jmath}} \, \mathrm{d}s$$

In practice, for a Gaussian approximation, the natural parameters can be obtained by converting the matched mean and variance of *q*˜*j*p*sj*q to the canonical form [18]. Computing the moments of *q*˜*j*p*sj*q is often challenging due to lack of closed form solutions of the normalization constant. In order to address the computation of moments in EP, Ref. [44] proposes to evaluate challenging moments by quadrature methods. For multivariate random variables, moment-matching by spherical radial cubature would be advantageous as it will reduce the computational complexity [45]. Another popular way of evaluating the moments is through importance sampling [46] (Ch. 7) and [47].

Expectation propagation has been utilized in various applications ranging from time series estimation with Gaussian processes [48] to Bayesian learning with stochastic natural gradients [49]. When the likelihood functions for Gaussian process classification are not Gaussian, EP is often utilized [50] (Chapter 3). In [51], a message passing-based expectation propagation algorithm is developed for models that involve both continuous and discrete random variables. Perhaps the most practical applications of EP are in the context of probabilistic programming [52], where it is heavily used in real-world applications.

### *4.3. Hybrid Constraints*

In this section, we consider hybrid methods that combine factorization and form constraints, and formalize some well-known algorithms in terms of message passing.

#### 4.3.1. Mean-Field Variational Laplace

Mean-field variational Laplace applies the mean-field factorization to the Laplaceapproximated factor function. The appeal of this method is that all messages outbound from the Laplace-approximated factor can be represented by Gaussians.

**Theorem 6.** *Mean-field variational Laplace: Given a TFFG* G " pV, Eq*, consider the induced subgraph* <sup>G</sup>p*b*, *<sup>j</sup>*<sup>q</sup> *(Figure 13) with the Laplace-encoded factor* ˜ *fb as per* (53)*. We write the model* (1) *with substituted Laplace-encoded factor* ˜ *fb for fb, as* ˜ *f . Furthermore, assume a naive mean-field factorization l*p*b*q " tt*i*u *for all i* P Ep*b*qu*. Let m* P *l*p*b*q *be the cluster where j* " *m. Given the local polytope of* (33)*, the local stationary solutions to*

$$\{q\_{\mathbf{b}}^{m,\*}, q\_{\mathbf{j}}^{\*}\} = \arg\min\_{\mathcal{L}(\mathcal{G}(\mathbf{b}, \mathbf{j}))} F[q, \tilde{f}; \mathfrak{s}\_{\mathbf{b}}] \, , \tag{67}$$

*are given by*

$$q\_{b^{\bullet}}^{m,\bullet}(\mathbf{s}\_{b^{\bullet}}^{m}) = q\_{\dot{\jmath}}^{\bullet}(\mathbf{s}\_{\dot{\jmath}}) = \frac{\mu\_{\dot{\jmath}b}^{\bullet}(\mathbf{s}\_{\dot{\jmath}})\mu\_{\dot{\jmath}\mathbf{c}}^{\bullet}(\mathbf{s}\_{\dot{\jmath}})}{\int \mu\_{\dot{\jmath}b}^{\bullet}(\mathbf{s}\_{\dot{\jmath}})\mu\_{\dot{\jmath}\mathbf{c}}^{\bullet}(\mathbf{s}\_{\dot{\jmath}}) \,\mathrm{d}s\_{\dot{\jmath}}} \,\mathrm{d}$$

*where μ*˚ *jc represents the fixed points of the following iterations*

$$\mu\_{j\varepsilon}^{(k+1)}(\mathbf{s}\_{j}) = \exp\left(\int \left(\prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq j}} q\_i^{(k)}(\mathbf{s}\_i)\right) \log f\_b(\mathbf{s}\_b; \mathfrak{s}\_b^{(k)}) \, \mathrm{d}\mathbf{s}\_{b \backslash j}\right), \tag{68}$$

*with*

$$
\mathfrak{s}\_b^{(k)} = \arg\max\_{\mathfrak{s}\_b} \log q\_b^{(k)}(\mathfrak{s}\_b) \,.
$$

#### **Proof.** See Appendix D.14.

Conveniently, under these constraints, every outbound message from node *b* will be proportional to a Gaussian. Substituting the Laplace-approximated factor function, we obtain:

$$\log \mu\_{j\mathbf{c}}^{(k)}(\mathbf{s}\_{j}) = \int \left( \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq j}} q\_{i}^{(k)}(\mathbf{s}\_{i}) \right) \vec{\mathcal{L}}\_{b}(\mathbf{s}\_{b}; \mathbf{s}\_{b}^{(k)}) \, \mathrm{d}\mathbf{s}\_{b\backslash j} + \mathrm{C} \, . $$

Resolving this expectation yields a quadratic form in *sj*, which after completing the square leads to a proportionally Gaussian message *μjc*p*sj*q. This argument holds for any edge adjacent to *b*, and therefore for all outbound messages from node *b*. Moreover, if the incoming messages are represented by Gaussians as well (e.g., because these are also computed under the mean-field variational Laplace constraint), then all beliefs on the adjacent edges to *b* will also be Gaussian. This significantly simplifies the procedure of computing the expectations, which illustrates the computational appeal of mean-field variational Laplace.

Mean-field variational Laplace is widely used in dynamic causal modeling [53] and more generally in cognitive neuroscience, partly because the resulting computations are deemed neurologically plausible [54–56].

#### 4.3.2. Expectation Maximization

Expectation Maximization (EM) can be viewed as a hybrid algorithm that combines a structured variational factorization with a Dirac-delta constraint, where the constrained value itself is optimized. Given a structured mean-field factorization *l*p*a*q Ď Pp*a*q, with a single-edge cluster *m* " *j*, then expectation maximization considers local factorizations of the form

$$q\_a(\mathbf{s}\_a) = \delta(\mathbf{s}\_j - \theta\_j) \prod\_{\substack{n \in l(a) \\ n \neq a \text{wt}}} q\_a^n(\mathbf{s}\_a^n), \tag{69}$$

where the belief for *sj* is constrained by a Dirac-delta distribution, similar to Section 4.2.1. In (69), however, the variable *sj* represents a random variable with (unknown) value *<sup>θ</sup><sup>j</sup>* <sup>P</sup> <sup>R</sup>*d*, where *d* is the dimension of the random variable *sj*. We explicitly use the notation *θ<sup>j</sup>* (as opposed to *s*ˆ*<sup>j</sup>* for the data constraint in Section 4.2.1) to clarify that this value is a parameter for the constrained belief over *sj* that will be optimized—that is, *θ<sup>j</sup>* does not represent a model parameter in itself. To make this distinction even more explicit, in the context of optimization, we will refer to Dirac-delta constraints as point-mass constraints.

The factor-local free energy *F*r*qa*, *fa*; *θj*s then becomes a function of the *θ<sup>j</sup>* parameter.

**Theorem 7.** *Expectation maximization: Given a TFFG* G " pV, Eq*, consider the induced subgraph* Gp*b*, *j*q *(Figure 14) with a structured mean-field factorization l*p*b*q Ď Pp*b*q*, with local clusters n* P *l*p*b*q*. Let m* P *l*p*b*q *be the cluster where j* " *m. Given the local polytope*

$$\mathcal{L}(\mathcal{G}(b,j)) = \left\{ q\_b^n \text{ for all } n \in l(b) \text{ s.t. (29a)} \right\},\tag{70}$$

*the local stationary solutions to*

$$\theta\_j^\* = \arg\min\_{\mathcal{L}(\mathcal{G}(b,j))} F[q\_\prime f; \theta\_j]\_{\prime j}$$

*are given by the fixed points of*

$$\mu\_{bj}^{(k+1)}(\mathbf{s}\_{\boldsymbol{\beta}}) = \exp\left(\int \left\{ \prod\_{\substack{n \in \mathcal{I}(b) \\ n \neq m}} q\_b^{n, \langle k \rangle}(\mathbf{s}\_b^n) \right\} \log f\_b(\mathbf{s}\_b) \, \mathrm{d}s\_{b \nmid j} \right) \tag{71a}$$

$$\theta\_j^{(k+1)} = \arg\max\_{s\_j} \left( \log \mu\_{bj}^{(k+1)}(s\_j) + \log \mu\_{cj}^{(k+1)}(s\_j) \right). \tag{71b}$$

**Proof.** See Appendix D.15.

**Figure 14.** Visualization of a subgraph Gp*b*, *j*q with indicated messages. The open circle indicates a point-mass constraint of the form *δ*p*sj* ´ *θj*q, where the value *θ<sup>j</sup>* is optimized.

Expectation maximization was formulated in [57] as an iterative method that optimizes log-expectations of likelihood functions, where each EM iteration is guaranteed to increase the expected log-likelihood. Moreover, under some differentiability conditions, the EM algorithm is guaranteed to converge [57] (Theorem 3). A detailed overview of EM for exponential families is available in [24] (Ch. 6). A formulation of EM in terms of message passing is given by [58], where message passing for EM is applied in a filtering and system identification context. In [58], derivations are based on [57] (Theorem 1), whereas our derivations directly follow from variational principles.

**Example 4.** *Now suppose we do not know the angle θ for the state transition matrix At in Example 2 and would like to estimate the value of θ. Moreover, further suppose that we are interested in estimating the hyper-parameters for the prior mx*<sup>0</sup> *and Vx*<sup>0</sup> *, as well as the precision matrix for the state transitions Qt. For this purpose, we changed the constraints of* (25a) *into EM constraints in accordance with Theorem 7:*

$$q(\mathbf{x}\_{t-1}, \mathbf{z}\_t, A\_l(\theta)) = \delta(A\_l(\theta) - A\_l(\hat{\theta})) q(z\_t | \mathbf{x}\_{t-1}, A\_l(\theta)) q(\mathbf{x}\_{t-1}) \tag{72a}$$

$$q(\mathbf{x}\_{0}, m\_{\mathbf{x}\_{0}}, V\_{\mathbf{x}\_{0}}) = q(\mathbf{x}\_{0})\delta(m\_{\mathbf{x}\_{0}} - \hat{m}\_{\mathbf{x}\_{0}})\delta(V\_{\mathbf{x}\_{0}} - \hat{V}\_{\mathbf{x}\_{0}})\,. \tag{72b}$$

*where we optimize* ˆ *θ*, *V*ˆ *<sup>x</sup>*<sup>0</sup> *and m*ˆ *<sup>x</sup>*<sup>0</sup> *with EM (V*ˆ *<sup>x</sup>*<sup>0</sup> *is further constrained to be positive definite during the optimization procedure). With the addition of the new EM constraints, the resulting FFG is given in Figure 15. The hybrid message passing algorithm consists of structured variational messages around the factor ht, and sum-product messages around wt, nt, mt and rt, and EM messages around g*<sup>0</sup> *and gt. We used identical observations as in the previous examples. The results for the hybrid SVMP-EM-SP algorithm are given in Figure 16 along with the evaluated Bethe free energy over all iterations.*

**Figure 15.** The FFG of the linear Gaussian state space model augmented with the EM constraints in Example 4.

**Figure 16.** (**Left**) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution of the hidden states inferred by structured variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over iterations is *F*r*q*, *f*s " 583.683. Moreover, the posterior distribution for the precision matrix is given by *Q* " W ˜«0.00286 0.00038 0.00038 0.0.00691ff , 102.0¸ . The EM estimates are *θ* " *π*{7.821, *m*ˆ *<sup>x</sup>*<sup>0</sup> " r7.23, ´7.016s and *V*ˆ *x*<sup>0</sup> " « 11.028 ´1.926 ´1.926 10.918ff . (**Right**) Free energy plots of the 4 algorithms discussed in Examples 1–4 on the same data set.

#### *4.4. Overview of Message Passing Algorithms*

In Sections 4.1–4.3, following a high-level recipe pioneered by [15], we presented firstprinciple derivations of some of the popular message passing-based inference algorithms by manipulating the local constraints of the Bethe free energy. The results are summarized in Table 1.

Crucially, the method of constrained BFE minimization goes beyond the reviewed algorithms. Through creating a new set of local constraints and following similar derivations based on variational calculus, one can obtain new message passing-based inference algorithms that better match the specifics of the generative model or application.

#### **5. Scoring Models by Minimized Variational Free Energy**

As discussed in Section 2.2, the variational free energy is an important measure of model performance. In Sections 5.1 and 5.2, we discuss some problems that occur when evaluating the BFE on a TFFG. In Section 5.3, we propose an algorithm that evaluates the constrained BFE as a summation of local contributions on the TFFG.

## *5.1. Evaluation of the Entropy of Dirac-Delta Constrained Beliefs*

For continuous variables, data and point-mass constraints, as discussed in Sections 4.2.1 and 4.3.2 and Appendix A, collapse the information density to infinity, which leads to singularities in entropy evaluation [59]. More specifically, for a continuous variable *sj*, the entropies for beliefs of the form *qj*p*sj*q " *δ*p*sj* ´ *s*ˆ*j*q and *qa*p*sa*q " *qa*|*j*p*sa*z*j*|*sj*q*δ*p*sj* ´ *s*ˆ*j*q both evaluate to ´8.

In variational inference, it is common to define the VFE only with respect to the latent (unobserved) variables [2] (Section 10.1). In contrast, in this paper, we explicitly define the BFE in terms of an iteration over all nodes and edges (7), which also includes non-latent beliefs in the BFE definition. Therefore, we define

$$\begin{aligned} q\_{\boldsymbol{j}}(\mathbf{s}\_{\boldsymbol{j}}) &= \delta(\mathbf{s}\_{\boldsymbol{j}} - \mathbf{s}\_{\boldsymbol{j}}) \Rightarrow H[q\_{\boldsymbol{j}}] \triangleq \mathbf{0}, \\ q\_{\boldsymbol{a}}(\mathbf{s}\_{\boldsymbol{a}}) &= q\_{\boldsymbol{a}|\boldsymbol{j}}(\mathbf{s}\_{\boldsymbol{a}\backslash\boldsymbol{j}}|\mathbf{s}\_{\boldsymbol{j}}) \delta(\mathbf{s}\_{\boldsymbol{j}} - \hat{\mathbf{s}}\_{\boldsymbol{j}}) \Rightarrow H[q\_{\boldsymbol{a}}] \triangleq H[q\_{\boldsymbol{a}\backslash\boldsymbol{j}}] \;. \end{aligned}$$

where *qa*|*j*p*sa*z*j*|*sj*q indicates the conditional belief and *qa*z*j*p*sa*z*j*q is the joint belief. These definitions effectively remove the entropies for observed variables from the BFE evaluation. Note that although *qa*z*j*p*sa*z*j*q is technically not a part of our belief set (7), it can be obtained by marginalization of *qa*p*sa*q (9b).

#### *5.2. Evaluation of Node-Local Free Energy for Deterministic Nodes*

Another difficulty arises with the evaluation of the node-local free energy *F*r*qa*s for factors of the form

$$f\_a(\mathbf{s}\_a) = \delta(h\_a(\mathbf{s}\_a))\,. \tag{73}$$

This type of node function reflects deterministic operations, e.g., *h*p*x*, *y*, *z*q " *z* ´ *x* ´ *y* corresponds to the summation *z* " *x* ` *y*. In this case, directly evaluating *F*r*qa*s again leads to singularities.

There are (at least) two strategies available in the literature that resolve this issue. The first strategy "softens" the Dirac-delta by re-defining:

$$f\_a(\mathbf{s}\_a) \triangleq \frac{1}{\sqrt{2\pi\epsilon}} \exp\left(-\frac{1}{2\epsilon}h\_a(\mathbf{s}\_a)^2\right),$$

with 0 ă ! 1 [17]. A drawback of this approach is that it may alter the model definition in a numerically unstable way, leading to a different inference solution and variational free energy than originally intended.

The second strategy combines the deterministic factor *fa* with a neighboring stochastic factor *fb* into a new *composite* factor *fc*, by marginalizing over a shared variable *sj*, leading to [60]

$$f\_{\mathcal{E}}(\mathbf{s}\_{\mathcal{E}}) \triangleq \int \delta(h\_{\mathcal{A}}(\mathbf{s}\_{\mathcal{A}})) \, f\_{\mathcal{b}}(\mathbf{s}\_{\mathcal{b}}) \, \mathrm{d}s\_{\mathcal{j}} \, \mathrm{d}s$$

where *s<sup>c</sup>* " t*s<sup>a</sup>* Y *sb*uz*sj*. This procedure has drawbacks for models that involve many deterministic factors—namely, the convenient model modularity and resulting distributed compatibility are lost when large groups of factors are compacted in model-specific composite factors. We propose here a third strategy.

**Theorem 8.** *Let fa*p*sa*q " *δ*p*ha*p*sa*qq*, with ha*p*sa*q " *sj* ´ *ga*p*sa*z*j*q*, and node-local belief qa*p*sa*q " *qj*|*a*p*sj*|*sa*z*j*q *qa*z*j*p*sa*z*j*q*. Then, the node-local free energy evaluates to*

$$F[q\_{a\prime}f\_a] = \begin{cases} -H[q\_{a\prime\dot{j}}] & \text{if } q\_{\dot{j}|a}(s\_{\dot{j}}|s\_{a\prime\dot{j}}) = \delta(s\_{\dot{j}} - g\_a(s\_{a\prime\dot{j}})) \\ \infty & \text{otherwise.} \end{cases}$$

**Proof.** See Appendix D.16.

An example that evaluates the node-local free energy for a non-trivial deterministic node can be found in Appendix C.

The equality node is a special case deterministic node, with a node function of the form (3). The argument of (Theorem 8) does not directly apply to this node. As the equality node function comprises two Dirac-delta functions, it can not be written in the form of Theorem 8. However, we can still reduce the node-local free energy contribution.

**Theorem 9.** *Let fa*p*sa*q " *δ*p*sj* ´ *si*q *δ*p*sj* ´ *sk*q*, with node-local belief qa*p*sa*q " *qik*|*j*p*si*,*sk*|*sj*q *qj*p*sj*q*. Then, the node-local free energy evaluates to*

$$F[q\_{a\_{\prime}}f\_{a}] = \begin{cases} -H[q\_{\dot{j}}] & \text{if } q\_{ik|\dot{j}}(s\_{i\prime}s\_{k}|s\_{\dot{j}}) = \delta(s\_{\dot{j}} - s\_{i})\,\delta(s\_{\dot{j}} - s\_{k})\\ \infty & \text{otherwise.} \end{cases}$$

**Proof.** See Appendix D.17.

#### *5.3. Evaluating the Variational Free Energy*

We propose here an algorithm that evaluates the BFE on a TFFG representation of a factorized model. The algorithm is based on the following results:


The decomposition of (7) shows that the BFE can be computed by an iteration over the nodes and edges of the graph. As some contributions to the BFE might cancel each other, the algorithm first tracks counting numbers *ua* for the average energies

$$\mathcal{U}\_a[q\_a] = -\int q\_a(\mathbf{s}\_a) \log f\_a(\mathbf{s}\_a) \, \mathrm{d}s\_{a'} $$

and counting numbers *hk* for the (joint) entropies

$$H[q\_k] = -\int q\_k(\mathbf{s}\_k) \log q\_k(\mathbf{s}\_k) \, \mathrm{d}s\_k \, \rho$$

which are ultimately combined and evaluated. We used an index *k* to indicate that the entropy computation may include not only the edges but a generic set of variables. We will give the definition of the set that *k* belongs to in Algorithm 1.


## **6. Implementation of Algorithms and Simulations**

We have developed a probabilistic programming toolbox *ForneyLab.jl* in the Julia language [61,62]. The majority of algorithms that are reviewed in Table 1 have been implemented in ForneyLab along with variety of demos (https://github.com/biaslab/ ForneyLab.jl/tree/master/demo, accessed on 23 June 2021). ForneyLab is extendable and supports postulating new local constraints of the BFE for the creation of custom message passing-based inference algorithms.

In order to limit the length of this paper, we refer the reader to the demonstration folder of ForneyLab and to several of our previous papers with code. For instance, our previous work in [63] implemented a mean-field variational Laplace propagation for the hierarchical Gaussian filter (HGF) [64]. In the follow-up work [65], inference results improved by changing to structured factorization and moment-matching local constraints. In that case, modification of local constraints created a hybrid EP-VMP algorithm that better suited the model. Moreover, in [13], we formulated the idea of *chance constraints* in the form of violation probabilities leading to a new message passing algorithm that supports goal-directed behavior within the context of active inference. A similar line of reasoning led to improved inference procedures for auto-regressive models [66].

## **7. Related Work**

Our work is inspired by the seminal work [17], which discusses the equivalence between the fixed points of the belief propagation algorithm [32] and the stationary points of the Bethe free energy. This equivalence is established through a Lagrangian formalism, which allows for the derivation of Generalized Belief Propagation (GBP) algorithms by introducing region-based graphs and the region-based (Kikuchi) free energy [16].

Region graph-based methods allows for overlapping clusters (Section 4.1) and thus offer a more generic message passing approach. The selection of appropriate regions (clusters), however, proves to be difficult, and the resulting algorithms may grow prohibitively complex. In this context, Ref. [67] addresses how to manipulate regions and manage the complexity of GBP algorithms. Furthermore, Ref. [68] also establishes a connection between GBP and expectation propagation (EP) by introducing structured region graphs.

The inspirational work of [15] derives message passing algorithms by minimization of *α*-divergences. The stationary points of *α*-divergences are obtained by a fixed point projection scheme. This projection scheme is reminiscent of the minimization scheme of the expectation propagation (EP) algorithm [18]. Compared to [15], our work focuses on a single divergence objective (namely, the VFE). The work of [12] derives the EP algorithm by manipulating the marginalization and factorization constraints of the Bethe free energy objective (see also Section 4.2.3). The EP algorithm is, however, not guaranteed to converge to a minimum of the associated divergence metric.

To address the convergence properties of the algorithms that are obtained by region graph methods, the outstanding work of [33] derives conditions on the region counting numbers that guarantee the convexity of the underlying objective. In general, however, the constrained Bethe free energy is not guaranteed to be convex and therefore the derived message passing updates are not guaranteed to converge.

#### **8. Discussion**

The key message in this paper is that a (variational) Bayesian model designer may tune the tractability-accuracy trade-off for evidence and posterior evaluation through constraint manipulation. It is interesting to note that the technique to derive message passing algorithms is always the same. We followed the recipe pioneered in [15] to derive a large variety of message passing algorithms solely through minimizing constrained Bethe free energy. This minimization leads to local fixed-point equations, which we can interpret as message passing updates on a (terminated) FFG. The presented lemmas showed how the constraints affect the Lagrangians locally. The presented theorems determined the stationary solutions of the Lagrangians and obtained the message passing equations. Thus, if a designer proposes a new set of constraints, then the first place to start is to analyze the effect on the Lagrangian. Once the effect of the constraint on the Lagrangian is known, then variational optimization may result in stationary solutions that can be obtained by a fixed-point iteration scheme.

In this paper, we selected the Forney-style factor graph framework to illustrate our ideas. FFGs are mathematically comparable to the more common bi-partite factor graphs that associate round nodes with variables and square nodes with factors [20]. Bi-partite factor graphs require two distinct types of message updates (one leaving variable nodes and one leaving factor nodes), while message passing on a (T)FFG requires only a single type of message update [69]. The (T)FFG paradigm thus substantially simplifies the derivations and resulting message passing update equations.

The message passing update rules in this paper are presented without guarantees on convergence of the (local) minimization process. In practice, however, algorithm convergence can be easily checked by evaluating the BFE (Algorithm 1) after each belief update.

In future work, we plan on extending the treatment of constraints to formulate sampling-based algorithms such as importance sampling and Hamiltonian Monte Carlo in a message passing framework. While introducing SVMP, we have limited the discussion to local clusters that are not overlapping. We plan to extend variational algorithms to include local clusters that are overlapping without altering the underlying free-energy objective or the graph structure [16,67].

#### **9. Conclusions**

In this paper, we formulated a message-passing approach to probabilistic inference by identifying local stationary solutions of a constrained Bethe free energy objective (Sections 3 and 4). The proposed framework constructs a graph for the generative model and specifies local constraints for variational optimization in a local polytope. The constraints are then imposed on the variational objective by a Lagrangian construct. Unconstrained optimization of the Lagrangian then leads to local expressions of stationary points, which can be obtained by iterative execution of the resulting fixed point equations, which we identify with message passing updates.

Furthermore, we presented an approach to evaluate the BFE on a (terminated) Forneystyle factor graph (Section 5). This procedure allows an algorithm designer to readily assess the performance of algorithms and models.

We have included detailed derivations of message passing updates (Appendix D) and hope that the presented formulation inspires the discovery of novel and customized message passing algorithms.

**Author Contributions:** Conceptualization: ˙ I. ¸S., T.v.d.L. and B.d.V.; methodology: ˙ I. ¸S. and T.v.d.L.; formal analysis, ˙ I. ¸S. and T.v.d.L.; investigation: ˙ I. ¸S., T.v.d.L. and B.d.V.; software, ˙ I. ¸S. and D.B.; validation, ˙ I. ¸S.; resources: ˙ I. ¸S., T.v.d.L. and B.d.V.; writing—original draft preparation: ˙ I. ¸S. and T.v.d.L.; writing—review and editing: T.v.d.L., D.B. and B.d.V.; visualizations: ˙ I. ¸S., T.v.d.L. and B.d.V.; supervision: T.v.d.L. and B.d.V.; project administration, B.d.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partly financed by GN Hearing A/S.

**Acknowledgments:** The authors would like to thank the BIASlab team members for many very interesting discussions.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **Appendix A. Free Energy Minimization by Variational Inference**

In this section, we present a pedagogical example of inductive inference. After we establish an intuition, we apply the same principles to a more general context in the further sections. We follow Caticha [42,70], who showed that a constrained free energy functional can be interpreted as a principled objective measure for inductive reasoning, see also [71,72]. The calculus of variations offers a principled method for optimizing this free energy functional.

In this section, we assume an example model

$$f(y, \theta) = f\_y(y, \theta) \, f\_\theta(\theta) \, , \tag{A1}$$

with observed variables *y* and a single parameter *θ*. We define the (variational) free energy (VFE) as

$$F[q\_\prime f] = \iint q(y,\theta) \log \frac{q(y,\theta)}{f(y,\theta)} \,\mathrm{d}y \,\mathrm{d}\theta\,\mathrm{d}\theta\,\mathrm{d}y$$

The goal is to find a posterior

$$q^\* = \arg\min\_{q \in \mathcal{Q}} F[q\_\prime f] \tag{A3}$$

that minimizes the free energy subject to some pre-specified constraints. These constraints may include form or factorization constraints on *q* (to be discussed later) or relate to observations of a signal *y*.

As an example, assume that we obtained some measurements *y* " *y*ˆ and wish to obtain a posterior marginal belief *q*˚p*θ*q over the parameter. We can then incorporate the data in the form of a data constraint

$$\int q(\mathbf{y},\theta) \, \mathrm{d}\theta = \delta(\mathbf{y} - \hat{\mathbf{y}}) \, \mathrm{,}\tag{A4}$$

where *δ* defines a Dirac-delta. The *constrained* free energy can be rewritten by including Lagrange multipliers as

$$L[q, f] = F[q, f] + \gamma \left( \int \int q(y, \theta) \, \mathrm{d}y \, \mathrm{d}\theta - 1 \right) + \int \lambda(y) \left( \int q(y, \theta) \, \mathrm{d}\theta - \delta(y - \hat{y}) \right) \, \mathrm{d}y,\tag{A5}$$

where the first term specifies the (to be minimized) free energy objective, the second term a normalization constraint, and the third term the data constraint. Optimization of (A5) can be performed using variational calculus.

Variational calculus considers the impact of a variation in *q*p*y*, *θ*q on the Lagrangian *L*r*q*, *f*s. We define the variation as

$$
\delta q(y,\theta) \stackrel{\Delta}{=} \epsilon \phi(y,\theta) \,\,\,\,\,\,
$$

where Ñ 0, and *φ* is a continuous and differentiable "test" function. The fundamental theorem of variational calculus states that the stationary solutions *q*˚ are obtained by setting *δL*{*δq* " 0, where the functional derivative *δL*{*δq* is implicitly defined by Appendix D in [2]:

$$\left. \frac{\mathrm{d}L[q + \epsilon \phi, f]}{\mathrm{d}\epsilon} \right|\_{\epsilon=0} = \iint \frac{\delta L}{\delta q}(y, \theta) \, \phi(y, \theta) \, \mathrm{d}y \, \mathrm{d}\theta \,. \tag{A6}$$

Equation (A6) provides a way to derive the functional derivative through ordinary differentiation. For example, we take the Lagrangian defined by (A5) and work out the left hand side of (A6):

d*L*r*q* ` *φ*, *f*s d ˇ ˇ ˇ ˇ "0 " <sup>d</sup>*F*r*<sup>q</sup>* ` *φ*, *<sup>f</sup>*<sup>s</sup> d ˇ ˇ ˇ ˇ "0 ` d d *γ* ij p*q* ` *φ*q d*y* d*θ* ˇ ˇ ˇ ˇ "0 ` d d ż *λ*p*y*q ż p*q* ` *φ*q d*θ* d*y* ˇ ˇ ˇ ˇ "0 (A7a) " ij d d ˆ <sup>p</sup>*<sup>q</sup>* ` *φ*qlog <sup>p</sup>*<sup>q</sup>* ` *φ*<sup>q</sup> *f* ˙ˇ ˇ ˇ ˇ "0 d*y* d*θ* ` *γ* ij d d p*q* ` *φ*q ˇ ˇ ˇ ˇ "0 d*y* d*θ* ` ż *λ*p*y*q ż d d p*q* ` *φ*q ˇ ˇ ˇ ˇ "0 d*θ* d*y* (A7b)

$$=\iint\left[\underbrace{\log\frac{q(\boldsymbol{y},\boldsymbol{\theta})}{f(\boldsymbol{y},\boldsymbol{\theta})}+1+\gamma+\lambda(\boldsymbol{y})}\_{\delta\mathbb{L}[q,f]\nmid\delta\boldsymbol{q}}\right]\underbrace{\phi(\boldsymbol{y},\boldsymbol{\theta})\,\mathrm{d}\boldsymbol{y}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\theta}\,\mathrm{d}\boldsymbol{\$$

Note that, since (A7c) has been written in similar form as (A6), it is easy to identify the functional derivative. This procedure is one of many ways to obtain the functional derivatives [73].

Setting *δL*r*q*, *f*s{*δq* " 0 we find the stationary solution as

$$q^\*(y,\theta) = \exp(-1 - \gamma - \lambda(y)) \, f(y,\theta) \tag{A8a}$$

$$\hat{\mathbf{y}} = \frac{1}{Z} \exp(-\lambda(\mathbf{y})) f(\mathbf{y}, \boldsymbol{\theta}) \,. \tag{A8b}$$

with *Z* " ť expp´*λ*p*y*qq*f*p*y*, *θ*q d*y* d*θ* " expp*γ* ` 1q. In order to determine the Lagrange multipliers *γ* and *λ*p*y*q, we must substitute the stationary solution (A8b) back into the constraints. The normalization constraint evaluates to

$$\frac{1}{Z} \iint \exp(-\lambda(y)) f(y, \theta) \,\mathrm{d}y \,\mathrm{d}\theta = 1. \tag{A9}$$

We find that (A9) is always satisfied since *Z* " ť expp´*λ*p*y*qq*f*p*y*, *θ*q d*y* d*θ* by definition. Note, however, that the computation of the normalization constant still depends on the undetermined Lagrange multiplier *λ*p*y*q.

The data constraint evaluates to

$$\int q^\*(y,\theta) \,\mathrm{d}\theta = \frac{1}{Z} \exp(-\lambda(y)) \int f(y,\theta) \,\mathrm{d}\theta = \delta(y-\hat{y}) \tag{A10}$$

which can be rewritten as

$$\frac{\exp(-\lambda(y))}{Z} = \frac{\delta(y - \hat{y})}{\int f(y, \theta) \, \mathrm{d}\theta} \,. \tag{A11}$$

Equation (A11) shows that *λ*p*y*q can satisfy this constraint only if it is proportional to *δ*p*y* ´ *y*ˆq. Indeed, substitution of (A11) into (A8b) gives

$$q^\*(y,\theta) = \frac{f(y,\theta)}{\int f(y,\theta) \, \mathrm{d}\theta} \delta(y-\hat{y})\,\mathrm{d}\theta$$

and the posterior for the parameters evaluates to

$$\begin{split} q^\*(\theta) &= \int q^\*(y,\theta) \mathrm{d}y \\ &= \int \frac{f(y,\theta)}{\int f(y,\theta) \, \mathrm{d}\theta} \delta(y-\hat{y}) \mathrm{d}y \\ &= \frac{f(\hat{y},\theta)}{\int f(\hat{y},\theta) \, \mathrm{d}\theta} \\ &= \frac{f\_y(\hat{y},\theta) f\_\theta(\theta)}{\int f\_y(\hat{y},\theta) f\_\theta(\theta) \, \mathrm{d}\theta} \, \mathrm{d}\theta \end{split}$$

which we recognize as the Bayes rule.

Note that the Bayes rule was derived here as a special case of constrained variational free energy minimization when data constraints are present. This derivation of the Bayes rule seems unnecessarily tedious but the value of this approach to inductive inference is that the same principle applies when other (not data) constraints on *q* are present.

#### **Appendix B. Lagrangian Optimization and the Dual Problem**

With the addition of Lagrange multipliers to the Bethe functional, the resulting Lagrangian depends both on the variational distribution *q*p*s*q and the Lagrange multipliers Ψp*s*q. Formally, the introduction of the Lagrange multipliers allows us to rewrite the constrained optimization on the local polytope as an unconstrained optimization. We follow [33], and write

$$\min\_{q \in \mathcal{L}(\mathcal{G})} F[q] = \min\_{q} \max\_{\Psi} L[q\_{\prime} \Psi] \dots$$

Weak duality [74] (Chapter 5) then states that

$$\min\_{q} \max\_{\Psi} L[q, \Psi] \gg \max\_{\Psi} \min\_{q} L[q, \Psi] \dots$$

The minimization with respect to *q* then yields a solution that depends on the Lagrange multipliers, as

$$q^\*(\mathbf{s}; \Psi) = \arg\min\_q L[q\_{\prime}, \Psi] \dots$$

For any given *q* the Lagrangian is concave in Ψ. Therefore, substituting *q*˚ in the Lagrangian, the maximization over *L*r*q*˚, Ψs yields the unique solution

$$\Psi^\*(\mathbf{s}) = \arg\max\_{\Psi} L[q^\*, \Psi] \dots$$

Stationary solutions are then given by

$$q^\*(s; \Psi^\*) = \arg\min\_{q \in \mathcal{L}(\mathcal{G})} F[q] \dots$$

In the current paper, we consider factorized *q*'s (e.g., (8)), and consider variations with respect to the individual factors. We then need to show that the combined stationary points of the individual factors also constitute a stationary point of the total objective.

Consider a Lagrangian having multiple arguments, i.e.,

$$L[q] = L[q\_1, \dots, q\_{n\prime}, \dots, q\_N] \tag{A12}$$

$$\mathfrak{q} \triangleq [q\_1, \dots, q\_N]^\top \,. \tag{A13}$$

We want to determine the first total variation of the Lagrangian given by

$$
\delta L = L[q + \epsilon \phi] - L[q] \tag{A14}
$$

$$\boldsymbol{\Phi}(\mathbf{s}) \triangleq \left[ \phi\_1(\mathbf{s}), \dots, \phi\_N(\mathbf{s}) \right]^\top. \tag{A15}$$

By a Taylor series expansion on we obtain [73] (A.14) and [75] (Equation (23.2))

$$L[\mathfrak{q} + \mathfrak{e}\mathfrak{q}] - L[\mathfrak{q}] = \sum\_{k=1}^{K} \frac{1}{k!} \frac{\mathbf{d}}{\mathbf{d}\mathfrak{e}^{k}} \Big(L^{k}[\mathfrak{q} + \mathfrak{e}\mathfrak{q}]\Big) \mathfrak{e}^{k} + \mathcal{O}(\mathfrak{e}^{K+1})\,. \tag{A16}$$

Omitting all terms higher than the first order, we obtain the first variation as

$$
\delta L = \frac{\mathbf{d}}{\mathbf{d}\epsilon} (L[\mathbf{q} + \epsilon\phi\mathbf{f}])\epsilon\,. \tag{A17}
$$

Rearranging the terms and letting vanish, we obtain the following expression

$$\lim\_{\varepsilon \to 0} \frac{\delta L}{\varepsilon} = \left. \frac{d}{d\varepsilon} (L[\mathfrak{q} + \varepsilon \mathfrak{q}]) \right|\_{\varepsilon=0}. \tag{A18}$$

Let us assume that the Frechet derivative exists [73] such that we can obtain the following integral representation (It should be noted that this integral expression is not always possible for a generic Lagrangian. That is why we need to assume that the Frechet derivative exists)

$$\left. \frac{\mathbf{d}}{\delta \mathbf{d} \varepsilon} (L[\mathbf{q} + \varepsilon \boldsymbol{\Phi}]) \right|\_{\varepsilon=0} = \int \boldsymbol{\phi}(\mathbf{s})^{\top} \frac{\delta L}{\delta \boldsymbol{q}} \mathbf{d} \mathbf{s} \tag{A19}$$

where *<sup>δ</sup><sup>L</sup> <sup>δ</sup><sup>q</sup>* is the variational derivative

$$\frac{\delta L}{\delta q} = \left[\frac{\delta L}{\delta q\_1}, \dots, \frac{\delta L}{\delta q\_N}\right]^\top \tag{A20}$$

$$
\delta \mathfrak{q}\_n = \epsilon \mathfrak{q}\_n(\mathbf{s}) \,. \tag{A21}
$$

This means that (A19) can be written as [75] (Equation (22.5)) (Here, we use a more generic Lagrangian and our notation is different than in [75]; howeverm the expression is motivated again by a Taylor series expansion on )

$$\lim\_{\varepsilon \to 0} \frac{\delta L}{\varepsilon} = \left. \frac{d}{d\varepsilon} (L[\mathfrak{q} + \mathfrak{c}\mathfrak{q}]) \right|\_{\mathfrak{c} = 0} = \sum\_{n} \int \phi(\mathfrak{s}) \frac{\delta L}{\delta q\_n} \, \mathrm{d}\mathfrak{s} \,. \tag{A.22}$$

Fundamental theorem of variational calculus states that in order for a point to be stationary, the first variation needs to vanish. In order for the first variation to vanish, it is sufficient to have vanishing of the variational derivatives

$$\frac{\delta L}{\delta q\_n} = 0 \text{ for every } n = 1, \dots, N. \tag{A23}$$

Vanishing of individual variational derivatives will mean that that the local stationary points will also correspond to a global stationary point.

#### **Appendix C. Local Free Energy Example for a Deterministic Node**

Theorem 8 tells us how to evaluate the node-local free energy for a deterministic node. As an example, consider the node function *fa*p*y*, *x*q " *δ*p*y* ´ sgnp*x*qq, with *y* P t´1, 1u and *<sup>x</sup>* <sup>P</sup> <sup>R</sup> as depicted in Figure A1.

$$\begin{array}{c} \begin{array}{c} \text{x} \ \stackrel{\mu\_{\text{xy}}}{\rightarrow} \end{array} \begin{array}{c} \begin{array}{c} \text{y} \ \text{m} \end{array} \end{array} \begin{array}{c} \begin{array}{c} \text{y} \ \text{m} \end{array} \end{array} \begin{array}{c} \begin{array}{c} \text{y} \\ \text{m} \end{array} \end{array} \end{array} \end{array}$$

**Figure A1.** Messages around a "sign" node.

Interestingly, there is information loss in this node because the "sign" mapping is not bijective. Given an incoming Bernoulli distributed message *μya*p*y*q " B*er*p*y*|*p*q, the backward outgoing message is derived as

$$\begin{aligned} \mu\_{ax}(\mathbf{x}) &= \int \mu\_{\mathrm{yn}}(y) \, \delta(y - \mathrm{sgn}(\mathbf{x})) \, \mathrm{d}y \\ &= \begin{cases} p & \text{if } \mathbf{x} \gg 0 \\ 1 - p & \text{if } \mathbf{x} < 0 \end{cases} \end{aligned}$$

Given a Gaussian distributed incoming message *μxa*p*x*q " Np*x*|*m*, *ϑ*q, the resulting belief then becomes

$$\begin{split} q\_{\boldsymbol{x}}(\boldsymbol{x}) &= \frac{\mu\_{\boldsymbol{x}\boldsymbol{a}}(\boldsymbol{x})\,\mu\_{\boldsymbol{a}\boldsymbol{x}}(\boldsymbol{x})}{\int \mu\_{\boldsymbol{x}\boldsymbol{a}}(\boldsymbol{x})\,\mu\_{\boldsymbol{a}\boldsymbol{x}}(\boldsymbol{x})\,\mathrm{d}\boldsymbol{x}} \\ &= \begin{cases} \frac{p}{p+\Phi-2p\,\Phi}\,\mathcal{N}(\boldsymbol{x}|\boldsymbol{m},\boldsymbol{\theta}) & \text{if } \boldsymbol{x} \gg 0 \\\ \frac{1-p}{p+\Phi-2p\,\Phi}\,\mathcal{N}(\boldsymbol{x}|\boldsymbol{m},\boldsymbol{\theta}) & \text{if } \boldsymbol{x} < 0 \end{cases} \end{split}$$

with <sup>Φ</sup> " <sup>ş</sup><sup>0</sup> ´8 Np*x*|*m*, *<sup>ϑ</sup>*<sup>q</sup> <sup>d</sup>*<sup>x</sup>* . We define a truncated Gaussian distribution as

$$\mathcal{T}(\mathbf{x}|m,\theta,a,b) = \begin{cases} \frac{1}{\Phi(a,b;m,\theta)} \mathcal{N}(\mathbf{x}|m,\theta) & \text{if } a \ll \mathbf{x} \ll b, \\ 0 & \text{otherwise}, \end{cases}$$

with <sup>Φ</sup>p*a*, *<sup>b</sup>*; *<sup>m</sup>*, *<sup>ϑ</sup>*q " <sup>ş</sup>*<sup>b</sup> <sup>a</sup>* Np*x*|*m*, *ϑ*q d*x*. This leads to

$$q\_{\mathcal{X}}(\mathbf{x}) = \underbrace{\frac{p(1-\Phi)}{p+\Phi-2p\Phi}}\_{\mathcal{K}} \mathcal{T}(\mathbf{x}|m,\theta,-\infty,0) + \underbrace{\frac{(1-p)\Phi}{p+\Phi-2p\Phi}}\_{1-\mathcal{K}} \mathcal{T}(\mathbf{x}|m,\theta,0,\infty) \dots$$

as a truncated Gaussian mixture.

The node-local free energy then evaluates to

$$\begin{split} &F[q\_{\boldsymbol{\theta}},f\_{\boldsymbol{\theta}}] = -H[q\_{\boldsymbol{\chi}}] \\ &= \int\_{-\infty}^{0} q\_{\boldsymbol{\chi}}(\boldsymbol{\chi}) \log q\_{\boldsymbol{\chi}}(\boldsymbol{\chi}) \, \mathrm{d}x + \int\_{0}^{\infty} q\_{\boldsymbol{\chi}}(\boldsymbol{\chi}) \log q\_{\boldsymbol{\chi}}(\boldsymbol{\chi}) \, \mathrm{d}x \\ &= -KH[\mathcal{T}(m,\boldsymbol{\theta}, -\boldsymbol{\infty}, 0)] + K \log K - (1 - K)H[\mathcal{T}(m, \boldsymbol{\theta}, 0, \boldsymbol{\infty})] + (1 - K)\log(1 - K) \\ &= -KH[\mathcal{T}(m, \boldsymbol{\theta}, -\boldsymbol{\infty}, 0)] - (1 - K)H[\mathcal{T}(m, \boldsymbol{\theta}, 0, \boldsymbol{\infty})] - H[\mathcal{B}\boldsymbol{\sigma}(K)], \end{split}$$

as a weighted sum of entropies, which can be computed analytically.

#### **Appendix D. Proofs**

*Appendix D.1. Proof of Lemma 1*

**Proof.** We apply the variation *φ<sup>b</sup>* to *qb* and, as discussed in Appendix A, we can identify the functional derivative *δLb*{*δqb* through ordinary differentiation as

$$\left. \frac{\mathrm{d}L\_{b}[q\_{b} + \varepsilon \phi\_{b\prime} f\_{b}]}{\mathrm{d}\varepsilon} \right|\_{\mathfrak{c}=0} = \int \left( \overbrace{\log \frac{q\_{b}(\mathsf{s}\_{b})}{f\_{b}(\mathsf{s}\_{b})} + 1 + \psi\_{b} - \sum\_{i \in \mathcal{E}(b)} \lambda\_{i\boldsymbol{b}}(\mathsf{s}\_{i})} \right) \phi\_{\boldsymbol{b}}(\mathsf{s}\_{b}) \, \mathrm{d}\mathfrak{s}\_{b} \, \mathrm{d}\mathfrak{s}\_{b} \, \mathrm{d}\mathfrak{s}\_{b} \right)$$

Setting the functional derivative to zero and identifying

$$
\mu\_{i\bar{b}}(\mathbf{s}\_i) = \exp(\lambda\_{i\bar{b}}(\mathbf{s}\_i)) \tag{A24}
$$

$$\psi\_{\mathbb{b}} = \log \int f\_{\mathbb{b}}(\mathbf{s}\_{\mathbb{b}}) \prod\_{i \in \mathcal{E}(\mathbb{b})} \mu\_{i\mathbb{b}}(s\_i) \mathbf{d}s\_{\mathbb{b}} - 1 \tag{A25}$$

yields the stationary solutions (18) in terms of Lagrange multipliers that are to be determined.

#### *Appendix D.2. Proof of Lemma 2*

**Proof.** We follow the same procedure as in Appendix D.1, where we apply a variation *φ<sup>j</sup>* to *qj* (instead of *qb*), and identify the functional derivative *δLj*{*δqj* through

$$\frac{\mathrm{d}L\_{j}[q\_{j}+\epsilon\phi\_{j}]}{\mathrm{d}\epsilon}\Big|\_{\epsilon=0} = \int \left(\overbrace{-\log q\_{j}(s\_{j})-1+\psi\_{j}+\sum\_{a\in\mathcal{V}(j)}\lambda\_{ja}(s\_{j})}\,\middle|\,\phi\_{j}(s\_{j})\,\mathrm{d}s\_{j}\,\dots\right)$$

As the TFFG is terminated, each edge has 2 degrees and the node-induced edge set has only 2 factors, which we denote by *fb* and *fc*. Then, setting the functional derivative to zero and identifying

$$\mu\_{\rm ja}(s\_j) = \exp\left(\lambda\_{\rm ja}(s\_j)\right) \tag{A26}$$

$$\psi\_{\rangle} = -\log \int \mu\_{\slash b}(s\_{\slash}) \mu\_{\slash c}(s\_{\not\slash}) \mathrm{d}s\_{\not\slash} + 1 \tag{A27}$$

yields the stationary solution of (20) in terms of the Lagrange multipliers.

#### *Appendix D.3. Proof of Theorem 1*

**Proof.** The local polytope of (14) constructs the Lagrangians of (17) and (19). Substituting the stationary solutions from Lemmas 1 and 2 in the marginalization constraint,

$$q\_{\dot{\jmath}}(s\_{\dot{\jmath}}) = \int q\_{\dot{\jmath}}(\mathbf{s}\_{\dot{\vartheta}}) \, \mathrm{d}s\_{\dot{\vartheta}\dot{\jmath}'} $$

we obtain the following relation

$$\frac{\mu\_{jb}(s\_j)\mu\_{jc}(s\_j)}{Z\_j} = \frac{1}{Z\_b} \int f\_b(s\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_i) \, \mathrm{d}s\_{b \backslash j'} $$

where we defined the following normalization constants to ensure that the computed marginals are normalized:

$$\begin{aligned} Z\_j &= \int \mu\_{\bar{j}b}(s\_j) \mu\_{\bar{j}c}(s\_{\bar{j}}) \mathrm{d}s\_{\bar{j}} \\ Z\_b &= \int f\_b(s\_b) \prod\_{i \in \mathcal{E}(b)} \mu\_{ib}(s\_i) \mathrm{d}s\_b \dots \end{aligned}$$

Extracting *μjb* from the integral

$$\frac{\mu\_{\vec{j}b}(s\_{\vec{j}})\mu\_{\vec{j}c}(s\_{\vec{j}})}{Z\_{\vec{j}}} = \frac{\mu\_{\vec{j}b}(s\_{\vec{j}})}{Z\_{b}} \int f\_{\vec{b}}(\mathbf{s}\_{\vec{b}}) \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq \vec{j}}} \mu\_{\vec{i}b}(s\_{i}) \, \operatorname{d}\mathbf{s}\_{\vec{b}\cdot\vec{j}} \, \, \mu\_{\vec{j}\cdot\vec{j}} \, \, \, \, \, \, \begin{split} \mu\_{\vec{j}\cdot\vec{k}} \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq \vec{j}}} \mu\_{\vec{i}b}(s\_{i}) \, \operatorname{d}\mathbf{s}\_{\vec{b}\cdot\vec{j}} \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \, \,$$

and cancelling *μjb* on both sides then yields the condition on the functional form of the message *μjc*.

We now need to show that the fixed points of (22) satisfy (A28). Let us assume that the fixed points exist, such that *μ*p*k*<sup>q</sup> *jc* " *<sup>μ</sup>*p*k*`1<sup>q</sup> *jc* for some *k*. Then, we want to show that at the fixed points the following equality holds:

$$\mu\_{j\varepsilon}^{(k)}(s\_j) = \frac{Z\_j^{(k)}}{Z\_b^{(k)}} \int f\_b(\mathbf{s}\_b) \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq l}} \mu\_{ib}^{(k)}(\mathbf{s}\_i) \, \mathrm{d}\mathbf{s}\_{b \backslash j} \dots$$

Substituting (22), we need to show that

$$
\mu\_{j\varepsilon}^{(k)}(s\_j) = \frac{Z\_j^{(k)}}{Z\_b^{(k)}} \mu\_{j\varepsilon}^{(k+1)}(s\_j) \dots
$$

Since *μ*p*k*<sup>q</sup> *jc* " *<sup>μ</sup>*p*k*`1<sup>q</sup> *jc* , we can rearrange

$$
\mu\_{j\varepsilon}^{(k)} \left( 1 - \frac{Z\_j^{(k)}}{Z\_b^{(k)}} \right) = 0
$$

From *Zb*, we obtain

$$\begin{split} Z\_{b}^{(k)} &= \int \mu\_{jb}^{(k)}(s\_{j}) \int f\_{b}(s\_{b}) \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq j}} \mu\_{ib}^{(k)}(s\_{i}) \, \mathrm{d}s\_{b \backslash j} \, \mathrm{d}s\_{j} \\ &= \int \mu\_{jb}^{(k)}(s\_{j}) \mu\_{jc}^{(k+1)}(s\_{j}) \mathrm{d}s\_{j} \\ &= \int \mu\_{jb}^{(k)}(s\_{j}) \mu\_{jc}^{(k)}(s\_{j}) \mathrm{d}s\_{j} \\ &= Z\_{j}^{(k)} \end{split}$$

which implies that the fixed points satisfy the desired condition. This proves that the stationary solutions to the BFE within the local polytope can be obtained as fixed points of the sum-product update equations.

#### *Appendix D.4. Proof of Lemma 3*

**Proof.** Substituting the definition of (32), we can re-write the second term of Lagrangian (30) as

$$\begin{split} \int \left\{ \prod\_{n \in l(b)} q\_b^m(\mathbf{s}\_b^m) \right\} \log f\_b(\mathbf{s}\_b) \, \mathrm{d}\mathbf{s}\_b &= \int q\_b^m(\mathbf{s}\_b^m) \left( \int \left\{ \prod\_{\substack{n \in l(b) \\ n \neq m}} q\_b^n(\mathbf{s}\_b^n) \right\} \log f\_b(\mathbf{s}\_b) \, \mathrm{d}\mathbf{s}\_b^{(m)} \right) \, \mathrm{d}\mathbf{s}\_b^m \\ &= \int q\_b^m(\mathbf{s}\_b^m) \log \tilde{f}\_b^m(\mathbf{s}\_b^m) \, \mathrm{d}\mathbf{s}\_b^m . \end{split}$$

We apply the variation *φ<sup>m</sup> <sup>b</sup>* to *<sup>q</sup><sup>m</sup> <sup>b</sup>* and identify the functional derivative *<sup>δ</sup>L<sup>m</sup> <sup>b</sup>* {*δq<sup>m</sup> <sup>b</sup>* , as

$$\left. \frac{\mathrm{d}L\_{b}^{m}[q\_{b}^{m} + \epsilon \mathfrak{o}\_{b}^{m}]}{\mathrm{d}\epsilon} \right|\_{\epsilon=0} = \int \left( \overbrace{\log \frac{q\_{b}^{m}(\mathbf{s}\_{b}^{m})}{\tilde{f}\_{b}^{m}(\mathbf{s}\_{b}^{m})} + 1 + \mathfrak{y}\_{b}^{m} - \sum\_{i \in m} \lambda\_{ib}(\mathbf{s}\_{i})} \right) \boldsymbol{\upphi}\_{b}^{m}(\mathbf{s}\_{b}^{m}) \, \mathrm{d}\mathbf{s}\_{b}^{m} \, \mathrm{d}\mathbf{s}\_{b}^{m}$$

whose functional form we recognize from Appendix D.1. Setting the functional derivative to zero and again identifying *μib*p*si*q " exp *λib*p*si*q, yields the stationary solutions of (31).

#### *Appendix D.5. Proof of Theorem 2*

**Proof.** The local polytope of (33) constructs the Lagrangians *L<sup>m</sup> <sup>b</sup>* and *Lj* as (30) and (19), respectively. We substitute the stationary solutions of Lemmas 2 and 3 in the local marginalization constraint (29b), which yields

$$q\_j(\mathbf{s}\_j) = \int q\_b^m(\mathbf{s}\_b^m) \, \mathbf{d} \mathbf{s}\_{b \diamond\_{\cup} j}^m \, \mathbf{d}$$

Following the structure of the proof in Appendix D.3, we obtain the following condition for the stationary solutions in terms of messages:

$$\frac{\mu\_{jb}(s\_{j})\mu\_{jc}(s\_{j})}{Z\_{j}} = \frac{\mu\_{jb}(s\_{j})}{Z\_{b}^{m}} \int \tilde{f}\_{b}^{m}(\mathbf{s}\_{b}^{m}) \prod\_{\substack{i \in m \\ i \neq j}} \mu\_{ib}(s\_{i}) \mathbf{d} \mathbf{s}\_{b\backslash j}^{m} $$
 
$$\frac{\mu\_{jc}(s\_{j})}{Z\_{j}} = \frac{1}{Z\_{b}^{m}} \int \tilde{f}\_{b}^{m}(\mathbf{s}\_{b}^{m}) \prod\_{\substack{i \in m \\ i \neq j}} \mu\_{ib}(s\_{i}) \mathbf{d} \mathbf{s}\_{b\backslash j}^{m} . \tag{A.29}$$

Now we want to show that the fixed points of the message updates (36) satisfy (A29). Let us assume that the fixed points exists for some *k* such that *μ*p*k*`1<sup>q</sup> *jc* " *<sup>μ</sup>*p*k*<sup>q</sup> *jc* . Then, we will show that the fixed points satisfy

$$\frac{\mu\_{\boldsymbol{j}\boldsymbol{c}}^{(k)}(\boldsymbol{s}\_{\boldsymbol{j}})}{Z\_{\boldsymbol{j}}^{(k)}} = \frac{1}{Z\_{\boldsymbol{b}}^{m,(k)}} \int \tilde{f}\_{\boldsymbol{b}}^{m,(k)}(\mathbf{s}\_{\boldsymbol{b}}^{\boldsymbol{m}}) \prod\_{\substack{i \in \boldsymbol{m} \\ i \neq \boldsymbol{j}}} \mu\_{\boldsymbol{i}\boldsymbol{b}}^{(k)}(\boldsymbol{s}\_{i}) \mathbf{d} \mathbf{s}\_{\boldsymbol{b}\boldsymbol{\n}}^{\boldsymbol{m}} \,. \tag{A.30}$$

Similar to Appendix D.3, it will suffice to show that *Zm*,p*k*<sup>q</sup> *<sup>b</sup>* " *<sup>Z</sup>*p*k*<sup>q</sup> *<sup>j</sup>* at the fixed points. Arranging the order of integration in normalization constant *Zm*,p*k*<sup>q</sup> *<sup>b</sup>* , we obtain

$$\begin{split} Z\_{b}^{\mathfrak{m},(k)} &= \int \mu\_{jb}^{(k)}(s\_{j}) \int \tilde{f}\_{b}^{\mathfrak{m},(k)}(s\_{b}^{\mathfrak{m}}) \prod\_{\substack{i \in \mathfrak{m} \\ i \neq j}} \mu\_{ib}^{(k)}(s\_{i}) \mathrm{d}s\_{b}^{\mathfrak{m}} \, \_{j} \mathrm{d}s\_{j} \\ &= \int \mu\_{jb}^{(k)}(s\_{j}) \mu\_{jc}^{(k+1)}(s\_{j}) \mathrm{d}s\_{j} \\ &= \int \mu\_{jb}^{(k)}(s\_{j}) \mu\_{jc}^{(k)}(s\_{j}) \mathrm{d}s\_{j} \\ &= Z\_{j}^{(k)} \, . \end{split}$$

By the same line of reasoning as in Appendix D.3, this shows that the fixed points of the message updates (36) leads to stationary distributions of the Bethe free energy with structured factorization constraints.

## *Appendix D.6. Proof of Corollary 1*

**Proof.** For a fully factorized local variational distribution (41), the augmented node function ˜ *f m <sup>b</sup>* <sup>p</sup>*s<sup>m</sup> <sup>b</sup>* q of (32) reduces to

$$\tilde{f}\_{\vec{\jmath}}(\mathbf{s}\_{\vec{\jmath}}) = \exp\left(\int \left\{ \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq j}} q\_i(\mathbf{s}\_i) \right\} \log f\_b(\mathbf{s}\_b) \, \mathrm{d}s\_{\vec{\nu}\cdot\vec{\jmath}} \right). \tag{A31}$$

The message of (36) then reduces to

$$
\mu\_{jc}(s\_j) := \tilde{f}\_{\hat{j}}(s\_{\hat{j}})\_{.\*}
$$

which, after substitution, recovers (43).

#### *Appendix D.7. Proof of Lemma 4*

**Proof.** When we apply the variation *φ<sup>b</sup>* to *qb* and identify the functional derivative *δLb*{*δqb*, we recover the result from Appendix D.1, which leads to a solution of the form (47).

#### *Appendix D.8. Proof of Theorem 3*

**Proof.** We construct the Lagrangian of (46), which by Lemma 4 leads to a solution of the form (47). Substituting this solution in the constraint of (45) leads to

$$\left[\overbrace{\left[\int f\_{b}(\mathbf{s}\_{b}) \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq j}} \mu\_{ib}(s\_{i}) \, \mathrm{d}s\_{b/j} \right]}\_{\underset{i \neq j}{\mathrm{d}s\_{j}}} \right] \mu\_{jb}(s\_{j}) \, = \; \delta(s\_{j} - \mathfrak{k}\_{j}) \,. \tag{A32}$$

This equation is then satisfied by (50), which proves the theorem.

#### *Appendix D.9. Proof of Lemma 5*

**Proof.** The proof follows directly from Appendix D.1, with ˜ *fb*p*sb*; *s*ˆ*b*q substituted for *fb*p*sb*q.

#### *Appendix D.10. Proof of Theorem 4*

**Proof.** Given the result of Lemma 5, the proof follows Appendix D.3, where Laplace propagation chooses the expansion point to be the fixed point ˆ*s<sup>b</sup>* " arg max log *qb*p*sb*q.

For all second-order fixed points of the Laplace iterations, it holds that *s*ˆ*<sup>b</sup>* is a fixed point if and only if it is a local optimum of *qb*. The proof is then concluded by Lemma 1 in [76].

## *Appendix D.11. Proof of Lemma 6*

**Proof.** We note that the Lagrange multiplier *ηjb* does not depend on *sj* because the expectation removes all the functional dependencies on *sj*. Furthermore, the expectations of *Tj*p*sj*q have the same dimension as the function *Tj*p*sj*q. This means that the dimension of *ηjb* needs to be compatible with that of *Tj*p*sj*q so that we can write the constraint as an inner product.

We apply the variation *φ<sup>b</sup>* to *qb* and identify the functional derivative *δLb*{*δqb*, as

$$\frac{\mathrm{d}L\_{b}[q\_{b}+\varepsilon\phi\_{b},f\_{b}]}{\mathrm{d}\varepsilon}\Big|\_{\epsilon=0} = \int \left(\overbrace{\log\frac{q\_{b}(\mathbf{s}\_{b})}{f\_{b}(\mathbf{s}\_{b})}+1+\psi\_{b}-\sum\_{\substack{i\in\mathcal{E}(b)\\i\neq j}}\lambda\_{ib}(\mathbf{s}\_{i})-\eta\_{jb}^{\top}T\_{j}(\mathbf{s}\_{j})}\_{i\neq j}\right)\phi\_{b}(\mathbf{s}\_{b})\,\mathrm{d}\mathbf{s}\_{b}\,\mathrm{d}\phi\_{b}(\mathbf{s}\_{j})$$

Setting the functional derivative to zero and identifying *μib*p*si*q " exp *λib*p*si*q for *i* ‰ *j* and identifying *<sup>μ</sup>jb*p*sj*q " exp´ *η*J *jbTj*p*sj*q ¯ yields the functional form of the stationary solution as (62).

#### *Appendix D.12. Proof of Lemma 7*

**Proof.** We follow a similar procedure as in Appendix D.11 and apply the variation *φ<sup>j</sup>* to *qj*, which identifies the functional derivative *δLj*{*δqj*, as

$$\left. \frac{\mathrm{d}L[q\_{\rangle} + \epsilon \phi\_{\rangle}}{\mathrm{d}\varepsilon} \right|\_{\mathfrak{c}=0} = \int \left( \overbrace{-\log q\_{\rangle}(s\_{\rangle}) - 1 + \psi\_{\rangle} + \sum\_{a \in \mathcal{V}(j)} \eta\_{ja}^{\top} T\_{j}(s\_{\rangle})}^{\delta L\_{j}/\delta q\_{j}} \right) \phi\_{\rangle}(s\_{\rangle}) \, \mathrm{d}s\_{\rangle} \, \mathrm{d}s\_{\rangle} \, \mathrm{d}s\_{\rangle}$$

Setting the functional derivative to zero and following the same procedure as in Appendix D.2 yields (64).

## *Appendix D.13. Proof of Theorem 5*

**Proof.** By substituting the stationary solutions given by Lemmas 6 and 7 into the momentmatching constraint (59), we obtain the following condition:

$$\int T\_j(s\_j) q\_j(s\_j) \, \mathrm{d}s\_j = \int T\_j(s\_j) q\_k(s\_k) \, \mathrm{d}s\_b$$

$$\frac{1}{\underline{Z}\_j} \int T\_j(s\_j) \exp\left( [\eta\_{j\bar{b}} + \eta\_{j\bar{c}}]^\top T\_j(s\_j) \right) \mathrm{d}s\_j = \frac{1}{\underline{Z}\_j} \int T\_j(s\_j) \overline{\exp\left(\eta\_{j\bar{b}}^\top T\_j(s\_j)\right)} \left[ \widetilde{\int f\_b(s\_b) \prod\_{\substack{i \in \mathcal{E}(b) \\ i \neq j}} \mu\_{\bar{a}b}(s\_i) \, \mathrm{d}s\_{b(j)}} \right] \mathrm{d}s\_j \, \mathrm{d}s\_b$$

$$= \int T\_j(s\_j) \overline{q\_j(s\_j)} \, \mathrm{d}s\_j,$$

where we recognize the sum-product message *μ*˜*jc*p*sj*q, which we multiply by the incoming exponential family message *μjb*p*sj*q and normalize to obtain *q*˜*j*p*sj*q. By defining *η<sup>j</sup>* " *ηjb* ` *ηjc*, normalization constants are given by

$$\begin{aligned} Z\_{\dot{\jmath}}(\eta\_{\dot{\jmath}}) &= \int \exp\big(\eta\_{\dot{\jmath}}^{\top} T\_{\dot{\jmath}}(s\_{\dot{\jmath}})\Big) \mathrm{d}s\_{\dot{\jmath}} \\ Z\_{\dot{\jmath}} &= \int \exp\big(\eta\_{\dot{\jmath}b}^{\top} T\_{\dot{\jmath}}(s\_{\dot{\jmath}})\Big) \bar{\mu}\_{\dot{\jmath}c}(s\_{\dot{\jmath}}) \mathrm{d}s\_{\dot{\jmath}} \dots \end{aligned}$$

Computing the moments allows us to determine the exponential family parameter by solving the following equation [24] (Proposition 3.1)

$$\nabla\_{\eta\_{\vec{j}}} \log Z\_{\vec{j}}(\eta\_{\vec{j}}) = \int \vec{q}\_{\vec{j}}(s\_{\vec{j}}) \, T\_{\vec{j}}(s\_{\vec{j}}) \, \mathrm{d}s\_{\vec{j}} \dots$$

Suppose you obtain a solution to this equation denoted by *η*˜*j*, this allows us to approximate the sum-product message *μ*˜*jc*p*sj*q by an exponential family message whose parameter is given by

$$
\eta\_{j\varepsilon} = \bar{\eta}\_{j} - \eta\_{jb} \cdot \varepsilon
$$

Now let us assume that the fixed points of the sum-product iterations *μ*˜ p*k*q *jc* p*sj*q " *μ*˜ p*k*`1q *jc* <sup>p</sup>*sj*<sup>q</sup> and the incoming exponential family messages *<sup>μ</sup>*p*k*<sup>q</sup> *jb* <sup>p</sup>*sj*q " *<sup>μ</sup>*p*k*`1<sup>q</sup> *jb* p*sj*q exist for some *k*. Then, we need to show that the existence of these fixed points implies the existence of the fixed points of *μ*p*k*`1<sup>q</sup> *jc* " *<sup>μ</sup>*p*k*<sup>q</sup> *jc* .

By moment-matching, we have

$$\begin{aligned} \eta\_{j\mathfrak{c}}^{(k+1)} &= \widetilde{\eta}\_{j}^{(k+1)} - \eta\_{j\mathfrak{b}}^{(k+1)} \\ &= \widetilde{\eta}\_{j}^{(k)} - \eta\_{j\mathfrak{b}}^{(k)} \\ &= \eta\_{j\mathfrak{c}}^{(k)} \end{aligned}$$

which proves the existence of the fixed point of *μjc* if *μ*˜*jc* and *μjb*p*sj*q have fixed points.

## *Appendix D.14. Proof of Theorem 6*

**Proof.** The proof follows directly from substituting the Laplace-approximated factorfunction (53) in the naive mean-field result of Corollary. 1.

#### *Appendix D.15. Proof of Theorem 7*

**Proof.** In order to obtain the optimal parameter value *θ*˚ *<sup>j</sup>* , we view the free energy as a function of *θj*. As there are two node-local free energies that depend upon *θj*, this leads to

$$\begin{split} \theta\_{j}^{\star} &= \arg\min\_{\theta\_{j}} \left( F[q\_{b^{\star}}f\_{b^{\star}}\theta\_{j}] + F[q\_{c^{\star}}f\_{c^{\star}}\theta\_{j}] \right) \\ &= \arg\max\_{\theta\_{j}} \left( \int \left\{ \delta(s\_{j} - \theta\_{j}) \prod\_{\begin{subarray}{c} \mathbf{n} \in \mathcal{I}(j) \\ n \neq m \end{subarray}} q\_{b}^{\mathbf{n}}(\mathbf{s}\_{\boldsymbol{\theta}}^{n}) \right\} \log f\_{b}(\mathbf{s}\_{\boldsymbol{\theta}}) \, \mathrm{d}\mathbf{s}\_{\boldsymbol{\theta}} + \int \left\{ \delta(s\_{j} - \theta\_{j}) \prod\_{\begin{subarray}{c} \mathbf{n} \in \mathcal{I}(j) \\ n \neq m \end{subarray}} q\_{c}^{\mathbf{n}}(\mathbf{s}\_{\boldsymbol{\epsilon}}^{n}) \right\} \log f\_{c}(\mathbf{s}\_{\boldsymbol{\epsilon}}) \, \mathrm{d}\mathbf{s}\_{\boldsymbol{\epsilon}} \right) \\ &= \arg\max\_{\theta\_{j}} \left( \int \left\{ \prod\_{\begin{subarray}{c} \mathbf{n} \in \mathcal{I}(j) \\ n \neq m \end{subarray}} q\_{b}^{\mathbf{n}}(\mathbf{s}\_{\boldsymbol{\epsilon}}^{n}) \, \mathrm{d}\mathbf{s}\_{\boldsymbol{\epsilon}} \right\} \log f\_{b}(\mathbf{s}\_{\boldsymbol{\epsilon}}^{n}) \, \mathrm{d}\mathbf{s}\_{\boldsymbol{\epsilon}} \right) \, \mathrm{d}\mathbf{s}\_{\boldsymbol{\epsilon}} \\ &= \arg\max\_{\mathbf{s}\_{j}} \bigg( \log \mu\_{\mathcal{I}}(s\_{j}) + \log \mu\_{\mathcal{I}}(s\_{j}) \Big) \,, \\ \end{split}$$

where in the last step we replaced *θ<sup>j</sup>* with *sj* for convenience. Here, we recognize *μbj* and *μcj* as the structured variational updates of Theorem 2. Identification of the fixed points can then be obtained by [57] (Corollary 2). For a rigorous discussion on convergence of the EM algorithm, we refer to [77] (Corollary 32), [24] (Chapter 6) and [57] (Section 3).

## *Appendix D.16. Proof of Theorem 8*

**Proof.** Substituting for *qa*p*sa*q, the node-local free energy becomes

$$\begin{split} F[q\_a, f\_a] &= \int q\_a(s\_a) \log \frac{q\_a(s\_a)}{f\_a(s\_a)} \, \mathrm{d}s\_a \\ &= \int q\_a(s\_a) \log \frac{q\_{j|a}(s\_j|s\_{a\backslash j})}{f\_a(s\_a)} \, \mathrm{d}s\_a + \int q\_{a^\sharp}(s\_a) \log q\_{a\backslash j}(s\_{a\backslash j}) \, \mathrm{d}s\_a \\ &= \int q\_{a\backslash j}(s\_{a\backslash j}) q\_{j|a}(s\_{\backslash}|s\_{a\backslash j}) \log \frac{q\_{j|a}(s\_j|s\_{a\backslash j})}{f\_a(s\_a)} \, \mathrm{d}s\_a + \int q\_{a\backslash j}(s\_{a\backslash j}) q\_{j|a}(s\_j|s\_{a\backslash j}) \log q\_{a\backslash j}(s\_{a\backslash j}) \, \mathrm{d}s\_{a\backslash j} \\ &= \int q\_{a\backslash j}(s\_{a\backslash j}) \left[ \int q\_{j|a}(s\_j|s\_{a\backslash j}) \log \frac{q\_{j|a}(s\_j|s\_{a\backslash j})}{f\_a(s\_a)} \, \mathrm{d}s\_j \right] \mathrm{d}s\_{a\backslash j} + \int q\_{a\backslash j}(s\_{a\backslash j}) \log q\_{a\backslash j}(s\_{a\backslash j}) \, \mathrm{d}s\_{a\backslash j} \\ &= \mathbb{E}\_{q\_{\backslash a}|\mathit{d}} \Big[ D\Big[q\_{j|a}|\boldsymbol{f}\_a\Big] \Big] - H[q\_{a\backslash j}] \Big] . \end{split}$$

where the first term expresses an expected Kullback–Leibler divergence, and the second term is a negative entropy. The only possibility for the local free energy to becomes finite, is when *qj*|*a*p*sj*|*sa*z*j*q " *fa*p*sa*q " *δ*p*sj* ´ *ga*p*sa*z*j*qq. We then have:

$$F[q\_{a'}f\_{a}] = \begin{cases} -H[q\_{a'\dot{j}}] & \text{if } q\_{\dot{j}|a}(s\_{\dot{j}}|\mathbf{s}\_{a'\dot{j}}) = \delta(\mathbf{s}\_{\dot{j}} - g\_{a}(\mathbf{s}\_{a'\dot{j}})),\\ \mathcal{C} & \text{otherwise.} \end{cases}$$

## *Appendix D.17. Proof of Theorem 9*

**Proof.** The proof is similar to Appendix D.16. Substituting for *qa*p*sa*q, the node-local free energy becomes

$$\begin{split} F[q\_a, f\_a] &= \int q\_a(\mathbf{s}\_a) \log \frac{q\_a(\mathbf{s}\_a)}{f\_a(\mathbf{s}\_a)} \operatorname{d\mathbf{s}}\_a \\ &= \int q\_a(\mathbf{s}\_i, \mathbf{s}\_j, \mathbf{s}\_k) \log \frac{q\_{ik[j}(\mathbf{s}\_i, \mathbf{s}\_k | \mathbf{s}\_j)}{f\_a(\mathbf{s}\_i, \mathbf{s}\_j, \mathbf{s}\_k)} \operatorname{d\mathbf{s}}\_i \operatorname{d\mathbf{s}}\_j \operatorname{d\mathbf{s}}\_k + \int q\_{\bar{j}}(\mathbf{s}\_{\bar{j}}) \log q\_{\bar{j}}(\mathbf{s}\_{\bar{j}}) \operatorname{d\mathbf{s}}\_{\bar{j}} \\ &= \mathbb{E}\_{\mathbf{q}\_{\bar{j}}} \Big[ D \Big[ q\_{\bar{k}\bar{k}|\bar{j}} |f\_a \Big] \Big] - H[q\_{\bar{j}}] .\end{split}$$

In contrast to Appendix D.16, here we have a joint belief within the divergence with a single conditioning variable. Conditioning on *sj* (or by symmetry *si* or *sk*) determines the realization of the other variables. Therefore, we have:

$$F[q\_{a\prime}, f\_a] = \begin{cases} -H[q\_j] & \text{if } q\_{ik|j}(s\_i, s\_k|s\_j) = \delta(s\_j - s\_i) \,\delta(s\_j - s\_k) \\ \infty & \text{otherwise.} \end{cases}$$

#### **References**


## *Article* **Understanding the Variability in Graph Data Sets through Statistical Modeling on the Stiefel Manifold**

**Clément Mantoux 1,2,3,\*, Baptiste Couvy-Duchesne 1,2, Federica Cacciamani 1,2, Stéphane Epelbaum 1,2,4, Stanley Durrleman 1,2 and Stéphanie Allassonnière 5,6**


**Abstract:** Network analysis provides a rich framework to model complex phenomena, such as human brain connectivity. It has proven efficient to understand their natural properties and design predictive models. In this paper, we study the variability within groups of networks, i.e., the structure of connection similarities and differences across a set of networks. We propose a statistical framework to model these variations based on manifold-valued latent factors. Each network adjacency matrix is decomposed as a weighted sum of matrix patterns with rank one. Each pattern is described as a random perturbation of a dictionary element. As a hierarchical statistical model, it enables the analysis of heterogeneous populations of adjacency matrices using mixtures. Our framework can also be used to infer the weight of missing edges. We estimate the parameters of the model using an Expectation-Maximization-based algorithm. Experimenting on synthetic data, we show that the algorithm is able to accurately estimate the latent structure in both low and high dimensions. We apply our model on a large data set of functional brain connectivity matrices from the UK Biobank. Our results suggest that the proposed model accurately describes the complex variability in the data set with a small number of degrees of freedom.

**Keywords:** network modeling; network variability; Stiefel manifold; MCMC-SAEM; data imputation

## **1. Introduction**

Network science is at the core of an ever-growing range of applications. Network analysis [1] aims at studying the natural properties of complex systems of interacting components or individuals through their connections. It provides a large number of tools to detect communities [2], predict unknown connections [3] and covariates [4], measure population characteristics [5,6] or build unsupervised low-dimensional representations [7]. The need to understand and model networks arises in multiple fields, such as social networks analysis [8], recommender systems [9], gene interactions networks [10], neuroscience [11] or chemistry [12]. Network analysis allows accounting for very diverse phenomenons in similar mathematical frameworks, which lend themselves to theoretical and statistical analysis [13]. In this paper, we are interested in groups of undirected networks that are defined on the same set of nodes. This situation describes the longitudinal evolution of a given network throughout time or the case where the nodes define a standard structure

**Citation:** Mantoux, C.; Couvy-Duchesne, B.; Cacciamani, F.; Epelbaum, S.; Durrleman, S.; Allassonnière, S. Understanding the Variability in Graph Data Sets through Statistical Modeling on the Stiefel Manifold. *Entropy* **2021**, *23*, 490. https://doi.org/10.3390/ e23040490

Academic Editor: Pierre Alquier

Received: 13 March 2021 Accepted: 14 April 2021 Published: 20 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

identical across the networks. The former is of interest in computational social science, which studies the evolution of interactions within a fixed population [14]. The latter arises naturally in neuroscience, where the connections between well-defined brain regions are studied on large groups of subjects. The analysis of brain networks is the main application of the present study. It has proven an efficient tool to discover new aspects of the anatomy and function of the human brain [15] and remains a very active research topic [16].

In this study, we are interested in the variability of undirected graph data sets, i.e., how graphs defined on a common set of nodes vary from one network to another. Accounting for this variability is a crucial issue in neuroscience: predicting neurodegenerative diseases or understanding the complex mechanisms of aging requires robust, coherent statistical frameworks that model the diversity among a population. Working on such graphs sharing the same nodes allows comparing them to one another through their adjacency matrices.

The comparison and statistical modeling of such matrices are difficult problems. If all the graphs have *n* nodes, a Gaussian model on the *n* × *n* adjacency matrices has a covariance matrix with *n*<sup>4</sup> coefficients, which is hard to interpret and difficult to estimate from a reasonable number of observations. Considering adjacency matrices as large vectors allows using classical statistical methods, such as Principal Component Analysis (PCA), but does not take advantage of the strong structures underlying the interactions between the nodes. Tailored kernel methods can be employed to evaluate distances between networks, but many theoretically interesting graph kernels require solving NP-hard problems [17]. In the field of brain network analysis, graphs are often modeled and summarized by features like the average shortest path length, which only partially characterize their structure [6]. Recent methods relying on graphs neural networks often consider the nodes of the network to be permutation invariant, whereas nodes in brain networks play a specific role likely to remain stable across subjects [15,18].

In this paper, we propose a generative statistical model to express the variability in undirected graph data sets. We decompose the network adjacency matrices as a weighted sum of orthonormal matrix patterns with rank one. The patterns and their weights vary around their mean values. Using rank-one patterns allows understanding each decomposition term, while using only a small number of parameters. This is comparable to PCA where each observation is decomposed onto orthogonal elements, which in this case would be matrices. The orthogonal patterns are seen as elements of the Stiefel manifold of rectangular matrices *X* such that *XX* is the identity matrix [19]. This model allows us to use known distributions and perform a statistical estimation of the mean patterns and weights. We use a restricted number of patterns to get a robust model, which captures the main structures and their variations. This low-dimensional parametric representation provides a simple interpretation of the structure and the variability of the distribution. Our model accounts for two sources of variability: the perturbations of the patterns and their weight. In contrast, current approaches in the literature only consider one of them, as with dictionary-based models and graph auto-encoders.

The proposed framework is expressed as a generative statistical model so that it can easily be generalized to analyze heterogeneous populations. This corresponds to a mixture of several copies of the former model where each cluster has its own center and variance parameters.

In Section 2, we recall relevant literature references for network modeling and statistics on the Stiefel manifold. Section 3 defines our model and further motivates its structure. Section 4 proposes an algorithm based on Expectation-Maximization (EM) to perform Maximum Likelihood Estimation of the model parameters. In Section 5, we present numerical experiments on synthetic and real data. We use our model to predict missing links using the parameters given by the algorithm. We show how our model can be used to perform clustering on network data sets, allowing to distinguish different modes of variability better than a classical clustering algorithm. Applying our method to the UK Biobank collection of brain functional connectivity networks, we demonstrate that our model is able to capture a complex variability with a limited number of parameters. Note

that the tools we present here could also be used on any type of network, such as the ones we mentioned above or gene interaction networks.

#### **2. Background**

#### *2.1. Statistical Modeling for Graphs Data Sets*

The analysis of graph data sets is a wide area of research that overlaps with many application domains. In this section, we review the principal trends of this field that are used in statistics and machine learning.

The first category of statistical models characterizes graphs in a data set (with possibly varying number of nodes) by a set of features that can be compared across networks, rather than matching the nodes of one graph to those of another. These features can be, for example, the average shortest path length, the clustering coefficient, or the occurrence number of certain patterns. Two examples of such models are Exponential Random Graphs Models [20] and graph kernel methods [17]. Other models are defined by a simple, interpretable generative procedure that allows testing hypotheses on complex networks. The Erd˝os–Rényi model [21] assumes that each node has an equal probability of connecting with one another. The Stochastic Block Model (SBM, [22]) extends this model and introduces communities organized in distinct clusters with simple interactions. In the limit of a large number of nodes, the same idea gives rise to the graphon model, which has also recently been used to model graph data sets [23]. Finally, recent machine learning models leverage the power of graph neural networks [24] to perform classification or regression tasks. They are used, for instance, in brain network analysis to predict whether a patient is affected by Alzheimer's disease or how the disease will evolve [25,26].

In this paper, we consider undirected graphs on a fixed given set of *n* nodes connected by weighted or binary edges. This situation arises when studying the evolution of a given network across time [27] or when considering several subjects whose networks have the same structure, for instance, brain networks and protein or gene interaction networks. This constraint allows building models based on the ideas of mean and covariance of adjacency matrices, otherwise ill-defined when the nodes change across networks. In particular, little work has been done in the literature so far on the analysis of the variability of graphs in a data set sharing a common set of nodes. Dictionary-based graph analysis models [28] and graph auto-encoders [25,29] are interesting frameworks in that regard. They allow concisely representing a network in a form that compresses the *O*(*n*2) adjacency matrix representation into a smaller space of dimension *O*(*p*) or *O*(*np*) (where *p* is the encoding dimension that characterizes the model). However, they each focus on one aspect of the variability of graph data sets, either the variations of patterns for graph auto-encoders or the variations of patterns weights for dictionary-based models. The model proposed in Section 3 builds on these ideas and accounts for both sources of variability in two latent variables that are combined to obtain the adjacency matrices. These variables are the dominant eigenvalues and the related eigenvectors.

These eigenvectors are regrouped in matrices with orthonormal columns, which makes them points on the Stiefel manifold introduced in the next section. Statistical modeling of these matrices requires taking their geometry into account with manifoldvalued distributions.

#### *2.2. Models and Algorithms on the Stiefel Manifold*

#### 2.2.1. Compact Stiefel Manifolds of Orthonormal Frames

In this paper, we will be considering latent variables belonging to the compact Stiefel manifold V*n*,*p*, defined as the set of *n*-dimensional orthonormal *p*-frames (with *p* ≤ *n*): <sup>V</sup>*n*,*<sup>p</sup>* <sup>=</sup> {*<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>p</sup>* <sup>|</sup> *<sup>X</sup><sup>X</sup>* <sup>=</sup> *Ip*}. Since an element of <sup>V</sup>*n*,*<sup>p</sup>* can be obtained by taking the *p* first columns of an orthogonal matrix, the Stiefel manifold can be seen as a quotient manifold from the orthogonal group, and thus inherits a canonical Riemannian manifold structure. A detailed and clear introduction to algorithms for optimization and geodesic path computation on the Stiefel Manifold can be found in [30]. More recently,

Zimmermann [31] proposed an algorithm to compute the Riemannian logarithm associated with the canonical metric, solving the inverse problem of the geodesic computation.

#### 2.2.2. Von Mises–Fisher Distributions

Various difficulties arise when dealing with statistical distributions on Riemannian manifolds: for instance, computing the barycenter of a set of points can be a difficult problem, if not even ill-posed. The normalizing constant of a distribution is often impossible to compute analytically from its non-normalized density, so Maximum Likelihood Estimation cannot be performed by standard optimization.

Luckily, tractable distributions on the Stiefel manifolds circumventing some of these problems have been brought up and studied over the last decades in the research field of directional statistics. The most well-studied of them is the von Mises–Fisher (vMF) distribution (also called the Matrix Langevin distribution in some papers) first introduced in [32], which is the one we will be using in this paper. Given a matrix-valued parameter *<sup>F</sup>* <sup>∈</sup> <sup>R</sup>*n*×*p*, the von Mises–Fisher distribution on the Stiefel Manifold is defined by its density: *p*vMF(*X*) ∝ exp(Tr(*FX*)). Written differently, if we denote by *f*1, ..., *fp* the columns of *F* and by *x*1, ..., *xp* those of *X*, we have

$$p\_{\rm vMF}(X) \propto \exp\left(\langle f\_1, x\_1 \rangle + \ldots + \langle f\_{p^\prime}, x\_p \rangle\right).$$

In this expression, each *xi* is drawn toward *fi*/| *fi*| (up to the orthogonality constraint). The norm | *fi*| can be interpreted as a concentration parameter that determines the strength of the attraction toward *fi*/| *fi*|. The von Mises–Fisher distribution can be considered analogous to a Euclidean Gaussian distribution with a diagonal covariance matrix: the density imposes no interaction between the components of *X*, so that the only dependency between the columns is the orthogonality constraint. The equivalent of the Gaussian mode (which is the same as the Gaussian mean) is given by the following lemma:

**Lemma 1.** *The von Mises–Fisher distribution with parameter F reaches its maximum density value at X* = *πV*(*F*)*, where π<sup>V</sup> is an orthogonal projection onto the Stiefel manifold.*

**Proof.** From the definition of the von Mises–Fisher density, we have:

$$\begin{aligned} \operatorname{argmax}\_{X^\top X = I\_p} \operatorname{Tr}(F^\top X) &= \operatorname{argmax}\_{X^\top X = I\_p} -\frac{1}{2} \operatorname{Tr}(F^\top F) + \operatorname{Tr}(F^\top X) - \frac{1}{2} \operatorname{Tr}(X^\top X) \\ &= \operatorname{argmin}\_{X^\top X = I\_p} \frac{1}{2} \left\| F - X \right\|^2 .\end{aligned}$$

with · the Frobenius norm. Hence, by definition, *πV*(*F*) maximizes the von Mises–Fisher density. Note that the projection onto the Stiefel manifold is not uniquely defined, as V*n*,*<sup>p</sup>* is not convex.

The following lemma allows us to compute such a projection.

**Lemma 2.** *Let <sup>M</sup>* <sup>∈</sup> <sup>R</sup>*n*×*p, and <sup>M</sup>* <sup>=</sup> *UDV (U* <sup>∈</sup> <sup>R</sup>*n*×*p*, *<sup>D</sup>* <sup>∈</sup> <sup>R</sup>*p*×*p*, *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*p*×*p) the Singular Value Decomposition of M. If M has full rank, then UV is the unique projection of M onto the Stiefel manifold* V*n*,*p.*

**Proof.** Let us consider the Lagrangian related to the constrained optimization problem *πV*(*M*) ∈ argmin*XX*=*Ip* 1 2 *M* − *X* 2 :

$$\mathcal{L}(X,\Lambda) = \frac{1}{2} \left\| M - X \right\|^2 - \text{Tr}(\Lambda^\top (I\_\mathcal{P} - X^\top X))\,\,.$$

Then the Karush–Kuhn–Tucker theorem [33] shows that, if *X*∗ is a local extremum of *<sup>X</sup>* → <sup>1</sup> 2 *X* − *M* <sup>2</sup> over <sup>V</sup>*n*,*p*, then there exists <sup>Λ</sup><sup>∗</sup> such that <sup>∇</sup>*X*L(*X*∗, <sup>Λ</sup>∗) = 0. This gradient writes:

$$\begin{aligned} \nabla\_X \mathcal{L}(X^\*, \Lambda^\*) &= X^\* - M + X^\*(\Lambda^\* + \Lambda^{\*\top}) \\ &= X^\*(I + \Lambda^\* + \Lambda^{\*\top}) - M = 0. \end{aligned}$$

Since *X* ∈ V*n*,*<sup>p</sup>* and *M* has full rank, the symmetric matrix Ω = *I* + Λ<sup>∗</sup> + Λ∗ must be invertible, so that *X*<sup>∗</sup> = *M*Ω−1. Hence

$$I\_p = X^{\*\top}X^\* = \Omega^{-1}M^\top M \Omega^{-1} \iff \Omega^2 = M^\top M = VD^2 V^\top.$$

The matrix square roots of *MM* are exactly given by the Ω's of the form *VRV*, with *R* = Diag(±*D*11, ..., <sup>±</sup>*Dpp*). We get *<sup>X</sup>*<sup>∗</sup> <sup>=</sup> *<sup>M</sup>*Ω−<sup>1</sup> <sup>=</sup> *UDR*−1*V*, which gives the following objective function:

$$\left\| \|M - X^\*\| \right\|^2 = \left\| \|\mathcal{U}(D - DR^{-1})V^\top \| \right\|^2 = \left\| \|D - DR^{-1}\| \right\|^2.$$

As *D* has a positive diagonal, this function is globally minimized by *R* = *D*, so that the unique projection is *X*∗ = *UV*.

The simple, interpretable density of the von Mises–Fisher distribution comes with several important advantages. First, it allows using classical Markov Chain Monte Carlo (MCMC) methods to sample efficiently from the distribution (see Figure 1 for examples of distributions over V3,2). Next, the form of the density makes it a member of the exponential family, which is a key requirement to perform latent variable inference with the MCMC-Stochastic Approximation Expectation-Maximization algorithm (MCMC-SAEM, [34]) used in this paper. Finally, reasonably efficient algorithms exist to perform Maximum Likelihood Estimation (MLE) of the parameter *F*. This point will be further developed in Section 4.

**Figure 1.** One thousand samples of three von Mises–Fisher distributions on V3,2. The mode of the distribution is represented by two red arrows along the *x* and *z* axes, and the two vectors in each matrix by two blue points. The concentration parameters are set to | *fz*| = 10 and | *fx*| ∈ [10, 100, 500] (from **left** to **right**). Samples are drawn with an adaptive Metropolis– Hastings sampler using the transition kernel described in Section 4. A stronger concentration of the *x* vector impacts the spread of the *z* vector.

2.2.3. Application to Network Modeling

Statistical modeling on the Stiefel manifold has proven relevant to analyze networks. By considering the matrix of the *p* eigenvectors associated with the largest eigenvalues of an adjacency matrix as an element of V*n*,*p*, Hoff and colleagues [35–38] showed that probabilistic modeling of the eigenvector matrix on the Stiefel manifold provides a robust

representation while allowing to quantify the uncertainty of each edge and estimate the probability of missing links. In these papers, the eigenvectors follow a uniform prior distribution. In the present study, we propose to model the eigenvectors of several networks as samples of a common distribution on V*n*,*<sup>p</sup>* concentrated around a mode.

## **3. A Latent Variable Model for Graph Data Sets**

#### *3.1. Motivation*

We model graphs in a data set by studying the eigendecomposition of their adjacency matrices. Given such a symmetric weighted adjacency matrix *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*n*×*n*, the spectral theorem grants the existence of a unique decomposition *A* = *X*Λ*X* = ∑*<sup>r</sup> <sup>i</sup>*=<sup>1</sup> *λixix <sup>i</sup>* , where *r* is the rank of *A*, and *λ*<sup>1</sup> ≥ ... ≥ *λ<sup>r</sup>* and *x*1, ..., *xr* are the eigenvalues and the orthonormal eigenvectors of the matrix. This decomposition is unique up to the sign of the eigenvectors, as long as the non-zero eigenvalues values have multiplicity-one, which always holds in practice. The interest of this decomposition for graph adjacency matrices is threefold.

First, the eigendecomposition of the adjacency matrix reflects the modularity of a network, i.e., the extent to which its nodes can be divided into separate communities. For instance, in the case of the Stochastic Block Model (SBM), each node *i* is randomly assigned to one cluster *c*(*i*) among *p* possible ones. Nodes in clusters *c*, *c* are connected independently with probability *Pcc* . In expectation, the adjacency matrix is equal to the matrix (*Pc*(*i*)*c*(*j*)), which has the rank of *p* at most. In samples of the SBM as well as real modular networks, the decay of the eigenvalues allows estimating the number of clusters. The eigenvectors related to non-zero eigenvalues are used to perform clustering on the nodes to retrieve their labels.

Furthermore, this decomposition provides a natural expression of *A* as a sum of rank-one patterns *xix <sup>i</sup>* . Modeling vectors as a weighted sum of patterns is at the core of dictionary learning-based and mixed effects models, which have proven of great interest to the statistics and machine learning research communities. In the specific case of graph data sets, such a model was recently proposed by D'Souza et al. [28] in the context of brain networks analysis. The authors learn a set of rank-one patterns without orthogonality constraints, and estimate the adjacency matrices as weighted sums of these patterns, in order to use the weights as regression variables. However, they consider the patterns as population-level variables only. This choice prevents taking into account potential individual-level variations.

Finally, the dominant eigenvectors yield strong patterns that are likely to remain stable among various networks in a data set, up to a certain variability. In other words, given *N* adjacency matrices *A*(1), ..., *A*(*N*) and their eigendecompositions (*X*(1), Λ(1)), ..., (*X*(*N*), Λ(*N*)), the first columns of the *X*(*k*)'s should remain stable among subjects (up to a column permutation and/or change of sign). On the contrary, smaller eigenvalues should be expected to correspond to eigenvectors with greater variability. The recent work of Chen et al. [39] takes stock of this remark to analyze the Laplacian matrices of brain networks (the Laplacian is a positive matrix that can be computed from the adjacency matrix). The authors propose to compute the *L*<sup>1</sup> mean of the *X*(*k*)'s first *p* columns in order to get a robust average *X* representative of the population. As the *X*(*k*)'s are composed of *p* orthonormal vectors, their average should have the same property: it ensures that the obtained matrix can be interpreted as a point that best represents the distribution. Its definition thus formulates as an optimization problem over the Stiefel manifold V*n*,*p*. The authors show that taking this geometric consideration into account leads to better results than computing a Euclidean mean.

In the next section, we introduce our statistical analysis framework. We model the perturbations of the adjacency matrix eigendecomposition to account for the variability within a network data set.

#### *3.2. Model Description*

We propose to account for the variability in a set of networks by considering the random perturbation of both the patterns (*X* variable) that compose the networks and their weight (*λ* variable). In this study, we consider each pattern *xi* (column of *X*) and each weight *λ<sup>i</sup>* to be independent of one another. This assumption, although a first approximation, leads to a tractable inference problem and interpretable results. Future works could consider interactions between the *xi*'s or the *λi*'s, as well as the dependency between both.

The model decomposition of each adjacency matrix *A*(*k*) in a data set writes

$$A^{(k)} = X^{(k)} \text{Diag}(\lambda^{(k)}) X^{(k) \top} + \varepsilon^{(k)} \tag{1}$$

with *X*(*k*) a pattern matrix, *λ*(*k*) the pattern weight vector and *ε*(*k*) the symmetric residual noise. The *X*(*k*) and *λ*(*k*) are independent unobserved variables that determine the individual-level specificity of network *k*. We model these variables as follows:

$$\begin{cases} \mathcal{X}^{(k)} \overset{\text{i.i.d.}}{\sim} \text{vMF}(F) \\ \lambda^{(k)} \overset{\text{i.i.d.}}{\sim} \mathcal{N}(\mu, \sigma\_{\lambda}^{2} I\_{\mathbb{P}}) \\ \varepsilon^{(k)} \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma\_{\varepsilon}^{2} I\_{\mathbb{P}(n+1)/2}) . \end{cases} \tag{2}$$

The matrix *<sup>F</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>p</sup>* parametrizes a von Mises–Fisher distribution for the eigenvectors matrix *X*(*k*), and the eigenvalues *λ*(*k*) follow a Gaussian distribution with mean *<sup>μ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* and independent components with variance *<sup>σ</sup>*<sup>2</sup> *<sup>λ</sup>*. We further impose that the columns of *F* are orthogonal: this constraint ensures that the maximum of the log-density *f*1, *x*<sup>1</sup> + ... + *fp*, *xp* is reached at *πV*(*F*)=(*f*1/| *f*1|, ..., *fp*/| *fp*|). In this model, the matrix *πV*(*F*) is the mode of the distribution of patterns and plays a role similar to the mean of a Gaussian distribution. The mode of the full distribution of latent variables thus refers to (*πV*(*F*), *μ*). In the particular case where *F* has orthogonal columns, the column norms of *F* correspond to its singular values. In the remainder of the paper we call them the *concentration parameters* of the distribution. The variability of the adjacency matrices is thus fully characterized by *σε*, *σλ* and the concentration parameters. The pattern weights *λ*(*k*) are the eigenvalues of the *X*(*k*)Diag(*λ*(*k*))*X*(*k*) term, and we thus call them eigenvalues even though they are not the actual spectrum of the real adjacency matrices *A*(*k*). Our model is summarized in Figure 2.

**Figure 2.** Graphical model for a data set of adjacency matrices *A*1, ..., *AN*. The variables *π* and *z*(*k*) can be added to get a mixture model.

Note that this model may be adapted to deal with other types of adjacency matrices. The distribution for *λ*(*k*) can be effortlessly changed to a log-normal distribution to model data sets of positive matrices like covariance matrices. Binary networks can be modeled by removing the *ε*(*k*) noise and adding a Bernoulli sampling step, considering *X*(*k*)*λ*(*k*)*X*(*k*) as a logit. Adjacency matrices with positive coefficients are considered by adding the softplus

function *<sup>x</sup>* → log(<sup>1</sup> <sup>+</sup> *<sup>e</sup>x*) in Equation (1). These extensions bring a wide range of possible statistical models for adjacency matrices for which the estimation procedure is the same as the one developed below.

Equation (1) theoretically requires each *A*(*k*) to be close to a rank *p* matrix. While this assumption is reasonable for well-clustered networks like samples of an SBM, some real-life networks exhibit heavy eigenvalue tails and cannot be approximated accurately using low rank matrices. While our model should not be expected to provide a perfect fit on general networks data sets, its main goal is to retrieve the principal modes of variability and their weight in an interpretable way, comparable to probabilistic Principal Component Analysis (PCA) or probabilistic Independent Component Analysis (ICA) [40]. An important difference with these methods is that our model expresses each of the *p* components using only an *n*-dimensional vector, whereas PCA and ICA require an *n* × *n* matrix per component to model adjacency matrices.

In the case of well clustered networks, our model can be seen as a refinement of the SBM better suited to data sets of networks. The SBM is designed to handle one single network and mainly addresses the problem of identifying the communities. In the case of network data sets, all subjects share the same node labels and the communities can be more easily identified by averaging the edge weights over the subjects. The main assumption of the SBM that the connections between the nodes are independent of one another prevents from further analyzing individual-level variability. In contrast, our model can account for the impact of a node variation on its connections, as well as pattern variations affecting the whole network. In the limit where the concentration parameters become very large and the weight variance is small, the patterns become constant and our model becomes equivalent to an SBM for networks organized in distinct clusters.

Another remark can be made on the identifiability of the model: the manifold of matrices of the form *<sup>X</sup>*Diag(*λ*)*X* with *<sup>X</sup>* ∈ V*n*,*p*, *<sup>λ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* (also known as the non-compact Stiefel manifold) has a tangent space *T* with dimension dim(V*n*,*p*) + *p* = *np* − *p*(*p* − 1)/2 at *X*(*k*)Diag(*λ*(*k*))*X*(*k*). The noise *ε*(*k*) can be decomposed into components in *T* and its orthogonal complement *<sup>T</sup>* with dimension *<sup>n</sup>*<sup>2</sup> <sup>−</sup> *np* <sup>+</sup> *<sup>p</sup>*(*<sup>p</sup>* <sup>−</sup> <sup>1</sup>)/2. The component in *<sup>T</sup>* thus induces an implicit source of variability on *X* and *λ*, which depends on *σε*. We show in the experiment section that it may lead to underestimating the concentration parameters (| *f*1|, ..., | *fp*|). While aware of this phenomenon, we consider it an acceptable trade-off regarding the simple formulation of Equation (2).

## *3.3. Mixture Model*

The matrix distribution introduced in the previous section can be integrated in a mixture model to account for heterogeneous populations with a multi-modal distribution and variability. It amounts to considering *K* clusters with, for each cluster, a probability *π<sup>c</sup>* and a parameter *θ<sup>c</sup>* = (*Fc*, *μc*, *σ<sup>c</sup> <sup>ε</sup>* , *σ<sup>c</sup> <sup>λ</sup>*). The mixture model writes hierarchically:

$$\begin{cases} z^{(k)} \sim \text{Categorical}(\pi) \\ (X^{(k)} \mid z^{(k)} = c) \sim \text{vMF}(F^c) \\ (\lambda^{(k)} \mid z^{(k)} = c) \sim \mathcal{N}(\mu^c, (\sigma\_\lambda^c)^2 I\_p) \\ (A^{(k)} \mid X^{(k)}, \lambda^{(k)}, z^{(k)} = c) \sim \mathcal{N}(X^{(k)} \text{Diag}(\lambda^{(k)}) X^{(k)\top}, (\sigma\_\varepsilon^c)^2 I\_{n(n+1)/2}) . \end{cases} \tag{3}$$

We show in the next section on parameter estimation that the mixture layer only comes at a small algorithmic cost.

#### **4. A Maximum Likelihood Estimation Algorithm**

We now turn to the problem of estimating the model parameters *θ* = (*F*, *μ*, *σλ*, *σε*) given a set of observations (*A*(*k*))*<sup>N</sup> <sup>k</sup>*=1. Let us denote *λ* · *X* = *X*Diag(*λ*)*X*. The complete likelihood is expressed as:

$$p((A^{(k)}),(X^{(k)}),(\\\lambda^{(k)});\theta) = \prod\_{k=1}^{N} \frac{1}{K(\theta)} p(A^{(k)} \mid X^{(k)}, \lambda^{(k)}; \theta) p(X^{(k)}; \theta) p(\lambda^{(k)}; \theta)$$

with

$$\begin{cases} p(A^{(k)} \mid X^{(k)}, \boldsymbol{\lambda}^{(k)}; \boldsymbol{\theta}) = \frac{1}{|\sigma\_{\boldsymbol{\varepsilon}}|^{\boldsymbol{\sigma}^{2}} (2\pi)^{\boldsymbol{\sigma}^{2}/2}} \exp\left[ -\frac{1}{2\sigma\_{\boldsymbol{\varepsilon}}^{2}} \| \boldsymbol{A}^{(k)} - \boldsymbol{\lambda}^{(k)} \cdot \boldsymbol{X}^{(k)} \|^{2} \right] \\ p(X^{(k)}; \boldsymbol{\theta}) = \frac{1}{\mathcal{C}\_{\boldsymbol{\alpha}, p}(\boldsymbol{F})} \exp\left[ \text{Tr} (\boldsymbol{F}^{\top} \boldsymbol{X}^{(k)}) \right] \\ p(\boldsymbol{\lambda}^{(k)}; \boldsymbol{\theta}) = \frac{1}{|\sigma\_{\boldsymbol{\lambda}}|^{p} (2\pi)^{p/2}} \exp\left[ -\frac{1}{2\sigma\_{\boldsymbol{\lambda}}^{2}} \| \boldsymbol{\lambda}^{(k)} - \boldsymbol{\mu} \|^{2} \right] \end{cases}$$

We compute the maximum of the observed likelihood *p*((*A*(*k*)); *θ*) using the MCMC-SAEM algorithm introduced in the next section. The MLE is not unique, as a permutation or a change of sign in the columns of *X* (together with a permutation of *λ*) yield the same model. This invariance can be broken by sorting the eigenvalues *μ* in increasing order as long as they are sufficiently spread. However, in practice, several eigenvalues may be close, and imposing such an order hinders the convergence of the algorithm. We thus choose to leave the optimization problem unchanged and deal with the permutation invariance by adding a supplementary step to the MCMC-SAEM algorithm.

#### *4.1. Maximum Likelihood Estimation for Exponential Models with the MCMC-SAEM Algorithm*

When dealing with latent variable models, the standard tool for MLE is the Expectation-Maximization (EM) algorithm [41]. Given a general parametric model *p*(*y*, *z*; *θ*) with *y* an observed variable and *z* a latent variable, performing MLE amounts to maximizing log *p*(*y*; *θ*) = log - *p*(*y*, *z*; *θ*)d*z*, which is intractable in practice with classical optimization routines. The EM algorithm allows indirectly maximizing this objective by looping over two alternating steps:

1. *E-step*: Using the current value of the parameter *θt*, compute the expectation

$$Q\_t(\theta) = \mathbb{E}\_{p(z|y;\theta\_t)}[\log p(y,z;\theta)];$$

2. *M-step*: Find *θt*+<sup>1</sup> ∈ argmax*<sup>θ</sup> Qt*(*θ*).

While the EM algorithm proves efficient to deal with simple models like mixtures of Gaussian distributions, it requires adaptation for the cases of more complicated models where the expectation in the *Qt*(*θ*) function is intractable, and the distribution *p*(*z* | *y*, *θn*) cannot be explicitly sampled from to approximate the expectation.

The Markov Chain Monte Carlo–Stochastic Approximation EM algorithm (MCMC-SAEM) developed by [34] aims at overcoming these hurdles in the case of models belonging to the Curved Exponential Family. For such models, the log-density expresses as log *p*(*y*, *z*; *θ*) = *S*(*y*, *z*), *ϕ*(*θ*) + *ψ*(*θ*), where *S*(*y*, *y*) is a sufficient statistic. The *Qt* function then simply rewrites *Qt*(*θ*) = <sup>E</sup>*p*(*z*|*y*;*θt*)[*S*(*y*, *<sup>z</sup>*)], *<sup>ϕ</sup>*(*θ*) <sup>+</sup> *<sup>ψ</sup>*(*θ*). In the MCMC-SAEM algorithm, the expectation of sufficient statistics is computed throughout iterations using Stochastic Approximation. The samples from *p*(*z* | *y*; *θt*) are drawn using a MCMC kernel *q*(*z* | *zt*; *θt*) with invariant distribution *p*(*z* | *y*; *θt*). The procedure is recalled in Algorithm 1. Under additional assumptions on the model and the Markov kernel, the MCMC-SAEM algorithm converges toward a critical point of the initial objective log *p*(*y*; *θ*) [42,43].

In the case of the model proposed in this paper, the MCMC-SAEM is well suited to the problem at hand as we have to deal with a latent variable model. In a setting with manifoldvalued latent variables, the E-step of the SAEM algorithm becomes intractable; using the MCMC-SAEM allows overcoming this hurdle. Following the outline of Algorithm 1, we need to draw samples from *<sup>p</sup>*(*X*(*k*), *<sup>λ</sup>*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*); *<sup>θ</sup>*) and perform the maximization step using the stochastic approximation of sufficient statistics.

*4.2. E-Step with Markov Chain Monte Carlo* 4.2.1. Transition Kernel

The target density *<sup>p</sup>*(*X*(*k*), *<sup>λ</sup>*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*); *<sup>θ</sup>*) is known up to a normalizing constant, and it is sufficient to use MCMCs based on the Metropolis–Hastings acceptance rule [44]. The MCMC is structured as a Gibbs sampler alternating simulations of *X*(*k*) and *λ*(*k*) for each individual. Note that conditional density *<sup>p</sup>*(*λ*(*k*) <sup>|</sup> *<sup>X</sup>*(*k*), *<sup>A</sup>*(*k*); *<sup>θ</sup>*) is a Gaussian distribution. However, when experimenting with the MCMC-SAEM, we find that using Metropolis–Hastings-based transitions rather than sampling directly from the true conditional distribution accelerates the Markov chain convergence. This is why we perform a Metropolis–Hastings within Gibbs sampler for both variables [45]. We generate proposals for *λ* with a symmetric Gaussian kernel with adaptive variance in order to reach a target acceptance rate. We also use a Metropolis Hastings transition for *X*, with the constraint that the variable stays on the Stiefel manifold. Several techniques can be used to generate such proposals. The most natural equivalent of the symmetric random walk consists of a geodesic random walk generated by normally distributed tangent vectors. This method can be employed as the exponential map on the Stiefel manifold has a closed-form expression relying on the matrix exponential [30]. Another option is to use the curves given by the Cayley transform as in [46]: Cayley curves can be considered a fast first-order approximation of the exponential map. Finally, a more direct approach consists of making non-manifold Gaussian transitions and projecting the result back onto the manifold using Lemma 2. In our experiments these three approaches turn out to give very similar performances, and in practice we use the last method, which is also the fastest.

**Remark 1.** *Our numerical implementation offers the possibility to use the Metropolis Adjusted Langevin Algorithm (MALA) instead of Metropolis–Hastings, as the gradient of the log-likelihood can be computed explicitly. While the experiments we have presented rely on the Metropolis– Hastings kernel, which is faster overall, we find that in some cases where the dimensions n and p grow large the MALA kernel allows accelerating the convergence.*

#### 4.2.2. Permutation Invariance Problem

The non-uniqueness of the MLE translates into a practical hurdle to the convergence of the MCMC: if two eigenvalues *μi*, *μ<sup>j</sup>* are close, we get (*μi*, *μj*) · (*xi*, *xj*) (*μj*, *μi*) · (*xi*, *xj*). As a consequence, the distribution *<sup>p</sup>*(*X*(*k*), *<sup>λ</sup>*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*); *<sup>θ</sup>*) is multi-modal in *<sup>X</sup>*(*k*) , with a dominant mode close to *πV*(*F*) and other modes corresponding to column sign variations and permutations among similar eigenvalues. These modes are numerical artifacts rather than likely locations for the true value of *X*(*k*) . Exploring them in the MCMC-SAEM hinders the global convergence: they encourage the samples to spread over the Stiefel manifold, which in turn yields a very bad estimation of *F* by inducing a bias toward the uniform distribution.

We address the permutation invariance problem by adding a column matching step every five SAEM iterations for the first third of the SAEM iterations. This step is a greedy algorithm that aims at finding the column permutation of a sample *X*(*k*) that makes it closest to *M* = *πV*(*F*). It proceeds recursively by choosing the columns *mi*, *xj* with the greatest absolute correlation. The steps are summarized in Algorithm 2. The greedy permutation algorithm causes the MCMC samples to stabilize around a single mode, allowing estimation of the *F* parameter.


**input** *<sup>F</sup>* <sup>∈</sup> <sup>R</sup>*n*×*p*, *<sup>X</sup>* ∈ V*n*,*<sup>p</sup>* Compute *M* = *πV*(*F*), *D* = (*mi*, *xj* ) *p i*,*j*=1 Let *I* = *J* = {1, ..., *p*} Let *σ* = (0, ..., 0) (column order), *η* = (0, ..., 0) (column sign) **for** *t* ∈ [1, ..., *n*] **do** Find *it*, *jt* ∈ argmax*i*∈*I*,*j*∈*<sup>J</sup>* |*Dij*| Set *σ*(*jt*) = *it*, *η*(*it*) = sign(*Ditjt* ) Set *I* = *I*\{*it*}, *J* = *J*\{*jt*} **end return** *σ, η*

#### *4.3. M-Step with Saddle-Point Approximations*

The maximization step of the MCMC-SAEM algorithm has a closed form expression, except for the parameter *F*. In this section, we recall a method to estimate *F* in a general setting and apply this method to get the optimal model parameters given sufficient statistics.

4.3.1. Maximum Likelihood Estimation of Von Mises–Fisher Distributions

The main obstacle to retrieving the parameter *F* given samples *X*1, ..., *XN* is the normalizing constant of the distribution: though analytically known, it is hard to compute in practice (see Pal et al. [47] for a computation procedure when *n* = 2). Jupp and Mardia [48] proved that the MLE exists and is unique as long as *p* < *n* and *N* ≥ 2, or *p* = *n* and *N* ≥ 3. Khatri and Mardia [32], who first studied the properties of the MLE, showed the following result:

**Theorem 1** ([32])**.** *Let <sup>X</sup>*1, ..., *XN be <sup>N</sup> samples from a von Mises–Fisher distribution and <sup>X</sup>* <sup>=</sup> <sup>1</sup> *<sup>n</sup>* <sup>∑</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Xi. Let X* = *UDV be the Singular Value Decomposition (SVD) of X. Then the Maximum Likelihood Estimator can be written under the form <sup>F</sup>*<sup>ˆ</sup> <sup>=</sup> *<sup>U</sup>*Diag(*s*ˆ)*V, with <sup>s</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* +*.*

Maximizing the log-likelihood of samples *X*1, ..., *XN* is thus equivalent to solving the optimization problem

$$\text{argmax}\_{s \in \mathbb{R}^{p}} \text{Tr} \left[ \nabla \text{Diag}(s) \Box^{\top} X \right] - \log \mathcal{C}\_{n, p} (\Box \text{Diag}(s) \nabla^{\top}), \tag{4}$$

where C*n*,*p*(*F*) is the normalizing constant of the vMF distribution.

Several methods were proposed to solve this problem: the authors of [32] provide approximate formulas when the singular values of *F* are all either very large or very small. The authors of [49] propose a method to approximate the normalizing constant, which in turn yields a surrogate objective for the MLE giving satisfactory results. Finally, in [50], a different formula is proposed, which applies when the singular values are small. When experimenting with von Mises–Fisher distributions, we found that the method proposed by [49] gives the most robust results for a wide range of singular values of *F*, even in a high-dimensional setting.

#### 4.3.2. Application to the Proposed Model

Computational details for the likelihood rearrangement are deferred to Appendix A. The model belongs to the curved exponential family, and its sufficient statistics are:

$$S(A, X, \lambda) = \begin{cases} S^1 = \frac{1}{N} \sum\_{k=1}^N X^{(k)} \\ S^2 = \frac{1}{N} \sum\_{k=1}^N \lambda^{(k)} \\ S^3 = \frac{1}{N} \sum\_{k=1}^N \left\| \lambda^{(k)} \right\|^2 \\ S^4 = \frac{1}{N} \sum\_{k=1}^N \left\| A^{(k)} - \lambda^{(k)} \cdot X^{(k)} \right\|^2. \end{cases}$$

These sufficient statistics are updated using the MCMC samples (*X*(*k*) *<sup>t</sup>* , *<sup>λ</sup>*(*k*) *<sup>t</sup>* )*<sup>N</sup> <sup>k</sup>*=<sup>1</sup> with the stochastic approximation *St*+<sup>1</sup> = (1 − *αt*)*St* + *αtS*(*A*, *Xt*, *λt*). The optimization problem defined by the M-step of the SAEM algorithm gives the following results:

$$
\begin{aligned}
\hat{\theta}\_t &= \begin{cases}
\hat{F} &= \hat{F}(\overline{S}\_t^1) \\
\hat{\mu} &= \overline{S}\_t^2 \\
\hat{\sigma}\_\lambda^2 &= \frac{1}{p} \left( ||\hat{\mu}||^2 - 2\langle \hat{\mu}, \overline{S}\_t^2 \rangle + \overline{S}\_t^3 \right) \\
\hat{\sigma}\_\varepsilon^2 &= \frac{1}{n^2} \overline{S}\_t^4
\end{aligned}
\tag{5}
$$

where *F*ˆ(*S*<sup>1</sup> *<sup>t</sup>*) denotes the MLE of the von Mises–Fisher distribution. As explained in the section above, the method proposed by Kume et al. [49] allows estimating the normalizing constant of general Fisher–Bingham distributions. The approximation relies on rewriting the constant to make it depend on a density that fits into the framework of Saddle-Point Approximations [51]. We recall the main steps of the computation procedure for this approximation in Appendix A for the specific, simple case of vMF distributions.

In the definition of our model, we impose that the columns of *F* are orthogonal. As recalled in Section 2.2, the MLE for the vMF mode is *M* = *UV*, where *X* = *UDV* is the SVD of the empirical arithmetic mean of samples. Since the column norms correspond to the singular values when the columns are orthogonal, the MLE under this constraint can be sought under the form *M*Diag(*s*). Hence, the optimization problem is used to estimate *F*:

$$\text{argmax}\_{s \in \mathbb{R}^{p}} \text{Tr}[\text{Diag}(s)\overline{M}^{\top}\overline{X}] - \log \dot{\mathcal{C}}\_{n,p}(\overline{M}\text{Diag}(s))\,,\tag{6}$$

with C 9*n*,*<sup>p</sup>* the approximation of the normalizing constant. We solve this optimization problem using the open source optimization library scipy.optimize.

The complete procedure is summarized in Algorithm 3.

#### *4.4. Algorithm for the Mixture Model*

The mixture model adds a cluster label *z*(*k*) for each subject and a list *π* of cluster probabilities. The model still remains in the curved exponential family, and the MCMC-SAEM algorithm can still be used. The Gibbs sampler now also updates *z*(*k*): it consists of sampling from the probabilities *<sup>p</sup>*(*z*(*k*) <sup>|</sup> *<sup>X</sup>*(*k*), *<sup>λ</sup>*(*k*), *<sup>A</sup>*(*k*); *<sup>π</sup>*, *<sup>θ</sup>*), which can be computed explicitly. The sufficient statistics *S*1, *S*2, *S*3, *S*<sup>4</sup> are defined and stored for each cluster. The statistics of cluster *c* are updated using only the indices *k* such that *z*(*k*) = *c*. The variable *<sup>π</sup>* adds new sufficient statistics: *<sup>S</sup><sup>π</sup>* = (#{*<sup>k</sup>* <sup>|</sup> *<sup>z</sup>*(*k*) <sup>=</sup> *<sup>c</sup>*}/*N*)*<sup>K</sup> <sup>c</sup>*=1. The related MLE estimate of *π* is *π*ˆ = *Sπ*.

In our implementation, we initialize the clusters using the K-Means algorithm. We use the tempering proposed by [52] for the *z* sampling step in order to encourage points moving between clusters at the beginning of the algorithm. The vMF parameters *F<sup>c</sup>* are aligned every 5 SAEM iterations using Algorithm 2 in order to allow the latent variables to move between the regions of influence of different clusters through small Metropolis–Hastings steps. The resulting algorithm is detailed in Appendix C.

#### *4.5. Numerical Implementation Details*

We initialize the algorithm by taking the first eigenvectors and eigenvalues of each adjacency matrix. Algorithm 2 is used to align the eigenvectors between samples. In order to accelerate the convergence, we perform a small number of hybrid MCMC-SAEM steps at the start of the algorithm, where the MCMC step on *X* is replaced with a gradient ascent step on the log-likelihood. These first steps move the *<sup>X</sup>*(*k*)'s to an area of <sup>V</sup>*n*,*<sup>p</sup>* with high posterior probability, which accelerate the convergence of the MCMC, as the *X* variable is the slowest to evolve along the MCMC-SAEM iterations. The Riemannian gradient ascent is detailed in Appendix B.

**Algorithm 3:** Maximum Likelihood Estimation algorithm for *θ* = (*F*, *μ*, *σε*, *σλ*)

Initialize *θ*0, *X*0, *λ*<sup>0</sup> and *S*<sup>0</sup> **for** *t* = 1 *to T* **do if** *t* ≤ *T*/3 *and* (*t* mod 5) = 0 **then for** *k* = 1 *to N* **do** Use Algorithm 2 to align *X*(*k*) *<sup>t</sup>* with *πV*(*Ft*). Permute *λ*(*k*) *<sup>t</sup>* accordingly. **end end** Set *<sup>X</sup>*=(*k*) <sup>0</sup> <sup>=</sup> *<sup>X</sup>*(*k*) *<sup>t</sup>* and = *λ*(*k*) <sup>0</sup> <sup>=</sup> *<sup>λ</sup>*(*k*) *t* **for** - = 1 *to n*MCMC **do for** *k* = 1 *to N* **do** Sample *<sup>X</sup>*=(*k*) from the Metropolis kernel *qX*(· | *<sup>X</sup>*=(*k*) -−1, <sup>=</sup> *λ*(*k*) -−1; *<sup>θ</sup>t*) targetting *<sup>p</sup>*(*X*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*), <sup>=</sup> *λ*(*k*) -−1; *<sup>θ</sup>t*) Sample = *λ*(*k*) from the Metropolis kernel *<sup>q</sup>λ*(· | *<sup>X</sup>*=(*k*) - , = *λ*(*k*) -−1; *<sup>θ</sup>t*) targetting *<sup>p</sup>*(*λ*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*), *<sup>X</sup>*=(*k*) -−1; *<sup>θ</sup>t*) **end end** Set *X*(*k*) *<sup>t</sup>*+<sup>1</sup> <sup>=</sup> *<sup>X</sup>*=(*k*) *<sup>n</sup>*MCMC and *<sup>λ</sup>*(*k*) *<sup>t</sup>*+<sup>1</sup> = = *λ*(*k*) *n*MCMC Update the sufficient statistics *St*+<sup>1</sup> = (1 − *αt*)*St* + *αtS*(*A*, *Xt*+1, *λt*+1) Compute *μt*+1, (*σε*)*t*+<sup>1</sup> and (*σλ*)*t*+<sup>1</sup> using Equation (5). Compute *Ft*+<sup>1</sup> by solving problem (6). **end return** *θT,* (*Xt*, *λt*)*<sup>T</sup> t*=1

The Metropolis–Hastings transition variance is selected adaptively throughout the iterations using stochastic approximation. At SAEM step *t* + 1, the proportion of accepted Metropolis transitions is computed. The logarithm of the variance is then incremented according to the rule log *<sup>σ</sup>t*+<sup>1</sup> *MH* <sup>=</sup> log *<sup>σ</sup><sup>t</sup> MH* + *<sup>t</sup>*/2*t* 0.6, with *<sup>t</sup>* = ±1 depending on whether the proportion of accepted jumps should be increased or decreased.

During the first half of the *T* iterations we set *α<sup>t</sup>* = 1 in order to minimize the impact of poor initializations. Then *<sup>α</sup><sup>t</sup>* decreases as 1/(*<sup>t</sup>* <sup>−</sup> *<sup>T</sup>*/2)0.6, which ensures the theoretical convergence of the algorithm.

The algorithms as well as all the experiments presented in this paper are implemented with Python 3.8.6. The package Numba [53] is used to accelerate the code. We provide a complete implementation (https://github.com/cmantoux/graph-spectral-variability, accessed on 19 April 2021), which allows reproducing the experiments on synthetic data and running the algorithm on new data sets.

#### **5. Experiments**

#### *5.1. Experiments on Synthetic Data*

5.1.1. Parameters Estimation Performance

First we investigate the ability of the algorithm to retrieve the correct parameters when the data are simulated according to Equations (1) and (2). We test the case (*n* = 3, *p* = 2), referred to as low-dimensional, where *X* can be visualized in three dimensions as well as the case (*n* = 40, *p* = 20), referred to as high-dimensional.

## Small Dimension

We choose *F* with two orthogonal columns uniformly in V3,2 with column norms (25, 10). Using these low concentration parameters makes the results simple to visualize. We set *μ* = (20, 10) and *σλ* = 2, and generate *N* = 100 matrices *A*(*k*) with *σε* = 0.1 and 100

other matrices with the same *X*(*k*)'s and *λ*(*k*)'s but a much stronger noise standard deviation *σε* = 4. We run the MCMC-SAEM algorithm for 100 iterations with 20 MCMC steps for each maximization step. The results are shown in Figure 3. In both cases, the mode of the vMF distribution *πV*(*F*) is well recovered. In the small noise case, the posterior *X* samples closely match the true *X* samples, and the estimated concentration parameters (23.7, 8.0) remain close to ground truth. In the strong noise case, the posterior samples spread much farther around *F*ˆ than the true samples: the estimated concentration is (9.9, 2.8). This result highlights the remark in Section 3.2 on the bias induced by the Gaussian noise on the latent variable spread: the best *X* variable to estimate the matrix *A*(*k*) is moved apart from the true *X*(*k*) in a random direction because of the noise *ε*(*k*) living outside the manifold.

**Figure 3.** True latent variables *X*(*k*) and their posterior MCMC mean estimation. The red arrows represent the true *πV*(*F*) parameter and its estimate *πV*(*F*ˆ). (**a**) The true mode and samples. (**b**) Mode and samples estimates when *σε* = 0.1. (**c**) Mode and samples estimates when *σε* = 4. The columns are rearranged using Algorithm 2 to ease visualization. The latent variables are accurately estimated when the noise is small. A stronger noise causes the estimated latent variables to spread over the Stiefel manifold.

## High Dimension

We now consider a synthetic data set of *N* = 200 samples generated from 20 latent patterns in dimension 40 and *σε* = 1, *σλ* = 2, with various sizes of concentration parameters and eigenvalues, pairing large eigenvalues together with high concentrations. We run the MCMC-SAEM algorithm for 100 iterations with 20 MCMC steps per SAEM iteration to obtain convergence. The convergence of the parameters is shown in Figure 4. For both the concentration parameters and the eigenvalues, the algorithm starts by finding the highest values, only identifying lower values progressively afterward. The lowest values are associated to patterns with low weight, hence their recovery is naturally more difficult. As in the previous sections, the concentration parameters tend to be underestimated, indicating wider spreading around the mode vectors *fi*/| *fi*| than the original latent variable. However, the ordering and orders of magnitude of the concentrations stay coherent, which, in practice, allows interpreting them and comparing them to each other. The estimation *F*ˆ matches the true parameter with a relative Root Mean Square Error (rRMSE) of 28%. As can be seen in Figure 5, the estimated normalized columns closely correspond to the original ones except when the concentration parameters get too small to allow for a good estimation, as explained above.

We use this example to illustrate the role of the algorithm hyperparameters on the practical convergence, namely the number of MCMC steps per SAEM iteration and the column matching step. We consider the same data set, but we initialize the MCMC-SAEM algorithm with random latent variables instead of the method described in Section 4: this worst-case initialization highlights the differences between the settings more easily. It is also closer to the case of real data sets: the MCMC and model parameters are slower to

converge on real data, as the adjacency matrices are not actual samples of the theoretical model distribution. For different numbers of MCMC steps per SAEM iterations, we run the MCMC-SAEM algorithm for 200 iterations 10 times to average out the random initialization dependency, with and without the column matching step. Then we compute the relative RMSE of the parameters *F* and *μ* at the end of the algorithm. The rRMSE averaged over the 10 runs is shown in Figure 6. It can be seen that when the column matching step is used, increasing the number of MCMC steps at a fixed number of SAEM iterations improves the estimation. It allows accelerating the convergence, as MCMC steps are faster than the maximization step (which requires repeated vMF normalizing constant computations). However, when the number of MCMC steps gets too large, the performance improvement stagnates while the execution time increases. We find that, in practice, using between 20 and 40 MCMC steps per SAEM iterations is a good compromise in terms of convergence speed. Figure 6 also illustrates the need for the column matching step proposed in Section 4: when not used, the parameters hardly converge to the right values, even with a large number of MCMC steps per SAEM iteration. When the eigenvectors are permuted differently across the samples, the related eigenvalues cannot be estimated accurately, as they mix together when averaged in the maximization step. The abscence of permutations also spreads the eigenvectors over the Stiefel manifold, which prevents estimating the von Mises–Fisher parameter. Since Algorithm 2 is very fast to execute, it is not a computational bottleneck. In our experiments, the number of SAEM iterations between successive column permutation steps did not have a significant impact as long as it was not too high: values between 5 and 20 produced similar results.

#### Model Selection

In all the experiments on simulated data presented in this paper, we use the correct number of columns *p*, which we assume to be known. However, when studying real data sets, classical model selection procedures like the Bayesian Information Criterion cannot be applied to our model: they require computing the complete probability of the observations *<sup>p</sup>*(*<sup>A</sup>* <sup>|</sup> *<sup>θ</sup>m*) = - V*n*,*p* - <sup>R</sup>*pm p*(*A* | *X*, *λ*, *θm*) d*X* d*λ* for each model *θm*. This probability cannot be computed explicitly, as it requires integrating over the Stiefel manifold, which results in intractable expressions using the matrix hypergeometric function [49].

**Figure 4.** Convergence of the concentration parameters (| *f*1|, ... , | *fp*|) (**left**) and the mean eigenvalues *μ* (**right**) over the SAEM iterations. The red lines represent the values of the parameters along the iterations. The black dotted lines represent the true values, which are grouped in batches to ease visualization. The convergence is fastest for the large eigenvalues and concentration parameters. At the start of the algorithm, the biggest changes in the parameters come from the greedy permutation performed every 5 iterations. As explained in the text, the concentration parameters are underestimated. However, they keep the right order of magnitude, which allows interpreting the output of the algorithm in practice.

**Figure 5.** Von Mises-Stiefel distribution parameter *F* and its estimation *F*ˆ. (**Top row**): the two parameters and their difference. (**Bottom row**): mode of the true distribution (given by *πV*(*F*)), mode of the estimated distribution *πV*(*F*ˆ) and their difference. The images show each matrix as an array of coefficients, with pixel color corresponding to coefficient amplitude. Since the matrix columns are orthonormal, the projection just consists of normalizing the columns. The columns are sorted by decreasing the concentration parameter. The normalized columns of *F* corresponding to the smallest concentration parameters are estimated with less precision.

**Figure 6.** Relative RMSE of parameters *F* and *μ* after 100 MCMC-SAEM iterations depending on the number of MCMC steps per SAEM iteration. Results are averaged over 10 experiments to reduce the variance. The shaded areas indicate the extremal values across the repeated experiments. When using the greedy permutation, the rRMSE decreases rapidly when the number of MCMC steps increases before stabilizing. On the other hand, without the permutation step, the performance stays poor for any number of MCMC steps per maximization, as the parameters cannot be estimated correctly. In this experiment only, the latent variables are initialized at random to highlight the result.

In practice, several tools can be used to choose the number of latent patterns. First, the marginal likelihood *p*(*A* | *X*, *λ*; *θ*) or the error *A* − *λ* · *X* can be used to evaluate the model expressiveness. As *p* increases, the error will naturally diminish and should be very small for *p* = *n*. As with linear models, the proportion of the variance captured by *λ* · *X* can be computed to evaluate the improvement gained by adding new patterns. The concentration parameters of the von Mises–Fisher distribution also give important information on pattern relevance: if a pattern has a very low concentration parameter, it means that the related eigenvectors are widely spread across the Stiefel manifold. Smaller concentrations are thus related to overfitting, as they do not correspond to actual patterns contributing to the data set variability. The relative importance of concentration parameters can be compared numerically with the vMF concentration obtained on samples from the uniform distribution gathered with Algorithm 2.

**Remark 2.** *In this paper, we approximate the posterior mean of MCMC samples of X*(*k*) *by projecting their arithmetic mean over the Stiefel manifold. We find this procedure a very convenient alternative to computing the Fréchet mean (i.e., the Riemannian center of mass) over the manifold for two reasons. First, computing the Fréchet mean requires an extensive use of the Riemannian logarithm. Although a recent paper [31] allows computing this logarithm, the proposed algorithm heavily relies on matrix logarithm computations and requires points to remain very close to the mean. Similar iterative algorithms to compute the mean based on other retraction and lifting maps than the Riemannian exponential and logarithm were proposed and analyzed in [54], but in our experiments, these alternatives also turn out to require samples close to the mean point, especially in high dimensions. Second, projecting the mean sample onto the Stiefel manifold amounts to computing the mode of a vMF distribution. As shown in Appendix D, the vMF distribution is symmetric around its mode, which makes this mode a summary variable similar to the Gaussian mean.*

#### 5.1.2. Missing Links Imputation

Once the parameters ˆ *θ* are estimated from adjacency matrices *A*1, ..., *AN*, missing links can be inferred on a new adjacency matrix *A*. Suppose that only a subset Ω of the edge weights is known: the weights of masked edges Ω can be obtained by considering the posterior distribution *p*(*A*<sup>Ω</sup> | *A*Ω; *θ*). This distribution is obtained as a marginal of the full posterior *p*(*A*Ω, *X*, *λ* | *A*Ω; *θ*). Sampling from this distribution yields a posterior mean as well as confidence intervals for the value of missing links. In the case of binary networks, the posterior distribution gives the probability of a link existing for each masked edge. Samples are obtained by Gibbs sampling using the same method as in Section 4. We also compute the Maximum A Posteriori (MAP) by performing gradient ascent on the posterior density of (*A*Ω, *X*, *λ*) given *A*Ω.

We generate a synthetic data set of *N* = 200 adjacency matrices with *n* = 20 nodes and *p* = 5. The noise level *σε* is chosen such that the average relative difference between the coefficients of *<sup>A</sup>*(*k*) and *<sup>λ</sup>*(*k*) · *<sup>X</sup>*(*k*) is 25%. We estimate the model parameters using the MCMC-SAEM algorithm. Then, we generate another 200 samples from the same model. We mask 16% of the edge weights corresponding to the interactions between the last eight nodes. The posterior estimation is compared with the ground truth for one matrix in Figure 7. Both the MAP and posterior mean allow to estimate the masked coefficients better than the mean sample (*A*<sup>1</sup> + ... + *AN*)/*N*, which is the base reference for missing data imputation. They achieve, respectively, 58% (±28%) and 57% (±24%) rRMSE on average, whereas the mean sample has an 85% (±10% over the data set) relative difference to the samples on average. Finally, we perform the same experiment except we select the masked edges uniformly at random, masking 40% of the edges. This problem is easier than the former despite the larger amount of hidden coefficients because the missing connections are not aligned with each other. The posterior mean and the MAP achieve, respectively, 34% (±9% over the data set) and 35% (±7%) rRMSE, against 75% (±5%) for the mean sample.

**Figure 7.** Result for missing link inference using the posterior distribution. (**a**) Ground truth input matrix *A*. (**b**) Posterior mean of the masked coefficients. (**c**) MAP estimator. (**d**) Mean of model samples for comparison. The area of masked edges is highlighted by a black square. Above each matrix is the rRMSE with the ground truth. Both the posterior mean and the MAP give a reasonable estimation for the missing weights, significantly better than the empirical mean of all adjacency matrices, which is the base reference for missing data imputation. The images show each matrix as an array of coefficients, with pixel color corresponding to coefficient amplitude.

Link prediction has been a very active research topic in network analysis for several decades, and numerous methods can be employed to address this problem depending on the setting [3,55,56]. However, the most commonly used approaches are designed to perform inference on a single network or consider the nodes as permutation invariant. In turn, the new approach we propose allows for population-informed prediction and uncertainty quantification. It could be used in practice to compare specific connection weights of a new subject with their distribution given the other coefficients and the population parameters. This comparison provides a tool to detect anomalies in the subject's connectivity network stepping out of the standard variability.

**Remark 3.** *The error uncertainties reported in this paper refer to the variance of the estimation error across the adjacency matrices.*

#### 5.1.3. Clustering on Synthetic Data

As explained in Section 3.3, our model can be used within a mixture to account for multi-modal distributions of networks. When experimenting with the clustering version of our algorithm on data sets with distinctly separated clusters, we noticed that the algorithm provides results similar to running K-Means and estimating the parameters on each K-Means cluster separately. However, the clusters in complex populations often overlap, and the ideal case where all groups are well separated rarely occurs. In this section, we show two examples of simulated data sets where the variabilities of the clusters makes them hard to distinguish with the sole application of the K-Means algorithm.

#### Small Dimension

We test the mixture model estimation in the small dimensional case (*n* = 3, *p* = 2) where results can be visualized. We simulate three clusters of matrices as in Section 5.1.1 with *N* = 500 samples overall. In order to make the problem difficult, we use the same mean eigenvalues for two clusters. We set the Stiefel modes of these clusters to be very close, differing mainly by their concentration parameters. We run the tempered MCMC-SAEM for 1000 iterations with a decreasing temperature profile *Tt* = 1 + 50/*t* 0.6. Once the convergence is achieved, the estimated clusters are mapped to the true clusters. The eigenvalue parameters are estimated accurately with 2% rRMSE. The original and estimated von Mises–Fisher distributions are compared in Figure 8. We can see that each cluster distribution is well recovered. In particular, the overlapping distributions of cluster 1 and 2 are separated, and the higher concentration of cluster 1 is recovered in the estimation. This example also highlights the relevance of the MCMC-SAEM clustering procedure compared with its K-Means initialization: up to a label permutation, 50.4% of the K-Means proposed

labels are correct, whereas the posterior distribution *<sup>p</sup>*(*z*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*); <sup>ˆ</sup> *θ*) computed with the final MCMC samples predicts the correct answer for 79.6% of the model samples.

**Figure 8.** True latent variables *X*(*k*) and their posterior mean estimation for the clustering problem. (**Top row**): the plots (**a**–**c**) represent the true vMF modes (in red), as well as the true *X*(*k*) samples (in green) in their true class. (**Bottom row**): the plots (**d**–**f**) represent the three estimated vMF central modes (in red) and the estimated *X*(*k*) in their estimated class (in blue). The cluster centers are well recovered, as well as the concentration parameters. In particular, the two first clusters, which mainly differ by their concentration parameters, are correctly separated.

## Larger Dimension

We now test the mixture model on a synthetic data set of 500 samples in dimension (*n* = 20, *p* = 10). We generate four clusters with Stiefel modes close to one another, with equal concentration parameters. The modes mainly differ by their mean eigenvalues *μc*. The eigenvalue standard deviation *σλ* is set to be of the same order of magnitude as the means *μ*, larger than most of its coefficients. The resulting data set is hard to estimate with classical clustering: the K-Means algorithm retrieves 53.6% of correct labels at best. In contrast, running the tempered MCMC-SAEM algorithm for 1000 iterations yields 99.4% of correct labels. The algorithm achieves this result by identifying the template patterns of each cluster despite the large variation in their weights. Once these template patterns are learned, the proportion of correctly classified samples increases and the mean eigenvalues of each cluster converge to a good estimation.

## Model Selection

Selecting the number of clusters *K* is a known problem adressed for general mixture models [57]. Although it is well understood for simple Gaussian mixture models or for low dimensional data, other cases remain challenging problems. For the model proposed in this paper, likelihood-based procedures cannot be applied, as the complete likelihood is an integral over the Stiefel manifold (see Section 5.1.1). As with the selection of parameter *p*, the concentration parameters and the reconstruction errors could be used to choose the number of clusters. Using a *K* that is too small will result in stretching the latent von Mises–Fisher distributions toward low concentration parameters and large reconstruction errors. The reconstruction error should decrease slower once the right number of clusters has been reached.

**Remark 4.** *The link prediction procedure described in Section 5.1.2 could also be applied in the mixture model to infer the coefficients of new networks of which the class is unknown.*

#### *5.2. Experiments on Brain Connectivity Networks*

We test our model on the UK Biobank data repository [58]. The UK Biobank is a large scale data collection project, gathering brain imaging data on more than 37,000 subjects. In this paper, we are interested more specifically in the resting-state functional Magnetic Resonance Imaging data (rs-fMRI). The rs-fMRI measures the variations of blood oxygenation levels (BOLD signals) across the whole brain while the subject is in a resting state, i.e., receives no stimulation. The brain is then divided into regions through a spatial ICA that maximizes the signal coherence within each region [59]. Smaller regions give more detail on the brain structure but are less consistent across individuals. Finally, the raw imaging data are processed to obtain a matrix that gathers the temporal correlations between the mean blood oxygenation levels in each region. This matrix thus represents the way brain regions activate and deactivate with one another. It is called the *functional connectivity network* of the brain, as it provides information on the role of the regions rather than their physical connections. In the UK Biobank data used in the present study, the connectivity matrices are defined on a parcellation of the brain into *n* = 21 regions. These connectivity matrices illustrate our purpose well: as shown in Figure 9, the data set has a very large diversity of networks that express in patterns with varying weights.

## 5.2.1. Parameter Estimation

We run our algorithm on *N* = 1000 subjects for 1000 SAEM iterations with 20 MCMC steps per SAEM iteration. Working on a restricted number of samples allows for a fast convergence toward the final values. Indeed, we noticed that, while most of the parameters stabilize relatively fast, the time to convergence of the concentration parameters grows with the number of samples. Apart from these concentration parameters, we obtained very similar results when taking all the UK Biobank subjects. In this section, we consider a decomposition into *p* = 5 patterns. In Appendix E.1, we show the results obtained by taking different values of *p*.

In Figure 10, we show the *p* normalized patterns *fi f <sup>i</sup>* / *fi* <sup>2</sup> obtained once the algorithm has converged. Patterns 3 and 5 have very high concentration parameters and only use a small subset of the nodes. The three other patterns have smaller concentration parameters. However, these concentrations are still high enough for the related columns of *X* to be significantly more concentrated than a uniform distribution: the average Euclidean distance between these three columns of *X*(*k*) and the related mode columns is 1.1 (±0.2 over the data set). Comparatively, the average distance between two points drawn uniformly on the Stiefel manifold is 2.4 (±0.2) (over 10,000 uniform samples).

Figure 11 displays data set matrices *A*(*k*) alongside the respective mean posterior estimates of *<sup>λ</sup>*(*k*) · *<sup>X</sup>*(*k*). For comparison purpose, we also compute the approximation given by the projection onto the subspace of the first five PCA components of the full data set, where each component has been vectorized. The *λ* · *X* matrices capture the main structure, whereas the PCA approximation relying on the same number of base components provides a less accurate reconstitution. Quantitatively, the *λ* · *X* term has a 47% (±5% over the data set) relative distance to *A*, whereas the PCA approximation has a 92% (±12%) relative distance to *A*. The *λ* · *X* representation accounts for 60% of the total variance, whereas the corresponding PCA representation only accounts for 35%. This difference highlights the benefits of taking into account the variations of the patterns across individuals. In a classical dictionary-based representation model, the patterns do not vary

among individuals. In contrast, accounting for the pattern variability only adds a small number of parameters (one per pattern) and increases the representation power.

**Figure 9.** Functional connectivity matrices (21 × 21 ) of 25 UK Biobank subjects. The connectivity structure changes a lot depending on the subject, with various patterns expressing with different weights. The matrices in the data set have no diagonal coefficients; hence, the diagonals are shown as zero.

**Figure 10.** Normalized rank-one connectivity patterns. The matrix *i* represents sign(*μi*)*fi f <sup>i</sup>* / *fi* 2 . The caption above each pattern gives the related concentration parameter and mean eigenvalue. The diagonal coefficients are set to zero, as they do not correspond to values in the data set. The images show each matrix as an array of coefficients, with pixel color corresponding to coefficient amplitude.

**Figure 11.** (**a**) UK Biobank connectivity matrices for 5 subjects. (**b**) Corresponding posterior mean value of *λ* · *X* estimated by the MCMC-SAEM. (**c**) Projection of the true connectivity matrices onto the subspace of the first five PCA components. The posterior mean matrix achieves a better rRMSE than PCA by capturing the main patterns of each individual matrix. As in Figure 10, the diagonal cofficients are set to zero.

#### 5.2.2. Pattern Interpretation

Once the patterns are identified, they can be interpreted based on the function of the related involved brain regions. All brain regions can be found on a web page of the UK Biobank project (https://www.fmrib.ox.ac.uk/datasets/ukbiobank/group\_means/ rfMRI\_ICA\_d25\_good\_nodes.html, accessed on 19 April 2021). The regions analyzed in this section can be visualized on brain cuts in Appendix E.2.

Pattern 3 mainly represents the anti-correlation between regions 1 and 3. Region 1 comprises, among others, the inner part of the orbitofrontal cortex and the precuneus. These regions are parts of the Default Mode Network (DMN) of the brain, which is a large-scale functional brain network known to be active when the subject is at rest or mind-wandering [60]. Region 3 comprises part of the insular cortex and the post-central gyrus, which both play a role in primary sensory functions. The anti-correlation between regions 1 and 3 is a consequence of external sensations activating the sensory areas and decreasing the DMN activity. This anti-correlation is also one of the strongest coefficients in pattern 1.

Pattern 5 mainly features the dependency between nodes 2, 4, 8, 9, and 19, which are all related to the visual functions. Node 2 represents the parts of the occipital and temporal lobes forming the ventral and dorsal streams: they are theorized to process the raw sensory vision and hearing to answer the questions "what?" and "where?" [61]. Region 4 features the cuneus, which is a primary visual area in the occipital lobe. Region 8 spans over the whole occipital lobe, covering primary visual functions and associative functions like the recognition of color or movement. Region 9 comprises the continuation of the ventral and dorsal streams of region 2 in the parietal and medial temporal areas. Finally, Region 19 represents the V1 area that processes the primary visual information. Pattern 5 involving these regions has a very high concentration parameter, which means that this structure remains very stable among the subjects.

Considering that the subject's activity in the MRI scanner mainly consists of looking around and laying still, it is coherent that the most stable patterns (i.e., with highest concentration parameters) during the resting-state fMRI measurement are the activity of the vision system and the anti-correlation between the DMN and sensory areas.

Pattern 4 also shows the interaction between the visual areas 2, 4, 8, and 19. It also includes the strong correlation between nodes 9, 10, 11, 12, and 17. Regions 10, 11, and 12 are involved in motor functions. Region 10 features part of the pre-central gyrus, which is central in the motor control function, and part of the post-central gyrus, which is involved in sensory information processing. Region 11 encompasses the entire pre-central gyrus. Region 12 includes a part of the motor and pre-motor cortex in the frontal lobe and the insular cortex. It also includes the cerebellum, which plays an important role in motor control, and the insular cortex, which also acts on the motor control, for instance, in the face and hands motion control [62]. Region 17 comprises the medial face of the superior temporal gyrus and the hippocampus, which are involved in short and long-term memory and spatial navigation.

Pattern 2 combines, to some extent, the structure contained in patterns 4 and 5. It features, among others, interactions between the visual areas and the correlation between the motor function areas.

**Remark 5.** *The results and interpretation we present in this experiment depend on the state of the subjects—in this case, a resting state—and the brain parcellation used to obtain the definition of the regions. If we were to analyze another data set of subjects performing a different task, the connectivity patterns X would likely differ from their resting-state counterpart. It follows from the fact that two different phenomenons naturally require two different base dictionaries. Analyzing the pattern difference would thus provides a way to interpret the structure difference between the two settings. For instance, the role of the occipital lobe in the vision-involved patterns would likely change for tasks related to vision. However, if the brain regions are defined differently in the two experiments, the comparison can only be made in a qualitative way.*

#### 5.2.3. Link Prediction

We evaluate the relevance of our model on fMRI data by testing the missing link imputation method introduced in the previous section. First we fit the model on *N* = 1000 subjects. Then we take 1000 other test subjects and mask the edges corresponding to the interactions between the last nine nodes (except the diagonal coefficients, which are unknown and thus considered null). We compute the MAP estimator of the masked coefficients. For comparison purposes, we perform a linear regression to predict the masked coefficients given the visible ones. Finally, we truncate the matrix with masked coefficients to only keep the *p* dominant eigenvalues. This technique is at the core of lowrank matrix completion methods [63], and it relates naturally with the estimation derived from our model relying on low-rank variability. The result is shown for one sample in Figure 12. The linear model and the MAP estimator give comparable estimates, both close to the true masked coefficients. Over the 1000 test subjects, these estimators achieve on average 58% (±14% over the samples) rRMSE for the linear model and 65% (±15%) rRMSE for the MAP. Interestingly, our model uses only *np* + *p* + 2 = 112 degrees of freedom, whereas the linear prediction model has dimension 26,640 and was specifically trained for the regression purpose.

Our model captures a faithful representation of the fMRI data set and uses far fewer coefficients than other models like PCA and linear regression by accounting for the structure of the interactions between the network nodes. It provides an explanation of the network variability using simple interpretable patterns, which correspond to known specific functions and structures of the brain. The variations of these patterns and their weight allow for a representation rich enough to explain a significant proportion of the variance and impute the value of missing coefficients.

**Figure 12.** From left to right: (**a**) True connectivity matrix *A*. (**b**) MAP estimator for the masked coefficients framed in a black square. (**c**) Linear model prediction for the masked coefficients. (**d**) Rank 5 truncation of the matrix *A* with masked coefficients set to zero. (**e**) Mean of all data set matrices. Above each matrix is the rRMSE with the ground truth.

#### **6. Conclusions**

This paper introduces a new model for the analysis of undirected graph data sets. The adjacency matrices are expressed as a weighted sum of rank-one matrix patterns. The individual-level deviations from the population average translate into variations of the patterns and their weight. Sample graphs are characterized by these variations in a way similar to PCA. The form of the decomposition allows for a simple interpretation: each pattern corresponds to a matrix with rank one and is thus represented by a vector of node coefficients. The variability of this decomposition is captured within a small number of variance and concentration parameters.

We use the MCMC-SAEM algorithm to estimate the model parameters as well as the individual-level variable. The parameter of von Mises–Fisher distributions is recovered by estimating the vMF normalizing constant, which allows retrieving both the mode and its concentration parameters. Future work could further investigate the role of the approximation error induced by the use of saddle-point approximations, comparing its performance with a recently proposed alternative method [64]. The impact of noise on the underestimation of the vMF distribution concentration also requires further analysis.

Experiments on synthetic data show that the algorithm yields good approximations of the true parameters and covers the posterior distributions of the latent variables. Our model can be used to infer the value of masked or unknown edge weights once the parameters are estimated. In practice, the posterior distribution could be compared to the real connections to detect anomalous connections that step out of the expected individual variability.

The model we introduce is a hierarchical generative statistical model, which easily extends to mixture models. We show that a mixture of decomposition models can be estimated with a similar algorithmic procedure and allow disentangling between modes of variability that are indistinguishable by a traditional clustering method.

We demonstrate the relevance of the proposed approach for the modeling of functional brain networks. Using few parameters, it explains the main components of the variability. The induced posterior representation is more accurate than PCA and gives a link prediction performance similar to a linear model, which has a comparably simple structure, but requires far more coefficients and was trained specifically to that purpose. The estimated connectivity patterns have a simple structure and lead to an interpretable representation of the functional networks. We show that our model identifies specific patterns for the visual information processing system or the motor control. The related concentration parameters allow measuring the variability of the function of the related brain regions across the subjects.

This work focuses on cross-sectional network data sets, i.e., populations where each adjacency matrix belongs to a different subject and is independent of the others. Our model could also be used as a base framework for longitudinal network modeling using the tools proposed by Schiratti et al. [65]. This would consist of considering time-dependent latent variables *X* and *λ* for each subject, evolving close to a population-level reference trajectory in the latent space.

Future work could investigate the dependencies between the latent variables of the model. Correlation can be introduced between the patterns by using Fisher–Bingham distributions on the Stiefel manifold [38] and between pattern weights with full Gaussian covariance matrices. Another direction to develop is the quantification of the uncertainty: by adding prior distributions on *F* and *μ*, a Bayesian analysis would naturally provide posterior confidence regions for the model parameters [47]. Finally, our framework could be adapted to model graph Laplacian matrices instead of adjacency matrices. The analysis of the eigenvalues and eigenvectors of the graph Laplacian has proven of great theoretical [66] and practical [67] interest in network analysis. Understanding the variability of the eigendecomposition of graph Laplacians could help to design robust models relying on spectral graph theory.

**Author Contributions:** Conceptualization, C.M. and S.A.; methodology, C.M. and S.A.; software, C.M.; data curation, B.C.-D. and C.M.; validation, C.M. and S.A.; visualization, C.M. and B.C.-D.; result analysis, F.C., S.E., S.D., S.A. and C.M.; writing—original draft preparation, C.M.; writing—review and editing, C.M., S.D. and S.A.; supervision, S.D. and S.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research leading to these results has received funding from the European Research Council (ERC) under grant agreement No. 678304, European Union's Horizon 2020 research and innovation program under grant agreement No 666992 (Euro-POND) and No. 826421 (TVB-Cloud). It was also funded by in part by the program "Investissements d'avenir" ANR-10-IAIHU-06 and the French government under management of Agence Nationale de la Recherche as part of the "Investissements d'avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained by the UK Biobank from all subjects involved in the study.

**Data Availability Statement:** The data used in this paper come from the UK Biobank repository. The website is hosted at https://www.ukbiobank.ac.uk/ (accessed on 19 April 2021). The data are accessed upon application and cannot be made available publicly.

**Acknowledgments:** This research has been conducted using the UK Biobank resource under application 53185.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A. SAEM Maximization Step**

*Appendix A.1. Maximum Likelihood Estimates for μ, σ*<sup>2</sup> *<sup>λ</sup>, <sup>σ</sup>*<sup>2</sup> *ε*

Up to a constant normalization term *c*, the complete log-likelihood of the model in the Gaussian case writes:

$$\begin{split} \log p((A^{(k)}), (\mathbf{X}^{(k)}), (\boldsymbol{\lambda}^{(k)}); \boldsymbol{\theta}) &= \sum\_{k=1}^{N} \log p(A^{(k)}, \mathbf{X}^{(k)}, \boldsymbol{\lambda}^{(k)}; \boldsymbol{\theta}) \\ &= \sum\_{k=1}^{N} \left[ -\frac{1}{2\sigma\_{\varepsilon}^{2}} \left\| A^{(k)} - \boldsymbol{\lambda}^{(k)} \cdot \mathbf{X}^{(k)} \right\|^{2} - \frac{1}{2\sigma\_{\lambda}^{2}} \left\| \boldsymbol{\lambda}^{(k)} - \boldsymbol{\mu} \right\|^{2} + \text{Tr}(\boldsymbol{F}^{\top} \mathbf{X}^{(k)}) \right] \\ &\quad - Nn^{2} \log \sigma - Np \log \sigma\_{\lambda} - N \log \mathcal{C}\_{n,p}(\boldsymbol{F}) + c \\ &= N \left[ \text{Tr}(\boldsymbol{F}^{\top} \mathbf{S}\_{1}) + \langle \mathcal{S}\_{2}, \frac{1}{2\sigma\_{\lambda}^{2}} \boldsymbol{\mu} \rangle - S\_{3} \frac{1}{2\sigma\_{\lambda}^{2}} - S\_{4} \frac{1}{2\sigma\_{\varepsilon}^{2}} + \mathbf{Y}(\boldsymbol{\theta}) \right] \end{split} \tag{A1}$$

$$\text{with } \Psi(\theta) = -n^2 \log \sigma\_\varepsilon - p \log \sigma\_\lambda - \log \mathcal{C}\_{n,p}(F) + \text{c.s.} $$

$$\begin{cases} \mathcal{S}^1 = \frac{1}{N} \sum\_{k=1}^N X^{(k)} \\ \mathcal{S}^2 = \frac{1}{N} \sum\_{k=1}^N \lambda^{(k)} \\ \mathcal{S}^3 = \frac{1}{N} \sum\_{k=1}^N \left\| \lambda^{(k)} \right\|^2 \\ \mathcal{S}^4 = \frac{1}{N} \sum\_{k=1}^N \left\| \mathcal{A}^{(k)} - \lambda^{(k)} \cdot \mathcal{X}^{(k)} \right\|^2 \end{cases}$$

The model thus belongs to the curved exponential family, and its sufficient statistics are given by (*S*1, *S*2, *S*3, *S*4). The log-likelihood is componentwise convex in *μ*, *σ*<sup>2</sup> *<sup>λ</sup>* and *<sup>σ</sup>*<sup>2</sup> *ε* . Computing its gradient yields one single critical point, which is thus the maximum value. In the case of the binary model, the log-likelihood writes:

$$\begin{aligned} \log p((A^{(k)}), (X^{(k)}), (\lambda^{(k)}); \boldsymbol{\theta}) &= \sum\_{k=1}^{N} \sum\_{i,j=1}^{n} \left[ A\_{ij}^{(k)} \log h(\lambda^{(k)} \cdot X^{(k)})\_{ij} + (1 - A\_{ij}^{(k)}) \log(1 - h)(\lambda^{(k)} \cdot X^{(k)})\_{ij} \right] \\ &+ \sum\_{k=1}^{N} -\frac{1}{2\sigma\_{\lambda}^{2}} \left\| \lambda^{(k)} - \mu \right\|^{2} + \text{Tr}(F^{\top} X^{(k)}) \\ &- Np \log \sigma\_{\lambda} - N \log \mathcal{L}\_{n,p}(F) + c \end{aligned}$$

with *h*(*x*) = 1/(1 + exp(−*x*)) the sigmoid function, which applies component-wise on matrices. The distribution (*A* | *λ*, *X*) is non parametric and needs no estimation. Hence, for all the model parameters *F*, *μ*, *σλ* the MLE remains unchanged.

## *Appendix A.2. Saddle-Point Approximation of* C*n*,*p*(*F*)

We recall the main steps to compute the approximation of C*n*,*p*(*F*) proposed by Kume et al. [49]. For more details on the justification of the approximation and applications to more general distributions, we refer the reader to the original paper. Our implementation provides a function spa.log\_vmf, which computes this approximation. The approximation C 9*n*,*p*(*F*) for von Mises–Fisher distributions is written in Equation (16) of [49]:

$$\hat{\mathcal{L}}\_{n,p}(F) = \frac{2^p (2\pi)^{np/2 - p(p+1)/4}}{|\hat{\mathcal{K}}^{\prime\prime}|^{1/2} |\hat{\mathcal{C}}|^{1/2}} \exp\left\{ \frac{1}{2} \text{vec}(F)^{\top} \hat{\mathcal{C}}^{-1} \text{vec}(F) - \sum\_{i=1}^p \hat{\theta}\_{ii} \right\} \exp(T - p/2) \,. \tag{A2}$$

Using the following definitions:


$$\mathcal{K}\_{(r\_1, s\_1), (r\_2, s\_2)}^{\nu} = \begin{cases} 0 & r\_1 \neq r\_2 \text{ or } s\_1 \neq s\_2, \\ n \hat{\phi}\_r \hat{\phi}\_s + \hat{\phi}\_r \hat{\phi}\_s (\omega\_r^2 \hat{\phi}\_r + \omega\_s^2 \hat{\phi}\_s) & r\_1 = r\_2 < s\_1 = s\_2, \\ 2n \hat{\phi}\_r^2 + 4\omega\_r^2 \hat{\phi}\_r^3 & r\_1 = r\_2 = s\_1 = s\_2 \end{cases}$$

• The parameter *T* is defined in Equation (8) of [49] and computed in the supplementary material of the paper in the case of vMF distributions. In first approximation, *T* can be considered zero.

As in the original paper, we validate our implementation by comparing the result with the Monte Carlo estimate of the normalizing constant produced by uniform sampling on the Stiefel manifold.

**Remark A1.** *The* −*p*/2 *factor comes from using B* = −*In*×*p*/2 *(and thus V* = *In*×*p) and compensating with Equation (22) of [49] to handle vMF distributions, which otherwise have B* = 0*. This point is not stated explicitly in the main text of the paper but it is explained in the related MATLAB implementation provided by the authors.*

#### **Appendix B. Gradient Formulas**

The MCMC-SAEM initialization heuristic, as well as the MALA transition kernel, require the gradients of the log-likelihood with respect to the latent variables. In this section, we compute these gradients for the model with Gaussian perturbation and the model with binary coefficients.

#### *Appendix B.1. Model with Gaussian Perturbation*

Consider the log-likelihood for the variables of only one subject (*X*, *λ*, *A*). Using the formula in Equation (A1), we can compute its gradients with respect to *X* and *λ*. For *λ*, it writes:

$$\nabla\_{\lambda} \log p(\mathbf{X}, \lambda, A; \theta) = -\left(\frac{1}{\sigma\_{\varepsilon}^{2}} + \frac{1}{\sigma\_{\lambda}^{2}}\right)\lambda + \frac{1}{\sigma^{2}}(\mathbf{x}\_{i}^{\top} A \mathbf{x}\_{i})\_{i=1}^{p} + \frac{1}{\sigma\_{\lambda}^{2}}\mu \dots$$

with (*xi*) *p <sup>i</sup>*=<sup>1</sup> the columns of *X*. Similarly, the Euclidean gradient for *X* is given by

$$\nabla\_X \log p(X, \lambda, A; \theta) = \frac{1}{\sigma\_\varepsilon^2} A X \text{Diag}(\lambda) + F + 4X \text{Diag}(\lambda) X^\top X \text{Diag}(\lambda)$$

Following Edelman et al. [30], the Riemannian gradient on the Stiefel manifold is then given by:

$$\nabla\_X^{\mathcal{V}} \log p(X, \lambda, A; \theta) = \nabla\_X \log p(X, \lambda, A; \theta) X^{(k) \top} - X \nabla\_X \log p(X, \lambda, A; \theta)$$

## *Appendix B.2. Binary Model*

Similarly, the log-likelihood gradients can be derived for the binary model. Let *x*˜*<sup>i</sup>* be the *i*-th *row* of *X* and denote the entrywise product. We have:

$$\begin{split} \nabla\_{\lambda} \log p(\mathbf{X}, \lambda, A; \theta) &= -\frac{1}{\sigma\_{\lambda}^{2}} (\lambda - \mu) + \sum\_{i,j=1}^{n} [A\_{ij}h(-(\lambda \cdot X)\_{ij}) - (1 - A\_{ij})h((\lambda \cdot X)\_{ij})] (\nexists i\_{i} \odot \mathbf{x}\_{j}) \\ \nabla\_{\lambda} \log p(\mathbf{X}, \lambda, A; \theta) &= F + \sum\_{i \neq j} [A\_{ij}h(-(\lambda \cdot X)\_{ij}) - (1 - A\_{ij})h((\lambda \cdot X)\_{ij})] H\_{ij} \\ &\quad + \sum\_{i=1}^{n} [A\_{ii}h(-(\lambda \cdot X)\_{ii}) - (1 - A\_{ii})h((\lambda \cdot X)\_{ii})] K\_{i} .\end{split}$$

In the latter formula, *Hij* is a *n* × *p* matrix with zeros everywhere except the *i*-th row equal to *λ x*˜*<sup>j</sup>* and the *j*-th row equal to *λ x*˜*i*. *Ki* is the *n* × *p* matrix with zeros everywhere except the *i*-th row equal to 2*λ x*˜*i*.

## **Appendix C. Algorithm for the Clustering Model**

We summarize in Algorithm A1 the procedure to estimate the MLE of a mixture model.

**Algorithm A1:** Maximum Likelihood Estimation of *θ* = (*F*, *μ*, *σε*, *σλ*, *π*) for the mixture model Initialize *θ*<sup>0</sup> and *S*0. Initialize *X*0, *λ*<sup>0</sup> and *z*<sup>0</sup> using the K-Means algorithm. **for** *t* = 1 *to T* **do if** (*t* mod 5) = 0 **then** Align together the parameters (*Fc*, *μc*)*<sup>K</sup> <sup>c</sup>*=<sup>1</sup> of each cluster using Algorithm 2. **end if** *t* ≤ *T*/3 *and* (*t* mod 5) = 0 **then for** *k* = 1 *to N* **do** Use Algorithm 2 to align *X*(*k*) *<sup>t</sup>* with *π<sup>V</sup> F z* (*k*) *t t* . Permute *λ*(*k*) *<sup>t</sup>* accordingly. **end end** Set *<sup>X</sup>*=(*k*) <sup>0</sup> <sup>=</sup> *<sup>X</sup>*(*k*) *<sup>t</sup>* , = *λ*(*k*) <sup>0</sup> <sup>=</sup> *<sup>λ</sup>*(*k*) *<sup>t</sup>* , <sup>=</sup>*<sup>z</sup>* (*k*) <sup>0</sup> = *z* (*k*) *t* **for** - = 1 *to n*MCMC **do for** *k* = 1 *to N* **do** Sample *<sup>X</sup>*=(*k*) from the Metropolis kernel *qX*(· | *<sup>X</sup>*=(*k*) -−1, <sup>=</sup> *λ*(*k*) -−1, <sup>=</sup>*<sup>z</sup>* (*k*) -−1; *<sup>θ</sup>t*) targetting *<sup>p</sup>*(*X*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*), <sup>=</sup> *λ*(*k*) -−1, <sup>=</sup>*<sup>z</sup>* (*k*) -−1; *<sup>θ</sup>t*). Sample = *λ*(*k*) from the Metropolis kernel *<sup>q</sup>λ*(· | *<sup>X</sup>*=(*k*) - , = *λ*(*k*) -−1, <sup>=</sup>*<sup>z</sup>* (*k*) -−1; *<sup>θ</sup>t*) targetting *<sup>p</sup>*(*λ*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*), *<sup>X</sup>*=(*k*) - , <sup>=</sup>*<sup>z</sup>* (*k*) -−1; *<sup>θ</sup>t*). Sample <sup>=</sup>*<sup>z</sup>* (*k*) from the distribution *<sup>p</sup>*(*z*(*k*) <sup>|</sup> *<sup>A</sup>*(*k*), *<sup>X</sup>*=(*k*) - , = *λ*(*k*) - ; *θt*). **end end** Set *X*(*k*) *<sup>t</sup>*+<sup>1</sup> <sup>=</sup> *<sup>X</sup>*=(*k*) *<sup>n</sup>*MCMC , *<sup>λ</sup>*(*k*) *<sup>t</sup>*+<sup>1</sup> = = *λ*(*k*) *<sup>n</sup>*MCMC and *z* (*k*) *<sup>t</sup>*+<sup>1</sup> <sup>=</sup> <sup>=</sup>*<sup>z</sup>* (*k*) *<sup>n</sup>*MCMC . Update the sufficient statistics *St*+<sup>1</sup> = (1 − *αt*)*St* + *αtS*(*A*, *Xt*+1, *λt*+1). Compute *πt*+<sup>1</sup> using the proportion of samples *z* (*k*) *<sup>t</sup>*+<sup>1</sup> belonging to each cluster. **for** *c* = 1 *to K* **do** Compute *μ<sup>c</sup> <sup>t</sup>*+1, (*σ<sup>c</sup> <sup>ε</sup>*)*t*+<sup>1</sup> and (*σ<sup>c</sup> <sup>λ</sup>*)*t*+<sup>1</sup> with Equation (5) using only the *k* such that *z* (*k*) *<sup>t</sup>*+<sup>1</sup> = *c*. Compute *F<sup>c</sup> <sup>t</sup>*+<sup>1</sup> by solving problem (6), using only the *k* such that *z* (*k*) *<sup>t</sup>*+<sup>1</sup> = *c*. **end end return** *θT,* (*Xt*, *λt*, *zt*)*<sup>T</sup> t*=1

#### **Appendix D. Symmetry of Von Mises–Fisher Distributions**

Let *F* be the parameter of a von Mises–Fisher distribution. Let exp*<sup>X</sup>* be the Riemannian exponential map at *X*. We have the following result:

**Proposition A1.** *Suppose that the columns of F are orthogonal. Let M* = *πV*(*F*) *be the mode of the vMF distribution and D* ∈ *TM*V*n*,*<sup>p</sup> a tangent vector at M. Then p*vMF(exp*M*(*D*)) = *p*vMF(exp*M*(−*D*))*, i.e., the vMF distribution is symmetric around its mode.*

**Proof.** Since the columns of *F* are orthogonal, we can write *F* = *M*Λ with *M* = *πV*(*F*) ∈ V*n*,*<sup>p</sup>* and Λ = Diag(*λ*). Let *D* ∈ *TM*V*n*,*p*. As proven in [30], the geodesic *Xt* starting at *M* with *X* (0) = *D* is then given by

$$X\_t = (M, M\_\perp) \exp(tK\_M(D))I\_{n,p,\tau}$$

where exp is the matrix exponential, *<sup>M</sup>*<sup>⊥</sup> ∈ V*n*,*n*−*<sup>p</sup>* is such that *<sup>M</sup>M*<sup>⊥</sup> = 0 and *KM*(*D*) is skew-symmetric: *KM*(*D*) = −*KM*(*D*). Therefore, the von Mises–Fisher log-density along *Xt* writes as:

$$\begin{aligned} \operatorname{Tr}(\boldsymbol{F}^{\top}\boldsymbol{X}\_{t}) &= \operatorname{Tr}(\Lambda\boldsymbol{M}^{\top}(\boldsymbol{M},\boldsymbol{M}\_{\perp})\exp(t\boldsymbol{K}\_{\mathcal{M}}(\boldsymbol{D}))I\_{\boldsymbol{\upupup},\boldsymbol{p}}) \\ &= \operatorname{Tr}(\Lambda I\_{p,\mathcal{U}}\exp(t\boldsymbol{K}\_{\mathcal{M}}(\boldsymbol{D}))I\_{\boldsymbol{\upupup},\boldsymbol{p}}) \\ &= \operatorname{Tr}(I\_{n,p}^{\top}\exp(t\boldsymbol{K}\_{\mathcal{M}}(\boldsymbol{D}))^{\top}I\_{p,\mathcal{U}}^{\top}\Lambda^{\top}) \\ &= \operatorname{Tr}(I\_{p,\mathcal{U}}\exp(t\boldsymbol{K}\_{\mathcal{M}}(\boldsymbol{D})^{\top})I\_{\boldsymbol{\upupup},\boldsymbol{p}}\Lambda) \\ &= \operatorname{Tr}(\Lambda I\_{p,\mathcal{U}}\exp(-t\boldsymbol{K}\_{\mathcal{M}}(\boldsymbol{D}))I\_{\boldsymbol{\upupup},\boldsymbol{p}}) \\ &= \operatorname{Tr}(\boldsymbol{F}^{\top}\boldsymbol{X}\_{-t}) \end{aligned}$$

Therefore the von Mises–Fisher density is symmetric around its mode.

#### **Appendix E. Additional Details on the UK Biobank Experiment**

*Appendix E.1. Impact of the Number p of Patterns*

We perform the same experiment as in Section 5.2 with different numbers of patterns, *p* ∈ {2, 5, 10}, always running the MCMC-SAEM for 1000 iterations with 20 MCMC steps per SAEM iteration. We call the related models M2, M5, and M10. The normalized patterns of M2 and M10 are reproduced in Figures A1 and A2. The patterns of M2 correspond to patterns 1 and 2 of M5 and M10. Similarly, the patterns of M5 correspond to patterns 1 to 5 of M10. This result confirms that our model acts in a way comparable to PCA, selecting first the dominant patterns with the largest eigenvalues. Figure A3 compares the posterior means of *<sup>λ</sup>* · *<sup>X</sup>* given by M2, M5 and M10 for 5 subjects. Coherently, the approximation *<sup>λ</sup>*(*k*) · *<sup>X</sup>*(*k*) refines and gets closer to *A*(*k*) as *p* increases. Over the 1000 subjects, these posterior means achieve, respectively, 57% (±7%), 47% (±5%) and 35% (±4%) relative RMSE.

However, this observation does not assess whether higher values of *p* provide additional relevant features to represent the network structure. The following result illustrates this idea. We repeat the experiment of missing link MAP imputation on models M2 and M10. We find that both M2 and M10 yield a worse prediction than M5 on this task: model M2 gets 70% (±16%) rRMSE and M10 gets 76% (±16%) rRMSE, whereas model M5 gets 65% (±15%) rRMSE. While the prediction performance of M2 is expected to be worse than M5's, observing a worse prediction performance in M10 means that the information captured by the additional components does not help infer the network structure. As with PCA, the components with lesser amplitude are less relevant to perform regression tasks; this idea is at the core of Partial Least Square Regression [68].

**Figure A1.** Normalized connectivity patterns when *p* = 2, computed as in Figure 10.

**Figure A2.** Normalized connectivity patterns when *p* = 10, computed as in Figure 10.

**Figure A3.** (**a**) UK Biobank connectivity matrices for 5 subjects. (**b**) M10 posterior mean value of *λ* · *X*. (**c**) M5 posterior mean value of *λ* · *X*. (**d**) M2 posterior mean value of *λ* · *X*. The rRMSE coherently increases as *p* decreases.

Therefore, the parameter *p* should be chosen with care when using our model for predictive purposes. The experiment presented above can be used to quantify the relevance of the obtained representation, but other methods could be explored. Future work could investigate the question of parameter selection by adapting Bayesian model selection methods to our method, as well as likelihood ratio tests or criteria like BIC and AIC.

### *Appendix E.2. Brain Regions of the UK Biobank fMRI Correlation Networks*

As explained in Section 5.2, the Regions Of Interest (ROIs) that define the correlation networks are detected automatically using a spatial ICA [59]. Each component of the ICA attributes a weight to each brain voxel. The brain regions are visualized by selecting the voxels with weight above a certain threshold. The obtained level set may be scattered over the brain, which sometimes makes their interpretation difficult. In Figure A4, we show the brain regions mentioned in the interpretation of the patterns identified by our model, namely regions 1, 2, 3, 4, 8, 9, 10, 11, 12, 17, 19. In this figure, as well as online, the ICA weight threshold value is set to 5.

**Figure A4.** Frontal, sagittal, and transverse cuts of the brain for the UK Biobank fMRI brain regions analyzed in this paper. As explained in Section 5.2, region 1 comprises part of the Default Mode Network of the brain, which characterizes its activity at rest. Region 3, which is anti-correlated to region 1, is related to sensory functions. Regions 2, 4, 8, 9, and 19 are involved in the visual functions. Regions 10, 11, 12 correspond to motor control. Region 17 is involved in memory and spatial navigation. The L/R letters distinguish the left and right hemispheres. The black axes on each view give the three-dimensional position of the cut. The color strength corresponds to the truncated ICA weight.

## **References**


## *Article* **"Exact" and Approximate Methods for Bayesian Inference: Stochastic Volatility Case Study**

**Yuliya Shapovalova**

Institute for Computing and Information Sciences, Radboud University Nijmegen, Toernooiveld 212, 6525 EC Nijmegen, The Netherlands; yuliya.shapovalova@ru.nl

**Abstract:** We conduct a case study in which we empirically illustrate the performance of different classes of Bayesian inference methods to estimate stochastic volatility models. In particular, we consider how different particle filtering methods affect the variance of the estimated likelihood. We review and compare particle Markov Chain Monte Carlo (MCMC), RMHMC, fixed-form variational Bayes, and integrated nested Laplace approximation to estimate the posterior distribution of the parameters. Additionally, we conduct the review from the point of view of whether these methods are (1) easily adaptable to different model specifications; (2) adaptable to higher dimensions of the model in a straightforward way; (3) feasible in the multivariate case. We show that when using the stochastic volatility model for methods comparison, various data-generating processes have to be considered to make a fair assessment of the methods. Finally, we present a challenging specification of the multivariate stochastic volatility model, which is rarely used to illustrate the methods but constitutes an important practical application.

**Keywords:** Bayesian inference; Markov Chain Monte Carlo; Sequential Monte Carlo; Riemann Manifold Hamiltonian Monte Carlo; integrated nested laplace approximation; fixed-form variational Bayes; stochastic volatility

## **1. Introduction**

The field of Bayesian statistics and machine learning has advanced in recent years quite rapidly. The methods that have been developed do not often find fast assimilation across different fields. In this review, we aim to provide the reader with methodologies that try to solve the estimation problem in models with latent variables and intractable likelihoods. We are in particular interested in the methods that can be used to estimate nonlinear state-space models and in particular stochastic (latent) volatility models. There are multiple studies that conducted review and comparison of the methods of estimation of the stochastic volatility models [1–3]. We briefly mention some of the methods that have been reviewed; however, most of the methods considered in this paper have not entered those reviews. In this paper, we focus in particular on comparing methods that target posterior distribution exactly and the methods that try to approximate it. We also conduct the review from the point of view of estimating multivariate models with these methods and discuss what the bottleneck is in each of them when extending to higher-dimensional stochastic volatility (SV) models. We consider different data-generating processes for simulating data in the empirical studies and conclude that the choice of the data-generating process can heavily affect performance of a method. Thus, illustrating the performance of a method on just one data generating process or one real-world data set is not sufficient.

In financial econometrics literature, GARCH-type models prevail since they are much simpler to estimate. Stochastic (latent) volatility models, however, can be more natural frameworks for modeling asset returns. They can provide flexible and intuitive tools for applications in financial econometrics as well as some other disciplines. In particular, multivariate stochastic volatility models offer an attractive framework for detection and measuring *volatility spillover effects*. Volatility spillovers in this framework can be defined

**Citation:** Shapovalova, Y. 'Exact' and Approximate Methods for Bayesian Inference: Stochastic Volatility Case Study. *Entropy* **2021**, *23*, 466. https://doi.org/10.3390/e23040466

Academic Editor: Pierre Alquier

Received: 28 February 2021 Accepted: 12 April 2021 Published: 15 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

through Granger-causal links in the latent (unobservable) volatility process, which is modeled with a Vector Autoregressive model (VAR(*p*)). Insights about the causal structure can help to identify the relationship (Granger-causality or/and contemporaneous correlation) between the financial markets. Such information can be insightful and helpful in the decision-making process of portfolio managers and policymakers. These models are, however, rarely considered in practice. Multiple Bayesian inference methods have been proposed for the estimation of this class of models in recent years. In this paper, we identify the bottlenecks in different classes of methods for the estimation of these models in the multivariate case.

One of the stepping stones of estimation of the nonlinear state-space models in general (and stochastic volatility models in particular) lies in the intractability of the likelihood, which is the result of the presence of an unobservable process in the model and nonlinear dependence between this process and the observed data. The likelihood can be estimated with particle filter methods, also known as Sequential Monte Carlo. This is a computationally intensive procedure; however, depending on the problem and the data, it can provide excellent results. The second stepping stone of the estimation is the intractable posterior distribution. A standard starting point for sampling from the posterior distribution is the Metropolis-Hastings algorithm, which is a general method and can be applied straightforwardly to different models. It works well when the number of parameters in the model is small. However, the convergence of the algorithm can be slow in larger models, due to inefficiency of the sampling with random walk proposals. Particle Metropolis-Hastings [4] combines Sequential Monte Carlo for the likelihood estimation with Metropolis-Hastings for the sampling from the posterior, which results in a state-of-the-art method in terms of the estimation quality since it targets the exact posterior. The downside of this method is that it is computationally extremely demanding. Note that, while we consider particle Metropolis-Hastings in this paper, the class of methods from [4] is more general.

Two main downsides of particle Metropolis-Hastings are random walk behavior of the proposals and computational burden. One of the possible solutions to the first problem are the algorithms that use gradient information for the construction of the proposal distribution and thus explore the parameter space more efficiently. An additional step in improving these algorithms is defining them on a Riemann manifold instead of Euclidean space as proposed in [5]. The resulting algorithm, which we consider for the comparison in this paper, is Riemann Manifold Hamiltonian Monte Carlo. For extensive comparison of the methods that exploit gradient information and Langevin dynamics—such as Metropolisadjusted Langevin algorithm, Hamiltonian Monte Carlo, Riemann manifold Metropolis adjusted Langevin algorithm, andRiemann Manifold Hamiltonian Monte Carlo—we refer to [5].

Thus far, we have discussed the methods that target the posterior distribution exactly and have a high computational burden, which makes empirical investigation of their performance in high-dimensional cases infeasible. In the last decade, a large number of methods have been published on approximate posterior inference thaat allow much faster computations, but lose in terms of precision of the estimation. In this paper, we consider two such methods that deal with different types of approximation. Fixed-form variational Bayes, proposed in [6], assumes hierarchical factorization of the prior and posterior distributions, and the factorized distributions are approximated by an analytically tractable distribution from a certain family of distributions *q*(·). Then, instead of solving integration problem, one solves the optimization problem of minimizing the Kullback– Leibler divergence between *q*(·) and *p*(·), where *p*(·) is the target distribution. The second approximate method that we consider is the integrated nested Laplace approximation (INLA) [7]. The method relies on the nested version of the classical Laplace approximation. It became very popular in recent years and made computations in many models feasible.

In this review paper, we focus our attention on the following methodologies and provide a comparison for some of the methods via a simulation study. We consider how the variance of the estimated likelihood is affected by choosing different particle-filtering algorithms. Unlike previous studies, we consider the variance of the estimated likelihood over the whole parameter space and notice that it is affected by some parameters of the model more than by the others. We compare particle Metropolis-Hastings with Riemann Manifold Hamiltonian Monte Carlo as two state-of-the-art sampling methods for this type of problem. We asses how well the INLA method performs in the task of the estimation of the parameters of stochastic volatility model and finally, compare fixed-form variational Bayes methods with sampling by RMHMC. All the between-methods comparisons are performed on multiple simulated data sets with different underlying parameters. We illustrate that, for fair comparison and performance assessment, illustration only on data sets is not sufficient.

The paper is organized as follows. In Section 2, we introduce the model and its different specifications. While in simulation studies we use univariate model, we do introduce multivariate stochastic volatility models with Granger-causal feedback as the model of interest for high-dimensional inference. In Section 3, we review the methods that can be used for the estimation of this class of models. We introduce major ideas behind these methods, and for the details of the derivations we refer to the original papers. In Section 4, we perform empirical case study on different simulated data sets and compare the methods on two real-world time series. We in particular focus on the precision loss of parameter estimation when using approximate methods and how adaptable the methods are to perform multivariate estimation and estimation of various model specifications.

## **2. Model**

## *2.1. Univariate Stochastic Volatility Model*

In this section, we introduce the model of interest that we will use in the simulation studies. Stochastic volatility (SV) models are concerned with modeling asset prices or asset returns depending on how the model is formulated. Let *Pt* be the price of the asset at time *t* or the exchange rate at time *t* (we consider two applications to real data in Section 3.5: one to exchange rate and one to log-returns). Then the log-return *yt* is

$$y\_t = \log(1 + R\_t) = \log \frac{P\_t}{P\_{t-1}}.\tag{1}$$

Stochastic volatility models are built in such a way that they can mimic *stylized facts* about financial markets and log-returns *yt*. Stylized facts are empirically observed statistical properties of asset prices and asset returns. Typical examples of stylized facts are


One of the earlier works that received much attention in the financial literature and proposed a mathematical model that tried to explain the dynamics of financial markets is [10]. Numerous continuous-time stochastic volatility models have been proposed since then, and among the first ones, multiple variants should be mentioned [11–14]. The model we will be considering in this chapter can be viewed as a discrete version of the model in [13] derived by using Euler–Maruyama approximation. The stochastic volatility model in continuous time can be written as

$$ds(t) = \sigma(t)dB\_1(t),\tag{2}$$

$$
\ln \sigma^2(t) = \mu + \beta \ln \sigma^2(t)dt + \sigma\_\eta dB\_2(t),
\tag{3}
$$

where *s*(*t*) is log of asset price, *σ*2(*t*) is the volatility, *B*1(*t*) and *B*2(*t*) are Brownian motions that satisfy *corr*(*B*1(*t*), *B*2(*t*)) = *ρ*. If *ρ* < 0, there is leverage effect present. Thus, log of asset price follows diffusion and its volatility parameter also follows diffusion [15]. As we often get the data in discrete time, usually the discrete time approximation of the model is used in practice. The discrete model then follows by using Euler–Maruyama approximation

$$y\_t = \sigma\_t \epsilon\_{t\prime} \tag{4}$$

$$
\ln \sigma\_{t+1}^2 = \mu + \phi \ln \sigma\_t^2 + \sigma\_\eta \eta\_{t+1}.\tag{5}
$$

where *yt* is logarithmic return, *<sup>t</sup>* = *B*1(*t* + 1) − *B*1(*t*), *ηt*+<sup>1</sup> = *B*2(*t* + 1) − *B*2(*t*), *φ* = 1 + *β*. Further, *<sup>t</sup>* ∼ *N*(0, 1) and *η<sup>t</sup>* ∼ *N*(0, 1), *corr*(*t*, *ηt*+1) = *ρ*.

We get state-space representation of the model that is commonly used by defining *ht* = ln *σ*<sup>2</sup> *<sup>t</sup>* and *σ*<sup>2</sup> *<sup>t</sup>* = exp(*h*2)

$$y\_{\ell} = \exp(h\_{\ell}/2)\epsilon\_{\ell} \tag{6}$$

$$h\_{t+1} = \mu + \phi h\_t + \eta\_{t+1} \tag{7}$$

where *yt* are log-returns that are observed and volatility *ht* is latent and drives the dynamics of *yt*. Figure 1 illustrates this structure of the model. Note that the latent volatility process has an autoregressive form. However, unlike in the standard autoregressive model, the latent volatility is not observed and thus has to be estimated together with the model parameters *μ*, *φ*, and *ση*, which are the scale, the volatility persistence and the noise variance of the latent volatility process, respectively. The persistence parameter *φ* reflects one of the stylized facts of financial returns, namely volatility persistence. The intuition is as follows: if *φ* > 0 and exp(*ht*−1/2) is large, then exp(*ht*/2) will tend to be large too. Hence, the model can account for volatility clustering. In this paper, we consider stationary volatility cases with |*φ*| < 1. Finally, one can also incorporate leverage effects by defining negative correlation between noise terms *<sup>t</sup>* and *ηt*+1. Intuitive interpretation of the leverage effect goes as follows: bad news tends to decrease the price, which means that financial leverage increases, the firm becomes riskier, and thus expected volatility also increases. The leverage effects in this model have been studied in [16]. The stochastic volatility model can be parametrized in multiple ways; often, the following alternatives are considered [2]. Other ways to parametrize this model are presented in Equarions (8) and (9). The left-hand side version of the model corresponds to that of [17]. The right-hand side version is a different way to define the scaling parameter; in this case, it is *β*. For identifiability reasons, only *β* or *μ* as in Equation (7) should be included in the model.

$$y\_t = \sqrt{h\_t} \epsilon\_t \tag{8}$$

$$y\_t = \beta \exp(h\_t/2)\epsilon\_t \tag{8}$$

$$
\log h\_t = \mu + \phi \log h\_{t-1} + \eta\_t \tag{9}
$$

$$
h\_t = \phi h\_{t-1} + \eta\_t. \tag{9}
$$

Note that the authors of [17] define the leverage effect as correlation between *<sup>t</sup>* and *ηt*, so the correlation between noise terms is contemporaneous while [16] model correlation between *<sup>t</sup>* and *ηt*+1, which corresponds to correlation of the returns with one-step-ahead volatility. Reference Yu [18] shows that the approach of [16] is preferable. In particular, while in case of [16] the model is a martingale difference sequence, i.e., the past does not help to predict the future of the time series, in the case of [17], it is not. Hence, in the latter case, the efficient market hypothesis is violated.

In the remainder of this manuscript, we will work with either specification of the model defined in Equations (6) and (7) or in right-hand side of Equations (8) and (9). These models are equivalent, and we interchange the representation either for the convenience of using some of the methods or for comparison with other work. In the literature, both specifications are frequently used, and in some papers (for example, ref. [19]) the transition from one specification to another is conducted by observing that *β* = exp(*μ*/2).

Under the assumption that |*φ*| < 1, the unconditional first and second moments of the latent process *ht* are

$$\mathbb{E}(h\_t) = \frac{\mu}{1 - \phi'} \qquad Var(h\_t) = \frac{\sigma\_\eta^2}{1 - \phi^2}. \tag{10}$$

The challenge of the estimation of the model lies in the intractability of the likelihood and posterior distribution. The likelihood factorizes as

$$L(y|\theta) = \prod\_{t=1}^{T} p(y\_t|y\_{1:t-1}, \theta)\_t \tag{11}$$

where the terms in the product can be computed recursively, and it becomes clear that the likelihood is a high-dimensional integral

$$p(y\_t|y\_{1:t-1}, \theta) = \int p(y\_t|h\_t, \theta) p(h\_t|y\_{1:t-1}, \theta) dh\_t. \tag{12}$$

There is no analytical solution to the integral in Equation (12), and in this paper, we consider methods to estimate it using sequential Monte Carlo methods.

**Figure 1.** Graphical representation of stochastic volatility model. Observations *yt* represented by shaded edges depend at each time point on the state of the latent volatility process *ht*.

#### *2.2. Multivariate Stochastic Volatility Model*

In this section, we introduce the multivariate stochastic volatility model, which is rarely used in practice due to the challenges of estimation. One of the objectives of this paper is to assess whether modern methods in Bayesian inference are capable of the estimation of these models in high-dimensional case. Multivariate or high dimensional application of this class of models can give insightful information to practitioners. We deal with the same set-up as before; however, we now consider multiple time series of logarithmic returns that are interconnected through the latent volatility process

$$y\_t = \Omega\_t \epsilon\_{t'} \tag{13}$$

where  *<sup>t</sup>* ∼ *N*(**0**, *R*) and *R* is a correlation matrix with entries *rii* = 1, *i* = 1, ... , *n* on the diagonal. Furthermore, **Ω***<sup>t</sup>* is a diagonal matrix that contains time-varying volatilities that are driven by an independent stochastic process *ht*,

$$
\Omega\_t = \operatorname{diag}(\exp(\mathbf{h}\_t/2)).
$$

The process *ht* of log-volatilities follows a VAR(*p*) process

$$\mathfrak{h}\_{t} = \mu + \sum\_{k=1}^{p} \Phi\_{k} \left( \mathfrak{h}\_{t-i} - \mu \right) + \eta\_{t\prime} \tag{14}$$

where <sup>Φ</sup>*<sup>k</sup>* =  *φij*,*<sup>k</sup> <sup>i</sup>*,*j*=1,...,*<sup>n</sup>* are *<sup>n</sup>* <sup>×</sup> *<sup>n</sup>* coefficient matrices. Introducing the matrices <sup>Φ</sup>*<sup>k</sup>* <sup>=</sup> *φij*,*<sup>k</sup> <sup>i</sup>*,*j*=1,...,*<sup>n</sup>* allows us to model connectivity in financial time series through the concept of Granger-causality in latent volatility process. We say that *hi* does not Granger-cause *hj* if all  *φij*,*<sup>k</sup> <sup>k</sup>*=1,...,*<sup>p</sup>* = 0. The standard conditions on stationarity of a vector autoregressive model apply: the root of <sup>|</sup><sup>I</sup> <sup>−</sup> *<sup>λ</sup>*Φ<sup>|</sup> <sup>=</sup> 0 should lie outside the unit circle, and the errors *<sup>η</sup><sup>t</sup>* are independent and identically normally distributed with mean zero and variance-covariance matrix Σ = *diag*(*σ*<sup>2</sup> <sup>1</sup> , ... , *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>*). Equations (13) and (14) are multivariate extensions of the model described in Equations (6) and (7). The representation from the right-hand side of Equations (8) and (9) can be obtained by including a vector of parameters *β* into **Ω***<sup>t</sup>* and removing *μ* from Equation (14). As before, for identifiability, only one vector of the scale parameters—either *μ* or *β*—should be included in the model.

The above MSV model can also be viewed as a non-linear state-space model where (14) is the state equation of the latent process *ht* and (13) is the observation equation that depends non-linearly on the latent state. Note that, in this model, the time series are interconnected and the relationship between them can be interpreted through the concept of Granger-causality in latent volatility processes.

## **3. Methods**

#### *3.1. Bayesian Inference*

In this paper, we review various methods that sample from or approximate the posterior distribution of the parameters of the model *θ*. The sampling or approximate methods are necessary since we are working in the framework when the posterior distribution and the likelihood are analytically intractable. The Bayes' rule allows us to write posterior distribution in the form

$$p(\theta|y) = \frac{\pi(\theta)g(y|\theta)}{m(y)},\tag{15}$$

where *π*(*θ*) is the prior distribution of the parameters of the model, *g*(*y*|*θ*) is the likelihood of the data given parameters of the model, and *m*(*y*) is the marginal density of *y*, which can be viewed as normalizing constant and which we will ignore in this paper. In the remainder of the paper we will work with the Bayes' rule in proportionality terms:

$$p(\theta|y) \propto \pi(\theta)g(y|\theta). \tag{16}$$

Note that in the stochastic volatility model we have to estimate parameters of the model *θ* = (*μ*, *φ*, *σ*2) and the latent vector of volatilities *h*. Thus, we are interested in the following form of the Bayes' rule

$$p(\theta, h|y) \propto \gcd(y|\theta, h) f(h|\theta) \pi(\theta). \tag{17}$$

Multiple approaches can be used for the estimation of *p*(*θ*, *h*|*y*). One of the challenges is that neither posterior *p*(*θ*, *h*|*y*) nor the likelihood *g*(*y*|*θ*, *h*) is tractable. We start our review by considering sequential Monte Carlo methods, also known as particle filtering, for the estimation of the likelihood *g*(*y*|*θ*, *h*). We then discuss Metropolis-Hastings algorithm for sampling from the posterior and how these two algorithm can be combined into particle Metropolis-Hastings for sampling from the posterior distribution. We continue the review of the methods by considering RMHMC method in which the parameters and volatilities are sampled within the same framework. Finally, we review two approximate methods: integrated nested Laplace approdximation and fixed-form variational Bayes, two different ways of approximating posterior distribution.

#### *3.2. Sequential Monte Carlo for the Estimation of the Likelihood*

The Sequential Monte Carlo (SMC) method, also known in the literature as *particle filtering*, is considered a state-of-the-art method for estimation of the intractable likelihoods in nonlinear state-space models. The general idea behind this method lies in the estimation of the latent states by drawing multiple samples (particles) and then propagating them in time according to corresponding importance weights. By combining the weights over all time steps, one obtains a marginal likelihood estimate. Standard and well-known schemes are *Bootstrap particle filter* (BPF) [20], *Seqiential Importance Sampling* (SIS), and *Seqiential Importance Resampling* (SIR) [21]. Sequential Monte Carlo methods were elegantly

combined with Markov Chain Monte Carlo in [4], and the method was named particle Markov Chain Monte Carlo (PMCMC). This method provides a powerful and coherent approach for Bayesian inference in a wide range of complex models. In the later subsections, we will discuss how sequential Monte Carlo methods are combined with Markov Chain Monte Carlo for fully Bayesian inference in stochastic volatility models. One of the concerns when using and implementing SMC for the likelihood estimation is the variance of the estimated likelihood. Standard SMC techniques such as SIS are prone to have high variance of the estimated likelihood once the dimensionality of the problem increases [22]. A number of studies have tried to address this problem. The common choice of proposal for sample of particles in standard schemes is *f*(*ht*|*ht*−1). Pitt and Shephard [23] propose an *auxiliary particle filter* as a solution that is using proposal for particles which takes into account the current observation *q*(*ht*|*ht*−1, *yt*) and not only the dynamics of the latent process itself. Scharth and Kohn [24] suggest using efficient importance sampling [25] inside the PMCMC procedure. Guarniero et al. [26] use twisted representation of the model and use the look-ahead type of particle filtering to address the issue of high variance of the estimated likelihood. Johansen and Doucet [27] compare sequential importance resampling (SIR) with auxiliary particle filter and find that APF does not always outperform SIR. Often, the variance of the estimated likelihood is analyzed in the true value of the parameters, such as in [24]. However, when using particle Markov Chain Monte Carlo, it is also of interest whether the same conclusions hold in different points of the parameter space. In particular, we never start running the algorithm at the point of the true parameter values. This means that if the variance of the estimated likelihood is much larger in some areas of the parameter space, the convergence of the algorithm can be affected. Having insights into how the variance of the estimated likelihood is different in the parameter space can help to make a more efficient choice of the starting point for the algorithm.

We first review the sequential Monte Carlo methods for the estimation of the likelihood. After that, we discuss Metropolis-Hastings algorithm and how SMC and Metropolis-Hastings can be combined for Bayesian inference in general and stochastic volatility models in particular.

#### 3.2.1. Sequential Monte Carlo

Assume that we are in the framework with an observed time series process *yt* and a latent Markovian process *ht*. Since we never observe the latent process, we need to infer it. The objective that can be achieved with *Sequential Monte Carlo* (SMC) is also known as *particle filtering*. The method operates in sequential manner with arriving observations *yt*. The posterior distribution of the latent process can be computed sequentially

$$p(h\_{0:t}|y\_{1:t}) = p(h\_{0:t-1}|y\_{0:t-1}) \frac{g(y\_{1:t}|h\_{0:t})f(h\_t|h\_{t-1})}{p(y\_t|y\_{t-1})}.\tag{18}$$

The denominator of Equation (18) is not analytically tractable, which can be also seen from Equation (12) earlier. SMC allows us to estimate the posterior distribution *p*(*h*0:*t*|*y*1:*t*) and additionally get the estimate of the likelihood

$$\begin{split} L(y\_{1:T}) &= \int p(y\_{1:T}, h\_{1:T}) dh\_{1:T} = \int g(y\_{1:T} | h\_{1:T}) p(h\_{1:T}) dh\_{1:T} \\ &= \int g(y\_1 | h\_1) p(h\_1) \prod\_{t=2}^{T} g(y\_t | h\_t) f(h\_t | h\_{t-1}) dh\_1 \dots h\_T. \end{split} \tag{19}$$

The basic procedure of particle filtering in this setting can be summarized by three crucial steps: prediction, updating, and resampling. The outline of a basic particle filter can be summarized in the following way.

• Initialization: given the prior distribution *π*(*θ*0), we draw *N* independent random samples {*h*<sup>0</sup> *<sup>i</sup>* }*<sup>N</sup> <sup>i</sup>*=1; these samples we call *particles*.

• Prediction: we sample particles according to the importance density

$$h\_t^{(i)} \sim q(h\_t | h\_{t-1}^{(i)}, y\_t). \tag{20}$$

• Updating: During updating, we assign a weight *w*(*i*) *<sup>t</sup>* to every particle

$$w\_t^{(i)} = \frac{p(y\_t|h\_t^{(i)})f(h\_t^{(i)}|h\_{t-1}^{(i)})}{p(y\_t|y\_{1:t-1})q\_t(h\_t^{(i)}|h\_{0:t-1}^{(i)})} \tag{21}$$

and normalize these weights to sum to 1. Every weight can be interpreted as our "confidence" about a particle.

• Resampling: resample the particles if the effective number of particles,

$$N\_{eff} = \frac{1}{\sum\_{i=1}^{N} (\omega\_t^{(i)})^2},\tag{22}$$

is too low. In Equation (22), *ω*(*i*) *<sup>t</sup>* is the normalized weight of particle *i* at the time step *k*. The threshold for the resampling step is set depending on whether particle degeneracy is a problem. In general, we perform resampling when *Neff* < *N*/*c*, where *c* is a constant.

The resampling step is performed to find the trade-off between two well-documented problems: particle *degeneracy* and particle *impoverishment* [28]. The former happens when the resampling step is ignored or is not performed frequently enough. In this case, one ends up with a particle set that has zero weights. The latter problem happens when the particle set is resampled too frequently; then, eventually one gets one particle with a large weight and hence the particle set lacks the diversity. The way to find the balance between these two problems is resampling when the efficient number of particles is smaller than a certain threshold.

In this paper, we consider two particle filters: bootstrap and auxiliary particle filters. A generic particle filter is presented in Algorithm A1 [28]. The bootstrap filter is a variation of a more general approach—sequential importance sampling (resampling). The distinction of the bootstrap filter is the proposal mechanism for the particles. In the bootstrap particle filter the proposals for the particles are made on the basis of the dynamics of the model *<sup>f</sup>*(*ht*|*ht*−1). If *<sup>q</sup>*(*ht*|*yt*, *ht*−1) = *<sup>f</sup>*(*ht*|*ht*−1), then the term *<sup>f</sup>*(*ht*|*ht*−1) *<sup>q</sup>*(*ht*|*yt*,*ht*−1) is equal to 1. In the case of the auxiliary particle filter, we also incorporate the current observation into the proposal mechanism *q*(*ht*|*ht*−1, *yt*). Incorporating the current observation into the proposal for the particles in some cases allows us to reduce the variance of the estimated likelihood. In our case, there is no analytical expression for the proposal density. In the next subsection, we discuss how it can be approximated as proposed in [23].

## 3.2.2. Auxiliary Particle Filter for SV Model

Incorporating knowledge of *yt* into proposals for particles *q*(*ht*|*ht*−1, *yt*) can help to reduce the variance of the estimated likelihood and improve the approximation of the filtering distribution *p*(*ht*|*y*1:*t*). Note, however, that it is not always the case as has been shown in [27]. Only in the case of linear Gaussian state-space models does the proposal density from Equation (A2) have an analytical expression. Hence, for the stochastic volatility models, this term must be approximated. Pitt and Shephard [23] propose using nonblind proposals for the next generation of particles by first expanding log *g*(*yt*+1|*ht*+1) to a second-order term around *μ<sup>k</sup> <sup>t</sup>*+<sup>1</sup> via Taylor expansion

$$\begin{split} \log g(y\_{t+1}|h\_{t+1}, \boldsymbol{\mu}\_{t+1}^{k}) &= \log p(y\_{t+1}|\boldsymbol{\mu}\_{t+1}^{k})' \times \frac{\partial \log p(y\_{t+1}|\boldsymbol{\mu}\_{t+1}^{k})}{\partial h\_{t+1}} + \\ \frac{1}{2} \times (h\_{t+1} - \boldsymbol{\mu}\_{t+1}^{k})' \times \frac{\partial^2 \log p(y\_{t+1}|\boldsymbol{\mu}\_{t+1}^{k})}{\partial h\_{t+1}h\_{t+1}'} \times (h\_{t+1} - \boldsymbol{\mu}\_{t+1}^{k}) \end{split} \tag{23}$$

For deriving the expression for log *<sup>g</sup>*(*yt*+1|*ht*+1, *<sup>μ</sup><sup>k</sup> <sup>t</sup>*+1), recall that *yt* ∼ *N*(0, exp(*ht*)) and hence

$$\begin{split} \log(y\_t|h\_t) &= \frac{1}{\sqrt{2\pi \exp(h\_t)}} \exp\left\{-\frac{y\_t^2}{2\exp(h\_t)}\right\} = \\ &\frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{y\_t^2}{\exp(h\_t)} - \frac{h\_t}{2}\right\} \end{split} \tag{24}$$

and further note that *<sup>f</sup>*(*ht*|*ht*−1) = *<sup>N</sup>*(*<sup>μ</sup>* <sup>+</sup> *<sup>φ</sup>*(*ht*−<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*), *<sup>σ</sup>*<sup>2</sup> *<sup>η</sup>* ); thus

$$f(h\_l|h\_{l-1}) = \frac{1}{\sqrt{2\pi\sigma\_\eta^2}} \exp\left\{\frac{(h\_l - \mu - \phi(h\_{l-1} - \mu))^2}{2\sigma\_\eta^2}\right\}.\tag{25}$$

It follows that the proposal for particles at time *t* + 1 when taking into account the observation of the same period is

$$q(h\_{t+1} \mid h\_t^{(k)}, y\_{t+1}; \mu\_{t+1}^{(k)}) = N \left( \mu\_{t+1}^{(k)} + \frac{\sigma^2}{2} \left( \frac{y\_t^2}{\beta^2} \exp(-\mu\_{t+1}^{(k)}) - 1 \right), \sigma^2 \right). \tag{26}$$

3.2.3. Metropolis–Hastings

In this section, we consider the problem of sampling from the posterior distribution and a general algorithm to construct such a sampling scheme. With the Metropolis– Hastings algorithm, we sample from the posterior distribution by proposing a transition *θ* → *θ*<sup>∗</sup> with the density *q*(*θ*∗|*θ*), which we accept with probability

$$\alpha(\theta, \theta^\*) = \min \left\{ 1, \frac{\vec{p}(\theta^\*)}{\vec{p}(\theta)} \frac{q(\theta|\theta^\*)}{q(\theta^\*|\theta)} \right\},\tag{27}$$

where *p*˜(·) is a function proportional to our target distribution. A common choice for the proposal distribution is a random-walk, which we also use when applying PMCMC later in this paper, *q*(*θ*∗|*θ*) = *N*(*θ*∗|*θ*, Σ). The Metropolis–Hastings algorithm is one of the off-theshelf MCMC methods in the statistical community. It is quite general and can be applied to various problems. The implementation of the Metropolis–Hastings algorithm requires specification of multiple quantities. We need to specify a conditional density *q*(*θ*∗|*θ*) that is a proposal distribution, generally *q*(*θ*∗|*θ*) should be such that we can easily simulate from it. In many applications, including ours, it is reasonable to take the Gaussian distribution as proposal distribution. In this case, it is also symmetric, meaning *q*(*θ*∗|*θ*) = *q*(*θ*|*θ*∗). The Metropolis-Hastings iteration is outlined in the Algorithm 1.

In this algorithm, *α*(*θ*, *θ*∗) is the Metropolis–Hastings acceptance probability, where *θ* is the current state of the chain and *θ*∗ is the candidate state of the parameter vector. Generally, in the simulations, it is desired to have around 25% of proposed candidate values accepted [29]. The idea is that when the proposal steps are too large (we make a proposal that is far away from the current state, *θ*, in the Markov chain), we do not explore local regions sufficiently well; moreover many of the candidates are then very likely to be rejected. When the proposal steps are very small, the acceptance rate will be very high,

however, then we are not likely to leave regions of the local maximum or the convergence will happen very slowly.

**Algorithm 1** Metropolis-Hastings Algorithm.


The performance of Metropolis–Hastings depends on the choice of *q*(·) proposal distribution. In the simulation studies, we consider random-walk proposals of the form *θ*∗ *<sup>i</sup>*+<sup>1</sup> = *θ<sup>i</sup>* +  *<sup>i</sup>*, where *i* is iteration of the algorithm and  *<sup>i</sup>* is assumed to be Gaussian. More information on the theoretical properties of this algorithm can be found in [30].

#### 3.2.4. Particle Metropolis-Hastings

Particle Markov Chain Monte Carlo (PMCMC) methods were introduced in [4]. The basic idea is that MCMC methods, and in particular, Metropolis–Hastings algorithm, which is of interest to us, can be combined with Sequential Monte Carlo to make draws from the posterior distributions of the parameters. Algorithm 2 presents the particle Metropolis– Hastings algorithm. The difference from the standard Metropolis-Hastings is in the quantity *p*ˆ*θ*<sup>∗</sup> (*y*1:*T*), which is the estimate of the likelihood obtained with a particle filter conditioning on the parameters vector *θ*. In this algorithm, *q*(*θ*(*i* − 1)|*θ*∗) is the proposal distribution (which cancels out when it is symmetric), and *π*(·) is prior distribution.

#### **Algorithm 2** Particle Metropolis-Hastings.


$$\begin{array}{ll} \circledast & \mathbf{for} \ i = 1, \ldots, M \text{ do} \end{array}$$


$$\min\left\{1, \frac{\mathfrak{p}\_{\theta^\*}\left(y\_{1:T}\right)}{\mathfrak{p}\_{\theta^{(i-1)}}\left(y\_{1:T}\right)} \frac{\pi\left(\mathfrak{G}^\*\right)}{\pi\left(\mathfrak{G}^{(i-1)}\right)} \frac{q\left(\mathfrak{G}^{(i-1)}|\mathfrak{G}^\*\right)}{q\left(\mathfrak{G}^\*|\mathfrak{G}^{(i-1)}\right)}\right\}\tag{28}$$

7: Set *θ*(*i*) = *θ*∗, *h* (*i*) 1:*<sup>T</sup>* = *h*<sup>∗</sup> 1:*<sup>T</sup>* and *p*ˆ *<sup>θ</sup>*(*i*)(*y*1:*T*) = *p*ˆ*θ*<sup>∗</sup> (*y*1:*T*); 8: Otherwise set *θ*(*i*) = *θ*(*i*−1) , *h* (*i*) 1:*<sup>T</sup>* = *h* (*i*−1) 1:*<sup>T</sup>* and *p*ˆ *<sup>θ</sup>*(*i*)(*y*1:*T*) = *p*ˆ *<sup>θ</sup>*(*i*−<sup>1</sup>)(*y*1:*T*). 9: **end for**

#### *3.3. MCMC with Gradient Information*

In this section, we discuss Riemann Manifold Langevin Hamiltonial Monte Carlo methods that are introduced in [5] and in particular can be applied to stochastic volatility models.

The method originates in physics statistical literature and provides a tool that allows one to make large transitions with high acceptance probability, something that standard methods such as Metropolis–Hastings fail to achieve. The idea of HMC is based on relation between differential geometry and statistical theory (MCMC in particular). Girolami and Calderhead [5] propose the Metropolis-adjusted Langevin algorithm and Hamiltonian Monte Carlo sampling algorithms that are defined on the Riemann manifold. Their methods allow us to overcome the problem of sampling from high-dimensional densities that may show strong correlation. We further provide the general background and summary of the algorithms together with the necessary quantities for their implementation in the case of stochastic volatility models. It is not our goal to provide theoretical foundations of these methods in this article. For deeper theoretical foundations, see [31–33].

In standard MCMC setting, one uses probability distribution to make a proposal for the next state of the Markov chain. Hamiltonian Monte Carlo methods exploit physical system dynamics to make proposals for the next state. It can improve the mixing drastically and result in a more efficient algorithm. Especially since we are interested in multivariate modeling, a more efficient exploration of the posterior distribution is of interest. Once the dimension of the model grows with standard random walk, it is very hard to make proposals that would be accepted frequently enough and result in a good mixing Markov chain. We first introduce some basic ideas on which Hamiltonian Monte Carlo method is built; for an extensive introduction, we refer to [33].

#### 3.3.1. Metropolis-Adjusted Langevin Algorithm

Previously we have discussed the Metropolis-Hastings algorithm. The idea of the Metropolis-Hastings algorithm is to make a new proposal *θ*∗ using random walk. Then this proposal is accepted with probability.

$$\alpha(\theta, \theta^\*) = \min \left\{ 1, \frac{\bar{p}(\theta^\*)}{\bar{p}(\theta)} \frac{q(\theta|\theta^\*)}{q(\theta^\*|\theta)} \right\}. \tag{29}$$

Although this algorithm benefits from desirable theoretical guarantees, the random walk proposal is not efficient, especially when the number of parameters in the model becomes large. Metropolis-adjusted Langevin algorithm (MALA), originally proposed in [34], is designed to solve the same problem—sample from the target distribution. The big advantage of MALA in comparison to Metropolis–Hastings is the construction for the proposal of the candidate parameter *θ*∗. The proposal mechanism for MALA originates from the stochastic differential equation based on Langevin diffusion; the proposal mechanism reads

$$
\theta^\* = \theta^n + \epsilon^2 \nabla\_\theta L(\theta^n) / 2 + \epsilon z^n,\tag{30}
$$

where we define *<sup>L</sup>*(*θn*) = log(*p*(*θ*)) and *<sup>z</sup>* <sup>∼</sup> *<sup>N</sup>***(***z*|**0**, *<sup>I</sup>*) and —integration step size. Convergence for this proposal is not guaranteed unless we employ a Metropolis acceptance probability after every integration step. For convenience, let us define

$$
\mu(\theta^n, \epsilon) = \theta^n + \frac{\epsilon^2}{2} \nabla\_\theta L(\theta^n);
\tag{31}
$$

then the proposal density can be written as *<sup>q</sup>*(*θ*∗|*θn*) = *<sup>N</sup>*(*θ*∗|*μ*(*θn*, ), <sup>2</sup> *<sup>I</sup>*). The standard acceptance probability follows

$$\min\{1, p(\boldsymbol{\theta}^\*)q(\boldsymbol{\theta}^{\boldsymbol{\eta}}|\boldsymbol{\theta}^\*) / p(\boldsymbol{\theta}^{\boldsymbol{\eta}})q(\boldsymbol{\theta}^\*|\boldsymbol{\theta}^{\boldsymbol{\eta}})\}.\tag{32}$$

The type of proposal in Equation (29) is inefficient for strongly correlated parameters *θ*. To solve this issue, one can use a preconditioning matrix *M*

$$
\theta^\* = \theta^u + \epsilon^2 M \nabla\_\theta L(\theta^u)/2 + \epsilon \sqrt{M} z^n. \tag{33}
$$

Unfortunately there is no principled way to choose matrix *M*. As we will see later, HMC encounters the same problem. Generally, MALA iterates between two general steps. First, Langevin dynamics is used for the proposals, and it exploits the gradients of the target. Second, the proposals are accepted or rejected similarly to those of the Metropolis– Hastings algorithm.

#### 3.3.2. Hamiltonian Monte Carlo Algorithm

The HMC algorithm [31] also uses gradient information for constructing the proposal of the parameters in the MCMC scheme. In particular, it exploits the ideas from simulating the behavior of the physical systems. Similarly to describing the behavior of the physical system, HMC performs sampling by exploiting *Hamiltonian dynamics*. A conceptual introduction to this class of methods and its relationship to differential geometry can be found in [33]. In this section, we discuss the general idea behind the algorithm without performing detailed derivations. We focus on the final proposal machinery that can be used in practice and investigate which quantities need to be manually computed before implementing the algorithm and which variables need to be calibrated for the successful performance of the algorithm. First, let us consider a general set-up. In Hamiltonian Monte Carlo, we consider a Hamiltonian function

$$H(\theta, p) = -\log p(\theta) + \frac{1}{2}\log\{(2\pi)^D |\mathcal{M}|\} + \frac{1}{2}p^T \mathcal{M}^{-1} p.\tag{34}$$

which consists of potential energy in the system *<sup>E</sup>*(*θ*) = <sup>−</sup>*L*(*θ*) and kinetic energy *<sup>K</sup>*(*p*) = <sup>1</sup> <sup>2</sup> log{(2*π*)*D*|*M*|} <sup>+</sup> <sup>1</sup> <sup>2</sup> *<sup>p</sup>TM*−1*p*; variables *<sup>p</sup>* are called momentum variables. The dynamics of the system then evolves according to Hamiltonian equations

$$\frac{d\theta}{d\tau} = \frac{\partial H}{\partial p} = \mathbf{M}^{-1} p\_{\prime} \tag{35}$$

$$\frac{d\mathbf{p}}{d\tau} = -\frac{\partial H}{\partial \mathbf{\theta}} = \nabla\_{\mathbf{\theta}} L(\mathbf{\theta}),\tag{36}$$

where by *τ* in physical interpretation of the system we denote continuous time. Practical implementation requires discretization, and the commonly used scheme for this purpose is the leapfrog discretezation:

$$p(\tau + \epsilon/2) = p(\tau) + \epsilon \nabla\_{\theta} L\{\theta(\tau)\}/2,\tag{37}$$

$$
\theta(\tau+\epsilon) = \theta(\tau) + \epsilon \mathcal{M}^{-1} p(\tau+\epsilon/2),
\tag{38}
$$

$$p(\tau + \epsilon) = p(\tau + \epsilon/2) + \epsilon \nabla\_{\theta} L\{\theta(\tau + \epsilon)\}/2. \tag{39}$$

This scheme does not sample from the target distribution and to correct for that, implementation of Metropolis acceptance probability is necessary. For a proposal (*θ*, *p*) → (*θ*∗, *p*∗), acceptance probability in this algorithm is defined as

$$\min\{1, \exp\{-H(\boldsymbol{\theta}^\*, \boldsymbol{p}^\*) + H(\boldsymbol{\theta}, \boldsymbol{p})\}\}.$$

Thus, HMC iterates between updating momentum variables, proposal for the parameter values, additional update to the momentum variables, and then an acceptance/rejection step. The Gibbs sampler provides a good understanding for the system evolution in this algorithm:

$$p^{n+1}|\theta^n \sim p(p^{n+1}|\theta^n) = p(p^{n+1}) = N(p^{n+1}|0, \mathcal{M}),\tag{40}$$

$$
\theta^{n+1}|p^{n+1} \sim p(\theta^{n+1}|p^{n+1})\tag{41}
$$

Similarly to MALA, the choice of matrix *M* is crucial for good performance of HMC. While the choice of the step size and the leapfrog steps can be tuned relatively easily by considering acceptance rate, the choice of the matrix *M* is challenging, and there is no principled way to define it. Leapfrog step and step size proposal are two variables that need to be calibrated when implementing HMC. Usually, different combinations of these

two variables are considered, and the combination leading to the highest acceptance rate is picked.

#### *3.4. Riemann Manifold Hamiltonian Monte Carlo*

The further improvement of HMC and MALA can done by defining the algorithms on Riemann manifold instead of Euclidean space. Proposals guided by Riemann metric instead of Euclidean distance have the potential to explore parameter space more efficiently, especially in the cases when the target density is high-dimensional or exhibits strong correlation [5]. The method originally proposed in [5] and multiple algorithms were compared in the paper: MALA, MMALA, HMC, and RMHMC. For detailed comparison between these methods, we refer to [5], while in our simulation studies, we will focus on comparing RMHMC and particle Metropolis–Hastings for the estimation of parameters in stochastic volatility models.

Girolami and Calderhead [5] define HMC methods in the form of Riemann manifold, and this can be seen as generalization of HMC. The Hamiltonian on the Riemann manifold is defined as follows

$$H(\theta, p) = -\log p(\theta) + \frac{1}{2}\log((2\pi)^n \mid G(\theta) \mid ) + \frac{1}{2}p^T G(\theta)p \tag{42}$$

with exp(−*H*(*θ*, *p*)) = *p*(*θ*, *p*) = *p*(*θ*)*p*(*p* | *θ*) and the marginal target density

$$p(\boldsymbol{\theta}) \propto \int \exp(-H(\boldsymbol{\theta}, p))d\boldsymbol{p} = \frac{\exp\{\log p(\boldsymbol{\theta})\}}{\sqrt{2\pi^n \mid \, \, G(\boldsymbol{\theta})} \mid} \int \exp\{-\frac{1}{2}\boldsymbol{p}^T G(\boldsymbol{\theta})^{-1} \boldsymbol{p}\} d\boldsymbol{p} \\ \tag{43}$$
 
$$= \exp(\log p(\boldsymbol{\theta})).$$

The general idea behind the updates in RMHMC is similar to that of HMC, and the updates for the momentum variables and parameters of the model are defined in Equations (44)–(46).

$$p(\tau + \frac{\epsilon}{2}) = p(\tau) - \frac{\epsilon}{2} \nabla\_{\theta} H\{\theta(\tau), p(\tau + \frac{\epsilon}{2})\},\tag{44}$$

$$\theta(\tau+\epsilon) = \theta(\tau) + \epsilon/2 \left[\nabla\_{\mathcal{P}}H\left\{\theta(\tau), p(\tau+\frac{\epsilon}{2})\right\} + \nabla\_{\mathcal{P}}H\left\{\theta(\tau+\epsilon), p(\tau+\frac{\epsilon}{2})\right\}\right],\tag{45}$$

$$p(\tau + \epsilon) = p(\tau + \frac{\epsilon}{2}) - \frac{\epsilon}{2} \nabla\_{\theta} H\left\{\theta(\tau + \epsilon), p(\tau + \frac{\epsilon}{2})\right\} \tag{46}$$

Therefore, as in standard HMC algorithm, we iterate between half-step update of the momentum variables, and then we update position variables, and we finish iteration with additional half-step update of the momentum variables and Metropolis acceptance/rejection step with the probability

$$\min\{1, \exp\{-H(\boldsymbol{\theta}^\*, \boldsymbol{p}^\*) + H(\boldsymbol{\theta}^{\boldsymbol{n}}, \boldsymbol{p}^{\boldsymbol{n}+1})\}\}.$$

Similarly to HMC, RMHMC can be viewed as a Gibbs sampling scheme

$$p^{n+1}|\theta^n \sim p(p^{n+1}|\theta^n) = N\{p^{n+1}|0, G(\theta^n)\},\tag{47}$$

$$
\theta^{n+1}|p^{n+1} \sim p(\theta^{n+1}|p^{n+1}).\tag{48}
$$

Recall that in the case of MALA and HMC, matrix *M* has to be chosen manually and there is no principled way to choose it. In RMHMC, matrix *G***(***θ***)** is defined at each step by underlying geometry; see for more details [5]. Below we discuss quantities that need to be computed for the implementation of RMHMC in the case of stochastic volatility model and in particular *G***(***θ***)**.

Recall stochastic volatility model parametrized through *β*

$$y\_t = \beta \exp(h\_t/2)\epsilon\_{t\_\prime} \tag{49}$$

$$h\_{t+1} = \phi h\_t + \eta\_{t+1} \tag{50}$$

*<sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 1), *<sup>η</sup><sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0, *<sup>σ</sup>*2), with *<sup>h</sup>*<sup>1</sup> <sup>∼</sup> *<sup>N</sup>*(0, *<sup>σ</sup>*2/(<sup>1</sup> <sup>−</sup> *<sup>φ</sup>*2)). The joint likelihood of the model is

$$p(y, h, \beta, \phi, \sigma) = \prod\_{t=1}^{T} p(y\_t \mid h\_{t\prime}\beta) \prod\_{t=2}^{T} p(h\_t \mid h\_{t-1\prime}\phi, \sigma) \pi(\beta) \pi(\phi) \pi(\sigma) \tag{51}$$

The prior distributions are chosen as follows

$$
\beta \approx \exp(\beta), \qquad \sigma^2 \sim I n \upsilon - \chi^2(10, 0.05), \qquad (\phi + 1)/2 \sim \text{Beta}(20, 1.5). \tag{52}
$$

Further, following [5], we write the partial derivatives for *L* = *p*(*y*, *h* | *β*, *φ*, *σ*)

$$\frac{\partial L}{\partial \beta} = -\frac{T}{\beta} + \sum\_{t=1}^{T} \frac{y^2}{\beta^3 \exp(h\_t)},\tag{53}$$

$$\frac{\partial L}{\partial \phi} = -\frac{\phi}{(1 - \phi^2)} + \frac{\phi h\_1^2}{\sigma^2} + \sum\_{t=2}^T \frac{h\_{t-1}(h\_t - \phi h\_{t-1})}{\sigma^2},\tag{54}$$

$$\frac{\partial L}{\partial \sigma} = -\frac{T}{\sigma} + \frac{h\_1^2 (1 - \phi^2)}{\sigma^3} + \sum\_{t=2}^T \frac{(h\_t - \phi h\_{t-1})^2}{\sigma^3}. \tag{55}$$

To implement the algorithms, we require the expressions for the individual components of the metric tensor for the likelihood. Following [5], the expressions are

$$\mathbb{E}\left\{\frac{\partial L}{\partial \boldsymbol{\beta}}\frac{\partial L}{\partial \boldsymbol{\beta}}\right\} = \frac{2T}{\beta^2}, \qquad \mathbb{E}\left\{\frac{\partial L}{\partial \boldsymbol{\sigma}}\frac{\partial L}{\partial \boldsymbol{\sigma}}\right\} = \frac{2T}{\sigma^2}, \qquad \mathbb{E}\left\{\frac{\partial L}{\partial \boldsymbol{\beta}}\frac{\partial L}{\partial \boldsymbol{\sigma}}\right\} = \mathbb{E}\left\{\frac{\partial L}{\partial \boldsymbol{\beta}}\frac{\partial L}{\partial \boldsymbol{\phi}}\right\} = 0,\tag{56}$$

$$\mathbb{E}\left\{\frac{\partial L}{\partial \sigma}\frac{\partial L}{\partial \phi}\right\} = \frac{2\phi}{\sigma^3(1-\phi^2)}, \qquad \mathbb{E}\left\{\frac{\partial L}{\partial \phi}\frac{\partial L}{\partial \phi}\right\} = \frac{2\phi^2}{(1-\phi^2)^2} + \frac{T-1}{1-\phi^2}.\tag{57}$$

Furthermore, the expressions for the metric tensor for the likelihood and its partial derivatives follow

$$G(\boldsymbol{\phi}, \boldsymbol{\sigma}, \boldsymbol{\beta}) = \begin{bmatrix} \frac{2T}{\rho^2} & 0 & 0\\ 0 & \frac{2T}{\sigma^2} & \frac{2\rho}{\sigma^3 (1 - \phi^2)}\\ 0 & \frac{2\rho}{\sigma^3 (1 - \phi^2)} & \frac{2\rho^2}{(1 - \phi^2)^2} + \frac{T - 1}{1 - \phi^2} \end{bmatrix} \tag{58}$$
 
$$\boldsymbol{\sigma} \quad \boldsymbol{4T} \quad \tag{59}$$

$$
\frac{\partial G}{\partial \beta} = \begin{bmatrix} -\frac{4T}{\beta^3} & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix} \tag{59}
$$

$$
\frac{\partial G}{\partial \sigma} = \begin{bmatrix}
0 & 0 & 0 \\
0 & -\frac{4T}{\sigma^3} & -\frac{6\phi}{\sigma^4(1-\phi^2)} \\
0 & -\frac{6\phi}{\sigma^4(1-\phi^2)} & 0
\end{bmatrix},
\tag{60}
$$

$$\frac{\partial G}{\partial \phi} = \begin{bmatrix} 0 & 0 & 0\\ 0 & 0 & \frac{2}{\sigma^3 (1 - \phi^2)} + \frac{4\phi^2}{\sigma^3 (1 - \phi^2)^2} \\ 0 & \frac{2}{\sigma^3 (1 - \phi^2)} + \frac{4\phi^2}{\sigma^3 (1 - \phi^2)^2} & \frac{2\phi (1 + T)}{(1 - \phi^2)^2} + \frac{6\phi^3}{(1 - \phi^2)^3} \end{bmatrix}. \tag{61}$$

The proposal machinery in RMHMC provides advantages for exploring parameter space efficiently. However, it is not easily adaptable for different model specifications, especially when increasing the model's dimensionality, as we discussed in Section 1. In particular, although matrix G can be computed in the multivariate model specified in Equations (12) and (13) exactly, it scales quadratically with the number of parameters. This might be one of the reasons why the method has not been used on multivariate stochastic volatility models we introduced in Section 1. However, probabilistic programming languages [35,36] and automatic differentiation possibilities developed in recent years allow the efficient and adaptable implementation of these algorithms in practice.

#### *3.5. Integrated Nested Laplace Approximation*

Integrated Nested Laplace Approximation was introduced in [7]. The method is based on the nested version of the classical Laplace approximation and was introduced for latent Gaussian models (LGMs). It became a popular approach in Bayesian inference due to its good performance in the variety of models in the class of LGMs and its computational advantages over other methods in Bayesian literature. The computational appeal of this method comes from the possibility of exploiting sparse matrix computations when evaluating certain approximations. INLA has found its applications in many fields in the models where high-dimensional problems arise. Stochastic volatility models have been analyzed using INLA in [37,38]. Bivariate stochastic volatility model has been considered in [39], where the authors present and solve some issues that arise in using INLA in the multivariate case of the model. One of the conclusions of this study was that INLA loses its computational advantage with increased dimensionality of the stochastic volatility model. We further discuss the details of the method and the implementation shortcomings in a multivariate case and present the reader with a simulation study that illustrates the discussed approach's performance.

Stochastic volatility model can be written in the form of LGMs

$$y \mid h\_{\prime} \theta\_1 \sim \prod\_{i \in \mathcal{J}} \pi(y\_i \mid h\_{i\prime} \theta\_1) \,\tag{62}$$

$$h \mid \theta\_2 \sim N(\mu(\theta\_2), Q^{-1}(\theta\_2)). \tag{63}$$

As before, *yt* is the data that we observe and *ht* is the latent volatility process, and we are interested in the posterior distribution of the parameters of the model *θ* and the latent process given the data

$$p(h, \theta \mid y) \propto p(\theta) p(h \mid \theta) \prod\_{t=1}^{T} p(y \mid h\_{t\prime} \theta). \tag{64}$$

The outline of the INLA approach can be summarized in the following steps [7,37]


$$p(\mathbf{x} \mid \mathbf{y}, \boldsymbol{\theta}) \propto \exp\left\{-\frac{1}{2} \mathbf{x}^T \mathbf{Q} \mathbf{x} + \sum \mathbf{g}\_t(\mathbf{h}\_t)\right\},\tag{65}$$

where *x* = (*μ*, *h*), *gt*(*ht*) = log *p*(*yt* | *ht*, *θ*). By matching the mode and curvature in the mode, we obtain the Gaussian approximation for our model

$$p\_G(\mathbf{x} \mid y, \theta) = K\_1 \exp\left\{-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T (\mathbf{Q} + \text{diag}(\mathbf{c}))(\mathbf{x} - \boldsymbol{m})\right\},\tag{66}$$

where *K*<sup>1</sup> is a normalizing constant, *m* is the modal value of *p*(*x* | *y*, *θ*), the vector *c* contains the second order terms in the Taylor expansion of ∑ *gt*(*ht*) at the modal value m, and Q is the precision matrix that has the form

$$Q = \begin{bmatrix} 1 & -\phi \\ -\phi & 1 + \phi^2 & -\phi \\ & \ddots & \ddots & \ddots \\ & & -\phi & 1 + \phi^2 & -\phi \\ & & & -\phi & 1 \end{bmatrix} . \tag{67}$$

The sparsity of the precision matrix above allows one to exploit efficient sparse matrix computational methods and thus gain computational speed. Note that in the multivariate case, this advantage disappears since the matrix Q is not sparse anymore.

When it comes to the estimation of stochastic volatility models, approximation of the marginals *p*(*ht* | *θ*, *y*) is always the most challenging task. The solution that is proposed in [7] is (simplified) Laplace approximation of the form

$$\log \vec{p}\_{SLA}(\mathbf{x}\_{l} \mid \theta, y) = \text{const} - \frac{1}{2} \mathbf{x}\_{l}^{2} + \gamma\_{l}^{(1)}(\theta)\mathbf{x}\_{l} + \frac{1}{6} \mathbf{x}\_{l}^{3} \gamma\_{l}^{3}(\theta) + \dots,\tag{68}$$

where *γ*(1) *<sup>t</sup>* and *<sup>γ</sup>*(3) *<sup>t</sup>* are the terms in the Taylor expansion. The final step of the method is to approximate *p*(*xt* | *y*) with the numerical integration scheme

$$\vec{p}(\mathbf{x}\_{t}\mid\mathbf{y}) = \sum\_{k} \vec{p}(\mathbf{x}\_{t}\mid\theta^{k},\mathbf{y})\vec{p}(\theta^{k}\mid\mathbf{y})\,\triangle\_{k\prime} \tag{69}$$

for some *θ<sup>k</sup>* of *θ*, where *θ<sup>k</sup>* is selected by creating a grid of points that covers the area of high density for *p*˜(*θ* | *y*). For more details on implementation of the simplified Laplace approximation and the selection of grid of points for *θk*, see [7,37].

#### *3.6. Fixed-Form Variational Bayes*

In this section, we discuss how the posterior distribution can be approximated using the fixed-form variational Bayes method proposed in [6]. The general idea of fixed-form variational inference consists in assuming a certain factorization of the prior distribution, which naturally leads to the factorized structure of the posterior. The factorizing distributions of the posterior are then assumed to come from a certain parametric family of distributions (for example, exponential) and instead of a sampling task, as in the previous section, we would perform the optimization task of minimizing the distance between the approximating distribution and the unknown posterior distribution.

As before, assume we observe a process {*yt*}*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> that is driven by an unobservable or latent process {*ht*}*<sup>T</sup> <sup>t</sup>*=1. Recall that Bayes' rule gives us the posterior distribution of the parameters of the system

$$p(h \mid y) \propto g(y \mid h) \pi(h). \tag{70}$$

In the Bayesian framework, we formulate our prior beliefs, which we update once we acquire more data. In general, Variational Bayes methods focus on approximating the posterior distribution *p*(*h* | *y*) with some distribution *q*(*h* | *y*). Further, it is common to choose blocks of the parameters and impose independence for these blocks

$$p(h \mid y) \approx q(h \mid y) = q(h\_1 \mid y)q(h\_2 \mid y). \tag{71}$$

By construction, the posterior of the blocks of the parameters is independent. In the literature, this is referred to as *the mean-field assumption*. To find the optimal approximation, we minimize the Kullback–Leibler (KL) divergence from *q*(*h* | *y*) to *p*(*h* | *y*)

$$\vec{p}(h \mid y) = \underset{q(h\_1 \mid \cdot)q(h\_2 \mid \cdot)}{\arg\min} \, KL(q(h\_1 \mid y)q(h\_2 \mid y) \mid \mid p(h \mid y)). \tag{72}$$

Distributional approximation can be viewed as an optimization problem; i.e., an optimal distribution has to be chosen from the space of all possible distributions, and the KL divergence is chosen as a loss function [40]. Salimans et al. [6] propose a specific approach to the minimization problem of KL divergence, which is based on the similarities between the optimal solution to the problem and linear regression. The general idea of the approach is summarized as comprising the following steps:


Consider the stochastic volatility model

$$y\_t = \beta \exp(h\_t/2)\epsilon\_t \tag{73}$$

$$h\_{t+1} = \phi h\_t + \eta\_{t+1} \tag{74}$$

with *<sup>h</sup>*<sup>1</sup> <sup>∼</sup> *<sup>N</sup>*(0, *<sup>σ</sup>*2/(<sup>1</sup> <sup>−</sup> *<sup>φ</sup>*2)) and *<sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0, 1), *<sup>η</sup><sup>t</sup>* <sup>∼</sup> *<sup>N</sup>*(0, *<sup>σ</sup>*<sup>2</sup> *<sup>η</sup>* ). We specify our a priori beliefs in the following manner

$$p(\beta) \propto \beta^{-1} \qquad (\phi + 1)/2 \sim \text{Beta}(20, 1.5), \qquad \sigma^2 \sim IG(5, 0.25). \tag{75}$$

To apply the Variational Bayes method, we need to specify the posterior approximations *q*(*θ*). It is convenient to assume a hierarchical structure of the prior, in which case it factorizes to

$$p(\phi, \sigma^2, \beta, f) = p(\phi)p(\sigma^2)p(f \mid \phi, \sigma^2)p(y \mid f),\tag{76}$$

where *f* = (log(*β*), *h* ). The hierarchical structure of the prior leads to the following factorization of the posterior approximation

$$q\_{\vec{\xi}}(\sigma\_{\eta}^{2}, f \mid f) = q\_{\vec{\xi}}(\sigma\_{\eta}^{2}) q\_{\vec{\xi}}(f \mid \phi, \sigma^{2}) = \frac{q\_{\vec{\xi}}(\sigma\_{\eta}^{2} \mid \phi) p(f \mid \phi, \sigma^{2}) q\_{\vec{\xi}}(y \mid f)}{q\_{\vec{\xi}}(y \mid \phi, \sigma^{2})}. \tag{77}$$

Thus, the posterior approximations can be chosen as follows

$$q\_{\vec{\xi}}((\phi+1)/2) = \operatorname{Beta}(\mathfrak{z}\_1, \mathfrak{z}\_2),\tag{78}$$

$$\log\_{\mathbb{G}}(\sigma^2 \mid \phi) \sim IG(\zeta\_{3\prime}\zeta\_4 + \zeta\_5^{\prime}\phi^2),\tag{79}$$

$$q(\log(\beta), h \mid \phi, \sigma^2) = N(m, V), \tag{80}$$

where

$$V^{-1} = P(\phi, \sigma^2) + \mathfrak{J}\_{\theta\prime} \qquad m = V^{-1}\mathfrak{J}\gamma\_{\prime\prime}$$

with *<sup>P</sup>*(*φ*, *<sup>σ</sup>*2) precision matrix of *<sup>p</sup>*(*log*(*β*), *<sup>h</sup>* <sup>|</sup> *<sup>φ</sup>*, *<sup>σ</sup>*2).

Once the posterior approximations are initialized, we proceed with the next step and iterate over the parameters. The parameters are updated in blocks that correspond to the factorization of the posterior approximations. First, we update the block for the persistence parameter in the latent process *q<sup>ξ</sup>* (*φ*)

$$
\phi^\* = s\_1(\check{\varsigma}, z\_1^\*), \text{ with } s\_1() \text{ and } z\_1^\* \text{ such that } \upsilon^{2\*} \sim q\_{\check{\varsigma}}(\upsilon^2 \mid \phi^\*), \tag{81}
$$

$$
\sigma^{2\*} = s\_2(\zeta\_\*^\* z\_2^\*, \phi^\*), \text{ with } s\_2() \text{ and } z\_2^\* \text{ such that } \sigma^{2\*} \sim q\_{\overline{\varsigma}}(\sigma^2 \mid \phi^\*), \tag{82}
$$

$$\hat{\mathbf{C}}\_1 = \nabla\_{\tilde{\xi}} [s\_1(\tilde{\xi}, z\_1^\*)] \nabla\_{\phi} [T\_1(\phi^\*)]\_\prime \tag{8.3}$$

$$\xi\_1 \approx \nabla\_{\tilde{\zeta}} \left[ s\_1(\xi, z\_1^\*) \right] \{ \nabla\_{\phi} [\log p(\phi^\*) + \log q\_{\tilde{\zeta}}(y \mid \phi^\*, \sigma^{2\*}) - \log q\_{\tilde{\zeta}}(\sigma^{2\*} \mid \phi^\*)] \}. \tag{84}$$

Second, we update the block for the variance of the latent process *<sup>q</sup><sup>ξ</sup>* (*σ*<sup>2</sup> <sup>|</sup> *<sup>φ</sup>*)

$$
\hat{\mathbb{C}}\_2 = \nabla\_{\xi} [s\_2(\xi, z\_2^\*, \phi^\*)] \nabla\_{\sigma^2} [T\_2(\sigma^{2\*})] \tag{85}
$$

$$\mathcal{G}\_2 \approx \nabla\_{\vec{\xi}} [\varepsilon\_2(\vec{\xi}, z\_2^{2\*}, \phi^\*)] \nabla\_{\sigma^2} [\log p(\sigma^{2\*}) + \log q\_{\vec{\xi}}(y \mid \phi^\*, \sigma^{2\*})],\tag{86}$$

where *<sup>T</sup>*2(*σ*2∗) are the sufficient statistics of *<sup>q</sup><sup>ξ</sup>* (*σ*<sup>2</sup> <sup>|</sup> *<sup>φ</sup>*). The last update is the update of the likelihood approximation

$$a\_{l+1} = (1 - \omega)a\_l + \omega \mathbb{E}\_{q\_\xi(f|\phi^\*, \mu^{2\*})} [\nabla\_f \log p(y \mid f)],\tag{87}$$

$$z\_{t+1} = (1 - \omega)z\_t + \omega \mathbb{E}\_{q\_{\mathbb{R}}(f|\phi^\*, \boldsymbol{\mu}^{2\*})}[f]\_{\prime} \tag{88}$$

$$\mathbf{q}\_{\delta, t+1}^{\mathbf{x}} = (1 - \omega)\mathbf{j}\_{\delta, t}^{\mathbf{x}} - \omega \mathbb{E}\_{q\_{\zeta}(f|\phi^\*, \rho^{2\*})} [\nabla\_f \nabla\_f \log p(y \mid f)],\tag{89}$$

$$
\mathfrak{J}\_{7,t+1} = a\_{t+1} + \mathfrak{J}\_{6,t+1} z\_{t+1} \,. \tag{90}
$$

For more extensive derivations of the updates, we refer the reader to [6]. Further, one might wonder how the latent process is estimated in this procedure. Salimans et al. [6] propose using the Kalman filter to estimate the filtering distribution. Even though it is a valid approach that is also undertaken in quasi-maximum likelihood method [41], its weakness lies in the linearization of the observation equation which implies that the distribution of the noise process is not longer Gaussian.

#### **4. Results**

In this section, we present results for the comparison of the discussed methods. We compare two particle filters (bootstrap and auxiliary particle filters) on the basis of bias, variance and on the estimated effective number of particles. We choose the better performing procedure of the two for using in the particle Metropolis–Hastings algorithm. We compare particle Metropolis–Hastings (PMH), Riemann Manifold Hamiltonian Monte Carlo (RMHMC), integrated nested Laplace approximation (INLA), and fixed-form variational Bayes (VB) on the basis of how well the posterior distributions obtained with these methods capture the ground truth (e.g., true parameter values). The ability of the methods to recover ground truth is assessed based on five simulated data sets with different underlying parameters. We additionally provide effective sample sizes for the comparison of the sampling methods (PMH and RMHMC). For illustration purposes, we also provide comparison on two real-world data sets.

#### *4.1. Variance of the Estimated Likelihood*

As we mentioned before, the marginal likelihood can be approximated sequentially through particle filtering. The marginal likelihood approximation of the parameters *θ* reads

$$p(y\_{1:T}|\theta) \approx \prod\_{t} \not p(y\_t|y\_{1:t-1}, \theta),\tag{91}$$

where the right hand side is obtained by running particle filter presented in Algorithm A1. In practice, usually the log-likelihood

$$\log p\_{\theta}(y\_{1:T}) = \log p\_{\theta}(y\_1) + \sum\_{t=2}^{T} \log p\_{\theta}(y\_t|y\_{1:t-1}) \tag{92}$$

is estimated for the purpose of numerical stability (as the product of small weights would lead to unstable results). The estimate of the log-likelihood is the by-product of the particle filtering, as it is the average over log-weights that are assigned to the particles at every time step. In this section, we compare the bootstrap (BPF) and auxiliary particle filters (APF) in terms of bias, variance, and number of effective particles. Both of them can be used for obtaining simulated likelihood estimates, which can be further used in the particle

Metropolis–Hastings algorithm. We denote by *L*ˆ the estimate of the likelihood obtained with a particle filter. Then, the bias and the variance can be estimated as follows

$$Bias = 5000^{-1} \sum\_{i=1}^{K} \sum\_{j=1}^{M} (\log \hat{L}^j - \log L(y^i)),\tag{93}$$

$$Variance = 5000^{-1} \sum\_{i=1}^{K} \sum\_{j=1}^{M} (\log \mathcal{L}^j - \log L(y^i))^2,\tag{94}$$

where *yi* is the *i*-th time series, and log *L* is the "true" log-likelihood value. For the comparison, we use *K* = 50 different time series generated from the stochastic volatility model and *M* = 100 Monte Carlo iterations. We use *N* = 100, *N* = 1000, and *N* = 10,000 number of particles for this study. As the true value of the likelihood is not available, we substitute it with an estimate that is obtained with *N* = 1,000,000 number of particles. First, we conduct the analysis of the variance of the estimated likelihood in true parameter values as discussed in [24]. The authors of, ref. [27] showed theoretically that the asymptotic variance is not always smaller for the APF in comparison to the BPF. We run additional simulation studies to examine whether the variance of the estimated likelihood varies in the parameter space. Since we are interested in using the estimated likelihood in the Markov Chain Monte Carlo setting, it is relevant how the variance behaves in different points of the parameter space. If we start far away from the true value and the variance of the estimated likelihood is larger in that part of the parameter space, it can affect the convergence and calibration of the algorithm. Table 1 shows variance of the estimated likelihood for bootstrap and auxiliary particle filters. *N* indicates the number of the particles that we used for the estimation of the likelihood. It is clear that, on average, APF performs better in terms of the variance of the estimated likelihood. Table 2 indicates results for a similar experiment, but on the level of individual times series. We consider different data-generating processes and find that, in particular, higher variance of the latent volatility process is associated with higher variance of the estimated likelihood. Finally, in Figures 2–4, we illustrate that the variance of the estimated likelihood changes depending on the location in the parameter space, and these changes can be specific to a data-generating process. These figures correspond to the experiments with time series 2, 3, and 4 from Table 2. The likelihood was estimated with *N* = 1000 particles. We observe that the variance of the latent process has a strong effect on the landscape of the variance of the estimated likelihood in the parameter space. From Figure 4c,d, we see that the variance of the estimated likelihood obtained with the bootstrap particle filter appears to be more strongly affected by the location in the parameter space than the variance of the estimated likelihood obtained with the auxiliary particle filter. In Figure 3, we observe that the variance of the estimated likelihood is affected by the scale parameter *β* in the case of auxiliary particle filter, but not so much in the case of the bootstrap particle filter. Thus, initialization of PMCMC and the choice of number of particles should be considered with care for the optimal performance of the algorithm as the variance of the estimated likelihood can differ in the parameter space, and these changes can vary across different time series.

**Table 1.** Variance, bias, and number of effective particles *Neff* for the estimated likelihood with bootstrap particle filter (BPF) and auxiliary particle filter (APF) averaged over 50 time series. *Neff* is computed as in Equation (22). Variance and bias are computed as in Equations (93) and (94).


**Table 2.** Variance of the estimated likelihood for 10 different data-generating processes (*TS*). We consider different settings for the number of particles *N*.

**Figure 2.** Variance of the estimated likelihood in different points of the parameter space for *TS* = 2 from Table 2.

**Figure 3.** Variance of the estimated likelihood in different points of the parameter space for *TS* = 3 from Table 2.

**Figure 4.** Variance of the estimated likelihood in different points of the parameter space for *TS* = 4 from Table 2.

#### *4.2. Particle Metropolis–Hastings and Riemann Manifold Hamiltonian Monte Carlo*

In this section, we compare particle Metropolis–Hastings (PMH) and Riemann Manifold Hamiltonian Monte Carlo. We evaluate the algorithms based on how well they recover the true parameters of the model *β*, *φ*, and *ση* and on the basis of the effective sample size. We obtained 20,000 samples and discarded the first 1000 as burn-in. Further, Figure 5 and Figures A1–A4 show results for both samplers: trace plots, histograms, and autocorrelation function are depicted. Table 3 presents the moments and highest posterior density intervals for the parameters of the model. The marginal likelihood in PMH was estimated with auxiliary particle filter as discussed in [23]. The Metropolis–Hastings part of the algorithm was calibrated to achieve 20–40% acceptance rate. RMHMC was implemented as in [5] with openly available implementation of the method by the authors. Both PMH and RMHMC

require careful calibration of the step-size, and RMHMC additionally needs calibration of the number of the leapfrog steps; thus in Table A1, we additionally present results for the no-u-turn sampler (NUTS). NUTS is an extension of HMC algorithm that allows one to tune the algorithm automatically. From Figure 5 and Figures A1–A4, we observe that autocorrelation of the samples indicated in the third column of the plots decreases faster for PMH than for RMHMC, in particular for the parameters *φ* and *ση*. Effective sample size is similarly high for parameter *β* for both samplers. Effective sample size for the parameters *φ* and *ση* is lower in the case of RMHMC. However, if we compute the ESS per second as presented in the last column of Table 3, this advantage disappears. This result is not surprising since PMH is the most computationally intensive procedure we are considering. Both the likelihood and the posterior use sequential sampling methods, which makes computations very demanding. Nevertheless, PMH allows us to recover the underlying parameters more accurately. In particular, in most of the presented examples, true variance of the latent volatility process lies inside the 95% highest posterior density interval for PMH. RMHMC tends to overestimate this parameter. As Table A1 indicates, the highest posterior density intervals obtained with NUTS are larger than those obtained with PMH and RMHMC. Number of gradient evaluations for RMHMC are 69718, 70042, 69802, 69801, and 70041 for Experiments 1–5, respectively.

**Table 3.** Posterior moments for the samples obtained with particle Metropolis–Hastings (PMH) and Riemann Manifold Hamiltonian Monte Carlo (RMHMC) for the parameters *β*, *φ* and *ση* of the stochastic volatility model. Experiments 1, 2, 3, 4, and 5 correspond to *TS* 2, 4, 5, 9, and 10 from Table 2.



**Table 3.** *Cont.*

**b** Trace plots (left), histograms (middle) and ACF plots (right) obtained with Riemann manifold Hamiltonian Monte Carlo

**Figure 5.** Results of the sampling from the posterior distribution with PMH and RMHMC for *TS* = 2 from Table 2. The first column corresponds to the trace plots, the middle column to histograms obtained with the samples from the posterior distribution, and the last column corresponds to autocorrelation function for the samples.

### *4.3. Integrated Nested Laplace Approximation*

We provide two simulation studies for the integrated nested Laplace approximation. First, we replicate and extend the simulation study provided in [38] by analyzing datagenerating processes with different values of *μ* and *ση* since the estimation of the variance parameter appears to be a challenge for existing methods. Table 4 replicates results from [38] with the parametrization of the model with the scale parameter *μ*, and Table 5 presents results for the parametrization with the scale parameter *β* and different data-generating processes. Both Monte Carlo studies are conducted with 1000 iterations. Our findings are comparable to those of [38]: the mean of the volatility process and the persistence parameter are estimated quite accurately, while the variance of the latent volatility process estimated with INLAis biased—usually, it is overestimated. Second, we provide the posterior moments for INLA similarly to Table 3 for PMH and RMHMC. These results are presented in Table 6. These results also suggest that the variance of the latent volatility process tends to be overestimated with INLA to a larger degree than with RMHMC, which also overestimates this parameter, as can be seen from the Table 3. Moreover, highest posterior density intervals for the parameter *φ* obtained with INLA are larger than those obtained with PMH and RMHMC.

**Table 4.** Bias and square root of the mean squared error for integrated nested Laplace approximation (INLA) parametrized with scale parameter *μ*.


**Table 5.** Bias and square root of the mean squared error for INLA parametrized with scale parameter *β*.



**Table 6.** Posterior results for estimation of the stochastic volatility (SV) model with INLA. Experiments 1, 2, 3, 4, and 5 correspond to *TS* 2, 4, 5, 9, and 10 from Table 2.

## *4.4. Fixed-Form Variational Bayes*

In this section, we discuss results for the simulation study with fixed-form variational Bayes. We consider the same time series as in the case of comparison between PMH and RMHMC. In Table 7, we present estimated variational parameters and in Figure 6, comparison of the posterior with fixed-form variational Bayes (in blue), RMHMC (histograms from the posterior samples), and INLA (green). It is clear that in some cases, the variational Bayes method performs quite well; in particular, parameter *β* is very well estimated in most of the cases. Only in Figure 6j is the approximate posterior for *β* far from the truth. The variance parameter is underestimated in all cases with VB; this is less severe in the cases when the true variance is relatively small. However, when the true variance is relatively large, the discrepancy between VB estimate and the true value increases, as can be seen from Figure 6o. We observe the opposite picture with INLA: it tends to overestimate the variance of the latent volatility process. Overestimation of the variance of the latent volatility process for stochastic volatility models with INLA has been previously reported in [38]. Additionally, it is reported in [38] that this effect decreases with larger values of *ση*. The source of this has to be investigated further. RMHMC overestimates the variance to a lesser degree than INLA, and as can be seen from Figure 6o, this is also connected to the value of the ground truth for *ση*: with larger true value of *ση*, RMHMC provides more accurate results.

**Figure 6.** Illustration of the Fixed-form Variational Bayes in comparison to RMHMC and INLA. Subfigures illustrate the posterior distributions estimated with different methods for the different data-generating processes. (**a**–**c**) correspond to Experiment 1 from Tables 2 and 6, (**d**–**f**) correspond to Experiment 2, (**g**–**i**) correspond to Experiment 3, (**j**–**l**) correspond to experiment 4, and (**m**–**o**) correspond to Experiment 5. Red vertical lines indicate true parameter values.


**Table 7.** Parameters of the posterior distribution obtained with fixed-form Variational Bayes. Exp. 1–5 correspond to the Experiments 1–5 in Tables 3 and 6.

#### *4.5. Comparison of the Methods on the Real Data*

In this section, we present posterior distributions of the parameters estimated with different Bayesian inference methods on two real-world time series. First, we consider the mean corrected log-returns of the Australian dollar against the US dollar. The data range from January 1994 to December 2003 with a total of 519 weekly observations. Resulting posterior distributions obtained with different inference methods are presented in Figure 7. Second, we consider daily log-returns for the DAX index from 3 January 2000 until 17 May 2001, which in total constitute 1000 observations. We provide descriptive statistics for both time series in Table A2. Resulting posterior distributions for this time series are presented in Figure 8. The main discrepancies between the methods are largest in the estimation of the parameter *ση* for both time series, and the results are consistent with the simulation studies in terms of the difference of these discrepancies. As we can see from Figures 7c and 8d, the posterior distribution of *ση* obtained with variational Bayes is concentrated in smaller values in comparison to the other methods. INLA suggests the higher values for *ση* in comparison to the other methods. The posterior samples obtained with RMHMC are concentrated in values higher than the ones obtained with PMH. Both sampling methods appear to give results larger than VB but smaller than INLA for the parameter *ση*.

In Table 8, we present results for efficient sample size (ESS) for both empirical applications and both samplers. Similarly to what is found in simulation studies, ESS is higher in the case of the PMH algorithm. However, if the computational time were taken into account, this advantage would have disappeared, similarly to the results in Table 2.


**Table 8.** Efficient sample size (ESS) for PMH and RMHMC in real-world time series applications: weekly log-returns for the exchange rate of Australian/US dollars and daily log-returns of DAX index.

**Figure 7.** Comparison of PMH (pink), VB (blue), INLA (green), and RMHMC (yellow) on the weekly log-returns of the Australian dollar against the US dollar. Subfigures illustrate the posterior distributions for different parameters of the model obtained with different methods. (**a**) Corresponds to the posterior distribution for the parameter *β*. (**b**) Corresponds to the posterior distribution of the parameter *φ*. (**c**) Corresponds to the posterior distribution of the parameter *ση*.

**Figure 8.** Comparison of PMH (pink), VB (blue), INLA (green), and RMHMC (yellow) on the daily log-returns of DAX index. Subfigures illustrate the posterior distributions for different parameters of the model obtained with different methods. (**a**,**b**) correspond to the posterior distribution of the parameter *β*. (**c**) corresponds to the posterior distribution of the parameter *φ*. (**d**) corresponds to the posterior distribution of the parameter *ση*.

#### **5. Discussion**

This paper reviewed multiple methods for the estimation of nonlinear state-space modes and stochastic volatility modes in particular that appear in Bayesian statistics and machine learning. We in particular focused on representative inference methods from different classes: methods that can recover the posterior distribution 'exactly' and the ones that build an approximation. We discussed which methods have the potential to be applied in a multivariate or high-dimensional situation and why they have this potential. Finally, we discovered that while stochastic volatility models are common for use in simulation studies for demonstrating the performance of the methods, usually not enough possible data-generating processes are considered to make a fair comparison. In particular, the performance of the methods is heavily connected to the variance of the latent volatility process.

State-space models can be powerful tools for modeling latent variables in different scientific fields. However, already for univariate time series, they are challenging to estimate. This paper's main aim was to review and understand the existing classes of methods of estimation (targeting exact posterior or approximating it) and define the direction one can undertake for the estimation of *multivariate* nonlinear state-space models. The challenge arises from both statistical and computational perspectives. By this, we mean it is hard to develop methods that both provide sufficiently good results from the estimation point of view and are computationally feasible. We have reviewed a number of methods that allow a trade-off between these two aspects. In particular, we have considered particle Markov Chain Monte Carlo and reviewed multiple particle filtering approaches for this method, Riemann Manifold Langevin Hamiltonian Monte Carlo, Integrated Nested Laplace Approximation, and Variational Bayes. All these methods are equipped with the ability to estimate models with intractable likelihoods.

#### *5.1. Sequential Monte Carlo*

We compared the auxiliary particle filter with the bootstrap particle filter in terms of the variance of the estimated likelihood. We found that the auxiliary particle filter outperformed the bootstrap particle filter for most of the data-generating processes. As discussed in [27], auxiliary particle filter does not always have a smaller variance of the estimated likelihood. Additionally, we looked into how the variance of the estimated likelihood changes in the parameter space. We found that, in particular, the variance of the latent process affects the variance of the estimated likelihood. This implies that one has to find the balance for the number of particles used in Sequential Monte Carlo and a clever way of finding initial parameter values for the sampling from the posterior, especially when considering multivariate models. The advantage of the auxiliary particle filter from the methodological point of view is that it takes into account current observation *yt* when constructing the proposal for the particles *q*(*ht* | *ht*−1, *yt*). The method that we did not include in our simulation study, but that possibly can solve the problem with the variance of the estimated likelihood, is the iterated auxiliary particle filter (iAPF): for the proposal of the particles, it uses not only current observation *yt*, but all observations *q*(*ht* | *ht*−1, *y*1:*T*). A backward sequential procedure with an optimization step is used in this proposal mechanism for the particles, which makes the algorithm computationally intensive. The multivariate application of the stochastic volatility model in [26] considers only diagonal case of the matrix Φ, and the proposed procedure for the particle proposals does not incorporate such dependence. While this method does introduce an additional computational burden on already computationally intensive method (particle Metropolis–Hastings), it is promising for getting state-of-the-art results for the task of parameter estimation.

#### *5.2. Particle Metropolis-Hastings*

Metropolis-Hastings is a general MCMC method that is easy to implement and works well for the univariate model. The estimation results are satisfying when it is properly calibrated, and good mixing of the chains is achieved. It works well in low-dimensional problems but is unlikely to be successful in the case of multivariate stochastic volatility models. Considering the non-diagonal matrix Φ in a five-dimensional case, we would have 45 parameters to be estimated. The random walk proposal would be very inefficient even with a reasonable sparseness assumption on Φ. Nevertheless, in the low-dimensional model, we get the best estimation results with particle Metropolis–Hastings, where the particle filtering scheme is chosen to be an auxiliary particle filter. From the methods considered in this paper, particle Markov Chain Monte Carlo methods are easiest to adapt to different specifications of the model and are easiest to implement.

#### *5.3. Riemann Manifold Hamiltonian Monte Carlo Methods*

Hamiltonian Monte Carlo is a very attractive method for high-dimensional problems as it allows us to explore the parameter space efficiently. In particular, the gain in efficiency comes from avoiding random walk behavior in the proposals. The disadvantage comes from the need of careful calibration since there is no principled way of choosing matrix *M*. RMHMC avoids this problem by exploiting underlying geometry in the proposal mechanism. In our study, we notice that RMHMC results in good mixing of the Markov chains, and the method is generally easy to calibrate, but the estimation of the parameters is not very good. In particular, it appears that the variance of the latent volatility process is challenging for the method. It is not surprising that the PMH algorithm performs better in terms of parameter estimation since we use an auxiliary particle filter for the volatility process estimation and thus take current observation *yt* for the particle proposals. RMHMC does not benefit from similar information when estimating model parameters. Therefore, improved estimation of the volatility process can be one of the directions for improving the performance of RMHMC for the parameter estimation of stochastic volatility models.

## *5.4. Variational Bayes*

As one can see from the illustrative example, in some cases, variational Bayes performs quite well; however, there are also situations when it is far off from the underlying truth. The challenge with stochastic volatility models remains the same: it is difficult to estimate the latent states. In the approach of [6], this is done via Kalman filtering. Therefore, the drawback of linearization of the model will remain and will show in the final results. In this respect, the possible combination of VB and SMC can be of interest. Some advances in this direction have already been made [42].

## *5.5. Integrated Nested Laplace Approximation*

Integrated Nested Laplace Approximation is another approach that works well considering how fast the method is, but it clearly overestimates the variance of the latent volatility process. Additionally, the sparse matrix computation that is used in univariate models is not applicable to the multivariate case. In the multivariate case, the precision matrix in Equation (67) is not sparse, and thus, the method does not benefit from fast sparse matrix computation. An approach that we have not considered in this paper is the Expectation Propagation algorithm. In particular, the authors of [43] propose a way to improve approximate marginals *p*(*xt* | *θ*, *y*) in latent Gaussian fields by using EP. The motivation of the approach builds on the fact that EP can give better approximations than the Laplace approximation in this case. The improvements, however, would come at computational costs. In the univariate case, the extra computational costs do not play a significant role as the algorithm can be parallelized. However, it is hard to say how big the difference would be in the multivariate model, both in terms of improvement in the estimation and loss in computational speed.

## **6. Conclusions**

We reviewed multiple Bayesian inference methods, which both target the exact posterior distribution and approximate it. By comparing methods on various data-generating processes, we notice that variational Bayes tends to underestimate the latent volatility

process variance, while INLA and RMHMC, in the cases considered, overestimated this parameter. We also get similar disposition of the results on two real-world data sets. We achieved the best performance with PMH in terms of recovering ground truth and uncertainty quantification. In PMH, the particle filtering step was performed with an auxiliary particle filter. This indicates that filtering with look-ahead approaches, which include current (or future) observations into proposal machinery can improve the performance of the inference method. It is important to note that different data-generating processes for simulation studies would indicate different performance results. Thus, we stress that when using stochastic volatility models, more than one data-generating process should be considered for methods comparison. This practice would allow indicating in which situation a method can fail or perform differently. Our results indicate that fixed-form variational Bayes tends to underestimate the variance of the latent process, while RMHMC and INLA overestimated this parameter. To estimate the stochastic volatility model in the multivariate case, the combination of different strategies appears to be necessary. In a high-dimensional case, the random-walk proposal would become extremely inefficient. At the same time, approximate methods lose their outstanding computational advantage (for example, INLA), and the implementation of these methods in the multivariate case is not straightforward.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The author sincerely thanks anonymous referees for constructive and useful comments. Their suggestions helped improve and clarify this manuscript.

**Conflicts of Interest:** The author declares no conflict of interest.

## **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

*Appendix A.1.*

**b** Trace plots (left), histograms (middle) and ACF plots (right) obtained with Riemann manifold Hamiltonian Monte Carlo

**Figure A1.** Results of the sampling from the posterior distribution with PMH and RMHMC for *TS* = 4 from Table 2. The first column corresponds to the trace plots, the middle column to histograms obtained with the samples from the posterior distribution, and the last column to autocorrelation function for the samples.

**b** Trace plots (left), histograms (middle) and ACF plots (right) obtained with Riemann manifold Hamiltonian Monte Carlo

**Figure A2.** Results of the sampling from the posterior distribution with PMH and RMHMC for *TS* = 5 from Table 2. The first column corresponds to the trace plots, the middle column to histograms obtained with the samples from the posterior distribution, and the last column to autocorrelation function for the samples.

**b** Trace plots (left), histograms (middle) and ACF plots (right) obtained with Riemann manifold Hamiltonian Monte Carlo

**Figure A3.** Results of the sampling from the posterior distribution with PMH and RMHMC for *TS* = 9 from Table 2. The first column corresponds to the trace plots, the middle column to histograms obtained with the samples from the posterior distribution, and the last column to autocorrelation function for the samples.

**b** Trace plots, histograms and ACF plots obtained with Riemann manifold Hamiltonian Monte Carlo

**Figure A4.** Results of the sampling from the posterior distribution with PMH and RMHMC for *TS* = 10 from Table 2. The first column corresponds to the trace plots, the middle column to histograms obtained with the samples from the posterior distribution, and the last column to autocorrelation function for the samples.

#### *Appendix A.2.*

Although we do not give details for the NUTS sampler [44] in the main text, we present here experiments for this sampler using the same experiments as in the main text. The results in Table A1 are obtained with the sampler implemented in RStan [45]. The method provides large confidence intervals for the parameters *φ* and *ση*. Similarly to RMHMC, the variance of the latent volatility process tends be overestimated based on the mean and the mode of the posterior distribution. The confidence intervals obtained with NUTS appear to be quite large, especially for the parameters *φ* and *ση*. The multivariate version of the stochastic volatility model can provide additional challenges since different parameters might require different step sizes and the sampler can get stuck in the regions of space where a small step size is needed to achieve target acceptance rate. In the univariate case, NUTS appears to be more efficient than both PMH and RMHMC as can be seen from the last two columns of Table A1. The applicability of this particular implementation can be limited due to the large 95% highest posterior intervals as uncertainty about parameters *φ* and *ση* is very large in most cases.

**Table A1.** Posterior results for estimation of the SV model with INLA. Experiments 1, 2, 3, 4, and 5 correspond to *TS* 2, 4, 5, 9, and 10 from Table 2.


*Appendix A.3.*

Algorithm A1 is a generic particle filter. We use auxiliary version of it proposed in [23].

## **Algorithm A1** Approximation of marginal likelihood with ASIR algorithm.

1: Draw *N* samples *h* (*i*) <sup>0</sup> from the prior

$$h\_0^{(i)} \sim \pi(h\_0 \mid \theta), \ i = 1, \ldots, N \tag{A1}$$

and set *w*(*i*) <sup>0</sup> = 1/*N*, for all *i* = 1, . . . , *N*.


$$h\_t^{(i)} \sim q(h\_t \mid h\_{t-1}^{(i)}, y\_{1:t}, \theta), \ i = 1, \ldots, N. \tag{A2}$$

4: Compute the following weights

$$w\_t^{(i)} = \frac{g(y\_t \mid h\_t^{(i)}, \theta) f(h\_t^{(i)} \mid h\_{t-1}^{(i)}, \theta)}{q(h\_t^{(i)} \mid h\_{t-1}^{(i)}, y\_{1:t}, \theta)} \tag{A.3}$$

and compute the estimate of *p*(*yt* | *y*1:*t*−1, *θ*) as

$$
\hat{p}(y\_t \mid y\_{1:t-1}, \boldsymbol{\theta}) = \sum\_i \mathcal{W}\_{t-1}^{(i)} w\_t^{(i)}.\tag{A4}
$$

5: Compute normalized weights as

$$\mathcal{W}\_t^{(i)} \propto \mathcal{W}\_{t-1}^{(i)} w\_t^{(i)}.\tag{A5}$$

6: If the effective number of particles is too low, perform resampling.

*Appendix A.4.*

**Table A2.** Descriptive statistics for time series from the empirical example in Section 3.5: daily log-returns for DAX index and weekly log-returns for Australian/US dollar exchange rate.


#### **References**


## *Article* **PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models**

**Imon Banerjee 1, Vinayak A. Rao 1,\*,† and Harsha Honnappa 2,†**


**Abstract:** Datasets displaying temporal dependencies abound in science and engineering applications, with Markov models representing a simplified and popular view of the temporal dependence structure. In this paper, we consider Bayesian settings that place prior distributions over the parameters of the transition kernel of a Markov model, and seek to characterize the resulting, typically intractable, posterior distributions. We present a Probably Approximately Correct (PAC)-Bayesian analysis of variational Bayes (VB) approximations to tempered Bayesian posterior distributions, bounding the model risk of the VB approximations. Tempered posteriors are known to be robust to model misspecification, and their variational approximations do not suffer the usual problems of over confident approximations. Our results tie the risk bounds to the mixing and ergodic properties of the Markov data generating model. We illustrate the PAC-Bayes bounds through a number of example Markov models, and also consider the situation where the Markov model is misspecified.

**Keywords:** ergodicity; Markov chain; probably approximately correct; variational Bayes

## **1. Introduction**

This paper presents probably approximately correct (PAC)-Bayesian bounds on variational Bayesian (VB) approximations of fractional or tempered posterior distributions for Markov data generation models. Exact computation of either standard or tempered posterior distributions is a hard problem that has, broadly speaking, spawned two classes of computational methods. The first, Markov chain Monte Carlo (MCMC), constructs ergodic Markov chains to approximately sample from the posterior distribution. MCMC is known to suffer from high variance and complex diagnostics, leading to the development of variational Bayesian (VB) [1] methods as an alternative in recent years. VB methods pose posterior computation as a variational optimization problem, approximating the posterior distribution of interest by the 'closest' element of an appropriately defined class of 'simple' probability measures. Typically, the measure of closeness used by VB methods is the Kullback–Leibler (KL) divergence. Excellent introductions to this so-called *KL-VB* method can be found in [2–4]. More recently, there has also been interest in alternative divergence measures, particularly the *α*-Rényi divergence [5–7], though in this paper, we focus on the KL-VB setting.

Theoretical properties of VB approximations, and in particular asymptotic frequentist consistency, have been studied extensively under the assumption of an independent and identically distributed (i.i.d.) data generation model [4,8,9]. On the other hand, the common setting where data sets display temporal dependencies presents unique challenges. In this paper, we focus on homogeneous Markov chains with parameterized transition kernels, representing a parsimonious class of data generation models with a wide range of applications. We work in the Bayesian framework, focusing on the posterior distribution over the unknown parameters of the transition kernel. Our theory develops PAC bounds

**Citation:** Banerjee, I.; Rao, V. A.; Honnappa, H. PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models. *Entropy* **2021**, *23*, 313. https://doi.org/10.3390/ e23030313

Academic Editor: Pierre Alquier

Received: 8 February 2021 Accepted: 4 March 2021 Published: 6 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

that link the ergodic and mixing properties of the data generating Markov chain to the Bayes risk associated with approximate posterior distributions.

Frequentist consistency of Bayesian methods, in the sense of concentration of the posterior distribution around neighborhoods of the 'true' data generating distribution, have been established in significant generality, in both the i.i.d. [10–12] and in the non-i.i.d. data generation setting [13,14]. More recent work [14–16] has studied fractional or tempered posteriors, a class of generalized Bayesian posteriors obtained by combining the likelihood function raised to a fractional power with an appropriate prior distribution using Bayes' theorem. Tempered posteriors are known to be robust against model misspecification: in the Markov setting we consider, the associated stationary distribution as well as mixing properties are sensitive to model parameterization. Further, tempered posteriors are known to be much simpler to analyze theoretically [14,16]. Therefore, following [14–16] we focus on tempered posterior distributions on the transition kernel parameters, and study the rate of concentration of variational approximations to the tempered posterior. Equivalently, as shown in [16] and discussed in Section 1.1, our results also apply to so-called *α*-variational approximations to standard posterior distributions over kernel parameters. The latter are modifications of the standard KL-VB algorithm to address the well-known problem of overconfident posterior approximations.

While there have been a number of recent papers studying the consistency of approximate variational posteriors [5,8,15] in the large sample limit, rates of convergence have received less attention. Exceptions include [9,15,17], where an i.i.d. data generation model is assumed. [15] establishes PAC-Bayes bounds on the convergence of a variational tempered posterior with fractional powers in the range [0, 1), while [9] considers the standard variational posterior case (where the fractional power equals 1). [17], on the other hand, establishes PAC-Bayes bounds for risk-sensitive Bayesian decision making problems in the standard variational posterior setting. The setting in [15] allows for model misspecification and the analysis is generally more straightforward than that in [9,17]. Our work extends [15] to the setting of a discrete-time Markov data generation model.

Our first results in Theorem 1 and Corollary 1 of Section 2 establish PAC-Bayes bounds for sequences with arbitrary temporal dependence. Our resultsgeneralize [15], [Theorem 2.4] to the non-i.i.d. data setting in a straightforward manner. Note that Theorem 1 also recovers ([16], [Theorem 3.3]), which is established under different 'existence of test' conditions. Our objective in this paper is to explicate how the ergodic and mixing properties of the Markov data generating process influences the PAC-Bayes bound. The sufficient conditions of our theorem, bounding the mean and variance of the log-likelihood ratio of the data, allows for developing this understanding, without the technicalities of proving the existence of test conditions intruding on the insights.

In Section 3, we study the setting where the data generating model is a stationary *α*-mixing Markov chain. Stationarity means that the Markov chain is initialized with the invariant distribution corresponding to the parameterized transition kernel, implying all subsequent states also follow this marginal distribution. The *α*-mixing condition ensures that the variance of the likelihood ratio of the Markov data does not grow faster than linear in the sample size. Our main results in this setting are applicable when the state space of the Markov chain is either continuous or discrete. The primary requirement on the class of data generating Markov models is for the log-likelihood ratio of the parameterized transition kernel and invariant distribution to satisfy a Lipschitz property. This condition implies a decoupling between the model parameters and the random samples, affording a straightforward verification of the mean and variance bounds. We highlight this main result by demonstrating that it is satisfied by a finite state Markov chain, a birth-death Markov chain on the positive integers, and a one-dimensional Gaussian linear model.

In practice, the assumption that the data generating model is stationary is unlikely to be satisfied. Typically, the initial distribution is arbitrary, with the state distribution of the Markov sequence converging weakly to the stationary distribution. In this setting, we must further assume that the class of data generating Markov chains are geometrically ergodic. We show that this implies the boundedness of the mean and variance of the log-likelihood ratio of the data generating Markov chain. Alternatively, in Theorem 4 we directly impose a drift condition on random variables that bound the log-likelihood ratio. Again, in this more general nonstationary setting, we illustrate the main results by showing that the PAC-Bayes bound is satisfied by a finite state Markov chain, a birth-death Markov chain on the positive integers, and a one-dimensional Gaussian linear model.

In preparation for our main technical results starting in Section 2 we first note relevant notations and definitions in the next section.

#### *1.1. Notations and Definitions*

We broadly adopt the notation in [15]. Let the sequence of random variables *<sup>X</sup><sup>n</sup>* = (*X*0, ... , *Xn*) <sup>⊂</sup> <sup>R</sup>*m*×(*n*+1) represent a dataset of *<sup>n</sup>* <sup>+</sup> 1 observations drawn from a joint distribution *P*(*n*) *<sup>θ</sup>*<sup>0</sup> , where *<sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup> <sup>⊆</sup> <sup>R</sup>*<sup>d</sup>* is the 'true' parameter underlying the data generation process. We assume the state space *<sup>S</sup>* <sup>⊆</sup> <sup>R</sup>*<sup>m</sup>* of the random variables *Xi* is either discrete-valued or continuous, and write {*x*0, ... , *xn*} for a realization of the dataset. We also adopt the convention that 0 log(0/0) = 0.

For each *θ* ∈ Θ, we will write *p* (*n*) *<sup>θ</sup>* as the probability density of *<sup>P</sup>*(*n*) *<sup>θ</sup>* with respect to some measure *Q*(*n*), i.e., *p* (*n*) *<sup>θ</sup>* :<sup>=</sup> *dP*(*n*) *θ dQ*(*n*) , where *<sup>Q</sup>*(*n*) is either Lebesgue measure or the counting measure. Unless stated otherwise, all probabilities, expectations and variances, which we represent as *P*, E[*X*] and Var[*X*], are with respect to the true distribution *P*(*n*) *<sup>θ</sup>*<sup>0</sup> .

Let *π*(*θ*) be a *prior* distribution with support Θ. The *αte*-*fractional posterior* is defined as

$$\pi\_{n,a^{4r}|X^n}(d\theta) := \frac{e^{-a^{4r}r\_n(\theta,\theta\_0)(X^n)}\pi(d\theta)}{\int e^{-a^{4r}r\_n(\theta,\theta\_0)(X^n)}\pi(d\theta)},\tag{1}$$

where, for *<sup>θ</sup>*0, *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>, *rn*(*θ*, *<sup>θ</sup>*0)(·) :<sup>=</sup> log *<sup>p</sup>* (*n*) *<sup>θ</sup>*<sup>0</sup> (·) *p* (*n*) *<sup>θ</sup>* (·) , is the log-likelihood ratio of the corre-

sponding density functions, and *<sup>α</sup>te* <sup>∈</sup> (0, <sup>∞</sup>) is a tempering coefficient. Setting *<sup>α</sup>te* <sup>=</sup> <sup>1</sup> recovers the standard Bayesian posterior. Note that we will use superscripts to distinguish different quantities that are referred to just as *α* in the literature.

The *Kullback–Leibler* (KL) divergence between distributions *P*, *Q* is defined as

$$\mathcal{K}(P,Q) := \int\_{\mathcal{X}} \log \left( \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})} \right) p(\boldsymbol{x}) d\boldsymbol{x} \,\boldsymbol{x}$$

where *p*, *q* are the densities corresponding to *P*, *Q* on some sample space X . In particular, the KL divergence between the distributions parameterized by *θ*<sup>0</sup> and *θ* is

$$\mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) := \int \log \left( \frac{p\_{\theta\_0}^{(n)}(\mathbf{x}\_0, \dots, \mathbf{x}\_n)}{p\_{\theta}^{(n)}(\mathbf{x}\_0, \dots, \mathbf{x}\_n)} \right) p\_{\theta\_0}^{(n)}(\mathbf{x}\_0, \dots, \mathbf{x}\_n) d\mathbf{x}\_0 \cdots d\mathbf{x}\_n$$

$$= \int r\_n(\theta, \theta\_0)(\mathbf{x}\_0, \dots, \mathbf{x}\_n) p\_{\theta\_0}^n(\mathbf{x}\_0, \dots, \mathbf{x}\_n) d\mathbf{x}\_0 \cdots d\mathbf{x}\_n. \tag{2}$$

The *<sup>α</sup>re-Rényi divergence Dαre* (*P*(*n*) *<sup>θ</sup>* , *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>0</sup> ) is defined as

$$D\_{\boldsymbol{u}^{\rm tr}}(P\_{\boldsymbol{\theta}}^{(n)}, P\_{\boldsymbol{\theta}\_{0}}^{(n)}) := \frac{1}{\boldsymbol{a}^{\rm tr} - 1} \log \int \exp(-\boldsymbol{a}^{\rm tr} r\_{\boldsymbol{n}}(\boldsymbol{\theta}, \boldsymbol{\theta}\_{0})(\mathbf{x}\_{0}, \dots, \mathbf{x}\_{\rm n})) p\_{\boldsymbol{\theta}\_{0}}^{(n)}(\mathbf{x}\_{0}, \dots, \mathbf{x}\_{\rm n}) d\mathbf{x}\_{0} \cdots d\mathbf{x}\_{\rm n} \tag{3}$$

where *<sup>α</sup>re* <sup>∈</sup> (0, 1). As *<sup>α</sup>re* <sup>→</sup> 1, the *<sup>α</sup>re*-Rényi divergence recovers the KL divergence.

Let <sup>F</sup> be some class of distributions with support in <sup>R</sup>*<sup>d</sup>* and such that any distribution *<sup>P</sup>* in F is absolutely continuous with respect to the tempered posterior: *<sup>P</sup> <sup>π</sup>n*,*αte*|*X<sup>n</sup>* .

Many choices of F exist; for instance (see also [15]), F can be the set of Gaussian measures, denoted <sup>F</sup> <sup>Φ</sup> *id* :

$$\mathcal{F}\_{id}^{\Phi} = \{ \Phi(d\theta; \mu, \Sigma) : \mu \in \mathbb{R}^d, \Sigma\_{d \times d} \in \text{P.D.} \},\tag{4}$$

where P.D. references the class of positive definite matrices. Alternately, F can be the family of *mean-field* or factored distributions where the components *θ<sup>i</sup>* of *θ* are independent of each other. Let *<sup>π</sup>*˜ *<sup>n</sup>*,*αte*|*X<sup>n</sup>* be the variational approximation to the tempered posterior, defined as

$$\pi\_{n,a^{4r}|X^n} := \underset{\rho \in \mathcal{F}}{\text{arg min }} \mathcal{K}(\rho, \pi\_{n,a^{4r}|X^n}) \tag{5}$$

It is easy to see that finding *<sup>π</sup>*˜ *<sup>n</sup>*,*αte*|*X<sup>n</sup>* in Equation (5) is equivalent to the following optimization problem:

$$\tilde{\pi}\_{n, \mathbf{a}^{\mathrm{tr}} \mid \mathbf{X}^n} := \underset{\rho \in \mathcal{F}}{\arg\max} \left[ \int r\_{\mathrm{il}}(\theta, \theta\_0)(\mathbf{x}\_{\mathrm{0}}, \dots, \mathbf{x}\_{\mathrm{il}}) \rho(d\theta) - \left(\mathbf{a}^{\mathrm{tr}}\right)^{-1} \mathcal{K}(\rho, \pi) \right]. \tag{6}$$

Setting *αte* = 1 again recovers the usual variational solution that seeks to approximate the posterior distribution with the closest element of F (the right-hand side above is called the evidence lower bound (ELBO)). Other settings of *αte* constitute *αte*-variational inference [16], which seeks to regularize the 'overconfident' approximate posteriors that standard variational methods tend to produce.

Our results in this paper focus on parametrized Markov chains. We term a Markov chain as 'parameterized' if the transition kernel *p<sup>θ</sup>* (·|·) is parametrized by some *θ* ∈ Θ ⊆ <sup>R</sup>*d*. Let *<sup>q</sup>*(0)(·) be the initial density (defined with respect to the Lebesgue measure over R*m*) or initial probability mass function. Then, the joint density is *p* (*n*) *<sup>θ</sup>* (*x*0, ... , *xn*) = *q*(0)(*x*0) ∏*n*−<sup>1</sup> *<sup>i</sup>*=<sup>0</sup> *p<sup>θ</sup>* (*xi*+1|*xi*); recall, this joint density *p* (*n*) *<sup>θ</sup>* (*x*0, ... , *xn*) corresponds to the walk probability of a time-homogeneous Markov chain. We assume that corresponding to each transition kernel *pθ*, *θ* ∈ Θ, there exists an invariant distribution *q* (∞) *<sup>θ</sup>* ≡ *q<sup>θ</sup>* that satisfies

$$q\_{\theta}(\mathbf{x}) = \int p\_{\theta}(\mathbf{x}|y) q\_{\theta}(dy) \quad \forall \mathbf{x} \in \mathbb{R}^{m}, \theta \in \Theta.$$

We will also use *q<sup>θ</sup>* to designate the density of the invariant measure (as before, this is with respect to the Lebesgue or counting measure for continuous or discrete state spaces, respectively). A Markov chain is stationary if its initial distribution is the invariant probability distribution, that is, *X*<sup>0</sup> ∼ *qθ*.

Our results in the ensuing sections will be established under strong mixing conditions [18] on the Markov chain. Specifically, recall the definition of the *α*-mixing coefficients of a Markov chain {*Xn*}:

**Definition 1** (*α*-mixing coefficient)**.** *Let* <sup>M</sup>*<sup>j</sup> <sup>i</sup> denote the σ-field generated by the Markov chain* {*Xk* : *i* ≤ *k* ≤ *j*} *parameterized by θ* ∈ Θ*. Then, the α-mixing coefficient is defined as*

$$\mathfrak{a}\_k = \sup\_{t > 0} \sup\_{(A, B) \in \mathcal{M}^t\_{-\sigma a} \times \mathcal{M}^{\sigma}\_{t+k}} |P\_\theta(A \cap B) - P\_\theta(A)P\_\theta(B)|. \tag{7}$$

Informally speaking, the *α*-mixing coefficients {*αk*} measure the dependence between any two events *A* (in the 'history' *σ*-algebra) and *B* (in the 'future' *σ*-algebra) with a time lag *k*. We note that we do not use superscripts to identify these *α* parameters, since they are the only ones with subscripts, and can be identified through this.

## **2. A Concentration Bound for the** *αre***-Rényi Divergence**

The object of analysis in what follows is the probability measure *<sup>π</sup>*˜ *<sup>n</sup>*,*αte*|*X<sup>n</sup>* (*θ*), the variational approximation to the tempered posterior. Our main result establishes a bound on the Bayes risk of this distribution; in particular, given a sequence of loss functions *<sup>n</sup>*(*θ*, *<sup>θ</sup>*0), we bound - *<sup>n</sup>*(*θ*, *<sup>θ</sup>*0)*π*˜ *<sup>n</sup>*,*αte*|*X<sup>n</sup>* (*θ*)*dθ*. Following recent work in both the i.i.d. and dependent sequence settings [14–16], we will use *<sup>n</sup>*(*θ*, *<sup>θ</sup>*0) = *<sup>D</sup>αre* (*P*(*n*) *<sup>θ</sup>* , *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>0</sup> ), the *<sup>α</sup>re*-Rényi divergence between *P*(*n*) *<sup>θ</sup>* and *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>0</sup> as our loss function. Unlike loss functions like Euclidean distance, Rényi divergence compares *θ* and *θ*<sup>0</sup> through their effect on observed sequences, so that issues like parameter identifiability no longer arise. Our first result generalizes [15], [Theorem 2.1] to a general non-i.i.d. data setting.

**Proposition 1.** *Let* <sup>F</sup> *be a subset of all probability distributions on* <sup>Θ</sup>*. For any <sup>α</sup>re* <sup>∈</sup> (0, 1)*,* <sup>∈</sup> (0, 1) *and <sup>n</sup>* <sup>≥</sup> <sup>1</sup>*, the following probabilistic uniform upper bound on the expected <sup>α</sup>re-Rényi divergence holds:*

$$\sup\_{\rho \in \mathcal{F}} \int \mathcal{D}\_{a^{\nu \epsilon}} (P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \rho(d\theta) \le \frac{a^{\nu \epsilon}}{1 - a^{\nu \epsilon}} \int r\_n(\theta, \theta\_0) \rho(d\theta) + \frac{\mathcal{K}(\rho, \pi) + \log(\frac{1}{\epsilon})}{1 - a^{\nu \epsilon}} \right] \ge 1 - \epsilon. \tag{8}$$

The proof of Proposition 1 follows easily from [15], and we include it in Appendix B.1.1 for completeness. Mirroring the comments in [15], when *ρ* = *π*˜ *<sup>n</sup>*,*αte* this result is precisely [14, Theorem 3.4]. We also note from [14] that <sup>∀</sup> *<sup>α</sup>re*, *<sup>β</sup>* <sup>∈</sup> (0, 1] *<sup>α</sup>re*-Rényi divergences are all equivalent through the following inequality *<sup>α</sup>re*(1−*β*) *<sup>β</sup>*(1−*αre*) *<sup>D</sup><sup>β</sup>* <sup>≤</sup> *<sup>D</sup>αre* <sup>≤</sup> *<sup>D</sup><sup>β</sup>* <sup>∀</sup> *<sup>α</sup>re* <sup>≤</sup> *<sup>β</sup>*. Hence, for the subsequent results, we simplify by assuming that *αte* = *αre*. This probabilistic bound implies the following PAC-Bayesian concentration bound on the model risk computed with respect to the fractional variational posterior:

**Theorem 1.** *Let* F *be a subset of all probability distributions parameterized by* Θ*, and assume there exist <sup>n</sup>* > 0 *and ρ<sup>n</sup>* ∈ F *such that*

*i.* - K(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* )*ρn*(*dθ*) = - E[*rn*(*θ*, *θ*0)]*ρn*(*dθ*) ≤ *nn, ii.* - Var(*rn*(*θ*, *θ*0))*ρn*(*dθ*) ≤ *nn, and*

$$\|\vec{u}\|.\qquad\mathcal{K}(\rho\_{n\prime}\pi)\leq n\epsilon\_{n\prime}.$$

*Then, for any <sup>α</sup>re* <sup>∈</sup> (0, 1) *and* (, *<sup>η</sup>*) <sup>∈</sup> (0, 1) <sup>×</sup> (0, 1)*,*

$$P\left[\int D\_{a^{\kappa}}(P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \text{Tr}\_{n,a^{\kappa}}(d\theta | X^{(n)}) \le \frac{(a^{r\epsilon} + 1)n\epsilon\_n + a^{r\epsilon}\sqrt{\frac{n\epsilon\_n}{\eta}} - \log(\epsilon)}{1 - a^{r\epsilon}}\right] \ge 1 - \epsilon - \eta. \tag{9}$$

The proof of Theorem 1 is a generalization of [15] (Theorem 2.4) to the non-i.i.d. setting, and a special case of [16] (Theorem 3.1), where the problem setting includes latent variables. We include a proof for completeness. As noted in [15], the sufficient conditions follow closely from [13] and we will show that they hold for a variety of Markov chain models.

A direct corollary of Theorem 1 follows by setting *η* = <sup>1</sup> *<sup>n</sup><sup>n</sup>* , <sup>=</sup> *<sup>e</sup>*−*n<sup>n</sup>* and using the fact that *<sup>e</sup>*−*n<sup>n</sup>* <sup>≥</sup> <sup>1</sup> *<sup>n</sup><sup>n</sup>* . Note that Equation (9) is vacuous if *η* + > 1. Therefore, without loss of generality, we restrict ourselves to the condition <sup>2</sup> *<sup>n</sup><sup>n</sup>* < 1.

**Corollary 1.** *Assume* ∃ *<sup>n</sup>* > 0*, ρ<sup>n</sup>* ∈ F *such that the following conditions hold:*


*Then, for any <sup>α</sup>re* <sup>∈</sup> (0, 1)*,*

$$P\left[\int D\_{a^{\eta\epsilon}}(P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \, \nexists \pi\_{n,a^{\eta\epsilon}}(d\theta | X^{(n)}) \le \frac{2(a^{n\epsilon} + 1)\epsilon\_n}{1 - a^{n\epsilon}}\right] \ge 1 - \frac{2}{n\epsilon\_n}.\tag{10}$$

We observe that Theorem 1 and Corollary 1 place no assumptions on the nature of the statistical dependence between data points. However, verification of the sufficient conditions is quite hard, in general. One of our key contributions is to verify that under reasonable assumptions on the smoothness of the transition kernel, the sufficient conditions of Theorem 1 and Corollary 1 are satisfied by ergodic Markov chains.

Observe that the first two conditions in Corollary 1 ensure that the distribution *ρ<sup>n</sup>* concentrates on parameters *θ* ∈ Θ around the true parameter *θ*0, while the third condition requires that *ρ<sup>n</sup>* not diverge from the prior *π* rapidly as a function of the sample size *n*. In general, verifying the first and third conditions is relatively straightforward. The second condition, on the other hand, is significantly more complicated in the current setting of dependent data, as the variance of *rn*(*θ*, *θ*0) includes correlations between the observations {*X*0, ... , *Xn*}. In the next section, we will make assumptions on the transition kernels (and corresponding invariant densities) that 'decouple' the temporal correlations and the model parameters in the setting of strongly mixing and ergodic Markov chain models, and allow for the verification of the conditions in Corollary 1. Towards this, Propositions 2 and 3 below characterize the expectation and variance of the log-likelihood ratio *rn*(·, ·) in terms of the one-step transition kernels of the Markov chain. First, consider the expectation of *rn*(·, ·) in condition (i).

**Proposition 2.** *Fix θ*1, *θ*<sup>2</sup> ∈ Θ *and consider the parameterized Markov transition kernels pθ*<sup>1</sup> *and pθ*<sup>2</sup> *, and initial distributions q* (0) *<sup>θ</sup>*<sup>1</sup> *and q* (0) *<sup>θ</sup>*<sup>2</sup> *. Let p* (*n*) *<sup>θ</sup>*<sup>1</sup> *and p* (*n*) *<sup>θ</sup>*<sup>2</sup> *be the corresponding joint probability densities; that is,*

$$p\_{\boldsymbol{\theta}\_{\hat{\boldsymbol{\theta}}\_{j}}^{(n)}}^{(n)}(\mathbf{x}\_{0},\ldots,\mathbf{x}\_{n}) = q\_{\boldsymbol{\theta}\_{\hat{\boldsymbol{\theta}}\_{j}}^{(0)}}^{(0)}(\mathbf{x}\_{0}) \prod\_{i=1}^{n} p\_{\boldsymbol{\theta}\_{i}}(\mathbf{x}\_{i}|\mathbf{x}\_{i-1}) \tag{11}$$

*for j* ∈ {1, 2}*. Then, for any n* ≥ 1*, the log-likelihood ratio rn*(*θ*2, *θ*1) *satisfies*

$$\mathbb{E}\_{\theta\_1}[r\_n(\theta\_2, \theta\_1)] = \sum\_{i=1}^n \mathbb{E}\_{\theta\_1} \left[ \log \left( \frac{p\_{\theta\_1}(X\_i | X\_{i-1})}{p\_{\theta\_2}(X\_i | X\_{i-1})} \right) \right] + \mathbb{E}\_{\theta\_1}[Z\_0],\tag{12}$$

*where <sup>Z</sup>*<sup>0</sup> :<sup>=</sup> log*<sup>q</sup>* (0) *<sup>θ</sup>*<sup>1</sup> (*X*0) *q* (0) *<sup>θ</sup>*<sup>2</sup> (*X*0) *. The expectation in the first term is with respect to the joint density function pθ*<sup>1</sup> (*y*, *x*) = *pθ*<sup>1</sup> (*y*|*x*)*q* (*i*−1) *<sup>θ</sup>*<sup>1</sup> (*x*) *where the marginal density satisfies*

$$q\_{\theta\_1}^{(i-1)}(\mathbf{x}) = \begin{cases} \int p\_{\theta\_1}^{(i-1)}(\mathbf{x}\_{0\prime}, \dots, \mathbf{x}\_{i-2\prime}\mathbf{x}) d\mathbf{x}\_0 \cdots d\mathbf{x}\_{i-2} & \text{for } i > 1, and \\ q\_{\theta\_1}^{(0)}(\mathbf{x}) & \text{for } i = 1. \end{cases}$$

*If the Markov chain is also stationary under θ*1*, then Equation (12) simplifies to*

$$\mathbb{E}\_{\theta\_1}[r\_n(\theta\_2, \theta\_1)] = n \mathbb{E}\_{\theta\_1} \left[ \log \left( \frac{p\_{\theta\_1}(X\_1 | X\_0)}{p\_{\theta\_2}(X\_1 | X\_0)} \right) \right] + \mathbb{E}\_{\theta\_1}[Z\_0]. \tag{13}$$

Notice that E*θ*<sup>1</sup> [*rn*(*θ*2, *<sup>θ</sup>*1)] is precisely the KL divergence, <sup>K</sup>(*P*(*n*) *<sup>θ</sup>*<sup>1</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>2</sup> ). Next, the following proposition uses [19] (Lemma 1.3) to upper bound the variance of the loglikelihood ratio.

**Proposition 3.** *Fix θ*1, *θ*<sup>2</sup> ∈ Θ *and consider parameterized Markov transition kernels pθ*<sup>1</sup> *and pθ*<sup>2</sup> *, with initial distributions q* (0) *<sup>θ</sup>*<sup>1</sup> *and q* (0) *<sup>θ</sup>*<sup>2</sup> *. Let p* (*n*) *<sup>θ</sup>*<sup>1</sup> *and p* (*n*) *<sup>θ</sup>*<sup>2</sup> *be the corresponding joint* *probability densities of the sequence* (*x*0, ... , *xn*)*, and q* (*i*) *<sup>θ</sup><sup>j</sup> the marginal density for i* ∈ {1, ... , *n*} *and j* ∈ {1, 2}*. Fix δ* > 0 *and, for each i* ∈ {1, . . . , *n*}*, define*

$$\mathbb{C}\_{\theta\_1, \theta\_2}^{(i)} := \int \left| \log \left( \frac{p\_{\theta\_1}(\mathbf{x}\_i | \mathbf{x}\_{i-1})}{p\_{\theta\_2}(\mathbf{x}\_i | \mathbf{x}\_{i-1})} \right) \right|^{2+\delta} p\_{\theta\_1}(\mathbf{x}\_i | \mathbf{x}\_{i-1}) q\_{\theta\_1}^{(i-1)}(\mathbf{x}\_{i-1}) d\mathbf{x}\_i d\mathbf{x}\_{i-1}.$$

*Similarly, define <sup>Z</sup>*<sup>0</sup> :<sup>=</sup> log*<sup>q</sup>* (0) *<sup>θ</sup>*<sup>1</sup> (*X*0) *q* (0) *<sup>θ</sup>*<sup>2</sup> (*X*0) *, and D*1,2 := E*θ*<sup>1</sup> |*Z*0| 2+*δ* . *Suppose the Markov chain corresponding to θ*<sup>1</sup> *is α-mixing with coefficients* {*αk*}*. Then,*

$$\begin{split} \text{Var}(r\_n(\theta\_1, \theta\_2)) &< \sum\_{i,j=1}^n \left( \frac{4}{n} + 2n^{\delta/2} (\mathsf{C}^{(i)}\_{\theta\_1, \theta\_2} + \mathsf{C}^{(j)}\_{\theta\_1, \theta\_2} + \sqrt{\mathsf{C}^{(i)}\_{\theta\_1, \theta\_2} \mathsf{C}^{(j)}\_{\theta\_1, \theta\_2}}) \right) \Big( \mathsf{a}^{\delta/(2+\delta)}\_{[i-j]-1} \Big) \\ &+ \sum\_{i=1}^n \left( \frac{4}{n} + 2n^{\delta/2} (\mathsf{C}^{(i)}\_{\theta\_1, \theta\_2} + D\_{1,2} + \sqrt{\mathsf{C}^{(i)}\_{\theta\_1, \theta\_2} D\_{1,2}}) \right) \Big( \mathsf{a}^{\delta/(2+\delta)}\_{i-1} \Big) \\ &+ \text{Cov}(\text{Zo}, \text{Zo}). \end{split} \tag{14}$$

Note that this result holds for any parameterized Markov chain. In particular, when the Markov chain is stationary, *C*(*i*) *<sup>θ</sup>*1,*θ*<sup>2</sup> <sup>=</sup> *<sup>C</sup>*(1) *<sup>θ</sup>*1,*θ*<sup>2</sup> ∀ *<sup>i</sup>* and ∀*<sup>θ</sup>* ∈ <sup>Θ</sup>, and Equation (14) simplifies to

$$\begin{split} \text{Var}(r\_n(\theta\_1, \theta\_2)) &< n \left( \frac{4}{n} + 6n^{\delta/2} \mathsf{C}^{(1)}\_{\theta\_1, \theta\_2} \right) \left( \sum\_{k \ge 0} a\_k^{\delta/(2+\delta)} \right) \\ &+ \left( \frac{4}{n} + 2n^{\delta/2} (\mathsf{C}^{(1)}\_{\theta\_1, \theta\_2} + D\_{1,2} + \sqrt{\mathsf{C}^{(1)}\_{\theta\_1, \theta\_2} D\_{1,2}}) \right) \left( \sum\_{k \ge 1} a\_k^{\delta/(2+\delta)} \right) \\ &+ \text{Cov}(Z\_0, Z\_0). \end{split} \tag{16}$$

If the sum <sup>∑</sup>*k*≥<sup>0</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* is infinite, the bound is trivially true. For it to be finite, of course, the coefficients *α<sup>k</sup>* must decay to zero sufficiently quickly. For instance, Theorem A.1.2 shows that if the Markov chain is geometrically ergodic, then the *α*-mixing coefficients are geometrically decreasing. We will use this fact when the Markov chain is non-stationary, as in Section 4. In the next section, however, we first consider the simpler stationary Markov chain setting where geometric ergodic conditions are not explicitly imposed. We also note that unless only a finite number of *<sup>α</sup><sup>k</sup>* are nonzero, the sum <sup>∑</sup>*k*≥<sup>0</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* is infinite when *δ* = 0, and our results will typically require *δ* > 0.

## **3. Stationary Markov Data-Generating Models**

Observe that the PAC-Bayesian concentration bound in Corollary 1 specifically requires bounding the mean and variance of the log-likelihood ratio *rn*(*θ*, *θ*0). We ensure this by imposing regularity conditions on the log-ratio of the one-step transition kernels and the corresponding invariant densities. Specifically, we assume the following conditions that decouple the model parameters from the random samples, allowing us to verify the bounds in Corollary 1.

**Assumption 1.** *There exist positive functions <sup>M</sup>*(1) *<sup>k</sup>* (·, ·) *and <sup>M</sup>*(2) *<sup>k</sup>* (·)*, k* ∈ {1, 2, ... , *m*} *such that for any parameters θ*1, *θ*<sup>2</sup> ∈ Θ*, the log of the ratio of one-step transition kernels and the log of the ratio of the invariant distributions satisfy, respectively,*

$$|\log p\_{\theta\_1}(\mathbf{x}\_1|\mathbf{x}\_0) - \log p\_{\theta\_2}(\mathbf{x}\_1|\mathbf{x}\_0)| \le \sum\_{k=1}^m M\_k^{(1)}(\mathbf{x}\_1, \mathbf{x}\_0) |f\_k^{(1)}(\theta\_2, \theta\_1)| \vee (\mathbf{x}\_0, \mathbf{x}\_1), \text{ and} \tag{17}$$

$$|\log q\_{\theta\_1}(\mathbf{x}) - \log q\_{\theta\_2}(\mathbf{x})| \le \sum\_{k=1}^{m} M\_k^{(2)}(\mathbf{x}) |f\_k^{(2)}(\theta\_2, \theta\_1)| \text{ }\nwarrow\text{ }\mathbf{x}.\tag{18}$$

*We further assume that for some δ* > 0*, the functions f* (1) *<sup>k</sup>* , *f* (2) *<sup>k</sup> and M*(1) *<sup>k</sup> satisfy the following:*


The following examples illustrate Equations (17) and (18) for discrete and continuous state Markov chains.

**Example 1.** *Suppose* {*X*0, ... , *Xn*} *is generated by the birth-death chain with parameterized transition probability mass function,*

$$p\_{\theta}(j|i) = \begin{cases} \theta & \text{if } j = i - 1, \\ 1 - \theta & \text{if } j = i + 1. \end{cases}$$

*In this example, the parameter θ denotes the probability of birth. We shall see that, m* = 3*: M*(1) <sup>1</sup> (*X*1, *<sup>X</sup>*0) = *<sup>I</sup>*[*X*1=*X*0+1]*, <sup>M</sup>*(1) <sup>2</sup> (*X*1, *<sup>X</sup>*0) = *<sup>I</sup>*[*X*1=*X*0−1]*, and <sup>M</sup>*(1) <sup>3</sup> (*X*1, *X*0) = 1*. We also define <sup>M</sup>*(2) <sup>1</sup> (*X*0) = <sup>1</sup>*, and set <sup>M</sup>*(2) <sup>2</sup> (*X*0) *and <sup>M</sup>*(2) <sup>3</sup> (*X*0) *both to X*<sup>0</sup> − 1*. Let f* (1) <sup>1</sup> (*θ*, *θ*0) = log *θ*0 *θ , f* (1) <sup>2</sup> (*θ*, *<sup>θ</sup>*0) = log 1−*θ*<sup>0</sup> 1−*θ , f* (1) <sup>3</sup> (*θ*, *θ*0) = 0*, f* (2) <sup>1</sup> (*θ*, *θ*0) = −*f* (2) <sup>3</sup> (*θ*, *<sup>θ</sup>*0) = log 1−*θ*<sup>0</sup> 1−*θ , and f* (2) <sup>2</sup> (*θ*, *<sup>θ</sup>*0) = log *θ*0 *θ . The derivation of these terms and that they satisfy the conditions of Assumption 1 is provided in the proof of Proposition 6.*

**Example 2.** *Suppose* {*X*0,..., *Xn*} *is generated by the 'simple linear' Gauss–Markov model*

$$X\_n = \theta X\_{n-1} + W\_{n\epsilon}$$

*where* {*Wn*} *is a sequence of i.i.d. standard Gaussian random variables. Then, m* = 2*, with M*(1) <sup>1</sup> (*Xn*, *Xn*−1) = <sup>|</sup>*XnXn*−1|*, <sup>M</sup>*(1) <sup>2</sup> (*Xn*, *Xn*−1) = *<sup>X</sup>*<sup>2</sup> *<sup>n</sup>*, *<sup>M</sup>*(2) <sup>1</sup> (*x*) = *<sup>x</sup>*<sup>2</sup> <sup>2</sup> *and <sup>M</sup>*(2) <sup>2</sup> (*X*) = 0*. Corresponding to these, we have f* (1) <sup>1</sup> (*θ*, *θ*0)=(*θ* − *θ*0), *f* (1) <sup>2</sup> (*θ*, *<sup>θ</sup>*0)=(*θ*<sup>2</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*2), *<sup>f</sup>* (2) <sup>1</sup> (*θ*0, *θ*0) = (*θ*<sup>2</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*2) *and <sup>f</sup>* (2) <sup>2</sup> (*θ*0, *θ*0) = 0*. The derivation of these quantities and that these satisfy the conditions of Assumption 1 under appropriate choice of ρ<sup>n</sup> is shown in the proof of Proposition 10.*

Note that assuming the same number *m* of *M*(1) *<sup>k</sup>* and *<sup>M</sup>*(2) *<sup>k</sup>* involves no loss of generality, since these functions can be set to 0. Both Equations (17) and (18) can be viewed as generalized Lipschitz-smoothness conditions, recovering the usual Lipschitz-smoothness when *m* = 1 and when *f* (*t*) *<sup>k</sup>* is Euclidean distance. Our generalized conditions are useful for distributions like the Gaussian, where Lipschitz smoothness does not apply. From Jensen's inequality we have - | *f* (*t*) *<sup>k</sup>* (*θ*, *<sup>θ</sup>*0)|*ρn*(*dθ*)| ≤ - | *f* (*t*) *<sup>k</sup>* (*θ*, *θ*0)| <sup>2</sup><sup>+</sup>*δρn*(*dθ*) 1 2+*δ* , and Assumption 1(i) above implies that for some constant *C* > 0 and *k* ∈ {1, 2, ... , *m*}, *t* ∈ {1, 2},

$$\int |f\_k^{(t)}(\theta, \theta\_0)| \rho\_n(d\theta) \le \frac{\mathbb{C}}{n^{1/(2+\delta)}} < \frac{\mathbb{C}}{\sqrt{n}}.\tag{19}$$

Assumption 1(i) is satisfied in a variety of scenarios, for example, under mild assumptions on the partial derivatives of the functions *f* (*t*) *<sup>k</sup>* . To illustrate this, we present the following proposition.

**Proposition 4.** *Let f*(*θ*, *θ*0) *be a function on a bounded domain with bounded partial derivatives with f*(*θ*0, *θ*0) = 0*. Let* {*ρn*(·)} *be a sequence of probability densities on θ such that* E*ρ<sup>n</sup>* [*θ*] = *θ*<sup>0</sup> *and* Var*ρ<sup>n</sup>* [*θ*] = *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup> for some σ* > 0*. Then, for some C* > 0*,*

$$\int |f(\theta, \theta\_0)|^{2+\delta} \rho\_n(d\theta) < \frac{\mathbb{C}}{n}.\tag{20}$$

**Proof.** Define *∂θ <sup>f</sup>*(*θ*, *<sup>θ</sup>*0) :<sup>=</sup> *<sup>∂</sup> <sup>f</sup>*(*θ*,*θ*0) *∂θ* as the partial derivative of the function *f* . By the mean value theorem, | *f*(*θ*, *θ*0)| = |*θ* − *θ*0||*∂θ f*(*θ*∗, *θ*0)|, for some *θ*<sup>∗</sup> ∈ [min{*θ*, *θ*0}, max{*θ*, *θ*0}]. Since the partial derivatives are bounded, there exists *<sup>L</sup>* <sup>∈</sup> <sup>R</sup> such that *∂θ <sup>f</sup>*(*θ*∗, *<sup>θ</sup>*0) <sup>&</sup>lt; *<sup>L</sup>*, and - | *f*(*θ*, *θ*0)| <sup>2</sup><sup>+</sup>*δρn*(*dθ*) < *L*2+*<sup>δ</sup>* - |*θ* − *θ*0| <sup>2</sup><sup>+</sup>*δρn*(*dθ*). Choose *<sup>G</sup>* <sup>&</sup>gt; 0 be such that <sup>|</sup>*θ*<sup>|</sup> <sup>&</sup>lt; *<sup>G</sup>*, then \$ \$ \$ *θ*−*θ*<sup>0</sup> 2*G* \$ \$ \$ 2+*δ* < \$ \$ \$ *θ*−*θ*<sup>0</sup> 2*G* \$ \$ \$ 2 . Therefore, - |*θ* − *θ*0| <sup>2</sup><sup>+</sup>*δρn*(*dθ*) <sup>&</sup>lt; (2*G*)2+*δ*Var *<sup>θ</sup>* 2*G* < (2*G*)*<sup>δ</sup> <sup>σ</sup>*<sup>2</sup> *<sup>n</sup>* . Now choosing (2*G*)*δσ*<sup>2</sup> as *C* completes the proof.

If *∂θ f* (*t*) *<sup>k</sup>* is continuous and Θ is compact, then *∂θ f* (*t*) *<sup>k</sup>* is always bounded. Furthermore, observe that if E *M*(1) *<sup>k</sup>* (*X*1, *<sup>X</sup>*0)2+*<sup>δ</sup>* < *B*, without loss of generality we can use Jensen's inequality to conclude that, for all 0 <sup>&</sup>lt; *<sup>a</sup>* <sup>&</sup>lt; <sup>2</sup> <sup>+</sup> *<sup>δ</sup>*, E *M*(1) *<sup>k</sup>* (*X*1, *<sup>X</sup>*0)*<sup>a</sup>* < *<sup>B</sup> <sup>a</sup>* <sup>2</sup>+*<sup>δ</sup>* < *B*.

We can now state the main theorem of this section.

**Theorem 2.** *Let* {*X*0, ... , *Xn*} *be generated by a stationary, α-mixing Markov chain parametrized by θ*<sup>0</sup> ∈ Θ*. Suppose that Assumption 1 holds and that the α-mixing coefficients satisfy* <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* <sup>&</sup>lt; <sup>+</sup>∞*. Furthermore, assume that* <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC for some constant <sup>C</sup>* <sup>&</sup>gt; <sup>0</sup>*. Then, the conditions of Corollary 1 are satisfied with <sup>n</sup>* = O max( <sup>√</sup><sup>1</sup> *<sup>n</sup>* , *<sup>n</sup>δ*/2 *<sup>n</sup>* ) *.*

Theorem 2 is satisfied by a large class of Markov chains, including chains with countable and continuous state spaces. In particular, if the Markov chain is geometrically ergodic, then it follows from Equation (A4) (in the appendix) that <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* < +∞. Observe that in order to achieve *O*( <sup>√</sup><sup>1</sup> *<sup>n</sup>* ) convergence, we need *<sup>δ</sup>* ≤ 1. Key to the proof of Theorem <sup>2</sup> is the fact that the variance of the log-likelihood ratio can be controlled via the application of Assumption 1 and Proposition 3. Note also that as *δ* decreases, satisfying the condition <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* requires the Markov chain to be faster mixing.

We now illustrate Theorem 2 for a number of Markov chain models. First, consider a birth-death Markov chain on a finite state space.

**Proposition 5.** *Suppose the data-generating process is a birth-death Markov chain, with onestep transition kernel parametrized by the birth probability θ*<sup>0</sup> ∈ Θ*. Let* F *be the set of all Beta distributions. We choose the prior to be a Beta distribution. Then, the conditions of Theorem 2 are satisfied and <sup>n</sup>* = O √1 *n .*

**Proof.** The proof of Proposition 5 follows from the more general Proposition 8, by fixing the initial distribution to the invariant distribution under *θ*0. Therefore it has been omitted. We simply refer to the proof of Proposition 8 under a more general setup in Appendix B.3.

The birth-death chain on the finite state space is, of course, geometrically ergodic and the *α*-mixing coefficients *α<sup>k</sup>* decay geometrically. Note that the invariant distribution of this Markov chain is uniform over the state space, and consequently this is a particularly simple example. A more complicated and more realistic example is a birth-death Markov chain on the nonnegative integers. We note that if the probability of birth *θ* in a birth-death Markov chain on positive integers is greater than 0.5, then the Markov chain is transient, and consequently, not ergodic. Hence, our prior should be chosen to have support within (0, 0.5). For that purpose, we define the class of scaled beta distributions.

**Definition 2** (Scaled Beta)**.** *If X is a beta distribution on with parameters a and b, then Y is said to be a scaled beta distribution with same parameters on the interval* (*c*, *m* + *c*) *if*

$$\mathcal{Y} = m\mathfrak{x} + c \; ; \; (m, c) \in \mathbb{R}^2$$

*and in that case, the pdf of Y is obtained as*

$$f(y) = \begin{cases} \frac{1}{m \text{Beta}(a,b)} \left(\frac{y-c}{m}\right)^{a-1} \left(1 - \frac{y-c}{m}\right)^{b-1} & \text{if } y \in (c, m+c),\\ 0 & \text{otherwise.} \end{cases}$$

Here, E[*Y*] = *m <sup>a</sup> <sup>a</sup>*+*<sup>b</sup>* <sup>+</sup> *<sup>c</sup>* and Var[*Y*] = *<sup>m</sup>*<sup>2</sup> *ab* (*a*+*b*)2(*a*+*b*+1) . For the birth-death chain, we set *m* = 0.5 and *c* = 0 giving it support on (0, <sup>1</sup> <sup>2</sup> ). Setting *m* = 2 and *c* = −1 gives a beta distribution rescaled to have support on (−1, 1).

**Proposition 6.** *Suppose the data-generating process is a positive recurrent birth-death Markov chain on the positive integers parameterized by the birth probability <sup>θ</sup>*<sup>0</sup> <sup>∈</sup> (0, <sup>1</sup> <sup>2</sup> )*. Further let* F *be the set of all Beta distributions rescaled to have support* (0, <sup>1</sup> <sup>2</sup> )*. We choose the prior to be a scaled Beta distribution on* (0, 1/2) *with parameters a and b. Then, the conditions of Theorem 2 are satisfied with <sup>n</sup>* = O √1 *n .*

**Proof.** The proof of Proposition 6 (for the stationary case) follows from the more general Proposition 9 (the nonstationary case) by fixing the initial distribution to the invariant distribution under *θ*0. We omit the proof and simply refer to the proof of Proposition 9 under a more general setup in Appendix B.3.

Unlike with the finite state-space, the invariant distribution now depends on the parameter *θ* ∈ Θ, and verification of the conditions of the proposition is more involved. In Appendix A.2, we prove that the class of scaled beta distributions satisfy the condition K(*ρn*, *π*) ≤ *n<sup>n</sup>* when the prior *π* is a beta or an uniform distribution. This fact will allow us to prove the above propositions.

Both Proposition 5 and Proposition 6 assume a discrete state space. The next example considers a strictly stationary simple linear model (as defined in Example 2), which has a continuous, unbounded state space.

**Proposition 7.** *Suppose the data-generating model is a stationary simple linear model:*

$$X\_n = \theta\_0 X\_{n-1} + W\_{n\prime} \tag{21}$$

*where* {*Wn*} *are i.i.d. standard Gaussian random variables and* |*θ*0| < 1*. Suppose that* F *is the class of all beta distributions rescaled to have the support* (−1, 1)*. Then, the conditions of Theorem 2 are satisfied with <sup>n</sup>* = O √1 *n .*

**Proof.** This is a special case of the more general non-stationary simple linear model which is detailed in Proposition 10. Therefore, the proof of the fact that the simple linear model satisfies Assumption 1 when starting from stationarity is deferred to the proof of Proposition 10. The simple linear model with |*θ*0| < 1 has geometrically decreasing (and therefore summable) *α*-mixing coefficients as a consequence of [20] (eq. (15.49)) and Theorem A.1.2. Combining these two facts, it follows that the conditions of Theorem 2 are satisfied.

Observe that Theorem 1 (and Corollary 1) are general, and hold for *any* dependent data-generating process. Therefore, there can be Markov chains that satisfy these, but do not satisfy Assumption 1 which entails some loss of generality. However, as our examples demonstrate, common Markov chain models do indeed satisfy the latter assumption.

#### **4. Non-Stationary, Ergodic Markov Data-Generating Models**

We call a time-homogeneous Markov chain *non-stationary* if the initial distribution *q*(0) is not the invariant distribution. There are two sets of results in this setting: in Theorem 3 and Theorem 4 we explicitly impose the *α*-mixing condition, while in Theorem 5 we impose a *f*-geometric ergodicity condition (Definition A.1.2 in the appendix). As seen in Equation (A4) (in the appendix) if the Markov chain is also geometrically ergodic, then <sup>∀</sup> *<sup>δ</sup>* <sup>&</sup>gt; 0, <sup>∑</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* < ∞. This condition can be relaxed, albeit at the risk of more complicated calculations that, nonetheless, mirror those in the geometrically ergodic setting. A common thread through these results is that we must impose some integrability or regularity conditions on the functions *M*(1) *<sup>k</sup>* .

First, in Theorem <sup>3</sup> we assume that the *M*(1) *<sup>k</sup>* functions in Assumption 1 are uniformly bounded and that the *α*-mixing condition is satisfied. This result holds for both discrete and continuous state space settings.

**Theorem 3.** *Let* {*X*0, ... , *Xn*} *be generated by an α-mixing Markov chain parametrized by θ*<sup>0</sup> ∈ Θ *with transition probabilities satisfying Assumption 1 and with known initial distribution q*(0)*. Let* {*αk*} *be the <sup>α</sup>-mixing coefficients under <sup>θ</sup>*0*, and assume that* <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* < +∞*. Suppose that there exists <sup>B</sup>* <sup>∈</sup> <sup>R</sup> *such that* sup*x*,*<sup>y</sup>* <sup>|</sup>*M*(1) *<sup>k</sup>* (*x*, *y*)| < *B for all k* ∈ {1, 2, ... , *m*} *in Assumption 1. Furthermore, assume that there exists <sup>ρ</sup><sup>n</sup>* ∈ F *such that* <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC for some constant <sup>C</sup>* <sup>&</sup>gt; <sup>0</sup>*. If the initial distribution <sup>q</sup>*(0) *satisfies Eq*(0)|*M*(2) *<sup>k</sup>* (*X*0)| <sup>2</sup> <sup>&</sup>lt; <sup>+</sup><sup>∞</sup> *for all <sup>k</sup>* ∈ {1, 2, ... , *<sup>m</sup>*}*, then the conditions of Corollary 1 are satisfied with <sup>n</sup>* = O max( <sup>√</sup><sup>1</sup> *<sup>n</sup>* , *<sup>n</sup>δ*/2 *<sup>n</sup>* ) *.*

The following result in Proposition 8 illustrates Theorem 3 in the setting of a finite state birth-death Markov chain.

**Proposition 8.** *Suppose the data-generating process is a finite state birth-death Markov chain, with one-step transition kernel parametrized by the birth probability θ*0*. Let* F *be the set of all Beta distributions. We choose the prior on θ*<sup>0</sup> *to be a Beta distribution. Then, the conditions of Theorem 3 are satisfied with <sup>n</sup>* = O √1 *n for any initial distribution q*(0)*.*

Theorem 3 also applies to data generated by Markov chains with countably infinite state spaces, so long as the class of data-generating Markov chains is strongly ergodic and the initial distribution has finite second moments. The following example demonstrates this in the setting of a birth-death Markov chain on the positive integers, where the initial distribution is assumed to have finite second moments.

**Proposition 9.** *Suppose the data-generating process is a birth-death Markov chain on the nonnegative integers, parameterized by the probability of birth <sup>θ</sup>*<sup>0</sup> <sup>∈</sup> (0, <sup>1</sup> <sup>2</sup> )*. Further let* F *be the set of all Beta distributions rescaled upon the support* (0, <sup>1</sup> <sup>2</sup> )*. Let <sup>q</sup>*(0) *be a probability mass function on non-negative integers such that* ∑<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *i* <sup>2</sup>*q*(0)(*i*) < +∞*. We choose the prior to be a scaled Beta distribution on* (0, 1/2) *with parameters a and b. Then, the conditions of Theorem 3 are satisfied with <sup>n</sup>* = O √1 *n .*

Since continuous functions on a compact domain are bounded, we have the following (easy) corollary (stated without proof).

**Corollary 2.** *Let* {*X*0, ... , *Xn*} *be generated by an α-mixing Markov chain parametrized by <sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup> *on a compact state space, and with initial distribution <sup>q</sup>*(0)*. Suppose the <sup>α</sup>-mixing coefficients satisfy* <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* <sup>&</sup>lt; <sup>+</sup>∞*, and that Assumption <sup>1</sup> holds with continuous functions <sup>M</sup>*(1) *<sup>k</sup>* (·, ·)*, <sup>k</sup>* ∈ {1, 2, ... , *<sup>m</sup>*}*. Furthermore, assume that there exists <sup>ρ</sup><sup>n</sup> such that* <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC for some constant C. Then, Theorem 3 is satisfied with <sup>n</sup>* = O max( <sup>√</sup><sup>1</sup> *<sup>n</sup>* , *<sup>n</sup>δ*/2 *<sup>n</sup>* ) *.*

In general, the *M*(1) *<sup>k</sup>* functions will not be uniformly bounded (consider the case of the Gauss–Markov simple linear model in Example 2), and stronger conditions must be imposed on the data-generating Markov chain itself. The following assumption imposes a 'drift' condition from [21]. Specifically, [21] (Theorem 2.3) shows that under the conditions of Assumption 2, the moment generating function of an aperiodic Markov chain {*Xn*} can be upper bounded by a function of the moment generating function of *X*0. Together with the *α*-mixing condition, Assumption 2 implies that this Markov data generating process satisfies Corollary 1.

**Assumption 2.** *Consider a Markov chain* {*Xn*} *parameterized by <sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup>*. Let* <sup>M</sup>*<sup>n</sup>* <sup>−</sup><sup>∞</sup> *denote the <sup>σ</sup>-field generated by* {*X*−∞, ... , *Xn*−1, *Xn*}*. Denote the stochastic process* {*M<sup>k</sup> <sup>n</sup>*} := {*M*(1) *<sup>k</sup>* (*Xn*, *Xn*−1)}*; recall <sup>M</sup>*(1) *<sup>k</sup> , for each k* = 1, ... , *m*1*, are defined in Assumption 1. For each <sup>k</sup>* <sup>=</sup> 1, . . . , *m, assume the process* {*M<sup>k</sup> <sup>n</sup>*} *satisfies the following conditions:*


Under this drift condition, the next theorem shows that Corollary 1 is satisfied.

**Theorem 4.** *Let* {*X*0, ... , *Xn*} *be generated by an aperiodic α-mixing Markov chain parametrized by <sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup> *and initial distribution <sup>q</sup>*(0)*. Suppose that Assumption <sup>1</sup> and Assumption <sup>2</sup> hold, and that the <sup>α</sup>-mixing coefficients satisfy* <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* <sup>&</sup>lt; <sup>+</sup>∞*. Furthermore, assume* <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC for some constant C* > 0*. If* - *eλM*(1) *<sup>k</sup>* (*y*,*x*) *<sup>p</sup>θ*<sup>0</sup> (*y*|*x*)*<sup>q</sup>* (0) <sup>1</sup> (*x*)*dx* < +∞ *for all k* = 1, ... , *m*1*, then the conditions of Corollary 1 are satisfied with <sup>n</sup>* = O max( <sup>√</sup><sup>1</sup> *<sup>n</sup>* , *<sup>n</sup>δ*/2 *<sup>n</sup>* ) *.*

Verifying the conditions in Theorem 4 can be quite challenging. Instead, we suggest a different approach that requires *f*-geometric ergodicity. Unlike the drift condition in Assumption 2, *f*-geometric ergodicity additionally requires the existence of a petite set. As noted before, geometric ergodicity implies *α*-mixing with geometrically decaying mixing coefficients. As with Theorem 4, we assume for simplicity that the Markov chain is aperiodic.

**Theorem 5.** *Let* {*X*0, ... , *Xn*} *be generated by an aperiodic Markov chain parametrized by <sup>θ</sup>*<sup>0</sup> <sup>∈</sup> <sup>Θ</sup> *with known initial distribution <sup>q</sup>*(0)*, and assumed to be V-geometrically ergodic for some <sup>V</sup>* : <sup>R</sup>*<sup>m</sup>* <sup>→</sup> [1, <sup>∞</sup>)*. Suppose that Assumption <sup>1</sup> holds and* - *M*(1) *<sup>k</sup>* (*y*, *<sup>x</sup>*)2+*<sup>δ</sup> <sup>p</sup>θ*<sup>0</sup> (*y*|*x*)*dy* <sup>&</sup>lt; *<sup>V</sup>*(*x*) <sup>∀</sup> *<sup>k</sup>*, *x and some <sup>δ</sup>* <sup>&</sup>gt; <sup>0</sup>*. Furthermore, assume that* <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC for some constant <sup>C</sup>* <sup>&</sup>gt; <sup>0</sup>*. If the initial distribution <sup>q</sup>*(0) *satisfies Eq*(0)[*V*(*X*0)] <sup>&</sup>lt; <sup>+</sup>∞*, then the conditions of Corollary 1 are satisfied with <sup>n</sup>* = O max( <sup>√</sup><sup>1</sup> *<sup>n</sup>* , *<sup>n</sup>δ*/2 *<sup>n</sup>* ) *.*

The following Proposition 10 shows, the simple linear model satisfies Theorem 5 when the parameter *θ*<sup>0</sup> is suitably restricted.

**Proposition 10.** *Consider the simple linear model satisfying the equation*

$$X\_n = \theta\_0 X\_{n-1} + W\_{n\prime} \tag{22}$$

*where* {*Wn*} *are i.i.d. standard Gaussian random variables and* <sup>|</sup>*θ*0<sup>|</sup> <sup>&</sup>lt; <sup>2</sup> <sup>1</sup> <sup>4</sup>+2*<sup>δ</sup>* <sup>−</sup><sup>1</sup> *for <sup>δ</sup>* <sup>&</sup>gt; <sup>0</sup>*. Let* <sup>F</sup> *be the space of all scaled Beta distributions on* (−1, 1) *and suppose the prior π is a uniform distribution on* (−1, 1)*. Then, the conditions of Theorem 5 are satisfied with <sup>n</sup>* = O max( <sup>√</sup><sup>1</sup> *<sup>n</sup>* , *<sup>n</sup>δ*/2 *<sup>n</sup>* ) *, if the initial distribution q*(0) *satisfies Eq*(0)[*X*4+2*<sup>δ</sup>* <sup>0</sup> ] <sup>&</sup>lt; <sup>+</sup>∞*.*

#### **5. Misspecified Models**

We show next how our results can be extended to the misspecified model setting. Assume that the true data generating distribution is parametrized by *θ*<sup>0</sup> -∈ Θ. Let *θ*<sup>∗</sup> *<sup>n</sup>* := arg min*θ*∈<sup>Θ</sup> <sup>K</sup>(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* ) represent the closest parametrized distribution in the variational family to the data-generating distribution. Further, assume our usual conditions:

$$\begin{array}{ll} i. & \int \mathrm{E}[r\_{\mathrm{n}}(\theta, \theta\_{\mathrm{n}}^{\*})] \rho\_{\mathrm{n}}(d\theta) \leq n \epsilon\_{\mathrm{n}}, \\\ i i. & \int \mathrm{Var}(r\_{\mathrm{n}}(\theta, \theta\_{\mathrm{n}}^{\*})) \rho\_{\mathrm{n}}(d\theta) \leq n \epsilon. \end{array}$$

Var(*rn*(*θ*, *θ*<sup>∗</sup> *<sup>n</sup>*))*ρn*(*dθ*) ≤ *nn*. Now, since *rn*(*θ*, *θ*0) = *rn*(*θ*, *θ*<sup>∗</sup> *<sup>n</sup>*) + *rn*(*θ*<sup>∗</sup> *<sup>n</sup>*, *θ*0), we have

$$\int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) \le \mathbb{E}[r\_n(\theta\_0, \theta\_n^\*)] + n\varepsilon\_n. \tag{2.3}$$

Similarly, decomposing the variance it follows that

$$\text{Var}[r\_n(\theta, \theta\_0)] = \text{Var}[r\_n(\theta, \theta\_n^\*)] + \text{Var}[r\_n(\theta\_n^\*, \theta\_0)] + 2\text{Cov}[r\_n(\theta, \theta\_n^\*), r\_n(\theta\_n^\*, \theta\_0)].\tag{24}$$

Using the fact that 2*ab* <sup>≤</sup> *<sup>a</sup>*<sup>2</sup> <sup>+</sup> *<sup>b</sup>*<sup>2</sup> on the covariance term 2Cov[*rn*(*θ*, *<sup>θ</sup>*<sup>∗</sup> *n*),*rn*(*θ*<sup>∗</sup> *<sup>n</sup>*, *θ*0)] = 2E[(*rn*(*θ*, *θ*<sup>∗</sup> *<sup>n</sup>*) − E[*rn*(*θ*, *θ*<sup>∗</sup> *n*)])(*rn*(*θ*<sup>∗</sup> *<sup>n</sup>*, *θ*0) − E[*rn*(*θ*<sup>∗</sup> *<sup>n</sup>*, *θ*0)])], we have

$$\text{Var}[r\_n(\theta, \theta\_0)] \le 2\text{Var}[r\_n(\theta, \theta\_n^\*)] + 2\text{Var}[r\_n(\theta\_{n'}^\* \theta\_0)].\tag{25}$$

Integrating both sides with respect to *ρn*(*dθ*) we get

$$\begin{split} \int \text{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) &\le 2 \int \text{Var}[r\_n(\theta, \theta\_n^\*)] \rho\_n(d\theta) + 2 \int \text{Var}[r\_n(\theta\_{n'}^\* \theta\_0)] \rho\_n(d\theta) \\ &\le 2n\varepsilon\_n + 2 \text{Var}[r\_n(\theta\_{n'}^\* \theta\_0)]. \end{split} \tag{26}$$

Consequently, we arrive at the following result:

**Theorem 6.** *Let* F *be a subset of all probability distributions parameterized by* Θ*. Let θ*<sup>∗</sup> *<sup>n</sup>* = arg min*θ*∈<sup>Θ</sup> <sup>K</sup>(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* ) *and assume there exist <sup>n</sup>* > 0 *and ρ<sup>n</sup>* ∈ F *such that*


$$\|\vec{u}\|.\qquad\mathcal{K}(\rho\_{n\prime}\pi)\leq n\epsilon\_{n\prime}.$$

*Then, for any <sup>α</sup>re* <sup>∈</sup> (0, 1) *and* (, *<sup>η</sup>*) <sup>∈</sup> (0, 1) <sup>×</sup> (0, 1)*,*

$$P\left[\int D\_{\mathbf{z}^{\rm ref}}(P\_{\theta}^{(n)}, P\_{\theta\_{0}}^{(n)}) \tilde{\pi}\_{\mathbf{z}, \mathbf{z}^{\rm ref}}(d\theta | \mathbf{X}^{(n)}) \le \epsilon \right.$$

$$\frac{(\mathbf{a}^{\rm ref} + 1)n\varepsilon\_{\rm tr} + \mathbb{E}[r\_{\theta}(\theta\_{0}, \theta\_{\mathbf{z}}^{\star})] + a^{\rm ref}\sqrt{\frac{2n\varepsilon\_{\rm tr} + 2\text{Var}[r\_{\theta}(\theta\_{\mathbf{z}}^{\star}, \theta\_{0})]}{\eta}} - \log(\epsilon)}{1 - a^{\rm ref}} \ge 1 - \epsilon - \eta. \tag{27}$$

The proof of this theorem is straightforward and follows from the proof of Theorem 1 by plugging in the upper bounds for KL-divergence from Equation (23), and variance from Equation (26) to Equation (A13). A sketch of the proof is presented in the appendix.

#### **6. Conclusions**

Concentration of the KL-VB model risk, in terms of the expected *αre*-Rényi divergence, is well established under the i.i.d. data generating model assumption. Here, we extended this to the setting of Markov data generating models, linking the concentration rate to the mixing and ergodic properties of the Markov model. Our results apply to both stationary and non-stationary Markov chains, as well as to the situation with misspecified models. There remain a number of open questions. An immediate one is to extend the current

analysis to continuous-time Markov chains and Markov jump processes, possibly using uniformization of the continuous time model. Another direction is to extend this to the setting of non-homogeneous Markov chains, where analogues of notions such as stationarity are less straightforward. Further, as noted in the introduction, [14] establish PAC-Bayes bounds under slightly weaker 'existence of test functions' conditions, while our results are established under the stronger conditions used by [15] for the i.i.d. setting. Weakening the conditions in our analysis is important, but complicated. A possible path is to build on results from [22], who provides conditions form the existence of exponentially powerful test functions exist for distinguishing between two Markov chains. It is also known that there exists a likelihood ratio test separating any two ergodic measures [23]. However, leveraging these to establish the PAC-Bayes bounds for the KL-VB posterior is a challenging effort that we leave to future papers. Finally it is of interest to generalize our PAC-bounds to posterior approximations beyond KL-variational inference, such as *αre*-Rényi posterior approximations [6], and loss-calibrated posterior approximations [24,25].

**Author Contributions:** Formal analysis, I.B.; Investigation, I.B.; Methodology, I.B., V.A.R. and H.H.; Resources, V.A.R. and H.H.; Validation, V.A.R. and H.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** National Science Foundation : IIS-1816499; DMS-1812197.

**Acknowledgments:** Rao and Honnappa acknowledge support from NSF DMS-1812197. In addition, Rao acknowledges NSF IIS-1816499 for supporting this project.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **Appendix A. Technical Desiderata**

*Appendix A.1. Definitions Related to Markov Chains*

As noted before, ergodicity plays an acute role in establishing our results. We consolidate various definitions used throughout the paper in this appendix. Recall that we assume the parameterized Markov chain possesses an invariant probability density or mass function *q<sup>θ</sup>* under parameter *θ* ∈ Θ. Our results in Section 4 also rely on the ergodic properties of the Markov chain, and we assume that the Markov chain is *f*-geometrically ergodic [20] (Chapter 15). First, refer to the definition of the functional norm · *<sup>f</sup>* , from Definition A.1.1,

**Definition A.1.1** (*f*-norm)**.** *The functional norm in f-metric of a measure v, or the f-norm of v is*

$$\|v\|\_{f} = \sup\_{\mathcal{S} : \|\mathcal{S}\| \le f} \left| \int \mathcal{g} dv \right|,\tag{A1}$$

*where f and g are any two functions.*

An immediate consequence of this definition is that if *f*1, *f*<sup>2</sup> are two functions such that *f*<sup>1</sup> < *f*<sup>2</sup> (i.e., for all points in the support of the functions), then

$$\|\|\boldsymbol{\upsilon}\|\|\_{f\_1} \le \|\boldsymbol{\upsilon}\|\_{f\_2}.\tag{A2}$$

Now that we have defined the · *<sup>f</sup>* norm, we can now define *f*-geometric ergodicity. In the following, we assume the Markov chain is positive Harris; see [20] for a definition. This is a mild and fairly standard assumption in Markov chain theory.

**Definition A.1.2** (*f*-geometric ergodicity)**.** *For any function f , Markov chain* {*Xn*} *parameterized by θ* ∈ Θ *is said to be f-geometrically ergodic if it is positive Harris and there exists a constant rf* > 1*, that depends on f , such that for any A* ∈ B(*X*)*,*

$$\left\| \sum\_{n=1}^{n} r\_f^n \right\| \left\| P\_\theta(X\_n \in A | X\_0 = x) - \int\_A q\_\theta(y) dy \right\|\_f < \infty. \tag{A.3}$$

It is straightforward to see that this is equivalent to

$$\left\| \left| P\_{\theta} (X\_n \in A | X\_0 = x) - \int q\_{\theta}(y) dy \right| \right\|\_{f} \leq \mathcal{C} r\_f^{-n} $$

for an appropriate constant *C* (which may depend on the state *x*), that is, the Markov chain approaches steady state at a geometrically fast rate. If a Markov chain is *f*-geometrically ergodic for *f* ≡ 1, then, it is simply termed as *geometrically ergodic*. It is straightforward to see (via Theorem A.1.2 in the Appendix) that a geometrically ergodic Markov chain is also *α*-mixing, with mixing coefficients satisfying

$$\sum\_{k\geq 0} a\_k^{\upsilon} < \infty \quad \forall \ \upsilon > 0,\tag{A4}$$

showing that, under geometric ergodicity, the *α*-mixing coefficients raised to any positive power *υ* are finitely summable. We note here that the most standard procedure to establish *f*-geometric ergodicity for any Markov chain is through the verification of the drift condition. The drift condition is a sufficient condition for a Markov chain to be *f*-geometrically ergodic, as long as there exists a set (called petite set) towards which the Markov chain drifts to (see Assumption A.1.1 in the appendix). If a Markov chain is *f*-geometrically ergodic with *f* ≡ *V*, for some particular function *V*, then we call it *V*-geometrically ergodic.

We defined *V*-geometric ergodicity in the previous sections. In this section, we provide a sufficient condition for a Markov chain to be *V*-geometrically ergodic. First, we recall the definition of resolvent from [20] (Chapter 5).

**Definition A.1.3** (Resolvent)**.** *Let n* ∈ {0, 1, 2, ... } *and qn be such that qn* ≥ 0 ∀ *n and* ∑<sup>∞</sup> *<sup>n</sup>*=<sup>1</sup> *qn* = 1*. Note that qn can be thought of being a probability mass function for a random variable "q" taking values on non-negative integers. Then, the resolvent of a Markov chain with respect to q is given by Kq*(*x*, *A*) *where,*

$$K\_{\eta}(\mathbf{x}, A) = \sum\_{n=0}^{\infty} q\_n P(X\_n \in A | X\_0 = \mathbf{x}). \tag{A5}$$

Then, the definition of petite sets follows (see, for Reference, [20] (Chapter 5)).

**Definition A.1.4** (Petite Sets)**.** *Let X*0, ... , *Xn be n samples from a Markov chain taking values on the state space* X *. Let C be a set. We shall call C to be vq petite if*

$$\mathcal{K}\_{\mathcal{q}}(\mathfrak{x}, \mathcal{B}) \ge \upsilon\_{\mathcal{q}}(\mathcal{B})$$

*for all x* ∈ *C and B* ∈ B(X )*, and a non-trivial measure υ<sup>q</sup> on* B(X )*, and a probability mass function q on* {1, 2, 3, . . . }

$$\text{Now, let } \Delta V(\mathbf{x}) := E[V(X\_n) | X\_{n-1} = \mathbf{x}] - V(\mathbf{x}) \text{ for } V: \mathbb{S} \to [1, \infty).$$

**Assumption A.1.1** (Drift condition)**.** *[20] (Chapter 5) Suppose the chain* {*Xn*} *is, aperiodic and ψ-irreducible . Let there exists a petite set C, constants b* < ∞, *β* > 0*, and a non-trivial function V* : *S* → [1, ∞) *satisfying*

$$
\Delta V(\mathbf{x}) \le -\beta V(\mathbf{x}) + bI\_{\mathbf{x} \in \mathbb{C}} \,\,\forall \mathbf{x} \in \mathcal{S}.\tag{A6}
$$

If a Markov chain drifts towards a petite set then it is *V*-geometrically ergodic. Suppose, for simplicity, that *V*(*x*) = |*X*|. Then, the drift condition becomes *E*[|*Xn Xx*−1] − |*Xn*−1| = −*β*|*Xn*| + *bIXn*∈*C*. The left hand side of this equation represents the change in the state of the Markov chain in one time epoch. Thus, the condition in Assumption A.1.1 essentially states that the Markov chain drifts towards a petite set *C* and then, once it reaches that set, moves to any point in the state space with at least some probability independent of *C*.

**Theorem A.1.1** (Geometrically ergodic theorem)**.** *Suppose that* {*Xn*} *is satisfies Assumption A.1.1. Then, the set SV* = {*x* : *V*(*x*) < ∞} *is absorbing, i.e., P<sup>θ</sup>* (*X*<sup>1</sup> ∈ *SV*|*X*<sup>0</sup> = *<sup>x</sup>*) = <sup>1</sup> <sup>∀</sup>*<sup>x</sup>* <sup>∈</sup> *SV, and full, i.e., <sup>ψ</sup>*(*S<sup>c</sup> <sup>V</sup>*) = 0*. Furthermore,* ∃ *constants r* > 1, *R* < ∞ *such that, for any A* ∈ B(*S*)*,*

$$\left\| P\_{\theta}(X\_{\hbar} \in A | X\_{0} = x) - \int\_{A} q\_{\theta}(y) dy \right\|\_{V} \leq \mathcal{R}r^{-\mathfrak{n}}V(x). \tag{A7}$$

Any aperiodic and *ψ*-irreducible Markov chain satisfying the drift condition is geometrically ergodic. A consequence of Equation (A2) is that if, {*Xn*} is *V*-geometrically ergodic, then for any other function *U*, such that |*U*| < *V*, it is also *U*-geometrically ergodic. In essence, a geometrically ergodic Markov chain is asymptotically uncorrelated in a precise sense. Recall *<sup>ρ</sup>*-mixing coefficients defined as follows. Let <sup>A</sup> be a sigma field and <sup>L</sup>2(*A*) be the set of square integrable, real valued, A measurable functions.

**Definition A.1.5** (*ρ*-mixing coefficient)**.** *Let* <sup>M</sup>*<sup>j</sup> <sup>i</sup> denote the sigma field generated by the measures Xk*, *where i* ≤ *k* ≤ *j. Then,*

$$\rho\_k = \sup\_{t>0} \sup\_{(f,\emptyset)\in\mathcal{L}^2\left(\mathcal{M}^t\_{-\infty}\right)\times\mathcal{L}^2\left(\mathcal{M}^\infty\_{t+k}\right)} |\mathrm{Corr}(f,\emptyset)|\,\mathrm{}\tag{A8}$$

*where* Corr *is the correlation function.*

**Theorem A.1.2.** *If Xn is geometrically ergodic, then it is α-mixing. That is, there exists a constant c* > 0 *such that α<sup>k</sup>* = O(*e*−*ck*)*.*

**Proof.** By [26] (Theorem 2) it follows that a geometrically ergodic Markov chain is asymptotically uncorrelated with *ρ*-mixing coefficients (see Definition A.1.5) that satisfy *<sup>ρ</sup><sup>k</sup>* <sup>=</sup> <sup>O</sup>(*e*−*ck*). Furthermore, it is well known that [18,26] *<sup>α</sup><sup>k</sup>* <sup>≤</sup> <sup>1</sup> <sup>4</sup> *ρk*, implying *α<sup>k</sup>* = O(*e*−*ck*).

*Appendix A.2. Bounding the KL-Divergence between Beta Distributions*

The following results will be utilized in the proofs of Propositions 8–10.

**Lemma A.2.1.** *Let θ*<sup>0</sup> ∈ (0, 1)*. Let, ρ<sup>n</sup> be a sequence of Beta distributions with parameters an* = *nθ*<sup>0</sup> *and bn* = *n*(1 − *θ*0)*. Let π denote an uniform distribution, U*(0, 1)*. Then,* K(*ρn*, *π*) < *C* + <sup>1</sup> <sup>2</sup> log(*n*)*, for some constant C* > 0*.*

**Proof.** Without loss of generality, we can assume *an* > 1 and *bn* > 1. The same form of the result can be obtained in all the other cases, by appropriate use of the bounds presented in the proof. We write the KL divergence <sup>K</sup>(*ρn*, *<sup>π</sup>*) as - log *<sup>ρ</sup><sup>n</sup> π ρn*(*dθ*). Since *π* is uniform, *π*(*θ*) = 1 whenever *θ* ∈ (0, 1). Hence, the KL-divergence can be written as the negative of the entropy of *ρ<sup>n</sup>* - 1 <sup>0</sup> log(*ρn*(*θ*))*ρn*(*dθ*), which can be written as

$$\begin{split} \mathcal{K}(\rho\_{\boldsymbol{n}}, \pi) = (a\_{\boldsymbol{n}} - 1)\psi(a\_{\boldsymbol{n}}) + (b\_{\boldsymbol{n}} - 1)\psi(b\_{\boldsymbol{n}}) - (a\_{\boldsymbol{n}} + b\_{\boldsymbol{n}} - 2)\psi(a\_{\boldsymbol{n}} + b\_{\boldsymbol{n}}) \\ & - \log \operatorname{Beta}(a\_{\boldsymbol{n}}, b\_{\boldsymbol{n}}), \end{split} \tag{A9}$$

where *ψ* is the digamma function. Using Stirling's approximation on Beta(*an*, *bn*) yields,

$$\text{Beta}(a\_{\boldsymbol{n}}, b\_{\boldsymbol{n}}) = \sqrt{2\pi} \frac{a\_{\boldsymbol{n}}^{a\_{\boldsymbol{n}} - 1/2} b\_{\boldsymbol{n}}^{b\_{\boldsymbol{n}} - 1/2}}{(a\_{\boldsymbol{n}} + b\_{\boldsymbol{n}})^{a\_{\boldsymbol{n}} + b\_{\boldsymbol{n}} - 1/2}} (1 + o(1)).$$

Hence, setting *C*<sup>1</sup> = log(2 <sup>√</sup>*π*), we can write <sup>−</sup> log Beta(*an*, *bn*) as,

$$\begin{aligned} -\log \operatorname{Beta}(a\_{\boldsymbol{n}}, b\_{\boldsymbol{n}}) &= \mathbb{C}\_1 - (a\_{\boldsymbol{n}} - \frac{1}{2}) \log(a\_{\boldsymbol{n}}) - (b\_{\boldsymbol{n}} - \frac{1}{2}) \log(b\_{\boldsymbol{n}}) \\ &+ (a\_{\boldsymbol{n}} + b\_{\boldsymbol{n}} - \frac{1}{2}) \log(a\_{\boldsymbol{n}} + b\_{\boldsymbol{n}}) + \log(1 + o(1)). \end{aligned}$$

From [27] we have that log(*x*) <sup>−</sup> <sup>1</sup> *<sup>x</sup>* <sup>&</sup>lt; *<sup>ψ</sup>*(*x*) <sup>&</sup>lt; log(*x*) <sup>−</sup> <sup>1</sup> <sup>2</sup>*<sup>x</sup>* ∀ *x* > 0. Since we assumed *an* <sup>&</sup>gt; 1 and *bn* <sup>&</sup>gt; 1, the fact that *<sup>ψ</sup>*(*x*) <sup>&</sup>lt; log(*x*) <sup>−</sup> <sup>1</sup> <sup>2</sup>*<sup>x</sup>* implies

$$\begin{aligned} (a\_{\mathrm{il}} - 1)\psi(a\_{\mathrm{il}}) &< (a\_{\mathrm{il}} - 1)\log(a\_{\mathrm{il}}) - \frac{a\_{\mathrm{il}} - 1}{2a\_{\mathrm{il}}} \text{ and } \forall \\ (b\_{\mathrm{il}} - 1)\psi(b\_{\mathrm{il}}) &< (b\_{\mathrm{il}} - 1)\log(b\_{\mathrm{il}}) - \frac{b\_{\mathrm{il}} - 1}{2b\_{\mathrm{il}}}. \end{aligned}$$

Finally, using the fact that log(*x*) <sup>−</sup> <sup>1</sup> *<sup>x</sup>* < *ψ*(*x*), we get,

$$-(a\_n + b\_n - 2)\psi(a\_n + b\_n) < -(a\_n + b\_n - 2)\log(a\_n + b\_n) + \frac{a\_n + b\_n - 2}{a\_n + b\_n}.$$

Therefore, after much cancellation, the KL-divergence

$$(a\_n - 1)\psi(a\_n) + (b\_n - 1)\psi(b\_n) - (a\_n + b\_n - 2)\psi(a\_n + b\_n) - \log \operatorname{Beta}(a\_n, b\_n)$$

can be upper bounded by

$$-\frac{1}{2}\log(a\_n) - \frac{1}{2}\log(b\_n) + \frac{3}{2}\log(a\_n + b\_n) + \frac{a\_n + b\_n - 2}{a\_n + b\_n} - \frac{a\_n - 1}{2a\_n} - \frac{b\_n - 1}{2b\_n}.$$

Now, plugging in the values of *an* and *bn*, we get Plugging in the values of *an* and *bn*, we get as upper bound for the KL-divergence as,

$$\begin{aligned} \mathcal{K}(\rho\_n, \pi) &< -\frac{1}{2} \log(n\theta\_0) - \frac{1}{2} \log(n(1-\theta\_0)) + \frac{3}{2} \log(n) + \frac{n-2}{n} - \frac{n\theta\_0 - 1}{2n\theta\_0} - \frac{n(1-\theta\_0) - 1}{2n(1-\theta\_0)} \\ &= \frac{1}{2} \log(n) - \frac{1}{2} (\log(\theta\_0) + \log(1-\theta\_0)) + 3 - \frac{2}{n} - \frac{1}{2n\theta\_0} - \frac{1}{2n(1-\theta\_0)} \\ &< \mathbb{C} + \frac{1}{2} \log(n), \end{aligned}$$

for some large enough positive constant *C*. This completes our proof.

**Proposition A.2.1.** *Let θ*<sup>0</sup> ∈ (0, 1)*. Let, ρ<sup>n</sup> be a sequence of Beta distributions with parameters an* = *nθ*<sup>0</sup> *and bn* = *n*(1 − *θ*0)*. Let π denote an Beta distribution, with parameters* (*a*, *b*)*. Then,* <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>&</sup>lt; *<sup>C</sup>* <sup>+</sup> <sup>1</sup> <sup>2</sup> log(*n*)*, for some constant C* > 0*.*

**Proof.** Without loss of generality, we assume *a* > 1 and *b* > 1. As mentioned in the proof of Lemma A.2.1, the other cases follows similarly. We write the KL-divergence between *ρ<sup>n</sup>* and *π* as,

$$\mathcal{K}(\rho\_{\mathbb{H}}, \pi) = \int \log \left(\frac{\rho\_{\mathbb{H}}}{\pi}\right) \rho\_{\mathbb{H}}(d\theta) = \int \log \left(\frac{\rho\_{\mathbb{H}}}{\mathcal{U}}\right) \rho\_{\mathbb{H}}(d\theta) + \int \log \left(\frac{\mathcal{U}}{\pi}\right) \rho\_{\mathbb{H}}(d\theta),$$

where, *U* is an uniform distribution on (0, 1). We analyze the second term in the above expression. The second term can be written as,

$$\begin{aligned} \int \log\left(\frac{\mathcal{U}}{\pi}\right) \rho\_n(d\theta) &= \int \log\left(\frac{1}{\frac{1}{\text{Beta}(a\bar{\theta})} \theta^{a-1} (1-\theta)^{b-1}}\right) \rho\_n(d\theta) \\ &= \mathcal{C}\_1 - (a-1) \int \log(\theta) \rho\_n(d\theta) - (b-1) \int \log(1-\theta) \rho\_n(d\theta), \end{aligned}$$

where *C*<sup>1</sup> is log(Beta(*a*, *b*)). Since, *ρ<sup>n</sup>* follows a Beta distribution with parameters *an* = *nθ*<sup>0</sup> and *bn* = *n*(1 − *θ*0), we get that,

$$\int \log\left(\frac{lI}{\pi}\right) \rho\_n(d\theta) = \mathbb{C}\_1 - (a-1) \left[\psi(a\_n) - \psi(a\_n + b\_n)\right] - (b-1) \left[\psi(b\_n) - \psi(a\_n + b\_n)\right]$$

Since, log(*x*) <sup>−</sup> <sup>1</sup> *<sup>x</sup>* <sup>&</sup>lt; *<sup>ψ</sup>*(*x*) <sup>&</sup>lt; log(*x*) <sup>−</sup> <sup>1</sup> <sup>2</sup>*<sup>x</sup>* , looking at the term [*ψ*(*an*) − *ψ*(*an* + *bn*)], we get that,

$$\begin{aligned} -\left[\psi(a\_{\boldsymbol{n}}) - \psi(a\_{\boldsymbol{n}} + b\_{\boldsymbol{n}})\right] &= -\left[\psi(n\theta\_0) - \psi(n\theta\_0 + n(1-\theta\_0))\right] \\ &= -\left[\psi(n\theta\_0) - \psi(n)\right]. \end{aligned}$$

Using the lower bound on *ψ*(*nθ*0) and the upper bound on *ψ*(*n*), we get

$$\begin{aligned} -\left[\psi(a\_n) - \psi(a\_n + b\_n)\right] &< -\log(n\theta\_0) + \frac{1}{n\theta\_0} + \log(n) - \frac{1}{2n} \\ &= -\log(\theta\_0) + \frac{2 - \theta\_0}{2n\theta\_0} .\end{aligned}$$

Furthermore, similarly, we get that,

$$-\left[\psi(b\_{\rm ll}) - \psi(a\_{\rm ll} + b\_{\rm tr})\right] < -\log(1 - \theta\_0) + \frac{2 - (1 - \theta\_0)}{2n(1 - \theta\_0)}.$$

Therefore it follows that

$$\begin{split} & \max \{ -(a-1) \left[ \psi(a\_n) - \psi(a\_n + b\_n) \right], -(b-1) \left[ \psi(b\_n) - \psi(a\_n + b\_n) \right] \} \\ & \quad < \max \left\{ (a-1) \left[ -\log(\theta\_0) + \frac{2-\theta\_0}{2n\theta\_0} \right], (b-1) \left[ -\log(1-\theta\_0) + \frac{2-(1-\theta\_0)}{2n(1-\theta\_0)} \right] \right\} \\ & \quad < \mathcal{C}, \end{split}$$

for a large positive constant *C*. Using the above bounds, we finally show that,

$$\begin{aligned} \left[ \left( \mathbb{C}\_1 - (a - 1) \left[ \psi(a\_{\mathbb{H}}) - \psi(a\_{\mathbb{H}} + b\_{\mathbb{H}}) \right] - (b - 1) \left[ \psi(b\_{\mathbb{H}}) - \psi(a\_{\mathbb{H}} + b\_{\mathbb{H}}) \right] \right) \right] &< \mathbb{C}\_1 + 2\mathbb{C}\_2 \end{aligned}$$

which can be upper bounded by *C* for some large constant *C* . Finally, we upper bound - log *<sup>ρ</sup><sup>n</sup> U ρn*(*dθ*) by Lemma A.2.1 thereby completing the proof.

#### **Appendix B. Proofs of Main Results**

*Appendix B.1. Proofs for A Concentration Bound for the αre-Rényi Divergence*

Appendix B.1.1. Proof of Proposition 1

We start by recalling the variational formula of Donsker and Varadhan [28].

**Lemma B.1.1** (Donsker-Varadhan)**.** *For any probability distribution function π on* Θ*, and for any measurable function h* : <sup>Θ</sup> <sup>→</sup> <sup>R</sup>*, if* - *ehdπ* < ∞*, then*

$$\log \int e^h d\pi = \sup\_{\rho \in \mathcal{M}^+(\Theta)} \left\{ \int h d\rho - \mathcal{K}(\rho, \pi) \right\} \tag{A10}$$

Now, fix *<sup>α</sup>re* <sup>∈</sup> (0, 1), and *<sup>θ</sup>* <sup>∈</sup> <sup>Θ</sup>. First, observe that by the definition of the *<sup>α</sup>re*-Rényi divergence we have

$$E\_{\theta\_0}^{(n)}[\exp(-a^{r\epsilon}r\_n(\theta,\theta\_0))] = \exp[-(1-a^{rc})D\_{a^{nr}}(P\_{\theta}^{(n)},P\_{\theta\_0}^{(n)})]$$

Multiplying both sides of the equation by exp[(<sup>1</sup> <sup>−</sup> *<sup>α</sup>re*)*Dαre* (*P*(*n*) *<sup>θ</sup>* , *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>0</sup> ) and integrating with respect to (w.r.t.) *π*(*θ*) it follows that

$$\begin{split} &\int \mathcal{E}\_{\theta\_0}^{(n)} \left[ \exp \left( -a^{\imath \epsilon} r\_n(\theta\_\prime \theta\_0) + (1 - a^{\imath \epsilon}) D\_{a^{\imath \epsilon}} (P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \right) \right] \pi(d\theta) = 1, \text{or} \\ &\mathcal{E}\_{\theta\_0}^{(n)} \left[ \int \exp \left( -a^{\imath \epsilon} r\_n(\theta\_\prime \theta\_0) + (1 - a^{\imath \epsilon}) D\_{a^{\imath \epsilon}} (P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \right) \pi(d\theta) \right] = 1. \end{split}$$

Define *<sup>h</sup>*(*θ*) :<sup>=</sup> <sup>−</sup>*αrern*(*θ*, *<sup>θ</sup>*0)+(<sup>1</sup> <sup>−</sup> *<sup>α</sup>re*)*Dαre* (*P*(*n*) *<sup>θ</sup>* , *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>0</sup> ). Then, applying Lemma B.1.1 to the integrand on the left hand side (l.h.s.) above, it follows that

$$\mathbb{E}\_{\theta\_0}^{(n)}\left[\exp\left(\sup\_{\rho\in\mathcal{M}^+(\Theta)} \left[\int h(\theta)\rho(d\theta) - \mathcal{K}(\rho,\pi)\right]\right)\right] = 1.$$

Multiply both sides of this equation by > 0 to obtain

$$\mathbb{E}\_{\theta\_0}^{(n)}\left[\exp\left(\sup\_{\rho\in\mathcal{M}^+(\Theta)} \left[\int h(\theta)\rho(d\theta) - \mathcal{K}(\rho,\pi) + \log(\epsilon)\right]\right)\right] = \epsilon.$$

Now, by Markov's inequality, we have

$$P\_{\theta\_0}^{(\boldsymbol{\eta})} \left[ \sup\_{\rho \in \mathcal{M}^+(\Theta)} \int (-a^{r\epsilon} r\_{\boldsymbol{\eta}}(\theta, \theta\_0) + (1 - a^{r\epsilon}) D\_{d^{\boldsymbol{\eta}}}(P\_{\theta}^{(\boldsymbol{\eta})}, P\_{\theta\_0}^{(\boldsymbol{\eta})})) \rho(d\theta) - \mathcal{K}(\rho, \boldsymbol{\eta}) + \log(\epsilon) \ge 0 \right] \le \epsilon. \tag{A11}$$

Thus, it follows via complementation that

$$\begin{aligned} P\_{\theta\_0}^{(n)} \left[ \forall \rho \in \mathcal{F}(\Theta) \int D\_{\theta^{\varepsilon\varepsilon}}(P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \rho(d\theta) \leq \frac{a^{r\varepsilon}}{(1 - a^{r\varepsilon})} \int r\_n(\theta, \theta\_0) \rho(d\theta) + \frac{\mathcal{K}(\rho, \pi) - \log(\varepsilon)}{1 - a^{r\varepsilon}} \right] \\ \geq 1 - \varepsilon, \end{aligned}$$

thereby completing the proof.

Appendix B.1.2. Proof of Theorem 1

Recall the definition of the fractional posterior and the VB approximation,

$$\pi\_{n,a^{\nu r}|X^{n}} = \frac{\exp^{-\mathfrak{a}^{\nu r}r\_{n}(\boldsymbol{\theta},\boldsymbol{\theta}\_{0})(X^{n})}\pi(d\boldsymbol{\theta})}{\int \exp^{-\mathfrak{a}^{\nu r}r\_{n}(\boldsymbol{\gamma},\boldsymbol{\theta}\_{0})(X^{n})}\pi(d\boldsymbol{\gamma})}, \\ \widehat{\pi}\_{n,a^{\nu r}|X^{n}} = \underset{\boldsymbol{\rho}\in \mathcal{F}}{\arg\min} \mathcal{K}(\boldsymbol{\rho},\pi\_{n,a^{\nu r}|X^{(n)}}).$$

It follows by definition of the KL divergence that

$$\pi\_{n,a^{\prime\prime}|X^n} = \underset{\rho \in \mathcal{F}}{\text{arg min}} \left\{-a^{r\varepsilon} \int r\_n(\theta, \theta\_0) \rho(d\theta) + \mathcal{K}(\rho, \pi) \right\},\tag{A12}$$

where *π* is the prior distribution. Following Proposition 1 it follows that for any > 0

$$\int D\_{a^{\epsilon\epsilon}}(P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \pi(d\theta | X^n) \le \frac{a^{r\epsilon}}{(1 - a^{r\epsilon})} \int r\_n(\theta, \theta\_0) \rho(d\theta) + \frac{\mathcal{K}(\rho, \pi) - \log(\epsilon)}{1 - a^{r\epsilon}}.$$

with probability 1 − . We fix an *η* ∈ (0, 1). Using Chebychev's inequality, we have

*P*(*n*) *θ*0 \* *αre* <sup>1</sup> <sup>−</sup> *<sup>α</sup>re rn*(*θ*, *<sup>θ</sup>*0)*ρn*(*dθ*) <sup>≥</sup> *<sup>α</sup>re* <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* E[*rn*(*θ*, *θ*0)]*ρn*(*dθ*) <sup>+</sup> *<sup>α</sup>re* <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* % Var[ - *rn*(*θ*, *θ*0)*ρn*(*dθ*)] *η* + K(*ρn*, *π*) <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* <sup>+</sup> = *P*(*n*) *θ*0 \* *αre* <sup>1</sup> <sup>−</sup> *<sup>α</sup>re rn*(*θ*, *<sup>θ</sup>*0)*ρn*(*dθ*) <sup>−</sup> *<sup>α</sup>re* <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* <sup>E</sup>[*rn*(*θ*, *<sup>θ</sup>*0)]*ρn*(*dθ*) <sup>−</sup> <sup>K</sup>(*ρn*, *<sup>π</sup>*) 1 − *αre* <sup>≥</sup> *<sup>α</sup>re* <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* % Var[ - *rn*(*θ*, *θ*0)*ρn*(*dθ*)] *η* + ≤ Var *<sup>α</sup>re* <sup>1</sup>−*αre* - *rn*(*θ*, *<sup>θ</sup>*0)*ρn*(*dθ*) <sup>−</sup> *<sup>α</sup>re* <sup>1</sup>−*αre* - <sup>E</sup>[*rn*(*θ*, *<sup>θ</sup>*0)]*ρn*(*dθ*) <sup>−</sup> <sup>K</sup>(*ρn*,*π*) <sup>1</sup>−*αre* (*αre*) 2 (1−*αre*) 2 Var[ - *rn*(*θ*,*θ*0)*ρn*(*dθ*)] *η* .

Note that *<sup>α</sup>re* <sup>1</sup>−*αre* - *<sup>E</sup>*(*rn*(*θ*, *<sup>θ</sup>*0))*ρn*(*dθ*) and <sup>K</sup>(*ρn*,*π*) <sup>1</sup>−*αre* are constants with respect to the data, implying

$$\begin{split} \text{Var}\left[\frac{\boldsymbol{a}^{r\varepsilon}}{1-\boldsymbol{a}^{r\varepsilon}} \int \boldsymbol{r}\_{\boldsymbol{n}}(\boldsymbol{\theta},\theta\_{0})\rho\_{\boldsymbol{n}}(d\boldsymbol{\theta}) - \frac{\boldsymbol{a}^{r\varepsilon}}{1-\boldsymbol{a}^{r\varepsilon}} \int \mathbb{E}[\boldsymbol{r}\_{\boldsymbol{n}}(\boldsymbol{\theta},\theta\_{0})] \rho\_{\boldsymbol{n}}(d\boldsymbol{\theta}) - \frac{\mathcal{K}(\rho\_{\boldsymbol{n}},\boldsymbol{\pi})}{1-\boldsymbol{a}^{r\varepsilon}}\Big] \\ = \frac{\left(\boldsymbol{a}^{r\varepsilon}\right)^{2}}{(1-\boldsymbol{a}^{r\varepsilon})^{2}} \text{Var}\left[\int \boldsymbol{r}\_{\boldsymbol{n}}(\boldsymbol{\theta},\theta\_{0})\rho\_{\boldsymbol{n}}(d\boldsymbol{\theta})\right]. \end{split}$$

Therefore, we have

$$\begin{split} \mathbb{E}\_{\theta\_{0}}[\frac{\mathbf{a}^{\mathsf{ret}}}{1 - \mathbf{a}^{\mathsf{ret}}} \int r\_{n}(\theta, \theta\_{0}) \rho\_{n}(d\theta) &\geq \frac{\mathbf{a}^{\mathsf{ret}}}{1 - \mathbf{a}^{\mathsf{ret}}} \int \mathbb{E}[r\_{n}(\theta, \theta\_{0})] \rho\_{n}(d\theta) \\ &+ \frac{\mathbf{a}^{\mathsf{ret}}}{1 - \mathbf{a}^{\mathsf{ret}}} \sqrt{\frac{\text{Var}[\int r\_{n}(\theta, \theta\_{0}) \rho\_{n}(d\theta)]}{\eta} + \frac{\mathcal{K}(\rho\_{n}, \pi)}{1 - \mathbf{a}}} \leq \eta. \end{split}$$

From Proposition 1, with probability 1 − the following holds

$$\int D\_{a^{\epsilon\epsilon}}(P\_{\theta}^{(n)}, P\_{\theta\_0}^{(n)}) \, \mathfrak{H}\_{n, a^{\epsilon\epsilon} | \mathcal{X}^n}(d\theta) \leq \frac{a^{r\epsilon} \int r\_n(\theta, \theta\_0) \rho\_n(d\theta) + \mathcal{K}(\rho\_{n\epsilon}, \pi) - \log(\epsilon)}{1 - a^{r\epsilon}}.$$

Therefore, with probability 1 − *η* − the following statement holds

$$\begin{split} \int D\_{\mathbf{u}^{\varepsilon\varepsilon}}(P\_{\theta}^{(n)}, P\_{\theta\_{0}}^{(n)}) \tilde{\pi}\_{\mathbf{u}, \mathbf{a}^{\varepsilon\varepsilon} | \mathbf{X}^{\varepsilon}}(d\theta) &\leq \frac{\mathbf{a}^{\varepsilon\varepsilon}}{1 - \mathbf{a}^{\varepsilon\varepsilon}} \int \mathcal{K}(P\_{\theta\_{0}}^{(n)}, P\_{\theta}^{(n)}) \rho\_{n}(d\theta) \\ &+ \frac{\mathbf{a}^{\varepsilon\varepsilon}}{1 - \mathbf{a}^{\varepsilon\varepsilon}} \sqrt{\frac{\text{Var}[\int r\_{n}(\theta, \theta\_{0}) \rho\_{n}(d\theta)]}{\eta}} \\ &+ \frac{\mathcal{K}(\rho\_{n}, \pi) - \log(\varepsilon)}{1 - \mathbf{a}^{\varepsilon\varepsilon}}. \end{split} \tag{A13}$$

Next, we observe that

$$\begin{split} \text{Var}\left[\int r\_n(\theta,\theta\_0)\rho\_n(d\theta)\right] &= E\_{\theta\_0}^{(n)}\left[\left|\int r\_n(\theta,\theta\_0)\rho\_n(d\theta) - E\left[\int r\_n(\theta,\theta\_0)\rho\_n(d\theta)\right]\right|^2\right] \\ &\leq \int \text{Var}[r\_n(\theta,\theta\_0)]\rho\_n(d\theta), \end{split}$$

by a straightforward application of Jensen's inequality to the inner integral on the left hand side. Finally, following the hypotheses (i), (ii) and (iii), we have,

 *<sup>D</sup>αre* (*P*(*n*) *<sup>θ</sup>* , *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>0</sup> )*π*˜ *<sup>n</sup>*,*αre*|*X<sup>n</sup>* (*dθ*) <sup>≤</sup> *<sup>α</sup>re* <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* K(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* ) + %- Var[*rn*(*θ*, *θ*0)]*ρn*(*dθ*) *η ρn*(*dθ*) + 1 *<sup>α</sup>re* (K(*ρn*, *<sup>π</sup>*) <sup>−</sup> log()) ≤ *<sup>α</sup>re*(*<sup>n</sup>* <sup>+</sup> &*n<sup>n</sup> <sup>η</sup>* ) <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* <sup>+</sup> *<sup>n</sup><sup>n</sup>* <sup>−</sup> log() <sup>1</sup> <sup>−</sup> *<sup>α</sup>re* ,

thereby concluding the proof.

Appendix B.1.3. Proof of Proposition 2

We define *Yi* :<sup>=</sup> log *<sup>p</sup>θ*<sup>1</sup> (*Xi*|*Xi*−1) *<sup>p</sup>θ*<sup>2</sup> (*Xi*|*Xi*−1) for *<sup>i</sup>* <sup>=</sup> 1, ... , *<sup>n</sup>*, and *<sup>Z</sup>*<sup>0</sup> <sup>=</sup> log *<sup>q</sup>* (0) <sup>1</sup> (*X*0) *q* (0) <sup>2</sup> (*X*0) . Then, using the Markov property we can see that the Kullback–Leibler divergence between the joint distributions *P*(*n*) *<sup>θ</sup>*<sup>1</sup> and *<sup>P</sup>*(*n*) *<sup>θ</sup>*<sup>2</sup> satisfies K *P*(*n*) *<sup>θ</sup>*<sup>1</sup> , *<sup>P</sup>*(*n*) *θ*2 = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> E*θ*<sup>1</sup> [*Yi*] + E*θ*<sup>1</sup> [*Z*0]. If the Markov chain {*Xi*} is stationary under *θ*1, so is {*Yi*}. Hence *Yi d* = *Y*<sup>1</sup> and the above equation reduces to,

$$\mathcal{K}\left(P\_{\theta\_1}^{(n)}, P\_{\theta\_2}^{(n)}\right) = n \mathbf{E}\_{\theta\_1}[Y\_1] + \mathbf{E}\_{\theta\_1}[Z\_0].\tag{A14}$$

Appendix B.1.4. Proof of Proposition 3

First, recall the following result from [19].

**Lemma B.1.2.** *[19] (Lemma 1.2) Let X*−∞, ... , *X*1, *X*2, ... *be an α-mixing Markov chain with <sup>α</sup>-mixing coefficients given by <sup>α</sup>k. Let* <sup>M</sup>*<sup>b</sup> <sup>a</sup> be the sigma-field generated by the subsequence* (*Xa*, *Xa*+1, ... , *Xb*)*. Let <sup>η</sup><sup>t</sup>* ∈ M*<sup>t</sup>* <sup>−</sup><sup>∞</sup> *and <sup>τ</sup><sup>t</sup>* ∈ M<sup>∞</sup> *<sup>t</sup>*+*<sup>k</sup> be adapted random variables such that* |*ηt*| ≤ 1, |*τt*| ≤ 1*. Then,*

$$\sup\_{t} \sup\_{\eta\_{t}, \tau\_{t}} |\mathbb{E}[\eta\_{t}\tau\_{t}] - \mathbb{E}[\eta\_{t}]\mathbb{E}[\tau\_{t}]| \le 4\alpha\_{k}.\tag{A15}$$

This lemma provides an upper bound on the covariance of events *η* and *τ*, as shown next.

**Lemma B.1.3.** *Let <sup>η</sup>* ∈ M*<sup>t</sup>* <sup>−</sup><sup>∞</sup> *<sup>τ</sup>* ∈ M<sup>∞</sup> *<sup>t</sup>*+*<sup>k</sup> be such that, E*|*η*| <sup>2</sup>+*<sup>δ</sup>* <sup>≤</sup> *<sup>C</sup>*1, *<sup>E</sup>*|*τ*<sup>|</sup> <sup>2</sup>+*<sup>δ</sup>* <sup>≤</sup> *C*<sup>2</sup> *for some δ* > 0*. Then, for a fixed n* < +∞*, we have*

$$|\mathrm{E}\eta\tau - \mathrm{E}\eta\mathrm{E}\tau| \le \left(\frac{4}{n} + 2n^{\delta/2}(\mathbb{C}\_1 + \mathbb{C}\_2) + 2n^{\delta/2}\sqrt{\mathbb{C}\_1\mathbb{C}\_2}\right)a\_k^{2\delta/(2+\delta)}.\tag{A16}$$

**Proof.** Let *N* < +∞ be a fixed number. We get from the triangle inequality that

$$\begin{split} |\mathrm{E}\eta\mathrm{\tau} - \mathrm{E}\eta\mathrm{\mathrm{E}\tau}| &\leq |\mathrm{E}\eta\mathrm{\tau}I\_{\left[|\eta|\leq N, |\tau|\leq N\right]} - \mathrm{E}\eta I\_{\left[|\eta|\leq N\right]} \mathrm{E}\tau I\_{\left[|\tau|\leq N\right]}| \\ &\quad + |\mathrm{E}\eta\mathrm{\tau}I\_{\left[|\eta|\geq N, |\tau|\leq N\right]} - \mathrm{E}\eta I\_{\left[|\eta|\geq N\right]} \mathrm{E}\tau I\_{\left[|\tau|\leq N\right]}| \\ &\quad + |\mathrm{E}\eta\mathrm{\tau}I\_{\left[|\eta|\leq N, |\tau|\geq N\right]} - \mathrm{E}\eta I\_{\left[|\eta|\leq N\right]} \mathrm{E}\tau I\_{\left[|\tau|\geq N\right]}| \\ &\quad + |\mathrm{E}\eta\mathrm{\tau}I\_{\left[|\eta|\geq N, |\tau|\geq N\right]} - \mathrm{E}\eta I\_{\left[|\eta|\geq N\right]} \mathrm{E}\tau I\_{\left[|\tau|\geq N\right]}| .\end{split}$$

Multiplying and dividing the first term by *N*<sup>2</sup> and applying Lemma B.1.2, we get <sup>|</sup>E*ητI*[|*η*|≤*N*,|*τ*|≤*N*] <sup>−</sup> <sup>E</sup>*ηI*[|*η*|≤*N*]E*τI*[|*τ*|≤*N*]| ≤ <sup>4</sup>*N*2*αk*. For the second term, if <sup>|</sup>*τ*| ≤ *<sup>N</sup>*, then *τ* ≤ *N* and *τ* ≥ −*N*. Plugging this in the second term we get,

$$\begin{split} \left| \operatorname{E} \eta \tau I\_{\left[|\eta| \geq N, |\tau| \leq N\right]} - \operatorname{E} \eta I\_{\left[|\eta| \geq N\right]} \operatorname{E} \tau I\_{\left[|\tau| \leq N\right]} \right| &\leq \left| N \operatorname{E} \eta I\_{\left[|\eta| \geq N\right]} + N \left[ \operatorname{E} \eta I\_{\left[|\eta| \geq N\right]} \right] \right| \\ &= 2N \left| \operatorname{E} \eta I\_{\left[|\eta| \geq N\right]} \right|. \end{split} \tag{A18}$$

Since <sup>|</sup>*η*| ≥ *<sup>N</sup>*, we have 1 <sup>≤</sup> <sup>|</sup>*η*<sup>|</sup> 1+*δ <sup>N</sup>*1+*<sup>δ</sup>* . Following this,

$$\left| \left| 2N \mathbb{E} \eta \, I\_{\left| \left| \eta \right| \right| \geq N} \right| \right| \leq 2N \left| \mathrm{E} \left[ \frac{|\eta|^{2+\delta}}{N^{1+\delta}} I\_{\left| \left| \eta \right| \geq N \right|} \right] \right| \tag{A20}$$

$$\leq 2N \frac{1}{N^{1+\delta}} |\mathrm{E}\eta^{2+\delta}| \leq 2 \frac{\mathbb{C}\_1}{N^{\delta}}.\tag{A21}$$

Similarly, we can also write for the third term, |E*ητI*[|*η*|≤*N*,|*τ*|≥*N*] − <sup>E</sup>*ηI*[|*η*|≤*N*]E*τI*[|*τ*|≥*N*]| ≤ 2 *<sup>C</sup>*<sup>2</sup> *<sup>N</sup><sup>δ</sup>* . Finally, for the last term we get that by Cauchy-Schwarz inequality,

$$\left| \mathrm{E} \eta \,\mathrm{\boldsymbol{\tau}} I\_{[|\eta| \geq N, |\boldsymbol{\tau}| \geq N]} - \mathrm{E} \eta I\_{[|\eta| \geq N]} \mathrm{E} \boldsymbol{\tau} I\_{[|\boldsymbol{\tau}| \geq N]} \right| \leq \sqrt{\mathrm{Var} \left[ \eta I\_{[|\eta| \geq N]} \right] \mathrm{Var} \left[ \boldsymbol{\tau} I\_{[|\boldsymbol{\tau}| \geq N]} \right]} \tag{A.22}$$

$$<2\sqrt{\text{Var}\left[\eta I\_{\left[\left|\eta\right|\geq N\right]}\right]\text{Var}\left[\pi I\_{\left[\left|\tau\right|\geq N\right]}\right]}\qquad\text{(A23)}$$

$$\leq 2\sqrt{\mathbb{E}\left[\eta^2 I\_{\left[|\eta|\geq N\right]}\right] \mathbb{E}\left[\mathfrak{r}^2 I\_{\left[|\tau|\geq N\right]}\right]}.\tag{A24}$$

Since <sup>|</sup>*η*<sup>|</sup> <sup>&</sup>gt; *<sup>N</sup>*, 1 <sup>&</sup>lt; <sup>|</sup>*η*<sup>|</sup> *δ <sup>N</sup><sup>δ</sup>* . Similarly, 1 <sup>&</sup>lt; <sup>|</sup>*τ*<sup>|</sup> *δ <sup>N</sup><sup>δ</sup>* . Plugging these in the previous equation, we get,

*N<sup>δ</sup>*

$$\sqrt{\mathbb{E}\left[\eta^2 I\_{\left[|\eta|\geq N\right]}\right]\mathbb{E}\left[\tau^2 I\_{\left[|\tau|\geq N\right]}\right]} \leq \sqrt{\frac{1}{N^{2\delta}}\mathbb{E}\left[|\eta|^{2+\delta} I\_{\left[|\eta|\geq N\right]}\right]\mathbb{E}\left[|\tau|^{2+\delta} I\_{\left[|\tau|\geq N\right]}\right]}\tag{A25}$$

$$\leq \frac{1}{N^{\delta}}\sqrt{C\_1 C\_2}.\tag{A26}$$

Combining the four upper bounds above, we get,

$$|\mathrm{E}\eta\,\mathrm{r} - \mathrm{E}\eta\mathrm{E}\tau| \le 4N^2 a\_k + \frac{2}{N^\delta}(\mathbb{C}\_1 + \mathbb{C}\_2) + \frac{2}{N^\delta}\sqrt{\mathbb{C}\_1\mathbb{C}\_2}.\tag{A27}$$

Now, in particular, setting *N* = *n*−1/2*α*−1/(2+*δ*) *<sup>k</sup>* it follows that

$$|\mathrm{E}\eta\mathrm{r}-\mathrm{E}\eta\mathrm{E}\tau| \leq \frac{4}{n}a\_{k}^{\delta/(2+\delta)} + 2n^{\delta/2}a\_{k}^{\delta/(2+\delta)}(\mathrm{C}\_{1}+\mathrm{C}\_{2}) + 2n^{\delta/2}a\_{k}^{\delta/(2+\delta)}\sqrt{\mathrm{C}\_{1}\mathrm{C}\_{2}} \tag{A28}$$
 
$$\left(4 \underset{\Delta\mathrm{C}\_{1}\delta/2}{\text{A}}\sqrt{2}\begin{array}{c} \text{C}\_{1} \rightarrow \text{C} \end{array}\right), \delta/(2+\delta) \tag{A28}$$

$$= \left(\frac{\mathfrak{a}}{n} + 2n^{\delta/2}(\mathbb{C}\_1 + \mathbb{C}\_2) + 2n^{\delta/2}\sqrt{\mathbb{C}\_1 \mathbb{C}\_2}\right) a\_k^{\delta/(2+\delta)}.\tag{A29}$$

**Lemma B.1.4.** *Let* {*Xt*} *be an α-mixing Markov chain with mixing coefficient αk. Further assume that* E|*Xt*| <sup>2</sup>+*<sup>δ</sup>* <sup>≤</sup> *<sup>C</sup>*<sup>1</sup> *and* <sup>E</sup>|*Xt*<sup>+</sup>*k*<sup>|</sup> <sup>2</sup>+*<sup>δ</sup>* <sup>≤</sup> *<sup>C</sup>*<sup>2</sup> *for some <sup>δ</sup>* <sup>&</sup>gt; <sup>0</sup>*. Then, for any t and any n* <sup>&</sup>gt; <sup>0</sup>

$$|\text{Cov}(X\_t, X\_{t+k})| \le \left(\frac{4}{n} + 2n^{\delta/2}(\mathbb{C}\_1 + \mathbb{C}\_2) + 2n^{\delta/2}\sqrt{\mathbb{C}\_1\mathbb{C}\_2}\right) a\_k^{\delta/(2+\delta)}.\tag{A30}$$

**Proof.** Set *η* = *Xt*, *τ* = *Xt*<sup>+</sup>*<sup>k</sup>* in Lemma B.1.3.

We also need to establish the following technical lemma.

**Lemma B.1.5.** *Let* {*Xt*} *be an α-mixing Markov Chain with mixing coefficients* {*αt*}*. Then the process* {*Yt*} *where Yt* :<sup>=</sup> log *<sup>p</sup>θ*<sup>0</sup> (*Xt*|*Xt*−1) *<sup>p</sup><sup>θ</sup>* (*Xt*|*Xt*−1) *is also α-mixing with mixing coefficients* {*α*˜*t*} *where α*˜*<sup>t</sup>* = *αt*−1*.*

**Proof.** By *Zi* denote the paired random measure (*Xi*, *Xi*−1). Let <sup>M</sup>*<sup>j</sup> <sup>i</sup>* denote the sigma field generated by the measures *Xk*, where *<sup>i</sup>* <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>j</sup>*. By <sup>G</sup>*<sup>j</sup> <sup>i</sup>* denote the sigma field generated by the measures *Zk*, where *<sup>i</sup>* <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>j</sup>*. Let *<sup>C</sup>* ∈ M*<sup>j</sup> <sup>i</sup>*−1. Then, *<sup>C</sup>* can be expressed as (*Ci*−<sup>1</sup> <sup>×</sup> *Ci* ×···× *Cj*). for *Ci*−<sup>1</sup> ∈ M*i*−<sup>1</sup> *<sup>i</sup>*−1, *Ci* ∈ M*<sup>i</sup> <sup>i</sup>* ... and so on. Now, consider a map. *T<sup>j</sup> <sup>i</sup>* : (*Ci*−<sup>1</sup> × *Ci* ×···× *Cj*) −→ (*Ci*−<sup>1</sup> × *Ci* × *Ci* ×···× *Cj*−<sup>1</sup> × *Cj*−<sup>1</sup> × *Cj*). Note that, *Tj <sup>i</sup>* (*C*) ∈ G*<sup>j</sup> <sup>i</sup>* . It is easy to see that <sup>G</sup>*<sup>j</sup> <sup>i</sup>* <sup>=</sup> *<sup>T</sup><sup>j</sup> <sup>i</sup>* (M*<sup>j</sup> <sup>i</sup>*−1) ∪ M∗*<sup>j</sup> <sup>i</sup>*−1, where *<sup>T</sup><sup>j</sup> <sup>i</sup>* (M*<sup>j</sup> <sup>i</sup>*−1) is obtained by applying the map *T<sup>j</sup> <sup>i</sup>* to each element of <sup>M</sup>*<sup>j</sup> <sup>i</sup>*−1. If we assume this latter set to be the range and <sup>M</sup>*<sup>j</sup> <sup>i</sup>*−<sup>1</sup> to be the domain, then, by construction, *<sup>T</sup><sup>j</sup> <sup>i</sup>* is a bijection. Furthermore, the two classes are made of disjoint sets, i.e., if *<sup>A</sup>* <sup>∈</sup> *<sup>T</sup><sup>j</sup> <sup>i</sup>* (M*<sup>j</sup> <sup>i</sup>*−1) and *<sup>A</sup>*<sup>∗</sup> ∈ M∗*<sup>j</sup> <sup>i</sup>*−1, then *<sup>A</sup>* <sup>∩</sup> *<sup>A</sup>*<sup>∗</sup> <sup>=</sup> *<sup>φ</sup>*. Furthermore, note that <sup>M</sup>*j*<sup>∗</sup> *<sup>i</sup>*−<sup>1</sup> is made of impossible sets. i.e., *<sup>P</sup>*(*A*∗) = <sup>0</sup> <sup>∀</sup> *<sup>A</sup>*<sup>∗</sup> ∈ M*j*<sup>∗</sup> *<sup>i</sup>*−1. Now consider the *α*-mixing coefficients for *Zi*. By definition, it is given by

$$\begin{split} a\_k^z &= \sup\_i \sup\_{A \in \mathcal{G}\_{-\infty}^i B \in \mathcal{G}\_{i+k}^{\infty}} |P(A \cap B) - P(A)P(B)| \\ &= \sup\_i \sup\_{A \in \mathcal{G}\_{-\infty}^i B \in \mathcal{G}\_{i+k}^{\infty}} |P((A^o \cup A^\*) \cap (B^o \cup B^\*)) - P((A^o \cup A^\*))P((B^o \cup B^\*))|. \end{split}$$

where,

$$\begin{array}{cc} A = (A^o \cup A^\*) & B = (B^o \cup B^\*) \\ A^o \in \mathcal{T}\_{-\infty}^i(\mathcal{M}\_{-\infty}^i) & A^\* \in \mathcal{M}\_{-\infty}^{\*\text{i}} \\ B^o \in \mathcal{T}\_{i+k-1}^\infty(\mathcal{M}\_{j+k-1}^\infty) & B^\* \in \mathcal{M}\_{j+k-1}^{\*\infty} \end{array}$$

Then, the expression for the *α*-mixing coefficient can be reduced into

$$\kappa\_k^z = \sup\_i \sup\_{A^\sigma \in T\_{-\infty}^i(\mathcal{M}\_{-\infty}^i), B^\sigma \in T\_{i+k-1}^\infty(\mathcal{M}\_{i+k-1}^\infty)} |P(A^\sigma \cap B^\sigma) - P(A^\sigma)P(B^\sigma)|.$$

Note that, by bijection property of *T<sup>j</sup> <sup>i</sup>* , we can find *<sup>A</sup>* ∈ M*<sup>i</sup>* <sup>−</sup><sup>∞</sup> and *<sup>B</sup>* ∈ M<sup>∞</sup> *<sup>i</sup>*+*k*−<sup>1</sup> such that

$$\begin{split} a\_k^z &= \sup\_i \sup\_{A' \in \mathcal{M}\_{-\infty}^i B' \in \mathcal{M}\_{i+k-1}^\infty} |P(T\_{-\infty}^i(A') \cap T\_{i+k-1}^\infty(B')) - P(T\_{-\infty}^i(A'))P(T\_{i+k-1}^\infty(B'))|.\\ &= a\_{k-1}. \end{split}$$

Now, log *<sup>p</sup>θ*<sup>0</sup> (*Xn*|*Xn*−1) *<sup>p</sup><sup>θ</sup>* (*Xn*|*Xn*−1) is just a function of the paired Markov chain *Zi*, therefore it has *α*-mixing coefficient *αk*−1.

We now proceed to the proof of Proposition 3. Let {*Xk*} be a stationary *α*-mixing Markov chain under *θ*<sup>1</sup> with mixing coefficients {*αk*}. Observe that the log-likelihood can be expressed as

$$\begin{aligned} r\_n(\theta\_2, \theta\_1) &= \sum\_{i=1}^n \log \left( \frac{p\_{\theta\_1}(X\_i | X\_{i-1})}{p\_{\theta\_2}(X\_i | X\_{i-1})} \right) + \log \left( \frac{q\_1^{(0)}(X\_0)}{q\_2^{(0)}(X\_0)} \right), \\ &\equiv \sum\_{i=1}^n \mathbf{Y}\_i + \mathbf{Z}\_0. \end{aligned}$$

Therefore, the variance of the log-likelihood ratio is simply

$$\begin{aligned} \operatorname{Var}\_{\theta\_1} \left[ r\_n(\theta\_2, \theta\_1) \right] &= \operatorname{Var}\_{\theta\_1} \left[ \sum\_{i=1}^n Y\_i + Z\_0 \right] \\ &= \sum\_{i,j=1}^n \operatorname{Cov}\_{\theta\_1} (Y\_{i\prime}, Y\_j) + \sum\_{i=1}^n \operatorname{Cov}\_{\theta\_1} (Y\_{i\prime} Z\_0) + \operatorname{Cov}\_{\theta\_1} (Z\_0, Z\_0) . \end{aligned}$$

It follows from Lemma B.1.5 that {*Yk*} is a stochastic process with *α*-mixing coefficients *αk*−1. Therefore, using Lemma B.1.4 we have

$$\begin{split} |\mathbb{C}\mathrm{cov}\_{\theta\_{1}}(Y\_{i},Y\_{j})| &= |\mathbb{E}\_{\theta\_{1}}Y\_{i}Y\_{j} - \mathbb{E}\_{\theta\_{1}}Y\_{i}\mathbb{E}\_{\theta\_{1}}Y\_{j}| \\ &< \left(\frac{4}{n} + 2n^{\delta/2}(\mathbb{E}\_{\theta\_{1}}|Y\_{i}|^{2+\delta} + \mathbb{E}\_{\theta\_{1}}|Y\_{j}|^{2+\delta}\right) \\ &+ \sqrt{\mathrm{E}\_{\theta\_{1}}|Y\_{i}|^{2+\delta}\mathrm{E}\_{\theta\_{1}}|Y\_{j}|^{2+\delta}}\right) a\_{|j-i|-1}^{\delta/(2+\delta)} \\ &= \left(\frac{4}{n} + 2n^{\delta/2}(\mathcal{C}\_{\theta\_{1},\theta\_{2}}^{(j)} + \mathcal{C}\_{\theta\_{1},\theta\_{2}}^{(j)} + \sqrt{\mathcal{C}\_{\theta\_{1},\theta\_{2}}^{(i)}\mathcal{C}\_{\theta\_{1},\theta\_{2}}^{(j)}}\right) a\_{|j-i|-1}^{\delta/(2+\delta)} .\end{split}$$

Similarly, as above we can also say

$$|\mathsf{Cov}\_{\theta\_1}(Y\_{\mathsf{i}}, Z\_0)| < \left(\frac{4}{n} + 2n^{\delta/2}(\mathsf{C}^{(i)}\_{\theta\_1, \theta\_2} + D\_{1,2} + \sqrt{\mathsf{C}^{(i)}\_{\theta\_1, \theta\_2} D\_{1,2}})\right) \left(\mathsf{a}^{\delta/(2+\delta)}\_{i-1}\right)^{\delta}$$

Combining, the two upper bounds above, we get the first result:

$$\begin{split} \text{Var}\_{\theta\_{1}} \left[ r\_{n}(\theta\_{2}, \theta\_{1}) \right] &< \sum\_{i,j=1}^{n} \left( \frac{4}{n} + 2n^{\delta/2} (\mathsf{C}^{(i)}\_{\theta\_{1}\theta\_{2}} + \mathsf{C}^{(j)}\_{\theta\_{1}\theta\_{2}} + \sqrt{\mathsf{C}^{(i)}\_{\theta\_{1}\theta\_{2}} \mathsf{C}^{(j)}\_{\theta\_{1}\theta\_{2}}}) \right) \left( \mathsf{a}^{\delta/(2+\delta)}\_{|i-j|-1} \right) \\ &+ \sum\_{i=1}^{n} \left( \frac{4}{n^{2}} + 2n^{\delta/2} (\mathsf{C}^{(i)}\_{\theta\_{1}\theta\_{2}} + D\_{1,2} + \sqrt{\mathsf{C}^{(i)}\_{\theta\_{1}\theta\_{2}} D\_{1,2}}) \right) \left( \mathsf{a}^{\delta/(2+\delta)}\_{i-1} \right) \\ &+ \text{Var}[Z\_{0}, Z\_{0}]. \end{split}$$

If {*Xi*} is stationary under *θ*1, so is {*Yi*}. Therefore, E*θ*<sup>1</sup> |*Yi*| <sup>2</sup>+*<sup>δ</sup>* <sup>=</sup> <sup>E</sup>*θ*<sup>1</sup> <sup>|</sup>*Y*1<sup>|</sup> <sup>2</sup>+*<sup>δ</sup>* = *C*(1) *<sup>θ</sup>*1,*θ*<sup>2</sup> ∀ *i*, and

$$\begin{split} \sum\_{i,j=1}^{n} \text{Cov}\_{\theta\_1}(Y\_i, Y\_j) &\leq \sum\_{i,j=1}^{n} \left( \frac{4}{n} + 6n^{\delta/2} \mathsf{C}\_{\theta\_1, \theta\_2}^{(1)} \right) n^{\delta/(2+\delta)}\_{|j-i|-1} \\ &\leq n \left( \frac{4}{n} + 6n^{\delta/2} \mathsf{C}\_{\theta\_1, \theta\_2}^{(1)} \right) \left( \sum\_{h \geq 1} n^{\delta/(2+\delta)}\_{h-1} \right) . \end{split} \tag{A31}$$

Again, using Lemma B.1.4 on Cov*θ*<sup>1</sup> (*Yi*, *Z*0), yields

$$\sum\_{i=1}^{n} \text{Cov}\_{\theta\_1}(Y\_i, Z\_0) \le \left(\frac{4}{n} + 2n^{\delta/2} (\mathbb{C}\_{\theta} + D\_{1,2} + \sqrt{\mathbb{C}\_{\theta} D\_{1,2}})\right) \left(\sum\_{h \ge 1} a\_h^{\delta/(2+\delta)}\right). \tag{A32}$$

Finally, using Equations (A31) and (A32) we have

$$\begin{split} \mathrm{Var}\_{\theta\_{1}}[r\_{n}(\theta\_{2},\theta\_{1})] &\leq n \Big( \frac{4}{n} + 6n^{\delta/2} \mathcal{C}^{(1)}\_{\theta\_{1},\theta\_{2}} \Big) \Big( \sum\_{h\geq 1} a^{\delta/(2+\delta)}\_{h-1} \Big) + \\ & \Big( \frac{4}{n} + 2n^{\delta/2} (\mathcal{C}^{(1)}\_{\theta\_{1},\theta\_{2}} + D\_{1,2} + \sqrt{\mathcal{C}^{(1)}\_{\theta\_{1},\theta\_{2}} D\_{1,2}}) \Big) \Big( \sum\_{h\geq 1} a^{\delta/(2+\delta)}\_{h} \Big) + \\ & \qquad + \mathrm{Cov}\_{\theta\_{1}}(Z\_{0}, Z\_{0}). \end{split}$$

*Appendix B.2. Proofs for Stationary Markov Data-Generating Models* Proof of Theorem 2

*Part 1: Verifying condition (i) of Corollary 1.*

We substitute the true parameter *θ*<sup>0</sup> for *θ*<sup>1</sup> and *θ* for *θ*2. We also set *q* (0) <sup>1</sup> to be the invariant distribution of the Markov chain under *θ*0, *q*0, and *q* (0) <sup>2</sup> as the invariant distribution of the Markov chain under *θ*, *qθ*. Applying the fact that these Markov chains are stationary to Proposition 2, we have

$$\begin{split} \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) &= n \mathbb{E} \Big[ \log \left( \frac{p\_{\theta\_0}(X\_1 | X\_0)}{p\_{\theta}(X\_1 | X\_0)} \right) \Big] + \mathbb{E}[Z\_0], \\ &\leq n \sum\_{j=1}^m \mathbb{E} \Big[ M\_j^{(1)}(X\_1, X\_0) \Big] |f\_j^{(1)}(\theta, \theta\_0)| + \sum\_{k=1}^m \mathbb{E} [M\_k^{(2)}(X\_0)] |f\_k^{(2)}(\theta, \theta\_0)|, \quad \text{(A33)}, \end{split} \tag{A33}$$

where the inequality follows from Assumption 1. Therefore, it follows that

$$\begin{split} \int \mathcal{K}(P\_{\boldsymbol{\theta}\_{0}}^{(n)},P\_{\boldsymbol{\theta}}^{(n)})\rho\_{n}(d\boldsymbol{\theta}) &\leq n\sum\_{j=1}^{m}\mathbb{E}\Big[M\_{j}^{(1)}(X\_{1},X\_{0})\Big]\int |f\_{j}^{(1)}(\boldsymbol{\theta},\boldsymbol{\theta}\_{0})|\rho\_{n}(d\boldsymbol{\theta}) \\ &+\sum\_{k=1}^{m}\mathbb{E}[M\_{k}^{(2)}(X\_{0})]\Big|\int f\_{k}^{(2)}(\boldsymbol{\theta},\boldsymbol{\theta}\_{0})|\rho\_{n}(d\boldsymbol{\theta}). \end{split}$$

By Assumption 1(i), it follows that

$$\int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) \le n \sum\_{j=1}^m \mathbb{E}\left[M\_j^{(1)}(X\_1, X\_0)\right] \frac{\mathbb{C}}{\sqrt{n}} + \sum\_{k=1}^m \mathbb{E}[M\_k^{(2)}(X\_0)] \frac{\mathbb{C}}{\sqrt{n}} \le n \varepsilon\_n^{(1)}.$$

where (1) *<sup>n</sup>* = O √1 *n* .

*Part 2: Verifying condition (ii) of Corollary 1.* Again, using Proposition 3 along with the fact that the Markov chain is stationary we have

$$\begin{split} \text{Var}[r\_{\boldsymbol{n}}(\theta,\theta\_{0})] &\leq n \left( \frac{4}{n} + 6n^{\delta/2} \mathsf{C}^{(1)}\_{\theta\_{0},\theta} \right) \left( \sum\_{k\geq 0} a\_{k}^{\delta/(2+\delta)} \right) \\ &+ \left( \frac{4}{n^{2}} + 2n^{\delta/2} (\mathsf{C}^{(1)}\_{\theta\_{0},\theta} + D\_{\theta\_{0},\theta} + \sqrt{\mathsf{C}^{(1)}\_{\theta\_{0},\theta} D\_{\theta\_{0},\theta}}) \right) \left( \sum\_{k\geq 1} a\_{k}^{\delta/(2+\delta)} \right) \\ &+ \text{Var}[Z\_{0}]. \end{split}$$

It then follows that

$$\begin{split} \int \mathrm{Var}[r\_{\mathrm{n}}(\theta,\theta\_{0})]\rho\_{\mathrm{n}}(d\theta) &\leq n\left(\frac{4}{n} + 6n^{\delta/2}\int \mathcal{C}^{(1)}\_{\theta\_{0},\theta}\rho\_{\mathrm{n}}(d\theta)\right)\left(\sum\_{k\geq 1} a^{\delta/(2+\delta)}\_{k-1}\right) + \int \mathrm{Var}[Z\_{0}]\rho\_{\mathrm{n}}(d\theta) \\ &+ \left(\frac{4}{n^{2}} + 2n^{\delta/2}\Big{(}\int \mathcal{C}^{(1)}\_{\theta\_{0},\theta}\rho\_{\mathrm{n}}(d\theta)\right) \\ &+ \int D\_{\theta\_{0},\theta}\rho\_{\mathrm{n}}(d\theta) + \int \sqrt{c^{(1)}\_{\theta\_{0},\theta}D\_{\theta\_{0},\theta}}\rho\_{\mathrm{n}}(d\theta))\left(\sum\_{k\geq 1} a^{\delta/(2+\delta)}\_{k}\right). \end{split}$$

First, consider the term - *C*(1) *<sup>θ</sup>*0,*θρn*(*θ*), and observe that

$$\int \mathbb{C}\_{\theta\_0, \theta}^{(1)} \rho\_n(d\theta) = \int \mathbb{E} \log \left| \frac{p\_{\theta\_0}(X\_1|X\_0)}{p\_\theta(X\_1|X\_0)} \right|^{2+\delta} \rho\_n(d\theta).$$

By Assumption 1, we have

$$\int \mathbb{E} \log \left| \frac{p\_{\theta\_0}(X\_1|X\_0)}{p\_\theta(X\_1|X\_0)} \right|^{2+\delta} \rho\_n(d\theta) \le \int \mathbb{E} \left[ \sum\_{j=1}^m M\_j^{(1)}(X\_1, X\_0) |f\_k^{(1)}(\theta, \theta\_0)| \right]^{2+\delta} \rho\_n(d\theta).$$

Since the function *<sup>x</sup>* → *<sup>x</sup>*2+*<sup>δ</sup>* is convex, we can apply Jensen's inequality to obtain,

$$\left(\sum\_{j=1}^m M\_j^{(1)}(X\_1, X\_0) |f\_k^{(1)}(\theta, \theta\_0)|\right)^{2+\delta} \le m^{1+\delta} \sum\_{k=1}^m M\_j^{(1)}(X\_1, X\_0)^{2+\delta} |f\_k^{(1)}(\theta, \theta\_0)|^{2+\delta}.$$

Therefore, it follows that

$$\begin{split} \int \mathbb{E} \log \left| \frac{p\_{\theta\_0}(X\_1 | X\_0)}{p\_\theta(X\_1 | X\_0)} \right|^{2 + \delta} \rho\_n(d\theta) &\leq m^{1 + \delta} \sum\_{k=1}^m \mathbb{E} [\mathcal{M}\_k^{(1)}(X\_1, X\_0)^{2 + \delta}] \\ &\quad \times \int |f\_k^{(1)}(\theta, \theta\_0)|^{2 + \delta} \rho\_n(d\theta). \end{split}$$

By Assumption 1, - | *fk*(*θ*, *θ*0)| <sup>2</sup><sup>+</sup>*δρn*(*dθ*) < *<sup>C</sup> <sup>n</sup>* and E[*M*(1) *<sup>k</sup>* (*X*1, *<sup>X</sup>*0)2+*δ*] <sup>&</sup>lt; *<sup>B</sup>*, implying that

$$\int \mathbb{C}\_{\theta\_0, \theta}^{(1)} \rho\_n(d\theta) \le m^{1+\delta} \sum\_{k=1}^m B \frac{C}{n} = m^{2+\delta} \frac{BC}{n} \cdot \frac{1}{n}$$

Since <sup>∑</sup>*k*≥<sup>0</sup> *<sup>α</sup>δ*/(2+*δ*) *k* <sup>&</sup>lt; <sup>∞</sup>, it follows that <sup>4</sup> *<sup>n</sup>* <sup>+</sup> <sup>6</sup>*nδ*/2 - *C*(1) *<sup>θ</sup>*0,*θρn*(*dθ*) ∑*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *k*−1 = O( *<sup>n</sup>δ*/2 *<sup>n</sup>* ). Similarly, we can show that - *<sup>D</sup>θ*0,*θρn*(*dθ*) = <sup>O</sup>( <sup>1</sup> *<sup>n</sup>* ), and - Var[*Z*0]*ρn*(*dθ*) = O( <sup>1</sup> *n* ). For the final term - & *C*(1) *<sup>θ</sup>*0,*θDθ*0,*θρn*(*dθ*), use the Cauchy-Schwarz inequality to obtain the upper bound - *C*(1) *<sup>θ</sup>*0,*θρn*(*dθ*) - *Dθ*0,*θρn*(*dθ*) 1/2 which is also of order *O*( <sup>1</sup> *<sup>n</sup>* ). Combining

$$\int \mathrm{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) \le n \varepsilon\_n^{(2)}.$$

for some (2) *<sup>n</sup>* <sup>=</sup> <sup>O</sup>( *<sup>n</sup>δ*/2 *<sup>n</sup>* ).

all of these together we have

Since <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>&</sup>lt; <sup>√</sup>*nC* <sup>=</sup> *<sup>n</sup>* <sup>√</sup>*<sup>C</sup> <sup>n</sup>* , it follows that K(*ρn*, *<sup>π</sup>*) < *<sup>n</sup>* (3) *<sup>n</sup>* , where (3) *<sup>n</sup>* = O(1/ <sup>√</sup>*n*) as before. Finally, by choosing *<sup>n</sup>* <sup>=</sup> max( (1) *<sup>n</sup>* , (2) *<sup>n</sup>* , (3) *<sup>n</sup>* ), our theorem is proved.

*Appendix B.3. Proofs for Non-Stationary, Ergodic Markov Data-Generating Models* Appendix B.3.1. Proof of Theorem 3

*Part 1: Verifying condition (i) of Corollary 1:* As in the proof of Theorem 2 substitute the true parameter *θ*<sup>0</sup> for *θ*<sup>1</sup> and *θ* for *θ*<sup>2</sup> in . We also set *q* (0) <sup>1</sup> and *q* (0) <sup>2</sup> to the distribution *<sup>q</sup>*(0). Applying Proposition 2 to the corresponding transition kernels and initial distribution we have,

$$\begin{split} \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) &= \sum\_{i=1}^{n} \mathbb{E} \left[ \log \left( \frac{p\_{\theta\_0}(X\_i | X\_{i-1})}{p\_{\theta}(X\_i | X\_{i-1})} \right) \right] + \mathbb{E} \left[ \log \left( \frac{D(X\_0)}{D(X\_0)} \right) \right] \\ &= \sum\_{i=1}^{n} \mathbb{E} \left[ \log \left( \frac{p\_{\theta\_0}(X\_i | X\_{i-1})}{p\_{\theta}(X\_i | X\_{i-1})} \right) \right]. \end{split} \tag{A34}$$

Now, applying Assumption 1, we can bound the previous equation as follows,

$$\begin{split} \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) &\leq \sum\_{i=1}^{n} \mathbb{E} \left[ \sum\_{k=1}^{m} M\_k^{(1)}(X\_i, X\_{i-1}) |f\_k^{(1)}(\theta, \theta\_0)| \right] \\ &= \sum\_{i=1}^{n} \sum\_{k=1}^{m} \mathbb{E} \left[ M\_k^{(1)}(X\_i, X\_{i-1}) \right] |f\_k^{(1)}(\theta, \theta\_0)|. \end{split} \tag{A35}$$

Since *M*(1) *<sup>k</sup>* 's are bounded there exists a constant *Q* so that,

$$\begin{aligned} \int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) &\leq Q \int \sum\_{i=1}^n \sum\_{k=1}^m |f\_k^{(1)}(\theta, \theta\_0)| \rho\_n(d\theta) \\ &= Qn \sum\_{k=1}^m \int |f\_k^{(1)}(\theta, \theta\_0)| \rho\_n(d\theta). \end{aligned}$$

By Assumption 19 in Assumption 1, it follows that

$$\int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) \le Qn \sum\_{k=1}^m \frac{\mathbb{C}}{\sqrt{n}} = nmQ \frac{\mathbb{C}}{\sqrt{n}} = n \varepsilon\_n^{(1)},$$

for some (1) *<sup>n</sup>* = O( <sup>√</sup><sup>1</sup> *n* ).

*Part 2: Verifying condition (ii) of Corollary 1:* As in the previous part, *Z*<sup>0</sup> = 0, implying that *Dθ*,*θ*<sup>0</sup> . Applying Proposition 3 and integrating with respect to *ρn*, we obtain

$$\begin{split} \int \mathrm{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) &\leq \sum\_{i=1}^n \left( \frac{4}{n} + 2n^{\delta/2} \int \mathcal{L}^{(i)}\_{\theta\_0, \theta} \rho\_n(d\theta) \right) \left( \mathfrak{a}^{\delta/(2+\delta)}\_{i-1} \right) \\ &+ \sum\_{i,j=1}^n \left( \frac{4}{n} + 2n^{\delta/2} \left( \int \mathcal{L}^{(i)}\_{\theta\_0, \theta} \rho\_n(d\theta) + \int \mathcal{L}^{(j)}\_{\theta\_0, \theta} \rho\_n(d\theta) + \int \sqrt{\mathcal{L}^{(i)}\_{\theta\_0, \theta} \mathcal{L}^{(j)}\_{\theta\_0, \theta}} \rho\_n(d\theta) \right) \right) \\ &\times \left( \mathfrak{a}^{\delta/(2+\delta)}\_{|i-j|-1} \right). \end{split} \tag{A36}$$

First, consider the term - *C*(*i*) *<sup>θ</sup>*0,*θρn*(*dθ*). Using Assumption 1, we can upper bound *<sup>C</sup>*(*i*) *<sup>θ</sup>*0,*<sup>θ</sup>* as,

$$\begin{split} \mathbf{C}\_{\theta\_{0},\theta}^{(i)} &\leq \mathbb{E} \left[ \sum\_{k=1}^{m} M\_{k}^{(1)} (X\_{i\cdot}, \mathbf{X}\_{i-1}) |f\_{k}^{(1)} (\theta, \theta\_{0})| \right]^{2+\delta} \\ &\leq \sum\_{k=1}^{m} m^{1+\delta} \mathbb{E} \left[ \left( M\_{k}^{(1)} (\mathbf{X}\_{i\cdot}, \mathbf{X}\_{i-1}) |f\_{k}^{(1)} (\theta, \theta\_{0})| \right)^{2+\delta} \right] \text{(by Jensen's inequality)} \\ &= \sum\_{k=1}^{m} m^{1+\delta} \mathbb{E} \left[ M\_{k}^{(1)} (\mathbf{X}\_{i\cdot} \mathbf{X}\_{i-1})^{2+\delta} \right] |f\_{k}^{(1)} (\theta, \theta\_{0})|^{2+\delta} . \end{split}$$

Since *M*(1) *<sup>k</sup>* 's are upper bounded by *<sup>Q</sup>*, it follows from the previous expression that, *<sup>C</sup>*(*i*) *<sup>θ</sup>*0,*<sup>θ</sup>* ≤ ∑*<sup>m</sup> <sup>k</sup>*=<sup>1</sup> *<sup>m</sup>*1+*δQ*2+*δ*<sup>|</sup> *<sup>f</sup>* (1) *<sup>k</sup>* (*θ*, *θ*0)| <sup>2</sup>+*δ*.

Hence, from Assumption 1, we get,

$$\int \mathbb{C}\_{\theta\_0, \theta}^{(i)} \rho\_n(d\theta) \le \sum\_{k=1}^m m^{1+\delta} \mathbb{Q}^{2+\delta} \int |f\_k^{(1)}(\theta, \theta\_0)|^{2+\delta} \rho\_n(d\theta) \le (m\mathbb{Q})^{2+\delta} \frac{\mathbb{C}}{n}.$$

Using the upper bound above, we can say for an *L* large enough, - *C*(*i*) *<sup>θ</sup>*0,*θρn*(*dθ*) ≤ *L <sup>n</sup>* . Next, by the Cauchy-Schwarz inequality, we have that - & *C*(*i*) *θ*0,*θC*(*j*) *<sup>θ</sup>*0,*θρn*(*dθ*)) < &- *C*(*i*) *<sup>θ</sup>*0,*θρn*(*dθ*) - *<sup>C</sup>*(*j*) *<sup>θ</sup>*0,*θρn*(*dθ*)) <sup>≤</sup> *<sup>L</sup> <sup>n</sup>* . Thus, we have the following upper bound.

$$\begin{split} \int \mathrm{Var}[r\_n(\theta,\theta\_0)] \rho\_n(d\theta) &\leq \sum\_{i=1}^n \left(\frac{4}{n} + 2n^{\delta/2}\frac{L}{n}\right) \left(a\_{i-1}^{\delta/(2+\delta)}\right) \\ &+ \sum\_{i,j=1}^n \left(\frac{4}{n} + 2n^{\delta/2}(\frac{L}{n} + \frac{L}{n} + \frac{L}{n})\right) \left(a\_{[i-j]-1}^{\delta/(2+\delta)}\right) \\ &= \left(\frac{4}{n} + 2n^{\delta/2}\frac{L}{n}\right) \left(\sum\_{i=1}^n a\_{i-1}^{\delta/(2+\delta)}\right) \\ &+ \left(\frac{4}{n} + 6n^{\delta/2}\frac{L}{n}\right) \left(\sum\_{i,j=1}^n a\_{[i-j]-1}^{\delta/(2+\delta)}\right). \end{split}$$

Since ∑*<sup>n</sup> <sup>i</sup>*,*j*=<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) <sup>|</sup>*i*−*j*|−<sup>1</sup> <sup>&</sup>lt; *<sup>n</sup>* <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>*−<sup>1</sup> <sup>&</sup>lt; <sup>∞</sup>, we have that for some (2) *<sup>n</sup>* <sup>=</sup> <sup>O</sup>( *<sup>n</sup>δ*/2 *<sup>n</sup>* ),

$$\int \text{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) < n \epsilon\_n^{(2)}.$$

Since <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC*, following the concluding argument in Theorem <sup>2</sup> completes the proof.

Appendix B.3.2. Proof of Proposition 8

We verify Assumption 1 and the proof follows from Theorem 3. For *i* ∈ {1, 2, ... , *K* − 1},

$$p\_{\theta}(j|i) = \begin{cases} \theta & \text{if } j = i - 1, \\ 1 - \theta & \text{if } j = i + 1. \end{cases}$$

If *i* = 0 or *i* = *K*, then the Markov chain goes back to 1 or *K* − 1, respectively, with probability 1. With the convention log <sup>0</sup> <sup>0</sup> = 0, the log ratio of the transition probabilities becomes,

$$|\log p\_{\theta \downarrow}(X\_1|X\_0) - \log p\_{\theta}(X\_1|X\_0)| = I\_{[X\_1 = X\_0 + 1]} \log \left(\frac{\theta\_0}{\theta}\right) + I\_{[X\_1 = X\_0 - 1]} \log \left(\frac{1 - \theta\_0}{1 - \theta}\right).$$

In this case, *m* = 2. *M*(1) <sup>1</sup> (*X*1, *<sup>X</sup>*0) = *<sup>I</sup>*[*X*1=*X*0+1] and *<sup>M</sup>*(1) <sup>2</sup> (*X*1, *<sup>X</sup>*0) = *<sup>I</sup>*[*X*1=*X*0−1], both of which are bounded. Let *f* (1) <sup>1</sup> (*θ*, *<sup>θ</sup>*0) :<sup>=</sup> log*θ*<sup>0</sup> *θ* suppose *f* (1) <sup>2</sup> (*θ*, *<sup>θ</sup>*0) :<sup>=</sup> log1−*θ*<sup>0</sup> 1−*θ* .

The stationary distribution *q<sup>θ</sup>* (*i*) = <sup>1</sup> *<sup>K</sup>* ∀ *i* ∈ 1, 2, . . . , *K*. Hence the log of the ratio of the invariant distribution becomes

$$
\log q\_0(\mathbf{x}) - \log q\_\theta(\mathbf{x}) = 0,\tag{A37}
$$

and we can set *M*(2) *<sup>i</sup>* (·) := 1 and *f* (2) *<sup>i</sup>* (·, ·) := 0 for *i* ∈ {1, 2}. Thus, to prove the concentration bound for this Markov chain it is enough to assume that *δ* = 1 and show that - [ *f* (1) <sup>1</sup> (*θ*, *<sup>θ</sup>*0)]3*ρn*(*dθ*) <sup>&</sup>lt; *<sup>C</sup> <sup>n</sup>* and - [ *f* (1) <sup>2</sup> (*θ*, *<sup>θ</sup>*0)]3*ρn*(*dθ*) < *<sup>C</sup> <sup>n</sup>* for some constant *C* > 0.

As given, {*ρn*} is a sequence of beta probability distribution functions, with parameters *an*, *bn* that satisfy the constraint *an an*+*bn* = *<sup>θ</sup>*0. Specifically, we choose *an* = *<sup>n</sup>θ*<sup>0</sup> and (therefore) *bn* = *n*(1 − *θ*0). Thus, we get the following,

$$\begin{split} \int |f\_1^{(1)}(\theta,\theta\_0)|^3 \rho\_n(d\theta) &= \int \left| \log \left(\frac{\theta\_0}{\theta}\right) \right|^3 \rho\_n(d\theta) \\ &< \int \left| \frac{\theta\_0}{\theta} - 1 \right|^3 \rho\_n(d\theta) \\ &= \frac{1}{\text{Beta}\left(a\_n, b\_n\right)} \int\_0^1 \left| \frac{\theta\_0 - \theta}{\theta} \right|^3 \theta^{a\_n - 1} (1 - \theta)^{b\_n - 1} d\theta. \end{split}$$

Since *<sup>θ</sup>*0, *<sup>θ</sup>* <sup>∈</sup> (0, 1), so is <sup>|</sup>*θ*0−*θ*<sup>|</sup> <sup>2</sup> , giving |*θ*<sup>0</sup> − *θ*| <sup>3</sup> <sup>&</sup>lt; <sup>2</sup>(*θ*<sup>0</sup> <sup>−</sup> *<sup>θ</sup>*)2. We use that fact to arrive at

$$\begin{split} \int |f\_1^{(1)}(\theta,\theta\_0)|^3 \rho\_n(d\theta) &\leq \frac{2}{\text{Beta}\left(a\_n,b\_n\right)} \int\_0^1 (\theta\_0-\theta)^2 \theta^{a\_n-4} (1-\theta)^{b\_n-1} d\theta \\ &= \frac{2\text{Beta}\left(a\_n-3,b\_n\right)}{\text{Beta}\left(a\_n,b\_n\right)} \frac{(a\_n-3)(b\_n)}{(a\_n+b\_n-3)^2(a\_n+b\_n-2)}. \end{split}$$

From our choice of *an* and *bn*, 2Beta(*an*−3,*bn*) Beta(*an*,*bn*) = *<sup>O</sup>*(1), and plugging the values of *an* and *bn* into (*an*−3)(*bn*) (*an*+*bn*−3)2(*an*+*bn*−2) , we get (*an*−3)(*bn*) (*an*+*bn*−3)2(*an*+*bn*−2) <sup>=</sup> <sup>1</sup> *n* (*θ*0<sup>−</sup> <sup>3</sup> *<sup>n</sup>* )(1−*θ*0) (1<sup>−</sup> <sup>3</sup> *<sup>n</sup>* )2(1<sup>−</sup> <sup>2</sup> *n* ) , which is upper bounded by *<sup>C</sup>*<sup>1</sup> *<sup>n</sup>* for some constant *C*<sup>1</sup> > 0. Hence,

$$\int |f\_1^{(1)}(\theta, \theta\_0)|^3 \rho\_n(d\theta) < \frac{C\_1}{n}.$$

Similarly, we can also show that,

$$\int |f\_2^{(1)}(\theta, \theta\_0)|^3 \rho\_n(d\theta) < \frac{C\_2}{n}.$$

Finally, from Proposition A.2.1, we get that <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>&</sup>lt; *<sup>C</sup>* <sup>+</sup> <sup>1</sup> <sup>2</sup> log(*n*) for some large constant *C*. Hence, K(*ρn*, *π*) < *C*<sup>3</sup> <sup>√</sup>*<sup>n</sup>* for some constant *<sup>C</sup>*<sup>3</sup> <sup>&</sup>gt; 0. Choosing *<sup>C</sup>* <sup>=</sup> max(*C*1, *<sup>C</sup>*2, *<sup>C</sup>*3), we satisfy all the conditions of Assumption 1 and Theorem 3.

Appendix B.3.3. Proof of Proposition 9

For the purpose of this proof, we choose *ρn*'s with scaled Beta distribution with parameters *an* = *n*(*θ*0/2) and *bn* = *n*(1 − *θ*0/2). Since, *ρ<sup>n</sup>* is a scaled Beta distribution with the scaling factors *m* = 0.5 and *c* = 0, the pdf of *ρ<sup>n</sup>* is given by

$$\rho\_n(\theta) = \frac{2}{\text{Beta}\left(a\_{\text{n}}, b\_{\text{n}}\right)} (2\theta)^{a\_n} (1 - 2\theta)^{b\_n}$$

Since this is a scaled distribution, *Eρ<sup>n</sup>* [*θ*] = 2 *an an*+*bn* = *<sup>θ</sup>*<sup>0</sup> and there exists a constant *<sup>σ</sup>* > 0, Var*ρ<sup>n</sup>* [*θ*] = *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>* . Now, we analyse the transition probabilities. For *i* ∈ {1, 2, ... }, the Birth-Death process has transition probabilities

$$p\_{\theta}(j|i) = \begin{cases} \theta & \text{if } j = i - 1, \\ 1 - \theta & \text{if } j = i + 1. \end{cases}$$

If *i* = 0, then the Markov chain goes to 1 with probability 1. Hence with the convention log <sup>0</sup> <sup>0</sup> = 0 the ratio of the log of the transition probabilities becomes,

$$|\log p\_{\theta\_0}(X\_1|X\_0) - \log p\_{\theta}(X\_1|X\_0)| = I\_{[X\_1 = X\_0 + 1]} \log \left[\frac{\theta\_0}{\theta}\right] + I\_{[X\_1 = X\_0 - 1]} \log \left[\frac{1 - \theta\_0}{1 - \theta}\right].$$

In this case, *m* = 3. *M*(1) <sup>1</sup> (*X*1, *<sup>X</sup>*0) = *<sup>I</sup>*[*X*1=*X*0+1] and *<sup>M</sup>*(1) <sup>2</sup> (*X*1, *<sup>X</sup>*0) = *<sup>I</sup>*[*X*1=*X*0−1]. Define *M*(1) <sup>3</sup> (*X*1, *X*0) := 1. All these random variables are bounded. Define *f* (1) <sup>1</sup> (*θ*, *θ*0) := log *θ*0 *θ* , *f* (1) <sup>2</sup> (*θ*, *<sup>θ</sup>*0) :<sup>=</sup> log 1−*θ*<sup>0</sup> 1−*θ* and *f* (1) <sup>3</sup> (*θ*, *θ*0) := 0. Similarly as in the proof on Proposition 8,

$$\begin{aligned} \int [f\_1^{(1)}(\theta, \theta\_0)]^3 \rho\_n(d\theta) &< \frac{C\_1}{n}, \text{ and } \\ \int [f\_2^{(1)}(\theta, \theta\_0)]^3 \rho\_n(d\theta) &< \frac{C\_2}{n}. \end{aligned}$$

The stationary distribution is given by *q<sup>θ</sup>* (*i*)=( *<sup>θ</sup>* <sup>1</sup>−*<sup>θ</sup>* )*i*−1*q<sup>θ</sup>* (1) <sup>∀</sup> *<sup>i</sup>* <sup>∈</sup> 1, 2, . . ., so that *<sup>q</sup><sup>θ</sup>* (*i*) = (<sup>1</sup> <sup>−</sup> *<sup>θ</sup>*)( *<sup>θ</sup>* <sup>1</sup>−*<sup>θ</sup>* )*i*−<sup>1</sup> Hence the log of the ratio of the invariant distribution becomes

$$\log q\_0(i) - \log q\_\theta(i) = \log \left[ \frac{1 - \theta\_0}{1 - \theta} \right] + (i - 1) \log \left[ \frac{\theta\_0}{\theta} \right] - (i - 1) \log \left[ \frac{1 - \theta\_0}{1 - \theta} \right] \tag{A38}$$

We define *M*(2) <sup>1</sup> (*X*0) :<sup>=</sup> 1, and *<sup>M</sup>*(2) <sup>2</sup> (*X*0) = *<sup>M</sup>*(2) <sup>3</sup> (*X*0) := *X*<sup>0</sup> − 1. We can write <sup>E</sup>*q*(0)[*M*(2) <sup>2</sup> (*X*0)]<sup>2</sup> <sup>=</sup> <sup>∑</sup><sup>∞</sup> *<sup>i</sup>*=1(*<sup>i</sup>* <sup>−</sup> <sup>1</sup>)2*q*(0)(*i*) <sup>&</sup>lt; <sup>∑</sup><sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *i* <sup>2</sup>*q*(0)(*i*). We have chosen *q*(0) such that ∑<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *i* <sup>2</sup>*q*(0)(*i*) is bounded. Hence, E*q*(0)[*M*(2) <sup>2</sup> (*X*0)]<sup>2</sup> < <sup>∞</sup>. To verify Assumption <sup>i</sup> define, *f* (2) <sup>1</sup> (*θ*, *θ*0) = −*f* (2) <sup>3</sup> (*θ*, *<sup>θ</sup>*0) :<sup>=</sup> log 1−*θ*<sup>0</sup> 1−*θ* , and define *f* (2) <sup>2</sup> (*θ*, *<sup>θ</sup>*0) :<sup>=</sup> log *θ*0 *θ* . Therefore following the proof of Proposition 8,

$$\begin{split} \int |f\_1^{(2)}(\theta,\theta\_0)|^3 \rho\_n(d\theta) &= \int |f\_3^{(2)}(\theta,\theta\_0)|^3 \rho\_n(d\theta) = \int |f\_2^{(1)}(\theta,\theta\_0)|^3 \rho\_n(d\theta) < \frac{C\_2}{n}, \text{ and } \\ &\int |f\_2^{(2)}(\theta,\theta\_0)|^3 \rho\_n(d\theta) = \int |f\_1^{(1)}(\theta,\theta\_0)|^3 \rho\_n(d\theta) < \frac{C\_1}{n}. \end{split}$$

Finally, we take the KL-divergence K(*ρn*, *π*). *ρ<sup>n</sup>* follows a scaled Beta distribution on (0, 1/2) with parameters *an* = *n*(*θ*0/2) and *bn* = *n*(1 − *θ*0/2), while *π* follows a scaled Beta distribution on (0, 1/2) with parameters *a* and *b*. Thus,

$$\mathcal{K}(\rho\_n, \pi) = \int\_0^{\frac{1}{2}} \log \left( \frac{\rho\_n(\theta)}{\pi(\theta)} \right) \rho\_n(d\theta),$$

which, by substituting *t* = 2*θ*, we get,

$$
\mathcal{K}(\rho\_{\mathfrak{n}}, \pi) = 2 \int\_0^1 \log \left( \frac{\rho\_{\mathfrak{n}}(t)}{\pi(t)} \right) \rho\_{\mathfrak{n}}(dt).
$$


$$\int\_0^1 \log\left(\frac{\rho\_n(t)}{\pi(t)}\right) \rho\_n(dt) < C\_1 + \frac{1}{2} \log(n).$$

Hence we can say, K(*ρn*, *π*) < 2 *C*<sup>1</sup> + <sup>1</sup> <sup>2</sup> log(*n*) . Thus, we now get that for some constant *C*<sup>3</sup> > 0,

$$\mathcal{K}(\rho\_{n\_{\prime}}\pi) < \mathcal{C}\_{3}\sqrt{n}.$$

Choosing *C* = max(*C*1, *C*2, *C*3) we satisfy all of the conditions of Assumption 1 and thus by Theorem 3, we are complete the proof.

## Appendix B.3.4. Proof of Theorem 4

*Part 1: Verifying condition (i) of Corollary 1* As in the proof of Theorem 2 substitute the true parameter *θ*<sup>0</sup> for *θ*<sup>1</sup> and *θ* for *θ*2. We also set our initial distributions *q* (0) <sup>1</sup> and *q* (0) <sup>2</sup> to the known initial distribution *q*(0). A method similar to Equation (A35), yields

$$\mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \le \sum\_{i=1}^{n} \sum\_{k=1}^{m} \mathbb{E}\left[\mathcal{M}\_k^{(1)}(X\_i, X\_{i-1})\right] |f\_k^{(1)}(\theta, \theta\_0)|.$$

Because *M*(1) *<sup>k</sup>* s satisfy Assumption 2, it follows by the application of Theorem 2.3, [21] that ∃ *λ* > 0 such that for any 0 < *κ* ≤ *λ*, and for some *ζ* ∈ (0, 1) possibly depending upon *λ*,

$$\mathbb{E}\left[e^{\mathbf{x}\mathbf{M}\_{k}^{(1)}(X\_{i},X\_{i-1})}\,\middle|\,X\_{1},X\_{0}\right] \leq \zeta^{i-1}e^{\mathbf{x}\mathbf{M}\_{k}^{(1)}(X\_{1},X\_{0})} + \frac{1-\zeta^{i}}{1-\widetilde{\zeta}}\mathcal{D}e^{\mathbf{x}\mathbf{a}} \qquad \text{for all } i > 1.$$

We rewrite E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)|*X*1, *<sup>X</sup>*<sup>0</sup> as follows:

$$\begin{split} \mathbb{E}\left[\boldsymbol{M}\_{k}^{(1)}(\boldsymbol{X}\_{i},\boldsymbol{X}\_{i-1})|\boldsymbol{X}\_{1},\boldsymbol{X}\_{0}\right] &= \frac{\mathbb{E}[\boldsymbol{\kappa}\boldsymbol{M}\_{k}^{(1)}(\boldsymbol{X}\_{i},\boldsymbol{X}\_{i-1})|\boldsymbol{X}\_{1},\boldsymbol{X}\_{0}]}{\kappa} \\ &\leq \frac{\mathbb{E}[\boldsymbol{\epsilon}^{\boldsymbol{\kappa}\boldsymbol{M}\_{k}^{(1)}}(\boldsymbol{X}\_{i},\boldsymbol{X}\_{i-1})|\boldsymbol{X}\_{1},\boldsymbol{X}\_{0}]}{\kappa}. \end{split}$$

Therefore, ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1) can be upper bounded as,

$$\begin{split} \sum\_{i=1}^{n} \mathbb{E}\left[\mathcal{M}\_{k}^{(1)}(\mathbf{X}\_{i}, \mathbf{X}\_{i-1})\right] &= \sum\_{i=1}^{n} \mathbb{E}\left[\kappa \mathcal{M}\_{k}^{(1)}(\mathbf{X}\_{i}, \mathbf{X}\_{i-1}) | \mathbf{X}\_{1}, \mathbf{X}\_{0}\right] \kappa^{-1} \\ &\leq \sum\_{i=1}^{n} \left[\xi^{i-1} \mathbb{E}e^{\kappa \mathcal{M}\_{k}^{(1)}(\mathbf{X}\_{1}, \mathbf{X}\_{0})} + \frac{1-\xi^{i}}{1-\xi} \mathcal{D}e^{\kappa a}\right] \kappa^{-1} .\end{split}$$

Since, *<sup>ζ</sup>* <sup>∈</sup> (0, 1), *<sup>ζ</sup><sup>i</sup>* <sup>&</sup>lt; 1. Hence, we can write that,

$$\begin{split} \sum\_{i=1}^{n} \left[ \zeta^{i-1} \mathrm{E} e^{\kappa M\_{k}^{(1)}(X\_{1}, X\_{0})} + \frac{1 - \zeta^{i}}{1 - \zeta} \mathcal{D} e^{\kappa a} \right] \kappa^{-1} &\leq \sum\_{i=1}^{n} \left[ \zeta^{i-1} \mathrm{E} e^{\kappa M\_{k}^{(1)}(X\_{1}, X\_{0})} + \frac{1}{1 - \zeta} \mathcal{D} e^{\kappa a} \right] \kappa^{-1} \\ &= \left[ \frac{1 - \zeta^{n}}{1 - \zeta} \mathrm{E} e^{\kappa M\_{k}^{(1)}(X\_{1}, X\_{0})} + \frac{n}{1 - \zeta} \mathcal{D} e^{\kappa a} \right] \kappa^{-1} \\ &\leq n L\_{\prime} \end{split}$$

for a large constant *L*. Therefore - K(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* )*ρn*(*dθ*) can be upper bounded as follows,

$$\begin{aligned} \int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) &\leq \int \sum\_{k=1}^m n L |f\_k^{(1)}(\theta, \theta\_0)| \rho\_n(d\theta) \\ &= \sum\_{k=1}^m n L \int |f\_k^{(1)}(\theta, \theta\_0)| \rho\_n(d\theta) .\end{aligned}$$

By Assumption 1, - | *f* (1) *<sup>k</sup>* (*θ*, *<sup>θ</sup>*0)|*ρn*(*dθ*) <sup>&</sup>lt; *<sup>C</sup> <sup>n</sup>* , hence,

$$\int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) \le nL \frac{C}{\sqrt{n}}$$

Hence, for some (1) *<sup>n</sup>* = O( <sup>√</sup><sup>1</sup> *<sup>n</sup>* ), we have obtained that, - K(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* )*ρn*(*dθ*) ≤ *n* (1) *<sup>n</sup>* .

*Part 2: Verifying condition (ii) of Corollary 1:* Similar to as in the proof of Theorem 3, we upper bound - Var[*rn*(*θ*, *θ*0)]*ρn*(*dθ*) by

$$\int \text{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) \le \sum\_{i,j=1}^n \left(\frac{4}{n} + 2n^{\delta/2} \left(\int \mathbb{C}^{(i)}\_{\theta\_0, \theta} \rho\_n(d\theta) + \int \mathbb{C}^{(j)}\_{\theta\_0, \theta} \rho\_n(d\theta)\right) \tag{A39}$$

$$\int \sqrt{\int \mathbb{C}^{(i)}\_{\theta} \mathbb{C}^{(j)}\_{\theta\_0, \theta}} \rho\_n(d\theta) \right) \Big(\rho\_n^{\delta/(2+\delta)}\Big) \tag{A40}$$

.

$$+\int \sqrt{\mathcal{C}\_{\theta\downarrow}^{(i)} \mathcal{C}\_{\theta\downarrow}^{(j)}} \rho\_n(d\theta) \int \left(a\_{\left[i-j\right]-1}^{\delta/\left(2+\delta\right)}\right) \tag{A40}$$

$$+\sum\_{i=1}^n \left(\frac{4}{n} + 2n^{\delta/2} \int \mathcal{C}\_{\theta\downarrow}^{(i)} \rho\_n(d\theta)\right) \left(a\_{i-1}^{\delta/\left(2+\delta\right)}\right),$$

where *Cθ*0,*<sup>θ</sup>* is upper bounded as

$$\mathbb{C}\_{\theta\_0, \theta}^{(i)} \le \sum\_{k=1}^{m} m^{1+\delta} \mathbb{E} \left[ M\_k^{(1)} (X\_{i\prime} X\_{i-1}) \right]^{2+\delta} |f\_k^{(1)}(\theta, \theta\_0)|^{2+\delta}.$$

There exists a constant *C<sup>δ</sup>* depending upon *δ* such that,

$$\begin{aligned} [M\_k^{(1)}]^{2+\delta}(X\_{i\prime}X\_{i-1}) &= \frac{\kappa^{2+\delta}[M\_k^{(1)}]^{2+\delta}(X\_{i\prime}X\_{i-1})^{2+\delta}}{\kappa^{2+\delta}} \\ &\leq \frac{e^{\kappa M\_k^{(1)}(X\_iX\_{i-1})} + C\_\delta}{\kappa^{2+\delta}}.\end{aligned}$$

By expressing E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)2+*<sup>δ</sup>* = E E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)2+*δ*|*X*1, *<sup>X</sup>*<sup>0</sup> and following a method similar to the previous part, we get,

$$\mathbb{E}\left[M\_k^{(1)}(X\_{i\prime},X\_{i-1})^{2+\delta}\right] \le \frac{\left[\zeta^i \mathbb{E}e^{\kappa M\_k^{(1)}(X\_1,X\_0)} + \frac{1-\zeta^i}{1-\delta}\mathcal{D}e^{\kappa\mu}\right] + C\_{\delta}}{\kappa^{2+\delta}}.$$

The fact that 0 < *ζ* < 1 implies that 0 < *ζ<sup>i</sup>* < *ζ*. This gives us the following,

$$\mathbb{E}\left[M\_k^{(1)}(X\_{i\prime}X\_{i-1})^{2+\delta}\right] \le \frac{\left[\mathbb{Z}\mathbb{E}e^{\kappa M\_k^{(1)}(X\_1,X\_0)} + \frac{1}{1-\delta}\mathcal{D}e^{\kappa d}\right] + C\_{\delta}}{\kappa^{2+\delta}}.$$

Since *κ* < *λ*, by the application of Jensen's inequality, we get

$$\begin{split} \mathbb{E}\left[M\_{k}^{(1)}(\mathbf{X}\_{i},\mathbf{X}\_{i-1})^{2+\delta}\right] &\leq \frac{\left[\zeta\mathbb{E}e^{\lambda M\_{k}^{(1)}(\mathbf{X}\_{i},\mathbf{X}\_{0})}+\frac{1}{1-\zeta}\mathcal{D}\mathcal{E}^{\mathbf{x}\mathbf{z}}\right]+\mathbb{C}\_{\delta}}{\kappa^{2+\delta}} \\ &= \frac{\left[\zeta\int e^{\lambda M\_{k}^{(1)}(\mathbf{x}\_{i},\mathbf{x}\_{0})}p\_{\theta\_{0}}(\mathbf{x}\_{1}|\mathbf{x}\_{0})D(\mathbf{x}\_{0})d\mathbf{x}\_{1}d\mathbf{x}\_{0}+\frac{1}{1-\zeta}\mathcal{D}\mathcal{E}^{\mathbf{x}\mathbf{z}}\right]+\mathbb{C}\_{\delta}}{\kappa^{2+\delta}}. \end{split}$$

We know that - | *f* (1) *<sup>k</sup>* (*θ*, *θ*0)| <sup>2</sup><sup>+</sup>*δρn*(*dθ*) < *<sup>C</sup> <sup>n</sup>* . Thus, following Assumption 1 we can say that, for a large constant *L*, - *C*(*i*) *<sup>θ</sup>*0,*θρn*(*dθ*) <sup>≤</sup> *<sup>L</sup> <sup>n</sup>* . The rest of the proof follows similarly as in the proof of Theorem 3, and we obtain an (2) *<sup>n</sup>* <sup>=</sup> <sup>O</sup>( *<sup>n</sup>δ*/2 *<sup>n</sup>* ), such that,

$$\int \text{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) < n \varepsilon\_n^{(2)}.$$

Since, <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC*, similar arguments as in the proof of Theorem <sup>2</sup> holds. The theorem is thus proved.

## Appendix B.3.5. Proof of Theorem 5

*Part 1: Verifying condition (i) of Corollary 1* As in the proof of Theorem 2 substitute the true parameter *θ*<sup>0</sup> for *θ*<sup>1</sup> and *θ* for *θ*2. We also set *q* (0) <sup>1</sup> and *q* (0) <sup>2</sup> to the known initial distribution *q*(0). Similar to the steps leading to Equation (A35), we get

$$\mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \le \sum\_{i=1}^{n} \sum\_{k=1}^{m} \mathbb{E}\left[\mathcal{M}\_k^{(1)}(X\_i, X\_{i-1})\right] |f\_k^{(1)}(\theta, \theta\_0)|.$$

Consider the term E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1) . With *q* (*i*−1) *<sup>θ</sup>*<sup>0</sup> the marginal distribution of *Xi*−1, we have

$$\begin{split} \mathbb{E}\left[M\_{k}^{(1)}(\mathbf{X}\_{i\cdot}\mathbf{X}\_{i-1})\right] &= \int M\_{k}^{(1)}(\mathbf{x}\_{i\cdot}\mathbf{x}\_{i-1}) p\_{\theta\_{0}}(\mathbf{x}\_{i}|\mathbf{x}\_{i-1}) q\_{\theta\_{0}}^{(i-1)}(\mathbf{x}\_{i-1}) d\mathbf{x}\_{i} d\mathbf{x}\_{i-1} \\ \mathbb{E}\left[M\_{k}^{(1)}(\mathbf{X}\_{i\cdot}\mathbf{X}\_{i-1})\right] &= \int M\_{k}^{(1)}(\mathbf{x}\_{i\cdot}\mathbf{x}\_{i-1}) p\_{\theta\_{0}}(\mathbf{x}\_{i}|\mathbf{x}\_{i-1}) p\_{\theta\_{0}}^{i-1}(\mathbf{x}\_{i-1}|\mathbf{x}\_{0}) q\_{\theta\_{0}}^{(0)}(\mathbf{x}\_{0}) d\mathbf{x}\_{0} d\mathbf{x}\_{i} d\mathbf{x}\_{i-1} \end{split}$$

Recall that the marginal density satisfies *q* (*i*−1) *<sup>θ</sup>*<sup>0</sup> (*xi*−1) = - *pi*−<sup>1</sup> *<sup>θ</sup>*<sup>0</sup> (*xi*−1|*x*0)*<sup>q</sup>* (0) *<sup>θ</sup>*<sup>0</sup> (*x*0)*d*(*x*0), where *p<sup>i</sup> θ*0 (·|*x*0) is the *i*-step transition probability. Then

$$\mathbb{E}\left[M\_k^{(1)}(X\_i, X\_{i-1})\right] = \int \mathbb{E}\left[M\_k^{(1)}(X\_i, \mathbf{x}\_{i-1})|\mathbf{x}\_{i-1}\right] p\_{\theta\_0}^{i-1}(\mathbf{x}\_{i-1}|\mathbf{x}\_0) q\_{\theta\_0}^{(0)}(\mathbf{x}\_0) d\mathbf{x}\_0 d\mathbf{x}\_{i-1}.$$

Since the Markov chain {*Xn*} satisfies Assumption A.1.1, we know by the application of Theorem A.1.1 that {*Xn*} is *V*-geometrically ergodic. Hence, ∃ *τ* < 1, *R* < ∞ such that ∀ | *f* | < *V*

$$\leq \int f(\mathbf{x}\_{i-1}) p\_{\boldsymbol{\theta}\_0}^{i-1}(\mathbf{x}\_{i-1}|\mathbf{x}\_0) d\mathbf{x}\_{i-1} - \int f(\mathbf{x}\_{i-1}) q\_{\boldsymbol{\theta}\_0}(\mathbf{x}\_{i-1}) d\mathbf{x}\_{i-1}| \\ < RV(\mathbf{x}\_0) \tau^{i-1},$$

where *qθ*<sup>0</sup> is the stationary distribution, implying that

$$\int f(\mathbf{x}\_{i-1}) p\_{\boldsymbol{\theta}\_0}^{i-1}(\mathbf{x}\_{i-1}|\mathbf{x}\_0) d\mathbf{x}\_{i-1} < \int f(\mathbf{x}\_{i-1}) q\_{\boldsymbol{\theta}\_0}(\mathbf{x}\_{i-1}) d\mathbf{x}\_{i-1} + RV(\mathbf{x}\_0) \tau^{i-1}.$$

By the application of Jensen's inequality we get E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)|*Xi*−<sup>1</sup> 2+*<sup>δ</sup>* ≤ E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)2+*δ*|*Xi*−<sup>1</sup> < *<sup>V</sup>*(*Xi*−1). Since *<sup>V</sup>*(·) ≥ 1, it follows from the previous expression that E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)|*Xi*−<sup>1</sup> <sup>&</sup>lt; *<sup>V</sup>*(*Xi*−1)1/(2+*δ*) <sup>≤</sup> *<sup>V</sup>*(*Xi*−1). Thus, setting *f*(*x*) = E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)|*Xi*−<sup>1</sup> = *<sup>x</sup>* , we obtain

$$\begin{split} \mathbb{E}\left[M\_{k}^{(1)}(\mathbf{X}\_{i},\mathbf{X}\_{i-1})\right] &< \int \Big[\mathbb{E}\left[M\_{k}^{(1)}(\mathbf{X}\_{i},\mathbf{X}\_{i-1})|\mathbf{X}\_{i-1}\right] q\_{\theta 0}(\mathbf{x}\_{i}) d\mathbf{x}\_{i-1} + RV(\mathbf{x}\_{0})\tau^{i-1}\Big] q^{(0)}(\mathbf{x}\_{0}) d\mathbf{x}\_{0} \\ &= \mathbb{E}\left[M\_{k}^{(1)}(\mathbf{X}\_{1},\mathbf{X}\_{0})\right] + \tau^{i-1}\int RV(\mathbf{x}\_{0}) q^{(0)}(\mathbf{x}\_{0}) d\mathbf{x}\_{0}. \end{split}$$

Summing from *i* = 1 to *n*, we get

$$\begin{split} \sum\_{i=1}^{n} \mathbb{E} \left[ M\_k^{(1)} (\mathbf{x}\_i, \mathbf{x}\_{i-1}) \right] &< n \mathbb{E} [M\_k^{(1)} (\mathbf{X}\_1, \mathbf{X}\_0)] + \sum\_{i=1}^{n} \tau^{i-1} \int R V(\mathbf{x}\_0) q^{(0)} (\mathbf{x}\_0) d\mathbf{x}\_0 \\ &= n \mathbb{E} [M\_k^{(1)} (\mathbf{X}\_1, \mathbf{X}\_0)] + \frac{1 - \tau^{\text{tr}}}{1 - \tau} \int R V(\mathbf{x}\_0) q^{(0)} (\mathbf{x}\_0) d\mathbf{x}\_0. \end{split}$$

This gives us the following bound on - K(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* )*ρn*(*dθ*):

$$\begin{split} \int \mathcal{K}(P\_{\theta\_{0}}^{(n)}, P\_{\theta}^{(n)}) \rho\_{n}(d\theta) \leq & \sum\_{k=1}^{m} \Big[ n \mathbb{E} [M\_{k}^{(1)}(X\_{1}, X\_{0})] + \frac{1-\tau^{n}}{1-\tau} \int R V(\mathbf{x}\_{0}) D(\mathbf{x}\_{0}) d\mathbf{x}\_{0} \Big] \\ & \times \int |f\_{k}^{(1)}(\theta, \theta\_{0})| \rho\_{n}(d\theta). \end{split}$$

By Assumption 1, - | *f* (1) *<sup>k</sup>* (*θ*, *<sup>θ</sup>*0)|*ρn*(*dθ*) <sup>&</sup>lt; <sup>√</sup>*<sup>C</sup> <sup>n</sup>* . Hence, we can rewrite the previous expression as

$$\begin{split} \int \mathcal{K}(P\_{\theta\_{0}}^{(n)}, P\_{\theta}^{(n)}) \rho\_{n}(d\theta) &\leq \sum\_{k=1}^{m} \Big[ n \mathbb{E} \left[ M\_{k}^{(1)}(X\_{1}, X\_{0}) \right] + \frac{1-\tau^{n}}{1-\tau} \int R V(\mathbf{x}\_{1}) D(\mathbf{x}\_{1}) d\mathbf{x}\_{1} \Big] \frac{\mathcal{C}}{\sqrt{n}} \\ &= nm \Big[ \mathbb{E} [M\_{k}^{(1)}(X\_{1}, X\_{0})] + \frac{1-\tau^{n}}{n(1-\tau)} \int R V(\mathbf{x}\_{0}) D(\mathbf{x}\_{0}) d\mathbf{x}\_{0} \Big] \frac{\mathcal{C}}{\sqrt{n}}. \end{split}$$

Since, *<sup>τ</sup>* <sup>&</sup>lt; 1, 0 <sup>&</sup>lt; <sup>1</sup> <sup>−</sup> *<sup>τ</sup><sup>n</sup>* <sup>&</sup>lt; 1, and we rewrite the previous equation as,

$$\int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) \le nm \left[ \mathbb{E}[M\_k^{(1)}(X\_1, X\_0)] + \frac{1}{n(1-\tau)} \int R V(\mathbf{x}\_0) D(\mathbf{x}\_0) d\mathbf{x}\_0 \right] \frac{C}{\sqrt{n}}.$$

Hence, there exists an (1) *<sup>n</sup>* = O( <sup>√</sup><sup>1</sup> *<sup>n</sup>* ) such that - K(*P*(*n*) *<sup>θ</sup>*<sup>0</sup> , *<sup>P</sup>*(*n*) *<sup>θ</sup>* )*ρn*(*dθ*) ≤ *n* (1) *<sup>n</sup>* . *Part 2: Verifying condition (ii) of Corollary 1:* Similar to as in the proof of Theorem 3, we upper bound - Var[*rn*(*θ*, *θ*0)]*ρn*(*dθ*) by

$$\begin{split} \int \mathrm{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) \le \sum\_{i,j=1}^n \left( \frac{4}{n} + 2n^{\delta/2} \left( \int \mathcal{C}^{(i)}\_{\theta\_0, \theta} \rho\_n(d\theta) + \int \mathcal{C}^{(j)}\_{\theta\_0, \theta} \rho\_n(d\theta) \right. \\ \left. + \int \sqrt{\mathcal{C}^{(i)}\_{\theta\_0, \theta} \mathcal{C}^{(j)}\_{\theta\_0, \theta}} \rho\_n(d\theta) \right) \right) \left( a^{\delta/(2+\delta)}\_{|i-j|-1} \right) \end{split} \tag{A42}$$

$$\begin{aligned} &+\int\sqrt{\mathcal{C}^{(i)}\_{\theta\_{0},\theta}\mathcal{C}^{(j)}\_{\theta\_{0},\theta}\rho\_{n}(d\theta)}\left(\begin{array}{c} \alpha^{\delta/(2+\delta)}\_{|i-j|-1} \end{array}\right) \\ &+\sum\_{i=1}^{n} \left(\frac{4}{n}+2n^{\delta/2}\int\mathcal{C}^{(i)}\_{\theta\_{0},\theta}\rho\_{n}(d\theta)\right)\left(\alpha^{\delta/(2+\delta)}\_{i-1}\right), \end{aligned}$$

where *Cθ*0,*<sup>θ</sup>* is upper bounded as

$$\mathbb{C}\_{\theta\_0, \theta}^{(i)} \le \sum\_{k=1}^m m^{1+\delta} \mathbb{E} \left[ M\_k^{(1)} (X\_i, X\_{i-1}) \right]^{2+\delta} |f\_k^{(1)}(\theta, \theta\_0)|^{2+\delta}.$$

Since E *M*(1) *<sup>k</sup>* (*Xi*, *Xi*−1)2+*δ*|*Xi*−<sup>1</sup> < *<sup>V</sup>*(*Xi*−1), by a similar application of *<sup>V</sup>*-geometric ergodicity, we can say that, ∃ 0 < *τ* < 1, such that

$$\mathbb{E}\left[M\_k^{(1)}(X\_i, X\_{i-1})\right]^{2+\delta} \le n\mathbb{E}[M\_k^{(1)}(X\_1, X\_0)]^{2+\delta} + \tau^{i-1}\int RV(\mathbf{x}\_0)D(\mathbf{x}\_0)d\mathbf{x}\_0.$$

which, by the fact that *τi*−<sup>1</sup> < *τ*, gives us,

$$\mathbb{E}\left[M\_k^{(1)}(X\_{\mathrm{i}},X\_{\mathrm{i}-1})\right]^{2+\delta} \le \mathbb{E}[M\_k^{(1)}(X\_{\mathrm{i}},X\_0)]^{2+\delta} + \tau \int RV(\mathbf{x}\_0)D(\mathbf{x}\_0)d\mathbf{x}\_0.$$

By Assumption 1, we know that, - | *f* (1) *<sup>k</sup>* (*θ*, *θ*0)| <sup>2</sup><sup>+</sup>*δρn*(*dθ*) < *<sup>C</sup> <sup>n</sup>* . Hence, for a large constant *L*, - *C*(*i*) *<sup>θ</sup>*0,*θρn*(*dθ*) <sup>≤</sup> *<sup>L</sup> <sup>n</sup>* . We also see that since the chain is geometrically ergodic, by the application of Equation (A4), <sup>∑</sup>*k*≥<sup>1</sup> *<sup>α</sup>δ*/(2+*δ*) *<sup>k</sup>* < +∞. The rest of the proof follows similarly as in the proof of Theorem 3, and we obtain an (2) *<sup>n</sup>* <sup>=</sup> <sup>O</sup>( *<sup>n</sup>δ*/2 *<sup>n</sup>* ), such that,

$$\int \text{Var}[r\_n(\theta, \theta\_0)] \rho\_n(d\theta) < n \epsilon\_n^{(2)}.$$

Since, <sup>K</sup>(*ρn*, *<sup>π</sup>*) <sup>≤</sup> <sup>√</sup>*nC*, similar arguments as in the proof of Theorem <sup>2</sup> holds. The theorem is thus proved.

## Appendix B.3.6. Proof of Proposition 10

For the purpose of the proof, we choose *ρn*'s with scaled Beta distribution with parameters *an* = *<sup>n</sup>*1+*θ*<sup>0</sup> <sup>2</sup> and *bn* <sup>=</sup> *<sup>n</sup>*1−*θ*<sup>0</sup> <sup>2</sup> . Since, *ρ<sup>n</sup>* is a scaled Beta distribution with the scaling factors *m* = 2 and *c* = −1, the pdf of *ρ<sup>n</sup>* is given by

$$\rho\_n(\theta) = \frac{1}{2\text{Beta}\left(a\_{\hbar\prime}b\_{\hbar}\right)} \left(\frac{1+\theta}{2}\right)^{a\_{\hbar}} \left(\frac{1-\theta}{2}\right)^{b\_{\hbar}}$$

Since this is a scaled distribution, *Eρ<sup>n</sup>* [*θ*] = 2 *an an*+*bn* − 1 = *<sup>θ</sup>*<sup>0</sup> and there exists a constant *<sup>σ</sup>* <sup>&</sup>gt; 0, Var*ρ<sup>n</sup>* [*θ*] = *<sup>σ</sup>*<sup>2</sup> *<sup>n</sup>* . We now analyse the log-ratio of the transition probabilities for the Markov chain,

$$\log p\_{\theta\_0}(X\_n|X\_{n-1}) - \log p\_{\theta}(X\_n|X\_{n-1}) = 2X\_n X\_{n-1}(\theta - \theta\_0) + X\_{n-1}^2(\theta\_0^2 - \theta^2).$$

Observe that in this setting, *M*(1) <sup>1</sup> (*Xn*, *Xn*−1) = <sup>|</sup>*XnXn*−1<sup>|</sup> and *<sup>M</sup>*(1) <sup>2</sup> (*Xn*, *Xn*−1) = *<sup>X</sup>*<sup>2</sup> *<sup>n</sup>*. Next, using the fact that

$$\mathbb{E}[|X\_n|^{2+\delta}|X\_{n-1}] = \mathbb{E}[|X\_n - \theta\_0 X\_{n-1} + \theta\_0 X\_{n-1}|^{2+\delta}|X\_{n-1}]\_\rho$$

and by an application of triangle inequality, we obtain

$$\begin{split} \mathbb{E}[|X\_{n}|^{2+\delta}|X\_{n-1}| &\leq \mathbb{E}\left[\left(|X\_{n}-\theta\_{0}X\_{n-1}|+|\theta\_{0}X\_{n-1}|\right)^{2+\delta}|X\_{n-1}|\right] \\ &= \mathbb{E}\left[\left(2\frac{|X\_{n}-\theta\_{0}X\_{n-1}|+|\theta\_{0}X\_{n-1}|}{2}\right)^{2+\delta}|X\_{n-1}|\right] \\ &= \mathbb{E}\left[2^{2+\delta}\left(\frac{|X\_{n}-\theta\_{0}X\_{n-1}|+|\theta\_{0}X\_{n-1}|}{2}\right)^{2+\delta}|X\_{n-1}|\right]. \end{split}$$

Now by using Jensen's inequality we get,

$$\begin{split} \mathbb{E}\left[|X\_{\boldsymbol{n}}|^{2+\delta}|X\_{\boldsymbol{n}-1}| \le \mathbb{E}\left[2^{2+\delta}\left(\frac{|X\_{\boldsymbol{n}}-\theta\_{0}X\_{\boldsymbol{n}-1}|^{2+\delta}+|\theta\_{0}X\_{\boldsymbol{n}-1}|^{2+\delta}}{2}\right)|X\_{\boldsymbol{n}-1}\right] \right] \\ = 2^{1+\delta}\mathbb{E}\left[|X\_{\boldsymbol{n}}-\theta\_{0}X\_{\boldsymbol{n}-1}|^{2+\delta}|X\_{\boldsymbol{n}-1}|\right] + 2^{1+\delta}|\theta\_{0}X\_{\boldsymbol{n}-1}|. \end{split}$$

We know if *<sup>Y</sup>* <sup>∼</sup> *<sup>N</sup>*(*μ*, *<sup>σ</sup>*2), then E|*<sup>Y</sup>* <sup>−</sup> *<sup>μ</sup>*<sup>|</sup> *<sup>p</sup>* = *σ<sup>p</sup>* <sup>2</sup> *p* <sup>2</sup> <sup>Γ</sup>( *<sup>p</sup>*+<sup>1</sup> <sup>2</sup> ) <sup>√</sup>*<sup>π</sup>* . Consequently,

$$\mathbb{E}[|X\_n|^{2+\delta}|X\_{n-1}] \le 2^{1+\delta} \left[ \frac{2^{\frac{2+\delta}{2}} \Gamma(\frac{3+\delta}{2})}{\sqrt{\pi}} \right] + 2^{1+\delta} |\theta\_0 X\_{n-1}|^{2+\delta}.\tag{A43}$$

It follows that,

$$\begin{split} \mathbb{E}[M\_1^{(1)}(\mathbf{X}\_n, \mathbf{X}\_{n-1})^{2+\delta}|\mathbf{X}\_{n-1}] &\leq 2^{1+\delta} \left[ \frac{2^{\frac{2+\delta}{2}\Gamma(\frac{3+\delta}{2})}}{\sqrt{\pi}} \right] |\mathbf{X}\_{n-1}|^{2+\delta} + 2^{1+\delta} |\theta\_0|^{2+\delta} |\mathbf{X}\_{n-1}|^{4+2\delta} \\ &\leq \left( 2^{1+\delta} \left[ \frac{2^{\frac{2+\delta}{2}\Gamma(\frac{3+\delta}{2})}}{\sqrt{\pi}} \right] + 2^{1+\delta} |\theta\_0|^{2+\delta} \right) (|\mathbf{X}\_{n-1}|^{4+2\delta} + 1). \end{split}$$

Since *θ*<sup>0</sup> < 1, we can say,

$$\mathbb{E}\left[M\_1^{(1)}(X\_n, X\_{n-1})^{2+\delta}|X\_{n-1}\right] \le \left(2^{1+\delta}\left[\frac{2^{\frac{2+\delta}{2}\Gamma\left(\frac{3+\delta}{2}\right)}}{\sqrt{\pi}}\right] + 2^{1+\delta}\right)(|X\_{n-1}|^{4+2\delta}+1).$$

Define a constant *C<sup>δ</sup>* := 21<sup>+</sup>*<sup>δ</sup>* . 2 2+*δ* <sup>2</sup> <sup>Γ</sup>( <sup>3</sup>+*<sup>δ</sup>* <sup>2</sup> ) <sup>√</sup>*<sup>π</sup>* / + 21<sup>+</sup>*<sup>δ</sup>* . The above term then becomes,

$$\mathbb{E}[M\_1^{(1)}(X\_{n}, X\_{n-1})^{2+\delta}|X\_{n-1}] \le \mathbb{C}\_{\delta}(|X\_{n-1}|^{4+2\delta}+1).$$

Next we analyse the term *M*(1) <sup>2</sup> (*Xn*, *Xn*−1).

$$\begin{aligned} \mathrm{E}\left[M\_2^{(1)}(X\_{n},X\_{n-1})^{2+\delta}|X\_{n-1}\right] &= \mathrm{E}[X\_{n-1}^{4+2\delta}|X\_{n-1}] \\ &= X\_{n-1}^{4+2\delta} \\ &\leq C\_{\delta}(X\_{n-1}^{4+2\delta}+1). \end{aligned}$$

Then, defining *V*(*x*) := *Cδ*(*x*4+2*<sup>δ</sup>* + 1) it follows that,

$$\mathbb{E}[V(X\_n)|X\_{n-1}] = \mathbb{E}\left[\mathbb{C}\_\delta(X\_n^{4+2\delta} + 1)|X\_{n-1}\right].$$

Using a technique similar to Equation (A43) we get,

$$\mathbb{E}\left[\mathbb{C}\_{\delta}(X\_n^{4+2\delta}+1)|X\_{n-1}\right] \le \left[\mathbb{C}\_{\delta}(2^{3+2\delta}\left[\frac{2^{\frac{4+2\delta}{2}}\Gamma(\frac{5+2\delta}{2})}{\sqrt{\pi}}\right]+2^{3+2\delta}|\theta\_0 X\_{n-1}|^{4+2\delta}+1\right].$$

Define another constant *C <sup>δ</sup>* := *C<sup>δ</sup>* 23<sup>+</sup>2*<sup>δ</sup>* . 2 <sup>4</sup>+2*<sup>δ</sup>* <sup>2</sup> <sup>Γ</sup>( <sup>5</sup>+2*<sup>δ</sup>* <sup>2</sup> ) √*<sup>π</sup>* / <sup>−</sup> <sup>2</sup>3+2*δ*|*θ*0<sup>|</sup> <sup>4</sup>+2*<sup>δ</sup>* + 1 . Since *δ* > 0, 2 <sup>4</sup>+2*<sup>δ</sup>* <sup>2</sup> <sup>Γ</sup>( <sup>5</sup>+2*<sup>δ</sup>* <sup>2</sup> ) <sup>√</sup>*<sup>π</sup>* > 1. Furthermore, since |*θ*0| < 1, so is |*θ*0| <sup>4</sup>+2*δ*. Hence,

$$2^{3+2\delta} \left[ \frac{2^{\frac{4+2\delta}{2}\Gamma(\frac{5+2\delta}{2})}}{\sqrt{\pi}} \right] - 2^{3+2\delta} |\theta\_0|^{4+2\delta} > 0.1$$

Hence, we have shown that,

$$\mathbb{E}[V(X\_n)|X\_{n-1}] \le (2^{3+2\delta}|\theta\_0|^{4+2\delta})\mathbb{C}\_{\delta}(X\_{n-1}^{4+2\delta}+1) + \mathbb{C}\_{\delta}'.$$

Since <sup>|</sup>*θ*0<sup>|</sup> <sup>&</sup>lt; <sup>2</sup> <sup>1</sup> <sup>4</sup>+2*<sup>δ</sup>* −1 , 23+2*δ*|*θ*0<sup>|</sup> <sup>4</sup>+2*<sup>δ</sup>* < 1, and we can express the above equation as,

$$\mathbb{E}[V(X\_n)|X\_{n-1}] \le V(X\_{n-1}) + C'\_{\delta}.$$

Define the set *C*(*m*) := {*x* : |*x*| <sup>4</sup>+2*<sup>δ</sup>* <sup>+</sup> <sup>1</sup> <sup>≤</sup> *<sup>m</sup>*}. From Proposition 11.4.2, [20], for a large enough *m*, *C*(*m*) forms a petite set. Thus, we have proved that *V*(*x*) as defined in this example satisfies Assumption A.1.1, and {*Xn*} is *V*-geometrically ergodic. The *f* (1) *<sup>j</sup>* 's corresponding to Assumption 1 are given by *f* (1) <sup>1</sup> (*θ*, *θ*0)=(*θ* − *θ*0) and *f* (1) <sup>2</sup> (*θ*, *θ*0) = (*θ*<sup>2</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*2). Therefore, it follows that,

$$\begin{aligned} \partial\_{\theta} f\_1^{(1)} &= 1, \\ \partial\_{\theta} f\_2^{(1)} &= -2\theta \text{ and } \\ -2 &< -2\theta < 2. \end{aligned}$$

Since *f* (1) <sup>1</sup> (*θ*0, *θ*0) = *f* (1) <sup>2</sup> (*θ*0, *θ*0) = 0, We just showed that they also have bounded partial derivatives. We also know that |*θ*| < 1. Hence, by Proposition 4 *f* (1) *<sup>j</sup>* 's satisfy the conditions of Assumption 1.

The invariant distribution for the simple linear model Markov-chain under parameter *θ* is given by a gaussian distribution with mean 0 and variance <sup>1</sup> <sup>1</sup>−*θ*<sup>2</sup> . In other words,

$$q\_{\theta}(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1-\theta^2}{2}x^2}.$$

Analyzing the log likelihood yields,

$$\begin{aligned} \log q\_0(\mathbf{x}) - \log q\_\theta(\mathbf{x}) &= -\frac{\mathbf{x}^2}{2} (1 - \theta\_0^2) + \frac{\mathbf{x}^2}{2} (1 - \theta^2) \\ &= \frac{\mathbf{x}^2}{2} (\theta\_0^2 - \theta^2) .\end{aligned}$$

Let *f* (2) <sup>1</sup> (*θ*0, *<sup>θ</sup>*0)=(*θ*<sup>2</sup> <sup>0</sup> <sup>−</sup> *<sup>θ</sup>*2) and *<sup>f</sup>* (2) <sup>1</sup> (*θ*0, *θ*0) = 0. Since *f* (2) <sup>1</sup> (*θ*0, *θ*0) = *f* (1) <sup>2</sup> (*θ*0, *θ*0), by following arguments similar as before, can conclude that *f* (2) <sup>1</sup> (*θ*0, *θ*0) also satisfies the requirements of Assumption 1. Let *M*(2) <sup>1</sup> (*x*) = *<sup>x</sup>*<sup>2</sup> <sup>2</sup> and define *<sup>M</sup>*(2) <sup>2</sup> (*x*) := 1. Let *X*<sup>0</sup> ∼ *q* (0) <sup>1</sup> . As long as - *x*4+2*δq* (0) <sup>1</sup> (*x*)*dx* < ∞, we satisfy all the conditions required for Theorem 5. Finally we need to verify the condition that K(*ρn*, *π*) < *C* <sup>√</sup>*<sup>n</sup>* for some constant *<sup>C</sup>* <sup>&</sup>gt; 0. The KL-divergence - log*ρn*(*θ*) *π*(*θ*) *ρn*(*dθ*) becomes,

$$\begin{split} \mathcal{K}(\rho\_{\rm II}, \pi) = \int\_{-1}^{1} \log \left( \frac{1}{2 \text{Beta}(a\_{\rm II}, b\_{\rm II})} \left( \frac{1 + \theta}{2} \right)^{a\_{\pi}} \left( \frac{1 - \theta}{2} \right)^{b\_{\pi}} \right) \\ \times \frac{1}{2 \text{Beta}(a\_{\rm II}, b\_{\rm II})} \left( \frac{1 + \theta}{2} \right)^{a\_{\pi}} \left( \frac{1 - \theta}{2} \right)^{b\_{\rm II}} d\theta. \end{split}$$

Substituting, *y* = <sup>1</sup>+*<sup>θ</sup>* <sup>2</sup> , we get,

$$\begin{split} \mathcal{K}(\rho\_{n},\pi) &= \int\_{0}^{1} \log\left(\frac{1}{2\text{Beta}\left(a\_{n},b\_{n}\right)} (y)^{a\_{n}} (1-y)^{b\_{n}}\right) \frac{1}{2\text{Beta}\left(a\_{n},b\_{n}\right)} (y)^{a\_{n}} (1-y)^{b\_{n}} dy \\ &= \int\_{0}^{1} \log\left(\frac{1}{2}\right) \frac{1}{\text{Beta}\left(a\_{n},b\_{n}\right)} (y)^{a\_{n}} (1-y)^{b\_{n}} dy \\ &\quad + \int\_{0}^{1} \log\left(\frac{1}{\text{Beta}\left(a\_{n},b\_{n}\right)} (y)^{a\_{n}} (1-y)^{b\_{n}}\right) \frac{1}{\text{Beta}\left(a\_{n},b\_{n}\right)} (y)^{a\_{n}} (1-y)^{b\_{n}}. \end{split}$$

The first term integrates up to log(1/2). The second term is the KL-divergence between a Uniform and Beta distribution with parameters *an* = *<sup>n</sup>*1+*θ*<sup>0</sup> <sup>2</sup> and *bn* <sup>=</sup> *<sup>n</sup>*(<sup>1</sup> <sup>−</sup> <sup>1</sup>+*θ*<sup>0</sup> <sup>2</sup> ) and support [0, 1]. Following Lemma A.2.1 it follows that K(*ρn*, *π*) is upper bounded by,

$$\mathcal{K}(\rho\_n, \pi) < \log(1/2) + \mathbb{C}\_1 + \frac{1}{2}\log(n) < \mathbb{C}\sqrt{n}.$$

for some large constant *C*. This completes the proof.

*Appendix B.4. Proofs for Misspecified Models*

## Proof of Theorem 6

As in the proof of Theorem 1, following Equation (A13), we note that,

$$\begin{split} \int \mathcal{D}\_{\mathbf{d}^{\varepsilon\varepsilon}}(\mathcal{P}^{(n)}\_{\theta},\mathcal{P}^{(n)}\_{\theta\_{0}}) \pi\_{n,\mathbf{a}^{\varepsilon\varepsilon}|X^{n}}(d\theta) &\leq \frac{\mathbf{a}^{\varepsilon\varepsilon}}{1-\mathbf{a}^{\varepsilon\varepsilon}} \int \mathcal{K}(\mathcal{P}^{(n)}\_{\theta\_{0}},\mathcal{P}^{(n)}\_{\theta}) \rho\_{n}(d\theta) \\ &+ \frac{\mathbf{a}^{\varepsilon\varepsilon}}{1-\mathbf{a}^{\varepsilon\varepsilon}} \sqrt{\frac{\text{Var}\left[\int r\_{n}(\theta,\theta\_{0})\rho\_{n}(d\theta)\right]}{\eta}} + \frac{\mathcal{K}(\rho\_{n},\pi) - \log(\varepsilon)}{1-\mathbf{a}^{\varepsilon\varepsilon}}. \end{split} \tag{A44}$$

Following from Equations (23) and (26), we get that,

$$\int \mathcal{K}(P\_{\theta\_0}^{(n)}, P\_{\theta}^{(n)}) \rho\_n(d\theta) \le \mathbb{E}[r\_n(\theta\_0, \theta\_n^\*)] + n\varepsilon\_{n\prime}$$

and

$$\int \text{Var}[r\_{\mathcal{U}}(\theta\_{\prime}\theta\_{0})] \rho\_{\mathcal{U}}(d\theta) \leq 2n\epsilon\_{\mathcal{U}} + 2\text{Var}[r\_{n}(\theta\_{n\prime}^{\*}\theta\_{0})].$$

Plugging these into Equation (A44), we are done.

## **References**


## *Article* **Approximate Bayesian Computation for Discrete Spaces**

**Ilze A. Auzina † and Jakub M. Tomczak \*,†**

Department of Computer Science, Faculty of Science, Vrije Universiteit Amsterdam, De Boelelaan 1111, 1081 HV Amsterdam, The Netherlands; ilze.amanda.auzina@gmail.com

**\*** Correspondence: jmk.tomczak@gmail.com

† These authors contributed equally to this work.

**Abstract:** Many real-life processes are black-box problems, i.e., the internal workings are inaccessible or a closed-form mathematical expression of the likelihood function cannot be defined. For continuous random variables, likelihood-free inference problems can be solved via Approximate Bayesian Computation (ABC). However, an optimal alternative for discrete random variables is yet to be formulated. Here, we aim to fill this research gap. We propose an adjusted population-based MCMC ABC method by re-defining the standard ABC parameters to discrete ones and by introducing a novel Markov kernel that is inspired by differential evolution. We first assess the proposed Markov kernel on a likelihood-based inference problem, namely discovering the underlying diseases based on a QMR-DTnetwork and, subsequently, the entire method on three likelihood-free inference problems: (i) the QMR-DT network with the unknown likelihood function, (ii) the learning binary neural network, and (iii) neural architecture search. The obtained results indicate the high potential of the proposed framework and the superiority of the new Markov kernel.

**Keywords:** Approximate Bayesian Computation; differential evolution; MCMC; Markov kernels; discrete state space

## **1. Introduction**

In various scientific domains, an accurate simulation model can be designed, yet formulating the corresponding likelihood function remains a challenge. In other words, there is a simulator of a process available that, when provided an input, returns an output, but the inner workings of the process are not analytically available [1–5]. Thus far, the existing tools for solving such problems are typically limited to continuous random variables. Consequently, many discrete problems are reparameterized to continuous ones via, for example, the Gumbel-softmax trick [6] rather than being solved directly. In this paper, we aim at providing a solution to this problem by translating the existing likelihood-free inference methods to discrete space applications.

Commonly, likelihood-free inference problems for continuous data are solved via a group of methods known under the term Approximate Bayesian Computation (ABC) [2,7]. The main idea behind ABC methods is to model the posterior distribution by approximating the likelihood as a fraction of accepted simulated data points from the simulator model, by the use of a distance measure *δ* and a tolerance value . The first approach, known as the ABC-rejection scheme, has been successfully applied in biology [8,9], and since, then many alternative versions of the algorithm have been introduced, with the three main groups represented by Markov Chain Monte Carlo (MCMC) ABC [10], Sequential Monte Carlo (SMC) ABC [11], and neural network-based ABC [12,13]. In the current paper, we focus on the MCMC-ABC version [14] for discrete data application, as it can be more readily implemented and the computational costs are lower [15]. Thus, the efficiency of our newly proposed likelihood-free inference method will depend on two parts, namely (i) on the design of the proposal distribution for the MCMC algorithm and (ii) the selected hyperparameter values for the ABC algorithm.

**Citation:** Auzina, I.A.; Tomczak, J.M. Approximate Bayesian Computation for Discrete Spaces. *Entropy* **2021**, *23*, 312. https://doi.org/10.3390/ e23030312

Academic Editor: Pierre Alquier

Received: 18 February 2021 Accepted: 2 March 2021 Published: 6 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Our main focus is on optimal proposal distribution design as there is no "natural" notion of the search direction and scale for discrete data spaces. Hence, the presented solution is inspired by Differential Evolution (DE) [16], which has been shown to be an effective optimization technique for many likelihood-free (or black-box) problems [17,18]. We propose to define a probabilistic DE kernel for discrete random variables that allows us to traverse the search space without specifying any external parameters. We evaluate our approach on four test-beds: (i) we verify our proposal on a benchmark problem of the QMR-DTnetwork presented by [19]; (ii) we modify the first problem and formulate it as a likelihood-free inference problem; (iii) we assess the applicability of our method for high-dimensional data, namely training binary neural networks on MNIST data; (iv) we apply the proposed approach to Neural Architecture Search (NAS) using the benchmark dataset proposed by [20].

The contribution of the present paper is as follows. First, we introduce an alternative version of the MCMC-ABC algorithm, namely a population-based MCMC-ABC method, that is applicable to likelihood-free inference tasks with discrete random variables. Second, we propose a novel Markov kernel for likelihood-based inference methods in a discrete state space. Third, we present the utility of the proposed approach on three binary problems.

#### **2. Likelihood-Free Inference and ABC**

Let *<sup>x</sup>* ∈ X be a vector of parameters or decision variables, where <sup>X</sup> <sup>=</sup> <sup>R</sup>*<sup>D</sup>* or <sup>X</sup> <sup>=</sup> {0, 1}*D*, and *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>M</sup>* is a vector of observable variables. Typically, for a given collection of observations of *<sup>y</sup>*, *ydata* <sup>=</sup> {*yn*}*<sup>N</sup> <sup>n</sup>*=1, we are interested in solving the following optimization problem (we note that the logarithm does not change the optimization problem, but it is typically used in practice):

$$\mathbf{x}^\* = \arg\max \ln p(y\_{data}|\mathbf{x}),\tag{1}$$

where *p*(*ydata*|*x*) is the likelihood function. Sometimes, it is more advantageous to calculate the posterior:

$$
\ln p(\mathbf{x}|y\_{data}) = \ln p(y\_{data}|\mathbf{x}) + \ln p(\mathbf{x}) - \ln p(y\_{data}),
\tag{2}
$$

where *p*(*x*) denotes the prior over *x* and *p*(*ydata*) is the marginal likelihood. The posterior *p*(*x*|*ydata*) could be further used in Bayesian inference.

In many practical applications, the likelihood function is unknown, but it is possible to obtain (approximate) samples from *p*(*y*|*x*) through a simulator. Such a problem is referred to as likelihood-free inference [3] or a black-box optimization problem [1]. If the problem is about finding the posterior distribution over *x* while only a simulator is available, then it is considered as an Approximate Bayesian Computation (ABC) problem, meaning that *p*(*ydata*|*x*) is assumed to be given represented as the simulator.

#### **3. Population-Based MCMC**

Typically, a likelihood-free inference problem or an ABC problem is solved through sampling. One of the most well-known sampling methods is the Metropolis–Hastings algorithm [21], where the samples are generated from an ergodic Markov chain, and the target density is estimated via Monte Carlo sampling. In order to speed up the computations, it is proposed to run multiple chains in parallel rather than sampling from a single chain. This approach is known as population-based MCMC methods [22]. A population-based MCMC method operates over a joint state space with the following distribution:

$$p(\mathbf{x}\_1, \dots, \mathbf{x}\_\mathcal{C}) = \prod\_{\mathcal{c} \in \mathcal{C}} p\_\mathcal{c}(\mathbf{x}\_\mathcal{c}) \tag{3}$$

where C denotes the population of chains and at least one of *pc*(*xc*) is equivalent to the original distribution we want to sample from (e.g., the posterior distribution *p*(*x*|*ydata*)).

Given a population of chains, a question of interest is what is the best proposal distribution for an efficient sampling convergence. One approach is parallel tempering. It introduces an additional temperature parameter and initializes each chain at a different temperature [23,24]. However, the performance of the algorithm highly depends on an appropriate cooling schedule rather than a smart interaction between the chains. A different approach proposed by [25] relies on a suitable proposal that is able to adapt the shape of the population at a single temperature. We further expand on this idea by formulating population-based proposal distributions that are inspired by evolutionary algorithms.

## *3.1. Continuous Case*

Reference [26] successfully formulated a new proposal called Differential Evolution Markov Chain (DE-MC) that combines the ideas of differential evolution and populationbased MCMC. In particular, he redefined the DE-1 equation [16] by adding noise, *ε*, to it:

$$\mathbf{x}\_{new} = \mathbf{x}\_{i} + \gamma(\mathbf{x}\_{j} - \mathbf{x}\_{k}) + \varepsilon,\tag{4}$$

where *<sup>ε</sup>* is sampled from a Gaussian distribution, *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup>+. The created proposal automatically implies the invariance of the underlying distribution, as the reversibility condition is satisfied:

• Reversibility is met, because the suggested proposal could be inverted to obtain *xi*.

Furthermore, the created Markov chain is ergodic, as the following two conditions are met:


Hence, the resulting Markov chain has a unique stationary distribution. The results presented by [26] indicate an advantage of DE-MC over conventional MCMC with respect to the speed of calculations, convergence, and applicability to multimodal distributions, therefore positioning DE as an optimal method for choosing an appropriate scale and orientation of the jumping distribution for a population-based MCMC.

## *3.2. Discrete Case*

In this paper, we focus on binary variables, because categorical variables could always be transformed to a binary representation. Hence, the most straightforward proposal for binary variables is the independent sampler that utilizes the product of Bernoulli:

$$q(\mathbf{x}) = \prod\_{d} B(\theta\_d)\_{\prime} \tag{5}$$

where *B*(*θd*) denotes the Bernoulli distribution with a parameter *θd*. However, the above proposal does not utilize the information available across the population; hence, the performance could be improved by allowing the chains to interact. Exactly this possibility we investigate in the following section.

## **4. Our Approach**

#### *4.1. Markov Kernels*

We propose to utilize the ideas outlined by [26], but in a discrete space. For this purpose, we need to relate the DE-1 equation to logical operators, as now the vector *x* is represented by a string of bits, <sup>X</sup> <sup>=</sup> {0, 1}*D*, and properly defined noise. Following [19], we propose to use the *xor*operator between two bits *b*<sup>1</sup> and *b*2:

$$b\_1 \otimes b\_2 = \begin{cases} 1, & b\_1 \neq b\_2 \\ 0, & b\_1 = b\_2 \end{cases} \tag{6}$$

instead of the subtraction in (4). Next, we define a difference between two chains *xi* and *xj* as *δ<sup>k</sup>* = *xi* ⊗ *xj* and a set of all possible differences between two chains, Δ = {*δ<sup>k</sup>* : ∀*xi*,*xj*∈C *<sup>δ</sup><sup>k</sup>* = *xi* ⊗ *xj*} (a similar construction could be done for the continuous case). We can construct a distribution over *δ<sup>k</sup>* as a uniform distribution:

$$q(\delta|\mathcal{C}) = \frac{1}{|\Delta|} \sum\_{\delta\_k \in \Delta} \mathbb{I}\left[\delta\_k = \delta\right]\_{\prime} \tag{7}$$

where <sup>|</sup>Δ<sup>|</sup> denotes the cardinality of <sup>Δ</sup> and <sup>I</sup>[·] is an indicator function such that <sup>I</sup>[*δ<sup>k</sup>* <sup>=</sup> *δ*] = 1 if *δ<sup>k</sup>* = *δ* and zero otherwise. Now, we can formulate a binary equivalence of the DE-1 equation by adding a difference drawn from *q*(*δ*|C):

$$
\mathfrak{x}\_{\text{new}} = \mathfrak{x}\_{\text{i}} \otimes \delta\_{\text{k}}.\tag{8}
$$

However, the proposal defined in (8) is not a valid ergodic Markov kernel, as is shown in the following Proposition.

**Remark 1.** *The proposal defined in (8) fulfills reversibility and aperiodicity, but it does not meet the irreducibility requirement.*

**Proof.** Reversibility is met, as *xi* can be re-obtained by applying the difference to the left side of (8). Aperiodicity is met because the general setup of the Markov chain is kept unchanged (it resembles a random walk). However, the operation in (8) is deterministic; thus, it violates the irreducibility assumption.

The missing property of (8) could be fixed by including the following mutation (*mut*) operation:

$$\mathbf{x}\_{l} = \begin{cases} 1 - \mathbf{x}\_{l} & \text{if } p\_{flip} \ge u \\ \mathbf{x}\_{l} & \text{otherwise} \end{cases} \tag{9}$$

where *pflip* ∈ (0, 1) corresponds to an independent probability of flipping a bit and *U*(0, 1) denotes the uniform distribution. Then, the following proposal could be formulated [19] as in Proposition 1.

**Proposition 1.** *The proposal defined as a mixture qmut*+*xor*(*x*|C) = *πqmut*(*x*|C)+(1 − *π*)*qxor* (*x*|C)*, where π* ∈ (0, 1)*, qmut*(*x*|C) *is defined by (9) and qxor*(*x*|C) *is defined by (8), is a proper Markov kernel.*

**Proof.** Reversibility and aperiodicity were shown in Proposition 1. The irreducibility is met, because the *mut* proposal assures that there is a positive transition probability across the entire search space.

However, we notice that there are two potential issues with the mixture proposal *mut+xor*. First, it introduces another hyperparameter, *π*, that needs to be determined. Second, improperly chosen *π* could negatively affect the convergence speed, i.e., a fixed value that is either too frequent or scarce would drastically halt the convergence.

In order to overcome these issues, we propose to apply the *mut* operation in (9) directly to *δk*, in a similar manner as the Gaussian noise is added to *γ*(*xi* − *xj*) in the proposition of [26]. As a result, we obtain the following proposal:

$$\mathbf{x}\_{new} = \mathbf{x}\_{i} \odot (mut(\delta\_{k})).\tag{10}$$

Importantly, this proposal fulfills all requirements for an ergodic Markov kernel.

**Proposition 2.** *The proposal defined in (10) is a valid ergodic Markov kernel.*

**Proof.** Reversibility and aperiodicity are met in the same manner as shown in Proposition 1. Adding the mutation operation directly to *δ<sup>k</sup>* allows obtaining all possible states in the discrete space; thus, the irreducibility requirement is met.

We refer to this new Markov kernel for discrete random variables as the discrete differential evolution Markov chain (*dde-mc*).

## *4.2. Population-MCMC-ABC*

Since we formulated a proposal distribution that utilizes a population of chains, we propose to use a population-based MCMC algorithm for the discrete ABC problems. The core of the MCMC-ABC algorithm is to use a proxy of the likelihood-function defined as an -ball from the observed data, i.e., *y* − *ydata* ≤ , where > 0 and · is a chosen metric. The convergence speed and the acceptance rate highly depend on the value of [27–29]. In this paper, we consider two approaches to determine the value: (i) by setting a fixed value and (ii) by sampling ∼ *Exp*(*τ*) [30]. See the Appendix A for details.

A single step of the population-MCMC-ABC algorithm is presented in Algorithm 1. Notice that in Line 5, we take advantage of the symmetricity of all the proposal. Moreover, in the procedure, we skip an outer loop over all chains for clarity. Without loss of generality, we assume a simulator to be a probabilistic program denoted by *p*˜(*y*|*x*).

**Algorithm 1** Population-MCMC-ABC.


#### **5. Experiments**

In order to verify our proposed approach, we use four test-beds:


With each test-bed, we increase the complexity of the problem. Hence, the number of iterations chosen varies per experiment. The code of the methods and all experiments is available at the following link: https://github.com/IlzeAmandaA/ABCdiscrete (accessed on 5 March 2021).

#### *5.1. A Likelihood-Based QMR-DT Network*

#### 5.1.1. Implementation Details

The overall setup was designed as described by [19], i.e., we considered a QMR-DT network model. The architecture of the network follows a two-level or bipartite graphical model, where the top level of the graph contains nodes for the diseases and the bottom level contains nodes for the findings [31]. The following density model captures the relations between the diseases (*x*) and findings (*y*):

$$p(y\_i = 1 | \mathbf{x}) = 1 - (1 - q\_{i0}) \prod\_{l} (1 - q\_{il})^{\mathbf{x}\_l} \tag{11}$$

where *yi* is an individual bit of string *y* and *qi*<sup>0</sup> is the corresponding leak probability, i.e., the probability that the finding is caused by means other than the diseases included in the QMR-DT model [31]. *qil* is the association probability between disease *l* and finding *i*, i.e., the probability that the disease *l* alone could cause the finding *i* to have a positive outcome. For a complete inference, the prior *p*(*x*) is specified. We follow the assumption made by [19] that the diseases are independent:

$$p(\mathbf{x}) = \prod\_{l} p\_l^{\mathbf{x}\_l} (1 - p\_l)^{(1 - x\_l)} \tag{12}$$

where *pl* is the prior probability for disease *l*.

We compare the performance of the *dde-mc* kernel to the *mut* proposal, the *mut-xor* proposal, the *mut+crx* proposal (see [19] for details), and the independent sampler (*indsamp*) as in (5) with sampling probability *θ<sup>d</sup>* = 0.5. We expect the DE-inspired proposals to outperform *ind-samp*, and *dde-mc* to perform similarly, if not surpass, *mut+xor*. Out of the possible parameter settings we investigate, the following population sizes *C* = {8, 12, 24, 40, 60}, as well as bit-flipping probabilities *pflip* = {0.1, 0.05, 0.01, 0.005}. All experiments were run for 10,000 iterations, as in earlier work by [19], it was observed that the performance differences after 10,000 steps were negligible, and initial experiments revealed that in the current work, all proposals approximately converged at this mark. Furthermore, the performance was validated over 80 random problem instances, and the resulting mean and its standard error are reported.

In this experiment, we used the error that is defined as the average Hamming distance between the real values of *x* and the most probable values found by the population-MCMC with different proposals. The number of diseases was set to *m* = 20, and the number of findings was *n* = 80.

## 5.1.2. Results and Discussion

DE-inspired proposals, *dde-mc* and *mut+xor*, are superior to kernels stemming from genetic algorithms or random search, i.e., *mut+crx*, *mut*, and *ind-samp* (Figure 1). In particular, *dde-mc* converged the fastest (see the first 4000 evaluations in Figure 1), suggesting that an update via a single operator rather than a mixture is most effective. As expected, *ind-samp* requires many evaluations to obtain a reasonable performance. Even more so, the obtained difference in wall-clock time between *dde-mc* and *ind-samp* was negligible, 148 versus 117 min, respectively, even though the computational complexity of the new method is theoretically higher: given a search space of {0, 1}*D*, the *dde-mc* proposal costs O(D), while the time complexity of *ind-samp* is O(1).

Based on the obtained results, the subsequent experiments were carried out only with *dde-mc*, *mut+xor*, and *ind-samp* as a baseline. *mut+crx* and *mut* were not selected due to to their very slow convergence with high-dimensional problems.

**Figure 1.** A comparison of the considered proposals using the population average error. The obtained mean and its corresponding standard error (shaded area) across 80 random problem instances are plotted. The following settings were used: *C* = 24, *pflip* = 0.01, *pcross* = 0.5. The corresponding equations for each proposal are as follows: *mut* as in (9), *ind-samp* as in (5), *dde-mc* as in (10), and *mut+xor*, *mut+crx* as in [19].

### *5.2. A Likelihood-Free QMR-DT Network*

#### 5.2.1. Implementation Details

In this test-bed, the QMR-DT network is redefined as a simulator model, i.e., the likelihood is assumed to be intractable. The Hamming distance is selected as the distance metric, but due to its equivocal nature for high-dimensional data, the dimensionality of the problem is reduced. In particular, the number of diseases and observations (i.e., findings) are decreased to 10 and 20, respectively, while the probabilities of the network are sampled from a beta distribution, *Beta*(0.15, 0.15). The resulting network is more deterministic as the underlying density distributions are more peaked; thus, the stochasticity of the simulator is reduced. Multiple tolerance values are investigated to find the optimal settings, = {0.5, 0.8, 1., 1.2, 1.5, 2.}, respectively. The minimal value is chosen to be 0.5 due to variability across the observed data *ydata*. Additionally, we checked sampling from the exponential distribution. All experiments were cross-evaluated 80 times, and each experiment was initialized with different underlying parameter settings.

#### 5.2.2. Results and Discussion

First, for the fixed value of , we notice that *dde-mc* converged faster and to a better (local) optimum than *mut+xor*. However, this effect could be explained by a lower dimensionality of the problem compared to the first experiment. Second, utilizing the exponential distribution had a profound positive effect on the convergence rate of both *dde-mc* and *mut+xor* (Figure 2). This confirmed the expectation that an adjustable has a better balance between exploration and exploitation. In particular, ∼ *Exp*(2) brought the best results with *dde-mc* converging the fastest, followed by *mut+xor* and *ind-samp*. This is in line with the corresponding acceptance rates for the first 10,000 iterations (Table 1), i.e., the use of a smarter proposal allows increasing the acceptance probability, as the search space is investigated more efficiently.

**Figure 2.** A comparison of the considered proposals using the population error for exponentially adjusted and the fixed (indicated by \*). The shaded area corresponds to the standard error across 80 random problem instances. The parameter settings are as follows: *C* = 24, *pflip* = 0.01, = 2.0. The following equations describe the proposal distributions utilized in Algorithm 1: *ind-samp* as in (5), *dde-mc* as in (10), and *mut+xor* as in [19].

**Table 1.** Percentage of acceptance ratio, *α*.


Furthermore, the final error obtained by the likelihood-free inference approach is comparable with the results reported for the likelihood-based approach (Figures 1 and 2). This is a positive outcome as any approximation of the likelihood will always be inferior to an exact solution. In particular, the final error obtained by the *dde-mc* proposal is lower; however, this is accounted for by the reduced dimensionality of the problem. Interestingly, despite approximating the likelihood, the computational time only increased twice, while the best performing chain was already identified after 4000 evaluations (Figure 3).

**Figure 3.** A comparison of the considered proposal using the minimum average error (i.e., the lowest error found by the population) on QMR-DTwith adjusted . The shaded area corresponds to the standard error across 80 random problem instances. The parameter settings are as follows: *C* = 24, *pflip* = 0.01, = 2.0. The corresponding equations represent the proposal distributions: *ind-samp* in (5), *dde-mc* in (10), and *mut+xor* as in [19].

Lastly, the obtained results were validated by comparing the true approximate posterior distribution to the approximate posterior distribution of the last five generations of the multi-chain ensemble. In Figure 4, the negative logarithm of the posterior distribution is plotted. The main conclusion is that all proposals converge towards the approximate posterior, yet the obtained distributions are more dispersed.

**Figure 4.** Approximate posterior distribution. The approximate posterior distribution, *p*(*x*|*ydata*) ≈ *p*(*ydata*|*x*) ∗ *p*(*x*), was computed using the last population of each chain for all 80 random problem instances. To reconstruct the true posterior, the true underlying parameters were used.

#### *5.3. Binary Neural Networks*

#### 5.3.1. Implementation Details

In the following experiment, we aimed at evaluating our approach on a high-dimensional optimization problem. We trained a Binary Neural Network (BinNN) with a single fullyconnected hidden layer on the image dataset of ten handwritten digits (MNIST [32]). We used 20 hidden units, and the image was resized from 28px × 28px to 14px × 14px. Furthermore, the image was converted to polar values of +1 or −1, while the network was created in accordance to [33], where the weights and activations of the network were binary, meaning that they were constrained to +1 or −1 as well. We simplified the problem to a binary classification by only selecting two digits from the dataset. As a result, the total number of weights equaled 3940. We used the *tanh*activation function for the hidden units and the sigmoid activation function for the outputs. Consequently, the distance metric becomes the classification error:

$$||y\_{data} - y|| = 1 - \frac{1}{N} \sum\_{n=1}^{N} \mathbb{I}[y\_n = y\_n(\mathbf{x})],\tag{13}$$

where *<sup>N</sup>* denotes the number of images, <sup>I</sup>[·] is an indicator function, *yn* is the true label for the *n*-th image, and *yn*(*x*) is the *n*-th label predicted by the binary neural net with weights *x*.

For the Metropolis acceptance rule, we define a Boltzmann distribution over the prior distribution of the weights *x* inspired by the work of [34]:

$$p(\mathbf{x}) = \frac{h(\mathbf{x})}{\sum\_{i} h(\mathbf{x}\_i)},\tag{14}$$

where *<sup>h</sup>*(*x*) = *exp*(<sup>−</sup> <sup>1</sup> *<sup>D</sup>* <sup>∑</sup>*<sup>D</sup> <sup>i</sup>*=<sup>1</sup> *xi*) and *D* denotes the dimensionality of *x*. As a result, the prior distribution acts as a regularization term as it favors parameter settings with fewer active weights. The distribution is independent of the data *y* thus, the partition function ∑*<sup>i</sup> h*(*xi*) cancels out in the computation of the Metropolis ratio:

$$\alpha = \frac{p(\mathbf{x'})}{p(\mathbf{x})} = \frac{h(\mathbf{x'})}{h(\mathbf{x})}.\tag{15}$$

The original dataset consists of 60,000 training examples and 10,000 test examples. For our experiment, we selected the digits 0 and 1; hence, the dataset size was reduced to 12,665 training and 2115 test examples. Different tolerance values were investigated to obtain the best convergence, ranging from 0.03 to 0.2, and each experiment was run for at least 200,000 iterations. All experiments were cross-evaluated five times. Lastly, we evaluated the performance by computing both the minimum test error obtained by the final population, as well as the test error obtained by using a Bayesian approach, i.e., we computed the true predictive distribution via majority voting by utilizing an ensemble of models. In particular, we selected the five last updated populations, resulting in 5 × 24 × 5 = 600 models per run, and we repeated this with different seeds 10 times.

Because the classification error function in (13) is non-differentiable, the problem could be treated as a black-box objective. However, we want to emphasize that we do not propose our method as an alternative to gradient-based learning methods. In principle, any gradient-based approach will be superior to a derivative-free method, as what a derivativefree method tries to achieve is to implicitly approximate the gradient [1]. Therefore, the purpose of the presented experiment is not to showcase a state-of-the-art classification accuracy, as that already has been done with gradient-based approaches for BinNN [33], but rather showcase the population-MCMC-ABC applicability to a high-dimensional optimization problem.

## 5.3.2. Results and Discussion

For the high-dimensional data problem, the *mut+xor* proposal converged the fastest towards the optimal solution in the search space (Figure 5). In particular, the minimum error on the training set was already found after 100,000 iterations, and a tolerance threshold of 0.05 had the best trade-off between the Markov chain error and the likelihood approximation bias.

With respect to the error within the entire population (Figure 6), *dde-mc* converged the fastest, although its performance was on par with *ind-samp*. In general, the drop in performance with respect to the convergence rate of the entire population could be explained by the high dimensionality of the problem, i.e., the higher the dimensionality, the more time is needed for every chain to explore the search space. This observation was confirmed by computing the test error via utilizing all the population members in a majority-voting setting. In particular, the test error based on the ensemble approach was alike across all three proposals, yet the minimum error (i.e., for a single best model) was better for *dde-mc* and *mut+xor* compared to *ind-samp* (Table 2). This result suggests that there seems to be an added advantage of utilizing DE-inspired proposals in faster convergence towards a local optimal solution.

**Figure 5.** A comparison of the considered proposals using the minimum training error on MNIST. The mean minimum error across five cross-evaluations is plotted with the shaded area corresponding to the standard error. Tolerance is set to = *Exp*(0.05), with the prior and the Metropolis ratio as described in (14) and (15). The following equations describe the proposals: *ind-samp* in (5), *dde-mc* in (10), and *mut+xor* as in [19].

**Figure 6.** A comparison of the considered proposals using the avg. training error on MNIST. The mean population error across five cross-evaluations is plotted with the shaded area corresponding to the standard error. Tolerance is set to = *Exp*(0.05), with the prior and the Metropolis ratio as described in (14) and (15). The following equations describe the proposals: *ind-samp* in (5), *dde-mc* in (10), and *mut+xor* as in [19].

**Table 2.** Test error of BinNN on MNIST.


#### *5.4. Neural Architecture Search*

## 5.4.1. Implementation Details

In the last experiment, we aimed at investigating whether the proposed approach is applicable for efficient neural architecture search. In particular, we made use of the NAS-Bench-101 dataset, the first public architecture dataset for NAS research [20]. The dataset is represented as a table, which maps neural architectures to their training and evaluations metrics, and as such, it represents an efficient solution for querying different neural topologies. Each topology is captured by a directed acyclic graph represented by an adjacency matrix. The number of vertices was set to seven, while the maximum amount of edges was nine. Apart from these restrictions, we limited the search space by constricting the possible operations for each vertex. Consequently, the simulator was captured by querying the dataset, while the distance metric now was simply the validation error. The prior distribution was kept the same as for the previous experiment.

Every experiment was run for at least 120,000 iterations, with five cross-evaluations. To find the optimal performance, the following tolerance threshold values were investigated = {0.01, 0.1, 0.2, 0.3}. As we are approaching the problem as an optimization task, the aim is to find a chain with the lowest test error, rather than covering the entire distribution. Therefore, to evaluate the performance, we plot the minimum error obtained through the training process, as well as the lowest test error obtained by the final population.

#### 5.4.2. Results and Discussion

*dde-mc* identified the best solution the fastest with set to ∼ *Exp*(0.2) (Figure 7). The corresponding test error is reported in Table 3, and it follows the same pattern, namely *dde-mc* is superior. Interestingly, here, the *mut+xor* proposal performed almost on par with the *ind-samp* proposal for the first 10,000 iterations, and then, both methods converged to almost the same result. Our proposed Markov kernel obtained again the best result, and also it was the fastest.

**Figure 7.** A comparison of the considered proposals using the minimum training error on NAS-Bench-101. The mean minimum error with its corresponding standard error (shaded area) across five cross-evaluations is plotted. Tolerance is set to = *Exp*(0.2). The prior distribution is as described in (14), with the corresponding Metropolis ratio (15). The following equations describe the proposals: *ind-samp* in (5), *dde-mc* in (10), and *mut+xor* as in [19].


**Table 3.** Test error on NAS-Bench-101.

#### **6. Conclusions**

In this paper, we note that there is a gap in the available methods for likelihood-free inference on discrete problems. We propose to utilize ideas known from evolutionary computing similarly to [26], in order to formulate a new Markov kernel, *dde-mc*, for a population-based MCMC-ABC algorithm. The obtained results suggest that the newly designed proposal is a promising and effective solution for intractable problems in a discrete space.

Furthermore, Markov kernels based on differential evolution are also effective to traverse a discrete search space. Nonetheless, great attention has to be paid to the choice of the tolerance threshold for the MCMC-ABC methods. In other words, if the tolerance is set too high, then the performance of the DE-based proposals drops to that of an independent sampler, i.e., the error of the Markov chain is high. For high-dimensional problems, the proposed kernel seems to be most promising; however, its population error becomes similar to that of *ind-samp*. This is accounted for by the fact that for high dimensions, it takes more time for the entire population to converge.

In conclusion, we would like to highlight that the present work offers new research directions:


**Author Contributions:** Conceptualization, J.M.T.; methodology, I.A.A. and J.M.T.; software, I.A.A.; validation, I.A.A.; formal analysis, I.A.A. and J.M.T.; investigation, I.A.A. and J.M.T.; writing original draft preparation, I.A.A. and J.M.T.; writing—review and editing, I.A.A. and J.M.T.; visualization, I.A.A.; supervision, J.M.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The code used in this paper is available at: https://github.com/ IlzeAmandaA/ABCdiscrete (accessed on 5 March 2021).

**Acknowledgments:** We would like to thank [37] for the usage of the Distributed ASCI Supercomputer (DAS).

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

## **Abbreviations**


## **Appendix A. Determination**

The choice of defines which data points are going to be accepted; as such, it implicitly models the likelihood. Setting the value too high will result in a biased estimate; however, it will improve the performance of Monte Carlo as more samples are utilized per unit time. Hence, as [4] already has stated: "the goal is to find a good balance between the bias and the Monte Carlo error".

## *Appendix A.1. Fixed*

The first group of tolerance selection methods are all based on a fixed value. The possible approaches are summarized as follows:


Nonetheless, setting to a fixed value hinders the convergence as it clearly is a suboptimal approach due to its static nature. Ideally, we want to promote exploration at the beginning of the algorithm and, subsequently, move towards exploitation, hence alluding to the second group of tolerance selection methods: adaptive .

## *Appendix A.2. Adaptive*

In general, the research on adaptive tolerance methods for MCMC-ABC is very limited as traditionally, adaptive tolerance is seen as part of SMC-ABC. In the current literature, two adaptive tolerance methods for MCMC-ABC are mentioned:


In order to establish a clear baseline for MCMC-ABC in a discrete space, we decided to implement both fixed and adaptive . Such an approach allows us to evaluate what is the effect of an adaptive in comparison to a fixed in a discrete space, as well as to compare how well our observations are in line with the observations drawn in a continuous space.

## **References**


## *Article* **Variationally Inferred Sampling through a Refined Bound**

**Víctor Gallego 1,2,\* and David Ríos Insua 1,3**


**Abstract:** In this work, a framework to boost the efficiency of Bayesian inference in probabilistic models is introduced by embedding a Markov chain sampler within a variational posterior approximation. We call this framework "refined variational approximation". Its strengths are its ease of implementation and the automatic tuning of sampler parameters, leading to a faster mixing time through automatic differentiation. Several strategies to approximate evidence lower bound (ELBO) computation are also introduced. Its efficient performance is showcased experimentally using statespace models for time-series data, a variational encoder for density estimation and a conditional variational autoencoder as a deep Bayes classifier.

**Keywords:** variational inference; MCMC; stochastic gradients; neural networks

## **1. Introduction**

Bayesian inference and prediction in large, complex models, such as in deep neural networks or stochastic processes, remains an elusive problem [1–3]. Variational approximations (e.g., automatic differentiation variational inference (ADVI) [4]) tend to be biased and underestimate uncertainty [5]. On the other hand, depending on the target distribution, Markov Chain Monte Carlo (MCMC) [6] methods, such as Hamiltonian Monte Carlo (HMC) [7]), tend to be exceedingly slow [8] in large scale settings with large amounts of data points and/or parameters. For this reason, in recent years, there has been increasing interest in developing more efficient posterior approximations [9–11] and inference techniques that aim to be as general and flexible as possible so that they can be easily used with any probabilistic model [12,13].

It is well known that the performance of a sampling method depends heavily on the parameterization used [14]. This work proposes a framework to automatically tune the parameters of a MCMC sampler with the aim of adapting the shape of the posterior, thus boosting the Bayesian inference efficiency. We deal with a case in which the latent variables or parameters are continuous. Our framework can also be regarded as a principled way to enhance the flexibility of variational posterior approximation in search of an optimally tuned MCMC sampler; thus the proposed name of our framework is the variationally inferred sampler (VIS).

The idea of preconditioning the posterior distribution to speed up the mixing time of a MCMC sampler has been explored recently in [15,16], where a parameterization was learned before sampling via HMC. Both papers extend seminal work in [17] by learning an efficient and expressive deep, non-linear transformation instead of a polynomial regression. However, they do not account for tuning the parameters of the sampler, as introduced in Section 3, where a fully, end-to-end differentiable sampling scheme is proposed.

The work presented in [18] introduced a general framework for constructing more flexible variational distributions, called normalizing flows. These transformations are one of the main techniques used to improve the flexibility of current variational inference (VI) approaches and have recently pervaded the approximate Bayesian inference literature with

**Citation:** Gallego, V.; Ríos Insua, D. Variationally Inferred Sampling through a Refined Bound. *Entropy* **2021**, *23*, 123. https://doi.org/ 10.3390/e23010123

Received: 24 December 2020 Accepted: 13 January 2021 Published: 19 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional clai-ms in published maps and institutio-nal affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

developments such as continuous-time normalizing flows [19] (which extend an initial simple variational posterior with a discretization of Langevin dynamics) or householder flow for mixtures of Gaussian distributions [20]. However, they require a generative adversarial network (GAN) [21] to learn the posterior, which can be unstable in highdimensional spaces. We overcome this problem with our novel formulation; moreover, our framework is also compatible with different optimizers, rather than only those derived from Langevin dynamics [22]. Other recent proposals create more flexible variational posteriors based on implicit approaches typically requiring a GAN, as presented in [23] and including unbiased implicit variational inference (UIVI) [24] or semi-implicit variational inference (SIVI) [25]. Our variational approximation is also implicit but uses a sampling algorithm to drive the evolution of the density, combined with a Dirac delta approximation to derive an efficient variational approximation, as reported through extensive experiments in Section 5.

Closely related to our framework is the work presented in [26], where a variational autoencoder (VAE) is learned using HMC. We use a similar compound distribution as the variational approximation, yet our approach allows any stochastic gradient MCMC to be embedded, as well as facilitating the tuning of sampler parameters via gradient descent. Our work also relates to the recent idea of sampler amortization [27]. A common problem with these approaches is that they incur in an additional error—the amortization gap [28] which we alleviate by evolving a set of particles through a stochastic process in the latent space after learning a good initial distribution, meaning that the initial approximation bias can be significantly reduced. A recent related article was presented in [29], which also defined a compound distribution. However, our focus is on efficient approximation using the reverse KL divergence, which allows sampler parameters to be tuned and achieves superior results. Apart from optimizing this kind of divergence, the main point is that we can compute the gradients of sampler parameters (Section 3.3), whereas in [29] the authors only consider a parameterless sampler: thus, our framework allows for greater flexibility, helping the user to tune sampler hyperparameters. In the Coupled Variational Bayes (CVB) [30] approach, optimization is in the dual space, whereas we optimize the standard evidence lower bound (ELBO). Note that even if the optimization was exact, the solutions would coincide, and it is not clear yet what happens in the truncated optimization case,other than performing empirical experiments on given datasets. We thus feel that there is room for implicit methods that perform optimization in the primal space (besides this, they are easier to implement). Moreover, the previous dual optimization approach requires the use of an additional neural network (see the paper on the Coupled Variational Bayes (CVB) approach or [31]). This adds a large number of parameters and requires another architecture decision. With VIS, we do not need to introduce an auxiliary network, since we perform a "non-parametric" approach by back-propagating instead through several iterations of SGLD. Moreover, the lack of an auxiliary network simplifies the design choices.

Thus, our contributions include a flexible and consistent variational approximation to the posterior, embedding an initial variational approximation within a stochastic process; an analysis of its key properties; the provision of several strategies for ELBO optimization using the previous approximation; and finally, an illustration of its power through relevant complex examples.

#### **2. Background**

Consider a probabilistic model *p*(*x*|*z*) and a prior distribution *p*(*z*), where *x* denotes the observations and *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* the unobserved latent variables or parameters, depending on the context. Whenever necessary for disambiguation purposes, we shall distinguish between *z* for latent variables and *θ* for parameters. Our interest is in performing inference regarding the unobserved *z* by approximating its posterior distribution

$$p(z|\mathbf{x}) = \frac{p(z)p(\mathbf{x}|z)}{\int p(z)p(\mathbf{x}|z)dz} = \frac{p(\mathbf{x},z)}{p(\mathbf{x})}.$$

The integral assessing the evidence *p*(*x*) = - *p*(*z*)*p*(*x*|*z*)*dz* is typically intractable. Thus, several techniques have been proposed to perform approximate posterior inference [3].

#### *2.1. Inference as Optimization*

Variational inference (VI) [4] tackles the problem of approximating the posterior *p*(*z*|*x*) with a tractable parameterized distribution *qφ*(*z*|*x*). The goal is to find the parameters *φ* so that the variational distribution *qφ*(*z*|*x*) (also referred to as variational guide or variational approximation) can be as close as possible to the actual posterior. Closeness is typically measured through the Kullback–Leibler divergence *KL*(*qφ*||*p*), reformulated into the ELBO as follows:

$$\text{ELBO}(q) = \mathbb{E}\_{q\_{\phi}(z|x)} \left[ \log p(x, z) - \log q\_{\phi}(z|x) \right],\tag{1}$$

This is the objective to be optimized, typically through stochastic gradient descent techniques. To enhance flexibility, a standard choice for *qφ*(*z*|*x*) is a Gaussian distribution N (*μφ*(*x*), *σφ*(*x*)), with the mean and covariance matrix defined through a deep, non-linear model conditioned on observation *x*.

#### *2.2. Inference as Sampling*

HMC [7] is an effective sampling method for models whose probability is pointwise computable and differentiable. When scalability is an issue, as proposed by the authors in [32], a formulation of a continuous-time Markov process that converges to the target distribution *p*(*z*|*x*) can be used, which is based on the Euler–Maruyama discretization of Langevin dynamics

$$z\_{t+1} \gets z\_t + \eta\_t \nabla\_z \log p(\mathbf{x}, z\_t) + \mathcal{N}(0, 2\eta\_t I),\tag{2}$$

where *η<sup>t</sup>* is the step size at time period *t*, and *I* is the identity matrix. The required gradient ∇ log *p*(*zt*, *x*) can be estimated using mini-batches of data. Several extensions of the original Langevin sampler have been proposed to increase its mixing speed, such as in [33–36]. We refer to these extensions as stochastic gradient MCMC samplers (SG-MCMC) [37].

#### **3. A Variationally Inferred Sampling Framework**

In standard VI, the variational approximation is analytically tractable and typically chosen as a factorized Gaussian, as mentioned above. However, it is important to note that other distributions can be adopted as long as they are easily sampled and their log-density and entropy values evaluated. However, in the rest of this paper, we focus on the Gaussian case, as the usual choice in the Bayesian deep learning community. Stemming from this variational approximation, we introduce several elements to construct the VIS.

Our first major modification of standard VI proposes the use of a more flexible distribution, approximating the posterior by embedding a sampler through

$$q\_{\Phi\mathcal{J}\mathfrak{J}}(z|\mathbf{x}) = \int Q\_{\eta,T}(z|z\_0) q\_{0,\Phi}(z\_0|\mathbf{x}) dz\_0. \tag{3}$$

where *q*0,*φ*(*z*|*x*) is the initial and tractable density *qφ*(*z*|*x*) (i.e., the starting state for the sampler). We designate this as refined variational approximation. The conditional distribution *Qη*,*T*(*z*|*z*0) refers to a stochastic process parameterized by *η* and used to evolve the original density *q*0,*φ*(*z*|*x*) for *T* periods, so as to achieve greater flexibility. Specific forms for *Qη*,*T*(*z*|*z*0) are described in Section 3.1. Observe that when *T* = 0, no refinement steps are performed and the refined variational approximation coincides with the original one; on the other hand, as *T* increases, the approximation will be closer to the exact posterior, assuming that *Qη*,*<sup>T</sup>* is a valid MCMC sampler in the sense of [37].

We next maximize a refined ELBO objective, replacing in Equation (1) the original *qφ* by *qφ*,*η*:

$$\text{ELBO}(q\_{\phi\mathcal{J}\mid\mathcal{I}}) = \mathbb{E}\_{q\_{\phi\mathcal{J}\mid\mathcal{I}}(z|\mathbf{x})} \left[ \log p(\mathbf{x}, z) - \log q\_{\phi\mathcal{J}\mid\mathcal{I}}(z|\mathbf{x}) \right] \tag{4}$$

This is done to optimize the divergence *KL*(*qφ*,*η*(*z*|*x*)||*p*(*z*|*x*)). The first term of Equation (4) requires only being able to sample from *qφ*,*η*(*z*|*x*); however, the second term, the entropy <sup>−</sup>E*qφ*,*<sup>η</sup>* (*z*|*x*) 7 log *qφ*,*η*(*z*|*x*) 8 , also requires the evaluation of the evolving, implicit density. Section 3.2 describes efficient methods to approximate this evaluation. As a consequence, performing variational inference with the refined variational approximation can be regarded as using the original variational guide while optimizing an alternative, tighter ELBO, as Section 4.2 shows.

The above facilitates a framework for learning the sampler parameters *φ*, *η* using gradient-based optimization, with the help of automatic differentiation [38]. For this, the approach operates in two phases. First, in a refinement phase, the sampler parameters are learned in an optimization loop that maximizes the ELBO with the new posterior. After several iterations, the second phase, focused on inference, starts. We allow the tuned sampler to run for sufficient iterations, as in SG-MCMC samplers. This is expressed algorithmically as follows.

Refinement phase:

Repeat the following until convergence:


Once good sampler parameters *φ*∗, *η*∗ are learned,


Since the sampler can be run for a different number of steps depending on the phase, we use the following notation when necessary: VIS-*X*-*Y* denotes *T* = *X* iterations during the refining phase and *T* = *Y* iterations during the inference phase.

Let us specify now the key elements.

## *3.1. The Sampler Qη*,*T*(*Z*|*Z*0)

As the latent variables *z* are continuous, we evolve the original density *q*0,*φ*(*z*|*x*) through a stochastic diffusion process [39]. To make it tractable, we discretize the Langevin dynamics using the Euler–Maruyama scheme, arriving at the stochastic gradient Langevin dynamics (SGLD) sampler (2). We then follow the process *Qη*,*T*(*z*|*z*0), which represents *T* iterations of the MCMC sampler.

As an example, for the SGLD sampler *zt* = *zt*−<sup>1</sup> + *η*∇ log *p*(*x*, *zt*−1) + *ξt*, where *t* iterates from 1 to *T*. In this case, the only parameter is the learning rate *η* and the noise is *ξ<sup>t</sup>* ∼ N (0, 2*ηI*). The initial variational distribution *q*0,*φ*(*z*|*x*) is a Gaussian parameterized by a deep neural network (NN). Then, after *T* iterations of the sampler *Q* are parameterized by *η*, we arrive at *qφ*,*η*.

An alternative arises by ignoring the noise *ξ* [22], thus refining the initial variational approximation using only the stochastic gradient descent (SGD). Moreover, we can use Stein variational gradient descent (SVGD) [40] or a stochastic version [36] to apply repulsion between particles and promote more extensive explorations of the latent space.

#### *3.2. Approximating the Entropy Term*

We propose four approaches for the ELBO optimization which take structural advantage of the refined variational approximation.

#### 3.2.1. Particle Approximation (VIS-P)

In this approach, we approximate the posterior *qφ*,*η*(*z*|*x*) by a mixture of Dirac deltas (i.e., we approximate it with a finite set of particles), by sampling *<sup>z</sup>*(1), ... , *<sup>z</sup>*(*M*) <sup>∼</sup> *<sup>q</sup>φ*,*η*(*z*|*x*) and setting

$$q\_{\boldsymbol{\Phi}, \boldsymbol{\eta}}(z|\boldsymbol{x}) = \frac{1}{M} \sum\_{m=1}^{M} \delta(z - z^{(m)}).$$

In this approximation, the entropy term in (4) is set to zero. Consequently, the sample converges to the maximum posterior (MAP). This may be undesirable when training generative models, as the generated samples usually have little diversity. Thus, in subsequent computations, we add to the refined ELBO the entropy of the initial variational approximation, <sup>E</sup>*q*0,*φ*(*z*|*x*) 7 log *q*0,*φ*(*z*|*x*) 8 , which serves as a regularizer alleviating the previous problem. When using SGD as the sampler, the resulting ELBO is tighter than that without refinement, as shown in Section 4.2.

#### 3.2.2. MC Approximation (VIS-MC)

Instead of performing the full marginalization in Equation (3), we approximate it with *<sup>q</sup>φ*,*η*(*zT*, ... , *<sup>z</sup>*0|*x*) = <sup>∏</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *qη*(*zt*|*zt*−1)*q*0,*φ*(*z*0|*x*); i.e., we consider the joint distribution for the refinement. However, in inference we only keep the *zT* values. The entropy for each factor in this approximation is straightforward to compute. For example, for the SGLD case, we have

$$z\_t = z\_{t-1} + \eta \nabla \log p(\mathbf{x}, z\_{t-1}) + \mathcal{N}(0, 2\eta I), \qquad t = 1, \dots, T.$$

This approximation tracks a better estimate of the entropy than VIS-P, as we are not completely discarding it; rather, for each *t*, we marginalize out the corresponding *zt* using one sample.

#### 3.2.3. Gaussian Approximation (VIS-G)

This approach is targeted at settings in which it could be helpful to have a posterior approximation that places density over the whole *z* space. In the specific case of using SGD as the inner kernel, we have

$$\begin{aligned} z\_0 &\sim q\_{0,\Phi}(z\_0|\mathbf{x}) = \mathcal{N}(z\_0|\mu\_{\Phi}(\mathbf{x}), \sigma\_{\Phi}(\mathbf{x})) \\ z\_t &= z\_{t-1} + \eta \nabla \log p(\mathbf{x}, z\_{t-1}), \qquad t = 1, \dots, T. \end{aligned}$$

By treating the gradient terms as points, the refined variational approximation can be computed as *qφ*,*η*(*z*|*x*) = N (*z*|*zT*, *σφ*(*x*)). Observe that there is an implicit dependence on *η* through *zT*.

#### 3.2.4. Fokker–Planck Approximation (VIS-FP)

Using the Fokker–Planck equation, we derive a deterministic sampler via iterations of the form

$$z\_t = z\_{t-1} + \eta (\nabla \log p(\mathbf{x}, z\_{t-1}) - \nabla \log q\_t(z\_{t-1})), \qquad t = 1, \dots, T.$$

Then, we approximate the density *qφ*,*η*(*z*|*x*) using a mixture of Dirac deltas. A detailed derivation of this approximation is given in Appendix A.

#### *3.3. Back-Propagating through the Sampler*

In standard VI, the variational approximation *q*(*z*|*x*; *φ*) is parameterized by *φ*. The parameters are learned employing SGD, or variants such as Adam [41], using the gradient ∇*φ*ELBO(*q*). We have shown how to embed a sampler inside the variational guide. It is therefore also possible to compute a gradient of the objective with respect to the sampler parameters *η* (see Section 3.1). For instance, we can compute a gradient ∇*η*ELBO(*q*) with

respect to the learning rate *η* from the SGLD or SGD processes to search for an optimal step size at every VI iteration. This is an additional step apart from using the gradient ∇*φ*ELBO(*q*) which is used to learn a good initial sampling distribution.

#### **4. Analysis of Vis**

Below, we highlight key properties of the proposed framework.

#### *4.1. Consistency*

The VIS framework is geared towards SG-MCMC samplers, where we can compute the gradients of sampler hyperparameters to speed up mixing time (a common major drawback in MCMC [42]). After back-propagating for a few iterations through the SG-MCMC sampler and learning a good initial distribution, one can resort to the learned sampler in the second phase, so standard consistency results from SG-MCMC apply as *T* → ∞ [43].

#### *4.2. Refinement of ELBO*

Note that, for a refined guide using the VIS-P approximation and *M* = 1 samples, the refined objective function can be written as

$$\mathbb{E}\_{q(z\_0|\mathbf{x})} \left[ \log p(\mathbf{x}, z\_0 + \eta \nabla \log p(\mathbf{x}, z\_0)) - \log q(z\_0|\mathbf{x}) \right] = $$

noting that *z* = *z*<sup>0</sup> + *η*∇ log *p*(*x*, *z*0) when using SGD for *T* = 1 iterations. This is equivalent to the refined ELBO in (4). Since we are perturbing the latent variables in the steepest direction, we show easily that, for a moderate *η*, the previous bound is tighter than <sup>E</sup>*q*(*z*0|*<sup>x</sup>*)[log *<sup>p</sup>*(*x*, *<sup>z</sup>*0) <sup>−</sup> log *<sup>q</sup>*(*z*0|*x*)], the one for the original variational guide *<sup>q</sup>*(*z*0|*x*). This reformulation of ELBO is also convenient since it provides a clear way of implementing our refined variational inference framework in any probabilistic programming language (PPL) supporting algorithmic differentiation.

Respectively, for the VIS-FP case, we find that its deterministic flow follows the same trajectories as SGLD: based on standard results of MCMC samplers [44], we have

$$KL(q\_{\boldsymbol{\phi}, \boldsymbol{\eta}}(z|\boldsymbol{x})||p(z|\boldsymbol{x})) \leq KL(q\_{0, \boldsymbol{\phi}}(z|\boldsymbol{x})||p(z|\boldsymbol{x})).$$

A similar reasoning applies to the VIS-MC approximation; however, it does not hold for VIS-G since it assumes that the posterior is Gaussian.

#### *4.3. Taylor Expansion*

This analysis applies only to VIS-P and VIS-FP. As stated in Section 4.2, within the VIS framework, optimizing the ELBO resorts to the performance of max*<sup>z</sup>* log *p*(*x*, *z* + Δ*z*), where Δ*z* is one iteration of the sampler; i.e., Δ*z* = *η*∇ log *p*(*x*, *z*) in the SGD case (VIS-P), or Δ*z* = *η*∇(log *p*(*x*, *z*) − log *q*(*z*)) in the VIS-FP case. For notational clarity, we consider the case *T* = 1, although a similar analysis follows in a straightforward manner if more refinement steps are performed.

Consider a first-order Taylor expansion of the refined objective

$$
\log p(\mathbf{x}, z + \Delta z) \approx \log p(\mathbf{x}, z) + (\Delta z)^{\mathsf{T}} \nabla \log p(\mathbf{x}, z) \,.
$$

Taking gradients with respect to the latent variables *z*, we arrive at

$$
\nabla\_z \log p(\mathbf{x}, z + \Delta z) \approx \nabla\_z \log p(\mathbf{x}, z) + \eta \nabla\_z \log p(\mathbf{x}, z)^\mathsf{T} \nabla\_z^2 \log p(\mathbf{x}, z),
$$

where we have not computed the gradient through the Δ*z* term (i.e., we treated it as a constant for simplification). Then, the refined gradient can be deemed to be the original gradient plus a second order correction. Instead of being modulated by a constant learning rate, this correction is adapted by the chosen sampler. The experiments in Section 5.4 show that this is beneficial for the optimization as it typically takes fewer iterations than the original variant to achieve lower losses.

By further taking gradients through the Δ*z* term, we may tune the sampler parameters such as the learning rate as presented in Section 3.3. Consequently, the next subsection describes two differentiation modes.

#### *4.4. Two Automatic Differentiation Modes for Refined ELBO Optimization*

For the first variant, remember that the original variant can be rewritten (which we term Full AD) as

$$\mathbb{E}\_q\left[\log p(\mathbf{x}, z + \Delta z) - \log q(z + \Delta z | \mathbf{x})\right].\tag{5}$$

We now define a stop gradient operator ⊥ (which corresponds to detach in Pytorch or stop\_gradient in tensorflow) that sets the gradient of its operand to zero—i.e., ∇*x*⊥(*x*) = 0—whereas in a forward pass, it acts as the identity function—that is, ⊥(*x*) = *x*. With this, a variant of the ELBO objective (which we term Fast AD) is

$$\mathbb{E}\_q[\log p(\mathbf{x}, z + \bot(\Delta z)) - \log q(z + \bot(\Delta z)|\mathbf{x})].\tag{6}$$

Full AD ELBO enables a gradient to be computed with respect to the sampler parameters inside Δ*z* at the cost of a slight increase in computational burden. On the other hand, the Fast AD variant may be useful in numerous scenarios, as illustrated in the experiments.

#### Complexity

Since we need to back propagate through *T* iterations of an SG-MCMC scheme, using standard results of meta-learning and automatic differentiation [45], the time complexity of our more intensive approach (Full-AD) is O(*mT*), where *m* is the dimension of the hyperparameters (the learning rate of SG-MCMC and the latent dimension). Since for most use cases, the hyperparameters lie in a low-dimensional space, the approach is therefore scalable.

#### **5. Experiments**

The following experiments showcase the power of our approach as well as illustrating the the impact of various parameters on its performance, guiding their choice in practice. We also present a comparison with standard VIS and other recent variants, showing that the increased computational complexity of computing gradients through sampling steps is worth the gains in flexibility. Moreover, the proposed framework is compatible with other structured inference techniques, such as the sum–product algorithm, as well as serving to support other tasks such as classification.

Within the spirit of reproducible research, the code for VIS has been released at https://github.com/vicgalle/vis. The VIS framework is implemented with Pytorch [46], although we have also released a notebook for the first experiment using Jax to highlight the simple implementation of VIS. In any case, we emphasize that the approach facilitates rapid iterations over a large class of models.

#### *5.1. Funnel Density*

We first tested the framework on a synthetic yet complex target distribution. This experiment assessed whether VIS is suitable for modeling complex distributions. The target bi-dimensional density was defined through

$$\begin{aligned} z\_1 &\sim \mathcal{N}(0, 1.35) \\ z\_2 &\sim \mathcal{N}(0, \exp(z\_1)) .\end{aligned}$$

We adopted the usual diagonal Gaussian distribution as the variational approximation. For VIS, we used the VIS-P approximation and refined it for *T* = 1 steps using SGLD. Figure 1 top shows the trajectories of the lower bound for up to 50 iterations of variational

optimization with Adam: our refined version achieved a tighter bound. The bottom figures present contour curves of the learned variational approximations. Observe that the VIS variant was placed closer to the mean of the true distribution and was more disperse than the original variational approximation, illustrating the fact that the refinement step helps in attaining more flexible posterior approximations.

**Figure 1. Top**: Evolution of the negative evidence lower bound (ELBO) loss objective over 50 iterations. Darker lines depict means along different seeds (lighter lines). **Bottom left**: Contour curves (blue–turquoise) of the variational approximation with no refinement (*T* = 0) at iteration 30 (loss of 1.011). **Bottom right**: Contour curves (blue–turquoise) of refined variational approximation (*T* = 1) at iteration 30 (loss of 0.667). Green–yellow curves denote target density.

#### *5.2. State-Space Markov Models*

We tested our variational approximation on two state-space models: one for discrete data and another for continuous observations. These experiments also demonstrated that the framework is compatible with standard inference techniques such as the sum–product scheme from the Baum–Welch algorithm or Kalman filter. In both models, we performed inference on their parameters *θ*. All the experiments in this subsection used the Fast AD version (Section 4.4) as it was not necessary to further tune the sampler parameters to obtain competitive results. Full model implementations can be found in Appendix B.1, based on funsor (https://github.com/pyro-ppl/funsor/), a PPL on top of the Pytorch autodiff framework.

Hidden Markov Model (HMM): The model equations are

$$p(x\_{1:\tau}, z\_{1:\tau}, \theta) = \prod\_{t=1}^{\tau} p(x\_t | z\_t, \theta\_{\text{cm}}) p(z\_t | z\_{t-1}, \theta\_{\text{tr}}) p(\theta), \tag{7}$$

where each conditional is a categorical distribution taking five different classes. The prior is *p*(*θ*) = *p*(*θem*)*p*(*θtr*) based on two Dirichlet distributions that sample the observation and state transition probabilities, respectively.

Dynamic Linear Model (DLM): The model equations are as in (7), although the conditional distributions are now Gaussian and the parameters *θ* refer to the observation and transition variances.

For each model, we generated a synthetic dataset and used the refined variational approximation with *T* = 0, 1, 2. For the original variational approximation to the parameters *θ*, we used a Dirac delta. Performing VI with this approximation corresponded to MAP estimation using the Baum–Welch algorithm in the HMM case [47] and the Kalman filter in the DLM case [48], as we marginalized out the latent variables *z*1:*τ*. We used the VIS-P variant since it was sufficient to show performance gains in this case.

Figure 2 shows the results. The first row reports the experiments related to the HMM, the second row those for the DLM. We report the evolution of the log-likelihood during inference in all graphs; the first column reports the number of ELBO iterations, and the second column portrays clock times as the optimization takes place. They confirm that VIS (*T* > 0) achieved better results than standard VI (*T* = 0) for a comparable amount of time. Note also that there was not as much gain when changing from *T* = 1 to *T* = 2 as there is from *T* = 0 to *T* = 1, suggesting the need to carefully monitor this parameter. Finally, the top-right graph for the case *T* = 0 is shorter as it requires less clock time.

**Figure 2.** Results of ELBO optimization for state-space models. **Top-left** (Hidden Markov Model (HMM)): Log-likelihood against the number of ELBO gradient iterations. **Top-right** (HMM): Log-likelihood against clock time. **Bottom-left** (Dynamic Linear Model (DLM)): Log-likelihood against number of ELBO gradient iterations. **Bottom-right** (DLM): Log-likelihood against against clock time.

## 5.2.1. Prediction with an HMM

With the aim of assessing whether ELBO optimization helps in attaining better auxiliary scores, results in a prediction task are also reported. We generated a synthetic time series of alternating values of 0 and 1 for *τ* = 105 timesteps. We trained the previous HMM model on the first 100 points and report in Table 1 the accuracy of the predictive distribution *p*(*yt*) averaged over the final five time-steps. We also report the predictive entropy as it helps in assessing the confidence of the model in its predictions, as a strictly proper scoring rule [49]. To guarantee the same computational budget time and a fair comparison, the model without refinement was run for 50 epochs (an epoch was a full iteration over the training dataset), whereas the model with refinement was run for 20 epochs. It can be observed that the refined model achieved higher accuracy than its counterpart. In addition, it was more correctly confident in its predictions.

**Table 1.** Prediction metrics for the HMM.


## 5.2.2. Prediction with a DLM

We tested the VIS framework on Mauna Loa monthly CO2 time-series data [50]. We used the first 10 years as a training set, and we tested over the next 2 years. We used a DLM composed of a local linear trend plus a seasonal block of periodicity 12. Data were standardized to a mean of zero and standard deviation of one. To guarantee the same computational budget time, the model without refining was run for 10 epochs, whereas the model with refinement was run for 4 epochs. Table 2 reports the mean absolute error (MAE) and predictive entropy. In addition, we computed the interval score [49], as a strictly proper scoring rule. As can be seen, for similar clock times, the refined model not only achieved a lower MAE, but also its predictive intervals were narrower than the non-refined counterpart.

**Table 2.** Prediction metrics for the DLM.


#### *5.3. Variational Autoencoder*

The third batch of experiments showed that VIS was competitive with respect to other algorithms from the recent literature, including unbiased implicit variational inference (UIVI [24]), semi-implicit variational inference (SIVI [25]), variational contrastive divergence (VCD [29]), and the HMC variant from [26], showing that our framework can outperform those approaches in similar experimental settings.

To this end, we tested the approach with a variational autoencoder (VAE) model [51]. The VAE defines a conditional distribution *p<sup>θ</sup>* (*x*|*z*), generating an observation *x* from a latent variable *z* using parameters *θ*. For this task, our interest was in modeling the 28 × 28 image distributions underlying the MNIST [52] and the fashion-MNIST [53] datasets. To perform inference (i.e., to learn the parameters *θ*) the VAE introduces a variational approximation *qφ*(*z*|*x*). In the standard setting, this distribution is Gaussian; we instead used the refined variational approximation comparing various values of *T*. We used the VIS-MC approximation (although we achieved similar results with VIS-G) with the Full AD variant given in Section 4.4.

For the experimental setup, we reproduced the setting in [24]. For *p<sup>θ</sup>* (*x*|*z*), we used a factorized Bernoulli distribution parameterized by a two layer feed-forward network with 200 units in each layer and relu activations, except for a final sigmoid activation. As a variational approximation *qφ*(*z*|*x*), we used a Gaussian with mean and (diagonal) covariance matrix parameterized by two distinct neural networks with the same structure as previously used, except for sigmoid activation for the mean and a softplus activation for the covariance matrix.

Results are reported in Table 3. To guarantee fair comparison, we trained the VIS-5-10 variant for 10 epochs, whereas all the other variants were trained for 15 epochs (fMNIST) or 20 epochs (MNIST), so that the VAE's performance was comparable to that reported in [24]. Although VIS was trained for fewer epochs, by increasing the number *T* of MCMC iterations, we dramatically improved the test log-likelihood. In terms of computational complexity, the average time per epoch using *T* = 5 was 10.46 s, whereas with no refinement (*T* = 0), the time was 6.10 s (which was the reason behind our decision to train the refined variant for fewer epochs): a moderate increase in computing time may be worth the dramatic increase in log-likelihood while not introducing new parameters into the model, except for the learning rate *η*.

**Table 3.** Test log-likelihood on binarized MNIST and fMNIST. Bold numbers indicate the best results. UIVI: unbiased implicit variational inference; SIVI: semi-implicit variational inference; VAE: variational autoencoder; VCD: variational contrastive divergence; HMC-DLGM: Hamiltonian Monte Carlo for Deep Latent Gaussian Models; VIS: variationally inferred sampler.


Finally, as a visual inspection of the VAE reconstruction quality trained with VIS, Figures 3 and 4, respectively, display 10 random samples of each dataset.

**Figure 3.** Top: original images from MNIST. Bottom: reconstructed images using VIS-5-10 at 10 epochs.

**Figure 4.** Top: original images from fMNIST. Bottom: reconstructed images using VIS-5-10 at 10 epochs.

#### *5.4. Variational Autoencoder as a Deep Bayes Classifier*

In the final experiments, we investigated whether VIS can deal with more general probabilistic graphical models and also perform well in other inference tasks such as classification. We explored the flexibility of the proposed scheme to solve inference problems in an experiment with a classification task in a high-dimensional setting with the MNIST dataset. More concretely, we extended the VAE model, conditioning it on a discrete variable *y* ∈ Y = {0, 1, ... , 9}, leading to a conditional VAE (cVAE). The cVAE defined a decoder distribution *<sup>p</sup><sup>θ</sup>* (*x*|*z*, *<sup>y</sup>*) on an input space *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>D</sup>* given a class label *<sup>y</sup>* ∈ Y, latent variables *<sup>z</sup>* <sup>∈</sup> <sup>R</sup>*<sup>d</sup>* and parameters *<sup>θ</sup>*. Figure <sup>5</sup> depicts the corresponding probabilistic graphic model. Additional details regarding the model architecture and hyperparameters are given in Appendix B.

**Figure 5.** Probabilistic graphical model for the deep Bayes classifier.

To perform inference, a variational posterior was learned as an encoder *qφ*(*z*|*x*, *y*) from a prior *p*(*z*) ∼ N (0, *I*). Leveraging the conditional structure on *y*, we used the generative model as a classifier using the Bayes rule,

$$p(y|\mathbf{x}) \approx p(y)p(\mathbf{x}|y) = p(y) \int p\_{\theta}(\mathbf{x}|z,y)q\_{\theta}(z|\mathbf{x},y)dz \approx \frac{1}{M} \sum\_{m=1}^{M} p\_{\theta}(\mathbf{x}|z^{(m)},y)p(y),\tag{8}$$

where we used *<sup>M</sup>* Monte Carlo samples *<sup>z</sup>*(*m*) <sup>∼</sup> *<sup>q</sup>φ*(*z*|*x*, *<sup>y</sup>*). In the experiments, we set *M* = 5. Given a test sample *x*, the label *y*ˆ with the highest probability *p*(*y*|*x*) is predicted.

For comparison, we performed several experiments changing *T* in the transition distribution *Qη*,*<sup>T</sup>* of the refined variational approximation. The results are given in Table 4, which reports the test accuracy at end of the refinement phase. Note that we are comparing

different values of *T* depending on their use in refinement or inference phases (in the latter, the model and variational parameters were kept frozen). The model with *Tref* = 5 was trained for 10 epochs, whereas the other settings were for 15 epochs, to give all settings a similar training time. Results were averaged over three runs with different random seeds. In all settings, we used the VIS-MC approximation for the entropy term. From the results, it is clear that the effect of using the refined variational approximation (the cases when *T* > 0) is crucially beneficial to achieve higher accuracy. The effect of learning a good initial distribution and inner learning rate by using the gradients ∇*φ*ELBO(*q*) and ∇*η*ELBO(*q*) has a highly positive impact in the accuracy obtained.

On a final note, we have not included the case of only using an SGD or an SGLD sampler (i.e., without learning an initial distribution *q*0,*φ*(*z*|*x*)) since the results were much worse than those in Table 4 for a comparable computational budget. This strongly suggests that, for inference in high-dimensional, continuous latent spaces, learning a good initial distribution through VIS may accelerate mixing time dramatically.

*Tre f Tin f* **Acc. (Test)** 0 0 96.5 ± 0.5 % 0 10 97.7 ± 0.7 % 5 10 **99**.**8** ± **0**.**2** %

**Table 4.** Results on digit classification task using a deep Bayes classifier.

#### **6. Conclusions**

In this work, we have proposed a flexible and efficient framework to perform largescale Bayesian inference in probabilistic models. The scheme benefits from useful properties and can be employed to efficiently perform inference with a wide class of models such as state-space time series, variational autoencoders and variants such as the conditioned VAE for classification tasks, defined through continuous, high-dimensional distributions.

The framework can be seen as a general approach to tuning MCMC sampler parameters, adapting the initial distributions and learning rate. Key to the success and applicability of the VIS framework are the ELBO approximations based on the introduced refined variational approximation, which are computationally cheap but convenient.

Better estimates of the refined density and its gradient may be a fruitful line of research, such as the spectral estimator used in [54]. Another alternative is to use a deterministic flow (such as SGD or SVGD), keeping track of the change in entropy at each iteration using the change of the variable formula, as in [55]. However, this requires a costly Jacobian computation, making it unfeasible to combine with our approach of back-propagation through the sampler (Section 3.3) for moderately complex problems. We leave this for future exploration. Another interesting and useful line of further research would be to tackle the case in which the latent variables *z* are discrete. This would entail adapting the automatic differentiation techniques to be able to back-propagate the gradients through the sequences of acceptance steps necessary in Metropolis–Hastings samplers.

In order to deal with the implicit variational density, it may be worthwhile to consider optimizing the Fenchel dual of the KL divergence, as in [31]. However, this requires the use of an auxiliary neural network, which may entail a large computational price compared with our simpler particle approximation.

Lastly, probabilistic programming offers powerful tools for Bayesian modeling. A PPL can be viewed as a programming language extended with random sampling and Bayesian conditioning capabilities, complemented with an inference engine that produces answers to inference, prediction and decision-making queries. Examples include WinBUGS [56], Stan [57] or the recent Edward [58] and Pyro [59] languages. We plan to adapt VIS into several PPLs to facilitate the adoption of the framework.

**Author Contributions:** Conceptualization, V.G. and D.R.I.; methodology, V.G.; software, V.G.; investigation, V.G.; writing—original draft preparation, V.G. and D.R.I.; writing—review and editing, V.G. and D.R.I. All authors have read and agreed to the published version of the manuscript.

**Funding:** VG acknowledges support from grant FPU16-05034. DRI is grateful to the MINECO MTM2017-86875-C3-1-R project and the AXA-ICMAT Chair in Adversarial Risk Analysis. Both authors acknowledge support from the Severo Ochoa Excellence Program CEX2019-000904-S. This material was based upon work partially supported by the National Science Foundation under Grant DMS-1638521 to the Statistical and Applied Mathematical Sciences Institute as well as a BBVA Foundation project.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **Appendix A. Fokker-Planck Approximation (Vis-Fp)**

The Fokker–Planck equation is a PDE that describes the temporal evolution of the density of a random variable under a (stochastic) gradient flow [39]. For a given SDE

$$dz = \mu(z, t)dt + \sigma(z, t)dB\_{\mathfrak{h}\_{\tau}}$$

the corresponding Fokker–Planck equation is

$$\frac{\partial}{\partial t}q\_t(z) = -\frac{\partial}{\partial z}[\mu(z,t)q\_t(z)] + \frac{\partial^2}{\partial z^2}\left[\frac{\sigma^2(z,t)}{2}q\_t(z)\right].$$

We are interested in converting the SGLD dynamics to a deterministic gradient flow.

**Proposition A1.** *The SGLD dynamics, given by the SDE*

$$dz = \nabla \log p(z)dt + \sqrt{2}dB\_{t,\tau}$$

*have an equivalent deterministic flow, written as the ODE*

$$dz = (\nabla \log p(z) - \nabla \log q\_t(z))dt.$$

**Proof.** Let us write the Fokker–Planck equation for the respective flows. For the Langevin SDE, it is

$$\frac{\partial}{\partial t}q\_t(z) = -\frac{\partial}{\partial z}\left[\nabla\log p(z)q\_t(z)\right] + \frac{\partial^2}{\partial z^2}\left[q\_t(z)\right].$$

On the other hand, the Fokker–Planck equation for the deterministic gradient flow is given by

$$\frac{\partial}{\partial t}q\_t(z) = -\frac{\partial}{\partial z}\left[\nabla\log p(z)q\_t(z)\right] + \frac{\partial}{\partial z}\left[\nabla\log q\_t(z)q\_t(z)\right].$$

The result immediately follows since *<sup>∂</sup> <sup>∂</sup><sup>z</sup>* [<sup>∇</sup> log *qt*(*z*)*qt*(*z*)] <sup>=</sup> *<sup>∂</sup>*<sup>2</sup> *<sup>∂</sup>z*<sup>2</sup> [*qt*(*z*)].

Given that both flows are equivalent, we restrict our attention to the deterministic flow. Its discretization leads to iterations of the form

$$z\_t = z\_{t-1} + \eta \left(\nabla \log p(z\_{t-1}) - \nabla \log q\_{t-1}(z\_{t-1})\right). \tag{A1}$$

In order to tackle the last term, we make the following particle approximation. Using a variational formulation, we have

$$-\nabla \log q(z) = \nabla \left( -\frac{\delta}{\delta q} \mathbb{E}\_q[\log q] \right).$$

Then, we smooth the true density *q* convolving it with a kernel *K*, typically the rbf kernel, *K*(*z*, *z* ) = exp{−*γ z* − *z* <sup>2</sup>}, where *<sup>γ</sup>* is the bandwidth hyperparameter, leading to

$$\begin{aligned} \nabla \left( -\frac{\delta}{\delta q} \mathbb{E}\_q[\log q] \right) & \approx \nabla \left( -\frac{\delta}{\delta q} \mathbb{E}\_q[\log(q \ast K)] \right) \\ &= \nabla \log(q \ast K) - \nabla \left( \frac{q}{(q \ast K)} \ast K \right). \end{aligned}$$

If we consider a mixture of Dirac deltas, *q*(*z*) = <sup>1</sup> *<sup>M</sup>* <sup>∑</sup>*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> *δ*(*z* − *zm*), then the approximation is given by

$$-\nabla\log q(z) \approx -\frac{\sum\_{k} \nabla\_{z\_{m}} K(z\_{m}, z\_{n})}{\sum\_{n} K(z\_{m}, z\_{n})} - \sum\_{l} \frac{\nabla\_{z\_{m}} K(z\_{m}, z\_{l})}{\sum\_{n} K(z\_{n}, z\_{l})},$$

which can be inserted into Equation (A1). Finally, note that it is possible to back-propagate through this equation; i.e., the gradients of *K* can be explicitly computed.

#### **Appendix B. Experiment Details**

*Appendix B.1. State-Space Models*

Appendix B.1.1. Initial Experiments

For the HMM, both the observation and transition probabilities are categorical distributions, taking values in the domain {0, 1, 2, 3, 4}.

The equations of the DLM are

$$\begin{aligned} z\_{t+1} &\sim \mathcal{N}(0.5z\_t + 1.0\,\sigma\_{tr}) \\ x\_t &\sim \mathcal{N}(3.0z\_t + 0.5\,\sigma\_{\ell m}). \end{aligned}$$

with *z*<sup>0</sup> = 0.0.

Appendix B.1.2. Prediction Task in a DLM

The DLM model comprises a linear trend component plus a seasonal block with a period of 12. The trend is specified as

$$\begin{aligned} \mathbf{x}\_{t} &= z\_{\text{level},t} + \boldsymbol{\varepsilon}\_{t} & \boldsymbol{\varepsilon}\_{t} &\sim \mathcal{N}(\mathbf{0}, \sigma\_{\text{obs}}) \\ z\_{\text{level},t} &= z\_{\text{level},t-1} + z\_{\text{slope},t-1} + \boldsymbol{\varepsilon}\_{t}^{\prime} & \boldsymbol{\varepsilon}\_{t}^{\prime} &\sim \mathcal{N}(\mathbf{0}, \sigma\_{\text{level}}) \\ z\_{\text{slope},t} &= z\_{\text{slope},t-1} + \boldsymbol{\varepsilon}\_{t}^{\prime\prime} & \boldsymbol{\varepsilon}\_{t}^{\prime\prime} &\sim \mathcal{N}(\mathbf{0}, \sigma\_{\text{slope}}) . \end{aligned}$$

With respect to the seasonal component, we specify it through

$$\begin{aligned} x\_t &= Fz\_t + \upsilon\_t & \upsilon\_t &\sim \mathcal{N}(0, \sigma\_{\text{obs}})\\ z\_t &= Gz\_{t-1} + w\_t & w\_t &\sim \mathcal{N}(0, \sigma\_{\text{series}}) \end{aligned}$$

where *F* is a 12-dimensional vector (1, 0, . . . , 0, 0) and *G* is the 12 × 12 matrix

$$G = \begin{bmatrix} 0 & 0 & \dots & 0 & 1 \\ 1 & 0 & & 0 & 0 \\ 0 & 1 & & & 0 & 0 \\ & & & \ddots & & \\ 0 & 0 & & & 1 & 0 \end{bmatrix}$$

.

Further details are in [60].

#### *Appendix B.2. Vae*

## Appendix B.2.1. Model Details

The prior distribution *<sup>p</sup>*(*z*) for the latent variables *<sup>z</sup>* <sup>∈</sup> <sup>R</sup><sup>10</sup> is a standard factorized Gaussian. The decoder distribution *p<sup>θ</sup>* (*x*|*z*) and the encoder distribution (initial variational approximation) *q*0,*φ*(*z*|*x*) are parameterized by two feed-forward neural networks, as detailed in Figure A1.

## Appendix B.2.2. Hyperparameter Settings

The optimizer Adam is used in all experiments, with la earning rate of *λ* = 0.001. We also set *η* = 0.001. We train for 15 epochs (fMNIST) and 20 epochs (MNIST) to achieve a performance similar to the VAE in [24]. For the VIS-5-10 setting, we train only for 10 epochs to allow a fair computational comparison in terms of similar computing times.

## *Appendix B.3. cVAE*

## Appendix B.3.1. Model Details

The prior distribution *<sup>p</sup>*(*z*) for the latent variables *<sup>z</sup>* <sup>∈</sup> <sup>R</sup><sup>10</sup> is a standard factorized Gaussian. The decoder distribution *p<sup>θ</sup>* (*x*|*y*, *z*) and the encoder distribution (initial variational approximation) *q*0,*φ*(*z*|*x*, *y*) are parameterized by two feed-forward neural networks whose details can be found in Figure A2. Equation (8) is approximated with one MC sample from the variational approximation in all experimental settings, as it allowed fast inference times while offering better results.

```
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        self.z_d = 10
        self.h_d = 200
        self.x_d = 28*28
        self.fc1_mu = nn.Linear(self.x_d, self.h_d)
        self.fc1_cov = nn.Linear(self.x_d, self.h_d)
        self.fc12_mu = nn.Linear(self.h_d, self.h_d)
        self.fc12_cov = nn.Linear(self.h_d, self.h_d)
        self.fc2_mu = nn.Linear(self.h_d, self.z_d)
        self.fc2_cov = nn.Linear(self.h_d, self.z_d)
        self.fc3 = nn.Linear(self.z_d, self.h_d)
        self.fc32 = nn.Linear(self.h_d, self.h_d)
        self.fc4 = nn.Linear(self.h_d, self.x_d)
    def encode(self, x):
        h1_mu = F.relu(self.fc1_mu(x))
        h1_cov = F.relu(self.fc1_cov(x))
        h1_mu = F.relu(self.fc12_mu(h1_mu))
        h1_cov = F.relu(self.fc12_cov(h1_cov))
        # we work in the logvar-domain
        return self.fc2_mu(h1_mu),
        torch.log(F.softplus(self.fc2_cov(h1_cov)))
    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        h3 = F.relu(self.fc32(h3))
        return torch.sigmoid(self.fc4(h3))
```
#### **Figure A1.** Model architecture for the VAE.

```
class cVAE(nn.Module):
    def __init__(self):
        super(cVAE, self).__init__()
        self.z_d = 10
        self.h_d = 200
        self.x_d = 28*28
        num_classes = 10
        self.fc1_mu = nn.Linear(self.x_d + num_classes, self.h_d)
        self.fc1_cov = nn.Linear(self.x_d + num_classes, self.h_d)
        self.fc12_mu = nn.Linear(self.h_d, self.h_d)
        self.fc12_cov = nn.Linear(self.h_d, self.h_d)
        self.fc2_mu = nn.Linear(self.h_d, self.z_d)
        self.fc2_cov = nn.Linear(self.h_d, self.z_d)
        self.fc3 = nn.Linear(self.z_d + num_classes, self.h_d)
        self.fc32 = nn.Linear(self.h_d, self.h_d)
        self.fc4 = nn.Linear(self.h_d, self.x_d)
    def encode(self, x, y):
        h1_mu = F.relu(self.fc1_mu(torch.cat([x, y], dim=-1)))
        h1_cov = F.relu(self.fc1_cov(torch.cat([x, y], dim=-1)))
        h1_mu = F.relu(self.fc12_mu(h1_mu))
        h1_cov = F.relu(self.fc12_cov(h1_cov))
        # we work in the logvar-domain
        return self.fc2_mu(h1_mu),
        torch.log(F.softplus(self.fc2_cov(h1_cov)))
    def decode(self, z, y):
        h3 = F.relu(self.fc3(torch.cat([z, y], dim=-1)))
        h3 = F.relu(self.fc32(h3))
        return torch.sigmoid(self.fc4(h3))
```
**Figure A2.** Model architecture for the cVAE.

Appendix B.3.2. Hyperparameter Settings

The optimizer Adam was used in all experiments, with a learning rate of *λ* = 0.01. We set the initial *<sup>η</sup>* <sup>=</sup> <sup>5</sup> <sup>×</sup> <sup>10</sup>−5.

#### **References**


## *Article* **Dynamics of Coordinate Ascent Variational Inference: A Case Study in 2D Ising Models**

## **Sean Plummer \*, Debdeep Pati and Anirban Bhattacharya**

Department of Statistics, Texas A&M University, College Station, TX 77843, USA; debdeep@stat.tamu.edu (D.P.); anirbanb@stat.tamu.edu (A.B.)

**\*** Correspondence: snplmmr@stat.tamu.edu

Received: 3 September 2020; Accepted: 3 November 2020; Published: 6 November 2020

**Abstract:** Variational algorithms have gained prominence over the past two decades as a scalable computational environment for Bayesian inference. In this article, we explore tools from the dynamical systems literature to study the convergence of coordinate ascent algorithms for mean field variational inference. Focusing on the Ising model defined on two nodes, we fully characterize the dynamics of the sequential coordinate ascent algorithm and its parallel version. We observe that in the regime where the objective function is convex, both the algorithms are stable and exhibit convergence to the unique fixed point. Our analyses reveal interesting discordances between these two versions of the algorithm in the region when the objective function is non-convex. In fact, the parallel version exhibits a periodic oscillatory behavior which is absent in the sequential version. Drawing intuition from the Markov chain Monte Carlo literature, we empirically show that a parameter expansion of the Ising model, popularly called the Edward–Sokal coupling, leads to an enlargement of the regime of convergence to the global optima.

**Keywords:** bifurcation; dynamical systems; Edward–Sokal coupling; mean-field; Kullback–Leibler divergence; variational inference

## **1. Introduction**

Variational Bayes (VB) is now a standard tool to approximate computationally intractable posterior densities. Traditionally this computational intractability has been circumvented using sampling techniques such as Markov chain Monte Carlo (MCMC). MCMC techniques are prone to be computationally expensive for high dimensional and complex hierarchical Bayesian models, which are prolific in modern applications. VB methods, on the other hand, typically provide answers orders of magnitude faster, as they are based on optimization. Introduction to VB can be found in chapter 10 of [1] and chapter 33 of [2]. Excellent recent surveys can be found in [3,4].

The objective of VB is to find the best approximation to the posterior distribution from a more tractable class of distributions on the latent variables that is well-suited to the problem at hand. The best approximation is found by minimizing a divergence between the posterior distribution of interest and a class of distributions that are computationally tractable. The most popular choices for the discrepancy and the approximating class are the Kullback–Leibler (KL) divergence and the class of product distributions, respectively. This combination is popularly known as mean field variational inference, originating from mean field theory in physics [5]. Mean-field inference has percolated through a wide variety of disciplines, including statistical mechanics, electrical engineering, information theory, neuroscience, cognitive sciences [6] and more recently deep neural networks [7]. While computing the KL divergence is intractable for a large class of distributions, reframing the minimization problem for maximizing the evidence lower bound (ELBO) leads to efficient algorithms. In particular, for conditionally conjugate-exponential family models, the optimal distribution for mean

field variational inference can be computed by iteration of closed form updates. These updates form a coordinate ascent algorithm known as coordinate ascent variational inference (CAVI) [1].

Research into the theoretical properties of variational Bayes has exploded in the last few years. Recent theoretical work focuses on statistical risk bounds for variational estimate obtained from VB [8–11], asymptotic normality of VB posteriors [12] and extension to model misspecification [8,13]. While much of the recent theoretical work focuses on statistical optimality guarantees, there has been less work studying the convergence of the CAVI algorithms employed in practice. Convergence of CAVI to the global optima is only known in special cases that depend heavily on model structure for normal mixture models [14,15]; stochastic block models [16–19]; topic models [20]; and under special restrictions of the parameter regime, Ising models [21,22]. The convergence properties of the CAVI algorithm still largely constitute an open problem.

The goal of this work is to suggest a general systematic framework for studying convergence properties of CAVI algorithms. By viewing CAVI as a discrete time dynamical system, we can leverage dynamical systems theory to analyze the convergence behavior of the algorithm and bifurcation theory to study the types of changes that solutions can undergo as the various parameters are varied. For sake of concreteness, we focus on the 2D Ising model. While dynamical systems theory possesses the tools [23–25] necessary to analyze higher dimensional systems, they were mainly developed for non-sequential systems. The general theory for *n*-dimensional discrete dynamical systems is dependent on having the evolution function in the form *xn*+<sup>1</sup> = *F*(*xn*). Deriving this *F* is typically not possible for densely connected higher dimensional sequential systems. The 2D Ising model has the special property that both the sequential and parallel updates in the two variables case can be written as two separate one variable dynamical systems, allowing for a simplified analysis. Our contributions to the literature are as follows: We provide a complete classification of the dynamical properties of the the traditional sequential update CAVI algorithm, and a parallelized version of the algorithm using dynamical systems and bifurcation theory on the Ising models. Our findings show that the sequential CAVI algorithm and the parallelized version have different convergence properties. Additionally, we numerically investigated the convergence of the CAVI algorithm on the Edward–Sokal coupling, a generalization of the Ising model. Our findings suggest that couplings/parameter expansion may provide a powerful way of controlling the convergence behavior of the CAVI algorithm, beyond the immediate example considered here.

#### **2. Mean-Field Variational Inference and the Coordinate Ascent Algorithm**

In this section, we briefly introduce mean-field variational inference for a target distribution in the form of a Boltzmann distribution with potential function Ψ,

$$p(\mathbf{x}) = \frac{\exp\{\Psi(\mathbf{x})\}}{Z}, \quad \mathbf{x} \in \mathcal{X}\_{\prime}$$

where Z denotes the intractable normalizing constant. The above representation encapsulates both posterior distributions that arise in Bayesian inference, where Ψ is the log-posterior up to constants, and probabilistic graphical models such as the Ising and Potts models. For instance, <sup>Ψ</sup>(*x*) = *<sup>β</sup>* <sup>∑</sup>*u*∼*<sup>v</sup> Juvxuxv* + *<sup>β</sup>* <sup>∑</sup>*<sup>u</sup> huxu* for the Ising model; see the next section for more details. Many of the complications in inference arise from the intractability of the normalizing constant Z, which is commonly referred to as the free energy in probabilistic graphical models, and the marginal likelihood or evidence in Bayesian statistics. Variational inference aims to mitigate this problem by using optimization to find the best approximation *q*<sup>∗</sup> to the target density *p* from a class F of variational distributions over the parameter vector **x**,

$$q^\* = \arg\min\_{q \in \mathcal{F}} D(q \mid \mid p) \tag{1}$$

where *D*(*q* || *p*) denotes the Kullback–Leibler (KL) divergence between *q* and *p*. The complexity of this optimization problem is largely determined by the choice of variational family F. The objective function of the above optimization problem is intractable because it also involves the evidence Z. We can work around this issue by rewriting the KL divergence as

$$D(q \mid \mid p) = \mathbb{E}\_q[\log q] - \mathbb{E}\_q[\Psi] + \log \mathcal{Z} \tag{2}$$

where E*<sup>q</sup>* denotes the expectation with respect to *q*(**x**). Rearranging terms,

$$\log \mathbb{Z}^{\mathbb{Z}} \quad = D(q \, || \, p) + \mathbb{E}\_q[\Psi] - \mathbb{E}\_q[\log q] \tag{3}$$

$$\geq \mathbb{E}\_q[\Psi] - \mathbb{E}\_q[\log q] := \text{ELBO}(q). \tag{4}$$

The acronym ELBO stands for evidence lower bound and the nomenclature is now apparent from the above inequality. Notice from Equation (2) that maximizing the ELBO is equivalent to minimizing the KL divergence. By maximizing the ELBO we can solve the original variational problem while by-passing the computational intractability of the evidence.

As mentioned above, the choice of variational family controls both the complexity and accuracy of approximation. Using a more flexible family achieves a tighter lower bound but at the cost of having to solve a more complex optimization problem. A popular choice of family that balances both flexibility and computability is the mean-field family. Mean-field variational inference refers to the situation when *q* is restricted to the product family of densities over the parameters,

$$\mathcal{F}\_{\text{MF}} := \left\{ q(\mathbf{x}) = q\_1(\mathbf{x}\_1) \otimes \dots \otimes q\_n(\mathbf{x}\_n) \text{ for probability measures } q\_{[\cdot]}, j = 1, \dots, n \right\}, \tag{5}$$

The coordinate ascent variational inference (CAVI) algorithm (refer to Algorithm 1) is a learning algorithm that optimizes the ELBO over the mean-field family FMF. At each time step *t* ≥ 1, the CAVI algorithm iteratively updates the current mean field marginal distribution *q* (*t*) *<sup>j</sup>* (*xj*) by maximizing the ELBO over that marginal while keeping the other marginals {*q* (*t*) - (*x*-)}--<sup>=</sup>*<sup>j</sup>* fixed at their current values. Formally, we update the current distribution *q*(*t*)(**x**) to *q*(*t*+1)(**x**) by the updates,

$$\begin{array}{rcl}q\_1^{(t+1)}(\mathbf{x}\_1) &=& \arg\max\_{q\_1} \text{ELBO}(q\_1 \otimes q\_2^{(t)} \otimes \cdots \otimes q\_n^{(t)})\\q\_2^{(t+1)}(\mathbf{x}\_2) &=& \arg\max\_{q\_2} \text{ELBO}(q\_1^{(t+1)} \otimes q\_2 \otimes q\_3^{(t)} \otimes \cdots \otimes q\_n^{(t)})\\ &\vdots\\q\_n^{(t+1)}(\mathbf{x}\_n) &=& \arg\max\_{q\_n} \text{ELBO}(q\_1^{(t+1)} \otimes \cdots \otimes q\_{n-1}^{(t+1)} \otimes q\_n). \end{array}$$

**Algorithm 1** Coordinate ascent variational inference (CAVI). **Input:** Model *p*(**x**) = exp(Ψ(**x**) − log Z) **Output:** A variational density *q*(**x**) = ∏*<sup>n</sup> <sup>j</sup>*=<sup>1</sup> *qj*(*xj*) **Initialize:** variational densities *qj*(*xj*) **while** *ELBO*(*q*) *not converged* **do for** *j* ∈ {1, . . . , *n*} **do** *qj*(*xj*) ∝ exp <sup>E</sup>−*<sup>j</sup>* [Ψ(**x**)] **end** Compute ELBO(*q*) = <sup>E</sup>*q*[Ψ(**x**)] <sup>−</sup> <sup>E</sup>*q*[log *<sup>q</sup>*(**x**)] **end return** *q*(*x*)

The objective function ELBO(*q*<sup>1</sup> ⊗···⊗ *qn*) is concave in each of the arguments individually (although it is rarely jointly concave), so these individual maximization problems have unique solutions. The optimal update for the *j*th mean field variational component of the model has the closed form,

$$q\_j^\*(\mathbf{x}\_j) \quad \approx \quad \exp\left\{ \mathbb{E}\_{-j} \left[ \Psi(\mathbf{x}) \right] \right\},$$

where the expectations <sup>E</sup>−*<sup>j</sup>* are taken with respect to the distribution <sup>∏</sup>*i*-<sup>=</sup>*<sup>j</sup> qi*(**x***i*). Furthermore, the update step of the algorithm is monotonous, as each step of the CAVI increases the objective function

$$\begin{array}{c} \text{ELBO}(q\_1^{(t+1)} \otimes q\_2^{(t+1)} \otimes \dots \otimes q\_n^{(t+1)}) \\ \quad \quad \quad \quad \quad \text{ELBO}(q\_1^{(t+1)} \otimes q\_2^{(t+1)} \otimes \dots \otimes q\_{n-1}^{(t+1)} \otimes q\_n^{(t)}) \ge \dots \ge \sum\_{t=1}^n \text{ELBO}(q\_1^{(t)} \otimes q\_2^{(t)} \otimes \dots \otimes q\_n^{(t)}). \end{array}$$

For parametric models, the sequential updates of the variational marginal distributions in the CAVI algorithm is done by a sequential update of the variational parameters of these distributions. The CAVI algorithm updates for parametric models induce a discrete time dynamical system of the parameters. Clearly, convergence of the CAVI algorithm can be framed in terms of this induced discrete time dynamical system. As discussed before, the ELBO is generally a non-convex function, and hence the CAVI algorithm is only guaranteed to converge to a local optimum of the system. It is also not clear how many local optima (or fixed points) the system has, nor whether the algorithm always settles on a single fixed point, diverges away from the fixed point or cycles between multiple fixed points. These questions translate to questions about the existence and stability of fixed points of the induced dynamical system. We are also interested in how the behavior of the CAVI algorithm could possibly change as we vary the parameters of the model. This translates to questions about the possible bifurcations of the induced dynamical system. In Section 3, we formally introduce the Ising model and its mean-field variational inference.

### **3. CAVI in Ising Model**

We first briefly review the definition of an Ising model. The Ising model was first introduced as a model for magnetization in statistical physics, but has found many applications in other fields; see [26] and references therein. The Ising model is a probability distribution on the hypercube {±1}*<sup>n</sup>* given by

$$p(\mathbf{x}) \quad \text{or} \quad \exp\left[\beta \sum\_{\mathbf{u}\sim\boldsymbol{\mathcal{u}}} f\_{\mathbf{u}\boldsymbol{\mathcal{v}}} \mathbf{x}\_{\mathbf{u}} \mathbf{x}\_{\mathbf{v}} + \beta \sum\_{\mathbf{u}} h\_{\mathbf{u}} \mathbf{x}\_{\mathbf{u}}\right],\tag{6}$$

where the interaction matrix *J* is a symmetric real *n* × *n* matrix with zeros on the diagonal, *h* is a real *n*-vector that represents the external magnetic field, and *β* is the inverse temperature parameter. The model is said to be ferromagnetic if *Juv* ≥ 0 for all *u*, *v* and anti-ferromagnetic if *Juv* < 0 for all *u*, *v*. The normalizing constant or the partition function of the Ising model is

$$\mathcal{Z}\_{\omega} = \sum\_{\mathbf{x} \in \{\pm 1\}^n} \exp \left[ \beta \sum\_{\mathbf{u} \sim \mathbf{v}} f\_{\mathbf{u} \mathbf{v}} \mathbf{x}\_{\mathbf{u}} \mathbf{x}\_{\mathbf{v}} + \beta \sum\_{\mathbf{u}} h\_{\mathbf{u}} \mathbf{x}\_{\mathbf{u}} \right].$$

Refer to Chapter 31 of [2] for an excellent review of Ising models.

#### *Mean Field Variational Inference in Ising Model*

Here we provide a derivation of the CAVI update function for the Ising model, focusing on the two nodes (*n* = 2) case for simplicity and analytic tractability.

Notice log *<sup>p</sup>*(**x**) := *<sup>β</sup>*H(**x**) = *<sup>β</sup>* <sup>∑</sup>*u*∼*<sup>v</sup> Juvxuxv* + *<sup>β</sup>* <sup>∑</sup>*<sup>u</sup> huxu*. In this case, we have the Ising model on two spins with **x** = (*x*1, *x*2) and influence matrix *J* with off diagonal term *J*<sup>12</sup> and external magnetic field *h* = (*h*1, *h*2)=(0, 0). From the general framework in Section 2, the CAVI updates are given by,

$$\{q\_j^\*(\mathbf{x}\_j) \propto \exp\left\{\mathbb{E}\_{-j}\left[\beta\left(f\_{12}\mathbf{x}\_1\mathbf{x}\_2 + h\_1\mathbf{x}\_1 + h\_2\mathbf{x}\_2\right)\right]\right\}.\}$$

Equivalently, the same updates are obtained by setting the gradient of the ELBO as a function of (*x*1, *x*2) equal to the (0, 0) vector. Illustrations of the ELBO and the gradient functions for various values of *β* are in Figures 1 and 2 respectively.

**Figure 1.** A contour plot of the ELBO as a function of *x*<sup>1</sup> and *x*<sup>2</sup> for *β* = 0.7 (**left**) and *β* = 1.2 (**right**) together with the optimal update functions for *x*<sup>1</sup> (**orange**) and *x*<sup>2</sup> (**blue**) given in Equation (8). For *β* = 0.7 the ELBO is a convex function and has exactly one optima, the global maximum, at (0.5, 0.5). For *β* = 1.2 the ELBO is now a nonconvex function and has three optima at (0.5, 0.5), (0.17071, 0.17071) and (0.82928, 0.82928).

**Figure 2.** A contour plot of the ELBO as a function of *x*<sup>1</sup> and *x*<sup>2</sup> for *β* = −0.7 (**left**) and *β* = −1.2 (**right**) together with the optimal update functions for *x*<sup>1</sup> (**orange**) and *x*<sup>2</sup> (**blue**) given in Equation (8). For *β* = −0.7 the ELBO is a convex function and has exactly one optima, the global maximum, at (0.5, 0.5). For *β* = −1.2 the ELBO is now a nonconvex function and has three optima at (0.5, 0.5), (0.17071, 0.82928) and (0.82928, 0.17071).

Since *q*∗ <sup>1</sup> and *q*<sup>∗</sup> <sup>2</sup> are two point distributions, it is sufficient to keep track of the mass assigned to 1. Simplifying,

$$\begin{split} q\_1^\*(\mathbf{x}\_1) &\quad \propto \exp\left\{ \mathbb{E}\_2 \left[ \log p(\mathbf{x}\_1, \mathbf{x}\_2) \right] \right\} \\ &= \exp\left\{ \beta \mathcal{H}(\mathbf{x}\_1, \mathbf{x}\_2 = 1) q\_2(\mathbf{x}\_2 = 1) + \beta \mathcal{H}(\mathbf{x}\_1, \mathbf{x}\_2 = -1) q\_2(\mathbf{x}\_2 = -1) \right\} \\ &= \exp\left\{ (\beta j\_{12} \mathbf{x}\_1 + \beta h\_1 \mathbf{x}\_1 + \beta h\_2) \mathbf{y}^\mathbf{x} + (-\beta j\_{12} \mathbf{x}\_1 + \beta h\_1 \mathbf{x}\_1 - \beta h\_2)(1 - \mathbf{\tilde{y}}) \right\} \\ &= \exp\left\{ (2\mathbf{j}^x - 1)(\beta l\_{12} \mathbf{x}\_1 + \beta h\_2) + \beta h\_1 \mathbf{x}\_1 \right\} .\end{split}$$

where *ξ* = *q*2(*x*<sup>2</sup> = 1). Therefore

$$\begin{split} q\_1^\*(\mathbf{x}\_1 = 1) &= \frac{\exp\left\{ (2\underline{\boldsymbol{\zeta}} - \mathbf{1})(\boldsymbol{\beta}\boldsymbol{\delta}\_{12} + \boldsymbol{\beta}\boldsymbol{h}\_2) + \boldsymbol{\beta}\boldsymbol{h}\_1 \right\}}{\exp\left\{ (2\underline{\boldsymbol{\zeta}} - \mathbf{1})(\boldsymbol{\beta}\boldsymbol{\delta}\_{12} + \boldsymbol{\beta}\boldsymbol{h}\_2) + \boldsymbol{\beta}\boldsymbol{h}\_1 \right\} + \exp\left\{ (2\underline{\boldsymbol{\zeta}} - \mathbf{1})(-\boldsymbol{\beta}\boldsymbol{h}\_{12} + \boldsymbol{\beta}\boldsymbol{h}\_2) - \boldsymbol{\beta}\boldsymbol{h}\_1 \right\}} \\ &= \frac{1}{1 + \exp\left\{ -2\underline{\boldsymbol{\beta}}\boldsymbol{h}\_{12}(2\underline{\boldsymbol{\zeta}} - \mathbf{1}) - 2\underline{\boldsymbol{\beta}}\boldsymbol{h}\_1 \right\}}. \end{split}$$

Similarly denoting *ζ* = *q*1(*x*<sup>1</sup> = 1),

$$\begin{split} q\_2^\*(\mathbf{x}\_2 = 1) &= \frac{\exp\left\{ (2\zeta - 1)(\beta f\_{12} + \beta h\_1) + \beta h\_2 \right\}}{\exp\left\{ (2\zeta - 1)(\beta f\_{12} + \beta h\_1) + \beta h\_2 \right\} + \exp\left\{ (2\zeta - 1)(-\beta f\_{12} + \beta h\_1) - \beta h\_2 \right\}} \\ &= \frac{1}{1 + \exp\left\{ -2\beta f\_{12}(2\zeta - 1) - 2\beta h\_2 \right\}}. \end{split}$$

Let *ζ<sup>k</sup>* (resp. *ξk*) denote the *k*th iterate of *q*1(*x*<sup>1</sup> = 1) (resp. *q*2(*x*<sup>2</sup> = 1)) from the CAVI algorithm. To succinctly represent these updates, define the logistic sigmoid function

$$\sigma(\boldsymbol{u}, \boldsymbol{\beta}) = \frac{1}{1 + e^{-\beta \underline{\boldsymbol{u}} \cdot \boldsymbol{\tau}}}, \quad \boldsymbol{u} \in [0, 1], \quad \boldsymbol{\beta} \in \mathbb{R}. \tag{7}$$

With this notation, we have, for any *<sup>k</sup>* <sup>∈</sup> <sup>Z</sup>+,

$$\begin{aligned} \mathcal{Z}\_{k+1} &= \sigma(f\_{12}(2\mathcal{Z}\_k - 1) + h\_1, 2\beta) \\ \mathcal{Z}\_{k+1} &= \sigma(f\_{12}(2\mathcal{Z}\_{k+1} - 1) + h\_2, 2\beta). \end{aligned} \tag{8}$$

Without loss of generality we henceforth set *J*<sup>12</sup> = 1. Under this choice the model is in the ferromagnetic regime for *β* > 0 and the anti-ferromagnetic regime for *β* < 0.

## **4. Why the Ising Model: A Summary of Our Contributions**

There are exactly two cases of the Ising model that have a full analytic solution for the free energy. They are (i) the one dimensional line graph solved by Ernst Ising in his thesis [27] and (ii) the two dimensional case on the anisotropic square lattice when the magnetic field *h* = 0 by [28]. Comparison with the mean field solution for the same models highlights the poor approximation quality of the mean field solution in low dimensions. To the best knowledge of the authors, there are no results in the literature detailing the properties of the mean field solution to the anti-ferromagnetic Ising model. Readers not familiar with the physics may wonder why this is the case. To explain this, there are two cases in the anti-ferromagnetic regime: one of the two regions is equivalent to the ferromagnetic case and in the other the mean field approximation is not a good approximation of the system. The first case occurs in a bipartite graph where a transformation of variables makes the antiferromagnetic regime equivalent to the ferromagnetic one [29]. The other case can be seen on the triangle graph. By fixing the spin of one vertex as 1 and the other as −1, the third vertex becomes geometrically frustrated and neither choice of spin lowers the energy level of the system and the two configurations are equivalent [30]. In this case the mean field approximation gives a completely incorrect answer and does not merit further investigation from a qualitative point of view. The physics literature is primarily concerned with using the mean field solutions to the Ising model to estimate important physical constants of the systems. These constants are only meaningful when the mean field solution provides a good approximation to the behavior of the system in large dimensions. It is known, however, that under certain conditions the mean field approximation does indeed converge to the true free energy of the system as the dimension increases [21,31].

Our work is focused on providing a rigorous methodology to analyze dynamics of the CAVI algorithm that can be applied to any model structure. All of the interesting behaviors exhibited by the CAVI algorithm fit into the classical mathematical framework of discrete dynamical systems

and bifurcation theory. Specifically, we use the Ising model as a simple and yet rich example to illustrate the potential of dynamical systems theory to analyze CAVI updates for mean field variational inference. The bifurcation of the ferromagnetic Ising model at the boundary of the Dobrushin regime is known [2,26]; however, a rigorous proof in terms of dynamical systems theory is missing in the literature.

There are several features that make the CAVI algorithm on the Ising model a nontrivial example worth investigating. The optimization problem arising from mean field variational inference on the Ising model is, in general, non-convex [21]. However, it is straightforward to obtain sufficient conditions to guarantee the existence of a global optima. One such condition is that the inverse temperature *β* is inside the Dobrushin regime, |*β*| < 1 [21]. Inside the Dobrushin regime, the CAVI update equations form a contraction mapping guaranteeing a unique global optima [21]. Outside of this regime the behavior of the CAVI algorithm is nontrivial. The CAVI solution to the Ising model with zero external magnetic field exhibits multiple local optima outside of the Dobrushin regime [2].

Our contributions to the literature are as follows. We utilize tools from dynamical systems theory to rigorously classify the full behavior of Ising model for the full parameter regime in dimension *n* = 2 for both the sequential and parallel versions of CAVI algorithm. We show that the dynamical behavior of the sequential CAVI is not equivalent to the behavior of the parallel CAVI. Lastly we derive a variational approximation to the Edward-Sokal parameter expansion of the Potts and Random Cluster models and numerically study its convergence behavior under the CAVI algorithm. Our numerical results reveal that the parameter expansion leads to an enlargement of the regime of convergence to the global optima. In particular the Dobrushin regime is strictly contained in the expanded regime. This is compatible with the analogous results in Markov chain literature. See the introduction of [32] for a well written summary of Markov chain mixing in the Ising model.

#### *Statistical Significance of Our Results*

Although mean-field variational inference has been routinely used in applications [3] for computational efficiency, it may not yield statistically optimal estimators. A statistically optimal estimator should correctly recover the statistical properties of the true distribution. Ideally, we would like the estimate to recover the true mean and true covariance of the distribution. It is well known that mean-field variational inference produces estimators that underestimate the posterior covariance [14]. More recently, it was shown that the mean-field estimators for certain topic models and stochastic block models may not even be correlated with the true distribution [17,20]. For these reasons, it is important to see if the mean field estimators can at least recover the true mean for various *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>.

Mean field inference approximates the joint probability mass function in (6) for *n* = 2 by product of two distributions on {−1, 1} in the sense of Kullback–Leibler divergence. As discussed in Section 3, minimizing this divergence is equivalent to maximizing an objective function, called the Evidence Lower Bound (ELBO). Our objective is to better understand the relation between the CAVI estimate and the global maximum of ELBO in (6) when *n* = 2 and *h* = 0. Ideally, we want the global maximum of the ELBO to be a statistically reliable estimate. To understand this, let us denote 2 × Bernoulli(*p*) − 1 by 1, −1; *p* . As the marginal distributions of (6) are both equal to 1, −1; 0.5 , we want the ELBO to be maximized at this value. From an algorithmic perspective, we would like to ensure that the CAVI iterates converge to this global maximum. The synergy of these two phenomena leads to a successful variational inference method. We show in this article that both these conditions can be violated in a certain regime of the parameter space in the context of Ising model on two nodes. Inside the Dobrushin regime (−1 ≤ *β* ≤ 1), the global optima of the ELBO obtained from a mean field inference occurs at (1, −1; 0.5 ,1, −1; 0.5 ) which is qualitatively the optimal solution. In this regime, the CAVI system converges to this global optimum irrespective of where the system is initialized. Thus, in the Dobrushin regime, the mean field inference yields the statistically optimal estimate. Additionally, the CAVI algorithm is stable and convergent at this value. Unfortunately, this property deteriorates outside of the Dobrushin regime. Outside of the regime, the global maxima occur at

two symmetric points which are different from (1, −1; 0.5 ,1, −1; 0.5 ). These two symmetric points are equivalent under label switching. For example, when *β* = 1.2 one of the optima is (1, −1; 0.17071 ,1, −1; 0.17071 ) and the other is (1, −1; 0.82928 ,1, −1; 0.82928 ). Notice this second optima is equivalent to the sign swapped version (−1, 1; 0.17071 ,−1, 1; 0.17071 ).

The original optima (1, −1; 0.5 ,1, −1; 0.5 ) is actually a local minimum of the ELBO outside the Dobrushin regime. We illustrate in our theory that the CAVI system returns one of two global maxima of the objective function depending on the initialization of the algorithm. Although it is widely known that the statistical quality of the mean field inference is poor outside the regime, we show in addition that the algorithm itself exhibits erratic behavior and may not converge to the global maximizer of the ELBO for all initializations. Interestingly, outside the Dobrushin regime, the statistically optimal solution (1, −1; 0.5 ,1, −1; 0.5 ) is a repelling fixed point of the CAVI system. This means that as the system is iterated, the current value of the system is pulled away from (1, −1; 0.5 ,1, −1; 0.5 ) to the global maximum.

A common technique to further improve computational time is the use of block updates in the CAVI algorithm, meaning groups of parameters are updated simultaneously. We refer to this as the parallelized CAVI algorithm. This has been shown to work well in certain models [17], but has not been investigated in a general setting. However, it turns out that block updating in the Ising model can lead to new problematic behaviors. Outside the Dobrushin regime, block updates can exhibit non-convergence in the form of cycling. As the system updates, it eventually switches back and forth between two points that yield the same value in the objective function.

Parameter expansions (coupling) is another method of improving the convergence properties of algorithms. In the Markov chain theory for Ising models, it is well-known that mixing and convergence time are typically improved by using the Edward–Sokal coupling, a parameter expansion of the ferromagnetic Ising model [33]. Our preliminary investigation reveals that the convergence properties of the CAVI algorithm also exhibit a similar phenomenon.

#### **5. Main Results**

In this section, we analyze the behavior of the dynamical systems that one can form using the CAVI update equations and show that the behaviors of the systems differ. Our results are heavily dependent on well-known techniques in dynamical systems. For readers unfamiliar with some of technical terminology below, we have included a primer on the basics of dynamical systems in Appendix A.

Recall the system of sequential updates, which are the updates used in CAVI:

$$\mathbb{Z}\_{k+1} = \sigma(2\mathbb{Z}\_k - 1, 2\beta), \quad \mathbb{Z}\_{k+1} = \sigma(2\mathbb{Z}\_{k+1} - 1, 2\beta), \tag{9}$$

and the parallel updates:

$$\mathbb{Z}\_{k+1} = \sigma(2\mathbb{Z}\_k - 1, 2\beta), \quad \mathbb{Z}\_{k+1} = \sigma(2\mathbb{Z}\_k - 1, 2\beta). \tag{10}$$

We will show that these two systems are not topologically conjugate. We first state and prove some results on the dynamics of the sigmoid function (7). These results will be used as building blocks to study the dynamics of (9) and (10). Phase change behavior of dynamical systems using the sigmoid and RELU activation functions are known in the literature in the context of generalization performance of deep neural networks [34,35]. In this section we present a complete proof of the bifurcation analysis of non-linear dynamical systems involving sigmoid activation function despite its connections with [34,35]. Our results in Section 5.1 provide a more complete picture of the behavior of the dynamics in all regimes and can be readily exploited to analyze the dynamics of (9) and (10).

#### *5.1. Sigmoid Function Dynamics*

In this section we provide a full classification for the dynamics of the following sigmoid function and its second iterate,

$$
\sigma(2\mathbf{x} - \mathbf{1}, 2\boldsymbol{\beta}),
\tag{11}
$$

$$
\sigma(2x(2x-1,2\beta)-1,2\beta). \tag{12}
$$

To the best of our knowledge, we could not find a formal proof of the full classification of the dynamics of the sigmoid function (or its second iterate) for all *<sup>β</sup>* <sup>∈</sup> <sup>R</sup> in the literature. Additionally, it provides an introductory example to demonstrate the concepts and techniques of dynamical systems. We begin by using numerical techniques to determine the number of fixed points in the system and its possible periodic behavior. We then proceed by providing a formal proof of the full dynamical properties of (11) in Theorem 1 and the full dynamical properties of (12) in Theorem 2.

Using numerical techniques, we solve for the number of fixed points of the system. The number of fixed points the function (11) depends on the magnitude of the parameter. For *β* > 0, there is no periodic behavior, so there are no additional fixed points in (12) that are not fixed points in (11). For −1 ≤ *β* ≤ 1, there is a single fixed point at *x*<sup>∗</sup> = 1/2 and for *β* > 1, there are 3 fixed points *c*0(*β*), 1/2, *c*1(*β*) in the interval [0, 1]. These fixed points satisfy 0 ≤ *c*0(*β*) < 1/2 < *c*1(*β*) ≤ 1, *c*0(*β*) → 0 and *c*1(*β*) → 1 as *β* → ∞. For *β* < 0, we see periodic behavior in the system; there are fixed points of (12) that are not fixed points of (11). For *β* < −1, the function (11) has one fixed point at *x*<sup>∗</sup> = 1/2 and a periodic cycle *C* = {*c*0(*β*), *c*1(*β*)}. Both *c*0(*β*), *c*1(*β*) are fixed points of (12) and these points are the same fixed points from the *β* > 0 regime as (12) is an even function with respect to *β*.

Table 1 denotes the values of the derivatives at the fixed point 1/2 for *β* = ±1.

**Table 1.** Partial derivatives of (11) and (12) at fixed point *x*<sup>∗</sup> = 1/2 for parameter value *β* = ±1. The derivatives of the the function (11) are denoted using *σ* and the derivatives for (12) are denoted using *σ*2.


We now have enough information to provide a complete classification of the dynamics of the sigmoid function.

**Theorem 1** (Dynamics of sigmoid function)**.** *Consider the discrete dynamical system generated by (11)*

$$\infty \mapsto \sigma(2\mathbf{x} - \mathbf{1}, 2\boldsymbol{\beta}) = \frac{1}{1 + \boldsymbol{\varepsilon}^{-2\beta(2\mathbf{x} - \mathbf{1})}}.$$

*The full dynamics of the system (11) are as follows*


*4. For* |*β*| = 1*, the system has one non-hyperbolic fixed point at x*<sup>∗</sup> = 1/2 *which is asymptotically stable and attracting.*

*The system undergoes a PD bifurcation at β* = −1 *and a pitchfork bifurcation at β* = 1*.*

**Proof.** We will break the proof up into three parts. The first part of the proof is a linear stability analysis of the system, the second part is a stability analysis of the periodic points in the system and the third part is an analysis of the bifurcations of the system. We begin with a linear stability analysis of the system at each fixed point. For *β* ≤ 1 the system has one fixed point *x*<sup>∗</sup> = 1/2 and for *β* > 1 the system has three fixed points *c*0, 1/2, *c*1. The derivative of *σ*(2*x* − 1, 2*β*) is *σx*(2*x* − 1, 2*β*) = −4*βσ*(2*x* − 1, 2*β*)(1 − *σ*(2*x* − 1, 2*β*)).

**Fixed point** *x*<sup>∗</sup> = 1/2 : The Jacobian of the system at the fixed point *x*<sup>∗</sup> = 1/2 is *σx*(2*x*<sup>∗</sup> − 1, 2*β*) = *β*. For *β* -= 1, the fixed point *x*<sup>∗</sup> = 1/2 is hyperbolic and for *β* = ±1 the fixed point is non-hyperbolic. We classify the stability of the hyperbolic fixed point *x*<sup>∗</sup> = 1/2 using Theorem A2. For |*β*| < 1 the fixed point *x*<sup>∗</sup> = 1/2 is globally attracting as |*σx*(2*x*<sup>∗</sup> − 1, 2*β*)| < 1 and for |*β*| > 1 the fixed point *x*<sup>∗</sup> = 1/2 is globally repelling as |*σx*(2*x*<sup>∗</sup> − 1, 2*β*∗)| > 1. For *β* = ±1 we invoke Theorem A3 to check for stability of the fixed point. At *β* = −1 we have *σx*(2*x*<sup>∗</sup> − 1, 2*β*) = −1 and we need to check the Schwarzian derivative. The fixed point *x*<sup>∗</sup> = 1/2 is asymptotically stable for *β* = −1 by Theorem A3, as S*σ*(2*σ*(2*x* − 1, 2*β*) − 1, 2*β*) |*x*=*x*∗= −8. For *β* = 1 we have *σx*(2*x*<sup>∗</sup> − 1, 2*β*) = 1 and we need to check the second and third derivatives at the fixed point. The fixed point *x*<sup>∗</sup> = 1/2 is asymptotically stable when *β* = 1 by Theorem A3 as *σxx*(2*x*<sup>∗</sup> − 1, 2*β*) = 0 and *σxxx*(2*x*<sup>∗</sup> − 1, 2*β*) = −8.

**Fixed points** *c*0, *c*<sup>1</sup> : These fixed points have the same behavior so we have grouped them together in the analysis. When *β* > 1 there are two additional fixed points *c*0, *c*<sup>1</sup> of the system, both are attracting fixed points by Theorem A2 as |*σx*(2*ci* − 1, 2*β*)| < 1 for each *i* = 0, 1 and all *β* > 1. The stable sets are *Ws*(*c*0)=[0, 1/2) and *Ws*(*c*1)=(1/2, 1].

**Periodic points**: For *β* < −1 we see the two cycle C = {*c*0, *c*1}. Notice *σ*(2*c*<sup>0</sup> − 1, 2*β*) = *c*<sup>1</sup> and *σ*(2*c*<sup>1</sup> − 1, 2*β*) = *c*0. This two cycle is stable since *c*<sup>0</sup> and *c*<sup>1</sup> are both stable fixed points of (12). The stable set is *<sup>W</sup>s*(C)=[0, 1/2) <sup>∪</sup> (1/2, 1], 0 <sup>&</sup>lt; *<sup>c</sup>*<sup>0</sup> <sup>&</sup>lt; 1/2 <sup>&</sup>lt; *<sup>c</sup>*<sup>1</sup> <sup>&</sup>lt; 1.

At (*x*∗, *β*∗)=(1/2, 1) the system under goes a pitchfork bifurcation as it satisfies the conditions in Theorem A5:

$$\begin{aligned} \sigma(2\mathbf{x}\_{\ast}-1,2\beta\_{\ast}) &= 1/2 & \sigma\_{\mathbf{x}}(2\mathbf{x}\_{\ast}-1,2\beta\_{\ast}) &= 1 & \sigma\_{\mathbf{x}\mathbf{x}}(2\mathbf{x}\_{\ast}-1,2\beta\_{\ast}) &= 0, \\ \sigma\_{\hat{\beta}}(2\mathbf{x}\_{\ast}-1,2\beta\_{\ast}) &= 0 & \sigma\_{\mathbf{x}\hat{\beta}}(2\mathbf{x}\_{\ast}-1,2\beta\_{\ast}) &\neq 0 & \sigma\_{\mathbf{xxx}}(2\mathbf{x}\_{\ast}-1,2\beta\_{\ast}) &\neq 0. \end{aligned}$$

Similarly at (*x*∗, *β*∗)=(1/2, −1) the system under goes a period doubling bifurcation as it satisfies the conditions in Theorem A4

$$\begin{aligned} \sigma(2\mathbf{x}\_{\ast}-1,2\boldsymbol{\beta}\_{\ast}) &= 1/2 & \sigma\_{\mathbf{x}}(2\mathbf{x}\_{\ast}-1,2\boldsymbol{\beta}\_{\ast}) &= -1 & \sigma\_{\mathbf{x}\mathbf{x}}(2\mathbf{x}\_{\ast}-1,2\boldsymbol{\beta}\_{\ast}) &= 0, \\ \sigma\_{\boldsymbol{\beta}}(2\mathbf{x}\_{\ast}-1,2\boldsymbol{\beta}\_{\ast}) &= 0 & \sigma\_{\mathbf{x}\boldsymbol{\beta}}(2\mathbf{x}\_{\ast}-1,2\boldsymbol{\beta}\_{\ast}) &\neq 0 & \sigma\_{\mathbf{xxx}}(2\mathbf{x}\_{\ast}-1,2\boldsymbol{\beta}\_{\ast}) &\neq 0. \end{aligned}$$

We can fully classify the dynamics of (12) using the above theorem. We omit the proof as it is similar to the proof of Theorem 1.

**Theorem 2.** *The full dynamics of the system (12) are as follows*


*The system undergoes a pitchfork bifurcation at β* = ±1*. There are no p-periodic points for p* ≥ 2*.*

## *5.2. Sequential Dynamics*

To fully understand the dynamics of the equations defining the updates to *q*∗ <sup>1</sup> and *q*<sup>∗</sup> <sup>2</sup> it suffices to track the evolution of the points *q*∗ <sup>1</sup> (1) = *ζ* and *q*<sup>∗</sup> <sup>2</sup> (1) = *ξ*. The CAVI algorithm updates terms sequentially, using the new values of the variables to calculate the others. We initialize the CAVI algorithm at points *ζ*0, *ξ*0. The CAVI algorithm is a dynamical system formed by sequential iterations of *σ*(2*x* − 1, 2*β*) starting from *ζ*0, *ξ*0. We can decouple the CAVI updates for *ξ<sup>k</sup>* and *ζ<sup>k</sup>* by looking at the second iterations. This decoupling is visualized in the diagram (14) below. The system formed the sequential updates is equivalent to the following decoupled system

$$\begin{aligned} \check{\zeta}\_1 &= \sigma(2\check{\xi}\_0 - 1, 2\beta), \\ \check{\zeta}\_{k+1} &= \sigma(2\sigma(2\check{\xi}\_k - 1, 2\beta) - 1, 2\beta), \quad k \ge 1, \\ \check{\zeta}\_{k+1} &= \sigma(2\sigma(2\check{\xi}\_k - 1, 2\beta) - 1, 2\beta), \quad k \ge 0. \end{aligned} \tag{13}$$

We propose to investigate the dynamics of the sequential system (9) by studying the dynamics of individual subsequences *ζk*+<sup>1</sup> and *ξk*+<sup>1</sup> of the decoupled system (13). The dynamical properties of the individual subsequences follow from a combination of Theorem 1, Theorem 2 and other methods from Appendix A.

$$\begin{array}{c} \zeta\_0 \\ \zeta\_0 \xrightarrow{\begin{array}{c} \zeta\_1 \\ \sigma^2 \end{array}} \zeta\_1 \xrightarrow{\begin{array}{c} \sigma^2 \\ \sigma^2 \end{array}} \zeta\_2 \xrightarrow{\begin{array}{c} \sigma^2 \\ \sigma^2 \end{array}} \zeta\_3 \cdots \\ \end{array} \tag{14}$$

Illustrations of the evolution of the dynamics of the sequential updates for various initializations and values of *β* are in Figures 3–6.

**Theorem 3** (CAVI dynamics)**.** *The Dynamics of the CAVI System (9) Are Given by*


*where* 0 ≤ *c*<sup>0</sup> < 1/2 < *c*<sup>1</sup> ≤ 1 *are the fixed points of (11) in* [0, 1]*. The system undergoes a super-critical pitchfork bifurcation at β* = −1 *and again at β* = 1*. Furthermore the system has no p-periodic points for p* ≥ 2*.*

**Proof.** We will proceed to construct the dynamics of the system (9) by tracing the behavior of the dynamics in the equivalent system (13). The dynamics of each of these subsequences is governed by the Functions (11) and (12) and dependent on the initialization *ξ*0. The behavior for each of the subsequence *ξk*<sup>+</sup>1, for *k* ≥ 0 is governed by Theorem 2. Similarly the behavior of the subsequence *ζk*<sup>+</sup>1, for *k* ≥ 1 is governed by Theorem 2 with the additional point *ζ*<sup>1</sup> = *σ*(2*ξ*<sup>0</sup> − 1, 2*β*) dependent on Theorem 1. For |*β*| < 1, (11) has a globally stable fixed point at *x*<sup>∗</sup> = 1/2 and thus for all *ξ*0, *<sup>ζ</sup>*<sup>1</sup> <sup>=</sup> *<sup>σ</sup>*(2*ξ*<sup>0</sup> <sup>−</sup> 1, 2*β*) <sup>∈</sup> *<sup>W</sup>s*(1/2). It now follows from Theorem <sup>2</sup> that the only fixed point in the sequential system is (1/2, 1/2) which must be globally stable. For *β* = ±1, the fixed point *x*<sup>0</sup> = 1/2 is asymptotically stable by Theorem A3. The system undergoes a super-critical pitchfork bifurcation at *β* = −1 and again at *β* = 1 as a consequnece from its relation to (12). For *β* > 1, (11) bifurcates. We have the unstable fixed point *x*<sup>∗</sup> = 1/2, and the two locally stable fixed points, *c*<sup>0</sup> with stable set

*<sup>W</sup>s*(*c*0)=[0, 1/2), and *<sup>c</sup>*<sup>1</sup> with stable set *<sup>W</sup>s*(*c*1)=(1/2, 1]. For *<sup>ξ</sup>*<sup>0</sup> <sup>∈</sup> *<sup>W</sup>s*(*c*0) we have *<sup>ζ</sup>*<sup>1</sup> <sup>∈</sup> *<sup>W</sup>s*(*c*0) and *<sup>ξ</sup>*<sup>1</sup> <sup>∈</sup> *<sup>W</sup>s*(*c*0). It now follows from Theorem <sup>2</sup> that the system will converge to (*c*0, *<sup>c</sup>*0) and that *<sup>W</sup>s*((*c*0, *<sup>c</sup>*0)) = [0, 1] <sup>×</sup> [0, 1/2). A similar argument shows the system converges to (*c*1, *<sup>c</sup>*1) for *<sup>ξ</sup>*<sup>0</sup> <sup>∈</sup> *<sup>W</sup>s*(*c*1) and *<sup>W</sup>s*((*c*1, *<sup>c</sup>*1)) = [0, 1] <sup>×</sup> (1/2, 1]. Lastly, (1/2, 1/2) is a repelling fixed point of the systems since *x*<sup>∗</sup> = 1/2 is a repelling fixed point for both (11) and (12). For *β* < −1, (11) bifurcates. We have the unstable fixed point *x*<sup>∗</sup> = 1/2, and the stable two cycle, C = {*c*0, *c*1} with stable set *<sup>W</sup>s*(C)=[0, 1/2) <sup>∪</sup> (1/2, 1]. For any *<sup>ξ</sup>*<sup>0</sup> <sup>&</sup>lt; 1/2 we have, *<sup>ζ</sup>*<sup>1</sup> <sup>&</sup>gt; 1/2 and *<sup>ξ</sup>*<sup>1</sup> <sup>&</sup>lt; 1/2. It now follows from Theorem <sup>2</sup> that the system will converge to (*c*1, *<sup>c</sup>*0) and that *<sup>W</sup>s*((*c*1, *<sup>c</sup>*0)) = [0, 1] <sup>×</sup> [0, 1/2). A similar argument shows the system converges to (*c*0, *<sup>c</sup>*1) for *<sup>ξ</sup>*<sup>0</sup> <sup>&</sup>gt; 1/2 and *<sup>W</sup>s*((*c*0, *<sup>c</sup>*1)) = [0, 1] <sup>×</sup> [0, 1/2). Lastly, (1/2, 1/2) is a repelling fixed point of the systems since *x*<sup>∗</sup> = 1/2 is a repelling fixed point for both (11) and (12). The dynamics of (13) lack any *p*-period point and cycles for *p* > 2 as a consequence of its construction from (12).

**Figure 3.** A plot of the first 20 iterations of the CAVI algorithm at various initializations for *β* = −1.2. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.3; we see that *ζ<sup>k</sup>* converges to the local fixed point *c*1(1.2) = 0.82928 and *ξ<sup>k</sup>* converges to the local fixed point *c*0(1.2) = 0.17071. The upper right is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that *ζ<sup>k</sup>* converges to the local fixed point *c*0(1.2) = 0.17071 and *ξ<sup>k</sup>* converges to the local fixed point *c*1(1.2) = 0.82928. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that *ζ<sup>k</sup>* converges to the local fixed point *c*1(1.2) = 0.82928 and *ξ<sup>k</sup>* converges to the local fixed point *c*0(1.2) = 0.17071. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that *ζ<sup>k</sup>* converges to the local fixed point *c*0(1.2) = 0.17071 and *ξ<sup>k</sup>* converges to the local fixed point *c*1(1.2) = 0.82928.

**Figure 4.** A plot of the first 20 iterations of the CAVI algorithm at various initializations for *β* = −0.7. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.3; we see that both of these converge to the global fixed point 1/2. The upper right is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that this initialization converges to the global fixed point 1/2. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the global fixed point 1/2. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to the global fixed point 1/2.

**Figure 5.** A plot of the first 20 iterations of the CAVI algorithm at various initializations for *β* = 0.7. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.3; we see that both of these converge to the global fixed point 1/2. The upper right is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that this initialization converges to the global fixed point 1/2. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the global fixed point 1/2. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to the global fixed point 1/2.

**Figure 6.** A plot of the first 20 iterations of the CAVI algorithm at various initializations for *β* = 1.2. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.3; we see that both of these converge to the local fixed point *c*0(1.2) = 0.17071. The upper right is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that this initialization converges to the local fixed point *c*1(1.2) = 0.82928. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the local fixed point *c*0(1.2) = 0.17071. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to the local fixed point *c*1(1.2) = 0.82928.

## *5.3. Parallel Updates*

The system of parallel updates is defined by the one-step map *<sup>F</sup>* : <sup>R</sup><sup>2</sup> <sup>→</sup> <sup>R</sup><sup>2</sup>

$$\begin{pmatrix} \mathbb{Z} \\ \mathbb{Z} \end{pmatrix} \mapsto F(\mathbb{Z}, \mathbb{S}) = \begin{pmatrix} \sigma(2\mathbb{z}^{\mathbb{z}} - 1, 2\beta) \\ \sigma(2\mathbb{z} - 1, 2\beta). \end{pmatrix} \tag{15}$$

The dynamics of the parallel system are similar to the system studied in [36]. As we shall show below, the parallel system exhibits periodic behavior that the sequential system does not and it follows as a corollary that the systems are not locally topologically conjugate.

The parallelized CAVI algorithm is a dynamical system formed by iterations of *F* defined in (15). We shall decouple the parallelized CAVI updates for sequences *ξ<sup>k</sup>* and *ζ<sup>k</sup>* by looking at iterations of (12) acting on the sequences individually. This decoupling is visualized in diagram form

where each cross is an application of *F*. The system formed the parallel updates is equivalent to the following decoupled systems of even subsequences and odd subsequences. The even subsequences are

$$\mathcal{Z}\_{2k} \qquad = \sigma(2\sigma(2\mathcal{Z}\_{2(k-1)} - 1, 2\mathcal{\beta}) - 1, 2\mathcal{\beta}), \quad k \ge 1 \tag{17}$$

$$\mathfrak{J}\_{2k} \qquad = \sigma(2\sigma(2\mathfrak{J}\_{2(k-1)} - 1, 2\mathfrak{J}) - 1, 2\mathfrak{J}), \quad k \ge 1. \tag{18}$$

The odd subsequences are

$$\zeta\_{2k+1} = \begin{cases} \sigma(2\zeta\_0, 2\beta) & k = 0 \\ \sigma(2\sigma(2\zeta\_{2k-1}, 2\beta), 2\beta) & k \ge 1 \end{cases} \tag{19}$$

$$\xi\_{2k+1}^{x} = \begin{cases} \sigma(2\xi\_0, 2\beta) & k = 0 \\ \sigma(2\sigma(2\xi\_{2k-1}^{x}, 2\beta), 2\beta) & k \ge 1 \end{cases} \tag{20}$$

Following a similar approach to the one used to study the sequential dynamics, we investigate the dynamics of the parallel system (15) by studying the dynamics of four individual subsequences (17)–(20) of the decoupled system given by diagram (16). The dynamical properties of the individual subsequences follow from a combination of Theorem 1, Theorem 2 and other methods from Appendix A. Illustrations of the evolution of the dynamics of the parallel updates for various initializations and values of *β* are in Figures 7–12.

**Figure 7.** A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for *β* = −1.2. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that this initialization converges to the two cycle C<sup>0</sup> = {(*c*0, *c*0),(*c*1, *c*1)}. The upper right plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to *c*0(1.2) ≈ 0.17071. The lower left plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to *c*1(1.2) ≈ 0.82928. The lower right is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the two cycle C<sup>0</sup> = {(*c*0, *c*0),(*c*1, *c*1)}.

**Figure 8.** A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for *β* = −0.7. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.3; we see that both of these converge to the global fixed point 1/2. The upper right is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that this initialization converges to the global fixed point 1/2. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the global fixed point 1/2. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to the global fixed point 1/2.

**Figure 9.** A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for *β* = 0.7. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.3; we see that both of these converge to the global fixed point 1/2. The upper right is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that this initialization converges to the global fixed point 1/2. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the global fixed point 1/2. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to the global fixed point 1/2.

**Figure 10.** A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for *β* = 1.2. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.3; we see that both of these converge to *c*0(1.2) ≈ 0.17071. The upper right is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.7; we see that this initialization converges to the two cycle C<sup>1</sup> = {(*c*1, *c*0),(*c*0, *c*1)}. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the two cycle C<sup>1</sup> = {(*c*1, *c*0),(*c*0, *c*1)}. The lower right plot is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.7; we see that both of these converge to *c*1(1.2) ≈ 0.82928.

**Figure 11.** A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for *β* = −1.2. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.5; we see that this converges to the two-cycle C<sup>2</sup> = {(*c*0, 1/2),(1/2, *c*1)}. The upper right is an initialization of *ζ*<sup>0</sup> = 0.5 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the two cycle C<sup>3</sup> = {(*c*1, 1/2),(1/2, *c*0)}. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.5; we see that this initialization converges to the two cycle C3. The lower right plot is an initialization of *ζ*<sup>0</sup> = 0.5 and *ξ*<sup>0</sup> = 0.7; we see that this converges to the two-cycle C2.

**Figure 12.** A plot of the first 20 iterations of the parallel update CAVI algorithm at various initializations for *β* = 1.2. In each of the plots the *ζ* updates are black and the *ξ* updates are red. The upper left plot is an initialization of *ζ*<sup>0</sup> = 0.3 and *ξ*<sup>0</sup> = 0.5; we see that this converges to the two-cycle C<sup>4</sup> = {(*c*0, 1/2),(1/2, *c*0)}. The upper right is an initialization of *ζ*<sup>0</sup> = 0.5 and *ξ*<sup>0</sup> = 0.3; we see that this initialization converges to the two cycle C4. The lower left is is an initialization of *ζ*<sup>0</sup> = 0.7 and *ξ*<sup>0</sup> = 0.5; we see that this initialization converges to the two cycle C<sup>5</sup> = {(*c*1, 1/2),(1/2, *c*1)}. The lower right plot is an initialization of *ζ*<sup>0</sup> = 0.5 and *ξ*<sup>0</sup> = 0.7; we see that this converges to the two-cycle C5.

We now present the main result for the parallel dynamics.

**Theorem 4** (Parallel Dynamics)**.** *The Dynamics of the Parallel System (10) Are As Follows*

*1. For β* < −1*, the system has two locally asymptotically stable fixed points* (*c*1, *c*0) *and* (*c*0, *c*1)*, and one locally asymptotically unstable fixed point* (1/2, 1/2)*, where c*<sup>0</sup> *and c*<sup>1</sup> *are the fixed points of (11). Furthermore the system exhibits periodic behavior in the form of 2-cycles. The asymptotically stable 2-cycle,* C<sup>1</sup> = {(*c*0, *c*0),(*c*1, *c*1)} *and asymptotically unstable 2-cycles,*

$$\mathcal{C}\_2 = \{ (1/2, c\_1), (c\_0, 1/2) \} \text{ and } \mathcal{C}\_3 = \{ (1/2, c\_0), (c\_1, 1/2) \}.$$

*The stable sets are*

$$\begin{aligned} W^s(\mathfrak{c}\_0, \mathfrak{c}\_1) &= [0, 1/2) \times (1/2, 1] \\ W^s(\mathfrak{c}\_1, \mathfrak{c}\_0) &= (1/2, 1] \times [0, 1/2) \\ W^s(\mathfrak{C}\_1) &= ([0, 1/2) \times [0, 1/2)) \cup ((1/2, 1] \times (1/2, 1]) \\ W^s(\mathfrak{C}\_2) &= ([0, 1/2) \times \{1/2\}) \cup (\{1/2\} \times (1/2, 1]) \\ W^s(\mathfrak{C}\_3) &= ([0, 1/2) \times \{1/2\}) \cup (\{1/2\} \times (1/2, 1]) \end{aligned}$$


*W<sup>s</sup>* (*c*0, *c*1) =[0, 1/2) × (1/2, 1] *W<sup>s</sup>* (*c*1, *c*0) =(1/2, 1] × [0, 1/2) *W<sup>s</sup>* (C3) = ([0, 1/2) × [0, 1/2)) ∪ ((1/2, 1] × (1/2, 1]) *W<sup>s</sup>* (C4) = ([0, 1/2) × {1/2}) ∪ ({1/2} × [0, 1/2)) *W<sup>s</sup>* (C5) = ({1/2} × (1/2, 1]) ∪ ((1/2, 1] × {1/2}).

*The system has no p-periodic points for p* > 2*. The system under goes a PD bifurcation at β* = −1 *and a pitchfork bifurcation at β* = 1*.*

**Proof.** The dynamics of the system defined by *F* in (15) are equivalent to the dynamics of the system generated by the subsequences (17)–(20). The dynamics of each of these subsequences are governed by the functions (11) and (12). By Theorem 1, we have the behavior for each of the subsequences (17)–(20). For |*β*| < 1, (11) has a globally stable fixed point at *x*<sup>∗</sup> = 1/2 and thus the only fixed point in the parallel system is (1/2, 1/2) which must be globally stable. For *β* = ±1, the fixed point *x*<sup>0</sup> = 1/2 is asymptotically stable by Theorem A3.

For *β* > 1, (11) bifurcates. We have the unstable fixed point *x*<sup>∗</sup> = 1/2, and the two locally stable fixed points, *c*<sup>0</sup> with stable set *Ws*(*c*0)=[0, 1/2), and *c*<sup>1</sup> with stable set *Ws*(*c*1)=(1/2, 1]. Returning to the system generated by *F*, if we consider the initialization (*ζ*0, *ξ*0)=(*c*0, *c*0) then by the sequence construction of *ζn*, given in (17) and (19), we see that *ζ<sup>n</sup>* = *c*<sup>0</sup> for *n* ≥ 1, as *c*<sup>0</sup> is a fixed point of (11) for *β* > 1. Similarly, using the sequence construction of *ξn*, given in (18) and (20), we see that *ξ<sup>n</sup>* = *c*<sup>0</sup> for *n* ≥ 1, as *c*<sup>0</sup> is a fixed point of (11) for *β* > 1. Therefore, (*c*0, *c*0) is a fixed point. An analogous argument shows that (*c*1, *c*1) is also a fixed point. The parallel system has the stable fixed points (*c*0, *c*0) with stable set *<sup>W</sup>s*(*c*0, *<sup>c</sup>*0) = *<sup>W</sup>s*(*c*0) <sup>×</sup> *<sup>W</sup>s*(*c*0) and (*c*1, *<sup>c</sup>*1) with stable set *<sup>W</sup>s*(*c*1, *<sup>c</sup>*1) = *<sup>W</sup>s*(*c*1) <sup>×</sup> *<sup>W</sup>s*(*c*1). After the bifurcation at *β* = 1 the parallel system also contains 2-cycles. Using the sequence construction we see that C<sup>3</sup> = {(*c*1, *c*0),(*c*0, *c*1)} is an asymptotically stable 2-cycle in the parallel system, with stable subspace *<sup>W</sup>s*(C3)=(1/2, 1] <sup>×</sup> [0, 1/2) <sup>∪</sup> [0, 1/2) <sup>×</sup> (1/2, 1]. Additionally, we have two asymptotically unstable 2-cycles C<sup>4</sup> = {(*c*0, 1/2),(1/2, *c*0)} and C<sup>5</sup> = {(*c*1, 1/2),(1/2, *c*1)}. Perturbing the 1/2 coordinate in the unstable cycle pushes it into the basin of attraction for one of the fixed points or the asymptotically stable 2-cycle. The stable sets are *<sup>W</sup>s*(C4) = ([0, 1/2) × {1/2}) <sup>∪</sup> ({1/2, 1} × [0, 1/2)), *<sup>W</sup>s*(C5) = ({1/2} × (1/2, 1]) <sup>∪</sup> ((1/2, 1] × {1/2}). The dynamics of *<sup>F</sup>* lack any *<sup>p</sup>*-period point and cycles for *p* > 2 as a consequence of its construction from (12).

For *β* < −1, (11) bifurcates. We have the unstable fixed point *x*<sup>∗</sup> = 1/2, and the stable two cycle, <sup>C</sup> <sup>=</sup> {*c*0, *<sup>c</sup>*1} with stable set *<sup>W</sup>s*(C)=[0, 1/2) <sup>∪</sup> (1/2, 1]. Returning to the system generated by *F*, if we consider the initialization (*ζ*0, *ξ*0)=(*c*0, *c*1) then by the sequence construction of *ζn*, given in (17) and (19), we see that *ζ<sup>n</sup>* = *c*<sup>0</sup> for *n* ≥ 1, as C is a 2-cycle of (11) for *β* < −1. Similarly, using the sequence construction of *ξn*, given in (18) and (20) we see that *ξ<sup>n</sup>* = *c*<sup>1</sup> for *n* ≥ 1, as C is a 2-cycle of (11) for *β* < −1. Therefore, (*c*0, *c*1) is a fixed point. An analogous argument shows that (*c*1, *c*0) is also a fixed point. The parallel system has the stable fixed points (*c*0, *<sup>c</sup>*1) with stable set *<sup>W</sup>s*(*c*0, *<sup>c</sup>*1) = *<sup>W</sup>s*(*c*0) <sup>×</sup> *<sup>W</sup>s*(*c*1) and (*c*1, *<sup>c</sup>*0) with stable set *<sup>W</sup>s*(*c*1, *<sup>c</sup>*0) = *<sup>W</sup>s*(*c*1) <sup>×</sup> *<sup>W</sup>s*(*c*0), where *<sup>W</sup>s*(*c*0)=[0, 1/2) and *<sup>W</sup>s*(*c*1)=(1/2, 1]. After the bifurcation at *β* = −1 the parallel system also contains 2-cycles. Using the sequence construction we see that C<sup>1</sup> = {(*c*0, *c*0),(*c*1, *c*1)} is an asymptotically stable 2-cycle in the parallel system, with stable subspace *<sup>W</sup>s*(C1) = *<sup>W</sup>s*(*c*0) <sup>×</sup> *<sup>W</sup>s*(*c*0) <sup>∪</sup> *<sup>W</sup>s*(*c*1) <sup>×</sup> *<sup>W</sup>s*(*c*1). Additional we have two asymptotically unstable 2-cycles C<sup>2</sup> = {(*c*0, 1/2),(1/2, *c*1)} and C<sup>3</sup> = {(*c*1, 1/2),(1/2, *c*0)}. Perturbing the 1/2 coordinate in the unstable cycle pushes it into the basin of attraction for one of the fixed points or the asymptotically stable 2-cycle. The stable sets are *<sup>W</sup>s*(C3) = ([0, 1/2) <sup>×</sup> [0, 1/2)) <sup>∪</sup> ((1/2, 1] <sup>×</sup> (1/2, 1]), *<sup>W</sup>s*(C4) = ([0, 1/2) × {1/2}) <sup>∪</sup> ({1/2} × [0, 1/2)), *<sup>W</sup>s*(C5) = ({1/2} × (1/2, 1]) <sup>∪</sup> ((1/2, 1] × {1/2}). The dynamics of *F* lack any *p*-period point and cycles for *p* > 2 as a consequence of its construction from (12).

This completes the characterization of the dynamics of *<sup>F</sup>* for *<sup>β</sup>* <sup>∈</sup> <sup>R</sup>.

## *5.4. A Comparison of the Dynamics*

We end the section by providing a comparison of the dynamical properties of the sequential system in Theorem 3 and the parallel system in Theorem 4. The main difference between the sequential system and the parallel system is the presence of two-cycles that can be found in the parallel system when |*β*| > 1. This behavior stems from the difference between the sequential and parallel implementations of the CAVI. Looking closely at the update diagrams for the two systems reveals the key difference that produces these two-cycles. The decoupled sequential system is

and the decoupled parallel system is

The major difference between these diagrams is how the individual update sequences begin. Notice *ζ*<sup>0</sup> plays no role in updating the sequential system as both the *ζ<sup>k</sup>* update sequence and the *ξ<sup>k</sup>* update sequence are dependent only on the choice of *ξ*0. Even after rewriting the sequential updates in terms of individual sequences the system is not truly decoupled as both sequences depend on a common starting point. This precisely prescribes the behavior that we see in the system relative to the sigmoid function dynamics in Theorem 1 and Theorem 2. Compare this to the parallel system. Here *ζ*<sup>0</sup> is involved in updating both the odd *ξ*2*k*+<sup>1</sup> subsequence and the even *ζ*2*<sup>k</sup>* subsequence. Furthermore, *ξ*<sup>0</sup> remains involved by controlling the updates for the even *ξ*2*<sup>k</sup>* subsequence and the odd *ζ*2*k*+<sup>1</sup> subsequence. This additional flexibility allows the parallel system to develop periodic behavior outside of the Dobrushin regime (1 ≤ *β* ≤ 1).

As an example, we will consider initializing the sequential algorithm to the parallel algorithm for *β* = 1.2. We begin with the sequential algorithm. For *β* = 1.2, consider initializing the sequential system at (*ζ*0, *ξ*0)=(0.7, 0.3). The sequential system updates are fully determined by *ξ*0, so for *ξ*<sup>0</sup> = 0.3 it follows from Theorem <sup>1</sup> that an application of the function (11) will cause *<sup>ζ</sup>*<sup>1</sup> <sup>∈</sup> *<sup>W</sup>s*(*c*0). At this point, the system can be evolved by applying (12) to the independent sequences for *ζ* and *ξ* as given in (13). The dynamics of the system are now controlled by the function (12). From this initialization the system will converge to the fixed point (*c*0, *c*0)=(0.17071, 0.17071) as shown in Figure 6.

Contrast this with the behavior of the parallel system in which the updates are determined by both *ξ*<sup>0</sup> and *ζ*0. For *β* = 1.2, consider initializing the parallel system at (*ζ*0, *ξ*0)=(0.7, 0.3). It follows from Theorem <sup>1</sup> that an application of the function (11) will cause *<sup>ζ</sup>*<sup>1</sup> <sup>∈</sup> *<sup>W</sup>s*(*c*0) and *<sup>ξ</sup>*<sup>1</sup> <sup>∈</sup> *<sup>W</sup>s*(*c*1). Successive updates will cause the sequences *<sup>ζ</sup><sup>k</sup>* and *<sup>ξ</sup><sup>k</sup>* to flip back and forth between the domains *Ws*(*c*0) and *Ws*(*c*1), until the system settles into the two cycle C<sup>1</sup> = {(*c*0, *c*1),(*c*1, *c*0)} = {(0.17071, 0.82928),(0.82928, 0.17071)} as seen in Figure 10.

This simple example highlights the danger of naively parallelizing the CAVI algorithm. The convergence properties of a parallel version of the CAVI algorithm will heavily depend on the models CAVI update equations. In the case of the Ising model we have demonstrated that for

certain parameter regimes the parallel implementation of the algorithm can fail to converge due to the dependence of the algorithm on both *ζ*<sup>0</sup> and *ξ*0.

#### **6. Edward–Sokal Coupling**

One method of improving convergence in Markov chains is through the use of probabilistic couplings. The Edward–Sokal (ES) coupling is a coupling of two statistical physics models, the random cluster model and the Potts model (a generalization of the Ising model) [37]. Running a Markov chain on the ES coupling leads to improved mixing properties compared to the equivalent Potts model and random cluster models [33]. Motivated by these findings in the Markov chain literature, we ask a similar question: Can the convergence properties of mean-field VI be improved by using the ES coupling in place of the Ising model? In this section we investigate this idea numerically. We first introduce the Edward–Sokal coupling following [37]. We introduce a variational family for the Edward–Sokal coupling and derive the variational updates for this model. Our findings suggests the variational updates converge to a unique solution in a larger range than the equivalent Dobrushin regime for the corresponding Ising measure.

## *6.1. Random Cluster Model*

Let *G* = (*V*, *E*) be a finite graph. Let *e* = *x*, *y* ∈ *E* denote an edge in *G* with endpoints *x*, *y* ∈ *V*. <sup>Σ</sup> <sup>=</sup> {1, 2, ... , *<sup>q</sup>*}*V*, <sup>Ω</sup> <sup>=</sup> {0, 1}*<sup>E</sup>* and <sup>F</sup> denotes the powerset of <sup>Ω</sup>. The random cluster model is a 2 parameter probability measure with an edge weight parameter *p* ∈ [0, 1] and a cluster weight parameter *q* ∈ {2, 3, . . .} on (Ω, F) given by

$$\phi\_{p,q}(\omega) \propto \left\{ \prod\_{\mathfrak{e} \in E} p^{\omega(\mathfrak{e})} (1 - p)^{(1 - \omega(\mathfrak{e}))} \right\} q^{\mathfrak{x}(\omega)}.$$

where *κ*(*ω*) denoted the number of connected components in the subgraph corresponding to *ω*. The partition function for the random cluster model is

$$\mathcal{Z}\_{\mathcal{R}} = \sum\_{\omega \in \Omega} \left\{ \prod\_{\mathfrak{c} \in E} p^{\omega(\mathfrak{c})} (1 - p)^{(1 - \omega(\mathfrak{c}))} \right\} q^{\aleph(\omega)} \cdot \mathfrak{c}$$

For *q* = 2 the the random cluster model reduces to the Ising model on *G*. The Edward–Sokal Coupling is a probability measure *μ* on Σ × Ω given by

$$\mu(\boldsymbol{\sigma}, \boldsymbol{\omega}) \propto \prod\_{\boldsymbol{\varepsilon} \in E} \left[ (1 - p) \delta\_{\boldsymbol{\omega}(\boldsymbol{\varepsilon}), 0} + p \delta\_{\boldsymbol{\omega}(\boldsymbol{\varepsilon}), 1} \delta\_{\boldsymbol{\varepsilon}}(\boldsymbol{\sigma}) \right],\tag{21}$$

where *δa*,*<sup>b</sup>* = 1(*a* = *b*), and *δe*(*σ*) = 1(*σ<sup>x</sup>* = *σy*), for *e* = (*x*, *y*) ∈ *E*.

It is well known that in the special case, *<sup>p</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>e</sup>*−*<sup>β</sup>* and *<sup>q</sup>* <sup>=</sup> 2 the <sup>Σ</sup>-marginal of the ES coupling is the Ising model, the Ω-marginal is the random cluster model [37]. We are interested in better understanding how the convergence of the CAVI algorithm on the ES coupling compares to the convergence of the CAVI algorithm on the Ising model.

#### *6.2. VI Objective Function*

To calculate the VI updates for each variable we may need to make use of the alternative characterization of the ES coupling

$$
\mu(\sigma, \omega) \propto \psi(\sigma) \phi\_{p,1}(\omega) 1\_F(\sigma, \omega),
$$

where *ψ* is uniform measure on Σ and *φp*,1(*ω*) is a product measure on Ω

$$\Phi\_{p,1}(\omega) = \prod\_{\varepsilon \in E} p^{\omega(\varepsilon)} (1 - p)^{(1 - \omega(\varepsilon))} \tag{22}$$

and

$$F = \{ (\sigma, \omega) : \delta\_{\omega}(\varepsilon) = 1 \implies \delta\_{\varepsilon}(\sigma) = 1 \} \tag{23}$$

The variational family that we will be optimizing over is

$$q(\sigma,\omega) = q\_1(\sigma\_1)q\_2(\sigma\_2)q\_0(\omega)1\_{\mathbb{F}}(\sigma,\omega). \tag{24}$$

We have added the indicator on the set *F* to eliminate the configurations (*σ*, *ω*) that are not well defined in the variational objective. We will use the convention that 0 log(0) = 0.

## *6.3. VI Updates*

The ELBO that corresponds to the variational family (24) is

$$\begin{aligned} \text{ELBO}(\mathbf{x}\_{1}, \mathbf{x}\_{2}, y, p) &= \quad \mathbf{x}\_{1} \mathbf{x}\_{2} y \log(\mathbf{x}\_{1} \mathbf{x}\_{2} y) - \mathbf{x}\_{1} \mathbf{x}\_{2} y \log(1 - p) \\ &+ \quad (1 - \mathbf{x}\_{1}) \mathbf{x}\_{2} y \log(((1 - \mathbf{x}\_{1}) \mathbf{x}\_{2} y) - (1 - \mathbf{x}\_{1}) \mathbf{x}\_{2} y \log(1 - p) \\ &+ \quad \mathbf{x}\_{1} (1 - \mathbf{x}\_{2}) y \log(\mathbf{x}\_{1} (1 - \mathbf{x}\_{2}) y) - \mathbf{x}\_{1} (1 - \mathbf{x}\_{2}) y \log(1 - p) \\ &+ \quad (1 - \mathbf{x}\_{1}) (1 - \mathbf{x}\_{2}) y \log(((1 - \mathbf{x}\_{1})(1 - \mathbf{x}\_{2}) y) - (1 - \mathbf{x}\_{1})(1 - \mathbf{x}\_{2}) y \log(1 - p) \\ &+ \quad \mathbf{x}\_{1} \mathbf{x}\_{2} (1 - y) \log(\mathbf{x}\_{1} \mathbf{x}\_{2} (1 - y)) - \mathbf{x}\_{1} \mathbf{x}\_{2} (1 - y) \log(p) \\ &+ \quad (1 - \mathbf{x}\_{1}) (1 - \mathbf{x}\_{2}) (1 - y) \log((1 - \mathbf{x}\_{1})(1 - \mathbf{x}\_{2})(1 - y)) - (1 - \mathbf{x}\_{1})(1 - \mathbf{x}\_{2})(1 - y) \log(p). \end{aligned}$$

Taking the derivative with respect to *x*<sup>1</sup> and simplifying gives us

$$\begin{split} \text{ELBO}\_{1}(\mathbf{x}\_{1}, \mathbf{x}\_{2}, y, p) &= \quad y \log\left(\frac{\mathbf{x}\_{1}}{1 - \mathbf{x}\_{1}}\right) + (1 - y) \log\left(\frac{1}{1 - \mathbf{x}\_{1}}\right) \\ &+ \quad \mathbf{x}\_{2}(1 - y) \log(\mathbf{x}\_{1}(1 - \mathbf{x}\_{1})) + \mathbf{x}\_{2}(1 - y) \log\left(\frac{\mathbf{x}\_{2}(1 - \mathbf{x}\_{2})(1 - y)^{2}}{p^{2}}\right) \\ &+ \quad \log\left(\frac{p}{(1 - \mathbf{x}\_{2})(1 - y)}\right) + (2\mathbf{x}\_{2} - 1)(1 - y). \end{split}$$

Taking the derivative with respect to *x*<sup>2</sup> and simplifying gives us

$$\begin{split} \text{ELBO}\_2(\mathbf{x}\_1, \mathbf{x}\_2, y, p) &= -y \log \left( \frac{\mathbf{x}\_2}{1 - \mathbf{x}\_2} \right) + (1 - y) \log \left( \frac{1}{1 - \mathbf{x}\_2} \right) \\ &+ \quad \text{x}\_1(1 - y) \log(\text{x}\_2(1 - \mathbf{x}\_2)) + \text{x}\_1(1 - y) \log \left( \frac{\mathbf{x}\_1(1 - \mathbf{x}\_1)(1 - y)^2}{p^2} \right) \\ &+ \quad \log \left( \frac{p}{(1 - \mathbf{x}\_1)(1 - y)} \right) + (2\mathbf{x}\_1 - 1)(1 - y). \end{split}$$

Taking the derivative with respect to *y* and simplifying gives us

$$\begin{split} \text{ELBO}\_{y}(\mathbf{x}\_{1},\mathbf{x}\_{2},y,p) &= \quad \mathbf{x}\_{1}\mathbf{x}\_{2}\log\left(\frac{y}{1-y}\right) + \mathbf{x}\_{1}\mathbf{x}\_{2}\log\left(\frac{p}{1-p}\right) \\ &+ \quad (1-\mathbf{x}\_{1})(1-\mathbf{x}\_{2})\log\left(\frac{y}{1-y}\right) + (1-\mathbf{x}\_{1})(1-\mathbf{x}\_{2})\log\left(\frac{p}{1-p}\right) \\ &+ \quad (1-\mathbf{x}\_{1})\mathbf{x}\_{2}\log\left(\frac{(1-\mathbf{x}\_{1})\mathbf{x}\_{2}y}{1-p}\right) + \mathbf{x}\_{1}(1-\mathbf{x}\_{2})\log\left(\frac{\mathbf{x}\_{1}(1-\mathbf{x}\_{2})y}{1-p}\right) \\ &+ \quad (1-\mathbf{x}\_{1})\mathbf{x}\_{2} + \mathbf{x}\_{1}(1-\mathbf{x}\_{2}). \end{split}$$

Absence of closed form updates for any of the variables limits our ability to study the convergence of the system with classical dynamical systems techniques. Instead we look at the long evolution behavior of the system by plotting 100 iterations of the CAVI updates which are generated from the following system

$$\begin{array}{lll}\infty\_1(t+1) &=& \operatorname\*{argmin}\_{z \in (0,1)} |\operatorname{ELBO}\_1(z, \mathsf{x}\_2(t), y(t), p)|,\\\infty\_2(t+1) &=& \operatorname\*{argmin}\_{z \in (0,1)} |\operatorname{ELBO}\_2(\mathsf{x}\_1(t+1), z, y(t), p)|,\\y(t+1) &=& \operatorname\*{argmin}\_{z \in (0,1)} |\operatorname{ELBO}\_y(\mathsf{x}\_1(t+1), \mathsf{x}\_2(t+1), z, p)|.\end{array}$$

We generate the argmin of the free variable *z* from a line search with a step size of Δ = 10−6. Running these simulations we find that the iterations of *x*1(*t*), *x*2(*t*), *y*(*t*) converge to a global solution within about *T* = 20 time steps from any initialization *x*1(0), *x*2(0), *y*(0) ∈ (0, 1) and any *β* > 0. It is evident that using the ES coupling, we get global convergence of the algorithm outside of the Dobrushin regime of the corresponding paramagnetic Ising model. The figures depicting the simulation results of convergence of the variational inference algorithm in the Edward–Sokal coupling can be found below in Figures 13–16.

**Figure 13.** A plot of the 20 iterations of the ES updates for *<sup>p</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>e</sup>*−<sup>5</sup> from a uniformly random initialization. Each of the lines represents a different parameter. The solid line is *x*1, the dashed line is *x*<sup>2</sup> and the dotted line is *y*. We see convergence to a unique fixed point for each of the variables.

**Figure 14.** A plot of the ELBO of the ES coupling for *<sup>p</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>e</sup>*−5. The red line denotes the global minimum ELBO value.

**Figure 15.** A plot of the 20 iterations of the ES updates for *<sup>p</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>e</sup>*−0.1 from a uniformly random initialization. Each of the lines represents a different parameter. The solid line is *x*1, the dashed line is *x*<sup>2</sup> and the dotted line is *y*. We see convergence to a unique fixed point.

**Figure 16.** A plot of the ELBO of the ES coupling for *<sup>p</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>e</sup>*−0.1. The red line denotes the global minimum ELBO value.

#### **7. Conclusions**

This paper demonstrates the use of classical dynamical systems and bifurcation theory to study the convergence properties of the CAVI algorithm of the Ising model on two nodes. In our simple model we are able to provide the complete dynamical behavior for the Ising model on two nodes. Interestingly, we find that the sequential CAVI algorithm and parallelized CAVI algorithm are not topologically conjugate owing to the presence of periodic behavior in the parallelized CAVI. This behavior originates from the added flexibility of the initialization in the parallelized CAVI when compared to the sequential CAVI. The erratic behavior we see in the Ising model for |*β*| > 1 is due to a combination of the existence of multiple fixed points of the systems update function and the instability of these fixed points. In this parameter regime, the fixed point that produces the optimal solution (0.5, 0.5) is a repelling fixed point. Unless we initialize the algorithm exactly at (0.5, 0.5), the CAVI system cannot converge to this point. The other two suboptimal fixed points are both asymptotically stable. This suggests that the main problem that the CAVI algorithm experiences is centered around the existence of multiple fixed points. Recent work on stochastic block models (SBM) and topic models (TM) models shows that mean field VI leads to suboptimal estimators [17–20]. It is not clear if this property comes from the mean field variational inferences construction using product distributions or if this is a consequence of structure among latent variables. A minor difference of the stochastic block model (SBM) or topic

model (TM) with the Ising model is that the former contain parameters (e.g., the cluster labels) that are identifiable only up to permutations. That being said, in the SBM or TM, if the cluster means are not well-separated, then it is not possible to identify the labels even up to permutations. This is somewhat related to having multiple fixed points of the objective function and we conjecture similar behavior to what we have found in the Ising model will be exhibited in the SBM or TM outside the Dobrushin regime. Interestingly, a close look at the BCAVI updates in [17,18] reveals a similar sigmoid update function 1/(1 + *e*−*x*). Applying the tools and techniques from dynamical systems theory to study the CAVI algorithm in the SBM, TM and other models will provide a better understanding of the issues that come with using mean field variational inference and is important to developing better variational inference techniques.

Most of the research into the theoretical properties of variational inference has focused on the mean field family due to its computational simplicity. This computational simplicity comes at the cost of limited expressive power. Can we make due with this limited expressive power in practical applications? More specifically, is there an equivalent parameter regime to the Dobrushin regime (1 ≤ *β* ≤ 1) for other similar models like the SBM and TM inside which the CAVI produces statistically optimal estimators? The answer to this question provides researchers with stable parameter regimes for the model. The non-existenceof such a region would indicate the need for more expressive variational methods for the model beyond mean field methods. Recent work [19,20] suggests that this adding some structure to algorithms may fix the problems that arise from mean field VI. How much structure is needed to recover statistically optimal estimators? Could adding in a simple structure of pair-wise dependence to the mean field VI in the Ising model, similarly to [19], be enough to recover the optimal estimator outside of the Dobrushin regime? Is the amount of additional structure that is needed somehow related to the latent structure of the models? Tools from dynamical systems theory can be used to study these questions.

Using dynamical systems to study the convergence properties of the CAVI algorithm is not without its challenges. While dynamical systems theory can provide the answers to many of the above questions, applying these tools to higher dimensional sequential systems is a challenging problem. As mentioned previously, the general theory for *n*-dimensional discrete dynamical systems is dependent on writing the evolution function in the form *xn*+<sup>1</sup> = *F*(*xn*). Deriving this *F* is typically not possible for densely connected higher dimensional sequential systems like the *n*-dimensional Ising model CAVI. This is not the only challenging aspect to the problem. These systems typically possess multiple fixed points which can only be found numerically. Multiple fixed points will lead to more complicated partitions of the space into domains of attraction. Furthermore, higher dimensional systems can possess bifurcations of multiple codimensions, which as significantly more difficult to study. Bifurcations of codimension 3 are so exotic that they are not well studied [23,24]. Software to handle such calculations has only recently been developed [24]. In practical terms this means that the convergence properties can only be studied numerically for models with a small number of parameters. Furthermore, most of the numerical techniques work under the assumption of differentiability of the evolution operator and will fail to be applicable to many systems of practical interest in statistics such as the Edward–Sokal CAVI. Applying tools from dynamical systems to the study of variational inference algorithms will require developing new theory for high dimensional and well connected sequential dynamical systems.

**Author Contributions:** Conceptualization, S.P., D.P. and A.B.; formal analysis, S.P.; supervision, D.P. and A.B.; writing—original draft, S.P., D.P. and A.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** Pati and Bhattacharya acknowledge support from NSF DMS (1854731, 1916371) and NSF CCF 1934904 (HDR-TRIPODS). In addition, Bhattacharya acknowledges the NSF CAREER 1653404 award for supporting this project.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. An Overview of One Dimensional Dynamical Systems**

The main focus of discrete dynamical systems is the asymptotic behavior of iterated systems (8). Bifurcation theory studies how the dynamical behavior of a system changes as the parameter *J*<sup>12</sup> changes. We study the behavior of convergence of the CAVI algorithm by studying the autonomous discrete time dynamical system formed by the update Equation (8). This allows us to utilize tools from dynamical systems theory to study the behavior of the algorithm with respect to its parameters. In this section we provide a brief overview of the necessary dynamical systems and bifurcation theory in dimension 1 used in Section 5.

## *Appendix A.1. Notation*

Our focus will be on parametric dynamical systems defined by a functions *<sup>f</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>×</sup> <sup>R</sup>*<sup>p</sup>* <sup>→</sup> <sup>R</sup>*n*. We will call elements **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* elements in the state space (phase space) and elements *<sup>α</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* as parameters. We denote real numbers *<sup>x</sup>* <sup>∈</sup> <sup>R</sup> and real vectors in **<sup>x</sup>** = (*x*1, ... , *xn*) <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* with bold. We denote the inverse of an invertible function *f* by *f* <sup>−</sup>1. The *k*-fold composition of a function *f* with itself at a point (**x**, *α*) will be denoted by *f <sup>k</sup>*(**x**, *α*). The *k*-fold composition of the inverse function *f* <sup>−</sup><sup>1</sup> will be denoted *f* <sup>−</sup>*k*. The identity function will be denoted id. We use the convention *f* <sup>0</sup> = id. We denote the tensors of derivatives of *f* by *f***x**(**x**, *α*)=(*∂ fi*/*∂xj*), *f***xx**(**x**, *α*)=(*∂*<sup>2</sup> *fi*/*∂xj∂xk*), *f***xx**(**x**, *α*)=(*∂*<sup>2</sup> *fi*/*∂xj∂xk*), *f***xxx**(**x**, *α*)=(*∂*<sup>3</sup> *fi*/*∂xj∂xk∂x*-), *fα*(**x**, *α*)=(*∂ fi*/*∂αj*).

#### *Appendix A.2. Dynamical Systems*

Dynamical systems is a classical approach to studying the convergence properties of non-linear iterative systems. These systems can be continuous in time, for example a differential equation, or discrete in time, for example iterations of a function from an initial point. A dynamical system is called autonomous if the function governing the system is independent of time and non-autonomous otherwise. The coordinate ascent variational inference for the Ising model is a discrete-time autonomous dynamical system. Before giving a complete proof of the dynamical properties of the CAVI algorithm for the Ising model in dimension 2, we first give a basic introduction to the theory of discrete time dynamical systems and bifurcations following [23–25,38].

Formally, a dynamical system is triple {*T*, *<sup>X</sup>*, *<sup>φ</sup><sup>t</sup>* } where *T* is a time set, *X* is the state space and *<sup>φ</sup><sup>t</sup>* : *<sup>X</sup>* <sup>→</sup> *<sup>X</sup>* is a family of evolution operators parameterized by *<sup>t</sup>* <sup>∈</sup> *<sup>T</sup>* satisfying *<sup>φ</sup>*<sup>0</sup> <sup>=</sup> *id* and *<sup>φ</sup>s*+*<sup>t</sup>* <sup>=</sup> *<sup>φ</sup><sup>t</sup>* ◦ *<sup>φ</sup><sup>s</sup>* for all *<sup>x</sup>* <sup>∈</sup> *<sup>X</sup>*. For a discrete time system the evolution operator is fully specified by the one-step map *<sup>φ</sup>*<sup>1</sup> <sup>=</sup> *<sup>f</sup>* , since the composition rule then defines *<sup>φ</sup><sup>k</sup>* <sup>=</sup> *<sup>f</sup> <sup>k</sup>* for *<sup>k</sup>* <sup>∈</sup> <sup>Z</sup>. We restrict the further discussion to discrete time dynamical systems defined by the one-step map

$$\mathbf{x} \mapsto f(\mathbf{x}, \boldsymbol{\mathfrak{a}}), \quad \mathbf{x} \in \mathbb{R}^n, \boldsymbol{\mathfrak{a}} \in \mathbb{R}^p,\tag{A1}$$

where *f* is a diffeomorphism, a smooth function with smooth inverse, of the state space R*<sup>n</sup>* and *α* are the parameters of the system.

The basic geometric objects of a dynamical system are orbits in the state space and the phase portrait, defined as follows. The phase portrait is the partition of the state space induced by the orbits. The orbit starting at a point **x** is an ordered subset of the state space R*<sup>n</sup>* denoted orb(**x**) = { *<sup>f</sup> <sup>k</sup>*(**x**) : *<sup>k</sup>* <sup>∈</sup> <sup>Z</sup>}. There are two special types of orbits, fixed points and cycles, defined below.

A fixed point **x**<sup>∗</sup> of the system are points that remain fixed under the evolution of the system, ones that satisfies **x**<sup>∗</sup> = *f*(**x**∗). We can classify fixed points of the system by studying the local behavior of the system near the fixed point. To do this we consider small perturbations of the system near the fixed point. A fixed point **x**<sup>∗</sup> is said to be locally stable if points that are near the fixed point do not move too far away from the fixed point as the system evolves. Formally, if for any *ε* > 0 there exists *<sup>δ</sup>* <sup>&</sup>gt; 0 such that for all *<sup>x</sup>* with <sup>|</sup>**<sup>x</sup>** <sup>−</sup> **<sup>x</sup>**∗| <sup>&</sup>lt; *<sup>δ</sup>* we have <sup>|</sup> *<sup>f</sup> <sup>k</sup>*(**x**) <sup>−</sup> **<sup>x</sup>**∗| <sup>&</sup>lt; *<sup>ε</sup>* for all *<sup>k</sup>* <sup>&</sup>gt; 0. A fixed point is called semi-stable from the right if for any *ε* > 0 there exists *δ* > 0 such that for all *x* with 0 < **x** − **x**<sup>∗</sup> < *δ* we have <sup>|</sup> *<sup>f</sup> <sup>k</sup>*(**x**) <sup>−</sup> **<sup>x</sup>**∗| <sup>&</sup>lt; *<sup>ε</sup>* for all *<sup>k</sup>* <sup>&</sup>gt; 0 (semi-stable from the left is defined analogously). It is said to be

locally unstable otherwise. A fixed point **x**<sup>∗</sup> is locally attracting if all points in a small neighborhood converge to the fixed point as we let the system evolve. Formally, if there exists an *η* > 0 such that <sup>|</sup>**<sup>x</sup>** <sup>−</sup> **<sup>x</sup>**∗| <sup>&</sup>lt; *<sup>η</sup>* implies *<sup>f</sup> <sup>n</sup>*(**x**) <sup>→</sup> **<sup>x</sup>**<sup>∗</sup> as *<sup>n</sup>* <sup>→</sup> <sup>∞</sup>. A fixed point **<sup>x</sup>**<sup>∗</sup> is locally asymptotically stable if it is both locally stable and attracting. A fixed point **x**<sup>∗</sup> is locally semi-asymptotically stable from the right if it is both locally semi-stable from the right and lim*<sup>n</sup> <sup>f</sup> <sup>n</sup>*(*x*) = *<sup>x</sup>*<sup>∗</sup> for 0 <sup>&</sup>lt; **<sup>x</sup>** <sup>−</sup> **<sup>x</sup>**<sup>∗</sup> <sup>&</sup>lt; *<sup>η</sup>* for some *<sup>η</sup>*. It is globally asymptotically stable if the point is attracting for all **x** in the state space.

A cycle is a periodic orbit of distinct points *C* = {**x**0, **x**1, ... , **x***K*−1}, where **x**<sup>0</sup> = *f*(**x***K*−1) for some *<sup>K</sup>* <sup>&</sup>gt; 0. The minimal *<sup>K</sup>* generating the cycle is called the period of the cycle. A subset *<sup>S</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>n</sup>* is called invariant if *<sup>f</sup> <sup>k</sup>*(*S*) <sup>⊂</sup> *<sup>S</sup>*, *<sup>k</sup>* <sup>∈</sup> <sup>Z</sup>. An invariant set *<sup>S</sup>* is called asymptotically stable if there exists a neighborhood *<sup>U</sup>* of *<sup>S</sup>* such that for any point in *<sup>U</sup>* is eventually inside the set *<sup>S</sup>*. The stable set of *<sup>S</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>n</sup>* is *<sup>W</sup>s*(*S*) = **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* : lim*k*→<sup>∞</sup> *<sup>f</sup> <sup>k</sup>*(**x**) <sup>∈</sup> *<sup>S</sup>* . If *<sup>f</sup>* is invertible, we define the unstable set of *<sup>S</sup>* <sup>⊂</sup> <sup>R</sup>*<sup>n</sup>* is *<sup>W</sup>u*(*S*) = **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* : lim*k*→<sup>∞</sup> *<sup>f</sup>* <sup>−</sup>*k*(**x**) <sup>∈</sup> *<sup>S</sup>* . The unstable set of *S* for the forward system *f <sup>k</sup>*, *k* > 0 is the stable set of *S* for the backward system *f* <sup>−</sup>*k*, *k* > 0. It is possible to study the behavior of points that diverge by studying points that converge under the inverse map. We can also classify the stability of *K*-cycles. We classify the stability of the cycle as a fixed point in the map *f <sup>K</sup>*.

Consider a discrete time dynamical system defined by a diffeomorphism *<sup>f</sup>* : <sup>R</sup> <sup>×</sup> <sup>R</sup> <sup>→</sup> <sup>R</sup>. Let *<sup>x</sup>*<sup>∗</sup> be a fixed point of *f*(*x*, *α*) and consider a nearby point *x*, |*x* − *x*∗| = . Taking a Taylor expansion of the system about the fixed point gives us

$$f(\mathbf{x}, \mathbf{a}) - \mathbf{x}\_\* = \begin{array}{c} f\_\mathbf{x}(\mathbf{x}\_\*, \mathbf{a})(\mathbf{x} - \mathbf{x}\_\*) + f\_{\mathbf{xx}}(\mathbf{x}\_\*, \mathbf{a})(\mathbf{x} - \mathbf{x}\_\*)^2 + O(|\mathbf{x} - \mathbf{x}\_\*|^3). \end{array}$$

If the Jacobian does not have modulus one and is small enough, then the contribution by the terms of *O*(|*x* − *x*∗| <sup>2</sup>) will be negligible, in which case the behavior of the system is governed by the the behavior of the linearization of the system *fx*(*x*∗, *α*). We now introduce the idea of a hyperbolic fixed point. Assume that the Jacobian *A* := *fx*(*x*∗, *α*) of the system (A1) at a fixed point *x*<sup>∗</sup> is non-singular. The fixed point *x*<sup>∗</sup> is called hyperbolic if | *fx*(*x*∗, *α*)| -= 1 and non-hyperbolic if | *fx*(*x*∗, *α*)| = 1. The notion of hyperbolic fixed and non-hyperbolic fixed points generalizes to higher dimensions where it involves the eigenvalues of the Jacobian; see [23,25,38] for more details.

Near a hyperbolic fixed point a non-linear dynamical system behaves its first order Taylor approximation (also known as the linearization of the system). To make this argument rigorous we need to discuss what it means for two dynamical systems to be equivalent. Two systems are topologically equivalent if we can map orbits of one system to orbits of another system in a continuous way that preserves the order of time. The dynamical system (A1) is called topologically equivalent to the system

$$\mathbf{y} \mapsto \mathbf{g}(\mathbf{y}, \boldsymbol{\beta}), \quad \mathbf{y} \in \mathbb{R}^n, \quad \boldsymbol{\beta} \in \mathbb{R}^p,\tag{A2}$$

if there exists a homeomorphism of the parameter space *hp* : <sup>R</sup>*<sup>p</sup>* <sup>→</sup> <sup>R</sup>*p*, *<sup>β</sup>* <sup>=</sup> *hp*(*α*) and a parameter dependent state space homeomorphism, continuous in the first argument, *<sup>h</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>×</sup> <sup>R</sup>*<sup>p</sup>* <sup>→</sup> <sup>R</sup>*<sup>n</sup>* such that, **y** = *h*(**x**, *α*), mapping orbits of the system (A1) at parameter value *α* onto orbits of the system (A2) at parameter *β* = *hp*(*α*) preserving the direction of time. If *h* is a diffeomorphism then the systems are called smoothly equivalent.

Let (A1) and (A2) be two topologically equivalent invertible dynamical systems. Consider the orbit of the system under the mapping *f*(**x**, *α*), orb(**x**; *f* , *α*) and the orbit of the system *g*(**y**, *β*), orb(**y**; *g*, *β*). Topological equivalence means that the homeomorphism (*h*(**x**, *α*), *hp*(*α*)) maps orb(**x**; *f* , *α*) to orb(**y**; *g*, *β*) preserving the order of time. This gives us the following commutative diagram

$$\begin{array}{c} \cdots \xrightarrow{f} \xrightarrow{f} f^{-1}(\mathbf{x}, \boldsymbol{\alpha}) \xrightarrow{f} \xrightarrow{f} \mathbf{x} \xrightarrow{f} f(\mathbf{x}, \boldsymbol{\alpha}) \xrightarrow{f} \xrightarrow{f} \cdots \\ \downarrow \xrightarrow{f} \xrightarrow{g} g^{-1}(\mathbf{y}, \boldsymbol{\beta}) \xrightarrow{g} \xrightarrow{g} \mathbf{y} \xrightarrow{f} \xrightarrow{g} g(\mathbf{y}, \boldsymbol{\beta}) \xrightarrow{g} \cdots \\ \end{array}$$

The orbits being topologically equivalent means that orbit **x** under the mapping *h* should produce the same orbit as mapping **x** to **y** = *h*(**x**, *α*) computing the orbit of **y** under *g*(·, *β*) and mapping back to *<sup>f</sup>*(**x**, *<sup>α</sup>*) by *<sup>h</sup>*−1, *<sup>f</sup>*(**x**, *<sup>α</sup>*) = *<sup>h</sup>*−<sup>1</sup> ◦ *<sup>g</sup>* ◦ *<sup>h</sup>*(**x**, *<sup>α</sup>*). We shall primarily be interested in the behavior of the system in a small neighborhood of an equilibrium point. A system (A1) is called locally topologically equivalent near an equilibrium **<sup>x</sup>**<sup>∗</sup> to a system (A2) near an equilibrium **<sup>y</sup>**<sup>∗</sup> if there exists a homeomorphism *<sup>h</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>*<sup>n</sup>* defined in a small neighborhood *<sup>U</sup>* of **<sup>x</sup>**<sup>∗</sup> with **<sup>y</sup>**<sup>∗</sup> <sup>=</sup> *<sup>h</sup>*(**x**∗) that maps orbits of (A1) in *<sup>U</sup>* onto orbits of (A2) in *V* = *h*(*U*), preserving the direction of time.

We now have enough terminology to introduce the following theorem, which shows that the dynamics of a smooth system in the neighborhood of a hyperbolic fixed point are equivalent to the dynamics of the linearization of the system,

**Theorem A1** (Grobman–Hartman)**.** *Consider a smooth map*

$$\mathbf{x} \mapsto A\mathbf{x} + F(\mathbf{x}), \quad \mathbf{x} \in \mathbb{R}^n,\tag{A3}$$

*where A is an n* × *n matrix and F*(*x*) = *O*( *x* <sup>2</sup>)*. If <sup>x</sup>*<sup>∗</sup> <sup>=</sup> <sup>0</sup> *is a hyperbolic fixed point of (A3), then (A3) is topologically equivalent near this point to its linearization*

$$\mathbf{x} \mapsto A\mathbf{x}, \quad \mathbf{x} \in \mathbb{R}^n.$$

Note Theorem A1 is true for a general *n*-dimensional system. Theorem A1 provides sufficient conditions to determine the stability of a hyperbolic fixed point of a general discrete time system,

**Theorem A2.** *Consider a discrete time dynamical systems (A1) where f is a smooth map. Suppose for a fixed point x*<sup>∗</sup> *that the eigenvalues of Jacobian fx*(*x*∗, *α*) *all satisfy* |*λ*| < 1 *then the fixed point is stable. Alternatively, suppose for a fixed point x*<sup>∗</sup> *that the eigenvalues of Jacobian fx*(*x*∗, *α*) *all satisfy* |*λ*| > 1 *then the fixed point is unstable.*

The linearization of the system near a non-hyperbolic fixed point is not sufficient to determine stability of the fixed point and we need to investigate higher order terms. The following theorem provides sufficient condition to check the stability of a smooth one dimensional system at a non-hyperbolic fixed point,

**Theorem A3.** *Let <sup>f</sup>* : <sup>R</sup> <sup>×</sup> <sup>R</sup> <sup>→</sup> <sup>R</sup>*. Suppose that <sup>f</sup>*(·, *<sup>α</sup>*) <sup>∈</sup> *<sup>C</sup>*3(R; <sup>R</sup>) *and <sup>x</sup>*<sup>∗</sup> *is a non-hyperbolic fixed point of f, x*<sup>∗</sup> = *f*(*x*∗, *α*)*. We have the following cases:*

*Case 1: If fx*(*x*∗, *α*) = 1*, then*


*Case 2: If fx*(*x*∗, *α*) = −1*, then*

*1. If* S *f*(*x*∗, *α*) < 0*, then x*<sup>∗</sup> *is asymptotically stable;*

## *2. If* S *f*(*x*∗; , *α*) > 0*, then x*<sup>∗</sup> *is unstable.*

*where* S *f*(*x*) *is the Schwarzian derivative of f*

$$\mathcal{S}f(\mathbf{x},\boldsymbol{\alpha}) = \frac{f\_{\text{xxx}}(\mathbf{x},\boldsymbol{\alpha})}{f\_{\text{x}}(\mathbf{x},\boldsymbol{\alpha})} - \frac{3}{2} \left[ \frac{f\_{\text{xx}}(\mathbf{x},\boldsymbol{\alpha})}{f\_{\text{x}}(\mathbf{x},\boldsymbol{\alpha})} \right]^2 \dots$$

The Schwarzian derivative controls the higher order behavior in oscillatory systems.

#### *Appendix A.3. Codimension 1 Bifurcations*

Until now we have kept the parameter of the system fixed. The study of the change in behavior of a dynamical system as the parameters are varied is called bifurcation theory. A bifurcation occurs when the dynamics of the system at a parameter value *α*<sup>1</sup> differ from the dynamics of the system at a different parameter value *α*2. Changing the parameter in a system may cause a stable fixed point to become unstable, the fixed point may split into multiple fixed points, or a new orbit may form. Each of these is an example of a bifurcation, although these are not the only things that can happen. The point at which a bifurcation occurs is called a bifurcation point. More formally, the parameter *α*<sup>∗</sup> is called a bifurcation point if arbitrarily close to it there is *<sup>α</sup>* such that **<sup>x</sup>** → *<sup>f</sup>*(**x**, *<sup>α</sup>*), **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is not topologically equivalent to **<sup>x</sup>** → *<sup>f</sup>*(**x**, *<sup>α</sup>*∗), **<sup>x</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* in some domain *<sup>U</sup>* <sup>⊂</sup> <sup>R</sup>*n*.

A necessary, but not sufficient condition for bifurcation of a fixed point to occur is for the fixed point to be nonhyperbolic. Theorem A1 together with the implicit function theorem show that in a sufficiently small neighborhood of a hyperbolic fixed point (**x**∗, *α*∗), for each *α* there is another unique fixed point with the same stability properties as (**x**∗, *α*). So hyperbolic fixed points do not undergo local bifurcations. In the context of discrete systems, a local bifurcation can occur only at a fixed point (**x**∗, *α*∗) when the Jacobian of the system at (**x**∗, *α*∗) has an eigenvalue with modulus one.

Perhaps surprisingly, there are only three types of generic bifurcations that can happen in a discrete system with one parameter. They are the limit point (LP), period doubling (PD) and Neimark–Sacker (NS) bifurcations. The reason for this is fairly simple. It turns out that there is a generic system, called the topological normal form, which undergoes this bifurcation at the origin in the (**x**, *α*)-plane. For any other system that undergoes the same bifurcation and satisfies certain non-degeneracy conditions there is a local change of coordinates that transforms the system into the topological normal form.

In general the types of bifurcations that can occur are connected to the number of parameters in the system. The minimal number of parameters that must be changed in order for a particular bifurcation to occur in *f*(**x**, *α*) is called the codimension of the bifurcation. A bifurcation is called local if it can be detected in any small neighborhood of the fixed point, otherwise its called global. Global bifurcations are much harder to analyze and since we do not attempt to investigate them in this paper we will not expand upon them further. More detailed results on bifurcations in codimension 1 and 2 can be found in [23,24].

We will now formally define the sufficient conditions for a system to undergo a period doubling or a pitchfork bifurcation. The period doubling bifurcation occurs when a system with a non-hyperbolic fixed point with multiplier *λ*<sup>1</sup> = −1 satisfies certain non-degeneracy conditions. There are two types of PD bifurcations. In the super-critical case, a stable 2-cycle is generated when a fixed point becomes unstable. In the sub-critical case, a stable fixed point turns unstable when it coalesces with an unstable 2-cycle (This is true for a general *k*-cycle. In the super-critical case, a stable 2*k*-cycle is generated when a *k*-cycle becomes unstable. In the sub-critical case, a stable *k*-cycle turns unstable when it coalesces with an unstable 2*k*-cycle ). The conditions for a PD bifurcation to occur are given as follows

**Theorem A4** (Period Doubling Bifurcation)**.** *Suppose That A One-Dimensional System*

$$\mathfrak{x} \mapsto f(\mathfrak{x}, \mathfrak{a}), \quad \mathfrak{x}, \mathfrak{a} \in \mathbb{R}\_{\mathsf{A}}$$

*with smooth f , has at α* = 0 *the fixed point x*<sup>∗</sup> = 0*, and let λ* = *fx*(0, 0) = −1*. Assume the following non-degeneracy conditions are satisfied*

*1.* 1/2(*fxx*(0, 0))<sup>2</sup> <sup>+</sup> 1/3 *fxxx*(0, 0) -= 0 *2. fxα*(0, 0) -= 0

*Then there are smooth invertible coordinate and parameter changes transforming the system into*

$$
\eta \mapsto -(1+\beta) \pm \eta^3 + O(\eta^4). \tag{A4}
$$

An classical example of a period doubling bifurcation can be seen in the logistic map *f*(*x*, *μ*) = *μx*(1 − *x*), for *x* ∈ [0, 1]. The bifurcation occurs at the point (*x*∗, *μ*∗)=(2/3, 3). The logistic map has two fixed points. One fixed point is at *x* = 0 and the other is at *x* = (*μ* − 1)/*μ*. We will ignore the fixed point at *x* = 0 since it is repelling for *μ* > 1. We look at the behavior of the system in a small neighborhood of *μ*<sup>∗</sup> = 3. For *μ* = 2.9, the fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ* is a hyperbolic attracting fixed point since | *fx*(*x*∗, 2.9)| = |2 − *μ*| < 1. For *μ* = 3 the fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ* is a non-hyperbolic fixed point since *fx*(*x*∗, 2.9) = 2 − *μ* = −1. Checking the Schwarzian derivative shows that the fixed point is asymptotically stable. For *μ* = 3.1, *x*<sup>∗</sup> = (*μ* − 1)/*μ* becomes a repelling fixed point. The points in (0, *x*∗) ∪ (*x*∗, 1) converge to the attracting 2-cycle *C* = {0.558014, 0.7645665}. A super-critical period doubling bifurcation has occurred in the system formed by the logistic map. As the parameter *μ* increases we see a stable fixed point degenerate and a stable 2-cycle is formed.

**Figure A1.** The above plots are cobweb diagrams for the logistic map *f*(*x*, *μ*) = *μx*(1 − *x*), for *x* ∈ [0, 1], with parameters *μ* = 2.9, *μ* = 3 and *μ* = 3.1, respectively. For *μ* = 2.9 the system has one stable fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ*. For *μ* = 3, the system has one non-hyperbolic fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ* which is asymptotically stable attracting; the plot was not iterated long enough to see convergence. For *μ* = 3.1, the system has a hyperbolic repelling fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ* and an asymptotically stable attracting two cycle *C* = {0.558014, 0.7645665}.

The second iterate of a map that undergoes a PD bifurcation undergoes a bifurcation know as the pitchfork bifurcation. A system that undergoes a super-critical pitchfork bifurcation when a stable fixed point becomes unstable and two stable fixed points appear in the system. A system that undergoes a sub-critical pitchfork bifurcation when two stable fixed points coalesce with an unstable fixed point, the unstable fixed point becomes stable as the parameter crosses the bifurcation point. Below we present extra details pertaining to the period doubling bifurcation and its relation to the pitchfork bifurcation.

Consider the one-dimensional system

$$\mathbf{x} \mapsto -(1+\mathfrak{a})\mathfrak{x} + \mathfrak{x}^3 = f(\mathfrak{x}, \mathfrak{a})\dots$$

The map *f*(*x*, *α*) is invertible in a small neighborhood of (0, 0). The system has a fixed point at *x*<sup>∗</sup> = 0 for all *α*, with eigenvalue −(1 + *α*). For small *α* < 0 the fixed point is hyperbolic stable and for *α* > 0 is it hyperbolic unstable. For *α* = 0 the fixed point is non-hyperbolic, but is asymptotically stable. Consider the second iterate of *f*(*x*, *α*)

$$\begin{aligned} f^2(\mathbf{x}, \boldsymbol{\alpha}) &= \begin{aligned} -(1+\boldsymbol{\alpha})f(\mathbf{x}, \boldsymbol{\alpha}) + \left(f(\mathbf{x}, \boldsymbol{\alpha})\right)^3 \\ &= \left(1+\boldsymbol{\alpha}\right)^2 \mathbf{x} - \left[(1+\boldsymbol{\alpha})(2+2\boldsymbol{\alpha}+\boldsymbol{\alpha}^2)\right] \mathbf{x}^3 + O(\mathbf{x}^5). \end{aligned}$$

The second iterate has a trivial fixed point at *x*<sup>∗</sup> = 0 and for *α* > 0 it has two non-trivial stable fixed points *<sup>x</sup>*<sup>1</sup> = (√*<sup>α</sup>* <sup>+</sup> *<sup>O</sup>*(*α*)), *<sup>x</sup>*<sup>1</sup> <sup>=</sup> <sup>−</sup>( <sup>√</sup>*<sup>α</sup>* <sup>+</sup> *<sup>O</sup>*(*α*)) that form a two cycle

$$\mathbf{x}\_2 = f(\mathbf{x}\_1, \mathbf{a}), \quad \mathbf{x}\_1 = f(\mathbf{x}\_2, \mathbf{a}).$$

The conditions for a generic pitchfork bifurcation can be found in [25]

**Theorem A5** (Pitchfork Bifurcation)**.** *For A System*

$$\mathfrak{x} \mapsto f(\mathfrak{x}, \mathfrak{a}), \quad \mathfrak{x}, \mathfrak{a} \in \mathbb{R}$$

*having non-hyperbolic fixed point at x*<sup>∗</sup> = 0*, α*<sup>∗</sup> = 0 *with fx*(0, 0) = 1 *undergoes a pitchfork bifurcation at* (*x*∗, *α*∗)=(0, 0) *if*

$$f\_{\mathbf{a}}(0,0) = 0, \quad f\_{\mathbf{x}\mathbf{x}}(0,0) = 0, \quad f\_{\mathbf{x}\mathbf{x}\mathbf{x}}(0,0) \neq 0, \quad f\_{\mathbf{x}\mathbf{a}}(0,0) \neq 0.$$

*A pitchfork bifurcation is super-critical if* −*fxxx*(*x*∗, *α*∗)/ *fαx*(*x*∗, *α*∗) > 0 *and sub-critical if* −*fxxx*(*x*∗, *α*∗)/ *fαx*(*x*∗, *α*∗) < 0

An example of a pitchfork bifurcation can be seen in the second iteration of the logistic map *<sup>f</sup>* <sup>2</sup>(*x*, *<sup>μ</sup>*) = *<sup>μ</sup>*2*x*(<sup>1</sup> <sup>−</sup> *<sup>x</sup>*)(<sup>1</sup> <sup>−</sup> *<sup>μ</sup>x*(<sup>1</sup> <sup>−</sup> *<sup>x</sup>*)), for *<sup>x</sup>* <sup>∈</sup> [0, 1]. The bifurcation occurs at the point (*x*∗, *μ*∗)=(2/3, 3). For *μ* ≤ 3, the second iteration of the logistic map has the same fixed points as the first iteration. One fixed point is at *x* = 0 and the other is at *x* = (*μ* − 1)/*μ*. We will ignore the fixed point at *x* = 0 since it is repelling for *μ* > 1. We look at the behavior of the system in a small neighborhood of *μ*<sup>∗</sup> = 3. For *μ* = 2.9, the fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ* is a hyperbolic attracting fixed point since <sup>|</sup> *<sup>f</sup>* <sup>2</sup> *<sup>x</sup>* (*x*∗, 2.9)| < 1. For *μ* = 3 the fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ* is non-hyperbolic since *f* <sup>2</sup> *<sup>x</sup>* (*x*∗, 2.9) = 2 − *μ* = 1. Checking the higher order derivative shows that the fixed point is asymptotically stable. For *μ* = 3.1, *x*<sup>∗</sup> = (*μ* − 1)/*μ* becomes a repelling fixed point. Using numerical methods we find two additional fixed points, *x*<sup>1</sup> = 0.558014 and *x*<sup>2</sup> = 0.7645665, both of which are attracting. A super-critical pitchfork bifurcation has occurred in the system formed by the logistic map. As the parameter *μ* increases we see a stable fixed point degenerates to an unstable fixed point and two stable fixed points.

**Figure A2.** The above plots are cobweb diagrams for the second iterate of the logistic map *f*(*x*, *μ*) = *μx*(1 − *x*), for *x* ∈ [0, 1], with parameters *μ* = 2.9 and *μ* = 3.1, respectively. For *μ* = 2.9 the system has one stable fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ*. For *μ* = 3.1, the system has a hyperbolic repelling fixed point *x*<sup>∗</sup> = (*μ* − 1)/*μ* and two asymptotically stable attracting fixed points *x*<sup>1</sup> = 0.0558014 and *x*<sup>2</sup> = 0.7645665.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com

ISBN 978-3-0365-3790-0