GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework

Wang, Fangyikang; Zhu, Huminhao; Zhang, Chao; Zhao, Hanbin; Qian, Hui

doi:10.3390/e26080679

Open AccessArticle

GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework

by

Fangyikang Wang

,

Huminhao Zhu

,

Chao Zhang

^*,

Hanbin Zhao

and

Hui Qian

College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(8), 679; https://doi.org/10.3390/e26080679

Submission received: 7 June 2024 / Revised: 5 August 2024 / Accepted: 8 August 2024 / Published: 11 August 2024

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

Particle-based Variational Inference (ParVI) methods have been widely adopted in deep Bayesian inference tasks such as Bayesian neural networks or Gaussian Processes, owing to their efficiency in generating high-quality samples given the score of the target distribution. Typically, ParVI methods evolve a weighted-particle system by approximating the first-order Wasserstein gradient flow to reduce the dissimilarity between the particle system’s empirical distribution and the target distribution. Recent advancements in ParVI have explored sophisticated gradient flows to obtain refined particle systems with either accelerated position updates or dynamic weight adjustments. In this paper, we introduce the semi-Hamiltonian gradient flow on a novel Information–Fisher–Rao space, known as the SHIFR flow, and propose the first ParVI framework that possesses both accelerated position update and dynamical weight adjustment simultaneously, named the General Accelerated Dynamic-Weight Particle-based Variational Inference (GAD-PVI) framework. GAD-PVI is compatible with different dissimilarities between the empirical distribution and the target distribution, as well as different approximation approaches to gradient flow. Moreover, when the appropriate dissimilarity is selected, GAD-PVI is also suitable for obtaining high-quality samples even when analytical scores cannot be obtained. Experiments conducted under both the score-based tasks and sample-based tasks demonstrate the faster convergence and reduced approximation error of GAD-PVI methods over the state-of-the-art.

Keywords:

variational inference; Bayesian sampling; probability gradient flow

1. Introduction

Bayesian inference is an active area in modern machine learning that provides powerful tools for modeling unknown distributions and reasoning under uncertainty. Its applications range from natural language processing [1,2,3] and image processing [4,5,6,7] to knowledge representation [8,9,10,11]. The core of Bayesian inference is to estimate the target posterior distribution given the data.

Markov Chain Monte Carlo (MCMC) methods have been extensively employed in Bayesian inference, serving as a cornerstone for sampling from complex probability distributions. These methods rely on constructing a Markov chain that has the desired distribution as its equilibrium distribution. Through iterative sampling, MCMC methods facilitate the exploration of the sample space, providing a robust framework for estimating the posterior distributions critical to Bayesian approaches [12,13,14]. In the MCMC literature, acceleration methods such as Hamilton Monte Carlo (HMC) and underdamped Langevin dynamics, which optimize sampling efficiency and convergence speed, have been widely studied [15,16]. Furthermore, Sequential Monte Carlo (SMC) methods use dynamic weight techniques combined with resampling to tackle particle degeneracy and have been integrated with Hamiltonian Monte Carlo (HMC) to boost convergence [17,18,19].

Recently, Particle-based Variational Inference (ParVI) methods have gained significant attention in the Bayesian inference literature owing to their effectiveness in providing approximations of the target posterior distribution

π

[20,21,22,23,24,25]. The essence of ParVI lies in deterministically evolving a system of finite particles iteratively and approximating the target distribution with this set of finite particles. Compared to traditional MCMC methods, PVI introduces repulsive forces among particles. This fundamental addition prevents the particles from collapsing or degenerating, ensuring a more robust distribution coverage. This feature is particularly beneficial in high-dimensional spaces, where MCMC methods may encounter difficulties due to particle degeneration. Typically, the rule of updating the particles is designed by simulating the probability space gradient flow of certain dissimilarity functional

F (μ) : = D (μ | π)

vanishing at

μ = π

. Since the seminal work on Stein Variational Gradient Descent (SVGD) [20] to the subsequent BLOB method [26], the Gradient Flow with Smoothed Density (GFSD) method [27], and the Kernel Setin Discrepancy Descent (KSDD) method [28], various effective ParVI methods have been proposed that adopt different dissimilarities or empirical approaches to simulate the probability gradient flow. (See Table 1).

However, these classical ParVIs only focus on simulating the first-order gradient flow in the Wasserstein space. To improve the efficiency of ParVIs, recent works explore different aspects of the underlying geometry structures in the probability space and design two types of refined particle systems with either accelerated position update or dynamic weight adjustment.

Accelerated position update. By considering the second-order information of the Wasserstein probability space, different accelerated position update strategies have been proposed [27,32]: Liu et al. [27] follows the accelerated gradient descent methods in the Wasserstein probability space [34,35] and derives the Wasserstein Nesterov’s (WNES) and Wasserstein Accelerated Gradient (WAG) methods, which update the particles’ positions with an extra momentum. Inspired by the Accelerated Flow on the $R^{d}$ space [36], the Accelerated Flow (ACCEL) method [32] directly discretizes the Hamiltonian gradient flow in the Wasserstein space and updates the position with the damped velocity field, effectively decreasing the Hamiltonian potential of the particle system. Later, Wang and Li [33] considered the accelerated gradient flow for general information probability spaces [37], and derived novel accelerated position update strategies according to the Kalman–Wasserstein/Stein Hamiltonian flow. Following similar analyses as in [36], they theoretically show that, under mild conditions, the Hamiltonian flow usually converges to the equilibrium faster compared with the original first-order counterpart, aligning with the Nesterov acceleration framework [38]. Numerous experimental studies demonstrate that these accelerated position-update strategies usually drift the particle system to the target distribution more efficiently [27,32,33,39].
Dynamic weight adjustment. Dynamic weight techniques, developed within Markov Chain Monte Carlo (MCMC) frameworks, have shown significant promise in improving sampling efficiency by adapting the weight of samples throughout the computation process. Building on these foundations, ref. [23] introduces the novel application of dynamic weights within the Wasserstein–Fisher–Rao (WFR) space to develop Dynamic-weight Particle-based Variational Inference (DPVI) methods. Specifically, they derive effective dynamical weight adjustment approaches by mimicking the reaction variational step in a JKO splitting scheme of first-order WFR gradient flow [40,41]. The seminal papers [42,43] provide foundational insights into the WFR geometry, particularly discussing an un-normalized dynamic weight variant based on a novel metric interpolating between the quadratic Wasserstein and the Fisher–Rao metrics, which is critical for developing these dynamic weight adjustment schemes. Compared with the commonly used fixed weight strategy, these dynamical weight adjustment schemes usually lead to less approximation error, especially when the number of particles is limited [23].

Given the contributions of both the acceleration technique and the dynamic-weight technique in enhancing the efficiency of ParVI, researchers have increasingly focused on developing a ParVI algorithm that can effectively integrate these features. One natural idea is to consider a ParVI algorithm that incorporates the second-order gradient flow in the WFR space, serving as a direct combination of an accelerated position update method (e.g., ACCEL, WNES, AIG) with a dynamic weight adjustment method (e.g., DPVI). However, we demonstrate that discretizing this flow generally does not yield a practical algorithm. Through our investigation, we discover that the primary obstacle lies in the intractable kinetic energy on the Fisher–Rao structure. For a detailed discussion on the IFR Hamiltonian flow, please refer to Appendix A.2.

1.1. Contribution

In this paper, we propose the first ParVI method, which possesses both accelerated position update and dynamical weight adjustment simultaneously. Specifically, we first construct a novel Information–Fisher–Rao (IFR) probability space, whereby the original information space is augmented with a Fisher–Rao structure. Infinitesimally, this Fisher–Rao structure is orthogonal to the Information structure. The orthogonality between metrics typically means orthogonality between tangent spaces, and details can be found in the work of [40]. Then, we present a novel Semi-Hamiltonian IFR (SHIFR) flow in this space, which simplifies the influence of the kinetic energy on the Fisher–Rao structure in the Hamiltonian IFR flow. By discretizing the SHIFR flow, a practical General Accelerated Dynamic-weight Particle-based Variational Inference (GAD-PVI) framework is proposed. The main contributions of our paper are as follows:

We investigate the convergence property of the SHIFR flow and show that the target distribution $π$ is the stationary distribution of the proposed semi-Hamiltonian flow for proper dissimilarity functional $D (\cdot | π)$ . Moreover, our theoretical result also shows that the augmented Fisher–Rao structure yields an additional decrease in the local functional dissipation, compared to the Hamiltonian flow in the vanilla information space.
We derive an effective finite-particle approximation to the SHIFR flow, which directly evolves the position, weight, and velocity of the particles via a set of ordinary differential equations. The finite particle system is compatible with different dissimilarity and associated smoothing approaches. We prove that the mean-field limit of the proposed particle system converges to the exact SHIFR flow under mild conditions.
By adopting explicit Euler discretization for the finite-particle system, we create the General Accelerated Dynamic-weight Particle-based Variational Inference (GAD-PVI) framework, which updates positions in an acceleration manner and dynamically adjusts weights. We derive various GAD-PVI instances by using three different dissimilarities and associated smoothing approaches (KL-BLOB, KL-GFSD, and KSD-KSDD) on the Wasserstein/Kalman–Wasserstein/Stein IFR space, respectively.
Furthermore, we showcase the versatility of our GAD-PVI by extending its applicability to scenarios where the analytic score is unavailable. We illustrate that the GAD-PVI algorithm can be utilized to develop methods for generating new samples from an unknown target distribution, given only a set of i.i.d. samples. This is achieved by employing suitable dissimilarities and their associated approximation approaches, such as Maximum-Mean-Distance–Maximum-Mean-Distance Flow (MMD-MMDF) and Sinkhorn-Divergence–Sinkhorn-Divergence Flow (SD-SDF), in the GAD-PVI framework.

We evaluate our algorithms under both the variational inference scenario tasks and the i.i.d. sample’s accessible sampling tasks. The empirical results demonstrate the superiority of our GAD-PVI methods.

1.2. Notation

Given a probability measure

μ

on

R^{d}

, we denote

μ \in P_{2} (R^{d})

if its second moment is finite. For a given functional

F (\cdot) : P_{2} (R^{d}) \to R

,

\frac{δ F (\tilde{μ})}{δ μ} (\cdot) : R^{d} \to R

denote its first variation at

μ = \tilde{μ}

. We use

C (R^{n})

to denote the set of continuous functions map from

R^{n}

to

R

. We denote

x^{i} \in R^{d}

as the i-th particle, for

i \in {1 \dots M}

and

K (\cdot, \cdot) : R^{d} \times R^{d} \to R

as a positive semi-definite kernel function. We also denote the Dirac delta distribution with point mass located at

x^{i}

as

δ_{x^{i}}

, and use

f * g : R^{d} \to R

to denote the convolution operation between

f : R^{d} \to R

and

g : R^{d} \to R

. Furthermore, we use ∇ and

\nabla \cdot ()

to denote the gradient and the divergence operator, respectively. We denote a general information probability space as

(P (R^{n}), G (μ))

, where

G (μ) [\cdot]

denotes the one-to-one information metric tensor mapping elements in the tangent space

T_{μ} P (R^{n}) \subset C (R^{n})

to the cotangent space

T_{μ}^{*} P (R^{n}) \subset C (R^{n})

. The inverse map of

G (μ) [\cdot]

is denoted as

G^{- 1} (μ) [\cdot] : T_{μ}^{*} P (R^{n}) \to T_{μ} P (R^{n})

.

2. Related Works

The core of Bayesian inference is to estimate the posterior distribution given the data. By reformulating the inference problem into an optimization problem, variational inference (VI) seeks an approximation within a certain family of distributions that minimizes the Kullback–Leibler (KL) divergence to the posterior. However, the construction of approximating distributions can be restrictive, which may lead to poor approximation [44].

Recently proposed Particle-based Variational Inference methods (ParVIs) use a set of samples, or particles, to represent the approximating distribution and deterministically update particles by minimizing the KL divergence to the target. ParVIs are more non-parametrically flexible than VIs. Stein Variational Gradient Descent (SVGD) [20] is the first and most representative of the ParVI-type algorithms. It updates the set of particles by incorporating a proper vector field that minimizes the kernelized Stein discrepancy with respect to the target. SVGD was later understood as simulating the Wasserstein gradient flow of the KL-divergence on a certain kernel-dependent probability space

P_{H} (R^{n})

[45]. The unique benefits of SVGD make it popular in various applications, including Generative Models [46,47], reinforcement learning [48,49], and recommendation systems [50].

Inspired by this gradient flow understanding of SVGD, more ParVIs have been developed by simulating the gradient flow on the Wasserstein space

P_{2} (R^{n})

. Like the BLOB method [29], the Gradient Flow with Smoothed Density (GFSD) method [27] and the Kernel Stein Discrepancy Descend (KSDD) method [28] utilize different kernel tricksters to approximate the vector field form of the gradient flow and update the finite set of particles.

On accelerating ParVI, initially, Wasserstein Nesterov’s (WNES) method and Wasserstein Accelerated Gradient (WAG) method [27] leveraged the geometry of the underlying space to devise the Riemannian acceleration mechanism of ParVI with auxiliary points. More recently, The Accelerated Flow (ACCEL) [32] method has utilized the Hamiltonian dynamic on the original probability space to develop an accelerated ParVI method by incorporating a set of momentum variables. Further, the Accelerated Information Gradient Flow (AIG) method [33] extends such technique to general metric probability space beyond the Wasserstein metric. We follow the idea of leveraging the Hamiltonian flow on the probability space to develop second-order accelerated ParVIs, as this approach provides theoretical guarantees.

Regarding the dynamic-weight ParVIs, DPVI methods [23] include the first dynamic-weight ParVI method, which maintains a set of weighted particles and hence has a better approximation ability. DPVIs are proposed by leveraging the augmented Wasserstein–Fisher–Rao space rather than the vanilla Wasserstein space. The idea behind DPVI originates from MCMC–Birth–Death (MCMC-BD) [51], an MCMC-type sampling algorithm that is the first algorithm to introduce the Wasserstein–Fisher–Rao flow into the sampling literature. However, it is important to note that while MCMC-BD can transfer particles, it is unable to obtain weighted particles.

Recently, several studies have begun utilizing the mechanism of probability gradient flow to generate new samples from an unknown target distribution when only given a set of i.i.d. samples. One pioneering study introduced the Maximum Mean Discrepancy Flow (MMDF) method [30], which considers a particle flow to minimize the Maximum Mean Discrepancy (MMD) between the model particles and the accessible set of i.i.d. samples. The Sinkhorn Divergence Flow (SDF) method [52] instead considers the Sinkhorn Divergence (SD), which operates by finding the push-forward mapping in a Reproducing Kernel Hilbert Space that allows the fastest descent of SD and consequently solves the SD minimization problem iteratively. Note that this sampling setting is closely related to the emerging field of Generative Models (GMs), which includes well-known models such as Generative Adversarial Networks (GANs) [53], variational autoencoders (VAEs) [54], and Diffusion Models (DMs) [55,56]. However, it is important to clarify that the objective of this paper is not to achieve state-of-the-art performance in Generative Models but rather to demonstrate the extensibility of our technique within this particular setting. Indeed, there have been studies attempting to obtain state-of-the-art Generative Models based on Wasserstein Gradient Flow, such as the Neural Sinkhorn Gradient Flow (NSGF) method [31] and the S-JKO (JSD) method [57]. However, these methods typically require complex network design and additional components. Addressing this challenge remains an open question and is not the primary focus of this paper.

3. Preliminaries

When dealing with Bayesian inference tasks, variational inference methods approximate the target posterior

π

with an easy-to-sample distribution

μ

and recast the inference task as an optimization problem over

P_{2} (R^{d})

[58]:

\begin{matrix} \min_{μ \in P_{2} (R^{n})} F (μ) : = D (μ | π) . \end{matrix}

(1)

To solve this optimization problem, Particle-based Variational Inference (ParVI) methods generally simulate the gradient flow of

F (μ)

in a certain probability space with a finite particle system, which transports the initial empirical distribution toward the target distribution

π

iteratively. Given an information metric tensor

G (μ) [\cdot]

, the gradient flow in the information probability space

(P (R^{n}), G (μ))

takes the following form [59]:

\begin{matrix} \partial_{t} μ_{t} = - G {(μ_{t})}^{- 1} [\frac{δ F (μ_{t})}{δ μ}] . \end{matrix}

(2)

3.1. Wasserstein Gradient Flow and Classical ParVIs

Since the seminal work on Stein Variational Gradient Descent (SVGD) [20], many ParVI methods have focused on flows in the Wasserstein space, where the inverse of the Wasserstein metric tensor is defined as

\begin{matrix} G^{W} {(μ)}^{- 1} [φ] = - \nabla \cdot (μ \nabla φ), \forall μ \in P (R^{n}), \forall φ \in T_{μ} P (R^{n}), \end{matrix}

(3)

and the Wasserstein gradient flow is defined as

\begin{matrix} \partial_{t} μ_{t} = \nabla \cdot (μ_{t} \nabla \frac{δ F (μ_{t})}{δ μ}) . \end{matrix}

(4)

Based on the probability flow (4) on the density, existing ParVIs maintain a set of particles

x_{t}^{i}

and directly modify the particle position according to the following ordinary differential equation:

\begin{matrix} d x_{t}^{i} = \nabla \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i}) d t, \end{matrix}

(5)

where

{\tilde{μ}}_{t} = \sum_{i = 1}^{M} w_{t}^{i} δ_{x_{t}^{i}}

denotes the empirical distribution. Since the first total variation

\frac{δ F ({\tilde{μ}}_{t})}{δ μ}

of

F

might not be well defined for the discrete empirical distribution, various ParVI methods have been proposed by choosing different dissimilarities

F

and associated particle approaches, depending on the accessible information of the target distribution

π

. When the score of the target distribution

\nabla log π (\cdot)

is accessible, some have ParVIs adopted the KL divergence or Kernel Stein Discrepancy (KSD) as the

F

, e.g., KL-BLOB [26], KL-GFSD [27], and KSD-KSDD [28]. When only samples of

π

are provided, integral-based dissimilarities like Maximum Mean Divergence (MMD) and Sinkhorn Divergence, which are naturally compatible with sample approximation, are adopted to develop ParVIs, like the MMDF method [30] and SD method [52].

3.2. Hamiltonian Gradient Flows and Accelerated ParVIs

The following Hamiltonian gradient flow in the general information probability space has recently been utilized to derive more efficient ParVI methods

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} = \frac{δ}{δ Φ} H (μ_{t}, Φ_{t}), \\ \partial_{t} Φ_{t} = - γ_{t} Φ_{t} - \frac{δ}{δ μ} H (μ_{t}, Φ_{t}), \end{matrix} \end{matrix}

(6)

where

Φ_{t} : R^{n} \to R

represents the Hamiltonian momentum variable, while

H (μ_{t}, Φ_{t}) = \frac{1}{2} \int Φ_{t} G {(μ_{t})}^{- 1} [Φ_{t}] d x + F (μ_{t})

signifies the Hamiltonian potential. It is pertinent to note that the Hamiltonian momentum variable

Φ_{t}

can be interpreted as the momentum of

μ_{t}

. Note that the Hamiltonian flow (6) can be regarded as the second-order accelerated version of the information gradient flow (2) and usually converges faster to the equilibrium of the target distribution under mild conditions [32,33,39]. By reformulating the partial differential equations in terms of

(μ_{t}, Φ_{t})

into Lagrangian formulations with respect to samples

X_{t} \sim μ_{t}

and

V_{t} \sim \nabla Φ_{t}

and further finite approximation, we obtain a simple augmented particle system

(x_{t}^{i}, v_{t}^{i})

, which evolves the position

x_{t}^{i}

and velocity

v_{t}^{i}

of particles simultaneously. As the position update rule of

x_{t}^{i}

also uses the extra velocity information, the induced system is said to have an accelerated position update. By discretizing the continuous particle system, several accelerated ParVI methods have been proposed, which converge faster to the target distribution in numerous real-world Bayesian inference tasks [32,33].

3.3. Wasserstein–Fisher–Rao Flow and Dynamic-Weight ParVIs

Recently, the Wasserstein–Fisher–Rao (WFR) Flow has been used to derive effective dynamic weight adjustment approaches to mitigate the fixed-weight restriction of ParVIs [23]. The inverse of the WFR metric tensor is

\begin{matrix} G^{W F R} {(μ)}^{- 1} [Φ] = - \nabla \cdot (μ \nabla Φ) + (Φ - \int Φ) μ, \end{matrix}

(7)

where

Φ \in T_{μ}^{*} P (R^{n})

, and the WFR gradient flow are written as

\begin{matrix} \partial_{t} μ_{t} = \underset{Wasserstein transport}{\underset{︸}{\nabla \cdot (μ_{t} \nabla \frac{δ F (μ_{t})}{δ μ})}} - \underset{Fisher - Rao variational distortion}{\underset{︸}{(\frac{δ F (μ_{t})}{δ μ} - \int \frac{δ F (μ_{t})}{δ μ} d μ_{t}) μ_{t}}} . \end{matrix}

(8)

Since the WFR space can be regarded as the orthogonal sum of the Wasserstein space and the Fisher–Rao space, ref. [23] mimics a JKO splitting scheme for the WFR flow, which deals with the position and the weight with the Wasserstein transport and the Fisher–Rao variational distortion, respectively. Given a set of particles with position,

x_{t}^{i}

and weight

w_{t}^{i}

, the Fisher–Rao distortion can be approximated by the following code

\begin{matrix} \frac{d}{d t} w_{t}^{i} & = - (\frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i}) - \sum_{i = 1}^{M} w_{t}^{i} \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i})) w_{t}^{i} . \end{matrix}

(9)

According to (9), ref. [23] derive two dynamical weight-adjustment schemes and propose the Dynamic-Weight Particle-based Variational Inference (DPVI) framework, which is compatible with several dissimilarity functionals and associated particle approaches. The dynamic weight technique employed here differs from the one utilized in the dynamic-weight Sequential Monte Carlo (SMC) literature [19]. SMC-type algorithms achieve dynamic weight sampling through importance sampling, which is predicated on maintaining an estimated posterior. The weights of each incoming particle are individually calculated upon their arrival. The dynamic weight technique combined with sequential importance resampling is employed to circumvent degeneracy. In contrast, the Fisher–Rao dynamic weight technique iteratively adjusts the weights of the existing particles in an interactive way to augment the final approximation accuracy.

3.4. Dissimilarity Functionals

ParVIs typically select the probability gradient flow functional to be a divergence with respect to the target distribution, denoted as

F (μ) : = D (μ | π)

. In this context, we will now introduce four frequently employed dissimilarities, along with their corresponding first variation forms. The first two dissimilarities are typically employed in score-based tasks due to their logarithmic form. On the other hand, the last two dissimilarities are commonly utilized in sample-based tasks, where only a set of i.i.d. samples is available, as they have an integral form that can be estimated using the Monte Carlo method. In this subsection, K denotes a positive semi-definite kernel function.

3.4.1. Kullback–Leibler Divergence

The Kullback–Leibler (KL) divergence is the most often used divergence in the ParVI field [20,27,33], when selecting KL divergence, the functional is written as

\begin{matrix} F^{K L} (μ) : = K L (μ | π) = \int log μ (x) \frac{μ (x)}{π (x)} d x \end{matrix}

(10)

The first variation of this function has the form

\begin{matrix} \frac{δ F^{K L} (μ)}{δ μ} (\cdot) = \frac{δ K L (μ | π)}{δ μ} (\cdot) = log μ (\cdot) - log π (\cdot) + C \end{matrix}

(11)

3.4.2. Kernel Stein Discrepancy

The Kernel Stein Discrepancy (KSD) has recently been adopted as the dissimilarity functional in the ParVI method KSDD [28], which is written as follows:

\begin{matrix} K S D (μ | π) : = \sqrt{\int \int k_{π} (x, y) d μ (x) d π (y)} \end{matrix}

(12)

where

k_{π}

is the Stein Kernel, defined through

k_{π} (x, y) = \nabla log p {(x)}^{T} \nabla log p (y) K (x, y) + \nabla log p {(x)}^{T} \nabla_{y} K (x, y) + \nabla_{x} K {(x, y)}^{T} \nabla log p (y) + \nabla_{x} \cdot \nabla_{y} K (x, y)

. We follow the KSDD method to consider

F^{K S D} (μ) : = \frac{1}{2} K S D^{2} (μ | π)

; then, the first variation of this functional is written as

\begin{matrix} \frac{δ F^{K S D} (μ)}{δ μ} (\cdot) = \int \nabla_{x} k_{π} (x, \cdot) d μ (x) \end{matrix}

(13)

3.4.3. Maximum Mean Discrepancy

The Maximum Mean Discrepancy is a widely used integral probability metric, which has the following form:

\begin{matrix} {MMD}^{2} (μ | π) : = \int \int K (x, x^{'}) d μ (x) d μ (x^{'}) \\ + \int \int K (y, y^{'}) d π (y) d π (y^{'}) - 2 \int \int K (x, y) d μ (x) d π (y) . \end{matrix}

(14)

Consider

F^{MMD} (μ) : = \frac{1}{2} {MMD}^{2} (μ | π)

; then, the first variation of this functional is written as

\begin{matrix} \frac{δ F^{MMD} (μ)}{δ μ} (\cdot) = \int K (x, \cdot) d μ (x) - \int K (y, \cdot) d π (y) \end{matrix}

(15)

3.4.4. Sinkhorn Divergence

Sinkhorn divergence (SD) is derived as a computationally efficient counterpart to the famous Wasserstein distance by utilizing the regularization technique. The entropy-regularized Wasserstein distance is defined as

\begin{matrix} W_{p, ε} (μ, π) = \\ inf_{γ \in Γ (μ, π)} [{(\int_{R^{n} \times R^{n}} {∥x - y∥}^{2} d γ (x, y))}^{\frac{1}{p}} + ε K L (γ | μ \otimes π)], \end{matrix}

(16)

where

ε > 0

is a regularization coefficient,

μ \otimes π

denotes the product measure, i.e.,

μ \otimes π (x, y) = μ (x) π (y)

and we fix

p = 2

and abbreviate

W_{2, ε} : = W_{ε}

. According to the Fenchel–Rockafellar theorem, the entropy-regularized Wasserstein problem

W_{ε}

(16) has an equivalent dual formulation, which is given as follows [60]:

\begin{matrix} W_{ε} (μ, π) & = max_{f, g \in C (R^{n})} 〈 μ, f 〉 + 〈 π, g 〉 \\ - ε 〈μ \otimes π, exp (\frac{1}{ε} (f \oplus g - C)) - 1〉, \end{matrix}

(17)

where C is the cost function in (16) and

f \oplus g

is the tensor sum:

(x, y) \mapsto f (x) + g (y)

. The maximizers

f_{μ, π}

and

g_{μ, π}

of (17) are called the

W_{ε}

-potentials of

W_{ε} (μ, π)

. Note that, although computationally more efficient than the

W_{p}

distance, the

W_{ε}

distance is not a true metric, as there exists

μ \in P_{2} (R^{n})

such that

W_{ε} (μ, μ) \neq 0

when

ε \neq 0

, which restricts the applicability of

W_{ε}

. As a result, the following Sinkhorn divergence

S_{ε} (μ, π) : P_{2} (R^{n}) \times P_{2} (R^{n}) \to R

is proposed [60]:

\begin{matrix} S_{ε} (μ, π) = W_{ε} (μ, π) - \frac{1}{2} (W_{ε} (μ, μ) + W_{ε} (π, π)) . \end{matrix}

(18)

Consider

F^{S D} (\cdot) = S_{ε} (\cdot, π)

. Let

(f_{μ, π}, g_{μ, π})

be the

W_{ε}

-potentials of

W_{ε} (μ, π)

and let

(f_{μ, μ}, g_{μ, μ})

be the

W_{ε}

-potentials of

W_{ε} (μ, μ)

. The first variation of the Sinkhorn functional

F^{S D}

is

\frac{δ F^{S D}}{δ μ} = f_{μ, π} - f_{μ, μ} .

(19)

4. Methodology

In this section, we present our General Accelerated Dynamic-weight Particle-based Variational Inference (GAD-PVI) framework, detailed in Algorithm 1. We first introduce a novel augmented Information–Fisher–Rao space, and the Semi-Hamiltonian-Information–Fisher–Rao (SHIFR) flow in the space. The theoretical analysis of SHIFR shows that it usually possesses an additional decrease in the local functional dissipation compared to the Hamiltonian flow in the original information space. Then, effective finite-particle systems, which directly evolve the position, weight, and velocity of the particles via a set of ordinary differential equations, are constructed based on SHIFR flows in several IFR spaces with different underlying information metric tensors. We demonstrate that the mean-field limit of the constructed particle system exactly converges to the SHIFR flow in the corresponding probability space. Next, we develop the GAD-PVI framework by discretizing these continuous-time finite-particle formulations, which enables simultaneous accelerated updates of particles’ positions and dynamic adjustment of particles’ weights. We present nine effective GAD-PVI algorithms that use different underlying information metric tensors, dissimilarity functionals, and the associated finite-particle empirical approximation.

Algorithm 1 General Accelerated Dynamic-weight Particle-based Variational Inference (GAD-PVI) framework

Input: Initial distribution

{\tilde{μ}}_{0} = \sum_{i = 1}^{M} w_{0}^{i} δ_{x_{0}^{i}}

, position adjusting step-size

η_{p o s}

, weight adjusting step-size

η_{w e i}

, velocity field adjusting step-size

η_{v e l}

, velocity damping parameter

γ

.

1:: Choose a suitable functional $F$ and its empirical approximation $U_{\tilde{μ}}$ according to the sampling setting.
2:: for $k = 0, 1, \dots, T - 1$ do
3:: for $i = 1, 2, \dots, M$ do
4:: Update positions $x_{k + 1}^{i}$ ’s according to (26).
5:: end for
6:: for $i = 1, 2, \dots, M$ do
7:: Adjust velocity field $v_{k + 1}^{i}$ ’s according to (27).
8:: end for
9:: if Adopt CA strategy then
10:: for $i = 1, 2, \dots, M$ do
11:: Adjust weights $w_{k + 1}^{i}$ ’s according to (28).
12:: end for
13:: else if Adopt DK strategy then
14:: for $i = 1, 2, \dots, M$ do
15:: Calculate the duplicate/kill rate: $R_{k + 1}^{i} = - λ η (U_{{\tilde{μ}}_{k}} (x_{k + 1}^{i}) - \frac{1}{M} \sum_{i = 1}^{M} U_{{\tilde{μ}}_{k}} (x_{k + 1}^{i}))$
16:: end for
17:: for $i = 1, 2, \dots, M$ do
18:: if $R_{k + 1}^{i} > 0$ then
19:: Duplicate the particle $x_{k + 1}^{i}$ with probability $1 - exp (- R_{k + 1}^{i})$ and kill one which is uniformly chosen from the rest.
20:: else
21:: Kill the particle $x_{k + 1}^{i}$ with probability $1 - exp (R_{k + 1}^{i})$ and duplicate one which is uniformly chosen from the rest.
22:: end if
23:: end for
24:: end if
25:: end for
26:: Output: ${\tilde{μ}}_{T} = \sum_{i = 1}^{M} w_{T}^{i} δ_{x_{T}^{i}}$ .

4.1. Information–Fisher–Rao Space and Semi-Hamiltonian-Information–Fisher–Rao Flow

To define the augmented Information–Fisher–Rao probability space, we introduce the Information–Fisher–Rao metric tensor

G^{I F R} (μ)

, whose inverse is defined as follows.

\begin{matrix} G^{I F R} {(μ)}^{- 1} [Φ] = G^{I} {(μ)}^{- 1} [Φ] + (Φ - \int Φ d μ) μ, \end{matrix}

(20)

where

Φ \in T_{μ}^{*} P (R^{n})

and

G^{I} (μ)

denotes certain underlying information metric tensor. Note that

G^{I F R} (μ)

is formed by the inf-convolution of

G^{I} (μ)

and Fisher–Rao metric tensor.

Based on

G^{I F R} (μ)

, we introduce the following novel semi-Hamiltonian flow of

F

on the Information–Fisher–Rao space

(P (R^{n}), G^{I F R} (μ))

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} = \frac{δ}{δ Φ} H^{I F R} (μ_{t}, Φ_{t}), \\ \partial_{t} Φ_{t} = - γ_{t} Φ_{t} - \frac{1}{2} \frac{δ}{δ μ} (\int Φ_{t} G^{I} {(μ_{t})}^{- 1} [Φ_{t}] d x) - \frac{δ F (μ_{t})}{δ μ} . \end{matrix} \end{matrix}

(21)

where

Φ_{t}

denotes the Hamiltonian velocity and

\begin{matrix} H^{I F R} (μ_{t}, Φ_{t}) & = \underset{Information kinetic energy}{\underset{︸}{\frac{1}{2} \int Φ_{t} G^{I} {(μ_{t})}^{- 1} [Φ_{t}] d x}} \\ + \underset{Fisher - Rao kinetic energy}{\underset{︸}{\frac{1}{2} \int Φ_{t} (Φ_{t} - \int Φ d μ_{t}) d μ_{t}}} + \underset{potential energy}{\underset{︸}{\frac{δ F (μ_{t})}{δ μ}}}, \end{matrix}

(22)

denotes the Hamiltonian potential in the IFR space. Compared to the full Hamiltonian flow of

F

in the IFR space, the SHIFR flow (21) ignores the influence of the Fisher–Rao kinetic energy on the Hamiltonian field

Φ_{t}

. Intuitively, at the gradient flow level, the Fisher–Rao metric modifies the mass in the vertical dimension, while the Wasserstein metric redistributes mass in the horizontal dimension. In the finite particle approximation system, we manipulate the weights and positions of particles to emulate the underlying infinite-dimensional mass of the distribution

μ

. The Fisher–Rao metric exclusively adjusts the weights of particles, serving as an analogy for altering the mass in the vertical dimension in the infinite case. Conversely, the Wasserstein metric changes the position of particles, acting as an analogy for adjusting the mass in the horizontal dimension. These two techniques interact with each other, collectively facilitating faster convergence.

Later, we will show that SHIFR can be directly transformed into a particle system consisting of odes on the positions, velocities, and weights of particles for proper underlying information metric tensor, while it is generally infeasible to obtain such a direct particle system according to the corresponding full Hamiltonian flow because it is difficult to handle the Fisher–Rao kinetic energy. Given that the Fisher–Rao kinetic energy term diminishes when approaching the flow’s equilibrium, it is acceptable for the SHIFR flow to disregard this complex term while still maintaining the target distribution

π

as its stationary distribution. The following proposition establishes that the stationary property of the SHIFR Flow (21) is still the target distribution.

Proposition 1.

The target distribution and zero-velocity

(μ_{\infty} = π, Φ_{\infty} = 0)

(

0

means that a function defined on

R^{n}

that always maps to zero) is the stationary distribution of the SHIFR flow (21) with dissimilarity functional

D (\cdot | π)

, which satisfies

D (π | π) = 0

with any information metric tensor

G^{I} (μ) [\cdot]

.

Moreover, this semi-Hamiltonian flow would converge faster than the Hamiltonian flow in the original information space on account of the extra functional dissipation property. Here, we establish the extra decrease property in terms of functional dissipation of the SHIFR gradient flow (21) in the following proposition.

Proposition 2.

For arbitrary

\bar{μ} \in P (R^{n})

and

\bar{Φ} \in C (R^{n})

, the local dissipation of functional

\frac{d F (μ_{t})}{d t}

following the SHIFR gradient flow (21) starting from

(\bar{μ}, \bar{Φ})

has an additional functional dissipation term compared to the ones following the Hamiltonian flow in non-augmented space (6). Take the Wasserstein case as an example. Denote the probability path starting from

(\bar{μ}, \bar{Φ})

following the W-SHIFR flow as

(μ_{t}^{S H I F R}, Φ_{t}^{S H I F R})

, and following the Hamiltonian flow in vanilla space as

(μ_{t}^{H}, Φ_{t}^{H})

. We have

\begin{matrix} {\frac{d F (μ_{t}^{S H I F R})}{d t}|}_{t = 0} {\leq \frac{d F (μ_{t}^{H})}{d t}|}_{t = 0} \end{matrix}

(23)

We acknowledge that these theoretical analyses are currently limited to the variational inference case, where the functional is set to a dissimilarity with respect to a target distribution

π

, denoted as

F (μ) : = D (μ | π)

. Future work will aim to explore the semi-Hamiltonian system in a broader statistical physics context.

With different underlying information metric tensor

G^{I} (μ)

in

H^{I F R} (μ_{t}, Φ_{t})

, we can obtain different SHIFR flows. Suitable

G^{I} (μ)

includes the Wasserstein metric tensor, the Kalman–Wasserstein metric tensor (KW-metric) and the Stein metric tensor (S-metric). For instance, the SHIFR flow with Wasserstein metric (Wasserstein–SHIFR flow) is written as

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} = - \nabla \cdot (μ_{t} \nabla Φ_{t}) - (\frac{δ F (μ_{t})}{δ μ} - \int \frac{δ F (μ_{t})}{δ μ} d μ_{t}) μ_{t}, \\ \partial_{t} Φ_{t} = - γ_{t} Φ_{t} - {∥\nabla Φ_{t}∥}^{2} - \frac{δ F (μ_{t})}{δ μ} . \end{matrix} \end{matrix}

(24)

Note that in the subsequent section, we focus on the Wasserstein–SHIFR flow and defer the detailed formulations with respect to KW-SHIFR and S-SHIFR to Appendix B.1 and Appendix B.2 due to limited space.

4.2. Finite-Particles Formulations to SHIFR Flows

Now, we derive the finite-particle approximation to the SHIFR flow, which directly evolves the position

x_{t}^{i}

, weight

w_{t}^{i}

, and velocity

v_{t}^{i}

of the particles. Specifically, we construct the following ordinary differential equation system to simulate the Wasserstein–SHIFR flow (24):

\begin{matrix} \{\begin{matrix} d x_{t}^{i} & = v_{t}^{i} d t, \\ d v_{t}^{i} & = (- γ v_{t}^{i} - \nabla \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i})) d t, \\ d w_{t}^{i} & = - (\frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i}) - \sum_{i = 1}^{M} w_{t}^{i} \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i})) w_{t}^{i} d t, \\ {\tilde{μ}}_{t} & = \sum_{i = 1}^{M} w_{t}^{i} δ_{x_{t}^{i}} . \end{matrix} \end{matrix}

(25)

While the dynamic weight adjustment component of the proposed method (25) is quite similar to the ones in [23], as both are derived based on the Fisher–Rao structure of the underlying gradient flow, the proposed method can further achieve accelerated position updates. The following proposition demonstrates that the mean-field limit of the particle system (25) corresponds precisely to the Wasserstein–SHIFR flow in (24).

Proposition 3.

Suppose the empirical distribution

{\tilde{μ}}_{0}^{M}

of M weighted particles weakly converges to a distribution

μ_{0}

when

M \to \infty

. Then, the path of (25) starting from

{\tilde{μ}}_{0}^{M}

and

Φ_{0}

with initial velocity

0

weakly converges to a solution of the Wasserstein–SHIFR gradient flow (24) starting from

μ_{t} |_{t = 0} = μ_{0} =

and

Φ_{t} |_{t = 0} = 0

as

M \to \infty

:

Here, Proposition 3 serves as a bridge, substantiating that our proposed particle methods (25) possess appealing theoretical properties, as they are rooted in a mean-field limit of the Wasserstein–SHIFR flow (24) with superior theoretical attributes.

4.3. GAD-PVI Framework

Generally, it is impossible to obtain an analytic solution of the continuous finite-particle formulations (25); thus, a numerical integration method is required to derive an approximate solution. Note that any numerical solver, such as the implicit Euler method [61] and the higher-order Runge–Kutta method [62], can be used. Here, we follow the tradition of ParVIs to adopt the first-order explicit Euler discretization [63], since it is efficient and easy to implement [23], and propose our GAD-PVI framework, as listed in Algorithm 1. Like other ParVI methods, our GAD-PVI algorithm also sustains a set of particles and alters their attributes. However, the distinctive feature of GAD-PVI methods is their unique capacity to concurrently modify three different attributes, namely position, weight, and velocity. This pioneering approach sets GAD-PVI methods apart from other ParVI variants.

4.3.1. Updating Rules

Suppose the functional

F

and its empirical approximation of the first variation

U_{\tilde{μ}} \approx \frac{δ F (\tilde{μ})}{δ μ}

is decided. We adopt a Jacobi-type strategy to update the position

x_{k}^{i}

, velocity field

v_{k}^{i}

, and weight

w_{k}^{i}

; i.e., the calculations in the

k + 1

-th iteration are totally based on the variables obtained in the k-th iteration. Therefore, starting from M weighted particles located at

{x_{0}^{i}}_{i = 1}^{M}

with weights

{w_{0}^{i}}_{i = 0}^{M}

and

{v_{0}^{i} = 0}_{i = 0}^{M}

, GAD-PVI with respect to the Wasserstein–SHIFR flow first updates the positions of particles according to the following rule:

\begin{matrix} x_{k + 1}^{i} = x_{k}^{i} + η_{p o s} v_{k}^{i} . \end{matrix}

(26)

Then, it adjusts the velocity field as

\begin{matrix} v_{k + 1}^{i} = (1 - γ η_{v e l}) v_{k}^{i} - η_{v e l} \nabla U_{{\tilde{μ}}_{k}} (x_{k}^{i}), \end{matrix}

(27)

and particles’ weights as follows:

\begin{matrix} w_{k + 1}^{i} = w_{k}^{i} - η_{w e i} (U_{{\tilde{μ}}_{k}} (x_{k}^{i}) - \sum_{j = 1}^{M} w_{k}^{j} U_{{\tilde{μ}}_{k}} (x_{k}^{j})) . \end{matrix}

(28)

Here

{\tilde{μ}}_{k} = \sum_{i = 1}^{M} w_{k}^{i} δ_{x_{k}^{i}}

denotes the empirical distribution, and

η_{p o s} / η_{v e l} / η_{w e i}

are the discretization step sizes. It can be verified that the total mass of

{\tilde{μ}}_{k}

is conserved and

{\tilde{μ}}_{k}

remains a valid probability distribution during the whole procedure of GAD-PVI, i.e.,

\sum_{i} w_{k}^{i} = 1

for all k. The detailed updating rules of GAD-PVI with respect to the KW-SHIFR and S-SHIFR can be found in Appendix B.3.

Notice that, in comparison to traditional ParVIs, the incorporation of a position acceleration scheme and dynamic-weight scheme results in minimal additional computational costs. This is due to the fact that the number of operations that contribute to time complexity bottlenecks, specifically the calculation of

U_{\tilde{μ}}

and

\nabla U_{\tilde{μ}}

, remains unchanged.

4.3.2. Dissimilarities and Approximation Approaches

Our GAD-PVI framework is compatible with different dissimilarities (

F

) and their associated approximation approaches of the first variation. By selecting appropriate dissimilarities and approximation approaches, our GAD-PVI framework can be employed effectively for both score-based tasks and sample-based tasks.

Score-Based Scenario

When the score function value of the target distribution is accessible, the commonly used underlying dissimilarities in ParVIs are KL-divergence [20,27,33] and KSD [64].

For KL divergence, the total variation (11) includes the term

log μ (x)

, which is ill-defined for the discrete empirical distribution

{\tilde{μ}}_{k}

. Consequently, the use of approximation approaches is necessary to resolve this issue. Commonly employed approximation approaches include BLOB [29] and GFSD [27].

The BLOB approximation approach reformulates the intractable term

log μ

as

\frac{δ}{δ μ} E_{μ} [log μ]

and smooth the density with a kernel function K, resulting in the approximation

\begin{matrix} log \tilde{μ} & \approx \frac{δ}{δ \tilde{μ}} E_{\tilde{μ}} [log (\tilde{μ} * K)] \\ : = log \sum_{i = 1}^{M} w^{i} K (\cdot, x^{i}) + \sum_{i = 1}^{M} \frac{w^{i} K (\cdot, x^{i})}{\sum_{j = 1}^{M} w^{j} K (x^{i}, x^{j})} . \end{matrix}

(29)

for a discrete density

\tilde{μ} = \sum_{i = 1}^{M} w^{i} x^{i}

. This leads to the following approximation results:

\begin{matrix} \begin{matrix} U_{{\tilde{μ}}_{k}} (x) = & - log π (x) + log \sum_{i = 1}^{M} w_{k}^{i} K (x, x_{k}^{i}) \\ + \sum_{i = 1}^{M} \frac{w_{k}^{i} K (x, x_{k}^{i})}{\sum_{j = 1}^{M} w_{k}^{j} K (x_{k}^{i}, x_{k}^{j})} . \end{matrix} \end{matrix}

(30)

The GFSD approximation approach directly approximates

μ

by smoothing the empirical distribution

\tilde{μ}

with a kernel function K:

\hat{μ} = \tilde{μ} * K = \sum_{i = 1}^{M} w^{i} K (\cdot, x^{i})

, which leads to the following approximations:

\begin{matrix} U_{{\tilde{μ}}_{k}} (x) = & - log π (x) + log \sum_{i = 1}^{M} w_{k}^{i} K (x, x_{k}^{i}), \end{matrix}

(31)

\begin{matrix} \nabla U_{{\tilde{μ}}_{k}} (x) = & - \nabla log π (x) + \frac{\sum_{i = 1}^{M} w_{k}^{i} \nabla_{x} K (x, x_{k}^{i})}{\sum_{i = 1}^{M} w_{k}^{i} K (x, x_{k}^{i})} . \end{matrix}

(32)

The KSDD directly approximates the first variation and its gradient of KSD (12) by employing Monte Carlo sampling with the empirical distribution

\tilde{μ}

. KSDD constructs the following finite-particle approximations:

\begin{matrix} U_{{\tilde{μ}}_{k}} (x) = & \sum_{i = 1}^{M} w_{k}^{i} k_{π} (x_{k}^{i}, x), \end{matrix}

(33)

\begin{matrix} \nabla U_{{\tilde{μ}}_{k}} (x) = & \sum_{i = 1}^{M} w_{k}^{i} \nabla_{x} k_{π} (x_{k}^{i}, x) . \end{matrix}

(34)

Sample-Based Scenario

When only the samples of the target distribution

{y_{j}}_{j = 1}^{N} \overset{i . i . d .}{\sim} π

is accessible, we then take

\frac{1}{N} \sum_{j = 1}^{N} δ_{y_{j}}

as a surrogate of

π

; the commonly used underlying dissimilarities in this case are MMD [30] and SD [52].

MMDF directly approximates the first variation and its gradient of MMD (14) by employing Monte Carlo sampling with both the empirical distribution

\tilde{μ}

and samples from the target distribution. Let

{\tilde{μ}}_{k} = \sum_{i = 1}^{M} w_{k}^{i} δ_{x_{k}^{i}}

and

π = \sum_{j = 1}^{N} a^{j} δ_{y^{j}}

; MMDF constructs the following finite-particle approximations:

\begin{matrix} U_{{\tilde{μ}}_{k}} (x) = & \sum_{i = 1}^{M} w_{k}^{i} K (x_{k}^{i}, x) - \sum_{j = 1}^{N} a^{j} K (y^{j}, x), \end{matrix}

(35)

\begin{matrix} \nabla U_{{\tilde{μ}}_{k}} (x) = & \sum_{i = 1}^{M} w_{k}^{i} \nabla K (x_{k}^{i}, x) - \sum_{j = 1}^{N} a^{j} \nabla K (y^{j}, x) . \end{matrix}

(36)

SDF directly leverages samples from

μ

and

π

to obtain the approximated Sinkhorn potentials

f_{{\tilde{μ}}_{k}, π}

and

f_{{\tilde{μ}}_{k}, {\tilde{μ}}_{k}}

. Consequently, it is able to approximate the first variation of SD as follows.

\begin{matrix} U_{{\tilde{μ}}_{k}} (x) = & f_{{\tilde{μ}}_{k}, π} - f_{{\tilde{μ}}_{k}, {\tilde{μ}}_{k}}, \end{matrix}

(37)

\begin{matrix} \nabla U_{{\tilde{μ}}_{k}} (x) = & \nabla f_{{\tilde{μ}}_{k}, π} - \nabla f_{{\tilde{μ}}_{k}, {\tilde{μ}}_{k}} . \end{matrix}

(38)

The details of utilizing samples to obtain the approximated Sinkhorn potential can be found in [31,52].

4.3.3. An Alternative Weight Adjusting Approach

Except for the Continuous Adjusting (CA) strategy, the Duplicate/Kill (DK) strategy, which is a probabilistic discretization strategy to the Fisher–Rao part of (24), can also be adopted in GAD-PVI. This strategy duplicates/kills particle

x_{k + 1}^{i}

according to an exponential clock with an instantaneous rate:

\begin{matrix} R_{k + 1}^{i} = - η_{w e i} (\frac{δ F ({\tilde{μ}}_{k})}{δ μ} (x_{k}^{i}) - \sum_{j = 1}^{M} w_{k}^{j} \frac{δ F (\tilde{μ_{k}})}{δ μ} (x_{k}^{j})) . \end{matrix}

(39)

Specifically, if

R_{k + 1}^{i} > 0

, we duplicate the particle

x_{k + 1}^{i}

with probability

1 - exp (- R_{k + 1}^{i})

and kill another one with uniform probability to conserve the total mass; if

R_{k + 1}^{i} < 0

, we kill the particle

x_{k + 1}^{i}

with probability

1 - exp (R_{k + 1}^{i})

, and duplicate another one with uniform probability. By replacing the CA strategy (28) in the GAD-PVI framework, we could obtain the DK variants of GAD-PVI methods.

4.3.4. GAD-PVI Instances

With different underlying information metric tensors (W-metric, KW-metric and S-metric), weight adjustment approaches (CA and DK), and dissimilarities/associated approximation approaches (KL-BLOB, KL-GFSD, KSD-KSDD, MMD-MMDF, SD-SDF), we can derive various instances of GAD-PVI, named as WGAD/KWGAD/SGAD-CA/DK-BLOB/GFSD/KSDD/MMDF/SDF. While the aforementioned updating rules, approximation approaches, and weight adjustment methods have been previously proposed, we conduct a comprehensive investigation of these components within a single framework and view them from a modular standpoint. Moreover, GAD-PVI designs a general inference framework for both the score-function scenario and the sample-based scenario, whereas previous works have only focused on one of these.

5. Experiments

In this section, we conduct empirical studies with our GAD-PVI algorithms. Our empirical studies include score-based tasks where the score function value of the target distribution is accessible, as well as sample-based tasks where only i.i.d. samples from the target distribution are available. Here, we focus on the instances of GAD-PVI with respect to the W-SHIFR flows. The experimental results for methods with respect to the KW-SHIFR and S-SHIFR flows are provided in Appendix A.3. Given that the Full Hamiltonian Flow on the IFR Space does not facilitate the development of practical algorithms, it is not feasible to incorporate the direct combination of an accelerated position update method (such as ACCEL, WNES, AIG) with a dynamic weight adjustment method (such as DPVI) into our experiments.

Our proposed algorithm simultaneously incorporates both accelerated position updates and dynamic weight adjustments. Therefore, comparing it to these baseline methods, which only possess one of these characteristics, can function as an ablation study for the components within our method.

Compared to existing dynamic-weight ParVIs, GAD-PVI benefits from the accelerated position update derived from the Hamiltonian acceleration mechanism of the SHIFR flow, enabling it to converge more quickly to the target distribution. When compared to ParVIs with only accelerated position updates, GAD-PVI benefits from the dynamic weight adjustment resulting from the Fisher–Rao component of the SHIFR flow, achieving superior approximation accuracy to the target distribution.

5.1. Score-Based Experiments

In the setting where we can access the analytical score function value of the target distribution, we choose KL-BLOB and KL-GFSD as the dissimilarity and empirical approximation of our GAD-PVI methods. Note that we do not include GAD-PVI methods with the KSDD empirical approaches, as they are more computationally expensive and have been widely reported to be less stable [23,64]. We include four classes of methods as our baseline: classical ParVI algorithms (SVGD, GFSD, and BLOB), the Nesterov accelerated ParVI algorithms (WNES-BLOB/GFSD), the Hamiltonian accelerated ParVI algorithms (WAIG-BLOB/GFSD), and the Dynamic-weight ParVI algorithms (DPVI-CA/DK-BLOB/GFSD).

In this score function setting, we consider four tasks, comprising two simulations: a 10-D Single-mode Gaussian model (SG) and a Gaussian mixture model (GMM), as well as two real-world applications: Gaussian Process (GP) regression and Bayesian neural network (BNN). For all the algorithms, the particles’ weights are initialized to be equal. In the first three experiments, we tune the parameters to achieve the best

W_{2}

distance. In the BNN task, we split

1 / 5

of the training set as our validation set to tune the parameters. Note that the position step size is tuned via a grid search for the fixed-weight ParVI algorithms and then used in the corresponding dynamic-weight algorithms. The acceleration parameters and weight adjustment parameters are tuned via grid search for each specific algorithm. We repeat all the experiments 10 times and report the average results. Due to limited space, only parts of the results are reported in this section. We refer readers to Appendix C for the results on SG and additional results for GMM, GP, and BNN.

5.1.1. Gaussian Mixture Model

We consider approximating a 10-D Gaussian mixture model with two components, weighted by

1 / 3

and

2 / 3

respectively. We run all algorithms with particle number

M \in {32, 64, 128, 256, 512}

. In Figure 1, we report the 2-Wasserstein (

W_{2}

) distance between the empirical distribution generated by each algorithm and the target distribution with respect to iterations of different ParVI methods. We generate 5000 samples from the target distribution

π

as a reference to evaluate the

W_{2}

distance by using the POT library http://jmlr.org/papers/v22/20-451.html (accessed on 16 August 2023).

The results demonstrate that our GAD-PVI algorithms consistently outperform their counterpart with only one (or none) of the accelerated position update strategy and dynamic weight adjustment approach. Furthermore, the CA weight-adjustment approach usually results in a lower

W_{2}

compared to the DK scheme, and WGAD-CA-BLOB/GFSD usually has the fastest convergence and the lowest final

W_{2}

distance to the target.

5.1.2. Gaussian Process Regression

The Gaussian Process (GP) model is widely adopted for uncertainty quantification in regression problems [65]. We follow the experimental setting in [66], and use the dataset LIDAR (denoted as

D = {(x_{i}, y_{i})}_{i = 1}^{N}

), which consists of 221 observations. Of scalar variables

x_{i}

and

y_{i}

. We denote

x = {[x_{1}, x_{2}, \dots, x_{N}]}^{T}

and

y = {[y_{1}, y_{2}, \dots, y_{N}]}^{T}

, and the target log-posterior with respect to the model parameter

ϕ = (ϕ_{1}, ϕ_{2})

is defined as follows:

\begin{matrix} log p (ϕ | D) = - \frac{y^{T} K_{y}^{- 1} y}{2} - \frac{log det (K_{y})}{2} - log (1 + x^{T} x) . \end{matrix}

Here,

K_{y}

is a covariance function

K_{y} = K + 0.04 I

with

K_{i, j} = exp (ϕ_{1}) exp (- exp (ϕ_{2}) {(x_{i} - x_{j})}^{2})

, and

I

represents the identity matrix. In this task, we set the particle number to

M = 128

for all the algorithms.

We report the

W_{2}

distance between the empirical distribution after 10,000 iterations and the target distribution in Table 2. The target distribution is approximated by 10,000 reference particles generated by the HMC method after it achieves its equilibrium [67]. It can be observed that both the accelerated position update and the dynamic weight adjustment result in a decreased

W_{2}

, and GAD-PVI algorithms consistently achieve the lowest

W_{2}

to the target. Furthermore, the results also show that the CA variants usually outperform their DK counterpart, as CA is able to adjust the weight continuously on

[0, 1]

while DK sets the weight to either 0 or

1 / M

.

In Figure 2, we plot the contour lines of the log posterior and the particles generated by four representative algorithms, namely BLOB, WAIG-BLOB, DPVI-CA-BLOB, and WGAD-CA-BLOB, at different iterations (0, 100, 500, 2000, 10,000). The results indicate that the particles in WAIG-BLOB and WGAD-CA-BLOB exhibit a faster convergence to the high probability area of the target due to their accelerated position updating strategy, and the DPVI-CA and WGAD-CA algorithms finally offer broader final coverage, as the CA dynamic weight adjustment strategy enables the particles to represent the region with arbitrary local density mass instead of a fixed

1 / M

mass.

5.1.3. Bayesian Neural Network

In this experiment, we study a Bayesian regression task with a Bayesian neural network on four datasets from UCI, http://archive.ics.uci.edu/ml/datasets (accessed on 16 August 2023), and and LIBSVM, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ (accessed on 16 August 2023). Given a training dataset

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, where

x_{i}

denotes the input covariate vector and

y_{i}

is the corresponding prediction, this task aims at predicting the output

y^{*}

for a new input

x^{*}

from the perspective of Bayesian inference:

\begin{matrix} p (y^{*} | x^{*}, D) = \int p (y^{*} | x^{*}, w) p (w | D) d w, \end{matrix}

where

w

denotes the model parameter of the neural network and

p (w | D)

is the target posterior given training dataset

D

. Since the explicit integration is intractable, one can resort to sampling methods to approximate the target posterior

p (w | D)

and transfer the integration problem into calculating the average with a set of samples. We follow the experimental setting from [20,23], which models the output as a Gaussian distribution and uses a

Gamma (1, 0.1)

prior for the inverse covariance. We use a one-hidden-layer neural network with 50 hidden units and maintain 128 particles. For all the datasets, we set the batch size as 128.

We present the Root Mean Squared Error (RMSE) of various ParVI algorithms in Table 3. The results demonstrate that the combination of the accelerated position updating strategy and the dynamically weighted adjustment leads to a lower RMSE. Notably, WGAD-CA type algorithms outperform other methods in the majority of cases.

5.2. Sample-Based Experiments

In the setting where we can only obtain the i.i.d. samples of the target distribution, we choose MMD-MMDF and SD-SDF as the dissimilarity and empirical approximation of our GAD-PVI methods.

We are the first to consider the intricate structure of the underlying gradient flow within the ParVI algorithm under the sample-based setting. Consequently, we include classical ParVI algorithms (MMDF and SDF) as our baseline methods, along with DPVI-type and WAIG-type algorithms as ablation baselines. In this sample-based setting, we consider two tasks: the shape morphing task between different 2-D icons and the Sketching task of high-resolution pictures. For all the algorithms, the particles’ weights are initialized to be equal.

Note that the position steps are tuned via grid search for the baseline ParVI algorithms and then used in the corresponding GAD-PVI algorithms. The acceleration parameters and weight adjustment parameters are tuned via grid search for each specific algorithm. We repeat all the experiments 10 times and report the average results.

5.2.1. Shape Morphing

In this experiment, we study the task of shape morphing between different icons. The source shape and target shape are distributions lying in

R^{2}

. We need to move points sampled uniformly from shape A to shape B. This task is often considered in the Wasserstein Barycenter literature [68,69,70]. Note that, to make the task more complex, we add an unbalanced distortion on the X-axis to the target distribution so that the probability distribution density on the left side of the target distribution is greater than that on the right. We consider a single loop shape morphing between the four icons, i.e., A(CAT), B(SPIRAL), C(HEART), and D(CHICKEN).

In Figure 3, we report the 2-Wasserstein (

W_{2}

) distance between the empirical distribution generated by each algorithm and the target icon with respect to iterations of different ParVI methods. We generate 2000 samples from the target distribution

π

as a reference to evaluate the

W_{2}

distance. The results demonstrate that our GAD-PVI algorithms consistently outperform their baselines.

Table 4 presents the average

W_{2}

distance to the target distribution after 2000 iterations (MMDF-Type) or 100 iterations (SDF-Type). It can be observed that GAD-PVI methods consistently achieve the lowest

W_{2}

distance to the target, attributed to their dynamic weight adjustment.

5.2.2. Picture Sketching

This section presents the results of the picture sketching experiment, a task that can be viewed as approximating a given picture with particles and is therefore called picture sketching. Specifically, this section uses real cheetah images as original data, image pixels as particle points, and gray values of pixels as weights to generate discrete target distribution in

R^{2}

space. In this experiment, all algorithms are initialized to 1000 equal-weight particle points, and all particle points are initially sampled from uniform noise distribution.

In Figure 4, we report the 2-Wasserstein (

W_{2}

) between the empirical distribution generated by each algorithm and the target picture with respect to iterations of different ParVI methods. The results demonstrate that our GAD-PVI algorithms consistently outperform their baselines and the CA weight-adjustment approach usually results in the lowest

W_{2}

distance to the target and fastest convergence.

Figure 5 shows the sketching process from the initial noise to the target picture of different ParVIs. In this visualization, particle points with higher weights are represented as pixels with brighter intensities. The results demonstrate that ParVI with accelerated position updates converges more rapidly towards the target picture. Furthermore, ParVI with dynamic-weight adjustment exhibits superior ability in accurately depicting the target picture. Specifically, the ParVI algorithm with dynamic-weight adjustment effectively captures the chiaroscuro between the cheetah’s two ears, which is not evident in the fixed-weight baseline approach.

6. Conclusions

In this paper, we propose the General Accelerated Dynamic-Weight Particle-based Variational Inference (GAD-PVI) framework, which adopts an accelerated position update scheme and dynamic weight adjustment approach simultaneously. Our GAD-PVI framework is developed by discretizing the Semi-Hamiltonian Information Fisher–Rao (SHIFR) flow on the novel Information–Fisher–Rao space. The theoretical analysis demonstrates that the SHIFR flow yields an additional decrease in the local functional dissipation compared to the Hamiltonian flow in the vanilla information space. We propose an effective particle system that evolves the position, weight, and velocity of particles via a set of odes for the SHIFR flows with different underlying information metrics. By directly discretizing the proposed particle system, we obtain our GAD-PVI framework. Several effective instances of the GAD-PVI framework have been provided by employing three distinct dissimilarity functionals and associated empirical approaches under the Wasserstein/Kalman–Wasserstein/Stein metric. Empirical studies demonstrate the faster convergence and reduced approximation error of GAD-PVI methods over the SOTAs.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/e26080679/s1.

Author Contributions

Conceptualization, F.W., H.Z. (Huminhao Zhu) and C.Z.; Funding acquisition, C.Z. and H.Q.; Methodology, F.W.; Project administration, C.Z., H.Z. (Hanbin Zhao) and H.Q.; Resources, C.Z., H.Z. (Hanbin Zhao) and H.Q.; Software, F.W. and H.Z. (Huminhao Zhu); Supervision, C.Z.; Visualization, F.W. and H.Z. (Huminhao Zhu); Writing—original draft, F.W. and H.Z. (Huminhao Zhu); Writing—review and editing, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grant No: 62206248).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author(s).

Acknowledgments

We extend gratitude to Zhijian Li of Zhejiang University for insightful discussions on the variational inference.

Conflicts of Interest

All authors are employees or students of Zhejiang University. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MCMC	Markov Chain Monte Carlo
ParVI	Particle-based Variational Inference
DPVI	Dynamic-weight Particle-based Variational Inference
IFR	Information–Fisher–Rao
SHIFR	Semi-Hamiltonian Information–Fisher–Rao
WFR	Wasserstein–Fisher–Rao
GAD-PVI	General Accelerated Dynamic-weight Particle-based Variational Inference
SVGD	Stein Variational Gradient Descent
GFSD	Gradient Flow with Smoothed Density
KSDD	Kernel Setin Discrepancy Descent
WAG	Wasserstein Accelerated Gradien
ACCEL	The Accelerated Flow method
WNES	Wasserstein Nesterov’s
AIG	Accelerated Information Gradient Flow
MMD	Maximum Mean Distance
MMDF	Maximum Mean Distance Flow
SD	Sinkhorn Divergence
SDF	Sinkhorn Divergence Flow
NSGF	Neural Sinkhorn Gradient Flow
JSD	J-JKO

Appendix A. Definitions and Proofs

Appendix A.1. Definition of Information Metric in Probability Space

To ensure the definition of the general information gradient flow in the probability space, we provide a brief review of the general information metrics in the probability space, as proposed by the studies of [37,59]:

Definition A1

(Information metric in probability space). Denote the tangent space at

μ \in P (R^{n})

by

T_{μ} P (R^{n}) = {σ \in C (R^{n}) : \int σ d x = 0}

. The cotangent space at μ is denoted as

T_{μ}^{*} P (R^{n})

and can be treated as the quotient space

C (R^{n}) / R

. An information metric tensor

G (μ) [\cdot] : T_{μ} P (R^{n}) \to T_{μ}^{*} P (R^{n})

is an invertible mapping from

P (R^{n})

to

T_{μ}^{*} P (R^{n})

. This information metric tensor defines the information metric (as well as the inner product) on tangent space

P (R^{n})

, for

σ_{1}, σ_{2} \in T_{μ} P (R^{n})

and

Φ_{i} = G (μ) [σ_{i}], i = 1, 2

as

\begin{matrix} g_{μ} (σ_{1}, σ_{2}) = \int σ_{1} G (μ) σ_{2} d x = \int Φ_{1} G {(μ)}^{- 1} Φ_{2} d x, \end{matrix}

(A1)

Then, we denote the general information probability space as

(P (R^{n}), G (μ))

. As long as a metric is specified, the probability space

P (R^{n})

together with the metric can be taken as an infinite dimensional Riemannian manifold, which is the so-called density manifold [37], which enables the definition of gradient flow.

Appendix A.2. Full Hamiltonian Flow on the IFR Space and the Fisher–Rao Kinetic Energy

To develop ParVI methods that possess both accelerated position update and dynamical weight adjustment simultaneously, a natural choice is to directly simulate the Hamiltonian flow on the augmented IFR space By substituting the IFR metric (20) into the general Hamiltonian flow (6), we derive the full Hamiltonian flow on the IFR space as the direct accelerated probabilistic flow on the IFR space with the form

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} = G {(μ_{t})}^{- 1} [Φ_{t}] + (Φ_{t} - \int Φ_{t} d μ_{t}) μ_{t} \\ \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \underset{Information kinetic energy}{\underset{︸}{\frac{1}{2} \frac{δ}{δ μ} (\int Φ_{t} G {(μ_{t})}^{- 1} [Φ_{t}] d x)}} + \underset{Fisher - Rao kinetic energy}{\underset{︸}{\frac{1}{2} Φ_{t}^{2} - Φ_{t} \int Φ_{t} d μ_{t}}} + \frac{δ F (μ_{t})}{δ μ} = 0 . \end{matrix} \end{matrix}

(A2)

where the

Φ_{t}

is the velocity field; the

\frac{δ F (μ_{t})}{δ μ}

represents the potential energy dissipation; the

\frac{1}{2} \frac{δ}{δ μ} (\int Φ_{t} G {(μ_{t})}^{- 1} [Φ_{t}] d x)

represents the kinetic energy dissipation of information transport; and the

\frac{1}{2} Φ_{t}^{2} - Φ_{t} \int Φ_{t} d μ_{t}

represents the kinetic energy dissipation of Fisher–Rao distortion. As far as we know, the particle formulation of the Full Hamiltonian flow on the IFR space (A2) is intractable due to the Fisher–Rao kinetic energy term. When deriving particle systems, the

\nabla Φ_{t}

can be straightforwardly approximated by the velocities

v_{t}^{i}

, but the Hamiltonian field

Φ_{t}

is hard to approximate by finite points and iteratively updated. Actually, even the particle formulation of the accelerated Fisher–Rao flow has not been derived due to great difficulty [33]. Therefore, we ignore the influence of the Fisher–Rao kinetic energy on the Hamiltonian field and derive the SHIFR (21). We point out that the Fisher–Rao kinetic energy would vanish as the flow converges to the equilibrium of

(μ_{\infty} = π, Φ_{\infty} = 0)

which fits the behavior of kinetic energy in a physical dynamic system. Therefore, the ignorance of the Fisher–Rao kinetic energy is tenable, and the SHIFR still has the target distribution as its stationary distribution which will be shown in Proposition 1.

Appendix A.3. Proof of Proposition 3

First, we introduce a technical lemma for the proof of Proposition 3.

Lemma A1.

The following probability flow dynamic formulation and particle system formulation are equivalent:

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} + \nabla \cdot (μ_{t} \nabla Φ_{t}) = 0 \\ \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \frac{1}{2} {∥\nabla Φ_{t}∥}^{2} + \frac{δ F (μ_{t})}{δ μ} = 0 \end{matrix} \end{matrix}

(A3)

\begin{matrix} \{\begin{matrix} \frac{d}{d t} X_{t} = V_{t} \\ \frac{d}{d t} V_{t} = - γ_{t} V_{t} - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) \end{matrix} \end{matrix}

(A4)

Proof.

We start with the calculation of the gradient of the kinetic term. For a twice-differentiable

Φ (x)

, we have

\begin{matrix} \frac{1}{2} \nabla {∥\nabla Φ∥}^{2} = \nabla^{2} Φ \nabla Φ = (\nabla Φ \cdot \nabla) \nabla Φ \end{matrix}

(A5)

From (A3), we have

\begin{matrix} \partial_{t} μ_{t} + \nabla \cdot (μ_{t} \nabla Φ_{t}) = 0 \end{matrix}

which is the continuity equation of

μ_{t}

under vector field

\nabla Φ_{t}

[71]. Hence, we have the following equation on the particle level (denoting

V_{t} = \nabla Φ_{t} (X_{t})

):

\begin{matrix} \frac{d}{d t} X_{t} & = \nabla Φ_{t} (X_{t}) \\ = V_{t} \end{matrix}

Then, the vector field shall follow:

\begin{matrix} \frac{d}{d t} V_{t} & = \frac{d}{d t} \nabla Φ_{t} (X_{t}) \\ \overset{(1)}{=} (\partial_{t} + \nabla Φ_{t} (X_{t}) \cdot \nabla) \nabla Φ_{t} (X_{t}) \\ \overset{(2)}{=} - γ_{t} \nabla Φ_{t} (X_{t}) - \frac{1}{2} \nabla {∥\nabla Φ_{t} (X_{t})∥}^{2} - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) + (\nabla Φ_{t} (X_{t}) \cdot \nabla) \nabla Φ_{t} (X_{t}) \\ \overset{(3)}{=} - γ_{t} \nabla Φ_{t} (X_{t}) - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) \\ \overset{(4)}{=} - γ_{t} V_{t} - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) \end{matrix}

where Equation (1) becomes valid from material derivative in fluid dynamic [72], Equation (2) comes from the PDEs of

Φ_{t}

in (A3), Equation (3) comes from canceling terms on each side of (A5), and Equation (4) comes from the definition of

V_{t}

. □

Now, we are ready to give the proof of Proposition 3.

Proof of Proposition 3.

Let

Ψ : P_{2} (R^{d}) \to R

be a functional on the probability space, and let

{\tilde{μ}}_{t}^{M}

be the distribution produced by the continuous-time composite flow (25) at time t. With

μ_{t}

denoting the mean-field limit of

{\tilde{μ}}_{t}^{M}

as

M \to \infty

, we have

\begin{matrix} \partial_{t} Ψ [μ_{t}] = (L Ψ) [μ_{t}], \end{matrix}

where

\begin{matrix} L Ψ [μ] = \int 〈 \nabla Φ (x), \nabla_{x} \frac{δ Ψ (μ)}{δ μ} (x) 〉 μ (x) d x - \int \frac{δ Ψ (μ)}{δ μ} (x) (\frac{δ F (μ)}{δ μ} (x) - E_{μ} [\frac{δ F (μ)}{δ μ} (x)]) μ (x) d x, \end{matrix}

(A6)

in which

Φ_{t}

abides by

\begin{matrix} \{\begin{matrix} \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \frac{1}{2} {∥\nabla Φ_{t}∥}^{2} + \frac{δ F (μ_{t})}{δ μ} = 0 \\ Φ_{0} = 0 \end{matrix} \end{matrix}

and

\frac{δ Ψ (μ)}{δ μ} (\cdot)

denotes the first variation of functional

Ψ

at

μ

satisfying

\begin{matrix} \int \frac{δ Ψ (μ)}{δ μ} (x) ξ (x) d x = lim_{ϵ \to 0} \frac{Ψ (μ + ϵ ξ) - Ψ (μ)}{ϵ} \end{matrix}

for all signed measures

\int ξ (x) d x = 0

. Let

(L^{P o s} Ψ) [μ]

be the first term of (A6), and let

(L^{W e i} Ψ) [μ]

be the second term of (A6). We have

\begin{matrix} L Ψ [μ] = (L^{P o s} Ψ) [μ] + (L^{W e i} Ψ) [μ] \end{matrix}

For the measure-valued composite flow

{\tilde{μ}}_{t}^{M}

(25), the infinitesimal generator of

Ψ

with respect to

{\tilde{μ}}_{t}^{M}

is defined as follows:

\begin{matrix} (L_{M} Ψ) [{\tilde{μ}}^{M}] : = lim_{t \to 0^{+}} \frac{E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [{\tilde{μ}}_{t}^{M}]) - Ψ ({\tilde{μ}}^{M})}{t}, \end{matrix}

where

E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [{\tilde{μ}}_{t}^{M}])

denotes the expectation of the functional

Ψ

evaluated along the trajectory

{\tilde{μ}}_{t}^{M}

taken conditional on the initialization

{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}

. As

\begin{matrix} E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [{\tilde{μ}}_{t}^{M}]) - Ψ ({\tilde{μ}}^{M}) = & \underset{weight adjusting infinitesimal}{\underset{︸}{E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [\sum_{i = 1}^{M} w_{t}^{i} δ_{x_{t}^{i}}]) - E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [\sum_{i = 1}^{M} w_{0}^{i} δ_{x_{t}^{i}}])}} \\ + \underset{position adjusting infinitesimal}{\underset{︸}{E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [\sum_{i = 1}^{M} w_{0}^{i} δ_{x_{t}^{i}}]) - Ψ (\sum_{i = 1}^{M} w_{0}^{i} δ_{x_{0}^{i}})}}, \end{matrix}

we follow the same idea from [23,41,73] to divide

(L_{M} Ψ) [{\tilde{μ}}^{M}]

into two parts

(L_{M}^{P o s} Ψ) [{\tilde{μ}}^{M}]

, and

(L_{M}^{W e i} Ψ) [{\tilde{μ}}^{M}]

corresponds to the position update and the weight adjustment, respectively. According to the definition of the first variation, it can be calculated that

\begin{matrix} (L_{M}^{W e i} Ψ) [{\tilde{μ}}^{M}] & = lim_{t \to 0^{+}} \frac{E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [\sum_{i = 1}^{M} w_{t}^{i} δ_{x_{t}^{i}}]) - E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [\sum_{i = 1}^{M} w_{0}^{i} δ_{x_{t}^{i}}])}{t} \\ = \int \frac{δ Ψ (μ^{M})}{δ μ} (x) \sum_{i = 1}^{M} \partial_{t} ρ (w_{t}^{i} x_{0}^{i}) d x \\ = - \int \frac{δ Ψ (μ^{M})}{δ μ} (x) (\frac{δ F (μ^{M})}{δ μ} (x) - E_{μ^{M}} [\frac{δ F (μ^{M})}{δ μ} (x)]) μ^{M} (x) d x . \end{matrix}

and

\begin{matrix} (L_{n}^{P o s} Ψ) [{\tilde{μ}}^{M}] & = lim_{t \to 0^{+}} \frac{E_{{\tilde{μ}}_{0}^{M} = {\tilde{μ}}^{M}} (Ψ [\sum_{i = 1}^{M} w_{0}^{i} δ_{x_{t}^{i}}]) - Ψ (\sum_{i = 1}^{M} w_{0}^{i} δ_{x_{0}^{i}})}{t} \\ = \int \frac{δ Ψ (μ^{M})}{δ μ} (x) \sum_{i = 1}^{M} w_{0}^{i} \partial_{t} ρ (x_{t}^{i}) d x \\ = \int 〈 V_{μ^{M}} (x), \nabla_{x} \frac{δ Ψ (μ^{M})}{δ μ} (x) 〉 μ^{M} (x) d x \end{matrix}

where

V_{μ^{M}}

abides by

\begin{matrix} \{\begin{matrix} \frac{d V_{μ^{M}}}{d t} = - γ_{t} V_{μ^{M}} - \nabla \frac{δ F (μ^{M})}{δ μ} \\ V_{μ_{0}^{M}} = 0 \end{matrix} \end{matrix}

Combining the above equalities, we have

\begin{matrix} L_{M} Ψ [μ_{M}] = & L_{M}^{P o s} Ψ [μ_{M}] + L_{M}^{W e i} Ψ [μ_{M}] \end{matrix}

If we take the limit of

L_{M} Ψ [μ_{M}]

as

M \to \infty

on a sequence such that

μ^{M} ⇀ μ

(i.e.,

μ^{M}

weakly converges to

μ

) a.s., and

\frac{δ F (μ^{M})}{δ μ} ⇀ \frac{δ F (μ)}{δ μ}

a.s., we can deduce that

V_{μ^{M}} ⇀ \nabla Φ

in light of Lemma A1. These allow us to conclude that

L_{M}^{W e i} Ψ [μ_{M}] \to L^{W e i} Ψ [μ]

and

L_{M}^{P o s} Ψ [μ_{M}] \to L^{P o s} Ψ [μ]

, thus

L_{M} Ψ [μ_{M}] \to L Ψ [μ]

.

Since

\partial_{t} Ψ (μ_{t}^{M}) = L_{M} Ψ [μ^{M}]

and

\partial_{t} Ψ (μ_{t}) = L_{M} Ψ [μ_{t}]

, we have

\begin{matrix} lim_{n \to \infty} Ψ (μ_{t}^{M}) = Ψ (μ_{t}), \end{matrix}

which indicates that

μ_{t}^{M} ⇀ μ_{t}

if

μ_{0}^{M} ⇀ μ_{0}

. Since

μ_{t}

satisfying

\partial_{t} Ψ (μ_{t}) = L Ψ [μ_{t}]

solves the partial differential Equation (24), we conclude that the path of (25) starting from

{\tilde{μ}}_{0}^{M}

weakly converges to a solution of the partial differential Equation (24) starting from

μ_{0}

as

M \to \infty

. □

Appendix A.4. Proof of Proposition 1

Proof.

The SHIFR flow under an the information metric is written as

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} = G^{I} {(μ_{t})}^{- 1} [Φ_{t}] - (\frac{δ F (μ_{t})}{δ μ} - \int \frac{δ F (μ_{t})}{δ μ} d μ_{t}) μ_{t}, \\ \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \frac{1}{2} \frac{δ}{δ μ} (\int Φ_{t} G^{I} {(μ_{t})}^{- 1} [Φ_{t}] d x) + \frac{δ F (μ_{t})}{δ μ} = 0 . \end{matrix} \end{matrix}

(A7)

Because the functional

F (\cdot)

is specified as some dissimilarity functional

D (\cdot | π)

, we have

\begin{matrix} \frac{δ F (μ_{\infty})}{δ μ} = \frac{δ F (π)}{δ μ} = \frac{δ D (π, π)}{δ μ} = 0 . \end{matrix}

From the element of gradient flow, we also have

\begin{matrix} G^{I} {(μ)}^{- 1} [Φ_{\infty}] = G^{I} {(μ)}^{- 1} [0] = 0 . \end{matrix}

Substituting into (A7) that

(μ_{\infty} = π, Φ_{\infty} = 0)

, we can obtain

\begin{matrix} {\partial_{t} Φ_{t}|}_{t = \infty} = - γ_{\infty} Φ_{\infty} - \frac{1}{2} \frac{δ}{δ μ} (\int Φ_{\infty} G^{I} {(μ_{t})}^{- 1} [Φ_{\infty}] d x) - \frac{δ F (μ_{\infty})}{δ μ} = 0, \end{matrix}

(A8)

\begin{matrix} {\partial_{t} μ_{t}|}_{t = \infty} = G^{I} {(μ_{\infty})}^{- 1} [Φ_{\infty}] - (\frac{δ F (μ_{\infty})}{δ μ} - \int \frac{δ F (μ_{\infty})}{δ μ} d μ_{\infty}) μ_{\infty} = 0 . \end{matrix}

(A9)

These suffice for proof. □

Appendix A.5. Proof of Proposition 2

Proof.

For the W-SHIFR case, according to the result in (A6), the following eqaulity holds for any functional

Ψ : P_{2} (R^{d}) \to R

on the probability space

P_{2} (R^{d})

, where

\begin{matrix} \partial_{t} Ψ [μ_{t}^{S H I F R}] & = \int 〈 \nabla Φ (x), \nabla_{x} \frac{δ Ψ (μ_{t}^{S H I F R})}{δ μ} (x) 〉 μ_{t}^{S H I F R} (x) d x \\ - \int \frac{δ Ψ (μ_{t}^{S H I F R})}{δ μ} (x) (\frac{δ F (μ_{t}^{S H I F R})}{δ μ} (x) - E_{μ} [\frac{δ F (μ_{t}^{S H I F R})}{δ μ} (x)]) μ_{t}^{S H I F R} (x) d x, \end{matrix}

in which

{μ_{t}|}_{t = 0} = \bar{μ}

and

Φ_{t}

abides by

\begin{matrix} \{\begin{matrix} \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \frac{1}{2} {∥\nabla Φ_{t}∥}^{2} + \frac{δ F (μ_{t})}{δ μ} = 0 \\ {Φ_{t}|}_{t = 0} = \bar{Φ} \end{matrix} \end{matrix}

By substituting

Ψ (μ_{t}^{S H I F R}) = F (μ_{t}^{S H I F R})

,

U_{μ_{t}^{S H I F R}} = \frac{δ F (μ_{t}^{S H I F R})}{δ μ}

and

t = 0

in the above equality, we have

\begin{matrix} {\frac{d F (μ_{t}^{SHIFR})}{d t}|}_{t = 0} & = & - \int 〈\nabla_{x} \frac{δ F (\bar{μ})}{δ μ} (x), \nabla \bar{Φ} (x)〉 \bar{μ} (x) d x \\ - \int \frac{δ F (\bar{μ})}{δ μ} (x) (U_{\bar{μ}} (x) - E_{\bar{μ}} [U_{\bar{μ}} (x)]) \bar{μ} (x) d x \\ = & - \int 〈\nabla_{x} \frac{δ F (\bar{μ})}{δ μ} (x), \nabla \bar{Φ} (x)〉 \bar{μ} (x) d x \\ - (\int {(\frac{δ F (\bar{μ})}{δ μ} (x))}^{2} \bar{μ} (x) d x - {(\int \frac{δ F (\bar{μ})}{δ μ} (x) \bar{μ} (x) d x)}^{2}) . \end{matrix}

(A10)

Similarly, we can obtain the following result for the Hamiltonian flow in the non-augmented Wasserstein space case:

\begin{matrix} {\frac{d F (μ_{t}^{H})}{d t}|}_{t = 0} = & - \int 〈 \nabla_{x} \frac{δ F (\bar{μ})}{δ μ} (x), \nabla \bar{Φ} (x) 〉 \bar{μ} (x) d x . \end{matrix}

(A11)

Since the second term of (A10) is always less than or equal to zero and the first term of (A10) is the same as the first term of (A11), we can reach to the conclusion that the local dissipation of SHIFR flow has an additional functional dissipation term compared to the Hamiltonian flow in the non-augmented space:

\begin{matrix} {\frac{d F (μ_{t}^{S H I F R})}{d t}|}_{t = 0} {\leq \frac{d F (μ_{t}^{H})}{d t}|}_{t = 0} \end{matrix}

For general information, readers can follow the same routine to obtain the extra functional dissipation property. □

Appendix B. Detailed Formulations

Appendix B.1. Kalman–Wasserstein–SHIFR Flow and KWGAD-PVI Algorithms

Combining the Kalman filter to estimate the probability distributions of a dynamic system over time and the Wasserstein metric to measure the difference between these estimated distributions, the Kalman–Wasserstein metric is proposed in ensemble Kalman sampling literature [74]. The inverse of the Kalman–Wasserstein metric tensor is written as

\begin{matrix} G^{K W} {(μ)}^{- 1} [Φ] = - \nabla \cdot (μ C^{λ} (μ) \nabla Φ), Φ \in T_{μ}^{*} P (R^{n}), \end{matrix}

(A12)

where

Φ \in T_{μ}^{*} P (R^{n})

and, substituting into (2), the gradient flow of Kalman–Wasserstein metric is written as

\begin{matrix} \partial_{t} μ_{t} = G^{K W} {(μ_{t})}^{- 1} [\frac{δ F (μ_{t})}{δ μ}] = - \nabla \cdot (μ_{t} C^{λ} (μ_{t}) \nabla \frac{δ F (μ_{t})}{δ μ}) . \end{matrix}

(A13)

where

λ \geq 0

is the regularization constant and

C^{λ} (μ)

is the linear transformation as follows:

\begin{matrix} C^{λ} (μ) = \int (x - m (μ)) {(x - m (μ))}^{T} μ d x + λ I, m (μ) = \int x μ d x . \end{matrix}

(A14)

Substituting the Kalman–Wasserstein metric into the SHIFR flow (21) gives the Kalman–Wasserstein–SHIFR flow:

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} = - \nabla \cdot (μ_{t} C^{λ} (μ_{t}) \nabla Φ_{t}) - (\frac{δ F (μ_{t})}{δ μ} - \int \frac{δ F (μ_{t})}{δ μ} d μ_{t}) μ_{t}, \\ \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \frac{1}{2} ({(x - m (μ_{t}))}^{T} \int \nabla Φ_{t} \nabla Φ_{t}^{T} d μ_{t} (x - m (μ_{t})) + \nabla Φ_{t}^{T} C^{λ} (μ_{t}) \nabla Φ_{t}) \\ + \frac{δ F (μ_{t})}{δ μ} = 0 . \end{matrix} \end{matrix}

(A15)

We claim that the finite-particles formulation of the Kalman–Wasserstein–SHIFR flow (A15) evolves the positions

x^{i}

’s, the weights

w^{i}

’s of M particles, and velocity field

v

as follows:

\begin{matrix} \{\begin{matrix} d x_{t}^{i} & = C^{λ} ({\tilde{μ}}_{t}) v_{t}^{i} d t, \\ d v_{t}^{i} & = (- γ v_{t}^{i} - E [v_{t} v_{t}^{T}] (x^{i} - E [x]) - \nabla \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i})) d t, \\ d w_{t}^{i} & = - (\frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i}) - \sum_{i = 1}^{M} w_{t}^{i} \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i})) w_{t}^{i} d t, \\ {\tilde{μ}}_{t} & = \sum_{i = 1}^{M} w_{t}^{i} δ_{x_{t}^{i}} . \end{matrix} \end{matrix}

(A16)

Here, the expectation is taken over the empirical distribution of particles.

Then, the proposition below shows the mean-field limit of the finite-particles formulation (A16) is exactly the Kalman–Wasserstein–SHIFR flow (A15).

Proposition A1.

Suppose the empirical distribution

{\tilde{μ}}_{0}^{M}

of M weighted particles weakly converges to a distribution

μ_{0}

when

M \to \infty

. Then, the path of (A16) starting from

{\tilde{μ}}_{0}^{M}

and

Φ_{0}

with initial velocity

0

weakly converges to a solution of the Kalman–Wasserstein–SHIFR gradient flow (A15) starting from

μ_{t} |_{t = 0} = μ_{0} =

and

Φ_{t} |_{t = 0} = 0

as

M \to \infty

:

Similar to the proof scheme of Proposition 3, we start by proving a technical lemma first:

Lemma A2.

The following fluid dynamic formulation and particle dynamic formulation are equivalent:

(Suppose that

X_{t} \sim μ_{t}

and

V_{t} = \nabla Φ_{t} (X_{t})

, expectation is taken over particles)

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} + \nabla \cdot (μ_{t} C^{λ} (μ_{t}) \nabla Φ_{t}) = 0, \\ \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \frac{1}{2} ({(x - m (μ_{t}))}^{T} \int \nabla Φ_{t} \nabla Φ_{t}^{T} d μ_{t} (x - m (μ_{t})) \\ + \nabla Φ_{t}^{T} C^{λ} (μ_{t}) \nabla Φ_{t}) + \frac{δ F (μ_{t})}{δ μ} = 0 . \end{matrix} \end{matrix}

(A17)

\begin{matrix} \{\begin{matrix} \frac{d}{d t} X_{t} = C^{λ} (μ_{t}) V_{t}, \\ \frac{d}{d t} V_{t} = - γ_{t} V_{t} - E [V_{t} V_{t}^{T}] (X_{t} - E [X_{t}]) - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) . \end{matrix} \end{matrix}

(A18)

Proof.

First, we establish two equations. For

i = 1 \dots n

, we have

\begin{matrix} \begin{matrix} (C^{λ} (μ_{t}) \nabla Φ_{t} \cdot \nabla) \nabla_{i} Φ_{t} (X_{t}) & = \sum_{j = 1}^{n} {(C^{λ} (μ_{t}) \nabla Φ_{t})}_{j} \nabla_{j} \nabla_{i} Φ_{t} (X_{t}) \\ = \sum_{j = 1}^{n} \nabla_{i j} Φ_{t} (X_{t}) {(C^{λ} (μ_{t}) \nabla Φ_{t})}_{j} \\ = {(\nabla^{2} Φ_{t} C^{λ} (μ_{t}) \nabla Φ_{t})}_{i} . \end{matrix} \end{matrix}

(A19)

Them, according to the chain rule, we have

\begin{matrix} \begin{matrix} \nabla (\nabla Φ_{t} {(x)}^{T} C^{λ} (μ_{t}) \nabla Φ_{t} (x)) = 2 \nabla^{2} Φ_{t} (x) C^{λ} (μ_{t}) \nabla Φ_{t} (x) . \end{matrix} \end{matrix}

(A20)

Since the first equation of (A17) is actually the continuity equation with velocity field

C^{λ} (μ_{t}) \nabla Φ_{t}

, it is obvious that we have

\frac{d}{d t} X_{t} = C^{λ} (μ_{t}) V_{t}

. Then, we deduce the second equation of (A18):

\begin{matrix} \frac{d}{d t} V_{t} & = \frac{d}{d t} \nabla Φ_{t} (X_{t}) \\ \overset{(1)}{=} (\partial_{t} + C^{λ} (μ_{t}) \nabla Φ_{t} \cdot \nabla) \nabla Φ_{t} (X_{t}) \\ \overset{(2)}{=} \partial_{t} \nabla Φ_{t} + \nabla^{2} Φ_{t} C^{λ} (μ_{t}) \nabla Φ_{t} \\ \overset{(3)}{=} - γ_{t} \nabla Φ_{t} - \int \nabla Φ_{t} \nabla Φ_{t}^{T} d μ_{t} (x - m (μ_{t})) - \frac{1}{2} \nabla (\nabla Φ_{t} {(x)}^{T} C^{λ} (μ_{t}) \nabla Φ_{t} (x)) \\ - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) + \nabla^{2} Φ_{t} C^{λ} (μ_{t}) \nabla Φ_{t} \\ \overset{(4)}{=} - γ_{t} \nabla Φ_{t} - \int \nabla Φ_{t} \nabla Φ_{t}^{T} d μ_{t} (x - m (μ_{t})) - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) \\ \overset{(5)}{=} - γ_{t} V_{t} - E [V_{t} V_{t}^{T}] (X_{t} - E [X_{t}]) - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) . \end{matrix}

where Equation (1) becomes valid from a material derivative in fluid dynamic [72], Equation (2) comes from Equation (A19), Equation (3) comes from the PDE (A17), Equation (4) comes from canceling terms on each side of (A20), and Equation (5) comes from the definition of

V_{t}

and

X_{t}

. □

Proof of Proposition A1.

Substituting Lemma A1 by Lemma A2, the proof scheme of Proposition A1 is the same as the proof scheme of Proposition 3. □

By discretizing (A16), we derive the KWGAD-PVI algorithms, which update the positions of particles according to the following rule:

\begin{matrix} x_{k + 1}^{i} = x_{k}^{i} + η_{p o s} {[C_{k}^{λ} v_{k}]}^{i}, \end{matrix}

(A21)

and adjusts the velocity field as follows:

\begin{matrix} v_{k + 1}^{i} = (1 - γ η_{v e l}) v_{k}^{i} - \frac{η_{v e l}}{M} [\sum_{j = 1}^{N} w_{k}^{j} (V_{k}^{j}) {(V_{k}^{j})}^{T}] (x_{k}^{i} - m_{k}) - η_{v e l} \nabla U_{{\tilde{μ}}_{k}} (x_{k}) . \end{matrix}

(A22)

Here,

C_{k}^{λ}

and

m_{k}

are calculated at each round by

\begin{matrix} m_{k} = \frac{1}{N} \sum_{i = 1}^{M} w_{k}^{i} x_{k}^{i}, C_{k}^{λ} = \frac{1}{N - 1} \sum_{i = 1}^{M} w_{k}^{i} (x_{k}^{i} - m_{k}) {(x_{k}^{i} - m_{k})}^{T} + λ I . \end{matrix}

(A23)

Appendix B.2. Stein-SHIFR Flow and SGAD-PVI Algorithms

Involving reproducing kernel Hilbert space norm into probability space, and the Stein metric is proposed for geometrical analysis [75]. The gradient flow of Stein metric is written as

\begin{matrix} \partial_{t} μ_{t} = G^{S} {(μ_{t})}^{- 1} \frac{δ F (μ_{t})}{δ μ} = - \nabla \cdot (μ_{t} \int k (\cdot, y) μ_{t} (y) \nabla_{y} \frac{δ F (μ_{t})}{δ μ} (y) d y) . \end{matrix}

(A24)

Substituting the Stein metric into the SHIFR flow (21) gives the Stein-SHIFR flow:

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} = - \nabla \cdot (μ_{t} \int k (\cdot, y) μ_{t} (y) \nabla_{y} Φ_{t} (y) d y) - (\frac{δ F (μ_{t})}{δ μ} - \int \frac{δ F (μ_{t})}{δ μ} d μ_{t}) μ_{t}, \\ \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \int \nabla Φ_{t} {(\cdot)}^{T} \nabla Φ_{t} (y) k (\cdot, y) μ_{t} (y) d y + \frac{δ F (μ_{t})}{δ μ} = 0 . \end{matrix} \end{matrix}

(A25)

We claim that the finite-particles formulation of the Stein-SHIFR flow (A25) evolves the positions

x^{i}

’s, the weights

w^{i}

’s of M particles, and velocity field

v

as follows:

\begin{matrix} \{\begin{matrix} d x_{t}^{i} & = {[\int k (x_{t}, y) \nabla Φ_{t} (y) {\tilde{μ}}_{t} (y) d y]}^{i} d t, \\ d v_{t}^{i} & = (- γ v_{t}^{i} - {[\int v_{t}^{T} \nabla Φ_{t} (y) \nabla_{x} k (x_{t}, y) {\tilde{μ}}_{t} (y) d y]}^{i} - \nabla \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i})) d t, \\ d w_{t}^{i} & = - (\frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i}) - \sum_{i = 1}^{M} w_{t}^{i} \frac{δ F ({\tilde{μ}}_{t})}{δ μ} (x_{t}^{i})) w_{t}^{i} d t, \\ {\tilde{μ}}_{t} & = \sum_{i = 1}^{M} w_{t}^{i} δ_{x_{t}^{i}} . \end{matrix} \end{matrix}

(A26)

Then, the proposition below shows the mean-field limit of the finite-particles formulation (A26) is exactly the Stein-SHIFR flow (A25).

Proposition A2.

Suppose the empirical distribution

{\tilde{μ}}_{0}^{M}

of M weighted particles weakly converges to a distribution

μ_{0}

when

M \to \infty

. Then, the path of (A26) starting from

{\tilde{μ}}_{0}^{M}

and

Φ_{0}

with initial velocity

0

weakly converges to a solution of the Stein-SHIFR gradient flow (A25) starting from

μ_{t} |_{t = 0} = μ_{0} =

and

Φ_{t} |_{t = 0} = 0

as

M \to \infty

:

Similarly, the first proof is a technical lemma:

Lemma A3.

The following fluid dynamic formulation and particle dynamic formulation are equivalent:

(Suppose that

X_{t} \sim μ_{t}

and

V_{t} = \nabla Φ_{t} (X_{t})

)

\begin{matrix} \{\begin{matrix} \partial_{t} μ_{t} + \nabla \cdot (μ_{t} \int k (\cdot, y) μ_{t} (y) \nabla_{y} Φ_{t} (y) d y) = 0, \\ \partial_{t} Φ_{t} + γ_{t} Φ_{t} + \int \nabla Φ_{t} {(\cdot)}^{T} \nabla Φ_{t} (y) k (\cdot, y) μ_{t} (y) d y + \frac{δ F (μ_{t})}{δ μ} = 0 . \end{matrix} \end{matrix}

(A27)

\begin{matrix} \{\begin{matrix} \frac{d}{d t} X_{t} = \int k (X_{t}, y) \nabla Φ_{t} (y) μ_{t} (y) d y, \\ \frac{d}{d t} V_{t} = - γ_{t} V_{t} - \int V_{t}^{T} \nabla Φ_{t} (y) \nabla_{x} k (X_{t}, y) μ_{t} (y) d y - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) . \end{matrix} \end{matrix}

(A28)

Proof.

First, we notice the following equation:

\begin{matrix} \begin{matrix} \nabla (\int \nabla Φ {(x)}^{T} \nabla Φ (y) k (x, y) μ_{t} (y) d y) & = \nabla^{2} Φ (x) \int \nabla Φ (y) k (x, y) μ (y) d y \\ + \int \nabla Φ {(x)}^{T} \nabla Φ (y) \nabla_{x} k (x, y) μ (y) d y . \end{matrix} \end{matrix}

(A29)

Since the first equation of (A27) is actually the continuity equation with velocity field

\int k (\cdot, y) μ_{t} (y) \nabla_{y} Φ_{t} (y) d y

, it is obvious that we have

\frac{d}{d t} X_{t} = \int k (X_{t}, y) \nabla Φ_{t} (y) μ_{t} (y) d y

. Then, we deduce the second equation of (A28):

\begin{matrix} \frac{d}{d t} V_{t} & = \frac{d}{d t} \nabla Φ_{t} (X_{t}) \\ \overset{(1)}{=} \partial_{t} \nabla Φ_{t} (X_{t}) + \nabla^{2} Φ_{t} (X_{t}) (\int k (X_{t}, y) \nabla Φ_{t} (y) μ_{t} (y) d y) \\ \overset{(2)}{=} - γ_{t} V_{t} - \nabla (\int \nabla Φ {(x)}^{T} \nabla Φ (y) k (x, y) μ_{t} (y) d y) - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) \\ + \nabla^{2} Φ_{t} (X_{t}) (\int k (X_{t}, y) \nabla Φ_{t} (y) μ_{t} (y) d y) \\ \overset{(3)}{=} - γ_{t} V_{t} - \int \nabla Φ {(x)}^{T} \nabla Φ (y) \nabla_{x} k (x, y) μ (y) d y - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) \\ \overset{(4)}{=} - γ_{t} V_{t} - \int V_{t}^{T} \nabla Φ_{t} (y) \nabla_{x} k (X_{t}, y) μ_{t} (y) d y - \nabla (\frac{δ F (μ_{t})}{δ μ}) (X_{t}) . \end{matrix}

where Equation (1) becomes valid from material derivative in fluid dynamic [72], Equation (2) comes from the PDE (A27), Equation (3) comes from leveraging (A29), and Equation (4) comes from the definition of

V_{t}

. □

Proof of Proposition A2.

Substituting Lemma A1 by Lemma A3, the proof scheme of Proposition A2 is the same as the proof scheme of Proposition 3. □

By discretizing (A26), Stein-GAD-PVI algorithm updates the positions of particles according to the following rule:

\begin{matrix} x_{k + 1}^{i} = x_{k}^{i} + \frac{η_{p o s}}{M} \sum_{j = 1}^{M} K (x_{k}^{i}, x_{k}^{j}) v_{k}^{i}, \end{matrix}

(A30)

and adjusts the velocity field as follows:

\begin{matrix} v_{k + 1}^{i} = (1 - γ η_{v e l}) v_{k}^{i} - \frac{η_{v e l}}{M} \sum_{j = 1}^{M} {(v_{k}^{i})}^{T} v_{k}^{j} \nabla_{1} K (x_{k}^{i}, x_{k}^{j}) - η_{v e l} \nabla U_{{\tilde{μ}}_{k}} (x_{k}) . \end{matrix}

(A31)

Appendix B.3. GAD-PVI Algorithms in Details

By adopting different underlying information metric tensors (W-metric, KW-metic and S-metric), weight-adjustment approaches (CA and DK), and dissimilarity functionals/associated smoothing approaches (KL-BLOB, KL-GFSD, KSD-KSDD, MMD-MMDF, SD-SDF), we can derive 30 different instances of GAD-PVI, named as WGAD/KWGAD/SGAD-CA/DK-BLOB/GFSD/KSDD/MMDF/SDF. Here, we present our General Accelerated Dynamic-weight Particle-based Variational Inference (GAD-PVI) framework in a more detailed version of Algorithm 1 as Algorithm A1.

Algorithm A1 General Accelerated Dynamic-weight Particle-based Variational Inference (GAD-PVI) framework in details

Input: Initial distribution

{\tilde{μ}}_{0} = \sum_{i = 1}^{M} w_{0}^{i} δ_{x_{0}^{i}}

, position adjusting step-size

η_{p o s}

, weight adjusting step-size

η_{w e i}

, velocity field adjusting step-size

η_{v e l}

, velocity damping parameter

γ

.

1:: Choose a suitable functional $F$ and its smoothing strategy $U_{\tilde{μ}} \approx \frac{δ F (\tilde{μ})}{δ μ}$ from KL-BLOB/KL-GFSD/KSD-KSDD/MMD-MMDF/SD-SDF
2:: if Under the score function accessible sampling setting then
3:: $\begin{matrix} U_{\tilde{μ}} (x) \approx \frac{δ F (\tilde{μ})}{δ μ} (x) = \{\begin{matrix} - log π (x) + \frac{\sum_{i = 1}^{M} K (x, x_{k}^{i})}{\sum_{i = 1}^{M} K (x, x_{k}^{i})} + \sum_{i = 1}^{M} \frac{K (x, x_{k}^{i})}{\sum_{l = 1}^{M} K (x_{k}^{i}, x_{k}^{l})} (KL - BLOB), \\ - log π (x) + \frac{\sum_{i = 1}^{M} K (x, x_{k}^{i})}{\sum_{i = 1}^{M} K (x, x_{k}^{i})} (KL - GFSD), \\ \frac{1}{M} \sum_{i = 1}^{M} k_{π} (x_{k}^{i}, x) (KSD - KSDD) . \end{matrix} \end{matrix}$
4:: else if Under the i.i.d. samples accessible sampling setting then
5:: $\begin{matrix} U_{\tilde{μ}} (x) \approx \frac{δ F (\tilde{μ})}{δ μ} (x) = \{\begin{matrix} \sum_{i = 1}^{M} w_{k}^{i} K (x_{k}^{i}, x) - \sum_{j = 1}^{N} a^{j} K (y^{j}, x) (MMD - MMDF), \\ f_{{\tilde{μ}}_{k}, π} - f_{{\tilde{μ}}_{k}, {\tilde{μ}}_{k}} (SD - SDF) . \end{matrix} \end{matrix}$
6:: end if
7:: for $k = 0, 1, \dots, T - 1$ do
8:: for $i = 1, 2, \dots, M$ do
9:: $\begin{matrix} Update positions x_{k}^{i} according to \{\begin{matrix} (26) (WGAD), \\ (A 21) (KWGAD), \\ (A 30) (SGAD) . \end{matrix} \end{matrix}$

$\begin{matrix} Update positions v_{k}^{i} according to \{\begin{matrix} (27) (WGAD), \\ (A 22) (KWGAD), \\ (A 31) (SGAD) . \end{matrix} \end{matrix}$
10:: end for
11:: if Adopt CA strategy for weight adjustment then
12:: Update weights $w_{k}^{i}$ according to (28)
13:: end if
14:: if Adopt DK strategy for weight adjustment then
15:: for $i = 1, 2, \dots, M$ do
16:: Calculate the duplicate/kill rate: $R_{k + 1}^{i} = - λ η (U_{{\tilde{μ}}_{k}} (x_{k + 1}^{i}) - \frac{1}{M} \sum_{i = 1}^{M} U_{{\tilde{μ}}_{k}} (x_{k + 1}^{i}))$
17:: end for
18:: for $i = 1, 2, \dots, M$ do
19:: if $R_{k + 1}^{i} > 0$ then
20:: Duplicate the particle $x_{k + 1}^{i}$ with probability $1 - exp (- R_{k + 1}^{i})$ and kill one which is uniformly chosen from the rest.
21:: else
22:: Kill the particle $x_{k + 1}^{i}$ with probability $1 - exp (R_{k + 1}^{i})$ and duplicate one which is uniformly chosen from the rest.
23:: end if
24:: end for
25:: end if
26:: end for
27:: Output: ${\tilde{μ}}_{T} = \frac{1}{M} \sum_{i = 1}^{M} δ_{x_{T}^{i}}$ .

Appendix C. Experiment Details

In this section, we list the details of the experimental setting, parameter tuning, and additional results of our empirical studies.

Appendix C.1. Experiments Settings

Appendix C.1.1. Density of the Gaussian Mixture Model

The density of the Gaussian mixture model is defined as follows:

\begin{matrix} π (x) \propto \frac{2}{3} exp (- \frac{1}{2} {∥x - a∥}^{2}) + \frac{1}{3} exp (- \frac{1}{2} {∥x + a∥}^{2}), \end{matrix}

where

a = 1.2 * 1

.

Appendix C.1.2. Density of the Gaussian Process Task

We follow the experimental setting in [23,66] and use the dataset LIDAR (denoted as

D = {(x_{i}, y_{i})}_{i = 1}^{N}

), which consists of 221 observations of the scalar variables

x_{i}

and

y_{i}

. Denoting

x = {[x_{1}, x_{2}, \dots, x_{N}]}^{T}

and

y = {[y_{1}, y_{2}, . . ., y_{N}]}^{T}

, the target log-posterior with respect to the model parameter

ϕ = (ϕ_{1}, ϕ_{2})

is defined as follows:

\begin{matrix} log p (ϕ | D) = - \frac{y^{T} K_{y}^{- 1} y}{2} - \frac{log det (K_{y})}{2} - log (1 + x^{T} x) . \end{matrix}

Here,

K_{y}

is a covariance function

K_{y} = K + 0.04 I

with

K_{i, j} = exp (ϕ_{1}) exp (- exp (ϕ_{2}) {(x_{i} - x_{j})}^{2})

, and

I

represents the identity matrix.

Appendix C.1.3. Training/Validation/Test Dataset in Bayesian Neural Network

For each dataset in the Bayesian neural network task, we split it into 90% training data and 10% test data randomly, which follows the settings from [20,23,76]. Furthermore, we also randomly choose

1 / 5

of the training set as the validation set for parameter tuning.

Appendix C.1.4. Initialization of Particles’ Positions

In the Gaussian mixture model, we initialize particles according to the standard Gaussian distribution. In the Gaussian Process regression task, we initialize particles with mean vector

{[0, - 10]}^{T}

and covariance

0.09 * I_{2 \times 2}

for all the algorithms. As for the Bayesian neural network task, we follow the initialization convention in [20,23].

Appendix C.1.5. Bandwidth of Kernel Function in Different Algorithms

For all the experiments, we adopt RBF kernel as the kernel function K:

K (x, y) = exp (- ∥ x - y ∥_{2}^{2} / h)

, where the parameter h is known as the bandwidth [27]. We follow the convention in [23] and set the parameter

h = \frac{1}{M} \sum_{i = 1}^{M} ({min}_{j \neq i} ∥ x^{i} - x^{j} ∥_{2}^{2})

for GFSD-type algorithms and BLOB-type algorithms.

Appendix C.1.6. WNES and WAG

Ref. [27] follows the accelerated gradient descend methods in the Wasserstein probability space [34,35] and derives the WNES and WAG methods, which update the particles’ positions with an extra momentum. Though their methods have the WNES and WAG types, we only conducted empirical studies of WNES as a baseline because the authors report WNES algorithms are usually more robust and efficient than WAG-type algorithms [27].

Appendix C.2. Parameters Tuning

Detailed Settings for η_pos, η_wei, η_vel and γ

Here, we present the parameter settings for position adjusting step-size

η_{p o s}

, weight adjusting step-size

η_{w e i}

, velocity field adjusting step-size

η_{v e l}

, velocity damping parameter

γ

of different algorithms are provided in Table A1, Table A2, Table A3 and Table A4. All the parameters are chosen by a grid search. For the position adjusting step-size

η_{p o s}

, we first find a suitable range using a coarse-grain grid search and then fine-tune it. Note that the position step size is tuned via grid search for the fixed-weight ParVI algorithms and then used in the corresponding dynamic-weight algorithms. The acceleration parameters and weight adjustment parameters are tuned via a grid search for each specific algorithm. As a result, it can be observed that the position adjusting step-size for any specific fixed-weight ParVI algorithm, its corresponding dynamic algorithm, and the DK variant are the same in these tables. For ease of understanding, we use the rate of weight, adjusting step-size

η_{w e i}

divided by the position, adjusting step-size

η_{p o s}

to illustrate the tuning. Moreover, inspired by the effective warmup strategy in tuning hyper-parameters, we follow the settings of [23] and construct the weight-adjusting step-size parameter schedule using the hyperbolic tangent

λ tanh (2 * {(t / T)}^{5})

, with t being the current time step and T the total number of steps. This parameter tuning routine demonstrates the robustness of our GAD-PVI method with respect to hyperparameters. This is attributed to the fact that GAD-PVI can maintain stable performance when utilizing the optimal hyperparameters established for the baseline.

Table A1. Parameters of different algorithms in SG and GMM (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

, and

γ

).

Table A1. Parameters of different algorithms in SG and GMM (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

, and

γ

).

Table A2. Parameters of different algorithms in GP (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

and

γ

).

Table A2. Parameters of different algorithms in GP (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

and

γ

).

Algorithm	Smoothing Approaches
Algorithm	BLOB	GFSD
ParVI	$1.0 \times 10^{- 2}$ , –, –, –	$1.0 \times 10^{- 2}$ , –, –, –
WAIG	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.4$	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.3$
WNES	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.4$	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.4$
DPVI-CA	$1.0 \times 10^{- 2}$ , $0.1$ , –, –	$1.0 \times 10^{- 2}$ , $0.3$ , –, –
DPVI-DK	$1.0 \times 10^{- 2}$ , $0.01$ , –, –	$1.0 \times 10^{- 2}$ , $0.01$ , –, –
WGAD-CA	$1.0 \times 10^{- 2}$ , $0.1$ , $1.0$ , $0.4$	$1.0 \times 10^{- 2}$ , $0.3$ , $1.0$ , $0.3$
WGAD-DK	$1.0 \times 10^{- 2}$ , $0.01$ , $1.0$ , $0.4$	$1.0 \times 10^{- 2}$ , $0.01$ , $1.0$ , $0.3$
KWAIG	$5.0 \times 10^{- 3}$ , –, $1.0$ , $0.8$	$1.0 \times 10^{- 3}$ , –, $1.0$ , $0.7$
KWGAD-CA	$5.0 \times 10^{- 3}$ , $0.1$ , $1.0$ , $0.8$	$1.0 \times 10^{- 3}$ , $0.3$ , $1.0$ , $0.7$
KWGAD-DK	$5.0 \times 10^{- 3}$ , $0.01$ , $1.0$ , $0.8$	$1.0 \times 10^{- 3}$ , $0.01$ , $1.0$ , $0.7$
SAIG	$2.0 \times 10^{- 2}$ , –, $1.0$ , $0.7$	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.6$
SGAD-CA	$2.0 \times 10^{- 2}$ , $0.1$ , $1.0$ , $0.7$	$1.0 \times 10^{- 2}$ , $0.3$ , $1.0$ , $0.6$
SGAD-DK	$2.0 \times 10^{- 2}$ , $0.01$ , $1.0$ , $0.7$	$1.0 \times 10^{- 2}$ , $0.01$ , $1.0$ , $0.6$

Table A3. Parameters of different algorithms in BNN (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

and

γ

).

Table A3. Parameters of different algorithms in BNN (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

and

γ

).

Algorithm	Datasets
Algorithm	Concrete	kin8nm	RedWine	Space
ParVI-BLOB	$4.0 \times 10^{- 6}$ , –, –, –	$1.0 \times 10^{- 6}$ , –, –, –	$3.4 \times 10^{- 6}$ , –, –, –	$3.0 \times 10^{- 6}$ , –, –, –
WAIG-BLOB	$4.0 \times 10^{- 6}$ , –, $1.0$ , $0.2$	$1.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , –, $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , –, $1.0$ , $0.5$
WNES-BLOB	$4.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$1.0 \times 10^{- 6}$ , –, $1.0$ , $0.2$	$3.4 \times 10^{- 6}$ , –, $1.0$ , $0.2$	$3.0 \times 10^{- 6}$ , –, $1.0$ , $0.2$
DPVI-CA-BLOB	$4.0 \times 10^{- 6}$ , $1.0$ , –, –	$1.0 \times 10^{- 6}$ , $0.8$ , –, –	$3.4 \times 10^{- 6}$ , $0.5$ , –, –	$3.0 \times 10^{- 6}$ , $1.0$ , –, –
DPVI-DK-BLOB	$4.0 \times 10^{- 6}$ , $1.0$ , –, –	$1.0 \times 10^{- 6}$ , $0.8$ , –, –	$3.4 \times 10^{- 6}$ , $0.5$ , –, –	$3.0 \times 10^{- 6}$ , $1.0$ , –, –
WGAD-CA-BLOB	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.2$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.5$ , $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.5$
WGAD-DK-BLOB	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.2$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.1$ , $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.5$
KWAIG-BLOB	$4.0 \times 10^{- 6}$ , –, $1.0$ , $0.7$	$1.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , –, $1.0$ , $0.5$	$2.0 \times 10^{- 6}$ , –, $1.0$ , $0.8$
KWGAD-CA-BLOB	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.7$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.5$ , $1.0$ , $0.5$	$2.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.8$
KWGAD-DK-BLOB	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.7$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.1$ , $1.0$ , $0.5$	$2.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.8$
SAIG-BLOB	$1.0 \times 10^{- 5}$ , –, $1.0$ , $0.6$	$2.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$7.0 \times 10^{- 6}$ , –, $1.0$ , $0.5$	$8.0 \times 10^{- 6}$ , –, $1.0$ , $0.8$
SGAD-CA-BLOB	$1.0 \times 10^{- 5}$ , $1.0$ , $1.0$ , $0.6$	$2.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$7.0 \times 10^{- 6}$ , $0.5$ , $1.0$ , $0.5$	$8.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.8$
SGAD-DK-BLOB	$1.0 \times 10^{- 5}$ , $1.0$ , $1.0$ , $0.6$	$2.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$7.0 \times 10^{- 6}$ , $0.1$ , $1.0$ , $0.5$	$8.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.8$
ParVI-GFSD	$4.0 \times 10^{- 6}$ , –, –, –	$1.0 \times 10^{- 6}$ , –, –, –	$3.4 \times 10^{- 6}$ , –, –, –	$3.0 \times 10^{- 6}$ , –, –, –
WAIG-GFSD	$4.0 \times 10^{- 6}$ , –, $1.0$ , $0.1$	$1.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , –, $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , –, $1.0$ , $0.5$
WNES-GFSD	$4.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$1.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , –, $1.0$ , $0.2$	$3.0 \times 10^{- 6}$ , –, $1.0$ , $0.2$
DPVI-CA-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , –, –	$1.0 \times 10^{- 6}$ , $0.8$ , –, –	$3.4 \times 10^{- 6}$ , $0.5$ , –, –	$3.0 \times 10^{- 6}$ , $1.0$ , –, –
DPVI-DK-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , –, –	$1.0 \times 10^{- 6}$ , $0.8$ , –, –	$3.4 \times 10^{- 6}$ , $0.5$ , –, –	$3.0 \times 10^{- 6}$ , $1.0$ , –, –
WGAD-CA-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.1$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.5$ , $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.5$
WGAD-DK-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.1$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.1$ , $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.5$
KWAIG-GFSD	$4.0 \times 10^{- 6}$ , –, $1.0$ , $0.6$	$1.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , –, $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$
KWGAD-CA-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.6$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.5$ , $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.3$
KWGAD-DK-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.6$	$1.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$3.4 \times 10^{- 6}$ , $0.1$ , $1.0$ , $0.5$	$3.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.3$
SAIG-GFSD	$4.0 \times 10^{- 6}$ , –, $1.0$ , $0.2$	$2.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$	$7.0 \times 10^{- 6}$ , –, $1.0$ , $0.5$	$6.0 \times 10^{- 6}$ , –, $1.0$ , $0.3$
SGAD-CA-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.2$	$2.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$7.0 \times 10^{- 6}$ , $0.5$ , $1.0$ , $0.5$	$6.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.3$
SGAD-DK-GFSD	$4.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.2$	$2.0 \times 10^{- 6}$ , $0.8$ , $1.0$ , $0.3$	$7.0 \times 10^{- 6}$ , $0.1$ , $1.0$ , $0.5$	$6.0 \times 10^{- 6}$ , $1.0$ , $1.0$ , $0.3$

Table A4. Parameters of different algorithms in morphing and sketching (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

and

γ

).

Table A4. Parameters of different algorithms in morphing and sketching (

η_{p o s}

,

\frac{η_{w e i}}{η_{p o s}}

,

η_{v e l}

and

γ

).

Algorithms	Sampling Tasks
Algorithms	A→B	B→C	C→D	D→A	Sketching
ParVI-MMDF	$3.0 \times 10^{- 2}$ , –, –, –	$3.0 \times 10^{- 2}$ , –, –, –	$3.0 \times 10^{- 2}$ , –, –, –	$3.0 \times 10^{- 2}$ , –, –, –	$3.0 \times 10^{- 2}$ , –, –, –
WAIG-MMDF	$3.0 \times 10^{- 2}$ , –, $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , –, $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , –, $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , –, $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , –, $1.0$ , $0.8$
DPVI-DK-MMDF	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , –, –	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , –, –	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , –, –	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , –, –	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , –, –
DPVI-CA-MMDF	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , –, –	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , –, –	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , –, –	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , –, –	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , –, –
WGAD-DK-MMDF	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$
WGAD-CA-MMDF	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$3.0 \times 10^{- 2}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$
ParVI-SDF	$1.0 \times 10^{- 1}$ , –, –, –	$1.0 \times 10^{- 1}$ , –, –, –	$1.0 \times 10^{- 1}$ , –, –, –	$1.0 \times 10^{- 1}$ , –, –, –	$1.0 \times 10^{- 1}$ , –, –, –
WAIG-SDF	$1.0 \times 10^{- 1}$ , –, $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , –, $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , –, $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , –, $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , –, $1.0$ , $0.8$
DPVI-DK-SDF	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , –, –	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , –, –	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , –, –	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , –, –	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , –, –
DPVI-CA-SDF	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , –, –	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , –, –	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , –, –	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , –, –	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , –, –
WGAD-DK-SDF	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $5.0 \times 10^{- 1}$ , $1.0$ , $1.0$ , $0.8$
WGAD-CA-SDF	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$	$1.0 \times 10^{- 1}$ , $1.0 \times 10^{1}$ , $1.0$ , $1.0$ , $0.8$

Appendix C.3. Additional Experiments Results

Appendix C.3.1. Results for SG

In this section, we give empirical results of approximating a single-mode Gaussian distribution, whose density is defined as

\begin{matrix} π (x) \propto exp (- \frac{1}{2} x^{T} Σ^{- 1} x), \end{matrix}

where

Σ_{i i} = 1.0

and correlation

Σ_{i j, i \neq j} = 0.8

. To investigate the influence of number M in this task, we run all the algorithms with

M \in {32, 64, 128, 256, 512}

. All the particles are initialized from a Gaussian distribution with zero mean and covariance matrix

0.5 * I_{10 \times 10}

.

In Figure A2 and Figure A3, we plot the

W_{2}

distance to the target of the samples generated by each algorithm with respect to iteration and time. We generate 5000 samples from the target distribution

π

as a reference to evaluate

W_{2}

. As this task is a simple Single Gaussian model, the approximation error difference between the GAD-PVI algorithms and fixed-weight ones is not so obvious. When the particle number increases, the effect of the dynamic weight adjustment scheme is smaller, for such a large number of fixed-weight particles is also sufficient for approximating this simple distribution. Meanwhile, the faster convergence effect of the accelerated position update strategies is quite obvious in the figures. Moreover, we can see that the GFSD-type algorithms cannot outperform the BLOB-type algorithms and SVGD; this may be due to the lack of the repulsive mechanism in GFSD, which leads to particle system collapse in single-mode tasks. This also coincides with the discussion in Appendix B.3.

In Table A5, we further report the final

W_{2}

distance between the empirical distribution generated by each algorithm and the target distribution. It can be observed that, in the Wasserstein metric case, the CA strategy consistently outperforms its fixed-weight counterparts with the same number of particles, and the DK variants are weakened due to the single modality of this task. However, in the KW or S metric case, it can be observed that the GAD-DK algorithms outperform others in the majority of cases, which is due to the poor transportation ability of the KW and S metrics. A direct duplicate/kill mechanism greatly enhances the transport speed from low-probability regions to high-probability regions. We also find that the KW- and S-type algorithms achieve poor approximation results compared to WGAD algorithms in terms of the Wasserstein distance to reference points, which may be because the WGAD algorithms aim to implicitly minimize the Wasserstein distance, and this is not the case for other two types.

Table A5. Averaged test

W_{2}

distances for different ParVI methods in the SG task.

Table A5. Averaged test

W_{2}

distances for different ParVI methods in the SG task.

Algorithm	Number of Particles
Algorithm	32	64	128	256	512
ParVI-SVGD	$1.320$	$1.228$	$1.162$	$1.102$	$1.027$
ParVI-BLOB	$1.315$	$1.229$	$1.164$	$1.102$	$1.038$
WAIG-BLOB	$1.314$	$1.230$	$1.164$	$1.100$	$1.038$
WNes-BLOB	$1.315$	$1.230$	$1.164$	$1.101$	$1.037$
DPVI-DK-BLOB	$1.313$	$1.229$	$1.163$	$1.100$	$1.035$
DPVI-CA-BLOB	$1.309$	$1.227$	$1.162$	$1.102$	$1.037$
WGAD-DK-BLOB(Ours)	$1.313$	$1.229$	$1.163$	$1.098$	$1.034$
WGAD-CA-BLOB(Ours)	$1.300$	$1.226$	$1.161$	$1.099$	$1.036$
KWAIG-BLOB	$1.943$	$1.922$	$1.889$	$1.827$	$1.777$
KWGAD-DK-BLOB(Ours)	$1.920$	$1.884 \times 10^{- 2}$	$1.848$	$1.798$	$1.757$
KWGAD-CA-BLOB(Ours)	$1.942$	$1.901$	$1.865$	$1.816$	$1.764$
SAIG-BLOB	$1.451$	$1.412$	$1.429$	$1.436$	$1.479$
SGAD-DK-BLOB(Ours)	$1.435$	$1.355$	$1.341$	$1.219$	$1.143$
SGAD-CA-BLOB(Ours)	$1.444$	$1.396$	$1.412$	$1.405$	$1.407$
ParVI-GFSD	$1.453$	$1.353$	$1.267$	$1.198$	$1.136$
WAIG-GFSD	$1.449$	$1.353$	$1.264$	$1.196$	$1.134$
WNes-GFSD	$1.450$	$1.353$	$1.265$	$1.197$	$1.135$
DPVI-DK-GFSD	$1.448$	$1.347$	$1.267$	$1.197$	$1.135$
DPVI-CA-GFSD	$1.446$	$1.349$	$1.259$	$1.195$	$1.133$
WGAD-DK-GFSD(Ours)	$1.448$	$1.345$	$1.264$	$1.195$	$1.134$
WGAD-CA-GFSD(Ours)	$1.398$	$1.332$	$1.252$	$1.191$	$1.131$
KWAIG-GFSD	$2.246$	$2.171$	$2.134$	$2.082 \times 10^{- 2}$	$2.046$
KWGAD-DK-GFSD(Ours)	$2.215$	$2.154$	$2.092$	$2.073 \times 10^{- 2}$	$2.022$
KWGAD-CA-GFSD(Ours)	$2.204$	$2.160$	$2.110$	$2.081 \times 10^{- 2}$	$2.035$
SAIG-GFSD	$1.823$	$1.789$	$1.760$	$1.626 \times 10^{- 2}$	$1.609$
SGAD-DK-GFSD(Ours)	$1.881$	$1.782$	$1.609$	$1.437 \times 10^{- 2}$	$1.401$
SGAD-CA-GFSD(Ours)	$1.820$	$1.783$	$1.720$	$1.602 \times 10^{- 2}$	$1.592$

Appendix C.3.2. Additional Results for GMM

We provide additional results for the Gaussian Mixture Model experiment. In Figure 1 and Figure A4, we plot the

W_{2}

w.r.t iteration of each algorithm for all

M = {32, 63, 128, 256, 512}

. From these figures, we can observe that, compared with the baseline ParVI algorithms, our GAD-PVI of either CA strategy or DK variants results in a better performance, i.e., less approximation error and faster convergence. Actually, as we have discussed in the methodology part, while the weight-adjustment step in GAD-PVI greatly enhances the expressiveness of particles’ empirical distribution, the accelerated position update strategy also brings in faster convergence. From Figure 1, we find that DK variants decrease quite quickly at first in the WGAD case, which is because the duplicate/kill scheme will greatly enhance the particle transport ability, thus moving more particles to high-probability regions at first compared to moving particles step by step. In Figure A4, we observe that DK variants also show better results compared to other algorithms, which may be because of the slow transportation property of KW and S-type algorithms.

In Table A6, we further report the final

W_{2}

distance between the empirical distribution generated by each algorithm and the target distribution. It can be observed that GAD-PVI algorithms constantly achieve better approximation results than existing algorithms. Notably, for this complex multi-mode task, DK variants show their advantage as the duplicate/kill operation allows transferring particles from a low-probability region to a distant high-probability area (e.g., among different local modes), especially in the KW/S case. However, the DK variants in the Wasserstein case are not as competitive with CA algorithms; this difference could lie in that the KW/S metric space is much more influenced by the potential barrier, which needs the DK scheme to transport particles among different local modes, but the Wasserstein metric is more robust from the sufficient multi-modality and CA strategy. Furthermore, the GAD-PVI algorithms with the CA strategy are usually more stable than their counterpart with DK, which may be ascribed to the fluctuations induced by the discrete weight adjustment (0 or

1 / M

) in DK.

Table A6. Averaged test

W_{2}

distances for different ParVI methods in GMM tasks.

Table A6. Averaged test

W_{2}

distances for different ParVI methods in GMM tasks.

Algorithm	Number of Particles
Algorithm	32	64	128	256	512
ParVI-SVGD	$2.175$	$2.101$	$2.088$	$2.026$	$2.044$
ParVI-BLOB	$2.317$	$2.779$	$2.292$	$2.440$	$2.294$
WAIG-BLOB	$2.317$	$2.775$	$2.039$	$1.976$	$1.845$
WNes-BLOB	$2.317$	$2.777$	$2.213$	$2.329$	$2.180$
DPVI-DK-BLOB	$2.065$	$2.066$	$1.859$	$1.735$	$1.704$
DPVI-CA-BLOB	$2.039$	$1.934$	$1.825$	$1.727$	$1.633$
WGAD-DK-BLOB(Ours)	$2.064$	$1.933$	$1.831$	$1.727$	$1.650$
WGAD-CA-BLOB(Ours)	$2.037$	$1.929$	$1.824$	$1.725$	$1.632$
KWAIG-BLOB	$4.937$	$4.674$	$4.600$	$4.242$	$4.221$
KWGAD-DK-BLOB(Ours)	$2.854$	$2.622$	$2.400$	$2.542$	$2.248$
KWGAD-CA-BLOB(Ours)	$4.565$	$4.206$	$4.094$	$3.836$	$3.767$
SAIG-BLOB	$5.070$	$4.632$	$4.554$	$4.140$	$4.032$
SGAD-DK-BLOB(Ours)	$2.863$	$2.914$	$2.760$	$1.1890$	$2.247$
SGAD-CA-BLOB(Ours)	$3.406$	$3.051$	$2.659$	$2.595$	$2.492$
ParVI-GFSD	$2.427$	$2.888$	$2.331$	$2.567$	$2.398$
WAIG-GFSD	$2.425$	$2.885$	$2.328$	$2.206$	$2.094$
WNes-GFSD	$2.426$	$2.888$	$2.330$	$2.494$	$2.337$
DPVI-DK-GFSD	$2.151$	$2.025$	$1.924$	$1.837$	$1.769$
DPVI-CA-GFSD	$2.134$	$2.025$	$1.928$	$1.838$	$1.755$
WGAD-DK-GFSD(Ours)	$2.150$	$2.017$	$1.924$	$1.834$	$1.755$
WGAD-CA-GFSD(Ours)	$2.120$	$2.019$	$1.923$	$1.835$	$1.754$
KWAIG-GFSD	$5.072$	$4.780$	$4.706$	$4.381$	$4.359$
KWGAD-DK-GFSD(Ours)	$3.086$	$2.682$	$2.817$	$2.385$	$2.393$
KWGAD-CA-GFSD(Ours)	$4.597$	$4.389$	$4.262$	$3.960$	$3.929$
SAIG-GFSD	$4.828$	$4.577$	$4.555$	$4.151$	$4.075$
SGAD-DK-GFSD(Ours)	$2.994$	$3.411$	$2.881$	$2.347$	$2.324$
SGAD-CA-GFSD(Ours)	$3.937$	$4.142$	$3.676$	$3.617$	$4.007$

Appendix C.3.3. Additional Results for GP

Here, we provide additional results of the KW/S-type method for Gaussian Process Regression task in Table A7. The result is quite similar to the Wasserstein case, i.e., both the accelerated position update and the dynamic weight adjustment result in a decreased

W_{2}

and GAD-PVI algorithms to consistently achieve loweset

W_{2}

to the target. Note that the difference between DK type GAD-PVI and their fixed weight counterpart is not that obvious due to the fact that the one-mode nature of GP greatly weakens the advantage of DK, i.e., transferring particles from low-probability region to distant high-probability area (e.g., among different local modes).

Table A7. Averaged

W_{2}

distances after 10,000 iterations with different KW/S-type algorithms in the GP task with dataset LIDAR.

Table A7. Averaged

W_{2}

distances after 10,000 iterations with different KW/S-type algorithms in the GP task with dataset LIDAR.

Algorithm	Smoothing Strategy
Algorithm	BLOB	GFSD
KWAIG	$1.571 \times 10^{- 1} \pm 2.190 \times 10^{- 4}$	$2.146 \times 10^{- 1} \pm 8.608 \times 10^{- 4}$
GAD-KW-DK	$1.566 \times 10^{- 1} \pm 1.791 \times 10^{- 3}$	$2.072 \times 10^{- 1} \pm 1.772 \times 10^{- 3}$
GAD-KW-CA	$1.341 \times 10^{- 1} \pm 1.494 \times 10^{- 4}$	$1.991 \times 10^{- 1} \pm 4.415 \times 10^{- 4}$
SAIG	$1.570 \times 10^{- 1} \pm 3.791 \times 10^{- 4}$	$2.084 \times 10^{- 1} \pm 7.575 \times 10^{- 3}$
GAD-S-DK	$1.552 \times 10^{- 1} \pm 1.090 \times 10^{- 2}$	$2.012 \times 10^{- 1} \pm 4.960 \times 10^{- 3}$
GAD-S-CA	$1.236 \times 10^{- 1} \pm 2.872 \times 10^{- 4}$	$1.691 \times 10^{- 1} \pm 3.499 \times 10^{- 3}$

Appendix C.3.4. Additional Results for BNN

We provide additional test negative log-likelihood results for the Bayesian neural network experiment for all algorithms in Table A8 and the test RMSE result under KW/S-Type algorithms in Table A9. The results demonstrate that the combination of the accelerated position updating strategy and the dynamically weighted adjustment leads to a lower NLL and RMSE under difference-specific IFR space, and WGAD-PVI algorithms with CA usually achieve the best performance in the Wasserstein case, while KW/SGAD-PVI algorithms with CA or DK are comparable to each other. Note that the position step-size of GAD-PVI is set to the value tuned for their fixed weight counterpart. Actually, if we retune the position step size for all GAD-PVI algorithms, they are expected to achieve a better performance than the existing result.

Table A8. Averaged test

N L L

distances for different ParVI methods in the BNN task.

Table A8. Averaged test

N L L

distances for different ParVI methods in the BNN task.

Algorithm	Datasets
Algorithm	Concrete	kin8nm	RedWine	Space
ParVI-SVGD	$1.738$	$1.160$	$6.943$	$2.739$
ParVI-BLOB	$1.849$	$1.122$	$6.900$	$2.742$
WAIG-BLOB	$1.710$	$1.093$	$6.790$	$2.638$
WNES-BLOB	$1.734$	$1.065$	$6.799$	$2.679$
DPVI-DK-BLOB	$1.833$	$1.122$	$6.848$	$2.687$
DPVI-CA-BLOB	$1.837$	$1.093$	$6.856$	$2.685$
WGAD-DK-BLOB(Ours)	$1.703$	$1.065$	$6.785$	$2.602$
WGAD-CA-BLOB(Ours)	$1.697$	$1.048$	$6.782$	$2.595$
KWAIG-BLOB	$1.794$	$1.199$	$9.322 \times 10^{- 1}$	$2.751$
KWGAD-DK-BLOB(Ours)	$1.788$	$1.169$	$9.300 \times 10^{- 1}$	$2.743$
KWGAD-CA-BLOB(Ours)	$1.789$	$1.173$	$9.304 \times 10^{- 1}$	$2.736$
SAIG-BLOB	$1.832$	$1.136$	$9.349 \times 10^{- 1}$	$2.711$
SGAD-DK-BLOB(Ours)	$1.821$	$1.068$	$9.292 \times 10^{- 1}$	$2.708$
SGAD-CA-BLOB(Ours)	$1.824$	$1.119$	$9.270 \times 10^{- 1}$	$2.642$
ParVI-GFSD	$1.850$	$1.122$	$6.899$	$2.742$
WAIG-GFSD	$1.724$	$1.094$	$6.785$	$2.638$
WNES-GFSD	$1.740$	$1.108$	$6.781$	$2.679$
DPVI-DK-GFSD	$1.836$	$1.120$	$6.812$	$2.687$
DPVI-CA-GFSD	$1.836$	$1.093$	$6.857$	$2.687$
WGAD-DK-GFSD(Ours)	$1.722$	$1.075$	$6.780$	$2.593$
WGAD-CA-GFSD(Ours)	$1.720$	$1.050$	$6.774$	$2.597$
KWAIG-GFSD	$1.820$	$1.199$	$9.337 \times 10^{- 1}$	$2.759$
KWGAD-DK-GFSD(Ours)	$1.813$	$1.176$	$9.305 \times 10^{- 1}$	$2.740$
KWGAD-CA-GFSD(Ours)	$1.812$	$1.169$	$9.319 \times 10^{- 1}$	$2.746$
SAIG-GFSD	$1.814$	$1.116$	$9.444 \times 10^{- 1}$	$2.782$
SGAD-DK-GFSD(Ours)	$1.809$	$1.062$	$9.391 \times 10^{- 1}$	$2.555$
SGAD-CA-GFSD(Ours)	$1.800$	$1.098$	$9.359 \times 10^{- 1}$	$2.745$

Table A9. Averaged test

R M S E

for different ParVI methods in the BNN task.

Table A9. Averaged test

R M S E

for different ParVI methods in the BNN task.

Algorithm	Datasets
Algorithm	Concrete	kin8nm	RedWine	Space
KWAIG-BLOB	$6.217$	$8.159 \times 10^{- 2}$	$6.916$	$8.829 \times 10^{- 2}$
KWGAD-DK-BLOB(Ours)	$6.207$	$8.069 \times 10^{- 2}$	$6.910$	$8.760 \times 10^{- 2}$
KWGAD-CA-BLOB(Ours)	$6.208$	$8.058 \times 10^{- 2}$	$6.896$	$8.753 \times 10^{- 2}$
SAIG-BLOB	$6.323$	$7.936 \times 10^{- 2}$	$6.856$	$8.899 \times 10^{- 2}$
SGAD-DK-BLOB(Ours)	$6.293$	$7.695 \times 10^{- 2}$	$6.828$	$8.867 \times 10^{- 2}$
SGAD-CA-BLOB(Ours)	$6.269$	$7.872 \times 10^{- 2}$	$6.811$	$8.783 \times 10^{- 2}$
KWAIG-GFSD	$6.330$	$8.158 \times 10^{- 2}$	$6.924$	$8.948 \times 10^{- 2}$
KWGAD-DK-GFSD(Ours)	$6.274$	$8.079 \times 10^{- 2}$	$6.854$	$8.918 \times 10^{- 2}$
KWGAD-CA-GFSD(Ours)	$6.251$	$8.057 \times 10^{- 2}$	$6.917$	$8.927 \times 10^{- 2}$
SAIG-GFSD	$6.266$	$7.864 \times 10^{- 2}$	$6.861$	$8.998 \times 10^{- 2}$
SGAD-DK-GFSD(Ours)	$6.257$	$7.679 \times 10^{- 2}$	$6.804$	$8.638 \times 10^{- 2}$
SGAD-CA-GFSD(Ours)	$6.228$	$7.801 \times 10^{- 2}$	$6.793$	$8.931 \times 10^{- 2}$

Figure A1. The shape morphing of the source shape CAT to the target SPIRAL.

Figure A2. Averaged test

W_{2}

distance to the target with respect to iterations in the SG task for algorithms (W-Type).

Figure A2. Averaged test

W_{2}

distance to the target with respect to iterations in the SG task for algorithms (W-Type).

Figure A3. Averaged test

W_{2}

distance to the target with respect to iterations in the SG task for algorithms (KW/S-Type).

Figure A3. Averaged test

W_{2}

distance to the target with respect to iterations in the SG task for algorithms (KW/S-Type).

Figure A4. Averaged test

W_{2}

distance to the target with respect to iterations in the GMM task for algorithms (KW/S-Type).

Figure A4. Averaged test

W_{2}

distance to the target with respect to iterations in the GMM task for algorithms (KW/S-Type).

Appendix C.3.5. Results for Morphing

We provide additional morphing results in Figure A1, which shows the morphing process from CAT to SPIRAL.

Appendix C.3.6. Results for GAD-KSDD

Newly derived KSDD methods proposed by [64] evolve particle systems according to the direct minimizing of the Kernel Stein Discrepancy (KSD) of particles with respect to the target distribution. The KSDD method is the first ParVI that introduces the dissimilarity functional. whose first variation is well defined at a discrete empirical distribution, thus resulting in no approximation error when the particle number is infinite. Though theoretically impressive, the experimental performance of KSDD is not satisfying, for they are more computationally expensive and have been widely reported to be less stable. Furthermore, KSDD is also reported to make particles easily trapped at saddle points and demanding for the convexity of task and the sensitivity to parameters.

As shown in Figure A5, we make simple Wasserstein experiments of KSDD-type algorithms on SG task to illustrate that our GAD-PVI framework is compatible with the KSD-KSDD approach. The bandwidth of KSDD is reported and should be carefully determined [23,64]; we follow the conventions in [28,51] and set the parameter h via a grid search. From Figure A5, it can be observed that our GAD-PVI algorithms achieve the best results in different particle number settings. These illustrate our framework can cooperate with this new smoothing approach. However, due to the limit of the KSDD itself, it is not realistic to fine-tune parameters and conduct empirical studies of KSDD ParVI algorithms on complex tasks. Additionally, in this simple SG task, the final result of KSDD-type algorithms is not competitive with other methods at all. So we exclude KSDD-type experiments in GMM, GP, and BNN.

Figure A5. Averaged test

W_{2}

distance to the target with respect to iterations in the SG task for algorithms (WGAD-KSDD-Type).

Figure A5. Averaged test

W_{2}

distance to the target with respect to iterations in the SG task for algorithms (WGAD-KSDD-Type).

References

Akbayrak, S.; Bocharov, I.; de Vries, B. Extended variational message passing for automated approximate Bayesian inference. Entropy 2021, 23, 815. [Google Scholar] [CrossRef] [PubMed]
Sharif-Razavian, N.; Zollmann, A. An overview of nonparametric bayesian models and applications to natural language processing. Science 2008, 71–93. Available online: https://www.cs.cmu.edu/~zollmann/publications/nonparametric.pdf (accessed on 16 August 2023).
Siddhant, A.; Lipton, Z.C. Deep bayesian active learning for natural language processing: Results of a large-scale empirical study. arXiv 2018, arXiv:1808.05697. [Google Scholar]
Luo, L.; Yang, J.; Zhang, B.; Jiang, J.; Huang, H. Nonparametric Bayesian Correlated Group Regression With Applications to Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5330–5344. [Google Scholar] [CrossRef] [PubMed]
Du, C.; Du, C.; Huang, L.; He, H. Reconstructing Perceived Images From Human Brain Activities With Bayesian Deep Multiview Learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 2310–2323. [Google Scholar] [CrossRef] [PubMed]
Frank, P.; Leike, R.; Enßlin, T.A. Geometric variational inference. Entropy 2021, 23, 853. [Google Scholar] [CrossRef] [PubMed]
Mohammad-Djafari, A. Entropy, information theory, information geometry and Bayesian inference in data, signal and image processing and inverse problems. Entropy 2015, 17, 3989–4027. [Google Scholar] [CrossRef]
Jewson, J.; Smith, J.Q.; Holmes, C. Principles of Bayesian inference using general divergence criteria. Entropy 2018, 20, 442. [Google Scholar] [CrossRef] [PubMed]
Konishi, T.; Kubo, T.; Watanabe, K.; Ikeda, K. Variational Bayesian Inference Algorithms for Infinite Relational Model of Network Data. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 2176–2181. [Google Scholar] [CrossRef]
Chen, Z.; Song, Z.; Ge, Z. Variational Inference Over Graph: Knowledge Representation for Deep Process Data Analytics. IEEE Trans. Knowl. Data Eng. 2023, 36, 730–2744. [Google Scholar] [CrossRef]
Wang, H.; Fan, J.; Chen, Z.; Li, H.; Liu, W.; Liu, T.; Dai, Q.; Wang, Y.; Dong, Z.; Tang, R. Optimal Transport for Treatment Effect Estimation. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LO, USA, 10–16 December 2023. [Google Scholar]
Geyer, C.J. Practical markov chain monte carlo. Stat. Sci. 1992, 7, 473–483. [Google Scholar] [CrossRef]
Carlo, C.M. Markov chain monte carlo and gibbs sampling. Lect. Notes EEB 2004, 581, 3. [Google Scholar]
Neal, R.M. MCMC using Hamiltonian dynamics. arXiv 2012, arXiv:1206.1901. [Google Scholar]
Chen, T.; Fox, E.; Guestrin, C. Stochastic gradient hamiltonian monte carlo. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1683–1691. [Google Scholar]
Betancourt, M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv 2017, arXiv:1701.02434. [Google Scholar]
Doucet, A.; De Freitas, N.; Gordon, N.J. Sequential Monte Carlo Methods in Practice; Springer: Cham, Switzerland, 2001; Volume 1. [Google Scholar]
Del Moral, P.; Doucet, A.; Jasra, A. Sequential monte carlo samplers. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006, 68, 411–436. [Google Scholar] [CrossRef]
Septier, F.; Peters, G.W. Langevin and Hamiltonian based sequential MCMC for efficient Bayesian filtering in high-dimensional spaces. IEEE J. Sel. Top. Signal Process. 2015, 10, 312–327. [Google Scholar] [CrossRef]
Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. arXiv 2016, arXiv:1608.04471. [Google Scholar]
Zhu, M.; Liu, C.; Zhu, J. Variance Reduction and Quasi-Newton for Particle-Based Variational Inference. In Proceedings of the ICML, Virtual, 13–18 July 2020; pp. 11576–11587. [Google Scholar]
Shen, Z.; Heinonen, M.; Kaski, S. De-randomizing MCMC dynamics with the diffusion Stein operator. In Proceedings of the Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 17507–17517. [Google Scholar]
Zhang, C.; Li, Z.; Du, X.; Qian, H. DPVI: A Dynamic-Weight Particle-Based Variational Inference Framework. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 4900–4906. [Google Scholar] [CrossRef]
Li, L.; Liu, Q.; Korba, A.; Yurochkin, M.; Solomon, J. Sampling with Mollified Interaction Energy Descent. In Proceedings of the The Eleventh International Conference on Learning Representations, Vienna, Austria, 7–11 May 2023. [Google Scholar]
Galy-Fajou, T.; Perrone, V.; Opper, M. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. Entropy 2021, 23, 990. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Zhang, R.; Wang, W.; Li, B.; Chen, L. A unified particle-optimization framework for scalable Bayesian sampling. arXiv 2018, arXiv:1805.11659. [Google Scholar]
Liu, C.; Zhuo, J.; Cheng, P.; Zhang, R.; Zhu, J. Understanding and accelerating particle-based variational inference. In Proceedings of the ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 4082–4092. [Google Scholar]
Korba, A.; Aubin-Frankowski, P.C.; Majewski, S.; Ablin, P. Kernel Stein Discrepancy Descent. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 5719–5730. [Google Scholar]
Craig, K.; Bertozzi, A. A blob method for the aggregation equation. Math. Comput. 2016, 85, 1681–1717. [Google Scholar] [CrossRef]
Arbel, M.; Korba, A.; Salim, A.; Gretton, A. Maximum mean discrepancy gradient flow. arXiv 2019, arXiv:1906.04370. [Google Scholar]
Zhu, H.; Wang, F.; Zhang, C.; Zhao, H.; Qian, H. Neural Sinkhorn Gradient Flow. arXiv 2024, arXiv:2401.14069. [Google Scholar]
Taghvaei, A.; Mehta, P. Accelerated flow for probability distributions. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6076–6085. [Google Scholar]
Wang, Y.; Li, W. Accelerated Information Gradient Flow. J. Sci. Comput. 2022, 90, 11. [Google Scholar] [CrossRef]
Liu, Y.; Shang, F.; Cheng, J.; Cheng, H.; Jiao, L. Accelerated first-order methods for geodesically convex optimization on Riemannian manifolds. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/hash/6ef80bb237adf4b6f77d0700e1255907-Abstract.html (accessed on 16 August 2023).
Zhang, H.; Sra, S. An estimate sequence for geodesically convex optimization. In Proceedings of the Conference on Learning Theory, Stockholm, Sweden, 6–9 July 2018; pp. 1703–1723. [Google Scholar]
Wibisono, A.; Wilson, A.C.; Jordan, M.I. A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. USA 2016, 113, E7351–E7358. [Google Scholar] [CrossRef]
Lafferty, J.D. The Density Manifold and Configuration Space Quantization. Trans. Am. Math. Soc. 1988, 305, 699–741. [Google Scholar] [CrossRef]
Nesterov, Y. Lectures on Convex Optimization; Springer: Cham, Switzerland, 2018; Volume 137. [Google Scholar]
Carrillo, J.A.; Choi, Y.P.; Tse, O. Convergence to equilibrium in Wasserstein distance for damped Euler equations with interaction forces. Commun. Math. Phys. 2019, 365, 329–361. [Google Scholar] [CrossRef]
Gallouët, T.O.; Monsaingeon, L. A JKO Splitting Scheme for Kantorovich–Fisher–Rao Gradient Flows. SIAM J. Math. Anal. 2017, 49, 1100–1130. [Google Scholar] [CrossRef]
Rotskoff, G.; Jelassi, S.; Bruna, J.; Vanden-Eijnden, E. Global convergence of neuron birth-death dynamics. arXiv 2019, arXiv:1902.01843. [Google Scholar]
Chizat, L.; Peyré, G.; Schmitzer, B.; Vialard, F.X. An interpolating distance between optimal transport and Fisher–Rao metrics. Found. Comput. Math. 2018, 18, 1–44. [Google Scholar] [CrossRef]
Kondratyev, S.; Monsaingeon, L.; Vorotnikov, D. A new optimal transport distance on the space of finite Radon measures. Adv. Differ. Equ. 2016, 21, 1117–1164. [Google Scholar] [CrossRef]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
Liu, Q. Stein variational gradient descent as gradient flow. arXiv 2017, arXiv:1704.07520. [Google Scholar]
Wang, D.; Liu, Q. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv 2016, arXiv:1611.01722. [Google Scholar]
Pu, Y.; Gan, Z.; Henao, R.; Li, C.; Han, S.; Carin, L. VAE Learning via Stein Variational Gradient Descent. In Proceedings of the NIPS, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Liu, Y.; Ramachandran, P.; Liu, Q.; Peng, J. Stein variational policy gradient. arXiv 2017, arXiv:1704.02399. [Google Scholar]
Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement learning with deep energy-based policies. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 1352–1361. [Google Scholar]
Liu, W.; Zheng, X.; Su, J.; Zheng, L.; Chen, C.; Hu, M. Contrastive Proxy Kernel Stein Path Alignment for Cross-Domain Cold-Start Recommendation. IEEE Trans. Knowl. Data Eng. 2023, 35, 11216–11230. [Google Scholar] [CrossRef]
Lu, Y.; Lu, J.; Nolen, J. Accelerating langevin sampling with birth-death. arXiv 2019, arXiv:1905.09863. [Google Scholar]
Shen, Z.; Wang, Z.; Ribeiro, A.; Hassani, H. Sinkhorn barycenter via functional gradient descent. Adv. Neural Inf. Process. Syst. 2020, 33, 986–996. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. NIPS 2014, 27, 139–144. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Choi, J.; Choi, J.; Kang, M. Scalable Wasserstein Gradient Flow for Generative Modeling through Unbalanced Optimal Transport. arXiv 2024, arXiv:2402.05443. [Google Scholar]
Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822. [Google Scholar]
Ambrosio, L.; Gigli, N.; Savaré, G. Gradient Flows: In Metric Spaces and in the Space of Probability Measures; Springer Science & Business Media: New York, NY, USA, 2008. [Google Scholar]
Peyré, G.; Cuturi, M. Computational optimal transport. Cent. Res. Econ. Stat. Work. Pap. 2017. Available online: https://ideas.repec.org/p/crs/wpaper/2017-86.html (accessed on 16 August 2023).
Platen, E.; Bruti-Liberati, N. Numerical Solution of Stochastic Differential Equations with Jumps in Finance; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2010; Volume 64. [Google Scholar]
Butcher, J.C. Implicit runge-kutta processes. Math. Comput. 1964, 18, 50–64. [Google Scholar] [CrossRef]
Süli, E.; Mayers, D.F. An Introduction to Numerical Analysis; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Korba, A.; Salim, A.; Arbel, M.; Luise, G.; Gretton, A. A non-asymptotic analysis for Stein variational gradient descent. NeurIPS 2020, 33, 4672–4682. [Google Scholar]
Rasmussen, C.E. Gaussian processes in machine learning. In Proceedings of the Summer School on Machine Learning, Tubingen, Germany, 4–16 August 2003; pp. 63–71. [Google Scholar]
Chen, W.Y.; Mackey, L.; Gorham, J.; Briol, F.X.; Oates, C. Stein points. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018; pp. 844–853. [Google Scholar]
Brooks, S.; Gelman, A.; Jones, G.; Meng, X.L. Handbook of Markov Chain Monte Carlo; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Cuturi, M.; Doucet, A. Fast computation of Wasserstein barycenters. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; pp. 685–693. [Google Scholar]
Solomon, J.; De Goes, F.; Peyré, G.; Cuturi, M.; Butscher, A.; Nguyen, A.; Du, T.; Guibas, L. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Trans. Graph. ToG 2015, 34, 1–11. [Google Scholar] [CrossRef]
Mroueh, Y.; Sercu, T.; Raj, A. Sobolev descent. In Proceedings of the Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019; pp. 2976–2985. [Google Scholar]
Santambrogio, F. {Euclidean, metric, and Wasserstein} gradient flows: An overview. Bull. Math. Sci. 2017, 7, 87–154. [Google Scholar] [CrossRef]
Von Mises, R.; Geiringer, H.; Ludford, G.S.S. Mathematical Theory of Compressible Fluid Flow; Courier Corporation: North Chelmsford, MA, USA, 2004. [Google Scholar]
Mroueh, Y.; Rigotti, M. Unbalanced Sobolev Descent. NeurIPS 2020, 33, 17034–17043. [Google Scholar]
Garbuno-Inigo, A.; Hoffmann, F.; Li, W.; Stuart, A.M. Interacting Langevin diffusions: Gradient structure and ensemble Kalman sampler. SIAM J. Appl. Dyn. Syst. 2020, 19, 412–441. [Google Scholar] [CrossRef]
Nüsken, N.; Renger, D. Stein Variational Gradient Descent: Many-particle and long-time asymptotics. arXiv 2021, arXiv:2102.12956. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, R.; Carin, L.; Chen, C. Stochastic particle-optimization sampling and the non-asymptotic convergence theory. In Proceedings of the Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 1877–1887. [Google Scholar]

Figure 1.

W_{2}

to the target with respect to iterations in the GMM task.

Figure 1.

W_{2}

to the target with respect to iterations in the GMM task.

Figure 2. The contour lines of the log posterior in the Gaussian Process task (all variants with BLOB strategy).

Figure 3.

W_{2}

distance to the target with respect to iterations in the shape morphing task.

Figure 3.

W_{2}

distance to the target with respect to iterations in the shape morphing task.

Figure 4.

W_{2}

distance to the target with respect to iterations in the sketching task.

Figure 4.

W_{2}

distance to the target with respect to iterations in the sketching task.

Figure 5. The sketching task from random noise to cheetah,

M = 1000

.

Figure 5. The sketching task from random noise to cheetah,

M = 1000

.

Table 1. Feature-by-feature comparison of different ParVIs.

	Accelerated Position Update	Dynamic Weight Adjustment	Dissimilarity and Empirical Approximation	Underlying Probability Space	Target Distribution Accessibility
Methods	Accelerated Position Update	Dynamic Weight Adjustment	Dissimilarity and Empirical Approximation	Underlying Probability Space	Target Distribution Accessibility
SVGD [20]	✗	✗	KL-RKHS	Wasserstein	Score
BLOB [29]	✗	✗	KL-BLOB	Wasserstein	Score
KSDD [28]	✗	✗	KSD-KSDD	Wasserstein	Score
MMDF [30]	✗	✗	MMD-MMDF	Wasserstein	Samples
SDF [31]	✗	✗	SD-SDF	Wasserstein	Samples
ACCEL [32]	✓	✗	KL-GFSD	Wasserstein	Score
WNES, WAG [27]	✓	✗	General	Wasserstein	Score
AIG [33]	✓	✗	KL-GFSD	Information (General)	Score
DPVI [23]	✗	✓	General	WFR	Score
GAD-PVI (Ours)	✓	✓	General	IFR (General)	Both

Table 2. Averaged

W_{2}

for the GP task with dataset LIDAR.

Table 2. Averaged

W_{2}

for the GP task with dataset LIDAR.

Algorithm	Empirical Strategy
Algorithm	BLOB	GFSD
ParVI	$1.570 \times 10^{- 1}$ ± $2.210 \times 10^{- 4}$	$2.143 \times 10^{- 1}$ ± $7.424 \times 10^{- 4}$
WAIG	$1.572 \times 10^{- 1}$ ± $2.070 \times 10^{- 4}$	$2.142 \times 10^{- 1}$ ± $7.048 \times 10^{- 4}$
WNES	$1.571 \times 10^{- 1}$ ± $3.011 \times 10^{- 4}$	$2.138 \times 10^{- 1}$ ± $7.771 \times 10^{- 4}$
DPVI-DK	$1.568 \times 10^{- 1}$ ± $1.496 \times 10^{- 3}$	$2.142 \times 10^{- 1}$ ± $2.712 \times 10^{- 3}$
DPVI-CA	$1.285 \times 10^{- 1}$ ± $2.960 \times 10^{- 4}$	$1.638 \times 10^{- 1}$ ± $4.332 \times 10^{- 4}$
WGAD-DK	$1.561 \times 10^{- 1}$ ± $1.155 \times 10^{- 3}$	$2.142 \times 10^{- 1}$ ± $1.501 \times 10^{- 3}$
WGAD-CA	$1.274 \times 10^{- 1}$ ± $2.964 \times 10^{- 4}$	$1.626 \times 10^{- 1}$ ± $4.842 \times 10^{- 4}$

Table 3. Averaged test

R M S E

in the BNN task.

Table 3. Averaged test

R M S E

in the BNN task.

Algorithms	Datasets
Algorithms	Concrete	kin8nm	RedWine	Space
ParVI-SVGD	6.323	8.020 × $10^{- 2}$	6.330 × $10^{- 1}$	9.021 × $10^{- 2}$
ParVI-BLOB	6.313	7.891 × $10^{- 2}$	6.318 × $10^{- 1}$	8.943 × $10^{- 2}$
WAIG-BLOB	6.063	7.791 × $10^{- 2}$	6.267 × $10^{- 1}$	8.775 × $10^{- 2}$
WNES-BLOB	6.112	7.690 × $10^{- 2}$	6.264 × $10^{- 1}$	8.836 × $10^{- 2}$
DPVI-DK-BLOB	6.285	7.889 × $10^{- 2}$	6.294 × $10^{- 1}$	8.853 × $10^{- 2}$
DPVI-CA-BLOB	6.292	7.789 × $10^{- 2}$	6.298 × $10^{- 1}$	8.850 × $10^{- 2}$
WGAD-DK-BLOB	6.058	7.688 × $10^{- 2}$	6.267 × $10^{- 1}$	8.716 × $10^{- 2}$
WGAD-CA-BLOB	6.047	7.629 × $10^{- 2}$	6.263 × $10^{- 1}$	8.704 × $10^{- 2}$
ParVI-GFSD	6.314	7.891 × $10^{- 2}$	6.317 × $10^{- 1}$	8.943 × $10^{- 2}$
WAIG-GFSD	6.105	7.794 × $10^{- 2}$	6.265 × $10^{- 1}$	8.776 × $10^{- 2}$
WNES-GFSD	6.123	7.756 × $10^{- 2}$	6.263 × $10^{- 1}$	8.836 × $10^{- 2}$
DPVI-DK-GFSD	6.291	7.882 × $10^{- 2}$	6.277 × $10^{- 1}$	8.851 × $10^{- 2}$
DPVI-CA-GFSD	6.290	7.791 × $10^{- 2}$	6.298 × $10^{- 1}$	8.852 × $10^{- 2}$
WGAD-DK-GFSD	6.099	7.726 × $10^{- 2}$	6.265 × $10^{- 1}$	8.708 × $10^{- 2}$
WGAD-CA-GFSD	6.088	7.634 × $10^{- 2}$	6.260 × $10^{- 1}$	8.710 × $10^{- 2}$

Table 4. Averaged

W_{2}

for the shape morphing task between different shapes (A = CAT, B = SPIRAL, C = HEART, D = CHICKEN) and averaged

W_{2}

for the sketching task of high-resolution cheetah picture.

Table 4. Averaged

W_{2}

for the shape morphing task between different shapes (A = CAT, B = SPIRAL, C = HEART, D = CHICKEN) and averaged

W_{2}

for the sketching task of high-resolution cheetah picture.

Algorithms	Sampling Tasks
Algorithms	A→B	B→C	C→D	D→A	Sketching
ParVI-MMDF	$3.019 \times 10^{- 2}$	$1.245 \times 10^{- 2}$	$1.396 \times 10^{- 2}$	$9.338 \times 10^{- 3}$	$5.846 \times 10^{- 4}$
WAIG-MMDF	$2.332 \times 10^{- 2}$	$1.232 \times 10^{- 2}$	$1.205 \times 10^{- 2}$	$9.589 \times 10^{- 3}$	$5.869 \times 10^{- 4}$
DPVI-DK-MMDF	$2.956 \times 10^{- 2}$	$1.243 \times 10^{- 2}$	$1.385 \times 10^{- 2}$	$9.411 \times 10^{- 3}$	$5.754 \times 10^{- 4}$
DPVI-CA-MMDF	$2.039 \times 10^{- 2}$	$1.190 \times 10^{- 2}$	$1.316 \times 10^{- 2}$	$9.008 \times 10^{- 3}$	$4.040 \times 10^{- 4}$
WGAD-DK-MMDF	$2.275 \times 10^{- 2}$	$1.225 \times 10^{- 2}$	$1.190 \times 10^{- 2}$	$9.126 \times 10^{- 3}$	$5.721 \times 10^{- 4}$
WGAD-CA-MMDF	$1.642 \times 10^{- 2}$	$1.183 \times 10^{- 2}$	$1.101 \times 10^{- 2}$	$8.518 \times 10^{- 3}$	$3.344 \times 10^{- 4}$
ParVI-SDF	$1.715 \times 10^{- 2}$	$6.561 \times 10^{- 3}$	$6.111 \times 10^{- 3}$	$5.325 \times 10^{- 3}$	$1.389 \times 10^{- 5}$
WAIG-SDF	$1.631 \times 10^{- 2}$	$6.221 \times 10^{- 3}$	$5.820 \times 10^{- 3}$	$5.148 \times 10^{- 3}$	$1.231 \times 10^{- 5}$
DPVI-DK-SDF	$1.434 \times 10^{- 2}$	$6.488 \times 10^{- 3}$	$6.115 \times 10^{- 3}$	$5.327 \times 10^{- 3}$	$1.286 \times 10^{- 5}$
DPVI-CA-SDF	$1.485 \times 10^{- 2}$	$6.423 \times 10^{- 3}$	$6.021 \times 10^{- 3}$	$5.359 \times 10^{- 3}$	$9.181 \times 10^{- 6}$
WGAD-DK-SDF	$1.431 \times 10^{- 2}$	$6.207 \times 10^{- 3}$	$5.890 \times 10^{- 3}$	$5.190 \times 10^{- 3}$	$1.204 \times 10^{- 5}$
WGAD-CA-SDF	$1.394 \times 10^{- 2}$	$6.078 \times 10^{- 3}$	$5.721 \times 10^{- 3}$	$5.143 \times 10^{- 3}$	$8.744 \times 10^{- 6}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, F.; Zhu, H.; Zhang, C.; Zhao, H.; Qian, H. GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework. Entropy 2024, 26, 679. https://doi.org/10.3390/e26080679

AMA Style

Wang F, Zhu H, Zhang C, Zhao H, Qian H. GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework. Entropy. 2024; 26(8):679. https://doi.org/10.3390/e26080679

Chicago/Turabian Style

Wang, Fangyikang, Huminhao Zhu, Chao Zhang, Hanbin Zhao, and Hui Qian. 2024. "GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework" Entropy 26, no. 8: 679. https://doi.org/10.3390/e26080679

APA Style

Wang, F., Zhu, H., Zhang, C., Zhao, H., & Qian, H. (2024). GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework. Entropy, 26(8), 679. https://doi.org/10.3390/e26080679

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Algorithm	Tasks
Algorithm	Single Gaussian	Gaussian Mixture Model
ParVI-BLOB	$1.0 \times 10^{- 2}$ , –, –, –	$1.0 \times 10^{- 2}$ , –, –, –
WAIG-BLOB	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.3$	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.3$
WNES-BLOB	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.2$	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.2$
DPVI-CA-BLOB	$1.0 \times 10^{- 2}$ , $1.0$ , –, –	$1.0 \times 10^{- 2}$ , $1.0$ , –, –
DPVI-DK-BLOB	$1.0 \times 10^{- 2}$ , $1.0$ , –, –	$1.0 \times 10^{- 2}$ , $1.0$ , –, –
WGAD-CA-BLOB	$1.0 \times 10^{- 2}$ , $1.0$ , $1.0$ , $0.3$	$1.0 \times 10^{- 2}$ , $1.0$ , $1.0$ , $0.3$
WGAD-DK-BLOB	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.3$	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.3$
KWAIG-BLOB	$1.0 \times 10^{- 2}$ , –, –, –	$1.0 \times 10^{- 2}$ , –, –, –
KWGAD-CA-BLOB	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$
KWGAD-DK-BLOB	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$
SAIG-BLOB	$5.0 \times 10^{- 2}$ , –, –, –	$2.5 \times 10^{- 2}$ , –, –, –
SGAD-CA-BLOB	$5.0 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$	$2.5 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$
SGAD-DK-BLOB	$5.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$	$2.5 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$
ParVI-GFSD	$1.0 \times 10^{- 2}$ , –, –, –	$1.0 \times 10^{- 2}$ , –, –, –
WAIG-GFSD	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.3$	$1.0 \times 10^{- 2}$ , –, $1.0$ , 0.3
WNES-GFSD	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.2$	$1.0 \times 10^{- 2}$ , –, $1.0$ , $0.2$
DPVI-CA-GFSD	$1.0 \times 10^{- 2}$ , $1.0$ , –, –	$1.0 \times 10^{- 2}$ , $0.8$ , –, –
DPVI-DK-GFSD	$1.0 \times 10^{- 2}$ , $1.0$ , –, –	$1.0 \times 10^{- 2}$ , $1.0$ , –, –
WGAD-CA-GFSD	$1.0 \times 10^{- 2}$ , $1.0$ , $1.0$ , $0.3$	$1.0 \times 10^{- 2}$ , $0.8$ , $1.0$ , $0.3$
WGAD-DK-GFSD	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.3$	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.3$
KWAIG-GFSD	$1.0 \times 10^{- 2}$ , –, –, –	$1.0 \times 10^{- 2}$ , –, –, –
KWGAD-CA-GFSD	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$
KWGAD-DK-GFSD	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$	$1.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$
SAIG-GFSD	$5.0 \times 10^{- 2}$ , –, –, –	$2.5 \times 10^{- 2}$ , –, –, –
SGAD-CA-GFSD	$5.0 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$	$2.5 \times 10^{- 2}$ , $5 \times 10^{- 3}$ , $1.0$ , $0.9$
SGAD-DK-GFSD	$5.0 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$	$2.5 \times 10^{- 2}$ , $5 \times 10^{- 2}$ , $1.0$ , $0.9$

Article Menu

GAD-PVI: A General Accelerated Dynamic-Weight Particle-Based Variational Inference Framework

Abstract

1. Introduction

1.1. Contribution

1.2. Notation

2. Related Works

3. Preliminaries

3.1. Wasserstein Gradient Flow and Classical ParVIs

3.2. Hamiltonian Gradient Flows and Accelerated ParVIs

3.3. Wasserstein–Fisher–Rao Flow and Dynamic-Weight ParVIs

3.4. Dissimilarity Functionals

3.4.1. Kullback–Leibler Divergence

3.4.2. Kernel Stein Discrepancy

3.4.3. Maximum Mean Discrepancy

3.4.4. Sinkhorn Divergence

4. Methodology

4.1. Information–Fisher–Rao Space and Semi-Hamiltonian-Information–Fisher–Rao Flow

4.2. Finite-Particles Formulations to SHIFR Flows

4.3. GAD-PVI Framework

4.3.1. Updating Rules

4.3.2. Dissimilarities and Approximation Approaches

Score-Based Scenario

Sample-Based Scenario

4.3.3. An Alternative Weight Adjusting Approach

4.3.4. GAD-PVI Instances

5. Experiments

5.1. Score-Based Experiments

5.1.1. Gaussian Mixture Model

5.1.2. Gaussian Process Regression

5.1.3. Bayesian Neural Network

5.2. Sample-Based Experiments

5.2.1. Shape Morphing

5.2.2. Picture Sketching

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Definitions and Proofs

Appendix A.1. Definition of Information Metric in Probability Space

Appendix A.2. Full Hamiltonian Flow on the IFR Space and the Fisher–Rao Kinetic Energy

Appendix A.3. Proof of Proposition 3

Appendix A.4. Proof of Proposition 1

Appendix A.5. Proof of Proposition 2

Appendix B. Detailed Formulations

Appendix B.1. Kalman–Wasserstein–SHIFR Flow and KWGAD-PVI Algorithms

Appendix B.2. Stein-SHIFR Flow and SGAD-PVI Algorithms

Appendix B.3. GAD-PVI Algorithms in Details

Appendix C. Experiment Details

Appendix C.1. Experiments Settings

Appendix C.1.1. Density of the Gaussian Mixture Model

Appendix C.1.2. Density of the Gaussian Process Task

Appendix C.1.3. Training/Validation/Test Dataset in Bayesian Neural Network

Appendix C.1.4. Initialization of Particles’ Positions

Appendix C.1.5. Bandwidth of Kernel Function in Different Algorithms

Appendix C.1.6. WNES and WAG

Appendix C.2. Parameters Tuning

Detailed Settings for ηpos, ηwei, ηvel and γ

Appendix C.3. Additional Experiments Results

Appendix C.3.1. Results for SG

Appendix C.3.2. Additional Results for GMM

Appendix C.3.3. Additional Results for GP

Appendix C.3.4. Additional Results for BNN

Appendix C.3.5. Results for Morphing

Appendix C.3.6. Results for GAD-KSDD

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Detailed Settings for η_pos, η_wei, η_vel and γ