Next Article in Journal
Synergy Makes Direct Perception Inefficient
Previous Article in Journal
Precise Error Performance of BPSK Modulated Coherent Terahertz Wireless LOS Links with Pointing Errors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Variational Bayesian Approximation (VBA): Implementation and Comparison of Different Optimization Algorithms

by
Seyedeh Azadeh Fallah Mortezanejad
1 and
Ali Mohammad-Djafari
2,3,*
1
School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China
2
International Science Consulting and Training (ISCT), 91440 Bures sur Yvette, France
3
Shanfeng Company, Shaoxing 312352, China
*
Author to whom correspondence should be addressed.
Entropy 2024, 26(8), 707; https://doi.org/10.3390/e26080707 (registering DOI)
Submission received: 18 April 2024 / Revised: 27 July 2024 / Accepted: 14 August 2024 / Published: 20 August 2024
(This article belongs to the Special Issue Maximum Entropy and Bayesian Methods for Image and Spatial Analysis)

Abstract

:
In any Bayesian computations, the first step is to derive the joint distribution of all the unknown variables given the observed data. Then, we have to do the computations. There are four general methods for performing computations: Joint MAP optimization; Posterior expectation computations that require integration methods; Sampling-based methods, such as MCMC, slice sampling, nested sampling, etc., for generating samples and numerically computing expectations; and finally, Variational Bayesian Approximation (VBA). In this last method, which is the focus of this paper, the objective is to search for an approximation for the joint posterior with a simpler one that allows for analytical computations. The main tool in VBA is to use the Kullback–Leibler Divergence (KLD) as a criterion to obtain that approximation. Even if, theoretically, this can be conducted formally, for practical reasons, we consider the case where the joint distribution is in the exponential family, and so is its approximation. In this case, the KLD becomes a function of the usual parameters or the natural parameters of the exponential family, where the problem becomes parametric optimization. Thus, we compare four optimization algorithms: general alternate functional optimization; parametric gradient-based with the normal and natural parameters; and the natural gradient algorithm. We then study their relative performances on three examples to demonstrate the implementation of each algorithm and their efficiency performance.

1. Introduction

The Bayesian inference starts by defining the joint posterior distribution of all the unknowns in the models given the data. Then, to obtain a point estimation, there are two main solutions: Joint Maximum A Posterior (JMAP), and computation of the posterior expectations. JMAP requires optimization to determine the unknown parameters that maximize the joint posterior distribution. Computing posterior expectations requires integration and involves calculating the expected values of unknown parameters. These computations are possible analytically in the simple case of linear and Normal priors or for exponential families with conjugate priors. In all other cases, the computation of multivariate integration is too costly. Then, there are two main tools. First, sampling methods such as Markov Chain Monte Carlo (MCMC), slice sampling, nested sampling, etc., where we generate samples and then compute means, variances, covariances, etc. Second, analytic approximation methods, which first approximate the original posterior with a simpler one to facilitate analytical computations. The Variational Bayesian Approximation (VBA) is the main method of this kind, using the KLD divergence as the criterion to obtain such an approximation. However, even if theoretically, the functional optimization of KLD is possible, it still needs integral computations, which are possible if we have some conjugacy properties. In particular, when the original posterior expression is in the exponential families, the approximation expression is also in the exponential families thanks to the KLD properties. In this paper, we consider this case and accomplish the optimization task in various ways. The first approach involves directly applying alternate optimization to the KLD criterion. Other methods consist of expressing the KLD as a function of parameters and using gradient-based algorithms. As the exponential families can be represented using normal or natural parameters, we obtain different algorithms that we compare.
For a general Bayesian inference, the main computation tool is the exploration of the joint posterior law via the sampling methods, such as the MCMC, or other more recent techniques such as slice sampling [1] or nested sampling [2]. However, these methods become very costly for high dimensional problems in Computer Vision, Machine Learning, and Artificial Intelligence. Blei et al. (2017) [3] mentioned in their review article the alternative of MCMC with the support of variational inference. They found Bayesian inference fast and easy to scale up to large data sets. Variational inference has less complexity than MCMC in many examples, such as large-scale document analysis, computational neuroscience, and Computer Vision.
The VBA is constructed for a variety of models to approximate a posterior distribution with many variables whose aim is to minimize the computational costs while keeping controllable accuracy. The main reduction in computational cost is achieved by searching for a good and tractable approximation for the joint posterior. VBA has acceptable efficiency computational costs compared with sampling methods.
Parisi and Shankar [4] have been a pioneer in VBA since 1988, and after that MacKay and Neal [5,6] were forerunners in Bayesian Neural Network. Šmídl and Quinn [7] provided a theoretical background of VBA and its iterative algorithms applied in signal processing. Sarkka and Nummenmaa [8] used VBA on an autoregressive model with unknown variance, which is an application of VBA in signal processing. Zheng et al. [9] introduced a new version of VBA for image restoration concerning an ill-posed linear inverse problem applying all variation priors. Their method structure was on the memory transposition subspace algorithm for a probability density function space. Many other authors used this approach for different applications [7,10,11,12,13]. Renard et al. [14] worked on a Bayesian inference application for regional frequency analysis in a nonstationary context to handle intricate models getting physical reality and statistical necessities. Li and Shi [15] explained the matter of the model accuracy in the frameworks of wind power. Bayesian approaches have many advantages in the uncertainty and variability estimations, which are crucial in model prediction. They mentioned Bayesian methods being appropriate for various aspects of wind energy conversion systems, such as enhancing the precision and reliability of wind resource estimation and short-term predictions. Bayesian models have applications on enormous data of microarray technology, which are challenging in making inferences from such massive data sets. Yang et al. [16] introduced Bayesian methods as popular techniques for their benefits in microarray analysis.
In many applications, with indirect observations, such as using a thermometer sensor to measure the temperature (hidden variable), Bayesian inference starts with obtaining the expression of the joint distribution of all the unknown variables (hidden variables and parameters of the model) given the observed data. Then, we have to use it to do inference. In general, this expression is not separable in all its variables. So, the computations become hard and costly. For example, obtaining the marginals for each variable and computing the expectations are difficult and expensive. This problem becomes even more crucial in high-dimensional Computer Vision and Machine Learning and becomes an issue in inverse problems. We may then need to propose a surrogate expression with which we can do approximate computations.
The VBA is a technique that approximates the joint distribution p with an easier method; for example, a separable one q by minimizing K L ( q | p ) , which makes the marginal computations much easier. For example, in the case of two variables, p ( x , y ) is approximated by q ( x , y ) = q 1 ( x ) q 2 ( y ) via minimizing K L ( q 1 q 2 | p ) . When q is separable in all the variables of p, the approximation is also called Mean Field Approximation (MFA).
To obtain the approximate marginals q 1 and q 2 , we must minimize K L ( q 1 q 2 | p ) . A first standard and general algorithm is the alternate analytic optimization of K L ( q 1 q 2 | p ) with respect to q 1 , and then, to q 2 . Finding the expression of the functional derivatives of K L ( q 1 q 2 | p ) for q 1 and q 2 and equating them to zero alternatively, we obtain an iterative optimization algorithm. A second general approach is its optimization in the Riemannian Manifold, where we consider the case p is in the exponential family and so are q 1 and q 2 . For this case, K L ( q 1 q 2 | p ) becomes a function of the parameters θ of the exponential family. Then, we can use any other optimization algorithm to obtain those parameters.
In this paper, we review the VBA technique and then compare four different VBA optimization algorithms in three examples: standard alternate analytic optimization, a gradient-based algorithm for normal and natural parameter space [17,18] and a natural gradient algorithm [19,20,21] The paper aims to consider the first algorithm as our principal method and compare it with the three other algorithms. Building upon our initial proceedings paper [22], presented at the 41st International Workshop, on Bayesian inference and maximum entropy methods in 2022, this journal article further investigates more theoretical details, algorithms, and comparison study.
In this article, we consider the case of exponential families, so that p and q are both in exponential families. Then, we write the expression of the K L ( q | p ) and explore four different estimation algorithms for unknown parameters in a model that incorporates prior information. The first iterative algorithm handles directly K L ( · ) , alternatively optimizing it with respect to each marginal q i . This is the standard optimization used often in VBA. The function to be optimized is KLD. First, the gradient of KLD for all unknown parameters needs to be found. Then, we can begin with initial values for the parameters, either estimated from data or chosen deliberately. Then, we repeat the iterative algorithm until it converges to some points. If we denote the unknown normal parameter space with θ , then the iterative optimization algorithm can be written as: θ ˜ ( k + 1 ) = θ ˜ ( k ) γ K L ( θ ˜ ( k ) ) for gradient-based algorithm and the same with the natural parameters Λ ˜ ( k + 1 ) = Λ ˜ ( k ) γ K L ( Λ ˜ ( k ) ) with different values of γ . The natural gradient definition, mentioned by Amari [23], is:
˜ h = F 1 h ,
where F and h are the Fisher information matrix and objective function, respectively. In our case, h is the KLD. To make Fisher’s formula understandable, we change p ( x , y ) to p ( x , y | θ ) to explicitly show the parameters p ( x , y ) , and differentiate for θ . The Fisher information matrix of p ( x , y | θ ) is given by:
F = ln p ( x , y | θ ) ln p ( x , y | θ ) p ( x , y ) = H ln p ( x , y | θ ) p ( x , y ) ,
where H ln p ( x , y | θ ) and · p ( x , y ) are the Hessian matrix of ln p ( x , y | θ ) concerning θ and the expectation respect to p ( x , y ) distribution, respectively. An approximation of the Fisher information, introduced by Schraudolph [24], is called Empirical Fisher:
F ¯ = ln p ( x , y | θ ) ln p ( x , y | θ ) q ( x , y ) = 1 S X Y ( x , y ) S X Y ln p ( y | x , θ ) ln p ( y | x , θ ) .
The fundamental natural gradient iteration algorithm is:
θ ˜ ( k + 1 ) = θ ˜ ( k ) ρ k ˜ h ,
where ρ k is a learning-rate schedule. Martens [25] suggested an optimum update based on a particular second-order local approximation of h, ρ k = 1 .
We consider three examples: Normal-Inverse-Gamma, multivariate Normal, and linear inverse problems to assess the performance and convergence speed of the algorithms through multiple simulations. We propose the following organization of this paper: In Section 2, we present a brief explanation of the basic VBA analytical alternate optimization algorithm. In Section 3, we illustrate our first example related to Normal-Inverse-Gamma distribution analytically and, in practice, explain the outcomes of four algorithms to estimate the unknown parameters. In Section 4, we study a more complex example of a multivariate Normal distribution whose means and variance-covariance matrix are unknown and have a Normal-Inverse-Wishart distribution. This section aims to demonstrate marginal distributions of μ ˜ and Σ ˜ using a set of multivariate Normal observations using these mean and variance. In Section 5, the example is closest to realistic situations and is a linear inverse problem. We simulate the model with different dimensions to see the changes in the performance of the algorithms. In Section 6, we present our work summary in the article and compare the four recursive algorithms in three different examples.

2. Variational Bayesian Approach (VBA)

In this paper, our focus is on MFA as a subset of VBA. In MFA, the posterior distribution p ( x , y ) is approximated by separable q ( x , y ) :
q ( x , y ) = q 1 ( x ) q 2 ( y ) .
K L ( q | p ) [26] is an information measure of discrepancy between two probability functions and one of the most likely used information for divergence and separation measurement and disparity of two density functions [27]. One advantage of K L ( q | p ) is its effectiveness in solving distributionally robust optimization problems [28]. Despite its computational and theoretical benefits, K L ( q | p ) faces certain challenges. These include asymmetries, which complicate optimal model selection [27], and the analytical complexity of K L ( q | p ) when comparing normal mixture models, along with a lack of practical computational algorithms [29]. Let p ( x ) and q ( x ) be two density functions of a continuous random variable x for support set S X . K L ( q | p ) function is introduced as:
K L ( q | p ) = x S X q ( x ) ln q ( x ) p ( x ) x · .
The basic structure of VBA is to minimize K L ( q | p ) and find an estimation for p with the density factorial over all variables, consisting of state, hidden, and all unknown parameters. For instance, x and y represent the state and hidden variables, respectively. The unknown parameters refer to the parameters of the prior distributions of x and y.
For simplicity, we assume a bivariate case of distribution p ( x , y ) , and want to assess it via the alternative optimization, then we have:
K L ( q | p ) = H ( q 1 ) H ( q 2 ) ln p ( x , y ) q 1 q 2 ,
where:
H ( q 1 ) = x S X q 1 ( x ) ln q 1 ( x ) x · and H ( q 2 ) = y S Y q 2 ( y ) ln q 2 ( y ) y · ,
are, respectively, the Shannon entropies of x and of y, and:
ln p ( x , y ) q 1 q 2 = ( x , y ) S X Y q 1 ( x ) q 2 ( y ) ln p ( x , y ) x · y · .
H ( q 1 ) and H ( q 2 ) are fixed term, so the minimization is only on ln p ( x , y ) q 1 q 2 . Now, differentiating the Equation (5) with respect to q 1 , and then with respect to q 2 and equating them to zero, we obtain:
q 1 ( x ) exp ln p ( x , y ) q 2 ( y ) and q 2 ( y ) exp ln p ( x , y ) q 1 ( x ) .
These results can be easily extended to more dimensions [10]. They do not have any closed form, because they depend on the expression of p ( x , y ) and those of q 1 and q 2 . An interesting case is the case of exponential families and conjugate priors, where writing:
p ( x , y ) = p ( x | y ) p ( y ) ,   and p ( y | x ) = p ( x , y ) p ( x ) = p ( x | y ) p ( y ) p ( x ) ,
we can consider p ( y ) as prior, p ( x | y ) as the likelihood, and p ( y | x ) as the posterior distributions. We know that, if p ( x , y ) is in the exponential families, q 1 ( x ) and q 2 ( y ) are also in the exponential families, and q 1 ( x ) is conjugate to p ( y | x ) and q 2 ( y ) is conjugate to p ( x | y ) . To illustrate all these properties, we give details of these expressions for a first simple example of Normal-Inverse-Gamma p ( x , y ) = N ( x | μ , y ) I G ( y | α , β ) with q 1 ( x ) = N ( x | μ , v ) and q 2 = I G ( y | α , β ) . For this simple case, first we give the expression of K L ( q | p ) with q 1 ( x ) = N ( x | μ ˜ , v ˜ ) and q 2 ( y ) = I G ( y | α ˜ , β ˜ ) as a function of the parameters θ = ( μ ˜ , v ˜ , α ˜ , β ˜ ) and then the expressions of the four above-mentioned algorithms and we study their convergence.
For a numerical comparison, we start by generating n = 100 samples from p ( x , y ) = N ( x | 0 , y ) I G ( y | 3 , 1 ) , so we know the true parameters ( μ = 0 , α = 3 , β = 1 ) . Then, for different initializations of θ = ( μ ˜ , v ˜ , α ˜ , β ˜ ) , we run the four algorithms and compare the results. The final resulting margin, which is in this case the Student-t S ( x | μ , α , β ) = N ( x | μ , y ) I G ( y | α , β ) y · . We can then compare the true S ( x | μ , α , β ) with the obtained S ( x | μ ^ , α ^ , β ^ ) as well as with q ( x , y | μ ^ , v ^ , α ^ , β ^ ) = N ( x | μ ^ , v ^ ) I G ( y | α ^ , β ^ ) . As we see in the proceeding of the paper, the initialization is very important. When we have samples of ( x , y ) , we can try to initialize the parameters by their empirical values obtained by simple classical methods such as the method of moments. Another issue is the stopping criteria, which can be based either on the KLD criterion during successive iterations or any distances between the parameters at successive iterations.

3. Normal-Inverse-Gamma Distribution Example

The purpose of this section is to explain in detail the process of performing calculations in the alternative analytic algorithm. For this, we consider a simple case Normal-Inverse-Gamma Distribution for which, we have all the necessary expressions. This distribution has many applications in fields, like finance, econometrics, engineering, and Machine Learning. The objective here is to compare the four different algorithms mentioned earlier. The practical problem considered here is the following:
Suppose the data as z = z 1 , . . . z N and model it by Z = X + E , where E N ( 0 , v ) and so Z | X N ( X , v ) . Thus, we have X N ( μ , Y ) and Y I G ( α , β ) , then:
p ( x , y | z ) = p ( z | x , y ) p ( x , y ) p ( z ) = p ( z | x ) p ( x | y ) p ( y ) p ( z ) .
Putting all the elements, we see that p ( x , y | z ) is a N I G model and not separable in x and y. We like to approximate it with a separable one q ( x , y ) = q 1 ( x ) q 2 ( y ) . However, in this simple case, assume we have a sensor, which delivers a physical quantity X, N times, x = { x 1 , x 2 , , x N } . We want to model these data. In the first step, we model it as N ( x | μ , v ) with fixed μ and v. Then, it is easy to estimate the parameters ( μ , v ) either by Maximum Likelihood or Bayesian strategy. If we assume that the model is Normal with unknown variance and call this variance y and assign an I G prior to it, then we have a model N I G for p ( x , y ) . The N I G priors have been applied to the wavelet context with correlated structures because they are conjugated with Normal priors [30]. We choose Normal-Inverse-Gamma distribution because of this conjugated property and ease of handling.
We showed that the margins are S t and I G . Working directly with S t is difficult. So, we want to approximate it with a Normal q 1 ( x ) . This is equivalent to approximating p ( x , y ) with q 1 ( x ) q 2 ( y ) . Now, we want to find the parameters μ , v, α , and β , which minimize K L ( q 1 q 2 | p ) . This process is called MFVBA. Then, we compare four algorithms to obtain the parameters, which minimize K L ( q 1 q 2 | p ) . K L ( q 1 q 2 | p ) is convex with respect to q 1 if q 2 is fixed and is convex with respect to q 2 if q 1 is fixed. So, we hope that the iterative algorithm converges. However, K L ( q 1 q 2 | p ) may not be convex in the space of parameters. So, we have to study the shape of this criterion concerning the parameters μ ˜ , v ˜ , α ˜ , and β ˜ .
We want to find p ( x ) . For this process, we assume a simple Normal model, but with unknown variance y. So that, the forward model can be written as p ( x , y ) = N ( x | μ , y ) I G ( y | α , β ) . In this simple example, we know that p ( x ) is a Student-t distribution obtained by:
S ( x | μ , α , β ) = N ( x | μ , y ) I G ( y | α , β ) y · .
We approximate three parameters θ = ( μ , α , β ) from the data x and find an approximated distribution q ( x ) for p ( x ) .
The main idea is to find such q 1 ( x ) q 2 ( y ) as an approximation of p ( x , y ) . Here, we show the standard alternative analytic optimization, step by step. For this, we start by choosing the conjugate families q 1 ( x ) = N ( x | μ ˜ , v ˜ ) and q 2 ( y ) = I G ( y | α ˜ , β ˜ ) . Note that x p = μ ˜ and x q 1 = μ ˜ .
In the first step, we have to calculate ln p ( x , y ) mentioned earlier:
ln p ( x , y ) = c 1 2 ln y 1 2 y ( x μ ˜ ) 2 ( α ˜ + 1 2 ) ln y β ˜ y ,
where c is a constant value term independent of x and y. First of all, to use the iterative algorithm given in (6), starting by q 1 = N ( x | μ ˜ , v ˜ ) we have to find q 2 ( y ) , so we start by finding q 2 ( y ) . The integration of ln p ( x , y ) is with respect to q 1 ( x ) :
ln p ( x , y ) q 1 = c 1 2 y ( x μ ˜ ) 2 q 1 ( α ˜ + 1 ) ln y β ˜ y .
Since ( x μ ˜ ) 2 q 1 = v ˜ + ( μ ˜ μ ˜ ) 2 :
q 2 ( y ) exp ( α ˜ + 1 ) ln y ( v ˜ + ( μ ˜ μ ˜ ) 2 2 + β ˜ ) 1 y .
Thus, the function q 2 ( y ) is equivalent to an inverse gamma distribution I G ( α ˜ , v ˜ + ( μ ˜ μ ˜ ) 2 2 + β ˜ ) . To do so, we only use q 1 distribution. We have to take integral of ln p ( x , y ) over q 2 to find q 1 :
ln p ( x , y ) q 2 = c ( α ˜ + 1 ) ln y q 2 ( β ˜ + 1 2 ( x μ ˜ ) 2 ) 1 y q 2 .
Note that the first term does not depend on x and 1 y q 2 = 2 α ˜ 2 β ˜ + v ˜ + ( μ ˜ μ ˜ ) 2 , so:
q 1 ( x ) exp 2 α ˜ 2 β ˜ + v ˜ + ( μ ˜ μ ˜ ) 2 ( β ˜ + 1 2 ( x μ ˜ ) 2 ) exp ( x μ ˜ ) 2 2 2 β ˜ + v ˜ + ( μ ˜ μ ˜ ) 2 2 α ˜ .
We see that q 1 is again a Normal distribution but with updated parameters N ( μ ˜ , 2 β ˜ + v ˜ + ( μ ˜ μ ˜ ) 2 2 α ˜ ) , so v ˜ = 2 β ˜ + v ˜ + ( μ ˜ μ ˜ ) 2 2 α ˜ . Note that, we obtained the conjugal property: If p ( x | y ) = N ( x | μ , y ) and p ( y ) = I G ( y | α , β ) , then p ( y | x ) = I G ( y | α , β ) where μ , α and β are μ = μ , α = α , β = β + 2 β + v + ( μ μ ) 2 2 α . In this case, we also know that p ( x | α , β ) = S t ( x | μ , α , β ) .
In these calculations, we first calculated the distribution of q 2 and then q 1 . If we do the opposite and first obtain q 1 and then q 2 , the parameter v is eliminated in the iterative calculations and there is no recursive relationship between β and v, and it is only necessary to calculate the value of parameter β .
In standard alternate optimization, there is no need for an iterative process for μ ˜ and α ˜ , which are approximated by μ ˜ = μ 0 and α ˜ = α 0 , respectively. The situation for β ˜ and v ˜ is different because there are circular dependencies among them. So, the approximation needs an iterative process, starting from μ ˜ ( 1 ) = μ 0 , v ˜ ( 1 ) = v 0 , α ( 1 ) = α 0 , and β ( 1 ) = β 0 .
1. Standard alternate optimization algorithm:
α ˜ ( k + 1 ) = α ˜ ( k ) , β ˜ ( k + 1 ) = β ˜ ( k ) + v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 2 , μ ˜ ( k + 1 ) = μ ˜ ( k ) , v ˜ ( k + 1 ) = 2 β ˜ ( k + 1 ) + v ˜ ( k ) + ( μ ˜ μ ˜ ( k + 1 ) ) 2 2 α ˜ ( k + 1 ) .
This algorithm converges to v ˜ = ( 2 β ˜ + v ˜ + ( μ ˜ μ ˜ ) 2 ) / ( 2 α ˜ ) , which gives v ˜ = ( 2 β ˜ + ( μ ˜ μ ˜ ) 2 ) / ( 2 α ˜ 1 ) and β ˜ = 0 , so v ˜ = 0 that means very strange convergence. That is why this alternate algorithm can not work in this case.
For other algorithms based on normal parameters, such as gradient-based and natural gradient algorithms, it is necessary to find the expression of K L ( q 1 q 2 | p ) as a function of the normal parameters θ ˜ = ( α ˜ , β ˜ , μ ˜ , v ˜ ) :
K L ( θ ˜ ) 1 2 ln v ˜ + 1 2 ( ln β ˜ ψ 0 ( α ˜ ) ) + α ˜ ( v ˜ + ( μ ˜ μ ˜ ) 2 ) 2 β ˜ .
Then, we also need the gradient expression of K L ( θ ˜ ) for θ ˜ :
K L ( θ ˜ ) = ( v ˜ + ( μ ˜ μ ˜ ) 2 β ˜ ψ 1 ( α ˜ ) 2 β ˜ , α ˜ ( v ˜ + ( μ ˜ μ ˜ ) 2 ) + β ˜ 2 β ˜ 2 , α ˜ ( μ ˜ μ ˜ ) β ˜ , α ˜ 2 β ˜ 1 2 v ˜ .
The details are available in Appendix A.
2. The gradient-based algorithm with normal parameters:
α ˜ ( k + 1 ) = α ˜ ( k ) γ v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 β ˜ ( k ) ψ 1 ( α ˜ ( k ) ) 2 β ˜ ( k ) , β ˜ ( k + 1 ) = β ˜ ( k ) γ α ˜ ( k + 1 ) ( v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 ) + β ˜ ( k ) 2 [ β ˜ ( k ) ] 2 , μ ˜ ( k + 1 ) = μ ˜ ( k ) γ α ˜ ( k + 1 ) ( μ ˜ ( k ) μ ˜ ) β ˜ ( k + 1 ) , v ˜ ( k + 1 ) = v ˜ ( k + 1 ) γ α ˜ ( k + 1 ) 2 β ˜ ( k + 1 ) 1 2 v ˜ ( k ) ,
where γ is a fixed value for the gradient algorithm. We propose two values for γ be equal to 1 and 1 K L ( θ ˜ ) . We need to calculate the Fisher information for the third algorithm based on the normal parameters. Then, the natural gradient based on K L ( θ ˜ ) is calculated as follows:
˜ K L ( θ ˜ ) = 1 2 , 1 2 ( v ˜ + ( μ ˜ μ ˜ ) 2 ) , α ˜ v ˜ ( μ ˜ μ ˜ ) β ˜ , v ˜ ( β ˜ α ˜ v ˜ ) β ˜ .
The corresponding algorithm is in the following:
3. The natural gradient algorithm:
α ˜ ( k + 1 ) = α ˜ ( k ) , β ˜ ( k + 1 ) = β ˜ ( k ) + 1 2 ( v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 ) , μ ˜ ( k + 1 ) = μ ˜ ( k ) + α ˜ ( k + 1 ) v ˜ ( k ) ( μ ˜ μ ˜ ( k ) ) β ˜ ( k + 1 ) , v ˜ ( k + 1 ) = v ˜ ( k ) + v ˜ ( k ) ( β ˜ ( k + 1 ) α ˜ ( k + 1 ) v ˜ ( k ) ) β ˜ ( k + 1 ) .
We consider another sub-algorithm for gradient-based optimization with natural parameters, explained in detail in Appendix A. The algorithm for the last two components produces different results.
We generate n = 100 samples from the model p ( x , y ) = N ( x | 1 , y ) I G ( y | 3 , 1 ) for the numerical computations. Thus, we know the exact values of the unknown parameters, just keeping in mind, not used in algorithms. The estimated parameters are in Table 1 using the alternative, gradient-based with γ = 1 , 1 K L ( θ ˜ ) , natural gradient algorithms and gradient considering natural parameters along with their contour and surface plots in Figure 1 and Figure 2, respectively.
All four algorithms attempt to minimize the same criterion. So, the objectives are consistent, but the number of steps in the recursive process may vary. The requirements must meet the minimum K L ( · ) . In this simple example of the Normal-Inverse-Gamma distribution with the model p ( x , y ) = N ( x | 1 , y ) I G ( y | 3 , 1 ) , the convergence step numbers of the alternative, gradient-based with γ = 1 ,   K L ( · ) , gradient with natural parameters, and natural gradient algorithms are 1, 10, 8, 2, and 9 using Maximum Likelihood Estimation (MLE) initializations. One crucial point is that the process should not be repeated excessively because we need to find the local minimum of K L ( · ) . In this model, the VBA and gradient with respect to the natural parameters are the fastest, and the VBA provides the best approximation for the model depicted in Figure 2.
We simulate more models in Table 1 with the same joint distribution of the Normal-Inverse-Gamma along with different parameters. We draw all final results in Figure 2 and see their visual appearances. We start the recursive processes with two primary value groups for the unknown variables containing the MLE and the desired with no evidence. The selection for μ and v is quite simple because we have its data and can plot the histogram and approximately guess the correct values. The situation for α and β is different and a bit problematic to surmise perfectly. If we assume any outliers for α and β , the algorithm finds a minimization for K L ( · ) in the first stage of the iteration or another local minimum far from the truth. The more the iteration loop is executed, the KLD value increases or decreases in any case. Thus, the initializations are crucial to obtain the closest results for the unknown parameters. For instance, in N ( x | 1 , y ) I G ( y | 4 , 6 ) with elementary points μ = 1 , v = 3.5 , α = 3 , and β = 3 in the natural gradient analytic algorithm, the repetitive result is the same as the initial points. So, these initializations are not suitable for the available data.
An algorithm with the lowest KLD value provides a more accurate estimate of the model parameters. As we discussed earlier, for the model N ( x | 0 , y ) I G ( y | 3 , 1 ) , the first, fourth, and fifth algorithms estimate the parameters well, as shown in Figure 2, and Table 1. Remember, we use two values for γ in the gradient-based algorithm with normal parameters, so the total estimated values are five for each unknown parameter. The first three algorithms approximate the parameters of the model N ( x | 1 , y ) I G ( y | 4 , 6 ) well, but the issue lies in fitting the center, resulting in a shifted approximation. All methods perform well in the models N ( x | 1 , y ) I G ( y | 7 , 10 ) , N ( x | 2 , y ) I G ( y | 6 , 10 ) , and N ( x | 2 , y ) I G ( y | 10 , 11 ) . The optimal options for N ( x | 1 , y ) I G ( y | 7 , 10 ) include VBA, gradient with γ = 1 , and gradient with respect to natural parameters. Considering the model N ( x | 2 , y ) I G ( y | 6 , 10 ) , the best approach is to use gradient algorithms. The best option for the final model is the alternative approximation.
To gain an overview of the K L ( · ) process, we plot its trend with respect to the number of iterations in Figure 3. The left column shows the results for the MLE initializations, where the minimization of K L ( · ) occurs more quickly compared to the right column, which is associated with the evidence-free initializations. We do not plot the trend of the natural gradient’s K L ( · ) in the second and last two models. The reason is that the K L ( · ) becomes NaN in the initial iterations, so it is necessary to start the recursive process with some points estimated from the available data. The algorithms approximate the joint density function with a separable one but with different accuracy. In the following section, we tackle a more complex model.

4. Multivariate Normal-Inverse-Wishart Example

In the previous section, we explain how to perform VBA optimization methods to approximate a complicated joint distribution function by a tractable factorial of margins over a simple case study. In this section, a multivariate Normal case p ( x ) = N ( x | μ ˜ , Σ ˜ ) is considered, which is approximated by q ( x ) = i N ( x i | μ ˜ i , ν ˜ i ) for different shapes for the covariance matrix Σ ˜ .
We assume that the basic structure of an available data set is multivariate Normal with the unknown mean vector μ ˜ and variance-covariance matrix Σ ˜ . Their joint prior distribution is a Normal-Inverse-Wishart distribution of N I W ( μ ˜ , Σ ˜ | μ ˜ 0 , κ ˜ , Ψ ˜ , ν ˜ ) defined by N ( μ ˜ | μ ˜ 0 , 1 κ ˜ Σ ˜ ) I W ( Σ ˜ | Ψ ˜ , ν ˜ ) , which is the generalized form of classical N I G . One of its applications is in image segmentation tasks. The posteriors are multivariate Normal for the mean vector and Inverse Wishart for the variance-covariance matrix. Inverse Wishart distribution has many properties on the density function of unknown variances, and so there is a variety of references. For example, Bouriga and Féron [31] worked on covariance matrix estimation using Inverse Wishart distribution. In this regard, they inquired about posterior properties and Bayesian risk. Also, they applied Daniels and Kass prior [32] to the hyper-parameters of Σ .
Before approximating the expression of p ( x , μ ˜ , Σ ˜ ) , let us define some notations that we are using here:
x = x 1 x p , μ ˜ = μ ˜ 1 μ ˜ p ,   and Σ ˜ = Σ ˜ 11 Σ ˜ 1 p Σ ˜ 21 Σ ˜ 2 p Σ ˜ p 1 Σ ˜ p p .
Since the Normal-Inverse-Wishart distribution is a conjugate prior distribution for multivariate Normal, the posterior distribution of μ ˜ and Σ ˜ again belongs to the same family, and their corresponding margins are:
M N κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n , 1 κ ˜ + n Λ ˜ , I W Λ ˜ + Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) , ν ˜ + n ,
where n and x are the sample size and observations of x , respectively. The proof is shown in Appendix B.
For other algorithms, we define Θ ˜ = ( κ ˜ , μ ˜ 0 , Λ ˜ , ν ˜ , Ψ ˜ ) for calculation of K L ( Θ ˜ ) . After some tedious computation available in Appendix C, the K L ( Θ ˜ ) is equivalence by:
K L ( Θ ˜ ) p 1 2 ln κ ˜ + 1 2 ln Ψ ˜ Λ ˜ 1 + ν ˜ 2 Tr Ψ ˜ 1 Λ ˜ + I p ν ˜ 2 1 2 i = 1 p ψ 0 ( ν ˜ p + i 2 ) ,
and the corresponding gradient-based and natural gradient algorithms are determined similar to Section 3 based on K L ( Θ ˜ ) , also available in Appendix C.
To present the performance of the four algorithms, we work on a data set coming from x N I W ( x | μ , Σ ) whose parameters have the below density structure:
μ M N μ | 2 1 , 1 2 3 1 1 1 , Σ I W Σ | 3 1 1 1 , 6 .
We use only the data of x in the estimation processes. The results of algorithms are in Table 2 and drowned in Figure 4 along with its true contour plot of the model. In this figure, we have two different rows of results. In the first row, we use the MLE estimation of the mean and covariance parameters as our pre-information to start the algorithms and find the best local minimum for the K L ( · ) . In this regard, we put some gusted value for κ and ν . At the bottom of the figure, we use some uninformative initializations. This choice is crucial because the evidence-free primitive starter may end in some local optimization, which is far from the part we need. Thus, using MLE prevents this problem to a great extent. Although the result for the alternative optimization is good using evidence-free elementary points in Figure 4, it does not always work this way and depends on luck. Other gradient-based and natural gradient algorithms do not work well in this example. One reason can be the more dimensions, the worse the results.
In Table 2, we have more models with a variety of dimensions 2, 3, and 5 applying two initialization groups, MLE and uninformative starters. In all cases, the standard alternative algorithm has the closest estimation to the real model in the case of K L ( · ) minimization. The larger the K L ( · ) , the greater the distance between the true and estimated posterior distributions. Therefore, all gradient-based and natural gradient algorithms are not worthy practices to extract the posterior distribution with more dimensions in the data, while our alternative acts better in this situation. In Table 3, we provide a Normal-Inverse-Wishart model with 10 dimensions. In comparison, the alternative method approximates the distribution with so many parameters more precisely than the other methods, which seems easy to use with any large-scale dimension dataset, and the result is suitable enough.

5. Simple Linear Inverse Problem

The third example is the case of linear inverse problems g = H f + ϵ with priors p ( ϵ ) = N ( ϵ | 0 , v ϵ I ) and p ( f ) = N ( f | 0 , diag v ) , where f = [ f 1 , f 2 , , f N ] and v = [ v 1 , v 2 , , v N ] for which we have p ( f , v | g ) p ( g | f , v ϵ ) p ( f | v ) p ( v ) with p ( g | f , v ϵ ) = N ( g | H f , v ϵ I ) , p ( f | v ) = N ( f | 0 , diag v ) and p ( v | α , β ) = j I G ( v j | α , β ) , [33].
In the inverse problem, we have the same multivariate case, except in Section 3 that this time, the sensor does not give directly f , but g , such that g = H f + ϵ where ϵ is again a Normal with known or unknown variance. In the first step, we assume that the variance v ϵ and H are known. H is called the transfer function of the sensor. So, this time we have p ( g , f , v ) = N ( g | f ) N ( f | v ) I G ( v ) and we approximate it by q 1 ( f ˜ | g , v ˜ ) q 2 ( v ˜ ) . The main reason of this choice for q 1 ( f ˜ | g , v ˜ ) is that, when g and v ˜ are known, this becomes a Normal q 1 ( f ˜ | g , v ˜ ) = N ( f ˜ | μ ˜ , Σ ˜ ) where the expressions of μ ˜ and Σ ˜ are given in the paper. Now, we have a multivariate N I G problem. Thus, the joint distribution of g , f , and v is estimated by VBA as the following relation:
p ( f , v | g ) p ( f | g , v ) p ( v ) .
Although we can mathematically compute the margins in this particular example, we desire to approximate them via the iterative alternative algorithm q ( g , f ˜ , v ˜ ) = q 1 ( g | f ˜ ) q 2 ( f ˜ ) q 3 ( v ˜ ) compared them with gradient and natural gradient-based algorithms. The objective function is the estimation of q 2 ( f ˜ ) , but in the recursive process, q 1 ( g | f ˜ ) and q 3 ( v ˜ ) are updated, too. For simplicity, we suppose that the transposition matrix H is an identical matrix I . The final outputs are as follows, with the details and algorithm in Appendix D:
f ˜ M N μ ˜ f ˜ 1 + 2 v ϵ ˜ α ˜ n ( v ˜ f ˜ + μ ˜ f ˜ 2 + 2 β ˜ ) , diag v ϵ ˜ ( v ˜ f ˜ + μ ˜ f ˜ 2 + 2 β ˜ ) n ( v ˜ f ˜ + μ ˜ f ˜ 2 ) + 2 n β ˜ + 2 v ϵ ˜ α ˜ , g N ( μ ˜ f ˜ , v ϵ ˜ I ) , and v ˜ k I G α ˜ k , v ˜ f ˜ k + μ ˜ f ˜ k 2 2 + β ˜ k , k = 1 , , p .
The corresponding K L ( θ ˜ ) for θ ˜ = ( μ ˜ g , v ϵ ˜ , v ˜ f ˜ , α ˜ , β ˜ ) is below and the details are available in Appendix E along with its gradient and other algorithms:
K L ( θ ˜ ) 1 2 j = 1 p ln ( v ˜ f ˜ j ) + n 2 v ϵ ˜ j = 1 p ( μ ˜ g j 2 + v ˜ f ˜ j ) + j = 1 p α ˜ j v ˜ f ˜ j β ˜ j 1 2 j = 1 p ψ 0 ( α ˜ j ) + 1 2 j = 1 p ln β ˜ j .
We choose a model to see the performance of these margins and compare them with gradient-based considering normal and natural parameters and natural gradient algorithms. The selected model is g = H f + ϵ with the following knowledge:
H = I , f M N ( f | 0 , diag v 1 , v 2 ) , v 1 I G ( v 1 | 3 , 2 ) v 2 I G ( v 2 | 4 , 3 ) ϵ M N ( ϵ | 0 , I ) .
In the assessment procedure, we do not apply the above information. The outputs of algorithms are shown in Figure 5, as well as the actual contour plot. The K L ( · ) for the four algorithms, starting from 108.91 , are 4.93 , 108.91 , 19.57 , 108.91 , and 108.91 , respectively. In this example, the best diagnosis is from the alternative algorithm with the minimum of K L ( · ) .
The objective of the inverse problem is to approximate the distribution of f . We use the distribution of g in Figure 5 to show the number of method accuracies. In Figure 5, we can observe improved approximations for the second and last two plots. In these algorithms, the best choice is the initialization, and the distribution has a higher K L ( · ) by reputation. Here, are the conjectures of the standard alternative, gradient-based with γ = 1 , K L ( θ ˜ ) 1 and natural parameters, and natural gradient algorithms, respectively:
μ f ˜ = 0.036 0.118 , Σ ˜ f ˜ = 0.023 0 0 0.022 , μ f ˜ = 0.038 0.129 , Σ ˜ f ˜ = 2.459 0 0 2.178 , μ f ˜ = 0 0 , Σ ˜ f ˜ = 0.891 0 0 0.523 ,
μ f ˜ = 0.04 0.13 , Σ ˜ f ˜ = 2.459 0 0 2.178 , μ f ˜ = 0.04 0.13 , Σ ˜ f ˜ = 2.459 0 0 2.178 .
We simulate additional g = H f + ϵ models with varying dimensions in Table 4. Based on the results, the optimal posterior is obtained from the standard alternative algorithm when we increase the dimensions, as it exhibits lower values of K L ( · ) . In the inverse problem, the primary focus is on observing f , so another way to compare could be the accuracy of v f . When it comes to the variance-covariance matrix of f , the most effective algorithm is parametric natural gradient optimization for approximating the posterior distribution of f because it yields results that closely align with the data of f . We would like to remind you that we do not use f data in the recursive processes; we only have g data. It is applicable to real-world situations, but since this is a simulated example, we have the f data, but we do not utilize it. In the example above, we observe that the alternative method yields a lower K L ( · ) in the two-dimensional dataset, too. As the number of dimensions increases, so does the value of K L ( · ) . However, the accuracy of the variance-covariance of f is not reduced.

6. Conclusions

This paper presents four approximation methods for estimating the density functions of hidden variables, referred to as VBA. We also consider three examples of the Normal-Inverse-Gamma, Normal-Inverse-Wishart, and linear inverse problems. We provide the details of the first model here and include the details for two other examples in the appendices. In all three models, the parameters are unobserved and estimated using recursive algorithms. We attempt to approximate the joint complex distribution by simplifying the margin factorials to resemble independent cases. We compare the performance and accuracy of VBA algorithms. The standard alternative analytic optimization algorithm demonstrates the highest robustness in minimizing KLD, particularly as the number of dimensions increases, and converges fairly quickly. The numerical computation cost is negligible in our examples because we find the explicit form for each parameter, and the algorithms are well-formulated. The main difference in algorithms lies in the accuracy of the results. They estimate the complex joint distribution using separable ones. In the linear inverse problem, the standard alternative algorithm yields lower K L ( · ) values, while the parametric gradient algorithm with γ = K L ( · ) 1 produces variance-covariance matrices that are closest to the real ones. The overall performance of the alternative is most satisfactory, especially in high dimensions.

Author Contributions

Conceptualization, S.A.F.M. and A.M.-D.; Methodology, S.A.F.M. and A.M.-D.; Software, S.A.F.M.; Validation, S.A.F.M. and A.M.-D.; Formal analysis, S.A.F.M. and A.M.-D.; Investigation, S.A.F.M. and A.M.-D.; Resources, S.A.F.M. and A.M.-D.; Data curation, S.A.F.M. and A.M.-D.; Writing—original draft, S.A.F.M.; Writing—review & editing, S.A.F.M. and A.M.-D.; Visualization, S.A.F.M. and A.M.-D.; Supervision, A.M.-D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Author Ali Mohammad-Djafari was employed by the Shanfeng Company. The remaining author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Computation of KL( θ ˜ ) and its Gradient for Normal-Inverse-Gamma

The mathematical computations of K L ( θ ˜ ) in Section 3 is based on (5) as follows:
K L ( q 1 q 2 | p ) = H ( q 1 ) H ( q 2 ) q 1 ( x ) q 2 ( y ) ln 1 2 π y exp { ( x μ ˜ ) 2 2 y } β ˜ α ˜ Γ ( α ˜ ) y ( α ˜ + 1 ) exp { β ˜ y } d x d y = H ( q 1 ) H ( q 2 ) q 1 ( x ) q 2 ( y ) 1 2 ln ( 2 π ) ( x μ ˜ ) 2 2 y + α ˜ ln β ˜ ln Γ ( α ˜ ) ( α ˜ + 3 2 ) ln y β ˜ y d x d y .
Since, H ( q 1 ) and H ( q 2 ) are Shannon entropy, we have:
H ( q 1 ) = q 1 ( x ) ln 1 2 π v ˜ exp { ( x μ ˜ ) 2 2 v ˜ } d x = 1 2 ln ( 2 π v ˜ ) 1 2 ,
and
H ( q 2 ) = q 2 ( y ) ln β ˜ α ˜ Γ ( α ˜ ) y ( α ˜ + 1 ) exp { β ˜ y } d y = α ˜ ln ( β ˜ Γ ( α ˜ ) ) + ( α ˜ + 1 ) ψ 0 ( α ˜ ) .
Also, we need to know ln p ( x , y ) q 1 q 2 as well:
ln p ( x , y ) q 1 q 2 = q 1 ( x ) q 2 ( y ) 1 2 ln ( 2 π ) ( x μ ˜ ) 2 2 y + α ˜ ln β ˜ ln Γ ( α ˜ ) ( α ˜ + 3 2 ) ln y β ˜ y d x d y = 1 2 ln ( 2 π ) α ˜ ln β ˜ + ln Γ ( α ˜ ) + ( α ˜ + 3 2 ) ( ln β ˜ ψ 0 ( α ˜ ) ) + ( v ˜ + ( μ ˜ μ ˜ ) 2 2 + β ˜ ) α ˜ β ˜ .
Thus, the desire function K L ( θ ˜ ) is:
K L ( θ ˜ ) ln Γ ( α ˜ ) 1 2 ln v ˜ ln ( β ˜ Γ ( α ˜ ) ) + ( α ˜ + 3 2 ) ( ln β ˜ ψ 0 ( α ˜ ) ) α ˜ ln β ˜ + ( α ˜ + 1 ) ψ 0 ( α ˜ ) + α ˜ ( v ˜ + ( μ ˜ μ ˜ ) 2 ) 2 β ˜ 1 2 ln v ˜ + 1 2 ( ln β ˜ ψ 0 ( α ˜ ) ) + α ˜ ( v ˜ + ( μ ˜ μ ˜ ) 2 ) 2 β ˜ ,
where, ψ 0 ( · ) is the polygamma function of order 0, or called digamma function. The gradient expression concerning θ ˜ = ( α ˜ , β ˜ , μ ˜ , v ˜ ) is:
K L ( θ ˜ ) = v ˜ + ( μ ˜ μ ˜ ) 2 β ˜ ψ 1 ( α ˜ ) 2 β ˜ , α ˜ ( v ˜ + ( μ ˜ μ ˜ ) 2 ) + β ˜ 2 β ˜ 2 , α ˜ ( μ ˜ μ ˜ ) β ˜ , α ˜ 2 β ˜ 1 2 v ˜ ,
where, ψ 1 ( · ) is the polygamma function of order 1. We substitute (A6) in the gradient-based formulas, so the algorithm is as follows:
2. The gradient-based algorithm with normal parameters:
α ˜ ( k + 1 ) = α ˜ ( k ) + γ K L α ˜ ( α ˜ ( k ) , β ˜ ( k ) , μ ˜ ( k ) , v ˜ ( k ) ) = α ˜ ( k ) γ v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 β ˜ ( k ) ψ 1 ( α ˜ ( k ) ) 2 β ˜ ( k ) , β ˜ ( k + 1 ) = β ˜ ( k ) γ K L β ˜ ( α ˜ ( k + 1 ) , β ˜ ( k ) , μ ˜ ( k ) , v ˜ ( k ) ) = β ˜ ( k ) γ α ˜ ( k + 1 ) ( v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 ) + β ˜ ( k ) 2 [ β ˜ ( k ) ] 2 , μ ˜ ( k + 1 ) = μ ˜ ( k ) γ K L μ ˜ ( α ˜ ( k + 1 ) , β ˜ ( k + 1 ) , μ ˜ ( k ) , v ˜ ( k ) ) = μ ˜ ( k ) γ α ˜ ( k + 1 ) ( μ ˜ ( k ) μ ˜ ) β ˜ ( k + 1 ) , v ˜ ( k + 1 ) = v ˜ ( k ) γ K L v ˜ ( α ˜ ( k + 1 ) , β ˜ ( k + 1 ) , μ ˜ ( k + 1 ) , v ˜ ( k ) ) = v ˜ ( k + 1 ) γ α ˜ ( k + 1 ) 2 β ˜ ( k + 1 ) 1 2 v ˜ ( k ) ,
where γ is a fix value. For the natural gradient algorithm, we need to calculate the gradient of ln p ( x , y | θ ˜ ) in (9). Remember, we change the notation p ( x , y ) to p ( x , y | θ ˜ ) for the natural gradient algorithm explained in (2). Since Equation (9) is a function of α ˜ , β ˜ , and μ ˜ , the final algorithm has no information about v ˜ . Therefore, we replace ln p ( x , y | θ ˜ ) with ln q ( x , y | θ ˜ ) :
ln q ( x , y | θ ˜ ) = ln β ˜ α ˜ Γ ( α ˜ ) y ( α ˜ + 1 ) exp { β ˜ y } 1 2 π v ˜ exp { ( x μ ˜ ) 2 2 v ˜ } = α ˜ ln β ˜ ln Γ ( α ˜ ) ( α ˜ + 1 ) ln y β ˜ y 1 2 ln v ˜ 1 2 ln ( 2 π ) ( x μ ˜ ) 2 2 v ˜ .
So, we obtain the following Fisher information matrix by using ln q ( x , y | θ ˜ ) :
F ¯ = ( ln β ˜ ψ 0 ( α ˜ ) ln y ) 2 ( ln β ˜ ψ 0 ( α ˜ ) ln y ) ( α ˜ β ˜ 1 y ) ( ln β ˜ ψ 0 ( α ˜ ) ln y ) ( x μ ˜ ) v ˜ ( ln β ˜ ψ 0 ( α ˜ ) ln y ) ( ( x μ ˜ ) 2 2 v ˜ 2 1 2 v ˜ ) ( ln β ˜ ψ 0 ( α ˜ ) ln y ) ( α ˜ β ˜ 1 y ) ( α ˜ β ˜ 1 y ) 2 ( α ˜ β ˜ 1 y ) ( x μ ˜ ) v ˜ ( α ˜ β ˜ 1 y ) ( ( x μ ˜ ) 2 2 v ˜ 2 1 2 v ˜ ) ( ln β ˜ ψ 0 ( α ˜ ) ln y ) ( x μ ˜ ) v ˜ ( α ˜ β ˜ 1 y ) ( x μ ˜ ) v ˜ ( x μ ˜ ) 2 v ˜ 2 ( x μ ˜ ) v ˜ ( ( x μ ˜ ) 2 2 v ˜ 2 1 2 v ˜ ) ( ln β ˜ ψ 0 ( α ˜ ) ln y ) ( ( x μ ˜ ) 2 2 v ˜ 2 1 2 v ˜ ) ( α ˜ β ˜ 1 y ) ( ( x μ ˜ ) 2 2 v ˜ 2 1 2 v ˜ ) ( x μ ˜ ) v ˜ ( ( x μ ˜ ) 2 2 v ˜ 2 1 2 v ˜ ) ( ( x μ ˜ ) 2 2 v ˜ 2 1 2 v ˜ ) 2 q ( x , y ) .
After integrating over q ( x , y | θ ˜ ) , we have:
F ¯ = ψ 1 ( α ˜ ) 1 β ˜ 0 0 1 β ˜ α ˜ β ˜ 2 0 0 0 0 1 v ˜ 0 0 0 0 1 2 v ˜ 2 .
Then, we have ˜ K L ( θ ˜ ) utilizing (1):
˜ K L ( θ ˜ ) = α ˜ α ˜ ψ 1 ( α ˜ ) 1 β ˜ α ˜ ψ 1 ( α ˜ ) 1 0 0 β ˜ α ˜ ψ 1 ( α ˜ ) 1 β ˜ 2 ψ 1 ( α ˜ ) α ˜ ψ 1 ( α ˜ ) 1 0 0 0 0 v ˜ 0 0 0 0 2 v ˜ 2 v ˜ + ( μ ˜ μ ˜ ) 2 β ˜ ψ 1 ( α ˜ ) 2 β ˜ α ˜ ( v ˜ + ( μ ˜ μ ˜ ) 2 ) + β ˜ 2 β ˜ 2 α ˜ ( μ ˜ μ ˜ ) β ˜ α ˜ 2 β ˜ 1 2 v ˜ = 1 2 1 2 ( v ˜ + ( μ ˜ μ ˜ ) 2 ) α ˜ v ˜ ( μ ˜ μ ˜ ) β ˜ v ˜ ( β ˜ α ˜ v ˜ ) β ˜ .
α ˜ , the first component of ˜ K L ( θ ˜ ) , does not need an iteration and is fixed to α ˜ + 1 2 , but others require iterations. Thus, the natural gradient algorithm based on (3) is as follows:
3. The natural gradient algorithm:
α ˜ ( k + 1 ) = α ˜ ( k ) , β ˜ ( k + 1 ) = β ˜ ( k ) + 1 2 ( v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 ) , μ ˜ ( k + 1 ) = μ ˜ ( k ) + α ˜ ( k + 1 ) v ˜ ( k ) ( μ ˜ μ ˜ ( k ) ) β ˜ ( k + 1 ) , v ˜ ( k + 1 ) = v ˜ ( k ) + v ˜ ( k ) ( β ˜ ( k + 1 ) α ˜ ( k + 1 ) v ˜ ( k ) ) β ˜ ( k + 1 ) .
Another gradient algorithm to consider is based on the natural parameters Λ ˜ . We use q ( x , y | Λ ˜ ) instead of q ( x , y ) to clarify the computations, Λ ˜ = ( λ ˜ 1 , λ ˜ 2 , λ ˜ 3 , λ ˜ 4 ) :
q ( x , y | Λ ˜ ) exp λ ˜ 4 x 2 2 + λ ˜ 3 x λ ˜ 1 ln y λ ˜ 2 1 y ,
where λ ˜ 1 = α ˜ + 1 , λ ˜ 2 = β ˜ , λ ˜ 3 = μ ˜ v ˜ , and λ ˜ 4 = 1 v ˜ . The logarithm of q ( x , y | Λ ˜ ) is:
ln q ( x , y | Λ ˜ ) λ ˜ 4 x 2 2 + λ ˜ 3 x λ ˜ 1 ln y λ ˜ 2 1 y .
By substituting Λ ˜ in (A5), we obtain:
K L ( Λ ˜ ) 1 2 ln λ ˜ 4 + 1 2 ( ln λ ˜ 2 ψ 0 ( λ ˜ 1 1 ) ) + ( λ ˜ 1 1 ) 2 λ ˜ 2 ( 1 λ ˜ 4 + ( μ ˜ λ ˜ 3 λ ˜ 4 ) 2 ) .
Then, we have:
K L ( Λ ˜ ) = 1 2 λ ˜ 2 ( 1 λ ˜ 4 + ( μ ˜ λ ˜ 3 λ ˜ 4 ) 2 ) 1 2 ψ 1 ( λ ˜ 1 ) , 1 2 λ ˜ 2 λ ˜ 1 1 2 λ ˜ 2 2 ( 1 λ ˜ 4 + ( μ ˜ λ ˜ 3 λ ˜ 4 ) 2 ) , λ ˜ 1 1 λ ˜ 2 λ ˜ 4 ( μ ˜ λ ˜ 3 λ ˜ 4 ) , 1 2 λ ˜ 4 + λ ˜ 1 1 2 λ ˜ 2 λ ˜ 4 2 ( 2 λ ˜ 3 ( μ ˜ λ ˜ 3 λ ˜ 4 ) 1 ) .
The K L ( Λ ˜ ) with respect to the natural parameters, in Equation (A8), is not equal to the one obtained from Equation (A6) after substituting the corresponding normal parameters as defined earlier. The issue pertains to the chain rule in the differentiation, which involves the relationship between λ ˜ 3 and λ ˜ 4 . Thus, the new K L ( θ ˜ ) is:
K L ( θ ˜ ) = v ˜ + ( μ ˜ μ ˜ ) 2 β ˜ ψ 1 ( α ˜ ) 2 β ˜ , α ˜ ( v ˜ + ( μ ˜ μ ˜ ) 2 ) + β ˜ 2 β ˜ 2 , α ˜ v ˜ ( μ ˜ μ ˜ ) β ˜ , v ˜ 2 α ˜ 2 β ˜ ( v ˜ 2 2 μ ˜ v ˜ ( μ ˜ μ ˜ ) ) .
4. The gradient-based algorithm concerning natural parameters:
α ˜ ( k + 1 ) = α ˜ ( k ) v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 β ˜ ( k ) ψ 1 ( α ˜ ( k ) ) 2 β ˜ ( k ) , β ˜ ( k + 1 ) = β ˜ ( k ) α ˜ ( k + 1 ) ( v ˜ ( k ) + ( μ ˜ μ ˜ ( k ) ) 2 ) + β ˜ ( k ) 2 [ β ˜ ( k ) ] 2 , μ ˜ ( k + 1 ) = μ ˜ ( k ) α ˜ ( k + 1 ) v ˜ ( k ) ( μ ˜ ( k ) μ ˜ ) β ˜ ( k + 1 ) , v ˜ ( k + 1 ) = v ˜ ( k + 1 ) v ˜ ( k ) 2 α ˜ ( k + 1 ) 2 β ˜ ( k + 1 ) ( [ v ˜ ( k ) ] 2 2 μ ˜ ( k + 1 ) v ˜ ( k ) ( μ ˜ μ ˜ ( k + 1 ) ) ) .

Appendix B. Conjugate Posterior of Normal-Inverse-Wishart of ( μ ˜ , Σ ˜ )

The Normal-Inverse-Wishart distribution function N I W ( μ ˜ , Σ ˜ | μ ˜ 0 , κ ˜ , Ψ ˜ , ν ˜ ) is:
p ( μ ˜ , Σ ˜ | μ ˜ 0 , κ ˜ , Ψ ˜ , ν ˜ ) = N μ ˜ | μ ˜ 0 , 1 κ ˜ Σ ˜ I W ( Σ ˜ | Ψ ˜ , ν ˜ ) .
The explicit form of joint distribution p ( x , μ ˜ , Σ ˜ ) is:
p ( x , μ ˜ , Σ ˜ ) = p ( x | μ ˜ , Σ ˜ ) p ( μ ˜ , Σ ˜ | μ ˜ 0 , κ ˜ , Ψ ˜ , ν ˜ ) = 1 ( 2 π ) n p 2 Σ ˜ n 2 exp { 1 2 i = 1 n ( x i μ ˜ ) Σ ˜ 1 ( x i μ ˜ ) } κ ˜ ( 2 π ) p 2 Σ ˜ 1 2 exp { κ ˜ 2 ( μ ˜ μ ˜ 0 ) Σ ˜ 1 ( μ ˜ μ ˜ 0 ) } Ψ ˜ ν ˜ 2 2 ν ˜ p 2 Γ p ( ν ˜ 2 ) Σ ˜ ν ˜ + p + 1 2 exp { 1 2 Tr Ψ ˜ Σ ˜ 1 } .
We rewrite the exponential expression of multivariate Normal Distribution as follows:
i = 1 n ( x i μ ˜ ) Σ ˜ 1 ( x i μ ˜ ) = n x ¯ μ ˜ Σ ˜ 1 x ¯ μ ˜ + i = 1 n x i x ¯ Σ ˜ 1 x i x ¯ .
By substituting (A11) into (A10), we obtain:
p ( x , μ ˜ , Σ ˜ ) Σ ˜ ν ˜ + p + n + 2 2 exp n 2 x ¯ μ ˜ Σ ˜ 1 x ¯ μ ˜ κ ˜ 2 ( μ ˜ μ ˜ 0 ) Σ ˜ 1 ( μ ˜ μ ˜ 0 ) 1 2 Tr Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) Σ ˜ 1 .
We have to make integration over Σ ˜ and put observations x instead of x to obtain the margin of μ ˜ :
ln p ( μ ˜ , Σ ˜ ) q 1 q 3 n 2 x ¯ μ ˜ Σ ˜ 1 x ¯ μ ˜ q 1 q 3 κ ˜ 2 ( μ ˜ μ ˜ 0 ) Σ ˜ 1 ( μ ˜ μ ˜ 0 ) q 1 q 3 κ ˜ + n 2 ( μ ˜ κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n ) Λ ˜ 1 ( μ ˜ κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n ) ,
where, Σ ˜ 1 q 3 = Λ ˜ 1 . Thus, μ ˜ has a Normal distribution N κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n , 1 κ ˜ + n Λ ˜ . Also, we have the same for the margin of Σ ˜ :
ln p ( μ ˜ , Σ ˜ ) q 1 q 2 ν ˜ + p + n + 2 2 ln Σ ˜ n 2 ( x ¯ μ ˜ ) Σ ˜ 1 ( x ¯ μ ˜ ) q 1 q 2 κ ˜ 2 ( μ ˜ μ ˜ 0 ) Σ ˜ 1 ( μ ˜ μ ˜ 0 ) q 1 q 2 1 2 Tr Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) Σ ˜ 1 q 1 q 2 ν ˜ + p + n + 2 2 ln Σ ˜ κ ˜ + n 2 ( μ ˜ κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n ) Σ ˜ 1 ( μ ˜ κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n ) q 1 q 2 1 2 Tr Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) Σ ˜ 1 q 1 ν ˜ + p + n + 2 2 ln Σ ˜ 1 2 Tr ( κ ˜ + n ) ( μ ˜ κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n ) ( μ ˜ κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n ) q 1 q 2 + Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) Σ ˜ 1 ν ˜ + p + n + 2 2 ln Σ ˜ 1 2 Tr Λ ˜ + Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) Σ ˜ 1 .
So, Σ ˜ has the structure of an Inverse Wishart distribution with the below parameters as following:
Σ ˜ I W Λ ˜ + Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) , ν ˜ + n .
It is clear that x has again a multivariate Normal distribution whose mean and variance-covariance matrix are in the following expression
x M N κ ˜ μ ˜ 0 + n x ¯ κ ˜ + n , 1 ν ˜ + n p 1 Λ ˜ + Ψ ˜ + i = 1 n ( x i x ¯ ) ( x i x ¯ ) .
In the next step, we want to make an iterative algorithm of the above-mentioned marginal distributions. We need some pre-guesses for mean μ ˜ 0 ( 1 ) = μ 0 , variance Λ ˜ ( 1 ) = Λ 0 , and precision κ ˜ ( 1 ) = κ 0 of μ ˜ and also for scale matrix Ψ ˜ ( 1 ) = Ψ 0 and degree of freedom ν ˜ ( 1 ) = ν 0 + n of Σ ˜ . So, we can consider the standard alternative algorithm for Normal-Inverse-Wishart distribution as follows:
1. Standard alternate optimization algorithm:
κ ˜ ( k + 1 ) = κ ˜ ( k ) + n , μ ˜ 0 ( k + 1 ) = κ ˜ ( k + 1 ) μ ˜ 0 ( k ) + n x ¯ κ ˜ ( k + 1 ) + n , Λ ˜ ( k + 1 ) = 1 κ ˜ ( k + 1 ) + n Λ ˜ ( k ) , ν ˜ ( k + 1 ) = ν ˜ ( k ) + n , Ψ ˜ ( k + 1 ) = Λ ˜ ( k + 1 ) + Ψ ˜ ( k ) + i = 1 n ( x i x ¯ ) ( x i x ¯ ) .

Appendix C. KLD of Normal-Inverse-Wishart

First of all, we define Θ ˜ = ( κ ˜ , μ ˜ 0 , Λ ˜ , ν ˜ , Ψ ˜ ) and the corresponding KLD function:
K L ( Θ ˜ ) = q 1 ( μ ˜ ) q 2 ( Σ ˜ ) ln q 1 ( μ ˜ ) q 2 ( Σ ˜ ) p ( μ ˜ , Σ ˜ ) d μ ˜ d Σ ˜ = q 1 ( μ ˜ ) q 2 ( Σ ˜ ) ln q 2 ( Σ ˜ ) p ( μ ˜ , Σ ˜ ) d μ ˜ d Σ ˜ H ( q 1 ) = q 2 ( Σ ˜ ) ln q 2 ( Σ ˜ ) q 1 ( μ ˜ ) ln p ( μ ˜ , Σ ˜ ) d μ ˜ d Σ ˜ H ( q 1 ) .
We need to compute the internal integral first:
q 1 ( μ ˜ ) ln p ( μ ˜ , Σ ˜ ) d μ ˜ = q 1 ( μ ˜ ) ln κ ˜ Ψ ˜ ν ˜ 2 ( 2 π ) p 2 2 ν ˜ p 2 Γ p ( ν ˜ 2 ) Σ ˜ ν ˜ + p + 2 2 exp { κ ˜ 2 ( μ ˜ μ ˜ 0 ) Σ ˜ 1 ( μ ˜ μ ˜ 0 ) 1 2 Tr Ψ ˜ Σ ˜ 1 } d μ ˜ = ln κ ˜ Ψ ˜ ν ˜ 2 ( 2 π ) p 2 2 ν ˜ p 2 Γ p ( ν ˜ 2 ) Σ ˜ ν ˜ + p + 2 2 exp { 1 2 Tr Ψ ˜ Σ ˜ 1 } κ ˜ 2 q 1 ( μ ˜ ) Tr ( μ ˜ μ ˜ 0 ) Σ ˜ 1 ( μ ˜ μ ˜ 0 ) d μ ˜ = ln κ ˜ Ψ ˜ ν ˜ 2 ( 2 π ) p 2 2 ν ˜ p 2 Γ p ( ν ˜ 2 ) Σ ˜ ν ˜ + p + 2 2 exp { 1 2 Tr ( Λ ˜ + Ψ ˜ ) Σ ˜ 1 } .
By substituting (A18) in side of (A17), we get:
K L ( Θ ˜ ) = q 2 ( Σ ˜ ) ln q 2 ( Σ ˜ ) ln κ ˜ Ψ ˜ ν ˜ 2 ( 2 π ) p 2 2 ν ˜ p 2 Γ p ( ν ˜ 2 ) Σ ˜ ν ˜ + p + 2 2 exp { 1 2 Tr ( Λ ˜ + Ψ ˜ ) Σ ˜ 1 } d Σ ˜ H ( q 1 ) .
Since q 2 ( Σ ˜ ) = I W ( Ψ ˜ , ν ˜ ) , we can rewrite K L ( Θ ˜ ) as follows:
K L ( Θ ˜ ) = ln κ ˜ Ψ ˜ ν ˜ 2 Γ p ( ν ˜ + 1 2 ) π p 2 Γ p ( ν ˜ 2 ) Λ ˜ + Ψ ˜ ν ˜ + 1 2 + q 2 ( Σ ˜ ) ln I W ( Ψ ˜ , ν ˜ ) I W ( Λ ˜ + Ψ ˜ , ν ˜ + 1 ) d Σ ˜ H ( q 1 ) .
The second term is again a KLD function with respect to two Inverse Wishart distributions with different parameters calculated in [34]:
q 2 ( Σ ˜ ) ln I W ( Ψ ˜ , ν ˜ ) I W ( Λ ˜ + Ψ ˜ , ν ˜ + 1 ) d Σ ˜ = ln Γ p ( ν ˜ + 1 2 ) Γ p ( ν ˜ 2 ) + ν ˜ 2 Tr Ψ ˜ 1 Λ ˜ + I p ν ˜ 2 ν ˜ + 1 2 ln Ψ ˜ 1 Λ ˜ + I 1 2 i = 1 p ψ 0 ( ν ˜ p + i 2 ) .
The last term of (A17) is the Shannon entropy of μ ˜ :
H ( q 1 ) = q 1 ( μ ˜ ) ln q 1 ( μ ˜ ) d μ ˜ = q 1 ( μ ˜ ) ln κ ˜ p 2 ( 2 π ) p 2 Λ ˜ 1 2 exp { κ ˜ 2 ( μ ˜ μ ˜ 0 ) Λ ˜ 1 ( μ ˜ μ ˜ 0 ) } d μ ˜ = p 2 ln ( κ ˜ ) p 2 ln ( 2 π ) 1 2 ln Λ ˜ κ ˜ 2 q 1 ( μ ˜ ) ( μ ˜ μ ˜ 0 ) Λ ˜ 1 ( μ ˜ μ ˜ 0 ) d μ ˜ = p 2 ln ( κ ˜ ) p 2 ln ( 2 π ) 1 2 ln Λ ˜ p 2 .
The final expression of K L ( Θ ˜ ) is equivalent to the following after substitution of (A21) and (A22) into (A20):
K L ( Θ ˜ ) = 1 2 ln κ ˜ ν ˜ 2 ln Ψ ˜ + p 2 ln π ln Γ p ( ν ˜ + 1 2 ) Γ p ( ν ˜ 2 ) + ν ˜ + 1 2 ln Λ ˜ + Ψ ˜ + ln Γ p ( ν ˜ + 1 2 ) Γ p ( ν ˜ 2 ) + ν ˜ 2 Tr Ψ ˜ 1 Λ ˜ + I p ν ˜ 2 ν ˜ + 1 2 ln Ψ ˜ 1 Λ ˜ + I 1 2 i = 1 p ψ 0 ( ν ˜ p + i 2 ) + p 2 ln ( κ ˜ ) p 2 ln ( 2 π ) 1 2 ln Λ ˜ p 2 p 1 2 ln κ ˜ ν ˜ + 1 2 ln Ψ ˜ · Ψ ˜ 1 Λ ˜ + I + 1 2 ln Ψ ˜ + ν ˜ + 1 2 ln Λ ˜ + Ψ ˜ + ν ˜ 2 Tr Ψ ˜ 1 Λ ˜ + I p ν ˜ 2 1 2 i = 1 p ψ 0 ( ν ˜ p + i 2 ) 1 2 ln Λ ˜ p 1 2 ln κ ˜ + 1 2 ln Ψ ˜ Λ ˜ 1 + ν ˜ 2 Tr Ψ ˜ 1 Λ ˜ + I p ν ˜ 2 1 2 i = 1 p ψ 0 ( ν ˜ p + i 2 ) .
The gradient of K L ( Θ ˜ ) is:
K L ( Θ ˜ ) = p 1 2 κ ˜ , 0 , 1 2 Λ ˜ 1 + ν ˜ 2 Ψ ˜ 1 , 1 2 Tr Ψ ˜ 1 Λ ˜ + I p 2 1 4 i = 1 p ψ 1 ( ν ˜ p + i 2 ) , 1 2 Ψ ˜ 1 ν ˜ 2 Ψ ˜ 1 Λ ˜ Ψ ˜ 1 ,
where 0 is the zero vector of p dimension. So, the parametric gradient and natural gradient algorithms are:
2. The gradient-based algorithm with normal parameters:
κ ˜ ( k + 1 ) = κ ˜ ( k ) γ 2 p 1 κ ˜ ( k ) , μ ˜ 0 ( k + 1 ) = μ ˜ 0 ( k ) , Λ ˜ ( k + 1 ) = Λ ˜ ( k ) γ 2 [ Λ ˜ ( k ) ] 1 + ν ˜ ( k ) [ Ψ ˜ ( k ) ] 1 , ν ˜ ( k + 1 ) = ν ˜ ( k ) γ 2 Tr [ Ψ ˜ ( k ) ] 1 Λ ˜ ( k + 1 ) + I p 1 2 i = 1 p ψ 1 ( ν ˜ ( k ) p + i 2 ) , Ψ ˜ ( k + 1 ) = Ψ ˜ ( k ) γ 2 [ Ψ ˜ ( k ) ] 1 ν ˜ ( k + 1 ) [ Ψ ˜ ( k ) ] 1 Λ ˜ ( k + 1 ) [ Ψ ˜ ( k ) ] 1 .
where γ is a fixed value for the gradient algorithm, and is proportional to 1 and 1 / K L for this algorithm. To get started with the natural gradient method, first, let us redefine q ( μ ˜ , Σ ˜ ) as q ( μ ˜ , Σ ˜ | Θ ˜ ) and Θ ˜ as Θ ˜ = ( μ ˜ 0 , Λ ˜ , Ψ ˜ ) . We exclude κ ˜ and ν ˜ from Θ ˜ and hold them constant because we require the estimated values of the remaining parameters. Then, we have:
q ( μ ˜ , Σ ˜ | Θ ˜ ) ( 2 π ) p 2 κ ˜ 1 2 Λ ˜ 1 2 exp κ ˜ 2 ( μ ˜ μ ˜ 0 ) Λ ˜ 1 ( μ ˜ μ ˜ 0 ) Ψ ˜ ν ˜ 2 ( 2 π ) ν ˜ p 2 Γ ( ν ˜ 2 ) Σ ˜ ν ˜ + p + 1 2 exp 1 2 Tr Ψ ˜ Σ ˜ 1 .
Then, we need its logarithm function:
ln q ( μ ˜ , Σ ˜ | Θ ˜ ) 1 2 ln Λ ˜ κ ˜ 2 ( μ ˜ μ ˜ 0 ) Λ ˜ 1 ( μ ˜ μ ˜ 0 ) + ν ˜ 2 ln Ψ ˜ ν ˜ + p + 1 2 ln Σ ˜ 1 2 Tr Ψ ˜ Σ ˜ 1 .
The Fisher information is:
F ¯ = κ ˜ Λ ˜ 1 κ ˜ ( μ ˜ μ ˜ 0 ) Λ ˜ 1 Λ ˜ 1 κ ˜ ( μ ˜ μ ˜ 0 ) Λ ˜ 1 Λ ˜ 1 1 2 Λ ˜ 1 Λ ˜ 1 κ ˜ Λ ˜ 1 ( μ ˜ μ ˜ 0 ) ( μ ˜ μ ˜ 0 ) Λ ˜ 1 Λ ˜ 1 0 0 0 0 ν ˜ 2 Ψ ˜ 1 Ψ ˜ 1 q ( μ ˜ , Σ ˜ ) ,
where 0 is a p × p matrix of 0. After taking into account the expectations of the Fisher matrix, we obtain the following:
F ¯ = κ ˜ Λ ˜ 1 0 0 0 1 2 Λ ˜ 1 Λ ˜ 1 0 0 0 ν ˜ 2 Ψ ˜ 1 Ψ ˜ 1 .
Then, we have ˜ K L ( Θ ˜ ) , as well:
˜ K L ( Θ ˜ ) = 1 κ ˜ Λ ˜ 0 0 0 2 Λ ˜ 2 0 0 0 2 ν ˜ Ψ ˜ 2 0 1 2 Λ ˜ 1 + ν ˜ 2 Ψ ˜ 1 1 2 Ψ ˜ 1 ν ˜ 2 Ψ ˜ 1 Λ ˜ Ψ ˜ 1 = 0 Λ ˜ ν ˜ Λ ˜ 2 Ψ ˜ 1 1 ν ˜ Ψ ˜ + Ψ ˜ Λ ˜ Ψ ˜ 1 .
We replaced the zero vector 0 with the zero matrix 0 in the first element of K L ( Θ ˜ ) to feasible matrix multiplication. The natural gradient algorithm for the Normal-Inverse-Wishart example is as follows:
3. The natural gradient algorithm:
μ ˜ 0 ( k + 1 ) = μ ˜ 0 ( k ) , Λ ˜ ( k + 1 ) = ν ˜ [ Λ ˜ ( k ) ] 2 [ Ψ ˜ ( k ) ] 1 , Ψ ˜ ( k + 1 ) = ν ˜ + 1 ν ˜ Ψ ˜ ( k ) Ψ ˜ ( k ) Λ ˜ ( k + 1 ) [ Ψ ˜ ( k ) ] 1 .
In this algorithm, the iteration is just on Λ ˜ and Ψ ˜ . We need to recall Equation (A25) for the final step of the gradient algorithm involving the natural parameters. We require the terms that depend on μ ˜ and Σ ˜ :
ln q ( μ ˜ , Σ ˜ | Θ ˜ ) κ ˜ μ ˜ 0 Λ ˜ 1 μ ˜ κ ˜ 2 μ ˜ Λ ˜ 1 μ ˜ ν ˜ + p + 1 2 ln Σ ˜ 1 2 Tr Ψ ˜ Σ ˜ 1 .
We define the natural parameters as L = ( ł 1 , ł 2 , ł 3 , ł 4 ) , where ł 1 = κ ˜ μ ˜ 0 Λ ˜ 1 , ł 2 = κ ˜ Λ ˜ 1 , ł 3 = ν ˜ + p + 1 , and ł 4 = Ψ ˜ . Note that ł 1 is a p-vector, ł 3 is a scaler, and ł 2 and ł 4 are p × p matrices. Then, we have a new version of ln q ( μ ˜ , Σ ˜ | Θ ˜ ) represented as ln q ( μ ˜ , Σ ˜ | L ) :
ln q ( μ ˜ , Σ ˜ | L ) ł 1 μ ˜ 1 2 μ ˜ ł 2 μ ˜ ł 3 2 ln Σ ˜ 1 2 Tr ł 4 Σ ˜ 1 .
We fix the κ ˜ value and substitute μ ˜ 0 = ł 1 ł 2 1 , Λ ˜ 1 = 1 κ ˜ ł 2 so Λ ˜ = κ ˜ ł 2 1 , ν ˜ = ł 3 p 1 , and Ψ ˜ = ł 4 inside Equation (A23) to obtain the new K L ( L )
K L ( L ) p n 1 2 ln κ ˜ + 1 2 ln ł 4 ł 2 + ł 3 p 1 2 Tr ł 4 1 ł 2 1 + I p ( ł 3 p 1 ) 2 1 2 i = 1 p ψ 0 ( ł 3 2 p + i 1 2 ) .
Then, we have:
K L ( L ) = 0 , 1 2 ł 2 1 ł 3 p 1 2 ł 2 1 ł 4 1 ł 2 1 , 1 2 Tr ł 4 1 ł 2 1 + I p 2 1 4 i = 1 p ψ 1 ( ł 3 2 p + i 1 2 ) , 1 2 ł 4 1 ł 3 p 1 2 ł 4 1 ł 2 1 ł 4 1 .
To derive the gradient algorithm based on the natural parameters, we substitute the definition of L back into K L ( L ) to obtain K L ( Θ ˜ ) :
K L ( Θ ˜ ) = 0 , 1 2 κ ˜ Λ ˜ ν ˜ 2 κ ˜ 2 Λ ˜ Ψ ˜ 1 Λ ˜ , 1 2 Tr 1 κ ˜ Ψ ˜ 1 Λ ˜ + I p 2 1 4 i = 1 p ψ 1 ( ν ˜ p + i 2 ) , 1 2 Ψ ˜ 1 ν ˜ 2 κ ˜ Ψ ˜ 1 Λ ˜ Ψ ˜ 1 .
Comparing this gradient with Equation (A24), we notice that it has one less component, the first one is missed because we have fixed κ ˜ . Other elements of these two gradients are slightly different because ł 1 is a matrix multiple of ł 2 . The last algorithm is:
4. The gradient-based algorithm concerning natural parameters:
μ ˜ 0 ( k + 1 ) = μ ˜ ( k ) , Λ ˜ ( k + 1 ) = 2 κ ˜ 1 2 κ ˜ Λ ˜ ( k ) + ν ˜ ( k ) 2 κ ˜ 2 Λ ˜ ( k ) [ Ψ ˜ ( k ) ] 1 Λ ˜ ( k ) , ν ˜ ( k + 1 ) = ν ˜ ( k ) 1 2 Tr 1 κ ˜ [ Ψ ˜ ( k ) ] 1 Λ ˜ ( k + 1 ) + I + p 2 + 1 4 i = 1 p ψ 1 ( ν ˜ ( k ) p + i 2 ) , Ψ ˜ ( k + 1 ) = Ψ ˜ ( k ) 1 2 [ Ψ ˜ ( k ) ] 1 + ν ˜ ( k + 1 ) 2 κ ˜ [ Ψ ˜ ( k ) ] 1 Λ ˜ ( k + 1 ) [ Ψ ˜ ( k ) ] 1 .

Appendix D. The Standard Alternative Optimization of the Linear Inverse Problem

In this example, the process is pretty much the same. First, we have to rewrite p ( g , f ˜ , v ˜ ) and ln p ( g , f ˜ , v ˜ ) :
p ( g , f ˜ , v ˜ ) = 1 ( 2 π ) n p 2 v ϵ ˜ I n 2 exp { 1 2 i = 1 n ( g i f ˜ ) ( v ϵ ˜ I ) 1 ( g i f ˜ ) } 1 ( 2 π ) p 2 diag v ˜ 1 2 exp { 1 2 f ˜ diag v ˜ 1 f ˜ } j = 1 p β ˜ j α ˜ j Γ ( α ˜ j ) v ˜ j ( α ˜ j + 1 ) exp { β ˜ j v ˜ j } ,
and
ln p ( g , f ˜ , v ˜ ) n 2 ln v ϵ ˜ I 1 2 i = 1 n ( g i f ˜ ) ( v ϵ ˜ I ) 1 ( g i f ˜ ) 1 2 ln diag v ˜ 1 2 f ˜ diag v ˜ 1 f ˜ + j = 1 p α ˜ j ln β ˜ j j = 1 p ln Γ ( α ˜ j ) j = 1 p ( α ˜ j + 1 ) ln v ˜ j j = 1 p β ˜ j v ˜ j ,
where, I is an identical matrix p × p . According to the above expression, it is easy to find out all margins functions starting from v ˜ :
ln p ( g , f ˜ , v ˜ ) g , f ˜ 1 2 ln diag v ˜ 1 2 f ˜ diag v ˜ 1 f ˜ f ˜ j = 1 p ( α ˜ j + 1 ) ln v ˜ j j = 1 p β ˜ j v ˜ j 1 2 j = 1 p ln v ˜ i 1 2 j = 1 p f ˜ k 2 f ˜ v ˜ i j = 1 p ( α ˜ j + 1 ) ln v ˜ j j = 1 p β ˜ j v ˜ j ( α ˜ k + 3 2 ) ln v ˜ k ( v ˜ f ˜ k + μ ˜ f ˜ k 2 2 + β ˜ k ) 1 v ˜ k .
Thus, v ˜ k I G ( α ˜ k , v ˜ f ˜ k + μ ˜ f ˜ k 2 2 + β ˜ k ) for k = 1 , , p . The process for computation of g density function is similar:
ln p ( g , f ˜ , v ˜ ) f ˜ , v ˜ 1 2 i = 1 n ( g i f ˜ ) ( v ϵ ˜ I ) 1 ( g i f ˜ ) f ˜ = 1 2 v ϵ ˜ i = 1 n j = 1 p ( g i j f ˜ j ) 2 f ˜ 1 2 v ϵ ˜ i = 1 n j = 1 p ( g i j f ˜ j f ˜ ) 2 = 1 2 i = 1 n ( g i μ ˜ f ˜ ) ( v ϵ ˜ I ) 1 ( g i μ ˜ f ˜ ) .
So, g has again a Normal distribution of N ( μ ˜ f ˜ , v ϵ ˜ I ) . Since we need the density of g ¯ in the proceeding, we can specify it now. Therefore, back to properties of Normal distribution, we know that g ¯ has also a Normal density shown by g ¯ N ( μ ˜ f ˜ , 1 n v ϵ ˜ I ) . Now, we redo all calculation on (A29) for the objective margin of f ˜ , separately for each component denoted by f ˜ k for k = 1 , , p :
ln p ( g , f ˜ , v ˜ ) g , v ˜ 1 2 i = 1 n ( g i f ˜ ) ( v ϵ ˜ I ) 1 ( g i f ˜ ) g 1 2 f ˜ diag v ˜ 1 f ˜ v ˜ 1 2 n ( g ¯ f ˜ ) ( v ϵ ˜ I ) 1 ( g ¯ f ˜ ) g + f ˜ d i a g v ˜ 1 f ˜ v ˜ = 1 2 n v ϵ ˜ j = 1 p ( g ¯ j f ˜ j ) 2 g + j = 1 p 1 v ˜ j v ˜ f ˜ j 2 1 2 j = 1 p ( n v ϵ ˜ + 1 v ˜ j v ˜ ) f ˜ j g ¯ j g 1 + v ϵ ˜ n 1 v ˜ j v ˜ 2 = 1 2 j = 1 p ( n v ϵ ˜ + 2 α ˜ j v ˜ f ˜ j + μ ˜ f ˜ j 2 + 2 β ˜ j ) f ˜ j μ ˜ g i 1 + 2 v ϵ ˜ α ˜ j n ( v ˜ f ˜ j + μ ˜ f ˜ j 2 + 2 β ˜ j ) 2 .
Thus, f ˜ k has a Normal distribution with these structures:
f ˜ k N μ ˜ g k 1 + 2 v ϵ ˜ α ˜ k n ( v ˜ f ˜ k + μ ˜ f ˜ k 2 + 2 β ˜ k ) , v ϵ ˜ ( v ˜ f ˜ k + μ ˜ f ˜ k 2 + 2 β ˜ k ) n ( v ˜ f ˜ k + μ ˜ f ˜ k 2 ) + 2 n β ˜ k + 2 v ϵ ˜ α ˜ k , k = 1 , , p .
So, f ˜ is separable due to the diag function. We summarize all in a multivariate Normal distribution and μ ˜ g = μ ˜ f ˜ :
f ˜ M N μ ˜ f ˜ 1 + 2 v ϵ ˜ α ˜ n ( v ˜ f ˜ + μ ˜ f ˜ 2 + 2 β ˜ ) , diag v ϵ ˜ ( v ˜ f ˜ + μ ˜ f ˜ 2 + 2 β ˜ ) n ( v ˜ f ˜ + μ ˜ f ˜ 2 ) + 2 n β ˜ + 2 v ϵ ˜ α ˜ .
where all the mathematical operations on vectors are componentwise, and we find that μ ˜ g = μ ˜ f ˜ . The recursive algorithm is fixed for v ϵ ˜ and α ˜ , which can be estimated from the data set. The alternative algorithm is in the following for other parameters:
1. Standard alternate optimization algorithm:
v ϵ ˜ ( k + 1 ) = v ϵ ˜ ( k ) , α ˜ ( k + 1 ) = α ˜ ( k ) , β ˜ ( k + 1 ) = 1 2 ( v ˜ f ˜ ( k ) + [ μ ˜ f ˜ ( k ) ] 2 ) + β ˜ ( k ) , μ ˜ f ˜ ( k + 1 ) = μ ˜ f ˜ ( k ) 1 + 2 v ϵ ˜ ( k + 1 ) α ˜ ( k + 1 ) n ( v ˜ f ˜ ( k ) + [ μ ˜ f ˜ ( k ) ] 2 + 2 β ˜ ( k + 1 ) ) 1 , v ˜ f ˜ ( k + 1 ) = diag v ϵ ˜ ( k + 1 ) ( v ˜ f ˜ ( k ) + [ μ ˜ f ˜ ( k + 1 ) ] 2 + 2 β ˜ ( k + 1 ) ) n ( v ˜ f ˜ ( k ) + [ μ ˜ f ˜ ( k + 1 ) ] 2 ) + 2 n β ˜ ( k + 1 ) + 2 v ϵ ˜ ( k + 1 ) α ˜ ( k + 1 ) .

Appendix E. KL( θ ˜ ) of the Linear Inverse Problem

We define θ ˜ = ( μ ˜ g , v ϵ ˜ , v ˜ f ˜ , α ˜ , β ˜ ) and its corresponding K L ( θ ˜ ) :
K L ( θ ˜ ) = H ( q 1 ) H ( q 2 ) H ( q 3 ) ln p ( g , f ˜ , v ˜ ) q 1 q 2 q 3 .
We separately decompose each term initializing from H ( q 1 ) , the negative entropy of g :
H ( q 1 ) = q 1 ( g ) ln 1 ( 2 π ) p n 2 diag v ϵ ˜ I n 2 exp { 1 2 i = 1 n ( g i μ ˜ g ) diag v ϵ ˜ 1 ( g i μ ˜ g ) } d g = p n 2 p n 2 ln ( 2 π ) p n 2 ln v ϵ ˜ ,
and the negative entropy of f ˜ :
H ( q 2 ) = q 2 ( f ˜ ) ln 1 ( 2 π ) p 2 diag v ˜ f ˜ 1 2 exp { 1 2 f ˜ diag v ˜ f ˜ 1 f ˜ } d f ˜ = p 2 p 2 ln ( 2 π ) 1 2 ln diag v ˜ f ˜ ,
and the negative entropy of v ˜ :
H ( q 3 ) = q 3 ( v ˜ ) ln j = 1 p β ˜ j α ˜ j Γ ( α ˜ j ) v ˜ j ( α ˜ j + 1 ) exp { β ˜ j v ˜ j } d v ˜ = j = 1 p α ˜ j j = 1 p ln ( β ˜ j Γ ( α ˜ j ) ) + j = 1 p ( α ˜ j + 1 ) ψ 0 ( α ˜ j ) ,
and the last term of K L ( θ ˜ ) according to Equation (A27):
ln p ( g , f ˜ , v ˜ ) q 1 q 2 q 3 q 1 ( g ) q 2 ( f ˜ ) q 3 ( v ˜ ) n 2 ln v ϵ ˜ I + 1 2 i = 1 n ( g i f ˜ ) ( v ϵ ˜ I ) 1 ( g i f ˜ ) + 1 2 ln diag v ˜ + 1 2 f ˜ diag v ˜ 1 f ˜ j = 1 p α ˜ j ln β ˜ j + j = 1 p ln Γ ( α ˜ j ) + j = 1 p ( α ˜ j + 1 ) ln v ˜ j + j = 1 p β ˜ j v ˜ j d g d f ˜ d v ˜ n p 2 ln v ϵ ˜ + 1 2 v ϵ ˜ i = 1 n j = 1 p q 1 ( g ) q 2 ( f ˜ ) ( g i j 2 + f ˜ j 2 ) d g d f ˜ + 1 2 j = 1 p f ˜ j 2 v ˜ j d f ˜ d v ˜ j = 1 p α ˜ j ln β ˜ j + j = 1 p ln Γ ( α ˜ j ) + q 3 ( v ˜ ) j = 1 p ( α ˜ j + 3 2 ) ln v ˜ j + j = 1 p β ˜ j v ˜ j d v ˜ n p 2 ln v ϵ ˜ + n 2 v ϵ ˜ j = 1 p ( v ϵ ˜ + μ ˜ g j 2 + v ˜ f ˜ j ) + j = 1 p α ˜ j v ˜ f ˜ j β ˜ j j = 1 p α ˜ j ln β ˜ j + j = 1 p ln Γ ( α ˜ j ) + j = 1 p ( α ˜ j + 3 2 ) ( ln β ˜ j ψ 0 ( α ˜ j ) ) + j = 1 p α ˜ j .
Finally, we get:
K L ( θ ˜ ) p n 2 ln v ϵ ˜ 1 2 ln diag v ˜ f ˜ j = 1 p ln ( β ˜ j Γ ( α ˜ j ) ) + j = 1 p ( α ˜ j + 1 ) ψ 0 ( α ˜ j ) + n p 2 ln v ϵ ˜ + n 2 v ϵ ˜ j = 1 p ( v ϵ ˜ + μ ˜ g j 2 + v ˜ f ˜ j ) + j = 1 p α ˜ j v ˜ f ˜ j β ˜ j j = 1 p α ˜ j ln β ˜ j + j = 1 p ln Γ ( α ˜ j ) + j = 1 p ( α ˜ j + 3 2 ) ( ln β ˜ j ψ 0 ( α ˜ j ) ) . 1 2 j = 1 p ln v ˜ f ˜ j + n 2 v ϵ ˜ j = 1 p ( μ ˜ g j 2 + v ˜ f ˜ j ) + j = 1 p α ˜ j v ˜ f ˜ j β ˜ j 1 2 j = 1 p ψ 0 ( α ˜ j ) + 1 2 j = 1 p ln β ˜ j .
To make the last two algorithms, we have to differentiate K L ( θ ˜ ) with respect to θ ˜ :
K L ( θ ˜ ) = n v ϵ ˜ μ ˜ g , n 2 v ϵ ˜ 2 j = 1 p ( μ ˜ g j 2 + v ˜ f ˜ j ) , n 2 v ϵ ˜ 1 1 2 v ˜ f ˜ 1 + α ˜ β ˜ , 1 2 ψ 1 ( α ˜ ) + v ˜ f ˜ β ˜ , 1 2 β ˜ 1 α ˜ v ˜ f ˜ β ˜ 2 ,
where 1 is the all-ones vector, and also all operations are componentwise. The gradient algorithm is as follows:
2. The gradient-based algorithm with normal parameters:
μ ˜ g ( k + 1 ) = μ ˜ g ( k ) γ n v ϵ ˜ ( k ) μ ˜ g ( k ) , v ϵ ˜ ( k + 1 ) = v ϵ ˜ ( k ) + γ n 2 [ v ϵ ˜ ( k ) ] 2 j = 1 p ( 2 v ˜ f ˜ j ( k ) + [ μ ˜ g j ( k + 1 ) ] 2 ) , v ˜ f ˜ ( k + 1 ) = v ˜ f ˜ ( k ) γ n v ϵ ˜ ( k + 1 ) 1 1 2 [ v ˜ f ˜ ( k ) ] 1 + α ˜ ( k ) β ˜ ( k ) , α ˜ ( k + 1 ) = α ˜ ( k ) γ v ˜ f ˜ ( k + 1 ) β ˜ ( k ) 1 2 ψ 1 ( α ˜ ( k ) ) , β ˜ ( k + 1 ) = β ˜ ( k ) γ 1 2 [ β ˜ ( k ) ] 1 α ˜ ( k + 1 ) v ˜ f ˜ ( k + 1 ) [ β ˜ ( k ) ] 2 .
Another algorithm we consider here is the natural gradient algorithm with respect to the redefine θ ˜ . We fix v ϵ ˜ and determine θ ˜ = ( μ ˜ g , v ˜ f ˜ , α ˜ , β ˜ ) because the empirical Fisher information respect to v ϵ ˜ is 0 . It is needed to calculate the Fisher information of ln q ( g , f ˜ , v ˜ | θ ˜ ) :
ln q ( g , f ˜ , v ˜ | θ ˜ ) p n 2 ln v ϵ ˜ 1 2 v ϵ ˜ i = 1 n j = 1 p ( g i j μ ˜ g j ) 2 1 2 j = 1 p ln v ˜ f ˜ j 1 2 f ˜ diag v ˜ f ˜ 1 f ˜ + j = 1 p α ˜ j ln β ˜ j j = 1 p ln Γ ( α ˜ j ) j = 1 p ( α ˜ j + 1 ) ln v ˜ j j = 1 p β ˜ j v ˜ j .
Its corresponding Fisher information matrix is:
F ¯ = n v ϵ ˜ 1 0 0 0 0 1 2 v ˜ f ˜ 2 f ˜ f ˜ v ˜ f ˜ 3 0 0 0 0 Ψ 1 ( α ˜ ) 1 β ˜ 0 0 1 β ˜ α ˜ β ˜ 2 q ( g , f ˜ , v ˜ ) = n v ϵ ˜ 1 0 0 0 0 1 2 v ˜ f ˜ 2 0 0 0 0 Ψ 1 ( α ˜ ) 1 β ˜ 0 0 1 β ˜ α ˜ β ˜ 2 .
Then, we can calculate ˜ K L ( θ ˜ ) as follows:
˜ K L ( θ ˜ ) = v ϵ ˜ n 1 0 0 0 0 2 v ˜ f ˜ 2 0 0 0 0 α ˜ α ˜ Ψ 1 ( α ˜ ) 1 β ˜ α ˜ Ψ 1 ( α ˜ ) 1 0 0 β ˜ α ˜ Ψ 1 ( α ˜ ) 1 β ˜ 2 Ψ 1 ( α ˜ ) α ˜ Ψ 1 ( α ˜ ) 1 n v ϵ ˜ μ ˜ g n 2 v ϵ ˜ 1 1 2 v ˜ f ˜ 1 + α ˜ β ˜ 1 2 ψ 1 ( α ˜ ) + v ˜ f ˜ β ˜ 1 2 β ˜ 1 α ˜ v ˜ f ˜ β ˜ 2 = μ ˜ g 2 v ˜ f ˜ 2 ( n 2 v ϵ ˜ 1 + 1 2 v ˜ f ˜ 1 α ˜ β ˜ ) 1 2 β ˜ Ψ 1 ( α ˜ ) 2 ( α ˜ Ψ 1 ( α ˜ ) 1 ) + v ˜ f ˜ .
For the natural gradient algorithm, we set μ ˜ g ( k + 1 ) = 2 μ ˜ g and α ˜ ( k + 1 ) = α ˜ 1 2 , and repeat the algorithm for v ˜ f ˜ and β ˜ .
3. The natural gradient algorithm:
v ˜ f ˜ ( k + 1 ) = v ˜ f ˜ ( k ) 2 [ v ˜ f ˜ ( k ) ] 2 ( n 2 v ϵ ˜ 1 + 1 2 [ v ˜ f ˜ ( k ) ] 1 α ˜ β ˜ ( k ) ) , β ˜ ( k + 1 ) = β ˜ ( k ) 1 + Ψ 1 ( α ˜ ) 2 ( α ˜ Ψ 1 ( α ˜ ) 1 ) + v ˜ f ˜ ( k + 1 ) .
For the last algorithm, we rewrite ln q ( g , f ˜ , v ˜ | Λ ˜ ) respect to the natural parameters:
ln q ( g , f ˜ , v ˜ | Λ ˜ ) λ ˜ 1 2 i = 1 n g i λ ˜ 2 2 i = 1 n g i g i 1 2 f ˜ λ ˜ 3 f ˜ j = 1 p λ ˜ 4 j ln v ˜ j j = 1 p λ ˜ 5 j v ˜ j ,
where Λ ˜ is defined with ( λ ˜ 1 , λ ˜ 2 , λ ˜ 3 , λ ˜ 4 , λ ˜ 5 ) such that λ ˜ 1 = μ ˜ g v ϵ ˜ , λ ˜ 2 = 1 v ϵ ˜ , λ ˜ 3 = 1 v ˜ f ˜ , λ ˜ 4 = α ˜ + 1 , λ ˜ 5 = β ˜ . Then, the corresponding K L ( Λ ˜ ) is:
K L ( Λ ˜ ) 1 2 j = 1 p ln λ ˜ 3 j + n 2 j = 1 p ( λ ˜ 1 j 2 λ ˜ 2 + λ ˜ 2 λ ˜ 3 j ) + j = 1 p λ ˜ 4 j 1 λ ˜ 3 j λ ˜ 5 j 1 2 j = 1 p ψ 0 ( λ ˜ 4 j 1 ) + 1 2 j = 1 p ln λ ˜ 5 j .
The gradient of K L ( Λ ˜ ) is:
K L ( Λ ˜ ) = n λ ˜ 1 λ ˜ 2 , n 2 j = 1 p ( λ ˜ 1 j 2 λ ˜ 2 2 1 λ ˜ 3 j ) , 1 2 λ ˜ 3 n λ ˜ 2 2 λ ˜ 3 2 λ ˜ 4 1 λ ˜ 3 2 λ ˜ 5 , 1 λ ˜ 3 λ ˜ 5 1 2 ψ 1 ( λ ˜ 4 1 ) , λ ˜ 4 1 λ ˜ 3 λ ˜ 5 2 + 1 2 λ ˜ 5 .
We substitute θ ˜ = ( μ ˜ g , v ϵ ˜ , v ˜ f ˜ , α ˜ , β ˜ ) for Λ ˜ in the K L ( Λ ˜ ) :
K L ( θ ˜ ) = n μ ˜ g , n 2 j = 1 p ( μ ˜ g j 2 v ˜ f ˜ j ) , 1 2 v ˜ f ˜ n v ˜ f ˜ 2 2 v ϵ ˜ α ˜ v ˜ f ˜ 2 β ˜ , 1 2 ψ 1 ( α ˜ ) + v ˜ f ˜ β ˜ , 1 2 β ˜ 1 α ˜ v ˜ f ˜ β ˜ 2 .
For this algorithm, we fix μ ˜ g ( k + 1 ) = ( 1 n ) μ ˜ g and repeat for other remaining parameters.
4. The gradient-based algorithm concerning natural parameters:
v ϵ ˜ ( k + 1 ) = v ϵ ˜ ( k ) + n 2 j = 1 p ( μ ˜ g j 2 v ˜ f ˜ j ( k ) ) , v ˜ f ˜ ( k + 1 ) = 1 2 v ˜ f ˜ ( k ) + n [ v ˜ f ˜ ( k ) ] 2 2 v ϵ ˜ ( k + 1 ) + α ˜ ( k ) [ v ˜ f ˜ ( k ) ] 2 β ˜ ( k ) , α ˜ ( k + 1 ) = α ˜ ( k ) + 1 2 ψ 1 ( α ˜ ( k ) ) v ˜ f ˜ ( k + 1 ) β ˜ ( k ) , β ˜ ( k + 1 ) = β ˜ ( k ) 1 2 [ β ˜ ( k ) ] 1 + α ˜ ( k + 1 ) v ˜ f ˜ ( k + 1 ) [ β ˜ ( k ) ] 2 .

References

  1. Neal, R.M. Slice sampling. Ann. Stat. 2003, 31, 705–767. [Google Scholar] [CrossRef]
  2. Ashton, G.; Bernstein, N.; Buchner, J.; Chen, X.; Csányi, G.; Fowlie, A.; Feroz, F.; Griffiths, M.; Handley, W.; Habeck, M.; et al. Nested sampling for physical scientists. Nat. Rev. Methods Prim. 2022, 2, 39. [Google Scholar] [CrossRef]
  3. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
  4. Parisi, G.; Shankar, R. Statistical field theory. Phys. Today 1988, 41, 110. [Google Scholar] [CrossRef]
  5. MacKay, D.J. A practical Bayesian framework for backpropagation networks. Neural Comput. 1992, 4, 448–472. [Google Scholar] [CrossRef]
  6. Neal, R. Bayesian Learning for Neural Networks. Ph.D. Thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada, 1995. [Google Scholar]
  7. Šmídl, V.; Quinn, A. The Variational Bayes Method in Signal Processing; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  8. Sarkka, S.; Nummenmaa, A. Recursive noise adaptive Kalman filtering by variational Bayesian approximations. IEEE Trans. Autom. Control 2009, 54, 596–600. [Google Scholar] [CrossRef]
  9. Zheng, Y.; Fraysse, A.; Rodet, T. Efficient variational Bayesian approximation method based on subspace optimization. IEEE Trans. Image Process. 2014, 24, 681–693. [Google Scholar] [CrossRef] [PubMed]
  10. Fox, C.W.; Roberts, S.J. A tutorial on variational Bayesian inference. Artif. Intell. Rev. 2012, 38, 85–95. [Google Scholar] [CrossRef]
  11. Gharsalli, L.; Duchêne, B.; Mohammad-Djafari, A.; Ayasso, H. Microwave tomography for breast cancer detection within a variational Bayesian approach. In Proceedings of the 21st European Signal Processing Conference (EUSIPCO 2013), Marrakech, Morocco, 9–13 September 2013; pp. 1–5. [Google Scholar]
  12. Mohammad-Djafari, A. Variational Bayesian approximation method for classification and clustering with a mixture of student-t model. In Geometric Science of Information, Proceedings of the Second International Conference, GSI 2015, Palaiseau, France, 28–30 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 723–731. [Google Scholar]
  13. Mohammad-Djafari, A.; Ayasso, H. Variational Bayes and Mean Field Approximations for Markov field unsupervised estimation. In Proceedings of the 2009 IEEE International Workshop on Machine Learning for Signal Processing, Grenoble, France, 1–4 September 2009; pp. 1–6. [Google Scholar] [CrossRef]
  14. Renard, B.; Garreta, V.; Lang, M. An application of Bayesian analysis and Markov chain Monte Carlo methods to the estimation of a regional trend in annual maxima. Water Resour. Res. 2006, 42. [Google Scholar] [CrossRef]
  15. Li, G.; Shi, J. Applications of Bayesian methods in wind energy conversion systems. Renew. Energy 2012, 43, 1–8. [Google Scholar] [CrossRef]
  16. Yang, D.; Zakharkin, S.O.; Page, G.P.; Brand, J.P.; Edwards, J.W.; Bartolucci, A.A.; Allison, D.B. Applications of Bayesian statistical methods in microarray data analysis. Am. J. Pharmacogenomics 2004, 4, 53–62. [Google Scholar] [CrossRef]
  17. Acerbi, L. Variational bayesian monte carlo. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
  18. Kuusela, M.; Raiko, T.; Honkela, A.; Karhunen, J. A gradient-based algorithm competitive with variational Bayesian EM for mixture of Gaussians. In Proceedings of the 2009 International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 1688–1695. [Google Scholar]
  19. Gharsalli, L.; Duchêne, B.; Mohammad-Djafari, A.; Ayasso, H. A gradient-like variational Bayesian approach: Application to microwave imaging for breast tumor detection. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 1708–1712. [Google Scholar]
  20. Zhang, G.; Sun, S.; Duvenaud, D.; Grosse, R. Noisy natural gradient as variational inference. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: London, UK, 2018; pp. 5852–5861. [Google Scholar]
  21. Lin, W.; Khan, M.E.; Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family approximations. In Proceedings of the International Conference on Machine Learning, Beach, CA, USA, 9–15 June 2019; PMLR: London, UK, 2019; pp. 3992–4002. [Google Scholar]
  22. Fallah Mortezanejad, S.A.; Mohammad-Djafari, A. Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms. Phys. Sci. Forum 2023, 5, 48. [Google Scholar] [CrossRef]
  23. Amari, S.I. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
  24. Schraudolph, N.N. Fast curvature matrix-vector products for second-order gradient descent. Neural Comput. 2002, 14, 1723–1738. [Google Scholar] [CrossRef] [PubMed]
  25. Martens, J. New insights and perspectives on the natural gradient method. J. Mach. Learn. Res. 2020, 21, 5776–5851. [Google Scholar]
  26. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  27. Seghouane, A.K.; Amari, S.I. The AIC criterion and symmetrizing the Kullback–Leibler divergence. IEEE Trans. Neural Netw. 2007, 18, 97–106. [Google Scholar] [CrossRef]
  28. Hu, Z.; Hong, L.J. Kullback-Leibler divergence constrained distributionally robust optimization. Available Optim. Online 2013, 1, 9. [Google Scholar]
  29. Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler divergence between Gaussian mixture models. In Proceedings of the 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Honolulu, HI, USA, 15–20 April 2007; Volume 4, p. 317. [Google Scholar]
  30. De Canditiis, D.; Vidakovic, B. Wavelet Bayesian block shrinkage via mixtures of normal-inverse gamma priors. J. Comput. Graph. Stat. 2004, 13, 383–398. [Google Scholar] [CrossRef]
  31. Bouriga, M.; Féron, O. Estimation of covariance matrices based on hierarchical inverse-Wishart priors. J. Stat. Plan. Inference 2013, 143, 795–808. [Google Scholar] [CrossRef]
  32. Daniels, M.J.; Kass, R.E. Nonconjugate Bayesian estimation of covariance matrices and its use in hierarchical models. J. Am. Stat. Assoc. 1999, 94, 1254–1263. [Google Scholar] [CrossRef]
  33. Ayasso, H.; Mohammad-djafari, A. Joint image restoration and segmentation using Gauss-Markov-Potts prior models and variational Bayesian computation. In Proceedings of the 2009 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 1297–1300. [Google Scholar] [CrossRef]
  34. Gupta, M.; Srivastava, S. Parametric Bayesian estimation of differential entropy and relative entropy. Entropy 2010, 12, 818–843. [Google Scholar] [CrossRef]
Figure 1. (a) Surface plot of the true model N ( x | 0 , y ) I G ( y | 3 , 1 ) ; (b) Surface plot of the alternative method with K L ( · ) = 0.58 ; (c) Surface plot of the gradient method with γ = 1 and K L ( · ) = 0.58 ; (d) Surface plot of the gradient method with γ = K L ( · ) 1 and K L ( · ) = 0.56 ; (e) Surface plot of the gradient method with natural parameters and K L ( · ) = 0.59 ; (f) Surface plot of the natural gradient method with K L ( · ) = 0.56 .
Figure 1. (a) Surface plot of the true model N ( x | 0 , y ) I G ( y | 3 , 1 ) ; (b) Surface plot of the alternative method with K L ( · ) = 0.58 ; (c) Surface plot of the gradient method with γ = 1 and K L ( · ) = 0.58 ; (d) Surface plot of the gradient method with γ = K L ( · ) 1 and K L ( · ) = 0.56 ; (e) Surface plot of the gradient method with natural parameters and K L ( · ) = 0.59 ; (f) Surface plot of the natural gradient method with K L ( · ) = 0.56 .
Entropy 26 00707 g001aEntropy 26 00707 g001b
Figure 2. These plots are related to Table 1. The columns from left to right are related to the standard alternate optimization, parametric gradient-based with γ = 1 , K L ( · ) , gradient with natural parameters, and natural gradient, respectively. The rows of the estimated plots from top to bottom are based on MLE estimators and the initializations in the absence of information, respectively. The visualization inference is that in models N ( x | 0 , y ) I G ( y | 3 , 1 ) and N ( x | 2 , y ) I G ( y | 10 , 11 ) , the standard algorithm has the appropriate approximation, gradient-based method with γ = K L ( · ) 1 acts acceptable in models N ( x | 1 , y ) I G ( y | 4 , 6 ) and N ( x | 2 , y ) I G ( y | 6 , 10 ) , and all methods have admissible estimations for N ( x | 1 , y ) I G ( y | 7 , 10 ) . (a) The true model N ( x | 0 , y ) I G ( y | 3 , 1 ) ; (b) The estimation plots of (a); (c) The true model N ( x | 1 , y ) I G ( y | 4 , 6 ) ; (d) The estimation plots of (e); (e) The true model N ( x | 1 , y ) I G ( y | 7 , 10 ) ; (f) The estimation plots of (c); (g) The true model N ( x | 2 , y ) I G ( y | 6 , 10 ) ; (h) The estimation plots of (g); (i) The true model N ( x | 2 , y ) I G ( y | 10 , 11 ) ; (j) The estimation plots of (i).
Figure 2. These plots are related to Table 1. The columns from left to right are related to the standard alternate optimization, parametric gradient-based with γ = 1 , K L ( · ) , gradient with natural parameters, and natural gradient, respectively. The rows of the estimated plots from top to bottom are based on MLE estimators and the initializations in the absence of information, respectively. The visualization inference is that in models N ( x | 0 , y ) I G ( y | 3 , 1 ) and N ( x | 2 , y ) I G ( y | 10 , 11 ) , the standard algorithm has the appropriate approximation, gradient-based method with γ = K L ( · ) 1 acts acceptable in models N ( x | 1 , y ) I G ( y | 4 , 6 ) and N ( x | 2 , y ) I G ( y | 6 , 10 ) , and all methods have admissible estimations for N ( x | 1 , y ) I G ( y | 7 , 10 ) . (a) The true model N ( x | 0 , y ) I G ( y | 3 , 1 ) ; (b) The estimation plots of (a); (c) The true model N ( x | 1 , y ) I G ( y | 4 , 6 ) ; (d) The estimation plots of (e); (e) The true model N ( x | 1 , y ) I G ( y | 7 , 10 ) ; (f) The estimation plots of (c); (g) The true model N ( x | 2 , y ) I G ( y | 6 , 10 ) ; (h) The estimation plots of (g); (i) The true model N ( x | 2 , y ) I G ( y | 10 , 11 ) ; (j) The estimation plots of (i).
Entropy 26 00707 g002aEntropy 26 00707 g002bEntropy 26 00707 g002c
Figure 3. The vertical and horizontal axis are the K L ( · ) values and the number of iterations, respectively. The left and right columns are for the MLE and evidence-free initializations, respectively. (a,b) N ( x | 0 , y ) I G ( y | 3 , 1 ) ; (c,d) N ( x | 1 , y ) I G ( y | 4 , 6 ) ; (e,f) N ( x | 1 , y ) I G ( y | 7 , 10 ) ; (g,h) N ( x | 2 , y ) I G ( y | 6 , 10 ) . (i,j) N ( x | 2 , y ) I G ( y | 10 , 11 ) .
Figure 3. The vertical and horizontal axis are the K L ( · ) values and the number of iterations, respectively. The left and right columns are for the MLE and evidence-free initializations, respectively. (a,b) N ( x | 0 , y ) I G ( y | 3 , 1 ) ; (c,d) N ( x | 1 , y ) I G ( y | 4 , 6 ) ; (e,f) N ( x | 1 , y ) I G ( y | 7 , 10 ) ; (g,h) N ( x | 2 , y ) I G ( y | 6 , 10 ) . (i,j) N ( x | 2 , y ) I G ( y | 10 , 11 ) .
Entropy 26 00707 g003
Figure 4. The corresponding estimation values are in Table 2. From the left to right, the columns show the contour plots with MLE and evidence-free initialization, respectively. The first raw is for the result of the standard alternative optimization, and the two others are for gradient and natural gradient algorithms, respectively. In this example, the standard alternative method has the most fitted result. (a) Model (18); (b) Estimations of the Model.
Figure 4. The corresponding estimation values are in Table 2. From the left to right, the columns show the contour plots with MLE and evidence-free initialization, respectively. The first raw is for the result of the standard alternative optimization, and the two others are for gradient and natural gradient algorithms, respectively. In this example, the standard alternative method has the most fitted result. (a) Model (18); (b) Estimations of the Model.
Entropy 26 00707 g004
Figure 5. (a) True contour plot of model (5), almost separable; (b) These are the estimation of the model based on the standard alternative analytic approximation, gradient-based algorithm with γ = 1 , K L ( θ ˜ ) 1 and concerning the natural parameters, and natural gradient algorithm, respectively, from left to right.
Figure 5. (a) True contour plot of model (5), almost separable; (b) These are the estimation of the model based on the standard alternative analytic approximation, gradient-based algorithm with γ = 1 , K L ( θ ˜ ) 1 and concerning the natural parameters, and natural gradient algorithm, respectively, from left to right.
Entropy 26 00707 g005
Table 1. Simulation results for five different models using the Normal-Inverse-Gamma distribution are presented here. We use two initialization point groups. The first column shows the MLE for each parameter. The second column contains some intended initializations that we refer to as evidence-free initialization. They are not based on any information and are optional. It is evident that when we apply data-derived points, the results are highly accurate. However, when we rely on evidence-free points, the results deviate significantly from the true models. Also, we calculate the K L ( · ) values for the initializations and the final estimations for each method. Each algorithm optimizes the function K L ( · ) in different ways. The alternative provides acceptable parameter estimations for the correct parameters, although the criterion is to minimize K L ( · ) . The corresponding plots can be found in Figure 2 to provide a comprehensive overview of the approximations and facilitate comparisons.
Table 1. Simulation results for five different models using the Normal-Inverse-Gamma distribution are presented here. We use two initialization point groups. The first column shows the MLE for each parameter. The second column contains some intended initializations that we refer to as evidence-free initialization. They are not based on any information and are optional. It is evident that when we apply data-derived points, the results are highly accurate. However, when we rely on evidence-free points, the results deviate significantly from the true models. Also, we calculate the K L ( · ) values for the initializations and the final estimations for each method. Each algorithm optimizes the function K L ( · ) in different ways. The alternative provides acceptable parameter estimations for the correct parameters, although the criterion is to minimize K L ( · ) . The corresponding plots can be found in Figure 2 to provide a comprehensive overview of the approximations and facilitate comparisons.
ModelsAlgorithm↓MLE InitializationsEvidence-Free Initializations
Parameters KL ( · ) Parameters KL ( · )
μ v α β μ v α β
N ( x | 0 , y ) I G ( y | 3 , 1 ) Initial Points → 0.04 0.59 3.79 1.47 0.62 0 0.5 1015 0.74
Alternative 0.04 0.54 3.79 1.77 0.58 0 1.55 10 15.25 0.53
Gradient γ = 1 0.04 1.05 3.27 3.39 0.58 0.04 1.45 10.09 14.95 0.53
γ = K L ( θ ˜ ) 1 0.04 1.77 4.29 7.64 0.56 0.04 1.50 10.05 15.07 0.53
Natural Parameters0 0.55 3.71 2.73 0.59 0 1.52 10.17 15.47 0.52
Natural Gradient 0.04 0.74 4.29 3.54 0.56 0.04 2.52 10.50 27.67 0.52
N ( x | 1 , y ) I G ( y | 4 , 6 ) Initial Points → 1.09 1.20 5.35 9.62 0.58 1 3.5 33 1.22
Alternative 1.09 2.02 5.35 10.22 0.55 1 3.24 3 8.37 0.60
Gradient γ = 1 1.09 1.68 5.56 9.55 0.55 1.09 1.40 3.26 4.45 0.58
γ = K L ( θ ˜ ) 1 1.09 2.33 6.15 14.96 0.54 1.09 2.55 2.53 6.58 0.68
Natural Parameters 1.36 1.54 5.44 10.44 0.58 1.10 1.91 4.18 8.21 0.56
Natural Gradient 1.09 2.94 5.85 18.66 0.55 1 3.5 33 1.22
N ( x | 1 , y ) I G ( y | 7 , 10 ) Initial Points → 0.92 1.11 10.82 19.99 0.52 1 0.75 58 0.67
Alternative 0.92 1.95 10.82 20.54 0.52 1 1.75 5 8.38 0.55
Gradient γ = 1 0.92 1.77 10.94 19.94 0.52 0.92 1.48 5.24 7.91 0.55
γ = K L ( θ ˜ ) 1 0.92 3.24 11.84 36.34 0.52 0.86 1.69 5.17 8.25 0.55
Natural Parameters 1.19 1.6 10.87 20.25 0.55 0.97 1.68 5.20 9.01 0.55
Natural Gradient 0.92 2.72 11.32 32.21 0.52 0.92 2.60 5.50 15.58 0.55
N ( x | 2 , y ) I G ( y | 6 , 10 ) Initial Points → 2.09 1.21 8.86 18.66 0.59 2285 1.06
Alternative 2.09 2.24 8.86 19.27 0.53 2 0.98 8 7.34 0.54
Gradient γ = 1 2.09 1.91 9.01 18.61 0.53 2.09 2.08 8.94 18.65 0.53
γ = K L ( θ ˜ ) 1 2.09 3.15 10.27 33.24 0.52 22 8.86 18.66 0.53
Natural Parameters 2.09 1.21 8.86 18.66 0.59 2255 0.71
Natural Gradient 2.09 2.88 9.36 28.39 0.53 22 8.5 5 1.06
N ( x | 2 , y ) I G ( y | 10 , 11 ) Initial Points → 1.93 0.96 17.52 26.54 0.56 2 0.5 87 0.60
Alternative 1.93 1.57 17.52 27.02 0.51 2 1.61 8 12.13 0.53
Gradient γ = 1 1.93 1.47 17.56 26.51 0.51 1.93 1.48 17.57 26.51 0.51
γ = K L ( θ ˜ ) 1 1.93 1.33 17.42 29.43 0.53 1.93 1.50 17.55 26.57 0.51
Natural Parameters 1.93 0.96 17.52 26.54 0.56 2.01 0.56 8.20 7.96 0.60
Natural Gradient 1.93 1.80 18.02 33.32 0.51 1.93 1.58 18.02 29.30 0.51
Table 2. There are some models with their estimation via four optimization algorithms with two different initializations. The lower the K L ( · ) , the more accurate the fitted posterior distribution.
Table 2. There are some models with their estimation via four optimization algorithms with two different initializations. The lower the K L ( · ) , the more accurate the fitted posterior distribution.
ModelsAlgorithm↓MLE InitializationsEvidence-Free Initializations
Parameters KL ( · ) Parameters KL ( · )
μ 0 κ Λ Ψ ν + n p 1 ν μ 0 κ Λ Ψ ν + n p 1 ν ˜
Initial Points → 1.94 1.01 5 158.08 50.41 50.41 49.38 0.01 0 0 0 5 520.59 4 2 5 10 5 5 4 0.10 0.05 0.05 0.04 5 21.91
Alternative 1.94 1.01 5 1.51 0.48 0.48 0.47 3.03 0.96 0.96 0.96 5 5.59 2.04 1.06 5 0.10 0.05 0.05 0.04 1.66 0.47 0.47 0.53 57
γ = 1 1.94 1.01 5 158.08 50.41 50.41 49.38 0.01 0.01 0.01 0.01 5 520.59 4 2 5 10 5 5 4 0.10 0.05 0.05 0.04 5 21.91
N I W μ ˜ , Σ ˜ | 2 1 , 2 , 3 1 1 1 , 6 Gradient γ = K L ( Θ ˜ ) 1 1.94 1.01 4.85 158.04 50.42 50.42 49.32 0.04 0.01 0.01 0.03 3.99 88.82 4 2 1.31 7.82 1.72 1.72 2.79 0.14 0.01 0.01 0.09 3.99 2.64
 Natural Parameters 1.94 1.01 5 158.08 50.41 50.41 49.38 0.01 0 0 0 5 520.59 4 2 5 10 5 5 4 0.10 0.05 0.05 0.04 5 21.91
Natural Gradient 1.94 1.01 5 158.08 50.41 50.41 49.38 0.01 0 0 0 5 520.59 4 2 5 16.67 9.33 9.33 6.67 0.11 0.03 0.03 0.06 5 11.67
Initial Points → 3.35 1.39 2.25 5 857.08 388.16 553.55 388.16 1041.13 310.01 553.55 310.01 1127.20 0.08 0.04 0.05 0.04 0.10 0.03 0.05 0.03 0.11 5 781.55 4 2 3 5 9 6 6 6 10 6 6 6 6 0.09 0.06 0.06 0.06 0.10 0.06 0.06 0.06 0.06 5 23.53
Alternative 3.35 1.39 2.25 5 8.16 3.70 5.27 3.70 9.92 2.95 5.27 2.95 10.74 16.57 7.50 10.70 7.50 20.13 5.99 10.70 5.99 21.79 5 9.05 3.19 1.32 2.15 5 0.09 0.06 0.06 0.06 0.10 0.06 0.06 0.06 0.06 8.18 3.62 5.29 3.62 9.92 2.89 5.29 2.89 10.64 5 15.89
γ = 1 3.35 1.39 2.25 5 857.08 388.16 553.55 388.16 1041.13 310.01 553.55 310.01 1127.20 0.08 0.04 0.05 0.04 0.10 0.03 0.05 0.03 0.11 5 781.55 4 2 3 5 9 6 6 6 10 6 6 6 6 0.09 0.06 0.06 0.06 0.10 0.06 0.06 0.06 0.06 5 23.53
N I W μ ˜ , Σ ˜ | 3 1 2 , 3 , 3 1 2 1 4 1 2 1 5 , 4 Gradient γ = K L ( Θ ˜ ) 1 3.35 1.39 2.25 5 857.08 388.16 553.55 388.16 1041.13 310.01 553.55 310.01 1127.20 0.08 0.04 0.05 0.04 0.10 0.03 0.05 0.03 0.11 4.11 614.41 4 2 3 4.98 8.95 6 6.05 6 9.96 5.94 6.05 5.94 5.88 0.09 0.06 0.06 0.06 0.10 0.06 0.06 0.06 0.07 4.76 16.44
 Natural Parameters 3.35 1.39 2.25 5 857.08 388.16 553.55 388.16 1041.13 310.01 553.55 310.01 1127.2 0.08 0.04 0.05 0.04 0.1 0.03 0.05 0.03 0.11 5 781.55 4 2 3 5 9 6 6 6 10 6 6 6 6 0.09 0.06 0.06 0.06 0.10 0.06 0.06 0.06 0.06 5 23.53
Natural Gradient 3.35 1.39 2.25 5 857.08 388.16 553.55 388.16 1041.13 310.01 553.55 310.01 1127.2 0.08 0.04 0.05 0.04 0.1 0.03 0.05 0.03 0.11 5 781.55 4 2 3 5 4 2 3 10.8 16 10.8 9.6 10.8 9.6 0.09 0.06 0.06 0.06 0.11 0.05 0.06 0.05 0.07 21.75 21.75
Initial Points → 0.18 0.29 0.42 0.23 0.05 5 473.25 152.04 66.00 31.17 58.19 152.04 334.02 34.61 72.16 221.59 66.00 34.61 947.32 90.75 16.37 31.17 72.16 90.75 553.61 68.83 58.19 221.59 16.37 68.83 545.75 0.05 0.015 0.01 0 0.01 0.015 0.032 0 0.01 0.02 0.01 0 0.09 0.01 0 0 0.01 0.01 0.05 0.01 0.01 0.02 0 0.01 0.05 5 1304.77 0.5 1 0.5 1 0.5 5 10 10 10 10 5 10 10 10 10 10 10 10 8 10 10 10 10 10 10 6 5 10 10 6 10 0.10 0.10 0.10 0.10 0.05 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.08 0.10 0.10 0.10 0.10 0.10 0.10 0.06 0.05 0.10 0.10 0.06 0.10 5 21.85
Alternative 0.18 0.29 0.42 0.23 0.05 5 4.51 1.45 0.63 0.30 0.55 1.45 3.18 0.33 0.69 2.11 0.63 0.33 9.02 0.86 0.16 0.30 0.69 0.86 5.27 0.66 0.55 2.11 0.16 0.66 5.20 9.33 2.30 1.30 0.61 1.15 2.30 6.59 0.68 1.42 4.37 1.30 0.68 18.68 1.79 0.32 0.61 1.42 1.79 10.92 1.36 1.15 4.37 0.32 1.36 10.76 5 17.28 0 0 0 0 0 5 10 10 10 10 5 10 10 10 10 10 10 10 8 10 10 10 10 10 10 6 5 10 10 6 10 0.10 0.10 0.10 0.10 0.05 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.18 0.10 0.10 0.10 0.10 0.10 0.10 0.06 0.05 0.10 0.10 0.06 0.10 5 21.85
γ = 1 0.18 0.29 0.42 0.23 0.05 5 473.25 152.04 66.00 31.17 58.19 152.04 334.02 34.61 72.16 221.59 66.00 34.61 947.32 90.75 16.37 31.17 72.16 90.75 553.61 68.83 58.19 221.59 16.37 68.83 545.75 0.05 0.01 0.01 0.00 0.01 0.01 0.03 0.00 0.01 0.02 0.01 0.00 0.09 0.01 0.00 0.00 0.00 0.01 0.05 0.01 0.01 0.02 0.00 0.00 0.05 5 1304.77 0.5 1 0.5 1 0.5 4.6 10.00 9.64 9.91 10.00 4.55 9.64 9.70 10.49 9.88 10.09 9.91 10.49 7.92 10.11 9.57 10.00 9.88 10.11 10.04 5.98 4.55 10.09 9.57 5.98 9.91 0.10 0.11 0.10 0.10 0.04 0.11 0.10 0.09 0.10 0.10 0.10 0.09 0.08 0.10 0.11 0.10 0.10 0.10 0.10 0.06 0.04 0.10 0.11 0.06 0.11 3.66 11.15
N I W μ ˜ , Σ ˜ | 0 0 0 0 0 , 4 , 6 2 1 1 0 2 4 0 1 2 1 0 9 1 0 1 1 1 7 0 0 2 0 0 5 , 7 Gradient γ = K L ( Θ ˜ ) 1 0.18 0.29 0.42 0.23 0.05 5 473.25 152.04 66.00 31.17 58.19 152.04 334.02 34.61 72.16 221.59 66.00 34.61 947.32 90.75 16.37 31.17 72.16 90.75 553.61 68.83 58.19 221.59 16.37 68.83 545.75 0.05 0.01 0.01 0.00 0.00 0.01 0.04 0.00 0.00 0.02 0.01 0.00 0.10 0.00 0.00 0.00 0.00 0.01 0.06 0.01 0.00 0.02 0.00 0.01 0.06 4.28 1018.67 0.5 1 0.5 1 0.5 4.88 10.00 9.90 9.97 10.00 4.87 9.90 9.91 10.14 9.96 10.03 9.97 10.14 7.98 10.03 9.87 10.00 9.96 10.03 10.01 5.99 4.87 10.03 9.87 5.99 9.97 0.10 0.10 0.10 0.10 0.05 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.08 0.10 0.10 0.10 0.10 0.10 0.10 0.06 0.05 0.10 0.10 0.06 0.10 4.71 19.13
  Natural Parameters 0.18 0.29 0.42 0.23 0.05 5 473.25 152.04 66.00 31.17 58.19 152.04 334.02 34.61 72.16 221.59 66.00 34.61 947.32 90.75 16.37 31.17 72.16 90.75 553.61 68.83 58.19 221.59 16.37 68.83 545.75 0.05 0.015 0.01 0 0.01 0.015 0.032 0 0.01 0.02 0.01 0 0.09 0.01 0 0 0.01 0.01 0.05 0.01 0.01 0.02 0 0.01 0.05 5 1304.77 0.5 1 0.5 1 0.5 5 10 10 10 10 5 10 10 10 10 10 10 10 8 10 10 10 10 10 10 6 5 10 10 6 10 0.10 0.10 0.10 0.10 0.05 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.08 0.10 0.10 0.10 0.10 0.10 0.10 0.06 0.05 0.10 0.10 0.06 0.10 5 21.85
Natural Gradient 0.18 0.29 0.42 0.23 0.05 5 473.25 152.04 66.00 31.17 58.19 152.04 334.02 34.61 72.16 221.59 66.00 34.61 947.32 90.75 16.37 31.17 72.16 90.75 553.61 68.83 58.19 221.59 16.37 68.83 545.75 0.05 0.015 0.01 0 0.01 0.015 0.032 0 0.01 0.02 0.01 0 0.09 0.01 0 0 0.01 0.01 0.05 0.01 0.01 0.02 0 0.01 0.05 5 1304.77 0.18 0.29 0.42 0.23 0.05 5 473.25 152.04 66.00 31.17 58.19 152.04 334.02 34.61 72.16 221.59 66.00 34.61 947.32 90.75 16.37 31.17 72.16 90.75 553.61 68.83 58.19 221.59 16.37 68.83 545.75 0.05 0.015 0.01 0 0.01 0.015 0.032 0 0.01 0.02 0.01 0 0.09 0.01 0 0 0.01 0.01 0.05 0.01 0.01 0.02 0 0.01 0.05 5 1304.77
Table 3. We provide an example of running the algorithms on a large-scale dataset. The standard alternative algorithm is outstanding here.
Table 3. We provide an example of running the algorithms on a large-scale dataset. The standard alternative algorithm is outstanding here.
ModelsAlgorithm↓MLE Initializations
Parameters KL ( · )
μ 0 κ Λ Ψ ν + n p 1 ν
Initial Points → 1 2.02 1.02 3.05 1.02 4.05 4.98 10.03 1.08 0.03 5 3.03 0.18 2.69 1.07 0.43 0.38 1.49 0.49 0.5 1.33 0.18 4.73 2.56 1.57 3.75 4.62 3.12 3.48 3.49 1.02 2.69 2.56 11.84 0.18 4.91 3.69 2.44 1.37 1.64 0.25 1.07 1.57 0.18 6.67 2.21 4.75 2.65 1.87 4.21 0.16 0.43 3.75 4.91 2.21 7.62 4.31 4.63 2.93 2.3 0.33 0.38 4.62 3.69 4.75 4.31 8.59 4.52 4.69 7.47 1.78 1.49 3.12 2.44 2.65 4.63 4.52 5.31 3.66 4.01 0.21 0.49 3.48 1.37 1.87 2.93 4.69 3.66 7.73 4.21 5.67 0.5 3.49 1.64 4.21 2.3 7.47 4.01 4.21 9.51 1.81 1.33 1.02 0.25 0.16 0.33 1.78 0.21 5.67 1.81 8.67 0.2 0.01 0.18 0.07 0.03 0.02 0.1 0.03 0.03 0.09 0.01 0.31 0.17 0.1 0.25 0.3 0.21 0.23 0.23 0.07 0.18 0.17 0.78 0.01 0.32 0.24 0.16 0.09 0.11 0.02 0.07 0.1 0.01 0.44 0.15 0.31 0.17 0.12 0.28 0.01 0.03 0.25 0.32 0.15 0.5 0.28 0.3 0.19 0.15 0.02 0.02 0.3 0.24 0.31 0.28 0.56 0.3 0.31 0.49 0.12 0.1 0.21 0.16 0.17 0.3 0.3 0.35 0.24 0.26 0.01 0.03 0.23 0.09 0.12 0.19 0.31 0.24 0.51 0.28 0.37 0.03 0.23 0.11 0.28 0.15 0.49 0.26 0.28 0.63 0.12 0.09 0.07 0.02 0.01 0.02 0.12 0.01 0.37 0.12 0.57 10 5231.57
Alternative 1.09 2.44 1.44 4.19 1.59 5.2 4.61 10.81 3.14 0.79 5 19.75 1.15 17.53 6.96 2.79 2.45 9.72 3.21 3.26 8.68 1.15 30.8 16.64 10.21 24.44 30.1 20.3 22.69 22.69 6.63 17.53 16.64 77.06 1.17 31.99 24.04 15.86 8.9 10.66 1.61 6.96 10.21 1.17 43.41 14.38 30.95 17.28 12.16 27.39 1.05 2.79 24.44 31.99 14.38 49.64 28.05 30.12 19.05 15 2.16 2.45 30.1 24.04 30.95 28.05 55.9 29.45 30.51 48.65 11.62 9.72 20.3 15.86 17.28 30.12 29.45 34.55 23.8 26.11 1.35 3.21 22.69 8.9 12.16 19.05 30.51 23.8 50.35 27.43 36.89 3.26 22.69 10.66 27.39 15 48.65 26.11 27.43 61.89 11.82 8.68 6.63 1.61 1.05 2.16 11.62 1.35 36.89 11.82 56.43 40.9 2.37 36.3 14.41 5.77 5.07 20.13 6.66 6.75 17.97 2.37 63.77 34.45 21.13 50.62 62.33 42.03 46.98 46.99 13.72 36.3 34.45 159.58 2.43 66.25 49.77 32.84 18.44 22.08 3.34 14.41 21.13 2.43 89.89 29.77 64.08 35.79 25.18 56.72 2.17 5.77 50.62 66.25 29.77 102.79 58.09 62.37 39.44 31.05 4.46 5.07 62.33 49.77 64.08 58.09 115.75 60.97 63.17 100.74 24.06 20.13 42.03 32.84 35.79 62.37 60.97 71.53 49.29 54.07 2.8 6.66 46.98 18.44 25.18 39.44 63.17 49.29 104.26 56.79 76.38 6.75 46.99 22.08 56.72 31.05 100.74 54.07 56.79 128.16 24.47 17.97 13.72 3.34 2.17 4.46 24.06 2.8 76.38 24.47 116.85 10 31.70
γ = 1 1.09 2.44 1.44 4.19 1.59 5.2 4.61 10.81 3.14 0.79 5 2074.17 120.32 1840.86 730.59 292.65 257.14 1020.81 337.48 342.19 911.26 120.32 3233.78 1746.98 1071.53 2566.57 3160.45 2131.31 2382.43 2382.86 695.73 1840.86 1746.98 8091.77 123.13 3359.43 2523.71 1665.11 934.91 1119.54 169.52 730.59 1071.53 123.13 4558.21 1509.46 3249.33 1814.84 1276.68 2875.88 109.91 292.65 2566.57 3359.43 1509.46 5212.43 2945.44 3162.79 1999.9 1574.61 226.38 257.14 3160.45 2523.71 3249.33 2945.44 5869.57 3091.85 3203.23 5108.31 1219.85 1020.81 2131.31 1665.11 1814.84 3162.79 3091.85 3627.32 2499.28 2741.79 142.06 337.48 2382.43 934.91 1276.68 1999.9 3203.23 2499.28 5286.96 2879.7 3873.26 342.19 2382.86 1119.54 2875.88 1574.61 5108.31 2741.79 2879.7 6498.71 1240.72 911.26 695.73 169.52 109.91 226.38 1219.85 142.06 3873.26 1240.72 5925.01 0.2 0.01 0.18 0.07 0.03 0.02 0.1 0.03 0.03 0.09 0.01 0.31 0.17 0.1 0.25 0.3 0.21 0.23 0.23 0.07 0.18 0.17 0.78 0.01 0.32 0.24 0.16 0.09 0.11 0.02 0.07 0.1 0.01 0.44 0.15 0.31 0.17 0.12 0.28 0.01 0.03 0.25 0.32 0.15 0.5 0.28 0.3 0.19 0.15 0.02 0.02 0.3 0.24 0.31 0.28 0.56 0.3 0.31 0.49 0.12 0.1 0.21 0.16 0.17 0.3 0.3 0.35 0.24 0.26 0.01 0.03 0.23 0.09 0.12 0.19 0.31 0.24 0.51 0.28 0.37 0.03 0.23 0.11 0.28 0.15 0.49 0.26 0.28 0.63 0.12 0.09 0.07 0.02 0.01 0.02 0.12 0.01 0.37 0.12 0.57 10 5231.57
N I W μ ˜ , Σ ˜ | 1 2 1 3 1 4 5 10 1 0 , 10 , 67 11 50 16 4 2 21 21 10 16 11 92 37 20 49 78 37 51 53 32 50 37 177 14 53 76 41 29 61 9 16 20 14 121 10 88 28 59 89 15 4 49 53 10 114 57 53 44 30 26 2 78 76 88 57 160 69 92 145 43 21 37 41 28 53 69 71 60 73 11 21 51 29 59 44 92 60 170 86 120 10 53 61 89 30 145 73 86 196 33 16 32 9 15 26 43 11 120 33 165 , 15 Gradient γ = K L ( Θ ˜ ) 1 1.09 2.44 1.44 4.19 1.59 5.2 4.61 10.81 3.14 0.79 5 2074.17 120.32 1840.86 730.59 292.65 257.14 1020.81 337.48 342.19 911.27 120.32 3233.78 1746.98 1071.53 2566.57 3160.45 2131.31 2382.43 2382.86 695.73 1840.86 1746.98 8091.77 123.13 3359.43 2523.71 1665.11 934.91 1119.54 169.52 730.59 1071.53 123.13 4558.21 1509.46 3249.33 1814.84 1276.68 2875.88 109.91 292.65 2566.57 3359.43 1509.46 5212.43 2945.44 3162.79 1999.9 1574.61 226.38 257.14 3160.45 2523.71 3249.33 2945.44 5869.57 3091.85 3203.23 5108.31 1219.85 1020.81 2131.31 1665.11 1814.84 3162.79 3091.85 3627.32 2499.28 2741.79 142.06 337.48 2382.43 934.91 1276.68 1999.9 3203.23 2499.28 5286.96 2879.7 3873.26 342.19 2382.86 1119.54 2875.88 1574.61 5108.31 2741.79 2879.7 6498.71 1240.72 911.27 695.73 169.52 109.91 226.38 1219.85 142.06 3873.26 1240.72 5925.01 0.2 0.01 0.18 0.07 0.03 0.02 0.1 0.03 0.03 0.09 0.01 0.32 0.17 0.1 0.25 0.31 0.21 0.23 0.23 0.07 0.18 0.17 0.79 0.01 0.33 0.24 0.16 0.09 0.11 0.02 0.07 0.1 0.01 0.44 0.15 0.31 0.18 0.12 0.28 0.01 0.03 0.25 0.33 0.15 0.51 0.29 0.31 0.19 0.15 0.02 0.02 0.31 0.24 0.31 0.29 0.57 0.3 0.31 0.5 0.12 0.1 0.21 0.16 0.18 0.31 0.3 0.35 0.24 0.27 0.01 0.03 0.23 0.09 0.12 0.19 0.31 0.24 0.51 0.28 0.38 0.03 0.23 0.11 0.28 0.15 0.5 0.27 0.28 0.63 0.12 0.09 0.07 0.02 0.01 0.02 0.12 0.01 0.38 0.12 0.58 9.09 4681.41
Natural Parameters 1 2.02 1.02 3.05 1.02 4.05 4.98 10.03 1.08 0.03 5 3.03 0.18 2.69 1.07 0.43 0.38 1.49 0.49 0.5 1.33 0.18 4.73 2.56 1.57 3.75 4.62 3.12 3.48 3.49 1.02 2.69 2.56 11.84 0.18 4.91 3.69 2.44 1.37 1.64 0.25 1.07 1.57 0.18 6.67 2.21 4.75 2.65 1.87 4.21 0.16 0.43 3.75 4.91 2.21 7.62 4.31 4.63 2.93 2.3 0.33 0.38 4.62 3.69 4.75 4.31 8.59 4.52 4.69 7.47 1.78 1.49 3.12 2.44 2.65 4.63 4.52 5.31 3.66 4.01 0.21 0.49 3.48 1.37 1.87 2.93 4.69 3.66 7.73 4.21 5.67 0.5 3.49 1.64 4.21 2.3 7.47 4.01 4.21 9.51 1.81 1.33 1.02 0.25 0.16 0.33 1.78 0.21 5.67 1.81 8.67 0.2 0.01 0.18 0.07 0.03 0.02 0.1 0.03 0.03 0.09 0.01 0.31 0.17 0.1 0.25 0.3 0.21 0.23 0.23 0.07 0.18 0.17 0.78 0.01 0.32 0.24 0.16 0.09 0.11 0.02 0.07 0.1 0.01 0.44 0.15 0.31 0.17 0.12 0.28 0.01 0.03 0.25 0.32 0.15 0.5 0.28 0.3 0.19 0.15 0.02 0.02 0.3 0.24 0.31 0.28 0.56 0.3 0.31 0.49 0.12 0.1 0.21 0.16 0.17 0.3 0.3 0.35 0.24 0.26 0.01 0.03 0.23 0.09 0.12 0.19 0.31 0.24 0.51 0.28 0.37 0.03 0.23 0.11 0.28 0.15 0.49 0.26 0.28 0.63 0.12 0.09 0.07 0.02 0.01 0.02 0.12 0.01 0.37 0.12 0.57 10 5231.57
Natural Gradient 1 2.02 1.02 3.05 1.02 4.05 4.98 10.03 1.08 0.03 5 3.03 0.18 2.69 1.07 0.43 0.38 1.49 0.49 0.5 1.33 0.18 4.73 2.56 1.57 3.75 4.62 3.12 3.48 3.49 1.02 2.69 2.56 11.84 0.18 4.91 3.69 2.44 1.37 1.64 0.25 1.07 1.57 0.18 6.67 2.21 4.75 2.65 1.87 4.21 0.16 0.43 3.75 4.91 2.21 7.62 4.31 4.63 2.93 2.3 0.33 0.38 4.62 3.69 4.75 4.31 8.59 4.52 4.69 7.47 1.78 1.49 3.12 2.44 2.65 4.63 4.52 5.31 3.66 4.01 0.21 0.49 3.48 1.37 1.87 2.93 4.69 3.66 7.73 4.21 5.67 0.5 3.49 1.64 4.21 2.3 7.47 4.01 4.21 9.51 1.81 1.33 1.02 0.25 0.16 0.33 1.78 0.21 5.67 1.81 8.67 0.2 0.01 0.18 0.07 0.03 0.02 0.1 0.03 0.03 0.09 0.01 0.31 0.17 0.1 0.25 0.3 0.21 0.23 0.23 0.07 0.18 0.17 0.78 0.01 0.32 0.24 0.16 0.09 0.11 0.02 0.07 0.1 0.01 0.44 0.15 0.31 0.17 0.12 0.28 0.01 0.03 0.25 0.32 0.15 0.5 0.28 0.3 0.19 0.15 0.02 0.02 0.3 0.24 0.31 0.28 0.56 0.3 0.31 0.49 0.12 0.1 0.21 0.16 0.17 0.3 0.3 0.35 0.24 0.26 0.01 0.03 0.23 0.09 0.12 0.19 0.31 0.24 0.51 0.28 0.37 0.03 0.23 0.11 0.28 0.15 0.49 0.26 0.28 0.63 0.12 0.09 0.07 0.02 0.01 0.02 0.12 0.01 0.37 0.12 0.57 10 5231.57
Table 4. We have additional simulation results for the model g = H f + ϵ with 4, 6, and 10 dimensions. The initializations are calculated based on some guesses using available data g . Although we do not need the values of μ f and v f for the algorithm implementations, we have included two columns for them. The purpose of inverse problems is to extract f from the dataset of g . v f is a diagonal matrix, which we represent by its main diagonal components as a vector here.
Table 4. We have additional simulation results for the model g = H f + ϵ with 4, 6, and 10 dimensions. The initializations are calculated based on some guesses using available data g . Although we do not need the values of μ f and v f for the algorithm implementations, we have included two columns for them. The purpose of inverse problems is to extract f from the dataset of g . v f is a diagonal matrix, which we represent by its main diagonal components as a vector here.
ModelsAlgorithm↓Data-Based Initializations
Parameters KL ( · )
α β v ϵ μ f v f
Initial Points → 3.08 3.13 2.7 3.13 2.6 1.98 5.35 2.32 3.06 212.55
Alternative 3.08 3.13 2.7 3.13 3.92 2.99 8.17 3.5 3.06 0.02 0.07 0.36 0.07 0.03 0.03 0.03 0.03 12.37
γ = 1 3.08 3.13 2.7 3.13 2.6 1.98 5.35 2.32 3.06 0.02 0.07 0.37 0.07 2.6 1.98 5.35 2.32 212.55
α = 7 4 3 5 , β = 6 3 8 4 , v ϵ = , μ f = 0 , v f = 1.23 0.82 3.23 1.36 Gradient γ = K L ( θ ˜ ) 1 3.01 3.07 2.6 3.06 2.89 2.22 5.68 2.59 6.5 0 0.89 0.24 3.71 0.6 46.37
Natural Parameters 3.08 3.13 2.7 3.13 2.6 1.98 5.35 2.32 3.06 0.02 0.07 0.37 0.07 2.6 1.98 5.35 2.32 212.55
Natural Gradient 3.08 3.13 2.7 3.13 2.6 1.98 5.35 2.32 3.06 0.02 0.07 0.37 0.07 2.6 1.98 5.35 2.32 212.55
Initial Points → 3.58 3.64 3.91 3.41 3.65 3.52 1.52 2.89 4.77 7.56 2.86 2.39 3.67 320.4
Alternative 3.58 3.64 3.91 3.41 3.65 3.52 2.3 4.36 7.23 11.43 4.32 3.63 3.67 0.08 0.03 0.23 0.26 0.02 0.13 0.03 0.04 0.04 0.04 0.04 0.04 16.38
γ = 1 3.58 3.64 3.91 3.41 3.65 3.52 1.52 2.89 4.77 7.56 2.86 2.39 3.67 0.09 0.03 0.24 0.26 0.02 0.14 1.52 2.89 4.77 7.56 2.86 2.39 320.4
α = 6 4 3 2 3 5 , β = 4 5 6 8 3 7 , v ϵ = , μ f = 0 , v f = 0.73 1.91 3.61 7.41 1.3 1.41 Gradient γ = K L ( θ ˜ ) 1 3.52 3.56 3.82 3.32 3.57 3.46 1.73 3.21 5.16 7.92 3.18 2.68 7.58 0.01 0 0.01 0.02 0 0.01 0.27 1.72 3.63 6.46 1.69 1.2 108.9
Natural Parameters 3.58 3.64 3.91 3.41 3.65 3.52 1.52 2.89 4.77 7.56 2.86 2.39 3.67 0.09 0.03 0.24 0.26 0.02 0.14 1.52 2.89 4.77 7.56 2.86 2.39 320.4
Natural Gradient 3.58 3.64 3.91 3.41 3.65 3.52 1.52 2.89 4.77 7.56 2.86 2.39 3.67 0.09 0.03 0.24 0.26 0.02 0.14 1.52 2.89 4.77 7.56 2.86 2.39 320.4
Initial Points → 3.89 3.41 3.66 3.76 3.55 3.9 3.88 3.8 3.51 3.95 2.04 3.79 2.27 7.35 12.92 2.9 1.36 1.33 1.35 2.7 3.8 536.28
Alternative 3.89 3.41 3.66 3.76 3.55 3.9 3.88 3.8 3.51 3.95 3.08 5.85 3.44 11.04 19.46 4.38 2.06 2.01 2.13 4.09 3.8 0.08 0.38 0.13 0.04 0.25 0.1 0.07 0 0.26 0.14 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 27.43
γ = 1 3.89 3.41 3.66 3.76 3.55 3.9 3.88 3.8 3.51 3.95 2.04 3.79 2.27 7.35 12.92 2.9 1.36 1.33 1.35 2.7 3.8 0.09 0.39 0.14 0.04 0.25 0.1 0.08 0 0.29 0.15 2.04 3.79 2.27 7.35 12.92 2.9 1.36 1.33 1.35 2.7 536.28
α = 6 2 3 2 1 5 5 4 4 3 , β = 4 5 2 6 3 7 2 1 1 3 , v ϵ = , μ f = 0 , v f = 0.91 3.51 1.2 6.76 12.4 2.2 0.41 0.33 0.36 1.27 Gradient γ = K L ( θ ˜ ) 1 3.81 3.29 3.57 3.62 3.41 3.8 3.84 3.76 3.47 3.85 2.41 4.25 2.65 7.94 13.52 3.38 1.58 1.53 1.55 3.16 10.04 0 0.02 0.01 0 0.01 0 0 0 0.01 0.01 0.51 2.38 0.77 5.99 11.59 1.43 0.05 0.19 0 1.22 136.63
Natural Parameters 3.89 3.41 3.66 3.76 3.55 3.9 3.88 3.8 3.51 3.95 2.04 3.79 2.27 7.35 12.92 2.9 1.36 1.33 1.35 2.7 3.8 0.09 0.39 0.14 0.04 0.25 0.1 0.08 0 0.29 0.15 2.04 3.79 2.27 7.35 12.92 2.9 1.36 1.33 1.35 2.7 536.28
Natural Gradient 3.89 3.41 3.66 3.76 3.55 3.9 3.88 3.8 3.51 3.95 2.04 3.79 2.27 7.35 12.92 2.9 1.36 1.33 1.35 2.7 3.8 0.09 0.39 0.14 0.04 0.25 0.1 0.08 0 0.29 0.15 2.04 3.79 2.27 7.35 12.92 2.9 1.36 1.33 1.35 2.7 536.28
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fallah Mortezanejad, S.A.; Mohammad-Djafari, A. Variational Bayesian Approximation (VBA): Implementation and Comparison of Different Optimization Algorithms. Entropy 2024, 26, 707. https://doi.org/10.3390/e26080707

AMA Style

Fallah Mortezanejad SA, Mohammad-Djafari A. Variational Bayesian Approximation (VBA): Implementation and Comparison of Different Optimization Algorithms. Entropy. 2024; 26(8):707. https://doi.org/10.3390/e26080707

Chicago/Turabian Style

Fallah Mortezanejad, Seyedeh Azadeh, and Ali Mohammad-Djafari. 2024. "Variational Bayesian Approximation (VBA): Implementation and Comparison of Different Optimization Algorithms" Entropy 26, no. 8: 707. https://doi.org/10.3390/e26080707

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop