Next Article in Journal
A Sparse Multiwavelet-Based Generalized Laguerre–Volterra Model for Identifying Time-Varying Neural Dynamics from Spiking Activities
Next Article in Special Issue
Entropy Ensemble Filter: A Modified Bootstrap Aggregating (Bagging) Procedure to Improve Efficiency in Ensemble Model Simulation
Previous Article in Journal
An Improved System for Utilizing Low-Temperature Waste Heat of Flue Gas from Coal-Fired Power Plants
Previous Article in Special Issue
Estimating Mixture Entropy with Pairwise Distances
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Survey on Probabilistic Models of Low-Rank Matrix Factorizations

1
School of Science, Xi’an University of Architecture and Technology, Xi’an 710055, China
2
School of Architecture, Xi’an University of Architecture and Technology, Xi’an 710055, China
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(8), 424; https://doi.org/10.3390/e19080424
Submission received: 12 June 2017 / Revised: 11 August 2017 / Accepted: 16 August 2017 / Published: 19 August 2017
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Abstract

:
Low-rank matrix factorizations such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) are a large class of methods for pursuing the low-rank approximation of a given data matrix. The conventional factorization models are based on the assumption that the data matrices are contaminated stochastically by some type of noise. Thus the point estimations of low-rank components can be obtained by Maximum Likelihood (ML) estimation or Maximum a posteriori (MAP). In the past decade, a variety of probabilistic models of low-rank matrix factorizations have emerged. The most significant difference between low-rank matrix factorizations and their corresponding probabilistic models is that the latter treat the low-rank components as random variables. This paper makes a survey of the probabilistic models of low-rank matrix factorizations. Firstly, we review some probability distributions commonly-used in probabilistic models of low-rank matrix factorizations and introduce the conjugate priors of some probability distributions to simplify the Bayesian inference. Then we provide two main inference methods for probabilistic low-rank matrix factorizations, i.e., Gibbs sampling and variational Bayesian inference. Next, we classify roughly the important probabilistic models of low-rank matrix factorizations into several categories and review them respectively. The categories are performed via different matrix factorizations formulations, which mainly include PCA, matrix factorizations, robust PCA, NMF and tensor factorizations. Finally, we discuss the research issues needed to be studied in the future.

1. Introduction

In many practical applications, a commonly-used assumption is that the dataset approximately lies in a low-dimensional linear subspace. Low-rank matrix factorizations (or decompositions, the two terms are used interchangeably) are just a type of method for recovering the low-rank structure, removing noise and completing missing values. Principal Component Analysis (PCA) [1], a traditional matrix factorization method, copes effectively with the situation that the dataset is contaminated by Gaussian noise. If the mean vector is set to be a zero vector, then PCA is transformed into Singular Value Decomposition (SVD) [2]. Classical PCA approximates the low-rank representation according to eigen decomposition of the covariance matrix of a dataset to be investigated. When there are outliers or large sparse errors, the original version of PCA does not work well for obtaining the low-rank representations. In this case, many robust methods of PCA have been extensively studied such as L1-PCA [3], L1-norm maximization PCA [4], L21-norm maximization PCA [5] and so on.
In this paper, low-rank matrix factorizations refer to a general formulation of matrix factorizations and they mainly consist of PCA, matrix factorizations, robust PCA [6], Non-negative Matrix Factorization (NMF) [7] and tensor decompositions [8]. As a matter of fact, matrix factorizations are usually a special case of PCA and they directly factorize a data matrix into the product of two low-rank matrices without consideration of the mean vector. Matrix completion aims to complete all missing values according to the approximate low-rank structure and it is mathematically described as a nuclear norm minimization problem [9]. To this end, we regard matrix completion as a special case of matrix factorizations. In addition, robust PCA decomposes the data matrix into the superposition of a low-rank matrix and a sparse component, and recovers both the low-rank and the sparse matrices via principal component pursuit [6].
Low-rank matrix factorizations suppose that the low-rank components are corrupted by certain random noise and the low-rank matrices are deterministic unknown parameters. Maximum Likelihood (ML) estimation and Maximum a posteriori (MAP) are two most popular methodologies used in estimating the low-rank matrices. A prominent advantage of the aforementioned point estimate methods is that they are simple and easy to implement. However, we can not obtain the probability distributions of the low-rank matrices that are pre-requisite in exploring the generative models. Probabilistic models of low-rank matrix factorizations can approximate the true probability distributions of the low-rank matrices and they have attracted a great deal of attention in the past two decades. These models have been widely applied in the fields of signal and image processing, computer vision, machine learning and so on.
Tipping and Bishop [10] originally presented probabilistic PCA in which the latent variables are assumed to be a unit isotropic Gaussian distribution. Subsequently, several other probabilistic models of PCA were proposed successively, such as Bayesian PCA [11], Robust L1 PCA [12], Bayesian robust PCA [13] and so on. As a special case of probabilistic models of PCA, Bayesian matrix factorization [14] placed zero-mean spherical Gaussian priors on two low-rank matrices and it was applied to collaborative filtering. Probabilistic models of matrix factorizations also include probabilistic matrix factorization [15], Bayesian probabilistic matrix factorization [16], robust Bayesian matrix factorization [17], probabilistic robust matrix factorization [18], sparse Bayesian matrix completion [19], Bayesian robust matrix factorization [20] and Bayesian L1-norm low-rank matrix factorizations [21].
As to robust PCA, we take the small dense noise into account. In other words, the data matrix is decomposed into the sum of a low-rank matrix, a sparse noise matrix and a dense Gaussian noise matrix. Hence, the corresponding probabilistic models are robust to outliers and large sparse noise, and they are mainly composed of Bayesian robust PCA [22], variational Bayesian robust PCA [23] and sparse Bayesian robust PCA [19]. As another type of low-rank matrix factorizations, NMF decomposes a non-negative data matrix into the product of two non-negative low-rank matrices. Compared with PCA, NMF is a technique which learns holistic, not parts-based representations [7]. Different probabilistic models of NMF were proposed in [24,25,26,27,28]. Recently, probabilistic models of low-rank matrix factorizations are also extended to the case of tensor decompositions. Tucker decomposition and CANDECOMP/PARAFAC (CP) decomposition are two most important tensor decompositions. By generalizing the subspace approximation, some new low rank tensor decompositions have emerged, such as the hierarchical Tucker (HT) format [29,30], the tensor tree structure [31] and the tensor train (TT) format [32]. Among them, the TT format is a special case of the HT and the tensor tree structure [33]. The probabilistic models of the Tucker were presented in [34,35,36] and that of the CP were developed in [37,38,39,40,41,42,43,44,45,46,47,48].
For probabilistic models of low-rank matrix factorizations, there are three most frequently used statistical approaches for inferring the corresponding probability distributions or parameters, i.e., Expectation Maximization (EM) [49,50,51,52,53,54], Gibbs sampling(or a Gibbs sampler) [54,55,56,57,58,59,60,61,62] and variational Bayesian (VB) inference [54,62,63,64,65,66,67,68,69,70,71]. EM is an iterative algorithm with guaranteed the local convergence for ML estimation and does not require explicit evaluation of the likelihood function [70]. Although it has many advantages over ML, EM tends to be limited in applications because of its serious requirements for the posterior of the hidden variables and can not be used to solve complex Bayesian models [70]. However, VB inference relaxes some limitations of the EM algorithm and ameliorates its shortcomings. As a means of VB, Gibbs sampling is another method used to infer the probability distributions of all parameters and hidden variables.
This paper provides a survey on probabilistic models of low-rank matrix factorizations. Firstly, we review the significant probability distributions commonly-used in probabilistic low-rank matrix factorizations and introduce the conjugate priors that are essential to Bayesian statistics. Then we present two most important methods for inferring the probability distributions, that is, Gibbs sampling and VB inference. Next, we roughly divide the available low-rank matrix factorization models into five categories: PCA, matrix factorizations, robust PCA, NMF and tensor decompositions. For each category, we provide a detailed overview of the corresponding probabilistic models.
A central task for probabilistic low-rank matrix factorizations is to predict the missing or incomplete data. For the sake of concise descriptions, we do not consider the missing entries in all models except the sparse Bayesian model of matrix completion. The remainder of this paper is listed as below. Section 2 introduces the commonly-used probability distributions and the conjugate priors. Section 3 presents two frequently used inferring methods: Gibbs sampling and VB inference. Probabilistic models of PCA and matrix factorizations are reviewed respectively in Section 4 and Section 5. Section 6 and Section 7 survey probabilistic models of robust PCA and NMF, respectively. Section 8 provides other probabilistic models of low-rank matrix factorizations and probabilistic tensor factorizations. The conclusions and future research directions are drawn in Section 9.
Notation: Let be the set of real numbers and + the set of non-negative real numbers. We denote scalars with italic letters (e.g., x ), vectors with bold letters (e.g., x ), matrices with bold capital letters (e.g., X ) and sets with italic capital letters (e.g., X ). Given a matrix X , its ith row, jth column and (i, j)th element are expressed as x i , x j and X i j , respectively. If X is square, let Tr ( X ) and | X | be the trace and the determinant of X , respectively.

2. Probability Distributions and Conjugate Priors

This section will introduce some probability distributions commonly adopted in probabilistic models of low-rank matrix factorizations and discuss the conjugate priors for algebraic and intuitive convenience in Bayesian statistics.

2.1. Probability Distributions

We consider some significant probability distributions that will be needed in the following sections. Given a univariate random variable x or a random vector x , we denote its probability density/mass function by p ( x ) or p ( x ) . Let E ( x ) be the expectation of x and Cov ( x , x ) the covariance matrix of x .
Several probability distributions such as Gamma distribution and Beta distribution deal with the gamma function defined by
Γ ( x ) = 0 + u x 1 exp ( u ) d u .
We summarize the probability distributions used frequently in probabilistic models of low-rank matrix factorizations, as shown in Table 1. In this table, we list the notation for each probability distribution, the probability density/mass function, the expectation and the variance/covariance respectively. For the Wishart distribution W ( Λ | W , v ) , the term B ( W , v ) is given by
| W | v / 2 ( 2 ν d / 2 π d ( d 1 ) / 4 i = 1 d Γ ( ( v + 1 i ) / 2 ) ) 1
where the positive integer v d 1 is named the degrees of freedom. For the generalized inverse Gaussian distribution GIG ( x | p , a , b ) , K p ( c ) is a modified Bessel function of the second kind. For brevity of notation, we stipulate a few simple representations of a random variable or vector. For instance, if x follows a multivariate Gaussian distribution with mean μ and covariance matrix Σ , we can write this distribution as x ~ N ( x | μ , Σ ) , or x ~ N ( μ , Σ ) , or p ( x ) = N ( x | μ , Σ ) . In addition, some probability distributions can be extended to the multivariate case under identical conditions. The following cites an example: if random variables x i ~ St ( μ i , λ i , v i ) and x 1 , x 2 , , x N are independent, then we have x = ( x 1 , x 2 , , x N ) T ~ St ( μ , λ , v ) with the probability density function p ( x ) = i = 1 N St ( x i | μ i , λ i , v i ) , where μ = ( μ 1 , μ 2 , , μ N ) T , λ = ( λ 1 , λ 2 , , λ N ) T and v = ( v 1 , v 2 , , v N ) T .
There exist close relationships among probability distributions listed in Table 1, as shown partially in Figure 1. Refer to [54,62,66] for more probability distributions. Moreover, some continuous random variables can be reformulated as Gaussian scale mixtures with other distributions from a hierarchical viewpoint. Now, we give two most frequently used examples as below.
Example 1.
For given parameter μ , if the conditional probability distribution of x is p ( x | λ ) = N ( x | μ , λ 1 ) and the prior distribution of λ is p ( λ ) = IGam ( λ | 1 , 1 ) , then we have p ( x ) = Lap ( x | μ , 2 / 2 ) .
The probability density function of x is derived as:
p ( x ) = 0 + p ( λ ) p ( x | λ ) d λ = 0 + 1 2 π λ 3 / 2 exp ( 1 λ ) exp ( 1 2 λ ( x μ ) 2 ) d λ = λ = 2 t 0 + 1 2 π t 3 / 2 exp ( 1 2 t ) exp ( t ( x μ ) 2 ) d t = { 1 2 π t 3 / 2 exp ( 1 2 t ) } ( x μ ) 2 .
Meanwhile, it holds that
1 { 2 2 exp ( 2 ( x μ ) 2 ) } ( t ) = 1 2 π t 3 / 2 exp ( 1 2 t )
where { } is the Laplace transform and 1 { } is the inverse Laplace transform. Hence, we get x ~ Lap ( μ , 2 / 2 ) .
Example 2.
For given parameters μ and τ , if the conditional probability distribution of x is p ( x | u ) = N ( x | μ , ( τ u ) 1 ) and the prior distribution of u is p ( u | v ) = Gam ( u | v / 2 , v / 2 ) , then it holds that p ( x | v ) = St ( x | μ , τ , v ) , where v is a fixed parameter.
The derivation process for p ( x | v ) is described as follows:
p ( x | v ) = 0 + p ( x , u | v ) d u = 0 + p ( u | v ) p ( x | u ) d u 0 + u ( v + 1 ) / 2 1 exp ( u ( v + ( x μ ) 2 τ ) 2 ) d u 1 ( v + ( x μ ) 2 τ ) ( v + 1 ) / 2 .
So, p ( x | v ) = St ( x | μ , τ , v ) .

2.2. Conjugate Priors

Let x be a random vector with the parameter vector z and X = { x 1 , x 2 , , x N } a collection of N observed samples. In the presence of latent variables, they are also absorbed into z . For given z , the conditional probability density/mass function of x is denoted by p ( x | z ) . Thus, we can construct the likelihood function:
L ( z | X ) = p ( X | z ) = i = 1 N p ( x i | z ) .
As for variational Bayesian methods, the parameter vector z is usually assumed to be stochastic. Here, the prior distribution of z is expressed as p ( z ) .
To simplify Bayesian analysis, we hope that the posterior distribution p ( z | X ) is in the same functional form as the prior p ( z ) . Under this circumstance, the prior and the posterior are called conjugate distributions and the prior is also called a conjugate prior for the likelihood function L ( z | X ) [54,66]. In the following, we provide three most commonly-used examples of conjugate priors.
Example 3.
Assume that random variable x obeys the Bernoulli distribution with parameter μ . We have the likelihood function for x :
L ( μ | X ) = i = 1 N Bern ( x i | μ ) = i = 1 N μ x i ( 1 μ ) 1 x i = μ i = 1 N x i ( 1 μ ) N i = 1 N x i
where the observations x i { 0 , 1 } . In consideration of the form of L ( μ | X ) , we stipulate the prior distribution of μ as the Beta distribution with parameters a and b :
p ( μ ) = Beta ( μ | a , b ) = Γ ( a + b ) Γ ( a ) Γ ( b ) μ a 1 ( 1 μ ) b 1 .
At this moment, we get the posterior distribution of μ via the Bayes’ rule:
p ( μ | X ) = p ( μ ) p ( X | μ ) p ( X ) p ( μ ) p ( X | μ ) .
Because
p ( μ ) p ( X | μ ) = p ( μ ) L ( μ | X ) μ a 1 ( 1 μ ) b 1 μ i = 1 N x i ( 1 μ ) N i = 1 N x i μ a + i = 1 N x i 1 ( 1 μ ) b + N i = 1 N x i 1
we have p ( μ | X ) ~ Beta ( a + i = 1 N x i , b + N i = 1 N x i ) . The conclusion means that the Beta distribution is the conjugate prior for the Bernoulli likelihood.
Example 4.
Assume that random variable x obeys a univariate Gaussian distribution N ( μ , λ 1 ) , where λ is also named the precision. The likelihood function of x for given μ is
L ( λ | μ , X ) = i = 1 N N ( x i | μ , λ 1 ) = i = 1 N λ 2 π exp ( λ 2 ( x i μ ) 2 ) = λ N / 2 ( 2 π ) N / 2 exp ( λ 2 i = 1 N ( x i μ ) 2 )
We further suppose that the prior distribution of λ is the Gamma distribution with parameters a and b. Let p ( λ | μ , X ) be the posterior distribution of λ . Then we have
p ( λ | μ , X ) p ( λ ) p ( X , μ | λ ) .
Because
p ( λ ) p ( X , μ | λ ) = 1 Γ ( a ) b a λ a 1 exp ( b λ ) λ N / 2 ( 2 π ) N / 2 exp ( λ 2 i = 1 N ( x i μ ) 2 ) λ a + N / 2 1 exp ( λ ( b + i = 1 N ( x i μ ) 2 / 2 ) )
we get p ( λ | μ , X ) ~ Gam ( a + N / 2 , b + i = 1 N ( x i μ ) 2 / 2 ) . Therefore, the conjugate prior of the precision for a Gaussian likelihood is a Gamma distribution.
Example 5.
Assume that random vector x obeys a d-dimensional Gaussian distribution N ( μ , Λ 1 ) , where Λ , the inverse of covariance matrix, is called the precision matrix. We consider the case that both μ and Λ are unknown. Thus, the likelihood function for x is
L ( μ , Λ | X ) = i = 1 N N ( x i | μ ,   Λ 1 ) = i = 1 N | Λ | 1 / 2 ( 2 π ) d / 2 exp ( 1 2 ( x i μ ) T Λ ( x i μ ) ) = | Λ | N / 2 ( 2 π ) d N / 2 exp ( 1 2 i = 1 N ( x i μ ) T Λ ( x i μ ) ) .
The prior distribution of ( μ , Λ ) is given by a Gaussian–Wishart distribution N W ( μ , Λ | μ 0 , β , W , v ) with the joint probability density function:
p ( μ , Λ | μ 0 ,   β , W ,   v ) = N ( μ | μ 0 , ( β Λ ) 1   ) W ( Λ | W ,   v )
where μ 0 , β , W and v are the fixed hyperparameters. Hence, it holds that
p ( μ , Λ | μ 0 , β , W , v ) p ( X | μ , Λ ) = N ( μ | μ 0 , ( β Λ ) 1 ) W ( Λ | W , v ) L ( μ , Λ | X ) | Λ | ( N + ν d ) / 2 exp ( 1 2 ( ( μ μ 0 ) T ( β Λ ) ( μ μ 0 ) + i = 1 N ( x i μ ) T Λ ( x i μ ) + Tr ( W 1 Λ ) ) ) .
Denote x ¯ = i = 1 N x i / N and X = ( x 1 , x 2 , , x N ) . To obtain the above probability density function, we first derive the following formula:
( μ μ 0 ) T ( β Λ ) ( μ μ 0 ) + i = 1 N ( x i μ ) T Λ ( x i μ ) = μ T ( β Λ ) μ 2 μ 0 T ( β Λ ) μ + μ 0 T ( β Λ ) μ 0 + i = 1 N ( μ T Λ μ 2 x i T Λ μ + x i T Λ x i ) = ( β + N ) μ T Λ μ 2 ( Λ ( β μ 0 + N x ¯ ) ) T μ + Tr ( ( β μ 0 μ 0 T + X X T ) Λ ) = ( μ ( β μ 0 + N x ¯ ) / ( β + N ) ) T ( ( β + N ) Λ ) ( μ ( β μ 0 + N x ¯ ) / ( β + N ) )   + Tr ( ( β μ 0 μ 0 T + X X T ( β μ 0 + N x ¯ ) ( β μ 0 + N x ¯ ) T / ( β + N ) ) Λ ) .
Then we get
p ( μ , Λ | μ 0 , β , W , v ) p ( X | μ , Λ ) | Λ | 1 / 2 exp ( 1 2 ( μ ( β μ 0 + N x ¯ ) / ( β + N ) ) T ( ( β + N ) Λ ) ( μ ( β μ 0 + N x ¯ ) / ( β + N ) ) )   | Λ | ( N + ν d 1 ) / 2 exp ( 1 2 Tr ( ( W 1 + β μ 0 μ 0 T + X X T ( β μ 0 + N x ¯ ) ( β μ 0 + N x ¯ ) T / ( β + N ) ) Λ ) ) .
Because p ( μ , Λ | X ) p ( μ , Λ | μ 0 , β , W , v ) p ( X | μ , Λ ) , we have
p ( μ , Λ | X ) ~ N W ( β μ 0 + N x ¯ β + N , β + N , ( W 1 + β μ 0 μ 0 T + X X T ( β μ 0 + N x ¯ ) ( β μ 0 + N x ¯ ) T ( β + N ) ) 1 ,   N + v ) .
This example shows that the conjugate prior of ( μ , Λ ) for a multivariate Gaussian N ( x | μ , Λ ) is a Gaussian–Wishart distribution.
There are some other conclusions on the conjugate priors. For instance, given Gaussian likelihood, the conjugate prior for the mean is another Gaussian distribution, the conjugate prior for the precision matrix is a Wishart distribution. All probability distributions discussed in Section 2.1 belong to a broad class of distributions, that is, the exponential family. It is shown that the exponential family distributions have conjugate priors [54,66,72].

3. Gibbs Sampling and Variational Bayesian Inference

Due to the existence of the latent variables or unknown parameters, computing the posterior distribution is analytically intractable in general. In this section, we will provide two approximation inference methods: Gibbs sampling and variational Bayesian inference.

3.1. Gibbs Sampling

As a powerful framework for sampling from a probability distribution, Markov chain Monte Carlo (MCMC) methods, also called Markov chain simulation, construct a Markov chain such that its equilibrium distribution is the desired distribution [73,74,75,76,77,78,79,80]. Random walk Monte Carlo methods are a large subclass of MCMC and they include mainly Metropolis–Hastings [54,62,66,81], Gibbs sampling, Slice sampling [54,62,66], Multiple-try Metropolis [82,83].
Gibbs sampling or Gibbs sampler, a simple MCMC algorithm, is especially applicable for approximating a sequence of observations by a specified probability distribution when direct sampling is intractable. In addition, the basic version of Gibbs sampling is also a special case of the Metropolis–Hastings algorithm [54,62]. In detailed implementation, Gibbs sampling adopts the strategy of sampling from a conditional probability distribution instead of marginalizing the joint probability distribution by integrating over other variables. In other words, Gibbs sampling generates alternatively an instance from its corresponding conditional probability distribution by fixing other variables.
Let z 1 , z 2 , , z M be M blocks of random variables and set z = ( z 1 , z 2 , , z M ) . The joint probability distribution of z is written as p ( z ) = p ( z 1 , z 2 , , z M ) . In this subsection, we only consider the case that it is difficult to directly sampling from the joint probability distribution p ( z ) . To this end, we can sample each component z i from marginal distribution in order. The marginal distribution of z i can be theoretically obtained by the following formulation:
p ( z i ) = p ( z 1 , z 2 , , z M ) d z 1 d z i 1 d z i + 1 d z M .
Generally speaking, the integrating term in the above formula is not tractable. A wise sampling strategy is that we generate z i according to the conditional probability distribution p ( z i | z \ z i ) , where z \ z i = ( z 1 , , z i 1 , z i + 1 , , z M ) . By using Bayes’ rule, the relationship between the conditional probability distribution and the joint probability distribution is given as follows:
p ( z i | z \ z i ) = p ( z 1 , z 2 , , z M ) p ( z 1 , , z i 1 , z i + 1 , , z M ) p ( z 1 , z 2 , , z M ) .
The sampling procedure is repeated by cycling through all variables. We summarize the outline of Gibbs sampling as below.
1. Initialize z i = z i ( 0 ) , i = 2 , , M .
 2. For k = 1 , 2 , , T
  For i = 1 , 2 , , M
   Generate z i ( k ) from the conditional probability distribution
    p ( z i | z 1 ( k ) , , z i 1 ( k ) , z i + 1 ( k 1 ) , , z M ( k 1 ) ) .
   End
  End
In the above sampling procedure, T is the maximum number of iterations. Therefore, we have T samples z ( k ) = ( z 1 ( k ) , z 2 ( k ) , , z M ( k ) ) , k = 1 , 2 , , T . To alleviate the influence of random initializations, we can ignore the samples within the burn-in period. Although Gibbs sampling is commonly efficient in obtaining marginal distributions from conditional distributions, it can be highly inefficient for some cases such as sampling from Mexican hat like distribution [53].
The Metropolis algorithm is an instance of MCMC algorithms and walks randomly with an acceptance/rejection rule until the convergence is achieved. As the generalization of Metropolis algorithm, the Metropolis–Hastings algorithm modifies the jumping rules and its convergence speed is improved [66]. Broadly speaking, the advanced Gibbs sampling algorithm can be regarded as a special case of the Metropolis–Hastings algorithm. The Gibbs sampler is applicable for models that are conditionally conjugate, while the Metropolis algorithm can be used for not conditionally conjugate models. Hence, we can combine both the Gibbs sampler and the Metropolis algorithm to sample from complex distributions [66].

3.2. Variational Bayesian Inference

Another widely used class of approximating the marginal probability distribution p ( z i ) is the variational Bayesian (VB) inference [54]. We still use the notation X = ( x 1 , x 2 , , x N ) to represent a matrix composed of N observed datum and z = ( z 1 , z 2 , , z M ) to represent a vector constructed by M blocks of latent variables. In VB methods, both X and z are assumed to be stochastic.
Given the prior distribution p ( z ) and the data likelihood p ( X | z ) , the probability distribution of data matrix X , also called the evidence, can be calculated as p ( X ) = p ( z ) p ( X | z ) d z . Then we derive the conditional probability distribution of z via Bayes’ rule:
p ( z | X ) = p ( z ) p ( X | z ) p ( z ) p ( X | z ) d z .
However, the integrating term p ( z ) p ( X | z ) d z is analytically intractable under normal conditions. Now, we harness VB methods to approximate the posterior distribution p ( z | X ) as well as p ( X ) .
Let q ( z ) be the trial probability distribution of p ( z | X ) . The log of the probability distribution p ( X ) is decomposed as below:
ln p ( X ) = q ( z ) ln p ( X ) d z = q ( z ) ln p ( X , z ) p ( z | X ) d z = q ( z ) ln p ( X , z ) d z q ( z ) ln p ( z | X ) d z = q ( z ) ln p ( X , z ) d z ( q ( z ) ln p ( z | X ) q ( z ) d z + q ( z ) ln q ( z ) d z ) = L ( q ) + K L ( q | | p ) L ( q )
where L ( q ) = q ( z ) ln p ( X , z ) q ( z ) d z and K L ( q | | p ) = q ( z ) ln p ( z | X ) q ( z ) d z . The term K L ( q | | p ) is called the Kullback–Leibler (KL) divergence [54,84,85,86,87] between q ( z ) and p ( z | X ) , and L ( q ) is the lower bound of ln p ( X ) that achieves its lower bound if and only if K L ( q | | p ) = 0 (or equivalently q = p ). The above divergence can also be explained as the information gain changing from the prior q ( z ) to the posterior p ( z | X ) [85].
Another perspective for Equation (23) is that,
ln p ( X ) = ln p ( X , z ) d z = ln q ( z ) p ( X , z ) q ( z ) d z q ( z ) ln p ( X , z ) q ( z ) d z = L ( q ) .
The above inequality is obtained by Jensen’s inequality. The negative lower bound L ( q ) is called the free energy [88,89].
The KL divergence K L ( q | | p ) can be regarded as a metric for evaluating the approximation performance of the prior distribution q ( z ) over the posterior distribution p ( z | X ) [54]. In the light of Equation (23), minimizing K L ( q | | p ) is equivalent to maximizing L ( q ) . What’s more, the lower bound can be further derived as below:
L ( q ) = q ( z ) ln p ( X | z ) d z + q ( z ) ln p ( z ) d z q ( z ) ln q ( z ) d z = q ( z ) ln p ( X | z ) d z + q ( z ) ln p ( z ) q ( z ) d z = E q ( z ) ln p ( X | z ) K L ( q | | p ) .
The above Equation means that ln p ( X ) can also be regarded as the expectation of the log likelihood function ln p ( X | z ) with respect to z .
To reduce the difficulty of approximating the trial distribution q ( z ) , we partition all latent variables into M disjoint block variables { z 1 , z 2 , , z M } and assume that z 1 , z 2 , , z M are mutually independent. Based on the above assumption, we have
q ( z ) = q ( z 1 ) q ( z 2 ) q ( z M ) .
Afterwards, we utilize the mean field theory to obtain the approximation of q ( z ) . Concretely speaking, we update each q ( z i ) in turn. Let q ( z i ) be unknown and q ( z j ) ( j i ) be fixed. Under this circumstance, we can get the optimal q ( z i ) by solving the following variational maximization problem:
q * ( z i ) = arg max q ( z i ) L ( q ( z ) )
Plugging Equation (26) into the lower bound of ln p ( X ) , we have
L ( q ) = j = 1 M q ( z j ) ln p ( X , z ) j = 1 M d z j j = 1 M q ( z j ) j = 1 M ln q ( z j ) j = 1 M d z j = q ( z i ) ( ln p ( X , z ) j i q ( z j ) j i d z j ) d z i q ( z i ) ln q ( z i ) d z i + const = q ( z i ) ln p ˜ ( X , z i ) d z i q ( z i ) ln q ( z i ) d z i + const = q ( z i ) ln p ˜ ( X , z i ) q ( z i ) d z i + const
where “const” is a term independent of z i and p ˜ ( X , z i ) is a new probability distribution whose log is defined by
ln p ˜ ( X , z i ) = ln p ( X , z ) j i q ( z j ) j i d z j + const = E z \ z i [ p ( X , z ) ] + const .
Therefore, p ˜ ( X , z i ) is the optimal solution of problem (27) obtained by minimizing the KL divergence between q ( z i ) and p ˜ ( X , z i ) .
An advantage of the VB methods is that they are immune to over fitting. But they also have some shortcomings. For example, the probability distributions derived via VB have always less probability mass in the wings than the true solution and this systematic bias may break applications. A VB method approximates a full posterior distribution by maximizing the corresponding lower bound on the marginal likelihood and it can only handle a smaller dataset. In contrast, the stochastic variational inference optimizes a subset at each iteration. This batch inference algorithm is scalable and outperforms traditional variational inference methods [90,91].

3.3. Comparisons between Gibbs Sampling and Variational Bayesian Inference

Evaluating the posterior distribution p ( z | X ) is a central task in probabilistic models of low-rank matrix factorizations. In many practical applications, it is often infeasible to compute the posterior distribution or the expectations with respect to this distribution. Gibbs sampling and VB inference are two dominant methods for approximating the posterior distribution. The former is a stochastic approximation and the latter is a deterministic technique.
The fatal shortcoming of Expectation Maximization (EM) algorithm is that the posterior distribution of the latent variables should be given in advance. Both Gibbs sampling and VB inference can ameliorate the shortcoming of EM algorithm. Gibbs sampling is easy to implement due to the fact it adopts a Monte Carlo procedure. Another advantage of Gibbs sampling is that it can generate exact results [54]. But this method is often suitable for small-scale problems because it costs a large amount of computation. Compared with Gibbs sampling, VB inference does not generate exact results and has small computation complexity. Therefore, their strengths and weaknesses are complementary rather than competitive.

4. Probabilistic Models of Principal Component Analysis

Principal Component Analysis (PCA) is a special type of low-rank matrix factorizations. This section first introduces the classical PCA and then reviews its probabilistic models.

4.1. Principal Component Analysis

The classical PCA converts a set of samples with possibly correlated variables into another set of samples with linearly uncorrelated variables via an orthogonal transformation [1]. Based on this, PCA is an effective technique widely used in performing dimensionality reduction and extracting features.
Let X = { x 1 , x 2 , , x N } be a collection of N samples with d dimensions and I r ( r < d ) an r-by-r identity matrix. Given a projection transformation matrix W d × r , the PCA model can be expressed as
x i = W z i + x ¯ + ε i ,   i = 1 , 2 , , N
where the mean vector x ¯ = i = 1 N x i / N , W satisfies W T W = I r , z i is a representation coefficient vector with r dimensions and ε 1 , ε 2 , , ε N ~ N ( 0 , σ 2 I d ) are independent and identically distributed noise vectors. Denote X = ( x 1 , x 2 , , x N ) , Z = ( z 1 , z 2 , , z N ) , X ¯ = ( x ¯ , x ¯ , , x ¯ ) d × N and E = ( ε 1 , ε 2 , , ε N ) . Then PCA can be rewritten as the following matrix factorization formulation:
X = W Z + X ¯ + E .
According to the maximum likelihood estimation, the optimal W and Z can be obtained by solving the minimization problem:
min W , Z X X ¯ W Z F 2 , s . t . W T W = I r
where F is the Frobenious norm of a matrix. Although this optimization problem is not convex, we can obtain the optimal transform matrix W * by stacking r singular vectors corresponding to the first r largest singular values of the sample covariance matrix S = i = 1 N ( x i x ¯ ) ( x i x ¯ ) T / ( N 1 ) . Let z i * be the optimal low-dimensional representation of x i . Then it holds that z i * = W * T ( x i x ¯ ) .
The PCA technique only supposes that the dataset is contaminated by isotropic Gaussian noise. The advantage of PCA is that it is very simple and effective in achieving the point estimations of W and Z . But we can not obtain their probability distributions. In fact, the probability distributions of parameters are more useful and valuable than the point estimations in exploiting the intrinsic essence and investigating the procedure of data generation. Probabilistic models of PCA are just a class of methods for inferring the probability distributions of parameters.

4.2. Probabilistic Principal Component Analysis

In Equation (30), the low-dimensional representation z i is an unknown and deterministic parameter vector. In contrast, the original probabilistic PCA [10] regards z i as a stochastic vector. This probability model provides a general form of decomposing a d-dimensional input sample x :
x = W z + μ + ε
where the latent random vector z ~ N ( 0 , I r ) , the noise ε ~ N ( 0 , σ 2 I d ) and μ is the mean vector. In the following, a Maximum Likelihood (ML) method is proposed to obtain the point estimations of W , μ and σ 2 .
It is obvious that p ( x | z ) = N ( x | W z + μ , σ 2 I d ) . Hence we have
p ( x ) = p ( x | z ) p ( z ) d z = N ( x | μ , σ 2 I d + W W T ) .
Given the observed sample set X = { x 1 , x 2 , , x N } , the log-likelihood function is
L ( W , μ , σ 2 | X ) = i = 1 N ln p ( x i ) = i = 1 N ln N ( x i | μ , σ 2 I d + W W T ) .
The optimal W , μ , σ 2 can be obtained by maximizing L ( W , μ , σ 2 | X ) . By letting the partial derivative of L ( W , μ , σ 2 | X ) with respect to μ be the zero vector, we can easily get the maximum likelihood estimation of μ : μ ML = x ¯ . Hence, the optimal W and σ 2 can be achieved by the stationary points of L ( W , μ ML , σ 2 | X ) with respect to W and σ 2 .
The aforementioned method is slightly complex when computing W and σ 2 . For this purpose, an Expectation Maximization (EM) algorithm was also presented to solve probabilistic PCA. If W , μ and σ 2 are given, then the joint probability distribution of z and x can be derived as follows:
p ( z , x | W , μ , σ 2 ) = p ( z | W , μ , σ 2 ) p ( x | z , W , μ , σ 2 ) = N ( z | 0 , I r ) N ( x | W z + μ , σ 2 I d ) .
The posterior distribution of z for given x is
p ( z | x , W , μ , σ 2 ) = p ( z , x | W , μ , σ 2 ) p ( x | W , μ , σ 2 ) N ( z | 0 , I r ) N ( x | W z + μ , σ 2 I d ) .
Because
N ( z | 0 , I r ) N ( x | W z + μ , σ 2 I d ) exp ( 1 2 z T z ) exp ( 1 2 σ 2 ( x W z μ ) T ( x W z μ ) ) exp ( 1 2 [ z ( σ 2 I + W T W ) 1 W T ( x μ ) ] T σ 2 I + W T W σ 2 [ z ( σ 2 I + W T W ) 1 W T ( x μ ) ] )
we have
p ( z | x , W , μ , σ 2 ) ~ N ( z , Σ z )
where Σ z = Cov ( z , z ) = σ 2 ( σ 2 I + W T W ) 1 is the covariance matrix of z and z = Σ z W T ( x μ ) / σ 2 is the mean of z . Based on the above analysis, we give the complete-data log-likelihood function:
L C ( W , μ , σ 2 | X ) = i = 1 N ln p ( x i , z i | W , μ , σ 2 ) = i = 1 N ln N ( z i | 0 , I r ) + ln N ( x i | W z i + μ , σ 2 I ) = d N 2 ln σ 2 1 2 i = 1 N ( z i T z i + ( x i W z i μ ) T ( x i W z i μ ) / σ 2 ) + const = d N 2 ln σ 2 1 2 σ 2 i = 1 N ( z i T ( σ 2 I + W T W ) z i + ( x i μ ) T ( x i μ ) + 2 ( x i μ ) T W z i ) + const
where “const” is a term independent of W , μ and σ 2 .
In the expectation step, we take expectation on L C ( W , μ , σ 2 | X ) with respect to z :
E z [ L C ( W , μ , σ 2 | X ) ] = d N 2 ln σ 2 1 2 σ 2 i = 1 N ( tr ( ( σ 2 I + W T W ) E [ z i z i T ] ) + ( x i μ ) T ( x i μ ) 2 ( x i μ ) T W E [ z i ] ) + const .
According to Equation (39), we have E [ z i ] = Σ z W T ( x i μ ) / σ 2 and E [ z i z i T ] = Σ z + E [ z i ] E [ z i ] T .
In the maximization step, we first obtain the maximum likelihood estimation of μ : μ ML = x ¯ by setting the partial derivative of E z [ L C ( W , μ , σ 2 | X ) ] with respect to μ be a zero vector. Similarly, we can also obtain the optimal estimations of W and σ 2 by solving the equation set:
{ W E z [ L C ( W , μ ML , σ 2 | X ) ] = 0 , σ 2 E z [ L C ( W , μ ML , σ 2 | X ) ] = 0 .
Moreover, a mixture model for probabilistic PCA was proposed in [92]. Khan et al. replaced the Gaussian noise with Gaussian process and incorporated the information of preference pairs into the collaborative filtering [93].

4.3. Bayesian Principal Component Analysis

In probabilistic PCA, both z and ε obey Gaussian distributions, and both W and μ are non-stochastic parameters. Now, we further treat w 1 , w 2 , , w r and μ as independent random variables, that is, w i ~ N ( 0 , α i 1 I d ) and μ ~ N ( 0 , β 1 I d ) . At this moment, the corresponding probabilistic model of PCA is called Bayesian PCA [11]. Set α = ( α 1 , α 2 , , α r ) T and τ = σ 2 . Then we give the prior distributions of α and τ as follows:
p ( α ) = i = 1 r Gam ( α i | a 0 , b 0 ) , p ( τ ) = Gam ( τ | c 0 , d 0 )
where a 0 , b 0 , c 0 , d 0 are four given hyperparameters.
The true joint probability distribution of complete data is given by
p ( X , W , α , Z , μ , τ ) = p ( W , α , Z , μ , τ ) p ( X | W , α , Z , μ , τ ) = p ( α ) p ( W | α ) p ( Z ) p ( μ ) p ( τ ) p ( X | W , Z , μ ) .
We suppose that the trial distribution of p ( X , W , α , Z , μ , τ ) has the following form:
q ( W , α , Z , μ , τ ) = q ( W ) q ( α ) q ( Z ) q ( μ ) q ( τ ) .
By making use of VB inference, we can obtain the trial probability distributions of W , α , Z , μ and τ respectively. In detailed implementation, the hyperparameters a 0 , b 0 , c 0 , d 0 , β can be set to be small positive numbers to obtain the broad priors. Unlike other approximation methods, the proposed method maximizes a rigorous lower bound.

4.4. Robust L1 Principal Component Analysis

The probabilistic model of robust L1 PCA [12] regards both μ and W as deterministic parameters and μ is set to a zero vector without loss of the generality. We still suppose that z obeys a spherical multivariate Gaussian distribution, that is, z ~ N ( 0 , I r ) . To improve the robustness, the noise ε i is assumed to follow a multivariate Laplace distribution:
p ( ε i | σ ) = ( 1 2 σ ) d / 2 exp ( 1 σ ε i 1 )
where 1 is the L1 norm of a vector. Due to the fact that the Laplace distribution has heavy tail, the proposed model in [12] is more robust against data outliers.
We use the hierarchical model to deal with the Laplace distribution. The probability density distribution of a univariate Laplace distribution Lap ( 0 , σ ) can be rewritten as
p ( ε | σ ) = β 2 π σ 2 exp ( β 2 σ 2 ε 2 ) 1 2 β 2 exp ( 1 2 β ) d β
Hence, we can set ε i ~ N ( 0 , ( σ 2 / β i ) I d ) and β i ~ IGam ( 1 , 1 / 2 ) . Let ρ = 1 / σ 2 and give its prior distribution:
p ( ρ ) = Gam ( ρ | a , b )
where a and b are two given hyperparameters.
We take Z , β and ρ as latent variables and W as the hyperparameters. For fixed W , the true joint probability distribution of all latent variables is
p ( Z , β , ρ , X | W ) = p ( ρ ) p ( β ) p ( Z ) p ( X | W , Z ) .
The trial joint distribution of Z , β and ρ is chosen as q ( Z , β , ρ ) = q ( Z ) q ( β ) q ( ρ ) . By applying the VB inference, the probability distributions of Z , β and ρ are approximated respectively. What’s more, W is updated by minimizing the robust reconstruction error of all samples.

4.5. Bayesian Robust Factor Analysis

In previous probabilistic models of PCA, the noise ε obeys the same Gaussian distributions. However, different features maybe have different noise levels in practical applications. Now, we set ε ~ N ( 0 , diag ( τ ) ) , where τ = ( τ 1 , τ 2 , , τ d ) T and diag ( τ ) is a diagonal matrix. Probabilistic Factor Analysis (or PCA) [13] further assumes that W i j ~ N ( 0 , τ i 1 α j 1 ) , μ i ~ N ( 0 , τ i 1 β 1 ) . In other words, different W i j or μ i have different variances and the variances of W i j and μ i also have a coupling relationship. Given W , z , μ , τ , the conditional probability distribution of x is written as
p ( x | W , z , μ , τ ) = N ( x | W z + μ , diag ( τ ) ) .
Meanwhile, the prior distributions for τ i , α j and β are given as follows:
τ i ~ Gam ( a 0 , b 0 ) , α j ~ Gam ( c 0 , d 0 ) ,   β ~ Gam ( e 0 , f 0 )
where { a 0 , b 0 , c 0 , d 0 , e 0 , f 0 } is the set of hyperparameters.
The robust version of Bayesian factor analysis [13] considers the Student’s t-distributions instead of Gaussian noise corruptions due to the fact that the heavy tail of Student’s t-distributions makes it more robust to outliers or large sparse noise. Let ε ~ St ( 0 , τ . 1 , v ) , where τ . 1 = ( 1 / τ 1 , 1 / τ 2 , , 1 / τ N ) T and v = ( v 1 , v 2 , , v N ) T . In this case, the probability distribution of x for given W , z , μ , τ , v is
p ( x | W , z , μ , τ , v ) = St ( x | W z + μ , τ . 1 , v ) .
We can represent hierarchically the above Student’s t-distributions. Firstly, the conditional probability distribution of x k can be expressed as:
p ( x k | W , z k , μ , τ , u k ) = N ( x k | W z k + μ , diag ( τ . 1 . * u k . 1 ) )
where u k is a d-dimensional column vector, “ . * ” is the Hadamard product (also known as entrywise product). Then we give the prior of u k for fixed d-dimensional hyper parameter vector v as below:
p ( u k | v ) = Gam ( u k | v / 2 , v / 2 ) .
Another Bayesian robust factor analysis is on the basis of the Laplace distribution. At this moment, we assume that ε ~ Lap ( 0 , τ . 1 / 2 ) , where τ . 1 / 2 = ( τ 1 1 / 2 , τ 2 1 / 2 , , τ N 1 / 2 ) T . In this case, we have
p ( x | W , z , μ , τ ) = Lap ( x | W z + μ , τ . 1 / 2 )
The Laplace distribution generally leads to adverse effects on inferring the probability distributions of other random variables. Here, we still employ the hierarchical method, that is,
p ( x | W , z , μ , τ , u ) = N ( x | W z + μ , diag ( τ . 1 . * u . 1 ) ) , p ( u ) = IGam ( u | 1 , 1 / 2 ) .
Under this circumstance, we have
p ( x | W , z , μ , τ ) = p ( x , u | W , z , μ , τ ) d u = p ( x | W , z , μ , τ , u ) p ( u ) d u = N ( x | W z + μ , diag ( τ . 1 . * u . 1 ) ) IGam ( u | 1 , 1 / 2 ) d u i = 1 N ( τ i u i ) 1 / 2 exp ( 1 2 τ i u i ( x W z μ ) i 2 ) u i 2 exp ( 1 2 u i ) d u i i = 1 N u i 3 / 2 exp ( 1 2 u i ) exp ( 1 2 τ i ( x W z μ ) i 2 u i ) d u i i = 1 N { u i 3 / 2 exp ( 1 2 u i ) } ( 1 2 τ i ( x W z μ ) i 2 )
where ( x W z μ ) i is the i-th element of vector x W z μ . Because
1 { 2 2 exp ( τ i ( x W z μ ) i 2 ) } ( t ) = 1 2 π t 3 / 2 exp ( 1 2 t )
we have p ( x | W , z , μ , τ ) i = 1 N exp ( | τ i ( x W z μ ) i | ) . Hence, Equation (56) holds.
For the aforementioned two probabilistic models of robust factor analysis, the VB inference was proposed to approximate the posterior distributions [13]. In practice, the probability distribution of the noise should be chosen based on the application. The probabilistic models of PCA are compared in Table 2.

5. Probabilistic Models of Matrix Factorizations

Matrix factorizations are a type of methods to approximating the data matrix by the product of two low-rank matrices. They can be regarded as a special case of PCA without considering the mean. This section will discuss the existing probabilistic models of matrix factorizations.

5.1. Matrix Factorizations

For given data matrix X , its low-rank factorization model is written as
X = W Z + E
where W d × r , Z r × N , E is the noise matrix and r < d . We assume that E i j ~ N ( 0 , σ 2 ) are independent and identically distributed. We can get the optimal W and Z according to the maximum likelihood estimation. More specifically, we need to solve the following minimization problem:
min W , Z X W Z F .
The closed-form solution to problem (60) can be obtained by the Singular Value Decomposition (SVD).
To enhance the robustness to outliers or large sparse noise, we now assume that E i j ~ Lap ( 0 , σ ) . For the moment, we solve the following optimization problem:
min W , Z X W Z 1
where 1 is the L1-norm of a matrix (i.e., the sum of the absolute value of all elements). This problem is also called L1-norm PCA and the corresponding optimization methods were proposed in [3,4]. Srebro and Jaakkola considered the weighted low-rank approximations problems and provided an EM algorithm [94].

5.2. Probabilistic Matrix Factorization

We still consider Gaussian noise corruptions, that is, E i j ~ N ( 0 , σ 2 ) . Furthermore, the zero-mean spherical Gaussian priors are respectively imposed on each row of W and each column of Z :
p ( W | σ W 2 ) = i = 1 d N ( w i | 0 , σ w 2 I r ) , p ( Z | σ Z 2 ) = j = 1 N N ( z j | 0 , σ z 2 I r ) .
Probabilistic matrix factorization (PMF) [15] regards both σ w 2 and σ z 2 as two deterministic parameters. The point estimations of W , Z can be obtained by maximizing the posterior distribution with the following form:
p ( W , Z | X , σ 2 , σ w 2 , σ z 2 ) p ( W , Z , X | σ 2 , σ w 2 , σ z 2 ) p ( W | σ w 2 ) p ( Z | σ z 2 ) p ( X | W , Z , σ 2 ) .
If the priors are respectively placed on σ w 2 and σ z 2 , we can obtain the generalized model of probabilistic matrix factorization [15]. In this case, the likelihood function is derived as
p ( W , Z , σ 2 , σ w 2 , σ z 2 | X ) p ( W , Z , σ 2 , σ w 2 , σ z 2 , X ) p ( X | W , Z , σ 2 ) p ( W | σ w 2 ) p ( Z | σ z 2 ) p ( σ w 2 ) p ( σ z 2 ) .
By maximizing the above posterior distribution, we can obtain the point estimations of parameters { W , Z } and hyperparameters { σ w 2 , σ z 2 } . Furthermore, two derivatives of PMF are also presented, i.e., PMF with a adaptive prior and PMF with constraining user-specific feature vectors.
In [89], a fully-observed variational Bayesian matrix factorization, an extension of PMF, was discussed. Meanwhile, it is shown that this new probabilistic matrix factorization can weaken the decomposability assumption and obtain the exact global analytic solution for rectangular cases.

5.3. Variational Bayesian Approach to Probabilistic Matrix Factorization

In PMF, W i k are independent and identically distributed and so are Z k j . Variational Bayesian PMF [14] assumes the entries from different columns of W or Z T have different variances, that is, W i k ~ N ( 0 , σ k 2 ) , Z k j ~ N ( 0 , ρ k 2 ) , k = 1 , 2 , , r . For given hyperparameters { σ 2 , σ W 2 , ρ Z 2 } , we get the joint probability distribution:
p ( X , W , Z | σ 2 , σ W 2 , ρ Z 2 ) = p ( X | W , Z , σ 2 ) p ( W | σ W 2 ) p ( Z | ρ Z 2 )
where σ W 2 = ( σ 1 2 , σ 2 2 , , σ r 2 ) T and ρ Z 2 = ( ρ 1 2 , ρ 2 2 , , ρ r 2 ) T .
We assume that the trial joint distribution of W and Z is decomposable, that is, q ( W , Z ) = q ( W ) q ( Z ) . Using VB method, we can update alternatively q ( W ) and q ( Z ) . The variances { σ 2 , σ W 2 , ρ Z 2 } can be determined by maximizing the following lower bound:
L ( q ( W , Z ) ) = E q ( W ) , q ( Z ) [ ln p ( X , W , Z | σ 2 , σ W 2 , ρ Z 2 ) ln q ( W ) q ( Z ) ] .
The experimental results in Netflix Prize competition show that the Variational Bayesian approach has superiority over MAP and greedy residual fitting.

5.4. Bayesian Probabilistic Matrix Factorizations Using Markov Chain Monte Carlo

Variational Bayesian approach to PMF only discusses the case that W i k (or Z k j ) are independent and identically distributed and their means are zeros. However, Bayesian PMF [16] assumes that w i (or z j ) are independent and identically distributed and their mean vectors are not zero vectors. Concretely speaking, we stipulate that
p ( W | μ W , Λ W ) = i = 1 d N ( w i | μ W , Λ W 1 ) , p ( Z | μ Z , Λ Z ) = j = 1 N N ( z j | μ Z , Λ Z 1 ) .
Let Θ W = { μ W , Λ W } , Θ Z = { μ Z , Λ Z } . We further suppose the prior distributions of Θ W and Θ Z are Gaussian–Wishart priors:
{ p ( Θ W | Θ 0 ) = p ( μ W | Λ W ) p ( Λ W ) = N ( μ W | μ 0 , ( β 0 Λ W ) 1 ) W ( Λ W | W 0 , v 0 ) p ( Θ Z | Θ 0 ) = p ( μ Z | Λ Z ) p ( Λ Z ) = N ( μ Z | μ 0 , ( β 0 Λ Z ) 1 ) W ( Λ Z | W 0 , v 0 )
where Θ 0 = { μ 0 , v 0 , W 0 } , β 0 is a hyper parameter.
We can initialize the parameters Θ 0 as follows: μ 0 = 0 , v 0 = d , W 0 = I d . In theory, the predictive probability distribution of X i j * can be obtained by marginalizing over model parameters and hyperparameters:
p ( X i j * | X , Θ 0 ) = p ( X i j * , W , Z , Θ W , Θ Z | X , Θ 0 ) d W d Z d Θ W d Θ Z = p ( X i j * | w i , z j ) p ( W , Z | X , Θ W , Θ Z ) p ( Θ W | Θ 0 ) p ( Θ Z | Θ 0 ) d W d Z d Θ W d Θ Z .
However, the above integral is analytically intractable due to the fact that it is very difficult to determine the posterior distribution. Based on this, Gibbs sampling, one of the simplest Markov chain Monte Carlo, was proposed to approximate the predictive distribution p ( X i j * | X , Θ 0 ) . It is noted that MCMC methods for large-scale problems require especial care for efficient proposals and may be very slow if the sample correlation is too long.

5.5. Sparse Bayesian Matrix Completion

We consider the case that some elements of data matrix X are missing and the observed index set is denoted by Ω . Matrix completion assumes that X is approximately low-rank and its goal is to recover all missing elements from observed elements.
For noise matrix E , we assume that E i j ~ N ( 0 , β 1 ) . In sparse Bayesian matrix completion [19], the Gaussian distributions are imposed on two low-rank matrices, that is,
p ( W | γ ) = i = 1 r N ( w i | γ i 1 I d ) , p ( Z | γ ) = i = 1 r N ( z i | γ i 1 I r ) .
Moreover, the priors of γ i are given by γ i ~ Gam ( a , b ) and the prior of β is assigned the noninformative Jeffrey’s prior: p ( β ) β 1 . It is obvious that
p ( X | W , Z , β ) = ( i , j ) Ω N ( X i j | w i z j , β 1 ) .
Then the joint probability distribution is
p ( X , W , Z , γ , β ) = p ( X | W , Z , β ) p ( W | γ ) p ( Z | γ ) p ( γ ) p ( β ) .
Let q ( W , Z , γ , β ) be the approximated distribution of p ( W , Z , γ , β ) . The approximated procedure can be achieved by VB inference. It is demonstrated that the proposed method achieves a better prediction error in recommendation systems.

5.6. Robust Bayesian Matrix Factorization

In previous probabilistic models of matrix factorizations, there is no relationship among the variances of W i k , Z k j , E i j . Now, we reconsider the probability distributions of W i k , Z k j , E i j . The noise E i j is chosen as the heteroscedastic Gaussian scale mixture distribution:
E i j ~ N ( 0 , ( τ α i β j ) 1 )
and the probability distributions of w i and z j are given by:
p ( w i | α i ) = N ( w i | 0 , ( α i Λ w ) 1 ) , p ( z j | β j ) = N ( z j | 0 , ( β j Λ z ) 1 ) .
We also impose Gamma distribution priors on α i and β j :
p ( α i ) = Gam ( α i | a 0 / 2 , b 0 / 2 ) , p ( β j ) = Gam ( β j | c 0 / 2 , d 0 / 2 )
where { a 0 , b 0 , c 0 , d 0 } is a given set of hyper-parameters. To reduce this problem’s complexity, we restrict Λ W and Λ Z to be diagonal matrices, that is,
Λ W 1 = diag ( σ 1 2 , , σ r 2 ) , Λ Z 1 = diag ( ρ 1 2 , , ρ r 2 ) .
Let θ = { τ , Λ W , Λ Z , a 0 , b 0 , c 0 , d 0 } . For the fixed parameters θ , the joint probability distribution is
p ( X , W , Z , α , β | θ ) = p ( X | W , Z , α , β , θ ) p ( W | α ) p ( Z | β ) p ( α | θ ) p ( β | θ ) .
We consider two types of approximated posteriors of p ( W , Z , α , β | X , θ ) , one is
q ( W , Z , α , β ) = q ( W ) q ( Z ) q ( α ) q ( β )
and another has the form
q ( W , Z , α , β ) = q ( W , α ) q ( Z , β ) .
For the above two cases, we can obtain respectively the trial probability distribution q ( W , Z , α , β ) by VB method [17].
In addition, a structured variational approximation was also proposed in [17]. We assume that the variational posteriors of W and Z obey Gaussian distributions:
q ( W | α ) = i = 1 d N ( w i | w ¯ i , α i S ¯ i ) , q ( Z | β ) = j = 1 N N ( z j | z ¯ j , β j R ¯ j ) .
The free energy function is defined by the negative lower bound:
L ( q ) = q ( W , Z , α , β ) ln p ( X , W , Z , α , β | θ ) q ( W , Z , α , β ) d W d Z d α d β
By directly minimizing the free energy function with respect to w ¯ i , S ¯ i , z ¯ j , R ¯ j , we can obtain the optimal w ¯ i , S ¯ i , z ¯ j , R ¯ j . The variational posteriors of scale variables α and β can also be recognized as the generalized inverse Gaussians.
The parameters θ can be estimated by type II maximum likelihood or empirical Bayes. In other words, we update the parameters by minimizing directly the negative lower bound. Robust Bayesian matrix factorization shows that the heavy-tailed distributions are useful to incorporate robustness information to the probabilistic models.

5.7. Probabilistic Robust Matrix Factorization

The model of probabilistic robust matrix factorization [18] considers the sparse noise corruptions. In this model, the Gaussian distributions are also imposed on W i k and Z k j :
p ( W i k | λ W ) = N ( W i k | 0 , λ W 1 ) , p ( Z k j | λ Z ) = N ( Z k j | 0 , λ Z 1 )
and the Laplace noise is placed on E i j , that is, E i j ~ Lap ( 0 , λ ) .
From Bayes’ rule, we have
p ( W , Z | X , λ , λ W , λ Z ) p ( X | W , Z , λ ) p ( W | λ W ) p ( Z | λ Z ) .
We construct the hierarchical model for the Laplace distribution:
p ( X i j | W , Z , T ) = N ( X i j | w i z j , T i j ) , p ( T i j | λ ) = Exp ( T i j | λ / 2 ) .
We regard T as a latent variable matrix and denote θ = { W , Z } , θ ^ = { W ^ , Z ^ } , where θ ^ is the current estimation of θ . An EM algorithm was proposed to inferring W and Z [18]. To this end, we construct the so-called Q-function:
Q ( Z | θ ^ ) = E T [ log p ( Z | W ^ , X , T ) | X , θ ^ ] .
The posterior of complete-data is
p ( Z | W ^ , X , T ) = p ( Z , X | W ^ , T ) p ( X | W ^ , T ) p ( Z , X | W ^ , T ) p ( X | Z , W ^ , T ) p ( Z )
Hence, its log is
log p ( Z | W ^ , X , T ) = 1 2 i = 1 d j = 1 N T i j 1 ( X i j w ^ i z j ) 2 λ Z 2 j z j T z j + const
where “const” is a term independent of Z .
To compute the expectations of T, we derive the conditional probability distribution of T as follows:
p ( T | X , W ^ , Z ^ ) = p ( T , X | W ^ , Z ^ ) p ( X | W ^ , Z ^ ) p ( T , X | W ^ , Z ^ ) p ( T ) p ( X | T , W ^ , Z ^ ) .
Hence, T i j 1 follows an inverse Gaussian distribution and its posterior expectation is given by
E [ T i j 1 | X , W ^ , Z ^ ] = λ | X i j ( W ^ Z ^ ) i j | .
Thus, we get Q ( Z | θ ^ ) .
To obtain the update of Z , we maximize the function Q ( Z | θ ^ ) with respect to z j . By setting z j Q ( Z | θ ^ ) = 0 , we can get the closed-form solution of z j . The update rule for W is similar to that of Z . The proposed probabilistic model is robust again outliers and missing data and equivalent to robust PCA under mild conditions [18].

5.8. Bayesian Robust Matrix Factorization

Another robust probabilistic model of matrix factorizations is Bayesian robust matrix factorization [20]. Gaussian distributions are still imposed on W and Z , that is,
w i ~ N ( μ W , Λ W 1 ) , z j ~ N ( μ Z , Λ Z 1 ) , i = 1 , 2 , , d , j = 1 , 2 , , N .
We further assume both ( μ W , Λ W ) and ( μ Z , Λ Z ) follow Gaussian–Wishart distributions:
Λ W ~ W ( W 0 , v 0 ) , μ W ~ N ( μ 0 , ( β 0 Λ W ) 1 ) , Λ Z ~ W ( W 0 , v 0 ) , μ Z ~ N ( μ 0 , ( β 0 Λ Z ) 1 )
where W 0 , v 0 , μ 0 , β 0 are the hyperparameters.
To enhance the robustness, we suppose the noise is the mixture of a Laplace distribution and a GIG distribution. Concretely speaking, the noise term E i j ~ Lap ( 0 , η i j ) and the prior of η i j is given by GIG ( p , a , b ) , where p , a , b are three hyperparameters. Hence, E i j ~ Lap ( 0 , η i j ) can be represented by two steps: E i j ~ N ( 0 , T i j ) , T i j ~ Exp ( η i j / 2 ) . According to the above probability distributions, we can generate Λ W , μ W , W , μ Z , Λ Z , Z , η , T and X in turn.
Gibbs sampling is proposed to infer the posterior distributions. For this purpose, we need to derive the posterior distributions of all random variables. Due to the fact that the derivation process of { μ Z , Λ Z , Z } is similar to that of { μ W , Λ W , W } , the following only considers approximating the posterior distributions of ( μ W , Λ W , W ) for brevity. Firstly, the posterior distribution of ( μ W , Λ W ) is a Gaussian–Wishart distribution because that
p ( μ W , Λ W | W , W 0 , μ 0 , v 0 , β 0 ) p ( μ W , Λ W , W , W 0 , μ 0 , v 0 , β 0 ) i = 1 d p ( w i . | μ W , Λ W ) p ( Λ W | W 0 , v 0 ) p ( μ W | μ 0 , Λ W , β 0 )
Then, we compute the conditional probability distribution of w i . :
p ( w i | X , Z , μ W , Λ W , T ) p ( w i , X , Z , μ W , Λ W , T ) p ( x i | w i z j , t i ) p ( w i | μ W , Λ W ) j = 1 N N ( X i j | w i z j , T i j ) N ( w i | μ W , Λ W ) .
According to the above Equation, p ( w i | X , Z , μ W , Λ W , T ) is also Gaussian. Next, for given { X i j , w i , z j , η i j } , the probability distribution of T i j is derived as below:
p ( T i j | X i j , w i , z j , η i j ) p ( T i j , X i j , w i , z j , η i j ) p ( X i j | w i , z j , t i j ) p ( T i j | η i j ) .
So, T i j | X i j , w i , z j , η i j ~ GIG ( 1 / 2 , η i j , r i j 2 ) , where r i j = X i j w i z j .
Finally, the posterior of η i j satisfies that
p ( η i j | T i j , p , a , b ) p ( η i j , T i j , p , a , b ) p ( T i j | η i j ) p ( η i j | p , a , b )
Hence, it holds η i j | T i j , p , a , b ~ GIG ( p + 1 , T i j + a , b ) . Bayesian robust matrix factorization incorporates spatial or temporal proximity in computer vision applications and batch algorithms are proposed to infer parameters.

5.9. Bayesian Model for L1-Norm Low-Rank Matrix Factorizations

For low-rank matrix factorizations, L1-norm minimization is more robust than L2-norm minimization in the presence of outliers or non-Gaussian noises. Based on this, we assume that the noise E i j follows the Laplace distribution: E i j ~ Lap ( 0 , λ / 2 ) . Since the Laplace noise is inconvenient for Bayesian inference, a hierarchical Bayesian model was formulated in [21]. Concretely speaking, a two-level hierarchical prior is imposed on the Laplace prior:
E i j ~ N ( 0 , T i j ) , T i j ~ Exp ( λ ) .
The generative models of W and Z are constructed as follows:
w i ~ N ( 0 , τ W I r ) , z j ~ N ( 0 , τ Z I r ) , i = 1 , 2 , , d , j = 1 , 2 , , N .
In addition, Gamma priors are placed on the precision parameters of the above Gaussian distributions:
τ W ~ Gam ( a 0 , b 0 ) ,   τ Z ~ Gam ( c 0 , d 0 ) .
The trial posterior distribution for { W , Z , τ W ,   τ Z , T } is specified as:
q ( W , Z , τ W ,   τ Z , T ) = i = 1 d q ( w i ) j = 1 N q ( z j ) q ( τ W ) q ( τ Z ) .
And the joint probability distribution is expressed as
p ( W , Z , τ W ,   τ Z , T , X ) = p ( X | W , Z , T ) p ( W | τ W ) p ( Z | τ Z ) p ( τ W ) p ( τ Z ) p ( T ) .
VB inference was adopted to approximate the full posterior distribution [21]. Furthermore, varying precision parameters are also considered for different rows of W or different columns of Z . All parameters are automatically tuned to adapt to the data, and the proposed method is applied in computer vision problems to validate its efficiency and robustness.
In Table 3, we list all probabilistic models of matrix factorizations discussed in this section, and compare their probability distributions, priors and solving strategy.

6. Probabilistic Models of Robust PCA

Compared with traditional PCA, robust PCA is more robust to outliers or large sparse noise. Stable robust PCA [95], a stable version of robust PCA, decomposes the data matrix into the sum of a low-rank matrix, a sparse noise matrix and a dense noise matrix. The low-rank matrix obtained by solving a relaxed principal component pursuit is simultaneously stable to small noise and robust to gross sparse errors. This section will review probabilistic models of robust PCA.

6.1. Bayesian Robust PCA

In [22], a stable robust PCA is modeled as:
X = W ( D Λ ) Z + B . * S + E
where D = diag ( d 11 , d 22 , , d r r ) , Λ = diag ( λ 11 , λ 22 , , λ r r ) , B { 0 , 1 } d × N . The three terms W ( D Λ ) Z , B . * S and E are the low-rank, the sparse noise and the dense noise terms respectively. If we do not consider the sparse noise term B . * S , then Equation (101) is equivalent to Equation (59). If all columns of B . * S are same, then the stable robust PCA becomes to be the PCA model (31).
The following considers the probability distributions of all matrices in the right of Equation (101). We assume that w i ~ N ( 0 , I d / d ) , d i i ~ Bern ( p i ) , λ i i ~ N ( 0 , τ 1 ) , z j ~ N ( 0 , I r / r ) , b j ~ k = 1 r Bern ( π k ) , s j ~ N ( 0 , v 1 I d ) , E i j ~ N ( 0 , γ 1 ) , i = 1 , 2 , , r , j = 1 , 2 , , N . The priors of p i , τ , π k , v and γ are given by p i ~ Beta ( α 0 , β 0 ) , τ ~ Gam ( a 0 , b 0 ) , π k ~ Beta ( α 1 , β 1 ) , v ~ Gam ( c 0 , d 0 ) and γ ~ Gam ( e 0 , f 0 ) respectively.
Two methods were proposed in [22], that is, Gibbs sampling and VB inference. For the second method, the joint probability distribution is
p ( X , W , Z , D , Λ , B , S , p , τ , π , v , γ ) = p ( X | W , Z , D , Λ , B , S , γ ) p ( W ) p ( Z ) p ( τ ) p ( Λ | τ ) p ( D | p ) p ( p ) p ( π ) p ( B | π ) p ( v ) p ( S | v ) p ( γ ) .
VB inference is employed to approximating the posterior distribution p ( W , Z , D , Λ , B , S , p , τ , π , v , γ | X ) . Bayesian robust PCA is robust to different noise levels without changing model hyperparameters and exploits additional structure of the matrix in video applications.

6.2. Variational Bayes Approach to Robust PCA

In Equation (101), if both D and Λ are set to be identical matrices, then we have another form of stable robust PCA:
X = W Z + B . * S + E .
Essentially, both Equations (101) and (103) are equivalent. Formally, Equation (103) can also be transformed into another formula:
X = W Z + B . * S + ( 1 B ) . * E .
We assume that w i ~ N ( μ 0 , σ 0 2 I d ) , z j ~ N ( v 0 , σ 0 2 I N ) , b i j ~ Bern ( b 0 ) . Let the means of S i j and E i j be zeros and their precision be τ S and τ E respectively. The priors of τ S and τ E are given by
τ S ~ Gam ( α S , β S ) , τ E ~ Gam ( α E , β E ) .
A naïve VB approach was proposed in [23]. The trial distribution is stipulated as
q ( W , Z , B , τ S , τ E ) = q ( W ) q ( Z ) q ( B ) q ( τ S ) q ( τ E ) .
The likelihood function is constructed as
L = p ( X | W , Z , B , τ S , τ E ) = i , j N ( X i j w i z j | τ S ) B i j N ( X i j w i z j | τ E ) 1 B i j .
The prior distribution is represented by
π ( W , Z , B , τ S , τ E ) = p ( W | μ 0 , σ 0 2 I d ) p ( Z | v 0 , σ 0 2 I N ) p ( B ) p ( τ S ) p ( τ E ) .
We first construct a function: G ( q ) = E q [ log L ] D K L ( q | | π ) . To simplify this problem, we let
q ( w i ) ~ N ( μ w i , Σ w i ) , q ( z i ) ~ N ( μ z i , Σ z i ) .
To find the updates for W , Z , we can maximize the function G with respect to μ w i , Σ w i , μ z i , Σ z i respectively. The main advantage of the proposed approach is that it can incorporate additional prior information and cope with missing data.

6.3. Sparse Bayesian Robust PCA

In Equation (101), we replace B . * S by S for the sake of simplicity. Thus, we have a model of sparse Bayesian robust PCA [19]. Assume that w k ~ N ( 0 , γ k 1 I d ) , z k ~ N ( 0 , γ k 1 I N ) , S i j ~ N ( 0 , α i j 1 ) and E i j ~ N ( 0 , β 1 ) , where k = 1 , 2 , , r , i = 1 , 2 , , d , j = 1 , 2 , , N . The priors of γ k are given by γ k ~ Gam ( a , b ) . What’s more, we assign Jeffrey’s priors to α i j and β :
p ( β ) = β 1 , p ( α i j ) = α i j 1 , i = 1 , 2 , , d , j = 1 , 2 , , N .
The joint distribution is expressed as
p ( X , W , Z , S , γ , α , β ) = p ( X | W , Z , S , β ) p ( W | γ ) p ( Z | γ ) p ( S | α ) p ( γ ) p ( α ) p ( β ) .
VB inference was used to approximate the posterior distributions of all variables matrices [19]. Experimental results in video background/foreground separation show that the proposed method is more effective than MAP and Gibbs sampling.

7. Probabilistic Models of Non-Negative Matrix Factorization

Non-negative matrix factorization (NMF) decomposes a non-negative data matrix into the product of two non-negative low-rank matrices. Mathematically, we can formulate NMF as follows:
X = W Z + E
where X i j , W i k , Z k j are non-negative. Multiplicative algorithms [96] are often used to obtain the point estimations of both W and Z . This section will introduce probabilistic models of NMF.

7.1. Probabilistic Non-Negative Matrix Factorization

Equation (112) can be rewritten as X W Z = k = 1 r w k z k . Let θ k = { w k , z k } and θ = { θ 1 , θ 2 , , θ r } . Probabilistic non-negative matrix factorization [24] introduces a generative model:
X i j = k = 1 r C k , i j , C k , i j ~ p ( C k , i j | θ k ) .
The probability distributions of C k , i j can be assumed to follow Gaussian or Poisson distributions because they are closed under summation. This assumption means that we can get easily the probability distributions of X i j . Four algorithms were proposed in [24], that is, multiplicative, EM, Gibbs sampling and VB algorithms.

7.2. Bayesian Inference for Nonnegative Matrix Factorization

For arbitrary k { 1 , 2 , , r } , Bayesian non-negative matrix factorization [25] introduces variables S k = { S i k j | i = 1 , 2 , , d , j = 1 , 2 , , N } as latent sources. The hierarchical model of X i j is given by
S i k j ~ Poiss ( W i k Z k j ) , X i j = k = 1 r S i k j .
In view of the fact that a Gamma distribution is the conjugate prior to Poisson distribution, the hierarchical priors of W i k and Z k j are proposed as follows
W i k ~ Gam ( a i k W , b i k W ) , Z k j ~ Gam ( a i k Z , b i k Z ) .
Let Θ = { a i k W , b i k W , a i k Z , b i k Z } and S = { S 1 , S 2 , , S r } . For given Θ and X , the posterior distribution is expressed as p ( W , Z , S | X , Θ ) . We assume the trial distribution of p ( W , Z , S | X , Θ ) is factorable: q ( W , Z , S ) = q ( W ) q ( Z ) q ( S ) . Both VB inference and Gibbs sampling were proposed to infer the probability distributions of all variables [25].
The above Bayesian nonnegative matrix factorization is not a matrix factorization approach to latent Dirichlet allocation. To this end, another Bayesian extension of the nonnegative matrix factorization algorithm was proposed in [27]. What’s more, Paisley et al. also provided a correlated nonnegative matrix factorization based on the correlated topic model [27]. The stochastic variational inference algorithms were presented to solve the proposed two models.

7.3. Bayesian Nonparametric Matrix Factorization

Gamma process nonnegative matrix factorization (GaP-NMF) was developed in [26]. This Bayesian nonparametric matrix factorization considers the case that the number of sources r is unknown. Let non-negative hidden variable θ k be the overall gain of the k-th source and L a large number of sources. We assume that
W i k ~ Gam ( a , a ) , Z k j ~ Gam ( b , b ) , X i j ~ Exp ( k = 1 r θ k W i k Z k j ) , θ k ~ Gam ( α / L , α c ) .
The posterior distribution is expressed as p ( W , Z , θ | X , a , b , α , c ) for given hyperparameters a , b , α , c . The trial distribution of θ , W , Z is assumed to be factorable, that is, q ( W , Z , θ ) = q ( W ) q ( Z ) q ( θ ) . The flexible generalized inverse-Gaussian distributions are imposed on W i k , Z k j and θ k respectively,
q ( W i k ) ~ GIG ( γ i k W , ρ i k W , τ i k W ) , q ( Z k j ) ~ GIG ( γ k j Z , ρ k j Z , τ k j Z ) , q ( θ k ) ~ GIG ( γ k θ , ρ k θ , τ k θ ) .
The lower bound of the marginal likelihood is computed as
log p ( X , a , b , α , c ) E q [ log p ( X | W , Z , θ ) ] + E q [ log p ( W | a ) ] E q [ log q ( W ) ] + E q [ log p ( Z | b ) ] E q [ log q ( Z ) ] + E q [ log p ( θ | α , c ) ] E q [ log q ( θ ) ]
By maximizing the lower bound of log p ( X , a , b , α , c ) , we can yield an approximation distribution of q ( W , Z , θ ) . GaP-NMF is applied in recorded music and the number of latent sources is discovered automatically.

7.4. Beta-Gamma Non-Negative Matrix Factorization

For the moment, we do not factorize the data matrix X into the product of two low-rank matrices. Different from previous probabilistic models of NMF, we assume that X i j is generated from a Beta distribution: Beta ( A i j , B i j ) . For two matrices A and B , we jointly factorize them as
A C H , B D H
where C + d × r , D + d × r , H + r × N .
In Beta-Gamma non-negative matrix factorization [28], the generative model is given by
X i j ~ Beta ( k = 1 r C i k H k j , k = 1 r D i k H k j ) , C i k ~ Gam ( μ 0 , α 0 ) , D i k ~ Gam ( υ 0 , β 0 ) , H k j ~ Gam ( ρ 0 , ς 0 )
Variational inference framework was adopted and a new lower-bound was proposed to approximate the objective function to derive an analytically tractable approximate solution of the posterior distribution. Beta-Gamma non-negative matrix factorization is used in source separation, collaborative filtering and cancer epigenomics analysis.

8. Other Probabilistic Models of Low-Rank Matrix/Tensor Factorizations

Besides the probabilistic models discussed in foregoing sections, there are many other types of probabilistic low-rank matrix factorization models. A successful application of probabilistic low-rank matrix factorization is the collaborative filtering in recommendation systems. In collaborative filtering, there are several other Poisson models in which the observations are usually modeled with a Poisson distribution, and these models mainly include [97,98,99,100,101,102,103,104,105]. As a matter of fact, the Poisson factorization roots in the nonnegative matrix factorization and takes advantage of the sparse essence of user behavior data and scales [103].For some probabilistic models with respect to collaborative filtering, the Poisson distribution is changed into other probability distributions and this change deals with logistic function [106,107,108], Heaviside step function [107,109], Gaussian cumulative density function [110] and so on. In addition, side information on the a low-dimensional latent presentations is integrated into probabilistic low-rank matrix factorization models [111,112,113], and the case that the data is missing not at random is taken into consideration [109,114,115].
It is worthy to pay attention to other applications of probabilistic low-rank matrix factorization models. For instance, [116] developed a probabilistic model for low-rank subspace clustering. In [88], a sparse additive matrix factorization was proposed by a Bayesian regularization effect and the corresponding model was applied into a foreground/background video separation problem.
Recently, probabilistic low-rank matrix factorizations have been extended into the case of tensor decompositions (factorizations). Tucker decomposition and CP decomposition are two popular tensor decomposition approaches. The probabilistic Tucker decomposition models mainly include probabilistic Tucker decomposition [34], exponential family tensor factorization [35] and InfTucker model [36]. Probabilistic Tucker decomposition was closely related to probabilistic PCA. In [35], an integration method was proposed to model heterogeneously attributed data tensors. InfTucker, a tensor-variate latent nonparametric Bayesian model, conducted Tucker decomposition in an infinite feature space.
More probabilistic models of tensor factorizations focus on CP tensor decomposition model. For example, Ermis and Cemgil investigated variational inference for probabilistic latent tensor factorization [37]. Based on hierarchical dirichlet process, a Bayesian probabilistic model for unsupervised tensor factorization was proposed [38]. In [39], a novel probabilistic tensor factorization was proposed by extending probabilistic matrix factorization. A probabilistic latent tensor factorization was proposed in [40] to address the task of link pattern prediction. Based on the Polya-Gamma augmentation strategy and online expectation maximization algorithm, [41] proposed a scalable probabilistic tensor factorization framework. As the generalization of Poisson matrix factorization, Poisson tensor factorization was presented in [42]. In [43], a Bayesian tensor factorization models was proposed to infer the latent group structures from dynamic pairwise interaction patterns. In [44], a Bayesian non-negative tensor factorization model was presented for count-valued tensor data and scalable inference algorithms were developed. A scalable Bayesian framework for low-rank CP decomposition was presented and it can analyses both continuous and binary datasets [45]. A zero-truncated Poisson tensor factorization for binary tensors was proposed in [46]. A Bayesian robust tensor factorization [47] was proposed and it is the extension of probabilistic stable robust PCA. And in [48], the CP factorization was formulated by a hierarchical probabilistic model.

9. Conclusions and Future Work

In this paper, we have made a survey on probabilistic models of low-rank matrix factorizations and the related works. To classify the main probabilistic models, we divide low-rank matrix factorizations into several groups such as PCA, matrix factorizations, robust PCA, non-negative matrix factorization and so on. For each category, we list representative probabilistic models, describe the probability distributions of all random matrices or latent variables, present the corresponding inference methods and compare their similarity and difference. Besides, we further provide an overview of probabilistic models of low-rank tensor factorizations and discuss other probabilistic matrix factorizations models.
Although probabilistic low-rank matrix/tensor factorizations have made some progresses, we still face some challenges in theories and applications. Future research may concern the following aspects:
  • Scalable algorithms to infer the probability distributions and parameters. Although both Gibbs sampling and variational Bayesian inference have their own advantages, they need large computation cost for real large-scale problems. A promising future direction is to design scalable algorithms.
  • Constructing new probabilistic models of low-rank matrix factorizations. It is necessary to develop other probabilistic models according to the actual situation. For example, we can consider different types of sparse noise and different probability distributions (including the prior distributions) of low-rank components or latent variables.
  • Probabilistic models of non-negative tensor factorizations. There is not much research on this type of probabilistic models. Compared with probabilistic models of tensor factorizations, the probabilistic non-negative tensor factorizations models are more complex and difficult in inferring the posterior distributions.
  • Probabilistic TT format. In contrast to both CP and Tucker decompositions, the TT format provides stable representations and is formally free from the curse of dimensionality. Hence, probabilistic model of the TT format would be an interesting research issue.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (No. 61403298, No. 11401457), China Postdoctoral Science Foundation (No. 2017M613087) and the Scientific Research Program Funded by the Shaanxi Provincial Education Department (No. 16JK1435).

Author Contributions

Jiarong Shi reviewed probabilistic models of low-rank matrix factorizations and wrote the manuscript. Xiuyun Zheng wrote the preliminaries. Wei Yang presented literature search and wrote other probabilistic models of low-rank matrix factorizations. All authors were involved in organizing and refining the manuscript. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jolliffe, I. Principal Component Analysis; John Wiley & Sons, Ltd.: Mississauga, ON, Canada, 2002. [Google Scholar]
  2. Golub, G.H.; Reinsch, C. Singular value decomposition and least squares solutions. Numer. Math. 1970, 14, 403–420. [Google Scholar] [CrossRef]
  3. Ke, Q.; Kanade, T. Robust L1 norm factorizations in the presence of outliers and missing data by alternative convex programming. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); IEEE Computer Society: Washington, DC, USA, 2005. [Google Scholar]
  4. Kwak, N. Principal component analysis based on L1-norm maximization. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1672–1680. [Google Scholar] [CrossRef] [PubMed]
  5. Nie, F.; Huang, H. Non-greedy L21-norm maximization for principal component analysis. arXiv 2016. [Google Scholar]
  6. Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis? J. ACM 2011, 58, 11. [Google Scholar] [CrossRef]
  7. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorizations. Nature 1999, 401, 788–791. [Google Scholar] [PubMed]
  8. Kolda, T.G.; Bader, B.W. Tensor decompositions and applications. SIAM Rev. 2009, 51, 455–500. [Google Scholar] [CrossRef]
  9. Candès, E.J.; Recht, B. Exact matrix completion via convex optimization. Found. Comput. Math. 2009, 9, 717–772. [Google Scholar] [CrossRef]
  10. Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B 1999, 21, 611–622. [Google Scholar] [CrossRef]
  11. Bishop, C.M. Variational principal components. In Proceedings of the Ninth International Conference on Artificial Neural Networks, Edinburgh, UK, 7–10 September 1999. [Google Scholar]
  12. Gao, J. Robust L1 principal component analysis and its Bayesian variational inference. Neural Comput. 2008, 20, 55–572. [Google Scholar] [CrossRef] [PubMed]
  13. Luttinen, J.; Ilin, A.; Karhunen, J. Bayesian robust PCA of incomplete data. Neural Process. Lett. 2012, 36, 189–202. [Google Scholar] [CrossRef]
  14. Lim, Y.J.; Teh, Y.W. Variational Bayesian approach to movie rating prediction. In Proceedings of the KDD Cup and Workshop, San Jose, CA, USA, 12 August 2007. [Google Scholar]
  15. Salakhutdinov, R.; Mnih, A. Probabilistic matrix factorization. In Proceedings of the 20th Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007. [Google Scholar]
  16. Salakhutdinov, R.; Mnih, A. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008. [Google Scholar]
  17. Lakshminarayanan, B.; Bouchard, G.; Archambeau, C. Robust Bayesian matrix factorization. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
  18. Wang, N.; Yao, T.; Wang, J.; Yeung, D.-Y. A probabilistic approach to robust matrix factorization. In Proceedings of the 12th European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012. [Google Scholar]
  19. Babacan, S.D.; Luessi, M.; Molina, R.; Katsaggelos, K. Sparse Bayesian methods for low-rank matrix estimation. IEEE Trans. Signal Process. 2012, 60, 3964–3977. [Google Scholar] [CrossRef]
  20. Wang, N.; Yeung, D.-Y. Bayesian robust matrix factorization for image and video processing. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013. [Google Scholar]
  21. Zhao, Q.; Meng, D.; Xu, Z.; Zuo, W.; Yan, Y. L1-norm low-rank matrix factorization by variational Bayesian method. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 825–839. [Google Scholar] [CrossRef] [PubMed]
  22. Ding, X.; He, L.; Carin, L. Bayesian robust principal component analysis. IEEE Trans. Image Process. 2011, 20, 3419–3430. [Google Scholar] [CrossRef] [PubMed]
  23. Aicher, C. A variational Bayes approach to robust principal component analysis. In SFI REU 2013 Report; University of Colorado Boulder: Boulder, CO, USA, 2013. [Google Scholar]
  24. Févotte, C.; Cemgil, A.T. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proceedings of the IEEE 17th European Conference on Signal Processing, Glasgow, UK, 24–28 August 2009. [Google Scholar]
  25. Cemgil, A.T. Bayesian inference for nonnegative matrix factorization models. Comput. Intell. Neurosci. 2009, 2009, 785152. [Google Scholar] [CrossRef] [PubMed]
  26. Hoffman, M.; Cook, P.R.; Blei, D.M. Bayesian nonparametric matrix factorization for recorded music. In Proceedings of the International Conference on Machine Learning, Washington, DC, USA, 12–14 December 2010. [Google Scholar]
  27. Paisley, J.; Blei, D.; Jordan, M. Bayesian nonnegative matrix factorization with stochastic variational inference. In Handbook of Mixed Membership Models and Their Applications, Chapman and Hall; CRC: Boca Raton, FL, USA, 2014. [Google Scholar]
  28. Ma, Z.; Teschendorff, A.E.; Leijon, A.; Qiao, Y.; Zhang, H.; Guo, J. Variational Bayesian matrix factorizations for bounded support data. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 876–889. [Google Scholar] [CrossRef] [PubMed]
  29. Hackbusch, W.; Kühn, S. A new scheme for the tensor representation. J. Fourier Anal. Appl. 2009, 15, 706–722. [Google Scholar] [CrossRef]
  30. Grasedyck, L.; Hackbusch, W. An introduction to hierarchical (H-) rank and TT-rank of tensors with examples. Comput. Methods Appl. Math. 2011, 11, 291–304. [Google Scholar] [CrossRef]
  31. Oseledets, I.V.; Tyrtyshnikov, E.E. Breaking the curse of dimensionality, or how to use SVD in many dimensions. Soc. Ind. Appl. Math. 2009, 31, 3744–3759. [Google Scholar] [CrossRef]
  32. Oseledets, I.V. Tensor-train decomposition. SIAM J. Sci. Comput. 2011, 33, 2295–2317. [Google Scholar] [CrossRef]
  33. Holtz, S.; Rohwedder, T.; Schneider, R. The alternating linear scheme for tensor optimization in the tensor train format. SIAM J. Sci. Comput. 2012, 34, A683–A713. [Google Scholar] [CrossRef]
  34. Chu, W.; Ghahramani, Z. Probabilistic models for incomplete multi-dimensional arrays. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), Clearwater, FL, USA, 16–18 April 2009. [Google Scholar]
  35. Hayashi, K.; Takenouchi, T.; Shibata, T.; Kamiya, Y.; Kato, D.; Kunieda, K.; Yamada, K.; Ikeda, K. Exponential family tensor factorizations for missing-values prediction and anomaly detection. In Proceedings of the IEEE 10th International Conference on Data Mining, Sydney, Australia, 13–17 December 2010. [Google Scholar]
  36. Xu, Z.; Yan, F.; Qi, Y. Bayesian nonparametric models for multiway data analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 475–487. [Google Scholar] [CrossRef] [PubMed]
  37. Ermis, B.; Cemgil, A. A Bayesian tensor factorization model via variational inference for link prediction. arXiv 2014. [Google Scholar]
  38. Porteous, I.; Bart, E.; Welling, M. Multi-HDP: A nonparametric Bayesian model for tensor factorization. In Proceedings of the National Conference on Artificial Intelligence, Chicago, IL, USA, 13–17 July 2008. [Google Scholar]
  39. Xiong, L.; Chen, X.; Huang, T.; Schneiderand, J.G.; Carbonell, J.G. Temporal collaborative filtering with Bayesian probabilistic tensor factorization. In Proceedings of the SIAM Data Mining, Columbus, OH, Canada, 29 April–1 May 2010. [Google Scholar]
  40. Gao, S.; Denoyer, L.; Gallinari, P.; Guo, J. Probabilistic latent tensor factorizations model for link pattern prediction in multi-relational networks. J China Univ. Posts Telecommun. 2012, 19, 172–181. [Google Scholar] [CrossRef]
  41. Rai, P.; Wang, Y.; Guo, S.; Chen, G.; Dunson, D.; Carin, L. Scalable Bayesian low-rank decomposition of incomplete multiway tensors. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
  42. Schein, A.; Paisley, J.; Blei, D.M.; Wallach, H. Inferring polyadic events with Poisson tensor factorization. In Proceedings of the NIPS 2014 Workshop on Networks: From Graphs to Rich Data, Montreal, QC, Canada, 13 December 2014. [Google Scholar]
  43. Schein, A.; Paisley, J.; Blei, D.M. Bayesian Poisson tensor factorization for inferring multilateral relations from sparse dyadic event counts. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Sydney, Australia, 10–13 August 2015. [Google Scholar]
  44. Hu, C.; Rai, P.; Chen, C.; Harding, M.; Carin, L. Scalable Bayesian non-negative tensor factorization for massive count data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer International Publishing AG: Cham, Switzerland, 2015. [Google Scholar]
  45. Rai, P.; Hu, C.; Harding, M.; Carin, L. Scalable probabilistic tensor factorization for binary and count data. In Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
  46. Hu, C.; Rai, P.; Carin, L. Zero-truncated Poisson tensor factorization for massive binary tensors. arXiv 2015. [Google Scholar]
  47. Zhao, Q.; Zhou, G.; Zhang, L.; Cichocki, A.; Amari, S.I. Bayesian robust tensor factorization for incomplete multiway data. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 736–748. [Google Scholar] [CrossRef] [PubMed]
  48. Zhao, Q.; Zhang, L.; Cichocki, A. Bayesian CP factorization of incomplete tensors with automatic rank determination. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 37, 1751–1763. [Google Scholar] [CrossRef] [PubMed]
  49. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–38. [Google Scholar]
  50. Jeff Wu, C.F. On the convergence properties of the EM algorithm. Ann. Stat. 1983, 11, 95–103. [Google Scholar]
  51. Xu, L.; Jordan, M.I. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 1995, 8, 129–151. [Google Scholar] [CrossRef]
  52. Hunter, D.R.; Lange, K. A tutorial on MM algorithms. Am. Stat. 2004, 58, 30–37. [Google Scholar] [CrossRef]
  53. Gelman, A.; Carlin, J.; Stern, H.S.; Dunson, D.; Vehtari, A.; Rubin, D. Bayesian Data Analysis; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
  54. Bishop, C. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
  55. Geman, S.; Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 1984, 6, 721–741. [Google Scholar] [CrossRef] [PubMed]
  56. Casella, G.; George, E.I. Explaining the Gibbs sampler. Am. Stat. 1992, 46, 167–174. [Google Scholar]
  57. Gilks, W.R.; Wild, P. Adaptive rejection sampling for Gibbs sampling. J. R. Stat. Soc. Ser. C Appl. Stat. 1992, 41, 337–348. [Google Scholar] [CrossRef]
  58. Liu, J.S. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Am. Stat. Assoc. 1994, 89, 958–966. [Google Scholar] [CrossRef]
  59. Gilks, W.R.; Best, N.G.; Tan, K.K.C. Adaptive rejection metropolis sampling within Gibbs sampling. J. R. Stat. Soc. Ser. C Appl. Stat. 1995, 44, 455–472. [Google Scholar] [CrossRef]
  60. Martino, L.; Míguez, J. A generalization of the adaptive rejection sampling algorithm. Stat. Comput. 2010, 21, 633–647. [Google Scholar] [CrossRef] [Green Version]
  61. Martino, L.; Read, J.; Luengo, D. Independent doubly adaptive rejection metropolis sampling within Gibbs sampling. IEEE Trans. Signal Process. 2015, 63, 3123–3138. [Google Scholar] [CrossRef]
  62. Bernardo, J.M.; Smith, A.F.M. Bayesian Theory; John Wiley & Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
  63. Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
  64. Attias, H. A variational Bayesian framework for graphical models. Adv. Neural Inf. Process. Syst. 2000, 12, 209–215. [Google Scholar]
  65. Beal, M.J. Variational Algorithms for Approximate Bayesian Inference; University of London: London, UK, 2003. [Google Scholar]
  66. Smídl, V.; Quinn, A. The Variational Bayes Method in Signal Processing; Springer: New York, NY, USA, 2005. [Google Scholar]
  67. Blei, D.; Jordan, M. Variational inference for Dirichlet process mixtures. Bayesian Anal. 2006, 1, 121–144. [Google Scholar] [CrossRef]
  68. Beal, M.J.; Ghahramani, Z. The variational Bayesian EM algorithm for incomplete data: With application to scoring graphical model structures. In Proceedings of the IEEE International Conference on Acoustics Speech & Signal Processing, Toulouse, France, 14–19 May 2006. [Google Scholar]
  69. Schölkopf, B.; Platt, J.; Hofmann, T. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
  70. Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P. The variational approximation for Bayesian inference. IEEE Signal Process. Mag. 2008, 25, 131–146. [Google Scholar] [CrossRef]
  71. Chen, Z.; Babacan, S.D.; Molina, R.; Katsaggelos, A.K. Variational Bayesian methods for multimedia problems. IEEE Trans. Multimed. 2014, 16, 1000–1017. [Google Scholar] [CrossRef]
  72. Fink, D. A Compendium of Conjugate Priors. Available online: https://www.johndcook.com/CompendiumOfConjugatePriors.pdf (accessed on 17 August 2017).
  73. Pearl, J. Evidential reasoning using stochastic simulation. Artif. Intell. 1987, 32, 245–257. [Google Scholar] [CrossRef]
  74. Tierney, L. Markov chains for exploring posterior distributions. Ann. Stat. 1994, 22, 1701–1762. [Google Scholar] [CrossRef]
  75. Besag, J.; Green, P.J.; Hidgon, D.; Mengersen, K. Bayesian computation and stochastic systems. Stat. Sci. 1995, 10, 3–66. [Google Scholar] [CrossRef]
  76. Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J. Markov Chain Monte Carlo in Practice; Chapman and Hall: Suffolk, UK, 1996. [Google Scholar]
  77. Brooks, S.P. Markov chain Monte Carlo method and its application. J. R. Stat. Soc. Ser. D Stat. 1998, 47, 69–100. [Google Scholar] [CrossRef]
  78. Beichl, I.; Sullivan, F. The Metropolis algorithm. Comput. Sci. Eng. 2000, 2, 65–69. [Google Scholar] [CrossRef]
  79. Liu, J.S. Monte Carlo Strategies in Scientific Computing; Springer: Berlin, Germany, 2001. [Google Scholar]
  80. Andrieu, C.; Freitas, N.D.; Doucet, A.; Jordan, M.I. An Introduction to MCMC for machine learning. Mach. Learn. 2003, 50, 5–43. [Google Scholar] [CrossRef]
  81. Von der Linden, W.; Dose, V.; von Toussaint, U. Bayesian Probability Theory: Applications in the Physical Sciences; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
  82. Liu, J.S.; Liang, F.; Wong, W.H. The Multiple-try method and local optimization in Metropolis sampling. J. Am. Stat. Assoc. 2000, 95, 121–134. [Google Scholar] [CrossRef]
  83. Martino, L.; Read, J. On the flexibility of the design of multiple try Metropolis schemes. Comput. Stat. 2013, 28, 2797–2823. [Google Scholar] [CrossRef]
  84. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  85. Kullback, S. Information Theory and Statistics; John Wiley & Sons: Hoboken, NJ, USA, 1959. [Google Scholar]
  86. Johnson, D.H.; Sinanovic, S. Symmetrizing the Kullback–Leibler distance. IEEE Trans. Inf. Theory 2001, 9, 96–99. [Google Scholar]
  87. Erven, T.V.; Harremos, P.H. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2013, 60, 3797–3820. [Google Scholar] [CrossRef]
  88. Nakajima, S.; Sugiyama, M.; Babacan, S.D. Variational Bayesian sparse additive matrix factorization. Mach. Learn. 2013, 92, 319–347. [Google Scholar] [CrossRef]
  89. Nakajima, S.; Sugiyama, M.; Babacan, S.D. Global analytic solution of fully-observed variational Bayesian matrix factorization. J. Mach. Learn. Res. 2013, 14, 1–37. [Google Scholar]
  90. Paisley, J.; Blei, D.; Jordan, M. Variational Bayesian inference with stochastic search. arXiv 2012. [Google Scholar]
  91. Hoffman, M.; Blei, D.; Wang, C.; Paisley, J. Stochastic variational inference. J Mach. Learn. Res. 2013, 14, 1303–1347. [Google Scholar]
  92. Tipping, M.E.; Bishop, C.M. Mixtures of probabilistic principal component analyzers. Neural Comput. 1999, 11, 443–482. [Google Scholar] [CrossRef] [PubMed]
  93. Khan, M.E.; Young, J.K.; Matthias, S. Scalable collaborative Bayesian preference learning. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–24 April 2014. [Google Scholar]
  94. Srebro, N.; Tommi, J. Weighted low-rank approximations. In Proceedings of the International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003. [Google Scholar]
  95. Zhou, Z.; Li, X.; Wright, J.; Candès, E.J.; Ma, Y. Stable principal component pursuit. In Proceedings of the IEEE International Symposium on Information Theory, Austin, TX, USA, 13–18 June 2010. [Google Scholar]
  96. Lee, D.D.; Seung, H.S. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2011. [Google Scholar]
  97. Ma, H.; Liu, C.; King, I.; Lyu, M.R. Probabilistic factor models for web site recommendation. In Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, China, 24–28 July 2011. [Google Scholar]
  98. Seeger, M.; Bouchard, G. Fast variational Bayesian inference for non-conjugate matrix factorization models. J. Mach. Learn. Res. Proc. Track 2012, 22, 1012–1018. [Google Scholar]
  99. Hoffman, M. Poisson-uniform nonnegative matrix factorization. In Proceedings of the Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012. [Google Scholar]
  100. Gopalan, P.; Hofman, J.M.; Blei, D.M. Scalable recommendation with Poisson factorization. arXiv 2014. [Google Scholar]
  101. Gopalan, P.; Ruiz, F.; Ranganath, R.; Blei, D. Bayesian nonparametric Poisson factorization for recommendation systems. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014. [Google Scholar]
  102. Gopalan, P.; Charlin, L.; Blei, D. Content-based recommendations with Poisson factorization. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
  103. Gopalan, P.; Ruiz, F.; Ranganath, R.; Blei, D. Bayesian nonparametric Poisson factorization for recommendation systems. J. Mach. Learn. Res. 2014, 33, 275–283. [Google Scholar]
  104. Gopalan, P.; Hofman, J.M.; Blei, D. Scalable Recommendation with hierarchical Poisson factorization. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Amsterdam, The Netherlands, 6–10 July 2015. [Google Scholar]
  105. Gopalan, P.; Hofman, J.; Blei, D. Scalable recommendation with Poisson factorization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, Amsterdam, The Netherlands, 6–10 July 2015. [Google Scholar]
  106. Ma, H.; Yang, H.; Lyu, M.R.; King, I. SoRec: Social recommendation using probabilistic matrix factorization. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA, USA, 26–30 October 2008. [Google Scholar]
  107. Paquet, U.; Koenigstein, N. One-class collaborative filtering with random graphs. In Proceedings of the 22nd International Conference on World Wide Web, ACM, Rio de Janeiro, Brazil, 13–17 May 2013. [Google Scholar]
  108. Liu, Y.; Wu, M.; Miao, C. Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS Comput. Biol. 2016, 12, e1004760. [Google Scholar] [CrossRef] [PubMed]
  109. Hernández-Lobato, J.M.; Houlsby, N.; Ghahramani, Z. Probabilistic matrix factorization with non-random missing data. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
  110. Koenigstein, N.; Nice, N.; Paquet, U.; Schleyen, N. The Xbox recommender system. In Proceedings of the Sixth ACM Conference on Recommender Systems, Dublin, Ireland, 9–13 September 2012. [Google Scholar]
  111. Shan, H.; Banerjee, A. Generalized probabilistic matrix factorizations for collaborative filtering. In Proceedings of the Data Mining (ICDM), Sydney, Australia, 13–17 December 2010. [Google Scholar]
  112. Zhou, T.; Shan, H.; Banerjee, A.; Sapiro, G. Kernelized probabilistic matrix factorization: Exploiting graphs and side information. In Proceedings of the 2012 SIAM International Conference on Data Mining, Anaheim, CA, USA, 26–28 April 2012. [Google Scholar]
  113. Gonen, M.; Suleiman, K.; Samuel, K. Kernelized Bayesian matrix factorization. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
  114. Marlin, B.M.; Zemel, R.S. Collaborative prediction and ranking with non-random missing data. In Proceedings of the Third ACM Conference on Recommender Systems, New York, NY, USA, 23–25 October 2009. [Google Scholar]
  115. Bolgár, B.; Péter, A. Bayesian matrix factorization with non-random missing data using informative Gaussian process priors and soft evidences. In Proceedings of the Eighth International Conference on Probabilistic Graphical Models, Lugano, Switzerland, 6–9 September 2016. [Google Scholar]
  116. Babacan, S.D.; Nakajima, S.; Do, M. Probabilistic low-rank subspace clustering. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Figure 1. Relationships among several probability distributions.
Figure 1. Relationships among several probability distributions.
Entropy 19 00424 g001
Table 1. Commonly-used probability distributions.
Table 1. Commonly-used probability distributions.
Probability DistributionNotationProbability Density/Mass FunctionExpectationVariance/Covariance
Bernoulli distribution Bern ( x | μ ) μ x ( 1 μ ) 1 x , x { 0 , 1 } μ μ ( 1 μ )
Poisson distribution Poiss ( x | λ ) λ x x ! exp ( λ ) , x = 0 , 1 , λ λ
Uniform distribution U ( x | a , b ) 1 b a , x ( a , b ) a + b 2 ( b a ) 2 12
Multivariate
Gaussian distribution
N ( x | μ , Σ ) 1 ( 2 π ) d / 2 | Σ | 1 / 2 exp ( 1 2 ( x μ ) T Σ 1 ( x μ ) ) , Σ is a d × d symmetric, positive definite matrix μ Σ
Exponential distribution Exp ( x | λ ) λ exp ( λ x ) , x > 0 1 λ 1 λ 2
Laplace distribution Lap ( x | μ , σ ) 1 2 σ exp ( | x μ | σ ) μ σ 2
Gamma distribution Gam ( x | a , b ) 1 Γ ( a ) b a x a 1 exp ( b x ) , x > 0 a b a b 2
Inverse-Gamma distribution IGam ( x | a , b ) 1 Γ ( a ) b a x a 1 exp ( b x ) , x > 0 b a 1 for a > 1 b 2 ( a 1 ) 2 ( a 2 ) for a > 2
Student’s t-distribution St ( x | μ , λ , ν ) Γ ( ( v + 1 ) / 2 ) Γ ( v / 2 ) ( λ π v ) 1 / 2 ( 1 + λ ( x μ ) 2 v ) ( v + 1 ) / 2 μ v λ ( v 2 ) for v > 2
Beta distribution Beta ( x | a , b ) Γ ( a + b ) Γ ( a ) Γ ( b ) x a 1 ( 1 x ) b 1 , x [ 0 , 1 ] a a + b a b ( a + b ) 2 ( a + b + 1 )
Wishart distribution W ( Λ | W , v ) B ( W , v ) | Λ | ( ν d 1 ) / 2 exp ( 1 2 Tr ( W 1 Λ ) ) , Λ is a d × d symmetric, positive definite matrix v W v ( W i j 2 + W i i W j j ) for Λ i j
Inverse Gaussian distribution IG ( x | μ , λ ) λ 2 π x 3 exp ( λ ( x μ ) 2 2 μ 2 x ) ,   x > 0 μ μ 3 λ
Generalized inverse Gaussian distribution GIG ( x | p , a , b ) ( a / b ) p / 2 x p 1 2 K P ( a b ) exp ( 1 2 ( a x + b x ) ) ,   x > 0 b K p + 1 ( c ) a K p ( c ) , c = a b b K p + 2 ( c ) a K p ( c ) b K p + 1 2 ( c ) a K p 2 ( c )
Table 2. Probabilistic models of Principal Component Analysis (PCA) with the factorization x = W z + μ + ε .
Table 2. Probabilistic models of Principal Component Analysis (PCA) with the factorization x = W z + μ + ε .
Probabilistic ModelDeterministic VariablesRandom VariablesPrior DistributionsSolving Strategy
Probabilistic PCA [10] W , μ ε ~ N ( 0 , σ 2 I d ) , z ~ N ( 0 , I r ) . -ML
EM
Bayesian PCA [11]- ε ~ N ( 0 , σ 2 I d ) , z ~ N ( 0 , I r ) , w i ~ N ( 0 , α i 1 I d ) , μ ~ N ( 0 , β 1 I ) . p ( α ) = i = 1 r Gam ( α i | a 0 , b 0 ) , p ( τ ) = Gam ( τ | c 0 , d 0 ) , τ = σ 2 . VB
Robust L1 PCA [12] W , μ ( μ = 0 ) ε i ~ N ( 0 , ( σ 2 / β i ) I d ) , z ~ N ( 0 , I d ) . β i ~ IGam ( 1 , 1 / 2 ) , p ( ρ ) = Gam ( ρ | a , b ) , ρ = 1 / σ 2 . VB
Probabilistic factor analysis [13]- ε ~ N ( 0 , σ 2 I d ) , z ~ N ( 0 , I r ) , w i j ~ N ( 0 , τ i 1 α j 1 ) , μ i ~ N ( 0 , τ i 1 β 1 ) . τ i ~ Gam ( a 0 , b 0 ) , α j ~ Gam ( c 0 , d 0 ) , β ~ Gam ( e 0 , f 0 ) . VB
Bayesian robust
PCA I [13]
- ε k ~ N ( 0 , diag ( τ . 1 * u k 1 ) ) , z ~ N ( 0 , I r ) , w i j ~ N ( 0 , τ i 1 β 1 ) , μ i ~ N ( 0 , τ i 1 β 1 ) . p ( u k | v ) = Gam ( u k | v / 2 , v / 2 ) , τ i ~ Gam ( a 0 , b 0 ) , α j ~ Gam ( c 0 , d 0 ) , β ~ Gam ( e 0 , f 0 ) . VB
Bayesian robust
PCA II [13]
- ε ~ N ( 0 , diag ( τ . 1 * u 1 ) ) , z ~ N ( 0 , I r ) , w i j ~ N ( 0 , τ i 1 α j 1 ) , μ i ~ N ( 0 , τ i 1 β 1 ) . p ( u ) = IGam ( u | 1 , 1 / 2 ) , τ i ~ Gam ( a 0 , b 0 ) , α j ~ Gam ( c 0 , d 0 ) , β ~ Gam ( e 0 , f 0 ) . VB
Table 3. Probabilistic models of matrix factorizations with X = W Z + E .
Table 3. Probabilistic models of matrix factorizations with X = W Z + E .
Probabilistic ModelRandom VariablesPrior DistributionsSolving Strategy
PMF [15] E i j ~ N ( 0 , σ 2 ) , P ( W | σ W 2 ) = i = 1 r N ( w i | 0 , σ W 2 I d ) , P ( Z | σ Z 2 ) = j = 1 N N ( z j | 0 , σ Z 2 I r ) . -MAP
Variational Bayesian PMF [14] E i j ~ N ( 0 , σ 2 ) , W i k ~ N ( 0 , σ k 2 ) , Z k j ~ N ( 0 , ρ k 2 ) . -VB
Bayesian PMF [16] E i j ~ N ( 0 , β 1 ) , p ( W | μ W , Λ W ) = i = 1 d N ( w i | μ W , Λ W 1 ) , p ( Z | μ Z , Λ Z ) = j = 1 N N ( z j | μ Z , Λ Z 1 ) . p ( Θ W | Θ 0 ) = N ( μ W | μ 0 , ( β 0 Λ W ) 1 ) W ( Λ W | W 0 , v 0 ) , p ( Θ Z | Θ 0 ) = N ( μ Z | μ 0 , ( β 0 Λ Z ) 1 ) W ( Λ Z | W 0 , v 0 ) . Gibbs sampling
Sparse Bayesian matrix completion [19] E i j ~ N ( 0 , β 1 ) , p ( W | γ ) = i = 1 r N ( w i | γ i 1 I d ) , p ( Z | γ ) = i = 1 r N ( z i | γ i 1 I r ) . γ i ~ Gam ( a , b ) , p ( β ) β 1 . VB
Robust Bayesian matrix factorization [17] E i j ~ N ( 0 , ( τ α i β j ) 1 ) , p ( w i | α i ) = N ( w i | 0 , ( α i Λ w ) 1 ) , p ( z j | β j ) = N ( z j | 0 , ( β j Λ z ) 1 ) . p ( α i ) = Gam ( α i | a 0 / 2 , b 0 / 2 ) , p ( β j ) = Gam ( β j | c 0 / 2 , d 0 / 2 ) . VB, type II ML, empirical Bayes
Probabilistic robust matrix factorization [18] E i j ~ N ( X i j | 0 , T i j ) , p ( W i k | λ W ) = N ( W i k | 0 , λ W 1 ) , p ( Z k j | λ Z ) = N ( Z k j | 0 , λ Z 1 ) . T i j ~ Exp ( λ / 2 ) . EM
Bayesian robust matrix factorization [20] E i j ~ N ( 0 , T i j ) , w i ~ N ( μ W , Λ W 1 ) , z j ~ N ( μ Z , Λ Z 1 ) . T i j ~ Exp ( η i j / 2 ) , Λ W ~ W ( W 0 , v 0 ) , μ W ~ N ( μ 0 , ( β 0 Λ W ) 1 ) , Λ Z ~ W ( W 0 , v 0 ) , μ Z ~ N ( μ 0 , ( β 0 Λ Z ) 1 ) . Gibbs sampling
Bayesian model for L1-norm low-rank matrix factorizations [21] E i j ~ N ( 0 , T i j ) , w i ~ N ( 0 , τ W I r ) , z j ~ N ( 0 , τ Z I r ) . T i j ~ Exp ( λ ) , τ W ~ Gam ( a 0 , b 0 ) , τ Z ~ Gam ( c 0 , d 0 ) . VB

Share and Cite

MDPI and ACS Style

Shi, J.; Zheng, X.; Yang, W. Survey on Probabilistic Models of Low-Rank Matrix Factorizations. Entropy 2017, 19, 424. https://doi.org/10.3390/e19080424

AMA Style

Shi J, Zheng X, Yang W. Survey on Probabilistic Models of Low-Rank Matrix Factorizations. Entropy. 2017; 19(8):424. https://doi.org/10.3390/e19080424

Chicago/Turabian Style

Shi, Jiarong, Xiuyun Zheng, and Wei Yang. 2017. "Survey on Probabilistic Models of Low-Rank Matrix Factorizations" Entropy 19, no. 8: 424. https://doi.org/10.3390/e19080424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop