Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers

Feng, Shan; Xie, Wenxian; Nie, Yufeng

doi:10.3390/math12071091

Open AccessArticle

Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers

by

Shan Feng

^1,2,*

,

Wenxian Xie

¹ and

Yufeng Nie

^1,*

¹

School of Mathematics and Statistics, Northwestern Polytechnical University, Xi’an 710129, China

²

College of Statistics, Xi’an University of Finance and Economics, Xi’an 710100, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(7), 1091; https://doi.org/10.3390/math12071091

Submission received: 29 February 2024 / Revised: 29 March 2024 / Accepted: 2 April 2024 / Published: 4 April 2024

(This article belongs to the Special Issue Bayesian Inference, Prediction and Model Selection)

Download

Browse Figures

Versions Notes

Abstract

:

Finite Gaussian mixture models are powerful tools for modeling distributions of random phenomena and are widely used for clustering tasks. However, their interpretability and efficiency are often degraded by the impact of redundancy and noise, especially on high-dimensional datasets. In this work, we propose a generative graphical model for parsimonious modeling of the Gaussian mixtures and robust unsupervised learning. The model assumes that the data are generated independently and identically from a finite mixture of robust factor analyzers, where the features’ salience is adjusted by an active set of latent factors to allow a violation of the local independence assumption. For the model inference, we propose a structured variational Bayes inference framework to realize simultaneous clustering, model selection and outlier processing. Performance of the proposed algorithm is evaluated by conducting experiments on artificial and real-world datasets. Moreover, an application on the high-dimensional machine learning task of handwritten alphabet recognition is introduced.

Keywords:

Bayesian inference; feature selection; mixture of factor analyzers; robust clustering; structured variational Bayes

MSC:

62H30; 62F15; 62H22

1. Introduction

Finite Gaussian mixture models are powerful tools for modeling distributions of random phenomena. They are widely used for unsupervised classification tasks and lay the foundation for many deep learning-based clustering algorithms, e.g., [1,2]. However, competitive performance of the Gaussian mixture model cannot be expected on high-dimensional datasets due to the curse of dimensionality [3]. The impact of redundancy and noise can degrade the model’s interpretability and efficiency, which is crucial in many application fields such as molecular biology and the clinical medicine [4]. Since the intrinsic dimensions of high-dimensional data are usually much less than their original feature space, it is possible to improve the clustering performance via dimension-reduction methods [5].

Feature-selection approaches are designed to retain a subset of features that are informative and discriminant for clustering. The two-stage methods implement the feature selection and clustering separately, which consider the preselected features as input without regarding the subsequent clustering algorithms [6,7]. But as choosing the feature subset and clustering are highly dependent problems, to circumvent the loss of information, incorporating the feature selection in the clustering algorithms and constructing an integrated objective function is suggested [3,4]. Pan and Shen [8] proposed a penalized likelihood approach for unsupervised feature selection where they used an

L_{1}

penalty to shrink the component means. A similar approach was also suggested in [9], where the feature-selection consistency via the penalization was further studied. However, as both approaches are highly dependent on the choosing of penalization parameters, cross-validation or criterion-based model selection is required to tune the parameters.

A different stream of research casts the feature selection as parameter-estimation problems [4,10,11,12], where the random variable “feature saliency” is introduced to quantify the relevance of features to class assignment. This approach is efficient as neither combinatorial search through the feature subsets nor tuning of parameters is required. The feature selection and the clustering can be performed simultaneously in a principled and automatic way. Zhang et al. [13] extended this method to the Student’s t mixture model, which has higher tolerance to outliers and therefore is more robust for clustering and feature selection. As an extension to the work of Zhang et al., Sun and Zhou [14] made a full-Bayesian treatment for the model and proposed the structured variational Bayesian (VB) approach, which takes into consideration the estimation uncertainty of all model parameters and can deliver a tighter bound to the marginal likelihood than the mean-field approximated VB algorithms. Their model was extended further in [3] to consider the class-specific feature saliency for Bayesian feature selection.

Feature-selection approaches commonly assume that the features are conditionally independent given the latent class variable, which is equivalent to adopting a diagonal component covariance matrix structure in the Gaussian mixture model. While this assumption greatly facilitates computational efficiency, it can be easily violated in real-world datasets [15]. For the linear regression analysis, Fan et al. [16] conducted a synthetic study indicating that when the covariates are highly correlated exact recovery of the active set from the solution path of LASSO will be difficult. For classification, as has been mentioned in [17], ignoring the dependence relationship across features may undermine the reliability of the algorithms and lead to misleading conclusions about the features’ salience.

In [18], the local independence assumption for the Gaussian mixture model was relaxed by a block-diagonal specification of the component covariance matrix, where the features are partitioned into several disconnected groups in each class. However, as the total number of block-diagonal structures increases as the Bell number [19], searching for the optimal model can be difficult, especially in the high-dimensional cases. Galimberti and Soffritti [18] proposed a hierarchical aggregative strategy based on the BIC criterion to perform a nonexhaustive search of the structures. But this method cannot promise to find the optimal model. Ruan et al. [20] extended the graphical LASSO method to the context of the Gaussian mixture model and proposed a penalized likelihood approach for a sparse solution of the component covariance matrices. But the penalization parameters still need to be selected.

Different to the block-diagonal specification, the model of mixture of factor analyzers assumes a factor analysis-based decomposition structure for the component covariance matrices, where the local dependence between features is explained by a few latent factors [21,22,23,24]. Typically, the number of factors for each component needs to be specified in advance of the model fitting or the model-selection criterion used to select the optimal number of factors. However, while presetting of the number may lead to over-fitted or over-simplified models, conducting exhaustive searches over the model space is computationally expensive. The shrinkage prior methods were proposed to achieve an automatic latent dimension reduction. Wang and Lan [25] imposed automatic relevance determination prior [26] on the factor loading matrices. Multiplicative Gamma process shrinkage priors [27] in the infinite factor analysis was used by Murphy et al. [28]. They also suggested an adaptive Gibbs sampling algorithm where the factors with negligible loadings are removed gradually during the iterations. Therefore, the computational efficiency can be improved.

In this paper, we develop further the Student’s t mixture model of Sun and Zhou [14] for Bayesian clustering and feature selection to tackle the cases where features are correlated in the mixture component. While the model in [14] defines the feature saliency under the local independence assumption, we introduce the factor-adjusted feature saliency, where the salience of each feature is evaluated by conditioning on the latent factors. Taken as a whole, the extension produces parsimonious, flexible and robust modeling for the mixture of factor analyzers. Moreover, motivated by the Bayesian model selection method in [29] for the linear regression analysis, instead of using the shrinkage priors, we propose an automatic inference scheme for the number of factors by introducing the random variable of factor activity. Then, the problems of feature selection, latent dimension reduction, outlier processing and clustering can be integrated together as the inference for a Bayesian hierarchical latent variable model. We continue the work in [14] to adopt a full Bayesian treatment, where proper prior distributions are assumed for the model parameters. The structured VB inference framework that improves the evidence lower bound (ELBO) for the proposed model is presented, where a “drop-out” sampling technique [30] can be applied immediately to ease the computation.

The rest of this paper is organized as follows: In Section 2, we introduce the Student’s t mixture model proposed by Sun and Zhou [14], which has provided the base of our study. In Section 3, we develop the proposed mixture of robust factor analyzers, which is present as a hierarchical latent variable model for Bayesian inference. In Section 4, the structured VB inference framework for the proposed model is established. Section 5 justifies the performance of the developed model and algorithm on the synthetic data and presents the evaluation results based on some real-world datasets. Section 6 concludes this paper, points out the limitations and suggests future research directions.

2. The Student’s t Mixture Model for Feature Selection

We present the proposed hierarchical latent variable model starting from the Student’s t mixture model defined in [14]. Denote the set of i.i.d. observations as

Y = {y_{n}}_{n = 1}^{N}

, where

y_{n} = {(y_{n 1}, y_{n 2}, \dots, y_{n d})}^{T} \in R^{d}

is the d-dimensional feature data for the nth individual. The finite mixture model for clustering assumes that the data for each individual are generated from a class-specific distribution but with the class label missing, then it marginally follows a finite mixture distribution. Throughout the paper, we denote the number of mixture components or equally the number of classes as K. The latent class label for the nth individual is denoted as

z_{n}

which takes value in

{1, 2, \dots, K}

. The clustering is realized by assigning each individual the class label where it has the highest posterior probability of belonging.

The Student’s t mixture model given in [14] assumes that the features are conditionally independent given the hidden class label and each follows a Student’s t distribution. Moreover, the relevance or irrelevance of feature to data separation is taken into account by introducing the Bernoulli latent variables

ϕ_{n} = {(ϕ_{n 1}, ϕ_{n 2}, \dots, ϕ_{n d})}^{T}

, which gives the mixture density of

y_{n}

as

p (y_{n} | ϕ_{n}, Θ_{1}) = \sum_{k = 1}^{K} π_{k} \prod_{l = 1}^{d} [S_{t} {(y_{n l} | μ_{k l}, σ_{k l}, v_{k l})}^{ϕ_{n l}} S_{t} {(y_{n l} | μ_{0 l}, σ_{0 l}, v_{0 l})}^{1 - ϕ_{n l}}] .

(1)

S_{t} (y | μ, σ, v)

is the density function of the Student’s t distribution with mean, precision and degrees of freedom as

μ

,

σ

and v, respectively. For

l = 1, 2, \dots, d

,

ϕ_{n l} \in {0, 1}

, if

ϕ_{n l} = 1

, then the lth feature is relevant to the class assignment; if

ϕ_{n l} = 0

, then the lth feature is irrelevant and follows a common distribution independent of the class assignment. For

k = 1, 2, \dots, K

, the parameter

π_{k}

(

π_{k} > 0

and

\sum_{k = 1}^{K} π_{k} = 1

) is the mixing proportion of class k. Let

Θ_{1} = {π, μ, σ, v}

denote the set of unknown parameters in model (1), where

π = {(π_{1}, π_{2}, \dots, π_{K})}^{T}

,

μ = {\{μ_{0 l}, {μ_{k l}}_{k = 1}^{K}\}}_{l = 1}^{d}

,

σ = {\{σ_{0 l}, {σ_{k l}}_{k = 1}^{K}\}}_{l = 1}^{d}

and

v = {\{v_{0 l}, {v_{k l}}_{k = 1}^{K}\}}_{l = 1}^{d}

.

The prior distribution of

ϕ_{n}

is given by

p (ϕ_{n} | β) = \prod_{l = 1}^{d} p (ϕ_{n l} | β_{l}) = \prod_{l = 1}^{d} {β_{l}}^{ϕ_{n l}} {(1 - β_{l})}^{1 - ϕ_{n l}},

(2)

where the

ϕ_{n l}

’s are assumed to be mutually independent. The parameter

β_{l}

for the Bernoulli distribution of

ϕ_{n l}

is called the feature saliency [4] of the lth feature. It measures the importance of the feature for class assignment and is estimated to realize a “soft” feature selection. Denote

β = {β_{l}}_{l = 1}^{d}

.

The observed-data likelihood function can be obtained by integrating over the latent variables

ϕ_{n}

in model (1), which gives

p (y_{n} | Θ_{2}) = \sum_{k = 1}^{K} π_{k} \prod_{l = 1}^{d} [β_{l} S_{t} (y_{n l} | μ_{k l}, σ_{k l}, v_{k l}) + (1 - β_{l}) S_{t} (y_{n l} | μ_{0 l}, σ_{0 l}, v_{0 l})],

(3)

where

Θ_{2} = {π, μ, σ, v, β}

. Statistical inference directly on the observed-data likelihood is difficult. In [14], the VB inference method was adopted where the complete-data likelihood is given by

p (y_{n}, u_{n}, ϕ_{n}, z_{n} | Θ_{2}) = \prod_{k = 1}^{K} {[π_{k} p (y_{n} | u_{n}, ϕ_{n}, z_{n} = k, μ, σ) p (u_{n} | ϕ_{n}, z_{n} = k, v) p (ϕ_{n} | β)]}^{δ_{z_{n}, k}} .

(4)

At the right-hand side of (4),

δ_{z_{n}, k}

is the Kronecker delta function. The latent variables

u_{n} = {(u_{n 1}, u_{n 2}, \dots, u_{n d})}^{T}

are introduced by noting that the Student’s t distribution can be written as a convolution of a Gaussian and a Gamma distribution [3,14]. It follows that

\begin{matrix} p (y_{n} | u_{n}, ϕ_{n}, z_{n} = k, μ, σ) & = \prod_{l = 1}^{d} p (y_{n l} | u_{n l}, ϕ_{n l}, z_{n} = k, μ, σ) \\ = \prod_{l = 1}^{d} [N {(y_{n l} | μ_{k l}, σ_{k l} u_{n l})}^{ϕ_{n l}} N {(y_{n l} | μ_{0 l}, σ_{0 l} u_{n l})}^{1 - ϕ_{n l}}], \end{matrix}

(5)

and

\begin{matrix} p (u_{n} | ϕ_{n}, z_{n} = k, v) & = \prod_{l = 1}^{d} p (u_{n l} | ϕ_{n l}, z_{n} = k, v) \\ = \prod_{l = 1}^{d} [G {(u_{n l} | \frac{v_{k l}}{2}, \frac{v_{k l}}{2})}^{ϕ_{n l}} G {(u_{n l} | \frac{v_{0 l}}{2}, \frac{v_{0 l}}{2})}^{1 - ϕ_{n l}}] . \end{matrix}

(6)

N (y | μ, σ)

represents the Gaussian density function with mean

μ

and precision

σ

and

G (u | a, b)

is the Gamma density function

G (u | a, b) = \frac{b^{a} u^{a - 1}}{Γ (a)} exp (- b u) .

(7)

Note that by integrating over

u_{n}

,

ϕ_{n}

and

z_{n}

in (4), the observed-data likelihood (3) can be recovered.

3. Towards the Mixture of Robust Factor Analyzers

To tackle the cases where features are correlated in the mixture component, we relax the local independence assumption in [14] by specifying for each class a latent factor model. Specifically, for class k, we introduce the latent factors

x_{n k} = {(x_{n k 1}, x_{n k 2}, \dots, x_{n k p_{k}})}^{T}

, where

x_{n k j}

’s are i.i.d. from the distribution

N (0, 1)

and

p_{k}

is the number of latent factors. After conditioning on

x_{n k}

, the features are assumed mutually independent within the class, which corresponds to a modification of model (1) as follows:

p (y_{n} | ϕ_{n}, x_{n}, Θ_{3}) = \sum_{k = 1}^{K} π_{k} \prod_{l = 1}^{d} [S_{t} {(y_{n l} | w_{k l}^{T} x_{n k} + μ_{k l}, σ_{k l}, v_{k l})}^{ϕ_{n l}} S_{t} {(y_{n l} | w_{k l}^{T} x_{n k} + μ_{0 l}, σ_{0 l}, v_{0 l})}^{1 - ϕ_{n l}}],

(8)

where

w_{k l} \in R^{p_{k}}

are the factor loadings for the lth feature in class k. Denote

x_{n} = {x_{n k}}_{k = 1}^{K}

and

Θ_{3} = {π, μ, σ, v, w}

where

w = {\{{w_{k l}}_{l = 1}^{d}\}}_{k = 1}^{K}

. In model (8),

ϕ_{n l}

indicates the relevance of the lth feature to class assignment after adjustment by the latent factors. Correspondingly,

β_{l}

that defines the distribution of

ϕ_{n l}

in (2) represents the factor-adjusted feature saliency.

Typically, in each local factor model, the latent dimensions

p_{k}

need to be specified. With overly high dimensions, the model may over-fit the data, yielding poor interpretations and hardening the computation, while with low and inadequate dimensions, the model may not be flexible enough to capture the correlations between features in each class. To enable an automatic determination, we treat the problem as another feature-selection task, but now the “features” become the latent factors. Starting from a sufficiently large

p_{k}

, we introduce in class k the Bernoulli latent variables

r_{n k} = {(r_{n k 1}, r_{n k 2}, \dots, r_{n k p_{k}})}^{T}

, where

r_{n k j} \in {0, 1}

with

r_{n k j} = 1

indicating that the factor

x_{n k j}

is active and

r_{n k j} = 0

inactive. Model (8) then becomes

p (y_{n} | ϕ_{n}, x_{n}, r_{n}, Θ_{3}) = \sum_{k = 1}^{K} π_{k} \prod_{l = 1}^{d} [S_{t} {(y_{n l} | w_{k l}^{T} R_{n k} x_{n k} + μ_{k l}, σ_{k l}, v_{k l})}^{ϕ_{n l}} S_{t} {(y_{n l} | w_{k l}^{T} R_{n k} x_{n k} + μ_{0 l}, σ_{0 l}, v_{0 l})}^{1 - ϕ_{n l}}],

(9)

where

R_{n k} = diag (r_{n k})

and we denote

r_{n} = {r_{n k}}_{k = 1}^{K}

. When

r_{n k j}

’s all equal zero, the model reduces to the Student’s t mixtures of model (1).

The prior distribution of

r_{n k}

is given by

p (r_{n k} | ρ_{k}) = \prod_{j = 1}^{p_{k}} p (r_{n k j} | ρ_{k j}) = \prod_{j = 1}^{p_{k}} {ρ_{k j}}^{r_{n k j}} {(1 - ρ_{k j})}^{1 - r_{n k j}},

(10)

where we have assumed prior independence between the entries of

r_{n k}

. Denote

ρ_{k} = {ρ_{k j}}_{j = 1}^{p_{k}}

and

ρ = {ρ_{k}}_{k = 1}^{K}

. In accordance with the concept of feature saliency, we call

ρ_{k j}

the factor activity. It is the probability that the jth factor in class k is active. The problem of finding latent dimensions then can be cast as a parameter-estimation problem, i.e., the estimation of

ρ

.

Our modeling of

r_{n k}, k = 1, 2, \dots, K

to select the active factors in each class is inspired by the normal-zero model proposed in [29], which introduces the indicators to select automatically the important covariates in linear regression. The difference is that we have defined the indicators as latent variables for each individual, while the normal-zero model introduces the indicators as model parameters. In our model, the parameters

ρ

that define the Bernoulli distributions of the indicators are the key quantities for model selection and will be inferred under a Bayesian inference framework, while following the normal-zero model

ρ

are treated as hyper-parameters and typically need to be specified.

Note that the conditional probability of (9) defines a mixture of robust factor analyzers. The factor model for class k can be written as

y_{n} = W_{k} R_{n k} x_{n k} + Φ_{n} μ_{k} + (I - Φ_{n}) μ_{0} + ε_{n},

(11)

where

W_{k} = {[w_{k 1}, w_{k 2}, \dots, w_{k d}]}^{T}

is the

d \times p_{k}

factor loading matrix. Denote

μ_{k} = (μ_{k 1}, μ_{k 2}, \dots,

μ_{k d})^{T}

,

μ_{0} = {(μ_{01}, μ_{02}, \dots, μ_{0 d})}^{T}

and

Φ_{n} = diag (ϕ_{n})

. The latent factors

x_{n k}

follow the distribution of

N (0, I_{p_{k}})

, where

I_{p_{k}}

is the identity matrix of order

p_{k}

. The distributions for

ϕ_{n}

and

r_{n k}

are defined in (2) and (10), separately.

ε_{n} = {(ε_{n 1}, ε_{n 2}, \dots, ε_{n d})}^{T}

, where

ε_{n l}

’s are mutually independent given

z_{n}

and

p (ε_{n l} | ϕ_{n l}, z_{n} = k, σ, v) = S_{t} {(ε_{n l} | 0, σ_{k l}, v_{k l})}^{ϕ_{n l}} S_{t} {(ε_{n l} | 0, σ_{0 l}, v_{0 l})}^{1 - ϕ_{n l}} .

(12)

By introducing the latent variable

u_{n l}

distributed according to (6), we have

p (ε_{n l} | u_{n l}, ϕ_{n l}, z_{n} = k, σ) = N {(ε_{n l} | 0, σ_{k l} u_{n l})}^{ϕ_{n l}} N {(ε_{n l} | 0, σ_{0 l} u_{n l})}^{1 - ϕ_{n l}} .

(13)

The complete-data likelihood

p (y_{n}, u_{n}, ϕ_{n}, x_{n}, r_{n}, z_{n} | Θ)

for the proposed hierarchical latent variable model, where

Θ = {π, μ, σ, v, β, w, ρ}

can be factorized as

\begin{matrix} p (y_{n}, u_{n}, ϕ_{n}, x_{n}, r_{n}, z_{n} | Θ) & = \prod_{k = 1}^{K} [π_{k} p (y_{n} | u_{n}, ϕ_{n}, x_{n k}, r_{n k}, z_{n} = k, μ, σ, w) \\ \times p (u_{n} | ϕ_{n}, z_{n} = k, v) p (ϕ_{n} | β) p (x_{n k}) p (r_{n k} | ρ_{k})]^{δ_{z_{n}, k}}, \end{matrix}

(14)

where

\begin{matrix} p (y_{n} | u_{n}, ϕ_{n}, x_{n k}, r_{n k}, z_{n} = k, μ, σ, w) & = \prod_{l = 1}^{d} p (y_{n l} | u_{n l}, ϕ_{n l}, x_{n k}, r_{n k}, z_{n} = k, μ, σ, w) \\ = \prod_{l = 1}^{d} [N {(y_{n l} | w_{k l}^{T} R_{n k} x_{n k} + μ_{k l}, σ_{k l} u_{n l})}^{ϕ_{n l}} N {(y_{n l} | w_{k l}^{T} R_{n k} x_{n k} + μ_{0 l}, σ_{0 l} u_{n l})}^{1 - ϕ_{n l}}], \end{matrix}

(15)

corresponding to a modification of conditional probability (5) for the Student’s t mixture model.

In the following, we denote the set of latent variables as

H = {h_{n}}_{n = 1}^{N}

where

h_{n} = {u_{n}, ϕ_{n}, x_{n}, r_{n}, z_{n}}

. Then, the complete-data likelihood for the whole dataset can be written as

p (Y, H | Θ) = \prod_{n = 1}^{N} p (y_{n}, h_{n} | Θ) .

(16)

Full Bayesian treatment to the latent variable model requires specification of the prior distributions associated with the model parameters. We assume that

p (Θ) = p (π) p (μ) p (σ) p (β) p (w) p (ρ),

(17)

and

\begin{matrix} p (π) & = D i r (π | α_{0}), \\ p (μ) & = \prod_{l = 1}^{d} [p (μ_{0 l}) \prod_{k = 1}^{K} p (μ_{k l})] = \prod_{l = 1}^{d} [N (μ_{0 l} | s_{0 l}, λ_{0}) \prod_{k = 1}^{K} N (μ_{k l} | s_{0 l}, λ_{0})], \\ p (σ) & = \prod_{l = 1}^{d} [p (σ_{0 l}) \prod_{k = 1}^{K} p (σ_{k l})] = \prod_{l = 1}^{d} [G (σ_{0 l} | \frac{η_{0}}{2}, \frac{ξ_{0}}{2}) \prod_{k = 1}^{K} G (σ_{k l} | \frac{η_{0}}{2}, \frac{ξ_{0}}{2})], \\ p (β) & = \prod_{l = 1}^{d} B e t a (β_{l} | κ_{1}, κ_{2}), \\ p (w) & = \prod_{k = 1}^{K} \prod_{l = 1}^{d} p (w_{k l}) = \prod_{k = 1}^{K} \prod_{l = 1}^{d} N (w_{k l} | 0, m_{0} I_{p_{k}}), \\ p (ρ) & = \prod_{k = 1}^{K} \prod_{j = 1}^{p_{k}} B e t a (ρ_{k j} | τ_{1}, τ_{2}), \end{matrix}

(18)

where

B e t a (β | a, b)

represents the Beta density function

B e t a (β | a, b) = \frac{β^{a - 1} {(1 - β)}^{b - 1}}{B (a, b)},

(19)

and

D i r (π | α) = \frac{Γ (\sum_{k = 1}^{K} α_{k})}{\prod_{k = 1}^{K} Γ (α_{k})} \prod_{k = 1}^{K} {π_{k}}^{α_{k} - 1},

(20)

is the Dirichlet density. In the above specifications, the conjugate priors are used. The parameters in the priors, including

κ_{1}

,

κ_{2}

,

τ_{1}

,

τ_{2}

,

m_{0}

,

λ_{0}

,

η_{0}

,

ξ_{0}

,

s_{0}

and

α_{0}

where

s_{0} = {s_{0 l}}_{l = 1}^{d}

and

α_{0} = {(α_{01}, α_{02}, \dots, α_{0 K})}^{T}

, are considered as hyperparameters. It is noticeable that we do not assume any prior for the degrees of freedom

v_{0 l}

’s and

v_{k l}

’s. Since there are no conjugate priors, we follow the practice in [3,14] to seek for the point estimates for them.

The plate diagram of the proposed hierarchical latent variable model is shown in Figure 1. The arrows in the diagram indicate the dependencies. The original model of Sun and Zhou [14] is depicted in blue.

4. Inference on the Model

4.1. Brief Introduction to VB Method

To infer from the posterior distribution of the latent variables and the parameters, computation of the evidence

p (Y)

is required. However, the computation involves integration over the latent variables and the parameters, which is intractable for our model. In this paper, we resort to the VB method for model inference. It is designed to maximize a lower bound of

log p (Y)

. Assuming posterior independence between the latent variables and the parameters, the evidence lower bound (ELBO) is defined by

L (q (H), q (Θ), Y) = E_{q (H) q (Θ)} [log \frac{p (Y, H | Θ) p (Θ)}{q (H) q (Θ)}] \leq log p (Y),

(21)

where

q (H)

and

q (Θ)

are auxiliary posteriors for the latent variables and the parameters, respectively. A coordinate ascent search method [31] can be applied to iteratively maximize the ELBO. At the tth iteration, it implements the VB expectation (VB-E) step and the VB maximization (VB-M) step as follows:

\begin{matrix} VB - E step : & q^{(t + 1)} (H) = \underset{q (H)}{arg max} L (q (H), q^{(t)} (Θ), Y); \\ VB - M step : & q^{(t + 1)} (Θ) = \underset{q (Θ)}{arg max} L (q^{(t + 1)} (H), q (Θ), Y) . \end{matrix}

(22)

4.2. Tree-Like Factorization of the Auxiliary Posterior

In this paper, we apply the tree-like factorization proposed in [3,14] to the auxiliary posterior of the latent variables. The resultant structured VB method can be viewed as a partially collapsed VB [32], which can reach a tighter lower bound for

log p (Y)

than the mean-filed approximated VB method [13].

As the observations are mutually independent,

q (H)

has the form

q (H) = \prod_{n = 1}^{N} q (h_{n}) .

(23)

Tree-like factorization assumes that the auxiliary posterior

q (h_{n})

can be factorized as

q (h_{n}) = \prod_{k = 1}^{K} {\{q (u_{n} | ϕ_{n}, z_{n} = k) q (ϕ_{n}) q (x_{n k} | r_{n k}) q (r_{n k}) q (z_{n} = k)\}}^{δ_{z_{n}, k}} .

(24)

As entries of the noise term in the local factor model are assumed to be mutually independent,

q (h_{n})

can be further factorized as

\begin{matrix} q (h_{n}) = & \prod_{k = 1}^{K} {\prod_{l = 1}^{d} [{(q (u_{n l} | ϕ_{n l} = 1, z_{n} = k) q (ϕ_{n l} = 1))}^{ϕ_{n l}} {(q (u_{n l} | ϕ_{n l} = 0) q (ϕ_{n l} = 0))}^{1 - ϕ_{n l}}] \\ \times q (x_{n k} | r_{n k}) q (r_{n k}) q (z_{n} = k)}^{δ_{z_{n}, k}} . \end{matrix}

(25)

Different from [3,14], we do not keep the posterior dependence of

ϕ_{n l}

on

z_{n}

and when

ϕ_{n l} = 0

the auxiliary posterior of

u_{n l}

is assumed to be independent of

z_{n}

, though the closed forms of the posteriors are available when retaining the dependencies. We found that the above specifications lead to more robust inference results. As in [29,33], we assume a full factorization for

q (r_{n k})

, i.e.,

q (r_{n k}) = \prod_{j = 1}^{p_{k}} q (r_{n k j}) = \prod_{j = 1}^{p_{k}} q {(r_{n k j} = 1)}^{r_{n k j}} q {(r_{n k j} = 0)}^{1 - r_{n k j}} .

(26)

Additionally, the auxiliary posterior

q (Θ)

is assumed to be its full factorized form

\begin{matrix} q (Θ) & = q (π) q (μ) q (σ) q (β) q (w) q (ρ) \\ = q (π) \cdot \prod_{l = 1}^{d} [q (μ_{0 l}) \prod_{k = 1}^{K} q (μ_{k l})] \cdot \prod_{l = 1}^{d} [q (σ_{0 l}) \prod_{k = 1}^{K} q (σ_{k l})] \cdot \prod_{l = 1}^{d} q (β_{l}) \cdot \prod_{k = 1}^{K} \prod_{l = 1}^{d} q (w_{k l}) \cdot \prod_{k = 1}^{K} \prod_{j = 1}^{p_{k}} q (ρ_{k j}) . \end{matrix}

(27)

For ease of exposition, we use n, l, j and k in the following to denote the index of the individual, the feature, the latent factor and the class, respectively. We omit the iteration indexes

(t)

and

(t + 1)

and without loss of generosity deliver the update during one iteration of the algorithm. We use

〈 \cdot 〉

to denote the expectation operation with respect to the current auxiliary posteriors.

4.3. Auxiliary Posteriors of the Latent Variables: VB-E Step

The VB-E step updates the auxiliary posterior

q (h_{n})

of the latent variables following the factorizations of (25) and (26).

(i)

q (u_{n l} | ϕ_{n l}, z_{n})

: Through some mathematical manipulations (see the Supplementary Materials for the details), we obtain

\begin{matrix} q (u_{n l} | ϕ_{n l} = 1, z_{n} = k) & = G (u_{n l} | {\hat{a}}_{k l}, {\hat{b}}_{n l}^{k}), \\ q (u_{n l} | ϕ_{n l} = 0) & = G (u_{n l} | {\hat{a}}_{0 l}, {\hat{b}}_{n l}^{0}), \end{matrix}

(28)

where

\begin{matrix} {\hat{a}}_{k l} = \frac{v_{k l} + 1}{2}, {\hat{b}}_{n l}^{k} = \frac{v_{k l} + 〈 σ_{k l} 〉 〈 {({\tilde{y}}_{n l}^{k} - μ_{k l})}^{2} 〉}{2}, \\ {\hat{a}}_{0 l} = \frac{v_{0 l} + 1}{2}, {\hat{b}}_{n l}^{0} = \frac{v_{0 l} + 〈 σ_{0 l} 〉 \sum_{k} 〈 δ_{z_{n}, k} 〉 〈 {({\tilde{y}}_{n l}^{k} - μ_{0 l})}^{2} 〉}{2}, \end{matrix}

(29)

and

{\tilde{y}}_{n l}^{k} = y_{n l} - w_{k l}^{T} R_{n k} x_{n k}

. Note that

\begin{matrix} 〈 {({\tilde{y}}_{n l}^{k} - μ_{k l})}^{2} 〉 = & {(y_{n l} - 〈 μ_{k l} 〉)}^{2} + {\hat{λ}}_{k l}^{- 1} - 2 (y_{n l} - 〈 μ_{k l} 〉) {〈 w_{k l} 〉}^{T} 〈 R_{n k} x_{n k} 〉 \\ + tr (〈 w_{k l} {w_{k l}}^{T} 〉 〈 x_{n k} {x_{n k}}^{T} ⊙ r_{n k} r_{n k}^{T} 〉), \\ 〈 {({\tilde{y}}_{n l}^{k} - μ_{0 l})}^{2} 〉 = & {(y_{n l} - 〈 μ_{0 l} 〉)}^{2} + {\hat{λ}}_{0 l}^{- 1} - 2 (y_{n l} - 〈 μ_{0 l} 〉) {〈 w_{k l} 〉}^{T} 〈 R_{n k} x_{n k} 〉 \\ + tr (〈 w_{k l} {w_{k l}}^{T} 〉 〈 x_{n k} {x_{n k}}^{T} ⊙ r_{n k} r_{n k}^{T} 〉), \end{matrix}

(30)

where

{\hat{λ}}_{k l}

is the precision of posterior

q (μ_{k l})

and

{\hat{λ}}_{0 l}

is the precision of

q (μ_{0 l})

. We denote the trace operator as

tr (\cdot)

and the Hadamard product operator between two matrices as ⊙.

In the sequel, we use

{〈 \cdot 〉}_{k}^{1}

and

{〈 \cdot 〉}^{0}

to distinguish between the expectations regarding

q (u_{n l} | ϕ_{n l} = 1, z_{n} = k)

and

q (u_{n l} | ϕ_{n l} = 0)

. As with the property of Gamma distribution, we obtain

\begin{matrix} {〈 u_{n l} 〉}_{k}^{1} = \frac{{\hat{a}}_{k l}}{{\hat{b}}_{n l}^{k}}, {〈 log u_{n l} 〉}_{k}^{1} = ψ ({\hat{a}}_{k l}) - log ({\hat{b}}_{n l}^{k}), \\ {〈 u_{n l} 〉}^{0} = \frac{{\hat{a}}_{0 l}}{{\hat{b}}_{n l}^{0}}, {〈 log u_{n l} 〉}^{0} = ψ ({\hat{a}}_{0 l}) - log ({\hat{b}}_{n l}^{0}), \end{matrix}

(31)

where

ψ (\cdot)

is the digamma function.

(ii)

q (ϕ_{n l})

: Define

\begin{matrix} \bar{q} (ϕ_{n l} = 1) & = exp \{\sum_{k} 〈 δ_{z_{n}, k} 〉 [\frac{1}{2} 〈 log σ_{k l} 〉 + \frac{v_{k l}}{2} log \frac{v_{k l}}{2} - log Γ (\frac{v_{k l}}{2}) - {\hat{a}}_{k l} log {\hat{b}}_{n l}^{k} + log Γ ({\hat{a}}_{k l})] + 〈 log β_{l} 〉\}, \\ \bar{q} (ϕ_{n l} = 0) & = exp \{\frac{1}{2} 〈 log σ_{0 l} 〉 + \frac{v_{0 l}}{2} log \frac{v_{0 l}}{2} - log Γ (\frac{v_{0 l}}{2}) - {\hat{a}}_{0 l} log {\hat{b}}_{n l}^{0} + log Γ ({\hat{a}}_{0 l}) + 〈 log (1 - β_{l}) 〉\} . \end{matrix}

(32)

Then,

q (ϕ_{n l})

can be obtained by

q (ϕ_{n l} = 1) = \frac{\bar{q} (ϕ_{n l} = 1)}{\bar{q} (ϕ_{n l} = 1) + \bar{q} (ϕ_{n l} = 0)},

(33)

and

q (ϕ_{n l} = 0) = 1 - q (ϕ_{n l} = 1)

. Denote

〈 ϕ_{n l} 〉 = q (ϕ_{n l} = 1)

and

〈 1 - ϕ_{n l} 〉 = q (ϕ_{n l} = 0)

.

(iii)

q (x_{n k} | r_{n k})

: The posterior

q (x_{n k} | r_{n k})

is multivariate Gaussian with precision matrix and mean vector as

{\hat{C}}_{n}^{k} (r_{n k}) = I + A_{n k} ⊙ r_{n k} r_{n k}^{T}, {\hat{f}}_{n}^{k} (r_{n k}) = {\hat{C}}_{n}^{k} {(r_{n k})}^{- 1} R_{n k} t_{n k},

(34)

where

\begin{matrix} A_{n k} & = \sum_{l} [〈 ϕ_{n l} 〉 〈 σ_{k l} 〉 {〈 u_{n l} 〉}_{k}^{1} + 〈 1 - ϕ_{n l} 〉 〈 σ_{0 l} 〉 {〈 u_{n l} 〉}^{0}] 〈 w_{k l} w_{k l}^{T} 〉, \\ t_{n k} & = \sum_{l} [〈 ϕ_{n l} 〉 〈 σ_{k l} 〉 {〈 u_{n l} 〉}_{k}^{1} (y_{n l} - 〈 μ_{k l} 〉) + 〈 1 - ϕ_{n l} 〉 〈 σ_{0 l} 〉 {〈 u_{n l} 〉}^{0} (y_{n l} - 〈 μ_{0 l} 〉)] 〈 w_{k l} 〉 . \end{matrix}

(35)

(iv)

q (r_{n k})

: The posterior

q (r_{n k j})

can be obtained by

q (r_{n k j} = 1) = \frac{\bar{q} (r_{n k j} = 1)}{\bar{q} (r_{n k j} = 1) + \bar{q} (r_{n k j} = 0)},

(36)

and

q (r_{n k j} = 0) = 1 - q (r_{n k j} = 1)

, where

\begin{matrix} \bar{q} (r_{n k j} = c) = exp [ & - \frac{1}{2} 〈 log | I + A_{n k} ⊙ r_{n k} r_{n k}^{T} | 〉 \\ + \frac{1}{2} tr 〈 {(I + A_{n k} ⊙ r_{n k} r_{n k}^{T})}^{- 1} (R_{n k} t_{n k}) {(R_{n k} t_{n k})}^{T} 〉 \\ + c 〈 log ρ_{k j} 〉 + (1 - c) 〈 log (1 - ρ_{k j}) 〉], \end{matrix}

(37)

with

c \in {0, 1}

. The expectations in (37) are taken by fixing

r_{n k j} = c

. Denote

〈 r_{n k j} 〉 = q (r_{n k j} = 1)

and

〈 1 - r_{n k j} 〉 = q (r_{n k j} = 0)

.

When posterior independence is assumed between

x_{n k}

and

r_{n k}

or

x_{n k}

is observable as in the regression models of [29,33],

q (r_{n k j})

can be derived analytically and the expectations in (30) regarding

q (x_{n k}, r_{n k})

can be obtained in closed form using the results:

\begin{matrix} 〈 R_{n k} 〉 & = diag (〈 r_{n k} 〉), \\ 〈 r_{n k} r_{n k}^{T} 〉 & = 〈 R_{n k} 〉 ⊙ 〈 I - R_{n k} 〉 + 〈 r_{n k} 〉 {〈 r_{n k} 〉}^{T}, \\ 〈 x_{n k} 〉 & = {(I + A_{n k} ⊙ 〈 r_{n k} r_{n k}^{T} 〉)}^{- 1} 〈 R_{n k} 〉 t_{n k}, \\ 〈 x_{n k} x_{n k}^{T} 〉 & = {(I + A_{n k} ⊙ 〈 r_{n k} r_{n k}^{T} 〉)}^{- 1} + 〈 x_{n k} 〉 {〈 x_{n k} 〉}^{T} . \end{matrix}

(38)

However, the computation in high-dimensional cases is obstructed as it involves multiplication and inversion of large-scale matrices.

The sparse property of the indicator vector

r_{n k}

motivates us to resort to a “drop-out” sampling scheme [30], where the conditioning of

x_{n k}

on

r_{n k}

does not influence the efficiency of the algorithm. Specifically, we keep a random sample

{\hat{r}}_{n k}

from

q (r_{n k})

at each iteration of the algorithm and use it as an imputation for

r_{n k}

to update the remaining auxiliary posteriors. During this process, the connections to the latent factor with smaller

q (r_{n k j} = 1)

have higher chance of drop out. Simplification of the computation can be realized, for example, in Equation (30),

\begin{matrix} {〈 w_{k l} 〉}^{T} 〈 R_{n k} x_{n k} 〉 & = {({\hat{R}}_{n k} 〈 w_{k l} 〉)}^{T} {(I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T})}^{- 1} {\hat{R}}_{n k} t_{n k}, \\ 〈 w_{k l} {w_{k l}}^{T} 〉 〈 x_{n k} x_{n k}^{T} ⊙ r_{n k} r_{n k}^{T} 〉 & = (〈 w_{k l} w_{k l}^{T} 〉 ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T}) [{(I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T})}^{- 1} \\ + {(I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T})}^{- 1} ({\hat{R}}_{n k} t_{n k}) {({\hat{R}}_{n k} t_{n k})}^{T} {(I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T})}^{- 1}], \end{matrix}

(39)

where the latent dimensions to be tackled are reduced due to the sparse property of the random sample

{\hat{r}}_{n k}

. For the multiplication and inversion computations, only the entries of vector or matrix corresponding to

{\hat{r}}_{n k j} \neq 0

need to be involved.

To obtain a random sample from

q (r_{n k})

, we update the entries of

{\hat{r}}_{n k}

one by one through a single turn of Gibbs sampling, where the sampling probability for

{\hat{r}}_{n k j}

has the form in (37) but with the expectation replaced by the current imputation of

{\hat{r}}_{n k, - j}

.

(v)

q (z_{n})

: To update

q (z_{n} = k)

, we define the quantity

\begin{matrix} \bar{q} (z_{n} = k) = & exp {\sum_{l} 〈 ϕ_{n l} 〉 [\frac{1}{2} 〈 log σ_{k l} 〉 + \frac{v_{k l}}{2} log \frac{v_{k l}}{2} - log Γ (\frac{v_{k l}}{2}) - {\hat{a}}_{k l} log {\hat{b}}_{n l}^{k} + log Γ ({\hat{a}}_{k l})] \\ - \frac{1}{2} \sum_{l} 〈 1 - ϕ_{n l} 〉 〈 σ_{0 l} 〉 {〈 u_{n l} 〉}^{0} 〈 {({\tilde{y}}_{n l}^{k} - μ_{0 l})}^{2} 〉 - \frac{1}{2} log | I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T} | \\ - \frac{1}{2} tr [{(I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T})}^{- 1} + {(I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T})}^{- 1} ({\hat{R}}_{n k} t_{n k}) {({\hat{R}}_{n k} t_{n k})}^{T} {(I + A_{n k} ⊙ {\hat{r}}_{n k} {\hat{r}}_{n k}^{T})}^{- 1}] \\ + \sum_{j} [〈 r_{n k j} 〉 (〈 log ρ_{k j} 〉 - log 〈 r_{n k j} 〉) + 〈 1 - r_{n k j} 〉 (〈 log (1 - ρ_{k j}) 〉 - log 〈 1 - r_{n k j} 〉)] + 〈 log π_{k} 〉} . \end{matrix}

(40)

Then,

q (z_{n} = k) = \frac{\bar{q} (z_{n} = k)}{\sum_{k^{'}} \bar{q} (z_{n} = k^{'})} .

(41)

4.4. Auxiliary Posteriors of the Parameters: VB-M Step

The VB-M step updates the posterior

q (Θ)

for the parameters following the factorization in (27). Through mathematical manipulation (see the Supplementary Materials for the details), we have

\begin{matrix} q (π) = D i r (π | \hat{α}), \\ q (β_{l}) = B e t a (β_{l} | {\hat{κ}}_{1 l}, {\hat{κ}}_{2 l}), \\ q (ρ_{k j}) = B e t a (ρ_{k j} | {\hat{τ}}_{1 k j}, {\hat{τ}}_{2 k j}), \end{matrix}

(42)

where

\begin{matrix} \hat{α} = {({\hat{α}}_{1}, {\hat{α}}_{2}, \dots, {\hat{α}}_{K})}^{T}, {\hat{α}}_{k} = α_{0 k} + \sum_{n} 〈 δ_{z_{n}, k} 〉, \\ {\hat{κ}}_{1 l} = κ_{1} + \sum_{n} 〈 ϕ_{n l} 〉, {\hat{κ}}_{2 l} = κ_{2} + \sum_{n} 〈 1 - ϕ_{n l} 〉, \\ {\hat{τ}}_{1 k j} = τ_{1} + \sum_{n} 〈 δ_{z_{n}, k} 〉 〈 r_{n k j} 〉, {\hat{τ}}_{2 k j} = τ_{2} + \sum_{n} 〈 δ_{z_{n}, k} 〉 〈 1 - r_{n k j} 〉 . \end{matrix}

(43)

The posterior

q (w_{k l})

is given by

q (w_{k l}) = N (w_{k l} | {\hat{m}}_{k l}, {\hat{M}}_{k l}),

(44)

where

\begin{matrix} {\hat{M}}_{k l} & = m_{0} I + \sum_{n} 〈 δ_{z_{n}, k} 〉 (〈 σ_{k l} 〉 〈 ϕ_{n l} 〉 {〈 u_{n l} 〉}_{k}^{1} + 〈 σ_{0 l} 〉 〈 1 - ϕ_{n l} 〉 {〈 u_{n l} 〉}^{0}) 〈 x_{n k} x_{n k}^{T} ⊙ r_{n k} r_{n k}^{T} 〉, \\ {\hat{m}}_{k l} & = {\hat{M}}_{k l}^{- 1} \sum_{n} 〈 δ_{z_{n}, k} 〉 [〈 σ_{k l} 〉 〈 ϕ_{n l} 〉 {〈 u_{n l} 〉}_{k}^{1} (y_{n l} - 〈 μ_{k l} 〉) + 〈 σ_{0 l} 〉 〈 1 - ϕ_{n l} 〉 {〈 u_{n l} 〉}^{0} (y_{n l} - 〈 μ_{0 l} 〉)] 〈 R_{n k} x_{n k} 〉 . \end{matrix}

(45)

The posteriors

q (μ_{0 l})

and

q (μ_{k l})

are given by

q (μ_{0 l}) = N (μ_{0 l} | {\hat{s}}_{0 l}, {\hat{λ}}_{0 l}), q (μ_{k l}) = N (μ_{k l} | {\hat{s}}_{k l}, {\hat{λ}}_{k l}),

(46)

where

\begin{matrix} {\hat{λ}}_{0 l} & = λ_{0} + 〈 σ_{0 l} 〉 \sum_{n} 〈 1 - ϕ_{n l} 〉 {〈 u_{n l} 〉}^{0}, \\ {\hat{s}}_{0 l} & = {\hat{λ}}_{0 l}^{- 1} (λ_{0} s_{0 l} + 〈 σ_{0 l} 〉 \sum_{n, k} 〈 δ_{z_{n}, k} 〉 〈 1 - ϕ_{n l} 〉 {〈 u_{n l} 〉}^{0} 〈 {\tilde{y}}_{n l}^{k} 〉), \\ {\hat{λ}}_{k l} & = λ_{0} + 〈 σ_{k l} 〉 \sum_{n} 〈 δ_{z_{n}, k} 〉 〈 ϕ_{n l} 〉 {〈 u_{n l} 〉}_{k}^{1}, \\ {\hat{s}}_{k l} & = {\hat{λ}}_{k l}^{- 1} (λ_{0} s_{0 l} + 〈 σ_{k l} 〉 \sum_{n} 〈 δ_{z_{n}, k} 〉 〈 ϕ_{n l} 〉 {〈 u_{n l} 〉}_{k}^{1} 〈 {\tilde{y}}_{n l}^{k} 〉) . \end{matrix}

(47)

In addition, the posteriors

q (σ_{0 l})

and

q (σ_{k l})

are updated as

q (σ_{0 l}) = G (σ_{0 l} | \frac{{\hat{η}}_{0 l}}{2}, \frac{{\hat{ξ}}_{0 l}}{2}), q (σ_{k l}) = G (σ_{k l} | \frac{{\hat{η}}_{k l}}{2}, \frac{{\hat{ξ}}_{k l}}{2}),

(48)

where

\begin{matrix} {\hat{η}}_{0 l} & = η_{0} + \sum_{n} 〈 1 - ϕ_{n l} 〉, \\ {\hat{ξ}}_{0 l} & = ξ_{0} + \sum_{n, k} 〈 δ_{z_{n}, k} 〉 〈 1 - ϕ_{n l} 〉 {〈 u_{n l} 〉}^{0} 〈 {({\tilde{y}}_{n l}^{k} - μ_{0 l})}^{2} 〉, \\ {\hat{η}}_{k l} & = η_{0} + \sum_{n} 〈 δ_{z_{n}, k} 〉 〈 ϕ_{n l} 〉, \\ {\hat{ξ}}_{k l} & = ξ_{0} + \sum_{n} 〈 δ_{z_{n}, k} 〉 〈 ϕ_{n l} 〉 {〈 u_{n l} 〉}_{k}^{1} 〈 {({\tilde{y}}_{n l}^{k} - μ_{k l})}^{2} 〉 . \end{matrix}

(49)

The degrees of freedom

v_{0 l}

can be obtained by solving the nonlinear equation

\sum_{n} 〈 1 - ϕ_{n l} 〉 [1 + log (\frac{v_{0 l}}{2}) - ψ (\frac{v_{0 l}}{2}) + {〈 log u_{n l} 〉}^{0} - {〈 u_{n l} 〉}^{0}] = 0 .

(50)

Similarly,

v_{k l}

can be obtained by solving

\sum_{n} 〈 δ_{z_{n}, k} 〉 〈 ϕ_{n l} 〉 [1 + log (\frac{v_{k l}}{2}) - ψ (\frac{v_{k l}}{2}) + {〈 log u_{n l} 〉}_{k}^{1} - {〈 u_{n l} 〉}_{k}^{1}] = 0 .

(51)

4.5. Algorithm

The developed structured VB algorithm is summarized in Algorithm 1. The optimization process can be monitored via the ELBO (21). The computation of the ELBO is detailed in Appendix A.

Algorithm 1 Proposed Structured VB Algorithm for Robust Clustering and Model Selection

Require:: training data $y_{n}$ , $1 \leq n \leq N$ , the number of clusters K;
Ensure:: the response probabilities, the centroids, the saliency of features, the factor loading matrices, the activity of factors;
1:: while the evidence lower bound $L$ increases more than $ϵ$ and the number of iteration is less than $I t e r M a x$ do
2:: VB-E step
3:: Update $q (u_{n l} | ϕ_{n l}, z_{n})$ according to (28) for $1 \leq l \leq d$ and $1 \leq n \leq N$ ;
4:: Update $q (ϕ_{n l})$ according to (33) for $1 \leq l \leq d$ and $1 \leq n \leq N$ ;
5:: Update $q (x_{n k} | r_{n k})$ according to (34) for $1 \leq k \leq K$ and $1 \leq n \leq N$ ;
6:: Update $q (r_{n k})$ according to (36) for $1 \leq k \leq K$ and $1 \leq n \leq N$ : run a single turn of Gibbs sampling for a sample from $q (r_{n k})$ ;
7:: Update $q (z_{n})$ according to (41) for $1 \leq n \leq N$ ;
8:: VB-M step
9:: Update $q (π)$ , $q (β_{l})$ and $q (ρ_{k j})$ according to (42) for $1 \leq l \leq d$ , $1 \leq j \leq p_{k}$ and $1 \leq k \leq K$ ;
10:: Update $q (w_{k l})$ according to (44) for $1 \leq l \leq d$ and $1 \leq k \leq K$ ;
11:: Update $q (μ_{0 l})$ and $q (μ_{k l})$ according to (46) for $1 \leq l \leq d$ and $1 \leq k \leq K$ ;
12:: Update $q (σ_{0 l})$ and $q (σ_{k l})$ according to (48) for $1 \leq l \leq d$ and $1 \leq k \leq K$ ;
13:: Update $v_{0 l}$ and $v_{k l}$ according to (50) and (51) for $1 \leq l \leq d$ and $1 \leq k \leq K$ ;
14:: end while

We apply K-mean clustering for initialization of the VB algorithm and initialize a large p

(p < min (d, N))

for the latent dimensions of the K local factor models. At each iteration, we randomize the updating order of

{\hat{r}}_{n k j}

’s in the Gibbs sampling step to avoid co-adaptation. To further accelerate the algorithm, we make the number of factors adaptive. The empirical estimator of factor activity, i.e.,

{\hat{ρ}}_{k j} = \frac{\sum_{n} 〈 δ_{z_{n}, k} 〉 〈 r_{n k j} 〉}{\sum_{n} 〈 δ_{z_{n}, k} 〉},

(52)

is computed at the end of each iteration. If

{\hat{ρ}}_{k j} = 0

, then we remove the jth latent factor from the kth local factor model. The pruning is carried out after a burn-in period of the algorithm.

4.6. Interpreting the Model

The expectation of feature saliency

β_{l}

can be used to show the informative degree of features after being adjusted by latent factors, which is given by

〈 β_{l} 〉 = \frac{{\hat{κ}}_{1 l}}{{\hat{κ}}_{1 l} + {\hat{κ}}_{2 l}} .

(53)

In addition, the expectation of factor activity

ρ_{k j}

can be applied to evaluate the explanatory power of latent factors in each class, which can be obtained as

〈 ρ_{k j} 〉 = \frac{{\hat{τ}}_{1 k j}}{{\hat{τ}}_{1 k j} + {\hat{τ}}_{2 k j}} .

(54)

We also consider the reconstruction performance of the proposed algorithm. The centroid of each class is estimated by

〈 {\tilde{μ}}_{k} 〉 = 〈 B 〉 〈 μ_{k} 〉 + 〈 I - B 〉 〈 μ_{0} 〉,

(55)

where

〈 B 〉 = diag (〈 β_{1} 〉, 〈 β_{2} 〉, \dots, 〈 β_{d} 〉)

,

〈 μ_{k} 〉 = {({\hat{s}}_{k 1}, {\hat{s}}_{k 2}, \dots, {\hat{s}}_{k d})}^{T}

and

〈 μ_{0} 〉 = {({\hat{s}}_{01}, {\hat{s}}_{02}, \dots, {\hat{s}}_{0 d})}^{T}

. Then, reconstruction for the nth individual in class k can be computed as

{\hat{y}}_{n} = 〈 W_{k} 〉 〈 R_{n k} x_{n k} 〉 + 〈 {\tilde{μ}}_{k} 〉,

(56)

where

〈 W_{k} 〉 = {[{\hat{m}}_{k 1}, {\hat{m}}_{k 2}, \dots, {\hat{m}}_{k d}]}^{T}

and

〈 R_{n k} x_{n k} 〉

is obtained from the VB-E step after the algorithm converges.

5. Experiment Study

5.1. Experiments on Synthetic Data

In this section, we justify the developed model and the structured VB algorithm using controlled experiments. We continue the experiments in [14] with the same synthetic data where the features were generated independently in each component. An additional set of data was generated where we imposed correlation between features within the mixture component. The proposed model and algorithm was compared with the semi-Bayesian clustering model and algorithm in [10], called varFnMS, in which a finite mixture of Gaussian is adopted and a mean-field VB is applied and compared with the full-Bayesian model and algorithm in [14], denoted varFnMS-T, which is based on the mixture of Student’s t distribution and uses the structured VB algorithm.

The synthetic data in [14] contain 800 data points from four well-separated classes. The data are 10-dimensional with two influential features located around the class centers

(0, 3)

,

(1, 9)

,

(6, 4)

and

(7, 10)

with identity covariance matrices in each class. The remaining eight “noisy” features were sampled from

N (0, 1)

. We made randomly

1 %

of the data outliers by adding noises sampled uniformly from

{[- 10, 10]}^{10}

. The features are mutually independent in each class, which is consistent with the assumption underlying varFnMS and varFnMS-T. In the additional set of data, the local independence assumption is violated. We assigned a four-factor model for class 1, a two-factor model for class 2, a one-factor model for class 3 and no factor in class 4. The mean vector of each factor model remained the same as that in the “locally independent” data. The factor loading matrices were generated randomly with each entry from

N (0, 1)

. The noise term in each class was generated from

N (0, I_{10})

.

The proposed algorithm, denoted as varFnMS-TFA, the varFnMS and the varFnMS-T were carried out twenty times, separately. The number of clusters K was set as four. The K-mean clustering algorithm was used to initialize the posterior

q (z_{n})

. The feature saliency and factor activity were both initialized as

0.5

. The hyperparameters

α_{0 k}, κ_{1}, κ_{2}, τ_{1}, τ_{2}, λ_{0}, m_{0}

,

η_{0}

and

ξ_{0}

were set to be

10^{- 5}

and

s_{0}

was set as the empirical mean of the feature data. We assumed a nine-factor model for each class at the beginning and initialized the posterior means of the latent factors by sampling from

N (0, I_{9})

. The algorithm terminates when the difference of the ELBO between two consecutive iterations is less than

10^{- 7}

or the maximum number of iterations (

I t e r M a x = 500

) is reached. To avoid the “label switching” problems, we labeled the obtained clusters from the twenty repeated experiments by matching with the true classification of the data.

The ELBO reached and the classification error rate comparing the clustering with the original grouping of the data via the three algorithms with the two synthetic datasets are presented in Table 1. For the dataset where features are mutually independent within class, the classification accuracy of varFnMS-TFA is slightly higher than the other two algorithms, but the difference is not evident. For the dataset generated with correlated features, the proposed algorithm shows significantly higher accuracy than the other two algorithms under the local independence assumption. Moreover, the ELBO reached via varFnMS-TFA is the highest on average in both datasets and the discrepancy is enlarged where the features are locally correlated. As seen in Figure 2, it successfully captures the correlation across features through the latent factors.

The estimated factor activity in each class by the proposed algorithm (averaged over the twenty repeats) for the two synthetic datasets is presented in Figure 2. Generally, the algorithm recovers the ground truth in both datasets. It can be seen in subplot (a) that there is no significantly active factor across the four classes for the “locally independent” data and the true pattern of factor activity in the “locally correlated” data is recovered as shown in subplot (b).

Figure 3 compares the estimated feature saliency for the two synthetic datasets. As shown in subplot (a), the three algorithms make the same good estimation on the feature saliency when the features are independent within class. But when the dependence relationship is imposed, the proposed algorithm shows apparently different behavior from the other two algorithms as shown in subplot (b). While varFnMS and varFnMS-T estimate the salience of a feature that could be confounded by the other features, varFnMS-TFA gives the factor-adjusted feature saliency, where the confounding effects are resolved by the latent factors.

In Table 2, the estimated class centroids in the two synthetic datasets (averaged over the twenty repeats) are present. When features are generated independently within class, the three algorithms exhibit comparable performance and recover the real centroids approximately. But with correlations imposed, the estimation accuracy by varFnMS or varFnMS-T is apparently degraded. In class 1, 2 and 3, they misestimate the means of the first two features that are salient and the variations of estimation are significantly enlarged compared with the results of varFnMS-TFA. In comparison, the varFnMS-TFA algorithm that considers the local dependence relationships gives more accurate and stable estimation results.

5.2. Experiments on Real Datasets

In this section, we apply the proposed model on the benchmark datasets: Iris, Olive, Wine and WDBC. The Iris dataset is obtained from the R package “datasets”. Olive is the Italian olive oil dataset and Wine the Italian wine dataset. They are both obtained from the R package “pgmm”. WDBC is the Wisconsin diagnostic breast cancer dataset downloaded from the UCI machine learning repository (https://doi.org/10.24432/C5DW2B; accessed on 26 May 2023). For each dataset, we repeated each algorithm ten times and retrieved the result with the highest value on ELBO. We set the initial dimensions of latent factors for each dataset as

d - 1

, where d is the number of features in the data. Table 3 presents the basic information for the four datasets and the classification error obtained. There is a significant decrease on the classification error for the Olive data and a slight improvement on the results for Iris and WDBC when using the proposed algorithm. The exception goes to the Wine data, where the proposed algorithm gives results slightly inferior to varFnMS-T.

Figure 4 and Figure 5 show the factor activity and feature saliency for the four benchmark datasets. It can be seen from Figure 4 that strong factor activity is detected in Iris, Olive and WDBC data. As shown in Figure 5, the patterns of estimated feature saliency are noticeably changed when applying the proposed algorithm. The combined results indicate that the correlation between features could interfere with our decision about the features’ relevance and the classification of data.

5.3. Application on Handwritten Object Recognition

In this section, we apply the developed algorithm to the machine learning task of handwritten alphabet recognition. The handwritten alphabet dataset is obtained from the Kaggle webpage (https://www.kaggle.com/datasets/sachinpatel21/az-handwritten-alphabets-in-csv-format/data; accessed on 26 May 2023). It contains more than 370,000 images for the English alphabets (A–Z). The images are gray-scale in the size of

28 \times 28

pixels. We focus on separation of the handwritten alphabets A, B and C and reserve randomly 200 images for each of the alphabets. As the variability of some pixels in the image of an alphabet is exactly zero, we may encounter the singularity problem during iterations of the clustering algorithms. Therefore, pre-processing was implemented on the data as detailed in Appendix B. In the proposed algorithm, the initial number of latent factors was set as fifty for each class. The three algorithms attain the same classification error rate as 0.16. The patterns of feature saliency estimated via varFnMS, varFnMS-T and the proposed algorithm are compared in Figure 6. The pixels are arranged along the x-axis column by column in the

28 \times 28

-pixel image. The saliences for the margin of the image with almost zero variability have been set as zero in the pre-processing stage. As can be seen from Figure 6, for the potentially discriminant part of the image, while the other two algorithms may have ambiguity concerning deciding the relevance of features, the evaluation based on the proposed algorithm is clearer which could be an improvement by extracting the confounding effects through latent factors.

The centroids estimated via the three algorithms are shown in Figure 7, where the reconstruction process of images via the proposed algorithm is also illustrated. The estimated centroid in each class can be calculated following (55), which is a mixing of the class-specific mean and the background. The calculation is outlined by the red box, where the centroid, the class-specific mean and the background are present from left to right, successively. It can be seen that all three algorithms exhibit good performance with respect to characterizing the alphabets. The estimated centroids can sketch the general appearance of the alphabets. But it is noticeable that varFnMS and varFnMS-T make some mistakes on estimation of the background. There should have been no handwritten stroke at the bottom of the background image, since this distinguishes the images of alphabet A. The proposed algorithm performs well with respect to reconstructing the images. An example of reconstruction is present in subplot (c). Additional examples are present in Appendix C. Generally, by adding the influence of latent factors, handwriting on the images becomes legible. The factor loadings on the two most active factors for each alphabet are shown at the right side of subplot (c). As can be seen, the information from latent factors plays an important role in refining the images.

6. Conclusions

In this paper, we developed a hierarchical latent variable model for robust clustering and model selection. We considered the cases where features are correlated within mixture components in a Student’s t mixture model. Factor-adjusted feature saliency was proposed to evaluate the relevance of features to data separation. Automatic latent dimension reduction was achieved by introducing the variables of factor activity. A full Bayesian treatment was adopted and a structured VB inference framework was developed that have enabled a tighter bond to the marginal likelihood and improved the inference accuracy. Controlled experiments on synthetic and real-world datasets showed that the proposed model is able to capture the correlation between features and shows better clustering performance than the models relying on the local independence assumption. Application of the developed algorithm on the high-dimensional handwritten alphabet data showed its applicability and usefulness for image recognition and reconstruction.

In the proposed model, we take the number of clusters (number of components in the mixture model) as fixed and given before inference. An ongoing work is to extend our model to realize automatic selection of the number of clusters. We imposed the Dirichlet prior on the mixing probabilities, which can act as a penalization to drive the mixing probabilities associated with unnecessary components towards extinction. We will also investigate the novel penalization methods proposed in [34] which result in continuous objective functions and can shrink the mixing weights to exactly zero. Other limitations include assuming that features are approximated Gaussian distributed in each component. This assumption can be violated when the features only take positive values or follow skewed distributions. Future work may consider extending the model to tackle these scenarios.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12071091/s1, S1. Deriving the auxiliary posteriors of the latent variables; S2. Deriving the auxiliary posteriors of the parameters.

Author Contributions

Conceptualization, Y.N. and W.X.; methodology, S.F.; software, S.F.; validation, S.F.; formal analysis, S.F.; writing—original draft preparation, S.F.; writing—review and editing, W.X.; visualization, S.F.; supervision, Y.N.; funding acquisition, Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China grant number 2020YFA0713603 and the National Natural Science Foundation of China grant number 11971386.

Data Availability Statement

The five datasets used in this study are publicly available. The Iris data can be obtained from the R package “datasets”. (The R software can be downloaded at http://www.r-project.org/; R version is 4.3.1. (accessed on 26 May 2023)) The Olive and Wine data can be obtained from the R package “pgmm”. The Wisconsin diagnostic breast cancer (WDBC) data can be downloaded from the UCI machine learning repository at https://doi.org/10.24432/C5DW2B (accessed on 26 May 2023). The handwritten alphabets data are openly available in the Kaggle webpage: https://www.kaggle.com/datasets/sachinpatel21/az-handwritten-alphabets-in-csv-format/data (accessed on 26 May 2023). The R codes for the developed algorithm are available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The evidence lower bound monitoring the optimization process of the proposed algorithm can be evaluated as follows:

\begin{matrix} L & = \sum_{n, k, l} 〈 δ_{z_{n}, k} 〉 [〈 ϕ_{n l} 〉 〈 log p (y_{n l} | x_{n k}, r_{n k}, u_{n l}, ϕ_{n l} = 1, z_{n} = k) 〉 + 〈 1 - ϕ_{n l} 〉 〈 log p (y_{n l} | x_{n k}, r_{n k}, u_{n l}, ϕ_{n l} = 0, z_{n} = k) 〉 \\ + 〈 ϕ_{n l} 〉 〈 log p (u_{n l} | ϕ_{n l} = 1, z_{n} = k) - log q (u_{n l} | ϕ_{n l} = 1, z_{n} = k) 〉] \\ + \sum_{n, k} 〈 δ_{z_{n}, k} 〉 [〈 log p (x_{n k}) - log q (x_{n k} | r_{n k}) 〉 + 〈 log p (r_{n k}) - log q (r_{n k}) 〉 + 〈 log p (z_{n} = k) 〉 - log q (z_{n} = k)] \\ + \sum_{n, l} [〈 1 - ϕ_{n l} 〉 〈 log p (u_{n l} | ϕ_{n l} = 0) - log q (u_{n l} | ϕ_{n l} = 0) 〉 + 〈 log p (ϕ_{n l}) - log q (ϕ_{n l}) 〉] \\ + \sum_{l} 〈 log p (μ_{0 l}) - log q (μ_{0 l}) + log p (σ_{0 l}) - log q (σ_{0 l}) + log p (β_{l}) - log q (β_{l}) 〉 \\ + \sum_{k, l} 〈 log p (μ_{k l}) - log q (μ_{k l}) + log p (σ_{k l}) - log q (σ_{k l}) + log p (w_{k l}) - log q (w_{k l}) 〉 \\ + \sum_{k, j} 〈 log p (ρ_{k j}) - log q (ρ_{k j}) 〉 + 〈 log p (π) - log q (π) 〉 . \end{matrix}

(A1)

Table A1 lists the computation for the expectations in the evidence lower bound (A1).

Table A1. Evaluation of the evidence lower bound.

		Expectations of the Logarithm of Priors of the Latent Variables
$〈 log p (y_{n l} \| x_{n k}, r_{n k}, u_{n l}, ϕ_{n l} = 1, z_{n} = k) 〉$	=	$\frac{1}{2} 〈 log σ_{k l} 〉 + \frac{1}{2} {〈 log u_{n l} 〉}_{k}^{1} - \frac{1}{2} 〈 σ_{k l} 〉 {〈 u_{n l} 〉}_{k}^{1} 〈 {({\tilde{y}}_{n l}^{k} - μ_{k l})}^{2} 〉 + const .$
$〈 log p (y_{n l} \| x_{n k}, r_{n k}, u_{n l}, ϕ_{n l} = 0, z_{n} = k) 〉$	=	$\frac{1}{2} 〈 log σ_{0 l} 〉 + \frac{1}{2} {〈 log u_{n l} 〉}^{0} - \frac{1}{2} 〈 σ_{0 l} 〉 {〈 u_{n l} 〉}^{0} 〈 {({\tilde{y}}_{n l}^{k} - μ_{0 l})}^{2} 〉 + const .$
$〈 log p (u_{n l} \| ϕ_{n l} = 1, z_{n} = k) 〉$	=	$\frac{v_{k l}}{2} log \frac{v_{k l}}{2} - log Γ (\frac{v_{k l}}{2}) + (\frac{v_{k l}}{2} - 1) {〈 log u_{n l} 〉}_{k}^{1} - \frac{v_{k l}}{2} {〈 u_{n l} 〉}_{k}^{1}$
$〈 log p (u_{n l} \| ϕ_{n l} = 0) 〉$	=	$\frac{v_{0 l}}{2} log \frac{v_{0 l}}{2} - log Γ (\frac{v_{0 l}}{2}) + (\frac{v_{0 l}}{2} - 1) {〈 log u_{n l} 〉}^{0} - \frac{v_{0 l}}{2} {〈 u_{n l} 〉}^{0}$
$〈 log p (ϕ_{n l}) 〉$	=	$〈 ϕ_{n l} 〉 〈 log β_{l} 〉 + 〈 1 - ϕ_{n l} 〉 〈 log (1 - β_{l}) 〉$
$〈 log p (x_{n k}) 〉$	=	$- \frac{p_{k}}{2} log 2 π - \frac{1}{2} tr 〈 {\hat{C}}_{n}^{k} {(r_{n k})}^{- 1} + {\hat{f}}_{n}^{k} (r_{n k}) {\hat{f}}_{n}^{k} {(r_{n k})}^{T} 〉$
$〈 log p (r_{n k}) 〉$	=	$\sum_{j} [〈 r_{n k j} 〉 〈 log ρ_{k j} 〉 + 〈 1 - r_{n k j} 〉 〈 log (1 - ρ_{k j}) 〉]$
$〈 log p (z_{n} = k) 〉$	=	$〈 log π_{k} 〉$
$〈 log p (π) 〉$	=	$\sum_{k} (α_{0 k} - 1) 〈 log π_{k} 〉 + const .$
$〈 log p (β_{l}) 〉$	=	$(κ_{1} - 1) 〈 log β_{l} 〉 + (κ_{2} - 1) 〈 log (1 - β_{l}) 〉 + const .$
$〈 log p (ρ_{k j}) 〉$	=	$(τ_{1} - 1) 〈 log ρ_{k j} 〉 + (τ_{2} - 1) 〈 log (1 - ρ_{k j}) 〉 + const .$
$〈 log p (μ_{k l}) 〉$	=	$- \frac{1}{2} λ_{0} [{\hat{λ}}_{k l}^{- 1} + {({\hat{s}}_{k l} - s_{0 l})}^{2}] + const .$
$〈 log p (μ_{0 l}) 〉$	=	$- \frac{1}{2} λ_{0} [{\hat{λ}}_{0 l}^{- 1} + {({\hat{s}}_{0 l} - s_{0 l})}^{2}] + const .$
$〈 log p (σ_{k l}) 〉$	=	$(\frac{η_{0}}{2} - 1) 〈 log σ_{k l} 〉 - \frac{ξ_{0}}{2} 〈 σ_{k l} 〉 + const .$
$〈 log p (σ_{0 l}) 〉$	=	$(\frac{η_{0}}{2} - 1) 〈 log σ_{0 l} 〉 - \frac{ξ_{0}}{2} 〈 σ_{0 l} 〉 + const .$
$〈 log p (w_{k l}) 〉$	=	$- \frac{p_{k}}{2} log 2 π - \frac{1}{2} m_{0} tr ({\hat{M}}_{k l}^{- 1} + {\hat{m}}_{k l} {\hat{m}}_{k l}^{T})$
		Expectations of the Logarithm of the Auxiliary Posteriors
$〈 log q (u_{n l} \| ϕ_{n l} = 1, z_{n} = k) 〉$	=	${\hat{a}}_{k l} log {\hat{b}}_{n l}^{k} - log Γ ({\hat{a}}_{k l}) + ({\hat{a}}_{k l} - 1) {〈 log u_{n l} 〉}_{k}^{1} - {\hat{b}}_{n l}^{k} {〈 u_{n l} 〉}_{k}^{1}$
$〈 log q (u_{n l} \| ϕ_{n l} = 0) 〉$	=	${\hat{a}}_{0 l} log {\hat{b}}_{n l}^{0} - log Γ ({\hat{a}}_{0 l}) + ({\hat{a}}_{0 l} - 1) {〈 log u_{n l} 〉}^{0} - {\hat{b}}_{n l}^{0} {〈 u_{n l} 〉}^{0}$
$〈 log q (ϕ_{n l}) 〉$	=	$〈 ϕ_{n l} 〉 log q (ϕ_{n l} = 1) + 〈 1 - ϕ_{n l} 〉 log q (ϕ_{n l} = 0)$
$〈 log q (x_{n k} \| r_{n k}) 〉$	=	$- \frac{p_{k}}{2} log 2 π + \frac{1}{2} log \| {\hat{C}}_{n}^{k} (r_{n k}) \| - \frac{1}{2} p_{k}$
$〈 log q (r_{n k}) 〉$	=	$\sum_{j} [〈 r_{n k j} 〉 log q (r_{n k j} = 1) + 〈 1 - r_{n k j} 〉 log q (r_{n k j} = 0)]$
$〈 log q (π) 〉$	=	$log Γ (\sum_{k} {\hat{α}}_{k}) - \sum_{k} log Γ ({\hat{α}}_{k}) + \sum_{k} ({\hat{α}}_{k} - 1) 〈 log π_{k} 〉$
$〈 log q (β_{l}) 〉$	=	$({\hat{κ}}_{1 l} - 1) 〈 log β_{l} 〉 + ({\hat{κ}}_{2 l} - 1) 〈 log (1 - β_{l}) 〉 - log B ({\hat{κ}}_{1 l}, {\hat{κ}}_{2 l})$
$〈 log q (ρ_{k j}) 〉$	=	$({\hat{τ}}_{1 k j} - 1) 〈 log ρ_{k j} 〉 + ({\hat{τ}}_{2 k j} - 1) 〈 log (1 - ρ_{k j}) 〉 - log B ({\hat{τ}}_{1 k j}, {\hat{τ}}_{2 k j})$
$〈 log q (μ_{k l}) 〉$	=	$\frac{1}{2} log {\hat{λ}}_{k l} + const .$
$〈 log q (μ_{0 l}) 〉$	=	$\frac{1}{2} log {\hat{λ}}_{0 l} + const .$
$〈 log q (σ_{k l}) 〉$	=	$\frac{{\hat{η}}_{k l}}{2} log \frac{{\hat{ξ}}_{k l}}{2} - log Γ (\frac{{\hat{η}}_{k l}}{2}) + (\frac{{\hat{η}}_{k l}}{2} - 1) 〈 log σ_{k l} 〉 - \frac{{\hat{ξ}}_{k l}}{2} 〈 σ_{k l} 〉$
$〈 log q (σ_{0 l}) 〉$	=	$\frac{{\hat{η}}_{0 l}}{2} log \frac{{\hat{ξ}}_{0 l}}{2} - log Γ (\frac{{\hat{η}}_{0 l}}{2}) + (\frac{{\hat{η}}_{0 l}}{2} - 1) 〈 log σ_{0 l} 〉 - \frac{{\hat{ξ}}_{0 l}}{2} 〈 σ_{0 l} 〉$
$〈 log q (w_{k l}) 〉$	=	$- \frac{p_{k}}{2} log 2 π + \frac{1}{2} log \| {\hat{M}}_{k l} \| - \frac{1}{2} p_{k}$

Appendix B

Figure A1 shows the frequency histogram of the standard deviation for features in the handwritten alphabet image. As can be seen, the values of standard deviation range from 0 to 120 and a large proportion of the features have zero variance or comparably small variance. Most of them lie on the marginal area of the image. To improve the computational efficiency, we removed the features with standard deviation smaller than ten at the pre-processing stage which left 400 features for the clustering task.

In addition, we observed the singularity problem during iterations of the clustering algorithms as the variability of some pixels in the image of an alphabet is exactly zero. This happens when the clustering of the images gets close to their original grouping. To tackle the problem, we put a noise mask on the data. Each element of the noise mask was generated from

N (0, 0.1)

.

Figure A1. Frequency histogram of the standard deviation for the features in the handwritten alphabet data. The frequency bar corresponding to the standard deviation below 10 is marked in grey.

Appendix C

Figure A2 presents the reconstructed images for the alphabet A, B and C by the proposed algorithm.

Figure A2. Reconstructed images for the handwritten alphabet A, B and C by the proposed algorithm.

References

Jiang, Z.; Zheng, Y.; Tan, H.; Tang, B.; Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1965–1972. [Google Scholar]
Yang, L.; Cheung, N.M.; Li, J.; Fang, J. Deep clustering by Gaussian mixture variational autoencoders with graph embedding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6440–6449. [Google Scholar]
Sun, J.; Zhou, A.; Keates, S.; Liao, S. Simultaneous Bayesian clustering and feature selection through student’s t mixtures model. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1187–1199. [Google Scholar] [CrossRef] [PubMed]
Law, M.H.C.; Figueiredo, M.A.T.; Jain, A.K. Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1154–1166. [Google Scholar] [CrossRef] [PubMed]
Bouveyron, C.; Brunet-Saumard, C. Model-based clustering of high-dimensional data: A review. Comput. Stat. Data Anal. 2014, 71, 52–78. [Google Scholar] [CrossRef]
Dash, M.; Liu, H. Feature selection for clustering. In Proceedings of the 4th International Conference on the Practical Application of Knowledge Discovery and Data Mining, Crowne Plaza Midland Hotel, Manchester, UK, 11–13 April 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 110–121. [Google Scholar]
Mitra, P.; Murthy, C.; Pal, S. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 301–312. [Google Scholar] [CrossRef]
Pan, W.; Shen, X. Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 2007, 8, 1145–1164. [Google Scholar]
Bhattacharya, S.; McNicholas, P.D. A LASSO-penalized BIC for mixture model selection. Adv. Data Anal. Classif. 2014, 8, 45–61. [Google Scholar] [CrossRef]
Constantinopoulos, C.; Titsias, M.K.; Likas, A. Bayesian feature and model selection for Gaussian mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1013–1018. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Dong, M.; Hua, J. Simultaneous localized feature selection and model detection for Gaussian mixtures. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 953–960. [Google Scholar]
Hong, X.; Li, H.; Miller, P.; Zhou, J.; Li, L.; Crookes, D.; Lu, Y.; Li, X.; Zhou, H. Component-based feature saliency for clustering. IEEE Trans. Knowl. Data Eng. 2021, 33, 882–896. [Google Scholar] [CrossRef]
Zhang, H.; Wu, Q.M.J.; Nguyen, T.M. Variational Bayes and localized feature selection for student’s t-mixture models. Int. J. Pattern Recognit. Artif. Intell. 2013, 27, 1350016. [Google Scholar] [CrossRef]
Sun, J.; Zhou, A. Unsupervised robust Bayesian feature selection. In Proceedings of the 2014 International Joint Conference on Neural Networks, Beijing, China, 6–11 July 2014; pp. 558–564. [Google Scholar]
Perthame, E.; Friguet, C.; Causeur, D. Stability of feature selection in classification issues for high-dimensional correlated data. Stat. Comput. 2016, 26, 783–796. [Google Scholar] [CrossRef]
Fan, J.; Ke, Y.; Wang, K. Factor-adjusted regularized model selection. J. Econom. 2020, 216, 71–85. [Google Scholar] [CrossRef] [PubMed]
Mai, Q.; Zou, H.; Yuan, M. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 2012, 99, 29–42. [Google Scholar] [CrossRef]
Galimberti, G.; Soffritti, G. Using conditional independence for parsimonious model-based Gaussian clustering. Stat. Comput. 2013, 23, 625–638. [Google Scholar] [CrossRef]
Devijver, E.; Gallopin, M. Block-diagonal covariance selection for high-dimensional Gaussian graphical models. J. Am. Stat. Assoc. 2018, 113, 306–314. [Google Scholar] [CrossRef]
Ruan, L.; Yuan, M.; Zou, H. Regularized parameter estimation in high-dimensional Gaussian mixture models. Neural Comput. 2011, 23, 1605–1622. [Google Scholar] [CrossRef] [PubMed]
McLachlan, G.J.; Bean, R.W.; Jones, L.B.T. Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Comput. Stat. Data Anal. 2007, 51, 5327–5338. [Google Scholar] [CrossRef]
Archambeau, C.; Delannay, N.; Verleysen, M. Mixtures of robust probabilistic principal component analyzers. Neurocomputing 2008, 71, 1274–1282. [Google Scholar] [CrossRef]
McNicholas, P.D.; Murphy, T.B. Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 2010, 26, 2705–2712. [Google Scholar] [CrossRef]
Andrews, J.L.; McNicholas, P.D. Extending mixtures of multivariate t-factor analyzers. Stat. Comput. 2011, 21, 361–373. [Google Scholar] [CrossRef]
Wang, Z.; Lan, C. Towards a hierarchical Bayesian model of multi-view anomaly detection. In Proceedings of the 29th International Joint Conference on Artificial Intelligence, Yokohama, Japan, 7–15 January 2020; pp. 2420–2426. [Google Scholar]
Mackay, D.J.C. Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. Netw. Comput. Neural Syst. 1995, 6, 469–505. [Google Scholar] [CrossRef]
Bhattacharya, A.; Dunson, D.B. Sparse Bayesian infinite factor models. Biometrika 2011, 98, 291–306. [Google Scholar] [CrossRef] [PubMed]
Murphy, K.; Viroli, C.; Gormley, I.C. Infinite mixtures of infinite factor analysers. Bayesian Anal. 2020, 15, 937–963. [Google Scholar] [CrossRef]
Ormerod, J.T.; You, C.; Müller, S. A variational Bayes approach to variable selection. Electron. J. Stat. 2017, 11, 3549–3594. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1050–1059. [Google Scholar]
Beal, M.J.; Ghahramani, Z. The variational Bayesian EM algorithm for incomplete data: With application to scoring graphical model structures. In Bayesian Statistic 7: Proceedings of the Seventh Valencia International Meeting; Oxford University Press: Oxford, UK, 2003; pp. 453–463. [Google Scholar]
Teh, Y.W.; Newman, D.; Welling, M. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 19 Proceedings of the 2006 Conference; MIT Press: Cambridge, MA, USA, 2006; pp. 1353–1360. [Google Scholar]
Zhang, C.X.; Xu, S.; Zhang, J.S. A novel variational Bayesian method for variable selection in logistic regression models. Comput. Stat. Data Anal. 2019, 133, 1–19. [Google Scholar] [CrossRef]
Huang, T.; Peng, H.; Zhang, K. Model selection for Gaussian mixture models. Stat. Sin. 2017, 27, 147–169. [Google Scholar] [CrossRef]

Figure 1. Plate diagram of the proposed hierarchical latent variable model.

Figure 2. Factor activity estimated by the proposed algorithm for the two synthetic datasets where the features are locally (a) independent and (b) correlated, separately.

Figure 3. Feature saliency estimated via varFnMS (left), varFnMS-T (middle) and varFnMS-TFA (right) for the two synthetic datasets where the features are locally (a) independent and (b) correlated, separately.

Figure 4. Factor activity estimated via the proposed algorithm for the four benchmark datasets: (a) Iris, (b) Olive, (c) Wine and (d) WDBC.

Figure 5. Feature saliency estimated via varFnMS (left), varFnMS-T (middle) and varFnMS-TFA (right) for the four benchmark datasets: (a) Iris, (b) Olive, (c) Wine and (d) WDBC. The saliency level at 0.5 is marked by the red dotted line.

Figure 6. Feature saliency estimated via varFnMS (left), varFnMS-T (middle) and varFnMS-TFA (right) for the handwritten alphabet data. The saliency level at 0.5 is marked by the red dotted line.

Figure 7. Reconstruction of the handwritten alphabet image via varFnMS (a), varFnMS-T (b) and varFnMS-TFA (c). The estimated centroids are outlined with a red box, where the centroid, the class-specific mean and the background are present from left to right, successively.

Table 1. Evidence lower bound (ELBO) and classification error obtained via varFnMS, varFnMS-T and varFnMS-TFA for the two synthetic datasets where the features are locally independent and correlated, separately.

	Independent		Correlated
Algorithm	ELBO	Error	ELBO	Error
varFnMS	−13,320.803	0.076	−16,868.400	0.351
varFnMS	(136.822)	(0.118)	(113.338)	(0.071)
varFnMS-T	−13,140.178	0.073	−16,468.433	0.302
varFnMS-T	(763.027)	(0.111)	(135.973)	(0.083)
varFnMS-TFA	−9082.876	0.071	−8854.751	0.051
varFnMS-TFA	(3320.315)	(0.109)	(2696.577)	(0.053)

Table 2. The centroids estimated via varFnMS, varFnMS-T and varFnMS-TFA for the two synthetic datasets where the features are locally independent and correlated, separately.

	Independent			Correlated
Class 1	varFnMS	varFnMS-T	varFnMS-TFA	varFnMS	varFnMS-T	varFnMS-TFA
$μ_{11} = 0$	0.405 (0.791)	0.399 (0.910)	0.392 (0.906)	0.739 (1.056)	0.064 (0.569)	0.019 (0.140)
$μ_{12} = 3$	3.289 (0.664)	3.178 (0.471)	3.223 (0.670)	3.922 (1.405)	1.771 (1.270)	3.093 (0.201)
$μ_{13} = 0$	0.000 (0.037)	−0.107 (0.497)	−0.116 (0.490)	−0.314 (0.760)	−1.730 (0.952)	−0.001 (0.121)
$μ_{14} = 0$	−0.007 (0.049)	0.041 (0.178)	−0.027 (0.096)	−0.009 (0.776)	−1.053 (0.814)	−0.025 (0.078)
$μ_{15} = 0$	0.002 (0.043)	−0.025 (0.098)	0.000 (0.044)	−0.271 (0.915)	−1.748 (0.869)	−0.020 (0.041)
$μ_{16} = 0$	0.005 (0.044)	−0.086 (0.393)	−0.031 (0.121)	0.011 (0.210)	−0.390 (0.428)	0.000 (0.000)
$μ_{17} = 0$	−0.016 (0.043)	−0.073 (0.292)	−0.017 (0.046)	−0.185 (0.543)	0.060 (0.537)	−0.014 (0.046)
$μ_{18} = 0$	−0.007 (0.039)	−0.051 (0.194)	−0.032 (0.119)	−0.004 (0.448)	0.340 (0.670)	−0.052 (0.180)
$μ_{19} = 0$	0.034 (0.050)	0.075 (0.228)	0.015 (0.032)	0.019 (0.866)	1.410 (0.979)	−0.012 (0.028)
$μ_{1, 10} = 0$	−0.021 (0.040)	−0.168 (0.662)	−0.034 (0.125)	0.026 (0.422)	0.262 (0.237)	−0.004 (0.030)
Class 2	varFnMS	varFnMS-T	varFnMS-TFA	varFnMS	varFnMS-T	varFnMS-TFA
$μ_{21} = 1$	1.332 (0.650)	1.096 (0.447)	1.058 (0.282)	2.000 (0.939)	1.129 (0.678)	1.049 (0.151)
$μ_{22} = 9$	8.798 (0.802)	8.875 (0.478)	8.793 (0.862)	6.712 (1.229)	6.762 (0.863)	9.012 (0.101)
$μ_{23} = 0$	0.100 (0.472)	0.136 (0.622)	0.040 (0.209)	−0.024 (0.984)	0.340 (0.315)	0.008 (0.066)
$μ_{24} = 0$	0.002 (0.057)	0.015 (0.047)	0.008 (0.035)	0.071 (0.738)	0.487 (0.792)	0.042 (0.091)
$μ_{25} = 0$	−0.207 (0.926)	−0.228 (1.006)	−0.192 (0.865)	−0.028 (0.647)	0.394 (0.321)	−0.011 (0.028)
$μ_{26} = 0$	−0.049 (0.138)	−0.049 (0.167)	−0.012 (0.037)	−0.107 (0.566)	0.124 (0.305)	0.004 (0.019)
$μ_{27} = 0$	−0.034 (0.200)	−0.053 (0.231)	−0.001 (0.027)	−0.293 (0.578)	−0.468 (0.615)	−0.007 (0.031)
$μ_{28} = 0$	−0.083 (0.367)	−0.094 (0.423)	−0.093 (0.417)	−0.052 (0.684)	−0.213 (0.314)	0.008 (0.067)
$μ_{29} = 0$	0.036 (0.189)	0.039 (0.202)	0.004 (0.033)	−0.340 (0.611)	−0.702 (0.637)	−0.009 (0.027)
$μ_{2, 10} = 0$	0.009 (0.129)	0.018 (0.193)	0.040 (0.245)	−0.099 (0.550)	−0.147 (0.160)	0.002 (0.032)
Class 3
$μ_{31} = 6$	5.684 (1.136)	5.691 (0.998)	5.758 (0.767)	4.773 (1.507)	5.313 (1.603)	6.013 (0.158)
$μ_{32} = 4$	4.392 (0.843)	4.514 (1.246)	4.553 (1.464)	3.962 (0.782)	3.644 (0.726)	4.174 (0.656)
$μ_{33} = 0$	−0.082 (0.504)	0.056 (0.402)	0.064 (0.215)	0.206 (0.502)	0.127 (0.454)	0.018 (0.049)
$μ_{34} = 0$	0.114 (0.570)	−0.015 (0.722)	−0.063 (0.631)	0.053 (0.466)	−0.204 (0.516)	0.006 (0.037)
$μ_{35} = 0$	−0.025 (0.289)	−0.015 (0.330)	0.062 (0.305)	0.197 (0.517)	0.131 (0.428)	−0.011 (0.019)
$μ_{36} = 0$	−0.181 (0.557)	−0.032 (0.632)	−0.087 (0.854)	0.076 (0.390)	0.064 (0.196)	0.002 (0.008)
$μ_{37} = 0$	−0.137 (0.646)	−0.176 (0.761)	−0.213 (0.649)	0.269 (0.495)	0.703 (0.619)	−0.005 (0.051)
$μ_{38} = 0$	−0.241 (0.744)	−0.332 (0.946)	−0.141 (0.465)	−0.170 (0.599)	0.216 (0.427)	0.010 (0.034)
$μ_{39} = 0$	0.030 (0.270)	−0.030 (0.196)	−0.017 (0.158)	−0.023 (0.711)	0.468 (0.629)	−0.012 (0.033)
$μ_{3, 10} = 0$	−0.309 (0.771)	−0.304 (0.767)	−0.142 (0.416)	−0.017 (0.512)	0.068 (0.156)	−0.001 (0.030)
Class 4
$μ_{41} = 7$	6.792 (0.752)	6.966 (0.130)	6.960 (0.123)	5.443 (1.236)	6.329 (0.899)	6.709 (1.289)
$μ_{42} = 10$	9.857 (0.521)	9.830 (0.689)	9.831 (0.688)	9.039 (1.183)	9.649 (0.623)	9.741 (0.995)
$μ_{43} = 0$	0.002 (0.051)	−0.007 (0.060)	−0.005 (0.045)	0.075 (0.100)	0.034 (0.079)	−0.006 (0.046)
$μ_{44} = 0$	0.000 (0.053)	0.005 (0.036)	0.000 (0.021)	−0.072 (0.111)	−0.061 (0.129)	0.380 (1.725)
$μ_{45} = 0$	0.008 (0.043)	0.009 (0.041)	0.002 (0.032)	0.043 (0.175)	−0.013 (0.096)	−0.015 (0.027)
$μ_{46} = 0$	−0.003 (0.041)	−0.003 (0.037)	−0.006 (0.025)	−0.017 (0.095)	−0.044 (0.084)	0.002 (0.008)
$μ_{47} = 0$	−0.025 (0.063)	−0.011 (0.058)	0.002 (0.026)	0.111 (0.117)	0.079 (0.098)	−0.097 (0.425)
$μ_{48} = 0$	−0.003 (0.051)	−0.005 (0.045)	−0.004 (0.031)	−0.053 (0.106)	−0.026 (0.077)	0.081 (0.434)
$μ_{49} = 0$	0.029 (0.055)	0.021 (0.051)	0.016 (0.032)	0.071 (0.118)	0.050 (0.106)	−0.011 (0.035)
$μ_{4, 10} = 0$	−0.026 (0.048)	−0.027 (0.049)	−0.011 (0.036)	−0.001 (0.059)	0.015 (0.053)	−0.001 (0.031)

Table 3. Classification error obtained via varFnMS, varFnMS-T and varFnMFAS-T for the four benchmark datasets.

Dataset	N	d	K	varFnMS	varFnMS-T	varFnMS-TFA
Iris	150	4	3	0.093	0.093	0.020
Olive	572	8	3	0.203	0.199	0.042
Wine	178	27	3	0.079	0.062	0.073
WDBC	569	30	2	0.095	0.095	0.088

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, S.; Xie, W.; Nie, Y. Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers. Mathematics 2024, 12, 1091. https://doi.org/10.3390/math12071091

AMA Style

Feng S, Xie W, Nie Y. Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers. Mathematics. 2024; 12(7):1091. https://doi.org/10.3390/math12071091

Chicago/Turabian Style

Feng, Shan, Wenxian Xie, and Yufeng Nie. 2024. "Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers" Mathematics 12, no. 7: 1091. https://doi.org/10.3390/math12071091

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Simultaneous Bayesian Clustering and Model Selection with Mixture of Robust Factor Analyzers

Abstract

1. Introduction

2. The Student’s t Mixture Model for Feature Selection

3. Towards the Mixture of Robust Factor Analyzers

4. Inference on the Model

4.1. Brief Introduction to VB Method

4.2. Tree-Like Factorization of the Auxiliary Posterior

4.3. Auxiliary Posteriors of the Latent Variables: VB-E Step

4.4. Auxiliary Posteriors of the Parameters: VB-M Step

4.5. Algorithm

4.6. Interpreting the Model

5. Experiment Study

5.1. Experiments on Synthetic Data

5.2. Experiments on Real Datasets

5.3. Application on Handwritten Object Recognition

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI