Variational Bayesian Inference in High-Dimensional Linear Mixed Models

Yi, Jieyi; Tang, Niansheng

doi:10.3390/math10030463

Open AccessEditor’s ChoiceArticle

Variational Bayesian Inference in High-Dimensional Linear Mixed Models

by

Jieyi Yi

and

Niansheng Tang

^*

Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming 650091, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(3), 463; https://doi.org/10.3390/math10030463

Submission received: 24 December 2021 / Revised: 26 January 2022 / Accepted: 27 January 2022 / Published: 31 January 2022

(This article belongs to the Special Issue Bayesian Inference and Modeling with Applications)

Download Versions Notes

Abstract

:

In high-dimensional regression models, the Bayesian lasso with the Gaussian spike and slab priors is widely adopted to select variables and estimate unknown parameters. However, it involves large matrix computations in a standard Gibbs sampler. To solve this issue, the Skinny Gibbs sampler is employed to draw observations required for Bayesian variable selection. However, when the sample size is much smaller than the number of variables, the computation is rather time-consuming. As an alternative to the Skinny Gibbs sampler, we develop a variational Bayesian approach to simultaneously select variables and estimate parameters in high-dimensional linear mixed models under the Gaussian spike and slab priors of population-specific fixed-effects regression coefficients, which are reformulated as a mixture of a normal distribution and an exponential distribution. The coordinate ascent algorithm, which can be implemented efficiently, is proposed to optimize the evidence lower bound. The Bayes factor, which can be computed with the path sampling technique, is presented to compare two competing models in the variational Bayesian framework. Simulation studies are conducted to assess the performance of the proposed variational Bayesian method. An empirical example is analyzed by the proposed methodologies.

Keywords:

Bayesian lasso; evidence lower bound; high-dimensional linear mixed model; spike and slab priors; variational Bayesian inference

1. Introduction

Linear mixed models are widely used to analyze longitudinal and correlated data by considering the between-subject and within-subject variations and incorporating the random effects to account for heterogeneity among the subjects in many fields, such as psychology, medicine, epidemiology and econometrics. Various methods have been developed to estimate fixed-effects parameters and variance–covariance matrices for unobservable random effects and noises or select fixed-effects and random-effects components, even if it is quite challenging for the problem of variable selection and parameter estimation in linear mixed models. For example, see [1] for restricted maximum likelihood estimation of parameters, ref [2] for EM algorithm of parameter estimation, refs [3,4] for Bayesian parameter estimation, ref [5] for Bayesian random effects selection and [6] for moment-based method for random effects selection. The aforementioned methods mainly focus on low-dimensional linear mixed models, while high-dimensional data have become increasingly common with the rapid development of modern information technologies that facilitate data collection. Thus, the aforementioned methods do not work well in high-dimensional linear mixed models, and so some penalized methods have developed to simultaneously estimate parameters and select variables in high-dimensional linear mixed models. For example, Bondell, Krishna and Ghosh [7] and Ibrahim et al. [8] proposed the penalized likelihood methods for joint selection of fixed and random effects; Schelldorfer, Buhlmann and van De Geer [9] proposed an

ℓ_{1}

–penalized estimation procedure; Fan and Li [10] investigated the problem of fixed and random effects selection when the cluster sizes are balanced; Li et al. [11] presented a doubly regularized approach to simultaneously select fixed and random effects; Bradic, Claeskens and Gueuning [12] considered testing a single parameter of fixed effects in high-dimensional linear mixed models with fixed cluster sizes, fixed numbers of random effects and sub-Gaussian designs; Li, Cai and Li [13] proposed a penalized quasi-likelihood method for statistical inference on unknown parameters in high-dimensional linear mixed-effects models. However, the aforementioned regularization methods are computationally complex and unstable and they do not consider the prior information of fixed-effects parameters and variance–covariance matrices, which may lead to unsatisfactory estimation accuracy of parameters or variance–covariance matrices. Bayesian approaches for variable selection and parameter estimation have received much attention over the past years because they can largely improve the accuracy and efficiency of parameter estimation, consistently select important variables and provide more information for variable selection than the corresponding penalization method with a highly non-convex optimization problem by imposing various priors on model parameters. For example, see [14] for reference prior, ref [15] for normal mixture prior, ref [16] for spike and slab prior, ref [17] for horseshoe prior and [18] for shrinking and diffusing prior. In the high-dimensional setting, Bayesian lasso, Bayesian adaptive lasso or the indicator model method, together with the Markov chain Monte Carlo (MCMC) algorithm, are widely used to select important variables. For example, see [19] for Bayesian lasso, ref [20] for Bayesian adaptive lasso and [21,22] for the EM approach in the Bayesian framework. The above-mentioned literature involves the implementation of the standard Gibbs sampler for posterior computation, which is not so scalable for large numbers of fixed-effects components [23]. To address the issue, Narisetty, Shen and He [23] proposed a Skinny Gibbs algorithm by using a sparse matrix to replace the high-dimensional variance–covariance matrix, which avoids large matrix operations. However, implementing the above MCMC algorithm in the presence of high-dimensional data may still be subject to the well-known ill-posed problems, i.e., low efficiency, slow convergence and huge memory being required.

As an alternative to the MCMC, the variational Bayesian method, also called ensemble learning, is widely adopted to approximate intractable integrals involved in Bayesian inference or machine learning due to its good properties, such as high-speed computation. Its basic idea is to transform the high-dimensional integration problem into an optimization problem in making Bayesian inference and then optimize the evidence lower bound (ELB), which is efficiently computed, and finally utilize the ELB to obtain a variational approximation to the posterior distribution in Bayesian analysis. The variational Bayesian approach has been applied to some familiar models, for example, latent variable models [24], mixtures of factor analysis [25], graphical models [26] and partially linear mean shift models with high-dimensional data [27].

Motivated by the aforementioned variational Bayesian studies, we develop a novel variational Bayesian approach to estimate model parameters and select important variables under the Skinny Gibbs sampling framework in a linear mixed model with low-dimensional random effects and high-dimensional fixed effects. We specify the spike and slab priors for the population-specific fixed-effects regression coefficients with completely different shrinkage parameters, which overcomes the problem of selecting a high-dimensional vector of the shrinkage parameters. We reformulate the spike and slab priors of parameter as a mixture of a normal distribution and an exponential distribution, which avoids the high-dimensional integral problem. The coordinate ascent algorithm, which can be implemented efficiently, is proposed to optimize the ELB. The Bayes factor, which can be computed with the path sampling technique, is presented to compare two competing models in the variational Bayesian framework. The merits of the proposed variational Bayesian method are (i) simultaneously estimating parameters and variance–covariance matrices and select fixed- and random-effects components with quite a low computation cost, (ii) efficiently analyzing high-dimensional data without requiring the non-convex optimization and avoiding the curse of dimensionality problem, (iii) automatically incorporating the shrinkage parameters and (iv) avoiding large matrix computations.

The rest of the article is organized as follows: Section 2 introduces the linear mixed model setup, including the spike and slab priors. Section 3 describes the Skinny Gibbs sampler algorithm for selecting fixed- and random-effects components and estimating parameters in coefficients and variance–covariance matrices via the Bayesian Lasso method. Section 4 develops a variational Bayesian approach to approximate posterior distributions of parameters and random effects and presents the Bayes factor for model comparison. The coordinate ascent algorithm is adopted to optimize the ELB in Section 4. Simulation studies are considered in Section 5. An empirical example is illustrated by the proposed methodologies in Section 6. A simple discussion is given in Section 7. Technique details are presented in the Appendix A, Appendix B and Appendix C.

2. Model

Consider a dataset with n subjects. For the ith subject, let

y_{i j}

be the observation of the response variable,

x_{i j}

be a

p \times 1

vector of covariates associated with the fixed effects and

z_{i j}

be a

q \times 1

vector of covariates associated with the random effects, which may be a subvector of

x_{i j}

for

j = 1, \dots, n_{i}

, where

n_{i}

is the number of times observed repeatedly for the ith subject. Generally,

n_{i}

varies across subjects. For simplicity, we suppose that

y_{i j}

has been centered at zero for avoiding the requirement of intercept and

n_{1} = \dots = n_{n} = m

, i.e., the balanced design. It is assumed that

p ≫ n

and only a small number of covariates

x_{i j}

contribute to response variable

y_{i j}

, i.e.,

x_{i j}

has sparsity, while q is smaller than n.

For the dataset

D = {(y_{i j}, x_{i j}, z_{i j}) : i = 1, \dots, n, j = 1, \dots, m}

, we consider the following linear mixed model:

y_{i j} = x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i} + ε_{i j}, i = 1, \dots, n, j = 1, \dots, m,

(1)

where

β = {(β_{1}, \dots, β_{p})}^{⊤}

is a

p \times 1

vector of population-specific fixed-effects regression coefficients,

b_{i}

is a

q \times 1

vector of subject-specific random effects and

ε_{i j}

is measurement error. Here, we assume that

b_{1}, \dots, b_{n}

are independent and identically distributed (i.i.d.) as the multivariate normal distribution with mean zero and covariance matrix Q and

ε_{i j}

’s are independently distributed as

N (0, σ_{j}^{2})

, where

N (\cdot, \cdot)

represents the normal distribution. Here,

σ_{1}^{2}, \dots, σ_{m}^{2}

are not completely different but some of them may be identical.

Under the aforementioned assumptions, a penalized likelihood approach to estimate

β

is implemented by

\hat{β} = \arg max_{β \in R^{p}} [- \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{m} \frac{{(y_{i j} - x_{i j}^{⊤} β)}^{2}}{σ_{j}^{2} + z_{i j}^{⊤} Q z_{i j}} + f_{λ} (β)],

(2)

where

f_{λ} (β)

is some appropriate penalty function indexed by the penalty parameter

λ

. In variable selection literature, it is usually assumed that

f_{λ} (β)

has the form:

f_{λ} (β) = \sum_{k = 1}^{p} f_{λ_{k}} (β_{k})

, where

f_{λ_{k}} (β_{k})

takes the

ℓ_{0}

-norm,

ℓ_{1}

-norm, MCP penalty [28], SCAD penalty [29] and Elastic-Net penalty [30]. Recently, it was widely recognized that

\hat{β}

can be regarded as a posterior mode of

β

with some proper prior. Inspired by this idea, we consider Bayesian variable selection procedure based on some proper prior on

β

.

Following [31], we consider the following spike and slab prior of

β

:

f (β | γ, λ_{0}, λ_{1}) = \prod_{k = 1}^{p} {γ_{k} g_{1} (β_{k} | λ_{1}) + (1 - γ_{k}) g_{0} (β_{k} | λ_{0})},

(3)

where

γ = {(γ_{1}, \dots, γ_{p})}^{⊤}

, in which

γ_{k}

is a binary latent variable and follows a Bernoulli distribution with the probability

ρ_{k} = \Pr (γ_{k} = 1)

, i.e.,

γ_{k} = 1

indicates that the kth covariate is active and

γ_{k} = 0

implies that the kth covariate is inactive and

g_{1} (β_{k} | λ_{1})

is usually referred to as a diffuse slab prior reflecting the effect of an active covariate, while

g_{0} (β_{k} | λ_{0})

is called a concentrated spike prior reflecting the negligibly unimportant effect of an inactive covariate for

k = 1, \dots, p

. Let

f (γ | ρ)

be the prior distribution of

γ

indexed by

ρ

. It is assumed that

f (γ | ρ)

has the form

f (γ | ρ) = \prod_{k = 1}^{p} ρ_{k}^{γ_{k}} {(1 - ρ_{k})}^{1 - γ_{k}},

(4)

where

ρ = {(ρ_{1}, \dots, ρ_{p})}^{⊤}

. For simplicity, we assume

ρ_{1} = \dots = ρ_{p} = ρ

, which is the expected proportion of the active covariates. Generally, one can take

g_{0} (\cdot)

and

g_{1} (\cdot)

as the normal distribution with a small and a large variance, respectively. However, for the spike and slab lasso, we take the following slab and spike priors

g_{1} (β_{k} | λ_{1}) = \frac{λ_{1}}{2} e^{- λ_{1} | β_{k} |}, g_{0} (β_{k} | λ_{0}) = \frac{λ_{0}}{2} e^{- λ_{0} | β_{k} |},

(5)

respectively, where

λ_{1}

should tend to zero and

λ_{0}

should tend to ∞ as the sample size is sufficiently large, which implies that the inactive covariates will be detected as zeros in that small values of

β_{k}

relative to

1 / λ_{0}

or

λ_{1}

are truncated to zero. Following [32], the density

g_{ℓ} (β_{k} | λ_{ℓ}) = \frac{λ_{ℓ}}{2} exp (- λ_{ℓ} | β_{k} |)

can be hierarchically written as a mixture of a normal distribution and an exponential distribution, i.e.,

β_{k} | ξ_{ℓ k}^{2}, γ_{k} = ℓ \sim N (0, ξ_{ℓ k}^{2}), ξ_{ℓ k}^{2} | λ_{ℓ}^{2} \sim Exp (λ_{ℓ}^{2} / 2), ℓ = 0, 1 .

(6)

Incorporating the above idea shows that the posterior distributions of binary latent variables can be employed to distinguish active covariates from inactive ones in the considered model.

For covariance matrix Q, the proportion

ρ

,

λ_{0}^{2}

,

λ_{1}^{2}

and

σ_{j}^{2}

, we consider the following priors:

Q \sim IW (S_{0}, ν_{0}), ρ \sim Beta (a_{γ}, b_{γ}), λ_{0}^{2} \sim Γ (c_{0}, d_{0}), λ_{1}^{2} \sim Γ (c_{1}, d_{1}), σ_{j}^{- 2} \sim Γ (c_{2}, d_{2}),

(7)

where

IW (\cdot, \cdot)

denotes the inverted Wishart distribution,

Beta (\cdot, \cdot)

represents the Beta distribution,

Γ (\cdot)

is the gamma distribution,

IG (\cdot, \cdot)

is the inverse gamma distribution and

S_{0}

,

ν_{0}

,

a_{γ}

,

b_{γ}

,

c_{0}

,

d_{0}

,

c_{1}

,

d_{1}

,

c_{2}

and

d_{2}

are the user-specified hyperparameters. As mentioned above,

λ_{1}

should tend to zero and

λ_{0}

should tend to ∞ as the sample size is sufficiently large, which implies that

c_{1} / d_{1}

is smaller than

c_{0} / d_{0}

. To this end, we assume

c_{1} ≪ c_{0}

and

d_{0} ≪ d_{1}

.

Based on the above discussion, we can rewrite the considered linear mixed model together with the spike and slab lasso prior as the following hierarchical models:

\{\begin{matrix} y_{i j} | b_{i} \sim N (μ_{i j}, σ_{j}^{2}), μ_{i j} = x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i}, i = 1, \dots, n, j = 1, \dots, m, \\ b_{i} \sim N_{q} (0, Q), i = 1, \dots, n, \\ β_{k} | ξ_{1 k}^{2}, γ_{k} = 1 \sim N (0, ξ_{1 k}^{2}), ξ_{1 k}^{2} | λ_{1}^{2} \sim Exp (λ_{1}^{2} / 2), λ_{1}^{2} \sim Γ (c_{1}, d_{1}), \\ β_{k} | ξ_{0 k}^{2}, γ_{k} = 0 \sim N (0, ξ_{0 k}^{2}), ξ_{0 k}^{2} | λ_{0}^{2} \sim Exp (λ_{0}^{2} / 2), λ_{0}^{2} \sim Γ (c_{0}, d_{0}), \\ γ_{k} \sim Bernoulli (ρ), k = 1, \dots, p, \\ Q \sim IW (S_{0}, ν_{0}), ρ \sim Beta (a_{γ}, b_{γ}), σ_{j}^{- 2} \sim Γ (c_{2}, d_{2}), j = 1, \dots, m . \end{matrix}

(8)

3. Skinny Gibbs Sampler for Bayesian Lasso

Let

Y = {y_{i j} : i = 1, \dots, n, j = 1, \dots, m}

,

X = {x_{i j} : i = 1, \dots, n, j = 1, \dots, m}

and

Z = {z_{i j} : i = 1, \dots, n, j = 1, \dots, m}

. From Equation (8), the joint posterior density of parameters

β

, Q,

γ = {γ_{1}, \dots, γ_{p}}

,

σ^{2} = {(σ_{1}^{2}, \dots, σ_{m}^{2})}^{⊤}

and

ϑ = {ρ, λ_{0}, λ_{1}}

given the data

D = {Y, X, Z}

is given by

\begin{matrix} f (β, Q, γ, σ^{2}, ϑ | D) & \propto & \{\prod_{i = 1}^{n} \prod_{j = 1}^{m} ψ (y_{i j}, x_{i j}^{⊤} β, z_{i j}^{⊤} Q^{- 1} z_{i j} + σ_{j}^{2})\} \{\prod_{j = 1}^{m} f (σ_{j}^{- 2})\} \\ \times \prod_{k = 1}^{p} {ρ g_{1} (β_{k} | λ_{1})}^{γ_{k}} {(1 - ρ) g_{0} (β_{k} | λ_{0})}^{1 - γ_{k}} f_{W} (Q) f_{ϑ} (ϑ), \end{matrix}

(9)

where

ψ (x, μ, ς^{2})

is the probability density of normal random variable x with mean

μ

and variance

ς^{2}

,

f (σ_{j}^{- 2})

denotes the probability density of random variable

σ_{j}^{- 2}

,

f_{W} (Q)

is the inverted Wishart density function of random matrix Q and

f_{ϑ} (ϑ)

represents the joint prior density function of random variable vector

ϑ

. It is rather difficult to sample observations from the joint posterior density given in Equation (9) in the presence of high-dimensional fixed effects because of some non-standard distributions and large matrix computations involved. In what follows, the Gibbs sampler is utilized to sample observations required for Bayesian inference.

To avoid expensive computation in running the Gibbs sampler, similarly to [23], at each Gibbs iteration, we divide parameter vector

β

into two subvectors corresponding to those active (i.e.,

γ_{k} = 1

) and inactive (i.e.,

γ_{k} = 0

) covariates, respectively. To wit, we define

β = {(β_{A}^{⊤}, β_{I}^{⊤})}^{⊤}

, where

β_{A}

and

β_{I}

are the subvectors of

β

associated with

γ_{k} = 1

and

γ_{k} = 0

, respectively. Suppose that the cardinality of the set A is r. Without loss of generality, it is assumed that the first r components of

β

correspond to

β_{A}

and the last

p - r

components of

β

correspond to

β_{I}

. Similarly, we decompose

x_{i j}

as

x_{i j} = {(x_{i j A}^{⊤}, x_{i j I}^{⊤})}^{⊤}

. Under the above assumptions, the Gibbs sampler is implemented as follows. Observations required at each Gibbs iteration are iteratively drawn from the following conditional distributions:

f_{A} (β_{A} | D, b, σ^{2})

,

f_{I} (β_{I} | D)

,

f (b_{i} | D, β, σ^{2}, Q)

,

f (ξ_{0 k}^{2} | β_{k}, γ_{k})

,

f (ξ_{1 k}^{2} | β_{k}, γ_{k})

,

f_{γ} (γ_{k} | D, b, ξ_{1}, ξ_{0})

,

f (Q | b)

,

f (σ_{j}^{- 2} | D, b)

,

f (ρ | γ)

,

f (λ_{0}^{2} | ξ_{0})

and

f (λ_{1}^{2} | ξ_{1})

, which are given in Appendix A, where

b = {b_{1}, \dots, b_{n}}

,

ξ_{0} = {ξ_{01}^{2}, \dots, ξ_{0 p}^{2}}

and

ξ_{1} = {ξ_{11}^{2}, \dots, ξ_{1 p}^{2}}

.

Although the Skinny Gibbs sampler introduced above can be easily conducted, it is rather time-consuming for a sufficiently large p. To address the issue, we investigate a fast yet efficient approach as follows, i.e., the variational Bayesian method.

4. Variational Bayesian Inference

4.1. Variational Bayes

It follows from the principle of variational inference that it is necessary to first construct a variational set

F

of densities for random variables

Ξ

having the same support as the posterior density

f (Ξ | D)

, where

Ξ = {β, b, ξ_{0}, ξ_{1}, Q, γ, σ^{2}, ϑ}

. It is assumed that

q (Ξ) \in F

is any variational density for approximating

f (Ξ | D)

. The variational Bayes aims to find the best approximation to

f (Ξ | D)

in terms of the Kullback–Leibler divergence between

q (Ξ)

and

f (Ξ | D)

, which is a solution to the optimization problem:

q^{*} (Ξ) = \underset{q (Ξ) \in F}{argmin} KL (q (Ξ) ‖ f (Ξ | D)),

(10)

where

\begin{matrix} KL (q (Ξ) ‖ f (Ξ | D)) & = & \int log \{\frac{q (Ξ)}{f (Ξ | D)}\} q (Ξ) d Ξ \\ = & \int log \{\frac{q (Ξ) f (Y | X, Z)}{f (Ξ, Y | X, Z)}\} q (Ξ) d Ξ \\ = & E_{q (Ξ)} \{log q (Ξ)\} - E_{q (Ξ)} \{log f (Ξ, Y | X, Z)\} \\ + log f (Y | X, Z) \geq 0, \end{matrix}

(11)

in which

E_{q (Ξ)} (\cdot)

is the expectation taken with respect to

q (Ξ)

. Here,

KL (q (Ξ) ‖ f (Ξ | D))

equals zero if and only if

q (Ξ) \equiv f (Ξ | D)

. Due to the intractable high-dimensional integral involved, it is quite troublesome to conduct the above optimization problem.

However, it follows from

L {q (Ξ)} = E_{q (Ξ)} \{log f (Ξ, Y | X, Z)\} - E_{q (Ξ)} \{log q (Ξ)\}

that

log f (Y | X, Z) = KL (q (Ξ) ‖ f (Ξ | D)) + L {q (Ξ)} \geq L {q (Ξ)} .

(12)

Thus,

L {q (Ξ)}

might be regarded as a lower bound of

log f (Y | X, Z)

and is usually referred to as the evidence lower bound (ELB). Then, minimizing

KL (q (Ξ) ‖ f (Ξ | D))

is equivalent to maximizing

L {q (Ξ)}

in that

log f (Y | X, Z)

is not related to

Ξ

. That is,

q^{*} (Ξ) = \underset{q (Ξ) \in F}{argmin} KL (q (Ξ) ‖ f (Ξ | D)) = \underset{q (Ξ) \in F}{argmax} L {q (Ξ)} .

(13)

Finding the problem of the best approximation to

f (Ξ | D)

is transformed into an optimization problem of maximizing

L {q (Ξ)}

over the variational family

F

. The complexity of the optimization problem is associated with that of the variational set

F

. Thus, it is rather desirable to implement the optimization problem over a relatively simple variational set

F

.

Following the widely used methods for constructing a relatively simple variational set, we take

F

as the mean-field variational family in which components of

Ξ

are mutually independent and each has a distinct factor in the variational density. Thus, we can assume that the variational density

q (Ξ)

has the form

q (Ξ) = q (β) q (b) q (σ^{- 2}) q (γ) q (Q) q (ϑ) \prod_{k = 1}^{p} {q (ξ_{0 k}^{2}) q (ξ_{1 k}^{2})} \equiv \prod_{s = 1}^{S} q_{s} (ζ_{s}),

(14)

where

q_{s} (ζ_{s})

s are unspecified but the above assumed factorization across components is pre-specified. Similarly to considerable variational literature, the optimal solutions of

q_{s} (ζ_{s})

s can be obtained by maximizing

L {q (ζ_{1}, \dots, ζ_{S})}

via the coordinate ascent method, where

Ξ = {ζ_{1}, \dots, ζ_{S}}

.

Following the idea of the coordinate ascent method given in [33,34,35], when fixing other variational factors

q_{j} (ζ_{j})

for

j \neq s

, i.e.,

ζ_{- s} = {ζ_{j} : j \neq s, j = 1, \dots, S}

, the optimal variational density

q_{s}^{*} (ζ_{s})

maximizing

L {q (Ξ)}

with respect to

q_{s} (ζ_{s})

has the form

\begin{matrix} q_{s}^{*} (ζ_{s}) & \propto & exp [E_{- s} \{log f (ζ_{s} | ζ_{- s}, D)\}] \\ \propto & exp [E_{- s} \{log f (Y, Ξ | X, Z)\}], \end{matrix}

(15)

where

f (ζ_{s} | ζ_{- s}, D)

is the conditional density for

ζ_{s}

given (

ζ_{- s}

D

) and

E_{- s} (\cdot)

represents the expectation evaluated for

q_{- s} (ζ_{- s}) = \prod_{j \neq s} q_{j} (ζ_{j})

. Equation (15) implies that

E_{- s} (\cdot)

is not associated with the sth variational factor

q_{s} (ζ_{s})

and the optimal variational density

q_{s}^{*} (ζ_{s})

cannot be obtained in that the

q_{- s} (ζ_{- s})

on the right-hand side are not the optimal ones. To address this issue, the coordinate updating algorithm is employed to iteratively update

q_{s}^{*} (ζ_{s})

via Equation (15). After the coordinate updating algorithm converges, we can take mean or mode of the optimal variational density

q_{s}^{*} (ζ_{s})

as a variational Bayesian estimate of parameter vector

ζ_{s}

and regard the covariate as active if its corresponding variational Bayesian estimate deviates from zero.

It is easily shown from Equation (15) that the optimal density

q_{β}^{*} (β)

has the form

q_{β_{A}}^{*} (β_{A}) \sim N_{r} (μ_{A}, Σ_{A}), q_{β_{I}}^{*} (β_{I}) \sim N_{p - r} (0, Σ_{I}),

(16)

respectively, where

Σ_{A}^{- 1} = \sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j A} x_{i j A}^{⊤} E_{σ_{j}^{2}}^{*} (σ_{j}^{- 2}) + diag (ξ_{A})

with

ξ_{A} = {E_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2}), k \in A}

,

μ_{A} = Σ_{A} [\sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j A} {y_{i j} - z_{i j}^{⊤} E_{b_{i}}^{*} (b_{i})} E_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})]

and

Σ_{I}^{- 1} = diag (\sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j I} x_{i j I}^{⊤})

+ diag (ξ_{I}^{0}) = n m I_{p - r} + diag (ξ_{I}^{0})

with

ξ_{I}^{0} = {E_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2}), k \in I}

, in which

E_{σ_{j}^{2}}^{*} (\cdot)

,

E_{ξ_{1 k}}^{*} (\cdot)

,

E_{ξ_{0 k}}^{*} (\cdot)

and

E_{b_{i}}^{*} (\cdot)

are the expectations taken with respect to

q_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})

,

q_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2})

,

q_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2})

and

q_{b_{i}}^{*} (b_{i})

, respectively. Then, the estimated posterior means and variance matrices of

β_{A}

and

β_{I}

for the variational densities

q_{β_{A}}^{*} (β_{A})

and

q_{β_{I}}^{*} (β_{I})

are

E_{A}^{*} (β_{A}) = μ_{A}

,

{var}_{A}^{*} (β_{A}) = Σ_{A}

,

E_{I}^{*} (β_{I}) = 0

and

{var}_{I}^{*} (β_{I}) = Σ_{I}

, respectively. Moreover, the mode estimator

β_{A}^{q}

of

β_{A}

for the variational density

q_{β_{A}}^{*} (β_{A})

is

β_{A}^{q} = μ_{A}

, while the mode estimator

β_{I}^{q}

of

β_{I}

for the variational density

q_{β_{I}}^{*} (β_{I})

is

β_{I}^{q} = 0

.

The optimal density

q_{b_{i}}^{*} (b_{i})

is the multivariate normal distribution

q_{b_{i}}^{*} (b_{i}) \sim N_{q} (μ_{b}, Σ_{b}),

(17)

where

Σ_{b}^{- 1} = E_{Q}^{*} (Q) + \sum_{j = 1}^{m} z_{i j} z_{i j}^{⊤} E_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})

and

μ_{b} = Σ_{b} [\sum_{j = 1}^{m} z_{i j} {y_{i j} - x_{i j A}^{⊤} E_{A}^{*} (β_{A})} E_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})]

. Then, the estimated posterior mean and variance matrix of

b_{i}

for variational densities

q_{b_{i}}^{*} (b_{i})

are

E_{b_{i}}^{*} (b_{i}) = μ_{b}

and

{var}_{b_{i}}^{*} (b_{i}) = Σ_{b}

, respectively. Moreover, the mode estimator

b_{i}^{q}

of

b_{i}

for variational density

q_{b_{i}}^{*} (b_{i})

is

b_{i}^{q} = μ_{b}

. The optimal densities

q_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2})

and

q_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2})

are given by

q_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2}) \sim IvG (a_{0 ξ k}^{*}, b_{0 ξ k}^{*}) for k \in I, q_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2}) \sim IvG (a_{1 ξ k}^{*}, b_{1 ξ k}^{*}) for k \in A,

(18)

respectively, where

a_{0 ξ k}^{*} = \sqrt{E_{λ_{0}}^{*} (λ_{0}^{2}) / {var}_{β_{k}}^{*} (β_{k})}

,

a_{1 ξ k}^{*} = \sqrt{E_{λ_{1}}^{*} (λ_{1}^{2}) / [{E_{β_{k}}^{*} (β_{k})}^{2} + {var}_{β_{k}}^{*} (β_{k})]}

,

b_{0 ξ k}^{*} = E_{λ_{0}}^{*} (λ_{0}^{2})

,

b_{1 ξ k}^{*} = E_{λ_{1}}^{*} (λ_{1}^{2})

and

E_{λ_{0}}^{*} (\cdot)

and

E_{λ_{1}}^{*} (\cdot)

are the expectations taken with respect to

q_{λ_{0}}^{*} (λ_{0}^{2})

and

q_{λ_{1}}^{*} (λ_{1}^{2})

, respectively. In this case, we have

E_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2}) = a_{0 ξ k}^{*}

,

E_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2}) = a_{1 ξ k}^{*}

,

{var}_{ξ_{0 k}}^{*} (ξ_{1 k}^{- 2}) = {(a_{0 ξ k}^{*})}^{3} / b_{0 ξ k}^{*}

and

{var}_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2}) = {(a_{1 ξ k}^{*})}^{3} / b_{1 ξ k}^{*}

. Moreover, the mode estimators

ξ_{0 k}^{- 2 q}

and

ξ_{1 k}^{- 2 q}

of

ξ_{0 k}^{- 2}

and

ξ_{1 k}^{- 2}

for variational densities

q_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2})

and

q_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2})

are

ξ_{0 k}^{- 2 q} = a_{0 ξ k}^{*} \sqrt{1 + {(1.5 a_{0 ξ k}^{*} / b_{0 ξ k}^{*})}^{2}} - 1.5 {(a_{0 ξ k}^{*})}^{2} / b_{0 ξ k}^{*}

for

k \in I

and

ξ_{1 k}^{- 2 q} = a_{1 ξ k}^{*} \sqrt{1 + {(1.5 a_{1 ξ k}^{*} / b_{1 ξ k}^{*})}^{2}} - 1.5 {(a_{1 ξ k}^{*})}^{2} / b_{1 ξ k}^{*}

for

k \in A

, respectively.

To derive the optimal density

q_{γ_{k}}^{*} (γ_{k})

, we denote

\begin{matrix} log (ϱ_{k}) & = & E_{ρ}^{*} (log ρ) - E_{ρ}^{*} {log (1 - ρ)} + \frac{1}{2} [E_{ξ_{1 k}}^{*} {log (ξ_{1 k}^{- 2})} - E_{ξ_{0 k}}^{*} {log (ξ_{0 k}^{- 2})}] \\ + E_{β_{k}}^{*} (β_{k}) \sum_{i = 1}^{n} \sum_{j = 1}^{m} {y_{i j} - x_{i j, C_{k}}^{⊤} E_{β}^{*} (β_{C_{k}}) - z_{i j}^{⊤} E_{b_{i}}^{*} (b_{i})} x_{i j k} E_{σ_{j}^{2}}^{*} (σ_{j}^{- 2}) \\ - \frac{1}{2} [{var}_{β_{k}}^{*} (β_{k}) + {E_{β_{k}}^{*} (β_{k})}^{2}] \{\sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j k}^{2} E_{σ_{j}^{2}}^{*} (σ_{j}^{- 2}) - E_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2}) + E_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2})\}, \end{matrix}

(19)

where

C_{k} = {ℓ : γ_{ℓ} = 1, ℓ \neq k \in A} = A ∖ {k}

, which is an index set with the kth index deleted from the set A. Thus, latent variable

γ_{k}

is sampled from the Bernoulli distribution with the probability

ς_{k} = ϱ_{k} / (ϱ_{k} + 1)

, i.e.,

γ_{k} | D, b, σ \sim Bernoulli (ς_{k})

for

k = 1, \dots, p

. In this case, the estimated posterior mean and variance of

γ_{k}

for variational density

q_{γ_{k}}^{*} (γ_{k})

are

E_{γ_{k}}^{*} (γ_{k}) = ς_{k}

and

{var}_{γ_{k}}^{*} (γ_{k}) = ς_{k} (1 - ς_{k})

, respectively. Thus, the mode estimator

γ_{k}^{q}

of

γ_{k}

for variational density

q_{γ_{k}}^{*} (γ_{k})

is

γ_{k}^{q} = ς_{k}

for

k = 1, \dots, p

.

The optimal density

q_{Q}^{*} (Q)

has the form

q_{Q}^{*} (Q) \sim {IW}_{q} (S_{0}^{*}, ν_{0}^{*}),

(20)

where

S_{0}^{*} = S_{0} + n μ_{b} μ_{b}^{⊤} + n Σ_{b}

with

μ_{b}

and

Σ_{b}

defined in Equation (17) and

ν_{0}^{*} = ν_{0} + n

. Then, we have

E_{Q}^{*} (Q) = S_{0}^{*} / (ν_{0}^{*} - q - 1)

. Moreover, the mode estimator

Q^{q}

of Q is given by

Q^{q} = S_{0}^{*} / (ν_{0}^{*} + q + 1)

.

The optimal density

q_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})

(

j = 1, \dots, m

) has the form

q_{σ_{j}^{2}}^{*} (σ_{j}^{- 2}) \sim Γ (\frac{n}{2}, b_{σ}^{*}),

(21)

where

b_{σ}^{*} = 0.5 \sum_{i = 1}^{n} h_{i j}

,

h_{i j} = {(y_{i j} - μ_{i j}^{*})}^{2} + x_{i j A}^{⊤} Σ_{A} x_{i j A} + x_{i j I}^{⊤} Σ_{I} x_{i j I} + z_{i j}^{⊤} Σ_{b} z_{i j}

and

μ_{i j}^{*} = x_{i j A}^{⊤} μ_{A} + z_{i j}^{⊤} μ_{b}

. Thus, we have

E_{σ_{j}}^{*} (σ_{j}^{- 2}) = n / \sum_{i = 1}^{n} h_{i j}

and

{var}_{σ_{j}}^{*} (σ_{j}^{- 2}) = 2 n / {(\sum_{i = 1}^{n} h_{i j})}^{2}

. In this case, the mode estimator

σ_{j}^{- 2 q}

of

σ_{j}^{- 2}

for variational density

q_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})

is

σ_{j}^{- 2 q} = (n - 2) / \sum_{i = 1}^{n} h_{i j}

for

j = 1, \dots, m

.

The optimal density

q_{ρ}^{*} (ρ)

can be expressed as

q_{ρ}^{*} (ρ) \sim Beta (c_{ρ}, d_{ρ}),

(22)

where

c_{ρ} = a_{γ} + \sum_{k = 1}^{p} E_{γ_{k}}^{*} (γ_{k})

and

d_{ρ} = b_{γ} + p - \sum_{k = 1}^{p} E_{γ_{k}}^{*} (γ_{k})

. Thus, we have

E_{ρ}^{*} (ρ) = c_{ρ} / (c_{ρ} + d_{ρ})

and

{var}_{ρ}^{*} (ρ) = c_{ρ} d_{ρ} / {{(c_{ρ} + d_{ρ})}^{2} (c_{ρ} + d_{ρ} - 1)}

. In this case, the mode estimator of

ρ

is given as

ρ^{q} = c_{ρ} / (c_{ρ} + d_{ρ})

.

The optimal densities

q_{λ_{0}}^{*} (λ_{0}^{2})

and

q_{λ_{1}}^{*} (λ_{1}^{2})

are

q_{λ_{0}}^{*} (λ_{0}^{2}) \sim Γ (a_{0 λ}^{*}, b_{0 λ}^{*}), q_{λ_{1}}^{*} (λ_{1}^{2}) \sim Γ (a_{1 λ}^{*}, b_{1 λ}^{*}),

(23)

respectively, where

a_{0 λ}^{*} = c_{0} + p - \sum_{k = 1}^{p} E_{γ_{k}}^{*} (γ_{k})

,

b_{0 λ}^{*} = d_{0} + \sum_{k = 1}^{p} {1 - E_{γ_{k}}^{*} (γ_{k})} E_{ξ_{0 k}}^{*} (ξ_{0 k}^{2}) / 2

,

a_{1 λ}^{*} = c_{1} + \sum_{k = 1}^{p} E_{γ_{k}}^{*} (γ_{k})

and

b_{1 λ}^{*} = d_{1} + \sum_{k = 1}^{p} E_{γ_{k}}^{*} (γ_{k}) E_{ξ_{1 k}}^{*} (ξ_{1 k}^{2}) / 2

. In this case, we obtain

E_{λ_{0}}^{*} (λ_{0}^{2}) = a_{0 λ}^{*} / b_{0 λ}^{*}

,

{var}_{λ_{0}}^{*} (λ_{0}^{2}) = a_{0 λ}^{*} / {(b_{0 λ}^{*})}^{2}

,

E_{λ_{1}}^{*} (λ_{1}^{2}) = a_{1 λ}^{*} / b_{1 λ}^{*}

and

{var}_{λ_{1}}^{*} (λ_{1}^{2}) = a_{1 λ}^{*} / {(b_{1 λ}^{*})}^{2}

. The mode estimators

λ_{0}^{2 q}

and

λ_{1}^{2 q}

of

λ_{0}^{2}

and

λ_{1}^{2}

for variational densities

q_{λ_{0}}^{*} (λ_{0}^{2})

and

q_{λ_{1}}^{*} (λ_{1}^{2})

are

λ_{0}^{2 q} = (a_{0 λ}^{*} - 1) / b_{0 λ}^{*}

and

λ_{1}^{2 q} = (a_{1 λ}^{*} - 1) / b_{1 λ}^{*}

, respectively.

4.2. Optimizing $L {q (Ξ)}$ via Coordinate Ascent Algorithm

The elaborated steps for optimizing

L {q (Ξ)}

via the coordinate ascent algorithm are given below:

Step (a) Given the initial values of variational densities $q_{β}^{*} (β)$ , $q_{b_{i}}^{*} (b_{i})$ , $q_{ξ 0 k}^{*} (ξ_{0 k}^{- 2})$ , $q_{ξ 1 k}^{*} (ξ_{1 k}^{- 2})$ , $q_{γ_{k}}^{*} (γ_{k})$ , $q_{Q}^{*} (Q)$ , $q_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})$ , $q_{ρ}^{*} (ρ)$ , $q_{λ_{0}}^{*} (λ_{0}^{2})$ and $q_{λ_{1}}^{*} (λ_{1}^{2})$ , compute the lower bound $L {q (Ξ)}$ (denoted as $L^{(0)} {q (Ξ)}$ ) and set $κ = 1$ .
Step (b) Compute variational density $q_{β}^{*} (β)$ and update $E_{β}^{*} (β)$ .
Step (c) Compute variational density $q_{b_{i}}^{*} (b_{i})$ and update $E_{b_{i}}^{*} (b_{i})$ .
Step (d) Compute variational density $q_{ξ 0 k}^{*} (ξ_{0 k}^{- 2})$ and update $E_{ξ_{0 k}}^{*} (ξ_{0 k}^{- 2})$ .
Step (e) Compute variational density $q_{ξ 1 k}^{*} (ξ_{1 k}^{- 2})$ and update $E_{ξ_{1 k}}^{*} (ξ_{1 k}^{- 2})$ .
Step (f) For $k = 1, \dots, p$ , compute variational densities $q_{γ_{k}}^{*} (γ_{k})$ and update $E_{γ_{k}}^{*} (γ_{k})$ .
Step (g) Compute variational density $q_{Q}^{*} (Q)$ and update $E_{Q}^{*} (Q)$ .
Step (h) Compute variational densities $q_{σ_{j}^{2}}^{*} (σ_{j}^{- 2})$ and update $E_{σ_{j}}^{*} (σ_{j}^{- 2})$ .
Step (i) Compute variational density $q_{ρ}^{*} (ρ)$ and update $E_{ρ}^{*} (ρ)$ .
Step (j) Compute variational density $q_{λ_{0}}^{*} (λ_{0}^{2})$ and update $E_{λ_{0}}^{*} (λ_{0}^{2})$ .
Step (k) Compute variational density $q_{λ_{1}}^{*} (λ_{1}^{2})$ and update $E_{λ_{1}}^{*} (λ_{1}^{2})$ .
Step (l) Based on variational densities from Steps (b)–(k), compute the ELB $L {q (Ξ)}$ (denoted as $L^{(κ)} {q (Ξ)}$ ) and the relative change

$RC = \frac{| L^{(κ)} {q (Ξ)} - L^{(ℓ - 1)} {q (Ξ)} |}{L^{(κ - 1)} {q (Ξ)}} .$
Step (m) Given sufficiently small $ϵ$ , if $RC < ϵ$ , the algorithm is stopped. Otherwise, repeat Steps (b)–(l).

The preceding presented coordinate ascent algorithm for computing variational Bayesian estimates of parameters is summarized as Algorithm 1 and converges to the solution of the optimization problem (13) because it satisfies the well-known KKT condition for the considered model.

Algorithm 1: Variational Bayesian estimation

4.3. Model Comparison

The Bayes factor is a vital statistic for model comparison within the Bayesian framework and is widely employed to choose a better model among the considered competing models due to its merits for model selection: (i) it is a consistent selector; (ii) it plays the part of an Occam’s razor, preferring the simpler model for the similar fits; (iii) it does not need the models to be nested. For instance, see [36] for structural equation models and [37] for non-ignorable missing data. Denote

f (Y | X, Z, Ξ_{h}, H_{h})

as the probability density of the data

{Y, X, Z}

associated with the model

H_{h}

, where

Ξ_{h}

is the parameter vector in the model

H_{h}

. Define

f (Ξ_{h} | H_{h})

as the prior of

Ξ_{h}

for

h = 0, 1

. The Bayes factor for comparing two competing models

H_{0}

and

H_{1}

can be written as

B_{10} = \frac{\int f (Y | X, Z, Ξ_{1}, H_{1}) f (Ξ_{1} | H_{1}) d Ξ_{1}}{\int f (Y | X, Z, Ξ_{0}, H_{0}) f (Ξ_{0} | H_{0}) d Ξ_{0}} = \frac{f (Y | X, Z, H_{1})}{f (Y | X, Z, H_{0})},

(24)

where

f (Y | X, Z, H_{k})

is the marginal likelihood for the model

H_{h}

for

h = 0

and 1. However, computing the Bayes factor

B_{10}

is a non-trivial task for our considered high-dimensional linear mixed model because of the intractable integral involved. Considerable methods have been developed to compute the marginal likelihood

f (Y | X, Z, H_{k})

or the Bayes factor, for example, Laplace’s method [38], annealed importance sampling [39], bridge sampling [40], path sampling (also called thermodynamic integration) [41], nested sampling [42], power posteriors [43], hybrid method combining simulation and asymptotic approximations [44]. For a comprehensive review, refer to [45]. Here, a path sampling or thermodynamic integration method is adopted to compute

B_{10}

via a link model:

H_{ζ 01} = (1 - ζ) H_{0} + ζ H_{1}

, where

ζ

is a continuous parameter taking value in the interval

[0, 1]

. Thus, we have

H_{ζ 01} = H_{0}

when

ζ = 0

and

H_{ζ 01} = H_{1}

when

ζ = 1

. Similarly to [41], we define the following class of probability densities:

Q (ζ) = f (Y | X, Z, ζ) = \int f (Y, ζ | X, Z, Ξ) f (Ξ) d Ξ,

(25)

where

f (Y, ζ | X, Z, Ξ)

is the density of Y given X and Z under

H_{ζ}

and

f (Ξ)

is the prior of

Ξ

. Under the above definition, it is easily known that

Q (0) = f (Y | X, Z, H_{0})

and

Q (1) = f (Y | X, Z, H_{1})

. Following the argument of [41], we obtain

log B_{10} = log \frac{Q (1)}{Q (0)} = \int_{0}^{1} E {U (Y, ζ, Ξ | X, Z)} d ζ,

(26)

where

E (\cdot)

represents the expectation taken with respect to the conditional density

f (Ξ, ζ | Y, X, Z)

and

U (Y, ζ, Ξ | X, Z) = d log f (Y, ζ, Ξ | X, Z) / d ζ

. Thus, applying the thermodynamic integration [41] or powered posteriors method [43] to Equation (26),

log B_{10}

can be estimated by

\hat{log B_{10}} = \frac{1}{2} \sum_{ℓ = 0}^{L} (ζ_{(ℓ + 1)} - ζ_{(ℓ)}) ({\bar{U}}_{(ℓ + 1)} + {\bar{U}}_{(ℓ)}),

(27)

where

0 = ζ_{(0)} < ζ_{(1)} < \dots < ζ_{(L + 1)} = 1

and

{\bar{U}}_{(ℓ)} = J^{- 1} \sum_{τ = 1}^{J} U (Y, ζ_{(ℓ)}, Ξ_{ℓ}^{(τ)} | X, Z)

, in which

{Ξ_{ℓ}^{(τ)} : τ = 1, \dots, J}

are observations sampled from the variational density

q^{*} (Ξ | ζ_{(ℓ)})

for

ℓ = 1, \dots, L

. Following [46],

H_{1}

is selected when

\hat{log B_{10}} > 1

; otherwise,

H_{0}

is selected.

5. Simulation Studies

Several simulation studies are implemented to assess the performance of the introduced variational Bayesian methodologies. For comparison, we also take the Bayesian lasso method into consideration. In this simulation study, response variables

y_{i j}

s are independently sampled from the normal distribution:

y_{i j} \sim N (x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i}, σ_{j}^{2})

, where

x_{i j}

,

z_{i j}

and

b_{i}

are independently drawn from the multivariate normal distributions

N_{p} (0, Σ_{x})

,

N_{q} (0, I)

and

N_{q} (0, Q)

, respectively, for

i = 1, \dots, n, j = 1, \dots, m

. The true value of

β

is taken to be

{(- 0.5, 0.8, 2, 0.8, 0.5, 0.0, \dots, 0.0)}^{⊤}

, which implies that there are five active variables and

p - 5

inactive variables. As an illustration, we set

m = 6

,

q = 4

,

n = 100

, 200 and 300, and

p = 500, 1000

and 2000, which indicate that

n ≪ p

. The true values of

σ_{j}^{2}

’s are set to be

σ_{1}^{2} = σ_{2}^{2} = 0.8

,

σ_{3}^{2} = σ_{4}^{2} = 0.9

and

σ_{5}^{2} = σ_{6}^{2} = 1.0

. The true value of Q is taken with diagonal elements being 1.0 and remaining components being 0.1.

We consider the following two types of covariance structures for

Σ_{x} = {(σ_{x_{j k}})}_{p \times p}

.

Type I. Components of covariate vector $x_{i j}$ are independent of each other, i.e., $σ_{x_{j k}} = 0.0$ when $j \neq k$ and $σ_{x_{j j}} = 1.0$ when $1 \leq j, k \leq p$ .
Type II. $x_{i j}$ is an autoregressive correlation, i.e., $σ_{x_{j k}} = {0.5}^{∣ j - k ∣}$ when $\forall j \neq k$ and $σ_{x_{j j}} = 1.0$ when $1 \leq j, k \leq p$ .

In implementing the preceding presented variational Bayesian approach together with the spike and slab priors, we take the hyperparameters

ν_{0} = 1

and

S_{0} = 0.02 I_{q \times q}

leading to the flat prior for Q and set

a_{γ} = b_{γ} = 0.5

. For the spike and slab priors of

β_{k}

s, to achieve appropriate shrinkage and model selection consistency, we take

c_{0} = 500

and

c_{1} = 0.3

, indicating

c_{1} ≪ c_{0}

,

d_{0} = 5

and

d_{1} = 30

, implying

d_{0} ≪ d_{1}

, guaranteeing the sparsity of the model. In this simulation, 100 replications are conducted to select active variables and estimate model parameters. To assess the accuracy of parameter estimation via the proposed variational Bayesian method, we calculate the average value of RMSes for unknown parameters, where “RMS“ indicates the root mean square between the Bayesian estimates based on 100 replications and true values of unknown parameters. To assess the performance of the variable selection procedure, we compute TP and FP, where TP represents the average number of active covariates correctly identified as active and FP denotes the average number of inactive covariates incorrectly detected as active. Generally, the closer to the true number of active covariates TP is or the smaller FP is, the better the variable selection method behaves. Results are reported in Table 1. Examination of Table 1 shows that the proposed variational Bayesian method behaves better than Bayesian lasso method, regardless of the values of p and n and covariance structures, in that TP values for the former are closer to the true number of active covariates and FP values for the former are closer to zero than those for the latter. For parameter estimation, the proposed variational Bayesian method behaves better than the Bayesian lasso method in that the average values of the RMSes for the former are smaller than those for the latter, regardless of the values of p and n and covariance structures. To investigate the sensitivity of the selection of the hyperparameters

a_{γ}

and

b_{γ}

, we take

a_{γ} = 0.1

and

b_{γ} = 0.9

and calculate the corresponding results for the Type I structure of

Σ_{x}

, which results are given in Table 1. These empirical results indicate that the proposed variational Bayesian method is not sensitive to the hyperparameters in that the same pattern is observed regardless of the hyperparameters

a_{γ}

and

b_{γ}

.

As an illustration for model comparison via the proposed Bayes factor, we consider the second simulation study. In the simulation study, the data

{(x_{i j}, z_{i j}, y_{i j}) : i = 1, \dots, n, j = 1, \dots, m}

are generated as those in the first simulation study with covariance structure of

Σ_{x}

taken to be Type I. To this end, we consider the following competing models:

\begin{matrix} H_{0} : & y_{i j} = x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i} + ε_{i j}, ε_{i j} \sim N (0, σ_{j}^{2}), \\ H_{1} : & y_{i j} = z_{i j}^{⊤} b_{i} + ε_{i j}, ε_{i j} \sim N (0, σ_{j}^{2}), \\ H_{2} : & y_{i j} = x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i} + ε_{i j}, ε_{i j} \sim N (0, σ_{0}^{2}), \end{matrix}

where

H_{0}

represents the true linear mixed model and while

H_{1}

and

H_{2}

are two competing linear mixed models,

H_{1}

only containing random effects without fixed effects, and

H_{2}

misspecifying the distribution of measurement error. We define a path

t \in [0, 1]

to link any two of the above presented three models. For example,

H_{0}

and

H_{1}

can be linked by

H_{t 01} : y_{i j} = (1 - t) x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i} + ε_{i j}

, which indicates that

H_{t 01}

is just

H_{0}

for

t = 0

and becomes

H_{1}

for

t = 1

, and

H_{0}

and

H_{2}

are linked by

H_{t 02} : y_{i j} = x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i} + ε_{i j}

with

ε_{i j} \overset{ind}{\sim} N (0, t^{2} σ_{0}^{2} + {(1 - t)}^{2} σ_{j}^{2})

, which implies that

H_{t 02}

reduces to

H_{0}

with

t = 0

and

H_{t 02}

becomes

H_{2}

with

t = 1

.

To calculate the estimated log Bayes factors (i.e.,

\hat{log B_{10}}

and

\hat{log B_{20}}

) via the preceding proposed path sampling procedure, we take

ζ_{(ℓ)} = ℓ / L

for

ℓ = 0, 1, \dots, L

,

L = 10

,

J = 1000

and

σ_{0}^{2} = 0.5

and the same priors as those given in the first simulation studies. Results are given in Table 2, which indicates that

H_{0}

is strongly selected as expected regardless of n and p.

6. An Empirical Example

As an illustration of the variational Bayesian method developed above, we consider the ADNI-2 data [47] published in 2003 and followed by ADNI-1, ADNI-GO and ADNI-2 groups. This study aims to predict the mini-mental state examination (MMSE) score, which is an important index for detecting Alzheimer’s disease (AD) stages in that different MMSE scores indicate different progression of a AD patient. AD is the most common type of dementia for elderly people and the sixth leading cause of death in the United States, and it results in the loss of memory and the impairment of cognitive and language skills. More importantly, there is no effective treatment to slow the progression of the disease [48]. The number of AD patients has grown exponentially with the speed of the aging population, bringing a socioeconomic burden to both families and society [49]. The details on the ADNI database can refer to the website http://adni.loni.usc.edu (accessed on 20 May 2021).

The ADNI-2 data were analyzed by [48] using the factor analysis model to impute missing values. As an illustration, we utilize 340 complete magnetic resonance imaging (MRI) features with 62 samples and 3 medical visits (6-month, 12-month and 24-month), take five features among 340 features as covariates associated with random effects and set the MMSE score as the response variable. That is,

n = 62

,

p = 340

,

q = 5

and

m = 3

. In this case, covariates are high-dimensional compared with the sample size. Here, we assume that only a small fraction of covariates contribute to the response variable.

The preceding introduced variational Bayesian method together with the linear mixed model and the same priors as those in the first simulation study are utilized to fit the above-mentioned MRI data. Here, the hyperparameters are taken as

ν_{0} = 1

,

S_{0} = 0.02 I_{q \times q}

,

a_{γ} = b_{γ} = 0.5

,

c_{0} = 10

,

d_{0} = 1

,

c_{1} = 1

and

d_{1} = 10

for ensuring the sparsity of the model. Thus, the proposed variational Bayesian method selects three features as active variables: thickness average of the right fusiform (denoted as “

x_{1}

”), thickness standard deviation of the right posterior cingulate (denoted as “

x_{2}

”) and thickness standard deviation of the left postcentral (denoted as “

x_{3}

"). Their corresponding parameter estimates are 1.9, 0.25 and 0.4, respectively, which show that the three active variables have positive effects on MMSE that are consistent with those given in [48]. Bayesian estimates of random effects

b_{i}

are −0.003, −0.0021, −0.0013, −0.0058 and −0.0054, respectively, which imply that the selected five covariates associated with random effects have negative effects on MMSE. Table 3 also presents the RMSE and MAP values for the models with 340 covariates (denoted as the “Complete“ model) and the selected three active covariates (denoted as the “Selected“ model), where RMSE and MAP are evaluated by RMSE=

\sqrt{n^{- 1} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}

and MAP=

n^{- 1} \sum_{i = 1}^{n} | {\hat{y}}_{i} - y_{i} |

and

{\hat{y}}_{i}

is the fitted value of response

y_{i}

. Examination of Table 3 shows that the selected model has smaller RMS and MAP values than the complete model, i.e., the selected model fits the ADNI-2 data better than the complete model. For the selected model, we also compute the Bayes factors for three competing models

H_{0}, H_{1}

and

H_{2}

given in the second simulation study, which are

\hat{\log B_{10}} = - 558

and

\hat{\log B_{20}} = - 46.93

, leading to the conclusion that

H_{0}

is strongly selected.

7. Discussions

This paper investigates simultaneously estimating model parameters and selecting variables in linear mixed models with high-dimensional fixed effects and low-dimensional random effects in the Bayesian framework. A novel variational Bayesian approach is developed to address the time-consuming problem of the traditional Bayesian lasso method due to the ill-posited problem and large matrix computation involved in the presence of high-dimensional data. The Gaussian spike and slab priors of population-specific fixed-effects regression coefficients are specified to identify important fixed effects by allowing the tuning parameters to tend to zero. For the sake of sampling observations, the Gaussian spike and slab priors are reformulated as a mixture of a normal distribution and an exponential distribution. In the variational Bayesian framework, the problem of best approximating the posterior density is transformed as an optimization problem, i.e., minimizing the evidence lower bound. For ease of computation, the coordinate ascent algorithm, implemented efficiently, is employed to optimize the evidence lower bound. For model comparison, the Bayes factor is computed by the path sampling method. Simulation studies are conducted to investigate the performance of the proposed variational Bayesian method, and a real example is illustrated by the proposed methodologies. Empirical results show that the proposed variational Bayesian method behaves better than the traditional Bayesian lasso method regardless of the accuracy of parameter estimation, the consistency of variable selection or computational flexibility and complexity.

The proposed variational Bayesian method has the following advantages:

Overcoming the problem of selecting a high-dimensional vector of shrinkage parameters required for the Bayesian lasso method;
Simultaneously estimating model parameters and variance–covariance matrices and selecting fixed-effects and random-effects components with a relatively low computational cost;
Avoiding large matrix computations and the curse of dimensionality problem;
Providing a flexible and efficient approach to compute the Bayes factor for model comparison.

The proposed variational Bayesian method can be extended to more complicated models, such as generalized linear mixed models with mixed discrete and missing data. However, their extensions have huge challenges, including the closed-form derivation of the optimal variational density, the specification of the priors, the learning of the data-driven hyperparameters and the computational complexity. In addition, this paper does not consider the selection of high-dimensional random effects, which is a rather challenging topic. In addition, to speed up the convergence of the chain, we might consider some important and relevant Gibbs sampling schemes, for example, the herded Gibbs sampling, which is a deterministic variant of the Gibbs sampling scheme and generates observations by matching the full-conditionals rather than by taking the full-conditionals at random [50], the recycling Gibbs sampler, which generates auxiliary observations whose information is eventually discarded and which can be recycled within the Gibbs algorithm for improving efficiency with no extra cost [51], and the blocking and parameterization method [52].

In addition, we did not consider BIC criterion for model comparison in that BIC is only an approximation to the Bayes factor of marginal likelihood of the data given each hypothesis. Moreover, due to the random effects involved in the considered models, BIC behaves unsteadily.

Author Contributions

Conceptualization, N.T.; methodology, N.T.; software, J.Y.; validation, N.T. and J.Y.; formal analysis, N.T. and J.Y.; investigation, J.Y.; resources, N.T. and J.Y.; data curation, J.Y.; writing—original draft preparation, J.Y.; writing—review and editing, N.T.; visualization, J.Y.; supervision, N.T.; project administration, N.T.; funding acquisition, N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Projects of the National Nautral Science Foundation of China (grant number 11731011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ADNI database is available on the website http://adni.loni.usc.edu (accessed on 20 May 2021).

Acknowledgments

The authors are grateful for the associate editor and the three referees for their constructive comments, which largely improved an earlier manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MCMC	Markov chain Monte Carlo algorithm
EM	Expectation Maximization algorithm
ELB	evidence lower bound
TP	average number of active covariates correctly identified as active
FP	average number of inactive covariates incorrectly detected as active
RMS	mean square between the Bayesian estimates based on 100 replications and true value
	of unknown parameter
VB	variational Bayesian with proposed method
LASSO	Bayesian lasso method
AD	Alzheimer’s Disease
ADNI	Alzheimer’s Disease Neuroimaging Initiative
MRI	magnetic resonance imaging
MMSE	mini-mental state examination

Appendix A. Conditional Distributions Required in Implementing the Gibbs Sampler

By the definitions and priors of

β_{A}

and

β_{I}

, it is easily shown from Equation (9) that the conditional distributions

f_{A} (β_{A} | D, b, σ)

and

f_{I} (β_{I} | D)

have the forms

β_{A} | D, b, σ \sim N_{r} (μ_{A}^{0}, Σ_{A}^{0}), β_{I} | D \sim N_{p - r} (0, Σ_{I}^{0}),

(A1)

respectively, where

{Σ_{A}^{0}}^{- 1} = \sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j A} x_{i j A}^{⊤} / σ_{j}^{2} + diag (ξ_{A}^{0})

with

ξ_{A}^{0} = {ξ_{1 k}^{- 2}, k \in A}

,

μ_{A}^{0} = Σ_{A}^{0} {\sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j A} (y_{i j} - z_{i j}^{⊤} b_{i}) / σ_{j}^{2}}

and

{Σ_{I}^{0}}^{- 1} = diag (\sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j I} x_{i j I}^{⊤}) + diag (ξ_{C I}^{0}) = n m I_{p - r} + diag (ξ_{C I}^{0})

with

ξ_{C I}^{0} = {ξ_{0 k}^{- 2}, k \in I}

.

The conditional distribution

f (b_{i} | D, β, σ, Q)

has the form

b_{i} | D, β, σ, Q \sim N_{q} (μ_{b}^{C}, Σ_{b}^{C}),

(A2)

where

{Σ_{b}^{C}}^{- 1} = Q + \sum_{j = 1}^{m} z_{i j} z_{i j}^{⊤} / σ_{j}^{2}

and

μ_{b}^{C} = Σ_{b}^{C} {\sum_{j = 1}^{m} z_{i j} (y_{i j} - x_{i j}^{⊤} β) / σ_{j}^{2}}

.

The conditional distributions

f (ξ_{0 k}^{2} | β_{k}, γ_{k})

and

f (ξ_{1 k}^{2} | β_{k}, γ_{k})

are given by

\begin{matrix} f (ξ_{0 k}^{2} | β_{k}, γ_{k}) \propto {(ξ_{0 k}^{2})}^{- (1 - γ_{k}) / 2} exp \{- (1 - γ_{k}) β_{k}^{2} / (2 ξ_{0 k}^{2}) - λ_{0}^{2} (1 - γ_{k}) ξ_{0 k}^{2} / 2\}, \\ f (ξ_{1 k}^{2} | β_{k}, γ_{k}) \propto {(ξ_{1 k}^{2})}^{- γ_{k} / 2} exp \{- γ_{k} β_{k}^{2} / (2 ξ_{1 k}^{2}) - λ_{1}^{2} γ_{k} ξ_{1 k}^{2} / 2\}, \end{matrix}

(A3)

respectively, which lead to

ξ_{0 k}^{2} | β_{k} = 0, γ_{k} = 0 \sim Γ (1 / 2, λ_{0}^{2} / 2), ξ_{1 k}^{- 2} | β_{k}, γ_{k} = 1 \sim IvG (\sqrt{λ_{1}^{2} / β_{k}^{2}}, λ_{1}^{2}),

where

IvG (a, b)

represents the inverse Gaussian distribution with parameters a and b.

The ratio of

\Pr (γ_{k} = 1 | D, b, σ)

to

\Pr (γ_{k} = 0 | D, b, σ)

is proportional to

\begin{matrix} \frac{ρ ψ (β_{k}, 0, ξ_{1 k}^{2})}{(1 - ρ) ψ (β_{k}, 0, ξ_{0 k}^{2})} exp \{β_{k} \sum_{i = 1}^{n} \sum_{j = 1}^{m} \frac{(y_{i j} - x_{i, C_{k}}^{⊤} β_{C_{k}} - z_{i j}^{⊤} b_{i}) x_{i j k}}{σ_{j}^{2}} + \frac{β_{k}^{2}}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{m} x_{i j k}^{2} (1 - σ_{j}^{- 2})\}, \end{matrix}

(A4)

which is denoted as

ϱ_{k}

, where

C_{k} = {ℓ : γ_{ℓ} = 1, ℓ \neq k \in A}

. Thus, latent variable

γ_{k}

is sampled from the Bernoulli distribution with the probability

ς_{k} = ϱ_{k} / (ϱ_{k} + 1)

, i.e.,

γ_{k} | D, b, σ \sim Bernoulli (ς_{k})

for

k = 1, \dots, p

.

The conditional distribution

f (Q | b)

is shown as

Q | b \sim {IW}_{q} (S_{0} + \sum_{i = 1}^{n} b_{i} b_{i}^{⊤}, ν_{0} + n) .

(A5)

The conditional distribution

f (σ_{j}^{- 2} | D, b)

(

j = 1, \dots, m

) has the form

f (σ_{j}^{- 2} | D, b) \propto {(σ_{j}^{- 2})}^{n / 2 + c_{2} - 1} exp \{- \frac{1}{2 σ_{j}^{2}} \sum_{i = 1}^{n} {(y_{i j} - μ_{i j})}^{2} - \frac{d_{2}}{σ_{j}^{2}}\},

(A6)

which indicates

σ_{j}^{- 2} | D, b \sim Γ (\frac{n}{2} + c_{2}, d_{2} + \frac{1}{2} \sum_{i = 1}^{n} {(y_{i j} - μ_{i j})}^{2}) .

The conditional distribution

f (ρ | γ)

is given as

ρ | γ \sim Beta (a_{γ} + \sum_{k = 1}^{p} γ_{k}, b_{γ} + p - \sum_{k = 1}^{p} γ_{k}) .

(A7)

The conditional distributions

f (λ_{0}^{2} | ξ_{0})

and

f (λ_{1}^{2} | ξ_{1})

are shown as

\begin{matrix} λ_{0}^{2} | ξ_{0} \sim Γ (c_{0} + p - \sum_{k = 1}^{p} γ_{k}, d_{0} + \frac{1}{2} \sum_{k = 1}^{p} (1 - γ_{k}) ξ_{0 k}^{2}), \\ λ_{1}^{2} | ξ_{1} \sim Γ (c_{1} + \sum_{k = 1}^{p} γ_{k}, d_{1} + \frac{1}{2} \sum_{k = 1}^{p} γ_{k} ξ_{1 k}^{2}), \end{matrix}

(A8)

respectively.

Appendix B. Calculating the Evidence Lower Bound (ELB)

Denote

q^{*} (Ξ)

to be the optimal variational density approximating the posterior density

f (Ξ | D)

and

f (Ξ)

to be the prior density of

Ξ = {β, b, ξ_{0}, ξ_{1}, Q, γ, σ^{2}, ϑ}

. Define

E_{q^{*} (Ξ)} (\cdot)

as the expectation taken with respect to

q^{*} (Ξ)

. Thus, it follows from Equation (12) that ELOB has the form

\begin{matrix} L {q^{*} (Ξ)} & = E_{q^{*} (Ξ)} \{log f (Ξ, Y | X, Z)\} - E_{q^{*} (Ξ)} \{log q (Ξ)\} \\ = E_{q^{*} (Ξ)} \{log f (Y | Ξ, X, Z) + log f (Ξ)\} - E_{q^{*} (Ξ)} \{log q (Ξ)\}, \end{matrix}

(A9)

where

log f (Y | Ξ, X, Z) \propto \frac{n}{2} \sum_{j = 1}^{m} log σ_{j}^{- 2} - \sum_{i = 1}^{n} \sum_{j = 1}^{m} \frac{{(y_{i j} - x_{i j}^{⊤} β - z_{i j}^{⊤} b_{i})}^{2}}{2 σ_{j}^{2}},

(A10)

\begin{matrix} log f (Ξ) & \propto \frac{1}{2} \sum_{k = 1}^{r} (r log ξ_{1 k}^{- 2} - \frac{β_{k}^{2}}{ξ_{1 k}^{2}}) + \frac{1}{2} \sum_{k = 1}^{p - r} \{(p - r) log ξ_{0 k}^{- 2} - \frac{β_{k}^{2}}{ξ_{0 k}^{2}}\} \\ - \frac{1}{2} trace \{(S_{0} + \sum_{i = 1}^{n} b_{i} b_{i}^{⊤}) Q^{- 1}\} + \frac{λ_{1}^{2} + λ_{0}^{2}}{2} + (c_{1} - 1) log λ_{1}^{2} \\ - d_{1} λ_{1}^{2} + (c_{0} - 1) log λ_{0}^{2} - d_{0} λ_{0}^{2} - \frac{n + ν_{0} + q + 1}{2} log | Q | \\ + (a_{γ} - 1) log ρ + (b_{γ} - 1) log (1 - ρ) - \sum_{j = 1}^{m} \frac{d_{2}}{σ_{j}^{2}} \\ + (c_{2} - 1) \sum_{j = 1}^{m} log (σ_{j}^{- 2}) + \sum_{k = 1}^{p} \{γ_{k} log ρ + (1 - γ_{k}) log (1 - ρ)\} . \end{matrix}

(A11)

It follows from the definition of

q (Ξ)

that

\begin{matrix} E_{q^{*} (Ξ)} {log q (Ξ)} & = & E_{β}^{*} {log q (β)} + E_{b}^{*} {log q (b)} + E_{ξ 1}^{*} {log q (ξ_{1}^{- 2})} + E_{ξ 0}^{*} {log q (ξ_{0}^{- 2})} \\ + E_{γ}^{*} {log q (γ)} + E_{Q}^{*} {log q (Q)} + E_{σ}^{*} {log q (σ^{2})} + E_{ρ}^{*} {log q (ρ)} \\ + E_{λ_{0}}^{*} {log q (λ_{0}^{2})} + E_{λ_{1}}^{*} {log q (λ_{1}^{2})}, \end{matrix}

(A12)

where

E_{β}^{*} {log q (β)} \propto - \frac{r}{2} log | Σ_{A} | - \frac{p - r}{2} log | Σ_{I} |

,

E_{b}^{*} {log q (b)} \propto - \frac{n}{2} log | Σ_{b} |

,

E_{ξ 1}^{*} {log q (ξ_{1}^{- 2})} \propto - \frac{1}{2} \sum_{k = 1}^{p} [3 {log a_{1 ξ}^{*} - a_{1 ξ k}^{*} / (2 b_{1 ξ k}^{*})} + 2 b_{1 ξ k}^{*} / a_{1 ξ k}^{*} + 1]

,

E_{ξ 0}^{*} {log q (ξ_{0}^{- 2})} \propto - \frac{1}{2} \sum_{k = 1}^{p} [3 {log a_{0 ξ}^{*} - a_{0 ξ k}^{*} / (2 b_{0 ξ k}^{*})} + 2 b_{0 ξ k}^{*} / a_{0 ξ k}^{*} + 1]

,

E_{γ}^{*} {log q (γ)} \propto \sum_{k = 1}^{p} {ς_{k} log ς_{k} + (1 - ς_{k}) log (1 - ς_{k})}

,

E_{Q}^{*} {log q (Q)} \propto - \frac{ν_{0}^{*}}{2} log S_{0}^{*} + \frac{ν_{0}^{*} - q - 1}{2} ν_{0}^{*} S_{0}^{*} - \frac{1}{2} trace (ν_{0}^{*} I_{q \times q})

,

E_{σ}^{*} {log q (σ)} \propto - n d_{2} \sum_{j = 1}^{m} {(\sum_{i = 1}^{n} h_{i j})}^{- 1} + (c_{2} - 1) \sum_{j = 1}^{m} (log n - log \sum_{i = 1}^{n} h_{i j} - 1 / n)

,

E_{ρ}^{*} {log q (ρ)} \propto (c_{ρ} - 1) {log (c_{ρ}) - log (c_{ρ} + d_{ρ})} - d_{ρ} (c_{ρ} - 1) / {2 c_{ρ} (c_{ρ} + d_{ρ} + 1)} + (d_{ρ} - 1) {log (d_{ρ}) - log (c_{ρ} + d_{ρ}) - c_{ρ} (d_{ρ} - 1) / {2 d_{ρ} (c_{ρ} + d_{ρ} + 1)}

,

E_{λ_{0}}^{*} {log q (λ_{0}^{2})} \propto (a_{0 λ}^{*} - 1) {\dot{Γ} (a_{0 λ}^{*}) / Γ (a_{0 λ}^{*}) - log b_{0 λ}^{*}} - a_{0 λ}^{*}

and

E_{λ_{1}}^{*} {log q (λ_{1}^{2})} \propto (a_{1 λ}^{*} - 1) {\dot{Γ} (a_{1 λ}^{*}) / Γ (a_{1 λ}^{*}) - log b_{1 λ}^{*}] - a_{1 λ}^{*}

.

Note that for a random variable

ξ

with mean

E (ξ) = μ

and variance

D (ξ) = σ^{2}

, it follows from Taylor expansion that the mean of the function

y = f (ξ)

is

E (y) \approx f (μ) + \frac{1}{2} \ddot{f} (μ) D (ξ)

, where

\ddot{f} (\cdot)

denotes the second derivative of the function

f (ξ)

. Then, we have

\begin{matrix} E_{q^{*} (Ξ)} \{log f (Y | Ξ, X, Z)\} & \propto & \frac{n}{2} \sum_{j = 1}^{m} (\frac{1}{n} - log \frac{n}{\sum_{i = 1}^{n} h_{i j}}) - \sum_{i = 1}^{n} \sum_{j = 1}^{m} \frac{n}{\sum_{i^{'} = 1}^{n} h_{i^{'} j}} [y_{i j}^{2} - 2 y_{i j} {x_{i j}^{⊤} \\ E_{β}^{*} (β) + z_{i j}^{⊤} E_{b_{i}}^{*} (b_{i})} + x_{i j}^{⊤} {{var}_{β}^{*} (β) + E_{β}^{*} (β) E_{β}^{*} (β^{⊤})} x_{i j} \\ + z_{i j}^{⊤} {{var}_{b_{i}}^{*} (b_{i}) + E_{b_{i}}^{*} (b_{i}) E_{b_{i}}^{*} (b_{i}^{⊤})} z_{i j} + 2 x_{i j}^{⊤} E_{β}^{*} (β) E_{b_{i}}^{*} (b_{i}^{⊤}) z_{i j}] . \end{matrix}

(A13)

Note that for a random variable

ξ \sim Γ (α, β)

, we have

E {log (ξ)} = \dot{Γ} (α) / Γ (α) - log (β)

, where

\dot{Γ} (\cdot)

denotes the first derivative of gamma function. Thus, we have

\begin{matrix} E_{q^{*} (Ξ)} {log f (Ξ)} & \propto & \frac{1}{2} \sum_{k = 1}^{r} [r (log a_{1 ξ k}^{*} - \frac{a_{1 ξ k}^{*}}{2 b_{1 ξ k}^{*}}) - {{var}_{β_{k}}^{*} (β_{k}) + {(E_{β_{k}}^{*} (β_{k}))}^{2}} E_{ξ 1 k}^{*} (ξ_{1 k}^{- 2})] \\ + \frac{1}{2} \sum_{k = 1}^{p - r} [(p - r) (log a_{0 ξ k}^{*} - \frac{a_{0 ξ k}^{*}}{2 b_{0 ξ k}^{*}}) - {{var}_{β_{k}}^{*} (β_{k}) + {(E_{β_{k}}^{*} (β_{k}))}^{2}} E_{ξ 0 k}^{*} (ξ_{0 k}^{- 2})] \\ - \frac{1}{2} \sum_{i = 1}^{n} E_{b_{i}}^{*} (b_{i}^{⊤}) E_{Q}^{*} (Q) E_{b_{i}}^{*} (b_{i}) + \frac{E_{λ_{1}}^{*} (λ_{1}^{2}) + E_{λ_{0}}^{*} (λ_{0}^{2})}{2} \\ + (c_{1} - 1) \{\frac{\dot{Γ} (a_{1 λ}^{*})}{Γ (a_{1 λ}^{*})} - log (b_{1 λ}^{*})\} - d_{1} E_{λ_{1}}^{*} (λ_{1}^{2}) \\ + (c_{0} - 1) \{\frac{\dot{Γ} (a_{0 λ}^{*})}{Γ (a_{0 λ}^{*})} - log b_{0 λ}^{*}\} - d_{0} E_{λ_{0}}^{*} (λ_{0}^{2}) \\ + \frac{n + ν_{0} - q - 1}{2} (log | S_{0}^{*} ν_{0}^{*} | - \frac{{var}_{Q}^{*} | Q |}{2 | S_{0}^{*} ν_{0}^{*} |^{2}}) - \frac{1}{2} trace {S_{0}^{- 1} E_{Q}^{*} (Q)} \\ + (a_{γ} - 1) (log \frac{c_{ρ}}{c_{ρ} + d_{ρ}} - \frac{d_{ρ}}{2 c_{ρ} (c_{ρ} + d_{ρ} + 1)}) \\ + (b_{γ} - 1) (log \frac{d_{ρ}}{c_{ρ} + d_{ρ}} - \frac{c_{ρ}}{2 d_{ρ} (c_{ρ} + d_{ρ} + 1)}) \\ - n d_{2} \sum_{j = 1}^{m} {(\sum_{i = 1}^{n} h_{i j})}^{- 1} + (c_{2} - 1) \sum_{j = 1}^{m} (log n - log \sum_{i = 1}^{n} h_{i j} - 1 / n) \\ + \sum_{k = 1}^{p} [E_{γ_{k}}^{*} (γ_{k}) (log \frac{c_{ρ}}{c_{ρ} + d_{ρ}} - \frac{d_{ρ}}{2 c_{ρ} (c_{ρ} + d_{ρ} + 1)}) \\ + (1 - E_{γ_{k}}^{*} (γ_{k})) (log \frac{d_{ρ}}{c_{ρ} + d_{ρ}} - \frac{c_{ρ}}{2 d_{ρ} (c_{ρ} + d_{ρ} + 1)})], \end{matrix}

(A14)

where

| Q |

represents the determinant of matrix Q,

{var}_{Q}^{*} (Q_{i j}) = ν_{0} (σ_{i j}^{* 2} + σ_{i i}^{*} σ_{j j}^{*})

and

σ_{i j}^{*}

is the

(i, j)

-th component of

S_{0}^{*}

.

Appendix C. Calculating the Estimated Bayes Factor in the Second Simulation

For the model

H_{t 01} : y_{i j} = x_{i j}^{⊤} β + (1 - t) z_{i j}^{⊤} b_{i} + ε_{i j}

for

i = 1, \dots, n

and

j = 1, \dots, m

, where

t \in [0, 1]

, its first-order derivative of log joint density function has the form

U (Y, t, Ξ | X, Z) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} {(y_{i j} - x_{i j}^{⊤} β - (1 - t) z_{i j}^{⊤} b_{i}) z_{i j}^{⊤} b_{i}} / σ_{j}^{2} .

(A15)

In this case,

U (Y, 0, Ξ | X, Z) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} (y_{i j} - x_{i j}^{⊤} β - z_{i j}^{⊤} b_{i}) z_{i j}^{⊤} b_{i} / σ_{j}^{2}

and

U (Y, 1, Ξ | X, Z) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} (y_{i j} - x_{i j}^{⊤} β) z_{i j}^{⊤} b_{i} / σ_{j}^{2}

.

For

H_{t 02} : y_{i j} = (1 - t) x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i} + ε_{i j}

for

i = 1, \dots, n

and

j = 1, \dots, m

, where

t \in [0, 1]

, its first-order derivative of log joint density function has the form

\begin{matrix} U (Y, t, Ξ | X, Z) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} {y_{i j} - (1 - t) x_{i j}^{⊤} β - z_{i j}^{⊤} b_{i}} x_{i j}^{⊤} β / σ_{j}^{2} . \end{matrix}

(A16)

In this case,

U (Y, 0, Ξ | X, Z) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} (y_{i j} - x_{i j}^{⊤} β - z_{i j}^{⊤} b_{i}) x_{i j}^{⊤} β / σ_{j}^{2}

and

U (Y, 1, Ξ | X, Z) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} (y_{i j} - z_{i j}^{⊤} b_{i}) x_{i j}^{⊤} β / σ_{j}^{2}

.

For

H_{t 03} : y_{i j} = x_{i j}^{⊤} β + z_{i j}^{⊤} b_{i} + ε_{i j}

with

ε_{i j} \overset{i . i . d}{\sim} N (0, t^{2} σ_{0}^{2} + {(1 - t)}^{2} σ_{j}^{2})

for

i = 1, \dots, n

and

j = 1, \dots, m

, where

t \in [0, 1]

, its first-order derivative of log joint density function has the form

\begin{matrix} U (Y, t, Ξ | X, Z) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} \frac{{t σ_{0}^{2} - (1 - t) σ_{j}^{2}} {t^{2} σ_{0}^{2} + {(1 - t)}^{2} σ_{j}^{2}}^{2} - {(y_{i j} - μ_{i j})}^{2} {(1 - t) σ_{j}^{2} - t σ_{0}^{2}}}{{t^{2} σ_{0}^{2} + {(1 - t)}^{2} σ_{j}^{2}}^{2}} . \end{matrix}

(A17)

In this case,

U (Y, 0, Ξ | X, Z) = - \sum_{i = 1}^{n} \sum_{j = 1}^{m} {σ_{j}^{4} + {(y_{i j} - μ_{i j})}^{2}} / σ_{j}^{2}

and

U (Y, 1, Ξ | X, Z) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} {σ_{0}^{4} + {(y_{i j} - μ_{i j})}^{2}} / σ_{0}^{2}

.

References

Lindstrom, M.J.; Bates, D.M. Newton-raphson and EM algorithms for linear mixed-effects models for repeated measures data. J. Am. Stat. Assoc. 1988, 83, 1014–1022. [Google Scholar]
Laird, N.; Lange, N.; Stram, D. Maximum likelihood computations with repeated measures: Applications of the EM algorithm. J. Am. Stat. Assoc. 1987, 82, 97–105. [Google Scholar] [CrossRef]
Zeger, S.L.; Karim, M.R. Generalized linear models with random effects: A Gibbs sampling approach. J. Am. Stat. Assoc. 1991, 3, 79–86. [Google Scholar] [CrossRef]
Gilks, W.R.; Wang, C.C.; Yvonnet, B.; Coursaget, P. Random-effects models for longitudinal data using Gibbs sampling. Biometrics 1993, 49, 441–453. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Dunson, D.B. Random effects selection in linear mixed models. Biometrics 2003, 59, 762–769. [Google Scholar] [CrossRef] [PubMed]
Ahn, M.; Zhang, H.H.; Lu, W. Moment-based method for random effects selection in linear mixed models. Stat. Sin. 2012, 22, 1539–1562. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Bondell, H.D.; Krishna, A.; Ghosh, S.K. Joint variable selection of fixed and random effects in linear mixed-effects models. Biometrics 2010, 66, 1069–1077. [Google Scholar] [CrossRef]
Ibrahim, J.G.; Zhu, H.; Garcia, R.I.; Guo, R. Fixed and random effects selection in mixed effects models. Biometrics 2011, 67, 495–503. [Google Scholar] [CrossRef] [Green Version]
Schelldorfer, J.; Buhlmann, P.; Van De Geer, S. Estimation for high-dimensional linear mixed-effects models using ℓ₁–penalization. Scand. J. Stat. 2011, 38, 197–214. [Google Scholar] [CrossRef] [Green Version]
Fan, Y.; Li, R. Variable selection in linear mixed effects models. Ann. Stat. 2012, 40, 2043–2068. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.J.; Song, P.X.K.; Wang, N.; Zhou, L.; Zhu, J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat. Interface 2018, 11, 721–737. [Google Scholar] [CrossRef] [PubMed]
Bradic, J.; Claeskens, G.; Gueuning, T. Fixed effects testing in high-dimensional linear mixed models. J. Am. Stat. Assoc. 2020, 115, 1835–1850. [Google Scholar] [CrossRef] [Green Version]
Li, S.; Cai, T.T.; Li, H. Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach. J. Am. Stat. Assoc. 2021, 1–12. [Google Scholar] [CrossRef]
Berger, J.; Bernardo, J.M. Reference priors in a variance components problem. In Bayesian Analysis in Statistics and Econometrics; Lecture Notes in Statistics; Goel, P., Ed.; Springer: New York, NY, USA, 1992; Volume 75, pp. 177–194. [Google Scholar]
George, E.I.; McCullogh, R.E. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 1993, 88, 881–889. [Google Scholar] [CrossRef]
Ishwaran, H.; Rao, J.S. Spike and slab gene selection for multigroup microarray data. J. Am. Stat. Assoc. 2005, 100, 764–780. [Google Scholar] [CrossRef]
Polson, N.G.; Scott, J.G. Local shrinkage rules, Levy processess and regularized regression. J. R. Stat. Soc. 2012, 74, 287–311. [Google Scholar] [CrossRef]
Narisetty, N.N.; He, X. Bayesian variable selection with shrinking and diffusing priors. Ann. Stat. 2014, 42, 789–817. [Google Scholar] [CrossRef] [Green Version]
Park, T.; Casella, G. The Bayesian Lasso. J. Am. Stat. Assoc. 2008, 103, 681–686. [Google Scholar] [CrossRef]
Griffin, J.E.; Brown, P.J. Bayesian adaptive lassos with non-convex penalization. Aust. N. Z. J. Stat. 2011, 53, 423–442. [Google Scholar] [CrossRef]
Rockova, V.; George, E.I. EMVS: The EM approach to Bayesian variable selection. J. Am. Stat. Assoc. 2014, 109, 828–846. [Google Scholar] [CrossRef]
Latouche, P.; Mattei, P.A.; Bouveyron, C.; Chiquet, J. Combining a relaxed EM algorithm with Occam’s razor for Bayesian variable selection in high-dimensional regression. J. Multivar. Anal. 2016, 146, 177–190. [Google Scholar] [CrossRef]
Narisetty, N.N.; Shen, J.; He, X. Skinny Gibbs: A consistent and acalable Gibbs sampler for model selection. J. Am. Stat. Assoc. 2019, 114, 1205–1217. [Google Scholar] [CrossRef]
Wipf, D.P.; Rao, B.D.; Nagarajan, S. Latent variable Bayesian models for promoting sparsity. IEEE Trans. Inf. Theory 2011, 57, 6236–6255. [Google Scholar] [CrossRef] [Green Version]
Ghahramani, Z.; Beal, M.J. Variational inference for Bayesian mixtures of factor analysis. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 449–455. [Google Scholar]
Attias, H. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 209–215. [Google Scholar]
Wu, Y.; Tang, N.S. Variational Bayesian partially linear mean shift models for high-dimensional Alzheimer’s disease neuroimaging data. Stat. Med. 2022, in press. [Google Scholar] [CrossRef]
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
Rockova, V.; George, E.I. The Spike-and-Slab Lasso. J. Am. Stat. Assoc. 2018, 113, 431–444. [Google Scholar] [CrossRef]
Leng, C.; Tran, M.N.; Nott, D. Bayesian adaptive Lasso. Ann. Inst. Stat. Math. 2014, 66, 221–244. [Google Scholar] [CrossRef] [Green Version]
Beal, M.J. Variational Algorithms for Approximate Bayesian Inference. Ph.D. Thesis, University of London, London, UK, 2003. [Google Scholar]
Bishop, C. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 518, 859–877. [Google Scholar] [CrossRef] [Green Version]
Lee, S.Y.; Song, X.Y. Model comparison of nonlinear structural equation models with fixed covariates. Psychometrika 2003, 68, 27–47. [Google Scholar] [CrossRef]
Lee, S.Y.; Tang, N.S. Bayesian analysis of nonlinear structural equation models with nonignorable missing data. Psychometrika 2005, 71, 541–564. [Google Scholar] [CrossRef]
Tierney, L.; Kadane, J.B. Accurate approximations for posterior moments and marginal densities. J. Am. Stat. Assoc. 1986, 81, 82–86. [Google Scholar] [CrossRef]
Neal, R.M. Annealed importance sampling. Stat. Comput. 2001, 11, 125–139. [Google Scholar] [CrossRef]
Meng, X.L.; Wong, W. Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Stat. Sin. 1996, 6, 831–860. [Google Scholar]
Gelman, A.; Meng, X.L. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Stat. Sci. 1998, 13, 163–185. [Google Scholar] [CrossRef]
Skilling, J. Nested sampling for general bayesian computation. Bayesian Anal. 2006, 1, 833–859. [Google Scholar] [CrossRef]
Friel, N.; Pettitt, A.N. Marginal likelihood estimation via power posterior. J. R. Stat. Soc. 2008, 70, 589–607. [Google Scholar] [CrossRef]
DiCicio, T.; Kass, R.; Raftery, A.; Wasserman, L. Computing Bayes factor by combining simulation and asymptotic approximations. J. Am. Stat. Assoc. 1997, 92, 903–915. [Google Scholar] [CrossRef]
LIorente, F.; Martino, L.; Delgado, D.; Lopez-Santiago, J. Marginal likelihood computation for model selection and hypothesis testing: An extensive review. arXiv 2022, arXiv:2005.08334. [Google Scholar]
Kass, R.E.; Raftery, A.E. Bayes factors. J. Am. Stat. Assoc. 1995, 90, 773–795. [Google Scholar] [CrossRef]
Jack, C.; Bernstein, M.; Fox, N.; Thompson, P.; Alexander, G.; Harvey, D.; Borowski, B.; Britson, P.; Whitwell, J.; Ward, C. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. J. Magn. Reson. Imaging 2008, 27, 685–691. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, Y.Q.; Tang, N.S.; Qu, A. Imputed factor regression for high-dimensional block-wise missing data. Stat. Sin. 2020, 30, 631–651. [Google Scholar] [CrossRef]
Brookmeyer, R.; Johnson, E.; Ziegler-Graham, K.; Arrighi, H. Forecasting the global burden of Alzheimer’s disease. Alzheimers Dement. 2007, 3, 186–191. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, Y.; Bornn, L.; De Freitas, N.; Eskelin, M.; Fang, J.; Welling, M. Herded Gibbs sampling. J. Mach. Learn. Res. 2016, 17, 263–291. [Google Scholar]
Martino, L.; Elvira, V.; Camps-Valls, G. The recycling Gibbs sampler for efficient learning. Digit. Signal Process. 2018, 74, 1–13. [Google Scholar] [CrossRef] [Green Version]
Roberts, G.O.; Sahu, S.K. Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. J. R. Stat. Soc. 1997, 59, 291–317. [Google Scholar] [CrossRef]

Table 1. Performance of variable selection and parameter estimation in the first simulation study.

( $a_{γ}$ , $b_{γ}$ )	$Σ_{x}$	n	Method	p = 500			p = 1000			p = 2000
( $a_{γ}$ , $b_{γ}$ )	$Σ_{x}$	n	Method	TP	FP	RMS	TP	FP	RMS	TP	FP	RMS
(0.5, 0.5)	I	100	$VB$	3.91	0.00	0.11	3.79	0.00	0.08	3.84	0.00	0.06
			$LASSO$	4.44	0.87	1.90	3.54	1.03	1.66	1.39	0.00	1.91
		200	$VB$	4.71	0.00	0.11	4.68	0.00	0.08	4.65	0.00	0.06
			$LASSO$	4.95	0.24	2.24	2.78	1.91	1.36	3.34	0.00	1.64
		300	$VB$	4.89	0.00	0.11	4.81	0.00	0.08	4.91	0.00	0.06
			$LASSO$	4.99	0.01	2.12	4.91	0.00	1.41	4.23	0.00	1.45
	II	100	$VB$	3.79	0.00	0.11	3.84	0.00	0.08	3.76	0.00	0.06
			$LASSO$	3.48	0.10	2.19	3.01	0.00	1.87	3.00	0.00	2.01
		200	$VB$	3.97	0.00	0.11	3.96	0.00	0.08	3.98	0.00	0.06
			$LASSO$	3.59	0.02	2.44	3.12	0.00	1.78	3.00	0.00	1.84
		300	$VB$	3.98	0.00	0.11	3.96	0.00	0.08	3.98	0.00	0.06
			$LASSO$	3.63	0.03	2.31	3.20	0.00	1.79	3.01	0.00	1.75
(0.1, 0.9)	I	100	$VB$	3.88	0.00	0.11	3.79	0.00	0.08	3.84	0.00	0.06
			$LASSO$	4.44	0.87	1.90	3.54	1.03	1.66	1.39	0.00	1.91
		200	$VB$	4.71	0.00	0.11	4.66	0.00	0.08	4.64	0.00	0.06
			$LASSO$	4.95	0.24	2.24	2.78	1.91	1.36	3.34	0.00	1.64
		300	$VB$	4.89	0.00	0.11	4.81	0.00	0.08	4.91	0.00	0.06
			$LASSO$	4.99	0.01	2.12	4.91	0.00	1.41	4.23	0.00	1.45

Note: VB represents variational Bayesian method and LASSO denotes Bayesian lasso method.

Table 2. Estimated log Bayes factor in the second simulation study.

$\hat{log B_{10}}$	n	p
$\hat{log B_{10}}$	n	500	1000	2000
	100	−194	−102	−86
	200	−372	−272	−294
	300	−506	−544	−588
$\hat{log B_{20}}$ ( $\times 10^{7}$ )	100	−0.95	−4.03	−1.41
	200	−1.54	−3.68	−2.54
	300	−3.13	−3.58	−2.26

Table 3. Performance of variational Bayesian method for the complete and selected models in the ADNI-2 data.

Model	n	p	RMSE	MAP
Complete	62	340	49.17	49.15
Selected	62	3	1.05	0.82

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, J.; Tang, N. Variational Bayesian Inference in High-Dimensional Linear Mixed Models. Mathematics 2022, 10, 463. https://doi.org/10.3390/math10030463

AMA Style

Yi J, Tang N. Variational Bayesian Inference in High-Dimensional Linear Mixed Models. Mathematics. 2022; 10(3):463. https://doi.org/10.3390/math10030463

Chicago/Turabian Style

Yi, Jieyi, and Niansheng Tang. 2022. "Variational Bayesian Inference in High-Dimensional Linear Mixed Models" Mathematics 10, no. 3: 463. https://doi.org/10.3390/math10030463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Variational Bayesian Inference in High-Dimensional Linear Mixed Models

Abstract

1. Introduction

2. Model

3. Skinny Gibbs Sampler for Bayesian Lasso

4. Variational Bayesian Inference

4.1. Variational Bayes

4.2. Optimizing $L {q (Ξ)}$ via Coordinate Ascent Algorithm

4.3. Model Comparison

5. Simulation Studies

6. An Empirical Example

7. Discussions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Conditional Distributions Required in Implementing the Gibbs Sampler

Appendix B. Calculating the Evidence Lower Bound (ELB)

Appendix C. Calculating the Estimated Bayes Factor in the Second Simulation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Variational Bayesian Inference in High-Dimensional Linear Mixed Models

Abstract

1. Introduction

2. Model

3. Skinny Gibbs Sampler for Bayesian Lasso

4. Variational Bayesian Inference

4.1. Variational Bayes

4.2. Optimizing L { q ( Ξ ) } via Coordinate Ascent Algorithm

4.3. Model Comparison

5. Simulation Studies

6. An Empirical Example

7. Discussions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Conditional Distributions Required in Implementing the Gibbs Sampler

Appendix B. Calculating the Evidence Lower Bound (ELB)

Appendix C. Calculating the Estimated Bayes Factor in the Second Simulation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Optimizing $L {q (Ξ)}$ via Coordinate Ascent Algorithm