Next Article in Journal
A Review of the Modification Strategies of the Nature Inspired Algorithms for Feature Selection Problem
Next Article in Special Issue
A Bayesian EAP-Based Nonlinear Extension of Croon and Van Veldhoven’s Model for Analyzing Data from Micro–Macro Multilevel Designs
Previous Article in Journal
Periodic Orbits of a Mosquito Suppression Model Based on Sterile Mosquitoes
Previous Article in Special Issue
A Novel Method to Use Coordinate Based Meta-Analysis to Determine a Prior Distribution for Voxelwise Bayesian Second-Level fMRI Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Variational Bayesian Inference in High-Dimensional Linear Mixed Models

Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming 650091, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(3), 463; https://doi.org/10.3390/math10030463
Submission received: 24 December 2021 / Revised: 26 January 2022 / Accepted: 27 January 2022 / Published: 31 January 2022
(This article belongs to the Special Issue Bayesian Inference and Modeling with Applications)

Abstract

:
In high-dimensional regression models, the Bayesian lasso with the Gaussian spike and slab priors is widely adopted to select variables and estimate unknown parameters. However, it involves large matrix computations in a standard Gibbs sampler. To solve this issue, the Skinny Gibbs sampler is employed to draw observations required for Bayesian variable selection. However, when the sample size is much smaller than the number of variables, the computation is rather time-consuming. As an alternative to the Skinny Gibbs sampler, we develop a variational Bayesian approach to simultaneously select variables and estimate parameters in high-dimensional linear mixed models under the Gaussian spike and slab priors of population-specific fixed-effects regression coefficients, which are reformulated as a mixture of a normal distribution and an exponential distribution. The coordinate ascent algorithm, which can be implemented efficiently, is proposed to optimize the evidence lower bound. The Bayes factor, which can be computed with the path sampling technique, is presented to compare two competing models in the variational Bayesian framework. Simulation studies are conducted to assess the performance of the proposed variational Bayesian method. An empirical example is analyzed by the proposed methodologies.

1. Introduction

Linear mixed models are widely used to analyze longitudinal and correlated data by considering the between-subject and within-subject variations and incorporating the random effects to account for heterogeneity among the subjects in many fields, such as psychology, medicine, epidemiology and econometrics. Various methods have been developed to estimate fixed-effects parameters and variance–covariance matrices for unobservable random effects and noises or select fixed-effects and random-effects components, even if it is quite challenging for the problem of variable selection and parameter estimation in linear mixed models. For example, see [1] for restricted maximum likelihood estimation of parameters, ref [2] for EM algorithm of parameter estimation, refs [3,4] for Bayesian parameter estimation, ref [5] for Bayesian random effects selection and [6] for moment-based method for random effects selection. The aforementioned methods mainly focus on low-dimensional linear mixed models, while high-dimensional data have become increasingly common with the rapid development of modern information technologies that facilitate data collection. Thus, the aforementioned methods do not work well in high-dimensional linear mixed models, and so some penalized methods have developed to simultaneously estimate parameters and select variables in high-dimensional linear mixed models. For example, Bondell, Krishna and Ghosh [7] and Ibrahim et al. [8] proposed the penalized likelihood methods for joint selection of fixed and random effects; Schelldorfer, Buhlmann and van De Geer [9] proposed an  1 –penalized estimation procedure; Fan and Li [10] investigated the problem of fixed and random effects selection when the cluster sizes are balanced; Li et al. [11] presented a doubly regularized approach to simultaneously select fixed and random effects; Bradic, Claeskens and Gueuning [12] considered testing a single parameter of fixed effects in high-dimensional linear mixed models with fixed cluster sizes, fixed numbers of random effects and sub-Gaussian designs; Li, Cai and Li [13] proposed a penalized quasi-likelihood method for statistical inference on unknown parameters in high-dimensional linear mixed-effects models. However, the aforementioned regularization methods are computationally complex and unstable and they do not consider the prior information of fixed-effects parameters and variance–covariance matrices, which may lead to unsatisfactory estimation accuracy of parameters or variance–covariance matrices. Bayesian approaches for variable selection and parameter estimation have received much attention over the past years because they can largely improve the accuracy and efficiency of parameter estimation, consistently select important variables and provide more information for variable selection than the corresponding penalization method with a highly non-convex optimization problem by imposing various priors on model parameters. For example, see [14] for reference prior, ref [15] for normal mixture prior, ref [16] for spike and slab prior, ref [17] for horseshoe prior and [18] for shrinking and diffusing prior. In the high-dimensional setting, Bayesian lasso, Bayesian adaptive lasso or the indicator model method, together with the Markov chain Monte Carlo (MCMC) algorithm, are widely used to select important variables. For example, see [19] for Bayesian lasso, ref [20] for Bayesian adaptive lasso and [21,22] for the EM approach in the Bayesian framework. The above-mentioned literature involves the implementation of the standard Gibbs sampler for posterior computation, which is not so scalable for large numbers of fixed-effects components [23]. To address the issue, Narisetty, Shen and He [23] proposed a Skinny Gibbs algorithm by using a sparse matrix to replace the high-dimensional variance–covariance matrix, which avoids large matrix operations. However, implementing the above MCMC algorithm in the presence of high-dimensional data may still be subject to the well-known ill-posed problems, i.e., low efficiency, slow convergence and huge memory being required.
As an alternative to the MCMC, the variational Bayesian method, also called ensemble learning, is widely adopted to approximate intractable integrals involved in Bayesian inference or machine learning due to its good properties, such as high-speed computation. Its basic idea is to transform the high-dimensional integration problem into an optimization problem in making Bayesian inference and then optimize the evidence lower bound (ELB), which is efficiently computed, and finally utilize the ELB to obtain a variational approximation to the posterior distribution in Bayesian analysis. The variational Bayesian approach has been applied to some familiar models, for example, latent variable models [24], mixtures of factor analysis [25], graphical models [26] and partially linear mean shift models with high-dimensional data [27].
Motivated by the aforementioned variational Bayesian studies, we develop a novel variational Bayesian approach to estimate model parameters and select important variables under the Skinny Gibbs sampling framework in a linear mixed model with low-dimensional random effects and high-dimensional fixed effects. We specify the spike and slab priors for the population-specific fixed-effects regression coefficients with completely different shrinkage parameters, which overcomes the problem of selecting a high-dimensional vector of the shrinkage parameters. We reformulate the spike and slab priors of parameter as a mixture of a normal distribution and an exponential distribution, which avoids the high-dimensional integral problem. The coordinate ascent algorithm, which can be implemented efficiently, is proposed to optimize the ELB. The Bayes factor, which can be computed with the path sampling technique, is presented to compare two competing models in the variational Bayesian framework. The merits of the proposed variational Bayesian method are (i) simultaneously estimating parameters and variance–covariance matrices and select fixed- and random-effects components with quite a low computation cost, (ii) efficiently analyzing high-dimensional data without requiring the non-convex optimization and avoiding the curse of dimensionality problem, (iii) automatically incorporating the shrinkage parameters and (iv) avoiding large matrix computations.
The rest of the article is organized as follows: Section 2 introduces the linear mixed model setup, including the spike and slab priors. Section 3 describes the Skinny Gibbs sampler algorithm for selecting fixed- and random-effects components and estimating parameters in coefficients and variance–covariance matrices via the Bayesian Lasso method. Section 4 develops a variational Bayesian approach to approximate posterior distributions of parameters and random effects and presents the Bayes factor for model comparison. The coordinate ascent algorithm is adopted to optimize the ELB in Section 4. Simulation studies are considered in Section 5. An empirical example is illustrated by the proposed methodologies in Section 6. A simple discussion is given in Section 7. Technique details are presented in the Appendix A, Appendix B and Appendix C.

2. Model

Consider a dataset with n subjects. For the ith subject, let y i j be the observation of the response variable, x i j be a  p × 1 vector of covariates associated with the fixed effects and z i j be a  q × 1 vector of covariates associated with the random effects, which may be a subvector of  x i j for  j = 1 , , n i , where n i is the number of times observed repeatedly for the ith subject. Generally, n i varies across subjects. For simplicity, we suppose that y i j has been centered at zero for avoiding the requirement of intercept and n 1 = = n n = m , i.e., the balanced design. It is assumed that p n and only a small number of covariates x i j contribute to response variable y i j , i.e., x i j has sparsity, while q is smaller than n.
For the dataset D = { ( y i j , x i j , z i j ) : i = 1 , , n , j = 1 , , m } , we consider the following linear mixed model:
y i j = x i j β + z i j b i + ε i j , i = 1 , , n , j = 1 , , m ,
where β = ( β 1 , , β p ) is a  p × 1 vector of population-specific fixed-effects regression coefficients, b i is a  q × 1 vector of subject-specific random effects and ε i j is measurement error. Here, we assume that b 1 , , b n are independent and identically distributed (i.i.d.) as the multivariate normal distribution with mean zero and covariance matrix Q and ε i j ’s are independently distributed as N ( 0 , σ j 2 ) , where N ( · , · ) represents the normal distribution. Here, σ 1 2 , , σ m 2 are not completely different but some of them may be identical.
Under the aforementioned assumptions, a penalized likelihood approach to estimate β is implemented by
β ^ = arg max β R p 1 2 i = 1 n j = 1 m ( y i j x i j β ) 2 σ j 2 + z i j Q z i j + f λ ( β ) ,
where f λ ( β ) is some appropriate penalty function indexed by the penalty parameter  λ . In variable selection literature, it is usually assumed that f λ ( β ) has the form: f λ ( β ) = k = 1 p f λ k ( β k ) , where f λ k ( β k ) takes the  0 -norm, 1 -norm, MCP penalty [28], SCAD penalty [29] and Elastic-Net penalty [30]. Recently, it was widely recognized that β ^ can be regarded as a posterior mode of  β with some proper prior. Inspired by this idea, we consider Bayesian variable selection procedure based on some proper prior on β .
Following [31], we consider the following spike and slab prior of  β :
f ( β | γ , λ 0 , λ 1 ) = k = 1 p { γ k g 1 ( β k | λ 1 ) + ( 1 γ k ) g 0 ( β k | λ 0 ) } ,
where γ = ( γ 1 , , γ p ) , in which γ k is a binary latent variable and follows a  Bernoulli distribution with the probability ρ k = Pr ( γ k = 1 ) , i.e., γ k = 1 indicates that the kth covariate is active and γ k = 0 implies that the kth covariate is inactive and g 1 ( β k | λ 1 ) is usually referred to as a diffuse slab prior reflecting the effect of an active covariate, while g 0 ( β k | λ 0 ) is called a concentrated spike prior reflecting the negligibly unimportant effect of an inactive covariate for k = 1 , , p . Let f ( γ | ρ ) be the prior distribution of  γ indexed by ρ . It is assumed that f ( γ | ρ ) has the form
f ( γ | ρ ) = k = 1 p ρ k γ k ( 1 ρ k ) 1 γ k ,
where ρ = ( ρ 1 , , ρ p ) . For simplicity, we assume ρ 1 = = ρ p = ρ , which is the expected proportion of the active covariates. Generally, one can take g 0 ( · ) and g 1 ( · ) as the normal distribution with a small and a large variance, respectively. However, for the spike and slab lasso, we take the following slab and spike priors
g 1 ( β k | λ 1 ) = λ 1 2 e λ 1 | β k | , g 0 ( β k | λ 0 ) = λ 0 2 e λ 0 | β k | ,
respectively, where λ 1 should tend to zero and λ 0 should tend to as the sample size is sufficiently large, which implies that the inactive covariates will be detected as zeros in that small values of  β k relative to 1 / λ 0 or λ 1 are truncated to zero. Following [32], the density g ( β k | λ ) = λ 2 exp ( λ | β k | ) can be hierarchically written as a mixture of a normal distribution and an exponential distribution, i.e.,
β k | ξ k 2 , γ k = N ( 0 , ξ k 2 ) , ξ k 2 | λ 2 Exp ( λ 2 / 2 ) , = 0 , 1 .
Incorporating the above idea shows that the posterior distributions of binary latent variables can be employed to distinguish active covariates from inactive ones in the considered model.
For covariance matrix Q, the proportion ρ , λ 0 2 , λ 1 2 and σ j 2 , we consider the following priors:
Q IW ( S 0 , ν 0 ) , ρ Beta ( a γ , b γ ) , λ 0 2 Γ ( c 0 , d 0 ) , λ 1 2 Γ ( c 1 , d 1 ) , σ j 2 Γ ( c 2 , d 2 ) ,
where IW ( · , · ) denotes the inverted Wishart distribution, Beta ( · , · ) represents the Beta distribution, Γ ( · ) is the gamma distribution, IG ( · , · ) is the inverse gamma distribution and S 0 , ν 0 , a γ , b γ , c 0 , d 0 , c 1 , d 1 , c 2 and d 2 are the user-specified hyperparameters. As mentioned above, λ 1 should tend to zero and λ 0 should tend to as the sample size is sufficiently large, which implies that c 1 / d 1 is smaller than c 0 / d 0 . To this end, we assume c 1 c 0 and d 0 d 1 .
Based on the above discussion, we can rewrite the considered linear mixed model together with the spike and slab lasso prior as the following hierarchical models:
y i j | b i N ( μ i j , σ j 2 ) , μ i j = x i j β + z i j b i , i = 1 , , n , j = 1 , , m , b i N q ( 0 , Q ) , i = 1 , , n , β k | ξ 1 k 2 , γ k = 1 N ( 0 , ξ 1 k 2 ) , ξ 1 k 2 | λ 1 2 Exp ( λ 1 2 / 2 ) , λ 1 2 Γ ( c 1 , d 1 ) , β k | ξ 0 k 2 , γ k = 0 N ( 0 , ξ 0 k 2 ) , ξ 0 k 2 | λ 0 2 Exp ( λ 0 2 / 2 ) , λ 0 2 Γ ( c 0 , d 0 ) , γ k Bernoulli ( ρ ) , k = 1 , , p , Q IW ( S 0 , ν 0 ) , ρ Beta ( a γ , b γ ) , σ j 2 Γ ( c 2 , d 2 ) , j = 1 , , m .

3. Skinny Gibbs Sampler for Bayesian Lasso

Let Y = { y i j : i = 1 , , n , j = 1 , , m } , X = { x i j : i = 1 , , n , j = 1 , , m } and Z = { z i j : i = 1 , , n , j = 1 , , m } . From Equation (8), the joint posterior density of parameters β , Q, γ = { γ 1 , , γ p } , σ 2 = ( σ 1 2 , , σ m 2 ) and ϑ = { ρ , λ 0 , λ 1 } given the data D = { Y , X , Z } is given by
f ( β , Q , γ , σ 2 , ϑ | D ) i = 1 n j = 1 m ψ ( y i j , x i j β , z i j Q 1 z i j + σ j 2 ) j = 1 m f ( σ j 2 ) × k = 1 p { ρ g 1 ( β k | λ 1 ) } γ k { ( 1 ρ ) g 0 ( β k | λ 0 ) } 1 γ k f W ( Q ) f ϑ ( ϑ ) ,
where ψ ( x , μ , ς 2 ) is the probability density of normal random variable x with mean μ and variance ς 2 , f ( σ j 2 ) denotes the probability density of random variable σ j 2 , f W ( Q ) is the inverted Wishart density function of random matrix Q and f ϑ ( ϑ ) represents the joint prior density function of random variable vector ϑ . It is rather difficult to sample observations from the joint posterior density given in Equation (9) in the presence of high-dimensional fixed effects because of some non-standard distributions and large matrix computations involved. In what follows, the Gibbs sampler is utilized to sample observations required for Bayesian inference.
To avoid expensive computation in running the Gibbs sampler, similarly to [23], at each Gibbs iteration, we divide parameter vector β into two subvectors corresponding to those active (i.e., γ k = 1 ) and inactive (i.e., γ k = 0 ) covariates, respectively. To wit, we define β = ( β A , β I ) , where β A and β I are the subvectors of  β associated with γ k = 1 and γ k = 0 , respectively. Suppose that the cardinality of the set A is r. Without loss of generality, it is assumed that the first r components of  β correspond to β A and the last p r components of  β correspond to β I . Similarly, we decompose x i j as x i j = ( x i j A , x i j I ) . Under the above assumptions, the Gibbs sampler is implemented as follows. Observations required at each Gibbs iteration are iteratively drawn from the following conditional distributions: f A ( β A | D , b , σ 2 ) , f I ( β I | D ) , f ( b i | D , β , σ 2 , Q ) , f ( ξ 0 k 2 | β k , γ k ) , f ( ξ 1 k 2 | β k , γ k ) , f γ ( γ k | D , b , ξ 1 , ξ 0 ) , f ( Q | b ) , f ( σ j 2 | D , b ) , f ( ρ | γ ) , f ( λ 0 2 | ξ 0 ) and f ( λ 1 2 | ξ 1 ) , which are given in Appendix A, where b = { b 1 , , b n } , ξ 0 = { ξ 01 2 , , ξ 0 p 2 } and ξ 1 = { ξ 11 2 , , ξ 1 p 2 } .
Although the Skinny Gibbs sampler introduced above can be easily conducted, it is rather time-consuming for a sufficiently large p. To address the issue, we investigate a fast yet efficient approach as follows, i.e., the variational Bayesian method.

4. Variational Bayesian Inference

4.1. Variational Bayes

It follows from the principle of variational inference that it is necessary to first construct a variational set F of densities for random variables Ξ having the same support as the posterior density f ( Ξ | D ) , where Ξ = { β , b , ξ 0 , ξ 1 , Q , γ , σ 2 , ϑ } . It is assumed that q ( Ξ ) F is any variational density for approximating f ( Ξ | D ) . The variational Bayes aims to find the best approximation to f ( Ξ | D ) in terms of the Kullback–Leibler divergence between q ( Ξ ) and f ( Ξ | D ) , which is a solution to the optimization problem:
q ( Ξ ) = argmin q ( Ξ ) F KL ( q ( Ξ ) f ( Ξ | D ) ) ,
where
KL ( q ( Ξ ) f ( Ξ | D ) ) = log q ( Ξ ) f ( Ξ | D ) q ( Ξ ) d Ξ = log q ( Ξ ) f ( Y | X , Z ) f ( Ξ , Y | X , Z ) q ( Ξ ) d Ξ = E q ( Ξ ) log q ( Ξ ) E q ( Ξ ) log f ( Ξ , Y | X , Z ) + log f ( Y | X , Z ) 0 ,
in which E q ( Ξ ) ( · ) is the expectation taken with respect to q ( Ξ ) . Here, KL ( q ( Ξ ) f ( Ξ | D ) ) equals zero if and only if q ( Ξ ) f ( Ξ | D ) . Due to the intractable high-dimensional integral involved, it is quite troublesome to conduct the above optimization problem.
However, it follows from L { q ( Ξ ) } = E q ( Ξ ) log f ( Ξ , Y | X , Z ) E q ( Ξ ) log q ( Ξ ) that
log f ( Y | X , Z ) = KL ( q ( Ξ ) f ( Ξ | D ) ) + L { q ( Ξ ) } L { q ( Ξ ) } .
Thus, L { q ( Ξ ) } might be regarded as a lower bound of  log f ( Y | X , Z ) and is usually referred to as the evidence lower bound (ELB). Then, minimizing KL ( q ( Ξ ) f ( Ξ | D ) ) is equivalent to maximizing L { q ( Ξ ) } in that log f ( Y | X , Z ) is not related to Ξ . That is,
q ( Ξ ) = argmin q ( Ξ ) F KL ( q ( Ξ ) f ( Ξ | D ) ) = argmax q ( Ξ ) F L { q ( Ξ ) } .
Finding the problem of the best approximation to f ( Ξ | D ) is transformed into an optimization problem of maximizing L { q ( Ξ ) } over the variational family F . The complexity of the optimization problem is associated with that of the variational set F . Thus, it is rather desirable to implement the optimization problem over a relatively simple variational set F .
Following the widely used methods for constructing a relatively simple variational set, we take F as the mean-field variational family in which components of  Ξ are mutually independent and each has a distinct factor in the variational density. Thus, we can assume that the variational density q ( Ξ ) has the form
q ( Ξ ) = q ( β ) q ( b ) q ( σ 2 ) q ( γ ) q ( Q ) q ( ϑ ) k = 1 p { q ( ξ 0 k 2 ) q ( ξ 1 k 2 ) } s = 1 S q s ( ζ s ) ,
where q s ( ζ s ) s are unspecified but the above assumed factorization across components is pre-specified. Similarly to considerable variational literature, the optimal solutions of  q s ( ζ s ) s can be obtained by maximizing L { q ( ζ 1 , , ζ S ) } via the coordinate ascent method, where Ξ = { ζ 1 , , ζ S } .
Following the idea of the coordinate ascent method given in [33,34,35], when fixing other variational factors q j ( ζ j ) for  j s , i.e., ζ s = { ζ j : j s , j = 1 , , S } , the optimal variational density q s ( ζ s ) maximizing L { q ( Ξ ) } with respect to q s ( ζ s ) has the form
q s ( ζ s ) exp E s log f ( ζ s | ζ s , D ) exp E s log f ( Y , Ξ | X , Z ) ,
where f ( ζ s | ζ s , D ) is the conditional density for  ζ s given ( ζ s D ) and E s ( · ) represents the expectation evaluated for  q s ( ζ s ) = j s q j ( ζ j ) . Equation (15) implies that E s ( · ) is not associated with the sth variational factor q s ( ζ s ) and the optimal variational density q s ( ζ s ) cannot be obtained in that the q s ( ζ s ) on the right-hand side are not the optimal ones. To address this issue, the coordinate updating algorithm is employed to iteratively update q s ( ζ s ) via Equation (15). After the coordinate updating algorithm converges, we can take mean or mode of the optimal variational density q s ( ζ s ) as a variational Bayesian estimate of parameter vector ζ s and regard the covariate as active if its corresponding variational Bayesian estimate deviates from zero.
It is easily shown from Equation (15) that the optimal density q β ( β ) has the form
q β A ( β A ) N r ( μ A , Σ A ) , q β I ( β I ) N p r ( 0 , Σ I ) ,
respectively, where Σ A 1 = i = 1 n j = 1 m x i j A x i j A E σ j 2 ( σ j 2 ) + diag ( ξ A ) with ξ A = { E ξ 1 k ( ξ 1 k 2 ) , k A } , μ A = Σ A [ i = 1 n j = 1 m x i j A { y i j z i j E b i ( b i ) } E σ j 2 ( σ j 2 ) ] and Σ I 1 = diag ( i = 1 n j = 1 m x i j I x i j I ) + diag ( ξ I 0 ) = n m I p r + diag ( ξ I 0 ) with ξ I 0 = { E ξ 0 k ( ξ 0 k 2 ) , k I } , in which E σ j 2 ( · ) , E ξ 1 k ( · ) , E ξ 0 k ( · ) and E b i ( · ) are the expectations taken with respect to q σ j 2 ( σ j 2 ) , q ξ 1 k ( ξ 1 k 2 ) , q ξ 0 k ( ξ 0 k 2 ) and q b i ( b i ) , respectively. Then, the estimated posterior means and variance matrices of  β A and β I for the variational densities q β A ( β A ) and q β I ( β I ) are E A ( β A ) = μ A , var A ( β A ) = Σ A , E I ( β I ) = 0 and var I ( β I ) = Σ I , respectively. Moreover, the mode estimator β A q of  β A for the variational density q β A ( β A ) is β A q = μ A , while the mode estimator β I q of  β I for the variational density q β I ( β I ) is β I q = 0 .
The optimal density q b i ( b i ) is the multivariate normal distribution
q b i ( b i ) N q ( μ b , Σ b ) ,
where Σ b 1 = E Q ( Q ) + j = 1 m z i j z i j E σ j 2 ( σ j 2 ) and μ b = Σ b [ j = 1 m z i j { y i j x i j A E A ( β A ) } E σ j 2 ( σ j 2 ) ] . Then, the estimated posterior mean and variance matrix of  b i for variational densities q b i ( b i ) are E b i ( b i ) = μ b and var b i ( b i ) = Σ b , respectively. Moreover, the mode estimator b i q of  b i for variational density q b i ( b i ) is b i q = μ b . The optimal densities q ξ 0 k ( ξ 0 k 2 ) and q ξ 1 k ( ξ 1 k 2 ) are given by
q ξ 0 k ( ξ 0 k 2 ) IvG ( a 0 ξ k , b 0 ξ k ) for k I , q ξ 1 k ( ξ 1 k 2 ) IvG ( a 1 ξ k , b 1 ξ k ) for k A ,
respectively, where a 0 ξ k = E λ 0 ( λ 0 2 ) / var β k ( β k ) , a 1 ξ k = E λ 1 ( λ 1 2 ) / [ { E β k ( β k ) } 2 + var β k ( β k ) ] , b 0 ξ k = E λ 0 ( λ 0 2 ) , b 1 ξ k = E λ 1 ( λ 1 2 ) and E λ 0 ( · ) and E λ 1 ( · ) are the expectations taken with respect to q λ 0 ( λ 0 2 ) and q λ 1 ( λ 1 2 ) , respectively. In this case, we have E ξ 0 k ( ξ 0 k 2 ) = a 0 ξ k , E ξ 1 k ( ξ 1 k 2 ) = a 1 ξ k , var ξ 0 k ( ξ 1 k 2 ) = ( a 0 ξ k ) 3 / b 0 ξ k and var ξ 1 k ( ξ 1 k 2 ) = ( a 1 ξ k ) 3 / b 1 ξ k . Moreover, the mode estimators ξ 0 k 2 q and ξ 1 k 2 q of  ξ 0 k 2 and ξ 1 k 2 for variational densities q ξ 0 k ( ξ 0 k 2 ) and q ξ 1 k ( ξ 1 k 2 ) are ξ 0 k 2 q = a 0 ξ k 1 + ( 1.5 a 0 ξ k / b 0 ξ k ) 2 1.5 ( a 0 ξ k ) 2 / b 0 ξ k for  k I and ξ 1 k 2 q = a 1 ξ k 1 + ( 1.5 a 1 ξ k / b 1 ξ k ) 2 1.5 ( a 1 ξ k ) 2 / b 1 ξ k for  k A , respectively.
To derive the optimal density q γ k ( γ k ) , we denote
log ( ϱ k ) = E ρ ( log ρ ) E ρ { log ( 1 ρ ) } + 1 2 E ξ 1 k { log ( ξ 1 k 2 ) } E ξ 0 k { log ( ξ 0 k 2 ) } + E β k ( β k ) i = 1 n j = 1 m { y i j x i j , C k E β ( β C k ) z i j E b i ( b i ) } x i j k E σ j 2 ( σ j 2 ) 1 2 var β k ( β k ) + { E β k ( β k ) } 2 i = 1 n j = 1 m x i j k 2 E σ j 2 ( σ j 2 ) E ξ 0 k ( ξ 0 k 2 ) + E ξ 1 k ( ξ 1 k 2 ) ,
where C k = { : γ = 1 , k A } = A { k } , which is an index set with the kth index deleted from the set A. Thus, latent variable γ k is sampled from the Bernoulli distribution with the probability ς k = ϱ k / ( ϱ k + 1 ) , i.e., γ k | D , b , σ Bernoulli ( ς k ) for  k = 1 , , p . In this case, the estimated posterior mean and variance of  γ k for variational density q γ k ( γ k ) are E γ k ( γ k ) = ς k and var γ k ( γ k ) = ς k ( 1 ς k ) , respectively. Thus, the mode estimator γ k q of  γ k for variational density q γ k ( γ k ) is γ k q = ς k for  k = 1 , , p .
The optimal density q Q ( Q ) has the form
q Q ( Q ) IW q ( S 0 , ν 0 ) ,
where S 0 = S 0 + n μ b μ b + n Σ b with μ b and Σ b defined in Equation (17) and ν 0 = ν 0 + n . Then, we have E Q ( Q ) = S 0 / ( ν 0 q 1 ) . Moreover, the mode estimator Q q of Q is given by Q q = S 0 / ( ν 0 + q + 1 ) .
The optimal density q σ j 2 ( σ j 2 ) ( j = 1 , , m ) has the form
q σ j 2 ( σ j 2 ) Γ n 2 , b σ ,
where b σ = 0.5 i = 1 n h i j , h i j = ( y i j μ i j ) 2 + x i j A Σ A x i j A + x i j I Σ I x i j I + z i j Σ b z i j and μ i j = x i j A μ A + z i j μ b . Thus, we have E σ j ( σ j 2 ) = n / i = 1 n h i j and var σ j ( σ j 2 ) = 2 n / ( i = 1 n h i j ) 2 . In this case, the mode estimator σ j 2 q of  σ j 2 for variational density q σ j 2 ( σ j 2 ) is σ j 2 q = ( n 2 ) / i = 1 n h i j for  j = 1 , , m .
The optimal density q ρ ( ρ ) can be expressed as
q ρ ( ρ ) Beta ( c ρ , d ρ ) ,
where c ρ = a γ + k = 1 p E γ k ( γ k ) and d ρ = b γ + p k = 1 p E γ k ( γ k ) . Thus, we have E ρ ( ρ ) = c ρ / ( c ρ + d ρ ) and var ρ ( ρ ) = c ρ d ρ / { ( c ρ + d ρ ) 2 ( c ρ + d ρ 1 ) } . In this case, the mode estimator of  ρ is given as ρ q = c ρ / ( c ρ + d ρ ) .
The optimal densities q λ 0 ( λ 0 2 ) and q λ 1 ( λ 1 2 ) are
q λ 0 ( λ 0 2 ) Γ a 0 λ , b 0 λ , q λ 1 ( λ 1 2 ) Γ a 1 λ , b 1 λ ,
respectively, where a 0 λ = c 0 + p k = 1 p E γ k ( γ k ) , b 0 λ = d 0 + k = 1 p { 1 E γ k ( γ k ) } E ξ 0 k ( ξ 0 k 2 ) / 2 , a 1 λ = c 1 + k = 1 p E γ k ( γ k ) and b 1 λ = d 1 + k = 1 p E γ k ( γ k ) E ξ 1 k ( ξ 1 k 2 ) / 2 . In this case, we obtain E λ 0 ( λ 0 2 ) = a 0 λ / b 0 λ , var λ 0 ( λ 0 2 ) = a 0 λ / ( b 0 λ ) 2 , E λ 1 ( λ 1 2 ) = a 1 λ / b 1 λ and var λ 1 ( λ 1 2 ) = a 1 λ / ( b 1 λ ) 2 . The mode estimators λ 0 2 q and λ 1 2 q of λ 0 2 and λ 1 2 for variational densities q λ 0 ( λ 0 2 ) and q λ 1 ( λ 1 2 ) are λ 0 2 q = ( a 0 λ 1 ) / b 0 λ and λ 1 2 q = ( a 1 λ 1 ) / b 1 λ , respectively.

4.2. Optimizing L { q ( Ξ ) } via Coordinate Ascent Algorithm

The elaborated steps for optimizing L { q ( Ξ ) } via the coordinate ascent algorithm are given below:
  • Step (a) Given the initial values of variational densities q β ( β ) , q b i ( b i ) , q ξ 0 k ( ξ 0 k 2 ) , q ξ 1 k ( ξ 1 k 2 ) , q γ k ( γ k ) , q Q ( Q ) , q σ j 2 ( σ j 2 ) , q ρ ( ρ ) , q λ 0 ( λ 0 2 ) and q λ 1 ( λ 1 2 ) , compute the lower bound L { q ( Ξ ) } (denoted as L ( 0 ) { q ( Ξ ) } ) and set κ = 1 .
  • Step (b) Compute variational density q β ( β ) and update E β ( β ) .
  • Step (c) Compute variational density q b i ( b i ) and update E b i ( b i ) .
  • Step (d) Compute variational density q ξ 0 k ( ξ 0 k 2 ) and update E ξ 0 k ( ξ 0 k 2 ) .
  • Step (e) Compute variational density q ξ 1 k ( ξ 1 k 2 ) and update E ξ 1 k ( ξ 1 k 2 ) .
  • Step (f) For  k = 1 , , p , compute variational densities q γ k ( γ k ) and update E γ k ( γ k ) .
  • Step (g) Compute variational density q Q ( Q ) and update E Q ( Q ) .
  • Step (h) Compute variational densities q σ j 2 ( σ j 2 ) and update E σ j ( σ j 2 ) .
  • Step (i) Compute variational density q ρ ( ρ ) and update E ρ ( ρ ) .
  • Step (j) Compute variational density q λ 0 ( λ 0 2 ) and update E λ 0 ( λ 0 2 ) .
  • Step (k) Compute variational density q λ 1 ( λ 1 2 ) and update E λ 1 ( λ 1 2 ) .
  • Step (l) Based on variational densities from Steps (b)–(k), compute the ELB L { q ( Ξ ) } (denoted as L ( κ ) { q ( Ξ ) } ) and the relative change
    RC = | L ( κ ) { q ( Ξ ) } L ( 1 ) { q ( Ξ ) } | L ( κ 1 ) { q ( Ξ ) } .
  • Step (m) Given sufficiently small ϵ , if RC < ϵ , the algorithm is stopped. Otherwise, repeat Steps (b)–(l).
The preceding presented coordinate ascent algorithm for computing variational Bayesian estimates of parameters is summarized as Algorithm 1 and converges to the solution of the optimization problem (13) because it satisfies the well-known KKT condition for the considered model.   
Algorithm 1: Variational Bayesian estimation
Mathematics 10 00463 i001

4.3. Model Comparison

The Bayes factor is a vital statistic for model comparison within the Bayesian framework and is widely employed to choose a better model among the considered competing models due to its merits for model selection: (i) it is a consistent selector; (ii) it plays the part of an Occam’s razor, preferring the simpler model for the similar fits; (iii) it does not need the models to be nested. For instance, see [36] for structural equation models and [37] for non-ignorable missing data. Denote f ( Y | X , Z , Ξ h , H h ) as the probability density of the data { Y , X , Z } associated with the model H h , where Ξ h is the parameter vector in the model H h . Define f ( Ξ h | H h ) as the prior of Ξ h for h = 0 , 1 . The Bayes factor for comparing two competing models H 0 and H 1 can be written as
B 10 = f ( Y | X , Z , Ξ 1 , H 1 ) f ( Ξ 1 | H 1 ) d Ξ 1 f ( Y | X , Z , Ξ 0 , H 0 ) f ( Ξ 0 | H 0 ) d Ξ 0 = f ( Y | X , Z , H 1 ) f ( Y | X , Z , H 0 ) ,
where f ( Y | X , Z , H k ) is the marginal likelihood for the model H h for h = 0 and 1. However, computing the Bayes factor B 10 is a non-trivial task for our considered high-dimensional linear mixed model because of the intractable integral involved. Considerable methods have been developed to compute the marginal likelihood f ( Y | X , Z , H k ) or the Bayes factor, for example, Laplace’s method [38], annealed importance sampling [39], bridge sampling [40], path sampling (also called thermodynamic integration) [41], nested sampling [42], power posteriors [43], hybrid method combining simulation and asymptotic approximations [44]. For a comprehensive review, refer to [45]. Here, a path sampling or thermodynamic integration method is adopted to compute B 10 via a link model: H ζ 01 = ( 1 ζ ) H 0 + ζ H 1 , where ζ is a continuous parameter taking value in the interval [ 0 , 1 ] . Thus, we have H ζ 01 = H 0 when ζ = 0 and H ζ 01 = H 1 when ζ = 1 . Similarly to [41], we define the following class of probability densities:
Q ( ζ ) = f ( Y | X , Z , ζ ) = f ( Y , ζ | X , Z , Ξ ) f ( Ξ ) d Ξ ,
where f ( Y , ζ | X , Z , Ξ ) is the density of Y given X and Z under H ζ and f ( Ξ ) is the prior of Ξ . Under the above definition, it is easily known that Q ( 0 ) = f ( Y | X , Z , H 0 ) and Q ( 1 ) = f ( Y | X , Z , H 1 ) . Following the argument of [41], we obtain
log B 10 = log Q ( 1 ) Q ( 0 ) = 0 1 E { U ( Y , ζ , Ξ | X , Z ) } d ζ ,
where E ( · ) represents the expectation taken with respect to the conditional density f ( Ξ , ζ | Y , X , Z ) and U ( Y , ζ , Ξ | X , Z ) = d log f ( Y , ζ , Ξ | X , Z ) / d ζ . Thus, applying the thermodynamic integration [41] or powered posteriors method [43] to Equation (26), log B 10 can be estimated by
log B 10 ^ = 1 2 = 0 L ζ ( + 1 ) ζ ( ) U ¯ ( + 1 ) + U ¯ ( ) ,
where 0 = ζ ( 0 ) < ζ ( 1 ) < < ζ ( L + 1 ) = 1 and U ¯ ( ) = J 1 τ = 1 J U ( Y , ζ ( ) , Ξ ( τ ) | X , Z ) , in which { Ξ ( τ ) : τ = 1 , , J } are observations sampled from the variational density q ( Ξ | ζ ( ) ) for = 1 , , L . Following [46], H 1 is selected when log B 10 ^ > 1 ; otherwise, H 0 is selected.

5. Simulation Studies

Several simulation studies are implemented to assess the performance of the introduced variational Bayesian methodologies. For comparison, we also take the Bayesian lasso method into consideration. In this simulation study, response variables y i j s are independently sampled from the normal distribution: y i j N ( x i j β + z i j b i , σ j 2 ) , where x i j , z i j and b i are independently drawn from the multivariate normal distributions N p ( 0 , Σ x ) , N q ( 0 , I ) and N q ( 0 , Q ) , respectively, for i = 1 , , n , j = 1 , , m . The true value of β is taken to be ( 0.5 , 0.8 , 2 , 0.8 , 0.5 , 0.0 , , 0.0 ) , which implies that there are five active variables and p 5 inactive variables. As an illustration, we set m = 6 , q = 4 , n = 100 , 200 and 300, and p = 500 , 1000 and 2000, which indicate that n p . The true values of σ j 2 ’s are set to be σ 1 2 = σ 2 2 = 0.8 , σ 3 2 = σ 4 2 = 0.9 and σ 5 2 = σ 6 2 = 1.0 . The true value of Q is taken with diagonal elements being 1.0 and remaining components being 0.1.
We consider the following two types of covariance structures for Σ x = ( σ x j k ) p × p .
  • Type I. Components of covariate vector x i j are independent of each other, i.e., σ x j k = 0.0 when j k and σ x j j = 1.0 when 1 j , k p .
  • Type II. x i j is an autoregressive correlation, i.e., σ x j k = 0.5 j k when j k and σ x j j = 1.0 when 1 j , k p .
In implementing the preceding presented variational Bayesian approach together with the spike and slab priors, we take the hyperparameters ν 0 = 1 and S 0 = 0.02 I q × q leading to the flat prior for Q and set a γ = b γ = 0.5 . For the spike and slab priors of β k s, to achieve appropriate shrinkage and model selection consistency, we take c 0 = 500 and c 1 = 0.3 , indicating c 1 c 0 , d 0 = 5 and d 1 = 30 , implying d 0 d 1 , guaranteeing the sparsity of the model. In this simulation, 100 replications are conducted to select active variables and estimate model parameters. To assess the accuracy of parameter estimation via the proposed variational Bayesian method, we calculate the average value of RMSes for unknown parameters, where “RMS“ indicates the root mean square between the Bayesian estimates based on 100 replications and true values of unknown parameters. To assess the performance of the variable selection procedure, we compute TP and FP, where TP represents the average number of active covariates correctly identified as active and FP denotes the average number of inactive covariates incorrectly detected as active. Generally, the closer to the true number of active covariates TP is or the smaller FP is, the better the variable selection method behaves. Results are reported in Table 1. Examination of Table 1 shows that the proposed variational Bayesian method behaves better than Bayesian lasso method, regardless of the values of p and n and covariance structures, in that TP values for the former are closer to the true number of active covariates and FP values for the former are closer to zero than those for the latter. For parameter estimation, the proposed variational Bayesian method behaves better than the Bayesian lasso method in that the average values of the RMSes for the former are smaller than those for the latter, regardless of the values of p and n and covariance structures. To investigate the sensitivity of the selection of the hyperparameters a γ and b γ , we take a γ = 0.1 and b γ = 0.9 and calculate the corresponding results for the Type I structure of Σ x , which results are given in Table 1. These empirical results indicate that the proposed variational Bayesian method is not sensitive to the hyperparameters in that the same pattern is observed regardless of the hyperparameters a γ and b γ .
As an illustration for model comparison via the proposed Bayes factor, we consider the second simulation study. In the simulation study, the data { ( x i j , z i j , y i j ) : i = 1 , , n , j = 1 , , m } are generated as those in the first simulation study with covariance structure of Σ x taken to be Type I. To this end, we consider the following competing models:
H 0 : y i j = x i j β + z i j b i + ε i j , ε i j N ( 0 , σ j 2 ) , H 1 : y i j = z i j b i + ε i j , ε i j N ( 0 , σ j 2 ) , H 2 : y i j = x i j β + z i j b i + ε i j , ε i j N ( 0 , σ 0 2 ) ,
where H 0 represents the true linear mixed model and while H 1 and H 2 are two competing linear mixed models, H 1 only containing random effects without fixed effects, and H 2 misspecifying the distribution of measurement error. We define a path t [ 0 , 1 ] to link any two of the above presented three models. For example, H 0 and H 1 can be linked by H t 01 : y i j = ( 1 t ) x i j β + z i j b i + ε i j , which indicates that H t 01 is just H 0 for t = 0 and becomes H 1 for t = 1 , and H 0 and H 2 are linked by H t 02 : y i j = x i j β + z i j b i + ε i j with ε i j ind N ( 0 , t 2 σ 0 2 + ( 1 t ) 2 σ j 2 ) , which implies that H t 02 reduces to H 0 with t = 0 and H t 02 becomes H 2 with t = 1 .
To calculate the estimated log Bayes factors (i.e., log B 10 ^ and log B 20 ^ ) via the preceding proposed path sampling procedure, we take ζ ( ) = / L for = 0 , 1 , , L , L = 10 , J = 1000 and σ 0 2 = 0.5 and the same priors as those given in the first simulation studies. Results are given in Table 2, which indicates that H 0 is strongly selected as expected regardless of n and p.

6. An Empirical Example

As an illustration of the variational Bayesian method developed above, we consider the ADNI-2 data [47] published in 2003 and followed by ADNI-1, ADNI-GO and ADNI-2 groups. This study aims to predict the mini-mental state examination (MMSE) score, which is an important index for detecting Alzheimer’s disease (AD) stages in that different MMSE scores indicate different progression of a AD patient. AD is the most common type of dementia for elderly people and the sixth leading cause of death in the United States, and it results in the loss of memory and the impairment of cognitive and language skills. More importantly, there is no effective treatment to slow the progression of the disease [48]. The number of AD patients has grown exponentially with the speed of the aging population, bringing a socioeconomic burden to both families and society [49]. The details on the ADNI database can refer to the website http://adni.loni.usc.edu (accessed on 20 May 2021).
The ADNI-2 data were analyzed by [48] using the factor analysis model to impute missing values. As an illustration, we utilize 340 complete magnetic resonance imaging (MRI) features with 62 samples and 3 medical visits (6-month, 12-month and 24-month), take five features among 340 features as covariates associated with random effects and set the MMSE score as the response variable. That is, n = 62 , p = 340 , q = 5 and m = 3 . In this case, covariates are high-dimensional compared with the sample size. Here, we assume that only a small fraction of covariates contribute to the response variable.
The preceding introduced variational Bayesian method together with the linear mixed model and the same priors as those in the first simulation study are utilized to fit the above-mentioned MRI data. Here, the hyperparameters are taken as ν 0 = 1 , S 0 = 0.02 I q × q , a γ = b γ = 0.5 , c 0 = 10 , d 0 = 1 , c 1 = 1 and d 1 = 10 for ensuring the sparsity of the model. Thus, the proposed variational Bayesian method selects three features as active variables: thickness average of the right fusiform (denoted as “ x 1 ”), thickness standard deviation of the right posterior cingulate (denoted as “ x 2 ”) and thickness standard deviation of the left postcentral (denoted as “ x 3 "). Their corresponding parameter estimates are 1.9, 0.25 and 0.4, respectively, which show that the three active variables have positive effects on MMSE that are consistent with those given in [48]. Bayesian estimates of random effects b i are −0.003, −0.0021, −0.0013, −0.0058 and −0.0054, respectively, which imply that the selected five covariates associated with random effects have negative effects on MMSE. Table 3 also presents the RMSE and MAP values for the models with 340 covariates (denoted as the “Complete“ model) and the selected three active covariates (denoted as the “Selected“ model), where RMSE and MAP are evaluated by RMSE= n 1 i = 1 n ( y ^ i y i ) 2 and MAP= n 1 i = 1 n | y ^ i y i | and y ^ i is the fitted value of response y i . Examination of Table 3 shows that the selected model has smaller RMS and MAP values than the complete model, i.e., the selected model fits the ADNI-2 data better than the complete model. For the selected model, we also compute the Bayes factors for three competing models H 0 , H 1 and H 2 given in the second simulation study, which are log B 10 ^ = 558 and log B 20 ^ = 46.93 , leading to the conclusion that H 0 is strongly selected.

7. Discussions

This paper investigates simultaneously estimating model parameters and selecting variables in linear mixed models with high-dimensional fixed effects and low-dimensional random effects in the Bayesian framework. A novel variational Bayesian approach is developed to address the time-consuming problem of the traditional Bayesian lasso method due to the ill-posited problem and large matrix computation involved in the presence of high-dimensional data. The Gaussian spike and slab priors of population-specific fixed-effects regression coefficients are specified to identify important fixed effects by allowing the tuning parameters to tend to zero. For the sake of sampling observations, the Gaussian spike and slab priors are reformulated as a mixture of a normal distribution and an exponential distribution. In the variational Bayesian framework, the problem of best approximating the posterior density is transformed as an optimization problem, i.e., minimizing the evidence lower bound. For ease of computation, the coordinate ascent algorithm, implemented efficiently, is employed to optimize the evidence lower bound. For model comparison, the Bayes factor is computed by the path sampling method. Simulation studies are conducted to investigate the performance of the proposed variational Bayesian method, and a real example is illustrated by the proposed methodologies. Empirical results show that the proposed variational Bayesian method behaves better than the traditional Bayesian lasso method regardless of the accuracy of parameter estimation, the consistency of variable selection or computational flexibility and complexity.
The proposed variational Bayesian method has the following advantages:
  • Overcoming the problem of selecting a high-dimensional vector of shrinkage parameters required for the Bayesian lasso method;
  • Simultaneously estimating model parameters and variance–covariance matrices and selecting fixed-effects and random-effects components with a relatively low computational cost;
  • Avoiding large matrix computations and the curse of dimensionality problem;
  • Providing a flexible and efficient approach to compute the Bayes factor for model comparison.
The proposed variational Bayesian method can be extended to more complicated models, such as generalized linear mixed models with mixed discrete and missing data. However, their extensions have huge challenges, including the closed-form derivation of the optimal variational density, the specification of the priors, the learning of the data-driven hyperparameters and the computational complexity. In addition, this paper does not consider the selection of high-dimensional random effects, which is a rather challenging topic. In addition, to speed up the convergence of the chain, we might consider some important and relevant Gibbs sampling schemes, for example, the herded Gibbs sampling, which is a deterministic variant of the Gibbs sampling scheme and generates observations by matching the full-conditionals rather than by taking the full-conditionals at random [50], the recycling Gibbs sampler, which generates auxiliary observations whose information is eventually discarded and which can be recycled within the Gibbs algorithm for improving efficiency with no extra cost [51], and the blocking and parameterization method [52].
In addition, we did not consider BIC criterion for model comparison in that BIC is only an approximation to the Bayes factor of marginal likelihood of the data given each hypothesis. Moreover, due to the random effects involved in the considered models, BIC behaves unsteadily.

Author Contributions

Conceptualization, N.T.; methodology, N.T.; software, J.Y.; validation, N.T. and J.Y.; formal analysis, N.T. and J.Y.; investigation, J.Y.; resources, N.T. and J.Y.; data curation, J.Y.; writing—original draft preparation, J.Y.; writing—review and editing, N.T.; visualization, J.Y.; supervision, N.T.; project administration, N.T.; funding acquisition, N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Projects of the National Nautral Science Foundation of China (grant number 11731011).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ADNI database is available on the website http://adni.loni.usc.edu (accessed on 20 May 2021).

Acknowledgments

The authors are grateful for the associate editor and the three referees for their constructive comments, which largely improved an earlier manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MCMCMarkov chain Monte Carlo algorithm
EMExpectation Maximization algorithm
ELBevidence lower bound
TPaverage number of active covariates correctly identified as active
FPaverage number of inactive covariates incorrectly detected as active
RMSmean square between the Bayesian estimates based on 100 replications and true value
of unknown parameter
VBvariational Bayesian with proposed method
LASSOBayesian lasso method
ADAlzheimer’s Disease
ADNIAlzheimer’s Disease Neuroimaging Initiative
MRImagnetic resonance imaging
MMSEmini-mental state examination

Appendix A. Conditional Distributions Required in Implementing the Gibbs Sampler

By the definitions and priors of β A and β I , it is easily shown from Equation (9) that the conditional distributions f A ( β A | D , b , σ ) and f I ( β I | D ) have the forms
β A | D , b , σ N r ( μ A 0 , Σ A 0 ) , β I | D N p r ( 0 , Σ I 0 ) ,
respectively, where Σ A 0 1 = i = 1 n j = 1 m x i j A x i j A / σ j 2 + diag ( ξ A 0 ) with ξ A 0 = { ξ 1 k 2 , k A } , μ A 0 = Σ A 0 { i = 1 n j = 1 m x i j A ( y i j z i j b i ) / σ j 2 } and Σ I 0 1 = diag ( i = 1 n j = 1 m x i j I x i j I ) + diag ( ξ C I 0 ) = n m I p r + diag ( ξ C I 0 ) with ξ C I 0 = { ξ 0 k 2 , k I } .
The conditional distribution f ( b i | D , β , σ , Q ) has the form
b i | D , β , σ , Q N q ( μ b C , Σ b C ) ,
where Σ b C 1 = Q + j = 1 m z i j z i j / σ j 2 and μ b C = Σ b C { j = 1 m z i j ( y i j x i j β ) / σ j 2 } .
The conditional distributions f ( ξ 0 k 2 | β k , γ k ) and f ( ξ 1 k 2 | β k , γ k ) are given by
f ( ξ 0 k 2 | β k , γ k ) ( ξ 0 k 2 ) ( 1 γ k ) / 2 exp ( 1 γ k ) β k 2 / ( 2 ξ 0 k 2 ) λ 0 2 ( 1 γ k ) ξ 0 k 2 / 2 , f ( ξ 1 k 2 | β k , γ k ) ( ξ 1 k 2 ) γ k / 2 exp γ k β k 2 / ( 2 ξ 1 k 2 ) λ 1 2 γ k ξ 1 k 2 / 2 ,
respectively, which lead to
ξ 0 k 2 | β k = 0 , γ k = 0 Γ ( 1 / 2 , λ 0 2 / 2 ) , ξ 1 k 2 | β k , γ k = 1 IvG ( λ 1 2 / β k 2 , λ 1 2 ) ,
where IvG ( a , b ) represents the inverse Gaussian distribution with parameters a and b.
The ratio of Pr ( γ k = 1 | D , b , σ ) to Pr ( γ k = 0 | D , b , σ ) is proportional to
ρ ψ ( β k , 0 , ξ 1 k 2 ) ( 1 ρ ) ψ ( β k , 0 , ξ 0 k 2 ) exp β k i = 1 n j = 1 m ( y i j x i , C k β C k z i j b i ) x i j k σ j 2 + β k 2 2 i = 1 n j = 1 m x i j k 2 ( 1 σ j 2 ) ,
which is denoted as ϱ k , where C k = { : γ = 1 , k A } . Thus, latent variable γ k is sampled from the Bernoulli distribution with the probability ς k = ϱ k / ( ϱ k + 1 ) , i.e., γ k | D , b , σ Bernoulli ( ς k ) for k = 1 , , p .
The conditional distribution f ( Q | b ) is shown as
Q | b IW q S 0 + i = 1 n b i b i , ν 0 + n .
The conditional distribution f ( σ j 2 | D , b ) ( j = 1 , , m ) has the form
f ( σ j 2 | D , b ) ( σ j 2 ) n / 2 + c 2 1 exp 1 2 σ j 2 i = 1 n ( y i j μ i j ) 2 d 2 σ j 2 ,
which indicates
σ j 2 | D , b Γ n 2 + c 2 , d 2 + 1 2 i = 1 n ( y i j μ i j ) 2 .
The conditional distribution f ( ρ | γ ) is given as
ρ | γ Beta a γ + k = 1 p γ k , b γ + p k = 1 p γ k .
The conditional distributions f ( λ 0 2 | ξ 0 ) and f ( λ 1 2 | ξ 1 ) are shown as
λ 0 2 | ξ 0 Γ c 0 + p k = 1 p γ k , d 0 + 1 2 k = 1 p ( 1 γ k ) ξ 0 k 2 , λ 1 2 | ξ 1 Γ c 1 + k = 1 p γ k , d 1 + 1 2 k = 1 p γ k ξ 1 k 2 ,
respectively.

Appendix B. Calculating the Evidence Lower Bound (ELB)

Denote q ( Ξ ) to be the optimal variational density approximating the posterior density f ( Ξ | D ) and f ( Ξ ) to be the prior density of Ξ = { β , b , ξ 0 , ξ 1 , Q , γ , σ 2 , ϑ } . Define E q ( Ξ ) ( · ) as the expectation taken with respect to q ( Ξ ) . Thus, it follows from Equation (12) that ELOB has the form
L { q ( Ξ ) } = E q ( Ξ ) log f ( Ξ , Y | X , Z ) E q ( Ξ ) log q ( Ξ ) = E q ( Ξ ) log f ( Y | Ξ , X , Z ) + log f ( Ξ ) E q ( Ξ ) log q ( Ξ ) ,
where
log f ( Y | Ξ , X , Z ) n 2 j = 1 m log σ j 2 i = 1 n j = 1 m ( y i j x i j β z i j b i ) 2 2 σ j 2 ,
log f ( Ξ ) 1 2 k = 1 r r log ξ 1 k 2 β k 2 ξ 1 k 2 + 1 2 k = 1 p r ( p r ) log ξ 0 k 2 β k 2 ξ 0 k 2 1 2 trace S 0 + i = 1 n b i b i Q 1 + λ 1 2 + λ 0 2 2 + ( c 1 1 ) log λ 1 2 d 1 λ 1 2 + ( c 0 1 ) log λ 0 2 d 0 λ 0 2 n + ν 0 + q + 1 2 log | Q | + ( a γ 1 ) log ρ + ( b γ 1 ) log ( 1 ρ ) j = 1 m d 2 σ j 2 + ( c 2 1 ) j = 1 m log ( σ j 2 ) + k = 1 p γ k log ρ + ( 1 γ k ) log ( 1 ρ ) .
It follows from the definition of q ( Ξ ) that
E q ( Ξ ) { log q ( Ξ ) } = E β { log q ( β ) } + E b { log q ( b ) } + E ξ 1 { log q ( ξ 1 2 ) } + E ξ 0 { log q ( ξ 0 2 ) } + E γ { log q ( γ ) } + E Q { log q ( Q ) } + E σ { log q ( σ 2 ) } + E ρ { log q ( ρ ) } + E λ 0 { log q ( λ 0 2 ) } + E λ 1 { log q ( λ 1 2 ) } ,
where E β { log q ( β ) } r 2 log | Σ A | p r 2 log | Σ I | , E b { log q ( b ) } n 2 log | Σ b | , E ξ 1 { log q ( ξ 1 2 ) } 1 2 k = 1 p [ 3 { log a 1 ξ a 1 ξ k / ( 2 b 1 ξ k ) } + 2 b 1 ξ k / a 1 ξ k + 1 ] , E ξ 0 { log q ( ξ 0 2 ) } 1 2 k = 1 p [ 3 { log a 0 ξ a 0 ξ k / ( 2 b 0 ξ k ) } + 2 b 0 ξ k / a 0 ξ k + 1 ] , E γ { log q ( γ ) } k = 1 p { ς k log ς k + ( 1 ς k ) log ( 1 ς k ) } , E Q { log q ( Q ) } ν 0 2 log S 0 + ν 0 q 1 2 ν 0 S 0 1 2 trace ( ν 0 I q × q ) , E σ { log q ( σ ) } n d 2 j = 1 m ( i = 1 n h i j ) 1 + ( c 2 1 ) j = 1 m ( log n log i = 1 n h i j 1 / n ) , E ρ { log q ( ρ ) } ( c ρ 1 ) { log ( c ρ ) log ( c ρ + d ρ ) } d ρ ( c ρ 1 ) / { 2 c ρ ( c ρ + d ρ + 1 ) } + ( d ρ 1 ) { log ( d ρ ) log ( c ρ + d ρ ) c ρ ( d ρ 1 ) / { 2 d ρ ( c ρ + d ρ + 1 ) } , E λ 0 { log q ( λ 0 2 ) } ( a 0 λ 1 ) { Γ ˙ ( a 0 λ ) / Γ ( a 0 λ ) log b 0 λ } a 0 λ and
E λ 1 { log q ( λ 1 2 ) } ( a 1 λ 1 ) { Γ ˙ ( a 1 λ ) / Γ ( a 1 λ ) log b 1 λ ] a 1 λ .
Note that for a random variable ξ with mean E ( ξ ) = μ and variance D ( ξ ) = σ 2 , it follows from Taylor expansion that the mean of the function y = f ( ξ ) is E ( y ) f ( μ ) + 1 2 f ¨ ( μ ) D ( ξ ) , where f ¨ ( · ) denotes the second derivative of the function f ( ξ ) . Then, we have
E q ( Ξ ) log f ( Y | Ξ , X , Z ) n 2 j = 1 m 1 n log n i = 1 n h i j i = 1 n j = 1 m n i = 1 n h i j [ y i j 2 2 y i j { x i j E β ( β ) + z i j E b i ( b i ) } + x i j { var β ( β ) + E β ( β ) E β ( β ) } x i j + z i j { var b i ( b i ) + E b i ( b i ) E b i ( b i ) } z i j + 2 x i j E β ( β ) E b i ( b i ) z i j ] .
Note that for a random variable ξ Γ ( α , β ) , we have E { log ( ξ ) } = Γ ˙ ( α ) / Γ ( α ) log ( β ) , where Γ ˙ ( · ) denotes the first derivative of gamma function. Thus, we have
E q ( Ξ ) { log f ( Ξ ) } 1 2 k = 1 r r log a 1 ξ k a 1 ξ k 2 b 1 ξ k { var β k ( β k ) + ( E β k ( β k ) ) 2 } E ξ 1 k ( ξ 1 k 2 ) + 1 2 k = 1 p r ( p r ) log a 0 ξ k a 0 ξ k 2 b 0 ξ k { var β k ( β k ) + ( E β k ( β k ) ) 2 } E ξ 0 k ( ξ 0 k 2 ) 1 2 i = 1 n E b i ( b i ) E Q ( Q ) E b i ( b i ) + E λ 1 ( λ 1 2 ) + E λ 0 ( λ 0 2 ) 2 + ( c 1 1 ) Γ ˙ ( a 1 λ ) Γ ( a 1 λ ) log ( b 1 λ ) d 1 E λ 1 ( λ 1 2 ) + ( c 0 1 ) Γ ˙ ( a 0 λ ) Γ ( a 0 λ ) log b 0 λ d 0 E λ 0 ( λ 0 2 ) + n + ν 0 q 1 2 log | S 0 ν 0 | var Q | Q | 2 | S 0 ν 0 | 2 1 2 trace { S 0 1 E Q ( Q ) } + ( a γ 1 ) log c ρ c ρ + d ρ d ρ 2 c ρ ( c ρ + d ρ + 1 ) + ( b γ 1 ) log d ρ c ρ + d ρ c ρ 2 d ρ ( c ρ + d ρ + 1 ) n d 2 j = 1 m ( i = 1 n h i j ) 1 + ( c 2 1 ) j = 1 m ( log n log i = 1 n h i j 1 / n ) + k = 1 p E γ k ( γ k ) log c ρ c ρ + d ρ d ρ 2 c ρ ( c ρ + d ρ + 1 ) + ( 1 E γ k ( γ k ) ) log d ρ c ρ + d ρ c ρ 2 d ρ ( c ρ + d ρ + 1 ) ,
where | Q | represents the determinant of matrix Q, var Q ( Q i j ) = ν 0 ( σ i j 2 + σ i i σ j j ) and σ i j is the ( i , j ) -th component of S 0 .

Appendix C. Calculating the Estimated Bayes Factor in the Second Simulation

For the model H t 01 : y i j = x i j β + ( 1 t ) z i j b i + ε i j for i = 1 , , n and j = 1 , , m , where t [ 0 , 1 ] , its first-order derivative of log joint density function has the form
U ( Y , t , Ξ | X , Z ) = i = 1 n j = 1 m { ( y i j x i j β ( 1 t ) z i j b i ) z i j b i } / σ j 2 .
In this case, U ( Y , 0 , Ξ | X , Z ) = i = 1 n j = 1 m ( y i j x i j β z i j b i ) z i j b i / σ j 2 and U ( Y , 1 , Ξ | X , Z ) = i = 1 n j = 1 m ( y i j x i j β ) z i j b i / σ j 2 .
For H t 02 : y i j = ( 1 t ) x i j β + z i j b i + ε i j for i = 1 , , n and j = 1 , , m , where t [ 0 , 1 ] , its first-order derivative of log joint density function has the form
U ( Y , t , Ξ | X , Z ) = i = 1 n j = 1 m { y i j ( 1 t ) x i j β z i j b i } x i j β / σ j 2 .
In this case, U ( Y , 0 , Ξ | X , Z ) = i = 1 n j = 1 m ( y i j x i j β z i j b i ) x i j β / σ j 2 and U ( Y , 1 , Ξ | X , Z ) = i = 1 n j = 1 m ( y i j z i j b i ) x i j β / σ j 2 .
For H t 03 : y i j = x i j β + z i j b i + ε i j with ε i j i . i . d N ( 0 , t 2 σ 0 2 + ( 1 t ) 2 σ j 2 ) for i = 1 , , n and j = 1 , , m , where t [ 0 , 1 ] , its first-order derivative of log joint density function has the form
U ( Y , t , Ξ | X , Z ) = i = 1 n j = 1 m { t σ 0 2 ( 1 t ) σ j 2 } { t 2 σ 0 2 + ( 1 t ) 2 σ j 2 } 2 ( y i j μ i j ) 2 { ( 1 t ) σ j 2 t σ 0 2 } { t 2 σ 0 2 + ( 1 t ) 2 σ j 2 } 2 .
In this case, U ( Y , 0 , Ξ | X , Z ) = i = 1 n j = 1 m { σ j 4 + ( y i j μ i j ) 2 } / σ j 2 and U ( Y , 1 , Ξ | X , Z ) = i = 1 n j = 1 m { σ 0 4 + ( y i j μ i j ) 2 } / σ 0 2 .

References

  1. Lindstrom, M.J.; Bates, D.M. Newton-raphson and EM algorithms for linear mixed-effects models for repeated measures data. J. Am. Stat. Assoc. 1988, 83, 1014–1022. [Google Scholar]
  2. Laird, N.; Lange, N.; Stram, D. Maximum likelihood computations with repeated measures: Applications of the EM algorithm. J. Am. Stat. Assoc. 1987, 82, 97–105. [Google Scholar] [CrossRef]
  3. Zeger, S.L.; Karim, M.R. Generalized linear models with random effects: A Gibbs sampling approach. J. Am. Stat. Assoc. 1991, 3, 79–86. [Google Scholar] [CrossRef]
  4. Gilks, W.R.; Wang, C.C.; Yvonnet, B.; Coursaget, P. Random-effects models for longitudinal data using Gibbs sampling. Biometrics 1993, 49, 441–453. [Google Scholar] [CrossRef] [PubMed]
  5. Chen, Z.; Dunson, D.B. Random effects selection in linear mixed models. Biometrics 2003, 59, 762–769. [Google Scholar] [CrossRef] [PubMed]
  6. Ahn, M.; Zhang, H.H.; Lu, W. Moment-based method for random effects selection in linear mixed models. Stat. Sin. 2012, 22, 1539–1562. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Bondell, H.D.; Krishna, A.; Ghosh, S.K. Joint variable selection of fixed and random effects in linear mixed-effects models. Biometrics 2010, 66, 1069–1077. [Google Scholar] [CrossRef]
  8. Ibrahim, J.G.; Zhu, H.; Garcia, R.I.; Guo, R. Fixed and random effects selection in mixed effects models. Biometrics 2011, 67, 495–503. [Google Scholar] [CrossRef] [Green Version]
  9. Schelldorfer, J.; Buhlmann, P.; Van De Geer, S. Estimation for high-dimensional linear mixed-effects models using 1–penalization. Scand. J. Stat. 2011, 38, 197–214. [Google Scholar] [CrossRef] [Green Version]
  10. Fan, Y.; Li, R. Variable selection in linear mixed effects models. Ann. Stat. 2012, 40, 2043–2068. [Google Scholar] [CrossRef]
  11. Li, Y.; Wang, S.J.; Song, P.X.K.; Wang, N.; Zhou, L.; Zhu, J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat. Interface 2018, 11, 721–737. [Google Scholar] [CrossRef] [PubMed]
  12. Bradic, J.; Claeskens, G.; Gueuning, T. Fixed effects testing in high-dimensional linear mixed models. J. Am. Stat. Assoc. 2020, 115, 1835–1850. [Google Scholar] [CrossRef] [Green Version]
  13. Li, S.; Cai, T.T.; Li, H. Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach. J. Am. Stat. Assoc. 2021, 1–12. [Google Scholar] [CrossRef]
  14. Berger, J.; Bernardo, J.M. Reference priors in a variance components problem. In Bayesian Analysis in Statistics and Econometrics; Lecture Notes in Statistics; Goel, P., Ed.; Springer: New York, NY, USA, 1992; Volume 75, pp. 177–194. [Google Scholar]
  15. George, E.I.; McCullogh, R.E. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 1993, 88, 881–889. [Google Scholar] [CrossRef]
  16. Ishwaran, H.; Rao, J.S. Spike and slab gene selection for multigroup microarray data. J. Am. Stat. Assoc. 2005, 100, 764–780. [Google Scholar] [CrossRef]
  17. Polson, N.G.; Scott, J.G. Local shrinkage rules, Levy processess and regularized regression. J. R. Stat. Soc. 2012, 74, 287–311. [Google Scholar] [CrossRef]
  18. Narisetty, N.N.; He, X. Bayesian variable selection with shrinking and diffusing priors. Ann. Stat. 2014, 42, 789–817. [Google Scholar] [CrossRef] [Green Version]
  19. Park, T.; Casella, G. The Bayesian Lasso. J. Am. Stat. Assoc. 2008, 103, 681–686. [Google Scholar] [CrossRef]
  20. Griffin, J.E.; Brown, P.J. Bayesian adaptive lassos with non-convex penalization. Aust. N. Z. J. Stat. 2011, 53, 423–442. [Google Scholar] [CrossRef]
  21. Rockova, V.; George, E.I. EMVS: The EM approach to Bayesian variable selection. J. Am. Stat. Assoc. 2014, 109, 828–846. [Google Scholar] [CrossRef]
  22. Latouche, P.; Mattei, P.A.; Bouveyron, C.; Chiquet, J. Combining a relaxed EM algorithm with Occam’s razor for Bayesian variable selection in high-dimensional regression. J. Multivar. Anal. 2016, 146, 177–190. [Google Scholar] [CrossRef]
  23. Narisetty, N.N.; Shen, J.; He, X. Skinny Gibbs: A consistent and acalable Gibbs sampler for model selection. J. Am. Stat. Assoc. 2019, 114, 1205–1217. [Google Scholar] [CrossRef]
  24. Wipf, D.P.; Rao, B.D.; Nagarajan, S. Latent variable Bayesian models for promoting sparsity. IEEE Trans. Inf. Theory 2011, 57, 6236–6255. [Google Scholar] [CrossRef] [Green Version]
  25. Ghahramani, Z.; Beal, M.J. Variational inference for Bayesian mixtures of factor analysis. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 449–455. [Google Scholar]
  26. Attias, H. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; Volume 12, pp. 209–215. [Google Scholar]
  27. Wu, Y.; Tang, N.S. Variational Bayesian partially linear mean shift models for high-dimensional Alzheimer’s disease neuroimaging data. Stat. Med. 2022, in press. [Google Scholar] [CrossRef]
  28. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef] [Green Version]
  29. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  30. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  31. Rockova, V.; George, E.I. The Spike-and-Slab Lasso. J. Am. Stat. Assoc. 2018, 113, 431–444. [Google Scholar] [CrossRef]
  32. Leng, C.; Tran, M.N.; Nott, D. Bayesian adaptive Lasso. Ann. Inst. Stat. Math. 2014, 66, 221–244. [Google Scholar] [CrossRef] [Green Version]
  33. Beal, M.J. Variational Algorithms for Approximate Bayesian Inference. Ph.D. Thesis, University of London, London, UK, 2003. [Google Scholar]
  34. Bishop, C. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
  35. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 518, 859–877. [Google Scholar] [CrossRef] [Green Version]
  36. Lee, S.Y.; Song, X.Y. Model comparison of nonlinear structural equation models with fixed covariates. Psychometrika 2003, 68, 27–47. [Google Scholar] [CrossRef]
  37. Lee, S.Y.; Tang, N.S. Bayesian analysis of nonlinear structural equation models with nonignorable missing data. Psychometrika 2005, 71, 541–564. [Google Scholar] [CrossRef]
  38. Tierney, L.; Kadane, J.B. Accurate approximations for posterior moments and marginal densities. J. Am. Stat. Assoc. 1986, 81, 82–86. [Google Scholar] [CrossRef]
  39. Neal, R.M. Annealed importance sampling. Stat. Comput. 2001, 11, 125–139. [Google Scholar] [CrossRef]
  40. Meng, X.L.; Wong, W. Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Stat. Sin. 1996, 6, 831–860. [Google Scholar]
  41. Gelman, A.; Meng, X.L. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Stat. Sci. 1998, 13, 163–185. [Google Scholar] [CrossRef]
  42. Skilling, J. Nested sampling for general bayesian computation. Bayesian Anal. 2006, 1, 833–859. [Google Scholar] [CrossRef]
  43. Friel, N.; Pettitt, A.N. Marginal likelihood estimation via power posterior. J. R. Stat. Soc. 2008, 70, 589–607. [Google Scholar] [CrossRef]
  44. DiCicio, T.; Kass, R.; Raftery, A.; Wasserman, L. Computing Bayes factor by combining simulation and asymptotic approximations. J. Am. Stat. Assoc. 1997, 92, 903–915. [Google Scholar] [CrossRef]
  45. LIorente, F.; Martino, L.; Delgado, D.; Lopez-Santiago, J. Marginal likelihood computation for model selection and hypothesis testing: An extensive review. arXiv 2022, arXiv:2005.08334. [Google Scholar]
  46. Kass, R.E.; Raftery, A.E. Bayes factors. J. Am. Stat. Assoc. 1995, 90, 773–795. [Google Scholar] [CrossRef]
  47. Jack, C.; Bernstein, M.; Fox, N.; Thompson, P.; Alexander, G.; Harvey, D.; Borowski, B.; Britson, P.; Whitwell, J.; Ward, C. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. J. Magn. Reson. Imaging 2008, 27, 685–691. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Zhang, Y.Q.; Tang, N.S.; Qu, A. Imputed factor regression for high-dimensional block-wise missing data. Stat. Sin. 2020, 30, 631–651. [Google Scholar] [CrossRef]
  49. Brookmeyer, R.; Johnson, E.; Ziegler-Graham, K.; Arrighi, H. Forecasting the global burden of Alzheimer’s disease. Alzheimers Dement. 2007, 3, 186–191. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  50. Chen, Y.; Bornn, L.; De Freitas, N.; Eskelin, M.; Fang, J.; Welling, M. Herded Gibbs sampling. J. Mach. Learn. Res. 2016, 17, 263–291. [Google Scholar]
  51. Martino, L.; Elvira, V.; Camps-Valls, G. The recycling Gibbs sampler for efficient learning. Digit. Signal Process. 2018, 74, 1–13. [Google Scholar] [CrossRef] [Green Version]
  52. Roberts, G.O.; Sahu, S.K. Updating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. J. R. Stat. Soc. 1997, 59, 291–317. [Google Scholar] [CrossRef]
Table 1. Performance of variable selection and parameter estimation in the first simulation study.
Table 1. Performance of variable selection and parameter estimation in the first simulation study.
( a γ , b γ ) Σ x nMethodp = 500p = 1000p = 2000
TPFPRMSTPFPRMSTPFPRMS
(0.5, 0.5)I100 VB 3.910.000.113.790.000.083.840.000.06
LASSO 4.440.871.903.541.031.661.390.001.91
200 VB 4.710.000.114.680.000.084.650.000.06
LASSO 4.950.242.242.781.911.363.340.001.64
300 VB 4.890.000.114.810.000.084.910.000.06
LASSO 4.990.012.124.910.001.414.230.001.45
II100 VB 3.790.000.113.840.000.083.760.000.06
LASSO 3.480.102.193.010.001.873.000.002.01
200 VB 3.970.000.113.960.000.083.980.000.06
LASSO 3.590.022.443.120.001.783.000.001.84
300 VB 3.980.000.113.960.000.083.980.000.06
LASSO 3.630.032.313.200.001.793.010.001.75
(0.1, 0.9)I100 VB 3.880.000.113.790.000.083.840.000.06
LASSO 4.440.871.903.541.031.661.390.001.91
200 VB 4.710.000.114.660.000.084.640.000.06
LASSO 4.950.242.242.781.911.363.340.001.64
300 VB 4.890.000.114.810.000.084.910.000.06
LASSO 4.990.012.124.910.001.414.230.001.45
Note: VB represents variational Bayesian method and LASSO denotes Bayesian lasso method.
Table 2. Estimated log Bayes factor in the second simulation study.
Table 2. Estimated log Bayes factor in the second simulation study.
log B 10 ^ np
50010002000
 100−194−102−86
200−372−272−294
300−506−544−588
log B 20 ^ ( × 10 7 ) 100−0.95−4.03−1.41
200−1.54−3.68−2.54
300−3.13−3.58−2.26
Table 3. Performance of variational Bayesian method for the complete and selected models in the ADNI-2 data.
Table 3. Performance of variational Bayesian method for the complete and selected models in the ADNI-2 data.
Model npRMSEMAP
Complete6234049.1749.15
Selected6231.05 0.82
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yi, J.; Tang, N. Variational Bayesian Inference in High-Dimensional Linear Mixed Models. Mathematics 2022, 10, 463. https://doi.org/10.3390/math10030463

AMA Style

Yi J, Tang N. Variational Bayesian Inference in High-Dimensional Linear Mixed Models. Mathematics. 2022; 10(3):463. https://doi.org/10.3390/math10030463

Chicago/Turabian Style

Yi, Jieyi, and Niansheng Tang. 2022. "Variational Bayesian Inference in High-Dimensional Linear Mixed Models" Mathematics 10, no. 3: 463. https://doi.org/10.3390/math10030463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop