2.1. Quantile Regression
Consider a dataset of
n independent subjects. For the
ith subject, let
be the response, while
is an
predictor vector, a simple linear regression model is defined as follows:
where
is the regression coefficient vector with
corresponding to the intercept terms, and
represents the error term with unknown distribution. It is usual to assume that the
th quantile of the random error term is 0; that is,
for
. According to this assumption, the form of
th quantile regression of model (
1) is specified as follows:
where
is the inverse cumulative distribution function of
, given
evaluated at
. The estimate of the regression coefficient vector
in Equation (
2) is
where the loss function
with the indicator function
.
In light of [
5,
24], minimizing Equation (
3) is equivalent to maximizing the likelihood of
n independent individuals with the
ith one distributed as an asymmetric Laplace distribution (ALD) specified as
where the local parameter
, the scale parameter
, and the skewness parameter
is between 0 and 1; obviously, ALD is a Laplace distribution when
and
. However, it is computationally infeasible to carry out statistical inference based directly on Equation (
4) involving the nondifferentiable point
. Following [
25], Equation (
4) can be rewritten in the following hierarchical fashion:
where
,
and
denotes the exponential distribution with mean
, whose specific density function is
. Equation (
5) illustrates that an asymmetric Laplace distribution can also be represented as a mixture of exponential and standard normal distributions, which allows us to express a quantile regression model as a normal regression model, in which the response has the following conditional distribution:
For the above-defined quantile regression model with high-dimensional covariate vector (r is large enough), it is of interest to estimate the parameter vector and to identify the critical covariates. To this end, we considered Bayesian quantile regression based on spike-and-slab lasso, as follows:
2.2. Bayesian Quantile Regression Based on a Spike-and-Slab Lasso
As early as 2016, Xi et al. [
23] applied a spike-and-slab prior to Bayesian quantile regression, but their proposed prior was a mixture of zero particle and normal distribution with large variance, and the estimate of posterior density was obtained using a Gibbs sampler. To offer novel theoretical insights into a class of continuous spike-and-slab priors, Rockova (2018) [
26] introduced a novel family of spike-and-slab priors, which are a mixture of two density functions with spike or slab probability. In this paper, we adopt a spike-and-slab lasso prior with a mixture of two Laplace distributions with large or small variance, respectively [
26], which facilitates the variational Bayesian technique for approximating the posterior density of parameters and for improving the efficiency of the algorithm. In light of ref. [
26], given the indicator
or 1, the prior of
in the Bayesian quantile regression model (
5) can be written as
where the Laplace density
and
with precision parameters
and
satisfying that
, the indicator variable set
, the
jth variable is active when
, and inactive otherwise. Similarly to [
27], the Laplace distribution for the regression coefficient
can be represented as a mixture of a normal distribution and an exponential distribution, specifically the distribution of
can be expressed as a hierarchical structure, as follows:
where
denotes the Bernoulli distribution, with
being the probability that indicator variable
equals one for
, and specifies the prior of
as a Beta distribution
, where hyperparameters
and
,
, and
are regularization parameters to identify important variables, for which we consider the following conjugate priors:
where
denotes the gamma distribution with shape parameter
a and scale parameter
b. As mentioned above,
and
should satisfy that
, to this end, we select hyperparmeters
and
to satisfy that
. The prior of scale parameter
in (
5) is an inverse gamma distribution
of the hyperparameters
and
in the paper leading to almost non-informative prior.
Under a Bayesian statistical paradise, based on the above priors and likelihood of quantile regression, it is required to induce the following posterior distribution
, where
, the latent variable set
,
,
, observing set
with the response set
and covariate set
. Based on the hierarchical structure (
5) of the quantile regression likelihood
and the hierarchical structure (
7) of the spike-and-slab prior to regression coefficient vector
, we derive the joint density
Although sampling from the aforementioned posterior is simple, it becomes increasingly time-consuming for higher dimension quantile models. To tackle this issue, we developed a faster and more efficient alternative method based on variational Bayesian.
2.3. Quantile Regression with a Spike-and-Slab Lasso Penalty Based on Variational BAYESIAN
At present, the most commonly variational Bayesian approximation posterior distribution methods use mean field approximation theory [
28], which has the highest efficiency among variational methods, especially for those parameters or parameter block with conjugate priors. Bayesian quantile regression needs to take into account that the variance of each observation value is different, and each
corresponds to a potential variable
, which will result in the algorithm efficiency of quantile regression being lower than that of general mean regression. Therefore, in this paper, we use the variational Bayesian algorithm of the mean field approximation, which is the most efficient algorithm, to derive the quantile regression model with the spike-and-slab lasso penalty.
Based on variational theory, we choose a densities for random variables
from variational family
, which having the same support
as the posterior density
. We approximate the posterior density
by any variational density
. The variational Bayesian method is to seek the optimal approximation to
by minimizing the Kullback-Leibler divergence between
and
, which is an optimization problem that can be expressed as:
where
, which is not less than zero and equal to zero if, and only if,
. The posterior density
with the joint distribution
of parameter
and data
and the marginal distribution
data
. Since
, which does not have an analytic expression for our considered model, it is rather difficult to implement the optimization problem presented above. It is easy to induce that
in which the evidence lower bound (ELOB)
with
representing the expectation taken with respect to the variational density
. Thus, minimizing
is equivalent to maximizing
because log
does not depend on
. That is,
which indicates that seeking the optimal approximation problem of
becomes maximizing
under the variational family
. The complexity of the approximation problem heavily is related to the variational family
. Therefore, choosing a comparatively simple variational family
to optimize the objective function
with respect to
is fascinating.
Following the commonly used approach to choosing a tractable variational family
in the variational studies, we consider the frequently-used mean-field theory, which assumes that blocks of
are mutually independent and each is measured by the parameters of the variational density. Obviously, the variational density
is assumed to be factorized across the blocks of
:
in which form of each variational densities
’s is unknown, but the above assumed factorization across components is predetermined. Moreover, the best solutions for
’s are to be achieved by maximizing
with respect to variational densities
by the coordinate ascent method, where
where
can be either a scalar or a vector. This means that when the correlation between several unknown parameters or potential variables cannot be ignored, they should be put in the same block and merged into
. Following the idea of the coordinate ascent method given in ref. [
29], when fixing other variational factors
for
, i.e., the optimal density
, which maximizes
with respect to
, is shown to take the form
where
is the logarithm of the joint density function and
is the expectation taken with respect to the density
for
.
According to Equations (
8) and (
9), we can derive the variational posterior for each parameter as follows (see
Appendix A for the details):
where
denotes generalized inverse Gaussian distribution,
and
In the above equation, , and , , , are similar to , with , and is a vector with the jth component of vector deleted, with .
In the section above, we derived the variational posterior of each parameter. Using the idea of coordinate axis optimization, we can update each variational distribution iteratively until it converges.
For this reason, we list the variational Bayesian spike-and-slab lasso quantile regression (VBSSLQR) algorithm as shown in Algorithm 1:
Algorithm 1 Variational Bayesian spike-and-slab lasso quantile regression (VBSSLQR). |
Input: |
Data , predictors , prior parameters ,, , , , precision and quantile ; |
Output: |
Optimized variational parameters , for and the corresponding Bayesian confidence interval. |
Initialize: ; ; ;; |
; ; ;; |
;; ; |
while do |
|
for to r do |
Update , , and according to Equation (10). |
Update , , , according to , |
Update , , , according to , |
Update according to the variational posterior , |
Update and , |
end for |
Update according to the variational posterior , |
Update according to the variational posterior , |
Update , and according to , |
for to n do |
Update and according to the variational posterior , |
end for |
Update and according to the variational posterior , |
, |
end while |
In Algorithm 1 above,
is the digamma function and the expectation
of
with respect to generalized inverse Gaussian distribution. Thus, we assume that
, then:
where
represents the Bessel function of the second kind. Note that there is no analytic solution or function to the differential of the modified Bessel function. Therefore, we approximate
using the second-order Taylor expansion of
. This paper lists the expectations of some parameter functions about variational posteriors involved in Algorithm 1; see
Appendix B for details. Based on our proposed VBSSLQR algorithm, in the next section, we randomly generate high-dimensional data, conduct simulation research, and compare the performance with other methods. Notably, the asymptotic variance of the quantile regression is reciprocal proportional to the density of the errors at the quantile point. In cases where n is small and we estimate extreme quantiles, the correlating asymptotic variance will be large, resulting in less precise estimates [
23]. Therefore, the regression coefficient is difficult to estimate at an extreme quantile and this is feasible when the sample sizes needs to be increased appropriately.