1. Introduction
Along with the continuous development of data collection and storage technology, data sets that present high dimensions and high correlations within blocks of variables can cause some new research problems in economics, finance, genomics, statistics, machine learning, etc. Because for such data, we need to make a variable selection in highly correlated variables.
There has been significant research into variable selection methods, and many variable selection methods have been developed, such as the regularized M-estimation method, which includes the LASSO [
1], SCAD [
2], elastic net [
3], and the Dantzig selector [
4]. There are many references to the regularized M-estimation method’s theoretical properties and algorithmic studies, including [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14].
Most existing variable selection methods assume that the covariates are cross-sectionally weakly correlated, even, and serially independent. However, these assumptions are easily invalid in the data sets, which present high dimensions and high correlations within blocks of covariates, such as economic and financial data sets. For example, economics studies [
15,
16,
17] show a strong correlation within blocks of covariates. In order to deal with the problem, Fan et al. proposed factor-adjusted variable selection for mean regression [
18].
However, mean regression cannot simultaneously fit the skew and heavy-tailed data; mean regression is not robust against the outliers. Koenker and Bassett [
19] proposed quantile regression (QR) to model the relationship between the response
y and the covariates
. Compared to the mean regression, QR has two significant advantages: (i) QR can be used to model the entire conditional distribution of
y given
, and thus, it provides insightful information about the relationship between
y and
. The conditional distribution function of
Y given
is
. For
, the
th conditional quantile of
Y given
is defined as
. (ii) QR is robust against outliers and can be used to model the response in which distribution is skewed or heavy-tailed without correct error assumption. These two advantages make QR an appealing method to reflect data information that is difficult for the mean regression. The researchers can refer to Koenker [
20] and Koenker et al. [
21] for a comprehensive overview of methods, theory, computation, and many extensions of QR.
Ando and Tsay [
22] proposed factor-augmented predictors for quantile regression, but the model did not contain the idiosyncratic components of the covariates, so it will cause an information loss of explanatory variables. So, we refer to Fan et al. [
18] and propose the factor-augmented regularized variable selection (Farvsqr) for quantile regression to overcome the problems caused by the correlations within the covariates. As usual, let us assume that the
i-th observation covariates
follow an approximate factor model,
where
is a
vector of latent factors,
is a
loading matrix, and
is a
vector of idiosyncratic components or errors which are independent of
.
The factor model has become one of the most popular and powerful tools in multivariate statistics and deeply impacted biology [
23,
24,
25], economics, and finance [
15,
16,
26]. Chamberlain and Rothschild [
27] first proposed using principal component analysis (PCA) to solve the approximate factor model’s latent factors and loading matrix. Subsequently, much literature explores the factor model using the PCA method [
28,
29,
30,
31,
32]. In our paper, we will use the PCA to obtain the estimators of
,
, and
.
The process of Farvsqr is first to estimate model (
1) and obtain the independent or low-correlated estimators of
and
. Then, we replace the high correlation covariates
with the estimators
and
. The second step is to solve a common regularized loss function. In this paper, we study Farvsqr by giving the specific parameter-solving process and the theoretical properties. Moreover, both simulation and real data application studies are presented.
The main contribution of our paper is to generalize the factor-adjusted regularized variable selection of mean regression to the quantile regression to accommodate the skew and heavy-tailed data.
Section 2 introduces the smoothed quantile regression and the approximate factor models.
Section 3 introduces the variable selection methodology of Farvsqr.
Section 4 presents the general theoretical results.
Section 5 provides simulation studies, and
Section 6 applies our model to the Quarterly Database for Macroeconomic Research (FRED-QD).
5. Simulation Study
In this section, we will assess the performance of the method proposed by this paper through simulation. We compare Farvsqr with LASSO and SCAD under different simulation data.
We generate the response from the model , where the true coefficients are set to be , and the error part is following three models:
- (i)
;
- (ii)
;
- (iii)
.
The covariates are generated from one of the following two models:
- (i)
Factor model. with . Factors are generated from a stationary model with . The -th entry of is set to be when and when . We draw , and from the i.i.d. standard normal distribution.
- (ii)
Equal correlated case. We draw from i.i.d. , where has diagonal elements 1 and off-diagonal elements 0.4.
For the factor model, in order to comprehensively evaluate the Farvsqr, given the quantile
, we compare the influence of the different sample sizes and the explanatory variable’s dimensionality under different error distributions. We use the estimation error, namely
, average model size, percentage of true positives (TP) for
, percentage of true negatives (TN) for
, and the elapsed time to compare the Farvsqr and LASSO. The percentage of TP and TN are defined as follows:
We compare the model performance of Farvsqr with LASSO under different error distributions and explanatory variable relationships; for each situation, we simulate 500 replications.
We compare the model with the fixed explanatory variable’s dimensionality
; the sample size is set to be
, and
respectively. For each sample size, we simulate 500 replications and calculate the average estimation error, average model size, TP, TN, and elapsed time. The results are presented in
Table 1,
Table 2 and
Table 3. From the results, we can see that under three different error distributions, for each
and
n, the average estimation error of Farvsqr is smaller than that for LASSO. For example, when
of normal distribution, the average estimation errors of Farvsqr and LASSO are 0.127 and 2.586, respectively. As for the average model size, almost all the values of Farvsqr are smaller than those of LASSO, except for
. For TP, all the scenarios are the same for Farvsqr and LASSO, so we can say that both can select the true non-zero variables. For elapsed time, all the values of Farvsqr are smaller than those of LASSO, so we can say that our method is more efficient. From all of the above, we can say that Farvsqr is better than LASSO. For every quantile
, as the number of samples increases, the estimation error gradually decreases for Farvsqr, but for LASSO, the impact of sample size is not obvious. It may be that for the factor model, LASSO is not approximate, so although the sample size becomes larger, it cannot change the defects of LASSO method.
We compare the model with a fixed sample size
; the explanatory variable’s dimensionality is set to be
, and 600, respectively. For each explanatory variable’s dimensionality, we simulate 500 replications and calculate the average estimation error, average model size, TP, TN, and elapsed time. The results are presented in
Table 4,
Table 5 and
Table 6. From the results, we can see that under three different error distributions, for each
and
p, the average estimation error of Farvsqr is smaller than that of LASSO. For example, when
of normal distribution, the average estimation errors of Farvsqr and LASSO are 0.124 and 2.059, respectively. As for the average model size, all the values of Farvsqr are smaller than those for LASSO. For TP, all the scenarios are the same for Farvsqr and LASSO, so we can say that both can select the true non-zero variables. For TN, all the values of Farvsqr are bigger than those of LASSO, so we can say that LASSO prefers to select redundant variables. For elapsed time, all the values of Farvsqr are smaller than those of LASSO, so we can say that our method is more efficient. From all of the above, we can say that Farvsqr is better than LASSO. For every quantile
, as the dimension increases, the average estimation error also increases, which is consistent with common sense, however, the increase in range of Farvsqr is smaller than that for LASSO. For example, when
normal distribution, the values of Farvsqr are 0.124 and 0.158, respectively, for
and
, the relative increase is
; as for LASSO, the relative increase is
, so we can say that LASSO is vulnerable to the increase of variable dimension.
We also compare our model with LASSO under different sample sizes and explanatory variable’s dimensionality situation for the equal correlated case. By simulating 500 replications, we calculate the average estimation error, average model size, TP, TN, and elapsed time. The results are presented in
Table 7,
Table 8,
Table 9,
Table 10,
Table 11 and
Table 12. From all the tables, we can see that essentially all the elapsed time of Farvsqr is shorter than LASSO; at the same time, the estimation error is slightly larger for most situations. For the fixed explanatory variable’s dimensionality
, as the number of samples increases, the elapsed time gradually decreases for Farvsqr and LASSO, but the relative increase is more significant for LASSO. For example, when
for
, the elapsed time of two methods for
are 0.687 and 1.099, respectively, and the elapsed time of two methods for
are 1.965 and 3.856, respectively, and the relative increase is
for Farvsqr. As for LASSO, the relative increase is
. So, we can say that the efficiency of LASSO is easily affected by the sample size, and it is not appropriate for the large sample data. So, we can say that Farvsqr pays less cost for the similar correlated case.
From all the results above, we can draw the following conclusions:
- (i)
When the covariates are high dimensional and high correlations within blocks, namely, the covariates are generated from the factor model, our method Farvsqr is better than LASSO from all the evaluating indicators, including the average estimation error, average model size, TP, TN, and elapsed time.
- (ii)
For the factor model, the parameter estimation accuracy of LASSO is easily affected by the increase of the explanatory variable’s dimension.
- (iii)
For the equal correlated case, the Farvsqr pays less cost.
- (iv)
For all the different scenarios, the efficiency of the LASSO is easily affected by the sample size.
In order to illustrate further that our method is better for the data which is high dimensional and high correlations within blocks, we compare our method with SCAD also, and we found the same conclusions as LASSO. Here, we just give the results under normal distribution.
Table 13 and
Table 14 are, respectively, for the fixed explanatory variable’s dimensionality and sample size. We need to know here that the Farvsqr method is first to replace the highly dependent covariates by weakly dependent or uncorrelated ones by the latent factor model; then, we minimize (
12) with LASSO or SCAD. However, LASSO and SCAD directly minimize Formula (
5) in which the covariates are highly correlated.
6. Real Data Application
In this section, we will use the season U.S. macroeconomic variables in the FRED-QD database [
17]. The dataset includes 247 dimensions, and the covariates in the FRED-QD data set are strongly correlated. We choose 88 data points which are complete observation samples from the first quarter of 2000 to the last quarter of 2021. The FRED-QD is a quarterly economic database updated by the Federal Reserve Bank of St. Louis, which is publicly available at
http://research.stlouisfed.org/econ/mccracken/sel/ (accessed on 28 June 2022). The detailed information about the data can be found on the website. In this paper, we choose the variable GDP as the response and the other 246 variables as the explanatory variables. The density distribution of the response of our data is as shown in
Figure 1. We compare the proposed Farvsqr with LASSO in variable selection, estimation, and elapsed time. The estimation performance is evaluated by the
, which is defined as:
where
is the observed value at the time
i,
is the predicted value, and
is the sample mean. We model the data given the quantile
. We evaluate the model from the
, model size, and elapsed time.
The results are presented in
Table 15. From the result, we can see that the model sizes of Farvsqr are 18, 19, 38, and 38 for the quantile
and
, respectively; however, the model sizes of LASSO are 241, 176, 207, and 222 for the quantile
and
, respectively. The LASSO prefers to choose more related variables. For instance, for
and
, all LASSO models include both Real PCE expenditures: durable goods, Real PCE: services, Real PCE: nondurable goods, Real gross private domestic investment, Real private fixed investment, Real gross private domestic investment: fixed investment: nonresidential: equipment, and Real private fixed investment: nonresidential because of the strong correlation between them. Moreover, all LASSO models also include both Number of civilians unemployed for less than 5 weeks, Number of civilians unemployed from 5 to 14 weeks, and Number of civilians unemployed from 15 to 26 weeks because of the strong correlation between them. Many other related variables are included by LASSO. The elapsed times of Farvsqr are 7.6209, 8.2036, 8.3589, and 8.3493 for the quantile
and
respectively, while the elapsed times of LASSO are 9.8736, 13.8031, 10.6616, and 10.1012 for the quantile
and
, respectively; so we can say that the algorithm efficiency of LASSO for our real data is much lower than that of Farvsqr. It may be because LASSO selects too many redundant explanatory variables, which not only affects the estimation accuracy of the model but also affects the efficiency of the algorithm. For the
, Farvsqr is better than LASSO except for
. So, we can see that Farvsqr is more suitable for this data set. Furthermore, we can say that for the data set with strong correlation between explanatory variables, Farvsqr is more suitable for use.