1. Introduction
Regression analysis is a statistical technique that is concerned with the prediction of a response variable dependent on one or more regressors. The standard linear regression model assumes that there should not exist a strong correlation among predictors, as this could lead to the problem of so-called multicollinearity [
1]. However, in many fields of study, the regressors can be highly intercorrelated. For instance, when studying the effect of weight and age on the sugar level of a human, both predictors can be highly correlated, as the weight increases with the increase in age. Similarly, if the age and price of a car are used as regressors in predicting a car’s average selling time, then both predictors can be highly intercorrelated. In the presence of highly correlated regressors, the ordinary least squares estimation can lead to wrong inferences. For example, the ordinary least squares (OLS) estimators are inconsistent in the presence of multicollinearity as they have large standard errors. In the case of perfect multicollinearity, the OLS estimators cannot be estimated uniquely [
2].
In such situations, a very popular and successful approach in statistical modeling is to use regularization regression techniques, such as ridge regression or Least Angle Shrinkage Selection Operator (LASSO) estimation [
3,
4]. The main idea behind these regression techniques is to introduce biased estimators by penalizing the OLS estimators, which decreases the overall standard error of the estimator. By minimizing both the empirical error and the penalty, one can find a model that is well suited and also “simple”, avoiding the large variance that can occur when estimating complex models. However, ridge regression cannot generate a parsimonious model because it still retains all predictors in the model [
5]. On the other hand, best-subset selection generates a sparse model, but because of its inherent discreteness, it is extremely variable, as discussed in [
6]. To cope with these problems, the LASSO [
4] is an example of a good concept in this category. Its popularity stems in part from the regularization induced by the LASSO’s L1 penalty, which results in sparse solutions. LASSO shrinks the estimated coefficient vector toward the origin (in the L1 sense) for different values of k, usually setting some of the coefficients to zero. As a result, the LASSO blends the ridge regression and subset selection characteristics, and it aims to be a useful method for variable selection. The main idea behind the LASSO is to introduce biased estimators in order to decrease the standard error of the estimator. However, bias and variance are complementary; increasing one causes a decrease in the other, and the inverse is also true. This trade-off between bias and variance can be controlled by the LASSO parameter k, which is known as the tuning parameter. References [
4,
7] compared the predictive performance of LASSO, ridge, and bridge regression and found that none of them dominated the other two uniformly [
8]. Although the LASSO has been proven to be effective in various scenarios, it has some limitations. Consider the three scenarios below:
In the case of
, the LASSO selects at most n variables before it saturates because of the nature of the convex optimization problem. This seems to be a limiting feature for a variable selection method. Moreover, the LASSO is not well defined unless the bound on the L1-norm of the coefficients is smaller than a certain value [
9].
The LASSO cannot perform group selection. If a group of predictors is highly correlated with themselves, the LASSO tends to pick only one among them and will shrink the others to zero [
10].
For usual
situations, if there are high correlations between the predictors, then it has been empirically observed that the prediction performance of the LASSO is dominated by ridge regression [
4].
To address these limitations, several modifications to the LASSO were proposed in the literature, namely the adaptive LASSO [
10], fused LASSO [
11], group LASSO [
12], elastic net [
9], degree of freedom of the LASSO [
13], and square root LASSO [
14]. In addition, different researchers proposed different estimation methods for the estimation of the LASSO parameter “k”, such as references [
15,
16,
17,
18,
19,
20,
21,
22] and the references cited therein.
This paper aims to study the performance of the LASSO and adaptive LASSO in handling severe multicollinearity among independent variables in the context of multiple regression analysis using Monte Carlo simulations and real-life examples. Furthermore, we propose some new estimators for estimating the LASSO parameter “k” using a quantile-based approach and compared them with existing estimators to assess the performance of the proposals.
The rest of this paper is structured as follows. The general methodology as well as the proposed methods for the estimation of the LASSO parameter “k” are described in
Section 2.
Section 3 contains information about the simulation settings, while the simulation results are discussed in
Section 4. In
Section 5, the performance of the proposals, as well as existing LASSO methods, is evaluated using real data. Finally, some concluding remarks are discussed in
Section 6.
2. Methodology
Consider the following linear regression model:
where
is an
vector of the response variable and
is a
matrix (also known as design matrix) of the observed regressors,
is a
vector of unknown regression parameters, and
is an
vector of random errors. It is assumed that
is normally distributed with a zero mean and a covariance matrix
, where
represents an identity matrix of the order n. In general, the parameter
is estimated using the OLS that minimizes the following squared differences:
where
denotes the
penalty. As a result, the OLS estimator
can be estimated as follows:
and its covariance can be computed as follows:
Note that both the estimator and its covariance are heavily dependent on the matrix
. However, it is well known that in the presence of high correlation among predictors, the matrix
is ill-conditioned, and consequently,
is highly inconsistent and has a large variance [
5]. To cope with this issue, the LASSO is an alternate estimation procedure that is defined as
where the first argument assesses the fit while the second argument penalizes the parameter
. The parameter k is called the LASSO parameter, which is a trade-off between the fit and the penalty. Thus, the choice of k is an important task in conducting LASSO regression.
Different extensions have been introduced for the LASSO. For example, the adaptive LASSO seeks to minimize
where k is the adaptive LASSO parameter,
represents the estimated coefficients, and
is called the adaptive weights vector, which is defined as follows:
where
refers to an initial estimate of the coefficients and
is a positive constant for adjustment of the adaptive weights vector [
10].
As the LASSO estimation is often a nonlinear and non-differentiable response value function, it is difficult to obtain a reliable estimate of its standard error. One approach is through the bootstrap, where either k can be set, or for each bootstrap sample, we can optimize over k. Fixing k is analogous to selecting the best subset and then using the least squares standard error for that subset.
Although a closed-form solution for the LASSO estimator is not possible because the solution is nonlinear in the response variable, an approximate closed-form estimate may be derived by writing the penalty
as
. Hence, at the LASSO estimate
, we may approximate the solution with a ridge regression of the form
, where
W is a diagonal matrix with diagonal elements
,
denotes the generalized inverse of
W, and k is chosen so that
= k [
4]. Thus, the LASSO estimation problem becomes a ridge estimation.
To understand how the estimator is built, suppose that there exists an orthogonal matrix D such that
=
and
=
,
,
,...,
contain the eigenvalues of matrix
. Then, the modified form of Equation (
1) is
where
and
. Consequently, the generalized LASSO regression estimator can be written as
where k =
and
is the OLS estimate of
.
Modified LASSO Estimators
Using Equation (
6), some new modified estimators are proposed in this section. In particular, we use different percentiles, such as the 5
th percentile
, 25
th percentile
, 50
th percentile
), 75
th percentile
and 90
th percentile
, of the vector
=
, where
is the maximum eigenvalue of the matrix C and
is the i
th element of
. In particular, for the vector
, the qth percentile is defined as
or, equivalently,
. An approximate value of the percentile is obtained through a linear interpolation of the modes for the order statistics from the uniform distribution on
, as the R function
quantile() does [
23]. Mathematically, one can use
, where
,
denotes the largest nearest integer of the specified value,
returns the lowest integer that is greater than or equal to the given number, and
n is the size of the vector
. The main reason for considering percentiles is their robustness against outliers. In particular, the following are the proposed modified estimators.
1: The first proposal to estimate k is IHS1, which is defined as
where H
=
and P
denotes the fifth percentile.
2: The second proposal is IHS2, defined as
where Q
denotes the first quartile.
3: To estimate k, the next proposal is IHS3, defined as
where Q
denotes the second quartile.
4: The fourth proposal is
where Q
denotes the third quartile.
5: Next, we propose IHS5, which is defined as
where P
denotes the 90
th percentile.
It is worth mentioning that many other types of shrinkage estimators can be considered for comparison purposes (e.g., [
24]). However, we restricted the comparison of our proposed LASSO estimators to classical LASSO estimators which are widely used in the literature.
3. Simulation Settings
This section discusses a comprehensive simulation study that involved varying the number of independent variables, sample, size, correlation coefficient, and residual variance. For each simulation case, all covariates were centered and standardized to have a mean of zero and standard deviation of one. The predictors were generated as follows:
where
represents random numbers generated from the standard normal distribution and
is a high correlation coefficient value, indicating strong correlation among predictors. We considered three different correlation coefficient values (i.e.,
= 0.90, 0.95, and 0.99). In addition, to evaluate the effect of the sample size, different sample sizes such as n = 50, 100, and 150 were considered. Furthermore, we considered p = 4, 8, and 16 with variances
= 1, 3, 5, 7, and 9 for the error terms to evaluate their effects. Thus, the errors were generated as follows:
- 1.
followed the independent normal (0, 1);
- 2.
followed the independent normal (0, 3);
- 3.
followed the independent normal (0, 5);
- 4.
followed the independent normal (0, 7);
- 5.
followed the independent normal (0, 9).
To study the performance of the OLS, LASSO, adaptive LASSO, and proposed LASSO estimators, we computed the MSE using the following equation:
where
is the estimator of
obtained through estimators and
N is the number of replicates used in the Monte Carlo simulation. To achieve a reliable estimate, the simulation studies were repeated 2000 times, and thus 2000 MSEs were computed, with one for every replication.
Simulation Table Settings
Different combinations of varying values for
, and
are considered in
Table 1,
Table 2 and
Table 3, with
p = 4. In particular, high values for the correlation coefficient were considered (i.e.,
= 0.90, 0.95, and 0.99). To assess the effect of the sample sizes,
50, 100, and 150 were considered. Furthermore, the error variances
= 1, 3, 5, 7, and 9 were used to evaluate their effect on the proposed estimator performance.
In
Table 4,
Table 5 and
Table 6, the values of
, and
were the same as those used in the first case. However, the number of variables was increased to
p = 8 to assess the effect of the number of variables on the simulation studies.
Table 7,
Table 8 and
Table 9 report the results when considering the number of explanatory variables to be 16 (i.e.,
p = 16). The choices for other variables remained the same as those used in the first two cases.
4. Simulation Results
The simulation results for the proposed estimators, along with some existing estimators (OLS, LASSO, and adaptive LASSO), are given in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9.
Table 1 reports the results for different estimators used in the study considering
= 0.90, 0.95, and 0.99, n = 50, p = 4, and
= 1, 3, 5, 7, and 9. From these results, it is evident that the proposed estimators outperformed the OLS and existing LASSO estimators. The poor performance of the OLS estimator was evident from the results, as it provided higher values of MSEs. Furthermore, one can see that when
= 0.90 and
= 1, 3, 5, 7, the proposed estimator IHS1 performed relatively well, whereas when
= 9, IHS2 outperformed all other estimators. For
= 0.9, the smallest obtained MSE value was 0.336, which was obtained by IHS1. In the case of
= 0.95, when
= 1, 3, 5, or 7, IHS1 outperformed all other estimators, whereas for
= 9, the performance of IHS3 was better than other estimators. By increasing the correlation coefficient value
to 0.99, IHS1 performed relatively well compared with all other estimators, especially for small values of
. As the value of
increased, IHS2 and IHS3 outperformed the other estimators. Given the value of
, the proposed estimators’ increases in MSEs were significantly less than those of the OLS and the existing LASSO estimators.
When considering
= 0.90, 0.95, and 0.99, p = 4, and
= 1, 3, 5, 7, and 9, the results are listed in
Table 2 with
n = 100. This suggests that as the sample sizes increased, the proposed estimators IHS1 outperformed all other estimators. By increasing
, the MSEs increased for the OLS and existing LASSO estimators. However, they decreased for the proposed LASSO estimators. The smallest values of the MSEs when considering
= 0.90, 0.95, and 0.99 were 0.306, 0.239, and 0.151, all of which were produced by the IHS1 proposal. Note that the existing estimators produced relatively large MSE values, especially for large values of
. The table also shows the OLS’s poor performance.
Table 3 reports the results for different estimators used in the study when considering the same values for
,
p, and
with
n = 150. This table shows that the proposed estimator IHS1 uniformly outperformed all other estimators. Note that for all combinations of different parameters, IHS1, IHS2, and IHS3 were the best three estimators, indicating the good performance of the proposed estimators. The MSEs decreased with the increase in the value of
for the proposed estimators. However, they increased for the existing LASSO and the OLS estimators. Furthermore, the existing LASSO and the OLS estimators produced significantly large MSEs compared with the proposed estimators.
The regressors in the simulation study were increased from
to
to assess the effect of the number of explanatory variables listed in
Table 4,
Table 5 and
Table 6. The parameter values
= 0.90, 0.95, and 0.99 and
= 1, 3, 5, 7, and 9 were considered in these tables. However, the sample size varied by table, as
Table 4,
Table 5, and
Table 6 considered
n = 50, 100, and 150, respectively. The proposed estimators, according to these tables, produced significantly lower MSEs than the existing estimators. As previously noted, the proposed estimators IHS1 outperformed all others by producing extremely low MSEs. In comparison with the earlier outcomes, the proposed estimators IHS4 and IHS5 improved their rankings. This suggests that as
p and
n increased, the proposed estimators were more accurate and reliable than the existing LASSO and OLS estimators. Furthermore, the poor performance of the OLS estimators is shown in these tables, demonstrating that multicollinearity prevents them from being used.
To evaluate the effectiveness of the proposals, the number of regressors was increased from 8 to 16, and the results are shown in
Table 7,
Table 8 and
Table 9. In
Table 7,
= 0.90, 0.95, and 0.99,
,
, and
= 1, 3, 5, 7, and 9 are taken into account. The same parameter values are specified in
Table 8 with
. According to these tables, the proposed estimators outperformed the existing LASSO and OLS estimators. The first five rankings in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9 were attained by the five proposals, which show that they provided superior results in a high-dimensional context. IHS1 produced the best results, and as the sample size increased, the corresponding MSEs were reduced.
6. Conclusions
The multicollinearity issue in linear regression occurs when the regressors have a high degree of correlation. In the presence of this issue, the OLS estimators are inconsistent and do not produce accurate estimates. To resolve this concern, penalized regression approaches such as the LASSO estimator are extensively used. For estimating the LASSO parameter k, this work provided five new estimators. The estimators’ performance was affected by the standard deviation of the random error and the correlations between the explanatory variables (), the sample size (n), and the number of variables p. To evaluate the effectiveness of estimators employing the MSE criterion, we conducted comprehensive Monte Carlo simulation studies and used real data sets. The findings revealed that the OLS estimator performed poorly when there was substantial predictor correlation.
The findings revealed that in terms of the MSEs, the recommended estimators consistently outperformed the OLS estimators, conventional LASSO estimators, adaptive LASSO estimators, and others. The MSE decreased as the sample size increased, even for high correlation coefficient values and . Furthermore, in both the simulation studies and real-world data cases, the suggested estimate IHS1 outperformed all other estimators. On the other hand, it is evident that the MSEs of the existing estimators increased as the number of variables (p), the error variance (), and the correlation coefficient () between the independent variables increased.
The simulation and real data analysis led to the conclusion that the quartile-based LASSO’s parameter estimates had lower mean square errors than those of the OLS and other regularization regression technique type estimators. This work can be expanded in the future by assuming a multivariate response variable. More research can be performed to determine how well the suggested estimator performs when the response variable is distributed using another exponential distribution.