2.1. Definition
Recall that adaptive lasso estimation improves the shrinkage force to equalize the coefficients in lasso by applying a weight vector. The adaptive lasso estimator’s set of predictor variables
is denoted by
,
The solution of the relaxed adaptive lasso is obtained via the adaptive lasso estimator in if and only if in the low-dimension case.
We now consider the linear regression model
where
is a vector composed of i.i.d. random variables with mean 0 and variance
.
is an
matrix with a normally distribution
, where
is the
ith column and
Y is an
vector of response variables. Now, we define relaxed adaptive lasso estimation. The variable selection and shrinkage are controlled by adding two constraints,
and
, and one weight vector,
, to the
penalty term. According to the setup of Zou [
6], suppose that
is an
-consistent estimator of
.
Definition 1. Define the relaxed adaptive lasso estimator as denoted by ,where is an indicator function , for all ; [0,1]; given a , define the weight vector . Notably, only predictor variables in the set
can be chosen as the relaxed adaptive lasso solution. In the following, we discuss different functions and value ranges of parameters under the set
. The parameter
determines the number of variables retained in the model. For
or
, the problem of solving the estimators in Equation (
5) is transformed into an ordinary least squares problem where
so that the purpose of variable selection cannot be achieved. As
increases, all coefficients of the variables selected by adaptive lasso are compressed towards 0, and some finally become exactly 0. However, for a large
, all estimators are shrunk to 0, where
, leading to a null model. In addition, the relaxation parameter
controls the amount of shrinkage applied to the coefficients in estimation. When
, the adaptive lasso and relaxed adaptive lasso estimators are the same. When
, the shrinkage force on the estimators is weaker than that of the adaptive lasso. The optimal tuning parameters
and
are chosen by cross-validation. The vector
assigns different weights to the coefficients; hence, the relaxed adaptive lasso has consistency when the weight vector is correctly chosen.
2.2. Algorithm
We will discuss the algorithm for computing the estimator of the relaxed adaptive lasso in this section. Note that (
5) is a convex optimization problem, which means that we can obtain the global optimal solution effectively. Unlike concave penalties, however, multiple minimal penalties, such as SCAD, suffer from the multiple minimal problem. In the following, we discuss a simplified version of the relaxed adaptive lasso estimator algorithm. An improved algorithm is then proposed based on the process of computation for the relaxed lasso estimator [
11].
The simple algorithm for relaxed adaptive lasso
- Step (1).
For a given
, we use
to construct the weight in an adaptive lasso based on the definition from Zou [
6]. We can also replace
with other consistent estimators, e.g.,
.
- Step (2).
Define , where
- Step (3).
Then, the process of computing relaxed adaptive lasso solutions is identical to that of solving the relaxed lasso solutions in Meinshausen [
11]. The relaxed lasso estimator is defined as
The Lars algorithm is first used to compute all the adaptive lasso solutions. Select a total of h resulting models attained with the sorted penalty parameters . When , for example, all variables with nonzero coefficients are selected, which is identical to the OLS function. On the other hand, completely shrinks the estimators to zero, thus leading to a null model. Therefore, a moderate in the sequence of is chosen such that . Then, define the OLS estimator , where is the direction of adaptive lasso solutions, which can be obtained from the last step. If there exists at least one component j such that , then all the adaptive lasso solutions on the set of variables are identical to the set of relaxed lasso estimators for . Otherwise, for are computed by linear interpolation between and .
- Step (4).
Output the relaxed adaptive lasso solutions: .
Simple algorithms have the same computational complexity as Lars-OLS hybrid algorithms. However, due to the high computing complexity, this approach is frequently not ideal. Then, we consider an improved algorithm introduced by Hastie et al. [
12], which uses the definition of the relaxed adaptive lasso estimators to solve a problem of high computational complexity.
The improved algorithm for relaxed adaptive lasso
- Step (1).
As before,
denotes the active set of the adaptive lasso. Let
denote the adaptive lasso estimator. The relaxed adaptive lasso solution can be defined as
where
is a constant with a value between 0 and 1.
- Step (2).
The submatrix of active predictors is a reversible matrix; thus, .
- Step (3).
Define
, where
; then, the adaptive lasso solution
is identical to solving the lasso problem
By means of the Karush–Kuhn–Tucker (KKT) optimality condition, the lasso solution over its active set can be written as
From the transformation of the predictor matrix in Step (2), it follows that the adaptive lasso estimator is .
- Step (4).
Thus the improved solution of the relaxed adaptive lasso can be written as
The computational complexity of Algorithm 1 in the best case is equivalent to the ordinary lasso. Specifically, in Step (3) of the simple algorithm, the relaxed adaptive lasso estimator can be solved in the same way as the relaxed lasso. The improved algorithm is computed from the adaptive lasso and lasso estimators. Given the weight vector, the computational cost of the relaxed adaptive lasso is the same as that of the lasso [
21]. Therefore, the computational complexity of Algorithm 2 is equivalent to that of the lasso.
Now we compare the computational cost of the two algorithms. The relaxed lasso’s computational cost in the worst scenario is
, which is slightly more expensive than the cost of the regular lasso with
Meinshausen [
11]. For this reason, we compute the relaxed adaptive lasso estimator using the improved algorithm.
Algorithm 1. The simple algorithm for relaxed adaptive
lasso. |
Input: a given constant , the weight vector , |
Precompute: |
Initialization: Let
to be the optimal parameter |
corresponding to the modified models . |
Set to an initial order number of |
Define , |
, where
|
fordo |
if then |
|
else |
|
Set |
until |
Output: |
Algorithm 2. The improved algorithm for the relaxed adaptive
lasso. |
Input: Adaptive lasso estimator ,
OLS estimator , |
weight vector |
Precompute:,
Let be the active set of the adaptive lasso |
Initialization: Define |
fordo |
if then |
compute , |
|
else |
Stop iterations |
until |
Output:,
|
2.3. Asymptotic Results
To investigate the asymptotic property, we make the following two assumptions about the architecture used in the setup of Fu and Knight [
18]:
where
is a positive definite matrix. Furthermore,
Without loss of generality, the sparse constant vector
is defined as the true coefficient of the model. We assume that the number of nonzero estimators selected into the real model is
q, that is
, where
only for
and
for
. The true model is, hence,
. The covariance matrix
can be written in block-wise form, i.e.,
, where
is a
matrix. The random loss
of the adaptive lasso is defined as
The loss
of the relaxed adaptive lasso is analogously defined as
We discover that the relaxed adaptive lasso estimator has the same rapid convergence rate as the relaxed lasso estimator when the exponential growth rate of the size p is ignored. Additionally, the adaptive lasso has a slower pace than both of them but is slightly faster than the lasso estimator. We make the following assumptions concerning asymptotic results for low-dimensional sparse solutions to demonstrate the above conclusion.
Assumption 1. The number of predictors increases exponentially with the number of observations n, that is, there exist some , such that .
We cannot rule out the possibility that the remaining noise factors are linked with the response. A square matrix is said to be diagonally dominant if the magnitude of the diagonal entry in each row of the matrix is greater than or equal to the sum of the magnitudes of all the other (nondiagonal) entries in that row.
Assumption 2. Σ and are diagonally dominant at some constant , for all .
Notably, when the diagonal is positive, the diagonally dominating symmetric matrix is positive definite. Based on this premise, the inverse matrix of can guarantee its existence.
Assumption 3. We limit the penalty parameter λ to the range ,if and only if there exists an arbitrarily large . Assumption 3 holds true if the exponent of the number of variables in the selected model is less than the sample size n. Using values in the range , relaxed lasso, adaptive lasso, and relaxed adaptive lasso can obtain consistent variable selection and a specified number of nonzero coefficients.
Lemma 1. Assume that predictor variables are independent of each other, , is the penalty parameter of the adaptive lasso, and its order is for . Under Assumptions 1–3, As a result of Lemma 1, the chance of at least one noise variable being evaluated as nonzero is close to one. We prove Theorem 1 by utilizing the conclusion of Lemma 1 on the order of the penalty parameter.
Lemma 2. Let with , being the number of observations. Then, under Assumptions 1–3, We want to investigate the computational cost of the specified parameters by examining the order of the relaxed adaptive lasso loss function. Lemma 2 is a technique that will assist us in proving Theorem 3.
Lemma 3. Assume that predictor variables are independent of each other, , is the penalty parameter of the relaxed adaptive lasso, and for . Under Assumptions 1–3, As a result of Lemma 3, the noise variable can be predicted to be 0. If the penalty parameter ensures that converges to 0 at a slower rate than , the noise variable can be precisely evaluated as nonzero with a probability approaching 0. In addition, Lemma 3 helps to prove Theorem 3 by describing the order of the penalty parameter of the relaxed adaptive lasso.
Theorem 1 addresses the question of whether the adaptive lasso can sustain a faster convergence rate as the number of noise variables increases rapidly and the convergence speed exceeds that of the lasso. The addition of the weight parameter enables the adaptive lasso to gain oracle qualities while also increasing the algorithm’s rate of convergence.
Theorem 1. Assume that predictor variables are independent of each other. Under Assumptions 1–3, for any and . The convergence rate of the adaptive lasso is as follows: On the other hand, Theorem 2 establishes that the convergence rate of the relaxed adaptive lasso is equivalent to that of the relaxed lasso. Theorem 2 resolves the question of whether the convergence rate of the relaxed adaptive lasso is consistent with that of the relaxed lasso by establishing that the convergence rate of the relaxed adaptive lasso is not related to the noise variable’s growth rate r or the parameter s that determines the growth rate.
Theorem 2. Assume that predictor variables are independent of each other. Under Assumptions 1–3, for , the convergence rate of the relaxed adaptive lasso is as follows: The shade in
Figure 1 represents the rate at which various models converge. The rate of the relaxed adaptive lasso is the same as that of the relaxed lasso; this indicates that the convergence rate of the relaxed adaptive lasso is unaffected by the rapid increase in the noise variable, and it can still retain a high rate. Although the adaptive lasso’s convergence rate is suboptimal, it is faster than the lasso’s due to the presence of the weight vector. The addition of an excessive number of noise variables slows the Lasso estimator, regardless of how the penalty parameter is chosen [
11].
The convergence rate of the relaxed adaptive lasso is as robust as the rate of the relaxed lasso, i.e., it is unaffected by noise factors. Theorem 3 demonstrates that cross-validation selection of the parameters , can still maintain a rapid rate.
Franklin [
22] indicated that K-fold cross-validation includes
K partitions and each partition consists of
observation data, where
for
. When building an estimator on a different set of observations than
, define the empirical loss of observations as
for
. Let
be the empirical loss function,
The selection of
,
and
is performed by minimizing the loss function
, that is,
This article uses five-fold cross-validation in the numerical study.
Theorem 3. Under Assumptions 1–3, the convergence rate of K-fold cross-validation with holds that Therefore, when K-fold cross-validation is used to determine the relaxed adaptive lasso’s penalty parameters , , the convergence speed may maintain a relatively ideal outcome. As a result, if using cross-validation to select the penalty parameters, the optimal rate and consistent variable selection under the oracle selection of penalty parameters may be nearly achieved.
Theorem 4. If , then in the relaxed adaptive lasso estimator; moreover, if , is consistent.
Theorem 4 indicates that the relaxed adaptive lasso estimator is consistent under the condition . does not have to be root-n consistent; nonetheless, the consistency of the relaxed adaptive lasso is determined by the conclusion drawn from probability convergence.