Adaptive Bi-Level Variable Selection for Quantile Regression Models with a Diverging Number of Covariates

Xianwen Ding; Zhihuang Yang

doi:10.3390/math12203284

Abstract

The paper develops an innovatively adaptive bi-level variable selection methodology for quantile regression models with a diverging number of covariates. Traditional variable selection techniques in quantile regression, such as the lasso and group lasso techniques, offer solutions predominantly for either individual variable selection or group-level selection, but not for both simultaneously. To address this limitation, we introduce an adaptive group bridge approach for quantile regression, to simultaneously select variables at both the group and within-group variable levels. The proposed method offers several notable advantages. Firstly, it adeptly handles the heterogeneous and/or skewed data inherent to quantile regression. Secondly, it is capable of handling quantile regression models where the number of parameters grows with the sample size. Thirdly, via employing an ingeniously designed penalty function, our method surpasses traditional group bridge estimation techniques in identifying important within-group variables with high precision. Fourthly, it exhibits the oracle group selection property, implying that the relevant variables at both the group and within-group levels can be identified with a probability converging to one. Several numerical studies corroborated our theoretical results.

Keywords:

quantile regression; bi-level variable selection; adaptive bridge estimator; diverging parameters

MSC:

62F12; 62F35

1. Introduction

Over the past three decades, variable selection has emerged as a critical process in diverse scientific disciplines, including biomedical research, environmental science, and financial econometrics. Its importance lies in enhancing model interpretability, as statistical models with fewer important variables are more easily understood than their fully specified counterparts (Hastie, Tibshirani, and Friedman [1]). Usually, extensive data on potential predictors are collected to make sure that insignificant predictive relationships are overlooked. To diminish variability and enhance interpretability, it is necessary to seek a parsimonious model using a smaller subset of collected variables (Ahn and Kim [2]).

Although there is a large body of literature on variable selection, most works focus on the selection of individual variables. In many regression problems, however, important predictors are related and a manifestation of underlying common factors (Yuan and Lin [3]). Categorical factors, for example, are often represented by groups of indicator functions, whereas continuous factors may be modeled using basis functions. Moreover, groups of measurements are frequently employed to detect unobservable latent variables or to assess various aspects of complex entities. For instance, gene expression data might be categorized by biological pathways, and genetic markers grouped by the genes or haplotypes they represent (Goeman and Buhlmann [4]). Methods focused exclusively on individual variable selection can be suboptimal in these contexts, as they may overlook the information provided by the structure of groups, thereby potentially leading to incoherent and inefficient models.

In addition to convex/nonconvex regularizers designed for individual variable selection, such as the lasso (Tibshirani [5]), bridge (Frank and Friedman [6]), smoothly clipped absolute deviation penalty (Fan and Li [7]), and minimax concave penalty (Zhang [8]), several penalty functions have been developed to accommodate selection at the group level. Yuan and Lin [3] introduced the group lasso, where the penalty function is composed of the

L_{2}

norms of predefined groups of variables. This approach promotes sparsity at the group level, while applying ridge-regression-like shrinkage within each group. Meier et al. [9] extended this concept to logistic regression models, and Zhao et al. [10] further extended it to accommodate overlapping and hierarchical group structures. While these aforementioned approaches can effectively perform variable selection at the group level, they do not facilitate individual-level variable selection within groups.

The group bridge penalty (Huang et al. [11]) applies a group penalty to the

L_{1}

norms of groups, thereby enabling bi-level selection by promoting sparse solutions both at the group level and within individual groups. Bi-level selection is crucial for models that require identification of relevant groups of variables, as well as the selection of significant variables within those groups. Several recent works related to the group bridge penalty include [2,12], and references therein. Further advancing this area, Cai et al. [13] proposed an adaptive bi-level variable selection method for analyzing multivariate failure time data using the Cox proportional hazards model, showcasing the versatility and applicability of bi-level selection methods in survival data analysis. Buch et al. [14] demonstrated that bi-level selection methods offer enhanced model interpretability over traditional approaches like LASSO, particularly by effectively addressing the complex interactions present in grouped data, such as those encountered in omics research. More recently, Buch et al. [15] further highlighted the potential of bi-level methods to improve the flexibility and precision of variable selection, enabling more refined control over both group-level and individual-level sparsity. A thorough examination of the theoretical underpinnings and practical implications of bi-level variable selection techniques can be found in [16].

The existing bi-level variable selection approaches exhibit sensitivity to the tails of the unobservable error distribution. Additionally, when heterogeneity is present in response data, sparse estimators derived from least squares methods may yield inefficient results. Quantile regression (QR) emerges as a robust alternative to classical mean regression, offering a comprehensive view of the entire response distribution, while being resistant to heterogeneity, see [17,18,19]. Within the framework of QR, various regularization methods have been developed to identify significant group structures and individual variables in covariate data with an inherent group structure. Notable contributions include the works of Ciuperca ([20,21]), among others. Regarding bi-level variable selection, Ahn and Kim [2] investigated group bridge and adaptive group bridge penalties for competing risk QR with a diverging number of group variables, thereby facilitating bi-level variable selection in this specific context. Similarly, Shi and Wilke [22] employed a flexible-yet-dependent competing risks QR model to elucidate the relationships between early and late retirement transitions and various informative registers. Despite the advantages of the aforementioned approaches, there is a paucity of theoretical and computational aspects for bi-level variable selection in QR models in the presence of a diverging number of parameters. This highlights the necessity for further exploration and development of robust bi-level variable selection methodologies to effectively manage the complexities of QR models in such contexts. In addition, the frequent occurrence of heterogeneous group-structured data in medical and health-related research underscores the need for effective methods to identify both relevant groups and key individual variables within those groups [14,15]. Addressing these challenges was the primary motivation behind this study, as accurate group and within-group selection is crucial for improving the model interpretability and predictive performance for complex health data [16].

In this paper, we introduce an adaptive group bridge methodology for bi-level variable selection in QR models with a diverging number of parameters. The proposed bi-level variable selection approach utilizes an adaptive penalty function to simultaneously identify group structures and individual variables within the groups. The proposed estimation procedure offers several notable advantages: it effectively handles the heterogeneity and skewness often encountered in regression analysis; it is scalable, making it suitable for models with an increasing number of parameters as the sample size grows; it outperforms traditional group bridge methods by efficiently identifying key within-group variables, leveraging the flexibility of the adaptive penalty function; and it achieves the oracle group selection property, ensuring accurate identification of relevant variables at both the group and within-group levels with high probability. Additionally, an iterative optimization algorithm is presented to address the computational challenges posed by the non-differentiable check loss function and nonconvex penalty function, thereby improving both the computational efficiency and practical applicability.

The remainder of this paper is structured as follows: In Section 2, the proposed adaptive group bridge quantile methods are introduced, including a comprehensive description of the computational algorithm and the selection criteria for tuning parameters. The asymptotic properties of the proposed sparse estimation procedure are rigorously developed in Section 3. Section 4 presents simulation studies and an application to real data, providing empirical validation of the proposed methods. A concise discussion is provided in Section 5. All technical derivations and proofs are relegated to the Appendix A.

2. Methods

2.1. Adaptive Bi-Level Variable Selection in QR Models

Consider a sample

{(Y_{i}, x_{i})}_{i = 1}^{n}

of size n from some unknown population, where

Y_{i} \in R

is the response of interest, and

x_{i} = {(X_{i 1}, X_{i 2}, \dots, X_{i p_{n}})}^{T} \in R^{p_{n}}

is the covariate or prediction vector. We focus on a family of linear QR models, which can be expressed as

Y_{i} = x_{i}^{T} β (τ) + e_{i} (τ), i = 1, 2, \dots, n,

(1)

where

β (τ) = {(β_{1} (τ), \dots, β_{p_{n}} (τ))}^{T}

is an unknown coefficient vector, and

e_{i} (τ)

is an independent random error variable with a

τ

-th quantile equal to zero.

Suppose the prediction variables can be divided into J groups. Let

B_{1}, \dots, B_{J}

be subsets of

{1, \dots, p_{n}}

representing known groupings of the design covariates. For any subset

A \subseteq {1, \dots, p_{n}}

, let

β_{A} (τ) = {(β_{j} (τ))}_{j \in A}

denote the

| A |

-dimensional sub-vector of

β (τ)

indexed by A, where

| A |

denotes the cardinality of A. At a prefixed quantile level

τ

, the j-th group is denoted by

β_{A_{j}} (τ) = {(β_{k} (τ))}_{k \in A_{j}}

. Additionally, a quantile level

τ

will be omitted from various expressions, such as

β (τ)

and

e_{i} (τ)

, whenever this is clear from the context.

To gain a comprehensive understanding of the relationship between the response variable and its predictors, it is crucial to not only select important groups of variables but also to identify significant individual members within these groups across different quantile levels. This process, known as bi-level selection, ensures a more detailed and accurate representation of the underlying data structure. To this end, we here introduce an adaptive group bridge quantile estimator

\hat{β}

, which is a minimizer of the adaptive group bridge penalized quantile loss function

G_{n} (β)

, i.e.,

\hat{β} = \underset{β \in R^{p_{n}}}{arg min} {G_{n} (β)} = \underset{β \in R^{p_{n}}}{arg min} \{\sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β) + λ_{n} \sum_{j = 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ}\},

(2)

where

ρ_{τ} (t) = t (τ - I (t < 0))

is the check loss function, with

I (\cdot)

being the indicator function;

λ_{n} > 0

is a tuning parameter that controls the penalty level;

b_{j}

are constants for adjusting different dimensions of

β_{B_{j}}

;

{\tilde{β}}_{k}

is a consistent estimator of its true counterpart

β_{k}^{0}

; v is a non-negative constant representing the penalty level of individuals within group

B_{j}

; and

0 < γ \leq 1

.

In general, groups

B_{j}

are allowed to overlap, and their union may be a proper subset of the entire set of variables, ensuring that variables not included in

\cup_{j = 1}^{J} B_{j}

are not subject to penalization. When

v = 0

, the penalty term corresponds to the group bridge penalty, whereas for

v > 0

, it becomes the adaptive group bridge penalty. The adaptive group bridge estimator reduces to individual variable selection when

| B_{j} | = 1

for all j, and it is equivalent to the adaptive hierarchical lasso penalty developed by [23] when

γ = 1 / 2

and

b_{j} = 1

for all j.

The objective function

G_{n} (β)

demonstrates considerable flexibility in the variable selection regime. Specifically, when

γ = 1

, it simplifies to the adaptive group lasso selection method for QR models in [20,24]. Furthermore, when

γ = 1

and

| B_{j} | = 1

for

j = 1, \dots, J

, it reduces to the adaptive lasso objective function. When

v = 0

, it aligns with the group bridge variable selection method in QR models, analogous to that of mean regression models studied by [11]. Additionally, when

v > 0

and

0 < γ < 1

, utilizing a conditional hazard loss function, the estimation procedure transitions into an adaptive bi-level variable selection methodology for multivariate failure time models (Cai et al. [13]).

In what follows, we will show that the designed objective function (2) can be employed for bi-level variable selection in QR models. To see this, for

0 < γ < 1

, define

\begin{matrix} {\tilde{G}}_{n} (β, θ) = \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β) + \sum_{j = 1}^{J} θ_{j}^{1 - 1 / γ} b_{j}^{1 / γ} \sum_{k \in B_{j}} \frac{| β_{k} |}{| {\tilde{β}}_{k} |^{v}} + ρ_{n} \sum_{j = 1}^{J} θ_{j}, \end{matrix}

(3)

where

ρ_{n}

is a penalty parameter, and

θ = {(θ_{1}, \dots, θ_{J})}^{T}

with

θ_{j}

given by

θ_{j} = b_{j} {(\frac{1 - γ}{ρ_{n} γ})}^{γ} {(\sum_{k \in B_{j}} \frac{| β_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} .

Proposition 1.

Assume that

0 < γ < 1

. If

λ_{n} = ρ_{n}^{1 - γ} γ^{- γ} {(1 - γ)}^{1 - γ}

, then

\hat{β}

minimizes

G_{n} (β)

if and only if

{\hat{β}, \hat{θ}}

minimizes

{\tilde{G}}_{n} (β, θ)

, where

θ_{j} > 0

and

{\hat{θ}}_{j} > 0

for

j = 1, \dots, J

.

This proposition is analogous to the characterization of the component selection and smoothing method of [25]. Examining the form of

{\tilde{G}}_{n} (β, θ)

defined in (3), we observe that minimizing

{\tilde{G}}_{n} (β, θ)

with respect to

(β, θ)

yields sparse solutions at both the group and individual variable levels. Specifically, the penalty is an adaptively weighted

L_{1}

penalty, resulting in sparsity in

β

. Moreover, for

0 < γ < 1

, small

θ_{j}

values force

β_{B_{j}} = 0

, leading to group selection. Following a similar approach to that outlined in [11], the validity of Proposition 1 can be rigorously verified. The detailed derivations are omitted here for brevity.

2.2. Algorithm

Since the check loss function is non-differentiable and the adaptive group bridge penalty is not a convex function for

0 < γ < 1

, minimizing the objective function

G_{n} (β)

in (2) with respect to

β

poses a significant challenge. To address this difficulty, we propose an iterative optimization algorithm, as follows:

Step 1. At a given quantile level

τ

, obtain an initial estimator

\tilde{β}

of

β

in model (1), i.e.,

\tilde{β} = \underset{β \in R^{p_{n}}}{arg min} \{\frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β)\} .

(4)

Step 2. Compute

θ_{j}^{(l)} = b_{j} {(\frac{1 - γ}{ρ_{n} γ})}^{γ} {(\sum_{k \in B_{j}} \frac{| β_{k}^{(l)} |}{| {\tilde{β}}_{k} |^{v}})}^{γ}, j = 1, \dots, J .

Step 3. Compute

{\hat{β}}^{(l + 1)} = \underset{β \in R^{p_{n}}}{arg min} \{\sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β) + \sum_{j = 1}^{J} {(θ_{j}^{(l)})}^{1 - 1 / γ} b_{j}^{1 / γ} \sum_{k \in B_{j}} \frac{| β_{k} |}{| {\tilde{β}}_{k} |^{v}}\} .

Step 4. Repeat Steps 2 and 3 until

∥ {\hat{β}}^{(l + 1)} - {\hat{β}}^{(l)} ∥ \leq η

. For instance,

η = 10^{- 4}

can be chosen as the convergence rule.

The proposed algorithm is guaranteed to converge because it monotonically decreases the non-negative objective function (2) at each iteration. The minimization in Step 3 can be efficiently implemented by directly applying the adaptive LASSO penalized QR method of [26]. Generally, this algorithm converges to a local minimizer, depending on the initial value

\tilde{β}

, due to the non-convex nature of the adaptive group bridge penalty. The flexibility and robustness of the iterative approach make it well-suited for the bi-level variable selection problems in QR settings.

3. Theoretical Properties

The proposed adaptive bi-level variable selection method for QR relies on adaptive weights

ω_{k} = 1 / {| {\tilde{β}}_{k} |}^{v}

, where

k \in B_{j}

and

j = 1, \dots, J

. This implies that an important variable in the j-th group will receive a smaller penalty, whereas less significant variables will incur larger penalties. Consequently, the initial estimator

\tilde{β}

in (4) must be a consistent estimator of the true parameter

β^{0}

in the QR model (1) with a diverging number of parameters. To ensure this consistency, we need the following technical conditions:

(A1): Random error terms ${e_{i}}_{i = 1}^{n}$ are independently and identifiably distributed with the $τ$ -th quantile equal to zero and possess a continuous, positive density $f (\cdot)$ in a neighborhood of zero. The distribution functions $F (\cdot)$ are absolutely continuous with $F (0) = τ$ .
(A2): Let $ϕ_{max} (M)$ and $ϕ_{min} (M)$ denote the largest and smallest eigenvalues of a positive definite matrix M, respectively. There are two positive constants $C_{1}$ and $C_{2}$ , such that

$C_{1} \leq ϕ_{min} (n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T}) \leq ϕ_{max} (n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T}) \leq C_{2} .$
(A3): ${(p_{n} / n)}^{1 / 2} {max}_{1 \leq i \leq n} ∥ x_{i} ∥ \to 0$ as $n \to \infty$ .
(A4): The dimension $p_{n}$ satisfies $p_{n} = O (n^{c})$ , where $0 < c < 1 / 2$ .

Condition (A1) is standard in QR models (e.g., [26,27]). Condition (A2) ensures that the design matrix of the true model at the sample level is well-behaved. Condition (A3) is required for high-dimensional QR models (see [21,28]). Condition (A4) permits the number of parameters

p_{n}

to diverge as the sample size n increases.

Theorem 1.

Suppose that Conditions (A1)–(A4) hold. Then, we have

∥ \tilde{β} - β^{0} ∥ = O_{p} (\sqrt{\frac{p_{n}}{n}})

.

Theorem 1 indicates that the initial estimator

\tilde{β}

is a consistent estimator of its true counterpart in the QR model (1). This consistency is crucial for the effectiveness of the adaptive bi-level variable selection method, ensuring reliable and accurate estimation in the presence of a diverging number of parameters.

Next, we present the asymptotic properties of

\hat{β}

. In particular, two scenarios are considered: (i)

0 < γ < 1

and

v = 0

, i.e., the group bridge penalty; (ii)

0 < γ < 1

and

v > 0

, i.e., the adaptive group bridge penalty. Specifically, we show that the group bridge estimators correctly select groups of nonzero coefficients with probability converging to one, while the adaptive group bridge estimator correctly identifies nonzero variables at both the group and within-group levels with probability approaching one. Moreover, the asymptotic distributions of nonzero components of these two penalized estimators are derived under different conditions.

Without loss of generality, we define

E_{1}

and

E_{2}

, such that

E_{1} = \cup_{j = 1}^{J_{1}} B_{j}

and

E_{2} = \cup_{j = J_{1} + 1}^{J} B_{j}

, where

β_{B_{j}, 0 τ} \neq 0

for

1 \leq j \leq J_{1}

and

β_{B_{j}, 0 τ} = 0

for

J_{1} + 1 \leq j \leq J

. Write

β_{E_{1}, 0 τ}

and

β_{E_{2}, 0 τ}

as the true values of

β_{0 τ}

with indices belonging to

E_{1}

and

E_{2}

, respectively. Additionally, to distinguish the individual memberships between nonzero

β_{k}^{0}

’s and zero

β_{k}^{0}

’s

(k = 1, 2, \dots, p_{n})

, we define

E_{3}

and

E_{4}

such that

β_{k}^{0} \neq 0

if

k \in E_{3}

and

β_{k}^{0} = 0

if

k \in E_{4}

.

We first study the oracle property of the group bridge estimator at the given quantile level

τ

. For any vector

ν = {(ν_{1}, \dots, ν_{m})}^{T}

, denote its

L_{1}

norm by

{∥ ν ∥}_{1} = \sum_{i = 1}^{m} | ν_{i} |

. Let

Σ_{n} = n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T}

. We assume that

(A5): $C_{n}^{*} = {max}_{k} \sum_{j = 1}^{J} I (k \in B_{j})$ is bounded and

$\frac{λ_{n}^{2}}{n} \sum_{j = 1}^{J_{1}} b_{j}^{2} ∥ β_{B_{j}}^{0} ∥_{1}^{2 - 2 γ} | B_{j} | \leq p_{n} M_{n}, M_{n} = O (1),$

where the constants $b_{j}$ satisfy ${min}_{1 \leq j \leq J} b_{j} \geq 1$ and $λ_{n} / (n^{γ (1 - c) / 2 + c}) \to \infty$ as $n \to \infty$ .
(A6): For fixed unknowns ${E_{1}, β_{E_{1}}^{0}, J_{1}}$ ,

$λ_{n} n^{- 1 / 2} \to λ_{0}, \sum_{j = 1}^{J_{1}} b_{j}^{2} = O (1) .$

Conditions (A5) and (A6) control the tuning parameter

λ_{n}

, the number of variables within each group, and the magnitude of the true parameters in nonzero groups. Condition (A5) is a simplified version of Assumption 3 in [11] based on Conditions (A2) and (A4). Together with Condition (A2) and

\sum_{j = 1}^{J_{1}} b_{j}^{2} = O (1)

in Condition (A6), we have

ϕ_{max} (Σ_{τ}) + 1 / ϕ_{min} (Σ_{n}) + \sum_{j = 1}^{J} b_{j}^{2} = O (1)

. If we further assume

p_{n}^{4} = o (n)

, Condition (A5) still holds with

λ_{n} / (n^{γ / 2} p_{n}^{1 - γ / 2}) \to \infty

.

The following theorem provides the large sample theory of the group bridge estimator at the given quantile level

τ

.

Theorem 2.

Assume

v = 0

in (2). Suppose that Conditions (A1)–(A6) hold. Then, we have

(i): Consistency: $∥ \hat{β} - β_{0} ∥ = O_{p} (\sqrt{p_{n} / n})$ .
(ii): Group variable selection consistency: $Pr {{\hat{β}}_{E_{2}} = 0} \to 1$ .
(iii): Asymptotic distribution: for fixed unknowns ${E_{1}, β_{E_{1}}^{0}, J_{1}}$ ,

$\sqrt{n} ({\hat{β}}_{E_{1}} - β_{E_{1}}^{0}) \to_{d} argmin {V_{1} (u) : u \in R^{| E_{1} |}},$

where

$\begin{matrix} V_{1} (u) & = & u^{T} W + \frac{1}{2} f (0) u^{T} Υ_{11} u \\ + γ λ_{0} \sum_{j = 1}^{J_{1}} b_{j} ∥ β_{B_{j}}^{0} ∥_{1}^{γ - 1} \sum_{k \in B_{j} \cap E_{1}} {u_{k} sgn (β_{k}^{0}) I (β_{k}^{0} \neq 0) + | u_{k} | I (β_{k}^{0} = 0)}, \end{matrix}$

with W following $N (0, τ (1 - τ) Υ_{11})$ , $Υ_{11}$ the leading $| E_{1} | \times | E_{1} |$ submatrix of Υ with $β_{E_{2}}^{0} = 0$ , and Υ satisfying $n^{- 1} \sum_{i = 1}^{n} x_{i} x_{i}^{T} \to Υ$ as $n \to \infty$ . In particular, if $λ_{0} = 0$ , then

$\sqrt{n} ({\hat{β}}_{E_{1}} - β_{E_{1}}^{0}) \to_{d} f^{- 1} (0) Υ_{11}^{- 1} W \sim N (0, τ (1 - τ) f^{- 2} (0) Υ_{11}^{- 1}),$

where $\to_{d}$ denotes convergence in distribution.

Theorem 2 demonstrates the asymptotic oracle property in group selection. Moreover, the estimator of coefficients in non-zero groups is

\sqrt{n / p_{n}}

-consistent and, in general, converges to the argmin of the Gaussian process

V_{1}

.

From Theorem 2, we see that the group bridge method can consistently select nonzero group variables but may not effectively remove all unimportant variables within these groups. However, this issue can be addressed by using

v > 0

in (2), i.e., the adaptive group bridge penalty, which can consistently eliminate zero individual variables within nonzero groups by setting the corresponding weights to be large. To establish the oracle property of the adaptive group bridge, we need the following conditions:

(A7): For some $v_{1}$ and $v_{2}$ such that $0 < v_{1} < 1$ , $0 < v_{2}$ , $v_{2} / (1 - v_{1}) < v$ , ${min}_{j \in E_{3}} | β_{j}^{0} | = O_{p} {{(p_{n} / n)}^{v_{1} / 2}}$ , ${max}_{j} | B_{j} \cap E_{3} | = O {{(n / p_{n})}^{v_{2} / 2}}$ , and

$\sum_{j = 1}^{J_{1}} b_{j} \{{(\sum_{k \in B_{j} \cap E_{3}} {| β_{j}^{0} |}^{1 - v})}^{γ - 1} \sum_{k \in B_{j} \cap E_{3}} {| β_{j}^{0} |}^{- v}\} = O_{p} (\sqrt{p_{n}}) .$
(A8): $λ_{n} n^{- 1 / 2} \to 0$ , $\sqrt{n / p_{n}} {\tilde{β}}_{j} = O_{p} (1)$ , $min {λ_{n} n^{(v - 1) / 2 - c (1 + v) / 2}, λ_{n} n^{γ (1 - c) (v - 1) / 2 - c}} \to \infty$ .

Conditions (A7) and (A8) allow

p_{n}

to diverge as

n \to \infty

. Condition (A7) controls the number of nonzero parameters and also represents the minimal signal strength condition, requiring that the smallest magnitude of nonzero parameters diminishes to zero at a rate slower than

n^{- 1 / 2}

. Condition (A8) restricts

λ_{n}

and v as

n \to \infty

to prove the oracle property. Specifically, Condition (A8) implies that v and

γ

satisfy

v > min {1 + (2 c - 1) / (γ (1 - c)), c / (1 - c)}

and

(1 - 2 c) / (1 - c) < γ < 1

, given

c \in (0, 1 / 2)

. If

p_{n}^{4} = o (n)

, the third part of Condition (A8) becomes

min {λ_{n} n^{(v - 1) / 2} p_{n}^{- (1 + v) / 2}, λ_{n} n^{γ (v - 1) / 2} p_{n}^{- 1 + γ (1 - v) / 2}} \to \infty

.

The following theorem establishes the oracle property of the adaptive group bridge quantile estimator at a given quantile level

τ

.

Theorem 3.

Assume

v > 0

in (2). Suppose that Conditions (A1)–(A4), (A7), and (A8) hold. Then, we have

(i): Consistency: $∥ \hat{β} - β^{0} ∥ = O_{p} (\sqrt{p_{n} / n})$ .
(ii): Bi-level variable selection consistency: $Pr {{\hat{β}}_{E_{4}} = 0} = 1$ .
(iii): Asymptotic distribution: for fixed unknowns ${E_{3}, β_{E_{3}}^{0}}$ ,

\sqrt{n} ({\hat{β}}_{E_{3}} - β_{E_{3}}^{0}) \to_{d} N (0, τ (1 - τ) f^{- 2} (0) Υ_{11}^{* - 1}),

where

Υ_{11}^{*}

is the leading

| E_{3} | \times | E_{3} |

submatrix of Υ with

β_{E_{4}}^{0} = 0

.

Theorem 3 demonstrates the oracle property of the adaptive group bridge estimator. The first part of Theorem 3 presents the convergence rate of this estimator. The second part shows that the adaptive group bridge consistently identifies not only important groups but also significant within-group variables. Moreover, the third part indicates that when the number of nonzero variables is fixed, the asymptotic distributions of the adaptive group bridge estimators are asymptotically equivalent to those obtained from the truly underlying variables at both the group and within-group levels.

The tuning parameter

λ

defined in (2) controls the trade-off between the goodness of fit and the model complexity. It is of great importance to select an optimal

λ_{o p t}

to achieve bi-level variable selection consistency in QR models. To this end, we here employ the following BIC-type criterion proposed by [29]:

BIC (λ_{n}, τ) = log (n^{- 1} \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β_{τ})) + max {1, log (p_{n})} \frac{log (n) {\hat{d}}_{n} (λ_{n})}{2 n},

where

{\hat{d}}_{n} (λ_{n})

is the number of nonzero estimates given

λ_{n}

. Given a range of the tuning parameter values, the optimal tuning parameter

λ_{o p t}

is selected as the minimizer of

BIC (λ_{n}, τ)

.

4. Numerical Studies

4.1. Simulation Studies

We employed simulation studies to evaluate the finite-sample performance of the adaptive bi-level variable selection in QR models. Two scenarios were considered in our simulations. In Experiment 1, the group sizes were uniform, each comprising the same number of covariates. Conversely, in Experiment 2, the group sizes varied. Notably, both experiments included scenarios where some groups contained a mix of zero and nonzero coefficients. This setup allowed for the examination of the effectiveness of the proposed variable selection methods under different structural complexities and varying degrees of sparsity within the groups. The sample size was

n = 200

in each example.

Experiment 1. In this experiment, there were eight groups, each consisting of five covariates. The covariate vector was

x = {(x_{1}^{T}, \dots, x_{8}^{T})}^{T}

, where

x_{j} = {(X_{5 (j - 1) + 1}, \dots, X_{5 (j - 1) + 5})}^{T}

for

1 \leq j \leq 8

. To generate

x

, we first generated 40 independent random variables

R_{1}, \dots, R_{40}

from the standard normal distribution. Then,

Z_{j}

(

j = 1, \dots, 8

) were simulated from a multivariate normal distribution with mean zero and

cov (Z_{j_{1}}, Z_{j_{2}}) = 0 . 5^{| j_{1} - j_{2} |}

. Thus, the covariates

X_{1}, \dots, X_{40}

were simulated as

X_{5 (j - 1) + k} = \frac{Z_{j} + R_{5 (j - 1) + k}}{\sqrt{2}}, 1 \leq j \leq 8, 1 \leq k \leq 5 .

Finally, the response variable Y was simulated as

\begin{matrix} Y = \sum_{i = 1}^{8} x_{i}^{T} β_{i} + σ (x) (e - Q_{e} (τ)), \end{matrix}

(5)

where

Q_{e} (τ)

is the conditional

τ

-quantile of e, and e is generated from

N (0, 1)

. We considered two cases for

σ (x)

: (i)

σ (x) = 1

; and (ii)

σ (x) = 0.5 + | x_{1} |

, which was used to investigate the effect of heteroscedasticity. Additionally, two different types of coefficient vectors were considered:

(a): $β_{1} = {(0.5, 1, 1.5, 2, 2.5)}^{T}$ , $β_{2} = {(2, 2, 2, 2, 2)}^{T}$ , $β_{3} = {(- 1.5, 1, 0.5, 0.5, 0.5)}^{T}$ , $β_{4} = \dots = β_{8} = 0$ .
(b): $β_{1} = {(0.5, 1, 1.5, 2, 2.5)}^{T}$ , $β_{2} = {(2, 2, 2, 2, 2)}^{T}$ , $β_{3} = {(- 1.5, 1, 0, 0, 0)}^{T}$ , $β_{4} = {(- 1.5, 1, 1.5, 0, 0)}^{T}$ , $β_{5} = \dots = β_{8} = 0$ .

Under the above settings, scenario (a) assumed that all coefficients within each group were either all nonzero or all zero. This scenario was designed to evaluate the finite sample performance of the proposed bi-level variable selection method at the group level, in comparison to several well-known variable selection methods for QR models, which operate at either the individual or group level. In scenario (b), however, some coefficients within a nonzero group, such as groups 3 and 4, were equal to zero. This setting was specifically designed to assess the performance of bi-level variable selection, as traditional methods that focus solely on individual or group-level selection may produce suboptimal results in such cases.

Experiment 2. In this experiment, the group sizes varied across groups. Specifically, there were four groups with size five and four groups with size three. The covariate vector was

x = {(x_{1}^{T}, \dots, x_{8}^{T})}^{T}

, where

x_{j} = {(X_{4 (j - 1) + 1}, \dots, X_{4 (j - 1) + 5})}^{T}

for

1 \leq j \leq 4

, and

x_{j} = {(X_{3 (j - 1) + 21}, \dots, X_{3 (j - 1) + 23})}^{T}

for

5 \leq j \leq 8

. To generate

X_{1}, \dots, X_{32}

, we first generated 32 random variables

R_{1}, \dots, R_{32}

from

N (0, 1)

. Then,

Z_{j}

(

j = 1, \dots, 8

) were generated from a multivariate normal distribution with mean zero and

cov (Z_{j_{1}}, Z_{j_{2}}) = 0 . 5^{| j_{1} - j_{2} |}

. The covariates

X_{1}, \dots, X_{40}

were then simulated as

\begin{matrix} X_{5 (j - 1) + k} = \frac{Z_{j} + R_{5 (j - 1) + k}}{\sqrt{2}}, 1 \leq j \leq 4, 1 \leq k \leq 5, \\ X_{3 (j - 5) + 20 + k} = \frac{Z_{j} + R_{3 (j - 5) + 20 + k}}{\sqrt{2}}, 5 \leq j \leq 8, 1 \leq k \leq 3 . \end{matrix}

Finally, the response variable y was simulated as in (5). Additionally, we here considered two different group structures of coefficients in model (5), as follows:

(a): $β_{1} = {(0.5, 1, 1.5, 2, 2.5)}^{T}$ , $β_{2} = {(2, 0, 0, 2, 2)}^{T}$ , $β_{3} = β_{4} = 0$ , $β_{5} = (- 1, - 2, - 3)$ , $β_{6} = β_{7} = β_{8} = 0$ .
(b): $β_{1} = {(0.5, 1, 1.5, 2, 2.5)}^{T}$ , $β_{2} = {(2, 2, 2, 2, 2)}^{T}$ , $β_{3} = {(- 1.5, 2, 0, 0, 0)}^{T}$ , $β_{4} = 0$ , $β_{5} = (2, - 2, 1)$ , $β_{6} = (0, - 3, 1.5)$ , $β_{7} = (2, 0, 0)$ , $β_{8} = 0$ .

These settings reflected the different structural characteristics in the coefficient vectors. In scenario (a), some groups had all zero coefficients, while others had nonzero coefficients, with the presence of zero coefficients within nonzero groups in scenario (b).

The proposed estimation procedures were applied to the simulated model (5) at the quantile levels

τ = 0.25, 0.50

, and

0.75

, respectively. For each of the six combinations of quantile levels and parameter settings, 1000 datasets

{(Y_{i}, x_{i}) : i = 1, \dots, 200}

were independently generated, following the data generation processes outlined in Experiments 1 and 2. For each dataset, the adaptive group bridge quantile estimator with

v = 0

and

γ = 1 / 2

, denoted by GB, and the adaptive group bridge quantile estimator with

v = 1

and

γ = 1 / 2

, denoted by AGB, were computed. In addition, we evaluated the mean square error (MSE), which was calculated by

MSE = ∥ {\hat{β}}^{i} - β^{0} ∥^{2}

, where

{\hat{β}}^{i}

is the estimator of

β^{0}

evaluated on the ith dataset for a given

τ

.

The performance of the proposed AGB methods was compared with two existing variable selection techniques. The first technique was the smoothly clipped absolute deviation (SCAD) method, as developed by [27]. This method integrates the SCAD penalty into the QR loss function to achieve effective individual variable selection. The second technique was the adaptive group lasso quantile estimator (AGL), as described in [20]. The AGL method employs an adaptive lasso penalty at the group level within the QR framework to improve the identification of significant covariate groups.

The results for 1000 repetitions in each of the six cases are reported in Table 1 and Table 2, which correspond to the scenarios of homoscedasticity (i.e.,

σ (x) = 1

) and heteroscedasticity (i.e.,

σ (x) = 0.5 + | X_{1} |

), respectively. In the tables, the notations “NG” and “NV” denote the average number of groups and individual variables selected by each variable selection method, respectively. The notations “%CG” and “%CI” represent the proportions that the corresponding variable selection method correctly identified as nonzero group variables and nonzero individual variables for the underlying model, respectively. “MSER” denotes the ratio of the median MSE of each variable selection method to that of the oracle estimator. The oracle values for these measures are also listed in Table 1 and Table 2. Clearly, the closer a method’s result is to the oracle value, the better its performance.

Table 1. Simulation results for Experiment 1.

Table 2. Simulation results for Experiment 2.

From Table 1 and Table 2, several key insights can be derived: (1) In scenario (a) of Experiment 1 designed for group variable selection, the proposed methods AGB and GB were comparable to AGL if all individual variables within each group were nonzero. Both AGB and GB methods effectively identified the true nonzero and zero groups, with mean group sizes closely approximating the actual number of true groups, indicating that the proposed bi-level variable selection procedures performed robustly at the group level; (2) When nonzero individual variables were present within a nonzero group, as observed in scenario (b) of Experiment 1 and in both scenarios (a) and (b) of Experiment 2, the AGB method outperformed AGL in terms of correctly identifying the magnitudes of nonzero variables. Additionally, the AGB method surpassed GB in accurately identifying nonzero individual variables, demonstrating the effectiveness of the proposed methods in individual-level variable selection; (4) The SCAD method exhibited poor performance across all considered settings, because SCAD is primarily designed for individual variable selection and does not extract group structure information; (5) When sparsity existed at both group and with-in group levels, the AGB method was superior to the other competitors listed in Table 1 and Table 2 in terms of estimation and variable selection performance in most cases, even when heteroscedasticity was present in the response data; (6) The proposed methods demonstrated resistance to variations in the number of individuals within a group, as evidenced by the results from the two experiments, underscoring the versatility and reliability of the bi-level variable selection approach. Overall, the findings reveal that the adaptive group bridge method for QR was capable of achieving both group and within-group variable selection, even in the presence of heterogeneity and variation in the number of individuals within groups, and competed effectively with a number of existing bi-level variable selection methods.

4.2. An Example

In this section, the Birthwt dataset, collected at Baystate Medical Center, Springfield, Massachusetts, in 1986, was utilized to illustrate the effectiveness of the proposed methods. The Birthwt dataset consists of 189 observations, with 16 predictors and an outcome variable, birth weight, which is available both as a continuous measure and as a binary indicator for low birth weight. In this analysis, the birth weight in kilograms was taken as the response variable Y, while the other 16 variables served as covariates. These covariates

x

were divided into eight groups, as outlined below:

age1 ( $X_{1}$ ), age2 ( $X_{2}$ ), age3 ( $X_{3}$ ): Orthogonal polynomials of the first, second, and third degree, representing the mother’s age in years.
lwt1 ( $X_{4}$ ), lwt2 ( $X_{5}$ ), lwt3 ( $X_{6}$ ): Orthogonal polynomials of the first, second, and third degree, representing the mother’s weight in pounds at the last menstrual period.
white ( $X_{7}$ ), black ( $X_{8}$ ): Indicator variables for the mother’s race; “other” serves as the reference group.
smoke ( $X_{9}$ ): Smoking status during pregnancy.
ptl1 ( $X_{10}$ ), ptl2 ( $X_{11}$ ): Indicator variables for one or for two or more previous premature labors, respectively. No previous premature labor serves as the reference category.
ht ( $X_{12}$ ): History of hypertension.
ui ( $X_{13}$ ): Presence of uterine irritability.
ftv1 ( $X_{14}$ ), ftv2 ( $X_{15}$ ), ftv3m ( $X_{16}$ ): Indicator variables for one, for two, or for three or more physician visits during the first trimester, respectively. No visits serves as the reference category.

The primary objective of this study was to investigate whether Y was related to the covariates

(X_{1}, \dots, X_{16})

. To achieve this, a QR model was employed to fit the dataset. Specifically, the

τ

-th conditional quantile of

Y_{i}

was assumed to be

Q_{Y} (τ | x_{i}) = x_{i}^{T} β_{τ}, i = 1, \dots, 189,

where

Q_{Y} (τ | x_{i})

is the

τ

-th quantile of Y given

x_{i}

. Under this assumption, the group structures were defined as follows:

B_{1} = {1, 2, 3}

,

B_{2} = {4, 5, 6}

,

B_{3} = {7, 8}

,

B_{4} = {9}

,

B_{5} = {10, 11}

,

B_{6} = {12}

,

B_{7} = {13}

, and

B_{8} = {14, 15, 16}

. The group settings for

B_{1}

and

B_{2}

were designed to examine whether age and mother’s weight have linear or nonlinear effects on birth weight, respectively. The same rationale applies to the group settings for

B_{3}

,

B_{5}

and

B_{8}

. Three different quantile levels,

τ = 0.25

,

0.50

, and

0.75

, were considered. Similarly to in the simulation studies, the adaptive group lasso (AGL) estimator and the group bridge estimator proposed by [11] (denoted as GB-LS) were computed for comparison.

For each specified quantile level

τ

, the GB and AGB estimators of the coefficient vector

β_{τ}

were computed. To assess model performance, the mean absolute prediction error (APE) was defined by

APE = \frac{1}{189} \sum_{i = 1}^{189} |Y_{i} - {\hat{Y}}_{i}|,

where

{\hat{Y}}_{i} = x_{i}^{T} {\hat{β}}_{τ}

represents the predicted value based on the estimated coefficients

{\hat{β}}_{τ}

.

Table 3 presents the point estimates of the parameters

β_{τ}

and the APE values for the four different methods. An analysis of Table 3 reveals several key findings: (i) While the GB-LS method focused on variable selection based on their effects on the mean, the other methods selected different variables across various quantiles. For example, at the lower quantile (

τ = 0.25

), the AGB and GB methods selected ht and ui as significant variables, whereas these variables became less significant at higher quantiles. This suggests that ht and ui had a greater impact on lower birth weights, but their influence diminished for higher birth weights; (ii) The AGL method exclusively performed selection at the group level, while the GB and AGB methods were capable of identifying variables at both the group and individual levels. For instance, at

τ = 0.25

, AGL treated the group of physician visits as insignificant, whereas AGB selected only ftv1 as insignificant, demonstrating a more precise selection at the individual level; (iii) When comparing APE values, the AGB and GB methods often yielded similar results. However, AGB typically produced sparser models. For instance, at

τ = 0.75

, AGB selected fewer variables than GB, while maintaining comparable APE values, indicating the efficiency of the AGB method in balancing model sparsity and predictive accuracy; (iv) The AGL method occasionally failed to identify relevant groups at certain quantile levels. For example, at

τ = 0.75

, AGL did not select the physician visit group (ftv1, ftv2, ftv3), whereas AGB identified ftv1 and ftv3 as significant. This highlighted AGB’s ability to adapt across different quantiles and capture the varying effects of covariates throughout the birth weight distribution; (v) age group had a significant impact on birth weight across all three quantile levels considered. Specifically, we observed no linear relationship between age and birth weight, as the coefficient of

age 1

was zero. However,

age 2

and

age 3

emerged as important predictors, indicating a significant nonlinear effect on birth weight. A similar pattern was observed within groups

B_{2}

,

B_{5}

, and

B_{8}

, suggesting complex, non-linear dynamics in their influence on birth weight. Overall, these findings underscored the effectiveness and flexibility of the proposed AGB method for handling the bi-level variable selection problem in QR models, particularly when considering the effects of differing quantile levels.

Table 3. Results of the real data analysis.

5. Conclusions and Discussion

In this paper, we introduced an adaptive bi-level variable selection method for QR models with a diverging number of covariates. The method employs an adaptive penalty that enables simultaneous selection at both the group and individual levels, addressing challenges related to sparsity, heterogeneity, and skewness in data. Through rigorous theoretical analysis, we established the asymptotic properties of the proposed estimators, confirming their consistency and efficiency. The simulation studies demonstrated the superior performance of the proposed method, particularly in scenarios where both group-level and within-group sparsity existed. The adaptive bi-level method consistently outperformed traditional variable selection techniques in terms of selecting the correct groups and identifying the most relevant individual variables within those groups. Additionally, the real data application from the Birthwt dataset further validated the method’s practical utility. It effectively identified key covariates influencing birth weight at different quantiles, offering improved interpretability and predictive accuracy across various quantile levels. Overall, the findings suggest that the adaptive bi-level method is a robust and flexible approach to variable selection in complex, high-dimensional QR models.

Throughout this study, the dimension of parameters in QR models was assumed to grow with the sample size. However, extending the proposed procedure to high-dimensional settings, where the number of covariates exceeds the sample size, is of significant interest. In such cases, further investigation into the bi-level variable selection procedure, in terms of both theory and optimization, would be necessary for high-dimensional QR models with grouped variables. In addition, the convolution-type smoothing techniques of [30,31] could be employed to achieve bi-level variable selection in high-dimensional QR models under high dimensionality. These interesting extensions are beyond the scope of the present paper, and are left for future research.

Author Contributions

Conceptualization, X.D. and Z.Y.; methodology, X.D. and Z.Y.; validation, X.D.; formal analysis, Z.Y.; investigation, X.D.; writing— original draft, Z.Y.; writing—review and editing, X.D. and Z.Y.; supervision, Z.Y.; project administration, Z.Y.; funding acquisition, X.D. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Yunnan Fundamental Research Projects (Grant Number 202401AS070152), the National Natural Science Foundation of China (Grant Number 12001244), and the Major Basic Research Project of the Natural Science Foundation of the Jiangsu Higher Education Institutions (Grant Number 19KJB110007).

Data Availability Statement

The real data that are used to illustrate the proposed methods are available at https://search.r-project.org/CRAN/refmans/fic/html/birthwt.html (accessed on 16 October 2024).

Acknowledgments

The authors wish to thank the Editor-in-Chief, the Associate Editor and three reviewers for their many helpful and insightful comments and suggestions that greatly improved the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Proof of Theorem 1.

Let

Q_{n} (β) = \sum_{i = 1}^{n} ρ_{τ} (Y_{i} - x_{i}^{T} β)

. It is required to show that for all

ε > 0

, there exists a constant

C_{ε}

sufficiently large such that, for sufficiently large n,

\Pr [inf_{∥ u ∥ = 1} Q_{n} (β_{0, τ} + C_{ε} \sqrt{\frac{p_{n}}{n}} u) > Q_{n} (β_{0, τ})] \geq 1 - ε .

(A1)

To achieve this, for some constant

C > 0

, consider the expectation of the difference:

E [Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0})] = \sum_{i = 1}^{n} E [ρ_{τ} (e_{i} - C \sqrt{\frac{p_{n}}{n}} x_{i}^{T} u) - ρ_{τ} (e_{i})] .

(A2)

This expectation can be rewritten as

\begin{matrix} \sum_{i = 1}^{n} E [\int_{0}^{C \sqrt{\frac{p_{n}}{n}} x_{i}^{T} u} I (0 < e_{i} < t) d t] = \sum_{i = 1}^{n} E [\int_{0}^{C \sqrt{\frac{p_{n}}{n}} x_{i}^{T} u} [F (t) - F (0)] d t], \end{matrix}

where

F (t)

denotes the cumulative distribution function of the errors

e_{i}

.

Given condition (A4),

p_{n} / n \to 0

. Additionally, under condition (A3),

{(p_{n} / n)}^{1 / 2} x_{i}^{T} u \to 0

as

∥ u ∥ = 1

. Using the mean value theorem and the fact that the density f has a bounded first derivative in the neighborhood of 0, it follows that

E [Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0})] = C^{2} \frac{p_{n} f (0)}{2 n} \sum_{i = 1}^{n} {(x_{i}^{T} u)}^{2} + o (\frac{p_{n}}{n} \sum_{i = 1}^{n} u^{T} x_{i} x_{i}^{T} u) .

From condition (A2), it holds that

\frac{1}{n} E [Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0})] = C^{2} \frac{p_{n} f (0)}{2 n} \sum_{i = 1}^{n} {(x_{i}^{T} u)}^{2} (1 + o (1)) .

Next, define the random variable

R_{i} = ρ_{τ} (e_{i} - C \sqrt{\frac{p_{n}}{n}} x_{i}^{T} u) - ρ_{τ} (e_{i}) + C \sqrt{\frac{p_{n}}{n}} D_{i} x_{i}^{T} u

, and the random vector

W_{i} = C \sqrt{\frac{p_{n}}{n}} D_{i} x_{i}^{T}

, where

D_{i} = (1 - τ) I (e_{i} < 0) - τ I (e_{i} \geq 0)

. Then, the process

Q_{n}

can be expressed as

\begin{matrix} Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0}) = & E [Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0})] \\ + \sum_{i = 1}^{n} W_{i} u + \sum_{i = 1}^{n} (R_{i} - E (R_{i})) . \end{matrix}

(A3)

Given that

| R_{i} | < C \sqrt{\frac{p_{n}}{n}} | x_{i}^{T} u | I (e_{i} < C \sqrt{\frac{p_{n}}{n}} | x_{i}^{T} u |)

, together with condition (A3), it follows that

E [I (e_{i} < C \sqrt{\frac{p_{n}}{n}} | x_{i}^{T} u |)] \leq C \sqrt{\frac{p_{n}}{n}} ∥ x_{i} ∥ \leq C \sqrt{\frac{p_{n}}{n}} max_{1 \leq i \leq n} ∥ x_{i} ∥,

which implies

E {[\sum_{i = 1}^{n} (R_{i} - E (R_{i}))]}^{2} \leq \sum_{i = 1}^{n} E [{(R_{i} - E (R_{i}))}^{2}] \leq \sum_{i = 1}^{n} E (R_{i}^{2}) .

Using conditions (A1)–(A3), it follows that

E [\sum_{i = 1}^{n} {(R_{i} - E (R_{i}))}^{2}] \leq C^{2} \frac{p_{n}}{n} {| x_{i}^{T} u |}^{2} E [I (e_{i} < C \sqrt{\frac{p_{n}}{n}} | x_{i}^{T} u |)] \leq o_{p} (\frac{p_{n}}{n} \sum_{i = 1}^{n} u^{T} x_{i} x_{i}^{T} u) = o (p_{n}) .

Defining the random variable

U_{n} = p_{n}^{- 1 / 2} \sum_{i = 1}^{n} (R_{i} - E (R_{i}))

, it follows that

E (U_{n}^{2}) = o (1)

. This, along with

E (U_{n}) = 0

, implies by the Bienaymé–Tchebychev inequality that

U_{n} \to 0

as

n \to \infty

. Consequently,

\sum_{i = 1}^{n} (R_{i} - E (R_{i})) = o_{p} (p_{n}^{1 / 2})

. Therefore, Equation (A3) can be expressed as

Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0}) = E [Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0})] + \sum_{i = 1}^{n} W_{i} u + o_{p} (p_{n}^{1 / 2}) .

This implies that

\begin{matrix} Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0}) & = E [\frac{f (0)}{2} C^{2} p_{n} (\frac{1}{n} \sum_{i = 1}^{n} u^{T} x_{i} x_{i}^{T} u)] (1 + o (1)) \\ + C p_{n}^{1 / 2} (n^{- 1 / 2} \sum_{i = 1}^{n} D_{i} x_{i}^{T}) u + o_{p} (p_{n}^{1 / 2}) . \end{matrix}

Using the central limit theorem,

n^{- 1 / 2} \sum_{i = 1}^{n} D_{i} x_{i}^{T} u

converges in distribution to a centered Gaussian distribution, since

E [D_{i}] = 0

and

Var (D_{i} x_{i}^{T} u) = τ (1 - τ) u^{T} x_{i} x_{i}^{T} u

with

∥ u ∥ = 1

.

Taking into account conditions (A2) and (A4), for a sufficiently large constant C, it follows that

Q_{n} (β^{0} + C \sqrt{\frac{p_{n}}{n}} u) - Q_{n} (β^{0}) = f (0) C^{2} p_{n} (\frac{1}{n} \sum_{i = 1}^{n} u^{T} x_{i} x_{i}^{T} u) (1 + o (1)) > 0

for sufficiently large n. Therefore, inequality (A1) is satisfied, considering conditions (A1) and (A2). □

Proof of Theorem 2.

We begin by establishing part (i) of Theorem 2. Let

α_{n} = \sqrt{p_{n} / n}

and define

B (C) = {β : β = β^{0} + C α_{n} u, ∥ u ∥ = 1}

. It suffices to show that, for any

ε > 0

, there exists a sufficiently large C such that

\Pr \{inf_{β \in B (C)} G_{n} (β) > G_{n} (β^{0})\} \geq 1 - ε .

Considering that

\frac{1}{n} G_{n} (β) - \frac{1}{n} G_{n} (β^{0}) = \frac{1}{n} Q_{n} (β) - \frac{1}{n} Q_{n} (β^{0}) + D_{n} (C),

where

D_{n} (C) = \frac{λ_{n}}{n} (\sum_{j = 1}^{J} b_{j} ∥ β_{B_{j}} ∥_{1}^{γ} - \sum_{j = 1}^{J} b_{j} {∥ β_{B_{j}}^{0} ∥}_{1}^{γ}) .

We apply Knight’s identity (Knight and Fu [32]) for any two scalars w and v, yielding

ρ_{τ} (w - v) - ρ_{τ} (w) = - v (τ - I (w \leq 0)) + \int_{0}^{v} [I (w \leq t) - I (w \leq 0)] d t .

(A4)

The difference between

Q_{n} (β)

and

Q_{n} (β^{0})

can thus be expressed as

\begin{matrix} Q_{n} (β) - Q_{n} (β^{0}) & = - C \sum_{i = 1}^{n} α_{n} x_{i}^{T} u (τ - I (e_{i} \leq 0)) \\ + \sum_{i = 1}^{n} \int_{0}^{C α_{n} x_{i}^{T} u} [I (e_{i} \leq t) - I (e_{i} \leq 0)] d t \\ : = I_{1} + I_{2} . \end{matrix}

Given that

I_{1} \leq C α_{n} ∥ u ∥ \sum_{i = 1}^{n} (τ - I (e_{i} \leq 0)) x_{i}

and

|τ - I (e_{i} \leq 0)| \leq 1

, we obtain

\begin{matrix} E {∥\sum_{i = 1}^{n} (τ - I (e_{i} \leq 0)) x_{i}∥}^{2} & = E [\sum_{j = 1}^{p_{n}} \sum_{i = 1}^{n} \sum_{l = 1}^{n} (τ - I (e_{i} \leq 0)) (τ - I (e_{j} \leq 0)) x_{i j} x_{l j}] \\ = \sum_{j = 1}^{p_{n}} \sum_{i = 1}^{n} E [{(τ - I (e_{i} \leq 0))}^{2} x_{i j}^{2}] = O_{p} (n p_{n}) . \end{matrix}

Thus,

| I_{1} |

is of order

C α_{n}^{2} n

. For

I_{2}

, using the proof of Theorem 1, it holds that

E (I_{2}) = \frac{f (0)}{2} C^{2} α_{n}^{2} n [\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{T} u)}^{2}] + o (1) α_{n}^{2} n [\frac{1}{n} \sum_{i = 1}^{n} u^{T} (x_{i} x_{i}^{T}) u] .

This implies that

I_{2}

is of order

C^{2} α_{n}^{2} n

. Therefore, by choosing C sufficiently large,

I_{2}

dominates

I_{1}

uniformly for

∥ u ∥ = 1

.

For the lower bound of

D_{n} (C)

, consider the case where

{∥ β_{B_{j}}^{0} ∥}_{1} \geq {∥ β_{B_{j}} ∥}_{1}

. Since

b^{γ} - a^{γ} \leq 2 (b - a) b^{γ - 1}

for

0 \leq a \leq b

, it follows that

\begin{matrix} \sum_{j = 1}^{J} b_{j} ∥ β_{B_{j}}^{0} ∥_{1}^{γ} - \sum_{j = 1}^{J} b_{j} {∥ β_{B_{j}} ∥}_{1}^{γ} & \leq 2 \sum_{j = 1}^{J} b_{j} {∥ β_{B_{j}}^{0} ∥}_{1}^{γ - 1} {(| B_{j} | ∥ β_{B_{j}} - β_{B_{j}}^{0} ∥^{2})}^{1 / 2} \\ \leq 2 η_{n} {(\sum_{j = 1}^{J} {∥ β_{B_{j}} - β_{B_{j}}^{0} ∥}^{2})}^{1 / 2} \\ \leq 2 η_{n} \sqrt{C_{n}^{*}} {∥ \hat{β} - β^{0} ∥}_{2}, \end{matrix}

where

η_{n}^{2} = \sum_{j = 1}^{J_{1}} b_{j}^{2} ∥ β_{B_{j}}^{0} ∥_{1}^{2 γ - 2} | B_{j} |

and

C_{n}^{*}

is defined in condition (A5). Therefore, it holds that

\begin{matrix} \frac{1}{n} G_{n} (β) - \frac{1}{n} G_{n} (β^{0}) & \geq \frac{f (0)}{2} C^{2} α_{n}^{2} u^{T} [\frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T}] u + O_{p} (C α_{n}^{2}) \\ - 2 \frac{λ_{n}}{n} η_{n} \sqrt{C_{n}^{*}} C α_{n} . \end{matrix}

(A5)

Since

- 2 (λ_{n} / n) η_{n} \sqrt{C_{n}^{*}} C α_{n}

is of order

C α_{n}^{2}

, the first term on the right-hand side of (A5) dominates the third term uniformly for

∥ u ∥ = 1

when C is sufficiently large. This completes the proof of part (i) of Theorem 2.

Next, we establish the group selection consistency. Using Theorem 2-(i),

\hat{β}

belongs, with a probability converging to one, to the set

V_{p_{n}} (β_{0}) = {β : ∥ β - β^{0} ∥ \leq B \sqrt{p_{n} / n}}

for sufficiently large

B > 0

. For any

β = {(β_{E_{1}}^{T}, β_{E_{1}^{c}}^{T})}^{T} \in V_{p_{n}} (β_{0})

with

∥ β - β^{0} ∥ = O_{p} (\sqrt{p_{n} / n})

and for all constants

C \in (0, B)

, we show that

G_{n} ({(β_{E_{1}}^{T}, 0^{T})}^{T}) = min_{∥ β_{E_{1}^{c}} ∥ \leq C \sqrt{p_{n} / n}} G_{n} ({(β_{E_{1}}^{T}, β_{E_{1}^{c}}^{T})}^{T}),

(A6)

with a probability tending to one as

n \to \infty

.

Consider the parameter set

W_{n} = {β \in V_{p_{n}} (β^{0}) : ∥ β_{E_{1}^{c}} ∥ > 0}

. We demonstrate that

\Pr {\hat{β} \in W_{n}} \to 0

as

n \to \infty

. Let

\hat{β} = {({\hat{β}}_{E_{1}}^{T}, {\hat{β}}_{E_{1}^{c}}^{T})}^{T} \in W_{n}

and

{\hat{β}}^{(1)} = {({\hat{β}}_{E_{1}}^{(1) T}, {\hat{β}}_{E_{1}^{c}}^{(1) T})}^{T} \in V_{p_{n}} (β^{0})

such that

{\hat{β}}_{E_{1}} = {\hat{β}}_{E_{1}}^{(1)}

and

{\hat{β}}_{E_{1}^{c}}^{(1)} = 0

.

Using the definition of

\hat{β}

, it holds that

Q_{n} (\hat{β}) + λ_{n} \sum_{j = 1}^{J} b_{j} ∥ {\hat{β}}_{B_{j}} ∥_{1}^{γ} \leq Q_{n} ({\hat{β}}^{(1)}) + λ_{n} \sum_{j = 1}^{J} b_{j} {∥ {\hat{β}}_{B_{j}}^{(1)} ∥}_{1}^{γ} .

This leads to

\begin{matrix} \frac{λ_{n}}{n} \sum_{j = J_{1} + 1}^{J} b_{j} {∥ {\hat{β}}_{B_{j}} ∥}_{1}^{γ} & \leq - \frac{1}{n} {(\sum_{i = 1}^{n} x_{i} (τ - I (Y_{i} \leq x_{i}^{T} {\hat{β}}^{(1)})))}^{T} (\hat{β} - {\hat{β}}^{(1)}) \\ + \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{x_{i}^{T} (\hat{β} - {\hat{β}}^{(1)})} [I (Y_{i} - x_{i}^{T} {\hat{β}}^{(1)} \leq t) - I (Y_{i} - x_{i}^{T} {\hat{β}}^{(1)} \leq 0)] d t \\ : = T_{1} + T_{2} . \end{matrix}

(A7)

For

T_{1}

, since the density f is bounded in a neighborhood of 0, it holds that

\begin{matrix} E (T_{1}) & = {(\hat{β} - {\hat{β}}^{(1)})}^{T} \frac{1}{n} \sum_{i = 1}^{n} x_{i} [F (x_{i}^{T} ({\hat{β}}^{(1)} - β^{0})) - F (0)] \\ = {(\hat{β} - {\hat{β}}^{(1)})}^{T} \frac{f (0)}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T} (β^{0} - {\hat{β}}^{(1)}) (1 + o (1)) . \end{matrix}

Given condition (A2), it follows that

| E (T_{1}) | \leq f (0) ∥ \hat{β} - {\hat{β}}^{(1)} ∥ ∥ \frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T} ∥ ∥ β_{0} - {\hat{β}}^{(1)} ∥ (1 + o (1)) = C {∥ \hat{β} - {\hat{β}}^{(1)} ∥}^{2}

. By analogous calculations, using the independence of

{(e_{i})}_{1 \leq i \leq n}

, we have

E (T_{1}^{2}) = C n^{- 1} {∥ \hat{β} - {\hat{β}}^{(1)} ∥}^{3} \to 0

as

n \to \infty

. Since

Var (T_{1}) \leq E (T_{1}^{2})

, applying the Bienaym–Tchebychev inequality yields

T_{1} = C {∥ \hat{β} - {\hat{β}}^{(1)} ∥}^{2} (1 + o (1)) .

For

T_{2}

, we rewrite

T_{2}

as

T_{2} = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{x_{i}^{T} (\hat{β} - {\hat{β}}^{(1)})} [F (t - x_{i}^{T} (β^{0} - {\hat{β}}^{(1)})) - F (- x_{i}^{T} (β^{0} - {\hat{β}}^{(1)}))] d t .

Using conditions (A1)–(A3) and the boundedness of

f (x_{i}^{T} ({\hat{β}}^{(1)} - β^{0}))

, we have

E (T_{2}) = C {(\hat{β} - {\hat{β}}^{(1)})}^{T} (\frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T}) (\hat{β} - {\hat{β}}^{(1)}) + o {∥ \hat{β} - {\hat{β}}^{(1)} ∥^{2}} = C ∥ \hat{β} - {\hat{β}}^{(1)} ∥ .

Similarly, we can show that

E (T_{2}^{2}) = C n^{- 1} {∥ \hat{β} - {\hat{β}}^{(1)} ∥}^{3} \to 0

. Using the Bienaym–Tchebychev inequality, it follows that

T_{2} = C {∥ \hat{β} - {\hat{β}}^{(1)} ∥}^{2} (1 + o (1)) .

Consequently, we obtain

T_{1} + T_{2} = C ∥ \hat{β} - {\hat{β}}^{(1)} ∥ (1 + o (1)) .

Since a uniform constant

C > 0

exists such that (A7) holds for sufficiently large n, we have

\frac{λ_{n}}{n} \sum_{j = J_{1} + 1}^{J} b_{j} ∥ {\hat{β}}_{B_{j}} ∥_{1}^{γ} \leq C ∥ \hat{β} - {\hat{β}}^{(1)} ∥^{2} = C ∥ {\hat{β}}_{E_{2}} ∥^{2} = C {∥ \hat{β} - β^{0} ∥}^{2} .

Thus, it follows that

λ_{n} \sum_{j = J_{1} + 1}^{J} b_{j} ∥ {\hat{β}}_{B_{j}} ∥_{1}^{γ} \leq C n {∥ \hat{β} - β^{0} ∥}^{2} = O_{p} (p_{n}) .

For the lower bound of

\sum_{j = J_{1} + 1}^{J} b_{j} {∥ {\hat{β}}_{B_{j}} ∥}_{1}^{γ}

, by (A5), we have

\sum_{j = J_{1} + 1}^{J} b_{j} ∥ {\hat{β}}_{B_{j}} ∥_{1}^{γ} \geq {(\sum_{j = J_{1} + 1}^{J} {∥ {\hat{β}}_{B_{j}} ∥}_{1})}^{γ} \geq ∥ {\hat{β}}_{E_{2}} ∥_{1}^{γ} \geq {∥ {\hat{β}}_{E_{2}} ∥}^{γ} .

If

∥ {\hat{β}}_{E_{2}} ∥^{γ} > 0

, then

λ_{n} \leq C n {∥ {\hat{β}}_{E_{2}} ∥}^{2 - γ} \leq p_{n}^{1 - γ / 2} n^{γ / 2} O (1) = n^{c + (1 - c) γ / 2} O (1) .

Hence,

λ_{n} / (n^{c + (1 - c) γ / 2}) \leq O (1)

. Since

λ_{n} / (n^{c + (1 - c) γ / 2}) \to \infty

, from condition (A5), we conclude that

\Pr \{∥ {\hat{β}}_{E_{2}} ∥^{γ} > 0\} \leq \Pr \{\frac{λ_{n}}{n^{c + (1 - c) γ / 2}} \leq O (1)\} \to 0 .

Finally, we establish the asymptotic distribution of

\hat{β}

. Since

{E_{1}, β_{E_{1}}^{0}, J_{1}}

are fixed,

{min}_{j \leq J_{1}} {∥ β_{B_{j}}^{0} ∥}_{1}^{1 - γ} = O (1)

, so that condition (A6) implies condition (A5), and

\frac{λ_{n}^{2}}{n} \sum_{j = 1}^{J_{1}} b_{j}^{2} ∥ β_{B_{j}}^{0} ∥_{1}^{2 - 2 γ} | B_{j} | | B_{j} \cap E_{1} | = O (1) .

Therefore, the proof of Theorem 2-(i) applies with the reduced

X_{1}

and the reduced number

p_{1} = | E_{1} |

of coefficients

{β_{k} : k \in E_{1}}

. Thus,

∥ {\hat{β}}_{E_{1}} - β_{E_{1}}^{0} ∥ = O_{p} (\frac{1}{n}), ∥ \hat{β} - β^{0} ∥ = O_{p} (\frac{1}{n}) .

Let

h_{n} = n^{- 1 / 2}

and

V_{1 n} (a) = G_{n} (β^{0} + h_{n} {(a^{T}, 0^{T})}^{T}) - G_{n} (β^{0})

, where

0

is a zero vector of dimension

E_{2}

, and

a = {(a_{1}, \dots, a_{p_{1}})}^{T}

is a

p_{1}

-dimensional constant vector. Using part (i) of Theorem 2, with large probability approaching one,

\hat{β} - β^{0} = h_{n} {({\hat{a}}^{T}, 0^{T})}^{T}

.

On the other hand,

V_{1 n} (a)

can be rewritten as

\begin{matrix} V_{1 n} & = - h_{n} {({\hat{a}}^{T}, 0^{T})}^{T} \sum_{i = 1}^{n} x_{i}^{T} (τ - I (e_{i} \leq 0)) \\ + \sum_{i = 1}^{n} \int_{0}^{h_{n} x_{i}^{T} {({\hat{a}}^{T}, 0^{T})}^{T}} [I (e_{i} \leq t) - I (e_{i} \leq 0)] d t \\ + λ_{n} \sum_{j = 1}^{J_{1}} b_{j} \{{(\sum_{k \in B_{j} \cap E_{1}} | β_{k}^{0} + h_{n} a_{k} |)}^{γ} - {∥ β^{0} ∥}_{1}^{γ}\} \\ = T_{1 n} (a) + T_{2 n} (a) . \end{matrix}

Following the arguments of [26],

T_{1 n} (a) \overset{d}{\to} a^{T} W + \frac{1}{2} f (0) a^{T} Υ_{11} a

, where

\overset{d}{\to}

denotes convergence in distribution. According to [11], we have

T_{2 n} (a) \to γ λ_{0} \sum_{j = 1}^{n} b_{j} {∥ β_{B_{j}}^{0} ∥}_{1}^{γ - 1} \sum_{k \in B_{j} \cap E_{1}} \{u_{k} sgn (β_{k}^{0}) I (β_{k}^{0} \neq 0) + | u_{k} | I (β_{k}^{0} = 0)\} .

Therefore,

V_{1 n} \overset{d}{\to} V_{1} (a)

. Since

a_{n} = O (1)

, using the argmin continuous mapping theorem (Kim and Pollard [33]),

\sqrt{n} ({\hat{β}}_{E_{1}} - β_{E_{1}}^{0}) = {\hat{a}}_{n} \overset{d}{\to} argmin {V_{1} (a) : a \in R^{| E_{1} |}}

, which completes the proof. □

Proof of Theorem 3.

We first establish the consistency of

\hat{β}

. Let

α_{n} = \sqrt{p_{n} / n}

and

∥ u ∥ = 1

. It is sufficient to show that for every

ε > 0

, there exists

B (C)

such that

\Pr \{inf_{β \in B (C)} G_{n} (β^{0} + C α_{n} u) > G_{n} (β^{0})\} \geq 1 - ε .

Since

\frac{1}{n} G_{n} (β^{0} + C α_{n} u) - \frac{1}{n} G_{n} (β^{0}) = \frac{1}{n} Q_{n} (β^{0} + C α_{n} u) - \frac{1}{n} Q_{n} (β^{0}) + D_{n},

where

D_{n} = \frac{λ_{n}}{n} (\sum_{j = 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - \sum_{j = 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}})}^{γ}) .

We can utilize Knight’s identity (Knight and Fu [32]) to rewrite the difference between

Q_{n} (β^{0} + C α_{n} u)

and

Q_{n} (β^{0})

as

\begin{matrix} Q_{n} (β^{0} + C α_{n} u) - Q_{n} (β^{0}) & = - C \sum_{i = 1}^{n} α_{n} x_{i}^{T} u (τ - I (e_{i} \leq 0)) + \sum_{i = 1}^{n} \int_{0}^{C α_{n} x_{i}^{T} u} (I (e_{i} \leq t) - I (e_{i} \leq 0)) d t \\ : = I_{1} + I_{2} . \end{matrix}

Since

I_{1} \leq C α_{n} ∥ u ∥ \sum_{i = 1}^{n} (τ - I (e_{i} \leq 0)) x_{i}

and

| τ - I (e_{i} \leq 0) | \leq 1

, it follows that

\begin{matrix} E {∥\sum_{i = 1}^{n} (τ - I (e_{i} \leq 0)) x_{i}∥}^{2} & = E [\sum_{j = 1}^{p_{n}} \sum_{i = 1}^{n} \sum_{l = 1}^{n} (τ - I (e_{i} \leq 0)) (τ - I (e_{j} \leq 0)) X_{i j} X_{l j}] \\ = \sum_{j = 1}^{p_{n}} \sum_{i = 1}^{n} E [{(τ - I (e_{i} \leq 0))}^{2} X_{i j}^{2}] = O_{p} (n p_{n}) . \end{matrix}

Thus, we have

| I_{1} | \leq O_{p} (C α_{n} \sqrt{p_{n} n}) ∥ u ∥ = O_{p} (C α_{n}^{2} n) ∥ u ∥

. For

I_{2}

, from the proof of Theorem 1, it holds that

E (I_{2}) = \frac{f (0)}{2} C^{2} α_{n}^{2} n [\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}^{T} u)}^{2}] + o (1) α_{n}^{2} n [\frac{1}{n} \sum_{i = 1}^{n} u^{T} (x_{i} x_{i}^{T}) u] .

It follows that

I_{2}

is of order

C^{2} α_{n}^{2} n

. By choosing a sufficiently large C,

I_{2}

dominates

I_{1}

uniformly in

∥ u ∥ = 1

.

Next,

D_{n}

can be re-expressed as

\begin{matrix} D_{n} & = \frac{λ_{n}}{n} \sum_{j = 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - \frac{λ_{n}}{n} \sum_{j = 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ = \frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - \frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ + \frac{λ_{n}}{n} \sum_{j = J_{1} + 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ : = H_{1} + H_{2} . \end{matrix}

Consider

H_{1}

first. We analyze two cases: Case

(1)

when

β_{k}^{0} \neq 0

for all

k \in B_{j}

and

j \in {1, \dots, J_{1}}

, and Case

(2)

where at least one

β_{k}^{0} = 0

for some

k \in B_{j}

and some

j \in {1, \dots, J_{1}}

.

In Case

(1)

, assume that

β_{k}^{0} \neq 0

for all

k \in B_{j}

and all

j \in {1, \dots, J_{1}}

. Noting that

b^{γ} - a^{γ} \leq 2 (b - a) b^{γ - 1}

for

0 \leq a \leq b

, and using condition (A7), it follows that

| β_{k}^{0} | = O {{(p_{n} / n)}^{v_{1} / 2}}

with

0 < v_{1} < 1

for

k \in E_{1}

. Thus, for sufficiently large n such that

C α_{n} \leq | β_{k}^{0} |

for all

k \in B_{j}

,

\begin{matrix} \begin{matrix} H_{1} & = \frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - \frac{λ_{n}}{n} \sum_{j = 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ \leq 2 α_{n}^{2} λ_{n} {(p_{n} n)}^{- 1 / 2} \sum_{j = 1}^{J_{1}} b_{j} \{{(\sum_{k \in B_{j}} \frac{2 | β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}})}^{γ - 1} \sum_{k \in B_{j}} \frac{C | u_{k} |}{| {\tilde{β}}_{k} |^{v}}\} . \end{matrix} \end{matrix}

(A8)

Since

| {\tilde{β}}_{k} |^{v} \to_{p} {| β_{k}^{0} |}^{v}

for

k \in E_{3}

, and given conditions (A7) and (A8), this term is dominated by

I_{2} > 0

.

In Case

(2)

, where at least one

β_{k}^{0} = 0

for some

k \in B_{j}

and some

j \in {1, \dots, J_{1}}

, consider the term of

H_{2}

,

\begin{matrix} \begin{matrix} H_{2} & = \frac{λ_{n}}{n} \sum_{j = J_{1} + 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ = \frac{λ_{n}}{n} {(\frac{n}{p_{n}})}^{v γ / 2} \sum_{j = J_{1} + 1}^{J} b_{j} {(\sum_{k \in B_{j} \cap E_{3}} \frac{C α_{n} | u_{k} |}{(\sqrt{n / p_{n}} | {\tilde{β}}_{k} {|)}^{v}})}^{γ} \\ = α_{n}^{2} λ_{n} n^{γ (v - 1) / 2} p_{n}^{- 1 + γ (1 - v) / 2} \sum_{j = J_{1} + 1}^{J} b_{j} {(\sum_{k \in B_{j} \cap E_{3}} \frac{C | u_{k} |}{(\sqrt{n / p_{n}} | {\tilde{β}}_{k} {|)}^{v}})}^{γ} \\ = O_{p} (1) α_{n}^{2} λ_{n} n^{γ (1 - c) (v - 1) / 2 - c} \sum_{j = J_{1} + 1}^{J} b_{j} {(\sum_{k \in B_{j} \cap E_{3}} \frac{C | u_{k} |}{(\sqrt{n / p_{n}} | {\tilde{β}}_{k} {|)}^{v}})}^{γ}, \end{matrix} \end{matrix}

(A9)

where the last equality holds due to

p_{n} = O (n^{c})

with

0 < c < 1

. Since

λ_{n} n^{γ (1 - c) (v - 1) / 2 - c} \to \infty

by condition (A8),

H_{2} > 0

dominates

I_{1}

and

I_{2}

.

Next, consider

\begin{matrix} \sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}} - \sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}} & = \sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} + C α_{n} u_{k} | - | β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}} + \sum_{k \in B_{j} \cap E_{4}} \frac{C α_{n} | u_{k} |}{| {\tilde{β}}_{k} |^{v}} \\ : = H_{11} + H_{22} . \end{matrix}

We conclude that

\begin{matrix} | H_{11} | \leq \sum_{k \in B_{j} \cap E_{3}} \frac{C α_{n} | u_{k} |}{| {\tilde{β}}_{k, τ} |^{v}}, \\ | H_{22} | \leq \sum_{k \in B_{j} \cap E_{4}} \frac{C α_{n}^{1 - v} | u_{k} |}{| \sqrt{n / p_{n}} {\tilde{β}}_{k, τ} |^{v}} \end{matrix}

(A10)

Since

{\tilde{β}}_{k}

converges in probability to the non-zero

β_{k, 0 τ}

for

k \in E_{3}

, and

{max}_{j} | B_{j} \cap E_{3} | = O_{p} {{(n / p_{n})}^{v_{2} / 2}}

,

H_{11} \leq O (α_{n}^{1 - v v_{1} - v_{2}}) = o (α_{n}^{1 - v})

. Meanwhile,

\sqrt{n / p_{n}} {\tilde{β}}_{k} = O_{p} (1)

for

k \in E_{3}

, so

H_{22}

is at least

O_{p} (α_{n}^{1 - v})

. Therefore,

H_{22} > 0

dominates

H_{11}

as

n \to \infty

. Hence, for sufficiently large n,

\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}} > \sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}} .

By

γ b^{γ - 1} (b - a) \leq b^{γ} - a^{γ}

for

0 \leq a \leq b

, for sufficiently large n we obtain

\begin{matrix} \frac{λ_{n}}{n} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - \frac{λ_{n}}{n} {(\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ \geq γ \frac{λ_{n}}{n} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ - 1} (\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} + C α_{n} u_{k} | - | β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}} + \sum_{k \in B_{j} \cap E_{4}} \frac{C α_{n} | u_{k} |}{| {\tilde{β}}_{k} |^{v}}) \\ : = H_{33} + H_{44} . \end{matrix}

Using similar arguments to those in (A10),

H_{44}

dominates

H_{33}

. Consider

H_{44}

,

\begin{matrix} \begin{matrix} H_{44} & = γ \frac{λ_{n}}{n} {(\frac{n}{p_{n}})}^{v / 2} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ - 1} \sum_{k \in B_{j} \cap E_{4}} \frac{C α_{n} | u_{k} |}{| \sqrt{n / p_{n}} {\tilde{β}}_{k} |^{v}} \\ = O_{p} (1) γ α_{n}^{2} λ_{n} n^{(v - 1) / 2 - c (v + 1) / 2} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ - 1} \sum_{k \in B_{j} \cap E_{4}} \frac{C | u_{k} |}{| \sqrt{n / p_{n}} {\tilde{β}}_{k} |^{v}}, \end{matrix} \end{matrix}

(A11)

where

\sqrt{n / p_{n}} {\tilde{β}}_{k} = O (1)

, and the last equality holds due to

p_{n} = O (n^{c})

with

0 < c < 1

. Since

λ_{n} n^{(v - 1) / 2 - c (v + 1) / 2} \to \infty

,

H_{44} > 0

dominates

I_{1}

and

I_{2}

. Thus, by (A9) and (A11), if at least one

β_{k}^{0} = 0

,

H_{1} > 0

dominates

I_{1}

and

I_{2}

for sufficiently large n.

Combining (A9) and (A11) with (A8), for sufficiently large n, it holds that

\frac{1}{n} G_{n} (β^{0} + C α_{n} u) - \frac{1}{n} G_{n} (β^{0}) > 0,

which completes the proof of Theorem 3(i).

Next, we demonstrate variable selection consistency. Let

β = β^{0} + C α_{n} u

, where C is a constant and

∥ u ∥ = 1

. Define

\begin{matrix} \frac{1}{n} G_{n} ({β_{E_{3}}^{T}, 0^{T}}^{T}) - \frac{1}{n} G_{n} ({β_{E_{3}}^{T}, β_{E_{4}}^{T}}^{T}) & = \frac{1}{n} (G_{n} ({β_{E_{3}}^{T}, 0^{T}}^{T}) - G_{n} ({β_{E_{3}}^{0 T}, 0^{T}}^{T})) \\ - \frac{1}{n} (G_{n} ({β_{E_{3}}^{T}, β_{E_{4}}^{T}}^{T}) - \frac{1}{n} G_{n} ({β_{E_{3}}^{0 T}, 0^{T}}^{T})) \\ + \{\frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ - \frac{λ_{n}}{n} \sum_{j = 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ}\} \\ : = K_{1} - K_{2} + K_{3} . \end{matrix}

It suffices to show that for sufficiently large n,

\frac{1}{n} G_{n} ({β_{E_{3}}^{T}, 0^{T}}^{T}) - \frac{1}{n} G_{n} ({β_{E_{3}}^{T}, β_{E_{4}}^{T}}^{T}) < 0 .

Following the argument of the consistency proof for the adaptive group bridge estimator, we can verify that

K_{1}

and

K_{2}

are of order

C^{2} α_{n}^{2}

.

Consider

K_{3}

:

\begin{matrix} K_{3} & = \frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - \frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ - \frac{λ_{n}}{n} \sum_{j = J_{1} + 1}^{J} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ : = K_{31} - K_{32} - K_{33} . \end{matrix}

Since

γ b^{γ - 1} (b - a) \leq b^{γ} - a^{γ}

for

0 \leq a \leq b

, for sufficiently large n, similarly to (A11), we have

\begin{matrix} K_{32} - K_{31} & = \frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - \frac{λ_{n}}{n} \sum_{j = 1}^{J_{1}} b_{j} {(\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} \\ \geq γ α_{n}^{2} λ_{n} n^{(v - 1) / 2} p_{n}^{- (v + 1) / 2} {(\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} + C α_{n} u_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ - 1} \sum_{k \in B_{j} \cap E_{4}} \frac{C | u_{k} |}{| \sqrt{n / p_{n}} {\tilde{β}}_{k} |^{v}} . \end{matrix}

Since

\sqrt{n / p_{n}} {\tilde{β}}_{k} = O (1)

for

j \in E_{3}

, and

λ_{n} n^{(v - 1) / 2 - c (v + 1) / 2} \to \infty

, using condition (A8),

K_{32} - K_{31} < 0

dominates

K_{1}

and

K_{2}

. Similarly to (A9),

- K_{33} < 0

also dominates

K_{1}

and

K_{2}

.

Therefore, for sufficiently large n,

\frac{1}{n} G_{n} ({β_{E_{3}}^{T}, 0^{T}}^{T}) - \frac{1}{n} G_{n} ({β_{E_{3}}^{T}, β_{E_{4}}^{T}}^{T}) < 0,

which proves the individual variable selection consistency.

Finally, we show the asymptotics of

{\hat{β}}_{E_{3}}

, where

β_{E_{3}} = {β_{k}^{0} + C α_{n} u_{k}; k \in E_{3}}^{T}

. Note that the proof of Theorem 3(i) still works with the reduced

X_{1}

and reduced number

p_{3} = | E_{3} |

of coefficients

{β_{k} : k \in E_{3}}

. Thus,

∥ {\hat{β}}_{E_{3},} - β_{E_{3}}^{0} ∥ = O_{p} (\frac{1}{n}), ∥ \hat{β} - β^{0} ∥ = O_{p} (\frac{1}{n}) .

Let

h_{n} = n^{- 1 / 2}

and

V_{1 n} (a) = G_{n} (β^{0} + h_{n} {(a^{T}, 0^{T})}^{T}) - G_{n} (β^{0})

, where

0

is a zero vector of dimension

E_{4}

and

a = {(a_{1}, \dots, a_{p_{3}})}^{T}

is a

p_{3}

-dimensional constant vector. From the consistency of

\hat{β}

, with large probability approaching one,

\hat{β} - β^{0} = h_{n} {({\hat{a}}^{T}, 0^{T})}^{T}

, with

{\hat{a}}^{T} = arg min {V_{1 n} (a) : a \in R^{p_{3}}}

.

Using (A4),

V_{1 n} (a)

can be rewritten as

\begin{matrix} V_{1 n} & = - h_{n} {({\hat{a}}^{T}, 0^{T})}^{T} \sum_{i = 1}^{n} x_{i}^{T} (τ - I (e_{i} \leq 0)) + \sum_{i = 1}^{n} \int_{0}^{h_{n} x_{i}^{T} {({\hat{a}}^{T}, 0^{T})}^{T}} (I (e_{i} \leq t) - I (e_{i} \leq 0)) d t \\ + λ_{n} \sum_{j = 1}^{J_{1}} b_{j} \{{(\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} + h_{n} a_{k} |}{| {\tilde{β}}_{k} |^{v}})}^{γ} - {(\sum_{k \in B_{j} \cap E_{3}} \frac{| β_{k}^{0} |}{| {\tilde{β}}_{k} |^{v}})}^{γ}\} \\ : = T_{1 n} (a) + T_{2 n} (a) . \end{matrix}

Following the arguments of [26],

T_{1 n} (a) \to_{d} V_{1} (a) = a^{T} W + \frac{1}{2} f (0) a^{T} Υ_{11} a

, where

\to_{d}

denotes convergence in distribution and

W \sim N (0, τ (1 - τ) Υ_{11})

. Similarly to [11], it follows from condition (A8) that

T_{2 n} (a) \to 0

. Therefore,

V_{1 n} \to_{d} V_{1} (a)

. Using the epi-convergence results of [34],

\sqrt{n} ({\hat{β}}_{E_{3}} - β_{E_{3}, 0 τ}) = {\hat{a}}_{n} \to_{d} arg min {V_{1} (a) : a \in R^{| E_{3} |}}

. This completes the proof. □

References

Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar]
Ahn, K.W.; Kim, S. Variable selection with group structure in competing risks quantile regression. Stat. Med. 2018, 37, 1577–1586. [Google Scholar] [CrossRef] [PubMed]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 2006, 68, 49–67. [Google Scholar] [CrossRef]
Goeman, J.; Bühlmann, P. Analyzing gene expression data in terms of gene sets: Methodological issues. Bioinformatics 2007, 23, 980–987. [Google Scholar] [CrossRef] [PubMed]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Frank, I.E.; Friedman, J.H. A statistical view of some chemometrics regression tools. Technometrics 1993, 35, 109–135. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Zhang, C. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Meier, L.; Van De Geer, S.; Bühlmann, P. The group lasso for logistic regression. J. R. Stat. Soc. Ser. B 2008, 70, 53–71. [Google Scholar] [CrossRef]
Zhao, W.; Zhang, R.; Liu, J. Sparse group variable selection based on quantile hierarchical lasso. J. Appl. Stat. 2014, 41, 1658–1677. [Google Scholar] [CrossRef]
Huang, J.; Ma, S.; Xie, H. A group bridge approach for variable selection. Biometrika 2009, 96, 339–355. [Google Scholar] [CrossRef]
Huang, J.; Li, L.; Liu, Y.; Zhao, X. Group selection in the cox model with adiverging number of covariates. Stat. Sin. 2014, 24, 1787–1810. [Google Scholar]
Cai, K.; Shen, H.; Lu, X. Adaptive bi-level variable selection for multivariate failure time model with a diverging number of covariates. Test 2022, 31, 968–993. [Google Scholar] [CrossRef]
Buch, G.; Schulz, A.; Schmidtmann, I.; Strauch, K.; Wild, P.S. Interpretability of bi-level variable selection methods. Biom. J. 2024, 66, 2300063. [Google Scholar] [CrossRef] [PubMed]
Buch, G.; Schulz, A.; Schmidtmann, I.; Strauch, K.; Wild, P.S. Sparse Group Penalties for bi-level variable selection. Biom. J. 2024, 66, 2200334. [Google Scholar] [CrossRef]
Buch, G.; Schulz, A.; Schmidtmann, I.; Strauch, K.; Wild, P.S. A systematic review and evaluation of statistical methods for group variable selection. Stat. Med. 2023, 42, 331–352. [Google Scholar] [CrossRef]
Dai, D.; Tang, A.; Ye, J. High-Dimensional Variable Selection for Quantile Regression Based on Variational Bayesian Method. Mathematics 2023, 11, 2232. [Google Scholar] [CrossRef]
Koenker, R. Quantile Regression; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Li, Y.; Zhu, J. L1-norm quantile regression. J. Comput. Graph. Stat. 2008, 17, 163–185. [Google Scholar] [CrossRef]
Ciuperca, G. Adaptive group LASSO selection in quantile models. Stat. Pap. 2019, 60, 173–197. [Google Scholar] [CrossRef]
Ciuperca, G. Adaptive elastic-net selection in a quantile model with diverging number of variable groups. Statistics 2020, 54, 1147–1170. [Google Scholar] [CrossRef]
Shi, S.; Wilke, R.A. Variable selection with group structure: Exiting employment at retirement age—A competing risks quantile regression analysis. Empir. Econ. 2022, 62, 119–155. [Google Scholar] [CrossRef]
Zhou, N.; Zhu, J. Group variable selection via a hierarchical lasso and its oracle property. Stat. Interface 2010, 3, 557–574. [Google Scholar] [CrossRef]
Ouhourane, M.; Yang, Y.; Benedet, A.L. Group penalized quantile regression. Stat. Methods Appt. 2022, 31, 1–35. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H. Component selection and smoothing in multivariate nonparametric regression. Ann. Stat. 2006, 34, 2272–2297. [Google Scholar]
Wu, Y.; Liu, Y. Variable selection in quantile regression. Stat. Sin. 2009, 19, 801–817. [Google Scholar]
Zhong, W.; Zhu, L.; Li, R. Regularized quantile regression and robust feature screening for single index models. Stat. Sin. 2016, 26, 69–95. [Google Scholar] [CrossRef]
Zou, H.; Zhang, H. On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 2009, 37, 1733–1751. [Google Scholar] [CrossRef]
Lee, E.R.; Noh, H.; Park, B.U. Model selection via Bayesian information criterion for quantile regression models. J. Am. Stat. Assoc. 2014, 109, 216–229. [Google Scholar] [CrossRef]
Fernandes, M.; Guerre, E.; Horta, E. Smoothing quantile regressions. J. Bus. Econ. Stat. 2021, 39, 338–357. [Google Scholar] [CrossRef]
He, X.; Pan, X.; Tan, K.M. Smoothed quantile regression with large-scale inference. J. Econom. 2023, 232, 367–388. [Google Scholar] [CrossRef]
Fu, W.; Knight, K. Asymptotics for lasso-type estimators. Ann. Stat. 2000, 28, 1356–1378. [Google Scholar] [CrossRef]
Kim, J.; Pollard, D. Cube root asymptotics. Ann. Stat. 1990, 18, 191–219. [Google Scholar] [CrossRef]
Geyer, C.J. On the asymptotics of constrained M-estimation. Ann. Stat. 1994, 22, 1993–2010. [Google Scholar] [CrossRef]

Table 1. Simulation results for Experiment 1.

		Homoscedasticity					Heteroscedasticity
$τ$	Method	% CG	% CI	NG	NV	MSER	% CG	% CI	NG	NV	MSER
		scenario (a)
	Oracle	1	1	3	15	1	1	1	3	15	1
0.25	SCAD	0.859	0.259	3.154	14.160	2.936	0.959	0.010	3.000	11.880	5.580
	AGL	0.999	0.979	3.005	15.000	2.471	0.989	0.966	3.011	14.990	1.800
	GB	0.991	0.985	3.009	15.000	1.058	0.992	0.925	3.008	14.94	1.108
	AGB	0.973	0.923	3.028	14.980	1.069	0.992	0.916	3.008	14.920	1.023
0.5	SCAD	0.800	0.326	3.241	14.38	2.367	0.848	0.103	3.160	13.380	3.391
	AGL	0.998	0.996	3.002	15.000	1.546	0.985	0.971	3.015	15.000	1.1591
	GB	0.994	0.990	3.006	15.000	1.066	0.999	0.956	3.001	14.950	1.063
	AGB	0.986	0.960	3.014	14.990	1.084	0.999	0.953	3.001	14.950	1.007
0.75	SCAD	0.842	0.191	2.974	13.000	3.289	0.687	0.006	3.003	11.120	4.317
	AGL	1.000	0.937	3.000	14.930	3.646	0.847	0.591	3.159	14.780	2.223
	GB	0.990	0.958	2.990	14.900	1.102	0.950	0.626	3.006	14.480	1.063
	AGB	0.982	0.935	3.018	14.970	1.088	0.945	0.555	3.013	14.390	1.107
		scenario (b)
	Oracle	1	1	4	15	1	1	1	4	15	1
0.25	SCAD	0.975	0.734	4.026	14.790	1.142	0.979	0.402	4.021	14.440	1.807
	AGL	0.875	0.058	4.130	18.060	1.473	0.890	0.084	4.129	17.55	1.476
	GB	0.999	0.452	4.001	15.750	1.770	1.000	0.461	4.000	15.720	1.818
	AGB	0.995	0.898	4.005	14.970	1.155	0.991	0.756	4.009	15.070	1.180
0.5	SCAD	0.768	0.687	4.263	15.230	1.109	0.973	0.363	4.027	14.410	2.180
	AGL	1.000	0.220	4.000	16.280	2.229	0.953	0.067	4.048	17.630	1.608
	GB	1.000	0.536	4.000	15.600	1.773	1.000	0.545	4.000	15.540	1.992
	AGB	0.992	0.936	4.008	14.990	1.107	0.999	0.834	4.001	15.02	1.151
0.75	SCAD	0.980	0.691	4.023	14.740	1.212	0.968	0.385	4.034	14.440	1.573
	AGL	0.979	0.054	4.021	17.340	1.635	0.996	0.191	4.004	16.440	2.259
	GB	0.999	0.512	4.001	15.640	1.664	0.999	0.429	4.001	15.700	1.844
	AGB	0.992	0.894	4.008	14.960	1.141	0.994	0.701	4.006	15.010	1.243

Table 2. Simulation results for Experiment 2.

		Homoscedasticity					Heteroscedasticity
$τ$	Method	% CG	% CI	NG	NV	MSER	% CG	% CI	NG	NV	MSER
		scenario (a)
	Oracle	1	1	3	11	1	1	1	3	11	1
0.25	SCAD	0.989	0.760	3.012	10.780	1.127	0.995	0.343	3.006	10.350	1.848
	AGL	1.000	0.284	3.000	11.960	1.243	1.000	0.266	3.000	11.940	1.297
	GB	0.987	0.590	3.013	11.480	1.156	0.998	0.512	3.002	11.430	1.233
	AGB	0.982	0.917	3.019	11.020	1.084	0.945	0.716	3.057	11.050	1.151
0.5	SCAD	0.983	0.740	3.018	10.780	1.131	0.998	0.354	3.002	10.360	2.162
	AGL	1.000	0.192	3.000	12.110	1.157	1.000	0.233	3.000	12.000	1.272
	GB	0.996	0.599	3.004	11.460	1.182	0.998	0.548	3.002	11.440	1.211
	AGB	0.985	0.936	3.015	11.040	1.082	0.987	0.810	3.013	10.920	1.178
0.75	SCAD	1.000	0.324	3.000	10.320	2.482	0.997	0.352	3.003	10.360	1.882
	AGL	1.000	0.350	3.000	11.840	1.334	1.000	0.311	3.000	11.860	1.375
	GB	1.000	0.691	3.000	11.340	1.207	0.996	0.531	3.004	11.420	1.185
	AGB	0.995	0.923	3.005	11.020	1.068	0.962	0.742	3.039	11.010	1.181
		scenario (b)
	Oracle	1	1	6	18	1	1	1	6	18	1
0.25	SCAD	0.979	0.734	6.021	17.780	1.119	0.988	0.421	6.012	17.450	1.417
	AGL	0.903	0.025	6.107	21.91	1.442	0.915	0.027	6.091	21.840	1.486
	GB	0.997	0.260	6.003	19.220	1.670	0.990	0.293	6.010	19.070	1.720
	AGB	0.994	0.865	6.006	17.960	1.124	0.995	0.678	6.005	17.770	1.253
0.5	SCAD	0.976	0.734	6.024	17.810	1.168	0.996	0.410	6.004	17.420	1.604
	AGL	0.975	0.049	6.050	20.660	2.032	0.989	0.033	6.013	21.540	1.549
	GB	0.996	0.333	6.004	19.040	1.727	0.996	0.353	6.004	18.910	1.767
	AGB	0.994	0.923	6.006	17.990	1.108	0.994	0.723	6.006	18.020	1.165
0.75	SCAD	0.978	0.776	6.022	17.830	1.072	0.990	0.415	6.010	17.440	1.448
	AGL	0.907	0.021	6.102	21.930	1.412	0.918	0.021	6.085	21.730	1.470
	GB	0.992	0.237	6.009	19.350	1.721	0.998	0.269	6.002	19.130	1.747
	AGB	0.991	0.858	6.009	18.060	1.123	0.992	0.680	6.008	17.900	1.251

Table 3. Results of the real data analysis.

		$τ = 0.25$			$τ = 0.5$			$τ = 0.75$
Variable	GB-LS	AGL	GB	AGB	AGL	GB	AGB	AGL	GB	AGB
Intercept	2.999	2.599	2.601	2.615	3.164	3.145	3.136	3.467	3.370	3.423
age1	0.000	−0.046	0.000	0.000	0.010	0.000	0.000	0.000	0.000	0.000
age2	0.893	2.329	2.144	2.105	1.403	1.503	1.509	0.000	1.103	1.293
age3	0.282	0.653	0.616	0.668	1.004	0.890	0.876	0.000	0.537	0.377
lwt1	1.019	1.530	1.233	1.303	1.656	1.643	1.680	1.880	1.984	2.333
lwt2	0.000	0.150	0.170	0.160	−0.139	0.000	0.000	−1.193	−0.907	−0.424
lwt3	0.734	1.800	2.086	2.071	1.379	1.140	1.354	1.446	1.639	1.200
white	0.265	0.398	0.492	0.496	0.278	0.273	0.277	0.332	0.361	0.284
black	−0.040	0.099	0.1515	0.156	−0.237	−0.269	−0.250	−0.291	−0.230	−0.315
smoke	−0.221	−0.293	−0.379	−0.391	−0.387	−0.354	0.361	−0.305	−0.250	−0.205
ptl1	−0.172	−0.168	−0.245	−0.221	−0.344	−0.362	−0.346	0.000	−0.214	−0.259
ptl2	0.000	0.106	0.000	0.000	0.289	0.259	0.246	0.000	0.000	0.000
ht	−0.287	−0.852	−0.861	−0.856	−0.323	−0.323	−0.325	0.000	0.000	0.000
ui	−0.384	−0.489	−0.534	0.553	−0.477	−0.452	−0.437	0.000	0.000	0.000
ftv1	0.000	0.000	0.022	0.000	−0.048	0.000	0.000	0.000	0.132	0.162
ftv2	0.000	0.000	0.050	0.196	0.064	0.038	0.000	0.000	0.105	0.000
ftv3	0.000	0.000	−0.102	−0.113	−0.372	−0.342	−0.340	0.000	−0.614	−0.625
APE	0.512	0.582	0.570	0.569	0.471	0.472	0.472	0.661	0.603	0.608

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.