Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors

Kubkowski, Mariusz; Mielniczuk, Jan

doi:10.3390/e22020153

Open AccessArticle

Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors

by

Mariusz Kubkowski

^1,2,†

and

Jan Mielniczuk

^1,2,*,†

¹

Institute of Computer Science, Polish Academy of Sciences, Jana Kazimierza 5, 01-248 Warsaw, Poland

²

Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2020, 22(2), 153; https://doi.org/10.3390/e22020153

Submission received: 13 November 2019 / Revised: 22 January 2020 / Accepted: 24 January 2020 / Published: 28 January 2020

(This article belongs to the Special Issue Nonparametric Statistical Inference with an Emphasis on Information-Theoretic Methods)

Download

Browse Figures

Versions Notes

Abstract

:

We consider selection of random predictors for a high-dimensional regression problem with a binary response for a general loss function. An important special case is when the binary model is semi-parametric and the response function is misspecified under a parametric model fit. When the true response coincides with a postulated parametric response for a certain value of parameter, we obtain a common framework for parametric inference. Both cases of correct specification and misspecification are covered in this contribution. Variable selection for such a scenario aims at recovering the support of the minimizer of the associated risk with large probability. We propose a two-step selection Screening-Selection (SS) procedure which consists of screening and ordering predictors by Lasso method and then selecting the subset of predictors which minimizes the Generalized Information Criterion for the corresponding nested family of models. We prove consistency of the proposed selection method under conditions that allow for a much larger number of predictors than the number of observations. For the semi-parametric case when distribution of random predictors satisfies linear regressions condition, the true and the estimated parameters are collinear and their common support can be consistently identified. This partly explains robustness of selection procedures to the response function misspecification.

Keywords:

high-dimensional regression; loss function; random predictors; misspecification; consistent selection; subgaussianity; generalized information criterion; robustness

1. Introduction

Consider a random variable

(X, Y) \in R^{p} \times {0, 1}

and a corresponding response function defined as a posteriori probability

q (x) = P (Y = 1 | X = x)

. Estimation of the a posteriori probability is of paramount importance in machine learning and statistics since many frequently applied methods, e.g., logistic or tree-based classifiers, rely on it. One of the main estimation methods of q is a parametric approach for which the response function is assumed to have parametric form

q (x) = q_{0} (β^{T} x)

(1)

for some fixed

β

and known

q_{0} (x)

. If Equation (1) holds, that is the underlying structure is correctly specified, then it is known that

β = {argmin}_{b \in R^{p}} - {E_{X, Y} (Y log q_{0} (b^{T} X) + (1 - Y) log (1 - q_{0} (b^{T} X))},

(2)

or, equivalently (cf., e.g., [1])

β = {argmin}_{b} E_{X} K L (q (X), q_{0} (X^{T} b)),

(3)

where

E_{X} f (X)

is the expected value of a random variable

f (X)

and

K L (q (X), q_{0} (X^{T} b))

is Kullback–Leibler distance between the binary distributions with success probabilities

q (X)

and

q_{0} (X^{T} b)

:

K L (q (X), q_{0} (X^{T} b)) = q (X) log \frac{q (X)}{q_{0} (X^{T} b)} + (1 - q (X)) log \frac{1 - q (X)}{1 - q_{0} (X^{T} b)} .

The equalities in Equations (2) and (3) form the theoretical underpinning of (conditional) maximum likelihood (ML) method as the expression under the expected value in Equation (2) is the conditional log-likelihood of Y given X in the parametric model. Moreover, it is a crucial property needed to show that ML estimates of

β

under appropriate conditions approximate

β

.

However, more frequently than not, the model in Equation (1) does not hold, i.e., response q is misspecified and ML estimators do not approximate

β

, but the quantity defined by the right-hand side of Equation (3), namely

β^{*} = {argmin}_{b} E_{X} K L (q (X), q_{0} (X^{T} b)),

(4)

Thus, parametric fit using conditional ML method, which is the most popular approach to modeling binary response, also has very intuitive geometric and information-theoretic flavor. Indeed, fitting a parametric model, we try to approximate the

β^{*}

which yields averaged KL projection of unknown q on set of parametric models

{q_{0} (b^{T} x)}_{b \in R^{p}}

. A typical situation is a semi-parametric framework the true response function satisfies when

q (x) = \tilde{q} (β^{T} x)

(5)

for some unknown

\tilde{q} (x)

and the model in Equation (1) is fitted where

\tilde{q} \neq q_{0}

. An important problem is then how

β^{*}

in Equation (4) relates to

β

in Equation (5). In particular, a frequently asked question is what can be said about a support of

β = {(β_{1}, \dots, β_{p})}^{T}

, i.e., the set

{i : β_{i} \neq 0}

, which consists of indices of predictors which truly influence Y. More specifically, an interplay between supports of

β

and analogously defined support of

β^{*}

is of importance as the latter is consistently estimated and the support of ML estimator is frequently considered as an approximation of the set of true predictors. Variable selection, or equivalently the support recovery of

β

in high-dimensional setting, is one of the most intensively studied subjects in contemporary statistics and machine learning. This is related to many applications in bioinformatics, biology, image processing, spatiotemporal analysis, and other research areas (see [2,3,4]). It is usually studied under a correct model specification, i.e., under theassumption that data are generated following a given parametric model (e.g., logistic or, in the case of quantitative Y, linear model).

Consider the following example: let

\tilde{q} (x) = q_{L} (x^{3})

, where

q_{L} (x) = e^{x} / (1 + e^{x})

is the logistic function. Define regression model by

P (Y = 1 | X) = \tilde{q} (β^{T} X) = q_{L} ({(X_{1} + X_{2})}^{3})

, where

X = (X_{1}, \dots, X_{p})

is

N (0, I_{p \times p})

-distributed vector of predictors,

p > 2

and

β = (1, 1, 0, \dots, 0) \in R^{p}

. Then, the considered model will obviously be misspecified when the family of logistic models is fitted. However, it turns out in this case that, as X is elliptically contoured,

β^{*} = η β = η (1, 1, 0, \dots, 0)

and

η \neq 0

(see [5]) and thus supports of

β

and

β^{*}

coincide. Thus, in this case, despite misspecification variable selection, i.e., finding out that

X_{1}

and

X_{2}

are the only active predictors, it can be solved using the methods described below.

For recent contributions to the study of Kullback–Leibler projections on logistic model (which coincide with Equation (4) for a logistic loss, see below) and references, we refer to the works of Kubkowski and Mielniczuk [6], Kubkowski and Mielniczuk [7] and Kubkowski [8]. We also refer to the work of Lu et al. [9], where the asymptotic distribution of adaptive Lasso is studied under misspecification in the case of fixed number of deterministic predictors. Questions of robustness analysis evolve around an interplay between

β

and

β^{*}

, in particular under what conditions the directions of

β

and

β^{*}

coincide (cf. the important contribution by Brillinger [10] and Ruud [11]).

In the present paper, we discuss this problem in a more general non-parametric setting. Namely, the minus conditional log-likelihood

- (y log q_{0} (b^{T} x + (1 - y) log (1 - q_{0} (b^{T} x))

is replaced by a general loss function of the form

l (b, x, y) = ρ (b^{T} x, y),

(6)

where

ρ : R \times {0, 1} \to R

is some function,

b, x \in R^{p},

y \in {0, 1}

, and

R (b) = E_{X, Y} l (b, X, Y)

is the associated risk function for

b \in R^{p}

. Our aim is to determine a support of

β^{*}

, where

β^{*} = {argmin}_{b \in R^{p_{n}}} R (b) .

(7)

Coordinates of

β^{*}

corresponding to non-zero coefficients are called active predictors and vector

β^{*}

the pseudo-true vector.

The most popular loss functions are related to minus log-likelihood of specific parametric models such as logistic loss

l_{l o g i s t} (b, x, y) = - y b^{T} x + log (1 + exp (b^{T} x))

related to

q_{0} (b^{T} x) = exp (b^{T} x) / (1 + exp (b^{T} x)

, probit loss

l_{p r o b i t} (b, x, y) = - y log Φ (b^{T} x) + (1 - y) log (1 - Φ (b^{T} x))

related to

q_{0} (b^{T} x) = Φ (b^{T} x)

, or quadratic loss

l_{l i n} (b, x, y) = {(y - b^{T} x)}^{2} / 2

related to linear regression and quantitative response. Other losses which do not correspond to any parametric model such as Huber loss (see [12]) are constructed with a specific aim to induce certain desired properties of corresponding estimators such as robustness to outliers. We show in the following that variable selection problem can be studied for a general loss function imposing certain analytic properties such as its convexity and Lipschitz property.

For fixed number p of predictors smaller than sample size n, the statistical consequences of misspecification of a semi-parametric regression model were intensively studied by H. White and his collaborators in the 1980s. The concept of a projection on the fitted parametric model is central to these investigations which show how the distribution of maximum likelihood estimator of

β^{*}

centered by

β^{*}

changes under misspecification (cf. e.g., [13,14]). However, for the case when

p > n

, the maximum likelihood estimator, which is a natural tool for fixed

p \leq n

case, is ill-defined and a natural question arises: What can be estimated and by what methods?

The aim of the present paper is to study the above problem in high-dimensional setting. To this end, we introduce two-stage approach in which the first stage is based on Lasso estimation (cf., e.g., [2])

{\hat{β}}_{L} = {argmin}_{b \in R^{p_{n}}} {R_{n} (b) + λ_{L} \sum_{i = 1}^{p_{n}} | b_{i} |}

(8)

where

b = {(b_{1}, \dots, b_{p_{n}})}^{T}

and the empirical risk

R_{n} (b)

corresponding to

R (b)

is

R_{n} (b) = n^{- 1} \sum_{i = 1}^{n} ρ (b^{T} X_{i}, Y_{i}) .

Parameter

λ_{L} > 0

is Lasso penalty, which penalizes large

l_{1}

-norms of potential candidates for a solution. Note that the criterion function in Equation (8) for

ρ (s, y) = log (1 + exp (- s (2 y - 1))

can be viewed as penalized empirical risk for the logistic loss. Lasso estimator is thoroughly studied in the case of the linear model when considered loss is square loss (see, e.g., [2,4] for references and overview of the subject) and some of the papers treat the case when such model is fitted to Y, which is not necessarily linearly dependent on regressors (cf. [15]). In this case, regression model is misspecified with respect to linear fit. However, similar results are scarce for other scenarios such as logistic fit under misspecification in particular. One of the notable exceptions is Negahban et al. [16], who studied the behavior of Lasso estimate i for a general loss function and possibly misspecified models.

The output of the first stage is Lasso estimate

{\hat{β}}_{L}

. The second stage consists in ordering of predictors according to the absolute values of corresponding non-zero coordinates of Lasso estimator and then minimization of Generalized Information Criterion (GIC) on the resulting nested family. This is a variant of SOS (Screening-Ordering-Selection) procedure introduced in [17]. Let

{\hat{s}}^{*}

be the model chosen by GIC procedure.

Our main contributions are as follows:

We prove that under misspecification when the sample size grows support ${\hat{s}}^{*}$ coincides with support of $β^{*}$ with probability tending to 1. In the general framework allowing for misspecification this means that selection rule ${\hat{s}}^{*}$ is consistent, i.e., $P ({\hat{s}}^{*} = s^{*}) \to 1$ when $n \to \infty$ . In particular, when the model in Equation (1) is correctly specified this means that we recover the support of the true vector $β$ with probability tending to 1.
We also prove approximation result for Lasso estimator when predictors are random and $ρ$ is a convex Lipschitz function (cf. Theorem 1).
A useful corollary of the last result derived in the paper is determination of sufficient conditions under which active predictors can be separated from spurious ones based on the absolute values of corresponding coordinates of Lasso estimator. This makes construction of nested family containing $s^{*}$ with a large probability possible.
Significant insight has been gained for fitting of parametric model when predictors are elliptically contoured (e.g., multivariate normal). Namely, it is known that in such situation $β^{*} = η β$ , i.e., these two vectors are collinear [5]. Thus, in the case when $η \neq 0$ we have that support $s^{*}$ of $β^{*}$ coincides with support s of $β$ and the selection consistency of two-step procedure proved in the paper entails direction and support recovery of $β$ . This may be considered as a partial justification of a frequent observation that classification methods are robust to misspecification of the model for which they are derived (see, e.g., [5,18]).

We now discuss how our results relate to previous results. Most of the variable selection methods in high-dimensional case are studied for deterministic regressors; here, our results concern random regressors with subgaussian distributions. Note that random regressors scenario is much more realistic for experimental data than deterministic one. The stated results to the best of our knowledge are not available for random predictors even when the model is correctly specified. As to novelty of SS procedure, for its second stage we assume that the number of active predictors is bounded by a deterministic sequence

k_{n}

tending to infinity and we minimize GIC on family

M

of models with sizes satisfying also this condition. Such exhaustive search has been proposed in [19] for linear models and extended to GLMs in [20] (cf. [21]). In these papers, GIC has been optimized on all possible subsets of regressors with cardinality not exceeding certain constant

k_{n}

. Such method is feasible for practical purposes only when

p_{n}

is small. Here, we consider a similar set-up but with important differences:

M

is a data-dependent small nested family of models and optimization of GIC is considered in the case when the original model is misspecified. The regressors are supposed random and assumptions are carefully tailored to this case. We also stress the fact that the presented results also cover the case when the regression model is correctly specified and Equation (5) is satisfied.

In numerical experiments, we study the performance of grid version of logistic and linear SOS and compare it to its several Lasso-based competitors.

The paper is organized as follows. Section 2 contains auxiliaries, including new useful probability inequalities for empirical risk in the case of subgaussian random variables (Lemma 2). In Section 3, we prove a bound on approximation error for Lasso when the loss function is convex and Lipschitz and regressors are random (Theorem 1). This yields separation property of Lasso. In Theorems 2 and 3 of Section 4, we prove GIC consistency on nested family, which in particular can be built according to the order in which the Lasso coordinates are included in the fitted model. In Section 5.1, we discuss consequences of the proved results for semi-parametric binary model when distribution of predictors satisfies linear regressions condition. In Section 6, we numerically compare the performance of two-stage selection method for two closely related models, one of which is a logistic model and the second one is misspecified.

2. Definitions and Auxiliary Results

In the following, we allow random vector

(X, Y)

,

q (x)

, and p to depend on sample size n, i.e.,

(X, Y) = (X^{(n)}, Y^{(n)}) \in R^{p_{n}} \times {0, 1}

and

q_{n} (x) = P (Y^{(n)} = 1 | X^{(n)} = x)

. We assume that n copies

X_{1}^{(n)}, \dots, X_{n}^{(n)}

of a random vector

X^{(n)}

in

R^{p_{n}}

are observed together with corresponding binary responses

Y_{1}^{(n)}, \dots, Y_{n}^{(n)}

. Moreover, we assume that observations

(X_{i}^{(n)}, Y_{i}^{(n)}), i = 1, \dots, n

are independent and identically distributed (iid). If this condition is satisfied for each n, but not necessarily for different n and m, i.e., distributions of

(X_{i}^{(n)}, Y_{i}^{(n)})

may be different from that of

(X_{j}^{(m)}, Y_{j}^{(m)})

or they may be dependent for

m \neq n

, then such framework is called a triangular scenario. A frequently considered scenario is the sequential one. In this case, when sample size n increases, we observe values of new predictors additionally to the ones observed earlier. This is a special case of the above scheme as then

X_{i}^{(n + 1)} = {(X_{i}^{(n) T}, X_{i, p_{n} + 1}, \dots, X_{i, p_{n + 1}})}^{T}

. In the following, we skip the upper index n if no ambiguity arises. Moreover, we write

q (x) = q_{n} (x)

. We impose a condition on distributions of random predictors assume that coordinates

X_{i j}

of

X_{i}

are subgaussian

S u b g (σ_{j n}^{2})

with subgaussianity parameter

σ_{j n}^{2}

, i.e., it holds that (see [22])

E exp (t X_{i j}) \leq exp (t^{2} σ_{j n}^{2} / 2)

(9)

for all

t \in R

. This condition basically says that the tails of of

X_{i j}

do not decrease more slowly than tails of normal distribution

N (0, σ_{j n}^{2})

. For future reference, let

s_{n}^{2} = max_{j = 1, \dots, p_{n}} σ_{j n}^{2}

and assume in the following that

γ^{2} : = \underset{n}{lim sup} s_{n}^{2} < \infty .

(10)

We assume moreover that

X_{i 1}, \dots, X_{i p_{n}}

are linearly independent in the sense that their arbitrary linear combination is not constant almost everywhere. We consider a general form of response function

q (x) = P (Y = 1 | X = x)

and assume that for the given loss function

β^{*}

, as defined in Equation (7), exists and is unique. For

s \subseteq {1, \dots, p_{n}}

, let

β^{*} (s)

be defined as in Equation (7) when minimum is taken over b with support in s. We let

s^{*} = supp (β^{*} ({1, \dots, p_{n}}) = {i \leq p_{n} : β_{i}^{*} \neq 0},

denote the support of

β^{*} ({1, \dots, p_{n}})

with

β^{*} ({1, \dots, p_{n}}) = {(β_{1}^{*}, \dots, β_{p_{n}}^{*})}^{T}

.

Let

v_{π} = {(v_{j_{1}}, \dots, v_{j_{k}})}^{T} \in R^{| π |}

for

v \in R^{p_{n}}

and

π = {j_{1}, \dots, j_{k}} \subseteq {1, \dots, p_{n}}

. Let

β_{s^{*}}^{*} \in R^{| s^{*} |}

be

β^{*} = β^{*} ({1, \dots, p_{n}})

restricted to its support

s^{*}

. Note that if

s^{*} \subseteq s

, then provided projections are unique (see Section 2) we have

β_{s^{*}}^{*} = β^{*} (s^{*}) = β^{*} {(s)}_{s^{*}} .

Note that this implies that for every superset

s \supseteq s^{*}

of s the projection

β^{*} (s)

on the model pertaining to s is obtained by appending projection

β^{*} (s^{*})

with appropriate number of zeros. Moreover, let

β_{m i n}^{*} = min_{i \in s^{*}} | β_{i}^{*} | .

We remark that

β^{*}

,

s^{*}

and

β_{m i n}^{*}

may depend on n. We stress that

β_{m i n}^{*}

is an important quantity in the development here as it turns out that it may not decrease too quickly in order to obtain approximation results for

{\hat{β}}_{L}^{*}

(see Theorem 1). Note that, when the parametric model is correctly specified, i.e.,

q (x) = q_{0} (β^{T} x)

for some

β

with l being an associated log-likelihood loss, if s is the support of

β

, we have

s = s^{*}

.

First, we discuss quantities and assumptions needed for the first step of SS procedure.

We consider cones of the form:

C_{ε} = {Δ \in R^{p_{n}} : | | Δ_{s^{* c}} {| |}_{1} \leq (3 + ε) | | Δ_{s^{*}} {| |}_{1}},

(11)

where

ε > 0

,

s^{* c} = {1, \dots, p_{n}} ∖ s^{*}

and

Δ_{s^{*}} = (Δ_{s_{1}^{*}}, \dots, Δ_{s_{| s^{*} |}^{*}})

for

s^{*} = {s_{1}^{*}, \dots, s_{| s^{*} |}^{*}}

. Cones

C_{ε}

are of special importance because we prove that

{\hat{β}}_{L} - β^{*} \in C_{ε}

(see Lemma 3). In addition, we note that since

l^{1}

-norm is decomposable in the sense that

| | v_{A} {| |}_{1} + | | v_{A^{c}} {| |}_{1} = {| | v | |}_{1}

the definition of the cone above can be stated as

C_{ε} = {Δ \in R^{p_{n}} {: | | Δ | |}_{1} \leq (4 + ε) | | Δ_{s^{*}} {| |}_{1}} .

Thus,

C_{ε}

consists of vectors which do not put too much mass on the complement of

s^{*}

. Let

H \in R^{p_{n} \times p_{n}}

be a fixed non-negative definite matrix. For cone

C_{ε}

, we define a quantity

κ_{H} (ε)

which can be regarded as a restricted minimal eigenvalue of a matrix in high-dimensional set-up:

κ_{H} (ε) = inf_{Δ \in C_{ε} ∖ {0}} \frac{Δ^{T} H Δ}{Δ^{T} Δ} .

(12)

In the considered context, H is usually taken as hessian

D^{2} R (β^{*})

and, e.g., for quadratic loss, it equals

E X^{T} X

. When H is non-negative definite and not strictly positive definite its smallest eigenvalue

λ_{1} = 0

and thus

{inf}_{Δ \in R^{p} ∖ {0}} \frac{Δ^{T} H Δ}{Δ^{T} Δ} = λ_{1} = 0

. That is why we have to restrict minimization in Equation (12) in order to have

κ_{H} (ε) > 0

in the high-dimensional case. As we prove that

Δ_{0} = {\hat{β}}_{L} - β^{*} \in C_{ε}

and would use

0 < κ_{H} (ε) \leq Δ_{0}^{T} H Δ_{0} / Δ_{0}^{T} Δ_{0}

it is useful to restrict minimization in Equation (12) to

C_{ε} ∖ {0}

. Let R and

R_{n}

be the risk and the empirical risk defined above. Moreover, we introduce the following notation:

\begin{matrix} W (b) = R (b) - R (β^{*}), \end{matrix}

(13)

\begin{matrix} W_{n} (b) = R_{n} (b) - R_{n} (β^{*}), \end{matrix}

(14)

\begin{matrix} B_{p} (r) = {Δ \in R^{p_{n}} {: | | Δ | |}_{p} \leq r}, for p = 1, 2, \end{matrix}

(15)

\begin{matrix} S (r) = sup_{b \in R^{p_{n}} : b - β^{*} \in B_{1} (r)} | W (b) - W_{n} (b) | . \end{matrix}

(16)

Note that

E R_{n} (b) = R (b)

. Thus,

S (r)

corresponds to oscillation of centred empirical risk over ball

B_{1} (r)

. We need the following Margin Condition (MC) in Lemma 3 and Theorem 1:

(MC): There exist $ϑ, ε, δ > 0$ and non-negative definite matrix $H \in R^{p_{n} \times p_{n}}$ such that for all b with $b - β^{*} \in C_{ε} \cap B_{1} (δ)$ we have

$R (b) - R (β^{*}) \geq \frac{ϑ}{2} {(b - β^{*})}^{T} H (b - β^{*}) .$

The above condition can be viewed as a weaker version of strong convexity of function R (when the right-hand side is replaced by

ϑ | | b - β^{*} {| |}^{2}

) in the restricted neighbourhood of

β^{*}

(namely, in the intersection of ball

B_{1} (δ)

and cone

C_{ε}

). We stress the fact that H is not required to be positive definite, as in Section 3 we use Condition (MC) together with stronger conditions than

κ_{H} (ε) > 0

which imply that right hand side of inequality in (MC) is positive. We also do not require here twice differentiability of R. We note in particular that Condition (MC) is satisfied in the case of logistic loss, X being bounded random variable and

H = D^{2} R (β^{*})

(see [23,24,25]). It is also easily seen that that (MC) is satisfied for quadratic loss, X such that

{E | | X | |}_{2}^{2} < \infty

and

H = D^{2} R (β^{*})

. Similar condition to (MC) (called Restricted Strict Convexity) was considered in [16] for empirical risk

R_{n}

:

R_{n} (β^{*} + Δ) - R_{n} (β^{*}) \geq D R_{n} {(β^{*})}^{T} Δ + κ_{L} {| | Δ | |}^{2} - τ^{2} (β^{*})

for all

Δ \in C (3, s^{*})

, some

κ_{L} > 0

, and tolerance function

τ

. Note however that MC is a deterministic condition, whereas Restricted Strict Convexity has to be satisfied for random empirical risk function.

Another important assumption, used in Theorem 1 and Lemma 2, is the Lipschitz property of

ρ :

(LL): $\exists L > 0 \forall b_{1}, b_{2} \in R, y \in {0, 1} : | ρ (b_{1}, y) - ρ (b_{2}, y) | \leq L | b_{1} - b_{2} |$ .

Now, we discuss preliminaries needed for the development of the second step of SS procedure. Let

| w |

stand for dimension of w. For the second step of the procedure we consider an arbitrary family

M \subseteq 2^{{1, \dots, p_{n}}}

of models (which are identified with subsets of

{1, \dots, p_{n}}

and may be data-dependent) such that

s^{*} \in M, \forall w \in M : | w | \leq k_{n}

a.e. and

k_{n} \in N_{+}

is some deterministic sequence. We define Generalized Information Criterion (GIC) as:

G I C (w) = n R_{n} (\hat{β} (w)) + a_{n} | w |,

(17)

where

\hat{β} (w) = \underset{b \in R^{p_{n}} : b_{w^{c}} = 0_{| w^{c} |}}{error} R_{n} (b)

is ML estimator for model w as minimization above is taken over all vectors b with support in w. Parameter

a_{n} > 0

is some penalty factor depending on the sample size n which weighs how important is the complexity of the model described by the number of its variables

| w |

. Typical examples of

a_{n}

include:

AIC (Akaike Information Criterion): $a_{n} = 2$ ;
BIC (Bayesian Information Criterion): $a_{n} = log n$ ; and
EBIC(d) (Extended BIC): $a_{n} = log n + 2 d log p_{n}$ , where $d > 0$ .

AIC, BIC and EBIC were introduced by Akaike [26], Schwarz [27], and Chen and Chen [19], respectively. Note that for

n \geq 8

BIC penalty is larger than AIC penalty and in its turn EBIC penalty is larger than BIC penalty.

We study properties of

S_{k} (r)

for

k = 1, 2

, where:

S_{k} (r) = sup_{b \in D_{k} : b - β^{*} \in B_{2} (r)} | (W_{n} (b) - W (b) |

(18)

and is the maximal absolute value of the centred empirical risk

W_{n} (\cdot)

and sets

D_{k}

for

k = 1, 2

are defined as follows:

\begin{matrix} D_{1} = {b \in R^{p_{n}} : \exists w \in M : | w | \leq k_{n} \land s^{*} \subset w \land supp b \subseteq w}, \end{matrix}

(19)

\begin{matrix} D_{2} = {b \in R^{p_{n}} : supp b \subset s^{*}} . \end{matrix}

(20)

The idea here is simply to consider sets

D_{i}

consisting of vectors having no more that

k_{n}

non-zero coordinates. However, for

s^{*} \leq k_{n}

, we need that for

b \in D_{i}

, we have

| supp (b - β^{*}) | \leq k_{n}

, what we exploit in Lemma 2. This entails additional condition in the definition of

D_{1}

. Moreover, in Section 4, we consider the following condition

C_{ϵ} (w)

for

ϵ > 0

,

w \subseteq {1, \dots, p_{n}}

and some

θ > 0

:

$C_{ϵ} (w)$ : $R (b) - R (β^{*}) \geq θ | | b - β^{*} {| |}_{2}^{2}$ for all $b \in R^{p_{n}}$ such that $supp b \subseteq w$ and $b - β^{*} \in B_{2} (ϵ) .$

We observe also that, although Conditions (MC) and

C_{ϵ} (w)

are similar, they are not equivalent, as they hold for

v = b - β^{*}

belonging to different sets:

B_{1} (r) \cap C_{ε}

and

B_{2} (ϵ) \cap {Δ \in R^{p_{n}} : supp Δ \subseteq w}

, respectively. If the minimal eigenvalue

λ_{m i n}

of matrix H in Condition (MC) is positive and Condition (MC) holds for

b - β^{*} \in B_{1} (r)

(instead of for

b - β^{*} \in C_{ε} \cap B_{1} (r)

), then we have for

b - β^{*} \in B_{2} (r / \sqrt{p_{n}}) \subseteq B_{1} (r)

:

R (b) - R (β^{*}) \geq \frac{ϑ}{2} {(b - β^{*})}^{T} H (b - β^{*}) \geq \frac{ϑ λ_{m i n}}{2} | | b - β^{*} {| |}_{2}^{2} .

Furthermore, if

λ_{m a x}

is the maximal eigenvalue of H and Condition

C_{ϵ} (w)

holds for all

v = b - β^{*} \in B_{2} (r)

without restriction on

supp b

, then we have for

b - β^{*} \in B_{1} (r) \subseteq B_{2} (r)

:

R (b) - R (β^{*}) \geq θ | | b - β^{*} {| |}_{2}^{2} \geq \frac{θ}{λ_{m a x}} {(b - β^{*})}^{T} H (b - β^{*}) .

Thus, Condition (MC) holds in this case. A similar condition to Condition

C_{ϵ} (w)

for empirical risk

R_{n}

was considered by Kim and Jeon [28] (formula (2.1)) in the context of GIC minimization. It turns out that Condition

C_{ϵ} (w)

together with

ρ (\cdot, y)

being convex for all y and satisfying Lipschitz Condition (LL) are sufficient to establish bounds which ensure GIC consistency for

k_{n} ln p_{n} = o (n)

and

k_{n} ln p_{n} = o (a_{n})

(see Corollaries 2 and 3). First, we state the following basic inequality.

W (v)

and

S (r)

are defined above the definition of Margin Condition.

Lemma 1.

(Basic inequality). Let

ρ (\cdot, y)

be convex function for all

y .

If for some

r > 0

we have

u = \frac{r}{r + | | {\hat{β}}_{L} - β {| |}_{1}}, v = u {\hat{β}}_{L} + (1 - u) β^{*},

then

W (v) + λ | | v - β^{*} {| |}_{1} \leq S (r) + 2 λ | | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{1} .

The proof of the lemma is moved to the Appendix A. It follows from the lemma that, as in view of decomposability of

l^{1}

-distance we have

| | v - β^{*} {| |}_{1} = | | {(v - β^{*})}_{s}^{*} {| |}_{1} + | | {(v - β^{*})}_{s^{* c}} {| |}_{1}

, when

S (r)

is small we have

| | {(v - β^{*})}_{s^{* c}} {| |}_{1}

is not large in comparison with

| | {(v - β^{*})}_{s}^{*} {| |}_{1}

.

Quantities

S_{k} (r)

are defined in Equation (18). Recall that

S_{2} (r)

is an oscillation taken over ball

B_{2} (r)

, whereas

S_{i}, i = 1, 2

are oscillations taken over

B_{1} (r)

ball with restriction on the number of nonzero coordinates.

Lemma 2.

Let

ρ (\cdot, y)

be convex function for all y and satisfy Lipschitz Condition (LL). Assume that

X_{i j}

for

j \geq 1

are subgaussian

S u b g (σ_{j n}^{2})

, where

σ_{j n} \leq s_{n}

. Then, for

r, t > 0

:

1.: $P (S (r) > t) \leq \frac{8 L r s_{n} \sqrt{log (p_{n} \lor 2)}}{t \sqrt{n}}$ ,
2.: $P (S_{1} (r) \geq t) \leq \frac{8 L r s_{n} \sqrt{k_{n} ln (p_{n} \lor 2)}}{t \sqrt{n}}$ ,
3.: $P (S_{2} (r) \geq t) \leq \frac{4 L r s_{n} \sqrt{| s^{*} |}}{t \sqrt{n}}$ .

The proof of the Lemma above, which relies on Chebyshev inequality, symmetrization inequality (see Lemma 2.3.1 of [29]), and Talagrand–Ledoux inequality ([30], Theorem 4.12), is moved to the Appendix A. In the case when

β^{*}

does not depend on n and thus its support does not change, Part 3 implies in particular that

S_{2} (r)

is of the order

n^{- 1 / 2}

in probability.

3. Properties of Lasso for a General Loss Function and Random Predictors

The main result in this section is Theorem 1. The idea for the proof is based on fact that, if

S (r)

defined in Equation (16) is sufficiently small (condition

S (r) \leq \bar{C} λ r

is satisfied), then

{\hat{β}}_{L}

lies in a ball

{Δ \in R^{p_{n}} : | | Δ - β^{*} | |_{1} \leq r}

(see Lemma 3). Using a tail inequality for

S (r)

proved in Lemma 2, we obtain Theorem 1. Note that

κ_{H} (ε)

has to be bounded away from 0 (condition

2 | s^{*} | λ \leq κ_{H} (ε) ϑ \tilde{C} r

). Convexity of

ρ (\cdot, y)

below is understood as convexity for both

y = 0, 1

.

Lemma 3.

Let

ρ (\cdot, y)

be convex function and assume that

λ > 0 .

Moreover, assume margin Condition (MC) with constants

ϑ, ϵ, δ > 0

and some non-negative definite matrix

H \in R^{p_{n} \times p_{n}}

. If for some

r \in (0, δ]

we have

S (r) \leq \bar{C} λ r

and

2 | s^{*} | λ \leq κ_{H} (ε) ϑ \tilde{C} r

, where

\bar{C} = ε / (8 + 2 ε)

and

\tilde{C} = 2 / (4 + ε),

then

| | {\hat{β}}_{L} - β^{*} {| |}_{1} \leq r .

The proof of the lemma is moved to the Appendix A.

The first main result provides an exponential inequality for

P (| | {\hat{β}}_{L} - β^{*} {| |}_{1} \leq β_{m i n}^{*} / 2)

. The threshold

β_{m i n}^{*} / 2

is crucial there as it ensures separation:

max_{i \in s^{* c}} | {\hat{β}}_{L, i} | \leq min_{i \in s^{*}} | {\hat{β}}_{L, i} |

(see proof of Corollary 1).

Theorem 1.

Let

ρ (\cdot, y)

be convex function for all y and satisfy Lipschitz Condition (LL). Assume that

X_{i j} \sim S u b g (σ_{j n}^{2})

,

β^{*}

exists and is unique, margin Condition (MC) is satisfied for

ε, δ, ϑ > 0

, non-negative definite matrix

H \in R^{p_{n} \times p_{n}}

and let

\frac{2 | s^{*} | λ}{ϑ κ_{H} (ε)} \leq \tilde{C} min \{\frac{β_{m i n}^{*}}{2}, δ\},

where

\tilde{C} = 2 / (4 + ε) .

Then,

P (| | {\hat{β}}_{L} - β^{*} {| |}_{1} \leq \frac{β_{m i n}^{*}}{2}) \geq 1 - 2 p_{n} e^{- \frac{n ε^{2} λ^{2}}{A}},

where

A = 128 L^{2} {(4 + ε)}^{2} s_{n}^{2}

.

Proof.

Let

m = min \{\frac{β_{m i n}^{*}}{2}, δ\} .

Lemmas 2 and 3 imply that:

\begin{matrix} P (| | {\hat{β}}_{L} - β^{*} {| |}_{1} > \frac{β_{m i n}^{*}}{2}) & \leq P (| | {\hat{β}}_{L} - β^{*} {| |}_{1} > m) \leq P (S (m) > \bar{C} λ m) \\ \leq 2 p_{n} e^{- \frac{n ε^{2} λ^{2}}{128 L^{2} {(4 + ε)}^{2} s_{n}^{2}}} . \end{matrix}

□

Corollary 1.

(Separation property) If assumptions of Theorem 1 are satisfied,

λ = \frac{8 L s_{n} (4 + ε) ϕ}{ε} \sqrt{\frac{2 log (2 p_{n})}{n}}

for some

ϕ > 1

and

κ_{H} (ε) > d

for some

d, ε > 0

for large n,

| s^{*} | λ = o (min {β_{m i n}^{*}, 1}),

then

P (| | {\hat{β}}_{L} - β^{*} {| |}_{1} \leq \frac{β_{m i n}^{*}}{2}) \to 1 .

Moreover,

P (max_{i \in s^{* c}} | {\hat{β}}_{L, i} | \leq min_{i \in s^{*}} | {\hat{β}}_{L, i} |) \to 1 .

Proof.

The first part of the corollary follows directly from Theorem 1 and the observation that:

P (| | {\hat{β}}_{L} - β^{*} {| |}_{1} > \frac{β_{m i n}^{*}}{2}) \leq e^{log (2 p_{n}) - \frac{n ε^{2} λ^{2}}{128 L^{2} {(4 + ε)}^{2} s_{n}^{2}}} = e^{log (2 p_{n}) (1 - ϕ^{2})} \to 0 .

Now, we prove that condition

| | {\hat{β}}_{L} - β^{*} {| |}_{1} \leq β_{m i n}^{*} / 2

implies separation property

max_{i \in s^{* c}} | {\hat{β}}_{L, i} | \leq min_{i \in s^{*}} | {\hat{β}}_{L, i} | .

(21)

Indeed, observe that for all

j \in {1, \dots, p_{n}}

we have:

\frac{β_{m i n}^{*}}{2} \geq | | {\hat{β}}_{L} - β^{*} {| |}_{1} \geq | {\hat{β}}_{L, j} - β_{j}^{*} | .

(22)

If

j \in s^{*},

then using triangle inequality yields:

| {\hat{β}}_{L, j} - β_{j}^{*} | \geq | β_{j}^{*} | - | {\hat{β}}_{L, j} | \geq β_{m i n}^{*} - | {\hat{β}}_{L, j} | .

Hence, from the above inequality and Equation (22), we obtain for

j \in s^{*}

:

| {\hat{β}}_{L, j} | \geq β_{m i n}^{*} / 2 .

If

j \in s^{* c},

then

β_{j}^{*} = 0

and Equation (22) takes the form:

| {\hat{β}}_{L, j} | \leq β_{m i n}^{*} / 2 .

This ends the proof. □

We note that the separation property in Equation (21) means that when

λ

is chosen in an appropriate manner, recovery of

s^{*}

is feasible with a large probability if all predictors corresponding to absolute value of Lasso coefficient exceeding a certain threshold are chosen. The threshold unfortunately depends on unknown parameters of the model. However, separation property allows to restrict attention to nested family of models and thus to decrease significantly computational complexity of the problem. This is dealt with in the next section. Note moreover that if

γ

in Equation (10) is finite than

λ

defined in the Corollary is of order

{(log p_{n} / n)}^{1 / 2}

, which is the optimal order of Lasso penalty in the case of deterministic regressors (see, e.g., [2]).

4. GIC Consistency for a a General Loss Function and Random Predictors

Theorems 2 and 3 state probability inequalities related to behavior of GIC on supersets and on subsets of

s^{*}

, respectively. In a nutshell, we show for supersets and subsets separately that the probability that the minimum of GIC is not attained at

s^{*}

is exponentially small. Corollaries 2 and 3 present asymptotic conditions for GIC consistency in the aforementioned situations. Corollary 4 gathers conclusions of Theorem 1 and Corollaries 1–3 to show consistency of SS procedure (see [17] for consistency of SOS procedure for a linear model with dieterministic predictors) in case of subgaussian variables. Note that in Theorem below we want to consider minimization of GIC in Equation (23) over all supersets of

s^{*}

as in our applications

M

is data dependent. As the number of such possible subsets is at least

(\binom{p_{n} - | s^{*} |}{k_{n} - | s^{*} |})

, the proof has to be more involved than using reasoning based on Bonferroni inequality.

Theorem 2.

Assume that

ρ (\cdot, y)

is convex, Lipschitz function with constant

L > 0

,

X_{i j} \sim S u b g (σ_{j n}^{2}),

condition

C_{ϵ} (w)

holds for some

ϵ, θ > 0

and for every

w \subseteq {1, \dots, p_{n}}

such that

| w | \leq k_{n}

. Then, for any

r < ϵ

, we have:

P (min_{w \in M : s^{*} \subset w} G I C (w) \leq G I C (s^{*})) \leq 2 p_{n} e^{- \frac{a_{n}^{2}}{k_{n} B}} + 2 p_{n} e^{- \frac{n D}{k_{n}}},

(23)

where

B = 32 n L^{2} r^{2} k_{n} s_{n}^{2}

and

D = θ^{2} r^{2} / 512 L^{2} s_{n}^{2}

.

Proof.

If

s^{*} \subset w \in M

and

\hat{β} (w) - β^{*} \in B_{2} (r)

, then in view of inequalities

R_{n} (\hat{β} (s^{*})) \leq R_{n} (β^{*})

and

R (β^{*}) \leq R (b)

we have:

\begin{matrix} R_{n} (\hat{β} (s^{*})) - R_{n} (\hat{β} (w)) & \leq sup_{b \in D_{1} : b - β^{*} \in B_{2} (r)} (R_{n} (β^{*}) - R_{n} (b)) \\ \leq sup_{b \in D_{1} : b - β^{*} \in B_{2} (r)} ((R_{n} (β^{*}) - R (β^{*})) - (R_{n} (b) - R (b))) \\ \leq sup_{b \in D_{1} : b - β^{*} \in B_{2} (r)} | R_{n} (b) - R (b) - (R_{n} (β^{*}) - R (β^{*})) | \\ = S_{1} (r) . \end{matrix}

Note that

a_{n} (| w | - | s^{*} |) \geq a_{n} .

Hence, if we have for some

w \supset s^{*}

:

G I C (w) \leq G I C (s^{*})

, then we obtain

n R_{n} (\hat{β} (s^{*})) - n R_{n} (\hat{β} (w))) \geq a_{n} (| w | - | s^{*} |)

and from the above inequality we have

S_{1} (r) \geq a_{n} / n

. Furthermore, if

\hat{β} (w) - β^{*} \in B_{2} {(r)}^{c}

and

r < ϵ,

then consider:

v = u \hat{β} (w) + (1 - u) β^{*},

where

u = r / (r + | | \hat{β} (w) - β^{*} | |_{2})

. Then

| | v - β^{*} {| |}_{2} = u | | \hat{β} (w) - β^{*} {| |}_{2} = r \cdot \frac{| | \hat{β} (w) - β^{*} {| |}_{2}}{r + | | \hat{β} (w) - β^{*} {| |}_{2}} \geq \frac{r}{2},

as function

x / (x + r)

is increasing with respect to x for

x > 0

. Moreover, we have

| | v - β^{*} {| |}_{2} \leq r < ϵ

. Hence, in view of

C_{ϵ} (w)

condition, we get:

R (v) - R (β^{*}) \geq θ | | v - β^{*} {| |}_{2}^{2} \geq \frac{θ r^{2}}{4} .

From convexity of

R_{n}

, we have:

R_{n} (v) \leq u (R_{n} (\hat{β} (w)) - R_{n} (β^{*})) + R_{n} (β^{*}) \leq R_{n} (β^{*}) .

Let

supp v

denote the support of vector v. We observe that

supp v \subseteq supp \hat{β} (w) \cup supp β^{*} \subseteq w

, hence

v \in D_{1}

. Finally, we have:

\begin{matrix} S_{1} (r) \geq R_{n} (β^{*}) - R (β^{*}) - (R_{n} (v) - R (v)) \geq R (v) - R (β^{*}) \geq \frac{θ r^{2}}{4} . \end{matrix}

Hence, we obtain the following sequence of inequalities:

\begin{matrix} P (min_{w \in M : s^{*} \subset w} G I C (w) \leq G I C (s^{*})) \\ \leq P (S_{1} (r) \geq \frac{a_{n}}{n}, \forall w \in M : \hat{β} (w) - β^{*} \in B_{2} (r)) \\ + P (\exists w \in M : s^{*} \subset w \land \hat{β} (w) - β^{*} \in B_{2} {(r)}^{c}) \leq P (S_{1} (r) \geq \frac{a_{n}}{n}) + P (S_{1} (r) \geq \frac{θ r^{2}}{4}) \\ \leq 2 p_{n} e^{- \frac{a_{n}^{2}}{32 n L^{2} r^{2} k_{n} s_{n}^{2}}} + 2 p_{n} e^{- \frac{n θ^{2} r^{2}}{512 L^{2} k_{n} s_{n}^{2}}} . \end{matrix}

□

Corollary 2.

Assume that the conditions of Theorem 2 hold and for some

ϵ, θ > 0

and for every

w \subseteq {1, \dots, p_{n}}

such that

| w | \leq k_{n}

,

k_{n} ln (p_{n} \lor 2) = o (n)

and

\underset{n \to \infty}{lim inf} \frac{D_{n} a_{n}}{k_{n} log (2 p_{n})} > 1

, where

D_{n}^{- 1} = 128 L^{2} s_{n}^{2} ϕ / θ

for some

ϕ > 1

. Then, we have

P (min_{w \in M : s^{*} \subset w} G I C (w) \leq G I C (s^{*})) \to 0 .

Proof.

We the choose allb radius r of

B_{2} (r)

in a special way. Namely, we take:

r_{n}^{2} = \frac{512 ϕ^{2} L^{2} s_{n}^{2} log (2 p_{n}) k_{n}}{n θ^{2}}

for some

ϕ > 1

. In view of assumptions

r_{n} \to 0

. Consider

n_{0}

such that

r_{n} < ϵ

for all

n \geq n_{0}

. Hence, the second term of the upper bound in Equation (23) for

r = r_{n}

is equal to:

2 p_{n} e^{- \frac{n θ^{2} r_{n}^{2}}{512 L^{2} k_{n} s_{n}^{2}}} = e^{log (2 p_{n}) (1 - ϕ^{2})} \to 0 .

Similarly, the first term of the upper bound in Equation (23) is equal to:

2 p_{n} e^{- \frac{a_{n}^{2}}{32 n L^{2} r_{n}^{2} k_{n} s_{n}^{2}}} = e^{log (2 p_{n}) (1 - \frac{a_{n}^{2} θ^{2}}{128^{2} L^{4} k_{n}^{2} s_{n}^{4} ϕ^{2} {log}^{2} (2 p_{n})})} = e^{log (2 p_{n}) (1 - \frac{D_{n}^{2} a_{n}^{2}}{k_{n}^{2} {log}^{2} (2 p_{n})})} \to 0 .

These two convergences end the proof. □

The most restrictive condition of Corollary 2 is

\underset{n \to \infty}{lim inf} \frac{D_{n} a_{n}}{k_{n} log (2 p_{n})} > 1

which is slightly weaker than

k_{n} ln (p_{n} \lor 2) = o (a_{n})

. The following remark proved in the Appendix A gives sufficient conditions for consistency of BIC and EBIC penalties, which do not satisfy condition

k_{n} log (p_{n}) = o (a_{n})

.

Remark 1.

If in Corollary 2 we assume

D_{n} \geq A

for some

A > 0

, then condition

\underset{n \to \infty}{lim inf} \frac{D_{n} a_{n}}{k_{n} log (2 p_{n})} > 1

holds when:

(1): $a_{n} = log n$ and $p_{n} < \frac{n^{\frac{A}{k_{n} (1 + u)}}}{2}$ for some $u > 0$ .
(2): $a_{n} = log n + 2 γ log p_{n}$ , $k_{n} \leq C$ and $2 A γ - (1 + u) C \geq 0$ , where $C, u > 0$ .
(3): $a_{n} = log n + 2 γ log p_{n}$ , $k_{n} \leq C$ , $2 A γ - (1 + u) C < 0$ , $p_{n} < B n^{δ}$ , where $δ = \frac{A}{(1 + u) C - 2 A γ}$ and $B = 2^{- (1 + u) C}$ .

Theorem 3 is an analog of Theorem 2 for subsets of

s^{*}

.

Theorem 3.

Assume that

ρ (\cdot, y)

is convex, Lipschitz function with constant

L > 0

,

X_{i j} \sim S u b g (σ_{j n}^{2}),

condition

C_{ϵ} (s^{*})

holds for some

ϵ, θ > 0

, and

8 a_{n} | s^{*} | \leq θ n min {ϵ^{2}, β_{m i n}^{* 2}}

. Then, we have:

P (min_{w \in M : w \subset s^{*}} G I C (w) \leq G I C (s^{*})) \leq \sqrt{2} e^{- n min {\{ϵ, β_{m i n}^{*}\}}^{2} E},

where

E = θ^{2} / 2^{12} L^{2} s_{n}^{2} | s^{*} |

Proof.

Suppose that for some

w \subset s^{*}

we have

G I C (w) \leq G I C (s^{*})

. This is equivalent to:

n R_{n} (\hat{β} (s^{*})) - n R_{n} (\hat{β} (w)) \geq a_{n} (| w | - | s^{*} |) .

In view of inequalities

R_{n} (\hat{β} (s^{*})) \leq R_{n} (β^{*})

and

a_{n} (| w | - | s^{*} |) \geq - a_{n} | s^{*} |

, we obtain:

n R_{n} (β^{*}) - n R_{n} (\hat{β} (w)) \geq - a_{n} | s^{*} | .

Let

v = u \hat{β} (w) + (1 - u) β^{*}

for some

u \in [0, 1]

to be specified later. From convexity of

ρ

, we consider:

n R_{n} (β^{*}) - n R_{n} (v) \geq n u (R_{n} (β^{*}) - R_{n} (\hat{β} (w))) \geq - u a_{n} | s^{*} | \geq - a_{n} | s^{*} | .

(24)

We consider two cases separately:

(1)

β_{m i n}^{*} > ϵ

.

First, observe that

8 a_{n} | s^{*} | \leq θ ϵ^{2} n,

(25)

which follows from our assumption. Let

u = ϵ / (ϵ + | | \hat{β} (w) - β^{*} | |_{2})

and

v = u \hat{β} (w) + (1 - u) β^{*} .

(26)

Note that

| | \hat{β} (w) - β^{*} {| |}_{2} \geq | | β_{s^{*} ∖ w}^{*} {| |}_{2} \geq β_{m i n}^{*}

. Then, as function

d (x) = x / (x + c)

is increasing and bounded from above by 1 for

x, c > 0

, we obtain:

ϵ \geq | | v - β^{*} {| |}_{2} = \frac{ϵ | | \hat{β} (w) - β^{*} {| |}_{2}}{ϵ + | | \hat{β} (w) - β^{*} {| |}_{2}} \geq \frac{ϵ β_{m i n}^{*}}{ϵ + β_{m i n}^{*}} > \frac{ϵ^{2}}{2 ϵ} = \frac{ϵ}{2} .

(27)

Hence, in view of

C_{ϵ} (s^{*})

condition, we have:

R (v) - R (β^{*}) > θ \frac{ϵ^{2}}{4} .

Using Equations (24)–(26) and the above inequality yields:

\begin{matrix} S_{2} (ϵ) \geq R_{n} (β^{*}) - R (β^{*}) - (R_{n} (v) - R (v)) > θ \frac{ϵ^{2}}{4} - \frac{a_{n}}{n} | s^{*} | \geq \frac{θ ϵ^{2}}{8} . \end{matrix}

Thus, in view of Lemma 2, we obtain:

P (min_{w \in M : w \subset s^{*}} G I C (w) \leq G I C (s^{*})) \leq P (S_{2} (ϵ) > \frac{θ ϵ^{2}}{8}) \leq \sqrt{2} e^{- \frac{n θ^{2} ϵ^{2}}{4096 L^{2} s_{n}^{2} | s^{*} |}} .

(28)

(2)

β_{m i n}^{*} \leq ϵ

.

In this case, we take

u = β_{m i n}^{*} / (β_{m i n}^{*} + | | \hat{β} (w) - β^{*} {| |}_{2})

and define v as in Equation (26). Analogously, as in Equation (27), we have:

\frac{β_{m i n}^{*}}{2} \leq | | v - β^{*} {| |}_{2} \leq β_{m i n}^{*} .

Hence, in view of

C_{ϵ} (s^{*})

condition, we have:

R (v) - R (β^{*}) \geq θ \frac{β_{m i n}^{* 2}}{4} .

Using Equation (24) and the above inequality yields:

\begin{matrix} S_{2} (β_{m i n}^{*}) \geq R_{n} (β^{*}) - R (β^{*}) - (R_{n} (v) - R (v)) \geq θ \frac{β_{m i n}^{* 2}}{4} - \frac{a_{n}}{n} | s^{*} | \geq \frac{θ}{8} β_{m i n}^{* 2} . \end{matrix}

Thus, in view of Lemma 2, we obtain:

P (min_{w \in M : w \subset s^{*}} G I C (w) \leq G I C (s^{*})) \leq P (S_{2} (β_{m i n}^{*}) \geq \frac{θ}{8} β_{m i n}^{* 2}) \leq \sqrt{2} e^{- \frac{n θ^{2} β_{m i n}^{* 2}}{2^{12} L^{2} s_{n}^{2} | s^{*} |}} .

(29)

By combining Equations (28) and (29), the theorem follows. □

Corollary 3.

Assume that loss

ρ (\cdot, y)

is convex, Lipschitz function with constant

L > 0

,

X_{i j} \sim S u b g (σ_{j n}^{2}),

condition

C_{ϵ} (s^{*})

holds for some

ϵ, θ > 0

and

a_{n} | s^{*} | = o (n min {1, β_{m i n}^{*}}^{2})

, then

P (min_{w \in M : w \subset s^{*}} G I C (w) \leq G I C (s^{*})) \to 0 .

Proof.

First, observe that as

a_{n} \to \infty

a_{n} | s^{*} | = o (n min {1, β_{m i n}^{*}}^{2})

implies

| s^{*} | = o (n min {1, β_{m i n}^{*}}^{2}),

and thus in view of Theorem 3 we have

P (min_{w \in M : w \subset s^{*}} G I C (w) \leq G I C (s^{*})) \to 0 .

5. Selection Consistency of SS Procedure

In this section, we combine the results of the two previous sections to establish consistency of a two-step SS procedure. It consists in construction of a nested family of models

M

using magnitude of Lasso coefficients and then finding the minimizer of GIC over this family. As

M

is data dependent to establish consistency of the procedure we use Corollaries 2 and 3 in which the minimizer of GIC is considered over all subsets and supersets of

s^{*}

.

SS (Screening and Selection) procedure is defined as follows:

Choose some $λ > 0$ .
Find ${\hat{β}}_{L} = \underset{b \in R^{p_{n}}}{arg min} R_{n} (b) + {λ | | b | |}_{1}$ .
Find ${\hat{s}}_{L} = supp {\hat{β}}_{L} = {j_{1}, \dots, j_{k}}$ such that $| {\hat{β}}_{L, j_{1}} | \geq \dots \geq | {\hat{β}}_{L, j_{k}} | > 0$ and $j_{1}, \dots, j_{k} \in {1, \dots, p_{n}}$ .
Define $M_{S S} = {\emptyset, {j_{1}}, {j_{1}, j_{2}}, \dots, {j_{1}, j_{2}, \dots, j_{k}}}$ .
Find ${\hat{s}}^{*} = \underset{w \in M_{S S}}{arg min} G I C (w)$ .

The SS procedure is a modification of SOS procedure in [17] designed for linear models. Since ordering step considered in [17] is omitted in the proposed modification, we abbreviate the name to SS.

Corollary 4 and Remark 2 describe the situations when SS procedure is selection consistent. In it, we use the assumptions imposed in Section 2 and Section 3 together with an assumption that support of

s^{*}

contains no more than

k_{n}

elements, where

k_{n}

is some deterministic sequence of integers. Let

M_{S S}

is nested family constructed in the step 4 of SS procedure.

Corollary 4.

Assume that

ρ (\cdot, y)

is convex, Lipschitz function with constant

L > 0

,

X_{i j} \sim S u b g (σ_{j n}^{2})

and

β^{*}

exists and is unique. If

k_{n} \in N_{+}

is some sequence, margin Condition (MC) is satisfied for some

ϑ, δ, ε > 0

, condition

C_{ϵ} (w)

holds for some

ϵ, θ > 0

and for every

w \subseteq {1, \dots, p_{n}}

such that

| w | \leq k_{n}

and the following conditions are fulfilled:

$| s^{*} | \leq k_{n}$ ,
$P (\forall w \in M_{S S} : | w | \leq k_{n}) \to 1$ ,
$\underset{n}{lim inf} κ_{H} (ε) > 0$ for some $ε > 0$ , where H is non-negative definite matrix and $κ_{H} (ε)$ is defined in Equation (12),
$log (p_{n}) = o (n λ^{2})$ ,
$k_{n} λ = o (min {β_{m i n}^{*}, 1})$ ,
$k_{n} log p_{n} = o (n)$ ,
$k_{n} log p_{n} = o (a_{n})$ ,
$a_{n} k_{n} = o (n min {β_{m i n}^{*}, 1}^{2})$ ,

then for SS procedure we have

P ({\hat{s}}^{*} = s^{*}) \to 1 .

Proof.

In view of Corollary 1, following from the separation property in Equation (22) we obtain

P (s^{*} \in M_{S S}) \to 1

. Let:

\begin{matrix} A_{1} = {min_{w \in M_{S S} : w \supset s^{*}, | w | \leq k_{n}} G I C (w) \leq G I C (s^{*})}, \\ A_{2} = {min_{w \in M_{S S} : w \supset s^{*}, | w | > k_{n}} G I C (w) \leq G I C (s^{*})}, \\ B = {\forall w \in M_{S S} : | w | \leq k_{n}} . \end{matrix}

Then, we have again from the fact that

A_{2} \cap B = \emptyset

, union inequality and Corollary 2:

\begin{matrix} P (min_{w \in M_{S S} : w \supset s^{*}} G I C (w) \leq G I C (s^{*})) & = P (A_{1} \cup A_{2}) = P (A_{1} \cup (A_{2} \cap B^{c})) \\ \leq P (A_{1}) + P (B^{c}) \to 0 . \end{matrix}

(30)

In an analogous way, using

| s^{*} | \leq k_{n}

and Corollary 3 yields:

P (min_{w \in M_{S S} : w \subset s^{*}} G I C (w) \leq G I C (s^{*})) \to 0 .

(31)

Now, observe that in view of definition of

{\hat{s}}^{*}

and union inequality:

\begin{matrix} P ({\hat{s}}^{*} = s^{*}) = P (min_{w \in M_{S S} : w \neq s^{*}} G I C (w) > G I C (s^{*})) \\ \geq 1 - P (min_{w \in M_{S S} : w \subset s^{*}} G I C (w) \leq G I C (s^{*})) \\ - P (min_{w \in M_{S S} : w \supset s^{*}} G I C (w) \leq G I C (s^{*})) . \end{matrix}

Thus,

P ({\hat{s}}^{*} = s^{*}) \to 1

in view of the above inequality and Equations (30) and (31). □

5.1. Case of Misspecified Semi-Parametric Model

Consider now the important case of the misspecified semi-parametric model defined in Equation (5) for which function

\tilde{q}

is unknown and may be arbitrary. An interesting question is whether information about

β

can be recovered when misspecification occurs. The answer is positive under some additional assumptions on distribution of random predictors. Assume additionally that X satisfies

E (X | β^{T} X) = u_{0} + u β^{T} X,

(32)

where

β

is the true parameter. Thus, regressions of X given

β^{T} X

have to be linear. We stress that conditioning

β^{T} X

involves only the true

β

in Equation (5). Then, it is known (cf. [5,10,11]) that

β^{*} = η β

and

η \neq 0

if

Cov (Y, X) \neq 0

. Note that because

β

and

β^{*}

are collinear and

η \neq 0

it follows that

s = s^{*}

. This is important in practical applications as it shows that a position of the optimal separating direction given by

β

can be consistently recovered. It is also worth mentioning that if Equation (32) is satisfied the direction of

β

coincides with the direction of the first canonical vector. We refer to the work of Kubkowski and Mielniczuk [7] for the proof and to the work o Kubkowski and Mielniczuk [6] for discussion and up-to date references to this problem. The linear regressions condition in Equation (32) is satisfied, e.g., by elliptically contoured distribution, in particular by multivariate normal. We note that it is proved in [18] that Equation (32) approximately holds for the majority of

β

. When Equation (32) holds exactly, proportionality constant

η

can be calculated numerically for known

\tilde{q}

and

β

. We can state thus the following result provided Equation (32) is satisfied.

Corollary 5.

Assume that Equation (32) and the assumptions of Corollary 4 are satisfied. Moreover,

Cov (Y, X) \neq 0

. Then,

P ({\hat{s}}^{*} = s) \to 1

.

Remark 2.

If

p_{n} = O (e^{c n^{γ}})

for some

c > 0

,

γ \in (0, 1 / 2)

,

ξ \in (0, 0.5 - γ)

,

u \in (0, 0.5 - γ - ξ)

,

k_{n} = O (n^{ξ})

,

λ = C_{n} \sqrt{log (p_{n}) / n}

,

C_{n} = O (n^{u})

,

C_{n} \to + \infty

,

n^{- \frac{γ}{2}} = O (β_{m i n}^{*})

,

a_{n} = d n^{\frac{1}{2} - u}

, then assumptions imposed on asymptotic behavior of parameters in Corollary 4 are satisfied.

Note that

p_{n}

is allowed to grow exponentially:

log p_{n} = O (n^{γ})

, however

β_{m i n}^{*}

may not decrease to 0 too quickly with regard to growth of

p_{n}

:

n^{- \frac{γ}{2}} = O (β_{m i n}^{*})

.

Remark 3.

We note that, to apply Corollary 4 to the two-step procedure based on Lasso, it is required that

| s^{*} | \leq k_{n}

and that the support of Lasso estimator with probability tending to 1 contains no more than

k_{n}

elements. Some results bounding

| supp {\hat{β}}_{L} |

are available for deterministic X (see [31]) and for random X (see [32]), but they are too weak to be useful for EBIC penalties. The other possibility to prove consistency of two-step procedure is to modify it in the first step by using thresholded Lasso (see [33]) corresponding to

k_{n}^{'}

largest Lasso coefficients where

k_{n}^{'} \in N

is such that

k_{n} = o (k_{n}^{'})

. This is a subject of ongoing research.

6. Numerical Experiments

6.1. Selection Procedures

We note that the original procedure is defined for a single

λ

only. In the simulations discussed below, we implemented modifications of SS procedure introduced in Section 5. In practice, it is generally more convenient to consider in the first step some sequence of penalty parameters

λ_{1} > \dots > λ_{m} > 0

instead of only one

λ

in order to avoid choosing the “best”

λ

. For the fixed sequence

λ_{1}, \dots, λ_{m}

, we construct corresponding families

M_{1}, \dots, M_{m}

analogously to

M

in Step 4 of the SS procedure. Thus, we arrive at the following SSnet procedure, which is the modification of SOSnet procedure in [17]. Below,

\tilde{b}

is a vector b with first coordinate corresponding to intercept omitted,

b = {(b_{0}, {\tilde{b}}^{T})}^{T}

:

Choose some $λ_{1} > \dots > λ_{m} > 0$ .
Find ${\hat{β}}_{L}^{(i)} = \underset{b \in R^{p_{n} + 1}}{arg min} R_{n} (b) + λ_{i} | | \tilde{b} {| |}_{1}$ for $i = 1, \dots, m$ .
Find ${\hat{s}}_{L}^{(i)} = supp {\hat{\tilde{β}}}_{L}^{(i)} = {j_{1}^{(i)}, \dots, j_{k_{i}}^{(i)}}$ where $j_{1}^{(i)}, \dots, j_{k_{i}}^{(i)}$ are such that $| {\hat{β}}_{L, j_{1}^{(i)}}^{(i)} | \geq \dots \geq | {\hat{β}}_{L, j_{k_{i}}^{(i)}}^{(i)} | > 0$ for $i = 1, \dots, m$ .
Define $M_{i} = {{j_{1}^{(i)}}, {j_{1}^{(i)}, j_{2}^{(i)}}, \dots, {j_{1}^{(i)}, j_{2}^{(i)}, \dots, j_{k_{i}}^{(i)}}}$ for $i = 1, \dots, m$ .
Define $M = {\emptyset} \cup ⋃_{i = 1}^{m} M_{i}$ .
Find ${\hat{s}}^{*} = \underset{w \in M}{arg min} G I C (w)$ , where

$G I C (w) = min_{b \in R^{p_{n} + 1} : supp \tilde{b} \subseteq w} n R_{n} (b) + a_{n} (| w | + 1) .$

Instead of constructing families

M_{i}

for each

λ_{i}

in SSnet procedure,

λ

can be chosen by cross-validation using 1SE rule (see [34]) and then SS procedure is applied for such

λ

. We call this procedure SSCV. The last procedure considered was introduced by Fan and Tang [35] and is Lasso procedure with penalty parameter

\hat{λ}

chosen in a data-dependent way analogously to SSCV. Namely, it is the minimizer of GIC criterion with

a_{n} = log (log n) \cdot log p_{n}

for which ML estimator has been replaced by Lasso estimator with penalty

λ

. Once

{\hat{β}}_{L} ({\hat{λ}}_{L})

is calculated, then

{\hat{s}}^{*}

is defined as its support. The procedure is called LFT in the sequel.

We list below versions of the above procedures along with R packages that were used to choose sequence

λ_{1}, \dots, λ_{m}

and computation of Lasso estimator. The following packages were chosen based on selection performance after initial tests for each loss and procedure:

SSnet with logistic or quadratic loss: ncvreg;
SSCV or LFT with logistic or quadratic loss: glmnet; and
SSnet, SSCV or LFT with Huber loss (cf. [12]): hqreg.

The following functions were used to optimize

R_{n}

in GIC minimization step for each loss:

logistic loss: glm.fit (package stats);
quadratic loss: .lm.fit (package stats); and
Huber loss: rlm (package rlm).

Before applying the investigated procedures, each column of matrix

X = {(X_{1}, \dots, X_{n})}^{T}

was standardized as Lasso estimator

{\hat{β}}_{L}

depends on scaling of predictors. We set length of

λ_{i}

sequence to

m = 20

. Moreover, in all procedures we considered only

λ_{i}

for which

| {\hat{s}}_{L}^{(i)} | \leq n

because, when

| {\hat{s}}_{L}^{(i)} | > n

, Lasso and ML solutions are not unique (see [32,36]). For Huber loss, we set parameter

δ = 1 / 10

(see [12]). The number of folds in SSCV was set to

K = 10

.

Each simulation run consisted of L repetitions, during which samples

X_{k} = {(X_{1}^{(k)}, \dots, X_{n}^{(k)})}^{T}

and

Y_{k} = {(Y_{1}^{(k)}, \dots, Y_{n}^{(k)})}^{T}

were generated for

k = 1, \dots, L

. For kth sample

(X_{k}, Y_{k})

estimator

{\hat{s}}_{k}^{*}

of set of active predictors was obtained by a given procedure as the support of

\hat{\tilde{β}} ({\hat{s}}_{k}^{*})

, where

\hat{β} ({\hat{s}}_{k}^{*}) = {({\hat{β}}_{0} ({\hat{s}}_{k}^{*}), \hat{\tilde{β}} {({\hat{s}}_{k}^{*})}^{T})}^{T} = \underset{b \in R^{p_{n} + 1}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} ρ (b^{T} X_{i}^{(k)}, Y_{i}^{(k)})

is ML estimator for kth sample. We denote by

M^{(k)}

the family

M

obtained by a given procedure for kth sample.

In our numerical experiments we have computed the following measures of selection performance which gauge co-direction of true parameter

β

and

\hat{β}

and the interplay between

s^{*}

and

{\hat{s}}^{*}

:

$A N G L E = \frac{1}{L} \sum_{k = 1}^{L} arccos | cos ∠ ({\tilde{β}}_{0}, \hat{\tilde{β}} ({\hat{s}}_{k}^{*})) |$ , where

$cos ∠ (\tilde{β}, \hat{\tilde{β}} ({\hat{s}}_{k}^{*})) = \frac{\sum_{j = 1}^{p_{n}} β_{j} {\hat{β}}_{j} ({\hat{s}}_{k}^{*})}{| | \tilde{β} {| |}_{2} | | \hat{\tilde{β}} ({\hat{s}}_{k}^{*}) {| |}_{2}}$

and we let $cos ∠ (\tilde{β}, \hat{\tilde{β}} ({\hat{s}}_{k}^{*})) = 0$ , if $| | \tilde{β} {| |}_{2} | | \hat{\tilde{β}} ({\hat{s}}_{k}^{*}) {| |}_{2} = 0$ ,
$P_{i n c} = \frac{1}{L} \sum_{k = 1}^{L} I (s^{*} \in M^{(k)})$ ,
$P_{e q u a l} = \frac{1}{L} \sum_{k = 1}^{L} I ({\hat{s}}_{k}^{*} = s^{*})$ .
$P_{s u p s e t} = \frac{1}{L} \sum_{k = 1}^{L} I ({\hat{s}}_{k}^{*} \supseteq s^{*})$ .

Thus,

A N G L E

is equal an of angle between true parameter (with intercept omitted) and its post model-selection estimator averaged over simulations,

P_{i n c}

is a fraction of simulations for which family

M^{(k)}

contains true model

s^{*}

, and

P_{e q u a l}

and

P_{s u p s e t}

are the fractions of time when SSnet chooses true model or its superset, respectively.

6.2. Regression Models Considered

To investigate behavior of two-step procedure under misspecification we considered two similar models with different sets of predictors. As sets of predictors differ, this results in correct specification of the first model (Model M1) and misspecification of the second (Model M2).

Namely, in Model M1, we generated n observations

(X_{i}, Y_{i}) \in R^{p + 1} \times {0, 1}

for

i = 1, \dots, n

such that:

\begin{matrix} X_{i 0} = 1, X_{i 1} = Z_{i 1}, X_{i 2} = Z_{i 2}, X_{i j} = Z_{i, j - 7} for j = 10, \dots, p, \\ X_{i 3} = X_{i 1}^{2}, X_{i 4} = X_{i 2}^{2}, X_{i 5} = X_{i 1} X_{i 2}, \\ X_{i 6} = X_{i 1}^{2} X_{i 2}, X_{i 7} = X_{i 1} X_{i 2}^{2}, X_{i 8} = X_{i 1}^{3}, X_{i 9} = X_{i 2}^{3}, \end{matrix}

where

Z_{i} = {(Z_{i 1}, \dots, Z_{i p})}^{T} \sim N_{p} (0_{p}, Σ)

,

Σ = {[ρ^{| i - j |}]}_{i, j = 1, \dots, p}

and

ρ \in (- 1, 1)

. We consider response function

q (x) = q_{L} (x^{3})

for

x \in R

,

s = {1, 2}

and

β_{s} = {(1, 1)}^{T}

. Thus,

\begin{matrix} P (Y_{i} = 1 | X_{i} = x_{i}) & = q (β_{s}^{T} x_{i, s}) = q (x_{i 1} + x_{i 2}) = q_{L} ({(x_{i 1} + x_{i 2})}^{3}) \\ = q_{L} (x_{i 1}^{3} + x_{i 2}^{3} + 3 x_{i 1}^{2} x_{i 2} + 3 x_{i 1} x_{i 2}^{2}) \\ = q_{L} (3 x_{i 6} + 3 x_{i 7} + x_{i 8} + x_{i 9}) . \end{matrix}

We observe that the last equality implies that the above binary model is correctly specified with respect to family of fitted logistic models and

X_{6}, X_{7}, X_{8}

and

X_{9}

are four active predictors, whereas the remaining ones play no role in prediction of Y. Hence,

s^{*} = {6, 7, 8, 9}

and

β_{s^{*}}^{*} = {(3, 3, 1, 1)}^{T}

are, respectively, sets of indices of active predictors and non-zero coefficients of projection onto family of logistic models.

We considered the following parameters in numerical experiments:

n = 500, p = 150, ρ \in {- 0.9 + 0.15 \cdot k : k = 0, 1, \dots, 12}

, and

L = 500

(the number of generated datasets for each combination of parameters). We investigated procedures SSnet, SSCV, and LFT using logistic, quadratic, and Huber (cf. [12]) loss functions. For procedures SSnet and SSCV, we used GIC penalties with:

$a_{n} = log n$ (BIC); and
$a_{n} = log n + 2 log p_{n}$ (EBIC1).

In Model M2, we generated n observations

(X_{i}, Y_{i}) \in R^{p + 1} \times {0, 1}

for

i = 1, \dots, n

such that

X_{i} = {(X_{i 0}, X_{i 1}, \dots, X_{i p})}^{T}

and

{(X_{i 1}, \dots, X_{i p})}^{T} \sim N_{p} (0_{p}, Σ)

,

Σ = {[ρ^{| i - j |}]}_{i, j = 1, \dots, p}

and

ρ \in (- 1, 1)

. Response function is

q (x) = q_{L} (x^{3})

for

x \in R

,

s = {1, 2}

and

β_{s} = {(1, 1)}^{T}

. This means that:

P (Y_{i} = 1 | X_{i} = x_{i}) = q (β_{s}^{T} x_{i, s}) = q (x_{i 1} + x_{i 2}) = q_{L} ({(x_{i 1} + x_{i 2})}^{3})

This model in comparison to Model M1 does not contain monomials of

X_{i 1}

and

X_{i 2}

of degree higher than 1 in its set of predictors. We observe that this binary model is misspecified with respect to fitted family of logistic models, because

q (x_{i 1} + x_{i 2}) ≢ q_{L} (β^{T} x_{i})

for any

β \in R^{p + 1}

. However, in this case, the linear regressions condition in Equation (32) is satisfied for X, as it follows normal distribution (see [5,7]). Hence, in view of Proposition 3.8 in [6], we have

s_{l o g}^{*} = {1, 2}

and

β_{l o g, s_{l o g}^{*}}^{*} = η {(1, 1)}^{T}

for some

η > 0

. Parameters

n, p, ρ

as well as L were chosen as for Model M1.

6.3. Results for Models M1 and M2

We first discuss the behavior of

P_{i n c}

,

P_{e q u a l}

and

P_{s u p s e t}

for the considered procedures. We observe that values of

P_{i n c}

for SSCV and SSnet are close to 1 for low correlations in Model M2 for every tested loss (see Figure 1). In Model M1,

P_{i n c}

attains the largest values for SSnet procedure and logistic loss for low correlations, which is because in most cases the corresponding family

M

is the largest among the families created by considered procedures.

P_{i n c}

is close to 0 in Model M1 for quadratic and Huber loss, which results in low values of the remaining indices. This may be due to strong dependences between predictors in Model M1; note that we have, e.g.,

Cor (X_{i 1}, X_{i 8}) = 3 / \sqrt{15} \approx 0.77

. It is seen that in Model M1 inclusion probability

P_{i n c}

is much lower than in Model M2 (except for negative correlations). It it also seen that

P_{i n c}

for SSCV is larger than for LFT and LFT fails with respect to

P_{i n c}

in M1.

In Model M1, the largest values

P_{e q u a l}

are attained for SSnet with BIC penalty, the second best is SSCV with EBIC1 penalty (see Figure 2). In Model M2,

P_{e q u a l}

is close to 1 for SSnet and SSCV with EBIC1 penalty and is much larger than

P_{e q u a l}

for the corresponding versions using BIC penalty. We also note that choice of loss is relevant only for larger correlations. These results confirm theoretical result of Theorem 2.1 in [5], which show that collinearity holds for broad class of loss function. We observe also that, although in Model M2 remaining procedures do not select

s^{*}

with high probability, they select its superset, what is indicated by values of

P_{s u p s e t}

(see Figure 3). This analysis is confirmed by an analysis of

A N G L E

measure (see Figure 4), which attains values close to 0, when

P_{s u p s e t}

is close to 1. Low values of

A N G L E

measure mean that estimated vector

\hat{\tilde{β}} ({\hat{s}}_{k}^{*})

is approximately proportional to

\tilde{β}

, which is the case for Model M2, where normal predictors satisfy linear regressions condition. Note that the angles of

\hat{\tilde{β}} ({\hat{s}}_{k}^{*})

and

{\tilde{β}}^{*}

in Model M1 significantly differ even though Model M1 is well specified. In addition, for the best performing procedures in both models and any loss considered,

P_{e q u a l}

is much larger in Model M2 than in Model M1, even though the latter is correctly specified. This shows that choosing a simple misspecified model which retains crucial characteristics of the well specified large model instead of the latter might be beneficial.

In Model M1, procedures with BIC penalty perform better than those with EBIC1 penalty; however, the gain for

P_{e q u a l}

is much smaller than the gain when using EBIC1 in Model M2. LFT procedure performs poorly in Model M1 and reasonably well in Model M2. The overall winner in both models is SSnet. SSCV performs only slightly worse than SSnet in Model M2 but performs significantly worse in Model M1.

Analysis of computing times of the first and second stages of each procedure shows that SSnet procedure creates large families

M

and GIC minimization becomes computationally intensive. We also observe that the first stage for SSCV is more time consuming than for SSnet, what is caused by multiple fitting of Lasso in cross-validation. However, SSCV is much faster than SSnet in the second stage.

We conclude that in the considered experiments SSnet with EBIC1 penalty works the best in most cases; however, even for the winning procedure, strong dependence of predictors results in deterioration of its performance. It is also clear from our experiments that a choice of GIC penalty is crucial for its performance. Modification of SS procedure which would perform satisfactorily for large correlations is still an open problem.

7. Discussion

In the paper, we study the problem of selecting a set of active variables in binary regression model when the number of all predictors p is much larger then number of observations n and active predictors are sparse among all predictors, i.e., their number is significantly smaller than p. We consider a general binary model and fit based on minimization of empirical risk corresponding to a general loss function. This scenario encompasses the common case in practice when the underlying semi-parametric model is misspecified, i.e., the assumed response function is different from the true one. For random predictors, we show that in such a case the two-step procedure based on Lasso consistently estimates the support of pseudo-true vector

β^{*}

. Under linear regression conditions and semi-parametric model, this implies consistent recovery of a subset of active predictors. This partly explains why selection procedures perform satisfactorily even when the fitted model is wrong. We show that, by using the two-step procedure, we can successfully reduce the dimension of the model chosen by Lasso. Moreover, for the two-step procedure in the case of random predictors, we do not require restrictive conditions on experimental matrix needed for Lasso support consistency for deterministic predictors such as irrepresentable condition. Our experiments show satisfactory behavior of the proposed SSnet procedure with EBIC1 penalty.

Future research directions include considering the performance of SS procedure without subgaussianity assumption and for practical importance an automatic choice of a penalty for GIC criterion. Moreover, we note the existing challenge of finding a modification of SS procedure that would perform satisfactorily for large correlations is still an open problem. It would also be of interest to find conditions under which weaker than Equation (32) would lead to collinearity of

β

and

β^{*}

(see [18] for different angle on this problem).

Author Contributions

Both authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

The research of the second author was partially supported by Polish National Science Center grant

2015 / 17 / B / S T 6 / 01878

.

Acknowledgments

The comments by the two referees, which helped to improve presentation of the original version of the manuscript, are gratefully acknowledged.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Lemma 1:

Proof.

Observe first that function

R_{n}

is convex as

ρ

is convex. Moreover, from the definition of

{\hat{β}}_{L}

, we get the inequality:

W_{n} ({\hat{β}}_{L}) = R_{n} ({\hat{β}}_{L}) - R_{n} (β^{*}) \leq λ (| | β^{*} {| |}_{1} - | | {\hat{β}}_{L} {| |}_{1}) .

(A1)

Note that

v - β^{*} \in B_{1} (r),

as we have:

| | v - β^{*} {| |}_{1} = \frac{| | {\hat{β}}_{L} - β^{*} {| |}_{1}}{r + | | {\hat{β}}_{L} - β^{*} {| |}_{1}} \cdot r \leq r .

(A2)

By definition of

W_{n},

convexity of

R_{n}

, Equation (A2) and definition of S, we have:

\begin{matrix} W (v) & = W (v) - W_{n} (v) + R_{n} (v) - R_{n} (β^{*}) \\ \leq W (v) - W_{n} (v) + u (R_{n} ({\hat{β}}_{L}) - R_{n} (β^{*})) \leq S (r) + u W_{n} ({\hat{β}}_{L}) . \end{matrix}

(A3)

From the convexity of

l_{1}

norm, Equations (A1) and (A3), equality

| | β^{*} {| |}_{1} = | | β_{s^{*}}^{*} {| |}_{1}

, and triangle inequality, it follows that:

\begin{matrix} W (v) + {λ | | v | |}_{1} & \leq W (v) + λ u | | {\hat{β}}_{L} {| |}_{1} + λ (1 - u) | | β^{*} {| |}_{1} \\ \leq S (r) + u W_{n} ({\hat{β}}_{L}) + u λ (| | {\hat{β}}_{L} {| |}_{1} - | | β^{*} {| |}_{1}) + λ | | β^{*} {| |}_{1} \\ \leq S (r) + λ | | β^{*} {| |}_{1} \leq S (r) + λ | | β^{*} - v_{s^{*}} {| |}_{1} + λ | | v_{s^{*}} {| |}_{1} . \end{matrix}

(A4)

Hence,

\begin{matrix} W (v) + λ | | v - β^{*} {| |}_{1} = {(W (v) + λ | | v | |}_{1}) + λ (| | v - β^{*} {| |}_{1} - {| | v | |}_{1}) \\ \leq S (r) + λ | | β^{*} - v_{s^{*}} {| |}_{1} + λ | | v_{s^{*}} {| |}_{1} + λ (| | v - β^{*} {| |}_{1} - {| | v | |}_{1}) = S (r) + 2 λ | | β^{*} - v_{s^{*}} {| |}_{1} . \end{matrix}

□

We prove now Lemma A1 needed in the proof of Lemma 2 below.

Lemma A1.

Assume that

S \sim S u b g (σ^{2})

and T is a random variable such that

| T | \leq M,

where M is some positive constant and S and T are independent. Then,

S T \sim S u b g (M^{2} σ^{2}) .

Proof.

Observe that:

E e^{t S T} = E (E (e^{t S T} | T)) \leq E e^{\frac{t^{2} T^{2} σ^{2}}{2}} \leq e^{\frac{t^{2} M^{2} σ^{2}}{2}} .

□

Proof of Lemma 2.

Proof.

From the Chebyshev inequality (first inequality below), symmetrization inequality (see Lemma 2.3.1 of [29]) and Talagrand–Ledoux inequality ([30], Theorem 4.12), we have for

t > 0

and

{(ε_{i})}_{i = 1, \dots, n}

being Rademacher variables independent of

{(X_{i})}_{i = 1, \dots, n}

:

\begin{matrix} P (S (r) > t) & \leq \frac{E S (r)}{t} \\ \leq \frac{2}{t n} E sup_{b \in R^{p_{n}} : b - β^{*} \in B_{1} (r)} |\sum_{i = 1}^{n} ε_{i} (ρ (X_{i}^{T} b, Y_{i}) - ρ (X_{i}^{T} β^{*}, Y_{i}))| \\ \leq \frac{4 L}{t n} E sup_{b \in R^{p_{n}} : b - β^{*} \in B_{1} (r)} |\sum_{i = 1}^{n} ε_{i} X_{i}^{T} (b - β^{*})| . \end{matrix}

(A5)

We observe that

ε_{i} X_{i j} \sim S u b g (σ_{j n}^{2})

in view of Lemma A1. Hence, using independence, we obtain

\sum_{i = 1}^{n} ε_{i} X_{i j} \sim S u b g (n σ_{j n}^{2})

and thus

\sum_{i = 1}^{n} ε_{i} X_{i j} \sim S u b g (n s_{n}^{2}) .

Applying Hölder inequality and the following inequality (see Lemma 2.2 of [37]):

E {||\sum_{i = 1}^{n} ε_{i} X_{i j}||}_{\infty} \leq \sqrt{n} s_{n} \sqrt{2 ln (2 p_{n})} \leq 2 s_{n} \sqrt{n ln (p_{n} \lor 2)}

(A6)

we have:

\begin{matrix} \frac{4 L}{t n} E sup_{b \in R^{p_{n}} : b - β^{*} \in B_{1} (r)} |\sum_{i = 1}^{n} ε_{i} X_{i}^{T} (b - β^{*})| & \leq \frac{4 L r}{t} E max_{j \in {1, \dots, p_{n}}} |\frac{1}{n} \sum_{i = 1}^{n} ε_{i} X_{i j}| \\ \leq \frac{8 L r s_{n} \sqrt{log (p_{n} \lor 2)}}{t \sqrt{n}} . \end{matrix}

From this, Part 1 follows. In the proofs of Parts 2 and 3, the first inequalities are the same as in Equation (A5) with supremums taken on corresponding sets. Using Cauchy–Schwarz inequality, inequality

{| | v | |}_{2} \leq \sqrt{| v |} {| | v | |}_{\infty}

, inequality

| | v_{π} {| |}_{\infty} \leq {| | v | |}_{\infty}

for

π \subseteq {1, \dots, p_{n}}

, and Equation (A6) yields:

\begin{matrix} P (S_{1} (r) \geq t) & \leq \frac{4 L}{n t} E sup_{b \in D_{1} : b - β^{*} \in B_{2} (r)} |\sum_{i = 1}^{n} ε_{i} X_{i}^{T} (b - β^{*})| \\ \leq \frac{4 L r}{n t} E max_{π \subseteq {1, \dots, p_{n}}, | π | \leq k_{n}} {||\sum_{i = 1}^{n} ε_{i} X_{i, π}||}_{2} \\ \leq \frac{4 L r}{n t} E max_{π \subseteq {1, \dots, p_{n}}, | π | \leq k_{n}} \sqrt{| π |} {||\sum_{i = 1}^{n} ε_{i} X_{i, π}||}_{\infty} \\ \leq \frac{4 L r \sqrt{k_{n}}}{n t} E {||\sum_{i = 1}^{n} ε_{i} X_{i}||}_{\infty} \leq \frac{8 L r}{t \sqrt{n}} \sqrt{k_{n}} s_{n} \sqrt{ln (p_{n} \lor 2)} . \end{matrix}

Similarly for

S_{2} (r)

, using Cauchy–Schwarz inequality,

| | v_{π} {| |}_{2} \leq | | v_{s^{*}} {| |}_{2}

, which is valid for

π \subseteq s^{*}

, definition of

l_{2}

norm and inequality

E | Z | \leq \sqrt{E Z^{2}} \leq σ

for

Z \sim S u b g (σ^{2})

, we obtain:

\begin{matrix} P (S_{2} (r) \geq t) & \leq \frac{4 L}{n t} E sup_{b \in D_{2} : b - β^{*} \in B_{2} (r)} |\sum_{i = 1}^{n} ε_{i} X_{i}^{T} (b - β^{*})| \\ \leq \frac{4 L r}{n t} E max_{π \subseteq s^{*}} {||\sum_{i = 1}^{n} ε_{i} X_{i, π}||}_{2} \leq \frac{4 L r}{n t} E {||\sum_{i = 1}^{n} ε_{i} X_{i, s^{*}}||}_{2} \\ \leq \frac{4 L r}{n t} \sqrt{E {||\sum_{i = 1}^{n} ε_{i} X_{i, s^{*}}||}_{2}^{2}} = \frac{4 L r}{n t} \sqrt{\sum_{j \in s^{*}} E {(\sum_{i = 1}^{n} ε_{i} X_{i j})}^{2}} \leq \frac{4 L r}{\sqrt{n} t} \sqrt{| s^{*} |} s_{n} . \end{matrix}

□

Proof of Lemma 3.

Proof.

Let u and v be defined as in Lemma 1. Observe that

| | v - β^{*} {| |}_{1} \leq r / 2

is equivalent to

| | {\hat{β}}_{L} - β^{*} {| |}_{1} \leq r,

as the function

f (x) = r x / (x + r)

is increasing,

f (r) = r / 2

and

f (| | {\hat{β}}_{L} - β^{*} {| |}_{1}) = | | v - β^{*} {| |}_{1}

. Let

C = 1 / (4 + ε) .

We consider two cases:

(i)

| | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{1} \leq C r

.

In this case, from the basic inequality (Lemma 1), we have:

| | v - β^{*} {| |}_{1} \leq λ^{- 1} (W (v) + λ | | v - β^{*} {| |}_{1}) \leq λ^{- 1} S (r) + 2 | | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{1} \leq \bar{C} r + 2 C r = \frac{r}{2} .

(ii)

| | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{1} > C r

.

Note that

| | v_{s^{* c}} {| |}_{1} < (1 - C) r,

otherwise we would have

| | v - β^{*} {| |}_{1} > r

, which contradicts Equation (A2) in proof of Lemma 1. Now, we observe that

v - β^{*} \in C_{ε},

as we have from definition of C and assumption for this case:

| | v_{s^{* c}} {| |}_{1} < (1 - C) r = (3 + ε) C r < (3 + ε) | | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{1} .

By inequality between

l_{1}

and

l_{2}

norms, the definition of

κ_{H} (ε),

inequality

c a^{2} / 4 + b^{2} / c \geq a b

, and margin Condition (MC) (which holds because

v - β^{*} \in B_{1} (r) \subseteq B_{1} (δ)

in view of Equation (A2)), we conclude that:

\begin{matrix} | | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{1} & \leq \sqrt{| s^{*} |} | | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{2} \leq \sqrt{| s^{*} |} | | v - β^{*} {| |}_{2} \\ \leq \sqrt{| s^{*} |} \sqrt{\frac{{(v - β^{*})}^{T} H (v - β^{*})}{κ_{H} (ε)}} \end{matrix}

(A7)

\begin{matrix} \leq \frac{ϑ {(v - β^{*})}^{T} H (v - β^{*})}{4 λ} + \frac{| s^{*} | λ}{ϑ κ_{H} (ε)} \leq \frac{W (v)}{2 λ} + \frac{| s^{*} | λ}{ϑ κ_{H} (ε)} . \end{matrix}

(A8)

Hence, from the basic inequality (Lemma 1) and the inequality above, it follows that:

W (v) + λ | | v - β^{*} {| |}_{1} \leq S (r) + 2 λ | | v_{s^{*}} - β_{s^{*}}^{*} {| |}_{1} \leq S (r) + W (v) + \frac{2 | s^{*} | λ^{2}}{ϑ κ_{H} (ε)} .

Subtracting

W (v)

from both sides of the above inequality and using the assumption on S, the bound on

| s^{*} |

, and the definition of

\tilde{C}

yields:

| | v - β^{*} {| |}_{1} \leq \frac{S (r)}{λ} + \frac{2 | s^{*} | λ}{ϑ κ_{H} (ε)} \leq \bar{C} r + \frac{2 | s^{*} | λ}{ϑ κ_{H} (ε)} \leq (\bar{C} + \tilde{C}) r = \frac{r}{2} .

□

Proof of Remark 1.

Proof.

Condition

\underset{n \to \infty}{lim inf} \frac{D_{n} a_{n}}{k_{n} log (2 p_{n})} > 1

is equivalent to the condition that exists some

u > 0

that for almost all n we have:

D_{n} a_{n} - (1 + u) k_{n} log (2 p_{n}) > 0 .

(1) We observe that, if

A a_{n} - (1 + u) k_{n} log (2 p_{n}) > 0,

then the above condition is satisfied. For BIC, we have:

A log n > (1 + u) k_{n} log (2 p_{n}) > 0,

which is equivalent to the condition (1) of the Remark.

(2) We observe that using inequalities

k_{n} \leq C

,

2 A γ - (1 + u) C \geq 0

and

p_{n} \geq 1

yields for

n > 2^{\frac{(1 + u) C}{A}}

:

\begin{matrix} A (log n + 2 γ log p_{n}) - (1 + u) k_{n} log (2 p_{n}) \geq A (log n + 2 γ log p_{n}) - (1 + u) C log (2 p_{n}) \\ = (2 A γ - (1 + u) C) log p_{n} + A log n - (1 + u) C log 2 \geq A log n - (1 + u) C log 2 > 0 . \end{matrix}

(3) In this case, we check similarly as in (2) that

\begin{matrix} A (log n + 2 γ log p_{n}) - (1 + u) k_{n} log (2 p_{n}) \geq A (log n + 2 γ log p_{n}) - (1 + u) C log (2 p_{n}) \\ = (2 A γ - (1 + u) C) log p_{n} + A log n - (1 + u) C log 2 > 0 \end{matrix}

□

References

Cover, T.; Thomas, J. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
Bühlmann, P.; van de Geer, S. Statistics for High-dimensional Data; Springer: New York, NY, USA, 2011. [Google Scholar]
van de Geer, S. Estimation and Testing Under Sparsity; Lecture Notes in Mathematics; Springer: New York, NY, USA, 2009. [Google Scholar]
Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity; Springer: New York, NY, USA, 2015. [Google Scholar]
Li, K.; Duan, N. Regression analysis under link violation. Ann. Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
Kubkowski, M.; Mielniczuk, J. Active set of predictors for misspecified logistic regression. Statistics 2017, 51, 1023–1045. [Google Scholar] [CrossRef]
Kubkowski, M.; Mielniczuk, J. Projections of a general binary model on logistic regrssion. Linear Algebra Appl. 2018, 536, 152–173. [Google Scholar] [CrossRef]
Kubkowski, M. Misspecification of Binary Regression Model: Properties and Inferential Procedures. Ph.D. Thesis, Warsaw University of Technology, Warsaw, Poland, 2019. [Google Scholar]
Lu, W.; Goldberg, Y.; Fine, J. On the robustness of the adaptive lasso to model misspecification. Biometrika 2012, 99, 717–731. [Google Scholar] [CrossRef] [PubMed]
Brillinger, D. A Generalized linear model with ‘gaussian’ regressor variables. In A Festschfrift for Erich Lehmann; Wadsworth International Group: Belmont, CA, USA, 1982; pp. 97–113. [Google Scholar]
Ruud, P. Suffcient conditions for the consistency of maximum likelihood estimation despite misspecifcation of distribution in multinomial discrete choice models. Econometrica 1983, 51, 225–228. [Google Scholar] [CrossRef]
Yi, C.; Huang, J. Semismooth Newton coordinate descent algorithm for elastic-net penalized Huber loss regression and quantile regression. J. Comput. Graph. Stat. 2017, 26, 547–557. [Google Scholar] [CrossRef] [Green Version]
White, W. Maximum likelihood estimation of misspecified models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
Vuong, Q. Likelihood ratio testts for model selection and not-nested hypotheses. Econometrica 1989, 57, 307–333. [Google Scholar] [CrossRef] [Green Version]
Bickel, P.; Ritov, Y.; Tsybakov, A. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Negahban, S.N.; Ravikumar, P.; Wainwright, M.J.; Yu, B. A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci. 2012, 27, 538–557. [Google Scholar] [CrossRef] [Green Version]
Pokarowski, P.; Mielniczuk, J. Combined ℓ₁ and greedy ℓ₀ penalized least squares for linear model selection. J. Mach. Learn. Res. 2015, 16, 961–992. [Google Scholar]
Hall, P.; Li, K.C. On almost Linearity of Low Dimensional Projections from High Dimensional Data. Ann. Stat. 1993, 21, 867–889. [Google Scholar] [CrossRef]
Chen, J.; Chen, Z. Extended bayesian information criterion for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef] [Green Version]
Chen, J.; Chen, Z. Extended BIC for small-n-large-p sparse GLM. Stat. Sin. 2012, 22, 555–574. [Google Scholar] [CrossRef] [Green Version]
Mielniczuk, J.; Szymanowski, H. Selection consistency of Generalized Information Criterion for sparse logistic model. In Stochastic Models, Statistics and Their Applications; Steland, A., Rafajłowicz, E., Szajowski, K., Eds.; Springer: Cham, Switzerland, 2015; Volume 122, pp. 111–118. [Google Scholar]
Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing, Theory and Applications; Cambridge University Press: Cambridge, UK, 2012; pp. 210–268. [Google Scholar]
Fan, J.; Xue, L.; Zou, H. Supplement to “Strong Oracle Optimality of Folded Concave Penalized Estimation”. 2014. Available online: NIHMS649192-supplement-suppl.pdf (accessed on 25 January 2020).
Fan, J.; Xue, L.; Zou, H. Strong Oracle Optimality of folded concave penalized estimation. Ann. Stat. 2014, 43, 819–849. [Google Scholar] [CrossRef]
Bach, F. Self-concordant analysis for logistic regression. Electron. J. Stat. 2010, 4, 384–414. [Google Scholar] [CrossRef]
Akaike, H. Statistical predictor identification. Ann. Inst. Stat. Math. 1970, 22, 203–217. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Kim, Y.; Jeon, J. Consistent model selection criteria for quadratically supported risks. Ann. Stat. 2016, 44, 2467–2496. [Google Scholar] [CrossRef]
van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes with Applications to Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
Ledoux, M.; Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes; Springe: New York, NY, USA, 1991. [Google Scholar]
Huang, J.; Ma, S.; Zhang, C. Adaptive Lasso for sparse high-dimensional regression models. Stat. Sin. 2008, 18, 1603–1618. [Google Scholar]
Tibshirani, R. The lasso problem and uniqueness. Electron. J. Stat. 2013, 7, 1456–1490. [Google Scholar] [CrossRef]
Zhou, S. Thresholded Lasso for high dimensional variable selection and statistical estimation. arXiv 2010, arXiv:1002.1583. [Google Scholar]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fan, Y.; Tang, C. Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B 2013, 75, 531–552. [Google Scholar] [CrossRef] [Green Version]
Rosset, S.; Zhu, J.; Hastie, T. Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res. 2004, 5, 941–973. [Google Scholar]
Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer Science & Business Media: New York, NY, USA, 2012. [Google Scholar]

Figure 1.

P_{i n c}

for Models M1 and M2.

Figure 1.

P_{i n c}

for Models M1 and M2.

Figure 2.

P_{e q u a l}

for Models M1 and M2.

Figure 2.

P_{e q u a l}

for Models M1 and M2.

Figure 3.

P_{s u p s e t}

for Models M1 and M2.

Figure 3.

P_{s u p s e t}

for Models M1 and M2.

Figure 4.

A N G L E

for Models M1 and M2.

Figure 4.

A N G L E

for Models M1 and M2.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kubkowski, M.; Mielniczuk, J. Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors. Entropy 2020, 22, 153. https://doi.org/10.3390/e22020153

AMA Style

Kubkowski M, Mielniczuk J. Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors. Entropy. 2020; 22(2):153. https://doi.org/10.3390/e22020153

Chicago/Turabian Style

Kubkowski, Mariusz, and Jan Mielniczuk. 2020. "Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors" Entropy 22, no. 2: 153. https://doi.org/10.3390/e22020153

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Selection Consistency of Lasso-Based Procedures for Misspecified High-Dimensional Binary Model and Random Regressors

Abstract

1. Introduction

2. Definitions and Auxiliary Results

3. Properties of Lasso for a General Loss Function and Random Predictors

4. GIC Consistency for a a General Loss Function and Random Predictors

5. Selection Consistency of SS Procedure

5.1. Case of Misspecified Semi-Parametric Model

6. Numerical Experiments

6.1. Selection Procedures

6.2. Regression Models Considered

6.3. Results for Models M1 and M2

7. Discussion

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI