**Sparse Multicategory Generalized Distance Weighted Discrimination in Ultra-High Dimensions**

**Tong Su <sup>1</sup> , Yafei Wang <sup>2</sup> , Yi Liu <sup>2</sup> , William G. Branton <sup>3</sup> , Eugene Asahchop <sup>3</sup> , Christopher Power <sup>3</sup> , Bei Jiang <sup>2</sup> , Linglong Kong 2,\* and Niansheng Tang 1,\***


Received: 30 September 2020; Accepted: 2 November 2020; Published: 5 November 2020 -

**Abstract:** Distance weighted discrimination (DWD) is an appealing classification method that is capable of overcoming data piling problems in high-dimensional settings. Especially when various sparsity structures are assumed in these settings, variable selection in multicategory classification poses great challenges. In this paper, we propose a multicategory generalized DWD (MgDWD) method that maintains intrinsic variable group structures during selection using a sparse group lasso penalty. Theoretically, we derive minimizer uniqueness for the penalized MgDWD loss function and consistency properties for the proposed classifier. We further develop an efficient algorithm based on the proximal operator to solve the optimization problem. The performance of MgDWD is evaluated using finite sample simulations and miRNA data from an HIV study.

**Keywords:** high dimension; multicategory classification; DWD; sparse group lasso; *L*2-consistency; proximal algorithm

#### **1. Introduction**

Classification problems appear in diverse practical applications, such as spam e-mail classification, disease diagnosis and drug discovery, among many others (e.g., [1–3]). In these classification problems, the goal is to predict class labels based on a given set of variables. Recent research has focused extensively on linear classification: see [4,5] for comprehensive introductions. Among many linear classification methods, support vector machines (SVMs) (see [6,7]) and distance-weighted discrimination (DWD) (see [8–10]) are two commonly used large-margin based classification methods.

Owing to the recent advent of new technologies for data acquisition and storage, classification with high dimensional features, i.e., a large number of variables, has become a ubiquitous problem in both theoretical and applied scientific studies. Typically, only a small number of instances are available, a setting we refer to as high-dimensional, low-sample size (HDLSS), as in [11]. In the HDLSS setting, a so-called "data-piling" phenomenon is observed in [8] for SVMs, occurring when projections of many training instances onto a vector normal to the separating hyperplane are nearly identical, suggesting severe overfitting. DWD was originally proposed to overcome data-piling in the HDLSS setting. In binary classification problems, linear SVMs seek a hyperplane maximizing the smallest margin for all data points, while DWD seeks a hyperplane minimizing the sum of inverse margins over all data points. Reference [8] suggests replacing the inverse margins by the *q*-th power of the inverse margins in a generalized DWD method; see [12] for a detailed description. Formally, for a training

data set {(*y<sup>i</sup>* , *Xi*)} *N i*=1 of *<sup>N</sup>* observations, where *<sup>X</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* and *<sup>y</sup><sup>i</sup>* ∈ {−1, 1}, binary generalized linear DWD seeks a proper separating hyperplane {*X* : *a* + *X* <sup>⊤</sup>*b* = 0} through the optimization problem

$$\begin{aligned} \underset{a,b}{\operatorname{arg\,max}} & \sum\_{i=1}^{N} \frac{1}{d\_i^q} \\ \text{s.t. } d\_i &= y\_i \left( a + \mathbf{X}\_i^T \boldsymbol{\theta} \right) + \eta\_i \ge 0, \forall i, \\ & \eta\_i \ge 0, \forall i, \sum\_i \eta\_i \le c\_\prime \end{aligned} \tag{1}$$
 
$$||\boldsymbol{\theta}||\_2^2 = 1,$$

where *a* and *b* are the intercept and slope parameters, respectively. The slack variable *η<sup>i</sup>* is introduced to ensure that the corresponding margin *d<sup>i</sup>* is non-negative and the constant *c* > 0 is a tuning parameter to control the overlap between classes. Problem (1) can also be written in a loss-plus-penalty form (e.g., [12]) as

$$\left(\left(\hat{a},\hat{\mathbf{b}}\right) = \underset{a,\mathbf{b}}{\text{argmin}} \left[\frac{1}{N} \sum\_{i=1}^{N} \phi\_{\eta} \left\{ y\_{i} \left(a + \mathbf{X}\_{i}^{\top} \mathbf{b}\right) \right\} + \lambda ||\mathbf{b}||\_{2}^{2} \right],\tag{2}$$

where

$$\phi\_q(u) = \begin{cases} 1 - u, & \text{if } u \le Q \\\ \varphi\_q(u), & \text{if } u > Q, \end{cases} \tag{3}$$

with *Q* = *q q*+1 , *<sup>q</sup>* <sup>&</sup>gt; 0 and *<sup>ϕ</sup>q*(*u*) = (<sup>1</sup> <sup>−</sup> *<sup>Q</sup>*)(*Qu*−<sup>1</sup> ) *q* . When *q* = 1, (1) becomes the standard DWD problem in [8] while problem (2) appears in [9,13].

The binary classification problem (1) is well studied. However, in many applications such as image classification [1], cancer diagnosis [2] and speech recognition [3], to name a few, problems with more than two categories are commonplace. To solve these multicategory problems with the DWD classifier, approaches based on either formulation (1) or (2) are common. One common strategy is to extend problem (1) to multiple classes by solving a series of binary problems in a one-versus-one (OVO) or one-versus-rest (OVR) method (e.g., [14]). Instead of reducing the multicategory problem to a binary one, another strategy based on problem (1) considers all classes at once. As shown in [14], this approach generally works better than the OVO and OVR methods. Based on an extension of problem (2), [15] proposes multicategory DWD, written in a loss-plus-penalty form as

$$\begin{aligned} \min\_{a\_k b\_k} &\frac{1}{N} \sum\_{i=1}^N \phi\_q \left( a\_{y\_i} + \mathbf{X}\_i^\top \mathbf{b}\_{y\_i} \right) + \lambda \sum\_{k=1}^K \|\mathbf{b}\_k\|\_2^2\\ \text{s.t.} &\sum\_{k=1}^K a\_k = 0; \sum\_{k=1}^K b\_{jk} = 0, \quad \forall j = 1, \dots, p. \end{aligned} \tag{4}$$

with *y<sup>i</sup>* , *k* ∈ {1, . . . , *K*} and where *a<sup>k</sup>* and *b<sup>k</sup>* = (*b*1*<sup>k</sup>* , · · · , *bpk*) are the intercept and slope parameters for each category *k*, respectively. Although these methods can be applied to multicategory classification in the HDLSS setting, both problems (2) and (4) use the *L*<sup>2</sup> penalty and do not perform feature selection. As discussed in [16], for high dimensional classification, taking all features into consideration does not work well for two reasons. First, based on prior knowledge, only a small number of variables are relevant to the classification problem: a good classifier in high dimensions should have the ability to sparsely select important variables and discard redundant ones. Second, classifiers using all available variables in high-dimensional settings may have poor classification performance.

Much of the SVM literature has considered variable selection in high-dimensional classification problems to improve performance (e.g., [17–19]). Among the DWD literature, to the best of our knowledge, only [16] considered variables selection and classification simultaneously. Wang and Zou [16] considered an *L*<sup>1</sup> rather than an *L*<sup>2</sup> penalty in problem (2) to improve interpretability through sparsity in the binary classification. Moreover, [16] made selections based on the strengths of input variables within individual classes but ignored the strengths of input variable groupings, thereby selecting more factors than necessary for each class. To overcome this weakness in this paper, we developed a multicategory generalized DWD method that is capable of performing variable selection and classification simultaneously. Our approach incorporates sparsity and group structure information via the sparse group lasso penalty (see [20–24]).

Although DWD is well studied, it is less popular than the SVM for binary classification, arguably for computational and theoretical reasons. For an up-to-date list of works on DWD mostly focused on the *q* = 1 case, see [14,15]. Theoretical asymptotic properties of large-margin classifiers in high dimensional settings were studied in [25], and [26] derived an expression for asymptotic generalization error. In terms of computation, [8] solved the standard DWD problem in (1) as a second-order cone programming (SOCP) problem using a primal-dual interior-point method that is computationally expensive when *N* or *p* is large. To overcome computational bottlenecks, [12] proposed an approach based on a novel formulation of the primal DWD model in (1): this method, proposed in [12], does not scale to large data sets and requires further work. Lam et al. [27] designed a new algorithm for large DWD problems with *q* ≥ 2 and *K* = 2 based on convergent multi-block ADMM-type methods (see [28]). Wang and Zou [16] solved the lasso-penalized binary DWD problem by combining majorization–minimization and coordinate descent methods: the lasso penalty does not directly permit a SOCP solution. In fact, solution identifiability in the generalized DWD problem with *q* > 1 requires more constraints and remains an open research problem (see [8]). To the best of our knowledge, no work focusing on computational aspects of lasso penalized multicategory generalized DWD (MgDWD) exists. The same holds for sparse group lasso-penalized MgDWD.

The theoretical and computational contributions of this paper are as follows. First, we establish the uniqueness of the minimizer in the population form of the MgDWD problem. Second, we prove a non-asymptotic *L*<sup>2</sup> estimation error bound for the sparse group lasso-regularized MgDWD loss function in the ultra-high dimensional setting under mild regularity conditions. Third, we develop a fast, efficient algorithm able to solve the sparse group lasso-penalized MgDWD problem using proximal methods.

The rest of this paper is organized as follows. In Section 2.1, we introduce the MgDWD problem with sparse group lasso penalty. In Sections 2.2 and 2.3, we establish theoretical properties of the population classifier and regularized empirical loss. We propose a computational algorithm in Section 2.4. Section 3 illustrates the finite sample performance of our method through simulation studies and a real data analysis. Proofs for major theorems are given in the Appendix A.

#### **2. Methodology**

#### *2.1. Model Setup*

We begin with some basic set-up and notation. Consider the multicategory classification problem for a random sample {(*y<sup>i</sup>* , *Xi*)} *N i*=1 of *N* independent and identically distributed (i.i.d.) observations from some distribution <sup>P</sup>(*y*, *<sup>X</sup>*). Here, *<sup>y</sup>* is the categorical response taking values in <sup>Y</sup> <sup>=</sup> {1, . . . , *<sup>K</sup>*}, and *X* = (*x*1, . . . , *xp*) <sup>⊤</sup> <sup>∈</sup> <sup>X</sup> <sup>⊂</sup> <sup>R</sup>*<sup>p</sup>* is the covariate vector. We wish to obtain a proper separating hyperplane {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>|</sup>*a<sup>k</sup>* <sup>+</sup> *<sup>X</sup>* <sup>⊤</sup>*b<sup>k</sup>* <sup>=</sup> <sup>0</sup>} for each category *<sup>k</sup>* <sup>∈</sup> <sup>Y</sup> , where *<sup>a</sup><sup>k</sup>* and *<sup>b</sup><sup>k</sup>* = (*b*1*<sup>k</sup>* , . . . , *bpk*) ⊤ are intercept and slope parameters, respectively.

In this paper, we consider MgDWD with sparse group lasso regularization. That is, we estimate a classification boundary by solving the constrained optimization problem

$$\begin{aligned} \min\_{a\_k, b\_k} &\frac{1}{N} \sum\_{i=1}^N \phi\_q \left( a\_{ji} + \mathbf{X}\_i^\top b\_{ji} \right) + \lambda\_1 \sum\_{k=1}^K \sum\_{j=1}^p |b\_{jk}| + \lambda\_2 \sum\_{j=1}^p \sqrt{\sum\_{k=1}^K b\_{jk}^2} \\ \text{s.t.} &\sum\_{k=1}^K a\_k = 0; \quad \sum\_{k=1}^K b\_{jk} = 0, \quad \forall j = 1, \dots, p, \end{aligned} \tag{5}$$

where *φ<sup>q</sup>* is as defined in (3).

To approach this problem, we apply the concept of a "margin vector" to extend the definition of a (binary) margin to the multicategory case. Denote the margin vector of an observation *X<sup>i</sup>* as *F<sup>i</sup>* = (*fi*<sup>1</sup> , . . . , *fiK*) ⊤, with *f ik* = *a<sup>k</sup>* + *X* ⊤ *i b<sup>k</sup>* satisfying ∑ *K k*=1 *f ik* = 0. Let *E<sup>i</sup>* = (*ei*<sup>1</sup> , . . . ,*eiK*) ⊤ be the class indicator vector with *<sup>e</sup>ik* <sup>=</sup> <sup>1</sup>{*y<sup>i</sup>* <sup>=</sup> *<sup>k</sup>*}. The multicategory margin of the data point (*y<sup>i</sup>* , *Xi*) is then given as *fiy<sup>i</sup>* = *ay<sup>i</sup>* + *X* ⊤ *i by<sup>i</sup>* = *E* ⊤ *i Fi* . Therefore, the MgDWD loss can be rewritten as

$$
\phi\_q(a\_{y\_i} + \mathbf{X}\_i^\top \mathbf{b}\_{y\_i}) = \phi\_q(\mathbf{E}\_i^\top \mathbf{F}\_l) = \mathbf{E}\_i^\top \phi\_q(\mathbf{F}\_l) = \sum\_{k=1}^K \mathbb{1}\{y\_i = k\} \phi\_q(a\_k + \mathbf{X}\_i^\top \mathbf{b}\_k). \tag{6}
$$

Based on (6), Lemma 1 describes the Fisher consistency of the MgDWD loss.

**Lemma 1.** *Given X* = *u, the minimizer of the conditional expectation of* (6) *is F*˜(*u*) = ˜ *f*1(*u*), . . . , ˜ *fK*(*u*) ⊤ *, satisfying*

$$\underset{k \in \mathcal{Y}}{\operatorname{argmax}} \,\tilde{f}\_k(\mathfrak{u}) = \underset{k \in \mathcal{Y}}{\operatorname{argmax}} \,\mathrm{Pr}\{y = k | \mathbf{X} = \mathfrak{u}\}\_{\mathcal{Y}}$$

*where*

$$\tilde{f}\_k(\boldsymbol{\mu}) = \begin{cases} \mathcal{Q} \sqrt[q]{\frac{\Pr\{y=k|\mathbf{X}=\boldsymbol{\mu}\}}{\Pr\{y=k\_\*|\mathbf{X}=\boldsymbol{\mu}\}}}, & k \neq k\_\*,\\ -\mathcal{Q} \sum\_{l \neq k\_\*} \sqrt[q]{\frac{\Pr\{y=l|\mathbf{X}=\boldsymbol{\mu}\}}{\Pr\{y=k\_\*|\mathbf{X}=\boldsymbol{\mu}\}}}, & k = k\_\*. \end{cases}$$

*and k*<sup>∗</sup> = argmin *<sup>k</sup>*∈<sup>Y</sup> Pr{*y* = *k*|*X* = *u*}*.*

Consequently, ˜ *fk* (*u*) can be treated as an effective proxy of Pr{*y* = *k*|*X* = *u*} and, for any new observation *X*∗, a reasonable prediction of its label *y*<sup>∗</sup> is

$$
\hat{y}\_\* = \underset{k \in \mathcal{Y}}{\text{argmax}} \{ a\_k + \mathbf{X}\_\*^{\top} b\_k \}.
$$

Speaking to the sparse group lasso (SGL) regularization in (5), the *L*<sup>1</sup> penalty encourages an element-wise sparse estimator that selects important variables for each category, indicated by <sup>ˆ</sup>*bjk* <sup>6</sup><sup>=</sup> 0. Assuming that parameters in different categories share the same information, we use an *L*<sup>2</sup> penalty to encourage a group-wise sparsity structure that removes covariates that are irrelevant across all categories, that is, where *β*ˆ *<sup>j</sup>* = (*b*1*<sup>j</sup>* , . . . , *bKj*) <sup>⊤</sup> = **0**. Specifically, let *x<sup>j</sup>* = (*x*1*<sup>j</sup>* , · · · , *xNj*) *<sup>T</sup>* and **B** = (*bjk*) ∈ R *p*×*K jk* , where the *k*-th column *b<sup>k</sup>* is the slope vector for the category label *k* and the *j*-th row *β* ⊤ *j* is the group coefficient for the variable *x<sup>j</sup>* . If *x<sup>j</sup>* is noise in the classification problem or is not

relevant to category label *k*, then the entry *bjk* of **B** should be shrunk to exactly zero. The SGL penalty of (5) can be written as a convex combination of the lasso and group lasso penalties in terms of *β<sup>j</sup>* as

$$\lambda\_1 \sum\_{k=1}^K \sum\_{j=1}^p |b\_{jk}| + \lambda\_2 \sum\_{j=1}^p \sqrt{\sum\_{k=1}^K b\_{jk}^2} = \lambda \sum\_{j=1}^p \left\{ \tau \|\|\mathcal{S}\_j\|\|\_1 + (1 - \tau) \|\|\mathcal{S}\_j\|\|\_2 \right\},\tag{7}$$

where *<sup>λ</sup>* <sup>&</sup>gt; 0 is the scale of the penalty and *<sup>τ</sup>* <sup>∈</sup> [0, 1] tunes the propensity between the element-wise and group-wise sparsity structure.

#### *2.2. Population MgDWD*

In this subsection, some basic results pertaining to unpenalized population MgDWD are given. These results are necessary for further theoretical analysis.

Denote the marginal probability mass of *y* as Pr(*y* = *k*) = *π<sup>k</sup>* with *π<sup>k</sup>* > 0 and ∑ *K <sup>k</sup>*=<sup>1</sup> *π<sup>k</sup>* = 1, and the conditional probability density functions of *X* given *y* = *k* by *g*(*X* | *y* = *k*) = *g<sup>k</sup>* (*X*). Let **Θ** = (*θ*1, . . . , *θK*) be the collection of coefficient vectors *θ<sup>k</sup>* = (*a<sup>k</sup>* , *b* ⊤ *k* ) ⊤ for all labels and *Z* = (1, *X* ⊤) ⊤. The population version of the MgDWD problem in (6) is

$$\mathcal{L}(\mathfrak{d}) = \mathbb{E}\left\{ \mathbb{I}(\mathcal{Y})^{\top} \boldsymbol{\phi}\_{\boldsymbol{q}}(\boldsymbol{\Theta}^{\top} \mathbf{Z}) \right\} = \sum\_{k=1}^{K} \pi\_{k} \int\_{\mathcal{X}} \boldsymbol{\phi}\_{\boldsymbol{q}}(\mathbf{Z}^{\top} \boldsymbol{\theta}\_{k}) \boldsymbol{g}\_{k}(\mathbf{x}) d\mathbf{x},\tag{8}$$

where *<sup>ϑ</sup>* <sup>=</sup> vec{**Θ**} is the vectorization of the matrix **<sup>Θ</sup>** and <sup>I</sup>(<sup>Y</sup> ) = (1{*<sup>y</sup>* <sup>=</sup> <sup>1</sup>}, . . . , <sup>1</sup>{*<sup>y</sup>* <sup>=</sup> *<sup>K</sup>*}) ⊤ is a random vector. Denote the true parameter value *ϑ* ∗ as a minimizer of the population MgDWD problem, namely,

$$\mathfrak{d}^\* \in \operatorname\*{argmin}\_{\mathfrak{d} \in \mathcal{C}} \mathcal{L}(\mathfrak{d})\_\vee$$

where C = *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*K*(*p*+1) **C***ϑ* = **0***<sup>K</sup>* is the set of sum-constrained *ϑ* with **C** = **1** ⊤ *<sup>K</sup>* ⊗ **I***p*+1, where ⊗ denotes the Kronecker product.

To facilitate our theoretical analysis, we first define the gradient vector and Hessian matrix of the population MgDWD loss function. We then introduce some regularity conditions necessary to derive theoretical properties of this problem. Let diag{*v*} be a diagonal matrix constructed from the vector *v*, and let ◦ and ⊕ be the Hadamard product and the direct matrix sum, respectively. Denote the gradient vector of the population MgDWD loss function (8) as

$$\mathcal{S}(\mathfrak{d}) = \mathbb{E}\left(\{\mathbb{I}(\mathcal{Y}) \circ \phi\_q'(\mathbf{G}^\top \mathbf{Z})\} \otimes \mathbf{Z}\right) = \text{vec}(\mathbb{S}\_1, \dots, \mathbb{S}\_K)\_{\prime\prime}$$

with

$$\mathbf{S}\_k = \mathbb{E}\{\mathbb{1}\{y=k\} \boldsymbol{\phi}\_q^\prime(\mathbf{Z}^\top \boldsymbol{\theta}\_k)\mathbf{Z}\} = \pi\_k \int\_{\mathcal{X}} \boldsymbol{\phi}\_q^\prime(\mathbf{Z}^\top \boldsymbol{\theta}\_k) \mathbf{Z} \boldsymbol{\mathcal{g}}\_k(\mathbf{X}) \mathbf{d} \mathbf{X} \boldsymbol{\omega}$$

and its Hessian matrix as

$$\mathcal{H}(\mathfrak{d}) = \mathbb{E}\left\{ \text{diag}\left\{ \mathbb{I}(\mathcal{Y}, \mathcal{X}) \circ \varphi\_q''(\mathbf{G}^\top \mathbf{Z}) \right\} \otimes (\mathbf{Z}\mathbf{Z}^\top) \right\} = \bigoplus\_{k=1}^K H\_{k'} $$

where *ϕ* ′′ *<sup>q</sup>* denotes the second derivative of the function *<sup>ϕ</sup>q*; <sup>I</sup>(<sup>Y</sup> , <sup>X</sup> ) = <sup>I</sup>(<sup>Y</sup> ) ◦ <sup>I</sup>(<sup>X</sup> ) is a random vector with <sup>I</sup>(<sup>X</sup> ) = (1{*<sup>X</sup>* <sup>∈</sup> <sup>X</sup>1}, . . . , <sup>1</sup>{*<sup>X</sup>* <sup>∈</sup> <sup>X</sup>*k*}) <sup>⊤</sup> and X*<sup>k</sup>* = *<sup>X</sup>* <sup>∈</sup> <sup>X</sup> *Z* <sup>⊤</sup>*θ<sup>k</sup>* > *Q* ; and

$$\mathbf{H}\_{k} = \mathbb{E}\{\mathbf{1}\{y=k,\mathbf{X}\in\mathcal{X}\_{k}^{\circ}\}\boldsymbol{\varrho}\_{q}^{\prime\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})\mathbf{Z}\mathbf{Z}^{\top}\} = \boldsymbol{\pi}\_{k}\int\_{\mathcal{X}\_{k}^{\circ}}\boldsymbol{\varrho}\_{q}^{\prime\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})\mathbf{Z}\mathbf{Z}^{\top}\boldsymbol{\varrho}\_{k}(\mathbf{X})\mathbf{d}\mathbf{X}.$$

The block structure of H(*ϑ*) implies a parallel relationship between each category. The relationship between the *θ<sup>k</sup>* is reflected by the sum-to-zero constraint in the definition of C .

We assume the following regularity conditions.

(C1) The densities of *<sup>X</sup>* given *<sup>y</sup>* <sup>=</sup> *<sup>k</sup>* <sup>∈</sup> <sup>Y</sup> , i.e., the *<sup>g</sup><sup>k</sup>* (*X*), are continuous and have finite second moments.

(C2) 0 <sup>&</sup>lt; Pr{*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>∗</sup> *k* <sup>|</sup>*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*} <sup>&</sup>lt; 1 for all *<sup>k</sup>* <sup>∈</sup> <sup>Y</sup> , where <sup>X</sup> <sup>∗</sup> *<sup>k</sup>* = *<sup>X</sup>* <sup>∈</sup> <sup>X</sup> *Z* ⊤*θ* ∗ *<sup>k</sup>* <sup>&</sup>gt; *<sup>Q</sup>* . (C3) Var *X <sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>∗</sup> *k* , *y* = *k* <sup>≻</sup> **<sup>O</sup>***<sup>p</sup>* for all *<sup>k</sup>* <sup>∈</sup> <sup>Y</sup> .

**Remark 1.** *Condition (C1) ensures that* L*,* S *and* H *are well defined and continuous in ϑ. For the theoretically optimal hyperplane* {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>|</sup>*<sup>Z</sup>* ⊤*θ* ∗ *<sup>k</sup>* = 0}*, the case with θ* ∗ *<sup>k</sup>* <sup>=</sup> **<sup>0</sup>***p*+<sup>1</sup> *leaves* <sup>X</sup> *useless for classification. On the other hand, when a* ∗ *k* 6= 0 *and b* ∗ *<sup>k</sup>* = **0***p, the hyperplane is the empty set and is similarly meaningless. Condition (C2) is proposed to avoid the case where b* ∗ *<sup>k</sup>* = **0***<sup>p</sup> so that ϑ* ∗ *always contains information relevant to the classification problem. For bounded random variables, condition (C2) should be assumed with caution. Condition (C3) implies the positive definiteness of* H(*ϑ* ∗ )*.*

By convexity and the second-order Lagrange condition, the following theorem shows that the local minimizer of the population MgDWD problem exists and is unique.

**Theorem 1.** *Under the regularity conditions (C1)-(C3), the true parameter ϑ* <sup>∗</sup> <sup>∈</sup> <sup>C</sup> *is the unique minimizer of* L(*ϑ*) *with b* ∗ *k* 6= **0***p, and*

$$\mathcal{L}(\mathfrak{d}^\*) = \sum\_{k=1}^K A(k,q)\pi\_{k'} $$

*with* 0 ≤ *u*(*k*, *q*) ≤ *A*(*k*, *q*) ≤ *v*(*k*, *q*) ≤ 1*, where*

$$A(k,q) = 1 - \mathbb{E}\left\{\mathbf{1}\{\mathbf{X} \in \mathcal{X}\_k^\*\} \left\{1 - \left(\frac{\mathbf{Q}}{\mathbf{Z}^\top \mathbf{\theta}\_k^\*}\right)^q\right\} \,\Big|\,y = k\right\},$$

$$u(k,q) = \Pr\{\mathbf{X} \notin \mathcal{X}\_k^\* \,|y = k\} + \mathbf{Q}^{2q}\Pr\{\mathbf{Q} < \mathbf{Z}^\top \mathbf{\theta}\_k^\* \le \mathbf{Q}^{-1} \,|y = k\},$$

$$v(k,q) = \Pr\{\mathbf{Z}^\top \mathbf{\theta}\_k^\* \le 1 \,|\,y = k\} + \inf\_{\epsilon > 0} \left(\frac{\mathbf{Q}}{1 + \epsilon}\right)^q \Pr\{\mathbf{Z}^\top \mathbf{\theta}\_k^\* > 1 + \epsilon \,|\,y = k\}.$$

The bounds in Theorem 1 show how *q* affects the loss function L(*ϑ* ∗ ). The upper bound *v*(*k*, *q*) is a decreasing function of *q* with

$$\lim\_{q \to 0} v(k, q) = 1 \text{ and } \lim\_{q \to \infty} v(k, q) = \Pr\{\mathbf{Z}^\top \boldsymbol{\theta}\_k^\* \le 1 \mid y = k\}.$$

In the lower bound *u*(*k*, *q*), the first term Pr *<sup>X</sup>* <sup>∈</sup>/ <sup>X</sup> <sup>∗</sup> *k y* = *k* is an increasing function of *q* and the last term *Q*2*<sup>q</sup>* Pr *Q* < *Z* ⊤*θ* ∗ *<sup>k</sup>* <sup>≤</sup> *<sup>Q</sup>*−<sup>1</sup> *y* = *k* is a decreasing function of *q*, with

$$\lim\_{q \to 0} \mu(k, q) = 1 \text{ and } \lim\_{q \to \infty} \mu(k, q) = \Pr\{\mathbf{Z}^\top \boldsymbol{\theta}\_k^\* \le 1 \mid y = k\}.$$

Consequently, for the given population P(*y*, *X*), a larger *q* encourages the population MgDWD estimator to focus more on the regions {*<sup>X</sup>* <sup>∈</sup>/ <sup>X</sup>*<sup>k</sup>* , *y* = *k* that correspond to misclassifications. As a result, the estimator's performance will be similar to the hinge loss as *<sup>q</sup>* → <sup>∞</sup>. Setting *<sup>q</sup>* too small will lead to an ineffective classifier due to the unreasonable penalty placed on the well classified region {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup>*<sup>k</sup>* , *y* = *k* . This variation in the lower bound with respect to *q* provides a necessary condition for the existence of an optimal *q*.

**Remark 2.** *The explicit relationship between q and ϑ* ∗ *is complicated. While it may be more desirable to prove that a greater value of q results in a smaller value of the loss function* L(*ϑ*)*, there is no explicit formula for the optimal value ϑ* ∗ *in terms of q.*

#### *2.3. Estimator Consistency*

Under the unpenalized framework presented in the previous subsection, all covariates will contribute to the classification task for each category: this scenario may lead to a classifier that overfits to the training data set. In this subsection, we study the consistency of the estimator for (5) in ultra-high dimensional settings.

To achieve structural sparsity in the estimator, the regularization parameter *λ* in (7) must be large enough to dominate the gradient of the empirical MgDWD loss evaluated at the theoretical minimizer *ϑ* <sup>∗</sup> <sup>=</sup> vec{**Θ**∗} with high probability. On the other hand, *<sup>λ</sup>* should also be as small as possible to reduce the bias incurred by the SGL regularization term

$$P(\mathcal{J}) = \sum\_{j=1}^{p} \tau ||\mathcal{J}\_j||\_1 + (1 - \tau) ||\mathcal{J}\_j||\_2.$$

Lemma 2 provides a suitable choice of *λ* under the following assumption.

(A1) The predictors *<sup>X</sup>* = (*x*1, . . . , *<sup>x</sup>p*) <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* are independent sub-Gaussian random vectors satisfying <sup>E</sup>*<sup>X</sup>* <sup>=</sup> **<sup>0</sup>***p*, and where Var(*X*) = **<sup>Σ</sup>**, there exists a constant *<sup>κ</sup>* <sup>&</sup>gt; 0 such that for any *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* , Eexp(*γ* ⊤**Σ** <sup>−</sup>1/2*X*) <sup>≤</sup> exp(k*γ*<sup>k</sup> 2 2 *κ* <sup>2</sup>/2). From here on, we define *ς* 2 1 as the largest eigenvalue of **Σ**.

**Lemma 2.** *Denote S*(*ϑ* ∗ ) = (**I***<sup>K</sup>* ⊗ **Z** <sup>⊤</sup>)diag(vec{**E**})vec{*φ* ′ *q* (**ZΘ**<sup>∗</sup> )}*, where* **E** = *E*1, . . . , *E<sup>N</sup>* ⊤ *,* **Z** = *Z*1, . . . , *Z<sup>N</sup>* ⊤ *with Z<sup>i</sup>* = (1, *X* ⊤ *i* ) ⊤*, and* **I***<sup>K</sup> is the identity matrix of size K. Under condition (A1),*

$$\tilde{P}\{\mathbf{PS}(\mathfrak{d}^\*)\} \le \tau \Lambda\_1 + (1 - \tau)\Lambda\_2$$

*with probability at least* 1 − 2(*Kp*) 1−*c* 2 <sup>1</sup> − *p* 1−*c* 2 <sup>2</sup> *, where*

$$\begin{aligned} \mathbf{P} &= (\mathbf{I}\_K - K^{-1} \mathbf{1}\_K \mathbf{1}\_K^\top) \otimes \mathbf{I}\_{p+1}, \\ \Lambda\_1 &= \max\{\varsigma\_1 \kappa, 1\} \left(1 - \frac{1}{K}\right) \sqrt{\frac{2\log(pK)}{N}}, \\ \Lambda\_2 &= \max\{2\sqrt{2}\varsigma\_1 \kappa, 1\} \left\{c\_2 \sqrt{\left(1 - \frac{1}{K}\right) \frac{2\log(p)}{N}} + \sqrt{\frac{K-1}{N}}\right\}. \end{aligned}$$

*for constants c*1, *c*<sup>2</sup> > 1*.*

It is difficult to obtain a closed form for the conjugate of the SGL penalty, say, *<sup>P</sup>*¯(*v*) = sup*u*∈<sup>C</sup> \{0} h*u*,*v*i *P*(*u*) . Instead, we use a regularized upper bound *<sup>P</sup>*˜(*v*) <sup>≥</sup> *<sup>P</sup>*¯(*v*). Based on Lemma 2, we propose a theoretical tuning parameter value

$$
\lambda = c\_0 \sqrt{\frac{\log(pK)}{N}} \,\,\,\tag{9}
$$

where *<sup>c</sup>*<sup>0</sup> is some given constant satisfying *<sup>λ</sup>* <sup>&</sup>gt; *<sup>τ</sup>*Λ<sup>1</sup> + (<sup>1</sup> <sup>−</sup> *<sup>τ</sup>*)Λ2.

Before we can derive an error bound for the estimator in (5), we impose two additional assumptions.

(A2) For the true parameter value *ϑ* ∗ , there is a (*s<sup>e</sup>* ,*sg*)-sparse structure in the coefficients **B** ∗ with element-wise and group-wise support sets

$$\mathcal{C} = \left\{ (j,k) \in \{1, \dots, p\} \times \{1, \dots, K\} \, |b^\*\_{jk} \neq 0\right\} \text{ and } \mathcal{O} = \left\{ j \in \{1, \dots, p\} \, |\mathfrak{F}^\*\_j \neq \mathbf{0}\_\mathbb{K}\} \right\}$$

with cardinality <sup>|</sup><sup>E</sup> <sup>|</sup> <sup>=</sup> *<sup>s</sup><sup>e</sup>* and <sup>|</sup><sup>G</sup> <sup>|</sup> <sup>=</sup> *<sup>s</sup>g*, respectively.

(A3) There exist some positive constants *ς*<sup>2</sup> and *ς*<sup>3</sup> such that

$$\varsigma\_2^2 = \max\_{\gamma \in \mathcal{Y}} \frac{\|\text{diag}\{\text{vec}(\mathbf{E}^\top)\}(\mathbf{Z} \otimes \mathbf{I}\_K)\gamma\|\_2^2}{N\|\|\gamma\|\_2^2} \text{ and } \varsigma\_3^2 = \min\_{\gamma \in \mathcal{Y}} \frac{\gamma^\top \mathcal{H}(\mathfrak{d}^\*)\gamma}{\gamma^\top \gamma}$$

with V = *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*K*(*p*+1) <sup>|</sup><sup>0</sup> <sup>&</sup>lt; <sup>k</sup>*v*k<sup>0</sup> <sup>≤</sup> *<sup>s</sup><sup>e</sup>* <sup>+</sup> *<sup>K</sup>* and

$$\mathcal{W} = \left\{ \delta \in \mathbb{R}^{K(p+1)} \; \middle| \; \frac{\tau}{1-\tau} \|\mathcal{S}\_{\mathcal{S}\_+}\|\_1 + \sum\_{j \in \mathcal{G}\_+} \|\mathcal{S}\_j\|\_2 \ge \mathbb{C}\_0 \left( \frac{\tau}{1-\tau} \|\mathcal{S}\_{\mathcal{S}^c}\|\_1 + \sum\_{j \notin \mathcal{G}} \|\mathcal{S}\_j\|\_2 \right) \right\}.$$

where *C*<sup>0</sup> = (*c*0−1) (*c*0+1) , E c is the complement of <sup>E</sup> , <sup>E</sup><sup>+</sup> <sup>=</sup> <sup>E</sup> ∪ {*<sup>l</sup>* <sup>=</sup> <sup>1</sup> + (*<sup>k</sup>* <sup>−</sup> <sup>1</sup>)(*<sup>p</sup>* <sup>+</sup> <sup>1</sup>)|*<sup>k</sup>* <sup>=</sup> 1, . . . , *<sup>K</sup>*}, and <sup>G</sup><sup>+</sup> <sup>=</sup> <sup>G</sup> ∪ {0}.

Under the choice of *λ* given in (9), we show the *L*2-consistency of the estimator in (5).

**Theorem 2.** *Suppose that conditions (A1)-(A3) hold. Then with λ* = *c*<sup>0</sup> q log(*pK*) *N in* (5)*, we have that*

$$\|\left|\widehat{\mathfrak{d}} - \mathfrak{d}^\*\right|\|\_2 \le \left\{ \mathbb{C}\_1 \sqrt{s\_\varepsilon + K} + \mathbb{C}\_2 \sqrt{s\_\mathcal{g} + 1} \right\} \sqrt{\frac{\log(pK)}{N}}$$

*with probability at least* 1 − 2(*Kp*) 2(*se*+*K*)(1−*c* 2 3 ) *, where C*<sup>1</sup> = 2*ς* −2 3 {*c*0*<sup>τ</sup>* + (<sup>√</sup> 2 + 2*c*3)*ς*2} *and C*<sup>2</sup> = 2*ς* −2 3 *c*0(1 − *τ*)*.*

**Remark 3.** *The sub-Gaussian distribution assumption (A1) is common in high-dimensional scenarios. This assumption characterizes the tail behavior of a collection of random variables including Gaussian, Bernoulli, and bounded variables as special cases. Assumption (A2) describes structural sparsity at two levels. The element-wise size s<sup>e</sup>* < *p is the size of the underlying generative model, and the group-wise size s<sup>g</sup>* < *pK is the size of the signal covariate set. Both s<sup>e</sup> and s<sup>g</sup> are allowed to depend on the sample size N. As a result, the dimension p is allowed to increase with the sample size N. Assumption (A3) guarantees that eigenvalues are positive in this sparse scenario.*

**Remark 4.** *In practice, the tuning parameters λ and τ in* (7) *are commonly chosen by M-fold cross validation. That is, we choose the pair* (*τ*, *λ*) *with the highest prediction accuracy among the sub-data sets* D*m, specifically,*

$$\mathcal{C}V(\boldsymbol{\tau},\boldsymbol{\lambda}) = \sum\_{m=1}^{M} \sum\_{\boldsymbol{i} \in \mathcal{D}\_m} \mathbb{1}\{\boldsymbol{y}\_{\boldsymbol{i}} = \boldsymbol{\mathfrak{y}}\_{\boldsymbol{i}}(\boldsymbol{\tau},\boldsymbol{\lambda})\},$$

*where y*ˆ*i*(*τ*, *λ*) = argmax *<sup>k</sup>*∈<sup>Y</sup> *Z* ⊤ *i θ*ˆ *k* (*τ*, *λ*)*.*

#### *2.4. Computational Algorithm*

In this section, we propose an efficient algorithm to solve problem (5). Our approach uses the proximal algorithm (see [29]) for solving high dimensional regularization problems. In two main steps, this approach obtains a solution to the constrained optimization problem by applying the proximal operator to the solution to the unconstrained problem.

Since regularization is not needed for the intercept terms *α* = (*a*1, . . . , *aK*) ⊤, it can be separated from the coefficients in **B**. The empirical MgDWD loss of (8) is given by

$$L(\boldsymbol{\theta}) = \frac{1}{N} \sum\_{i=1}^{N} E\_i^\top \boldsymbol{\phi}\_{\boldsymbol{\theta}}(\mathbf{F}\_i) = \frac{1}{N} \text{tr} \left\{ \mathbf{E} \boldsymbol{\phi}\_{\boldsymbol{\theta}}(\mathbf{F}^\top) \right\} = \frac{1}{N} \text{vec} \{ \mathbf{E}^\top \}^\top \text{vec} \{ \boldsymbol{\phi}\_{\boldsymbol{\theta}}(\mathbf{F}^\top) \}$$

where **F** = (*f ik*)*N*×*<sup>K</sup>* <sup>=</sup> **<sup>Z</sup><sup>Θ</sup>** <sup>=</sup> **<sup>1</sup>***N<sup>α</sup>* ⊤ + **XB**. Various properties of the loss function *L*(*ϑ*) follow below.

**Lemma 3.** *The loss function L*(*ϑ*) *has Lipschitz continuous partial derivatives. In particular, for <sup>S</sup>*(*α*) = *<sup>∂</sup>L*(*θ*) *∂α* = 1 *N* **E** ◦ *φ* ′ *q* (**F**) ⊤ **<sup>1</sup>***<sup>N</sup> and any <sup>u</sup>*, *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*K, we have that*

$$\left\|\mathbf{S}(\mathfrak{u}) - \mathbf{S}(\mathfrak{v})\right\|\_{2} \leq \sqrt{\frac{n\_{\max}}{N}} \frac{(q+1)^2}{q} \left\|\mathfrak{u} - \mathfrak{v}\right\|\_{2'} $$

*where <sup>n</sup>*max *is the largest group sample size. For <sup>S</sup>*(**B**) = *<sup>∂</sup>L*(*θ*) *∂***B** = 1 *N* **E** ◦ *φ* ′ *q* (**F**) ⊤ **<sup>X</sup>** *and any* **<sup>U</sup>**, **<sup>V</sup>** <sup>∈</sup> <sup>R</sup>*p*×*K, we have that*

$$\left\| \left| \text{vec} \{ \mathbf{S}(\mathbf{U}) - \mathbf{S}(\mathbf{V}) \} \right\| \right\|\_{2} \leq \frac{\max\_{k} \left\| \text{diag} (\mathbf{e}\_{k}) \mathbf{X} \right\|\_{2}^{2}}{N} \frac{(q+1)^{2}}{q} \left\| \text{vec} \{ \mathbf{U} - \mathbf{V} \} \right\|\_{2} \left\| \text{vec} \{ \mathbf{U} - \mathbf{V} \} \right\|\_{2}$$

*where e<sup>k</sup> is the k-th column of* **E** *and indicates the observations belonging to the k-th group.*

Hence, following the majorization–minimization scheme, we can majorize the empirical MgDWD loss *L*(*ϑ*) by a quadratic function, that is,

$$\begin{split} L(\mathfrak{d}) &\leq L(\mathfrak{d}\_{\*}) + \mathsf{S}(\mathfrak{a}\_{\*})^{\top}(\mathfrak{a} - \mathfrak{a}\_{\*}) + \frac{L\_{\mathfrak{a}}}{2} ||\mathfrak{a} - \mathfrak{a}\_{\*}||\_{2}^{2} \\ &+ \mathrm{vec}\{\mathsf{S}(\mathbf{B})\}^{\top} \mathrm{vec}\{\mathbf{B} - \mathbf{B}\_{\*}\} + \frac{L\_{\mathbf{B}\_{\*}}}{2} ||\mathrm{vec}\{\mathbf{B} - \mathbf{B}\_{\*}\}||\_{2\times 2}^{2} \end{split}$$

for some *ϑ*<sup>∗</sup> = vec{(*α*∗, **B** ⊤ ∗ ) <sup>⊤</sup>}, where *L<sup>α</sup>* and *L***<sup>B</sup>** denote the Lipschitz constants in Lemma 3. Instead of minimizing *L*(*ϑ*) directly, we apply gradient descent to minimize its surrogate upper bound function. The gradient descent updates are given by

$$\mathfrak{a}\_{\*} = \mathfrak{a} - \frac{q(q+1)^{-2}}{\sqrt{n\_{\text{max}}N}} \left\{ \mathbf{E} \circ \phi\_{q}^{\prime}(\mathbf{F}) \right\}^{\top} \mathbf{1}\_{N\prime} \tag{10}$$

$$\mathbf{B}\_{\*} = \mathbf{B} - \frac{q(q+1)^{-2}}{\max\_{k} \| \text{diag}(\mathbf{e}\_{k}) \mathbf{X} \|\_{2}^{2}} \left\{ \mathbf{E} \circ \phi\_{q}^{\prime}(\mathbf{F}) \right\}^{\top} \mathbf{X}. \tag{11}$$

Next, we address the problem's constraints and regularization simultaneously by applying the proximal operator. For *α*∗, it is clear that

$$\mathfrak{a}\_{\text{new}} = \underset{\mathfrak{a}^\top \mathbf{1}\_K = 0}{\text{argmin}} \left\| \mathfrak{a} - \mathfrak{a}\_\* \right\|\_2^2 = \mathbf{P}\_K \mathfrak{a}\_{\*\prime} \tag{12}$$

where **P***<sup>K</sup>* = **I***<sup>K</sup>* − *k* <sup>−</sup>1**1***K***1***K*. For **<sup>B</sup>**<sup>∗</sup> = (*β*1∗, . . . , *<sup>β</sup>p*∗) ⊤, the minimization problem can be expressed as

$$\begin{split} \mathbf{B\_{new}} &= \underset{\mathbf{B} \mathbf{1}\_{K} = \mathbf{0}\_{p}}{\operatorname{argmin}} \frac{1}{2} ||\text{vec}\{\mathbf{B} - \mathbf{B}\_{\ast}\}||\_{2}^{2} + \frac{\lambda\_{1}}{L\_{\mathbf{B}}} ||\text{vec}\{\mathbf{B}\}||\_{1} + \frac{\lambda\_{2}}{L\_{\mathbf{B}}} ||\text{vec}\{\mathbf{B}\}||\_{1,2} \\ &= \underset{\mathbf{B\_{1}} = \mathbf{0\_{p}}}{\operatorname{argmin}} \sum\_{j=1}^{p} \frac{1}{2} ||\boldsymbol{\beta}\_{j} - \boldsymbol{\beta}\_{j\*}||\_{2}^{2} + \frac{\lambda\_{1}}{L\_{\mathbf{B}}} ||\boldsymbol{\beta}\_{j}||\_{1} + \frac{\lambda\_{2}}{L\_{\mathbf{B}}} ||\boldsymbol{\beta}\_{j}||\_{2}. \end{split} \tag{13}$$

which implies that we can implement minimization for *p* groups in parallel. The following theorem provides the solution to (13).

**Theorem 3.** *Let <sup>ρ</sup>*1, *<sup>ρ</sup>*<sup>2</sup> <sup>≥</sup> <sup>0</sup> *and <sup>β</sup>*<sup>∗</sup> <sup>∈</sup> <sup>R</sup>*K. Then the constrained regularization problem*

$$\min\_{\mathcal{B}\in\mathbb{R}^{K}} \frac{1}{2} ||\mathcal{B} - \mathcal{B}\_{\*}||\_{2}^{2} + \rho\_{1} ||\mathcal{B}||\_{1} + \rho\_{2} ||\mathcal{B}||\_{2}$$
 
$$\text{s.t. } \mathcal{B}^{\top} \mathbf{1}\_{K} = \mathbf{0}$$

*has a solution of the form*

$$\mathcal{J}^\* = \left\{ 1 - \frac{\rho\_2}{||\mathbf{P}\_K(\mathfrak{f}\_\* - \rho\_1 \mathfrak{u})||\_2} \right\}\_+ \mathbf{P}\_K(\mathfrak{f}\_\* - \rho\_1 \mathfrak{u}) \tag{14}$$

*for some u* ∈ *∂*k*β*k1*.*

In the special case with *ρ*<sup>2</sup> = 0, the constrained regularization problem in Theorem 3 reduces to the constrained lasso problem with solution *<sup>β</sup>*˜<sup>∗</sup> <sup>=</sup> **<sup>P</sup>***K*(*β*<sup>∗</sup> <sup>−</sup> *<sup>ρ</sup>*1*u*). Combined with (14), the proximal operator U, given by

$$\mathfrak{F}^\* = \mathcal{U}(\mathfrak{F}^\*, \rho\_2) = \left\{ 1 - \frac{\rho\_2}{||\mathfrak{F}^\*||\_2} \right\}\_+ \mathfrak{F}^\*,\tag{15}$$

can be introduced to realize the group sparsity of *β*˜<sup>∗</sup> .

For the standard lasso problem, the subgradient *u* has a closed form given by *<sup>β</sup>*˜<sup>∗</sup> <sup>=</sup> *<sup>β</sup>*<sup>∗</sup> <sup>−</sup> *<sup>ρ</sup>*1*<sup>u</sup>* <sup>=</sup> <sup>S</sup>(*β*∗, *<sup>ρ</sup>*1), with <sup>S</sup>(*u*, *<sup>v</sup>*) = sign(*u*)(|*u*| − *<sup>v</sup>*)+. However, under the constraint on *β*˜∗ , the naive solution **P***K*S(*β*∗, *ρ*1) is misleading in that it satisfies the constraint but does not achieve shrinkage, let alone loss function minimization. The term **P***Ku* is suggestive of the intersection between the subdifferential set *<sup>∂</sup>*k*β*k<sup>1</sup> and the constraint set {*<sup>β</sup>* <sup>∈</sup> <sup>R</sup>*K*|*<sup>β</sup>* <sup>⊤</sup>**1***<sup>K</sup>* <sup>=</sup> <sup>0</sup>}; in this sense, *<sup>β</sup>*˜<sup>∗</sup> might not have a closed form. Here we consider using coordinate descent to solve the constrained lasso problem. For some fixed coordinate *m*, since *β* <sup>⊤</sup>**1***<sup>K</sup>* = 0, we have that *<sup>b</sup><sup>m</sup>* = − <sup>∑</sup>*l*6=*<sup>m</sup> <sup>b</sup><sup>l</sup>* . Rewriting the objective function of the lasso-constrained problem in a coordinate-wise form, we obtain

$$\begin{split} \sum\_{l=1}^{K} \frac{1}{2} (b\_l - b\_{l\*})^2 + \rho\_1 |b\_l| &= \left( b\_k - \frac{(b\_{k\*} - b\_{m\*})}{2} + \frac{1}{2} \sum\_{l \neq k, m}^{K} b\_l \right)^2 + \rho\_1 \left\{ |b\_k| + \left| b\_k + \sum\_{l \neq k, m}^{K} b\_l \right| \right\} \\ &+ \frac{1}{4} \left( b\_{k\*} + b\_{m\*} + \sum\_{l \neq k, m}^{K} b\_l \right)^2 + \sum\_{l \neq k, m}^{K} \frac{1}{2} (b\_l - b\_{l\*})^2 + \rho\_1 |b\_l|. \end{split} \tag{16}$$

Next, Theorem 4 provides the solution to the optimization problem (16).

**Theorem 4.** *Suppose that t*, *s* ∈ R *and ̺* ≥ 0*. Then the regularization problem*

$$\min\_{b \in \mathbb{R}} \frac{1}{2}(b-t)^2 + \varrho\{|b| + |b+s|\}$$

*has solution*

$$\begin{aligned} b^\* &= \begin{cases} t, & |t| < \mathcal{C}(s, t) \\ -\mathcal{C}(s, t), & \mathcal{C}(s, t) \le |t| \le \mathcal{C}(s, t) + 2\varrho \\ \text{sign}(t)(|t| - 2\varrho), & |t| > \mathcal{C}(s, t) + 2\varrho \end{cases} \\ &= t - \mathcal{S}\{t, \mathcal{C}(s, t)\} + \mathcal{S}\{\mathcal{S}(t, \mathcal{C}(s, t)), 2\varrho\}, \end{aligned}$$

**Input:** *λ*1, *λ*2.

*where C*(*s*, *<sup>t</sup>*) = <sup>1</sup> <sup>−</sup> sign(*s*)sign(*t*) 2 |*s*|*.*

By Theorem 4, given some *m* ∈ {1, . . . , *K*}, the coordinate-wise minimizer for any *k* 6= *m* can be expressed as the proximal operator

$$b\_k = \mathcal{T}(t, s, \rho\_1) = t - \mathcal{S}\left(t, \mathcal{C}(s, t)\right) + \mathcal{S}\left\{\mathcal{S}\left(t, \mathcal{C}(s, t)\right), \rho\_1\right\},\tag{17}$$

with *<sup>s</sup>* = <sup>∑</sup>*l*6=*k*,*<sup>m</sup> <sup>b</sup><sup>l</sup>* and *<sup>t</sup>* = (*bk*<sup>∗</sup> − *<sup>b</sup>m*<sup>∗</sup> − *<sup>s</sup>*)/2. If we fix *<sup>m</sup>* during iteration, then the shrinkage of *<sup>b</sup><sup>m</sup>* will be indirectly reflected in the other *b<sup>k</sup>* . We propose that *m* change with *k* in the coordinate-wise minimization process to ensure that every coordinate can be equally shrunk. We summarize our proposed algorithm in Algorithm 1.

**Algorithm 1** Proximal gradient descent algorithm for SGL-MgDWD.

**Initialization:** *α* (0) = **0***K*, **B** (0) <sup>=</sup> **<sup>O</sup>***p*×*K*, *<sup>l</sup>* <sup>=</sup> 0. 1: **repeat** 2: Update *α* according to (10) and (12): *α* (*l*+1) <sup>=</sup> **<sup>P</sup>***K*{*<sup>α</sup>* (*l*) <sup>−</sup> *<sup>L</sup>* −1 *<sup>α</sup> S*(*α* (*l*) )}. 3: Update **B**˜ according to (11): **B**˜ = **B** (*l*) <sup>−</sup> *<sup>L</sup>* −1 **B** *S*(**B** (*l*) ). 4: Set **B** (*l*+1) <sup>←</sup> **<sup>B</sup>**˜ . 5: **repeat** 6: **for** *m* = 1 to *K* **do** 7: **for** *k* in {1, . . . , *K*} \ *m* **do** 8: Update (*t*, *s*):

$$\mathbf{t} = \mathfrak{b}\_k - \mathfrak{b}\_{m\nu} \quad \mathbf{s} = \sum\_{r=1}^{K} \mathfrak{b}\_r^{(l+1)} - \mathfrak{b}\_k^{(l+1)} - \mathfrak{b}\_m^{(l+1)}.$$

9: Update *b* (*l*+1) *k* according to (17) and *b* (*l*+1) *<sup>m</sup>* by constraint:

$$\boldsymbol{b}\_{k}^{(l+1)} = \mathcal{T}(\mathbf{t}, \mathbf{s}, L\_{\mathbf{B}}^{-1} \boldsymbol{\lambda}\_{1}), \quad \mathbf{b}\_{m}^{(l+1)} = -\mathbf{s} - \mathbf{b}\_{k}^{(l+1)}.$$

10: **end for**

11: **end for**


$$\mathbf{B}^{(l+1)} = \mathcal{U}(\mathbf{B}^{(l+1)}, L\_\mathbf{B}^{-1} \lambda\_2).$$

14: Set *l* ← *l* + 1. 15: **until** some condition is met. **Output:** *α* (*l*) and **B** (*l*) .

#### **3. Numerical Analysis**

In the following section, we use both simulated and real data sets to evaluate the finite sample properties of our proposed method. We compare the finite sample performance of SGL-MgDWD with *L*1-regularized multinomial logistic regression (*L*1-logistic).

#### *3.1. Simulation Studies*

The data is generated from the following model. Consider the *K*-category classification problem where *π<sup>k</sup>* = *K* <sup>−</sup><sup>1</sup> and *g<sup>k</sup>* (*X*) is the density function of a normal distribution with mean vector *µ<sup>k</sup>* = (*µ*1*<sup>k</sup>* , *µ*2*<sup>k</sup>* , **0** ⊤ *p*−2 ) <sup>⊤</sup> and covariance matrix **I***p*, where (*µ*1*<sup>k</sup>* , *µ*2*<sup>k</sup>* ) = (2 cos(*πr<sup>k</sup>* ), 2 sin(*πr<sup>k</sup>* )) with *r<sup>k</sup>* = 2(*k*−1) *K* , for *k* = 1, . . . , *K*. In this model, only the first two variables contribute to the classification and their corresponding parameter vectors *β*<sup>1</sup> and *β*<sup>2</sup> form two groups of coefficients. The true model has the sparsity structure (*s<sup>e</sup>* ,*sg*) = (2*K*, 2) for a total of *K*(*p* + 1) coefficients. We set the sample size for each category to *n<sup>k</sup>* = 50, 100, 200 and 400, and the number of classes to *K* = 5 and 11. We consider dimensionality *p* = 100 and 1000.

In what follows, we compare the proposed SGL-MgDWD method with the OVR method based on SGL-MgDWD with *K* = 2 (OVR-SGL-gDWD). For SGL-MgDWD, logistic regression and OVR, the tuning parameter *λ* is optimized over a discrete set by minimizing the prediction error using 5-fold cross validation. In each simulation, we conduct 100 runs and use a testing set of equal size to evaluate each method's performance using the following criteria:


Simulation results are summarized in Tables 1 and 2.

As shown in Tables 1 and 2, the proposed SGL-MgDWD method performs better than the *L*1-logistic and OVR methods. Specifically, in each scenario, predictions from the SGL-MgDWD method had higher accuracy relative to the other two methods. Similarly, the SGL-MgDWD method correctly selected the signal components of the model with fewer incorrectly-selected noise components, again relative to the *L*1-logistic and OVR methods. These simulation results also demonstrate that test accuracy increases with increasing sample size *n<sup>k</sup>* and that test accuracy decreases with higher dimension *p* at fixed *n<sup>k</sup>* . This is consistent with the derived theoretical properties. All computations were performed on a Tensorflow 2.3 CPU on Threadripper 2950X at 4.1 Ghz.

#### *3.2. HIV Data Analysis*

Symptomatic distal sensory polyneuropathy (sDSP) is a common debilitating condition among people with HIV. This condition leads to neuropathic pain and is associated with substantial comorbidities and increased health care costs. Plasma miRNA profiles show differences between HIV patients with and without sDSP, and several miRNA biomarkers are reported to be associated with the presence of sDSP in HIV patients (see [30]). The corresponding binary classification problem was analyzed in [30] using random forest classifiers. However, the HIV data set can be further classified into four classes. The HIV data set has 1715 miRNA measures for 40 patients and is partitioned into four groups (*K* = 4) with *n<sup>k</sup>* = 10 patients each category: non-HIV, HIV with no brain damage (HIVNBD), HIV with brain damage but stable (HIVBDS) and HIV with brain damage and unstable (HIVBDU). In the following analysis, we apply our proposed method to this classification problem. The primary aim was to identify critical miRNA biomarkers for each of the four groups. Beyond achieving a finer classification, this analysis is helpful in assessing related pathogenic effects for each patient group.

Given the small sample size of *N* = 40, we chose the tuning parameter *λ* by maximizing leave-one-out cross validation accuracy. We fixed (*q*, *τ*) = (1, 0.1). Table 3 shows the signal for coefficient estimates obtained from the SGL-MgDWD method using the selected *λ*. We conclude that there are 22 critical miRNA biomarkers important to the classification problem. In particular, the biomarkers miR-25-star, miR-3171, miR-3924 and miR-4307 are not relevant to the non-HIV group; miR-4641, miR-4655-3p and miR-660 are not relevant to the HIVNBD group; miR-217 and miR-4683 are not relevant to the HIVBDS group; and miR-217 and miR-4307 are not relevant to the HIVBDU group.

**Table 1.** Simulation results for the SGL-MgDWD, *L*<sup>1</sup> -logistic, and OVR methods with *K* = 5. Time is measured relative to a baseline logistic regression model with *K* = 5, *p* = 100, and *N* = 50. Numbers in parentheses denote standard deviations.



**Table 2.** Simulation results for the SGL-MgDWD, *L*<sup>1</sup> -logistic, and OVR methods with *K* = 11. Time is measured relative to a baseline logistic regression model with *K* = 5, *p* = 100, and *N* = 50. Numbers in parentheses denote standard deviations.

**Table 3.** Signal for the coefficient estimates obtained from the SGL-MgDWD method with (*q*, *τ*) = (1, 0.1) for the HIV data set. The symbols "+" and "-" denote positive and negative coefficient estimates, respectively, while "0" denotes a zero coefficient (i.e., an irrelevant variable).



**Table 3.** *Cont.*

**Author Contributions:** Conceptualization, L.K. and N.T.; Methodology, T.S., L.K. and N.T.; Formal Analysis, Y.L.; Data Curation, W.G.B., E.A. and C.P.; Writing—Review & Editing, Y.W., B.J. and L.K.; Supervision, B.J., L.K. and N.T.. All authors have read and agreed to the published version of the manuscript.

**Funding:** A Canadian Institutes of Health Research Team Grant and Canadian HIV-Ageing Multidisciplinary Programmatic Strategy (CHAMPS) in NeuroHIV (Christopher Power) supported these studies. Bei Jiang and Linglong Kong were supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Christopher Power and Linglong Kong were supported by Canada Research Chairs in Neurological Infection and Immunity and Statistical Learning, respectively. Niansheng Tang was supported by grants from the National Natural Science Foundation of China (grant number: 11671349) and the Key Projects of the National Natural Science Foundation of China (grant number: 11731011).

**Acknowledgments:** The authors are thankful for the invitation of the two guest editors, Farouk Nathoo and Ejaz Ahmed. This work has also benefited from two anonymous reviewers' constructive comments and valuable feedback. The authors also thank the great help of Matthew Pietrosanu with editing.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Proofs**

#### *Appendix A.1. Proof of Lemma 1*

**Proof.** For simplicity, we write *p<sup>j</sup>* = *P*(*y* = *j*|*X*) and *f<sup>k</sup>* = *f<sup>k</sup>* (*X*). Using the Lagrange multiplier method, we define

$$L(F) = \mathbb{E}\left\{\sum\_{k=1}^{K} \mathbb{1}\{y=k\} \phi\_{\boldsymbol{q}}\{F(\mathbf{X})\} \Big|\mathbf{X} = \boldsymbol{\mu}\right\} + \mu \mathbf{1}\_{\mathbf{K}}^{\top} \mathbf{F}(\mathbf{X}) = \sum\_{k=1}^{K} p\_{k} \phi\_{\boldsymbol{q}}(f\_{k}) + \mu f\_{k}.$$

Then for each *k*,

$$\frac{\partial L(\mathbf{F})}{\partial f\_k} = \phi\_q'(f\_k)p\_k + \mu = 0 \tag{A1}$$

with

$$\phi\_q'(f\_{\bar{l}}) = \begin{cases} -1, & f\_k \le Q \\ -(Qf\_k^{-1})^q, & f\_k > Q. \end{cases}$$

Without loss of generality, assume that *<sup>p</sup>*<sup>1</sup> <sup>&</sup>gt; *<sup>p</sup>*<sup>2</sup> <sup>≥</sup> *<sup>p</sup>*<sup>3</sup> ≥ · · · ≥ *<sup>p</sup>K*−<sup>1</sup> <sup>&</sup>gt; *<sup>p</sup>K*. Note that <sup>−</sup><sup>1</sup> <sup>≤</sup> *<sup>φ</sup>* ′ *<sup>q</sup>* < 0, and so *p<sup>j</sup>* ≥ −*φ* ′ *q* (*f<sup>k</sup>* )*p<sup>k</sup>* = *µ* > 0 and *µ* = *p<sup>k</sup>* if and only if *f<sup>k</sup>* ≤ *Q*.

If *µ* < *p<sup>K</sup>* < *p<sup>k</sup>* , then *<sup>p</sup><sup>K</sup>* <sup>6</sup><sup>=</sup> *<sup>µ</sup>* when *<sup>f</sup><sup>K</sup>* <sup>&</sup>gt; *<sup>Q</sup>*, which implies that *<sup>f</sup><sup>k</sup>* <sup>&</sup>gt; *<sup>f</sup><sup>K</sup>* <sup>&</sup>gt; *<sup>Q</sup>* for all 1 <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *<sup>K</sup>*. Hence, substituting *φ* ′ *q* (*f<sup>k</sup>* ) = <sup>−</sup>(*Q f* <sup>−</sup><sup>1</sup> *k* ) *q* into (A1) yields

$$f\_k = \mathbb{Q}\sqrt[q]{p\_k\mu^{-1}} > \mathbb{Q} > 0.$$

However, ∑ *K k*=1 *f<sup>k</sup>* > 0, contradicting the sum-to-zero constraint. Therefore, *µ* = *p<sup>K</sup>* < *p<sup>k</sup>* for *k* < *K* and the result follows.

*Appendix A.2. Proof of Theorem 1*

**Lemma A1.** *Under (C1),* L(*ϑ*) *exists, and it is convex on ϑ.*

**Proof.** The existence of L(*ϑ*) will be satisfied if

$$\mathbb{E}\_{\mathbf{X}|y}\{|\phi\_{\boldsymbol{q}}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})| \mid \boldsymbol{y} = k\} = \int\_{\mathcal{X}} |\phi\_{\boldsymbol{q}}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})| \mathcal{g}\_{k}(\mathbf{X}) \mathbf{dX} < \infty.$$

We divide <sup>X</sup> into two disjoint subsets. Defining <sup>X</sup>*<sup>k</sup>* <sup>=</sup> {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>|</sup> *<sup>Z</sup>* <sup>⊤</sup>*θ<sup>k</sup>* <sup>&</sup>gt; *<sup>Q</sup>*}, it is clear that

$$\int\_{\mathcal{X}\_k^\circ} |\phi\_q(\mathbf{Z}^\top \theta\_k)| g\_k(\mathbf{X}) d\mathbf{X} \le (q+1)^{-1} \int\_{\mathcal{X}\_k^\circ} g\_k(\mathbf{X}) d\mathbf{X} < \infty.$$

Note that 0 < *φq*(*u*) < (1 + *q*) <sup>−</sup><sup>1</sup> < 1 when *u* > *Q*. On the other hand, for X c *<sup>k</sup>* <sup>=</sup> {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>|</sup> *<sup>Z</sup>* <sup>⊤</sup>*θ<sup>k</sup>* ≤ *Q*},

$$\int\_{\mathcal{X}\_k^{\times c}} |\phi\_q(\mathbf{Z}^\top \theta\_k)| \mathcal{g}\_k(\mathbf{X}) \mathbf{dX} \leq |1 - a\_k| + \sum\_{j=1}^p b\_{jk} \int\_{\mathcal{X}\_j} |\mathbf{x}\_j| \mathcal{g}\_k(\mathbf{X}) \mathbf{dX} < \infty$$

if E*X*|*<sup>y</sup>* |*xj* | *y* = *k* <sup>&</sup>lt; <sup>∞</sup> for all *<sup>k</sup>* <sup>∈</sup> <sup>Y</sup> . This completes the proof of the existence of <sup>L</sup>(*ϑ*). Recall that

$$\mathcal{L}(\mathfrak{d}) = \sum\_{k=1}^{K} \pi\_k \int\_{\mathcal{X}} \phi\_q(\mathbf{Z}^\top \theta\_k) \mathcal{g}\_k(\mathbf{X}) d\mathbf{X} \,\mathrm{d}\mathbf{X}$$

where *φq*(*u*) is a convex function of *u*, so its composition with the affine mapping *u* = *Z* ⊤*θ<sup>k</sup>* is still convex in *θ<sup>k</sup>* . Clearly, *g<sup>k</sup>* (*X*), *π<sup>k</sup>* > 0, so the non-negatively-weighted integral and sum both preserve convexity.

**Lemma A2.** *Existence of minimizers of* <sup>L</sup>(*ϑ*) *on* <sup>C</sup> <sup>=</sup> *<sup>ϑ</sup>* <sup>∈</sup> <sup>R</sup>*K*(*p*+1) **C***ϑ* = **0***<sup>K</sup> , where* **C** = **1** ⊤ *<sup>K</sup>* ⊗ **I***p*+1*.*

**Proof.** By Jensen's inequality, for any *<sup>ϑ</sup>* <sup>∈</sup> <sup>C</sup> , we have that

$$\mathcal{L}(\boldsymbol{\theta}) \ge \phi\_q \left( \sum\_{k=1}^K \pi\_k \mathbb{E} \{ \mathbf{Z}^\top \boldsymbol{\theta}\_k | \boldsymbol{y} = k \} \right).$$

Let *µ* = vec (*πk*E{*z<sup>j</sup>* <sup>|</sup>*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*})*jk* , where k*µ*k<sup>2</sup> ≥ ∑ *K <sup>k</sup>*=<sup>1</sup> *π* 2 *k* 1 <sup>2</sup> ≥ *K* − 1 <sup>2</sup> > 0. For some *C* > 0, we have that

$$\begin{split} \mathcal{L}(\boldsymbol{\theta}) \ge & \boldsymbol{\phi}\_{q}(\boldsymbol{\mu}^{\top}\boldsymbol{\theta}) = \mathbbm{1}\{\boldsymbol{\mu}^{\top}\boldsymbol{\theta} < Q\}(1 - \boldsymbol{\mu}^{\top}\boldsymbol{\theta}) + \mathbbm{1}\{\boldsymbol{\mu}^{\top}\boldsymbol{\theta} \ge Q\}\boldsymbol{\phi}\_{q}(\boldsymbol{\mu}^{\top}\boldsymbol{\theta}) \\ \ge & \mathbbm{1}\{\boldsymbol{\mu}^{\top}\boldsymbol{\theta} < Q\}[1 - |\boldsymbol{\mu}^{\top}\boldsymbol{\theta}|] \\ = & \mathbbm{1}\{\boldsymbol{\mu}^{\top}\boldsymbol{\theta} < -(\mathbb{C} + 1)\}(|\boldsymbol{\mu}^{\top}\boldsymbol{\theta}| - 1) + \mathbbm{1}\{-(\mathbb{C} + 1) < \boldsymbol{\mu}^{\top}\boldsymbol{\theta} < -1\}(|\boldsymbol{\mu}^{\top}\boldsymbol{\theta}| - 1) \\ & + \mathbbm{1}\{-1 < \boldsymbol{\mu}^{\top}\boldsymbol{\theta} < Q\}(1 - |\boldsymbol{\mu}^{\top}\boldsymbol{\theta}|) \\ > & \mathbbm{1}\{||\boldsymbol{\mu}||\_{2}||\boldsymbol{\theta}||\_{2} > \mathbb{C} + 1\}\mathbb{C} \\ = & \mathbbm{1}\{||\boldsymbol{\theta}\||\_{2} > \frac{\mathbb{C} + 1}{||\boldsymbol{\mu}||\_{2}}\}\mathbb{C}. \end{split}$$

Note that 1 − *µ* <sup>⊤</sup>*<sup>ϑ</sup>* <sup>&</sup>gt; <sup>1</sup> <sup>−</sup> *<sup>Q</sup>* <sup>&</sup>gt; 0 when *<sup>µ</sup>* <sup>⊤</sup>*ϑ* < *Q*. By the Cauchy–Schwarz inequality, −*µ* <sup>⊤</sup>*ϑ* = |*µ* <sup>⊤</sup>*ϑ*| ≤ k*µ*k2k*ϑ*k2.

Hence, if <sup>k</sup>*ϑ*k<sup>2</sup> <sup>&</sup>gt; *C* + 1 k*µ*k<sup>2</sup> <sup>&</sup>gt; 0, then <sup>L</sup>(*ϑ*) <sup>&</sup>gt; *<sup>C</sup>* <sup>&</sup>gt; 0. The contrapositive of this result implies the existence of a minimizer in the unconstrained problem. That is, the closed set *<sup>ϑ</sup>* <sup>∈</sup> <sup>C</sup>  L(*ϑ*) ≤ *C* is bounded for some large enough *C*. This guarantees the existence of a solution, as desired.

**Lemma A3.** *Under (C1),* S(*ϑ*) *exists and*

$$\frac{\partial \mathcal{L}(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \mathcal{S}(\boldsymbol{\theta}).$$

**Proof.** The existence of S(*ϑ*) will follow if

$$\int\_{\mathcal{X}} |\phi\_q'(\mathbf{Z}^\top \theta\_k) z\_j| \pi\_k \mathbf{g}\_k(\mathbf{X}) \mathbf{dX} \le \pi\_k \int\_{\mathcal{X}} |z\_j| \mathbf{g}\_k(\mathbf{X}) \mathbf{dX} < \infty$$

for *j* = 1, . . . , *p* + 1. Note that |*φ* ′ *q* (*u*)| ≤ 1 when *<sup>u</sup>* <sup>&</sup>gt; *<sup>Q</sup>*.

For every *θkj* ∈ R, *φq*(*Z* ⊤*θ<sup>k</sup>* ) is a Lebesgue integrable function of *X*. For any *u* ∈ R, *φ* ′ *q* (*u*) exists and |*φ* ′ *q* (*u*)| ≤ 1. Hence, by the Leibniz integral rule, we have that

$$\begin{split} \frac{\partial}{\partial \theta\_{jk}} \int\_{\mathcal{X}^{\top}} \phi\_{q}(\mathbf{Z}^{\top} \theta\_{k}) \pi\_{k} \mathbf{g}\_{k}(\mathbf{X}) \mathbf{d} \mathbf{X} &= \int\_{\mathcal{X}^{\top}} \frac{\partial \phi\_{q}(\mathbf{Z}^{\top} \theta\_{k})}{\partial \theta\_{jk}} \pi\_{k} \mathbf{g}\_{k}(\mathbf{X}) \mathbf{d} \mathbf{X} \\ &= \int\_{\mathcal{X}^{\top}} \phi\_{q}^{\prime}(\mathbf{Z}^{\top} \theta\_{k}) z\_{j} \pi\_{k} \mathbf{g}\_{k}(\mathbf{X}) \mathbf{d} \mathbf{X} \end{split}$$

and for any *l* 6= *k*,

$$\frac{\partial}{\partial \theta\_{jl}} \int\_{\mathcal{X}} \phi\_q(\mathbf{Z}^\top \theta\_k) \pi\_k \mathcal{g}\_k(\mathbf{X}) \mathbf{d} \mathbf{X} = \mathbf{0}\_\prime$$

which is sufficient to show that

$$\frac{\partial \mathcal{L}(\mathfrak{d})}{\partial \mathfrak{d}} = \mathcal{S}(\mathfrak{d}).$$

**Lemma A4.** *Suppose (C1) is satisfied. Then (C2) implies that b* ∗ *k* 6= **0***.* **Proof.** We can rewrite *φq*(*u*) as

$$\begin{split} \phi\_{q}(u) &= \mathbb{1}\{u \le Q\}(1-u) + \mathbb{1}\{u > Q\}(1-Q)\left(\frac{Q}{u}\right)^{q} \\ &= \left\{-\mathbb{1}\{u \le Q\} - \mathbb{1}\{u > Q\}\left(\frac{Q}{u}\right)^{q+1}\right\}u + \mathbb{1}\{u \le Q\} + \mathbb{1}\{u > Q\}\left(\frac{Q}{u}\right)^{q} \\ &= \phi\_{q}'(u)u + \mathbb{1}\{u \le Q\} + \mathbb{1}\{u > Q\}\left(\frac{Q}{u}\right)^{q}. \end{split}$$

Then for any *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> and its corresponding <sup>X</sup>*<sup>k</sup>* <sup>=</sup> {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>|</sup>*<sup>Z</sup>* <sup>⊤</sup>*<sup>γ</sup>* <sup>&</sup>gt; *<sup>Q</sup>*}, we have that

$$\begin{split} & \mathbb{E}\{\mathbbm{1}\{\mathbbm{y}=k\}\phi\_{q}(\mathbf{Z}^{\top}\boldsymbol{\gamma})\} \\ &= \mathbb{E}\{\mathbbm{1}\{\mathbbm{y}=k\}\phi\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\gamma})\mathbf{Z}^{\top}\boldsymbol{\gamma}\} + \mathbb{E}\{\mathbbm{1}\{\mathbbm{y}=k,\ \mathbf{Z}^{\top}\boldsymbol{\gamma}\leq Q\}\} \\ & \quad + \mathbb{E}\{\mathbbm{1}\{\mathbbm{y}=k,\ \mathbf{Z}^{\top}\boldsymbol{\gamma} > Q\}\big{(}\frac{Q}{\mathbf{Z}^{\top}\boldsymbol{\gamma}}\}^{q} \Big{)} \\ &= \mathbb{S}\_{k}^{\top}(\boldsymbol{\gamma})\boldsymbol{\gamma} + \Pr\{\boldsymbol{y}=k,\ \mathbf{X}\notin\mathcal{X}\_{k}^{\top}\} + \mathbb{E}\{\mathbbm{1}\{\boldsymbol{y}=k,\ \mathbf{X}\in\mathcal{X}\_{k}^{\top}\}\Big{(}\frac{Q}{\mathbf{Z}^{\top}\boldsymbol{\gamma}}\boldsymbol{\gamma}\}^{q} \Big{)} \\ &= \mathbb{S}\_{k}^{\top}(\boldsymbol{\gamma})\boldsymbol{\gamma} + \pi\_{k}\Big{(}1 - \mathbb{E}\{\mathbbm{1}\{\mathbf{X}\in\mathcal{X}\_{k}^{\top}\}\Big{(}1 - \left(\frac{Q}{\mathbf{Z}^{\top}\boldsymbol{\gamma}}\right)^{q}\Big{)} \Big{|}\boldsymbol{y} = k\}\Big{)}. \end{split}$$

Let *ϑ* <sup>∗</sup> <sup>∈</sup> <sup>C</sup> be a local minimizer. It follows that **<sup>P</sup>**S(*<sup>ϑ</sup>* ∗ ) = **0** and ∑ *K k*=1 *S* ⊤ *k* (*θ* ∗ *k* )*θ* ∗ *<sup>k</sup>* = S ⊤(*ϑ* ∗ )*ϑ* ∗ = 0 since *ϑ* ∗ = **P***ϑ* ∗ and **P** = **I***<sup>K</sup>* − *K* <sup>−</sup>1**1***K***1** ⊤ *K* ⊗ **I***p*+1. Therefore,

L(*ϑ* ∗ ) =E <sup>1</sup>{*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*}*φq*(*<sup>Z</sup>* ⊤*θ* ∗ *k* ) = *K* ∑ *k*=1 *πk* 1 − E n <sup>1</sup>{*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>∗</sup> *k* } n 1 − *Q Z*⊤*θ* ∗ *k q*o *<sup>y</sup>* <sup>=</sup> *<sup>k</sup>* o = *K* ∑ *k*=1 *πk* <sup>1</sup> <sup>−</sup> Pr{*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>∗</sup> *k* |*y* = *k*}E n 1 − *Q Z*⊤*θ* ∗ *k q <sup>y</sup>* <sup>=</sup> *<sup>k</sup>*, *<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>∗</sup> *k* o. (A2)

For any *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> and its corresponding <sup>X</sup>*<sup>k</sup>* <sup>=</sup> {*<sup>X</sup>* <sup>∈</sup> <sup>X</sup> <sup>|</sup>*<sup>Z</sup>* <sup>⊤</sup>*<sup>γ</sup>* <sup>&</sup>gt; *<sup>Q</sup>*}, we always have that

$$0 < \mathbb{E}\left\{ \left( \frac{\mathbb{Q}}{\mathbf{Z}^{\top}\boldsymbol{\gamma}} \right)^{q} \, \Big| \, y = k, \, \mathbf{X} \in \mathcal{X}\_{k}^{\prime} \right\} < 1.$$

If *<sup>γ</sup>* <sup>=</sup> **<sup>0</sup>***p*+1, then <sup>X</sup>*<sup>k</sup>* <sup>=</sup> <sup>∅</sup> so that Pr{*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*, *<sup>X</sup>* <sup>∈</sup>/ <sup>X</sup>*k*} <sup>=</sup> *<sup>π</sup><sup>k</sup>* and Pr{*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*, *<sup>X</sup>* <sup>∈</sup> <sup>X</sup>*k*} <sup>=</sup> 0. If *<sup>γ</sup>*<sup>1</sup> <sup>≤</sup> *<sup>Q</sup>* and *<sup>γ</sup>*/1 <sup>=</sup> **<sup>0</sup>***p*, then <sup>X</sup>*<sup>k</sup>* <sup>=</sup> <sup>∅</sup>, giving the same conclusions as the previous case. If *<sup>γ</sup>*<sup>1</sup> <sup>&</sup>gt; *<sup>Q</sup>* and *<sup>γ</sup>*/1 <sup>=</sup> **<sup>0</sup>***p*, then <sup>X</sup>*<sup>k</sup>* <sup>=</sup> <sup>X</sup> so that Pr{*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*, *<sup>X</sup>* <sup>∈</sup>/ <sup>X</sup>*k*} <sup>=</sup> 0 and Pr{*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*, *<sup>X</sup>* <sup>∈</sup> <sup>X</sup>*k*} <sup>=</sup> *<sup>π</sup><sup>k</sup>* . Consequently, when 0 <sup>&</sup>lt; Pr{*<sup>X</sup>* <sup>∈</sup> <sup>X</sup>*<sup>k</sup>* <sup>|</sup>*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*} <sup>&</sup>lt; 1, then neither <sup>X</sup>*<sup>k</sup>* nor <sup>X</sup> equal <sup>∅</sup>, so *<sup>b</sup><sup>k</sup>* <sup>6</sup><sup>=</sup> 0 follows.

Note that Pr{*<sup>X</sup>* <sup>∈</sup>/ <sup>X</sup>*<sup>k</sup>* <sup>|</sup>*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*} <sup>&</sup>gt; 0 implies that Pr{<sup>0</sup> <sup>&</sup>lt; *<sup>Z</sup>* <sup>⊤</sup>*<sup>γ</sup>* <sup>≤</sup> *<sup>Q</sup>*|*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*} <sup>&</sup>gt; 0 or Pr{*Z* <sup>⊤</sup>*<sup>γ</sup>* <sup>≤</sup> <sup>0</sup>|*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*} <sup>&</sup>gt; 0, and so special attention should be paid to bounded random variables.

**Lemma A5.** *Under (C1),* H(*ϑ*) *exists and*

$$
\frac{
\partial^2
\mathcal{L}(\mathfrak{d})
}{
\partial\mathfrak{d}
\partial\mathfrak{d}^\top
} = \mathcal{H}(\mathfrak{d}).
$$

*Furthermore,* H(*ϑ* ∗ ) ≻ **O***K*(*p*+1) *when (C2) and (C3) hold.* **Proof.** The existence of H(*ϑ*) follows if its all entries are absolutely integrable, that is, for any *j*, *k* = 1, . . . , *p* + 1,

$$\int\_{\mathcal{X}} |\mathbb{1}\{\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k} > Q\}\boldsymbol{q}\_{q}^{\prime\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})\boldsymbol{z}\_{j}\boldsymbol{z}\_{l}|\pi\_{k}\boldsymbol{g}\_{k}(\mathbf{X})\mathbf{dX}$$

$$\leq (q + q^{-1} + 2) \int\_{\mathcal{X}\_{k}^{\mathrm{c}^{\mathrm{c}}}} |\boldsymbol{z}\_{j}\boldsymbol{z}\_{l}|\boldsymbol{g}\_{k}(\mathbf{X})\mathbf{dX}$$

$$ < \infty.$$

Equivalently, the result follows if E*X*|*<sup>y</sup>* |*zjz<sup>l</sup>* | *y* = *k* <sup>&</sup>lt; <sup>∞</sup> for all *<sup>k</sup>* <sup>∈</sup> <sup>Y</sup> . Note that 0 <sup>&</sup>lt; *<sup>ϕ</sup>* ′′ *q* (*u*) ≤ *q* + *q* <sup>−</sup><sup>1</sup> + 2 when *u* > *Q*.

Let *η* be a test function belonging to the Schwartz space D. Then *η* ′ <sup>∈</sup> <sup>D</sup> with some support denoted by supp(*η* ′ ).

Clearly, *φ* ′ *q* (*u*) is not differentiable at *Q* but is Lipschitz continuous. Therefore, the measurable function *S<sup>k</sup>* (*θ<sup>k</sup>* ) is a locally integrable function of *θ<sup>k</sup>* . Then the (regular) generalized functions *S<sup>k</sup>* (*θ<sup>k</sup>* ) belong to the dual space of D.

For the distributional derivative of *S<sup>k</sup>* (*θ<sup>k</sup>* ) with respect to *θjk*, we have that

$$\begin{split} \left| \left< \frac{\partial \mathbf{S}\_k(\boldsymbol{\theta}\_k)}{\partial \boldsymbol{\theta}\_{jk}}, \eta(\boldsymbol{\theta}\_{jk}) \right> \right| &= \left| - \left< \mathbf{S}\_k(\boldsymbol{\theta}\_k), \frac{\mathbf{d}\eta(\boldsymbol{\theta}\_{jk})}{\mathbf{d}\boldsymbol{\theta}\_{jk}} \right> \right| \\ &\leq \int\_{\mathbb{R}} \left| \mathbf{S}\_k(\boldsymbol{\theta}\_k) \eta'(\boldsymbol{\theta}\_{jk}) \right| \, \mathrm{d}\boldsymbol{\theta}\_{jk} \\ &\leq \max\_{\boldsymbol{\theta}\_{jk} \in \mathrm{supp}(\eta')} |\eta'(\boldsymbol{\theta}\_{jk})| \int\_{\mathrm{supp}(\eta')} |\mathbf{S}\_k(\boldsymbol{\theta}\_k)| \, \mathrm{d}\boldsymbol{\theta}\_{jk} \\ &< \infty \end{split}$$

implying that the function *f*(*θjk*, *X*) = *φ* ′ *q* (*Z* ⊤*θ<sup>k</sup>* )*Zπkg<sup>k</sup>* (*X*)*η* ′ (*θjk*) is integrable on <sup>R</sup> <sup>×</sup> <sup>X</sup> . Therefore, by Fubini's Theorem,

$$\begin{split} \left\langle \frac{\partial \mathbf{S}\_{k}(\boldsymbol{\theta}\_{k})}{\partial \boldsymbol{\theta}\_{jk}}, \boldsymbol{\eta}(\boldsymbol{\theta}\_{jk}) \right\rangle &= - \left\langle \mathbf{S}\_{k}(\boldsymbol{\theta}\_{k}), \frac{\mathbf{d}\boldsymbol{\eta}(\boldsymbol{\theta}\_{jk})}{\mathbf{d}\boldsymbol{\theta}\_{jk}} \right\rangle \\ &= \int\_{\mathcal{X}^{\cdot}} - \left\langle \boldsymbol{\Phi}\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}) \mathbf{Z}\pi\_{k} \mathcal{g}\_{k}(\mathbf{X}), \frac{\mathbf{d}\boldsymbol{\eta}(\boldsymbol{\theta}\_{jk})}{\mathbf{d}\boldsymbol{\theta}\_{jk}} \right\rangle \mathbf{d}\mathbf{X} \\ &= \int\_{\mathcal{X}^{\cdot}} \left\langle \frac{\partial \boldsymbol{\Phi}\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})}{\partial \boldsymbol{\theta}\_{jk}} \mathbf{Z}\pi\_{k} \mathcal{g}\_{k}(\mathbf{X}), \boldsymbol{\eta}(\boldsymbol{\theta}\_{jk}) \right\rangle \mathbf{d}\mathbf{X} \\ &= \left\langle \mathbb{E}\left\{\frac{\partial \boldsymbol{\Phi}\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})}{\partial \boldsymbol{\theta}\_{jk}} \mathbf{Z}\mathbf{1}\{\boldsymbol{y} = k\} \right\}, \boldsymbol{\eta}(\boldsymbol{\theta}\_{jk}) \right\rangle. \end{split}$$

which implies that

$$\frac{\partial \mathbf{S}\_k(\boldsymbol{\theta}\_k)}{\partial \theta\_{jk}} = \mathbb{E}\left\{ \frac{\partial \phi'\_q(\mathbf{Z}^\top \boldsymbol{\theta}\_k)}{\partial \theta\_{jk}} \mathbf{Z} \mathbb{1} \{y = k\} \right\}.$$

Recall that *φ* ′ *q* can be written as

$$\phi\_q'(u) = \phi\_q'(u)\mathbb{1}\{u > Q\} + (-1)\mathbb{1}\{u \le Q\} = (\phi\_q'(u) + 1)\mathbb{1}\{u > Q\} - 1\Delta$$

which contains a Schwartz product between the differentiable function *ϕ* ′ *q* (*u*) and the generalized function <sup>1</sup>{*<sup>u</sup>* <sup>&</sup>gt; *<sup>Q</sup>*}. Note that

$$\begin{aligned} \mathbb{1}\{\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k} > Q\} &= \mathbb{1}\{z\_{j} > 0, \ \theta\_{jk} > c\_{jk}\} + \mathbb{1}\{z\_{j} \le 0, \ \theta\_{jk} \le c\_{jk}\} \\ &= (2\mathbb{1}\{z\_{j} > 0\} - 1)\mathbb{1}\{\theta\_{jk} > c\_{jk}\} + (1 - \mathbb{1}\{z\_{j} > 0\}) \\ &= \text{sign}(z\_{j})\mathbb{1}\{\theta\_{jk} > c\_{jk}\} + \mathbb{1}\{z\_{j} \le 0\}, \end{aligned}$$

where *<sup>c</sup>jk* = (*<sup>Q</sup>* − <sup>∑</sup>*l*6=*<sup>j</sup> zlθlk*)/*z<sup>j</sup>* and

$$\begin{aligned} \frac{\partial \mathbb{1}\{\mathbf{Z}^\top \boldsymbol{\theta}\_k > Q\}}{\partial \theta\_{jk}} + 0 &= \text{sign}(\boldsymbol{z}\_j) \delta(\theta\_{jk} - \boldsymbol{c}\_{jk}) \\ &= \text{sign}(\boldsymbol{z}\_j) |\boldsymbol{z}\_j| \delta(\mathbf{Z}^\top \boldsymbol{\theta}\_k - Q) \\ &= \boldsymbol{z}\_j \delta(\mathbf{Z}^\top \boldsymbol{\theta}\_k - Q) \end{aligned}$$

where *<sup>δ</sup>*(*x*) is the Dirac delta function and the distributional derivative of <sup>1</sup>{*<sup>x</sup>* <sup>&</sup>gt; <sup>0</sup>}. Recall that *δ*(*cx*) = *δ*(*x*)/|*c*| and *f*(*x*)*δ*(*x* − *c*) = *f*(*c*)*δ*(*x* − *c*) for some constant *c* and function *f* .

Thus, by the product rule for the distributional derivative of the Schwartz product,

$$\begin{split} \frac{\partial \phi\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k})}{\partial \theta\_{jk}} &= \frac{\partial (\phi\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}) + 1)}{\partial \theta\_{jk}} \mathbb{1}\{\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k} > Q\} + (\phi\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}) + 1) \frac{\partial \mathbb{1}\{\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k} > Q\}}{\partial \theta\_{jk}} \\ &= q\_{q}^{\prime\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}) z\_{j} \mathbb{1}\{\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k} > Q\} + (\phi\_{q}^{\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}) + 1) z\_{j} \delta(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k} - Q) \\ &= q\_{q}^{\prime\prime}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}) z\_{j} \mathbb{1}\{\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k} > Q\}. \end{split}$$

Substituting the above expression, we obtain

$$\frac{\partial \mathbf{S}\_k(\boldsymbol{\theta}\_k)}{\partial \boldsymbol{\theta}\_{jk}} = \mathbb{E}\left\{\boldsymbol{\varphi}\_q^{\prime\prime}(\mathbf{Z}^\top \boldsymbol{\theta}\_k) \mathbf{Z} z\_j \mathbf{1}\{\mathbf{Z}^\top \boldsymbol{\theta}\_k > \mathbf{Q}\} \mathbf{1}\{\mathbf{y} = k\}\right\}.$$

Similarly, for *l* 6= *k*, we have the distributional derivative

$$\frac{\partial \mathbf{S}\_k(\boldsymbol{\theta}\_k)}{\partial \boldsymbol{\theta}\_{\vec{\boldsymbol{\mu}}}} = \mathbf{0}.$$

Recall that the distributional derivative does not depend on the order of differentiation and agrees with the classical derivative whenever the latter exists. To summarize, we have that

$$H\_k(\boldsymbol{\theta}\_k) = \frac{\partial^2 \mathcal{L}(\boldsymbol{\vartheta})}{\partial \boldsymbol{\theta}\_k \partial \boldsymbol{\theta}\_k^\top} = \frac{\partial \mathbf{S}\_k(\boldsymbol{\theta}\_k)}{\partial \boldsymbol{\theta}\_k^\top}, \quad \mathcal{H}(\boldsymbol{\vartheta}) = \bigoplus\_{k=1}^K \mathbf{H}\_k(\boldsymbol{\theta}\_k).$$

The *H<sup>k</sup>* (*θ<sup>k</sup>* ) are symmetric matrices, so H(*ϑ*) is also symmetric.

In the sense of generalized functions, differentiation is a continuous operation with respect to convergence in D′ . Therefore, *φ* ′ <sup>0</sup> = lim *q*→0 *φ* ′ *<sup>q</sup>* <sup>=</sup> <sup>−</sup>1{*<sup>u</sup>* <sup>≤</sup> <sup>0</sup>} and *<sup>φ</sup>* ′′ <sup>0</sup> = lim *q*→0 *φ* ′′ *<sup>q</sup>* = *δ*(*u*); *φ* ′ <sup>∞</sup> = lim *<sup>q</sup>*→<sup>∞</sup> *φ* ′ *<sup>q</sup>* <sup>=</sup> <sup>−</sup>1{*<sup>u</sup>* <sup>≤</sup> <sup>1</sup>} and *φ* ′′ <sup>∞</sup> = lim *<sup>q</sup>*→<sup>∞</sup> *φ* ′′ *<sup>q</sup>* = *δ*(*u* − 1), which coincides with results from the hinge loss.

Next, H(*ϑ*) ≻ **O***K*(*p*+1) if and only if both *H*1(*θ*1) and its Schur complement <sup>L</sup>*<sup>K</sup> <sup>k</sup>*=<sup>2</sup> *H<sup>k</sup>* (*θ<sup>k</sup>* ) are both symmetric and positive definite. We can deduce that H(*ϑ*) ≻ **O***K*(*p*+1) if and only if *H<sup>k</sup>* (*θ<sup>k</sup>* ) ≻ **O***p*+<sup>1</sup> for all *k*.

Note that there exists *c* > 0 such that *ϕ* ′′ *q* (*Z* ⊤*θ<sup>k</sup>* ) <sup>≥</sup> *<sup>c</sup>* on <sup>X</sup>*<sup>k</sup>* . Then for any *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup>*p*+<sup>1</sup> ,

$$\begin{split} \gamma\,^\top \mathbf{H}\_k(\boldsymbol{\theta}\_k)\gamma &= \pi\_k \int\_{\mathcal{X}\_k^\circ} \boldsymbol{g}\_q''(\mathbf{Z}^\top \boldsymbol{\theta}\_k) (\mathbf{Z}^\top \boldsymbol{\gamma})^2 \boldsymbol{g}\_k(\mathbf{X}) \mathbf{dX} \\ &\geq c \Pr\{\mathbf{X} \in \mathcal{X}\_k^\circ, y = k\} \mathbb{E}\{ (\mathbf{Z}^\top \boldsymbol{\gamma})^2 | \mathbf{X} \in \mathcal{X}\_k^\circ, y = k \} \\ &\geq c \Pr\{\mathbf{X} \in \mathcal{X}\_k^\circ, y = k\} \left( \gamma\_0^2 + \gamma\_1^\top \text{Var}\{\mathbf{X} | \mathbf{X} \in \mathcal{X}\_k^\circ, y = k\} \gamma\_1 \right), \end{split}$$

which implies that *γ* ⊤*H<sup>k</sup>* (*θ<sup>k</sup>* )*<sup>γ</sup>* <sup>=</sup> 0 if and only if *<sup>γ</sup>* <sup>=</sup> **<sup>0</sup>***p*+<sup>1</sup> when Var{*X*|*<sup>X</sup>* <sup>∈</sup> <sup>X</sup>*<sup>k</sup>* , *y* = *k*} is assumed to be non-singular. Assuming that Var{*X*|*<sup>y</sup>* <sup>=</sup> *<sup>k</sup>*} ≻ **<sup>O</sup>** implies that Var{*X*|*<sup>X</sup>* <sup>∈</sup> <sup>X</sup>*<sup>k</sup>* , *y* = *k*} **O**.

**Proof of Theorem 1.** By Lemma A2, a minimizer *ϑ* <sup>∗</sup> <sup>∈</sup> <sup>C</sup> exists with *<sup>b</sup>* ∗ *k* 6= **0***<sup>p</sup>* (by Lemma A4) and H(*ϑ* ∗ ) ≻ **O***K*(*p*+1) (by Lemma A5). By the second-order Lagrange condition and the convexity of L(*ϑ*) (by Lemma A1), a minimizer of the population MgDWD loss is unique.

Recall from (A2) that

$$\begin{split} \mathcal{L}(\boldsymbol{\theta}^{\*}) &= \mathbb{E}\{\mathbbm{1}\{\boldsymbol{y} = k\} \boldsymbol{\phi}\_{\boldsymbol{q}}(\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}^{\*})\} \\ &= \sum\_{k=1}^{K} \pi\_{k} \Big{(}1 - \mathbb{E}\left\{\mathbbm{1}\{\boldsymbol{X} \in \mathcal{H}\_{k}^{\*}\} \Big{\} \Big{1} - \left(\frac{\boldsymbol{Q}}{\mathbf{Z}^{\top}\boldsymbol{\theta}\_{k}^{\*}}\right)^{q}\Big{\} \Big{|}\boldsymbol{y} = k\Big{)}\Big{)} \\ &= \sum\_{k=1}^{K} A(k,q)\pi\_{k}. \end{split}$$

It follows that

$$\begin{split} 0 \leq & \mathbb{E} \left\{ \mathbb{1} \{ \mathbf{X} \in \mathcal{H}\_{k}^{\*} \} \left\{ 1 - \left( \frac{\mathbb{Q}}{\mathbf{Z}^{\top} \mathbf{f}\_{k}^{\*}} \right)^{q} \right\} \middle| y = k \right\} \\ & \quad < & \mathbb{E} \left\{ \mathbb{1} \{ \mathbf{Z}^{\top} \gamma > 1 + q^{-1} \} + \mathbb{1} \{ Q < \mathbf{Z}^{\top} \gamma \leq 1 + q^{-1} \} \left\{ 1 - \left( \frac{\mathbb{Q}}{1 + q^{-1}} \right)^{q} \right\} \middle| y = m \right\} \\ & = & \Pr \left\{ \mathbf{Z}^{\top} \gamma > Q \middle| y = m \right\} - \Pr \left\{ Q < \mathbf{Z}^{\top} \gamma \leq Q^{-1} \middle| y = m \right\} \mathbb{Q}^{2q} \\ & \leq 1 \end{split}$$

and

$$\begin{split} 1 &\geq \mathbb{E}\left\{\mathbbm{1}\{\mathbbm{X}\in\mathcal{O}\_{k}^{\times\ast}\}\left\{1-\left(\frac{\mathbb{Q}}{\mathbb{Z}^{\top}\mathfrak{d}\_{k}^{\ast}}\right)^{q}\right\}\middle|\,\middle|\,y=k\right\} \\ &> \mathbb{E}\left\{\mathbbm{1}\{\mathbbm{Z}^{\top}\mathfrak{d}\_{k}^{\ast}>1+\varepsilon\}\left\{1-\left(\frac{\mathbb{Q}}{1+\varepsilon}\right)^{q}\right\}\middle|\,y=k\right\} \\ &\geq \sup\_{\varepsilon>0}\left\{1-\left(\frac{\mathbb{Q}}{1+\varepsilon}\right)^{q}\right\}\Pr\{\mathbb{Z}^{\top}\mathfrak{d}\_{k}^{\ast}>1+\varepsilon\mid y=m\} \\ &\geq 0. \end{split}$$

Consequently, 0 ≤ *u*(*k*, *q*) ≤ *A*(*k*, *q*) ≤ *v*(*k*, *q*) ≤ 1. Note that lim *<sup>q</sup>*→<sup>∞</sup> (1 + *ǫ*) <sup>−</sup>*qQ<sup>q</sup>* = e <sup>−</sup><sup>1</sup> when *ǫ* = 0 and lim *<sup>q</sup>*→<sup>∞</sup> (1 + *ǫ*) <sup>−</sup>*qQ<sup>q</sup>* = 0 when *ǫ* > 0. The difference between these two results is attributed to pointwise convergence.

Let *<sup>f</sup><sup>m</sup>* <sup>=</sup> <sup>1</sup> <sup>−</sup> *<sup>A</sup>*(*k*, *<sup>m</sup>*) <sup>∈</sup> <sup>D</sup>′ with *<sup>m</sup>* <sup>=</sup> 1, 2, . . . and *<sup>η</sup>* <sup>∈</sup> <sup>D</sup>. By Fubini's theorem and the dominated convergence theorem,

$$\begin{split} \lim\_{m \to \infty} \langle f\_{m}, \eta \rangle &= \lim\_{m \to \infty} \left\langle \mathbb{E} \left\{ \mathbb{1} \left\{ \mathbf{X} \in \mathcal{R}\_{k}^{\ast \*} \right\} \left( \frac{\mathbf{Q}}{\mathbf{Z}^{\top} \mathbf{\theta}\_{k}^{\*}} \right)^{q} \Big| y = k \right\}, \eta(\gamma) \right\rangle \\ &= \lim\_{m \to \infty} \mathbb{E} \left\{ \left\langle \mathbb{1} \left\{ \mathbf{X} \in \mathcal{R}\_{k}^{\ast \*} \right\} \left( \frac{\mathbf{Q}}{\mathbf{Z}^{\top} \mathbf{\theta}\_{k}^{\*}} \right)^{q}, \eta(\mathbf{\theta}\_{k}^{\*}) \right\rangle \Big| y = k \right\} \\ &= \mathbb{E} \left\{ \lim\_{m \to \infty} \left\langle \mathbb{1} \left\{ \mathbf{X} \in \mathcal{R}\_{k}^{\ast \*} \right\} \left( \frac{\mathbf{Q}}{\mathbf{Z}^{\top} \mathbf{\theta}\_{k}^{\*}} \right)^{q}, \eta(\mathbf{\theta}\_{k}^{\*}) \right\rangle \Big| y = k \right\} \\ &= 0 = \left\langle 0, \eta\left(\mathbf{\theta}\_{k}^{\*} \right) \right\rangle. \end{split}$$

Similarly,

$$\begin{split} \lim\_{m \to 0} \left< f\_{m}, \eta \right> &= \mathbb{E} \left\{ \lim\_{m \to 0} \left< \mathbb{1} \{ \mathbf{X} \in \mathcal{A}\_{k}^{\*} \} \left( \frac{\mathbf{Q}}{\mathbf{Z}^{\top} \mathbf{\theta}\_{k}^{\*}} \right)^{q}, \eta(\gamma) \right> \middle| y = k \right\}, \\ &= \mathbb{E} \left\{ \left< \mathbb{1} \{ \mathbf{Z}^{\top} \mathbf{\theta}\_{k}^{\*} > 0 \}, \eta(\gamma) \right> \middle| y = k \right\} \\ &= \left< \mathbb{E} \left\{ \mathbb{1} \{ \mathbf{Z}^{\top} \mathbf{\theta}\_{k}^{\*} > 0 \} \middle| y = k \right\}, \eta(\mathbf{\theta}\_{k}^{\*}) \right> \\ &= \left< \text{Pr} \{ \mathbf{Z}^{\top} \mathbf{\theta}\_{k}^{\*} > 0 \mid y = k \}, \eta(\mathbf{\theta}\_{k}^{\*}) \right>. \end{split}$$

hence

$$A(k,\infty) = \lim\_{q \to \infty} A(k,q) = \Pr\{X \notin \mathcal{X}\_k^\* \mid y = k\}, \text{ and } A(k,0) = \lim\_{q \to 0} A(k,q) = 1.$$

As a result, *A*(*k*, ∞) coincides with the population hinge/SVM loss and *A*(*k*, 0) is independent of *θ* ∗ *k* .

*Appendix A.3. Proof of Lemma 2*

**Proof.** By the definition of *P*˜,

$$\tilde{P}\{\mathbf{PS}(\mathfrak{d}^\*)\} = \tau \|\mathbf{PS}(\mathfrak{d}^\*)\|\_{\infty} + (1 - \tau) \max\_{j} \left\{ \|\mathbf{P}\_{\mathbf{S}}\mathbf{S}(\mathfrak{a}^\*)\|\_{2^\*} \|\mathbf{P}\_{\mathbf{S}}\mathbf{S}(\mathfrak{f}\_j^\*)\|\_{2^\*} \right\}.$$

where

$$\mathbf{P}\_{K}\mathbf{S}(\mathfrak{a}^{\*}) = \mathbf{P}\_{K}\{\mathbf{E}\circ\phi\_{q}^{\prime}\{\mathbf{F}(\mathfrak{d}^{\*})\}\}\stackrel{\top}{}}{}{\mathbf{1}}\_{K} = \frac{1}{N}\sum\_{i=1}^{N}\mathbf{P}\_{K}\text{diag}\{\mathbf{E}\_{i}\}\phi\_{q}^{\prime}(\mathbf{F}\_{i}^{\*}),$$

$$\mathbf{P}\_{K}\mathbf{S}(\mathfrak{f}\_{j}^{\*}) = \mathbf{P}\_{K}\{\mathbf{E}\circ\phi\_{q}^{\prime}\{\mathbf{F}(\mathfrak{d}^{\*})\}\}\stackrel{\top}{}{\mathbf{1}}\_{j} = \frac{1}{N}\sum\_{i=1}^{N}\mathbf{x}\_{ij}\mathbf{P}\_{K}\text{diag}\{\mathbf{E}\_{i}\}\phi\_{q}^{\prime}(\mathbf{F}\_{i}^{\*}),$$

**<sup>P</sup>***<sup>K</sup>* = (*p*1, . . . , *<sup>p</sup>K*) with *<sup>p</sup><sup>k</sup>* = (*plk*) = <sup>1</sup>{*<sup>l</sup>* <sup>=</sup> *<sup>k</sup>*} − *<sup>K</sup>* −1 , and

$$\mathbb{E}\{\mathbf{P}\_{K}\mathbf{S}(\mathfrak{a}^\*)\} = \mathbf{P}\_{K}\mathbf{S}(\mathfrak{a}^\*) = \mathbf{0}\_{K'} \quad \mathbb{E}\{\mathbf{P}\_{K}\mathbf{S}(\mathfrak{f}\_{j}^\*)\} = \mathbf{P}\_{K}\mathbf{S}(\mathfrak{f}\_{j}^\*) = \mathbf{0}\_{K}.$$

Denoting

$$d\_{ik} = \left\{ p\_k^\top \text{diag}\{ E\_i \} \phi\_q'(F\_i^\*) \right\} = \sum\_{l=1}^K \left( \mathbb{1}\{ y\_i = k \} - \frac{1}{K} \right) e\_{il} \phi\_q'(f\_{il}^\*)\_{,k}$$

we have that |*dik*| ≤ 1 − *K* −1 . Note that the *dik* are *N* i.i.d. random variables with

$$\frac{1}{N} \sum\_{i=1}^{N} \mathbb{E}(d\_{ik}) = \mathfrak{p}\_k^\top \mathcal{S}(\mathfrak{a}^\*) = 0 \text{ and } \frac{1}{N} \sum\_{i=1}^{N} \mathbb{E}(d\_{ik} \mathfrak{x}\_{ij}) = \mathfrak{p}\_k^\top \mathcal{S}(\mathfrak{f}\_j^\*) = 0.$$

By Hoeffding's inequality, we have that

$$\Pr\left\{|p\_k^\top \mathbf{S}(\mathfrak{a}^\*)| > c\_1 \left(1 - \frac{1}{K}\right) \sqrt{\frac{2\log(pK)}{N}}\right\} \le 2(p\mathcal{K})^{-\varepsilon\_1^2},\tag{A3}$$

where *c*<sup>1</sup> > 1.

Regarding the *dikxij*, we have that

$$\mathbb{E}\exp\{d\_{ik}\mathbf{x}\_{ij}\} \le \mathbb{E}\exp\{(1-K^{-1})|\mathbf{x}\_{ij}|\} \le \exp\{4(1-K^{-1})^2\zeta\_1^2\kappa^2\}.$$

which implies that the *dikxij* are *N* independent sub-Gaussian random variables with variance proxy (1 − *K* −1 ) 2 *ς* 2 1 *κ* 2 . Taking *c*<sup>1</sup> > 1, we have that

$$\Pr\left\{|p\_k^\top \mathcal{S}(\mathfrak{f}\_j^\*)| > c\_1 \varsigma\_1 \kappa \left(1 - \frac{1}{K}\right) \sqrt{\frac{2\log(pK)}{N}}\right\} \le 2(pK)^{-c\_1^2}.\tag{A4}$$

Then by (A3) and (A4),

$$\Pr\left\{\max\_{j}\left\{|p\_{k}^{\top}\mathcal{S}(\mathfrak{a}^{\*})|,|p\_{k}^{\top}\mathcal{S}(\mathfrak{f}\_{j}^{\*})|\right\}>\Lambda\_{1}\right\}\leq 2(p\mathcal{K})^{-\varepsilon\_{1}^{2}}\tag{A5}$$

with

$$\Lambda\_1 = \max\{\varsigma\_1 \kappa\_\prime 1\} c\_1 \left(1 - \frac{1}{K}\right) \sqrt{\frac{2\log(pK)}{N}}.$$

Taking a union bound over the *Kp* entries of **P***S*(*β* ∗ ) yields that

$$\begin{split} \Pr\{\|\mathbf{PS}(\boldsymbol{\theta}^{\*})\|\_{\infty} \geq \Lambda\_{1} \} &= \Pr\left\{ \max\_{j,k} \left\{ \left| \frac{1}{N} \sum\_{i=1}^{N} p\_{k}^{\top} \mathbf{S}(\boldsymbol{\mathfrak{a}}^{\*}) \right| , \left| \frac{1}{N} \sum\_{i=1}^{N} p\_{k}^{\top} \mathbf{S}(\boldsymbol{\mathfrak{f}}\_{i}^{\*}) \right| \right\} \geq \Lambda\_{1} \right\} \\ &\leq 2\mathbf{K}(p+1)(\mathbf{K}p)^{-c\_{1}^{2}} .\end{split}$$

On one hand,

$$\|\mathbf{P}\text{diag}\{\mathbf{E}\_{\bar{i}}\}\phi\_q'(\mathbf{F}\_{\bar{i}}^\*)\|\_2^2 = \|(\mathbf{E}\_{\bar{i}} - \mathbf{K}^{-1}) \circ \phi\_q'(\mathbf{F}\_{\bar{i}}^\*)\|\_2^2 \le \sum\_{l=1}^{K} (\mathbf{e}\_{\bar{i}l} - \mathbf{K}^{-1})^2 \cdot 1 = 1 - K^{-1}\zeta$$

so for any *<sup>γ</sup>* <sup>∈</sup> <sup>R</sup>*K*,

$$|\gamma^\top \mathbf{P} \text{diag}\{\mathbf{E}\_i\} \phi\_q'(\mathbf{F}\_i^\*)| \le ||\gamma||\_2 \sqrt{1 - \frac{1}{K}}$$

and E{*γ* <sup>⊤</sup>**P**diag{*Ei*}*φ* ′ *q* (*F* ∗ *i* )} = 0. Applying Hoeffding's lemma,

$$\mathbb{E}\exp\{\gamma^\top \mathbf{P}\_\mathcal{K} \mathcal{S}(\mathfrak{a}^\*)\} = \prod\_{i=1}^N \mathbb{E}\exp\left\{\frac{1}{N} \gamma^\top \mathbf{P}\_\mathcal{K} \text{diag}\{\mathbf{E}\_i\} \phi\_q^\prime(\mathbf{F}\_i^\*)\right\} \leq \exp\left\{\frac{\|\gamma\|\_2^2}{2N} \left(1 - \frac{1}{\mathcal{K}}\right)\right\}.$$

Applying a square root to Theorem 2.1 of [31] with *c*<sup>2</sup> > 1, we have that

$$\Pr\left\{ \|\mathbf{PS}(\mathfrak{a}^\*)\|\_2 \ge \sqrt{\frac{K-1}{N}} + c\_2 \sqrt{\left(1 - \frac{1}{K}\right) \frac{2\log(p)}{N}} \right\} \le p^{-c\_2^2}.\tag{A6}$$

On the other hand, since the *xij* are *N* independent sub-Gaussian random variables with variance proxy *ς* 2 1 *κ* 2 ,

$$\begin{split} \mathbb{E}\exp\{\gamma^{\top}\mathbb{P}\mathcal{S}(\boldsymbol{\beta}^{\*}\_{j})\} &= \prod\_{i=1}^{N} \mathbb{E}\exp\left\{\frac{\mathsf{x}\_{ij}^{\top}}{N} \left\{\gamma^{\top}\mathbb{P}\text{diag}\{\mathsf{E}\_{i}\}\boldsymbol{\phi}^{\prime}\_{q}(\boldsymbol{F}^{\*}\_{i})\right\}\right\} \\ &\leq \prod\_{i=1}^{N} \mathbb{E}\exp\left\{\sqrt{1-\frac{1}{K}}\frac{\|\boldsymbol{\gamma}\|\_{2}}{N}|\boldsymbol{\chi}\_{ij}|\right\} \\ &\leq = \exp\left\{\frac{\|\boldsymbol{\gamma}\|\_{2}^{2}}{2}\left(1-\frac{1}{K}\right)\frac{8\varepsilon\_{1}^{2}\kappa^{2}}{N}\right\} \end{split}$$

and E{**P***KS*(*β* ∗ *j* )} = **0***K*. Similarly, we have that

$$\Pr\left\{ \|\mathbf{PS}(\mathfrak{H}\_{\tilde{f}}^{s})\|\_{2} \geq 2\sqrt{2}c\_{1}\kappa \left\{ \sqrt{\frac{K-1}{N}} + c\_{2}\sqrt{\left(1-\frac{1}{K}\right)\frac{2\log(p)}{N}} \right\} \right\} \leq p^{-c\_{2}^{2}} \tag{A7}$$

for a constant *c*<sup>2</sup> > 1.

Therefore, by (A6) and (A7),

$$\Pr\left\{\max\_{j}\left\{||\mathbf{PS}(\mathfrak{a}^\*)||\_{2}, ||\mathbf{PS}(\mathfrak{f}\_j^\*)||\_{2}\right\} \geq \Lambda\_2\right\} \leq p^{-c\_2^2}$$

with

$$\Lambda\_2 = \max\{2\sqrt{2}\zeta\_1\kappa, 1\} \left\{ \sqrt{\frac{K-1}{N}} + c\_2 \sqrt{\left(1 - \frac{1}{K}\right) \frac{2\log(p)}{N}} \right\}.$$

Applying the union bound to (A5), it follows that

$$\Pr\left\{\tilde{P}\{\mathbf{PS}(\mathfrak{d}^\*)\} \ge \tau \Lambda\_1 + (1 - \tau)\Lambda\_2\right\} \le 2\mathcal{K}(p + 1)(p\mathcal{K})^{1 - c\_1^2} + p^{1 - c\_2^2}\mu$$

and the desired result follows.

*Appendix A.4. Proof of Theorem 2*

**Lemma A6.** *Suppose that λ* = *c*<sup>0</sup> r log(*pK*) *N . Then <sup>ϑ</sup>*<sup>ˆ</sup> <sup>−</sup> *<sup>ϑ</sup>* <sup>∗</sup> <sup>∈</sup> <sup>U</sup> *, where*

$$\mathcal{W} = \left\{ \delta \in \mathbb{R}^{K(p+1)} \; \middle| \; \frac{\tau}{1-\tau} \|\mathcal{S}\_{\mathcal{E}\_{+}}\|\_{1} + \sum\_{j \in \mathcal{H}\_{+}} \|\mathcal{S}\_{j}\|\_{2} \geq \mathbb{C}\_{0} \left( \frac{\tau}{1-\tau} \|\mathcal{S}\_{\mathcal{E}^{c}}\|\_{1} + \sum\_{j \notin \mathcal{H}} \|\mathcal{S}\_{j}\|\_{2} \right) \right\},$$

*C*<sup>0</sup> = (*c*0−1) (*c*0+1) *,* E <sup>c</sup> *denotes the complement of* <sup>E</sup> *,* <sup>E</sup><sup>+</sup> <sup>=</sup> <sup>E</sup> ∪ {*<sup>l</sup>* <sup>=</sup> <sup>1</sup> + (*<sup>k</sup>* <sup>−</sup> <sup>1</sup>)(*<sup>p</sup>* <sup>+</sup> <sup>1</sup>)|*<sup>k</sup>* <sup>=</sup> 1, . . . , *<sup>K</sup>*}*, and* <sup>G</sup><sup>+</sup> <sup>=</sup> <sup>G</sup> ∪ {0}*.*

**Proof.** Since *ϑ*ˆ = *ϑ* ∗ + *δ* is the minimizer, we have that

$$\begin{aligned} L(\mathfrak{d}^\*) + \lambda P(\mathfrak{f}^\*) &\geq L(\hat{\mathfrak{d}}) + \lambda P(\hat{\mathfrak{f}})\\ \lambda \left\{ P(\mathfrak{f}^\*) - P(\mathfrak{f}^\* + \check{\mathfrak{d}}) \right\} &\geq L(\mathfrak{d}^\* + \delta) - L(\mathfrak{d}^\*), \end{aligned} \tag{A8}$$

where *β* ∗ is the vector *ϑ* <sup>∗</sup> without the *a<sup>k</sup>* components, replacing *δ*˜ for *δ*. Then

$$\begin{split} P(\boldsymbol{\mathcal{B}}^{\*}) - P(\boldsymbol{\mathcal{B}}^{\*} + \tilde{\boldsymbol{\delta}}) &= \tau \big( \|\boldsymbol{\mathcal{B}}\_{\mathcal{E}}^{\*}\|\_{1} - \|\boldsymbol{\mathcal{B}}\_{\mathcal{E}}^{\*} + \tilde{\boldsymbol{\delta}}\_{\mathcal{E}}^{\*}\|\_{1} - \|\tilde{\boldsymbol{\delta}}\_{\mathcal{E}^{c}}^{c\*}\|\_{1} \big) \\ &+ (1 - \tau) \big( \sum\_{j \in \mathcal{G}} \|\boldsymbol{\mathcal{B}}\_{j}^{\*}\|\_{2} - \sum\_{j \in \mathcal{G}} \|\boldsymbol{\mathcal{B}}\_{j}^{\*} + \boldsymbol{\delta}\_{j}\|\_{2} - \sum\_{j \notin \mathcal{G}} \|\boldsymbol{\mathcal{S}}\_{j}^{\*}\|\_{2} \big) \\ &\leq \tau \big( \|\boldsymbol{\mathcal{S}}\_{\mathcal{E}}\|\_{1} - \|\tilde{\boldsymbol{\delta}}\_{\mathcal{E}^{c}}\|\_{1} \big) + (1 - \tau) \big( \sum\_{j \in \mathcal{G}} \|\boldsymbol{\mathcal{S}}\_{j}\|\_{2} - \sum\_{j \notin \mathcal{G}} \|\boldsymbol{\mathcal{S}}\_{j}^{\*}\|\_{2} \big) \\ &\leq \tau \big( \|\boldsymbol{\mathcal{S}}\_{\mathcal{E}\_{+}}\|\_{1} - \|\boldsymbol{\mathcal{S}}\_{\mathcal{E}^{c}}\|\_{1} \big) + (1 - \tau) \big( \sum\_{j \in \mathcal{G}\_{+}} \|\boldsymbol{\mathcal{S}}\_{j}\|\_{2} - \sum\_{j \notin \mathcal{G}} \|\boldsymbol{\mathcal{S}}\_{j}\|\_{2} \big) . \end{split}$$

By the convexity of *L*,

$$L(\mathfrak{d}^\* + \delta) - L(\mathfrak{d}^\*) \ge \langle \mathbf{S}(\mathfrak{d}^\*), \delta \rangle \ge -\bar{P} \{ \mathbf{PS}(\mathfrak{d}^\*) \} P(\delta) \ge -\frac{\lambda}{c\_0} P(\delta).$$

Note that

$$P(\mathcal{S}) = \tau \left( \|\mathcal{S}\_{\mathcal{S}\_+}\|\_1 + \|\mathcal{S}\_{\mathcal{S}^c}\|\_1 \right) + (1 - \tau) \left( \sum\_{j \in \mathcal{S}\_+} \|\mathcal{S}\_j\|\_2 + \sum\_{j \notin \mathcal{S}} \|\mathcal{S}\_j\|\_2 \right).$$

Combining the above results, we have that

$$
\begin{split}
\lambda \left\{ P(\mathfrak{d}^\*) - P(\mathfrak{d}^\* + \delta) \right\} &\geq \left\{ L(\mathfrak{d}^\* + \delta) - L(\mathfrak{d}^\*) \right\} \\
\leq (c+1)\tau \|\delta\_{\mathcal{E}\_+}\|\_{1} + (1-\tau) \sum\_{j \in \mathcal{G}\_+} \|\delta\_j\|\_{2} \geq (c-1)\tau \|\delta\_{\mathcal{E}^c}\|\_{1} + (1-\tau) \sum\_{j \notin \mathcal{G}} \|\delta\_j\|\_{2} \\
\frac{\tau}{1-\tau} \|\delta\_{\mathcal{E}\_+}\|\_{1} + \sum\_{j \in \mathcal{G}\_+} \|\delta\_j\|\_{2} \geq \mathbb{C}\_0 \Big( \frac{\tau}{1-\tau} \|\delta\_{\mathcal{E}^c}\|\_{1} + \sum\_{j \notin \mathcal{G}} \|\delta\_j\|\_{2} \Big).
\end{split}
$$

**Lemma A7.** *Assume that conditions (A1)-(A3) are satisfied. Then*

$$\sup\_{\boldsymbol{v}\in\mathcal{V}}\frac{|\Delta L(\boldsymbol{u},\boldsymbol{v}) - \mathbb{E}\{\Delta L(\boldsymbol{u},\boldsymbol{v})\}|}{||\boldsymbol{v}||\_2} > \Lambda\_3.$$

*with probability at most* 2(*Kp*) 2(*se*+*K*)(1−*c* 2 3 ) *, where*

$$\Lambda\_3 = (1 + \sqrt{2}c\_3)\zeta\_2\sqrt{\frac{2(s\_\varepsilon + K)\log(pK)}{N}}$$

*and* <sup>∆</sup>*L*(*u*, *<sup>v</sup>*) = *<sup>L</sup>*(*<sup>u</sup>* <sup>+</sup> *<sup>v</sup>*) <sup>−</sup> *<sup>L</sup>*(*u*) *for any <sup>u</sup>*, *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*K*(*p*+1) *and for some constant c*<sup>3</sup> > 1*.*

**Proof.** Given any *<sup>u</sup>* <sup>∈</sup> <sup>R</sup>*K*(*p*+1) and *<sup>v</sup>* <sup>∈</sup> <sup>V</sup> with <sup>V</sup> <sup>=</sup> *<sup>v</sup>* <sup>∈</sup> <sup>R</sup>*K*(*p*+1) <sup>|</sup><sup>0</sup> <sup>&</sup>lt; <sup>k</sup>*v*k<sup>0</sup> <sup>≤</sup> *<sup>s</sup><sup>e</sup>* <sup>+</sup> *<sup>K</sup>* ,

$$\begin{split} \Delta L(\boldsymbol{\mu}, \boldsymbol{\nu}) &= \frac{1}{N} \sum\_{i=1}^{N} \mathbf{E}\_{i}^{\top} \left( \boldsymbol{\phi}\_{\boldsymbol{\eta}} \{ (\mathbf{U} + \mathbf{V})^{\top} \mathbf{Z}\_{i} \} - \boldsymbol{\phi}\_{\boldsymbol{\eta}} \{ \mathbf{U}^{\top} \mathbf{Z}\_{i} \} \right) \\ &= \frac{1}{N} \sum\_{i=1}^{N} \sum\_{k=1}^{K} e\_{ik} \left( \boldsymbol{\phi}\_{\boldsymbol{\eta}} \{ \mathbf{Z}\_{i}^{\top} (\boldsymbol{u}\_{k} + \boldsymbol{\nu}\_{k}) \} - \boldsymbol{\phi}\_{\boldsymbol{\eta}} \{ \mathbf{Z}\_{i}^{\top} (\boldsymbol{u}\_{k}) \} \right) \\ &= \frac{1}{N} \sum\_{i=1}^{N} d\_{i} (\boldsymbol{u}, \boldsymbol{\nu}), \end{split}$$

where *u* = vec{**U**}, *v* = vec{**V**}.

The bounded gradient implies the Lipschitz continuity of *φ<sup>q</sup>* so that |*φq*(*u* + *v*) − *φq*(*u*)| ≤ |*v*|. Since *eik* ∈ {0, 1}, we have that

$$\begin{split} |d\_{i}(\boldsymbol{\mu},\boldsymbol{\upsilon})| &\leq \sum\_{k=1}^{K} \left| \boldsymbol{e}\_{ik} \left\{ \boldsymbol{\phi}\_{q} \left\{ \mathbf{Z}\_{i}^{\top} (\boldsymbol{\mu}\_{k} + \boldsymbol{\upsilon}\_{k}) \right\} - \boldsymbol{\phi}\_{q} (\mathbf{Z}\_{i}^{\top} \boldsymbol{\mu}\_{k}) \right\} \right| \\ &\leq \sum\_{k=1}^{K} \left| \boldsymbol{e}\_{ik} \mathbf{Z}\_{i}^{\top} \boldsymbol{\upsilon}\_{k} \right| \leq \mathbf{E}\_{i}^{\top} \operatorname{vec} \{ \mathbf{V}^{\top} \mathbf{Z}\_{i} \} \\ &= \mathbf{v}^{\top} (\mathbf{Z}\_{i} \otimes \mathbf{I}\_{K}) \mathbf{E}\_{i}. \end{split}$$

Note that

$$\sum\_{i=1}^{N} \left( \mathfrak{v}^{\top} (\mathbf{Z}\_i \otimes \mathbf{I}\_K) \mathbf{E}\_i \right)^2 = \| \text{diag} \{ \text{vec} \{ \mathbf{E}^{\top} \} \} (\mathbf{Z} \otimes \mathbf{I}\_K) \mathbf{v} \|\_{2}^{2}.$$

By Hoeffding's inequality, we have that

$$\begin{aligned} &\Pr\left\{ \left| \frac{1}{N} \sum\_{i=1}^{N} d\_{i}(\mathfrak{u}, \mathfrak{v}) - \mathbb{E}\left(\frac{1}{N} \sum\_{i=1}^{N} d\_{i}(\mathfrak{u}, \mathfrak{v})\right) \right| > t \right\} \\ &\leq 2 \exp\left\{ - \frac{2N^{2}t^{2}}{4 \|\text{diag}\{\text{vec}\{\mathbf{E}^{\top}\}\}(\mathbf{Z} \otimes \mathbf{I}\_{K})\mathfrak{v}\|\_{2}^{2}} \right\} \\ &\leq 2 \exp\left\{ - \frac{Nt^{2}}{2\xi\_{2}^{2} \|\mathfrak{v}\|\_{2}^{2}} \right\} .\end{aligned}$$

Thus Pr{*R*(*v*) <sup>&</sup>gt; <sup>Λ</sup>3} ≤ <sup>2</sup>(*Kp*) −(*se*+*K*)*c* 2 <sup>3</sup> with

$$R(\mathfrak{v}) = \frac{|\Delta L(\mathfrak{u}, \mathfrak{v}) - \mathbb{E}\{\Delta L(\mathfrak{u}, \mathfrak{v})\}|}{||\mathfrak{v}||\_2} \text{ and } \Lambda\_3 = c\_3 \varsigma\_2 \sqrt{\frac{2(s\_\varepsilon + K)\log(pK)}{N}}.$$

Next, we consider covering V with *ǫ*-balls such that for any *v*<sup>1</sup> and *v*<sup>2</sup> in the same ball, *v*1 k*v*1k<sup>2</sup> − *v*1 k*v*1k<sup>2</sup>  ≤ *ǫ*, where *ǫ* is a small positive number. The number of *ǫ*-balls required to cover a *m*-dimensional unit ball is bounded by ( 2 *<sup>ǫ</sup>* + 1) *<sup>m</sup>*. Then for those *<sup>v</sup>* k*v*k<sup>2</sup> , we require a covering number of at most (3(*Kp*)/*ǫ*) *<sup>s</sup>e*+*K*. Let N denote such an *ǫ*-net. We have that

$$\Pr\left\{\sup\_{\mathfrak{v}\in\mathcal{A}'} \mathcal{R}(\mathfrak{v}) > \Lambda\_3 \right\} \le \left(\frac{3Kp}{\varepsilon}\right)^{s\_\varepsilon + K} 2(Kp)^{-(s\_\varepsilon + K)c\_3^2} = 2\left\{\frac{3}{\varepsilon}(Kp)^{1-c\_3^2}\right\}^{s\_\varepsilon + K}.$$

Furthermore, for any *<sup>v</sup>*1, *<sup>v</sup>*<sup>2</sup> <sup>∈</sup> <sup>V</sup> ,

$$\begin{split} |R(\boldsymbol{\uppi}\_{1}) - R(\boldsymbol{\uppi}\_{2})| &\leq \frac{2}{N} \left\| \text{diag} \{ \text{vec} \{ \mathbf{E}^{\top} \} \} (\mathbf{Z} \otimes \mathbf{I}\_{\mathcal{K}}) \left( \frac{\boldsymbol{\uppi}\_{1}}{\| \boldsymbol{\uppi}\_{1} \|\_{2}} - \frac{\boldsymbol{\uppi}\_{1}}{\| \boldsymbol{\uppi}\_{1} \|\_{2}} \right) \right\|\_{1} \\ &\leq \frac{2}{\sqrt{N}} \left\| \text{diag} \{ \text{vec} \{ \mathbf{E}^{\top} \} \} (\mathbf{Z} \otimes \mathbf{I}\_{\mathcal{K}}) \left( \frac{\boldsymbol{\uppi}\_{1}}{\| \boldsymbol{\uppi}\_{1} \|\_{2}} - \frac{\boldsymbol{\uppi}\_{1}}{\| \boldsymbol{\uppi}\_{1} \|\_{2}} \right) \right\|\_{2} \\ &\leq 2 \boldsymbol{\upfrac}{2} \boldsymbol{\upepsilon} . \end{split}$$

Therefore sup*v*∈<sup>V</sup> *<sup>R</sup>*(*v*) <sup>≤</sup> sup*v*∈<sup>N</sup> *<sup>R</sup>*(*v*) + <sup>2</sup>*ς*2*ǫ*. Taking *<sup>ǫ</sup>* <sup>=</sup> r (*s<sup>e</sup>* + *K*)log(*pK*) 2*N* , we have that

$$\begin{split} \Pr\left\{\sup\_{\boldsymbol{\nu}\in\mathcal{V}}\mathcal{R}(\boldsymbol{\nu}) > \Lambda\_{3} \right\} &\leq \Pr\left\{\sup\_{\boldsymbol{\nu}\in\mathcal{A}^{\prime}}\mathcal{R}(\boldsymbol{\nu}) > (c\_{3}-1)\varsigma\_{1}\sqrt{\frac{2(s\_{\varepsilon}+K)\log(pK)}{N}} \right\} \\ &\leq 2\left\{\sqrt{\frac{2N}{(s\_{\varepsilon}+K)\log(pK)}}\Im(Kp)^{1-(c\_{3}-1)^{2}}\right\}^{s\_{\varepsilon}+K} \\ &\leq 2\left\{(Kp)^{2-(c\_{3}-1)^{2}}\right\}^{s\_{\varepsilon}+K}. \end{split}$$

Setting *c*<sup>3</sup> = 1 + √ 2*c*<sup>4</sup> and *c*<sup>4</sup> > 1, we obtain the desired result that

$$\Pr\left\{\sup\_{\boldsymbol{\nu}\in\mathcal{Y}}\mathcal{R}(\boldsymbol{\nu}) > (1+\sqrt{2}c\_4)\zeta\_2\sqrt{\frac{2(\boldsymbol{s}\_\varepsilon+K)\log(pK)}{N}}\right\} \le 2(Kp)^{2(\boldsymbol{s}\_\varepsilon+K)(1-c\_4^2)}.$$

**Proof of Theorem 2.** Consider a disjoint partition on the coordinate set *<sup>δ</sup>* <sup>=</sup> *<sup>ϑ</sup>*<sup>ˆ</sup> <sup>−</sup> *<sup>ϑ</sup>* ∗ , that is, *δ* = ∑ *M m*=1 *<sup>v</sup><sup>m</sup>* with *<sup>v</sup><sup>m</sup>* <sup>∈</sup> <sup>V</sup> . Note that, each subvector *<sup>v</sup><sup>m</sup>* has at most *<sup>s</sup><sup>e</sup>* <sup>+</sup> *<sup>K</sup>* non-zero coordinates. Denote *v*<sup>0</sup> = **0** and *u<sup>m</sup>* = *ϑ* ∗ + ∑ *m*−1 *l*=0 *v<sup>l</sup>* so that *u*<sup>1</sup> = *ϑ* <sup>∗</sup> and *u<sup>M</sup>* + *v<sup>M</sup>* = *ϑ* ∗ + *δ*. We have the decomposition

$$\begin{split} \Delta L(\mathfrak{d}^\*, \mathfrak{d}) &= \mathrm{L}\Big(\mathfrak{d}^\* + \sum\_{m=1}^M \mathfrak{v}\_m\Big) - L(\mathfrak{d}^\*) = \sum\_{m=1}^M \mathrm{L}\Big(\mathfrak{d}^\* + \sum\_{l=0}^m \mathfrak{v}\_l\Big) - L\Big(\mathfrak{d}^\* + \sum\_{l=0}^{m-1} \mathfrak{v}\_l\Big) \\ &= \sum\_{m=1}^M \mathrm{L}(\mathfrak{u}\_m + \mathfrak{v}\_m) - L(\mathfrak{u}\_m) = \sum\_{m=1}^M \Delta L(\mathfrak{u}\_m, \mathfrak{v}\_m). \end{split}$$

By Lemma A7,

$$\sum\_{m=1}^{M} \Delta L(\mathfrak{u}\_{\mathfrak{m}}, \mathfrak{v}\_{\mathfrak{m}}) \ge \sum\_{m=1}^{M} \mathbb{E}\{\Delta L(\mathfrak{u}\_{\mathfrak{m}}, \mathfrak{v}\_{\mathfrak{m}})\} - \Lambda\_{3} \||\mathfrak{v}\_{\mathfrak{m}}||\_{2} = \mathbb{E}\{\Delta L(\mathfrak{v}^{\*}, \mathcal{S})\} - \Lambda\_{3} \||\mathcal{S}\|\_{2}$$

with high probability. By Lemma A5, L is twice differentiable so that

$$\begin{split} \mathbb{E}\{\Delta L(\mathfrak{d}^\*, \mathfrak{d})\} &= \frac{1}{N} \sum\_{i=1}^N \mathbb{E}\left(E\_i^\top \boldsymbol{\phi}\_q \left\{ F\_i(\mathfrak{d}^\* + \mathfrak{d}) \right\} \right) - \mathbb{E}\left(\mathbf{E}\_i^\top \boldsymbol{\phi}\_q \left\{ F\_i(\mathfrak{d}^\*) \right\} \right) \\ &= \mathcal{L}(\mathfrak{d}^\* + \mathfrak{d}) - \mathcal{L}(\mathfrak{d}^\*) \\ &= \mathcal{S}(\mathfrak{d}^\*)^\top \boldsymbol{\delta} + \frac{1}{2} \boldsymbol{\delta}^\top \mathcal{H}(\mathfrak{d}^\*) \boldsymbol{\delta} + o(\|\boldsymbol{\delta}\|\_2^2) \\ &\ge 0 + \frac{\xi\_3^2}{2} \|\boldsymbol{\delta}\|\_2^2 + o(\|\boldsymbol{\delta}\|\_2^2) .\end{split}$$

Consequently, ∆*L*(*ϑ* ∗ , *<sup>δ</sup>*) is bounded below by *<sup>ς</sup>* 2 3 2 k*δ*k 2 <sup>2</sup> − <sup>Λ</sup>3k*δ*k<sup>2</sup> with high probability. Note that

$$\begin{split} P(\mathfrak{g}^{\*}) - P(\mathfrak{g}^{\*} + \mathfrak{d}) &\leq \mathfrak{r} \left( \|\mathcal{S}\_{\mathcal{E}\_{+}}\|\_{1} - \|\mathcal{S}\_{\mathcal{E}^{c}}\|\_{1} \right) + (1 - \mathfrak{r}) \left( \sum\_{j \in \mathcal{G}\_{+}} \|\mathcal{S}\_{j}\|\_{2} - \sum\_{j \notin \mathcal{G}} \|\mathcal{S}\_{j}\|\_{2} \right) \\ &\leq \Big( \mathfrak{r} \|\mathcal{S}\_{\mathcal{E}\_{+}}\|\_{1} + (1 - \mathfrak{r}) \sum\_{j \in \mathcal{G}\_{+}} \|\mathcal{S}\_{j}\|\_{2} \Big). \end{split}$$

From (A8),

$$\begin{split} L(\mathfrak{d}^\*) + \lambda P(\mathfrak{f}^\*) &\geq L(\hat{\mathfrak{d}}) + \lambda P(\hat{\mathfrak{f}})\\ \lambda \left\{ P(\mathfrak{f}^\*) - P(\mathfrak{f}^\* + \tilde{\mathfrak{d}}) \right\} &\geq L(\mathfrak{d}^\* + \delta) - L(\mathfrak{d}^\*)\\ \lambda \left( \tau \| \delta\_{\mathcal{C}\_+} \|\_{1} + (1 - \tau) \sum\_{j \in \mathcal{G}\_+} \| \delta\_j \|\_{2} \right) &\geq \frac{\mathcal{C}^2}{2} \| \delta \|\_{2}^{2} - \Lambda\_3 \| \delta \|\_{2}. \end{split}$$

Clearly, k*δ*E<sup>+</sup> k1 ≤ √ *s<sup>e</sup>* + *K*k*δ*E<sup>+</sup> k2 ≤ √ *<sup>s</sup><sup>e</sup>* + *<sup>K</sup>*k*δ*k<sup>2</sup> and <sup>∑</sup>*j*∈G<sup>+</sup> k*δj*k<sup>2</sup> ≤ p *s<sup>g</sup>* + 1k*δ*k2. We conclude that

$$\begin{aligned} \frac{\xi\_3^2}{2} \|\boldsymbol{\delta}\|\_2^2 &\leq \lambda \Big(\boldsymbol{\tau} \|\boldsymbol{\delta}\_{\boldsymbol{\delta}\_+^\*}\|\_1 + (1-\boldsymbol{\tau}) \sum\_{j \in \mathcal{G}\_+} \|\boldsymbol{\delta}\_j\|\_2\Big) + \Lambda\_3 \|\boldsymbol{\delta}\|\_2\\ \|\boldsymbol{\delta}\|\_2^2 &\leq 2\xi\_3^{-2} \Big\{\lambda \Big(\boldsymbol{\tau}\sqrt{s\_{\boldsymbol{\varepsilon}}+\boldsymbol{K}} + (1-\boldsymbol{\tau})\sqrt{s\_{\boldsymbol{\delta}}+1}\Big) + \Lambda\_3\Big\} \|\boldsymbol{\delta}\|\_{2\prime} \end{aligned}$$

after which the desired result follows from straightforward algebraic manipulation.

*Appendix A.5. Proof of Lemma 3*

**Proof.** Since

$$\text{vec}(\mathbf{F}^{\top})^{\top} = \text{vec}\left\{ (\mathbf{1}\_N \boldsymbol{\mathfrak{a}}^{\top} + \mathbf{X} \mathbf{B})^{\top} \right\}^{\top} = \boldsymbol{\mathfrak{a}}^{\top}(\mathbf{1}\_N^{\top} \otimes \mathbf{I}\_K) + \text{vec}(\mathbf{B}^{\top})^{\top}(\mathbf{X}^{\top} \otimes \mathbf{I}\_K),$$

we have that

$$\begin{cases} \frac{\partial \text{vec}(\mathbf{F}^{\top})^{\top}}{\partial \mathbf{a}} = \frac{\partial \mathbf{a}^{\top}(\mathbf{1}\_{N}^{\top} \otimes \mathbf{I}\_{K})}{\partial \mathbf{a}} = \frac{\partial \mathbf{a}^{\top}}{\partial \mathbf{a}} (\mathbf{1}\_{N}^{\top} \otimes \mathbf{I}\_{K}) = \mathbf{I}\_{K} (\mathbf{1}\_{N}^{\top} \otimes \mathbf{I}\_{K}) = \mathbf{1}\_{N}^{\top} \otimes \mathbf{I}\_{K}, \\\\ \frac{\partial \text{vec}(\mathbf{F}^{\top})^{\top}}{\partial \text{vec}(\mathbf{B}^{\top})} = \frac{\partial \text{vec}(\mathbf{B}^{\top})^{\top}(\mathbf{X}^{\top} \otimes \mathbf{I}\_{K})}{\partial \text{vec}(\mathbf{B}^{\top})} = \mathbf{I}\_{pK} (\mathbf{X}^{\top} \otimes \mathbf{I}\_{K}) = \mathbf{X}^{\top} \otimes \mathbf{I}\_{K}. \end{cases}$$

The derivative with respect to *α* is

$$\begin{split} NS(\boldsymbol{\mathfrak{a}}) &= N \frac{\partial L(\boldsymbol{\mathfrak{a}})}{\partial \boldsymbol{\mathfrak{a}}} = \frac{\partial}{\partial \boldsymbol{\mathfrak{a}}} \text{vec} \{ \mathbf{E}^{\top} \}^{\top} \text{vec} \{ \boldsymbol{\phi}\_{q} (\mathbf{F}^{\top}) \} \\ &= \frac{\partial \text{vec} (\mathbf{F}^{\top})^{\top}}{\partial \boldsymbol{\mathfrak{a}}} \frac{\partial \boldsymbol{\phi}\_{q} \{ \text{vec} (\mathbf{F}^{\top})^{\top} \}}{\partial \text{vec} (\mathbf{F}^{\top})} \text{vec} \{ \mathbf{E}^{\top} \} \\ &= (\mathbf{1}\_{N}^{\top} \otimes \mathbf{I}\_{K}) \text{diag} \{ \text{vec} \{ \boldsymbol{\phi}\_{q}^{\prime} (\mathbf{F}^{\top}) \} \} \text{vec} \{ \mathbf{E}^{\top} \} \\ &= \text{vec} \{ \mathbf{I}\_{K} \left\{ \mathbf{E} \circ \boldsymbol{\phi}\_{q}^{\prime} (\mathbf{F}) \right\}^{\top} \mathbf{1}\_{N} \} \\ &= \left\{ \mathbf{E} \circ \boldsymbol{\phi}\_{q}^{\prime} (\mathbf{F}) \right\}^{\top} \mathbf{1}\_{N} . \end{split}$$

Thus,

$$\begin{split} \left\lVert \mathbf{S}(\mathbf{a}) \right\rVert\_{\mathbf{v}}^{\mathbf{u}} \Big\lVert\_{2}^{2} &= \left\lVert \mathbf{S}(\mathbf{u}) - \mathbf{S}(\mathbf{v}) \right\rVert\_{2}^{2} = N^{-2} \left\lVert \left( \mathbf{1}\_{N}^{\top} \otimes \mathbf{I}\_{N} \right) \text{vec} \left\{ \left( \mathbf{E} \circ \boldsymbol{\phi}\_{q}^{\prime} \{\mathbf{F}(\mathbf{a})\} \Big| \begin{smallmatrix} \mathbf{u} \\ \mathbf{v} \end{smallmatrix} \right)^{\top} \right\} \right\rVert\_{2}^{2} \\ &\leq N^{-2} \left\lVert \mathbf{1}\_{N}^{\top} \otimes \mathbf{I}\_{K} \right\rVert\_{2}^{2} \left\lVert \text{vec} \left\{ \left( \mathbf{E} \circ \boldsymbol{\phi}\_{q}^{\prime} \{\mathbf{F}(\mathbf{a})\} \right) \Big| \begin{smallmatrix} \mathbf{u} \\ \mathbf{v} \end{smallmatrix} \right\} \right\rVert\_{2}^{2} \\ &= N^{-1} \sum\_{k=1}^{K} \sum\_{l=1}^{N} e\_{ik}^{2} \left( \boldsymbol{\phi}\_{q}^{\prime} \{f\_{ik}(\boldsymbol{u}\_{k})\} - \boldsymbol{\phi}\_{q}^{\prime} \{f\_{ik}(\boldsymbol{v}\_{k})\} \right)^{2} \\ &\leq N^{-1} \sum\_{k=1}^{K} \left( \sum\_{i=1}^{N} e\_{ik} \right) L\_{q}^{2} (\boldsymbol{u}\_{k} - \boldsymbol{v}\_{k})^{2} \\ &\leq N^{-1} n\_{\max} L\_{q}^{2} ||\boldsymbol{u} - \boldsymbol{v}||\_{2}^{2} . \end{split}$$

where *L<sup>q</sup>* = (*q*+1) 2 *q* is the Lipschitz constant of *φ* ′ *q* . We have that *L<sup>α</sup>* = q*n*max *N Lq*. The derivative with respect to vec(**B** ⊤) is

$$\begin{split} N\frac{\partial L(\boldsymbol{\theta})}{\partial \text{vec}(\mathbf{B}^{\top})} &= \frac{\partial}{\partial \text{vec}(\mathbf{B}^{\top})} \text{vec}\{\mathbf{E}^{\top}\}^{\top} \text{vec}\{\boldsymbol{\phi}\_{q}(\mathbf{F}^{\top})\} \\ &= \frac{\partial \text{vec}(\mathbf{F}^{\top})^{\top}}{\partial \text{vec}(\mathbf{B}^{\top})} \frac{\partial \boldsymbol{\phi}\_{q}\{\text{vec}(\mathbf{F}^{\top})^{\top}\}}{\partial \text{vec}(\mathbf{F}^{\top})} \text{vec}\{\mathbf{E}^{\top}\} \\ &= (\mathbf{X}^{\top} \otimes \mathbf{I}\_{K}) \text{diag}\{\text{vec}\{\boldsymbol{\phi}\_{q}^{\prime}(\mathbf{F}^{\top})\}\} \text{vec}\{\mathbf{E}^{\top}\} \\ &= \text{vec}\left(\mathbf{I}\_{K} \{\mathbf{E} \circ \boldsymbol{\phi}\_{q}^{\prime}(\mathbf{F})\}^{\top} \mathbf{X}\right) \\ &= \text{vec}\left(\{\mathbf{E} \circ \boldsymbol{\phi}\_{q}^{\prime}(\mathbf{F})\}^{\top} \mathbf{X}\right). \end{split}$$

Therefore, the derivative with respect to **B** is *S*(**B**) = *N*−1**X** ⊤ **E** ◦ *φ* ′ *q* (**F**) . Note that

$$\begin{aligned} \text{vec}\left(\mathbf{X}^{\top}\left\{\mathbf{E}\circ\phi\_{q}^{\prime}(\mathbf{F})\right\}\right) &= (\mathbf{I}\_{K}\otimes\mathbf{X}^{\top})\text{diag}\{\text{vec}(\mathbf{E})\}\text{vec}\{\phi\_{q}^{\prime}(\mathbf{F})\} \\ &= \left\{\bigoplus\_{k=1}^{K}\mathbf{X}^{\top}\text{diag}(\mathfrak{e}\_{k})\right\}\text{vec}\{\phi\_{q}^{\prime}(\mathbf{F})\} \end{aligned}$$

and

$$\sum\_{i=1}^{N} \left\{ e\_{ik} \mathbf{X}\_i^\top (\boldsymbol{\mu}\_k - \boldsymbol{\nu}\_k) \right\}^2 = \left\| \text{diag}(\mathbf{e}\_k) \mathbf{X} (\boldsymbol{\mu}\_k - \boldsymbol{\nu}\_k) \right\|\_2^2 \le \left\| \text{diag}(\mathbf{e}\_k) \mathbf{X} \right\|\_2^2 \left\| \boldsymbol{\mu}\_k - \boldsymbol{\nu}\_k \right\|\_2^2;$$

thus

$$\begin{split} \|N^{2}\|\|\text{vec}\{\mathbf{S}(\mathbf{U})-\mathbf{S}(\mathbf{V})\}\|\_{2}^{2} &= \sum\_{k=1}^{K} \left\|\mathbf{X}^{\top}\text{diag}(e\_{k})\phi\_{q}\{f\_{k}(\mathbf{b}\_{k})\}\Big|\_{\mathbf{z}\_{k}}^{\mathbf{u}\_{k}}\right\|\_{2}^{2} \\ &\leq \sum\_{k=1}^{K} \|\mathbf{X}^{\top}\text{diag}(e\_{k})\|\_{2}^{2} \|\text{diag}(e\_{k})\phi\_{q}\{f\_{k}(\mathbf{b}\_{k})\}\|\_{\mathbf{v}\_{k}^{\top}}^{\mathbf{u}\_{k}}\|\_{2}^{2} \\ &\leq \sum\_{k=1}^{K} \|\text{diag}(e\_{k})\mathbf{X}\|\_{2}^{2} \sum\_{i=1}^{N} e\_{ik} \left(\phi\_{q}\{f\_{ik}(\mathbf{u}\_{k})\}-\phi\_{q}\{f\_{ik}(\mathbf{v}\_{k})\}\right)^{2} \\ &\leq L\_{q}^{2} \sum\_{k=1}^{K} \|\text{diag}(e\_{k})\mathbf{X}\|\_{2}^{2} \sum\_{i=1}^{N} \left\{e\_{ik}\mathbf{X}^{\top}\_{i}(\mathbf{u}\_{k}-\mathbf{v}\_{k})\right\}^{2} \\ &\leq L\_{q}^{2} \sum\_{k=1}^{K} \|\text{diag}(e\_{k})\mathbf{X}\|\_{2}^{2} \|\mathbf{u}\_{k}-\mathbf{v}\_{k}\|\_{2}^{2} \\ &\leq \max\_{k} \left\{ \|\text{diag}(e\_{k})\mathbf{X}\|\_{2}^{2} \right\}^{2} \|\text{vec}(\mathbf{U}-\mathbf{V})\|\_{2}^{2}. \end{split}$$

We conclude that *<sup>L</sup>***<sup>B</sup>** <sup>=</sup> *<sup>L</sup>qN*−<sup>1</sup> max*<sup>k</sup>* <sup>k</sup>diag(*e<sup>k</sup>* )**X**k 2 2 .

*Appendix A.6. Proof of Theorem 3*

**Lemma A8.** *The indicator function*

$$\delta\_{\mathcal{R}}(\mathbf{x}) = \begin{cases} 0, & \text{if } \mathbf{x} \in \mathcal{R} \\ \infty, & \text{if } \mathbf{x} \notin \mathcal{R} \end{cases}$$

*where* <sup>R</sup> <sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* | **1** ⊤ *<sup>p</sup> x* = 0}*, has subdifferential*

$$\partial \delta\_{\mathcal{R}}(\mathbf{x}) = \begin{cases} \{ \mathbf{g} \in \mathbb{R}^p \mid \mathbf{g} = s \mathbf{1}\_{p'} s \in \mathbb{R} \}, & \text{if } \mathbf{x} \in \mathcal{R} \\ \mathcal{Q}, & \text{if } \mathbf{x} \notin \mathcal{R}. \end{cases}$$

**Proof.** Suppose that *x* ∈ R. Then *g* ∈ *∂δ*R(*x*) if and only if both

$$
\delta\_{\mathcal{R}}(\boldsymbol{y}) \ge \delta\_{\mathcal{R}}(\boldsymbol{x}) + \langle \mathbf{g}, \boldsymbol{y} - \boldsymbol{x} \rangle \text{ for all } \boldsymbol{y} \in \mathcal{R} \text{ and }
$$

$$
\boldsymbol{\omega}^{\top}(\boldsymbol{y} - \boldsymbol{x}) \le 0.
$$

Let *z* = *y* − *x*. Then *z* ∈ R since **1** ⊤ *p* (*y* − *x*) = 0. Thus, *g* <sup>⊤</sup>*z* ≤ 0. If *g* <sup>⊤</sup>*<sup>z</sup>* <sup>=</sup> 0, then *<sup>g</sup>* ∈ {*<sup>g</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* | *g* = *s***1***p*, *s* ∈ R}. If there exists *g* ∈ *∂δ*R(*x*) satisfying *g* <sup>⊤</sup>*<sup>z</sup>* <sup>&</sup>lt; 0 for some *<sup>z</sup>* ∈ R, then <sup>−</sup>*<sup>z</sup>* ∈ R, so we must have that *g* <sup>⊤</sup>*z* > 0. This is a contradiction.

Now, for any *x* ∈ R/ , we have that *g* ∈ *∂δ*R(*x*) if and only if both

$$
\delta\_{\mathcal{R}}(\mathfrak{y}) \ge \delta\_{\mathcal{R}}(\mathfrak{x}) + \langle \mathfrak{g}, \mathfrak{y} - \mathfrak{x} \rangle \text{ for all } \mathfrak{y} \in \mathcal{R} \text{ and }$$

$$
\omega^{\top}(\mathfrak{x} - \mathfrak{y}) \ge \infty.$$

For *<sup>x</sup>* ∈ R/ and *<sup>y</sup>* ∈ R, since *<sup>z</sup>* <sup>=</sup> *<sup>x</sup>* <sup>−</sup> *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* and *<sup>g</sup>* <sup>⊤</sup>*<sup>z</sup>* ≥ <sup>∞</sup>, it must be that *<sup>g</sup>* ∈ ∅.

**Proof of Theorem 3.** It is sufficient to minimize the objective function

$$G(\mathfrak{F}) = \frac{1}{2} ||\mathfrak{F} - \mathfrak{F}\_\*||\_2^2 + \rho\_1 ||\mathfrak{F}||\_1 + \rho\_2 ||\mathfrak{F}||\_2 + \delta\_{\mathcal{R}}(\mathfrak{F})\_\*$$

where <sup>R</sup> <sup>=</sup> {*<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>K</sup>* <sup>|</sup> **<sup>1</sup>** ⊤ *K β* = 0}. Then the subdifferential of *G*(*β*) is

$$
\partial \mathcal{G}(\mathfrak{f}) = \mathfrak{f} - \mathfrak{f}\_\* + \rho\_1 \partial ||\mathfrak{f}||\_1 + \rho\_2 \partial ||\mathfrak{f}||\_2 + \partial \delta\_{\mathcal{R}}(\mathfrak{f}).
$$

For an optimal solution *β* <sup>∗</sup> ∈ R, we have that **0***<sup>p</sup>* ∈ *∂G*(*β* ∗ ) if and only if there exist *u* ∈ *∂*k*β*k1, *v* ∈ *∂*k*β*k<sup>2</sup> and *s* ∈ R such that *β* <sup>∗</sup> = *β*<sup>∗</sup> − *ρ*1*u* − *ρ*2*v* − *s***1***p*. Since **1** ⊤*β* ∗ = 0, we have that *s* = *p* <sup>−</sup>1**1** ⊤ *p* (*β*<sup>∗</sup> − *ρ*1*u* − *ρ*2*v*), so

$$
\mathcal{J}^\* = \mathbf{P}\_K(\mathcal{J}\_\* - \rho\_1 \boldsymbol{\mu} - \rho\_2 \boldsymbol{\sigma}).
$$

If *β* <sup>∗</sup> = **0***p*, then |*u<sup>j</sup>* <sup>|</sup> <sup>&</sup>lt; 1 for *<sup>j</sup>* <sup>=</sup> 1, . . . , *<sup>p</sup>*, <sup>k</sup>*v*k<sup>2</sup> <sup>≤</sup> 1 and

$$\|\mathbf{P}\_K(\mathfrak{f}\_\*-\rho\_1\mathfrak{u})\|\_2 = \rho\_2 \|\mathbf{P}\_K\mathfrak{v}\|\_2 \le \rho\_2 \|\mathbf{P}\_K\|\_2 \|\mathfrak{v}\|\_2 = \rho\_2 \|\mathfrak{v}\|\_2 \le \rho\_2 \mathbf{v}$$

If *β* <sup>∗</sup> 6= **0***K*, then *u* ∈ *∂*k*x*k<sup>1</sup> , *v* = *β* ∗ k*β*∗k<sup>2</sup> and

$$
\mathcal{B}^\* = \mathbf{P}\_K \left( \mathfrak{\mathcal{B}}\_\* - \rho\_1 \mathfrak{u} - \rho\_2 \frac{\mathfrak{B}^\*}{||\mathfrak{B}^\*||\_2} \right),
$$

$$
\mathfrak{a} \left( 1 + \frac{\rho\_2}{||\mathfrak{B}^\*||\_2} \right) \mathfrak{B}^\* = \mathbf{P}\_K (\mathfrak{B}\_\* - \rho\_1 \mathfrak{u}).
$$

Note that *β* <sup>∗</sup> = **P***Kβ* <sup>∗</sup> ∈ R. Taking the norm of both sides, we see that

$$\left(1 + \frac{\rho\_2}{||\mathcal{J}^\*||\_2}\right) ||\mathcal{J}^\*||\_2 = ||\mathbf{P}\_K(\mathfrak{f}\_\* - \rho\_1 \mathfrak{u})||\_2$$

$$||\mathfrak{f}^\*||\_2 = ||\mathbf{P}\_K(\mathfrak{f}\_\* - \rho\_1 \mathfrak{u})||\_2 - \rho\_2 > 0.$$

Substituting this result back into the *β* <sup>∗</sup> 6= **0***<sup>K</sup>* case, we have that

$$\mathcal{J}^\* = \left\{ 1 - \frac{\rho\_2}{||\mathbf{P}\_K(\boldsymbol{\beta}\_\* - \rho\_1 \boldsymbol{\mu})||\_2} \right\} \mathbf{P}\_K(\boldsymbol{\beta}\_\* - \rho\_1 \boldsymbol{\mu}).$$

Combining the above two cases gives the desired result.

#### *Appendix A.7. Proof of Theorem 4*

**Proof.** Denote the objective function by

$$G(b) = \frac{1}{2}(b-t)^2 + \varrho\{|b| + |b+s|\}.$$

When *s* = 0, we obtain a lasso problem with

$$b^\* = \underset{b \in \mathbb{R}}{\text{argmin}} \,\frac{1}{2}(b-t)^2 + 2\varrho|x| = \mathcal{S}(t, 2\varrho).$$

When *s* 6= 0, the subdifferential of *G*(*b*) is

$$
\partial G(b) = b - t + \varrho\{\partial|\mathbf{x}| + \partial|\mathbf{x} + s|\}.
$$

We see that 0 ∈ *∂G*(*b* ∗ ) if and only if there exist *u* ∈ *∂*|*b*| and *v* ∈ *∂*|*b* + *s*| with

$$b^\* = b - \varrho(u+v).$$

If *b* <sup>∗</sup> <sup>=</sup> 0, then <sup>|</sup>*u*<sup>|</sup> <sup>&</sup>lt; 1 and *<sup>v</sup>* <sup>=</sup> sign(*s*), hence

$$b^\* = 0 \text{ if } |t - \varrho \text{sign}(s)| \le \varrho.$$

If *<sup>s</sup>* <sup>&</sup>gt; 0, then sign(*s*) = 1 and 0 <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> <sup>2</sup>*̺*. If *<sup>s</sup>* <sup>&</sup>lt; 0, then sign(*s*) = <sup>−</sup>1, and <sup>−</sup>2*̺* <sup>≤</sup> *<sup>t</sup>* <sup>≤</sup> 0. Note that if *t* 6= 0, then sign(*s*) = sign(*t*) or sign(*s*)sign(*t*) = 1.

When *b* <sup>∗</sup> <sup>=</sup> <sup>−</sup>*s*, then *<sup>u</sup>* <sup>=</sup> <sup>−</sup>sign(*s*) and <sup>|</sup>*v*<sup>|</sup> <sup>&</sup>lt; 1, hence

$$b^\* = -s \text{ if } |t + s + \varrho \text{sign}(s)| \le \varrho.$$

If *<sup>s</sup>* <sup>&</sup>gt; 0, then sign(*s*) = 1 and <sup>−</sup>(*<sup>s</sup>* <sup>+</sup> <sup>2</sup>*λ*) <sup>≤</sup> *<sup>t</sup>* ≤ −*<sup>s</sup>* <sup>&</sup>lt; 0. If *<sup>s</sup>* <sup>&</sup>lt; 0, then sign(*s*) = <sup>−</sup>1 and <sup>0</sup> <sup>&</sup>lt; <sup>−</sup>*<sup>s</sup>* <sup>≤</sup> *<sup>t</sup>* ≤ −(*<sup>s</sup>* <sup>−</sup> <sup>2</sup>*λ*). Note that sign(*s*) = <sup>−</sup>sign(*t*) is equivalent to sign(*s*)sign(*t*) = <sup>−</sup>1.

Let *<sup>C</sup>*(*s*, *<sup>t</sup>*) = <sup>1</sup> <sup>−</sup> sign(*s*)sign(*t*) 2 |*s*| ≥ 0. We can summarize the two cases above as

$$b^\* = -\mathbb{C}(s, t) \text{ if } 0 \le \mathbb{C}(s, t) \le |t| \le \mathbb{C}(s, t) + 2\varrho. \tag{A9}$$

If *b* <sup>∗</sup> 6= 0, −*s*, then *u* = sign(*b* ∗ ) and *v* = sign(*b* ∗ + *s*), thus

$$\begin{aligned} b^\* &= t - \varrho \{ \text{sign}(b^\*) + \text{sign}(b^\* + s) \} \\ b^\* + s &= t + s - \varrho \{ \text{sign}(b^\*) + \text{sign}(b^\* + s) \} .\end{aligned}$$

If sign(*b* ∗ ) = −sign(*b* ∗ + *s*) = 1, then *b* ∗ (*b* <sup>∗</sup> <sup>+</sup> *<sup>s</sup>*) <sup>&</sup>lt; 0 or 0 <sup>&</sup>lt; *<sup>t</sup>* <sup>&</sup>lt; <sup>−</sup>*s*. Thus *<sup>b</sup>* <sup>∗</sup> = *t* > 0 if <sup>0</sup> <sup>&</sup>lt; *<sup>t</sup>* <sup>&</sup>lt; <sup>−</sup>*s*. If sign(*<sup>b</sup>* ∗ ) = −sign(*b* <sup>∗</sup> + *s*) = −1, then *b* ∗ (*b* <sup>∗</sup> <sup>+</sup> *<sup>s</sup>*) <sup>&</sup>lt; 0 or <sup>−</sup>*<sup>s</sup>* <sup>&</sup>lt; *<sup>t</sup>* <sup>&</sup>lt; 0. Thus *<sup>b</sup>* <sup>∗</sup> = *t* < 0 if <sup>−</sup>*<sup>s</sup>* <sup>&</sup>lt; *<sup>t</sup>* <sup>&</sup>lt; 0. Rewriting the two cases above, we have that

$$b^\* = t \quad \text{if } 0 < |t| < \mathcal{C}(\mathbf{s}, t). \tag{A10}$$

If sign(*b* ∗ ) = sign(*b* ∗ + *s*) = 1, then

$$\begin{aligned} \min\{b^\*, b^\* + s\} &> 0 \\ t - 2\varrho + \frac{s - |s|}{2} &> 0 \\ \text{sign}(t)|t| &> \text{sign}(t) \left(\frac{|s|}{2} + 2\varrho\right) - \frac{s}{2} > 0. \end{aligned}$$

Note that *t* > 0 and sign(*x*) = sign(*t*). If sign(*b* ∗ ) = sign(*b* <sup>∗</sup> + *s*) = −1, then

$$\begin{aligned} \max\{b^\*, b^\* + s\} &< 0\\ t + 2\varrho + \frac{s + |s|}{2} &> 0\\ \text{sign}(t)|t| &< \text{sign}(t)\left(\frac{|s|}{2} + 2\varrho\right) - \frac{s}{2} < 0. \end{aligned}$$

Note that *t* < 0 and sign(*x*) = sign(*t*). Rewriting the two cases above, we have that

$$b^\* = t - 2\varrho \text{sign}(t) \text{ if } |t| > 2\varrho + \mathcal{C}(\mathbf{s}, t). \tag{A11}$$

Summarizing (A9)–(A11),

$$b^\* = \begin{cases} t, & |t| < \mathcal{C}(s, t), \\ -\mathcal{C}(s, t), & \mathcal{C}(s, t) \le |t| \le \mathcal{C}(s, t) + 2\varrho, \\ \text{sign}(t)(|t| - 2\varrho), & |t| > \mathcal{C}(s, t) + 2\varrho. \end{cases}$$

with *<sup>C</sup>*(*s*, *<sup>t</sup>*) = <sup>1</sup> <sup>−</sup> sign(*s*)sign(*t*) 2 |*s*| ≥ 0. On one hand, when *s* 6= 0,

$$b^\* = t - \mathcal{S}\left(t, \mathcal{C}(s, t)\right) + \mathcal{S}\{\mathcal{S}\left(t, \mathcal{C}(s, t)\right), 2\varrho\}.$$

On the other hand, when *s* = 0, it follows that *b* <sup>∗</sup> = S(*t*, 2*̺*) since S(*z*, 0) = *z*.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article* **Ensemble Linear Subspace Analysis of High-Dimensional Data**

**S. Ejaz Ahmed <sup>1</sup> , Saeid Amiri 2,\* and Kjell Doksum <sup>3</sup>**


**Abstract:** Regression models provide prediction frameworks for multivariate mutual information analysis that uses information concepts when choosing covariates (also called features) that are important for analysis and prediction. We consider a high dimensional regression framework where the number of covariates (*p*) exceed the sample size (*n*). Recent work in high dimensional regression analysis has embraced an ensemble subspace approach that consists of selecting random subsets of covariates with fewer than *p* covariates, doing statistical analysis on each subset, and then merging the results from the subsets. We examine conditions under which penalty methods such as Lasso perform better when used in the ensemble approach by computing mean squared prediction errors for simulations and a real data example. Linear models with both random and fixed designs are considered. We examine two versions of penalty methods: one where the tuning parameter is selected by cross-validation; and one where the final predictor is a trimmed average of individual predictors corresponding to the members of a set of fixed tuning parameters. We find that the ensemble approach improves on penalty methods for several important real data and model scenarios. The improvement occurs when covariates are strongly associated with the response, when the complexity of the model is high. In such cases, the trimmed average version of ensemble Lasso is often the best predictor.

**Keywords:** ensembling; high-dimensional data; Lasso; elastic net; penalty methods; prediction; random subspaces

#### **1. Introduction**

Recent research in statistical science has focused on developing effective and useful techniques for analyzing high-dimensional data where the number of variables substantially exceeds the number of cases or subjects. Examples of such data sets are genome or gene expression arrays, and other biomarkers based on RNA and proteins. The challenge is to find associations between such markers (*X*'s) and phenotype (*Y*).

Regression models provide useful frameworks for multivariate mutual information analysis that uses information concepts when choosing covariates (also called features) that are important for the analysis and prediction. A recent article that includes both the concept of mutual information and the Lasso is [1]. This paper develops properties of methods that use the information in a vector *X* to reduce prediction error, that is, to reduce entropy. We consider regression experiments, that is, experiments with a response variable *Y* ∈ R and a covariate vector (*X*1, . . . , *Xp*) *t* . The objective is to use a sample of i.i.d. vectors (**x***i* , *yi*), 1 ≤ *i* ≤ *n*, where **x***<sup>i</sup>* = (*xi*<sup>1</sup> , . . . , *xip*) *<sup>t</sup>* with *<sup>x</sup>ij* <sup>∈</sup> <sup>R</sup>, to construct a predictor *<sup>Y</sup>*b<sup>0</sup> of a response *Y*<sup>0</sup> corresponding to a covariate vector **x**<sup>0</sup> = (*x*01, . . . , *x*0*p*) *t* that is not part of the sample. Let **X** = (*xij*)*n*×*<sup>p</sup>* be the design matrix of explanatory variables (covariates) and

**Citation:** Ahmed, S.E.; Amiri, S.; Doksum, K. Ensemble Linear Subspace Analysis of High-Dimensional Data. *Entropy* **2021**, *23*, 324. https://doi.org/ 10.3390/e23030324

Academic Editor: Alessandro Giuliani

Received: 14 January 2021 Accepted: 5 March 2021 Published: 9 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**y** = (*y*1, . . . , *yn*) *<sup>t</sup>* be the vector of response variables. Denote **X**[, *j*] as the *j*th column vector of the design matrix. We will use the linear model

$$\mathbf{y} = \mathbf{X}\mathfrak{B} + \mathbf{e},\tag{1}$$

where *β* = (*β*1, . . . , *βp*) *t* is the vector of regression coefficients and *ǫ* = (*ǫ*1, . . . , *ǫn*) *t*∼ *N*(0, *σ* 2 *I*) is the residual error term. In this model, predictors *Y*b<sup>0</sup> take the form

$$
\widehat{Y}\_0 = \sum\_{j=1}^p \widehat{\beta}\_j \mathbf{x}\_{0,j},
$$

where *β*b *j* is an estimator based on the i.i.d. sample (**x***<sup>i</sup>* , *yi*), 1 ≤ *i* ≤ *n*.

Under *<sup>n</sup>* <sup>≥</sup> *<sup>p</sup>*, the ordinary least square (OLS) estimator of *<sup>β</sup>* can be used. When *<sup>n</sup>* <sup>&</sup>lt; *<sup>p</sup>* a unique OLS estimate does not exist. However, for sparse models where most of the *β*'s are zero, we can use the Lasso [2] criteria that forces many of the estimated *β*'s to be set to zero. For a given penalty level *λ* ≥ 0, the Lasso estimate of *β* is

$$\widehat{\mathfrak{B}} = \operatorname{argmin}\_{\beta} \left\{ \frac{1}{2} ||\mathbf{y} - \mathbf{x}\mathfrak{B}||\_2^2 + \lambda ||\mathfrak{B}||\_1 \right\}.$$

where k.k<sup>2</sup> is the Euclidean distance and k*β*k<sup>1</sup> = ∑ |*β<sup>j</sup>* <sup>|</sup> is the <sup>ℓ</sup>1-norm. The Lasso not only sets a subset of *β*'s to zero, it also shrinks OLS estimates of the remaining *β*'s towards zero. It is an effective procedure for experiments when one can assume that the number *r* of covariates that are relevant for the response in the sense that their *β* coefficient is not zero, satisfies *r* ≤ *n*. That is, for sparse models.

Other effective high-dimension methods that we consider are adaptive Lasso, ref. [3], smoothly clipped absolute deviation (SCAD), ref. [4], least angle regression (LARS), ref. [5], and elastic net, ref. [6]. The properties of Lasso, and its variants, are well studied to examine consistency of parameter estimates [7,8], and to assess the prediction error and the variable selection process [9,10] examined properties of the Lasso in partially linear models. Several variants of Lasso were introduced by [11] and more recently by [12]. See [13–15] for many of the extensions of the original Lasso.

In this paper, we examine properties of statistical methods based on Ensemble Linear Subspace Analysis (ELSA) for analyzing high-dimensional data. ELSA is based on repeated random selection of subsets of covariates, doing statistical inference on each of the subsets, and then combing the results from subsets to construct a final inference. One advantages of this ensemble subspace approach is that it makes the analysis of studies with a million or more covariates variables more manageable. Another advantage is that for many situations the ensemble approach is more efficient because it takes advantage of the high efficiency of statistical methods for the case where the number of covarites is less than or equal to the sample size.

Classical examples using sub-models whose results are pooled and aggregated into a final statistical analysis is the bagging method ([16]) and the random forests approach ([17]). Recent studies that use ensemble ideas include [18,19]. These papers focus on feature selection, that is, selecting the covariates that are associated with the response variable. This paper deals with using the selected covariates to construct efficient predictors of the response. We examine conditions under which penalty methods such as Lasso perform better when used in the ensemble approach by computing mean squared prediction errors for simulations and a real data example. Linear models with both random and fixed designs are considered. We examine two versions of penalty methods: one where the tuning parameter is selected by cross-validation; and one where the final predictor is a trimmed average of individual predictors corresponding to the members of a set of fixed tuning parameters. We find that the ensemble approach improves on penalty methods for several important real data and model scenarios. The improvement occurs when covariates are strongly associated with the response, when the complexity of the model (represented

by *r*/*p*) is high. In such cases, the trimmed average version of ensemble Lasso is often the best predictor.

The rest of this article is organized as follows. In Sections 2 and 3, we introduce six different approaches to subspace selection. Section 3 describes a new approach for dealing with tuning parameters *λ*. Instead of using the standard Lasso based on a b*λ* obtained by cross validation, it computes Lasso predictors for a fixed set of tuning parameters and uses the average of these predictors as the fixed predictors. Section 4 outlines other penaltybased ensemble methods for high dimensional data. Section 5 introduces the concepts of mean squared Prediction Error (MSPE) and efficiency (EFF) for fixed and random design experiments as well as for real data. Section 6 gives efficiency of various penalty methods with respect to CV Lasso, including efficiencies of ensemble subspace version of these penalty methods. The efficiency results show that when the model complexity *r*/*p* is moderately high, trimmed subspace method perform best in all but one case. Section 7 compares six ensemble subspace Lasso methods to the standard CV Lasso. For models with a mixture of strong and weak signals, the ensemble methods perform best except when the models are very sparse. The final section gives a summary of results.

#### **2. Ensembling via Random Subspaces**

The following three-step protocol provides the ensemble subspace approach:


We consider three approaches to choosing subsets of **X**-variables

1. Choose subspaces with *p* ∗ covariates, where *p* ∗ is the number of distinct covariates after randomly selecting *p* covariates with replacement from the collection of all covariates. Here the random variable *p* ∗ is known to have expected value approximately 0.63*p*. Let **x** ∗ denote the distinct covariates and **X** ∗ denote the corresponding design matrix. The subspace data is (**X** ∗ , **y**) where **y** ∈ *R <sup>n</sup>* and **X** ∗ = (*x* ∗ *ij*)*n*×*<sup>p</sup>* <sup>∗</sup> . By repeating this procedure *B*

times independently and using a method such as Lasso we get predictors {*Y*b 0,1, . . . ,*Y*b 0,*B*}. 2. Choose *n* covariates without replacement from the *p* covariates, repeating *B* times 0,1, . . . ,*Y*b 0,*B*}.

independently and using a method such as Lasso thereby obtaining {*Y*b 3. Same as 2., except choose *n*/2 covariates.

The final prediction of the response based on a covariate vector **x**<sup>0</sup> is *Y*b0(**x**0) = *B* <sup>−</sup><sup>1</sup> ∑ *B <sup>b</sup>*=<sup>1</sup> *Y*b 0*b* (**x**0). Note that the terms in the sum that defines *Y*b 0*b* (**x**0) are identically distributed, but not independent. Thus, with *Y*b<sup>0</sup> = *Y*b0(**x**0) and *Y*b <sup>0</sup>*<sup>b</sup>* = *Y*b 0*b* (**x**0)

$$\operatorname{Var}(\hat{Y}\_{0b}) = \frac{1}{B}\operatorname{Var}(\hat{Y}\_{01}) + \frac{B-1}{B}\operatorname{Cov}(\hat{Y}\_{01}, \hat{Y}\_{02}) = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2,\tag{2}$$

where *σ* 2 is the variance of one predictor *Y*b<sup>0</sup> and *ρ* is the pairwise correlation between two such predictors. By selecting *B* large, we can make the second term negligible. When *ρ* is sufficiently small *ρσ*<sup>2</sup> can in many cases be smaller than the variance of the predictor based on all the covariates. When *Y*b<sup>0</sup> is prediction unbiased, that is, *E*(*Y*b<sup>0</sup> − *Y*) = 0, then *Var*(*Y*b0) equals the prediction mean squared error (PMSE). When the subspace have *n* or fewer variables, OLS is prediction unbiased.

#### **3. Prediction on Subspaces**

We consider two approaches for dealing with Lasso tuning parameters: the crossvalidated and the Trimmed Lasso. The same approaches will be applied to the other penalty

methods. Let **X** <sup>∗</sup> = {*x* ∗ *ij*} be the subspace design matrix. The Lasso estimate based on a linear model on the subspace is

$$\hat{\boldsymbol{\beta}} = \operatorname{argmin}\_{\boldsymbol{\beta}} \left\{ \frac{1}{2} ||\mathbf{y} - \mathbf{X}^\* \boldsymbol{\beta}||\_2^2 + \lambda ||\boldsymbol{\beta}||\_1 \right\} \boldsymbol{\lambda}$$

The standard procedure is to choose the tuning parameter *λ* using 10-fold cross-validation (CV), which denoted as CVLasso hereafter. Note, since the size of subspace design **X** ∗ = {*x* ∗ *ij*} is changed, *β*b is changed as well and correspond the number variables in **X** <sup>∗</sup> = {*x* ∗ *ij*}. It is implemented in the library "glmnet" in R. Cross validation may sometimes lead to unfortunate choices of *λ* because the random choices of training and test sample may not yield a *λ* that represents a *λ* that will give a good predictor. Thus we will consider a method based on a collection of fixed *λ*'s. This method, which we call the *Trimmed Lasso (TrLasso)*, uses as predictor the trimmed average (10% in each tails) of Lasso predictors computed from a path of 100 *λ*'s. The path is generated using the library glmnet in R with option "nlambda". The largest lambda, *λMAX*, is the smallest value for which all beta coefficients are zero while *λMIN* = *λMAXe* −6 . The *λ* values are equally spaced on the log scale. We consider six versions of ensemble subspace methods. In the following, "approach *j*" for *j* = 1, 2 and 3 chooses subspace sizes *p* ∗ , *n*, and *n*/2, respectively.

ETrLasso (j): For *j* = 1, 2 and 3 use approach (j) to choose the number of variables in each subspace. Then apply TrLasso in each subspace.

ECVTLasso (j): For *j* = 1, 2 and 3 use approach (j) to choose the number of variables in each subspace. Then apply CVLasso in each subspace.

#### **4. Competitors to Lasso**

#### *4.1. Elastic-Net*

For highly correlated predictor variables the Lasso tends to select a few of them and shrink the rest to zero, see [6,15] for an extensive discussion. For such cases the Elastic Net, denoted ELNET hereafter, is suggested as a compromise between the ridge and the Lasso methods. The estimates of coefficients can be obtained from:

$$\hat{\boldsymbol{\beta}} = \operatorname{argmin}\_{\boldsymbol{\beta}} \left\{ \frac{1}{2} \| \boldsymbol{Y} - \mathbf{X}\boldsymbol{\beta} \|\_{2}^{2} + \lambda \left( \frac{1}{2} (1 - a) \| \boldsymbol{\beta} \|\_{2}^{2} + a \| \boldsymbol{\beta} \|\_{1} \right) \right\},\tag{3}$$

where *α* ∈ [0, 1]. Here *α* = 1 leads to the regular Lasso. The penalty parameters, *λ* and *α*, are two nonnegative tuning parameters.

We examine properties of ELNET using of *α* = 0.25, 0.5, and 0.75, while *λ* is treated as for the Lasso. Thus we obtain TrELNET(*α*) and CVELNET(*α*). For ELNET the ensemble subspace method is also carried out as for the Lasso but only using the trimmed (10%) option, resulting in three methods for each *α*. We use the notation TrELNET(*j*, *α*) and ELNET(*j*, *α*), *j* = 1, 2, 3 for the trimmed and CV ensemble subspace option for subspace of size *p* ∗ , *n*, and *n*/2. The calculations of these ELNETs, including the Lasso where *α* = 1, are done using the library glmnet in R.

#### *4.2. Adaptive Lasso*

Ref. [3] introduced the adaptive Lasso for linear regression. It uses a weighted penalty of the form ∑ *p <sup>j</sup>*=<sup>1</sup> *w<sup>j</sup>* |*β*b *j* | where *w<sup>j</sup>* = 1/|*β*b *j* | and *β*b *j* is a preliminary estimate of *β<sup>j</sup>* and

$$\hat{\boldsymbol{\beta}} = \operatorname{argmin}\_{\boldsymbol{\beta}} \left\{ \frac{1}{2} \| \boldsymbol{Y} - \mathbf{X} \boldsymbol{\beta} \|\_{2}^{2} + \lambda \| \boldsymbol{w} \boldsymbol{\beta} \|\_{1} \right\}. \tag{4}$$

The preliminary beta estimate is typically the Ridge estimate. We use that in our simulation studies. The Adaptive Lasso is also computed as a 10% trimmed average of Lasso predictors for a sequenced of *λ*'s and as the predictor obtained when *λ* is selected using CV. They are denoted as TrALasso and CVALasso, respectively. We consider these methods

for the proposed ensembled subspace procedures and denote them as ETrAlasso(*j*) and ECVAlasso(*j*), *j* = 1, 2, 3.

#### *4.3. Lars*

Least angle regression, also called LARS, was developed in [5]. It uses a model selection algorithms based on forward selection that enables the procedure to select a parsimonious set of predictors to be used for the efficient prediction of a response variable from an available large collection of possible covariates. It improves computational efficiency compared to the Lasso. As in Section 3, LARS is considered with trimming and with CV in prediction. They are denoted as TrLARS and CVLARS, respectively. We consider the trimmed and CV versions of these methods for the proposed ensembled subspace procedure and denoted them as ETrLARS(*j*) and ECVLARS(*j*), *j* = 1, 2, 3. The calculation of LARS is done by using the library LAR in R.

#### *4.4. Scad*

Ref. [4] introduced the SCAD penalty for linear regression. It is a symmetric and quadratic spline on the reals whose first order derivative is

$$\text{SCAD}\_{\lambda,a}^{\prime}(\mathbf{x}) = \lambda \{ I(|\mathbf{x}| \le \lambda) + \frac{(a\lambda - |\mathbf{x}|) +}{(a-1)\lambda} I(|\mathbf{x}| > \lambda) \},\tag{5}$$

where *λ* > 0 and *a* = 3.7 as recommended by [4]. The SCAD penalty is continuously differentiable and can produce sparse solutions and nearly unbiased estimates for sparce models with large beta coefficients. The CV and trimmed version of SCAD will be labeled as CVSCAD and TrSCAD, while the ensemble subspace methods will be ECVSCAD(*j*) and ETrSCAD(*j*), *j* = 1, 2, 3.

#### **5. Mean Squared Prediction Error (MSPE)**

#### *5.1. (a) Random Covariates, Simulated Data*

To examine prediction error, we generate a training set D = {(**x**1, *y*1), . . . ,(**x***n*, *yn*)} using the simulation model under consideration, and for each method considered obtain a predictor of the form *<sup>y</sup>*b*<sup>i</sup>* <sup>=</sup> <sup>∑</sup> *p j*=1 *β*b *<sup>j</sup>xij*, *i* = 1, . . . , *n*. To explore the performance of proposed methods on data not used in producing the prediction formula, we independently generate a test set *D*<sup>0</sup> = {(**x**01, *y*01), . . . ,(**x**0*n*<sup>0</sup> , *y*0*n*<sup>0</sup> )} and compute

$$\text{MSPE} = \frac{1}{n\_0} \sum\_{i=1}^{n\_0} (y\_{0i} - \hat{y}\_{0i})^2 \text{.} $$

where

$$
\widehat{y}\_{0i} = \sum\_{j=1}^{p} \widehat{\beta}\_j x\_{0ij\prime} \; i = 1, \dots, n\_{0\prime}
$$

is the predicted value of *y*0*<sup>i</sup>* based on **x**0*<sup>i</sup>* . We use *n*<sup>0</sup> = 0.3*n* in the simulation studies. We repeat the process of generating independent collections for training and test sets *M* = 2000 times, therby obtaining MSPE1, . . . , MSPE*M*. We measure the efficiency of a predictor *Y*b by comparing it to the standard method, Lasso with cross-validation

$$EFF(\hat{Y}) = \frac{1}{M} \sum\_{b} \frac{MSPE\_b(\text{CVLasso})}{MSPE\_b(\hat{Y})} \,\,\,\,\tag{6}$$

where the sum is over the simulation, and as mentioned earlier for the Lass the standard procedure is to choose the tuning parameter *λ* using 10-fold cross-validation (CV).

#### *5.2. (b) Fixed Covariate, Simulated and Real Data*

Let D = {(**x**1, *y*1), . . . ,(**x***n*, *yn*)}, **x** ∈ *R <sup>p</sup>* and *<sup>y</sup>* <sup>∈</sup> *<sup>R</sup>*, denote a real or simulated data set with random *y*'s and fixed **x**'s. Split this set into a test set D<sup>0</sup> with *n*<sup>0</sup> data vectors and a training set D<sup>1</sup> with the remaining *n*<sup>1</sup> data vectors, where *n*<sup>0</sup> = 0.3*n* and *n*<sup>1</sup> = 0.7*n*. For each of the discussed methods, the training set is used to produce a prediction algorithm that is used to predict the *y*'s in the test set. The MSPE is then MSPE = <sup>1</sup> *n*0 ∑ *n*0 *i*=1 (*y*b0*<sup>i</sup>* <sup>−</sup> *<sup>y</sup>*0*i*) 2 , where *<sup>y</sup>*b0*<sup>i</sup>* is the predicted value of *y*0*<sup>i</sup>* based on **x**'s in the test set. Next we compute the ratio with respect to CVLasso(MSPE). This procedure is repeated 2000 times and the average is the final EFF(*Y*b). For simulated experiments, an additional *M* = 2000 repetitions is carried out.

#### **6. Efficiency Result for Lasso Competitors**

In the following, we compare the accuracy of the methods presented in Sections 3 and 4. The results are presented with *B* = 250 subspaces; we also tried *B* = 500, but since the result were nearly the same, they are not presented here. We examine the relative performance of the methods as a function of the complexity index which is defined as the ratio *r*/*p* of the number of covariates that are relevant for the response *y* to the total number of covariates.

#### *6.1. Syndrome Gene Data*

Ref. [20] studied expression quantitative trait locus mapping in the laboratory rat to gain a broad perspective of gene regulation in the mammalian eye and to identify genetic variation relevant to human eye disease. The dataset which is from the flare library in R has *n* = 120 with *p* = 200 predictors, it includes the expression level of TRIM32 gene which can be considered as dependent variable. To compare the accuracy of the proposed methods on this dataset, we randomly select 30% of the data as a test set and consider the rest as a training set, and calculate the relative efficiency EFF(*Y*b) to CVLasso. We repeat the procedure of selecting training and test set 2000 times which provide good accuracy. The results are reported in Table 1.

Among the seven Lasso Type competitor to CVLasso, the most efficient in terms of EFF(*Y*b) is the one based on subspaces of sizes *n*/2 = 60 and based on a trimmed average of Lasso predictors computed for a sequence of *λ* tuning parameters. We found that it improves on CVLasso 83% of the time. However, the average of the mean square prediction error ratios is EFF(*Y*b) = 1.11, thus the improvement does not appear to be substantial.

Turning to the other procedures in Table 1, we see that, generally, the best performance is obtained for the trimmed ensemble versions based on subspaces of size *n*/2, expect for adaptive Lasso which is best for subspace size *n*. Generally, the improvement ensemble over CvLasso is about 1.1 in terms of EFF(*Y*b). Moreover, the performance of these methods are very close, including ELNET methods with different *α*. That is, using subspaces and a robust trimmed average of response predictors obtained from the path of glment lambdas is more efficient than using the predictor based on the lambda selected by glment cross validation. The improvement achieved by the trimmed ensemble versions of SCAD based on subspaces of size *n*/2 over the basic (CV and trimmed) versions of SCAD is striking.


**Table 1.** Efficiencies with respect to CVLasso for the Syndrome Gene data.

#### *6.2. Simulation Efficiency Results*

We next used a modification of a model set forth by [21]. We set *p* = 1000, and in contrast to the syndrome Gene inspired model, we now use i.i.d. random **x**'s, as indicated in Model (7). The model provides a large range of *β* values corresponding to strong, moderate and weak covariate signals. The correlations between covariates renage from 0.28 and 0.94.

$$\begin{aligned} &X \sim N(M, \Sigma)\_{\prime} \\ &M = (\mu\_{i})\_{i=1,\ldots,p^{\prime}} \quad \mu\_{i} \stackrel{i.i.d.}{\sim} N(5, 2)\_{\prime} \\ &\Sigma = (\sigma\_{i,j})\_{i,i=1,\ldots,p^{\prime}} \quad \sigma\_{i,j} = \sigma\_{j,i} \stackrel{i.i.d.}{\sim} \text{Unif}(0.4, 0.6), i \neq j \\ &\sigma\_{i,i} \sim \text{Unif}(0.8, 1.2), \\ &\beta\_{j0+1}, \ldots, \beta\_{j0+r} \stackrel{i.i.d}{\sim} \text{Unif}(-2, 2), \ j\_{0} \in \{1, \ldots, p-r\}, \\ &\beta\_{j} = 0, \text{ for all other} \rangle, \\ &y\_{i} = \sum\_{j=1}^{p} \beta\_{j} \mathbf{x}\_{ij} + \varepsilon\_{i\prime} \text{ with } \varepsilon\_{i} \stackrel{i.i.d.}{\sim} N(0, 0.15), i = 1, \ldots, n. \end{aligned}$$

Using this model, we generate (**x**1, *y*1), . . . ,(**x***n*, *yn*), *n* = 180. Tables 2–5 give the mean of the efficiency criteria over *M* = 2000*trials*. The numbers in parentheses are standard deviations (SD). We next discuss the result for the case with *r* = 150 relevant variables. Here *k* denotes the number of covariates in the subspaces, and *p* ∗ is the number of distinct variables in a bootstrap sample from the set of covariates.

**Table 2.** Efficiencies of trimmed mean methods with respect to the CVLasso for the model (7) with complexity index *r*/*p* = 0.15.


**Table 3.** Efficiencies of cross validated methods with respect to the CVLasso for the model (7) with complexity index *r*/*p* = 0.15.


#### 6.2.1. Results for *r*/*p* = 0.15

#### (a) Lasso Based Methods

Trimmed Lasso based on all *p* = 1000 covariates performs best, with ensemble trimmed Lasso with *k* = *p* ∗ , a close second. Ensemble CVLasso performs poorly for all *k*. The trimming approach dominates the cross validation approach.

#### (b) ELNET Based Methods

CV and trimmed ELNET based on all *p* = 1000 covariates are close and better than the ensemble methods and CVLasso. The value *α* in ELNET does not make much difference. Among ensemble methods, the trimmed version with *k* = *p* ∗ and *α* = 0.75 is the best, it is slightly better than CVLasso.

#### (c) LARS Based Methods

The trimmed and CV ensemble subspace methods with *k* = *p* ∗ are best with the trimmed version slightly better. Both are better than CV Lasso.

#### (d) Adaptive Lasso Based Methods

CV ensemble adaptive Lasso based on subspaces with *k* = *p* ∗ is best among all methods.

#### (e) SCAD Based Methods

For this model, SCAD does poorly for all but one version, presumably because it produces poor predictors for *β*'s that are close to zero. The one version that does well is the trimmed ensemble method with *k* = *p* ∗ variables.

6.2.2. Results for *r*/*p* = 0.30

#### (a) Lasso Based Methods

Trimmed ensemble Lasso based on *p* ∗ covariates in the subspaces performs best. The trimming approach outperforms the CV approach for each of *k*.

#### (b) ELNET Based Methods

Trimmed ensemble ELNET based on *p* ∗ covariates performs best. The trimming approach outperforms the CV approach for each *k*. The value of *α* does not make much difference.

#### (c) LARS Based Methods

Trimmed ensemble LARS based on *p* ∗ covariates is best among all LARS methods. Trimmed methods outperform CV methods.

#### (d) Adaptive Lasso Based Methods

CV Adaptive ensemble Lasso based on subspaces with *p* ∗ covariates is best among all methods. Trimmed methods outperform CV methods except when *k* = *p* ∗ .

#### (e) SCAD Based Methods

Trimmed ensemble SCAD with *p* ∗ covariates in the supspaces does well. Trimmed ensemble versions outperform CV version and the *k* = 1000 version.


**Table 4.** Efficiencies of trimmed methods with respect to the CVLasso for the model (7) with *r*/*p* = 0.3.

**Table 5.** Efficiencies of cross validated methods with respect to the CVLasso for the model (7) with *r*/*p* = 0.3.


#### 6.2.3. Overall Summary

Tables 2–5 show that the ensemble and trimming methods can improve on the CV Lasso. Overall, the CV esnsemble Adaptive Lasso based on subspaces with *p* ∗ covariates performs best. For *r*/*p* = 0.30, that is, 30% complexity, ensemble subsace with *p* ∗ covariates does best overall and the trimmed approach is best except for the Adaptive Lasso. When *r*/*p* = 0.15, the results are less clear, except the ensemble subspaces with *p* ∗ covariates yields the overall best result when coupled with the Adaptive Lasso. The overall superior performance of ensemble subspace methods based on *p* ∗ can in part be explained by formula (2) because the *p* ∗ methods produce predictors that are weakly correlated.

#### **7. Comparison of Cv and Trimmed Lasso Methods**

#### *7.1. Syndrome Gene Data Inspired Simulation Model*

Simulation based on real data is very important from an application perspective, because the structure of the underlying population is often unknown. In this subsection, we use **x** from [20] as described in Section 6.1. That is we use non-random covariates to compare the efficiencies of the proposed Lasso-based methods on this dataset as a function of the complexity index *r*/*p*. We randomly selected *r* predictor variables from *p* = 200 predictors, where *r*/*p* ranges from 0 to and 0.5, and used the following models with *r* covariates relevant to the response *Y*.

$$\begin{aligned} \beta\_{j\_0+1}, \dots, \beta\_{j\_0+r} &\overset{i.i.d.}{\sim} \text{Unif}(-2, 2), \; j\_0 \in \{1, \dots, 200-r\}, \\ \beta\_j = 0, \text{ for all other } j, \\ \mathbf{y}\_i &= \sum\_{j=1}^p \beta\_j \mathbf{x}\_{ij} + \varepsilon\_{i\nu} \text{ with } \varepsilon^{i.i.d} \sim \text{N}(0, 0.4). \end{aligned} \tag{8}$$

The average of the standard deviations of the predictors is 0.28, so we considered *ǫ* ∼ *N*(0, 0.4). We then calculated the discussed efficiencies of the proposed methods using *M* = 2000. The result are reported in Figure 1. It shows that for *r*/*p* less than 0.29 the Lasso cross validated method has the best performance. For *r*/*p* larger than 0.29, the trimmed subspace version with *n* variables in the subspaces is best with cross validatioed ensemble Lasso with *p* ∗ covariates a close second. This CV ensemble Lasso is also second best for *r*/*p* < 0.29. For *r*/*p* <0.29, the performance of subspace methods are poor.

**Figure 1.** Efficiencies of the Lasso ensemble subspace methods with respect to the CVLasso for the Syndrome Gene inspired simulation model, with different complexity indices *r*/*p*.

To summarize, in terms of predictor error, for sparse models, the cross validated lasso based on all covariates performs best, while for the model with *r*/*p* larger than 0.29, the trimmed ensemble lasso based on subspaces of size *n* performs best.

#### *7.2. Simulated Models with Random Covariates*

#### 7.2.1. (a) Strong and Weak Signals. Strong Covariate Correlations

We consider model (7) with values of *r*/*p* ranging from 0 to 0.5. The results in Figure 2 show that the ensemble CV Lasso based on subspaces with *p* ∗ covariates improves on the CV Lasso for all values of the complexity index *r*/*p*. The ensemble trimmed Lasso with *p* ∗ covariates is for best 0.07 < *r*/*p* < 0.3 while the ensemble trimmed Lasso with *n* covariates in each subspace is best for *r*/*p* > 0.3. The ensemble CV Lasso's with *n* and *n*/2 covariates are slightly worse than CV Lasso.

To summarize, the ensemble methods with *p* ∗ covariates in the subspaces perform very well when compared to the CV Lasso. The ensemble trimmed Lasso versions are

best for values of *r*/*p* larger than 0.2. This shows that when there are many covariates with strong and weak signals cross validation may lead to a poor choice of the trimming parameter *λ*.

**Figure 2.** Efficiencies of the Lasso ensemble subspace methods with respect to the CVLasso for the model (7), with different complexity indices *r*/*p*.

7.2.2. (b) Strong and Weak Signals. Weak Covariate Correlations

We consider model (7) with *σij* replaced by

$$
\sigma\_{\rm ij} \sim \mathcal{U}m\mathcal{f}(0.0, 0.2). \tag{9}
$$

Figure 3 shows that the dominance of the ensemble trimmed Lasso methods holds for *r*/*p* > 0.09. In other words, when there is weak correlations between the covariates, and the complexity of the model is more than 0.09, it is better to use the trimmed average of ensemble predictors based ona sequence of fixed trimming parameters than using trimming parameters obtained by cross validation.

**Figure 3.** Efficiencies of the Lasso ensemble subspace methods with respect to the CVLasso for the model (9), with different complexity indices *r*/*p*.

7.2.3. (c) Strong Signals. Weak Covariate Correlations

We consider model (9) with *β* replaced by

$$
\beta \sim \mathcal{U}inf(2, 3). \tag{10}
$$

Figure 4 shows that for very small complexity (*r*/*p* ≤ 0.020), CV Lasso is best, while for *r*/*p* > 0.020, the ensemble trimmed Lasso with *p* ∗ covariates in the subspaces improves an CV Lasso and does very well overall. For *r*/*p* > 0.15, the ensemble trimmed Lasso with *n* covariates in the subspaces is best. The trimmed ensemble versions do better than the CV ensemble versions for *r*/*p* > 0.025.

**Figure 4.** Efficiencies of the Lasso ensemble subspace methods with respect to the CVLasso for the model (10), with different complexity indices *r*/*p*.

7.2.4. (d) Weak Signal. Weak and Strong Correlation between Covariates

These two cases had very similar results. Here we give only the case where we use model (9) with

$$
\beta \sim \mathcal{U}inf(-0.2, 0.2). \tag{11}
$$

Figure 5 shows that in this case the ensemble trimmed Lasso methods with *p* ∗ and with *n* covariates in the subspaces do poorly. The ensemble CV Lasso methods performs at the same level as CV Lasso, as does the ensemble trimmed mean approach with *k* = *n*/2.

**Figure 5.** Efficiencies of the Lasso ensemble subspace methods with respect to the CVLasso for the model (11), with different complexity indices *r*/*p*.

#### **8. Conclusions**

This article explores the random ensemble subspace approach for high-dimensional data analysis. This technique splits the data into covariate subspaces and generates models and methods on each covariate subspace. Merging and assembling the methods provides a global solution to the high-dimensional data analysis challenge. Let *n* denote the sample size and *p* the member of covariates, under *p* >> *n*. We consider three different approaches of selecting subspaces: repeatedly select subspaces as follows (1) *n* covariates with replacement from *p* covariates, then use the distinct covariates to form subspaces, (2)

*n* covariates at random without replacement, and (3) *n*/2 covariates on random without replacement. This approach is applied to a variety of penalty methods and compared to cross-validation (CV) Lasso using mean squared predictor error (MSPE). We consider MSPE as a function of model complexity, which is defined as *r*/*p* where *r* is the number of covariates that are associated with the response and find that when *r*/*p* is moderate to large, the cross-validation ensemble subspace approach improves the CVLasso that uses all *p* covariates in one step. We also introduced an alternative to cross-validation that consists of computing predictors for a fixed set of data-based tuning parameters and using these predictors' trimmed mean. This approach works well when the ratio *r*/*p* is above 0.2.

To facilitate communication among researchers and provide possible collaborations between scientists across disciplines and as supporters of open-science, the codes are written in R according to the end-to-end protocol we implemented in this manuscript, which are available on request.

**Author Contributions:** Conceptualization, S.E.A., S.A. and K.D.; Formal analysis, S.E.A., S.A. and K.D.; Funding acquisition, S.E.A.; Investigation, K.D.; Methodology, S.E.A., S.A. and K.D.; Writing—review & editing, S.E.A., S.A. and K.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research is supported by the Natural Sciences and the Engineering Research Council of Canada (NSERC).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We would like to thank the constructive comments by four anonymous referees and an associate editor which improved the quality and the presentation of our results.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


*Article*
