Parameter Estimation with the Ordered ℓ2 Regularization via an Alternating Direction Method of Multipliers

Humayoo, Mahammad; Cheng, Xueqi

doi:10.3390/app9204291

Open AccessArticle

Parameter Estimation with the Ordered ℓ₂ Regularization via an Alternating Direction Method of Multipliers

by

Mahammad Humayoo

^1,2,*

and

Xueqi Cheng

^1,2

¹

CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

²

University of Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2019, 9(20), 4291; https://doi.org/10.3390/app9204291

Submission received: 9 September 2019 / Revised: 7 October 2019 / Accepted: 8 October 2019 / Published: 12 October 2019

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Regularization is a popular technique in machine learning for model estimation and for avoiding overfitting. Prior studies have found that modern ordered regularization can be more effective in handling highly correlated, high-dimensional data than traditional regularization. The reason stems from the fact that the ordered regularization can reject irrelevant variables and yield an accurate estimation of the parameters. How to scale up the ordered regularization problems when facing large-scale training data remains an unanswered question. This paper explores the problem of parameter estimation with the ordered

ℓ_{2}

-regularization via Alternating Direction Method of Multipliers (ADMM), called ADMM-O

ℓ_{2}

. The advantages of ADMM-O

ℓ_{2}

include (i) scaling up the ordered

ℓ_{2}

to a large-scale dataset, (ii) predicting parameters correctly by excluding irrelevant variables automatically, and (iii) having a fast convergence rate. Experimental results on both synthetic data and real data indicate that ADMM-O

ℓ_{2}

can perform better than or comparable to several state-of-the-art baselines.

Keywords:

ADMM; big data; feature selection; optimization; ridge regression; ordered regularization; elastic net

1. Introduction

In the machine learning literature, one of the most important challenges involves estimating parameters accurately and selecting relevant variables from highly correlated, high-dimensional data. Researchers have noticed many highly correlated features in high-dimensional data [1]. Models often overfit or underfit high-dimensional data because they have a large number of variables but only a few of them are actually relevant; most others are irrelevant or redundant. An underfitting model contributes to estimation bias (i.e., high bias and low variance) in the model fitting because it keeps out relevant variables, whereas an overfitting raises estimation error (i.e., low bias and high variance) since it includes irrelevant variables in the model.

To illustrate an application of our proposed method, consider a study of gene expression data. This is a high-dimensional dataset and contains highly correlated genes. The geneticist always likes to determine which variants/genes contribute to changes in biological phenomena (e.g., increases in blood cholesterol level, etc.) [2]. Therefore, the aim is to explicitly identify all relevant variants. The penalized regularization models such as

ℓ_{1}

,

ℓ_{2}

, and so forth have recently become a topic of great interest within machine learning, statistics [1], and optimization [3] communities as classic approaches to estimate parameters. The

ℓ_{1}

-based method is not a preferred selection method for groups of variables among which pairwise correlations are significant because the lasso arbitrarily selects a single variable from the group without any consideration of which one to select [4]. Furthermore, if the selected value of a parameter is too small, the

ℓ_{1}

-based method would select many irrelevant variables, thus degrading its performance. On the other hand, a large value of the parameter would yield a large bias [5]. Another point worth noting is that few

ℓ_{1}

regularization methods are either adaptive, computationally tractable, or distributed, but no such method contains all three properties together. Therefore, the aim of this study is to develop a model for parameter estimation and to determine relevant variables in highly correlated, high-dimensional data based on the ordered

ℓ_{2}

. This model has the following three properties all together: adaptive (Our method is adaptive in the sense that it reduces the cost of including new relevant variables as more variables are added to the model due to rank-based penalization properties.), tractable (A computationally intractable method is a computer algorithm that takes a very long time to execute a mathematical solution. A computationally tractable method is exactly the opposite of intractable method.), and distributed.

Several adaptive and nonadaptive methods have been proposed for parameter estimation and variable selection in large-scale datasets. Different principles are adopted in these procedures to estimate parameters. For example, an adaptive solution (i.e., an ordered

ℓ_{1}

) [5] is a norm and, therefore, convex. Regularization parameters are sorted in non-increasing order in the ordered

ℓ_{1}

, in which the ordered regularization penalizes regression coefficients according to their order, with higher orders closer to the top and having larger penalties. Pan et al. [6] proposed a partial sorted

ℓ_{p}

norm, which is non-convex and non-smooth. In contrast, the ordered

ℓ_{2}

regularization is convex and smooth, just as the standard

ℓ_{2}

norm is convex and smooth in Reference [7]. Pan et al. [6] considered p-values between

0 < p \leq 1

that do not cover

ℓ_{2}

,

ℓ_{\infty}

norms, and so forth. Pan et al. [6] also did not provide details of other partially sorted norms when

p \geq 2

and used random projection and the partial sorted

ℓ_{p}

norm to complete the parameter estimation, whereas we have used ADMM with the ordered

ℓ_{2}

. A nonadaptive solution (i.e., an elastic net) [8] is a mixture of both ordinary

ℓ_{1}

and

ℓ_{2}

. In particular, it is a useful model when the number of predictors (p) is much larger than the number of observations (n) or in any situation where the predictor variables are correlated.

Table 1 presents the important properties of the regularizers. As seen in Table 1,

ℓ_{2}

and the ordered

ℓ_{2}

regularizers are suitable methods for highly correlated, high-dimensional grouping data rather than

ℓ_{1}

and the ordered

ℓ_{1}

regularizers. The ordered

ℓ_{2}

encourages grouping, whereas most of

ℓ_{1}

-based methods promote sparsity. Here, grouping signifies a group of strongly correlated variables in high-dimensional data. We used the ordered

ℓ_{2}

regularization in our method instead of

ℓ_{2}

regularization because the ordered

ℓ_{2}

regularization is adaptive. Finally, ADMM has a parallel behavior for solving large-scale convex optimization problems. Our model also employs ADMM and inherits distributed properties of native ADMM [9]. Hence, our model is also distributed. Bogdan et al. [5] did not provide any details about how they applied ADMM in the ordered

ℓ_{1}

regularization.

In this paper, we propose “Parameter Estimation with the Ordered

ℓ_{2}

Regularization via ADMM” called ADMM-O

ℓ_{2}

to find relevant parameters from a model.

ℓ_{2}

is a ridge regression; similarly, the ordered

ℓ_{2}

becomes an ordered ridge regression. The main contribution of this paper is not to present a superior method but rather to introduce a quasi-version of the

ℓ_{2}

regularization method and to concurrently raise awareness of the existing methods. As part of this research, we introduced a modern ordered

ℓ_{2}

regularization method and proved that the square root of the ordered

ℓ_{2}

is a norm and, thus, convex. Therefore, it is also tractable. In addition, the regularization method used an ordered elastic net method to combine the widely used ordered

ℓ_{1}

penalty with modern ordered

ℓ_{2}

penalty for ridge regression. The ordered elastic net is also proposed by the scholars in this paper. To the best of our knowledge, this is one of the first method to use the ordered

ℓ_{2}

regularization with ADMM for parameter estimation and variable selection. In Section 3 and Section 4, we explain the integration of ADMM with the ordered

ℓ_{2}

and further details about it.

The rest of the paper is arranged as follows. Related works are discussed in Section 2, along with a presentation of the ordered

ℓ_{2}

regularization in Section 3. Section 4 describes the application of ADMM to the ordered

ℓ_{2}

. Section 5 presents the experiments conducted. Finally, Section 6 closes the paper with a conclusion.

2. Related Work

2.1. $ℓ_{1}$ and $ℓ_{2}$ Regularization

Deng et al. [14] presented efficient algorithms for group sparse optimization with mixed

ℓ_{2, 1}

regularization for the estimation and reconstruction of signals. Their technique is rooted in a variable splitting strategy and ADMM. Zou and Hastie [8] suggested that an elastic net, a generalization of the lasso, is a linear combination of both

ℓ_{1}

and

ℓ_{2}

norm. It contributes to sparsity without permitting a coefficient to become too large. However, Candes and Tao [15] introduced a new estimator called the Dantzig selector for a linear model when the parameters are larger than the number of observations for which they established optimal

ℓ_{2}

rate properties under a sparsity. Chen et al. [16] enforced sparse embedding to ridge regression, obtaining solutions

\tilde{x}

with

{∥ \tilde{x} - x^{*} ∥}_{2} \leq ϵ ∥ x^{*} ∥

small, where

x^{*}

is optimal, and also did this in

O (n n z (A) + n^{3} / ϵ^{2})

time, where nnz(A) is the number of nonzero entries of A. Recently, Bogdan et al. [5] have proposed an ordered

ℓ_{1}

-regularization technique inspired from a statistical viewpoint, in particular, by a focus on controlling the false discovery rate (FDR) for variable selection in linear regressions. Our proposed method is similar but focused on parameter estimation based on the ordered

ℓ_{2}

regularization and ADMM. Several methods have been proposed based on Reference [5] and similar ideas. For example, Bogdan et al. [2] introduced a new model-fitting strategy called Sorted L-One Penalized Estimation (SLOPE).which regularizes least-squares estimates with rank-dependent penalty coefficients. Zeng and Figueiredo [17] proposed DWSL1 as a generalization of octagonal shrinkage and clustering algorithm (OSCAR) that aims to promote feature grouping without previous knowledge of the group structure. Pan et al. [6] have introduced an image restoration method based on a random projection and a partial sorted

ℓ_{p}

norm. In this method, an input signal is decomposed into two components: a low-rank component and a sparse component. The low-rank component is approximated by random projection, and the sparse one is recovered by the partial sorted

ℓ_{p}

norm. Our method can potentially be used in various other domains such as cyber security [18] and recommendation [19].

2.2. ADMM

Researchers have paid a significant amount of attention to ADMM because of its capability of dealing with objective functions independently and simultaneously. Second, ADMM has proved to be a genuine fit in the field of large-scale data-distributed optimization. However, ADMM is not a new algorithm; it was first introduced by References [20,21] in the mid-1970s, with roots as far back as the mid-1950s. In addition, ADMM originated from an augmented method with a Lagrangian multiplier [22]. It became more popular when Boyd et al. [9] published papers about ADMM. The classic ADMM algorithm applies to the following “ADMM-ready” form of problems.

{\begin{matrix} m i n i m i z e f (x) + g (z) \\ s . t . A x + B z = c \end{matrix}

(1)

The wide range of applications have also inspired the study of the convergence properties of ADMM. Under mild assumptions, ADMM can converge for all choices of the step size. Ghadimi et al. [23] provided some advice on tuning over-relaxed ADMM in the quadratic problems. Deng and Yin [24] have also suggested linear convergence results under the consideration of only a single strongly convex term, given that linear operator’s A and B are full-rank matrices. These convergence results bound error as measured by an approximation to the primal–dual gap. Goldstein et al. [25] created an accelerated version of ADMM that converges more quickly than traditional ADMM under an assumption that both objective functions are strongly convex. Yan and Yin [26] explained, in detail, the different kinds of convergence properties of ADMM and their prerequisite rules for converging. For further studies on ADMM, see Reference [9].

3. Ridge Regression with the Ordered $ℓ_{2}$ Regularization

3.1. The Ordered $ℓ_{2}$ Regularization

The proposed parameter estimation and variable selection method in this paper is computationally manageable and adaptive. This procedure depends on the ordered

ℓ_{2}

regularization. Let

λ = (λ_{1}, λ_{2}, \dots, λ_{p})

be a decreasing sequence of positive scalars that satisfy the following condition:

λ_{1} \geq λ_{2} \geq λ_{3} \geq \dots \geq λ_{p} \geq 0

(2)

The ordered

ℓ_{2}

regularization of a vector

x \in R^{p}

when

λ_{1} > 0

can be defined as follows:

J_{λ} (x) = λ_{1} x_{(1)}^{2} + λ_{2} x_{(2)}^{2} + \dots + λ_{p} x_{(p)}^{2} = \sum_{k = 1}^{p} λ_{{B H}^{(k)}} x_{(k)}^{2}

(3)

where

λ_{{B H}^{(k)}}

is called a BHq method [27], which generates an adaptive and a non-increasing value for

λ

(Reference [2], Section 1.1). The details of

λ_{{B H}^{(k)}}

are available in Section 4.2. For ease of presentation, we have written

λ_{k}

in place of

λ_{{B H}^{(k)}}

in the rest of the paper.

x_{(1)}^{2} \geq x_{(2)}^{2} \geq x_{(3)}^{2} \geq \dots \geq x_{(p)}^{2}

is the order statistic of the magnitudes of x [28]. The subscript k of x enclosed in parentheses indicates the kth-order statistic of a sample. Suppose that x is a sample size of 4. Hence, four numbers are observed in x if the sample values of x = (−2.1, −0.5, 3.2, 7.2). The order statistics of x would be

x_{(1)}^{2} = {7.2}^{2}, x_{(2)}^{2} = {3.2}^{2}, x_{(3)}^{2} = {2.1}^{2}, x_{(4)}^{2} = {0.5}^{2}

. The ordered

ℓ_{2}

regularization is expressed as the first largest value of

λ

times the square of the first largest entry of x, plus the second largest value of

λ

times the square of the second largest entry of x, and so on.

A \in R^{n \times p}

and

b \in R^{n}

are a matrix and a vector, respectively. The ordered

ℓ_{2}

regularized loss minimization can be expressed as follows:

\overset{m i n}{x \in R^{p}} \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + \frac{1}{2} {λ_{1} x_{(1)}^{2} + \dots + λ_{p} x_{(p)}^{2}}

(4)

Theorem 1.

If the square root of

J_{λ} (x)

(Equation (3)) is a norm on

R^{p}

and a function

∥ . ∥ : R^{p} \to R

satisfying the following three properties, then the following Corollaries 1 and 2 are true.

i: (Positivity) $∥ x ∥ \geq 0$ for any $x \in R^{p}$ and $∥ x ∥ = 0$ if and only if $x = 0$ .
ii: (Homogeneity) $∥ c x ∥ = | c | ∥ x ∥$ for any $x \in R^{p}$ and $c \in R$ .
iii: (Triangle inequality) $∥ x + y ∥ \leq ∥ x ∥ + ∥ y ∥$ for any $x, y \in R^{p}$ .
Note: $∥ x ∥$ and ${∥ x ∥}_{2}$ are used interchangeably.

Corollary 1.

When all the

λ_{k}^{'} s

take on an equal positive value,

J_{λ} (x)

reduces to the square of the usual

ℓ_{2}

norm.

Corollary 2.

When

λ_{1} > 0

and

λ_{2}

= … =

λ_{p} = 0

, the square root of

J_{λ} (x)

reduces to the

ℓ_{\infty}

norm.

Proofs of the theorem and corollaries are provided in Appendix A. Table 2 shows the notations used in this paper and their meanings.

3.2. The Ordered Ridge Regression

We propose an ordered ridge regression in Equation (5) and call it the ordered ridge regression because we used an ordered

ℓ_{2}

regularization with an objective function instead of a standard

ℓ_{2}

regularization. The ordered ridge regression is commonly used for parameter estimation and variable selection, particularly when data are strongly correlated and highly dimensional. The ordered ridge regression can be defined as follows:

\begin{matrix} \overset{m i n}{x \in R^{p}} \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + \frac{1}{2} J_{λ} (x) \\ = \overset{m i n}{x \in R^{p}} \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + \frac{1}{2} \sum_{k = 1}^{p} λ_{k} {| x_{(k)} |}^{2} \end{matrix}

(5)

where

x \in R^{p}

denotes an unknown regression coefficient,

A \in R^{n \times p} (p ≫ n)

is a known matrix,

b \in R^{n}

represents a response vector, and

J_{λ} (x)

is the ordered

ℓ_{2}

regularization. The optimal parameter choice for the ordered ridge regression is much more stable than that for a regular lasso; also, it achieves adaptivity in the following senses.

i: For decreasing $(λ_{k})$ , each parameter $λ_{k}$ marks the entry or removal of some variable from the current model (therefore, its coefficient becomes either zero or nonzero); thus, coefficients remain constant in the model. We achieved this by putting some threshold values for $λ_{k}$ (Reference [5], Section 1.4).
ii: We observed that the price of including new variables declines as more variables are added to the model when the $λ_{k}$ decreases.

4. Applying ADMM to the Ordered Ridge Regression

In order to apply ADMM to the problem in Equation (5), we first transform it into an equivalent form of the problem in Equation (1) by introducing an auxiliary variable z.

\overset{m i n}{x, z \in R^{p}} \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + \frac{1}{2} J_{λ} (z) s . t . x - z = 0

(6)

We can see that Equation (6) has two blocks of variables (i.e., x and z). Its objective function is separable in the form of Equation (1) since

f (x) = \frac{1}{2} {∥ A x - b ∥}_{2}^{2}

and

g (z) = \frac{1}{2} J_{λ} (z) = \frac{1}{2} λ {∥ z ∥}_{2}^{2} = \frac{1}{2} \sum_{k = 1}^{p} λ_{k} {| x_{(k)} |}^{2}

, where

A = I

and

B = - I

. Therefore, ADMM is applicable to Equation (6). An augmented Lagrangian form of Equation (6) can be defined as follows:

\begin{matrix} L_{p} (x, z, y) & = \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + \frac{1}{2} λ {∥ z ∥}_{2}^{2} + y^{T} (x - z) \\ + \frac{ρ}{2} {∥ x - z ∥}_{2}^{2} \end{matrix}

(7)

where

y \in R^{p}

is a Lagrangian multiplier and

ρ > 0

denotes a penalty parameter. Next, we apply ADMM to the augmented Lagrangian equation of Equation (7) (Reference [9], Section 3.1), which renders ADMM iterations as follows:

\begin{matrix} x^{k + 1} : = \overset{a r g m i n}{x \in R^{p}} L_{p} (x, z^{k}, y^{k}) \\ z^{k + 1} : = \overset{a r g m i n}{z \in R^{p}} L_{p} (x^{k + 1}, z, y^{k}) \\ y^{k + 1} : = y^{k} + ρ (x^{k + 1} - z^{k + 1}) \end{matrix}

(8)

Proximal gradient methods are well known for solving convex optimization problems for which the objective function is the sum of a smooth loss function and a non-smooth penalty function [9,29,30]. A well-studied example is

ℓ_{1}

regularized least squares [1,5]. It should be noted that an ordered

ℓ_{1}

norm is convex but not smooth. Therefore, these researchers used a proximal gradient method. In contrast, we have employed an ADMM method because ADMM can solve convex optimization problems for which the objective function is either the sum of a smooth loss function and a non-smooth penalty function or both loss and penalty function are smooth and ADMM also supports parallelism. In the ordered ridge regression, both loss and penalty function are smooth, whereas, in the ordered elastic net, loss function is smooth and penalty function is non-smooth.

4.1. Scaled Form

We can also define ADMM in scaled form by merging a linear and a quadratic term in augmented Lagrangian and then a scaled dual variable, which is shorter and more appropriate. The scaled dual form of ADMM iterations in Equation (8) can be expressed as follows:

\begin{matrix} (9 a) & x^{k + 1} : = \overset{a r g m i n}{x \in R^{p}} (f (x) + (ρ / 2) {∥ x - z^{k} + u^{k} ∥}_{2}^{2}) \\ (9 b) & z^{k + 1} : = \overset{a r g m i n}{z \in R^{p}} (g (z) + (ρ / 2) {∥ x^{k + 1} - z + u^{k} ∥}_{2}^{2}) \\ (9 c) & u^{k + 1} : = u^{k} + x^{k} - z^{k} \end{matrix}

where

u = \frac{1}{ρ} y

and u is the scaled dual variable. Next, we can minimize the augmented Lagrangian in Equation (7) with respect to x and z, successively. Minimizing Equation (7) with respect to x becomes the x subproblem of Equation (9a), and it can be expressed as follows:

\begin{matrix} \overset{m i n}{x \in R^{p}} \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + (ρ / 2) {∥ x - z^{k} + u^{k} ∥}_{2}^{2} \\ (10 a) & = \overset{m i n}{x \in R^{p}} \frac{1}{2} {x^{T} A^{T} A x - 2 b^{T} A x} + \frac{ρ}{2} {x^{2} - 2 {(z^{k} - u^{k})}^{T} x} \end{matrix}

After computing a derivative of Equation (10a) with respect to x, then the setting of the derivative of x becomes equal to zero. Notice that this is a convex problem; therefore, it minimizes to solve the following linear system of Equation (10b):

\begin{matrix} \Leftrightarrow A^{T} A x + ρ x - b^{T} A - ρ {(z^{k} - u^{k})}^{T} = 0 \\ (10 b) & x^{k + 1} = {(A^{T} A + ρ * I)}^{- 1} {(A^{T} b + ρ (z^{k} - u^{k}))}^{T} \end{matrix}

Minimizing problem Equation (7) w.r.t. z, we obtain Equation (9b), and it results in the following z subproblem:

\begin{matrix} \overset{m i n}{z \in R^{p}} \frac{1}{2} λ {∥ z ∥}_{2}^{2} + \frac{ρ}{2} {∥ x^{k + 1} + u^{k} - z ∥}_{2}^{2} \\ (11 a) & = \overset{m i n}{z \in R^{p}} \frac{1}{2} λ_{k} z^{T} z + \frac{ρ}{2} {{(x^{k + 1} + u^{k} - z)}^{T} (x^{k + 1} + u^{k} - z)} \end{matrix}

After computing a derivative of Equation (11a) with respect to z, then the setting of the derivative of z becomes equal to zero. Notice that this is a convex problem; therefore, it minimizes to solve the following linear system of Equation (11b):

\begin{matrix} \Leftrightarrow \frac{1}{2} 2 λ_{k} z + \frac{ρ}{2} {2 z - 2 {(x^{k + 1} + u^{k})}^{T}} = 0 \\ (11 b) & z^{k + 1} = {(λ_{k} + ρ * I)}^{- 1} ρ (x^{k + 1} + u^{k}) \end{matrix}

Finally, the multiplier (i.e., the scaled dual variable u) is updated in the following way:

u^{k + 1} = u^{k} + (x^{k} - z^{k})

(12)

Optimality conditions: Primal and dual feasibility are essential and adequate optimality conditions for ADMM in Equation (6) [9]. Dual residual (

S^{k + 1}

) and primal residual (

γ^{k + 1}

) can be defined as follows:

\begin{matrix} Dual residual (S^{k + 1}) at iteration k + 1 & = ρ (z^{k} - z^{k + 1}) \\ Primal residual (γ^{k + 1}) at iteration k + 1 & = x^{k + 1} - z^{k + 1} \end{matrix}

Stopping criteria: The stopping criterion for an ordered ridge regression is that primal and dual residuals must be small:

\begin{matrix} {∥ γ^{k} ∥}_{2} \leq ϵ^{p r i} & w h e r e − ϵ^{p r i} = \sqrt{p} ϵ^{a b s} + ϵ^{r e l} m a x {{∥ x^{k} ∥}_{2}, {∥ z^{k} ∥}_{2}} \\ {∥ S^{k} ∥}_{2} \leq ϵ^{d u a l} & w h e r e ϵ^{d u a l} = \sqrt{n} ϵ^{a b s} + ϵ^{r e l} {∥ ρ * u^{k} ∥}_{2} \end{matrix}

We set

ϵ^{a b s} = 10^{- 4}

and

ϵ^{r e l} = 10^{- 2}

. For further details about this choice, see Reference [9], Section 3).

4.2. Over-Relaxed ADMM Algorithm

By comparing Equations (1) and (6), we can write Equation (6) in the over-relaxation form as follows:

α x^{k + 1} + (1 - α) z^{k} / * where A = I, B = - I and c = 0 in our case * /

(13)

Substituting

x^{k + 1}

with Equation (13) into z of Equation (11b) and u of Equation (12) updates the results in a relaxation form. Algorithm 1 presents an ADMM iteration for the ordered ridge regression of Equation (6).

Algorithm 1: Over-relaxed ADMM for the ordered ridge regression

1:: Initialize $x^{0} \in R^{p}$ , $z^{0} \in R^{p}$ , $u^{0} \in R^{p}$ , $ρ > 0$
2:: while ( ${∥ γ^{k} ∥}_{2} \leq ϵ^{p r i} & & {∥ S^{k} ∥}_{2} \leq ϵ^{d u a l}$ ) do
3:: $x^{k + 1} \leftarrow {(A^{T} A + ρ * I)}^{- 1} {(A^{T} b + ρ (z^{k} - u^{k}))}^{T}$
4:: $λ_{k} \leftarrow S o r t e d L a m b d a ({λ_{k}})$ ; $▹ refer to Algorithm 2$
5:: $z^{k + 1} \leftarrow {(λ_{k} + ρ * I)}^{- 1} ρ (α x^{k + 1} + (1 - α) z^{k} + u^{k})$
6:: $u^{k + 1} \leftarrow u^{k} + α (x^{k + 1} - z^{k + 1}) + (1 - α) (z^{k} - z^{k + 1})$
7:: end while

We observed that ADMM Algorithm 1 computes an exact solution for each subproblem and that their convergence is guaranteed by existing ADMM theory [24,25,31]. The most important and computationally intensive operation here is matrix inversion in line 3 of Algorithm 1. Here, matrix A is high-dimensional (

p ≫ n

) and

(A^{T} A + ρ * I)

takes

O (n p^{2})

and its inverse (i.e.,

{(A^{T} A + ρ * I)}^{- 1}

) takes

O (p^{3})

. We compute

{(A^{T} A + ρ * I)}^{- 1}

and

A^{T} b

outside loop; then, we are left with (inverse *

{(A^{T} b + ρ (z^{k} - u^{k}))}^{T}

), which is

O (p^{2})

, while addition and subtraction take

O (p)

.

{(A^{T} A + ρ * I)}^{- 1}

is also cacheable, so the complexity is just

O (p^{3})

+ k *

O (n p^{2} + p)

heuristically with k number of iteration.

Generating the ordered parameter

(λ_{k})

: As mentioned in the beginning, we set out to identify a computationally tractable and adaptive solution. The regularizing sequences play a vital role in achieving this goal. Therefore, we generated adaptive values of

(λ_{k})

such that regressor coefficients are penalized according to their respective order. Our regularizing sequence procedure is motivated by the BHq procedure [27]. The BHq method generates

(λ_{k})

sequences as follows:

λ_{{B H}^{(k)}} = Φ^{- 1} (1 - q * \frac{k}{2 p})

(14)

λ_{k} = λ_{{B H}^{(k)}} \sqrt{1 + \frac{\sum_{j < k} λ_{{B H}^{(j)}}^{2}}{n - k}}

(15)

where

k > 0

,

Φ^{- 1} (α)

is

α

th quantile of a standard normal distribution, and q is a parameter, namely

q \in [0; 1]

. We started with

λ_{1} = λ_{{B H}^{(1)}}

as an initial value of the ordered parameter

(λ_{k})

.

Algorithm 2 presents a method for generating sorted

(λ_{k})

. The difference between lines 5 and 6 in Algorithm 2 is that line 5 is for low-dimensional (

n \leq p

) data and that line 6 is for high-dimensional data (

p ≫ n

). Finally, we used the ordered

(λ_{k})

from Algorithm 2 (i.e., the adaptive value of

(λ_{k})

) in the ordered ridge regression of Equations (6) and (7) instead of ordinary

λ

. This makes the ordered

ℓ_{2}

adaptive and different from standard

ℓ_{2}

.

Algorithm 2: Sorted Lambda

({λ_{k}})

1:: Initialize $q \in [0; 1], k > 0, p, n \in N$
2:: $λ_{1} \leftarrow λ_{{B H}^{(1)}}$ ; $▹ λ_{{B H}^{(1)}} i s f r o m E q u a t i o n (14) w h e r e k = 1$
3:: for $k \in {2, \dots, K}$ do
4:: $λ_{{B H}^{(k)}} \leftarrow Φ^{- 1} (1 - q * \frac{k}{2 p})$
5:: $λ_{k} \leftarrow λ_{{B H}^{(k)}} * \sqrt{1 + \frac{\sum_{j < k} λ_{{B H}^{(j)}}^{2}}{p - k - 1}}$ ; $▹ where n = p$
6:: $λ_{k} \leftarrow λ_{{B H}^{(k)}} * \sqrt{1 + \frac{\sum_{j < k} λ_{{B H}^{(j)}}^{2}}{2 p - k - 1}}$ ; $▹ where n = 2 p$
7:: end for

4.3. The Ordered Elastic Net

A standard

ℓ_{2}

(or an ordered

ℓ_{2}

) regularization is a commonly used tool to estimate parameters for microarray datasets (strongly correlated grouping). However, a key drawback of the

ℓ_{2}

regularization is that it cannot automatically select relevant variables because the

ℓ_{2}

regularization shrinks coefficient estimates closer but not exactly equal to zero (Reference [12], Chapter 6.2). On the other hand, a standard

ℓ_{1}

(or an ordered

ℓ_{1}

) regularization can automatically determine relevant variables due to its sparsity property. However, the

ℓ_{1}

regularization also has a limitation. Especially when different variables are highly correlated, the

ℓ_{1}

regularization tends to pick only a few of them and to remove the remaining ones—even important ones that might be better predictors. To overcome the limitations of both

ℓ_{1}

and

ℓ_{2}

regularization, we proposed another method called an ordered elastic net (the ordered

ℓ_{1, 2}

regularization or ADMM-O

ℓ_{1, 2}

or -O

ℓ_{1, 2}

), similar to a standard elastic net [8], by combining the ordered

ℓ_{2}

regularization with the ordered

ℓ_{1}

regularization and the elastic net. By doing so, the ordered

ℓ_{1, 2}

regularization automatically selects relevant variables in a way similar to the ordered

ℓ_{1}

regularization. In addition, it can select groups of strongly correlated variables. The key difference between the ordered elastic net and the standard elastic net is a regularization term. We apply the ordered

ℓ_{1}

and

ℓ_{2}

regularization in the ordered elastic net instead of the standard

ℓ_{1}

and

ℓ_{2}

regularization. This approach means that the ordered elastic net inherits the sparsity, grouping, and adaptive properties of the ordered

ℓ_{1}

and

ℓ_{2}

regularization. We have also employed ADMM to solve the ordered

ℓ_{1, 2}

regularized loss minimization as follows:

\Leftrightarrow \overset{m i n}{x \in R^{p}} \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + α λ_{B H} {∥ x ∥}_{1} + \frac{1}{2} (1 - α) λ_{B H} {∥ x ∥}_{2}^{2}

For simplicity, let

λ_{1} = α λ_{B H}

and

λ_{2} = (1 - α) λ_{B H}

. The ordered elastic net becomes

\Leftrightarrow \overset{m i n}{x \in R^{p}} \frac{1}{2} {∥ A x - b ∥}_{2}^{2} + λ_{1} {∥ x ∥}_{1} + \frac{1}{2} λ_{2} {∥ x ∥}_{2}^{2}

Now, we can transform the above ordered elastic net equation into an equivalent form of Equation (1) by introducing an auxiliary variable z.

\Leftrightarrow \overset{m i n}{x, z \in R^{p}} \underset{f (x)}{\underset{︸}{\frac{1}{2} {∥ A x - b ∥}_{2}^{2}}} + \underset{g (z)}{\underset{︸}{λ_{1} {∥ z ∥}_{1} + \frac{1}{2} λ_{2} {∥ z ∥}_{2}^{2}}} s . t . x - z = 0

(16)

We can minimize Equation (16) w.r.t. x and z in the same way as we minimized the ordered

ℓ_{2}

regularization in Section 4 and Section 4.1 and Section 4.2. Therefore, we directly present the final results below without any details. The + sign means to select max (0,value).

\begin{array}{l} x^{k + 1} = {(A^{T} A + ρ * I)}^{- 1} (A^{T} b + ρ (z^{k} - u^{k})) \\ z^{k + 1} = {(\frac{ρ (x^{k + 1} + u^{k}) - λ_{1}}{λ_{2} + ρ})}_{+} - {(\frac{- ρ (x^{k + 1} + u^{k}) - λ_{1}}{λ_{2} + ρ})}_{+} \\ u^{k + 1} = u^{k} + (x^{k} - z^{k}) \end{array}

(17)

5. Experiments

A series of experiments were conducted on both simulated and real data to examine the performance of a proposed method. In this section, first, a concept to select a correct sequence of

(λ_{k})

s is discussed. Second, an experiment on synthetic data is presented that describes the convergence of the lasso, SortedL1, ADMM-O

ℓ_{2}

, and ADMM-O

ℓ_{1, 2}

. Finally, the proposed method is applied to a real feature selection dataset. The performance of the ADMM-O

ℓ_{1, 2}

method is analyzed in comparison with state-of-the-art methods: the lasso and SortedL1. These two methods are chosen for comparison because they are very similar to the ADMM-O

ℓ_{1, 2}

method except they use the regular lasso and the ordered lasso, respectively, while ADMM-O

ℓ_{1, 2}

model employs the ordered

ℓ_{1, 2}

regularization with ADMM.

Experimental setting: The algorithms were implemented on Scala Spark™ with Scala code in both distributed and non-distributed versions. A distributed version of experiments was carried out in a cluster of virtual machines with four nodes: one master and three slaves. Each node has 10 GB of memory, 8 cores, CentOS release 6.2, and amd64: core-4.0-noarch. Apache spark™ 1.5.1 was deployed on it. The scholars also used IntelliJ IDEA 15 ULTIMATE as a Scala editor, interactive build tool sbt version 0.13.8, and Scala version 2.10.4. The standalone machine is a Lenovo desktop running Windows 7 Ultimate with an Intel™ Core™ i₃ Duo 3.20 GHz CPU and 4 GB of memory. The scholars used MATLAB™ version 8.2.0.701 running on a single machine to draw all figures. The source codes of the lasso, SortedL1, and ADMM-O

ℓ_{1, 2}

are available at References [32,33,34], respectively.

5.1. Adjusting the Regularizing Sequence $(λ_{k})$ for the Ordered Ridge Regression

Figure 1 was drawn using Algorithm (2), where p = 5000. As seen in Figure 1, when the value of a parameter (

q = 0.4

) becomes larger, the sequence

(λ_{k})

decreases, while

(λ_{k})

increases for a small value of

q = 0.055

. However, the goal is to obtain a non-increasing order of sequence

(λ_{k})

by adjusting the value of q, which stimulates convergence. Here, adjusting means tuning the value of the parameter q using the BHq procedure to yield a suitable sequence

(λ_{k})

such that it improves performance.

5.2. Experimental Results of Synthetic Data

In this section, numerical examples show the convergences of ADMM-O

ℓ_{1, 2}

, ADMM-O

ℓ_{2}

, and other methods. A tiny, dense example of an ordered

ℓ_{2}

regularization is examined, where the feature matrix A has n = 1500 examples and p = 5000 features. Synthetic data is generated as follows: create a matrix A and choose

A_{i, j}

using

N (0, 1)

and then normalize columns of the matrix A to have the unit

ℓ_{2}

norm.

x^{0} \in R^{p}

is generated such that each sampled from

x^{0} \sim

N (0, 0.02)

is a Gaussian distribution. Label b is calculated as b = A*

x^{0}

+ v, where v ∼

N (0, 10^{- 3} * I)

, which is the Gaussian noise. A penalty parameter

ρ

= 1.0, an over-relaxed parameter

α

= 1.0, and termination tolerances

ϵ^{a b s} \leq 10^{- 4}

and

ϵ^{r e l} \leq 10^{- 2}

are used. Variables

u^{0} \in R^{p}

and

z^{0} \in R^{p}

are initialized to be zero.

λ \in R^{p}

is a non-increasing ordered sequence according to Section 5.1 and Algorithm 2. Figure 2a,b indicates the convergence of ADMM-O

ℓ_{2}

and ADMM-O

ℓ_{1, 2}

, respectively. Figure 3a,b shows the convergence of the ordered

ℓ_{1}

regularization and the lasso, respectively. It can be seen from the Figure 2 and Figure 3 that the ordered

ℓ_{2}

regularization converges faster than all algorithms. The ordered

ℓ_{1}

, lasso, ordered

ℓ_{1, 2}

, and ordered

ℓ_{2}

take less than 80, 30, 30, and 10 iterations, respectively to converge. Dual is not guaranteed to be feasible. Therefore, a level of infeasibility of dual is also needed to compute. A numerical experiment terminates whenever both the infeasibility (

\hat{w}

) and relative primal-dual gap (

δ (b)

) are less than equal to

λ (1)

(Tolinfeas (

ϵ^{i n f e a s} = 10^{- 6}

) and TolRelGap (

ϵ^{g a p} = 10^{- 6}

), respectively) the ordered

ℓ_{1}

regularization harness synthetic data provided by Reference [5]. The same data is generated for the lasso as for the ordered

ℓ_{2}

regularization except for an initial value of

λ

. For the lasso, set

λ = 0.1 * λ_{m a x}

, where

λ_{m a x} = {∥ A^{T} * b ∥}_{\infty}

. The researchers also use 10-fold cross-validation (cv) with the lasso. For further details about this step, see Reference [9].

5.3. Experimental Results of Real Data

Variable selection difficulty arises when the number of features (p) are greater than the number of instances (n). The proposed method genuinely handles these types of issues. The practical application of the ADMM method is in many domains such as computer vision and graphics [35], analysis of biological data [36,37], and smart electric grids [38,39]. A biological leukaemia dataset [40] was used to demonstrate the performance of the proposed method. Leukemia is a type of cancer that impairs the body’s ability to build healthy blood cells. Leukemia begins in the bone marrow. There are many types of leukemia, such as acute lymphoblastic leukemia, acute myeloid leukemia, and chronic lymphocytic leukemia. The following two types of leukemia are used in this experiment: acute lymphoblastic leukemia and acute myeloid leukemia. The leukemia dataset consists of 7129 genes and 72 samples [41]. Randomly split the data into training and test sets. In the training set, there are 38 samples, among which 27 are type I ALL (acute lymphoblastic leukemia) and 11 are type II AML (acute myeloid leukemia). The remaining 34 samples allowed us to test the prediction accuracy. The test set contains 20 type I ALL and 14 type II AML. The data were labeled according to the type of leukemia (ALL or AML). Therefore, before applying an ordered elastic net, the type of leukemia (ALL = −1, or AML = 1) is converted as a (−1, 1) response y. Predicted response

\hat{y}

is set to 1 if

\hat{y} > 0

; otherwise, it is set to −1.

λ \in R^{p}

is a non-increasing, ordered sequence generated according to Section 5.1 and Algorithm 2. For the regular lasso,

λ

is a single scalar value generated using Equation (14).

α = 0.1

is used for the leukemia dataset. All other settings are the same as experiment with synthetic data.

Table 3 illustrates the experiment results of the leukemia dataset for different types of regularization. The lowest average mean square error (MSE) of regularization is the ordered

ℓ_{2}

, followed by the ordered

ℓ_{1, 2}

and lasso, while the highest average MSE can be seen in the ordered

ℓ_{1}

. Looking at Table 3 first, it is clear that the ordered

ℓ_{2}

converges the fastest among all the regularizations. The second fastest converging regularization is the ordered

ℓ_{1, 2}

, while the slowest converging regularization is the ordered

ℓ_{1}

. The ordered

ℓ_{2}

takes an average iteration around 190 and an average time around 0.15 s to converge. On the other hand, the ordered

ℓ_{1, 2}

, the ordered

ℓ_{1}

, and lasso take average iterations around 1381, 10,000, and 10,000, respectively, and average times around: 1.0, 14.0, and 5.0 s, respectively, to converge. It can also be seen from the data in Table 3 that the ordered

ℓ_{2}

selected all the variables but that the goal is to select only the relevant variables from strongly correlated, high-dimensional dataset. Therefore, the ordered elastic net was proposed, which only selects relevant variables and discards irrelevant variables. As can be seen from Table 3, average MSE, time, and iteration in the ordered

ℓ_{1}

regularization and lasso are significantly more than the ordered

ℓ_{1, 2}

regularization, although an average gene selection in the ordered

ℓ_{1, 2}

regularization is more than that of the ordered

ℓ_{1}

regularization and lasso. The ordered

ℓ_{1}

and lasso select averages around 84 and 7 variables, respectively, whereas the ordered

ℓ_{1, 2}

selects an average around 107 variables. The lasso performs poorly on the leukemia dataset. The reason for this is that strongly correlated variables are present in the leukemia dataset. In general, the ordered elastic net performs better than the ordered

ℓ_{1}

and lasso. Figure 4 shows the ordered elastic net solution paths and the variable selection results.

6. Conclusions

In this paper, we showed a method for optimizing an ordered

ℓ_{2}

problem under an ADMM framework, called ADMM-O

ℓ_{2}

. As an implementation of ADMM-O

ℓ_{2}

, the ridge regression with the ordered

ℓ_{2}

regularization is shown. We also presented a method for variable selection, called ADMM-O

ℓ_{1, 2}

which employs the ordered

ℓ_{1}

and

ℓ_{2}

. We see the ordered

ℓ_{1, 2}

as a generalization of the ordered

ℓ_{1}

, which is shown as an important tool for model fitting, feature selection, and parameter estimation.

Experimental results show that the ADMM-O

ℓ_{1, 2}

method correctly estimates parameter, selects relevant variables, and excludes irrelevant variables for microarray data. Our method is also computationally tractable, adaptive, and distributed. The ordered

ℓ_{2}

regularization is convex and can be optimized efficiently with faster convergence rate. Additionally, we have shown that our algorithm has complexity

O (p^{3})

+ k *

O (n p^{2} + p)

heuristically, where k is the number of iterations. In future work, we plan to apply our method to other regularization models with complex penalties.

Author Contributions

Conceptualization, M.H.; methodology, M.H.; investigation, M.H.; writing—original draft preparation, M.H.; writing—review and editing, M.H. and X.C.; visualization, M.H.; supervision, X.C.

Funding

The work is funded by the National Key R&D Program of China under Grant No. 2016QY02D0405 and 973 Program of China under Grant No. 2014CB340401. Mahammad Humayoo is supported by the CAS-TWAS fellowship.

Acknowledgments

We gratefully acknowledge the useful comments of the anonymous referees. We would also like to thank the editor for his continuous and quick support during the peer review process. We would also like to thank all students and teachers who supported us in this work. We discussed some issues with users in the online communities. Finally, we want to extend a big thank you to all the users of that online communities who participate in discussion and provide us valuable suggestion.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof.

Positivity of Theorem 1

J_{λ} (x) = {λ ∥ x ∥}_{2}^{2} = \sum_{k = 1}^{p} λ_{k} x_{(k)}^{2} = \sum_{k = 1}^{p} {(\sqrt{λ_{k}} x_{(k)})}^{2}

Take the square root on both side for the following:

{∥ \sqrt{λ} x ∥}_{2} = \sqrt{\sum_{k = 1}^{p} {(\sqrt{λ_{k}} x_{(k)})}^{2}}

For

\forall x \in R^{p}

, hold Equation (2); thus,

(λ_{1 .. p})

is positive and

(λ_{1 .. p}) \neq 0

.

{∥ x ∥}_{2}

will never be negative because of the square of

x

.

{∥ \sqrt{λ} x ∥}_{2} = \sqrt{\sum_{k = 1}^{p} {(\sqrt{λ_{k}} x_{(k)})}^{2}} \geq 0

If

x = 0

, then we have the following:

{∥ \sqrt{λ} x ∥}_{2} = \sqrt{\sum_{k = 1}^{p} {(\sqrt{λ_{k}} . 0)}^{2}} = 0

(A1)

If

{∥ x ∥}_{2}

, then we have the following:

{∥ \sqrt{λ} x ∥}_{2} = \sqrt{\sum_{k = 1}^{p} {(\sqrt{λ_{k}} x_{(k)})}^{2}} = 0 \Rightarrow x_{(k)} = 0 \forall k \Rightarrow x = 0

Since

(λ_{k}) \neq 0

,

{∥ x ∥}_{2}

will be only zero if and only if

x = 0

. □

Proof.

Homogeneity of Theorem 1

The first two steps are the same as that of the proof of positivity of Theorem 1; then, we have the following:

\begin{matrix} {∥ \sqrt{λ} x ∥}_{2} = \sqrt{\sum_{k = 1}^{p} {(\sqrt{λ_{k}})}^{2} x_{(k)}^{2}} = | \sqrt{λ} | \sqrt{\sum_{k = 1}^{p} x_{(k)}^{2}} \\ {∥ \sqrt{λ} x ∥}_{2} = | \sqrt{λ} | {∥ x ∥}_{2} \end{matrix}

where

(c = \sqrt{λ})

. Then, we have the following:

{∥ c x ∥}_{2} = | c | {∥ x ∥}_{2}

□

Proof.

Triangle inequality of Theorem 1

The first two steps are the same as that of the proof of positivity of Theorem 1. Now, with

x + y

in place of

x

, we have the following:

\begin{matrix} {∥ \sqrt{λ} (x + y) ∥}_{2} = \sqrt{\sum_{k = 1}^{p} {(\sqrt{λ_{k}} x_{(k)} + \sqrt{λ_{k}} y_{(k)})}^{2}} \\ {∥ \sqrt{λ} (x + y) ∥}_{2} = \sqrt{\sum_{k = 1}^{p} {(\sqrt{λ_{k}} x_{(k)})}^{2} + \sum_{k = 1}^{p} {(\sqrt{λ_{k}} y_{(k)})}^{2} + 2 \sum_{k = 1}^{p} λ_{k} x_{(k)} y_{(k)}} \end{matrix}

From Cauchy–Schwarz inequality,

x . y \leq ∥ x ∥ . ∥ y ∥

, we have the following:

\begin{matrix} {∥ \sqrt{λ} x + \sqrt{λ} y ∥}_{2} \leq \sqrt{{∥ \sqrt{λ} x ∥}^{2} + {∥ \sqrt{λ} y ∥}^{2} + 2 ∥ \sqrt{λ} x ∥ . ∥ \sqrt{λ} y ∥} \\ {∥ \sqrt{λ} x + \sqrt{λ} y ∥}_{2} \leq \sqrt{{(∥ \sqrt{λ} x ∥ + ∥ \sqrt{λ} y ∥)}^{2}} \\ {∥ \sqrt{λ} x + \sqrt{λ} y ∥}_{2} \leq (∥ \sqrt{λ} x ∥ + ∥ \sqrt{λ} y ∥) \end{matrix}

where

x = \sqrt{λ} x

and

y = \sqrt{λ} y

. Then, we have the following:

∥ x + y ∥ \leq ∥ x ∥ + ∥ y ∥

□

Proof.

Corollary 1

From Equation (3):

J_{λ} (x) = λ_{1} x_{(1)}^{2} + λ_{2} x_{(2)}^{2} + \dots = + λ_{p} x_{(p)}^{2}

All

λ_{k}

take on an equal positive value, i.e.,

λ_{1} = λ_{2} = \dots = λ_{p}

\begin{matrix} J_{λ} (x) = λ x_{(1)}^{2} + λ x_{(2)}^{2} + \dots = + λ x_{(p)}^{2} \\ J_{λ} (x) = λ (x_{(1)}^{2} + x_{(2)}^{2} + \dots + x_{(p)}^{2}) \\ J_{λ} (x) = λ \sum_{k = 1}^{p} x_{(k)}^{2} \\ J_{λ} (x) = λ {∥ x ∥}_{2}^{2} \end{matrix}

where

λ

is positive scalar. □

Proof.

Corollary 2

The first two steps are the same as that of the proof of positivity of Theorem 1. When

λ_{2} = \dots = λ_{p} = 0

, we get the first term only and

p = 1

. The remaining terms are zero when

p > 1

.

\begin{matrix} {∥ \sqrt{λ} x ∥}_{2} = \sqrt{\sum_{k = 1}^{p = 1} {(\sqrt{λ_{k}} x_{(k)})}^{2}} \\ {∥ \sqrt{λ} x ∥}_{2} = \sqrt{{(\sqrt{λ_{1}} x_{(1)})}^{2}} \\ {∥ \sqrt{λ} x ∥}_{2} = | \sqrt{λ_{1}} x_{(1)} | \end{matrix}

where

x_{1} = \sqrt{λ_{1}} x_{(1)}

and

{∥ x ∥}_{\infty} = m a x | x_{k} |

. Then, we have the following:

{∥ x ∥}_{2} = {∥ x ∥}_{\infty}

□

References

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Bogdan, M.; van den Berg, E.; Sabatti, C.; Su, W.; Candès, E.J. SLOPE—Adaptive variable selection via convex optimization. Ann. Appl. Stat. 2015, 9, 1103. [Google Scholar] [CrossRef] [PubMed]
Bach, F.; Jenatton, R.; Mairal, J.; Obozinski, G. Optimization with sparsity-inducing penalties. Found. Trends® Mach. Learn. 2012, 4, 1–106. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [Green Version]
Bogdan, M.; van den Berg, E.; Su, W.; Candès, E.J. Statistical Estimation and Testing via the Ordered L1 Norm; Stanford University: Palo Alto, CA, USA, 2013. [Google Scholar]
Pan, H.; Jing, Z.; Li, M. Robust image restoration via random projection and partial sorted ℓ_p norm. Neurocomputing 2017, 222, 72–80. [Google Scholar] [CrossRef]
Azghani, M.; Kosmas, P.; Marvasti, F. Fast Microwave Medical Imaging Based on Iterative Smoothed Adaptive Thresholding. IEEE Antennas Wirel. Propag. Lett. 2015, 14, 438–441. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
Daducci, A.; Van De Ville, D.; Thiran, J.P.; Wiaux, Y. Sparse regularization for fiber ODF reconstruction: From the suboptimality of ℓ₂ and ℓ₁ priors to ℓ₀. Med. Image Anal. 2014, 18, 820–833. [Google Scholar] [CrossRef]
Gong, P.; Zhang, C.; Lu, Z.; Huang, J.; Ye, J. A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 37–45. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
Wang, L.; Zhu, J.; Zou, H. Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 2008, 24, 412–419. [Google Scholar] [CrossRef] [Green Version]
Deng, W.; Yin, W.; Zhang, Y. Group sparse optimization by alternating direction method. In SPIE Optical Engineering+ Applications; International Society for Optics and Photonics: Bellingham, WA, USA, 2013; p. 88580R. [Google Scholar]
Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar] [CrossRef] [Green Version]
Chen, S.; Liu, Y.; Lyu, M.R.; King, I.; Zhang, S. Fast Relative-Error Approximation Algorithm for Ridge Regression; UAI: Amsterdam, The Netherlands, 2015; pp. 201–210. [Google Scholar]
Zeng, X.; Figueiredo, M.A. Decreasing Weighted Sorted L1 Regularization. IEEE Signal Process. Lett. 2014, 21, 1240–1244. [Google Scholar] [CrossRef]
Albanese, M.; Erbacher, R.F.; Jajodia, S.; Molinaro, C.; Persia, F.; Picariello, A.; Sperlì, G.; Subrahmanian, V. Recognizing unexplained behavior in network traffic. In Network Science and Cybersecurity; Springer: Berlin/Heidelberg, Germany, 2014; pp. 39–62. [Google Scholar]
Amato, F.; Moscato, V.; Picariello, A.; Sperlí, G. Recommendation in social media networks. In Proceedings of the 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), Laguna Hills, CA, USA, 19–21 April 2017; pp. 213–216. [Google Scholar]
Glowinski, R.; Marroco, A. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. Revue Française d’Automatique Informatique Recherche Opérationnelle Analyse Numérique 1975, 9, 41–76. [Google Scholar] [CrossRef]
Gabay, D.; Mercier, B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 1976, 2, 17–40. [Google Scholar] [CrossRef] [Green Version]
Hestenes, M.R. Multiplier and gradient methods. J. Optim. Theory Appl. 1969, 4, 303–320. [Google Scholar] [CrossRef]
Ghadimi, E.; Teixeira, A.; Shames, I.; Johansson, M. Optimal parameter selection for the alternating direction method of multipliers (ADMM): Quadratic problems. IEEE Trans. Autom. Control 2015, 60, 644–658. [Google Scholar] [CrossRef]
Deng, W.; Yin, W. On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 2016, 66, 889–916. [Google Scholar] [CrossRef]
Goldstein, T.; O’Donoghue, B.; Setzer, S.; Baraniuk, R. Fast alternating direction optimization methods. SIAM J. Imaging Sci. 2014, 7, 1588–1623. [Google Scholar] [CrossRef]
Yan, M.; Yin, W. Self equivalence of the alternating direction method of multipliers. In Splitting Methods in Communication, Imaging, Science, and Engineering; Springer: Berlin/Heidelberg, Germany, 2016; pp. 165–194. [Google Scholar]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
David, H.A.; Nagaraja, H.N. Order Statistics; John Wiley & Sons, Inc.: New York, NY, USA, 2003. [Google Scholar]
Schmidt, M.; Roux, N.L.; Bach, F.R. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2011; pp. 1458–1466. [Google Scholar]
Parikh, N.; Boyd, S. Proximal algorithms. Found. Trends® Optim. 2014, 1, 127–239. [Google Scholar] [CrossRef]
Glowinski, R. Lectures on Numerical Methods for Non-Linear Variational Problems; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Boyd, S. Lasso Solve Lasso Problem via ADMM. 2011. Available online: https://web.stanford.edu/~boyd/papers/admm/lasso/lasso.html (accessed on 12 October 2019).
Bogdan, M. Sorted L-One Penalized Estimation. 2015. Available online: https://statweb.stanford.edu/~candes/SortedL1/software.html (accessed on 12 October 2019).
Humayoo, M. ADMM Ordered L2. 2019. Available online: https://github.com/ADMMOL2/ADMMOL2 (accessed on 12 October 2019).
Liu, J.; Musialski, P.; Wonka, P.; Ye, J. Tensor completion for estimating missing values in visual data. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 208–220. [Google Scholar] [CrossRef] [PubMed]
Bien, J.; Taylor, J.; Tibshirani, R. A lasso for hierarchical interactions. Ann. Stat. 2013, 41, 1111. [Google Scholar] [CrossRef] [PubMed]
Danaher, P.; Wang, P.; Witten, D.M. The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2014, 76, 373–397. [Google Scholar] [CrossRef] [PubMed]
Kraning, M.; Chu, E.; Lavaei, J.; Boyd, S. Dynamic network energy management via proximal message passing. Found. Trends® Optim. 2014, 1, 73–126. [Google Scholar] [CrossRef]
Kekatos, V.; Giannakis, G.B. Distributed robust power system state estimation. IEEE Trans. Power Syst. 2012, 28, 1617–1626. [Google Scholar] [CrossRef]
Chih-Jen, L. Feature Datasets. 2017. Available online: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ (accessed on 12 October 2019).
Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.A.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286, 531–537. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Illustration of sequence

{λ_{k}}

for p = 5000: The solid line of

λ_{{B H}^{(k)}}

is given by Equation (14), while the dashed and dotted lines of

λ_{k}

are given by Equation (15) for n = p and n = 2p, respectively.

Figure 1. Illustration of sequence

{λ_{k}}

for p = 5000: The solid line of

λ_{{B H}^{(k)}}

is given by Equation (14), while the dashed and dotted lines of

λ_{k}

are given by Equation (15) for n = p and n = 2p, respectively.

Figure 2. Primal and dual residual versus primal and dual feasibility, respectively: Input synthetic data.

Figure 3. (a) Relative primal–dual gap versus dual infeasibility, respectively, and (b) primal and dual residual versus primal and dual feasibility, respectively: Input synthetic data.

Figure 4. The ordered elastic net coefficients paths: selected genes (the number of nonzero coefficients) are shown on the top of the x-axis and corresponding iterations are shown on the bottom of the x-axis; the optimal ordered elastic net model is given by the fit at an average iteration of 1380.6 with an average selected gene of 106.6 (indicated by a dotted line). Input leukemia data.

Table 1. Properties of the different regularizers: Correlation is a method that shows how strong the relationship between explanatory variables is. In the correlation column, “Yes” means a strong correlation between variables and “No” means weaker (or no) correlation between variables. Stable means that estimates of the

ℓ_{2}

-based methods are more stable when the explanatory variables are strongly correlated.

Table 1. Properties of the different regularizers: Correlation is a method that shows how strong the relationship between explanatory variables is. In the correlation column, “Yes” means a strong correlation between variables and “No” means weaker (or no) correlation between variables. Stable means that estimates of the

ℓ_{2}

-based methods are more stable when the explanatory variables are strongly correlated.

Regularizers	Promoting	Convex	Smooth	Adaptive	Tractable	Correlation	Stable
$ℓ_{0}$ [10,11]	Sparsity	No	No	No	No	No	No
$ℓ_{1}$ [12]	Sparsity	Yes	No	No	Yes	No	No
The Ordered $ℓ_{1}$ [5]	Sparsity	Yes	No	Yes	Yes	No	No
$ℓ_{2}$ [13]	Grouping	Yes	Yes	No	Yes	Yes	Yes
The Ordered $ℓ_{2}$	Grouping	Yes	Yes	Yes	Yes	Yes	Yes
Partial Sorted $ℓ_{p}$ [6]	Sparsity	No	No	Yes	No	No	No

Table 2. Notations and explanations.

Notations	Explanations	Notations	Explanations
Matrix	denoted by uppercase letter	f	loss convex function
Vector	denoted by lowercase letter	g	Regularizer part ( $ℓ_{1}$ or $ℓ_{2}$ etc.)
${∥ . ∥}_{1}$	$ℓ_{1}$ norm	$\partial f (x)$	subdifferential of convex function f at x
${∥ . ∥}_{2}$	$ℓ_{2}$ norm	$\partial g (z)$	subdifferential of convex function g at z
${∥ x ∥}_{1} = \sum_{k = 1}^{n} \| x_{k} \|$	$ℓ_{1}$ norm	L1	the $ℓ_{1}$ norm or the lasso
${∥ x ∥}_{2} = \sqrt{\sum_{k = 1}^{n} {\| x_{k} \|}^{2}}$	$ℓ_{2}$ norm	OL1	ordered $ℓ_{1}$ norm or ordered lasso
${∥ x ∥}_{2}^{2} = \sum_{k = 1}^{n} {\| x_{k} \|}^{2}$	square of $ℓ_{2}$ norm	OL2	ordered $ℓ_{2}$ norm or ordered ridge regression
$J_{λ} (.)$	the ordered norm	Eq.	equation

Note: We often use the ordered

ℓ_{2}

norm/regularization, OL2, and ADMM-O

ℓ_{2}

interchangeably.

Table 3. Summary of variable selection in leukaemia dataset.

q	Method	Test Error	#Genes	Time	#Iter
0.1	Lasso	2.352941	6	5.459234 s	10,000
	The ordered $ℓ_{1}$	2.235294	56	16.208356 s	10,000
	The ordered $ℓ_{2}$	2.117647	All	0.176351 s	216
	The ordered $ℓ_{1, 2}$	2.352941	109	1.391347 s	2104
0.2	Lasso	2.352941	6	5.419032 s	10,000
	The ordered $ℓ_{1}$	2.352941	65	14.990179 s	10,000
	The ordered $ℓ_{2}$	2.117647	All	0.167623 s	197
	The ordered $ℓ_{1, 2}$	2.352941	107	1.046763 s	1597
0.3	Lasso	2.352941	6	5.394828 s	10,000
	The ordered $ℓ_{1}$	2.352941	85	15.477436 s	10,000
	The ordered $ℓ_{2}$	2.117647	All	0.140347 s	185
	The ordered $ℓ_{1, 2}$	2.117647	108	0.820148 s	1276
0.4	Lasso	2.235294	7	5.470428 s	10,000
	The ordered $ℓ_{1}$	2.352941	90	12.206446 s	10,000
	The ordered $ℓ_{2}$	2.117647	All	0.135451 s	178
	The ordered $ℓ_{1, 2}$	2.0	107	0.685255 s	1055
0.5	Lasso	2.0	8	5.387423 s	10,000
	The ordered $ℓ_{1}$	2.352941	126	13.226632 s	10,000
	The ordered $ℓ_{2}$	2.117647	All	0.126034 s	172
	The ordered $ℓ_{1, 2}$	2.0	102	0.603964 s	871
Average	Lasso	2.2588234	6.6	5.426189 s	10,000
	The ordered $ℓ_{1}$	2.3294116	84.4	14.4218098 s	10,000
	The ordered $ℓ_{2}$	2.117647	All	0.1491612 s	189.6
	The ordered $ℓ_{1, 2}$	2.1647058	106.6	0.9094954 s	1380.6

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Humayoo, M.; Cheng, X. Parameter Estimation with the Ordered ℓ₂ Regularization via an Alternating Direction Method of Multipliers. Appl. Sci. 2019, 9, 4291. https://doi.org/10.3390/app9204291

AMA Style

Humayoo M, Cheng X. Parameter Estimation with the Ordered ℓ₂ Regularization via an Alternating Direction Method of Multipliers. Applied Sciences. 2019; 9(20):4291. https://doi.org/10.3390/app9204291

Chicago/Turabian Style

Humayoo, Mahammad, and Xueqi Cheng. 2019. "Parameter Estimation with the Ordered ℓ₂ Regularization via an Alternating Direction Method of Multipliers" Applied Sciences 9, no. 20: 4291. https://doi.org/10.3390/app9204291

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Parameter Estimation with the Ordered ℓ₂ Regularization via an Alternating Direction Method of Multipliers

Abstract

1. Introduction

2. Related Work

2.1. $ℓ_{1}$ and $ℓ_{2}$ Regularization

2.2. ADMM

3. Ridge Regression with the Ordered $ℓ_{2}$ Regularization

3.1. The Ordered $ℓ_{2}$ Regularization

3.2. The Ordered Ridge Regression

4. Applying ADMM to the Ordered Ridge Regression

4.1. Scaled Form

4.2. Over-Relaxed ADMM Algorithm

4.3. The Ordered Elastic Net

5. Experiments

5.1. Adjusting the Regularizing Sequence $(λ_{k})$ for the Ordered Ridge Regression

5.2. Experimental Results of Synthetic Data

5.3. Experimental Results of Real Data

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Parameter Estimation with the Ordered ℓ2 Regularization via an Alternating Direction Method of Multipliers

Abstract

1. Introduction

2. Related Work

2.1. ℓ 1 and ℓ 2 Regularization

2.2. ADMM

3. Ridge Regression with the Ordered ℓ 2 Regularization

3.1. The Ordered ℓ 2 Regularization

3.2. The Ordered Ridge Regression

4. Applying ADMM to the Ordered Ridge Regression

4.1. Scaled Form

4.2. Over-Relaxed ADMM Algorithm

4.3. The Ordered Elastic Net

5. Experiments

5.1. Adjusting the Regularizing Sequence ( λ k ) for the Ordered Ridge Regression

5.2. Experimental Results of Synthetic Data

5.3. Experimental Results of Real Data

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Parameter Estimation with the Ordered ℓ₂ Regularization via an Alternating Direction Method of Multipliers

2.1. $ℓ_{1}$ and $ℓ_{2}$ Regularization

3. Ridge Regression with the Ordered $ℓ_{2}$ Regularization

3.1. The Ordered $ℓ_{2}$ Regularization

5.1. Adjusting the Regularizing Sequence $(λ_{k})$ for the Ordered Ridge Regression