Next Article in Journal
Capacity Bounds on the Downlink of Symmetric, Multi-Relay, Single-Receiver C-RAN Networks
Next Article in Special Issue
Robust-BD Estimation and Inference for General Partially Linear Models
Previous Article in Journal
On the Uniqueness Theorem for Pseudo-Additive Entropies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust and Sparse Regression via γ-Divergence

by
Takayuki Kawashima
1,* and
Hironori Fujisawa
1,2,3
1
Department of Statistical Science, The Graduate University for Advanced Studies, Tokyo 190-8562, Japan
2
The Institute of Statistical Mathematics, Tokyo 190-8562, Japan
3
Department of Mathematical Statistics, Nagoya University Graduate School of Medicine, Nagoya 466-8550, Japan
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(11), 608; https://doi.org/10.3390/e19110608
Submission received: 30 September 2017 / Revised: 7 November 2017 / Accepted: 9 November 2017 / Published: 13 November 2017

Abstract

:
In high-dimensional data, many sparse regression methods have been proposed. However, they may not be robust against outliers. Recently, the use of density power weight has been studied for robust parameter estimation, and the corresponding divergences have been discussed. One such divergence is the γ -divergence, and the robust estimator using the γ -divergence is known for having a strong robustness. In this paper, we extend the γ -divergence to the regression problem, consider the robust and sparse regression based on the γ -divergence and show that it has a strong robustness under heavy contamination even when outliers are heterogeneous. The loss function is constructed by an empirical estimate of the γ -divergence with sparse regularization, and the parameter estimate is defined as the minimizer of the loss function. To obtain the robust and sparse estimate, we propose an efficient update algorithm, which has a monotone decreasing property of the loss function. Particularly, we discuss a linear regression problem with L 1 regularization in detail. In numerical experiments and real data analyses, we see that the proposed method outperforms past robust and sparse methods.

1. Introduction

In high-dimensional data, sparse regression methods have been intensively studied. The Lasso [1] is a typical sparse linear regression method with L 1 regularization, but is not robust against outliers. Recently, robust and sparse linear regression methods have been proposed. The robust least angle regression (RLARS) [2] is a robust version of LARS [3], which replaces the sample correlation by a robust estimate of correlation in the update algorithm. The sparse least trimmed squares (sLTS) [4] is a sparse version of the well-known robust linear regression method LTS [5] based on the trimmed loss function with L 1 regularization.
Recently, the robust parameter estimation using density power weight has been discussed by Windham [6], Basu et al. [7], Jones et al. [8], Fujisawa and Eguchi [9], Basu et al. [10], Kanamori and Fujisawa [11], and so on. The density power weight gives a small weight to the terms related to outliers, and then, the parameter estimation becomes robust against outliers. By virtue of this validity, some applications using density power weights have been proposed in signal processing and machine learning [12,13]. Among them, the γ -divergence proposed by Fujisawa and Eguchi [9] is known for having a strong robustness, which implies that the latent bias can be sufficiently small even under heavy contamination. The other robust methods including density power-divergence cannot achieve the above property, and the estimator can be affected by the outlier ratio. In addition, to obtain the robust estimate, an efficient update algorithm was proposed with a monotone decreasing property of the loss function.
In this paper, we propose the robust and sparse regression problem based on the γ -divergence. First, we extend the γ -divergence to the regression problem. Next, we consider a loss function based on the γ -divergence with sparse regularization and propose an update algorithm to obtain the robust and sparse estimate. Fujisawa and Eguchi [9] used a Pythagorean relation on the γ -divergence, but it is not compatible with sparse regularization. Instead of this relation, we use the majorization-minimization algorithm [14]. This idea is deeply considered in a linear regression problem with L 1 regularization. The MM algorithm was also adopted in Hirose and Fujisawa [15] for robust and sparse Gaussian graphical modeling. A tuning parameter selection is proposed using a robust cross-validation. We also show a strong robustness under heavy contamination even when outliers are heterogeneous. Finally, in numerical experiments and real data analyses, we show that our method is computationally efficient and outperforms other robust and sparse methods. The R language software package “gamreg”, which we use to implement our proposed method, can be downloaded at http://cran.r-project.org/web/packages/gamreg/.

2. Regression Based on γ -Divergence

The γ -divergence was defined for two probability density functions, and its properties were investigated by Fujisawa and Eguchi [9]. In this section, the γ -divergence is extended to the regression problem, in other words, defined for two conditional probability density functions.

2.1. γ -Divergence for Regression

We suppose that g ( x , y ) , g ( y | x ) and g ( x ) are the underlying probability density functions of ( x , y ) , y given x and x, respectively. Let f ( y | x ) be another parametric conditional probability density function of y given x. Let us define the γ -cross-entropy for regression by:
d γ ( g ( y | x ) , f ( y | x ) ; g ( x ) )   = 1 γ log g ( y | x ) f ( y | x ) γ d y g ( x ) d x + 1 1 + γ log f ( y | x ) 1 + γ d y g ( x ) d x   = 1 γ log f ( y | x ) γ g ( x , y ) d x d y + 1 1 + γ log f ( y | x ) 1 + γ d y g ( x ) d x f o r γ > 0 .
The γ -divergence for regression is defined by:
D γ ( g ( y | x ) , f ( y | x ) ; g ( x ) ) = d γ ( g ( y | x ) , g ( y | x ) ; g ( x ) ) + d γ ( g ( y | x ) , f ( y | x ) ; g ( x ) ) .
The γ -divergence for regression was first proposed by Fujisawa and Eguchi [9], and many properties were already shown. However, we adopt the definition (2), which is slightly different from the past one, because (2) satisfies the Pythagorean relation approximately (see Section 4).
Theorem 1.
We can show that:
( i ) D γ ( g ( y | x ) , f ( y | x ) ; g ( x ) ) 0 , ( ii ) D γ ( g ( y | x ) , f ( y | x ) ; g ( x ) ) = 0 g ( y | x ) = f ( y | x ) ( a . e . ) , ( iii ) lim γ 0 D γ ( g ( y | x ) , f ( y | x ) ; g ( x ) ) = D K L ( g ( y | x ) , f ( y | x ) ) g ( x ) d x ,
where D K L ( g ( y | x ) , f ( y | x ) ) = g ( y | x ) log g ( y | x ) d y g ( y | x ) log f ( y | x ) d y .
The proof is in Appendix A. In what follows, we refer to the regression based on the γ -divergence as the γ -regression.

2.2. Estimation for γ -Regression

Let f ( y | x ; θ ) be the conditional probability density function of y given x with parameter θ . The target parameter can be considered by:
θ γ * = argmin θ D γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = argmin θ d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) .
When g ( y | x ) = f ( y | x ; θ * ) , we have θ γ * = θ * .
Let ( x 1 , y 1 ) , , ( x n , y n ) be the observations randomly drawn from the underlying distribution g ( x , y ) . Using the formula (1), the γ -cross-entropy for regression, d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) , can be empirically estimated by:
d ¯ γ ( f ( y | x ; θ ) ) = 1 γ log 1 n i = 1 n f ( y i | x i ; θ ) γ + 1 1 + γ log 1 n i = 1 n f ( y | x i ; θ ) 1 + γ d y .
By virtue of (3), we define the γ -estimator by:
θ ^ γ = argmin θ d ¯ γ ( f ( y | x ; θ ) ) .
In a similar way as in Fujisawa and Eguchi [9], we can show the consistency of θ ^ γ to θ γ * under some conditions.
Here, we briefly show why the γ -estimator is robust. Suppose that y 1 is an outlier. The conditional probability density f ( y 1 | x 1 ; θ ) can be expected to be sufficiently small. We see from f ( y 1 | x 1 ; θ ) 0 and (4) that:
argmin θ d ¯ γ ( f ( y | x ; θ ) ) = argmin θ 1 γ log 1 n i = 1 n f ( y i | x i ; θ ) γ + 1 1 + γ log 1 n i = 1 n f ( y | x i ; θ ) 1 + γ d y argmin θ 1 γ log 1 n 1 i = 2 n f ( y i | x i ; θ ) γ + 1 1 + γ log 1 n i = 1 n f ( y | x i ; θ ) 1 + γ d y .
Therefore, the term f ( y 1 | x 1 ; θ ) is naturally ignored in (4). However, for the KL-divergence, log f ( y 1 | x 1 ; θ ) diverges from f ( y 1 | x 1 ; θ ) 0 . That is why the KL-divergence is not robust. The theoretical robust properties are presented in Section 4.
Moreover, the empirical estimation of the γ -cross-entropy with a penalty term can be given by:
L γ ( θ ; λ ) = d ¯ γ ( f ( y | x ; θ ) ) + λ P ( θ ) ,
where P ( θ ) is a penalty for parameter θ and λ is a tuning parameter for the penalty term. As an example of the penalty term, we can consider L 1 (Lasso, Tibshirani [1]), elasticnet [16], group Lasso [17], fused Lasso [18], and so on. The sparse γ -estimator can be proposed by:
θ ^ S = argmin θ L γ ( θ ; λ ) .
To obtain the minimizer, we propose the iterative algorithm by the majorization-minimization algorithm (MM algorithm) [14].

3. Parameter Estimation Procedure

3.1. MM Algorithm for Sparse γ -Regression

The MM algorithm is constructed as follows. Let h ( η ) be the objective function. Let us prepare the majorization function h M M satisfying:
h M M ( η ( m ) | η ( m ) ) = h ( η ( m ) ) , h M M ( η | η ( m ) ) h ( η ) for   all   η ,
where η ( m ) is the parameter of the m-th iterative step for m = 0 , 1 , 2 , Let us consider the iterative algorithm by:
η ( m + 1 ) = argmin η h M M ( η | η ( m ) ) .
Then, we can show that the objective function h ( η ) monotonically decreases at each step, because:
h ( η ( m ) ) = h M M ( η ( m ) | η ( m ) ) h M M ( η ( m + 1 ) | η ( m ) ) h ( η ( m + 1 ) ) .
Note that η ( m + 1 ) does not necessarily have to be the minimizer of h M M ( η | η ( m ) ) . We only need:
h M M ( η ( m ) | η ( m ) ) h M M ( η ( m + 1 ) | η ( m ) ) .
We construct the majorization function for the sparse γ -regression by the following inequality:
κ ( z T η ) i z i η i ( m ) z T η ( m ) κ η i z T η ( m ) η i ( m ) ,
where κ ( u ) is a convex function, z = ( z 1 , , z n ) T , η = ( η 1 , , η n ) T , η ( m ) = ( η 1 ( m ) , , η n ( m ) ) T , and z i , η i and η i ( m ) are positive. The inequality (5) holds from Jensen’s inequality. Here, we take z i = 1 n , η i = f ( y i | x i ; θ ) γ , η i ( m ) = f ( y i | x i ; θ ( m ) ) γ , and κ ( u ) = log u in (5). We can propose the majorization function as follows:
h ( θ ) = L γ ( θ ; λ ) = 1 γ log 1 n i = 1 n f ( y i | x i ; θ ) γ + 1 1 + γ log 1 n i = 1 n f ( y | x i ; θ ) 1 + γ d y + λ P ( θ ) 1 γ i = 1 n α i ( m ) log f ( y i | x i ; θ ) γ 1 n l = 1 n f ( y l | x l ; θ ( m ) ) γ f ( y i | x i ; θ ( m ) ) γ + 1 1 + γ log 1 n i = 1 n f ( y | x i ; θ ) 1 + γ d y + λ P ( θ ) = i = 1 n α i ( m ) log f ( y i | x i ; θ ) + 1 1 + γ log 1 n i = 1 n f ( y | x i ; θ ) 1 + γ d y + λ P ( θ ) + c o n s t = h M M ( θ | θ ( m ) ) + c o n s t ,
where α i ( m ) = f ( y i | x i ; θ ( m ) ) γ l = 1 n f ( y l | x l ; θ ( m ) ) γ and c o n s t is a term that does not depend on the parameter θ .
The first term on the original target function h ( θ ) is a mixture type of densities, which is not easy to optimize, while the first term on h M M ( θ | θ ( m ) ) is a weighted log-likelihood, which is often easy to optimize.

3.2. Sparse γ -Linear Regression

Let f ( y | x ; θ ) be the conditional density with θ = ( β 0 , β , σ 2 ) , given by:
f ( y | x ; θ ) = ϕ ( y ; β 0 + x T β , σ 2 ) ,
where ϕ ( y ; μ , σ 2 ) is the normal density with mean parameter μ and variance parameter σ 2 . Suppose that P ( θ ) is the L 1 regularization | | β | | 1 . After a simple calculation, we have:
h M M ( θ | θ ( m ) ) = 1 2 ( 1 + γ ) log σ 2 + 1 2 i = 1 n α i ( m ) ( y i β 0 x i T β ) 2 σ 2 + λ | | β | | 1 .
This function is easy to optimize by an update algorithm. For a fixed value of σ 2 , the function h M M is almost the same as Lasso except for the weight, so that it can be updated using the coordinate decent algorithm with a decreasing property of the loss function. For a fixed value of ( β 0 , β T ) T , the function h M M is easy to minimize. Consequently, we can obtain the update algorithm in Algorithm 1 with the decreasing property:
h M M ( θ ( m + 1 ) | θ ( m ) ) h M M ( θ ( m ) | θ ( m ) ) .
Algorithm 1 Sparse γ -linear regression.
  • Require: β 0 ( 0 ) , β ( 0 ) , σ 2 ( 0 )
  • repeat m = 0 , 1 , 2 ,
  •    α i ( m ) ϕ ( y i ; β 0 ( m ) + x i T β ( m ) , σ 2 ( m ) ) γ l = 1 n ϕ ( y l ; β 0 ( m ) + x l T β ( m ) , σ 2 ( m ) ) γ ( i = 1 , 2 , , n ) .
  •    β 0 ( m + 1 ) i = 1 n α i ( m ) ( y i x i T β ( m ) ) .
  •   for do j = 1 , , p
  •     β j ( m + 1 ) S i = 1 n α i ( m ) ( y i β 0 ( m + 1 ) r i , j ( m ) ) x i j , σ 2 ( m ) λ i = 1 n α i ( m ) x i j 2 ,
  •    where S ( t , λ ) = sign ( t ) ( | t | λ ) + and r i , j ( m ) = k j x i k ( 1 ( k < j ) β k ( m + 1 ) + 1 ( k > j ) β k ( m ) ) .
  •    σ 2 ( m + 1 ) ( 1 + γ ) i = 1 n α i ( m ) ( y i α β 0 ( m + 1 ) α x i T β ( m + 1 ) ) 2 .
  • until convergence
  • Ensure: β ^ 0 , β ^ , σ ^ 2
It should be noted that h M M is convex with respect to parameter β 0 , β and has the global minimum with respect to parameter σ 2 , but the original objective function h is not convex with respect to them, so that the initial points of Algorithm 1 are important. This issue is discussed in Section 5.4.
In practice, we also use the active set strategy [19] in the coordinate decent algorithm for updating β ( m ) . The active set consists of the non-zero coordinates of β ( m ) . Specifically, for a given β ( m ) , we only update the non-zero coordinates of β ( m ) , until they are converged. Then, the non-active set parameter estimates are updated once. When they remain zero, the coordinate descent algorithm stops. If some of them do not remain zero, those are added to the active set, and the coordinate descent algorithm continues.

3.3. Robust Cross-Validation

In sparse regression, a regularization parameter is often selected via a criterion. Cross-validation is often used for selecting the regularization parameter. Ordinal cross-validation is based on the squared error, and it can also be constructed using the KL-cross-entropy with the normal density. However, the ordinal cross-validation will fail due to outliers. Therefore, we propose the robust cross-validation based on the γ -cross-entropy. Let θ ^ γ be the robust estimate based on the γ -cross-entropy. The cross-validation based on the γ -cross-entropy can be given by:
RoCV ( λ ) = 1 γ 0 log 1 n i = 1 n f ( y i | x i ; θ ^ γ [ i ] ) γ 0 + 1 1 + γ 0 log 1 n i = 1 n f ( y | x i ; θ ^ γ [ i ] ) 1 + γ 0 d y ,
where θ ^ γ [ i ] is the γ -estimator deleting the i-th observation and γ 0 is an appropriate tuning parameter. We can also adopt the K-fold cross-validation to reduce the computational task [20].
Here, we give a small modification of the above. We often focus only on the mean structure for prediction, not on the variance parameter. Therefore, in this paper, θ ^ γ [ i ] = β ^ γ [ i ] , σ 2 ^ γ [ i ] is replaced by β ^ γ [ i ] , σ 2 ^ f i x . In numerical experiments and real data analyses, we used σ 2 ( 0 ) as σ f i x 2 .

4. Robust Properties

In this section, the robust properties are presented from two viewpoints of latent bias and Pythagorean relation. The latent bias was discussed in Fujisawa and Eguchi [9] and Kanamori and Fujisawa [11], which is described later. Using the results obtained there, the Pythagorean relation is shown in Theorems 2 and 3.
Let f * ( y | x ) = f θ * ( y | x ) = f ( y | x ; θ * ) and δ ( y | x ) be the target conditional probability density function and the contamination conditional probability density function related to outliers, respectively. Let ϵ and ϵ ( x ) denote the outlier ratios, which are independent of and dependent on x, respectively. Under homogeneous and heterogeneous contaminations, we suppose that the underlying conditional probability density function can be expressed as:
g ( y | x ) = ( 1 ϵ ) f ( y | x ; θ * ) + ϵ δ ( y | x ) , g ( y | x ) = ( 1 ϵ ( x ) ) f ( y | x ; θ * ) + ϵ ( x ) δ ( y | x ) .
Let:
ν f , γ ( x ) = δ ( y | x ) f ( y | x ) γ d y 1 γ ( γ > 0 ) ,
and let:
ν f , γ = ν f , γ ( x ) γ g ( x ) d x 1 γ .
Here, we assume that:
ν f θ * , γ 0 ,
which implies that ν f θ * , γ ( x ) 0 for any x (a.e.) and illustrates that the contamination conditional probability density function δ ( y | x ) lies on the tail of the target conditional probability density function f ( y | x ; θ * ) . For example, if δ ( y | x ) is the Dirac function at the outlier y ( x ) given x, then we have ν f θ * , γ ( x ) = f ( y ( x ) | x ; θ * ) , which should be sufficiently small because y ( x ) is an outlier. In this section, we show that θ γ * θ * is expected to be small even if ϵ or ϵ ( x ) is not small. To make the discussion easier, we prepare the monotone transformation of the γ -cross-entropy for regression by:
d ˜ γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = exp γ d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = g ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x f ( y | x ; θ ) 1 + γ d y g ( x ) d x γ 1 + γ .

4.1. Homogeneous Contamination

Here, we provide the following proposition, which was given in Kanamori and Fujisawa [11].
Proposition 1.
d ˜ γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = ( 1 ϵ ) d ˜ γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) ϵ ν f θ , γ γ f ( y | x ; θ ) 1 + γ d y g ( x ) d x γ 1 + γ .
Recall that θ γ * and θ * are also the minimizers of d ˜ γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) and d ˜ γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) , respectively. We can expect ν f θ , γ 0 from the assumption ν f θ * , γ 0 if the tail behavior of f ( y | x ; θ ) is close to that of f ( y | x ; θ * ) . We see from Proposition 1 and the condition ν f θ , γ 0 that:
θ γ * = argmin θ d ˜ γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = argmin θ ( 1 ϵ ) d ˜ γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) ϵ ν f θ , γ γ f ( y | x ; θ ) 1 + γ d y g ( x ) d x γ 1 + γ argmin θ ( 1 ϵ ) d ˜ γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) = θ * .
Therefore, under homogeneous contamination, it can be expected that the latent bias θ γ * θ * is small even if ϵ is not small. Moreover, we can show the following theorem, using Proposition 1.
Theorem 2.
Let ν = m a x { ν f θ , γ , ν f θ * , γ } . Then, the Pythagorean relation among g ( y | x ) , f ( y | x ; θ * ) , f ( y | x ; θ ) approximately holds:
D γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) D γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) = D γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) + O ( ν γ ) .
The proof is in Appendix A. The Pythagorean relation implies that the minimization of the divergence from f ( y | x ; θ ) to the underlying conditional probability density function g ( y | x ) is approximately the same as that to the target conditional probability density function f ( y | x ; θ * ) . Therefore, under homogeneous contamination, we can see why our proposed method works well in terms of the minimization of the γ -divergence.

4.2. Heterogeneous Contamination

Under heterogeneous contamination, we assume that the parametric conditional probability density function f ( y | x ; θ ) is a location-scale family given by:
f ( y | x ; θ ) = 1 σ s y q ( x ; ξ ) σ ,
where s ( y ) is a probability density function, σ is a scale parameter and q ( x ; ξ ) is a location function with a regression parameter ξ , e.g., q ( x ; ξ ) = ξ T x . Then, we can obtain:
f ( y | x ; θ ) 1 + γ d y = 1 σ 1 + γ s y q ( x ; ξ ) σ 1 + γ d y = σ γ s ( z ) 1 + γ d z .
That does not depend on the explanatory variable x. Here, we provide the following proposition, which was given in Kanamori and Fujisawa [11].
Proposition 2.
d ˜ γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = c d ˜ γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ˜ ( x ) ) ν f θ , γ ( x ) γ ϵ ( x ) g ( x ) d x σ γ s ( z ) 1 + γ d z γ 1 + γ ,
where c = 1 ϵ ( x ) g ( x ) d x γ 1 + γ and g ˜ ( x ) = ( 1 ϵ ( x ) ) g ( x ) .
The second term ν f θ , γ ( x ) γ ϵ ( x ) g ( x ) d x σ γ s ( z ) 1 + γ d z γ 1 + γ can be approximated to be zero from the condition ν f θ , γ 0 and ϵ ( x ) < 1 as follows:
ν f θ , γ ( x ) γ ϵ ( x ) g ( x ) d x σ γ s ( z ) 1 + γ d z γ 1 + γ < ν f θ , γ ( x ) γ g ( x ) d x σ γ s ( z ) 1 + γ d z γ 1 + γ = ν f θ , γ γ σ γ s ( z ) 1 + γ d z γ 1 + γ 0 .
We see from Proposition 2 and (7) that:
θ γ * = argmin θ d ˜ γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = argmin θ c d ˜ γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ˜ ( x ) ) ν f θ , γ ( x ) γ ϵ ( x ) g ( x ) d x σ γ s ( z ) 1 + γ d z γ 1 + γ argmin θ c d ˜ γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ˜ ( x ) ) = θ * .
Therefore, under heterogeneous contamination in a location-scale family, it can be expected that the latent bias θ γ * θ * is small even if ϵ ( x ) is not small. Moreover, we can show the following theorem, using Proposition 2.
Theorem 3.
Let ν = m a x { ν f θ , γ , ν f θ * , γ } . Then, the following relation among g ( y | x ) , f ( y | x ; θ * ) , f ( y | x ; θ ) approximately holds:
D γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) D γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) = D γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ˜ ( x ) ) + O ( ν γ ) .
The proof is in Appendix A. The above is slightly different from a conventional Pythagorean relation, because the base measure changes from g ( x ) to g ˜ ( x ) in part. However, it also implies that the minimization of the divergence from f ( y | x ; θ ) to the underlying conditional probability density function g ( y | x ) is approximately the same as that to the target conditional probability density function f ( y | x ; θ * ) . Therefore, under heterogeneous contamination in a location-scale family, we can see why our proposed method works well in terms of the minimization of the γ -divergence.

4.3. Redescending Property

First, we review a redescending property on M-estimation (see, e.g., [21]), which is often used in robust statistics. Suppose that the estimating equation is given by i = 1 n ζ ( z i ; θ ) = 0 . Let θ ^ be a solution of the estimating equation. The bias caused by outlier z o is expressed as θ ^ n = θ * , where θ ^ n = is the limiting value of θ ^ and θ * is the true parameter. We hope the bias is small even if the outlier z o exists. Under some conditions, the bias can be approximated to ϵ IF ( z o ; θ * ) , where ϵ is a small outlier ratio and IF ( z ; θ * ) is the influence function. The bias is expected to be small when the influence function is small. The influence function can be expressed as IF ( z ; θ * ) = A ζ ( z ; θ * ) , where A is a matrix independent of z, so that the bias is also expected to be small when ζ ( z o ; θ * ) is small. In particular, the estimating equation is said to have a redescending property if ζ ( z ; θ * ) goes to zero as | | z | | goes to infinity. This property is favorable in robust statistics, because the bias is expected to be sufficiently small when z o is very large.
Here, we prove a redescending property on the sparse γ -linear regression, i.e., when f ( y | x ; θ ) = ϕ ( y ; β 0 + x T β , σ 2 ) with θ = ( β 0 , β , σ 2 ) for fixed x. Recall that the estimate of the sparse γ -linear regression is the minimizer of the loss function:
L γ ( θ ; λ ) = 1 γ log 1 n i = 1 n ϕ ( y i ; β 0 + x i T β , σ 2 ) γ + b γ ( θ ; λ ) ,
where b γ ( θ ; λ ) = 1 1 + γ log 1 n i = 1 n ϕ ( y ; β 0 + x i T β , σ 2 ) 1 + γ d y + λ | | β | | 1 Then, the estimating equation is given by:
0 = θ L γ ( θ ; λ ) = i = 1 n ϕ ( y i ; β 0 + x i T β , σ 2 ) γ s ( y i | x i ; θ ) i = 1 n ϕ ( y i ; β 0 + x i T β , σ 2 ) γ + θ b γ ( θ ; λ ) ,
where s ( y | x ; θ ) = log ϕ ( y ; β 0 + x T β , σ 2 ) θ . This can be expressed by the M-estimation formula given by:
0 = i = 1 n ψ ( y i | x i ; θ ) ,
where ψ ( y | x ; θ ) = ϕ ( y ; β 0 + x T β , σ 2 ) γ s ( y | x ; θ ) ϕ ( y ; β 0 + x T β , σ 2 ) γ θ b γ ( θ ; λ ) . We can easily show that as | | y | | goes to infinity, ϕ ( y ; β 0 + x T β , σ 2 ) goes to zero and ϕ ( y ; β 0 + x T β , σ 2 ) s ( y | x ; θ ) also goes to zero. Therefore, the function ψ ( y | x ; θ ) goes to zero as | | y | | goes to infinity, so that the estimating equation has a redescending property.

5. Numerical Experiment

In this section, we compare our method (sparse γ -linear regression) with the representative sparse linear regression method, the least absolute shrinkage and selection operator (Lasso) [1], and the robust and sparse regression methods, sparse least trimmed squares (sLTS) [4] and robust least angle regression (RLARS) [2].

5.1. Regression Models for Simulation

We used the simulation model given by:
y = β 0 + β 1 x 1 + β 2 x 2 + + β p x p + e , e N ( 0 , 0.5 2 ) .
The sample size and the number of explanatory variables were set to be n = 100 and p = 100 , 200 , respectively. The true coefficients were given by:
β 1 = 1 , β 2 = 2 , β 4 = 4 , β 7 = 7 , β 11 = 11 , β j = 0 for j { 0 , , p } \ { 1 , 2 , 4 , 7 , 11 } .
We arranged a broad range of regression coefficients to observe sparsity for various degrees of regression coefficients. The explanatory variables were generated from a normal distribution N ( 0 , Σ ) with Σ = ( ρ | i j | ) 1 i , j p . We generated 100 random samples.
Outliers were incorporated into simulations. We investigated two outlier ratios ( ϵ = 0.1   and   0.3 ) and two outlier patterns: (a) the outliers were generated around the middle part of the explanatory variable, where the explanatory variables were generated from N ( 0 , 0.5 2 ) and the error terms were generated from N ( 20 , 0.5 2 ) ; (b) the outliers were generated around the edge part of the explanatory variable, where the explanatory variables were generated from N ( 1.5 , 0.5 2 ) and the error terms were generated from N ( 20 , 0.5 2 ) .

5.2. Performance Measure

The root mean squared prediction error (RMSPE) and mean squared error (MSE) were examined to verify the predictive performance and fitness of regression coefficient:
RMSPE ( β ^ ) = 1 n i = 1 n ( y i * x i * T β ^ ) 2 , MSE = 1 p + 1 j = 0 p ( β j * β j ^ ) 2 ,
where ( x i * , y i * ) ( i = 1 , , n ) is the test sample generated from the simulation model without outliers and β j * ’s are the true coefficients. The true positive rate (TPR) and true negative rate (TNR) were also reported to verify the sparsity:
TPR ( β ^ ) = | { j { 1 , , p } : β j ^ 0 β j * 0 } | | { j { 1 , , p } : β j * 0 } | , TNR ( β ^ ) = | { j { 1 , , p } : β j ^ = 0 β j * = 0 } | | { j { 1 , , p } : β j * = 0 } | .

5.3. Comparative Methods

In this subsection, we explain three comparative methods: Lasso, RLARS and sLTS.
Lasso is performed by the R-package “glmnet”. The regularization parameter λ L a s s o is selected by grid search via cross-validation in “glmnet”. We used “glmnet” by default.
RLARS is performed by the R-package “robustHD”. This is a robust version of LARS [3]. The optimal model is selected via BIC by default.
sLTS is performed by the R-package “robustHD”. sLTS has the regularization parameter λ s L T S and the fraction parameter α of squared residuals used for trimmed squares. The regularization parameter λ s L T S is selected by grid search via BIC. The number of grids is 40 by default. However, we considered that this would be small under heavy contamination. Therefore, we used 80 grids under heavy contamination to obtain a good performance. The fraction parameter α is 0.75 by default. In the case of α = 0.75 , the ratio of outlier is less than 25%. We considered this would be small under heavy contamination and large under low contamination in terms of statistical efficiency. Therefore, we used 0.65, 0.75, 0.85 as α under low contamination and 0.50, 0.65, 0.75 under heavy contamination.

5.4. Details of Our Method

5.4.1. Initial Points

In our method, we need an initial point to obtain the estimate, because we use the iterative algorithm proposed in Section 3.2. The estimate of other conventional robust and sparse regression methods would give a good initial point. For another choice, the estimate of RANSAC (random sample consensus) algorithm would also give a good initial point. In this experiment, we used the estimate of sLTS as an initial point.

5.4.2. How to Choose Tuning Parameters

In our method, we have to choose some tuning parameters. The parameter γ in the γ -divergence was set to 0.1 or 0.5 . The parameter γ 0 in the robust cross-validation was set to 0.5 . In our experience, the result via RoCVis not sensitive to the selection of γ 0 when γ 0 is large enough, e.g., γ 0 = 0.5 , 1 . The parameter λ of L 1 regularization is often selected via grid search. We used 50 grids in the range [ 0.05 λ 0 , λ 0 ] with the log scale, where λ 0 is an estimate of λ , which would shrink regression coefficients to zero. More specifically, in a similar way as in Lasso, we can derive λ 0 , which shrinks the coefficients β to zero in h M M ( θ | θ ( 0 ) ) [6] with respect to β , and we used it. This idea was proposed by the R-package “glmnet”.

5.5. Result

Table 1 is the low contamination case with Outlier Pattern (a). For the RMSPE, our method outperformed other comparative methods (the oracle value of the RMSPE is 0.5). For the TPR and TNR, sLTS showed a similar performance to our method. Lasso presented the worst performance, because it is sensitive to outliers. Table 2 is the heavy contamination case with Outlier Pattern (a). For the RMSPE, our method outperformed other comparative methods except in the case (p, ϵ , ρ ) = (100, 0.3, 0.2) for sLTS with α = 0.5 . Lasso also presented a worse performance, and furthermore, sLTS with α = 0.75 showed the worst performance due to a lack of truncation. For the TPR and TNR, our method showed the best performance. Table 3 is the low contamination case with Outlier Pattern (b). For the RMSPE, our method outperformed other comparative methods (the oracle value of the RMSPE is 0.5). For the TPR and TNR, sLTS showed a similar performance to our method. Lasso presented the worst performance, because it is sensitive to outliers. Table 4 is the heavy contamination case with Outlier Pattern (b). For the RMSPE, our method outperformed other comparative methods. sLTS with α = 0.5 showed the worst performance. For the TPR and TNR, it seems that our method showed the best performance. Table 5 is the no contamination case. RLARS showed the best performance, but our method presented comparable performances. In spite of no contamination case, Lasso was clearly worse than RLARS and our method. This would be because the underlying distribution can generate a large value in simulation, although it is a small probability.

5.6. Computational Cost

In this subsection, we consider the CPU times for Lasso, RLARS, sLTS and our method. The data were generated from the simulation model in Section 5.1. The sample size and the number of explanatory variables were set to be n = 100 and p = 100 , 500 , 1000 , 2000 , 5000 , respectively. In Lasso, RLARS and sLTS, all parameters were used by default (see Section 5.3). Our method used the estimate of the RANSAC algorithm as an initial point. The number of candidates for the RANSAC algorithm was set to 1000. The parameters γ and γ 0 were set to 0.1 and 0.5 , respectively. No method used parallel computing methods. Figure 1 shows the average CPU times over 10 runs in seconds. All results were obtained in R Version 3.3.0 with an Intel Core i7-4790K machine. sLTS shows very high computational cost. RLARS is faster, but does not give a good estimate, as seen in Section 5.5. Our proposed method is fast enough even for p = 5000 .

6. Real Data Analyses

In this section, we use two real datasets to compare our method with comparative methods in real data analysis. We show the best result of comparative methods among some parameter situations (e.g., Section 5.3).

6.1. NCI-60 Cancer Cell Panel

We applied our method and comparative methods to regress protein expression on gene expression data at the cancer cell panel of the National Cancer Institute. Experimental conditions were set in the same way as in Alfons et al. [4] as follows. The gene expression data were obtained with an Affymetrix HG-U133A chip and the normalized GCRMAmethod, resulting in a set of p = 22,283 explanatory variables. The protein expressions based on 162 antibodies were acquired via reverse-phase protein lysate arrays and log 2 transformed. One observation had to be removed since all values were missing in the gene expression data, reducing the number of observations to n = 59 . Then, the KRT18 antibody was selected as the response variable because it had the largest MAD among 162 antibodies, i.e., KRT18 may include a large number of outliers. Both the protein expressions and the gene expression data can be downloaded via the web application CellMiner (http://discover.nci.nih.gov/cellminer/). As a measure of prediction performance, the root trimmed mean squared prediction error (RTMSPE) was computed via leave-one-out cross-validation given by:
RTMSPE = 1 h i = 1 h ( e ) [ i : n ] 2 ,
where e 2 = ( ( y 1 x 1 T β ^ [ 1 ] ) 2 , , ( y n x n T β ^ [ n ] ) 2 ) and ( e ) [ 1 : n ] 2 ( e ) [ n : n ] 2 are the order statistics of e 2 and h = ( n + 1 ) 0.75 . The choice of h is important because it is preferable for estimating prediction performance that trimmed squares does not include outliers. We set h in the same way as in Alfons et al. [4], because the sLTS detected 13 outliers in Alfons et al. [4]. In this experiment, we used the estimate of the RANSAC algorithm as an initial point instead of sLTS because sLTS required high computational cost with such high dimensional data.
Table 6 shows that our method outperformed other comparative methods for the RTMSPE with high dimensional data. Our method presented the smallest RTMSPE with the second smallest number of explanatory variables. RLARS presented the smallest number of explanatory variables, but a much larger RTMSPE than our method.

6.2. Protein Homology Dataset

We applied our method and comparative methods to the protein sequence dataset used for KDD-Cup 2004. Experimental conditions were set in the same way as in Khan et al. [2] as follows. The whole dataset consists of n = 145,751 protein sequences, which has 153 blocks corresponding to native protein. Each data point in a particular block is a candidate homologous protein. There were 75 variables in the dataset: the block number (categorical) and 74 measurements of protein features. The first protein feature was used as the response variable. Then, five blocks with a total of n = 4141 protein sequences were selected because they contained the highest proportions of homologous proteins (and hence, the highest proportions of potential outliers). The data of each block were split into two almost equal parts to get a training sample of size n t r a = 2072 and a test sample of size n t e s t = 2069 . The number of explanatory variables was p = 77 , consisting of four block indicators (Variables 1–4) and 73 features. The whole protein, training and test dataset can be downloaded from http://users.ugent.be/~svaelst/software/RLARS.html. As a measure of prediction performance, the root trimmed mean squared prediction error (RTMSPE) was computed for the test sample given by:
RTMSPE = 1 h i = 1 h ( e ) [ i : n t e s t ] 2 ,
where e 2 = ( ( y 1 x 1 T β ^ ) 2 , , ( y n t e s t x n t e s t T β ^ ) 2 ) and ( e ) [ 1 : n t e s t ] 2 ( e ) [ n t e s t : n t e s t ] 2 are the order statistics of e 2 and h = ( n t e s t + 1 ) 0.99 , ( n t e s t + 1 ) 0.95 or ( n t e s t + 1 ) 0.9 . In this experiment, we used the estimate of sLTS as an initial point.
Table 7 shows that our method outperformed other comparative methods for the RTMSPE. Our method presented the smallest RTMSPE with the largest number of explanatory variables. It might seem that other methods gave a smaller number of explanatory variables than necessary.

7. Conclusions

We proposed robust and sparse regression based on the γ -divergence. We showed desirable robust properties under both homogeneous and heterogeneous contamination. In particular, we presented the Pythagorean relation for the regression case, although it was not shown in Kanamori and Fujisawa [11]. In most of the robust and sparse regression methods, it is difficult to obtain the efficient estimation algorithm, because the objective function is non-convex and non-differentiable. Nonetheless, we succeeded to propose the efficient estimation algorithm, which has a monotone decreasing property of the objective function by using the MM-algorithm. The numerical experiments and real data analyses suggested that our method was superior to comparative robust and sparse linear regression methods in terms of both accuracy and computational costs. However, in numerical experiments, a few results of performance measure “TNR” were a little less than the best results. Therefore, if more sparsity of coefficients is needed, other sparse penalties, e.g., the Smoothly Clipped Absolute Deviations (SCAD) [22] and the Minimax Concave Penalty (MCP) [23], can also be useful.

Acknowledgments

This work was supported by a Grant-in-Aid for Scientific Research of the Japan Society for the Promotion of Science.

Author Contributions

Takayuki Kawashima and Hironori Fujisawa contributed the theoretical analysis; Takayuki Kawashima performed the experiments; Takayuki Kawashima and Hironori Fujisawa wrote the paper. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Theorem 1.
For two non-negative functions r ( x , y ) and u ( x , y ) and probability density function g ( x ) , it follows from Hölder’s inequality that:
r ( x , y ) u ( x , y ) g ( x ) d x d y r ( x , y ) α g ( x ) d x d y 1 α u ( x , y ) β g ( x ) d x d y 1 β ,
where α and β are positive constants and 1 α + 1 β = 1 . The equality holds if and only if r ( x , y ) α = τ u ( x , y ) β for a positive constant τ . Let r ( x , y ) = g ( y | x ) , u ( x , y ) = f ( y | x ) γ , α = 1 + γ and β = 1 + γ γ . Then, it holds that:
g ( y | x ) f ( y | x ) γ d y d g ( x ) g ( y | x ) 1 + γ d y d g ( x ) 1 1 + γ f ( y | x ) 1 + γ d y d g ( x ) γ 1 + γ .
The equality holds if and only if g ( y | x ) 1 + γ = τ ( f ( y | x ) γ ) 1 + γ γ , i.e., g ( y | x ) = f ( y | x ) because g ( y | x ) and f ( y | x ) are conditional probability density functions. Properties (i), (ii) follow from this inequality, the equality condition and the definition of D γ ( g ( y | x ) , f ( y | x ) ; g ( x ) ) .
Let us prove Property (iii). Suppose that γ is sufficiently small. Then, it holds that f γ = 1 + γ log f + O ( γ 2 ) . The γ -divergence for regression is expressed by:
D γ ( g ( y | x ) , f ( y | x ) ; g ( x ) ) = 1 γ ( 1 + γ ) log g ( y | x ) ( 1 + γ log g ( y | x ) + O ( γ 2 ) ) d y g ( x ) d x 1 γ log g ( y | x ) ( 1 + γ log f ( y | x ) + O ( γ 2 ) ) d y g ( x ) d x + 1 1 + γ log f ( y | x ) ( 1 + γ log f ( y | x ) + O ( γ 2 ) ) d y g ( x ) d x = 1 γ ( 1 + γ ) log 1 + γ g ( y | x ) log g ( y | x ) d y g ( x ) d x + O ( γ 2 ) 1 γ log 1 + γ g ( y | x ) log f ( y | x ) d y g ( x ) d x + O ( γ 2 ) 1 1 + γ log 1 + γ f ( y | x ) log f ( y | x ) d y g ( x ) d x + O ( γ 2 ) = 1 ( 1 + γ ) g ( y | x ) log g ( y | x ) d y g ( x ) d x g ( y | x ) log f ( y | x ) d y g ( x ) d x + O ( γ ) = D K L ( g ( y | x ) , f ( y | x ) ) g ( x ) d x + O ( γ ) .
 ☐
Proof of Theorem 2.
We see that:
g ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x = ( 1 ϵ ) f ( y | x ; θ * ) + ϵ δ ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x = ( 1 ϵ ) f ( y | x ; θ * ) f ( y | x ; θ ) γ d y g ( x ) d x + ϵ δ ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x .
It follows from the assumption ϵ < 1 2 that:
ϵ δ ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x 1 γ < 1 2 δ ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x 1 γ < δ ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x 1 γ = ν f θ , γ .
Hence,
g ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x = ( 1 ϵ ) f ( y | x ; θ * ) f ( y | x ; θ ) γ d y g ( x ) d x = + O ν f θ , γ γ .
Therefore, it holds that:
d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = 1 γ log g ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x + 1 1 + γ log f ( y | x ; θ ) 1 + γ d y g ( x ) d x = 1 γ log f ( y | x ; θ * ) f ( y | x ; θ ) γ d y g ( x ) d x + 1 1 + γ log f ( y | x ; θ ) 1 + γ d y g ( x ) d x 1 γ log ( 1 ϵ ) + O ν f θ , γ γ = d γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) 1 γ log ( 1 ϵ ) + O ν f θ , γ γ .
Then, it follows that:
D γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) D γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) D γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) = d γ ( g ( y | x ) , g ( y | x ) ; g ( x ) ) + d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) d γ ( g ( y | x ) , g ( y | x ) ; g ( x ) ) + d γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) d γ ( f ( y | x ; θ * ) , f ( y | x ; θ * ) ; g ( x ) ) + d γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) = d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) d γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; g ( x ) ) d γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) + d γ ( f ( y | x ; θ * ) , f ( y | x ; θ * ) ; g ( x ) ) = O ν γ .
 ☐
Proof of Theorem 3.
We see that:
g ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x = f ( y | x ; θ * ) f ( y | x ; θ ) γ d y ( 1 ϵ ( x ) ) g ( x ) d x + δ ( y | x ) f ( y | x ; θ ) γ d y ϵ ( x ) g ( x ) d x .
It follows from the assumption ϵ ( x ) < 1 2 that:
δ ( y | x ) f ( y | x ; θ ) γ d y ϵ ( x ) g ( x ) d x 1 γ < δ ( y | x ) f ( y | x ; θ ) γ d y g ( x ) 2 d x 1 γ < δ ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x 1 γ = ν f θ , γ .
Hence,
g ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x = f ( y | x ; θ * ) f ( y | x ; θ ) γ d y ( 1 ϵ ( x ) ) g ( x ) d x + O ( ν f θ , γ γ ) .
Therefore, it holds that:
d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) = 1 γ log g ( y | x ) f ( y | x ; θ ) γ d y g ( x ) d x + 1 1 + γ log f ( y | x ; θ ) 1 + γ d y g ( x ) d x = 1 γ log f ( y | x ; θ * ) f ( y | x ; θ ) γ d y ( 1 ϵ ( x ) ) g ( x ) d x + O ( ν f θ , γ γ ) + 1 1 + γ log f ( y | x ; θ ) 1 + γ d y g ( x ) d x = d γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; ( 1 ϵ ( x ) ) g ( x ) ) + O ( ν f θ , γ γ ) 1 1 + γ log f ( y | x ; θ ) 1 + γ d y ( 1 ϵ ( x ) ) g ( x ) d x + 1 1 + γ log f ( y | x ; θ ) 1 + γ d y g ( x ) d x = d γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; ( 1 ϵ ( x ) ) g ( x ) ) + O ( ν f θ , γ γ ) 1 1 + γ log 1 ϵ ( x ) g ( x ) d x .
Then, it follows that:
D γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) D γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) D γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; ( 1 ϵ ( x ) ) g ( x ) ) = d γ ( g ( y | x ) , g ( y | x ) ; g ( x ) ) + d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) d γ ( g ( y | x ) , g ( y | x ) ; g ( x ) ) + d γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) d γ ( f ( y | x ; θ * ) , f ( y | x ; θ * ) ; ( 1 ϵ ( x ) ) g ( x ) ) + d γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; ( 1 ϵ ( x ) ) g ( x ) ) = d γ ( g ( y | x ) , f ( y | x ; θ ) ; g ( x ) ) d γ ( f ( y | x ; θ * ) , f ( y | x ; θ ) ; ( 1 ϵ ( x ) ) g ( x ) ) d γ ( g ( y | x ) , f ( y | x ; θ * ) ; g ( x ) ) + d γ ( f ( y | x ; θ * ) , f ( y | x ; θ * ) ; ( 1 ϵ ( x ) ) g ( x ) ) = O ν γ .
 ☐

References

  1. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar]
  2. Khan, J.A.; Van Aelst, S.; Zamar, R.H. Robust linear model selection based on least angle regression. J. Am. Stat. Assoc. 2007, 102, 1289–1299. [Google Scholar] [CrossRef]
  3. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar]
  4. Alfons, A.; Croux, C.; Gelper, S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 2013, 7, 226–248. [Google Scholar] [CrossRef]
  5. Rousseeuw, P.J. Least Median of Squares Regression. J. Am. Stat. Assoc. 1984, 79, 871–880. [Google Scholar] [CrossRef]
  6. Windham, M.P. Robustifying model fitting. J. R. Stat. Soc. Ser. B 1995, 57, 599–609. [Google Scholar]
  7. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
  8. Jones, M.C.; Hjort, N.L.; Harris, I.R.; Basu, A. A Comparison of related density-based minimum divergence estimators. Biometrika 2001, 88, 865–873. [Google Scholar] [CrossRef]
  9. Fujisawa, H.; Eguchi, S. Robust Parameter Estimation with a Small Bias Against Heavy Contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  10. Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
  11. Kanamori, T.; Fujisawa, H. Robust estimation under heavy contamination using unnormalized models. Biometrika 2015, 102, 559–572. [Google Scholar] [CrossRef]
  12. Cichocki, A.; Cruces, S.; Amari, S.I. Generalized Alpha-Beta Divergences and Their Application to Robust Nonnegative Matrix Factorization. Entropy 2011, 13, 134–170. [Google Scholar] [CrossRef]
  13. Samek, W.; Blythe, D.; Müller, K.R.; Kawanabe, M. Robust Spatial Filtering with Beta Divergence. In Advances in Neural Information Processing Systems 26; Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; pp. 1007–1015. [Google Scholar]
  14. Hunter, D.R.; Lange, K. A tutorial on MM algorithms. Am. Stat. 2004, 58, 30–37. [Google Scholar] [CrossRef]
  15. Hirose, K.; Fujisawa, H. Robust sparse Gaussian graphical modeling. J. Multivar. Anal. 2017, 161, 172–190. [Google Scholar] [CrossRef]
  16. Zou, H.; Hastie, T. Regularization and variable selection via the Elastic Net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
  17. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 2006, 68, 49–67. [Google Scholar] [CrossRef]
  18. Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B 2005, 67, 91–108. [Google Scholar] [CrossRef]
  19. Friedman, J.; Hastie, T.; Höfling, H.; Tibshirani, R. Pathwise coordinate optimization. Ann. Appl. Stat. 2007, 1, 302–332. [Google Scholar] [CrossRef]
  20. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction; Springer: New York, NY, USA, 2010. [Google Scholar]
  21. Maronna, R.A.; Martin, D.R.; Yohai, V.J. Robust Statistics: Theory and Methods; John Wiley and Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  22. Fan, J.; Li, R. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar]
  23. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
Figure 1. CPU times (in seconds).
Figure 1. CPU times (in seconds).
Entropy 19 00608 g001
Table 1. Outlier Pattern (a) with p = 100 , 200 , ϵ = 0.1 and ρ = 0.2 , 0.5 . RMSPE, root mean squared prediction error (RMSPE); RLARS, robust least angle regression; sLTS, sparse least trimmed squares.
Table 1. Outlier Pattern (a) with p = 100 , 200 , ϵ = 0.1 and ρ = 0.2 , 0.5 . RMSPE, root mean squared prediction error (RMSPE); RLARS, robust least angle regression; sLTS, sparse least trimmed squares.
p = 100 , ϵ = 0.1 , ρ = 0.2 p = 100 , ϵ = 0.1 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso3.049.72 × 10 2 0.9360.9093.11.05 × 10 1 0.9520.918
RLARS0.8066.46 × 10 3 0.9360.9490.7186.7 × 10 3 0.9440.962
sLTS ( α = 0.85 , 80 grids)0.6261.34 × 10 3 1.00.9640.5991.05 × 10 3 1.00.966
sLTS ( α = 0.75 , 80 grids)0.6511.71 × 10 3 1.00.9610.6231.33 × 10 3 1.00.961
sLTS ( α = 0.65 , 80 grids)0.6852.31 × 10 3 1.00.9570.6681.76 × 10 3 1.00.961
sparse γ -linear reg ( γ = 0.1)0.5576.71 × 10 4 1.00.9660.5616.99 × 10 4 1.00.965
sparse γ -linear reg ( γ = 0.5)0.5758.25 × 10 4 1.00.9610.5739.05 × 10 4 1.00.959
p = 200 , ϵ = 0.1 , ρ = 0.2 p = 200 , ϵ = 0.1 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso3.556.28 × 10 2 0.9040.9563.376.08 × 10 2 0.9280.961
RLARS0.883.8 × 10 3 0.9040.9770.8434.46 × 10 3 0.90.986
sLTS ( α = 0.85 , 80 grids)0.6317.48 × 10 4 1.00.9720.6145.77 × 10 4 1.00.976
sLTS ( α = 0.75 , 80 grids)0.6771.03 × 10 3 1.00.9660.6327.08 × 10 4 1.00.973
sLTS ( α = 0.65 , 80 grids)0.8232.34 × 10 3 0.9980.960.71.25 × 10 3 1.00.967
sparse γ -linear reg ( γ = 0.1 )0.584.19 × 10 4 1.00.9810.5573.71 × 10 4 1.00.977
sparse γ -linear reg ( γ = 0.5 )0.5895.15 × 10 4 1.00.9790.5865.13 × 10 4 1.00.977
Table 2. Outlier Pattern (a) with p = 100 , 200 , ϵ = 0.3 and ρ = 0.2 , 0.5 .
Table 2. Outlier Pattern (a) with p = 100 , 200 , ϵ = 0.3 and ρ = 0.2 , 0.5 .
p = 100 , ϵ = 0.3 , ρ = 0.2 p = 100 , ϵ = 0.3 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso8.076.72 × 10 1 0.8060.9038.13.32 × 10 1 0.80.952
RLARS2.651.54 × 10 1 0.750.9632.091.17 × 10 1 0.8120.966
sLTS ( α = 0.75 , 80 grids)10.42.080.8860.70911.72.360.8540.67
sLTS ( α = 0.65 , 80 grids)2.123.66 × 10 1 0.9720.8992.895.13 × 10 1 0.9660.887
sLTS ( α = 0.5 , 80 grids)1.371.46 × 10 1 0.9840.8961.531.97 × 10 1 0.9760.909
sparse γ -linear reg ( γ  = 0.1)1.139.16 × 10 2 0.9640.970.9615.38 × 10 2 0.9820.977
sparse γ -linear reg ( γ  = 0.5)1.281.5 × 10 1 0.9860.9521.008.48 × 10 2 0.9880.958
p = 200 , ϵ = 0.3 , ρ = 0.2 p = 200 , ϵ = 0.3 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso8.113.4 × 10 1 0.770.9518.026.51 × 10 1 0.810.91
RLARS3.61.7 × 10 1 0.710.9782.671.02 × 10 1 0.760.984
sLTS ( α = 0.75 , 80 grids)11.51.160.7380.80911.91.170.780.811
sLTS ( α = 0.65 , 80 grids)3.343.01 × 10 1 0.940.9294.224.08 × 10 1 0.9280.924
sLTS ( α = 0.5 , 80 grids)4.023.33 × 10 1 0.8920.9034.944.44 × 10 1 0.8420.909
sparse γ -linear reg ( γ = 0.1 )2.031.45 × 10 1 0.9640.9243.22.86 × 10 1 0.940.936
sparse γ -linear reg ( γ = 0.5 )1.237.69 × 10 2 0.9880.9423.132.98 × 10 1 0.9440.94
Table 3. Outlier Pattern (b) with p = 100 , 200 , ϵ = 0.1 and ρ = 0.2 , 0.5 .
Table 3. Outlier Pattern (b) with p = 100 , 200 , ϵ = 0.1 and ρ = 0.2 , 0.5 .
p = 100 , ϵ = 0.1 , ρ = 0.2 p = 100 , ϵ = 0.1 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso2.485.31 × 10 2 0.9820.5182.845.91 × 10 2 0.980.565
RLARS0.856.58 × 10 3 0.930.8270.8297.97 × 10 3 0.910.885
sLTS ( α = 0.85 , 80 grids)0.7345.21 × 10 3 0.9980.9640.6843.76 × 10 3 1.00.961
sLTS ( α = 0.75 , 80 grids)0.661.78 × 10 3 1.00.9750.6481.59 × 10 3 1.00.961
sLTS ( α = 0.65 , 80 grids)0.7342.9 × 10 3 1.00.960.661.74 × 10 3 1.00.962
sparse γ -linear reg ( γ  = 0.1)0.5778.54 × 10 4 1.00.8940.5455.44 × 10 4 1.00.975
sparse γ -linear reg ( γ  = 0.5)0.5817.96 × 10 4 1.00.9710.5465.95 × 10 4 1.00.977
p = 200 , ϵ = 0.1 , ρ = 0.2 p = 200 , ϵ = 0.1 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso2.392.57 × 10 2 0.9880.6962.572.54 × 10 2 0.9440.706
RLARS1.015.44 × 10 3 0.8960.9230.8774.82 × 10 3 0.8980.94
sLTS ( α = 0.85 , 80 grids)0.7081.91 × 10 3 1.00.9750.7903.40 × 10 3 0.9940.97
sLTS ( α = 0.75 , 80 grids)0.6831.06 × 10 4 1.00.9750.6357.40 × 10 4 1.00.977
sLTS ( α = 0.65 , 80 grids)1.111.13 × 10 2 0.9840.9560.7682.60 × 10 3 0.9980.968
sparse γ -linear reg ( γ = 0.1 )0.6035.71 × 10 4 1.00.9240.5633.78 × 10 3 1.00.979
sparse γ -linear reg ( γ = 0.5 )0.5925.04 × 10 4 1.00.9820.5664.05 × 10 3 1.00.981
Table 4. Outlier Pattern (b) with p = 100 , 200 , ϵ = 0.3 and ρ = 0.2 , 0.5 .
Table 4. Outlier Pattern (b) with p = 100 , 200 , ϵ = 0.3 and ρ = 0.2 , 0.5 .
p = 100 , ϵ = 0.3 , ρ = 0.2 p = 100 , ϵ = 0.3 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso2.816.88 × 10 2 0.9560.5673.137.11 × 10 2 0.970.584
RLARS2.707.69 × 10 2 0.8720.7892.226.1 × 10 2 0.8520.855
sLTS ( α = 0.75 , 80 grids)3.991.57 × 10 1 0.8560.7574.181.54 × 10 1 0.8780.771
sLTS ( α = 0.65 , 80 grids)3.21.46 × 10 1 0.8880.8542.691.08 × 10 1 0.9220.867
sLTS ( α = 0.5 , 80 grids)6.514.62 × 10 1 0.770.7727.145.11 × 10 1 0.8440.778
sparse γ -linear reg ( γ  = 0.1)1.753.89 × 10 2 0.9740.7251.472.66 × 10 2 0.9760.865
sparse γ -linear reg ( γ  = 0.5)1.683.44 × 10 2 0.980.7821.653.58 × 10 2 0.9740.863
p = 200 , ϵ = 0.3 , ρ = 0.2 p = 200 , ϵ = 0.3 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso2.713.32 × 10 2 0.9640.7342.863.05 × 10 2 0.9740.728
RLARS3.034.59 × 10 2 0.8440.8762.854.33 × 10 2 0.8620.896
sLTS ( α = 0.75 , 80 grids)3.737.95 × 10 2 0.8640.8724.208.17 × 10 2 0.8780.87
sLTS ( α = 0.65 , 80 grids)4.451.23 × 10 1 0.850.8863.618.95 × 10 2 0.9040.908
sLTS ( α = 0.5 , 80 grids)9.054.24 × 10 1 0.660.8538.633.73 × 10 1 0.7480.864
sparse γ -linear reg ( γ = 0.1 )1.781.62 × 10 2 0.9940.7311.821.62 × 10 2 0.9880.844
sparse γ -linear reg ( γ = 0.5 )1.791.69 × 10 2 0.9880.791.771.51 × 10 2 0.9960.77
Table 5. No contamination case with p = 100 , 200 , ϵ = 0 and ρ = 0.2 , 0.5 .
Table 5. No contamination case with p = 100 , 200 , ϵ = 0 and ρ = 0.2 , 0.5 .
p = 100 , ϵ = 0 , ρ = 0.2 p = 100 , ϵ = 0 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso0.6211.34 × 10 3 1.00.9870.6211.12 × 10 3 1.00.987
RLARS0.5517.15 × 10 4 0.9960.9690.5436.74 × 10 4 0.9960.971
sLTS ( α = 0.75 , 40 grids)0.9544.47 × 10 3 1.00.9960.8994.53 × 10 3 1.00.993
sparse γ -linear reg ( γ  = 0.1)0.5647.27 × 10 4 1.00.8780.5656.59 × 10 4 1.00.908
sparse γ -linear reg ( γ  = 0.5)0.591.0 × 10 3 1.00.9230.5848.47 × 10 4 1.00.94
p = 200 , ϵ = 0 , ρ = 0.2 p = 200 , ϵ = 0 , ρ = 0.5
MethodsRMSPEMSETPRTNRRMSPEMSETPRTNR
Lasso0.6357.18 × 10 4 1.00.9920.6246.17 × 10 4 1.00.991
RLARS0.553.63 × 10 4 0.9940.9830.5443.48 × 10 4 0.9960.985
sLTS ( α = 0.75 , 40 grids)1.013.76 × 10 3 1.00.9960.9092.47 × 10 3 1.00.996
sparse γ -linear reg ( γ = 0.1 )0.5844.45 × 10 4 1.00.9350.5733.99 × 10 4 1.00.938
sparse γ -linear reg ( γ = 0.5 )0.6216.55 × 10 4 1.00.9670.6025.58 × 10 4 1.00.966
Table 6. Root trimmed mean squared prediction error (RTMSPE) for protein expressions based on the KRT18 antibody (NCI-60 cancer cell panel data), computed from leave-one-out cross-validation.
Table 6. Root trimmed mean squared prediction error (RTMSPE) for protein expressions based on the KRT18 antibody (NCI-60 cancer cell panel data), computed from leave-one-out cross-validation.
MethodsRTMSPE 1 Selected Variables
Lasso1.05852
RLARS0.93618
sLTS0.72133
Our method ( γ = 0.1 )0.67929
Our method ( γ = 0.5 )0.70030
1 This means the number of non-zero elements.
Table 7. Root trimmed mean squared prediction error in the protein test set.
Table 7. Root trimmed mean squared prediction error in the protein test set.
Trimming Fraction
Methods1%5%10% 1 Selected Variables
Lasso10.6979.668.72922
RLARS10.4739.4358.52727
sLTS10.6149.528.57521
Our method ( γ = 0.1 )10.4619.4038.48144
Our method ( γ = 0.5 )10.4639.3698.41942
1 This means the number of non-zero elements.

Share and Cite

MDPI and ACS Style

Kawashima, T.; Fujisawa, H. Robust and Sparse Regression via γ-Divergence. Entropy 2017, 19, 608. https://doi.org/10.3390/e19110608

AMA Style

Kawashima T, Fujisawa H. Robust and Sparse Regression via γ-Divergence. Entropy. 2017; 19(11):608. https://doi.org/10.3390/e19110608

Chicago/Turabian Style

Kawashima, Takayuki, and Hironori Fujisawa. 2017. "Robust and Sparse Regression via γ-Divergence" Entropy 19, no. 11: 608. https://doi.org/10.3390/e19110608

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop