Next Article in Journal
A Cost Optimisation Model for Maintenance Planning in Offshore Wind Farms with Wind Speed Dependent Failure Rates
Previous Article in Journal
Study on the Reinforcement Mechanism of High-Energy-Level Dynamic Compaction Based on FDM–DEM Coupling
Previous Article in Special Issue
An Ensemble Method for Feature Screening
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adaptive L0 Regularization for Sparse Support Vector Regression

School of Mathematics, Cardiff University, Cardiff CF24 4AG, UK
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(13), 2808; https://doi.org/10.3390/math11132808
Submission received: 16 May 2023 / Revised: 12 June 2023 / Accepted: 20 June 2023 / Published: 22 June 2023
(This article belongs to the Special Issue Statistical Methods for High-Dimensional and Massive Datasets)

Abstract

:
In this work, we proposed a sparse version of the Support Vector Regression (SVR) algorithm that uses regularization to achieve sparsity in function estimation. To achieve this, we used an adaptive L0 penalty that has a ridge structure and, therefore, does not introduce additional computational complexity to the algorithm. In addition to this, we used an alternative approach based on a similar proposal in the Support Vector Machine (SVM) literature. Through numerical studies, we demonstrated the effectiveness of our proposals. We believe that this is the first time someone discussed a sparse version of Support Vector Regression (in terms of variable selection and not in terms of support vector selection).

1. Introduction

The Support Vector Machine (SVM), which was introduced by Cortes and Vapnik (1995) [1], is probably the most popular classification algorithm in the literature. A very popular variant of SVM is the Support Vector Regression (SVR), which is used for function estimation, for example, in a regression setting, and it can be used as an alternative algorithm to the Ordinary Least Squares (OLS) estimation. The main advantage of SVR is the use of a piecewise linear loss function, which diminishes the effect of outliers on function estimation (compared to the effect of outliers on OLS estimation) leading to a robust estimation of the regression function.
As SVR has been introduced alongside SVM, their developments are similar, that is, a lot of the extensions that were introduced in the classification framework for SVM have also been discussed in the function estimation framework for SVR. Although sparsity in terms of variable regularization was discussed extensively in the literature for SVM, there seems to be a lack of literature for sparse SVR in this context. There are some suggestions of different two-step procedures where variable selection occurs before the application of SVR and, therefore, only a reduced number of variables is used to fit the model, see, for example, Wang et al. (2006) [2], but we have not come across literature that demonstrates the advantages of using regularization in SVR. Here, it is important to note that in the literature of SVR, we came across a different type of sparsity where researchers refer to the sparsity of the support vectors, i.e., the number of points that affect the construction of the optimal separating hyperplane (see, for example, Ertin and Potter (2005) [3]) rather than the regularization of the variables.
Sparsity, in terms of variable selection, is very common in the statistics literature. It is usually introduced through penalization or regularization and, although there are a number of methods, we briefly discuss the most well-known methodology. L2 (or ridge) regularization is computationally fast but does not really achieve sparsity. There is also L0 regularization as well as LASSO (Tibshirani (1996) [4]) or L1 regularization, which are computationally more expensive but they are more effective than L2 in achieving sparsity. Furthermore, there is also a combination of these classic regularization methods, such as the elastic net, which is a combination of L1 and L2 regularization (see Zou and Hastie (2005) [5]) and SCAD (see Fan and Li (2001) [6]). This is just a very small sample of the methodology in the literature. Other algorithms also exist but, in all cases, there is a trade-off between accuracy and computational cost.
In this work, we combined the classic SVR algorithm with a relatively recently introduced penalization procedure, which is known as an adaptive L0 penalty (Frommlet and Nuel (2016) [7]). This penalty has the advantage of achieving similar regularization but without the computational complexity of the classic L0 penalty. In the next section, we present an overview of the fundamental concepts/elements used in our methodology, that is, we discuss the SVR and the adaptive L0 penalty. In Section 3, we discuss the mathematical development of the adaptive L0 penalized SVR algorithm and we also provide an alternative approach based on a similar penalty introduced in the SVM literature. In Section 4, we have the results of the numerical experiments and we close the paper with a discussion.

2. Fundamental Concepts

In this section, we discuss the general ideas behind SVR and the adaptive L0 penalty, which are the major elements we used in our algorithm.

2.1. SVR

As we mentioned above, SVR is a variant of SVM, which is used for function estimation. Before we start the discussion on SVR, we emphasize that we have changed the classic SVR notation in this paper to align it with the fact that the optimal vector estimates the regression coefficients. Therefore, instead of denoting the normal vector with w , which is the usual notation in the literature, we denote it with β . The main idea is that a bilinear function is used, which provides an area where all the residuals in a regression framework will lie. Mathematically, this means that if y is the output and f ( x ) is the function we are trying to estimate, then one defines the residuals r ( x , y ) = y f ( x ) and tries to find the function f ( x ) such that all the absolute residuals are less than ϵ > 0 , that is, | r ( x , y ) | ϵ . To help visualize this, it means that there is an area of radius ϵ around f ( x ) where all the outputs y lie. In the SVR literature, ϵ is also called the margin. As one can change the value of the margin, it means there are many functions f ( x ) that can satisfy the condition | r ( x , y ) | ϵ . At the same time, our purpose in SVR is to find f ( x ) , which has the maximum generalization ability. This is achieved by maximizing ϵ or, equivalently, if we minimize β 2 where β is the coefficient vector satisfying f ( x ) = β T x + b 0 , where b 0 is the offset. One can see f ( x ) as the equation of the hyperplane that passes from the middle of the area that contains all of the output y.
Let us assume we have n pairs of datapoints ( x i , y i ) , where i = 1 , , n . In an ideal world, where all the points lie within ϵ distance of f ( x ) , we have the hard-margin optimization, which is:
min 1 2 β 2 subject   to : y i β T x b 0 ϵ ,       i = 1 , , n β T x + b 0 y i ϵ ,       i = 1 , , n .
In reality, though, the optimal f ( x ) might be one that has some of its residuals bigger than ϵ or, in other words, one that allows for points to lie beyond the margin. In that case, one needs to solve the soft-margin optimization as follows:
min 1 2 β 2 + λ i = 1 n ξ i + ξ i * subject   to : y i β T x t ϵ + ξ i ,       i = 1 , , n β T x + t y i ϵ + ξ i * ,       i = 1 , , n ξ i 0 ,       ξ i * 0 ,       i = 1 , , n
where λ is the margin parameter, i.e., a value that assigns a trade-off between having a large margin, or a larger additive distance between all the possible points outside the margin, and ξ i and ξ i * are slack variables, which measure how far outside the margin a point is or, mathematically, ξ i = max { 0 , r ( x , y ) ϵ } and ξ i * = max { 0 , r ( x , y ) ϵ } .
To solve the above soft-margin optimization, one uses the Lagrangian approach, then finds the KKT equations, and finally uses quadratic programming optimization. This procedure is standard in the SVM and SVR literature. First, one needs to find the Lagrangian, which we express below in matrix form to simplify the notation:
L ( β , b 0 , ξ , ξ * , α , α * , η , η * ) = = 1 2 β T β + λ 1 n T ( ξ + ξ * ) α T ( ϵ n + ξ y + X β + b n ) ( α * ) T ( ϵ n + ξ * + y X β b n ) η T ξ ( η * ) T ξ *
where ξ = ( ξ 1 , , ξ n ) T and ξ * = ( ξ 1 * , , ξ n * ) T are the vectors with the slack variables, α = ( α 1 , , α n ) T , α * = ( α 1 * , , α n * ) T and η = ( η 1 , , η n ) T , η * = ( η 1 * , , η n * ) T are the Lagrangian multipliers, y = ( y 1 , , y n ) T denotes the response vector, X = ( x 1 T , , x n T ) T is the p × n predictor matrix where each row vector denotes the predictor vector of each observation, 1 n R n is the n-dimensional vector with all entries equal to one, b n R n , an n-dimensional vector with all entries equal to b 0 , and ϵ n R n , an n-dimensional vector with all entries equal to ϵ .
To find the solution to the Lagrangian in (2), one needs to take the partial derivatives with respect to β , t, ξ , and ξ * and set them equal to zero. Therefore, we have:
L β = β X T ( α α * ) = 0 β = X T ( α α * ) L b 0 = 1 n T ( α * α ) = 0 L ξ = λ 1 n α η = 0 L ξ * = λ 1 n α * η * = 0
Substituting the equations of the four derivatives in (3) into the Lagrangian in (2) gives the following dual optimization problem where one tries to maximize:
Q ( α , α * ) = 1 2 ( α α * ) T X X T ( α α * ) ϵ n T ( α + α * ) + y T ( α α * )
subject to the following conditions:
1 n T ( α α * ) = 0 ,     0 n α λ 1 n ,     0 n α * λ 1 n
Note that the optimization problem in (4) finds the values for ( α α * ) and, therefore, this allows us to estimate the β based on the solution of the equation of the first partial derivative of the Lagrangian with respect to β . Here, we emphasize that there are a number of variants of SVM that can be extended in the SVR framework and their solution is found using a similar procedure to the above.

2.2. L0 Penalty

In this section, we will introduce the adaptive L0 penalty, which was introduced by Frommlet and Nuel (2016) [7] for the linear regression model. The main idea of the adaptive L0 penalty is to find a differentiable approach to approximate the L0 penalty. This will give us a penalty that achieves regularization, that is, reduces the coefficients of the irrelevant variables to zero (a very well-known property of the L0 penalty) without the computational complexity the L0 penalty introduces. It is also called a ridge-type L0 penalty.
First, let us discuss the L0 penalty. The L0 penalty is the most explicit penalty one can use, in the sense that it is penalizing the number of nonzero coefficients among the variables used in the function estimation. Therefore, in this case, it takes the form β 0 = i = 1 p | β i > 0 | . Frommlet and Nuel (2016) [7] suggest roughly (we only roughly state the suggestion by Frommlet and Nuel (2016) [7] and we will discuss the details in later sections) replacing the penalty with the function β T Λ β , where Λ is a p × p dimensional diagonal matrix and the ( i , i ) th element is Λ ( i , i ) = 1 / β i 2 . One can see that for each j when β j = 0 then β j Λ ( j , j ) β j = 0 and when β j 0 , then β j Λ ( j , j ) β j = 1 . Putting these together, one can show that β T Λ β = i = 1 p | β i > 0 | .
The idea here is to use the above development to obtain a sparse β vector after iteratively applying the penalization. Therefore, for the adaptive L0 penalty, we try to minimize:
F λ , w = f ( β ) + λ i = 1 p w i β i 2
where λ > 0 is a constant that forces more sparsity as it becomes larger, f ( · ) is the main objective function we are trying to minimize to obtain the coefficients β (for example, in Ordinary Least Squares regression, this is ( y β T X ) ), and w i is the i th weight. Frommlet and Nuel (2016) [7] suggest using the weight:
w i = ( | β i | γ + δ γ ) q 2 γ
where γ > 0 is a constant used to define the quality of the estimation of the function F λ , w in (5) and, to maintain the approximation, we want it set as γ = 2 , δ > 0 , which is a constant that calibrates the size that is considered significant (and makes sure weight is not set to zero as we divide with it) and it is set to δ = 10 5 and q = 2 . Here, we note that, for example, q = 1 would have given an approximated LASSO penalty. Finally, we note that we need an iterative procedure and, therefore:
i = 1 p w i β i 2
is actually written as:
i = 1 p w i ( k 1 ) β i ( k ) 2
where superscript ( k ) denotes the k th iteration. Therefore, we use β i ( k 1 ) to estimate w i ( k 1 ) , which we use as a weight to estimate β i ( k ) , and then we repeat the process until convergence. As a starting point, we set all the weights equal to one, which is equivalent to running the classic SVR algorithm without regularization.

3. Adaptive L0 Support Vector Regression

In this section, we discuss the main method we introduced in this paper, where we try to have a sparse function estimation in SVR by applying the adaptive L0 penalty. This will change the Lagrangian we presented before in Equation (2). In this section, we could have added a subscript L 0 to differentiate the optimization problem presented in SVR but, to keep things simple (as the development is relatively clear), we omitted it throughout this section.
The main idea comes by combining the adaptive L0 penalty of Frommlet and Nuel (2016) [7] with the optimization of SVR. Therefore, the new optimization problem takes the following form expressed in matrix form:
min 1 2 β T Λ β + λ 1 n T ( ξ + ξ * ) subject   to : y X β b n ϵ n + ξ , X β + b n y ϵ n + ξ * , ξ 0 ,     ξ * 0
Similarly to the SVM and SVR literature, we have the following Lagrangian:
L 0 ( β , b 0 , Λ , ξ , ξ * , α , α * , η , η * ) = = 1 2 β T Λ β + λ 1 n T ( ξ + ξ * ) α T ( ϵ + ξ y + X β + b n ) ( α * ) T ( ϵ + ξ * + y X β b n ) η T ξ ( η * ) T ξ *
and, as in SVR, we take the derivatives to obtain the KKT equations in this problem:
L 0 β = Λ β X T ( α α * ) = 0 β = Λ 1 X T ( α α * ) L 0 b 0 = 1 n T ( α * α ) = 0 L 0 ξ = λ 1 α η = 0 L 0 ξ * = λ 1 α * η * = 0
Substituting the equations of the four derivatives in (8) into the Lagrangian L 0 in (7), one obtains the following dual optimization problem:
Q ( α , α * ) = 1 2 ( α α * ) T X Λ 1 Λ Λ 1 X T ( α α * ) ϵ n T ( α + α * ) + y T ( α α * )
subject to the following conditions:
1 n T ( α α * ) = 0 ,     0 n α λ 1 n ,     0 n α * λ 1 n
The solution of the optimization problem will give us the values of the vector ( α α * ) , which we will put into the equation for β in the first derivative in (8).

3.1. Estimation Procedure

In this section, we discuss the estimation procedure. There are a number of issues that need to be addressed. We will discuss them one by one and, at the end of the section, we will give a detailed outline of the algorithm.
The first thing we need to discuss is adjusting the algorithm to demonstrate how existing packages can be used. Although our objective function looks similar to the objective function for SVR, there is a very important difference since the first term includes the weight matrix Λ . Packages in R, which run SVR, such as e1071 (Meyer and Wien (2015)) [8] and kernlab (Karatzoglou et al. (2023) [9]) are able to solve the classic SVR optimization problem given in Section 2.1 rather than the new optimization problem with the adaptive L0 penalty that includes Λ and was presented at the beginning of Section 3. By setting β ˜ = Λ 1 / 2 β and X ˜ = X Λ 1 / 2 , one can rewrite the optimization in (6) as follows:
min 1 2 β ˜ T β ˜ + λ 1 n T ( ξ + ξ * ) subject   to : y X ˜ β ˜ b n ϵ n + ξ , X ˜ β ˜ + b n y ϵ n + ξ * , ξ 0 ,     ξ * 0
which is exactly the same as the objective function for SVR presented in (1), where β and X have been replaced with β ˜ and X ˜ , respectively. This modification allows us to use existing software to solve the optimization to obtain β ˜ by using X ˜ as the predictors and then using β = Λ 1 / 2 β ˜ to find β .
The second point we need to address is the fact that the procedure is an iterative process. By definition, Λ has entries that depend on the entries of vector β , which is essentially the vector we are trying to estimate. Therefore, one approach to alleviate this technicality here is to set all initial weights to one, which is equivalent to running the standard SVR algorithm without the adaptive L0 penalty. Then, based on this estimator, we construct an initial estimate for Λ ( 0 ) , which we use to estimate β ( 1 ) . We need to set a stopping rule that will stop when β ( t ) β ( t 1 ) κ , where κ in our simulation is usually set to something smaller than 10 3 .
This also leads to another modification of the optimization problem. To have an accurate optimization problem, we need to indicate the iterative nature of this procedure by using superscripts β ˜ ( t ) and X ˜ ( t ) . Therefore, the optimization takes the following form:
min 1 2 ( β ˜ ( t ) ) T β ˜ ( t ) + λ 1 n T ( ξ ( t ) + ( ξ * ) ( t ) ) subject   to : y X ˜ ( t 1 ) β ˜ ( t ) b n ( t ) ϵ n + ξ ( t ) , X ˜ ( t 1 ) β ˜ ( t ) + b n ( t ) y ϵ n + ( ξ * ) ( t ) , ξ ( t ) 0 ,     ( ξ * ) ( t ) 0
where we use Λ ( t 1 ) to construct X ˜ ( t 1 ) as X ˜ ( t 1 ) = X ( Λ ( t 1 ) ) 1 / 2 . Then, we put these in the optimization problem in (10) to obtain the solution β ˜ ( t ) . Based on β ˜ ( t ) , we obtain β ( t ) , which we use to obtain an updated estimate Λ ( t ) , and we use it to construct X ˜ ( t ) , which is plugged into the optimization problem in (10) to obtain a solution β ( t ) . This becomes more clear in the following estimation procedure:
  • Step 1: Obtain an initial estimate β ( 0 ) for the coefficients of the function using any method you like (one example is the OLS approach or the classic SVR approach).
  • Step 2: At iteration t, where t = 1 , , k , calculate Λ ( t 1 ) and X ˜ ( t 1 ) and solve the optimization in (10) to obtain β ( t ) .
  • Step 3: Compare whether the distance between β ( t ) and β ( t 1 ) is less than the cutoff point κ . If yes, stop, otherwise, increase t by one and repeat Step 2.

3.2. Alternative Approach to Adaptive L0 SVR

In this section, we discuss an alternative approach to adaptive L0 SVR. A similar approach to the one proposed in this manuscript has been presented in the SVM literature, although it was not discussed at all in the SVR literature. In Li et al. (2015) [10], the authors proposed what they call a sparse Least Squares SVM using L0 in the primal state. We introduce this approach into the SVR method in a similar way and we call it the Alternative Adaptive L0 (AAL0) penalty. The starting point for this is the optimization problem:
min 1 2 β T β + c 2 β T Λ β + λ 1 n T ( ξ + ξ * ) subject   to : y X β b n ϵ n + ξ , X β + b n y ϵ n + ξ * , ξ 0 ,     ξ * 0
where, if one compares it to the method introduced in the previous section, it is clear that there is an extra term ( 1 / 2 ) β T β and it also introduces a constant c, which gives different weights to the second term and is an additional parameter that needs to be tuned. If one combines the first two terms in the optimization above, then we have a similar problem as in (6), which is:
min 1 2 β T Λ ˜ β + λ 1 n T ( ξ + ξ * ) subject   to : y X β b n ϵ n + ξ , X β + b n y ϵ n + ξ * , ξ 0 ,     ξ * 0
where Λ ˜ = I + c Λ and I is the identity matrix. This last optimization is equivalent to (6), where Λ is replaced with Λ ˜ . The development of the dual optimization solution is, therefore, very similar and we omit it here.

4. Numerical Experiments

In this section, we demonstrate the performance of our method in a variety of numerical experiments. We first discuss a number of simulated models and then we discuss a number of real data.

4.1. Simulated Data

First, we ran a few experiments with simulated data to see how well our algorithm performs. We chose similar settings to the ones Frommlet and Nuel (2016) [7] used in their work for the linear model. To demonstrate how well the methods work, we used two metrics, the percentage of actual predictors that have nonzero coefficients that are recovered from the algorithm (one can see them as true positives) and the percentage of the zero coefficients where the algorithm correctly puts a zero coefficient (one can see them as true negatives). Therefore, the closer these percentages are to one, the better the model is at recovering the true solution.
We have the following models:
  • Model 1: Y = X 1 + X 2 + ϵ .
  • Model 2: Y = X 2 + X 5 + X 8 + X 11 + X 14 + ϵ .
  • Model 3: (p) Y = i = 1 20 β i X i + ϵ .
For Model 1, p = 20 , for Model 2, p = 24 , and, for Model 3, p = 100 ,   250 ,   500 ,   1000 ,   2500 and β i = 0.5 . We used the Toeplitz structure for the covariance matrix with ρ = 0 ,   0.5 ,   0.8 , n = 100 , ϵ N ( 0 , 1 ) and we ran 500 simulations. We compared the performance of the new approaches, the Adaptive L0 SVR method (SVR-AL0), and the Alternative Adaptive L0 SVR method (SVR-AAL0) we discussed in the previous section with the performance of the Adaptive L0 OLS approach (OLS-AL0) of Frommlet and Nuel (2016) [7]. For the SVR-AAL0 method, we used the value of c = 1 unless otherwise noted.
As we can see in Table 1, there are very good results in the estimates by all three methods. The three methods seem to have similar performances when identifying the true positives, that is, the coefficients that are different than zero. There seem to be different results when we try to find the true negatives, that is, the coefficients that are actually zero. The SVR- AL0 seems to perform well for Model 1 and Model 2 when there is no correlation between the predictors ( ρ = 0 ), but not as well as the OLS-AL0 when there is a correlation. In Model 3, this is reversed and the SVR-AL0 seems to work better than the OLS-AL0 in recovering the true negatives.

4.2. Real Data

Now, we discuss a number of real datasets from the UC Irvine Machine Learning Repository (Dua and Graff, 2019) [11].
To perform the comparison, we ran the SVR method and we obtained the regression function estimation, which is produced by the non-regularized procedure. Then, we ran the regularized SVR methods, including both SVR-AL0 and SVR-AAL0. The scalar c is set to take three different values, i.e., 0.1, 1, and 10, in the SVR-AAL0 algorithm. If we only present one answer, that means the three different values have shown the same answers.
We ran the algorithms on three datasets and, here, we present the results. The datasets are the Concrete Compression Strength (I-Cheng, 1998) [12], the power plant (Tüfeksi, 2014 [13]), and the wine quality (Cortez et al., 2009 [14]). In the Concrete Compression Strength data (Table 2), we see that the classic SVR algorithm produces a vector of nonzero coefficients, which means all the variables are important while both the sparse algorithms push five of the eight coefficients to zero. Furthermore, to demonstrate the effect of the value of scalar c in SVR-AAL0 on the results, we show that if we change c and set it equal to 0.1 , two of the five zero coefficients are nonzero (although very close to zero). Similarly, for the wine quality data (Table 3), the Adaptive L0 SDR shows clear evidence for five out of the eleven predictors to be equal to zero (and another two very close to zero) while the Alternative Adaptive L0 finds the same answer as the classic SVR for all three values of c (0.1, 1, and 10—only value 1 is shown on the results). To see if increasing c will change anything for this dataset, we tried c = 100 and we noticed the difference in the results, which is shown in Table 3. As we gave more significant weight to the number of nonzero coefficients by increasing c, the number of zero coefficients increased to eight and only the three larger coefficients have now clear nonzero coefficients. Finally, the power plant dataset (Table 4) shows a similar performance between the two sparse algorithms where there is one predictor with a zero coefficient and two very small coefficients (one can claim they are almost zero). It is important to note though, that in the alternative adaptive L0 SVR algorithm for c = 0.1 , there are different results, as two predictors have, in this case, significant coefficients and the other two predictors have clearly zero coefficients.

5. Discussion

In this work, we proposed what is possibly the first application of regularization to SVR to obtain sparse function estimation in statistical procedures, such as regression. We combined the classic SVR algorithm with a newly proposed adaptive L0 penalty. We also proposed an alternative approach to the adaptive L0 regularization of SVR, which led to a slightly different optimization and solution. We demonstrated in our numerical experiments that, although both of the methods we proposed performed very well, the alternative approach depends on the value of the parameter c that needs to be tuned.
The adaptive L0 procedure applied to SVR in this paper has been applied to linear regression by Frommlet and Nuel (2016) [7] and variations of this idea have been implemented in the SVM literature, see, for example, Li et al. (2015) [10] and Huang et al. (2008) [15]. It has also been used by Smallman et al. (2020) [16] for sparse feature extraction in principal component analysis (PCA) for Poisson distributed data, which is useful in dimension reduction for text data. It is a very appealing approach as it is computationally efficient (although it needs to be applied iteratively) to obtain an accurate solution and it is also efficient in providing sparse solutions. It has the potential to be applied to different settings and we are actively working toward this direction. In addition, this methodology depends on a number of parameters that need to be tuned and an interesting direction we have not explored is whether different settings will require different optimal parameters.

Author Contributions

Conceptualization, A.A.; Methodology, A.C.; Formal analysis, A.C.; Writing—original draft, A.C.; Writing—review and editing, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cortes, C.; Vapnik, V. Support Vector Network. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  2. Wang, X.X.; Chen, S.; Lowe, D.; Harris, C.J. Sparse Support Vector Regression based on orthogonal forward selection for the generalized kernel method. Neurocomputing 2006, 70, 462–474. [Google Scholar] [CrossRef] [Green Version]
  3. Ertin, E.; Potter, L.C. A method for sparse vector regression. In Intelligent Computing: Theory and Applications III; SPIE: Bellingham, WA, USA, 2005. [Google Scholar] [CrossRef]
  4. Tibshirani, R. Regression Shrinkage and Selection via the LASSO. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
  5. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
  6. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  7. Frommlet, F.; Nuel, G. An adaptive ridge procedure for L0 regularization. PLoS ONE 2016, 11, e0148620. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Meyer, D.; Wien, F.T. Support Vector Machines. The Interface to Libsvm in Package, e1071. R News 2015, 1, 597. [Google Scholar]
  9. Karatzoglou, A.; Smola, A.; Hornik, K. Kernlab: Kernel-Based Machine Learning Lab. R Package Version 0.9-32. 2023. Available online: https://CRAN.R-project.org/package=kernlab (accessed on 15 May 2023).
  10. Li, Q.; Li, X.; Ba, W. Sparse least squares support vector machine with l0-norm in primal space. In Proceedings of the International Conference on Information and Automation, Lijiang, China, 8–10 August 2015; pp. 2778–2783. [Google Scholar]
  11. Dua, D.; Graff, C. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 2019. [Google Scholar]
  12. Yeh, I.-C. Modeling of strength of high performance concrete using artificial neural networks. Cem. Concr. Res. 1998, 28, 1797–1808. [Google Scholar] [CrossRef]
  13. Tüfekci, P. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 2014, 60, 126–140. [Google Scholar] [CrossRef]
  14. Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef] [Green Version]
  15. Huang, K.; King, I.; Lyu, M.R. Direct zero-norm optimization for feature selection. In Proceedings of the Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 845–850. [Google Scholar]
  16. Smallman, L.; Underwood, W.; Artemiou, A. Simple Poisson PCA: An algorithm for (sparse) feature extraction with simultaneous dimension determination. Comput. Stat. 2020, 35, 559–577. [Google Scholar] [CrossRef] [Green Version]
Table 1. Comparing performance of true recovery of nonzero (true positives) and zero (true negatives) coefficients for the three methods.
Table 1. Comparing performance of true recovery of nonzero (true positives) and zero (true negatives) coefficients for the three methods.
True PositivesTrue Negatives
Model p ρ SVR-AL0 SVR-AAL0 OLS-AL0 SVR-AL0 SVR-AAL0 OLS-AL0
I2001.00001.00000.99900.99530.99481.0000
0.51.00001.00001.00000.96500.95931.0000
0.81.00001.00001.00000.85280.84551.0000
II2401.00000.99881.00000.94650.93841.0000
0.50.99960.99881.00000.84690.83891.0000
0.81.00000.99561.00000.85030.53431.0000
III (p)10000.75360.75740.72270.91400.91510.7562
0.50.93490.94850.73970.97640.98210.7470
0.80.98760.98940.71970.98740.98610.7514
25000.74370.88060.93470.88060.88390.8332
0.50.93800.95720.99710.95720.96040.8877
0.80.99580.98640.99930.98640.98710.9161
50000.71250.71380.70050.89030.89270.9605
0.50.93500.94730.97770.95610.94950.9910
0.80.99550.99630.99870.98520.98380.9992
100000.59690.91350.80290.91350.91660.8650
0.50.92760.95990.99350.95990.95300.9123
0.80.99460.98430.99930.98440.98430.9815
Table 2. Comparing coefficients of the different methods in the Concrete Compressive Strength real dataset.
Table 2. Comparing coefficients of the different methods in the Concrete Compressive Strength real dataset.
Predictors
MethodCementSlagAshWaterSuperplasticizerCoarse Aggr.Fine Aggr.Age
SVR0.755310.20537−0.12401−0.050530.01467−0.08858−0.113670.59027
SVR-AL00.771070.20580000000.60258
SVR-AAL0 (c = 1)0.770460.20941000000.60211
SVR-AAL0 (c = 0.1)0.773800.188560.000040000.000010.60471
Table 3. Comparing coefficients of the different methods in the wine quality dataset.
Table 3. Comparing coefficients of the different methods in the wine quality dataset.
Predictors
ResponseMethodFixed AcidVolatile AcidCitric AcidResidual SugarChloridesFree Sulfur DioxideTotal Sulfur DioxideDensitypHSulphatesAlcohol
SVR0.00915−0.010960.003510.05674−0.003570.232920.96777−0.00009−0.001100.003790.07556
RedSVR-AL00.00010−0.0002400.0534500.232970.968180000.07414
WineSVR-AAL0 ( c = 1 )0.00915−0.010960.003510.05674−0.003570.232920.96777−0.00009−0.001100.003790.07556
SVR-AAL0 ( c = 100 )00000−0.381300.915490000.12843
SVR−0.01372−0.004850.002490.08978−0.00066−0.17769−0.97910−0.000050.001370.001160.03885
WhiteSVR-AL00.00384−0.0000300.089670−0.17770−0.979270000.03736
WineSVR-AAL0 ( c = 1 )−0.01372−0.004850.002490.08978−0.00066−0.17769−0.97910−0.000050.001370.001160.03885
SVR-AAL0 ( c = 100 )0000.058530−0.25981−0.963880000
Table 4. Comparing coefficients of the different methods in the power plant dataset.
Table 4. Comparing coefficients of the different methods in the power plant dataset.
Predictors
MethodTemperaturePressureHumidityVacuum
SVR−0.81683−0.363720.182670.40881
SVR-AL0−1.0000−0.0000400.00050
SVR-AAL0 (c = 1)−1.0000−0.0000400.00045
SVR-AAL0 (c = 0.1)−0.89426000.44755
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Christou, A.; Artemiou, A. Adaptive L0 Regularization for Sparse Support Vector Regression. Mathematics 2023, 11, 2808. https://doi.org/10.3390/math11132808

AMA Style

Christou A, Artemiou A. Adaptive L0 Regularization for Sparse Support Vector Regression. Mathematics. 2023; 11(13):2808. https://doi.org/10.3390/math11132808

Chicago/Turabian Style

Christou, Antonis, and Andreas Artemiou. 2023. "Adaptive L0 Regularization for Sparse Support Vector Regression" Mathematics 11, no. 13: 2808. https://doi.org/10.3390/math11132808

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop