Next Article in Journal
Two Special Types of Curves in Lorentzian α-Sasakian 3-Manifolds
Previous Article in Journal
Enhanced Example Diffusion Model via Style Perturbation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Capped L2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss

School of Mathematics and Information Sciences, North Minzu University, Yinchuan 750021, China
*
Author to whom correspondence should be addressed.
Symmetry 2023, 15(5), 1076; https://doi.org/10.3390/sym15051076
Submission received: 24 March 2023 / Revised: 7 April 2023 / Accepted: 9 May 2023 / Published: 12 May 2023

Abstract

:
A twin bounded support vector machine (TBSVM) is a phenomenon of symmetry that improves the performance of the traditional support vector machine classification algorithm. In this paper, we propose an improved model based on a TBSVM, called a Welsch loss with capped L 2 , p -norm distance metric robust twin bounded support vector machine (WCTBSVM). On the one hand, by introducing the capped L 2 , p -norm metric in the TBSVM, the problem of the non-sparse output of the regularization term is solved; thus, the generalization and robustness of the TBSVM is improved and the principle of minimizing the structural risk is realized. On the other hand, a bounded, smooth, and non-convex Welsch loss function is introduced to reduce the influence of noise, which further improves the classification performance of the TBSVM. We use a half-quadratic programming algorithm to solve the model non-convexity problem caused by Welsch loss. Therefore, the WCTBSVM is more robust and effective in dealing with noise compared to the TBSVM. In addition, to reduce the time complexity and speed up the convergence of the algorithm, we constructed a least squares version of the WCTBSVM, named the fast WCTBSVM (FWCTBSVM). Experimental results on both UCI and artificial datasets show that our model can show better classification performance on classification problems.

1. Introduction

The support vector machine (SVM) [1,2,3,4,5] is designed for the binary classification problem, which is the development basis of other classifiers in machine learning due to its superior performance in text classification tasks. The SVM is widely used in data classification and regression analysis problems, such as artificial intelligence, speech recognition, remote image analysis, financial management, etc. Scene recognition plays a crucial role in supporting cognitive communications of unmanned aerial vehicle (UAV). To leverage scenario-dependent channel characteristics, we propose an air-to-ground (A2G) scenario identification model based on support vector machine (SVM) [6]. SVM is based on the principle of structural risk minimization, which has excellent generalization performance, and the basic idea is to determine the maximum distance between the two types of data. Although a SVM is highly applicable to classification problems, it still faces some challenges in handling its learning tasks. To avoid the occurrence of over-fitting, a SVM is extended to a soft-margin SVM (C-SVM) [7] after introducing relaxation variables. The loss function of the C-SVM is a hinge loss, which is very sensitive to the noise points in the sample, so as to avoid over-fitting. Moreover, the SVM needs to solve a quadratic programming problem (QPP) of complexity O ( n 3 ) [7] in the process of solving. Therefore, many new programmes have been proposed to speed up the training process of the SVM. For example, how to decompose a large QPP into multiple small QPPs, or how to build other variant models of the new SVM. Among the more popularly accepted is the generalized eigenvalue proximal support vector machine (GEPSVM) [8] and the twin support vector machine (TSVM) [3,9,10,11,12]. The GEPSVM searches for two nonparallel hyperplanes by solving two related generalized eigenvalue problems. The TSVM divides a large QPP into two small QPPs, of which the principle is based on the GEPSVM search for two nonparallel hyperplanes, so that the positive (negative) sample is as close as possible to the positive class (negative class) hyperplane, and as close to the other hyperplane as possible. Therefore, the TSVM runs much faster than the SVM in theory.
In the process of studying the TSVM, some scholars have found that the TSVM needs to solve two QPPs, which is not feasible for large-scale classification. In order to reduce the calculated amount of the TSVM and maintain the advantage, the least squares support vector machine (LSSVM) was first proposed by Suykens et al. [2,13,14]. The LSSVM is one of the important achievements in the field of machine learning in recent years, and its training process follows the principle of risk minimization, which replaces the inequality constraints with equality constraints and solves two sets of linear equations. Based on this, Tomar et al. extended the LSTSVM to a multi-class classification and proposed a multi-class least-squares twin support vector machine (MLSTSVM) [7]. For multiple samples, the MLSTSVM generates multiple hyperplanes, where one plane corresponds to one sample, where the ith sample is as close as possible to the ith hyperplane, while being as far away from the other hyperplane as possible. Xie et al. [15] propose a novel Laplacian Lp norm least squares twin support vector machine (Lap-LpLSTSVM). The experimental results on both synthetic and real-world datasets show that the Lap-LpLSTSVM outperforms other state-of-the-art methods and can also deal with noisy datasets. Ye et al. proposed the information-weighted TWSVM classification method (WLTSVM) [16], which represents the compactness of intra-class samples and the discreteness of inter-class samples by using inter-class graphs. In addition, a new SVM was also proposed by Shao et al. [8], which is a twin bounded support vector machine (TBSVM) [9]. It increases the regularization term containing the 2-norm and achieves the principle of minimizing the structural risk and improves the classification performance of the SVM. The TSVM is calculated faster than the SVM, but the square L 2 -norm distance used in the TSVM increases the effect of the outliers; thus, affecting the construction of the hyperplane in the presence of noise. To enhance the classification performance of twin support vector machines (TSVMs) on imbalanced datasets, a reduced universum twin support vector machine for class imbalance learning (RUTSVM) has been proposed based on the universum TSVM framework. However, RUTSVM suffers from a key drawback in terms of matrix inverse computation involved in solving its dual problem and finding classifiers. To address this issue, Moosae [17] proposes an improved version of RUTSVM called IRUTSVM. Therefore, on the basis of the TSVM, Wang et al. proposed the term capped L 1 -norm twin support vector machines (CTSVM) [18,19]; thus, eliminating the effect of partial outliers and increasing the robustness. From here, we see that the L 1 -norm increases the robustness better than the square L 2 -norm. However, the classification of the L 1 -norm distance measures also shows its drawbacks. When the outliers are large, the capped L 1 -norm measures are neither convex nor smooth, making them difficult to optimize. For the problem of the L 1 -norm, a robust twin support vector machine based on the L 2 , p -norm (pTSVM) [20,21,22] formula was proposed recently, which better suppresses the effect of outliers than the L 1 -norm and the square L 2 -norm.
To further suppress the adverse effects of the outliers, we introduce a bounded, smooth, and non-convex Welsch loss [23,24,25]. The Welsch loss is based on the loss function of the Welsch estimation method, of which the Welsch estimation method is one method in robust estimation. When the data error is normally distributed, it is comparable to the mean square error loss, but the Welsch loss is more robust when the error is a non-normal distribution and the error is caused by outliers. To improve the robustness and generalization performance of the TBSVM, this paper proposes a classifier to increase the terms of the Welsch loss function on the basis of the capped L 2 , p -norm, which is a capped L 2 , p -norm metric based on a robust twin support vector machine with a Welsch loss (WCTBSVM). The WCTBSVM is more robust than the TBSVM, while reducing the effect of outliers and improving the classification performance. Beyond this, the fast WCTBSVM (FWCTBSVM) is proposed based on the idea of least squares; changing the inequality constraint to the equality constraint accelerates the operation of the algorithm.
Combined with the above content, this article carries out the main work and mainly has the following points:
(1) By analyzing the characteristics of both the Welsch loss function and the capped L 2 , p -norm metric distance, a new robust learning algorithm based on a TBSVM is proposed as a WCTBSVM. Without loss of precision, a least squares version of WCTBSVM is constructed, named FWCTBSVM.
(2) The iterative algorithms of WCTBSVM and FWCTBSVM models are given, and their convergence is analyzed according to the characteristics of the models.
(3) Through experiments on UCI datasets and artificial datasets, we found that the WCTBSVM and FWCTBSVM have advantages over several other methods in terms of robustness and feasibility.
(4) To see the advantages of the WRTBSVM and FWRTBSVM more intuitively, we conducted a statistical monitoring analysis to further verify that the classification performance of the WRTBSVM and FWRTBSVM was stronger than that of the TSVM, TBSVM, LSTSVM, and CTSVM.
In Section 2, we introduce some related models, such as the TSVM, TBSVM, CTSVM, LSTSVM, and capped L 2 , p -norm. In Section 3, we integrate the Welsch function into the model, and then solve the model. A large number of data experiments are conducted to test the operation of the model in Section 5. The fifth chapter is a summary of the full text.

2. Related Work

In this paper, we propose two models based on the following relevant content. We first make a simple introduction to four classification models: the twin support vector machine (TSVM), twin bounded support vector machine (TBSVM), capped L 1 -norm twin support vector machine (CTSVM), and least squares twin support vector machine (LSTSVM), as well as the capped L 2 , p -norm and Welsch loss functions to be introduced in the model of this paper [25].

2.1. TSVM

The TSVM was first proposed by Jayadeva et al. The basic theory is to produce two nonparallel classification planes, and keep each class close to the corresponding one, and far away from the other one. The TSVM differs from the SVM by dividing a convex quadratic programming problem into two convex quadratic programming problems; thus, speeding up the rate of classification identification.
We set up a training dataset ζ l = { ( x i , y i ) | i = 1 , 2 l } ( R n , Y ) l , where x i R n and y i Y = { 1 , 1 } denote the class to which the point ith belongs. Class 1 and class -1 are represented as positive samples and negative samples, respectively. The number of samples in ζ l containing positive class samples is denoted by m 1 , and the number of samples containing negative class samples is denoted by m 2 , where l = m 1 + m 2 . Let A R m 1 × n represents all positive class samples, and B R m 2 × n represents all negative class samples. Thus, we determine two non-parallel hyperplanes:
f 1 ( x ) = ω 1 T x + b 1 = 0
and
f 2 ( x ) = ω 2 T x + b 2 = 0
where ω 1 , ω 2 R n are the normal vector of the hyperplane and b 1 , b 2 R are the offset of the hyperplane.
Subsequently, we can obtain a model of the TSVM:
min ω 1 , b 1 1 2 A ω 1 + e 1 b 1 2 2 + C 1 e 2 T ξ 1 s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 , ξ 1 0
min ω 2 , b 2 1 2 B ω 2 + e 2 b 2 2 2 + C 2 e 1 T ξ 2 s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1 ξ 1 0
where C 1 0 , C 2 0 are regularization parameters, e 1 and e 2 are vectors of ones of appropriate dimensions, ξ 1 and ξ 2 are the slack vectors.
Introducing the Lagrange multipliers α and β , let H = [ A , e 1 ] and Z = [ B , e 2 ] , and we can get the dual problem:
min α 1 2 α T Z ( H T H ) 1 Z T α e 2 T α s . t . 0 α C 1 e 2
min β 1 2 β T H ( Z T Z ) 1 H T β e 2 T β s . t . 0 β C 2 e 1 .
Finally, two nonparallel hyperplanes can be obtained by solving the QPPs in formulas (5) and (6)
[ ω 1 , b 1 ] T = ( H T H ) 1 Z T α , [ ω 2 , b 2 ] T = ( Z T Z ) 1 H T β .

2.2. TBSVM

To enhance the classification performance of the TSVM, a new TSVM model called a TBSVM was proposed by Shao et al. [8]. The TBSVM is two non-parallel hyperplanes for classification by solving two smaller QPPs. Thus, we can write the TBSVM as:
min ω 1 , b 1 1 2 A ω 1 + e 1 b 1 2 2 + C 1 e 2 T ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 , ξ 1 0
and
min ω 2 , b 2 1 2 B ω 2 + e 2 b 2 2 2 + C 2 e 1 T ξ 2 + C 4 2 ( ω 2 2 2 + b 2 2 ) s . t . s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1 , ξ 2 0
where ξ 1 and ξ 2 are the slack vectors, 0 is a zero vector, C 1 0 , C 2 0 , C 3 0 and C 4 0 are the regularization parameters,   e 1 , e 2  are all unit vectors, A and B are the same as A and B of (3). Based on optimization theory and dual theory, let H = [ A , e 1 ] and Z = [ B , e 2 ] . We can then get the dual problems of (8) and (9) as follows:
min α 1 2 α T Z ( H T H + C 3 I ) 1 Z T α e 2 T α s . t . 0 α C 1 e 2
and
min β 1 2 β T H ( Z T Z + C 4 I ) 1 H T β e 2 T β s . t . 0 β C 2 e 1
where  α R m 2  and  β R m 1  are  L a g r a n g e  multipliers. Further, we can get the solution problems of (10) and (11) as follows:  
[ ω 1 , b 1 ] T = ( H T H + C 3 I ) 1 Z T α ,
[ ω 2 , b 2 ] T = ( Z T Z + C 4 I ) 1 H T β .

2.3. CTSVM

Using the L 2 -norm in the TSVM increases the effect of the noises without reaching the structure of the optimized classification hyperplanes. Therefore, the capped L 1 -norm twin support vector machine (CTSVM) is introduced to increase the robustness to the noise, which helps reduce noise during the model training. The capped L 1 -norm is shown in Figure 1.
The CTSVM classifier is obtained by solving the following problems:
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 , ε 1 ) + C 1 i = 1 m 2 min ( ξ 1 , i 1 , ε 2 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2
and
min ω 2 , b 2 i = 1 m 2 min ( ω 2 x i + b 2 , ε 3 ) + C 2 i = 1 m 1 min ( ξ 2 , i 1 , ε 4 ) s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 e 1
where, C 1 > 0 and C 2 > 0 , e 1 R m 1 and e 2 R m 2 are the unit vectors, ε 1 , ε 2 , ε 3 , ε 4 are the thresholding parameters, and ξ 1 and ξ 2 are the slack vectors. Here, we use the capped L 1 -norm to reduce the influence of the outliers. At the capped L 1 -norm loss function, when the data point is misclassfied, the loss is ε .
We can reformulate the problems as the following approximation ones:
min ω 1 , b 1 1 2 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 1 b 1 ) + 1 2 σ 2 C 1 ξ 1 T Q ξ 1 s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2
and
min ω 2 , b 2 1 2 ( A ω 2 + e 2 b 2 ) T K ( A ω 2 + e 2 b 2 ) + 1 2 σ 2 C 2 ξ 2 T U ξ 2 s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1
where F, Q, K and U are four diagonal matrices with diagonal elements as:
f i = 1 ω 1 x i + b 1 , ω 1 x i + b 1 ε 1 . 0 , o t h e r . , q i = 1 ξ 1 , i , ξ 1 , i ε 2 . 0 , o t h e r .
k i = 1 ω 2 x i + b 2 , ω 2 x i + b 2 ε 3 . 0 , o t h e r . , u i = 1 ξ 2 , i , ξ 2 , i ε 4 . 0 , o t h e r .
Based on optimization theory and dual theory, let H = [ A , e 1 ] and Z = [ B , e 2 ] . We can then get the dual problems of (16) and (17)
min α 1 2 α T ( Z ( H T F H ) 1 Z T + ( 1 C 1 ) Q 1 ) α e 2 T α s . t . 0 α
and
min β 1 2 β T ( H ( Z T K Z ) 1 H T + ( 1 C 2 ) U 1 ) β e 1 T β s . t . 0 β
where  α R m 2  and  β R m 1  are  L a g r a n g e  multipliers.

2.4. LSTSVM

The TSVM has a great advantages over the SVM in the process of classification, but it also expands the influence of noise; thus, increasing the calculation difficulty when processing large amounts of data. However, the least squares twin support vector machine (LSTSVM) can change the inequality constraint into the equation constraint, which can avoid the problems that the TSVM deals with regarding large amounts of data. Therefore, we have the LSTSVM as
min ω 1 , b 1 1 2 A ω 1 + e 1 b 1 2 2 + C 1 2 ξ 1 T ξ 1 s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 = e 2
and
min ω 2 , b 2 1 2 B ω 2 + e 2 b 2 2 2 + C 1 2 ξ 2 T ξ 2 s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 = e 1
where, C 1 > 0 and C 2 > 0 represent regularization parameters, and ξ 1 and ξ 2 are slack vectors.
Furthermore, the optimization problems (22) and (23) can be rewritten:
min ω 1 , b 1 i = 1 m 1 1 2 A ω 1 + e 1 b 1 2 2 + 1 2 C 1 e 2 + B ω 1 + e 2 b 1 2 2 ,
and
min ω 2 , b 2 i = 1 m 2 1 2 B ω 2 + e 2 b 2 2 2 + 1 2 C 2 e 1 A ω 1 e 1 b 2 2 2 .
Let H = [ A , e 1 ] and Z = [ B , e 2 ] . We take the partial derivative of ω 1 and b 1 , ω 2 and b 2 to 0 in (24) and (25). Thus, we can get:
ω 1 b 1 T = ( 1 C 1 H T H + Z T Z ) 1 Z T e 2
and
ω 2 b 2 T = ( 1 C 2 Z T Z + H T H ) 1 H T e 1 .

2.5. Capped L 2 p -norm

It is well known that the squared L 2 -norm distance metric is used in most variant classifiers associated with the TSVM, but this is more sensitive to outliers. Because of the differentiable of the squared L 2 -norm, the negative effect on the outliers with the square term is added; thus, decreasing the classification performance of the model. However, the L 2 , p -norm inhibits the negative effect of the outliers better than the L 1 -norm and squared L 2 -norm, and the p = 1 , L 2 , p -norm becomes the L 1 -norm. Obviously, by setting an appropriate p, based on the capped L 2 , p -norm, there is more robustness than the capped L 2 -norm, and these algorithms show that the capped L 2 , p -norm is robust to Gaussian noise.
For any vector, the a R n and p [ 0 , 2 ] , L 2 , p -norm and capped L 2 , p -norm are defined as:
f 1 ( a ) = ( i = 1 n a i 2 ) p 2 , f 2 ( a ) = min ( ( i = 1 n a i 2 ) p 2 , ε ) ,
where  ε 0  is the thresholding parameter.
From Figure 2 we can find that the capped L 2 , p -norm is more robust than the L 1 -norm and L 2 -norm; thus, making the classification performance of the model better.
Because the capped L 2 , p -norm is relatively so well robust, we introduce the capped L 2 , p -norm in the model proposed in this paper and further improve the generalization and robustness of the TBSVM.

2.6. Welsch Regularization

In this paper, we focus on the Welsch loss function. It is a bounded, smooth, and non-convex loss, which is very robust to the noises. It is defined as:
V ( ξ ) = σ 2 2 [ 1 exp ( ξ 2 2 σ 2 ) ]
where, σ is a penalty parameter. Figure 3 shows the Welsch loss function, V ( ξ ) under different values of σ , which changes from 1 to 3 [23,26,27,28,29].
Through Figure 3, we found that the upper bound of the Welsch loss function increases and the convergence speed slows down as σ gradually increases. Thus, the impact of noise on the model during the training process is limited. Consequently, the Welsch loss can further enhance the robustness of the model.

3. Main Contributions

To improve the generalization performance of the TBSVM, this paper proposes the WCTBSVM and FWCTBSVM for classification problems.

3.1. WCTBSVM

To suppress the adverse impact caused by the outliers, we propose to incorporate the Welsch loss and capped L 2 , p -norm distance metric twin bounded support vector machines to the framework of the TBSVM, which is represented as:
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] + C 3 2 ( ω 1 2 2 + b 1 2 ) s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2
and
min ω 1 , b 1 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 3 ) + C 2 i = 1 m 1 [ 1 exp ( ξ 2 , i 2 2 σ 2 ) ] + C 4 2 ( ω 1 2 2 + b 1 2 ) s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1
where, C 1 , C 2 , C 3 , C 4 > 0 are regularization parameters, e 1 R m 1 and e 2 R m 2 are the unit vectors. The model is modified on the basis of the TBSVM, where the first term introduces a capped L 2 , p -norm that satisfies the principle of structural risk minimization and improves the generalization performance of the model. The second term introduces the Welsch loss function to reduce the influence of noise. The last term introduces the L 2 -norm to prevent over-fitting of the model.
Let
R ( ω 1 , b 1 ) = i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + C 3 2 ( ω 1 2 2 + b 1 2 ) , Q ( ω 1 , b 1 ) = C 1 i = 1 m 2 1 exp ξ 1 , i 2 2 σ 2 .
Further, we can get
Q ¯ ( ω 1 , b 1 ) = max ω 1 , b 1 Q ( ω 1 , b 1 ) ,
where Q ¯ ( ω 1 , b 1 ) = C 1 i = 1 m 2 exp ξ 1 , i 2 2 σ 2 . To facilitate the following derivations, we then define a convex function g ( v ) = v log ( v ) + v , where v < 0 . Based on the conjugate function theory, we have
exp ξ 1 2 2 σ 2 = sup v < 0 v ξ 1 2 2 σ 2 g ( v ) ,
where
v = exp ξ 1 2 2 σ 2 .
Thus, we can get
max ω 1 , b 1 , v M ( ω 1 , b 1 , v ) = i = 1 m 2 v i ξ 1 , i 2 2 σ 2 g ( v i ) R ( ω 1 , b 1 ) .
Use the half-quadratic optimization of (34), suppose that we have v s , where the superscript s denotes the something iteration, so that the v can be written:
max v i s < 0 i = 1 m 2 v i s ( ξ 1 s ) 2 2 σ 2 g ( v i s ) ,
v i s = exp ( ξ 1 s ) 2 2 σ 2 .
Further, the optimization problem (30) can be rewritten as:
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + C 1 2 σ 2 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 .
In a similar way, the optimization problem (31) can be rewritten as:
min ω 2 , b 2 i = 1 m 1 min ( ω 2 x i + b 2 2 p , ε 1 ) + C 2 2 σ 2 ξ 2 T Ω 2 ξ 2 + C 4 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 e 1
where the Ω j = d i a g ( v j , i s , 0 ) , j = 1 , 2 .
Theorem 1.
Let g ( θ ) : R n R be a continuous non-convex function and suppose h ( θ ) : R n Ξ is a map with a range of  Ξ. We assume that there exists a concave function g ¯ ( u ) defined on Ξ, such that g ( θ ) = g ( h ( θ ) ) holds. Under the above assumption, the non-convex function g ( θ ) is expressed as:
g ( θ ) = inf v R n [ v T h ( θ ) g * ( v ) ] .
According to concave duality, g * ( v )  is the concave dual of g ¯ ( u ) given as
g * ( v ) = inf u [ v T h ( θ ) g * ( v ) ] .
In addition, the minimum value to the right of (41) is as follows:
v * = g ¯ ( θ ) θ | u = h ( θ ) .
Based on Theorem 1, we give a concave function g ¯ ( θ ) : R R such that arbitrary θ > 0 ,
g ¯ ( θ ) = min ( θ p 2 , ε ) .
Assuming that  h ( μ ) = μ 2 , we can get
min ( ω x i + b 2 p , ε ) = g ¯ ( h ( μ ) ) ,
where μ = ω x i + b 2 . Based on (30), (45) can be rewritten as:
min ω 1 , b 1 i = 1 m 1 g ¯ ( ω 1 x i + b 1 2 2 ) + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 .
Similarly, based on (31) and (45), we can get
min ω 2 , b 2 i = 1 m 2 g ¯ ( ω 2 x i + b 2 2 2 ) + 1 2 σ 2 C 2 ξ 1 T Ω 2 ξ 1 + C 4 2 ( ω 2 2 2 + b 2 2 ) , s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1 .
Let θ 1 = h ( μ 1 ) = ω 1 x i + b 1 2 2 , via Theorem 1, the first term of (30) can be expressed as:
min ( ω 1 x i + b 1 2 p , ε 1 ) = g ¯ ( ω 1 x i + b 1 2 2 ) = inf f i i 0 ( f i i h ( μ 1 ) g * ( f i i ) ) = inf f i i 0 ( f i i θ 1 g * ( f i i ) ) .
Therefore, the concave dual function of g ¯ ( θ 1 ) is
g * ( f i i ) = inf θ 1 [ f i i θ 1 g ¯ ( θ 1 ] = inf θ 1 f i i θ 1 θ 1 p 2 , θ 1 p 2 < ε 1 . f i i θ 1 ε 1 , θ 1 p 2 ε 1 .
By optimizing θ 1  for (49), we can get:
g * ( f i i ) = f i i ( 2 p f i i ) 2 p 2 ( 2 p f i i ) 2 p 2 , θ 1 p 2 < ε 1 . f i i ε 1 2 p ε 1 , θ 1 p 2 ε 1 .
Therefore, the objective function (30) can be further written as:
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) ,
min ω 1 , b 1 i = 1 m 1 inf f i i 0 L i ( ω 1 , b 1 , f i i , ε 1 ) + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) ,
min ω 1 , b 1 , f i i 0 i = 1 m 1 L i ( ω 1 , b 1 , f i i , ε 1 ) + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) .
Finally, the first term in the objective (30) can be rewritten as:
L i ( ω 1 , b 1 , f i i , ε 1 ) = f i i θ 1 f i i ( 2 p f i i ) 2 p 2 + ( 2 p f i i ) 2 p 2 , θ 1 p 2 < ε 1 . f i i θ 1 f i i ε 1 2 p + ε 1 , θ 1 p 2 ε 1 .
The objective function (52) is solved by learning the optimal classifier via the alternate optimization algorithm. We calculate the gradient of the function g ( θ ) with respect to θ as follows:
g ¯ ( θ ) θ = p 2 θ p 2 1 , 0 < θ < ε 2 p , 0 , θ > ε 2 p .
If θ 1 = h ( μ 1 ) = ω 1 x i + b 1 2 2 , we fix ω 1 and b 1 , can get:
f i i = g ¯ ( θ 1 ) θ 1 | θ 1 = ω 1 x i + b 1 2 2 = p 2 ω 1 x i + b 1 2 p 2 , 0 < ω 1 x i + b 1 2 p < ε 1 . 0 , e l s e .
The same as (54), we can get
k i i = g ¯ ( θ 2 ) θ 2 | θ 2 = ω 2 x i + b 2 2 2 = p 2 ω 2 x i + b 2 2 p 2 , 0 < ω 2 x i + b 2 2 p < ε 3 . 0 , e l s e .
When variables f i and k i  are fixed, in order to solve the classification model of an unknown quantity the ω 1 , ω 2 and b 1 , b 2 , optimization problem formula (30) can be written as:
min ω 1 , b 1 i = 1 m 1 f i i ω 1 x i + b 1 2 2 + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 .
The same as (56), the optimization problem formula (31) can be written as:
min ω 2 , b 2 i = 1 m 2 k i i ω 2 x i + b 2 2 2 + 1 2 σ 2 C 2 ξ 2 T Ω 2 ξ 2 + C 4 2 ( ω 2 2 2 + b 2 2 ) , s . t . ( A ω 2 + e 1 b 2 ) + ξ 2 e 1 .
Let F = d i a g ( f 11 , f 22 , f 33 , , f m 1 , m 1 ) is a m 1 × m 1 diagonal matrix, K = d i a g ( k 11 , k 22 , k 33 , , k m 2 , m 2 ) is a m 2 × m 2  diagonal matrix (30) written as:
min ω 1 , b 1 1 2 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 1 b 1 ) + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 .
Similarly, (31) can be written as:
min ω 2 , b 2 1 2 ( B ω 2 + e 2 b 2 ) T K ( B ω 2 + e 2 b 2 ) + 1 2 σ 2 C 1 ξ 2 T Ω 2 ξ 2 + C 3 2 ( ω 2 2 2 + b 2 2 ) , s . t . ( B ω 2 + e 2 b 2 ) + ξ 2 e 1 .
The corresponding Lagrange function of the above optimization problem (58) can be written as:
L ( ω 1 , b 1 , ξ 1 , α ) = 1 2 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 1 b 1 ) + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) α T ( ( B ω 1 + e 2 b 1 ) + ξ 1 e 2 ) .
where, α is a Lagrange multiplier, derive the Lagrange function about ω 1  and  b 1 , and get the following Karush–Kuhn–Tucker conditions
L ω 1 = A T F ( A ω 1 + e 1 b 1 ) + B T α + C 3 ω 1 = 0 , ( i ) L b 1 = e 1 T F ( A ω 1 + e 1 b 1 ) + e 2 T α + C 3 b 1 = 0 , ( i i ) L ξ 1 = 1 σ 2 C 1 Ω 1 ξ 1 α = 0 , ( i i i ) α T ( B ω 1 + e 2 b 1 ξ 1 + e 2 ) = 0 , ( i v ) α 0 . ( v )
Combine ( i ) and ( i i ) to get
A T e 1 T F A e 1 ω 1 b 1 + B T e 2 T α + C 3 ω 1 b 1 = 0 .
Define
Z 1 = ω 1 b 1 , H = A e 1 , E = B e 2 .
Thus, we can get
H T F H Z 1 + E T α + C 3 Z 1 = 0 , Z 1 = [ ω 1 , b 1 ] T = ( H T F H + C 3 I ) 1 E T α .
At the same time, we can get ξ 1 = σ 2 ( C 1 Ω 1 ) 1 α . Therefore, the Lagrange function can be rewritten as
L ( ω 1 , b 1 , ξ 1 , α ) = 1 2 ( H Z 1 ) T F ( H Z 1 ) + C 1 2 σ 2 ( σ 2 C 1 Ω 1 1 α ) T Ω 1 ( σ 2 C 1 Ω 1 1 α ) α T ( E Z 1 + σ 2 C 1 Ω 1 1 α e 2 ) + C 3 2 Z 1 T Z 1 = e 2 T α 1 2 α T ( E ( H T F H + C 3 I ) 1 E T + σ 2 C 1 Ω 1 1 ) α .
Therefore, the dual problem of (58) is as follows:
min α 1 2 α T ( E ( H T F H + C 3 I ) 1 E T + σ 2 C 1 Ω 1 1 ) α e 2 T α , s . t . 0 α C 1 e 2 .
Similarly, the dual problem of (59) is as follows:
min β 1 2 β T ( H ( E T K E + C 4 I ) 1 H T + σ 2 C 2 Ω 2 1 ) β e 1 T β , s . t . 0 β C 2 e 1 .
According to the process of the above operation, we give the pseudo-code of the process, as shown in Algorithm 1.
Algorithm 1 Training WCTBSVM.
   Input: Training data A R m 1 × n and B R m 2 × n ; Parameters C i , ( i = 1 , 2 , 3 , 4 ) and
       ε i , ( i = 1 , 2 , 3 , 4 ) .
   Output: Z 1 * and Z 2 * ;
   Process:
   1. Initialize F R m 1 × m 1 and Ω 1 R m 2 × m 2 ; K R m 2 × m 2 and Ω 2 R m 1 × m 1 ;
   2. Calculate by the KKT conditions can get α and β by (65) and (66);
   3. Calculate Z 1 and Z 2 by
      Z 1 = ( H T F H + C 3 I ) 1 E T α
     and
      Z 2 = ( E T K E + C 4 I ) 1 H T β
   4. Update the matrix separately Ω 1 , Ω 2 , F , K by (40), (54) and (55).

3.2. FWCTBSVM

To reduce the computational complexity caused by the Welsch loss, we replace the inequality constraints in the WCTBSVM with equality constraints
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] + C 3 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 = e 2 .
min ω 1 , b 1 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 3 ) + C 2 i = 1 m 1 [ 1 exp ( ξ 2 , i 2 2 σ 2 ) ] + C 4 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 = e 1 .
Further, (67) and (68) can be rewritten as:
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + C 1 2 σ 2 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( B ω 1 + e 2 b 1 ) + ξ 1 = e 2 .
min ω 2 , b 2 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 1 ) + C 2 2 σ 2 ξ 2 T Ω 2 ξ 2 + C 4 2 ( ω 1 2 2 + b 1 2 ) , s . t . ( A ω 1 + e 1 b 2 ) + ξ 2 = e 1 .
Replace the equality constraint into the objective function and we have
min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + C 1 2 σ 2 Ω 1 e 2 + ( B ω 1 + e 2 b 1 ) 2 2 + C 3 2 ( ω 1 2 2 + b 1 2 ) ,
min ω 2 , b 2 i = 1 m 2 min ( ω 2 x i + b 2 2 p , ε 1 ) + C 2 2 σ 2 Ω 2 e 1 ( A ω 2 + e 1 b 2 ) 2 2 + C 4 2 ( ω 2 2 2 + b 2 2 ) .
Further, we can obtain:
min ω 1 , b 1 1 2 ( A ω 1 + e 1 b 1 ) T F ( A ω 1 + e 1 b 1 ) + 1 2 σ 2 C 1 ξ 1 T Ω 1 ξ 1 + C 3 2 ( ω 1 2 2 + b 1 2 ) ,
min ω 2 , b 2 1 2 ( A ω 2 + e 2 b 2 ) T K ( A ω 2 + e 2 b 2 ) + 1 2 σ 2 C 2 ξ 2 T Ω 2 ξ 2 + C 4 2 ( ω 2 2 2 + b 2 2 ) .
The derivative of ω 1 and b 1 in Equation (73) is zero and gives:
A T F ( A ω 1 + e 1 b 1 ) + C 1 2 σ 2 B T Ω 1 ( B ω 1 + e 2 b 1 ) + C 3 2 σ 2 ω 1 = 0 ,
e T F ( A ω 1 + e 1 b 1 ) + C 1 2 e 2 B T Ω 1 ( B ω 1 + e 2 b 1 ) + C 3 2 σ 2 b 1 = 0 .
Via (75) and (76), we have
2 σ 2 C 1 A T F A A T F e 1 e 1 T F A e 1 T F e 1 ω 1 b 1 + B T Ω 1 B B T Ω 1 e 2 e 2 T Ω 1 B e 2 T Ω 1 e 2 ω 1 b 1 + B T Ω 1 e 2 e 2 T Ω 1 e 2 + C 3 C 1 I ω 1 b 1 = 0 .
Furthermore, via (77) we can get
ω 1 b 1 = ( 2 σ 2 C 1 A T F A + B T Ω 1 B 2 σ 2 C 1 A T F e 1 + B T Ω 1 e 2 2 σ 2 C 1 e 1 T F A + e 2 T Ω 1 B 2 σ 2 C 1 e 1 T F e 1 + e 2 T Ω 1 e 2 + C 3 C 1 I ) 1 B T e 2 T Ω 1 e 2 .
Let
Z 1 = ω 1 b 1 , H = A e 1 , E = B e 2 .
Thus, we can get
Z 1 = ( 2 σ 2 C 1 H T F H + E T Ω 1 E + C 3 C 1 I ) 1 E T Ω 1 e 2 .
Similarly, the formula (74) for ω 2 and b 2 is also convex. We calculate the partial derivative order for them, respectively, and the result is 0 . Thus, we have
B T K ( B ω 2 + e 2 b 2 ) + C 2 A T Ω 2 ( e 1 A ω 2 + e 1 b 2 ) + C 4 ω 2 = 0 ,
e 2 T K ( B ω 2 + e 2 b 2 ) + C 2 e 1 T Ω 2 ( e 1 A ω 2 e 1 b 2 ) + C 4 b 2 = 0 .
Let
Z 2 = ω 2 b 2 , E = B e 2 , H = A e 1 .
Therefore, we have
Z 2 = ( 2 σ 2 C 2 E T K E + H T Ω 2 H + C 4 C 2 I ) 1 H T Ω 2 e 1 .
According to the process of the above operation, we give the pseudo-code of the process, as shown in Algorithm 2.
Algorithm 2 Training FWCTBSVM.
   Input: Training data A R m 1 × n and B R m 2 × n ; Parameters C i , ( i = 1 , 2 , 3 , 4 ) and
       ε i , ( i = 1 , 2 , 3 , 4 ) .
  Output: Z 1 * and Z 2 * ;
  Process:
  1. Initialize F R m 1 × m 1 and Ω 1 R m 2 × m 2 ; K R m 2 × m 2 and Ω 2 R m 1 × m 1 ;
  2. Calculate α and β ;
  3. Calculate Z 1 and Z 2 by
     Z 1 = ( 2 σ 2 C 1 H T F H + E T Ω 1 E + C 3 C 1 I ) 1 E T Ω 1 e 2
    and
     Z 2 = ( 2 σ 2 C 2 E T K E + H T Ω 2 H + C 4 C 2 I ) 1 H T Ω 2 e 1 ;
  4. Update the matrix separately Ω 1 , Ω 2 , F , K by (40), (54) and (55).

3.3. Convergence Analysis

Lemma 1.
For x y R n , if f ( x ) = x x 2 2 y , then inequality f ( x ) < f ( y ) is always established.
Lemma 2.
For any non-zero vectors α , β , when 0 < p 2 , then inequality
α 2 p p 2 β 2 p 2 α 2 2 β 2 p p 2 β 2 p 2 β 2 2
is always established.
Lemma 3.
For any u R ,
(1)
h ( u ) 0 , a n d h ( 0 ) = 0 ,
(2)
h ( u ) = h ( u ) ,
(3)
h ( u ) 0 , a n y u 0 . Then, there exists a convex function Ψ ( s ) and h ( u ) = inf s > 0 [ 1 2 s u 2 + Ψ ( s ) ]
Theorem 2.
The objective function value of (30) as V ( ω 1 , ξ 1 ) . The sequence { V ( ω 1 , ξ 1 ) , k = 1 , 2 , , κ } generated by Algorithm 1 is convergent.
Proof. 
Let
V ( ω 1 , ξ 1 ) = min ω 1 , b 1 i = 1 m 1 min ( ω 1 x i + b 1 2 p , ε 1 ) + C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] + C 3 2 ( ω 1 2 2 + b 1 2 ) = min z , ξ 1 , i i = 1 m 1 min ( ( e i z 2 p , ε 1 ) + C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] + C 3 2 z T z
where the e i = ( x i , 1 ) represents the line of E, and z = ( ω 1 , b 1 ) T , ω 1 , b 1 2 p < ε 1 and C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) < ε 2 . Therefore,
V ( z , ξ 1 ) = min z , ξ 1 , i i = 1 m 1 min ( e i z ) + C 1 i = 1 m 2 min [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] + C 3 2 z T z .
Let
V 1 ( z ) = min z i = 1 m 1 ( e i z 2 p ) ,
V 2 ( z , ξ 1 ) = C 1 i = 1 m 2 [ 1 exp ( ξ 1 , i 2 2 σ 2 ) ) ] + C 3 2 z T z .
Assuming that z ( k + 1 ) is the solution of the k + 1 iteration of the Algorithm, then
z ( k + 1 ) = min 1 2 ( E z ) T F ( k ) ( E z ) .
So, we can get
z ( k + 1 ) = min 1 2 ( E z ( k + 1 ) ) T F ( k ) ( E z ( k + 1 ) ) .
Therefore,
( E z ( k + 1 ) T F ( k ) ( E z ( k + 1 ) ) ( E z ( k ) T F ( k ) ( E z ( k ) ) .
Based on Lemma 2, we can get:
p 2 E z k + 1 2 2 E z k 2 p 2 p 2 E z k 2 2 E z k 2 p 2
and
E z k + 1 2 p p 2 E z k 2 p 2 E z k + 1 2 2 E z k 2 p p 2 E z k 2 p 2 E z k 2 2 .
Combine (93) and (94), and we have
E z k + 1 2 p E z k 2 p .
Therefore, V 1 ( z ) is the convergence.
Next, we discuss the convergence of V 2 ( z , ξ 1 ) . Let the function h ( u ) = 1 exp ( u 2 ) , where u = ξ 1 2 σ , and there exists a convex function Ψ ( s ) and
h ( u ) = inf s > 0 [ 1 2 s u 2 + Ψ ( s ) ]
where we have a minimum solution s ¯ to make the equation hold:
inf s > 0 [ 1 2 s u 2 + Ψ ( s ) ] = 1 2 s ¯ u 2 + Ψ ( s ¯ )
where s ¯ = 2 exp ( u 2 ) . Further, we get
[ 1 exp ( ξ 1 , i 2 2 σ 2 ) ] = inf s > 0 s i ( ξ 1 ) 2 4 σ 2 + Ψ ( s i ) .
Therefore, (88) is equivalent to
V 2 ( z , ξ 1 , s i ) = C 1 i = 1 m 2 { s i ( ξ 1 ) 2 4 σ 2 + Ψ ( s i ) } + C 3 2 z T z .
From (89), (99) and Lemma 3, we can get V 2 ( z , ξ 1 , s i ) V 2 ( z , ξ 1 ) 0 . Thus, the sequence is lower bounded. Suppose that z k , ξ k and s k are obtained after k iterations,
V 2 ( z k , ξ k , s k ) V 2 ( z k + 1 , ξ k , s k ) ,
V 2 ( z k + 1 , ξ k , s k ) V 2 ( z k + 1 , ξ k + 1 , s k ) ,
V 2 ( z k + 1 , ξ k + 1 , s k ) V 2 ( z k + 1 , ξ k + 1 , s k + 1 ) .
Concluding (100)–(102), we have
V 2 ( z k , ξ k , s k ) V 2 ( z k + 1 , ξ k + 1 , s k + 1 )
Therefore, V 2 ( z , ξ 1 ) is the convergence. Thus, the sequence is the convergence. □

4. Numerical Experiments

In this section, we will conduct data experiments on WCTBSVM and FWCTBSVM to test the classification performance of the two models. Meanwhile, both models were also used for comparison with the TSVM, TBSVM, CTSVM, and LSTSVM, and further verify whether the classification performance of the WCTBSVM and FWCTBSVM is better. After performing the test on the datasets, we performed a statistical monitoring analysis to further study the classification performance of the proposed model [28]. All of the data are obtained from the normalized dataset. All the experiments were implemented in MATLAB R2016a, equipped with an Intel Core i7 processor (3 GHz) and 8 GB of memory on a ASUS personal computer.

4.1. Experimental Setting

How a model of the algorithm reflects its best performance depends on the choice of parameters. Therefore, we use the traditional accuracy index ( A C C ) to measure the performance of these algorithms, defined as follows:
A C C = T P + T N T P + F N + T N + F P ,
where T P and T N denote true positives and true negatives, F P and F N denote false negatives and false positives, respectively. The higher value of the ACC is, the better the model is. For these, the TSVM parameters ε i set 10 5 , and c i searched from set { 10 i 5 , 4 , 3 , 2 , , 1 , 0 , 1 , 2 , 3 , 4 , 5 }, and the kernel parameters σ searched from set { 10 i 4 , 3 , 2 , , 1 , 0 , 1 , 2 , 3 , 4 } . We performed a 10-fold cross-validation on the dataset. The model was trained and tested multiple times using different cut apart training and test datasets; thus, overcoming the randomness of the individual test results.
We classified the 100 artificial data samples, and divided the data samples into two categories, one represented by +, another represented by ∗. Because the outliers can have some impact on the performance of the classification, six outliers were introduced to compare the robustness of the TSVM, TBSVM, CTSVM, LSTSVM, WCTBSVM, and FWCTBSVM. Respectively, the four outliers were classified on average, two positive and two negative, which are shown in Figure 4.
In order to validate the classification performance of the WCTBSVM and FWCTBSVM with the performance of the correlated algorithms, nine sets of datasets from the UCI dataset were selected in Table 1. These nine sets of data (Balance, Australian, Cancer, CMC, German, Pima, QSAR, WDBC, and wholesale) were used to perform the data experiments. Considering that noise is one of the criteria to measure the robustness of the algorithm, we will study it at the noise points. The classification accuracy changes smoothly with increasing noise, indicating that the proposed algorithm has good noise resistance.

4.2. Experimental Results on Artificial Dataset with Gaussian Noise

Through experiments on the nine sets of UCI datasets, we obtain an even better classification performance and robustness of the WCTBSVM and FWCTBSVM. Therefore, in order to further explore the advantages of the WCTBSVM and FWCTBSVM, we will conduct experiments on artificial datasets. All experimental results are presented in Figure 5.
From Figure 5, we can see the accuracy of the six models classified on an artificial dataset that contained Gaussian noise varies from low to high order and performs roughly equally on the UCI dataset. The precision of the six models are, respectively, TSVM, 58.8%; TBSVM, 63.0%; LSTSVM, 61.6%; CTSVM, 77%; WCTBSVM, 81.3%,; FWCTBSVM, 83.6%. From the experimental data, the WCTBSVM and FWCTBSVM still have high accuracy in the data with Gaussian noise. It also further determines the robustness of the negative effect of the L 2 , p -norm distance and the Welsch loss on the Gaussian noise.

4.3. Experimental Results on the Employed Datasets without Gaussian Noise

To further test the classification performance of the model, we will test it experimentally on the UCI dataset. We perform a 10-fold cross validation on the UCI dataset. The dataset is randomly divided into ten subsets, nine of which are used as training sets and the remaining one is reserved as a test set. This process is repeated ten times. We use the average of the ten test results as the performance measure. All experimental results presented in Table 2 are based on optimal parameters and the average classification accuracy is denoted by ‘ACC’.
Based on the performance of the nine UCI data in Table 2 in the six models, in the absence of Gaussian noise their accuracy order from high to low was WCTBSVM, FWCTBSVM, CTSVM, LSTSVM, TBSVM, and TSVM. The accuracy of FWCTBSVM and WCTBSVM is not much different, but FWCTBSVM run even faster. Therefore, on the basis of adding L 2 , p -norm and Welsch loss functions, the classification mechanism of the model will be even better.

4.4. Experimental Results on the UCI Datasets with Gaussian Noise

In the UCI datasets experiment without adding noise, we obtained that the WCTBSVM and FWCTBSVM classify even better compared to the other four models. To further investigate the robustness of the WCTBSVM and FWCTBSVM, the six models will next be tested in the UCI datasets incorporating noise. The noise added is 10% and 30% Gaussian noise.
According to the experimental results in Table 3 and Table 4, the classification performance of WCTBSVM and FWCTBSVM is still higher than TSVM, TBSVM, LSTSVM, and CTSVM. It also found that the classification performance of the model is reduced with the continuous noise improvement. In addition, the experimental results of Table 2, Table 3 and Table 4 can find that WCTBSVM and FWCTBSVM are more robust after adding the L 2 , p -norm and Welsch loss term.
For a more intuitive observation of the accuracy of the WCTBSVM and FWCTBSVM under different noises, we draw visual line plots according to the data of Table 2, Table 3 and Table 4, Cancer, Wholesale, German, spectf, Hepat and Pima to further show that the WCTBSVM and FWCTBSVM are more stable than several other models. It also shows that L 2 , p -norm and Welsch loss term make the model more stable. In this section, we randomly choose samples and contaminate the features by introducing Gaussian noise, which obeys a normal distribution N ( 0 , τ ) . More specifically, for the training dataset X, we replace X with X + X ^ , where X ^ is the noise matrix, which obeys a normal distribution with zero mean and variance of τ . We apply three types of experiments on different datasets: noises factors τ = 0 , τ = 1 , τ = 2 , and τ = 3 . The experimental results are shown in Figure 6.

4.5. Statistical Analysis

After the six models were tested in the UCI datasets and artificial datasets, both the WCTBSVM and FWCTBSVM showed their accuracy. In order to directly see the classification of the six models on the datasets, we used the Friedman test [30,31] to conduct the statistical monitoring analysis. This test method is a convenient and robust test. Therefore, we first calculated the average ranking and accuracy of the six algorithms and the nine datasets, and the results are shown in Table 5.
With the data in Table 5, we can calculate the Friedman statistical variables of the dataset containing 0, 10%, and 30% Gaussian kernels. The calculation formula is as follows:
χ F 2 = 12 N k ( k + 1 ) [ i R i 2 k ( k + 1 ) 2 4 ]
In this formula, k expresses the number of algorithms, N represents the number of UCI datasets, and R i shows the average of the dataset in the ith algorithm. In this paper, k = 6 , N = 9 . According to (105), it is obtained that the Friedmann statistical variables for Gaussian kernel datasets at different proportions are, respectively, 36.54, 32.53, and 36.46. Based on the χ F 2 -distribution with ( k 1 ) degrees of freedom, we have
F F = ( N 1 ) χ F 2 N ( k 1 ) χ F 2
Thus, according to (106), we can obtain the χ F 2 -distribution of Gaussian kernels at the three proportions, respectively 34.55, 20.87, 34.15. Where F F ( ( k 1 ) , ( k 1 ) ( N 1 ) ) follows the F-distribution, the number of independent variables is ( k 1 ) , and the degree of freedom is ( k 1 ) ( N 1 ) . Where α = 0.1 , by viewing the F test threshold table we can get F α = 1.997 . Obviously, F F > F α , so we reject the original hypothesis. From Table 5, we can see that the average ranking of the accuracy of WRTBSVM and FWRTBSVM is low compared to the other four models, explaining that the classification performance of these two models is more outstanding.
To further compare the classification performance of the six models, we used the Nemenyi test to test it. When the mean difference between the accuracy of the WCTBSVM and FWCTBSVM is larger, the two models differ significantly. The difference was not obvious otherwise. By checking the post-hoc test table, we can get the q α = 0.1 = 2.326 . Then, the Critical Difference (CD) is calculated by the following formula:
C D = q α = 0.1 k ( k + 1 ) 6 N = 2.3260 × 6 × 7 6 × 9 = 2.0513
Based on the CD values, we visualized the data for post-hoc tests, as shown in Figure 7.
From Figure 7, we can find that the WCTBSVM and FWCTBSVM classification performance is indeed better than the other four models. It is also found that the difference between the two models is less than the CD value, so it can be determined that the performance of the two models is not very different. Thus, the WCTBSVM and FWCTBSVM have better classification performance and robustness.

5. Conclusions and Future Directions

Based on the binary classification problem, a new TBSVM model is proposed in this paper as the WCTBSVM, which introduces bounded, smooth, non-convex Welsch loss terms in the TBSVM model and iterative optimization of the relevant model variables using the HQ optimization algorithm to handle the Welsch loss function term. Meanwhile, the capped L 2 , p -norm distance based on the TBSVM and results in the WCTBSVM model are introduced, which make it more generalized and robust than the TBSVM; thus, improving the classification performance of the model. In order to reduce the time complexity and space complexity of the WCTBSVM, we obtained the FWCTBSVM using least squares, which also speeds up the operation efficiency of the model while maintaining the performance advantages of the WCTBSVM.
According to the theoretical basis, we conducted accuracy testing experiments on a UCI dataset and manual dataset and found that the classification performance of the WCTBSVM model is indeed better than the TSVM, TBSVM, LSTSVM, and CTSVM. To further determine the reliability of the WCTBSVM, we also performed statistical test analysis, of which the results still show that the classification performance of the WCTBSVM is more outstanding. Therefore, we can still apply the model to semi-supervised learning in other classification experiments to further study the performance of the model. In future work, how to extend our method to multi-view learning and multi-instance learning are worthy of further study. Certainly, how to develop fast algorithms for our method is also worth studying.

Author Contributions

Writing the first draft, H.W.; software, H.W.; running data analysis, H.W.; writing and editing, H.W. and J.M.; supervision, G.Y.; validation, G.Y. and J.M.; project management, G.Y.; writing—review, G.Y.; conceptualization, G.Y. and J.M.; methodology, J.M.; project administration, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Ningxia Provincial (No. 2022AAC03260, No. 2022AAC03235, No. 2021AAC03183), in part by the Key Research and Development Program of Ningxia (Introduction of Talents Project) (No. 2022BSB03046), in part by the Fundamental Research Funds for the Central Universities (No. 2021KYQD23, No. 2022XYZSX03, No. 2020KYQD41), in part by the National Natural Science Foundation of China (No. 11861002, No. 61907012), and in part by the Key Scientific Research Projects of North Minzu University (No. 2021JCYJ107).

Data Availability Statement

All of the benchmark datasets used in our numerical experiments are from the UCI Machine Learning Repository, and are available at http://archive.ics.uci.edu/ml/ (accessed on 5 May 2022).

Conflicts of Interest

The authors declare no conflict of interest. Informed consent was obtained from all individual participants included in the study. This paper does not contain any studies with human participants or animals performed by any of the authors.

References

  1. Liu, Y. A nonfunctional data transformation approach via kurtosis adjustment and its application to SVM classification. J. Phys. Conf. Ser. 2022, 2294, 012024. [Google Scholar] [CrossRef]
  2. Baesens, B.; Viaene, S.; Gestel, T.V.; Suykens, J.A.; Dedene, G.; Moor, B.D.; Vanthienen, J. Least squares support vector machine classifiers: An empirical evaluation. DTEW Res. Rep. 2000, 3, 1–16. [Google Scholar]
  3. Peng, X.; Xu, D. Twin support vector hypersphere (TSVH) classifier for pattern recognition. Neural Comput. Appl. 2014, 24, 1207–1220. [Google Scholar] [CrossRef]
  4. Rahulamathavan, Y.; Phan, R.C.; Veluru, S.; Cumanan, K.; Rajarajan, M. Privacy-Preserving Multi-Class Support Vector Machine for Outsourcing the Data Classification in Cloud. IEEE Trans. Dependable Secur. Comput. 2014, 11, 467–479. [Google Scholar] [CrossRef]
  5. Zitha, P.; Thango, B.A. On the study of induction motor fault identification using support vector machine algorithms. In Proceedings of the 2023 31st Southern African Universities Power Engineering Conference (SAUPEC), Johannesburg, South Africa, 24–26 January 2023; pp. 1–5. [Google Scholar]
  6. Zhu, G.; Liu, Y.; Mao, K.; Zhang, J.; Hua, B.; Li, S. An Improved SVM-Based Air-to-Ground Communication Scenario Identification Method Using Channel Characteristics. Symmetry 2022, 14, 1038. [Google Scholar] [CrossRef]
  7. Kumar, M.A.; Gopal, M. Least squares twin support vector machines for pattern classification. Expert Syst. Appl. 2009, 36, 7535–7543. [Google Scholar] [CrossRef]
  8. Shao, Y.; Zhang, C.; Wang, X.; Deng, N. Improvements on Twin Support Vector Machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef]
  9. Ma, J.; Yang, L.; Sun, Q. Capped L1-norm distance metric-based fast robust twin bounded support vector machine. Neurocomputing 2020, 412, 295–311. [Google Scholar] [CrossRef]
  10. Ma, J. Capped L 1-norm distance metric-based fast robust twin extreme learning machine. Appl. Intell. 2020, 50, 3775–3787. [Google Scholar] [CrossRef]
  11. Ma, J.; Yang, L.; Sun, Q. Adaptive robust learning framework for twin support vector machine classification. Knowl. Based Syst. 2021, 211, 106536. [Google Scholar] [CrossRef]
  12. Yu, G.; Ma, J.; Xie, C. Hessian scatter regularized twin support vector machine for semi-supervised classification. Eng. Appl. Artif. Intell. 2023, 119, 105751. [Google Scholar] [CrossRef]
  13. Kumar, D.; Thakur, M. Weighted multicategory nonparallel planes SVM classifiers. Neurocomputing 2016, 211, 106–116. [Google Scholar] [CrossRef]
  14. Ke, T.; Zhang, L.; Ge, X.; Lv, H.; Li, M. Construct a robust least squares support vector machine based on Lp-norm and L∞-norm. Eng. Appl. Artif. Intell. 2021, 99, 104134. [Google Scholar] [CrossRef]
  15. Xie, X.; Sun, F.; Qian, J.; Guo, L.; Zhang, R.; Ye, X.; Wang, Z. Laplacian Lp norm least squares twin support vector machine. Pattern Recognit. 2023, 136, 109192. [Google Scholar] [CrossRef]
  16. Ye, Q.; Zhao, H.; Li, Z.; Yang, X.; Gao, S.; Yin, T.; Ye, N. L1-Norm Distance Minimization-Based Fast Robust Twin Support Vector k-Plane Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4494–4503. [Google Scholar] [CrossRef]
  17. Moosaei, H.; Ganaie, M.A.; Hladík, M.; Tanveer, M. Inverse free reduced universum twin support vector machine for imbalanced data classification. Neural Netw. Off. J. Int. Neural Netw. Soc. 2022, 157, 125–135. [Google Scholar] [CrossRef]
  18. Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L1-norm twin support vector machine. Neural Netw. Off. J. Int. Neural Netw. Soc. 2019, 114, 47–59. [Google Scholar] [CrossRef]
  19. Zheng, X.; Zhang, L.; Yan, L. CTSVM: A robust twin support vector machine with correntropy-induced loss function for binary classification problems. Inf. Sci. 2021, 559, 22–45. [Google Scholar] [CrossRef]
  20. Ma, X.; Ye, Q.; Yan, H. L2P-Norm Distance Twin Support Vector Machine. IEEE Access 2017, 5, 23473–23483. [Google Scholar] [CrossRef]
  21. Ma, X.; Liu, Y.; Ye, Q. P-Order L2-Norm Distance Twin Support Vector Machine. In Proceedings of the 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, China, 26–29 November 2017; pp. 617–622. [Google Scholar]
  22. Yan, H.; Ye, Q.; Zhang, T.; Yu, D.; Yuan, X.; Xu, Y.; Fu, L. Least squares twin bounded support vector machines based on L1-norm distance metric for classification. Pattern Recognit. 2018, 74, 434–447. [Google Scholar] [CrossRef]
  23. Ke, J.; Gong, C.; Liu, T.; Zhao, L.; Yang, J.; Tao, D. Laplacian Welsch Regularization for Robust Semisupervised Learning. IEEE Trans. Cybern. 2020, 52, 164–177. [Google Scholar] [CrossRef] [PubMed]
  24. Tokgoz, E.; Trafalis, T.B. Mixed convexity & optimization of the SVM QP problem for nonlinear polynomial kernel maps. In Proceedings of the 5th WSEAS international conference on Computers 2011, Corfu Island, Greece, 15–17 July 2011. [Google Scholar]
  25. Xu, Z.; Lai, J.; Zhou, J.; Chen, H.; Huang, H.; Li, Z. Image Deblurring Using a Robust Loss Function. Circuits Syst. Signal Process. 2021, 41, 1704–1734. [Google Scholar] [CrossRef]
  26. Wang, Y.; Yang, L.; Ren, Q. A robust classification framework with mixture correntropy. Inf. Sci. 2019, 491, 306–318. [Google Scholar] [CrossRef]
  27. Yang, L.; Ding, G.; Yuan, C.; Zhang, M. Robust regression framework with asymmetrically analogous to correntropy-induced loss. Knowl. Based Syst. 2020, 191, 105211. [Google Scholar] [CrossRef]
  28. Song, C.; Liu, W.; Wang, Y. Facial expression recognition based on Hessian regularized support vector machine. In Proceedings of the ICIMCS ’13: Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service, Huangshan, China, 17–19 August 2013. [Google Scholar]
  29. Ren, Z.; Yang, L. Correntropy-based robust extreme learning machine for classification. Neurocomputing 2018, 313, 74–84. [Google Scholar] [CrossRef]
  30. Demar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  31. Ma, J.; Yu, G. Lagrangian Regularized Twin Extreme Learning Machine for Supervised and Semi-Supervised Classification. Symmetry 2022, 14, 1186. [Google Scholar] [CrossRef]
Figure 1. Capped L 1 -norm loss and L 1 -norm loss.
Figure 1. Capped L 1 -norm loss and L 1 -norm loss.
Symmetry 15 01076 g001
Figure 2. Capped L 2 , p -norm loss and L 1 and L 2 -norm loss.
Figure 2. Capped L 2 , p -norm loss and L 1 and L 2 -norm loss.
Symmetry 15 01076 g002
Figure 3. Welsch loss with different σ .
Figure 3. Welsch loss with different σ .
Symmetry 15 01076 g003
Figure 4. Distribution of artificial datasets with Gaussian noise.
Figure 4. Distribution of artificial datasets with Gaussian noise.
Symmetry 15 01076 g004
Figure 5. The classification results on the artificial datasets.
Figure 5. The classification results on the artificial datasets.
Symmetry 15 01076 g005
Figure 6. Accuracy of six algorithms via different noises factors.
Figure 6. Accuracy of six algorithms via different noises factors.
Symmetry 15 01076 g006
Figure 7. Visualization of post-hoc tests.
Figure 7. Visualization of post-hoc tests.
Symmetry 15 01076 g007
Table 1. Characteristics of UCI Datasets.
Table 1. Characteristics of UCI Datasets.
DatasetsSamplesAttributesDatasetsSamplesAttributes
Vote43216Balance2674
Cancer6999German100024
WDBC56930hepat15519
spectf26744Pima7688
Wholesale4407
Table 2. Experimental results on UCI datasets without Gaussian noise.
Table 2. Experimental results on UCI datasets without Gaussian noise.
TSVMTBSVMLSTBSVMCTSVMWCTBSVMFWCTBSVM
Datasets ACC (%) ACC (%) ACC (%) ACC (%) ACC (%) ACC (%)
Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Vote94.5394.7694.5695.0095.4895.13
1.0851.6250.1256.2043.3680.062
Balance92.5792.5792.9692.8793.2193.08
0.98160.9248.02012.293.4210.065
Cancer94.2194.9495.4895.7596.4395.94
5.2832.9470.11614.311.4831.127
German73.8074.8074.6275.5076.8876.00
5.7098.8931.01822.688.0210.159
Wholesale82.7983.7285.3586.5188.4790.93
1.2442.4210.0655.6563.0760.058
WDBC93.0793.7194.6895.2595.9695.07
0.1701.8310.0697.6324.8800.086
hepat77.3378.0080.6782.0085.6584.65
0.7891.2610.0632.3991.5611.012
Pima75.3275.7175.9775.9276.3877.56
1.8543.3080.10516.171.5610.093
spectf80.3880.7780.8181.1581.5481.79
0.4460.5730.7472.9063.6320.068
Table 3. Experimental results of UCI datasets with 10% Gaussian noise.
Table 3. Experimental results of UCI datasets with 10% Gaussian noise.
TSVMTBSVMLSTBSVMCTSVMWCTBSVMFWCTBSVM
Datasets ACC (%) ACC (%) ACC (%) ACC (%) ACC (%) ACC (%)
Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Vote93.3394.6094.0294.6295.0094.48
1.4542.3010.2555.5015.2590.067
Balance90.6890.9991.2691.5492.5192.86
1.0860.8247.92514.325.2470.080
Cancer93.0693.7193.4894.6595.4395.04
4.3563.5680.21120.165.2830.096
German71.8271.3072.8273.4874.8874.60
5.1948.2332.11023.158.6150.154
Wholesale79.7780.2682.3384.9485.4685.93
0.8691.3520.1035.0524.1130.062
WDBC92.5792.9593.6893.2594.9694.18
4.9433.6550.0697.6324.8800.081
hepat76.0376.8177.3378.0680.6580.25
1.2412.3653.0631.6173.5610.612
Pima74.6674.7174.8575.0676.3876.06
14.538.6740.30821.535.5610.163
spectf78.1578.7779.3779.7780.4479.93
2.8071.5610.5637.9863.6320.052
Table 4. Experimental results of UCI datasets with 30% Gaussian noise.
Table 4. Experimental results of UCI datasets with 30% Gaussian noise.
TSVMTBSVMLSTBSVMCTSVMWCTBSVMFWCTBSVM
Datasets ACC (%) ACC (%) ACC (%) ACC (%) ACC (%) ACC (%)
Times (s) Times (s) Times (s) Times (s) Times (s) Times (s)
Vote91.8692.0591.3693.0393.3593.28
0.2391.0330.0725.9322.3960.062
Balance89.9390.1590.5690.8991.7290.56
0.9281.8247.50315.753.4570.057
Cancer92.1292.7593.3393.8794.4394.04
7.0413.4480.10323.175.7320.082
German69.2170.9370.4971.3772.8872.60
2.1948.6550.17224.528.3650.256
Wholesale76.0678.5979.3180.3582.8682.69
5.4131.5190.20611.0525.1130.103
WDBC91.5292.1192.4892.9893.8893.24
4.5684.3200.05610.523.6240.081
hepat74.1674.8175.3375.8676.2377.03
1.2413.4654.3001.7152.0341.025
Pima72.4372.8973.5774.0675.3874.28
9.4827.9010.11223.214.2100.395
spectf77.3077.7778.0678.8079.6878.72
3.0872.1570.6308.4263.6320.071
Table 5. Average accuracy and ranking of the six algorithm models on the UCI datasets with different noise proportions.
Table 5. Average accuracy and ranking of the six algorithm models on the UCI datasets with different noise proportions.
TSVMTBSVMLSTBSVMCTSVMWCTBSVMFWCTBSVM
Avg.ACC 0%84.8985.4486.1286.6687.7787.79
Avg.rank 0%5.94.84.03.11.31.6
Avg.ACC 10%83.3483.7884.3585.0486.1985.92
Avg.rank 10%5.84.64.13.11.31.8
Avg.ACC 30%81.6282.4582.7283.4784.4984.04
Avg.rank 30%5.84.74.32.81.12.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Yu, G.; Ma, J. Capped L2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss. Symmetry 2023, 15, 1076. https://doi.org/10.3390/sym15051076

AMA Style

Wang H, Yu G, Ma J. Capped L2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss. Symmetry. 2023; 15(5):1076. https://doi.org/10.3390/sym15051076

Chicago/Turabian Style

Wang, Haoyu, Guolin Yu, and Jun Ma. 2023. "Capped L2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss" Symmetry 15, no. 5: 1076. https://doi.org/10.3390/sym15051076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop