Capped L2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss

Wang, Haoyu; Yu, Guolin; Ma, Jun

doi:10.3390/sym15051076

Open AccessArticle

Capped L_2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss

by

Haoyu Wang

,

Guolin Yu

^* and

Jun Ma

School of Mathematics and Information Sciences, North Minzu University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Symmetry 2023, 15(5), 1076; https://doi.org/10.3390/sym15051076

Submission received: 24 March 2023 / Revised: 7 April 2023 / Accepted: 9 May 2023 / Published: 12 May 2023

Download

Browse Figures

Versions Notes

Abstract

:

A twin bounded support vector machine (TBSVM) is a phenomenon of symmetry that improves the performance of the traditional support vector machine classification algorithm. In this paper, we propose an improved model based on a TBSVM, called a Welsch loss with capped

L_{2, p}

-norm distance metric robust twin bounded support vector machine (WCTBSVM). On the one hand, by introducing the capped

L_{2, p}

-norm metric in the TBSVM, the problem of the non-sparse output of the regularization term is solved; thus, the generalization and robustness of the TBSVM is improved and the principle of minimizing the structural risk is realized. On the other hand, a bounded, smooth, and non-convex Welsch loss function is introduced to reduce the influence of noise, which further improves the classification performance of the TBSVM. We use a half-quadratic programming algorithm to solve the model non-convexity problem caused by Welsch loss. Therefore, the WCTBSVM is more robust and effective in dealing with noise compared to the TBSVM. In addition, to reduce the time complexity and speed up the convergence of the algorithm, we constructed a least squares version of the WCTBSVM, named the fast WCTBSVM (FWCTBSVM). Experimental results on both UCI and artificial datasets show that our model can show better classification performance on classification problems.

Keywords:

robustness; twin bounded support vector machine; capped L_2,p-norm; classification; Welsch loss

1. Introduction

The support vector machine (SVM) [1,2,3,4,5] is designed for the binary classification problem, which is the development basis of other classifiers in machine learning due to its superior performance in text classification tasks. The SVM is widely used in data classification and regression analysis problems, such as artificial intelligence, speech recognition, remote image analysis, financial management, etc. Scene recognition plays a crucial role in supporting cognitive communications of unmanned aerial vehicle (UAV). To leverage scenario-dependent channel characteristics, we propose an air-to-ground (A2G) scenario identification model based on support vector machine (SVM) [6]. SVM is based on the principle of structural risk minimization, which has excellent generalization performance, and the basic idea is to determine the maximum distance between the two types of data. Although a SVM is highly applicable to classification problems, it still faces some challenges in handling its learning tasks. To avoid the occurrence of over-fitting, a SVM is extended to a soft-margin SVM (C-SVM) [7] after introducing relaxation variables. The loss function of the C-SVM is a hinge loss, which is very sensitive to the noise points in the sample, so as to avoid over-fitting. Moreover, the SVM needs to solve a quadratic programming problem (QPP) of complexity

O (n^{3})

[7] in the process of solving. Therefore, many new programmes have been proposed to speed up the training process of the SVM. For example, how to decompose a large QPP into multiple small QPPs, or how to build other variant models of the new SVM. Among the more popularly accepted is the generalized eigenvalue proximal support vector machine (GEPSVM) [8] and the twin support vector machine (TSVM) [3,9,10,11,12]. The GEPSVM searches for two nonparallel hyperplanes by solving two related generalized eigenvalue problems. The TSVM divides a large QPP into two small QPPs, of which the principle is based on the GEPSVM search for two nonparallel hyperplanes, so that the positive (negative) sample is as close as possible to the positive class (negative class) hyperplane, and as close to the other hyperplane as possible. Therefore, the TSVM runs much faster than the SVM in theory.

In the process of studying the TSVM, some scholars have found that the TSVM needs to solve two QPPs, which is not feasible for large-scale classification. In order to reduce the calculated amount of the TSVM and maintain the advantage, the least squares support vector machine (LSSVM) was first proposed by Suykens et al. [2,13,14]. The LSSVM is one of the important achievements in the field of machine learning in recent years, and its training process follows the principle of risk minimization, which replaces the inequality constraints with equality constraints and solves two sets of linear equations. Based on this, Tomar et al. extended the LSTSVM to a multi-class classification and proposed a multi-class least-squares twin support vector machine (MLSTSVM) [7]. For multiple samples, the MLSTSVM generates multiple hyperplanes, where one plane corresponds to one sample, where the ith sample is as close as possible to the ith hyperplane, while being as far away from the other hyperplane as possible. Xie et al. [15] propose a novel Laplacian Lp norm least squares twin support vector machine (Lap-LpLSTSVM). The experimental results on both synthetic and real-world datasets show that the Lap-LpLSTSVM outperforms other state-of-the-art methods and can also deal with noisy datasets. Ye et al. proposed the information-weighted TWSVM classification method (WLTSVM) [16], which represents the compactness of intra-class samples and the discreteness of inter-class samples by using inter-class graphs. In addition, a new SVM was also proposed by Shao et al. [8], which is a twin bounded support vector machine (TBSVM) [9]. It increases the regularization term containing the 2-norm and achieves the principle of minimizing the structural risk and improves the classification performance of the SVM. The TSVM is calculated faster than the SVM, but the square

L_{2}

-norm distance used in the TSVM increases the effect of the outliers; thus, affecting the construction of the hyperplane in the presence of noise. To enhance the classification performance of twin support vector machines (TSVMs) on imbalanced datasets, a reduced universum twin support vector machine for class imbalance learning (RUTSVM) has been proposed based on the universum TSVM framework. However, RUTSVM suffers from a key drawback in terms of matrix inverse computation involved in solving its dual problem and finding classifiers. To address this issue, Moosae [17] proposes an improved version of RUTSVM called IRUTSVM. Therefore, on the basis of the TSVM, Wang et al. proposed the term capped

L_{1}

-norm twin support vector machines (CTSVM) [18,19]; thus, eliminating the effect of partial outliers and increasing the robustness. From here, we see that the

L_{1}

-norm increases the robustness better than the square

L_{2}

-norm. However, the classification of the

L_{1}

-norm distance measures also shows its drawbacks. When the outliers are large, the capped

L_{1}

-norm measures are neither convex nor smooth, making them difficult to optimize. For the problem of the

L_{1}

-norm, a robust twin support vector machine based on the

L_{2, p}

-norm (pTSVM) [20,21,22] formula was proposed recently, which better suppresses the effect of outliers than the

L_{1}

-norm and the square

L_{2}

-norm.

To further suppress the adverse effects of the outliers, we introduce a bounded, smooth, and non-convex Welsch loss [23,24,25]. The Welsch loss is based on the loss function of the Welsch estimation method, of which the Welsch estimation method is one method in robust estimation. When the data error is normally distributed, it is comparable to the mean square error loss, but the Welsch loss is more robust when the error is a non-normal distribution and the error is caused by outliers. To improve the robustness and generalization performance of the TBSVM, this paper proposes a classifier to increase the terms of the Welsch loss function on the basis of the capped

L_{2, p}

-norm, which is a capped

L_{2, p}

-norm metric based on a robust twin support vector machine with a Welsch loss (WCTBSVM). The WCTBSVM is more robust than the TBSVM, while reducing the effect of outliers and improving the classification performance. Beyond this, the fast WCTBSVM (FWCTBSVM) is proposed based on the idea of least squares; changing the inequality constraint to the equality constraint accelerates the operation of the algorithm.

Combined with the above content, this article carries out the main work and mainly has the following points:

(1) By analyzing the characteristics of both the Welsch loss function and the capped

L_{2, p}

-norm metric distance, a new robust learning algorithm based on a TBSVM is proposed as a WCTBSVM. Without loss of precision, a least squares version of WCTBSVM is constructed, named FWCTBSVM.

(2) The iterative algorithms of WCTBSVM and FWCTBSVM models are given, and their convergence is analyzed according to the characteristics of the models.

(3) Through experiments on UCI datasets and artificial datasets, we found that the WCTBSVM and FWCTBSVM have advantages over several other methods in terms of robustness and feasibility.

(4) To see the advantages of the WRTBSVM and FWRTBSVM more intuitively, we conducted a statistical monitoring analysis to further verify that the classification performance of the WRTBSVM and FWRTBSVM was stronger than that of the TSVM, TBSVM, LSTSVM, and CTSVM.

In Section 2, we introduce some related models, such as the TSVM, TBSVM, CTSVM, LSTSVM, and capped

L_{2, p}

-norm. In Section 3, we integrate the Welsch function into the model, and then solve the model. A large number of data experiments are conducted to test the operation of the model in Section 5. The fifth chapter is a summary of the full text.

2. Related Work

In this paper, we propose two models based on the following relevant content. We first make a simple introduction to four classification models: the twin support vector machine (TSVM), twin bounded support vector machine (TBSVM), capped

L_{1}

-norm twin support vector machine (CTSVM), and least squares twin support vector machine (LSTSVM), as well as the capped

L_{2, p}

-norm and Welsch loss functions to be introduced in the model of this paper [25].

2.1. TSVM

The TSVM was first proposed by Jayadeva et al. The basic theory is to produce two nonparallel classification planes, and keep each class close to the corresponding one, and far away from the other one. The TSVM differs from the SVM by dividing a convex quadratic programming problem into two convex quadratic programming problems; thus, speeding up the rate of classification identification.

We set up a training dataset

ζ_{l} = {(x_{i}, y_{i}) | i = 1, 2 \dots l} \in {(R^{n}, Y)}^{l}

, where

x_{i} \in R^{n}

and

y_{i} \in Y = {- 1, 1}

denote the class to which the point ith belongs. Class 1 and class -1 are represented as positive samples and negative samples, respectively. The number of samples in

ζ_{l}

containing positive class samples is denoted by

m_{1}

, and the number of samples containing negative class samples is denoted by

m_{2}

, where

l = m_{1} + m_{2}

. Let

A \in R^{m_{1} \times n}

represents all positive class samples, and

B \in R^{m_{2} \times n}

represents all negative class samples. Thus, we determine two non-parallel hyperplanes:

\begin{matrix} f_{1} (x) = ω_{1}^{T} x + b_{1} = 0 \end{matrix}

(1)

and

\begin{matrix} f_{2} (x) = ω_{2}^{T} x + b_{2} = 0 \end{matrix}

(2)

where

ω_{1}, ω_{2} \in R^{n}

are the normal vector of the hyperplane and

b_{1}, b_{2} \in R

are the offset of the hyperplane.

Subsequently, we can obtain a model of the TSVM:

\begin{matrix} min_{ω_{1}, b_{1}} \frac{1}{2} {∥ A ω_{1} + e_{1} b_{1} ∥}_{2}^{2} + C_{1} e_{2}^{T} ξ_{1} \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, ξ_{1} \geq 0 \end{matrix}

(3)

\begin{matrix} min_{ω_{2}, b_{2}} \frac{1}{2} {∥ B ω_{2} + e_{2} b_{2} ∥}_{2}^{2} + C_{2} e_{1}^{T} ξ_{2} \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1} ξ_{1} \geq 0 \end{matrix}

(4)

where

C_{1} \geq 0, C_{2} \geq 0

are regularization parameters,

e_{1}

and

e_{2}

are vectors of ones of appropriate dimensions,

ξ_{1}

and

ξ_{2}

are the slack vectors.

Introducing the Lagrange multipliers

α

and

β

, let

H = [A, e_{1}]

and

Z = [B, e_{2}]

, and we can get the dual problem:

\begin{matrix} min_{α} \frac{1}{2} α^{T} Z {(H^{T} H)}^{- 1} Z^{T} α - e_{2}^{T} α \\ s . t . 0 \leq α \leq C_{1} e_{2} \end{matrix}

(5)

\begin{matrix} min_{β} \frac{1}{2} β^{T} H {(Z^{T} Z)}^{- 1} H^{T} β - e_{2}^{T} β \\ s . t . 0 \leq β \leq C_{2} e_{1} . \end{matrix}

(6)

Finally, two nonparallel hyperplanes can be obtained by solving the QPPs in formulas (5) and (6)

{[ω_{1}, b_{1}]}^{T} = - {(H^{T} H)}^{- 1} Z^{T} α, {[ω_{2}, b_{2}]}^{T} = {(Z^{T} Z)}^{- 1} H^{T} β .

(7)

2.2. TBSVM

To enhance the classification performance of the TSVM, a new TSVM model called a TBSVM was proposed by Shao et al. [8]. The TBSVM is two non-parallel hyperplanes for classification by solving two smaller QPPs. Thus, we can write the TBSVM as:

\begin{matrix} min_{ω_{1}, b_{1}} \frac{1}{2} ∥ A ω_{1} + e_{1} b_{1} ∥_{2}^{2} + C_{1} e_{2}^{T} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2}, ξ_{1} \geq 0 \end{matrix}

(8)

and

\begin{matrix} min_{ω_{2}, b_{2}} \frac{1}{2} ∥ B ω_{2} + e_{2} b_{2} ∥_{2}^{2} + C_{2} e_{1}^{T} ξ_{2} + \frac{C_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) \\ s . t . s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1}, ξ_{2} \geq 0 \end{matrix}

(9)

where

ξ_{1}

and

ξ_{2}

are the slack vectors, 0 is a zero vector,

C_{1} \geq 0, C_{2} \geq 0, C_{3} \geq 0

and

C_{4} \geq 0

are the regularization parameters,

e_{1}, e_{2}

are all unit vectors, A and B are the same as A and B of (3). Based on optimization theory and dual theory, let

H = [A, e_{1}]

and

Z = [B, e_{2}]

. We can then get the dual problems of (8) and (9) as follows:

\begin{matrix} min_{α} \frac{1}{2} α^{T} Z {(H^{T} H + C_{3} I)}^{- 1} Z^{T} α - e_{2}^{T} α \\ s . t . 0 \leq α \leq C_{1} e_{2} \end{matrix}

(10)

and

\begin{matrix} min_{β} \frac{1}{2} β^{T} H {(Z^{T} Z + C_{4} I)}^{- 1} H^{T} β - e_{2}^{T} β \\ s . t . 0 \leq β \leq C_{2} e_{1} \end{matrix}

(11)

where

α \in R^{m_{2}}

and

β \in R^{m_{1}}

are

L a g r a n g e

multipliers. Further, we can get the solution problems of (10) and (11) as follows:

{[ω_{1}, b_{1}]}^{T} = - {(H^{T} H + C_{3} I)}^{- 1} Z^{T} α,

(12)

{[ω_{2}, b_{2}]}^{T} = {(Z^{T} Z + C_{4} I)}^{- 1} H^{T} β .

(13)

2.3. CTSVM

Using the

L_{2}

-norm in the TSVM increases the effect of the noises without reaching the structure of the optimized classification hyperplanes. Therefore, the capped

L_{1}

-norm twin support vector machine (CTSVM) is introduced to increase the robustness to the noise, which helps reduce noise during the model training. The capped

L_{1}

-norm is shown in Figure 1.

The CTSVM classifier is obtained by solving the following problems:

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥, ε_{1}) + C_{1} \sum_{i = 1}^{m_{2}} min (∥ ξ_{1, i} ∥_{1}, ε_{2}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2} \end{matrix}

(14)

and

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥, ε_{3}) + C_{2} \sum_{i = 1}^{m_{1}} min (∥ ξ_{2, i} ∥_{1}, ε_{4}) \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} \geq e_{1} \end{matrix}

(15)

where,

C_{1} > 0

and

C_{2} > 0

,

e_{1} \in R^{m_{1}}

and

e_{2} \in R^{m_{2}}

are the unit vectors,

ε_{1}, ε_{2}, ε_{3}

,

ε_{4}

are the thresholding parameters, and

ξ_{1}

and

ξ_{2}

are the slack vectors. Here, we use the capped

L_{1}

-norm to reduce the influence of the outliers. At the capped

L_{1}

-norm loss function, when the data point is misclassfied, the loss is

ε

.

We can reformulate the problems as the following approximation ones:

\begin{matrix} min_{ω_{1}, b_{1}} \frac{1}{2} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Q ξ_{1} \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2} \end{matrix}

(16)

and

\begin{matrix} min_{ω_{2}, b_{2}} \frac{1}{2} {(A ω_{2} + e_{2} b_{2})}^{T} K (A ω_{2} + e_{2} b_{2}) + \frac{1}{2 σ^{2}} C_{2} ξ_{2}^{T} U ξ_{2} \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1} \end{matrix}

(17)

where F, Q, K and U are four diagonal matrices with diagonal elements as:

f_{i} = \{\begin{matrix} \frac{1}{∣ ω_{1} x_{i} + b_{1} ∣}, & ∣ ω_{1} x_{i} + b_{1} ∣ \leq ε_{1} . \\ 0, & o t h e r . \end{matrix}, q_{i} = \{\begin{matrix} \frac{1}{∣ ξ_{1, i} ∣}, & ∣ ξ_{1, i} ∣ \leq ε_{2} . \\ 0, & o t h e r . \end{matrix}

(18)

k_{i} = \{\begin{matrix} \frac{1}{∣ ω_{2} x_{i} + b_{2} ∣}, & ∣ ω_{2} x_{i} + b_{2} ∣ \leq ε_{3} . \\ 0, & o t h e r . \end{matrix}, u_{i} = \{\begin{matrix} \frac{1}{∣ ξ_{2, i} ∣}, & ∣ ξ_{2, i} ∣ \leq ε_{4} . \\ 0, & o t h e r . \end{matrix}

(19)

Based on optimization theory and dual theory, let

H = [A, e_{1}]

and

Z = [B, e_{2}]

. We can then get the dual problems of (16) and (17)

\begin{matrix} min_{α} \frac{1}{2} α^{T} (Z {(H^{T} F H)}^{- 1} Z^{T} + (\frac{1}{C_{1}}) Q^{- 1}) α - e_{2}^{T} α \\ s . t . 0 \leq α \end{matrix}

(20)

and

\begin{matrix} min_{β} \frac{1}{2} β^{T} (H {(Z^{T} K Z)}^{- 1} H^{T} + (\frac{1}{C_{2}}) U^{- 1}) β - e_{1}^{T} β \\ s . t . 0 \leq β \end{matrix}

(21)

where

α \in R^{m_{2}}

and

β \in R^{m_{1}}

are

L a g r a n g e

multipliers.

2.4. LSTSVM

The TSVM has a great advantages over the SVM in the process of classification, but it also expands the influence of noise; thus, increasing the calculation difficulty when processing large amounts of data. However, the least squares twin support vector machine (LSTSVM) can change the inequality constraint into the equation constraint, which can avoid the problems that the TSVM deals with regarding large amounts of data. Therefore, we have the LSTSVM as

\begin{matrix} min_{ω_{1}, b_{1}} \frac{1}{2} {∥ A ω_{1} + e_{1} b_{1} ∥}_{2}^{2} + \frac{C_{1}}{2} ξ_{1}^{T} ξ_{1} \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} = e_{2} \end{matrix}

(22)

and

\begin{matrix} min_{ω_{2}, b_{2}} \frac{1}{2} {∥ B ω_{2} + e_{2} b_{2} ∥}_{2}^{2} + \frac{C_{1}}{2} ξ_{2}^{T} ξ_{2} \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} = e_{1} \end{matrix}

(23)

where,

C_{1} > 0

and

C_{2} > 0

represent regularization parameters, and

ξ_{1}

and

ξ_{2}

are slack vectors.

Furthermore, the optimization problems (22) and (23) can be rewritten:

min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} \frac{1}{2} ∥ A ω_{1} + e_{1} b_{1} ∥_{2}^{2} + \frac{1}{2} C_{1} {∥ e_{2} + B ω_{1} + e_{2} b_{1} ∥}_{2}^{2},

(24)

and

min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} \frac{1}{2} ∥ B ω_{2} + e_{2} b_{2} ∥_{2}^{2} + \frac{1}{2} C_{2} {∥ e_{1} - A ω_{1} - e_{1} b_{2} ∥}_{2}^{2} .

(25)

Let

H = [A, e_{1}]

and

Z = [B, e_{2}]

. We take the partial derivative of

ω_{1}

and

b_{1}

,

ω_{2}

and

b_{2}

to 0 in (24) and (25). Thus, we can get:

{[\begin{matrix} ω_{1} & b_{1} \end{matrix}]}^{T} = - {(\frac{1}{C_{1}} H^{T} H + Z^{T} Z)}^{- 1} Z^{T} e_{2}

(26)

and

{[\begin{matrix} ω_{2} & b_{2} \end{matrix}]}^{T} = - {(\frac{1}{C_{2}} Z^{T} Z + H^{T} H)}^{- 1} H^{T} e_{1} .

(27)

2.5. Capped $L_{2 p}$ -norm

It is well known that the squared

L_{2}

-norm distance metric is used in most variant classifiers associated with the TSVM, but this is more sensitive to outliers. Because of the differentiable of the squared

L_{2}

-norm, the negative effect on the outliers with the square term is added; thus, decreasing the classification performance of the model. However, the

L_{2, p}

-norm inhibits the negative effect of the outliers better than the

L_{1}

-norm and squared

L_{2}

-norm, and the

p = 1

,

L_{2, p}

-norm becomes the

L_{1}

-norm. Obviously, by setting an appropriate p, based on the capped

L_{2, p}

-norm, there is more robustness than the capped

L_{2}

-norm, and these algorithms show that the capped

L_{2, p}

-norm is robust to Gaussian noise.

For any vector, the

a \in R^{n}

and

p \in [0, 2]

,

L_{2, p}

-norm and capped

L_{2, p}

-norm are defined as:

f_{1} (a) = {(\sum_{i = 1}^{n} a_{i}^{2})}^{\frac{p}{2}}, f_{2} (a) = min ({(\sum_{i = 1}^{n} a_{i}^{2})}^{\frac{p}{2}}, ε),

(28)

where

ε \geq 0

is the thresholding parameter.

From Figure 2 we can find that the capped

L_{2, p}

-norm is more robust than the

L_{1}

-norm and

L_{2}

-norm; thus, making the classification performance of the model better.

Because the capped

L_{2, p}

-norm is relatively so well robust, we introduce the capped

L_{2, p}

-norm in the model proposed in this paper and further improve the generalization and robustness of the TBSVM.

2.6. Welsch Regularization

In this paper, we focus on the Welsch loss function. It is a bounded, smooth, and non-convex loss, which is very robust to the noises. It is defined as:

\begin{matrix} V (ξ) = \frac{σ^{2}}{2} [1 - \exp (- \frac{ξ^{2}}{2 σ^{2}})] \end{matrix}

(29)

where,

σ

is a penalty parameter. Figure 3 shows the Welsch loss function,

V (ξ)

under different values of

σ

, which changes from 1 to 3 [23,26,27,28,29].

Through Figure 3, we found that the upper bound of the Welsch loss function increases and the convergence speed slows down as

σ

gradually increases. Thus, the impact of noise on the model during the training process is limited. Consequently, the Welsch loss can further enhance the robustness of the model.

3. Main Contributions

To improve the generalization performance of the TBSVM, this paper proposes the WCTBSVM and FWCTBSVM for classification problems.

3.1. WCTBSVM

To suppress the adverse impact caused by the outliers, we propose to incorporate the Welsch loss and capped

L_{2, p}

-norm distance metric twin bounded support vector machines to the framework of the TBSVM, which is represented as:

\begin{matrix} \begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + C_{1} \sum_{i = 1}^{m_{2}} [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})] + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2} \end{matrix} \end{matrix}

(30)

and

\begin{matrix} \begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{3}) + C_{2} \sum_{i = 1}^{m_{1}} [1 - \exp (- \frac{ξ_{2, i}^{2}}{2 σ^{2}})] + \frac{C_{4}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1} \end{matrix} \end{matrix}

(31)

where,

C_{1}

,

C_{2}

,

C_{3}

,

C_{4}

> 0 are regularization parameters,

e_{1} \in R^{m_{1}}

and

e_{2} \in R^{m_{2}}

are the unit vectors. The model is modified on the basis of the TBSVM, where the first term introduces a capped

L_{2, p}

-norm that satisfies the principle of structural risk minimization and improves the generalization performance of the model. The second term introduces the Welsch loss function to reduce the influence of noise. The last term introduces the

L_{2}

-norm to prevent over-fitting of the model.

Let

\{\begin{matrix} R (ω_{1}, b_{1}) = \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ Q (ω_{1}, b_{1}) = C_{1} \sum_{i = 1}^{m_{2}} [1 - exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})] . \end{matrix}

(32)

Further, we can get

\bar{Q} (ω_{1}, b_{1}) = max_{ω_{1}, b_{1}} Q (ω_{1}, b_{1}),

(33)

where

\bar{Q} (ω_{1}, b_{1}) = C_{1} \sum_{i = 1}^{m_{2}} [exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})]

. To facilitate the following derivations, we then define a convex function

g (v) = - v log (- v) + v

, where

v < 0

. Based on the conjugate function theory, we have

exp (- \frac{ξ_{1}^{2}}{2 σ^{2}}) = sup_{v < 0} \{v \frac{ξ_{1}^{2}}{2 σ^{2}} - g (v)\},

(34)

where

v = - exp (- \frac{ξ_{1}^{2}}{2 σ^{2}}) .

(35)

Thus, we can get

max_{ω_{1}, b_{1}, v} M (ω_{1}, b_{1}, v) = \sum_{i = 1}^{m_{2}} (v_{i} \frac{ξ_{1, i}^{2}}{2 σ^{2}} - g (v_{i})) - R (ω_{1}, b_{1}) .

(36)

Use the half-quadratic optimization of (34), suppose that we have

v^{s}

, where the superscript s denotes the something iteration, so that the v can be written:

max_{v_{i}^{s} < 0} \sum_{i = 1}^{m_{2}} \{v_{i}^{s} \frac{{(ξ_{1}^{s})}^{2}}{2 σ^{2}} - g (v_{i}^{s})\},

(37)

v_{i}^{s} = - exp (- \frac{{(ξ_{1}^{s})}^{2}}{2 σ^{2}}) .

(38)

Further, the optimization problem (30) can be rewritten as:

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{C_{1}}{2 σ^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2} . \end{matrix}

(39)

In a similar way, the optimization problem (31) can be rewritten as:

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{1}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{1}) + \frac{C_{2}}{2 σ^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{C_{4}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} \geq e_{1} \end{matrix}

(40)

where the

Ω_{j} = d i a g (- v_{j, i}^{s}, 0)

,

j = 1, 2

.

Theorem 1.

Let

g (θ) : R^{n} \to R

be a continuous non-convex function and suppose

h (θ) : R^{n} \to Ξ

is a map with a range of Ξ. We assume that there exists a concave function

\bar{g} (u)

defined on Ξ, such that

g (θ) = g (h (θ))

holds. Under the above assumption, the non-convex function

g (θ)

is expressed as:

g (θ) = inf_{v \in R^{n}} [v^{T} h (θ) - g^{* (v)}] .

(41)

According to concave duality,

g^{*} (v)

is the concave dual of

\bar{g} (u)

given as

g^{*} (v) = inf_{u \in} [v^{T} h (θ) - g^{* (v)}] .

(42)

In addition, the minimum value to the right of (41) is as follows:

v^{*} = \frac{\partial \bar{g} (θ)}{\partial θ} |_{u = h (θ)} .

(43)

Based on Theorem 1, we give a concave function

\bar{g} (θ) : R \to R

such that arbitrary

θ > 0

,

\begin{matrix} \bar{g} (θ) = min (θ^{\frac{p}{2}}, ε) . \end{matrix}

(44)

Assuming that

h (μ) = μ^{2}

, we can get

\begin{matrix} min (∥ ω x_{i} + b ∥_{2}^{p}, ε) = \bar{g} (h (μ)), \end{matrix}

(45)

where

μ = ∥ ω x_{i} {+ b ∥}_{2}

. Based on (30), (45) can be rewritten as:

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} \bar{g} (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{2}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2} . \end{matrix}

(46)

Similarly, based on (31) and (45), we can get

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} \bar{g} (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{2}) + \frac{1}{2 σ^{2}} C_{2} ξ_{1}^{T} Ω_{2} ξ_{1} + \frac{C_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}), \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1} . \end{matrix}

(47)

Let

θ_{1} = h (μ_{1}) = {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{2}

, via Theorem 1, the first term of (30) can be expressed as:

min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) = \bar{g} (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{2}) = inf_{f_{i i} \geq 0} (f_{i i} h (μ_{1}) - g^{*} (f_{i i})) = inf_{f_{i i} \geq 0} (f_{i i} θ_{1} - g^{*} (f_{i i})) .

(48)

Therefore, the concave dual function of

\bar{g} (θ_{1})

is

g^{*} (f_{i i}) = inf_{θ_{1}} [f_{i i} θ_{1} - \bar{g} (θ_{1}] = inf_{θ_{1}} \{\begin{matrix} f_{i i} θ_{1} - θ_{1}^{\frac{p}{2}}, θ_{1}^{\frac{p}{2}} < ε_{1} . \\ f_{i i} θ_{1} - ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} . \end{matrix}

(49)

By optimizing

θ_{1}

for (49), we can get:

g^{*} (f_{i i}) = \{\begin{matrix} f_{i i} {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}} - {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}}, θ_{1}^{\frac{p}{2}} < ε_{1} . \\ f_{i i} ε_{1}^{\frac{2}{p}} - ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} . \end{matrix}

(50)

Therefore, the objective function (30) can be further written as:

min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}),

\Leftrightarrow min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} inf_{f_{i i} \geq 0} L_{i} (ω_{1}, b_{1}, f_{i i}, ε_{1}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}),

(51)

\Leftrightarrow min_{ω_{1}, b_{1}, f_{i i} \geq 0} \sum_{i = 1}^{m_{1}} L_{i} (ω_{1}, b_{1}, f_{i i}, ε_{1}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) .

Finally, the first term in the objective (30) can be rewritten as:

L_{i} (ω_{1}, b_{1}, f_{i i}, ε_{1}) = \{\begin{matrix} f_{i i} θ_{1} - f_{i i} {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}} + {(\frac{2}{p} f_{i i})}^{\frac{2}{p - 2}}, θ_{1}^{\frac{p}{2}} < ε_{1} . \\ f_{i i} θ_{1} - f_{i i} ε_{1}^{\frac{2}{p}} + ε_{1}, θ_{1}^{\frac{p}{2}} \geq ε_{1} . \end{matrix}

(52)

The objective function (52) is solved by learning the optimal classifier via the alternate optimization algorithm. We calculate the gradient of the function

g (θ)

with respect to

θ

as follows:

\frac{\partial \bar{g} (θ)}{\partial θ} = \{\begin{matrix} \frac{p}{2} θ^{\frac{p}{2} - 1}, 0 < θ < ε^{\frac{2}{p}}, \\ 0, θ > ε^{\frac{2}{p}} . \end{matrix}

(53)

If

θ_{1} = h (μ_{1}) = {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{2}

, we fix

ω_{1}

and

b_{1}

, can get:

f_{i i} = \frac{\partial \bar{g} (θ_{1})}{\partial θ_{1}} |_{θ_{1}} = {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{2} = \{\begin{matrix} \frac{p}{2} ∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p - 2}, 0 < {∥ ω_{1} x_{i} + b_{1} ∥}_{2}^{p} < ε_{1} . \\ 0, e l s e . \end{matrix}

(54)

The same as (54), we can get

k_{i i} = \frac{\partial \bar{g} (θ_{2})}{\partial θ_{2}} |_{θ_{2}} = {∥ ω_{2} x_{i} + b_{2} ∥}_{2}^{2} = \{\begin{matrix} \frac{p}{2} ∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p - 2}, 0 < {∥ ω_{2} x_{i} + b_{2} ∥}_{2}^{p} < ε_{3} . \\ 0, e l s e . \end{matrix}

(55)

When variables

f_{i}

and

k_{i}

are fixed, in order to solve the classification model of an unknown quantity the

ω_{1}

,

ω_{2}

and

b_{1}

,

b_{2}

, optimization problem formula (30) can be written as:

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} f_{i i} ∥ ω_{1} x_{i} + b_{1} ∥_{2}^{2} + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2} . \end{matrix}

(56)

The same as (56), the optimization problem formula (31) can be written as:

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} k_{i i} ∥ ω_{2} x_{i} + b_{2} ∥_{2}^{2} + \frac{1}{2 σ^{2}} C_{2} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{C_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}), \\ s . t . (A ω_{2} + e_{1} b_{2}) + ξ_{2} \geq e_{1} . \end{matrix}

(57)

Let

F = d i a g (f_{11}, f_{22}, f_{33}, \dots, f_{m_{1}, m_{1}})

is a

m_{1} \times m_{1}

diagonal matrix,

K = d i a g (k_{11}, k_{22}, k_{33}, \dots, k_{m_{2}, m_{2}})

is a

m_{2} \times m_{2}

diagonal matrix (30) written as:

\begin{matrix} min_{ω_{1}, b_{1}} \frac{1}{2} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} \geq e_{2} . \end{matrix}

(58)

Similarly, (31) can be written as:

\begin{matrix} min_{ω_{2}, b_{2}} \frac{1}{2} {(B ω_{2} + e_{2} b_{2})}^{T} K (B ω_{2} + e_{2} b_{2}) + \frac{1}{2 σ^{2}} C_{1} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{C_{3}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}), \\ s . t . (B ω_{2} + e_{2} b_{2}) + ξ_{2} \geq e_{1} . \end{matrix}

(59)

The corresponding Lagrange function of the above optimization problem (58) can be written as:

\begin{matrix} L (ω_{1}, b_{1}, ξ_{1}, α) & = & \frac{1}{2} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) \\ - & α^{T} (- (B ω_{1} + e_{2} b_{1}) + ξ_{1} - e_{2}) . \end{matrix}

(60)

where,

α

is a Lagrange multiplier, derive the Lagrange function about

ω_{1}

and

b_{1}

, and get the following Karush–Kuhn–Tucker conditions

\{\begin{matrix} \frac{\partial L}{\partial ω_{1}} = A^{T} F (A ω_{1} + e_{1} b_{1}) + B^{T} α + C_{3} ω_{1} = 0, & (i) \\ \frac{\partial L}{\partial b_{1}} = e_{1}^{T} F (A ω_{1} + e_{1} b_{1}) + e_{2}^{T} α + C_{3} b_{1} = 0, & (i i) \\ \frac{\partial L}{\partial ξ_{1}} = \frac{1}{σ^{2}} C_{1} Ω_{1} ξ_{1} - α = 0, & (i i i) \\ α^{T} (B ω_{1} + e_{2} b_{1} - ξ_{1} + e_{2}) = 0, & (i v) \\ α \geq 0 . & (v) \end{matrix}

(61)

Combine

(i)

and

(i i)

to get

[\begin{matrix} A^{T} \\ e_{1}^{T} \end{matrix}] F [\begin{matrix} A & e_{1} \end{matrix}] [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] + [\begin{matrix} B^{T} \\ e_{2}^{T} \end{matrix}] α + C_{3} [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] = 0 .

(62)

Define

Z_{1} = [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}], H = [\begin{matrix} A & e_{1} \end{matrix}], E = [\begin{matrix} B & e_{2} \end{matrix}] .

(63)

Thus, we can get

H^{T} F H Z_{1} + E^{T} α + C_{3} Z_{1} = 0, Z_{1} = {[ω_{1}, b_{1}]}^{T} = - {(H^{T} F H + C_{3} I)}^{- 1} E^{T} α .

(64)

At the same time, we can get

ξ_{1} = σ^{2} {(C_{1} Ω_{1})}^{- 1} α

. Therefore, the Lagrange function can be rewritten as

\begin{matrix} L (ω_{1}, b_{1}, ξ_{1}, α) & = & \frac{1}{2} {(H Z_{1})}^{T} F (H Z_{1}) + \frac{C_{1}}{2 σ^{2}} {(\frac{σ^{2}}{C_{1}} Ω_{1}^{- 1} α)}^{T} Ω_{1} (\frac{σ^{2}}{C_{1}} Ω_{1}^{- 1} α) \\ - & α^{T} (- E Z_{1} + \frac{σ^{2}}{C_{1}} Ω_{1}^{- 1} α - e_{2}) + \frac{C_{3}}{2} Z_{1}^{T} Z_{1} \\ = & e_{2}^{T} α - \frac{1}{2} α^{T} (E {(H^{T} F H + C_{3} I)}^{- 1} E^{T} + \frac{σ^{2}}{C_{1}} Ω_{1}^{- 1}) α . \end{matrix}

Therefore, the dual problem of (58) is as follows:

\begin{matrix} min_{α} \frac{1}{2} α^{T} (E {(H^{T} F H + C_{3} I)}^{- 1} E^{T} + \frac{σ^{2}}{C_{1}} Ω_{1}^{- 1}) α - e_{2}^{T} α, \\ s . t . 0 \leq α \leq C_{1} e_{2} . \end{matrix}

(65)

Similarly, the dual problem of (59) is as follows:

\begin{matrix} min_{β} \frac{1}{2} β^{T} (H {(E^{T} K E + C_{4} I)}^{- 1} H^{T} + \frac{σ^{2}}{C_{2}} Ω_{2}^{- 1}) β - e_{1}^{T} β, \\ s . t . 0 \leq β \leq C_{2} e_{1} . \end{matrix}

(66)

According to the process of the above operation, we give the pseudo-code of the process, as shown in Algorithm 1.

Algorithm 1 Training WCTBSVM.

Input: Training data

A \in R^{m_{1} \times n}

and

B \in R^{m_{2} \times n}

; Parameters

C_{i}, (i = 1, 2, 3, 4)

and

ε_{i}, (i = 1, 2, 3, 4)

.

Output:

Z_{1}^{*}

and

Z_{2}^{*}

;

Process:

1. Initialize

F \in R^{m_{1} \times m_{1}}

and

Ω_{1} \in R^{m_{2} \times m_{2}}

;

K \in R^{m_{2} \times m_{2}}

and

Ω_{2} \in R^{m_{1} \times m_{1}}

;

2. Calculate by the KKT conditions can get

α

and

β

by (65) and (66);

3. Calculate

Z_{1}

and

Z_{2}

by

Z_{1} = - {(H^{T} F H + C_{3} I)}^{- 1} E^{T} α

and

Z_{2} = {(E^{T} K E + C_{4} I)}^{- 1} H^{T} β

4. Update the matrix separately

Ω_{1}, Ω_{2}, F, K

by (40), (54) and (55).

3.2. FWCTBSVM

To reduce the computational complexity caused by the Welsch loss, we replace the inequality constraints in the WCTBSVM with equality constraints

\begin{matrix} \begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + C_{1} \sum_{i = 1}^{m_{2}} [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})] + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} = e_{2} . \end{matrix} \end{matrix}

(67)

\begin{matrix} \begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{3}) + C_{2} \sum_{i = 1}^{m_{1}} [1 - \exp (- \frac{ξ_{2, i}^{2}}{2 σ^{2}})] + \frac{C_{4}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} = e_{1} . \end{matrix} \end{matrix}

(68)

Further, (67) and (68) can be rewritten as:

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{C_{1}}{2 σ^{2}} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . - (B ω_{1} + e_{2} b_{1}) + ξ_{1} = e_{2} . \end{matrix}

(69)

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{1}) + \frac{C_{2}}{2 σ^{2}} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{C_{4}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \\ s . t . (A ω_{1} + e_{1} b_{2}) + ξ_{2} = e_{1} . \end{matrix}

(70)

Replace the equality constraint into the objective function and we have

\begin{matrix} min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + \frac{C_{1}}{2 σ^{2}} Ω_{1} ‖ e_{2} + (B ω_{1} + e_{2} b_{1}) ‖_{2}^{2} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \end{matrix}

(71)

\begin{matrix} min_{ω_{2}, b_{2}} \sum_{i = 1}^{m_{2}} min (∥ ω_{2} x_{i} + b_{2} ∥_{2}^{p}, ε_{1}) + \frac{C_{2}}{2 σ^{2}} Ω_{2} ‖ e_{1} - (A ω_{2} + e_{1} b_{2}) ‖_{2}^{2} + \frac{C_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) . \end{matrix}

(72)

Further, we can obtain:

\begin{matrix} min_{ω_{1}, b_{1}} \frac{1}{2} {(A ω_{1} + e_{1} b_{1})}^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{1}{2 σ^{2}} C_{1} ξ_{1}^{T} Ω_{1} ξ_{1} + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}), \end{matrix}

(73)

\begin{matrix} min_{ω_{2}, b_{2}} \frac{1}{2} {(A ω_{2} + e_{2} b_{2})}^{T} K (A ω_{2} + e_{2} b_{2}) + \frac{1}{2 σ^{2}} C_{2} ξ_{2}^{T} Ω_{2} ξ_{2} + \frac{C_{4}}{2} (∥ ω_{2} ∥_{2}^{2} + b_{2}^{2}) . \end{matrix}

(74)

The derivative of

ω_{1}

and

b_{1}

in Equation (73) is zero and gives:

\begin{matrix} A^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{C_{1}}{2 σ^{2}} B^{T} Ω_{1} (B ω_{1} + e_{2} b_{1}) + \frac{C_{3}}{2 σ^{2}} ω_{1} = 0, \end{matrix}

(75)

\begin{matrix} e^{T} F (A ω_{1} + e_{1} b_{1}) + \frac{C_{1}}{2 e^{2}} B^{T} Ω_{1} (B ω_{1} + e_{2} b_{1}) + \frac{C_{3}}{2 σ^{2}} b_{1} = 0 . \end{matrix}

(76)

Via (75) and (76), we have

\frac{2 σ^{2}}{C_{1}} [\begin{matrix} A^{T} F A & A^{T} F e_{1} \\ e_{1}^{T} F A & e_{1}^{T} F e_{1} \end{matrix}] [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] + [\begin{matrix} B^{T} Ω_{1} B & B^{T} Ω_{1} e_{2} \\ e_{2}^{T} Ω_{1} B & e_{2}^{T} Ω_{1} e_{2} \end{matrix}] [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] + [\begin{matrix} B^{T} Ω_{1} e_{2} \\ e_{2}^{T} Ω_{1} e_{2} \end{matrix}] + \frac{C_{3}}{C_{1}} I [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] = 0 .

(77)

Furthermore, via (77) we can get

[\begin{matrix} ω_{1} \\ b_{1} \end{matrix}] = - {([\begin{matrix} \frac{2 σ^{2}}{C_{1}} A^{T} F A + B^{T} Ω_{1} B & \frac{2 σ^{2}}{C_{1}} A^{T} F e_{1} + B^{T} Ω_{1} e_{2} \\ \frac{2 σ^{2}}{C_{1}} e_{1}^{T} F A + e_{2}^{T} Ω_{1} B & \frac{2 σ^{2}}{C_{1}} e_{1}^{T} F e_{1} + e_{2}^{T} Ω_{1} e_{2} \end{matrix}] + \frac{C_{3}}{C_{1}} I)}^{- 1} [\begin{matrix} B^{T} \\ e_{2}^{T} \end{matrix}] Ω_{1} e_{2} .

(78)

Let

Z_{1} = [\begin{matrix} ω_{1} \\ b_{1} \end{matrix}], H = [\begin{matrix} A & e_{1} \end{matrix}], E = [\begin{matrix} B & e_{2} \end{matrix}] .

(79)

Thus, we can get

Z_{1} = - {(\frac{2 σ^{2}}{C_{1}} H^{T} F H + E^{T} Ω_{1} E + \frac{C_{3}}{C_{1}} I)}^{- 1} E^{T} Ω_{1} e_{2} .

(80)

Similarly, the formula (74) for

ω_{2}

and

b_{2}

is also convex. We calculate the partial derivative order for them, respectively, and the result is 0 . Thus, we have

B^{T} K (B ω_{2} + e_{2} b_{2}) + C_{2} A^{T} Ω_{2} (e_{1} - A ω_{2} + e_{1} b_{2}) + C_{4} ω_{2} = 0,

(81)

e_{2}^{T} K (B ω_{2} + e_{2} b_{2}) + C_{2} e_{1}^{T} Ω_{2} (e_{1} - A ω_{2} - e_{1} b_{2}) + C_{4} b_{2} = 0 .

(82)

Let

Z_{2} = [\begin{matrix} ω_{2} \\ b_{2} \end{matrix}], E = [\begin{matrix} B & e_{2} \end{matrix}], H = [\begin{matrix} A & e_{1} \end{matrix}] .

(83)

Therefore, we have

Z_{2} = {(\frac{2 σ^{2}}{C_{2}} E^{T} K E + H^{T} Ω_{2} H + \frac{C_{4}}{C_{2}} I)}^{- 1} H^{T} Ω_{2} e_{1} .

(84)

According to the process of the above operation, we give the pseudo-code of the process, as shown in Algorithm 2.

Algorithm 2 Training FWCTBSVM.

Input: Training data

A \in R^{m_{1} \times n}

and

B \in R^{m_{2} \times n}

; Parameters

C_{i}, (i = 1, 2, 3, 4)

and

ε_{i}, (i = 1, 2, 3, 4)

.

Output:

Z_{1}^{*}

and

Z_{2}^{*}

;

Process:

1. Initialize

F \in R^{m_{1} \times m_{1}}

and

Ω_{1} \in R^{m_{2} \times m_{2}}

;

K \in R^{m_{2} \times m_{2}}

and

Ω_{2} \in R^{m_{1} \times m_{1}}

;

2. Calculate

α

and

β

;

3. Calculate

Z_{1}

and

Z_{2}

by

Z_{1} = - {(\frac{2 σ^{2}}{C_{1}} H^{T} F H + E^{T} Ω_{1} E + \frac{C_{3}}{C_{1}} I)}^{- 1} E^{T} Ω_{1} e_{2}

and

Z_{2} = {(\frac{2 σ^{2}}{C_{2}} E^{T} K E + H^{T} Ω_{2} H + \frac{C_{4}}{C_{2}} I)}^{- 1} H^{T} Ω_{2} e_{1}

;

4. Update the matrix separately

Ω_{1}, Ω_{2}, F, K

by (40), (54) and (55).

3.3. Convergence Analysis

Lemma 1.

For

\forall x \neq y \in R^{n}

, if

f (x) = x - \frac{x^{2}}{2 y}

, then inequality

f (x) < f (y)

is always established.

Lemma 2.

For any non-zero vectors

α, β

, when

0 < p \leq 2

, then inequality

{∥ α ∥}_{2}^{p} - \frac{p}{2} {∥ β ∥}_{2}^{p - 2} {∥ α ∥}_{2}^{2} \leq {∥ β ∥}_{2}^{p} - \frac{p}{2} {∥ β ∥}_{2}^{p - 2} {∥ β ∥}_{2}^{2}

(85)

is always established.

Lemma 3.

For any

u \in R,

(1): $h (u) \geq 0, a n d h (0) = 0,$
(2): $h (u) = h (- u),$
(3): $h^{^{'}} (u) \geq 0, a n y u \geq 0 .$ Then, there exists a convex function $Ψ (s)$ and $h (u) = {inf}_{s > 0} [\frac{1}{2} s u^{2} + Ψ (s)]$

Theorem 2.

The objective function value of (30) as

V (ω_{1}, ξ_{1})

. The sequence

{V (ω_{1}, ξ_{1}),

k = 1, 2, \dots, κ}

generated by Algorithm 1 is convergent.

Proof.

Let

\begin{matrix} \begin{matrix} V (ω_{1}, ξ_{1}) & = min_{ω_{1}, b_{1}} \sum_{i = 1}^{m_{1}} min (∥ ω_{1} x_{i} + b_{1} ∥_{2}^{p}, ε_{1}) + C_{1} \sum_{i = 1}^{m_{2}} [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})] + \frac{C_{3}}{2} (∥ ω_{1} ∥_{2}^{2} + b_{1}^{2}) \\ = min_{z, ξ_{1, i}} \sum_{i = 1}^{m_{1}} min ((∥ e_{i} z ∥_{2}^{p}, ε_{1}) + C_{1} \sum_{i = 1}^{m_{2}} [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})] + \frac{C_{3}}{2} z^{T} z \end{matrix} \end{matrix}

(86)

where the

e_{i} = (x_{i}, 1)

represents the line of E, and

z = {(ω_{1}, b_{1})}^{T}

,

∥ ω_{1}, b_{1} ∥_{2}^{p} < ε_{1}

and

C_{1} \sum_{i = 1}^{m_{2}} [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}}) < ε_{2}

. Therefore,

\begin{matrix} \begin{matrix} V (z, ξ_{1}) & = min_{z, ξ_{1, i}} \sum_{i = 1}^{m_{1}} min (∥ e_{i} z ∥) + C_{1} \sum_{i = 1}^{m_{2}} min [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})] + \frac{C_{3}}{2} z^{T} z . \end{matrix} \end{matrix}

(87)

Let

\begin{matrix} \begin{matrix} V_{1} (z) & = min_{z} \sum_{i = 1}^{m_{1}} (∥ e_{i} z ∥_{2}^{p}), \end{matrix} \end{matrix}

(88)

\begin{matrix} \begin{matrix} V_{2} (z, ξ_{1}) & = C_{1} \sum_{i = 1}^{m_{2}} [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}}))] + \frac{C_{3}}{2} z^{T} z . \end{matrix} \end{matrix}

(89)

Assuming that

z^{(k + 1)}

is the solution of the

k + 1

iteration of the Algorithm, then

\begin{matrix} \begin{matrix} z^{(k + 1)} & = min \frac{1}{2} {(E z)}^{T} F^{(k)} (E z) . \end{matrix} \end{matrix}

(90)

So, we can get

\begin{matrix} \begin{matrix} z^{(k + 1)} & = min \frac{1}{2} {(E z^{(k + 1)})}^{T} F^{(k)} (E z^{(k + 1)}) . \end{matrix} \end{matrix}

(91)

Therefore,

\begin{matrix} \begin{matrix} {(E z^{(k + 1})}^{T} F^{(k)} (E z^{(k + 1)}) \leq {(E z^{(k})}^{T} F^{(k)} (E z^{(k)}) . \end{matrix} \end{matrix}

(92)

Based on Lemma 2, we can get:

\begin{matrix} \begin{matrix} \frac{p}{2} ∥ E z^{k + 1} ∥_{2}^{2} ∥ E z^{k} ∥_{2}^{p - 2} \leq \frac{p}{2} ∥ E z^{k} ∥_{2}^{2} {∥ E z^{k} ∥}_{2}^{p - 2} \end{matrix} \end{matrix}

(93)

and

\begin{matrix} ∥ E z^{k + 1} ∥_{2}^{p} - \frac{p}{2} ∥ E z^{k} ∥_{2}^{p - 2} ∥ E z^{k + 1} ∥_{2}^{2} \leq ∥ E z^{k} ∥_{2}^{p} - \frac{p}{2} ∥ E z^{k} ∥_{2}^{p - 2} {∥ E z^{k} ∥}_{2}^{2} . \end{matrix}

(94)

Combine (93) and (94), and we have

\begin{matrix} ∥ E z^{k + 1} ∥_{2}^{p} \leq {∥ E z^{k} ∥}_{2}^{p} . \end{matrix}

(95)

Therefore,

V_{1} (z)

is the convergence.

Next, we discuss the convergence of

V_{2} (z, ξ_{1})

. Let the function

h (u) = 1 - \exp (- u^{2})

, where

u = \frac{ξ_{1}}{\sqrt{2} σ}

, and there exists a convex function

Ψ (s)

and

\begin{matrix} \begin{matrix} h (u) = & inf_{s > 0} [\frac{1}{2} s u^{2} + Ψ (s)] \end{matrix} \end{matrix}

(96)

where we have a minimum solution

\bar{s}

to make the equation hold:

\begin{matrix} \begin{matrix} inf_{s > 0} [\frac{1}{2} s u^{2} + Ψ (s)] = \frac{1}{2} \bar{s} u^{2} + Ψ (\bar{s}) \end{matrix} \end{matrix}

(97)

where

\bar{s} = 2 exp (- u^{2})

. Further, we get

\begin{matrix} \begin{matrix} [1 - \exp (- \frac{ξ_{1, i}^{2}}{2 σ^{2}})] = inf_{s > 0} \frac{s_{i} {(ξ_{1})}^{2}}{4 σ^{2}} + Ψ (s_{i}) . \end{matrix} \end{matrix}

(98)

Therefore, (88) is equivalent to

\begin{matrix} \begin{matrix} V_{2} (z, ξ_{1}, s_{i}) & = C_{1} \sum_{i = 1}^{m_{2}} {\frac{s_{i} {(ξ_{1})}^{2}}{4 σ^{2}} + Ψ (s_{i})} + \frac{C_{3}}{2} z^{T} z . \end{matrix} \end{matrix}

(99)

From (89), (99) and Lemma 3, we can get

V_{2} (z, ξ_{1}, s_{i}) \geq V_{2} (z, ξ_{1}) \geq 0

. Thus, the sequence is lower bounded. Suppose that

z^{k}

,

ξ^{k}

and

s^{k}

are obtained after k iterations,

\begin{matrix} \begin{matrix} V_{2} (z^{k}, ξ^{k}, s^{k}) \geq V_{2} (z^{k + 1}, ξ^{k}, s^{k}), \end{matrix} \end{matrix}

(100)

\begin{matrix} \begin{matrix} V_{2} (z^{k + 1}, ξ^{k}, s^{k}) \geq V_{2} (z^{k + 1}, ξ^{k + 1}, s^{k}), \end{matrix} \end{matrix}

(101)

\begin{matrix} \begin{matrix} V_{2} (z^{k + 1}, ξ^{k + 1}, s^{k}) \geq V_{2} (z^{k + 1}, ξ^{k + 1}, s^{k + 1}) . \end{matrix} \end{matrix}

(102)

Concluding (100)–(102), we have

\begin{matrix} \begin{matrix} V_{2} (z^{k}, ξ^{k}, s^{k}) \geq V_{2} (z^{k + 1}, ξ^{k + 1}, s^{k + 1}) \end{matrix} \end{matrix}

(103)

Therefore,

V_{2} (z, ξ_{1})

is the convergence. Thus, the sequence is the convergence. □

4. Numerical Experiments

In this section, we will conduct data experiments on WCTBSVM and FWCTBSVM to test the classification performance of the two models. Meanwhile, both models were also used for comparison with the TSVM, TBSVM, CTSVM, and LSTSVM, and further verify whether the classification performance of the WCTBSVM and FWCTBSVM is better. After performing the test on the datasets, we performed a statistical monitoring analysis to further study the classification performance of the proposed model [28]. All of the data are obtained from the normalized dataset. All the experiments were implemented in MATLAB R2016a, equipped with an Intel Core i7 processor (3 GHz) and 8 GB of memory on a ASUS personal computer.

4.1. Experimental Setting

How a model of the algorithm reflects its best performance depends on the choice of parameters. Therefore, we use the traditional accuracy index (

A C C

) to measure the performance of these algorithms, defined as follows:

\begin{matrix} A C C = \frac{T P + T N}{T P + F N + T N + F P}, \end{matrix}

(104)

where

T P

and

T N

denote true positives and true negatives,

F P

and

F N

denote false negatives and false positives, respectively. The higher value of the ACC is, the better the model is. For these, the TSVM parameters

ε_{i}

set

10^{- 5}

, and

c_{i}

searched from set {

10^{i} ∣ - 5, - 4, - 3, - 2, -, 1, 0, 1, 2, 3, 4, 5

}, and the kernel parameters

σ

searched from set

{10^{i} ∣ - 4, - 3, - 2, -, 1, 0, 1, 2, 3, 4}

. We performed a 10-fold cross-validation on the dataset. The model was trained and tested multiple times using different cut apart training and test datasets; thus, overcoming the randomness of the individual test results.

We classified the 100 artificial data samples, and divided the data samples into two categories, one represented by +, another represented by ∗. Because the outliers can have some impact on the performance of the classification, six outliers were introduced to compare the robustness of the TSVM, TBSVM, CTSVM, LSTSVM, WCTBSVM, and FWCTBSVM. Respectively, the four outliers were classified on average, two positive and two negative, which are shown in Figure 4.

In order to validate the classification performance of the WCTBSVM and FWCTBSVM with the performance of the correlated algorithms, nine sets of datasets from the UCI dataset were selected in Table 1. These nine sets of data (Balance, Australian, Cancer, CMC, German, Pima, QSAR, WDBC, and wholesale) were used to perform the data experiments. Considering that noise is one of the criteria to measure the robustness of the algorithm, we will study it at the noise points. The classification accuracy changes smoothly with increasing noise, indicating that the proposed algorithm has good noise resistance.

4.2. Experimental Results on Artificial Dataset with Gaussian Noise

Through experiments on the nine sets of UCI datasets, we obtain an even better classification performance and robustness of the WCTBSVM and FWCTBSVM. Therefore, in order to further explore the advantages of the WCTBSVM and FWCTBSVM, we will conduct experiments on artificial datasets. All experimental results are presented in Figure 5.

From Figure 5, we can see the accuracy of the six models classified on an artificial dataset that contained Gaussian noise varies from low to high order and performs roughly equally on the UCI dataset. The precision of the six models are, respectively, TSVM, 58.8%; TBSVM, 63.0%; LSTSVM, 61.6%; CTSVM, 77%; WCTBSVM, 81.3%,; FWCTBSVM, 83.6%. From the experimental data, the WCTBSVM and FWCTBSVM still have high accuracy in the data with Gaussian noise. It also further determines the robustness of the negative effect of the

L_{2, p}

-norm distance and the Welsch loss on the Gaussian noise.

4.3. Experimental Results on the Employed Datasets without Gaussian Noise

To further test the classification performance of the model, we will test it experimentally on the UCI dataset. We perform a 10-fold cross validation on the UCI dataset. The dataset is randomly divided into ten subsets, nine of which are used as training sets and the remaining one is reserved as a test set. This process is repeated ten times. We use the average of the ten test results as the performance measure. All experimental results presented in Table 2 are based on optimal parameters and the average classification accuracy is denoted by ‘ACC’.

Based on the performance of the nine UCI data in Table 2 in the six models, in the absence of Gaussian noise their accuracy order from high to low was WCTBSVM, FWCTBSVM, CTSVM, LSTSVM, TBSVM, and TSVM. The accuracy of FWCTBSVM and WCTBSVM is not much different, but FWCTBSVM run even faster. Therefore, on the basis of adding

L_{2, p}

-norm and Welsch loss functions, the classification mechanism of the model will be even better.

4.4. Experimental Results on the UCI Datasets with Gaussian Noise

In the UCI datasets experiment without adding noise, we obtained that the WCTBSVM and FWCTBSVM classify even better compared to the other four models. To further investigate the robustness of the WCTBSVM and FWCTBSVM, the six models will next be tested in the UCI datasets incorporating noise. The noise added is 10% and 30% Gaussian noise.

According to the experimental results in Table 3 and Table 4, the classification performance of WCTBSVM and FWCTBSVM is still higher than TSVM, TBSVM, LSTSVM, and CTSVM. It also found that the classification performance of the model is reduced with the continuous noise improvement. In addition, the experimental results of Table 2, Table 3 and Table 4 can find that WCTBSVM and FWCTBSVM are more robust after adding the

L_{2, p}

-norm and Welsch loss term.

For a more intuitive observation of the accuracy of the WCTBSVM and FWCTBSVM under different noises, we draw visual line plots according to the data of Table 2, Table 3 and Table 4, Cancer, Wholesale, German, spectf, Hepat and Pima to further show that the WCTBSVM and FWCTBSVM are more stable than several other models. It also shows that

L_{2, p}

-norm and Welsch loss term make the model more stable. In this section, we randomly choose samples and contaminate the features by introducing Gaussian noise, which obeys a normal distribution

N (0, τ)

. More specifically, for the training dataset X, we replace X with

X + \hat{X}

, where

\hat{X}

is the noise matrix, which obeys a normal distribution with zero mean and variance of

τ

. We apply three types of experiments on different datasets: noises factors

τ = 0

,

τ = 1

,

τ = 2

, and

τ = 3

. The experimental results are shown in Figure 6.

4.5. Statistical Analysis

After the six models were tested in the UCI datasets and artificial datasets, both the WCTBSVM and FWCTBSVM showed their accuracy. In order to directly see the classification of the six models on the datasets, we used the Friedman test [30,31] to conduct the statistical monitoring analysis. This test method is a convenient and robust test. Therefore, we first calculated the average ranking and accuracy of the six algorithms and the nine datasets, and the results are shown in Table 5.

With the data in Table 5, we can calculate the Friedman statistical variables of the dataset containing 0, 10%, and 30% Gaussian kernels. The calculation formula is as follows:

\begin{matrix} χ_{F}^{2} = \frac{12 N}{k (k + 1)} [\sum_{i} R_{i}^{2} - \frac{k {(k + 1)}^{2}}{4}] \end{matrix}

(105)

In this formula, k expresses the number of algorithms, N represents the number of UCI datasets, and

R_{i}

shows the average of the dataset in the ith algorithm. In this paper,

k = 6

,

N = 9

. According to (105), it is obtained that the Friedmann statistical variables for Gaussian kernel datasets at different proportions are, respectively, 36.54, 32.53, and 36.46. Based on the

χ_{F}^{2}

-distribution with

(k - 1)

degrees of freedom, we have

\begin{matrix} F_{F} = \frac{(N - 1) χ_{F}^{2}}{N (k - 1) - χ_{F}^{2}} \end{matrix}

(106)

Thus, according to (106), we can obtain the

χ_{F}^{2}

-distribution of Gaussian kernels at the three proportions, respectively 34.55, 20.87, 34.15. Where

F_{F} ((k - 1), (k - 1) (N - 1))

follows the F-distribution, the number of independent variables is

(k - 1)

, and the degree of freedom is

(k - 1) (N - 1)

. Where

α = 0.1

, by viewing the F test threshold table we can get

F_{α} = 1.997

. Obviously,

F_{F} > F_{α}

, so we reject the original hypothesis. From Table 5, we can see that the average ranking of the accuracy of WRTBSVM and FWRTBSVM is low compared to the other four models, explaining that the classification performance of these two models is more outstanding.

To further compare the classification performance of the six models, we used the Nemenyi test to test it. When the mean difference between the accuracy of the WCTBSVM and FWCTBSVM is larger, the two models differ significantly. The difference was not obvious otherwise. By checking the post-hoc test table, we can get the

q_{α = 0.1} = 2.326

. Then, the Critical Difference (CD) is calculated by the following formula:

\begin{matrix} C D = q_{α = 0.1} \sqrt{\frac{k (k + 1)}{6 N}} = 2.3260 \times \sqrt{\frac{6 \times 7}{6 \times 9}} = 2.0513 \end{matrix}

Based on the CD values, we visualized the data for post-hoc tests, as shown in Figure 7.

From Figure 7, we can find that the WCTBSVM and FWCTBSVM classification performance is indeed better than the other four models. It is also found that the difference between the two models is less than the CD value, so it can be determined that the performance of the two models is not very different. Thus, the WCTBSVM and FWCTBSVM have better classification performance and robustness.

5. Conclusions and Future Directions

Based on the binary classification problem, a new TBSVM model is proposed in this paper as the WCTBSVM, which introduces bounded, smooth, non-convex Welsch loss terms in the TBSVM model and iterative optimization of the relevant model variables using the HQ optimization algorithm to handle the Welsch loss function term. Meanwhile, the capped

L_{2, p}

-norm distance based on the TBSVM and results in the WCTBSVM model are introduced, which make it more generalized and robust than the TBSVM; thus, improving the classification performance of the model. In order to reduce the time complexity and space complexity of the WCTBSVM, we obtained the FWCTBSVM using least squares, which also speeds up the operation efficiency of the model while maintaining the performance advantages of the WCTBSVM.

According to the theoretical basis, we conducted accuracy testing experiments on a UCI dataset and manual dataset and found that the classification performance of the WCTBSVM model is indeed better than the TSVM, TBSVM, LSTSVM, and CTSVM. To further determine the reliability of the WCTBSVM, we also performed statistical test analysis, of which the results still show that the classification performance of the WCTBSVM is more outstanding. Therefore, we can still apply the model to semi-supervised learning in other classification experiments to further study the performance of the model. In future work, how to extend our method to multi-view learning and multi-instance learning are worthy of further study. Certainly, how to develop fast algorithms for our method is also worth studying.

Author Contributions

Writing the first draft, H.W.; software, H.W.; running data analysis, H.W.; writing and editing, H.W. and J.M.; supervision, G.Y.; validation, G.Y. and J.M.; project management, G.Y.; writing—review, G.Y.; conceptualization, G.Y. and J.M.; methodology, J.M.; project administration, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of Ningxia Provincial (No. 2022AAC03260, No. 2022AAC03235, No. 2021AAC03183), in part by the Key Research and Development Program of Ningxia (Introduction of Talents Project) (No. 2022BSB03046), in part by the Fundamental Research Funds for the Central Universities (No. 2021KYQD23, No. 2022XYZSX03, No. 2020KYQD41), in part by the National Natural Science Foundation of China (No. 11861002, No. 61907012), and in part by the Key Scientific Research Projects of North Minzu University (No. 2021JCYJ107).

Data Availability Statement

All of the benchmark datasets used in our numerical experiments are from the UCI Machine Learning Repository, and are available at http://archive.ics.uci.edu/ml/ (accessed on 5 May 2022).

Conflicts of Interest

The authors declare no conflict of interest. Informed consent was obtained from all individual participants included in the study. This paper does not contain any studies with human participants or animals performed by any of the authors.

References

Liu, Y. A nonfunctional data transformation approach via kurtosis adjustment and its application to SVM classification. J. Phys. Conf. Ser. 2022, 2294, 012024. [Google Scholar] [CrossRef]
Baesens, B.; Viaene, S.; Gestel, T.V.; Suykens, J.A.; Dedene, G.; Moor, B.D.; Vanthienen, J. Least squares support vector machine classifiers: An empirical evaluation. DTEW Res. Rep. 2000, 3, 1–16. [Google Scholar]
Peng, X.; Xu, D. Twin support vector hypersphere (TSVH) classifier for pattern recognition. Neural Comput. Appl. 2014, 24, 1207–1220. [Google Scholar] [CrossRef]
Rahulamathavan, Y.; Phan, R.C.; Veluru, S.; Cumanan, K.; Rajarajan, M. Privacy-Preserving Multi-Class Support Vector Machine for Outsourcing the Data Classification in Cloud. IEEE Trans. Dependable Secur. Comput. 2014, 11, 467–479. [Google Scholar] [CrossRef]
Zitha, P.; Thango, B.A. On the study of induction motor fault identification using support vector machine algorithms. In Proceedings of the 2023 31st Southern African Universities Power Engineering Conference (SAUPEC), Johannesburg, South Africa, 24–26 January 2023; pp. 1–5. [Google Scholar]
Zhu, G.; Liu, Y.; Mao, K.; Zhang, J.; Hua, B.; Li, S. An Improved SVM-Based Air-to-Ground Communication Scenario Identification Method Using Channel Characteristics. Symmetry 2022, 14, 1038. [Google Scholar] [CrossRef]
Kumar, M.A.; Gopal, M. Least squares twin support vector machines for pattern classification. Expert Syst. Appl. 2009, 36, 7535–7543. [Google Scholar] [CrossRef]
Shao, Y.; Zhang, C.; Wang, X.; Deng, N. Improvements on Twin Support Vector Machines. IEEE Trans. Neural Netw. 2011, 22, 962–968. [Google Scholar] [CrossRef]
Ma, J.; Yang, L.; Sun, Q. Capped L1-norm distance metric-based fast robust twin bounded support vector machine. Neurocomputing 2020, 412, 295–311. [Google Scholar] [CrossRef]
Ma, J. Capped L 1-norm distance metric-based fast robust twin extreme learning machine. Appl. Intell. 2020, 50, 3775–3787. [Google Scholar] [CrossRef]
Ma, J.; Yang, L.; Sun, Q. Adaptive robust learning framework for twin support vector machine classification. Knowl. Based Syst. 2021, 211, 106536. [Google Scholar] [CrossRef]
Yu, G.; Ma, J.; Xie, C. Hessian scatter regularized twin support vector machine for semi-supervised classification. Eng. Appl. Artif. Intell. 2023, 119, 105751. [Google Scholar] [CrossRef]
Kumar, D.; Thakur, M. Weighted multicategory nonparallel planes SVM classifiers. Neurocomputing 2016, 211, 106–116. [Google Scholar] [CrossRef]
Ke, T.; Zhang, L.; Ge, X.; Lv, H.; Li, M. Construct a robust least squares support vector machine based on Lp-norm and L∞-norm. Eng. Appl. Artif. Intell. 2021, 99, 104134. [Google Scholar] [CrossRef]
Xie, X.; Sun, F.; Qian, J.; Guo, L.; Zhang, R.; Ye, X.; Wang, Z. Laplacian Lp norm least squares twin support vector machine. Pattern Recognit. 2023, 136, 109192. [Google Scholar] [CrossRef]
Ye, Q.; Zhao, H.; Li, Z.; Yang, X.; Gao, S.; Yin, T.; Ye, N. L1-Norm Distance Minimization-Based Fast Robust Twin Support Vector k-Plane Clustering. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4494–4503. [Google Scholar] [CrossRef]
Moosaei, H.; Ganaie, M.A.; Hladík, M.; Tanveer, M. Inverse free reduced universum twin support vector machine for imbalanced data classification. Neural Netw. Off. J. Int. Neural Netw. Soc. 2022, 157, 125–135. [Google Scholar] [CrossRef]
Wang, C.; Ye, Q.; Luo, P.; Ye, N.; Fu, L. Robust capped L1-norm twin support vector machine. Neural Netw. Off. J. Int. Neural Netw. Soc. 2019, 114, 47–59. [Google Scholar] [CrossRef]
Zheng, X.; Zhang, L.; Yan, L. CTSVM: A robust twin support vector machine with correntropy-induced loss function for binary classification problems. Inf. Sci. 2021, 559, 22–45. [Google Scholar] [CrossRef]
Ma, X.; Ye, Q.; Yan, H. L2P-Norm Distance Twin Support Vector Machine. IEEE Access 2017, 5, 23473–23483. [Google Scholar] [CrossRef]
Ma, X.; Liu, Y.; Ye, Q. P-Order L2-Norm Distance Twin Support Vector Machine. In Proceedings of the 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, China, 26–29 November 2017; pp. 617–622. [Google Scholar]
Yan, H.; Ye, Q.; Zhang, T.; Yu, D.; Yuan, X.; Xu, Y.; Fu, L. Least squares twin bounded support vector machines based on L1-norm distance metric for classification. Pattern Recognit. 2018, 74, 434–447. [Google Scholar] [CrossRef]
Ke, J.; Gong, C.; Liu, T.; Zhao, L.; Yang, J.; Tao, D. Laplacian Welsch Regularization for Robust Semisupervised Learning. IEEE Trans. Cybern. 2020, 52, 164–177. [Google Scholar] [CrossRef] [PubMed]
Tokgoz, E.; Trafalis, T.B. Mixed convexity & optimization of the SVM QP problem for nonlinear polynomial kernel maps. In Proceedings of the 5th WSEAS international conference on Computers 2011, Corfu Island, Greece, 15–17 July 2011. [Google Scholar]
Xu, Z.; Lai, J.; Zhou, J.; Chen, H.; Huang, H.; Li, Z. Image Deblurring Using a Robust Loss Function. Circuits Syst. Signal Process. 2021, 41, 1704–1734. [Google Scholar] [CrossRef]
Wang, Y.; Yang, L.; Ren, Q. A robust classification framework with mixture correntropy. Inf. Sci. 2019, 491, 306–318. [Google Scholar] [CrossRef]
Yang, L.; Ding, G.; Yuan, C.; Zhang, M. Robust regression framework with asymmetrically analogous to correntropy-induced loss. Knowl. Based Syst. 2020, 191, 105211. [Google Scholar] [CrossRef]
Song, C.; Liu, W.; Wang, Y. Facial expression recognition based on Hessian regularized support vector machine. In Proceedings of the ICIMCS ’13: Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service, Huangshan, China, 17–19 August 2013. [Google Scholar]
Ren, Z.; Yang, L. Correntropy-based robust extreme learning machine for classification. Neurocomputing 2018, 313, 74–84. [Google Scholar] [CrossRef]
Demar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Ma, J.; Yu, G. Lagrangian Regularized Twin Extreme Learning Machine for Supervised and Semi-Supervised Classification. Symmetry 2022, 14, 1186. [Google Scholar] [CrossRef]

Figure 1. Capped

L_{1}

-norm loss and

L_{1}

-norm loss.

Figure 1. Capped

L_{1}

-norm loss and

L_{1}

-norm loss.

Figure 2. Capped

L_{2, p}

-norm loss and

L_{1}

and

L_{2}

-norm loss.

Figure 2. Capped

L_{2, p}

-norm loss and

L_{1}

and

L_{2}

-norm loss.

Figure 3. Welsch loss with different

σ

.

Figure 3. Welsch loss with different

σ

.

Figure 4. Distribution of artificial datasets with Gaussian noise.

Figure 5. The classification results on the artificial datasets.

Figure 6. Accuracy of six algorithms via different noises factors.

Figure 7. Visualization of post-hoc tests.

Table 1. Characteristics of UCI Datasets.

Datasets	Samples	Attributes	Datasets	Samples	Attributes
Vote	432	16	Balance	267	4
Cancer	699	9	German	1000	24
WDBC	569	30	hepat	155	19
spectf	267	44	Pima	768	8
Wholesale	440	7

Table 2. Experimental results on UCI datasets without Gaussian noise.

	TSVM	TBSVM	LSTBSVM	CTSVM	WCTBSVM	FWCTBSVM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Vote	94.53	94.76	94.56	95.00	95.48	95.13
	1.085	1.625	0.125	6.204	3.368	0.062
Balance	92.57	92.57	92.96	92.87	93.21	93.08
	0.9816	0.924	8.020	12.29	3.421	0.065
Cancer	94.21	94.94	95.48	95.75	96.43	95.94
	5.283	2.947	0.116	14.31	1.483	1.127
German	73.80	74.80	74.62	75.50	76.88	76.00
	5.709	8.893	1.018	22.68	8.021	0.159
Wholesale	82.79	83.72	85.35	86.51	88.47	90.93
	1.244	2.421	0.065	5.656	3.076	0.058
WDBC	93.07	93.71	94.68	95.25	95.96	95.07
	0.170	1.831	0.069	7.632	4.880	0.086
hepat	77.33	78.00	80.67	82.00	85.65	84.65
	0.789	1.261	0.063	2.399	1.561	1.012
Pima	75.32	75.71	75.97	75.92	76.38	77.56
	1.854	3.308	0.105	16.17	1.561	0.093
spectf	80.38	80.77	80.81	81.15	81.54	81.79
	0.446	0.573	0.747	2.906	3.632	0.068

Table 3. Experimental results of UCI datasets with 10% Gaussian noise.

	TSVM	TBSVM	LSTBSVM	CTSVM	WCTBSVM	FWCTBSVM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Vote	93.33	94.60	94.02	94.62	95.00	94.48
	1.454	2.301	0.255	5.501	5.259	0.067
Balance	90.68	90.99	91.26	91.54	92.51	92.86
	1.086	0.824	7.925	14.32	5.247	0.080
Cancer	93.06	93.71	93.48	94.65	95.43	95.04
	4.356	3.568	0.211	20.16	5.283	0.096
German	71.82	71.30	72.82	73.48	74.88	74.60
	5.194	8.233	2.110	23.15	8.615	0.154
Wholesale	79.77	80.26	82.33	84.94	85.46	85.93
	0.869	1.352	0.103	5.052	4.113	0.062
WDBC	92.57	92.95	93.68	93.25	94.96	94.18
	4.943	3.655	0.069	7.632	4.880	0.081
hepat	76.03	76.81	77.33	78.06	80.65	80.25
	1.241	2.365	3.063	1.617	3.561	0.612
Pima	74.66	74.71	74.85	75.06	76.38	76.06
	14.53	8.674	0.308	21.53	5.561	0.163
spectf	78.15	78.77	79.37	79.77	80.44	79.93
	2.807	1.561	0.563	7.986	3.632	0.052

Table 4. Experimental results of UCI datasets with 30% Gaussian noise.

	TSVM	TBSVM	LSTBSVM	CTSVM	WCTBSVM	FWCTBSVM
Datasets	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)	ACC (%)
	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)	Times (s)
Vote	91.86	92.05	91.36	93.03	93.35	93.28
	0.239	1.033	0.072	5.932	2.396	0.062
Balance	89.93	90.15	90.56	90.89	91.72	90.56
	0.928	1.824	7.503	15.75	3.457	0.057
Cancer	92.12	92.75	93.33	93.87	94.43	94.04
	7.041	3.448	0.103	23.17	5.732	0.082
German	69.21	70.93	70.49	71.37	72.88	72.60
	2.194	8.655	0.172	24.52	8.365	0.256
Wholesale	76.06	78.59	79.31	80.35	82.86	82.69
	5.413	1.519	0.206	11.052	5.113	0.103
WDBC	91.52	92.11	92.48	92.98	93.88	93.24
	4.568	4.320	0.056	10.52	3.624	0.081
hepat	74.16	74.81	75.33	75.86	76.23	77.03
	1.241	3.465	4.300	1.715	2.034	1.025
Pima	72.43	72.89	73.57	74.06	75.38	74.28
	9.482	7.901	0.112	23.21	4.210	0.395
spectf	77.30	77.77	78.06	78.80	79.68	78.72
	3.087	2.157	0.630	8.426	3.632	0.071

Table 5. Average accuracy and ranking of the six algorithm models on the UCI datasets with different noise proportions.

	TSVM	TBSVM	LSTBSVM	CTSVM	WCTBSVM	FWCTBSVM
Avg.ACC 0%	84.89	85.44	86.12	86.66	87.77	87.79
Avg.rank 0%	5.9	4.8	4.0	3.1	1.3	1.6
Avg.ACC 10%	83.34	83.78	84.35	85.04	86.19	85.92
Avg.rank 10%	5.8	4.6	4.1	3.1	1.3	1.8
Avg.ACC 30%	81.62	82.45	82.72	83.47	84.49	84.04
Avg.rank 30%	5.8	4.7	4.3	2.8	1.1	2.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, H.; Yu, G.; Ma, J. Capped L_2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss. Symmetry 2023, 15, 1076. https://doi.org/10.3390/sym15051076

AMA Style

Wang H, Yu G, Ma J. Capped L_2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss. Symmetry. 2023; 15(5):1076. https://doi.org/10.3390/sym15051076

Chicago/Turabian Style

Wang, Haoyu, Guolin Yu, and Jun Ma. 2023. "Capped L_2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss" Symmetry 15, no. 5: 1076. https://doi.org/10.3390/sym15051076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Capped L_2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss

Abstract

1. Introduction

2. Related Work

2.1. TSVM

2.2. TBSVM

2.3. CTSVM

2.4. LSTSVM

2.5. Capped $L_{2 p}$ -norm

2.6. Welsch Regularization

3. Main Contributions

3.1. WCTBSVM

3.2. FWCTBSVM

3.3. Convergence Analysis

4. Numerical Experiments

4.1. Experimental Setting

4.2. Experimental Results on Artificial Dataset with Gaussian Noise

4.3. Experimental Results on the Employed Datasets without Gaussian Noise

4.4. Experimental Results on the UCI Datasets with Gaussian Noise

4.5. Statistical Analysis

5. Conclusions and Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Capped L2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss

Abstract

1. Introduction

2. Related Work

2.1. TSVM

2.2. TBSVM

2.3. CTSVM

2.4. LSTSVM

2.5. Capped L 2 p -norm

2.6. Welsch Regularization

3. Main Contributions

3.1. WCTBSVM

3.2. FWCTBSVM

3.3. Convergence Analysis

4. Numerical Experiments

4.1. Experimental Setting

4.2. Experimental Results on Artificial Dataset with Gaussian Noise

4.3. Experimental Results on the Employed Datasets without Gaussian Noise

4.4. Experimental Results on the UCI Datasets with Gaussian Noise

4.5. Statistical Analysis

5. Conclusions and Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Capped L_2,p-Norm Metric Based on Robust Twin Support Vector Machine with Welsch Loss

2.5. Capped $L_{2 p}$ -norm