Kernel-Free Quadratic Surface Minimax Probability Machine for a Binary Classification Problem

Wang, Yulan; Yang, Zhixia; Yang, Xiaomei

doi:10.3390/sym13081378

Open AccessArticle

Kernel-Free Quadratic Surface Minimax Probability Machine for a Binary Classification Problem

by

Yulan Wang

,

Zhixia Yang

^*

and

Xiaomei Yang

College of Mathematics and Systems Science, Xinjiang University, Urumuqi 830046, China

^*

Author to whom correspondence should be addressed.

Symmetry 2021, 13(8), 1378; https://doi.org/10.3390/sym13081378

Submission received: 1 July 2021 / Revised: 24 July 2021 / Accepted: 26 July 2021 / Published: 28 July 2021

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose a novel binary classification method called the kernel-free quadratic surface minimax probability machine (QSMPM), that makes use of the kernel-free techniques of the quadratic surface support vector machine (QSSVM) and inherits the advantage of the minimax probability machine (MPM) without any parameters. Specifically, it attempts to find a quadratic hypersurface that separates two classes of samples with maximum probability. However, the optimization problem derived directly was too difficult to solve. Therefore, a nonlinear transformation was introduced to change the quadratic function involved into a linear function. Through such processing, our optimization problem finally became a second-order cone programming problem, which was solved efficiently by an alternate iteration method. It should be pointed out that our method is both kernel-free and parameter-free, making it easy to use. In addition, the quadratic hypersurface obtained by our method was allowed to be any general form of quadratic hypersurface. It has better interpretability than the methods with the kernel function. Finally, in order to demonstrate the geometric interpretation of our QSMPM, five artificial datasets were implemented, including showing the ability to obtain a linear separating hyperplane. Furthermore, numerical experiments on benchmark datasets confirmed that the proposed method had better accuracy and less CPU time than corresponding methods.

Keywords:

classification; quadratic surface support vector machine; kernel-free; minimax probability machine; second-order cone programming problem

1. Introduction

Machine learning is an important branch in the field of artificial intelligence, which has a wide range of applications in various fields of contemporary science [1]. With the development of machine learning, the classification problem has been widely concerned and studied in the fields of pattern recognition [2], text classification [3], image processing [4], financial time series prediction [5], skin disease [6], intrusion detection systems [7], etc. The classification problem is a vital task in supervised learning that learns a classification rule from a training set with known labels and then uses it to assign a new sample to a class.

At present, there are many famous classification methods. Among these existing methods, Lanckriet et al. [8,9] proposed an excellent classifier, called the minimax probability machine (MPM). For a given binary classification problem, the MPM not only deals with it in the linear case, but also in the nonlinear case by the kernel trick. It is worth noting that the MPM does not have any parameters, which is an important advantage. Therefore, it has been widely used in computer vision [10], engineering technology [11,12], agriculture [13], and novelty detection [14]. Moreover, many researchers have proposed a variety of improved versions of the MPM from different perspectives [14,15,16,17,18,19,20,21,22,23,24,25]. The representative works can be briefly reviewed as follows. In [15], Thomas and Gregory proposed MPM regression (MPMR), which transformed the regression problem into a classification problem, and then used the classifier MPM to obtain a regression function. To further exploit the structural information of the training set, Gu et al. [17] proposed the structural MPM (SMPM) by combining the finite mixture models with the MPM. In addition, Yoshiyama et al. [21] proposed the Laplacian MPM (Lap-MPM), which improved the performance of the MPM in semisupervised learning. However, the nonlinear MPM using kernel techniques lacks interpretability and usually depends heavily on the choice of a proper kernel function and the corresponding kernel parameters. Furthermore, choosing the appropriate kernel function and adjusting its parameters may require much computational time and effort. Therefore, it naturally occurs to us that the study of a kernel-free nonlinear MPM is of great significance.

For the first time, Dagher [26] proposed a kernel-free nonlinear classifier, namely the quadratic surface support vector machine (QSSVM), in 2008. It was based on the maximum margin idea, and the training points were separated by a quadratic hypersurface without a kernel function, avoiding the time-consuming process of selecting the appropriate kernel function and its corresponding parameters. Furthermore, in order to improve the classification accuracy and robustness, Luo et al. [27] proposed the soft-margin quadratic surface support vector machine (SQSSVC). After that, Bai et al. [28] proposed the quadratic kernel-free least-squares support vector machine for target diseases’ classification. Following these leading works, some scholars performed further studies, e.g., see [29,30,31,32,33,34] for the classification problem, [35] for the regression problem, and [36] for the cluster problem. The good performance of these methods demonstrates that the quadratic hypersurface is an effective method to flexibly capture the nonlinear structure of data. Thus, it can be seen that it is very interesting to study the kernel-free nonlinear MPM using the above kernel-free technique.

In this paper, for the binary classification problem, a new kernel-free nonlinear method is proposed, which is called the kernel-free quadratic surface minimax probability machine (QSMPM). It was constructed on the basics of the MPM by using the kernel-free techniques of the QSSVM. Specifically, it tries to seek a quadratic hypersurface that separates two classes of samples with maximum probability. However, the optimization problem derived directly was too difficult to solve. Therefore, a nonlinear transformation was introduced to change the quadratic function involved into a linear function. Through such processing, our optimization problem finally became a second-order cone programming problem, which was solved efficiently by an alternate iteration method. It is important to point out that our QSMPM addresses the following key issues. First, our method directly generates a nonlinear (quadratic) hypersurface without the kernel function, so there is no need to select the appropriate kernel. Second, our method does not need to choose any parameters. Third, the quadratic hypersurface obtained by our method has better interpretability than the one by the methods with the kernel function. Fourth, it is rather flexible because the quadratic hypersurface obtained by our method can be any general form of the quadratic hypersurface. In our experiment, the results of five artificial datasets showed that the proposed method can find the general form of the quadratic surface and has also the ability to obtain the linear separating hyperplane. Numerical experiments on 14 benchmark datasets verified that the proposed method was superior to corresponding methods in both accuracy and CPU time. What is more gratifying is that when the number of samples or the dimension is relatively large, our method can obtain good classification performance quickly. In addition, the results of the Friedman test and Nemenyi post-hoc test indicated that our QSMPM was statistically the best one compared to other methods.

The rest of this paper is organized as follows. Section 2 briefly reviews the related works, the QSSVM, and the MPM. Section 3 presents our method QSMPM, gives its algorithm, and analyzes the computational complexity of the QSMPM. In Section 4, we show the interpretability of our method. In Section 5, the results of the numerical experiments on the artificial datasets and benchmark datasets are presented, and a further statistical analysis is performed. Finally, Section 6 gives the conclusion and future work of this paper.

Throughout this paper, we use lower case letters to represent scalars, lower case bold letters to represent vectors, and upper case bold letters to represent matrices.

R

denotes the set of real numbers.

R^{d}

denotes the space of d-dimensional vectors.

R^{d \times d}

denotes the space of

d \times d

matrices.

S^{d}

denotes the set of

d \times d

symmetric matrices.

S_{+}^{d}

denotes the set of

d \times d

symmetric positive semidefinite matrices.

I_{d}

denotes the

d \times d

identity matrix.

{∥x∥}_{2}

denotes the two-norm of the vector

x

.

2. Related Work

In this section, we briefly introduce the QSSVM and the MPM. For a binary classification problem, the training set is given as:

T = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m_{+} + m_{-}}, y_{m_{+} + m_{-}})},

(1)

where

x_{i} \in R^{d}

is the i-th sample and

y_{i} \in {+ 1, - 1}

corresponds to the class label,

i = 1, 2, \dots, m_{+} + m_{-}

. The number of samples in class +1 and class −1 is

m_{+}

and

m_{-}

, respectively. For the training set (1), we want to find a hyperplane or quadratic hypersurface:

g (x) = 0,

(2)

and then use a decision function:

f (x) = sign (g (x))

(3)

to determine whether a new sample

x \in R^{d}

is assigned to class +1 or class −1.

2.1. Quadratic Surface Support Vector Machine

We first shortly outline the quadratic surface support vector machine (QSSVM) [26]. For the given training set (1), the goal of the QSSVM is to seek a quadratic separating hypersurface:

g (x) = \frac{1}{2} x^{T} A x + b^{T} x + c = 0,

(4)

where

A \in S^{d}

,

b \in R^{d}

,

c \in R

, which separates the samples into two classes with the largest margin. In order to obtain the quadratic hypersurface (4), the QSSVM establishes the following optimization problem:

\begin{matrix} min_{A, b, c} \sum_{i = 1}^{m_{+} + m_{-}} {∥ A x_{i} + b ∥}_{2}^{2} \\ s . t . y_{i} (\frac{1}{2} x_{i}^{T} A x_{i} + b^{T} x_{i} + c) \geq 1, i = 1, \dots, m_{+} + m_{-} . \end{matrix}

(5)

The optimization problem (5) is a convex quadratic programming problem.

After obtaining the optimal solution

A_{*}

,

b_{*}

, and

c_{*}

to the optimization problem (5), for a given new sample

x \in R^{d}

, its label is assigned to either class +1 or class −1 by the decision function:

f (x) = sgn (\frac{1}{2} x^{T} A_{*} x + b_{*}^{T} x + c_{*}) .

(6)

To allow some samples in the training set (1) to be misclassified, Luo et al. further proposed the soft-margin quadratic surface support vector machine (SQSSVM); please refer to [27].

2.2. Minimax Probability Machine

Now, we briefly review the minimax probability machine (MPM) [8,9]. Let us leave the training set (1) aside for a moment and suppose that these samples have some distribution. Specifically, assume that the samples in class +1 are drawn from a distribution with the mean vector

μ_{+} \in R^{d}

and the covariance matrix

Σ_{+} \in S_{+}^{d}

, without making other specific distributional assumptions. A similar assumption is also given for the samples in class −1 with the mean vector

μ_{-} \in R^{d}

and the covariance matrix

Σ_{-} \in S_{+}^{d}

. Denote the two distributions as

x_{+} \sim (μ_{+}, Σ_{+})

and

x_{-} \sim (μ_{-}, Σ_{-})

, respectively. Based on the above assumptions, the MPM attempts to obtain a separating hyperplane:

g (x) = w^{T} x - b = 0,

(7)

where

w \in R^{d}

,

b \in R

, which separates the two classes of samples with maximal probability with respect to all distributions having these mean vectors and covariance matrices. This is expressed as:

\begin{matrix} max_{w, b, α} α \\ s . t . inf_{x_{+} \sim (μ_{+}, Σ_{+})} pr {w^{T} x_{+} - b \geq 0} \geq α, \\ inf_{x_{-} \sim (μ_{-}, Σ_{-})} pr {w^{T} x_{-} - b \leq 0} \geq α, \end{matrix}

(8)

where

α \in (0, 1)

represents the lower bound of the accuracy for future data, namely the worst-case accuracy. The infimum “

\inf

” is taken over all distributions having these mean vectors

μ_{\pm} \in R^{d}

and covariance matrices

Σ_{\pm} \in S_{+}^{d}

.

The constraint condition of the above optimization problem (8) is the probabilistic constraint, which is difficult to solve. In order to convert the probabilistic constraints to easy, tractable constraints, the following lemma [9] is given:

Lemma 1

([9]). Let

x

be a d-dimensional random vector with mean vector μ and covariance matrix Σ, where

Σ \in S_{+}^{d}

. Given

w \in R^{d}

,

b \in R

, such that

w^{T} x \leq b

and

α \in (0, 1)

, the condition:

inf_{x \sim (μ, Σ)} pr {w^{T} x - b \leq 0} \geq α

(9)

holds if and only if:

b - w^{T} μ \geq κ (α) \sqrt{w^{T} Σ w},

(10)

where

κ (α) = \sqrt{\frac{α}{1 - α}}

.

Using the above Lemma 1, the optimization problem (8) is equivalent to:

\begin{matrix} max_{w, b, α} α \\ s . t . - b + w^{T} μ_{+} \geq κ (α) \sqrt{w^{T} Σ_{+} w}, \\ b - w^{T} μ_{-} \geq κ (α) \sqrt{w^{T} Σ_{-} w} . \end{matrix}

(11)

Then, through a series of algebraic operations (see Theorem 2 in [9], for the details), the above optimization problem (11) leads to:

\begin{matrix} min_{w} {∥Σ_{+}^{\frac{1}{2}} w∥}_{2} + {∥Σ_{-}^{\frac{1}{2}} w∥}_{2} \\ s . t . w^{T} (μ_{+} - μ_{-}) = 1 . \end{matrix}

(12)

When its optimal solution

w_{*}

is obtained, for the optimization problem (11), the optimal solution with respect to b is given by:

b_{*} = w_{*}^{T} μ_{-} + \frac{{∥Σ_{-}^{\frac{1}{2}} w_{*}∥}_{2}}{{∥Σ_{+}^{\frac{1}{2}} w_{*}∥}_{2} + {∥Σ_{-}^{\frac{1}{2}} w_{*}∥}_{2}},

(13)

or:

b_{*} = w_{*}^{T} μ_{+} - \frac{{∥Σ_{+}^{\frac{1}{2}} w_{*}∥}_{2}}{{∥Σ_{+}^{\frac{1}{2}} w_{*}∥}_{2} + {∥Σ_{-}^{\frac{1}{2}} w_{*}∥}_{2}} .

(14)

Now, let us return to the training set (1). It is easy to see that these required mean vectors

μ_{\pm} \in R^{d}

and covariance matrices

Σ_{\pm} \in S_{+}^{d}

are able to be estimated by the training set (1) as follows:

\begin{matrix} {\hat{μ}}_{\pm} = \frac{1}{m_{\pm}} \sum_{i = 1}^{m_{\pm}} x_{i} \in R^{d}, \\ {\hat{Σ}}_{\pm} = \frac{1}{m_{\pm}} \sum_{i = 1}^{m_{\pm}} (x_{i} - {\hat{μ}}_{\pm}) {(x_{i} - {\hat{μ}}_{\pm})}^{T} \in S_{+}^{d} . \end{matrix}

(15)

Therefore, in practice, these mean vectors

μ_{\pm}

and covariance matrices

Σ_{\pm}

in (12)–(14) should be replaced by

{\hat{μ}}_{\pm}

and

{\hat{Σ}}_{\pm}

, and the optimal solutions of

w

and b thus obtained are denoted as

{\hat{w}}_{*}

and

{\hat{b}}_{*}

. Then, for a given new sample

x \in R^{d}

, its label is assigned to either class +1 or class −1 by the decision function:

f (x) = sgn ({\hat{w}}_{*}^{T} x - {\hat{b}}_{*}) .

(16)

In addition, for nonlinear cases and more details, please refer to [8,9].

3. Kernel-Free Quadratic Surface Minimax Probability Machine

In this section, we first formulate the kernel-free quadratic surface minimax probability machine (QSMPM). Then, its algorithm is given.

3.1. Optimization Problem

For the binary classification problem with the training set (1), we attempt to find a quadratic separating hypersurface:

g (x) = \frac{1}{2} x^{T} A x + b^{T} x - c = 0,

(17)

where

A \in S^{d}, b \in R^{d}, c \in R

, which separates the two classes of the samples. Inspired by the MPM, we construct the following optimization problem:

\begin{matrix} max_{A, b, c, α} α \\ s . t . inf_{x_{+} \sim (μ_{+}, Σ_{+})} pr {\frac{1}{2} x_{+}^{T} A x_{+} + b^{T} x_{+} - c \geq 0} \geq α, \\ inf_{x_{-} \sim (μ_{-}, Σ_{-})} pr {\frac{1}{2} x_{-}^{T} A x_{-} + b^{T} x_{-} - c \leq 0} \geq α, \end{matrix}

(18)

where

α \in (0, 1)

represents the lower bound of the accuracy for future data, namely the worst-case accuracy. The notation

x_{+} \sim (μ_{+}, Σ_{+})

refers to the class distribution that has the prescribed mean vector

μ_{+} \in R^{d}

and covariance matrix

Σ_{+} \in S_{+}^{d}

, but otherwise arbitrary, and likewise for

x_{-}

.

The above optimization problem (18) corresponds to the optimization problem (8), which is used to derive the optimization problem (11). Analogically, the optimization problem (18) should be used to derive the required optimization problem. Unfortunately, it does not have a counterpart when the functions in curly braces in the optimization problem (18) are quadratic because of the lack of the corresponding Lemma 1. In order to overcome this difficulty, we change the quadratic functions as a linear function by introducing a nonlinear transformation from

x = {({[x]}_{1}, {[x]}_{2}, \dots, {[x]}_{d})}^{T} \in R^{d}

to:

z = z (x) = {(\frac{1}{2} {[x]}_{1}^{2}, {[x]}_{1} {[x]}_{2}, \dots, {[x]}_{1} {[x]}_{d}, \frac{1}{2} {[x]}_{2}^{2}, \dots, {[x]}_{2} {[x]}_{d}, \dots, \frac{1}{2} {[x]}_{d}^{2}, {[x]}_{1}, {[x]}_{2}, \dots, {[x]}_{d})}^{T} \in R^{\frac{d^{2} + 3 d}{2}} .

(19)

By representing the upper triangular entries of the symmetric matrix:

A = A^{T} = {(a_{i j})}_{d \times d} \in S^{d}

(20)

as a vector:

a = {(a_{11}, a_{12}, \dots, a_{1 d}, a_{22}, \dots, a_{2 d}, \dots, a_{d d})}^{T} \in R^{\frac{d^{2} + d}{2}},

(21)

and defining:

w = {(a^{T}, b^{T})}^{T} \in R^{\frac{d^{2} + 3 d}{2}},

(22)

the quadratic function (17) of

x

in d-dimensional space yields the linear function of

z

in

\frac{d^{2} + 3 d}{2}

-dimensional space as follows:

g (x) = \frac{1}{2} x^{T} A x + b^{T} x - c = w^{T} z - c .

(23)

Following the transformation (19), the training set (1) in the d-dimensional space correspondingly becomes:

\tilde{T} = {(z_{1}, y_{1}), (z_{2}, y_{2}), \dots, (z_{m_{+} + m_{-}}, y_{m_{+} + m_{-}})},

(24)

where

z_{i} = z (x_{i})

; in other words,

z_{i} \in R^{\frac{d^{2} + 3 d}{2}}

is obtained by replacing

x

in the formula (19) with

x_{i} = {({[x_{i}]}_{1}, {[x_{i}]}_{2}, \dots, {[x_{i}]}_{d})}^{T} \in R^{d}

,

i = 1, 2, \dots, m_{+} + m_{-}

. For the training set (24), it is naturally assumed that the samples of the two classes are sampled from

z_{+} \sim (μ_{z_{+}}, Σ_{z_{+}})

and

z_{-} \sim (μ_{z_{-}}, Σ_{z_{-}})

, respectively, where these mean vectors

μ_{z_{\pm}} \in R^{\frac{d^{2} + 3 d}{2}}

and covariance matrices

Σ_{z_{\pm}} \in S_{+}^{\frac{d^{2} + 3 d}{2}}

can be estimated as:

\begin{matrix} {\hat{μ}}_{z_{\pm}} = \frac{1}{m_{\pm}} \sum_{i = 1}^{m_{\pm}} z_{i} \in R^{\frac{d^{2} + 3 d}{2}}, \\ {\hat{Σ}}_{z_{\pm}} = \frac{1}{m_{\pm}} \sum_{i = 1}^{m_{\pm}} (z_{i} - {\hat{μ}}_{z_{\pm}}) {(z_{i} - {\hat{μ}}_{z_{\pm}})}^{T} \in S_{+}^{\frac{d^{2} + 3 d}{2}} . \end{matrix}

(25)

Based on the transformation (19), the optimization problem (18) is replaced by:

\begin{matrix} max_{w, c, α} α \\ s . t . inf_{z_{+} \sim ({\hat{μ}}_{z_{+}}, {\hat{Σ}}_{z_{+}})} pr {w^{T} z_{+} - c \geq 0} \geq α, \\ inf_{z_{-} \sim ({\hat{μ}}_{z_{-}}, {\hat{Σ}}_{z_{-}})} pr {w^{T} z_{-} - c \leq 0} \geq α . \end{matrix}

(26)

Now, Lemma 1 [9] is applicable to the optimization problem (26). Thus, we have:

\begin{matrix} max_{w, c, α} α \\ s . t . - c + w^{T} {\hat{μ}}_{z_{+}} \geq κ (α) \sqrt{w^{T} {\hat{Σ}}_{z_{+}} w}, \\ c - w^{T} {\hat{μ}}_{z_{-}} \geq κ (α) \sqrt{w^{T} {\hat{Σ}}_{z_{-}} w}, \end{matrix}

(27)

where

κ (α) = \sqrt{\frac{α}{1 - α}}

. Moreover, a series of algebraic operation shows that the above optimization problem (27) is equivalent to the following second-order cone programming problem:

\begin{matrix} min_{w} {∥{\hat{Σ}}_{z_{+}}^{\frac{1}{2}} w∥}_{2} + {∥{\hat{Σ}}_{z_{-}}^{\frac{1}{2}} w∥}_{2} \\ s . t . w^{T} ({\hat{μ}}_{z_{+}} - {\hat{μ}}_{z_{-}}) = 1 . \end{matrix}

(28)

When its optimal solution

w_{*}

is obtained, for the optimization problem (27), the optimal solution with respect to c is given by:

c_{*} = w_{*}^{T} {\hat{μ}}_{z_{-}} + \frac{{∥{\hat{Σ}}_{z_{-}}^{\frac{1}{2}} w_{*}∥}_{2}}{{∥{\hat{Σ}}_{z_{+}}^{\frac{1}{2}} w_{*}∥}_{2} + {∥{\hat{Σ}}_{z_{-}}^{\frac{1}{2}} w_{*}∥}_{2}},

(29)

or:

c_{*} = w_{*}^{T} {\hat{μ}}_{z_{+}} - \frac{{∥{\hat{Σ}}_{z_{+}}^{\frac{1}{2}} w_{*}∥}_{2}}{{∥{\hat{Σ}}_{z_{+}}^{\frac{1}{2}} w_{*}∥}_{2} + {∥{\hat{Σ}}_{z_{-}}^{\frac{1}{2}} w_{*}∥}_{2}} .

(30)

In the next subsection, we show how to solve the optimization problem (28).

3.2. Algorithm

Now, we present the solving process of the optimization problem (28), which is achieved by referring to [9]. By constructing an orthogonal matrix

F \in R^{\frac{d^{2} + 3 d}{2} \times \frac{d^{2} + 3 d - 2}{2}}

whose columns span the subspace of vectors orthogonal to

{\hat{μ}}_{z_{+}} - {\hat{μ}}_{z_{-}} \in R^{\frac{d^{2} + 3 d}{2}}

, the unknown variable

w \in R^{\frac{d^{2} + 3 d}{2}}

is converted into

u \in R^{\frac{d^{2} + 3 d - 2}{2}}

. Specifically, let

w = w_{0} + F u

, where

w_{0} = \frac{({\hat{μ}}_{z_{+}} - {\hat{μ}}_{z_{-}})}{{∥{\hat{μ}}_{z_{+}} - {\hat{μ}}_{z_{-}}∥}_{2}^{2}} \in R^{\frac{d^{2} + 3 d}{2}}

; the optimization problem (28) is transferred to the unconstrained optimization problem:

min_{u} {∥{\hat{Σ}}_{z_{+}}^{\frac{1}{2}} (w_{0} + F u)∥}_{2} + {∥{\hat{Σ}}_{z_{-}}^{\frac{1}{2}} (w_{0} + F u)∥}_{2},

(31)

In order to solve the above optimization problem (31), Lanckriet et al. [9] introduced two extra variables

β

and

η

and considered the following optimization problem:

\begin{matrix} min_{u, β, η} β + \frac{1}{β} {∥{\hat{Σ}}_{z_{+}}^{\frac{1}{2}} (w_{0} + Fu)∥}_{2}^{2} + η + \frac{1}{η} {∥{\hat{Σ}}_{z_{-}}^{\frac{1}{2}} (w_{0} + Fu)∥}_{2}^{2} . \end{matrix}

(32)

This optimization problem (32) is solved by an alternative iteration. The variables are divided into two sets: one is

β

and

η

, and the other is

u

. At the t-th iteration, first by fixing

β

and

η

to take the derivative of the optimization problem (32) with respect to

u

, we have the following updated iteration formula of

u_{t}

:

\begin{matrix} (\frac{1}{β_{t}} P + \frac{1}{η_{t}} Q) u_{t} = - (\frac{1}{β_{t}} p + \frac{1}{η_{t}} q), \end{matrix}

(33)

where

P = F^{T} {\hat{Σ}}_{z_{+}} F \in R^{\frac{d^{2} + 3 d - 2}{2} \times \frac{d^{2} + 3 d - 2}{2}}

,

Q = F^{T} {\hat{Σ}}_{z_{-}} F \in R^{\frac{d^{2} + 3 d - 2}{2} \times \frac{d^{2} + 3 d - 2}{2}}

,

p = F^{T} {\hat{Σ}}_{z_{+}} w_{0} \in R^{\frac{d^{2} + 3 d - 2}{2}}

,

q = F^{T} {\hat{Σ}}_{z_{-}} w_{0} \in R^{\frac{d^{2} + 3 d - 2}{2}}

. To ensure the stability, the regularization term

δ I_{\frac{d^{2} + 3 d - 2}{2}} (δ > 0)

is added. Therefore, the Equation (33) is replaced by:

\begin{matrix} (\frac{1}{β_{t}} P + \frac{1}{η_{t}} Q + δ I) u_{t} = - (\frac{1}{β_{t}} p + \frac{1}{η_{t}} q) . \end{matrix}

(34)

Next, by fixing

u

to take the derivative of the optimization problem (32) with respect to

β

and

η

, respectively, we have the following updated iteration formula of

β_{t}

and

η_{t}

:

\begin{matrix} β_{t} = {∥{\hat{Σ}}_{g_{+}}^{\frac{1}{2}} (w_{0} + F u_{t})∥}_{2}, η_{t} = {∥{\hat{Σ}}_{g_{-}}^{\frac{1}{2}} (w_{0} + F u_{t})∥}_{2} . \end{matrix}

(35)

When the optimal solution

u_{*}

is obtained by the above two updated iteration Formulas (34) and (35), the optimal solution

w_{*}

of the optimization problem (28) is

w_{*} = w_{0} + F u_{*}

. Then, we summarize the process of finding the optimal solution

A_{*}

,

b_{*}

,

c_{*}

of the optimization problem (18) in Algorithm 1.

Algorithm 1: Kernel-free quadratic surface minimax probability machine (QSMPM).

Input: Training set (1),

δ = 1 \times 10^{- 6}

, number of maximum iterations

τ = 100

.

1:: Initialize $β_{1} = 1$ , $η_{1} = 1$ , $t = 1$ .
2:: Obtain $z_{i}$ by (19), $i = 1, 2, \dots, m_{+} + m_{-}$ ;
3:: Calculate ${\hat{μ}}_{z_{\pm}}$ and ${\hat{Σ}}_{z_{\pm}}$ by (25), and calculate $w_{0} = \frac{({\hat{μ}}_{z_{+}} - {\hat{μ}}_{z_{-}})}{{∥{\hat{μ}}_{z_{+}} - {\hat{μ}}_{z_{-}}∥}_{2}^{2}}$ ;
4:: Calculate $P = F^{T} {\hat{Σ}}_{z_{+}} F$ , $Q = F^{T} {\hat{Σ}}_{z_{-}} F$ , $p = F^{T} {\hat{Σ}}_{z_{+}} w_{0}$ , $q = F^{T} {\hat{Σ}}_{z_{-}} w_{0}$ , where $F$ is an orthogonal matrix whose columns span the subspace of vectors orthogonal ${\hat{μ}}_{z_{+}} - {\hat{μ}}_{z_{-}}$ ;
5:: while $t < τ$ do
6:: Given $β_{t}$ and $η_{t}$ , update $u_{t}$ by (34);
7:: Given $u_{t}$ , update $β_{t}$ and $η_{t}$ by (35);
8:: $t \leftarrow t + 1$ .
9:: end
10:: Assign $w_{*} = w_{0} + F u_{t}$ , then obtain $A_{*}$ , $b_{*}$ by (20), (21), and (22); further, obtain $c_{*}$ by (29) or (30).

Output:

A_{*}

,

b_{*}

,

c_{*}

.

After obtaining the optimal solution

A_{*}

,

b_{*}

and

c_{*}

to the optimization problem (18), for a given new sample

x \in R^{d}

, its label is assigned to either class +1 or class −1 by the decision function:

f (x) = sgn (\frac{1}{2} x^{T} A_{*} x + b_{*}^{T} x - c_{*}) .

(36)

It should be pointed out that our QSMPM is kernel-free, which avoids the time-consuming task of selecting the appropriate kernel function and its corresponding parameters. What is more, it does not require any choice of parameter, which makes its use simpler and convenient. Furthermore, from the geometric point of view, the quadratic hypersurface (17) determined by our method is allowed to be any general form of quadratic hypersurface, including hyperplanes, hyperparaboloids, hyperspheres, hyperellipsoids, hyperhyperboloids, and so on, which is shown clearly by five artificial examples in Section 5.

3.3. Computational Complexity

Here, we analyze the computational complexity of our QSMPM. Suppose that the number and the dimension of the samples are N and d, respectively. Before reformulating the QSMPM as an SOCP problem, all d-dimensional samples need to be projected into the

\frac{d^{2} + 3 d}{2}

-dimensional space. Therefore, the total computational complexity of the QSMPM is

O ({(\frac{d^{2} + 3 d}{2})}^{3} + N {(\frac{d^{2} + 3 d}{2})}^{2} + N d^{2})

. In addition, we give the computational complexity of the MPM and the SVM. Their complexity is

O (d^{3} + N d^{2})

[9] and

O (N^{3})

[19], respectively. Then, by referencing the computational complexity of SVM, we obtain that the computational complexity of the QSSVM is

O (N^{3} + N d^{2})

. According to the above analysis, assuming that N is much larger than d, we can see that the computational complexity of the QSMPM is higher than that of the MPM, but lower than that of the SVM and the QSSVM.

4. The Interpretability

In this section, we discuss the interpretability of our method QSMPM. Suppose we have obtained the optimal solution

A_{*}

,

b_{*}

,

c_{*}

to the optimization problem (18), then the quadratic hypersurface (17) has the following component form:

g (x) = \frac{1}{2} x^{T} A_{*} x + b_{*}^{T} x - c_{*} = \frac{1}{2} \sum_{i = 1}^{d} \sum_{j = 1}^{d} a_{i j}^{*} {[x]}_{i} {[x]}_{j} + \sum_{i = 1}^{d} b_{i}^{*} {[x]}_{i} - c_{*} = 0,

(37)

where

{[x]}_{i}

is the i-th component of the vector

x \in R^{d}

,

a_{i j}^{*}

is the i-th row and the j-th column component of the matrix

A_{*} \in S^{d}

, and

b_{i}^{*}

is the i-th component of the vector

b_{*} \in R^{d}

. Each component of

x

produces the contribution of a quadratic polynomial function. Specifically,

b_{i}^{*}

is the linear effect coefficient of the i-th component,

a_{i i}^{*}

(

i = j

) is the quadratic effect coefficient of the i-th component, and

a_{i j}^{*}

(

i \neq j

) is the interaction coefficient between the i-th component and the j-th component. Therefore, for the i-th component of

x

, consider that the larger

| a_{i i}^{*} | + | a_{i j}^{*} | + | b_{i}^{*} |

(

j = 1, 2, \dots, d, j \neq i

), the greater the contribution of the i-th component is. In particular, when

| a_{i i}^{*} | + | a_{i j}^{*} | + | b_{i}^{*} | = 0

(

j = 1, 2, \dots, d, j \neq i

), the i-th component of

x

would not work. Therefore, compared with the methods with the kernel function, the QSMPM has better interpretability.

5. Numerical Experiments

In this section, we provide some numerical experiments to verify the performance of our QSMPM. We compared it with the hard-margin support vector machine (H−SVM), the soft-margin support vector machine (S−SVM), and the MPM with the linear kernel, the quadratic polynomial kernel, and the RBF kernel (H−SVM−L, H−SVM−P, H−SVM−R, S−SVM−L, S−SVM−P, S−SVM−R, MPM−L, MPM−P, and MPM−R, respectively). In addition, we also compared it with the QSSVM and the SQSSVM. In all numerical experiments, the penalty parameter C in the S-SVM and the kernel parameter

σ

of the RBF kernel were selected from

{2^{- 7}, 2^{- 6}, \dots, 2^{7}}

by the 10-fold cross-validation method. All numerical experiments were conducted using MATLAB R2016 (b) on a computer equipped with a 2.50 GHz (I5-4210U) CPU, and 4G available memory.

5.1. Artificial Datasets

To show the geometric interpretation of the proposed method QSMPM and to compare it with the original methods, the MPM−L, the MPM−P, and the MPM−R, we performed the following numerical experiments on 4 artificial examples. These 5 artificial examples were all generated in the 2-dimensional space, and each generated 300 samples

{x_{i} = {({[x_{i}]}_{1}, {[x_{i}]}_{2})}^{T}}_{i = 1}^{300}

, of which the first 150 were samples in class +1 and the last 150 samples in class −1. Here, we first illustrate the symbols in all figures. The red “+” represents the samples in class +1, and the blue “o” represents the samples in class −1. The results in the upper right express the accuracy of each method on the artificial example. The curve in bold black represents the hyperplane or quadratic hypersurface. Now, let us introduce the numerical experiments on each artificial example in turn.

Example 1.

\begin{matrix} {[x_{i}]}_{2} = - \frac{1}{2} {[x_{i}]}_{1} + 2 + ξ_{i}, i = 1, \dots, 150, \\ {[x_{i}]}_{2} = - \frac{1}{2} {[x_{i}]}_{1} - 3 + ξ_{i}, i = 151, \dots, 300, \end{matrix}

where

{[x_{i}]}_{1} \sim U [- 3, 4]

,

ξ_{i} \sim N (0, 1)

.

Figure 1 illustrates the classification results of the MPM−L, the MPM−P, the MPM−R, and the QSMPM on Example 1, respectively. We can see from Figure 1 that our QSMPM can obtain classification results as good as the other three methods. In addition, the quadratic hypersurface found by our QSMPM is a straight line, that is a linear separating hyperplane.

Example 2.

\begin{matrix} {[x_{i}]}_{2} = \frac{1}{2} {[x_{i}]}_{1}^{2} + 2 + ξ_{i}, i = 1, \dots, 150, \\ {[x_{i}]}_{2} = \frac{1}{2} {[x_{i}]}_{1}^{2} - 3 + ξ_{i}, i = 151, \dots, 300, \end{matrix}

where

{[x_{i}]}_{1} \sim U [- 3, 4]

,

ξ_{i} \sim N (0, 1)

.

Figure 2 presents the classification results on Example 2. We can observe in Figure 2 that the classification result of our QSMPM is superior to the MPM−L and similar to the MPM−P and the MPM−R. Moreover, the quadratic hypersurface obtained by our QSMPM is a parabola.

Example 3.

\begin{matrix} {[x_{i}]}_{1} = r_{i} cos (t_{i}), {[x_{i}]}_{2} = r_{i} sin (t_{i}), i = 1, \dots, 150, \\ {[x_{i}]}_{1} = 1.3 + r_{i} cos (t_{i}), {[x_{i}]}_{2} = 1.3 + r_{i} sin (t_{i}), i = 151, \dots, 300, \end{matrix}

where

r_{i} \sim U [0, 1]

,

t_{i} \sim U [0, 1]

.

Figure 3 reports the classification results of the MPM−L, the MPM−P, the MPM−R, and the QSMPM on Example 3, respectively. Obviously, in Figure 3, we can see that the classification result of our QSMPM is superior to the MPM−L and is the same as the MPM−P and the MPM−R. Furthermore, the quadratic hypersurface of our QSMPM is a circle.

Example 4.

\begin{matrix} {[x_{i}]}_{1} = 2 + a_{i} cos (t_{i}), {[x_{i}]}_{2} = b_{i} sin (t_{i}), i = 1, \dots, 150, \\ {[x_{i}]}_{1} = 3.5 + a_{i} cos (t_{i}), {[x_{i}]}_{2} = 1.5 + b_{i} cos (t_{i}), i = 151, \dots, 300, \end{matrix}

where

a_{i} \sim U [0, 1]

,

b_{i} \sim U [0, 1]

,

t_{i} \sim U [0, 1]

.

Figure 4 shows the classification results on Example 4. We can observe in Figure 4 that the QSMPM can obtain the same classification performance as the MPM−P and the MPM−R and is better than the MPM−L. Our QSMPM can find an ellipse.

Example 5.

\begin{matrix} {[x_{i}]}_{2}^{2} = {[x_{i}]}_{1}^{2} - 1 + ξ_{i}, i = 1, \dots, 150, \\ {[x_{i}]}_{1} \sim U [- 0.6, 0.6], {[x_{i}]}_{2} \sim U [- 6, 6], i = 151, \dots, 300, \end{matrix}

where

{[x_{i}]}_{1} \sim U [- 4, - 1]

and

U [1, 4]

,

ξ_{i} \sim N (0, 1)

,

i = 1, \dots, 150

.

Figure 5 reports the classification results of the MPM−L, the MPM−P, the MPM−R, and the QSMPM on Example 5, respectively. We can observe in Figure 5 that the classification performance of QSMPM is better than the MPM−L and is similar to the MPM−P and the MPM−R. In addition, our QSMPM finds a hyperbola.

In summary, from Figure 1 to Figure 5, we can see that our QSMPM can find any general form of the quadratic hypersurface, such as the line, parabola, circle, ellipse, and hyperbola found in sequence in the above numerical experiments. Moreover, our method can achieve as good classification performance as the MPM−P and the MPM−R. In addition, it can be seen from Figure 1d that our method can obtain the linear separating hyperplane.

5.2. Benchmark Datasets

To verify the classification performance and computational efficiency of our QSMPM, we performed the following numerical experiments on 14 benchmark datasets. Table 1 summarizes the basic information of the 14 benchmark datasets in the UCI Machine Learning Repository.

We divided the datasets in Table 1 into two groups for the numerical experiment. The first group was the first seven datasets, and the second group was the last seven datasets. All numerical experimental results were obtained through 10-times 10-fold cross-validation, as well as the numerical experimental results including the mean and standard deviation of accuracy and the CPU time of one experiment. The best results are highlighted in boldface. First of all, Table 2 shows the classification results on the first group.

It can be seen from Table 2 that compared with the other methods, our QSMPM obtained better accuracy on the first group of benchmark datasets, among which the accuracy was the best on four benchmark datasets. More specifically, except for Haberman and Bupa, the accuracy of our method was the best compared to the QSSVM and the SQSSVM. The accuracy of our QSMPM was the best compared to the three original kernel versions of the MPM except for Bupa. Furthermore, the accuracy of our method was the best compared to the H−SVM and the S−SVM with three kernel function except for Heart and Haberman. In addition, we can observe that QSMPM had a short CPU time.

Then, the classification results on the second group are reported in Table 3. The symbol “−” indicates that the corresponding method cannot obtain the classification results, because it cannot choose the optimal parameter in a limited amount of time or because the dimension and the number of the dataset are relatively large, resulting in insufficient memory.

From Table 3, we can see that our QSMPM had good classification results on the second group of benchmark datasets. Especially on QSAR and Turkiye, the H−SVM−R, the three kernel versions of the S−SVM, the QSSVM, and the SQSSVM could not obtain the corresponding classification results, but our QSMPM could obtain good classification performance. Here, we mention the reason for this situation. According to the computational complexity of each method, we know that when the sample dimension and the number of samples are relatively large, the SVM and the QSSVM need a larger memory space. In addition, our QSMPM had the fastest running time except the MPM−L, and it ran quite fast when the number of samples or the dimension was large.

5.3. Statistical Analysis

To further compare the performance of the above 12 methods, the Friedman test and the post-hoc test were performed. The ranks of the 12 methods on all benchmark datasets is shown in Table 4.

First, the Friedman test was used to compare the average ranks of different methods. The null hypothesis states that all methods have the same performance, that is their average ranks are the same. Based on the average ranks displayed in Table 4, we can calculate the Friedman statistic

τ_{F}

by the following formula:

\begin{matrix} τ_{χ^{2}} = \frac{12 N}{k (k + 1)} (\sum_{i = 1}^{k} r_{i} - \frac{k {(k + 1)}^{2}}{4}), \\ τ_{F} = \frac{(N - 1) τ_{χ^{2}}}{N (k - 1) - τ_{χ^{2}}}, \end{matrix}

(38)

where N and k are the number of datasets and methods, respectively.

r_{i}

is the average rank of the i-th method. According to the formula (38),

τ_{F} = 4.1825

. For

α = 0.05

, we can obtain

F_{α} = 1.8526

. Since

τ_{F} > F_{α}

, we rejected the null hypothesis. Then, we proceeded with a post-hoc test (the Nemenyi test) to find out which methods significantly differed. To be more specific, the performance of two methods was considered to be significantly different if the difference of their average ranks was larger than the critical difference (CD). The CD can be calculated by:

\begin{matrix} C D = q_{α} \sqrt{\frac{k (k + 1)}{6 N}} . \end{matrix}

(39)

For

α

= 0.05, we know

q_{α} = 3.2680

. Thus, we obtained CD = 4.4535 by the formula (39).

Figure 6 visually displays the results of the Friedman test and Nemenyi post-hoc test, where the average ranks of each method are marked along an axis. The axis is turned so that the lowest (best) ranks are to the right. Groups of methods that are not significantly different are linked by a red line. In Figure 6, we can see that our QSMPM was the best one statistically among the compared methods. Furthermore, there was no significant difference in performance between the QSMPM and the MPM−R.

6. Conclusions

For the binary classification problem, a new classifier, called the kernel-free quadratic surface minimax probability machine (QSMPM), was proposed by using the kernel-free techniques of the QSSVM and the classification idea of the MPM. Specifically, our goal was to find a quadratic hypersurface that separates two classes of samples with maximum probability. However, the optimization problem derived directly was too difficult to solve. Therefore, a nonlinear transformation was introduced to change the quadratic function involved into a linear function. Through such processing, our optimization problem finally became a second-order cone programming problem, which was solved efficiently by an alternate iteration method. Here, we clarify the main contributions of this paper. Unlike the methods realizing nonlinear separation, our method was kernel-free and had better interpretability. Then, our method was easy to use because it did not have any parameters. Furthermore, numerical experiments on five artificial datasets showed that the quadratic hypersurfaces found by our method were rather general, including that it could obtain the linear separating hyperplane. In addition, numerical experiments on benchmark datasets confirmed that the proposed method was superior to some relevant methods in both accuracy and computational time. Especially when the number of samples or dimension was relatively large, our method could also quickly obtain good classification performance. Finally, the results of the statistical analysis showed that our QSMPM was statistically the best one compared with the corresponding methods. Our QSMPM focuses on the standard binary classification problem, which we will extend to the multiclassification problem.

In our future work, there will be some issues to be address to extend our QSMPM. For example, we need to investigate further how to add appropriate regularization terms to our method. Meanwhile, we need to consider that the worst-case accuracies for two classes are not the same, and that will be interesting. Furthermore, we will pay attention to how the QSMPM achieves the dual purpose of feature selection and classification simultaneously. In addition, we can apply our method to practical problems in many fields in the future, especially image recognition in the medical field.

Author Contributions

Conceptualization, X.Y.; Data curation, Y.W.; Formal analysis, X.Y.; Methodology, Y.W. and Z.Y.; Software, Y.W.; Supervision, Z.Y.; Writing—original draft, Y.W.; Writing—review and editing, Z.Y. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Xinjiang Provincial Natural Science Foundation of China (No. 2020D01C028) and the National Natural Science Foundation of China (No. 12061071).

Data Availability Statement

All of the benchmark datasets used in our numerical experiments are from the UCI Machine Learning Repository, which are available at http://archive.ics.uci.edu/ml/ (accessed on 10 June 2021).

Conflicts of Interest

The authors declare that they have no conflict of interest regarding this work.

References

Langley, P.; Simon, H.A. Applications of machine learning and rule induction. Commun. ACM 1995, 38, 54–64. [Google Scholar] [CrossRef]
Vapnik, V.; Sterin, A. On structural risk minimization or overall risk in a problem of pattern recognition. Autom. Remote Control 1977, 10, 1495–1503. [Google Scholar]
Forman, G. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 2003, 3, 1289–1305. [Google Scholar]
Ying, S.H.; Qiao, H. Lie group method: A new approach to image matching with arbitrary orientations. Int. J. Imaging Syst. Technol. 2010, 20, 245–252. [Google Scholar] [CrossRef]
Cao, L.J.; Tay, F. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans. Neural. Netw. 2003, 14, 1506–1518. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Srinivasu, P.N.; Sivasai, J.G.; Ijaz, M.F.; Bhoi, A.K.; Kim, W.; Kang, J.J. Classification of skin disease using deep learning neural networks with MobileNet V2 and LSTM. Sensors 2021, 21, 2852. [Google Scholar] [CrossRef] [PubMed]
Panigrahi, R.; Borah, S.; Bhoi, A.K.; Ijaz, M.F.; Pramanik, M.; Jhaveri, R.H.; Chowdhary, C.L. Performance assessment of supervised classifiers for designing intrusion detection systems: A comprehensive review and recommendations for future research. Mathematics 2021, 9, 690. [Google Scholar] [CrossRef]
Lanckriet, G.R.G.; Ghaoui, L.E.; Bhattacharyya, C.; Jordan, M.I. Minimax probability machine. Adv. Neural Inf. Process. Syst. 2001, 37, 192–200. [Google Scholar]
Lanckriet, G.R.G.; Ghaoui, L.E.; Bhattacharyya, C.; Jordan, M.I. A robust minimax approach to classification. J. Mach. Learn. Res. 2003, 3, 555–582. [Google Scholar]
Johnny, K.C.; Zhong, Y.; Yang, S. A comparative study of minimax probability machine-based approaches for face recognition. Pattern Recognit. Lett. 2007, 28, 1995–2002. [Google Scholar]
Deng, Z.; Wang, S.; Chung, F.L. A minimax probabilistic approach to feature transformation for multi-class data. Appl. Soft Comput. 2013, 13, 116–127. [Google Scholar] [CrossRef]
Jiang, B.; Guo, Z.; Zhu, Q.; Huang, G. Dynamic minimax probability machine based approach for fault diagnosis using pairwise discriminate analysis. IEEE Trans. Control Syst. Technol. 2017, 27, 806–813. [Google Scholar] [CrossRef]
Yang, L.; Gao, Y.; Sun, Q. A new minimax probabilistic approach and its application in recognition the purity of hybrid seeds. Comput. Model. Eng. Sci. 2015, 104, 493–506. [Google Scholar]
Kwok, J.T.; Tsang, W.H.; Zurada, J.M. A Class of Single-Class Minimax Probability Machines for Novelty Detection. IEEE. Trans. Neural Netw. 2007, 18, 778–785. [Google Scholar] [CrossRef]
Strohmann, T.; Grudic, G.Z. A formulation for minimax probability machine regression. Adv. Neural Inf. Process. Syst. 2003, 76, 9–776. [Google Scholar]
Huang, k.; Yang, H.; King, I.; Lyu, M.R.; Chan, L. The minimum error minimax probability machine. J. Mach. Learn. Res. 2004, 5, 1253–1286. [Google Scholar]
Gu, B.; Sun, X.; Sheng, V.S. Structural minimax probability machine. IEEE Trans. Neural Netw. Learn. Syst. 2017, 1, 1646–1656. [Google Scholar] [CrossRef] [PubMed]
Maldonado, S.; Carrasco, M.; Lopez, J. Regularized minimax probability machine. Knowl. Based Syst. 2019, 177, 127–135. [Google Scholar] [CrossRef]
Yang, L.M.; Wen, Y.K.; Zhag, M.; Wang, X. Twin minimax probability machine for pattern classification. Neural Netw. 2020, 131, 201–214. [Google Scholar] [CrossRef]
Cousins, S.; Shawe-Taylor, J. High-probability minimax probability machines. Mach. Learn. 2017, 106, 863–886. [Google Scholar] [CrossRef] [Green Version]
Yoshiyama, K.; Sakurai, A. Laplacian minimax probability machine. Pattern Recognit. Lett. 2014, 37, 192–200. [Google Scholar] [CrossRef]
He, K.X.; Zhong, M.Y.; Du, W.L. Weighted incremental minimax probability machine-based method for quality prediction in gasoline blending process. Chemometr. Intell. Lab. Syst. 2019, 196, 103909. [Google Scholar] [CrossRef]
Ma, J.; Yang, L.M.; Wen, Y.K.; Sun, Q. Twin minimax probability extreme learning machine for pattern recognition. Knowl. Based Syst. 2020, 187, 104806. [Google Scholar] [CrossRef]
Ma, J.; Shen, J.M. A novel twin minimax probability machine for classification and regression. Knowl. Based Syst. 2020, 196, 105703. [Google Scholar] [CrossRef]
Deng, Z.H.; Chen, J.Y.; Zhang, T.; Cao, L.B.; Wang, S.T. Generalized hidden-mapping minimax probability machine for the training and reliability learning of several classical intelligent models. Inf. Sci. 2018, 436–437, 302–319. [Google Scholar] [CrossRef]
Dagher, I. Quadratic kernel-free non-linear support vector machine. J. Glob. Optim. 2008, 41, 15–30. [Google Scholar] [CrossRef]
Luo, J.; Fang, S.C.; Deng, Z.B.; Guo, X.L. Soft quadratic surface support vector machine for binary classification. Asia Pac. J. Oper. Res. 2016, 33, 1650046. [Google Scholar] [CrossRef]
Bai, Y.Q.; Han, X.; Chen, T.; Yu, H. Quadratic kernel-free least squares support vector machine for target diseases classification. J. Comb. Optim. 2015, 30, 850–870. [Google Scholar] [CrossRef]
Gao, Q.Q.; Bai, Y.R.; Zhan, Y.R. Quadratic kernel-free least square twin support vector machine for binary classification problems. J. Oper. Res. Soc. China. 2019, 7, 539–559. [Google Scholar] [CrossRef]
Mousavi, A.; Gao, Z.M.; Han, L.S.; Lim, A. Quadratic surface support vector machine with L1 norm regularization. J. Ind. Manag. Optim. 2019. [Google Scholar] [CrossRef]
Gao, Z.M.; Fang, S.C.; Luo, J.; Medhin, N. A kernel-free double well potential support vector machine with applications. Eur. J. Oper. Re. 2020, 290, 248–262. [Google Scholar] [CrossRef]
Luo, J.; Fang, S.C.; Bai, Y.Q.; Deng, Z.B. Fuzzy quadratic surface support vector machine based on fisher discriminant analysis. J. Ind. Mangn. Optim. 2015, 12, 357–373. [Google Scholar] [CrossRef]
Yan, X.; Bai, Y.Q.; Fang, S.C.; Luo, J. A kernel-free quadratic surface support vector machine for semi-supervised learning. J. Oper. Res. Soc. 2016, 67, 1001–1011. [Google Scholar] [CrossRef]
Tian, Y.; Yong, Z.Y.; Luo, J. A new approach for reject inference in credit scoring using kernel-free fuzzy quadratic surface support vector machines. Appl. Soft Comput. 2018, 73, 96–105. [Google Scholar] [CrossRef]
Zhai, Q.R.; Tian, Y.; Zhou, J.Y. Linear twin quadratic surface support vector regression. Math. Probl. Eng. 2020, 2020, 1–18. [Google Scholar] [CrossRef]
Luo, J.; Tian, Y.; Yan, X. Clustering via fuzzy one-class quadratic surface support vector machine. Soft Comput. 2017, 21, 5859–5865. [Google Scholar] [CrossRef]

Figure 1. Classification results on Example 1.

Figure 2. Classification results on Example 2.

Figure 3. Classification results on Example 3.

Figure 4. Classification results on Example 4.

Figure 5. Classification results on Example 5.

Figure 6. Results of the Friedman test and the Nemenyi post-hoc test.

Table 1. Basic information for the 14 benchmark datasets.

Datasets	Samples	Attributes	Class
Iirs	150	4	3
Hepatitis	155	19	2
Wine	178	13	3
Heart	270	13	2
Heart-c	303	14	2
Haberman	306	3	2
Bupa	345	6	2
Pima	768	8	2
QSAR	1055	41	2
Winequality-red	1599	11	6
Wireless	2000	7	4
Image	2310	19	7
Abalone	2649	8	2
Turkiye	5820	32	13

Table 2. Classification results on the first group.

Methods	Iris	Hepatitis	Wine	Heart	Heart-c	Haberman	Bupa
H−SVM−L	0.6413 ± 0.0245	0.5835 ± 0.0185	0.9573 ± 0.0054	0.6174 ± 0.0287	1.0000 ± 0.0000	0.5189 ± 0.0024	0.5573 ± 0.0021
	(4.1293)	(2.1965)	(1.8439)	(7.70490)	(3.9203)	(15.1649)	(11.2976)
H−SVM−P	0.9447 ± 0.0055	0.5150 ± 0.0399	0.7571 ± 0.0365	0.6985 ± 0.0126	0.8678 ± 0.0386	0.7351 ± 0.0005	0.5798 ± 0.0003
	(1.4087)	(1.5304)	(2.6005)	(3.8797)	(4.5568)	(10.0121)	(13.3459)
H−SVM−R	0.9467 ± 0.0000	0.5858 ± 0.0280	0.9341 ± 0.0116	0.7144 ± 0.0183	1.0000 ± 0.0000	0.7362 ± 0.0022	0.5930 ± 0.0192
	(0.8250)	(0.8734)	(1.0813)	(2.8344)	(2.4438)	(4.0609)	(3.7438)
S−SVM−L	0.8693 ± 0.0155	0.6037 ± 0.0206	0.9582 ± 0.0147	0.8259 ± 0.0129	0.9888 ± 0.0156	0.7241 ± 0.0073	0.6812 ± 0.0096
	(1.4594)	(1.3416)	(2.6656)	(9.2391)	(8.2672)	(7.8422)	(1 0.7031)
S−SVM−P	0.9653 ± 0.0053	0.6037 ± 0.0234	0.7274 ± 0.0235	0.8296 ± 0.0097	0.8605 ± 0.0087	0.7228 ± 0.0103	0.6788 ± 0.0095
	(1.4149)	(1.4703)	(1.9407)	(9.0609)	(11.3328)	(11.1953)	(12.3328)
S−SVM−R	0.9547 ± 0.0061	0.5588 ± 0.0183	0.8921 ± 0.0131	0.7989 ± 0.0058	0.7642 ± 0.0101	0.7261 ± 0.0080	0.6826 ± 0.0088
	(0.9406)	(0.8656)	(1.1641)	(2.1344)	(2.6000)	(2.7719)	(3.2813)
MPM−L	0.8280 ± 0.0082	0.6010 ± 0.0114	0.9731 ± 0.0052	0.8133 ± 0.0080	1.0000 ± 0.0000	0.7177 ± 0.0093	0.6220 ± 0.0079
	(0.3073)	(0.0244)	(1.7831)	(3.1887)	(1.7379)	(3.4944)	(0.1201)
MPM−P	0.9747 ± 0.0028	0.5954 ± 0.0302	0.9759 ± 0.0046	0.8026 ± 0.0133	0.9681 ± 0.0066	0.7159 ± 0.0075	0.6891 ± 0.0114
	(1.9079)	(1.7051)	(2.9640)	(5.6504)	(6.6425)	(5.9514)	(5.5131)
MPM−R	0.9620 ± 0.0063	0.5382 ± 0.0412	0.7860 ± 0.0121	0.6578 ± 0.0105	0.6981 ± 0.0101	0.6879 ± 0.0114	0.7212 ± 0.0088
	(4.5256)	(9.0683)	(4.3758)	(10.5238)	(5.6452)	(11.9107)	(8.5723)
QSSVM	0.9533 ± 0.0070	0.5626 ± 0.0342	0.9608 ± 0.0052	0.6989 ± 0.0182	0.9980 ± 0.0023	0.7416 ± 0.0073	0.7203 ± 0.0047
	(1.2371)	(3.7612)	(2.1109)	(3.5241)	(4.2947)	(5.6426)	(8.0918)
SQSSVM	0.9527 ± 0.0049	0.5650 ± 0.0117	0.9622 ± 0.0071	0.7970 ± 0.0098	0.9954 ± 0.0024	0.7296 ± 0.0050	0.7220 ± 0.0074
	(0.9266)	(4.6098)	(2.6832)	(5.2822)	(7.2993)	(5.0482)	(7.0653)
QSMPM	0.9767 ± 0.0035	0.6069 ± 0.0313	0.9759 ± 0.0053	0.8293 ± 0.0114	1.0000 ± 0.0000	0.7205 ± 0.0069	0.7164 ± 0.0093
	(0.3089)	(2.0608)	(1.3884)	(0.8580)	(2.5179)	(0.2902)	(0.2870)

Table 3. Classification results on the second group.

Methods	Pima	QSAR	Winequality-Red	Wireless	Image	Abalone	Turkiye
H−SVM−L	0.6799 ± 0.0035	0.3675 ± 0.0014	0.6156 ± 0.0014	0.7280 ± 0.0094	0.6470 ± 0.0011	0.7357 ± 0.0040	0.5051 ± 0.0005
	(128.9255)	(223.9172)	(813.7250)	(3956.2000)	(947.0836)	(2825.5000)	(3013.1000)
H−SVM−P	0.5710 ± 0.0155	0.8133 ± 0.0081	0.4653 ± 0.0000	0.8463 ± 0.0234	0.6979 ± 0.0020	0.4934 ± 0.0000	0.5033 ± 0.0038
	(75.1160)	(451.2469)	(475.2321)	(4449.2000)	(996.2086)	(1429.9000)	(23897.0000)
H−SVM−R	0.6919 ± 0.0117	0.8171 ± 0.0069	0.7628 ± 0.0058	0.9877 ± 0.0010	0.9698 ± 0.0019	0.8067 ± 0.0062	−
	(14.9234)	(110.5313)	(3707.5000)	(404.7438)	(567.4688)	(693.4125)	−
S−SVM−L	0.7669 ± 0.0076	0.8340 ± 0.0152	0.7291 ± 0.0023	0.9139 ± 0.0044)	0.7531 ± 0.0304	0.8108 ± 0.0009	−
	(82.0609)	(211.0188)	(640.9000)	(1793.9000)	(2595.7000)	(1189.7000)	−
S−SVM−P	0.7585 ± 0.0066	0.8677 ± 0.0058	0.7492 ± 0.0021	0.9799 ± 0.0017	0.9602 ± 0.0015	0.8298 ± 0.0018	−
	(27.8571)	(259.1922)	(1123.8000)	(3120.7000)	(4820.8000)	(3120.7000)	−
S−SVM−R	0.6941 ± 0.0078	0.8503 ± 0.0031	0.7352 ± 0.0024	0.9853 ± 0.0054	0.9715 ± 0.0020	0.8242 ± 0.0008	−
	(16.5313)	(103.9438)	(294.0688)	(717.2984)	(506.2984)	(984.2250)	−
MPM−L	0.7405 ± 0.0048	0.8296 ± 0.0039	0.7409 ± 0.0025	0.9108 ± 0.0008	0.8555 ± 0.0012	0.8144 ± 0.0009	0.5779 ± 0.0026
	(0.1014)	(0.8377)	(0.2683)	(0.1560)	(0.2969)	(0.1266)	(0.3167)
MPM−P	0.7442 ± 0.0035	0.8225 ± 0.0061	0.7326 ± 0.0020	0.9361 ± 0.0013	0.8499 ± 0.0011	0.8109 ± 0.0009	0.5780 ± 0.0014
	(32.9178)	(68.1350)	(122.7891)	(187.2266)	(257.7266)	(413.9861)	(2669.2000)
MPM−R	0.7356 ± 0.0054	0.8373 ± 0.0037	0.7234 ± 0.0020	0.9850 ± 0.0006	0.9413 ± 0.0018	0.8264 ± 0.0014	0.5689 ± 0.0016
	(37.2874)	(83.8022)	(139.5734)	(248.8594)	(402.6281)	(423.6125)	(2386.7000)
QSSVM	0.7663 ± 0.0049	−	0.4722 ± 0.0032	0.6315 ± 0.0078	0.5714 ± 0.0000	0.5125 ± 0.0012	−
	(70.7730)	−	(27.7094)	(14.7250)	(59.1266)	(29.5344)	−
SQSSVM	0.7589 ± 0.0036	−	0.7452 ± 0.0036	0.9782 ± 0.0012	0.5714 ± 0.0000	0.8280 ± 0.0015	−
	(43.0610)	−	(36.8234)	(45.3036)	(63.7109)	(50.1641)	−
QSMPM	0.7530 ± 0.0049	0.8482 ± 0.0047	0.7470 ± 0.0027	0.9427 ± 0.0012	0.8731 ± 0.0013	0.8299 ± 0.0011	0.5852 ± 0.0018
	(0.5585)	(16.6328)	(1.3525)	(1.0608)	(3.6953)	(1.1984)	(14.6360)

Table 4. Ranks of accuracy.

Datasets	H−SVM−L	H−SVM−P	H−SVM−R	S−SVM−L	S−SVM−P	S−SVM−R	MPM−L	MPM−P	MPM−R	QSSVM	SQSSVM	QSMPM
Iirs	12	9	8	10	3	5	11	2	4	6	7	1
Hepatitis	7	12	6	2.5	2.5	10	4	5	11	9	8	1
Wine	7	11	8	6	12	9	3	1.5	10	5	4	1.5
Heart	12	10	8	3	1	6	4	5	11	9	7	2
Heart-c	2.5	9	2.5	7	10	11	2.5	8	12	5	6	2.5
Haberman	12	3	2	6	7	5	9	10	11	1	4	8
Bupa Liver	12	11	10	7	8	6	9	5	2	3	1	4
Pima	11	12	10	1	4	9	7	6	8	2	3	5
QSAR	10	9	8	5	1	2	6	7	4	11.5	11.5	3
Winequality-red	10	12	1	8	2	6	5	7	9	11	4	3
Wireless	11	10	1	8	4	2	9	7	3	12	5	6
Image Segmentation	10	9	2	8	3	1	6	7	4	11.5	11.5	5
Abalone	10	12	9	8	2	5	6	7	4	11	3	1
Turkiye	5	6	9.5	9.5	9.5	9.5	3	2	4	9.5	9.5	1
Average ranks	9.3929	9.6429	6.0714	6.3571	4.9286	6.1786	6.0357	5.6786	6.9286	7.6071	6.0357	3.1429

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Yang, Z.; Yang, X. Kernel-Free Quadratic Surface Minimax Probability Machine for a Binary Classification Problem. Symmetry 2021, 13, 1378. https://doi.org/10.3390/sym13081378

AMA Style

Wang Y, Yang Z, Yang X. Kernel-Free Quadratic Surface Minimax Probability Machine for a Binary Classification Problem. Symmetry. 2021; 13(8):1378. https://doi.org/10.3390/sym13081378

Chicago/Turabian Style

Wang, Yulan, Zhixia Yang, and Xiaomei Yang. 2021. "Kernel-Free Quadratic Surface Minimax Probability Machine for a Binary Classification Problem" Symmetry 13, no. 8: 1378. https://doi.org/10.3390/sym13081378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Kernel-Free Quadratic Surface Minimax Probability Machine for a Binary Classification Problem

Abstract

1. Introduction

2. Related Work

2.1. Quadratic Surface Support Vector Machine

2.2. Minimax Probability Machine

3. Kernel-Free Quadratic Surface Minimax Probability Machine

3.1. Optimization Problem

3.2. Algorithm

3.3. Computational Complexity

4. The Interpretability

5. Numerical Experiments

5.1. Artificial Datasets

5.2. Benchmark Datasets

5.3. Statistical Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI