1. Introduction
Support vector machines (SVMs) [
1] have been a very powerful machine learning algorithm developed for classification problems, which works by recognizing patterns via kernel tricks [
2]. Because of its high performance and great generalization ability compared with other classification methods, the SVM method is widely used in bioinformatics, text and image recognition, and finances, to name a few. Basically, the method finds a linear boundary (hyperplane) that represents the largest margin between two classes (labels) in the input space [
3,
4,
5,
6]. It can be applied to not only linear separation but also nonlinear separation using kernel functions. Its nonlinear separation can be achieved via kernel functions, which map the input space to a high-dimensional space, called feature space where optimal separating hyperplane is determined in the feature space. In addition, the hyperplane in the feature space, which achieves a better separation of training data, is translated to a nonlinear boundary in the original space [
7,
8]. The kernel trick is used to associate the kernel function with the mapping function, bringing forth a nonlinear separation in the input space.
Due to the growing speed of data acquisition on various domains and the continual popularity of SVMs, large-scale SVM problems frequently arise: human detection using histogram of oriented gradients by SVMs, large-scale image classification by SVMs, disease classification using mass spectrum by SVMs, and so forth. Even though SVMs show superior classification performance, their computing time and storage requirements increase dramatically with the number of instances, which is a major obstacle [
9,
10]. As the goal of SVMs is to find the optimal separating hyperplane that maximizes the margin between two classes, they should solve a quadratic programming problem. In practice, the time complexity in the training phase of the SVM method is at least
, where
is the number of data samples, depending on the kernel function [
11]. Indeed, several approaches have been applied to improve the training speed of SVMs. Sequential minimal optimization (SMO) [
12], SVM-light [
13], simple support vector machine (SSVM) [
14] and library of support vector machine (LibSVM) [
15] are among others. Basically, they break the problem into a series of small problems that can be easily solved, reducing the required memory size.
Additionally, data reduction or selection methods have been introduced for large-scale SVM problems. Reduced support vector machines (RSVMs) are a random sampling method that, being quite simple, uses a small portion of the large dataset [
16]. However, it needs to be applied several times and unimportant observations are equally sampled. The method presented by Collobert et al. efficiently parallelizes sub-problems, fitting to very large-size SVM problems [
17]. It used cascades of SVMs in which data are split into subsets to be optimized separately with multiple SVMs instead of analyzing the whole dataset. A method based on the selection of candidate vectors (CVS) was presented using relative pair-wise Euclidean distances in the input space to find the candidate vectors in advance [
18]. Because the only selected samples are used in the training phase, it shows fast training speed. However, its classification performance is relatively worse than that of the conventional SVM, and the need for selecting good candidate vectors arise.
Besides, for large-scale SVM problems, a joint approach that combines SVM with other machine learning methods has emerged. Many evolutionary algorithms have been proposed to select training data for SVMs [
19,
20,
21,
22]. Although they have shown promising results, these methods need to be executed multiple times to decide proper parameters and training data, which is computationally expensive. Decision tree methods also have been commonly proposed to reduce training data because the training time is proportional to
where
represents discrete input variables [
23] so is faster than traditional SVMs. The decision tree method recursively decomposes the input data set into binary subsets through independent variables when the splitting condition is met. In supervised learning, decision trees, bringing forth random forests, are one of the most popular models because they are easy to interpret and computationally inexpensive. Indeed, taking advantage of decision trees, several researches combining SVMs with decision trees have been proposed for large-size SVM problems. Fu Chang et al. [
24] presented a method that uses a binary tree to decompose an input data space into several regions and trains an SVM classifier on each of the decomposed regions. Another method using decision trees and Fisher’s linear discriminant was also proposed for large-size SVM problems in which they applied Fisher’s linear discriminant to detect ‘good’ data samples near the support vectors [
25]. Cervantes et al. [
26] also utilized a decision tree to select candidate support vectors using the support vectors annotated by SVM trained by a small portion of training data. Their approaches, however, are limited in that it cannot properly handle the regions that have nonlinear relationships.
The ultimate aim in dealing with large-scale SVM problems is to reduce the training time and memory consumption of SVMs without compromising the performance. For this goal, it would be worth finding good support vector candidates as a data-reduction method. Thus, in this paper we present a method that finds support vector candidates based on decision trees that works better than previous methods. We determine the decision hyperplane using support vector candidates chosen among the training dataset. In this proposed approach, we introduce a new concept, relative support distance, to effectively find candidates using decision trees in consideration of nonlinear relationships between local observations and labels. Decision tree learning decomposes the input space and helps find subspaces of the data where the majority class labels are opposite to each other. Relative support distance measures a degree that an observation is likely to be a support vector, using a virtual hyperplane that bisects the two centroids of two classes and the nonlinear relationship between the hyperplane and each of the two centroids.
This paper is organized as follows.
Section 2 provides the overview of SVMs and decision trees that are exploited in our algorithm. In
Section 3, we introduce the proposed method of selecting support vector candidates using relative support distance measures. Then, in
Section 4, we provide the results of experiments to compare the performance of the proposed method with that of some existing methods. Lastly, in
Section 5, we conclude this paper with future research directions.
3. Tree-Based Relative Support Distance
In order to cope with large-scale SVM problems, we propose a novel selection method for support vector candidates using a combination of tree decomposition and relative support distance. We aim to reduce the training time of SVMs for the numerical computation of
in (2) which produces
and
in (1) by selecting good support vectors in advance that are a small subset of the training data. To illustrate our concept, we start with a simple example in
Figure 1, where the distribution of the iris data is shown: for the details of the data, refer to Fisher [
34]. In short, the iris dataset describes iris plants using four continuous features. The data set contains 3 classes of 50 instances as Iris Setosa, Iris Versicolor, or Iris Virginica. We decompose the input space into several regions by decision tree learning. After training an SVM model for the whole dataset, we mark support vectors by filled shapes. Each region has its own majority class label, and the boundaries are between the two majority classes. The support vectors are close to the boundaries. In addition, we notice that they are located relatively far away from the center of the data points with the majority class label in a region.
In light of this, we describe our algorithm to find a subset of support vectors that determine the separating hyperplane. We divide a training dataset into several decomposed regions in the input space by decision tree learning. This process brings each decomposed region to have most of the data points with the majority class label by the tree learning algorithm. Next, we detect adjacent regions in which the majority class is opposite to that of each region. We define this kind of region as distinct adjacent region. Then we calculate a new distance measure, relative support distance, with the data points in the selected region pairs. The procedure of the algorithm is as follows:
Decompose the input space by decision tree learning.
Find distinct adjacent regions, which mean adjacent regions whose majority class is different from that of each region.
Calculate the relative support distances for the data points in the found distinct adjacent regions.
Select the candidates of support vectors according to the relative support distances.
3.1. Distinct Adjacent Regions
After applying decision tree learning to the training data, we detect adjacent regions. The decision tree partitions the input space into several leaves (also denoted by terminal nodes) by reducing some impurity measures such as entropy. Following the approach of detecting adjacent regions introduced by Chau [
25], we put in mathematical conditions for being adjacent regions and relate it to the relative support distance. Firstly, we represent each terminal node of a learned decision tree as follows:
where
is the
th leaf in the tree structure and
is the boundary range for the
th variable of the
th leaf with its lower bound
and upper bound
Recall that
is the number of input variables. We should check whether each pair of leaves,
and
, meet the following criteria:
where
and
are one of the input variables,
,
, and
. That is to say, if two leaves
and
are adjacent regions, they have to share one variable, represented by the variable
in Equation (6), and one boundary, induced by the variable
k in (7). Among all adjacent regions, we only consider distinct adjacent regions. For example, in
Figure 2, the neighbors of
are
,
, and
:
, however, does not form an adjacent region pair. {
} is an adjacent region pair but not distinct since those regions have the same majority class. Therefore, the distinct adjacent regions in the example are only
,
,
,
, and
. Distinct adjacent regions are summarized in
Table 1. Now, we apply the measure of relative support distance to select support vector candidates in the found distinct adjacent regions for each region.
3.2. Relative Support Distance
Support vectors (SVs) play a substantial role in determining the decision hyperplane in contrast to non-SV data points. We extract data points in the training data that are most likely to be the support vectors, constructing a set of support vector candidates. Given two distinct adjacent regions
and
from the previous step, let us assume the majority class label of
is
and that of
,
without loss of generality. First, we calculate the centroid (
) for each majority class label as follows: for an index set
and the label of
,
where
and
is the cardinality of index set
.
In other words,
is the majority-class centroid of data points in
, of which the labels are
. Next, we create a virtual hyperplane that bisects the line from
to
:
where
is the middle point of the two majority-class centroids. The virtual hyperplane
is given by
, where
Lastly, we calculate the distance
between each data point
in
and
and the distance
between each data point in
and the virtual hyperplane
:
where
is the
th data point belonging to
.
Figure 3 shows a conceptual description of
and
using the virtual hyperplane in a leaf. After calculating
and
, we apply feature scaling to bring all values into the range between 0 and 1. Our observation is that data points lying close to the virtual hyperplane are likely to be support vectors. In addition, data points lying close to the centroid are less likely to be support vectors. In light of these observations, we select data points lying near the hyperplane and far away from the centroid. For this purpose, we define the relative support distance
as follows:
The larger becomes, the more likely that the associated is a support vector.
The relationship between support vectors and distances
and
is illustrated in
Figure 4. We use leaves
with distinct adjacent regions
and
in
Figure 1. In
Figure 4, the observations marked by circles are non-support vectors while those by triangles (in red) are support vectors selected after training all data by SVMs. The distances
and
of
relative to
are in
Figure 4a, and the relative support distance measures in
Figure 4c. Likewise, those of
L4 relative to
L2 are in
Figure 4b,d. We observe that the observations, marked by triangles and surrounded by a red ellipsoid in
Figure 1, correspond to the support vectors surrounded by a red ellipsoid in region
in
Figure 4a, and they have larger values of relative support distance as shown in
Figure 4c. Similarly, we notice that the observations, marked by triangles and surrounded by a green ellipsoid in
Figure 4b, correspond to the support vectors surrounded by a green ellipsoid in region
in
Figure 1, and they also have larger values of relative support distance as shown in
Figure 4d. The support vectors in
are obtained by collecting observations with large values of relative support distance, for example by the rule
, from both the pair of
and
and the pair of
and
. The results reveal that the observations that have a mostly larger distance
and shorter distance
are likely to be support vectors.
For each region, we calculate pairwise relative support distance with distinct adjacent regions and select a fraction of the observations, denoted by parameter , in the decreasing order by as a candidate set of support vectors. That is to say, for each region, we select the top fraction of training data based on . Parameter represents the proportion of the selected data points, between and . For example, when is set to , all data points are included in the training of SVMs. When , we exclude of the data points and reduce the training data set to . Finally, we combine relative support distance with random sampling, which means that a half of the training candidates are selected based on the proposed distance and the others are selected by random sampling. Though being quite informative for selecting possible support vectors, the proposed distance is calculated locally with distinct adjacent regions. Therefore, random sampling can compensate this property by providing whole data distribution information.
4. Experimental Results
In the experiments, we compare the proposed method, tree-based relative support distance (denoted by
), with some previously suggested methods, specifically SVMs with candidate vectors selection, denoted by
[
18], and SVM with Fisher linear discriminant analysis, denoted by
[
25], as well as standard SVMs, denoted by
. For all comparing methods, we use LibSVM [
15] since it is one of the fastest methods for training SVMs. The experiments are run on a computer with the following features: Core i5 3.4 GHz processor, 16.0 GB RAM, Windows 10 enterprise operating system. The algorithms are implemented in the R programming language. We use
datasets which are from UCI Machine Learning Repository [
35] and LibSVM Data Repository [
36] except the checkerboard dataset [
37]: a9a, banana, breast cancer, four-class, German credit, IJCNN-1 [
38], iris, mushroom, phishing, Cod-RNA, skin segmentation, waveform, and w8a. Iris and Waveform datasets are modified for binary classification problems by assigning one class to positive and the others to negative.
Table 2 shows a summary of the datasets used in the experiments where Size is the number of instances in dataset and Dim is the number of features.
For testing, we apply three-fold cross validation, repeated three-times, by shuffling each dataset and dividing it into three parts, and use two parts as the training dataset, the other part as a testing dataset with different seeds. We use the RBF kernel for training SVMs in all tested methods. For each experiment, cross validation and grid search are used for tuning two hyper-parameters: the penalty factor
and the RBF kernel parameter
in Equation (3). Hyper-parameters are searched by a two-dimensional grid with
and
where
is the number of features.
Table 3 shows the values used for each dataset in the experiments. Moreover, we vary the fraction of data points in each region
from
to
with the interval of
.
We compare the performance of
,
,
, and
in terms of classification accuracy and the training time (in seconds), summarized in
Table 4. We also depicted the performance comparison of the proposed
with
and
on the five largest datasets when
in
Figure 5. We used log-2 scale for
y-axis in
Figure 5b. In
Table 4, Acc is the accuracy on test data;
is the standard deviation; and Time is the training time in seconds. Even though the accuracy of the proposed algorithm is slightly degraded in a few cases, it is higher than that of
and
in most cases. In addition, as
is greater, the accuracy of the proposed algorithm enhanced substantially. For small datasets, there is no significant improvement on computation time compared to the standard SVM since those datasets are already small enough. However, we notice that the training time of
improved quite much when using the large-scale datasets.
For statistical analysis, we also performed Friedman test to see that there exists significant difference between the multiple comparing methods in terms of accuracy. If the null hypothesis of the Friedman test is rejected, we performed Dunn’s test.
Table 5 shows the summary of the Dunn’s test results at the significant level
. In
Table 5, the entries (1)
; (2)
; (3)
, respectively, denote that: (1) the performance of
is significantly better than
; (2) there in no significant difference between the performances of
and
; and (3) the performance of
is significantly worse than
. Each number in
Table 5 means the number of datasets. At
, our proposed method is significantly better than
and
in 10 and 9 cases among 18 datasets. These numbers increase to 11 and 10 at
. On the other hand, our proposed method is significantly worse than
only in 2, 3 and 3 cases; than
in 0, 1 and 0 cases at
respectively. Based on the observations in the experiments, we can conclude that our proposed method generates an effective reduction of the training datasets while producing better performance than the existing data reduction approaches.
Finally, to compare the training time of
,
,
, and
in detail, we divide it into two parts: selecting candidate vectors (
) and training a final SVM model (
) when
, summarized in
Table 6. From
Table 6, we can notice that it especially takes longer time for
and
than
to select candidate vectors with w8a dataset. This is because the time complexity of building a decision tree is
where
is the number of features and
is the size of training dataset. However, our proposed method takes shorter than
since calculating
is more computationally efficient than fisher linear discriminant and is the fastest overall. The results in
Table 6 show that the proposed method efficiently selects support vector candidates while maintaining good classification performance.