1. Introduction
Areas such as image classification and bioinformatics can greatly profit from the use of artificial intelligence algorithms such as Support Vector Machine (SVM). However, because SVM is computationally costly, software applications often do not provide sufficient performance in order to meet time requirements for large amounts of data [
1,
2]. Hardware platforms such as Field-Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs) can be used to increase the performance of SVM implementations, resulting in a higher number of samples evaluated per second (throughput) compared to General Purpose Processors (GPPs). These platforms have been used for real-time and massive data mining applications, also called Mining of Massive Datasets. Although GPUs have higher performance than GPPs, FPGAs can provide a better alternative as they can achieve similar computational power while having lower energy consumption [
1,
2,
3].
A widely used method for SVM training is Sequential Minimal Optimization (SMO). SMO is a technique to simplify the quadratic optimization problem involved in calculating the weights of the SVM [
4]. In previous works, several classes of SVMs were implemented in FPGA, mainly relying on the use of the SMO algorithm for training.
Reference [
5] implements the inference step of a SVM in FPGA for large datasets. As a form of validation, the MNIST dataset is used, achieving accelerations of up to 8× compared to GPU and FPGA implementations. The implementation proposed in [
6] develops a cascade SVM implementation, based on an earlier system developed by the same authors. With this new implementation, the authors achieved a reduction of 25% in the number of logical elements used, a 2× increase in performance and a reduction of 20% in the peak power required in relation to their previous work.
Gradient-based methods can also be used to implement the SVM training step. However, some of these methods depend on the analysis of the complete training set before performing an update of the network weights. To improve SVM scalability regarding the size of the data set, Stochastic Gradient Descent (SGD) algorithms are used, as a simplified procedure for evaluating the gradient of a function [
7].
It is still possible to contribute to the literature exploring the use of the FPGA to implement SVMs trained with the SGD algorithm. Currently, works focus on large scale SGD implementations as in [
8], which integrates the Spark framework, making use of a SGD implementation in FPGA to accelerate the training step of a linear SVM. For validation, the implementation was used for the classification of 7500 cancer cells images of size 256 × 256, achieving an increase in throughput of up to 2× compared to an implementation on a cluster of computers.
Other approaches focus on building a scalable SGD accelerator [
9]. In their work, the impact of quantization on statistical and hardware efficiency are discussed and how the use of stochastic quantization and fixed-point arithmetic can lead to better convergence than naive quantization. The implementation presented uses an embedded CPU that transmits the training samples to the FPGA.
In [
10] the authors developed an implementation of the HogBatch algorithm in hardware based on systolic arrays, and used an extension of SGD that allows batching. Reference [
11] evaluates the impact on communication and computation time of low-precision, asynchronous SGD using both CPU and FPGA implementations. Moreover, they present a new model to describe SGD implementations called Dataset, Model, Gradient, Communications (DMGC).
This paper presents a proposal to implement Parallel Support Vector Machines, with training performed by the stochastic gradient descent technique in reconfigurable hardware using FPGA. As an alternative to reach higher throughput, hardware resource optimization techniques such as different numerical representations are explored. Once the implementations have been made, an analysis of the occupied resources and other hardware performance parameters is performed. The experimental results are obtained with a Xilinx Virtex-6 XC6VCX240T-1 FPGA.
The rest of the paper is organized as follows.
Section 2 describes how the SVM has been implemented and the equations that describe it.
Section 3 defines the computational platforms used, how the SVM training was performed and how the results were obtained.
Section 4 analyses and compares the results obtained by the implementations in software and hardware.
Section 5 compares the results obtained with the state of the art regarding throughput.
Section 6 provides an overview of what was discussed and summarizes the contributions of the paper.
2. Project Description
A high-level description of the system can be made based on three main structures; Gaussian Module (GM), Aggregation Module (AM) and Training Unit (TU), as shown in
Figure 1. The GM is responsible for mapping the input into a different representation space, easing the classification of non-linearly separable patterns. The AM receives the GM output and, based on its weights, maps them to one of the possible labels. Finally, TU implements the SGD algorithm and adjusts the AM weights. In the following subsections, the modules will be detailed. The structures GM, AM and TU processes all the information in each
n-th sample time,
, in other words, at every
seconds there is a new output
from the
N inputs
{
}, and the
weights
{
} are updated. As this proposal is a full parallel implementation, the sample time,
, is also the iteration time and
n indicates the current iteration. The SVM throughput,
, in samples per second (sps) or in iterations per second (ips) can be expressed as
2.1. Gaussian Module (GM)
In this module, the entries are mapped into another representation space see
Figure 1, to facilitate the classification of non-linearly separable patterns by the SVM. The mapping is done through the use of kernels. Allowing to transform the input space of
N into a space of
K dimensions. In this work, it was implemented the Gaussian, also called Radial-Basis Function (RBF), kernel which can be expressed as
where
is the
j-th center of the gaussian function associated with the
i-th input,
the
i-th input,
the variance of each kernel function,
the
j-th distance from each input to the center,
the
j-th exit of GM in the
n-th iteration and
the bias.
To maximize the throughput of the system and reduce computation time of Equation (
2), most operations are executed in parallel. With the exception of summation, which is executed using a tree adder. To reduce resource consumption and decrease execution time in the FPGA, the exponential function is implemented through the use of Look-Up Tables (LUTs), where each address of the LUT stores a value of the exponential. This implementation choice implies in a delay of one sample in the algorithm execution. However, the impact on throughput is lower than if the exponential function was implemented using specific hardware circuitry.
The
j-th distance
shown in Equation (
2), is computed in parallel as in
Figure 2. It is important to note that all arithmetic operations are performed in parallel which increases performance regarding throughput.
Having the value of
, the calculation of the Equation (
2) is done using Read-Only Memory (ROM), composed of LUTs of depth 9. The circuit to convert the distance to memory addresses is illustrated in
Figure 3. Being
, the upper edge of the function domain and normalizer the circuit responsible for converting from distance metric to memory address. To discretize the gaussian a maximum value for the distance of 4 and a minimum of 0 was considered. Within that range the gaussian was evaluated at steps of
and the values were stored on the ROM memory, which resulted in 401 values being stored. The normalizer then divides the output of the mux with the discretization step.
2.2. Aggregation Module (AM)
After the
N entries are mapped into a new space, the AM is used to classify the entries
. In this implementation, only binary classification is possible, which consists of classifying the input vector into one of two possible classes [
12]. The classes are represented by 1 and
. The classification is expressed as
where
is the neural weight associated with the
j-th input, and
the SVM output in the
n-th iteration. The implementation of this equation in the FPGA is shown by
Figure 4. For this to happen the entries
are multiplied in parallel by their corresponding weights
, the result is then summed using a tree adder.
2.3. Training Unit (TU)
TU implements the equations expressed as
and
where
is the desired output,
are the correctly classified results,
is the learning rate and
is the regularization parameter. The variable
specifies how much the training samples are sufficient to specify a solution to the problem. When
, it means that the data is not trustworthy and
otherwise [
13].
For Equations (
4) and (
5), the loss function chosen was Hinge-Loss and the regularization given by the
method [
14]. In this work, a variant of the Pegasos algorithm was implemented for training [
15]. The designed circuit is presented in
Figure 5.
3. Methodology
Two datasets were used to validate the hardware design. The first was the XOR gate, being the SVM used to draw the ideal decision surface that divides the two classes correctly. In this design, the value of each j-th input, , in the n-th instant (or iteration), corresponds to the input of the XOR gate and the output of the circuit.
The second set of data is from the Iris flower, described in [
16]. Where each input
corresponds to the value of the length and width of the sepal and the petal. This dataset was made with the aim of classifying three types of flowers from the Iris family; Iris Setosa, Versicolor and Virginica. This implementation restricts to only classify Iris Setosa and Versicolor because the SVM used is a binary classifier. For this reason, all data related to Iris Virginica was removed from the dataset.
The results of the hardware design were then compared to the software implementation in Python using the scikit-learn package [
17]. Python and scikit-learn were used since they are widely used tools in Data Science and AI. The software implementation uses double-precision floating-point format, while the hardware uses single-precision floating and fixed-point representation, the fixed-point representation uses 5 bits in the integer part and 20 in the fractional part. Although the software implementation does not use single-precision float pointing format according to [
18] there is at most a speedup of 4 times between these numerical representations, depending on the instruction executed. With this experiment it was possible to acquire execution time data and usage of FPGA hardware resources.
Before the results were obtained, the dataset was preprocessed, which consisted of the separation between training and test set and in obtaining the centers of the kernels. In the case of the XOR gate, the training set is the same as the test set. For the Iris dataset, the samples were divided into for testing and for training. This was done randomly by the method train_test_split, also present in the scikit-learn package.
To obtain the centers of the kernels two methods were used. For the XOR dataset the centers were set using knowledge on the problem to find the ideal values. In the case of Iris dataset, the centers were calculated by the K-means algorithm, implemented by the KMeans class of the scikit-learn package. Once this data is retrieved in the Python environment, it is saved in CSV format to be used by the hardware implementation.
To execute the implementation in Python, a system with an i7 quad-core processor, running at 2.5 GHz with 8 GB of RAM and Python 3.6 was used. The RBF_kernel class was employed to map the SVM entries, using the RBF kernel and the class SGDClassifier was used to adjust the SVM parameters based on the SGD algorithm. This approach was necessary as Scikit-learn did not provide an implementation of kernel-based SVM using the SGD algorithm. The CPU time was measured taking the average of 10,000 executions of the algorithm, using the function_time function.
The hardware implementation was done through a Xilinx tool, called System Generator [
19]. With it, is possible to create RTL circuits in Matlab Simulink and perform simulations with bit-level precision, making possible to create highly efficient circuits and also to easily interface with the Matlab programming environment. Our design was synthesized to a Xilinx Virtex-6 XC6VCX240T-1 FPGA containing 241,152 logic cells, 768 embedded multipliers and 37,680 slices [
20], included on the ML605 evaluation board [
21].
To validate the results of the software implementation a confusion matrix was drawn, in which the horizontal axis represents the classes predicted by the SVM and the vertical the correct classes. To generate the confusion matrix, results from the training and test sets were retrieved, making possible to analyze the learning and generalization capacity of the SVM. Regarding the hardware implementations, the evolution of the instantaneous quadratic error during training was analyzed, to facilitate the visualization of the results a window of 250 iterations is used to visualize the error.
5. Comparison with the State of the Art
Although it was not possible to synthesize the design for a dataset with more than 4 features to compare with the state of the art, the results from the current datasets were extrapolated using a linear regression. The extrapolation was performed using data from the XOR dataset on both fixed and single precision floating point, as well as from Iris dataset using single precision floating point. The regressor has as input the number of kernels and as output the execution time. In order to compare to a model with 758 inputs, for example, the input of our regressor was 758, then the output was compared with the execution time of the model being referenced. Despite the fact that is not necessarily the case to have one kernel for each input, this draws a lower bound on our results. Further studies can show that the actual performance is different from the results presented in this work. As presented on
Figure 12 the linear regression of the xor dataset using floating-point precision had a
score of
, with sample time,
, computed as
For the implementation on the XOR dataset using fixed-point precision, the
value is
as shown on
Figure 13, the equation representing the sample time of the design is
On the Iris dataset, using floating-point precision a
value of
was achieved, as depicted on
Figure 14 and having the following fitting curve
The comparisons with the state of the art are summarized on
Table 10.
FPGAs are also being used to accelerate algorithms in association with cluster of computers as shown in [
8]. The work presents an implementation in hardware of a linear SVM accelerator in FPGA, being used to analyse a dataset of cancer imagens, with 65,536 features per image. The author’s proposal divides the workload between 8 computers, each having a associated FPGA to compute the SVM. The design presented on the FPGA is serial as it can only compute at most 64 features of an image per iteration, that way requiring 1024 iterations to analyse the complete set of features. Every image is considered a sample and requires 8 ms (125 sps) to execute. The proposal presented in this paper would need
ms (741 sps) using the results from the XOR float-point,
s (15,797 sps) considering XOR fixed-point and
ms (2857 sps) with the extrapolation from the Iris float-point implementation. Resulting in speedups of
,
and
respectively.
The work of [
9] analyzes the impact of stochastic quantization on SGD convergence and speedup on reconfigurable hardware using FPGA. The authors studied the impact on the task of classifying images of handwritten digits using the Gisette dataset containing 5000 features per image. The model used to analyse the digits was described as a dense linear model, the design with best throughput presented requires
ms to compute a sample (641 sps) and uses fixed-point precision. The input data was quantized to 1 bit numbers and 128 features are analysed at a time, requiring about 39 iterations to compute a sample. When compared with the proposal presented in this paper would need 103
s to compute a sample, yielding 9708 sps and presenting a speedup of
if the XOR float-point implementation is considered, a speedup of
with execution time of
s (20,4918 sps) for XOR fixed-point and
s (33,898 sps) and speedup
with the results from the Iris float-point implementation.
In [
10] a hardware implementation on FPGA of the HogBatch algorithm for SGD computation is used, which employs systolic arrays for improved scalability and performance results. The results were obtained with the RCV1-V1-test dataset, which contains 47,236 features and logistic regression was used as the model; the cited proposal requires at least
ms with 32 Processing Elements (PEs) to compute each sample, resulting in the analysis of 72 sps. The proposal presented in this paper would need 971
s (1029 sps) based on the results from the XOR float-point implementation,
s (21,881 sps) if the XOR fixed-point implementation is considered and 278
s (3597 sps) with the Iris float-point. Generating speedups of
,
and
.
In [
11] the use of low precision data is evaluated regarding throughput and statistical efficiency for SGD applied to linear models on FPGA and CPUs. The authors present results of throughput and statistical efficiency for different number of bits on linear regression. From the results shown in the work the design that is most similar to this paper uses 32 bits fixed-point hardware. The design consists of a linear regression SGD, capable of processing 3 Giga Numbers Per Second (GNPS) using a model of 48 kB, with a sample of 1500 features executed in 16
s (62,500 sps). The proposal presented in this paper presents a speedup of
, needing
s per sample (32,362 sps) using the results from the XOR float-point implementation,
s (66,6666 sps) with speedup of
considering to the XOR fixed-point and
s (112,359 sps) as well as speedup of
with the extrapolation from the Iris float-point implementation.
6. Conclusions
This work presented a parallel implementation in FPGA of the SVM algorithm using SGD as a training method. The main purpose of this implementation was to achieve a high rate of data processed in order to meet the demands of computationally intensive applications. For this reason, all possible calculations were parallelized. Finally, due to the observations made using the results of the synthesis, it was possible to note that the implementation of this technique in hardware would allow significant improvements in performance, resulting in an acceleration of more than 10,000× compared to a software implementation, and up to 319× compared to the state of the art. For this reason, it is possible for it to be used both in systems that require low response time, such as autonomous cars, or in massive data mining.In the future the impact on numerical precision and execution time per sample due to some variations in the design are going to be studied. One possibility is the computation of the exponential function in hardware, instead of LUTs. Another contribution we aim to make is the use of variations from the SGD, such as with variable step size and momentum.