Next Article in Journal
Recent Advances in Small-Angle Neutron Scattering
Next Article in Special Issue
Hardware Platform-Aware Binarized Neural Network Model Optimization
Previous Article in Journal
Racial Identity-Aware Facial Expression Recognition Using Deep Convolutional Neural Networks
Previous Article in Special Issue
AxP: A HW-SW Co-Design Pipeline for Energy-Efficient Approximated ConvNets via Associative Matching
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms

1
VeriMake Innovation Lab, Nanjing Renmian Integrated Circuit Co., Ltd., Nanjing 210088, China
2
Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
3
Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514, USA
4
National ASIC System Engineering Technology Research Center, Southeast University, Nanjing 210096, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(1), 89; https://doi.org/10.3390/app12010089
Submission received: 2 November 2021 / Revised: 4 December 2021 / Accepted: 16 December 2021 / Published: 22 December 2021
(This article belongs to the Special Issue Hardware-Aware Deep Learning)

Abstract

:
In Internet of Things (IoT) scenarios, it is challenging to deploy Machine Learning (ML) algorithms on low-cost Field Programmable Gate Arrays (FPGAs) in a real-time, cost-efficient, and high-performance way. This paper introduces Machine Learning on FPGA (MLoF), a series of ML IP cores implemented on the low-cost FPGA platforms, aiming at helping more IoT developers to achieve comprehensive performance in various tasks. With Verilog, we deploy and accelerate Artificial Neural Networks (ANNs), Decision Trees (DTs), K-Nearest Neighbors (k-NNs), and Support Vector Machines (SVMs) on 10 different FPGA development boards from seven producers. Additionally, we analyze and evaluate our design with six datasets, and compare the best-performing FPGAs with traditional SoC-based systems including NVIDIA Jetson Nano, Raspberry Pi 3B+, and STM32L476 Nucle. The results show that Lattice’s ICE40UP5 achieves the best overall performance with low power consumption, on which MLoF averagely reduces power by 891% and increases performance by 9 times. Moreover, its cost, power, Latency Production (CPLP) outperforms SoC-based systems by 25 times, which demonstrates the significance of MLoF in endpoint deployment of ML algorithms. Furthermore, we make all of the code open-source in order to promote future research.

1. Introduction

Machine Learning (ML) algorithms are effective and efficient in processing Internet of Things (IoT) endpoint data with well robustness [1]. As data volumes grow, IoT endpoint ML implementations have become increasingly important. Compared to the traditional cloud-based approaches, they can compute in real-time and reduce the communication overhead [2]. There are some researches deploying ML algorithms on System on Chip (SoC) based endpoint devices in industry and academia. For example, the TensorFlow Lite [3], X-CUBE-AI [4], and the Cortex Microcontroller Software Standard Neural Network (CMSIS-NN) [5] are three frameworks proposed by Google, STM, and ARM for pre-trained models in embedded systems. However, these solutions cannot achieve a balance among power consumption, cost-efficiency, and high-performance simultaneously for IoT endpoint ML implementations.
Field Programmable Gate Array (FPGA) devices have an inherently parallel architecture that makes them suitable for ML applications [6]. Moreover, some FPGAs have substantially close costs and leakage power compared to those of Microcontroller Unit (MCU). Therefore, FPGA can be an ideal target platform for IoT endpoint ML implementations. Nowadays, some researches and supports have been done for the deployment of ML algorithms on FPGAs. For instance, the ongoing AITIA, led by the Technical University of Dresden, KUL, IMEC, VUB, etc. [7], is a preliminary project that investigated the feasibility of ML implementations on FPGAs. On the other hand, in the industry, Xilinx and Intel have supported portions of the machine learning based Intellectual Property (IP) cores [8,9,10,11]. Unfortunately, these solutions are based on high-end FPGAs and require highly professional standards. It is difficult for many small and medium-sized corporations and individual developers to deploy their hardware platforms. Therefore, an ML library that can run on any FPGA platform is needed [12], especially those low-cost FPGAs. Low-cost FPGAs do not mean they have the lowest absolute value among all FPGAs, instead, they refer to the lowest-priced FPGA series in the most representative manufacturers. Meanwhile, the lack of comprehensive comparisons makes it difficult to demonstrate the benefits of FPGA ML implementations over conventional SoC-based solutions.
Therefore, we introduce Machine Learning on FPGA (MLoF) with a series of IP cores dedicated to low-cost FPGAs. The MLoF IP-cores are developed in Verilog Hardware Description Language (HDL) and can be used to implement popular machine learning algorithms on FPGAs, including Support Vector Machines (SVMs), K-Nearest Neighbors (k-NNs), Decision Trees (DTs), and Artificial Neural Networks (ANNs). The performance of seven FPGA producers (Anlogic *2, Gowin *1, Intel *2, Lattice *2, Microsemi *1, Pango *1, and Xilinx *1) is thoroughly evaluated using low-cost platforms. As far as we know, MLoF is the first case to implement machine learning algorithms on nearly every low-cost FPGA platform. Compared with the typical way of implementing machine learning algorithms on embedded systems, including NVIDIA Jetson Nano, Raspberry Pi 3B+, and STM32L476 Nucle, the advantage of MLoF is that it balances the cost, performance, and power consumption. Moreover, these IP cores are open-source, assisting developers and researchers in more efficient implementation of machine learning algorithms on their endpoint devices.
The contributions of this paper are as follows:
(1)
To the best knowledge of the authors, this is the first time that four ML hardware accelerator IP-cores are generated using Verilog HDL, including SVM, k-NN, DTs and ANNs. The source code of all IP-cores is fully disclosed at github.com/verimake-team/MLonFPGA;
(2)
The proposed IP-cores are deployed and validated on 10 mainstream low-cost and low-power FPGAs from seven producers to show its broad compatibility;
(3)
Our designs are comprehensively evaluated on FPGA boards and embedded system platforms. The results prove that low-cost FPGAs are ideal platforms for IoT endpoint ML implementations;
The rest of this paper is organized as follows: Section 2 reviews prior work on the hardware implementations of machine learning algorithms. Section 3 introduces details of the proposed MLoF IP series. Section 4 provides a comparison of various FPGA development platforms. Section 5 contains experiments and analyses. Section 6 concludes the research results and future works.

2. Related Work

For decades, the implementations of machine learning algorithms on low-cost embedded systems have been vigorously investigated. Due to the limited computing resources on embedded systems, these approaches tend to lack outstanding performance. However, FPGA is an effective solution for machine learning algorithms [13]. Saqib et al. [14] proposed a decision tree hardware architecture based on FPGA. It improves data throughput and resource utilization efficiency by utilizing parallel binary decision trees and a pipeline. A 3.5× computing speed is achieved while only 2952 Look-up Tables (LUTs) of resources are consumed via an 8-parallelism 4-stage pipeline. Nasrin Attaran et al. [15] proposed a binary classification architecture based on SVM and k-NN. Over 10× computing speed and 200× power-delay are obtained as compared with ARM A53. Gracieth et al. [16] proposed a 4-stage, low-power SVM pipeline architecture capable of achieving 98% of accuracy on over 30 classification tasks. It consumes only 1315 LUTs of resources and operates at a system frequency of 50 MHz. The aforementioned works introduce the deployment of ML algorithms on FPGA platforms, but there are still deficiencies: 1. None of the works integrate the mainstream ML algorithms for FPGA deployment; 2. None of the works compare the final results horizontally with other embedded platforms of various IoT terminals.
Due to the high inherent parallelism in FPGAs, more advanced machine learning algorithms, such as neural networks, are widely studied. Roukhami M et al. [17] proposed a Deep Neural Network (DNN) based architecture for classification tasks on low-power FPGA platforms. They thoroughly compared the performance with STM32 ARM MCU and designed a general communication interface for accelerators, such as SPI and UART. The entire acceleration process consumes only 25.459 mW of power with a latency of 1.99 s. Chao Wang et al. [18] proposed a Deep Learning Accelerating Unit (DLAU) that accelerates neural networks, e.g., DNN and CNN. Additionally, they developed an AXI-Lite interface for the acceleration unit to enhance its versatility. In general, the DLAU outperforms Cortex-A7 by 67%. Fen Ge et al. [19] developed a resource-constrained CNN accelerator for IoT endpoint SoCs that does not require any DSP resources. With a total resource overhead of 4901 LUTs, the data throughput reaches 6.54 GOP (Giga Operation per second). The above works present neural network-based FPGA deployments, but none of them is designed for low-cost FPGAs, which are oftentimes the most prevalent platforms used for IoT endpoints.
Although the previous works succeed in increasing the capability of endpoint computing by implementing only one or two machine learning algorithms on FPGAs, further analyses and comprehensive comparisons across low-cost FPGA platforms, as well as the integration of more commonly used machine learning algorithms are still required.

3. Machine Learning Algorithms Implementation on Low-Cost FPGAs

Normally, the post-process of IoT data mainly focuses on two tasks: Classifications and regressions [20]. By exploiting the parallelism and low power consumption of FPGAs, MLoF offers a superior solution for these workloads. Drawn from past designs, MLoF is designed with lower computation resources, and it includes a variety of common machine learning algorithms, namely ANN, DT, k-NN, and SVM. Details will be further presented in this section.
As shown in Figure 1, the system has consisted of the training and the MLoF modules as most machine learning models are used for inferencing and evaluating within IoT devices without requiring extensive training [21], all of the model training process is completed through PC. First, the IoT dataset is gained from the endpoints and will be sent to the training module. Then, an ML library (e.g., Scikit-Learn and TensorFlow Lite) [22] and an ML algorithm should be chosen as the first level of parameters. Thereafter, a set range of hyperparameters (the same set as in the MLoF module) are used to constrain the PC training process, as FPGA has limited local resources. Since hyperparameters are key features in ML algorithms [23,24,25], users could set them to different values to find the best sets for training according to Table 1. Next, with all the labeled data and hyperparameters, the best model and the best parameters (including the updated hyperparameters) are generated and sent to the MLoF module. After receiving and storing the parameters into ROM or external Flash, the algorithm is deployed on an FPGA. The language of FPGA is mainly Verilog HDL, thus the algorithms can literally be deployed on any known FPGA platform.
In this paper, we selected six datasets, the four most representative ML algorithms, and deployed them on 10 low-cost FPGAs, with a total of 240 combinations. The experiments and evaluations are described in detail in Section 5.

3.1. Artificial Neural Networks (ANN)

3.1.1. Overall Structure of ANN

The implementation of ANN, for example, with eight inputs and two hidden layers (eight neurons within each layer) is shown in Figure 2a. This ANN model includes a Memory Unit (Mem), a Finite State Machine (FSM), eight Multiplying Accumulator (MAC) computing units, Multiplexers (MUX), an Activation Function Unit (AF), and a Buffer Unit (BUFFER). The Memory Unit (Mem) stores the weights and biases after training. The FSM manages the computation order and the data stream. The MAC units are designed with multipliers, adders, and buffers for multiplying and adding operations within each neuron. Here, we implement eight MAC blocks to process in parallel the multiplication of the eight neurons. Initially, features are serially entered for registration. Then, the MAC is used to sequentially process the ANN from the first input feature to the eighth feature on the first hidden layer (Figure 2b). Next, the second hidden layer is sequentially processed from the first hidden neuron to the eighth neuron. Finally, we process the output by the activation function. As demonstrated in Figure 3, the multiplexers (MUX) allocate the data stream. The AF computes the activation functions, which will be discussed in Section 3.1.2. The buffer stores data computed from each neuron.
The entire procedure for performing a hardware-based ANN architecture is described below. First, eight features are serially inputted to an ANN model. Each feature is entered into eight MAC units simultaneously and multiplications for eight neurons in the first layer are then completed. An inner buffer is used to store the multiplication results. Next, all eight results are added to a user-specified activation function. The output is further stored in the buffer unit as the input of the next hidden layer. Finally, the results are exported from the output layer following the second hidden layer.

3.1.2. Activation Function

The activation function is required within each neuron, which introduces non-linearity into the ANN, resulting in better abstract capability. Three typical activation functions include the Rectified Linear Unit (ReLU), the Hyperbolic Tangent Function (Tanh), and the Sigmoid Function [26]. All of three activation functions listed above are developed in hardware, and details are described as follows.

Rectified Linear Unit (ReLU)

The mathematical representation of the ReLU is described as Equation (1):
ReLU x = x ,   x > 0 0 , x 0
The hardware implementation is shown in Figure 4 with a comparator and a multiplexer [27]:

Hyperbolic Tangent Function (Tanh)

The mathematical representation of the Tanh function is described as Equation (2):
Tanh x = e x e x e x + e x
This cannot be achieved directly in the hardware using HDL. Therefore, we fit this functionality separately with five sub-intervals [28]. We divide the interval of [0, +∞] into five sub-intervals: [0, 1], (1, 2], (2, 3], (3, 4], (4, +∞). Table 2 contains the heuristic functions used to fit the Tanh function for each sub-interval. The performance of each sub-interval is shown in Figure 5 with an error kept within an acceptable range. The sub-intervals enable the Tanh function to be implemented using only adders and multipliers.

Sigmoid Function

The mathematical representation of the Sigmoid function is described as Equation (3):
Sigmoid x = 1 1 + e x
Similar to the Tanh function, the Sigmoid function cannot be implemented directly in the hardware using HDL, as well. The Sigmoid function is equivalent to the tanh function [29] when the transformation in Equation (4) is applied:
Sigmoid x = Tanh x 2 + 1 2
This transformation can be implemented easily on the hardware using a shift operation and an adder based on the tanh function.

3.2. Decision Tree (DT)

Figure 6 illustrates the implementation of DT with multiple inputs, a depth of four, and a maximum of eight nodes on each layer. The DT consists of a Memory Unit (Rom), a Finite State Machine (FSM), eight compare units, and a dispatcher. The memory unit is used to store nodes’ parameters from PC training. The FSM is used to determine which input node to use next based on the output. The compare unit serves as the selecting node. The distributor is used to distribute the input to each node.

3.3. The k-Nearest Neighbors (k-NN)

3.3.1. Overall Structure of k-NN

The k-NN method is used to classify the samples based on their distances. In this case, we use the squared Euclidean distance as defined in Equation (5). The structure of k-NN is demonstrated in Figure 7 with an example of eight inputs and a k-value of 6. It consists of a Memory Unit (Mem), a Finite State Machine (FSM), a subtractor, a multiplier, a buffer, an adder, and a Sorting Network and Label Finder (SNLF) module.
d = i = 1 8 x i y i 2

3.3.2. Structure of Sort Network and Label Finder Module

The Sorting Network and Label Finder (SNLF) is a key module that completes the sorting operation of distance and then outputs the classification or prediction results. It balances the pipeline and parallel execution with a ping-pong operation. As shown in Figure 8, this module consists of three parts, MUX, comparators, and cache registers. The mux was used to control di transmit in different clock cycles. The comparator was used for comparison with the new di and for storing dis in cache registers. There were 12 cache registers (Ox and Ex) used for storing di, as shown in Figure 7b. Specifically, Ox registers were used to store the six smallest dis in ‘odd’ clock cycles. Ex registers were used to store the six smallest dis in ‘even’ clock cycles. The ping-pong cache was used to rank the di in different clock cycles for the SNLF. From those dis, six could be identified to be the smallest. Initially, we set the register’s value to the maximum. The di value was compared with each Ox’s value when the clock cycle was jth (j = 3,5…) period. If the di value was bigger than each Ox’s values in the next period, it would be dropped. Otherwise, the di value was inserted into Ox and the biggest value in six Ox’s would be dropped. Similarly, the Ex’s values were updated in (j + 1)th (j + 1 = 4,6…) period. The cycles were repeated until all of the 600 training samples had been calculated with input features. Finally, we compare all the 12 registers to get the smallest six values. The result was voted from the smallest six value of registers.

3.4. Support Vector Machine (SVM)

The SVM (with eight inputs) is composed of Memory Units (Mem), the Finite State Machine (FSM), multipliers, adders, and multiplexers. The pre-trained support vectors are stored in the memory unit. The FSM controls the order of output data and the running process. The FSM controlling process is mainly used in multi-class classification, where the support vector and bias are updated recursively within the structure shown in Figure 9. Multipliers and adders complete the support vector calculation in Equation (6), and the multiplexer is used for the sign function.
f x = s i g n i = 1 8 S i · F e a t u r e + B

4. Comparison of Development Platforms

The specifications of 10 different FPGA platforms from seven different producers are thoroughly analyzed. The key features of these FPGA cores are listed in Table 3. It is worth noting that Intel MAX and Xilinx Artix-7 FPGA have the richest LUTs and DSPs, which are beneficial for the parallel implementation of multi-input machine learning algorithms. Additionally, on-chip Random Access Memory (RAM) resources are important. Otherwise, a large amount of pre-trained data must be pre-fetched from memory to the cache and limited LUTs resources cannot be used as buffers to cache data during the calculation. Pango PGL12G, Lattice MachXO2, and Anlogic EG4S20 all have limited RAM capacitances. Anlogic EG4S20 has an internal SDRAM module that satisfies the need for additional caching. In addition, static power consumption is a critical metric for endpoint platforms, and Anlogic FPGAs, Intel Cyclone 10LP, and Microchip M2S010 perform well in this regard. Moreover, two Lattice FPGA platforms consume the least static power, which is quite competitive for endpoint implementations. Finally, both Anlogic EF2M45 and Microchip MS010 are equipped with an internal Cortex-M3 core, which significantly improves their general performance in terms of driving external devices and communication.
On-board resources, external interfaces, and prices are the three main discriminative FPGA features that the developers pay most attention to. Therefore, in Table 4, we present the features of 10 FPGAs from seven producers.
All seven producers develop their own Electronic Design Automation (EDA) software, among which Lattice develops a completely different EDA software for different devices. In addition, the final resource consumption is determined by synthesis tools. Most of the seven producers use either their own synthesis tools or Synplify [30], but the latter requires an individual supporting license. Table 5 summarizes the relative information of seven producers. It is worth noting that Lattice’s iCEcube2 and Lattice Diamond are for ICE40UP5 and MachXO2 development respectively, and thus they cannot share the same EDA.

5. Experimental Analysis and Result

To evaluate the performance of these IPs, we select six typical IoT endpoint datasets for different parameter combinations and tests. As shown in Table 6, the datasets include binary classifications, multi-classifications, and regressions. The Gutter Oil dataset proposed by VeriMake Innovation Lab aims to detect gutter oils [31], and contains six input oil features, including the pH value, refractive index, peroxide value, conductivity, pH value differences under different temperatures, and conductivity value difference under different temperatures. This dataset can serve both in a dichotomous and a polytomous way. The Smart Grid dataset for conducting research on electrical grid stability is from Karlsruher Institut für Technologie, Germany [32,33]. This is a dichotomous dataset with 13 input features used to determine whether the grid is stable under different loads. The Wine Quality dataset is proposed by the University of Minho [34] for classifying wine quality. This is a polytomous dataset, with 11 input dimensions (e.g., humidity, light, etc.), rating wines on a scale of 0 to 10. The rain dataset by the Bureau of Meteorology, Australia, is based on datasets of different weather stations for recording and forecasting the weather [35]. This is a dataset for regression prediction, using eight input parameters, such as wind, humidity, and light intensity to predict the probability of rain. Power Consumption is an open-source dataset created by the University of California, Irvine. It tracks the total energy consumption of various devices within families [36].
We use a desktop PC with a 2.59 GHz Core i7 processor to train various models on the six datasets and export the best parameters with the best scores obtained during training. For binary classifications and multiclass classifications, the scores represent the classification accuracies. For linear regressions, the scores represent R2 [37]. Then, these trained parameters are fed to our machine learning IPs and implemented on 10 different FPGA boards using EDAs from seven different candidate producers. Each EDA is configured to operate in the balanced mode with identical constraints. As shown in Table 5, part of the EDAs is integrated with synthesis tools, such as Gowin and Pango. However, as Synplify requires an individual supporting license, in this paper, only their self-developed synthesis tools (GowinSynthesis, ADS) are used for analyzing FPGA implementations. The analysis of FPGA implementations is not limited to the computing performance, but encompasses all aspects of the hardware. While Power Latency Production (PLP) [38] is a common metric for evaluating the results of FPGA implementations, it does not consider the cost, which is a critical factor in IoT endpoint device development. As a result, we introduce the Cost Power Latency Production (CPLP) as an additional metric for evaluating the results.
In addition, we realize the same machine learning algorithms and parameters to the Nvidia Jetson Nano 2, the Raspberry Pi 3B+, and STM32L476 Nucleo, respectively [39], allowing for more comprehensive comparisons of the implementations within different FPGAs.

5.1. ANN

5.1.1. ANN Parameter Analysis

We intend to find the best user-defined ANN parameters in six datasets, including the number of hidden layers, neurons within each layer, and activation functions. Different combinations of these parameters are used to train our ANN model on the desktop PC and their corresponding results are shown in Appendix A, Table A1. There are only minor differences among all the combinations. We chose to apply parameters with the best score from the software to our hardware implementation. The hyperparameter values associated with the best scores for these datasets processed using the ANN algorithm are shown in Table 7.

5.1.2. Implementation and Analysis of ANN Hardware

Based on the ANN architectures in Table 7, we use the corresponding EDA (with the Balanced Optimization Mode in Synthesis Settings) to implement ANN on 10 different FPGA boards. The results are summarized in Appendix A, Table A2. In terms of computing performance (latency), Intel MAX10M50DAF outperforms the others in five out of six datasets, while PGL12G outperforms the competition in the Rain task. The performance differences in time delay between 10 FPGAs are all at the millisecond level, which can almost be ignored. For the comprehensive comparisons, Lattice’s ICE40UP5 has achieved first place in most application scenarios for its extremely low power consumption and cost-effectiveness among most of the datasets (five out of six). One exception is that in the Wine Quality task scenario, it was not implemented on ICE40UP5 due to the resource constraint. In addition, the device that performed the best on the Wine Quality task was Lattice MachXO2. The FPGA deployment results with the best comprehensive performance under each task are shown in Table 8.

5.2. DT

5.2.1. Analysis of DT Parameters

For PC simulations, 12 different combinations of the maximum depth and the maximum number of leaf nodes are chosen. The results are shown in Appendix A, Table A3. Different DT structures produce nearly identical results. Adding the maximum depth and the maximum number of leaf nodes has no significant improvement on the score. Here, we chose the best results from various combinations for hardware deployment, and the results with the best scores for the six datasets are shown in Table 9.

5.2.2. Implementation and Analysis of DT Hardware

Based on the DT architectures in Appendix A, Table A4, we use the appropriate EDA (with the Balanced Optimization Mode in Synthesis Settings) to implement DT on 10 different FPGA boards. In terms of computing performance, Intel MAX10M50DAF outperforms the competition in all six datasets. While in terms of comprehensiveness, Lattice’s ICE40UP5 came to first place again in most application scenarios for its extremely low power consumption and cost-effectiveness. The FPGA DT deployment results with the best comprehensive performance under each task are shown in Table 10.

5.3. K-NN

5.3.1. Analysis of k-NN Parameters

In our k-NN model, the parameter k is user-defined. We experiment with various k values when training our model on the PC, and the results are shown in Appendix A, Table A5. The increment of k has no significant effect on the score. In fact, on the contrary, it might decrease them. We deploy the architecture that is optimal in terms of k value for hardware deployment. The hyperparameter values associated with the best scores for these datasets processed using the k-NN algorithm are shown in Table 11.

5.3.2. Implementation and Analysis of k-NN

According to the k parameters analyzed in Section 5.3.1, we implement our model on 10 FPGA boards (with the Balanced Optimization Mode in Synthesis Settings). The corresponding results are shown in Appendix A, Table A6. Gowin’s GW2A has the best computing performance in all of the task scenarios. By relying on extremely low power consumption and cost-effectiveness, Lattice’s ICE40UP5 achieves the best comprehensive performance across all datasets.
Additionally, two things are worth noting: Anlogic’s EF2M45 and Lattice’s MachXO2 are unable to deploy k-NN in multiple mission scenarios due to resource constraints. Pango’s PGL12G is also incapable of deploying k-NN. In addition, the reason is that the synthesis tool is unable to correctly recognize the current k-NN design, and therefore ignores the key path. This does not occur when using alternative development tools. The FPGA k-NN deployment results with the best comprehensive performance under each task are shown in Table 12.

5.4. SVM

In the experiment, the linear SVM is chosen for training, and the results are shown in Table 13. Due to the function similarity between the linear SVM and ANN, their simulation scores are very similar. The SVM deployment results on 10 FPGA boards with the best comprehensive performance under each task are shown in Table 14. The remaining implementation results are provided in the Appendix A, Table A7.

5.5. Comparisons with Embedded Platforms

To provide a more accurate assessment of our implementation, we also compare MLoF with three representative embedded platforms, namely Nvidia Jetson Nano, Raspberry Pi3 B+, and STM32L476 Nucleo. The specification of each platform is listed in Table 15. Jetson Nano is powered by a Cortex-A57 core running at 1.43 GHz and a 128-core Nvidia Maxwell-based GPU [40], while Raspberry features a Cortex-A53 core running at 1.2 GHz. STM32L476 Nucleo is a typical IoT development platform with a Cortex-M4 core running at 80 MHz [41]. Compared with Table 4, the prices of these three representative embedded development platforms are similar to the FPGAs, which indicates that all of them are comparable in terms of other indexes. Given the proper cost of FPGAs, they can be considered as competitive substitutes for past typical platforms.
Based on the simulation results of previous desktop PCs, the Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve shown in Figure 10, we select the models with the highest score in each of the six task scenarios for deployment of the embedded platform [42]. The corresponding deployment models and architectures for each task are listed in Table 16. PyCuda is used for GPU parallel acceleration with fixed weight parameters on Jetson Nano. Moreover, we use the same Python code to implement it on the Raspberry Pi, as well.
Table 17 compares the performance of our FPGA and three embedded platform implementations. Jetson Nano takes the lead in terms of computing performance. On the other hand, Nucleo consumes the lowest power. While the power consumption of FPGA decreased by an average of 891%, and its performance improved by an average of 9 times compared to typical IoT endpoint platforms. Moreover, FPGAs outperform all other platforms in terms of Energy Efficiency (PLP) and Cost Efficiency (CPLP).
To demonstrate the benefits of FPGA implementation in IoT endpoint scenarios, we compare embedded and FPGA platforms using six datasets in terms of performance (latency), power consumption, PLP, and CPLP, as shown in Figure 11 with the ordinate-axis in logarithmic scale. Jetson Nano exceeds the others in performance, but the second-best FPGA is not far behind, only 38% lower in average, 100% ahead of Raspberry, and 2300% ahead of Nucleo. In terms of power consumption, Nucleo is quite competitive as a low-power MCU with 102 mW on average, 30 mW lower than FPGA. These two platforms advanced well beyond Jetson Nano and Raspberry. It can be clearly seen that in comparison to other platforms, FPGAs require significantly less PLP and CPLP on the ordinate-axis in logarithmic scale. The smallest PLP and CPLP are critical for IoT endpoint development and implementation, as response time, power consumption, and cost are all critical factors in IoT endpoint tasks. Furthermore, the FPGA PLP is 17× better than the average for embedded platforms and the FPGA CPLP is 25× better than the average for embedded platforms.

6. Conclusions

In this paper, the Machine Learning on FPGA (MLoF), a series of ML hardware accelerator IP cores for IoT endpoint devices was introduced to offer high-performance, low-cost, and low-power.
MLoF completes the process of making inferences on FPGAs based on the optimal parameter results from PC training. It implements four typical machine learning algorithms (ANN, DT, k-NN, and SVM) with Verilog HDL on 10 FPGA development boards from seven different manufacturers. The usage of LUTs, Power, Latency, Cost, PLP, as well as CPLP are used in comparisons and analyses of the MLoF deployment results with six typical IoT datasets. At the same time, we analyzed the synthesis results of different EDA tools under the same hardware design. Finally, we compared the best FPGA deployment results with typical IoT endpoint platforms (Jetson Nano, Raspberry, STM32L476). The results indicate that the FPGA PLP outperforms the IoT platforms by an average of 17× due to their superior parallelism capability. Meanwhile, FPGAs have 25× better CPLP compared to the IoT platforms. To our knowledge, this is the first paper that conducts hardware deployment, platform comparisons, and deployment result analysis. At the same time, it is also the first set of IP on open-source FPGA machine learning algorithms, and has been verified on low-cost FPGA platforms.
MLoF still has room for further improvements: 1. The adaptability of MLoF could be enhanced, thus more complex algorithms (kNN with k > 16) could also be deployed on low-cost FPGAs with few resources, such as MachXO2; 2. More options for user parameters configuration could be added, including more ML algorithms, larger data bit width, and more hyperparameters; 3. Usability could be improved by further providing a script file or a user interface, to help the users generate the desired ML algorithm IP core more easily. These existing shortcomings of MLoF point out the direction of our future work.

Author Contributions

Conceptualization, M.L. and R.C.; investigation, M.L. and R.C.; algorithm proposed, R.C.; hardware architecture, R.C.; data curation, R.C., T.W. and Y.Z.; validation, M.L. and R.C.; resources, M.L. and R.C.; supervision, M.L.; visualization, R.C., T.W. and Y.Z.; writing—original draft, M.L., R.C. and T.W.; writing—review and editing, M.L., R.C., T.W. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation Funding of Agriculture Science and Technology in Jiangsu Province, under grant CX(21)3121.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The following are available online at https://github.com/verimake-team/MLonFPGA, the source code and data.

Acknowledgments

The authors would like to thank Jiangsu Academy of Agricultural Sciences for all the support, and would also like to acknowledge the technical support from VeriMake Innovation Lab.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ANNArtificial Neural Network
DTDecision Tree
FSMFinite State Machine
FPGAField Programmable Gate Array
GOPGiga Operation per Second
HDLHardware Description Language
IoTInternet of Thing
IPIntellectual Property
k-NNK-Nearest Neighbor
LUTLook-up Table
MACMultiplying Accumulator
MLoFMachine Learning on FPGA
MCUMicrocontroller Unit
MLMachine Learning
RAMRandom Access Memory
ReLURectified Linear Unit
SoCSystem on Chip
SVMSupport Vector Machine

Appendix A

Table A1. The score for different ANN architectures and activation functions.
Table A1. The score for different ANN architectures and activation functions.
Activation FunctionDatasetThe Score for ANN Architecture
1.41.82.42.83.43.84.44.8
ReLUGutter Oil95.7696.0895.9796.6195.5596.9296.7196.82
Smart Grid98.1798.2598.4197.9298.397.8398.1698.41
Gutter Oil95.2896.5596.4696.2896.2896.9793.196.19
Wine Quality70.8471.5863.0171.4570.2271.3670.772.29
Rain78.8585.1885.0683.6678.8585.3685.5585.83
Power Consumption0.99760.99670.98940.9970.99730.99640.99730.9973
TanhGutter Oil96.1996.495.8796.496.9295.8796.596.29
Smart Grid98.2598.5298.6898.398.7398.4198.3298.16
Gutter Oil95.7396.4696.3796.8296.3796.9196.5396.55
Wine Quality71.8570.4471.7672.9572.0272.2972.4672.94
Rain78.8583.6278.8578.8578.8578.8578.8478.85
Power Consumption0.99480.9950.99590.99650.99450.99560.99040.9958
SigmoidGutter Oil96.2996.496.1996.497.1497.1497.0396.92
Smart Grid98.0798.0798.5298.0398.6498.2697.8797.56
Gutter Oil95.5595.596.2796.7396.197.0996.8397
Wine Quality72.0772.6472.8672.9572.1572.817373.01
Rain78.8585.1485.1285.1778.8585.4678.8584.37
Power Consumption0.99530.99660.99510.99790.99540.99560.99580.9964
Table A2. ANN deployment results on the low-cost FPGAs.
Table A2. ANN deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M4514451511.76492.485789.10149,937.75
EG4S2011652511.40371.824238.32211,492.13
GW2A18331211.45451.475167.12873,243.27
Cyclone10LP10CL17951510.55271.382863.87142,907.27
MAX10M50DAF14002510.17312.533177.18270,060.30
ICE40UP51898811.121401556.1059,131.80
MachXO22717/11.48190.982192.0765,542.85
M2S01013982211.63292.473402.01203,780.46
PGL12G1780711.65585.226820.15354,648
Artix-79161010.755896331.75505,906.83
Smart GridEF2M4513711515.59494.187704.24199,539.69
EG4S2010912511.40381.854353.48217,238.81
GW2A17791212.30419.115155.09871,210.19
Cyclone10LP10CL15301510.62267.112837.24141,578.40
MAX10M50DAF1335219.58311.082978.90253,206.68
ICE40UP51855811.121401556.1059,131.80
MachXO22696/11.481912192.3065,549.71
M2S01012632211.79300.443542.49212,195.03
PGL12G11161011.615866802.87353,749.45
Artix-79191010.905886409.20512,095.08
Gutter OilEF2M4526511512.44542.826752.71174,895.08
EG4S2017662911.13481.855360.60267,494.11
GW2A3235812.32579.517136.701,206,102.74
Cyclone10LP10CL30691510.362722818.46140,641.35
MAX10M50DAF2198259.693173072.05261,124
ICE40UP53887811140154058,520
MachXO25763/12.051912301.1768,804.92
M2S010202222123053658.78219,160.92
PGL12G2546711.975857002.45364,127.40
Artix-7226914115916501519,429.90
Wine QualityEF2M4531391516.37622.8310,195.78264,070.60
EG4S2022312912.31491.846053.59302,074.21
GW2A36961614.27682.899741.401,646,296.15
Cyclone10LP10CL38151511.802793290.81164,211.17
MAX10M50DAF2895339.663223109.88264,339.46
ICE40UP5//////
MachXO26114/12.731922444.5473,091.87
M2S01025602213.99302.214226.41253,161.77
PGL12G2929715.615869146.87475,637.45
Artix-723201412.475887332.36585,855.56
RainEF2M4514261511.84542.856428.45166,496.94
EG4S2013541610.77431.854650.61232,065.65
GW2A3245811.45579.736634.961,121,308.93
Cyclone10LP10CL2263159.392732562.38127,862.66
MAX10M50DAF2292168.123132542.50216,112.42
ICE40UP5355388.731351178.6944,790.03
MachXO24392/9.141891728.0351,668.01
M2S0102196169.123112836.63169,914.20
PGL12G205574.725842754.73143,245.86
Artix-7179489.255955503.75439,749.63
Power ConsumptionEF2M4525481516.61555.519227.56238,993.80
EG4S2016142914.61454.126632.39330,956.43
GW2A3245814.45561.6981171,371,772.43
Cyclone10LP10CL27411512.482933657.23182,495.58
MAX10M50DAF18323311.673444014.14341,201.56
ICE40UP53653811.981401676.5063,707
MachXO25621/12.802092675.4179,994.73
M2S01023232212.213223931.30235,484.75
PGL12G22981015.895869309.78484,108.66
Artix-711101412.125887126.56569,412.14
Table A3. The score for different DT architectures.
Table A3. The score for different DT architectures.
Number of DepthsDatasetNumber of Leaf Nodes
8163264
4Gutter Oil96.0996.09//
Smart Grid97.9897.98//
Gutter Oil96.4696.46//
Wine Quality81.9883.69//
Rain0.84810.848//
Power Consumption0.980.992//
5Gutter Oil96.0996.1896.18/
Smart Grid97.9897.9897.98/
Gutter Oil96.4696.6396.7/
Wine Quality81.9883.2183.51/
Rain0.84810.84510.8404/
Power Consumption0.980.99280.9963/
6Gutter Oil96.0996.1896.1896.18
Smart Grid97.9897.9897.9897.98
Gutter Oil96.4696.6396.8596.85
Wine Quality81.9882.583.5183.25
Rain0.84810.84470.83660.839
Power Consumption0.980.99280.99660.9976
Table A4. DT deployment results on the low-cost FPGAs.
Table A4. DT deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M45681/12.28416.855120.52132,621.57
EG4S20661/11.52325.633751.54187,201.70
GW2A41708.62409.263528.20596,265.12
Cyclone10LP10CL36006.33263166483,033.65
MAX10M50DAF36305.703021722146,370.34
ICE40UP545107.92133.901060.7640,309.02
MachXO2494/8.86184.711635.8348,911.23
M2S01039609.872792752.34164,864.87
PGL12G27109.775485356.15278,519.90
Artix-726409.405875517.80440,872.22
Smart GridEF2M4527207.95387.383081.2179,803.41
EG4S2027207.62281.082142.35106,903.45
GW2A31307.86400.693150.21532,385.33
Cyclone10LP10CL28405.932611546.9577,192.66
MAX10M50DAF28705.383011620.28137,724.06
ICE40UP528505.87122716.1427,213.32
MachXO2285/5.951841095.1732,745.52
M2S01029106.382931868.75111,938.36
PGL12G30706.335433436.65178,705.64
Artix-718808.025884715.76376,789.22
Gutter OilEF2M451412018.90446.368436.73218,511.20
EG4S201463017.63374.876608.87329,782.61
GW2A693010.17437.504450.69752,166.19
Cyclone10LP10CL63608.062642128.10106,192.39
MAX10M50DAF64807.113032153.72183,066.54
ICE40UP571507.921351069.4740,639.86
MachXO2774/8.861871656.0749,516.55
M2S01068509.353162955.86177,056.25
PGL12G679012.825537088.35368,594.41
Artix-7655010.205906018480,838.20
Wine QualityEF2M4526608.26387.383199.7382,873.11
EG4S2026608.39286.442403.78119,948.59
GW2A31107.94401.593187.39538,668.59
Cyclone10LP10CL28405.77261.421507.0975,203.61
MAX10M50DAF28705.54301.811670.82142,019.71
ICE40UP530505.60121677.625,748.8
MachXO2304/5.54179991.6629,650.63
M2S01030106.72291.471958.41117,308.58
PGL12G31607.22443.833204.90166,654.61
Artix-717708.204884001.60319,727.84
RainEF2M4526907.65382.182925.2475,763.62
EG4S2026907.79281.842196.36109,598.15
GW2A31507.94400.683180.23537,458.69
Cyclone10LP10CL28005.77261.291508.6975,283.55
MAX10M50DAF28305.09301.381533.12130,315.21
ICE40UP530305.80121701.8026,668.40
MachXO2301/5.941741033.2130,893.04
M2S01030606.57289.281900.28113,826.79
PGL12G31207.23443.863210166,919.77
Artix-717807.424873613.54288,721.85
Power ConsumptionEF2M451437019.76436.518625.48223,399.86
EG4S201424017.51365.566399.56319,338.21
GW2A679010.24439.194495.09759,670.07
Cyclone10LP10CL65508.52264.802256.10112,579.19
MAX10M50DAF66506.87303.442083.72177,116.41
ICE40UP576509.651281235.4646,947.33
MachXO2783/9.961941932.8257,791.38
M2S010709010.47313.763286.36196,853.21
PGL12G668015.50454.357044.24366,300.60
Artix-771608.354924108.20328,245.18
Table A5. The score for different k-NN architectures.
Table A5. The score for different k-NN architectures.
Datasetk Values (Number of Neighbors)
24816
Gutter Oil99.3699.0998.7398.46
Smart Grid75.7877.378.3879.39
Gutter Oil99.3699.0798.7398.46
Wine Quality77.9377.4181.2374.9
Rain0.83660.84660.84940.8527
Power Consumption0.9960.99540.99440.9915
Table A6. The k-NN deployment results on the low-cost FPGAs.
Table A6. The k-NN deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M4512511517.26446.367703.32199,515.87
EG4S209421816.08315.815077.64253,374.31
GW2A99165.18397.252059.34348,029.14
Cyclone10LP10CL984610.63282.493001.74149,786.76
MAX10M50DAF98569.33335.913135.05266,479.08
ICE40UP597967.811341046.5439,768.52
MachXO21062/8.84185.471638.6348,994.96
M2S010990613.32325.574337.50259,816.40
PGL12G//////
Artix-75921811.925977116.24568,587.58
Smart GridEF2M45//////
EG4S2010,7572918.071371.3224,783.831,236,713.13
GW2A5708137.84888.166960.471,176,319.55
Cyclone10LP10CL48101311.89304.673621.31180,703.25
MAX10M50DAF4847139.90369.193653.14310,516.48
ICE40UP5494989.32135.791265.5848,092.09
MachXO2//////
M2S01050781313.41327.234386.45262,748.42
PGL12G//////
Artix-732313912.055947157.70571,900.23
Gutter OilEF2M4513611516.92452.827663.01198,471.83
EG4S2010581816.62331.425507.27274,812.90
GW2A104765.18406.452107.04356,089.22
Cyclone10LP10CL1076610.63282.793005.77149,988.17
MAX10M50DAF108868.83335.662964.55251,986.68
ICE40UP5107767.811351054.3540,065.30
MachXO21241/7.62185.471414.0242,279.30
M2S0101117612.94325.724213.81252,407.44
PGL12G//////
Artix-76731811.575976907.29551,892.47
Wine QualityEF2M45//////
EG4S2044352918.96672.6912,752.20636,334.94
GW2A3415116.46578.873739.51631,977.72
Cyclone10LP10CL30731111.10194.792162.56107,911.67
MAX10M50DAF3067119.273543280.87278,874.12
ICE40UP5384889.131351232.4246,831.77
MachXO24348/7.181861335.6739,936.41
M2S01035681113.35235.653145.95188,442.66
PGL12G//////
Artix-720373312.105116183.10494,029.69
RainEF2M45//////
EG4S2011,76724181491.2126,843.271,339,479.23
GW2A629787.76886.536882.981,163,223.64
Cyclone10LP10CL5397811.89305.543633.79181,325.98
MAX10M50DAF5412810.09362.043651.17310,349.74
ICE40UP5//////
MachXO2//////
M2S0106218812.94471.376097.62365,247.23
PGL12G//////
Artix-737542411.505095853.50467,694.65
Power ConsumptionEF2M4517351517.75485.838625.01223,387.78
EG4S2010712115.30365.835598.36279,358.05
GW2A109575.18321.681667.57281,819.93
Cyclone10LP10CL116079.07288.612617.40130,608.46
MAX10M50DAF1173710.54339.083573.90303,781.77
ICE40UP5115979.361271188.2145,152.06
MachXO21864/8.81195.091719.1251,401.82
M2S0101232711.86212.882524.48151,216.63
PGL12G//////
Artix-77342111.955346381.30509,865.87
Table A7. SVM deployment results on the low-cost FPGAs.
Table A7. SVM deployment results on the low-cost FPGAs.
DatasetDeviceLUTsDSPsLatencyPowerPLPCPLP
Gutter OilEF2M45829813.81426.675890.66152,568.13
EG4S20853813.45365.114909.28244,973.20
GW2A67186.51329.782145.52362,592.23
Cyclone10LP10CL740810.59274.362906.30145,024.14
MAX10M50DAF741810.06342.583447.04292,998.40
ICE40UP5765815.561512349.4189,277.54
MachXO22663/15.572173378.91101,029.32
M2S010766812.66315.133988.89238,934.52
PGL12G663811.73589.166910.26359,333.40
Artix-741689.125905380.80429,925.92
Smart GridEF2M45950813.69456.226245.71161,763.80
EG4S201000814.01355.154976.29248,316.96
GW2A89086.75361.032437.66411,964.72
Cyclone10LP10CL963811.33307.443484.22173,862.45
MAX10M50DAF964810.46364.083809.00323,765.42
ICE40UP5955815.90154.982464.1793,638.31
MachXO23752/15.212163284.7198,212.89
M2S010975813.48325.744390.70263,003.13
PGL12G775811.22587.976596.44343,014.64
Artix-748288.785895171.42413,196.46
Gutter OilEF2M45829813.81426.675890.66152,568.13
EG4S20853813.45365.114909.28244,973.20
GW2A67186.51329.782145.52362,592.23
Cyclone10LP10CL740810.59274.362906.30145,024.14
MAX10M50DAF741810.06362.583648.28310,103.80
ICE40UP5765814.90152.992279.2086,609.61
MachXO22667/15.19217.113297.0298,580.82
M2S010766812.07319.073851.48230,703.77
PGL12G663811.73569.166675.68347,135.24
Artix-741609.124874441.44354,871.06
Wine QualityEF2M45918813.52444.326005.85155,551.42
EG4S20994813.72355.154873.65243,195.38
GW2A85886.68354.762369.78400,493.40
Cyclone10LP10CL932810.96276.473030.39151,216.34
MAX10M50DAF933810.52362.873817.76324,509.20
ICE40UP5976815.56155.142413.8991,727.65
MachXO23070/15.91217.113454.24103,281.66
M2S010966812.44320.803990.75239,046.04
PGL12G743812.10569.606891.02358,333.08
Artix-746589.204904508.00360,189.20
RainEF2M45523813.27400.725317.57137,725.00
EG4S20533812.95312.134041.45201,668.17
GW2A54185.91316.571869.98316,027.45
Cyclone10LP10CL50289.83275.692710.86135,271.90
MAX10M50DAF50389.95320.903193.92271,483.00
ICE40UP5527813.51147.761996.7775,877.43
MachXO2864 15.21217.433307.5598,895.60
M2S010570811.38184.512100.12125,797.09
PGL12G540812.50567.527095.14368,947.02
Artix-734488.924904370.80349,226.92
Power ConsumptionEF2M45523813.85397.725509.19142,688.01
EG4S20533812.66315.823997.32199,466.32
GW2A53485.91314.481857.63313,939.04
Cyclone10LP10CL504810.02275.912765.72138,009.52
MAX10M50DAF50789.59321.223081.14261,897.09
ICE40UP5516813.471481994.0075,772.15
MachXO2851/15.21216.433292.3398,440.76
M2S010519812.53193.702427.30145,395.56
PGL12G531810.86467.435078.16264,064.30
Artix-734288.664904243.40339,047.66

References

  1. Li, H.; Ota, K.; Dong, M. Learning IoT in edge: Deep learning for the Internet of Things with edge computing. IEEE Netw. 2018, 32, 96–101. [Google Scholar] [CrossRef] [Green Version]
  2. Sakr, F.; Bellotti, F.; Berta, R.; De Gloria, A. Machine Learning on Mainstream Microcontrollers. Sensors 2020, 20, 2638. [Google Scholar] [CrossRef]
  3. Deploy Machine Learning Models on Mobile and IoT Devices. Available online: https://www.tensorflow.org/lite (accessed on 1 April 2021).
  4. STMicroelectronics X-CUBE-AI—AI Expansion Pack for STM32CubeMX. Available online: http://www.st.com/en/embedded-software/x-cube-ai.html (accessed on 1 April 2021).
  5. Lai, L.; Suda, N.; Chandra, V. CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs. arXiv 2018, arXiv:1801.06601. [Google Scholar]
  6. DiCecco, R.; Lacey, G.; Vasiljevic, J.; Chow, P.; Taylor, G.; Areibi, S. Caffeinated FPGAs: FPGA Framework for Convolutional Neural Networks. In Proceedings of the IEEE 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 265–268. [Google Scholar]
  7. Brandalero, M.; Ali, M.; Le Jeune, L.; Hernandez, H.G.M.; Veleski, M.; da Silva, B.; Lemeire, J.; Van Beeck, K.; Touhafi, A.; Goedemé, T. AITIA: Embedded AI Techniques for Embedded Industrial Applications. In Proceedings of the IEEE 2020 International Conference on Omni-layer Intelligent Systems (COINS), Barcelona, Spain, 31 August–2 September 2020; pp. 1–7. [Google Scholar]
  8. Kathail, V. Xilinx Vitis Unified Software Platform. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 173–174. [Google Scholar]
  9. Aydonat, U.; O’Connell, S.; Capalija, D.; Ling, A.C.; Chiu, G.R. An OpenclTM Deep Learning Accelerator on Arria 10. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–25 February 2017; pp. 55–64. [Google Scholar]
  10. Intelligent Automation, Inc. DeepIP-FNN. Available online: https://www.xilinx.com/products/intellectual-property/1-15kaxa2.html (accessed on 2 May 2021).
  11. Intel Intel® FPGA Technology Solutions for Artificial Intelligence (AI). Available online: https://www.intel.com/content/www/us/en/artificial-intelligence/programmable/solutions.html (accessed on 2 May 2021).
  12. Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2019, 7, 7823–7859. [Google Scholar] [CrossRef]
  13. Holanda Noronha, D.; Zhao, R.; Goeders, J.; Luk, W.; Wilton, S.J. On-Chip Fpga Debug Instrumentation for Machine Learning Applications. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 24–26 February 2019; pp. 110–115. [Google Scholar]
  14. Saqib, F.; Dutta, A.; Plusquellic, J.; Ortiz, P.; Pattichis, M.S. Pipelined Decision Tree Classification Accelerator Implementation in FPGA (DT-CAIF). IEEE Trans. Comput. 2013, 64, 280–285. [Google Scholar] [CrossRef]
  15. Attaran, N.; Puranik, A.; Brooks, J.; Mohsenin, T. Embedded Low-Power Processor for Personalized Stress Detection. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 2032–2036. [Google Scholar] [CrossRef]
  16. Batista, G.C.; Oliveira, D.L.; Saotome, O.; Silva, W.L. A Low-Power Asynchronous Hardware Implementation of a Novel SVM Classifier, with an Application in a Speech Recognition System. Microelectron. J. 2020, 105, 104907. [Google Scholar] [CrossRef]
  17. Roukhami, M.; Lazarescu, M.T.; Gregoretti, F.; Lahbib, Y.; Mami, A. Very Low Power Neural Network FPGA Accelerators for Tag-Less Remote Person Identification Using Capacitive Sensors. IEEE Access 2019, 7, 102217–102231. [Google Scholar] [CrossRef]
  18. Wang, C.; Gong, L.; Yu, Q.; Li, X.; Xie, Y.; Zhou, X. DLAU: A Scalable Deep Learning Accelerator Unit on FPGA. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2016, 36, 513–517. [Google Scholar] [CrossRef]
  19. Ge, F.; Wu, N.; Xiao, H.; Zhang, Y.; Zhou, F. Compact Convolutional Neural Network Accelerator for Iot Endpoint Soc. Electronics 2019, 8, 497. [Google Scholar] [CrossRef] [Green Version]
  20. Jindal, M.; Gupta, J.; Bhushan, B. Machine Learning Methods for IoT and Their Future Applications. In Proceedings of the IEEE 2019 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 18–19 October 2019; pp. 430–434. [Google Scholar]
  21. Qian, B.; Su, J.; Wen, Z.; Jha, D.N.; Li, Y.; Guan, Y.; Puthal, D.; James, P.; Yang, R.; Zomaya, A.Y. Orchestrating the Development Lifecycle of Machine Learning-Based Iot Applications: A Taxonomy and Survey. ACM Comput. Surv. (CSUR) 2020, 53, 1–47. [Google Scholar] [CrossRef]
  22. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  23. Meshram, V.; Patil, K.; Hanchate, D. Applications of Machine Learning in Agriculture Domain: A State-of-Art Survey. Int. J. Adv. Sci. Technol. 2020, 29, 5319–5343. [Google Scholar]
  24. Gong, Z.; Zhong, P.; Hu, W. Diversity in Machine Learning. IEEE Access 2019, 7, 64323–64350. [Google Scholar] [CrossRef]
  25. Yang, L.; Shami, A. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
  26. Venieris, S.I.; Bouganis, C.-S. FpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs. In Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Washington, DC, USA, 1–3 May 2016; pp. 40–47. [Google Scholar]
  27. Faraji, S.R.; Abillama, P.; Singh, G.; Bazargan, K. Hbucnna: Hybrid Binary-Unary Convolutional Neural Network Accelerator. In Proceedings of the 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 12–14 October 2020; pp. 1–5. [Google Scholar]
  28. Akima, H. A New Method of Interpolation and Smooth Curve Fitting Based on Local Procedures. J. ACM (JACM) 1970, 17, 589–602. [Google Scholar] [CrossRef]
  29. Chen, H.; Jiang, L.; Yang, H.; Lu, Z.; Fu, Y.; Li, L.; Yu, Z. An Efficient Hardware Architecture with Adjustable Precision and Extensible Range to Implement Sigmoid and Tanh Functions. Electronics 2020, 9, 1739. [Google Scholar] [CrossRef]
  30. Ramachandran, S. Synthesis of Designs–Synplify Tool. In Digital VLSI Systems Design: A Design Manual for Implementation of Projects on FPGAs and ASICs Using Verilog; Springer: Berlin/Heidelberg, Germany, 2007; pp. 255–292. [Google Scholar]
  31. Verimake Gutter Oil Dataset. Available online: https://github.com/verimake-team/Gutteroildetector/tree/master/data (accessed on 2 May 2021).
  32. Schäfer, B.; Grabow, C.; Auer, S.; Kurths, J.; Witthaut, D.; Timme, M. Taming Instabilities in Power Grid Networks by Decentralized Control. Eur. Phys. J. Spec. Top. 2016, 225, 569–582. [Google Scholar] [CrossRef]
  33. Arzamasov, V.; Böhm, K.; Jochem, P. Towards Concise Models of Grid Stability. In Proceedings of the 2018 IEEE International Conference on Communications, Control, and Computing Technologies for Smart Grids (SmartGridComm), Aalborg, Denmark, 29–31 October 2018; pp. 1–6. [Google Scholar]
  34. Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling Wine Preferences by Data Mining from Physicochemical Properties. Decis. Support Syst. 2009, 47, 547–553. [Google Scholar] [CrossRef] [Green Version]
  35. Climate Data Online-Map Search-Bureau of Meteorology. Available online: http://www.bom.gov.au/climate/data/ (accessed on 2 May 2021).
  36. Individual Household Electric Power Consumption Data Set. Available online: https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption (accessed on 2 May 2021).
  37. Dellamonica, J.; Lerolle, N.; Sargentini, C.; Beduneau, G.; Di Marco, F.; Mercat, A.; Richard, J.-C.M.; Diehl, J.-L.; Mancebo, J.; Rouby, J.-J. Accuracy and Precision of End-Expiratory Lung-Volume Measurements by Automated Nitrogen Washout/Washin Technique in Patients with Acute Respiratory Distress Syndrome. Crit. Care 2011, 15, 1–8. [Google Scholar] [CrossRef] [Green Version]
  38. Hu, Y.; Zhu, Y.; Chen, H.; Graham, R.; Cheng, C.-K. Communication Latency Aware Low Power NoC Synthesis. In Proceedings of the IEEE 43rd annual Design Automation Conference, San Francisco, CA, USA, 24–28 July 2006; pp. 574–579. [Google Scholar]
  39. Garofalo, A.; Rusci, M.; Conti, F.; Rossi, D.; Benini, L. PULP-NN: Accelerating Quantized Neural Networks on Parallel Ultra-Low-Power RISC-V Processors. Philos. Trans. R. Soc. A 2020, 378, 20190155. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  40. Slater, W.S.; Tiwari, N.P.; Lovelly, T.M.; Mee, J.K. Total Ionizing Dose Radiation Testing of NVIDIA Jetson Nano GPUs. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 22–24 September 2020; pp. 1–3. [Google Scholar]
  41. Lang, R.; Lescisin, M.; Mahmoud, Q.H. Selecting a Development Board for Your Capstone or Course Project. IEEE Potentials 2018, 37, 6–14. [Google Scholar] [CrossRef]
  42. Crocioni, G.; Pau, D.; Delorme, J.-M.; Gruosso, G. Li-Ion Batteries Parameter Estimation with Tiny Neural Networks Embedded on Intelligent IoT Microcontrollers. IEEE Access 2020, 8, 122135–122146. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the MLoF system architecture.
Figure 1. Block diagram of the MLoF system architecture.
Applsci 12 00089 g001
Figure 2. ANN algorithm implementation. (a) Block diagram of ANN. (b) Scheduling of the ANN processing.
Figure 2. ANN algorithm implementation. (a) Block diagram of ANN. (b) Scheduling of the ANN processing.
Applsci 12 00089 g002
Figure 3. Block diagram of MAC.
Figure 3. Block diagram of MAC.
Applsci 12 00089 g003
Figure 4. Block diagram of ReLU.
Figure 4. Block diagram of ReLU.
Applsci 12 00089 g004
Figure 5. The images of the fitting effect of these functions. (a) Numerical interval [0, 1]. (b) Numerical interval (1, 2]. (c) Numerical interval (2, 3]. (d) Numerical interval (3, 4].
Figure 5. The images of the fitting effect of these functions. (a) Numerical interval [0, 1]. (b) Numerical interval (1, 2]. (c) Numerical interval (2, 3]. (d) Numerical interval (3, 4].
Applsci 12 00089 g005
Figure 6. DT algorithm implementation. (a) The sample decision tree. (b) Block diagram of DT. (c) Scheduling of the DT processing.
Figure 6. DT algorithm implementation. (a) The sample decision tree. (b) Block diagram of DT. (c) Scheduling of the DT processing.
Applsci 12 00089 g006
Figure 7. The k-NN algorithm implementation. (a) Block diagram of k-NN. (b) Scheduling of the k-NN processing.
Figure 7. The k-NN algorithm implementation. (a) Block diagram of k-NN. (b) Scheduling of the k-NN processing.
Applsci 12 00089 g007
Figure 8. Block diagram of SNLF.
Figure 8. Block diagram of SNLF.
Applsci 12 00089 g008
Figure 9. SVM algorithm implementation. (a) Block diagram of SVM. (b) Scheduling of the SVM processing.
Figure 9. SVM algorithm implementation. (a) Block diagram of SVM. (b) Scheduling of the SVM processing.
Applsci 12 00089 g009
Figure 10. Comparison of the receiver operating characteristic curve and precision-recall curve for the six typical IoT endpoint datasets with different algorithms. (a) Receiver Operating Characteristic curve for Gutter Oil binary classification; (b) Receiver Operating Characteristic curve for Smart Grid binary classification; (c) Receiver Operating Characteristic curve for Gutter Oil multiclass classification; (d) Receiver Operating Characteristic curve for Smart Grid multiclass classification; (e) Precision-Recall curve for Gutter Oil binary classification; (f) Precision-Recall curve for Smart Grid binary classification; (g) Precision-Recall curve for Gutter Oil multiclass classification; (h) Precision-Recall curve for Smart Grid multiclass classification.
Figure 10. Comparison of the receiver operating characteristic curve and precision-recall curve for the six typical IoT endpoint datasets with different algorithms. (a) Receiver Operating Characteristic curve for Gutter Oil binary classification; (b) Receiver Operating Characteristic curve for Smart Grid binary classification; (c) Receiver Operating Characteristic curve for Gutter Oil multiclass classification; (d) Receiver Operating Characteristic curve for Smart Grid multiclass classification; (e) Precision-Recall curve for Gutter Oil binary classification; (f) Precision-Recall curve for Smart Grid binary classification; (g) Precision-Recall curve for Gutter Oil multiclass classification; (h) Precision-Recall curve for Smart Grid multiclass classification.
Applsci 12 00089 g010
Figure 11. Comparison of performance (latency), power consumption, PLP, and CPLP for the six typical IoT endpoint datasets with different algorithms when implemented on FPGA and embedded platforms. (a) Comparison of performance (latency). (b) Comparison of power consumption. (c) Comparison of PLP. (d) Comparison of CPLP.
Figure 11. Comparison of performance (latency), power consumption, PLP, and CPLP for the six typical IoT endpoint datasets with different algorithms when implemented on FPGA and embedded platforms. (a) Comparison of performance (latency). (b) Comparison of power consumption. (c) Comparison of PLP. (d) Comparison of CPLP.
Applsci 12 00089 g011aApplsci 12 00089 g011b
Table 1. Configurable parameters.
Table 1. Configurable parameters.
ANNDTk-NNSVM
Data width (max = 13);
Number of inputs (max = 16);Data width (max = 13);Data width (max = 13);Data width (max = 13);
Number of hidden layers (max = 4);Number of depth (max = 6);Number of inputs (max = 16);Number of inputs (max = 16);
Number of neurons in each hidden layer (max = 8);Number of leaf nodes (max = 64);Number of neighbors (max = 16);Number of targets (max = 16)
Number of targets (max = 16);Number of targets (max = 16)Number of targets (max = 16)
Activation functions
Table 2. Fitting function at different intervals.
Table 2. Fitting function at different intervals.
Numerical Interval FunctionAbsolute Error
[0, 1] y = 0.3275 x 2 + 1.0977 x 0.0038 0.0038
(1, 2] y = 0.1690 x 2 + 0.7021 x + 0.2324 0.0039
(2, 3] y = 0.0282 x 2 + 0.1703 x + 0.7370 0.0055
(3, 4] y = 0.0039 x 2 + 0.0313 x + 0.9363 0.0101
( 4 ,   + )1/
Table 3. The resource of 10 different FPGAs.
Table 3. The resource of 10 different FPGAs.
ProducerDeviceLUTsDSPsRAMStatic Current
(mA at 12 V, 25 °C)
Distinction
AnlogicEF2M45448015700 Kb5Inside Cortex-M3 unit
EG4S2019,60029156.8 Kb5Inside SDRAM unit
GowinGW2A20,73648828 Kb35/
IntelCyclone10LP10CL627230270 Kb5Low-power design
MAX10M50DAF49,7601441638 Kb35/
LatticeICE40UP552808128 Kb0.075Ultra low-power design
MachXO26864/92 Kb0.08Low-power design
MicrochipM2S01012,08422512 Kb6.9Inside Cortex-M3 unit
PangoPGL12G12,4802085 Kb13/
XilinxArtix-733,2802401800 Kb14/
Table 4. The specification of 10 different FPGA boards.
Table 4. The specification of 10 different FPGA boards.
ProducerDeviceManufacturerBoard NameMemoryInterfacePrice
AnlogicEF2M45Nanjing Renmian Integrated CircuitSparkroad-M//$25.9
EG4S20SparkroadSPI Flash, microSDADC, VGA, DVP Arduino, Raspberry Pi $49.9
GowinGW2AMYMINIEYECombatDDR3, SPI Flash, microSDRJ45, HDMI, MIPI$169
IntelCyclone10LP10CLQMTECHStarter KitSDRAM, SPI FlashADC, MIPI$49.9
MAX10M50DAFTerasicDE10-LiteSDRAM, VGA, Accelerometer$85
LatticeICE40UP5TinyFPGATinyFPGA BXSPI Flash/$38
MachXO2STEPFPGASTEP-MXO2//$29.9
MicrochipM2S010Trenz electronicSMF2000SDRAM, SPI Flash/$59.9
PangoPGL12GALINXPGL12G Development BoardSDRAM, SPI FlashADC, HDMI, MIPI$52
XilinxArtix-7QMTECHArtix-7 Development BoardSDRAM, SPI FlashADC, MIPI$79.9
Table 5. The specification of EDA software.
Table 5. The specification of EDA software.
ProducerEDA SoftwareSynthesis ToolAvailability
AnlogicTang DynastyTD Integrated SynthesisCommercial
GowinGOWIN EDAGowinSynthesis/SynplifyCommercial
IntelQuartus PrimeQuartus Integrated SynthesisFree License/Commercial
LatticeiCEcube2Synplify ProFree License/Commercial
Lattice DiamondLattice Synthesis EngineFree License/Commercial
MicrochipLibero SoCSynplify Pro MECommercial
PangoPango Design SuiteADS/SynplifyCommercial
XilinxVivadoXilinx Synthesis TechnologyFree License/Commercial
Table 6. Dataset specifications.
Table 6. Dataset specifications.
DatasetNumber of Input FeaturesType
Gutter Oil6Binary classification
Smart Grid13Binary classification
Gutter Oil6Multiclass classification 6
Wine Quality11Multiclass classification 11
Rain8Regression
Power Consumption7Regression
Table 7. The highest-scoring parameters of ANN obtained in different datasets.
Table 7. The highest-scoring parameters of ANN obtained in different datasets.
DatasetScoreANN ArchitecturesActivation Function
Binary classificationGutter Oil97.14%[4,4,4]Sigmoid
Smart Grid98.73%[4,4,4]Tanh
Multiclass classificationGutter Oil97.09%[8,8,8]Tanh
Wine Quality73.01%[8,8,8,8]Sigmoid
RegressionRain0.8583 (R2)[8,8,8,8]ReLU
Power Consumption0.9979 (R2)[8,8]Sigmoid
Table 8. ANN deployment results on the best CPLP-performing FPGAs.
Table 8. ANN deployment results on the best CPLP-performing FPGAs.
DatasetDeviceScoreLUTsDSPsLatency/usPower/mWPLPCPLP
Binary
classification
Gutter OilICE40UP597.14%1898811.121401556.1059,131.80
Smart GridICE40UP598.73%1855811.121401556.1059,131.80
Multiclass classificationGutter OilICE40UP597.09%3887811140154058,520
Wine QualityMachXO273.01%6114/12.731922444.5473,091.87
RegressionRainICE40UP50.8583 (R2)355388.731351178.6944,790.03
Power ConsumptionICE40UP50.9979 (R2)3653811.981401676.5063,707
Table 9. The highest-scoring parameters of DT obtained in different datasets.
Table 9. The highest-scoring parameters of DT obtained in different datasets.
DatasetScoreMax DepthMax Leaf Nodes
Binary classificationGutter Oil96.18%516
Smart Grid97.98%48
Multiclass
classification
Gutter Oil96.85%632
Wine Quality83.69%416
RegressionRain0.8481 (R2)48
Power Consumption0.9976 (R2)664
Table 10. DT deployment results on the best CPLP-performing FPGAs.
Table 10. DT deployment results on the best CPLP-performing FPGAs.
DatasetDeviceScoreLUTsDSPsLatency/usPower/mWPLPCPLP
Binary classificationGutter OilICE40UP596.18%45107.92133.901060.763882.38
Smart GridICE40UP597.98%28505.87122716.142873.01
Multiclass classificationGutter OilICE40UP596.85%71507.921351069.473876.36
Wine QualityICE40UP583.69%30505.601216772740.86
RegressionRainICE40UP50.8481 (R2)30305.80121701.802838.75
Power ConsumptionICE40UP50.9976 (R2)76509.651281235.464723.10
Table 11. The highest-scoring parameters of k-NN obtained in different datasets.
Table 11. The highest-scoring parameters of k-NN obtained in different datasets.
DatasetScorek Values
Binary classificationGutter Oil99.36%2
Smart Grid79.39%16
Multiclass classificationGutter Oil99.36%2
Wine Quality81.23%8
RegressionRain0.8527 (R2)16
Power Consumption0.9960 (R2)2
Table 12. The k-NN deployment results on the best CPLP-performing FPGAs.
Table 12. The k-NN deployment results on the best CPLP-performing FPGAs.
DatasetDeviceScoreLUTsDSPsLatency/usPower/mWPLPCPLP
Binary classificationGutter OilICE40UP599.36%97967.811341046.5439,768.52
Smart GridICE40UP579.39%494989.32135.791265.5848,092.09
Multiclass classificationGutter OilICE40UP599.36%107767.811351054.3540,065.30
Wine QualityMachXO281.23%4348/7.181861335.6739,936.41
RegressionRainCyclone10LP10CL0.8527 (R2)5397811.89305.543633.79181,325.98
Power ConsumptionICE40UP50.9960 (R2)115979.3612.71118.974520.69
Table 13. The highest-scoring of SVM obtained in different datasets.
Table 13. The highest-scoring of SVM obtained in different datasets.
DatasetScore
Binary classificationGutter Oil96.02%
Smart Grid98.35%
Multiclass classificationGutter Oil96.33%
Wine Quality72.68%
RegressionRain0.7932 (R2)
Power Consumption0.9979 (R2)
Table 14. SVM deployment results on the best CPLP-performing FPGAs.
Table 14. SVM deployment results on the best CPLP-performing FPGAs.
DatasetDeviceScoreLUTsDSPsLatencyPowerPLPCPLP
Binary
classification
Gutter OilICE40UP596.02%765815.561512349.4189,277.54
Smart GridICE40UP598.35%955815.90154.982464.1793,638.31
Multiclass
classification
Gutter OilICE40UP596.33%765814.90152.992279.2086,609.61
Wine QualityICE40UP572.68%976815.56155.142413.8991,727.65
RegressionRainICE40UP50.7932 (R2)527813.51147.761996.7775,877.43
Power ConsumptionICE40UP50.9979 (R2)516813.471481994.0075,772.15
Table 15. The specification of three representative embedded development platforms.
Table 15. The specification of three representative embedded development platforms.
PlatformProcessorClockPrice
Nvidia Jetson NanoCortex-A571.43 GHz$89.00
Raspberry Pi3 B+Cortex-A531.2 GHz$54.99
STM32L476 NucleoCortex-M480 MHz$31.99
Table 16. The highest-scoring model obtained in different datasets.
Table 16. The highest-scoring model obtained in different datasets.
DatasetModelScoreArchitecture
Binary classificationGutter Oilk-NN99.36%K = 2
Smart GridANN98.73%[4,4,4], Tanh
Multiclass classificationGutter Oilk-NN99.36%K = 2
Wine QualityDT83.69%Max depth = 4,
Max node = 16
RegressionRainANN0.8583 (R2)[8,8,8], ReLU
Power ConsumptionSVM0.9979 (R2)N/A
Table 17. Breakdown of platform implemented results.
Table 17. Breakdown of platform implemented results.
Dataset and Type of Module The Best of FPGAJetson NanoRaspberryNucleo
Binary
classification
Gutter Oil
(k-NN)
Accuracy99.36%99.36%99.36%99.36%
Latency7.81 us5.90 us10.51 us180.00 us
Power134 mW2120 mW1480 mW102 mW
PLP1046.5412,50815,554.818,360
CPLP39,768.521,113,212855,358.5587,336.4
Smart Grid
(ANN)
Accuracy98.73%98.73%98.73%98.73%
Latency11.12 us5.97 us18.77 us300.00 us
Power140 mW2110 mW1470 mW101 mW
PLP1556.812,596.727,591.930,300
CPLP59,158.41,121,1061,517,279969,297
Multiclass classificationGutter Oil
(k-NN)
Accuracy99.36%99.36%99.36%99.36%
Latency7.81 us5.44 us10.25 us180.00 us
Power135 mW2120 mW1470 mW102 mW
PLP1054.3511,532.815,067.518,360
CPLP40,065.31,026,419828,561.8587,336.4
Wine Quality
(DT)
Accuracy83.69%83.69%83.69%83.69%
Latency5.60 us1.37 us5.82 us84.00 us
Power121 mW2060 mW1350 mW101 mW
PLP677.62822.278578484
CPLP25,748.8251,175.8432,056.4271,403.2
RegressionRain
(ANN)
R20.85830.85830.85830.8583
Latency8.73 us7.81 us35.77 us261.00 us
Power135 mW2140 mW1480 mW102 mW
PLP1178.5516,713.452,939.626,622
CPLP44,784.91,487,4932,911,149851,637.8
Power Consumption
(SVM)
R20.99790.99790.99790.9979
Latency13.47 us7.94 us43.73 us364.00 us
Power148 mW2610 mW1530 mW103 mW
PLP1993.5620,723.466,906.937,492
CPLP75,755.281,844,3833,679,2101,199,369
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Chen, R.; Wu, T.; Zheng, Y.; Ling, M. MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms. Appl. Sci. 2022, 12, 89. https://doi.org/10.3390/app12010089

AMA Style

Chen R, Wu T, Zheng Y, Ling M. MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms. Applied Sciences. 2022; 12(1):89. https://doi.org/10.3390/app12010089

Chicago/Turabian Style

Chen, Ruiqi, Tianyu Wu, Yuchen Zheng, and Ming Ling. 2022. "MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms" Applied Sciences 12, no. 1: 89. https://doi.org/10.3390/app12010089

APA Style

Chen, R., Wu, T., Zheng, Y., & Ling, M. (2022). MLoF: Machine Learning Accelerators for the Low-Cost FPGA Platforms. Applied Sciences, 12(1), 89. https://doi.org/10.3390/app12010089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop