Advanced AI Hardware Designs Based on FPGAs

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence Circuits and Systems (AICAS)".

Deadline for manuscript submissions: closed (31 July 2021) | Viewed by 52674

Special Issue Editor


E-Mail Website
Guest Editor
School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Korea
Interests: VLSI design; computer architecture; data center architectures; FPGA; AI accelerators; domain specific processors; hardware/software co-design; distributed machine learning; processing-in-memory (PIM)

Special Issue Information

Dear Colleagues,

Machine learning (ML) and artificial intelligence (AI) technology have revolutionized how computers run cognitive tasks based on a massive amount of observed data. As more industries are adopting the technology, we are facing a fast-growing demand for new hardware that enables faster and more energy-efficient processing in AI workloads.

In recent years, traditional hardware vendors such as Intel and Nvidia as well as new start-up companies such as Graphcore, Wave Computing, and Habana have tried to offer the best computing platform for complex ML algorithms. Although GPU is still the preferred computing platform due to its large userbase and well-established programming interface, its top spot is not forever safe, due to its low hardware utilization and bad energy efficiency.

On top of energy efficiency and programming easiness, how to adapt fast-changing AI/ML algorithms is another hot topic in AI hardware. FPGA has a clear benefit on this point, as it can reprogram or amend its processing quickly with a relatively low power budget. In this Special Issue, we invite the latest developments in the field of advanced AI hardware design based on FPGA, which can show the device’s strengths, such as hardware/software co-design, customization, and scalability over other types of hardware.

Topics include but are not limited to:

  • DNN inference/training accelerators on FPGAs;
  • Multi-FPGA approaches for scalable ML acceleration;
  • Distributed deep learning architecture on multiple FPGAs;
  • Hardware/Software co-design for energy-efficient ML on FPGA;
  • Narrow-precision and efficient floating-point representation on FPGA for ML applications;
  • Design automation from the ML algorithm to FPGAs;
  • Soft DNN processor to cover a wide range of ML applications.

Prof. Dr. Joo-Young Kim
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • DNN inference/training accelerators on FPGAs
  • Multi-FPGA approaches for scalable ML acceleration
  • Distributed deep learning architecture on multiple FPGAs
  • Hardware/Software co-design for energy-efficient ML on FPGA
  • Narrow-precision and efficient floating-point representation on FPGA for ML applications
  • Design automation from ML algorithm to FPGAs
  • Soft DNN processor to cover a wide-range of ML applications

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

2 pages, 168 KiB  
Editorial
Advanced AI Hardware Designs Based on FPGAs
by Joo-Young Kim
Electronics 2021, 10(20), 2551; https://doi.org/10.3390/electronics10202551 - 19 Oct 2021
Cited by 2 | Viewed by 2408
Abstract
Artificial intelligence (AI) and machine learning (ML) technology enable computers to run cognitive tasks such as recognition, understanding, and reasoning, which are believed to be processes that only humans are capable of, using a massive amount of data [...] Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)

Research

Jump to: Editorial

23 pages, 2800 KiB  
Article
An Efficient FPGA-Based Convolutional Neural Network for Classification: Ad-MobileNet
by Safa Bouguezzi, Hana Ben Fredj, Tarek Belabed, Carlos Valderrama, Hassene Faiedh and Chokri Souani
Electronics 2021, 10(18), 2272; https://doi.org/10.3390/electronics10182272 - 16 Sep 2021
Cited by 29 | Viewed by 6927
Abstract
Convolutional Neural Networks (CNN) continue to dominate research in the area of hardware acceleration using Field Programmable Gate Arrays (FPGA), proving its effectiveness in a variety of computer vision applications such as object segmentation, image classification, face detection, and traffic signs recognition, among [...] Read more.
Convolutional Neural Networks (CNN) continue to dominate research in the area of hardware acceleration using Field Programmable Gate Arrays (FPGA), proving its effectiveness in a variety of computer vision applications such as object segmentation, image classification, face detection, and traffic signs recognition, among others. However, there are numerous constraints for deploying CNNs on FPGA, including limited on-chip memory, CNN size, and configuration parameters. This paper introduces Ad-MobileNet, an advanced CNN model inspired by the baseline MobileNet model. The proposed model uses an Ad-depth engine, which is an improved version of the depth-wise separable convolution unit. Moreover, we propose an FPGA-based implementation model that supports the Mish, TanhExp, and ReLU activation functions. The experimental results using the CIFAR-10 dataset show that our Ad-MobileNet has a classification accuracy of 88.76% while requiring little computational hardware resources. Compared to state-of-the-art methods, our proposed method has a fairly high recognition rate while using fewer computational hardware resources. Indeed, the proposed model helps to reduce hardware resources by more than 41% compared to that of the baseline model. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

18 pages, 985 KiB  
Article
Congestion Prediction in FPGA Using Regression Based Learning Methods
by Pingakshya Goswami and Dinesh Bhatia
Electronics 2021, 10(16), 1995; https://doi.org/10.3390/electronics10161995 - 18 Aug 2021
Cited by 8 | Viewed by 3260
Abstract
Design closure in general VLSI physical design flows and FPGA physical design flows is an important and time-consuming problem. Routing itself can consume as much as 70% of the total design time. Accurate congestion estimation during the early stages of the design flow [...] Read more.
Design closure in general VLSI physical design flows and FPGA physical design flows is an important and time-consuming problem. Routing itself can consume as much as 70% of the total design time. Accurate congestion estimation during the early stages of the design flow can help alleviate last-minute routing-related surprises. This paper has described a methodology for a post-placement, machine learning-based routing congestion prediction model for FPGAs. Routing congestion is modeled as a regression problem. We have described the methods for generating training data, feature extractions, training, regression models, validation, and deployment approaches. We have tested our prediction model by using ISPD 2016 FPGA benchmarks. Our prediction method reports a very accurate localized congestion value in each channel around a configurable logic block (CLB). The localized congestion is predicted in both vertical and horizontal directions. We demonstrate the effectiveness of our model on completely unseen designs that are not initially part of the training data set. The generated results show significant improvement in terms of accuracy measured as mean absolute error and prediction time when compared against the latest state-of-the-art works. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

13 pages, 2171 KiB  
Article
An Approach of Binary Neural Network Energy-Efficient Implementation
by Jiabao Gao, Qingliang Liu and Jinmei Lai
Electronics 2021, 10(15), 1830; https://doi.org/10.3390/electronics10151830 - 30 Jul 2021
Cited by 6 | Viewed by 2792
Abstract
Binarized neural networks (BNNs), which have 1-bit weights and activations, are well suited for FPGA accelerators as their dominant computations are bitwise arithmetic, and the reduction in memory requirements means that all the network parameters can be stored in internal memory. However, the [...] Read more.
Binarized neural networks (BNNs), which have 1-bit weights and activations, are well suited for FPGA accelerators as their dominant computations are bitwise arithmetic, and the reduction in memory requirements means that all the network parameters can be stored in internal memory. However, the energy efficiency of these accelerators is still restricted by the abundant redundancies in BNNs. This hinders their deployment for applications in smart sensors and tiny devices because these scenarios have tight constraints with respect to energy consumption. To overcome this problem, we propose an approach to implement BNN inference while offering excellent energy efficiency for the accelerators by means of pruning the massive redundant operations while maintaining the original accuracy of the networks. Firstly, inspired by the observation that the convolution processes of two related kernels contain many repeated computations, we first build one formula to clarify the reusing relationships between their convolutional outputs and remove the unnecessary operations. Furthermore, by generalizing this reusing relationship to one tile of kernels in one neuron, we adopt an inclusion pruning strategy to further skip the superfluous evaluations of the neurons whose real output values can be determined early. Finally, we evaluate our system on the Zynq 7000 XC7Z100 FPGA platform. Our design can prune 51 percent of the operations without any accuracy loss. Meanwhile, the energy efficiency of our system is as high as 6.55 × 105 Img/kJ, which is 118× better than the best accelerator based on an NVDIA Tesla-V100 GPU and 3.6× higher than the state-of-the-art FPGA implementations for BNNs. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

15 pages, 9011 KiB  
Article
FPGA Accelerator for Gradient Boosting Decision Trees
by Adrián Alcolea and Javier Resano
Electronics 2021, 10(3), 314; https://doi.org/10.3390/electronics10030314 - 29 Jan 2021
Cited by 24 | Viewed by 5560
Abstract
A decision tree is a well-known machine learning technique. Recently their popularity has increased due to the powerful Gradient Boosting ensemble method that allows to gradually increasing accuracy at the cost of executing a large number of decision trees. In this paper we [...] Read more.
A decision tree is a well-known machine learning technique. Recently their popularity has increased due to the powerful Gradient Boosting ensemble method that allows to gradually increasing accuracy at the cost of executing a large number of decision trees. In this paper we present an accelerator designed to optimize the execution of these trees while reducing the energy consumption. We have implemented it in an FPGA for embedded systems, and we have tested it with a relevant case-study: pixel classification of hyperspectral images. In our experiments with different images our accelerator can process the hyperspectral images at the same speed at which they are generated by the hyperspectral sensors. Compared to a high-performance processor running optimized software, on average our design is twice as fast and consumes 72 times less energy. Compared to an embedded processor, it is 30 times faster and consumes 23 times less energy. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

10 pages, 353 KiB  
Article
Efficient Memory Organization for DNN Hardware Accelerator Implementation on PSoC
by Antonio Rios-Navarro, Daniel Gutierrez-Galan, Juan Pedro Dominguez-Morales, Enrique Piñero-Fuentes, Lourdes Duran-Lopez, Ricardo Tapiador-Morales and Manuel Jesús Dominguez-Morales
Electronics 2021, 10(1), 94; https://doi.org/10.3390/electronics10010094 - 5 Jan 2021
Cited by 4 | Viewed by 3248
Abstract
The use of deep learning solutions in different disciplines is increasing and their algorithms are computationally expensive in most cases. For this reason, numerous hardware accelerators have appeared to compute their operations efficiently in parallel, achieving higher performance and lower latency. These algorithms [...] Read more.
The use of deep learning solutions in different disciplines is increasing and their algorithms are computationally expensive in most cases. For this reason, numerous hardware accelerators have appeared to compute their operations efficiently in parallel, achieving higher performance and lower latency. These algorithms need large amounts of data to feed each of their computing layers, which makes it necessary to efficiently handle the data transfers that feed and collect the information to and from the accelerators. For the implementation of these accelerators, hybrid devices are widely used, which have an embedded computer, where an operating system can be run, and a field-programmable gate array (FPGA), where the accelerator can be deployed. In this work, we present a software API that efficiently organizes the memory, preventing reallocating data from one memory area to another, which improves the native Linux driver with a 85% speed-up and reduces the frame computing time by 28% in a real application. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

20 pages, 1731 KiB  
Article
Implementation of Autoencoders with Systolic Arrays through OpenCL
by Rafael Gadea-Gironés, Vicente Herrero-Bosch, Jose Monzó-Ferrer and Ricardo Colom-Palero
Electronics 2021, 10(1), 70; https://doi.org/10.3390/electronics10010070 - 3 Jan 2021
Cited by 4 | Viewed by 2779
Abstract
In the world of algorithm acceleration and the implementation of deep neural networks’ recall phase, OpenCL based solutions have a clear tendency to produce perfectly adapted kernels in graphic processor unit (GPU) architectures. However, they fail to obtain the same results when applied [...] Read more.
In the world of algorithm acceleration and the implementation of deep neural networks’ recall phase, OpenCL based solutions have a clear tendency to produce perfectly adapted kernels in graphic processor unit (GPU) architectures. However, they fail to obtain the same results when applied to field-programmable gate array (FPGA) based architectures. This situation, along with an enormous advance in new GPU architectures, makes it unfeasible to defend an acceleration solution based on FPGA, even in terms of energy efficiency. Our goal in this paper is to demonstrate that multikernel structures can be written based on classic systolic arrays in OpenCL, trying to extract the most advanced features of FPGAs without having to resort to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL. This OpenCL methodology is based on the intensive use of channels (IntelFPGA extension of OpenCL) for the communication of both data and control and on the refinement of the OpenCL libraries using register transfer logic (RTL) code to improve the performance of the implementation of the base and activation functions of the neurons and, above all, to reflect the importance of adequate communication between the layers when implementing neuronal networks. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

16 pages, 571 KiB  
Article
An Approach of Feed-Forward Neural Network Throughput-Optimized Implementation in FPGA
by Rihards Novickis, Daniels Jānis Justs, Kaspars Ozols and Modris Greitāns
Electronics 2020, 9(12), 2193; https://doi.org/10.3390/electronics9122193 - 18 Dec 2020
Cited by 17 | Viewed by 3305
Abstract
Artificial Neural Networks (ANNs) have become an accepted approach for a wide range of challenges. Meanwhile, the advancement of chip manufacturing processes is approaching saturation which calls for new computing solutions. This work presents a novel approach of an FPGA-based accelerator development for [...] Read more.
Artificial Neural Networks (ANNs) have become an accepted approach for a wide range of challenges. Meanwhile, the advancement of chip manufacturing processes is approaching saturation which calls for new computing solutions. This work presents a novel approach of an FPGA-based accelerator development for fully connected feed-forward neural networks (FFNNs). A specialized tool was developed to facilitate different implementations, which splits FFNN into elementary layers, allocates computational resources and generates high-level C++ description for high-level synthesis (HLS) tools. Various topologies are implemented and benchmarked, and a comparison with related work is provided. The proposed methodology is applied for the implementation of high-throughput virtual sensor. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

19 pages, 9254 KiB  
Article
Entropy-Driven Adaptive Filtering for High-Accuracy and Resource-Efficient FPGA-Based Neural Network Systems
by Elim Yi Lam Kwan and Jose Nunez-Yanez
Electronics 2020, 9(11), 1765; https://doi.org/10.3390/electronics9111765 - 23 Oct 2020
Cited by 5 | Viewed by 2665
Abstract
Binarized neural networks are well suited for FPGA accelerators since their fine-grained architecture allows the creation of custom operators to support low-precision arithmetic operations, and the reduction in memory requirements means that all the network parameters can be stored in internal memory. Although [...] Read more.
Binarized neural networks are well suited for FPGA accelerators since their fine-grained architecture allows the creation of custom operators to support low-precision arithmetic operations, and the reduction in memory requirements means that all the network parameters can be stored in internal memory. Although good progress has been made to improve the accuracy of binarized networks, it can be significantly lower than networks where weights and activations have multi-bit precision. In this paper, we address this issue by adaptively choosing the number of frames used during inference, exploiting the high frame rates that binarized neural networks can achieve. We present a novel entropy-based adaptive filtering technique that improves accuracy by varying the system’s processing rate based on the entropy present in the neural network output. We focus on using real data captured with a standard camera rather than using standard datasets that do not realistically represent the artifacts in video stream content. The overall design has been prototyped on the Avnet Zedboard, which achieved 70.4% accuracy with a full processing pipeline from video capture to final classification output, which is 1.9 times better compared to the base static frame rate system. The main feature of the system is that while the classification rate averages a constant 30 fps, the real processing rate is dynamic and varies between 30 and 142 fps, adapting to the complexity of the data. The dynamic processing rate results in better efficiency that simply working at full frame rate while delivering high accuracy. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

19 pages, 591 KiB  
Article
Accelerating Event Detection with DGCNN and FPGAs
by Zhe Han, Jingfei Jiang, Linbo Qiao, Yong Dou, Jinwei Xu and Zhigang Kan
Electronics 2020, 9(10), 1666; https://doi.org/10.3390/electronics9101666 - 13 Oct 2020
Cited by 8 | Viewed by 2758
Abstract
Recently, Deep Neural Networks (DNNs) have been widely used in natural language processing. However, DNNs are often computation-intensive and memory-expensive. Therefore, deploying DNNs in the real world is very difficult. In order to solve this problem, we proposed a network model based on [...] Read more.
Recently, Deep Neural Networks (DNNs) have been widely used in natural language processing. However, DNNs are often computation-intensive and memory-expensive. Therefore, deploying DNNs in the real world is very difficult. In order to solve this problem, we proposed a network model based on the dilate gated convolutional neural network, which is very hardware-friendly. We further expanded the word representations and depth of the network to improve the performance of the model. We replaced the Sigmoid function to make it more friendly for hardware computation without loss, and we quantized the network weights and activations to compress the network size. We then proposed the first FPGA (Field Programmable Gate Array)-based event detection accelerator based on the proposed model. The accelerator significantly reduced the latency with the fully pipelined architecture. We implemented the accelerator on the Xilinx XCKU115 FPGA. The experimental results show that our model obtains the highest F1-score of 84.6% in the ACE 2005 corpus. Meanwhile, the accelerator achieved 95.2 giga operations (GOP)/s and 13.4 GOPS/W in performance and energy efficiency, which is 17/158 times higher than the Graphics Processing Unit (GPU). Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

21 pages, 809 KiB  
Article
Resource Partitioning and Application Scheduling with Module Merging on Dynamically and Partially Reconfigurable FPGAs
by Zhe Wang, Qi Tang, Biao Guo, Ji-Bo Wei and Ling Wang
Electronics 2020, 9(9), 1461; https://doi.org/10.3390/electronics9091461 - 7 Sep 2020
Cited by 9 | Viewed by 2307
Abstract
Dynamically partially reconfigurable (DPR) technology based on FPGA is applied extensively in the field of high-performance computing (HPC) because of its advantages in processing efficiency and power consumption. To make full use of the advantages of DPR in execution efficiency, we build a [...] Read more.
Dynamically partially reconfigurable (DPR) technology based on FPGA is applied extensively in the field of high-performance computing (HPC) because of its advantages in processing efficiency and power consumption. To make full use of the advantages of DPR in execution efficiency, we build a DPR system model that meets to the actual application requirements and the objective constraints. According to the consistency of reconfiguration order and dependencies, we propose two algorithms based on simulated annealing (SA). The algorithms partition FPGA resource to several regions and schedule tasks to the regions. In order to improve the performance of the algorithms, we exploit the module merging technology to improve the parallelism of task execution and design a new solution generation method to speed up the convergence speed. Experimental results show that the proposed algorithms have a lower time complexity than mixed-integer linear programming (MILP), iterative scheduler (IS) and Ant Colony Optimization (ACO). For applications with more tasks, the proposed algorithms show performance advantages in producing better partitioning and scheduling results in a shorter time. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

19 pages, 4276 KiB  
Article
A Reconfigurable Convolutional Neural Network-Accelerated Coprocessor Based on RISC-V Instruction Set
by Ning Wu, Tao Jiang, Lei Zhang, Fang Zhou and Fen Ge
Electronics 2020, 9(6), 1005; https://doi.org/10.3390/electronics9061005 - 16 Jun 2020
Cited by 28 | Viewed by 6724
Abstract
As a typical artificial intelligence algorithm, the convolutional neural network (CNN) is widely used in the Internet of Things (IoT) system. In order to improve the computing ability of an IoT CPU, this paper designs a reconfigurable CNN-accelerated coprocessor based on the RISC-V [...] Read more.
As a typical artificial intelligence algorithm, the convolutional neural network (CNN) is widely used in the Internet of Things (IoT) system. In order to improve the computing ability of an IoT CPU, this paper designs a reconfigurable CNN-accelerated coprocessor based on the RISC-V instruction set. The interconnection structure of the acceleration chain designed by the predecessors is optimized, and the accelerator is connected to the RISC-V CPU core in the form of a coprocessor. The corresponding instruction of the coprocessor is designed and the instruction compiling environment is established. Through the inline assembly in the C language, the coprocessor instructions are called, coprocessor acceleration library functions are established, and common algorithms in the IoT system are implemented on the coprocessor. Finally, resource consumption evaluation and performance analysis of the coprocessor are completed on a Xilinx FPGA. The evaluation results show that the reconfigurable CNN-accelerated coprocessor only consumes 8534 LUTS, accounting for 47.6% of the total SoC system. The number of instruction cycles required to implement functions such as convolution and pooling based on the designed coprocessor instructions is better than using the standard instruction set, and the acceleration ratio of convolution is 6.27 times that of the standard instruction set. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

25 pages, 8786 KiB  
Article
Novel CNN-Based AP2D-Net Accelerator: An Area and Power Efficient Solution for Real-Time Applications on Mobile FPGA
by Shuai Li, Kuangyuan Sun, Yukui Luo, Nandakishor Yadav and Ken Choi
Electronics 2020, 9(5), 832; https://doi.org/10.3390/electronics9050832 - 18 May 2020
Cited by 9 | Viewed by 3892
Abstract
Standard convolutional neural networks (CNNs) have large amounts of data redundancy, and the same accuracy can be obtained even in lower bit weights instead of floating-point representation. Most CNNs have to be developed and executed on high-end GPU-based workstations, for which it is [...] Read more.
Standard convolutional neural networks (CNNs) have large amounts of data redundancy, and the same accuracy can be obtained even in lower bit weights instead of floating-point representation. Most CNNs have to be developed and executed on high-end GPU-based workstations, for which it is hard to transplant the existing implementations onto portable edge FPGAs because of the limitation of on-chip block memory storage size and battery capacity. In this paper, we present adaptive pointwise convolution and 2D convolution joint network (AP2D-Net), an ultra-low power and relatively high throughput system combined with dynamic precision weights and activation. Our system has high performance, and we make a trade-off between accuracy and power efficiency by adopting unmanned aerial vehicle (UAV) object detection scenarios. We evaluate our system on the Zynq UltraScale+ MPSoC Ultra96 mobile FPGA platform. The target board can get the real-time speed of 30 fps under 5.6 W, and the FPGA on-chip power is only 0.6 W. The power efficiency of our system is 2.8× better than the best system design on a Jetson TX2 GPU and 1.9× better than the design on a PYNQ-Z1 SoC FPGA. Full article
(This article belongs to the Special Issue Advanced AI Hardware Designs Based on FPGAs)
Show Figures

Figure 1

Back to TopTop