Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators
Abstract
:1. Introduction
- Benchmarking a reconfigurable FPGA cluster platform: We present a comprehensive evaluation of an FPGA cluster designed for flexible deployment of DL accelerators.
- Evaluation of multiple DLA architectures: We assess five distinct DLA configurations under a ResNet-18 NN workload, providing standardized performance benchmarks for each configuration under different hardware conditions.
- Integration of advanced compilation and scheduling techniques: We combine state-of-the-art DL compilers (Apache TVM) and scheduling tools (OpenMPI) to optimize hardware deployment for four scheduling methods.
- Achieving significant speedups: Our experimental evaluation shows that the proposed FPGA cluster methodology achieves an exponential decay in processing time, with speedups of up to 90% in certain configurations, underscoring the efficiency of our approach.
2. Deep Learning Accelerator Discussion
2.1. Versatile Tensor Accelerator (VTA)
2.2. Nvidia DLA (NVDLA)
2.3. Tensil CU
2.4. PipeCNN
2.5. Xilinx DPU
3. Reconfigurable FPGA Cluster Design
3.1. Hardware Components
3.2. Software/Firmware Stack
4. Neural Network Compilation
4.1. VTA Microcode Generation Using TVM
4.2. TVM Frontend + NVDLA Runtime
4.3. Xilinx Vitis AI
4.4. Tensil AI
4.5. PipeCNN
4.6. Customized Compute Cores
5. Deep Neural Network Scheduling Across the FPGA Cluster
5.1. Pipeline Scheduling
5.2. Scatter-Gather Scheduling
5.3. AI Core Assignment
5.4. Fused Scheduling
6. Results: Comparison Across DLA and Configurations
6.1. VTA Cluster Implementation
6.1.1. VTA Configuration Parameters
6.1.2. TVM ResNet-18 Compilation
6.1.3. Auto-Tuning from Schedule Templates (AutoTVM)
6.1.4. Evaluation on Hardware Implementations
6.2. NVDLA Implementation
6.2.1. ResNet-18 Compilation with TVM and NVDLA Runtime Engine
6.2.2. NVDLA Inference
6.3. Tensil CU Implementation
6.3.1. NN Compilation Stack
6.3.2. Tensil CU Performance Evaluation
6.4. Xilinx DPU
6.5. PipeCNN
6.5.1. NN Compilation
6.5.2. Performance Evaluation
6.6. Overall Power Consumption
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2019, 7, 7823–7859. [Google Scholar] [CrossRef]
- Sentieys, O.; Filip, S.; Briand, D.; Novo, D.; Dupuis, E.; O’Connor, I.; Bosio, A. AdequateDL: Approximating Deep Learning Accelerators. In Proceedings of the 2021 24th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), Vienna, Austria, 7–9 April 2021; pp. 37–40. [Google Scholar] [CrossRef]
- Pal, S.; Venkataramani, S.; Srinivasan, V.; Gopalakrishnan, K. Efficient Management of Scratch-Pad Memories in Deep Learning Accelerators. In Proceedings of the 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Stony Brook, NY, USA, 28–30 March 2021; pp. 240–242. [Google Scholar] [CrossRef]
- Yang, A. Deep Learning Training At Scale Spring Crest Deep Learning Accelerator (Intel® Nervana™ NNP-T). In Proceedings of the 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, 18–20 August 2019; pp. 1–20. [Google Scholar] [CrossRef]
- Song, L.; Chen, F.; Chen, Y.; Li, H. Parallelism in Deep Learning Accelerators. In Proceedings of the 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), Beijing, China, 13–16 January 2020; pp. 645–650. [Google Scholar] [CrossRef]
- Choubey, A.; Choubey, S.B. Efficient Design of Adaptable Deep Learning Accelerator. In Proceedings of the 2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT), Bhopal, India, 18–19 June 2021; pp. 588–592. [Google Scholar] [CrossRef]
- Faber, C.J.; Harris, S.D.; Xiac, Z.; Chamberlain, R.D.; Cabrera, A.M. Challenges Designing for FPGAs Using High-Level Synthesis. In Proceedings of the 2022 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 19–23 September 2022; pp. 1–7. [Google Scholar] [CrossRef]
- Li, M.; Liu, Y.; Liu, X.; Sun, Q.; You, X.; Yang, H.; Luan, Z.; Gan, L.; Yang, G.; Qian, D. The Deep Learning Compiler: A Comprehensive Survey. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 708–727. [Google Scholar] [CrossRef]
- Gonzalez-Carabarin, L.; Schmid, A.; Sloun, R.J.v. Structured and tiled-based pruning of Deep Learning models targeting FPGA implementations. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 1392–1396. [Google Scholar] [CrossRef]
- Li, Z.; Ge, F.; Zhou, F.; Wu, N. An A3C Deep Reinforcement Learning FPGA Accelerator based on Heterogeneous Compute Units. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 11–14 November 2022; pp. 1521–1525. [Google Scholar] [CrossRef]
- Soltani, S.; Sagduyu, Y.E.; Hasan, R.; Davaslioglu, K.; Deng, H.; Erpek, T. Real-Time Experimentation of Deep Learning-based RF Signal Classifier on FPGA. In Proceedings of the 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Newark, NJ, USA, 11–14 November 2019; pp. 1–2. [Google Scholar] [CrossRef]
- Yin, H.; Hong, H.; Liu, J. FPGA-based Deep Learning Acceleration for Visual Grasping Control of Manipulator. In Proceedings of the 2021 IEEE International Conference on Real-time Computing and Robotics (RCAR), Xining, China, 15–19 July 2021; pp. 881–886. [Google Scholar] [CrossRef]
- Lu, Y.; Zhai, X.; Saha, S.; Ehsan, S.; McDonald-Maier, K.D. FPGA based Adaptive Hardware Acceleration for Multiple Deep Learning Tasks. In Proceedings of the 2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 20–23 December 2021; pp. 204–209. [Google Scholar] [CrossRef]
- Rupanetti, D.; Nepal, K.; Salamy, H.; Min, C.H. Cost-Effective, Re-Configurable Cluster Approach for Resource Constricted FPGA Based Machine Learning and AI Applications. In Proceedings of the 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2020; pp. 228–233. [Google Scholar] [CrossRef]
- Kan, H.; Li, R.; Su, D.; Wang, Y.; Shen, Y.; Liu, W. Trusted Edge Cloud Computing Mechanism Based on FPGA Cluster. In Proceedings of the 2020 IEEE 8th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 20–22 November 2020; pp. 146–149. [Google Scholar] [CrossRef]
- Wu, C.B.; Hsiao, Y.K.; Chang, W.H. Extensible and Modularized Processing Unit Design and Implementation for AI Accelerator. In Proceedings of the 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Republic of Korea, 13–15 June 2022; pp. 238–241. [Google Scholar] [CrossRef]
- Lin, X.; Yin, S.; Tu, F.; Liu, L.; Li, X.; Wei, S. LCP: A Layer Clusters Paralleling mapping method for accelerating Inception and Residual networks on FPGA. In Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Xu, J.; Huan, Y.; Huang, B.; Chu, H.; Jin, Y.; Zheng, L.R.; Zou, Z. A Memory-Efficient CNN Accelerator Using Segmented Logarithmic Quantization and Multi-Cluster Architecture. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 2142–2146. [Google Scholar] [CrossRef]
- Johnson, H.; Fang, T.; Perez-Vicente, A.; Saniie, J. Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators. In Proceedings of the 2023 IEEE International Conference on Electro Information Technology (eIT), Romeoville, IL, USA, 18–20 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Apache.org. Overview—tvm 0.19.dev0 Documentation. Available online: https://tvm.apache.org/docs/get_started/overview.html (accessed on 19 December 2024).
- Farshchi, F.; Huang, Q.; Yun, H. Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim. In Proceedings of the 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Washington, DC, USA, 17 February 2019; pp. 21–25. [Google Scholar] [CrossRef]
- Tensil. Hardware Architecture and Implementation Details. Available online: https://www.tensil.ai/docs/reference/hardware/ (accessed on 19 December 2024).
- Wang, D.; Xu, K.; Jiang, D. PipeCNN: An OpenCL-based open-source FPGA accelerator for convolution neural networks. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, 11–13 December 2017; pp. 279–282. [Google Scholar] [CrossRef]
- Taylor, A. MicroZed Chronicles: The Deep Learning Processing Unit. Available online: https://www.hackster.io/news/microzed-chronicles-the-deep-learning-processing-unit-659221f58883 (accessed on 19 December 2024).
- AMD. DPU for Convolutional Neural Network. Available online: https://www.xilinx.com/products/intellectual-property/dpu.html#overview (accessed on 19 December 2024).
- Ramagond, S.; Yellampalli, S.; Kanagasabapathi, C. A review and analysis of communication logic between PL and PS in ZYNQ AP SoC. In Proceedings of the 2017 International Conference On Smart Technologies For Smart Nation (SmartTechCon), Bengaluru, India, 17–19 August 2017; pp. 946–951. [Google Scholar] [CrossRef]
- Ferguson, D. Clusterssh, 2018. Copyright 1999–2018 Duncan Ferguson. Licensed under the GNU GPL or Artistic License. Available online: https://github.com/duncs/clusterssh (accessed on 19 December 2024).
- Zhang, Z. The Analysis of Distributed Computing Systems with Machine Learning. In Proceedings of the 2023 International Conference on Networking, Informatics and Computing (ICNETIC), Palermo, Italy, 29–31 May 2023; pp. 67–70. [Google Scholar] [CrossRef]
- Gavankar, T.; Joshi, A.; Sharma, S. Distributed Computing and Image Processing for Autonomous Driving Systems. In Proceedings of the 2018 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), Mangalore, India, 13–14 August 2018; pp. 13–18. [Google Scholar] [CrossRef]
- Yao, Y.; Liu, B.; Zhao, Y.; Shi, W. Towards Edge-enabled Distributed Computing Framework for Heterogeneous Android-based Devices. In Proceedings of the 2022 IEEE/ACM 7th Symposium on Edge Computing (SEC), Seattle, WA, USA, 5–8 December 2022; pp. 531–536. [Google Scholar] [CrossRef]
- Chen, H.; Wu, Y. Coded Computing for Master-Aided Distributed Computing Systems. In Proceedings of the 2020 IEEE Information Theory Workshop (ITW), Riva del Garda, Italy, 11–15 April 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Wen, J.; Zhang, W. Billing System in Distributed Computing Environment. In Proceedings of the 2020 International Conference on Computer Engineering and Intelligent Control (ICCEIC), Chongqing, China, 6–8 November 2020; pp. 310–313. [Google Scholar] [CrossRef]
- Dıker, A. A Performance Comparison of Pre-trained Deep Learning Models to Classify Brain Tumor. In Proceedings of the IEEE EUROCON 2021—19th International Conference on Smart Technologies, Lviv, Ukraine, 6–8 July 2021; pp. 246–249. [Google Scholar] [CrossRef]
- Khan, S.U.; Mynuddin, M.; Ahad, D.M.A.; Hossain, M.I.; Islam, M.J.; Kabir, M.F. A Comparative Analysis of Deep Learning Models for Power Quality Disturbance Classification. In Proceedings of the 2023 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 7–10 June 2023; pp. 0317–0323. [Google Scholar] [CrossRef]
- Poomrittigul, S.; Chomkwah, W.; Tanpatanan, T.; Sakorntanant, S.; Treebupachatsakul, T. A Comparison of Deep Learning CNN Architecture Models for Classifying Bacteria. In Proceedings of the 2022 37th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), Phuket, Thailand, 5–8 July 2022; pp. 290–293. [Google Scholar] [CrossRef]
- Gan, H.S.; Ramlee, M.H.; Wahab, A.A.; Mahmud, W.M.H.W.; Setiadi, D.R.I.M. Image-to-Graph Transformation via Superpixel Clustering to Build Nodes in Deep Learning for Graph. In Proceedings of the 2022 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), Kuala Lumpur, Malaysia, 7–9 December 2022; pp. 213–217. [Google Scholar] [CrossRef]
- He, R.; Gopinath, K.; Desrosiers, C.; Lombaert, H. Spectral Graph Transformer Networks for Brain Surface Parcellation. In Proceedings of the 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 3–7 April 2020; pp. 372–376. [Google Scholar] [CrossRef]
- Chen, X.; Lin, X.; Shen, Q.; Qian, X. Combined Spiral Transformation and Model-Driven Multi-Modal Deep Learning Scheme for Automatic Prediction of TP53 Mutation in Pancreatic Cancer. IEEE Trans. Med. Imaging 2021, 40, 735–747. [Google Scholar] [CrossRef] [PubMed]
- Bertalanič, B.; Vnučec, M.; Fortuna, C. Graph Neural Networks Based Anomalous RSSI Detection. In Proceedings of the 2023 International Balkan Conference on Communications and Networking (BalkanCom), İstanbul, Turkey, 5–8 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Xie, Y.; Xu, Z.; Zhang, J.; Wang, Z.; Ji, S. Self-Supervised Learning of Graph Neural Networks: A Unified Review. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2412–2429. [Google Scholar] [CrossRef] [PubMed]
- Lattner, C.; Amini, M.; Bondhugula, U.; Cohen, A.; Davis, A.; Pienaar, J.; Riddle, R.; Shpeisman, T.; Vasilache, N.; Zinenko, O. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 2–14. [Google Scholar] [CrossRef]
- Wang, Y.; Xie, F. Extending Tensor Virtual Machine to Support Deep-Learning Accelerators with Convolution Cores. In Proceedings of the 2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS), Hiroshima, Japan, 26–30 March 2022; pp. 189–194. [Google Scholar] [CrossRef]
- Xilinx. Xilinx/Vitis-AI. Available online: https://github.com/Xilinx/Vitis-AI (accessed on 19 December 2024).
- Jadhav, S.S.; Gloster, C.; Naher, J.; Doss, C.; Kim, Y. A Multi-Memory Field-Programmable Custom Computing Machine for Accelerating Compute-Intensive Applications. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2021; pp. 619–628. [Google Scholar] [CrossRef]
- Zou, C.; Cui, X.; Kuang, Y.; Liu, K.; Wang, Y.; Wang, X.; Huang, R. A Scatter-and-Gather Spiking Convolutional Neural Network on a Reconfigurable Neuromorphic Hardware. Front. Neurosci. 2021, 15, 694170. [Google Scholar] [CrossRef] [PubMed]
- Rodgers, D.P. Improvements in multiprocessor system design. ACM SIGARCH Comput. Archit. News 1985, 13, 225–231. [Google Scholar] [CrossRef]
- Reddy, M. API Design for C++; Morgan Kaufmann Publishers: Burlington, MA, USA, 2011; p. 210. [Google Scholar] [CrossRef]
- Bello, I.; Fedus, L.B.; Du, X.; Cubuk, E.D.; Srinivas, A.; Lin, T.Y.; Shlens, J.; Zoph, B.R. Revisiting ResNets: Improved Training Methodologies and Scaling Principles. 2021. Available online: https://research.google/pubs/revisiting-resnets-improved-training-methodologies-and-scaling-principles/ (accessed on 19 December 2024).
- Bressem, K.K.; Adams, L.C.; Erxleben, C.; Hamm, B.; Niehues, S.M.; Vahldiek, J.L. Comparing Different Deep Learning Architectures for Classification of Chest Radiographs. Sci. Rep. 2020, 10, 13590. [Google Scholar] [CrossRef] [PubMed]
- Pandey, G.K.; Srivastava, S. ResNet-18 comparative analysis of various activation functions for image classification. In Proceedings of the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26–28 April 2023; pp. 595–601. [Google Scholar] [CrossRef]
- AMD. Aurora 64B/66B. Available online: https://www.xilinx.com/products/intellectual-property/aurora64b66b.html (accessed on 19 December 2024).
- AMD. Power Estimator. Available online: https://www.amd.com/en/products/adaptive-socs-and-fpgas/technologies/power-efficiency/power-estimator.html (accessed on 19 December 2024).
Zynq-7000 Cluster | UltraScale+ Cluster | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Quantity of
FPGAs |
Scatter-
Gather (ms) |
AI Core
Assignment (ms) |
Pipeline
Scheduling (ms) |
Fused
Scheduling (ms) | Quantity of FPGAs |
Scatter-
Gather (ms) |
AI Core
Assignment (ms) |
Pipeline
Scheduling (ms) |
Fused
Scheduling (ms) | |
VTA (100 MHz) | 1 | 27.34 | 27.34 | 27.34 | 27.34 | 1 | 25.15 | 25.15 | 25.15 | 25.15 |
12 | 2.58 | 1.84 | 2.62 | 2.66 | 5 | 6.01 | 14.14 | 8.58 | 6.93 | |
Performance Speedup | 90.56% | 93.27% | 90.42% | 90.27% | Performance Speedup | 76.10% | 43.78% | 65.88% | 72.45% | |
Tensil CU | 1 | 26.03 | - | - | - | 1 | 4.67 | - | - | - |
2 | 15.24 | 29.33 | 28.73 | - | 2 | 2.94 | 7.2 | 6.08 | - | |
3 | 9.91 | 25.32 | 24.13 | 23.52 | 3 | 3.89 | 6.35 | 5.11 | 5.9 | |
12 | 2.25 | 1.84 | 2.62 | 2.66 | 5 | 2.75 | 3.22 | 3.39 | 3.48 | |
Performance Speedup | 91.36% | 93.73% | 90.88% | 88.69% | Performance Speedup | 41.11% | 55.28% | 44.24% | 41.02% | |
Xilinx DPU | 1 | 51.87 | - | - | - | 1 | 2.85 | - | - | - |
2 | 26.6 | 78.36 | 78.12 | - | 2 | 3.72 | 8.88 | 7.08 | - | |
3 | 18.19 | 65.39 | 55.56 | 41.38 | 3 | 4.47 | 6.43 | 5.53 | 5.11 | |
12 | 4.35 | 1.98 | 2.03 | 3.53 | 5 | 3.05 | 4.81 | 2.93 | 3.03 | |
Performance Speedup | 91.61% | 97.47% | 97.40% | 91.47% | Performance Speedup | −7.02% | 45.83% | 58.62% | 40.70% | |
Pipe CNN | 1 | 203.13 | - | - | - | 1 | 62.45 | - | - | - |
2 | 106.25 | 244.36 | 228.31 | - | 2 | 33.35 | 77.28 | 66.01 | - | |
3 | 69.55 | 217.12 | 215.52 | 218.18 | 3 | 66.15 | 92.31 | 49.27 | 75.43 | |
12 | 17.83 | 11.51 | 15.12 | 19.07 | 5 | 29.13 | 57.45 | 22.11 | 43.52 | |
Performance Speedup | 91.22% | 95.29% | 93.38% | 91.26% | Performance Speedup | 53.35% | 25.66% | 66.51% | 42.30% | |
NVDLA | NVDLA 64 MAC on UltraScale+ Cluster | NVDLA 256 + 512 MAC on UltraScale+ Cluster | ||||||||
1 | 346.41 | - | 346.41 | 346.41 | 1 | 73.93 | - | 73.93 | 73.93 | |
5 | 73.37 | - | 132.53 | 118.72 | 5 | 56.37 | - | 78.47 | 64.27 | |
Performance Speedup | 78.82% | - | 61.74% | 65.73% | Performance Speedup | 23.75% | - | −6.14% | 13.07% |
Parameters | Zynq-7020 |
---|---|
Clock_Frequency | 100 MHz |
Input_Width | 8-bit |
Weight_Width | 8-bit |
ACCUMULATOR_WIDTH | 32-bit |
BATCH_SIZE | 1 |
BLOCK_SIZE | 16 |
MICRO_OP_BUFFER_SIZE | 32 Kb |
INPUT_BUFFER_SIZE | 32 Kb |
WEIGHT_BUFFER_SIZE | 256 Kb |
ACCUMULATOR_BUFFER_SIZE | 128 Kb |
Resource | Utilization |
---|---|
LUT | 25,635 |
LUTRAM | 2092 |
FF | 24,968 |
BRAM | 132 |
DSP | 220 |
Parameters | Size |
---|---|
CLOCK_FRQUENCY | 200 MHz |
INPUT_WIDTH | 8-bit |
WIEGHT_WIDTH | 8-bit |
ACCUMULATOR_WIDTH | 32-bit |
BATCH_SIZE | 1 |
BLOCK_SIZE | 32 |
MICRO_OP_BUFFER_SIZE | 64 Kb |
INPUT_BUFFER_SIZE | 64 Kb |
WEIGHT_BUFFER_SIZE | 512 Kb |
ACCUMULATOR_BUFFER_SIZE | 256 Kb |
Parameters | NVDLA 64 MAC | NVDLA 256 MAC | NVDLA 512 MAC |
---|---|---|---|
FEATURE_DATA_TYPE | INT8 | INT8 | INT8 |
WIEGHT_DATA_TYPE | INT8 | INT8 | INT8 |
SDP_FUNCTION | SINGLE SCALING | SINGLE SCALING | SINGLE SCALING |
MAC_ATOMIC_C_SIZE | 8 | 32 | 32 |
MAC_ATOMIC_K_SIZE | 8 | 8 | 32 |
MEMORY_ATOMIC_SIZE | 8 | 8 | 8 |
CONV_BUF_BANK_NUM | 32 | 32 | 32 |
CONV_BUF_BANK_WIDTH | 8 | 32 | 32 |
CONV_BUF_BANK_DEPTH | 512 | 128 | 512 |
SDP_BS_THROUGHPUT | 1 | 1 | 4 |
SDP_BN_THROUGHPUT | 1 | 1 | 4 |
SDP_EW_THROUGHPUT | 1 | 1 | 4 |
PDP_THROUGHPUT | 1 | 1 | 2 |
CDP_THROUGHPUT | 1 | 1 | 2 |
Resource | NVDLA 64 MAC | NVDLA 256 MAC | NVDLA 512 MAC |
---|---|---|---|
LUT | 78,508 | 100,339 | 161,935 |
LUTRAM | 1812 | 1993 | 3579 |
FF | 88,335 | 113,735 | 157,610 |
BRAM | 64 | 128 | 853 |
URAM | - | - | - |
DSP | 32 | 32 | 65 |
Parameters | ZYNQ-7020 | KV260 | ZCU-104/ZCU-102 |
---|---|---|---|
DATA_TYPE | FP16BP8 | FP16BP8 | FP16BP8 |
ARRAY_SIZE | 8 | 16 | 32 |
DRAM0_DEPTH | 1,048,576 | 2,097,152 | 2,097,152 |
DRAM1_DEPTH | 1,048,576 | 2,097,152 | 2,097,152 |
LOCAL_DEPTH | 8192 | 8192 | 16,384 |
ACCUMULATOR_DEPTH | 2048 | 4096 | 4096 |
SIMD_REG_DEPTH | 1 | 1 | 1 |
STRIDE_0_DEPTH | 8 | 8 | 8 |
STRIDE_1_DEPTH | 8 | 8 | 8 |
NUM_THREADS | 1 | 1 | 1 |
THREAD_QUEUE_DEPTH | 8 | 8 | 8 |
Resource | ZYNQ-7020 | KV260 | ZCU104 | ZCU102 |
---|---|---|---|---|
LUT | 15,960 | 30,341 | 56,845 | 56,806 |
LUTRAM | 1914 | 3456 | 4214 | 4214 |
FF | 9576 | 18,346 | 58,521 | 58,479 |
BRAM | 44 | 122 | 93.5 | 293.5 |
URAM | - | 20 | 25 | - |
DSP | 73 | 274 | 1057 | 1057 |
Resource | ZYNQ-7020 (1xB1152) |
---|---|
LUT | 43,200 |
LUTRAM | 4562 |
FF | 75,798 |
BRAM | 121 |
URAM | - |
DSP | 196 |
Num. DPU Cores | 1 |
Resource | KV260 (1xB4096) | ZCU104 (2xB4096) | ZCU102 (3xB4096) |
---|---|---|---|
LUT | 58,450 | 107,901 | 157,050 |
LUTRAM | 6145 | 11,729 | 17,331 |
FF | 106,316 | 204,298 | 301,537 |
BRAM | 111 | 218 | 775 |
URAM | 40 | 80 | - |
DSP | 704 | 1394 | 2084 |
Num. DPU Cores | 1 | 2 | 3 |
Parameters | ZYNQ-7020 | KV260 | ZCU-104/ZCU-102 |
---|---|---|---|
VEC_SIZE | 4 | 16 | 16 |
LANE_NUM | 2 | 8 | 16 |
CONV_GP_SIZE_X | 7 | 7 | 7 |
CONV_GP_SIZE_Y | 1 | 1 | 1 |
PIPE_DEPTH | 6 | 24 | 48 |
POOL_GP_SIZE_X | 4 | 4 | 4 |
DP_WIDTH | 8 | 8 | 8 |
Resource | ZYNQ-7020 | KV260 | ZCU104 | ZCU102 |
---|---|---|---|---|
LUT | 48,220 | 85,946 | 130,374 | 130,432 |
LUTRAM | 3568 | 3648 | 3962 | 4021 |
FF | 68,492 | 100,092 | 160,238 | 164,102 |
BRAM | 54.5 | 140.5 | 190.5 | 334.5 |
URAM | - | 11 | 18 | - |
DSP | 51 | 297 | 395 | 395 |
DLAs | Zynq-7020 | Ultrascale+ | ||
---|---|---|---|---|
Per FPGA Unit | Stack Total (×12) | Per FPGA Unit | Stack Total (×5) | |
VTA | 1.9 W | 22.8 W | 2.4 W | 12.0 W |
NVDLA 512MAC | - | - | 4.3 W | 21.5 W |
Tensil CU | 1.6 W | 19.2 W | 3.4 W | 17.0 W |
Xilinx DPU | 2.2 W | 26.4 W | 6.6 W | 33.0 W |
PipeCNN | 2.1 W | 25.2 W | 4.2 W | 21.0 W |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fang, T.; Perez-Vicente, A.; Johnson, H.; Saniie, J. Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators. Information 2025, 16, 298. https://doi.org/10.3390/info16040298
Fang T, Perez-Vicente A, Johnson H, Saniie J. Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators. Information. 2025; 16(4):298. https://doi.org/10.3390/info16040298
Chicago/Turabian StyleFang, Tianyang, Alejandro Perez-Vicente, Hans Johnson, and Jafar Saniie. 2025. "Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators" Information 16, no. 4: 298. https://doi.org/10.3390/info16040298
APA StyleFang, T., Perez-Vicente, A., Johnson, H., & Saniie, J. (2025). Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators. Information, 16(4), 298. https://doi.org/10.3390/info16040298