Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators

Fang, Tianyang; Perez-Vicente, Alejandro; Johnson, Hans; Saniie, Jafar

doi:10.3390/info16040298

Open AccessArticle

Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators

Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 298; https://doi.org/10.3390/info16040298

Submission received: 11 February 2025 / Revised: 10 March 2025 / Accepted: 26 March 2025 / Published: 8 April 2025

(This article belongs to the Special Issue Machine Learning and Data Mining: Innovations in Big Data Analytics)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents the development and evaluation of a distributed system employing low-latency embedded field-programmable gate arrays (FPGAs) to optimize scheduling for deep learning (DL) workloads and to configure multiple deep learning accelerator (DLA) architectures. Aimed at advancing FPGA applications in real-time edge computing, this study focuses on achieving optimal latency for a distributed computing system. A novel methodology was adopted, using configurable hardware to examine clusters of DLAs, varying in architecture and scheduling techniques. The system demonstrated its capability to parallel-process diverse neural network (NN) models, manage compute graphs in a pipelined sequence, and allocate computational resources efficiently to intensive NN layers. We examined five configurable DLAs—Versatile Tensor Accelerator (VTA), Nvidia DLA (NVDLA), Xilinx Deep Processing Unit (DPU), Tensil Compute Unit (CU), and Pipelined Convolutional Neural Network (PipeCNN)—across two FPGA cluster types consisting of Zynq-7000 and Zynq UltraScale+ System-on-Chip (SoC) processors, respectively. Four deep neural network (DNN) workloads were tested: Scatter-Gather, AI Core Assignment, Pipeline Scheduling, and Fused Scheduling. These methods revealed an exponential decay in processing time up to 90% speedup, although deviations were noted depending on the workload and cluster configuration. This research substantiates FPGAs’ utility in adaptable, efficient DL deployment, setting a precedent for future experimental configurations and performance benchmarks.

Keywords:

deep learning accelerator (DLA); deep neural network (DNN); distributed computing; field-programmable gate array (FPGA); neural processing unit (NPU); system-on-chip (SoC)

1. Introduction

Many of the advances that we are seeing in the field of deep learning (DL) are due to collaborative research efforts made in both hardware (HW) and software (SW) designs [1,2,3,4,5,6]. In the past years, DL frameworks have facilitated the exploration of new DL architectures. However, Electronic Design Automation (EDA) tools have not developed as much and Register Transfer Level (RTL) designs still rely on using traditional hardware description languages (HDLs). Although C++/High-Level Synthesis (HLS) tools have improved the cost of hardware development, the final RTL can overuse logic resources and can be difficult to change [7]. The result is an environment where the gap between HW and DL architecture continues to grow. One of the main challenges that arise is to be able to support new operations on HW as neural network (NN) computation graphs become more complex. When dealing with an Application-Specific Integrated Circuit (ASIC) neural processing unit (NPU), the processing elements (PEs) become fixed, and the DL compilers must support newer computations on the existing hardware. This process is complicated and tedious since each new operator will need to be added to the scheduler [8]. Although recent advances in design automation and customized accelerator architectures have mitigated some development challenges, traditional ASIC designs still face significant hurdles. Dedicated ASIC hardware for DL workloads continues to require extended development times due to high RTL development costs and the difficulty of creating adaptable processing elements that can efficiently support the diverse operations in NN graphs.

The increasing demand to perform efficient deep learning (DL) computation at the edge is causing the exploration of optimized neural network (NN) architectures and allocating dedicated hardware to the appropriate computational blocks regarding power efficiency, reduced latency for real-time applications, and scheduling optimizations. New DL architectures have experienced a lot of progress with the help of DL frameworks. However, despite advances in design automation, dedicated ASIC hardware for DL workloads continues to face challenges due to the high cost of RTL development and the difficulty of creating adaptable processing elements that support diverse NN operations. Most of the difficulties related to the integration of new operators onto hardware have been solved by using DL compilers, which can perform the necessary polyhedral optimizations and graph scheduling so the hardware can meet the expected performance for the targeted graph. While some applications may require fixed workloads, others require a more dynamic approach, where the workload intensity as well as the number of compute resources vary. For that reason, FPGAs play a pivotal role in their adaptability and parallel nature while maintaining low latency and optimal power levels.

In this context, our paper presents a comprehensive examination of different NN workloads across various DLA architectures. A summary of the results of this study are shown in Table 1 to immediately provide readers with the context and purpose of our experiments. Detailed results of the study exemplified in Table 1 with the configuration parameters can be found in Section 6 of this paper. We specifically focused on five key DLA configurations: Versatile Tensor Accelerator (VTA), Nvidia Deep Learning Accelerator (NVDLA), Xilinx DPU (Deep Learning Processing Unit), Tensil CU (Compute Unit), and PipeCNN.

These architectures were selected for their distinct approaches to DL acceleration, ranging from VTA’s flexible, open-source nature, ideal for experimental modifications, to NVDLA’s industry-standard efficiency, Xilinx DPU’s integration capabilities with high-end FPGA systems, Tensil CU’s innovative compute unit design, and PipeCNN’s effectiveness in CNN acceleration. By employing these varied architectures, our research explores the adaptability and performance implications in a real-time edge computing context. We assess their efficiency using four DNN workloads—Scatter-Gather, AI Core Assignment, Pipeline Scheduling, and Fused Scheduling (Scatter-Gather + Pipeline)—each chosen for their relevance in demonstrating how DLAs can be optimized for different stages and types of neural network processing. The Scatter-Gather workload illustrates DLAs’ ability to manage data distribution and collection efficiently, AI Core Assignment tests their capability in dynamic resource allocation, Pipeline Scheduling assesses the sequential processing efficiency, and Fused Scheduling combines the complexities of Scatter-Gather and Pipeline to evaluate a DLA’s overall performance in more intricate scenarios. This diverse set of workloads and architectures allows us to comprehensively evaluate the strengths and limitations of each DLA configuration, demonstrating the flexibility and efficiency of our system in adapting to the continuous evolution of AI algorithms and NN architectures. These benchmarks are crucial in identifying the most effective strategies for FPGA-based DL acceleration, ultimately contributing to the advancement of low-latency, power-efficient solutions in edge computing applications.

There has been increasing interest in FPGA-based machine learning and AI applications in recent years [9,10,11,12,13]. A cost effective, easily scalable, and application-independent FPGA cluster co-processing platform for machine learning (ML) applications has been proposed by another group [14]. Another group proposed an ethernet infrastructure similar to ours that is used for the data traffic inside an FPGA cluster and have studied the efficiency and performance of FPGA clusters for edge cloud computing services and concluded it to be trusted and effective [15]. There has also been a group which proposed an extensible and modularized FPGA-based processing unit cluster for neural network accelerators with impressive hardware utilization efficiency [16]. A group from Tsinghua University, Beijing, addressed the inefficiency of previous FPGA mapping methods when adapting to the different data localities among the Inception and Residual layers and proposed a Layer Clusters Paralleling mapping method, achieving performance improvement over the state-of-the-art methods [17]. Finally, another group from Fudan University, Shanghai, presented a memory-efficient CNN accelerator design for resource-constrained devices in Internet of Things (IoT) and autonomous systems, and the design was implemented on both FPGAs and ASICs, achieving comparable memory efficiency improvement [18].

The research proposed by these groups explores individual aspects of our conducted research [19], and this paper serves as a conglomerate report of our findings testing different NN workloads for different DLA architectures. This research also thoroughly investigates the technical role of NN compilations within the context of DLA architectures and FPGA hardware in detail, serving as a scrutinizing resource for scientists and engineers looking to determine the role these methodologies have in reconfigurable hardware and SoC processors.

In Section 2, we discuss the proposed system as a reconfigurable FPGA cluster architecture in terms of the physical design. Section 3 will discuss the different types of deep learning accelerators that are optimal for our proposed system as well as introduce the concept of creating specialized cores for different computational layers. In Section 4, we discuss the process for compiling a neural network model and the variety of architectures proposed and explored for our system. Section 5 discusses scheduling the deep neural network workload across the FPGA hardware domain and how to connect the distributed data from different cores in the cluster. Finally, in Section 6, we discuss the results from our approaches and report our findings, and Section 7 concludes this paper.

Our main contributions are as follows:

Benchmarking a reconfigurable FPGA cluster platform: We present a comprehensive evaluation of an FPGA cluster designed for flexible deployment of DL accelerators.
Evaluation of multiple DLA architectures: We assess five distinct DLA configurations under a ResNet-18 NN workload, providing standardized performance benchmarks for each configuration under different hardware conditions.
Integration of advanced compilation and scheduling techniques: We combine state-of-the-art DL compilers (Apache TVM) and scheduling tools (OpenMPI) to optimize hardware deployment for four scheduling methods.
Achieving significant speedups: Our experimental evaluation shows that the proposed FPGA cluster methodology achieves an exponential decay in processing time, with speedups of up to 90% in certain configurations, underscoring the efficiency of our approach.

2. Deep Learning Accelerator Discussion

Configurable DLA architectures in an FPGA cluster allowed for the exploration of different performance, area, and latency trade-offs by partitioning the data flow graph across multiple accelerators and reducing overall computation time. In our design, compute cores are implemented using one of two main approaches: direct RTL design in HDL or HLS with C++. These approaches offer clear benefits over fixed or general-purpose hardware, particularly when minimizing idle power and maximizing throughput are critical. Importantly, our work emphasizes evaluating how such partitioning strategies enhance overall cluster performance rather than testing the individual novelty of each accelerator. While the RTL design provides detailed control over hardware resources, HLS offers a more software-oriented development process, and both strategies result in custom compute cores engineered to eliminate unnecessary components and reduce idle power consumption.

In DLA ASIC designs, the emphasis often shifts from merely achieving higher clock frequencies (which may not be necessary for all tasks) to achieving high throughput. High throughput ensures that the DLA can process large volumes of data quickly and efficiently, which is particularly critical in real-time or latency-sensitive applications such as autonomous vehicles, robotics, and edge computing. In our FPGA cluster deployment, achieving high throughput is critical for efficiently managing partitioned machine learning workloads, and our methodology suggests that a cluster approach could offer significant performance benefits in other latency-sensitive scenarios, as highlighted in Table 1. When using Xilinx High-Level Synthesis (HLS), developers can further enhance the efficiency of custom cores by employing pragma definitions and loop optimizations. Pragmas allow for fine-grained control over the synthesis process, specifying how the code should be transformed into hardware description language.

This section delves into each of the five DLAs used in this study and examines its architectural design (such as SIMD or systolic arrays), logic resource utilization, memory configurations, computational components, and overall compute performance, including latency and throughput (measured in Tera Operations per Second, TOPs). The rest of the section aims to provide comprehensive details of how each DLA works in our system, and how it contributes towards the efficiency and capability of our implementation.

2.1. Versatile Tensor Accelerator (VTA)

The Versatile Tensor Accelerator (VTA) is an open-source hardware accelerator optimized for scalability, extensibility, and low-latency inferencing. The Versatile Tensor Accelerator (VTA) has four main modules: fetch, load, compute, and store. These modules enable high memory bandwidth usage on memory-bound workloads and high compute resource utilization on the PL side. On-chip memory SRAM is used through unidirectional data channels to establish communication between the modules. Each of the four modules will have a connection between both a consumer and a producer. When introducing Read-After-Write (RAW) and Write-After-Read (WAR) queue dependencies, we ensure the correct time and execution of producer-to-consumer tokens.

The VTA architecture can simultaneously use compute and memory modules to maximize resource usage in every clock cycle. The way TVM can achieve this optimization is by creating virtual threads. Therefore, the tasks are partitioned into two mutually exclusive execution contexts so that fetch, load, compute, and store operations do not interfere. VTA enables the modification of hardware parameters in the accelerator. One can modify the GEMM core tensor intrinsic and I/O, weight, and accumulator tensor dimensions. The on-chip SRAM port memory and data type size for weights and accumulation are also modifiable. Two tensor operators can be performed in the register file of VTA: the ALU includes element-wise tensor operations such as addition, activation, pooling, etc., and GEMM, which performs more complex arithmetic operations that require computation of complex matrix multiplication operations, for example, 2D convolutions and dense Layers. A block diagram depicting VTA can be found in Figure 1 [20].

2.2. Nvidia DLA (NVDLA)

Nvidia DLA (NVDLA) is an open-source DLA architecture developed by Nvidia and features a modular and power-efficient design capable of handling diverse neural network workloads. The project provides RTL modules and an Inference runtime engine for compiling and executing binaries on the DLA, supporting both FPGA prototypes and a C model emulation system running atop QEMU—a generic and open-source machine emulator and virtualizer. The Verilog source code is configurable at the build time to meet different performance, power, and area trade-offs. NVDLA mainly targets embedded systems and IoT devices with limited power budgets. Figure 2 shows a high-level architecture of NVDLA [21]. As can be seen in this figure, NVDLA has three major top-level blocks.

The Convolutional Core consists of multiply-accumulate (MAC) units for matrix–matrix multiplication in convolutional and fully connected layers of a DNN. The input activations and filter weights are stored in the Convolutional Buffer which is then fed into the Convolutional Core. The post-processing unit is composed of subunits that perform various processes such as pooling and applying non-linear activation functions. These three blocks are programmed and controlled by the Configuration and Control Block, which the host processor accesses via the Configuration and Space Bus (CSB) interface. All the processing units are connected to the Memory Interface block. This block arbitrates access to the main memory via the Data Backbone (DBB) interface.

2.3. Tensil CU

The Tensil Compute Unit (CU) emphasizes modularity, enabling customization for performance and resource constraints, particularly in edge applications. Tensil is a set of customizable tools for machine learning applications running on custom accelerator architectures. Included are an RTL generator, a model compiler, and a set of drivers. It allows for computation without quantization or other degradations to the model, significantly better power optimization, and dynamic use on the majority of FPGA platforms. A decoder acting as a control block allows for the manipulation of data between the host (where the samples are) and the FPGA hardware. The data flow to and from memory after being processed through a systolic array and then are sent into accumulators. From there, the data are streamed back and forth between the ALUs and LUTs, where the final data are sent back to the memory and extracted by the host.

The Tensil CU supports highly parallelized operations through its systolic array, enabling efficient matrix multiplications. Additionally, the integration of ALUs and LUTs allows for flexible execution of custom operations, allowing for dynamic adjustments to computation flow and optimizing resource usage based on workload requirements. The DLA also incorporates efficient memory hierarchies to minimize data transfer latencies, ensuring high throughput even for memory-intensive layers in deep learning models. The block diagram can be seen in Figure 3 [22].

2.4. PipeCNN

PipeCNN, an open-source pipelined convolutional neural network, employs FPGA-specific optimizations for high throughput and resource efficiency. PipeCNN uses HLS tools to reduce the complexity of designing on traditional RTL code. HLS enables generation and compilation for a high-level C/C++ program into low-level RTL code generation. This approach reduces the time and complexity of generating a suitable RTL that runs on an FPGA. Although logic can be overutilized in some cases, for FPGA prototyping, it does the job considering how fast the turnaround time is for generating a functional device. Implementations of PipeCNN are based on an OpenCL engine. The OpenCL framework is a cross-platform programming language that assigns an FPGA board as an OpenCL device and a desktop CPU as an OpenCL host through a PCIe connection. Multiple parallel compute units (CUs) are defined as kernel functions, and OpenCL code is compiled on the FPGA accelerator while C/C++ code is run on the host. A hardware-level block diagram for PipeCNN is displayed in Figure 4 [23].

PipeCNN utilizes one or more convolutional layers, pooling layers, and one or more fully connected (FC) layers. The convolutional layers are designed to perform 3D multiply-accumulate (MAC) operations at specified positions on the feature map, and an inner product operation on the weighted summation of all input neurons. The pooling kernel subsamples directly on the output streams of the aforementioned convolutional layer’s kernel, which is then moved to and from global memory by two data mover kernels (MemRD and MemWR). This structure of cascaded kernels forms a deep computation pipeline that serially executes CNN operations without storing interlayer data in global memory. This is important because it significantly reduces the bandwidth requirement of the computations sent through the CNN.

2.5. Xilinx DPU

Finally, the Xilinx Deep Processing Unit (DPU) leverages parallelized computation and optimized memory hierarchies for high-performance inferencing. The Xilinx Deep Learning Processor (DPU) is an IP block created by Xilinx dedicated to CNN applications. The Zynq UltraScale+ MPSoC family has a supported DPU IP block denoted by DPUCZX8G. Most of the PL inside the MPSoC will be dedicated to the customized DPU IP. Different configurations can be selected based on the desired performance logic usage limitations. To improve performance, these cores can be customized for achieving higher parallelism. The parallelism can be modified across three dimensionalities: pixel parallelism, input parallelism, and output parallelism. For the proposed implementation, the logic resources were prioritized for DPU core parallelism against the amount of DPU cores inside the PL. The configurable hardware architecture allows many applications to connect FPGAs to CNNs. Up to three configurable cores can be implemented for the different types of functionalities that are allowed on the IP: convolution and deconvolution, max pooling, ReLU and Leaky ReLU, Concat, Elementwise, Dilation, Reorg, a fully connected (FC) layer, batch normalization, and split. The communication protocol for this IP is maintained through AXI. The DPU IP is also supported through PYNQ as of 2021. The hardware-level block diagram is shown in Figure 5 [24,25].

In the Xilinx DPU architecture, instructions are fetched from off-chip memory to control the computing engine. The on-chip memory is used primarily for high throughput and efficiency optimizations. The processing elements (PEs) observed in Figure 5 take advantage of logic blocks such as the DSP-based units (multipliers, accumulators, adders, etc.) in the FPGA to implement a deep, pipelined design for the computing engine.

3. Reconfigurable FPGA Cluster Design

3.1. Hardware Components

Our cluster hardware encompasses two FPGA SoC configurations, both embodying a heterogeneous design philosophy that marries a low-power Processing System (PS) with Programmable Logic (PL). The key distinction between the two FPGA SoC types lies in the PL’s logic resource allocation and the PS’s CPU computational throughput [26]. As depicted in Figure 6, the compute-lite assemblage harnesses the capabilities of a dozen Xilinx Zynq-7020 chips in a mix of PYNQ-Z1 and ZedBoard platforms. Utilization is streamlined to prioritize the ethernet port and the Zynq-7020 chip, a comprehensive APSoC that amalgamates an adaptable FPGA with a dual-core processing unit. The APSoC’s PL boasts 13,300 logic slices (inclusive of 6 LUTs and 8 flip-flops each), 630 KB of rapid block RAM, and 220 DSP slices, all orchestrated by a 50 MHz (corrected from Hz) input clock. Complementing the PL, the APSoC’s PS features a 650 MHz dual-core Cortex-A9 ARM processor, a DDR3 memory controller with octuple DMA channels, and quadruple high-speed AXI3 slave ports enabling PL-PS intercommunication.

For tasks demanding more computational rigor, our design integrates up to five Zynq^® UltraScale+™ MPSoC platforms, differentiated by their more substantial logic unit counts. The MPSoC is constructed on a dual-natured CPU-FPGA framework, funneling configurable logic and multicore processing into a singular silicon die. The PL in this MPSoC comprises an array of logic cells (LUTs and FFs), BRAM, URAM, and DSP slices. The robust PS within the MPSoC is equipped with a 1.5 GHz Quad-core Arm^® Cortex^®-A53 processor, a 600 MHz Dual-core Cortex-R5 real-time processor, Mali™-400 MP2 GPU, and a memory controller with DMA channels, supported by high-performance AXI4 slave ports for PL-PS data exchange. Although the MPSoC platform can run up to 1.5 GHz for the ARM Cortex-A53 cores and 600 MHz for the Cortex-R5 real-time cores, we kept the clock speed across platforms the same for comparative purposes in the results between the two cluster configurations. We specifically used a combination of the ZCU102, ZCU104, and KV260 MPSoC platforms.

Comparatively, the Zynq-7000 series supplies a more modest resource pool and a reduced clock frequency, ensuring timing requirements are met devoid of any negative slack or hold time infractions. Notably, the Arm CPU’s integration with the FPGA fabric presents marked differences in both instruction set architecture and computational prowess. A primary benefit of the Zynq-7020 is its commendable power efficiency and cost-effectiveness, facilitating scalable computing within power-sensitive frameworks. A detailed power consumption estimation for each DLA per FPGA platform is given in Section 6.6.

To connect the boards into a cohesive cluster, a standard Cisco switch has been deployed alongside RJ-45 connectors, establishing a 1 Gb/s ethernet conduit linking the FPGA slave nodes to the cluster’s master node. For this experiment, higher bandwidth ethernet alternatives were not explored, but future exploration into this could lead to potential speedups in large-scale operations. The system is managed by a principal host PC, although an FPGA CPU node could alternatively assume the master role, supplanting the traditional slave configuration. This latter technique could be employed for automated scheduling of the NN workloads in the future, but for now, we manually configured the workloads on a host PC.

3.2. Software/Firmware Stack

The master node in the distributed computing environment operates on a Linux OS using clusterssh [27], serving as the central control and coordination point for the cluster [28,29,30,31,32]. The master node’s Linux environment supports secure remote connections, enabling users to access and manage the FPGA cluster from external networks. In addition to local scheduling and control, external clients can deploy hardware accelerators and perform functional simulations remotely through SSH and web-based interfaces, ensuring seamless integration of the cluster into broader workflows. Figure 7 shows an overall abstraction of how workloads are scheduled from the host PC to the FPGAs. A picture of an active control session for the FPGA cluster is shown on the left side in Figure 7, while the right side displays the FPGAs used in the cluster. Any number of FPGAs can be added to the cluster, and in our experiments, we separated the FPGA clusters by SoC type: Zynq-7000 and UltraScale+. This figure highlights the overall scheduling flow of our system from software to hardware.

The FPGA slave nodes also run on Linux OS, with variations depending on the targeted AI accelerator, potentially using the Python productivity for Zynq (PYNQ). This is an open-source development platform that aims to facilitate hardware design on Xilinx heterogeneous devices. It allows the use of the Python programming language together with Python libraries to work across different IP blocks linked to the PL through the PS. PYNQ simplifies the design of embedded systems applications, allowing for the manipulation of variables contained in the PL through a PYNQ User Interface (UI), which is often carried out through Jupyter Notebook after connecting via SSH to the FPGA. This setup allows for efficient and scalable AI processing in a distributed environment, though we did not interface with Jupyter Notebook for our research.

To execute deep neural network (DNN) architectures efficiently on reconfigurable hardware, we employed a cutting-edge neural network compiler. In this particular study, we rely on Apache TVM as our compiler of choice. Apache TVM takes as its input the DNN model sourced from a DL framework and undertakes a transformative process, converting it into a low-level computational graph known as the intermediate representation (IR). This optimized IR graph is then capable of targeting a diverse array of hardware backends for seamless deployment and execution, including custom accelerators like our case.

In addition to this, some DLAs rely on their own microcode generation. Therefore, we will use TVM as the frontend compiler to generate the IR graph and the specific DLAs compiler as a code generator. We will also use TVMs’ runtime engine as an easy method to schedule neural network graphs. In other cases, such as Xilinx DPU IP, the runtime engine (XRT) and compiler will be based on the Vitis AI software stack. For Tensil CU, the same applies, and we used their dedicated IR graph and microcode generation as well as runtime engine. IPs like PipeCNN and dedicated AI cores for specific layers rely on writing OpenCL kernels and use the corresponding OpenCL runtime.

Four separate NN workloads were manually assigned to each of the DLA architectures for AI core acceleration. Although this process could be automated in future work, manual assignment was more practical for our current study. For our experiments, we selected ResNet-18 as the benchmark model across all DLA architectures and their subsequent NN workloads. ResNet-18 is renowned for its deep residual learning framework, which effectively mitigates the vanishing gradient problem and enables deeper networks for improved performance [33,34,35]. Its compact architecture strikes a balance between computational demand and predictive power, making it an exemplary candidate for evaluating AI core acceleration. ResNet-18’s widespread adoption in the DL community and extensive use in prior studies provide a standardized basis for comparison. While additional models could offer further insight, the use of ResNet-18 ensures a controlled and consistent evaluation of our system’s efficacy and scalability.

To efficiently manage deep neural network workloads across the FPGA cluster, our software stack integrates a scheduling strategy that uses OpenMPI and TVM. OpenMPI provides scatter-gather techniques to ensure balanced data exchange among FPGA nodes, while TVM compiles CNN graphs into an optimized intermediate representation (IR) that enables effective pipelining across multiple FPGA cores. This integrated approach not only overlaps layer execution to reduce latency but also enhances throughput and overall computational efficiency, laying a solid foundation for the dynamic partitioning of machine learning tasks across the cluster.

4. Neural Network Compilation

The process of compiling an NN model and generating an executable output can be divided into different steps. The first step consists of extracting the metadata from the NN graph. The main metadata consist of tensor shapes, input layout (NCHW and NHWC), weights and biases, and the type of layer. The metadata will then be stored in an intermediate representation (IR) graph. After generating the IR, it is up to the compiler to apply specific graph transformations and optimizations to the IR depending on the targeted HW accelerator. While the graph transformations for a given HW can be applied to the whole IR graph, in some cases, the NN runs on a heterogeneous accelerator where the computations will be offloaded to different architectures. When graph-level transformations become defined for target hardware, the NN compiler will create a low-level IR to generate an optimized microcode for HW execution. This low-level code generation will vary depending on the targeted device [36,37,38,39,40]. These transformations can be applied to the entire IR graph or tailored to heterogeneous accelerators, and will ultimately produce a low-level IR that generates device-specific code (e.g., an LLVM-based machine code for x86 or an ISA-specific microcode for FPGA accelerators).

Given the variety of architectures implemented on the FPGA logic, the compilation flow will vary. For most of the IPs tested on the FPGA cluster, the compiler will follow the workflow stated above; however, in the case of PipeCNN and dedicated AI core architectures, the NN compilation will require a less standardized way of generating an executable that can be integrated into the distributed cluster runtime. For VTA NVDLA and Xilinx DPU, we will use Apache TVM as the main compiler frontend, where an IR will be generated, followed by layer transformations. After generating the IR graph, only VTA will follow the TVM IR lowering to generate microcode kernels for the VTA architecture. For NVDLA and Xilinx DPU, the IR graph will be used as input to their own compiler tools, where each one will generate its respective executable binary out of the graph’s metadata. The generated output will be compatible with each DLA runtime engine. In the case of targeting Tensil CU, it will not use TVM as the main frontend compiler. Instead, Tensil AI has its own frontend compiler stack for ingesting NN models and generating its own IR transformation as well as handling memory management of the model inside the HW.

4.1. VTA Microcode Generation Using TVM

To run a neural network (NN) graph on various hardware targets, a compiler must extract computations from the dataflow graph (DFG), generate a hardware-agnostic intermediate representation (IR), and then optimize these computations for the chosen architecture. Apache TVM meets this need and is increasingly adopted alongside other graph compilers like multi-level intermediate representation (MLIR) [41]. An end-to-end deployment with TVM starts with a trained NN DFG from frameworks such as Caffe, TensorFlow, PyTorch, ONNX, or Keras. The TVM compiler lowers this DFG into a generalized computation graph called Relay, where optimizations like operator fusion and partial evaluation are performed. Relay’s extensibility allows for customization of graph-level optimizations for specific hardware targets. In TVM, schedules define operator transformations that modify loop computations for performance improvements, and these transformations are stored as IR data for generating low-level code.

TVM internally records the loop structure and other critical information used to produce low-level code via Tensor IR transformation passes. Because scheduling for performance can be tedious and hardware-specific, AutoTVM acts as a search-space engine to automatically generate scheduling primitives. It explores a range of template-based transformations, measuring latency and total TOPS on actual hardware to identify high-performing schedule candidates. Through iterative refinement, the search converges on more effective sequences that improve computation block performance. After finalizing these transformations, the best candidates are stored for future inference and the JIT compilation runtime generates the final low-level code for the targeted hardware.

4.2. TVM Frontend + NVDLA Runtime

Apart from releasing open-source NVDLA’s RTL, Nvidia released the NN compiler and parser and the runtime engine to deploy a fully fledged AI application. The compilation tool library loads a pre-trained Caffe model plus defined compiler parameters such as the type of quantization and calibration scales generated by TensorRT. The compiler toolchain converts each NN layer into a microcode operator that can be run on NVDLA’s compute modules. Given the open-source compiler tool, it has a main limitation coming from the frontend since it only supports parsing Caffe models. To address this, combining both NVDLA’s compiler with TVM’s frontend generates a TVM IR that can identify all layer operators to match a certain pattern run on one of NVDLA’s compute blocks and extract a tensorized graph representation of the compute descriptions that the NVDLA’s compiler toolchain can interpret to dump the loadable binary into NVDLA’s runtime engine [42].

4.3. Xilinx Vitis AI

The Vitis AI compiler interface parses the quantized neural network model and reconstructs it into the Xilinx Intermediate Representation (XIR), an intermediate format that aligns the model’s structure and operations with the underlying DPU hardware. Serving as a bridge between the model and the DPU IP, the interface enables targeted optimizations—such as enhancing computational efficiency, throughput, and resource utilization—based on the specific DPU configuration. After generating the XIR, the compiler refines the compute graph through optimizations and defines a scheduling strategy to ensure the neural network operates at peak performance with minimal latency.

This scheduling defines the sequence and timing of operations within the neural network. The scheduling is tailored to ensure that DPU resources are efficiently allocated and that computations are performed in an order that minimizes idle time, optimizing overall throughput. Following optimization and scheduling, the compiler tool generates microcode specific to the DPU architecture. This microcode is then serialized into a format that can be easily consumed by the Xilinx Runtime (XRT) engine [43].

4.4. Tensil AI

Tensil AI has its own dedicated compiler stack, from the frontend all the way down to the hardware scheduler and code generation backend. The frontend will process an NN graph and generate from it a high-level IR (HIR) from the compute layers. This HIR will be necessary for the scheduler to generate a lowered IR (LIR). Also, the frontend interfaces with the memory manager to generate memory objects for moving memory blocks between the two SRDAM banks and the host memory. The backend will translate the LIR into a set of instructions for the compute unit and SDRAM handshake between the host and output accumulators from the systolic array [36].

4.5. PipeCNN

The PipeCNN architecture is designed as an architecture made of fully pipelined processing cores, each one specialized for a type of operation that is frequent in NN models. Unlike other DLAs run on FPGA, the RTL is generated from HLS tools that will require a standardized way of handshaking memory between the compute cores and host memory buffers. For HLS designs, it is common to interface through software via OpenCL API, where FPGA’s HLS-defined compute kernels will communicate with the host CPU via C++ code, where OpenCL code defines multiple parallel compute units (CUs) as runnable kernel functions which will interact with other CUs and host memory via OpenCL-defined shared and global memory buffers.

Running an NN model through the PipeCNN architecture will require defining multiple CU kernels depending on the NN and will handshake the input–output tensors via shared memory buffers across the deeply pipelined HLS kernels, therefore improving the performance when performing inference on multiple image frames [33].

4.6. Customized Compute Cores

For programming the FPGA with a custom dedicated core, we utilized HLS tools to generate the RTL kernel, applying a similar approach to PipeCNN with the OpenCL API interface. Additionally, we explored integrating this HLS-generated custom IP block into the Vitis AI compiler. This integration was achieved by using the XIR API interface, allowing us to extend the compiler’s functionality through custom plugins. This HLS approach not only enabled us to specify the core’s functionality in high-level languages like C or C++ but also streamlined the translation of these specifications into an effective RTL kernel.

We chose to use the OpenCL API interface to interact with and control our FPGA hardware. This interface provides a standard way to communicate with our FPGA and invoke the custom core’s functionality, similar to how PipeCNN is managed. To further optimize our FPGA-accelerated AI workloads, we considered integrating our HLS-generated custom IP block with the Vitis AI compiler. Integration could be achieved using the XIR (Xilinx Intermediate Representation) API interface, which allowed us to interface with the compiler and leverage its extensibility. By integrating our custom IP block with the Vitis AI compiler, we gained the ability to create custom plugins and extensions to the toolchain. This can include custom optimizations, schedulers, and code generation strategies tailored to our specific hardware and AI workload requirements.

After the NN was compiled, the allocation of hardware resources must be completed using a Message Passing Interface (MPI) through software. The optimization of scheduling is left to the user’s tuning abilities, as it is not currently a completely automated process. This is not an uncommon approach for partitioning intensive computation tasks among FPGA resources, such as seen in [44].

5. Deep Neural Network Scheduling Across the FPGA Cluster

By using OpenMPI’s scatter-gather techniques and TVM’s IR-based CNN pipelining, our scheduling method efficiently distributes DNN compute graphs across the FPGA cluster, reducing latency through overlapping layer execution. This streamlined approach underpins our dynamic producer–consumer model, where FPGA nodes are strategically allocated to address computational bottlenecks. A key innovation in our approach is the strategic allocation of FPGA nodes to address computational bottlenecks in the NN. By dedicating specific nodes to the most compute-intensive layers, we establish a dynamic producer–consumer model that ensures immediate consumption of produced data, streamlining the flow of intermediate results. Several ‘consumer’ nodes are poised to process the data as soon as they are produced (or output) by the preceding ‘producer’ layer, ensuring a more streamlined and efficient flow of data through the network.

In our research, we analyzed four distinct DNN workloads, each tailored to assess specific aspects of FPGA-based deep learning systems: Pipeline Scheduling, Scatter-Gather Scheduling, AI Core Assignment, and Fused Scheduling.

5.1. Pipeline Scheduling

Pipeline Scheduling is a workload that evaluates the sequential processing capabilities of our system. This method involves the orderly and simultaneous execution of tasks across multiple FPGA cores, mirroring the layered structure of neural networks. Pipelining a model consists of executing segments of an NN model in a way that each segmented NN block can be allocated into independent or shared HW resources so that the overall workload becomes distributed. Considering the standard scheduling approach, the NN model is executed on one HW device, and the overall throughput of the model will be restricted by running inference on one input independent of batch size. The next input will not be accepted until the current input is processed by the entire network. By segmenting the neural network model, we can start feeding the next input to each segment whenever the consumer is free, and the single input bottleneck is removed. All segments of the NN graph will be processing input data from the output of the previous segment without many idle tasks, increasing overall utilization of the available HW.

Figure 8 depicts a pipelining process for DNN workloads across a sequence of hardware modules, labeled from mod0 to mod4. Each module is allocated with a certain amount of FPGA hardware and is assigned a part of the neural network workload in sequence. Data enter mod0 and are processed sequentially through the modules, with each outputting to the next: mod0’s output, data_n_0, becomes mod1’s input, and so on, culminating in the final output from mod4. This illustrates a segmented execution of an NN model, where each segment/module handles a portion of the network, allowing the next input to start processing as soon as the preceding module is free. This effectively removes single-input bottlenecks, enhancing hardware utilization and increasing overall throughput by maintaining most segments in active processing mode, thus optimizing the available hardware resources. FPGAs can be assigned per module to divide NN computational tasks among the FPGAs as necessary. By leveraging inter-module buffering and low-latency data transfers, pipeline scheduling minimizes communication overhead and ensures seamless propagation of intermediate results, maximizing throughput across the system.

5.2. Scatter-Gather Scheduling

Scatter-Gather Scheduling investigates the system’s ability to efficiently distribute and collect data across our FPGA cluster. This workload demonstrates how data can be segmented and processed in parallel before being aggregated, a method that significantly enhances computational efficiency. When feeding a video frame-based input to the distributed system, the first batch of input frames are scattered across a selected number of FPGA channels. Each of these FPGA channels consists of one or more sequential FPGAs, depending on the bifurcation configuration of the cluster. All cluster bifurcations start with a scatter operation to distribute the data across each channel and end up gathering all the outputs and storing them into an ordered batch.

Figure 9 describes the Scatter-Gather workload in a block diagram. A series of input frames, labeled from ‘frame_0’ to ‘frame_n’, enter the system and are directed towards ‘module 0’ (mod_0). The process flows sequentially through ‘module 1’ (mod_1) and finally to ‘module 2’ (mod_2), where the outputs, ranging from ‘out_0’ to ‘out_n’, are generated. This configuration shows that data are processed in a stepwise manner, passing it along to the next module without waiting for the entire batch to be processed at once. This approach can lead to efficiency gains as it minimizes the waiting time between stages, ensuring that all parts of the system are working simultaneously on different parts of the task.

The points where Scatter-Gather operations happen can be at both ends of a DFG or can start and end at any place in between. Also, multiple scatter-gather can happen across the NN graph. Each channel can be composed of one or more FPGAs in series, where each one will perform a given workload until the gathering phase. Both the scatter and gather phases will require a sending and receiving buffer instantiation on the master’s node host memory [45].

5.3. AI Core Assignment

AI Core Assignment focuses on the dynamic allocation of FPGA resources. This workload tests the system’s capability to intelligently assign specific tasks to FPGA cores based on their computational requirements and current load, ensuring optimal utilization of the hardware. A computational graph will always have certain nodes/layers which will determine the maximum performance our NN can achieve over a certain HW accelerator. According to Amdahl’s law [46,47], overall latency will be impacted by the most heavily compute-intensive workload across the graph independent of the latency performance obtained on other layers. Therefore, it is in our best interest to maximize overall performance for that workload. One way to achieve this is by assigning more compute resources to the bottleneck workload, increasing the number of consumer nodes for the given task, and minimizing the graph latency. It is important to keep track of the subsequent computations deployed on each assigned hardware so that tensors can be gathered and maintain the same order as they arrived. Node indexing informs the order of the data and determines the unvaried output for a fully pipelined inference task. A block diagram of the AI Core Assignment scheduling process can be found in Figure 10.

As previously mentioned, the order of computing resources is important; therefore, the inference sequence will be influenced by how the bottleneck is reordered. The compute sequence is allocated into temporary buffers allocated on the master node memory. This sequence of data stores the necessary parameters (weight, bias, and input data from the previous layer) for performing the inference. When a compute queue becomes ready to consume, the memory buffer associated with the corresponding queue is allocated onto the LPDDR of one of the free FPGA nodes. After the computation, the output will be sent back to the master host memory via TCP transaction call and stored in a buffer queue that will maintain the index for the output queues so that the next sequence of computations can be scheduled in the correct order and follow the DFG path. It is important to note when an NN has a skip connection in the NN graph: either the output value of the corresponding skip connection is stored on the DRAM of a target FPGA node, or the master host memory temporarily stores it for the subsequent layers that perform any kind of operation on the skip layer output.

5.4. Fused Scheduling

Fused Scheduling combines Scatter-Gather and Pipeline Scheduling to manage complex, concurrent workloads. For our proposed pipelined scheduling, the main focus was to increase HW utilization by making all segments available to compute when freed up. However, to obtain the maximum benefit from pipeline execution, it requires that none of the subgraphs that constitute the neural network graph becomes a limiting factor due to the time required to perform its corresponding workload.

To address this challenge, our approach focused on adding more compute units, specifically to the highest-demanding segment, a refinement of the previous AI core assignment method rather than replicating it directly. As shown in Figure 11, by combining pipelined scheduling together with compute core assignment methodologies, the utilization increases, and the intensive compute tasks are distributed whenever an assigned core is free, reducing the NN bottleneck and performing computations continually across the NN subgraphs. A key aspect of this approach is avoiding bottlenecks within the neural network graph, where certain subgraphs with high computational demands could slow down the entire processing pipeline. This adaptive resource allocation is crucial for maintaining a balanced load across the FPGA cluster, thereby enhancing overall processing efficiency.

In addition to mitigating bottlenecks, fused scheduling leverages dynamic reallocation of idle compute units to maximize overall system throughput. This strategy enables cores to process lower-priority tasks when high-priority workloads are delayed, further balancing the load across the cluster. Furthermore, by integrating fine-grained data synchronization mechanisms, fused scheduling ensures that intermediate results from scatter-gather operations align seamlessly with pipelined tasks, minimizing latency due to interdependencies. This combination of task prioritization, resource reallocation, and data synchronization ensures that fused scheduling fully exploits the parallelism and modularity of individual and interconnected FPGA architectures, enabling efficient execution of complex neural network workloads across heterogeneous compute stacks/clusters.

6. Results: Comparison Across DLA and Configurations

In this section, we will discuss the experiments that were performed using the FPGA cluster. We will present comparisons, benchmarks, and optimizations that were carried out for configurable accelerators. We will also discuss how the graph was optimized for a target accelerator, the DLA configuration parameters that were chosen, and how these configuration parameters affected the overall performance of the cluster.

To compare various strategies for different types of DLAs, we examined the ResNet-18 NN architecture. ResNet-18 was chosen because of its low network depth, which allowed us to test how well the hardware can handle deep learning NNs. It is complex enough to represent real-world tasks but not so resource-intensive that it becomes impractical for edge or embedded applications. ResNet architectures are also widely adopted and well studied [48,49,50], making them a de facto standard for performance evaluation in the deep learning community.

As previously mentioned, two types of clusters are tested: a high-performance edge cluster consisting of 5 Zynq-UltraScale+ MPSoC and a low-power compute-lite cluster made of 12 Zynq-7000 SoC. Each cluster will evaluate the FPGA DLAs using different configurations on the design. The applicable configurations will come from global parameters that will define performance capabilities in terms of total on-chip SRAM, clock frequency, and total MACs while maintaining the overall architecture for all DLA cuts. In addition, the cluster will be tested following the scheduling strategies for distributed workload inference that were previously introduced. Several metrics will be gathered from the FPGA cluster, including but not limited to total TOPS, logic utilization, inference per second (IPS), and power estimation. In the following subsections, the importance of the optimization of scheduling methods is discussed.

6.1. VTA Cluster Implementation

The generation of the RTL design came from Vivado HLS C++-defined modules with configuration parameters. Load instruction fetches, compute, and store are the main modules that had defined pragmas for port interface configurations of the generated RTL. The instruction fetch module had AXI stream interfaces to communicate with the compute, load, and store modules. This generated FIFO interfaces to achieve full synchronization between modules. Additionally, an AXI memory-mapped interface with external DRAM provided DMA transactions for fast input and weight parameter memory transactions. The programming of the control state registers of VTA was via an AXI-LITE slave port, which allowed the read–write control register to set initialization of the fetch module and the instruction count register to monitor a number of executable instructions as well as the instruction register with the DRAM address location for a given instruction.

For the load and store modules, two new AXI stream queues were opened for each module, directly interfacing with the compute (GEMM and ALU) instance. Whenever a load or store instruction was processed, we performed a 2D tensor DMA read/write transaction between off-chip DRAM and on-chip SRAM (synthesized as BRAM in the FPGA’s fabric). BRAM port definitions served as memory buffers for handling input/output and weight transfers as well as a micro-op cache for storing the microcode ISA to perform on the DLA. As for improving performance on VTA compute modules, ALU and GEMM cores used the pipeline II (Initiation Interval) pragma to reduce the overall clock cycle count between each loop iteration in the compute block definition. In this case, the GEMM core generated an RTL with an aggressive strategy of II = 1, whereas ALU operator RTL generation was targeted to achieve II = 2.

6.1.1. VTA Configuration Parameters

The VTA configuration parameters involved mainly the on-chip memory buffer size (input, weight, and accumulator) and tensor data types, including the matrix multiply quantization and data type of the input and parameter to perform computation inside the GEMM block. Independent of the targeted FPGA platform, the initial hardware parameters were set according to Table 2.

The initial clock frequency was set to 100 MHz, since given the VTA configuration, meeting timing for Zynq-7000 SoC became the limitation. However, for UltraScale Devices, we increased the clock frequency up to 300 MHz for the proposed VTA parameters. Table 3 shows the FPGA resource utilization of the VTA configuration.

6.1.2. TVM ResNet-18 Compilation

VTA GEMM and ALU operations were performed with low-precision INT8; therefore, the targeted network (in the case of ResNet-18) needed to be trained using quantization-aware technique, or quantized layer operators after being trained with floating point precision. The layers that could be supported on DLA intrinsics were quantized, and those not supported were offloaded to the CPU as lower-precision (INT8) or high-precision (FP16/32) arithmetics.

To evaluate the proposed net for gathering benchmark metrics, the model was already quantized to INT8 precision. When the model was fed through the TVM frontend, all layer operators were stored in a Relay IR. The DFG representation had enough information to start optimizing graph operators for the targeted DLA. When looking at the Relay IR, four types of graph operators were identified: injective (add), reduction (sum), operators for fused optimizations (conv2d), and operators not possible to fuse (gather).

Based on these types of graph operators, the TVM backend searched for layer patterns in the Relay IR specified by the compiler tools for a given HW target, which was VTA. For TVM’s compilation, the ResNet was quantized from FP32 to INT8. ResNet-18 architecture was characterized by applying a skip connection pattern or ResBlock (Residual Block) across the DAG to reduce degradation due to gradient diffusion. The network had the following layer operators: 2D convolution, Add, ReLU, MaxPool, GlobalAverage Pool, fully connected (FC), and SoftMax.

Once the model was fully quantized, TVM’s relay graph inferred type INT8 on the layer operators, storing tensor parameters as a 32-bit datatype. Following layer quantization, the relay graph sequentially went through a series of transformation passes. This optimization passes fused layer nodes into a single computation, removing unnecessary layers for inference, folding constants, and canonicalizing complex expressions to basic ones.

The next step on the compiler toolchain was to lower the Relay into a lowered tensor IR, where each computation primitive was assigned a schedule strategy to generate the lowered microcode to evaluate the function primitive onto the DLA hardware intrinsics. For supporting MaxPool and GlobaAveragePool, we created schedule functions for these operators to be computed on the ALU block. New scheduling strategies needed to be registered to the Tensor Operation Inventory (TOPI) API so that when lowering the operation, the compiler knew the set of primitive functions to schedule the specific operation for the targeted device. Add and ReLU operators were processed on ALU, while 2D convolution and Dense layers were performed on the GEMM block. At the end of the network, the flattened output from the FC was evaluated with Softmax, which was then offloaded to the CPU. Once all operators across the graph were lowered into specific instructions for the target device, the last step was to generate a module runtime where it stored lowered passes of each node operator as Packed functions.

6.1.3. Auto-Tuning from Schedule Templates (AutoTVM)

Defining compute schedules requires non-trivial knowledge of both hardware and algorithm optimizations for achieving the lowest latency, as generating the schedule that will give the best performance is a difficult task. TVM compiler integrates AutoTVM to generate more optimized kernels for a set of computations given a template schedule. This was achieved by creating a search space based on parameters defined by the scheduling template. The AutoTVM engine carried out reward-based policy exploration by comparing candidate schedules against the best-performing one in terms of latency. The best candidate was then selected as the target schedule for lowering the operation. For ResNet-18, schedule templates were created for Conv2D and Dense operators by registering the Tensor Expression schedule into the AutoTVM engine.

6.1.4. Evaluation on Hardware Implementations

The first cluster test on the VTA platform was taken with the parameters and clock frequency specifications mentioned at the beginning of this section. For this type of VTA configuration, it was possible to generate a bitstream design without any timing violations or node overlaps for both the Zynq-7000 and UltraScale+ platforms.

Figure 12 shows the inference time it took to process one image through the ResNet-18 graph on the compute-lite-type Zynq-7000 FPGA cluster. The VTA model was trained with an input shape of (N, 224, 224, 3) = (batch size/input samples, length, width, channels/colors) and no input resizing was performed for the inference step. The values obtained were categorized based on the number of compute resources used and the cluster strategy used for distributing the NN workloads. The inference time results were obtained by performing 10 evaluations on 10,000 random test images extracted from the ImageNet test DataSet. On each evaluation, the average inference time was calculated and averaged again for the 10 evaluation results. Each full run, the evaluation was verified so that no data discrepancy or deviation would appear from the expected time interval.

Running an inference on just one FPGA, we achieved 27.34 ms after generating the optimized micro-kernel from AutoTVM schedule exploration. On all different cluster strategies, as we increased the number of FPGA resources, the workload became more distributed, and therefore, the expected inference time for each input image was reduced. Figure 10 demonstrates that this is not always the case and that reduced latency is not linear, but rather exponentially decaying as we added more FPGA devices to the cluster. This suggests a convergence of optimization times that are at least sub-5 milliseconds. Among the four strategies, distributing the bottleneck operators (those that require a higher amount of computing) across more FPGAs became more effective as we increased the number of FPGAs on the cluster.

The same was performed for the UltraScale+ FPGA cluster, as seen in Figure 13. It is important to note that this type of DLA strategy negatively affects latency when the number of FPGA nodes is 2 and 3 for the UltraScale+ Cluster. The main factor that causes this performance loss is network bandwidth and processor involvement in sending packet streams of data between two or more FPGA devices. The distributed cluster was tested using RJ-45 connectors of up to 1 Gb/s, adding to that the need for the FPGA CPU required to DMA data buffer from FPGA’s logic and send it through the network to the next cluster node, which traduces to a lot of CPU handling overhead. Additionally, buffers are sent as blocking call MPI messages, which will also impact the overall node message passing handshake.

To mitigate the latency issues due to network bandwidth and processor overhead in a multi-FPGA UltraScale+ cluster, we considered implementing a direct communication link between FPGAs. If the standard network protocols over RJ-45 connectors are causing a bottleneck, we could instead use the high-speed serial transceivers to enable direct FPGA-to-FPGA communication by using high-speed channels such as the Aurora IP core [51] or PCIe for inter-FPGA communication. This approach would enable communication without involving the embedded CPU core housed in the Zynq UltraScale+ MPSoC Processing System IP and would minimize data handling overhead by the CPU and avoid network congestion that occurs with ethernet-based communication, specifically when block MPI calls are involved. With direct FPGA links, we can also implement non-blocking communication strategies and potentially use DMA engines within the FPGA fabric to manage data transfers more efficiently and with less overhead. This would also be extremely useful in the future for automated scheduling.

When testing the UltraScale+ FPGA cluster, the results obtained for the first VTA configuration improved by around 6% compared to the Zynq-7000 cluster. However, only up to five UltraScale+ boards were available, so we only made comparisons up to five UltraScale+ devices. Having a five-node cluster reduced the capabilities of achieving higher throughput when assigning dedicated compute operations to a set of rank nodes across the network. When comparing the Zynq-7000 and UltraScale+ cluster systems with the same VTA configuration parameters side by side, it is clear that introducing a quad-core ARM-Cortex A53 gives an advantage to the UltraScale+ when sending data buffers through MPI calls with our current setup. Unlike Zynq-7000 devices, Zynq UltraScale+ was able to increase overall architecture parameters that define VTA blocks without having timing violations on the RTL design.

For the same configuration parameters as Table 2, we increased the UltraScale+ FPGA cluster’s clock frequency to 350 MHz. The results can be found in Figure 14. Though the graph looks similar to Figure 13, it is clear that an increase in clock frequency brings down the average scheduling approach time.

For this case, it was not required to recompile the NN graph and generate new scheduling tasks as the core architecture remained the same without altering the defined HW intrinsic parameters. For another inference case, we increased the GEMM block size to 32, and the micro-op cache buffer and input buffer sizes to 64 kb and 512 Kb weight buffer plus 256 Kb of the accumulator. The data types remained the same to maintain the same quantization procedure during graph compilation. As for clock frequency, it was kept at 200 MHz to avoid negative hold slack. The adjusted parameters can be seen in Table 4, and the results in Figure 15.

Most of the logic utilization increases were in the form of BRAM because all related cache memory buffers were increased to double the size. For this experiment, the process of compiling and lowering ResNet-18 operators needed to be redone as the overall VTA architecture changed, including the AutoTVM scheduling exploration to generate a new micro-kernel for the custom VTA configuration.

For this setup, the overall latency was reduced on all strategies and followed the same pattern, where AI Core Assignment experienced a higher delay when node count was low and fused schedule performance stood between both Scatter-Gather and Pipeline Scheduling.

6.2. NVDLA Implementation

NVDLA IP interacted with the management processor by sending through the configuration interface (CSB) the layer configuration to be deployed on one of the HW layer blocks together with the command execution for each layer. For the HW used, the management processor was the PS CPU of the FPGA. We could send multiple hardware layer commands if there was no data dependency between them, and having a dual buffer available on each HW layer block made it possible to perform computations in a pipelined fashion as configuration registers were able to be stored in an additional layer configuration while the other was in the processing phase. Once the layer block finished processing, it sent an Interrupt Request (IRQ) via the PS to register the layer module as finished. The PS then began handling the next layer operator with the same command–execute–interrupt steps.

Being a fully parameterizable IP, the main architectural components of NVDLA can be customized depending on the specific application. The main logic blocks of NVDLA’s IP are the layer operator blocks, configuration blocks, and memory interface blocks.

Among the layer operator, configuration, and memory interface blocks, only the memory interface and layer operator blocks were affected by changing configuration parameters. While the configuration interface was represented as a Configuration Space Bus (CSB), the way NVDLA configuration registers were accessed was through the Advanced Peripheral Bus (APB) interface. The APB interface is 32 bits wide, which needed an additional interface conversion from APB to AXI-Lite so that Zynq PS could correctly map the NVDLA address registers. As for memory interface, NVDLA had a Data Backbone Interface (DBBIF) in the form of an AXI4 memory-mapped interface to access external memory such as DRAM, either bypassing the DMA engine or using NVDLA’s own DMA engine to load inputs and parameters from the DRAM. This path invokes an overall increase in latency due to the need to access high-latency memory from external DRAM. FPGA devices with enough logic can additionally add to the NVDLA IP a second memory block as SRAM. This small custom memory worked as a cache to reduce NVDLA’s latency. For the targeted devices, no additional SRAM memory was added to NVDLA’s IP core, as most of the logic was dedicated to the compute tile to increase overall MAC (multiply-accumulate) units.

Three core configurations were evaluated on the FPGA cluster, where the main difference resided in the size of the compute tile array, and as a result, the memory width and depth will also be altered to accommodate the new tile distribution. The total number of MAC operations was determined by the number of operations that can be performed in parallel through the input channel and output channel dimensions, represented by Atomic-K and Atomic-C size, respectively. This gave the total number of MAC operations in the compute tile, and based on this parameter, three configurations were selected. NVDLA small had 8 × 8 tiles (64 MAC), medium 32 × 8 (256 MAC), and big 32 × 32 (512 MAC). For all configurations, the main clock frequency on the design was set to 100 MHz to facilitate routability.

To avoid timing violations, all clock gating modules were disabled from the RTL design. As for our convolution buffer memory width, it needed to match in bit size to either the input channel size or the Atomic-C parameter. For performing non-linear activation functions in the NN graph, Single Data Point (SDP) support was added to the DLA’s block modules. Additionally, the Planar Data Processor block was included as a compute block for performing spatial operations commonly seen in CNN architectures such as Max Pool and Average Pool. In the case of multiplane operations such as local response normalization (LRN), a Cross-Channel Data Processor (CDP) was added to the compute block. For the SDP, PDP, and CDP blocks, the throughput parameter will establish the total number of output features achieved in one clock cycle.

The three architecture configurations were tested on the UltraScale+ devices as LUT resources were limited for the Zynq-7000 platform. While NVDLA 64 and 256 MAC cuts could be deployed on both Kria SOM and ZCU104 devices, 512 MAC cut can only be implemented on ZCU102 board due to LUT constraints. For that case, we evaluated the NVDLA core with two different configurations: 64 MAC cuts on all UltraScale+ nodes for the compute-lite version, against 256 MAC on Kria SoM and ZCU104 together with 512 MAC on the ZCU102 board. The target architectures for the NVLDA implementation on the UltraScale+ cluster can be found in Table 5. The resource utilization for each of the three architecture configurations can be found in Table 6.

6.2.1. ResNet-18 Compilation with TVM and NVDLA Runtime Engine

NVDLA’s frontend compiler toolchain is limited by parsing only Caffe models; therefore, we leveraged the usage of TVM’s compiler graph IR to make the necessary graph transformations to generate as codegen a .json file that included both the skeleton of the NN architecture as well as the metadata for parsing it through the NVDLA compiler and created a loadable binary that can be interpreted by NVDLA’s runtime engine. As for quantization, no quantization step was performed on TVM backend optimizations because the NVDLA compiler was responsible for quantizing the trained weights. The reason is that the NVDLA compiler toolchain performed two steps during quantization: the first step was to perform the precision lowering itself, while the second step was to determine how to effectively allocate the weights onto the convolution buffers according to the sequence of hardware layer commands that were executed during inference. The resulting binary included the sequence of all hardware layer commands, including the quantized weights for each HW layer block.

6.2.2. NVDLA Inference

The 64 MAC inference time can be found in Figure 16. The combined 256 + 512 MAC inference time can be found in Figure 17. For the 64 MAC configuration, the inference times decrease as the quantity of FPGAs increases, which was expected due to parallel processing capabilities. Specifically, the Scatter-Gather and AI Core Assignment methods showed a pronounced decrease in latency, more than halving as we went from one to five FPGAs. The Pipeline Scheduling and Fused Schedule methods also benefited from the additional FPGAs, although the performance gains tapered off slightly, suggesting that these methods might have been approaching a point of diminishing returns due to factors such as inter-FPGA communication overhead or synchronization challenges.

In contrast, the 256 + 512 MAC configuration presented a different pattern. Initially, with one FPGA, the inference times are significantly lower than those in the 64 MAC setup, indicating the benefit of a larger number of MAC units. However, when moving to two FPGAs, there is an unexpected increase in inference times for all methods. This could indicate that the network and processing overhead for managing a larger number of MACs across multiple FPGAs begins to outweigh the parallel processing benefits. As we added more FPGAs, the inference times decreased again, suggesting that the system could better leverage the parallelism with more nodes, although not to the extent seen with the 64 MAC configuration.

The runtime environment changes compared to TVM’s VTA architecture due to the lack of TVM runtime support; therefore, the distributed inference was performed locally on each node using NVDLA’s runtime Engine APIS, together with the necessary User and Kernel Mode Drivers (UMD and KMD). The interconnection between FPGA nodes was established by using basic MPI blocking call protocols. The UMD-integrated API includes a basic image processing pipeline for feeding inputs and submitting inference jobs through the KMD, a Linux kernel mode driver responsible for the HW layer blocks’ scheduling across the pipeline. For the NVDLA software stack, implementing a core assignment as a distributed runtime strategy supposed a more difficult task compared to VTA; therefore, this strategy was discarded for this DLA.

6.3. Tensil CU Implementation

6.3.1. NN Compilation Stack

Presently, there are two frontends available for the Tensil AI project: one supporting TensorFlow and the other supporting ONNX. The frontend parses the model, which is represented as a graph, and utilizes one or more output nodes to linearize the graph into a sequence of nodes that respect dataflow dependencies. As part of its processing, the frontend in this system groups model nodes together to form layers, each of which represents a complete cycle that begins with matrix multiplication, followed by a sequence of accumulator operations, and ends with the movement of the result out of the accumulators. It is worth noting that the accumulators and systolic array weights are never shared between layers, ensuring that the content of each layer remains independent and self-contained.

The frontend of this system communicated with the memory manager to acquire the required memory objects. The host could directly access two memory banks, DRAM0 and DRAM1. DRAM0 is reserved by the compiler to store variable data objects (Vars), such as inputs, outputs, and the data passed between layers. In contrast, DRAM1 is designated for various constants (Consts), including matrix multiplication weights and bias, constants used in accumulator operations, and constants used to blend with variable data objects (such as zero-padding). The memory manager was used to allocate and release memory objects.

A memory object is a set of memory addresses (memory span) associated with tensor dimensions. In our system, the scheduler used these dimensions to ensure the dataflow’s accuracy. Additionally, the memory manager kept track of pending constants found in the model nodes. For each layer, the frontend generated a new instance of the scheduler and submitted a sequence of high-level intermediate representation (HIR) operations based on the model nodes presented in that layer. The frontend created special temporary (Temp) memory objects to transfer data between HIR operations within a single layer. Later, the scheduler mapped this temporary memory to the available accumulators. The scheduler transformed the high-level intermediate representation (HIR) generated by the frontend to a low-level intermediate representation (LIR) used by the backend. This transformation is necessary to schedule HIR operations, which are expressed in terms of relatively large Vars, Consts, and unlimited Temp memories, to the limited SRAM local memory and accumulators available in a specific processing unit configuration.

The scheduler accomplished this by constructing a dataflow graph based on memory addresses and determining the max fit partition in local memory and accumulators, which is called a stage. The scheduler then generated LIR for each stage separately, and stages do not share weights in the systolic array or the content of accumulators. The backend translated the LIR into a binary representation contained in “model.tprog” and “model.tmodel” files, which were necessary for the driver to input the program into the processing unit. The instruction layout was computed by the backend based on compiler options, such as memory and SIMD register depth. The backend determines instruction flags by inferring LIR arguments to produce binary instruction form.

The Tensil Compute Unit (TCU) communicated with the PS of the SoC via three interfaces, two for data handling and one for register configuration. The instructions (MatMul, DataMove, and SIMD) were sent directly to the chip via the AXI-Streaming interface. Some of the registers include address offset for DRAM0 and DRAM1 banks’ memory allocation, cache behavior for the AXI-4 protocol, timeout register, program counter, and sample interval. For the memory interfacing, two AXI-4 memory-mapped interfaces will be used to handle memory data transfers on both memory banks. To reduce CPU overhead, an AXI-DMA IP was added in between the PS and Tensil AXI-4 interfaces so that both instructions and memory data can be streamed directly to on-chip memory. We were able to perform all four NN scheduling approaches for the Tensil CU DLA, the same as our VTA configuration.

6.3.2. Tensil CU Performance Evaluation

The architecture configurations of the Tensil CU DLA on the FPGAs implemented into our Zynq-7000 and UltraScale+ clusters can be found in Table 7. An assessment of the utilization across these FPGAs can be found in Table 8. The inference times of the Tensil CU DLA across our NN approaches can be found in Figure 18. The same inference time data but for the UltraScale+ cluster can be found in Figure 19.

6.4. Xilinx DPU

The DPUCZX8G is a supported DPU IP block for the Zynq UltraScale+ MPSoC family, which mainly utilized the PL resources for its customization. There are various configurations available for the DPU IP block, which can be selected based on the desired performance and logic usage limitations. Increasing the number of DPU cores up to four can enhance the performance, and customization can be performed to achieve higher parallelism in three dimensionalities: pixel parallelism (PP), input channel parallelism (ICP), and output channel parallelism (OCP). For the proposed implementation, DPU core parallelism was prioritized over the number of DPU cores to improve performance. The peak number of operations per cycle will be PP × ICP × OCP × 2.

The B4096 (8 × 16 × 16) DPU core architecture was chosen for its ability to perform the highest number of operations per clock cycle. To further accelerate the DPU cores, certain operations such as Depthwise convolutions, Elementwise multiply, max pooling, ReLU, and Softmax were computed within the PL.

To improve timing performance, the cascade length of the DSP48 slices was reduced, which requires extra DSP48 for higher pipelining. The DPU IP clocking used three different clock domains: one for the register, one for the data controller, and one for computing. A 100 MHz clock was recommended for the register configuration module and is connected through the AXI slave interface. The data controller schedules data transfer with a 300 MHz clock, while a Dual Data Rate (DDR) clocking technique is applied to enhance the DSP48 performance. To achieve DDR on the DSP slices, the corresponding clock domain is set to double the clock frequency of the data controller. To buffer weights, biases, and other intermediate features, the DPU used RAM, which can be either BRAM, URAM, or a combination of both. For this implementation, the DPU cores were configured with high RAM usage, as the majority of the FPGA logic design was dedicated to improving DPU performance. The Xilinx DPU interfaces include the AXI Interconnect for high-speed connectivity between components, the DPU Input and Output Interfaces for passing input and output data to and from the DPU, the DMA Controller for efficient data transfer between the DPU and external memory, the Interrupt Controller for interrupt handling, and clock and reset signals for synchronization.

Since the DPU processing elements perform computations with INT8 precision, models trained using data precision of FP32 require quantization to achieve higher inference power efficiency. In this context, the Vitis AI Post Training Quantization (PTQ) method employs a cross-layer equalization and AdaQuant layer calibration. The cross-layer equalization technique ensures that the activations across all layers are distributed uniformly in the range of −128 to 127. The AdaQuant layer calibration method is used to optimize the weights of the model, ensuring minimal accuracy loss during quantization.

To perform the Vitis AI PTQ, a small sample of images was extracted from the ImageNet test set to perform calibration iterations. The calibration process adjusts the scale factors of weights and activations, and these optimized scale factors were used for quantization. The resulting quantized model was followed by a compilation process using the Vitis AI compiler. The Vitis AI compiler performs a graph intermediate representation (XIR) that describes the neural network topology and its corresponding hyperparameters. The compiler also optimizes the network topology, removes redundant operations, and fuses similar operations to reduce the computation time.

When the DFG (Data Flow Graph) cannot perform some intermediate operation, the Vitis AI compiler divides the DFG into multiple XIR subgraphs to enable the DPU to perform operations in parallel, thus increasing the inference throughput. The Vitis AI Post Training Quantization method combined with the Vitis AI compiler allows us to deploy high-precision machine learning models on the DPU, achieving a higher inference power efficiency while maintaining accuracy.

Implementing a Xilinx DPU on a KV260 or ZCU104/ZCU102 board to run ResNet-18 involved several steps. First, the hardware platform must be set up, which involves configuring the FPGA and connecting the DPU to the arm host processor and DDR memory banks.

In a single-threaded implementation, the DPU is used to process one image at a time, with the host processor handling the input/output and other tasks. This can be suitable for low-latency applications where response time is critical. In a multi-threaded implementation, the DPU is used to process multiple images in parallel, with the host processor managing the threads and scheduling. This will enable maximum throughput by the DLA. The table below shows the performance obtained by each UltraScale+ MPSoC variant running ResNet-18 fully utilizing CPU multi-threaded inference and running the DPU cores at 300 MHz.

The high-compute cluster is made from these three FPGA boards (3x KV260, 1x ZCU104, 1X ZCU102); therefore, to measure the distributed inference performance, we scaled the cluster starting with the best possible configuration. The cluster started with the ZCU102, followed by adding the ZCU102, and the rest of the scaling consisted of adding the 3x KV260 to the cluster. The FPGA resource utilization across all of the FPGAs used to assess the Xilinx DPU on the Zynq-7000 FPGA cluster can be found in Table 9, and the resource utilization for our UltraScale+ cluster can be found in Table 10.

The performance results follow a similar pattern from previous DLA cluster evaluations. Since we started with a 3xB4096 DPU (ZCU102), adding devices with lower CPU counts to the cluster will not improve performance but increase overall inference time. To achieve a higher performance than three core DPUs, the cluster needed to add more than three devices, each one with a single-core DPU. Since the Scatter-Gather approach requires image indexing to rearrange the output order of the input images, it affected overall real-time processing considering a vision processing pipeline. However, using Pipeline Scheduling becomes beneficial in the case of having a scalable cluster for a real-time video processing application.

The performance of the DPU B4096 configuration was only evaluated using the UltraScale+ devices as Zynq-7020 chips are too low on resources for the given DPU configuration. Therefore, it was decreased to a B1152 (4 × 12 × 12) configuration and just one core, which reduces the performance from 1.2 TOPS (1xB4096 @300 MHz) to 230 GOPS (1xB1152 @200 MHz). The software runtime for these FPGA models needs to be the legacy version of Vitis AI, called DNNDK. The reason is that the current Xilinx Runtime (XRT) drivers only support MPSoC architecture and Alveo-type FPGAs.

This DPU cut design managed to close timing on the Zynq-7020 with a 200 MHz clock for the data controller path. Compared to the B4096 design, running ResNet-18 on B1152 worsened its overall inference time. Additionally, since B1152 is deployed on Zynq-7020, the FPGA coprocessor (ARM Cortex-A9) is not as powerful as the one available in the UltraScale+ MPSoC family (ARM Cortex-A53). This affects the data handshake transfers via ethernet, where DMA was not available. The DMA transaction time between DDR banks and PL was also affected by the reduction in AXI width from 128 bit (B4096) to 64 bit (B1152) and the DMA controller specifications. On the other hand, since the compute-lite cluster has up to 12 FPGA boards available, the scaling can be increased further than the high compute cluster. The following table shows the effect of successively increasing the number of FPGA nodes in the cluster, thus showing the full performance benefits of scaling distributed inference on embedded devices.

The results for the inference times on the Zynq-7000 cluster can be found in Figure 20. The inference time evaluation for the UltraScale+ cluster can be found in Figure 21.

6.5. PipeCNN

This DLA architecture follows the OpenCL framework development guidelines, where an OpenCL device (in this case, the FPGA PL) interacts with a host CPU (FPGA PS). The device functional logic was defined with multiple parallel compute units (CUs) as HLS/C++ kernel functions. These kernel functions were then synthesized via the Vitis HLS compiler to generate the corresponding RTL code. Once the FPGA’s device side was programmed with the corresponding CU RTL design, the host CPU will interact through a C/C++ code using a specific API to perform communication with the mapped kernels on the FPGA fabric.

The definition of PipeCNN consists of four optimized HLS kernels which are connected in a pipelined fashion. There is a convolution kernel, pooling kernel, and two data mover kernels for performing the NDRange kernel data transfer between shared and global memory. For the convolution kernel, it supports both common 3D MAC operations found in convolutions, as well as inner product computations in the case of weighted summation found in fully connected layers. Optimizations for this kernel include the usage of HLS pragmas for specifying loop unrolling on a five-stage nested loop and array partitioning for specifying a deep pipeline MAC with buffer delays in between.

Since pooling computations are found in most CNN architectures, a dedicated pooling kernel is implemented. The kernel will read the data in a line buffer fashion and when buffers get filled up, the subsampling data are then sent to the next pipeline stage. In addition, two NDRange kernels are implemented for the data handshake between the global memory and the CU kernels. The data mover kernels facilitate efficient data reuses, leading to a significant reduction in the global memory bandwidth requirements. The data are fetched from the MemRD kernel and transferred to the convolution kernel CU to perform parallel processing of the total number of output features, defined by CU_NUM. Data offloading to the global memory will happen through the MemWR kernel. The overall block diagram can be found in Appendix E. The architecture configurations for the different FPGAs implemented in our design can be found in Table 11 for PipeCNN, and the FPGA resource utilization for the FPGAs used in our clusters can be found in Table 12.

6.5.1. NN Compilation

Executing ResNet-18 with PipeCNN followed a host C/C++ script which defined the OpenCL kernels and subsequent intermediate buffers for data transfer between CUs. Each layer was defined as a kernel and grouped into an OpenCL array of kernels for each layer type.

An OpenCL buffer was defined for each pooling layer, fully connected layer, and input/output data, which set the weights and biases of the subsequent convolutions by setting the kernel arguments for each of the corresponding convolution kernels. The last step was to enqueue the kernels as tasks in the same order as in the CNN. Once the host OpenCL script finished, an MPI layer script was run on top of the host application program to schedule the tasks based on the distributed strategy. Note that for AI Core Assignment and Pipeline Scheduling, OpenCL host scripts will be different on each FPGA board depending on the assigned computation. It is important to mention that SoftMax functions on the CPU due to the lack of SoftMax function implementation in the PipeCNN architecture.

6.5.2. Performance Evaluation

The performance results were determined by the selected HW configuration parameters. Mainly, the VEC_SIZE and LANE_NUM parameters increased the performance the most. The results in Figure 22 show the inference time for running the full ResNet-18 network on each FPGA device with their respective PipeCNN configuration for the Zynq-7000 cluster, and Figure 23 shows the results for the UltraScale+ cluster. For the UltraScale+ cluster, we initially started with the best performing HW configurations, being in this case ZCU104 and ZCU102 configurations, followed by incrementally adding KV260.

6.6. Overall Power Consumption

Using the Xilinx Power Estimation (XPE) tool [52], we estimate the power consumption per DLA using the data from Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 to build Table 13. An estimation is given for each individual FPGA and their respective clusters, as used in our experiments.

7. Conclusions

This study demonstrated the effectiveness of a distributed FPGA system in handling various DNN workloads with a focus on optimized scheduling for real-time edge computing applications. We evaluated five distinct DLA architectures, and our research provides insights into the adaptability and performance of these systems as they apply to a distributed FPGA cluster to optimize latency across computing units. The results, showing up to a 90% speedup in processing time in some cases, underscores the potential of leveraging FPGAs for DL and NN tasks, contributing to looking at this hardware stack as alternative fast, powerful, and efficient computing solutions for high-speed machine learning applications.

The importance of this work hinges on its real-world implementations, especially in sectors where decision-making speed is life-critical, like in autonomous vehicle guidance systems or real-time analysis and characterization of medical diagnostics. These FPGA’s low-latency and high-throughput attributes could be the difference between success and failure in application. The adaptability of FPGAs—as shown in our study with loading FPGAs with different DLAs and connecting them seamlessly into a stack—enables them to efficiently handle the growing complexity of DNN models and ensures hardware innovation keeps pace with the rapid advancements in artificial intelligence. Through this work, we have established a practical foundation for the use of FPGAs in complex computational tasks, showing their worth beyond traditional hardware limitations targeting low-latency computations and opening doors for more innovative uses in the field of AI.

Author Contributions

J.S., A.P.-V., T.F. and H.J. conceived the concept of this paper; A.P.-V., T.F. and H.J., as graduate researchers, implemented the goals and objectives; J.S., as the faculty advisor, oversaw and guided the overall direction of this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions and results presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shawahna, A.; Sait, S.M.; El-Maleh, A. FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2019, 7, 7823–7859. [Google Scholar] [CrossRef]
Sentieys, O.; Filip, S.; Briand, D.; Novo, D.; Dupuis, E.; O’Connor, I.; Bosio, A. AdequateDL: Approximating Deep Learning Accelerators. In Proceedings of the 2021 24th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), Vienna, Austria, 7–9 April 2021; pp. 37–40. [Google Scholar] [CrossRef]
Pal, S.; Venkataramani, S.; Srinivasan, V.; Gopalakrishnan, K. Efficient Management of Scratch-Pad Memories in Deep Learning Accelerators. In Proceedings of the 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Stony Brook, NY, USA, 28–30 March 2021; pp. 240–242. [Google Scholar] [CrossRef]
Yang, A. Deep Learning Training At Scale Spring Crest Deep Learning Accelerator (Intel^® Nervana™ NNP-T). In Proceedings of the 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, 18–20 August 2019; pp. 1–20. [Google Scholar] [CrossRef]
Song, L.; Chen, F.; Chen, Y.; Li, H. Parallelism in Deep Learning Accelerators. In Proceedings of the 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), Beijing, China, 13–16 January 2020; pp. 645–650. [Google Scholar] [CrossRef]
Choubey, A.; Choubey, S.B. Efficient Design of Adaptable Deep Learning Accelerator. In Proceedings of the 2021 10th IEEE International Conference on Communication Systems and Network Technologies (CSNT), Bhopal, India, 18–19 June 2021; pp. 588–592. [Google Scholar] [CrossRef]
Faber, C.J.; Harris, S.D.; Xiac, Z.; Chamberlain, R.D.; Cabrera, A.M. Challenges Designing for FPGAs Using High-Level Synthesis. In Proceedings of the 2022 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 19–23 September 2022; pp. 1–7. [Google Scholar] [CrossRef]
Li, M.; Liu, Y.; Liu, X.; Sun, Q.; You, X.; Yang, H.; Luan, Z.; Gan, L.; Yang, G.; Qian, D. The Deep Learning Compiler: A Comprehensive Survey. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 708–727. [Google Scholar] [CrossRef]
Gonzalez-Carabarin, L.; Schmid, A.; Sloun, R.J.v. Structured and tiled-based pruning of Deep Learning models targeting FPGA implementations. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 1392–1396. [Google Scholar] [CrossRef]
Li, Z.; Ge, F.; Zhou, F.; Wu, N. An A3C Deep Reinforcement Learning FPGA Accelerator based on Heterogeneous Compute Units. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 11–14 November 2022; pp. 1521–1525. [Google Scholar] [CrossRef]
Soltani, S.; Sagduyu, Y.E.; Hasan, R.; Davaslioglu, K.; Deng, H.; Erpek, T. Real-Time Experimentation of Deep Learning-based RF Signal Classifier on FPGA. In Proceedings of the 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), Newark, NJ, USA, 11–14 November 2019; pp. 1–2. [Google Scholar] [CrossRef]
Yin, H.; Hong, H.; Liu, J. FPGA-based Deep Learning Acceleration for Visual Grasping Control of Manipulator. In Proceedings of the 2021 IEEE International Conference on Real-time Computing and Robotics (RCAR), Xining, China, 15–19 July 2021; pp. 881–886. [Google Scholar] [CrossRef]
Lu, Y.; Zhai, X.; Saha, S.; Ehsan, S.; McDonald-Maier, K.D. FPGA based Adaptive Hardware Acceleration for Multiple Deep Learning Tasks. In Proceedings of the 2021 IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 20–23 December 2021; pp. 204–209. [Google Scholar] [CrossRef]
Rupanetti, D.; Nepal, K.; Salamy, H.; Min, C.H. Cost-Effective, Re-Configurable Cluster Approach for Resource Constricted FPGA Based Machine Learning and AI Applications. In Proceedings of the 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2020; pp. 228–233. [Google Scholar] [CrossRef]
Kan, H.; Li, R.; Su, D.; Wang, Y.; Shen, Y.; Liu, W. Trusted Edge Cloud Computing Mechanism Based on FPGA Cluster. In Proceedings of the 2020 IEEE 8th International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 20–22 November 2020; pp. 146–149. [Google Scholar] [CrossRef]
Wu, C.B.; Hsiao, Y.K.; Chang, W.H. Extensible and Modularized Processing Unit Design and Implementation for AI Accelerator. In Proceedings of the 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Republic of Korea, 13–15 June 2022; pp. 238–241. [Google Scholar] [CrossRef]
Lin, X.; Yin, S.; Tu, F.; Liu, L.; Li, X.; Wei, S. LCP: A Layer Clusters Paralleling mapping method for accelerating Inception and Residual networks on FPGA. In Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–28 June 2018; pp. 1–6. [Google Scholar] [CrossRef]
Xu, J.; Huan, Y.; Huang, B.; Chu, H.; Jin, Y.; Zheng, L.R.; Zou, Z. A Memory-Efficient CNN Accelerator Using Segmented Logarithmic Quantization and Multi-Cluster Architecture. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 2142–2146. [Google Scholar] [CrossRef]
Johnson, H.; Fang, T.; Perez-Vicente, A.; Saniie, J. Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators. In Proceedings of the 2023 IEEE International Conference on Electro Information Technology (eIT), Romeoville, IL, USA, 18–20 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
Apache.org. Overview—tvm 0.19.dev0 Documentation. Available online: https://tvm.apache.org/docs/get_started/overview.html (accessed on 19 December 2024).
Farshchi, F.; Huang, Q.; Yun, H. Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim. In Proceedings of the 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Washington, DC, USA, 17 February 2019; pp. 21–25. [Google Scholar] [CrossRef]
Tensil. Hardware Architecture and Implementation Details. Available online: https://www.tensil.ai/docs/reference/hardware/ (accessed on 19 December 2024).
Wang, D.; Xu, K.; Jiang, D. PipeCNN: An OpenCL-based open-source FPGA accelerator for convolution neural networks. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, 11–13 December 2017; pp. 279–282. [Google Scholar] [CrossRef]
Taylor, A. MicroZed Chronicles: The Deep Learning Processing Unit. Available online: https://www.hackster.io/news/microzed-chronicles-the-deep-learning-processing-unit-659221f58883 (accessed on 19 December 2024).
AMD. DPU for Convolutional Neural Network. Available online: https://www.xilinx.com/products/intellectual-property/dpu.html#overview (accessed on 19 December 2024).
Ramagond, S.; Yellampalli, S.; Kanagasabapathi, C. A review and analysis of communication logic between PL and PS in ZYNQ AP SoC. In Proceedings of the 2017 International Conference On Smart Technologies For Smart Nation (SmartTechCon), Bengaluru, India, 17–19 August 2017; pp. 946–951. [Google Scholar] [CrossRef]
Ferguson, D. Clusterssh, 2018. Copyright 1999–2018 Duncan Ferguson. Licensed under the GNU GPL or Artistic License. Available online: https://github.com/duncs/clusterssh (accessed on 19 December 2024).
Zhang, Z. The Analysis of Distributed Computing Systems with Machine Learning. In Proceedings of the 2023 International Conference on Networking, Informatics and Computing (ICNETIC), Palermo, Italy, 29–31 May 2023; pp. 67–70. [Google Scholar] [CrossRef]
Gavankar, T.; Joshi, A.; Sharma, S. Distributed Computing and Image Processing for Autonomous Driving Systems. In Proceedings of the 2018 IEEE Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), Mangalore, India, 13–14 August 2018; pp. 13–18. [Google Scholar] [CrossRef]
Yao, Y.; Liu, B.; Zhao, Y.; Shi, W. Towards Edge-enabled Distributed Computing Framework for Heterogeneous Android-based Devices. In Proceedings of the 2022 IEEE/ACM 7th Symposium on Edge Computing (SEC), Seattle, WA, USA, 5–8 December 2022; pp. 531–536. [Google Scholar] [CrossRef]
Chen, H.; Wu, Y. Coded Computing for Master-Aided Distributed Computing Systems. In Proceedings of the 2020 IEEE Information Theory Workshop (ITW), Riva del Garda, Italy, 11–15 April 2021; pp. 1–5. [Google Scholar] [CrossRef]
Wen, J.; Zhang, W. Billing System in Distributed Computing Environment. In Proceedings of the 2020 International Conference on Computer Engineering and Intelligent Control (ICCEIC), Chongqing, China, 6–8 November 2020; pp. 310–313. [Google Scholar] [CrossRef]
Dıker, A. A Performance Comparison of Pre-trained Deep Learning Models to Classify Brain Tumor. In Proceedings of the IEEE EUROCON 2021—19th International Conference on Smart Technologies, Lviv, Ukraine, 6–8 July 2021; pp. 246–249. [Google Scholar] [CrossRef]
Khan, S.U.; Mynuddin, M.; Ahad, D.M.A.; Hossain, M.I.; Islam, M.J.; Kabir, M.F. A Comparative Analysis of Deep Learning Models for Power Quality Disturbance Classification. In Proceedings of the 2023 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 7–10 June 2023; pp. 0317–0323. [Google Scholar] [CrossRef]
Poomrittigul, S.; Chomkwah, W.; Tanpatanan, T.; Sakorntanant, S.; Treebupachatsakul, T. A Comparison of Deep Learning CNN Architecture Models for Classifying Bacteria. In Proceedings of the 2022 37th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), Phuket, Thailand, 5–8 July 2022; pp. 290–293. [Google Scholar] [CrossRef]
Gan, H.S.; Ramlee, M.H.; Wahab, A.A.; Mahmud, W.M.H.W.; Setiadi, D.R.I.M. Image-to-Graph Transformation via Superpixel Clustering to Build Nodes in Deep Learning for Graph. In Proceedings of the 2022 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), Kuala Lumpur, Malaysia, 7–9 December 2022; pp. 213–217. [Google Scholar] [CrossRef]
He, R.; Gopinath, K.; Desrosiers, C.; Lombaert, H. Spectral Graph Transformer Networks for Brain Surface Parcellation. In Proceedings of the 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 3–7 April 2020; pp. 372–376. [Google Scholar] [CrossRef]
Chen, X.; Lin, X.; Shen, Q.; Qian, X. Combined Spiral Transformation and Model-Driven Multi-Modal Deep Learning Scheme for Automatic Prediction of TP53 Mutation in Pancreatic Cancer. IEEE Trans. Med. Imaging 2021, 40, 735–747. [Google Scholar] [CrossRef] [PubMed]
Bertalanič, B.; Vnučec, M.; Fortuna, C. Graph Neural Networks Based Anomalous RSSI Detection. In Proceedings of the 2023 International Balkan Conference on Communications and Networking (BalkanCom), İstanbul, Turkey, 5–8 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Xie, Y.; Xu, Z.; Zhang, J.; Wang, Z.; Ji, S. Self-Supervised Learning of Graph Neural Networks: A Unified Review. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2412–2429. [Google Scholar] [CrossRef] [PubMed]
Lattner, C.; Amini, M.; Bondhugula, U.; Cohen, A.; Davis, A.; Pienaar, J.; Riddle, R.; Shpeisman, T.; Vasilache, N.; Zinenko, O. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Seoul, Republic of Korea, 27 February–3 March 2021; pp. 2–14. [Google Scholar] [CrossRef]
Wang, Y.; Xie, F. Extending Tensor Virtual Machine to Support Deep-Learning Accelerators with Convolution Cores. In Proceedings of the 2022 26th International Conference on Engineering of Complex Computer Systems (ICECCS), Hiroshima, Japan, 26–30 March 2022; pp. 189–194. [Google Scholar] [CrossRef]
Xilinx. Xilinx/Vitis-AI. Available online: https://github.com/Xilinx/Vitis-AI (accessed on 19 December 2024).
Jadhav, S.S.; Gloster, C.; Naher, J.; Doss, C.; Kim, Y. A Multi-Memory Field-Programmable Custom Computing Machine for Accelerating Compute-Intensive Applications. In Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 1–4 December 2021; pp. 619–628. [Google Scholar] [CrossRef]
Zou, C.; Cui, X.; Kuang, Y.; Liu, K.; Wang, Y.; Wang, X.; Huang, R. A Scatter-and-Gather Spiking Convolutional Neural Network on a Reconfigurable Neuromorphic Hardware. Front. Neurosci. 2021, 15, 694170. [Google Scholar] [CrossRef] [PubMed]
Rodgers, D.P. Improvements in multiprocessor system design. ACM SIGARCH Comput. Archit. News 1985, 13, 225–231. [Google Scholar] [CrossRef]
Reddy, M. API Design for C++; Morgan Kaufmann Publishers: Burlington, MA, USA, 2011; p. 210. [Google Scholar] [CrossRef]
Bello, I.; Fedus, L.B.; Du, X.; Cubuk, E.D.; Srinivas, A.; Lin, T.Y.; Shlens, J.; Zoph, B.R. Revisiting ResNets: Improved Training Methodologies and Scaling Principles. 2021. Available online: https://research.google/pubs/revisiting-resnets-improved-training-methodologies-and-scaling-principles/ (accessed on 19 December 2024).
Bressem, K.K.; Adams, L.C.; Erxleben, C.; Hamm, B.; Niehues, S.M.; Vahldiek, J.L. Comparing Different Deep Learning Architectures for Classification of Chest Radiographs. Sci. Rep. 2020, 10, 13590. [Google Scholar] [CrossRef] [PubMed]
Pandey, G.K.; Srivastava, S. ResNet-18 comparative analysis of various activation functions for image classification. In Proceedings of the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26–28 April 2023; pp. 595–601. [Google Scholar] [CrossRef]
AMD. Aurora 64B/66B. Available online: https://www.xilinx.com/products/intellectual-property/aurora64b66b.html (accessed on 19 December 2024).
AMD. Power Estimator. Available online: https://www.amd.com/en/products/adaptive-socs-and-fpgas/technologies/power-efficiency/power-estimator.html (accessed on 19 December 2024).

Figure 1. VTA block diagram.

Figure 2. NVDLA block diagram.

Figure 3. Tensil CU block diagram.

Figure 4. PipeCNN block diagram.

Figure 5. Xilinx DPU block diagram.

Figure 6. The FPGA cluster: twelve Xilinx Zynq-7020 SoC processors in a combination of PYNQ-Z1 and ZedBoard FPGAs.

Figure 7. Flowchart for partitioning DLA and NN workloads onto an FPGA cluster. Each command prompt corresponds to an FPGA in the stack.

Figure 8. Pipeline Scheduling block diagram for distributed DNN workload management.

Figure 9. Scatter-Gather approach block diagram depicting flow control for the generalized FPGA cluster setup.

Figure 10. AI Core Assignment scheduling. In this instance, a part of the mod1 resources is relocated to the skip connections, depicting the process of reassigning hardware resources to reduce bottleneck.

Figure 11. Fused schedule. Tasks with higher computational demands are allocated additional hardware resources, whereas less complex tasks receive fewer resources. This allocation strategy mitigates bottlenecks and enhances the overall efficiency of hardware resource utilization.

Figure 12. Zynq-7000 cluster VTA scheduling approaches.

Figure 13. UltraScale+ cluster VTA scheduling approaches.

Figure 14. UltraScale+ cluster VTA scheduling approaches with clock frequency increased to 350 MHz.

Figure 15. UltraScale+ cluster VTA scheduling approaches with increased clock frequency and buffer size.

Figure 16. NVDLA 64 MAC inference time on the UltraScale+ cluster.

Figure 17. NVDLA 256 + 512 MAC inference time on the UltraScale+ cluster.

Figure 18. Tensil CU’s inference time on the Zynq-7000 cluster.

Figure 19. Tensil CU’s inference time on the UltraScale+ cluster.

Figure 20. Xilinx DPU scheduling approaches’ inference times on the Zynq-7000 cluster.

Figure 21. Xilinx DPU scheduling approaches’ inference times on the UltraScale+ cluster.

Figure 22. PipeCNN scheduling approaches’ inference times on the Zynq-7000 cluster.

Figure 23. PipeCNN scheduling approaches’ inference times on the UltraScale+ cluster.

Table 1. Summary of the results of the performance characterization for a single unit vs. full-stack FPGA clusters running DNN workloads for separate DLAs.

	Zynq-7000 Cluster					UltraScale+ Cluster
	Quantity of FPGAs	Scatter- Gather (ms)	AI Core Assignment (ms)	Pipeline Scheduling (ms)	Fused Scheduling (ms)	Quantity of FPGAs	Scatter- Gather (ms)	AI Core Assignment (ms)	Pipeline Scheduling (ms)	Fused Scheduling (ms)
VTA (100 MHz)	1	27.34	27.34	27.34	27.34	1	25.15	25.15	25.15	25.15
	12	2.58	1.84	2.62	2.66	5	6.01	14.14	8.58	6.93
	Performance Speedup	90.56%	93.27%	90.42%	90.27%	Performance Speedup	76.10%	43.78%	65.88%	72.45%
Tensil CU	1	26.03	-	-	-	1	4.67	-	-	-
	2	15.24	29.33	28.73	-	2	2.94	7.2	6.08	-
	3	9.91	25.32	24.13	23.52	3	3.89	6.35	5.11	5.9
	12	2.25	1.84	2.62	2.66	5	2.75	3.22	3.39	3.48
	Performance Speedup	91.36%	93.73%	90.88%	88.69%	Performance Speedup	41.11%	55.28%	44.24%	41.02%
Xilinx DPU	1	51.87	-	-	-	1	2.85	-	-	-
	2	26.6	78.36	78.12	-	2	3.72	8.88	7.08	-
	3	18.19	65.39	55.56	41.38	3	4.47	6.43	5.53	5.11
	12	4.35	1.98	2.03	3.53	5	3.05	4.81	2.93	3.03
	Performance Speedup	91.61%	97.47%	97.40%	91.47%	Performance Speedup	−7.02%	45.83%	58.62%	40.70%
Pipe CNN	1	203.13	-	-	-	1	62.45	-	-	-
	2	106.25	244.36	228.31	-	2	33.35	77.28	66.01	-
	3	69.55	217.12	215.52	218.18	3	66.15	92.31	49.27	75.43
	12	17.83	11.51	15.12	19.07	5	29.13	57.45	22.11	43.52
	Performance Speedup	91.22%	95.29%	93.38%	91.26%	Performance Speedup	53.35%	25.66%	66.51%	42.30%
NVDLA	NVDLA 64 MAC on UltraScale+ Cluster					NVDLA 256 + 512 MAC on UltraScale+ Cluster
	1	346.41	-	346.41	346.41	1	73.93	-	73.93	73.93
	5	73.37	-	132.53	118.72	5	56.37	-	78.47	64.27
	Performance Speedup	78.82%	-	61.74%	65.73%	Performance Speedup	23.75%	-	−6.14%	13.07%

Table 2. Initial VTA configuration parameters.

Parameters	Zynq-7020
Clock_Frequency	100 MHz
Input_Width	8-bit
Weight_Width	8-bit
ACCUMULATOR_WIDTH	32-bit
BATCH_SIZE	1
BLOCK_SIZE	16
MICRO_OP_BUFFER_SIZE	32 Kb
INPUT_BUFFER_SIZE	32 Kb
WEIGHT_BUFFER_SIZE	256 Kb
ACCUMULATOR_BUFFER_SIZE	128 Kb

Table 3. FPGA resource utilization of the VTA configuration.

Resource	Utilization
LUT	25,635
LUTRAM	2092
FF	24,968
BRAM	132
DSP	220

Table 4. VTA configuration parameters with increased clock frequency and buffer size.

Parameters	Size
CLOCK_FRQUENCY	200 MHz
INPUT_WIDTH	8-bit
WIEGHT_WIDTH	8-bit
ACCUMULATOR_WIDTH	32-bit
BATCH_SIZE	1
BLOCK_SIZE	32
MICRO_OP_BUFFER_SIZE	64 Kb
INPUT_BUFFER_SIZE	64 Kb
WEIGHT_BUFFER_SIZE	512 Kb
ACCUMULATOR_BUFFER_SIZE	256 Kb

Table 5. Three target architecture configurations of the NVDLA implementation.

Parameters	NVDLA 64 MAC	NVDLA 256 MAC	NVDLA 512 MAC
FEATURE_DATA_TYPE	INT8	INT8	INT8
WIEGHT_DATA_TYPE	INT8	INT8	INT8
SDP_FUNCTION	SINGLE SCALING	SINGLE SCALING	SINGLE SCALING
MAC_ATOMIC_C_SIZE	8	32	32
MAC_ATOMIC_K_SIZE	8	8	32
MEMORY_ATOMIC_SIZE	8	8	8
CONV_BUF_BANK_NUM	32	32	32
CONV_BUF_BANK_WIDTH	8	32	32
CONV_BUF_BANK_DEPTH	512	128	512
SDP_BS_THROUGHPUT	1	1	4
SDP_BN_THROUGHPUT	1	1	4
SDP_EW_THROUGHPUT	1	1	4
PDP_THROUGHPUT	1	1	2
CDP_THROUGHPUT	1	1	2

Table 6. FPGA resource utilization of the three NVDLA architecture configurations.

Resource	NVDLA 64 MAC	NVDLA 256 MAC	NVDLA 512 MAC
LUT	78,508	100,339	161,935
LUTRAM	1812	1993	3579
FF	88,335	113,735	157,610
BRAM	64	128	853
URAM	-	-	-
DSP	32	32	65

Table 7. Architecture configurations of the Tensil CU DLA on different FPGAs.

Parameters	ZYNQ-7020	KV260	ZCU-104/ZCU-102
DATA_TYPE	FP16BP8	FP16BP8	FP16BP8
ARRAY_SIZE	8	16	32
DRAM0_DEPTH	1,048,576	2,097,152	2,097,152
DRAM1_DEPTH	1,048,576	2,097,152	2,097,152
LOCAL_DEPTH	8192	8192	16,384
ACCUMULATOR_DEPTH	2048	4096	4096
SIMD_REG_DEPTH	1	1	1
STRIDE_0_DEPTH	8	8	8
STRIDE_1_DEPTH	8	8	8
NUM_THREADS	1	1	1
THREAD_QUEUE_DEPTH	8	8	8

Table 8. FPGA resource utilization of the Tensil CU on different FPGAs.

Resource	ZYNQ-7020	KV260	ZCU104	ZCU102
LUT	15,960	30,341	56,845	56,806
LUTRAM	1914	3456	4214	4214
FF	9576	18,346	58,521	58,479
BRAM	44	122	93.5	293.5
URAM	-	20	25	-
DSP	73	274	1057	1057

Table 9. FPGA resource Utilization of the Xilinx DPU on different FPGAs used in the Zynq-7000 cluster.

Resource	ZYNQ-7020 (1xB1152)
LUT	43,200
LUTRAM	4562
FF	75,798
BRAM	121
URAM	-
DSP	196
Num. DPU Cores	1

Table 10. FPGA resource utilization of the Xilinx DPU on different FPGAs used in the UltraScale+ cluster.

Resource	KV260 (1xB4096)	ZCU104 (2xB4096)	ZCU102 (3xB4096)
LUT	58,450	107,901	157,050
LUTRAM	6145	11,729	17,331
FF	106,316	204,298	301,537
BRAM	111	218	775
URAM	40	80	-
DSP	704	1394	2084
Num. DPU Cores	1	2	3

Table 11. Architecture configurations of the PipeCNN on different FPGAs used in the UltraScale+ cluster.

Parameters	ZYNQ-7020	KV260	ZCU-104/ZCU-102
VEC_SIZE	4	16	16
LANE_NUM	2	8	16
CONV_GP_SIZE_X	7	7	7
CONV_GP_SIZE_Y	1	1	1
PIPE_DEPTH	6	24	48
POOL_GP_SIZE_X	4	4	4
DP_WIDTH	8	8	8

Table 12. Resource utilization of the PipeCNN on different FPGAs used in the UltraScale+ cluster.

Resource	ZYNQ-7020	KV260	ZCU104	ZCU102
LUT	48,220	85,946	130,374	130,432
LUTRAM	3568	3648	3962	4021
FF	68,492	100,092	160,238	164,102
BRAM	54.5	140.5	190.5	334.5
URAM	-	11	18	-
DSP	51	297	395	395

Table 13. Power estimation for DLA configuration on FPGA clusters.

DLAs	Zynq-7020		Ultrascale+
DLAs	Per FPGA Unit	Stack Total (×12)	Per FPGA Unit	Stack Total (×5)
VTA	1.9 W	22.8 W	2.4 W	12.0 W
NVDLA 512MAC	-	-	4.3 W	21.5 W
Tensil CU	1.6 W	19.2 W	3.4 W	17.0 W
Xilinx DPU	2.2 W	26.4 W	6.6 W	33.0 W
PipeCNN	2.1 W	25.2 W	4.2 W	21.0 W

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, T.; Perez-Vicente, A.; Johnson, H.; Saniie, J. Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators. Information 2025, 16, 298. https://doi.org/10.3390/info16040298

AMA Style

Fang T, Perez-Vicente A, Johnson H, Saniie J. Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators. Information. 2025; 16(4):298. https://doi.org/10.3390/info16040298

Chicago/Turabian Style

Fang, Tianyang, Alejandro Perez-Vicente, Hans Johnson, and Jafar Saniie. 2025. "Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators" Information 16, no. 4: 298. https://doi.org/10.3390/info16040298

APA Style

Fang, T., Perez-Vicente, A., Johnson, H., & Saniie, J. (2025). Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators. Information, 16(4), 298. https://doi.org/10.3390/info16040298

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning Scheduling on a Field-Programmable Gate Array Cluster Using Configurable Deep Learning Accelerators

Abstract

1. Introduction

2. Deep Learning Accelerator Discussion

2.1. Versatile Tensor Accelerator (VTA)

2.2. Nvidia DLA (NVDLA)

2.3. Tensil CU

2.4. PipeCNN

2.5. Xilinx DPU

3. Reconfigurable FPGA Cluster Design

3.1. Hardware Components

3.2. Software/Firmware Stack

4. Neural Network Compilation

4.1. VTA Microcode Generation Using TVM

4.2. TVM Frontend + NVDLA Runtime

4.3. Xilinx Vitis AI

4.4. Tensil AI

4.5. PipeCNN

4.6. Customized Compute Cores

5. Deep Neural Network Scheduling Across the FPGA Cluster

5.1. Pipeline Scheduling

5.2. Scatter-Gather Scheduling

5.3. AI Core Assignment

5.4. Fused Scheduling

6. Results: Comparison Across DLA and Configurations

6.1. VTA Cluster Implementation

6.1.1. VTA Configuration Parameters

6.1.2. TVM ResNet-18 Compilation

6.1.3. Auto-Tuning from Schedule Templates (AutoTVM)

6.1.4. Evaluation on Hardware Implementations

6.2. NVDLA Implementation

6.2.1. ResNet-18 Compilation with TVM and NVDLA Runtime Engine

6.2.2. NVDLA Inference

6.3. Tensil CU Implementation

6.3.1. NN Compilation Stack

6.3.2. Tensil CU Performance Evaluation

6.4. Xilinx DPU

6.5. PipeCNN

6.5.1. NN Compilation

6.5.2. Performance Evaluation

6.6. Overall Power Consumption

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI