1. Introduction
FPGA is widely used and has rapidly developed in high-performance computing, due to its parallel execution, high computational performance, low power consumption, and short development cycle compared to Application Specific Integrated Circuit (ASIC). However, with the increasing Scale of system-on-a-chip, the user designing process becomes more and more complex, and the drawbacks of traditional Register Transfer Level (RTL) approaches become prominent. Moreover, FPGA mainly uses hardware description language (HDL) development, which is difficult to use, has few practitioners, a long development cycle, and is not conducive to the rapid update of products.
Many designers still manually rewrite their sequential algorithms in HDL. In order to increase productivity and promote FPGA to a wider user community, new design methodologies in high-level design abstraction present in recent years, including FPGA HLS [
1].
HLS has advantages in the following aspects: (1) A higher level expression abstraction than HDL. (2) C++ language is more familiar to software developers. (3) Designers can incrementally optimize the code by gradually replacing the loop structure with the functional pattern without compromising the portability of the code. HLS also has obvious drawbacks: designers remain exposed to various aspects of hardware design, development cycles are still time-consuming, and the quality of results of HLS tools is far behind that of RTL flows. For instance, unrolling is a standard HLS optimization. However, loop code segments cannot be fully expanded and executed concurrently due to access conflicts and data dependency. It is not easy to implement a full pipeline with minimal initiation interval (II).
Other HL-related Domain-Specific Languages (DSL) include Chisel [
2] and SpinalHDL [
3]; they are a specific language designed for a certain domain. Because of the domain restrictions, the problem to be solved is delimited, so the language does not need to be complex to have precise expressiveness. Furthermore, these languages are usually easy to learn and use; however, because they are domain languages, the generality of this kind of language is far less than C++, and this kind of language is only used to generate Verilog code; its optimization of timing constraints is far less than vivado HLS.
In this paper, we propose high-performance computing pattern using template-based hardware generation strategy through extending C++-based Xilinx Vivado HLS tools. We design computational patterns based on the MapReduce model that can be rapidly adapted to high-performance parallel flow algorithms on FPGA. The patterns are implemented in C++ templates and expressed by functional programming. HLS language-specific structure optimization (specific pragma) is added to the template, and the details of the internal structure are parametrized so that the user can adjust the parameters to achieve different parallelization and pipeline levels, and finally achieve an efficient pipeline structure (II = 1) regardless of the computational operators.
Templating computations have the following benefits: (1) Parametric C++ templates provide enough flexibility to exploit the structural properties. (2) Independent of the analysis capabilities of HLS tools, template-based constructs allow us to manually extend the implementation to the resource and data bandwidth constraints of the target device, thus improving coding efficiency. (3) Predefined templates integrated with specific computational instances can be quickly adapted to the application.
We evaluate our work in two algorithms: the vector distances algorithm and the Quantum-behaved Particle Swarm Optimization (QPSO) algorithm. For the vector distances algorithm, multiplying and adding are frequent operations in deep learning computations. We design experiments with different concurrency and flow rates for the relevant computations, analyze the differences in resource utilization, and explore how different parameter types affect the flow rate hierarchy. The goal of the QPSO is to find the optimal solution for all particles in a multidimensional hypervolume. All particles in the space are first assigned an initial random position and an initial random velocity. The position of each particle is then advanced according to its velocity, the known optimal global position in the problem space, and the known optimal position of the particle in turn. As the computation progresses, the particles cluster or aggregate around one or more optimal positions by exploring and exploiting the known favorable positions in the search space. The mystery design of the algorithm is that it preserves information about both the optimal global position and the particle’s known optimal position [
4,
5]. Li et al., propose a framework to accelerate QPSO algorithm on FPGA [
6]. This framework reschedules the dataflow of QPSO to decrease the data transmission between different memory hierarchies, which improves the overall throughput. Because of the distributed memory architecture and the customized deep variable pipeline of FPGA, this framework on FPGA achieves better throughput. In this work, we have further improved the abstraction of the QPSO algorithm framework on FPGA by analyzing the algorithm structure and simplifying the coding process using functions from our library, define more parametric interfaces. The results show that QPSO on Xilinx Kintex Ultrascale xcku040 achieves up to 123 times acceleration ratio compared to the Intel Core i7-6700 CPU. Taking this as an example, our proposed library greatly reduces the coding difficulty, improves the efficiency of FPGA programming, and simplifies the process of algorithm reproduction while ensuring the high performance of algorithm implementation.
The results show that our templates contain predefined, schema-specific control logic that allows parametric parallel computing templates to greatly enhance programming flexibility, computational efficiency, code reusability, and for complex computational tasks, adaptability to more resource-efficient computational models with constant time complexity.
The contributions of this paper are as follows:
- (1)
Several standard functional operators suitable for hardware parallel computing are defined.
- (2)
A functional concurrent programming paradigm is abstracted based on C++ templates and Xilinx HLS.
- (3)
The efficiency of this programming paradigm is verified with two algorithms of different complexity.
2. Relate Work
HLS has existed for many years, and it is very active in academic circles due to its short development and verification time. Many HLS-related DSLs have been proposed: Lift [
7], Chisel, Bluespec [
8], SpinalHDL, Lava [
9], and C
ash [
10] to implement functional programming, which improves the abstraction of hardware code, and has the following advantages: Automatic bit width inference deduction (even across module boundaries), error checking capability, Parameterization capability, a large number of basic components and reusable Intellectual Property core (IP). Although these methods take advantage of the characteristics of modern programming languages, they are essentially HDL, and the design of these languages is the only real correspondence with real circuits. The designed applications are mapped directly to the hardware without too much compiler optimization. However, in high-level synthesis, timing optimizations are crucial for achieving high-performance circuits, these tools need to add timing constraints manually when they are integrated. For some algorithms that need to explore the best performance structure, the lack of timing optimization further increases the workload, and it requires the programmer to understand the hardware design concept, which undoubtedly improves the threshold for software engineers to use FPGA. While xilinx HLS accepts design in high-level language (e.g., C, C++, and SystemC) and generates synthesizable cycle-accurate RTL through code transformations and synthesis optimizations. In standard, statically scheduled HLS, such optimizations are typically performed in conjunction with modulo scheduling [
11,
12,
13]: the aim is to create pipelines with the best possible loop initiation intervals under the given clock and resource constraints. Vivado HLS [
14] estimates the timing and area resources based on built-in libraries for each FPGA. When using logic synthesis to compile the RTL into a gate-level implementation, perform physical placement of the gates in the FPGA, and perform routing of the inter-connections between gates, logic synthesis might make additional optimizations that change the Vivado HLS estimates.
The latest generation of HLS technology, focusing on the C/C++ language, gaining design intent, realizing a hardware-software common model, achieving joint design and joint verification, and achieving success [
15,
16,
17]. HLS has many advantages, such as completing FPGA design in a higher level of abstraction, requiring less hardware knowledge, exploring design space faster, making minor modifications to the program, richer, and more convenient verification and debugging methods. In addition, FPGA have a wide range of applications in the implementation and acceleration of various algorithms. The variability of the algorithm, easy iteration, easy debugging, and easy maintenance, it is very difficult to use RTL development, and the progress of the project is also affected. How to use high-level languages, such as C, C++, etc., for FPGA development has become a hot trend in EDA software [
18].
Dataflow circuits are fundamentally different: their schedules are not predetermined at compile time but devised as the circuit runs. Moreover, Lana [
19,
20] investigates how to create timing-efficient, high-throughput pipelines, and their MILP model is based on the theory of marked graphs and allows for resource-optimal buffer placement and sizing, with the purpose of maximizing throughput at the desired clock frequency. However, they are purely theoretical optimizations of the computational model without abstracting a generalized computational template for the computational model, which still requires a complete understanding of the circuit structure and does not improve the user’s coding efficiency.
Current HLS tools [
21] do not always produce the most performance-optimized implementations: no matter how well optimized the HLS hardware is, it will not perform as well as a HDL design implementation. As reported in previous work [
22], HLS generates efficient hardware when the input code is written in a specific coding style (adding pragma), which we call refactored code. Therefore, creating optimized hardware with HLS still requires a deep knowledge of the underlying hardware architecture and how to effectively utilize HLS tools. There are benefits of using HLS instead of HDL so that the entire application is in a high-level language: simulation speed is generally faster, debugging is less difficult.
A lot of work is implemented on HLS, such as the Genetic algorithm (GA) [
23], which is one of most popular evolutionary search algorithms that simulates natural selection of genetic evolution for searching solution to arbitrary engineering problems. However, it is computationally intensive and will become a limiting factor for evolving solution to most of the real-life problems as it involves large number of parameters that needs to be determined. As for Neural Networks, Zhang et al. proposed Caffeine [
24], a hardware/software co-designed library to accelerate convolutional neural networks on FPGA. And they propose to accelerate convolutional layers and fully connected layers with a uniformed representation. The authors of [
25] proposed DeepBurning, an automation tool to generate FPGA-based accelerators for NN models. DeepBurning compiles DNNs described in a Caffe-like script and generates the corresponding RTL-level accelerator under user-specified constraints. Cong et al. [
26] proposed an automated framework for mapping deep neural networks onto FPGA with RTL-HLS hybrid templates, which takes symbolic descriptions (in TensorFlow) of DNNs as input, and outputs implementations of the corresponding FPGA-based accelerators for model inference. They implement accelerators with RTL-HLS hybrid templates, and convert model inference into general-purpose computations like matrix multiplication. Several optimization kernels are developed and invoked to ensure the functionality, performance, and energy efficiency of the accelerator.
Haggai et al. [
27] use HLS proposed methodology and design patterns that enable code reuse. Evaluate proposal by implementing two networking applications: a key-value store cache and a UDP-based firewall for FPGA-based SmartNICs, showing that their methodology can simplify the implementation of high-performance networking applications using HLS. However, their work (ntl) only achieves a pipelining level of II = 3 with no performance improvement over their experimental control group except for a reduction in code size, and our implementation of the library function operator easily achieves II = 1 performance while maintaining coding efficiency.
3. Functional Operators in Our Model
Functional languages are much more natural fit for high-level hardware generation as they have limited to no side effects and more naturally express a dataflow representation of applications which can be mapped directly to hardware pipelines. The core idea of functional programming is pure function: functions use (without modifying) only the results of calculations with the actual parameters passed to them. If a pure function is called several times with the same real parameters, it will give the same result without leaving any traces (no side effects). This all means that pure functions cannot modify the state of the program. It also means that pure functions cannot read from the standard input and cannot write to the standard output. In a function language, as all functions are referentially transparent, even complex functions can be parallelized for each function without side effects. As a result, function expressions can be represented directly as data flow graphs and the data flow can be mapped directly to hardware functions. The difficulty with functional programming is the granularity of the functional units. Functional units mainly contain arithmetic functions and storage functions. First, the arithmetic and data storage part of the algorithm to be implemented has to be abstracted and decomposed into independent sub-functional units with the same arithmetic function. This process involves algorithm transformation and possibly even redesign of the algorithm. Second, the granularity of the functional units has to be carefully weighed. Too large a granularity can lead to state machine complexity, circuit redundancy, and reduced chip utilization. Too small a granularity may increase the burden on the IO interface. Therefore, the functional unit granularity is based on the principle of minimizing unnecessary generality [
28]. Finally, due to the different resource ratios required for computing and memory functions, the utilization of each resource on the FPGA chip should also be taken into account when designing functional units to ensure maximum utilization of at least one resource in order to achieve improved performance at the algorithm scale. In this paper, C++ templates are used to implement several operators that are typical of functional ideas, taking into account the specific requirements of concurrent computing in hardware. C++ templates are instantiated during compilation as classes and functions related to the template parameters, and therefore they essentially implement code generation functions.
3.1. TreeOP
For dealing with the problem of concurrent computational partitioning of large arrays, we propose a TreeOP template that iteratively expands the code using a binary tree based on the input array.
As shown in Algorithm 1, at each iteration, code lines 4 and 5 divide the array into left and right subtrees, and the subtrees are half the length of the original subtrees, and code line 6 put the set of subtrees and the length of the subtrees into TreeOp for the next iteration, until subtree length = 1 at code line 9. After division, the code is fully expanded and each leaf node is entered as a single operator for subsequent calculations. The fully expanded state essentially generates the corresponding code, and the leaf node after partitioning can continue to be the input of subsequent operators.
Algorithm 1 Implementation of TreeOP. |
- 1:
template<typename Result, typename Item,Result( *pairOp),int num, int idx = 0 > - 2:
class TreeOp { - 3:
static inline Result tree(const Item numbers[num]) { - 4:
Result t1 = TreeOp<Result, Item, pairOp, num/2, idx>::tree(numbers); - 5:
Result t2 = TreeOp<Result, Item, pairOp, num − num/2, idx + num/2>::tree(numbers); - 6:
return pairOp(t1,t2); - 7:
} - 8:
}; - 9:
class TreeOp<Result, Item, pairOp, binOp, 1, idx> { - 10:
static inline Result tree(const Item numbers[]) { - 11:
return binOp( numbers, idx); - 12:
} - 13:
}
|
3.2. MapOP
A map function is a specified operation on each element of a conceptually organized list of independent elements (e.g., a list of test scores), all duplicates being independent, and when the number of iterations is known in advance, all calculations depend only on the index value. Map transforms collections by functions. Specifically, it applies a function (hereafter referred to as an operator function) to all elements of the collection in parallel. Each operator function accesses a separate data element. Each parallel transformation of this operator function is called an instance of the operator function. The operator function uses Map to execute all Map instances in any order without any side effects. With this independence, the different elements of a Map can be synchronized with each other, thus achieving maximum parallelism.
The Map function we implement on HLS is in the form of map< DataType, Function>(), DataType is a custom data type that can be defined as Int, Double, Float, etc., while Function is a highly concurrent function that can be customized by the user. The code is shown in Algorithm 2.
Algorithm 2 Implementation of MapOP. |
- 1:
template<typename Item,typename Result,Result (*binOp)(const Item& ),typename UpStream> - 2:
class MapOp { - 3:
MapOp<Result,MapResult,mapOp,MapOp<Item,Result,binOp,UpStream> > - 4:
map() { - 5:
return MapOp<Result, MapResult,mapOp - 6:
,MapOp<Item,Result,binOp,UpStream> > (*this); - 7:
} - 8:
Result get() { - 9:
return binOp( up.get() ); - 10:
} - 11:
Result get(NumType idx) { - 12:
return binOp( up.get(idx) ); - 13:
} - 14:
};
|
3.3. ZipwithOP
Zipwith has two input sets, and the element function outputs a new result from each of the two input pairs. Zipwith operates on two data structures and creates a new structure using a binary function. As the lambda functions in Map and Zipwith have no side effects, individual function calls for different input elements are independent of each other and can be executed in parallel. Calculations in computational models such as Map, Zip, and Reduce can operate on multi-element data structures without side effects, thus taking full advantage of available parallelism.
The Zipwith function we implement on HLS is in the form of zipWith (UpStream) where UpStream is another set of inputs. The code is shown in Algorithm 3.
Algorithm 3 Implementation of ZipwithOP. |
- 1:
template<typename UpStream1, typename UpStream2> - 2:
class ZipOp { - 3:
template<typename UpStream2> - 4:
ZipOp<ATStream<Item,data>, UpStream2> - 5:
zipWith(UpStream2& up2) - 6:
return ZipOp<ATStream<Item,data>, UpStream2>(*this, up2); - 7:
} - 8:
ZipItem get(){ - 9:
return std::make_ pair(up1.get(), up2.get()); - 10:
} - 11:
ZipItem get(NumType idx){ - 12:
return std::make_ pair(up1.get(idx), up2.get(idx)); - 13:
} - 14:
}
|
3.4. ReduceOP
Reduce refers to the specific merging of elements of a list to form a smaller set of values. Usually, only a 0 or 1 output value is generated per Reduce. The intermediate values are provided to the user’s reduce function via an iterator. Although not as parallel as the Map function, the Reduce function is useful in highly parallel environments because Reduce always has a simple answer, is relatively independent of large-scale operations, has no data dependencies, and supports commutative law.
The Reduce function we implement on HLS is in the form of reduce<DataType,Function, TotalData,CONC,PipeStep>(), where DataType and Function are the same as above, TotalData is the total amount of data to be computed, and CONC is the concurrency of your own design, PipeStep is a structure for solving some computations that take too long and require additional pipelining levels to break the concurrency impact of data dependencies, as described in detail later in StreamReduce. The code is shown in Algorithm 4.
Algorithm 4 Implementation of ReduceOP. |
- 1:
template<typename Item, typename Result, Result (*pairOp)(const Item&,const Item&),int total, int parallel, int pipestep, typename UpStream> - 2:
class ReduceOp { - 3:
private: - 4:
UpStream& up; - 5:
public: - 6:
typedef Result ItemType; - 7:
ReduceOp(UpStream& up):up(up){ } - 8:
Item get(){ - 9:
const int round = total/parallel; - 10:
Item roundReduce[pipestep]; - 11:
# pragma HLS RESOURCE variable=roundReduce core=RAM_ S2P_ LUTRAM - 12:
ReduceFor: - 13:
for (int r = 0; r < round; ++r){ - 14:
# pragma HLS PIPELINE - 15:
Item pvalue[parallel]; - 16:
for (int i = 0; i < parallel; ++i){ - 17:
pvalue[i] = up.get(r * parallel + i); - 18:
} - 19:
Result,Item, pairOp, parallel>::tree(pvalue); - 20:
if (r < pipestep ) - 21:
roundReduce[r % pipestep] = reduceTmp; - 22:
else - 23:
roundReduce[r % pipestep] = pairOp(reduceTmp,roundReduce[r % pipestep]); - 24:
} - 25:
}
|
4. GroupPipeReduce Model
4.1. GroupReduce
The traditional sequential execution of Reduce does not optimize the code and the HLS tool will calculate the expansion in order of each clock cycle. For example, eight inputs will perform seven calculations with a time complexity of N, while TreeReduce is the fully concurrent version of Reduce, TreeReduce accepts a set of elements, merges the even-numbered bits of the set with the next odd-numbered bit in the set, and repeats the calculation. The time complexity of the operator is . We use C++ templates to implement TreeReduce. Instead of using the usual UNROLLFOR method, we use TreeOP to divide and expand the data for a set of inputs: we assume that the data coming in from the upstream is at level i. The Array[2n] and Array[2n + 1] bits at this layer perform the user-required reduce operation to output a result as the input at layer i + 1. The reduce calculation is completed when the number of output results is 1 at a certain layer. Layer-to-layer data computation is pure pipelined, with one result guaranteed per clock cycle.
4.2. PipeReduce
For high-precision floating-point calculations, the increasing complexity of the results may cause the current computation unit to take too long to compute, so we designed an alternative computation model, PipeReduce, for those computations that cannot be completed in a single clock cycle. This approach solves the problem of long operator computation time. The computation mode implements stepwise pipelining, dividing different levels of pipelining according to different clock cycles of the computation results, flexibly solving the problem that the critical path in the algorithm cannot be concurrent, breaking the data dependency, and optimizing the computation efficiency. For example, in traditional HDL development, we need to set different state machines when dividing different pipeline levels, and we need to process a lot of code to change the state machine. Our templates are optimized for this, and you can change the flow hierarchy flexibly by simply changing the PipeStep parameter. In addition, due to the advanced nature of HLS, the problem of handling different concurrent and different pipeline levels can be implemented automatically without manual adjustment.
As shown in the
Figure 1, the left half of the computational model is GroupReduce, and the output results start as inputs to PipeReduce. We can set the degree of concurrency (CONC) to M (a set of M BRAM inputs). Assuming the number of clock cycles required to complete the calculation is N, we set the number of N PipeStep, the next step is a 2-step operation:
Input data with the current data in PipeStep to continue the reduce operation, and the result is stored in the current PipeStep.
The data input for the next cycle continues the reduce operation with the data in the next state PipeStep, and the results are stored in the next PipeStep.
6. Discussion
In this paper, we propose high-performance computing pattern using a template-based hardware generation strategy through extending C++-based Xilinx Vivado HLS tools, which can quickly implement optimized algorithms on FPGA with specified simple parameters. The four operators are TreeOP, MapOP, ZipWithOP, and ReduceOP. TreeOP solves the problem of partitioning large arrays by concurrency, if we implement this function directly with HLS, we need to define several similar functions by ourselves, which is a relatively repetitive and troublesome workload. MapOP can take customized functions as parameters for concurrent execution of algorithms with no data dependencies. Zipwith has two input sets, and the element function outputs a new result from each of the two input pairs. Reduce refers to the specific merging of elements of a list to form a smaller set of values.
Except for the full concurrent execution of the Reduce function in a binary tree structure, we have made an additional change: to handle cases where some calculation time exceeds the current cycle, we have proposed a long pipeline structure. The advantage of increasing the pipeline level instead of using full concurrency is that it increases the frequency of computation: a new set of data is processed at the beginning of each clock cycle, and the results of the previous computations are stored in the buffer for backup. This structure greatly improves resource utilization and reuse rate.
Using the above computational model, we designed two experiments: (1) Sum of vector distances squared and (2) Algorithm QPSO. After that we show the code implementation, resource utilization of the former, and performance comparison with the latter. The results show that our computational model guarantees a high performance while improving the coding efficiency.