A Parameterized Parallel Design Approach to Efficient Mapping of CNNs onto FPGA

Mao, Ning; Yang, Haigang; Huang, Zhihong

doi:10.3390/electronics12051106

Open AccessArticle

A Parameterized Parallel Design Approach to Efficient Mapping of CNNs onto FPGA

by

Ning Mao

^1,2

,

Haigang Yang

^3,4,*

and

Zhihong Huang

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100094, China

³

School of Integrated Circuits, University of Chinese Academy of Sciences, Beijing 100049, China

⁴

Shandong Industrial Institute of Integrated Circuits Technology Ltd., Jinan 250001, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(5), 1106; https://doi.org/10.3390/electronics12051106

Submission received: 6 February 2023 / Revised: 20 February 2023 / Accepted: 22 February 2023 / Published: 23 February 2023

(This article belongs to the Special Issue FPGAs Based Hardware Design)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, Convolution Neural Networks (CNNs) have been widely applied to some artificial intelligence (AI) systems such as computer vision. Among many existing hardware accelerators, FPGA is regarded as a suitable platform for the implementation of CNNs because of its high energy efficiency and flexible reconfigurability. In this paper, a parameterized design approach is proposed to explore the maximum parallelism that could be possibly implemented in mapping a CNN algorithm onto targeted FPGA resources. Four types of parallelism are employed in our parameterized design to fully exploit the processing resources available in FPGA. Meanwhile, a hardware library consisting of a set of modules is established to accommodate various CNN models. Further, an algorithm is proposed to find the optimal level of parallelism dedicated to a constrained amount of resources. As a case study, the typical LeNet-5 is implemented on Xilinx Zynq7020. Compared with the existing works using the high-level synthesis design flow, our design obtains higher FPS and lower latency under the premise of using fewer LUTs and FFs.

Keywords:

Convolutional Neural Network; hardware acceleration; parameterize; FPGA

1. Introduction

Convolutional neural networks(CNNs) have been widely used in many tasks such as image recognition [1] due to their excellent performance. CNNs have the characteristic of computational complexity, which brings some challenges to hardware implementation. CPUs and GPUs are common hardware platforms to implement CNNs. The computing performance of a CPU is usually limited, so it cannot cope with computing-intensive tasks. GPUs usually require large batches to achieve high utilization when doing calculations. In actual CNN application scenarios, the small batch usually leads to low GPU utilization. Therefore a GPU is usually good for training. Due to the above shortcomings of CPU and GPU, some energy-efficient platforms are used as accelerators for forward inference of CNNs, such as Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC). ASIC has high energy efficiency, but at the same time, it is not flexible enough, and the cost is high. A particular ASIC chip may not be able to keep up with the development of CNN algorithms.

FPGA has achieved a good balance between energy efficiency, performance, reconfigurability, and cost, and has received more and more attention [2,3]. Many researchers have designed FPGA-based CNN accelerators [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]. Because the network parameters in CNN are so large that data is frequently obtained from off-chip storage and the amount of calculation is large, some optimization methods are used to accelerate a CNN through an FPGA. Several works have used optimization methods such as low bit-width data in the design [4,5]. In work [4], the use of 1-8bit mixed-precision data improves the data throughput and image processing speed. In [5], the weight is quantized to 1-bit width, which reduces the amount of calculation and the demand for storage bandwidth. Some works optimize computation and memory access by exploiting sparsity in matrix multiplication [7,19]. Work [7] reduces the amount of computation by packing sparse matrices and proposes an effective memory access module for sparse matrices in order to utilize bandwidth. Some works utilize separable convolutions to reduce matrix multiplication computation [9,12]. For the convolution operation that mainly exists in a CNN, part of the work is optimized by using the matrix multiplication fast algorithm [6,15]. For example, in [6], a unified hardware structure is used to calculate matrix multiplication and a Winograd fast algorithm. Winograd can effectively accelerate convolutional layers and fully connected layers.

We propose an architecture model exploiting various sources of parallelism in CNN to improve performance and resource utilization. The design consists of various modules having variable parameters at both the algorithm level and hardware level so that a diversity of CNNs can adapt to any FPGA. The key contributions of this work are summarized as follows.

We parameterize the CNN design architecture at both algorithm and hardware levels. Furthermore, four types of computational parallelism existing in CNNs are characterized to find an optimal relationship among those parameters for performance and energy efficient implementations.
We construct a hardware module library having various fundamental functions necessary for building up a complete CNN.
We devise a search algorithm to attain the maximized parallel computation in FPGA design.
For a case study, experiments have been performed to implement the typical LeNet-5 on Xilinx Zynq7020. The results show an improvement in FPS and a reduction in latency for a single image compared to previous work using high-level synthesis.

The rest of the paper is organized as follows. Section 2 gives an introduction to the background knowledge of convolutional neural networks. Section 3 gives the implementation details of the proposed architectures. Section 4 presents the devised algorithms for exploring optimal parallelism under the constraint of limited resources on FPGA. Section 5 gives an analysis of the experimental results. Section 6 concludes the paper.

2. Background

A typical CNN is composed of several layers: the convolutional layer, the pooling layer, and the fully-connected layer. The convolutional layer is devoted to core computations. The operation of a convolution layer can be expressed as:

O_{n, x, y} = f (\sum_{i = 0}^{K - 1} \sum_{j = 0}^{K - 1} I_{m, x S + i, y S + j} W_{m, n, i, j} + β_{n})

(1)

where

O_{n, x, y}

is the output neuron at position

(x, y)

of the nth output feature map, K is the kernel size, S is the kernel stride,

W_{m, n}

is the kernel between the mth input feature map and the nth output feature map,

β_{n}

is the bias of the nth output feature map, and f represents the nonlinear activation function.

Common nonlinear activation functions include ReLU, sigmoid, tanh, etc. Because ReLU has the advantages of simple calculation and solving the gradient vanishing problem in sigmoid and tanh, it has become a popular activation function. The nonlinear activation function provides a nonlinear effect for the entire network. Without an activation function, the entire network becomes a combination of linear regression models. Therefore, nonlinear activation functions are an integral part of the CNN model.

The pooling layer is usually used for the purpose of size reduction for the feature maps. The pooling layer can reduce the computational load of the entire network by reducing the size of the feature maps. Average pooling and max pooling are two common pooling methods. Taking a

2 \times 2

pooling window as an example, the average pooling calculates the average value of these four points, and the max pooling calculates the maximum value of these 4 points.

A fully connected layer (FC layer) is normally placed at the far end of a CNN network, acting as a classifier to output the final result. A fully connected layer has connections between all input nodes and all output nodes. Unlike the weight sharing in the convolutional layer, the weights of the fully connected layer are different. Therefore, the characteristic of the fully connected layer is that the number of weights is more than that of the convolutional layer, and the amount of calculation is less than that of the convolutional layer.

3. Hardware Architecture

3.1. Overview of Accelerator Architecture

In the computation of the whole convolutional neural network, since the required number of computation layers is predetermined, the latency elapses over a computing process involving all the successive layers in a CNN is usually hard to be reduced. Fortunately, the throughput may be improved by imposing some pipelined processing mechanism. Supposing that all the layers are being processed concurrently, the waiting time between any two consecutive input images could be reduced only to the computation time associated with the slowest layer. Note that an extra image buffer may be needed to avoid any possible collisions of data between the two adjacent layers: one is reading from while the other is writing to the same memory at the same time. We, therefore, proposed a balanced strategy to simultaneously compute these even-numbered layers at one time and those odd-numbered layers at the other time, as illustrated in Figure 1. Thus, no extra image buffer (memory) will be needed as the neighboring layers are processed at a tick-tock pace. Using a plain design implementation, the current image would start to be processed only after the processing of the previous image has been completed.

Figure 2 gives a generic view of the computing architecture for an entire network. In our design, the calculations of different layers are calculated using different computing units. The overall design adopts the pipeline structure. The synchronization circuit of the calculation of different layers is realized with the handshake signal. Each specific calculation layer has a start and done signal. When the start signal is pulled high, the layer starts to calculate, and when the calculation is completed, the done signal is pulled high. The start and done signals are generated in the control module. For one of the network layers, according to our hardware structure characteristics, the start signal will be generated after the done signals of the two adjacent layers are pulled high. The correctness of the function and the non-conflict of reading and writing can be ensured by handshaking signals. The synchronization circuit between input and output data is realized by FIFO. The external input data is first written into the FIFO and then written into the BRAM, which stores the input feature map. Whenever data is written into the BRAM, the counter that saves the write address is incremented by one. The output data is output through FIFO.

3.2. Parallel Computing Architecture of One Layer

For a

K \times K

convolution kernel, all the

K \times K

multiplications associated can be, in theory, calculated concurrently. However, in practice, depending on the actual FPGA device targeted, the degree of the parallelism in processing can be adjusted to fit into the specific DSP block structure provided. Let k denote the number of multiplications executed in parallel. For example, Figure 3 shows a data fetch sequence, i.e., d1, d2 … d9, where the data marked in the same color is fetched during the same cycle. In such a case, k = 2, which means that two data are fetched from the memory at the same time to participate in the multiplication operation.

To maximize the processing parallelism, we set an objective to have an appropriate number of input as well as output feature maps within a layer computed concurrently. As shown in Figure 4, suppose

N (i)

and

M (i)

denote the total numbers of the output feature maps, and the input feature maps separately for a given ith layer. Further,

P N (i)

and

P M (i)

respectively represent the number of output feature maps and the number of input feature maps bundled for parallel implementation on the same layer. Here, the Convolution Block marked in blue is served to compute two pixels at a time from the input feature map, and the Convolution Group marked in green is served to compute pixel by pixel for the output feature map.

It can be seen that a Convolution Group consists of

P M (i)

Convolution Blocks. Those memories that store the weights are directly connected to the Convolution Blocks. Meanwhile, the input feature maps are connected to the Convolution Blocks through a multiplexer. When the weight and the value in the feature map are sent to the multiplier, the two numbers are multiplied. The products obtained by the multipliers are accumulated. The accumulated results will be added by the subsequent addition tree. The results are then fed into subsequent activation and pooling modules, and finally written into memory.

3.3. Detailed Design of Computation Unit

As shown as an example in Figure 5, a Convolution Group includes two Convolution Blocks. There are several address generators employed for the memory blocks. The address generator generates addresses according to the sequence of operations and sends them to the memory. Data is read from memory. When the data arrives, the Convolution Block does two multiplications simultaneously and one accumulation later to yield a partial pixel result for the output feature map.

Below, we introduce the hardware implementation of adding bias. From (1), a bias number should be added to the final result. This can be accomplished in a way that such adding of a bias number is merged into the convolution operation by multiplying by 1. As Figure 6 suggests, the bias number is also placed in the same memory as the weights (w1, w2, ...w9). Either the pixel data of an input feature map (d1, d2, ...d9) or value 1 are selected by a multiplexer for further multiplication.

Below, we present the hardware implementation of the activation and pooling modules. Since the ReLU is one of the most cited activation functions, it has been deployed in our approach to implementation. The following are the detailed operations of the activation function and pooling function. The output neuron values are sent to the pooling module in a sequential manner. When the first value arrives at a ReLU and pooling module, it is written to output memory after being compared with zero. When the second value comes, it will be compared with the former data. Then the larger one will be written to memory. After continuous read/write operations together with comparison operations, max pooling is done. Pooling operations do not occupy extra time because each output value is handled immediately from the convolution block. Consequently, pooling and convolution are finished almost simultaneously.

3.4. Storage Mode

For a given layer, the required memory volume should hold for M input feature maps, N output feature maps, and

P M * P N

weights.

Figure 7 shows the storage mode. It is an example where

N = 4, M = 6

,

P M = 3

, and

P N = 2

.

W_{i j}

represents a weight value between the ith input feature map and the jth output feature map. Because convolution groups compute results at the same time, the number of memories storing output feature maps is

P N

. That means each memory store

\frac{N}{P N}

feature maps. Storage of input feature maps is determined by the previous layer, as shown in Figure 7. Considering the storage method of weights, each convolution block needs a memory to store weights. Therefore

P M \times P N

memories are needed to meet the requirement. The whole weights of

M \times N

convolution kernels are divided into different memories. As Figure 7 shows, weights are placed in an interleaved way. Due to the interleaved storage, computation units can get data in a sequential way.

3.5. Hardware Library

For the sake of parameterizing a design, a module library is established, including those commonly computed functions such as convolution, pooling, activation (ReLU in our case), fully connected layer, and control logic. In our experiment, the precision drops

0.26 %

when replacing 32-bit floating point data with 16-bit fixed-point data in LeNet-5. Thus for some benefits of reduced overheads in an FPGA implementation, a 16-bit fixed-point format is here applied.

Parameterization at both the algorithm level and implementation level is taken into consideration with respect to the modules just mentioned. The main design parameters are given in Table 1.

4. Optimal Parallelism Characterization

In this section, we aim to develop a methodology that maps various CNNs onto a target FPGA device in a computationally energy-efficient way. The proposed mapping architecture is characterized by pipeline data processing over the layers. Yet, the computation time may vary according to different layers. The time interval between two consecutive images at the input is estimated in Equation (2).

T_{image} = max (t_{l a y e r_i} + t_{l a y e r_i + 1}) \forall 0 \leq i < n - 1

(2)

where n is the total number of the layers, and

t_{l a y e r_i}

is the computation time of the ith layer.

Suppose the number of the multiplication operations required for a convolution layer is given below.

N U M_{c o n v} (i) = M (i) \times N (i) \times O (i) \times O (i) \times K (i) \times K (i)

(3)

Similarly for the number of multiplication operations required for a fully connected layer, we have

N U M_{f c} (j) = F I (j) \times F O (j)

(4)

Thus, a total of the multiplication operations required for an entire CNN can be calculated using the following equation.

S U M_{ops} = \sum_{i = 0}^{n u m c - 1} N U M_{c o n v} (i) + \sum_{j = 0}^{n u m f - 1} N U M_{f c} (j)

(5)

Suppose the total number of the multipliers available on an FPGA device is denoted as

T o t a l_{m u l t i p l i e r}

. Then the number of multipliers possibly to be allocated for a convolution layer is determined by

S u b T o t a l_{m u l t i p l i e r}^{c} (i) = \frac{N U M_{c o n v} (i)}{S U M_{ops}} \times T o t a l_{m u l t i p l i e r}

(6)

The same applies to the number of the multipliers allocated for a fully connected layer, as given in

S u b T o t a l_{m u l t i p l i e r}^{f} (j) = \frac{N U M_{f c} (j)}{S U M_{ops}} \times T o t a l_{m u l t i p l i e r}

(7)

The search for an optimal

P M (i)

or

P N (i)

is constrained by the following conditions.

P M (i) \times P N (i) \leq S u b T o t a l_{m u l t i p l i e r}^{c} (i)

(8)

P F O (j) \leq S u b T o t a l_{m u l t i p l i e r}^{f} (j)

(9)

In order to find the memory volume requirement for a given CNN, we only need to consider the output feature maps (in the case of a convolution layer) or the output neurons (in the case of a fully connected layer). This is due to the fact that the content of the corresponding input feature maps or the input neurons at a layer is respectively shared by the output feature maps or the output neurons at its previous layer. Hence, the required volume for a concurrently operational memory at the ith convolution layer may be estimated according to the following equation.

M V o l_{O F M}^{c} (i) = F_{(i n t)} (\frac{N (i) \times O X (i) \times O Y (i)}{P N (i)})

(10)

Similarly for the jth fully connected layer, we have:

M V o l_{N E U}^{c} (j) = F_{(i n t)} (\frac{F O (j)}{P F O (j)})

(11)

In Equations (10) and (11),

F_{i n t}

rounds up the value to the nearest available memory size. For example, with Stratix III device,

F_{i n t} \in [9 n K b, 144 n K b]

, in which

n = 1, 2, 3 \dots

In a similar way, we can calculate the memory volume requirement for the weights at both a convolution layer and a fully connected layer, as given below.

M V o l_{w e i g h t}^{c} (i) = F_{(i n t)} (\frac{M (i) \times N (i) \times K (i) \times K (i)}{P N (i) \times P M (i)})

(12)

M V o l_{w e i g h t}^{f} (j) = F_{(i n t)} (\frac{F O (j) \times F I (j)}{P F O (j)})

(13)

Further, the total memory volume required to store all the feature maps, neurons, and weights can be expressed as follows.

\begin{matrix} M V = \sum_{i = 0}^{n u m c - 1} (P N (i) \times M V o l_{O F M}^{c} (i) + P N (i) \times P M (i) \times M V o l_{w e i g h t}^{c} (i)) + \\ \sum_{j = 0}^{n u m f - 1} (P F O (j) \times M V o l_{N E U}^{c} (j) + P F O (j) \times M V o l_{w e i g h t}^{f} (j)) \end{matrix}

(14)

The throughput of the architecture can then be determined by

T h r o u g h p u t = \frac{S U M_{operation}}{T_{image}}

(15)

Furthermore, an algorithm for exploring optimal parameters is presented.

From Equation (14), we notice that if we increase the parallelism inter input feature maps, the number of memory store weights will increase. Due to interleaved storage mode, the memory store weights will usually not be filled fully with data. Increasing memory numbers will aggravate this problem. Thus we first increase PN, then increase PM until PN is exploited completely.

When the prior process is executed to allocate resources, a fine-tuned adjustment will be executed. First, we will check whether or not the memory requirement is satisfied after the first allocation.

If it is not satisfied, the quickest adjacent two layers will be selected. The interval time is limited by the lowest two layers. So interval time is not affected when we decrease the parallelism of the quicker one between the quickest two layers. After that, the memory requirement will be detected. If it is still not satisfied, the quickest adjacent layers will be reselected. Then the parallelism of the quicker of the two newly selected layers will be decreased. Decreasing parallelism will be executed until the memory requirement is satisfied.
If the memory requirement is satisfied after the first allocation, the lowest adjacent two layers will be selected because it is the performance bottleneck of the whole network. We increase the parallelism of the slower layer between the selected two layers. From the above analysis, PN is increased preferentially. Then the memory requirement will be detected. If it is still satisfied, the lowest adjacent two layers will be re-selected. Then parallelism of the quicker of the two newly selected layers is increased. The memory requirement will be detected immediately again. Increasing parallelism will be executed until the memory requirement is not satisfied.

Algorithm 1 illustrates an overview of the method in parallelism exploration.

Algorithm 1: Optimal parallelism exploration
	Input: CNN model parameters, FPGA resource parameters
	Output: parallelism of each layer in a CNN
1	First distribute DSP to different layers according to operations of different layers
2	Detect whether or not memory requirement is satisfied. MEMREQUIREMENT == 0 represents not satisfied while 1 represents satisfied
3	whileMEMREQUIREMENT == 0do
4
5
6
7	end
8	whileMEMREQUIREMENT == 1do
9
10
11
12	end
13	Return final results.

Figure 8 describes the whole design flow. First, the parameters of the CNN model and FPGA resource model are fed into the analyzer. Then the iteration algorithm is used in DSP and memory allocation. When an allocation is finished, the parallelism of each layer is determined. Finally, the parallelism of implementation and the parameters of the CNN model are delivered to the hardware library. Code generator joints these parameterized modules in the hardware library, and the code of the CNN accelerator is generated.

5. Evaluation

Our design is implemented on a Xilinx Zynq 7020 FPGA. Zynq 7020 has a PS and PL. Our hardware accelerator that is written in Verilog runs on the PL side and the design tool are Xilinx Vivado 2020.1 and Vitis 2020.1.

We take the typical LeNet-5 [25] as an example for design. LeNet-5 contains three convolutional layers, two pooling layers, and two fully connected layers. In our design, we first explore the optimal parameters according to the number of DSPs and memories in the FPGA chip and the parameters of the LeNet-5 network structure. All weights and feature maps of the LeNet-5 can be stored using on-chip memory. The degree of parallelism obtained through exploration, the number of DSPs occupied by each network layer, and the execution cycle are shown in Table 2 below. In hardware design, a convolution block consists of two DSPs, so one can take the second layer as an example,

D S P_{u s e d} = 2 \times P M \times P N = 96

.

It can be seen from the table that the main performance bottleneck of LeNet-5 is the first layer. The first convolutional layer has an input feature map and six output feature maps. This limits the parallelism of the LeNet-5 because of

P M \leq 1

and

P N \leq 6

. The parallelism of the second layer is larger than that of the first layer, so the execution time is shorter than that of the first layer. Because the calculation amount of the convolutional layer is more than that of the fully connected layer, the number of DSPs used by the convolutional layer is more than that of the fully connected layer. In the fully connected layer, each fully connected layer uses two DSPs to meet the performance requirements.

After synthesis and implementation in Vivado, the detailed resource occupation of the design is shown in Table 3.

It can be seen from the resource utilization that the usage rate of DSP and BRAM is high, which is consistent with the fact that the convolutional neural network needs to use more computing resources and storage resources. The usage of LUTs and registers is less, which is related to our use of hand-optimized verilog code.

In order to make a fair comparison, we compared the three works that also use Zynq 7020 to implement LeNet-5. The comparison results are shown in Table 4.

In [26], twenty-five MAC units are used to calculate a

5 \times 5

convolution kernel size in parallel, and technologies such as array partition are used to optimize memory access. In [27], in order to calculate more efficiently, the first few layers use the SLIT layer. In [28], the dynamic partial reconfigurable technology is used to change the data bit width of different network-layer parameters while the fpga is running.

Compared with [26], our design has higher effective utilization of DSP, which can be helpful for performance improvement. At the same time, compared with the code generated by high-level synthesis in [26], hand-written verilog takes up fewer registers and luts, which is helpful for place and route.

Compared with [27] in which part of the multiplication operation is replaced with a binary calculation, our design occupies fewer logic resources in the case of 16-bit integers. Our design achieves a delay of 248 μs for each picture, reducing the delay by 54.9% under more complex calculations.

Work [28] uses a dynamic data bit width. When the data bit width decreases, the resources used will decrease accordingly. As shown in Table 4, when the data bit width is 16 bits, our design saves 68.8% of LUTs and 46.7% of registers. When the data bit width is 5 bits, our design uses 26.5% more LUTs and 68.9% more FFs under the premise of using 16 bit data width. Since our hardware design uses an efficient parallel solution, our solution has an 89.6% lower latency than [28], and the FPS is 18.5 times that of [28].

6. Conclusions

This work proposes an architecture model utilizing four types of parallelism strategies to get high resource utilization and performance. Then we construct a hardware library consisting of a set of modules to build CNNs. The modules we have designed have been parameterized at both the algorithm level and hardware implementation level. Furthermore, an iteration algorithm is proposed to explore optimal parallelism dedicated to a constrained amount of resources. Finally, the code of a CNN accelerator can be generated automatically with parameterized modules and optimal parameters. Our hardware accelerator is implemented on Xilinx Zynq7020, and the experimental results show that our design obtains higher FPS and lower latency under the premise of using fewer LUTs and FFs.

Author Contributions

Conceptualization, N.M., H.Y. and Z.H.; methodology, N.M., H.Y. and Z.H.; software, N.M.; validation, N.M.; investigation, N.M. and H.Y.; writing—original draft preparation, N.M.; writing—review and editing, N.M., H.Y. and Z.H.; supervision, H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 61876172.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Stewart, R.; Berthomieu, B.; Garcia, P.; Ibrahim, I.; Michaelson, G.; Wallace, A. Verifying parallel dataflow transformations with model checking and its application to FPGAs. J. Syst. Archit. 2019, 101, 101657. [Google Scholar] [CrossRef]
Garcia, P.; Bhowmik, D.; Stewart, R.; Michaelson, G.; Wallace, A. Optimized Memory Allocation and Power Minimization for FPGA-Based Image Processing. J. Imaging 2019, 5, 7. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Latotzke, C.; Ciesielski, T.; Gemmeke, T. Design of High-Throughput Mixed-Precision CNN Accelerators on FPGA. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), Belfast, UK, 29 August–2 September 2022; pp. 358–365. [Google Scholar]
Zhang, Y.; Pan, J.; Liu, X.; Chen, H.; Chen, D.; Zhang, Z. FracBNN: Accurate and FPGA-Efficient Binary Neural Networks with Fractional Activations. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Virtual Event, USA, 28 February–2 March 2021; pp. 171–182. [Google Scholar]
Kala, S.; Jose, B.R.; Mathew, J.; Nalesh, S. High-Performance CNN Accelerator on FPGA Using Unified Winograd-GEMM Architecture. IEEE Trans. Very Large Scale Integr. Syst. 2019, 27, 2816–2828. [Google Scholar] [CrossRef]
Jiang, C.; Ojika, D.; Patel, B.; Lam, H. Optimized FPGA-based Deep Learning Accelerator for Sparse CNN using High Bandwidth Memory. In Proceedings of the 2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Orlando, FL, USA, 9–12 May 2021; pp. 157–164. [Google Scholar]
Nguyen, D.T.; Nguyen, T.N.; Kim, H.; Lee, H.-J. A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection. IEEE Trans. Very Large Scale Integr. Syst. 2019, 27, 1861–1873. [Google Scholar] [CrossRef]
Wu, D.; Zhang, Y.; Jia, X.; Tian, L.; Li, T.; Sui, L.; Xie, D.; Shan, Y. A High-Performance CNN Processor Based on FPGA for MobileNets. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 8–12 September 2019; pp. 136–143. [Google Scholar]
Yang, Y.; Huang, Q.; Wu, B.; Zhang, T.; Ma, L.; Gambardella, G.; Blott, M.; Lavagno, L.; Vissers, K.; Wawrzynek, J.; et al. Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 24–26 February 2019; pp. 23–32. [Google Scholar]
Azarmi Gilan, A.; Emad, M.; Alizadeh, B. FPGA-Based Implementation of a Real-Time Object Recognition System Using Convolutional Neural Network. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 755–759. [Google Scholar] [CrossRef]
Knapheide, J.; Stabernack, B.; Kuhnke, M. A High Throughput MobileNetV2 FPGA Implementation Based on a Flexible Architecture for Depthwise Separable Convolution. In Proceedings of the 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 31 August–4 September 2020; pp. 277–283. [Google Scholar]
Niu, Y.; Kannan, R.; Srivastava, A.; Prasanna, V. Reuse Kernels or Activations? A Flexible Dataflow for Low-latency Spectral CNN Acceleration. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 266–276. [Google Scholar]
Xing, Y.; Liang, S.; Sui, L.; Jia, X.; Qiu, J.; Liu, X.; Wang, Y.; Shan, Y.; Wang, Y. DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-Based CNN Accelerators. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2020, 39, 2668–2681. [Google Scholar] [CrossRef] [Green Version]
Yang, T.; Liao, Y.; Shi, J.; Liang, Y.; Jing, N.; Jiang, L. A Winograd-Based CNN Accelerator with a Fine-Grained Regular Sparsity Pattern. In Proceedings of the 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 31 August–2 September 2020; pp. 254–261. [Google Scholar]
Yu, Y.; Zhao, T.; Wang, K.; He, L. Light-OPU: An FPGA-based Overlay Processor for Lightweight Convolutional Neural Networks. In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; pp. 122–132. [Google Scholar]
Huang, Q.; Wang, D.; Dong, Z.; Gao, Y.; Cai, Y.; Li, T.; Wu, B.; Keutzer, K.; Wawrzynek, J. CoDeNet: Efficient Deployment of Input-Adaptive Object Detection on Embedded FPGAs. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Virtual Event, USA, 28 February–2 March 2021; pp. 206–216. [Google Scholar]
Li, J.; Un, K.-F.; Yu, W.-H.; Mak, P.-I.; Martins, R.P. An FPGA-Based Energy-Efficient Reconfigurable Convolutional Neural Network Accelerator for Object Recognition Applications. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 3143–3147. [Google Scholar] [CrossRef]
Meng, J.; Venkataramanaiah, S.K.; Zhou, C.; Hansen, P.; Whatmough, P.; Seo, J.-s. FixyFPGA: Efficient FPGA Accelerator for Deep Neural Networks with High Element-Wise Sparsity and without External Memory Access. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30–31 August 2021; pp. 9–16. [Google Scholar]
Yan, S.; Liu, Z.; Wang, Y.; Zeng, C.; Liu, Q.; Cheng, B.; Cheung, R.C.C. An FPGA-based MobileNet Accelerator Considering Network Structure Characteristics. In Proceedings of the 2021 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 17–23. [Google Scholar]
Jia, X.; Zhang, Y.; Liu, G.; Yang, X.; Zhang, T.; Zheng, J.; Xu, D.; Wang, H.; Zheng, R.; Pareek, S.; et al. XVDPU: A High Performance CNN Accelerator on the Versal Platform Powered by the AI Engine. In Proceedings of the 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), Belfast, UK, 29 August–2 September 2022; pp. 209–217. [Google Scholar]
Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.-S. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 45–54. [Google Scholar]
Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
Wu, X.; Ma, Y.; Wang, M.; Wang, Z. A Flexible and Efficient FPGA Accelerator for Various Large-Scale and Lightweight CNNs. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 1185–1198. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Yanamala, R.M.R.; Pullakandam, M. An Efficient Configurable Hardware Accelerator Design for CNN on Low Memory 32-Bit Edge Device. In Proceedings of the 2022 IEEE International Symposium on Smart Electronic Systems (iSES), Warangal, India, 19–21 December 2022; pp. 112–117. [Google Scholar]
Tran, T.D.; Nakashima, Y. SLIT: An Energy-Efficient Reconfigurable Hardware Architecture for Deep Convolutional Neural Networks. IEICE Trans. Electron. 2021, E104.C, 319–329. [Google Scholar] [CrossRef]
Youssef, E.; Elsimary, H.A.; El-Moursy, M.A.; Mostafa, H.; Khattab, A. Energy-Efficient Precision-Scaled CNN Implementation With Dynamic Partial Reconfiguration. IEEE Access 2022, 10, 95571–95584. [Google Scholar] [CrossRef]

Figure 1. Timing graph of the pipelined structure.

Figure 2. A generic overview of the CNN computing architecture.

Figure 3. A parallel data fetch sequence.

Figure 4. The computational architecture of a convolution layer.

Figure 5. Design structure in a Convolution Group.

Figure 6. Method of merging bias-adding into convolution.

Figure 7. Data storage mode.

Figure 8. End-to-end flow.

Table 1. Parameterized items of RTL modules in the hardware library.

Parameterized Items	Meaning
M(i)	total number of the input feature maps for the ith layer
N(i)	total number of the output feature maps for the ith layer
IX(i)	length of the input feature maps for the ith layer
IY(i)	width of the input feature maps for the ith layer
OX(i)	length of the output feature maps for the ith layer
OY(i)	width of the output feature maps for the ith layer
K(i)	size of the convolution kernel for the ith layer
S(i)	stride of the convolution kernel for the ith layer
PM(i)	parallelism degree inter input feature maps for the ith layer
PN(i)	parallelism degree inter output feature maps for the ith layer
FI(j)	number of the input neurons for the jth fully connected layer
FO(j)	number of the output neurons for the jth fully connected layer
PFI(j)	number of the memories required for the input neurons at the jth fully connected layer
PFO(j)	parallelism degree inter jth output neurons

Table 2. Parameters of parallelism, DSP usage and execution cycle of LeNet-5.

Layer	PM (conv)/PFI(fc)	PN (conv)/PFO (fc)	DSP Used	Execution Cycle
conv1	1	6	12	10,214
conv2	3	16	96	2635
conv3	1	4	8	6457
fc4	4	1	2	5053
fc5	1	1	2	433

Table 3. Resource utilization of LeNet-5.

Resource	LUT	Register	BRAM	DSP
total	53,200	106,400	140	220
used	6129	4855	79	120
utilization	11.52%	4.56%	56.43%	54.55%

Table 4. Comparison with the previous work.

	[26]	[27]	[28]	Our Work
Year	2022	2021	2022
FPGA	Zynq 7020	Zynq7020	Zynq7020	XC7Z020
Network	LeNet-5	LeNet-5	LeNet-5	LeNet-5
Frequency	100 MHz	115 MHz	50 MHz	100 MHz
Precision	N/A	24 bit	16 bit	16 bit
DSP	65	16	0	120
BRAM	98.5	127	80 (16 bit), 25 (5 bit)	79
LUT	30585	6853	19651 (16 bit), 4846 (5 bit)	6129
FF	51419	6378	9103 (16 bit), 2874 (5 bit)	4855
FPS (Frames Per Second)	114.9	N/A	420.09	7782
Latency	N/A	550 μs	2380 μs	248 μs

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mao, N.; Yang, H.; Huang, Z. A Parameterized Parallel Design Approach to Efficient Mapping of CNNs onto FPGA. Electronics 2023, 12, 1106. https://doi.org/10.3390/electronics12051106

AMA Style

Mao N, Yang H, Huang Z. A Parameterized Parallel Design Approach to Efficient Mapping of CNNs onto FPGA. Electronics. 2023; 12(5):1106. https://doi.org/10.3390/electronics12051106

Chicago/Turabian Style

Mao, Ning, Haigang Yang, and Zhihong Huang. 2023. "A Parameterized Parallel Design Approach to Efficient Mapping of CNNs onto FPGA" Electronics 12, no. 5: 1106. https://doi.org/10.3390/electronics12051106

APA Style

Mao, N., Yang, H., & Huang, Z. (2023). A Parameterized Parallel Design Approach to Efficient Mapping of CNNs onto FPGA. Electronics, 12(5), 1106. https://doi.org/10.3390/electronics12051106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Parameterized Parallel Design Approach to Efficient Mapping of CNNs onto FPGA

Abstract

1. Introduction

2. Background

3. Hardware Architecture

3.1. Overview of Accelerator Architecture

3.2. Parallel Computing Architecture of One Layer

3.3. Detailed Design of Computation Unit

3.4. Storage Mode

3.5. Hardware Library

4. Optimal Parallelism Characterization

5. Evaluation

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI