Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network

Hong, Eonpyo; Choi, Kang-A; Joo, Jhihoon

doi:10.3390/electronics12194043

Open AccessArticle

Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network

by

Eonpyo Hong

^*,

Kang-A Choi

and

Jhihoon Joo

Agency for Defense Development, Yuseong P.O. Box 35, Daejeon 34186, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(19), 4043; https://doi.org/10.3390/electronics12194043

Submission received: 8 August 2023 / Revised: 20 September 2023 / Accepted: 21 September 2023 / Published: 26 September 2023

(This article belongs to the Section Artificial Intelligence Circuits and Systems (AICAS))

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper proposes two max-pooling engines, named the RTB-MAXP engine and the CMB-MAXP engine, with a scalable window size parameter for FPGA-based convolutional neural network (CNN) implementation. The max-pooling operation for the CNN can be decomposed into two stages, i.e., a horizontal axis max-pooling operation and a vertical axis max-pooling operation. These two one-dimensional max-pooling operations are performed by tracking the rank of the values within the window in the RTB-MAXP engine and cascading the maximum operations of the values in the CMB-MAXP engine. Both the RTB-MAXP engine and the CMB-MAXP engine were implemented using VHSIC hardware description language (VHDL) and verified by simulations. The implementation results demonstrate that the 16 CMB-MAXP engines achieved a remarkable throughput of about 9 GBPS (gigabytes per second) while utilizing only about 3% of the available resources on the Xilinx Virtex UltraScale+ FPGA XCVU9P. On the other hand, the 16 RTB-MAXP engines exhibited somewhat lower throughput and resource utilization, although they did offer a slightly better latency when compared to the CMB-MAXP engines. In the comparison with existing techniques, the CMB-MAXP engine exhibited comparable implementation results in terms of the resource utilization and maximum operating frequency. It is crucial to note that only the proposed engines provide the features of runtime window scalability and boundary padding capability, which are essential requirements for CNN accelerators. The proposed max-pooling engines were employed and tested in our CNN accelerator targeting the CNN model YOLOv4-CSP-S-Leaky for object detection.

Keywords:

max pooling; convolutional neural network (CNN); FPGA; rank-tracking-based max pooling (RTB-MAXP); cascaded maximum based max pooling (CMB-MAXP)

1. Introduction

Convolutional neural networks (CNNs) have demonstrated remarkable performance in various domains, including image classification, object detection, and speech recognition [1,2]. However, effectively integrating CNNs into embedded systems with limited power and size requirements remains a significant challenge. This is primarily due to the high computational demands of CNNs, which can be resource intensive for embedded systems. Typically, embedded systems revolve around general purpose central processing units (CPUs) capable of handling a wide range of tasks. However, CPUs have limitations when it comes to implementing CNNs, mainly because of the repetitive and computationally intensive nature of large-scale convolution operations. To address this challenge and efficiently implement CNNs, dedicated hardware accelerators such as graphic processing units (GPUs) or field-programmable gate arrays (FPGAs) are commonly employed [3]. Among these options, FPGAs have gained popularity for implementing CNNs in embedded systems primarily due to their ability to perform convolution operations in parallel with high energy efficiency. Compared to GPUs, FPGAs offer higher energy efficiency, making them an attractive choice for resource-constrained embedded systems [4,5,6,7].

The generation of feature maps, achieved through the fundamental convolution operation using multiple kernels, plays a crucial role in CNNs. To minimize computational costs and simplify the model, reducing the size of these feature maps is necessary. The max-pooling technique is employed to achieve this while preserving spatial invariance of distinct features within the feature maps [8,9]. Typically, a window of size 2 × 2 is used in max-pooling operations, ensuring spatial overlap of the maximum values and sampling values along the horizontal and vertical axes every two positions [9]. As a result, the feature map’s width and height can be reduced by half, resulting in a 4× reduction in size while preserving the maximum values that represent the distinct features of the feature map. In recent years, there has been a growing focus on improving object detection performance by utilizing feature maps of various resolutions. This approach often involves incorporating max-pooling techniques with larger and more diverse window sizes [10]. For instance, YOLOv4’s spatial pyramid pooling (SPP) employed max pooling with window sizes of 5, 9, and 13 [10,11,12]. By employing different window sizes, the pooling operation captures multi-scale information from the feature maps, enabling the model to detect objects at different sizes and scales more effectively. This approach improves object detection accuracy and robustness in complex scenes [10,11,12]. Therefore, we are focusing on the FPGA-based implementation of the max-pooling engine with a scalable window size parameter ranging from 2 to 13.

The 2 × 2 max-pooling operation was initially implemented by using two delay buffers and three comparators on FPGAs [13,14]. This basic 2 × 2 max-pooling engine was later extended to handle max pooling with a k × k window, achieved by employing k delay buffers and k × k − 1 comparators [15,16]. However, despite its simplicity, the k × k − 1 comparator-based max-pooling engine is inefficient in terms of hardware utilization due to the requirement of numerous comparators with large window sizes, leading to higher power consumption due to the usage of many logic cells. An alternative approach to achieving max pooling involves using a traditional rank order filter, which selects a specific ranked value from a window of values, typically the maximum, minimum, or median value [17]. The implementation results of a two-dimensional rank order filter for FPGA are presented in [18], an eight-stage pipelined architecture utilizing a bit-serial algorithm for the rank filter is presented in [19], and a low-complexity pipelined rank filter is presented in in [20]. Subsequently, a hybrid sorting network architecture for median filtering was introduced to achieve high power efficiency [21]. However, these existing architectures [17,18,19,20,21] are optimized for fixed window sizes, whereas CNN applications require a scalable window size that can be dynamically adjusted during operation. In this paper, we propose two efficient max-pooling engines, named the rank-tracking-based max-pooling (RTB-MAXP) engine and the cascaded maximum-based max-pooling (CMB-MAXP) engine, designed with a scalable window size parameter for FPGA-based CNN implementations. The two-dimensional max-pooling operation is decomposed into the horizontal max-pooling operation and the vertical max-pooling operation for operational efficiency. Thus, the two-dimensional max-pooling operation can be accomplished with the two-step one-dimensional max-pooling operation, which leads to the reduction of comparison operations from k × k − 1 to 2k − 2. In the RTB-MAXP engine, the max-pooling operation is accomplished by tracking the ranks of the values within the scalable window and extracting the top ranked value as the maximum value. On the other hand, the CMB-MAXP engine employs the cascaded maximum operations to find the maximum value within the window.

This paper is organized as follows: In Section 2, the two-dimensional max-pooling operation is represented into the form of two-stage max-pooling operations, and then the architecture of the RTB-MAXP engine is introduced in Section 3. Section 4 describes the architecture of the proposed CMB-MAXP engine. Section 5 shows the implementation results of both the RTB-MAXP engine and the CMB-MAXP engine, followed by a comparison with existing techniques. Finally, the conclusion is provided in Section 6.

2. Max-Pooling Operation

Max pooling is a fundamental operation in deep learning used for reducing the spatial dimensions of a feature map. It involves sliding a window over the feature map and selecting the maximum value within the window [5]. Since the values in the feature map represent features, it is crucial to avoid losing important large values when reducing the size of the feature map. Max pooling ensures that the maximum values within the window are duplicated and remain, preventing the loss of maximum values during spatial sampling. Given an input feature map x, containing three-dimensional data of P channels, width W, and height H, the output of the max-pooling operation z can be expressed as the following equation:

z_{p} (i, j) = \max_{⌊ - \frac{k}{2} ⌋ \leq l, m \leq ⌊ \frac{k - 1}{2} ⌋} x_{p} (i + l, j + m),

(1)

where p (0 ≤ p ≤ P − 1) represents the index of the channel; i and j (0 ≤ i ≤ H − 1, 0 ≤ j ≤ W − 1) denote the vertical and horizontal indices of the feature map, respectively; the operator “max” represents the maximum operation; and the operator

⌊

∙

⌋

represents a floor operation. The value of k determines the size of the k × k square window. Note that the subsampling reducing the spatial size of feature maps z is not considered in this paper because it can be implemented easily and simply.

The two-dimensional max-pooling operation in Equation (1) can be decomposed into two distinct one-dimensional max-pooling operations as follows:

y_{p} (i, j) = \max_{⌊ - \frac{k}{2} ⌋ \leq m \leq ⌊ \frac{k - 1}{2} ⌋} x_{p} (i, j + m)

(2)

z_{p} (i, j) = \max_{⌊ - \frac{k}{2} ⌋ \leq l \leq ⌊ \frac{k - 1}{2} ⌋} y_{p} (i + l, j)

(3)

The operations in Equations (2) and (3) can be named the horizontal axis max-pooling operation and the vertical axis max-pooling operation, respectively. The output of the horizontal max-pooling operation is fed to the input of the vertical max-pooling operation. Using these two sequential max-pooling operations reduces the number of comparison operations compared to the two-dimensional max-pooling operation in Equation (1), i.e., from k × k − 1 to 2k − 2.

3. Rank-Tracking-Based Max-Pooling (RTB-MAXP) Engine

In this section, a two-step max-pooling operation engine, named the RTB-MAXP engine, is introduced. This engine is designed on the basis of the rank-tracking concept to find the maximum value within a kernel window, as shown in Figure 1. It has registers that store the sample values within a window in descending order. The maximum value is stored in register r₀, and the values within the window are stored in descending order down to r_{K −} ₁, where K is the maximum window size. When a new value of the feature map is input to the engine, the registers need to be updated. For example, if the new input value x_p (i,j) is smaller than r₀ but larger than r₁, the value in r₃ is the value pushed out of the window, the value of r₂ is shifted to r₃, the value of r₁ is shifted to r₂, and the value x_p (i,j) is stored in r₁ simultaneously. Meanwhile, the values of r₀ and r₄, …, r_K _{− 1} remain unchanged. The resulting output of this one-dimensional max-pooling engine corresponds to the value stored in r₀. This rank-tracking concept is implemented through two distinct blocks: the rank-counting block and the delay-counting block.

The rank-counting block is for storing the input values in a corresponding register based on their ranks. It is composed of K blocks named R_v (v = 0, 1, …, K − 1) and comparator blocks denoted as “>,” as shown in Figure 2. Each R_v block is composed of a multiplexer, a multiplexer switch (MS), and a register r_v, as depicted in Figure 3. The inputs for the multiplexer include the input feature map value x_p (i,j), the value of r_v, the value of r_v _{− 1} (one step larger than r_v), and the value of r_v _{+ 1} (one step smaller than r_v). The multiplexer selects one out of these inputs based on the MS’s output values c_v [0] and c_v [1]. Table 1 shows the relationship between the input values of m_v _{− 1}, m_v, m_v _{+ 1}, and n_v and the corresponding output values c_v [0] and c_v [1]. The values of m_v _{− 1}, m_v, and m_v _{+ 1} are obtained from the outputs of the comparators shown in Figure 2, whereas the value of n_v is obtained from the delay-counting block. The comparator block outputs a value of 1 when x_p (i,j) is larger than r_v, and a value of 0 otherwise.

The value pushed out of the window is also required to be tracked for the max-pooling operation, for which the delay-counting block is employed. The delay-counting block indicates the value n_v (v = 0, 1, …, K − 1) that is pushed out of the window. It consists of the K blocks, named D_v (v = 0, 1, …, K − 1); the comparator blocks, denoted as “=”; and OR gates, as depicted in Figure 4. The D_v block outputs the delay indicating value corresponding to the ranking-counting value r_v. Similar to the ranking-counting block, the D_v block is composed of a multiplexer, an MS, and a register denoted as “d_v,” as shown in Figure 5. The inputs for the multiplexer include the increased delay-counting value d_v + 1, corresponding to r_v; the delay-counting value 0, corresponding to x_p (i,j); the increased delay-counting value d_v _{− 1} + 1, corresponding to r_v _{– 1}; and the increased delay-counting value d_v _{+ 1} + 1, corresponding to r_v _{+ 1}. The multiplexer selects one of these inputs based on the MS’s output values c_v [0] and c_v [1]. Note that the MS is the same as that in the R_v block. As shown in Figure 4, the comparator block outputs a value of 1 if the delay-counting value is equal to the selected window size k and a value of 0 otherwise. This makes it possible to determine which value is pushed out of the window. The value n_v is the output of the OR operation between n_v _{− 1} and the output of the comparator block, as shown in Figure 4.

The two-dimensional max-pooling engine is implemented by employing the multiple one-dimensional rank-tracking-based max-pooling engines, as depicted in Figure 6. The block marked with “M^H” represents the horizontal one-dimensional max-pooling engine shown in Equation (2). Specifically, y_p (i,j) is obtained from the highest-ranking value r₀ of the ranking-counting block illustrated in Figure 2. In order to obtain z_p (i,j), where 0 ≤ j ≤ W − 1, each column of y_p (i,j) is fed into W vertical max-pooling engines M^V (j). Then, the final output z_p (i,j) is obtained from the multiplexer (MUX) based on the value of horizontal index j.

The proposed max-pooling engine can be efficiently implemented by sharing the common components. Firstly, the MS in the R_v block and the D_v block for each v can be shared. Since the output values c_v [0] and c_v [1] of these two MSs are identical, they can be used commonly for a one-dimensional max-pooling engine. Secondly, since every vertical max-pooling engine M^V (j) operates at non-overlapped timing for sequentially incoming y_p (i,j) values, all components of M^V (j) for 0 ≤ j ≤ W − 1 can be shared except for registers used to store ranking-counting values and delay-counting values. The RTB-MAXP engine appears to be somewhat intricate, as it involves tracking the ranks of values within a window. However, this complexity offers potential benefits in terms of reduced critical paths or latency since the output can be obtained only by comparing the incoming value with the first-ranked value.

4. Cascaded Maximum-Based Max-Pooling (CMB-MAXP) Engine

In this section, another two-step max-pooling operation engine, named the CMB-MAXP engine, is presented. In the CMB-MAXP engine, the horizontal axis max-pooling operation of Equation (2) is accomplished through the cascaded maximum operations, as illustrated in Figure 7. The cascaded F/Fs and the maximum operators find the maximum values y^w_p (i, j) for 1 ≤ w ≤ K − 1 of the input sequence elements x_p (i, j), x_p (i, j − 1), x_p(i, j − 2), …, x_p (i, j − w). The cascaded maximum operation can be expressed in a recursive form, as shown in Equation (4).

y_{p}^{w} (i, j) = \{\begin{matrix} \max [x_{p} (i, j), x_{p} (i, j - 1)] & , w = 1 \\ \max [y_{p}^{w - 1} (i, j), x_{p} (i, j - w)] & , w \geq 2 \end{matrix}

(4)

The element y^{k − 1}_p (i, j) is chosen as the output sequence elements y_p (i, j) of the horizontal axis max-pooling operation by a MUX according to the scalable kernel size parameter k, i.e.,

y_{p} (i, j) = y_{p}^{k - 1} (i, j)

.

The structure of the vertical axis max-pooling operation is shown in Figure 8, and it is almost identical to that of the horizontal axis max-pooling engine except for the additional usage of the two-dimensional memory elements M (v, w) for 0 ≤ v ≤ K − 1 and 0 ≤ w ≤ W − 1. The memory elements are employed for restoring the previous row’s values of the feature map y_p (i − 1, j), y_p (i − 2, j), …, y_p (i – K − 1, j) for the vertical max-pooling operations. The previously loaded elements y_p (i − 1, j), y_p(i − 2, j), …, y_p(i – K − 1, j) are fed into F/Fs from the memory element M_p (0, j), M_p (1, j), …, M_p (K − 1, j), and the new delayed elements y_p (i, j), y_p (i − 1, j),…, y_p (i – K − 2, j) are recursively fed into the corresponding memory elements. The maximum values z^w_p (i, j) of the elements y_p (i, j), y_p (i − 1, j), y_p (i − 2, j), …, y_p (i − w, j) are obtained by comparators and the final output sequence element z_p (i, j) is obtained from the MUX based on the value of k.

5. Implementations

The proposed RTB-MAXP engine and CMB-MAXP engine were implemented for employment in an FPGA-based CNN accelerator. The target model for the CNN was YOLOv4-CSP-S-Leaky designed for object detection [8]. It consists of 108 layers, including 3 × 3 convolution layers, 1 × 1 convolution layers, residual addition layers, concatenation layers, max-pooling layers, and up-sampling layers. The max-pooling layers are utilized for the SPP operation [10,11,12]. When the input feature map size of the model is 256 × 256, the input feature map size of the max-pooling operations is 32 × 32, with window sizes of 5, 9, and 13. Therefore, the maximum window size K was chosen as 13 and the maximum width of the feature map W was set as 32 for the proposed max-pooling engines. We applied 8-bit quantization to the weight and bias parameters for the convolution operation, along with 8-bit representation for the feature maps.

The max-pooling engines were designed using VHSIC hardware description language (VHDL), and their behaviors were verified through simulations using ModelSim, which is Mentor Graphics’ simulation and debugging tool for digital logic circuits. Figure 9 and Figure 10 depict the simulation results for the RTB-MAXP engine and the CMB-MAXP engine, respectively. The parameters k and W were denoted as ksize_i and width_i and set to 5 and 0 × 20 (=32), respectively. It is worth noting that signal prefixes are included in the figure. The signals inpchan_i, row_i, and valid_i were used for data synchronization with a start point of a new input channel, a start point of a new row, and a valid point of data, respectively. The signal data_i represents the feature map data synchronized with these signals. The output data signal data_o of the max-pooling engines comes out along with the data synchronization signals, i.e., outchan_o and valid_o. The latency of the RTB-MAXP engine is defined as

⌊ W / 2 ⌋ \times ⌊ k / 2 ⌋ + ⌊ k / 2 ⌋ + 1

, whereas the CMB-MAXP engine exhibits a latency of

⌊ W / 2 ⌋ \times ⌊ k / 2 ⌋ + ⌊ k / 2 ⌋ + k

. Buffering by

W \times k + k

is technically necessitated to determine the maximum value within a

k \times k

window. However, due to consideration of the zero-padding by

⌊ k / 2 ⌋

at the edge of the feature maps, only a

⌊ W / 2 ⌋ \times ⌊ k / 2 ⌋

delay is required. The additional delay of 1 in the RTB-MAXP engine and k in the CMB-MAXP engine results from the implementation of comparators with flip-flops, aimed at reducing the critical paths. Therefore, with specific parameters k = 5 and W = 32, the latency is given as 21 clock cycles for the RTB-MAXP engine and 25 clock cycles for the CMB-MAXP engine.

In order to accelerate the max-pooling operations for the YOLOv4-CSP-S-Leaky CNN model, we employed either the 16 RTB-MAXP engines or the 16 CMB-MAXP engines to simultaneously process 16 feature maps (0 ≤ p ≤ 15). These 16 max-pooling engines can collectively increase processing speed by a factor of 16, at the expense of FPGA resource utilization. For the implementation on the Xilinx VCU118 evaluation platform, we synthesized, placed, and routed the two designed max-pooling engines using the Xilinx Vivado Design Suite. The platform features the Xilinx Virtex UltraScale+ FPGA XCVU9P. In Table 2, we present the resource utilization of both the RTB-MAXP engine and the CMB-MAXP engine with K = 13 and W = 32 as a result of the implementation for the targeted FPGA. The RTB-MAXP engine utilized 158,515 LUTs and 99,342 FFs, constituting 13.4% and 4.2% of the total available resources, respectively. On the other hand, the CMB-MAXP engine employed 28,765 LUTs, 2688 LUTRAMs, and 76,906 flip-flops, representing 2.43%, 0.45%, and 3.25%, respectively. It is worth noting that the CMB-MAXP engine requires significantly fewer LUTs than the RTB-MAXP engine, but it involves the use of LUTRAMs as well. This is due to the necessity of buffering for the previous row data, with LUTs serving this purpose in the RTB-MAXP engine and LUTRAMs in the CMB-MAXP engine. The maximum operating frequency for the RTB-MAXP engine was determined to be 483.2 MHz, whereas it was found to be 562.4 MHz for the CMB-MAXP engines. Consequently, the throughput can be expressed as 16 times, resulting in 7731.2 (= 16 × 483.2) MBPS (megabytes per second) and 8998.4 (= 16 × 562.4) MBPS, respectively.

To perform a comparison with previous relevant studies [18,19,20,21], we conducted additional synthesis, placement, and routing of both the RTB-MAXP engine and the CMB-MAZP engine using the Xilinx ISE Design Suite, targeting the Xilinx Virtex-4 XC4VSX25 device utilized in those studies [19,20]. For this comparison, we considered only one instance of the RTB-MAXP engine and one instance of the CMB-MAXP engine, with the maximum window size set to 5 (K = 5). Whereas the RTB-MAXP engine utilizes approximately three times the resources compared to the eight-stage pipelined architecture in [19], the resource usage of the CMB-MAXP engine is comparable. The low-complexity pipelined rank filter described in [20] requires only about half the resources of the CMB-MAXP engine. However, the two proposed max-pooling engines exhibited slightly superior performance in terms of maximum operating frequency. On the contrary, when comparing the CMB-MAXP engine with the hybrid sorting network architecture in [21], the CMB-MAXP engine demonstrated about half of the resource utilization but a slightly lower maximum operating frequency. It is worth emphasizing that only the two proposed max-pooling engines offer runtime window scalability and boundary padding capability, which are essential requirements for CNN accelerators. Please note that the numbers of slices (marked as “-“ in the # Slices row of Table 3) were not provided in the respective papers [18,21].

The proposed RTB-MAX engine and CMB-MAXP engine were integrated into our CNN accelerator. The CNN accelerator was tested on the target VCU118 evaluation platform board. Figure 11 displays the results of the CNN tests using the YOLOv4-CSP-S-Leaky model for object detection [10]. The results demonstrate that the CNN accelerator detected three people and one bus. Note that the comparison in terms of mean average precision (mAP) is not included in this paper since the max-pooling operation is not the part of the loss operation.

6. Conclusions

In this paper, we proposed two two-stage max-pooling engines for the max-pooling operation in CNNs: the RTB-MAXP engine and the CMB-MAXP engine. The RTB-MAXP engine determines the maximum value by tracking the rank of the values within the window and the CMB-MAXP engine obtains the maximum value through cascading multiple maximum operations. These engines were implemented using VHDL and thoroughly verified by simulations. The implementation results demonstrate that the 16 CMB-MAXP engines achieved a remarkable throughput of about 9 GBPS (gigabytes per second) while utilizing only around 3% of the available resources on the Xilinx Virtex UltraScale+ FPGA XCVU9P. On the other hand, the 16 RTB-MAXP engines exhibited somewhat lower throughput and resource utilization, although they did offer a slightly better latency compared to the CMB-MAXP engines. In the comparison with existing techniques, the CMB-MAXP engine exhibited comparable implementation results in terms of the resource utilization and maximum operating frequency. It is essential to note that only the proposed engines provide the features of runtime window scalability and boundary padding capability, which are essential requirements for CNN accelerators. The proposed max-pooling engines were employed and tested in our CNN accelerator targeting the CNN model YOLOv4-CSP-S-Leaky for object detection.

Author Contributions

Conceptualization, E.H.; methodology, E.H., K.-A.C. and J.J.; validation, E.H., K.-A.C. and J.J.; investigation, E.H., K.-A.C. and J.J.; resources, K.-A.C.; data curation, J.J.; writing—original draft preparation, E.H.; writing—review and editing, E.H., K.-A.C. and J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data is unavailable due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
Lee, D.-H. Fully Convolutional Single-Crop Siamese Networks for Real-Time Visual Object Tracking. Electronics 2019, 8, 1084. [Google Scholar] [CrossRef]
Shawahna, A.; Sait, S.; El-Maleh, A. FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2018, 4, 7823–7859. [Google Scholar] [CrossRef]
Huang, J.; Liu, X.; Guo, T.; Zhao, Z. A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator. Electronics 2023, 12, 1571. [Google Scholar] [CrossRef]
Xie, Y.; Majoros, T.; Oniga, S. FPGA-Based Hardware Accelerator on Portable Equipment for EEG Signal Patterns Recognition. Electronics 2022, 11, 2410. [Google Scholar] [CrossRef]
Zhang, L.; Tang, X.; Hu, X.; Zhou, T.; Peng, Y. FPGA-Based BNN Architecture in Time Domain with Low Storage and Power Consumption. Electronics 2022, 11, 1421. [Google Scholar] [CrossRef]
Pettersson, L. Convolutional Neural Networks on FPGA and GPU on the Edge: A Comparison; Uppsala University: Uppsala, Sweden, 2020. [Google Scholar]
Lomas-Barrie, V.; Silva-Flores, R.; Neme, A.; Pena-Cabrera, M. A Multiview Recognition Method of Predefined Objects for Robot Assembly Using Deep Learning and Its Implementation on an FPGA. Electronics 2022, 11, 696. [Google Scholar] [CrossRef]
Zhou, H.; Xiao, Y.; Zheng, Z.; Yang, B. YOLOv2-tiny Target Detection System Based on FPGA Platform. In Proceedings of the 2022 3rd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Xi’an, China, 15–17 July 2022; pp. 289–292. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H. Scaled-Yolov4: Scaling Cross Stage Partial Network. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Rzaev, E.; Khanaev, A.; Amerikanov, A. Neural Network for Real-Time Object Detection on FPGA. In Proceedings of the 2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russia, 17–21 May 2021; pp. 719–723. [Google Scholar]
Archana, V. An FPGA-Based Computation-Efficient Convolutional Neural Network Accelerator. In Proceedings of the 2022 IEEE International Power and Renewable Energy Conference (IPRECON), Kollam, India, 16–18 December 2022; pp. 1–4. [Google Scholar] [CrossRef]
Wang, Z.; Xu, K.; Wu, S.; Liu, L.; Wang, D. Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2. IEEE Access 2020, 8, 116569–116585. [Google Scholar] [CrossRef]
Zhao, B.; Chong, Y.; Do, A. Area and Energy Efficient 2D Max-Pooling for Convolutional Neural Network Hardware Accelerator. In Proceedings of the IECON 2020 The 46th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 18–21 October 2020; pp. 423–427. [Google Scholar]
Zhao, D. F-CNN: An FPGA-based framework for training Convolutional Neural Networks. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, 11–13 December 2016; pp. 107–114. [Google Scholar] [CrossRef]
Satti, P.; Sharma, N.; Garg, B. Min-Max Average Pooling Based Filter for Impulse Noise Removal. IEEE Signal Process. Lett. 2020, 27, 1475–1479. [Google Scholar] [CrossRef]
Szedo, G.; Two-Dimensional Rank Order Filter. XililL Application Note XAPP953. 2006, pp. 1–17. Available online: https://docs.xilinx.com/v/u/en-US/xapp953 (accessed on 7 August 2023).
Choo, C.; Verma, P. A Real-Time Bit-Serial Rank Filter Implementation Using Xilinx FPGA. Real-Time Image Process. 2008, 6811, 125–132. [Google Scholar] [CrossRef]
Prokin, D.; Prokin, M. Low Hardware Complexity Pipelined Rank Filter. IEEE Trans. Circuits Syst. II Express Briefs 2010, 57, 446–450. [Google Scholar] [CrossRef]
Sambamurthy, N.; Kamaraju, M. Power Optimized Hybrid Sorting-Based Median Filtering. Int. J. Digit. Signals Smart Syst. 2020, 4, 80–86. [Google Scholar] [CrossRef]

Figure 1. Operating principle of the horizontal max-pooling engine with maximum possible window size K when the incoming value x_p(i,j) is r₀ < x_p (i,j) < r₁ and the r₃ is pushed out of the window.

Figure 2. Rank-counting block consisting of R_v blocks (v = 0, 1, …, K − 1) and comparator blocks, marked as “>”.

Figure 3. Structure of the R_v block consisting of the multiplexer, the multiplexer switch (MS), and the register r_v.

Figure 4. Delay-counting block consisting of D_v (v = 0, 1, …, K − 1); comparators, marked as “=”; and OR gates.

Figure 5. Structure of D_v consisting of a multiplexer, a multiplexer switch (MS), and the register d_v.

Figure 6. Block diagram for the two-dimensional RTB-MAXP engine.

Figure 7. The horizontal axis max-pooling operation of the CMB-MAXP engine with the scalable kernel size k.

Figure 8. The vertical axis max-pooling operation of the CMB-MAXP engine with scalable kernel size k.

Figure 9. The simulation result of the RTB-MAXP engine with k = 5 and W = 32.

Figure 10. The simulation result of the CMB-MAXP engine with k = 5 and W = 32.

Figure 11. The test results of our CNN accelerator employing the CMB-MAXP engine on the VCU118 platform.

Table 1. The relationships between the multiplexer switch (MS)’s input and output and the output of the multiplexer exploited as R_v and D_v.

Multiplexer Switch (MS)						Multiplexer
Input				Output		Output
m_v_{− 1}	m_v	m_v_{+ 1}	n_v	c_v [0]	c_v [1]	r_v	d_v
0	0	X	0	0	0	r_v	d_v+1
0	1	X	0	0	1	x_p (i,j)	0
0	1	X	1	0	0	r_v	d_v + 1
1	0	X	0	0	0	r_v	d_v + 1
1	1	X	0	1	0	r_v_{+ 1}	d_v_{+ 1} + 1
1	1	X	1	0	0	r_v	d_v + 1
X	0	0	1	1	1	r_v_{− 1}	d_v_{– 1} + 1
X	0	1	1	0	1	x_p (i,j)	0

Table 2. Implementation results of the 16 RTB-MAXP engines and the 16 CMB-MAXP engines with K = 13 and W = 32 for YOLOv4-CSP-S-Leaky CNN model implementation on Xilinx Virtex UltraScale+ FPGA XCVU9P.

		RTB-MAXP		CMB-MAXP
Resource utilization		UA	UP	UA	UP
	# LUTs	158,515	13.4%	28,765	2.43%
	# LUTRAMs	0	0%	2688	0.45%
	# Flip-flops	99,342	4.2%	76,906	3.25%
Maximum operating frequency (MHz)		483.2		562.4
Latency		$⌊ W / 2 ⌋ \times ⌊ k / 2 ⌋ + ⌊ k / 2 ⌋ + 1$ ,		$⌊ W / 2 ⌋ \times ⌊ k / 2 ⌋ + ⌊ k / 2 ⌋ + k$ .

UA: utilization amount, UP: utilization percentage.

Table 3. Comparison of implementation results for the RTB-MAXP engine and the CMB-MAXP engine with previous relevant studies [18,19,20,21].

	In [18]	In [19]	In [20]	In [21]	RTB-MAXP	CMB-MAXP
Device family	Xilinx Virtex-4 XC4VSX35	Xilinx Virtex-4 XC4VSX25	Xilinx Virtex-4 XC4VSX25	Xilinx Artix-7 XC7A35T	Xilinx Virtex-4 XC4VSX25	Xilinx Virtex-4 XC4VSX25
# Slices	-	1137	668	-	3182	1134
# Slice FFs	1169	698	964	3673	2020	1630
# 4-input LUTs	812	2055	766	5424	6070	2201
# BRAMs	12	5	0	0	0	0
Maximum operating frequency (MHz)	375	138.5	318.4	341.2	287.2	332.5
Runtime window scalability	No	No	No	No	Yes	Yes
Boundary padding capability	No	No	No	No	Yes	Yes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hong, E.; Choi, K.-A.; Joo, J. Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network. Electronics 2023, 12, 4043. https://doi.org/10.3390/electronics12194043

AMA Style

Hong E, Choi K-A, Joo J. Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network. Electronics. 2023; 12(19):4043. https://doi.org/10.3390/electronics12194043

Chicago/Turabian Style

Hong, Eonpyo, Kang-A Choi, and Jhihoon Joo. 2023. "Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network" Electronics 12, no. 19: 4043. https://doi.org/10.3390/electronics12194043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network

Abstract

1. Introduction

2. Max-Pooling Operation

3. Rank-Tracking-Based Max-Pooling (RTB-MAXP) Engine

4. Cascaded Maximum-Based Max-Pooling (CMB-MAXP) Engine

5. Implementations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI