Next Article in Journal
A 5G Coverage Calculation Optimization Algorithm Based on Multifrequency Collaboration
Previous Article in Journal
Transform-Based Feature Map Compression Method for Video Coding for Machines (VCM)
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network

Agency for Defense Development, Yuseong P.O. Box 35, Daejeon 34186, Republic of Korea
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(19), 4043; https://doi.org/10.3390/electronics12194043
Submission received: 8 August 2023 / Revised: 20 September 2023 / Accepted: 21 September 2023 / Published: 26 September 2023
(This article belongs to the Section Artificial Intelligence Circuits and Systems (AICAS))

Abstract

:
This paper proposes two max-pooling engines, named the RTB-MAXP engine and the CMB-MAXP engine, with a scalable window size parameter for FPGA-based convolutional neural network (CNN) implementation. The max-pooling operation for the CNN can be decomposed into two stages, i.e., a horizontal axis max-pooling operation and a vertical axis max-pooling operation. These two one-dimensional max-pooling operations are performed by tracking the rank of the values within the window in the RTB-MAXP engine and cascading the maximum operations of the values in the CMB-MAXP engine. Both the RTB-MAXP engine and the CMB-MAXP engine were implemented using VHSIC hardware description language (VHDL) and verified by simulations. The implementation results demonstrate that the 16 CMB-MAXP engines achieved a remarkable throughput of about 9 GBPS (gigabytes per second) while utilizing only about 3% of the available resources on the Xilinx Virtex UltraScale+ FPGA XCVU9P. On the other hand, the 16 RTB-MAXP engines exhibited somewhat lower throughput and resource utilization, although they did offer a slightly better latency when compared to the CMB-MAXP engines. In the comparison with existing techniques, the CMB-MAXP engine exhibited comparable implementation results in terms of the resource utilization and maximum operating frequency. It is crucial to note that only the proposed engines provide the features of runtime window scalability and boundary padding capability, which are essential requirements for CNN accelerators. The proposed max-pooling engines were employed and tested in our CNN accelerator targeting the CNN model YOLOv4-CSP-S-Leaky for object detection.

1. Introduction

Convolutional neural networks (CNNs) have demonstrated remarkable performance in various domains, including image classification, object detection, and speech recognition [1,2]. However, effectively integrating CNNs into embedded systems with limited power and size requirements remains a significant challenge. This is primarily due to the high computational demands of CNNs, which can be resource intensive for embedded systems. Typically, embedded systems revolve around general purpose central processing units (CPUs) capable of handling a wide range of tasks. However, CPUs have limitations when it comes to implementing CNNs, mainly because of the repetitive and computationally intensive nature of large-scale convolution operations. To address this challenge and efficiently implement CNNs, dedicated hardware accelerators such as graphic processing units (GPUs) or field-programmable gate arrays (FPGAs) are commonly employed [3]. Among these options, FPGAs have gained popularity for implementing CNNs in embedded systems primarily due to their ability to perform convolution operations in parallel with high energy efficiency. Compared to GPUs, FPGAs offer higher energy efficiency, making them an attractive choice for resource-constrained embedded systems [4,5,6,7].
The generation of feature maps, achieved through the fundamental convolution operation using multiple kernels, plays a crucial role in CNNs. To minimize computational costs and simplify the model, reducing the size of these feature maps is necessary. The max-pooling technique is employed to achieve this while preserving spatial invariance of distinct features within the feature maps [8,9]. Typically, a window of size 2 × 2 is used in max-pooling operations, ensuring spatial overlap of the maximum values and sampling values along the horizontal and vertical axes every two positions [9]. As a result, the feature map’s width and height can be reduced by half, resulting in a 4× reduction in size while preserving the maximum values that represent the distinct features of the feature map. In recent years, there has been a growing focus on improving object detection performance by utilizing feature maps of various resolutions. This approach often involves incorporating max-pooling techniques with larger and more diverse window sizes [10]. For instance, YOLOv4’s spatial pyramid pooling (SPP) employed max pooling with window sizes of 5, 9, and 13 [10,11,12]. By employing different window sizes, the pooling operation captures multi-scale information from the feature maps, enabling the model to detect objects at different sizes and scales more effectively. This approach improves object detection accuracy and robustness in complex scenes [10,11,12]. Therefore, we are focusing on the FPGA-based implementation of the max-pooling engine with a scalable window size parameter ranging from 2 to 13.
The 2 × 2 max-pooling operation was initially implemented by using two delay buffers and three comparators on FPGAs [13,14]. This basic 2 × 2 max-pooling engine was later extended to handle max pooling with a k × k window, achieved by employing k delay buffers and k × k − 1 comparators [15,16]. However, despite its simplicity, the k × k − 1 comparator-based max-pooling engine is inefficient in terms of hardware utilization due to the requirement of numerous comparators with large window sizes, leading to higher power consumption due to the usage of many logic cells. An alternative approach to achieving max pooling involves using a traditional rank order filter, which selects a specific ranked value from a window of values, typically the maximum, minimum, or median value [17]. The implementation results of a two-dimensional rank order filter for FPGA are presented in [18], an eight-stage pipelined architecture utilizing a bit-serial algorithm for the rank filter is presented in [19], and a low-complexity pipelined rank filter is presented in in [20]. Subsequently, a hybrid sorting network architecture for median filtering was introduced to achieve high power efficiency [21]. However, these existing architectures [17,18,19,20,21] are optimized for fixed window sizes, whereas CNN applications require a scalable window size that can be dynamically adjusted during operation. In this paper, we propose two efficient max-pooling engines, named the rank-tracking-based max-pooling (RTB-MAXP) engine and the cascaded maximum-based max-pooling (CMB-MAXP) engine, designed with a scalable window size parameter for FPGA-based CNN implementations. The two-dimensional max-pooling operation is decomposed into the horizontal max-pooling operation and the vertical max-pooling operation for operational efficiency. Thus, the two-dimensional max-pooling operation can be accomplished with the two-step one-dimensional max-pooling operation, which leads to the reduction of comparison operations from k × k − 1 to 2k − 2. In the RTB-MAXP engine, the max-pooling operation is accomplished by tracking the ranks of the values within the scalable window and extracting the top ranked value as the maximum value. On the other hand, the CMB-MAXP engine employs the cascaded maximum operations to find the maximum value within the window.
This paper is organized as follows: In Section 2, the two-dimensional max-pooling operation is represented into the form of two-stage max-pooling operations, and then the architecture of the RTB-MAXP engine is introduced in Section 3. Section 4 describes the architecture of the proposed CMB-MAXP engine. Section 5 shows the implementation results of both the RTB-MAXP engine and the CMB-MAXP engine, followed by a comparison with existing techniques. Finally, the conclusion is provided in Section 6.

2. Max-Pooling Operation

Max pooling is a fundamental operation in deep learning used for reducing the spatial dimensions of a feature map. It involves sliding a window over the feature map and selecting the maximum value within the window [5]. Since the values in the feature map represent features, it is crucial to avoid losing important large values when reducing the size of the feature map. Max pooling ensures that the maximum values within the window are duplicated and remain, preventing the loss of maximum values during spatial sampling. Given an input feature map x, containing three-dimensional data of P channels, width W, and height H, the output of the max-pooling operation z can be expressed as the following equation:
z p i , j = max k 2 l , m k 1 2 x p i + l , j + m ,
where p (0 ≤ pP − 1) represents the index of the channel; i and j (0 ≤ iH − 1, 0 ≤ jW − 1) denote the vertical and horizontal indices of the feature map, respectively; the operator “max” represents the maximum operation; and the operator represents a floor operation. The value of k determines the size of the k × k square window. Note that the subsampling reducing the spatial size of feature maps z is not considered in this paper because it can be implemented easily and simply.
The two-dimensional max-pooling operation in Equation (1) can be decomposed into two distinct one-dimensional max-pooling operations as follows:
y p i , j = max k 2 m k 1 2 x p i ,     j + m
z p i , j = max k 2 l k 1 2 y p i + l ,   j
The operations in Equations (2) and (3) can be named the horizontal axis max-pooling operation and the vertical axis max-pooling operation, respectively. The output of the horizontal max-pooling operation is fed to the input of the vertical max-pooling operation. Using these two sequential max-pooling operations reduces the number of comparison operations compared to the two-dimensional max-pooling operation in Equation (1), i.e., from k × k − 1 to 2k − 2.

3. Rank-Tracking-Based Max-Pooling (RTB-MAXP) Engine

In this section, a two-step max-pooling operation engine, named the RTB-MAXP engine, is introduced. This engine is designed on the basis of the rank-tracking concept to find the maximum value within a kernel window, as shown in Figure 1. It has registers that store the sample values within a window in descending order. The maximum value is stored in register r0, and the values within the window are stored in descending order down to rK − 1, where K is the maximum window size. When a new value of the feature map is input to the engine, the registers need to be updated. For example, if the new input value xp (i,j) is smaller than r0 but larger than r1, the value in r3 is the value pushed out of the window, the value of r2 is shifted to r3, the value of r1 is shifted to r2, and the value xp (i,j) is stored in r1 simultaneously. Meanwhile, the values of r0 and r4, …, rK − 1 remain unchanged. The resulting output of this one-dimensional max-pooling engine corresponds to the value stored in r0. This rank-tracking concept is implemented through two distinct blocks: the rank-counting block and the delay-counting block.
The rank-counting block is for storing the input values in a corresponding register based on their ranks. It is composed of K blocks named Rv (v = 0, 1, …, K − 1) and comparator blocks denoted as “>,” as shown in Figure 2. Each Rv block is composed of a multiplexer, a multiplexer switch (MS), and a register rv, as depicted in Figure 3. The inputs for the multiplexer include the input feature map value xp (i,j), the value of rv, the value of rv − 1 (one step larger than rv), and the value of rv + 1 (one step smaller than rv). The multiplexer selects one out of these inputs based on the MS’s output values cv [0] and cv [1]. Table 1 shows the relationship between the input values of mv − 1, mv, mv + 1, and nv and the corresponding output values cv [0] and cv [1]. The values of mv − 1, mv, and mv + 1 are obtained from the outputs of the comparators shown in Figure 2, whereas the value of nv is obtained from the delay-counting block. The comparator block outputs a value of 1 when xp (i,j) is larger than rv, and a value of 0 otherwise.
The value pushed out of the window is also required to be tracked for the max-pooling operation, for which the delay-counting block is employed. The delay-counting block indicates the value nv (v = 0, 1, …, K − 1) that is pushed out of the window. It consists of the K blocks, named Dv (v = 0, 1, …, K − 1); the comparator blocks, denoted as “=”; and OR gates, as depicted in Figure 4. The Dv block outputs the delay indicating value corresponding to the ranking-counting value rv. Similar to the ranking-counting block, the Dv block is composed of a multiplexer, an MS, and a register denoted as “dv,” as shown in Figure 5. The inputs for the multiplexer include the increased delay-counting value dv + 1, corresponding to rv; the delay-counting value 0, corresponding to xp (i,j); the increased delay-counting value dv − 1 + 1, corresponding to rv – 1; and the increased delay-counting value dv + 1 + 1, corresponding to rv + 1. The multiplexer selects one of these inputs based on the MS’s output values cv [0] and cv [1]. Note that the MS is the same as that in the Rv block. As shown in Figure 4, the comparator block outputs a value of 1 if the delay-counting value is equal to the selected window size k and a value of 0 otherwise. This makes it possible to determine which value is pushed out of the window. The value nv is the output of the OR operation between nv − 1 and the output of the comparator block, as shown in Figure 4.
The two-dimensional max-pooling engine is implemented by employing the multiple one-dimensional rank-tracking-based max-pooling engines, as depicted in Figure 6. The block marked with “MH” represents the horizontal one-dimensional max-pooling engine shown in Equation (2). Specifically, yp (i,j) is obtained from the highest-ranking value r0 of the ranking-counting block illustrated in Figure 2. In order to obtain zp (i,j), where 0 ≤ jW − 1, each column of yp (i,j) is fed into W vertical max-pooling engines MV (j). Then, the final output zp (i,j) is obtained from the multiplexer (MUX) based on the value of horizontal index j.
The proposed max-pooling engine can be efficiently implemented by sharing the common components. Firstly, the MS in the Rv block and the Dv block for each v can be shared. Since the output values cv [0] and cv [1] of these two MSs are identical, they can be used commonly for a one-dimensional max-pooling engine. Secondly, since every vertical max-pooling engine MV (j) operates at non-overlapped timing for sequentially incoming yp (i,j) values, all components of MV (j) for 0 ≤ jW − 1 can be shared except for registers used to store ranking-counting values and delay-counting values. The RTB-MAXP engine appears to be somewhat intricate, as it involves tracking the ranks of values within a window. However, this complexity offers potential benefits in terms of reduced critical paths or latency since the output can be obtained only by comparing the incoming value with the first-ranked value.

4. Cascaded Maximum-Based Max-Pooling (CMB-MAXP) Engine

In this section, another two-step max-pooling operation engine, named the CMB-MAXP engine, is presented. In the CMB-MAXP engine, the horizontal axis max-pooling operation of Equation (2) is accomplished through the cascaded maximum operations, as illustrated in Figure 7. The cascaded F/Fs and the maximum operators find the maximum values ywp (i, j) for 1 ≤ wK − 1 of the input sequence elements xp (i, j), xp (i, j − 1), xp(i, j − 2), …, xp (i, jw). The cascaded maximum operation can be expressed in a recursive form, as shown in Equation (4).
y p w i , j = max x p i , j , x p i , j 1 ,   w = 1 max y p w 1 i , j , x p i , j w , w 2
The element yk − 1p (i, j) is chosen as the output sequence elements yp (i, j) of the horizontal axis max-pooling operation by a MUX according to the scalable kernel size parameter k, i.e., y p i , j = y p k 1 i , j .
The structure of the vertical axis max-pooling operation is shown in Figure 8, and it is almost identical to that of the horizontal axis max-pooling engine except for the additional usage of the two-dimensional memory elements M (v, w) for 0 ≤ vK − 1 and 0 ≤ wW − 1. The memory elements are employed for restoring the previous row’s values of the feature map yp (i − 1, j), yp (i − 2, j), …, yp (iK − 1, j) for the vertical max-pooling operations. The previously loaded elements yp (i − 1, j), yp(i − 2, j), …, yp(iK − 1, j) are fed into F/Fs from the memory element Mp (0, j), Mp (1, j), …, Mp (K − 1, j), and the new delayed elements yp (i, j), yp (i − 1, j),…, yp (iK − 2, j) are recursively fed into the corresponding memory elements. The maximum values zwp (i, j) of the elements yp (i, j), yp (i − 1, j), yp (i − 2, j), …, yp (iw, j) are obtained by comparators and the final output sequence element zp (i, j) is obtained from the MUX based on the value of k.

5. Implementations

The proposed RTB-MAXP engine and CMB-MAXP engine were implemented for employment in an FPGA-based CNN accelerator. The target model for the CNN was YOLOv4-CSP-S-Leaky designed for object detection [8]. It consists of 108 layers, including 3 × 3 convolution layers, 1 × 1 convolution layers, residual addition layers, concatenation layers, max-pooling layers, and up-sampling layers. The max-pooling layers are utilized for the SPP operation [10,11,12]. When the input feature map size of the model is 256 × 256, the input feature map size of the max-pooling operations is 32 × 32, with window sizes of 5, 9, and 13. Therefore, the maximum window size K was chosen as 13 and the maximum width of the feature map W was set as 32 for the proposed max-pooling engines. We applied 8-bit quantization to the weight and bias parameters for the convolution operation, along with 8-bit representation for the feature maps.
The max-pooling engines were designed using VHSIC hardware description language (VHDL), and their behaviors were verified through simulations using ModelSim, which is Mentor Graphics’ simulation and debugging tool for digital logic circuits. Figure 9 and Figure 10 depict the simulation results for the RTB-MAXP engine and the CMB-MAXP engine, respectively. The parameters k and W were denoted as ksize_i and width_i and set to 5 and 0 × 20 (=32), respectively. It is worth noting that signal prefixes are included in the figure. The signals inpchan_i, row_i, and valid_i were used for data synchronization with a start point of a new input channel, a start point of a new row, and a valid point of data, respectively. The signal data_i represents the feature map data synchronized with these signals. The output data signal data_o of the max-pooling engines comes out along with the data synchronization signals, i.e., outchan_o and valid_o. The latency of the RTB-MAXP engine is defined as W / 2 × k / 2 + k / 2 + 1 , whereas the CMB-MAXP engine exhibits a latency of W / 2 × k / 2 + k / 2 + k . Buffering by W × k + k is technically necessitated to determine the maximum value within a k × k window. However, due to consideration of the zero-padding by k / 2 at the edge of the feature maps, only a W / 2 × k / 2 delay is required. The additional delay of 1 in the RTB-MAXP engine and k in the CMB-MAXP engine results from the implementation of comparators with flip-flops, aimed at reducing the critical paths. Therefore, with specific parameters k = 5 and W = 32, the latency is given as 21 clock cycles for the RTB-MAXP engine and 25 clock cycles for the CMB-MAXP engine.
In order to accelerate the max-pooling operations for the YOLOv4-CSP-S-Leaky CNN model, we employed either the 16 RTB-MAXP engines or the 16 CMB-MAXP engines to simultaneously process 16 feature maps (0 ≤ p ≤ 15). These 16 max-pooling engines can collectively increase processing speed by a factor of 16, at the expense of FPGA resource utilization. For the implementation on the Xilinx VCU118 evaluation platform, we synthesized, placed, and routed the two designed max-pooling engines using the Xilinx Vivado Design Suite. The platform features the Xilinx Virtex UltraScale+ FPGA XCVU9P. In Table 2, we present the resource utilization of both the RTB-MAXP engine and the CMB-MAXP engine with K = 13 and W = 32 as a result of the implementation for the targeted FPGA. The RTB-MAXP engine utilized 158,515 LUTs and 99,342 FFs, constituting 13.4% and 4.2% of the total available resources, respectively. On the other hand, the CMB-MAXP engine employed 28,765 LUTs, 2688 LUTRAMs, and 76,906 flip-flops, representing 2.43%, 0.45%, and 3.25%, respectively. It is worth noting that the CMB-MAXP engine requires significantly fewer LUTs than the RTB-MAXP engine, but it involves the use of LUTRAMs as well. This is due to the necessity of buffering for the previous row data, with LUTs serving this purpose in the RTB-MAXP engine and LUTRAMs in the CMB-MAXP engine. The maximum operating frequency for the RTB-MAXP engine was determined to be 483.2 MHz, whereas it was found to be 562.4 MHz for the CMB-MAXP engines. Consequently, the throughput can be expressed as 16 times, resulting in 7731.2 (= 16 × 483.2) MBPS (megabytes per second) and 8998.4 (= 16 × 562.4) MBPS, respectively.
To perform a comparison with previous relevant studies [18,19,20,21], we conducted additional synthesis, placement, and routing of both the RTB-MAXP engine and the CMB-MAZP engine using the Xilinx ISE Design Suite, targeting the Xilinx Virtex-4 XC4VSX25 device utilized in those studies [19,20]. For this comparison, we considered only one instance of the RTB-MAXP engine and one instance of the CMB-MAXP engine, with the maximum window size set to 5 (K = 5). Whereas the RTB-MAXP engine utilizes approximately three times the resources compared to the eight-stage pipelined architecture in [19], the resource usage of the CMB-MAXP engine is comparable. The low-complexity pipelined rank filter described in [20] requires only about half the resources of the CMB-MAXP engine. However, the two proposed max-pooling engines exhibited slightly superior performance in terms of maximum operating frequency. On the contrary, when comparing the CMB-MAXP engine with the hybrid sorting network architecture in [21], the CMB-MAXP engine demonstrated about half of the resource utilization but a slightly lower maximum operating frequency. It is worth emphasizing that only the two proposed max-pooling engines offer runtime window scalability and boundary padding capability, which are essential requirements for CNN accelerators. Please note that the numbers of slices (marked as “-“ in the # Slices row of Table 3) were not provided in the respective papers [18,21].
The proposed RTB-MAX engine and CMB-MAXP engine were integrated into our CNN accelerator. The CNN accelerator was tested on the target VCU118 evaluation platform board. Figure 11 displays the results of the CNN tests using the YOLOv4-CSP-S-Leaky model for object detection [10]. The results demonstrate that the CNN accelerator detected three people and one bus. Note that the comparison in terms of mean average precision (mAP) is not included in this paper since the max-pooling operation is not the part of the loss operation.

6. Conclusions

In this paper, we proposed two two-stage max-pooling engines for the max-pooling operation in CNNs: the RTB-MAXP engine and the CMB-MAXP engine. The RTB-MAXP engine determines the maximum value by tracking the rank of the values within the window and the CMB-MAXP engine obtains the maximum value through cascading multiple maximum operations. These engines were implemented using VHDL and thoroughly verified by simulations. The implementation results demonstrate that the 16 CMB-MAXP engines achieved a remarkable throughput of about 9 GBPS (gigabytes per second) while utilizing only around 3% of the available resources on the Xilinx Virtex UltraScale+ FPGA XCVU9P. On the other hand, the 16 RTB-MAXP engines exhibited somewhat lower throughput and resource utilization, although they did offer a slightly better latency compared to the CMB-MAXP engines. In the comparison with existing techniques, the CMB-MAXP engine exhibited comparable implementation results in terms of the resource utilization and maximum operating frequency. It is essential to note that only the proposed engines provide the features of runtime window scalability and boundary padding capability, which are essential requirements for CNN accelerators. The proposed max-pooling engines were employed and tested in our CNN accelerator targeting the CNN model YOLOv4-CSP-S-Leaky for object detection.

Author Contributions

Conceptualization, E.H.; methodology, E.H., K.-A.C. and J.J.; validation, E.H., K.-A.C. and J.J.; investigation, E.H., K.-A.C. and J.J.; resources, K.-A.C.; data curation, J.J.; writing—original draft preparation, E.H.; writing—review and editing, E.H., K.-A.C. and J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data is unavailable due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection with Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
  2. Lee, D.-H. Fully Convolutional Single-Crop Siamese Networks for Real-Time Visual Object Tracking. Electronics 2019, 8, 1084. [Google Scholar] [CrossRef]
  3. Shawahna, A.; Sait, S.; El-Maleh, A. FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review. IEEE Access 2018, 4, 7823–7859. [Google Scholar] [CrossRef]
  4. Huang, J.; Liu, X.; Guo, T.; Zhao, Z. A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator. Electronics 2023, 12, 1571. [Google Scholar] [CrossRef]
  5. Xie, Y.; Majoros, T.; Oniga, S. FPGA-Based Hardware Accelerator on Portable Equipment for EEG Signal Patterns Recognition. Electronics 2022, 11, 2410. [Google Scholar] [CrossRef]
  6. Zhang, L.; Tang, X.; Hu, X.; Zhou, T.; Peng, Y. FPGA-Based BNN Architecture in Time Domain with Low Storage and Power Consumption. Electronics 2022, 11, 1421. [Google Scholar] [CrossRef]
  7. Pettersson, L. Convolutional Neural Networks on FPGA and GPU on the Edge: A Comparison; Uppsala University: Uppsala, Sweden, 2020. [Google Scholar]
  8. Lomas-Barrie, V.; Silva-Flores, R.; Neme, A.; Pena-Cabrera, M. A Multiview Recognition Method of Predefined Objects for Robot Assembly Using Deep Learning and Its Implementation on an FPGA. Electronics 2022, 11, 696. [Google Scholar] [CrossRef]
  9. Zhou, H.; Xiao, Y.; Zheng, Z.; Yang, B. YOLOv2-tiny Target Detection System Based on FPGA Platform. In Proceedings of the 2022 3rd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Xi’an, China, 15–17 July 2022; pp. 289–292. [Google Scholar]
  10. Wang, C.; Bochkovskiy, A.; Liao, H. Scaled-Yolov4: Scaling Cross Stage Partial Network. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  11. Bochkovskiy, A.; Wang, C.; Liao, H. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  12. Rzaev, E.; Khanaev, A.; Amerikanov, A. Neural Network for Real-Time Object Detection on FPGA. In Proceedings of the 2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), Sochi, Russia, 17–21 May 2021; pp. 719–723. [Google Scholar]
  13. Archana, V. An FPGA-Based Computation-Efficient Convolutional Neural Network Accelerator. In Proceedings of the 2022 IEEE International Power and Renewable Energy Conference (IPRECON), Kollam, India, 16–18 December 2022; pp. 1–4. [Google Scholar] [CrossRef]
  14. Wang, Z.; Xu, K.; Wu, S.; Liu, L.; Wang, D. Sparse-YOLO: Hardware/Software Co-Design of an FPGA Accelerator for YOLOv2. IEEE Access 2020, 8, 116569–116585. [Google Scholar] [CrossRef]
  15. Zhao, B.; Chong, Y.; Do, A. Area and Energy Efficient 2D Max-Pooling for Convolutional Neural Network Hardware Accelerator. In Proceedings of the IECON 2020 The 46th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 18–21 October 2020; pp. 423–427. [Google Scholar]
  16. Zhao, D. F-CNN: An FPGA-based framework for training Convolutional Neural Networks. In Proceedings of the 2017 International Conference on Field Programmable Technology (ICFPT), Melbourne, VIC, Australia, 11–13 December 2016; pp. 107–114. [Google Scholar] [CrossRef]
  17. Satti, P.; Sharma, N.; Garg, B. Min-Max Average Pooling Based Filter for Impulse Noise Removal. IEEE Signal Process. Lett. 2020, 27, 1475–1479. [Google Scholar] [CrossRef]
  18. Szedo, G.; Two-Dimensional Rank Order Filter. XililL Application Note XAPP953. 2006, pp. 1–17. Available online: https://docs.xilinx.com/v/u/en-US/xapp953 (accessed on 7 August 2023).
  19. Choo, C.; Verma, P. A Real-Time Bit-Serial Rank Filter Implementation Using Xilinx FPGA. Real-Time Image Process. 2008, 6811, 125–132. [Google Scholar] [CrossRef]
  20. Prokin, D.; Prokin, M. Low Hardware Complexity Pipelined Rank Filter. IEEE Trans. Circuits Syst. II Express Briefs 2010, 57, 446–450. [Google Scholar] [CrossRef]
  21. Sambamurthy, N.; Kamaraju, M. Power Optimized Hybrid Sorting-Based Median Filtering. Int. J. Digit. Signals Smart Syst. 2020, 4, 80–86. [Google Scholar] [CrossRef]
Figure 1. Operating principle of the horizontal max-pooling engine with maximum possible window size K when the incoming value xp(i,j) is r0 < xp (i,j) < r1 and the r3 is pushed out of the window.
Figure 1. Operating principle of the horizontal max-pooling engine with maximum possible window size K when the incoming value xp(i,j) is r0 < xp (i,j) < r1 and the r3 is pushed out of the window.
Electronics 12 04043 g001
Figure 2. Rank-counting block consisting of Rv blocks (v = 0, 1, …, K − 1) and comparator blocks, marked as “>”.
Figure 2. Rank-counting block consisting of Rv blocks (v = 0, 1, …, K − 1) and comparator blocks, marked as “>”.
Electronics 12 04043 g002
Figure 3. Structure of the Rv block consisting of the multiplexer, the multiplexer switch (MS), and the register rv.
Figure 3. Structure of the Rv block consisting of the multiplexer, the multiplexer switch (MS), and the register rv.
Electronics 12 04043 g003
Figure 4. Delay-counting block consisting of Dv (v = 0, 1, …, K − 1); comparators, marked as “=”; and OR gates.
Figure 4. Delay-counting block consisting of Dv (v = 0, 1, …, K − 1); comparators, marked as “=”; and OR gates.
Electronics 12 04043 g004
Figure 5. Structure of Dv consisting of a multiplexer, a multiplexer switch (MS), and the register dv.
Figure 5. Structure of Dv consisting of a multiplexer, a multiplexer switch (MS), and the register dv.
Electronics 12 04043 g005
Figure 6. Block diagram for the two-dimensional RTB-MAXP engine.
Figure 6. Block diagram for the two-dimensional RTB-MAXP engine.
Electronics 12 04043 g006
Figure 7. The horizontal axis max-pooling operation of the CMB-MAXP engine with the scalable kernel size k.
Figure 7. The horizontal axis max-pooling operation of the CMB-MAXP engine with the scalable kernel size k.
Electronics 12 04043 g007
Figure 8. The vertical axis max-pooling operation of the CMB-MAXP engine with scalable kernel size k.
Figure 8. The vertical axis max-pooling operation of the CMB-MAXP engine with scalable kernel size k.
Electronics 12 04043 g008
Figure 9. The simulation result of the RTB-MAXP engine with k = 5 and W = 32.
Figure 9. The simulation result of the RTB-MAXP engine with k = 5 and W = 32.
Electronics 12 04043 g009
Figure 10. The simulation result of the CMB-MAXP engine with k = 5 and W = 32.
Figure 10. The simulation result of the CMB-MAXP engine with k = 5 and W = 32.
Electronics 12 04043 g010
Figure 11. The test results of our CNN accelerator employing the CMB-MAXP engine on the VCU118 platform.
Figure 11. The test results of our CNN accelerator employing the CMB-MAXP engine on the VCU118 platform.
Electronics 12 04043 g011
Table 1. The relationships between the multiplexer switch (MS)’s input and output and the output of the multiplexer exploited as Rv and Dv.
Table 1. The relationships between the multiplexer switch (MS)’s input and output and the output of the multiplexer exploited as Rv and Dv.
Multiplexer Switch (MS)Multiplexer
InputOutputOutput
mv− 1mvmv+ 1nvcv [0]cv [1]rvdv
00X000rvdv+1
01X001xp (i,j)0
01X100rvdv + 1
10X000rvdv + 1
11X010rv+ 1dv+ 1 + 1
11X100rvdv + 1
X00111rv− 1dv– 1 + 1
X01101xp (i,j)0
Table 2. Implementation results of the 16 RTB-MAXP engines and the 16 CMB-MAXP engines with K = 13 and W = 32 for YOLOv4-CSP-S-Leaky CNN model implementation on Xilinx Virtex UltraScale+ FPGA XCVU9P.
Table 2. Implementation results of the 16 RTB-MAXP engines and the 16 CMB-MAXP engines with K = 13 and W = 32 for YOLOv4-CSP-S-Leaky CNN model implementation on Xilinx Virtex UltraScale+ FPGA XCVU9P.
RTB-MAXPCMB-MAXP
Resource utilization UAUPUAUP
# LUTs158,51513.4%28,7652.43%
# LUTRAMs00%26880.45%
# Flip-flops99,3424.2%76,9063.25%
Maximum operating frequency (MHz)483.2562.4
Latency W / 2 × k / 2 + k / 2 + 1 , W / 2 × k / 2 + k / 2 + k .
UA: utilization amount, UP: utilization percentage.
Table 3. Comparison of implementation results for the RTB-MAXP engine and the CMB-MAXP engine with previous relevant studies [18,19,20,21].
Table 3. Comparison of implementation results for the RTB-MAXP engine and the CMB-MAXP engine with previous relevant studies [18,19,20,21].
In [18]In [19]In [20]In [21]RTB-MAXPCMB-MAXP
Device familyXilinx
Virtex-4 XC4VSX35
Xilinx
Virtex-4 XC4VSX25
Xilinx
Virtex-4
XC4VSX25
Xilinx
Artix-7
XC7A35T
Xilinx
Virtex-4
XC4VSX25
Xilinx
Virtex-4
XC4VSX25
# Slices-1137668-31821134
# Slice FFs1169698964367320201630
# 4-input LUTs8122055766542460702201
# BRAMs1250000
Maximum operating
frequency (MHz)
375138.5318.4341.2287.2332.5
Runtime window
scalability
NoNoNoNoYesYes
Boundary padding
capability
NoNoNoNoYesYes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hong, E.; Choi, K.-A.; Joo, J. Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network. Electronics 2023, 12, 4043. https://doi.org/10.3390/electronics12194043

AMA Style

Hong E, Choi K-A, Joo J. Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network. Electronics. 2023; 12(19):4043. https://doi.org/10.3390/electronics12194043

Chicago/Turabian Style

Hong, Eonpyo, Kang-A Choi, and Jhihoon Joo. 2023. "Efficient Two-Stage Max-Pooling Engines for an FPGA-Based Convolutional Neural Network" Electronics 12, no. 19: 4043. https://doi.org/10.3390/electronics12194043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop