Object Detection Post Processing Accelerator Based on Co-Design of Hardware and Software
Abstract
:1. Introduction
- Minimizing redundant IoU calculations: The traditional methods often include redundant IoU calculations. We introduce the priority of confidence threshold checking combined with classification (CTC-C) into NMS, effectively reducing the computation of ineffective IoUs. Given the operational similarities between the inference post process and the training post process, we also implement the priority of CTC-C in the training post process of Faster-RCNN. For YOLOv8s and GFL, we incorporate the priority of center point position checking combined with classification (CPC-C) to further reduce redundant IoU calculations.
- Reducing redundant comparisons: Traditional Top-K selection methods involve full sorting, leading to unnecessary comparisons. We introduce a hybrid bitonic sorter to avoid this, efficiently sorting the bounding boxes by the IoUs and identifying the Top-K elements, optimizing the selection of the best bounding boxes during training.
- Post processing software–hardware acceleration: We highlight the importance of hardware acceleration in both the training and inference post processes, proposing a co-design strategy to integrate the software and hardware for improved efficiency.
- Unified algorithm framework and hardware design: We explore unifying the training and inference post processing algorithms, presenting a hardware design framework that achieves significant speedups. Compared to recent accelerators, our design offers at least 7.55 times faster NMS acceleration.
- Priorities of CTC-C and CPC-C: We address the redundant IoU calculations by introducing the priorities of CTC-C and CPC-C into the training and inference post processes, reducing the bounding boxes to 12.9% and 11.9% of their original counts, respectively, and achieving a speedup of 1.10 to 1.19 times.
- Hybrid bitonic sorter: We introduce a hybrid bitonic sorter that enables efficient parallel comparisons and Top-K selection, reducing redundant comparisons. This design achieves a speedup of 1.05 to 1.25 times compared to the speed of recent sorters.
2. The Related Works
2.1. Object Detection and Post Process Algorithm
2.2. The Hardware Accelerator in Inference Post Processing Algorithms
2.3. The Hardware Accelerator in the Sorter Algorithm
3. The Unification of the Post Process Algorithm and Optimization
3.1. Analysis of the Uniformity of Post Processing Algorithms
- Step 1. Pre Process: This step involves extracting the bounding box with the highest confidence or verifying whether a bounding box’s center point is within the ground truth boxes. To be detailed, as shown in Figure 4, the light green background represents the Pre Process stage, where the input is data related to the bounding boxes, and the output is qualified bounding boxes that have undergone classification and confidence judgment and whose centers have been determined. The core element is the Pre Process Engine.
- Step 2. Alignment: This step calculates the IoU between the bounding boxes and the ground truth. Both the Task-Aligned Assigner and the Dynamic Soft Label Assigner additionally compute the alignment scores by integrating the classification values with the IoUs. As shown in Figure 4, the light pink background represents the Alignment stage, where the input includes the coordinate and category information on the qualified bounding boxes, as well as the coordinate and category information on the ground truth. Possible outputs include the IoU matrix and the alignment score matrix constructed based on the qualified bounding boxes and the ground truth. The core elements are the IoU Engine, the Score Engine, the and Cost Engine, with the IoU Engine responsible for calculating the IoU matrix, the Score Engine responsible for calculating the classification score matrix, and the Cost Engine responsible for calculating the alignment score matrix.
- Step 3. Dynamic Sampling: This step determines the number of bounding boxes corresponding to each ground truth target, a requirement specific to the Dynamic Soft Label Assigner based on its alignment scores. In Figure 4, the light yellow background represents the Dynamic Sampling stage, where the input is the alignment score matrix, and the output is the maximum number of bounding boxes that each ground truth can be assigned to. The core element is the Top-K Sampling Engine, used only for the Dynamic Soft Label Assigner.
- Step 4. Sorter: This final step involves sorting and selecting the bounding boxes based on the IoU matrix or the alignment score matrix. In Figure 4, the light blue background represents the Sorter stage, where possible inputs include the IoU matrix, the alignment score matrix, and the number of bounding boxes that each ground truth can be assigned to, with the output being the assignment of the bounding boxes to each ground truth or true target.
- NMS: In the Pre Process stage, NMS identifies the bounding box with the highest confidence score. During the Alignment stage, it calculates the IoU between the selected bounding box and the remaining ones. In the Sorter stage, the IoUs are sorted, and the bounding boxes are filtered based on predefined IoU thresholds. This iterative process continues until no bounding boxes remain.
- Max IoU Assigner: This algorithm filters the bounding boxes based on the confidence in the Pre Process stage. It then computes the IoU matrix between the bounding boxes and the ground truth boxes in the Alignment stage. In the Sorter stage, the IoUs are sorted and used to filter the bounding boxes according to the IoU thresholds, mapping each bounding box to its corresponding ground truth box.
- The Task-Aligned Assigner: In the Pre Process stage, this algorithm filters the bounding boxes based on confidence. The Alignment stage involves calculating the IoU matrix and the alignment score matrix. During the Sorter stage, the alignment scores are sorted, and bounding boxes are selected based on a fixed number of boxes per ground truth target, mapping them accordingly.
- The Dynamic Soft Label Assigner: This method filters the bounding boxes based on the confidence in the Pre Process stage. It calculates the IoU matrix and the alignment score matrix, using both the classification and IoU results, during the Alignment stage. The number of allowable bounding boxes per ground truth target is determined based on the alignment scores in the Dynamic Sampling stage. In the Sorter stage, the alignment scores are sorted, and bounding boxes are selected according to the number required for each ground truth target, effectively mapping them to the ground truth boxes.
3.2. Redundancy Analysis of Post Processing Algorithms
3.3. Redundancy Optimizations for Post Processing Algorithms
4. Accelerator Framework Design
4.1. Overall Framework Design of the Accelerator
4.2. The Design of the Hybrid Bitonic Sorter
4.3. The Design of the Pre Process Submodule
4.4. The Design of the Alignment Submodule
4.5. The Design of the Dynamic Sampling and Sorter Submodules
5. Experiments and Result Analysis
5.1. The Experimental Setup
5.2. Redundancy Result Analysis
5.3. Analysis of the Results for Sorter Optimization
5.4. Analysis of the Results for the Post Processing Accelerator
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Hu, Y.; Yang, J.; Chen, L.; Li, K.; Sima, C.; Zhu, X.; Chai, S.; Du, S.; Lin, T.; Wang, W.; et al. Planning-oriented Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–23 June 2023; pp. 17853–17862. [Google Scholar]
- Zhou, X.; Lin, Z.; Shan, X.; Wang, Y.; Sun, D.; Yang, M.H. DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 21634–21643. [Google Scholar]
- Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
- Vandome, P.; Leauthaud, C.; Moinard, S.; Sainlez, O.; Mekki, I.; Zairi, A.; Belaud, G. Making technological innovations accessible to agricultural water management: Design of a low-cost wireless sensor network for drip irrigation monitoring in Tunisia. Smart Agric. Technol. 2023, 4, 100227. [Google Scholar] [CrossRef]
- Chakraborty, R.; Kereszturi, G.; Pullanagari, R.; Durance, P.; Ashraf, S.; Anderson, C. Mineral prospecting from biogeochemical and geological information using hyperspectral remote sensing—Feasibility and challenges. J. Geochem. Explor. 2022, 232, 106900. [Google Scholar] [CrossRef]
- Wang, R.; Chen, F.; Wang, J.; Hao, X.; Chen, H.; Liu, H. Prospecting criteria for skarn-type iron deposits in the thick overburden area of Qihe-Yucheng mineral-rich area using geological and geophysical modelling. J. Appl. Geophys. 2024, 105442. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Lee, Y.; Hwang, J.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–30 October 2019; pp. 6569–6578. [Google Scholar]
- Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 16–22 June 2021; pp. 11632–11641. [Google Scholar]
- Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 9759–9768. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Zhang, S.; Li, C.; Jia, Z.; Liu, L.; Zhang, Z.; Wang, L. Diag-IoU loss for object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7671–7683. [Google Scholar] [CrossRef]
- Basalama, S.; Sohrabizadeh, A.; Wang, J.; Guo, L.; Cong, J. FlexCNN: An end-to-end framework for composing CNN accelerators on FPGA. ACM Trans. Reconfigurable Technol. Syst. 2023, 16, 1–32. [Google Scholar] [CrossRef]
- Jia, X.; Zhang, Y.; Liu, G.; Yang, X.; Zhang, T.; Zheng, J.; Xu, D.; Liu, Z.; Liu, M.; Yan, X. XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine. ACM Trans. Reconfigurable Technol. Syst. 2024, 17, 1–24. [Google Scholar] [CrossRef]
- Wu, C.; Wang, M.; Chu, X.; Wang, K.; He, L. Low-precision floating-point arithmetic for high-performance FPGA-based CNN acceleration. ACM Trans. Reconfigurable Technol. Syst. (TRETS) 2021, 15, 1–21. [Google Scholar]
- Wu, D.; Fan, X.; Cao, W.; Wang, L. SWM: A high-performance sparse-winograd matrix multiplication CNN accelerator. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 936–949. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- NanoDet-Plus: Super Fast and High Accuracy Lightweight Anchor-Free Object Detection Model. Available online: https://github.com/RangiLyu/nanodet (accessed on 8 January 2024).
- Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
- Lin, J.; Zhu, L.; Chen, W.; Wang, W.; Gan, C.; Han, S. On-device training under 256kb memory. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), New Orleans, LA, USA, 28 November–9 December 2022; pp. 22941–22954. [Google Scholar]
- Lyu, B.; Yuan, H.; Lu, L.; Zhang, Y. Resource-constrained neural architecture search on edge devices. IEEE Trans. Netw. Sci. Eng. 2021, 9, 134–142. [Google Scholar] [CrossRef]
- Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 10–16 October 2016; pp. 21–37. [Google Scholar]
- Ross, T.Y.; Dollár, P.; He, K.; Hariharan, B.; Girshick, R. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
- Law, H.; Deng, J. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
- Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 850–859. [Google Scholar]
- Liu, Z.; Zheng, T.; Xu, G.; Yang, Z.; Liu, H.; Cai, D. Training-time-friendly network for real-time object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 11685–11692. [Google Scholar]
- Ghahremannezhad, H.; Shi, H.; Liu, C. Object detection in traffic videos: A survey. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6780–6799. [Google Scholar] [CrossRef]
- Geng, X.; Su, Y.; Cao, X.; Li, H.; Liu, L. YOLOFM: An improved fire and smoke object detection algorithm based on YOLOv5n. Sci. Rep. 2024, 14, 4543. [Google Scholar] [CrossRef] [PubMed]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
- Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel-Aviv, Israel, 23–27 October 2022; pp. 526–543. [Google Scholar]
- Kim, K.; Lee, H. Probabilistic anchor assignment with IoU prediction for object detection. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 355–371. [Google Scholar]
- Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. OTA: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 303–312. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Pu, Y.; Liang, W.; Hao, Y.; Yuan, Y.; Yang, Y.; Zhang, C.; Hu, H.; Huang, G. Rank-DETR for high quality object detection. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Giroux, J.; Bouchard, M.; Laganiere, R. T-fftradnet: Object detection with Swin vision transformers from raw ADC radar signals. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4030–4039. [Google Scholar]
- Zeng, C.; Kwong, S.; Ip, H. Dual Swin-transformer based mutual interactive network for RGB-D salient object detection. Neurocomputing 2023, 559, 126779. [Google Scholar] [CrossRef]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
- Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
- Neubeck, A.; Van, G.L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR), Hong Kong, China, 20–24 August 2006; pp. 850–855. [Google Scholar]
- Song, Y.; Pan, Q.; Gao, L.; Zhang, B. Improved non-maximum suppression for object detection using harmony search algorithm. Appl. Soft Comput. 2019, 81, 105478. [Google Scholar] [CrossRef]
- Chu, X.; Zheng, A.; Zhang, X.; Sun, J. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12214–12223. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Adaptive NMS: Refining pedestrian detection in a crowd. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 6459–6468. [Google Scholar]
- Gao, P.; Zheng, M.; Wang, X.; Dai, J.; Li, H. Fast convergence of DETR with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3621–3630. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse R-CNN: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 14454–14463. [Google Scholar]
- Shi, M.; Ouyang, P.; Yin, S.; Liu, L.; Wei, S. A fast and power-efficient hardware architecture for non-maximum suppression. IEEE Trans. Circuits Syst. II Express Briefs 2019, 66, 1870–1874. [Google Scholar] [CrossRef]
- Fang, C.; Derbyshire, H.; Sun, W.; Yue, J.; Shi, H.; Liu, Y. A sort-less FPGA-based non-maximum suppression accelerator using multi-thread computing and binary max engine for object detection. In Proceedings of the IEEE Asian Solid-State Circuits Conference (A-SSCC), Busan, Republic of Korea, 7–10 November 2021; pp. 1–3. [Google Scholar]
- Chen, C.; Zhang, T.; Yu, Z.; Raghuraman, A.; Udayan, S.; Lin, J.; Aly, M.M.S. Scalable hardware acceleration of non-maximum suppression. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 21–25 March 2022; pp. 96–99. [Google Scholar]
- Choi, S.B.; Lee, S.S.; Park, J.; Jang, S.J. Standard greedy non maximum suppression optimization for efficient and high speed inference. In Proceedings of the IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Gangwon, Republic of Korea, 1–3 November 2021; pp. 1–4. [Google Scholar]
- Anupreetham, A.; Ibrahim, M.; Hall, M.; Boutros, A.; Kuzhively, A.; Mohanty, A.; Nurvitadhi, E.; Betz, V.; Cao, Y.; Seo, J. End-to-end FPGA-based object detection using pipelined CNN and non-maximum suppression. In Proceedings of the 31st International Conference on Field-Programmable Logic and Applications (FPL), Dresden, Germany, 30 August–3 September 2021; pp. 76–82. [Google Scholar]
- Guo, Z.; Liu, K.; Liu, W.; Li, S. Efficient FPGA-based Accelerator for Post-Processing in Object Detection. In Proceedings of the International Conference on Field Programmable Technology (ICFPT), Yokohama, Japan, 11–14 December 2023; pp. 125–131. [Google Scholar]
- Chen, Y.; Zhang, J.; Lv, D.; Yu, X.; He, G. O3 NMS: An Out-Of-Order-Based Low-Latency Accelerator for Non-Maximum Suppression. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar]
- Sun, K.; Li, Z.; Zheng, Y.; Kuo, H.W.; Lee, K.P.; Tang, K.T. An area-efficient accelerator for non-maximum suppression. IEEE Trans. Circuits Syst. II Express Briefs 2023, 70, 2251–2255. [Google Scholar] [CrossRef]
- Anupreetham, A.; Ibrahim, M.; Hall, M.; Boutros, A.; Kuzhively, A.; Mohanty, A.; Nurvitadhi, E.; Betz, V.; Cao, Y.; Seo, J. High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design. ACM Trans. Reconfigurable Technol. Syst. 2024, 17, 1–20. [Google Scholar] [CrossRef]
- Batcher, K.E. Sorting networks and their applications. In Proceedings of the Spring Joint Computer Conference, Atlantic City, NJ, USA, April 30–May 2 1968; pp. 307–314. [Google Scholar]
- Chen, Y.R.; Ho, C.C.; Chen, W.T.; Chen, P.Y. A Low-Cost Pipelined Architecture Based on a Hybrid Sorting Algorithm. IEEE Trans. Circuits Syst. I Regul. Pap. 2023, 71, 717–730. [Google Scholar] [CrossRef]
- Zhao, J.; Zeng, P.; Shen, G.; Chen, Q.; Guo, M. Hardware-software co-design enabling static and dynamic sparse attention mechanisms. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2024, 43, 2783–2796. [Google Scholar] [CrossRef]
- Fang, C.; Sun, W.; Zhou, A.; Wang, Z. Efficient N: M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2023, 43, 506–519. [Google Scholar] [CrossRef]
- Zhang, H.; Wu, W.; Ma, Y.; Wang, Z. Efficient hardware post processing of anchor-based object detection on FPGA. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Limassol, Cyprus, 6–8 July 2020; pp. 580–585. [Google Scholar]
Standard IoU | Priority of Confidence Threshold Checking | Priority of Classification | Support for Training Post Process | |
---|---|---|---|---|
2019-TCASII [56] | × | × | × | × |
2021-ASSCC [57] | × | × | × | × |
2021-FPL [60] | ✓ | ✓ | × | × |
2021-ICCE [59] | ✓ | ✓ | * | × |
2022-DATE [58] | × | ✓ | × | × |
2023-FPT [61] | ✓ | ✓ | × | × |
2023-ISCAS [62] | ✓ | ✓ | × | × |
2023-TCASII [63] | × | ✓ | × | × |
2024-TReTS [64] | ✓ | ✓ | * | × |
Ours | ✓ | ✓ | ✓ | ✓ |
Sorter Type | Support for Full Sorting | Support for Top-K Sorting | |
---|---|---|---|
- | Bitonic Sorter | ✓ | ✓ |
2023-TCAD [68] | Insertion Sorter | × | ✓ |
2023-TCASI [66] | Bitonic Sorter + Insertion Sorter | ✓ | ✓ |
2024-TCAD [67] | Variant of Bitonic Sorters | × | ✓ |
Ours | Bitonic Sorter + Sorter Tree | ✓ | ✓ |
Sorter | Sorter Type | Cycle |
---|---|---|
Parallel Bubble Sorter | Bubble Sorter | 7, 7@32:1, 7@32:16 |
2023-TCAD [68] | Insertion Sorter | 261, 261@32:1, 261@32:16 |
Bitonic Sorter | Bitonic Sorter | 21, 21@32:1, 21@32:16 |
2023-TCASI [66] | Bitonic Insertion Sorter | 106, 15@32:1. 106@32:16 |
Ours | Hybrid Bitonic Sorter | 21, 15@32:1, 21@32:16 |
2023-TCAD [68] | Bitonic Sorter | 2023-TCASI [66] | Ours | |
---|---|---|---|---|
Sorter Type | Insertion Sorter | Bitonic Sorter | Bitonic Insertion Sorter | Hybrid Bitonic Sorter |
Frequency (MHz) | 200 | 200 | 200 | 200 |
Time (ms) | 0.416@F | 0.252@F | 0.309@F | 0.248@F |
0.208@G | 0.184@G | 0.193@G | 0.182@G | |
0.206@Y | 0.186@Y | 0.193@Y | 0.187@Y | |
0.388@F-NMS | 0.255@F-NMS | 0.304@F-NMS | 0.253@F-NMS | |
0.220@G-NMS | 0.192@G-NMS | 0.202@G-NMS | 0.190@G-NMS | |
0.210@Y-NMS | 0.188@Y-NMS | 0.196@Y-NMS | 0.186@Y-NMS | |
Static Power (W) | 0.222 | 0.218 | 0.228 | 0.216 |
Dynamic Power (W) | 2.806 | 2.010 | 3.512 | 2.105 |
LUT | 204,200 | 158,329 | 351,365 | 166,417 |
FF | 154,203 | 135,901 | 181,732 | 136,652 |
2023-TCASII [63] | 2024-TReTS [64] | Ours | |
---|---|---|---|
Confidence Threshold Checking in Inference | ✓ | ✓ | ✓ |
Box Center Position Checking in Training | ✓ | ✓ | ✓ |
Classification Priority | × | * | ✓ |
Frequency (MHz) | 200 | 200 | 200 |
Time (ms) | 0.295@F | 0.272@F | 0.248@F (1.10×) |
0.217@G | 0.214@G | 0.182@G (1.18×) | |
0.209@Y | 0.212@Y | 0.187@Y (1.13×) | |
0.368@F-NMS | 0.287@F-NMS | 0.253@F-NMS (1.13×) | |
0.228@G-NMS | 0.226@G-NMS | 0.190@G-NMS (1.19×) | |
0.209@Y-NMS | 0.209@Y-NMS | 0.186@Y-NMS (1.12×) |
2020-ISVLSI [69] | 2021-ICCE [59] | 2022-DATE [58] | 2023-FPT [61] | Ours | ||
---|---|---|---|---|---|---|
Support Train | × | × | × | × | ✓ | |
IoU Accuracy Expression | ✓ | ✓ | ✓ | ✓ | ✓ | |
Testing Mode | IP Simulation | Data From DDR | IP Simulation | IP Simulation | Data Fetched by DMA From DDR | IP Simulation |
Frequency (MHz) | 100 | - | 400 | 150 | 200 | |
Time (ms) | 0.248@F | 0.080@F | ||||
0.182@G | 0.014@G | |||||
0.032@NMS | 1.91@NMS | 0.05@NMS | 0.014@NMS | 0.187@Y | 0.015@Y | |
5.19@NMS | 0.253@F-NMS | 0.083@F-NMS | ||||
0.190@G-NMS | 0.022@G-NMS | |||||
0.186@Y-NMS | 0.017@Y-NMS | |||||
Box Number | 22,000@F | |||||
3000 | - | 1000 | 2880 | 8400@G | ||
6300@Y | ||||||
Selected Boxes | 5 | 11/48 | 5 | 24 | 10 | |
Platform | ZYNQ7-ZC706 | ZCU106 | Gensys2 | Virtex7 690T | ZCU102 | |
LUT | - | - | 14,188 | 26,404 | 166,417 | |
FF | 11,139 | - | 2924 | 21,127 | 136,652 | |
LUTRAM | 714 | - | 7504 | 2 | 0 | |
DSP | 22 | - | - | 450 | 0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, D.; Chen, L.; Hao, X.; Zhang, Y. Object Detection Post Processing Accelerator Based on Co-Design of Hardware and Software. Information 2025, 16, 63. https://doi.org/10.3390/info16010063
Yang D, Chen L, Hao X, Zhang Y. Object Detection Post Processing Accelerator Based on Co-Design of Hardware and Software. Information. 2025; 16(1):63. https://doi.org/10.3390/info16010063
Chicago/Turabian StyleYang, Dengtian, Lan Chen, Xiaoran Hao, and Yiheng Zhang. 2025. "Object Detection Post Processing Accelerator Based on Co-Design of Hardware and Software" Information 16, no. 1: 63. https://doi.org/10.3390/info16010063
APA StyleYang, D., Chen, L., Hao, X., & Zhang, Y. (2025). Object Detection Post Processing Accelerator Based on Co-Design of Hardware and Software. Information, 16(1), 63. https://doi.org/10.3390/info16010063