An Enhanced Detector for Vulnerable Road Users Using Infrastructure-Sensors-Enabled Device
Abstract
:1. Introduction
2. Methods
2.1. An Enhanced VRU Detector Overall
2.1.1. Original YOLOv7 Framework
2.1.2. Overall Architecture for the VRU Detector
2.2. More Precise VRU Detection
2.2.1. ReXNet-SimAM
- (1)
- Rethinking the Backbone Network as ReXNet
- (2)
- An Attention Mechanism without Increasing Parameter Quantity
2.2.2. Dyhead for YOLOv7-Tiny
2.3. More Efficient VRU Detection
2.3.1. VoV-GSCSP Block for Slim-Neck
2.3.2. Lightweight Network Using LAMP
3. Experiments and Results
3.1. Setups and Datasets
3.1.1. Experimental Environment
3.1.2. Training and Testing Datasets
3.2. Evaluation Metrics for Object Detection
3.3. Implementation Results
3.3.1. Quantitative Analysis for Public Benchmarks
3.3.2. Qualitative Analysis with Visualization and Case Study
4. Conclusions and Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- World Health Organization. Global Status Report on Road Safety 2018: Summary; World Health Organization: Geneva, Switzerland, 2018.
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE Inst. Electr. Electron. Eng. 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–25 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Mallela, N.C.; Volety, R.; Rk, N. Detection of the triple riding and speed violation on two-wheelers using deep learning algorithms. Multimed. Tools Appl. 2021, 80, 8175–8187. [Google Scholar] [CrossRef]
- Wang, H.; Jin, L.; He, Y.; Huo, Z.; Wang, G.; Sun, X. Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking. Remote Sens. 2023, 15, 2088. [Google Scholar] [CrossRef]
- Kumar, C.; Ramesh, J.; Chakraborty, B.; Raman, R.; Weinrich, C.; Mundhada, A. Vru pose-ssd: Multiperson pose estimation for automated driving. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021. [Google Scholar]
- Aziz, K.; De Greef, E.; Rykunov, M.; Bourdoux, A.; Sahli, H. Radar-camera fusion for road target classification. In Proceedings of the 2020 IEEE Radar Conference (RadarConf20), Florence, Italy, 21–25 September 2020; pp. 1–6. [Google Scholar]
- Mordan, T.; Cord, M.; Pérez, P.; Alahi, A. Detecting 32 pedestrian attributes for autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1–13. [Google Scholar] [CrossRef]
- Zhou, C.; Wu, M.; Lam, S.K. Group Cost-Sensitive BoostLR with Vector Form Decorrelated Filters for Pedestrian Detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 5022–5035. [Google Scholar] [CrossRef]
- Savkin, A.; Lapotre, T.; Strauss, K.; Akbar, U.; Tombari, F. Adversarial Appearance Learning in Augmented Cityscapes for Pedestrian Recognition in Autonomous Driving. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 21–25 September 2020; pp. 3305–3311. [Google Scholar]
- Zhao, M.; Liu, Q.; Jha, A. VoxelEmbed: 3D Instance Segmentation and Tracking with Voxel Embedding Based Deep Learning. In Proceedings of the 12th International Workshop on Machine Learning in Medical Imaging (MLMI 2021), Strasbourg, France, 27 September 2021; pp. 437–446. [Google Scholar]
- Zhao, M.; Jha, A.; Liu, Q.; Millis, B.A.; Mahadevan-Jansen, A.; Lu, L.; Landman, B.A.; Tyska, M.J.; Huo, Y. Faster Mean-shift: GPU-accelerated clustering for cosine embedding-based cell segmentation and tracking. Med. Image Anal. 2021, 71, 102048. [Google Scholar] [CrossRef] [PubMed]
- Talaat, F.M.; ZainEldin, H. An Improved Fire Detection Approach Based on YOLO-v8 for Smart Cities. Neural. Comput. Applic. 2023, 35, 20939–20954. [Google Scholar] [CrossRef]
- Zhang, J.; Letaief, K.B. Mobile Edge Intelligence and Computing for the Internet of Vehicles. Proc. IEEE 2020, 108, 246–261. [Google Scholar] [CrossRef]
- Savaglio, C.; Barbuto, V.; Awan, F.M.; Minerva, R.; Crespi, N.; Fortino, G. Opportunistic Digital Twin: An Edge Intelligence Enabler for Smart City. ACM Trans. Sen. Netw. 2023. accepted. [Google Scholar] [CrossRef]
- Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet Things J. 2020, 7, 7457–7469. [Google Scholar] [CrossRef]
- Dai, Y.; Liu, W.; Xie, W.; Liu, R.; Zheng, Z.; Long, K.; Wang, L.; Mao, L.; Qiu, Q.; Ling, G. Making you only look once faster: Toward real-time intelligent transportation detection. IEEE Intell. Transp. Syst. Mag. 2023, 15, 8–25. [Google Scholar] [CrossRef]
- Lan, Q.; Tian, Q. Instance, scale, and teacher adaptive knowledge distillation for visual detection in autonomous driving. IEEE Trans. Intell. Veh. 2022, 8, 2358–2370. [Google Scholar] [CrossRef]
- Song, F.; Li, P. YOLOv5-MS: Real-time multi-surveillance pedestrian target detection model for smart cities. Biomimetics 2023, 8, 480. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Han, D.; Yun, S.; Heo, B.; Yoo, Y. Rexnet: Diminishing representational bottleneck on convolutional neural network. arXiv 2007, arXiv:00992. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
- Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 7373–7382. [Google Scholar]
- Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503. [Google Scholar]
- He, Y.; Zhang, X.; Sun, J. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1389–1397. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. BinaryConnect: Training deep neural networks with binary weights during propagations. Adv. Neural Inf. Process. Syst. 2015, 28, 3123–3131. [Google Scholar]
- Lee, J.; Park, S.; Mo, S.; Ahn, S.; Shin, J. Layer-adaptive Sparsity for the Magnitude-based Pruning. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 26–21 June 2012; pp. 3354–3361. [Google Scholar]
- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2633–2642. [Google Scholar]
Input | Operator | Channels | Stride |
---|---|---|---|
2242 × 3 | Conv3 × 3 | 32 | 2 |
1122 × 32 | Bottleneck1 | 16 | 1 |
1122 × 16 | Bottleneck6 | 27 | 2 |
562 × 27 | Bottleneck6 | 38 | 1 |
562 × 38 | Bottleneck6 | 50 | 2 |
282 × 50 | Bottleneck6 | 61 | 1 |
282 × 61 | Bottleneck6 | 72 | 2 |
142 × 72 | Bottleneck6 | 84 | 1 |
142 × 84 | Bottleneck6 | 95 | 1 |
142 × 95 | Bottleneck6 | 106 | 1 |
142 × 106 | Bottleneck6 | 118 | 1 |
142 × 117 | Bottleneck6 | 128 | 1 |
142 × 128 | Bottleneck6 | 140 | 2 |
72 × 140 | Bottleneck6 | 151 | 1 |
72 × 151 | Bottleneck6 | 162 | 1 |
72 × 162 | Bottleneck6 | 174 | 1 |
72 × 174 | Bottleneck6 | 185 | 1 |
72 × 185 | Conv1 × 1 Pooling7 × 7 | 1280 | 1 |
12 × 1280 | FC | 1000 | 1 |
Dataset | Image Size (Pixels) | Acquisition View | Training Set | Testing Set | Annotations of VRUs Contained |
---|---|---|---|---|---|
BDD100K | 1280 × 720 | Ego View | 70,000 | 20,000 | 4 (Person, Bike, Motor, Rider) |
LLVIP | 1280 × 1024 | CCTV View | 12,025 | 3463 | 1 (Person) |
De_VRU (ours) | 1920 × 1080 | CCTV View | 5000 | 1500 | 6 (Person, Bike, Motor, Rider, PTW, Tricycle) |
Model | Model Size (M) | Params (M) | FLOPs (G) | [email protected] | [email protected]:0.95 | Inference Time (s) |
---|---|---|---|---|---|---|
Baseline | 13.8 | 6.9 | 14.8 | 0.583 | 0.379 | 0.15 |
ReXNet1.0 | 13.1 ↑ | 5.6 ↑ | 14.6 ↑ | 0.531 ↓ | 0.342 ↓ | 0.13 ↑ |
ReXNet1.0-SimAM | 13.2 ↑ | 5.6 ↑ | 14.6 ↑ | 0.569 ↓ | 0.361 ↓ | 0.13 ↑ |
VoVGSCSP | 13.2 ↑ | 6.3 ↑ | 14.7 ↓ | 0.612 ↑ | 0.375 ↓ | 0.15 ↓ |
Dyhead | 13.8 -- | 8.0 ↓ | 16.0 ↓ | 0.643 ↑ | 0.401 ↑ | 0.16 ↓ |
ReXNet1.0-SimAM+VoVGSCSP | 12.9 ↑ | 5.1 ↑ | 14.2 ↑ | 0.607 ↑ | 0.401 ↑ | 0.12 ↑ |
ReXNet1.0-SimAM+ Dyhead | 13.8 -- | 6.8 ↑ | 14.6 ↑ | 0.622 ↑ | 0.454 ↑ | 0.15 ↓ |
VoVGSCSP +Dyhead | 13.7 ↑ | 7.4 ↓ | 15.4 ↓ | 0.718 ↑ | 0.437 ↑ | 0.18 ↓ |
ReXNet1.0-SimAM+VoVGSCSP +Dyhead | 14.1 ↓ | 7.1 ↓ | 15.2 ↓ | 0.709 ↑ | 0.441 ↑ | 0.17 ↓ |
Model | Model Size (M) | Params (M) | FLOPs (G) | [email protected] | [email protected]:0.95 | Inference Time (s) |
---|---|---|---|---|---|---|
Baseline | 13.2 | 6.4 | 14.1 | 0.598 | 0.364 | 0.14 |
ReXNet1.0 | 12.1 ↑ | 5.5 ↑ | 13.3 ↑ | 0.506 ↓ | 0.317 ↓ | 0.12 ↑ |
ReXNet1.0-SimAM | 12.2 ↑ | 5.5 ↑ | 13.3 ↑ | 0.610 ↑ | 0.357 ↓ | 0.13 ↑ |
VoVGSCSP | 12.7 ↑ | 5.8 ↑ | 13.8↑ | 0.590 ↓ | 0.373 ↑ | 0.14 -- |
Dyhead | 14.4 ↓ | 7.7 ↓ | 15.6 ↓ | 0.652 ↑ | 0.418 ↑ | 0.15 ↓ |
ReXNet1.0-SimAM+VoVGSCSP | 12.5 ↑ | 5.1 ↑ | 12.9 ↑ | 0.589 ↓ | 0.404 ↑ | 0.13 ↑ |
ReXNet1.0-SimAM+ Dyhead | 13.3 ↓ | 6.7 ↓ | 14.1 -- | 0.637 ↑ | 0.396 ↑ | 0.13 ↑ |
VoVGSCSP +Dyhead | 13.9 ↓ | 7.1 ↓ | 14.3 ↓ | 0.719 ↑ | 0.448 ↑ | 0.17 ↓ |
ReXNet1.0-SimAM+VoVGSCSP +Dyhead | 13.7 ↓ | 6.9 ↓ | 14.7 ↓ | 0.723 ↑ | 0.428 ↑ | 0.16 ↓ |
Model | Model Size (M) | Params (M) | FLOPs (G) | [email protected] | [email protected]:0.95 | Inference Time (s) |
---|---|---|---|---|---|---|
Baseline | 13.8 | 6.9 | 14.8 | 0.583 | 0.379 | 0.15 |
ReXNet1.0-SimAM+VoVGSCSP +Dyhead | 14.1 (100%) | 7.1 (100%) | 15.2 (100%) | 0.709 (−0.000) | 0.441 (−0.000) | 0.17 |
LAMP—1.5× | 7.7 (54.6%) | 4.1 (57.0%) | 10.2 (66.9%) | 0.701 (−0.008) | 0.420 (−0.023) | 0.08 |
LAMP—2.0× | 5.9 (41.8%) | 3.0 (42.3%) | 7.6 (50.0%) | 0.645 (−0.064) | 0.405 (−0.036) | 0.07 |
Model | Model Size (M) | Params (M) | FLOPs (G) | [email protected] | [email protected]:0.95 | Inference Time (s) |
---|---|---|---|---|---|---|
Baseline | 13.2 | 6.4 | 14.1 | 0.598 | 0.364 | 0.14 |
ReXNet1.0-SimAM+VoVGSCSP +Dyhead | 13.7 (100%) | 6.9 (100%) | 14.6 (100%) | 0.723 (−0.000) | 0.428 (−0.000) | 0.16 |
LAMP—1.5× | 7.4 (54.0%) | 3.9 (56.5%) | 9.8 (66.9%) | 0.721 (−0.002) | 0.421 (−0.007) | 0.08 |
LAMP—2.0× | 5.5 (40.1%) | 2.8 (40.6%) | 7.3 (50.0%) | 0.695 (−0.028) | 0.414 (−0.014) | 0.07 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shi, J.; Sun, D.; Kieu, M.; Guo, B.; Gao, M. An Enhanced Detector for Vulnerable Road Users Using Infrastructure-Sensors-Enabled Device. Sensors 2024, 24, 59. https://doi.org/10.3390/s24010059
Shi J, Sun D, Kieu M, Guo B, Gao M. An Enhanced Detector for Vulnerable Road Users Using Infrastructure-Sensors-Enabled Device. Sensors. 2024; 24(1):59. https://doi.org/10.3390/s24010059
Chicago/Turabian StyleShi, Jian, Dongxian Sun, Minh Kieu, Baicang Guo, and Ming Gao. 2024. "An Enhanced Detector for Vulnerable Road Users Using Infrastructure-Sensors-Enabled Device" Sensors 24, no. 1: 59. https://doi.org/10.3390/s24010059
APA StyleShi, J., Sun, D., Kieu, M., Guo, B., & Gao, M. (2024). An Enhanced Detector for Vulnerable Road Users Using Infrastructure-Sensors-Enabled Device. Sensors, 24(1), 59. https://doi.org/10.3390/s24010059