Lightweight Vehicle Detection Based on Mamba_ViT
Abstract
:1. Introduction
- We propose an efficient feature extraction network named Mamba_ViT. This network comprises two modules: Mamba_F and iRMB_F. These modules are designed to separate shallow features from deep features and process them independently through different network structures. The iRMB_F module focuses on extracting shallow features, such as edges and textures, while the Mamba_F module is responsible for capturing deep features, such as object contours and shapes. This separation approach optimizes the feature extraction process and reduces the loss of vehicle information.
- In the Mamba_ViT network, we incorporate a multi-scale feature fusion mechanism. By integrating features from different scales, this approach facilitates more efficient fusion of both shallow and deep features captured by Mamba_ViT.
- On the UA-DETRAC dataset, our proposed algorithm achieves a 3.2% improvement in mAP@50 compared to the latest YOLOv8 algorithm, while utilizing only 60% of the parameters of YOLOv8.
2. Related Work
Vehicle Detection
3. Method
3.1. Overall Architecture
3.2. Mamba_ViT
3.3. IRMB_F
3.4. Mamba_F
3.5. Feature Fusion
4. Results
4.1. Dataset
4.2. Experimental Equipment and Evaluation Metrics
4.3. Comparison Experiment
- Mamba_ViT utilizes two modules, Mamba_F and iRMB_F, to achieve effective separation and independent processing of shallow and deep features. By fully leveraging the strengths of both Mamba and the Transformer, it ensures comprehensive and accurate feature extraction. Its efficient feature extraction capability enables the network to better handle complex traffic scenarios and reduces detection errors caused by insufficient local features.
- The integration of the bidirectional pyramid feature fusion network into Mamba_ViT ensures thorough fusion of shallow and deep features. The bidirectional pyramid structure maximizes the complementarity between features at different layers, allowing shallow features to contribute detailed local information to deep features, while deep features provide global contextual information to shallow features, thereby enhancing the overall feature representation.
- The 2D selective scanning (SS2D) module in Mamba_F and the windowed attention in iRMB_F can effectively reduce the secondary computational complexity of attention, which greatly reduces the computational cost. In the feature fusion part, we use a bidirectional pyramid feature fusion network to reduce the model parameters while fully integrating the information. Compared to some advanced vehicle detection algorithms, higher accuracy is achieved with fewer parameters and computational complexity. Therefore, in theory, Mamba_ViT_YOLO fulfills the criteria of a lightweight algorithm.
4.4. Ablation Experiment
4.5. Comparison of Heat Maps
4.6. Comparison of Detection Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, Z.; Zhan, J.; Duan, C.; Guan, X.; Lu, P.; Yang, K. A review of vehicle detection techniques for intelligent vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 3811–3831. [Google Scholar] [CrossRef] [PubMed]
- Nigam, N.; Singh, D.P.; Choudhary, J.J.S. A review of different components of the intelligent traffic management system (ITMS). Symmetry 2023, 15, 583. [Google Scholar] [CrossRef]
- Badi, I.; Bouraima, M.B.; Muhammad, L.J. The role of intelligent transportation systems in solving traffic problems and reducing environmental negative impact of urban transport. Decis. Mak. Anal. 2023, 1, 1–9. [Google Scholar] [CrossRef]
- Zhang, Y.; Sun, Y.; Wang, Z.; Jiang, Y. YOLOv7-RAR for urban vehicle detection. Sensors 2023, 23, 1801. [Google Scholar] [CrossRef]
- Bie, M.; Liu, Y.; Li, G.; Hong, J.; Li, J. Real-time vehicle detection algorithm based on a lightweight You-Only-Look-Once (YOLOv5n-L) approach. Expert Syst. Appl. 2023, 213, 119108. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
- Yang, Z.; Yuan, Y.; Zhang, M.; Zhao, X.; Tian, B. Safety Distance Identification for Crane Drivers Based on Mask R-CNN. Sensors 2019, 19, 2789. [Google Scholar] [CrossRef]
- Li, Z.; Li, Y.; Yang, Y.; Guo, R.; Yang, J.; Yue, J.; Wang, Y. A high-precision detection method of hydroponic lettuce seedlings status based on improved Faster RCNN. Comput. Electron. Agric. 2021, 182, 106054. [Google Scholar] [CrossRef]
- Wang, C.C.; Samani, H.; Yang, C.Y. Object Detection with Deep Learning for Underwater Environment. In Proceedings of the 2019 4th International Conference on Information Technology Research (ICITR), Moratuwa, Sri Lanka, 10–13 December 2019. [Google Scholar]
- Yu, W.; Liu, Z.; Zhuang, Z.; Liu, Y.; Wang, X.; Yang, Y.; Gou, B. Super-Resolution Reconstruction of Speckle Images of Engineered Bamboo Based on an Attention-Dense Residual Network. Sensors 2022, 22, 6693. [Google Scholar] [CrossRef]
- Wang, K.; Liu, M.; Ye, Z. An advanced YOLOv3 method for small-scale road object detection. Appl. Soft Comput. 2021, 112, 107846. [Google Scholar] [CrossRef]
- Kasper-Eulaers, M.; Hahn, N.; Berger, S.; Sebulonsen, T.; Myrland, Ø.; Kummervold, P.E. Short Communication: Detecting heavy goods vehicles in rest areas in winter conditions using YOLOv5. Algorithms 2021, 14, 114. [Google Scholar] [CrossRef]
- Dong, X.; Yan, S.; Duan, C. lightweight vehicles detection network model based on YOLOv5. Eng. Appl. Artif. Intell. 2022, 113, 104914. [Google Scholar] [CrossRef]
- Zhang, X.; Zhang, X.; He, M. Research on vehicle detection method based on improved YOLOX-s. J. Syst. Simul. 2024, 36, 487–496. [Google Scholar]
- Elhanashi, A.; Saponara, S.; Dini, P.; Zheng, Q.; Morita, D.; Raytchev, B. An integrated and real-time social distancing, mask detection, and facial temperature video measurement system for pandemic monitoring. J. Real-Time Image Process. 2023, 20, 95. [Google Scholar] [CrossRef]
- Babenko, A.; Lempitsky, V. Aggregating deep convolutional features for image retrieval. arXiv 2015, arXiv:1510.07493. [Google Scholar]
- Zhang, Y.; Zhao, H.; Duan, Z.; Huang, L.; Deng, J.; Zhang, Q. Congested crowd counting via adaptive multi-scale context learning. Sensors 2021, 21, 3777. [Google Scholar] [CrossRef]
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
- Sun, Y.; Wang, W.; Zhang, Q.; Ni, H.; Zhang, X. Improved YOLOv5 with transformer for large scene military vehicle detection on SAR image. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; pp. 87–93. [Google Scholar]
- Liu, P.; Fu, H.; Ma, H. An end-to-end convolutional network for joint detecting and denoising adversarial perturbations in vehicle classification. Comput. Vis. Media 2021, 7, 217–227. [Google Scholar] [CrossRef]
- Lee, D.-S. Effective Gaussian mixture learning for video background subtraction. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 827–832. [Google Scholar] [PubMed]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
- Viola, P.A.; Jones, M.J. Rapid Object Detection using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Amit, Y.; Felzenszwalb, P.; Girshick, R. Object Detection. In Computer Vision: A Reference Guide; Ikeuchi, K., Ed.; Springer International Publishing: Cham, Switzerland, 2021; pp. 875–883. [Google Scholar]
- Jheng, Y.-J.; Yen, Y.-H.; Sun, T.-Y. A symmetry-based forward vehicle detection and collision warning system on Android smartphone. In Proceedings of the 2015 IEEE International Conference on Consumer Electronics-Taiwan, Taipei, Taiwan, 6–8 June 2015; pp. 212–213. [Google Scholar]
- Munajat, M.E.; Widyantoro, D.H.; Munir, R. Vehicle detection and tracking based on corner and lines adjacent detection features. In Proceedings of the 2016 2nd International Conference on Science in Information Technology (ICSITech), Balikpapan, Indonesia, 26–27 October 2016; pp. 244–249. [Google Scholar]
- Satzoda, R.K.; Trivedi, M.M. Multipart vehicle detection using symmetry-derived analysis and active learning. IEEE Trans. Intell. Transp. Syst. 2015, 17, 926–937. [Google Scholar] [CrossRef]
- Zhang, P.-p. Moving Target Detection and Tracking in Video Monitoring System. 2010. Available online: https://www.semanticscholar.org/paper/Moving-Target-Detection-and-Tracking-in-Video-Peng-pen/f46d58f1545bddcf49f0c5e339cf03c7f891d9b3 (accessed on 4 November 2024).
- Wu, X.; Song, X.; Gao, S.; Chen, C.J.T.M.T. Review of target detection algorithms based on deep learning. In Proceedings of the CCEAI 2021: 5th International Conference on Control Engineering and Artificial Intelligence, Sanya, China, 14–16 January 2021. [Google Scholar]
- Xie, W.; Zhu, D.; Tong, X. Small target detection method based on visual attention. Jisuanji Gongcheng Yu Yingyong (Comput. Eng. Appl.) 2013, 49, 125–128. [Google Scholar]
- Yin, S.; Li, H.; Teng, L. Imaging. Airport Detection Based on Improved Faster RCNN in Large Scale Remote Sensing Images. Sens. Imaging 2020, 21, 49. [Google Scholar] [CrossRef]
- Borji, A.; Cheng, M.M.; Jiang, H.; Li, J. Salient Object Detection: A Benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722. [Google Scholar] [CrossRef] [PubMed]
- Karangwa, J.; Liu, J.; Zeng, Z. Vehicle detection for autonomous driving: A review of algorithms and datasets. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11568–11594. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Jing, L.I.; Shan, H. YOLOv3 Based Object Tracking Method. Electron. Opt. Control 2019, 26, 87–93. [Google Scholar]
- Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proc. Mach. Learn. Res. 2019, 97, 6105–6114. [Google Scholar]
- Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
- Liu, Z.; Hao, Z.; Han, K.; Tang, Y.; Wang, Y. GhostNetV3: Exploring the Training Strategies for Compact Models. arXiv 2024, arXiv:2404.11202. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1389–1400. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y.V. Mamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Zheng, Y.; Zhang, X.; Zhang, R.; Wang, D. Gated Path Aggregation Feature Pyramid Network for Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 4614. [Google Scholar] [CrossRef]
- Yu, H.; Li, X.; Feng, Y.; Han, S.J.A.i. Multiple attentional path aggregation network for marine object detection. Appl. Intell. 2023, 53, 2434–2451. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Lyu, S.; Chang, M.-C.; Du, D.; Li, W.; Wei, Y.; Coco, M.D.; Carcagnì, P.; Schumann, A.; Munjal, B.; Dang, D.-Q.-T.; et al. UA-DETRAC 2018: Report of AVSS2018 & IWT4S Challenge on Advanced Traffic Monitoring. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
- Lyu, S.; Chang, M.-C.; Du, D.; Wen, L.; Qi, H.; Li, Y.; Wei, Y.; Ke, L.; Hu, T.; Del Coco, M. UA-DETRAC 2017: Report of AVSS2017 & IWT4S challenge on advanced traffic monitoring. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–7. [Google Scholar]
- Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.-C.; Qi, H.; Lim, J.; Yang, M.-H.; Lyu, S.J.C.V.; Understanding, I. UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar] [CrossRef]
Model | Params (M) | FLOPS (G) | mAP@50 |
---|---|---|---|
Faster-RCNN | 41.3 | 60.5 | 0.508 |
SSD | 24.1 | 30.5 | 0.532 |
YOLOv3-tiny | 8.7 | 12.9 | 0.498 |
YOLOv4-tiny | 6.0 | 16.2 | 0.508 |
YOLOv5s | 7.0 | 16.0 | 0.524 |
YOLOv6n | 4.6 | 11.3 | 0.536 |
YOLOv7-tiny | 6.0 | 13.0 | 0.472 |
YOLOv8-tiny | 3.0 | 8.1 | 0.556 |
Mamba_ViT_YOLO | 1.8 | 6.1 | 0.588 |
Model | Params (M) | FLOPS (G) | mAP@50 | mAP@50:95 |
---|---|---|---|---|
YOLOv3-tiny | 8.7 | 12.9 | 0.254 | 0.11 |
YOLOv5n | 1.8 | 4.2 | 0.37 | 0185 |
YOLOv7-tiny | 6.0 | 13.0 | 0.286 | 0.153 |
Mamba_ViT_YOLO | 1.8 | 4.2 | 0.398 | 0.22 |
YOLOv8 | Mamba_ViT | BiFPN | Params (M) | mAP@50 |
---|---|---|---|---|
√ | × | × | 3.0 | 0.556 |
√ | √ | × | 2.8 | 0.575 |
√ | × | √ | 2.0 | 0.57 |
√ | √ | √ | 1.8 | 0.588 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, Z.; Wang, Y.; Xu, S.; Wang, P.; Liu, L. Lightweight Vehicle Detection Based on Mamba_ViT. Sensors 2024, 24, 7138. https://doi.org/10.3390/s24227138
Song Z, Wang Y, Xu S, Wang P, Liu L. Lightweight Vehicle Detection Based on Mamba_ViT. Sensors. 2024; 24(22):7138. https://doi.org/10.3390/s24227138
Chicago/Turabian StyleSong, Ze, Yuhai Wang, Shuobo Xu, Peng Wang, and Lele Liu. 2024. "Lightweight Vehicle Detection Based on Mamba_ViT" Sensors 24, no. 22: 7138. https://doi.org/10.3390/s24227138
APA StyleSong, Z., Wang, Y., Xu, S., Wang, P., & Liu, L. (2024). Lightweight Vehicle Detection Based on Mamba_ViT. Sensors, 24(22), 7138. https://doi.org/10.3390/s24227138