BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds
Abstract
:1. Introduction
- (i)
- A BEV fusion framework, BEVCAM3D, with better performance, efficient feature alignment, and the fusion of camera and LiDAR sensor modalities across a perspective view and a BEV.
- (ii)
- A detailed and concise review of the BEV and BEV multi-modal sensor fusion paradigm.
- (iii)
- A BEV representation for point clouds that includes the fast ground segmentation of point clouds to improve feature extraction.
- (iv)
- A cross-modality attention module with cross-modality fusion loss for the efficient fusion and alignment of camera and lidar features.
2. Literature Review
2.1. Monocular Camera BEV Representation
2.2. Point Cloud BEV Representation
2.3. Multi-Modal BEV Fusion
3. Methodology
3.1. Preliminaries
3.2. BEVCAM3D Overview
3.3. Point Cloud BEV Representation
- is the mapping function of points with index i to its corresponding cell value j.
- is the intensity of points in the given point cloud.
3.4. BEV Feature Encoder
3.5. BEVCAM3D Feature Fusion Network
4. Results
4.1. Experimental Setup
- Frameworks: PyTorch 2.3.1 (CUDA 12.6 backend), NumPy 1.24.4.
- Data Loading: NuScenes devkit v1.1.11 [28], OpenCV 4.11.0.86.
- Augmentation: Albumentations 1.4.18 (applied to camera images).
- Visualization: Matplotlib 3.5.3, PIL 10.4.0.
- Sensors: Six surround-view cameras (1600 × 900 resolution) and one LiDAR (32 beams).
- Annotations: 1.4 M 3D bounding boxes across 10 classes: car, truck, bus, trailer, construction vehicle, pedestrian, motorcycle, bicycle, barrier, and traffic cone.
- Evaluation Metrics: NuScenes Detection Score (NDS) and mean Average Precision (mAP).
4.2. Experimental Parameters
- Initial LR: , Batch Size: 16 (2 GPUs, 8 samples/GPU).
- Training Schedule: 50 epochs, 20% warmup.
- Precision: FP16 via PyTorch AMP.
- Gradient Checkpointing: Enabled for ResNet-50.
- BEV Grid: , , resolution .
- Image Size: (center-cropped and resized).
- View Transformation: Focal length scaled to (aligned with camera intrinsics and adjusted for 224 × 224 resolution).
- Rotation: , Flip: Horizontal (50%), Shift: 10% of image size.
4.3. Evaluation Metrics
- Mean Average Precision (mAP):
- Description: The mAP measures detection quality using center distance thresholds (2 m for cars and 4 m for pedestrians/cyclists). It is calculated by integrating the precision–recall curve for each class.
- –
- classes in nuScenes.
- –
- represents the Average Precision for class c, averaged over 10 recall thresholds ranging from 0.1 to 0.9.
- NuScenes Detection Score (NDS):
- Description: The NDS is a composite metric that combines the mAP with the aforementioned error metrics to provide an overall assessment of detection performance.
- Frames Per Second (FPS):
- Description: FPS measures the number of frames displayed or processed in one second, serving as an indicator of performance in video processing or real-time systems.
5. Discussion
6. Analysis Study
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Raj, T.; Hashim, F.; Huddin, A.; Ibrahim, M.; Hussain, A. A Survey on LiDAR Scanning Mechanisms. Electronics 2020, 9, 741. [Google Scholar] [CrossRef]
- Li, H.; Sima, C.; Dai, J.; Wang, W.; Lu, L.; Wang, H.; Zeng, J.; Li, Z.; Yang, J.; Deng, H.; et al. Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2151–2170. [Google Scholar] [CrossRef] [PubMed]
- Arnold, E.; Al-Jarrah, O.; Dianati, M.; Fallah, S.; Oxtoby, D.; Mouzakitis, A. A Survey on 3D Object Detection Methods for Autonomous Driving Applications. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3782–3795. [Google Scholar] [CrossRef]
- Ma, Y.; Wang, T.; Bai, X.; Yang, H.; Hou, Y.; Wang, Y.; Qiao, Y.; Yang, R.; Manocha, D.; Zhu, X. Vision-Centric BEV Perception: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022. [Google Scholar] [CrossRef]
- Yang, C.; Chen, Y.; Tian, H.; Tao, C.; Zhu, X.; Zhang, Z.; Huang, G.; Li, H.; Qiao, Y.; Lu, L.; et al. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Li, Z.; Yu, Z.; Wang, W.; Anandkumar, A.; Lu, T.; Alvarez, J. FB-BEV: BEV Representation from Forward-Backward View Transformations. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023. [Google Scholar]
- Li, Z.; Yu, Z.; Austin, D.; Fang, M.; Lan, S.; Kautz, J.; Alvarez, J. FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation. In Proceedings of the 2023 IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Chen, S.; Cheng, T.; Wang, X.; Meng, W.; Zhang, Q.; Liu, W. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer. arXiv 2022, arXiv:2206.04584. [Google Scholar]
- Yoo, J.; Kim, Y.; Kim, J.; Choi, J. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2020; Volume 12372, pp. 720–736. [Google Scholar] [CrossRef]
- Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework. In Proceedings of the Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
- Gosala, N.; Valada, A. Bird’s-Eye-View Panoptic Segmentation Using Monocular Frontal View Images. IEEE Robot. Autom. Lett. 2022, 7, 1968–1975. [Google Scholar] [CrossRef]
- Liu, Y.; Yan, J.; Jia, F.; Li, S.; Gao, A.; Wang, T.; Zhang, X.; Sun, J. PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–2 October 2022. [Google Scholar]
- Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August2020; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12359, pp. 194–210. [Google Scholar] [CrossRef]
- Hu, A.; Murez, Z.; Mohan, N.; Dudas, S.; Hawke, J.; Badrinarayanan, V.; Cipolla, R.; Kendall, A. FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15253–15262. [Google Scholar] [CrossRef]
- Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; Volume 13669, pp. 1–18. [Google Scholar] [CrossRef]
- Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 37, pp. 1477–1485. [Google Scholar] [CrossRef]
- Mallot, H.; Bülthoff, H.; Little, J.; Bohrer, S. Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biol. Cybern. 1991, 64, 177–185. [Google Scholar] [CrossRef]
- Bertozzi, M.; Broggi, A.; Fascioli, A. Stereo inverse perspective mapping: Theory and applications. Image Vis. Comput. 1998, 16, 585–590. [Google Scholar] [CrossRef]
- Zhou, B.; Krahenbuhl, P. Cross-view Transformers for real-time Map-view Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13750–13759. [Google Scholar] [CrossRef]
- Li, Q.; Wang, Y.; Wang, Y.; Zhao, H. HDMapNet: An Online HD Map Construction and Evaluation Framework. In Proceedings of the International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 4628–4634. [Google Scholar] [CrossRef]
- Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy networks: Learning 3D reconstruction in function space. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4455–4465. [Google Scholar] [CrossRef]
- Huang, L.; Wang, H.; Zeng, J.; Zhang, S.; Cao, L.; Yan, J.; Li, H. Geometric-aware Pretraining for Vision-centric 3D Object Detection. arXiv 2023, arXiv:2304.03105. [Google Scholar]
- Elfes, A. Using Occupancy Grids for Mobile Robot Perception and Navigation. Computer 1989, 22, 46–57. [Google Scholar] [CrossRef]
- Maturana, D.; Scherer, S. VoxNet: A 3D Convolutional Neural Network for real-time object recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; pp. 922–928. [Google Scholar] [CrossRef]
- Roddick, T.; Cipolla, R. Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11135–11144. [Google Scholar] [CrossRef]
- Oladele, D.A.; Markus, E.D.; Abu-Mahfouz, A.M. Fastseg3d: A Fast, Efficient, and Adaptive Ground Filtering Algorithm for 3d Point Clouds in Mobile Sensing Applications. SSRN 2024. [Google Scholar] [CrossRef]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-View 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.; Vora, S.; Liong, V.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11618–11628. [Google Scholar] [CrossRef]
- Wofk, D.; Ma, F.; Yang, T.; Karaman, S.; Sze, V. FastDepth: Fast monocular depth estimation on embedded systems. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 6101–6108. [Google Scholar] [CrossRef]
- Roy, A.; Todorovic, S. Monocular Depth Estimation Using Neural Regression Forest. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5506–5514. [Google Scholar]
- Ming, Y.; Meng, X.; Fan, C.; Yu, H. Deep learning for monocular depth estimation: A review. Neurocomputing 2021, 438, 14–33. [Google Scholar] [CrossRef]
- Ku, J.; Pon, A.; Waslander, S. Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11859–11868. [Google Scholar] [CrossRef]
- Naiden, A.; Paunescu, V.; Kim, G.; Jeon, B.; Leordeanu, M. Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 61–65. [Google Scholar] [CrossRef]
- Mousavian, A.; Anguelov, D.; Košecká, J.; Flynn, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5632–5640. [Google Scholar] [CrossRef]
- Roddick, T.; Kendall, A.; Cipolla, R. Orthographic Feature Transform for Monocular 3D Object Detection. In Proceedings of the 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, 9–12 September 2019. [Google Scholar]
- Qin, Z.; Wang, J.; Lu, Y. MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 2019–1 February 2019; pp. 8851–8858. [Google Scholar] [CrossRef]
- Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 1530–1538. [Google Scholar] [CrossRef]
- Pham, C.; Jeon, J. Robust object proposals re-ranking for object detection in autonomous driving using convolutional neural networks. Signal Process. Image Commun. 2017, 53, 110–122. [Google Scholar] [CrossRef]
- Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teulière, C.; Chateau, T. Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 1827–1836. [Google Scholar] [CrossRef]
- Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Data-driven 3D Voxel Patterns for object category recognition. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1903–1911. [Google Scholar] [CrossRef]
- Wang, X.; Zhu, Z.; Zhang, Y.; Huang, G.; Ye, Y.; Xu, W.; Chen, Z.; Wang, X. Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9600–9610. [Google Scholar] [CrossRef]
- Liu, W.; Sun, J.; Li, W.; Hu, T.; Wang, P. Deep Learning on Point Clouds and Its Application: A Survey. Sensors 2019, 19, 4188. [Google Scholar] [CrossRef]
- Alatise, M.; Hancke, G. A Review on Challenges of Autonomous Mobile Robot and Sensor Fusion Methods. IEEE Access 2020, 8, 39830–39846. [Google Scholar] [CrossRef]
- Chib, P.; Singh, P. Recent Advancements in End-to-End Autonomous Driving using Deep Learning: A Survey. IEEE Trans. Intell. Veh. 2023, 9, 103–118. [Google Scholar] [CrossRef]
- Zhao, J.; Shi, J.; Zhuo, L. BEV perception for autonomous driving: State of the art and future perspectives. Expert Syst. Appl. 2024, 258, 125103. [Google Scholar] [CrossRef]
- Reading, C.; Harakeh, A.; Chae, J.; Waslander, S. Categorical Depth Distribution Network for Monocular 3D Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8551–8560. [Google Scholar] [CrossRef]
- Huang, J.; Huang, G. BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
- Li, Z.; Lan, S.; Alvarez, J.M.; Wu, Z. BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20113–20123. [Google Scholar]
- Xie, E.; Yu, Z.; Zhou, D.; Philion, J.; Anandkumar, A.; Fidler, S.; Luo, P.; Alvarez, J.M. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation. arXiv 2022, arXiv:2204.05088. [Google Scholar]
- Li, Y.; Huang, B.; Chen, Z.; Cui, Y.; Liang, F.; Shen, M.; Liu, F.; Xie, E.; Sheng, L.; Ouyang, W.; et al. Fast-BEV: A Fast and Strong Bird’s-Eye View Perception Baseline, 2023. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 8665–8679. Available online: https://arxiv.org/abs/2301.12511v1 (accessed on 11 September 2023). [CrossRef]
- Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D.; Robotics, P. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View. arXiv 2021, arXiv:2112.11790. Available online: https://arxiv.org/abs/2112.11790v3 (accessed on 11 September 2023).
- Lang, A.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 12689–12697. [Google Scholar] [CrossRef]
- Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11779–11788. [Google Scholar] [CrossRef]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. Available online: https://arxiv.org/abs/2010.04159v4 (accessed on 14 April 2025).
- Seo, S.; Huang, J.; Yang, H.; Liu, Y. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In Proceedings of the RecSys’17: Eleventh ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; Volume 17, pp. 297–305. [Google Scholar] [CrossRef]
- Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention Augmented Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 2017, 5999–6009. [Google Scholar]
- Wang, Y.; Guizilini, V.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. Proc. Mach. Learn. Res. 2021, 164, 180–191. [Google Scholar]
- Zhang, Y.; Robotics, P.; Zhu, Z.; Du, D. OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
- Tesla. Tesla AI Day 2021—YouTube, n.d. Available online: https://www.youtube.com/watch?v=j0z4FweCy4M (accessed on 12 February 2024).
- Palazzi, A.; Borghi, G.; Abati, D.; Calderara, S.; Cucchiara, R. Learning to Map Vehicles into Bird’s Eye View; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2017; Volume 10484 LNCS, pp. 233–243. [Google Scholar] [CrossRef]
- Reiher, L.; Lampe, B.; Eckstein, L. A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems ITSC 2020, Rhodes, Greece, 20–23 September 2020. [Google Scholar] [CrossRef]
- Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. 2015; Available online: https://papers.nips.cc/paper_files/paper/2015/hash/33ceb07bf4eeb3da587e268d663aba1a-Abstract.html (accessed on 14 April 2025).
- Li, S.; Yang, K.; Shi, H.; Zhang, J.; Lin, J.; Teng, Z.; Li, Z. Bi-Mapper: Holistic BEV Semantic Mapping for Autonomous Driving. IEEE Robot. Autom. Lett. 2023, 8, 7034–7041. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Sci. Robot. 2014, 3, 2672–2680. [Google Scholar] [CrossRef]
- Zhu, X.; Yin, Z.; Shi, J.; Li, H.; Lin, D. Generative Adversarial Frontal View to Bird View Synthesis. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 454–463. [Google Scholar] [CrossRef]
- Gupta, D.; Pu, W.; Tabor, T.; Schneider, J. SBEVNet: End-to-End Deep Stereo Layout Estimation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, 3–8 January 2022; pp. 667–676. [Google Scholar] [CrossRef]
- Chen, C.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 347–356. [Google Scholar] [CrossRef]
- Xia, Z.; Pan, X.; Song, S.; Li, L.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4784–4793. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10778–10787. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, T.; Zhang, X.; Sun, J. PETR: Position Embedding Transformation for Multi-View 3D Object Detection; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; Volume 13687 LNCS, pp. 531–548. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2020; Volume 12346 LNCS, pp. 213–229. [Google Scholar] [CrossRef]
- Chitta, K.; Prakash, A.; Geiger, A. NEAT: Neural Attention Fields for End-to-End Autonomous Driving. In Proceedings of the IEEE International Conference on Computer Vision 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 15773–15783. [Google Scholar] [CrossRef]
- Yang, W.; Li, Q.; Liu, W.; Yu, Y.; Ma, Y.; He, S.; Pan, J. Projecting your view attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15531–15540. [Google Scholar] [CrossRef]
- Xie, Y.; Tian, J.; Zhu, X. Linking Points With Labels in 3D: A Review of Point Cloud Semantic Segmentation. IEEE Geosci. Remote Sens. Mag. 2019, 8, 38–59. [Google Scholar] [CrossRef]
- Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef]
- Wang, R.; Peethambaran, J.; Chen, D. LiDAR Point Clouds to 3-D Urban Models: A Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 606–627. [Google Scholar] [CrossRef]
- Nguyen, A.; Le, B. 3D Point Cloud Segmentation: A Survey. In Proceedings of the IEEE International Conference on Robotics, Automation and Mechatronics (RAM 2013), Manila, Philippines, 12–15 November 2013; pp. 225–230. [Google Scholar] [CrossRef]
- Xu, Y.; Tong, X.; Stilla, U. Voxel-based Representation of 3D Point Clouds: Methods, Applications, and Its Potential Use in the Construction Industry. Autom. Constr. 2021, 126, 103675. [Google Scholar] [CrossRef]
- Brown, R. Building a Balanced k-d Tree in O(kn log n) Time. J. Comput. Graph. Tech. 2014, 4, 50–68. [Google Scholar]
- Meagher, D. Geometric Modeling Using Octree Encoding. Comput. Graph. Image Process. 1982, 19, 129–147. [Google Scholar] [CrossRef]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4490–4499. [Google Scholar] [CrossRef]
- Thrun, S.; Montemerlo, M.; Dahlkamp, H.; Stavens, D.; Aron, A.; Diebel, J.; Fong, P.; Gale, J.; Halpenny, M.; Hoffmann, G.; et al. Stanley: The robot that won the DARPA Grand Challenge. J. Field Robot. 2006, 23, 661–692. [Google Scholar] [CrossRef]
- Qi, C.; Su, H.; Mo, K.; Guibas, L. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar] [CrossRef]
- Qi, C.; Yi, L.; Su, H.; Guibas, L. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 2017, pp. 5100–5109. [Google Scholar] [CrossRef]
- Kang, Z.; Yang, J.; Zhong, R.; Wu, Y.; Shi, Z.; Lindenbergh, R. Voxel-Based Extraction and Classification of 3-D Pole-Like Objects from Mobile LiDAR Point Cloud Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4287–4298. [Google Scholar] [CrossRef]
- Shen, Z.; Liang, H.; Lin, L.; Wang, Z.; Huang, W.; Yu, J. Fast Ground Segmentation for 3D LiDAR Point Cloud Based on Jump-Convolution-Process. Remote Sens. 2021, 13, 3239. [Google Scholar] [CrossRef]
- Huang, W.; Liang, H.; Lin, L.; Wang, Z.; Wang, S.; Yu, B.; Niu, R. A Fast Point Cloud Ground Segmentation Approach Based on Coarse-To-Fine Markov Random Field. IEEE Trans. Intell. Transp. Syst. 2022, 23, 7841–7854. [Google Scholar] [CrossRef]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef]
- Lv, X.; Wang, S.; Ye, D. CFNet: LiDAR-Camera Registration Using Calibration Flow Network. Sensors 2021, 21, 8112. [Google Scholar] [CrossRef]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 2015 IEEE International Conference on Computer Vision Workshop, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef]
- Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10526–10535. [Google Scholar] [CrossRef]
- Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. Int. J. Comput. Vis. 2021, 131, 531–551. [Google Scholar] [CrossRef]
- He, C.; Zeng, H.; Huang, J.; Hua, X.; Zhang, L. Structure Aware Single-Stage 3D Object Detection from Point Cloud. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11870–11879. [Google Scholar] [CrossRef]
- Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-YOLO: Real-time 3D Object Detection on Point Clouds. arXiv 2018, arXiv:1803.06199. [Google Scholar]
- Barrera, A.; Guindel, C.; Beltrán, J.; García, F. BirdNet+: End-to-End 3D Object Detection in LiDAR Bird’s Eye View. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems ITSC 2020, Rhodes, Greece, 20–23 September 2020. [Google Scholar] [CrossRef]
- Mohapatra, S.; Yogamani, S.; Gotzig, H.; Milz, S.; Mader, P. BEVDetNet: Bird’s Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2809–2815. [Google Scholar] [CrossRef]
- Luo, L.; Zheng, S.; Li, Y.; Fan, Y.; Yu, B.; Cao, S.Y.; Li, J.; Shen, H.L. BEVPlace: Learning LiDAR-based Place Recognition using Bird’s Eye View Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5750–5757. [Google Scholar] [CrossRef]
- Qi, C.; Liu, W.; Wu, C.; Su, H.; Guibas, L. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 918–927. [Google Scholar] [CrossRef]
- Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying Voxel-based Representation with Transformer for 3D Object Detection. Adv. Neural Inf. Process. Syst. 2022, 35, 18442–18455. [Google Scholar]
- Borse, S.; Klingner, M.; Kumar, V.R.; Cai, H.; Almuzairee, A.; Yogamani, S.K.; Porikli, F.M. X-Align++: Cross-modal cross-view alignment for Bird’s-eye-view segmentation. Mach. Vis. Appl. 2022, 34, 1–16. [Google Scholar]
- Hao, X.; Diao, Y.; Wei, M.; Yang, Y.; Hao, P.; Yin, R.; Zhang, H.; Li, W.; Zhao, S.; Liu, Y. MapFusion: A novel BEV feature fusion network for multi-modal map construction. Inf. Fusion 2025, 119, 103018. [Google Scholar] [CrossRef]
- Wang, S.; Caesar, H.; Nan, L.; Kooij, J. UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024. [Google Scholar]
- Man, Y.; Gui, L.; Wang, Y. BEV-Guided Multi-Modality Fusion for Driving Perception. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21960–21969. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
- Harley, A.W.; Fang, Z.; Li, J.; Ambrus, R.; Fragkiadaki, K. Simple-BEV: What Really Matters for Multi-Sensor BEV Perception? In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar] [CrossRef]
- Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3D detection. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021. NIPS’21. [Google Scholar]
- Kuhn, H. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Gupta, A.; Jain, S.; Choudhary, P.; Parida, M. Dynamic object detection using sparse LiDAR data for autonomous machine driving and road safety applications. Expert Syst. Appl. 2024, 255, 124636. [Google Scholar] [CrossRef]
- Chu, P.M.; Cho, S.; Park, J.; Fong, S.; Cho, K. Enhanced ground segmentation method for Lidar point clouds in human-centric autonomous robot systems. Hum.-Centric Comput. Inf. Sci. 2019, 9, 17. [Google Scholar] [CrossRef]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1080–1089. [Google Scholar] [CrossRef]
- Shi, G.; Li, R.; Ma, C. PillarNet: Real-Time and High-Performance Pillar-Based 3D Object Detection. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part X. Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–52. [Google Scholar] [CrossRef]
- Chen, Y.; Yu, Z.; Chen, Y.; Lan, S.; Anandkumar, A.; Jia, J.; Alvarez, J.M. FocalFormer3D: Focusing on Hard Instance for 3D Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 8360–8371. [Google Scholar] [CrossRef]
- Vora, S.; Lang, A.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4603–4611. [Google Scholar] [CrossRef]
- Xie, Y.; Xu, C.; Rakotosaona, M.J.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; Zhan, W. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17545–17556. [Google Scholar] [CrossRef]
- Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 18222–18232. [Google Scholar] [CrossRef]
- Yin, J.; Shen, J.; Chen, R.; Li, W.; Yang, R.; Frossard, P.; Wang, W. IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 14905–14915. [Google Scholar] [CrossRef]
meth. | mod. | mAP↑ | NDS↑ | Car | Tk. | Bus | Tl. | C.Vh. | Ped. | Br. | T.C. | Bike | Mtc. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[115] | L | 65.5 | 70.2 | 86.2 | 56.7 | 66.3 | 58.8 | 28.2 | 86.1 | 78.2 | 82.0 | 44.2 | 68.3 |
[116] | L | 66.0 | 71.4 | 87.6 | 57.5 | 63.6 | 63.1 | 27.9 | 87.3 | 77.2 | 83.3 | 42.3 | 70.1 |
[117] | L | 68.7 | 72.6 | 87.2 | 57.1 | 69.6 | 64.9 | 34.4 | 88.2 | 77.8 | 82.3 | 49.6 | 76.2 |
Ours | L | 68.1 | 71.3 | 85.9 | 56.2 | 66.8 | 63.5 | 31.7 | 86.9 | 76.1 | 81.5 | 48.7 | 72.9 |
[118] | C + L | 46.4 | 58.1 | 77.9 | 35.8 | 36.1 | 37.3 | 15.8 | 73.3 | 60.2 | 62.4 | 24.1 | 41.5 |
[111] | C + L | 66.4 | 70.5 | 86.8 | 58.5 | 67.4 | 57.3 | 26.1 | 89.1 | 74.8 | 85.0 | 49.3 | 70.0 |
[115] | C + L | 68.9 | 71.7 | 87.1 | 60.0 | 68.3 | 60.8 | 33.1 | 88.4 | 78.1 | 86.7 | 52.9 | 73.6 |
[110] | C + L | 70.2 | 72.9 | 88.1 | 60.9 | 69.3 | 62.1 | 34.4 | 89.2 | 78.2 | 85.2 | 52.2 | 72.2 |
[117] | C + L | 71.6 | 73.9 | 88.5 | 61.4 | 71.7 | 66.4 | 35.9 | 89.7 | 79.3 | 85.3 | 57.1 | 80.3 |
[119] | C + L | 72.0 | 73.8 | 88.0 | 60.2 | 72.0 | 64.9 | 38.7 | 90.9 | 79.2 | 87.9 | 59.8 | 78.5 |
[120] | C + L | 72.0 | 74.1 | 88.0 | 63.3 | 75.4 | 65.4 | 37.3 | 87.9 | 78.2 | 84.7 | 60.6 | 79.1 |
[121] | C + L | 73.0 | 75.2 | 88.3 | 62.7 | 74.9 | 67.3 | 38.4 | 89.3 | 78.1 | 89.2 | 59.5 | 82.4 |
Ours † | C + L | 73.9 | 76.2 | 89.0 | 63.7 | 75.6 | 68.2 | 38.8 | 91.0 | 79.4 | 89.9 | 61.3 | 83.4 |
Method | Backbone | mAP (%) | NDS (%) | FPS (Hz) |
---|---|---|---|---|
TransFusion [115] | ResNet-50 | 65.6 | 69.7 | 3.8 |
BEVFusion [110] | Swin-T | 68.5 | 71.4 | 4.2 |
SparseFusion [119] | ResNet-50 | 72.8 | 70.4 | 5.6 |
SparseFusion [119] | Swin-T | 71.0 | 73.1 | 5.3 |
CMT [120] | VoVNet-99 | 70.3 | 72.9 | 3.8 |
ISFusion [121] | Swin-T | 72.8 | 74.0 | 3.2 |
Ours | Effdet-B3 | 72.1 | 74.9 | 11.2 |
Ours | ResNet-50 | 73.3 | 75.4 | 8.6 |
Ours | Swin-T | 73.9 | 76.2 | 6.3 |
Configuration | Daytime | Nighttime | ||
---|---|---|---|---|
mAP (%) | NDS (%) | mAP (%) | NDS (%) | |
W/o Ground Segmentation | 62.9 | 65.4 | 33.2 | 37.1 |
Cross-Modality Attention W/o CC | 70.8 | 71.9 | 52.6 | 55.9 |
Full Model | 73.9 | 76.2 | 62.3 | 63.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Oladele, D.A.; Markus, E.D.; Abu-Mahfouz, A.M. BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds. AI 2025, 6, 82. https://doi.org/10.3390/ai6040082
Oladele DA, Markus ED, Abu-Mahfouz AM. BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds. AI. 2025; 6(4):82. https://doi.org/10.3390/ai6040082
Chicago/Turabian StyleOladele, Daniel Ayo, Elisha Didam Markus, and Adnan M. Abu-Mahfouz. 2025. "BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds" AI 6, no. 4: 82. https://doi.org/10.3390/ai6040082
APA StyleOladele, D. A., Markus, E. D., & Abu-Mahfouz, A. M. (2025). BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds. AI, 6(4), 82. https://doi.org/10.3390/ai6040082