C2L3-Fusion: An Integrated 3D Object Detection Method for Autonomous Vehicles
Abstract
:1. Introduction
- Algorithmic advancements: Introduction of a new fusion architecture leveraging the CLOCs (Camera-LiDAR Object Candidates) framework for optimal feature integration and robust object detection [14].
- Implementation on embedded systems: Real-time deployment of the model on the Nvidia Jetson AGX Xavier, demonstrating its applicability [12].
- Comprehensive evaluation: Validation of the model on the KITTI dataset, achieving significant improvements in mean Average Precision (mAP) across easy, moderate, and hard scenarios [11].
2. Related Work
2.1. Motivation for Combining 2D and 3D Data in Object Detection
- Image-based methods rely on RGB cameras, which are effective in capturing detailed semantic information, including texture, color, and shape. However, these methods lack intrinsic depth perception, making it difficult to estimate object distances accurately, especially in occluded, poorly lit, or highly dynamic environments.
- LiDAR sensors provide precise depth and spatial data, essential for accurate 3D localization. Nevertheless, LiDAR data is often sparse, especially for small or distant objects, and lacks the rich semantic information necessary to distinguish between objects of similar shapes. Additionally, processing dense point clouds can be computationally intensive, posing challenges for real-time performance.
- The fusion of 2D and 3D data addresses these limitations by leveraging the advantages of both modalities (see Figure 1):
- –
- Semantic Enrichment from Images: High-resolution RGB images provide detailed object context, improving class recognition and distinguishing between visually similar objects.
- –
- Depth and Spatial Precision from LiDAR: LiDAR point clouds deliver accurate 3D localization, enabling robust distance and orientation estimation.
- –
- Improved Robustness in Complex Scenarios: The combined approach enhances detection performance in challenging environments, such as those with occlusions, varying light conditions, or high object density.
- Accurate real-time object localization, by leveraging both the 2D image plane and the 3D spatial geometry of objects.
- Prediction of object motion and behavior, such as determining movement direction (e.g., turning left or going straight), using orientation angles (α), velocity vectors (→v), and the object’s position in space.
- Effective operation in diverse and dynamic environments.
2.2. Image-Based Detection
- Mono3D [4]: This approach uses monocular images to generate 3D bounding boxes by integrating semantics, context, and geometric priors. While Mono3D improves detection precision through Fast R-CNN-based refinements, its reliance on hand-crafted features limits scalability in complex scenarios.
- DSGN [15]: Utilizing stereo image pairs, this method estimates depth and object location independently of specific classes. Despite its effectiveness, real-time implementation is hindered by synchronization issues between stereo cameras.
- MonoDTR [16]: This more recent approach integrates depth-aware transformers, capturing spatial relationships effectively and achieving competitive results on the KITTI dataset. However, its performance is constrained in dynamic, crowded environments due to reliance on monocular depth estimation.
2.3. Point-Cloud-Based Detection
- PointPillars [17]: This method converts LiDAR pointing clouds into pseudo-images for processing with 2D CNNs, achieving a balance between efficiency and accuracy. However, the voxelization process can introduce information loss.
- PV-RCNN [18]: This hybrid model combines voxel-based and point-based features for improved detection performance, excelling in dense urban environments.
- Voxel R-CNN [19]: This voxel-based architecture refines proposals through a lightweight design, achieving competitive performance on the nuScenes dataset while maintaining computational efficiency.
2.4. Multi-Sensor Fusion-Based Detection
- Multi-View Fusion Methods: As the information of a single view (e.g., the front view image or BEV point cloud) is usually not sufficient for understanding real scenes, some researchers try to explore multi-view fusion to improve the performance of 3D object detection tasks. AVOD [20] refine the detection box by fusing BEV and camera feature maps for each ROI region. Although these multi-view approaches usually outperform single-view-based methods, they still suffer from information loss due to the process of converting a point cloud to a specific view.
- Voxel and Image Fusion Methods: Many recent LiDAR-only methods convert the raw LiDAR point cloud to regular voxel grids for 3D object detection, thanks to its effectiveness and efficiency. To further improve the robustness of 3D detectors, some researchers [21,22,23,24] devote their efforts to voxel-based and camera image fusion methods. Specifically, ConFuse [25] proposes a novel continuous fusion layer, which not only achieves the voxel-wise alignment between BEV and image feature maps but also captures local information to improve the detection performance. MVX-Net [21] proposes to enhance the voxel feature representations with semantic image features by fusing the features of the camera image and LiDAR point cloud in the early stage. By utilizing a cross-view spatial feature fusion strategy, 3D-CVF [22] effectively fuses the spatial feature from both the camera image and LiDAR point cloud. Due to the quantized error brought by the voxelization operation, these methods have limitations in establishing the accurate corresponding relationship between the camera image and LiDAR point cloud.
- Raw Point Cloud and Image Fusion Methods: Considering that the point cloud possesses rich geometric structure information but lacks plentiful semantic information, some researchers [25,26,27] have tried to fuse the raw point cloud and the camera image. Specifically, PointFusion [27] and SIFRNet [28] first extract semantic features and produce 2D proposals from camera images using off-the-shelf 2D detectors [29,30,31]. The extracted semantic features are then combined with the point features extracted from the corresponding frustum to generate 3D bounding boxes. PointPainting [25] enriches each point feature with the corresponding output class scores predicted by a pre-trained image semantic segmentation network. PI-RCNN [32] employs a segmentation sub-network to extract full-resolution semantic feature maps from images and then fuses the multi-sensor features via a powerful PACF attention module. ImVoteNet [33] further improves the detection performance of VoteNet [34] by lifting 2D camera image votes as well as geometric, semantic, and texture cues from an off-the-shelf 2D detector and then combining them with 3D votes in point clouds.
- YOLOv8 for 2D object detection using RGB cameras. YOLOv8 is known for its exceptional speed and accuracy, making it one of the best choices for real-time vision tasks.
- PointPillars for 3D object detection using LiDAR. This model efficiently processes point clouds to generate precise 3D bounding boxes with minimal computational overhead.
3. Methodology
3.1. System Design and Fusion Network Architecture
- (1)
- Sensor Fusion Framework
- Preprocessing: Each sensor’s raw data is independently processed to remove noise and extract key features.
- Time Synchronization: A synchronization mechanism ensures that data from different sensors correspond to the same timestamps.
- Fusion Strategy: Integration is performed using a deep-learning-based method that combines YOLOv8 for 2D object detection and PointPillars for 3D object detection.
- (2)
- Data Processing Pipeline
- 2D Object Detection (YOLOv8):
- –
- Backbone: Extracts low- and high-level features from RGB images.
- –
- Neck: Combines multi-scale features to enhance detection robustness.
- –
- Head: Produces 2D bounding boxes, class probabilities, and confidence scores.
- 3D Object Detection (PointPillars):
- –
- Pillar Feature Network: Converts sparse 3D point clouds into structured pseudo-images via voxelization.
- –
- Backbone: Uses 2D CNN layers to extract spatial features from pseudo-images.
- –
- Detection Head: Outputs 3D bounding boxes and confidence scores.
- (3)
- C2L3-Fusion and CLOCs Framework
- IoU Matching: Matches 2D and 3D detections based on Intersection over Union (IoU) and confidence scores.
- Fusion Tensor Processing:
- –
- 1 × 1 Convolutions: Reduces dimensionality while preserving critical information.
- –
- Max Pooling: Further reduces the feature size, generating a refined tensor.
- –
- Fully Connected Layers: Produces the final object class, refined 3D bounding box (x, y, z, w, h, l, θ), and confidence score.
- (4)
- Evaluation Metrics
- Mean Average Precision (mAP): Measures the overall detection performance.
- Intersection over Union (IoU): Evaluates spatial overlap between predictions and ground truth.
- Processing Time: Determines real-time feasibility.
- (5)
- Experimental Setup
- Simulation Environment: The CARLA simulator is used to generate realistic urban driving scenarios.
- Training and Testing Split: The dataset is divided into 70% for training and 30% for testing to ensure robust validation.
- Hardware Implementation: Experiments are conducted on a Nvidia Jetson AGX Xavier and a high-performance GPU for comparative analysis.
3.2. Implementation
Algorithm 1: C2L3-Fusion |
Input:
|
- (1)
- Detection Pipelines
- 2D Object Detection: YOLOv8 is a state-of-the-art 2D object detection model optimized for speed and accuracy. It identifies object locations in the image plane and generates bounding boxes and class probabilities:
- –
- Backbone: Extracts low- and high-level visual features from the input image.
- –
- Neck: Up-sample and combine features at different scales for robust detection.
- –
- Head (Detect): Outputs 2D bounding boxes , class probabilities , and confidence scores .
- 3D Object Detection (PointPillars): The PointPillars model processes raw 3D point cloud data to detect objects in the 3D space. Key components are as follows:
- –
- Pillar Feature Net: Converts the sparse 3D point cloud into dense pseudo-images by voxelization and feature encoding.Voxels are then into pseudo-images.
- –
- Backbone: Applies 2D convolutional layers to extract spatial features from pseudo-images.
- –
- Detection Head: Outputs 3D bounding boxes and confidence scores .
- (2)
- CLOCs: Fusion Framework
- IoU Matching and Normalized Scores:
- –
- Matches 2D and 3D detections based on their IoU (Intersection over Union) and confidence scores.
- –
- Matched detections are weighted by their confidence scores:
- –
- Produces a Fusion Tensor of shape k × n × 4, where k is the number of 2D detections and n is the number of 3D detections:
- Fusion Tensor Processing:
- –
- 1 × 1 Convolutions: Lightweight 1 × 1 convolutions and max pooling reduce dimensionality while preserving key features:
- –
- Max Pooling: Reduces dimensionality while retaining essential features, producing a tensor of shape k × n × 1.
- –
- Fully Connected Layer: Fully connected layers output refined bounding boxes , class predictions , and scores .
- (1)
- Data Preparation
- (2)
- Training in Component Models
- (3)
- Post-Processing and Optimization
4. Results and Discussion
4.1. KITTI Dataset Results
4.2. Real-World Testing Results
- Bounding box precision: The 3D localization is generally accurate, though slight alignment variations may occur in cases involving distant or partially occluded objects.
- Classification confidence: In rare instances, especially near image boundaries, detected objects may exhibit minor uncertainties in classification or bounding box definition.
- Benchmark dataset differences: Although the model shows reliable performance in controlled scenarios, natural variations in real-world sensor data quality and dynamic factors (such as moving pedestrians or vehicles) pose additional considerations compared to benchmark datasets like KITTI.
5. Conclusions and Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Thrun, S.; Montemerlo, M.; Dahlkamp, H.; Stavens, D.; Aron, A.; Diebel, J.; Fong, P.; Gale, J.; Halpenny, M.; Hoffmann, G.; et al. Stanley: The robot that won the DARPA Grand Challenge. J. Field Robot. 2006, 23, 661–692. [Google Scholar] [CrossRef]
- Litman, T. Autonomous Vehicle Implementation Predictions: Implications for Transport Planning. Victoria Transport Policy Institute. 2024. Available online: https://vtpi.org/avip.pdf (accessed on 21 April 2025).
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
- Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D Object Detection for Autonomous Driving. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
- Yang, B.; Liang, M.; Urtasun, R. HDNET: Exploiting HD Maps for 3D Object Detection. arXiv 2020. [Google Scholar] [CrossRef]
- Nguyen, T.T.H.T.; Dao, T.T.; Ngo, T.B.; Phi, V.A. Self-Driving Car Navigation With Single-Beam LiDAR and Neural Networks Using JavaScript. IEEE Access 2024, 12, 190203–190219. [Google Scholar] [CrossRef]
- Binh, N.T.; Dung, B.N.; Chieu, L.X.; Long, N.; Soklin, M.; Thanh, N.D.; Tung, H.X.; Dung, N.V.; Truong, N.D.; Hoang, L.M. Deep Learning-Based Object Tracking and Following for AGV Robot. In Intelligent Systems and Networks (ICISN 2023), Lecture Notes in Networks and Systems; Springer: Singapore, 2023. [Google Scholar] [CrossRef]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
- Simon, M.; Milz, S.; Amende, K.; Gross, H. Complex-YOLO: An Euler-Region-Proposal for Real-Time 3D Object Detection on Point Clouds. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
- Dasari, S.; Ebert, F.; Tian, S.; Nair, S.; Bucher, B.; Schmeckpeper, K.; Singh, S.; Levine, S.; Finn, C. RoboNet: Large-Scale Multi-Robot Learning. arXiv 2020. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite. Available online: http://www.cvlibs.net/datasets/kitti/ (accessed on 21 April 2025).
- Nvidia Corporation Jetson AGX Xavier: The World’s First AI Computer for Autonomous Machines. Available online: https://developer.nvidia.com/embedded/jetson-agx-xavier (accessed on 21 April 2025).
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017. [Google Scholar] [CrossRef]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, S.; Shen, X.; Jia, J. DSGN: Deep Stereo Geometry Network for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12536–12545. [Google Scholar]
- Huang, K.-C.; Wu, T.-H.; Su, H.-T.; Hsu, W.H. MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer. arXiv 2022. [Google Scholar] [CrossRef]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar] [CrossRef]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 1201–1209. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. AVOD: Aggregate View Object Detection for Autonomous Driving. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 7204–7211. [Google Scholar]
- Sindagi, V.A.; Zhou, Y.; Tuzel, O. MVX-Net: Multimodal VoxelNet for 3D object detection. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 7276–7282. [Google Scholar] [CrossRef]
- Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3D-CVF: Generating joint camera and LiDAR features using cross-view spatial feature fusion for 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 720–736. [Google Scholar] [CrossRef]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, Z.; Loy, C.C. Multi-modality cut and paste for 3D object detection. arXiv 2020, arXiv:2012.12741. [Google Scholar] [CrossRef]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar] [CrossRef]
- Huang, T.; Liu, Z.; Chen, X.; Bai, X. EPNet: Enhancing point features with image semantics for 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 35–52. [Google Scholar] [CrossRef]
- Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep sensor fusion for 3D bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar] [CrossRef]
- Zhao, X.; Liu, Z.; Hu, R.; Huang, K. 3D object detection using scale invariant and feature reweighting networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9267–9274. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; Volume 28, pp. 91–99. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
- Xie, L.; Xiang, C.; Yu, Z.; Xu, G.; Yang, Z.; Cai, D.; He, X. PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12460–12467. [Google Scholar] [CrossRef]
- Qi, C.R.; Chen, X.; Litany, O.; Guibas, L.J. ImVoteNet: Boosting 3D object detection in point clouds with image votes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4404–4413. [Google Scholar] [CrossRef]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. arXiv 2022. [Google Scholar] [CrossRef]
Method | Input Data | Easy | Moderate | Hard |
---|---|---|---|---|
MonoDTR [16] | Camera | 21.99 | 15.39 | 12.73 |
PointPillars [17] | LiDAR | 79.05 | 74.99 | 68.30 |
AVOD [20] | LiDAR + RGB | 81.94 | 71.88 | 66.38 |
CLOCs_PVCas [14] | LiDAR + RGB | 88.94 | 80.67 | 77.15 |
C2L3-Fusion (ours) | LiDAR + RGB | 89.91 | 79.26 | 78.01 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ngo, T.B.; Ngo, L.; Phi, A.V.; Nguyen, T.T.H.T.; Nguyen, A.; Brown, J.; Perera, A. C2L3-Fusion: An Integrated 3D Object Detection Method for Autonomous Vehicles. Sensors 2025, 25, 2688. https://doi.org/10.3390/s25092688
Ngo TB, Ngo L, Phi AV, Nguyen TTHT, Nguyen A, Brown J, Perera A. C2L3-Fusion: An Integrated 3D Object Detection Method for Autonomous Vehicles. Sensors. 2025; 25(9):2688. https://doi.org/10.3390/s25092688
Chicago/Turabian StyleNgo, Thanh Binh, Long Ngo, Anh Vu Phi, Trung Thị Hoa Trang Nguyen, Andy Nguyen, Jason Brown, and Asanka Perera. 2025. "C2L3-Fusion: An Integrated 3D Object Detection Method for Autonomous Vehicles" Sensors 25, no. 9: 2688. https://doi.org/10.3390/s25092688
APA StyleNgo, T. B., Ngo, L., Phi, A. V., Nguyen, T. T. H. T., Nguyen, A., Brown, J., & Perera, A. (2025). C2L3-Fusion: An Integrated 3D Object Detection Method for Autonomous Vehicles. Sensors, 25(9), 2688. https://doi.org/10.3390/s25092688