MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection
Abstract
:1. Introduction
- We propose a 3D semantic-level feature fusion method to address the semantic sparsity of object point clouds. By employing an inverse operation, we precisely align the 3D voxel features with the semantic features of images projected into 3D space. This module efficiently achieves dense semantic feature fusion through summation.
- We present a 2D spatial-level feature fusion module to tackle the spatial sparsity of point clouds. RGB image features and LiDAR BEV features are aligned using learnable spatial position offsets and fused with dilated convolutions.
- Our proposed MDFusion demonstrates competitive results on the KITTI and ONCE datasets, particularly excelling in the cases of complex-shaped cyclists and long-range 50-Inf, outperforming several advanced methods. Extensive experiments demonstrate that our novel fusion method effectively integrates multi-modal features and overcomes the sparsity of the point cloud.
2. Related Work
2.1. LiDAR-Only 3D Detection
2.2. Camera-Only Detection
2.3. LiDAR–Camera 3D Detection
2.4. Deformable Convolution
3. Method
3.1. Overall Architecture
3.2. Three-Dimensional Semantic-Level Fusion
3.3. Two-Dimensional Spatial-Level Fusion
3.4. Loss Function
4. Experiments
4.1. Implementation Details
4.1.1. Network Settings
4.1.2. Training Configuration
4.1.3. Dataset
4.1.4. Evaluation Metrics
4.2. Experimental Results
4.2.1. Main Results on KITTI
4.2.2. Visualizations
4.2.3. Main Results on ONCE
4.3. Ablation Study
5. Limitations and Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
3D | Three-dimensional |
LiDAR | Light detection and ranging |
BEV | Bird’s eye view |
2D | Two-dimensional |
CNN | Convolutional neural network |
IoU | Intersection over union |
VFE | Voxel feature encoding |
RPN | Region proposal network |
RoI | Regions of interest |
3D-SLF | 3D semantic-level fusion |
2D-SLF | 2D spatial-level fusion |
ASPP | Atrous spatial pyramid pooling |
FNN | Feedforward neural network |
mAP | Mean average precision |
AP | Average precision |
References
- Ning, Y.; Cao, J.; Bao, C.; Hao, Q. DVST: Deformable Voxel Set Transformer for 3D Object Detection from Point Clouds. Remote Sens. 2023, 15, 5612. [Google Scholar] [CrossRef]
- Qiao, R.; Ji, H.; Zhu, Z.; Zhang, W. Local-to-Global Semantic Learning for Multi-View 3D Object Detection from Point Cloud. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9371–9385. [Google Scholar] [CrossRef]
- Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4604–4612. [Google Scholar]
- Zhang, Y.; Huang, D.; Wang, Y. PointAugmenting: Cross-modal augmentation for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11794–11803. [Google Scholar]
- Huang, T.; Liu, Z.; Chen, X.; Bai, X. EPNet: Enhancing point features with image semantics for 3D object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 35–52. [Google Scholar]
- Xie, S.; Yang, D.; Jiang, K.; Zhong, Y. Pixels and 3-d points alignment method for the fusion of camera and lidar data. IEEE Trans. Instrum. Meas. 2018, 68, 3661–3676. [Google Scholar]
- Wu, Q.; Li, X.; Wang, K.; Bilal, H. Regional feature fusion for on-road detection of objects using camera and 3d-lidar in high-speed autonomous vehicles. Soft Comput. 2023, 27, 18195–18213. [Google Scholar]
- Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 2774–2781. [Google Scholar]
- Hu, C.; Zheng, H.; Li, K.; Xu, J.; Mao, W.; Luo, M.; Wang, L.; Chen, M.; Peng, Q.; Liu, K.; et al. FusionFormer: A Multi-sensory Fusion in Bird’s-Eye-View and Temporal Consistent Transformer for 3D Object Detection. arXiv 2023, arXiv:2309.05257. [Google Scholar]
- Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 770–779. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems; NIPS: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Ding, Z.; Han, X.; Niethammer, M. Votenet: A deep learning label fusion method for multi-atlas segmentation. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China,13–17 October 2019; Proceedings, Part III 22; Springer: Cham, Switzerland, 2019; pp. 202–210. [Google Scholar]
- Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
- Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
- Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Y.ang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
- Deng, J.; Shi, S.; Li, P.; Zhou, W.; Zhang, Y.; Li, H. Voxel r-cnn: Towards high performance voxel-based 3d object detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 1201–1209. [Google Scholar] [CrossRef]
- Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3164–3173. [Google Scholar]
- Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 1951–1960. [Google Scholar]
- Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
- Wu, P.; Gu, L.; Yan, X.; Xie, H.; Wang, F.L.; Cheng, G.; Wei, M. Pv-rcnn++: Semantical point-voxel feature interaction for 3d object detection. Vis. Comput. 2022, 39, 2425–2440. [Google Scholar]
- Liu, Z.; Wu, Z.; Tóth, R. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 996–997. [Google Scholar]
- Li, Z.; Gao, Y.; Hong, Q.; Du, Y.; Serikawa, S.; Zhang, L. Keypoint3d: Keypoint-based and anchor-free 3d object detection for autonomous driving with monocular vision. Remote Sens. 2023, 15, 1210. [Google Scholar] [CrossRef]
- Philion, J.; Fidler, S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D. BEVDet: High-performance multi-camera 3D object detection in bird-eye-view. arXiv 2021, arXiv:2112.11790. [Google Scholar]
- Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
- Yoo, J.H.; Kim, Y.; Kim, J.; Choi, J.W. 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16; Springer: Cham, Switzerland, 2020; pp. 720–736. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. Clocs: Camera-lidar object candidates fusion for 3d object detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10386–10393. [Google Scholar]
- Wu, X.; Peng, L.; Yang, H.; Xie, L.; Huang, C.; Deng, C.; Liu, H.; Cai, D. Sparse fuse dense: Towards high quality 3d detection with depth completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5418–5427. [Google Scholar]
- Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-equivariant 3d object detection for autonomous driving. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2795–2802. [Google Scholar] [CrossRef]
- Qin, Y.; Wang, C.; Kang, Z.; Ma, N.; Li, Z.; Zhang, R. SupFusion: Supervised LiDAR-camera fusion for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 22014–22024. [Google Scholar]
- Xu, X.; Dong, S.; Xu, T.; Ding, L.; Wang, J.; Jiang, P.; Song, L.; Li, J. Fusionrcnn: Lidar-camera fusion for two-stage 3d object detection. Remote Sens. 2023, 15, 1839. [Google Scholar] [CrossRef]
- Xiang, X.; Zhang, J. Fusionvit: Hierarchical 3d object detection via lidar—Camera vision transformer fusion. arXiv 2023, arXiv:2311.03620. [Google Scholar]
- Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.-G. Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21643–21652. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 9308–9316. [Google Scholar]
- Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. SemDeepLabV3: Semantic segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [PubMed]
- OpenPCDet Development Team. OpenPCDet: An Open-Source Toolbox for 3D Object Detection from Point Clouds. 2020. Available online: https://github.com/open-mmlab/OpenPCDet (accessed on 20 November 2024).
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar]
- Mao, J.; Niu, M.; Jiang, C.; Liang, H.; Chen, J.; Liang, X.; Li, Y.; Ye, C.; Zhang, W.; Li, Z.; et al. One million scenes for autonomous driving: Once dataset. arXiv 2021, arXiv:2106.11037. [Google Scholar]
- Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. Part-A^2: Part-aware and aggregation neural network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
- Wang, Y.; Chen, X.; Ma, Y. BtcDet: A bidirectional temporal context network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D object detection from RGB-D data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
- Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. F-ConvNet: Feature fusion convolutional network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 641–650. [Google Scholar]
- Li, Y.; Chen, X.; Ma, Y. CasA: A context-aware sparse attention network for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Wang, Y.; Chen, X.; Ma, Y. M3DETR: Multi-modal 3D Detection Transformer for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Chen, Y.; Zhang, X.; Sun, J. PI-RCNN: An Efficient Multi-sensor Fusion Framework for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5428–5437. [Google Scholar]
- Xie, S.; Gu, J.; Guo, D.; Qi, C.R.; Guibas, L.; Litany, O. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16; Springer: Cham, Switzerland, 2020; pp. 574–591. [Google Scholar]
- Zhang, Z.; Girdhar, R.; Joulin, A.; Misra, I. Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 10252–10263. [Google Scholar]
- Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
Sensor Type | Merits | Limitations |
---|---|---|
LiDAR | High resolution, light-independent, rich 3D information | Poor performance in nighttime, rainy, and foggy scenes |
Monocular Camera | Fine-grained object details | Susceptible to lighting variations and occlusion, lacks direct depth perception |
Stereo Camera | Rich fine-grained object details with depth acquisition | Vulnerable to changes in lighting and occlusion |
Method | Modality | mAP | Car-R40 | Cyclist-R40 | Pedestrian-R40 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | |||
PointRCNN [11] | L | 69.96 | 86.99 | 76.09 | 73.12 | 85.32 | 70.98 | 66.59 | 65.03 | 56.75 | 48.78 |
Part-A2 [44] | L | 63.99 | 87.81 | 78.49 | 73.51 | 79.17 | 63.52 | 56.93 | 53.10 | 43.35 | 40.06 |
STD [21] | L | 63.60 | 87.95 | 79.71 | 75.09 | 78.69 | 61.59 | 55.30 | 53.29 | 42.47 | 38.35 |
BtcDet [45] | L | 65.96 | 90.64 | 82.86 | 78.09 | 82.81 | 68.68 | 61.81 | 47.80 | 41.63 | 39.30 |
PV-RCNN [22] | L | 70.67 | 91.73 | 82.55 | 80.06 | 88.71 | 71.27 | 66.44 | 58.50 | 50.72 | 46.06 |
F-PointNet [46] | L+C | 57.85 | 82.19 | 69.79 | 60.59 | 72.27 | 56.12 | 49.01 | 50.53 | 42.15 | 38.08 |
AVOD-FPN [28] | L+C | 56.84 | 83.07 | 71.76 | 65.73 | 50.46 | 42.27 | 39.04 | 63.76 | 50.55 | 44.93 |
F-ConvNet [47] | L+C | 63.15 | 87.36 | 76.39 | 66.69 | 81.98 | 65.07 | 56.54 | 52.16 | 43.38 | 38.80 |
EPNet [5] | L+C | 70.96 | 88.76 | 78.65 | 78.32 | 83.88 | 65.50 | 62.70 | 66.74 | 59.29 | 54.82 |
CasA [48] | L+C | 69.77 | 91.58 | 83.06 | 80.08 | 87.91 | 73.47 | 66.17 | 54.04 | 47.09 | 44.56 |
MDFusion (Ours) | L+C | 75.76 | 92.96 | 86.03 | 83.72 | 91.61 | 74.20 | 69.73 | 68.53 | 60.89 | 54.24 |
Method | Modality | Car-R11 | Cyclist-R11 | ||||
---|---|---|---|---|---|---|---|
>Easy | Mod. | Hard | Easy | Mod. | Hard | ||
PointRCNN [11] | L | 86.96 | 75.64 | 70.70 | 74.96 | 58.82 | 52.53 |
PV-RCNN [22] | L | 90.25 | 81.43 | 76.82 | 78.60 | 63.71 | 57.65 |
Part-A2 [44] | L | 87.81 | 78.49 | 73.51 | 79.17 | 63.52 | 56.93 |
M3DETR [49] | L | 90.28 | 81.73 | 76.96 | 83.83 | 66.74 | 59.03 |
STD [21] | L | 87.95 | 79.71 | 75.09 | 78.69 | 61.59 | 55.30 |
F-PointNet [46] | L+C | 82.19 | 69.79 | 60.59 | 72.27 | 56.12 | 49.01 |
AVOD-FPN [28] | L+C | 83.07 | 71.76 | 65.73 | 63.76 | 50.55 | 44.93 |
PI-RCNN [50] | L+C | 84.37 | 74.82 | 70.03 | - | - | - |
EPNet [5] | L+C | 89.81 | 79.28 | 74.59 | - | - | - |
3D-CVF [29] | L+C | 89.20 | 80.05 | 73.11 | - | - | - |
MDFusion (Ours) | L+C | 88.35 | 82.24 | 77.43 | 83.37 | 67.15 | 60.81 |
Method | Vehicle | Pedestrian | Cyclist | mAP | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Overall | 0–30 m | 30–50 m | 50 m-inf | Overall | 0–30 m | 30–50 m | 50 m-inf | Overall | 0–30 m | 30–50 m | 50 m-inf | ||
Baseline (SECOND) [17] | 71.19 | 84.04 | 63.02 | 47.25 | 26.44 | 29.33 | 24.05 | 18.05 | 58.04 | 69.96 | 52.43 | 34.61 | 51.89 |
Point Contrast [52] | 71.07 | 83.31 | 64.90 | 49.34 | 22.52 | 23.73 | 21.81 | 16.06 | 56.36 | 68.11 | 50.35 | 34.06 | 49.98 |
Depth Contrast [53] | 71.88 | 84.26 | 65.58 | 49.97 | 23.57 | 26.36 | 21.15 | 14.39 | 56.63 | 68.26 | 50.82 | 34.67 | 50.69 |
PV-RCNN [22] | 77.77 | 89.39 | 72.55 | 58.64 | 23.50 | 25.61 | 22.84 | 17.27 | 59.37 | 71.66 | 52.58 | 36.17 | 53.55 |
SwAV [54] | 72.71 | 83.68 | 65.91 | 50.10 | 25.13 | 27.77 | 22.77 | 16.36 | 58.05 | 69.99 | 52.23 | 34.86 | 51.96 |
PointRCNN [11] | 52.09 | 74.45 | 40.89 | 16.81 | 4.28 | 6.17 | 2.40 | 0.91 | 29.84 | 46.03 | 20.94 | 5.46 | 28.74 |
PointPillars [18] | 68.57 | 80.86 | 62.07 | 47.04 | 17.63 | 19.74 | 15.15 | 10.23 | 46.81 | 58.33 | 40.32 | 25.86 | 44.34 |
MDFusion (Ours) | 73.78 | 84.59 | 68.05 | 51.16 | 31.50 | 34.69 | 29.58 | 18.02 | 59.23 | 71.49 | 52.19 | 37.15 | 54.83 |
Improvement | +2.59 | +0.55 | +5.03 | +3.91 | +5.06 | +5.36 | +5.53 | −0.03 | +1.19 | +1.53 | −0.24 | +2.54 | +2.94 |
Method | 3D-Level Fusion | 2D-Level Fusion | BEV Detection | Runtime | ||
---|---|---|---|---|---|---|
Easy | Mod. | Hard | ||||
Baseline | 89.19 | 87.44 | 85.16 | 112 ms | ||
3D-SLF | √ | 90.58 | 88.56 | 88.63 | 122 ms | |
2D-SLF | √ | 90.25 | 88.14 | 87.97 | 86 ms | |
Multi-Dimension Fusion | √ | √ | 91.02 | 89.66 | 89.24 | 98 ms |
Improvement | - | - | +1.83 | +2.22 | +4.08 | −14 ms |
Fusion Stage | Car-R40 | Cyclist-R40 | Pedestrian-R40 | mAP | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | ||
C1 | 92.93 | 86.09 | 83.74 | 92.63 | 76.24 | 71.62 | 71.54 | 62.47 | 55.59 | 76.98 |
C2 | 92.85 | 85.72 | 83.45 | 90.06 | 73.47 | 68.96 | 64.74 | 56.99 | 50.18 | 74.05 |
C3 | 92.47 | 83.25 | 80.63 | 83.35 | 66.96 | 63.46 | 53.08 | 43.79 | 37.29 | 67.14 |
C4 | - | - | - | - | - | - | - | - | - | - |
Fusion Approach | Cyclist-R40 | Pedestrian-R40 | mAP | ||||
---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | ||
Add | 85.8 | 72.72 | 69.44 | 68.81 | 60.88 | 55.29 | 68.82 |
Multiple | 86.78 | 73.35 | 70.06 | 69.15 | 61.23 | 57.29 | 69.54 |
Conv | 86.53 | 73.04 | 71.58 | 68.64 | 60.52 | 57.14 | 69.58 |
SAA+Dilated-Conv | 86.48 | 74.16 | 72.07 | 69.5 | 61.51 | 58.15 | 70.31 |
Method | Car-R40 | Cyclist-R40 | Pedestrian-R40 | mAP | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Easy | Mod. | Hard | Easy | Mod. | Hard | Easy | Mod. | Hard | ||
HRNet | 92.58 | 85.25 | 82.87 | 90.40 | 73.76 | 69.35 | 66.08 | 59.71 | 53.82 | 74.87 |
PSPNet | 92.55 | 84.77 | 82.48 | 88.63 | 71.45 | 67.20 | 66.29 | 59.26 | 54.81 | 74.16 |
Ours | 92.96 | 86.03 | 83.72 | 91.61 | 74.20 | 69.73 | 68.53 | 60.89 | 54.24 | 75.77 |
Method | Parameters | Runtime |
---|---|---|
FocalConv-F | 13.70 M | 125 ms |
PV-RCNN | 13.16 M | 103 ms |
Ours | 11.21 M | 98 ms |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qiao, R.; Yuan, H.; Guan, Z.; Zhang, W. MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection. Remote Sens. 2025, 17, 1240. https://doi.org/10.3390/rs17071240
Qiao R, Yuan H, Guan Z, Zhang W. MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection. Remote Sensing. 2025; 17(7):1240. https://doi.org/10.3390/rs17071240
Chicago/Turabian StyleQiao, Renzhong, Hao Yuan, Zhenbo Guan, and Wenbo Zhang. 2025. "MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection" Remote Sensing 17, no. 7: 1240. https://doi.org/10.3390/rs17071240
APA StyleQiao, R., Yuan, H., Guan, Z., & Zhang, W. (2025). MDFusion: Multi-Dimension Semantic–Spatial Feature Fusion for LiDAR–Camera 3D Object Detection. Remote Sensing, 17(7), 1240. https://doi.org/10.3390/rs17071240