A Review of Multi-Sensor Fusion in Autonomous Driving
Abstract
1. Introduction
- To introduce the theoretical foundations behind sensor fusion, including mathematical models, uncertainty handling, and fusion principles.
- To analyze and categorize major fusion strategies based on the abstraction level, target tasks, and architectural design.
- To discuss the advantages and limitations of camera–LiDAR fusion, BEV transformation, cross-modal Transformer layers, and temporal fusion methods.
- To highlight real-world deployment challenges, including robustness under sensor failure, calibration errors, and domain generalization.
- To explore inspirations from other domains, particularly agriculture and robotics, for understanding the transferability of fusion architectures.
- To envision future directions with emerging paradigms such as foundation models, diffusion-based representation recovery, and large language model (LLM)-driven decision making.
2. Theoretical Foundations and Sensor Characteristics
2.1. Theoretical Foundations of Sensor Fusion
2.1.1. Bayesian Filtering and Probabilistic Estimation
2.1.2. Multi-View Learning and Representation Fusion
- Concatenation or summation (early fusion).
- Cross-modal attention (mid fusion).
- Weighted ensemble or voting (late fusion).
2.1.3. Uncertainty Modeling in Sensor Fusion
2.1.4. Transformer-Based Fusion and Token Alignment
2.2. Sensing Modalities and Fusion Properties
2.2.1. Sensors
2.2.2. Complementarity in Sensor Fusion
- Cross-modal redundancy (e.g., camera and LiDAR overlap).
- Temporal complementarity (e.g., Radar providing velocity between LiDAR frames).
- Robust fallback paths under sensor degradation (e.g., Radar-only detection under heavy fog).
2.3. Learning Paradigms for Fusion
2.3.1. Deep Learning Based Senor Fusion
- Scalability to high-dimensional sensory data.
- End-to-end optimization from raw input to control output.
- Ability to capture nonlinear and long-range cross-modal dependencies.
2.3.2. Probabilistic Fusion Models
- Explicit modeling of noise and confidence in sensor readings.
- Smooth state estimation over time.
- Lightweight implementations suitable for real-time systems.
2.3.3. Hybrid and Self-Supervised Approaches
- Improved generalization under label scarcity.
- Robustness to partial sensor failure.
2.3.4. Opportunities for Online Adaptation and Continual Learning
3. Fusion Architectures and Methodologies
3.1. Fusion Strategies
3.1.1. Early Fusion
3.1.2. Mid-Level Fusion
3.1.3. Late Fusion
3.1.4. Comparative Evaluation
3.2. Camera–LiDAR Fusion Architectures
3.2.1. Cross-Modal Token Alignment and Attention-Based Fusion
- Modality-agnostic structure with adaptive interaction.
- Rich spatial and semantic alignment via attention weights.
- Compatibility with long-horizon reasoning and memory fusion.
3.2.2. Unified Representation Fusion in BEV Space
- Consistent geometry for downstream reasoning.
- Compatibility with CNN-based detectors and planners.
- Robustness to partial observations through spatial pooling.
3.3. Multi-Modal BEV Construction
3.3.1. Unified BEV Projection of Modalities
3.3.2. Spatial Alignment and Fusion in BEV
3.3.3. Modular and Scalable Fusion Pipelines
3.3.4. Limitations and Considerations
3.4. Transformer-Based Query-Aligned Fusion
3.4.1. Query-to-Modality Attention and Cross-Modal Sampling
3.4.2. Temporal and Sequential Query Extension
3.4.3. Unified Query Spaces and Semantic Conditioning
3.4.4. Comparison to BEV Fusion and Challenges
- Dynamic Sampling: Tokens can attend to informative regions regardless of fixed grid structures.
- Cross-Task Flexibility: Query tokens can be tailored per task or per modality.
- End-to-End Learning: Attention weights can implicitly learn fusion relevance, bypassing hand-crafted projection steps.
- Computation Cost: Full self-attention across large feature maps is expensive.
- Stability: Query-token learning is sensitive to initialization and data imbalance.
- Interpretability: Attention maps may be hard to interpret in safety-critical contexts.
4. Task-Specific Applications of Multi-Modal Sensor Fusion
4.1. Depth Completion
4.2. Dynamic Object Detection
4.3. Static Object Detection
4.4. Semantic and Instance Segmentation
4.5. Multi-Object Tracking
4.6. Online Cross-Sensor Calibration
5. Current Challenges in Multi-Modal Fusion Perception for Autonomous Driving
5.1. Sensor Degradation and Failure Robustness
5.2. Temporal and Spatial Misalignment
5.3. Domain Generalization and Scene Diversity
5.4. Computational Burden and Real-Time Constraints
5.5. Interpretability and Trustworthiness
5.6. Data Scarcity and Annotation Cost
- Self-supervised Pretraining: Models are pre-trained using proxy tasks such as masked reconstruction, contrastive matching, or temporal prediction. For instance, BEV-MAE [85] adopts masked autoencoding for BEV fusion pretraining, leveraging cross-view consistency without human labels.
- Pseudo-labeling and Label Propagation: Networks trained on labeled data are used to infer labels for unlabeled samples, which are then reused for retraining. For example, CRN [86] propagates spatial priors from Radar onto BEV maps, while L3PS [87] bootstraps LiDAR labels from camera segmentation masks.
- Domain Adaptation and Style Transfer: Adversarial training or style translation methods reduce the distribution mismatch. For example, DA-MLF [89] performs adversarial alignment between real and simulated modalities, while CAPIT [90] applies image-to-image translation to adapt camera input to target domains.
5.7. Evaluation Metrics and Benchmark Limitations
- Benchmark extensions with adverse conditions: Incorporating fog, rain, night-time driving, or partial sensor occlusion into datasets like nuScenes-rain or Waymo-weather would better reflect deployment scenarios.
- Multi-objective metrics: Proposals like Fusion Robustness Score (FRS) or Alignment-Weighted IoU (AW-IoU) aim to jointly measure accuracy, fusion consistency, and resilience to modality dropout.
- Task-specific fusion benchmarks: Dedicated benchmarks for fusion challenges—e.g., joint depth and semantics or cross-modality correspondence—would foster architectural innovation.
5.8. Towards Foundation Models and Self-Adaptive Fusion
6. Future Directions and Research Prospects
6.1. Structured Generation via Diffusion Models for Robust Fusion
6.2. Long-Context Fusion with Mamba and State Space Models
6.3. Semantic Reasoning and Fusion Guidance with LLMs
6.4. Toward Universal Fusion Foundation Models
- Accept arbitrary sensor combinations as input (plug-and-play).
- Output universal representations (e.g., BEV maps, occupancy fields, trajectories).
- Transfer across domains with minimal fine-tuning.
6.5. Self-Adaptive and Continually Evolving Fusion Systems
- Online uncertainty estimation to modulate fusion weights [119].
- Memory-augmented architectures (e.g., retrieval-based Mamba) for context-aware fusion.
- Reinforcement learning (RL) policies to optimize sensor selection or fusion mode based on downstream performance (e.g., braking accuracy, latency) [120].
- Curriculum learning frameworks to gradually expose fusion models to harder conditions (e.g., sensor failure, low light) [121].
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May 2023–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 2774–2781. [Google Scholar] [CrossRef]
- Chitta, K.; Prakash, A.; Jaeger, B.; Yu, Z.; Renz, K.; Geiger, A. TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving. Pattern Anal. Mach. Intell. 2023, 45, 12878–12895. [Google Scholar] [CrossRef]
- Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Wu, B.; Lu, Y.; Zhou, D.; et al. DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 17161–17170. [Google Scholar] [CrossRef]
- Xie, E.; Yu, Z.; Zhou, D.; Philion, J.; Anandkumar, A.; Fidler, S.; Luo, P.; Alvarez, J.M. M2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird’s-Eye View Representation. arXiv 2022, arXiv:2204.05088. [Google Scholar]
- Shao, H.; Hu, Y.; Wang, L.; Song, G.; Waslander, S.L.; Liu, Y.; Li, H. LMDrive: Closed-Loop End-to-End Driving with Large Language Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 15120–15130. [Google Scholar] [CrossRef]
- Ye, T.; Jing, W.; Hu, C.; Huang, S.; Gao, L.; Li, F.; Wang, J.; Guo, K.; Xiao, W.; Mao, W.; et al. FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving. arXiv 2023, arXiv:2308.01006. [Google Scholar] [CrossRef]
- Phillips, J.; Martinez, J.; Barsan, I.A.; Casas, S.; Sadat, A.; Urtasun, R. Deep Multi-Task Learning for Joint Localization, Perception, and Prediction. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 4677–4687. [Google Scholar] [CrossRef]
- Qin, Y.; Wang, C.; Kang, Z.; Ma, N.; Li, Z.; Zhang, R. SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 21957–21967. [Google Scholar] [CrossRef]
- Jiang, B.; Chen, S.; Xu, Q.; Liao, B.; Chen, J.; Zhou, H.; Zhang, Q.; Liu, W.; Huang, C.; Wang, X. VAD: Vectorized Scene Representation for Efficient Autonomous Driving. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
- Shao, H.; Wang, L.; Chen, R.; Li, H.; Liu, Y. Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer. arXiv 2022, arXiv:2207.14024. [Google Scholar] [CrossRef]
- Malawade, V.; Mortlock, T.; Faruque, M.A.A. HydraFusion: Context-Aware Selective Sensor Fusion for Robust and Efficient Autonomous Vehicle Perception. arXiv 2022, arXiv:2201.06644. [Google Scholar]
- Wu, P.; Chen, S.; Metaxas, D. MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird’s Eye View Maps. arXiv 2020, arXiv:2003.06754. [Google Scholar]
- Pang, S.; Morris, D.; Radha, H. CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; IEEE: New York, NY, USA, 2021; pp. 10386–10393. [Google Scholar] [CrossRef]
- Du, X.; Ang, M.H.; Karaman, S.; Rus, D. A General Pipeline for 3D Detection of Vehicles. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; IEEE: New York, NY, USA, 2018; pp. 3194–3200. [Google Scholar] [CrossRef]
- Prakash, A.; Chitta, K.; Geiger, A. Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 7073–7083. [Google Scholar] [CrossRef]
- Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection. arXiv 2023, arXiv:2301.01283. [Google Scholar] [CrossRef]
- Zeng, Y.; Zhang, D.; Wang, C.; Miao, Z.; Liu, T.; Zhan, X.; Hao, D.; Ma, C. LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 17151–17160. [Google Scholar] [CrossRef]
- Qin, Z.; Chen, J.; Chen, C.; Chen, X.; Li, X. UniFusion: Unified Multi-view Fusion Transformer for Spatial-Temporal Representation in Bird’s-Eye-View. arXiv 2023, arXiv:2207.08536. [Google Scholar]
- Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection. arXiv 2022, arXiv:2207.10316. [Google Scholar]
- Ye, M.; Xu, S.; Cao, T. HVNet: Hybrid Voxel Network for LiDAR Based 3D Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 1628–1637. [Google Scholar] [CrossRef]
- Yang, Z.; Chen, J.; Miao, Z.; Li, W.; Zhu, X.; Zhang, L. DeepInteraction: 3D Object Detection via Modality Interaction. arXiv 2022, arXiv:2208.11112. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, T.; Zhang, X.; Sun, J. PETR: Position Embedding Transformation for Multi-View 3D Object Detection. arXiv 2022, arXiv:2203.05625. [Google Scholar] [CrossRef]
- Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D Object Detection and Tracking. arXiv 2021, arXiv:2006.11275. [Google Scholar] [CrossRef]
- Li, Y.; Yu, Z.; Choy, C.; Xiao, C.; Alvarez, J.M.; Fidler, S.; Feng, C.; Anandkumar, A. VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion. arXiv 2023, arXiv:2302.12251. [Google Scholar]
- Lin, X.; Lin, T.; Pei, Z.; Huang, L.; Su, Z. Sparse4D: Multi-view 3D Object Detection with Sparse Spatial-Temporal Fusion. arXiv 2023, arXiv:2211.10581. [Google Scholar]
- Xu, S.; Zhou, D.; Fang, J.; Yin, J.; Bin, Z.; Zhang, L. FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; IEEE: New York, NY, USA, 2021; pp. 3047–3054. [Google Scholar] [CrossRef]
- Zhu, B.; Jiang, Z.; Zhou, X.; Li, Z.; Yu, G. Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection. arXiv 2019, arXiv:1908.09492. [Google Scholar] [CrossRef]
- Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying Voxel-based Representation with Transformer for 3D Object Detection. arXiv 2022, arXiv:2206.00630. [Google Scholar] [CrossRef]
- Li, Z.; Deng, H.; Li, T.; Huang, Y.; Sima, C.; Geng, X.; Gao, Y.; Wang, W.; Li, Y.; Lu, L. BEVFormer++: Improving BEVFormer for 3D Camera-only Object Detection: 1st Place Solution for Waymo Open Dataset Challenge. 2022. Available online: https://storage.googleapis.com/waymo-uploads/files/research/3DCam/3DCam_BEVFormer.pdf (accessed on 23 September 2025).
- Wu, Z.; Chen, G.; Gan, Y.; Wang, L.; Pu, J. MVFusion: Multi-View 3D Object Detection with Semantic-aligned Radar and Camera Fusion. arXiv 2023, arXiv:2302.10511. [Google Scholar]
- Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review. IEEE Trans. Intell. Transp. Syst. 2022, 23, 722–739. [Google Scholar] [CrossRef]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.-L. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 1080–1089. [Google Scholar] [CrossRef]
- Chen, Z.; Zhang, J.; Tao, D. Progressive LiDAR Adaptation for Road Detection. IEEE/CAA J. Autom. Sin. 2019, 6, 693–702. [Google Scholar] [CrossRef]
- Li, Y.; Qi, B. Design and Analysis of a Sowing Depth Detection and Control Device for a Wheat Row Planter Based on Fuzzy PID and Multi-Sensor Fusion. Agronomy 2025, 15, 1490. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: New York, NY, USA, 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Computer Vision—ECCV 2012; Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7576, pp. 746–760. [Google Scholar] [CrossRef]
- Fong, W.K.; Mohan, R.; Hurtado, J.V.; Zhou, L.; Caesar, H.; Beijbom, O.; Valada, A. Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and Tracking. arXiv 2021, arXiv:2109.03805. [Google Scholar] [CrossRef]
- Qiu, J.; Cui, Z.; Zhang, Y.; Zhang, X.; Liu, S.; Zeng, B.; Pollefeys, M. DeepLiDAR: Deep Surface Normal Guided Depth Prediction for Outdoor Scene from Sparse LiDAR Data and Single Color Image. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 3308–3317. [Google Scholar] [CrossRef]
- Shi, K.; Feng, Y.; Hu, T.; Cao, Y.; Wu, P.; Liang, Y.; Zhang, Y.; Yan, Q. FusionNet: Multi-model Linear Fusion Framework for Low-light Image Enhancement. arXiv 2025, arXiv:2504.19295. [Google Scholar]
- Yang, G.; Tang, H.; Ding, M.; Sebe, N.; Ricci, E. Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction. arXiv 2021, arXiv:2103.12091. [Google Scholar]
- Hu, T.; Wang, W.; Gu, J.; Xia, Z.; Zhang, J.; Wang, B. Research on Apple Object Detection and Localization Method Based on Improved YOLOX and RGB-D Images. Agronomy 2023, 13, 1816. [Google Scholar] [CrossRef]
- Jiang, L.; Wang, Y.; Wu, C.; Wu, H. Fruit Distribution Density Estimation in YOLO-Detected Strawberry Images: A Kernel Density and Nearest Neighbor Analysis Approach. Agriculture 2024, 14, 1848. [Google Scholar] [CrossRef]
- Nunekpeku, X.; Zhang, W.; Gao, J.; Adade, S.Y.-S.S.; Li, H.; Chen, Q. Gel strength prediction in ultrasonicated chicken mince: Fusing near-infrared and Raman spectroscopy coupled with deep learning LSTM algorithm. Food Control. 2024, 168, 110916. [Google Scholar] [CrossRef]
- Xu, J.; Liu, H.; Shen, Y.; Zeng, X.; Zheng, X. Individual nursery trees classification and segmentation using a point cloud-based neural network with dense connection pattern. Food Control. 2024, 168, 112945. [Google Scholar] [CrossRef]
- Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum PointNets for 3D Object Detection from RGB-D Data. arXiv 2018, arXiv:1711.08488. [Google Scholar]
- Zhang, L.; Zhang, B.; Zhang, H.; Yang, W.; Hu, X.; Cai, J.; Wu, C.; Wang, X. Multi-Source Feature Fusion Network for LAI Estimation from UAV Multispectral Imagery. Agronomy 2025, 15, 988. [Google Scholar] [CrossRef]
- Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep Continuous Fusion for Multi-Sensor 3D Object Detection. arXiv 2020, arXiv:2012.10992. [Google Scholar] [CrossRef]
- Ren, Y.; Huang, X.; Aheto, J.H.; Wang, C.; Ernest, B.; Tian, X.; He, P.; Chang, X.; Wang, C. Application of volatile and spectral profiling together with multimode data fusion strategy for the discrimination of preserved eggs. Food Chem. 2021, 343, 128515. [Google Scholar] [CrossRef]
- Ma, Z.; Han, M.; Li, Y.; Yu, S.; Chandio, F.A. Comparing kernel damage of different threshing components using high-speed cameras. Int. J. Agric. Biol. Eng. 2020, 13, 215–219. [Google Scholar] [CrossRef]
- Yuanyuan, Z.; Bin, Z.; Cheng, S.; Haolu, L.; Jicheng, H.; Kunpeng, T.; Zhong, T. Review of the field environmental sensing methods based on multi-sensor information fusion technology. Int. J. Agric. Biol. Eng. 2024, 17, 1–13. [Google Scholar] [CrossRef]
- Tao, K.; Wang, A.; Shen, Y.; Lu, Z.; Peng, F.; Wei, X. Peach Flower Density Detection Based on an Improved CNN Incorporating Attention Mechanism and Multi-Scale Feature Fusion. Horticulturae 2022, 8, 904. [Google Scholar] [CrossRef]
- Ji, W.; Pan, Y.; Xu, B.; Wang, J. A Real-Time Apple Targets Detection Method for Picking Robot Based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
- Ma, J.; Zhao, Y.; Fan, W.; Liu, J. An Improved YOLOv8 Model for Lotus Seedpod Instance Segmentation in the Lotus Pond Environment. Agronomy 2024, 14, 1325. [Google Scholar] [CrossRef]
- Li, H.; Luo, X.; Haruna, S.A.; Zareef, M.; Chen, Q.; Ding, Z.; Yan, Y. Au-Ag OHCs-based SERS sensor coupled with deep learning CNN algorithm to quantify thiram and pymetrozine in tea. Food Chem. 2023, 428, 136798. [Google Scholar] [CrossRef]
- Zhang, X.; Wang, Y.; Zhou, Z.; Zhang, Y.; Wang, X. Detection Method for Tomato Leaf Mildew Based on Hyperspectral Fusion Terahertz Technology. Foods 2023, 12, 535. [Google Scholar] [CrossRef] [PubMed]
- Tian, Y.; Sun, J.; Zhou, X.; Yao, K.; Tang, N. Detection of soluble solid content in apples based on hyperspectral technology combined with deep learning algorithm. J. Food Process. Preserv. 2022, 46, e16414. [Google Scholar] [CrossRef]
- Zhang, Z.; Lu, Y.; Zhao, Y.; Pan, Q.; Jin, K.; Xu, G.; Hu, Y. TS-YOLO: An All-Day and Lightweight Tea Canopy Shoots Detection Model. Agronomy 2023, 13, 1411. [Google Scholar] [CrossRef]
- Luo, Y.; Wei, L.; Xu, L.; Zhang, Q.; Liu, J.; Cai, Q.; Zhang, W. Stereo-vision-based multi-crop harvesting edge detection for precise automatic steering of combine harvester. Biosyst. Eng. 2022, 215, 115–128. [Google Scholar] [CrossRef]
- Yang, N.; Chang, K.; Dong, S.; Tang, J.; Wang, A.; Huang, R.; Jia, Y. Rapid image detection and recognition of rice false smut based on mobile smart devices with anti-light features from cloud database. Biosyst. Eng. 2022, 218, 229–244. [Google Scholar] [CrossRef]
- Guan, H.; Yu, Y.; Peng, D.; Zang, Y.; Lu, J.; Li, A.; Li, J. A Convolutional Capsule Network for Traffic-Sign Recognition Using Mobile LiDAR Data with Digital Images. IEEE Geosci. Remote. Sens. Lett. 2020, 17, 1067–1071. [Google Scholar] [CrossRef]
- Dai, A.; Nießner, M. 3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation. arXiv 2018, arXiv:1803.10409. [Google Scholar]
- Wang, J.; Sun, B.; Lu, Y. MVPNet: Multi-View Point Regression Networks for 3D Object Reconstruction from A Single Image. arXiv 2018, arXiv:1811.09410. [Google Scholar] [CrossRef]
- Sun, Y.; Luo, Y.; Zhang, Q.; Xu, L.; Wang, L.; Zhang, P. Estimation of Crop Height Distribution for Mature Rice Based on a Moving Surface and 3D Point Cloud Elevation. Agronomy 2022, 12, 836. [Google Scholar] [CrossRef]
- Yu, S.; Huang, X.; Wang, L.; Chang, X.; Ren, Y.; Zhang, X.; Wang, Y. Qualitative and quantitative assessment of flavor quality of Chinese soybean paste using multiple sensor technologies combined with chemometrics and a data fusion strategy. Food Chem. 2022, 405, 134859. [Google Scholar] [CrossRef]
- Zhang, F.; Chen, Z.; Ali, S.; Yang, N.; Fu, S.; Zhang, Y. Multi-class detection of cherry tomatoes using improved YOLOv4-Tiny. Int. J. Agric. Biol. Eng. 2023, 16, 225–231. [Google Scholar] [CrossRef]
- Zhu, J.; Jiang, X.; Rong, Y.; Wei, W.; Wu, S.; Jiao, T.; Chen, Q. Label-free detection of trace level zearalenone in corn oil by surface-enhanced Raman spectroscopy (SERS) coupled with deep learning models. Food Chem. 2023, 414, 135705. [Google Scholar] [CrossRef]
- Luiten, J.; Fischer, T.; Leibe, B. Track to Reconstruct and Reconstruct to Track. IEEE Robot. Autom. Lett. 2020, 5, 1803–1810. [Google Scholar] [CrossRef]
- Simon, M.; Amende, K.; Kraus, A.; Honer, J.; Samann, T.; Kaulbersch, H.; Milz, S.; Gross, H.M. Complexer-YOLO: Real-Time 3D Object Detection and Tracking on Semantic Point Clouds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; IEEE: New York, NY, USA, 2019; pp. 1190–1199. [Google Scholar] [CrossRef]
- Liu, H.; Zhu, H. Evaluation of a Laser Scanning Sensor in Detection of Complex-Shaped Targets for Variable-Rate Sprayer Development. Trans. ASABE 2016, 59, 1181–1192. [Google Scholar] [CrossRef]
- Xu, S.; Xu, X.; Zhu, Q.; Meng, Y.; Yang, G.; Feng, H.; Yang, M.; Zhu, Q.; Xue, H.; Wang, B. Monitoring leaf nitrogen content in rice based on information fusion of multi-sensor imagery from UAV. Precis. Agric. 2023, 24, 2327–2349. [Google Scholar] [CrossRef]
- Xue, Y.; Jiang, H. Monitoring of Chlorpyrifos Residues in Corn Oil Based on Raman Spectral Deep-Learning Model. Foods 2023, 12, 2402. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.-D.; Zou, X.-B.; Shi, J.-Y.; Zhao, J.-W.; Huang, X.-W. Observation of the Oil Content of Fried Lotus (Nelumbo nucifera Gaertn.) Root Slices by Confocal Laser Scanning Microscopy Based on Three-Dimensional Model: Determination of Oil Location and Content in Fried Snacks by Clsm. J. Food Process. Preserv. 2016, 41, e12762. [Google Scholar] [CrossRef]
- Li, H.; Sheng, W.; Adade, S.Y.-S.S.; Nunekpeku, X.; Chen, Q. Investigation of heat-induced pork batter quality detection and change mechanisms using Raman spectroscopy coupled with deep learning algorithms. Food Chem. 2024, 461, 140798. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Gao, Z.; Zhang, Y.; Zhou, J.; Wu, J.; Li, P. Real-Time Detection and Location of Potted Flowers Based on a ZED Camera and a YOLO V4-Tiny Deep Learning Algorithm. Horticulturae 2021, 8, 21. [Google Scholar] [CrossRef]
- Pei, J.; Jiang, T.; Tang, H.; Liu, N.; Jin, Y.; Fan, D.-P.; Heng, P.-A. CalibNet: Dual-branch Cross-modal Calibration for RGB-D Salient Instance Segmentation. arXiv 2024, arXiv:2307.08098. [Google Scholar] [CrossRef]
- Yuan, L.; Cai, J.; Sun, L.; Ye, C. A Preliminary Discrimination of Cluster Disqualified Shape for Table Grape by Mono-Camera Multi-Perspective Simultaneously Imaging Approach. Food Anal. Methods 2015, 9, 758–767. [Google Scholar] [CrossRef]
- Sun, J.; Zhang, L.; Zhou, X.; Yao, K.; Tian, Y.; Nirere, A. A method of information fusion for identification of rice seed varieties based on hyperspectral imaging technology. J. Food Process. Eng. 2021, 44, e13797. [Google Scholar] [CrossRef]
- Yang, N.; Yuan, M.; Wang, P.; Zhang, R.; Sun, J.; Mao, H. Tea diseases detection based on fast infrared thermal image processing technology. J. Sci. Food Agric. 2019, 99, 3459–3466. [Google Scholar] [CrossRef]
- Hong, Y.; Dai, H.; Ding, Y. Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection. In European Conference on Computer Vision; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Zhu, H.; Deng, J.; Zhang, Y.; Ji, J.; Mao, Q.; Li, H.; Zhang, Y. VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion. arXiv 2021, arXiv:2111.14382. [Google Scholar] [CrossRef]
- Bavirisetti, D.P.; Dhuli, R. Fusion of Infrared and Visible Sensor Images Based on Anisotropic Diffusion and Karhunen-Loeve Transform. IEEE Sens. J. 2015, 16, 203–209. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 1–16. [Google Scholar]
- Alfasly, S.; Lu, J.; Xu, C.; Zou, Y. Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 20176–20185. [Google Scholar] [CrossRef]
- Lin, Z.; Wang, Y.; Qi, S.; Dong, N.; Yang, M.-H. BEV-MAE: Bird’s Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios. arXiv 2024, arXiv:2212.05758. [Google Scholar] [CrossRef]
- Kim, Y.; Shin, J.; Kim, S.; Lee, I.-J.; Choi, J.W.; Kum, D. CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception. arXiv 2023, arXiv:2304.00670. [Google Scholar] [CrossRef]
- Çanakçı, S.; Vödisch, N.; Petek, K.; Burgard, W.; Valada, A. Label-Efficient LiDAR Panoptic Segmentation. arXiv 2025, arXiv:2503.02372. [Google Scholar]
- Meng, Z.; Li, H.; Zhang, Z.; Shen, Z.; Yu, Y.; Song, X.; Wu, X. CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model. arXiv 2024, arXiv:2405.20764. [Google Scholar]
- Yu, L.; Shen, L.; Yang, H.; Jiang, X.; Yan, B. A Distortion-Aware Multi-Task Learning Framework for Fractional Interpolation in Video Coding. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2824–2836. [Google Scholar] [CrossRef]
- Xia, Y.; Monica, J.; Chao, W.-L.; Hariharan, B.; Weinberger, K.Q.; Campbell, M. Image-to-Image Translation for Autonomous Driving from Coarsely-Aligned Image Pairs. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 7756–7762. [Google Scholar] [CrossRef]
- Cheng, J.; Sun, J.; Yao, K.; Xu, M.; Tian, Y.; Dai, C. A decision fusion method based on hyperspectral imaging and electronic nose techniques for moisture content prediction in frozen-thawed pork. LWT 2022, 165, 113778. [Google Scholar] [CrossRef]
- Chen, S.; Hu, J.; Shi, Y.; Peng, Y.; Fang, J.; Zhao, R.; Zhao, L. Vehicle-to-Everything (v2x) Services Supported by LTE-Based Systems and 5G. IEEE Commun. Stand. Mag. 2017, 1, 70–76. [Google Scholar] [CrossRef]
- Yu, H.; Luo, Y.; Shu, M.; Huo, Y.; Yang, Z.; Shi, Y.; Guo, Z.; Li, H.; Hu, X.; Yuan, J.; et al. DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 21329–21338. [Google Scholar] [CrossRef]
- Sima, C.; Renz, K.; Chitta, K.; Chen, L.; Zhang, H.; Xie, C.; Beißwenger, J.; Luo, P.; Geiger, A.; Li, H. DriveLM: Driving with Graph Visual Question Answering. arXiv 2025, arXiv:2312.14150. [Google Scholar]
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; Jegou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. arXiv 2021, arXiv:2104.14294. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
- Wang, M.; Wang, H.; Li, Y.; Chen, L.; Cai, Y.; Shao, Z. MSAFusion: Object Detection Based on Multisensor Adaptive Fusion Under BEV. IEEE Trans. Instrum. Meas. 2025, 74, 9509212. [Google Scholar] [CrossRef]
- Lai, H.; Yin, P.; Scherer, S. AdaFusion: Visual-LiDAR Fusion with Adaptive Weights for Place Recognition. IEEE Robot. Autom. Lett. 2022, 7, 12038–12045. [Google Scholar] [CrossRef]
- Zhou, S.; Yuan, G.; Hua, Z.; Li, J. DGFEG: Dynamic Gate Fusion and Edge Graph Perception Network for Remote Sensing Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 3581–3598. [Google Scholar] [CrossRef]
- Shao, Z.; Wang, H.; Cai, Y.; Chen, L.; Li, Y. UA-Fusion: Uncertainty-Aware Multimodal Data Fusion Framework for 3-D Object Detection of Autonomous Vehicles. IEEE Trans. Instrum. Meas. 2025, 74, 3548184. [Google Scholar] [CrossRef]
- Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; Kembhavi, A. Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks. arXiv 2022, arXiv:2206.08916. [Google Scholar]
- Choi, S.; Kim, J.; Shin, H.; Choi, J.W. Mask2Map: Vectorized HD Map Construction Using Bird’s Eye View Segmentation Masks. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2025; Volume 15127, pp. 19–36. [Google Scholar] [CrossRef]
- Jiao, Y.; Jie, Z.; Chen, S.; Chen, J.; Ma, L.; Jiang, Y.-G. MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 21643–21652. [Google Scholar] [CrossRef]
- Le, D.-T.; Shi, H.; Cai, J.; Rezatofighi, H. DifFUSER: Diffusion Model for Robust Multi-Sensor Fusion in 3D Object Detection and BEV Segmentation. arXiv 2024, arXiv:2404.04629. [Google Scholar]
- Lin, Z.; Liu, Z.; Xia, Z.; Wang, X.; Wang, Y.; Qi, S.; Dong, Y.; Dong, N.; Zhang, L.; Zhu, C. RCBEVDet: Radar-Camera Fusion in Bird’s Eye View for 3D Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 14928–14937. [Google Scholar] [CrossRef]
- Yin, J.; Shen, J.; Chen, R.; Li, W.; Yang, R.; Frossard, P.; Wang, W. IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 14905–14915. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239. [Google Scholar]
- Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection. arXiv 2022, arXiv:2211.09386. [Google Scholar]
- Zou, J.; Zhu, Z.; Ye, Y.; Wang, X. DiffBEV: Conditional Diffusion Model for Bird’s Eye View Perception. arXiv 2023, arXiv:2303.08333. [Google Scholar] [CrossRef]
- Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
- He, H.; Bai, Y.; Zhang, J.; He, Q.; Chen, H.; Gan, Z.; Wang, C.; Li, X.; Tian, G.; Xie, L. MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection. arXiv 2025, arXiv:2404.06564. [Google Scholar]
- Zhou, X.; Han, X.; Yang, F.; Ma, Y.; Knoll, A.C. OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model. arXiv 2025, arXiv:2503.23463. [Google Scholar]
- Li, X.; Zhang, Z.; Tan, X.; Chen, C.; Qu, Y.; Xie, Y.; Ma, L. PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection. arXiv 2024, arXiv:2404.05231. [Google Scholar]
- Yeong, D.J.; Panduru, K.; Walsh, J. Exploring the Unseen: A Survey of Multi-Sensor Fusion and the Role of Explainable AI (XAI) in Autonomous Vehicles. Sensors 2025, 25, 856. [Google Scholar] [CrossRef] [PubMed]
- Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. arXiv 2022, arXiv:2204.14198. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv 2022, arXiv:2201.12086. [Google Scholar]
- Dey, P.; Merugu, S.; Kaveri, S. Uncertainty-Aware Fusion: An Ensemble Framework for Mitigating Hallucinations in Large Language Models. In Proceedings of the Companion Proceedings of the ACM on Web Conference 2025, Sydney, NSW, Australia, 23 May 2025; pp. 947–951. [Google Scholar] [CrossRef]
- Aghdasian, J.; Ardakani, A.H.; Aqabakee, K.; Abdollahi, F. Autonomous Driving using Residual Sensor Fusion and Deep Reinforcement Learning. arXiv 2023, arXiv:2312.16620. [Google Scholar] [CrossRef]
- Ozturk, A.; Gunel, M.B.; Dagdanov, R.; Vural, M.E.; Yurdakul, F.; Dal, M.; Ure, N.K. Investigating Value of Curriculum Reinforcement Learning in Autonomous Driving Under Diverse Road and Weather Conditions. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops), Nagoya, Japan, 11–17 July 2021; IEEE: New York, NY, USA, 2021; pp. 358–363. [Google Scholar] [CrossRef]
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
Approach | Reasoning Paradigm | Prior Knowledge | Fusion Stage | Strengths |
---|---|---|---|---|
Bayesian Filtering | Probabilistic Inference | Strong priors | Early/Middle | Handles uncertainty, recursive updates |
Multi-View Learning | Representation Alignment | Weak priors | Late Fusion | Learns from view diversity |
Uncertainty Modeling | Variational Approximation | Moderate priors | Middle/Late | Enhances robustness, supports risk-awareness |
Transformer-based Fusion | Attention-based Structure | No explicit prior | All Stages | Captures global dependencies |
Type | Advantages | Limitations |
---|---|---|
Camera | Rich semantic information, affordable hardware, passive sensing. | Sensitive to lighting conditions (e.g., night, glare), weather (e.g., fog, rain), and lack depth information without additional geometry. |
LiDAR | Accurate depth perception, strong spatial resolution, invariant to lighting. | High cost, mechanical complexity (for spinning LiDAR), reduced performance in fog or rain. |
Radar | Robust to lighting and weather, direct velocity measurement. | Low spatial resolution, noisy data, ghost objects. |
Ultrasonic sensor | Low cost, simple integration. | Limited range (~2–5 m), poor angular resolution, unreliable in high-speed scenarios. |
IMUs | High-frequency data, unaffected by external conditions. | Prone to drift, requires integration with other sensors for accurate long-term localization. |
GNSS | Global positioning, widely available. | Prone to multipath errors, signal loss in urban canyons or tunnels. |
Fusion Methodology | Structural Assumptions | Data Dependency | Adaptability | Advantages |
---|---|---|---|---|
Complementarity in Fusion | Assumes informative diversity | Moderate | Task-specific | Exploits heterogeneous sensor strengths |
Probabilistic Fusion Models | Likelihood-based reasoning | High | Static environments | Handles noise and ambiguity |
Hybrid & Self-Supervised Approaches | Multi-loss or multi-branch | Low-to-moderate | Generalizable | Reduces labeling cost, learns semantics |
Online Adaptation & Continual Learning | Temporal shift awareness | Dynamic data | High adaptability | Supports evolving environments |
Dataset | Modalities | Tasks Supported | Weather & Time Diversity | Annotation Quality |
---|---|---|---|---|
KITTI | Camera, LiDAR, GPS, IMU | Detection, Tracking, Depth | Limited (Daylight only) | Moderate (Sparse LiDAR, 2D/3D Boxes) |
nuScenes | Camera (6), LiDAR, Radar, GPS, IMU | Detection, Tracking, Segmentation | High (Night, Rain, Fog) | Rich (360°, 3D, 2 Hz) |
Waymo | Camera (5), LiDAR (5) | Detection, Tracking | Medium (Day/Night, Light Rain) | High (Dense LiDAR, HD Maps) |
PandaSet | Camera, LiDAR, Radar | Detection, Segmentation | Medium (Clear to Rain) | Detailed (Point-level Labels) |
RADIATE | Camera, LiDAR, Radar, GPS, IMU | Detection, Tracking, Weather Testing | Very High (Snow, Fog, Rain, Night) | Dense (Multi-weather frames) |
Task Type | Key Fusion Challenges | Representative Methods |
---|---|---|
Depth Completion | Sparse-to-dense LiDAR reconstruction, geometric projection error, uncertainty modeling | DeepLiDAR, FusionNet, TransDepth |
Dynamic Object Detection | Temporal misalignment, scale variance, occlusion handling, motion-aware fusion | TransFusion, ContFuse, BEVFusion |
Static Object Detection | Long-range low-SNR targets, geometric detail preservation, semantic boundary ambiguity | CenterFusion, HVNet, FusionPainting |
Semantic & Instance Segmentation | Cross-modal semantic alignment, class imbalance, fine-grained spatial matching | 3DMV, MVPNet, DeepInteraction |
Multi-Object Tracking | Temporal consistency, identity preservation, modality-dependent Re-ID drift | BEVTrack, VPFNet, Sparse4D |
Online Cross-Sensor Calibration | Real-time extrinsic drift compensation, cross-modal geometric alignment | CalibNet, RegNet, DeepCalib |
Method | Modality | Reference | mAP ↑ | NDS ↑ |
---|---|---|---|---|
TransFusion | Camera + LiDAR | CVPR 2022 | 68.9 | 71.6 |
DeepInteraction | Camera + LiDAR | NeurIPS 2022 | 70.8 | 73.4 |
BEVFusion | Camera + LiDAR | ICRA 2023 | 70.2 | 72.9 |
MSMDFusion [105] | Camera + LiDAR | CVPR 2023 | 71.0 | 73.0 |
CMT | Camera + LiDAR | ICCV 2023 | 72.0 | 74.1 |
DifFUSER [106] | Camera + LiDAR | ECCV 2024 | 71.3 | 73.8 |
RCBEVDet [107] | Camera + Radar | CVPR 2024 | 67.3 | 72.7 |
IS-FUSION [108] | Camera + LiDAR | CVPR 2024 | 73.0 | 75.2 |
MambaFusion | Camera + LiDAR | ICCV 2025 | 73.2 | 75.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qian, H.; Wang, M.; Zhu, M.; Wang, H. A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors 2025, 25, 6033. https://doi.org/10.3390/s25196033
Qian H, Wang M, Zhu M, Wang H. A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors. 2025; 25(19):6033. https://doi.org/10.3390/s25196033
Chicago/Turabian StyleQian, Hui, Mingchen Wang, Maotao Zhu, and Hai Wang. 2025. "A Review of Multi-Sensor Fusion in Autonomous Driving" Sensors 25, no. 19: 6033. https://doi.org/10.3390/s25196033
APA StyleQian, H., Wang, M., Zhu, M., & Wang, H. (2025). A Review of Multi-Sensor Fusion in Autonomous Driving. Sensors, 25(19), 6033. https://doi.org/10.3390/s25196033