Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View

Liu, Ajian; Zhang, Yandi; Shi, Huichao; Chen, Juan

doi:10.3390/wevj16100567

Open AccessArticle

Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View

¹

College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China

²

Tukrin Technology, Beijing 101300, China

^*

Author to whom correspondence should be addressed.

World Electr. Veh. J. 2025, 16(10), 567; https://doi.org/10.3390/wevj16100567

Submission received: 21 August 2025 / Revised: 28 September 2025 / Accepted: 30 September 2025 / Published: 2 October 2025

(This article belongs to the Special Issue Recent Advances in Intelligent Vehicle)

Download

Browse Figures

Versions Notes

Abstract

Reliable three-dimensional (3D) object detection is critical for intelligent vehicles to ensure safety in complex traffic environments, and recent progress in multi-modal sensor fusion, particularly between LiDAR and camera, has advanced environment perception in urban driving. However, existing approaches remain vulnerable to occlusions and dense traffic, where depth estimation errors, calibration deviations, and cross-modal misalignment are often exacerbated. To overcome these limitations, we propose BEVAlign, a local–global feature alignment framework designed to generate unified BEV representations from heterogeneous sensor modalities. The framework incorporates a Local Alignment (LA) module that enhances camera-to-BEV view transformation through graph-based neighbor modeling and dual-depth encoding, mitigating local misalignment from depth estimation errors. To further address global misalignment in BEV representations, we present the Global Alignment (GA) module comprising a bidirectional deformable cross-attention (BDCA) mechanism and CBR blocks. BDCA employs dual queries from LiDAR and camera to jointly predict spatial sampling offsets and aggregate features, enabling bidirectional alignment within the BEV domain. The stacked CBR blocks then refine and integrate the aligned features into unified BEV representations. Experiment on the nuScenes benchmark highlights the effectiveness of BEVAlign, which achieves 71.7% mAP, outperforming BEVFusion by 1.5%. Notably, it achieves strong performance on small and occluded objects, particularly in dense traffic scenarios. These findings provide a basis for advancing cooperative environment perception in next-generation intelligent vehicle systems.

Keywords:

3D object detection; multi-modal sensor fusion; unified BEV representations; feature alignment; autonomous driving

1. Introduction

Three-dimensional object detection is crucial for intelligent vehicles, serving as the foundation of environmental perception and enhancing safety in dynamic traffic environments [1,2]. To achieve robust and high-precision environmental understanding, recent studies have adopted multi-modal fusion paradigms [3,4,5], which leverage the complementary characteristics of visual and LiDAR sensors. Camera images offer dense semantic information but lack geometric accuracy. In contrast, LiDAR delivers precise spatial measurements but remains sparse and semantically limited, which constrains comprehensive scene interpretation. Therefore, developing an effective fusion strategy that integrates the strengths of both modalities while mitigating their respective limitations is crucial for advancing intelligent vehicle systems, particularly in improving localization and navigation in complex traffic scenarios.

Recent advances in multi-modal 3D detection have explored unified fusion strategies that project LiDAR and camera inputs into a common spatial domain, such as the voxel grid [6,7] or the BEV unified representation [3,5,8], to generate consistent multi-modal representations. These strategies have proven effective for enhancing the reliability of crossmodal perception. In the context of this trend, two mainstream paradigms are generally adopted, namely voxel space fusion and BEV space fusion, as shown in Figure 1a,b. The former fuses LiDAR and camera features within a 3D voxel space by voxelizing point clouds and employing learnable spatial alignment techniques. For instance, AutoAlignV2 utilizes deformable cross-attention to learn 3D spatial offsets for aligning heterogeneous features. In contrast, BEV-based fusion maps both modalities into a shared top-down view, which simplifies spatial layout and improves downstream detection, such as BEVFusion [9].

Despite notable strides having been made in multi-modal fusion, existing approaches still suffer from challenges caused by depth estimation errors and sensor misalignment. As illustrated in Figure 2, depth discontinuities often arise at object boundaries in the image domain, where the predicted depth fails to preserve sharp geometric transitions, leading to local feature misalignment (Figure 2a) during view transformation. Furthermore, calibration inaccuracies between both modalities cause projection errors (Figure 2b) in the LiDAR-to-image transformation, leading to feature misalignment. These discrepancies are particularly detrimental to BEV-based fusion frameworks such as LSS [10] and BEVFusion [9], which depend on depth-guided projection and direct feature concatenation, yet lack explicit mechanisms to correct these alignment errors.

To overcome these challenges, we propose a LiDAR–camera fusion framework that resolves both local and global misalignment in BEV space. The LA module uses LiDARprojected neighborhood depth to correct depth-induced errors in the camera-to-BEV transform. The GA module applies BDCA to correct large-scale inconsistencies and then uses stacked CBR blocks to refine the fused features. However, existing methods show only partial alignment capability. GraphAlign [11] focuses on graph matching at the correspondence level but remains restricted to local alignment. ObjectFusion [12] relies on object-centric proposal fusion without explicit BEV alignment. In contrast, our approach instead integrates depth-guided local alignment and bidirectional global alignment to produce a unified BEV representation, ensuring both semantic and geometric consistency. Experiments on the nuScenes [13] benchmark demonstrate the effectiveness of BEVAlign, which achieves 71.7% mAP and 75.3% NDS, surpassing BEVFusion and other state-of-the-art methods. The main contributions of this work are summarized as follows:

We propose BEVAlign, a unified feature alignment framework that fuses LiDAR and camera data in BEV space to mitigate cross-modal misalignment.
We introduce a Local Alignment (LA) module that uses LiDAR-projected depth and graph-based neighbor depth to correct local errors in the camera-to-BEV transform.
We propose a Global Alignment (GA) module that employs bidirectional deformable cross-attention (BDCA) to align LiDAR and camera BEV features, ensuring semantic and geometric consistency. The aligned features are then fused by CBR blocks to produce a unified BEV representation.
Evaluation on the nuScenes benchmark shows that BEVAlign achieves robust detection performance, particularly in detecting small and partially occluded objects.

2. Related Work

2.1. LiDAR-Driven Approaches for 3D Object Detection

LiDAR-based 3D detectors can be broadly grouped according to their data representation, namely point-level, voxelized, and hybrid formulations. Point-driven approaches [14,15,16] directly handle raw point clouds, using MLPs to capture detailed spatial structures, but often face high computational costs due to the irregularity of point data. In contrast, voxel-based approaches [17,18,19,20] convert point clouds into voxel grids or pillars [21], facilitating efficient feature extraction through sparse 3D convolutions or 2D backbones. However, voxelization introduces quantization errors and may degrade spatial resolution. To balance the trade-offs, hybrid point–voxel methods [22,23] have emerged, jointly leveraging fine-grained point features and structured voxel representations to enhance detection performance.

Overall, LiDAR-based detectors provide precise geometric priors and localization but remain limited in semantic understanding. This shortcoming motivates the integration of complementary modalities such as cameras.

2.2. Image-Centric Methods for 3D Object Detection

Camera-driven techniques [24,25,26] have attracted increasing interest, largely owing to the cost-effectiveness of cameras compared to LiDAR, analyzing images to predict object locations in 3D space. Recent research [27,28,29] modifies 2D detection frameworks, utilizing monocular cameras to estimate 3D bounding box parameters. However, monocular cameras are inherently limited by their lack of depth information, which constrains their ability to capture precise 3D geometry. To overcome this limitation, some methods employ stereo [26,28,29,30] or multi-view images [31,32] to construct rich three-dimensional geometry for object detection. For instance, Pseudo-LiDAR [33], a groundbreaking stereo-based technique, reconstructs depth from stereo images to generate virtual LiDAR points.

Overall, camera-driven approaches are cost-effective and semantically informative but suffer from unreliable depth estimation. This limitation underscores the need to combine visual features with LiDAR geometry for robust 3D detection.

2.3. Multi-Sensor Fusion for 3D Object Detection

Recent advancements in 3D object understanding increasingly rely on multi-modal fusion [8,34,35] frameworks that combine LiDAR and camera data to leverage their complementary strengths. Among them, BEV-based fusion [3,5,9,36] has emerged as a dominant paradigm due to its spatial consistency and scalability. Early works such as PointPainting [37], and MV3D [38], perform fusion at the point or proposal level, projecting LiDAR points into the image plane to retrieve semantic features. However, these methods only provide sparse point samples with limited image regions, resulting in suboptimal use of rich visual context. In contrast, unified BEV fusion methods like GAFusion [5], and BEVFusion [9], project image features into the BEV space, where they are deeply fused with LiDAR representations to form a unified spatial understanding. However, these models often depend on static calibration matrices and overlook intrinsic spatial discrepancies [11] between sensors, which can impair detection performance in complex traffic scenes.

In summary, while existing fusion strategies benefit from combining LiDAR and camera data, they remain limited in cross-modal alignment. In particular, they lack a unified paradigm to handle both local errors and global inconsistencies, motivating our BEVAlign framework in BEV space.

3. Methods

3.1. Overview

Cross-modal misalignment between LiDAR and camera features remains a major challenge for robust 3D detection. To address this, we propose a BEV-based unified architecture that aligns LiDAR and camera features for consistent fusion. The overall framework is illustrated in Figure 3. The pipeline begins with modality-specific encoders. LiDAR point clouds are voxelized and encoded by SECOND [39], and the resulting features are compressed along the height dimension to form BEV representations. Multi-view images are processed by a Swin Transformer [40] backbone. Then, the Local Alignment (LA) module refines the camera-to-BEV transformation. It leverages LiDAR-projected depth and neighborhood depth to guide feature alignment, which are further refined by a Dual-Depth Encoder (DDE) and a Depth-Guided Head (DGH), ultimately producing semantically consistent camera BEV representations. The Global Alignment (GA) module further mitigates cross-modal discrepancies. It employs the BDCA mechanism with dual querying, allowing each modality to attend to geometrically and semantically consistent regions of the other. The aligned features are then fused and refined through stacked CBR blocks, yielding a unified BEV representation for downstream detection tasks.

3.2. Depth-Guided Local Alignment

To mitigate local error during the camera-to-BEV transformation, we introduce a depth-guided Local Alignment (LA) module. The key idea is to use LiDAR-projected depth as supervision for camera features. Specifically, LiDAR points are projected into the image plane using intrinsic

K

and extrinsic

[\begin{matrix} R & T \end{matrix}]

parameters (Equation (1)), producing a sparse depth map aligned with image pixels. A KD-tree is further constructed to retrieve local geometric neighbors, providing complementary geometric cues. These projected and neighbor-guided depths are then encoded by a Dual-Depth Encoder (DDE) to form compact depth-aware features, as illustrated in Figure 4b. A Depth-Guided Head (DGH) subsequently fuses the encoded depth with multi-view image features and applies a view transformation to generate depth-aware camera BEV representations, as shown in Figure 4c. By injecting LiDAR-derived depth cues, the LA module effectively reduces depth ambiguity and enhances the semantic and geometric consistency of the resulting camera BEV features. The projection from LiDAR to the image plane is defined as:

z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = h K [\begin{matrix} R & T \end{matrix}] [\begin{matrix} P_{x} \\ P_{y} \\ P_{z} \\ 1 \end{matrix}]

(1)

where

(u, ν)

denote image coordinates, and h is a scaling factor due to image downsampling.

3.3. Global Alignment with BDCA

Direct fusion of LiDAR and camera features in the BEV space often suffers from spatial misalignment caused by projection disparities. To mitigate this issue, we introduce a Global Alignment (GA) module that explicitly aligns cross-modal features. The internal details of the GA module are illustrated in Figure 5. The GA module consists of two components. The first is the Bidirectional Deformable Cross-Attention (BDCA) mechanism. In BDCA, LiDAR and camera BEV features alternately act as queries to predict spatial offsets and attention weights. These offsets dynamically shift the sampling locations, allowing each modality to extract information from geometrically consistent regions of the other. This bidirectional design strengthens both semantic and geometric alignment. The second component is a convolution–batch normalization–ReLU (CBR) fusion block, which further refines and integrates the aligned features, producing coherent BEV representations for downstream detection. We define the deformable attention (DA) operation as follows:

DeformAttn (z_{q}, p_{q}, F) = \sum_{m = 1}^{M} W_{m} [\sum_{k = 1}^{K} A_{m q k} \cdot W_{m}^{'} F (p_{q} + Δ p_{m q k})]

(2)

\{\begin{matrix} F_{BEV}^{C} \leftarrow DeformAttn (z_{q}^{L}, p_{q}^{L}, F_{BEV}^{C}), \\ F_{BEV}^{L} \leftarrow DeformAttn (z_{q}^{C}, p_{q}^{C}, F_{BEV}^{L}) . \end{matrix}

(3)

F_{f u s e d} = Concat (F_{BEV}^{C}, F_{BEV}^{L})

(4)

F_{fused}^{'} = {CBR}_{4} ({CBR}_{3} ({CBR}_{2} ({CBR}_{1} (F_{fused}) \oplus F_{fused})))

(5)

The DA mechanism aggregates multi-modal features from adaptively sampled locations, guided by learned offsets and attention weights (Equation (2)). To strengthen cross-modal alignment, this operation is applied bidirectionally so that LiDAR and camera BEV features alternately act as queries to extract information from geometrically consistent regions of the other (Equation (3)). The aligned features are concatenated to form an intermediate fused representation (Equation (4)), which is subsequently refined by a CBR fusion block with residual concatenation and stacked convolutional layers (Equation (5)). This process yields coherent BEV representations for downstream detection.

4. Experiments

4.1. Dataset and Evaluation Metrics

We conduct experiments on the nuScenes dataset [13], which was collected in Boston, Massachusetts (USA), and Singapore using a full-scale autonomous driving platform. The driving routes cover the Boston Seaport area and three representative districts in Singapore, namely One North, Queenstown, and Holland Village. The dataset comprises 1000 scenes of 20 s each, partitioned into 700 for training, 150 for validation (val), and 150 for testing, selected to ensure diversity in location, traffic density, and weather conditions. The sensor suite includes six Basler acA1600-60gc RGB cameras (Basler, An der Strusbek 60–62, 22926 Ahrensburg Germany) with a native resolution of 1600 × 1200 pixels (cropped to 1600 × 900 for recording), operating at 12 Hz with a horizontal field of view of approximately 122°. Complementary depth information is provided by a Velodyne HDL-32E LiDAR, offering 32 beams at 20 Hz with a 360° horizontal field of view and an effective range of up to 70 m. For all experiments, images are downsampled from 1600 × 900 to 256 × 704 to reduce computational cost while preserving aspect ratio, following common practice in Swin Transformer-based detectors. Additional details of the dataset and sensor configuration are available in the official nuScenes documentation (https://www.nuscenes.org/nuscenes) (accessed on 2 August 2025).

For evaluation, we follow the official 3D detection protocol, reporting both mAP and NDS as key performance indicators. The mAP reflects localization accuracy across ten categories, while NDS serves as a comprehensive indicator of detection quality, formulated as follows:

NDS = \frac{1}{10} [5 \times mAP + \sum_{mTP \in TP} (1 - \min (1, mTP))]

(6)

where mTP denotes the average value of each true positive metric across all object classes.

4.2. Implementation Details

BEVAlign is developed with PyTorch 1.8.1+cu111, building upon the BEVFusion [9] and OpenPCDet [41] toolkits. For the LiDAR stream, we use SECOND [39] as the voxelbased encoder, voxel dimensions are fixed at [0.075 m, 0.075 m, 0.2 m], with spatial coverage spanning [−54 m, 54 m] on both x and y axes and [−5 m, 3 m] along the z-axis. The camera branch adopts a Swin Transformer [40] backbone, with input images resized to 256 × 704. Multi-scale image features are fused using FPN. For the view transformation, we follow the LSS [10] setup, with a depth range of [1 m, 60 m] and frustum settings aligned with the point cloud space. The network is optimized using Adam with a learning rate of 0.001 over 20 training epochs, a batch size of 24, and distributed computation across eight NVIDIA RTX 3090 GPUs (NVIDIA Corporate, 2788 San Tomas Expressway, Santa Clara, CA, USA). Inference is carried out without any test-time augmentation.

4.3. Comparison Results

Results on nuScenes test set. We report quantitative results of representative approaches on the nuScenes test set in Table 1. Abbreviations used in the table are as follows: ‘C.V.’ = construction vehicle, ‘T.L.’ = trailer, ‘Ped.’ = pedestrian, ‘M.T.’ = motorcycle, ‘B.R.’ = barrier, and ‘T.C.’ = traffic cone, whereas ‘L + C’ denotes LiDAR–camera fusion. The best and second-best scores are highlighted in bold red and bold blue, respectively. BEVAlign achieves 71.7% mAP and 75.3% NDS, surpassing the strong baseline BEVFusion by 1.5% and 2.4%, respectively. These improvements highlight the effectiveness of our proposed Local Alignment (LA) and Global Alignment (GA) modules in resolving feature misalignment issues across modalities. Compared with existing multi-modal approaches such as TransFusion [42] and DeepInteraction [43], BEVAlign achieves more consistent improvements across both large and small object categories. Notably, it provides notable improvements on small and dynamic objects, improving Motor by 5.5%, Bike by 9.3%, and Pedestrian by 1.7%. These categories are particularly vulnerable to calibration errors and depth discontinuities. These gains demonstrate that the proposed alignment modules are beneficial for resolving fine-grained discrepancies in multi-modal BEV representation. In addition, BEVAlign also enhances detection performance for large-scale traffic participants, with car and bus gains of 1.0% and 3.2% respectively, indicating strong adaptability across diverse road environments. Overall, BEVAlign presents a promising solution for 3D detection based on multi-modal BEV fusion, particularly effective in handling smallscale targets.

Performance on the nuScenes validation set. BEVAlign achieves state-of-the-art performance on the nuScenes val set as shown in Table 2, obtaining 71.2% mAP and 73.5% NDS, surpassing all competing multi-modal fusion baselines. Relative to BEVFusion, a strong baseline utilizing the same Swin-T backbone, BEVAlign delivers consistent gains, with +2.7% mAP and +2.1% NDS improvements. When compared against other recent fusion frameworks, including CMT [3] and SparseFusion [4], BEVAlign consistently demonstrates stronger detection capabilities, highlighting the effectiveness of the proposed dual-level alignment strategy.

Runtime and efficiency trade-off. Beyond accuracy, inference efficiency is critical for practical deployment. As summarized in Table 2, we extend the comparison to both multi-modal and representative camera-only detectors. Vision-only methods such as StreamPETR [53] and BEVNeXT [32] generally achieve higher throughput but at the expense of accuracy. In contrast, multi-modal frameworks deliver stronger detection performance while requiring more computation. Within this context, BEVAlign provides a balanced solution, achieving a runtime comparable to other fusion-based methods and at the same time offering the highest detection accuracy. This balance between precision and efficiency highlights its suitability for real-world autonomous driving scenarios.

4.4. Ablation Studies

Effect of the hyperparameters $K_{graph}$ in the LA module. As shown in Table 3, we examine how varying the neighborhood size

K_{graph}

in the LA module, which defines the number of depth-adjacent points used for LiDAR-to-camera feature aggregation, influences detection performance. Setting

K_{graph}

to eight offers the most favorable compromise between detection quality and processing speed. Reducing the neighborhood to five slightly degrades performance, whereas increasing it to twelve yields only marginal improvements while causing a noticeable slowdown. Notably, BEVAlign with graph-based alignment consistently surpasses the BEVFusion [9] baseline when

K_{graph}

is set to zero across all evaluated configurations.

Effectiveness of different cross-modal alignment strategies. We evaluate different alignment strategies for LiDAR–camera fusion, with results reported on the nuScenes validation benchmark and summarized in Table 4. LearnableAlign [56] adopts LiDAR features as queries while treating camera features as keys and values under a standard cross-attention (CA) scheme. However, this one-directional querying design exhibits limited alignment capacity, which restricts its effectiveness in capturing cross-modal correspondences. To strengthen cross-modal interaction, DeformCAFA [7] introduces deformable attention (DA) with LiDAR × Image queries, allowing adaptive sampling from camera features. This voxel-level alignment alleviates modality gaps and improves detection performance to 68.5% mAP and 71.4% NDS, validating the advantage of DA in handling spatial heterogeneity. Building upon these insights, our BDCA module adopts a Bidirectional Deformable Cross-Attention (BDCA) design. Specifically, both LiDAR and camera BEV features alternately serve as queries, allowing mutual feature interaction. Keys and values are jointly derived from both modalities, thereby ensuring dynamic semantic and geometric alignment across modalities. This dual querying strategy significantly strengthens the fused BEV representation and achieves the best performance, outperforming both CA- and DA-based baselines.

Effectiveness of different components. Table 5 summarizes the ablation analysis of different components in our BEVAlign framework on the nuScenes benchmark. The baseline follows a standard BEVFusion design. When the Local Alignment (LA) module is introduced, mAP increases by 1.2% while NDS shows a gain of 1.0%. This validates that LA effectively reduces local misalignment in the camera-to-BEV transformation through depth-guided neighborhood modeling. To further evaluate the Global Alignment (GA) module, we decompose it into BDCA and CBR subcomponents. This design alleviative semantic and geometric inconsistencies at a broader scale. Method (b), which integrates GA alone, improves mAP to 70.5% and NDS to 72.8%. The performance gain mainly stem from BDCA, which enables bidirectional interaction between LiDAR and camera features to dynamically align them in the BEV domain, while CBR further refines the fused representation. When LA and GA are combined, mAP and NDS improve by 2.7% and 2.1% over the baseline, respectively. These improvements result from the complementary effects of fine-grained local corrections and global semantic alignment. This joint design yields the most consistent and semantically aligned BEV representations, further enhancing fusion robustness in diverse traffic scenarios. The additional runtime cost remains acceptable, as most overhead arises from attention-based operations, while the CBR module introduces negligible latency.

Qualitative Results. Figure 6 presents a comparison between BEVAlign and the BEVFusion baseline on the nuScenes val set. While BEVFusion often suffers from missed detections and incomplete detections, our approach effectively alleviates these issues, producing more complete and reliable predictions. Specifically, the red circles in Figure 6 highlight missed detections, whereas the light blue boxes indicate incomplete predictions where objects are only partially localized. Figure 7 further illustrates detection results of BEVAlign under various driving scenarios. The model demonstrates robust performance across different environmental conditions, including sunny, overcast, and rainy weather. In these qualitative examples, 3D bounding boxes are shown in orange for cars, blue for pedestrians, and red for bicycles, projected consistently across multi-camera views and LiDAR BEV maps. The yellow circles and boxes highlight regions where BEVAlign offers enhanced robustness, especially for small or partially occluded targets. Notably, BEVAlign accurately localizes small-scale objects such as pedestrians and cyclists, even when partially occluded or embedded in dense traffic. The highlighted yellow boxes and circles indicate representative cases where the model successfully handles these challenging instances. These improvements stem from the joint effect of local and global alignment, which collaboratively strengthen cross-modal fusion in the BEV space. Overall, the findings confirm the resilience of BEVAlign in complex urban traffic scenes with high vehicle density.

5. Conclusions

This study presented BEVAlign, a multi-modal feature alignment architecture that advances 3D object detection for intelligent vehicles. BEVAlign constructs consistent BEV representations by explicitly aligning LiDAR and camera features at both local and global levels. The Local Alignment (LA) module reduces depth-induced errors in the camera-to-BEV transformation, while the Global Alignment (GA) module employs bidirectional deformable cross-attention and CBR fusion to correct large-scale inconsistencies. Extensive experiments on nuScenes demonstrate the robustness of BEVAlign, with consistent improvements in challenging conditions such as dense traffic and occluded scenes.

Limitation and Future Work. Although BEVAlign demonstrates strong performance across complex traffic scenes, several limitations remain. One challenge lies in its computational demand, which may constrain deployment on embedded processors with limited resources. Future work will focus on lightweight adaptations for urban driving, with particular attention to robustness under disturbances such as uneven roads, speed bumps, and dense traffic. To achieve this goal, knowledge distillation will be explored to reduce inference cost while maintaining accuracy. Strategies such as foreground-guided transfer [57] from LiDAR-enhanced teachers and inner-geometry distillation [58] across modalities can transfer semantic and geometric priors from large teacher models to compact student networks, thereby improving efficiency without compromising detection performance.

Another limitation arises from environmental diversity. Our experiments are conducted mainly under clear-weather conditions, while heavy rainfall, dense fog, and nighttime driving can reduce image clarity, introduce noise in LiDAR returns, and make small or distant objects harder to detect. Similarly, highly congested traffic often leads to occlusions that compromise detection stability. Future research will therefore investigate more robust feature alignment under adverse scenarios, incorporating uncertainty-aware [59] and weather-adaptive fusion [60] strategies. Overall, these limitations are not unique to our framework but are also shared by multi-modal fusion systems already deployed in practice, including NIO’s ET series, the Baidu Apollo platform, and Waymo’s autonomous fleets. Improving robustness in adverse conditions and enhancing computational efficiency are thus challenges of broad relevance, and the directions outlined above may provide useful insights for advancing both academic research and industrial deployment.

Author Contributions

Conceptualization, A.L. and J.C.; methodology, A.L. and Y.Z.; software, A.L. and H.S.; validation, A.L., Y.Z. and J.C.; formal analysis, Y.Z.; investigation, H.S.; data curation, A.L. and H.S.; writing—original draft preparation, A.L. and Y.Z.; writing—review and editing, A.L. and J.C.; visualization, Y.Z.; supervision, H.S. and J.C.; project administration, J.C.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (No. 61771034).

Data Availability Statement

The data presented in this study are openly available from the official website of the nuScenes dataset: https://www.nuscenes.org/ (accessed on 28 September 2025).

Conflicts of Interest

Yandi Zhang is an employee of Tukrin Technology, Beijing. The remaining authors (Ajian Liu, Huichao Shi, and Juan Chen) declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

3D	Three-Dimensional
BEV	Bird’s Eye View
LA	Local Alignment
GA	Global Alignment
BDCA	Bidirectional Deformable Cross-Attention
CBR	Convolution–Batch Normalization–ReLU
DeformCAFA	Deformable Cross-Attention Feature Aggregation
VT	View Transformation
DDE	Dual-Depth Encoder
DGH	Depth-Guided Head
DA	Deformable Attention
PE	Positional encoding
IoU	Intersection over Union
mAP	mean Average Precision
NDS	nuScenes Detection Score
FPS	Frames Per Second
$K$	Intrinsic camera matrix
$[R ∣ T]$	Extrinsic camera parameters
$(u, v)$	Image pixel coordinates
h	Scaling factor
$(P_{x}, P_{y}, P_{z})$	3D LiDAR point coordinates
$z_{c}$	Depth value in camera coordinates
$p_{q}$	Reference point of the query
$z_{q}$	Query feature vector
F	Input feature map for attention
M	Number of attention heads
K	Number of sampling locations per head
$W_{m}$	Projection weights for head
$A_{m q k}$	Attention weight assigned to each sampled location
$Δ p_{m q k}$	Learned offset applied to each sampled location
$F_{BEV}^{C}$	Camera BEV feature
$F_{BEV}^{L}$	LiDAR BEV feature
$F_{fused}$	Concatenated fused feature
$F_{fused}^{'}$	Refined fused feature after stacked CBR blocks
$p_{q}^{L}$	Reference point of the LiDAR branch at a BEV location
$p_{q}^{C}$	Reference point of the Camera branch at a BEV location
$z_{q}^{L}$	Query vector derived from the LiDAR BEV feature
$z_{q}^{C}$	Query vector derived from the Camera BEV feature
$K_{graph}$	Number of graph neighbors

References

Song, Z.; Liu, L.; Jia, F.; Luo, Y.; Jia, C.; Zhang, G.; Yang, L.; Wang, L. Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook. IEEE Trans. Intell. Transp. Syst. 2024, 25, 15407–15436. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Song, Z.; Bi, J.; Zhang, G.; Wei, H.; Tang, L.; Yang, L.; Li, J.; Jia, C.; et al. Multi-Modal 3D Object Detection in Autonomous Driving: A Survey and Taxonomy. IEEE Trans. Intell. Veh. 2023, 8, 3781–3798. [Google Scholar] [CrossRef]
Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 18268–18278. [Google Scholar] [CrossRef]
Xie, Y.; Xu, C.; Rakotosaona, M.-J.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; Zhan, W. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 17591–17602. [Google Scholar] [CrossRef]
Li, X.; Fan, B.; Tian, J.; Fan, H. GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 21209–21218. [Google Scholar] [CrossRef]
Wu, X.; Peng, L.; Yang, H.; Xie, L.; Huang, C.; Deng, C.; Liu, H.; Cai, D. Sparse Fuse Dense: Towards High Quality 3D Detection with Depth Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 5418–5427. [Google Scholar] [CrossRef]
Chen, Z.; Li, Z.; Zhang, S.; Fang, L.; Jiang, Q.; Zhao, F. Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 628–644. [Google Scholar] [CrossRef]
Song, Y.; Wang, L. BiCo-Fusion: Bidirectional Complementary LiDAR–Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection. IEEE Robot. Autom. Lett. 2025, 10, 1457–1464. [Google Scholar] [CrossRef]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.L.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 2774–2781. [Google Scholar] [CrossRef]
Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 194–210. [Google Scholar] [CrossRef]
Song, Z.; Jia, C.; Yang, L.; Wei, H.; Liu, L. GraphAlign++: An Accurate Feature Alignment by Graph Matching for Multi-Modal 3D Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 2619–2632. [Google Scholar] [CrossRef]
Cai, Q.; Pan, Y.; Yao, T.; Ngo, C.-W.; Mei, T. ObjectFusion: Multi-Modal 3D Object Detection with Object-Centric Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 18067–18076. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 11621–11631. [Google Scholar] [CrossRef]
Liu, Q.; Dong, Y.; Zhao, D.; Xiao, L.; Dai, B.; Min, C.; Zhang, J.; Nie, Y.; Lu, D. MT-SSD: Single-Stage 3D Object Detector Based on Magnification Transformation. IEEE Trans. Intell. Veh. 2024, 1–11. [Google Scholar] [CrossRef]
Liu, H.; Ma, Y.; Wang, H.; Zhang, C.; Guo, Y. AnchorPoint: Query Design for Transformer-Based 3D Object Detection and Tracking. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10988–11000. [Google Scholar] [CrossRef]
He, X.; Wang, Z.; Lin, J.; Nai, K.; Yuan, J.; Li, Z. Do-SA&R: Distant Object Augmented Set Abstraction and Regression for Point-Based 3D Object Detection. IEEE Trans. Image Process. 2023, 32, 5852–5864. [Google Scholar] [CrossRef]
Song, Z.; Wei, H.; Jia, C.; Xia, Y.; Li, X.; Zhang, C. VP-Net: Voxels as Points for 3-D Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Mahmoud, A.; Hu, J.S.K.; Waslander, S.L. Dense Voxel Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2023; IEEE: New York, NY, USA, 2023; pp. 663–672. [Google Scholar] [CrossRef]
An, P.; Duan, Y.; Huang, Y.; Ma, J.; Chen, Y.; Wang, L.; Yang, Y.; Liu, Q. SP-Det: Leveraging Saliency Prediction for Voxel-Based 3D Object Detection in Sparse Point Cloud. IEEE Trans. Multimed. 2023, 26, 2795–2808. [Google Scholar] [CrossRef]
Wu, H.; Wen, C.; Li, W.; Li, X.; Yang, R.; Wang, C. Transformation-Equivariant 3D Object Detection for Autonomous Driving. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; AAAI Press: Palo Alto, CA, USA, 2023; pp. 2795–2802. [Google Scholar] [CrossRef]
Tao, L.; Wang, H.; Chen, L.; Li, Y.; Cai, Y. Pillar3D-Former: A Pillar-Based 3-D Object Detection and Tracking Method for Autonomous Driving Scenes. IEEE Trans. Instrum. Meas. 2024, 73, 1–14. [Google Scholar] [CrossRef]
Liu, A.; Yuan, L.; Chen, J. CSA-RCNN: Cascaded Self-Attention Networks for High-Quality 3-D Object Detection from LiDAR Point Clouds. IEEE Trans. Instrum. Meas. 2024, 73, 1–13. [Google Scholar] [CrossRef]
Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point–Voxel Feature Set Abstraction with Local Vector Representation for 3D Object Detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
Yang, L.; Zhang, X.; Li, J.; Wang, L.; Zhu, M.; Zhang, C.; Liu, H. Mix-Teaching: A Simple, Unified and Effective Semi-Supervised Learning Framework for Monocular 3D Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6832–6844. [Google Scholar] [CrossRef]
Yang, Z.; Yu, Z.; Choy, C.; Wang, R.; Anandkumar, A.; Alvarez, J.M. Improving Distant 3D Object Detection Using 2D Box Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 14853–14863. [Google Scholar] [CrossRef]
Tao, C.; Cao, J.; Wang, C.; Zhang, Z.; Gao, Z. Pseudo-mono for Monocular 3D Object Detection in Autonomous Driving. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3962–3975. [Google Scholar] [CrossRef]
Jiang, X.; Jin, S.; Lu, L.; Zhang, X.; Lu, S. Weakly Supervised Monocular 3D Detection with a Single-View Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 10508–10518. [Google Scholar] [CrossRef]
Chen, W.; Zhao, J.; Zhao, W.-L.; Wu, S.-Y. Shape-Aware Monocular 3D Object Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 6416–6424. [Google Scholar] [CrossRef]
Huang, C.; He, T.; Ren, H.; Wang, W.; Lin, B.; Cai, D. OBMO: One Bounding Box Multiple Objects for Monocular 3D Object Detection. IEEE Trans. Image Process. 2023, 32, 6570–6581. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Bao, H.; Ge, Z.; Yang, J.; Sun, J.; Li, Z. BEVStereo: Enhancing Depth Estimation in Multi-View 3D Object Detection with Temporal Stereo. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; AAAI Press: Palo Alto, CA, USA, 2023; pp. 1486–1494. [Google Scholar] [CrossRef]
Wang, B.; Zheng, H.; Zhang, L.; Liu, N.; Anwer, R.M.; Cholakkal, H.; Zhao, Y.; Li, Z. BEVRefiner: Improving 3D Object Detection in Bird’s-Eye View via Dual Refinement. IEEE Trans. Intell. Transp. Syst. 2024, 25, 15094–15105. [Google Scholar] [CrossRef]
Li, Z.; Lan, S.; Alvarez, J.M.; Wu, Z. BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: New York, NY, USA, 2024; pp. 20113–20123. [Google Scholar] [CrossRef]
Wang, Y.; Chao, W.-L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8445–8453. [Google Scholar] [CrossRef]
Wang, Z.; Huang, Z.; Gao, Y.; Wang, N.; Liu, S. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 1–15. [Google Scholar] [CrossRef]
Guo, K.; Ling, Q. PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025; AAAI Press: Palo Alto, CA, USA, 2025; pp. 3266–3274. [Google Scholar] [CrossRef]
Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. BEVFusion: A Simple and Robust LiDAR–Camera Fusion Framework. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 10421–10434. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 4604–4612. [Google Scholar] [CrossRef]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-View 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 1907–1915. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
OD Team. OpenPCDet: An Open-Source Toolbox for 3D Object Detection from Point Clouds. Available online: https://github.com/open-mmlab/OpenPCDet (accessed on 28 September 2025).
Zhang, C.; Wang, H.; Cai, Y.; Chen, L.; Li, Y. TransFusion: Multi-Modal Robust Fusion for 3D Object Detection in Foggy Weather Based on Spatial Vision Transformer. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10652–10666. [Google Scholar] [CrossRef]
Yang, Z.; Chen, J.; Miao, Z.; Li, W.; Zhu, X.; Zhang, L. DeepInteraction: 3D Object Detection via Modality Interaction. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 1992–2005. [Google Scholar]
Chen, Y.; Li, Y.; Zhang, X.; Sun, J.; Jia, J. Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 5418–5427. [Google Scholar] [CrossRef]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: New York, NY, USA, 2023; pp. 21674–21683. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-Based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 11784–11793. [Google Scholar] [CrossRef]
Chen, Y.; Yu, Z.; Chen, Y.; Lan, S.; Anandkumar, A.; Jia, J.; Alvarez, J.M. FocalFormer3D: Focusing on Hard Instance for 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 8394–8405. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal Virtual Point 3D Detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2021), Virtual Event, 6–14 December 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 16494–16507. [Google Scholar]
Wang, C.; Ma, C.; Zhu, M.; Yang, X. PointAugmenting: Cross-Modal Augmentation for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 11794–11803. [Google Scholar] [CrossRef]
Wang, H.; Tang, H.; Shi, S.; Li, A.; Li, Z.; Schiele, B.; Wang, L. UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 6792–6802. [Google Scholar] [CrossRef]
Song, Z.; Wei, H.; Bai, L.; Yang, L.; Jia, C. GraphAlign: Enhancing Accurate Feature Alignment by Graph Matching for Multi-Modal 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3358–3369. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Yu, Q.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation From LiDAR–Camera via Spatiotemporal Transformers. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Wang, S.; Liu, Y.; Wang, T.; Li, Y.; Zhang, X. Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 3621–3631. [Google Scholar] [CrossRef]
Park, J.; Xu, C.; Yang, S.; Keutzer, K.; Kitani, K.; Tomizuka, M.; Zhan, W. Time Will Tell: New Outlooks and a Baseline for Temporal Multi-View 3D Object Detection. arXiv 2022, arXiv:2210.02443. [Google Scholar] [CrossRef]
Chen, X.; Zhang, T.; Wang, Y.; Wang, Y.; Zhao, H. Futr3D: A Unified Sensor Fusion Framework for 3D Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 172–181. [Google Scholar] [CrossRef]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. DeepFusion: LiDAR–Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 17182–17191. [Google Scholar] [CrossRef]
Li, J.; Lu, M.; Liu, J.; Guo, Y.; Du, Y.; Du, L.; Zhang, S. BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for Multi-View BEV 3D Object Detection. IEEE Trans. Intell. Veh. 2023, 9, 2489–2498. [Google Scholar] [CrossRef]
Xu, S.; Li, F.; Huang, P.; Song, Z.; Yang, Z.-X. TiGDistill-BEV: Multi-View BEV 3D Object Detection via Target Inner-Geometry Learning Distillation. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Shao, Z.; Wang, H.; Cai, Y.; Chen, L.; Li, Y. UA-Fusion: Uncertainty-Aware Multimodal Data Fusion Framework for 3-D Object Detection of Autonomous Vehicles. IEEE Trans. Instrum. Meas. 2025, 74, 1–16. [Google Scholar] [CrossRef]
Yue, J.; Lin, Z.; Lin, X.; Zhou, X.; Li, X.; Qi, L.; Wang, Y.; Yang, M.-H. RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird’s Eye View for 3D Object Detection. arXiv 2025, arXiv:2502.13071. [Google Scholar]

Figure 1. Architectural overview of BEVAlign compared with prior multi-modal fusion methods.

Figure 2. Representative challenges for LiDAR–camera fusion.

Figure 3. Overall architecture of BEVAlign.

Figure 4. Illustration of key sub-modules within the LA module.

Figure 5. Framework of the GA module.

Figure 6. Qualitative comparison between BEVFusion and BEVAlign on the nuScenes val set.

Figure 7. Qualitative visualization of BEVAlign evaluated on nuScenes scenes.

Table 1. Quantitative results of representative approaches on the nuScenes test set.

Method	Modality	mAP	NDS	Car	Bus	C.V.	Truck	T.L.	Ped.	M.T.	Bike	B.R.	T.C.
Focals Conv [44]	LiDAR	63.8	70.0	86.7	67.7	23.8	56.3	59.5	87.5	64.5	36.3	74.1	81.4
VoxelNeXt [45]	LiDAR	64.5	70.0	84.6	64.7	28.7	53.0	55.8	85.8	73.2	45.7	74.6	79.0
TransFusion-L [42]	LiDAR	65.5	70.2	86.2	66.3	28.2	56.7	58.8	86.1	68.3	44.2	78.2	82.0
CenterPoint [46]	LiDAR	60.3	67.3	85.2	63.6	20.0	53.5	56.0	84.6	59.5	30.7	71.1	78.4
FocalFormer3D [47]	LiDAR	68.7	72.6	87.2	69.6	34.4	57.1	64.9	88.2	76.2	49.6	77.8	82.3
MVP [48]	L + C	66.4	70.5	86.8	67.4	26.1	58.5	57.3	89.1	70.0	49.3	74.8	85.0
PointAugmenting [49]	L + C	66.8	71.0	87.5	65.2	28.0	57.3	60.7	87.9	74.3	50.9	72.6	83.6
BEVFusion-PKU [36]	L + C	69.2	71.8	88.1	69.3	34.4	60.9	62.1	89.2	72.2	52.2	78.2	85.2
TransFusion [42]	L + C	68.9	71.7	87.1	68.3	33.1	60.0	60.8	88.4	73.6	52.9	78.1	86.7
AutoAlignV2 [7]	L + C	68.4	72.4	87.0	69.3	33.1	59.0	59.3	87.6	72.9	52.1	-	-
UniTR [50]	L + C	70.9	74.5	87.9	72.2	39.2	60.2	65.1	89.4	75.8	52.2	76.8	89.7
DeepInteraction [43]	L + C	70.8	73.4	87.9	70.8	37.5	60.2	63.8	90.3	75.4	54.5	80.4	87.0
GraphAlign [51]	L + C	66.5	70.6	87.6	66.2	26.1	57.7	57.8	87.2	72.5	49.0	74.1	86.3
ObjectFusion [12]	L + C	71.0	73.3	89.4	71.8	40.5	59.0	63.1	90.7	78.1	53.2	76.6	87.7
BEVFusion [9]	L + C	70.2	72.9	88.6	69.8	39.3	60.1	63.8	89.2	74.1	51.0	80.0	86.5
BEVAlign (Ours)	L + C	71.7	75.3	89.6	73.0	38.5	60.6	65.2	90.9	79.6	60.3	80.3	87.3

Table 2. Quantitative evaluation of representative state-of-the-art methods on the nuScenes val set.

Method	Modality	Backbone	mAP	NDS	FPS
BEVFormer [52]	C	ResNet-101	41.6	51.7	3.0
StreamPETR [53]	C	ResNet50	43.2	54.0	6.4
SOLOFusion [54]	C	ResNet50	42.7	53.4	1.5
BEVNeXT [32]	C	ResNet50	53.5	62.2	4.4
AutoAlignV2 [7]	L + C	ResNet-50	67.1	71.2	-
TransFusion-LC [42]	L + C	ResNet-50	67.5	71.3	3.2
FUTR3D [55]	L + C	ResNet-101	64.2	68.0	2.3
CMT [3]	L + C	VoV-99	70.3	72.9	3.8
DeepInteraction [43]	L + C	Swin-T	69.9	72.6	2.6
SparseFusion [4]	L + C	Swin-T	71.0	73.1	5.3
BEVFusionn [9]	L + C	Swin-T	68.5	71.4	4.2
BEVAlign (Ours)	L + C	Swin-T	71.2	73.5	3.8

Table 3. Ablation study of the number of graph neighbors

K_{graph}

in LA module on nuScenes val set.

Table 3. Ablation study of the number of graph neighbors

K_{graph}

in LA module on nuScenes val set.

$K_{graph}$	mAP	NDS	FPS
0	68.5	71.4	4.2
5	69.3	72.1	4.0
8	71.2	73.5	3.8
12	70.4	73.0	3.5

Table 4. Ablation studies of attention mechanisms.

Method	Query	Key	Value	Attention	mAP	NDS
LearnableAlign	LiDAR	Image	Image	CA	65.7	69.2
DeformCAFA	LiDAR × Image	Image	Image	DA	68.5	71.4
BDCA	LiDAR & Image	Image & LiDAR	Image & LiDAR	DA × 2	71.2	73.5

Table 5. Ablation experiments of different components on the nuScenes val set. RT indicates run time.

Method	LA	BDCA	CBR	mAP	NDS	RT
Baseline [9]				68.5	71.4	238 ms
(a)	✓			69.7 ↑ 1.2	72.4 ↑ 1.0	247 ms
(b)		✓	✓	70.5	72.8	254 ms
(c)	✓	✓		71.0	73.2	260 ms
(d)	✓	✓	✓	71.2 ↑ 2.7	73.5 ↑ 2.1	263 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, A.; Zhang, Y.; Shi, H.; Chen, J. Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View. World Electr. Veh. J. 2025, 16, 567. https://doi.org/10.3390/wevj16100567

AMA Style

Liu A, Zhang Y, Shi H, Chen J. Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View. World Electric Vehicle Journal. 2025; 16(10):567. https://doi.org/10.3390/wevj16100567

Chicago/Turabian Style

Liu, Ajian, Yandi Zhang, Huichao Shi, and Juan Chen. 2025. "Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View" World Electric Vehicle Journal 16, no. 10: 567. https://doi.org/10.3390/wevj16100567

APA Style

Liu, A., Zhang, Y., Shi, H., & Chen, J. (2025). Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View. World Electric Vehicle Journal, 16(10), 567. https://doi.org/10.3390/wevj16100567

Article Menu

Robust 3D Object Detection in Complex Traffic via Unified Feature Alignment in Bird’s Eye View

Abstract

1. Introduction

2. Related Work

2.1. LiDAR-Driven Approaches for 3D Object Detection

2.2. Image-Centric Methods for 3D Object Detection

2.3. Multi-Sensor Fusion for 3D Object Detection

3. Methods

3.1. Overview

3.2. Depth-Guided Local Alignment

3.3. Global Alignment with BDCA

4. Experiments

4.1. Dataset and Evaluation Metrics

4.2. Implementation Details

4.3. Comparison Results

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI