Figure 1.
Overview of BevDrive. In the scene encoding stage, a bidirectional BEV feature construction module extracts image and LiDAR features separately and transforms them into a unified BEV space using depth-guided image BEV feature construction and image-guided LiDAR BEV feature construction. A dual-attention mechanism integrates multi-modal BEV features at both local and global levels. In the motion decoding stage, a BEV-based decoder predicts future trajectory points and control commands by leveraging scene context in the fused BEV feature.
Figure 1.
Overview of BevDrive. In the scene encoding stage, a bidirectional BEV feature construction module extracts image and LiDAR features separately and transforms them into a unified BEV space using depth-guided image BEV feature construction and image-guided LiDAR BEV feature construction. A dual-attention mechanism integrates multi-modal BEV features at both local and global levels. In the motion decoding stage, a BEV-based decoder predicts future trajectory points and control commands by leveraging scene context in the fused BEV feature.
Figure 2.
The lifting and projection approach in depth-guided image BEV feature construction. The 2D image features are lifted into 3D space using depth estimation, where the discrete depth distribution is aligned with the image features.
Figure 2.
The lifting and projection approach in depth-guided image BEV feature construction. The 2D image features are lifted into 3D space using depth estimation, where the discrete depth distribution is aligned with the image features.
Figure 3.
The process of generating a dense depth map. The sparse LiDAR point cloud is projected onto the image plane using intrinsic and extrinsic matrices to create a sparse depth map, which is then refined through spatial interpolation to produce the final dense depth map.
Figure 3.
The process of generating a dense depth map. The sparse LiDAR point cloud is projected onto the image plane using intrinsic and extrinsic matrices to create a sparse depth map, which is then refined through spatial interpolation to produce the final dense depth map.
Figure 4.
LiDAR BEV enhancement. This module first compresses the initial LiDAR BEV features into a bottleneck representation. It then performs cross-attention between LiDAR BEV queries and image feature key-value pairs. After feature aggregation through attention mechanisms, the module decodes the enhanced representation to generate refined LiDAR BEV features. This process significantly improves feature quality and semantic richness in the BEV space.
Figure 4.
LiDAR BEV enhancement. This module first compresses the initial LiDAR BEV features into a bottleneck representation. It then performs cross-attention between LiDAR BEV queries and image feature key-value pairs. After feature aggregation through attention mechanisms, the module decodes the enhanced representation to generate refined LiDAR BEV features. This process significantly improves feature quality and semantic richness in the BEV space.
Figure 5.
Dual-attention BEV feature fusion. This module integrates image and LiDAR BEV features through dual attention mechanisms: window self-attention captures local spatial correlations, while global self-attention models long-range dependencies.
Figure 5.
Dual-attention BEV feature fusion. This module integrates image and LiDAR BEV features through dual attention mechanisms: window self-attention captures local spatial correlations, while global self-attention models long-range dependencies.
Figure 6.
BEV-based motion planning decoder. In the BEV-based motion planning decoder, self-attention connects trajectory point queries with control command queries, while cross-attention enables interactions with fused BEV features to extract contextual information and iteratively refine the queries, enhancing their understanding and adaptability to complex scenes.
Figure 6.
BEV-based motion planning decoder. In the BEV-based motion planning decoder, self-attention connects trajectory point queries with control command queries, while cross-attention enables interactions with fused BEV features to extract contextual information and iteratively refine the queries, enhancing their understanding and adaptability to complex scenes.
Figure 7.
Qualitative results in typical driving scenarios. BevDrive showcases its ability to stop and go at traffic lights, avoid collisions by detecting obstacles and activating brakes, and navigate intersections with precise speed, throttle, and steering adjustments for smooth turns.
Figure 7.
Qualitative results in typical driving scenarios. BevDrive showcases its ability to stop and go at traffic lights, avoid collisions by detecting obstacles and activating brakes, and navigate intersections with precise speed, throttle, and steering adjustments for smooth turns.
Figure 8.
Interpretability analysis of our BevDrive. (a) visualizes the BEV feature map focusing on traffic light changes, while (b) visualizes the BEV feature map highlighting the model yielding to a turning vehicle and resuming safely.
Figure 8.
Interpretability analysis of our BevDrive. (a) visualizes the BEV feature map focusing on traffic light changes, while (b) visualizes the BEV feature map highlighting the model yielding to a turning vehicle and resuming safely.
Figure 9.
Attention analysis of the BEV-based motion planning decoder: (a) sensor inputs of BevDrive; (b) attention heatmaps corresponding to different trajectory point queries. In (b), the red dot represents the target point, while the white dots represent the predicted points. “n” indicates the trajectory point query corresponding to the attention heatmap. In the heatmaps, red indicates higher attention, and blue indicates lower attention.
Figure 9.
Attention analysis of the BEV-based motion planning decoder: (a) sensor inputs of BevDrive; (b) attention heatmaps corresponding to different trajectory point queries. In (b), the red dot represents the target point, while the white dots represent the predicted points. “n” indicates the trajectory point query corresponding to the attention heatmap. In the heatmaps, red indicates higher attention, and blue indicates lower attention.
Figure 10.
Performance in representative driving scenarios.
Figure 10.
Performance in representative driving scenarios.
Figure 11.
Performance in challenging driving conditions.
Figure 11.
Performance in challenging driving conditions.
Figure 12.
Our autonomous vehicle and its equipped devices. (a) Our full-scale autonomous vehicle equipped with a LiDAR sensor, cameras, and an integrated navigation module. (b) The sensors equipped on our autonomous vehicle. (c) The industrial computer and power system equipped on our autonomous vehicle.
Figure 12.
Our autonomous vehicle and its equipped devices. (a) Our full-scale autonomous vehicle equipped with a LiDAR sensor, cameras, and an integrated navigation module. (b) The sensors equipped on our autonomous vehicle. (c) The industrial computer and power system equipped on our autonomous vehicle.
Figure 13.
Statistics of our dataset. We categorized the driving scenarios in our dataset into three main types: Turn Right, Turn Left, and Keep Straight. For the Keep Straight category, we further divided the data into four subcategories: driving on straight roads, driving on curved roads, crossing intersections, and slowing down for collision avoidance.
Figure 13.
Statistics of our dataset. We categorized the driving scenarios in our dataset into three main types: Turn Right, Turn Left, and Keep Straight. For the Keep Straight category, we further divided the data into four subcategories: driving on straight roads, driving on curved roads, crossing intersections, and slowing down for collision avoidance.
Figure 14.
Visualization of trajectories in typical driving scenarios. BevDrive demonstrates its ability to generate collision-free trajectories in a variety of driving scenarios, including lane following (a), turning (b, f), collision avoidance (c) and traffic light reaction (d, e).
Figure 14.
Visualization of trajectories in typical driving scenarios. BevDrive demonstrates its ability to generate collision-free trajectories in a variety of driving scenarios, including lane following (a), turning (b, f), collision avoidance (c) and traffic light reaction (d, e).
Table 1.
Performance on the Town05 Long benchmark. ↑ indicates that higher values are better, while ★ highlights the primary metric. For modalities, C represents camera images, and L denotes LiDAR point clouds. Expert refers to imitation learning based on the driving behavior of a privileged expert. Seg. stands for semantic segmentation, Map. corresponds to BEV map segmentation, Dep. represents depth estimation, and Det. refers to object detection. Optimal values are shown in bold.
Table 1.
Performance on the Town05 Long benchmark. ↑ indicates that higher values are better, while ★ highlights the primary metric. For modalities, C represents camera images, and L denotes LiDAR point clouds. Expert refers to imitation learning based on the driving behavior of a privileged expert. Seg. stands for semantic segmentation, Map. corresponds to BEV map segmentation, Dep. represents depth estimation, and Det. refers to object detection. Optimal values are shown in bold.
Method | Modality | Supervision | | | |
---|
CILRS [19] | C | Expert | 7.8 ± 0.3 | 10.3 ± 0.0 | 0.75 ± 0.05 |
LBC [29] | C | Expert | 12.3 ± 2.0 | 31.9 ± 2.2 | 0.66 ± 0.02 |
Roach [32] | C | Expert | 41.6 ± 1.8 | 96.4 ± 2.1 | 0.43 ± 0.03 |
TCP [38] | C | Expert | 57.2 ± 1.5 | 80.4 ± 1.5 | 0.73 ± 0.02 |
CrossFuser [26] | C + L | Expert | 25.8 ± 1.7 | 71.8 ± 4.3 | - |
Transfuser [23] | C + L | Expert, Dep., Seg., Map., Det. | 31.0 ± 3.6 | 47.5 ± 5.3 | 0.77 ± 0.04 |
LAV [30] | C + L | Expert, Seg., Map., Det. | 46.5 ± 2.3 | 69.8 ± 2.3 | 0.73 ± 0.02 |
Interfuser [25] | C + L | Expert, Map., Det. | 51.6 ± 3.4 | 88.9 ± 2.5 | 0.59 ± 0.05 |
BevDrive (ours) | C + L | Expert, Dep. | 58.1 ± 1.3 | 89.3 ± 2.9 | 0.66 ± 0.03 |
Table 2.
Performance comparison of different modules in feature encoder. DIFC represents the depth-guided image BEV feature construction module, ILFC represents image-guided LiDAR BEV feature construction, and DF represents the dual-attention BEV feature fusion module.
Table 2.
Performance comparison of different modules in feature encoder. DIFC represents the depth-guided image BEV feature construction module, ILFC represents image-guided LiDAR BEV feature construction, and DF represents the dual-attention BEV feature fusion module.
ID | DIFC | ILFC | DF | | | |
---|
0 | | | | 28.5 | 56.7 | 0.56 |
1 | ✓ | | | 45.2 | 75.3 | 0.56 |
2 | ✓ | ✓ | | 46.5 | 74.5 | 0.63 |
3 | ✓ | | ✓ | 49.6 | 77.3 | 0.66 |
4 | ✓ | ✓ | ✓ | 51.8 | 78.2 | 0.66 |
Table 3.
Ablation study of the depth-guided image BEV feature construction module. LP represents the lifting and projection module, and Dep represents the depth supervision.
Table 3.
Ablation study of the depth-guided image BEV feature construction module. LP represents the lifting and projection module, and Dep represents the depth supervision.
ID | Module | | | |
---|
5 | None | 28.5 | 56.7 | 0.56 |
6 | + LP | 36.3 | 62.8 | 0.64 |
7 | + LP + Dep | 45.2 | 75.3 | 0.56 |
Table 4.
Ablation study of the dual-attention BEV feature fusion module. WSA represents the window self-attention mechanism and GSA represents the global self-attention mechanism.
Table 4.
Ablation study of the dual-attention BEV feature fusion module. WSA represents the window self-attention mechanism and GSA represents the global self-attention mechanism.
ID | Module | | | |
---|
8 | None | 46.5 | 74.5 | 0.63 |
9 | + WSA | 49.2 | 75.7 | 0.67 |
10 | + WSA + GSA | 51.8 | 78.2 | 0.66 |
Table 5.
Ablation study of the motion decoder modules. HMQ denotes joint motion queries, AMP denotes the attention-based motion planning network, and OFS represents the output fusion strategy.
Table 5.
Ablation study of the motion decoder modules. HMQ denotes joint motion queries, AMP denotes the attention-based motion planning network, and OFS represents the output fusion strategy.
ID | AMP | HMQ | OFS | | | |
---|
11 | | | | 51.8 | 78.2 | 0.66 |
12 | ✓ | | | 55.7 | 83.9 | 0.65 |
13 | ✓ | ✓ | | 56.2 | 85.6 | 0.67 |
14 | ✓ | ✓ | ✓ | 58.1 | 89.3 | 0.66 |
Table 6.
Ablation study of the parameters of the output fusion strategy. This table presents the impact of varying fusion weights on driving performance metrics.
Table 6.
Ablation study of the parameters of the output fusion strategy. This table presents the impact of varying fusion weights on driving performance metrics.
ID | Weight | | | |
---|
15 | 0.0 | 56.2 | 85.6 | 0.67 |
16 | 0.2 | 58.1 | 89.3 | 0.66 |
17 | 0.5 | 55.6 | 86.1 | 0.66 |
18 | 0.8 | 52.1 | 84.3 | 0.65 |
19 | 1.0 | 48.5 | 81.4 | 0.64 |
Table 7.
Performance comparison of different methods on Navsim benchmark.
Table 7.
Performance comparison of different methods on Navsim benchmark.
Method | | | | | | |
---|
ConsVelo [47] | 68.0 | 57.8 | 50.0 | 100 | 19.4 | 20.6 |
EgoStatMLPc [47] | 93.0 | 77.3 | 83.6 | 100 | 62.8 | 65.6 |
VADv2 [48] | 97.2 | 89.1 | 91.6 | 100 | 76.0 | 80.9 |
DrivingGPT [49] | 98.9 | 90.7 | 94.9 | 95.6 | 79.7 | 82.4 |
BevDrive | 97.7 | 92.5 | 92.9 | 100 | 78.7 | 83.8 |
Table 8.
L2 error comparison across methods at different time horizons.
Table 8.
L2 error comparison across methods at different time horizons.
Method | L2 Error (m) |
---|
0.5 s | 1 s | 1.5 s | 2 s | Avg. |
---|
Late Fusion | 0.06 | 0.09 | 0.16 | 0.31 | 0.16 |
Transfuser | 0.05 | 0.07 | 0.11 | 0.23 | 0.12 |
BevDrive | 0.04 | 0.05 | 0.08 | 0.13 | 0.08 |