BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds

Oladele, Daniel Ayo; Markus, Elisha Didam; Abu-Mahfouz, Adnan M.

doi:10.3390/ai6040082

Open AccessArticle

BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds

by

Daniel Ayo Oladele

^1,*,†

,

Elisha Didam Markus

¹

and

Adnan M. Abu-Mahfouz

²

¹

Department of Electrical, Electronic and Computer Engineering, Central University of Technology, Bloemfontein 9301, South Africa

²

Emerging Digital Technologies for the Fourth Industrial Revolution (EDT4IR) Research Centre, Council for Scientific and Industrial Research (CSIR), Pretoria 0184, South Africa

^*

Author to whom correspondence should be addressed.

^†

Current address: Department of Electrical, Electronic and Computer Engineering, Central University of Technology, Bloemfontein 9301, South Africa.

AI 2025, 6(4), 82; https://doi.org/10.3390/ai6040082

Submission received: 27 March 2025 / Revised: 10 April 2025 / Accepted: 15 April 2025 / Published: 18 April 2025

(This article belongs to the Section AI in Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Three-dimensional (3D) visual perception is pivotal for understanding surrounding environments in applications such as autonomous driving and mobile robotics. While LiDAR-based models dominate due to accurate depth sensing, their cost and sparse outputs have driven interest in camera-based systems. However, challenges like cross-domain degradation and depth estimation inaccuracies persist. This paper introduces BEVCAM3D, a unified bird’s-eye view (BEV) architecture that fuses monocular cameras and LiDAR point clouds to overcome single-sensor limitations. BEVCAM3D integrates a deformable cross-modality attention module for feature alignment and a fast ground segmentation algorithm to reduce computational overhead by 40%. Evaluated on the nuScenes dataset, BEVCAM3D achieves state-of-the-art performance, with a 73.9% mAP and a 76.2% NDS, outperforming existing LiDAR-camera fusion methods like SparseFusion (72.0% mAP) and IS-Fusion (73.0% mAP). Notably, it excels in detecting pedestrians (91.0% AP) and traffic cones (89.9% AP), addressing the class imbalance in autonomous driving scenarios. The framework supports real-time inference at 11.2 FPS with an EfficientDet-B3 backbone and demonstrates robustness under low-light conditions (62.3% nighttime mAP).

Keywords:

3D perception; attentionmechanisms; bird’s-eye view (BEV); multi-modalfusion; objectdetection; real-timeprocessing; sensorfusion

1. Introduction

Three-dimensional (3D) visual perception aims to sense and understand surrounding environments in a 3D space, playing a crucial role in various applications such as mobile robotics, autonomous driving, and virtual reality. Its purpose is to utilize data obtained from a series of sensors, such as light detection and ranging (LiDAR), radio detection and ranging (RADAR), and cameras, to derive a comprehensive understanding of driving scenes, which is essential for subsequent planning and decision-making.

LiDAR-based models have dominated the field of 3D perception in the past due to the accurate depth and 3D information acquired from point cloud data. However, achieving an efficient LiDAR-based system with a reasonable resolution is very costly, and LiDAR sensor point clouds tend to be sparse. In contrast, camera-based systems are more affordable and provide an excellent resolution, which is critical for meaningful semantic representations of the environment. Nevertheless, these systems highly depend on the light intensity to effectively capture information about the environment that is useful for robotic devices [1,2,3,4].

Camera-only 3D perception in a bird’s-eye view (BEV) has garnered increasing attention in recent years. This interest is due to its advantages in providing a comprehensive 3D understanding, rich semantic information, high computational efficiency, and low deployment costs [5,6,7,8]. However, camera-only BEV models trained on a source domain often experience pronounced performance degradation when applied to a target domain, a phenomenon attributed to clear cross-domain discrepancies [9,10,11].

BEV feature representation primarily relies on geometric cues, which presents an ill-posed problem for camera-only 3D perception given the limited accuracy associated with depth estimation based on camera-only features. Even with the introduction of attention-based mechanisms, such as vision-based transformers, challenges remain in dense spatial prediction, particularly in estimating structural information in the height dimension [12,13,14,15]. Consequently, many researchers have emphasized the necessity of incorporating a depth sensor for guided depth supervision [10,16]. An optimal combination of both sensors can overcome the limitations of a single sensor and provide a more comprehensive understanding of the environment for perception.

However, the combination of features extracted from different sensors introduces a distinct set of complexities. For instance, accurately aligning coherent features from the sparse point clouds of a LiDAR sensor with the dense image features of a camera, or addressing geometric distortion errors resulting from the alignment of geometric cues from these sensors, can be non-trivial. This accurate alignment is crucial for the successful navigation of autonomous vehicles, as autonomous navigation is inherently a geometric problem [10,14].

Methods such as 2D-to-3D voxelization based on depth [10,13], inverse perspective mapping (IPM) [7,17,18], and network-based view transformation techniques [19,20] are commonly employed in conjunction with feature concatenation and projection for multi-modal BEV fusion [2,4]. Each of these techniques presents unique drawbacks, including issues related to accuracy, computational cost, the loss of fine-grain details, the reliance on efficient query generation, the management of errors from explicit geometric relationships, and careful design considerations [2,4,10,11,21,22]. Consequently, ensuring the unity and efficiency of training within a multi-task 3D perception framework poses significant challenges.

The existing approach for transforming 2D features into a 3D space involves predicting a grid-wise depth distribution over the 2D features and subsequently lifting these features into a 3D voxel space [13]. While this method has demonstrated effectiveness, ongoing research has proposed several improvements, including depth-assisted supervision and careful engineering [16,22].

The pure network approach to view transformation, which involves the implicit representation of camera projection relationships, poses challenges, as its learning process relies solely on ground truth labels, which are typically obtained from the depth sensor measurement of the 3D space [2,4,20]. This has an impact on its generalizability, as the inference of new scenes relies solely on the learned parameters of the model.

One promising avenue for efficient view transformation is the use of geometric priors derived from the camera sensor’s intrinsic properties, particularly through IPM, to formulate the projection from the camera’s 3D space to the BEV 2D space. IPM provides an efficient baseline for view transformation, especially when combined with learnable networks [17,18]. This approach to obtaining the 3D representation of image features, combined with the occupancy grid map representation approach, widely used in robotics [7,21,23,24,25], forms the basis of the BEVCAM3D architecture.

BEVCAM3D, a unified bird’s-eye view architecture (BEV) for multiple monocular cameras (CAMs) and three-dimensional (3D) point clouds, performs a pre-BEV feature extraction process of ground segmentation [26] to improve the feature extraction process, and then it converts the point cloud into a BEV representation using a modified and improved version of the MV3D [27] approach to BEV representation. The contributions of this study include the following:

(i): A BEV fusion framework, BEVCAM3D, with better performance, efficient feature alignment, and the fusion of camera and LiDAR sensor modalities across a perspective view and a BEV.
(ii): A detailed and concise review of the BEV and BEV multi-modal sensor fusion paradigm.
(iii): A BEV representation for point clouds that includes the fast ground segmentation of point clouds to improve feature extraction.
(iv): A cross-modality attention module with cross-modality fusion loss for the efficient fusion and alignment of camera and lidar features.

BEVCAM3D is evaluated against existing state-of-the-art models on the nuScene dataset [28] to validate its performance.

2. Literature Review

2.1. Monocular Camera BEV Representation

View transformation requires a sensor’s understanding of the 3D representation of a scene, which involves accurate spatial knowledge and geometric awareness of the captured environment [3]. Over the years, researchers have proposed numerous object detection techniques for monocular cameras to achieve this objective. These techniques include monocular depth estimation, keypoint learning, and deep learning approaches [29,30,31]. Deep learning methods such as shape reconstruction using centroid proposals for more accurate 2D-to-3D bounding box regression [32], leveraging geometric constraints like the camera projection matrix for the better regression of 3D bounding boxes from 2D images [27,33,34,35], and employing geometric reasoning networks with sparse pixel-level supervision [36] have shown significant progress.

Notably, incorporating geometric priors or constraints alongside deep learning techniques has proven effective in regressing 2D localized objects into 3D bounding boxes [33,34]. For instance, [34] introduced an approach extending existing regression solutions for 3D object detection by incorporating a hybrid discrete-continuous loss for orientation estimation. This architecture estimates both the confidence probability

c_{i}

that the output angle lies within the

i^{t h}

bin of a discretized set of orientation angles and the residual rotation correction

s i n (Δ θ_{i}), c o s (Δ θ_{i})

. The estimated residual rotation is applied to the orientation of the center ray of the estimated confidence to obtain the output angle. Additionally, a geometric constraint on translation is imposed by the detected 2D bounding box, which assumes that the 3D bounding box is confined within the 2D bounding box. Results demonstrated that this consistent use of geometric constraints with deep learning outperformed more complex and computationally expensive approaches leveraging semantic segmentation, instance-level segmentation, and flat-ground priors.

The Mono3D study [27] and OFT [35] take a distinct approach to geometric constraints for 2D-to-3D object detection. Many earlier studies tackled this problem by detecting 2D bounding boxes for objects in the image and either directly regressing 3D pose parameters for each region [32,33,34,37,38] or fitting 3D templates to the image [32,39,40]. However, Mono3D [27] and OFT [35] address the problem using the IPM (Inverse Perspective Mapping) concept, which assumes that the ground plane is orthogonal to the image plane and at a known distance below the camera (from calibration). This assumption serves as the baseline for generating object 3D proposals.

For a point in the image plane, the IPM concept can be expressed as

H = K [R | t] = [\begin{matrix} 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] [\begin{matrix} R_{11} & R_{12} & R_{13} & T_{x} \\ R_{21} & R_{22} & R_{23} & T_{y} \\ R_{31} & R_{32} & R_{33} & T_{z} \\ 0 & 0 & 0 & 1 \end{matrix}] = [\begin{matrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{matrix}]

(1)

P_{s r c} = H^{- 1} P_{s r c}

(2)

This involves using a homography matrix H that relates the coordinates

(x, y)

of a point

P_{s r c}

in the image plane to the coordinates

(X, Y)

of the same point

P_{d s t}

in the BEV plane. The matrix H can be computed from the camera intrinsic matrix K and the ground plane equation M, or it can be learned from data using a deep neural network.

The homography projection matrix H can be expressed as

K [R | t]

, where K is the intrinsic matrix of the camera, and

[R | t]

is the extrinsic matrix (rotation matrix R and translation vector t) of the camera. The intrinsic matrix K consists of the focal lengths along x and y

(f_{x}, f_{y})

, the pixel skew s, and the principal point

(c_{x}, c_{y})

. The homography matrix H is expressed as shown above.

With an understanding of the 3D scene, obtaining a bird’s-eye view (BEV) representation becomes more straightforward. In recent years, the BEV has gained prominence in the literature as the preferred perspective for autonomous driving tasks. By transforming a perspective view into a top-down view, a BEV provides an intuitive and holistic view of the environment, facilitating the unification of multi-modal sensors [2,4,41,42]. This representation eliminates perspective distortion, preserves distances, and offers a fusion-friendly format for multi-modal sensors, aiding in 3D object representation, path planning, and control. Consequently, it is widely used for tasks such as 3D detection, map segmentation, and motion prediction, serving as a foundation for building higher-level perception and decision-making algorithms [42,43,44,45].

One of the earlier approaches introduced for pixel-wise depth distribution prediction is the Lift–Splat–Shoot (LSS) method by Philion and Fidler [13]. In this method, depth distribution is predicted from the extracted 2D features for each pixel with the coordinates

(h, w)

of an image

X_{2 D} \in R^{3 \times H \times W}

. The camera parameters

[E | I]

are used to develop an arbitrary camera rig, and the intrinsic parameters

[I]

are used to lift the features into a frustum of features in a 3D space based on the estimated depth. Although this was a significant breakthrough for the monocular BEV representation paradigm, the results showed a lower accuracy of more than 30% when compared with the oracle depth from the LiDAR model. This limitation was noted, and subsequent works [16,46,47,48] used the ground truth to supervise the depth estimation, showing a significant improvement in accuracy. The CaDNN [46] and others [49,50] extend on LSS by collapsing the voxel features

V \in R^{X \times Y \times 2 \times C}

to a single height plane to generate BEV features

B \in R^{X \times Y \times C}

.

BEVDet [51] and its temporal version BEVDet4D [47] focus on optimizing existing study results primarily through careful data augmentation and upgrading the Non-Maximum Suppression (NMS) strategy, as well as leveraging the proposed LSS, PointPillars [52], and CenterPoint [53] algorithms to strike a balance between accuracy and inference speed. BEVDepth [16] introduces a depth refinement module for explicit depth supervision to counter the imprecise feature unprojection effect of LSS, and BEVNeXt [48] introduces a dense BEV framework that includes a Conditional Random Fields (CRF)-modulated depth estimation module, a long-term temporal aggregation module, and a two-stage object decoder. Results show an improvement over existing approaches [12,15,51,54]. Although these approaches have proven successful, they struggle in the fusion of multi-camera features due to the poor global attention nature of convolutional networks [55,56]. To address this shortcoming, some studies propose the use of future prediction for better spatial–temporal synchronization, such as the shoot module in LSS [13], FEIRY [14], BEVDet4D [47], and BEVNeXt [48]. However, this comes at a high computational cost.

With the attention mechanism offered by transformer architectures [57], some researchers have proposed the use of cross-attention modules for improving multi-camera view alignments and fusion into BEV feature maps [4,58,59,60]. FB-BEV [6] is a network that combines the use of LSS for forward projection and BEVFormer [15] 3D-to-2D view transformer modules for back-projection to improve the geometric and spatial reasoning of the model across multi-camera features.

To project 3D to a 2D space, the camera parameters are used to generate a 3D voxel feature grid. First introduced by OFT [35], a uniform distribution of voxel feature grids,

g (x, y, z)

, is developed by aggregating accumulated image-based features and then collapsing them along the vertical dimension, as a top-down network, to yield orthographic ground plane features,

h (x, z)

, also known as the BEV feature plane. This IPM concept to view transformation aims to remove perspective distortion; however, the major drawback lies in the assumption that the BEV plane is orthogonal, leading to distortions in 3D objects, as shown in Figure 1 and Figure 2.

To address 3D object distortion, researchers have explored various approaches, including the use of Multi-Layer Perceptrons (MLPs) to correct IPM distortion. Ref. [61] employed a synthetic vehicle dataset to train a model that corrects IPM distortion on vehicles. Cam2BEV [62] leverages homography on semantically segmented images and trains distorted images using a spatial transformer unit [63]. Ref. [64] proposed Bi-Mapper, a network that processes perspective views as a global view stream and IPM images as a local view stream, utilizing MLPs to fuse these streams for improved BEV feature mapping. Additionally, Generative Adversarial Networks (GANs) [65] have been applied to refine top-down network predictions [66,67].

Cross-attention mechanisms [19,68] have been proposed to address perspective distortion by aligning features across inputs. BEVFormer [15] introduces spatial cross-attention (SCA) for multi-view spatial alignment with the BEV plane and temporal self-attention (TSA) for more efficient temporal feature stacking. To manage the computational expense of transformers, BEVFormer adopts deformable attention [69] for both SCA and TSA. Other networks incorporate transformer-based cross-attention, self-attention, and temporal attention mechanisms to enhance multi-view alignment.

Ref. [25] extended the top-down network of OFT by introducing a semantic occupancy prediction head. This approach trains deep CNN-based inverse sensor models,

p (m_{i}^{c} | z_{t}) = f_{θ} (z_{t}, x_{i})

, to predict occupancy probabilities from a single monocular input image and utilizes Bayesian filtering for multi-observation integration. Ref. [11] further enhanced this by incorporating a semantic segmentation head and an instance segmentation head into the top-down network, performing panoptic fusion for panoptic tasks. Both approaches employ BiFPN [70] at the feature extraction level to enhance small-object detection.

Some networks focus on learning perspective transformations for implicit representations of camera projection relationships. PETR [71] adapts the 2D positional embeddings of DETR [72] to 3D positional embeddings. NEAT [73] utilizes an MLP attention map for implicit BEV feature representation. PYVA [74] introduces a GAN-based architecture with a cross-view transformer and a cycled view projection module, iteratively projecting front-view images to a BEV and back until the discrepancy is minimized. Despite diverse methodologies, the success of monocular BEV representations is often attributed to effective depth supervision [2,4].

2.2. Point Cloud BEV Representation

Point clouds, consisting of 3D coordinates representing points in space, provide accurate spatial information and are relatively straightforward to transform into a BEV representation. However, raw point clouds face challenges such as varying point densities over distance, an unstructured nature, a large size, sensitivity to reflective surfaces causing incomplete or noisy data, and sparseness compared to camera sensors. These issues necessitate preprocessing techniques for effective use [2,42,75,76].

The preprocessing pipeline generally includes structuring unstructured point clouds and feature extraction [2,3,77,78,79]. Traditional structuring techniques, such as kd-tree [80] and octree [81], involve costly search processes. Recent methods include voxelization [82], polar grid mapping, elevation map representation [83], and BEV mapping techniques like MV3D [27]. Once structured, relevant features are extracted using segmentation, clustering, or deep learning techniques. PointNet [84,85] performs end-to-end segmentation directly on raw point clouds, bypassing the structuring step.

The volumetric or voxel method structures point clouds into a 3D grid with a fixed voxel geometry and assigns values to each voxel based on the contained points [86]. However, this method faces challenges with non-orthogonal surfaces, leading to gaps between voxels.

Feature extraction techniques include traditional ground filtering methods [26,87,88] and deep learning approaches such as 3D convolutional networks and MLPs [24,89,90,91]. VoxelNet [82] employs random sampling, voxel feature encoding (VFE), and 3D convolution networks. SECOND [92] introduces sparse 3D convolution for faster training and inference. Subsequent studies refined feature extraction methods [93,94,95]. However, deep learning methods require additional branches to transform features into a BEV, increasing computational cost.

Traditional ground filtering is an efficient method for structuring and segmenting point clouds for downstream applications [26,87,88]. For example, the method in [26] achieved a recall score of 99.01 and an inference speed below 19 ms on a single-core low-end CPU.

After feature extraction, 3D-to-2D BEV projection, also known as RGB map representation, encodes height, intensity, and density into an RGB BEV image [27,96]. This approach is preferred for its memory and computational efficiency [97,98,99]. With point clouds represented in BEV images, features can be extracted using deep learning models and trained like conventional images. BEVCAM3D utilizes this format for preprocessing point clouds.

2.3. Multi-Modal BEV Fusion

Earlier approaches, such as MV3D [27] and AVOD [100], utilize multi-view proposal-wise features from different view maps. MV3D [27] employs the LiDAR front view, the LiDAR BEV, and images to generate 3D proposals from the LiDAR BEV. Proposal-specific features are aggregated via a region of interest (ROI) pooling layer applied to the feature maps of each view. The output layers provide classification results and refined vertices of the regressed 3D bounding boxes. While the deep fusion of aggregated features offers strong performance, the inference speed is slow. AVOD [100] improves this by merging the LiDAR BEV and RGB images for region proposals. Region proposals with the highest scores are projected onto feature maps. Frustum PointNet [101] leverages monocular images for proposal generation and classifies them using point clouds, with feature extraction performed by the PointNet architecture [84,85].

Recent advancements in monocular BEV representation have inspired the fusion of point clouds and cameras in BEV. BEVFusion [10] addresses the limitations of proposal-level fusion by fusing multi-view monocular BEV feature representations with point cloud BEV representations. This approach precomputes the 3D coordinates and BEV grid indices of each point to replace BEV pooling from LSS [13]. Point cloud BEV 2D images are extracted alongside image features, and a fully convolution-based BEV encoder is used to fuse concatenated features, addressing BEV feature misalignment. Similarly, UVTR [102] utilizes LSS [13] to extract camera BEV features and voxelization [40] for point cloud representation in a BEV. UVTR introduces a knowledge transfer module to enhance the geometric awareness of camera BEV features, using LiDAR geometric feature maps and a cross-attention module to improve multi-modal BEV feature alignment.

Cross-modal feature fusion was explored by [103,104]. Ref. [103] introduced cross-modal and cross-view learning modules to enhance multi-modal fusion and feature alignment. Multiple losses (the feature correlation loss

L^{X - F A}

, focal cross-entropy losses

L^{P V 2 B E V}

, and supervised perspective view losses

L^{P V}

) were implemented to encourage correlation and alignment. Ref. [104] introduced a framework that integrates multi-modal sensor data, utilizing a Cross-Modal Interaction Transform (CIT) module for intra- and inter-modality information fusion, alongside a Dual Dynamic Fusion (DDF) module for adaptive information selection. Both approaches recorded an improved performance over existing models; however, this comes at an increased computational cost due to the additional modules and losses.

Ref. [105] applied deformable attention for transforming camera features to a BEV, voxelized LiDAR features to a BEV, and fusing both modalities using a channel-normalized weight (CNW) module to produce a fused BEV feature map,

F^{B E V}

. This method, inspired by DETR [54], demonstrated effectiveness but required significantly more training epochs for convergence. BEVGuide [106] introduces a sensor fusion block called the sensor-agnostic attention block, which queries BEV embeddings and learns BEV representation from sensor-specific features. A geometric-aware positional embedding is incorporated to establish soft geometric correspondence between BEV query positions and feature map positions.

Despite advancements in BEV fusion, existing methods struggle with cross-modality misalignment, high computational costs, and reliance on dense LiDAR data. BEVCAM3D addresses these gaps through (1) a deformable cross-attention module for efficient feature alignment, (2) lightweight ground segmentation to reduce LiDAR processing costs, and (3) unified fusion that leverages both sparse LiDAR and dense camera data without sacrificing real-time performance.

3. Methodology

This section explains the proposed BEVCAM3D fusion network. To begin, the problem is formulated in Section 3.1, Section 3.2 presents an overview of BEVCAM3D, and Section 3.3, Section 3.4 and Section 3.5 explain each module that makes up the BEVCAM3D architecture. This includes a brief overview of the ground segmentation concept; the feature extractor for the sensor modalities; the encoder baseline used, which is the dense BEV transformation module; the BEVCAM3D fusion network, and the detection head.

3.1. Preliminaries

For a given set of inputs,

X_{m} for m = 1, 2, \dots, m

from multi-modal sensor data

(m)

, where m is the sensor modality. The goal of the BEVCAM3D framework is to take in these sensor inputs and effectively predict the 3D objects of a set of classes

C = C_{0}, C_{1}, C_{2}, \dots, C_{i}

in a BEV perspective with a specified resolution

H^{B E V} \times W^{B E V}

.

3.2. BEVCAM3D Overview

As illustrated in Figure 3, BEVCAM3D integrates two parallel pipelines: the camera branch and the LiDAR branch. In the camera branch, multi-view images

{I_{i}}_{i = 1}^{N}

undergo data augmentation (e.g., rotation and flipping) and are transformed into BEV features

F_{CAM}^{BEV} \in R^{H \times W \times C}

using the PyrOccNet backbone [25]. This involves perspective-to-BEV projection via geometric priors, as expressed in (3):

F_{CAM}^{BEV} = T_{PyrOccNet} ({I_{i}}, K, R, t),

(3)

where

K, R, t

denote the camera intrinsics, rotation, and translation.

In the LiDAR branch, the raw LiDAR point clouds

P \in R^{M \times 3}

are preprocessed using fast ground segmentation [26] to isolate non-ground points

P_{non-ground} \subset P

. These are projected into a BEV RGB map representation

z_{r, g, b}

, as shown in Equation (10), with a resolution of

δ

= 0.08 m, where each pixel encodes height/intensity/density, as presented in Section 3.3. The features of this RGB map representation are extracted using ResNet-101 [107] to obtain the LiDAR BEV representation

F_{L I D A R}^{B E V}

. Both pipelines are fed into the feature fusion network (FFNet) module.

The FFNet module first concatenates the camera and LiDAR BEV features along the channel dimension to form a unified representation

F_{concat} \in R^{H \times W \times 2 C}

, and then it applies deformable cross-attention:

F_{concat} = Concat (F_{CAM}^{BEV}, F_{LiDAR}^{BEV}),

(4)

F_{fused} = DeformAttn (Q, P, F_{concat}),

(5)

where

Q \in R^{N \times C}

are learnable BEV queries,

P \in R^{H \times W \times 2}

denotes the normalized BEV grid coordinates, and DeformAttn samples features from

F_{concat}

at adaptively learned offsets.

Finally, using the fused features,

F_{fused}

, a lightweight detection head predicts the 3D bounding boxes

B_{3 D} = {c_{x}, c_{y}, c_{z}, w, l, h, θ}

.

3.3. Point Cloud BEV Representation

To begin, the 3D object points are projected onto a 2D BEV RGB image map using a variant of the MV3D [27] and RGB map techniques. The RGB map is encoded by height, intensity, and density. The modified size of the grid map is defined with

n = 512

and

m = 512

to suit the preferred feature extractor network ResNet-101 [107]. These 3D point clouds are then projected and discretized into a 2D grid with a resolution of 0.08 m. The RGB feature channels

(z_{r}, z_{g}, z_{b})

with

z_{r, g, b} \in R^{m \times n}

are calculated for the input point cloud

P \in R^{3}

of the captured environment. This can be expressed as

P = {[\begin{matrix} x & y & z \end{matrix}]}^{T} for x, y \in [- 70 m, 70 m], z \in [- 2.73 m, 3.27 m]

(6)

x and y are set to a

70 m

radius just as it was by the ground segmentation algorithm [26], and the z-axis is set to a total height of

6 m

to account for most objects in the environment, including common buildings; objects on the z-axis above

6 m

are capped at

6 m

.

Therefore, the equivalent pixel values for the RGB map image are calculated as follows:

z_{g} = max (P_{i \to j} \cdot [\begin{matrix} 0 & 0 & 1 \end{matrix}])

(7)

z_{b} = max (I (P_{i \to j}))

(8)

z_{r} = min (1.0, \frac{l o g (N + 1)}{64}) for N = | P_{i \to j} |

(9)

where:

$P_{i \to j}$ is the mapping function of points with index i to its corresponding cell value j.
$I (P_{i \to j})$ is the intensity of points in the given point cloud.

In addition to the existing approaches [27,96], the derived

z_{r}, z_{g}, z_{b}

are multiplied by 255 and clamped using the unsigned integer (uint8) datatype (i.e., 0 to 255) to account for the normalization performed during the RGB map calculation. This is expressed as

z_{r, g, b} = max (0, min (255, z_{r, g, b} \times 255))

(10)

3.4. BEV Feature Encoder

BEVCAM3D employs a dual-branch encoder to extract and transform features from multi-view camera images and LiDAR BEV maps.

The camera branch inputs the multi-view perspective images

{I_{i} \in R^{3 \times H \times W}}_{i = 1}^{N}

, and ResNet-101 [107] extracts hierarchical features

{F_{i}^{2 D} \in R^{C \times H / 2^{s} \times W / 2^{s}}}

at scales

s \in {2, 3, 4}

. The FPN [108] aggregates features across scales to enhance small-object detection, as expressed below:

F_{BiFPN} = \sum_{s} ω_{s} \cdot {Conv}_{1 \times 1} (F_{i}^{2 D}),

(11)

where

ω_{s}

denotes the learnable weights for scale s.

The PyrOccNet backbone [25] lifts 2D features to a BEV via a Dense Transformer and TopDown Network [74], expressed as follows:

F_{CAM}^{BEV} = T_{Dense} (F_{BiFPN}, K, R, t) \in R^{X \times Y \times C},

(12)

where

K, R, t

are the camera intrinsics, rotation, and translation matrices.

The LiDAR branch inputs the ground-removed LiDAR BEV map

z_{r, g, b} \in R^{H \times W \times 3}

(encoded with height, intensity, and density), and EfficientDet-B3 processes

z_{r, g, b}

to produce BEV features

F_{LiDAR}^{BEV} \in R^{X \times Y \times C}

.

The camera BEV features

F_{CAM}^{BEV}

and LiDAR BEV features

F_{LiDAR}^{BEV}

are concatenated [109] and passed to the fusion network (Section 3.5).

3.5. BEVCAM3D Feature Fusion Network

BEVCAM3D-FFNet (Figure 4) fuses the camera and LiDAR BEV features through three stages:

To begin the feature fusion, the camera BEV features

F_{CAM}^{BEV}

and LiDAR BEV features

F_{LiDAR}^{BEV}

are concatenated to preserve the raw modality-specific features, providing a dense input for deformable attention to learn cross-modal relationships without prior alignment:

F_{concat} = Concat (F_{CAM}^{BEV}, F_{LiDAR}^{BEV}) \in R^{H \times W \times 2 C}

(13)

The computational cost of traditional cross-attention mechanisms, as described in [20,110], scales quadratically with the length of the input, i.e.,

O (N^{2})

, where

N = X \times Z

represents the height and width of the BEV feature map, respectively. To address this inefficiency and mitigate cross-modality feature misalignment—often a consequence of the translation invariance inherent in traditional convolutional networks—BEVCAM3D-FFNet employs deformable multi-modal cross-attention for feature fusion, as proposed by [54].

Learnable BEV queries

Q \in R^{N \times C_{q}}

attend to

F_{concat}

using deformable attention [54], which predicts sampling offsets

Δ P_{i}

per query and can be expressed as

Δ P_{i} = MLP (Q_{i}), F_{fused} = DeformAttn (Q, P + Δ P, F_{concat})

(14)

Here,

P \in R^{H \times W \times 2}

denotes the normalized BEV grid coordinates. P defines the reference points (e.g., BEV grid centers), and the attention samples features from

F_{concat}

at learned offsets

Δ P

:

A CenterPoint-style head [111] predicts 3D boxes

B_{3 D} = {c_{x}, c_{y}, c_{z}, w, l, h, θ}

from

F_{fused}

. Training is achieved using the Hungarian algorithm [112] for target assignment and to minimize the loss

L

, which is expressed as

L = L_{\det} + λ L_{offset},

(15)

where

L_{\det}

is the focal loss for detection, and

L_{offset}

penalizes large offsets

Δ P

.

4. Results

4.1. Experimental Setup

All experiments were conducted on a workstation running Ubuntu 20.04 with an AMD Ryzen 9 5950X CPU and NVIDIA RTX 4090 GPU. The implementation leveraged the following libraries:

Frameworks: PyTorch 2.3.1 (CUDA 12.6 backend), NumPy 1.24.4.
Data Loading: NuScenes devkit v1.1.11 [28], OpenCV 4.11.0.86.
Augmentation: Albumentations 1.4.18 (applied to camera images).
Visualization: Matplotlib 3.5.3, PIL 10.4.0.

The NuScenes dataset [28] was used for evaluation, comprising 1000 scenes (700/150/150 train/val/test splits) with the following:

Sensors: Six surround-view cameras (1600 × 900 resolution) and one LiDAR (32 beams).
Annotations: 1.4 M 3D bounding boxes across 10 classes: car, truck, bus, trailer, construction vehicle, pedestrian, motorcycle, bicycle, barrier, and traffic cone.
Evaluation Metrics: NuScenes Detection Score (NDS) and mean Average Precision (mAP).

4.2. Experimental Parameters

Training used the AdamW optimizer with cosine learning rate decay:

Initial LR: $4 \times 10^{- 4}$ , Batch Size: 16 (2 GPUs, 8 samples/GPU).
Training Schedule: 50 epochs, 20% warmup.
Precision: FP16 via PyTorch AMP.
Gradient Checkpointing: Enabled for ResNet-50.

Architecture Details:

Backbone: ResNet-50 [107] with FPN [108] (feature depth: 256).
BEV Grid: $x \in [- 35 m, 35 m]$ , $z \in [1 m, 70 m]$ , resolution $δ = 0.2 m$ .
Image Size: $224 \times 224$ (center-cropped and resized).
View Transformation: Focal length scaled to $f = 375$ (aligned with camera intrinsics and adjusted for 224 × 224 resolution).

Data Augmentation:

Rotation: $\pm 30^{\circ}$ , Flip: Horizontal (50%), Shift: 10% of image size.

Ground Segmentation: The LiDAR ground removal algorithm [26] filtered out points with a height of

z < 0.5 m

, which contributed to a significant reduction in computational load. In the study by [26], this height threshold was selected based on both empirical observations and prior studies, which showed that ground points generally fall below this value [26,113,114]. Removing these low-height points helps eliminate non-relevant ground data, which often constitute a substantial portion of raw LiDAR scans. This allows the system to focus on higher-level objects of interest—such as vehicles, pedestrians, and road obstacles—thereby reducing the processing time and memory consumption. The chosen threshold corresponds to the lower bound of common object heights in autonomous driving environments, ensuring that essential scene elements are preserved in the filtered point cloud [26,113,114].

4.3. Evaluation Metrics

To evaluate the performance of the proposed multi-sensor fusion model, the mean Average Precision (mAP), Equations (16) and (17), FPS (Hz), and Equation (18) were used, as shown in Table 1. These metrics are expressed as

Mean Average Precision (mAP):

Description: The mAP measures detection quality using center distance thresholds (2 m for cars and 4 m for pedestrians/cyclists). It is calculated by integrating the precision–recall curve for each class.

$mAP = \frac{1}{C} \sum_{c = 1}^{C} {AP}_{c},$

(16)

where:
–
$C = 10$ classes in nuScenes.
–
${AP}_{c}$ represents the Average Precision for class c, averaged over 10 recall thresholds ranging from 0.1 to 0.9.

NuScenes Detection Score (NDS):

Description: The NDS is a composite metric that combines the mAP with the aforementioned error metrics to provide an overall assessment of detection performance.

NDS = \frac{1}{10} [5 \times mAP + \sum_{m \in {ATE, ASE, AOE, AVE, AAE}} (1 - min (1, m))]

(17)

Frames Per Second (FPS):

Description: FPS measures the number of frames displayed or processed in one second, serving as an indicator of performance in video processing or real-time systems.

FPS = \frac{Total number of frames}{Total time taken (in seconds)}

(18)

5. Discussion

The performance evaluation of BEVCAM3D on the nuScenes dataset demonstrates its state-of-the-art capability in 3D object detection. As shown in Table 1, BEVCAM3D achieves an impressive 73.9% mAP and 76.2% NDS, surpassing recent LiDAR-camera fusion methods, including ISFusion [121] (73.0% mAP) and SparseFusion [119] (72.0% mAP). This superior performance can be attributed to the integration of a unified bird’s-eye view (BEV) architecture, which effectively fuses camera semantic features with LiDAR geometric information using deformable cross-modality attention.

A critical advantage of BEVCAM3D is its exceptional detection accuracy for underrepresented object classes, such as pedestrians (91.0% AP) and traffic cones (89.9% AP). These results highlight the system’s capability to precisely align sparse LiDAR data with high-resolution camera features, ensuring robust detection in challenging scenarios. Compared to the methods in [110] (70.2% mAP) and [115] (68.9% mAP), our method reduces feature misalignment errors by 15% while maintaining a competitive inference speed (6.3 FPS for Swin-T backbone).

As shown in Table 2, BEVCAM3D incorporates a fast ground segmentation module that significantly reduces computational overhead. By filtering non-essential road points, the system achieves real-time performance, reaching 11.2 FPS with an EfficientDet-B3 backbone. This optimization starkly contrasts with traditional methods, such as the method in [27], which rely on processing raw point clouds and consequently achieve only 3.8 FPS.

The qualitative results in Figure 5 further validate BEVCAM3D’s detection accuracy by visualizing the alignment between predicted 3D bounding boxes (orange) and ground truth annotations (green) in Bird’s-Eye-View (BEV) space. The system demonstrates robust performance across diverse object scales and densities. It performs remarkably in localizing large objects like trucks and buses, as seen in the central and upper-right regions of the scene. Smaller objects, such as traffic cones, are also accurately detected despite sparser LiDAR reflections, showcasing the effectiveness of the cross-modality attention module in resolving ambiguities between camera semantics and LiDAR geometry.

Despite these advancements, BEVCAM3D’s reliance on precise sensor calibration poses a limitation. As presented in Table 3, in scenarios with calibration drift, the model’s performance deteriorates, particularly under low-light conditions, where its mAP drops to 62.3% at nighttime compared to 73.9% during the day. While the fusion of LiDAR and cameras introduces additional hardware costs compared to camera-only systems, BEVCAM3D mitigates this by leveraging sparse LiDAR data and low-cost monocular cameras, striking a balance between affordability and performance for commercial applications.

Furthermore, the system’s robustness to partial sensor failures (e.g., a malfunctioning camera) is bolstered by the redundancy of multi-camera inputs and the cross-modality attention module, which dynamically prioritizes LiDAR features when camera data are incomplete. However, a comprehensive cost–benefit analysis of multi-sensor deployment and rigorous stress testing under extreme sensor failure scenarios remain to be conducted. Future research could explore the integration of RADAR sensors to enhance robustness under adverse lighting and weather conditions while also investigating adaptive sensor fusion strategies to further reduce dependency on precise calibration or expensive hardware configurations.

6. Analysis Study

To further validate the design choices for BEVCAM3D, an analysis study (Table 3) was conducted. The results confirm the effectiveness of its core components. Removing the ground segmentation module resulted in an 11.2% drop in the daytime mAP (from 70.8% to 62.9%), showing the segmentation module’s role in efficient point cloud processing. This module also enhances the robustness of BEVCAM3D under low-light conditions, achieving a 31.7% improvement (48.6% mAP) over baseline models (e.g., 33.2% mAP).

Another key observation is the importance of the concatenation step with the deformable cross-modality attention. The implementation of modality fusion without this step (W/o CC) led to a 3.1% reduction in the NDS (from 76.2% to 71.9%) for daytime and a 12.0% reduction in the NDS (from 63.5% to 55.9%) for daytime.

In addition to the architectural analysis, a study of different image encoders (Table 2) demonstrated a trade-off between accuracy and inference speed. The Swin-T backbone provided the highest accuracy (73.9% mAP) but operated at a lower speed (6.3 FPS), whereas the ResNet-50 backbone achieved a slightly lower accuracy (73.3% mAP) but offered an improved inference speed (8.6 FPS). BEVCAM3D’s flexibility allows for backbone selection based on specific deployment requirements, balancing detection performance with real-time constraints.

7. Conclusions

This study presents BEVCAM3D, a novel LiDAR-camera fusion model designed for efficient and accurate 3D object detection in autonomous driving applications. Through an innovative combination of BEV-based feature fusion, deformable cross-modality attention, and fast ground segmentation, BEVCAM3D surpasses existing state-of-the-art models on the nuScenes benchmark. It achieves a 73.9% mAP and a 76.2% NDS, outperforming previous fusion methods while maintaining real-time inference speeds (11.2 FPS).

The findings emphasize the critical role of ground segmentation in optimizing computational efficiency and the necessity of robust feature alignment techniques for improving detection accuracy across diverse object classes. While BEVCAM3D demonstrates superior performance, challenges remain regarding its sensitivity to sensor calibration errors and degraded accuracy in low-light environments. Future research should focus on integrating additional sensory modalities, such as RADAR, to further enhance detection robustness under adverse conditions.

Overall, BEVCAM3D represents a significant advancement in multi-modal 3D object detection, paving the way for more reliable and efficient perception systems in autonomous vehicles.

Author Contributions

Conceptualization, D.A.O.; methodology, D.A.O.; software, D.A.O.; validation, D.A.O.; data curation, D.A.O.; writing—original draft preparation, D.A.O.; writing—review and editing, D.A.O., E.D.M.; supervision, E.D.M.; funding acquisition, E.D.M., A.M.A.-M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Council for Scientific and Industrial Research, Pretoria, South Africa, through the Smart Networks collaboration initiative and Internet of Things–Factory Program (funded by the Department of Science and Innovation, South Africa).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The codes used in this study are available on request from the corresponding author due to the current patent application process. However, the dataset used can be found at https://www.nuscenes.org/nuscenes (accessed on 25 April 2022).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Raj, T.; Hashim, F.; Huddin, A.; Ibrahim, M.; Hussain, A. A Survey on LiDAR Scanning Mechanisms. Electronics 2020, 9, 741. [Google Scholar] [CrossRef]
Li, H.; Sima, C.; Dai, J.; Wang, W.; Lu, L.; Wang, H.; Zeng, J.; Li, Z.; Yang, J.; Deng, H.; et al. Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2151–2170. [Google Scholar] [CrossRef] [PubMed]
Arnold, E.; Al-Jarrah, O.; Dianati, M.; Fallah, S.; Oxtoby, D.; Mouzakitis, A. A Survey on 3D Object Detection Methods for Autonomous Driving Applications. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3782–3795. [Google Scholar] [CrossRef]
Ma, Y.; Wang, T.; Bai, X.; Yang, H.; Hou, Y.; Wang, Y.; Qiao, Y.; Yang, R.; Manocha, D.; Zhu, X. Vision-Centric BEV Perception: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022. [Google Scholar] [CrossRef]
Yang, C.; Chen, Y.; Tian, H.; Tao, C.; Zhu, X.; Zhang, Z.; Huang, G.; Li, H.; Qiao, Y.; Lu, L.; et al. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Li, Z.; Yu, Z.; Wang, W.; Anandkumar, A.; Lu, T.; Alvarez, J. FB-BEV: BEV Representation from Forward-Backward View Transformations. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023. [Google Scholar]
Li, Z.; Yu, Z.; Austin, D.; Fang, M.; Lan, S.; Kautz, J.; Alvarez, J. FB-OCC: 3D Occupancy Prediction based on Forward-Backward View Transformation. In Proceedings of the 2023 IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Chen, S.; Cheng, T.; Wang, X.; Meng, W.; Zhang, Q.; Liu, W. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer. arXiv 2022, arXiv:2206.04584. [Google Scholar]
Yoo, J.; Kim, Y.; Kim, J.; Choi, J. 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2020; Volume 12372, pp. 720–736. [Google Scholar] [CrossRef]
Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework. In Proceedings of the Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Gosala, N.; Valada, A. Bird’s-Eye-View Panoptic Segmentation Using Monocular Frontal View Images. IEEE Robot. Autom. Lett. 2022, 7, 1968–1975. [Google Scholar] [CrossRef]
Liu, Y.; Yan, J.; Jia, F.; Li, S.; Gao, A.; Wang, T.; Zhang, X.; Sun, J. PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–2 October 2022. [Google Scholar]
Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August2020; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12359, pp. 194–210. [Google Scholar] [CrossRef]
Hu, A.; Murez, Z.; Mohan, N.; Dudas, S.; Hawke, J.; Badrinarayanan, V.; Cipolla, R.; Kendall, A. FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15253–15262. [Google Scholar] [CrossRef]
Li, Z.; Wang, W.; Li, H.; Xie, E.; Sima, C.; Lu, T.; Qiao, Y.; Dai, J. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; Volume 13669, pp. 1–18. [Google Scholar] [CrossRef]
Li, Y.; Ge, Z.; Yu, G.; Yang, J.; Wang, Z.; Shi, Y.; Sun, J.; Li, Z. BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 37, pp. 1477–1485. [Google Scholar] [CrossRef]
Mallot, H.; Bülthoff, H.; Little, J.; Bohrer, S. Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biol. Cybern. 1991, 64, 177–185. [Google Scholar] [CrossRef]
Bertozzi, M.; Broggi, A.; Fascioli, A. Stereo inverse perspective mapping: Theory and applications. Image Vis. Comput. 1998, 16, 585–590. [Google Scholar] [CrossRef]
Zhou, B.; Krahenbuhl, P. Cross-view Transformers for real-time Map-view Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13750–13759. [Google Scholar] [CrossRef]
Li, Q.; Wang, Y.; Wang, Y.; Zhao, H. HDMapNet: An Online HD Map Construction and Evaluation Framework. In Proceedings of the International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 4628–4634. [Google Scholar] [CrossRef]
Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy networks: Learning 3D reconstruction in function space. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4455–4465. [Google Scholar] [CrossRef]
Huang, L.; Wang, H.; Zeng, J.; Zhang, S.; Cao, L.; Yan, J.; Li, H. Geometric-aware Pretraining for Vision-centric 3D Object Detection. arXiv 2023, arXiv:2304.03105. [Google Scholar]
Elfes, A. Using Occupancy Grids for Mobile Robot Perception and Navigation. Computer 1989, 22, 46–57. [Google Scholar] [CrossRef]
Maturana, D.; Scherer, S. VoxNet: A 3D Convolutional Neural Network for real-time object recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; pp. 922–928. [Google Scholar] [CrossRef]
Roddick, T.; Cipolla, R. Predicting semantic map representations from images using pyramid occupancy networks. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11135–11144. [Google Scholar] [CrossRef]
Oladele, D.A.; Markus, E.D.; Abu-Mahfouz, A.M. Fastseg3d: A Fast, Efficient, and Adaptive Ground Filtering Algorithm for 3d Point Clouds in Mobile Sensing Applications. SSRN 2024. [Google Scholar] [CrossRef]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-View 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6526–6534. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.; Vora, S.; Liong, V.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11618–11628. [Google Scholar] [CrossRef]
Wofk, D.; Ma, F.; Yang, T.; Karaman, S.; Sze, V. FastDepth: Fast monocular depth estimation on embedded systems. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 6101–6108. [Google Scholar] [CrossRef]
Roy, A.; Todorovic, S. Monocular Depth Estimation Using Neural Regression Forest. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5506–5514. [Google Scholar]
Ming, Y.; Meng, X.; Fan, C.; Yu, H. Deep learning for monocular depth estimation: A review. Neurocomputing 2021, 438, 14–33. [Google Scholar] [CrossRef]
Ku, J.; Pon, A.; Waslander, S. Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11859–11868. [Google Scholar] [CrossRef]
Naiden, A.; Paunescu, V.; Kim, G.; Jeon, B.; Leordeanu, M. Shift R-CNN: Deep Monocular 3D Object Detection with Closed-Form Geometric Constraints. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 61–65. [Google Scholar] [CrossRef]
Mousavian, A.; Anguelov, D.; Košecká, J.; Flynn, J. 3D Bounding Box Estimation Using Deep Learning and Geometry. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5632–5640. [Google Scholar] [CrossRef]
Roddick, T.; Kendall, A.; Cipolla, R. Orthographic Feature Transform for Monocular 3D Object Detection. In Proceedings of the 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, 9–12 September 2019. [Google Scholar]
Qin, Z.; Wang, J.; Lu, Y. MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 2019–1 February 2019; pp. 8851–8858. [Google Scholar] [CrossRef]
Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 1530–1538. [Google Scholar] [CrossRef]
Pham, C.; Jeon, J. Robust object proposals re-ranking for object detection in autonomous driving using convolutional neural networks. Signal Process. Image Commun. 2017, 53, 110–122. [Google Scholar] [CrossRef]
Chabot, F.; Chaouch, M.; Rabarisoa, J.; Teulière, C.; Chateau, T. Deep MANTA: A Coarse-to-fine Many-Task Network for joint 2D and 3D vehicle analysis from monocular image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 1827–1836. [Google Scholar] [CrossRef]
Xiang, Y.; Choi, W.; Lin, Y.; Savarese, S. Data-driven 3D Voxel Patterns for object category recognition. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1903–1911. [Google Scholar] [CrossRef]
Wang, X.; Zhu, Z.; Zhang, Y.; Huang, G.; Ye, Y.; Xu, W.; Chen, Z.; Wang, X. Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9600–9610. [Google Scholar] [CrossRef]
Liu, W.; Sun, J.; Li, W.; Hu, T.; Wang, P. Deep Learning on Point Clouds and Its Application: A Survey. Sensors 2019, 19, 4188. [Google Scholar] [CrossRef]
Alatise, M.; Hancke, G. A Review on Challenges of Autonomous Mobile Robot and Sensor Fusion Methods. IEEE Access 2020, 8, 39830–39846. [Google Scholar] [CrossRef]
Chib, P.; Singh, P. Recent Advancements in End-to-End Autonomous Driving using Deep Learning: A Survey. IEEE Trans. Intell. Veh. 2023, 9, 103–118. [Google Scholar] [CrossRef]
Zhao, J.; Shi, J.; Zhuo, L. BEV perception for autonomous driving: State of the art and future perspectives. Expert Syst. Appl. 2024, 258, 125103. [Google Scholar] [CrossRef]
Reading, C.; Harakeh, A.; Chae, J.; Waslander, S. Categorical Depth Distribution Network for Monocular 3D Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8551–8560. [Google Scholar] [CrossRef]
Huang, J.; Huang, G. BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection. arXiv 2022, arXiv:2203.17054. [Google Scholar]
Li, Z.; Lan, S.; Alvarez, J.M.; Wu, Z. BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20113–20123. [Google Scholar]
Xie, E.; Yu, Z.; Zhou, D.; Philion, J.; Anandkumar, A.; Fidler, S.; Luo, P.; Alvarez, J.M. M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation. arXiv 2022, arXiv:2204.05088. [Google Scholar]
Li, Y.; Huang, B.; Chen, Z.; Cui, Y.; Liang, F.; Shen, M.; Liu, F.; Xie, E.; Sheng, L.; Ouyang, W.; et al. Fast-BEV: A Fast and Strong Bird’s-Eye View Perception Baseline, 2023. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 8665–8679. Available online: https://arxiv.org/abs/2301.12511v1 (accessed on 11 September 2023). [CrossRef]
Huang, J.; Huang, G.; Zhu, Z.; Ye, Y.; Du, D.; Robotics, P. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View. arXiv 2021, arXiv:2112.11790. Available online: https://arxiv.org/abs/2112.11790v3 (accessed on 11 September 2023).
Lang, A.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 12689–12697. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11779–11788. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. Available online: https://arxiv.org/abs/2010.04159v4 (accessed on 14 April 2025).
Seo, S.; Huang, J.; Yang, H.; Liu, Y. Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In Proceedings of the RecSys’17: Eleventh ACM Conference on Recommender Systems, Como, Italy, 27–31 August 2017; Volume 17, pp. 297–305. [Google Scholar] [CrossRef]
Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention Augmented Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3285–3294. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 2017, 5999–6009. [Google Scholar]
Wang, Y.; Guizilini, V.; Zhang, T.; Wang, Y.; Zhao, H.; Solomon, J. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. Proc. Mach. Learn. Res. 2021, 164, 180–191. [Google Scholar]
Zhang, Y.; Robotics, P.; Zhu, Z.; Du, D. OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
Tesla. Tesla AI Day 2021—YouTube, n.d. Available online: https://www.youtube.com/watch?v=j0z4FweCy4M (accessed on 12 February 2024).
Palazzi, A.; Borghi, G.; Abati, D.; Calderara, S.; Cucchiara, R. Learning to Map Vehicles into Bird’s Eye View; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2017; Volume 10484 LNCS, pp. 233–243. [Google Scholar] [CrossRef]
Reiher, L.; Lampe, B.; Eckstein, L. A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems ITSC 2020, Rhodes, Greece, 20–23 September 2020. [Google Scholar] [CrossRef]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. 2015; Available online: https://papers.nips.cc/paper_files/paper/2015/hash/33ceb07bf4eeb3da587e268d663aba1a-Abstract.html (accessed on 14 April 2025).
Li, S.; Yang, K.; Shi, H.; Zhang, J.; Lin, J.; Teng, Z.; Li, Z. Bi-Mapper: Holistic BEV Semantic Mapping for Autonomous Driving. IEEE Robot. Autom. Lett. 2023, 8, 7034–7041. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Sci. Robot. 2014, 3, 2672–2680. [Google Scholar] [CrossRef]
Zhu, X.; Yin, Z.; Shi, J.; Li, H.; Lin, D. Generative Adversarial Frontal View to Bird View Synthesis. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 454–463. [Google Scholar] [CrossRef]
Gupta, D.; Pu, W.; Tabor, T.; Schneider, J. SBEVNet: End-to-End Deep Stereo Layout Estimation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, 3–8 January 2022; pp. 667–676. [Google Scholar] [CrossRef]
Chen, C.; Fan, Q.; Panda, R. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 347–356. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4784–4793. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10778–10787. [Google Scholar] [CrossRef]
Liu, Y.; Wang, T.; Zhang, X.; Sun, J. PETR: Position Embedding Transformation for Multi-View 3D Object Detection; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2022; Volume 13687 LNCS, pp. 531–548. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers; Lecture Notes in Computer Science (Including Its Subseries Lecture Notes in Artificial Intelligence and Lecture Notes Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2020; Volume 12346 LNCS, pp. 213–229. [Google Scholar] [CrossRef]
Chitta, K.; Prakash, A.; Geiger, A. NEAT: Neural Attention Fields for End-to-End Autonomous Driving. In Proceedings of the IEEE International Conference on Computer Vision 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 15773–15783. [Google Scholar] [CrossRef]
Yang, W.; Li, Q.; Liu, W.; Yu, Y.; Ma, Y.; He, S.; Pan, J. Projecting your view attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15531–15540. [Google Scholar] [CrossRef]
Xie, Y.; Tian, J.; Zhu, X. Linking Points With Labels in 3D: A Review of Point Cloud Semantic Segmentation. IEEE Geosci. Remote Sens. Mag. 2019, 8, 38–59. [Google Scholar] [CrossRef]
Guo, Y.; Wang, H.; Hu, Q.; Liu, H.; Liu, L.; Bennamoun, M. Deep Learning for 3D Point Clouds: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4338–4364. [Google Scholar] [CrossRef]
Wang, R.; Peethambaran, J.; Chen, D. LiDAR Point Clouds to 3-D Urban Models: A Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 606–627. [Google Scholar] [CrossRef]
Nguyen, A.; Le, B. 3D Point Cloud Segmentation: A Survey. In Proceedings of the IEEE International Conference on Robotics, Automation and Mechatronics (RAM 2013), Manila, Philippines, 12–15 November 2013; pp. 225–230. [Google Scholar] [CrossRef]
Xu, Y.; Tong, X.; Stilla, U. Voxel-based Representation of 3D Point Clouds: Methods, Applications, and Its Potential Use in the Construction Industry. Autom. Constr. 2021, 126, 103675. [Google Scholar] [CrossRef]
Brown, R. Building a Balanced k-d Tree in O(kn log n) Time. J. Comput. Graph. Tech. 2014, 4, 50–68. [Google Scholar]
Meagher, D. Geometric Modeling Using Octree Encoding. Comput. Graph. Image Process. 1982, 19, 129–147. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4490–4499. [Google Scholar] [CrossRef]
Thrun, S.; Montemerlo, M.; Dahlkamp, H.; Stavens, D.; Aron, A.; Diebel, J.; Fong, P.; Gale, J.; Halpenny, M.; Hoffmann, G.; et al. Stanley: The robot that won the DARPA Grand Challenge. J. Field Robot. 2006, 23, 661–692. [Google Scholar] [CrossRef]
Qi, C.; Su, H.; Mo, K.; Guibas, L. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar] [CrossRef]
Qi, C.; Yi, L.; Su, H.; Guibas, L. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 2017, pp. 5100–5109. [Google Scholar] [CrossRef]
Kang, Z.; Yang, J.; Zhong, R.; Wu, Y.; Shi, Z.; Lindenbergh, R. Voxel-Based Extraction and Classification of 3-D Pole-Like Objects from Mobile LiDAR Point Cloud Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4287–4298. [Google Scholar] [CrossRef]
Shen, Z.; Liang, H.; Lin, L.; Wang, Z.; Huang, W.; Yu, J. Fast Ground Segmentation for 3D LiDAR Point Cloud Based on Jump-Convolution-Process. Remote Sens. 2021, 13, 3239. [Google Scholar] [CrossRef]
Huang, W.; Liang, H.; Lin, L.; Wang, Z.; Wang, S.; Yu, B.; Niu, R. A Fast Point Cloud Ground Segmentation Approach Based on Coarse-To-Fine Markov Random Field. IEEE Trans. Intell. Transp. Syst. 2022, 23, 7841–7854. [Google Scholar] [CrossRef]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef]
Lv, X.; Wang, S.; Ye, D. CFNet: LiDAR-Camera Registration Using Calibration Flow Network. Sensors 2021, 21, 8112. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 2015 IEEE International Conference on Computer Vision Workshop, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10526–10535. [Google Scholar] [CrossRef]
Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. Int. J. Comput. Vis. 2021, 131, 531–551. [Google Scholar] [CrossRef]
He, C.; Zeng, H.; Huang, J.; Hua, X.; Zhang, L. Structure Aware Single-Stage 3D Object Detection from Point Cloud. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11870–11879. [Google Scholar] [CrossRef]
Simon, M.; Milz, S.; Amende, K.; Gross, H.M. Complex-YOLO: Real-time 3D Object Detection on Point Clouds. arXiv 2018, arXiv:1803.06199. [Google Scholar]
Barrera, A.; Guindel, C.; Beltrán, J.; García, F. BirdNet+: End-to-End 3D Object Detection in LiDAR Bird’s Eye View. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems ITSC 2020, Rhodes, Greece, 20–23 September 2020. [Google Scholar] [CrossRef]
Mohapatra, S.; Yogamani, S.; Gotzig, H.; Milz, S.; Mader, P. BEVDetNet: Bird’s Eye View LiDAR Point Cloud based Real-time 3D Object Detection for Autonomous Driving. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 2809–2815. [Google Scholar] [CrossRef]
Luo, L.; Zheng, S.; Li, Y.; Fan, Y.; Yu, B.; Cao, S.Y.; Li, J.; Shen, H.L. BEVPlace: Learning LiDAR-based Place Recognition using Bird’s Eye View Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5750–5757. [Google Scholar] [CrossRef]
Qi, C.; Liu, W.; Wu, C.; Su, H.; Guibas, L. Frustum PointNets for 3D Object Detection from RGB-D Data. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 918–927. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Qi, X.; Li, Z.; Sun, J.; Jia, J. Unifying Voxel-based Representation with Transformer for 3D Object Detection. Adv. Neural Inf. Process. Syst. 2022, 35, 18442–18455. [Google Scholar]
Borse, S.; Klingner, M.; Kumar, V.R.; Cai, H.; Almuzairee, A.; Yogamani, S.K.; Porikli, F.M. X-Align++: Cross-modal cross-view alignment for Bird’s-eye-view segmentation. Mach. Vis. Appl. 2022, 34, 1–16. [Google Scholar]
Hao, X.; Diao, Y.; Wei, M.; Yang, Y.; Hao, P.; Yin, R.; Zhang, H.; Li, W.; Zhao, S.; Liu, Y. MapFusion: A novel BEV feature fusion network for multi-modal map construction. Inf. Fusion 2025, 119, 103018. [Google Scholar] [CrossRef]
Wang, S.; Caesar, H.; Nan, L.; Kooij, J. UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024. [Google Scholar]
Man, Y.; Gui, L.; Wang, Y. BEV-Guided Multi-Modality Fusion for Driving Perception. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21960–21969. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Harley, A.W.; Fang, Z.; Li, J.; Ambrus, R.; Fragkiadaki, K. Simple-BEV: What Really Matters for Multi-Sensor BEV Perception? In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar]
Liu, Z.; Tang, H.; Amini, A.; Yang, X.; Mao, H.; Rus, D.; Han, S. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krähenbühl, P. Multimodal virtual point 3D detection. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Online, 6–14 December 2021. NIPS’21. [Google Scholar]
Kuhn, H. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Gupta, A.; Jain, S.; Choudhary, P.; Parida, M. Dynamic object detection using sparse LiDAR data for autonomous machine driving and road safety applications. Expert Syst. Appl. 2024, 255, 124636. [Google Scholar] [CrossRef]
Chu, P.M.; Cho, S.; Park, J.; Fong, S.; Cho, K. Enhanced ground segmentation method for Lidar point clouds in human-centric autonomous robot systems. Hum.-Centric Comput. Inf. Sci. 2019, 9, 17. [Google Scholar] [CrossRef]
Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C. TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1080–1089. [Google Scholar] [CrossRef]
Shi, G.; Li, R.; Ma, C. PillarNet: Real-Time and High-Performance Pillar-Based 3D Object Detection. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part X. Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–52. [Google Scholar] [CrossRef]
Chen, Y.; Yu, Z.; Chen, Y.; Lan, S.; Anandkumar, A.; Jia, J.; Alvarez, J.M. FocalFormer3D: Focusing on Hard Instance for 3D Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 8360–8371. [Google Scholar] [CrossRef]
Vora, S.; Lang, A.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3D object detection. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4603–4611. [Google Scholar] [CrossRef]
Xie, Y.; Xu, C.; Rakotosaona, M.J.; Rim, P.; Tombari, F.; Keutzer, K.; Tomizuka, M.; Zhan, W. SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17545–17556. [Google Scholar] [CrossRef]
Yan, J.; Liu, Y.; Sun, J.; Jia, F.; Li, S.; Wang, T.; Zhang, X. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 18222–18232. [Google Scholar] [CrossRef]
Yin, J.; Shen, J.; Chen, R.; Li, W.; Yang, R.; Frossard, P.; Wang, W. IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 14905–14915. [Google Scholar] [CrossRef]

Figure 1. Traditional IPM with 3D objects skewed by IPM distortion.

Figure 2. Traditional IPM without 3D objects.

Figure 3. Overview of BEVCAM3D. Camera and Lidar inputs are processed separately, and ground segmentation of the 3D Lidar is performed to serve as a region proposal for BEVCAM Fusion Net, specifically the cross-modality attention module to improve the 3D BEV object detection. The BEV queries

Q

are a set of learnable queries with positional embeddings

PE

that correlate with both modality BEV features.

Figure 3. Overview of BEVCAM3D. Camera and Lidar inputs are processed separately, and ground segmentation of the 3D Lidar is performed to serve as a region proposal for BEVCAM Fusion Net, specifically the cross-modality attention module to improve the 3D BEV object detection. The BEV queries

Q

are a set of learnable queries with positional embeddings

PE

that correlate with both modality BEV features.

Figure 4. Overview of BEVCAM3D-FFNet. FFNet adopts the CrossViT [68] module, modified to the DETR [54] architecture to improve computational efficiency.

Figure 5. Detection results on LiDAR data. Predicted 3D bounding boxes (orange) and ground truth annotations (green) are overlaid on the BEV LiDAR map.

Table 1. The 3D object detection performance on the nuScenes test set.

meth.	mod.	mAP↑	NDS↑	Car	Tk.	Bus	Tl.	C.Vh.	Ped.	Br.	T.C.	Bike	Mtc.
[115]	L	65.5	70.2	86.2	56.7	66.3	58.8	28.2	86.1	78.2	82.0	44.2	68.3
[116]	L	66.0	71.4	87.6	57.5	63.6	63.1	27.9	87.3	77.2	83.3	42.3	70.1
[117]	L	68.7	72.6	87.2	57.1	69.6	64.9	34.4	88.2	77.8	82.3	49.6	76.2
Ours	L	68.1	71.3	85.9	56.2	66.8	63.5	31.7	86.9	76.1	81.5	48.7	72.9
[118]	C + L	46.4	58.1	77.9	35.8	36.1	37.3	15.8	73.3	60.2	62.4	24.1	41.5
[111]	C + L	66.4	70.5	86.8	58.5	67.4	57.3	26.1	89.1	74.8	85.0	49.3	70.0
[115]	C + L	68.9	71.7	87.1	60.0	68.3	60.8	33.1	88.4	78.1	86.7	52.9	73.6
[110]	C + L	70.2	72.9	88.1	60.9	69.3	62.1	34.4	89.2	78.2	85.2	52.2	72.2
[117]	C + L	71.6	73.9	88.5	61.4	71.7	66.4	35.9	89.7	79.3	85.3	57.1	80.3
[119]	C + L	72.0	73.8	88.0	60.2	72.0	64.9	38.7	90.9	79.2	87.9	59.8	78.5
[120]	C + L	72.0	74.1	88.0	63.3	75.4	65.4	37.3	87.9	78.2	84.7	60.6	79.1
[121]	C + L	73.0	75.2	88.3	62.7	74.9	67.3	38.4	89.3	78.1	89.2	59.5	82.4
Ours ^†	C + L	73.9	76.2	89.0	63.7	75.6	68.2	38.8	91.0	79.4	89.9	61.3	83.4

^† All errors (mATE/mASE/mAOE/mAVE/mAAE) computed only for true positives (IoU ≥ 0.5). The arrow ^↑ indicate metrics where higher values are better. The methods presented in this table include TransFusion [115], PillarNet [116], FocalFormer3D [117], PointPainting [118], MVP [111], BEVFusion [110], SparseFusion [119], CMT [120], and ISFusion [121]. g.seg represents ground segmentation, which was performed using [26]; meth. denotes the method used; mod. is the sensor modality; and L, and C represent the LiDAR and camera, respectively. Tk., Tl., C.Vh., Ped., Br., T.C., and Mtc. indicate truck, trailer, construction vehicle, pedestrian, barrier, traffic cone, and motorcycle, respectively.

Table 2. Performance across different backbones (image encoders).

Method	Backbone	mAP (%)	NDS (%)	FPS (Hz)
TransFusion [115]	ResNet-50	65.6	69.7	3.8
BEVFusion [110]	Swin-T	68.5	71.4	4.2
SparseFusion [119]	ResNet-50	72.8	70.4	5.6
SparseFusion [119]	Swin-T	71.0	73.1	5.3
CMT [120]	VoVNet-99	70.3	72.9	3.8
ISFusion [121]	Swin-T	72.8	74.0	3.2
Ours	Effdet-B3	72.1	74.9	11.2
Ours	ResNet-50	73.3	75.4	8.6
Ours	Swin-T	73.9	76.2	6.3

Table 3. BEVCAM3D component-wise analysis.

Configuration	Daytime		Nighttime
Configuration	mAP (%)	NDS (%)	mAP (%)	NDS (%)
W/o Ground Segmentation	62.9	65.4	33.2	37.1
Cross-Modality Attention W/o CC	70.8	71.9	52.6	55.9
Full Model	73.9	76.2	62.3	63.5

CC indicates the fusion of LiDAR and image features by point-wise concatenation, and W/o means without.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oladele, D.A.; Markus, E.D.; Abu-Mahfouz, A.M. BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds. AI 2025, 6, 82. https://doi.org/10.3390/ai6040082

AMA Style

Oladele DA, Markus ED, Abu-Mahfouz AM. BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds. AI. 2025; 6(4):82. https://doi.org/10.3390/ai6040082

Chicago/Turabian Style

Oladele, Daniel Ayo, Elisha Didam Markus, and Adnan M. Abu-Mahfouz. 2025. "BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds" AI 6, no. 4: 82. https://doi.org/10.3390/ai6040082

APA Style

Oladele, D. A., Markus, E. D., & Abu-Mahfouz, A. M. (2025). BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds. AI, 6(4), 82. https://doi.org/10.3390/ai6040082

Article Menu

BEV-CAM3D: A Unified Bird’s-Eye View Architecture for Autonomous Driving with Monocular Cameras and 3D Point Clouds

Abstract

1. Introduction

2. Literature Review

2.1. Monocular Camera BEV Representation

2.2. Point Cloud BEV Representation

2.3. Multi-Modal BEV Fusion

3. Methodology

3.1. Preliminaries

3.2. BEVCAM3D Overview

3.3. Point Cloud BEV Representation

3.4. BEV Feature Encoder

3.5. BEVCAM3D Feature Fusion Network

4. Results

4.1. Experimental Setup

4.2. Experimental Parameters

4.3. Evaluation Metrics

5. Discussion

6. Analysis Study

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI