1. Introduction
Autonomous driving (AD) is a safety-critical task. Multi-modal sensors that are fitted to self-driving cars, such as cameras, radar, and LiDAR (light detection and ranging), are designed to enhance the accuracy and robustness of AD operations [
1,
2,
3]. The camera captures ambient light, allowing it to obtain rich color and material information, which, in turn, provides rich semantic information. The millimeter-wave radar transmits and receives electromagnetic waves to obtain sparse orientation, distance, and velocity information from target objects. Additionally, LiDAR uses lasers for ranging, and, in AD, a multibeam LiDAR is commonly employed to perform the dense ranging of the environment, providing geometric information. To achieve advanced autonomous driving, it is crucial to fully utilize multi-sensor data through fusion methods, allowing for the integration of information from different sensors.
There are two main challenges facing current multi-sensor fusion approaches in autonomous driving. The first challenge is the heterogeneity of the data: multi-sensor data are generated from multiple sensors with different data representations, expressions (color or geometric), coordinates, and levels of sparsity; this heterogeneity poses difficulties for fusion. In most deep-learning-based fusion methods, it is necessary to align data accurately, both temporally and spatially. Additionally, during the feature fusion process, multi-source data features are obtained at different scales and from different viewpoints; this causes feature blurring and affects the accuracy of the model [
4,
5]. The second challenge is dynamic scene adaptation: when one of the modalities in the fusion method is disturbed, such as when adverse weather conditions, misalignment, or sensor failure is encountered, the performance of the model can be significantly reduced [
6]. Many data fusion methods primarily focus on achieving state-of-the-art performance benchmarks, which only addresses one aspect of the multi-sensor fusion challenge. An ideal fusion model should possess comprehensive properties; each individual model should not fail, regardless of the presence or absence of other modalities or the integrity of other modalities, and the model should achieve improved accuracy when incorporating multi-sensor data.
In facing the challenge caused by the heterogeneity of multi-sensor data, transformer-based methods have gained significant attention in autonomous driving. Transformers establish a connection between spatial information and the features extracted from the front view (camera plane), and they are SOTA (state of the art) in 3D object detection. For example, DETR3D [
7], inspired by methods like DETR [
8,
9], realizes end-to-end 3D object detection by constructing a 3D object query. BEVFormer [
10] implements the BEV (bird’s eye view) space interaction of current and temporal image features through a spatiotemporal transformer, achieving outstanding results in 3D perception tasks. The transformer’s impressive performance in monocular image-based 3D detection tasks also allows it to implicitly capture the correlation of data between different modalities, which is particularly crucial in multi-sensor data fusion methods. Furthermore, because of the implementation of image features sampled in BEV space, there is the possibility of representing multi-sensor data under a unified space. BEVFusion [
5,
11] proposes a unified representation of image and point cloud data under BEV space by reconstructing the depth distributions of multi-view images in 3D space through LSS [
12] and fusing them to the 3D point cloud data represented in BEV through the residual module Fusion. However, BEVFusion suffers from feature blurring in the fusion process brought about by depth estimation errors.
In facing the challenge caused by dynamic scenes, CMT [
13] introduces a masked model training strategy, which improves the robustness of the model by feeding the modal failure data into the network for training. DeepFusion [
14] tackles the alignment issue between point cloud and image features by leveraging a global attention mechanism, achieving an implicit alignment of the point cloud with the image in terms of features. The other methods [
10,
13,
15], while indirectly forming an implicit alignment between multi-sensor features through a reference point, all utilize an accurate sampling of the camera’s extrinsic parameters in the projection of the reference point to the image features, which does not alleviate the problems caused by misalignment.
To address the challenge above, we propose an adaptive fusion transformer (AFTR) for 3D detection tasks—a simple, robust, end-to-end, 3D object detection framework. Firstly, we propose an adaptive spatial cross-attention (ASCA) mechanism. ASCA realizes the implicit association of 3D object queries with spatial multi-sensor features through learnable offsets, and it only interacts with the corresponding features to realize local attention. ASCA avoids the information loss caused by the 3D-2D feature projection, since ASCA can directly sample in space. Then, we propose a spatial temporal self-attention (STSA) mechanism, which equates the displacement caused by the self-ego motion and the target motion to learnable offsets. We indicate the contributions of the proposed AFTR as follows:
To the best of our knowledge, the AFTR is the first fusion model that interacts with both 2D representational features and 3D representational features and interacts with 3D temporal information.
The AFTR outperforms on 3D detection tasks through the cross-modal attention mechanism and the cross-temporal attention mechanism, demonstrating SOTA performance on the nuScenes dataset.
The AFTR is the most robust framework compared to existing fusion modals; it has the smallest performance drop in the face of misalignment, and better robustness can be achieved via augmented learning using extra noisy data.
Here, we present the organization of the full paper. In
Section 2, we first present the current framework for 3D object detection based on single-sensor data, followed by the current state of the art in the development of multi-sensor data fusion frameworks. In
Section 3, we discuss the structure of the proposed AFTR framework in detail. In
Section 4, we present the datasets used in the AFTR and the evaluation metrics for 3D object detection, and we describe in detail the setup of the AFTR in specific experiments. In
Section 5, we compare the experimental results of the AFTR with those of SOTA methods and illustrate the effects of parameter settings and the components on the AFTR through a detailed ablation study, and, further, we test the robustness of the AFTR in dynamic scenes by applying noise to the alignment parameters. In
Section 6, we summarize the proposed AFTR with a brief description of its advancements and limitations.
2. Related Works
In this section, we provide an introduction to relevant single-sensor-based (both camera-only and LiDAR-only) and fusion-based 3D object detectors. In
Section 2.1, we focus on transformer-based camera-only 3D object detectors, while CNN-based methods are briefly described for the following reasons: (1) in the field of 3D object detection, transformer-based architectures have become dominant and have overwhelmed CNN-based methods in terms of performance, and (2) the proposed AFTR is a transformer-based framework, which is inspired by both image-based and the fusion method transformer frameworks. In
Section 2.2, we present the relevant and most commonly used LiDAR-only 3D object detectors based on different point cloud representations. In
Section 2.3, we detail the current SOTA transformer-based fusion model.
2.1. Camera-Only 3D Object Detector
In this section, we present only the CNN-based methods mentioned later, focusing on the transformer-based camera-only 3D detector.
2.1.1. CNN-Based Method
LSS [
12] introduces the lift-splat-shoot paradigm to address the bird’s-eye view perception from multi-view cameras. It involves bin-based depth prediction for lifting image features to 3D frustums, splatting these frustums onto a unified bird’s-eye view, and it performs downstream tasks on the resulting BEV feature map. FCOS3D [
16] inherits from FCOS [
17] and predicts 3D objects by transforming 7-DoF 3D ground truths to image view.
Since 3D target detection involves depth estimation, CNN-based methods have difficulties in modeling planar images in space, which is what the transformer excels at. In particular, after BEV-based perception methods were proposed, transformer-based frameworks outperformed CNN-based methods in the field of 3D object detection.
2.1.2. Transformer-Based Method
Benefiting from the fact that transformers can establish a correlation between spatial space and image features, transformer-based camera-only detectors achieve better performance in 3D object detection tasks. These methods can be broadly categorized into object-query-based, BEV-query-based, and BEV-depth-based methods.
DETR3D [
7] inherits from DETR [
8], which introduces object queries and generates a 3D reference point for each query. These reference points are used to aggregate multi-view image features as keys and values, and cross-attention is applied between object queries and image features. This approach allows each query to decode a 3D bounding box for object detection. DETR4D [
18] performs temporal modeling based on DETR3D, and this results in better performance. PETR [
19] achieves 3D object detection by encoding 3D position embedding into 2D images to generate 3D position-aware features. PolarFormer [
20] proposes a polar cross-attention mechanism based on polar coordinates, which achieves excellent detection performance under BEV. BEVDet [
21] extracts features from multi-view images through LSS [
12] and a BEV encoder, and it transforms them into BEV space and performs 3D object detection. BEVDet4D [
22] obtains better results than BEVDet by extending BEVDet and fusing BEV features from historical and current timestamps. BEVDepth [
23] continues to optimize on the basis of BEVDet and BEVDet4D by supervising and optimizing depth estimations through camera extrinsic parameters and the point cloud to achieve better results. BEVStereo [
24] solves the blurring and sparsity problems caused by depth estimation in a series of methods such as BEVDet through the improvement of the temporal multi-view stereo (MVS) technique, and the improved MVS can handle complex indoor and outdoor scenes to achieve better 3D detection. BEVFormer [
10] and BEVFormerV2 [
25] are based on Deformable DETR [
26], which interacts with image features by generating reference points in BEV, avoiding the computation of the transformation of 2D features to 3D features, and realizing robust and efficient 3D object detection. Although transformer-based camera-only frameworks have made breakthroughs in 3D object detection, they still have a reasonable performance disadvantage compared to point cloud methods or fusion-based methods that natively gain 3D geometric information.
2.2. LiDAR-Only 3D Object Detector
In this subsection, we briefly describe the original papers and detectors involved in commonly used LiDAR feature extraction methods. Point cloud data are usually feature-extracted under three representations: points, voxels, and pillars.
PointNet [
27] pioneered the method of feature extraction directly on the raw point cloud with its MLP (multilayer perception) layers and max-pooling layers. On this basis, PointNet++ [
28] achieves better performance in 3D target detection and segmentation tasks by optimizing local feature extraction.
VoxelNet [
29] converts sparse point cloud data into regular stereo grids, which provides the basis for CNN implementation, and SECOND [
30] improves the efficiency of feature extraction under voxel representation by employing a sparse convolution network [
31]. This is currently the most commonly used feature extraction method.
PointPillars [
32] extracts the pillar features of the point cloud in the longitudinal direction through PointNet, forming a particular type of regular 2D grid data with channels, which provides the possibility of using the 2D CNN method.
PointVoxel-RCNN (PV-RCNN) [
33] achieves better object detection performance by fusing features under two representations (points and voxels).
Although point cloud data natively possess 3D geometric information and perform well in 3D perception, due to their sparseness, it is difficult for the point cloud to accurately detect occluded, far, and small targets.
2.3. Fusion-Based 3D Object Detector
F-PointNet [
34] and PointPainting [
4], as two typical sequential-result-level fusion models, require accurate image detection frameworks with precise multi-modal sensor calibration, and they are susceptible to wrong detection, omissions, and misalignment due to the image detector. FusionPainting [
35] directly fuses the segmentation results of the LiDAR data and camera data via adaptive attention, and these are fed into the 3D detector to obtain the results. MVX-Net [
36] is a feature-level fusion model, which samples and aggregates image features by projecting voxels onto the image plane, and it is also affected by misalignment.
Recently, feature-level fusion models based on transformers have become major players, benefiting from the fact that transformers can establish feature-to-feature relationships, which is important for multi-sensor data fusion. TransFusion [
37] uses image features to initialize the object query; it updates the query by interacting with LiDAR features, and then it interacts with the image features and outputs the 3D detection results. DeepFusion [
14], however, uses LiDAR features as the query to interact with image features, and then it updates the output features with LiDAR features and outputs the 3D detection results. DeepInteraction [
38] argues that the model should learn and maintain the individual modal representations, and it proposes that LiDAR and camera features should interact with each other in order to fully learn the features of each modality. BEVFusion [
5,
11] proposes a simple and efficient framework to predict the depth distribution of multi-view images using LSS [
12], represent the image features under BEV, and subsequently generate fusion features by aggregating the BEV LiDAR features and BEV camera features through the BEV encoder to alleviate the feature blurring between multi-sensor data features. UVTR [
39] avoids the loss of information caused by compression into BEV space by proposing to represent both the image and the point cloud in voxel space. FUTR3D [
15] and CMT [
13], however, generate 3D reference points through object queries and use 3D reference point sampling or interaction with multi-modal features to update the object queries, and then they perform 3D target detection through a transformer-based decoder. However, both FUTR3D and CMT use calibration parameters to achieve the direct exact matching of multi-sensor data, which is detrimental to robustness.
3. AFTR Architecture
In this paper, we propose the AFTR (adaptive fusion transformer), which implicitly aligns the features of multi-sensor data to achieve more robust 3D object detection results. The AFTR can be divided into four parts, as shown in
Figure 1. The AFTR takes the multi-view camera data and LiDAR data as input data and extracts features through individual backbones (
Section 3.1). At the same time, the fusion queries of the historical timestamp
are also input into the AFTR encoder. The randomly generated 3D object queries
interact with the features of the multi-sensor data, and the historical information is finally updated with the fusion queries
of the current timestamp. Then, the fusion queries
are position-encoded and input into the DETR3D [
7] and Deformable DETR [
26] transformer decoders (
Section 3.4). The fusion queries
interact with the initialized 3D object queries
through layer-by-layer refinement in the transformer decoder, which finally outputs the 3D object detection results. The proposed AFTR has two main components, as shown in
Figure 2a: the adaptive spatial cross-attention (ASCA) module (
Section 3.2) and the spatial temporal self-attention (STSA) module (
Section 3.3). The input data of ASCA comprise multi-camera features
and LiDAR features
represented by voxels, and the input data of STSA comprise 3D representations of the historical frame fusion queries
. Finally, the fusion queries
are output through the feed-forward module and used for 3D object detection.
3.1. Feature Extraction
The proposed AFTR learns features from multi-view images and the point cloud, and any feature extraction method that can be used on images or the point cloud can be employed in our framework.
For multi-view images,
, where
,
, and
are the height, width, and the number of views of the image, respectively. We follow previous work [
7,
10,
13,
15,
16] using ResNet [
40] or VoVNet [
41] for feature extraction and use FPN [
42] to output multi-scale features, denoted as
for the
-th image view with
scales, where
is the channel size of the feature, and
and
denote the height and width of the
-th scale features, respectively.
For the point cloud, we use VoxelNet [
29] for feature extraction, and we follow FUTR3D [
15] to output multi-scale voxel features by using FPN [
42]. It should be noted that the point cloud features extracted in our method are represented in 3D space instead of being projected into BEV space [
13,
15], and the point cloud features can be denoted as
, where
,
, and
are the sizes of the 3D voxel feature.
3.2. Adaptive Spatial Cross-Attention
Adaptive spatial cross-attention (ASCA) is a critical component of the AFTR, and it aims to fuse multi-sensor features while achieving implicit alignment by interacting with multi-view, multi-scale image features and 3D point cloud features through an object-query-based cross-attention mechanism. A schematic diagram of the ASCA module is shown in
Figure 2c. The detection head for the AFTR is a set of object queries
, which has a number of
3D object queries named
, where
corresponds to a reference point
in real-world 3D space. Considering the handling of multi-scale features, we normalize the 3D reference point coordinates, giving
. ASCA dynamically updates each query
by interacting with and fusing multi-sensor data features.
3.2.1. Interaction with Multi-View Image Features
ASCA uses the Deformable DETR [
26] idea to produce an interaction between the query and multi-sensor data features for two reasons: first, the 3D reference point corresponds to only a few features, and the native attention [
9] mechanism requires a query to interact with all the features, which results in extreme computational costs. Deformable DETR, by adding an offset, focuses on only query-related features. Second, determining how to find the reference point in an image is a big challenge. Previous approaches directly project the 3D reference point onto the corresponding image plane using calibration parameters, which is not robust. ASCA learns the 3D reference point to correctly associate the features by using the offset to achieve implicit alignment. We follow the hit view
in BEVFormer [
10] and project the 3D reference points onto BEV to determine their possible projected view
. Ultimately, an interaction with the features in
is achieved through ASCA. The adaptive spatial cross-attention process with image features can be formulated as Equation (1):
where
is the 3D object query,
denotes the number of scales,
represents the image feature of the
-th scales in the
-th view, and
is the project function that transforms the 3D reference point
to the
-th image plane.
can be represented as Equation (2):
where
and
denote the normalization coordinate positions of the width and height in the
-th image plane, respectively;
is the depth of the pixel, which is not used in our method;
and
denote the LiDAR to the
-th camera transformation matrix of rotation and translation, respectively; and
represents the
-th camera intrinsic parameters.
Following Deformable DETR, the features obtained through the offset are calculated using bilinear interpolation [
43] from the four closest pixels.
In general, ASCA only interacts with the hit view image features corresponding to the object query to reduce computation. While ASCA employs camera extrinsic parameters to project 3D reference points onto the image, which only serves as a reference for sampling, ASCA uses dynamically updating offsets to implicitly align the reference points with the image features so that the object query only interacts with the related features.
3.2.2. Interaction with Point Cloud Features
Since point cloud features are natively represented in 3D space, indicating the geometric features of an object in a real-world space, 3D reference points can interact with point cloud features without projection. However, the point cloud coordinates deviate from the real-world coordinates or the ego coordinates in the following cases: first, when the sensor position is translated or rotated and, second, when there is a delay due to the sampling frequency of the LiDAR. ASCA can better learn such deviations to ensure an accurate implicit alignment. The adaptive spatial cross-attention process with point cloud features can be formulated as Equation (3):
The offsets of the reference point are generated in 3D space, the point cloud is encoded as stereo grids regularly arranged spatially, and then the offset is located within a certain stereo grid. We express the
-th scale point cloud features corresponding to the offsets
as Equation (4):
where
denotes the
-th offset in the
-th scale point cloud feature, and
is the number of offsets. We obtain the index of the 3D grid by rounding up the offset.
3.2.3. Multi-Model Fusion
After obtaining the results of the
interaction with multi-view images and point cloud features, we fuse them and update
. First, we concatenate the results of the ASCA interaction with the multi-sensor data and encode them using an MLP network; the process can be described as Equation (5):
Finally, we update the object query
using Equation (6):
3.3. Spatial Temporal Self-Attention
The incorporation of temporal information has been demonstrated to be beneficial for camera-only 3D object detection [
10,
18,
22,
44], which is still valid in multi-sensor data fusion models.
Features or queries on historical timestamps rather than the current timestamp introduce two problems: first, the misalignment of the coordinate system due to self-ego motion and, second, the misalignment of the features or query due to the motion of the object. BEVDet4D [
22], BEVFormer [
10], and DETR4D [
18] perform the transformation between different timestamps by means of self-vehicle motion. When facing the case of object motion, BEVFormer predicts the offset in Deformable DETR [
26] from the current frame queries and aggregates features in historical frames, which makes it challenging to align each object query with its own historical query. DETR4D globally interacts with queries from different timestamps by performing multi-head attention [
9] to achieve the aggregation of relevant features, which also induces significant computational costs.
We propose spatial temporal self-attention (STSA), as shown in
Figure 2b. Following Deformable DETR [
26], STSA realizes the implicit alignment of object and historical object features by sampling and interacting with historical 3D object queries
and finding the specific queries
associated with the current timestamp
by dynamically updating the offsets, which effectively counteracts the misalignment caused by both self-ego motion and object motion. STSA can be expressed as Equation (7):
where
is the 3D reference point corresponding to the current timestamp object query
; notice that the offset
is represented in 3D space.
Finally, we update the object query
using Equation (8):
3.4. Detection Head and Loss
We design a learnable end-to-end transformer-based 3D detection head based on the 2D detector Deformable DETR [
26], which implements the object query used for detection through
layers of the deformable attention blocks. Specifically, we use the AFTR-generated fusion features as inputs to the decoder to interact with the predefined object query, update all object queries
at the output of each decoder layer, and predict the updated 3D reference point
by using the sigmoid function as a learnable linear projection from the updated
, as shown in Equation (9):
The detector finally predicts the 3D bounding box
and classification
of the object after two feed-forward network (FFN) layers, which can be expressed as Equation (10):
Finally, for the prediction of the set, the Hungarian algorithm is used to find a bipartite match between the predicted truth and the ground truth. We use Gaussian focal loss [
45] for classification and L1 loss for 3D bounding box regression, and then we represent the 3D object detection total loss as Equation (11):
where
and
are the coefficients of the individual cost, and
and
are the ground truth of the 3D bounding box and the classification of the set, respectively.
6. Conclusions
In this paper, we proposed a transformer-based end-to-end multi-modal fusion 3D object detection framework, named the adaptive fusion transformer (AFTR). The AFTR achieves an implicit alignment of cross-modal features and cross-temporal features by adaptively sampling and interacting with multi-sensor data features and temporal information in 3D space via adaptive spatial cross-attention (ASCA) and spatial temporal self-attention (STSA) for accurate and efficient 3D object detection. Our experiments on the nuScenes dataset demonstrated that the AFTR achieves better performance by fusing multi-sensor features and improves the detection of occlusions and small targets by acquiring temporal information. In addition, when studying the AFTR in terms of the misalignment problem, we found that the AFTR has a strong robustness to minor misalignments caused by various reasons, benefiting from the abilities of adaptive correlation features.
While the proposed AFTR has many advantages, there are still some limitations. First, the current transformer-based models are more computationally intensive than the CNN-based models, and a feasible solution is to reduce the number of queries by making them mainly focus on the foreground. Second, when faced with sensor failures or distorted sensor data, the performance of the default AFTR will degrade or even be inferior to that of AFTR-L or AFTR-C trained on data from a single sensor; a possible solution to this is to incorporate failures and distortions into the training to make the model more robust.