1. Introduction
Among various tasks of scene understanding, object detection is crucial for autonomous driving [
1], robotics, and augmented reality. Deep learning-based 2D object detection which aims to predict the position and category of targets with given images has made unprecedented achievement in recent years [
2]. RGB images provide fine-grained contextual information but still lack accurate depth information, which lets the prediction of 2D object detection suffer from space ambiguity [
3]. Recently, extensive research has focused on 3D object detection to estimate the accurate 3D location of the target, benefitting from available point cloud sources.
LIDAR provides spatial and geometric descriptions for the 3D environment which targets exist in, but point cloud still lacks texture and color information like RGB images. Therefore, LIDAR-RGB fusion based 3D object detection takes advantage of two sensors to compensate the weaknesses of each other and capture more discriminative features of objects. However, two distinct modalities with various data formats and properties lead to challenges in this task. RGB images have ordered and grid structure which has been studied in numerous research, while the point cloud has unordered and spare structure. Moreover, the problem of how to correlate the semantic features of images with the geometric features of point cloud is indispensable in the fusion process. In common sense, the semantic and contextual information of images is always extracted in high-level features and the shape and texture information always exists in low-level features [
4]. These changing encoding characteristics make each stage of LIDAR-RGB fusion have specific demand and cooperation manner. For instance, in the low-level feature of images, the apparent texture and accurate shape information could easier match the geometric outlook of objects. In the deeper layers, the semantic and contextual features of images need implicit category-wise geometric information. To solve these problems, existing works mainly depend on cross-modality feature alignment to fuse the RGB and LIDAR features.
According to the way of fusing multi-modality sensor data, we classify previous works into three categories: (1) early fusion-based methods (2) late fusion-based methods and (3) deep fusion-based methods. For detail, early fusion-based methods usually utilize a separate perception algorithm to process the multi-modality raw sensor data. However, they always require the precise alignment of data. If the raw sensor data are not well aligned in the early stage, it would lead to heavy performance degradation due to the feature dislocation. Depending on coordinate location of two sensors, PointPainting [
5] and PI-RCNN [
6] project the image semantic segmentation to point cloud space by projecting matrix. Although this early fusion process enables the network to handle aligned two-modality information as a whole without specific modality adjustment, the early stage fusion also conveys the noise in one modality to another modality. This noise would unavoidably be aligned and combined with discriminative features of objects, significantly damaging the prominence of features.
Late fusion-based methods only fuse the processed features at the decision level because the spatial and modal difference between the point cloud and the image is greatly reduced in this stage. MV3D [
7], AVOD [
3], CLOCs [
8] extract point cloud and image features through independent modules and fuses them at the decision-making layer. However, the fusion in the decision-making layer has little effect on the raw data information fusion, and the confidence scores of the proposals generated by the two modules are not related. For the deep fusion-based methods, 3D-CVF [
9] and MMF [
10] adopt feature extractors respectively for LIDAR and image, and fuse images and LIDAR hierarchically and semantically. Finally, the semantic fusion of multi-scale information is realized. However, these methods are difficult to solve the problems of the difference between data formats and sensor positions. Moreover, 3D-CVF lacks continuous feature fusion in the feature extraction process will result in insufficient feature fusion. MMF only utilizes the sparse depth map projected from the point cloud, which leads to a weaker influence of the point cloud data on the generation of the anchor.
To solve these challenges, we observe that it is hard to align two-modality features throwaway. As aforementioned, features with different characteristics always need corresponding features from another modality. However, this demand is unknown for hand-crafted fusion design. Moreover, the processes of encoding RGB and LIDAR always have the dynamic appetite for extracting specific features, e.g., the contextual features of images are in the deeper layers, while low-level features are in early layers. The features in which layers are suitable for feature alignment are changing in optimization. Therefore, it is more reasonable to build a dynamic multi-modal fusion method. In this paper, we propose a cascaded cross-modality fusion network (CCFNet) for LIDAR-RGB fusion-based 3D object detection to address the above challenges. Our CCFNet is developed to establish a dynamic alignment manner by letting each stage choose specific salient features from previous stages. Our CCFNet mainly consists of a cascaded multi-scale fusion module (CMF) and a novel center 3D IoU loss.
In order to build a dynamic aligned network, we insert cascaded multi-scale fusion module (CMF) between each stage of LIDAR and RGB streams. Our CMF collects point cloud features from adjacent stages and aligns them with image features. By processing CMF in a cascaded way, the alignment in each stage could adjustably select specific point cloud features from previous stages to meet its demand. Besides, as pointed in [
11], traditional IoU loss has a plateau, making it infeasible to optimize in the case of nonoverlapping bounding boxes, which is much more severe in 3D cases. These non-overlapping anchors are still useful to let aligned RoI have a roughly location sensitivity, i.e., guiding the RPN to generate anchors close to true bounding boxes. In this work, we advocate a novel center 3D IoU loss to make use of this benefit of non-overlapping bounding boxes. By introducing the measurement of distance between anchor center and ground truth center, our center 3D IoU loss is equipped with the ability to decrease the possibility of unreasonable anchors.
Our contribution can be summarized as follows:
(1) We propose a novel cascade approach to fuse and align LIDAR-RGB information. Our approach obtains multiple residual operations which could back-propagate gradient of guidance of alignment to the previous parts in encoder to select informative point cloud features.
(2) In order to make use of non-overlapping bounding boxes, we propose a novel center 3D IoU loss to allow the model to be sensitive to the location of generated anchors.
(3) Our approach has achieved better performance on the KITTI benchmark and performs favorably compared to methods.
3. Our Approach
In this paper, we propose a cascaded cross-modality fusion network (CCFNet) for LIDAR-RGB fusion-based 3D object detection. As shown in
Figure 1, the features of LDIAR and RGB images are extracted by two separate streams. We use ResNet50 and four set ablation modules of PointNet as the feature extractor of RGB images and LIDAR respectively. Between each stage of two streams, we insert our cascaded multi-scale fusion module (CMF) to connect and fuse the image features and LIDAR features that share the same downsample ratio. Finally, the outputs of two streams are concatenated and sent to the detection head. We also describe the components of the total training losses, including our novel 3D IoU loss and the spatial setting of anchors and targets.
3.1. Cascaded Multi-Scale Fusion Module
We tackle the optimization of corresponding LIDAR-RGB feature fusion of each stage by building a cascaded structure where the image features of each stage could access the point cloud information from previous stages. In this way, the image features could dynamically select the suitable and multi-scale point cloud features from different stages and accordingly optimize the efficiency of LIDAR-RGB feature fusion.
Let denote the stage of LIDAR and RGB streams, and are the RGB and LIDAR feature maps at the i-th stage, where denote the height, width and channel dimension of the image feature, and are the number of points and channel dimension of LIDAR feature respectively. Our CMF mainly has two main procedures, i.e., multi-scale fusion and LIDAR-RGB projecting. Since the CMF modules in different stages have little difference, we first introduce the general mechanism of our CMF module and then describe the actual implementation of CMF modules in different stages.
As shown in
Figure 2, the CMF module has two inputs, i.e., the feature
from the previous stage and
from the current stage. We first select the points having sailent features in
, which has
points, and then fuse it with
. For detail, we first use
function to highlight the category characteristic of feature and then select
points possessing a larger value from the point set of
. According to the index of selected points, selected point features
can be easily found. By processing one
,
is activated by the global vector from
by global average pooling, which finally generates
. Then,
is conducted after concating
and
along the
N dimension. We name, after this process, multi-scale fusion, which could be represented as follows:
where
denote the step of choosing
points with salient features from
and
is the function of tiling the vector along
N dimension to generate a tensor whose resolution is the same as
.
Besides, we project the point cloud features
to image space through the LIDAR-RGB projecting procedure. Specifically, we utilize the principle of spatial perspective to project LIDAR points to image. Each position
of points
a belonging to the point set of
should be multiplied by image size, since these coordinates of position along
X and
Y axes have been normalized to
. Therefore, the corresponding position
of point
a in image, where the superscript
m denotes the image domain, could be calculated by:
Then, is transmitted to the next stage as one of the inputs of the next CMF module.
For the CFM module in the first stage, we directly project
to image space of
. For the CMF module in the second stage, we follow the full processes described above. However, in the third and fourth stages, the standard processes of CMF module would lead to the biased saturation of some points, since inherited features from previous stages have numerous repeated points. They would account most of the point candidates which have salient features, if the selecting process is not well regulated. This phenomenon would become severe when the stage goes deeper. Therefore, we design a regulation algorithm to avoid the oversaturated problem, as shown in Algorithm 1. Moreover, the whole procedure of CMF in the third and fourth stages is shown in
Figure 2.
Algorithm 1 A regulation algorithm of CMF module to avoid over-saturation of repeated points in the points candidates. |
Input: point cloud features from previous stage and point set of . Output: regulated point cloud features .
- 1:
Count the repeated points set and unrepeated points set in where - 2:
if> or then - 3:
Random select points from and corresponding features - 4:
else - 5:
Random select points from to build a new repeated point set - 6:
Random select points from and collect corresponding features - 7:
end ifreturn
|
3.2. Center 3D IoU Loss
Moreover, the 3D object detector always generates abundant anchors to predict the true bounding boxes, whose number is much more than that of ground truth bounding boxes. However, due to the meaning of IoU and the setting of corresponding traditional IoU loss, only anchors which overlap with ground truth would contribute to the optimization, while those anchors which have no overlapping are simply punished and contribute nothing. As shown in
Figure 3a, the anchor and bounding boxes do not overlap with each other, where the IoU score is 0 and no training gradient would back-propagate. However, the positive anchors only account a small part of total generated anchors, resulting in limited income compared with such a heavy computational cost. However, these non-overlapping anchors are still meaningful. We believe that even without overlapping, anchors which are near to ground truth bounding boxes are more useful than these anchors far from that. These non-overlapping anchors could provide a constant for RPN to build a position sensitivity which enables RPN to generate anchors close to true bounding boxes. Moreover, this distance information could adjust the inconsistency between loss and the quality of obtained bounding boxes. As shown in
Figure 3b, these two cases have the same IoU score, but apparently the left case would be much better than the right one, i.e., the two axes are aligned in left, but one axis is aligned in right.
Inspired by this observation, we introduce our center 3D IoU loss to solve the aforementioned limitations. We define 3D anchor as
, where
is the 3D coordinate of anchor and bounding box center,
are the height, width and depth of 3D anchor,
is the angle of anchor along
Z axis. In order to calculate the IoU score of predicted anchor and ground truth bounding box which are denoted as superscript
p and
, we first obtain the whole 3D volume of predicted anchor
and ground truth bounding box
and the overlapped volume of them
and the volume of the smallest enclosing convex region
. Therefore, 3D IoU could be formulated as:
In order to revise the inconsistcy between loss and the quality of obtained bouding boxes as illustrated in
Figure 3, we add the center distance
between anchor and ground truth bounding box:
Our center 3D IoU loss consists of these two parts and is formulated as follows:
This center 3D IoU loss optimizes the two IoU metrics: the overlapped area and the distance from the center of the anchor to that of ground truth bounding box. It helps the generator of anchors to be sensitive to the position of ground truth bounding box, since even without overlapping, the anchor near the ground truth also reduces the loss relatively. Then, the NMS process maintains a more accurate bounding box. The quantitative results and analysis in
Section 4.4 indicate the effectiveness of our center 3DIOU loss in improving 3D detection performance.
3.3. 3D Region Proposal Network
We use a multibox SSD-like [
31] RPN as detection head of our CCFNet. For the input of the 3D RPN, we use the feature maps generated by combining LIDAR and image features. Specifically, the architecture of RPN consists of three stages and each of them is composed of several convolutional layers and a downsampled convolutional layer sequentially. Then, the outputs of three stages are upsampled to a fixed size and are concatenated into one feature map. Finally, the concatenated feature map is sent to three
convolutions for classification and orientation estimation.