1. Introduction
With the development of deep learning, computer vision has witnessed a rise in several areas, such as image object detection [
1,
2] and image segmentation [
3]. Object detection using cameras has found widespread application in various fields. RGB images offer the advantage of low acquisition cost, high image resolution, and the inclusion of semantic information such as object color and texture. However, they are susceptible to environmental influences, such as weather and lighting conditions, and lack depth information. With the development of remote sensing technology, LiDAR, a remote sensing instrument, has been widely used by researchers to capture data; for example, 3D point clouds can be acquired from LiDAR sensors. These 3D point cloud data provide accurate geometric information, which is widely used in tracking reconstruction areas and has also been considered in the detection topic.
The 3D point cloud collected by LiDAR includes spatial coordinates (X, Y, Z) and reflection intensity, offering high detection accuracy and providing precise scene information for 3D object detection. Standard outdoor datasets include the Kitti, nuScenes, and Waymo.
Although point cloud data offer various advantages, they have some limitations. When objects are located at a far distance or heavily occluded, point cloud data would be sparse, leading to unclear object representations and making detection difficult.
Figure 1 illustrates this problem by demonstrating ambiguous objects in point cloud data. The input point cloud data are visualized in (a), with green boxes representing the 3D ground truth box projection, while (b) shows the points representing objects in the input point cloud. The smallest ground truth object contains only 20 points. It can be observed that points in distant regions are pretty sparse, and their shapes are difficult to recognize. Although increasing the number of LiDAR scan lines can alleviate this problem, it would significantly raise the price cost of hardware. Velodyne 64 costs USD 80,000 and can emit 64-beam lasers, while Velodyne 32 only costs USD 20,000; however, only 32-beam lasers can be emitted, which leads to a severe sparsity problem.
As shown in
Figure 2, green boxes refer to the ground truth boxes, and red boxes are prediction boxes. Single-frame detectors failed to detect the objects and generated lots of false predictions in distant areas.
Figure 2a,b are two adjacent frames. As can be seen, the failure to detect objects in the previous frame is repeated in the following frame. In both cases, the detector failed to detect the two objects at the top of the point cloud while having multiple false positives.
Using multiple-frame point clouds can effectively compensate for information. Multiple-frame point clouds, also called spatio-temporal data, can be used in several fields. For example, with spatio-temporal data, 4D dynamic scenes can be reconstructed [
5,
6]. In the detection field, using multiple-frame point clouds may alleviate the sparsity problem in 3D object detection. Although the above example shows the detection failed in two consecutive frames, they perform detection independently without all the data together to improve detection. With a proper fusion scheme, using multiple point cloud frames can be similar to using a denser line LiDAR. One intuitive approach is concatenating the points at input time, that is, aligning the multiple frames of point clouds into a single scene for input. Besl [
7] proposed the classical iterative closest point (ICP) algorithm, which laid the foundation for point cloud registration. This method uses the sum of Euclidean distances between all points of two point clouds as the matching cost for iterative search until the matching cost is minimized. Then, the transformation matrix between the two point clouds is computed. We conducted a simple experiment using the point concatenation method, as shown in
Figure 3, using ICP [
7] to align the point clouds. However, this alignment approach has its drawbacks. It requires many iterations, resulting in a long computation time. Also, a suitable initial position must be provided. As we are dealing with large outdoor datasets, most objects are moving and have different velocities. The movement poses further challenges to registration, which is usually restricted to stationary objects. Despite the increased point density, shadows appear on some objects. When magnifying the point cloud representing small objects in
Figure 3b, it can be observed that the alignment effect is unsatisfactory, leading to shadows on the small objects [
8,
9,
10].
Besides registration, some approaches have been proposed for multiple-frame point clouds, such as Long Short-Term Memory (LSTM) [
11] and concatenation [
12,
13]. However, these methods require intensive computation and suffer from the shadow problem. In this study, we propose a novel multi-frame object detection method based on fusing proposal features called proposal features fusion (PFF). The proposed method introduces an attention mechanism for feature-level fusion. Using an anchor-based detector [
4], a region proposal network (RPN) is used to generate proposals for multiple frames. The cosine similarity is then utilized to associate proposal features between adjacent frames. We further propose an Attention-Weighted Fusion (AWF) module for the associated proposal features to adjust and integrate features from different frames adaptively.
We summarize our contributions as follows:
A feature-level fusion method is proposed by fusing the extracted features from proposals of previous frames to the current frame. The feature-level fusion can improve detection performance while ensuring computational efficiency.
We apply the attention module in feature fusion to make the model robust and flexible. The proposed Attention Weighted Fusion (AWF) module is shown to play an important role in suppressing unimportant information and enhancing key features.
The Kitti dataset is used for the ablation study to demonstrate the effectiveness of the proposed method. The nuScenes dataset is further used to compare the performance of the proposed method with other multiple-frame point cloud methods in the literature. The comparison shows that our method outperforms the conventional multi-frame method by 6.64% mAP.
2. Related Work
Single-frame point cloud object detection methods can be roughly divided into two categories: point-based [
14,
15,
16,
17] and voxel-based methods [
4,
18,
19]. Since point clouds are obtained from LiDAR scans and only contain the surface information of objects, the distances and spatial distributions between points are non-uniform. Also, point clouds exhibit sparsity and disorder. PointNet [
20] and the subsequent work PointNet++ [
21] use Farthest Point Sampling (FPS) to sample non-uniform points in point clouds while preserving the shape of the point cloud. These works introduced max pooling to address the disorder of point clouds. PointNet++ has been widely used as a backbone network. F-PointNet [
14] employs a two-dimensional detector to generate candidate boxes and other information and then combines these 2D bounding boxes with depth information to form three-dimensional frustums. Subsequently, PointNet is used to encode the point clouds within the frustums and generate the 3D object detection results. PointRCNN utilizes PointNet++ as the backbone network and proposes a two-stage network to refine proposal boxes to achieve good detection results. Gao et al. [
22] proposes a dynamic clustering algorithm by using elliptic functions as point cloud data has a non-uniform distribution. SASA [
23] introduces S-FPS, an improved sampling method for small objects, to sample point clouds in the feature layer.
Another point cloud encoding method is the voxel-based method, which processes point cloud data by dividing the point cloud into 3D voxels. VoxelNet [
18] proposes an end-to-end network that divides the point cloud into voxels. Then, it utilizes a voxel feature encoder (VFE) on the voxels to combine the features of individual points within each voxel and global features. A 3D CNN is then employed to predict and regress the object’s bounding box for object detection. Second [
4] uses 3D sparse convolution networks [
24] to accelerate 3D voxel processing. VoxelRCNN [
25] proposes the utilization of voxel region of interest (ROI) pooling to optimize the features within the ROI. CenterPoint [
26] utilizes a voxel-based method for point cloud encoding and introduces an anchor-free 3D box regression method for bounding boxes.
In Ref. [
27], a combination method is proposed that uses both multi-scale voxel features and keypoints, Ref. [
28] using both RGB and point cloud information through extracting 3D proposal boxes in the Bird’s Eye View (BEV) and project them to RGB image to obtain more features. The performance of single-frame detection is unsatisfactory due to the sparsity and occlusion in the single-frame data. With the release of the multiple-frames dataset [
29,
30], exploring how to utilize multiple frames has become a research topic in recent years.
In order to leverage the multiple-frame point cloud data effectively, several branches of studies have been proposed. Ref. [
31] divides multiple frame point cloud studies into two branches, the data branch and the model branch. Furthermore, they classify their work into data-based approaches. Ref. [
31] proposes a data augmentation method and achieves 0.7 mAP on nuScenes dataset. Ref. [
32] focuses on false negative examples by using heatmap prediction to excavate hard samples and omitting the training of easy positive candidates.
Some studies [
11,
33] use the LSTM network to leverage spatio-temporal information in point cloud sequences. Yolo4D [
33] utilizes Yolo3D [
34] as the backbone network and integrates contextual information using a Recurrent Neural Network (RNN). It first employs a CNN to extract information from each frame and then feeds it into the LSTM to incorporate historical information. FaF [
12] uses aligned frames as inputs and employs 3D CNN to extract features from the aligned data. However, pre-aligning multiple frames of point clouds leads to an increased processing time and computational complexity. WYSIWYG [
13] concatenates different frames into a single frame to expand the visibility area, enabling a broader perspective in the detection process. Another method, 3DVID [
35], explores spatial–temporal correlations between multiple frame point clouds by using a Spatial Transformer Attention (STA) module to suppress the background noise and emphasize objects and a Temporal Transformer Attention (TTA) module to correlate the moving objects between frames.
3. Methods
The framework of the 3D object detection method based on multiple-frames fusion is shown in
Figure 4. We use the LiDAR point cloud as the input and adopt the two-stage detection framework: region proposal network (RPN) and proposals refinement network. In the preprocessing stage, we use a voxel feature encoder (VFE) to encode the input point cloud data as voxels. In the RPN stage, we extract features from voxels and generate the prediction according to the anchor feature to obtain high-quality 3D proposals. Then, non-maximum suppression (NMS) is used to select candidate proposals in the proposal refinement stage. We associate and merge the features of 3D proposals from consecutive frames. Cosine similarity is used to associate proposals in consecutive frames, and the AWF module is used to adaptively adjust the features from matched proposals. Based on the fusion results, the bounding box classification and regression determine the object category, size, and location.
3.1. RPN Stage
Point cloud data form a disordered 3D point set. A point cloud is divided into 3D grids (voxels) in the preprocessing. Given an input point cloud with depth, height, and width of and a predefined voxel size of , the entire input point cloud will be divided into voxels along each coordinate axis. These voxels are then encoded to generate features and extract multi-scale 3D features. Subsequently, compression is applied to the z-axis to obtain a pseudo-2D feature map. In this process, the network only processes non-empty voxels to speed up the feature extraction process.
For the input voxelized point cloud (batch size, 3 + C, ), where C represents the number of additional information channels apart from the (x, y, z) information, usually include reflection intensity, time stamp, etc. The output features are the feature maps stacked on the z-axis direction. This network consists of two components: sparse convolution (spconv) and subconvolution (subconv).
Table 1 shows the structure of voxel feature encoding layers. spconv is composed of a 3 × 3 × 3 convolution with a stride of 2, followed by BatchNorm and ReLU activation. Spconv is used for performing downsampling. Subconv involves a 3 × 3 × 3 convolution with a stride of 1, followed by BatchNorm and ReLU activation. Subconv is used for feature extraction. Notably, only an eight-times downsampling is applied in the h and w directions. The last spconv layer has a stride of 2 in the z-axis direction. The final output features are represented by (batch size, channel, z, h, w), and a height compression operation is performed to stack the z-axis and channel dimensions, resulting in a pseudo-2D feature map which shape is (batch size, channel × z, h, w).
After obtaining the pseudo-2D feature map, two separate branches perform 2D convolutions with a kernel size of 3 × 3. Both conv2d and deconv operations are followed by BatchNorm and ReLU activation. In one branch, downsampling of 2 times is applied, followed by deconvolution to restore the feature map’s shape. The features from both branches are then concatenated to obtain multi-scale features. Finally, the multi-scale features are fed through the Conv2d layers for proposal prediction and regression. We use NMS to remove redundant proposals [
15], and IOU = 0.45 as the threshold. The selected proposals will be kept for the Proposal Refinement Stage for refinement.
3.2. Proposal Refinement Stage
This stage aims to generate accurate 3D detection results from the candidate proposals through further optimization and regression. In this part, we find the proposals of the same object between consecutive frames through the feature association module. The AWF feature fusion module is then used to adaptively fuse the candidate frame features from different frames and send the fused results to the network for regression and classification.
3.2.1. Feature Association Module
For proposal sets
are generated from the region proposal network from
, separately, where
. One approach for establishing associations between box proposals in consecutive frames is utilizing the nearest object center distance metric. The position offset of the object’s center point between the multiple frames is calculated as follows:
where (
,
,
) and (
,
,
) represents the proposal center in multiple frames, respectively.
is a manually set threshold. When there is more than one proposal inside this threshold, the nearest proposal would be chosen to fuse.
We also consider using cosine metric distance as the correlation metric and calculate it using the potential features obtained from the network. Compared with Euclidean distance, cosine similarity is more sensitive to the pattern of two features, which is widely used in many applications [
36,
37]. That is,
where
, and
represent components of feature
and
respectively.
and
are consecutive frames.
donates the angle between two features.
3.2.2. Feature Fusion Module
The addition operation is a commonly used feature fusion method. That is,
where
, and
represent components of feature
and
respectively.
and
are consecutive frames and + is the element-wise addition operation. However, the addition operation could be contextual unawareness [
38]. Here, we use addition operation as a fusion method baseline and introduce an attention-weighted fusion (AWF) module to put weights into feature channels and fuse the features adaptively.
As shown in
Figure 5, the input proposal is first enlarged to 3 × 3 on the feature map to include the surrounding areas for additional information. The module performs average pooling on the input feature map, which reduces the spatial dimensions to 1 × 1 while preserving the number of channels. The module applies a 1 × 1 convolution on the pooled tensor, which reduces the dimension of the channel. This is accomplished using a smaller number of output channels (C/r) compared to the input channels, which projects the feature into a lower-dimensional space and removes redundant information. The output tensor is passed through a ReLU function, which introduces nonlinearity into the feature representation. Following this, the module applies another 1 × 1 convolution, which expands the dimensionality of the feature back to the original number of channels. This convolution is followed by a sigmoid activation function, which scales the learned weights to the range [0, 1]. These weights represent the importance of each channel in the input feature map, with higher weights indicating more discriminative features. Finally, the input feature map is multiplied element-wise with the learned weights to obtain a weighted feature map. That is,