1. Introduction
With the dramatic growth of computer vision applications, especially autonomous driving technology, the requirements of vehicle perception of the surrounding environment are increasing. Point cloud-based 3D-object-detection techniques have received much attention from both industry and academia. The main sensors commonly deployed in intelligent vehicles are cameras and LiDAR. However, the detection performance of the camera is often affected by the environment, such as in bright light, night, rain and fog, etc., for which the camera detection performance is extremely poor [
1]. As increasingly affordable and advanced LiDAR technology becomes available, 3D object detection has been integrated as an inseparably part of self-driving cars [
2]. Compared to traditional camera-based systems, LiDAR’s ability to circumvent ambient noise and interference is unparalleled, and its ability to capture precise and structured point cloud data makes it an invaluable tool for determining the location and geometric attributes of relative objects. It forms the basis of the complex decision-making and planning processes that underpin autonomous driving [
3]. However, despite the many advantages of LiDAR-based 3D object detection, significant challenges must be overcome to achieve optimal accuracy. The inherent disorder, sparsity, and inhomogeneous distribution of the point cloud pose significant obstacles to accurate and reliable detection, requiring sophisticated algorithms and innovative approaches to boost object-detection performance [
4].
Given the numerous obstacles associated with point cloud 3D object detection, a wide range of groundbreaking algorithms have emerged in recent years. There are two main categories that dominate the field of 3D object detection: point-based methods and voxel-based methods. Point-based methods directly take raw point data as the network input and extract the features of each individual point within the point cloud scenes. The unique advantage of this approach is the ability to utilize the inherent geometric structure information of the point cloud, thus facilitating highly accurate object detection. However, the bottleneck of this method is the inference time, which requires the sample points to be collected by the Farthest Point Sampling (FPS) [
5]. It suffers from the disadvantages of time-consuming and slow detection, which makes it difficult to be applied to real-time detection scenarios. As the field of 3D object detection continues to evolve, voxel-based methods (e.g., VoxelNet [
6] and PointPillars [
7]) have become prominent, characterized by their ability to rasterize the point cloud into a discrete grid representation. This process involves dividing the point cloud into voxels, columns, and Bird’s-Eye View (BEV) feature maps, which are then rigorously analyzed using either 2D convolutional neural networks or 3D sparse neural networks. However, the limitation of voxel-based approaches is that converting point cloud data into voxels via encoding can result in a loss of critical feature information, which affects the accuracy of object detection, especially for small objects. HotSpotNet [
8] transforms voxel features into a novel representation, reducing the complexity and irregularity of the input data. However, this conversion also leads to a loss of detailed information from the original point cloud, which can be detrimental to small object detection. In addition, the loss of fine-grained localization accuracy is an important issue as it impairs the ability of 3D-object-detection systems to work effectively in the real world. Compared to existing voxel-based methods, which solely utilize single-scale voxel representation, Voxel-FPN [
9] encodes multi-scale voxel features. However, while this method improves fine-grained localization accuracy, it introduces additional computational overhead to the model. As the field of detection technology continues to advance at a breakneck pace, the question of whether voxel-based detection methods can achieve point-based detection performance while maintaining existing detection speed has emerged as a formidable challenge.
The voxel-based methods divide the point cloud spatially into voxels and subsequently transform the point cloud features into voxel features. Existing voxel-based methods do not perform well enough to detect small objects that are difficult to detect such as pedestrians, motorcycles, and traffic cones. The main reasons are: (1) small objects such as pedestrians are smaller than vehicles, and LiDAR scans fewer valid points on the target, as shown in
Figure 1; (2) the voxel-based method is characterized by the ability to encode the point cloud into voxels and average the corresponding features in the voxel grid. While this method provides superior computational efficiency, due to the small number of valid points for small objects, this method leads to a large loss of valid feature information for small objects, which affects the performance of small object detection. To address the challenge, we attempted to make it possible for the network to focus more on feature extraction from the foreground point cloud during voxelization. Attention mechanisms [
10] have emerged as a particularly promising approach that capitalizes on their ability to enable models to focus on key information while avoiding the imposition of significant computational overhead. Specifically, the attention mechanism allows the neural network to concentrate on specific features of the input, and the effectiveness of attention remains consistent regardless of the input’s dimensionality. Hu et al. [
11] introduced SENet, a network architecture derived from the SE block. The architecture specifically emphasizes channel dependencies and employs global average pooling to compute the weights assigned to each channel within the feature. Woo et al. [
12] introduced the Convolutional Block Attention Module (CBAM), which builds upon the improvements of SENet. CBAM combines the individual advantages of spatial and channel attention mechanisms and further refines the feature map through the multiplication of the attention weights. Inspired by them, we propose a Dual-Attention Voxel Feature Extractor (DA-VFE), which combines pointwise attention and channelwise attention (Dual-Attention), and combines the Dual-Attention with the mean Voxel Feature Extractor (VFE) to extract features globally from the point cloud. The distribution of the points in the voxels is inhomogeneous, which is ameliorated by the DA-VFE, allowing for the refinement of more-representative point cloud information.
At the heart of the voxel-based method lies an interaction between two different backbones: a 3D Backbone and a 2D Backbone. The 3D Backbone is responsible for processing each voxel and extracting its unique features, which are then projected onto a Bird’s-Eye View (BEV) to generate a Two-Dimensional (2D) pseudo-image, which serves as the input to the 2D Backbone [
13]. The 2D Backbone plays a crucial role in extracting the features from the pseudo-images, using a series of convolutions to generate qualitative suggestions and detection results. Typically, the 2D backbones used in voxel-based methods bear a striking resemblance to established models (e.g., VoxelNet [
6] and SECOND [
14]). However, it is worth noting the relative lack of focus on the 2D Backbone in most methods, which instead focus on optimizing the 3D Backbone to extract additional features from the underlying point cloud data. This focus has led researchers to overlook the potentially transformative impact of an enhanced 2D Backbone on the accuracy and reliability of object detection. Therefore, we designed a more-sophisticated 2D Backbone named the MFF Module, which consists of self-calibrated convolutions [
15], coordinate attention [
16], and a residual structure [
17]. Our proposed 2D Backbone significantly expands the receptive field, enhances contextual information capture, and improves overall detection performance, particularly with regard to small objects. We evaluated our framework on the nuScenes [
18] dataset. The experimental results showed that AMFF-Net achieved significant performance improvements compared to the baseline network CenterPoint [
19] with an almost constant inference speed and a significant reduction in the overall number of parameters of the model, which demonstrated that our proposed framework can achieve performance improvements with reduced computational overhead.
Our main contributions can be summarized as follows:
We designed a Dual-Attention Voxel Feature Extractor (DA-VFE) to integrate the Dual-Attention mechanism into the voxel-feature-extraction process, extract more-representative point cloud features, and reduce the information loss in the process of extracting the voxel features.
We propose a novel 2D Backbone named the MFF Module, which extends the perceptual field of the 2D Backbone and can capture more contextual information to extract richer information with higher accuracy and robustness.
The proposed network can achieve performance gains while reducing the computational overhead.
2. Related Work
In recent years, with the development of automated driving technology, the requirements for real-world 3D perception have been increasing. An emerging trend in 3D-object-detection research has seen a gradual shift away from RGB images, which can only provide limited 2D planar information, in favor of point cloud data, which offer a more-comprehensive and -accurate representation of the depth information. Typically, point-based methods leverage the features of the original point cloud data, with F-PointNet [
20] representing a pioneering approach to applying PointNet [
21], a 2D detection method predicated on cropped point clouds, to the realm of 3D object detection. The follow-up work [
5] was a pioneering work that achieved good detection performance by the feature extraction of raw irregular point clouds. F-ConvNet [
22] is optimized in more detail on F-PointNet. PointRCNN [
23] segments the point cloud directly, generating a few qualitative three-dimensional candidate images directly in the point cloud, rather than projecting the point cloud into a pseudo-image. VoteNet [
24] proposed the Hough voting strategy to optimize the point cloud data grouping. Furthermore, STD [
25] introduced an innovative spherical anchor designed to generate highly accurate and reliable proposals. The 3D-SSD [
26] proposed new sampling methods to better group object features. The IA-SSD [
27] employs an instance-aware downsampling strategy to identify foreground points of interest. While this downsampling strategy effectively reduces redundant points, it also results in inevitable information loss. The point-based method retains the original geometric structure information, but suffers from the disadvantages of slow perception and high computation.
Voxel-based methods are dominant in the realm of 3D object detection. At the core of voxel-based methods is a complex voxelization process, which enables the input point cloud to be transformed into a voxel representation. This process is followed by the use of a 3D convolution to extract the voxel features across the scenario. The pioneering VoxelNet [
6] for point cloud 3D object detection revolutionized the field by introducing a new method for dividing a point cloud into a large number of homogeneous voxels so that all points within a voxel can be converted into voxel features by a voxel-feature-coding layer. Then, the voxel features are used as the inputs to 3D Backbone. However, voxels are sparse in autonomous driving scenarios; there exist a large number of voxels that do not contain point clouds, and processing empty voxels can place a significant burden on the processor. In order to decrease the computational overhead, 3D sparse convolution is introduced by SECOND [
14], which realizes efficient 3D convolution processing and greatly improves the inference speed of the network. PointPillars [
7] represents a paradigm shift approach to voxel-based point cloud 3D object detection by further elongating voxels into pillars along the z-axis, dividing the point cloud into these pillars, and then, extracting features from the point cloud in these pillars to form a pseudo-BEV 2D image. This method avoids the use of low-efficiency 3D convolution, is faster in inference, and is popular in industry. Part-A2 [
28] uses pointwise part location features as additional supervisory information, which is fused with pointwise semantic features to generate higher-quality 3D proposals. TANet [
29] utilizes a sophisticated and innovative triple-attention mechanism to boost the robustness and accuracy of voxel feature learning. PV-RCNN [
30] utilizes sparse convolution to extract high-level features from voxels to generate detailed suggestions. In this method, multiscale voxel features are encoded as keypoints, and the box is refined by a process of the aggregation of features from grid points located around the keypoints. The method preserves the structural information of the 3D point cloud by the additional branches of the key point cloud, which is a slower inference, even though it improves the detection performance. Voxel-RCNN [
31] introduced a novel detector that utilizes only 3D voxel features for 3D object detection. This method utilizes voxel ROI pooling operations to further refine the anchor, and the inference speed of the method is dramatically improved compared to PV-RCNN [
30]. Voxel ROI pooling does not require point information and avoids the interaction between the feature from point clouds and the feature from voxels. BtcDet [
32] addresses the issue of shape incompleteness in point cloud data caused by occlusion and truncation by learning additional point cloud data that align with the target shape. It effectively complements the obscured shapes of the target. All the methods mentioned above use anchors to attach to objects. An anchor-free method was proposed for CenterPoint [
19], which treats the target as a keypoint, predicts the center of the target, regresses to its orientation, and performs better in pedestrian detection compared to the anchor-based methods.
Through the analysis of voxel-based methods, the use of voxel feature encoding is a critical requirement in all methods for voxel-based 3D object detection. Therefore, we propose a Dual-Attention Voxel Feature Extractor (DA-VFE), which enables the learning of voxel-level feature representations that are both more robust and discriminative. In addition, this paper proposes a Multi-scale Feature Fusion Module (MMF Module), which contains self-calibrated convolution, coordinate attention, and a residual structure. Compared to the traditional RPN network, this Module boasts a larger receptive field, making it capable of capturing more contextual information. This, in turn, allows for the extraction of richer information and boosts detection performance for small objects. The proposed AMFF-Net is based on CenterPoint, and compared to CenterPoint, we boosted the detection performance of small objects and reduced the computational overhead while ensuring the overall accuracy and inference speed.