Hybrid Attention-Based 3D Object Detection with Differential Point Clouds

Han, Guangjie; Zhu, Yintian; Liao, Lyuchao; Yao, Huiwen; Zhao, Zhaolin; Zheng, Qi

doi:10.3390/electronics11234010

Open AccessArticle

Hybrid Attention-Based 3D Object Detection with Differential Point Clouds

by

Guangjie Han

^1,2,

Yintian Zhu

^1,*

,

Lyuchao Liao

¹

,

Huiwen Yao

¹,

Zhaolin Zhao

¹ and

Qi Zheng

¹

School of Transportation, Fujian University of Technology, Fuzhou 350118, China

²

Department of Information and Communication System, Hohai University, Changzhou 213022, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(23), 4010; https://doi.org/10.3390/electronics11234010

Submission received: 2 November 2022 / Revised: 27 November 2022 / Accepted: 1 December 2022 / Published: 2 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Object detection based on point clouds has been widely used for autonomous driving, although how to improve its detection accuracy remains a significant challenge. Foreground points are more critical for 3D object detection than background points; however, most current detection frameworks cannot effectively preserve foreground points. Therefore, this work proposes a hybrid attention-based 3D object detection method with differential point clouds, which we name HA-RCNN. The method differentiates the foreground points from the background ones to preserve the critical information of foreground points. Extensive experiments conducted on the KITTI dataset show that the model outperforms the state-of-the-art methods, especially in recognizing large objects such as cars and cyclists.

Keywords:

hybrid attention; autonomous driving; point cloud; 3D object detection

1. Introduction

3D object detection based on point clouds has many applications in natural scenes, especially in autonomous driving. Point cloud data provide reliable geometric and depth information. However, point clouds are disordered, sparse, and unevenly distributed, increasing the difficulty of object detection [1].

Currently, existing object detection methods mainly include image-based, point cloud-based, and multi-sensor methods [2]. In comparing them, image-based methods lack depth and 3D structure information, making it challenging to identify and locate 3D objects accurately in 3D space. Therefore, plans based on image information tend to be less effective than point clouds [3,4,5]. GS has proposed fusing point cloud and image data for object detection [6]. Subsequently, the classic methods MV3D, PC-CNN [7], AVOD [8], PointPainting [9], etc., have been proposed. However, although these fusion methods can integrate the characteristics of point clouds and images to a certain extent for recognition, the vast amount of calculation involved and the complex network has brought considerable challenges to this field. Thus, point cloud-based methods are the main methods for autonomous driving. The method based on the point cloud has developed rapidly in the last few years, and many classic methods have been proposed, including Pointnet [10], Pointnet++ [11], VoxelNet [12], SE-SSD [13], etc.

Early works usually convert raw point clouds into regular intermediate representations, including projecting 3D point cloud data from bird’s eye or frontal views into 2D images or dense 3D voxels. However, using voxel conversion to improve efficiency can lead to a lack of critical information, resulting in false and missed detection. PointPillars [14] encode point clouds with Pillar coding, which achieves extremely fast detection speed. However, it loses many important foreground points simultaneously, making the effect of detail processing not ideal. There have been a lot of missed and false detections in PointPillars. To solve this critical problem, TANet [15] enhances the local characteristics of the voxel by introducing an attention mechanism. However, due to the information loss during voxel conversion, it is impossible to avoid the occurrence of false and missed detection. In DA-PointRCNN [16], the density sampling method can pay better attention to where the clouds are sparse and improve missed detection. However, false detection exists due to ignoring the importance of feature information. Therefore, we retain the foreground points with rich details as much as possible from the sampling and remove a large number of background points that do not affect the recognition effect. This makes the entire network architecture more lightweight, and can reduce the occurrence of missed detection. Furthermore, the introduction of the point-wise features of the focus mechanism enhances the prospects of avoiding occurrences of misconduct. In Figure 1, we show the missed and false detections of PointRCNN in detection due to missing foreground point information.

In the existing methods, important foreground points are often discarded before the final boundary frame return steps. Therefore, the proposed model pays more attention to the foreground points, aiming to achieve high recognition accuracy through hybrid attention mechanisms. Moreover, we propose a novel HA module that generates pointwise features based on the sampled input point cloud. Finally, the original pointwise features are spliced with enhanced pointwise features to make them more recognizable.

Specifically, the main contributions of this work are as follows:

To make full use of the crucial information, we introduce a new hybrid sampling module (HS) in the sampling layer which integrates various sampling methods, such as D-FPS, F-FPS, and Random Sampling.
To preserve critical information from foreground points, we propose a hybrid attention-based model to differentiate the foreground and background points.
Based on this hybrid attention mechanism, we propose a 3D object detection method with differential point clouds. Extensive experiments on the KITTI dataset show that our model significantly outperforms state-of-the-art methods.

2. Related Work

In this section, we briefly outline the relevant development history of current voxel-based and point-based detection methods.

2.1. Voxel-Based Methods

In point cloud-based methods, converting the raw point cloud into a regular voxel grid and extracting local features for object detection has attracted much attention. the The voxel concept was first proposed with VoxelNet, in which the point cloud is divided into voxels by block and detected by extracting local features from each voxel. However, even this requires considerable computation. SECOND [17] adds a sparse convolution operation based on VoxelNet to speed up calculation. PointPillars directly converts point clouds into fake images, avoiding the time-consuming convolution calculation.

According to their different detection stages, the existing voxel detectors can be roughly divided into single-stage detectors and two-stage detectors. While these methods are efficient and straightforward, due to the reduction of spatial resolution and insufficient structural information their detection performance is significantly affected when the point cloud is relatively sparse. Thus, SA-SSD [18] supplements the utilization of structural information by adding auxiliary networks. HVNet [19] offers a hybrid voxel network that refines the projected and aggregated feature maps from multiple scales to improve detection performance. CIA-SSD [20] introduces a network incorporating IOU-aware confidence correction to extract spatially informative features of detected objects. In comparison, two-stage detectors can achieve higher performance at the cost of higher computation and storage. Part-A

^{2}

[21] proposes a two-stage detector consisting of part perception and aggregation modules, which is better able to utilize the location information of detected objects.

In general, detection methods based on voxel detection can achieve better detection effects and higher efficiency to a large extent. However, voxelizing the point cloud inevitably causes information loss. Later research work has made up for the loss and distortion caused by the point cloud data processing stage by continuously introducing complex module designs, which has made up for this defect to a certain extent; however, this has a great impact on detection efficiency. Therefore, using voxelization to process point cloud data has certain limitations.

2.2. Point-Based Methods

Unlike voxel-based detection methods, point-based methods directly process the disordered and cluttered point cloud. Thus approach obtains features point-by-point in order to predict each point. The point cloud itself contains very rich physical structure information. Therefore, a point-wise processing network was first proposed in the form of PointNet. This network directly takes the original point cloud as input, guaranteeing no loss of physical information from the original point cloud. Subsequently, PointNet++ improved PointNet to improve the detection efficiency of the network and further optimize the network structure. Most of the subsequent point-based methods have used this network and its variants to point cloud for processing. PointRCNN [22] utilizes PointNet++ to extract features from raw point clouds and a Region Prediction Network (RPN) to generate prediction boxes. 3DSSD [23] introduces a 3D single-stage detection network which uses Euclidean space to achieve feature sampling for far points. PointGNN [24] adds a graph neural network to the framework of 3D object detection, effectively improving recognition accuracy. Proposal Contrast [25] proposed a new unsupervised point cloud pre-training framework to achieve better detection results. Proficient Teachers [26] introduces a new 3D SSL framework that provides better results and removes the necessity of using confidence-based thresholds to filter pseudo-labels.

Point-based detection methods directly process the raw point cloud and effectively utilize the physical information of the point cloud itself. However, the huge amount of data inevitably takes up a lot of time and computing resources. Therefore, improving the efficiency of point-based detection is a bottleneck for this method.

3. Hybrid Attention Regions with CNN Features (HA-RCNN)

Unlike voxel-based methods, point-based methods need to perform point-wise detection, and as such need to pay more attention to foreground points (i.e., cars, pedestrians, etc.). However, most current point-based object detection frameworks usually adopt downsampling methods, such as random sampling [27] or farthest point sampling. Although these sampling methods can improve computational efficiency, the essential foreground points are ignored. Therefore, in this work we aim to train a point-based model to better retain the information of foreground points and efficiently detect multiple types of objects at one time. Based on this, we propose an efficient point cloud-based object detection algorithm.

As shown in Figure 2a, the proposed model framework mainly consists of three parts: Hybrid Sampling (HS), a Hybrid Attention Mechanism (HA), and Foreground Point Segmentation. First, the input original point cloud is processed through hybrid sampling, with as many foreground points retained as possible. Then, the point-wise features are generated by the HA module and focused. Subsequently, the foreground segmentation network is used to segment the foreground points and generate prediction boxes. Finally, 3DNMS is used to filter the prediction box and the refinement module retains the final boxes. In Figure 2b, each sampled point cloud input is extracted pointwise and then focused in the attention layer. Finally, the generated original pointwise features and the pointwise features developed by the attention layer are spliced together.

3.1. Hybrid Sampling

To improve the detection efficiency of 3D objects, especially in the face of point cloud data with a huge amount of data, progressive downsampling must be used to improve the calculation speed and reduce costs. However, active downsampling may result in the loss of foreground points, leaving the valuable information of the detection object missing, making it easy for this approach to cause missed or false detection. Therefore, we propose a hybrid sampling strategy as our sampling method; the specific sampling rules are shown in Figure 3.

Most of the sampling methods used in the current model are D-FPS (Distance-Farthest Point Sampling) and F-FPS (Feature-Farthest Point Sampling). Because F-FPS can retain many foreground points through the SA layer while the total number of representative points is limited, many background points are discarded in the process of downsampling. While this makes regression easier, it is not conducive to classification. The SA layer gathers features from adjacent points, and the background points usually cannot find enough surrounding points. These issues make it challenging to distinguish foreground points from background points, resulting in poor classification performance. To better preserve the foreground points without affecting the later regression, we propose a new hybrid sampling (HS) method. Multiple sampling methods, such as D-FPS, F-FPS, and random sampling, are mixed in parallel to preserve more foreground points for localization along with enough background points for classification.

Furthermore, we additionally introduce a branch to exploit the underlying feature semantics. In particular, two MLP layers are attached to the encoding layer in order to further estimate each point’s semantic category. Here, we use the vanilla cross-entropy loss function

L_{H S} = - \sum_{C = 1}^{C} (S_{i} log (\hat{S_{i}}) + (1 - S_{i}) log (1 - \hat{S_{i}}))

(1)

where C represents the number of categories,

S_{i}

is a one-hot label, and

\hat{S_{i}}

represents the predicted logit. By inference, the top k foreground points are kept, regarded as feedback, and sent to the following encoding layer as representative points.

3.2. Differentiation Mechanism with Hybrid Attention

Consider the original point clouds, represented as

G = \{V, D\}

, where

V = \{p_{1}, p_{2}, \dots, p_{n}\}

indicates n points in a D dimensional metric space. In our approach, D is set to 4; thus, each point in 3D space is defined as

v_{i} = \{x_{i}, y_{i}, z_{i}\}

, where

x_{i}, y_{i}, z_{i}

denote the coordinate values of each point along the axes X, Y, Z, while the fourth dimension is the laser reflection intensity, denoted as

s_{i}

and each input P contains N points,

P = {\{p_{i} = {[v_{i}, s_{i}]}^{T} \in R^{4}\}}_{(i = 1, 2 \dots N) .}

As shown in Figure 2, for each input

h_{i} = \{p_{j} = {[x_{j}, y_{j}, z_{j}, s_{j}]}^{T}\}, j = 1, 2, \dots, t

, the coordinate information of all points inside forms an input vector. We extract the point-wise features of each input by learning the mapping; here, we set a three-layer MLP, and the sizes are all (64, 128, 128):

f (h_{i}) = M L P (p_{j}), j = 1, 2, \dots, t,

(2)

allowing us to obtain the point-wise feature representation F, which is transformed by the subsequent layer for deeper feature learning.

In the HA module, we adopt two kinds of attention mixed in parallel; the specific architecture is shown in Figure 4. First, the point-wise features obtained in the sampling process are used as pointwise and channelwise inputs. To increase the spatial receptive field of each channel feature, two pooling operations (average pooling and maximum pooling) are used independently, and are respectively denoted by

F_{c a v g}

and

F_{c m a x}

. Then, the sigmoid function is used to perform the final nonlinear activation operation, thereby generating all the network parameters. The required channel attention weight matrix

M_{c} \in R^{C \times 1 \times 1}

. The specific formula is expressed as follows:

\begin{matrix} M_{C} (F) & = σ {M L P [A v g P o o l (F) + M a x P o o l (F)]} \\ = σ {W_{1} [W_{0} (F_{c a v g})] + W_{1} [W_{0} (F_{c m a x})]} \end{matrix}

(3)

where

σ

is the sigmoid function and

W_{0}

and

W_{1}

are the channel attention weights learned through MLP.

Finally, the original feature

F_{1}

obtained by the sampling process is combined for splicing; the final feature

F_{2}

is the output. Through the above operations the pointwise features are enhanced, which significantly contributes to the final task of foreground point segmentation while suppressing irrelevant and noisy features. We call this attention mechanism module the HA module.

3.3. Foreground Point Segmentation

Previous modules reserve foreground points with rich information for this task. Segmenting the foreground points enables the point cloud network to capture contextual information. This can improve the accuracy of pointwise prediction and benefit the generation of 3D prediction boxes. For this, we use a bottom-up 3D prediction box generation method. Foreground point segmentation and prediction box generation are performed simultaneously, generating prediction boxes directly from the reserved foreground points.

After the previous HA module processing, the full-point feature map

F_{f u l l} \in R^{(N \times (d + c))}

is generated. On this basis, the foreground segmentation branch composed of two convolution layers is attached to it and the confidence

S_{f o r e} \in R^{N}

of each point in the input point set P is further estimated. The Sigmoid function is used to normalize the

S_{f o r e}

to generate the foreground mask

S_{{f o r e}_{n o r m}} \in R^{N}

, which is used as an important basis for subsequent segmentation.

In most scenes, the number of foreground points tends to be much less than that of background points. Therefore, we use the focal loss function [28] to address the classification imbalance issue:

\begin{matrix} L_{f o c a l} (p_{t}) = - α_{t} {(1 - p_{t})}^{γ} log (p_{t}) \\ w h e r e, p_{t} = \{\begin{matrix} p & f o r f o r e g r o u n d p o i n t \\ 1 - p & o t h e r w i s e \end{matrix} \end{matrix}

(4)

where

p_{t}

is the probability of the foreground point. In the process of point cloud segmentation training, we keep the settings

α_{t} = 0.25, γ = 2

as the original values by default.

3.4. Loss Function

Considering that the object detected in this work is a three-dimensional object, the traditional 2D detection frame is no longer applicable thus, we use an upgraded 3D detection frame here. Therefore, in designing the loss function in this work we use the Intersection over Union (IoU), which is an upgraded three-dimensional space-based Intersection over Union model, which we call 3D-IoU. The specific form is shown in Figure 5.

The loss function of hybrid sampling is designed with reference to [29], and the specific formula is shown in Formula (1). The overall loss function in this work is designed based on [15,22]. According to the 3D box we designed, the linear regression between the true value of the detected object and the predicted anchor point is defined as

\begin{matrix} \{\begin{matrix} Δ x = \frac{x_{g t} - x_{a}}{d_{a}}, Δ y = \frac{y_{g t} - y_{a}}{d_{a}}, Δ z = \frac{z_{g t} - z_{a}}{h_{a}} \\ Δ w = log \frac{w_{g t}}{w_{a}}, Δ l = log \frac{l_{g t}}{l_{a}}, Δ h = log \frac{h_{g t}}{h_{a}} \\ Δ θ = sin (θ_{g t} - θ_{a}) \end{matrix} \end{matrix}

(5)

where the subscript gt represents the ground truth of the object, while the subscript a represents the predicted value (Anchor Box):

\begin{matrix} d_{a} = \sqrt{w_{a}^{2} + l_{a}^{2}} \end{matrix}

(6)

The center error loss function of the bounding box is defined as

\begin{matrix} L_{l o s} = \sum_{b \in (x, y, z, w, l, h, θ)} S m o o t h_{L 1} (Δ b) \end{matrix}

(7)

where

S m o o t h_{L 1}

is the smoothing function, and the specific calculation method is

S m o o t h_{L 1} (Δ b) = \{\begin{matrix} 0.5 x^{2} & i f | Δ b | < 1 \\ | Δ b | - 0.5 & o t h e r w i s e \end{matrix}

(8)

Because the sample classification data are quite different when the foreground points are segmented, we choose focal loss as the classification loss function, as shown in Formula (4). Finally, the total loss function is

\begin{matrix} L = \frac{1}{N_{p o s}} (β_{H S} L_{H S} + β_{l o c} L_{l o c} + β_{f o c a l} L_{f o c a l}) \end{matrix}

(9)

where

N_{p o s}

is the number of positive probability anchors,

β_{H S} = 0.2

,

β_{l o c} = 2

, and

β_{f o c a l} = 1

.

4. Experiments

We used HA-RCNN to conduct systematic experiments on the KITTI dataset and compare it with state-of-the-art models. Finally, HA-RCNN was analyzed through various expressions such as the loss function, P-R curve, and visual effect map.

4.1. Experimental Data

In this work, Velodyne data, Image data, Calib data, and Label data in KITTI are selected for related experiments, and the object detection effect is displayed on the visual view of the point cloud space [30]. The KITTI dataset contains three difficulty detection targets, namely, Easy, Moderate, and Hard. To analyze the results of the experiments, we employ KITTI’s official evaluation method in this work, in which the IoU (Intersection over Union) is used as a measure of the positioning accuracy of the object detection frame. Supposing that the overlap between the object detection frame and the label frame (IoU) reaches more than 50%. In such a case it is considered that the object has been correctly detected; the calculation formula of the intersection ratio is

I o U = \frac{a r e a (b_{i j}) \cap a r e a (b_{g t})}{a r e a (b_{i j}) \cup a r e a (b_{g t})}

(10)

4.2. Evaluation Indicator

Because the training and testing descibed in this article are based on the KITTI dataset, the metrics officially provided by KITTI are directly used as the metrics in this work. Therefore, we use the Average Precision (AP) here as our evaluation index. The specific calculation formula of AP is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

A P = \int_{0}^{1} f (P R) d r

(13)

where Precision is the precision, Recall is the recall, AP is the average precision, TP is the number of detection frames with IoU > 0.5, FP is the number of detection frames with IoU ≤ 0.5 or the number of redundant detection frames with the same actual value, and FN is number of valid values not detected.

4.3. Experimental Details

During training, for each scene we sampled 16384 points as input. For scenes with less than 16384 points, we randomly repeated any point in the scene to reach 16,384 points. In the sampling stage, we followed the network structure of PointNet++ in the SA layer and used four ensemble abstraction layers with multi-scale heel groups to divide points into groups of 4096, 1024, 256, and 64. The detailed parameter settings are shown in Table 1.

In the stage of foreground point segmentation, for better robust segmentation we ignore the background points near the object boundary during the training process by enlarging the 3D ground truth box on each side of the object by 0.2 m. The prediction frame is generated according to the pointwise vector. For the prediction frame classification training, if the maximum 3D IoU and the actual frame are above 0.6, the prediction frame is considered to be a positive value, while if the maximum 3D IoU is below 0.45, the prediction is considered to be a negative value. In the experiments, we use the 3D IoU value of 0.55 as the minimum criterion for prediction box regression training.

4.4. Results and Analysis

We evaluated our model on the 3D detection benchmark on the KITTI test server; the results are shown in Table 2. The HA-RCNN used in this work was trained for 200 epochs, with the trained loss curve shown in Figure 6. As the number of epochs increases, the value of the loss function of the HA-RCNN decreases rapidly. The decreasing trend of the loss value slows down after 25 epochs of training and flattens after 50 epochs. The final loss curve plateaus after 150 epochs. As can be seen from the figure, our loss function converges faster, the loss value is low, and the final loss value is maintained at around 0.5.

To further validate the superior performance of our model, we evaluated the performance of HA-RCNN on the Waymo dataset. This dataset consists of nearly 160k 360-degree lidar samples in the training set and 40k samples in the validation set with panoramic annotation objects. To make a fair comparison, we adjusted our framework during the evaluation, changing the number of input points from 16,384 to 65,536, and increased the sampling ratio of each sampling layer to four times. The results after comparison are shown in Table 3. Compared with other methods, our HA-RCNN has obvious vehicle and cyclist detection advantages. In pedestrian detection, HA-RCNN has only a slight advantage. We speculate that the pedestrian point cloud is relatively sparse, which is not conducive to detection. Therefore, we try to solve this problem in future work.

To analyze the effects of the different components of HA-RCNN, we conducted extensive ablation experiments on the car. We used the initial structure as the baseline of the experiment and only cut off the connection between the HS module and the HA module. In eaching the results shown in Table 4, extensive ablation experiments were performed on the proposed HS and HA modules. When there was only a baseline, the mAP detected by the model reached 86.43%, 77.39%, and 75.87%, respectively. When only the HS module was added, the mAP increased to 87.26%, 78.55%, and 76.91%. With only the HA module, mAP can reach 88.49%, 79.15%, and 77.44%. Finally, when the two modules were added together, mAP reached 89.23%, 79.88%, and 77.92%. It can be clearly observed that the HS module and the HA module are of great significance, especially when the two modules are integrated into the model, and the improvement in target detection accuracy is excellent.

In addition, this work uses Precision and Recall as a measure of the results. To this end, we show the multi-view P-R change curve of HA-RCNN for vehicle detection in the figure below. Each view contains three levels of P-R change curves: Easy, Moderate, and Hard. The specific situation is shown in Figure 7.

5. Discussion

After extensive systematic experiments to verify the effectiveness of our proposed mixed sampling approach, we report the actual recall (i.e., the ratio of instances still retained after sampling) for each layer in Table 5. At the same time, we report the recall of the random sampling method, the Euclidean distance-based sampling method (D-FPS), and the feature distance-based sampling method (F-FPS) for comparison.

From the analysis in Table 5, we can reach the following conclusions:

(1) After multiple random sampling operations the recall rate drops significantly, which means that a large number of foreground points are discarded.

(2) D-FPS and F-FPS have a reasonable recall rate in the early stage and the later effect is slightly worse, which causes loss of information on the foreground points. Therefore, it is challenging to accurately detect objects of interest, especially after multiple samplings with a limited number of preserved foreground points.

(3) Our mixed sampling approach has a significant performance advantage over most current methods, achieving higher recall and retaining more foreground points.

From the analysis in Table 2, it can be concluded that our model has apparent advantages for the detection of cars and bicycles in simple and medium detection compared with several previous classic methods, while the advantage is relatively small in detecting complex samples. Currently, many models take both RGB images and point cloud data as input to improve detection accuracy. However, our model only takes point cloud data as the input for object detection, and nonetheless achieves better performance.

For pedestrian detection, our method is not much different from previous ones that only use lidar data. However, our detection performance is slightly inferior compared to the multi-sensor approach. We believe that this is because the size of pedestrians is smaller, making their associated point clouds much smaller than those of cars and bicycles. Although our approach reserves as many foreground points as possible, the effect is insufficient because the object is too small. Ideally, using images can capture more detailed objects, meaning that multi-sensor detection methods have more advantages than our approach with respect to pedestrian detection.

In order to more intuitively observe the detection effect of our HA-RCNN model, we visualize the detection results in Figure 8. It can be seen from the visualization that our proposed HA-RANN model has good detection performance for cars and cyclists in various environments.

6. Conclusions

This work proposes a novel 3D object detection model for complex and changeable scenes called HA-RCNN. Hybrid Sampling (HS) and Hybrid Attention (HA) are the key parts of HA-RCNN, extracting useful foreground points from huge point cloud data and classifying them from background points to achieve higher accuracy. Extensive experiments conducted on the KITTI dataset show that our model outperforms state-of-the-art methods, especially on recognizing large objects such as cars and cyclists.

However, our detection of small objects such as pedestrians shows certain shortcomings. We speculate that the size of small objects such as pedestrians is too small to obtain sufficient points for detection. We intend to improve the model’s detection ability for small objects in future work.

Author Contributions

Conceptualization, G.H. and Y.Z.; methodology, L.L.; validation, H.Y., Z.Z. and Q.Z.; formal analysis, Y.Z.; investigation, G.H. and L.L.; data curation, H.Y. and Z.Z.; writing, all authors.; visualization, Y.Z.; supervision, Y.Z. and Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a project of the Fujian University of Technology (GY-Z19066), the projects of the National Natural Science Foundation of China (41971340), projects of Fujian Provincial Department of Science and Technology (2021Y4019), and the project of Fujian Provincial Universities Engineering Research Center for Intelligent Driving Technology (Fujian University of Technology) (KF-J21012).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HA-RCNN	Hybrid Attention Regions with CNN Features
D-FPS	Distance-Farthest Point Sampling
F-FPS	Feature-Farthest Point Sampling
IoU	Intersection over Union
RPN	Region Prediction Network
HA	Hybrid Attention
HS	Hybrid Sampling
AP	Average Precision

References

Huang, T.; Liu, Z.; Chen, X.; Bai, X. Epnet: Enhancing point features with image semantics for 3d object detection. In European Conference on Computer Vision, Proceedings of the Computer Vision ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; Volume 12360, pp. 35–52. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Qin, Z.; Wang, J.; Lu, Y. Monogrnet: A geometric reasoning network for monocular 3d object localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Hawaii, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8851–8858. [Google Scholar]
Li, B.; Ouyang, W.; Sheng, L.; Zeng, X.; Wang, X. Gs3d: An efficient 3d object detection framework for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1019–1028. [Google Scholar]
Liu, Z.; Zhou, D.; Lu, F.; Fang, J.; Zhang, L. Autoshape: Real-time shape-aware monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15641–15650. [Google Scholar]
Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from RGB-D images for object detection and segmentation. In European Conference on Computer Vision, Proceedings of the Computer Vision ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; Volume 8695, pp. 345–360. [Google Scholar]
Du, X.; Ang, M.H.; Karaman, S.; Rus, D. A general pipeline for 3d detection of vehicles. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3194–3200. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 4604–4612. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5099–5108. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Zheng, W.; Tang, W.; Jiang, L.; Fu, C.W. SE-SSD: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 14494–14503. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 12697–12705. [Google Scholar]
Liu, Z.; Zhao, X.; Huang, T.; Hu, R.; Zhou, Y.; Bai, X. Tanet: Robust 3d object detection from point clouds with triple attention. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11677–11684. [Google Scholar]
Li, J.; Hu, Y. A Density-Aware PointRCNN for 3D Object Detection in Point Clouds. arXiv 2020, arXiv:2009.05307. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. SECOND: Sparsely Embedded Convolutional Detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Du, L.; Ye, X.; Tan, X.; Feng, J.; Xu, Z.; Ding, E.; Wen, S. Associate-3Ddet: Perceptual-to-conceptual association for 3D point cloud object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 13329–13338. [Google Scholar]
He, C.; Zeng, H.; Huang, J.; Hua, X.S.; Zhang, L. Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 11873–11882. [Google Scholar]
Ye, M.; Xu, S.; Cao, T. Hvnet: Hybrid voxel network for lidar based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1631–1640. [Google Scholar]
Shi, S.; Wang, Z.; Shi, J.; Wang, X.; Li, H. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2647–2664. [Google Scholar] [CrossRef]
Shi, S.; Wang, X.; Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 770–779. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-Based 3D Single Stage Object Detector. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, 14–19 June 2020; pp. 11037–11045. [Google Scholar]
Shi, W.; Rajkumar, R. Point-GNN: Graph Neural Network for 3D Object Detection in a Point Cloud. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Seattle, WA, USA, 14–19 June 2020; pp. 1708–1716. [Google Scholar]
Yin, J.; Zhou, D.; Zhang, L.; Fang, J.; Xu, C.Z.; Shen, J.; Wang, W. Proposalcontrast: Unsupervised pre-training for lidar-based 3D object detection. In European Conference on Computer Vision, Proceedings of the Computer Vision ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; Volume 13699, pp. 17–33. [Google Scholar]
Yin, J.; Fang, J.; Zhou, D.; Zhang, L.; Xu, C.Z.; Shen, J.; Wang, W. Semi-supervised 3D object detection with proficient teachers. In European Conference on Computer Vision, Proceedings of the Computer Vision ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; Volume 13698, pp. 727–743. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer Society, Seattle, WA, USA, 14–19 June 2020; pp. 11105–11114. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 18953–18962. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 918–927. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. STD: Sparse-to-Dense 3D Object Detector for Point Cloud. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1951–1960. [Google Scholar]

Figure 1. Vehicle detection results. The first row shows the corresponding 2D image. The second row is the ground truth and the detection results from PointRCNN. We highlight missed and false detections in the dot column with red arrows.

Figure 2. HA-RCNN model frame diagram (a: Overall frame, b: HA module refinement).

Figure 3. Hybrid sampling diagram.

Figure 4. HA module architecture.

Figure 5. IoU calculation formula and schematic diagram.

Figure 6. Loss curve of HA-RCNN.

Figure 7. P-R variation curve in multiple viewing angle detection tasks.

Figure 8. Visualization of the detection effect of the HA-RCNN model. The ground-truth bounding boxes are shown in red, and the predicted bounding boxes are shown in green for cars, cyan for pedestrians, and yellow for cyclists.

Table 1. SA layer parameter settings.

SA	Point	Radii	Nquery	Feature Dimension
1	4096	[0.1,0.5]	[16,32]	[[16,16,32],[32,32,64]]
2	1024	[0.5,1.0]	[16,32]	[[64,64,128],[64,96,128]]
3	256	[1.0,2.0]	[16,32]	[[128,196,256],[128,196,256]]
4	64	[2.0,4.0]	[16,32]	[[256,256,512],[256,384,512]]

Table 2. Performance comparison with other 3D object detection algorithms (‘R’ denotes RGB image input and ‘L’ denotes Lidar point cloud input.)

Method	Modality	Car-3D Detection			Pedestrian-3D Detection			Cyclist-3D Detection
Method	Modality	Easy	Moderate	Hard	Easy	Moderate	Hard	Easy	Moderate	Hard
MV3D [2]	R + L	71.09	62.35	55.12	-	-	-	-	-	-
AVOD [8]	R + L	83.07	71.76	65.73	50.46	42.27	39.04	63.76	50.55	44.93
F-PointNet [31]	R + L	82.19	69.79	60.59	50.53	42.15	38.08	72.27	56.12	49.01
VoxelNet [12]	L	77.47	65.11	57.73	39.48	33.69	31.51	61.22	48.36	44.37
SECOND [17]	L	83.13	73.66	66.20	51.07	42.56	37.29	70.51	53.85	46.90
PointPillars [14]	L	82.58	74.31	68.99	51.45	41.92	38.89	77.10	58.65	51.92
PointRCNN [22]	L	86.96	75.61	70.70	47.98	39.37	36.01	74.96	58.82	52.53
STD [32]	L	87.95	79.71	75.09	53.29	42.47	38.35	78.69	61.59	55.30
HA-RCNN(Ours)	L	89.23	79.88	77.92	50.31	41.32	37.69	79.26	63.64	56.12

Table 3. State-of-the-art comparisons for 3D detection on the Waymo test set showing the mAP and mAPH for both Level 1 and Level 2 benchmarks.

Difficulty	Method	Vehicle		Pedestrian		Cyclist
Difficulty	Method	mAP	mAPH	mAP	mAPH	mAP	mAPH
Level 1	SECOND [17]	67.58	67.25	60.47	50.12	54.35	53.14
	PointPillars [14]	60.67	59.38	43.12	23.14	35.83	27.96
	HA-RCNN(Ours)	71.53	70.45	60.85	50.55	60.71	58.35
Level 2	SECOND [17]	59.57	59.04	53.00	43.56	52.67	51.37
	PointPillars [14]	52.64	51.86	37.17	20.03	34.21	27.09
	HA-RCNN(Ours)	62.15	60.87	53.12	44.03	57.18	55.15

Table 4. Ablation experiments on the effect of the HS and HA modules.

HS	HA	Easy	Moderate	Hard
✕	✕	86.43	77.39	75.87
✓	✕	87.26	75.55	76.91
✕	✓	88.49	79.15	77.44
✓	✓	89.23	79.88	77.92

Table 5. Comparison of commonly used centralized sampling methods and HS.

Sampling Strategies	4096 Point			1024 Point			256 Point			64 Point
Sampling Strategies	Car.	Ped.	Cyc.	Car.	Ped.	Cyc.	Car.	Ped.	Cyc.	Car.	Ped.	Cyc.
Random [27]	96.5%	98.9%	97.5%	87.6%	92.5%	83.8%	78.5%	84.9%	72.6%	60.9%	65.4%	51.8%
F-FPS [23]	98.5%	100%	97.3%	98.2%	99.5%	97.3%	97.3%	91.5%	95.6%	83.4%	78.6%	84.8%
D-FPS [11]	98.5%	100%	97.3%	98.1%	99.3%	97.3%	97.2%	90.4%	91.1%	82.1%	71.8%	75.2%
HS (ours)	98.5%	100%	97.5%	98.3%	99.5%	97.3%	97.3%	93.6%	97.3%	94.9%	90.8%	94.3%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, G.; Zhu, Y.; Liao, L.; Yao, H.; Zhao, Z.; Zheng, Q. Hybrid Attention-Based 3D Object Detection with Differential Point Clouds. Electronics 2022, 11, 4010. https://doi.org/10.3390/electronics11234010

AMA Style

Han G, Zhu Y, Liao L, Yao H, Zhao Z, Zheng Q. Hybrid Attention-Based 3D Object Detection with Differential Point Clouds. Electronics. 2022; 11(23):4010. https://doi.org/10.3390/electronics11234010

Chicago/Turabian Style

Han, Guangjie, Yintian Zhu, Lyuchao Liao, Huiwen Yao, Zhaolin Zhao, and Qi Zheng. 2022. "Hybrid Attention-Based 3D Object Detection with Differential Point Clouds" Electronics 11, no. 23: 4010. https://doi.org/10.3390/electronics11234010

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Attention-Based 3D Object Detection with Differential Point Clouds

Abstract

1. Introduction

2. Related Work

2.1. Voxel-Based Methods

2.2. Point-Based Methods

3. Hybrid Attention Regions with CNN Features (HA-RCNN)

3.1. Hybrid Sampling

3.2. Differentiation Mechanism with Hybrid Attention

3.3. Foreground Point Segmentation

3.4. Loss Function

4. Experiments

4.1. Experimental Data

4.2. Evaluation Indicator

4.3. Experimental Details

4.4. Results and Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI