1. Introduction
Three-dimensional point cloud data provide precise geometric and spatial information, which is crucial for computer vision applications such as autonomous driving [
1,
2], augmented reality [
3,
4], and domestic robots [
5,
6]. The development of commercial depth cameras and LiDARs has made this task an active research field, attracting increasing attention. The goal of this task is to obtain 3D bounding box orientation and semantic label information for each object in the input scene. However, detecting objects in point clouds may be more challenging than in 2D due to their non-structured nature. Although current CNN networks have been proven to be very effective in 2D object detection, applying these CNN networks directly to point clouds is very difficult.
Compared to grid-structured images, 3D point clouds can provide pure shape and geometric information, unaffected by lighting and reflectance. However, point clouds are irregular, disordered, and sparse, thus how to apply the current successful CNN methods to 3D point clouds is the main challenge. Early work attempted to transform point clouds into “grid” structured data and process them using 2D-based detectors. Song [
7] proposed a 3D ConvNet formulation for deep sliding shapes with a 3D region proposal network and an object recognition network to obtain 3D bounding boxes of the input scene. Hou [
8] extended the standard 2D object detection framework R-CNN [
9,
10] to 3D. The above methods try to project point clouds into voxelization spaces, which would obtain a plausible performance; however, this kind of method suffers from high computation costs due to expensive 3D convolutions. Alternatively, some researchers have focused on projecting point clouds into a bird’s-eye-view space [
11,
12]. However, bird’s-eye-view methods sacrifice depth in the geometric clues that are crucial for clustering indoor scenes, which decrease detection performance. Some other researchers have adopted two-stage cascade approaches to address the issues [
13,
14]. In the first stage, 2D object detectors will be used to obtain 2D object bounding boxes. Then, in the second stage, these 2D bounding boxes will be extruded to form the final 3D bounding boxes.
Another technical branch is point-based methods. Point clouds have emerged as a great powerful representation for 3D deep learning tasks, such as classification [
15,
16,
17,
18,
19,
20], semantic segmentation [
21,
22,
23], point cloud normal estimation [
24], 3D reconstruction [
25,
26,
27], and 3D object detection [
28,
29,
30,
31]. Most of these works adopt raw point clouds to extract expressive representations based on pioneering work PointNet/PointNet++ [
15,
16].
The first end-to-end point-based work, VoteNet [
32], utilizes PointNet++ as the backbone to extract point features, followed by a neural network that reproduces the classic traditional Hough voting scheme. It consists of three main components: a PointNet++-based point feature extraction module, a voting module, and a cluster generation module for object proposal and classification. Based on these three modules, many subsequent works attempt to improve performance by enhancing or modifying each module. Some works propose a combined MLP (CMLP) [
33] and attention MLP (AMLP) [
28] to enhance the modelling ability of the PointNet++ backbone. Some works consider the voting and clustering generation module not powerful enough, such as MLCVNet [
29], which proposes a patch-to-patch context module and object-to-object context modules to capture contextual information to obtain more accurate vote centers and clusters, respectively. RBGNet [
34] proposes a ray-based feature grouping module to improve the grouping scheme of VoteNet as well as a foreground-based feature extraction module to enhance feature representation ability.
Compared with VoteNet, although the aforementioned methods can achieve many improvements, we find that there is still room for improvement in these three modules. For feature extraction in the backbone network, most methods focus on obtaining more reasonable feature representations. Although RBGNet [
34] proposes a foreground-based module to consider different contributions of the foreground points and background points in the input point clouds, it did not consider the different contributions of foreground and background points in the input point cloud, and it did not take into account the relationship between foreground and background points. There is also much work to be done to improve the voting generation part, such as RBGNet [
34], which utilizes a ray-based module to learn better representations of object shape. However, the authors did not consider the relationship among the votes on each ray. For the cluster generation and classification part, most of the works ignore the surrounding clusters’ contextual information, which is crucial for the current clustering classification.
To address the aforementioned issues, this paper discusses the task of indoor scene 3D object detection via point cloud. Specifically, we propose a foreground-based module to calculate the different contributions of the foreground and background points in the backbone. In this paper, we adopt PointNet++ as our backbone network. In this module, we produce a two-channel weighted map for both foreground and background points separately. Towards the second problem, inspired by [
34], we introduce a vote-aware module to simulate the spatial relationship between votes along rays. For the third issue, we propose a cluster-aware module to build the spatial dependencies among clusters to utilize the rich contextual information for the final 3D bounding box classification. With these three modules, we propose a unified ray-based enhancement network (REGNet) to incorporate into VoteNet for 3D object detection.
The framework of REGNet is shown in
Figure 1; the three main modules are shown in red font bounding boxes. In summary, the contributions of this paper include:
We propose a ray-based enhancement 3D object detection network that exploits contextual information at foreground background, voting patch, and cluster.
We design three sub-modules, involving a foreground-aware module, a voting-aware module, and a cluster-aware module. The new modules nicely fit into thde VoteNet framework.
Experiments on ScanNet V2 and SUN RGB-D datasets demonstrate the effectiveness and superiority of the proposed modules in improving detection accuracy.
The rest of the article is organized as follows.
Section 2 briefly reviews the most relevant work.
Section 3 provides detailed information on the proposed REGNet. We present experimental results and in-depth analysis in
Section 4. Finally, the conclusions are drawn in
Section 5.
2. Related Work
Grid Projection/Voxelization-based Detection. 3D object detection is a challenging task due to the irregular, sparse, and orderless characteristics of 3D points. Most current existing work could be classified into three categories in terms of point cloud representations, i.e., voxel-based, bird’s-eye-view-based, and point-based. Thanks to the success of the deep neural network, marvelous progress has been achieved in 2D object detection. However, the 2D object detection task ignores the depth information, which is important to understand the whole scene. Early 3D object detection methods project point clouds to 2D grids [
35,
36,
37] or 3D voxels [
8,
38], so that the most successful convolutional networks could be directly applied. Concerning bird’s-eye-view based methods, most of them are applied to autonomous driving. In outdoor scenes, most objects are distributed on the same plane, so there is little mutual occlusion of objects in the top-down view. However, in indoor scenes, many objects are on top of each other, such as photos on walls or tables covering sofas. Therefore, some works project the point cloud onto a frontal view and use 2D ConvNets to tackle the problem. However, self-occlusion of objects in indoor scenes poses many challenging issues to address. Voxel-based methods transform the point cloud into 3D voxels, which have been shown to yield more reasonable performance. However, these methods suffer from high memory and computational costs, as well as quantization errors.
Point based-Detection. To tackle the problems noted above, recently, most methods process point clouds directly for 3D object detection. Due to the point clouds being irregular and sparse, how to extract feature representations from them is the core task of these methods. The pioneering works PointNet/PointNet++ [
15,
16] provide a powerful and robust backbone for point cloud based tasks. VoteNet [
32] adopts PointNet++ as the backbone and reproduces a Hough voting strategy to propose an end-to-end framework for 3D object detection. This is the first end-to-end framework based on point clouds for this task. There are numerous successors based on this work, such as BGNet [
39], which further improves the traditional Hough voting mechanism [
40]. The authors proposed a back-tracing strategy that generatively backtracked representative points from the vote centers and then revisited the seed point. H3DNet [
30] and MCGNet [
41] recognize the lack of modeling ability with only feature extraction from a single backbone branch. They utilized a four-way backbone to extract more plausible feature representations. MLCVNet [
29] proposes three sub-modules to capture multi-level contextual information in point cloud data to boost performance. VENet [
28] improves the voting procedure in “before, during and after” stages to address the limitations of current voting schemes.
Attention Mechanism/Transformer-based Detection. The transformer scheme is the dominant network architecture for the tasks of neural language of processing (NLP). Due to the powerful ability of feature modelling, it is applied in the field of 2D image recognition [
42,
43,
44]. Most recently, many works apply the transformer scheme into 3D object detection [
45,
46]. GFNet [
31] proposes a group-free strategy that adopts a powerful transformer module to replace the proposal head in VoteNet [
32]. In this paper, we also resorts to the transformer scheme to build spatial context dependencies among different vote centers and clusters to boost prediction performance.
Point Cloud Sampling. Since the point cloud is sparse and irregular, it cannot be calculated by conventional grid-based methods. Therefore, in order to facilitate training and inference, sampling operation [
47,
48,
49] plays a very key role in the task of point cloud analysis. Examples include farthest point sampling (FPS) and k-closest points sampling (KPS), which have been widely leveraged in the task of object detection in point clouds. However, these downsampling strategies are class-agnostic, treating all points equally without considering their different contributions. Therefore, some redundant points may inadvertently be retained, while important information may be lost after downsampling. 3DSSD [
50] proposed F-FPS sampling strategy based on feature distance to preserve interior points. BGRNet [
34] proposes a foreground-biased sampling to keep the foreground point during sampling. However, these methods still do not consider the spatial relationship between foreground points and background points. To address this issue, we proposed a foreground-aware module to sample more points on the foreground and establish relationships among the sampled points.
3. Our Approach
The proposed REGNet is based on VoteNet and inspired by the RGBNet for the ray-aware module. This method aims to establish relationships between foreground and background points to enhance seed point representation capability as well as model spatial relationships between voting centers on rays to obtain better object feature representation. Furthermore, it computes contextual information for all clusters to improve the final prediction performance. To achieve this goal, we have developed three new modules based on VoteNet, namely the foreground-aware module, voting-aware module, and cluster-aware module. This section elaborates on the learning details of the proposed REGNet, and the overall framework is shown in
Figure 1.
3.1. Backgrounds
PointNet++ is a pioneering work applied in 3D point cloud learning tasks. It is a hierarchical neural network that processes a set of points sampled in metric space in a layered manner. This work provides more powerful feature extraction capabilities. Therefore, it is often used as a backbone network for downstream tasks.
The original VoteNet can be summarized as three modules, namely the feature extraction module, voting module, and voting aggregation module. The feature extraction module generates seed points and corresponding features based on PointNet++. The voting module regresses object centers from each seed point, and the voting aggregation module combines features from different seed points to vote for the object center. Then, the object proposals are classified and the accurate position and size of the 3D object are regressed based on the aggregated features.
3.2. Foreground-Aware Module
Current sampling methods, such as FPS and KPS, lack consideration for foreground and background information. However, sometimes foreground points and background points can provide different contributions. For example, if too many points are sampled from a chair, it may result in incorrect classification as a table, as shown in Figure 5a. Foreground points provide rich clues about object shape, such as position and orientation information, which are important for 3D object detection tasks. However, it is also unreasonable to completely ignore background points while only considering foreground points. The background points could provide some important contextual information for the final prediction. Most sampling methods, such as FPS employed in the backbone, are class-agnostic. They do not consider foreground and background, but rather randomly sample an initial point from the dataset and iteratively select the furthest point from the previously selected point as the next point in the subset. Although these methods can capture the basic characteristics of the data, they bring about some uncertainty of information since the selection of points is solely based on the properties of the dataset, rather than any specific location information such as foreground and background. This is because the selection is based on the spatial distribution of points and their distance from previously selected points, without considering any class-specific information.
Although the FPS method can effectively solve the problem of distinguishing foreground and background information in point cloud sampling, it does not consider the interaction between different foreground and background information. Although foreground and background information is distributed in different locations and expresses different information in the scene, there is still a certain correlation between them. For example, the information around a chair can provide certain background knowledge for the classification of objects within the bounding box.
To address the above problem, we propose a foreground-aware module. For simplicity, the point cloud data in this paper are denoted as
. To be specific, there are four SA layers in the standard PointNet++. After the first SA layer, the input point cloud is downsampled to
points and corresponding features
, where
. To better separate foreground and background from the obtained point cloud data, this article employs an additional 2-class segmentation network for foreground and background point acquisition. The segmentation network obtains a 2-dimensional score map through a standard softmax network, and the argmax operation can be used to determine the category to which each point belongs. The detail is as follows:
where
denotes a point,
denotes a 2-class point cloud segmentation network, and
is the the point feature corresponding to
. After the operation noted above, all 2048 points have a label; the point with label 1 is grouped to
, and point with label 0 is grouped to
. Based on these two point clouds, we applied farthest point sampling to the foreground and background sets, respectively, and combined them into the final sample set, as shown below:
The next issue is how to combine
and
into the final sample set, which is very important. Ref. [
34] combines them directly without considering each contribution to the next stage. We propose a weighted map for all 2048 sampled points. Each item in the weighted map denotes the contribution of each point. The weighted map is generated by a 2-layer fully connected layer with a softmax layer, and then we build the relationship of the
and
to form the final sample set. The detail is as follows:
where
denotes the learned weighted map, the operator ⊙ represents the Hadamard product, ⊕ represents sum operation, and FC
2, FC
1 denote the 2-layer fully connected layer.
3.3. Voting-Aware Module
The vote scheme in the VoteNet does not consider relationships between point patches; however, this information is very important to the location and classification of the final prediction. MLCVNet proposes patch-patch context (PPC) to capture the relationships between point patches. It adopts the compact generalized non-local network (CGNL) [
51] to explicitly build rich correlations between any pair of point patches. The detail is as follows:
where
,
,
denote three transform functions, and
calculates the similarity between the two positions of
a and
b. However, CGNL only has one head and one layer to build the relationship among the vote centers. Moreover, CGNL can only model the vote center and cannot consider the relationship between the vote center and the clusters. Additionally, in VoteNet, vote centers and cluster centers are generated through sampling and grouping from seed points. However, this method lacks consideration for the appearance information of objects. For [
34], a ray-based feature grouping method is employed, which can learn a better feature representation of the surface geometry of foreground objects. However, the relationship between the vote center and the cluster has yet to be considered. With the development of transformer technology, more and more researchers are trying to use a transformer scheme to model the interconnections between different modules.
In this paper, we adopt the ray-based feature grouping method proposed in [
34] to generate vote centers and clusters, and the stacked multi-head self-attention and multi-head cross-attention modules are leveraged to establish the relationship between the voting center and the cluster. The multi-head self-attention module is used to simulate the relationship between votes, and the multi-head cross-attention module is used to simulate the interaction between votes and clusters. After stacking the attention modules, a feed-forward network (FFN) is used to obtain more reasonable transformation features for each object. The structural description of this module is shown in
Figure 2. Denote the point features of the voting center as
and the cluster feature as
. The self-attention module build the relationship between votes, formulated as follows:
and the cross-attention module adopts point features to compute object features, formulated as follows:
where
and
.
3.4. Cluster-Aware Module
VoteNet detects each object class and bounding box by feeding the generated votes to MLP layers in the scene. However, this grouping method lacks consideration of surrounding information. MLCVNet proposes object–object context (OOC) to combine features of surrounding objects to provide more information about object relationships. It inputs the grouped voting centers to an MLP with max-pooling to form a single vector representing the cluster, and then introduces a self-attention module to establish relationships between these clusters instead of processing them separately. It adopts the CGNL attention module to generate a weighted map to calculate the affinity between all clusters. The detail is as follows:
where
is the CGNL attention module, and
is the feature of the
cth cluster.
However, as noted above, the CGNL attention module can only simulate the affinity between clusters. Yet, a voting center can also provide useful information. In this paper, we adopt stacked self-attention and cross-attention modules to establish relationships. The detail of the self-attention module is as follows:
and the cross-attention module is formulated as follows:
After the stacked self-attention and cross-attention modules, a feed-forward neural network (FFN) is utilized to extract transformation features for the final prediction task.
By combining these three modules, the final 3D bounding box inference incorporates all surrounding contextual information while also taking into account the distinct contributions of foreground and background points. This results in a more accurate and reasonable final prediction.