REGNet: Ray-Based Enhancement Grouping for 3D Object Detection Based on Point Cloud

Zhou, Feng; Rao, Junkai; Shen, Pei; Zhang, Qi; Qi, Qianfang; Li, Yao

doi:10.3390/app13106098

Open AccessArticle

REGNet: Ray-Based Enhancement Grouping for 3D Object Detection Based on Point Cloud

by

Feng Zhou

^1,*,

Junkai Rao

¹,

Pei Shen

¹,

Qi Zhang

¹,

Qianfang Qi

² and

Yao Li

³

¹

School of Information Science and Technology, North China University of Technology, Beijing 100144, China

²

TravelSky Technology Ltd., Beijing 101318, China

³

State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(10), 6098; https://doi.org/10.3390/app13106098

Submission received: 14 April 2023 / Revised: 7 May 2023 / Accepted: 15 May 2023 / Published: 16 May 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Currently, 3D objects are usually represented by 3D bounding boxes. Much research work has focused on detecting 3D objects directly from point clouds, and significant progress has been made in this field. However, we find there are there is still room for improvement in three aspects. First is point cloud feature extraction. Many successful methods are based on PointNet/PointNet++, which uses multi-layer perceptrons (MLP) to extract features to generate seed points, without considering foreground and background clues. The second aspect is grouping. The “vote-based cluster” grouping method defined by the pioneering VoteNet ignores shape information that is very important in the object detection field. The final aspect is the modeling ability of grouped clusters. Most successful methods treat grouped clusters separately, regardless of their different contributions to the final detection. To address these challenges, we propose three modules to address them: the foreground-aware module, the voting-aware module, and the cluster-aware module. Extensive experiments on two large datasets of real 3D scans, ScanNet and SUN RGB-D, demonstrate the effectiveness of our method for 3D object detection on point clouds.

Keywords:

3D object detection; point cloud; Hough voting; ray-based detection; indoor scene

1. Introduction

Three-dimensional point cloud data provide precise geometric and spatial information, which is crucial for computer vision applications such as autonomous driving [1,2], augmented reality [3,4], and domestic robots [5,6]. The development of commercial depth cameras and LiDARs has made this task an active research field, attracting increasing attention. The goal of this task is to obtain 3D bounding box orientation and semantic label information for each object in the input scene. However, detecting objects in point clouds may be more challenging than in 2D due to their non-structured nature. Although current CNN networks have been proven to be very effective in 2D object detection, applying these CNN networks directly to point clouds is very difficult.

Compared to grid-structured images, 3D point clouds can provide pure shape and geometric information, unaffected by lighting and reflectance. However, point clouds are irregular, disordered, and sparse, thus how to apply the current successful CNN methods to 3D point clouds is the main challenge. Early work attempted to transform point clouds into “grid” structured data and process them using 2D-based detectors. Song [7] proposed a 3D ConvNet formulation for deep sliding shapes with a 3D region proposal network and an object recognition network to obtain 3D bounding boxes of the input scene. Hou [8] extended the standard 2D object detection framework R-CNN [9,10] to 3D. The above methods try to project point clouds into voxelization spaces, which would obtain a plausible performance; however, this kind of method suffers from high computation costs due to expensive 3D convolutions. Alternatively, some researchers have focused on projecting point clouds into a bird’s-eye-view space [11,12]. However, bird’s-eye-view methods sacrifice depth in the geometric clues that are crucial for clustering indoor scenes, which decrease detection performance. Some other researchers have adopted two-stage cascade approaches to address the issues [13,14]. In the first stage, 2D object detectors will be used to obtain 2D object bounding boxes. Then, in the second stage, these 2D bounding boxes will be extruded to form the final 3D bounding boxes.

Another technical branch is point-based methods. Point clouds have emerged as a great powerful representation for 3D deep learning tasks, such as classification [15,16,17,18,19,20], semantic segmentation [21,22,23], point cloud normal estimation [24], 3D reconstruction [25,26,27], and 3D object detection [28,29,30,31]. Most of these works adopt raw point clouds to extract expressive representations based on pioneering work PointNet/PointNet++ [15,16].

The first end-to-end point-based work, VoteNet [32], utilizes PointNet++ as the backbone to extract point features, followed by a neural network that reproduces the classic traditional Hough voting scheme. It consists of three main components: a PointNet++-based point feature extraction module, a voting module, and a cluster generation module for object proposal and classification. Based on these three modules, many subsequent works attempt to improve performance by enhancing or modifying each module. Some works propose a combined MLP (CMLP) [33] and attention MLP (AMLP) [28] to enhance the modelling ability of the PointNet++ backbone. Some works consider the voting and clustering generation module not powerful enough, such as MLCVNet [29], which proposes a patch-to-patch context module and object-to-object context modules to capture contextual information to obtain more accurate vote centers and clusters, respectively. RBGNet [34] proposes a ray-based feature grouping module to improve the grouping scheme of VoteNet as well as a foreground-based feature extraction module to enhance feature representation ability.

Compared with VoteNet, although the aforementioned methods can achieve many improvements, we find that there is still room for improvement in these three modules. For feature extraction in the backbone network, most methods focus on obtaining more reasonable feature representations. Although RBGNet [34] proposes a foreground-based module to consider different contributions of the foreground points and background points in the input point clouds, it did not consider the different contributions of foreground and background points in the input point cloud, and it did not take into account the relationship between foreground and background points. There is also much work to be done to improve the voting generation part, such as RBGNet [34], which utilizes a ray-based module to learn better representations of object shape. However, the authors did not consider the relationship among the votes on each ray. For the cluster generation and classification part, most of the works ignore the surrounding clusters’ contextual information, which is crucial for the current clustering classification.

To address the aforementioned issues, this paper discusses the task of indoor scene 3D object detection via point cloud. Specifically, we propose a foreground-based module to calculate the different contributions of the foreground and background points in the backbone. In this paper, we adopt PointNet++ as our backbone network. In this module, we produce a two-channel weighted map for both foreground and background points separately. Towards the second problem, inspired by [34], we introduce a vote-aware module to simulate the spatial relationship between votes along rays. For the third issue, we propose a cluster-aware module to build the spatial dependencies among clusters to utilize the rich contextual information for the final 3D bounding box classification. With these three modules, we propose a unified ray-based enhancement network (REGNet) to incorporate into VoteNet for 3D object detection.

The framework of REGNet is shown in Figure 1; the three main modules are shown in red font bounding boxes. In summary, the contributions of this paper include:

We propose a ray-based enhancement 3D object detection network that exploits contextual information at foreground background, voting patch, and cluster.
We design three sub-modules, involving a foreground-aware module, a voting-aware module, and a cluster-aware module. The new modules nicely fit into thde VoteNet framework.
Experiments on ScanNet V2 and SUN RGB-D datasets demonstrate the effectiveness and superiority of the proposed modules in improving detection accuracy.

The rest of the article is organized as follows. Section 2 briefly reviews the most relevant work. Section 3 provides detailed information on the proposed REGNet. We present experimental results and in-depth analysis in Section 4. Finally, the conclusions are drawn in Section 5.

2. Related Work

Grid Projection/Voxelization-based Detection. 3D object detection is a challenging task due to the irregular, sparse, and orderless characteristics of 3D points. Most current existing work could be classified into three categories in terms of point cloud representations, i.e., voxel-based, bird’s-eye-view-based, and point-based. Thanks to the success of the deep neural network, marvelous progress has been achieved in 2D object detection. However, the 2D object detection task ignores the depth information, which is important to understand the whole scene. Early 3D object detection methods project point clouds to 2D grids [35,36,37] or 3D voxels [8,38], so that the most successful convolutional networks could be directly applied. Concerning bird’s-eye-view based methods, most of them are applied to autonomous driving. In outdoor scenes, most objects are distributed on the same plane, so there is little mutual occlusion of objects in the top-down view. However, in indoor scenes, many objects are on top of each other, such as photos on walls or tables covering sofas. Therefore, some works project the point cloud onto a frontal view and use 2D ConvNets to tackle the problem. However, self-occlusion of objects in indoor scenes poses many challenging issues to address. Voxel-based methods transform the point cloud into 3D voxels, which have been shown to yield more reasonable performance. However, these methods suffer from high memory and computational costs, as well as quantization errors.

Point based-Detection. To tackle the problems noted above, recently, most methods process point clouds directly for 3D object detection. Due to the point clouds being irregular and sparse, how to extract feature representations from them is the core task of these methods. The pioneering works PointNet/PointNet++ [15,16] provide a powerful and robust backbone for point cloud based tasks. VoteNet [32] adopts PointNet++ as the backbone and reproduces a Hough voting strategy to propose an end-to-end framework for 3D object detection. This is the first end-to-end framework based on point clouds for this task. There are numerous successors based on this work, such as BGNet [39], which further improves the traditional Hough voting mechanism [40]. The authors proposed a back-tracing strategy that generatively backtracked representative points from the vote centers and then revisited the seed point. H3DNet [30] and MCGNet [41] recognize the lack of modeling ability with only feature extraction from a single backbone branch. They utilized a four-way backbone to extract more plausible feature representations. MLCVNet [29] proposes three sub-modules to capture multi-level contextual information in point cloud data to boost performance. VENet [28] improves the voting procedure in “before, during and after” stages to address the limitations of current voting schemes.

Attention Mechanism/Transformer-based Detection. The transformer scheme is the dominant network architecture for the tasks of neural language of processing (NLP). Due to the powerful ability of feature modelling, it is applied in the field of 2D image recognition [42,43,44]. Most recently, many works apply the transformer scheme into 3D object detection [45,46]. GFNet [31] proposes a group-free strategy that adopts a powerful transformer module to replace the proposal head in VoteNet [32]. In this paper, we also resorts to the transformer scheme to build spatial context dependencies among different vote centers and clusters to boost prediction performance.

Point Cloud Sampling. Since the point cloud is sparse and irregular, it cannot be calculated by conventional grid-based methods. Therefore, in order to facilitate training and inference, sampling operation [47,48,49] plays a very key role in the task of point cloud analysis. Examples include farthest point sampling (FPS) and k-closest points sampling (KPS), which have been widely leveraged in the task of object detection in point clouds. However, these downsampling strategies are class-agnostic, treating all points equally without considering their different contributions. Therefore, some redundant points may inadvertently be retained, while important information may be lost after downsampling. 3DSSD [50] proposed F-FPS sampling strategy based on feature distance to preserve interior points. BGRNet [34] proposes a foreground-biased sampling to keep the foreground point during sampling. However, these methods still do not consider the spatial relationship between foreground points and background points. To address this issue, we proposed a foreground-aware module to sample more points on the foreground and establish relationships among the sampled points.

3. Our Approach

The proposed REGNet is based on VoteNet and inspired by the RGBNet for the ray-aware module. This method aims to establish relationships between foreground and background points to enhance seed point representation capability as well as model spatial relationships between voting centers on rays to obtain better object feature representation. Furthermore, it computes contextual information for all clusters to improve the final prediction performance. To achieve this goal, we have developed three new modules based on VoteNet, namely the foreground-aware module, voting-aware module, and cluster-aware module. This section elaborates on the learning details of the proposed REGNet, and the overall framework is shown in Figure 1.

3.1. Backgrounds

PointNet++ is a pioneering work applied in 3D point cloud learning tasks. It is a hierarchical neural network that processes a set of points sampled in metric space in a layered manner. This work provides more powerful feature extraction capabilities. Therefore, it is often used as a backbone network for downstream tasks.

The original VoteNet can be summarized as three modules, namely the feature extraction module, voting module, and voting aggregation module. The feature extraction module generates seed points and corresponding features based on PointNet++. The voting module regresses object centers from each seed point, and the voting aggregation module combines features from different seed points to vote for the object center. Then, the object proposals are classified and the accurate position and size of the 3D object are regressed based on the aggregated features.

3.2. Foreground-Aware Module

Current sampling methods, such as FPS and KPS, lack consideration for foreground and background information. However, sometimes foreground points and background points can provide different contributions. For example, if too many points are sampled from a chair, it may result in incorrect classification as a table, as shown in Figure 5a. Foreground points provide rich clues about object shape, such as position and orientation information, which are important for 3D object detection tasks. However, it is also unreasonable to completely ignore background points while only considering foreground points. The background points could provide some important contextual information for the final prediction. Most sampling methods, such as FPS employed in the backbone, are class-agnostic. They do not consider foreground and background, but rather randomly sample an initial point from the dataset and iteratively select the furthest point from the previously selected point as the next point in the subset. Although these methods can capture the basic characteristics of the data, they bring about some uncertainty of information since the selection of points is solely based on the properties of the dataset, rather than any specific location information such as foreground and background. This is because the selection is based on the spatial distribution of points and their distance from previously selected points, without considering any class-specific information.

Although the FPS method can effectively solve the problem of distinguishing foreground and background information in point cloud sampling, it does not consider the interaction between different foreground and background information. Although foreground and background information is distributed in different locations and expresses different information in the scene, there is still a certain correlation between them. For example, the information around a chair can provide certain background knowledge for the classification of objects within the bounding box.

To address the above problem, we propose a foreground-aware module. For simplicity, the point cloud data in this paper are denoted as

P

= {p_{1}, \dots, p_{N}} \in R^{N \times 3}

. To be specific, there are four SA layers in the standard PointNet++. After the first SA layer, the input point cloud is downsampled to

P = {p_{1}, \dots, p_{2048}} \in R^{2048 \times 3}

points and corresponding features

f_{i}^{128}

, where

i = {1, 2, \dots, 2048}

. To better separate foreground and background from the obtained point cloud data, this article employs an additional 2-class segmentation network for foreground and background point acquisition. The segmentation network obtains a 2-dimensional score map through a standard softmax network, and the argmax operation can be used to determine the category to which each point belongs. The detail is as follows:

\begin{matrix} S e g (p_{i}, f_{i}^{128}) = 1, i f : a r g m a x (s o f t m a x (S e g (\cdot))) = 1 \\ S e g (p_{i}, f_{i}^{128}) = 0, i f : a r g m a x (s o f t m a x (S e g (\cdot))) = 0 \end{matrix}

(1)

where

p_{i}

denotes a point,

S e g (\cdot)

denotes a 2-class point cloud segmentation network, and

f_{i} \in f_{i}^{128}

is the the point feature corresponding to

p_{i}

. After the operation noted above, all 2048 points have a label; the point with label 1 is grouped to

P^{f o r e}

, and point with label 0 is grouped to

P^{b a c k}

. Based on these two point clouds, we applied farthest point sampling to the foreground and background sets, respectively, and combined them into the final sample set, as shown below:

\begin{matrix} P^{\overset{ˇ}{f o r e}} = F P S (P^{f o r e}), P^{\overset{ˇ}{b a c k}} = F P S (P^{b a c k}) \end{matrix}

(2)

The next issue is how to combine

P^{\overset{ˇ}{f o r e}}

and

P^{\overset{ˇ}{b a c k}}

into the final sample set, which is very important. Ref. [34] combines them directly without considering each contribution to the next stage. We propose a weighted map for all 2048 sampled points. Each item in the weighted map denotes the contribution of each point. The weighted map is generated by a 2-layer fully connected layer with a softmax layer, and then we build the relationship of the

P^{\overset{ˇ}{f o r e}}

and

P^{\overset{ˇ}{b a c k}}

to form the final sample set. The detail is as follows:

\begin{matrix} w p = softmax ({FC}_{2} ({FC}_{1} (P^{\overset{ˇ}{f o r e}}, P^{\overset{ˇ}{b a c k}}))) \\ P^{f i n a l} = (P^{\overset{ˇ}{f o r e}} \oplus P^{\overset{ˇ}{b a c k}}) ⊙ w p \end{matrix}

(3)

where

w p

denotes the learned weighted map, the operator ⊙ represents the Hadamard product, ⊕ represents sum operation, and FC₂, FC₁ denote the 2-layer fully connected layer.

3.3. Voting-Aware Module

The vote scheme in the VoteNet does not consider relationships between point patches; however, this information is very important to the location and classification of the final prediction. MLCVNet proposes patch-patch context (PPC) to capture the relationships between point patches. It adopts the compact generalized non-local network (CGNL) [51] to explicitly build rich correlations between any pair of point patches. The detail is as follows:

\begin{matrix} R e = s i m (θ (f), ϕ (f)) g (f) \end{matrix}

(4)

where

θ (\cdot)

,

ϕ (\cdot)

,

g (\cdot)

denote three transform functions, and

s i m (a, b)

calculates the similarity between the two positions of a and b. However, CGNL only has one head and one layer to build the relationship among the vote centers. Moreover, CGNL can only model the vote center and cannot consider the relationship between the vote center and the clusters. Additionally, in VoteNet, vote centers and cluster centers are generated through sampling and grouping from seed points. However, this method lacks consideration for the appearance information of objects. For [34], a ray-based feature grouping method is employed, which can learn a better feature representation of the surface geometry of foreground objects. However, the relationship between the vote center and the cluster has yet to be considered. With the development of transformer technology, more and more researchers are trying to use a transformer scheme to model the interconnections between different modules.

In this paper, we adopt the ray-based feature grouping method proposed in [34] to generate vote centers and clusters, and the stacked multi-head self-attention and multi-head cross-attention modules are leveraged to establish the relationship between the voting center and the cluster. The multi-head self-attention module is used to simulate the relationship between votes, and the multi-head cross-attention module is used to simulate the interaction between votes and clusters. After stacking the attention modules, a feed-forward network (FFN) is used to obtain more reasonable transformation features for each object. The structural description of this module is shown in Figure 2. Denote the point features of the voting center as

{f_{v}}_{i = 1}^{M}

and the cluster feature as

{f_{c}}_{i = 1}^{K}

. The self-attention module build the relationship between votes, formulated as follows:

\begin{matrix} s e l f - a t t (θ (f_{v}), ϕ (f_{v})) g (f_{v}) \end{matrix}

(5)

and the cross-attention module adopts point features to compute object features, formulated as follows:

\begin{matrix} c r o s s - a t t (f_{v_{i}}, f_{c_{i}}) \end{matrix}

(6)

where

f_{v_{i}} \in {f_{v}}_{i = 1}^{M}

and

f_{c_{i}} \in {f_{c}}_{i = 1}^{K}

.

3.4. Cluster-Aware Module

VoteNet detects each object class and bounding box by feeding the generated votes to MLP layers in the scene. However, this grouping method lacks consideration of surrounding information. MLCVNet proposes object–object context (OOC) to combine features of surrounding objects to provide more information about object relationships. It inputs the grouped voting centers to an MLP with max-pooling to form a single vector representing the cluster, and then introduces a self-attention module to establish relationships between these clusters instead of processing them separately. It adopts the CGNL attention module to generate a weighted map to calculate the affinity between all clusters. The detail is as follows:

\begin{matrix} R e_{O O C} = s e l f - a t t (\underset{i = 1, 2, \dots, n}{m a x} {M L P (c_{i})}) \end{matrix}

(7)

where

s e l f - a t t (\cdot)

is the CGNL attention module, and

c_{i}

is the feature of the cth cluster.

However, as noted above, the CGNL attention module can only simulate the affinity between clusters. Yet, a voting center can also provide useful information. In this paper, we adopt stacked self-attention and cross-attention modules to establish relationships. The detail of the self-attention module is as follows:

\begin{matrix} s e l f - a t t (θ_{1} (f_{v}), ϕ_{1} (f_{v})) g_{1} (f_{v}) \end{matrix}

(8)

and the cross-attention module is formulated as follows:

\begin{matrix} c r o s s - a t t (f_{c_{i}}, f_{v_{i}}) \end{matrix}

(9)

After the stacked self-attention and cross-attention modules, a feed-forward neural network (FFN) is utilized to extract transformation features for the final prediction task.

By combining these three modules, the final 3D bounding box inference incorporates all surrounding contextual information while also taking into account the distinct contributions of foreground and background points. This results in a more accurate and reasonable final prediction.

4. Experiments and Discussions

4.1. Dateset

We evaluated our REGNet on two large datasets, ScanNet V2 and SUN RGB-D, both of which have been captured from real 3D scans. ScanNet V2 is a well-known public indoor scene dataset that comprises 1513 indoor scenes with detailed annotations, including instance points, semantic labels for 40 categories, and 3D bounding boxes categorized into 18 categories.

SUN RGB-D contains 10,335 indoor RGB and depth images with per-point semantic labels and object bounding boxes. The data in this dataset were captured by four consumer-grade depth cameras, namely Intel Realsense, Asus Xtion, Kinect v1, and Kinect v2. A total of 5285 pairs of data are used for training, and the rest are used for validation. Prior to training and validation, each RGB-D scene is first converted into a point cloud representation by aligning the RGB and depth images and then projecting the depth information onto the RGB image.

4.2. Training Details

Our network is optimized end-to-end using the Adam optimizer with a batch size of 8. The base learning rate for the ScanNet V2 dataset is set to 0.01 and for the SUN RGB-D dataset it is set to 0.001. The network is trained for 350 epochs on ScanNet V2 and 300 epochs on SUN RGB-D. The learning rate decay steps are set to {100, 200, 300} for ScanNet and {100, 200, 250} for SUN RGB-D, and the decay rates are {0.1, 0.1, 0.1} for both datasets. Gradnorm clipping was used to ensure stable training dynamics. Due to unstable and difficult-to-train transformer blocks in the voting-aware module and the cluster-aware module, we use the same setting as in [31], performing three training runs for each experiment and also testing each training experiment three times. To illustrate the randomness of the algorithm, the average performance of nine trials is reported. All experiments are implemented by the PyTorch platform using RTX P6000 GPU.

Following [32], we report the detection performance on the validation set by calculating the average precision (AP) at IoU thresholds of 0.25 ([email protected]) and 0.5 ([email protected]) in 3D space. The detection performance and average results are displayed for each class. As shown in [32], there are four set abstraction (SA) layers in the network. In this paper, we use the same approach as [34], using a foreground bias sampling (FBS) module before each SA layer to enhance the network’s ability to distinguish foreground and background points during the learning process.

4.3. Comparisons with the State-of-the-Art Methods

Experiments with ScanNet V2. To demonstrate the effectiveness of the proposed REGNet, we compared it with the most recent state-of-the-art techniques. The results are summarized in Table 1. As shown, our proposed REGNet outperforms its baseline VoteNet and achieves comparable results with the current state-of-the-art techniques. Furthermore, REGNet performs the best in 8 out of 18 categories, which is almost three times more than the second-ranked GFNet, which only achieves the best results in 3 categories. This indicates that the three proposed enhancement modules can effectively improve subsequent object localization and classification tasks.

We also provide the qualitative comparisons in Figure 3. In the figure, we give the raw point cloud, the visualization results of VoteNet, the current GFNet, the visualization results of our REGNet, and the corresponding ground truth. To ensure a fair comparison, we only compare the results of methods using only 3D geometric information. The comparison results show that our approach can obtain more accurate location results for objects such as beds and sofas, especially the bounding boxes indicated by the red dashed lines in the figure.

Experiments with SUN RGB-D. We also evaluate our REGNet on the SUN RGB-D dataset; the quantitative results are provided in Table 2. In it, we compare our method with many other state-of-the-art methods, such as VoteNet, H3DNet, and many others on the SUN RGB-D benchmark. From the comparison results, we find that our method achieves the best performance in six classes, whereas almost all of the rest achieve the second-best performance. We also illustrate the qualitative comparisons of SUN RGB-D in Figure 4. In the figure, from left to right, are the image, raw point cloud, the prediction results of VoteNet and GFNet, our results, and the corresponding ground truth.

4.4. Ablation Study and Qualitative Results

In this section, a set of ablative studies are conducted on the ScanNet V2 and SUN RGB-D datasets to demonstrate the effectiveness of the three proposed modules of our REGNet.

Effectiveness of the foreground-aware module. To demonstrate the effectiveness of the foreground-aware module, we provide its performance across different architecture methods, as shown in Table 3. In the experiments, we replace the standard backbone (PointNet++) in three state-of-the-art architectures: VoteNet, H3DNet, and GFNet (L6, O256). Due to the fact that VoteNet and GFNet both use a branch of PointNet++ to extract feature representations, we only replace the backbone part. For H3DNet, there are four branches in the backbone used to extract four different geometric primitives, and we replaces all four branches.

To better validate the effectiveness of the module, we visualize the results of VoteNet with and without the module, as shown in Figure 5. From the visual comparison of (a) and (b), it can be observed that without the module, the localization of object bounding boxes is prone to error, and misclassification can occur. With the module, both localization and classification have been improved accordingly.

Effectiveness of the voting-aware module. To validate the advantages of the voting-aware module, we provide visual results with and without using the module on the validation dataset of ScanNet V2, as shown in Figure 6. In Figure 6a, we visualize some voting centers, shown as green points around the table and red points on the table. In VoteNet, the relationship between voting centers is not considered, resulting in many incorrect bounding boxes. However, when the voting perception module is used in VoteNet, the boundary boxes of the table are more concise, and the predicted results are more reasonable.

Effectiveness of the cluster-aware module. Finally, to demonstrate the effectiveness of our cluster-aware module, we suggest conducting experiments on ScanNet V2 with and without the module. The qualitative results are shown in Figure 7. From left to right, we provide the raw point cloud, the visualization results of VoteNet, the results of VoteNet with this module, and the ground truth. From the raw visualization, we can see that the object within the red dashed circle on the right and the chair within the red dashed circle on the left appear very similar in color and shape. VoteNet is unable to distinguish between them. With this module, we believe that it can extract more reasonable feature representations based on more surrounding information and provide more reasonable prediction results.

Quantitative depth analysis. In Table 4, we show detailed ablation studies on the proposed REGNet model on the ScanNet V2 dataset. First, we provided the results of VoteNet on ScanNet V2, which is 58.6% in [email protected]. To demonstrate the effectiveness of our proposed modules, we conduct ablation experiments on the VoteNet model. We incrementally add the modules one by one and compare the results to the baseline. As shown in Table 4, the results clearly indicate that all three proposed modules improve the final prediction results. Furthermore, we compare our proposed method to other state-of-the-art methods on the ScanNet V2 and SUN RGB-D datasets. The quantitative comparison results are shown in Table 1 and Table 2. Our method achieve a 12.1 improvement on the ScanNet V2 dataset based on VoteNet and a 6.4 improvement on the SUN RGB-D dataset compared to VoteNet. These results demonstrate that the proposed modules are effective for improving 3D object detection tasks. We believe that the inclusion of these modules can significantly enhance the performance of existing models.

Although our method has achieved significant benefits compared to VoteNet, we have not increased the model size. VoteNet has a model size of 11.2 M, and ours is 37.6 M. Compared to H3DNet and MCGNet, which are four-branch networks, our method is smaller yet more accurate. We provide detailed comparisons of backbone and model sizes in Table 5 to demonstrate the efficiency and compactness of our model.

To better visualize the effectiveness of the three proposed modules in this paper, we provide a comparison of our method and two classic methods, VoteNet and VENet, on the ScanNet V2 dataset in terms of category-wise performance, as shown in Figure 8. We also include a comparison of our method and VoteNet, as well as ImVoteNet, on the SUN RGB-D dataset in terms of the [email protected] criteria to further demonstrate the effectiveness and stability of our proposed method, as shown in Figure 9. Through comparison, we find that although our method decreases performance in some categories, there is an increase in performance to varying degrees in the majority of categories.

The realistic inference speed of our method is competitive with other state-of-the-art methods. The results are shown in Table 6. Our method achieves better performance with a competitive speed.

5. Conclusions

In this paper, we propose a novel REGNet to tackle the challenging task of 3D object detection in indoor scenes. Our method is inspired by VoteNet, based on which we propose the foreground-aware module, voting-aware module, and cluster-aware module to improve the detection performance. The foreground-aware module establishes relationships between foreground and background points to obtain more expressive seed point features. The voting-aware module proposes ray-based voting with a transformer strategy to calculate the contributions of each vote centers. The cluster-aware module is used to adaptively aggregate multiple clusters to achieve better performance. Qualitative and quantitative experimental results on the public ScanNet V2 and SUN RGB-D datasets show that our REGNet achieves reasonable performance improvement in 3D object detection.

Author Contributions

Conceptualization, F.Z. and J.R.; methodology, F.Z.; software, Q.Z.; validation, P.S. and Q.Z.; formal analysis, F.Z. and Q.Q.; investigation, F.Z.; resources, F.Z. and Q.Q.; data curation, J.R.; writing—original draft preparation, F.Z.; writing—review and editing, Q.Q. and Y.L.; visualization, Q.Z.; supervision, F.Z.; project administration, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by the R&D Program of Beijing Municipal Education Commission (KM202310009002), and in part by Beijing Natural Science Foundation (4232023), and in part by BeiHang University Yunnan Innovation Institute Yunding Technology Plan (2021) of Yunnan Provincial Key RD Program (202103AN080001-003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

The authors thank the editor and the anonymous reviewers for their valuable suggestions. The authors also thank Yukun Lai, Paul L. Rosin, and Ju Dai for their suggestions to and discussion of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bansal, M.; Krizhevsky, A.; Ogale, A. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. arXiv 2018, arXiv:1812.03079. [Google Scholar]
Wang, D.; Devin, C.; Cai, Q.Z.; Krähenbühl, P.; Darrell, T. Monocular plan view networks for autonomous driving. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems, Macau, China, 3–8 November 2019; pp. 2876–2883. [Google Scholar]
Azuma, R.T. A survey of augmented reality. Presence Teleoperators Virtual Environ. 1997, 6, 355–385. [Google Scholar] [CrossRef]
Billinghurst, M.; Clark, A.; Lee, G. A survey of augmented reality. Found. Trends Hum. Comput. Interact. 2015, 8, 73–272. [Google Scholar] [CrossRef]
Wang, H.; Wang, W.; Zhu, X.; Dai, J.; Wang, L. Collaborative visual navigation. arXiv 2021, arXiv:2107.01151. [Google Scholar]
Zhu, Y.; Mottaghi, R.; Kolve, E.; Lim, J.J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 3357–3364. [Google Scholar]
Song, S.; Xiao, J. Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 808–816. [Google Scholar]
Hou, J.; Dai, A.; Nießner, M. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4421–4430. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Lahoud, J.; Ghanem, B. 2d-driven 3d object detection in rgb-d images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4622–4630. [Google Scholar]
Qi, C.R.; Liu, W.; Wu, C.; Su, H.; Guibas, L.J. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 918–927. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Wang, P.S.; Liu, Y.; Guo, Y.X.; Sun, C.Y.; Tong, X. O-cnn: Octree-based convolutional neural networks for 3d shape analysis. ACM Trans. Graph. 2017, 36, 1–11. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 2019, 38, 1–12. [Google Scholar] [CrossRef]
Xie, S.; Liu, S.; Chen, Z.; Tu, Z. Attentional shapecontextnet for point cloud recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4606–4615. [Google Scholar]
Su, H.; Jampani, V.; Sun, D.; Maji, S.; Kalogerakis, E.; Yang, M.H.; Kautz, J. Splatnet: Sparse lattice networks for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2530–2539. [Google Scholar]
Tatarchenko, M.; Park, J.; Koltun, V.; Zhou, Q.Y. Tangent convolutions for dense prediction in 3d. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3887–3896. [Google Scholar]
Klokov, R.; Lempitsky, V. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 863–872. [Google Scholar]
Atzmon, M.; Maron, H.; Lipman, Y. Point convolutional neural networks by extension operators. arXiv 2018, arXiv:1803.10091. [Google Scholar] [CrossRef]
Fan, H.; Su, H.; Guibas, L.J. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 605–613. [Google Scholar]
Pham, Q.H.; Nguyen, T.; Hua, B.S.; Roig, G.; Yeung, S.K. Jsis3d: Joint semantic-instance segmentation of 3d point clouds with multi-task pointwise networks and multi-value conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8827–8836. [Google Scholar]
Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2088–2096. [Google Scholar]
Xie, Q.; Lai, Y.K.; Wu, J.; Wang, Z.; Lu, D.; Wei, M.; Wang, J. Venet: Voting enhancement network for 3d object detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3712–3721. [Google Scholar]
Xie, Q.; Lai, Y.K.; Wu, J.; Wang, Z.; Zhang, Y.; Xu, K.; Wang, J. Mlcvnet: Multi-level context votenet for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10447–10456. [Google Scholar]
Zhang, Z.; Sun, B.; Yang, H.; Huang, Q. H3dnet: 3d object detection using hybrid geometric primitives. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 311–329. [Google Scholar]
Liu, Z.; Zhang, Z.; Cao, Y.; Hu, H.; Tong, X. Group-free 3d object detection via transformers. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2949–2958. [Google Scholar]
Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9277–9286. [Google Scholar]
Huang, Z.; Yu, Y.; Xu, J.; Ni, F.; Le, X. Pf-net: Point fractal network for 3d point cloud completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7662–7670. [Google Scholar]
Wang, H.; Shi, S.; Yang, Z.; Fang, R.; Qian, Q.; Li, H.; Schiele, B.; Wang, L. Rbgnet: Ray-based grouping for 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1110–1119. [Google Scholar]
Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3d proposal generation and object detection from view aggregation. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 1–8. [Google Scholar]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 641–656. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Cheng, B.; Sheng, L.; Shi, S.; Yang, M.; Xu, D. Back-Tracing Representative Points for Voting-Based 3D Object Detection in Point Clouds. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 8963–8972. [Google Scholar]
Hough, P.V. Machine analysis of bubble chamber pictures. In Proceedings of the International Conference on High Energy Accelerators and Instrumentation, CERN, Geneva, Switzerland, 14–19 September 1959; pp. 554–556. [Google Scholar]
Chen, K.; Zhou, F.; Dai, J.; Shen, P.; Cai, X.; Zhang, F. MCGNet: Multi-Level Context-aware and Geometric-aware Network for 3D Object Detection. In Proceedings of the IEEE International Conference on Image Processing, Bordeaux, France, 16–19 October 2022; pp. 1846–1850. [Google Scholar]
Hu, H.; Zhang, Z.; Xie, Z.; Lin, S. Local relation networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3464–3473. [Google Scholar]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3d object detection with pointformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7463–7472. [Google Scholar]
Misra, I.; Girdhar, R.; Joulin, A. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2906–2917. [Google Scholar]
Dovrat, O.; Lang, I.; Avidan, S. Learning to sample. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2760–2769. [Google Scholar]
Lang, I.; Manor, A.; Avidan, S. Samplenet: Differentiable point cloud sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7578–7588. [Google Scholar]
Nezhadarya, E.; Taghavi, E.; Razani, R.; Liu, B.; Luo, J. Adaptive hierarchical down-sampling for point cloud classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12956–12964. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
Yue, K.; Sun, M.; Yuan, Y.; Zhou, F.; Ding, E.; Xu, F. Compact generalized non-local network. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Li, Y.; Ma, L.; Tan, W.; Sun, C.; Cao, D.; Li, J. GRNet: Geometric relation network for 3D object detection from point clouds. ISPRS J. Photogramm. Remote Sens. 2020, 165, 43–53. [Google Scholar] [CrossRef]
Griffiths, D.; Boehm, J.; Ritschel, T. Finding your (3D) center: 3D object detection using a learned loss. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 70–85. [Google Scholar]
Du, H.; Li, L.; Liu, B.; Vasconcelos, N. SPOT: Selective point cloud voting for better proposal in point cloud object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 230–247. [Google Scholar]
Gwak, J.; Choy, C.; Savarese, S. Generative sparse detection networks for 3d single-shot object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 297–313. [Google Scholar]
Chen, J.; Lei, B.; Song, Q.; Ying, H.; Chen, D.Z.; Wu, J. A hierarchical graph network for 3d object detection on point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 392–401. [Google Scholar]
Zhao, N.; Chua, T.S.; Lee, G.H. Sess: Self-ensembling semi-supervised 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11079–11087. [Google Scholar]
Najibi, M.; Lai, G.; Kundu, A.; Lu, Z.; Rathod, V.; Funkhouser, T.; Pantofaru, C.; Ross, D.; Davis, L.S.; Fathi, A. Dops: Learning to detect 3d objects and predict their 3d shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11913–11922. [Google Scholar]
Zheng, Y.; Duan, Y.; Lu, J.; Zhou, J.; Tian, Q. HyperDet3D: Learning a Scene-conditioned 3D Object Detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5585–5594. [Google Scholar]
Qi, C.R.; Chen, X.; Litany, O.; Guibas, L.J. Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4404–4413. [Google Scholar]

Figure 1. The overall architecture of our REGNet model. REGNet is built based on VoteNet with three new modules: (1) the foreground-aware module aims to automatically weight foreground and background features for obtaining better seed point features; (2) the voting-aware module concentrates on weighting different votes on the rays to produce better clusters; (3) the cluster-aware module is utilized for adaptively gathering different clusters for the final prediction. Best viewed in color.

Figure 2. The architecture of the voting-aware module. The seed points and vote centers are obtained from the previous stage. Best viewed in color.

Figure 3. The qualitative comparison results of 3D object detection on ScanNet V2 are presented. We provide a comparison between VoteNet and GFNet, as well as the raw point clouds and corresponding ground truth for comparison. The red dashed bounding boxes provide detailed information on qualitative improvements. For example, in the first row, VoteNet generates many bounding boxes for the objects in the dashed bounding box, and GFNet can eliminate many false positive results; however, it still eliminates true positive results. Our REGNet could generate more plausible results. Best viewed in color.

Figure 4. The qualitative comparison results of 3D object detection on SUN RGB-D are presented. We provide a comparison between VoteNet and GFNet, as well as the raw point clouds, corresponding RGB image, and ground truth for comparison. The red dashed bounding boxes provide detailed information on qualitative improvements. For example, in the second row, VoteNet could detect the desk beside the bed, whereas GFNet can detect the missing partial desk; however, it cannot detect the nightstand on the other side. Our REGNet could generate more plausible results. Best viewed in color.

Figure 5. Qualitative comparison results between the use and non-use of the proposed foreground-aware module on VoteNet: (a) visualization results of VoteNet, (b) visualization results of VoteNet with our foreground-aware module, and (c) corresponding ground truth. The bounding boxes in VoteNet are more cluttered, with some being misclassified, whereas our foreground-aware module produces more concise and reasonable results. Best viewed in color.

Figure 6. Qualitative comparison results with and without the proposed voting-aware module on VoteNet: (a) visualization results of VoteNet, (b) visualization results of VoteNet with our voting-aware module, and (c) corresponding ground truth. The red and green dots represent the voting centers. In (a), VoteNet does not consider the relationship between the voting centers, which leads to some misclassifications, such as the object in the upper right corner being classified as a table, but there is nothing but cluttered points. Using the proposed voting-aware module, more reliable results can be achieved, as shown in (b). Best viewed in color.

Figure 7. Qualitative comparison results between using and not using the cluster-aware module on VoteNet: (a) original point cloud of the scene, (b) bounding box results of VoteNet, (c) visualization results of VoteNet combined with our cluster-aware module, and (d) corresponding ground truth. The red dashed circles give detailed improvement visualization results. Best viewed in color.

Figure 8. Performance analysis. Per-class [email protected] improvement of our REGNet over VoteNet, VENet, and GFNet on the ScanNet V2 dataset. Best viewed in color.

Figure 9. Performance analysis. Per-class [email protected] improvement of our REGNet over VoteNet, ImVoteNet, and GFNet on the SUN RGB-D dataset. Best viewed in color.

Table 1. Performance comparisonon the ScanNet V2 validation set with state-of-the-art methods. The top two results are shown in red and blue, respectively.

Methods	Presented at	Cabinet	Bed	Chair	Sofa	Table	Door	Window	Bookshelf	Picture	Counter	Desk	Curtain	Refrigerator	Showercurtain	Toilet	Sink	Bathtub	Ofurn	[email protected]	[email protected]
VoteNet [32]	ICCV’19	36.27	87.92	88.71	89.62	58.77	47.32	38.1	44.62	7.83	56.13	71.69	47.23	45.37	57.13	94.94	54.7	92.11	37.2	58.6	33.5
GRNet [52]	ISPRS’20	39.45	88.78	89.18	88.34	58.16	48.46	32.7	46.97	4.94	63.48	69.81	48.46	49.06	66.37	94.07	49.7	90.9	35.6	59.1	39.1
Griffiths [53]	ECCV’20	43.0	70.8	58.3	16.0	44.6	28.0	13.4	58.2	4.9	69.9	74.0	75.0	36.0	58.9	79.0	47.0	77.9	48.2	50.2	–
SPOT [54]	ECCV’20	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	59.8	40.4
GSDN [55]	ECCV’20	41.58	82.5	92.14	86.95	61.05	42.41	40.66	51.14	10.23	64.18	71.06	54.92	40.0	70.54	99.97	75.5	93.23	53.07	62.8	34.8
H3DNet [30]	ECCV’20	49.4	88.6	91.8	90.2	64.9	61.0	51.9	54.9	18.6	62.0	75.9	57.3	57.2	75.3	97.9	67.4	92.5	53.6	67.2	48.1
HGNet [56]	CVPR’20	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	61.3	34.4
SESS [57]	CVPR’20	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	62.1	–
DOPS [58]	CVPR’20	53.2	83.3	91.6	82.6	60.5	54.8	45.2	41.0	26.3	51.9	73.7	53.9	49.2	64.7	98.0	71.3	86.6	59.2	63.7	38.2
MLCVNet [29]	CVPR’20	42.45	88.48	88.98	87.4	63.5	56.93	46.98	56.94	11.94	63.94	76.05	63.94	60.86	65.91	98.33	59.18	87.22	47.89	64.5	41.4
VENet [28]	ICCV’21	50.4	87.7	92.7	88.1	68.6	60.7	46.0	55.2	18.2	70.2	77.5	59.9	58.4	75.9	95.1	67.2	92.3	54.4	67.7	–
GFNet (L6, O256) [31]	ICCV’21	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	67.3 (66.3)	48.9 (48.5)
GFNet (L12, O512) [31]	ICCV’21	52.1	92.9	93.6	88.0	70.7	60.7	53.7	62.4	16.1	58.5	80.9	67.9	47.0	76.3	99.6	72.0	95.3	56.4	69.1 (68.6)	52.8 (51.8)
Hyper3D [59]	CVPR’22	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	70.9	57.2
RBGNet (R66, O256) [34]	CVPR’22	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	70.2 (69.6)	54.2 (53.6)
RBGNet (R66, O512) [34]	CVPR’22	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	–	70.6 (69.9)	55.2 (54.7)
Our REGNet	–	55.1	93.0	93.0	90.3	70.9	61.8	55.0	62.7	24.7	69.1	81.3	68.1	56.2	73.1	98.1	70.0	95.1	55.9	70.7 (69.9)	55.3 (53.9)

Table 2. Performance comparison on the SUN RGB-D validation set with state-of-the-art methods. The top two results are shown in red and blue, respectively.

Methods	Presented at	Table	Sofa	Bookshelf	Chair	Desk	Dresser	Nightstand	Bed	Bathtub	Toilet	[email protected]
F-PointNet [14]	CVPR’18	51.1	61.1	33.3	64.2	24.7	32.0	58.1	81.1	43.3	90.9	54.0
VoteNet [32]	ICCV’19	47.3	64.0	28.8	75.3	22.0	29.8	62.2	83.0	74.4	90.1	57.7
GRNet [52]	ISPRS’20	51.1	64.8	29.3	76.2	26.0	26.1	59.2	84.3	76.8	90.4	58.4
MLCVNet [29]	CVPR’20	50.4	66.3	31.9	75.8	26.5	31.3	61.5	85.8	79.2	89.1	59.8
ImVoteNet [60]	CVPR’20	51.1	70.7	41.3	76.7	28.7	41.4	69.9	87.6	75.9	90.5	63.4
H3DNet [30]	ECCV’20	50.8	66.5	31.0	76.7	29.6	33.4	65.5	85.6	73.8	88.2	60.1
GFNet [31]	ICCV’21	53.8	70.0	32.5	79.4	32.6	36.0	66.7	87.8	80.0	91.1	63.0 (62.6)
Hyper3D [59]	CVPR’22	–	–	–	–	–	–	–	–	–	–	63.5
RGBNet [34]	CVPR’22	–	–	–	–	–	–	–	–	–	–	64.1 (63.6)
Our GFENet	–	54.0	69.1	34.1	79.7	33.9	41.0	70.8	88.1	79.2	91.7	64.1 (63.7)

Table 3. Performance on different architecture methods of our foreground-aware module (FAM) module.

Methods	[email protected]
VoteNet + FAM	63.2
H3DNet + FAM	68.1
GFNet(L6, O256) + FAM	67.2

Table 4. Ablation study of the proposed REGNet model on ScanNet V2 dataset; F, V, and C denote foreground-aware module, voting-aware module, and cluster-aware module, respectively.

	Methods	mIoU
a.	VoteNet	58.6
b.	VoteNet + F	63.2
c.	VoteNet + F + V	69.1
d.	VoteNet + W + C	68.3
e.	VoteNet + F + V + C	70.7

Table 5. Comparison with VoteNet [31], GFNet, MLCVNet, H3DNet, and MCGNet with various configurations on ScanNet V2. The main comparison is based on the best results of multiple experiments between different methods, and the number within the bracket is the average result. Here, 4×PointNet++ denotes that this method adopted four-branch individual PointNet++ as backbone; PointNet++w2× denotes that the backbone width in the method is expanded by 2; L denotes the depth of decoder, and O denotes the number of the proposal.

Model	Backbone	Model Size	[email protected]
VoteNet [32]	PointNet++	11.2 M	58.6
GFNet (L6, O256) [31]	PointNet++	14.5 M	67.3
GFNet (L12, O512) [31]	PointNet++w2×	29.6 M	69.1
MLCVNet [29]	PointNet++	13.9 M	64.5
H3DNet [30]	4×PointNet++	56.0 M	67.2
MCGNet [41]	4×PointNet++	57.2 M	67.5
REGNet	PointNet++w2×	37.6 M	70.7

Table 6. Comparison on realistic inference speed on ScanNet V2 with MLCVNet, BRNet, H3DNet, GFNet, MCGNet, and RBGNet.

Model	[email protected]	[email protected]	Frames/s
MLCVNet [29]	64.5	41.4	5.37
BRNet [39]	66.1	50.9	7.37
H3DNet [30]	67.2	48.1	3.75
GFNet [31]	66.3	48.5	6.64
MCGNet [41]	67.5	50.1	3.59
RBGNet [34]	70.2	54.2	4.75
REGNet	70.7	55.3	4.69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, F.; Rao, J.; Shen, P.; Zhang, Q.; Qi, Q.; Li, Y. REGNet: Ray-Based Enhancement Grouping for 3D Object Detection Based on Point Cloud. Appl. Sci. 2023, 13, 6098. https://doi.org/10.3390/app13106098

AMA Style

Zhou F, Rao J, Shen P, Zhang Q, Qi Q, Li Y. REGNet: Ray-Based Enhancement Grouping for 3D Object Detection Based on Point Cloud. Applied Sciences. 2023; 13(10):6098. https://doi.org/10.3390/app13106098

Chicago/Turabian Style

Zhou, Feng, Junkai Rao, Pei Shen, Qi Zhang, Qianfang Qi, and Yao Li. 2023. "REGNet: Ray-Based Enhancement Grouping for 3D Object Detection Based on Point Cloud" Applied Sciences 13, no. 10: 6098. https://doi.org/10.3390/app13106098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

REGNet: Ray-Based Enhancement Grouping for 3D Object Detection Based on Point Cloud

Abstract

1. Introduction

2. Related Work

3. Our Approach

3.1. Backgrounds

3.2. Foreground-Aware Module

3.3. Voting-Aware Module

3.4. Cluster-Aware Module

4. Experiments and Discussions

4.1. Dateset

4.2. Training Details

4.3. Comparisons with the State-of-the-Art Methods

4.4. Ablation Study and Qualitative Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI