MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net

Zhang, Zheng; Bao, Zhiping; Wei, Yun; Zhou, Yongsheng; Li, Ming; Tian, Qing

doi:10.3390/rs16173146

Open AccessTechnical Note

MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net

by

Zheng Zhang

¹

,

Zhiping Bao

¹,

Yun Wei

²,

Yongsheng Zhou

³

,

Ming Li

² and

Qing Tian

^1,*

¹

School of Information, North China University of Technology, Beijing 100144, China

²

Corporation of Information, Beijing Mass Transit Railway Operation Co., Ltd., Beijing 100044, China

³

School of Information, Beijing University of Chemical Technology, Beijing 100029, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3146; https://doi.org/10.3390/rs16173146

Submission received: 16 July 2024 / Revised: 11 August 2024 / Accepted: 22 August 2024 / Published: 26 August 2024

(This article belongs to the Special Issue Point Cloud Processing with Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Autonomous vehicle technology is advancing, with 3D object detection based on point clouds being crucial. However, point clouds’ irregularity, sparsity, and large data volume, coupled with irrelevant background points, hinder detection accuracy. We propose a two-stage multi-scale 3D object detection network. Firstly, considering that a large number of useless background points are usually generated by the ground during detection, we propose a new ground filtering algorithm to increase the proportion of foreground points and enhance the accuracy and efficiency of the two-stage detection. Secondly, given that different types of targets to be detected vary in size, and the use of a single-scale voxelization may result in excessive loss of detailed information, the voxels of different scales are introduced to extract relevant features of objects of different scales in the point clouds and integrate them into the second-stage detection. Lastly, a multi-scale feature fusion module is proposed, which simultaneously enhances and integrates features extracted from voxels of different scales. This module fully utilizes the valuable information present in the point cloud across various scales, ultimately leading to more precise 3D object detection. The experiment is conducted on the KITTI dataset and the nuScenes dataset. Compared with our baseline, “Pedestrian” detection improved by 3.37–2.72% and “Cyclist” detection by 3.79–1.32% across difficulty levels on KITTI, and was boosted by 2.4% in NDS and 3.6% in mAP on nuScenes.

Keywords:

target detection; target recognition; deep learning

Graphical Abstract

1. Introduction

In recent years, with the continuous development of artificial intelligence, 3D object detection based on point clouds has played an irreplaceable role in various fields such as driverless vehicles, 3D reconstruction, virtual reality, and augmented reality. Compared to traditional images, point clouds are generally collected by LiDAR sensors. Due to the strong penetrating power of LiDAR, the point clouds obtained from it can better perceive the spatial information of the areas to be detected. However, point clouds are characterized by sparsity, irregularity, large data volume, and a high proportion of irrelevant points. Therefore, finding more efficient ways to utilize point clouds has become a key and challenging focus of numerous research efforts.

To cope with the sparsity and irregularity of point clouds, some research [1,2,3,4] has divided point clouds into regular grids and extracted features from these grids using 3D convolutions, thus regularizing irregular point clouds. This type of method is known as a grid-based detection method. In addition to grid-based methods, PointNet [5] first proposed a point-based processing approach, and the subsequent PointNet++ [6] used abstract set modules to directly process point clouds. This method has also been widely used in numerous studies.

Compared to point-based detection methods [7,8,9] grid-based methods offer higher inference speeds but inevitably result in more information loss during the process of point cloud gridding and 3D convolution processing. In response to this, HVNet [10] proposed the concept of using grids of different scales. This algorithm effectively extracts features from different receptive fields by dividing the point cloud into equal parts using cylinders of different scales, thereby reducing information loss during the gridding process. PV-RCNN [11] and PV-RCNN++ [12] adopted voxel abstraction modules to select key points and aggregate features from voxelized point clouds. These two algorithms combine point-based feature extraction methods with voxel-based detection methods and integrate the extracted features into two-stage detection, reducing information loss during 3D convolution processing and achieving excellent detection results.

To address the issues of large point cloud data volume and numerous invalid points, some point-based detection methods [8,9,13,14] focus on sampling more foreground points as key points, enabling the aggregated features to be closer to the target to be detected. This approach shifts the focus of network detection onto the object of interest. On the other hand, some voxel-based detection methods [1,15,16] directly process the raw point cloud. Before voxelization, they use algorithms to filter out some background points in the point cloud, increasing the proportion of foreground points, and thereby improving detection efficiency.

Inspired by the aforementioned research, this paper proposes a voxel-based two-stage multi-scale 3D object detection network. We observed that during the process of generating point clouds with Lidar, large areas of the ground are detected. However, in most object detection tasks, the ground does not belong to the targets of interest and is considered as background. Therefore, this paper introduces a novel ground removal algorithm to partially filter out the irrelevant ground points in the point cloud, thereby improving detection efficiency. Additionally, to reduce information loss during voxelization and fully extract the features of objects of different scales, we introduce voxels of different sizes in the second stage and use voxel abstraction modules to extract features from these multi-scale voxels. Finally, we design a fusion module to integrate the features extracted from different scale voxels, strengthening the overall feature representation.

In summary, the work presented in this paper has the following three main innovations:

We propose a partial ground removal algorithm. Leveraging the Z-axis characteristics and reflectivity properties of the ground in the point cloud, this algorithm first separates a portion of the point cloud based on the Z-axis coordinates. Subsequently, it further extracts ground points based on the reflectivity differences between the ground and the objects of interest. Finally, a certain ratio of the extracted ground points is randomly filtered out, thereby increasing the proportion of foreground points.
This paper introduces the utilization of features from voxels of different sizes. Given the significant size differences among various targets, voxelization using a single scale may result in a significant loss of detailed information. Therefore, we propose a two-stage detection network that introduces voxel features of different scales in the second detection stage. The voxel abstraction set module is employed to extract features from each scale separately and to highlight the features of key points. Additionally, the method of key point feature aggregation has been improved. Finally, the extracted features are utilized to assist in the refinement of candidate bounding boxes for targets of different sizes in the second stage, minimizing information loss during the voxelization process of the point cloud.
We have designed a fusion module for multi-scale voxel features. Since voxelization using different scales leads to the selection of different key voxels during subsequent feature extraction, thus resulting in different aggregated features, it is crucial to highlight the uniqueness of key voxel features extracted from different scales and enhance their representativeness. This module first extracts the global features of voxels from each scale separately and then combines these global features with the voxel-specific features. Finally, the fused features from different scale voxels are further integrated to enrich the final features used for bounding box refinement.

2. Related Works

Currently, many mainstream algorithms [13,16,17,18,19,20,21] used for the detection and recognition of both 2D and 3D targets can be broadly categorized into one-stage detection and two-stage detection based on whether the algorithm directly generates prediction boxes through the backbone network. As shown in Figure 1, we have listed some existing 3D object detection algorithms [1,8,11,15,22], and conducted a simple classification and analysis of them.

2.1. One-Stage Detection

In one-stage detection networks, detection is typically performed by directly generating prediction boxes and class information through the backbone network. In 2D object detection, the YOLO series of networks [23,24,25,26,27], as representatives of one-stage detection networks, excel in inference speed compared to other networks. Similarly, one-stage 3D object detection networks [17,28,29,30], represented by 3DSSD [8], also possess this characteristic. These networks extract multi-level deep features from point clouds through the backbone network, and the resulting deep features are directly used by the prediction head to generate prediction boxes and class information. While one-stage detection networks offer excellent inference speeds, their detection accuracy is typically inferior to two-stage detection networks due to limitations in their network structure. Voxel Transformer [3] introduced Transformer [31] into one-stage 3D object detection networks to improve detection accuracy. SFSS-Net [15] proposed a method for filtering out background points, increasing the proportion of foreground points, and thereby further improving the detection efficiency of the network. DCGNN [17] segments the point cloud space through density-based clustering queries, optimizing the original ball query method to ensure that key point sets contain more detailed object features, and as a single-stage 3D object detection network, it achieves fast inference speed. PillarNeXt [28] introduced a new Voxel2Pillar feature encoding method, which uses a sparse convolution constructor to build pillars with richer point cloud features, especially height features. This encoding approach added more learnable parameters to the initial pillars, thereby improving performance. PillarNet++ [30] expanded on the foundation of PillarNet [29] and introduced multi-scale feature fusion technology, allowing the model to capture and integrate features at different scales, thereby better understanding small and large targets in the scene. In 3D object detection, there is a large number of point-based detection networks [8,14,32] that use one-stage detection to improve the detection speed of point-based methods. Since point-based detection methods directly process point clouds, they can more effectively compensate for the insufficient detection accuracy of one-stage detection.

2.2. Two-Stage Detection

In 2D detection, Fast-RCNN [33] is a representative of two-stage networks. In the first detection stage, the network extracts deep features from images through the backbone network and generates initial candidate boxes. After obtaining the initial candidate boxes, in the second detection stage, the network classifies and regresses the candidate boxes through a neural network to obtain the final prediction boxes. In point cloud-based 3D object detection, two-stage detection is also widely used in many networks [9,11,34,35]. PointRCNN [11] uses the network structure of PointNet++ [6] in the first detection stage to obtain point-wise features and extracts features through a deep network to generate candidate boxes. Finally, the network completes the regression correction of the candidate boxes in the second detection stage. Additionally, two-stage detection networks benefit from their inherent two-stage structure, allowing for the integration of more methods into the detection network. For example, STD [36] uses PointNet++ [6] to extract features in the first detection stage and employs the VFE (Voxel Feature Extraction) module from VoxelNet [1] in the second stage to correct subsequent boxes. PSA-Det3D [37] combines methods based on pillars with point-based methods to perform a two-stage detection task, and results indicate that the network has significantly improved its performance in detecting small objects. PV-RCNN [11] and PV-RCNN++ [12] use 3D sparse convolution in the first detection stage to extract multi-level deep features from voxelized point clouds and generate proposals. In the second detection stage, the network uses a point-based abstract set module-like approach to extract features from different feature-level voxels and finally uses the extracted features for the regression correction of the proposals.

From the networks mentioned above, it can be seen that for two-stage detection networks, finding efficient ways to combine different methods to improve the utilization of features is crucial for enhancing the final detection accuracy. Especially in the second detection stage, fully extracting relevant features and effectively utilizing the advantages of two-stage detection are of great significance for improving the detection accuracy of two-stage networks.

In order to fully utilize point cloud features and improve detection accuracy, this paper proposes a point cloud-based two-stage detection network for 3D object detection tasks. This network focuses on the feature extraction method in the second stage to enhance the correction effect of candidate boxes, ultimately aiming to improve detection accuracy.

3. Proposed Methods

In this section, we will provide a detailed explanation of the network proposed in this paper. As shown in Figure 2, MSPV3D is a two-stage 3D object detection network that focuses on fully utilizing the information provided by the point cloud in the second detection stage. Firstly, to increase the proportion of foreground points in the point cloud, we propose a new ground filtering algorithm that filters out a certain proportion of the background ground. Secondly, we introduce the point cloud with an increased proportion of foreground points into the second detection stage, voxelize it using different scales of voxels, and extract features from these voxels of different scales, which are then used as relevant features for subsequent candidate box correction. Finally, we design a feature fusion module to fuse the features extracted from voxels of different scales to enhance the final output features.

3.1. Ground Partly Filtering Algorithm

In most 3D object detection tasks, the ground is not considered a target for detection. However, in general point cloud data, the ground is often detected over a large area. This not only increases the amount of data and slows down the network inference speed, but also interferes with the recognition and detection of target objects.

Based on Algorithm 1, first, we filter the point cloud based on the Z-axis coordinates to select points near the ground. Then, we take the features (reflectivity) of the point cloud into consideration. Since these point clouds are selected near the ground, the proportion of ground point clouds is quite large. We extract the features of these points and calculate their mean m and standard deviation sd. Since only a portion of the ground points will be selected for filtering out later, here we consider points with feature values within the range [m − 3sd, m + 3sd] as ground points. Finally, we randomly filter out these points based on the filtering rate a.

Algorithm 1: Ground Partly Filter

Input: the raw points:

P = {p_{n}}_{n = 0}^{N - 1}

; the size of raw points: N; the Z-axis threshold: Z_t; the Z-axis coordinates of points:

Z = {\{z_{n}\}}_{n = 0}^{N - 1}

; the random sample rate a;
Data: the set of points close to the ground: P_g; the point set after filter: P_a
Initialize P_g with zero;
for i = 0 to N − 1 do
if z_i ≤ Z_t then; P_g.append(p_i); end if
end
m = mean(P_g.f); sd = std(P_g.f);
for i in range(len(P_g) do:
if m − 3sd ≤ P_g[i]. f ≤ m + 3sd then; P_t.append(P_g[i]); end if
end
Random sample P_t1 in P_t with random sample rate a;
Remove P_t₁ from P;
return P

In reality, the ground is generally located below the target to be detected. In point clouds, the Z-axis coordinate values of the point clouds at the ground level are always less than those of the point clouds near the object to be detected. Therefore, as shown in Algorithm 1, we first set a threshold Z_t: based on actual testing, we set the value of Z_t to −1.2 to achieve better filtering of the ground. If the Z-axis coordinate value of the point cloud is less than Z_t, it is judged to be a near-ground point; otherwise, it is judged to be a non-ground point. After obtaining the near-ground point set P_g, we calculate their features’ mean m and standard deviation sd. Since the ground point proportion in the near-ground point set P_g is quite large, we choose those points that feature in the range of [m − 3sd, m + 3sd] as the ground points P_t. After obtaining the ground point set, we set a ratio a and filter out a certain proportion a of P_t, ultimately obtaining the point cloud data with part of the ground removed.

3.2. Multi-Scale Voxels Feature Extraction

Both PV-RCNN++ [12] and PARTA2 [22] demonstrate the importance of feature extraction from foreground points in the second detection stage for refining candidate boxes. Whether it is PV-RCNN++ [12] searching for key points near the region of interest or PARTA2 [22] using a semantic segmentation module to extract features from the decoded point cloud, the significance of foreground point features in the second stage is evident. Therefore, the specific process is shown in Equations (1)–(3).

f_{n m} = M L P {k_{n m}^{'} (x + o_{i x}, y + o_{i y}, z + o_{i z})}

(1)

{\{o_{i j}\}}_{i = 0}^{\sqrt[3]{N + 1}} = {(R - \frac{R}{\sqrt[3]{N + 1}} - \frac{2 R \times i}{\sqrt[3]{N + 1}})}_{j}

(2)

{k_{n m} (x, y, z)}_{n = C}^{n = 0} = S (v_{m} (p))

(3)

where v represents voxelization with voxel size m, S represents random sample to obtain C key points (the central coordinates (x, y, z) of key voxels). N represents the number of voxels participating in the aggregation around the key points. R represents the maximum distance between the key points, and the surrounding voxels participating in the aggregation. o represents the coordinate offset of the aggregated voxels relative to the key points, and j stands for the three-axis coordinates. f represents the final feature obtained after aggregation.

Firstly, we input the point cloud after ground filtering into the second detection stage and perform feature extraction using multi-scale voxels. We used three different scales of voxels to extract features in our experiments.

Secondly, in practical 3D object detection tasks, the sizes of different types of target objects can vary significantly. We believe that voxelization of point clouds using only a single scale of voxels can lead to excessive information loss. Therefore, to obtain more accurate information in the second detection stage, we adopt multiple scales of voxels to voxelize the point cloud after removing a portion of the ground.

Finally, to efficiently extract features from the generated multi-scale voxels, we utilize a voxel Abstraction Aggregation module for feature extraction, which is similar to the Abstraction Aggregation module in PointNet++ [6]. Furthermore, to highlight the features of key points in the second detection stage, we have improved the way of key points feature aggregation. Since key points are selected near the regions of interest generated in the first detection stage, to make the feature differences between the targets of different categories more distinct, we considered the feature factors between different voxels during the process of key point feature aggregation. First, we select voxels over a larger range around the key points, and then, from these voxels, we further select those that are more feature-similar to the key points for the final feature aggregation.

As shown in Figure 3, this module randomly selects a fixed number of key points within the voxels, limiting the selection range using the candidate boxes generated in the first detection stage to ensure that the key points are closer to the foreground points. Subsequently, feature aggregation is performed on the voxels surrounding these key points to extract voxel features. The extracted voxel features are then used for subsequent candidate box refinement.

3.3. Multi-Scale Voxel Features Fusion

To ensure that the final features used for refining the candidate boxes are comprehensive, we have designed a multi-scale voxel feature fusion module as shown in Figure 4. This module enhances and fuses the features extracted from different scales of voxels. The specific process is shown in Equations (4)–(6).

f_{o} = M L P (C ({\{f_{i}\}}_{i = 1}^{M}))

(4)

f_{i} = M L P (C (f_{m}, R (f_{g})))

(5)

f_{g} = C (\max p o o l (f), a v g p o o l (f))

(6)

where f_m represents the voxel feature with voxel size m, f_g represents the global features, and M denotes the different sizes of voxels used. C and R represent the concatenate operation and repeat operation, respectively.

Firstly, for a specific scale of voxel features, we perform both max pooling and average pooling and then concatenate the pooled features to obtain the global features for that scale of voxels. Secondly, we concatenate the original features of this scale of voxels with the global features to obtain the processed new voxel features, which are further enhanced through a Multilayer Perceptron (MLP) layer. Finally, we concatenate the features extracted from each scale of voxels after undergoing the above operations and pass them through an MLP layer to obtain the final fused features.

4. Experiment

4.1. Dataset

To validate the effectiveness of our proposed network, we tested it on the KITTI dataset and nuScenes dataset.

4.1.1. KITTI Dataset [38]

The KITTI dataset is a publicly available dataset widely used in the field of computer vision, primarily for research and evaluation of tasks such as autonomous driving, scene understanding, and object detection. Based on the streets of Karlsruhe, Germany, the dataset offers a diverse range of urban driving scenarios. Due to its provision of real-world scene data, the KITTI dataset exhibits high authenticity and representativeness, making it a mainstream standard for 3D object detection in traffic scenarios.

In the original KITTI dataset, each sample contains point cloud data from multiple consecutive frames. In our experiments, there were a total of 7481 point clouds with 3D bounding boxes for training and 7581 samples for testing. Following the common practice, we further divided the training samples into 3712 training samples and 3769 validation samples. Our experimental model was trained on the training samples and validated on the validation samples.

4.1.2. nuScenes Dataset [39]

The nuScenes dataset is a particularly demanding collection for autopilot systems, consisting of 380,000 LiDAR scans derived from a total of 1000 different scenes. Each scan is meticulously annotated with a maximum of ten distinct object categories. These categories encompass 3D bounding boxes, the velocities of the objects, and various attributes. The dataset boasts a detection range that spans a full 360 degrees.

The performance of the nuScenes dataset is gauged using a couple of key metrics. The first is the widely recognized mean Average Precision (mAP), and the second is the innovative nuScenes Detection Score (NDS), which offers a comprehensive assessment of measurement quality across various domains.

In the process of utilizing the nuScenes dataset, we amalgamate LiDAR data points from the present keyframe with those from preceding frames that occurred within the last half a second. This can result in a training sample that includes up to 400,000 LiDAR points. To manage this, we implement a reduction in the quantity of input LiDAR points.

To achieve this reduction, we voxelize the point cloud from the current keyframe and the aggregated data from the previous frames, using a voxel size of 0.1 m in each dimension. We then proceed to randomly select 16,384 voxels from the keyframe and 49,152 voxels from the previous frames. For each of these selected voxels, we randomly pick one internal LiDAR point. Consequently, the network is fed a total of 65,536 points, each accompanied by its 3D coordinates, reflectivity value, and timestamp information.

4.1.3. Evaluating Indicator

In the experiments on the KITTI dataset, the 11-point interpolated average precision (AP) proposed by Gerard et al. [40] was used, with all accuracy IoU (Intersection over Union) thresholds set to 0.70. The specific formula for AP|R is shown in Equations (7) and (8).

A P | R = \frac{1}{|R|} \sum_{r \in R} ρ_{i n t e r p} (r)

(7)

ρ_{i n t e r p} (r) = \max_{r' : r' \geq r} p (r')

(8)

where the p(r) represents the precision at recall rate r. AP applies exactly 11 equally spaced recall levels: R₁₁ = {0, 0.1, 0.2, …, 1}.

In the nuScenes dataset, as mentioned above, we use NDS (nuScenes Detection Score) and mAP (mean Average Precision) as the evaluation metrics. The specific formula for NDS is shown in Equation (9).

N D S = \frac{1}{10} [5 m A P + \sum_{m T P \in T P} (1 - \min (1, m T P))]

(9)

where the mTP stands for the mean True Positive metric, which is composed of five metrics: mean translation error, mean scale error, mean orientation error, mean velocity error, and mean attribute error.

4.2. Experiment Setting

MSPV3D is implemented based on OpenPCdet [41] and is trained on a single GPU. All experiments are conducted on an Ubuntu 16.04 system with an NVIDIA RTX-2080Ti graphics card (The manufacturer is NVIDIA, which is headquartered in Santa Clara, CA, USA).

4.2.1. KITTI Dataset

During the training process, the batch size was set to 1, and for each batch, 16,384 points were randomly sampled from the remaining points to input into the detector. The training utilized the Adam optimizer with a cyclically varying learning rate, and a total of 20 epochs were trained with an initial learning rate set to 0.01. Additionally, three common data augmentation methods were used during training: random flipping along the X- and Y-axes, random scaling, and random rotation around the Z-axis.

4.2.2. nuScenes Dataset

During the training process, the batch size was set to 1, and 65,536 points per frame were fed into the network with no filter. The training employed the Adam optimizer with a cyclically varying learning rate. Due to the vast amount of data in the NuScenes dataset, a total of 5 epochs were conducted with an initial learning rate set to 0.01.

4.3. Network Evaluation

The experiments evaluated the detection performance of the proposed method in this paper against several existing methods from the literature on the KITTI dataset and nuScenes dataset.

In the KITTI dataset, the test set was divided into three difficulty levels: “Easy”, “Moderate”, and “Hard”, and three categories of detection targets: “Car”, “Pedestrian”, and “Cyclist”. We adopted the widely used evaluation metric of 3D bounding box accuracy from the KITTI dataset as the primary metric. The IoU threshold for the “Car” category was set to 0.7, while the IoU thresholds for the “Pedestrian” and “Cyclist” categories were set to 0.5.

As shown in Table 1, after 20 epochs of training, MSPV3D exhibits a decline in detection accuracy for the “Car” category compared to the benchmark network PV-RCNN++ [12] we used. However, for the “Pedestrian” category, the detection accuracy has improved by 3.37%, 2.63%, and 2.72% in the “Easy”, “Moderate”, and “Hard” difficulty levels, respectively. Similarly, for the “Cyclist” category, the detection accuracy has increased by 3.79%, 1.26%, and 1.32% in the “Easy”, “Moderate”, and “Hard” difficulty levels, respectively. We believe that the selection of features from different scales of voxels within the network plays a significant role in the regression of bounding boxes for the “Pedestrian” and “Cyclist” categories.

The nuScenes dataset is shown in Table 2. Compared with the baseline network, MSPV3D demonstrates an increase of 2.4% in the NDS metric and a 3.6% increase in the mAP metric.

4.4. Ablation Experiment

To further validate the effectiveness of the multi-scale voxel feature fusion module designed in this paper, we conducted an ablation experiment to test the module.

As indicated in Table 3, compared to networks that only have a multi-scale voxel feature extraction module, the feature fusion module has achieved a respective enhancement of 1.62%, 2.47%, and 1.07% for the “Pedestrian” category across the “Easy”, “Mod.”, and “Hard” difficulty levels. Additionally, for the “Cyclist” category, it has resulted in improvements of 2.23%, 0.49%, and 1.98% across the “Easy”, “Mod.”, and “Hard” difficulty levels. Based on the aforementioned data, it is evident that the fusion module we have designed can further enhance the accuracy of candidate box corrections for both the “Pedestrian” and “Cyclist” categories. From this, we infer that the different scales of voxels used in this paper better capture the detailed features of the detection targets “Pedestrian” and “Cyclist”, and the feature fusion module further strengthens these detailed features.

4.5. Ground Filter Rate Test

During testing, we found that completely removing ground points from the point cloud can reduce the final detection accuracy. Instead, filtering out the ground should be conducted within a certain proportion. As shown in Figure 5, different filtering rates result in different filtering effects. To this end, we designed relevant experiments to test the optimal value of the filtering rate, denoted as a.

As shown in Table 4, as the filtering rate gradually increases, the detection accuracy will also gradually improve. When a is 0, it means that the network has not applied any filtering to the ground, and the results are lower than those obtained after ground point removal. However, when the entire ground is filtered out, the detection accuracy does not reach the optimal value. We believe that having too few background points may reduce the diversity of features, which can affect the classification of different objects and ultimately influence the calculation of final confidence scores. Based on the data in Table 4, we ultimately select 0.7 as the final value for a.

4.6. Detection Effect

Figure 6 shows the actual detection effect; although a small part of the missed detection problem still exists, most of the objects are detected, and the accuracy of the 3D bounding box is high.

5. Discussion

First, in the ground filter rate test, we observed that filtering out more points does not necessarily lead to better detection results. We infer that too few background points may reduce the diversity of features, which can affect the classification of different objects; a similar issue has also been mentioned in SASA [14]. This is also why we did not adopt the filtering operation in the first stage of detection. However, the partial removal of ground points can indeed make the features of foreground points more prominent, thereby improving detection accuracy.

Second, extracting multi-scale features is computationally expensive. Therefore, we adopted the method of removing some ground points, and in the experiment, we only used three different voxel sizes to reduce the number of voxels generated after voxelization at different scales. This approach was taken to decrease the number of network parameters and spatial complexity in the subsequent feature extraction process. Moreover, since using three different voxel sizes better captured some features of small targets, it led to an increase in detection accuracy for small targets in the ablation study. However, for targets like vehicles, the introduction of feature enhancement modules resulted in a slight decrease in detection accuracy. In subsequent work, we will further explore methods for partially removing point cloud background points (not limited to the ground) to reduce computational consumption in the voxelization and subsequent voxel feature extraction processes. Additionally, we will adopt a wider range of voxel sizes to better extract features of objects across more categories.

Finally, there are still many shortcomings in our work. Although our ground point removal algorithm can handle some uneven road conditions, it cannot cope with situations where the slope is too steep. In cases of steep slopes, the ground points are not completely removed, leading to increased computational load in the subsequent voxelization and feature extraction processes, and affecting the final detection accuracy. In future work, we will continue to explore different methods to partially remove background points.

6. Conclusions

In this paper, we propose a two-stage 3D object detection network, MSPV3D, which explores background point filtering, feature extraction from voxels of different scales, and feature application. By partially filtering out the ground points and extracting features from voxels of different scales in the second stage, we achieve improved detection accuracy on the KITTI and nuScenes datasets. This demonstrates that selecting voxels of different scales for voxelization can reduce some of the information loss caused by using a single scale, and filtering out some background points can help extract features more efficiently. Although point cloud data is vast and contains many irrelevant points, it also holds a wealth of spatial information. Efficiently extracting spatial features from point clouds remains a crucial focus for future research.

Author Contributions

Conceptualization, Z.Z. and Z.B.; methodology, Z.Z. and Z.B.; software, Z.B.; validation, Z.Z.; formal analysis, Z.B.; investigation, Q.T. and Z.B.; resources, Q.T. and Y.W.; data curation, Z.B.; writing, Z.Z. and Z.B.; original draft preparation, Z.B.; visualization, Z.Z. and M.L.; supervision, Z.B., Y.Z. and Q.T.; project administration, Q.T. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Yun Wei and Ming Li were employed by the Corporation of Information, Beijing Mass Transit Railway Operation Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4490–4499. [Google Scholar]
Yan, Y.; Mao, Y.; Li, B. Second: Sparsely embedded convolutional detection. Sensors 2018, 18, 3337. [Google Scholar] [CrossRef] [PubMed]
Mao, J.; Xue, Y.; Niu, M.; Bai, H.; Feng, J.; Liang, X.; Xu, H.; Xu, C. Voxel transformer for 3D object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3164–3173. [Google Scholar]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zheng, W.; Tang, W.; Jiang, L.; Fu, C.W. SE-SSD: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14494–14503. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-based 3D single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11040–11048. [Google Scholar]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar]
Ye, M.; Xu, S.; Cao, T. HvNet: HYBRID voxel network for lidar based 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1631–1640. [Google Scholar]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10529–10538. [Google Scholar]
Shi, S.; Jiang, L.; Deng, J.; Wang, Z.; Guo, C.; Shi, J.; Wang, X.; Li, H. PV-RCNN++: Point-voxel feature set abstraction with local vector representation for 3D object detection. Int. J. Comput. Vis. 2023, 131, 531–551. [Google Scholar] [CrossRef]
Zhang, Z.; Bao, Z.; Tian, Q.; Lyu, Z. SAE3D: Set Abstraction Enhancement Network for 3D Object Detection Based Distance Features. Sensors 2023, 24, 26. [Google Scholar] [CrossRef] [PubMed]
Chen, C.; Chen, Z.; Zhang, J.; Tao, D. SASA: Semantics-augmented set abstraction for point-based 3D object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 221–229. [Google Scholar]
Zhu, L.; Chen, Z.; Wang, B.; Tian, G.; Ji, L. SFSS-Net: Shape-awared filter and sematic-ranked sampler for voxel-based 3D object detection. Neural Comput. Appl. 2023, 35, 13417–13431. [Google Scholar] [CrossRef]
Chen, Y.; Liu, J.; Zhang, X.; Qi, X.; Jia, J. Voxelnext: Fully sparse voxelnet for 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21674–21683. [Google Scholar]
Xiong, S.; Li, B.; Zhu, S. DCGNN: A single-stage 3D object detection network based on density clustering and graph neural network. Complex Intell. Syst. 2023, 9, 3399–3408. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, R.; Tian, Q. PIDFusion: Fusing Dense LiDAR Points and Camera Images at Pixel-Instance Level for 3D Object Detection. Mathematics 2023, 11, 4277. [Google Scholar] [CrossRef]
Wang, S.; Lu, K.; Xue, J.; Zhao, Y. Da-Net: Density-aware 3D object detection network for point clouds. IEEE Trans. Multimed. 2023, 1–14. [Google Scholar] [CrossRef]
Pu, Y.; Liang, W.; Hao, Y.; Yuan, Y.; Yang, Y.; Zhang, C.; Hu, H.; Huang, G. Rank-DETR for high quality object detection. Adv. Neural Inf. Process. Syst. 2024, 36, 16100–16113. [Google Scholar]
Gao, J.; Zhang, Y.; Geng, X.; Tang, H.; Bhatti, U.A. PE-Transformer: Path enhanced transformer for improving underwater object detection. Expert Syst. Appl. 2024, 246, 123253. [Google Scholar] [CrossRef]
Shi, S.; Wang, Z.; Wang, X.; Li, H. Part-A² Net: 3D Part-aware and aggregation neural network for object detection from point cloud. arXiv 2019, arXiv:1907.03670. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. arXiv 2015, arXiv:1506.02640. [Google Scholar]
Wu, D.; Liao, M.-W.; Zhang, W.-T.; Wang, X.-G.; Bai, X.; Cheng, W.-Q.; Liu, W.-Y. YOLOP: You only look once for panoptic driving perception. Mach. Intell. Res. 2022, 19, 550–562. [Google Scholar] [CrossRef]
Yang, Y.; Deng, H. GC-YOLOv3: You only look once with global context block. Electronics 2020, 9, 1235. [Google Scholar] [CrossRef]
Wong, A.; Famuori, M.; Shafiee, M.J.; Li, F.; Chwyl, B.; Chung, J. YOLO nano: A highly compact you only look once convolutional neural network for object detection. In Proceedings of the 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), Vancouver, BC, Canada, 13 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 22–25. [Google Scholar]
Shafiee, M.J.; Chywl, B.; Li, F.; Wong, A. Fast YOLO: A fast you only look once system for real-time embedded object detection in video. arXiv 2017, arXiv:1709.05943. [Google Scholar] [CrossRef]
Li, J.; Luo, C.; Yang, X. PillarNeXt: Rethinking network designs for 3D object detection in LiDAR point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17567–17576. [Google Scholar]
Shi, G.; Li, R.; Ma, C. PillarNet: Real-time and high-performance pillar-based 3D object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 35–52. [Google Scholar]
Guo, D.; Yang, G.; Wang, C. PillarNet++: Pillar-based 3D object detection with multi-attention. IEEE Sens. J. 2023, 23, 27733–27743. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhang, Y.; Hu, Q.; Xu, G.; Ma, Y.; Wan, J.; Guo, Y. Not all points are equal: Learning highly efficient point-based detectors for 3D lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18953–18962. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Li, Z.; Wang, F.; Wang, N. Lidar R-CNN: An efficient and universal 3D object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7546–7555. [Google Scholar]
Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
Yang, Z.; Sun, Y.; Liu, S.; Shen, X.; Jia, J. STD: Sparse-to-dense 3D object detector for point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1951–1960. [Google Scholar]
Huang, Z.; Zheng, Z.; Zhao, J.; Hu, H.; Wang, Z.; Chen, D. PSA-Det3D: Pillar set abstraction for 3D object detection. Pattern Recognit. Lett. 2023, 168, 138–145. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Salton, G.; McGill, M.J. Introduction to Modern Information Retrieval; McGraw-Hill, Inc.: New York, NY, USA, 1986. [Google Scholar]
OD Team. OpenPCDet: An Open-Source Toolbox for 3D Object Detection from Point Clouds; GitHub: San Francisco, CA, USA, 2020. [Google Scholar]

Figure 1. Related works.

Figure 2. Overall Network Structure. In the overall network structure, the first stage extracts voxel features of Size0 scale through sparse convolution and generates initial 3D candidate boxes. In the second stage, various features from the ground-filtered point cloud are extracted, and the candidate boxes are refined to obtain confidence scores.

Figure 3. Flowchart of Feature Extraction from Multi-Scale Voxels. Where Scale X (X = 0, 1, 2) represents the input voxels of different scales.

Figure 4. Multi-Scale Voxel Feature Fusion Module. In this module, the input is voxel features of different scales with a size of (N, 32), and the output is fused features with a size of (N, F). Scale X (X = 1, 2, …) represents the voxel features of different scales. C represents the token concatenation operation.

Figure 5. Illustration of Ground Filtering Effects. From (left) to (right) and (top) to (bottom), the ground filtering rates a are 0, 0.5, 0.7, and 1, respectively.

Figure 6. Actual detection effect diagram in KITTI dataset. Three different detection boxes in green, red, and blue are been used to represent the three different categories of detected objects: “Car”, “Ped.”, and “Cyc.”.

Table 1. The test results of bounding boxes on the KITTI dataset’s test set.

Methods	Car 3D AP (%)			Ped. 3D AP (%)			Cyc. 3D AP (%)
Methods	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
SECOND [2]	85.44	77.96	73.94	48.63	41.08	39.48	44.25	36.14	34.23
SE-SSD [7]	87.05	76.43	75.12	52.78	44.79	42.42	48.59	41.28	39.24
PointPillars [4]	82.57	74.13	67.89	50.49	41.89	38.96	74.10	56.65	49.92
PointRCNN [9]	85.96	74.64	69.70	47.98	39.37	36.01	72.96	56.62	50.73
IA-SSD [32]	88.34	80.13	75.04	45.51	38.03	34.61	75.35	61.94	55.70
PSA-Det3D [37]	87.46	78.80	74.47	49.72	42.81	39.58	75.82	61.79	55.12
PVRCNN++ [12]	88.94	78.54	77.50	58.68	52.88	48.33	76.30	66.98	62.81
Ours	88.64	78.12	77.32	62.05	55.51	51.05	83.09	68.24	64.13

Table 2. The test results on nuScenes dataset abbreviations: pedestrian (ped.), traffic cone (T.C.), construction vehicle (C.V.), truck (Tru.), trailer (Tra.), motor (Mo.), bicycle (Bic.), barrier (Bar.).

Methods	NDS	mAP	Car	Tru.	Bus	Tra.	C.V.	Ped.	Mo.	Bic.	T.C.	Bar.
Pointpillar [4]	45.2	25.8	70.3	32.9	44.9	18.5	4.2	46.8	14.8	0.6	7.5	21.3
3DSSD [8]	51.7	34.5	75.9	34.7	60.7	21.4	10.6	59.2	25.5	7.4	14.8	25.5
SASA [14]	55.3	36.1	71.7	42.2	63.5	29.6	12.5	62.6	27.5	9.1	12.2	30.4
PVRCNN++ [12]	58.9	44.6	79.3	45.6	59.4	32.4	11.3	71.2	25.3	10.2	16.3	32.6
MSPV3D	61.3	48.2	76.1	40.2	62.8	31.3	14.2	75.3	28.4	13.8	18.4	31.2

Table 3. Table of ablation experiment results for feature fusion module on KITTI. Abbreviations: multi-scale voxels feature extraction (muti), multi-scale voxel features fusion (fusion).

Muti	Fusion	Car 3D AP (%)			Ped. 3D AP (%)			Cyc. 3D AP (%)
Muti	Fusion	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
🗴	🗴	88.94	78.54	77.50	58.68	52.88	48.33	76.30	66.98	62.81
✓	🗴	88.42	78.37	77.65	60.43	53.04	49.98	80.86	67.75	62.15
✓	✓	88.64	78.12	77.32	62.05	55.51	51.05	83.09	68.24	64.13

Table 4. Table of selected ground filter rate results.

Filter Rate (a)	Car 3D AP (%)			Ped. 3D AP (%)			Cyc. 3D AP (%)
Filter Rate (a)	Easy	Mod.	Hard	Easy	Mod.	Hard	Easy	Mod.	Hard
0	88.74	78.21	77.32	59.28	52.74	48.32	79.20	66.05	61.63
0.3	88.88	78.23	77.08	59.47	52.88	48.65	79.89	65.95	61.73
0.5	88.96	78.25	77.42	60.80	53.75	48.65	78.26	67.21	62.06
0.7	88.64	78.12	77.32	62.05	55.51	51.05	83.09	68.24	64.13
1	88.60	78.02	76.73	60.29	54.18	49.61	80.94	66.26	63.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Bao, Z.; Wei, Y.; Zhou, Y.; Li, M.; Tian, Q. MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net. Remote Sens. 2024, 16, 3146. https://doi.org/10.3390/rs16173146

AMA Style

Zhang Z, Bao Z, Wei Y, Zhou Y, Li M, Tian Q. MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net. Remote Sensing. 2024; 16(17):3146. https://doi.org/10.3390/rs16173146

Chicago/Turabian Style

Zhang, Zheng, Zhiping Bao, Yun Wei, Yongsheng Zhou, Ming Li, and Qing Tian. 2024. "MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net" Remote Sensing 16, no. 17: 3146. https://doi.org/10.3390/rs16173146

APA Style

Zhang, Z., Bao, Z., Wei, Y., Zhou, Y., Li, M., & Tian, Q. (2024). MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net. Remote Sensing, 16(17), 3146. https://doi.org/10.3390/rs16173146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSPV3D: Multi-Scale Point-Voxels 3D Object Detection Net

Abstract

1. Introduction

2. Related Works

2.1. One-Stage Detection

2.2. Two-Stage Detection

3. Proposed Methods

3.1. Ground Partly Filtering Algorithm

3.2. Multi-Scale Voxels Feature Extraction

3.3. Multi-Scale Voxel Features Fusion

4. Experiment

4.1. Dataset

4.1.1. KITTI Dataset [38]

4.1.2. nuScenes Dataset [39]

4.1.3. Evaluating Indicator

4.2. Experiment Setting

4.2.1. KITTI Dataset

4.2.2. nuScenes Dataset

4.3. Network Evaluation

4.4. Ablation Experiment

4.5. Ground Filter Rate Test

4.6. Detection Effect

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI