In the first stage, the original point cloud is transformed into voxels through the data processing module and then input into the symmetric point generation module for foreground point segmentation and symmetry point prediction. The generated symmetry points and non-empty voxel centers form an enhanced point cloud. In the second stage, the enhanced point cloud is input into the baseline, an anchor-based region proposal network, to generate the final detection results.
3.2.1. Data Processing
The space of the complete 3D point cloud scene scanned by lidar is often very large. Without any processing, direct voxelization not only consumes more time, but also occupies a lot of video memory for subsequent feature extraction. Hence, we need to limit the scene size. For a given point cloud, we only deal with the range of [0 m, −40 m, −3 m, 70.4 m, 40 m, 1 m] ([]) in the point cloud coordinate system. We also only focus on the point cloud in the field of view of the left camera. Specifically, we use the coordinate transformation matrix to project the point cloud to the left image plane and ignore the points outside the image range.
After the above series of operations on the original point cloud, we voxelized the scene with a voxel size of 0.05 m × 0.05 m × 0.1 m. So, the resolution of voxelized 3D space among the X-axis, Y-axis, and Z-axis is 1408, 1600, 40. Each point will fall into the corresponding voxel. When the points in a voxel exceed a certain number of N, random sampling will be conducted, leaving only N points. If the number is less than N, zero will be filled. The average coordinate values of points in non-empty voxels are used as the initial values of voxel features. The point set composed of the centers of the generated voxels can be regarded as a regular point cloud. The smaller the size of the voxel, the smaller the difference from the original point cloud. Therefore, the features of a voxel can also be regarded as the features of the voxel center.
3.2.2. Symmetry Point Generation Network
In practical applications, we cannot directly calculate the location of a symmetry point using Equation (
1) for the data that we have not labeled. So, we need a network that can segment the foreground points and predict the positions of their symmetry points. To accomplish these two tasks, we use a feature extractor similar to UNet [
26] proposed by [
27] to extract non-empty voxel-wise features. The network structure is shown in
Figure 3, which is mainly composed of sparse convolution layers and submanifold convolution [
28] layers. The encoder uses three sparse convolutions with stride 2 to downsample the spatial resolution eight times. Each sparse convolution layer is followed by two submanifold convolutions. The decoder uses four upsampling blocks to restore the spatial resolution to the original scale so that we can get effective non-empty voxel-wise feature representation. After the decoder, a segmentation head and a regression head are added to segment the point cloud and estimate the positions of the symmetry points, respectively.
The 3D ground-truth boxes naturally provide us with the semantic label of each point. The points in the boxes are considered as the foreground points, and are otherwise the background points. Considering the great gap between the number of foreground points and background points for automatic driving scenes, we use the focal loss proposed by [
29] to alleviate the foreground–background class imbalance problem:
where
is a binary label used to indicate whether a point is a foreground point, and
is the estimated probability of the network for a foreground point belonging to [0, 1].
is the number of foreground points.
and
are hyperparameters, with the same settings,
= 0.25 and
= 2, as those that the original paper used in the training phase.
In
Section 3.1, we have introduced the method of calculating the position label of the symmetry point, which is expressed as (
,
,
). Then, we can obtain the position offsets
of the symmetry point relative to its corresponding foreground point. It should be noted that a pair of symmetrical points here are in the same height plane, so only the offsets in the X and Y directions need to be calculated, and then
. Assuming that the position offset predicted by the network is
, the loss of the position prediction branch can be represented by the following smooth-
loss as
where
is a indicator function.
When the segmentation score of a point is greater than a threshold T, the point is considered to be a foreground point, and we can also get the position of the corresponding symmetry point estimated by the network, denoted as (, , ). Let the point set composed of these symmetry points be and the point set composed of non-empty voxel centers be , where is the vector of a point, which is composed of its own coordinates and additional feature channels, such as color, intensity, etc., and so is . Since the intensity of the symmetry point is unknown, we only use the coordinates of the points as the initial feature in this paper. The generated symmetry points and non-empty voxel centers form an enhanced point cloud expressed as , which will be the input of the baseline.
3.2.3. Region Proposal Network
The region proposal network consists of a backbone network and a detector head. The backbone network has the same structure as the encoder in the first stage and only adds a sparse convolution after the encoder to compress the height of the feature volume. The enhanced point cloud
is divided into voxels with a spatial resolution of
. The feature volume extracted by the backbone network is downsampled eight times, which can be expressed as a tensor form
. After that, the height channels of this feature are compressed into BEV representation, represented as
. Then, we apply the RPN head shown in
Figure 4 to 3D box classification and regression, which is similar to SSD [
30].
For cars, we place anchors of two directions,
and
, on each pixel of the BEV feature maps, so there are
anchors, of which the size is the average size of the cars in the dataset. An IoU (intersect in union) threshold of 0.6 is used to assign the anchor to the ground truth, and if the IoU is less than 0.45, the anchor is assigned to the background. We also apply focal loss
with default parameter sets to box classification, and
loss is adopted to regress the normalized box parameters:
where
is the predicted residual and
is the regression target. The specific encoding method is as follows:
where the subscripts
a,
g, and
t respectively indicate the anchor, the encoded value, and the ground truth.