In this section, we will provide a detailed explanation of how to prune points in an almost stationary continuous point cloud and how to utilize SIMD vector parallelism to accelerate the Gather-GEMMs-Scatter data flow to further speed up the process.
3.1. Prune Method for Redundancy of Point Clouds
In submanifold convolutions, the output coordinates
are the same as the input coordinates
when the input stride
s is 1, as follows.
Employing sparse convolution techniques can effectively preserve the data’s inherent sparsity, ensuring efficient computation and resource utilization. But when the input stride
s is 2, or has other values, to execute a downsampling operation or max-pooling, both background and foreground points dilate simultaneously. According to Equation
, the output coordinates need to be changed. As shown in
Figure 6, the input coordinates must first be offset to derive the output coordinates. Subsequently, potential output points within the expanded neighborhood of the input coordinates are identified. The final output coordinates are then selected according to the stride of the input coordinates, ensuring precise and accurate placement. These can be expressed as follows.
where % represents mod.
This process could disrupt the sparsity of the input points because the background points tend to occupy a significant portion of the points, which indicates that background points will dilate more quickly than foreground points. As illustrated in
Figure 3, sparse convolution relies on a hash table that establishes the relationship between input and output coordinates, thereby facilitating computations. The faster the expansion of background points, the more significant the increase in the length of the hash table, leading to a rapid growth in the computational load of the entire sparse convolution. This escalation ultimately undermines the sparsity of the point cloud data, thereby negating the advantages of sparse convolution, as shown in
Figure 7.
In the domain of 2D image analysis, one approach to enhancing computational speed and efficiency involves generating masks to flag features that should be ignored [
22,
23,
24,
25]. This motivates us to investigate analogous mask application techniques to 3D point clouds. Inspired by the unequal diffusion rate, we propose a pruning method based on the difference of coordinates in adjacent keyframes during the downsampling operation. The difference in coordinates is defined as follows.
where
,
t represents the coordinates at the
t-timeslot, and
m denotes the index of the point. The difference between background points tends to be small, and foreground points are the opposite. So, we define a mask based on the coordinate difference to decide which point is more likely to be the background point, as follows:
represents the sum of squared coordinate differences and is expressed as follows:
This implies that for any given point m at t-timeslot, the coordinate displacement is measured as , and a smaller displacement corresponds to a smaller .
To decide the threshold
T, which is a value in the collection of all points’ coordinate displacement
at
t-timeslot, we should first sort the
in ascending order and mark it as
. Subsequently, we establish a pruning ratio(
) to denote the proportion of points that should be considered background points. Then, we multiply the pruning ratio with the length of
(
l) to find the index(
) of the last background point, and the last background point’s value is set as the threshold
T. The index
is defined as follows.
Then, we perform indexing operations on the
to obtain the threshold
T.
Any point falling below this threshold will result in a mask with false values at the corresponding locations, thus ensuring the precise identification and handling of sub-threshold data points.
To identify an appropriate threshold T, we conducted an investigation using a mini dataset from nuScenes. We analyzed the changes in the number of points for foreground and background categories at a pruning ratio of 0.3, 0.5, and 0.7. The results revealed that when the pruning ratio T is around 0.3, there is a significant reduction in the number of background points, while the reduction in foreground points is minimal and can be considered negligible. Therefore, it is recommended that the pruning ratio T be set around 0.3.
This mask clearly differentiates background and foreground points, facilitating accurate segmentation and analysis.
Applying a mask allows the downsampling operation to be performed only on the selected points, significantly reducing the required computations, as shown in
Figure 7. According to Equations (4) and (6), this progress can be expressed as follows.
Then, the output coordinates can be represented as:
As shown in
Figure 8, the pruning method is applied on the first downsample layer of the encoder because the information of coordinate differences is more precise than after the downsampling operation, which generates new coordinate information that is harder to be used again to prune points.
3.2. Optimization for Gather-GEMMs-Scatter
Unlike 2D images, where computations can be performed by continuously sliding a convolution window, sparse convolution relies on mapping operations. A common implementation records the coordinates of non-zero inputs in a hash table with entries of the form
. After mapping, sparse convolution queries the hash table to retrieve the inputs associated with each kernel. These inputs are then rearranged to meet the continuity requirements for subsequent matrix multiplication. Once the computation is completed, the results need to be scattered back to their respective output positions. This entire process is referred to as the Gather-GEMMs-Scatter dataflow. Matrix multiplication is the core component of sparse convolution, and it takes up about 20–50% of the whole execution time [
19]. However, the data orchestration process, which aims to make the data more regular, consumes a significant portion of the execution time, accounting for nearly half of it. Therefore, optimizing the Gather-Scatter operations can accelerate the overall execution time.
Considering the parallelism of the modern CPU platform, we use SIMD to execute Gather–Scatter operations in parallel. As shown in
Figure 5, SIMD technology utilizes vector registers or vector processors to perform the same operation on a batch of data simultaneously. Under the control of a single instruction, the vector processor can process multiple data elements at once, significantly improving computational efficiency and performance. This means more data can be processed within the same time on CPU.
For sparse convolution, as shown in
Figure 3, it is expected to perform matrix multiplication for each unique kernel weight to obtain partial sums, which are then added together to obtain the final result. During Gather and Scatter operations, the first step involves extracting the parts of the discontinuous point cloud data that are involved in the current kernel calculation and arranging them in a continuous sequence for the next operation (in the Scatter operation, the continuous data are placed back in their original position), as shown in
Figure 9. In this step, data is read in batches of channel size and then switched to another position. Compared with scalar operation, the efficient utilization of SIMD operations allows for parallel acceleration along the channel dimension, enabling the simultaneous processing of multiple data points. This approach significantly enhances computational speed, optimizing performance in high-dimensional data processing tasks.
In SIMD operations, channel-wise batched data are truncated based on the length of the vector register rather than reading one data point at a time from the cache or memory. In scenarios where the channel number is not an integer multiple of the SIMD vector register length, it is imperative to fetch data of the SIMD vector register length from the tail end of the batch rather than using the pad. This precaution prevents errors stemming from array out-of-bounds access, thereby maintaining the computation’s integrity and accuracy. The detailed steps are outlined in Algorithm 1.
Complexity Analysis: The original complexity was O() due to the need for nested two-layer for loops during the Gather and Scatter operations. After implementing SIMD rewriting, the number of iterations in the inner loop is significantly reduced, especially when the number of input and output channels is small (the number of iterations in the inner for loop is determined by these two channel counts). In this case, the inner loop iterations can be considered as constant relative to the outer loop iterations, leading to a time complexity that can be viewed as O(N). However, when the number of channels is large, the size restrictions of vector registers make it challenging to substantially reduce the number of iterations in the inner loop. In this scenario, the time complexity may degrade back to O(), although it will still perform faster than the original unmodified version.
With the optimization of SIMD in the Gather and Scatter operation, the performance of the submanifold is significantly enhanced.
Algorithm 1: SIMD optimization for Gather and Scatter. |
|