Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints

Wu, Sidong; Ren, Liuquan; Zhu, Enzhi

doi:10.3390/app14198969

Open AccessArticle

Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints

by

Sidong Wu

,

Liuquan Ren

and

Enzhi Zhu

^*

School of Automation, Chengdu University of Information Technology, Chengdu 610225, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8969; https://doi.org/10.3390/app14198969

Submission received: 21 August 2024 / Revised: 30 September 2024 / Accepted: 1 October 2024 / Published: 5 October 2024

(This article belongs to the Special Issue State-of-the-Art of Computer Vision and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

Moving objects in the environment have a higher priority and more challenges in growing domains like unmanned vehicles and intelligent robotics. Estimating the motion state of objects based on point clouds in outdoor scenarios is currently a challenging area of research. This is due to factors such as limited temporal information, large volumes of data, extended network processing times, and the ego-motion. The number of points in a point cloud frame is typically 60,000–120,000 points, but most current motion state estimation methods for point clouds only downsample to a few thousand points for fast processing. The downsampling step will lead to the loss of scene information, which means these methods are far from being used in practical applications. Thus, this paper proposes a motion state estimation method that combines spatio-temporal constraints and deep learning. It starts by estimating and compensating the ego-motion of multi-frame point cloud data and mapping multi-frame data to a unified coordinate system; then the point cloud motion segmentation model on the multi-frame point cloud is proposed for motion object segmentation. Finally, spatio-temporal constraints are utilized to correlate the moving object at different moments and estimate the motion vectors. Experiments on KITTI, nuScenes, and real captured data show that the proposed method has good results, with an average vector deviation of only 0.036 m and 0.043 m in KITTI and nuScenes under a processing time of about 80 ms. The EPE3D error under the KITTI data is only 0.076 m, which proves the effectiveness of the method.

Keywords:

moving object; point cloud; motion segmentation; motion state estimation

1. Introduction

The relevant technologies for systems such as unmanned vehicles and intelligent mobile robots have been rapidly developed in recent years [1], in which the environment perception technology is one of the key technologies to ensure their safety. In environment perception, Lidar is a sensor that directly acquires three-dimensional information. It can capture the surrounding objects, obstacles, and road conditions with high-precision and high-resolution point cloud data, making it one of the key sensors in unmanned environment perception. The investigation of environmental perception, recognition, and analysis through the utilization of Lidar point cloud data has gained significant importance [2,3].

Generally, dynamic objects such as pedestrians, bicycles, and automobiles in the road environment occupy only a very small spatial area within the detectable range of Lidar, and most of the area is covered by static scene elements such as buildings, roads, and vegetation. Whereas the detection of moving objects is essential for unmanned vehicles and intelligent mobile robots, the recognition of these moving objects and the estimation of their motion states are equally, if not more, critical than the identification of static scene elements. There is an increasing body of research focused on the segmentation of moving objects [4] and scene flow estimation [5] for Lidar point clouds. However, these two aspects are typically addressed in isolation. For instance, the moving object segmentation algorithm solely performs the task of identifying moving entities, while the scene flow estimation algorithm does not differentiate between moving and stationary objects but instead estimates motion vectors for scene points by utilizing point clouds from both the previous and current frames. Predicting motion vectors directly using scene flow estimation methods would require a lot of effort for the vast majority of non-motion points in the scene. This is because, in a point cloud scanning scene, only a small portion of the point clouds are covered by moving objects, while the majority are covered by static scene elements such as buildings, roads, and vegetation. This creates a problem of imbalance between dynamic and static data. While motion targets have equal or even higher demands, directly predicting motion vectors using scene flow estimation will lead to a large number of point clouds in consecutive frames and consume a large amount of resources.

Therefore, in order to estimate the object motion state and obtain accurate motion vectors with a moving platform for practical applications, this paper combines motion segmentation and scene flow estimation techniques to divide the moving object vector estimation problem into two phases, that is, multi-frame-based moving object segmentation and moving object vector estimation. To address the issue of relative motion concerning stationary objects caused by the ego-motion of the Lidar sensor, we first compensate for the platform’s ego-motion. Subsequently, moving objects within the scene are segmented using an enhanced motion segmentation model, followed by estimating motion vectors for these segmented objects through spatio-temporal clustering fusion.

The remainder of this paper is organized as follows: Section 2 introduces the related works. The proposed motion state estimation method is presented in detail in Section 3. Section 4 shows the experimental results, and the conclusions follow in Section 5.

2. Related Work

The research methodology presented in this paper primarily encompasses two key aspects: point cloud segmentation and motion vector estimation. Most existing studies focus on addressing specific components of these topics. Therefore, the following discussion will elaborate on and analyze related research efforts concerning both point cloud segmentation and scene flow estimation.

2.1. Point Cloud Segmentation

The proposed PointNet made the deep learning-based point cloud analysis task enter a rapid development stage [6]. PointNet learns the rotational transformation properties of a point cloud by T-Net, uses a multi-layer perceptron (MLP) to construct the model, uses pooling to obtain global features, and finally combines the point features with the global features, which shows satisfactory results on semantic segmentation tasks. Subsequently, in order to utilize the local information of the neighborhood of the point cloud, the PointNet++ model was proposed to aggregate the neighboring point features within the sphere radius through the PointNet pooling operation [7]. Wang et al. studied the downsampling of point cloud data and fused the local information of the point cloud through local spatial coding and attention pooling methods [8]. Hu et al. proposed RandLA-Net [9], argued that the random sampling method can achieve better results while significantly saving memory and computational costs, and designed a local spatial coding and attention pooling method, which better fuses the local information of the point cloud and retains the fine geometric information of the point cloud. Due to the irregularity of the point cloud, Zhou et al. proposed the VoxelNet model by using voxelization [10], and the model is mainly composed of Voxel Feature Encoding (VFE) and a Region Proposal Network (RPN). However, a fixed voxel grid configuration may lead to challenges arising from the varying sparsity of point clouds generated by different Lidar sensors and scenes. Also, voxel resolution and boundary artifacts caused by point cloud partitioning limit performance. Tchapmi et al. proposed SegCloud [11], which employs trilinear interpolation combined with conditional random fields to better learn the fine-grained information in the scene and remap the coarse voxel information into the point cloud via 3D-FCNN [12].

Due to the high complexity of 3D convolutional operations, projection-based methods can benefit 2D image deep learning techniques by converting point clouds into 2.5D structured image arrays or projecting 3D coordinates onto 2D planes, and are also gaining rapid development. Wu et al. proposed SqueezeSeg [13], which is widely used in the task of road object segmentation for point cloud data in autonomous driving. Milioto et al. proposed RangeNet++ to explicitly addresses the semantic segmentation problem for rotated 3D Lidar, which effectively solves the object scale problem by using semantic reconstruction of raw points [14]. In order to make better use of neighborhood features, SalsaNext [15] introduced a context-aware module to extract richer feature representations by considering the neighborhood information of the points in the point cloud, and fused features at different scales through a multi-scale pyramid structure for semantic objects of different sizes and shapes. Chen et al. proposed LMNet for motion target segmentation by combining multi-frame point cloud data using projected images and motion residuals [4]. In order to better utilize the motion residual data, Sun et al. proposed a motion attention module to optimize the LMNet [16].

2.2. Scene Flow Estimation

Dewan et al. proposed a method for estimating dense rigid scene flows in 3D Lidar scans [17]. The method formulates scene flow estimation as an energy minimization problem, and by assuming local geometric constants and incorporating the regularization of smooth motion fields, analyzing the dynamics at the point level helps to infer fine-grained details of the motion. With the research on deep learning in scene flow estimation methods, most of the studies tend to use raw point cloud data directly for scene flow estimation. In 2019, FlowNet3D [18] was proposed to learn the scene flow of 3D point clouds in an end-to-end manner, and the model proposes a flow embedding layer, which makes it possible to accurately obtain the scene flow by learning deep layered features of the point cloud geometry and the flow embedding that represents its motion. HPLFlowNet [19] operates directly on large-scale 3D point clouds based on bilateral convolutional layers to recover structural information from unstructured point clouds and to fuse information from two consecutive point clouds. Compared to estimating the scene flow directly through the point approach, Puy et al. proposed the FLOT model based on the graph matching work using the optimal transmission theory [20], and considered the point cloud matching error and used the feature similarity to establish the transmission–cost relationship. To address the challenges of high computational costs and latency associated with complex matching mechanisms and feature decoding, FH-Net [21] employs a robust local geometric prior to directly derive keypoint flows via a lightweight cross layer. Additionally, it optionally backpropagates the computed sparse flows through the inverse cross layer to obtain hierarchical flows at varying resolutions.

Despite the large number of studies obtained for both motion segmentation and scene flow estimation, there are still problems such as high resource consumption and slow speed, so further in-depth research is needed to acquire the motion state of moving objects in the scene.

3. The Motion State Estimation Method

3.1. Overview

The workflow of the proposed method is given in Figure 1. Concretely, for a multiple-frame point cloud (

P_{t}, P_{t + 1}, . . ., P_{t + i}, i \in [1, . . ., 9]

), the ego-motion parameters (rotation matrix

R

and translation vector

T

) are estimated to match the point cloud

P_{t + i}

scanned at time

t + i

with

P_{t}

for ego-motion compensation. Then, the ego-motion compensated point cloud is projected to the range image, and the differential image of the range image is obtained. The concatenated range image and differential image are utilized for motion object segmentation using a CNN backbone network. Finally, the motion vector estimation module associates motion objects at different times through spatio-temporal fusion and clustering to realize the estimation of motion vectors. The details of each module will be illustrated in the following sections.

3.2. Ego-Motion Compensation

In applications such as unmanned vehicles and intelligent mobile robots, Lidar is mounted on a mobile platform. The motion of the platform induces relative motion between scene objects, which adversely affects the motion segmentation results of the perception system. The moving object motion vectors may not consistent with their actual trajectories. Ego-motion elimination is a prerequisite to ensure accurate motion segmentation and vector estimation. The ego-motion information can be estimated by other sensors (e.g., IMU, GPS, etc.) [22], but in this paper, we do not use other sensors and only study the motion vector estimation via Lidar data only; so, the ego-motion estimation and compensation are mainly performed by means of point cloud alignment.

When performing ego-motion estimation, the ground points at different locations usually have similar features and lack obvious texture and structure, which makes the matching of ground points before and after the motion prone to errors; so, the ground points are eliminated before the matching is performed. In this paper, we adopt the method proposed in [23] to exclude the ground points. Subsequently, the non-ground points in different frames are matched using the ICP (Iterative Closest Point) principle [24]. Due to the computational efficiency of the naïve ICP method, the VGICP method [25] is utilized to realize the matching between point clouds.

Concretely, for point cloud

P_{t} = {p_{n}, n = 1, . . ., N}

and

P_{t + i} = {q_{m}, m = 1, . . ., M}

scanned at time t and

t + i

, the goal of VGICP is to estimate the transformation matrix

M = [R, T]

from

P_{t + i}

to

P_{t}

. Given a point

p_{n}

and its neighboring points in

P_{t + i}

as

N_{n} = {q_{m}^{'} | ∥ q_{m}^{'} - p_{n} ∥ < θ_{d}, q_{m}^{'} \in P_{t + i}}

, the distance

D_{n}

is obtained as in Equation (1).

D_{n} = \sum_{N_{n}} (q_{m}^{'} - M * p_{n})

(1)

If

P_{t}

and

P_{t + i}

are Gaussian sampled with distributions

p_{n} \sim N ({\hat{p}}_{n}, C_{n}^{A})

and

q_{m} \sim N ({\hat{q}}_{m}, C_{m}^{B})

, then

D_{n}

has a distribution as in Equation (2).

\begin{matrix} d_{n} \sim & N (μ^{d}, C^{d}) \\ μ^{d} = & \sum_{N_{n}} (q_{m}^{'} - M * p_{n}) = 0 \\ C^{d} = & \sum_{N_{n}} (C_{m}^{B} + M * C_{n}^{A} * M^{T}) \end{matrix}

(2)

The transformation matrix

M

can then be obtained by maximizing the log-likelihood function as in Equation (3).

M = \underset{M}{arg min} \sum_{n} ({(\sum_{N_{n}} (q_{m}^{'} - M * p_{n}))}^{T} {(\sum_{N_{n}} (C_{m}^{B} + M * C_{n}^{A} * M^{T}))}^{- 1} (\sum_{N_{n}} (q_{m}^{'} - M * p_{n})))

(3)

After the transformation, we have Equation (4).

M = \underset{M}{arg min} \sum_{n} (| N_{n} | {(\frac{\sum_{N_{n}} q_{m}^{'}}{| N_{n} |} - M * p_{n})}^{T} {(\frac{\sum_{N_{n}} C_{m}^{B}}{| N_{n} |} + M * C_{n}^{A} * M^{T})}^{- 1} (\frac{\sum_{N_{n}} q_{m}^{'}}{| N_{n} |} - M * p_{n}))

(4)

where

| N_{n} |

denotes the number of

N_{n}

. By voxelizing point cloud

P_{t + i}

,

\frac{\sum_{N_{n}} q_{m}^{'}}{| N_{n} |}

and

\frac{\sum_{N_{n}} C_{m}^{B}}{| N_{n} |}

can be easily and quickly obtained, with details found in [25]. After obtaining the transformation matrix

M

,

P_{t + i}

is transformed to

P_{t + i}^{'}

under the coordinate system of the origin where

P_{t}

is located.

3.3. Moving Object Segmentation

After estimating the ego-motion, the moving objects within the point cloud are segmented using the model illustrated in Figure 2. After obtaining the multi-frame point cloud data that eliminate ego-motion, the point cloud is projected into a range image with tighter information (Section 3.3.1). The range image is differenced to obtain the differential image, and the differential image is incorporated into the range image to form a 5D range image (Section 3.3.1). The 5D range image is then fed into a CNN segmentation model for moving object segmentation [4].

3.3.1. Range Image Projection and Differential Image Generation

In usual use, Lidar is rotated to generate a point cloud centered on itself. In this paper, the point cloud data are projected into an information-tight range image by spherical projection. The schematic of the spherical projection is shown in Figure 3.

The spherical coordinate system is mainly based on the following equation [14],

\begin{array}{l} θ = arctan (y / x) \\ φ = arctan (z / \sqrt{x^{2} + y^{2}}) \end{array}

(5)

where

(x, y, z)

are the coordinates of the point in the Lidar coordinate system. The pitch and yaw angles of the point can then be calculated by Equation (6)

\begin{matrix} y a w = & (θ + π) / 2 π \\ p i t c h = & (u p - φ) / (u p - d o w n) \end{matrix}

(6)

where

u p

and

d o w n

are the Lidar vertical viewing angle upper and lower limits. By setting the range image resolution to

(W, H)

, for a point

p (x, y, z)

, the range image coordinate

r (r o w, c o l)

is,

\begin{matrix} c o l = & y a w * W \\ r o w = & p i t c h * H \end{matrix}

(7)

In this way each point

p (x, y, z)

of the point cloud can be projected to a pixel coordinate

r (r o w, c o l)

in the range image r. The corresponding pixel value of the image is a four-dimensional vector of

(x, y, z)

and the distance of the point

p (x, y, z)

to the Lidar transmitting source, as shown in Equation (8).

\begin{matrix} r a n g e = \sqrt{(x^{2} + y^{2} + z^{2})} \\ r (r o w, c o l) = (x, y, z, r a n g e) \end{matrix}

(8)

The differential image d can then obtained by Equation (9),

d_{i} (r o w, c o l) = \frac{|r {(r o w, c o l)}^{t} - r {(r o w, c o l)}^{t + i}|}{r {(r o w, c o l)}^{t}}

(9)

where

r {(r o w, c o l)}^{t}

is the range image of the source point cloud

P_{t}

,

r {(r o w, c o l)}^{t + i}

is the range image of the ego-motion compensated point cloud

P_{t + i}^{'}

, and

d_{i}

is the differential image. The range image feature

f (r o w, c o l)

can be obtained by concatenating

r (r o w, c o l)

and

d (r o w, c o l)

, that is,

f (r o w, c o l) = (x, y, z, r a n g e, d)

.

3.3.2. CNN Segmentation Model

When the 5D range image is generated and used for moving object segmentation, a CNN segmentation model is used. In the CNN segmentation model, the meta-kernel convolution technique [16] is used to effectively incorporate the 3D feature information into the model after the context extraction module, and the backbone network uses the classical range image semantic segmentation model SalsaNext [15] to estimate the motion object points and generate the binarized motion segmentation labels. Finally the range image segmentation results are projected into 3D space; the segmentation results are refined using kNN as [14,16].

(1): Context Extraction Module

The context extraction module utilizes the residual structure as shown in Figure 4. The 3D information of the original point is fused using the meta-kernel convolution after the residual structure. The meta-kernel convolution is demonstrated in Figure 5, and the operation details can be found in [16]. Its output is further fed to the SalsaNext model for motion segmentation.

(2): Point Reassignment

Although the motion segmentation based on the range image has a faster inference speed, the blurring of the range image boundaries cause the dynamic points and static point to be prone to segmentation errors in the junction area. As can be observed from Figure 6, some of the locations with better segmentation results on the 2D range image are still all estimated as motion points in the 3D point cloud despite the large differences in the locations of the foreground and background points.

Therefore, to solve this problem, a k-neighborhood (KNN) search is used to reassign ambiguous points that are at the boundaries [4,14]. For each estimated moving point, the k closest points are selected for the consistent voting, and the label with the most number of k points is selected as the label of that point after reassignment. The number of neighboring points k is empirically set to be the same as [4,14].

3.4. Motion State Estimation

Although the moving objects were segmented in the previous section, for consecutive frames, the same moving object cannot be well located. Therefore, the moving objects in different frames are transformed to the same coordinate system according to their ego-motion parameters. Then, the moving objects at different times are correlated and clustered by spatio-temporal constraints, and the motion state of each moving object is estimated with point matching. The structure of the motion state estimation is shown in Figure 7.

3.4.1. Spatio-Temporal Correlation and Clustering

The Lidar acquisition frame rate is generally 10 Hz, that is, 100 ms per frame. By overlapping the motion object point clouds at different times, the motion object point cloud containing spatial and temporal dimensional information can be obtained, and the same object across different times can be associated by clustering the motion-segmented objects in the temporal and spatial dimensions. Although the point cloud data from different time frames are overlaid, the moving objects occupy only a small proportion of the total points in each scanning scene. Consequently, the overall number of moving points remains relatively low, leading to highly efficient processing.

In order to realize the fast spatio-temporal clustering of objects, we use DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [26] to cluster the object points, and fuse the temporal information in the clustering. Specifically, the description of a point

p (x, y, z)

is added with time t to indicate the acquisition sequence of the point cloud in addition to the 3D spatial coordinates

(x, y, z)

to obtain

p = (x, y, z, t)

. The moving objects in consecutive frames are related by accumulating the spatial dimensions during clustering and utilizing the temporal information saved in the t-channel. Since the density is shifted on different time frames, we cannot know the exact offset vector. In order to capture the position after the density offset, we incorporate the density core of the initial frame into current frame to compensate for the density offset in the temporal dimension. By accumulating the points over time, we can obtain the individual moving objects and correlate the objects over time after the clustering operation. In this way, the point cloud clusters for each moving object in multiple frames can be identified as Equation (10).

C_{N} = \sum_{i = 1}^{n} {\{P_{i}^{t} | ∥ P_{i}^{t} - P_{c e n t e r} ∥^{2} < ϵ, P_{i}^{t} \in R^{3 \times n}\}}_{t = 1}^{T}

(10)

where

P_{c e n t e r}

is the core point of DBSCAN and

ϵ

is a hyper-parameter.

3.4.2. Motion Vector Estimation

After spatio-temporal clustering, the points of the same motion object on the intra-frame space and the points between different frames are aggregated into a class. Due to the sparse sampling property of the point cloud, the estimation of the motion vectors of individual points will have a large bias. Therefore, based on the rigid assumption of the motion object [27], the entire motion vector of the object is directly estimated. For motion vector estimation, due to the short time interval between neighboring frames, key frames are used with s frames between the key frames. The estimation of the motion state of a single object is equivalent to the estimation of the rotation R and translation T matrices between the particular point clouds (object points) in different frames, and thus, can also be achieved by point cloud matching. So in this paper, we use the ICP method to align the point clouds of a moving object at different times to obtain

R, T

. Then, the motion vector V of the object can be obtained by matching the core points (the clustering center) from the previous section as,

V = P_{c e n t e r} - R P_{c e n t e r} - T

. In order to obtain a more stable motion state, this paper uses a combination of the motion vector

V_{k}

between key frames and the superposed motion vector

V_{s}

of all frames within a key frame to estimate the final key frame motion vector V. The schematic diagram is shown in Figure 8.

V = ω_{s} V_{s} + ω_{k} V_{k}

(11)

where

ω_{s}

and

ω_{k}

are weights.

V_{s}

is the accumulated motion vector of adjacent frames between two key frame.

4. Experiments

In order to fully evaluate the performance of the algorithm proposed in this paper, we experimentally validate it on the KITTI [22], nuScenes [28], and SemanticKITTI [29] datasets, and actual collected data, and compare it with the existing mainstream methods.

4.1. Experimental Settings

In this paper, all the experiments are evaluated on the same platform with an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), Intel Core i9 processor (Intel Corporation, Santa Clara, CA, USA), and 64 GB of RAM. The server was running the Ubuntu 18.04 LTS operating system and was configured with the Python 3.8 compilation environment, and CUDA, cuDNN, and PyTorch were installed to accelerate the computation of data by the graphics processor for training and inference. In addition, the environment for various software tools, such as PCL for processing 3D point cloud data and Tensorboard for the visualization of the training process, was also required, and the specific experimental configurations are shown in Table 1.

4.2. Ego-Motion Estimation

For ego-motion estimation, the matching result of the VGICP is experimentally verified on SemanticKITTI, and the matched pose matrices are obtained and compared with the ground truth pose which is given by the dataset. The comparisons are made by decomposing the pose matrix into a position matrix and a pose angle matrix, where the pose angle matrix is expressed through Euler angles, and the deviations of the VGICP displacements and pose angles from the ground truth are obtained as shown in Table 2 and Table 3. It can be seen that the ego-motion estimates computed by VGICP are less than

0.02

m offset from the ground truth value in all three X, Y, and Z axes, which is a relatively small difference, and the standard deviation is also within 0.015 m. For the pose angular deviation, the largest deviation is the pitch, which is also within

0.0018

rad, or about

0 . 2^{\circ}

.

The comparison between VGICP and other point cloud matching methods (NDT, ICP, and GICP) is shown in Table 4. The fitness score is an evaluation metric of the matching errors [24] and the ↓ indicates that smaller metrics are better. It can be seen that compared with ICP, NDT, and GICP, the FitnessSocre of VGICP is comparable with other methods and the displacement and angular errors are kept at the same level, while the VGICP method has the fastest speed in processing time, especially when use multi-thread technology.

4.3. Moving Object Segmentation

In order to validate the comprehensive performance of motion segmentation, this section compares our method with the current mainstream 3D motion segmentation methods on the SemanticKITTI dataset for metrics evaluation and validation, as shown in Table 5 below. The evaluation results in this section are only for the IoU (Intersection over Union) of the motion objects, where the structure indicates the representation of a point cloud, and RV means the range view image.

It can be seen that the point-based methods PointNet and EmPointMovSeg have lower

I o U^{M O S}

than our method in motion segmentation. The voxel-based method Cylinder has a higher

I o U^{M O S}

than the point-based methods, but has no obvious advantage in running speed. The range-view image methods usually have faster speeds. Considering that the scanning speed of Lidar in real scenarios is 10 Hz in general, the RV structure has a clear advantage in these models, since it exceeds the scanning speed of Lidar and can achieve the effect of real-time operation. LMNet achieves the highest FPS among the RV methods. The accuracy of RV methods can be improved when the neighboring frame increases from 1 to 9, but the FPS decreases significantly. The 10-frame point cloud LMNet has an

8 %

improvement in

I o U^{M O S}

over the 2-frame point cloud, but a

73.9 %

reduction in speed. MotionSeg3D achieves the highest

I o U^{M O S}

, but the FPS is only

8.55

. Our method achieves the third-highest

I o U^{M O S}

and the second-fastest speed. Thus, although MotionSeg3D achieves a higher

I o U^{M O S}

, the performance increase comes at the cost of a significant speed reduction, and our method guarantees faster speeds while performing better than LMNet with similar speeds.

4.4. Motion Vector Estimation

Because the SemanticKITTI dataset only contains semantic label information, pose information, and sensor calibration information, the ground truth motion vector of the moving object cannot be directly obtained according to the data provided by SemanticKITTI. In this paper, the KITTI tracking dataset is used to construct a motion vector evaluation dataset that is more consistent with the motion of point cloud instances, and the nuScenes dataset is also tested.

Specifically, in order to obtain the ground truth of the instance motion, the ground point cloud is removed by a predetermined height threshold. The corresponding point cloud covered by the FOV of the front camera is obtained. The motion vector of the moving instance in the point cloud is measured based on the tracking box of the instance. The instances whose motion vectors are less than a certain threshold are marked as static objects. The ground truth of the motion vector for each instance is set as

V_{G T} = {T (P_{i}^{t + 1} - P_{i}^{t})}_{i = 1}^{I}

(12)

where T represents the time offset between two adjacent frames and

P_{i}^{t}

represents the position of the center point of the i instance at time t.

In the experiments, the keyframe is set five frames apart. To accurately segment the instance from moving point clouds, the

ϵ

is essential in principle. However, in actual experiments, different instances are usually far away from each other, while the instance locations are continuously changing and close to each other at different times; so, we found that e values from 0.1 m to 0.4 m can achieve accurate instance segmentation. The segmentation accuracy reaches

100 %

in all 10 randomly selected datasets, and

ϵ

is set to 0.2 m in this paper. The errors of the estimated vectors on the three axes for the KITTI and nuScenes datasets are listed in Table 6. It can be observed that the errors in the X-axis and Y-axis are between 0.01 m and 0.02 m. However, the error in the Z-axis is between 0.02 and 0.04. The main reason is that Z-axis is in the vertical direction, and the matching result is not good enough in the vertical direction due to the low resolution of Lidar in the vertical direction.

By combining the segmentation labels and moving vectors, the prediction results can be converted into typical scene flow results. As shown in Table 7, the results of the proposed method and the classical scene flow methods in the scene flow evaluation metrics EPE3D [19] can be compared. EPE3D is the average 3D end point error, that is the average Euclidean distance between all estimated motion vectors and the ground truth motion vectors.

It can be seen from Table 7 that the proposed method has obvious advantages in the EPE3D error and running time when compared with the current scene flow algorithm, and the accuracy is also improved by

31 %

when compared with the FLOT network. At the same time, it can be seen that the traditional scene flow algorithm is time-consuming, which cannot meet the scanning speed of 10 Hz of Lidar, that is, the processing time needs to be as small as 100 ms. Most scene flow algorithms cannot meet this requirement, and the method in this paper takes only 18.27 ms on average in fast motion segmentation. The average time consumption of motion vector estimation is 60.24 ms. In terms of running time, the method in this paper has obvious advantages.

The results of the motion vector estimation are illustrated in Figure 9, where the blue point represents the moving object and the red arrows denote both the direction and magnitude of motion, with the longer arrows indicating greater displacement. The impact of transforming the source object into the reference frame using motion vectors is depicted in Figure 10. Here, the red point corresponds to the source frame, while the green point is situated within the reference frame; additionally, the blue point represents a position in the source frame adjusted by incorporating the predicted motion vector. It is evident that there is a significant overlap between the green and blue points, suggesting that the estimated motion vector accurately captures the object’s state of movement.

4.5. Validation on Actual Collected Data

Finally, we tested and validated the proposed method using actual campus-collected data, as shown in Figure 11, where each set of images is, from left to right, the corresponding RGB image collected by the ZED camera, the original point cloud, and the result after motion segmentation. The red box is the corresponding motion object on the RGB image, on the original point cloud, and on the motion segmentation result, and the blue box is the corresponding stationary object. It can be seen that in the campus road scene, our method can accurately predict the moving objects, and at the same time, the stationary objects in the scene are filtered out, which is convenient for subsequent motion vector estimation, and this result proves the feasibility of our method in motion segmentation.

In motion vector estimation, every interval of 10 frames is used as a keyframe to estimate the displacement vector between keyframes, and the effect is shown in Figure 12. Here, the red color indicates the predicted motion object in the scene, the green color is the position of this motion object after 10 frames, and the blue color is the position of this motion object in the original frames plus the predicted motion vectors. It can be seen that the blue color and the green color have a relatively high degree of overlap, which proves the feasibility of the proposed method on motion vector estimation. In order to quantify the motion vector error of the motion objects, all the motion objects in the source frame after adding the estimated motion vectors are matched with the current actual object in the reference frame, and the matched position matrix is calculated and decomposed into six representations of X, Y, Z,

Y a w

,

R o l l

, and

P i t c h

before averaging them as the estimated motion vector deviation, and the results are shown in Table 8. We can see from Table 8 that the proposed method achieves adequate performance on the motion vector estimation, and the motion objects of the source frame and target frame have no large deviations after adding the motion vector to the source frame. In Scene II, the car object is turning, and the proposed method has a slightly larger error in this case than the other two.

Overall, the method proposed in this paper is able to quickly extract a motion object and estimate its motion vectors. Compared with typical motion segmentation methods, the proposed method can maintain high efficiency while keeping enough accuracy to provide accurate results for subsequent motion vector estimation. The estimation of motion object vectors through two stages has a faster speed than the direct estimation of the scene flow and also guarantees sufficient accuracy.

5. Conclusions

In this paper, we propose a fast motion vector estimation method based on point cloud only for ego-motion platforms. After achieving self-motion compensation through point cloud matching, we design a point cloud motion segmentation model to achieve motion object segmentation, and for the segmented motion objects, we perform object association through spatio-temporal correlation clustering, and use the matching to estimate the motion vector of the object. In the experiments on the KITTI and nuScenes datasets, the proposed method can accurately compensate for the ego-motion, and achieves an average motion vector deviation of only 0.036 m and 0.043 m in KITTI and nuScenes, with a processing time of about 80 ms. The EPE3D error in the KITTI data is only 0.076 m, which proves the effectiveness of the method. And good results have been achieved in the actual collected data. Although the proposed method can obtain a fast and accurate motion state result of the object, the motion segmentation and motion vector estimation are performed sequentially, which leads to a longer processing time. Therefore, the future study will focus on combining these two steps to further accelerate the processing time.

Author Contributions

Conceptualization, methodology, writing—review and editing, S.W.; implementation, validation and visualization, E.Z.; writing—original draft preparation, E.Z. and L.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China, under Grant No. 62103064; the Sichuan Science and Technology Program, under Grant No. 2024YFFK0442, No. 2023YFG0196, No. 2022YFNO020, No. 2023YFN0077, No. 2023JDZH0023, No. 2023YFG0045, and No. 2023NSFSC1985; the Opening Project of Unmanned System Intelligent Perception Control Technology Engineering Laboratory of Sichuan Province, under Grant No. WRXT2020-005; the Key Laboratory of Lidar and Device, P. R. China, under Grant No. LLD2023-411010; and the Scientific Research Foundation of CUIT, under Grant No. KYTZ202109.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

The authors would like to thank Tao Jiang for useful suggestions and Yuyong Cui for funding support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, L.; Wu, P.; Chitta, K.; Jaeger, B.; Geiger, A.; Li, H. End-to-end autonomous driving: Challenges and frontiers. IEEE Trans. Pattern Anal. Mach. Intell. 2024. [Google Scholar] [CrossRef] [PubMed]
Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D object detection for autonomous driving: A comprehensive survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Chen, X.; Li, S.; Mersch, B.; Wiesmann, L.; Gall, J.; Behley, J.; Stachniss, C. Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data. IEEE Robot. Autom. Lett. 2021, 6, 6529–6536. [Google Scholar] [CrossRef]
Li, Z.; Xiang, N.; Chen, H.; Zhang, J.; Yang, X. Deep learning for scene flow estimation on point clouds: A survey and prospective trends. Comput. Graph. Forum 2023, 42, e14795. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5099–5108. [Google Scholar]
Wang, L.; Huang, Y.; Hou, Y.; Zhang, S.; Shan, J. Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10296–10305. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. RandLA-Net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. Segcloud: Semantic segmentation of 3d point clouds. In Proceedings of the 2017 International Conference on 3d Vision (3DV), Qingdao, China, 10–12 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 537–547. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1887–1893. [Google Scholar]
Milioto, A.; Vizzo, I.; Behley, J.; Stachniss, C. Rangenet++: Fast and accurate lidar semantic segmentation. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4213–4220. [Google Scholar]
Cortinhal, T.; Tzelepis, G.; Erdal Aksoy, E. SalsaNext: Fast, uncertainty-aware semantic segmentation of lidar point clouds. In Proceedings of the Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, 5–7 October 2020; Proceedings, Part II 15. Springer: Cham, Switzerland, 2020; pp. 207–222. [Google Scholar]
Sun, J.; Dai, Y.; Zhang, X.; Xu, J.; Ai, R.; Gu, W.; Chen, X. Efficient spatial-temporal information fusion for lidar-based 3d moving object segmentation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 11456–11463. [Google Scholar]
Dewan, A.; Caselitz, T.; Tipaldi, G.D.; Burgard, W. Rigid scene flow for 3d lidar scans. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1765–1770. [Google Scholar]
Liu, X.; Qi, C.R.; Guibas, L.J. FlowNet3D: Learning scene flow in 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 529–537. [Google Scholar]
Gu, X.; Wang, Y.; Wu, C.; Lee, Y.J.; Wang, P. HPLFlowNet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3254–3263. [Google Scholar]
Puy, G.; Boulch, A.; Marlet, R. FLOT: Scene flow on point clouds guided by optimal transport. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 527–544. [Google Scholar]
Ding, L.; Dong, S.; Xu, T.; Xu, X.; Wang, J.; Li, J. FH-Net: A fast hierarchical network for scene flow estimation on real-world point clouds. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2022; pp. 213–229. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3354–3361. [Google Scholar]
Wu, S.; Ren, L.; Du, B.; Yuan, J. Robust ground segmentation in 3d point cloud for autonomous driving vehicles. In Proceedings of the 2023 2nd International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC), Mianyang, China, 11–13 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 109–114. [Google Scholar]
Wang, F.; Zhao, Z. A survey of iterative closest point algorithm. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 4395–4399. [Google Scholar]
Koide, K.; Yokozuka, M.; Oishi, S.; Banno, A. Voxelized GICP for fast and accurate 3D point cloud registration. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 11054–11059. [Google Scholar]
Khan, K.; Rehman, S.U.; Aziz, K.; Fong, S.; Sarasvady, S. DBSCAN: Past, present and future. In Proceedings of the Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), Bangalore, India, 17–19 February 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 232–238. [Google Scholar]
Gojcic, Z.; Litany, O.; Wieser, A.; Guibas, L.J.; Birdal, T. Weakly supervised learning of rigid 3D scene flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5692–5703. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 15–20 June 2019; pp. 9297–9307. [Google Scholar]
Li, S.; Chen, X.; Liu, Y.; Dai, D.; Stachniss, C.; Gall, J. Multi-scale interaction for real-time lidar data segmentation on an embedded platform. IEEE Robot. Autom. Lett. 2021, 7, 738–745. [Google Scholar] [CrossRef]
He, Z.; Fan, X.; Peng, Y.; Shen, Z.; Jiao, J.; Liu, M. Empointmovseg: Sparse tensor-based moving-object segmentation in 3-d lidar point clouds for autonomous driving-embedded system. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2022, 42, 41–53. [Google Scholar] [CrossRef]

Figure 1. Overview of the motion state estimation method.

Figure 2. The structure of the motion segmentation model.

Figure 3. The diagram of the spherical projection.

Figure 4. The context module.

Figure 5. The meta-kernel convolution [16].

Figure 6. The comparison of the motion segmentation result on the range image and 3D coordinate.

Figure 7. The structure of the motion state estimation.

Figure 8. The diagram of the motion vector estimation.

Figure 9. The predication of the motion vector.

Figure 10. Comparison of the source frame with the reference frame under the predicted motion vector.

Figure 11. Motion segmentation result of campus data.

Figure 12. Motion vector estimation result of campus data.

Table 1. The experimental setting.

CPU	GPU	CUDA	cuDNN	OS	Python	Pytorch
i9-10850K	RTX 3090	11.1	8.0.4	Ubuntu 18.04 LTS	3.8	1.10.0

Table 2. The error of position displacement between VGICP and the ground truth on SemanticKITTI.

Displacement Axis	Average Error (m)	Standard Deviation (m)
X	0.0127	0.0099
Y	0.0108	0.0146
Z	0.0166	0.0149

Table 3. The error of pose angle between VGICP and the ground truth on SemanticKITTI.

Euler Angle	Average Error (rad)	Standard Deviation (rad)
Yaw	0.000226	0.000734
Roll	0.000686	0.000335
Pitch	0.001800	0.000645

Table 4. The comparison of VGICP and other matching methods on SemanticKITTI.

Method	Time (ms) ↓	Fitness Score ↓	Displacement Error (m) ↓	Angular Error (rad) ↓
ICP	133.279	0.204890	0.0103	0.0016
NDT	51.954	0.229616	0.1613	0.0057
GICP (Single thread)	122.868	0.204376	0.0102	0.0014
GICP (Multi-thread)	17.959	0.204384	0.0102	0.0014
VGICP (Single thread)	93.106	0.205022	0.0134	0.0019
VGICP (Multi-thread)	13.549	0.205022	0.0134	0.0019

Table 5. The comparison of moving object segmentation on SemanticKITTI.

Method	Structure	${IoU}^{MOS}$	FPS	Achieve Lidar Scanning Speed
PointNet [6]	Point	0.16	3.65	×
MINet [30]	RV	0.19	10.84	✓
MINet/N = 10	RV	0.31	9.5	×
SalsaNext [15]	RV	0.46	12.64	✓
SalsaNext/N = 10	RV	0.62	3.41	×
LMNet [4]	RV	0.58	75.13	✓
LMNet/N = 10	RV	0.66	19.60	✓
EmPointMovSeg [31]	Point	0.43	8.74	×
Cylinder	Voxel	0.61	8.03	×
MotionSeg3D [16]	RV	0.71	8.55	×
Ours	RV	0.65	54.72	✓

Table 6. The errors of the motion vectors in the three axes.

Axis	KITTI (m)	nuScenes (m)
X	0.0110389	0.0146704
Y	0.0173906	0.0194629
Z	0.0299285	0.0350560
Total	0.0363620	0.0426959

Table 7. Comparison of motion vector estimation with current scene flow methods.

Method	EPE3D (m) ↓	Time (ms)
FlowNet3D [18]	0.173	434.12
PointPWC-Net	0.165	552.26
FLOT [20]	0.110	1111.12
Ours	0.076	78.51 (18.27 + 60.24)

Table 8. The error of motion vector on our campus data.

Scene	X (m)	Y (m)	Z (m)	Yaw (rad)	Roll (rad)	Pitch (rad)
Scene I	0.057	0.035	0.004	0.025	0.040	0.001
Scene II	0.053	0.207	0.025	0.108	0.101	0.037
Scene III	0.004	0.017	0.001	0.002	0.001	0.002

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, S.; Ren, L.; Zhu, E. Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints. Appl. Sci. 2024, 14, 8969. https://doi.org/10.3390/app14198969

AMA Style

Wu S, Ren L, Zhu E. Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints. Applied Sciences. 2024; 14(19):8969. https://doi.org/10.3390/app14198969

Chicago/Turabian Style

Wu, Sidong, Liuquan Ren, and Enzhi Zhu. 2024. "Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints" Applied Sciences 14, no. 19: 8969. https://doi.org/10.3390/app14198969

APA Style

Wu, S., Ren, L., & Zhu, E. (2024). Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints. Applied Sciences, 14(19), 8969. https://doi.org/10.3390/app14198969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Motion State Estimation Based on Point Cloud by Combing Deep Learning and Spatio-Temporal Constraints

Abstract

1. Introduction

2. Related Work

2.1. Point Cloud Segmentation

2.2. Scene Flow Estimation

3. The Motion State Estimation Method

3.1. Overview

3.2. Ego-Motion Compensation

3.3. Moving Object Segmentation

3.3.1. Range Image Projection and Differential Image Generation

3.3.2. CNN Segmentation Model

3.4. Motion State Estimation

3.4.1. Spatio-Temporal Correlation and Clustering

3.4.2. Motion Vector Estimation

4. Experiments

4.1. Experimental Settings

4.2. Ego-Motion Estimation

4.3. Moving Object Segmentation

4.4. Motion Vector Estimation

4.5. Validation on Actual Collected Data

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI