1. Introduction
Point cloud registration (PCReg) refers to the problem of finding the rigid transformation that maximizes the overlap between similar sections of two or more point clouds. As a fundamental technique in 3D data processing, it is employed in many fields including computer vision, robotics, medical image analysis and computer-assisted surgery.
Researchers in the past have proposed methods [
1,
2,
3,
4,
5] to address the PCReg problem. However, many of them are prone to converging to local optima. With the advent of deep neural networks (DNNs), it has been shown [
6,
7,
8] that PCReg methods using DNNs can achieve higher accuracy and robustness to inaccurate transformation when compared to traditional methods. The learning-based PCReg method processes unordered point clouds and extracts features through a deep learning network [
9,
10,
11]; then, the similarity of these features is used to calculate the transformation. However, most of these methods cannot cope with large transformations [
7,
12]; specifically, they achieve high accuracy only when the rotation and translation are limited to [−45
, 45
] and [−0.5 m, 0.5 m], respectively. Most researchers directly use the 3D coordinates of points as inputs for feature extraction. However, the values of 3D coordinates are susceptible to rigid transformation; the same point will have different features after transformation. Since they are not robust to transformation and therefore cannot act as stable inputs for DNNs, DNNs cannot learn the features with transformation invariance [
13,
14].
In this work, we propose a novel PCReg method with rigid transform-invariant features, named TIF-Reg, to overcome the limitations of existing learning-based methods, enabling accurate PCReg with large rigid transformations by constructing transform-invariant features.
Figure 1 shows the registration result of TIF-Reg. It includes four modules, which are the transform-invariant feature extraction module, deep feature embedding module, corresponding point generation module and decoupled SVD module. The transform-invariant feature extraction module constructs the transform-invariant features (TIF) based on the spatial structure of the point cloud. The deep feature embedding module embeds TIF into a high-dimensional space, leveraging mini-DGCNN to improve the expressivity of the features. The corresponding point generation module generates the corresponding points of input clouds through an attention-based module. The decoupled SVD module calculates the transformation using an SVD module. We test our method on ModelNet40 and TUM3D under various settings and compare them with traditional and learning-based methods to demonstrate the superior performance of the proposed method in terms of accuracy and complexity.
The key contributions of this work are summarized as follows:
We propose the leveraging of transform-invariant features in the PCReg problem and evaluate the expressivity of the features;
We propose a novel PCReg method that is robust to the large rigid transformation between source clouds and target clouds;
We evaluate the performance of our method under several settings, demonstrating the effectiveness of the proposed method.
3. TIF-Reg Algorithm
The architecture of the proposed TIF-Reg is shown in
Figure 2. The input includes the source point cloud
X (blue points) and the target point cloud
Y (red points). First, we extract TIF from the input and map TIF into high-dimensional space via DNN. Then, we generate the corresponding points using an attention mechanism. Lastly, we calculate the transformation using a decoupled SVD.
3.1. Transform-Invariant Feature Extraction
TIF are point cloud features that are invariant under rigid transformations of the point cloud, including rotations and translations.
As shown in
Figure 3, consider the point cloud with
N points:
. For each
, we construct the neighborhood set of
denoted
through the k-nearest neighbors algorithm (K-NN). Hence, there are
N neighborhoods in
X, and each neighborhood contains
K points. Each point in
X is described as
, and we define the TIF of
as
and
where
is the center of
X and
is the last point in
.
form a triangular structure, which we call a triangular feature.
describes the density of the k-NN to some degree and is called the local density feature. The triangular feature and local density feature represent the relative position characteristics between the points and the local distribution characteristics of the point cloud. Unlike 3D coordinates, the TIF can remain stable when point clouds are transformed. Therefore, they are more suitable for PCReg problems than 3D coordinates. In
Figure 2, the input is an
tensor, representing the 3D coordinates of the point cloud. After TIF extraction, the point cloud is represented as an
tensor, where
K refers to the points from k-NN.
3.1.1. Triangular Feature
Since point clouds are sets of points without any specific order, an input cloud with
N points can have
permutations, making it difficult to obtain the position of a specific point [
11]. However, the relative distance between points is invariant. To ensure the invariance of TIF to rigid transformation, we attempt to seek out two points with fixed relative positions (we regard them as indexable points) and define the Euclidean distance of indexable points as the descriptor of a point.
Firstly, it is easy to see that the shape distribution of
X will not change after rigid transformation, meaning the relative position of the center
in a point cloud will remain the same after transformation. Note that we are focused on relative positions and not coordinates. Moreover, since the relative distance of points will not be affected by transformations, for each
, the k-NN
will be constant during transformations as well. That is, for each
, its neighborhood set center
will remain stable during transformation. According to the above analyses, we offer each
two indexable points:
and
. Connecting the three points together, we can obtain the triangular feature of
, represented as the side length of the triangle
, as shown in (
1).
may seem to offer nothing in terms of improving the representation ability of a triangular feature; however, it can be proved that the full triangular feature can be more effective than when only considering
and
. For example, considering
from
Figure 4,
is the k-NN of
, and
. If we only consider
and
, then we can obtain the feature of
, by calculating
,
. Note how the features of
and
for
and
, respectively, are the same when
, leading to the weak uniqueness of the features. What makes the situation worse is that innumerable points with the same feature can be found on the sphere with
as center and
as radius. Therefore,
is necessary to describe the global characteristics of point clouds and is helpful to distinguish different k-NN neighborhoods.
3.1.2. Local Density Feature
Although we have built triangular features for each point, this is still insufficient for effective uniqueness in a 3D point cloud. For example, in
Figure 4, if we rotate the point
with
as the axis, we can obtain the circle
, and each point on
(such as
) has the same triangular feature as
.
To overcome this issue, we take inspiration from NDT [
1]. The distribution of a point cloud will not be affected by the reordering of the points in the point cloud and remains unchanged during rigid transformation. NDT places the point cloud on a grid and calculates the probability distribution function of a point that is observed in a particular unit of the grid and then performs registration using the likelihood function of the point cloud distribution. Similarly, in this work, we directly construct the local density feature of the point cloud with the k-NN as the unit. In order to avoid one of the features from being concealed due to the magnitude difference between the triangular feature and the local density feature, we must instead express the local density feature in terms of Euclidean distance. Since the number of points in
is fixed, the radius of
is one indicator of the density of the local point cloud. Generally, the sparser the point cloud, the larger the radius, and the denser the point cloud, the smaller the radius. Therefore, the radius of
, which can be used effectively with the triangular feature, is used in this paper to describe the local density feature of
.
3.2. Deep Feature Embedding
In
Section 3.1, the original 3D points have been transformed to 4D TIF features. In this section, we embed the TIF features via a deep neural network into high-dimensional space to strengthen the representation ability of feature descriptors. Mini-DGCNN, a simplified version of DGCNN, is used here. It uses a dynamic graph structure: it constructs a local k-NN graph for each
and pools the features of the points in
together using a max pooling layer.
In this work, the mini-DGCNN only utilizes a static graph in DGCNN, which helps to reduce the network complexity and yet still achieve the same performance of registration. As shown in
Figure 2, the Deep Feature Embedding (DFE) layer transforms the
feature to an
tensor.
3.3. Corresponding Point Cloud Generation
A prominent part of the typical point cloud registration process is the construction of a matching between the points in the original and target point clouds. The ICP algorithm iteratively updates the transformation by minimizing the distance between corresponding points to gradually optimize their alignments. However, this method is prone to stalling in local optima and can lead to poor registration results. Inspired by the attention mechanism in [
7,
33], we propose a destination point cloud generation method based on point cloud similarity rather using a point-to-point mapping between the source and target point clouds.
The attention mechanism is derived from the study of human vision and is widely used in the natural language processing (NLP) field to handle sequence-to-sequence (seq2seq) problems, such as machine translation and question answering. During observations, in order to efficiently distribute limited attention resources, humans tend to selectively focus on the more important data or regions of the subject and ignore the less useful noise. Similarly, in seq2seq problems, researchers use the attention mechanism to select information that is critical to the task at hand from a large amount of input information. In this paper, we regard the PCReg as a seq2seq problem, with point clouds X and Y as the source and target sequence, respectively. The purpose is to generate an output destination point cloud Z that is as similar as possible to Y with a mapping to correspond each point in X to each point in Z. With this goal in mind, we apply the attention mechanism to generate Z.
The attention weight
W is obtained using the similarity between features in
X and
Y:
where
and
are the deep features obtained in
Section 3.2 from
X and
Y, respectively. Then,
Z, the corresponding point cloud of
X, can be generated from
W and
Y:
For each , we generate its corresponding point using the similarity between features in X and Y. This approach avoids constructing a direct matching of points between X and Y since the rigid transformation is obtained with respect to X and Z instead of X and Y. Since Z has a one-to-one point correspondence with X, we can achieve the results in one shot, avoiding local optima during the iteration.
3.4. Decoupled SVD
After obtaining the optimal destination point cloud, the final step is to calculate the relative transformation between it and the original. Multilayer perceptrons (MLP) and singular value decomposition (SVD) are commonly used to compute these results, and in this work, we apply the latter as it was proven to be more effective for registration than MLP in recent work [
7]. More concretely, we aim to find the transformation
between
X and
Y that minimizes the error
E:
Here,
, where
Z is calculated in the last section to replace
Y in a one-to-one mapping. The cross-covariance matrix
H of
X and
Z is
Here,
and
.
and
are the center of
X and
Z respectively. Define the centralized
X and
Z as
and
, then
and
. Using SVD, the cross-covariance matrix
H can be decomposed as
We can use
R and
t to minimize (
4) based on (
6):
From the experimental results (see table in
Section 4.2), we find that when using the original SVD, the proposed method maintains high accuracy when the rotation is within the range [−180
, 180
] and the translation is within the range [−20 m, 20 m]. However, this will gradually decrease with larger translations. In order to solve this issue, in this section, we decouple the calculation of translation and rotation by introducing a two-step method. The proposed method with original SVD will be dubbed TIF-Reg, and the proposed method with decoupled SVD will be dubbed TIF-Reg2. We discuss the details of TIF-Reg2 below.
In step 1, instead of
X and
Y, we use
and
as the inputs of the proposed method’s attention mechanism to generate
. According to (
4), the rotation between
X and
Y can be calculated by using only
and
. That is, the rotation has no relation to the translation.
will coincide completely with
only when
X has the same distribution as
Y; otherwise, there will be a translation
between them. The greater the difference between the distributions, the greater the translation. Generally,
is much smaller than
, thus avoiding the previously mentioned effect of large translations on overall accuracy. In step 1,
and
are calculated.
Here, is the center of .
We first note that
, the relative translation between
X and
Y, can be decomposed as follows, where
is as defined in step 1, and
is the remainder of the final translation.
To calculate
, we first transform
X to
using the values obtained in step 1:
We denote the center of as and obtain , completing our calculations for the translation, , between X and Y.
In this section, we decompose to and by centralizing the point cloud. and are calculated in step 1, and is calculated in step 2. This approach decouples rotation and translation and therefore increases the robustness of the proposed method to large translation.
3.5. Loss Function
Considering the relationship between
X and
Y, we have
Due to the lack of order of the point cloud, the difference between
and
cannot be calculated directly, where
represents the ground truth value (referring to the actual target point cloud) and
represents the predicted value (referring to the destination point cloud obtained by the algorithm). Instead, we represent the difference using the loss function:
4. Experiments
The proposed method TIF-Reg was evaluated against ModelNet40 [
29] and TUM 3D object (TUM3D) [
37] datasets. All experiments were performed on a laptop computer with an Intel I7-8750 CPU, an Nvidia GTX 1060 GPU and with 24 GB RAM.
Implementation Details of TIF-Reg: The architecture of the TIF-Reg is shown in
Figure 2. In the deep feature embedding module, the EdgeConv (denoted as DGCNN [
10]) layers were used in mini-DGCNN and the numbers of filters in each layer were 64, 64, 64, 128 and 320. The optimizer applied here was Adam with an initial learning rate of 0.0001, which was divided by 10 at epochs 40 and 60. The total training epochs were 80 and the training took approximately 4 h in our condition.
Baselines: We used five baselines including the traditional methods ICP [
3] , GO-ICP [
22] and Ransac+ICP, and learning-based methods PointnetLK [
6] and DCP-v2 [
7] (referred to as DCP [
7]).
Evaluation metrics: We measured the root mean squared error (RMSE) and mean absolute error (MAE) between the ground truth value and predicted value for both rotation (R) and translation (t), which are represented as RMSE(R), RMSE(t), AME(R) and AME(t), respectively. The metrics related to rotation are in units of degrees. The metrics related to translation are in units of meters.
ModelNet40 dataset: This dataset consists of 12,311 CAD models from 40 categories. We randomly sampled 2048 points from the mesh faces and rescaled points into a unit sphere. In our experiments, we split up each category randomly, obtaining 9843 models for training and 2468 models for testing. For each model, 1024 points from the outer surface were uniformly sampled, and all of the points were centered and rescaled to fit in the unit sphere.
TUM3D: This dataset includes 20 CAD models from 20 different categories of 3D point cloud models and is significantly different from the ModelNet40. We used all of the 3D models for testing. For each model, 4096 points were uniformly sampled from the original CAD and all of the points were rescaled to fit in the unit sphere.
4.1. Train and Test on ModelNet40
Firstly, we trained the learning-based methods on the first 20 categories and tested all of the PCReg methods on the same 20 categories as well. We took the sampled point cloud from CAD as target Y. X was obtained through an arbitrary transformation of Y. where the rotation was in the range [−45, 45] and the translation was in the range [−0.5 m, 0.5 m].
Table 1 shows the results of this experiment. We can see that ICP had the largest errors (the RMSE and MAE) both in rotation and translation, while the traditional algorithm Go-ICP achieved similar results to the PointNetLK algorithm based on the deep neural network. Ransac+ICP achieved the middle performance for all methods but performed best out of the traditional methods. Both DCP and TIF-Reg had lower errors, but TIF-Reg performed the best and outperformed other methods by roughly an order of magnitude.
We tested the generalizability of the different methods using different categories for training and testing. Learning-based methods were trained on the first 20 categories and tested on the last 20 categories. Traditional methods, which do not require the training of the model, were also tested on the last 20 categories.
As shown in
Table 2, ICP still had the largest error while GO-ICP and Ransac+ICP had similar errors, but Ransac+ICP achieved a much better result than the previous experiment. TIF-Reg still exhibited the best performance among all the methods. In this experiment, almost all methods’ accuracies declined to varying degrees, except Ransac+ICP and TIF-REg. This shows that the methods DCP and PointNetLK based on deep learning achieved a slightly poor generalization of data in different categories, but our method was basically unaffected by different data categories.
4.2. Robustness to Transformation (Rotation and Translation)
This experiment tested the robustness of TIF-Reg to transformation, which is essential for evaluating the effectiveness of the PCReg method. This experiment was divided into two steps: first, we kept the translation within [−0.5 m, 0.5 m] while gradually expanding the rotation from [−45, 45] to [−180, 180] to test the robustness to rotation. Then, we kept the rotation within [−180, 180] while gradually expanding the translation from [−2 m, 2 m] to [−20 m, 20 m] to test the robustness to translation. In this section, learning-based methods were trained on the first 20 categories and tested on the last 20 categories.
Table 3 shows the rotation robustness of all methods (see
Table 2 for rotation within [−45
, 45
].
Table 4 shows the translation robustness. According to
Table 3 and
Table 4, ICP, GO-ICP and PointNetLK almost failed under larger rotation and translation. DCP was no longer valid under lager translation.The performance of Ransac+ICP was much better than the above methods, but compared with the first two experiments, the error was still large under larger rotation and translation. Of all methods, TIF-Reg demonstrated the highest robustness to transformation throughout the experiment. As the angle of rotation increased, the accuracy of TIF-Reg decreased slightly, but it had the lowest error and was the most stable.
4.3. Effectiveness of TIF
In this experiment, in order to verify the effectiveness of TIF, we compared the performance of the proposed method when using 3D coordinates, an incomplete TIF (only three of
are selected) and the complete TIF. The training set and test set used here were the same as
Section 4.2, and the random transformation was within [−180
, 180
] and [−8 m, 8 m].
As shown in
Table 5, the algorithm failed when using 3D coordinates. It performed well when using an incomplete TIF; however, the best results occurred when using the complete TIF. This demonstrates the effectiveness of not only TIF in the PCReg problem but also of each individual element in TIF in improving its representation ability.
4.4. Robustness to Large Translation
We had already tested the translation robustness of the proposed method in
Section 4.2, but in this experiment, we tested the proposed method’s performance with regard to even larger translations. The dataset used here was the same as
Section 4.2, the rotation was within [−180
, 180
], and the translation was expanded from [−20 m, 20 m] to [−120 m, 120 m].
Figure 5 displays the error of TIF-Reg and TIF-Reg2 with large translations. According to
Figure 5a,c, as the translation increased, the rotation error of TIF-Reg increased significantly, while TIF-Reg2 maintained high precision.
Figure 5b,d demonstrates that both TIF-Reg and TIF-Reg2 hardly increased in terms of translation error and maintained an error of less than 0.01. The error of TIF-Reg was slightly lower than that of TIF-Reg2. This shows that the performance of the decoupled SVD module is superior to that of using SVD directly.
4.5. Generalization on New Test Set
In this experiment, in order to further test the generalization of the proposed method, we used the new dataset TUM3D. We randomly performed 36 transformations on each of the 20 sampled CAD models to produce 720 source point clouds for the test set. The settings here were the same as
Section 4.2 except for the test set.
The experiment result was showed in
Table 6. Here, there are two arguments R and t and
Table 6 only shows t. The first line of
Table 6 shows R in the range of [−45
, 45
], the second line shows R in the range of [−90
, 90
], and the other lines shows
R R in the range of [−180
, 180
]. The table shows that the TIF-Reg model trained on ModelNet40 was still able to maintain a high accuracy on the TUM3D dataset and had strong robustness to transformation as well.
4.6. Complexity
This experiment was used to compare the complexity of the algorithm, including time complexity and model complexity. The complexity of the algorithm involves many factors, such as computation, real-time performance and hardware costs.
4.6.1. Time Complexity
We profiled the inference time of different methods in this experiment. In order to make the comparison more comprehensive, we tested the time complexity with point clouds of different sizes. The inference time was measured in seconds. Note that Go-ICP was ignored in the experiment as it took over 16 s, far exceeding other methods.
As shown in
Table 7, the time complexity of Ransac+ICP was the highest, and it was less affected by the number of points than other methods. The time complexity of two deep learning-based methods, PointNetLK and DCP, was most affected by the number of point clouds. As the number of point clouds increased, so did their time complexity. TIF-Reg showed the best real-time performance among learning-based methods and was equivalent to ICP, which was the best of the baselines (the blue line of TIF-Reg covers the black line of ICP, and the “O” markings of ICP cover the “X” markings of TIF-Reg).
4.6.2. Model Complexity
Since the traditional methods (ICP, Ransac+ICP, Go-ICP) do not have models, only learning-based methods (PointnetLK, DCP, TIF-Reg) were compared in the experiment. As shown in
Table 8, the TIF-Reg model occupied the least space.This shows that the calculation process involved in our method is the most simple compared with the other two methods using neural networks.
6. Conclusions
We have presented TIF-Reg, a novel point cloud registration approach for adapting transform-invariant features. By constructing transform-invariant features, the proposed method achieves the high-precision registration of point clouds when the rotation is within the range of [−180, 180] and the translation is within the range of [−20 m, 20 m]. Moreover, the proposed method is almost unaffected by translation due to the decoupling of translation and rotation in SVD. Experiments have shown that TIF outperforms state-of-the-art methods in many aspects, including accuracy, robustness and complexity. Its considerable potential in many applications allows TIF to be easily integrated into other networks. Finally, we believe that our work presents an important step forward for the community as it affords an effective strategy for the point cloud registration framework, as well as presenting an innovation in deep feature extraction for all deep learning networks.