Next Article in Journal
Development of a Soil Organic Matter Content Prediction Model Based on Supervised Learning Using Vis-NIR/SWIR Spectroscopy
Previous Article in Journal
Incremental Encoder Speed Acquisition Using an STM32 Microcontroller and NI ELVIS
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

NrtNet: An Unsupervised Method for 3D Non-Rigid Point Cloud Registration Based on Transformer

1
School of Computer Science, China University of Geosciences, Wuhan 430074, China
2
Hubei Key Laboratory of Intelligent Robot (Wuhan Institute of Technology), Wuhan 430205, China
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(14), 5128; https://doi.org/10.3390/s22145128
Submission received: 9 May 2022 / Revised: 23 June 2022 / Accepted: 6 July 2022 / Published: 8 July 2022
(This article belongs to the Section Intelligent Sensors)

Abstract

:
Self-attention networks have revolutionized the field of natural language processing and have also made impressive progress in image analysis tasks. Corrnet3D proposes the idea of first obtaining the point cloud correspondence in point cloud registration. Inspired by these successes, we propose an unsupervised network for non-rigid point cloud registration, namely NrtNet, which is the first network using a transformer for unsupervised large deformation non-rigid point cloud registration. Specifically, NrtNet consists of a feature extraction module, a correspondence matrix generation module, and a reconstruction module. Feeding a pair of point clouds, our model first learns the point-by-point features and feeds them to the transformer-based correspondence matrix generation module, which utilizes the transformer to learn the correspondence probability between pairs of point sets, and then the correspondence probability matrix conducts normalization to obtain the correct point set corresponding matrix. We then permute the point clouds and learn the relative drift of the point pairs to reconstruct the point clouds for registration. Extensive experiments on synthetic and real datasets of non-rigid 3D shapes show that NrtNet outperforms state-of-the-art methods, including methods that use grids as input and methods that directly compute point drift.

1. Introduction

The 3D object has better flexibility, and with the continuous development of 3D sensing technology in recent years, the 3D point cloud has been widely used in various fields, such as virtual reality [1], autonomous driving [2], and augmented reality [3]. Since LIDAR scanned point clouds do not correspond with each other, this great inconveniences downstream tasks of point cloud classification [4,5], segmentation [6,7], registration [8,9], and reconstruction [10,11].
Non-rigid point cloud registration can be divided into similar registration [12,13] and affine registration [14,15]. Similar registration is mostly based on ICP to improve the registration of point clouds by changing the optimization objective function and increasing the correspondence, while affine registration ensures that the parallelism between the lines remains unchanged during the transformation process. Corrnet3D [16] proposes an alignment idea of finding the correspondence between the point clouds first, and then reconstructing the point clouds, which gives us inspiration. However, most of these methods require large-scale labeled data. Labeled data requires a lot of time and cost, which also promotes the development of unsupervised methods. In our work, we focus on unsupervised large deformation non-rigid point cloud registration, which means that only 3D point cloud data is required as input.
Figure 1 illustrates our idea that if we can align two sets of point clouds A and B, the registration process between the point clouds becomes easy. We permutate the point clouds by a transformer because the transformer is better at handling natural language correspondences [17,18]. We design a reconstruction module to reconstruct A r e ̲ o r d e r R n × 3 to B, which is more meaningful than reconstructing directly from A to B.
Based on the above idea, we propose an unsupervised transformer-based registration network (NrtNet) for large deformation non-rigid point clouds. We propose a transformer-based permutation process. Specifically, this permutation process uses the encoder and decoder of the transformer to generate a point set correspondence matrix, which represents the correspondence between the source point cloud A and target point cloud B. During the training process, the global features of the target point cloud and the permutation source point cloud A r e ̲ o r d e r are fed to the reconstruction module to obtain the reconstructed point cloud. The reconstruction module drives the learning of the correspondence matrix and the relative drift by optimizing the reconstruction error and additional regularization terms to achieve registration.
In general, our main contributions are:
  • We propose a transformer-based point cloud correspondence learning framework for learning dense correspondences between point clouds, and we are the first to introduce a transformer into the field of non-rigid point cloud registration.
  • Our network eliminates the reliance on ground truth and achieves unsupervised learning of non-rigid point cloud registration in an end-to-end manner, and has a better registration effect for different objects.
  • Experiments demonstrate that NrtNet has significant advantages in non-rigid point cloud registration. In particular, it is superior to methods that directly compute the drift of coherent points between point clouds and methods that use a grid as input.

2. Related Work

In this section, we introduce the application of point clouds in deep learning, the study of non-rigid point cloud registration, and the development of transformer-based deep learning.

2.1. Deep Learning on Point Cloud

Compared with well-developed image-based deep learning methods, point cloud-based deep learning methods are more challenging and still in the developing stage due to the irregularity and disorder of point clouds. Three-dimensional data can be displayed in various forms, such as 2D multi-views, unstructured point clouds, voxelized volumes, etc. Voxelization methods convert 3D data into regular volume occupancy voxels, resulting in structured volumes that are well suited for 3D CNNs. Early point cloud tasks used end-to-end 3D convolutional networks [19,20,21]. Due to the sparse volume of 3D data and expensive 3D convolutions, voxelized representations are limited by resolution, and [22,23] effectively solve the voxelized resolution problem. Qi et al. [24] projected 3D data into multiple 2D views and used the popular 2D CNN to process it.
PointNet [25] learns features directly from the point cloud, maps the point cloud to higher dimensions before aggregation, and takes symmetry operations in higher dimensions. Mapping to higher dimensions generates redundant information, which can be captured by maximization operations to avoid geometric information loss. PointNet only uses MLP and max-pooling and does not have the ability to capture local structural defects, which PointNet++ [26] improves upon. DGCNN [27] designs an EdgeConv that can efficiently extract features of local shapes of point clouds while still maintaining alignment invariance. Later, researchers investigated the use of merged features to represent the overall features and pointwise features [26] or more sophisticated RNN-based methods to extract features [28,29]. MortonNet [30] extracts more effective features based on learning an ordered sequence of point clouds. FoldingNet [31] learns to deform predefined 2D regular meshes into 3D shapes, AtlasNet [32] and 3D-Coded [33] are also based on the deformation of their networks, and they use fixed template deformations to reconstruct the point cloud or mesh.

2.2. Non-Rigid Point Cloud Registration

The development of registration optimization algorithms has attracted the attention of many researchers, and these algorithms are used to refine geometric transformations during iterations. The Iterative Closest Point (ICP) algorithm [34] is a classic case in rigid registration. The ICP initializes the estimation of the rigidity function and then iteratively selects the corresponding point to revise the transformation. However, ICP is not able to handle non-rigid point cloud variations efficiently due to the influence of initial values. Non-rigid point cloud registration can be divided into parametric registration and non-parametric registration by target transformation. The TPS-RSM algorithm [35] in parametric registration estimates the parameters of the non-rigid transformation with the penalty of the second derivative.
For classical nonparametric methods, coherent point drift (CPD) [36] introduces a process of fitting a Gaussian mixed likelihood that aligns the source point set with the target point set. Ma et al. [37] proposed the importance of exploiting local and global structures in non-rigid point set registration. CPD-Net [38] uses deep neural networks to fit functions that can adapt to geometric transformations of varying complexity. DispVoxNets [39] converts point clouds to voxels for nonlinear deformation in a supervised manner. PR-Net [40] introduces point-set shape features that determine the correlation between the source and target point set to predict the transformation, allowing source and target point sets to be statistically aligned. CorrNet3D [27] uses a new efficient de-smoothing module to optimize the point set pairs with better results. Ma et al. [41] used a robust transformation estimation method based on streamwise regularization for non-rigid point set registration, and the spatial transformation between two point sets is estimated by iteratively recovering the point correspondence. However, all extant methods do not perform well for point cloud registration with large deformations, and most of them rely on ground truth. These methods also work poorly for data with non-corresponding point sets. Our method eliminates the reliance on ground truth and has better registration results for large deformations and non-corresponding data sets.

2.3. Deep Learning Based on Transformer

CNN is a standard network model in computer vision [42], with the introduction of AlexNet [43], CNN began to become the dominant network model. Transformer and Self-Attention models revolutionized natural language processing [44,45], and some studies used Self-attention and Transformer to replace some or all of the spatial convolutional layers in the popular ResNet [46]. The encoder-decoder design in Transformer has recently been applied to object detection and instance segmentation tasks [47], and ViT [48] directly applies transformer to non-overlapping medium-sized image blocks for image classification. AiR [49] is the first transformer-based image registration method. Point Transformer [50] is the first to introduce a transformer into the 3D point cloud domain, proposes a highly expressive point transform layer, and uses transformer to construct a high-performance point transform network for point cloud classification and dense prediction. Point Cloud Transformer [51] proposes a new transformer-based point cloud learning framework PCT, and uses implicit Laplace operators and normalized refinement to offset attention. Our method uses the transformer to derive correspondences between points to improve the effectiveness of the final registration.

3. Framework

3.1. Overview

As shown in Figure 2, NrtNet is composed of three modules: the feature extraction module, the transformer module, and the point cloud reconstruction module. Firstly, in order to get the point cloud features, the source point cloud A R n × 3 and the target point cloud B R n × 3 are fed into the point cloud feature extraction module to generate point cloud features F a R n × d and F b R n × d , where d is the feature dimension, the pointwise feature of the point cloud is obtained by setting d to the same dimension as the number of point clouds. After that, the pointwise features F a and F b are fed into the transformer module, which finds the correspondence between the source and target point clouds. The point set correspondence matrix P R n × n is obtained to represent the point set correspondence, the parameter p i j = 1 of P represents the i-th point a i of the source point cloud and the j-th point b j of the target point cloud. The transformer module is composed of a transformer encoder and a transformer decoder. The source point cloud A is permuted using P to obtain A r e ̲ o r d e r R n × 3 . Finally, the global features of the target point cloud V b R d and the permuted source point cloud A r e ̲ o r d e r are fed into the reconstruction module to obtain A l a s t , which is similar to the target point cloud B. The global features V b are aggregated from the pointwise features. As in most papers, we optimize our model parameters by minimizing the similarity between the reconstructed point cloud and the target point cloud. To better learn the correspondence between point sets, we regularize the point cloud correspondence matrix and then minimize it to obtain the optimal point set correspondence. We can express it as follows:
M o p t i m a l = a r g m i n B A l a s t F 2 + A B l a s t F 2 + G P ,
where A l a s t R n × 3 and B l a s t R n × 3 are the point clouds after the registration, and . F represents the Frobenius parametric matrix. G P is a regularization operation on the corresponding matrix.

3.2. Feature Extraction Module

For the feature extraction module, instead of using the traditional PointNet and PointNet++, we use a DGCNN with shared parameters to map points A and B to high-dimensional features. DGCNN uses edge convolution, EdgeConv to dynamically build graph structures on each layer of the network, using each point as a centroid to characterize its edge with each neighboring point feature, and then aggregates these features to obtain a new representation of that point. Firstly, DGCNN defines the edge feature representation as:
e i j = h x i , x i x j
where h is that the edge convolution operation considers both the global information x i , and the local neighborhood information x i x j , and x i R 1 × d is the feature extracted by the i-th point fed into the edge convolution. Then, aggregating the edge features to obtain the feature e i j l + 1 over the l-th layer is expressed as:
e i j l + 1 = χ x j Ω i h ( x i , x i x j )
where χ indicates that the aggregation operation consists of the MLP and max-pooling, Ω denotes the set of point-set pairs formed between the remaining points centered at point x i and the center point. After the multi-layer edge convolution, MLP and max-pooling operations, we can extract the pointwise features F a R n × d and F b R n × d . The pointwise features are fed into a max-avg-pooling layer to get the global feature V a R d and V b R d , which prepares for the later reconstruction.

3.3. Transformer Module

Because of the effectiveness of the transformer for word correspondence in NLP, we use the transformer to correspond to the point set. As shown in Figure 3, the transformer module consists of three parts: the transformer encoder, the transformer decoder, and a smooth module. The transformer module inputs the features of the source and target point clouds to learn the point set correspondence matrix P R n × n , which can explicitly represent the correspondence between any two points in A and B. The p i j = 1 in the matrix represents the i-th point a i of the source point cloud A and the j-th point b j of the target point cloud B are corresponding. This matrix is an inverted matrix. There are only two cases of correspondence between source and target point clouds, so this matrix should have only 0 or 1. The points in the source point cloud should correspond to the points in the target point cloud one by one, and each row and column of the matrix should have only one 1. The transformer module uses the transformer to find the similarity between point clouds, i.e., the probability matrix of the point clouds P r a n d , which represents the probability of correspondence between the point clouds. Finally, a smoothing process is applied to this probability matrix to obtain an exact inverted binary matrix P.
The transformer encoder that references Point Transformer [50] is shown in Figure 4. The feature f i l R l × d of the i-th point is fed into the standard scalar dot product attention layer. The standard scalar dot product attention layer is expressed as:
y i = f i a F a s o f t m a x γ φ f i a T ψ ( f j a ) + δ α f j a
where φ , ψ , α is the feature transform layer MLP, δ is a position encoder. γ is a mapping function. γ as a vector to represent the global features of the point cloud. The mapping function γ consists of an MLP, two linear layers, and a Relu activation function. Attention vectors are generated for later feature aggregation, feature transformation of φ minus ψ to obtain the vector relationship between them. Finally, the transformation features y i are obtained by a softmax regularization function.
Due to the disordered nature of point clouds and their irregular embedding in the entire vector space, self-supervision is performed using the position of the point cloud itself. The positional encoding δ is added to the transformed feature α . In this way, the transformation feature is expressed as:
y i = f i a F a i s o f t m a x γ φ f i a T ψ ( f j a ) + δ · α ( f j a ) + δ ,
where F a i F a is the feature of k neighboring points around the sought point f i a R l × d . Self-attention is applied to each data point in the local domain. In 3D point cloud alignment, the 3D point cloud itself comes with position information, and the trainable parametric position encoder can be expressed as:
δ = θ p j p i ,
where p i represent the i-th point, p j represents the j-th point around the i-th point, and θ has the same structure as γ . This position encoder has good effect enhancement for both attention generation and feature transformation.
As shown in Figure 5, the transformed features are fed into the transformer decoder and smoothing module to generate the point set correspondence matrix P. First, the global features need to be obtained by point-by-point features. The global feature can be expressed as:
f a a = i = 1 d f t r a n s ̲ i a 2 , f b b = j = 1 d f t r a n s ̲ j b 2
each transformed feature f t r a n s ̲ i a is summed to obtain a one-dimensional global feature f a a , so that we can find the global features F a a R n × 1 and F b b R n × 1 of the source and target point clouds. To obtain the probability matrix, we first obtain the distances of F a a and F b b . The distance formula can be expressed as:
P d i s = F a a · I T + F b b · I 2 F a a · F b b T ,
where I is a 1 × n unit column vector. Equation (8) returns an n × n point set corresponding distance matrix. The larger the distance, the smaller the probability they correspond to. The probability matrix P r a n d corresponding to its point set is obtained by inverting the distance matrix. The probability matrix can be expressed as:
P r a n d = 1 P d i s ,
since there are only two cases for the correspondence of point sets, the probability matrix cannot effectively represent the correspondence between point sets. It can only represent the corresponding probability between point sets. We need to smooth this matrix, and we refer to Corrnet3D. Each row of the probability matrix should follow a normal distribution with mean μ i and variance σ i , i.e., p r a n d ̲ i j N μ i , σ i 2 . In order to better filter the incorrect point set correspondence, we normalize this normal distribution z i j = ( p r a n d ̲ i j μ i ) / σ i , z i j obeys the standard normal distribution z i j N 0 , 1 . Finally, we select the corresponding point set according to the threshold τ . The number corresponding to the correct point set is z n u m . For the points close to the middle, it should find a larger number of corresponding points, and z n u m should also obey the normal distribution. It obeys the three-sigma rule. The probability of the value in μ Z n u m 3 σ Z n u m , μ Z n u m + 3 σ Z n u m is 0.9973, which is almost 1. The softmax operation on z n u m can be calculated to obtain the correct point set corresponding matrix P.

3.4. The Reconstruction Module

When the correct correspondence labels between point cloud A and B are given, the shape feature relationship between them can be learned well, and thus it is easy to learn the amount of drift between point sets. FoldingNet [31] and AtlasNet [32] reconstruct the global features by stitching point on top of the 2D grid. CPD-Net [38] learns point-to-point drift by concatenating point and global features. As shown in Figure 6, the reconstruction module based on point correspondence is proposed.
Point clouds A and B are permuted by the point set correspondence matrix P. A and B after permuting can be expressed as:
A r e ̲ o r d e r = P T A , B r e ̲ o r d e r = P B
The permuted source point cloud A r e ̲ o r d e r correspond to the point of B one by one, so that the large deformation registration can be learned, which CPD-Net cannot learn. The relative drifts between A r e ̲ o r d e r and B are learned by using the global features. As shown in Figure 6, A r e ̲ o r d e r and the global feature V b R d are concatenated, and then the drift of each point is learned through three MLPs. The reconstructed point cloud A l a s t is the source point cloud A plus the drift. The module is able to efficiently learn the drift between points for the purpose of registration. The reconstruction module learns a displacement field function to estimate the geometric transformations and is able to predict the geometric transformations of the alignment between positional objects.

3.5. Unsupervised Loss Function

The source point cloud A l a s t should be similar to the target point cloud B after registration. The Euclidean distance loss between B l a s t and A is added to the standard loss, which can better learn the relationship between A and B. According to the similarity of the source and target point clouds after deformation, the distance loss is expressed as:
L d i s = B A l a s t 2 + A B l a s t 2
Since the points in A and B should be in one-to-one correspondence, their correspondence matrix should be an inverted matrix. The transpose of the inverted matrix and its own dot product should be infinitely close to the unit matrix. Based on this property, the matrix optimization loss formula is expressed as:
L m a t = P T P I n 2 ,
where P is the correspondence matrix. I n is an n × n unit matrix.
There are similar local features between the target point cloud B and the source point cloud after the permutation A r e ̲ o r d e r , and similarly the source point cloud A and the target point cloud after the permutation B r e ̲ o r d e r also have similar local features. Based on this property, the proximity similarity loss is expressed as:
L p r o = i = 1 n k Ω i a b r e ̲ i b r e ̲ j 2 2 a i a k 2 2 + l Ω i b a r e ̲ i a r e ̲ j 2 2 b i b l 2 2 ,
where Ω i a represents the set of k indexes around the i-th point in A, b r e ̲ i B r e ̲ o r d e r and a r e ̲ i A r e ̲ o r d e r are the points after rearrangement.
Finally, we aggregate these losses and the final loss is expressed as:
L = L d i s + λ L m a t + η L p r o ,
where λ and η > 0 are superparameters to regulate the balance between several losses.

4. Experiment

In this section, the experimental results of NrtNet’s non-rigid point cloud registration are presented. Details of the dataset and laboratory parameters used for training and testing are described in Section 4.1, and a brief introduction to the experimental evaluation method. In Section 4.2, a comparison of rigid point cloud registration results from different networks is discussed. In Section 4.3, the experimental results of non-rigid body methods in rigid registration are discussed. In Section 4.4, the registration results of NrtNet on small deformation datasets are presented. In Section 4.5, the effects of different losses on the experiments are compared. In Section 4.6, we show the registration effect of NrtNet on real scan data.

4.1. Experimental Setup

Dataset. We use the 200k sampled dataset from Surreal [33] as the unsupervised training datasets, and divide these 200k datasets into 100 random pairs for registration training. We used the 300 pairs dataset from Shrec [52] as the test dataset. We downsampled Shrec’s dataset to 1024 grids and took the grid vertices as input to keep the variables constant. In order to compare the robustness of different datasets, we used the dataset of Bednarik, J et al. [53] including small deformation datasets of paper, tshirt, sweater, and cloth to learn for different data to ensure the reliability of NrtNet.
Evaluation. We reviewed a large amount of information on whether CPD-Net [38], DispVoxNets [39], or other articles such as PR-Net [40] have most of the evaluations as direct comparison of CD loss or subjective comparisons of the experimental result plots after registration. Almost none of them had registration again by finding correspondence for point pairs like we do, so we refer to Corrnet3D’s [16] evaluation method to evaluate the goodness of the model based on whether the point set corresponds to each other or not. The point correspondence rate is expressed as:
P r a t e = 1 n P P g t 1 ,
where ∘ is the Hadamard product and · 1 is the parametric matrix. P g t is the ground truth of the point set corresponding to the matrix. We set the percentage of correct correspondence under different tolerances to compare the pros and cons of the method. The point correspondence rate under different fault tolerance is expressed as:
P c o r r = r m a x a i a j 2 i , j ,
where r is the error tolerance radius.
Experimental parameters and configuration. We set the superparameter λ = 0.1 and η = 0.01 . Our method was implemented in pytorch and our evaluation system was trained and tested on an NVIDIA GTX 1080 GPU. The learning-rate was 1 × 10 4 , batchsize was two, and we trained 50 epochs on the large Surreal dataset [33] and 500 epochs on the small deformed dataset [53].

4.2. Experimental Evaluation of Non-Rigid Point Cloud Registration

NrtNet was compared with unsupervised FlowNet3D [54], unsupervised Corrnet3D [16], and unsupervised CPD-Net [38]. Figure 7 and Table 1 show a quantitative comparison of different methods, it can be seen that our method consistently outperforms other unsupervised methods. In particular, we have more significant performance advantages when comparing FlowNet3D and CPD-Net, and we also have some performance improvements when comparing Corrnet3D. The point set correspondence rate of CPD-Net is low, and the registration effect is poor for large deformation datasets. The point set correspondence rate of CPD-Net is low, and the registration effect is poor for large deformation datasets. Although FlowNet3D has a high correspondence rate, its registration effect is very dependent on the dataset, and some test datasets have a good registration effect, while some test datasets have a poor registration effect. Only NrtNet and the recently published Corrnet3D have better registration results. Because NrtNet uses a transformer that is better than Corrnet3D in point correspondence, it can still achieve better registration results for some datasets with larger deformations.
Figure 8 shows the qualitative comparison results. NrtNet suffers less from unsupervised large-deformation non-rigid point cloud registration and can generate a point cloud with accurate correspondence. In contrast, CPD-Net and FlowNet3D are affected by large deformation, which makes their correspondence deviate and cannot achieve effective registration when the target point cloud varies greatly from the source point cloud. NrtNet learns the point set correspondence between the target and source point cloud, and thus can effectively make the registration effect better. Our network can further enhance the robustness to the degree of deformation by learning the specific type.

4.3. Experimental Evaluation of Rigid Point Cloud Registration

We use the non-rigid registration method to register the rigid point cloud, and compare the effect of our method with FlowNet3D, Corrnet3D, and CPD-Net on the rigid point cloud registration. Figure 9 and Table 2 show our method and other methods compared with different fault tolerances. It can be seen from the table that our method has the best results under the same fault tolerance, Corrnet3D has a great improvement for FlowNet3D, and our method also has improvement for Corrnet3D. Compared with non-rigid point cloud registration, the unsupervised registration effect of CPD-Net in rigid point cloud registration has little improvement, while our method NrtNet has better registration effect and non-rigid point cloud registration in rigid point cloud registration.

4.4. Comparison between Different Datasets

Nrtnet was tested on the dataset of Bednarik, J et al. [53] for learning and registration to test the stability on different datasets. The dataset was divided into a training set and a test set in a ratio of 8:2. As shown in Figure 10, NrtNet has better registration for small deformation datasets, not only for learning the deformation part efficiently, but also for rigid transformations of deformed point clouds. NrtNet not only has a good registration effect on large deformation datasets, but also has good registration effects on small deformation datasets compared with existing methods. This makes the registration more efficient to first obtain the point set correspondence through the transformer.

4.5. Comparison of Different Losses

In the experiment, we compared the difference between the Euclidean distance L d i s and the CD Loss, and we also showed the improvement of the Euclidean distance and the cd distance by adding optimization losses L m a t + L p r o . Figure 11 and Table 3 show the point-set correspondence rate at different losses, and it can be seen that the loss of NrtNet achieves the best results with the same fault tolerance. It can be seen that the improvement of L m a t + L p r o to L d i s is very obvious by comparing L d i s and L d i s + L m a t + L p r o . When CDloss and L d i s are compared separately, CDloss has a certain improvement. When CDloss and C D l o s s + L m a t + L p r o are compared separately, L m a t + L p r o have little effect on CDloss, and the increase in correspondence rate is minimal. Experiments show that the loss of NrtNet can achieve the best experimental results.

4.6. Real Scan Data

This section shows the effect of NrtNet registration on real data. The experiments used Shrec’s human real scan dataset [52], and since the experiments were conducted without ground truth, it is hard to qualitatively evaluate the effects of the experiments. Figure 12 shows the final registration results of the experiments for a rational analysis of the results. NrtNet is able to effectively align the point cloud actions and shapes, and NrtNet is able to align any data without ground truth. As shown in Figure 12, The same color represents the correspondence of point sets, and NrtNet has better results for the correspondence of point set pairs. Although there are registration errors in some details, the experimental results are already much better than traditional non-rigid registration networks. The results are able to have better registration results for each movement.

5. Conclusions

We propose NrtNet, an unsupervised transformer-based registration architecture, which can learn the correspondence between pairs of large deformed point sets to effectively improve registration performance. NrtNet is much better than FlowNet3D in large deformation point cloud registration, and also significantly outperforms the state-of-the-art Corrnet3D. This shows that NrtNet can be used for most large deformation registration applications. We also show registration results on real scan data in the absence of ground truth, and still have good registration results. NrtNet has taken a long term step in large deformation non-rigid point cloud registration and eliminates the reliance on ground truth to conduct non-rigid point cloud registration.
In future work, NrtNet can be extended to voxels for non-rigid point cloud registration. Our correspondence may be inappropriate for the correspondence between points that are far apart. For this, we will sort the point cloud in future experiments and then use the transformer to do the point set correspondence, which corresponds to the word in NLP. Similarly, we believe that the registration effect can be improved to a certain extent after doing so. We believe that NrtNet can bring some help to other large scene point cloud registration, as well as human motion analysis and animal and plant growth analysis. Meanwhile, the model size of NrtNet can be further optimized to reduce training time.

Author Contributions

X.H. conceived of and designed the algorithm and the experiments; X.H. and D.Z. analyzed the data; X.H. wrote the manuscript; D.Z. supervised the research; J.C., Y.W. and Y.C. provided suggestions for the proposed method and its evaluation and assisted in the preparation of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Nature Science Foundation of China (grant No. 61802355) and Hubei Key Laboratory of Intelligent Robot (HBIR 202105).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our experiments employ open datasets in [33,52,53].

Acknowledgments

We would like to thank the anonymous reviewers for their constructive and valuable suggestions on the earlier drafts of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Apostolopoulos, J.G.; Chou, P.A.; Culbertson, B.; Kalker, T.; Trott, M.D.; Wee, S. The road to immersive communication. Proc. IEEE 2012, 100, 974–990. [Google Scholar] [CrossRef]
  2. Raviteja, T.; Vedaraj, I.R. An introduction of autonomous vehicles and a brief survey. J. Crit. Rev 2020, 7, 196–202. [Google Scholar]
  3. Silva, R.; Oliveira, J.C.; Giraldi, G.A. Introduction to augmented reality. Natl. Lab. Sci. Comput. 2003, 11, 1–11. [Google Scholar]
  4. Hackel, T.; Savinov, N.; Ladicky, L.; Wegner, J.D.; Schindler, K.; Pollefeys, M. Semantic3d. net: A new large-scale point cloud classification benchmark. arXiv 2017, arXiv:1704.03847. [Google Scholar]
  5. Reisi, A.R.; Moradi, M.H.; Jamasb, S. Classification and comparison of maximum power point tracking techniques for photovoltaic system: A review. Renew. Sustain. Energy Rev. 2013, 19, 433–443. [Google Scholar] [CrossRef]
  6. Rabbani, T.; Van Den Heuvel, F.; Vosselmann, G. Segmentation of point clouds using smoothness constraint. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2006, 36, 248–253. [Google Scholar]
  7. Brox, T.; Malik, J. Object segmentation by long term analysis of point trajectories. In Proceedings of the European Conference on Computer Vision, Heraklion, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 282–295. [Google Scholar]
  8. Myronenko, A.; Song, X. Point set registration: Coherent point drift. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 2262–2275. [Google Scholar] [CrossRef] [Green Version]
  9. Pomerleau, F.; Colas, F.; Siegwart, R. A review of point cloud registration algorithms for mobile robotics. Found. Trends Robot. 2015, 4, 1–104. [Google Scholar] [CrossRef] [Green Version]
  10. Tang, P.; Huber, D.; Akinci, B.; Lipman, R.; Lytle, A. Automatic reconstruction of as-built building information models from laser-scanned point clouds: A review of related techniques. Autom. Constr. 2010, 19, 829–843. [Google Scholar] [CrossRef]
  11. Vosselman, G.; Dijkman, S. 3D building model reconstruction from point clouds and ground plans. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2001, 34, 37–44. [Google Scholar]
  12. Zha, H.; Ikuta, M.; Hasegawa, T. Registration of range images with different scanning resolutions. In Proceedings of the SMC 2000 Conference Proceedings, 2000 IEEE International Conference on Systems, Man and Cybernetics.’Cybernetics Evolving to Systems, Humans, Organizations, and Their Complex Interactions’, Nashville, TN, USA, 8–11 October 2000; IEEE: Piscataway, NJ, USA, 2000; Volume 2, pp. 1495–1500, cat. no. 0. [Google Scholar]
  13. Zinßer, T.; Schmidt, J.; Niemann, H. Point set registration with integrated scale estimation. In Proceedings of the International Conference on Pattern Recognition and Image Processing, Estoril, Portugal, 7–9 June 2005; pp. 116–119. [Google Scholar]
  14. Amberg, B.; Romdhani, S.; Vetter, T. Optimal step nonrigid ICP algorithms for surface registration. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, Minnesota, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
  15. Wang, C.; Shu, Q.; Yang, Y.; Yuan, F. Point cloud registration in multidirectional affine transformation. IEEE Photonics J. 2018, 10, 1–15. [Google Scholar] [CrossRef]
  16. Zeng, Y.; Qian, Y.; Zhu, Z.; Hou, J.; Yuan, H.; He, Y. CorrNet3D: Unsupervised end-to-end learning of dense correspondence for 3D point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6052–6061. [Google Scholar]
  17. Vyas, A.; Katharopoulos, A.; Fleuret, F. Fast transformers with clustered attention. Adv. Neural Inf. Process. Syst. 2020, 33, 21665–21674. [Google Scholar]
  18. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  19. Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
  20. Huang, J.; You, S. Point cloud labeling using 3d convolutional neural network. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2670–2675. [Google Scholar]
  21. Maturana, D.; Scherer, S. Voxnet: A 3d convolutional neural network for real-time object recognition. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–3 October 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 922–928. [Google Scholar]
  22. Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1355–1361. [Google Scholar]
  23. Graham, B. Spatially-sparse convolutional neural networks. arXiv 2014, arXiv:1409.6070. [Google Scholar]
  24. Qi, C.R.; Su, H.; Nießner, M.; Dai, A.; Yan, M.; Guibas, L.J. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5648–5656. [Google Scholar]
  25. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
  26. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 1–10. [Google Scholar]
  27. Phan, A.V.; Le Nguyen, M.; Nguyen, Y.L.H.; Bui, L.T. Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Netw. 2018, 108, 533–543. [Google Scholar] [CrossRef]
  28. Huang, Q.; Wang, W.; Neumann, U. Recurrent slice networks for 3d segmentation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2626–2635. [Google Scholar]
  29. Ye, X.; Li, J.; Huang, H.; Du, L.; Zhang, X. 3d recurrent neural networks with context fusion for point cloud semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 403–417. [Google Scholar]
  30. Thabet, A.; Alwassel, H.; Ghanem, B. Mortonnet: Self-supervised learning of local features in 3d point clouds. arXiv 2019, arXiv:1904.00230. [Google Scholar]
  31. Yang, Y.; Feng, C.; Shen, Y.; Tian, D. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 206–215. [Google Scholar]
  32. Vakalopoulou, M.; Chassagnon, G.; Bus, N.; Marini, R.; Zacharaki, E.I.; Revel, M.P.; Paragios, N. Atlasnet: Multi-atlas non-linear deep networks for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Granada, Spain, 16–20 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 658–666. [Google Scholar]
  33. Groueix, T.; Fisher, M.; Kim, V.G.; Russell, B.C.; Aubry, M. 3d-coded: 3d correspondences by deep deformation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 230–246. [Google Scholar]
  34. Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Proceedings of the Sensor Fusion IV: Control Paradigms and Data Structures, Boston, MA, USA, 12–15 November 1991; SPIE: Bellingham, WA, USA, 1992; Volume 1611, pp. 586–606. [Google Scholar]
  35. Chui, H.; Rangarajan, A. A new algorithm for non-rigid point matching. In Proceedings of the Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000 (Cat. No. PR00662), Hilton Head Island, SC, USA, 13–15 June 2000; IEEE: Piscataway, NJ, USA, 2000; Volume 2, pp. 44–51. [Google Scholar]
  36. Myronenko, A.; Song, X.; Carreira-Perpinan, M. Non-rigid point set registration: Coherent point drift. Adv. Neural Inf. Process. Syst. 2006, 19, 1–8. [Google Scholar]
  37. Ma, J.; Zhao, J.; Yuille, A.L. Non-rigid point set registration by preserving global and local structures. IEEE Trans. Image Process. 2015, 25, 53–64. [Google Scholar]
  38. Wang, L.; Li, X.; Chen, J.; Fang, Y. Coherent point drift networks: Unsupervised learning of non-rigid point set registration. arXiv 2019, arXiv:1906.03039. [Google Scholar]
  39. Shimada, S.; Golyanik, V.; Tretschk, E.; Stricker, D.; Theobalt, C. Dispvoxnets: Non-rigid point set alignment with supervised learning proxies. In Proceedings of the 2019 International Conference on 3D Vision (3DV), Quebec City, QC, Canada, 16–19 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 27–36. [Google Scholar]
  40. Wang, L.; Chen, J.; Li, X.; Fang, Y. Non-rigid point set registration networks. arXiv 2019, arXiv:1904.01428. [Google Scholar]
  41. Ma, J.; Wu, J.; Zhao, J.; Jiang, J.; Zhou, H.; Sheng, Q.Z. Nonrigid point set registration with robust transformation learning under manifold regularization. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 3584–3597. [Google Scholar] [CrossRef]
  42. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  43. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1–9. [Google Scholar] [CrossRef]
  44. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
  45. Wu, F.; Fan, A.; Baevski, A.; Dauphin, Y.N.; Auli, M. Pay less attention with lightweight and dynamic convolutions. arXiv 2019, arXiv:1901.10430. [Google Scholar]
  46. Tchapmi, L.; Choy, C.; Armeni, I.; Gwak, J.; Savarese, S. Segcloud: Semantic segmentation of 3d point clouds. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 537–547. [Google Scholar]
  47. Dovrat, O.; Lang, I.; Avidan, S. Learning to sample. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2760–2769. [Google Scholar]
  48. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  49. Wang, Z.; Delingette, H. Attention for Image Registration (AiR): An unsupervised Transformer approach. arXiv 2021, arXiv:2105.02282. [Google Scholar]
  50. Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.; Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16259–16268. [Google Scholar]
  51. Guo, M.H.; Cai, J.X.; Liu, Z.N.; Mu, T.J.; Martin, R.R.; Hu, S.M. Pct: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
  52. Donati, N.; Sharma, A.; Ovsjanikov, M. Deep geometric functional maps: Robust feature learning for shape correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8592–8601. [Google Scholar]
  53. Bednarik, J.; Fua, P.; Salzmann, M. Learning to reconstruct texture-less deformable surfaces from a single view. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 606–615. [Google Scholar]
  54. Liu, X.; Qi, C.R.; Guibas, L.J. Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 529–537. [Google Scholar]
Figure 1. The idea of our proposed NrtNet. The point cloud is first rearranged using transformer, and then the permuted point cloud is reconstructed to achieve the registration.
Figure 1. The idea of our proposed NrtNet. The point cloud is first rearranged using transformer, and then the permuted point cloud is reconstructed to achieve the registration.
Sensors 22 05128 g001
Figure 2. NrtNet is an unsupervised, end-to-end network for non-rigid point cloud registration. The source point cloud A R n × 3 and the target point cloud B R n × 3 are fed into the feature extraction module and the transformer module to generate the point set correspondence matrix P R n × n . Then, the permuted point cloud is fed into the reconstruction module to generate the exact same point cloud A l a s t as B, which achieves the purpose of registration.
Figure 2. NrtNet is an unsupervised, end-to-end network for non-rigid point cloud registration. The source point cloud A R n × 3 and the target point cloud B R n × 3 are fed into the feature extraction module and the transformer module to generate the point set correspondence matrix P R n × n . Then, the permuted point cloud is fed into the reconstruction module to generate the exact same point cloud A l a s t as B, which achieves the purpose of registration.
Sensors 22 05128 g002
Figure 3. Transformer module. The probability matrix P r a n d can be obtained by feeding the high-dimensional features of the point cloud F a R n × d and F b R n × d into the transformer encoder and transformer decoder, respectively. Then, the probability matrix P r a n d is fed into the smooth module to obtain the inverted exact correspondence matrix P.
Figure 3. Transformer module. The probability matrix P r a n d can be obtained by feeding the high-dimensional features of the point cloud F a R n × d and F b R n × d into the transformer encoder and transformer decoder, respectively. Then, the probability matrix P r a n d is fed into the smooth module to obtain the inverted exact correspondence matrix P.
Sensors 22 05128 g003
Figure 4. Transformer Encoder. φ , ψ is a linear layer, α is an mlp and they are all feature transform layers, δ is a linear layer which is a position encoder and γ is a mapping function.
Figure 4. Transformer Encoder. φ , ψ is a linear layer, α is an mlp and they are all feature transform layers, δ is a linear layer which is a position encoder and γ is a mapping function.
Sensors 22 05128 g004
Figure 5. Transformer decoder and the smooth module. These two modules convert the transformed features obtained from the transformer encoder into an exact point set correspondence matrix.
Figure 5. Transformer decoder and the smooth module. These two modules convert the transformed features obtained from the transformer encoder into an exact point set correspondence matrix.
Sensors 22 05128 g005
Figure 6. The reconstruction module concatenates the global features v b of the target point cloud to each permuted point cloud A r e ̲ o r d e r , and feeds them into the MLP to reconstruct the point cloud A l a s t .
Figure 6. The reconstruction module concatenates the global features v b of the target point cloud to each permuted point cloud A r e ̲ o r d e r , and feeds them into the MLP to reconstruct the point cloud A l a s t .
Sensors 22 05128 g006
Figure 7. Quantitative comparison of point set correspondence rates for non-rigid registration under different methods.
Figure 7. Quantitative comparison of point set correspondence rates for non-rigid registration under different methods.
Sensors 22 05128 g007
Figure 8. In a qualitative comparison between Nrtnet and other methods in a large deformed human pose, the experiments show the effectiveness of different methods for non-rigid point cloud registration.
Figure 8. In a qualitative comparison between Nrtnet and other methods in a large deformed human pose, the experiments show the effectiveness of different methods for non-rigid point cloud registration.
Sensors 22 05128 g008
Figure 9. Quantitative comparison of point set correspondence rates for rigid registration under different methods.
Figure 9. Quantitative comparison of point set correspondence rates for rigid registration under different methods.
Sensors 22 05128 g009
Figure 10. Registration performance of Nrtnet in small deformation dataset paper, cloth, sweater, and t-shirt.
Figure 10. Registration performance of Nrtnet in small deformation dataset paper, cloth, sweater, and t-shirt.
Sensors 22 05128 g010
Figure 11. The point set correspondence rate under different losses. The experiments qualitatively compare the correspondence differences between NrtNet’s losses and ordinary losses.
Figure 11. The point set correspondence rate under different losses. The experiments qualitatively compare the correspondence differences between NrtNet’s losses and ordinary losses.
Sensors 22 05128 g011
Figure 12. The registration effect of NrtNet on the real scan data.
Figure 12. The registration effect of NrtNet on the real scan data.
Sensors 22 05128 g012
Table 1. Point set correspondence rates of different methods under different fault tolerance rates in non-rigid point cloud registration.
Table 1. Point set correspondence rates of different methods under different fault tolerance rates in non-rigid point cloud registration.
Method0% Error Tolerance10% Error Tolerance20% Error Tolerance
CPD-Net0.33116.821224.8963
FlowNet3D1.213319.761441.3494
Corrnet3D2.049425.6848.8636
NrtNet2.688930.042951.8758
Table 2. Point set correspondence rates of different methods under different fault tolerance rates in rigid point cloud registration.
Table 2. Point set correspondence rates of different methods under different fault tolerance rates in rigid point cloud registration.
Method0% Error Tolerance10% Error Tolerance20% Error Tolerance
CPD-Net0.17699.619124.286
FlowNet3D10.894571.557690.368
Corrnet3D12.406280.89595.61
NrtNet12.779385.619196.247
Table 3. Correspondence rates of point sets for different losses under different error tolerance rates.
Table 3. Correspondence rates of point sets for different losses under different error tolerance rates.
Loss0% Error Tolerance10% Error Tolerance20% Error Tolerance
L d i s 0.02142.428716.081
C D l o s s 0.08796.366224.9873
C D l o s s + L m a t + L p r o 0.10456.768630.7441
L d i s + L m a t + L p r o 12.779385.619196.247
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hu, X.; Zhang, D.; Chen, J.; Wu, Y.; Chen, Y. NrtNet: An Unsupervised Method for 3D Non-Rigid Point Cloud Registration Based on Transformer. Sensors 2022, 22, 5128. https://doi.org/10.3390/s22145128

AMA Style

Hu X, Zhang D, Chen J, Wu Y, Chen Y. NrtNet: An Unsupervised Method for 3D Non-Rigid Point Cloud Registration Based on Transformer. Sensors. 2022; 22(14):5128. https://doi.org/10.3390/s22145128

Chicago/Turabian Style

Hu, Xiaobo, Dejun Zhang, Jinzhi Chen, Yiqi Wu, and Yilin Chen. 2022. "NrtNet: An Unsupervised Method for 3D Non-Rigid Point Cloud Registration Based on Transformer" Sensors 22, no. 14: 5128. https://doi.org/10.3390/s22145128

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop