Next Article in Journal
Transient Interference Excision and Spectrum Reconstruction with Partial Samples Using Modified Alternating Direction Method of Multipliers-Net for the Over-the-Horizon Radar
Previous Article in Journal
An Online Monitoring System for In Situ and Real-Time Analyzing of Inclusions within the Molten Metal
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HALNet: Partial Point Cloud Registration Based on Hybrid Attention and Deep Local Features

School of Mechatronic Engineering and Automation, Shanghai University, Shanghai 200444, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(9), 2768; https://doi.org/10.3390/s24092768
Submission received: 25 March 2024 / Revised: 15 April 2024 / Accepted: 22 April 2024 / Published: 26 April 2024
(This article belongs to the Section Sensing and Imaging)

Abstract

:
Point cloud registration is an important task in computer vision and robotics which is widely used in 3D reconstruction, target recognition, and other fields. At present, many registration methods based on deep learning have better registration accuracy in complete point cloud registration, but partial registration accuracy is poor. Therefore, a partial point cloud registration network, HALNet, is proposed. Firstly, a feature extraction network consisting mainly of adaptive graph convolution (AGConv), two-dimensional convolution, and convolution block attention (CBAM) is used to learn the features of the initial point cloud. Then the overlapping estimation is used to remove the non-overlapping points of the two point clouds, and the hybrid attention mechanism composed of self-attention and cross-attention is used to fuse the geometric information of the two point clouds. Finally, the rigid transformation is obtained by using the fully connected layer. Five methods with excellent registration performance were selected for comparison. Compared with SCANet, which has the best registration performance among the five methods, the RMSE(R) and MAE(R) of HALNet are reduced by 10.67% and 12.05%. In addition, the results of the ablation experiment verify that the hybrid attention mechanism and fully connected layer are conducive to improving registration performance.

1. Introduction

Abundant geometry information is contained within 3D point clouds, primarily collected by light detection and ganging (LiDAR). Scanning methods of LiDAR are classified as terrestrial or airborne based on its spatial position and further categorized as static or mobile based on its motion status. The typologies scanned by LiDAR mainly consist of urban, rural, and inside buildings. Point cloud data obtained from LiDAR exhibit challenges such as large data volume and variation in point density, necessitating high computational capacity and processing time for data processing algorithms. Machine learning can effectively address these issues. In recent years of research, the applications of machine learning on LiDAR data have encompassed tasks such as building detection, scene segmentation, vegetation detection, and road marking classification [1,2].
LiDAR cannot scan the complete scene information during the data collection process. For instance, when scanning urban ground structures, the scanning range limitations prevent the comprehensive coverage of all ground information in a single scan. Moreover, multiple scans suffer from inconsistent spatial coordinate systems, which makes it difficult to splice multiple scanned images. To acquire a complete scene, it is necessary to align multiple scan images into a unified spatial coordinate system and then obtain a complete scene. Therefore, point cloud registration is a basic task of point cloud data processing and the basis of 3D reconstruction [3,4] and 3D localization [5,6,7].
In the process of point cloud registration, point cloud data will have problems such as noise and partial point cloud loss, which have a great impact on the registration accuracy. Many learning-based registration methods [8,9,10,11] have shown good registration performance in the complete point cloud scene, but their registration performance is poor when dealing with the task of partial point cloud registration. HALNet, a learning-based end-to-end partial point cloud registration network based on hybrid attention and deep local features, is proposed in this paper with reference to the registration model proposed in DCP [8] (as shown in Figure 1).
HALNet consists of four modules: feature extraction module, overlap prediction, feature rectification, and transformation prediction module. The feature extraction module consists of two parallel networks, one learning local features and the other extracting global features. Global features represent the overall structure and distribution information of the entire point cloud, such as its overall shape, size, and orientation. These features facilitate rapid matching and localization of the entire point cloud, thus providing initial references for registration. Local features within the point cloud concentrate on detailed information like local shapes and edges, enabling differentiation between different parts and facilitating the identification and matching of regions with similar structures, leading to more precise registration. The AGConv [12] is used to extract the local features of the point cloud and embed them into the high dimension feature space, and the CBAM [13] is used to enhance the local features. The global features of the point cloud are obtained by a convolution layer composed of multiple two-dimensional convolutions. The point cloud feature extracted by the feature extraction module is used by the overlap prediction module to calculate the overlap point set between the source point cloud and the target point cloud. The hybrid attention mechanism is used to learn the interrelated features within the point cloud and the structural features between the source point cloud (target point cloud) and the target point cloud (source point cloud). The overlapping point set features processed by the hybrid attention mechanism will be used as the residual term to correct the overlapping point set features. The rigid transformation-solving module is composed of a fully connected layer. The feature grouping function of the full connection layer enables the registration network to obtain a rotation matrix and translation vector that align the source point cloud and the target point cloud. The main contributions of this paper are as follows:
  • An end-to-end partial point cloud registration network, HALNet, is proposed. The network can extract deep local features through AGConv and CBAM and use the similarity scores of the point pairs of source and target point clouds to remove non-overlapping points in two point clouds.
  • A hybrid attention mechanism composed of self-attention and cross-attention is proposed to modify the extracted features and help to improve the accuracy of feature grouping for predicting rigid transformation.

2. Related Work

Research on point cloud registration based on deep learning mainly falls into two categories: feature-learning-based point cloud registration and end-to-end point cloud registration. The main idea of the feature-learning-based method is to utilize deep learning networks to learn deep features for estimating accurate correspondences, followed by employing a differentiable singular value decomposition (SVD) or random sample consensus (RANSAC) to estimate rigid transformations. The end-to-end method addresses registration problems by neural network only. This method takes two point clouds as input and outputs the rotation matrix and translation vector aligning these two point clouds. In other words, the transformation estimation is performed by the neural network, distinguishing it from the feature-learning method where feature learning and transformation estimation are separated. Various attention mechanisms are employed in feature learning or matching point pairs for point cloud registration to enhance registration performance. These attention mechanisms were adapted from image processing and specifically designed for point cloud features.

2.1. Feature-Learning-Based Methods

AlexNet is utilized by 3DMatch [14] to learn 3D features from RGB-D datasets. In the workflow of 3DMatch, the first step involves converting 3D point cloud data into 3D voxel data, which is then fed into a neural network to extract local features. These extracted local features contain characteristics of representative points in the local region and their surrounding structures. 3DSmoothNet [15] introduces a novel compact learning-based local feature descriptor for point cloud matching. This method uses smooth density value (SDV) voxelization to generate an SDV voxel grid as the input representation for the feature learning network, saving network capacity by learning highly descriptive features. Unlike providing volumetric data to the feature learning network, PPFNet [16] learns only geometric local descriptors and has a strong understanding of global context. 3DFeatNet [17] introduces a weakly supervised method for learning 3D feature detectors and point cloud descriptors. This method utilizes alignment operations and attention mechanisms to learn feature point matching for 3D point clouds labeled with a Global Positioning System/Inertial Navigation System (GPS/INS). RPMNet [18] introduces a more robust deep-learning-based point cloud registration method that is less sensitive to initialization. The network of this method can obtain soft assignments of point correspondences, addressing the issue of local visibility in point clouds. DCP [8] employs dynamic graph convolutional neural networks for feature extraction and uses attention modules to generate new embeddings that consider the relationship between two point clouds. IDAM [9] combines geometric and distance features into the iterative matching process to learn features of partially overlapping point clouds.

2.2. End-to-End Methods

Deng et al. [19] proposed RelativeNet to directly estimate poses from features. Lu et al. [20] proposed DeepVCP, which automatically avoids dynamic objects and selects easily matchable key points by incorporating semantic features. Within the candidate matching region of key points, same-name points are generated by calculating the probability of feature similarity, and the network loss comprehensively integrates the local and global matching effects of key points. PointNetLK [21] employs PointNet [22] to extract global features of two input point clouds, then utilizes the Iterative Closest Point (ICP) algorithm to estimate the transformation matrix. The objective is to minimize the feature discrepancy between the two features by estimating the transformation matrix. DeepGMR [10] uses neural networks to learn the correspondence between pose-invariant points and distribution parameters. These correspondences are then fed into a GMM optimization module to estimate the transformation matrix. Elbaz et al. [23] proposed an auto-encoder-based registration network that combines super-points extraction and unsupervised feature learning. FMR [24] introduced a fast feature-metric point cloud registration framework, which performs registration optimization by minimizing the projection error of feature metrics without corresponding relationships.

2.3. Attention Mechanism

The application of the attention mechanism is already a common way to solve problems in visual tasks, and the learning of point cloud features also introduces the attention mechanism. Yang et al. [25] proposed Pointment Attention Transformers (PATs), using parametric-efficient group shuffling attention instead of a multi-head attention mechanism to learn the relationship between points. Zhang et al. [26] proposed a Point Contextual Attention Network (PCAN), based on the point context, which can predict the importance of local point features and pay more attention to task-related features when aggregating local features. The GAPNet proposed by Chen et al. [27] learns the attention features of each point by highlighting different attention weights in the neighborhood and uses a multi-head mechanism to aggregate different features from independent heads. Zhao et al. [28] designed a self-attention layer for the point cloud and used it to build a self-attention network for different task scenarios. Lu et al. [29] designed the spatial-channel attention module, which selects differentiated features according to the generated spatial and channel attention, effectively combining multi-scale and global information. Guo et al. [30] proposed the PCT network, using offset attention instead of the existing self-attention as the attention module of the PCT network so that it has better semantic feature learning ability.

3. Methodology

The registration network’s capability to learn point cloud features has a significant impact on registration performance, particularly for partial point cloud registration. Learning rich features and selecting overlapping point sets are critical for HALNet to demonstrate excellent registration performance. Additionally, leveraging attention to assign higher weights to important features and mitigate interference from irrelevant features, along with exploiting the feature classification ability of fully connected layers, has enhanced registration accuracy. The network architecture of HALNet is shown in Figure 2. Note that all of the following equations refer to expressions commonly used in similar studies [8,12,31,32,33].

3.1. Preliminary

The purpose of point cloud registration is to estimate the optimal rigid transformation that aligns two given point clouds, X = x i | i = 1,2 , . . . , N x R N x × 3 and Y = y i i = 1,2 , , N y R N y × 3 . Here, x i and y i represent the x ,   y ,   z coordinates of the i-th point in X and Y , respectively. N x and N y are the number of points in X and Y , and for convenience of description, assume that N x is equal to N y . In the point cloud registration process, X is defined as the source point cloud, and Y is defined as the target point cloud. Assuming that Y results from applying a rigid transformation to X , this transformation can be represented by a rotation matrix R S O ( 3 ) and a translation vector t R 3 . The rotation matrix R is obtained from quaternion q = w + x i + y j + z k through Equation (1).
R = 1 2 y 2 2 z 2 2 x y 2 w z 2 x z + 2 w y 2 x y + 2 w z 1 2 x 2 2 z 2 2 y z 2 w x 2 x z + 2 w y 2 y z + 2 w x 1 2 x 2 2 y 2
The translation vector t is obtained from Equation (2).
t = t x t y t z
where t x , t y , t z are the three translation parameters.

3.2. Feature Extraction

The structure of the feature extraction network is shown in Figure 2, consisting of two parallel networks for extracting local and global features. The input point cloud data record the 3D coordinates of each point, where multiple points collectively form the overall shape of the point cloud. Since the given point cloud data only contain the 3D coordinates of each point, the k-nearest neighbors (KNN) algorithm is employed to construct the graph of the point cloud, denoted as K l   and K g in Figure 2. K l is constructed as a directed graph for extracting local features, while K g is constructed as an undirected graph for extracting global features.
We define the input point cloud as P = p i i = 1,2 , , N R N x 3 , where p i represents the x ,   y ,   z position of the i-th point, and its corresponding feature is F = f i i = 1,2 , , N R N × D , where D is the feature dimension of the i-th point. Based on the provided point cloud, one graph G = ( ν , ξ ) is computed, where ν = { 1 , , N } and ξ ( ν × ν ) represent the vertices and edges, and the set of point indexes in the domain of the local area is represented as S i = j : i , j ξ .
The local feature extraction network mainly consists of AGConv [12] and CBAM [13]. The input point cloud data, after computation by K l , yield p i j , which is first embedded into a high-dimensional feature space by the AGConv. Compared to conventional convolution, AGConv generates adaptive convolutional kernels based on dynamically learned point features. The uniqueness of the relationships between the central point and each neighboring point ensures that the learned point features in the local region possess uniqueness. The generation process of adaptive convolutional kernels can be represented as:
K e r n e l s = c o n v 2 d p i j
where p i j = p i , p j p i , p i represents the spatial position information of a local region composed of K points, p j represents the spatial position information of neighboring points, · , · is concatenation operation, and c o n v 2 d is a mapping function that embeds low-dimensional features into high-dimensional feature spaces.
The features embedded in the high-dimensional space undergo three layers of 2D convolution operations and then are fed into CBAM for feature enhancement. Skip connections are introduced in this network to concatenate the output of each layer, ensuring that the details of the point cloud features are preserved during the learning process. After concatenation, the features undergo another layer of 2D convolution operations to obtain the local features of the point cloud F l o c = f l o c i i = 1,2 , , N R N × M . The specific process described above can be represented as:
f l o c i = σ c o n v 2 d f i M 0 , p o o l ( f i M 1 ) , p o o l ( f i M 2 ) , p o o l ( f i M 3 )
where f l o c i is local feature of the point, σ is activation function, p o o l is pooling operation that reduces the number of parameters to prevent overfitting, f i M 1 = c o n v 2 d f i M 0 , f i M 2 = c o n v 2 d ( f i M 1 ) , f i M 3 = A t t ( c o n v 2 d f i M 2 ) , f i M 0 = σ K e r n e l s , p i j , · , · is inner product of output vectors, A t t is CBAM, and M 0 ,   M 1 ,   M 2 , M 3   represent the number of feature channels for each layer of convolutional output, respectively.
The global feature extraction network mainly consists of four layers of 2D convolution. The input point cloud data, after computation by K g , yield p i j , which first undergoes three layers of 2D convolution operations. Similar to the learning network for local features, this network also introduces skip connections. The concatenated features are then subjected to one layer of 2D convolution operation and normalization to obtain the global features of the point cloud F g l o = f g l o i i = 1,2 , , N R N × C . The specific process described above can be represented as:
f g l o i = σ B N c o n v 2 d p i j , p o o l f i C 1 , p o o l f i C 2 , p o o l f i C 3
where f g l o i is global feature of the point, f i C 1 = c o n v 2 d p i j , p i j = p i ,   p j , f i C 2 = c o n v 2 d ( f i C 1 ) , f i C 3 = c o n v 2 d f i C 2 , BN is batch normalization, and C 1 ,   C 2 , C 3 represent the number of feature channels for each layer of convolutional output, respectively.

3.3. Overlapping Region Estimation

In partial point cloud registration, the features of the overlapping regions have the greatest impact on registration accuracy. To achieve higher registration accuracy, it is necessary to estimate the overlapping regions of the source and target point clouds. The overlap area is calculated by using the estimated overlap score to determine whether a point needs to be removed, i.e., whether a point belongs to an overlap area based on the overlapping score. The operation of removing overlapping regions can also be referred to as pruning and is a bidirectional operation performed simultaneously on both the source and target point clouds, rather than being performed on a single point cloud.
The feature extraction module outputs local and global features corresponding to the source and target point clouds. Local features F l o c X i F l o c X and F l o c Y j F l o c Y of each point are concatenated with global features F g l o X and F g l o Y , respectively, and then fed into the shared multi-layer perceptron (MLP) for estimating overlap consistency scores:
S X i = g F l o c X i , δ F g l o X , δ F g l o Y
S Y j = g F l o c Y j , δ F g l o X , δ F g l o Y
where S X i is the score of the coincidence probability between i -th point in X and i -th in Y , S Y j   is the score of the coincidence probability between j -th point in Y and j -th in X , g · is shared MLP, δ · is the channel-wise repeat operation, and · , · is concatenation operation.
Based on the obtained overlap consistency scores, the first N points are retained and corresponding features are obtained. The retained points and corresponding features are denoted as X O , Y O R N ^ × 3 and F O X , F O Y R N ^ × d , respectively.
X O = X Indices top N ^ S X , F O X = F l o c X Indices top N ^ S X
Y O = Y Indices top N ^ S Y , F O Y = F l o c Y Indices top N ^ S Y
where Indices top N ^ ·   represents to obtain top N ^ indices based on S X or S Y .

3.4. Hybrid Attention

The self-attention and cross-attention are combined to form a hybrid attention mechanism, which enables the registration network to learn the features of the point cloud context. The structure of the hybrid attention is shown in Figure 3.

3.4.1. Self-Attention

For machine learning tasks, the ability of a model to capture discriminative features from a plethora of given information is crucial for the target task. As a variant of attention mechanisms, self-attention enables models to focus on important aspects, making them widely applicable in machine learning tasks such as machine translation, speech recognition [34], and caption generation [35]. The principle of self-attention can be represented as:
A t t e n t i o n ( Q , K , V ) = S o f t m a x Q K T d k V
where Q is query, K is key, V is value, and d k is the dimension of the parameter V .
The introduction of self-attention enables the point cloud features used for estimating rigid transformations to incorporate internal contextual information. At the same time, it helps filter out features that are difficult to match from a multitude of features and weaken their influence. Using the overlapping features F O X R N ^ × d of the source point cloud X as an example, the specific implementation of self-attention in the registration network is described. Firstly, compute Q , K, and V:
Q = W Q F O X ,   K = W K F O X ,   V = W V F O X
where W Q , W K , and W V are learnable weight matrices. Based on the obtained Q and K , calculate the self-correlation weight matrix of features:
W A t t = S o f t m a x Q K T R N ^ × N ^
Multiply the obtained self-correlation weight matrix with V to obtain the source point cloud features with incorporated self-correlation information.
F S X = W A t t V
The overlapping region features of the target point cloud will undergo the same calculation process, resulting in the target point cloud features F S Y with incorporated self-correlation information.

3.4.2. Cross-Attention

The cross-attention is an extension of the self-attention. Unlike the self-attention where the inputs come from the same sequence, the cross-attention takes inputs from different sequences. Apart from this distinction, the calculation process is similar to the self-attention.
For registration tasks, finding point pairs in the target point cloud that are highly correlated with the source point cloud is one of the crucial steps. While self-attention enables the registration network to capture dependencies within individual point clouds, it fails to capture the correlation between the source and target point clouds. The cross-attention is concatenated with the self-attention to address this issue. This ensures that the final learned features of the source point cloud contain both internal correlation information and learned external correlation information with the target point cloud, and vice versa for the target point cloud.
For cross-attention, K and V result from one input, while Q results from another input. Firstly, the calculation of the source point cloud features in the cross-attention is described. The first step is to compute the correlation matrix W C X between the source point cloud and the target point cloud:
W C X = S o f t m a x F S Y F S X T d k R N ^ × N ^
And then, multiply W C X with F S X to obtain the source point cloud features with incorporated information related to the target point cloud:
F C X = W C X F S X
The processing steps for the target point cloud are similar. The correlation matrix between the target point cloud and the source point cloud is denoted as W C Y :
W C Y = S o f t m a x F S X F S Y T d k R N ^ × N ^
The target point cloud features with incorporated information related to the source point cloud are denoted as F C Y :
F C Y = W C Y F S Y

3.5. Rigid Transformation Calculation

The fully connected layers are employed to classify the learned point cloud features, ultimately obtaining the rotation matrix and translation vector that align the source point cloud with the target point cloud. The specific process is described as follows.
The features generated by the feature extraction network are filtered by the overlapping estimation module to select the features of the overlapping point set. To ensure that the point cloud features used for registration have more complete structural information, the features obtained through hybrid attention are used as residual terms to correct the overlapping features.
Φ X = F O X + F C X
Φ Y = F O Y + F C Y
The corrected features of the source point cloud and target point cloud are concatenated after undergoing max pooling operations. The concatenated features are then fed into fully connected layers for feature classification.
P o s e = F C p o o l Φ X , p o o l Φ Y
where P o s e is the output of fully connected layers; F C is fully connected layers. There are a total of six fully connected layers for feature classification, as shown in Figure 2. P o s e is a seven-dimensional vector, with the first four dimensions representing the quaternion for rotation and the last three dimensions representing the translation vector. The procedure for converting a quaternion to a rotation matrix is described in Section 3.1.
Compared to using SVD to obtain rigid transformation, the fully connected layers with learnable weights and bias parameters can be optimized through backpropagation. This optimization enables the registration network to better fit the training data, thereby improving the registration accuracy.

3.6. Loss Function

The objective is to align the provided point clouds by accurately predicting the transformation. Hence, the predicted transformation and the true transformation are selected as variables to formulate the loss function, which is defined as Equation (21).
L o s s = R p T R g I 2 + t p t g 2 + λ θ 2
where R g and R p denote ground truth rotation matrix and predicted rotation matrix, respectively, and t g and t p denote ground truth translation vector and predicted translation vector. The first two terms define a simple distance on SE (3), and the third term represents the Tikhonov regularization of the registration network parameter θ.

4. Experiments

The experiments were divided into performance evaluation and ablation study. In the performance experiments, HALNet was compared with five deep-learning-based methods to validate its registration performance in partial point cloud registration. The ablation experiments examined the influence of hybrid attention and fully connected layers on registration performance.

4.1. Experimental Setup

All experiments were conducted on a 64-bit Windows operating system, utilizing an NVIDIA GeForce RTX4080 GPU mading by NVIDIA (Santa Clara, CA, USA) and buying from the company’s China sales department. The framework employed for the experiments was PyTorch version 1.12.1. The initial learning rate for network training was set to 1 × 10−4. The Adam optimizer [36] was utilized to optimize the network parameters over 300 training epochs.

4.2. Dataset

The dataset used in this paper was ModelNet40 [37], which consists of 12,311 meshed CAD models from 40 different categories. Of these, 9843 models were used for training in the experiment, and 2468 models were used for testing. Each source point cloud was randomly rotated within the range of [0°, 45°] and randomly translated within the range of [−0.5, 0.5]. The resulting point cloud after transformation served as the target point cloud corresponding to the source point cloud. Then, for both the source and target point clouds, a nearest neighbors search algorithm was employed to randomly select 1024 points as the source and target point clouds for partial point cloud registration.

4.3. Evaluation Indicators

The root mean square error (RMSE) and the mean absolute error (MAE) between the ground truth and the predicted value were used as evaluation indicators for the experimental results. Ideally, if the rigid alignment is perfect, all of the above error measures should be zero. When evaluating the registration performance, lower values of the evaluation metrics indicate better registration performance for the corresponding method.

4.4. Performance Evaluation

HALNet was compared with DeepGMR [10], DCP [8], IDAM [9], SCANet [38], and VRNet [11]. The experimental results for each method were obtained by training with the source code provided by the respective authors.

4.4.1. Unseen Shapes

The unseen shapes experiment refers to having the same categories in both the training and testing datasets but using different shapes for training and testing. Additionally, due to secondary processing of the data, the point cloud data do not represent complete shapes, making the specific shapes within the same category unknown. There were 9843 objects in the training set and 2468 objects in the testing set. This experiment validates the partial point cloud registration performance of the registration methods. The final registration experiment results are shown in Table 1, with underlines indicating the lowest values.
From the experimental results presented in the table, it is evident that HALNet outperformed the comparative methods in partial point cloud registration performance. Among all the compared methods, DeepGMR exhibited the poorest performance in partial point cloud registration, with significantly higher rotation evaluation metric values compared to the other methods. Although the RMSE(t) and MAE(t) of HALNet were not the lowest, ranking second after SCANet, considering the overall results of rotation and translation evaluation metrics, HALNet still demonstrated excellent registration performance.
The registration performance of the above comparison method was inferior to that of HALNet for the following two reasons: First, the feature extraction network used was different. For example, DeepGMR used shared MLP to extract the global features of the point cloud, while ignoring the impact of local features on the registration performance. SCANet extracted local information and global information at different levels using a network composed mainly of self-attention mechanism and fully connected layer. However, it did not use the local graph of points to learn local information, resulting in the loss of local details. HALNet first constructed the local relationship graph of points. Then AGConv and CBAM were used to extract rich local features. The second point is the selection of overlapping points. After learning the features of point cloud, HALNet removed non-overlapping points to reduce the interference to registration, but other comparison algorithms ignored the impact of non-overlapping points on registration performance.
In order to give readers a more intuitive understanding of the registration performance, five point clouds in the testing datasets were randomly selected to obtain the registration results obtained by different methods. The result is shown in Figure 4.

4.4.2. Noise

Point clouds collected from sensing devices may contain noise and outliers due to measurement errors. To verify the robustness, the noise was sampled from N 0 ,   0.01 , clipped to the range 0.05 ,   0.05 , and then added to point cloud X and Y for testing. The dataset used was the same as that in the unseen shapes experiment. The model used in this experiment was obtained from the unseen shapes. The experimental results after adding noise are shown in Table 2.
Comparing the experimental results from Table 1 and Table 2, it was observed that, except for DeepGMR which showed minimal change in registration performance, the registration performance of other methods generally declined after adding noise. However, HALNet still outperformed the other comparative methods in terms of registration accuracy. Despite a slight decrease in registration performance after the addition of noise, HALNet remained robust to noise.

4.4.3. Unseen Categories

Unseen categories refers to the situation where the object categories seen in the training set do not appear in the testing set. Since the training dataset cannot contain all data categories, evaluating the generalization of the registration network was necessary. In this experiment, the first 20 categories from ModelNet40 were used for the training set, and the remaining 20 categories were used for the testing set. The results of the point cloud registration experiment with unseen categories are shown in Table 3.
The results listed in Table 3 indicate that, compared to the unseen shapes registration results, the registration performance of each method decreased. DeepGMR almost failed. This phenomenon occurred because the categories involved in the training and testing sets were different. In this experiment, HALNet still exhibited better registration performance than other methods.

4.5. Ablation Experiment

4.5.1. Hybrid Attention

To evaluate the impact of hybrid attention on registration performance, the attention module was removed as the first comparison method (M1) and the attention module in HALNet was replaced with a transformer [39] as the second comparison method (M2). For readability, HALNet was labeled as M0 in this experiment. The experiments still included the Unseen Shapes, Noise, and Unseen Categories. The results are shown in Table 4.
The results in the table indicated that the use of attention improved the registration performance of the network, and the hybrid attention proposed in this paper had a greater impact on registration performance improvement than the transformer. The higher network complexity of the transformer compared to hybrid attention may be one of the reasons for its weaker performance.

4.5.2. Fully Connected Layer

This experiment compared the performance of using fully connected layers and SVD to solve rigid transformations. The experiment included the Unseen Shapes, Noise, and Unseen Categories. The fully connected layer was labeled as FCL. The results are shown in Table 5. The results indicated that using fully connected layers enabled registration network to achieve better registration performance.

5. Conclusions

For partial point cloud registration, HALNet is proposed. HALNet extracts local features of point clouds utilizing AGConv and CBAM and uses a hybrid attention mechanism to fuse the geometry of source and target point clouds. The hybrid attention mechanism is composed of self-attention and cross-attention, and its structure is simpler than the transformer which is also composed of self-attention and cross-attention. The ablation experiment proves that HALNet using hybrid attention has better registration performance than a transformer or HALNet used without an attention mechanism. The fully connected layer with learnable parameters is used to predict the rigid transformation instead of SVD. The ablation experiment shows that HALNet with a fully connected layer achieves better registration performance than SVD for the rigid transformation. HALNet was compared to five other registration methods with advanced registration performance, and HALNet showed excellent registration performance with robustness and generalization.
Since all the experiments in this paper were carried out on the ModeNet40 dataset, the performance on real scan data is unknown. In future research, HALNet will be applied to the dataset scanned in the real scene to test the viability of the method in the real scene. In addition, the simple structure of the hybrid attention mechanism leads to the discovery of detailed features to be improved, and future work can focus on designing better attention mechanisms.

Author Contributions

Conceptualization, methodology, data curation, software, writing—original draft preparation, D.W.; investigation, resources, H.H. and D.W.; formal analysis, validation, visualization, H.H.; supervision, project administration, J.Z.; writing—review and editing, D.W. and J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in [32].

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gharineiat, Z.; Kurdi, F.T.; Campbell, G. Review of Automatic Processing of Topography and Surface Feature Identification LiDAR Data Using Machine Learning Techniques. Remote Sens. 2022, 14, 4685. [Google Scholar] [CrossRef]
  2. Mirzaei, K.; Arashpour, M.; Asadi, E.; Masoumi, H.; Bai, Y.; Behnood, A. 3D point cloud data processing with machine learning for construction and infrastructure applications: A comprehensive review. Adv. Eng. Inform. 2022, 51, 101501. [Google Scholar] [CrossRef]
  3. Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.; Kohli, P.; Shotton, J.; Hodges, S.; Freeman, D.; Davison, A.; et al. KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, 16–19 October 2011; pp. 559–568. [Google Scholar]
  4. Huang, X.; Mei, G.; Zhang, J.; Abbas, R. A comprehensive survey on point cloud registration. arXiv 2021, arXiv:2103.02690. [Google Scholar]
  5. Elhousni, M.; Huang, X. Review on 3D Lidar Localization for Autonomous Driving Cars. arXiv 2020, arXiv:2006.00648. [Google Scholar]
  6. Nagy, B.; Benedek, C. Real-Time Point Cloud Alignment for Vehicle Localization in a High Resolution 3D Map. In Proceedings of the Computer Vision—ECCV 2018 Workshops, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2019; pp. 226–239. [Google Scholar]
  7. Fu, Y.; Brown, N.M.; Saeed, S.U.; Casamitjana, A.; Baum, Z.M.C.; Delaunay, R.; Yang, Q.; Grimwood, A.; Min, Z.; Blumberg, S.B.; et al. DeepReg: A deep learning toolkit for medical image registration. J. Open Source Softw. 2020, 5, 2705. [Google Scholar] [CrossRef]
  8. Wang, Y.; Solomon, J.M. Deep Closest Point: Learning Representations for Point Cloud Registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3522–3531. [Google Scholar] [CrossRef]
  9. Li, J.; Zhang, C.; Xu, Z.; Zhou, H.; Zhang, C. Iterative Distance-Aware Similarity Matrix Convolution with Mutual-Supervised Point Elimination for Efficient Point Cloud Registration. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 378–394. [Google Scholar]
  10. Yuan, W.; Eckart, B.; Kim, K.; Jampani, V.; Fox, D.; Kautz, J. DeepGMR: Learning Latent Gaussian Mixture Models for Registration. In Proceedings of the Computer Vision—ECCV, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 733–750. [Google Scholar]
  11. Zhang, Z.Y.; Sun, J.D.; Dai, Y.C.; Fan, B.; He, M.Y. VRNet: Learning the Rectified Virtual Corresponding Points for 3D Point Cloud Registration. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4997–5010. [Google Scholar] [CrossRef]
  12. Zhou, H.; Feng, Y.; Fang, M.; Wei, M.; Qin, J.; Lu, T. Adaptive Graph Convolution for Point Cloud Analysis. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 4945–4954. [Google Scholar]
  13. Woo, S.H.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. Lect. Notes Comput. Sci. 2018, 11211, 3–19. [Google Scholar] [CrossRef]
  14. Zeng, A.; Song, S.; Nießner, M.; Fisher, M.; Xiao, J.; Funkhouser, T.A. 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 199–208. [Google Scholar]
  15. Gojcic, Z.; Zhou, C.; Wegner, J.D.; Wieser, A. The Perfect Match: 3D Point Cloud Matching With Smoothed Densities. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5540–5549. [Google Scholar]
  16. Deng, H.; Birdal, T.; Ilic, S. PPFNet: Global Context Aware Local Features for Robust 3D Point Matching. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 195–205. [Google Scholar]
  17. Yew, Z.J.; Lee, G.H. 3DFeat-Net: Weakly Supervised Local 3D Features for Point Cloud Registration. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 630–646. [Google Scholar]
  18. Yew, Z.J.; Lee, G.H. RPM-Net: Robust Point Matching Using Learned Features. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11821–11830. [Google Scholar]
  19. Deng, H.W.; Birdal, T.; Ilic, S. 3D Local Features for Direct Pairwise Registration. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3239–3248. [Google Scholar] [CrossRef]
  20. Lu, W.X.; Wan, G.W.; Zhou, Y.; Fu, X.Y.; Yuan, P.F.; Song, S.Y. DeepVCP: An End-to-End Deep Neural Network for Point Cloud Registration. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 12–21. [Google Scholar] [CrossRef]
  21. Aoki, Y.; Goforth, H.; Srivatsan, R.A.; Lucey, S. PointNetLK: Robust & Efficient Point Cloud Registration using PointNet. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7156–7165. [Google Scholar] [CrossRef]
  22. Charles, R.Q.; Su, H.; Kaichun, M.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
  23. Elbaz, G.; Avraham, T.; Fischer, A. 3D Point Cloud Registration for Localization using a Deep Neural Network Auto-Encoder. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (Cvpr 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 2472–2481. [Google Scholar] [CrossRef]
  24. Huang, X.; Mei, G.; Zhang, J. Feature-Metric Registration: A Fast Semi-Supervised Approach for Robust Point Cloud Registration Without Correspondences. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11363–11371. [Google Scholar]
  25. Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; Tian, Q. Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3318–3327. [Google Scholar]
  26. Zhang, W.; Xiao, C. PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12428–12437. [Google Scholar]
  27. Chen, C.; Fragonara, L.Z.; Tsourdos, A. GAPointNet: Graph attention based point neural network for exploiting local feature of point cloud. Neurocomputing 2021, 438, 122–132. [Google Scholar] [CrossRef]
  28. Zhao, H.; Jiang, L.; Jia, J.; Torr, P.H.S.; Koltun, V. Point Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 16239–16248. [Google Scholar]
  29. Lu, H.; Chen, X.; Zhang, G.; Zhou, Q.; Ma, Y.; Zhao, Y. Scanet: Spatial-channel Attention Network for 3D Object Detection. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), London, UK, 12–17 May 2019; pp. 1992–1996. [Google Scholar]
  30. Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R.R.; Hu, S.-M. PCT: Point cloud transformer. Comput. Vis. Media 2021, 7, 187–199. [Google Scholar] [CrossRef]
  31. Huang, S.Y.; Gojcic, Z.; Usvyatsov, M.; Wieser, A.; Schindler, K. PREDATOR: Registration of 3D Point Clouds with Low Overlap. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 20–25 June 2021; pp. 4265–4274. [Google Scholar] [CrossRef]
  32. Wang, G.H.; Zhai, Q.Y.; Liu, H. Cross self-attention network for 3D point cloud. Knowl.-Based Syst. 2022, 247, 108769. [Google Scholar] [CrossRef]
  33. Shi, J.T.; Ye, H.L.; Yang, B.; Cao, F.L. An iteration-based interactive attention network for 3D point cloud registration. Neurocomputing 2023, 560, 126822. [Google Scholar] [CrossRef]
  34. Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. End-to-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4945–4949. [Google Scholar]
  35. Xu, K.; Ba, J.L.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 2048–2057. [Google Scholar]
  36. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  37. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar]
  38. Zhou, R.; Li, X.; Jiang, W. SCANet: A Spatial and Channel Attention based Network for Partial-to-Partial Point Cloud Registration. Pattern Recognit. Lett. 2021, 151, 120–126. [Google Scholar] [CrossRef]
  39. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Figure 1. The original framework of DCP. (Y is the target point cloud. X is the source point cloud. R and T are the rotation matrix and translation vector that align the source point cloud with the target point cloud. SVD is an abbreviation for singular value decomposition).
Figure 1. The original framework of DCP. (Y is the target point cloud. X is the source point cloud. R and T are the rotation matrix and translation vector that align the source point cloud with the target point cloud. SVD is an abbreviation for singular value decomposition).
Sensors 24 02768 g001
Figure 2. Network architecture of HALNet. (The input of HALNet is the source and target point clouds to be registered, and the output is the rotation matrix R and translation vector t that align the source point cloud X to the coordinate system of target point cloud Y . P X l and P X g represent the global and local structures used to extract the global and local features of X , respectively. And P Y l and P Y g are the same for Y . F g l o X , F c l o X , F g l o Y , and F c l o X represent global and local features corresponding to X and Y , respectively. F O X and F O Y are the features of X and Y without non-overlapping points. F C X and F C Y are features that include local details and incorporate geometry of X and Y ).
Figure 2. Network architecture of HALNet. (The input of HALNet is the source and target point clouds to be registered, and the output is the rotation matrix R and translation vector t that align the source point cloud X to the coordinate system of target point cloud Y . P X l and P X g represent the global and local structures used to extract the global and local features of X , respectively. And P Y l and P Y g are the same for Y . F g l o X , F c l o X , F g l o Y , and F c l o X represent global and local features corresponding to X and Y , respectively. F O X and F O Y are the features of X and Y without non-overlapping points. F C X and F C Y are features that include local details and incorporate geometry of X and Y ).
Sensors 24 02768 g002
Figure 3. The structure of hybrid attention. The inputs F O X and F O Y are the features of the overlapping point set, and the outputs F C X and F C Y are the features containing the internal features and the geometric structure of the two point clouds.
Figure 3. The structure of hybrid attention. The inputs F O X and F O Y are the features of the overlapping point set, and the outputs F C X and F C Y are the features containing the internal features and the geometric structure of the two point clouds.
Sensors 24 02768 g003
Figure 4. Registration results of unseen shapes (red: source point cloud; blue: target point cloud; green: predicted point cloud). The first row shows the initial position of the source and target point clouds, and the other rows show the registration results.
Figure 4. Registration results of unseen shapes (red: source point cloud; blue: target point cloud; green: predicted point cloud). The first row shows the initial position of the source and target point clouds, and the other rows show the registration results.
Sensors 24 02768 g004
Table 1. Partial point cloud registration results in unseen shapes.
Table 1. Partial point cloud registration results in unseen shapes.
ModelRMSE(R)MAE(R) RMSE(t) MAE(t)
DeepGMR13.3825778.8706500.0418590.029198
DCP-V25.4782043.5380270.0261530.019312
IDAM7.6210953.7137190.0470630.024128
SCANet3.4659791.9828890.0143850.010212
VRNet5.0933593.3646410.0307010.022927
HALNet3.0960931.7438900.0221060.017327
Table 2. Partial point cloud registration results in noise.
Table 2. Partial point cloud registration results in noise.
ModelRMSE(R)MAE(R) RMSE(t) MAE(t)
DeepGMR13.3521578.8785620.0417740.029208
DCP-V25.8972423.8996070.0262510.019345
IDAM7.4727063.4530480.0437780.022701
SCANet3.4849322.0030820.0142770.010201
VRNet5.4789863.6798500.0305000.022766
HALNet3.3572491.9480290.0227810.017951
Table 3. Partial point cloud registration results in unseen categories.
Table 3. Partial point cloud registration results in unseen categories.
ModelRMSE(R)MAE(R) RMSE(t) MAE(t)
DeepGMR14.1221259.9478800.0431040.031322
DCP-V26.2951354.1169090.0290180.021801
IDAM8.0443923.8913890.0481420.025395
SCANet4.5457722.9338260.0207030.014922
VRNet5.9479944.0351120.0339730.025189
HALNet4.0915922.6160300.0266650.020893
Table 4. The comparison experiment results for hybrid attention.
Table 4. The comparison experiment results for hybrid attention.
ConditionMethodRMSE(R)MAE(R)RMSE(t)MAE(t)
Unseen ShapesM13.9348372.4953510.0253450.019687
M23.4699682.1380490.0231290.018045
M03.0960931.7438900.0221060.017327
NoiseM14.1323652.7019660.0256430.019908
M23.5517022.2344200.0231990.018124
M03.3572491.9480290.0227810.017951
Unseen
Categories
M15.1652313.5347520.0309330.024414
M24.2276422.8525010.0285910.022475
M04.0915922.6160300.0266650.020893
Table 5. The comparison experiment results for fully connected layer.
Table 5. The comparison experiment results for fully connected layer.
ConditionMethodRMSE(R)MAE(R)RMSE(t)MAE(t)
Unseen ShapesSVD4.9058892.7972210.02223720.016029
FCL3.0960931.7438900.0221060.017327
NoiseSVD5.0344452.9431280.0226960.016189
FCL3.3572491.9480290.0227810.017951
Unseen
Categories
SVD6.3362254.0350860.0295510.022089
FCL4.0915922.6160300.0266650.020893
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, D.; Hao, H.; Zhang, J. HALNet: Partial Point Cloud Registration Based on Hybrid Attention and Deep Local Features. Sensors 2024, 24, 2768. https://doi.org/10.3390/s24092768

AMA Style

Wang D, Hao H, Zhang J. HALNet: Partial Point Cloud Registration Based on Hybrid Attention and Deep Local Features. Sensors. 2024; 24(9):2768. https://doi.org/10.3390/s24092768

Chicago/Turabian Style

Wang, Deling, Huadan Hao, and Jinsong Zhang. 2024. "HALNet: Partial Point Cloud Registration Based on Hybrid Attention and Deep Local Features" Sensors 24, no. 9: 2768. https://doi.org/10.3390/s24092768

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop