4.1. Datasets and Implementation Details
We evaluated our method on the 7Scenes [
8] and industrial components datasets. 7Scenes [
8] is a widely used indoor environment point cloud dataset, collected by a handheld Kinect RGB-D camera. The point cloud was obtained by fusing the depth map and color map, before being input into the subsequent model. It contained point clouds of seven different indoor environments: Chess, Fires, Heads, Office, Pumpkin, RedKitchen and Stairs. Each point cloud in the dataset was composed of continuous 50 frames with an RGB image and corresponding depth map. We refer to [
41] to project each point cloud along the main axis of the camera and manually select the image closest to the point cloud projection among the corresponding 50 frames of the RGB images as the corresponding RGB image of that point cloud. Thanks to the extensive work of IMFNet [
41], we conducted inspection and correction on the first-round filtered data. For pairs where there was a significant disparity between the RGB image and the projected point cloud image, we re-selected the closest RGB image from the 50 frames to correct the training samples. We screened out 26 images that were significantly different from the point cloud projections in 1184 7Scenes samples and finally replaced them with the images closet to the projections. It is noted that there were domain differences among the samples of each scene. Therefore, we utilized scenes Chess, Stair, Heads, Pumpkin, and Redkitchen as source domain samples, scene Fires as the target domain samples, and scene Office as the test set.
The industrial components dataset was acquired by the Mech-Eye PRO structured light camera in two different scenarios: a pure background plate and a complex workshop background. The Mech-Eye PRO structured light camera can directly acquire the point cloud and its corresponding image. Samples from the pure background plate and complex workshop background are shown in
Figure 4. The industrial components dataset comprises 48 samples, with 25 pure background plate samples and 23 complex workshop backgrounds samples. It should be noted that pure plate background data are ideal samples, and data acquisition in most industrial scenarios cannot avoid containing a large amount of background noise. Therefore, we used the 7Scenes public dataset as the source domain sample, and the components scanned under the pure background plate as the target domain sample. The training set was jointly composed of samples from both the source domain and the target domain. The samples under the complex workshop background served as the test set to create a huge domain gap and validate the domain adaptation capabilities of our model. The division of the 7Scenes and industrial components datasets is shown in
Figure 5.
Moreover, for each RGB image, the resolution was unified to
by bilinear interpolation. Each point cloud was downsampled to 2048 points, and the downsampled point cloud was voxel-normalized. The source point cloud and target point cloud were downsampled to 1536 points (75% of the complete point cloud). The inputs to the model were the downsampled source and target point cloud and their corresponding RGB image. Our model was implemented on one Geforce RTX 3090 GPU. We optimized the parameters with the adaptive moment estimation (ADAM) optimizer [
42]. The initial learning rate was 0.001. For both the 7Scenes and industrial components datasets, we trained the network for 100 epochs and multiplied the learning rate by 0.7 at epoch 25, 50, and 75. As for hyperparameters, we set
to 0.5, 1, 1, and 1, respectively.
4.2. Compared Methods and Evaluation Metrics
We conducted comparative experiments with three traditional methods, ICP [
9], Go-ICP [
10], and FPFH [
12], where FPFH [
12] was used for the feature descriptor extraction, followed by RANSAC [
13] to estimate rigid transformations. Apart from this, we selected five deep learning approaches including PointNetLK [
3], DCP [
16], PRNet [
19], RPMNet [
20], and RIENet [
6] to validate the effectiveness of our DAMF-Net. In order to introduce texture information into the comparative models and ensure the fairness of the comparative experiments, we utilized six dimensions (x, y, z, R, G, B) of features from point clouds in the following comparative experiments. The aforementioned six dimensions represented the spatial position and color information of the point clouds, respectively.
We evaluated the registration by the RMSE of the Euler angle and translation vector used in PRNet [
19], and the chamfer distance used in PointMBF [
43]. We employed RMSE(R) and RMSE(t) to denote the RMSE between the ground truth and predicted values of both Euler angles and translation vectors, respectively. We used CD to represent the chamfer distance between two point clouds. The calculation formulas of the three evaluation metrics are as follows:
where
and
represent the pitch angle, yaw angle, and roll angle in Euler angles, respectively, and
and
represent their predicted values, respectively.
represents the element of row i and column j in the rotation matrix. and represent the ground truth of translation distances in x, y, and z directions, and and represent their predicted values.
4.3. Comparative Evaluation
To validate our method, we conducted a series of experiments on the publicly available 7Scenes dataset and our own industrial components dataset. In
Figure 6, we considered the office scene in 7scenes as a test sample, which contains a desk and a chair, and the corresponding image is displayed in the second column of
Figure 4. In
Figure 7, we used an industrial component as a test sample, and the point cloud characterized scanned industrial parts in a complex workshop background, the corresponding images being shown in the fourth column of
Figure 4. When the two parts of the point cloud could form a complete point cloud, the corresponding parts could be overlapped as much as possible, and there was no large rotation error and translation error; thus, we considered that the point cloud registration had achieved preferable results. The results of our method were compared with those of the state-of-the-art methods mentioned above, as illustrated in
Figure 6 and
Figure 7 as well as
Table 1. As shown in
Table 1, on the 7Scenes dataset, RMSE(R) was greater than two in the traditional methods such as ICP [
9], Go-ICP [
10], and FPFH [
12] + RANSAC [
13]. It can also be observed in
Figure 6b–d that the registration results obtained by methods ICP [
9], Go-ICP [
10], and FPFH + RANSAC [
12] had a significant angular discrepancy with the target point cloud. Early deep learning methods such as PointNetLK [
3] also produced a significant translation error, as shown in
Figure 6e. Notably, DCP [
16] exhibited the lowest accuracy among all evaluated methods with an RMSE(R) of 6.0744. Its registration result, as shown in
Figure 6f, had substantial errors in both rotation matrix and translation vector. Our DAMF-Net approach achieved the highest accuracy among all methods as illustrated in
Figure 6j, with RMSE(R) and RMSE(t) reaching 0.0109 and 0.0001, respectively, outperforming other partial point cloud registration methods such as PRNet [
19], RPMNet [
20], and the baseline method RIENet [
6], whose chamfer distances were 8.7981, 6.5086, and 2.7058, respectively, higher than DAMF-Net’s result of 0.5674. To emphasize more clearly the improved accuracy of our method compared to RIENet (baseline), we show a top view of the measured sample in
Figure 6k,l in which the green arrows mark the portion where RIENet had a larger rotation deviation than our method in alignment.
On the industrial components dataset, the performance of traditional methods remained poor, as shown in
Table 1, where, for example, ICP [
9] showed the highest RMSE(R) at 6.1539. Furthermore, it was observed that deep learning-based partial point cloud registration methods also failed to achieve satisfactory results on the industrial components dataset containing repetitive structures. For instance, the RMSE(R) of PRNet [
19] increased from 0.1568 to 0.8901, while the values for RPMNet [
20] and RIENet [
6] also increased from 0.0781 and 0.0247 to 0.1878 and 0.2137, respectively.
Figure 7 illustrate the registration results of industrial components to further evaluate the performance of different models.
Figure 7 indicates that the registration performance of all selected methods on industrial components was slightly inferior compared to that on the 7Scenes dataset. However, for 3D object defect detection, it is crucially important to reconstruct the 3D model of the target object with high precision, as any minor registration deviation could lead to the incorrect identification of defects.
Our DAMF-Net achieved the smallest RMSE(R) among all methods, reaching 0.0109 and 0.0116 on the 7Scenes and industrial components datasets, respectively. The error in the translation vector RMSE(t) was also the lowest with a value of 0.0001. Notably, the performance of our DAMF-Net on the industrial components dataset with a complex workshop background, which had a significant domain difference from the training set, was very close to the results on the 7Scenes dataset. This demonstrates the good generalization ability of DAMF-Net. We also show side views of the tested samples in
Figure 7k,l, in which the green arrows indicate that RIENet had a larger translation deviation than our method in alignment.
Initially, the traditional methods yielded relatively poor results, indicating that even though these methods achieve satisfactory outcomes on dataset with distinct geometric features, such as ModelNet40 [
7], their performance deteriorates on more complex scene datasets, particularly those with repetitive geometric structures like 7Scenes. Some deep learning-based point cloud registration methods, such as PointNetLK [
3] and DCP [
16], even underperform compared to traditional methods. We believe this phenomenon arises because these network architectures are primarily designed for the registration of complete point clouds, whereas in our context, only a subset of the source and target point clouds correspond to each other. It is also noticeable that the performance of all selected methods on the industrial components dataset was even poorer than on the 7Scenes dataset, including those methods based on deep learning for partial point cloud registration like RPMNet [
20] and PRNet [
19]. The baseline method RIENet [
6] demonstrated good performance on the 7Scenes dataset, with RMSE(R) and RMSE(t) values reaching 0.0247 and 0.0001, respectively. However, these values increased to 0.2137 and 0.0006 on the industrial components dataset. This indicates that RIENet [
6] cannot generalize well to few-shot industrial scene datasets. We believe this is due to the repetitive planar structures in industrial components being far more common than those in the 7Scenes dataset, and also because of the huge domain gap between the 7Scenes and industrial components datasets. Despite the significant discrepancy between the industrial components dataset used during testing and the public dataset employed during the training phase, our DAMF-Net still achieved remarkably proximate RMSE(R) values of 0.0109 and 0.0116 on the public dataset and industrial components dataset, respectively. This demonstrates that our DAMF-Net possesses strong domain adaptation capabilities and also achieves good results in industrial scenes with a large number of planar repetitive structures.