Next Article in Journal
Sustainability of Automated Manufacturing Systems with Resources by Means of Their Deadlock Prevention
Previous Article in Journal
A Green Wave Ecological Global Speed Planning under the Framework of Vehicle–Road–Cloud Integration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RFF-PoseNet: A 6D Object Pose Estimation Network Based on Robust Feature Fusion in Complex Scenes

by
Xiaomei Lei
1,2,*,
Wenhuan Lu
1,
Jiu Yong
1,3 and
Jianguo Wei
1
1
College of Intelligence and Computing, Tianjin University, Tianjin 300072, China
2
Gansu Meteorological Information and Technical Equipment Support Center, Gansu Meteorological Bureau, Lanzhou 730020, China
3
The College of Intelligence and Computing, Lanzhou Jiaotong Univeristy, Lanzhou 730070, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(17), 3518; https://doi.org/10.3390/electronics13173518
Submission received: 26 July 2024 / Revised: 22 August 2024 / Accepted: 1 September 2024 / Published: 4 September 2024

Abstract

:
Six degrees-of-freedom (6D) object pose estimation plays an important role in pattern recognition of fields such as robotics and augmented reality. However, there are issues with low accuracy and real-time performance of 6D object pose estimation in complex scenes. To address these challenges, in this article, RFF-PoseNet (a 6D object pose estimation network based on robust feature fusion) is proposed for complex scenes. Firstly, a more lightweight Ghost module is used to replace the convolutional blocks in the feature extraction network. Then, a pyramid pooling module is added to the semantic label branch of PoseCNN to fuse the features of different pooling layers and enhance the network’s ability to capture information about objects in complex scenes and the correlations between contextual information. Finally, a pose regression and optimization module is utilized to further improve object pose estimation in complex scenes. Simulation experiments conducted on the YCB-Video and Occlusion LineMOD datasets show that the RFF-PoseNet algorithm can strengthen the correlation of features between different levels and the recognition ability of unclear targets, thereby achieving excellent accuracy and real-time performance, as well as strong robustness.

1. Introduction

Six degrees-of-freedom (6D) object pose estimation is the process of obtaining the homography transformation relationship between the target and the camera coordinate system, which includes 3-DoF rotation and 3-DoF translation. As shown in Figure 1, object 6D pose estimation refers to obtaining the homography transformation from the target to the camera coordinate system for a known category of target. Typically, a homography transformation matrix composed of rotation R and translation t is used to represent the homography transformation, which is widely used in fields such as robot grasping, augmented reality, and autonomous driving [1]. However, many factors in the process of 6D object pose estimation can affect the final accuracy [2], resulting in weak robustness of 6D object pose estimation in complex scenes.
The main challenge of 6D object pose estimation is to establish a correspondence between the input image and available 3D models, followed by the use of the PnP (Perspective-n-Point) algorithm to calculate pose parameters. Although the PnP algorithm has established a solid theoretical foundation in this field, the quality of corresponding relationships is sensitive to many factors, such as lighting changes, weak textures, and cluttered backgrounds. Common deep-learning-based 6D object pose estimation algorithms are often susceptible to the complexity of the network itself, which greatly affects the computational speed. For some images with complex backgrounds, surrounding objects often have intricate relationships with the target object, which can easily affect the accuracy, real-time performance, and robustness of the 6D object pose estimation.
In this article, RFF-PoseNet, based on PoseCNN, is proposed as a 6D pose estimation network with robust feature fusion in complex scenes to perform pose estimation tasks. Firstly, the conventional convolutional blocks in the PoseCNN feature extraction network are replaced with Ghost lightweight modules [3]. Then, the PPM pyramid pooling module [4] and PANet feature fusion module [5] are added to PoseCNN to enable the network to obtain information from different subregions in the feature map. Finally, an iterative optimization method is used to gradually correct pose estimation errors in the RFF-PoseNet network based on the initial 6D object pose estimation results predicted by the point with the highest confidence, further improving the ability of 6D object pose estimation in complex scenes. The main contributions of this article are:
(1)
Convolutional neural networks generate a large number of redundant feature maps during the feature extraction process. By replacing the conventional convolutional blocks in the feature extraction network with Ghost convolutions, the network model is compressed, effectively solving the problem of excessive parameter and computational complexity in 6D object pose estimation networks.
(2)
The PoseCNN network is susceptible to the influence of complex backgrounds on object segmentation and recognition. When the target is in a complex background, the semantic segmentation module in the network is prone to generate deviations from the segmentation results. A PPM pyramid pooling module and PANet feature fusion module are added to the RFF-PoseNet network to enable the network to obtain information from different subregions in the feature map.
(3)
To further improve the accuracy of 6D object pose estimation, the pose estimated at the highest confidence point is specified as the initial pose of the current target, and an iterative optimization method is adopted to gradually correct pose estimation errors in the RFF-PoseNet network, further improving the accuracy of 6D object pose estimation.

2. Related Work

In recent years, deep learning has made rapid progress with the support of high-performance computing hardware (GPU) and platforms such as Caffe, PyTorch, and TensorFlow. And excellent deep-learning algorithms have emerged in various fields of machine vision. At the same time, the powerful scene-understanding ability of deep learning has also driven technological innovation in the field of pose estimation. The types of 6D object pose estimation algorithms based on deep learning can be divided into keypoint matching, voting-based, and direct regression methods.
In terms of 6D object pose estimation based on keypoint matching, Mahdi et al. [6] proposed the BB8 algorithm to solve the object pose by predicting the projection coordinates of 3D bounding box vertices on the target image to form a 2D–3D corresponding point pair. However, the algorithm cannot be trained end-to-end and has poor real-time performance. Tekin et al. [7] modified the Yolov2 [8] network by using only one network to complete the steps of target localization and projection point prediction. The two-stage method can achieve fast pose estimation without the need for subsequent processing. Many subsequent methods used the two-stage approach such as Oberweger et al. [9], who predicted the projection coordinates of target 3D keypoints on the image using 2D heatmaps. However, although keypoint-matching-based methods are effective for weakly textured targets, they require knowledge of the real 3D bounding box vertex coordinates of the target and the 3D coordinates of special keypoints during network training, requirements that are difficult to meet in real-world scenarios, particularly when handling targets with rotational symmetry.
Voting-based 6D object pose estimation refers to each pixel or 3D point contributing through voting to the final target pose result. The PVNet (Pixel-wise Voting Network) proposed by Peng et al. [10] was the first voting-based pose estimation method. PVNet first trains a CNN to predict the vector field pointing to keypoints pixel by pixel and then uses the RANSAC algorithm to vote on the keypoint positions to obtain the final keypoint positions. After obtaining the keypoints, PnP is used to infer the object pose. This method predicts the vector pixel by pixel, an approach that can also recover the occluded keypoints in the case of target occlusion. In subsequent research, Yu et al. [11] proposed an effective loss function based on PVNet to achieve more accurate vector field prediction; Song et al. [12] proposed extending a single intermediate representation to a mixed representation to achieve more accurate pose estimation; He et al. [13] fused the features of RGB images and depth images, and used a 3D keypoint Hough voting network based on instance segmentation for pose estimation. Keypoint voting methods based on vector fields can achieve better pose estimation results; however, vector field voting methods ignore the influence of the distance between pixels and keypoints. In other words, when pixels are far from keypoints, small errors in the direction vector may have a serious impact on pose estimation results.
In the case of 6D object pose estimation based on direct regression of the target pose, Xiang et al. [14] proposed a PoseCNN network that decouples pose estimation into three branches: semantic segmentation, translation prediction, and rotation prediction. Translation is represented by the distance between the center point and the camera, and rotation is represented by the quaternion of regression (w, x, y, z). PoseLoss and ShapeMatch-Loss were proposed to handle symmetric targets. Li et al. [15] proposed the DenseFusion algorithm to fuse features of targets in RGB and depth images pixel by pixel and optimize the initial pose through a four-layer fully connected neural network. Hu et al. [16] proposed a dual-stream network that utilizes the DenseFusion network to obtain target fusion features, learns the consistency between target features in two frames of images through the network, and obtains pose estimation results. Reid et al. [17] proposed Deep-6DPose, which is an end-to-end deep learning framework that can detect, segment, and recover 6D pose of object instances from a single RGB image. The framework extends Mask RCNN [18] to the 6D object pose estimation branch, decouples pose parameters into translation and rotation, and finally regresses rotation through the use of Lie algebra. Wang et al. [19] believe that there is a significant difference between pose rotation and translation, which should be treated differently. Therefore, they proposed a coordinate-decoupling-based pose network, CDPN, which combines regression-based methods with the PnP algorithm for pose estimation. However, the existing methods mentioned above are computationally intensive and time-consuming in complex scenarios, making it difficult to meet the robust 6D object pose estimation processing requirements for real-time and accurate applications. Conversely, a 6D object pose estimation method based on direct regression can directly extract features from input data and learn the mapping relationship of pose estimation through an end-to-end training process, achieving accurate pose estimation of the target. Therefore, in this paper, we will study a direct regression 6D object pose estimation method based on robust feature fusion in complex scenes based on the PoseCNN network.

3. RFF-PoseNet Network

The existing methods based on direct regression of object pose have a high computational burden and slow running speed. In this article, RFF-PoseNet is proposed as a 6D object pose estimation network for robust feature fusion in complex scenes. As shown in Figure 2, firstly, the model’s parameter and computational complexity can be effectively reduced by adding a lightweight Ghost network module in the feature extraction stage based on the PoseCNN pose estimation network. Then, a PSP module is added to the semantic label branch of the RFF-PoseNet network, fully utilizing the global information of the image. Furthermore, a feature fusion module is added to fuse the pooled feature maps of different layers, fully extracting feature information from different regions, enhancing the contextual correlation of the image and the ability to capture some unclear objects. Finally, pose regression and optimization networks are added to optimize and eliminate the obtained results, further ensuring the accuracy of RFF-PoseNet pose estimation.

3.1. Ghost Lightweight Processing

The PoseCNN pose estimation method involves convolutional neural network feature extraction modules, which may contain redundant features extracted by the network. Although the redundant information of these feature maps helps the model make accurate predictions, the generation of feature maps requires a large amount of computation, which will increase the computational burden of the pose estimation network. As shown in Figure 3, the GhostNet module is a method to compress network models, which can solve the problem of the high computational cost of redundant feature maps from the perspective of model structure design, reduce network parameters and computational complexity, thereby improving network computing speed and greatly reducing latency. Ghost can replace convolutional layers in conventional convolutional networks, and its plug-and-play characteristics can further improve model performance. Therefore, in RFF-PoseNet, Ghost convolution modules are used to replace convolutional blocks in the feature extraction network VGG.
Unlike conventional convolution, a Ghost module first performs conventional convolution on the input feature map and sets the output channel number to half of the original channel number; that is, if the number of channels of the original output feature graph is N, then the number of channels of the middle feature graph in the left dashed box in Figure 3 is N/2. Next, linear operations are performed on the middle feature map on the left dashed box to generate more Ghost feature maps. Finally, the grouped convolution results inside the dashed box on the right are concatenated with the middle feature map, which reduces the model’s parameter and computational complexity while leaving the number of channels and resolution of the input and output feature maps unchanged.
Assuming that the input image size before conventional convolution is h × w × c , where c is the number of channels, and the convolution kernel size is k × k , then the output image size of conventional convolution is h × w , and the number of channels is n, and the computational complexity arising from these operations is:
M 1 = n × c × k × k × w × h
In the Ghost module, the size of the output feature map before the linear operation is w × h × n s , and after generating the Ghost feature map, the size of the feature map is:
M 2 = w × h × s 1 × n s
Thus, the final computational complexity of the Ghost module is:
M 3 = n s × c × k × k + s 1 × n s × d × d
Ultimately, the computation complexity of the Ghost module in the RFF-PoseNet network is about s − 1 times lower than that of conventional convolution.

3.2. Pyramid Pooling Module (PPM)

Existing 6D object pose estimation scenarios have complex backgrounds, and there are also common situations where objects have high similarity and some targets have small scales or are not obvious, making it difficult to capture the targets. These situations can cause deviation in the types of labels obtained by the semantic label branch of the PoseCNN network, leading to inaccurate and misidentified segmentation results when performing semantic segmentation tasks and resulting in deviation in the final 6D object pose estimation results. The Pyramid Scene Parsing Network (PSPNet) is proposed for detecting complex scenes. Some existing scenes not only contain a large number of complex backgrounds but also some similar or less obvious objects. These objects in complex scenes are prone to inaccurate segmentation results and misidentification when performing semantic segmentation tasks. In order to solve the interference from the background and objects in these scenes, the Pyramid Pooling Module (PPM) module in PSPNet is added to the semantic label branch of the PoseCNN network. As shown in Figure 4, this module can effectively fuse the features of the same feature map at different scales, so that the extracted feature map can fully combine the network contextual information, enhance the network’s ability to capture some obscure objects, make the semantic segmentation result identification more accurate, and further improve the subsequent accuracy of pose estimation.
The PPM module first generates four feature maps with different sizes by pooling the input feature maps. The top red layer represents the coarsest pooling level, which is global pooling. The remaining three levels evenly divide the input feature maps into several subregions with different sizes. In Figure 4, the yellow layer is 2 × 2, the blue layer is 3 × 3, and the green layer is 6 × 6. Then, a 1 × 1 convolution is performed to reduce the dimension of the four different feature maps after each pyramid level. For example, if the original feature layer level dimension is n, the dimension of the reduced feature maps is 1 / n . Then, bilinear interpolation is used to upsample the low-dimensional feature map directly to obtain a feature map of the same size as the original feature map. Finally, different levels of features are concatenated as the final pyramid pooling global feature.

3.3. PANet Feature Fusion Module

In complex 6D object pose estimation scenes, the segmentation and recognition ability of PoseCNN networks for objects is easily affected by the interference from the background and surrounding objects, making it difficult to correctly classify target objects, which results in incorrect restoration of object pose. However, the traditional feature fusion method FPN (Feature Pyramid Networks) [20] combines the high-level features extracted by the feature extraction network with the low-level features through a top-down network and horizontal connections, allowing the high-level features to learn some low-level features, further enhancing the feature information expressed by the network. As shown in Figure 5, PANet [5] adds a bottom-up feature fusion module based on FPN and optimizes the feature information of shallow feature maps in FPN to a certain extent. As the network depth increases, deeper features are continuously extracted, and there is a downsampling process in the top-down path of the FPN network, which can easily lead to a loss of shallow feature information. Therefore, PANet incorporates low-level feature information into high-level feature information by adding a bottom-up feature fusion path, utilizing high-level features to transmit downwards while also transmitting low-level features upwards.
The pyramid pooling module PPM contains pooling feature maps of different layers, which contain features of different subregions that play different roles in the network. Through the use of a feature fusion network, features of different layers can be fully combined, so that features of the target in different regions can be mutually verified, thereby improving the method’s ability to understand features [21]. Therefore, RFF-PoseNet adopts the feature fusion module of the PANet network. Firstly, the PPM network obtains pooling feature maps of different sizes and then, the network fuses these pooling feature maps in a top-down and bottom-up fusion method to output fused feature maps of different sizes. These feature maps are adjusted and concatenated through the convolution operation in the PANet module. Finally, the optimized semantic segmentation results are output through the semantic segmentation branch in the PoseCNN network, and the prediction accuracy of translation and rotation branches is further improved to obtain better 6D object pose estimation.

3.4. RFF-PoseNet Pose Regression and Optimization

In the pose estimation and iterative optimization stage, the training network completes the pose estimation and iteratively optimizes the corresponding results. Firstly, a fusion feature training network with N position points is used to predict the direction and depth value of the line between each point on the object and the center point. The center point coordinates are obtained through voting, and the average depth of each point is calculated based on the semantic segmentation results as the depth of the center point. The spatial position coordinates of the center point are calculated through inverse perspective transformation to obtain the translation t. Regression is used to obtain the predicted target rotation R and confidence level c. The RFF-PoseNet method specifies that the estimated pose of the point with the highest confidence is the initial pose of the current target. Then, iterative optimization methods are used to gradually correct pose estimation errors in the network, improving pose estimation accuracy. In the early stage of network training, the prediction error is relatively large, and optimizing the results has little significance. Therefore, iterative optimization is carried out when the prediction error is lower than a set value.
The RFF-PoseNet network loss function weights pose loss through confidence and uses a logarithmic function to regularize confidence. The overall loss function of the 6D object pose estimation network is:
L o s s = 1 N i ( c i L i p ω log ( c i ) )
where i represents the i-th point of N position points, L i p represents the pose loss of the i-th point, ci represents the confidence level of the predicted pose of the i-th point, and ω represents the balance parameter. In this experiment, ω is set to 0.01. L i p  is defined as the target 3D coordinates, representing the average distance after transformation between the ground truth pose matrix [ R t ] and estimated pose matrix [ R ¯ t ¯ ] , respectively. The calculation formula is:
L i p = 1 M j ( R x j + t ) ( R ¯ i x j + t ¯ i )
where M represents the set of sampling points for the target 3D model, i represents the i-th point of M position points, p is used as a marker to indicate pose loss, xj represents the j-th point of the feature set, ( R x j + t ) represents the coordinates after ground truth pose transformation, and ( R ¯ i x j + t ¯ i ) represents the estimated coordinates after pose transformation through the i-th point. The pose loss function in Formula (5) is only applicable to asymmetric targets. For targets with rotational symmetry, the pose loss is calculated by the average distance between the nearest coordinates transformed from pose matrix [ R ¯ t ¯ ] and ground truth pose matrix [ R t ] . The calculation formula is:
L i p = 1 M j min 0 < k < N ( R x j + t ) ( R ¯ i x k + t ¯ i )
where x k represents the point closest to x j . In order to obtain more accurate object pose estimation results, the initial pose is optimized based on the target’s initial pose. The pose obtained in the previous step is the basis of iterative optimization, which consists of four fully connected layers. After K iterations, the final pose estimation result is:
p ^ = [ R K t K ] [ R K 1 t K 1 ] [ R 0 t 0 ]
where [ R K t K ] represents the pose estimation after K iterations, and K is set to 2 in the experiment.

4. Performance Analysis of RFF-PoseNet

In this section, first, the experimental environment and parameter settings are introduced, then the experimental dataset and evaluation indicators are introduced, followed by precision and real-time comparison experiments on two public datasets, YCB-Video and Occlusion LineMOD, and finally, ablation experiments are conducted on the RFF-sPoseNet network.

4.1. Experimental Environment and Parameter Settings

The experimental section first introduces the experimental setup and then shows the results of accuracy and real-time comparison experiments on two public datasets, YCB-Video and Occlusion LineMOD. The specific experimental environment configuration in this article is shown in Table 1. In the training process of the RFF-PoseNet network, momentum-based stochastic gradient descent (SGD) was used as the optimizer, with an initial learning rate set to 0.001. A learning rate decay strategy was adopted, and L2 regularization was introduced to enhance the model’s generalization ability and avoid overfitting. Considering the convergence speed of the model and the size of the dataset, the experiment used 100 epochs for training.

4.2. Experimental Dataset and Evaluation Indicators

4.2.1. Experimental Dataset and Parameter Settings

As shown in Figure 6, 21 different objects from the YCB-Video dataset [14] and images from the Occlusion LineMOD dataset are utilized in the experiment.
The YCB-Video dataset [14] is derived from the YCB dataset [22], which contains 21 different object types. RGB-D cameras are used to capture actual objects. There are a total of 92 videos, with each frame containing at least three objects and a maximum of nine objects. The average number of objects per frame is about five. The 92 videos contain more than 130,000 frames, and all videos have a resolution of 640 × 480. The dataset includes the depth sequence, RGB sequence, annotation information for each frame, and model information for the 21 objects corresponding to each video sequence, such as point cloud information and map information. The dataset was developed for PoseCNN and can be used for comparative experiments.
The Occlusion LineMOD dataset [23] is a collection of 1218 images selected from the LineMOD dataset, most of which contain factors such as object occlusion and background clutter.
For the training of the RFF-PoseNet network model, 15% of the RGB-D images of each target were divided into a training set, and the rest were used as a test set. Due to insufficient training data for both types of datasets, the RFF-PoseNet network was trained with 20 repetitions per epoch to increase its training workload.

4.2.2. Estimation Metrics for Experimental Results

The experimental results are evaluated using four metrics: ADD, ADD-S, AUC (Area Under the Curve), and cm-degree.
ADD accuracy. ADD calculates the coordinates of the sampling points in the target 3D model, and the average distance between the coordinates of the corresponding point obtained through transforming the ground truth pose matrix [ R t ] and estimated pose matrix [ R ¯ t ¯ ] respectively. The calculation formula for ADD is:
A D D = 1 m x M ( R x + t ) ( R ¯ x + t ¯ )
where M represents the set of sampling points for the target 3D model, x represents the x-th point in the set of sampling points, and m represents the number of sampling points. ADD accuracy is the ratio of correctly estimated poses to the total number of ground truth poses expressed as a percentage; if the ADD is less than the threshold, it is considered that the pose estimation is correct. The calculation formula for ADD accuracy is as follows:
a c c u r a c y = N u m p r e N u m G T × 100 %
where N u m p r e represents the number of correct pose estimates and N u m G T represents the number of all true poses.
ADD-S accuracy. ADD-S is a fuzzy invariant pose error metric that can balance symmetric and asymmetric targets. ADD-S is the ratio of correct pose estimates to the total number of ground truth poses expressed as a percentage; if ADD-S is less than the threshold, the pose estimate is considered correct. ADD-S calculates the coordinates of sampling points of the target 3D model and estimates the average distance between the transformed coordinates of the pose matrix [ R ¯ t ¯ ] and the nearest coordinates transformed by the ground truth pose matrix [ R t ] . The calculation formula for ADD-S is:
A D D S = 1 m x 1 M min x 2 M ( R x 1 + t ) ( R ¯ x 2 + t ¯ )
where M represents the set of sampling points for the target 3D model, x 1 represents sampling points belonging to different transformations, and x 2 is the point closest to x 1 . The ADD-S accuracy is the ratio of correctly estimated poses to the total number of ground truth poses expressed as a percentage; if ADD-S is less than the threshold, it is considered that the pose estimation is correct.
AUC area. The AUC area represents the area enclosed by the accuracy–threshold curve and the threshold coordinate axis. The ADD-S metric is used to calculate the pose estimation accuracy under different thresholds (maximum threshold is 0.1 m), and then draw an accuracy–threshold curve.
➃ Cm-degree metric. The error between the predicted posture P = R , t and the true posture P g t = R g t , t g t in translation and rotation is calculated. The performance of tracking algorithms is evaluated by calculating the percentage of frames with errors less than a specified number of centimeters and degrees of angle in translation and rotation compared to the true posture. Typically, a rotation error of 5° or a translation error of 5 cm is used as a threshold, and the definition of each error is as follows:
e r = arccos ( t r a c e ( R T R g t ) 1 2 )
e t = t t g t 2

4.3. Experimental Results and Analysis

4.3.1. Experimental Results on YCB-Video Dataset

Based on the above experimental configuration, the visual pose estimation results of RFF-PoseNet on the YCB-Video dataset are shown in Figure 7. The first row is the original input image, and the middle row is the final result of the semantic segmentation task. The marker point in the center of each object is the 2D center point of each object in the image predicted by the network. The last row in Figure 7 shows the 6D object pose estimation result of the RFF-PoseNet algorithm, which is reflected by the degree of fit between the final predicted 3D pose and the actual object. As shown in Figure 7, the RFF-PoseNet pose estimation network has high accuracy. Furthermore, although the third input image has a relatively complex background compared to the other images, the 6D object pose estimation result still has good accuracy.
On the YCB-Video dataset, the performance of the RFF-PoseNet algorithm was compared with that of the PoseCNN [14], DenseFusion (iterative optimization) [15], Dual-Stream [16], Uni6D [24], REDE [25], and CoS-PVNet [26] algorithms in terms of AUC and ADD-S. The experimental results are shown in Table 2, which presents the evaluation metrics of AUC and ADD-S for each method. The larger the values, the better the pose estimation performance of this method for the current target, indicating that the RFF-PoseNet algorithm can achieve good estimation results on most objects. Among them, method names marked with “*” indicate that the target has rotational symmetry, and bold indicates that the method has the highest evaluation metric result for the current target.
As shown in Table 2, the average target AUC of the RFF-PoseNet is 7.5% higher than that of PoseCNN, 1.8% higher than that of the iteratively optimized DenseFusion algorithm, 1.2% higher than that of Dual-Stream, 0.3% higher than that of Uni6D, 0.1% higher than that of the REDE algorithm, and 23.6% higher than that of CoS-PVNet. The average ADD-S score of the RFF-PoseNet algorithm is 16.1% higher than that of PoseCNN, 0.4% higher than that of DenseFusion, 0.1% higher than that of Dual-Stream, 0.3% higher than that of Uni6D, 0.3% higher than that of REDE, and 31.1% higher than that of the CoS-PVNet algorithm. Therefore, the RFF-PoseNet algorithm performs relatively well. However, the RFF-PoseNet algorithm performs relatively poorly for two targets, 006_mustard-bottle and 009_gelatin-box, and has poor accuracy in estimating the pose of the symmetric targets 061_foam-brick *, which may be due to the fact that the RFF-PoseNet algorithm uses the same multi-scale radius for different targets, resulting in poor pose estimation performance compared to other targets.

4.3.2. Experimental Results on Occlusion LineMOD Dataset

The visualization results of RFF-PoseNet on the Occlusion LineMOD dataset are shown in Figure 8. The first row is the original input image, the second row is the result calculated by our method in the semantic segmentation branch, and the last row is the final result of pose estimation. It can be seen from Figure 8 that the RFF-PoseNet network has a certain degree of accuracy in pose estimation, and can also complete the task well when dealing with complex backgrounds and occlusion, thus achieving a relatively accurate 6D object pose estimation result.
On the Occlusion LineMOD dataset, the performance of RFF-PoseNet was compared with that of the PoseCNN [14], DenseFusion [15] after iterative optimization, Dual-Stream [16], Uni6D [24], REDE [25], CoS-PYNET [26] algorithms in terms of the ADD-S metric. The experimental results are shown in Table 3. The larger the value of the ADD-S metric, the better the pose estimation effect of the current target. It can be seen that RFF-PoseNet can achieve good estimation results on most objects. In Table 3, bold indicates that the method has the highest evaluation result for the current target.
As shown in Table 3, there are many occlusion relationships and complex background factors in the Occlusion LineMOD dataset, and PoseCNN cannot achieve good performance results overall. However, the RFF-PoseNet has an accuracy improvement of 13.8% compared to the iteratively optimized DenseFusion, 1.4% compared to Dual-Stream, 48.4% compared to the Uni6D algorithm, 13.7% compared to the REDE algorithm, and 29.9% compared to the CoS-PVNet algorithm. This indicates that RFF-PoseNet can effectively improve the pose estimation accuracy of targets, allowing it to perform better 6D object pose estimation in complex environments.

4.4. Comparison of Real-Time Performance

Due to the impact of various factors such as hardware performance and software environment on runtime, in order to ensure the fairness and comparability of experimental results, this paper used three representative 6D object pose estimation algorithms, PoseCNN [14], DenseFusion [15], and Dual-Stream [16], that use the same semantic segmentation method as RFF-PoseNet, as well as the Yolo-6D [7] algorithm, which has excellent real-time performance, and the latest algorithms, Uni6D [24], REDE [25], and CoS-PVNeT [26]. The above seven representative 6D object pose estimation algorithms and the RFF-PoseNet algorithm were compared in real-time in the same experimental environment according to four aspects: segmentation time, pose estimation time, optimization time, and total time. Table 4 shows the real-time comparison results of the studied 6D object pose estimation methods.
As shown in Table 4, in terms of segmentation time, the Yolo-6D algorithm does not require segmentation, resulting in a segmentation time of zero. PoseCNN, DenseFusion, Dual-Stream, and RFF-PoseNet all use the same semantic segmentation method. Among them, the RFF-PoseNet algorithm uses Ghost convolution blocks for lightweight processing and introduces a PPM module and PANet feature fusion module to improve segmentation, which requires feature information processing in higher dimensions and takes a certain amount of time. Therefore, the segmentation time of the four algorithms is similar. In terms of pose estimation time, the RFF-PoseNet algorithm is 0.08 s faster than the PoseCNN algorithm but performs poorly compared to other algorithms. However, DenseFusion, Dual-Stream, and the other algorithms use RGB-D data as input combined with point clouds for 6D object pose estimation, which has high computational requirements when processing high-resolution images and large amounts of point cloud data. In terms of optimization time, different algorithms use different optimization strategies for optimization, and the optimization time is also different. Among them, the Yolo-6D, Uni6D, REDE, and CoS-PVNet algorithms do not require optimization of pose estimation results; PoseCNN is optimized using the ICP algorithm to maximize its time consumption; and the RFF-PoseNet algorithm uses an iterative optimization method with an optimization time of 0.01 s. In terms of total time, the RFF-PoseNet algorithm takes about 0.09 s longer than the fastest Yolo-6D algorithm to complete one pose estimation, with moderate time performance. However, the RFF-PoseNet algorithm has greatly improved running efficiency compared to the benchmark network PoseCNN. Overall, the RFF-PoseNet algorithm performs well in terms of accuracy in complex scenarios and can meet the real-time requirements for 6D object pose estimation in practical complex application scenarios.

4.5. Ablation Experiment

To verify the effectiveness of each module in RFF-PoseNet, ablation experiments are conducted to analyze each module of RFF-PoseNet. As shown in Table 5, the ablation experiment gradually added each module of the RFF-PoseNet algorithm for comparison. Due to the characteristics of severe occlusion, lighting changes, and background clutter in the Occlusion LineMOD dataset, the dataset has a certain representativeness for evaluating model robustness and generalization ability. Therefore, this article conducted ablation experiments on the Occlusion LineMOD dataset to analyze the impact of each module component of the RFF-PoseNet algorithm and comprehensively test the accuracy and real-time processing speed of the RFF-PoseNet algorithm.
In Table 5, the ablation experimental results on the accuracy and speed of 6D object pose estimation for different modules of RFF-PoseNet are presented. According to the cm-degree metric, if the predicted translation and rotation errors with the actual pose are less than 5 cm and 5° respectively, it is considered that the predicted target object pose is correct. If the Ghost lightweight module and pose regression and optimization module are directly used, the accuracy is 45.6% and the speed is 135 ms. If the PPM pyramid pooling module combined with associated features is added to infer pose, the accuracy is improved by 16.5% and the speed is reduced by 12 ms. If the PANet feature fusion module is added to incorporate low-level features for inferring pose, the accuracy is improved by 14.8% and the FPS is reduced by 4 ms. Thus, it can be seen that if the RFF-PoseNet network directly uses the Ghost lightweight module, pose regression and optimization can easily lead to incomplete simulation or overfitting, resulting in lower accuracy of RFF-PoseNet pose estimation. In addition, the PPM pyramid pooling module and PANet feature fusion module directly utilize local and global contextual information, which rapidly improves the accuracy of 6D object pose estimation with a slight decrease in speed, increasing the accuracy to 76.9%.

5. Discussion

With the rapid development of artificial intelligence models and pattern recognition, 6D object pose estimation technology based on deep learning methods has made significant breakthroughs and has high application value in multiple fields. However, current 6D object pose estimation networks lack correlation between features at different levels, and the segmentation and recognition of target objects in complex scenes are easily affected by the interference of background and surrounding objects, making the object difficult to correctly recognize. Therefore, this paper proposes a robust feature fusion 6D object pose estimation network, RFF-PoseNet, for complex scenes. The RFF-PoseNet algorithm uses Ghost convolution blocks with smaller computational complexity than the convolution module in the PoseCNN feature extraction network, greatly reducing the model’s parameter and computational complexity. In addition, a pyramid pooling module and a feature fusion module are added to the semantic segmentation branch, enhancing the ability of the network to capture objects and the correlation between contexts. This enables the RFF-PoseNet algorithm to recognize some subtle objects in complex backgrounds, thereby improving the accuracy of pose estimation. Finally, the pose regression module is used for optimization, further improving the accuracy and robustness of 6D object pose estimation.
Six degrees-of-freedom target pose estimation has important research significance in the field of computer vision. For example, pose estimation can obtain the position of an object to be grasped relative to the robotic arm in the field of robot grasping, thereby helping the robotic arm to automatically complete the grasping task without the need for manual control. In the field of augmented reality, pose estimation can enable virtual elements to maintain their relative pose with the object during its movement, overlay the virtual object on the image sequence frame, and present different scene effects. The RFF-PoseNet algorithm proposed in this study can also achieve good 6D object pose estimation accuracy and real-time performance in complex scenes such as occlusion and rotation. However, 6D target pose estimation is closely related to object detection. Pose estimation not only needs to determine the 6D degree of freedom pose of the target object but also needs to restore its size information in 3D space. Object detection focuses on identifying the type of target and its position in the image, making it necessary to combine the two and fuse more dimensional data to automatically complete more complex work tasks based on the detected target type, estimated size, and pose. Overall, this study has effectively improved the robustness of 6D object pose estimation methods based on direct regression, which can promote the application of robust 6D target pose estimation algorithms in complex scenarios such as robot grasping, augmented reality, and autonomous driving.

6. Conclusions

In response to the low accuracy, real-time performance, and robustness of 6D target pose estimation in complex scenes, RFF-PoseNet, based on the PoseCNN algorithm, is proposed as a 6D object pose estimation network to perform pose estimation tasks and robust feature fusion in complex scenes. The RFF-PoseNet network first uses a Ghost lightweight module to reduce algorithm complexity, then adds a PPM pyramid pooling module and PANet feature fusion module to obtain different subregion information in the feature map, before, finally, using an iterative optimization method to gradually correct pose estimation errors. Experiment results conducted on the YCB-Video and Occlusion LineMOD benchmark datasets show that the RFF-PoseNet method proposed in this paper can enhance the correlation of feature fusion between different levels and the recognition ability of unclear targets, with excellent accuracy and real-time performance, further improving the ability of 6D target pose estimation in complex scenes. However, this article uses only 15% of the benchmark dataset for training. Although the training data is sufficient, each scenario is relatively unique, which to some extent affects the algorithm’s generalization ability. The next step of the work will focus on creating and expanding the dataset based on the 6D object pose estimation requirements of different application scenarios. At the same time, artificially synthesized image data from different scenarios can also be added to the training set to further improve the overall performance of the RFF-PoseNet algorithm.

Author Contributions

Conceptualization, X.L.; methodology, X.L.; software, X.L. and J.Y.; validation, X.L.; formal analysis, W.L. and J.W.; data curation, J.Y.; writing—original draft preparation, X.L.; writing—review and editing, X.L., W.L., J.W. and J.Y.; visualization, X.L. and J.Y.; supervision, W.L. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China, grant numbers 62367005 and 62067006; in part by the Research Projects of the Humanities and Social Sciences Foundation of the Ministry of Education of China, grant numbers 21YJC880085; in part by the Natural Science Foundation of Gansu Province, grant numbers 23JRRA845; and in part by the Youth Science and Technology Talent Innovation Project of Lanzhou, grant numbers 2023-QN-117.

Data Availability Statement

The data are contained within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Danielsson, O.; Holm, M.; Syberfeldt, A. Augmented reality smart glasses in industrial assembly: Current status and future challenges. J. Ind. Inf. Integr. 2020, 20, 100175. [Google Scholar] [CrossRef]
  2. Yang, T.; Jia, S.; Yang, B.; Kan, C. Research on tracking and registration algorithm based on natural feature point. Intell. Autom. Soft Comput. 2021, 28, 683–692. [Google Scholar] [CrossRef]
  3. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
  4. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  5. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
  6. Mahdi, R.; Vincent, L. BB8: A Scalable, Accurate, Robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In Proceedings of the IEEE/CVF Conference on International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3848–3856. [Google Scholar]
  7. Bugra, T.; Pascal, F. Real-time seamless single shot 6D object pose prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 229–238. [Google Scholar]
  8. Joseph, R.; Ali, F. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Hawaii, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
  9. Markus, O.; Mahdi, R.; Vincent, L. Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Munich, Germany, 8–14 September 2018; pp. 119–134. [Google Scholar]
  10. Peng, S.; Liu, Y.; Huang, X.; Zhou, X.; Bao, H. PVNet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar]
  11. Yu, X.; Zhuang, Z.Y.; Koniusz, P.; Li, H. 6dof object pose estimation via differentiable proxy voting loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7568–7579. [Google Scholar]
  12. Song, C.; Song, J.; Huang, Q. Hybridpose: 6D object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 431–440. [Google Scholar]
  13. He, Y.S.; Sun, W.; Huang, H.B.; Liu, J.R.; Fan, H.Q.; Sun, J. PVN3D: A deep point-wise keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11629–11638. [Google Scholar]
  14. Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: Convolutional neural network for 6D object pose estimation in cluttered scenes. In Proceedings of the 14th Conference on Robotics—Science and Systems, Carnegie Mellon University, Pittsburgh, PA, USA, 26–30 June 2018; pp. 3109–3119. [Google Scholar]
  15. Li, F.F.; Xu, D.F.; Wang, C.; Zhu, Y.K.; Martín-Martín, R.; Lu, C.W.; Savarese, S. DenseFusion: 6D object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3338–3347. [Google Scholar]
  16. Hu, R.M.; Li, Q.Q.; Xiao, E.; Wang, Z.; Chen, Y. Learning latent geometric consistency for 6D object pose estimation in heavily cluttered scenes. J. Vis. Commun. Image Represent. 2020, 70, 175–184. [Google Scholar]
  17. Thanh, D.; Cai, M.; Reid, I. Deep-6DPose: Recovering 6D object pose from a single RGB image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 26–28 April 2018; pp. 2168–2177. [Google Scholar]
  18. He, K.M.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  19. Wang, G.; Li, Z.G.; Ji, X.Y. CDPN: Coordinates-based disentangled pose network for real-time RGB based 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Internation Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7677–7686. [Google Scholar]
  20. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. Comput. Sci. 2014. [Google Scholar] [CrossRef]
  21. Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  22. Calli, B.; Singh, A.; Walsman, A.; Srinivasa, S.; Abbeel, P.; Dollar, A.M. The YCBobject and model set: Towards common benchmarks formanipulation research. In Proceedings of the International Conference on Advanced Robotics (ICAR), Istanbul, Turkey, 27–31 July 2015; pp. 510–517. [Google Scholar]
  23. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  24. Jiang, X.; Li, D.; Chen, H.; Zheng, Y.; Zhao, R.; Wu, L.W. Uni6D: A unified cnn framework without projection breakdown for 6d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11174–11184. [Google Scholar]
  25. Hua, W.; Zhou, Z.; Wu, J.; Huang, H.; Wang, Y.; Xiong, R. Rede: End-to-end object 6d pose robust estimation using differentiable outliers elimination. IEEE Robot. Autom. Lett. 2021, 6, 2886–2893. [Google Scholar] [CrossRef]
  26. Yong, J.; Lei, X.M.; Dang, J.W.; Wang, Y. A Robust CoS-PVNet Pose Estimation Network in Complex Scenarios. Electronics 2024, 13, 2089. [Google Scholar] [CrossRef]
Figure 1. Principle of 6D object pose estimation.
Figure 1. Principle of 6D object pose estimation.
Electronics 13 03518 g001
Figure 2. Overall framework of RFF-PoseNet network.
Figure 2. Overall framework of RFF-PoseNet network.
Electronics 13 03518 g002
Figure 3. Ghost convolutional network.
Figure 3. Ghost convolutional network.
Electronics 13 03518 g003
Figure 4. Multi-scale contextual information fusion pyramid pooling module.
Figure 4. Multi-scale contextual information fusion pyramid pooling module.
Electronics 13 03518 g004
Figure 5. PANet feature fusion module.
Figure 5. PANet feature fusion module.
Electronics 13 03518 g005
Figure 6. YCB-Video and Occlusion LineMOD dataset.
Figure 6. YCB-Video and Occlusion LineMOD dataset.
Electronics 13 03518 g006
Figure 7. Visualization results of RFF-PoseNet on the YCB-Video dataset.
Figure 7. Visualization results of RFF-PoseNet on the YCB-Video dataset.
Electronics 13 03518 g007
Figure 8. Accuracy comparison of RFF-PoseNet on Occlusion LineMOD dataset.
Figure 8. Accuracy comparison of RFF-PoseNet on Occlusion LineMOD dataset.
Electronics 13 03518 g008
Table 1. Experimental environment configuration.
Table 1. Experimental environment configuration.
Hardware or EnvironmentModel or Version
CPU modelIntel(R) Xeon(R) CPU E5-2678
GPUNVIDIA GeForce RTX 3080
Graphics card memory12 G
SystemUbuntu17.04
Development languagePython3.8
Development frameworkPytorch0.4.1
Table 2. Quantitative evaluation results on YCB Video dataset (AUC (%) and ADD-S < 2 cm (%)).
Table 2. Quantitative evaluation results on YCB Video dataset (AUC (%) and ADD-S < 2 cm (%)).
MethodsPoseCNNDenseFusionDual-StreamUni6DREDECoS-PVNetRFF-PoseNet
AUC<2 cmAUC<2 cmAUC<2 cmAUC<2 cmAUC<2 cmAUC<2 cmAUC<2 cm
002_master_chef.68.151.195.2100.094.2100.095.498.795.1100.081.674.395.7100.0
004_sugar_box97.599.595.1100.097.3100.096.499.197.4100.084.999.398.0100.0
006_mustard_bottle98.098.695.9100.095.9100.095.499.096.7100.088.382.597.6100.0
007_tuna_fish_can83.972.194.9100.096.5100.095.298.896.6100.062.250.497.299.6
009_gelatin_box98.1100.095.8100.097.1100.097.4100.097.8100.088.784.696.8100.0
011_banana91.188.191.593.994.396.696.499.497.094.251.850.197.497.8
025_mug81.155.295.5100.095.3100.096.699.794.299.381.558.995.099.8
040_large_marker85.387.294.799.294.199.096.799.897.8100.035.846.895.499.6
061_foam_brick *97.2100.092.4100.092.0100.096.199.294.6100.080.669.794.1100
Average88.983.594.699.295.299.596.199.396.399.372.868.596.499.6
Table 3. Quantitative evaluation results on the Occlusion LineMOD dataset (ADD-S).
Table 3. Quantitative evaluation results on the Occlusion LineMOD dataset (ADD-S).
MethodsPoseCNNDenseFusionDual-StreamUni6DREDECoS-PVNetRFF-PoseNet
ape10.855.8176.733.053.122.977.2
can47.263.3486.951.088.574.688.5
cat4.4352.8854.24.635.925.454.2
duck18.865.2489.958.477.835.791.5
driller42.365.6576.834.846.274.278.9
eggbox21.570.1772.11.771.852.973.5
glue39.569.6273.930.275.056.577.0
holepuncher23.179.6791.132.175.551.792.2
Average26.065.377.730.765.449.279.1
Table 4. Real-time comparison results of different 6D object pose estimation methods (ms).
Table 4. Real-time comparison results of different 6D object pose estimation methods (ms).
MethodsYolo-6DPoseCNNDenseFusionDual-StreamUni6DREDECoS-PVNetRFF-PoseNet
Segmentation time032303033294730
Pose estimation time601904955423571110
Optimization time011,000153000010
Total time6011,222941157564118150
Table 5. Ablation experiment results of RFF-PoseNet.
Table 5. Ablation experiment results of RFF-PoseNet.
NumberPose Estimation MethodsAccuracy (%)Speed (ms)
1Ghost lightweight + pose regression and optimization45.6135
2Ghost lightweight+ PPM + pose regression and optimization62.1147
3Ghost lightweight + PPM + PANet + pose regression and optimization76.9151
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lei, X.; Lu, W.; Yong, J.; Wei, J. RFF-PoseNet: A 6D Object Pose Estimation Network Based on Robust Feature Fusion in Complex Scenes. Electronics 2024, 13, 3518. https://doi.org/10.3390/electronics13173518

AMA Style

Lei X, Lu W, Yong J, Wei J. RFF-PoseNet: A 6D Object Pose Estimation Network Based on Robust Feature Fusion in Complex Scenes. Electronics. 2024; 13(17):3518. https://doi.org/10.3390/electronics13173518

Chicago/Turabian Style

Lei, Xiaomei, Wenhuan Lu, Jiu Yong, and Jianguo Wei. 2024. "RFF-PoseNet: A 6D Object Pose Estimation Network Based on Robust Feature Fusion in Complex Scenes" Electronics 13, no. 17: 3518. https://doi.org/10.3390/electronics13173518

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop