*1.4. Structure Layout*

As shown in Figure 2, the rest of this paper can be divided into four parts: First, we introduce some research contributions in recent years in the three approaches and describe the advantages and challenges they are faced with in detail. We introduce learning-based approaches in the second section, 2D-information-based approaches and 3D-informationbased approaches in the third section. Next, we focus on the challenges of 6D pose estimation. Then we compare the three approaches qualitatively. Finally, we give some views on the future development of the 6D pose estimation.

**Figure 2.** Structure layout of the review.

#### **2. Learning-Based Approaches**

In recent years, machine learning-based algorithms have become a hot topic due to new concepts' emergence, such as deep learning and neural networks. Many scholars have applied machine learning-based methods to 6D pose estimation and achieved good results. Posenet [18] is a monocular 6D relocalization system that trains a convolutional neural network (CNN) to regress 6D poses. The network transformed the problem of 6D pose estimation into a regression problem for which the input is a single RGB image and the output is the camera's 6D pose by using an end-to-end approach. To address the limitation of the lack of training data, a method was proposed that could generate large regression datasets of camera pose automatedly based on structure from motion. Crivellaro [19] also trained a CNN (Figure 3) to predict the 6D poses of objects that are partially visible. The key idea of the method was to predict the 3D-2D projections of feature points on each part of the object. When the test image is partially visible, the method could measure the 6D pose according to the feature points of visible part. When the test image is fully visible, the method is able to achieve more accurate results by combining all the feature points of the part. Particle Swarm Optimization (PSO)-based methods [20,21] demonstrate superior performance compared to Iterative Closest Point (ICP) algorithms. Hoang et al. [22] combined CNN with Simultaneous Localization and Mapping (SLAM), which improved the method [23] by adding a 6D object pose detector and measuring the 6D pose from different viewpoints to achieve a robust object detector system. However, there is a special problem for 6D pose estimation methods based on deep learning.

**Figure 3.** Architecture of the CNN predicting the projections of the control points.

Symmetrical objects' 6D poses (Figure 4) are difficult to measure correctly using normal deep learning methods [24], because the 6D pose of an object does not change from a fixed point of view when it is rotated 180 degrees. However, their actual ground truth is obviously different. For instance, if a network is trained to predict the pose using the squared loss between the ground truth poses and the predicted poses, it will converge to a model predicting the average of the possible poses for an input image, which is of course meaningless. Pitteri et al. [25] proposed an efficient method combined with Faster-RCNN, which relies on the normalization of the pose rotation. Manhardt et al. [26] proposed a method that was able to detect the rotation ambiguities and characterization of the uncertainty in the problem without further annotation or supervision. Zhang et al. [27] used the rigid transformation-invariant point-wise features of the point clouds as input features and used a hierarchical neural network that combined global point cloud information with the local patches to predict the key point coordinates.

### *2.1. Keypoint-Based Approaches*

Keypoint-based approaches establish 2D-3D correspondences between images and then measure the pose according to these correspondences [28–30]. The procedures for keypoint-based approaches can be divided into two steps: 1. extract the 2D feature points in the input image; and 2. regress the 6D pose results using a PnP algorithm. BB8 [31] leveraged CNN to predict the 2D projections of eight vertices of the 3D bounding box of the object (Figure 5). To solve the challenges presented by textureless symmetrical objects, BB8 restricted the range of the rotation angle of the training data and used a classifier to predict the rotation angle during the estimation step. However, when the object is partially invisible, BB8 may not obtain the correct 3D bounding box, which would have an adverse influence on PnP. To solve this problem, Hu [32] et al. proposed a method that segmented the image into several patches and made them predict both to which object they belonged and where the 2D projections were. Then, all the patches belonging to the same objects would be combined to measure the 6D pose based on PnP. Because each patch of the object is used to measure their respective local pose, this method was able to perform well when faced with occlusion. PVNet [33] predicted the direction of each pixel to each keypoint; thus, the spatial probability distribution of 2D key points can be obtained in a manner like RANSAC. According to the distribution, uncertainty-driven PnP could be used to measure the 6D pose. Predicting the direction of pixels and keypoints makes the local features more prominent. Even if one feature point is invisible, it can be positioned by means of another visible part. Jeon et al. [34] proposed a method involving learning orientation-induced primitives, rather than employing 3D bounding boxes, and calculated the rotation and translation vector in different modules. The methods mentioned above are all two-step-based; however, Hu [35] revealed the weakness of this kind of method. First, it is not an end-to-end system. Additionally, the loss function of the neural network cannot represent the accuracy of 6D pose estimation. Therefore, Hu et al. proposed a single-stage 6D pose estimation method that could directly regress the 6D pose on the basis of groups of 3D-to-2D correspondences associated with each 3D object keypoint.

**Figure 4.** Symmetrical objects.

**Figure 5.** The red bounding boxes for the pose estimation results using BB8.
