3.1.1. CAD Image-Based Approaches

In contrast to real image-based approaches, CAD image-based approaches are more suitable for industrial products. The virtual images (used as templates) generated by CAD models are more accurate than real images, as the render process is not affected by illumination or blur. Moreover, CAD models can be obtained in industrial applications. Therefore, many methods based on 3D CAD model have been proposed. A hierarchical model [53] has been proposed (Figure 6), combining a coarse-to-fine search with similarity scores [54] calculated between a template and a real image or between templates. In [55], a perspective cumulated orientation feature (PCOF) was proposed based on the orientation histograms extracted from randomly generated 2D projection images using CAD models. Muñoz et al. [56] proposed the use of edge correspondences to estimate poses, with a similarity measure encoded using a pre-computed linear regression matrix. The Fine pose Parts-based Model (FPM) [57] was introduced to localize objects in an image, and to estimate their fine pose using the given CAD models. Pei et al. [58] proposed a robust method that only used one pair of vanishing points and one structural line to estimate the relative pose between image pairs. Peng et al. [59] proposed a method

which used several cameras to detect geometrical features and then combined their results to obtain the final result.

In [60], the Epipolar Geometry method and direct estimation method were used to estimate the 3D parameters, which were then used to construct the transformation matrix. In [61], 6D pose estimation was used in augmented reality. A local moving edge tracker was used to provide real-time tracking of points normal to the object contours. In addition, an M-estimator was used, integrated with a robust control law, to obtain good robustness. Straight lines and curves were both used in this method to complement the virtual visual servoing. In [62], 6D pose estimation was used for end effector tracking in a scanning electron microscope to aid in enabling more precise automated manipulations and measurements. Visible line features were also used to update the pose results. Kemal et al. [6] proposed a CAD model-based tracking method for visually guided microassembly. They used multiple cameras to track objects and find feature points along the edges of objects. Then, the 3D-3D for each feature point was built, and the 6D pose was calculated.

Whether real images or CAD models are being used as the template, the performance of the matching step determines the accuracy of pose estimation. Improving the efficiency and accuracy of template matching has become an important problem to ensure the results of pose estimation.

Additionally, the number of templates influences the accuracy of the pose estimation. The more templates there are, the more accurate the pose estimation will be. However, a large number of templates requires lots of storage and search time. In [63], a part-based efficient template matching method was proposed which was able to accelerate the matching step and improve the accuracy of pose estimation. Each of the templates leveraged a different forest and independently encoded similarity function.

#### 3.1.2. Real Image-Based Approaches

Although it is possible to achieve precise results when using 3D CAD models, sometimes accurate 3D CAD models cannot be obtained. Therefore, real images are used as the template under these conditions. The histogram of gradients (HoG) [64] is a popular method that is computed on a dense grid with uniform intervals for better performance. Guo et al. [65] used multi-cooperative logos to measure 6D pose. Hinterstoisser [66] proposed a method including a novel image representation for template matching designed to be robust to small image transformations. It used the gradient orientation of the edges of objects to create templates. The method was able to be extended if 3D information was available. However, because obtaining adequate real image templates is time-consuming and challenging, and generating the images by CAD model is becoming easier, there is not much research on real image-based approaches.

#### 3.1.3. Conclusions

In this part, we introduced 2D-information-based approaches and classified the approaches into two categories: CAD image-based approaches and real image-based approaches. The difference between them is which type of template is used. CAD image-based approaches require a CAD model of the object, but templates can be generated conveniently. Real image-based approaches use real images as templates, but if there is clutter in the real images, the information may be extracted incorrectly.

Compared with 3D-information-based approaches, 2D-information-based approaches are less robust, because these approaches use 2D information, which has less data than 3D information. Additionally, complex scenes have a bad influence on the performance of these approaches. The biggest weakness for 2D-information-based approaches is that they are not able to adapt to some special scenes, such as scenes with strong changes in illumination, large numbers of repeated structures, textureless scenes, and so on.

### *3.2. 3D-Information-Based Approaches*

Although there are many kinds of 2D information, it is difficult to use 2D information to measure the 6D pose under some special conditions, or the method requires complex algorithms to obtain precise results. With the development of hardware, more and more devices that can record 3D scene information are appearing, such as depth cameras [67] and 3D scanners [68]. Compared with 2D information, 3D information preserves the original appearance of the object, which is more useful for measuring 6D pose [44,69]. Combined with 3D information, the method is more robust and can obtain more accurate results. Mainstream methods can be broadly divided into the following two categories.

#### 3.2.1. Matching-Based Approaches

The aim of matching-based approaches is to search for the most similar template in the dataset and return the 6D pose of the template. Because the 3D original data is always too large, processing these data directly can be computationally expensive. Therefore, many preprocessing methods have been proposed to reduce the complexity of the task. Zhang et al. [70] proposed two methods for solving this problem. One was to use a 2D/2.5D object detector for scene point clouds. YOLO was used to segment the scene point cloud with 2D bounding boxes due to their lower time consumption. The other preprocessing method was to extract the keypoints in the template point clouds. Points with more information, such as the points on edges, were preserved and points on surfaces were removed in order to compress the point cloud. Konishi et al. [71] combined PCOF-MOD (multimodal PCOF), balanced pose tree (BPT), and optimum memory rearrangement into 6D pose estimation to optimize data storage structure and lookup speed. To improve the accuracy of matching, Park [72] et al. proposed a novel multi-task template matching (MTTM) framework that finds the nearest template of a target object from an image while predicting segmentation masks and a pose transformation between the template and a detected object in the scene using the same feature map of the object region. In [73], research was carried out on tracking and control for micro-electro-mechanical system (MEMS) microassembly. The correlation between the real-time 3D vision tracking method and the control law based on 3D vision was demonstrated, and a pose-based visual servoing approach was used to to enable a precise regulation toward zero of 3D error.

#### 3.2.2. Local Descriptor Approaches

Approaches based on the local descriptor define and calculate a global descriptor on the model offline. The global descriptor should be invariant with respect to rotation and translation. Then, the local descriptor is calculated and matched with the global descriptor online. The iterative closest point (ICP) [74,75] algorithm is a classical one that is able to calculate the pose relation between two coordinates according to two sets of point clouds. 6D pose can be measured by the correspondence between them or as the result of voting. Guo et al. [76] proposed a global method named Super Key 4-Points Congruent Sets (SK-4PCS), combined with invariant local features of 3D shapes, thus reducing the amount of processed data. Akizuki et al. [77] proposed a method aimed at everyday tools that does not rely on the 3D model of the measured object. It assumed that the same kinds of tools possess the same part-affordance. Therefore, using the 3D models for similar kinds of object, the method was able to measure the pose of objects belonging to this kind on the basis of the spatial relationships of part-affordance. Yu et al. [78] used the improved Oriented Fast and Rotated Brief (ORB) [79] feature and rBRIEF descriptor (a descriptor developed based on binary robust independent elementary features (BRIEF) [80]) to obtain a coarse match, and then culled mismatches to retain more correct matches. They also proposed a hybrid reprojection errors optimization model (HREOM) to improve the accuracy of the result by minimizing 3D-3D and 3D-2D reprojection errors. To measure the 6D pose of large-scale objects that are partially visible, David et al. [81] proposed a method based on semi-global descriptors. They used semi-global descriptors for scene segments and model views in combination with up-sampling and segment label merging techniques and obtained more reliable results than with other descriptors.

#### 3.2.3. Conclusions

In this part, we introduced 3D-information-based approaches and classified the approaches into two categories: matching-based approaches and local descriptor approaches. Matching-based approaches are more computationally expensive; however, local descriptor approaches require some preprocessing oofline.

The biggest weakness for approaches based on 3D information is that they do not perform well when the object is reflective, which is due to how the approaches work. When measuring the 6D pose of reflective objects, these approaches are not able to obtain the depth information (or accurate point clouds) of the object. Additionally, another weakness is that the efficiency of this approach is relatively low, because point clouds and depth images include a lot of data, causing computational burden.

#### **4. Comparison**

In this section, we will compare these approaches in detail according to their performance in Table 1. Specifically, the accuracy, the storage cost, the robustness, the time cost, online performance and range of application will be discussed. The mentioned indicators are compared qualitatively in Table 1 according to the information in these papers, where A represents the best performance and C represents the worst performance.


**Table 1.** Comparison of three kinds of approaches.

Accuracy: The accuracy of holistic approaches is worse than that of 2D-informationbased approaches and 3D-information-based approaches. This is because holistic approaches transform the problem of 6D pose estimation into a problem of classification. The accuracy of classification decides the accuracy of 6D pose estimation. However, RGB-D-based approaches use more comprehensive information. The depth information is able to provide the overall morphology of a rigid body, while the color information is able to describe the position of keypoints. Therefore, RGB-D-based approaches that combine depth information and color information are able to improve the accuracy. 2D-information-based approaches and 3D-information-based approaches convert the 6D pose estimation problem into a template matching or coordinate transformation problem. Both of them are able to measure the

6D pose more specifically. 3D-information-based approaches use more information than 2D-information-based approaches, so they perform better on accuracy.

Robustness: Robustness in this paper mainly refers to the anti-interference performance of approaches to noise and environmental changes. Plenty of data is used to train a network in learning-based approaches. The model adequately considers the information and situation in the input scene, so learning-based approaches have better robustness than the other two approaches. Additionally, using only color information may lead to missing keypoints, while depth information provides some global pose information and is complementary to color information. After a long period of training, learning models are able to distinguish object features and environmental noise. Although the other two approaches are also able to distinguish information from the environment efficiently, some wrong information may be taken into consideration under some complex situations where background clutter and foreground occlusion are present.

Storage cost: Learning-based approaches need the most storage because the training process of the learning models needs plenty of data. The other two approaches need to store templates. However, the number of templates is lower than the training data required for learning-based approaches. In particular, for coarse-to-fine methods, which just need to match a basically similar template with the input image, much fewer templates are needed. In terms of a comparison between 2D-information-based approaches and 3D-informationbased approaches, the former needs less storage, because 2D information is smaller than 3D information.

Time cost and real-time performance: Because the time an approach takes determines the real-time performance of the approach, the two indicators are discussed together in this part. Learning-based approaches can be divided into two steps: offline training and online measuring. The offline step is time-consuming because it needs to train a model using plenty of data, which means that repeated calculations are necessary, although a GPU could accelerate the training process. However, in the online step, the 6D pose can be measured directly by the trained model, which costs very little time. 2D-information-based approaches and 3D-information-based approaches an also be divided into an offline step and an online step. In the offline step, many templates are generated from different angles (however, there are also some methods that do not need a lot of templates). Meanwhile, in the online step, the proper template needs to be retrieved from template dataset, which is a little time-consuming. Therefore, in the online step, the real-time performance of learning-based approaches is the best. 2D-information-based approaches cost less time than 3D-informationbased approaches because the retrieval of 3D information is more time-consuming.

Range of application: Due to their principle, 3D-information-based approaches are not able to handle the problem of 6D pose estimation of reflective objects. However, such objects are common in industry, such as metal parts. Learning-based approaches need plenty of time to train a network, which may not satisfy the requirements of real-time performance.

In this section, we compared the approaches with respect to six different aspects. In general, learning-based approaches have the best robustness, but these approaches need lots of storage to save training data and plenty of time to train the model. 3D-information-based approaches achieve the most accurate results, but they cannot be used on reflective objects. The range of application of 2D-based approaches is the widest; however, 2D information is easily affected by the environment. Therefore, the three approaches all have their advantages. In different situations, different approaches are needed.

#### **5. Challenges**

#### *5.1. Textureless and Reflective Objects*

As a special case in pose estimation, the pose estimation of textureless objects is challenging. Due to the lack of reliable texture information on the surface of such objects, it is difficult to extract feature points on them. Therefore, many methods based on the surface texture information of objects cannot effectively measure the 6D pose of such objects. However, textureless objects are common in industry. Therefore, the pose estimation of textureless objects is very important.

Zhang et al. [82] transformed sliding windows to scale-invariant RGB-D patches and applied a hash voting-based hypothesis generation scheme to compute a rough 6D pose hypothesis and then employed particle swarm optimization to improve the result, which achieved high precision and good performance on three datasets. Pan et al. [83] coarsely segmented the object in the point cloud and precisely measured the pose results in the gray image using a view-based matching method.

The methods mentioned above used a depth camera or 3D scanning device to obtain the depth image or the point cloud of objects. However, for objects with reflective surfaces, such as metal parts, depth information is hard to acquire. Therefore, some other kinds of information are used in pose estimation. Geometrical information is a reliable kind of information for textureless objects. Based on the above perspective, He et al. [84] proposed a new method (Figure 7) making full use of the geometrical information of objects. It used geometric features, such as straight lines, to generate descriptors of the objects, and proposed the GIIBOLD algorithm for matching the input image and the template image. Furthermore, accurate 2D-3D point pairs were acquired on the basis of the matched geometric features. Finally, the 6D pose was measured using the PnP-RANSAC algorithm. This method leveraged simple geometrical information, achieving fast matching and accurate measurement without the requirement for plenty of storage and time. Zhang et al. [85] first detected the object in the RGB image using a 2D bounding box and then measured the pose result in the edge image. Pan et al. [86] used multiple appearance features including color, size and aspect ratio to distinguish objects from environmental clutter and measured the 6D pose. Zhang et al. [51] proposed a novel method for measuring textureless objects on the basis of RGB images. It followed a coarse-to-fine procedure, using only the shape and contour information of the input image. Several template images with poses similar to the input image were selected to match with the input. Then the contour and shape information, specifically the ORB features, were used to establish 2D-3D correspondence and finally to calculate the 6D pose. On the basis of the studies above, without reliable texture information, there are many kinds of information can be leveraged, such as contour, color, shape, and so on. Accurate results can be obtained through the proper use of this information.

**Figure 7.** The overall workflow chart.

#### *5.2. Foreground Occlusion*

In complex industrial scenes, the condition of object occlusion appears frequently. Because the target is obscured by other objects, the recognition of the target's features is disturbed. In addition, due to part of the object information being missing, it is difficult to calculate accurate pose results. As a common condition, object occlusion has been studied by many researchers.

Crivellaro et al. [19] proposed a method for representing the pose of each part on the basis of the 2D reprojections of a small set of 3D control points. It was able to predict the 6D pose of the object by predicting the pose of the visible part through the reprojections of 2D control points. Even if the object was only partially visible, the method was able to calculate the 6D pose of the object. If the object had several visible parts, the method was able to combine all the information and obtain more accurate results. Distinct from the method above, Dong et al. [87] used 3D information as input and chose an end-to-end strategy. They proposed a novel network named Point-wise Pose Regression Network (PPR-Net). For each of the points in the point cloud, the network regressed a 6D pose of the object instance that the point belonged to. In the pose space, a clustering method was adopted in order to segment multiple instances from the clutter point cloud, and an instance's pose can be computed by averaging each subsidiary point's pose prediction. Essentially, the method used the information of visible parts to predict the pose of the object. The more parts of an object that could be seen, the more reliable the results obtained. Chen et al. [88] proposed a network which took the point cloud as input and regressed the point-wise unit vectors pointing to the 3D keypoints. Then the vectors were used to generate keypoint hypotheses from which 6D object pose hypotheses were computed. Tekin et al. [28] predicted the 2D image locations of the projected vertices of the object's 3D bounding box and used a PnP algorithm to estimate the object's 6D pose. Taking RGB-D images as input, Zhang et al. [89] combined holistic and local patches to measure the 6D object pose and obtained high precision and good performance under conditions of foreground occlusion and background clutter.

In conclusion, the principle of the method for dealing with object occlusion is to use the information of any visible parts to predict the 6D pose of the whole object. Generally, the number of methods to achieve this based on 3D information is greater than the number of those based on 2D information. In addition, among the methods based on 2D information, it is more appropriate to use outer information such as bounding boxes and contours. Because inner information, such as the texture, is occluded by other objects, the extraction of the information is affected to a great extent.

#### *5.3. Background Clutter*

Background clutter is also a challenge in 6D pose estimation. Because the target is surrounded by much useless information, it is difficult to measure the 6D pose directly. However, due to the complexity of practical scenarios, there are many conditions under which it is necessary to measure the 6D pose of objects in clutter.

He et al. [41] proposed Mask R-CNN, which efficiently detected objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mitash et al. proposed a method for measuring the 6D pose of objects in clutter. A global optimization process was employed to improve candidate poses by taking into account scene-level physical interactions between objects. Then, the combinations of candidate object poses were searched using a Monte Carlo Tree Search (MCTS) process that used the similarity between the observed depth image of the scene and the rendering of the scene given the hypothesized pose as a score, guiding the search procedure. Li et al. [90] proposed a two-step method for measuring the 6D pose. The first step was few-shot instance segmentation to segment the known objects from RGB images. Chen et al. [91] proposed a method based on point clouds that detected objects in an end-to-end manner. They introduced a point cloud-based 6D target object detection method that used segmented object point cloud patches to predict object 6D poses and identity. It used a point cloud segmentation procedure that was easier to visualize and tune in order to overcome the problem caused by background clutter.

#### *5.4. Deformable Objects*

The 6D pose estimation of deformable objects is a huge challenge in the field, because the posture of the objects is unpredictable and there are many ways for the objects to deform. Thus, lots of conditions need to be taken into consideration, resulting in pressure on the algorithm and calculation.

Li et al. [92] proposed a novel method that was able to classify and estimate the categories and poses of deformable objects. They simulated the deformable objects and obtained the depth images from different viewpoints in different postures as a dataset. By extracting features and using deep learning, they set up a codebook for the object, which could be used in training process. This method uses learning-based approaches and sets up a two-layer framework to solve the problem of 6D pose estimation. Lots of storage space is needed, but the method would be robust in such complex situations. In [93], a predictive and model-driven approach was proposed to solve this problem. A dataset was built up by using the picked-up garments in multiple poses under gravity. The dataset was orginized in an efficient way, increasing the speed of the searching process. The proposed method constructed a fully featured 3D model of the garment in real time and used volumetric features to obtain the most similar model in the database in order to predict the object category and pose. Accurate model simulation could also be used to optimize the trajectories of the manipulation of deformable objects. Caporali et al. [94] proposed a four-step method to solve this problem of grasping clothes by using a point cloud. Firstly, the instance segmentation was performed, and then a wrinkledness measure was implemented to robustly detect the graspable regions of the cloth. Next, the identification of each individual wrinkle was accomplished by fitting a piecewise curve, and finally a pose for each detected wrinkle was estimated.
