*2.3. RGB-D-Based Appraoches*

Compared with the learning-based approaches mentioned above, which only use the RGB information, RGB-D-based learning approaches combine color information and depth information to measure the pose of objects, and are able to solve the problem of insufficient information in approaches that only use color information. Additionally, due to the emergence of lots of RGB-D datasets, an increasing number of studies on RGB-D-based learning approaches are being performed. In [43], Wang et al. proposed a novel method named DenseFusion that provided a two-stage method for measuring the 6D pose. In the first stage, DenseFusion uses a heterogeneous network to deal with the RGB data and point cloud data, and to save their original structure. In the second stage, a full convolutional network is used to map each pixel in RGB crop to colored feature space and uses a network based on PointNet to map each point in the point cloud to geometrical feature space. Then it merges the feature points in the colored feature space and the geometrical feature space and outputs a 6D pose estimation result. In addition, it finally refines the result by loop learning. In [44], Chen et al. proposed a 6D pose estimation framework named G2L-Net. Firstly, it extracts the coarse point cloud from RGB-D images. Then, the point cloud is added into the network to achieve 3D segmentation and object translation predictions. Finally, the fine point cloud is transferred into a local canonical coordinate to estimate initial object rotation. In [45], PVN3D was proposed, in which the method based on 2D key points was extended to 3D key points, making full use of geometric constraint information of rigid objects and improving the accuracy of the 6D estimation significantly. In [46], a method named CosyPose was proposed, which used multiple cameras to estimate 6D pose. Firstly, it estimates the 6D pose of objects in each image and then matches the individual 6D object pose hypotheses across different input images in order to jointly estimate the camera viewpoints and 6D poses of all objects in a single consistent scene.

#### *2.4. Conclusions*

In this section, we introduced learning-based approaches and classified the appraches into three categories: keypoints-based approaches, holistic approaches, and RGB-D-based approaches. Keypoints-based approaches are two-step approaches that extract 2D-3D point pairs and then use PnP to calculate the 6D pose of the object. Holistic approaches use an endto-end structure to measure the 6D pose, which is faster and more robust than keypointsbased approached. RGB-D-based learning approaches combine color information and depth information to achieve a more accurate and robust 6D pose result.

The comprehensive performance of learning-based approaches is better, as can be seen in the Amazon Robotic Challenge. The first Amazon Robotic Challenge was held in 2015 in the USA [47]. The competition presents a challenging problem that integrates many fields, including pose estimation. Many teams use learning-based approach and obtain robust results. In [48], the researchers presented a method for multi-class segmentation from RGB-D data. Objects were segmented by computing the possibility of each pixel belonging to each object by using a network. However, learning-based approaches still have some weaknesses. They require plenty of storage for storign training data and enough time to train the model. In conclusion, there are three challenges facing learning-based approaches. Firstly, the approaches require plenty of training data to train the model. Secondly, they require plenty of time to train the model before online use. For some complex objects that need much training data, this may take serval hours or even several days to train the model, thus restricting the possible applications of such approaches. Finally, these approaches cannot perform well when measuring poses that do not exist in the dataset.

#### **3. Non-Learning-Based Approaches**

#### *3.1. 2D-Information-Based Approaches*

Compared with 3D information, 2D information can be obtained more easily by simpler devices such as Charge-Coupled Device (CCD) cameras, Complementary Metal Oxide Semiconductor (CMOS) cameras, or even color cameras that can obtain the color information of objects. There is much 2D information that can be used to measure the 6D pose of objects, for instance, geometric information, texture information, color information, and so on. Scale-invariant feature transform (SIFT) features [49] and speeded up robust features (SURF) [50] are the early and classical features for pose estimation based on the texture of objects. SIFT and SURF features are both reliable and can achieve precise matching; however, they rely on the texture information of objects. Therefore, they cannot be used to solve the pose estimation of textureless objects. Zhang et al. [51] used geometric information, which does not rely on texture, to solve the pose estimation problem of textureless objects. E. Miyake et al. [52] combined the color information in the 6D pose estimation to improve accuracy and robustness. 2D information is extracted for the matching of the template. According to the dimensions of the template, approaches based on 2D information can be divided into two categories: 1. CAD image-based methods; and 2. real image-based methods.
