In this section, we provide a structured summary of each of the 83 analyzed 6D pose estimation models presenting a very short description of the inputs, the real-time information, the datasets, and the evaluation metrics. Later in the analysis chapter, analytical tables present more detailed and organized information. The models are categorized into two distinct sections: Real-Time Models with 52 models and Non-Real-Time Models with 30 models. This division allows researchers to efficiently navigate through models based on their real-time processing needs and computational constraints.
4.1. Real-Time Models
The model proposed by Tristan et al. [
4], published in 2015, presents a pose estimation method for rendezvous and docking with passive objects, relevant to space debris removal, by fusing data from a 3D time-of-flight camera and a high-resolution grayscale camera, achieving an average distance error of 3 cm and up to 60 FPS in real-time tests using the European Proximity Operations Simulator (EPOS), making it applicable to space operations.
The model BB8 [
5], published in 2017, presents a method for 3D object detection and pose estimation using only color images by leveraging segmentation and a CNN-based holistic approach to predict 3D poses, addressing challenges posed by rotationally symmetric objects through restricted training ranges and classification
The model SSD6D [
6], published in 2017, introduces an SSD-based approach for 3D instance detection and 6D pose estimation from RGB images, trained exclusively on synthetic model data, achieving high-speed performance (10Hz) while demonstrating that color-based methods can effectively replace depth-dependent techniques, with future improvements focusing on robustness to color variations and loss term optimization.
The model PoseCNN [
7], published in 2017, presents PoseCNN, a convolutional neural network for 6D object pose estimation that separately predicts 3D translation and rotation, introduces loss functions to handle occlusions and object symmetries, and operates using only color images, making it applicable for robotic vision tasks while also contributing the YCB-Video dataset for further research.
The model CDPN [
8], published in 2019, presents the Coordinates-based Disentangled Pose Network (CDPN), a 6-DoF object pose estimation method that separately predicts rotation and translation using specialized techniques to enhance accuracy and robustness, making it effective for handling texture-less and occluded objects in real-world applications such as robotics and augmented reality.
The model CullNet [
9], published in 2019, proposes CullNet, a confidence calibration network for single-view object pose estimation that enhances the reliability of pose proposal selection by refining confidence scores using pose masks rendered from 3D models and cropped image regions, with experimental validation on LINEMOD and Occlusion LINEMOD datasets demonstrating its effectiveness.
The model Pix2Pose [
10], published in 2019, proposes Pix2Pose, a 6D object pose estimation method that predicts per-pixel 3D coordinates from RGB images without requiring textured 3D models, using an auto-encoder architecture to estimate 3D coordinates and errors per pixel, generative adversarial training for occlusion recovery, and a transformer loss to handle object symmetries, with validation on multiple benchmark datasets.
The model PoseRBPF [
11], published in 2019, proposes PoseRBPF, a Rao–Blackwellized particle filter for 6D object pose tracking that decouples 3D translation and 3D rotation by discretizing the rotation space and using an autoencoder-based codebook of feature embeddings, enabling robust tracking of symmetric and textureless objects, with validation on two benchmark datasets for robotic applications such as manipulation and navigation.
The DPOD (Dense Pose Object Detector) model [
12], published in 2019, presents DPOD, a deep learning-based method for 3D object detection and 6D pose estimation from RGB images, which computes dense multi-class 2D-3D correspondences to estimate object poses via PnP and RANSAC, followed by a deep learning-based refinement step, demonstrating effectiveness on both synthetic and real training data while being real-time capable.
The model AAE [
13], published in 2018, presents a real-time RGB-based pipeline for object detection and 6D pose estimation, utilizing an Augmented Autoencoder trained on synthetic 3D model views with domain randomization to enable robust self-supervised 3D orientation estimation across various RGB sensors, without requiring pose-annotated training data, while inherently handling object symmetries and perspective errors.
The model CosyPose [
14], published in 2020, presents CosyPose, a multi-view 6D object pose estimation method that integrates single-view predictions, robust hypothesis matching, and object-level bundle adjustment for global scene refinement, using a render-and-compare approach to handle unknown camera viewpoints, explicitly managing object symmetries, and operating without depth measurements, with applications in visually driven robotic manipulation.
The model G2L-Net [
15], published in 2020, presents G2L-Net, a real-time 6D object pose estimation framework that processes point clouds from RGB-D detection using a divide-and-conquer approach, incorporating 3D sphere-based localization to constrain the search space, embedding vector features (EVF) for viewpoint-aware rotation estimation, and a rotation residual estimator for refinement.
The model HybridPose [
16], published in 2020, presents HybridPose, a real-time 6D object pose estimation method that integrates keypoints, edge vectors, and symmetry correspondences as a hybrid intermediate representation, enhancing robustness to occlusions and extreme poses while filtering outliers through a robust regression module, with future extensions planned for incorporating additional geometric features.
The model YOLO-6D+ [
17], published in 2020, presents YOLO-6D+, an end-to-end deep learning framework for real-time 6D object pose estimation from a single RGB image, incorporating a silhouette prediction branch to enhance feature learning and an edge restrain loss to improve 3D shape constraints, predicting 2D keypoints for PnP-based pose estimation, demonstrating efficiency for augmented reality and robotic grasping applications.
The model ASPP-DF-PVNet [
18], presents ASPP-DF-PVNet, an occlusion-resistant 6D pose estimation framework that enhances PVNet with an Atrous Spatial Pyramid Pooling (ASPP) module for improved segmentation and a distance-filtered voting scheme for better keypoint localization, demonstrating effectiveness on LINEMOD and Occlusion LINEMOD datasets.
The model BundleTrack [
19], published in 2021, presents BundleTrack, a real-time 6D pose tracking framework that operates without relying on instance- or category-level 3D models, combining deep learning-based segmentation, robust feature extraction, and pose graph optimization for long-term, low-drift tracking under challenging conditions, with applications in robotic manipulation, pick-and-place, and in-hand dexterous tasks.
The model CloudAAE [
20], published in 2021, presents a point cloud-based 6D pose estimation system that leverages an augmented autoencoder (AAE) for pose regression and a lightweight synthetic data generation pipeline, significantly reducing training costs while enabling agile deployment for robotic applications, demonstrating effectiveness among synthetic-trained methods on public benchmarks.
The model FS-Net [
21], published in 2021, presents FS-Net, a real-time category-level 6D pose estimation framework that utilizes a 3D graph convolutional autoencoder for feature extraction, a decoupled rotation mechanism for improved orientation decoding, and a residual-based translation and size estimation strategy, demonstrating strong generalization and efficiency for fast and accurate object pose tracking.
The model RePOSE [
22], published in 2021,a real-time 6D object pose refinement method that replaces CNN-based refinement with efficient deep texture rendering and differentiable Levenberg-Marquardt optimization, enabling fast and accurate pose tracking, with potential applications in real-time object manipulation and robotics.
The SO-Pose model [
23], published on 2021, presents SO-Pose, an end-to-end 6D pose estimation framework that introduces a two-layer model combining 2D-3D correspondences and self-occlusion reasoning, enhancing spatial reasoning and robustness in cluttered environments, with potential applications in object tracking, manipulation, and self-supervised pose estimation.
The model developed by Jingrui Song et al. [
24], published in 2021, presents a satellite pose estimation framework that leverages a degraded image rendering pipeline to simulate atmospheric turbulence and a deep learning-based method inspired by YOLO-6D, incorporating a relative depth prediction branch to enhance pose estimation accuracy, with applications in ground-based optical telescope tracking of non-cooperative space targets.
The model GDR-NET [
25], published in 2021, presents GDR-Net, a geometry-guided direct regression network for end-to-end 6D object pose estimation, integrating dense geometric feature representations with a Patch-PnP module to enable real-time, differentiable, and accurate pose estimation, with applications in robotics, augmented reality, and computer vision tasks requiring differentiable poses.
This paper [
26], published in 2021 introduces FFB6D, a Full Flow Bidirectional fusion network for 6D pose estimation that integrates RGB and depth features throughout the encoding and decoding process and incorporates a SIFT-FPS keypoint selection algorithm, demonstrating strong performance on benchmark datasets with potential applications in 3D perception tasks.
The model DPOPV2 [
27], published in 2022 presents DPODv2, a three-stage 6D object pose estimation framework that integrates YOLOv3-based detection, CENet for dense 2D-3D correspondences, and multi-view optimization, supporting both RGB and depth modalities. It also introduces a differentiable rendering-based refinement stage to improve pose consistency across multiple views, demonstrating scalability and strong performance across multiple datasets.
The model FS6D-DPM [
28], published in 2022, introduces FS6D-DPM, a few-shot 6D object pose estimation framework that predicts the pose of unseen objects using only a few support views, leveraging dense RGBD prototype matching with transformers, ShapeNet6D for large-scale pre-training, and online texture blending for enhanced generalization, addressing challenges in open-set pose estimation with potential applications in robotics and augmented reality.
The model GPV-Pose [
29], published in 2022,presents GPV-Pose, a category-level 6D pose estimation framework that introduces a confidence-driven rotation representation and a geometry-guided point-wise voting mechanism to improve robustness across intra-class variations, achieving real-time inference speed and state-of-the-art performance on public datasets, with applications in robotics, augmented reality, and 3D scene understanding.
The model MV6D [
30], published in 2022, presents MV6D, an end-to-end multi-view 6D pose estimation framework that fuses RGB-D data from multiple perspectives using DenseFusion and joint point cloud processing, achieving robust pose estimation in cluttered and occluded scenes, with applications in robotics, augmented reality, and autonomous systems.
The model OVE6D [
31], published in 2022, presents OVE6D, a model-based 6D pose estimation framework that decomposes pose into viewpoint, in-plane rotation, and translation, using lightweight cascaded modules trained purely on synthetic data, enabling strong generalization to real-world objects without fine-tuning, with applications in robotics, industrial automation, and large-scale object recognition.
The model PVNet (Pixel-wise Voting Network) [
32], published in 2022, presents PVNet, a 6D object pose estimation framework that uses a pixel-wise voting mechanism for keypoint localization and an uncertainty-driven PnP solver for final pose estimation, improving robustness against occlusion and truncation, with applications in robotics, augmented reality, and automated inspection.
The model SC6D [
33], published in 2022, presents SC6D, a symmetry-agnostic, correspondence-free 6D object pose estimation framework that utilizes an SO(3) encoder for rotation learning, object-centric coordinate transformations for localization, and classification-based depth estimation, eliminating the need for CAD models on the T-LESS dataset.
The model SSP-Pose [
34], published in 2022, presents SSP-Pose, an end-to-end category-level 6D pose estimation framework that integrates shape priors into direct pose regression, leveraging symmetry-aware constraints and multiple learning branches to improve accuracy while maintaining real-time inference speed, with applications in robotics, autonomous driving, and augmented reality.
The model proposed by Yan Ren et al. [
35], published in 2022, presents a multi-scale convolutional feature fusion framework for 6D object pose estimation, enhancing correspondence-based keypoint extraction through residual learning and improved feature representation, achieving higher accuracy and robustness in challenging conditions, with applications in robotic grasping and real-time object tracking.
The model proposed by Yi-Hsiang Kao et al. [
36], published in 2022, presents a PVNet-based 6D object pose estimation framework for robotic arm applications in smart manufacturing, leveraging multi-angle image transformations and point cloud registration to improve pose estimation, feature extraction, and object grasping accuracy, addressing limitations of traditional 2D vision-based systems.
The model ZebraPose [
37], published in 2022, presents a coarse-to-fine surface encoding technique for 6D object pose estimation, introducing a hierarchical binary grouping strategy for efficient 2D-3D correspondence prediction, leveraging a PnP solver for final pose estimation, achieving state-of-the-art accuracy on benchmark datasets, with applications in robotics, augmented reality, and industrial automation.
The model article [
38], published in 2022, introduces Gen6D, a generalizable model-free 6D pose estimator that predicts object poses using only posed images, integrating a novel viewpoint selector and volume-based pose refiner to achieve accurate results for unseen objects in arbitrary environments.
The model proposed by Antoine Legrand et al. [
39], published in 2023, presents a two-stage deep learning framework for spacecraft 6D pose estimation, where a convolutional network predicts keypoint locations, and a Pose Inference Network estimates the final pose, achieving efficient processing for space-grade hardware while maintaining competitive accuracy on the SPEED dataset.
The model Compressed YOLO-6D [
40], published in 2023, presents a mobile-optimized real-time 6D pose estimation framework for augmented reality (AR), enhancing YOLO-6D with model compression techniques such as channel pruning and knowledge distillation to improve inference speed and efficiency, enabling low-latency AR interactions on mobile devices.
The CRT-6D model [
41], published in 2022, presents CRT-6D (Cascaded Pose Refinement Transformers), a real-time 6D object pose estimation framework that replaces dense intermediate representations with Object Surface Keypoint Features (OSKFs) and employs deformable transformers for iterative pose refinement, achieving state-of-the-art performance on benchmark datasets while being significantly faster than existing methods, with applications in robotics, augmented reality, and industrial automation.
The model DFTr network [
42], published in 2023, presents DR-Pose, a two-stage 6D pose and 3D size estimation framework that integrates point cloud completion for unseen part recovery and scaled registration for pose-sensitive feature extraction, with applications in robotics, augmented reality, and automated object recognition.
The model PoET [
43], published in 2022, presents PoET, a transformer-based 6D multi-object pose estimation framework that predicts object poses using only a single RGB image, integrating object detection with a transformer decoder to process multiple objects simultaneously, with applications in robotic grasping, localization, and real-time perception in resource-constrained environments.
The model HS-Pose [
44], published in 2023, presents HS-Pose, a category-level 6D object pose estimation framework that introduces the HS-layer, an improved 3D graph convolutional module designed to enhance local-global feature extraction, robustness to noise, and encoding of size and translation information, running in real-time, with applications in robotics, augmented reality, and industrial automation.
The Improved PVNet model [
45], published in 2023, presents an enhanced PVNet-based 6D pose estimation framework for amphibious robots, introducing confidence score prediction and keypoint filtering to improve accuracy in occluded target scenarios, with applications in robotic tracking, docking, and grasping, and plans for further optimization to reduce computational costs and improve real-time performance.
The 6D object pose estimation model proposed by Lichun Wang et al. [
46], published in 2023, presents an enhanced voting-based 6D object pose estimation method that improves PVNet by introducing a distance-aware vector-field prediction loss and a vector screening strategy, reducing angular deviations in keypoint predictions and preventing parallel vector hypotheses, achieving higher accuracy on benchmark datasets, with future work aimed at applying these improvements to robotic grasping tasks.
The model PLuM [
47], published in 2023, presents PLuM (Pose Lookup Method), a probabilistic reward-based 6D pose estimation framework that replaces complex geometric operations with precomputed lookup tables, enabling accurate and efficient pose estimation from point clouds, with applications in field robotics, including real-time haul truck tracking in excavation scenarios.
The model proposed by by Zih-Yun Chiu et al. [
48], published in 2023, presents a real-time 6D pose estimation framework for in-hand suture needle localization, incorporating a novel state space and feasible grasping constraints into Bayesian filters, ensuring consistent and accurate needle tracking relative to the end-effector, with applications in autonomous suturing and surgical robotics.
The model SE-UF-PVNet [
49], published in 2023, is a 6DoF object pose estimation framework that enhances keypoint localization from a single RGB image by integrating structural information via a keypoint graph and Graph Convolution Network, utilizing novel vector field predictions and multi-scale feature extraction to improve robustness, particularly in occlusion scenarios, while maintaining real-time inference.
The model BSAM-PVNet [
50], published in 2024, introduces BSAM-PVNet, a two-stage pose estimation method combining ResNet18 with blueprint separable convolutions and a convolutional attention mechanism for feature extraction, validated on a self-built insulator dataset, highlighting improvements in accuracy and efficiency while addressing generalization challenges in future research. The model CoS-PVNet [
51], published in 2024, is a robust 6D pose estimation framework designed for complex environments in augmented reality, virtual reality, robotics, and autonomous driving, enhancing PVNet with pixel-weighting, dilated convolutions, and a global attention mechanism to improve keypoint localization and feature extraction, demonstrating strong performance in occlusion-heavy conditions and potential for broader industry applications.
The model EPro-PnP [
52], published in 2025, is a probabilistic PnP layer that transforms traditional non-differentiable pose estimation into an end-to-end learnable framework by modeling pose as a probability distribution on the SE(3) manifold, enhancing 2D-3D correspondence learning with KL divergence minimization and derivative regularization, demonstrating strong applicability in robotics and autonomous driving.
The model “Focal Segmentation” proposed in [
53], published in 2024, presents an improved 6D pose estimation method for augmented reality, enhancing PVNet with a focal segmentation mechanism to improve object pixel segmentation under severe occlusion, enabling robust keypoint localization and pose estimation via a PnP algorithm while maintaining real-time performance, with future work exploring backbone network modifications and hyperparameter tuning.
The model proposed by Fupan Wang et al. [
54], introduces a lightweight 6DoF pose estimation method that enhances PVNet with depth-wise separable convolutions, coordinate attention, and an improved ASPP module for better multi-scale feature fusion, achieving robustness to scale variations while reducing computational complexity, making it suitable for deployment on low-performance devices such as mobile platforms.
The model Lite-HRPE [
55], published in 2024, is a lightweight 6DoF pose estimation network designed for intelligent robotics, incorporating a multi-branch parallel structure, Ghost modules, and an optimized feature extraction network to reduce computational complexity while maintaining accuracy, making it well-suited for real-time deployment in resource-constrained environments.
The model YOLOX-6D-Pose [
56], published in 2024, is an end-to-end multi-object 6D pose estimation framework that enhances the YOLOX detector to directly predict object poses from a single RGB image, eliminating the need for correspondences, CAD models, or post-processing, making it a highly efficient and accurate solution for real-time applications.
The model IFFNeRF [
57], published in 2024, is a real-time 6DoF camera pose estimation method that utilizes NeRF-based Metropolis-Hastings sampling and an attention-driven ray matching mechanism to estimate poses without an initial guess, demonstrating improved robustness and efficiency across synthetic and real-world datasets.
4.2. Non-Real-Time Models
The model “Chen et al.” [
58], published in 2019, introduces a 3D object detection and pose estimation framework for robotic applications, combining SSD-based object detection, modified LineMOD template matching, and ICP refinement to improve accuracy and reduce false positives, demonstrating its effectiveness in real-world robotic grasping and polishing tasks.
The model DeepVCP [
59], published in 2019, presents an end-to-end 3D hand pose estimation framework that eliminates the need for prior hand information by introducing a keypoint-based detection method and integrating biological hand constraints, improving accuracy and robustness for real-world applications in unconstrained environments.
The “Autonomous Mooring” model [
60], published in 2020, proposes an algorithm for autonomous mooring that refines bollard detection by converting Mask R-CNN segmentation into a single reference point, improving localization accuracy and reducing error for precise maritime navigation and offshore operations.
The model DeepIM [
61], published in 2018, is a deep learning-based 6D pose estimation framework for robot manipulation and virtual reality, using an iterative pose refinement approach that predicts relative transformations from color images, enabling accurate pose estimation without depth data or prior object models.
The model EPOS [
62], published in 2020, introduces a 6D object pose estimation method that models objects using compact surface fragments, predicts per-pixel 3D correspondences, and refines poses with a robust PnP-RANSAC algorithm, enabling accurate pose estimation for diverse objects, including those with global or partial symmetries.
The model LatentFusion [
63], published in 2020, presents a keypoint-based 6D pose estimation framework for spacecraft applications, using a two-stage neural network to predict keypoints and infer pose, achieving real-time processing while balancing accuracy and computational efficiency for deployment on space-grade hardware.
The model Self6D [
64], published in 2020, introduces Self6D, a self-supervised 6D pose estimation framework that refines models trained on synthetic data using neural rendering and geometric constraints, enabling improved accuracy on real-world data without requiring 6D pose annotations.
The model “Category Level Metric Scale Object Shape and Pose Estimation” [
65], published in 2021, introduces a framework for estimating metric scale shape and 6D pose from a single RGB image, utilizing the MSOS and NOCS branches along with the NOCE module to predict object mesh, coordinate space, and geometrically aligned object centers, demonstrating strong performance in robotics and augmented reality applications.
The model ConvPoseCNN2 [
66], published in 2021, is a fully convolutional 6D object pose estimation framework for robotics, utilizing dense pixel-wise predictions to improve spatial resolution and inference speed while integrating an iterative refinement module to enhance accuracy, making it well-suited for cluttered environments.
The model DCL-Net [
67], published in 2022, is a deep learning-based 6D object pose estimation framework that enhances correspondence learning between partial object observations and complete CAD models using dual Feature Disengagement and Alignment modules, integrating confidence-weighted pose regression and iterative refinement for improved accuracy across multiple benchmark datasets.
The “Hayashi et al.” model [
68] using Augmented Auto Encoder and Faster R-CNN, published in 2021, introduces a joint learning framework that integrates object detection and pose estimation, using shared feature maps to enhance pose accuracy by reducing errors from mislocalized bounding boxes, with potential applications in real-world object recognition tasks.
The model proposed by Ivan Shugurov et al. [
69], published in 2021, introduces a multi-view 6DoF object pose refinement method that enhances DPOD-based 2D-3D correspondences using a differentiable renderer and geometric constraints, demonstrating robust performance across multiple datasets and enabling automatic annotation of real-world training data for practical applications.
The model proposed by Haotong Lin et al. [
70], published in 2022, presents a self-supervised learning framework for 6DoF object pose estimation, utilizing depth-based pose refinement to supervise an RGB-based estimator, achieving competitive accuracy without real image annotations and offering a scalable solution for applications with limited labeled data.
The model “Primitive Pose” [
71], published in 2022,is a 3D pose and size estimation framework for robotic applications, using stereo depth-based geometric cues to predict oriented 3D bounding boxes for unseen objects without CAD models or semantic information, enabling open-ended object recognition in dynamic environments.
The model “RANSAC Voting” [
72], published in 2022, introduces a 6D pose estimation method for robotic grasping, leveraging EfficientNet for feature extraction and per-pixel keypoint prediction using RANSAC voting, demonstrating high accuracy under occlusions and enabling stable object grasping in real-world environments.
The model 3DNEL [
73], published in 2023, is a probabilistic inverse graphics framework for robotics and 3D scene understanding, integrating neural embeddings from RGB with depth information to improve robustness in sim-to-real 6D pose estimation, enabling uncertainty quantification, multi-object tracking, and principled scene modeling.
The model Multistream ValidNet [
74], introduces a validation framework for robotic manipulation that enhances 6D pose estimation by distinguishing between True and False Positive results using depth images and point clouds, improving pose accuracy and robustness in real-world applications.
The model C2FNET [
75], published in 2023, is a two-stage 6D pose estimation framework that refines keypoint localization using deformable convolutional networks (DCN), improving accuracy in occlusion-heavy conditions, as demonstrated on LINEMOD and Occlusion LINEMOD datasets.
The model DR-Pose [
76], published in 2023, is a 6D object pose estimation method that enhances RGBD feature integration using a Deep Fusion Transformer (DFTr) block and improves 3D keypoint localization with a weighted vector-wise voting algorithm.
The model Improved CDPN [
77], published in 2023, presents an enhanced pose estimation method for robotics by employing separate networks for translation and rotation prediction, integrating Convolutional Block Attention Module (CBAM) and Pyramid Pooling Module (PPM) to mitigate feature degradation, and validating its effectiveness in object pose estimation and robotic grasping tasks using the Linemod dataset and real-world experiments.
The Improved PVNet 2 model [
78], published in 2023, presents an improved PVNet-based pose estimation method with an enhanced ResNet18 backbone and ELU activation function to support real-time pose tracking in virtual-real fusion maintainability test scenarios, validated on a custom Linemod-format air filter dataset for improved accuracy in component detection and virtual reconstruction.
The model MSDA [
79], published in 2023, introduces a self-supervised domain adaptation approach for 6D object pose estimation that fine-tunes a synthetic pre-trained model using real RGB(-D) images with pose-aware consistency and depth-guided pseudo-labeling, reducing reliance on real pose labels and differentiable rendering while improving real-world applicability.
The Scale Adaptive Skip-ASP Pixel Voting Network (SASA-PVNet) model [
80], published in 2023, presents SASA-PVNet, a scale-adaptive 6D object pose estimation method that enhances keypoint-based approaches by dynamically resizing small objects and incorporating a Skip-ASP module for multi-scale information fusion.
The model proposed by Sijin Luo et al. [
81], published in 2022, presents a vision system for UAV-based logistics that integrates 2D image and 3D point cloud data to detect, segment, and estimate 6D object poses, demonstrating its effectiveness through experiments on the YCB-Video and SIAT datasets and highlighting future improvements for real-time deployment.
The model StereoPose [
82], published in 2023, presents StereoPose, a stereo image-based framework for category-level 6D pose estimation of transparent objects, integrating a back-view NOCS map, a parallax attention module for feature fusion, and an epipolar loss to enhance stereo consistency, with validation on the TOD dataset and potential applications in robotics.
The model SwinDePose [
83], published in 2023, proposes SwinDePose, a fusion network that extracts geometric features from depth images, combines them with point cloud representations, and predicts poses through semantic segmentation and keypoint localization, validated on the LineMod, Occlusion LineMod, and YCB-Video datasets.
The model Yolo 7 [
84], published in 2023, presents an improved YOLOv7-based 6D pose estimation method that extends prediction networks, modifies the loss function, and incorporates keypoint interpolation to enhance pose accuracy while reducing reliance on precise 3D models, validated on public and custom datasets.
The model developed by Junhao Cai et al. [
85], published in 2024, introduces an open-vocabulary 6D object pose and size estimation framework that utilizes pre-trained DinoV2 and text-to-image diffusion models to infer NOCS maps, enabling generalization to novel object categories described in human text, supported by the large-scale OO3D-9D dataset.
The model RNNPose [
86], published in 2024, presents RNNPose, a recurrent neural network-based framework for 6D object pose refinement that iteratively optimizes poses using a differentiable Levenberg-Marquardt algorithm, leveraging descriptor-based consistency checks for robustness against occlusions and erroneous initial poses, validated on multiple public datasets.
The model developed by Yiwei Song et al. [
87], published in 2024, presents a cross-modal fusion network for 6D pose estimation that extracts and integrates RGB and depth features using transformer-based fusion, demonstrating improved accuracy in occluded and truncated environments through evaluations on Occlusion Linemod and Truncation Linemod datasets.