6-DoF Pose Estimation from Single RGB Image and CAD Model Retrieval Using Feature Similarity Measurement
Abstract
:1. Introduction
2. Internal and External Pose Parameters
3. Proposed Methods
3.1. Dataset for Reference CAD Models
- The upper category should have a matching category in the Microsoft COCO dataset [25] because the object detection network is trained on this dataset.
- The appearance of the CAD models in a subcategory should be different enough to facilitate similarity measurement.
- The appearance and shape of the CAD models across all categories should be general so that they appear in a sufficient amount of RGB image data.
3.2. Dataset for RGB Images
- To ensure usability in object detection and pose estimation networks, the image size is constrained between 224 × 224 and 1600 × 1600 pixels.
- Accepted image formats include jpg, jpeg, and png.
- For similarity measurement between the RGB object and the CAD rendering, it is necessary that the RGB object is minimally occluded.
3.3. Proposed Pose Estimation Network
3.4. Similarity Measurement Between RGB Image and CAD Rendering
3.5. Translation and Focal Length Estimation
4. Experiment and Evaluation
4.1. Experimental Setup
4.2. Evaluation of Pose Estimation
4.3. Evaluation of CAD Model Retrieval
4.4. Correlation Analysis Between Pose Estimation and CAD Retrieval
5. Ablation Study of the Proposed Pose Estimation Model
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wang, S.; Clark, R.; Wen, H.; Trigoni, N. DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 2043–2050. [Google Scholar]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. “Speeded-Up Robust Features (SURF). ” Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An Efficient Alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: New York, NY, USA; pp. 2564–2571. [Google Scholar]
- Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
- Fischler, M.A.; Bolles, R.C. Random Sample Consensus. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv 2018, arXiv:1711.00199. [Google Scholar]
- Do, T.-T.; Cai, M.; Pham, T.; Reid, I. Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image. arXiv 2018, arXiv:1802.10367. [Google Scholar]
- Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) IEEE, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
- Dong, Z.; Liu, S.; Zhou, T.; Cheng, H.; Zeng, L.; Yu, X.; Liu, H. PPR-Net:Point-Wise Pose Regression Network for Instance Segmentation and 6D Pose Estimation in Bin-Picking Scenarios. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: New York, NY, USA; pp. 1773–1780. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martin-Martin, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: New York, NY, USA; pp. 3338–3347. [Google Scholar]
- Zakharov, S.; Shugurov, I.; Ilic, S. DPOD: 6D Pose Object Detector and Refiner. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA; pp. 1941–1950. [Google Scholar]
- Xu, Y.; Lin, K.-Y.; Zhang, G.; Wang, X.; Li, H. RNNPose: Recurrent 6-DoF Object Pose Refinement with Robust Correspondence Field Estimation and Pose Optimization. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 21–24 June 2022; IEEE: New York, NY, USA; pp. 14860–14870. [Google Scholar]
- Hu, N.; Zhou, H.; Liu, A.-A.; Huang, X.; Zhang, S.; Jin, G.; Guo, J.; Li, X. Collaborative Distribution Alignment for 2D image-based 3D shape retrieval. J. Vis. Commun. Image Represent. 2022, 83, 103426. [Google Scholar] [CrossRef]
- Park, S.-Y.; Son, C.-M.; Jeong, W.-J.; Park, S. Relative Pose Estimation between Image Object and ShapeNet CAD Model for Automatic 4-DoF Annotation. Appl. Sci. 2023, 13, 693. [Google Scholar] [CrossRef]
- Gümeli, C.; Dai, A.; Nießner, M. ROCA: Robust CAD Model Retrieval and Alignment from a Single Image. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4012–4021. [Google Scholar]
- Xiao, Y.; Du, Y.; Marlet, R. PoseContrast: Class-Agnostic Object Viewpoint Estimation in the Wild with Pose-Aware Contrastive Learning. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; IEEE: New York, NY, USA; pp. 74–84. [Google Scholar]
- Tang, J.; Chen, Z.; Fu, B.; Lu, W.; Li, S.; Li, X.; Ji, X. ROV6D: 6D Pose Estimation Benchmark Dataset for Underwater Remotely Operated Vehicles. IEEE Robot. Autom. Lett. 2024, 9, 65–72. [Google Scholar] [CrossRef]
- Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-Scale Image Retrieval with Attentive Deep Local Features. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA; pp. 3476–3485. [Google Scholar]
- Qiao, S.; Chen, L.-C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 18–24 June 2021; IEEE: New York, NY, USA; pp. 10208–10219. [Google Scholar]
- Home of the Blender Project—Free and Open 3D Creation Software. Available online: https://www.blender.org/ (accessed on 1 June 2023.).
- Xiang, Y.; Mottaghi, R.; Savarese, S. Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; IEEE: New York, NY, USA; pp. 75–82. [Google Scholar]
- Sun, X.; Wu, J.; Zhang, X.; Zhang, Z.; Zhang, C.; Xue, T.; Tenenbaum, J.B.; Freeman, W.T. Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Xiang, Y.; Kim, W.; Chen, W.; Ji, J.; Choy, C.; Su, H.; Mottaghi, R.; Guibas, L.; Savarese, S. ObjectNet3D: A Large Scale Database for 3D Object Recognition. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9912, pp. 160–176. ISBN 978-3-319-46483-1. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer International Publishing: Cham, Switzerland, 2014. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA; pp. 770–778. [Google Scholar]
- Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Bangunharcana, A.; Cho, J.W.; Lee, S.; Kweon, I.S.; Kim, K.-S.; Kim, S. Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September 2021; IEEE: New York, NY, USA; pp. 3542–3548. [Google Scholar]
- Xu, G.; Zhou, H.; Yang, X. CGI-Stereo: Accurate and Real-Time Stereo Matching via Context and Geometry Interaction. arXiv 2023, arXiv:2301.02789. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. ISBN 978-3-030-01233-5. [Google Scholar]
- Su, H.; Qi, C.R.; Li, Y.; Guibas, L.J. Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: New York, NY, USA; pp. 2686–2694. [Google Scholar]
- Tulsiani, S.; Malik, J. Viewpoints and Keypoints. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA; pp. 1510–1519. [Google Scholar]
Aero-plane (275) | Bicycle (118) | Boat (232) | Bottle (251) | Bus (154) | Car (308) | Chair (244) | Dining Table (21) | Motorbike (136) | Sofa (39) | Train (113) | Tv Monitor (222) | Average (2113) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc30 | |||||||||||||
Grabner et al. [17] | 80 | 82 | 57 | 90 | 97 | 94 | 72 | 67 | 90 | 80 | 82 | 85 | 81 |
StarMap [17] | 82 | 86 | 50 | 92 | 97 | 92 | 79 | 62 | 88 | 92 | 77 | 83 | 82 |
3DPoseLight [17] | 80 | 82 | 58 | 93 | 96 | 92 | 77 | 57 | 88 | 82 | 80 | 79 | 80 |
PoseContrast | 81 | 82 | 59 | 92 | 96 | 93 | 84 | 52 | 88 | 85 | 81 | 85 | 82 |
Proposed | 87 | 81 | 65 | 92 | 97 | 94 | 83 | 62 | 90 | 82 | 81 | 86 | 83 |
Median Error | |||||||||||||
Grabner et al. [17] | 10.9 | 12.2 | 23.4 | 9.3 | 3.4 | 5.2 | 15.9 | 16.2 | 12.2 | 11.6 | 6.3 | 11.2 | 11.5 |
StarMap [17] | 10.1 | 14.5 | 30.3 | 9.1 | 3.1 | 6.5 | 11.0 | 23.7 | 14.1 | 11.1 | 7.4 | 13.0 | 12.8 |
PoseContrast | 11.25 | 15.63 | 22.39 | 7.68 | 2.68 | 4.67 | 10.7 | 20.88 | 11.48 | 10.76 | 5.64 | 11.19 | 11.25 |
Proposed | 11.7 | 14.71 | 18.47 | 7.84 | 2.84 | 4.83 | 9.98 | 11.71 | 11.98 | 10.75 | 6.18 | 10.88 | 10.16 |
Tool (46) | Misc (61) | Bookcase (130) | Wardrobe (166) | Desk (297) | Bed (394) | Table (738) | Sofa (1092) | Chair (2894) | Average (5818) | |
---|---|---|---|---|---|---|---|---|---|---|
Acc30 | ||||||||||
3DPoseLight [17] | 9 | 10 | 62 | 57 | 66 | 58 | 40 | 94 | 50 | 58 |
PoseContrast | 11 | 25 | 76 | 58 | 78 | 71 | 53 | 97 | 86 | 62 |
Proposed | 11 | 21 | 85 | 80 | 76 | 68 | 53 | 96 | 85 | 64 |
Median Error | ||||||||||
PoseContrast | 128.75 | 58.36 | 17.29 | 22.76 | 12.49 | 13.27 | 23.79 | 14.49 | 11.7 | 33.66 |
Proposed | 107.18 | 75.99 | 12.44 | 16.11 | 12.76 | 13.06 | 25.4 | 13.13 | 11.66 | 31.97 |
Base Classes | Novel Classes | All Samples | |
---|---|---|---|
Acc30 | |||
PoseContrast | 46 | 45 | 50 |
Proposed | 47 | 48 | 52 |
Median Error | |||
PoseContrast | 51.32 | 56.92 | 29.43 |
Proposed | 50.14 | 50.16 | 27.57 |
Rank 1 | Rank 2 | Rank 3 | Rank 4 | Rank 5 | |
---|---|---|---|---|---|
SM-AT | 720 | 250 | 156 | 117 | 87 |
SM-NI | 741 | 258 | 179 | 101 | 91 |
Better | SS | Worse | |
---|---|---|---|
Bag (144) | 47 | 33 | 37 |
Bed (110) | 22 | 13 | 18 |
Bench (146) | 47 | 57 | 23 |
Bottle (184) | 32 | 79 | 24 |
Car (188) | 18 | 153 | 14 |
Chair (199) | 50 | 106 | 26 |
Clock (173) | 56 | 43 | 27 |
Motorbike (170) | 45 | 87 | 30 |
Stove (176) | 66 | 52 | 34 |
Table (72) | 10 | 56 | 6 |
Better | SS | Worse | |
---|---|---|---|
Bag (144) | 68 | 25 | 37 |
Bed (110) | 15 | 36 | 46 |
Bench (146) | 87 | 20 | 31 |
Bottle (184) | 109 | 3 | 8 |
Car (188) | 75 | 89 | 6 |
Chair (199) | 95 | 73 | 16 |
Clock (173) | 51 | 19 | 26 |
Motorbike (170) | 70 | 81 | 7 |
Stove (176) | 116 | 31 | 14 |
Table (72) | 47 | 21 | 4 |
Similarity Measure | Better | SS | Worse |
---|---|---|---|
SM-AT | 0.519 | 0.0 | −0.364 |
SM-NI | 0.338 | 0.110 | −0.226 |
Similarity Measure | Better | SS | Worse |
---|---|---|---|
SM-AT | 1.138 | 0.271 | −0.349 |
SM-NI | 1.075 | 0.261 | −0.097 |
VUF | GCE | CGF | Q loss | |
---|---|---|---|---|
PoseContrast (baseline) | ||||
VUF-module | ||||
VUF+GCE | ||||
VUF-full | ||||
Proposed |
Aero- plane (275) | Bicycle (118) | Boat (232) | Bottle (251) | Bus (154) | Car (308) | Chair (244) | Dining Table (21) | Motorbike (136) | Sofa (39) | Train (113) | Tv Monitor (222) | Average (2113) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc30 | |||||||||||||
PoseContrast | 81 | 82 | 59 | 92 | 96 | 93 | 84 | 52 | 88 | 85 | 81 | 85 | 82 |
VUF-module | 82 | 81 | 62 | 92 | 95 | 93 | 82 | 62 | 89 | 87 | 79 | 80 | 82 |
VUF + GCE | 85 | 81 | 60 | 93 | 97 | 93 | 86 | 57 | 89 | 90 | 81 | 85 | 83 |
VUF-full | 82 | 84 | 64 | 92 | 99 | 92 | 87 | 57 | 89 | 92 | 82 | 85 | 84 |
Proposed | 87 | 81 | 65 | 92 | 97 | 94 | 83 | 62 | 90 | 82 | 81 | 86 | 83 |
Median Error | |||||||||||||
PoseContrast | 11.25 | 15.63 | 22.39 | 7.68 | 2.68 | 4.67 | 10.7 | 20.88 | 11.48 | 10.76 | 5.64 | 11.19 | 11.25 |
VUF-module | 12.66 | 14.31 | 21.79 | 9.05 | 3.56 | 5.61 | 10.31 | 14.02 | 12.92 | 10.99 | 7.13 | 12.5 | 11.24 |
VUF + GCE | 10.62 | 14 | 20.18 | 8.17 | 3.01 | 4.85 | 10.41 | 14.27 | 12.13 | 11.64 | 5.58 | 12.29 | 10.6 |
VUF-full | 10.47 | 15.67 | 18.96 | 7.5 | 3.08 | 4.86 | 9.85 | 13.51 | 11.95 | 13.05 | 5.56 | 11.85 | 10.53 |
Proposed | 11.7 | 14.71 | 18.47 | 7.84 | 2.84 | 4.83 | 9.98 | 11.71 | 11.98 | 10.75 | 6.18 | 10.88 | 10.16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Park, S.; Jeong, W.-J.; Manawadu, M.; Park, S.-Y. 6-DoF Pose Estimation from Single RGB Image and CAD Model Retrieval Using Feature Similarity Measurement. Appl. Sci. 2025, 15, 1501. https://doi.org/10.3390/app15031501
Park S, Jeong W-J, Manawadu M, Park S-Y. 6-DoF Pose Estimation from Single RGB Image and CAD Model Retrieval Using Feature Similarity Measurement. Applied Sciences. 2025; 15(3):1501. https://doi.org/10.3390/app15031501
Chicago/Turabian StylePark, Sieun, Won-Je Jeong, Mayura Manawadu, and Soon-Yong Park. 2025. "6-DoF Pose Estimation from Single RGB Image and CAD Model Retrieval Using Feature Similarity Measurement" Applied Sciences 15, no. 3: 1501. https://doi.org/10.3390/app15031501
APA StylePark, S., Jeong, W.-J., Manawadu, M., & Park, S.-Y. (2025). 6-DoF Pose Estimation from Single RGB Image and CAD Model Retrieval Using Feature Similarity Measurement. Applied Sciences, 15(3), 1501. https://doi.org/10.3390/app15031501