Monocular Object-Level SLAM Enhanced by Joint Semantic Segmentation and Depth Estimation
Abstract
:1. Introduction
- 1.
- We propose JSDNet for joint learning of depth estimation and semantic segmentation, focusing on fixed depth estimation and a feature fusion block.
- 2.
- We suggest a semantic consistency process to keep the spatial consistency of semantic segmentation and depth estimation, with the aim of not only contributing to the two tasks, but also improving localization robustness.
- 3.
- We design an object-level SLAM system based on JSDNet with the utilization of pixel-level and object-level semantic information. In detail, the system optimizes the process covering feature matching, and local and global BA to improve accuracy and robustness. In addition to this, we add a scale uniformization procedure to recover a stable scale.
2. Related Work
2.1. Monocular Visual SLAM
2.2. Semantic SLAM
2.3. Joint Semantic Segmentation and Depth Estimation
3. JSDNet
3.1. Architecture
3.2. Loss Function
3.3. Semantic Consistency
4. Semantic SLAM
4.1. Visual Odometry
4.2. Three-Dimensional Object Representation and Generation
4.3. Object-Level Bundle Adjustment
4.4. Scale Restoring
5. Experiments
5.1. Datasets and Implementation Details
5.2. Depth and Segmentation Results
5.3. Object-Level SLAM Result
5.4. Ablation Studies
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wei, X.; Huang, J.; Ma, X. Real-time monocular visual slam by combining points and lines. In Proceedings of the IEEE International Conference on Multimedia and Expo, Shanghai, China, 8–12 July 2019; pp. 103–108. [Google Scholar]
- Wang, J.; Qi, Y. Scene-independent Localization by Learning Residual Coordinate Map with Cascaded Localizers. In Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, Sydney, Australia, 16–20 October 2023; pp. 79–88. [Google Scholar]
- Hu, W.; Zhang, Y.; Liang, Y.; Yin, Y.; Georgescu, A.; Tran, A.; Kruppa, H.; Ng, S.K.; Zimmermann, R. Beyond geo-localization: Fine-grained orientation of street-view images by cross-view matching with satellite imagery. In Proceedings of the ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6155–6164. [Google Scholar]
- Leng, K.; Yang, C.; Sui, W.; Liu, J.; Li, Z. Sitpose: A Siamese Convolutional Transformer for Relative Camera Pose Estimation. In Proceedings of the International Conference on Multimedia and Expo, Brisbane, Australia, 10–14 July 2023; pp. 1871–1876. [Google Scholar]
- Li, W.; Wang, Y.; Guo, Y.; Wang, S.; Shao, Y.; Bai, X.; Cai, X.; Ye, Q.; Li, D. ColSLAM: A Versatile Collaborative SLAM System for Mobile Phones Using Point-Line Features and Map Caching. In Proceedings of the ACM International Conference on Multimedia, Brisbane, Australia, 10–14 July 2023; pp. 9032–9041. [Google Scholar]
- Chang, Y.; Hu, J.; Xu, S. OTE-SLAM: An Object Tracking Enhanced Visual SLAM System for Dynamic Environments. Sensors 2023, 23, 7921. [Google Scholar] [CrossRef] [PubMed]
- Lin, B.H.; Shivanna, V.M.; Chen, J.S.; Guo, J.I. 360° Map Establishment and Real-Time Simultaneous Localization and Mapping Based on Equirectangular Projection for Autonomous Driving Vehicles. Sensors 2023, 23, 5560. [Google Scholar] [CrossRef] [PubMed]
- Vial, P.; Puig, V. Kinematic/Dynamic SLAM for Autonomous Vehicles Using the Linear Parameter Varying Approach. Sensors 2022, 22, 8211. [Google Scholar] [CrossRef] [PubMed]
- Luo, H.; Gao, Y.; Wu, Y.; Liao, C.; Yang, X.; Cheng, K. Real-Time Dense Monocular SLAM With Online Adapted Depth Prediction Network. IEEE Trans. Multimed. 2019, 21, 470–483. [Google Scholar]
- McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. SemanticFusion: Dense 3D semantic mapping with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 4628–4635. [Google Scholar]
- Liang, H.J.; Sanket, N.J.; Fermüller, C.; Aloimonos, Y. Salientdso: Bringing attention to direct sparse odometry. IEEE Trans. Autom. Sci. Eng. 2019, 16, 1619–1626. [Google Scholar]
- Yang, S.; Scherer, S. Cubeslam: Monocular 3-D object slam. IEEE Trans. Robot. 2019, 35, 925–938. [Google Scholar]
- Wang, J.; Qi, Y. Simultaneous Scene-independent Camera Localization and Category-level Object Pose Estimation via Multi-level Feature Fusion. In Proceedings of the IEEE Conference Virtual Reality and 3D User Interfaces, Shanghai, China, 25–29 March 2023; pp. 254–264. [Google Scholar]
- Frost, D.; Prisacariu, V.; Murray, D. Recovering Stable Scale in Monocular SLAM Using Object-Supplemented Bundle Adjustment. IEEE Trans. Robot. 2018, 34, 736–747. [Google Scholar] [CrossRef]
- Jiao, J.; Cao, Y.; Song, Y.; Lau, R. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 53–69. [Google Scholar]
- Nekrasov, V.; Dharmasiri, T.; Spek, A.; Drummond, T.; Shen, C.; Reid, I. Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In Proceedings of the International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 7101–7107. [Google Scholar]
- Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar]
- Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
- Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar]
- Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar]
- Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 225–234. [Google Scholar]
- Von Stumberg, L.; Cremers, D. DM-VIO: Delayed marginalization visual-inertial odometry. IEEE Robot. Autom. Lett. 2022, 7, 1408–1415. [Google Scholar] [CrossRef]
- Song, S.; Lim, H.; Lee, A.J.; Myung, H. DynaVINS: A visual-inertial SLAM for dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 11523–11530. [Google Scholar] [CrossRef]
- Bescos, B.; Fácil, J.M.; Civera, J.; Neira, J. DynaSLAM: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robot. Autom. Lett. 2018, 3, 4076–4083. [Google Scholar] [CrossRef]
- Yu, C.; Liu, Z.; Liu, X.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
- Bescos, B.; Campos, C.; Tardós, J.D.; Neira, J. DynaSLAM II: Tightly-coupled multi-object tracking and SLAM. IEEE Robot. Autom. Lett. 2021, 6, 5191–5198. [Google Scholar] [CrossRef]
- Wang, J.; Qi, Y. Visual camera relocalization using both hand-crafted and learned features. Pattern Recognition 2024, 145, 109914. [Google Scholar] [CrossRef]
- Long, R.; Rauch, C.; Zhang, T.; Ivan, V.; Vijayakumar, S. Rigidfusion: Robot localisation and mapping in environments with large dynamic rigid objects. IEEE Robot. Autom. Lett. 2021, 6, 3703–3710. [Google Scholar] [CrossRef]
- Wu, W.; Guo, L.; Gao, H.; You, Z.; Liu, Y.; Chen, Z. YOLO-SLAM: A semantic SLAM system towards dynamic environment with geometric constraint. Neural Comput. Appl. 2022, 34, 6011–6026. [Google Scholar] [CrossRef]
- Qiu, K.; Qin, T.; Gao, W.; Shen, S. Tracking 3-D motion of dynamic objects using monocular visual-inertial sensing. IEEE Trans. Robot. 2019, 35, 799–816. [Google Scholar] [CrossRef]
- Xiao, L.; Wang, J.; Qiu, X.; Rong, Z.; Zou, X. Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment. Robot. Auton. Syst. 2019, 117, 1–16. [Google Scholar] [CrossRef]
- Song, B.; Yuan, X.; Ying, Z.; Yang, B.; Song, Y.; Zhou, F. DGM-VINS: Visual-Inertial SLAM for Complex Dynamic Environments with Joint Geometry Feature Extraction and Multiple Object Tracking. IEEE Trans. Instrum. Meas. 2023, 72, 8503711. [Google Scholar] [CrossRef]
- Zheng, Z.; Lin, S.; Yang, C. RLD-SLAM: A Robust Lightweight VI-SLAM for Dynamic Environments Leveraging Semantics and Motion Information. IEEE Trans. Ind. Electron. 2024, 71, 14328–14338. [Google Scholar]
- Mousavian, A.; Pirsiavash, H.; Košecká, J. Joint semantic segmentation and depth estimation with deep convolutional networks. In Proceedings of the International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 611–619. [Google Scholar]
- He, L.; Lu, J.; Wang, G.; Song, S.; Zhou, J. SOSD-Net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing 2021, 440, 251–263. [Google Scholar]
- Gao, T.; Wei, W.; Cai, Z.; Fan, Z.; Xie, S.Q.; Wang, X.; Yu, Q. CI-Net: A joint depth estimation and semantic segmentation network using contextual information. Appl. Intell. 2022, 52, 18167–18186. [Google Scholar] [CrossRef]
- Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 746–760. [Google Scholar]
- Song, S.; Lichtenberg, S.P.; Xiao, J. Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Qi, X.; Liao, R.; Jia, J.; Fidler, S.; Urtasun, R. 3D Graph Neural Networks for RGBD Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5209–5218. [Google Scholar]
- Song, S.; Yu, F.; Zeng, A.; Chang, A.X.; Savva, M.; Funkhouser, T. Semantic Scene Completion from a Single Depth Image. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 190–198. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Kendall, A.; Cipolla, R. Geometric loss functions for camera pose regression with deep learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5974–5983. [Google Scholar]
- Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A Benchmark for the Evaluation of RGB-D SLAM Systems. In Proceedings of the International Conference on Intelligent Robot Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012. [Google Scholar]
- Hoyer, L.; Dai, D.; Chen, Y.; Koring, A.; Saha, S.; Van Gool, L. Three ways to improve semantic segmentation with self-supervised depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11130–11140. [Google Scholar]
- Lopes, I.; Vu, T.H.; de Charette, R. Cross-task attention mechanism for dense multi-task learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 2329–2338. [Google Scholar]
- Palazzolo, E.; Behley, J.; Lottes, P.; Giguere, P.; Stachniss, C. Refusion: 3D reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In Proceedings of the International Conference on Intelligent Robots and Systems, Macau, China, 3–8 November 2019; pp. 7855–7862. [Google Scholar]
- Du, Z.J.; Huang, S.S.; Mu, T.J.; Zhao, Q.; Martin, R.R.; Xu, K. Accurate dynamic SLAM using CRF-based long-term consistency. IEEE Trans. Vis. Comput. Graph. 2020, 28, 1745–1757. [Google Scholar]
- Wang, K.; Yao, X.; Ma, N.; Jing, X. Real-time motion removal based on point correlations for RGB-D SLAM in indoor dynamic environments. Neural Comput. Appl. 2023, 35, 8707–8722. [Google Scholar] [CrossRef]
- Cheng, S.; Sun, C.; Zhang, S.; Zhang, D. SG-SLAM: A real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information. IEEE Trans. Instrum. Meas. 2022, 72, 7501012. [Google Scholar]
Method | RMSE | Rel | Log | Speed | |||
---|---|---|---|---|---|---|---|
Eigen et al. [39] | 0.641 | 0.158 | 0.214 | 0.769 | 0.950 | 0.988 | - |
Mousavian et al. [34] | 0.816 | 0.200 | 0.314 | 0.568 | 0.856 | 0.956 | - |
Nekrasov et al. [16] | 0.565 | 0.149 | 0.205 | 0.790 | 0.955 | 0.990 | 12.6 |
SOSD-Net [35] | 0.514 | 0.145 | - | 0.805 | 0.962 | 0.992 | - |
CI-Net [36] | 0.504 | 0.129 | 0.181 | 0.812 | 0.957 | 0.990 | - |
Hoyer et al. [45] | 0.622 | - | - | - | - | - | - |
Lopes et al. [46] | 0.604 | - | - | - | - | - | - |
Ours (0.5 m–3.0 m, no segmentation) | 0.358 | 0.148 | 0.148 | 0.811 | 0.961 | 0.988 | - |
Ours (Full depth, no segmentation) | 0.543 | 0.148 | 0.192 | 0.816 | 0.959 | 0.992 | - |
Ours (0.5 m–3.0 m) | 0.335 | 0.138 | 0.132 | 0.821 | 0.968 | 0.994 | - |
Ours (Full depth) | 0.495 | 0.128 | 0.168 | 0.838 | 0.992 | 0.995 | 12.1 |
Method | Pixel-Acc | Mean-Acc | IoU |
---|---|---|---|
Eigen et al. [39] | 65.6 | 45.1 | 34.1 |
Mousavian et al. [34] | 68.6 | 52.3 | 39.2 |
Nekrasov et al. [16] | - | - | 42.0 |
SOSD-Net [35] | 72.2 | 62.5 | 43.3 |
CI-Net [36] | 72.7 | - | 42.6 |
Hoyer et al. [45] | - | - | 39.5 |
Lopes et al. [46] | - | - | 38.9 |
Ours (no depth estimation) | 69.8 | 53.2 | 40.7 |
Ours | 76.4 | 61.1 | 46.9 |
Static Scenes (Monocular) | |||||
fr1_xyz | fr2_xyz | fr1_desk | fr3_office | Mean | |
ORB-SLAM2 | 0.041 | 0.039 | 0.035 | 0.865 | 0.245 |
Luo et al. [9] | 0.027 | 0.026 | 0.192 | 0.635 | 0.22 |
Ours | 0.007 | 0.012 | 0.019 | 0.161 | 0.050 |
Static Scenes (RGB-D) | |||||
fr1_xyz | fr2_xyz | fr1_desk | fr3_office | Mean | |
ORB-SLAM2 | 0.005 | 0.004 | 0.016 | 0.010 | 0.009 |
Ours | 0.003 | 0.004 | 0.011 | 0.008 | 0.007 |
Dynamic Scenes (fr3_walking) | |||||
xyz | static | rpy | half | Mean | |
ORB-SLAM2 [17] (RGB-D) | 0.459 | 0.090 | 0.662 | 0.351 | 0.391 |
DS-SLAM [25] (RGB-D) | 0.025 | 0.008 | 0.444 | 0.030 | 0.127 |
DynaSLAM [24] (RGB-D) | 0.015 | 0.006 | 0.035 | 0.025 | 0.020 |
ReFusion [47] (RGB-D) | 0.099 | 0.017 | - | 0.104 | - |
LC-CRF SLAM [48](RGB-D) | 0.016 | 0.011 | 0.046 | 0.028 | 0.025 |
DGM-VINS [32] (RGB-D) | 0.036 | 0.013 | 0.071 | 0.033 | 0.038 |
Wang et al. [49] (RGB-D) | 0.015 | 0.007 | 0.047 | 0.023 | 0.023 |
SG-SLAM [50] (RGB-D) | 0.015 | 0.007 | 0.032 | 0.020 | |
Ours (Monocular) | 0.039 | 0.010 | 0.076 | 0.045 | 0.043 |
Ours (RGB-D) | 0.014 | 0.006 | 0.028 | 0.017 | 0.016 |
Feature Fusion | Semantic Consistency | RMSE | IoU | |
---|---|---|---|---|
S1 | 0.517 | 43.8 | ||
S2 | √ | 0.505 | 45.3 | |
S3 | √ | 0.508 | 45.5 | |
S4 | √ | √ | 0.495 | 46.9 |
Feature Fusion | Semantic Consistency | Object-Level BA | xyz | Static | rpy | Half | Mean | |
---|---|---|---|---|---|---|---|---|
S1 | 0.044 | 0.021 | 0.082 | 0.049 | 0.049 | |||
S2 | √ | 0.037 | 0.019 | 0.077 | 0.044 | 0.044 | ||
S3 | √ | 0.033 | 0.016 | 0.064 | 0.039 | 0.038 | ||
S4 | √ | 0.029 | 0.010 | 0.061 | 0.033 | 0.033 | ||
S5 | √ | √ | 0.028 | 0.011 | 0.052 | 0.030 | 0.030 | |
S6 | √ | √ | 0.015 | 0.007 | 0.030 | 0.017 | 0.017 | |
S7 | √ | √ | 0.024 | 0.010 | 0.049 | 0.029 | 0.028 | |
S8 | √ | √ | √ | 0.014 | 0.006 | 0.028 | 0.017 | 0.016 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gao, R.; Qi, Y. Monocular Object-Level SLAM Enhanced by Joint Semantic Segmentation and Depth Estimation. Sensors 2025, 25, 2110. https://doi.org/10.3390/s25072110
Gao R, Qi Y. Monocular Object-Level SLAM Enhanced by Joint Semantic Segmentation and Depth Estimation. Sensors. 2025; 25(7):2110. https://doi.org/10.3390/s25072110
Chicago/Turabian StyleGao, Ruicheng, and Yue Qi. 2025. "Monocular Object-Level SLAM Enhanced by Joint Semantic Segmentation and Depth Estimation" Sensors 25, no. 7: 2110. https://doi.org/10.3390/s25072110
APA StyleGao, R., & Qi, Y. (2025). Monocular Object-Level SLAM Enhanced by Joint Semantic Segmentation and Depth Estimation. Sensors, 25(7), 2110. https://doi.org/10.3390/s25072110