Next Article in Journal
Investigation of Current Collapse Mechanism on AlGaN/GaN Power Diodes
Previous Article in Journal
Seismoelectric Coupling Equations of Oil-Wetted Porous Medium Containing Oil and Water
 
 
Review
Peer-Review Record

Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends

Electronics 2023, 12(9), 2006; https://doi.org/10.3390/electronics12092006
by Margarita N. Favorskaya
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4:
Electronics 2023, 12(9), 2006; https://doi.org/10.3390/electronics12092006
Submission received: 21 February 2023 / Revised: 31 March 2023 / Accepted: 23 April 2023 / Published: 26 April 2023
(This article belongs to the Special Issue Enhanced Perception in Robotics Control and Manipulation)

Round 1

Reviewer 1 Report

This paper deals with review of SLAM technology, and it seems this review paper will be of interest in practical fields. Paper looks well organized, well written, but I have some questions, suggestions and comments as follows.

1. Could you please add analysis of VSLAM technologies focusing on comparison of various techniques?

2. In table.5, short descriptions needs to be shortened more.

3. "conclusion" section needs to be added or the last section's name needs to be changed. Then, the last section needs to include some analysis of drawbacks and advantages (i.e., pros and copns) of the VSLAM based on deep learning.

4. Any flow-diagram or figure that shows your contribution to this paper needs to be inserted to increase readability.

Author Response

Dear Editor and Reviewers, I would like to express sincere appreciations of your letter and reviewers’ constructive comments concerning article entitled “Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends” (Manuscript ID: electronics-2265708). All of these comments are valuable and helpful in improving the survey. According to the Associate Editor and Reviewers’ comments, extensive modifications have been made to the manuscript. In this revised version, all changes in the manuscript have been highlighted within the document by using blue colored text. Point-by-point responses to the two reviewers are listed below this letter.

Reviewer #1

General comments

Comments and Suggestions for Authors

This paper deals with review of SLAM technology, and it seems this review paper will be of interest in practical fields. Paper looks well organized, well written, but I have some questions, suggestions and comments as follows.

Response

Thank you very much for your valuable time spent reading our paper and for your insightful feedback. We express our sincere appreciation to your positive comments on the manuscript.

Comment #1

  1. Could you please add analysis of VSLAM technologies focusing on comparison of various techniques?

Thank you for your comment. The following paragraph has been improved.

VSLAM technologies can be formally represented as visual-only, visual-inertial, or RGB-D-based. The visual-only VSLAM systems use monocular or stereo cameras and process 2D images and are considered the most studied and cheapest systems, despite the fact that stereo cameras are inferior in cost to monocular cameras. The main benefit of the stereo camera application is to provide real information about the depth of the scene and pose in indoor and/or outdoor environment. It is currently the most researched field in VSLAM. Thus, MonoSLAM algorithm utilized an extended Kalman filter for visual measurement to estimate the ego-motion of the camera and the 3D coordinates of feature points in the scene [9]. In addition to MonoSLAM, many advanced but similar algorithms have been developed, such as parallel tracking and mapping [10], collaborative visual SLAM [11], monocular/stereo visual ORB-SLAM [12], robust to fast camera motions RKSLAM [13], among others. Visual-inertial VSLAM systems provide rich information about the angular velocity, acceleration and the magnetic field around the devices, which allow to accurately assess the position of the sensors in dynamic scenes. However, a fusion of visual data and inertial measurements is a real problem, far from its reasonable implementation at the algorithmic and software levels. RGB-D sensors, including a monocular RGB camera and a depth sensor, simultaneously generate color image and dense depth map that helps significantly in pose estimation and mapping. Thus, the RGB-D-based VSLAM architecture is simplified. However, this approach is only suitable to indoor environment due to the limitations of the depth sensor and requires large memory and power consumption. Sometimes other distance sensors are used instead of RGB-D sensors, such as SONAR sensors (for underwater UAVs), 2D laser scanners, or LiDAR scanners. In this case, data fusion is also necessary.

Comment #2

  1. In table.5, short descriptions needs to be shortened more.

Response

The content of “Short description” attribute in “Table 5 is made shorter:

The dataset contains image pairs captured in outdoor scenes with ground truth labeling

The dataset is synthetic dataset for training and testing autonomous driving models

The data was recorded as color and depth frames with the ground truth trajectories

The dataset includes physical scenes for segmenting the visible areas of objects

The dataset allows to evaluate dense tracking and mapping, as well as relocalization

The dataset presents high-resolution stereo images over a 36.8 km trajectory

The dataset includes labeled scenes with a living room and an office

The dataset consists of stereo images synchronized with IMU measurements

The dataset contains long trajectories in outdoor scenes with complex weather conditions

The dataset contains complex street scenes from 50 different cities

The dataset consists of photo-realistic 3D indoor scenes with AI agents navigation

The dataset is a set of annotated 3D indoor reconstructions

The dataset consists of varying conditions and traffic densities with complex scenarios

The dataset provides a large amount of synchronized data corresponding to flight records

The dataset provides RGB and depth images with semantic maps for reference

Comment #3

  1. "conclusion" section needs to be added or the last section's name needs to be changed. Then, the last section needs to include some analysis of drawbacks and advantages (i.e., pros and copns) of the VSLAM based on deep learning.

Response

Thank you for your comment. The following generalized sentence has been added in Conclusions.

Generally speaking, deep learning models offer opportunities for efficient processing of visual data in real time, but at the same time have limitations in data fusion obtained from different types of sensors at the current stage of technology development.

 

Comment #4

  1. Any flow-diagram or figure that shows your contribution to this paper needs to be inserted to increase readability.

Response

Thank you for your suggestion. Structure diagram has been added.

Author Response File: Author Response.pdf

Reviewer 2 Report

This manuscript is a review paper on the field of visual SLAM, specifically about the deep learning approaches. The authors provide a comprehensive survey on the existing literature, and introduce the recent state-of-the-art techniques. There is a large part of content stating the methods currently available with more than hundred papers, but most of them are just simply described. In the reviewer's opinion, knowing all these models do not benefit the audience. It is suggested to focus on less important work, instead of providing cyclopedia-type of statements. As a survey paper, although the main focus is deep learning-based approaches for SLAM, it is also suggested to provide a brief discussion and references related to general general robot localization and visual odometry techniques. For instance, it is suitable to brought out the approaches with other methods, such as incorporated UWB with visual inertial odometry, integrated UWB with visual SLAM, etc. The authors can also discuss the relationships among SLAM and sparse visual odometry for finding pose. As indicated in the title, future trends of deep learning-based SLAM is a main theme of this paper. But there is relatively less content. In addition to some itemized aspects with problems and possibilities, it is suggested to provide several key papers for reference and discussion.

Author Response

Dear Editor and Reviewers, I would like to express sincere appreciations of your letter and reviewers’ constructive comments concerning article entitled “Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends” (Manuscript ID: electronics-2265708). All of these comments are valuable and helpful in improving the survey. According to the Associate Editor and Reviewers’ comments, extensive modifications have been made to the manuscript. In this revised version, all changes in the manuscript have been highlighted within the document by using blue colored text. Point-by-point responses to the two reviewers are listed below this letter.

Reviewer #2

Comment #1

This manuscript is a review paper on the field of visual SLAM, specifically about the deep learning approaches. The authors provide a comprehensive survey on the existing literature, and introduce the recent state-of-the-art techniques. There is a large part of content stating the methods currently available with more than hundred papers, but most of them are just simply described. In the reviewer's opinion, knowing all these models do not benefit the audience. It is suggested to focus on less important work, instead of providing cyclopedia-type of statements.

Response

Thank you for your comment. However in the current review, only 46 articles deserving attention are “simply described”. Please see Tables 2-4. We focus readers' attention on the latest high-quality articles in the area under discussion. A review article on 28 pages cannot claim the title of an encyclopedia.

Comment #1

As a survey paper, although the main focus is deep learning-based approaches for SLAM, it is also suggested to provide a brief discussion and references related to general general robot localization and visual odometry techniques. For instance, it is suitable to brought out the approaches with other methods, such as incorporated UWB with visual inertial odometry, integrated UWB with visual SLAM, etc.

Response

Thank you for your comment. General robot localization is more related to traditional SLAM problems. Deep-based visual odometry techniques are discussed in Section 1 and Sections 3.1-3.3. At present, it is difficult to find adequate information about integrated UWB with visual SLAM. For example, Ching et al. [Ching, P.L.; Tan, S.C.; Ho, H.W. Ultra-wideband localization and deep-learning-based plant monitoring using micro air vehicles. Journal of Aerospace Information Systems 2022, 19(11). https://doi.org/10.2514/1.I011075] used the Tiny-YOLOv4 model in parallel with UWB localization. At the same time, we know that YOLOv8 appeared in the summer of 2022. The article [Wei, J.; Wang, H.; Su, S.; Tang, Y.; Guo, X.; Sun, X. NLOS identification using parallel deep learning model and time-frequency information in UWB-based positioning system, Measurement 2022, 195, 111191] does not discuss VSLAM problems.

To the best of our knowledge, it is too early to talk about real integrated techniques.

Comment #3

The authors can also discuss the relationships among SLAM and sparse visual odometry for finding pose.

Response

Thank you for your comment. We believe that the relationship between SLAM and sparse visual odometry deserves special consideration in the form of a separate review based on the entire range of methods. The application of deep learning models for sparse visual odometry is in early development. Progress is modest.

Comment #4

As indicated in the title, future trends of deep learning-based SLAM is a main theme of this paper. But there is relatively less content. In addition to some itemized aspects with problems and possibilities, it is suggested to provide several key papers for reference and discussion.

Response

Thank you for your comment. Sorry, but the title of the survey is “The State-of-the-Art and Future Trends”, not just “Future Trends”. The 18 recent surveys are provided in Table 1. The 12 surveys published in 2022-2023 can be considered as key papers.

Author Response File: Author Response.pdf

Reviewer 3 Report

This long paper is well-written, and all the ideas are clearly presented. It is easy to read and follow the key ideas. This survey contributes by providing a historical development of deep-based VSLAM tasks, a comprehensive classification of recent VSLAM methods based on deep learning integration, and descriptions of multi-modal VSLAM datasets. Additionally, it offers a critical analysis of the advantages and disadvantages of future research in this field.

Reviewer 4 Report

As a reviewer, it is evident that evaluating this survey paper can be quite burdensome.

However, the content seems to be reasonably sufficient for a survey paper, and through the paper revision process, the accuracy of the content and expression has been adequately ensured.

Furthermore, the paper appears to be up-to-date with the latest information.

The author, Margarita N. Favorskaya, has a track record of writing numerous papers in this field and has also authored related books, which cautiously suggests the paper's suitability as a survey paper.

Round 2

Reviewer 1 Report

The revised manuscript is written well. The authors added or modified the comments of the reviewer. I recommend this manuscript as accepted.

Author Response

Dear Editor and Reviewers,

First of all, we would like to express thanks to the reviewers for their valuable comments on this manuscript. The revised version of manuscript entitled “Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends” (Manuscript ID: electronics-2265708) is attached. We believe that the revised survey has become more comprehensive and responded to the comments of the reviewers. To facilitate the next review round, the changes are highlighted in green color.

Reviewer #1

General comments

The revised manuscript is written well. The authors added or modified the comments of the reviewer. I recommend this manuscript as accepted.

Response

Thank you very much for your opinion. We appreciate it so much.

Author Response File: Author Response.pdf

Reviewer 2 Report

As a survey paper, the reviewer does not think it is very informative to the researchers in the field. The authors did provide some tables describing the recent techniques. However, the descriptions for the approaches are relatively shallow. There are a large number of SLAM algorithms in recent years, It is suggested to focus on a limited number of state-of-the-arts and provide more in-depth discussion.

Author Response

Dear Editor and Reviewers,

First of all, we would like to express thanks to the reviewers for their valuable comments on this manuscript. The revised version of manuscript entitled “Deep Learning for Visual SLAM: The State-of-the-Art and Future Trends” (Manuscript ID: electronics-2265708) is attached. We believe that the revised survey has become more comprehensive and responded to the comments of the reviewers. To facilitate the next review round, the changes are highlighted in green color.

We thank the Reviewer for the review and appreciating the presentation of our contribution. We have clarified the issues in the revised manuscript.

Reviewer #2

General comments

As a survey paper, the reviewer does not think it is very informative to the researchers in the field.

Response

The motivation for writing a survey article for the special issue "Enhanced Perception in Robotics, Control and Manipulation" was a gap in the coverage of deep learning methods in VSLAM. The goal is two-fold. First, we want to inform the researchers in the VSALM field how deep learning can help and even change the traditional VSLAM problem statement. Second, we want to attract the researchers in the deep learning field to actively solve VSLAM problems. As well-know, in recent years, thousands of papers have been published in the field of object detection, pattern recognition, face recognition, video surveillance, etc. based on deep learning. At the same time, only a few dozen papers in the field of VSLAM based on deep learning have been published since 2017.

Comment #1

The authors did provide some tables describing the recent techniques. However, the descriptions for the approaches are relatively shallow.

Response

Thank you for your comment. We agree that the descriptions of the approaches in this survey are relatively short, but not shallow. The survey should not contain detailed information about the methods. Relevant links to full research articles are provided. On the other hand, there is enough information for developers in the field of deep learning. The backbones such as idea, basic deep model, type of learning, and databases are mentioned. We would like to provide professional information for developers in the field of deep learning.

Comment #2

There are a large number of SLAM algorithms in recent years, It is suggested to focus on a limited number of state-of-the-arts and provide more in-depth discussion.

Response

Thank you for your comment. We fully agree with the Reviewer about the large number of SLAM algorithms. This topic deserves to be covered in a handbook, not a survey. Thus, first, we focused on a limited topic (visual SLAM) and, second, on a more limited topic (deep learning in visual SLAM). It is unreasonable to describe traditional VSLAM methods, as has been done in many recent surveys. For example, in [Chen,W.; Shang, G.; Ji, A.; Zhou, C.;Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual SLAM: From tradition to semantic. Remote Sensing 2022, 14, 3010.1-3010.47. https://doi.org/10.3390/rs14133010], only 41 of the 295 articles in the References can be directly related to deep learning in VSLAM or in [32.        Tang, Y.; Zhao, C.; Wang, J.; Zhang, C.; Sun, Q.; Zheng, W.X.; Du, W.; Qian, F.; Kurths, J. Perception and navigation in au-tonomous systems in the era of learning: A survey. IEEE Transactions on Neural Networks and Learning Systems 2022, 1–21 (early access). https://doi.org/10.1109/TNNLS.2022.3167688], only 49 (published mainly in 2015-2018) of the 271 articles in the References can be directly related to deep learning in VSLAM. In the presented survey, 52 (published mainly in 2020-2022) of 135 articles in the References (actually 120 articles because 15 links refer to datasets) can be directly related to deep learning in VSLAM. The survey contains a discussion of high quality articles in the field of deep-based VSLAM methods.

We have added a few comments in Section 5. “Discussion and Future Trends” marked in green.

This survey presents the current advances in the field of deep-based VSLAM methods since 2017, focusing on two aspects. First, high-quality studies show how deep learning paradigms help to solve the VSLAM tasks and even change traditional VSLAM problem statements. Second, the new approaches proposed in the articles mentioned above open up great perspectives for future investigations. Every year, deep learning models improve their performance and demonstrate new capabilities for solving more and more complex problems. Obviously, the implementation of deep learning methods in VSLAM is currently far from desirable, but the first steps are very promising.

Recently, three ways to develop deep learning-based VSLAM software components such as auxiliary modules, original deep learning modules, and end-to-end deep neural networks have been identified with different degree of implementation. A way to develop auxiliary deep-based modules introduces the most of published studies including feature extraction [48,49,50], semantic segmentation [51,52,53,54,55,56,57,58,59,60], pose estimation [8,45,46,61,62,63], map construction [3,64,65,66], and loop closure [67,68,69,70]. It should be noted that deep neural networks extract low-level features from images by converting them to high-level ones layer by layer. Thus, deep learning “changes” the term “feature extraction” from conventional keypoints extraction to complex tasks, such as matching keypoints of a 2D image and 3D LiDAR points [48], keypoints extraction from an optical flow [49], extraction of image patches using famous ORB-SLAM algorithm [50], etc. Semantic segmentation seems to be a more explored area, with semantic filtering [51,52], object detection followed by semantic segmentation in static and dynamic environments [55,56,57], and scene representation [58,59] being the main approaches. Deep learning based pose estimation is a wide area of study in many scientific fields, but only a few approaches have been implemented in VSLAM systems related to the VO tasks [8,45,62], ego-motion of camera [46,61], low illumination conditions [63]. Currently, map construction is not well explored by deep learning paradigms and presented by several attempts to incorporate optical flow networks, RNNs, stereo vision into validated SLAM systems. Better results are achieved by combining depth, LiDAR and optical data. Auxiliary modules in the loop closure eliminate the influence of moving objects in the scene [67,70] and extract keyframes for search the correct trajectory [68,69].

There are several studies devoted to the development of original deep learning modules for camera relocalization [82], distance estimation [83], object segmentation [84,85], path planning [86], and scene reconstruction [87]. The architecture of original deep learning modules becomes more complex when multiple deep neural network models are used in serial or parallel pipelines with RNNs or GANs. It was shown in [84] that the accuracy and robustness of the proposed DDL-SLAM model outperform the indicators of the ORB-SLAM2 model in highly-dynamic scenarios. At the same time, the DDL-SLAM model has several limitations in the real-time performance and scene inpainting.

Obviously, the development of end-to-end deep neural networks is the most promising approach for VSLAM systems due to self-supervised learning and reinforcement learning as the basis of high adaptive ability to a real dynamic environment. Interesting experimental results were obtained in VO/VIO [96,97,98,99] and ego-motion tasks [94,95]. Sometimes traditional methods such as Kalman filter [47] or the Savitzky-Golay filter [91] are combined with end-to-end deep models that provides improved results. This is the so called hybrid approaches. Some end-to-end deep neural networks have original applications, for example, in surgery [90], the UAV pose estimation [91,98], for autonomous underwater vehicles [100], drone navigation and height mapping [102], etc.

It should be noted that the implementation of deep learning in VSALM systems is a very complex process. However, impressive results have recently been achieved. Thus, the Absolute Trajectory Error (ATE) metrics with and without auxiliary deep modules improve values by 50 times [57], depth reconstruction estimates in terms of time and accuracy are better using the DRM-SLAM model [65], as well as precision–recall results for different datasets in the loop closure problem [70].

Author Response File: Author Response.pdf

Back to TopTop