Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

DM-SLAM: Monocular SLAM in Dynamic Environments

Appl. Sci. 2020, 10(12), 4252; https://doi.org/10.3390/app10124252

by Xiaoyun Lu^1,2, Hu Wang^1,2,*, Shuming Tang^2,3, Huimin Huang^1,2 and Chuang Li^1,2

Reviewer 1: Anonymous

Reviewer 2:

Christian Galea

Appl. Sci. 2020, 10(12), 4252; https://doi.org/10.3390/app10124252

Submission received: 11 May 2020 / Revised: 14 June 2020 / Accepted: 16 June 2020 / Published: 21 June 2020

(This article belongs to the Section Robotics and Automation)

Round 1

Reviewer 1 Report

Compared to the previous version the presentation of the paper is vastly improved. The proposed methods - DRSAC and the feature candidate selection, while not being groundbreaking provide reasonable boost to the efficiency of the SLAM algorithm.

The paper is well structured and easy to follow.

Author Response

Dear reviewer：

It's our honor for getting your approval, we hope to have more exchanges with you in the following research.

Reviewer 2 Report

This paper proposes an approach to perform monocular SLAM in dynamic environments (scenes containing movement by objects and the camera itself).

While the contribution of the paper appears to be generally relevant, there are some weaknesses that should be addressed, including:

It is unclear whether ORB-SLAM or ORB-SLAM2 is used in the proposed approach...it seems like ORB-SLAM has been used. If this is the case, why is ORB-SLAM2 not used? Indeed, whilst the authors state that "ORB-SLAM has excellent and robust 191 performance in most practical situations", it would be beneficial to perhaps evaluate other systems to definitively show the advantages of ORB-SLAM/ORB-SLAM2 with respect to other methods.
Furthermore, overall evaluation could be improved. Whilst the focus of the paper is on dynamic environments, other environments (not only on scenes in the TUM RGB-D dataset as used in the paper, but also those in other datasets) could also be included (in a real-world implementation, the algorithm could be faced with any type of situation). Indeed, more videos depicting dynamic environments could have been used too (more videos are available in the TUM RGB-D dataset – why were these not used?) Lastly, comparison to other methods proposed in literature is rather limited, with virtually no direct performance evaluations being made.
The authors state that "Since the camera frame rate is high enough, the change of the scene including the moving objects isn’t obvious, which makes distinguishing moving objects very challenging for algorithms."; why couldn't the frame rate be reduced, e.g. by using only alternate frames (e.g. odd-numbered frames)?
The authors state that "there is no obvious improment when compared DM-SLAM with ORB-SLAM2. In sequence fr3_walking, the RMSE and S.D. improvement values of ATE can reach up to 96.42%, 98.45%."; these two sentences seem contradictory – clarification is required.
It would be desirable to include links to videos showing the behaviour of the proposed system (i.e. the feature points chosen).
Lastly, while the paper is generally understandable, there are numerous typos and grammatical errors throughout the paper (including figures, e.g. Figure 1 (Initilized -> Initialized)); these should be addressed, preferably by a native English speaker. It should also be ensured that the flow of the paper is good; for example, in line 80, the authors state that "Presently, we propose a distribution and local-based RANSAC algorithm (DLRSAC) to address 80 this problem."...however, the authors were previously describing methods proposed in literature, and thus the "problem" is unclear (it had been mentioned several paragraphs earlier). The authors could also consider renaming the last section from "Discussion" to "Conclusion". Moreover, the formatting leaves much to be desired – figures and tables sometimes overflow from one page to another, while text font is sometimes inconsistent. Moreover, some figures are unclear; for instance, in Figure 3, it is hard to see the points represented by yellow and green circles. It should also be ensured that references in the text to any figures or tables are correct; for example, Table II-IV does not seem to exist...it seems like authors are referring to Table 3 instead (also, naming conventions should be consistent; some references to tables in the text use Roman numerals, while the table numbers actually use Arabic numerals).

In conclusion, the paper should be revised prior to reconsideration of publication.

Author Response

Dear reviewer:

The authors appreciate the reviewers’ valuable comments. All the comments are considered seriously and our manuscript has been revised carefully in accordance with the comments where possible.In the attachment, we respond to reviewers' queries point-to-point, and made corrections to the article.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

The authors have addressed some, but not all, of the concerns of this reviewer.

In particular, the evaluation does not seem to have been changed, and reasons to at least justify the evaluation protocol were not given either (if point-by-point answers were given in response to each of the previous version's comments, these were not received).

The authors also did not consider the query regarding the frame rates, and the recommendation to publish videos still stands.

Lastly, while this reviewer appreciates improvements to the manuscript formatting as highlighted by this reviewer, there still remain several grammatical errors; for example, even in the revised Figure 1, the incorrect word was changed to another incorrect word (previously 'Initilized', now 'Initailized', should be 'Initialized' as mentioned in the previous report). The authors need to ensure that such typos, grammatical errors, and general issues with the use of the English language are rectified, and reiterate the recommendation to employ a native English speaker to review the manuscript. Moreover, some formatting issues still remain, such as the difficulty in viewing the yellow/green circles in some of the images.

Author Response

Dear reviewer:

We responded to the questions from the previous round and made targeted changes to the article. Regarding the evaluation method used in the article, we used the evaluation method given by the TUM CVPR team. It is also the method used by ORB-SLAM2 and other classic frameworks such as Dyna-SLAM and DS-SLAM。Details can be seen on the cover letter. If there is a better evaluation method, welcome to give good suggestions.

Author Response File: Author Response.docx

Round 3

Reviewer 2 Report

As mentioned in the previous reports, the concern of this reviewer with the evaluation is NOT with the metrics used for evaluation (ATE, RPE, etc.) per se, but the lack of comparison (i) with other methods proposed in literature and (ii) the limited number of sequences and databases used. Indeed, even the Dyna-SLAM and DS-SLAM papers mentioned by the authors include more evaluations. This reviewer is simply asking why there aren't any more comparisons with other methods and, especially, using more data (more scenes within the TUM RGB-D dataset and other datasets such as KITTI).

There also seem to be some substantial differences in the results of ORB-SLAM2 when compared to those published in literature; the authors could indicate why this is the case, and provide further details as to how it was evaluated (for example, unless this reviewer missed the mention in the manuscript, the authors do not seem to mention the number of times that the algorithms (ORB-SLAM2 and proposed) were run in obtaining the results; in the original paper [1], ORB-SLAM2 was run 5 times, while in [2] it was run for 10 times to account for the exacerbation of the non-deterministic nature of the algorithm caused by the larger movements of objects in dynamic scenes).

With regards to the lack of clarity in some of the images, the authors have stated that "The reason why we can’t see some of the images clearly is that our original picture has higher resolution, which is reduced too many times"...but that does not really solve the problem. Indeed, the issue stems from both the resolution and the colours used. At the very least, the original high-resolution images could be made available online - although of course, it would be preferable if the images are clear in the manuscript itself.

Finally, there still remain several typos/incorrect terms.

References

[1] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An open-source slam ´
system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot.,
vol. 33, no. 5, pp. 1255–1262, Oct. 2017.

[2] B. Bescos, J. M. Fácil, J. Civera and J. Neira, "DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes," in IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4076-4083, Oct. 2018

Author Response

Dear reviewer:

We really appreciate the reviewers' valuable comments, we have modified the article based on the reviewers' comments, and added the corresponding experiment and experiment description. The specific reply letter can be seen in the attachment.

Author Response File: Author Response.docx

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

The paper proposes a new monocular SLAM system, based on ORB-SLAM, capable of dealing with dynamic scenes. The authors propose two extensions/alterations of the original system:

DLRSAC - an algorithm for finding a consensus on the fundamental matrix estimation based on splitting the image in rectangular grid, estimating local fundamental matrices and finding the global agreement feature initialization method based on neighborhood filtering

Although neither of those is highly complicated or innovative they seem to provide the necessary performance boost to the system.

The paper is well structured, the design of the research is appropriate and the conclusions are supported by the results. The main drawback of the paper is in its level of English - extensive editing, preferably done by a native speaker, is needed before publishing the paper.

Author Response

We are grateful for the reviewer’s valuable comments.

We will revise and polish English writing of the full text according to your comments.

We think that although the method we take is simple, the idea of "separation-integration" behind it provides a new way to deal with dynamic environments. The validity is verified by the experiment in the article.

Reviewer 2 Report

This paper addresses the SLAM problem with a monocular camera in dynamic environments. This issue is challenging and can be useful particularly in the situation we only have a monocular camera. The authors insists that this paper has contributions on RANSAC and map updates. But the paper lacks novelties and its presentation is not clear enough. The experimental results seems to prove the effect of the proposed method, however it is not thorough to validate the contribution and does not have persuasive analysis. Finally, this paper is poorly written and hard to follow.

Line 87: This paper mentions that this problem has chicken-and-egg characteristics. Why is this chicken-and-egg problem? The authors does not described it and didn’t refer to any related papers. When we say the chicken-and-egg problem in computer vision community, it is generally a set of problems that are closely connected and one can be helpful for the other. But I don’t see why this is the chicken-and-egg problem.

Line: 110: The observation, the closer feature are more likely to belong to the same model, is not general and I think that it is not always true although it might be valid somewhat in particular situation.

Equation (2): This cost function seems main contribution of this paper, but there is no concrete explanation and intuition for it.

Line 115-116: These seems a measure for key point selection for finally updating maps. This approach is very hubristic and not elegant. It can be done in engineering side but it is not appropriate as a academic contribution.

Author Response

Dear reviewer:

we appreciate the reviewer's valuable comments, and have replied the comments point-to-point in the covering letter.As for the english writing, we wiil take the entensive english editing service.Thank you very much for reviewing and any question will be welcomed.

Yours

Lu Xiaoyun.

Author Response File: Author Response.docx

Reviewer 3 Report

The paper presents a novel method for visual monocular SLAM in dynamic environments. Although the results seem promising, the paper is written in low quality English and contains many errors and insufficient/poor explanations, which make the perception of the paper’s presented ideas difficult.

The structure of the introduction should be improved. The related work mostly refers to specifics of other algorithms, but it is not very clear how the proposed method differs from them. Clearer problem definition could help with this. It is hard to connect the related work in introduction with the methods used for comparison in chapter 3. For example, ARSAC is mentioned first time in chapter 3, and only by reference number it is possible to connect that it is the same method mentioned in introduction. It is not at all explained what ARSAC and PARSAC is. Reviewer believes that more context should be given: in the submissions’ current state only readers familiar with the referred methods are able to read this paper without further questions. The significance of the used data sets remain unexplained. What fr3_walking_xyz, f3_walking_rpy, fr3_walking_half, fr1_xyz sequences represent? What are their characteristics? Also, their evaluation remains unclear. Authors refer to source [17] about calculation method of improvements, but it is not enough (especially because it is not clear how referred source is used to calculated improvements other than using metric of absolute trajectory). Why all the error calculations are necessary if there is almost no analysis of results? Also, SLAM implies building of the map. Is there some visual example of resulting map or at least trajectory for both methods? DM-SLAM is compared with ORB-SLAM2, to which there is no reference. As far as reviewer is aware, ORB-SLAM2 is not intended for dynamic environments. Then why compare the proposed method to this algorithm and not the one proposed in reference source [9]? Figures are not high enough quality. Their intention is to demonstrate differences between methods, but 1) yellow and green circles on yellow books are hard to see (Figure 3), 2) The figures are low resolution – even when zoomed in the circles are blurred and hard to see. Larger higher resolution pictures (and another color for books) might help with this problem. Some concepts are often used, but their meaning is not sufficiently explained, e.g. ‘nature difference’, ‘chicken-and-egg problem’, ‘segmentation and reconstruction’. Formulation of lines 200-218 is very confusing. Definitions in lines 124-135 and algorithm in lines 138-152 should be at least partly visualized with examples. In their current description it is hard to follow what they actually mean. Algorithm 1 is visually difficult to read. It mostly contains IF THEN ELSE sentences; maybe better representation for this is block diagram. Also, some variables in the algorithm 1 are unexplained. The text contains numerous grammar and style errors, and sometimes they make the idea difficult to understand. Some examples are ‘take dominate’ in several places, ‘girds’ in line 124, ‘to initial camera pose’ in line 223-224 (these are just some examples, not extensive list). There are various minor errors additionally to grammar and style ones. Lines 70 and 73 contain word ‘Ref’ instead of specific authors. Figure 1 contains unnecessary letter (a) under it, its colors and steps are not sufficiently explained. 209-210 line is a repetition of 202-203 lines (also here - what is ‘ilness problem’?). What does DRLACS in line 234 mean? Heading formatting is not consistent throughout the paper. Table references in lines 316 and 321 seem to be incorrect. Lines 173-176 lack bullet points.

Author Response

Dear reviewer:

Yours

Lu Xiaoyun.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

1. Line 87: This paper mentions that this problem has chicken-and-egg characteristics. Why is this chicken-and-egg problem? The authors does not described it and didn’t refer to any related papers. When we say the chicken-and-egg problem in computer vision community, it is generally a set of problems that are closely connected and one can be helpful for the other. But I don’t see why this is the chicken-and-egg problem.

Our Reply:

We really appreciate the reviewer’s valuable comments.

We have described this problem in line 47;

“The premise of solving the camera pose is to get the static feature set, while camera pose is required to filter the static features from the image features of the noise features, mismatches and dynamic features.”

We have added reference [7], which contains more detailed description of the chicken-and-egg problem. We think that the chicken-and-egg problem mainly involves clustering and model solving, both solutions are premised on the other's solution

Response: the paper [7] the authors mention describes that “Dynamic object segmentation (also known as multibody motion segmentation [73, 132, 153] or eorumotion segmentation [133]) clusters all feature correspondences into n number of different object motions.” That is true for some dynamic object segmentation work because they cluster the camera and objects exploiting their motion models and the clustered results are used again for estimating the motion models until they converge. Nevertheless, the typical SLAM problem in dynamic scenes is not the chicken-and-egg problem because the outlier rejection for pose and map estimation is a part of SLAM, and the outlier rejection is generally done with RANSAC. It is also not usual to call RANSAC the chicken-and-egg problem although there is works to use prior knowledge on motion model (i.e., adaptive RANSAC).

2. Line: 110: The observation, the closer feature are more likely to belong to the same model, is not general and I think that it is not always true although it might be valid somewhat in particular situation.

Our Reply: We really appreciate the reviewer’s valuable comments.

As you said, at the edges of different moving objects, adjacent points do not belong to the same motion model. But when it is impossible to know the specific object classification, The distance between the feature points is conducive to building our confidence that the model is pure (the feature points used to solve the model belong to the same motion model, and the cases that cannot be covered by this observation will be rejected in the subsequent optimizations. We have added line 113-118 to illustrate the significance of observation

Response: it might be useful in the SLAM, but my point is that it is not elegant for a paper to exploit such restrictive observation. There is a plenty of examples that pops in my head such as slanted surface of houses or the ground floor, left and right sides of cars in the KITTI dataset.

3. This cost function seems main contribution of this paper, but there is no concrete explanation and intuition for it.

Our Reply: We add line 130-135 to explain why we take this cost function and the relationship between the cost function and the observations mentioned in the above paper.

Response: thanks for updating the manuscript. But this paper is seriously hard to follow and to understand because the paper uses their own terminology not commonly used one in the computer vision and robotics community.

4. These seems a measure for key point selection for finally updating maps. This approach is very heuristic and not elegant. It can be done in engineering side but it is not appropriate as a academic contribution.

Our Reply: The neighbor exclusion algorithm is essentially a variant of our understanding and application of the degree of distribution. In the process of map expansion, this method minimizes the influence of motion feature points while ensuring the expansion speed. As you said, this method seem to be done in engineering side other than academic contribution, while as a supplement to our DRLRSAC tracking process, our DM: SLAM system can completely process moving objects in the scene with this method.

Response: although system might be working fine, it doesn’t mean that this paper has a academic contribution to be published.

Author Response

Dear reviewer:

We have applied the extensive english editing from the magzine. And our reply is in the attachment. We appreciate your comments very much.

Yours

Lu Xiaoyun.

Author Response File: Author Response.docx

Article Menu

DM-SLAM: Monocular SLAM in Dynamic Environments

Further Information

Guidelines

MDPI Initiatives

Follow MDPI