Review Reports - CDSFusion: Dense Semantic SLAM for Indoor Environment Using CPU Computing

Round 1

Reviewer 1 Report

Contribution:

An indoor dense semantic Simultaneous Localization and Mapping (SLAM) method using only CPU computing was proposed, named CDSFusion.

Major issues:

My only broad question is, does needing a GPU represent a significant limitation for UAV? As a researcher, I like the flexibility of just using a CPU, so I see the point. But how much of an issue is this aspect? It doesn't strike me as a payload issue, but advise if I am wrong.

It seems error is comparable for your approach and VINS-RGBD. The main improvement is the 2.5 - 3 x increase in speed, I think. Which does support your CPU vs. GPU point. Excellent result.

Please state what CDS stands for. I'm assuming an acronym.

l29: In the sentence "The purpose of the dense semantic SLAM is to locate and reconstruct the dense 3D semantic map simultaneously", what are you locating? I don't think it is the map.

l41: "Some works such as Kimera [13] have achieved dense semantic SLAM without GPU, but they just execute semantic segmentation offline, and cannot be classified as dense semantic SLAM strictly." Is this because they do not operate in real time? I think this is worth stating explicitly.

l45: "However, UAV is essential in some scenes, such as exploring dangerous buildings." I think you mean intelligent UAV.

For future research, I suggest looking at ways to improve the "slow" performance. This seems to be the primary weakness of your algorithm, and thus the best target for improvement. However, it may be inherent to your approach, in which case I would not say it for future research. Your call.

Minor issues:

l13: "High-performance computing on most UAVs is always difficultly available however" needs rewording. Suggest "However, high-performance computing is generally not available on most UAVs"

l34: "there is" -> "there are"

l36: "has not" -> "have not"

l50: "was presented" -> "is presented". Generally, you should use present tense, unless you are saying "we did this..."

Generally, spell out / define acronyms on first use. Maybe not in the abstract, but certainly in the paper. E.g. VIO, CDS, PSP, FAST, etc., as you did for inertial measurement unit (IMU).

l97: "Double-window" -> "double-window"

l98: "were developed" -> "was developed"

l105: "high gradient" -> "high-gradient"

l111: "tracking lost" -> "tracking loss"

l116: "jointing all" -> "joining all"

l140: "but relying" -> "but relies"

l140: "and lacking of loop" -> "and lacks loop"

l154: "provide higher higher-level" -> "provide higher-level"

l165: "method were described" -> "method are described" (again, use present tense)

l173: "in real-time" -> "in real time" (this case, no dash)

l180: "and composed" -> "and is composed"

l182: "Each a per-pixel semantically annotated segmentation result of input RGB frames was gained in real-time using our lightweight semantic segmentation module." Sentence needs revision.

l187: "the three module" -> "the three modules"

l214: "rely on" -> "relies on"

l225: "the PSPNet. PSPNet" -> "the PSPNet, which"

l245: "semantically annotate" -> "semantically annotating"

l246: "map, A voxel" -> "map, a voxel"

l271: "And Root" -> "Root"

l331: "consumption was shown" -> "consumption is shown"

l364: paragraph beginning "The results show that the result of our method running on a CPU". Make sure cross-references to (a) - (d) in Figure 4 are all correct. It's hard to tell on review. They might be OK, just confirm.

l382: "achieved a same completeness with groundtruth". Your results look very good, but I am not sure "same completeness" is well defined and repeatable.

l383: "needed GPU acceleration" -> "needing GPU acceleration"

Table 3: It's not clear which method these results are for. Modify caption, headings, and/or text descritpion to clarify. Is it VINS-RGBD? Say so.

Author Response

Please see the attachment.

Author Response File: Author Response.doc

Reviewer 2 Report

The paper presents a method to perform SLAM in indoor environments using UAVs equipped with RGBD and inertial sensors

The system designed by the authors seems very interesting and promising. According to the authors, the algorithms presented are able to reconstruct the environment under different conditions:
- VIO (Visual inertial Odometry) using different sensor inputs (RGB, RGBD and IMU)
- Dense semantic SLAM

I have some comments:

C1 Too broad
To me, the paper is too broad and condensed. The authors are describing a system that is able to perform visual-inertial odometry and a lightweight semantic SLAM using different information from the sensors (RGB, RGBD and IMU). In my opinion, focusing only on visual inertial and comparing between different methods would be fine. A different manuscript could focus on the SLAM part using semantic information. Please note that 3D reconstruction is vaguely mentioned in Section 3.4.

C2 Implementation
In my opinion, the authors should publish the code in a public repository. In this way, other datasets could be tested using your proposed algorithms.

C3 Results and figures

Figure 2 is found accross two different pages and out of the page margins. figures 2d, 2e and 2f are impossible to decipher (too tiny). The same occurs with j, k and l.
Figure 3 is found accross two different pages and is out of the margins. Figure 3 a, b, c, d should be larger, so that the result of the algorithm may be clear to others. In addition, colors, represent semantic information. However, the authors do not state which color correspond to what class in the semantic map of the scene.

Figure 4 is also found accross two different pages.

C4 Title
Remove "using CPU COMPUTING" from the title. The reason: there exist many SLAM and visual SLAM algorithms that run on a CPU. What the authors suggest is that their proposal can be run on a CPU with a low computational time. I would suggest something like: CDSFusion: A lightweight Dense Semantic SLAM for indoor environments

English needs extensive editing. It is difficult to understand sometimes. Next, I suggest some changes.
- 10 Unmanned Aerial Vehicle (UAV) --> Unmanned Aerial Vehicles (UAV)
- 10 requires --> require (plural)
- 15 to resolve --> to solve
-20 FAST feature --> FAST features
- 52 the proposed method has the ability to localize the UAV and perform dense semantic 3D reconstructoin simultaneously...
- 95 split --> splits
- 116 jointing --> joining
- 137 compared with stereo camera --> compared to stereo
- 170 FAST feature was --> FAST features were used
- 172 We optimized a PSPNET --> a PSPNet pre-trained
- 179 takes RGB frame, depth frame --> takes a RGB frame, a depth frame
- 190 FAST corner features --> FAST features (not actually corners, as in Shi-Tomasi)
- 213 using Ceres solver --> using the Ceres solver
- in all the paper Shi-Tomas --> Shi-Tomasi
- 318 achieved expected results --> achieves the expected results.
327 for handheld simple dataset --> for the handheld simple dataset

Author Response

Please see the attachment.

Author Response File: Author Response.doc

Reviewer 3 Report

The paper 'CDSFusion: Dense Semantic SLAMfor Indoor Environment using CPU Computing' proposes a dense semantic 3D RGB-D SLAM algorithm that works in realtime on modern CPU. The motivation is reasonable. However, the paper is not well written with grammatical errors and lacks many important technical details. The experiment is also not adequate. It seems this method is just a combination of VINS-MONO, PSPNet and Voxblox in its current presentation. This makes the research not innovative for the journal. Thus, I would like to rate it as 'major revision' and encourage the authors to modify the paper based on the following suggestions if there are more techniques in the research. Kindly modify the content of this paper and clearly convey the technical and experimental details in future versions.

1) In the introduction, the authors said 'The FAST [15] feature was adapted to improve the efficiency of pose estimation instead of the Shi-Tomasi [16] feature used in VINS-Mono.' In section 3.2, the authors claimed 'The design choice for FAST corners was not only driven by functionalities, but also experimental results shown that the FAST features are much faster than Shi-Tomas and ORB with the same or more robustness and accuracy.' There are no direct experiments to back the two claims. Only general SLAM comparison with VINS-RGBD is presented. Since this is the major innovation in this research, the authors should present a direct experiment to validate the two claims.
A table is needed to compare the quantity of trackings, time consumption, robustness, and accuracy between Shi-Tomasi and FAST corner points.
2) There are two tiny figures in Fig.1 that are not readable.
3) Paragraph 2-4 in Section 3.2 seems to be the original VINS-MONO. Please clarify if they are the original VINS-MONO.
4)In the experiment, it is claimed 'all other parameters are kept constant using system default values unless stated otherwise'. Please present all the default values in the draft paper.
5) In table I, what is the exact metric for the accuracy quantification? A formulation is needed. In addition to the trajectory, please also compare the orientational error. Moreover, table I indicates that the proposed method achieves better accuracy than VINS-RGBD. Since both methods achieve localization with VINS-MONO framework, please conduct an ablation study to show why the proposed method is more accurate.
6) Table II shows the proposed method is almost two times faster than VINS-RGBD. Although FAST in CDSFusion may be faster than Shi-Tomasi in VINS-RGBD, CDSFusion contains additional modules like PSPNet and semantic reconstruction. Theoretically, CDSFusion is supposed to be slower. Please conduct an ablation study to show why it is faster.
7) The author said 'For comparison, we show the 3D map of the complete scene generated by a method under GPU acceleration, as shown in Figure 3'. This is very informal. Please cite this research in the reference. There are several places where this method is mentioned. Besides, please also compare with more state-of-the-art methods like Kimera.
8) Deploying PSPNet on CPU in realtime may benefit the community. More details are needed to illustrate the time consumption. Is it implemented on the CPU core or the intel graphic?
9) In Figure4, the results from CPU and GPU are different. The difference is needed since both versions are based on the same theory.
10) The authors acknowledge that the method 'is sensitive to fast camera motion'. Do FAST features cause it? More studies and experiments are needed.
11) Minor issue. There are many grammar mistakes in the content. Please carefully polish the paper and improve its readability. I cite some below:
a.High-performance computing on most UAVs is always difficultly available however, so a CPU-only real-time semantic reconstruction method is necessary.
b.Simultaneous Localization and Mapping (SLAM) method only using CPU computing was proposed,
c.Contrary to most efforts for only VIO or semantic 3D reconstruction,
d.We also interested in

Author Response

Please see the attachment.

Author Response File: Author Response.doc

Round 2

Reviewer 2 Report

The quality of the manuscript has been sifnificantly improved from the previous version. I am highly satisfied with the result and recommend that the paper may be published in the current form after some minor english language revision.

Reviewer 3 Report

The authors have made great efforts in revising the manuscript and the review comments have been well addressed.
The manuscript meets the requirements of the journal paper.