Self-Supervised Monocular Depth Estimation Using Global and Local Mixed Multi-Scale Feature Enhancement Network for Low-Altitude UAV Remote Sensing
Round 1
Reviewer 1 Report
The paper addresses the task of depth estimation from the perspective of UAVs flying at low altitudes. Notably, the authors argue this is more challenging than the more usual indoor or automotive scenario to varying scale and non-uniform distribution of the depth. Due to this, the author proposes to use a multi-scale depth decoder and a nove attention module inspired by the Squeeze Excitation approach. The method is validated on the UAVid dataset on different metrics showing that is able to outperfom some state-of-the-art approaches by a discrete margin.
There are a handful points for criticism. First, the related work review should be improved especially considering more recent unsupervised /self-supervised monocular depth estimation. Similarly, while the method from Godard et al. is a milestone in this area, the are more advanced models and techniques that could have been considered for the validation but also for building possibly a more perfomant network, e.g. Vision Transformers https://doi.org/10.1109/TIP.2022.3167307. The authors could explain why these have not been taken into considerations.
Regarding the methodology description, the structure of the networks lacks precise explaination fo the number of channels in each stage. Concerning the GD loss, the authors should acknoledge that other previous paper already tested with second order gradient and other techniques for introducing the smoothness prior.
Overall, the methods shows improvement on the selected benchmark, but doubts still remain that more recent method not considered by the study may already have solved the investigated challenges.
There are only a few sentences that may be corrected, e.g., lines 140,354. The authors should perform only a quick review of the English style.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
I would recommend the replacement of the acronym UAV with the acronym UAS (Unmanned Aerial System) when you refer to the whole system (UAV and sensor) - Lines 1, 6
Please uppercase the first letters of the Unmanned Aerial Vehicles - Lines 19, 75, 148, 283. It is also unnecessary to use the whole explanation, once you have already explained it the first time it is mentioned in your manuscript.
Please use ":" or another way to separate the method name from the sentence in lines 309, 318.
In line 3, you mention traditional depth estimation methods, maybe you could mention some, in two lines.
I recommend minor changes to the grammar and quality of the English language and minor text editing.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
The analysis of single images is an important question for many problems. The possibility of reconstructing secondary features from the analysis of one image is a very urgent problem, especially for the areas of automated control and remote decision making. Possible areas of application can be: remote sensing and mapping; auto-driving and UAV control systems; security systems; entertainment and games; et al. Solving the problem of recovering volume data and reconstructing spatial data is a complex problem, and its automation and intellectualization is an important area of research.
The strong point of the article is that authors use the of deep learning to estimate depth from a single low-altitude image. This approach has a wide range of applications. The weak side of the work is the shows excellent results only on data with good lighting and in the conditions of summer shooting of urban areas.
I have the following problems with this paper:
- A combined criterion for assessing the Loss function has been introduced in the work. This criterion includes many parameters, each of which will affect the result. One of the parameters is an attention map, the training of which is also a separate task. The paper describes the application of the trained function with perfectly matched parameters for the UAVid2020dataset elements. The choice of criteria is not justified, although the result obtained is high. It is possible that the exclusion of one parameter will not only not worsen the result, but may even increase it on separate input data.
- The UAVid2020dataset was taken for research. The work will be stronger if you show the possibility of processing real data in other shooting conditions (sunset, rain, winter, fog). The versatility of your approach would expand the scope.
In my opinion, the paper, after some small changes, can be recommended to accept.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors have addressed the issue raised in the first review entirely.