Next Article in Journal
Photobiomodulation Using Different Infrared Light Sources Promotes Muscle Precursor Cells Migration and Proliferation
Next Article in Special Issue
Half-Period Gray-Level Coding Strategy for Absolute Phase Retrieval
Previous Article in Journal
Imaging Complex Targets through a Scattering Medium Based on Adaptive Encoding
Previous Article in Special Issue
Deformation Measurements of Helicopter Rotor Blades Using a Photogrammetric System
 
 
Article
Peer-Review Record

Depth Estimation Using Feature Pyramid U-Net and Polarized Self-Attention for Road Scenes

Photonics 2022, 9(7), 468; https://doi.org/10.3390/photonics9070468
by Bo Tao 1,2,*, Yunfei Shen 1,3, Xiliang Tong 3, Du Jiang 4 and Baojia Chen 5
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Photonics 2022, 9(7), 468; https://doi.org/10.3390/photonics9070468
Submission received: 15 May 2022 / Revised: 17 June 2022 / Accepted: 30 June 2022 / Published: 4 July 2022
(This article belongs to the Special Issue Optical 3D Sensing Systems)

Round 1

Reviewer 1 Report

This manuscript, by the authors, studiedSelf-Supervised Monocular Depth Estimation for Road Scenes Based on Polarized Self-Attention”.

Overall, the topic is of interest to Photonics, readers. However, serious concerns in the article have been found in every section, before publication needs major revision.

Please revise all manuscript and summarize every section with specific details and significant findings.

-Irrelevant details make the manuscript confusing.

-Follow the scientific way as Introduction, Materials and Methods (Methodology), Results and Discussion (Separate), and then conclusions. Re

- Wrong references and confusing numbers. After 8-12, then 41- 48?

 

Specific Comments and Suggestions

Title

-Please improve the title with related work, not just copy

-Abstract

-The first methodology, need to be clear or improved in a better way for treatments, then, show the results of your study.

 

Rewrite abstract with Scientific problem, Solution as your study, Significance, Results, and specific findings?

e.g. “We propose a new method of self-supervised monocular depth estimation based on an attention mechanism.”

Already, methods of self-supervised monocular depth estimation have been discussed in earlier studies.

 

-Add more results in detail.

 

-Significance of your study?

 

- Specific findings?

- Revise all figures with clear resolution and detailed captions.

 

-Introduction

 

-Revise it, please.

 

-Need more details and significance of the study providing other references.

 

-Need to clear objectives by points.

 

 

-Materials and Methods

 

- Not clear

 

- Related work? Part of the introduction section? What methodology you use?

 

-Tables 1, 2, and 3 need to revise and references should be in the last column.

Better to change the current study instead of “ours”. Please read carefully other papers to improve your presentation of results and discussion.

 

 

-Results and Discussion

 

-Revise it carefully with your results.

 

-Discussion should be related and brief with other references.

 

-Conclusions

-Unnecessary details.

-One or two paragraphs are appropriate for specific conclusions. But not too short and without specific findings

-Significant differences on which basis?

References

 

-Not consistent 

Author Response

Thank you very much for your kindly comments on our manuscript. There is no doubt that these comments are valuable and very helpful for revising and improving our manuscript. In what follows, we would like to answer the questions you mentioned and give detailed account of the changes made to the original manuscript.

 

This manuscript, by the authors, studied “Self-Supervised Monocular Depth Estimation for Road Scenes Based on Polarized Self-Attention”.

Overall, the topic is of interest to Photonics, readers. However, serious concerns in the article have been found in every section, before publication needs major revision.

Please revise all manuscript and summarize every section with specific details and significant findings.

-Irrelevant details make the manuscript confusing.

Response: It was revised in the new manuscript.

 

-Follow the scientific way as Introduction, Materials and Methods (Methodology), Results and Discussion (Separate), and then conclusions. Re

Response: It was revised in the new manuscript.

 

- Wrong references and confusing numbers. After 8-12, then 41- 48?

Response: It was revised in the new manuscript.

 

Specific Comments and Suggestions

Title

-Please improve the title with related work, not just copy

Response: The title was changed to Depth Estimation Using Feature Pyramid U-net and Polarized Self-Attention for Road Scenes .

 

-Abstract

-The first methodology, need to be clear or improved in a better way for treatments, then, show the results of your study.

Rewrite abstract with Scientific problem, Solution as your study, Significance, Results, and specific findings?

e.g. “We propose a new method of self-supervised monocular depth estimation based on an attention mechanism.”

Already, methods of self-supervised monocular depth estimation have been discussed in earlier studies.

Response:It was revised in the new manuscript.

 

 

-Add more results in detail.

Response:More details were added to the new manuscript.

-Significance of your study?

Response: The proposed method is intended to get a more accurate depth estimation map and to understand a detected object from the predicted depth map intuitively. Compared with the previous models, the predicted depth map has achieved better results in detecting the boundary of the object and the accuracy of depth estimation.

- Specific findings?

Response:The main contributions of this research are as follows:

  • PSA is used in monocular self-supervised depth estimation model. It can guide the model to learn pixel-level semantic information, so it can get the depth map with more accurate boundaries.
  • We design a new decoder splicing method by combining the skip connection of U-net and FPN. This approach can get better results without significantly increasing the amount of calculation.

 

- Revise all figures with clear resolution and detailed captions.

Response:It was revised in the new manuscript.

-Introduction

-Revise it, please.

Response:It was revised in the new manuscript.

-Need more details and significance of the study providing other references.

Response:It was revised in the new manuscript.

-Need to clear objectives by points.

Response: It was revised in the new manuscript.

And our objective is a new self-supervised monocular depth estimation method, which can estimate the depth of the image accurately and preserve the contour lines of the image.

 

-Materials and Methods

- Not clear

Response:Receive, It was revised in the new manuscript.

- Related work? Part of the introduction section? What methodology you use?

Response:According to your suggestion, we revised the introduction and method.

-Tables 1, 2, and 3 need to revise and references should be in the last column.

Better to change the current study instead of “ours”. Please read carefully other papers to improve your presentation of results and discussion.

Response:It was revised in the new manuscript.

-Results and Discussion

-Revise it carefully with your results.

Response:It was revised in the new manuscript.

-Discussion should be related and brief with other references.

Response:It was revised in the new manuscript. Section 4.5 Discussion is added in the new manuscript.

-Conclusions

-Unnecessary details.

Response:It was revised in the new manuscript.

-One or two paragraphs are appropriate for specific conclusions. But not too short and without specific findings

Response: The conclusion has been rewritted.

-Significant differences on which basis?

Response:Objects with the same semantic information, but their depths are different. e.g. One white car blocks most of the area of another white car, and their depths are different.

References

-Not consistent

Response:It was revised in the new manuscript.

Reviewer 2 Report

Valid work with NN for depth estimation. Nonetheless, there is yet quite work ahead regarding format, structure and content of almost all the sections of the paper:

Introduction is too succint. More details that help approaching to the contributions of this work have to be given.

Related work is too straightforward, since it talks about methods with theoretical details which haven't been presented. This section should concentrate on state-of-the-art from an approaching point of view. Afterwards, Materials and Methods section has to deal with more specific details about theory models and implementation.

Again, this section is yet to succint. It has to provide facts for authors to propose their contributions and demonstrate their preliminary validity and novelty against the current trend in this field. I.e., approaches such as Lidar/Laser, RGB-D cameras, and similar devices applied to depth estimation are nearly forgoten.

Section 3.

What features are extracted and how?

Sections 3.2.1 and 3.2.2 start with a dull equation, without any sentence to introduce the presented term and relation to the previous architecture.

Section 4. Legend for variables in Table 1 are needed. More details and comments/insights as well.

Author Response

Thank you very much for your kindly comments on our manuscript. There is no doubt that these comments are valuable and very helpful for revising and improving our manuscript. In what follows, we would like to answer the questions you mentioned and give detailed account of the changes made to the original manuscript.

 

 

Valid work with NN for depth estimation. Nonetheless, there is yet quite work ahead regarding format, structure and content of almost all the sections of the paper:

Introduction is too succint. More details that help approaching to the contributions of this work have to be given.

Response:It was revised in the new manuscript.

 

Related work is too straightforward, since it talks about methods with theoretical details which haven't been presented. This section should concentrate on state-of-the-art from an approaching point of view. Afterwards, Materials and Methods section has to deal with more specific details about theory models and implementation.

Response:According to your suggestion, we have made significant changes about related work and methods. It was revised in the new manuscript.

 

Again, this section is yet to succint. It has to provide facts for authors to propose their contributions and demonstrate their preliminary validity and novelty against the current trend in this field. I.e., approaches such as Lidar/Laser, RGB-D cameras, and similar devices applied to depth estimation are nearly forgoten.

Response:It was revised in the new manuscript.

 

Section 3.

What features are extracted and how?

Response:We use depth encoder to extract features, which is based on Resnet18. Then, these feature maps extracted by convolution will be upsampled to generate a single depth map.

Sections 3.2.1 and 3.2.2 start with a dull equation, without any sentence to introduce the presented term and relation to the previous architecture.

Response:We added detailed instructions.

Section 4. Legend for variables in Table 1 are needed. More details and comments/insights as well.

Response: In the fourth section, Table 1, for the experimental results of other methods, we use the data from their papers. In addition, we selected GCNDepth with the best results for comparison, and the results are shown in Figure 2. More discussions are added.

Round 2

Reviewer 1 Report

The authors did good efforts to revise the manuscript. But it is still not satisfactory. 

"It was revised in the manuscript".

Seems as in highlighted parts. How?

Please answer all comments about how you improved.

And revise it carefully again by reconsidering all comments mentioned earlier.

Author Response

Thank you very much for your kindly comments on our manuscript. There is no doubt that these comments are valuable and very helpful for revising and improving our manuscript. In what follows, we would like to answer the questions you mentioned and give detailed account of the changes made to the original manuscript.

 

This manuscript, by the authors, studied “Self-Supervised Monocular Depth Estimation for Road Scenes Based on Polarized Self-Attention”.

Overall, the topic is of interest to Photonics, readers. However, serious concerns in the article have been found in every section, before publication needs major revision.

Please revise all manuscript and summarize every section with specific details and significant findings.

  1. -Irrelevant details make the manuscript confusing.

Response: We deleted irrelevant details in new manuscript.

 

  1. -Follow the scientific way as Introduction, Materials and Methods (Methodology), Results and Discussion (Separate), and then conclusions. Re

Response: The article is written according to the structure of 1.introduction. 2.Related work. 3. Methods. 4. Results and Discussion. 5. conclusions.

 

  1. - Wrong references and confusing numbers. After 8-12, then 41- 48?

Response: We have adjusted the references according to the order in which they appear in this manuscript.

 

  1. Specific Comments and Suggestions

Title

-Please improve the title with related work, not just copy

Response: The title has been changed to Depth Estimation Using Feature Pyramid U-net and Polarized Self-Attention for Road Scenes.

 

  1. -Abstract

-The first methodology, need to be clear or improved in a better way for treatments, then, show the results of your study.

Rewrite abstract with Scientific problem, Solution as your study, Significance, Results, and specific findings?

e.g. “We propose a new method of self-supervised monocular depth estimation based on an attention mechanism.”

Already, methods of self-supervised monocular depth estimation have been discussed in earlier studies.

Response: We revised the abstract, and quoted here:

Studies have shown that the observed image texture details and semantic information are of great significance for the depth estimation on the road scenes. However, there are ambiguities and inaccuracies in the boundary information of observed objects in previous methods. For this reason, we hope to design a new depth estimation method, which can obtain higher accuracy and more accurate boundary information of the detected object. Based on polarized self-attention (PSA) and feature pyramid U-net, we proposed a new self-supervised monocular depth estimation model to extract more accurate texture details and semantic information. Firstly, we add PSA module at the end of depth encoder and pose encoder, so that the network can extract more accurate semantic information. Then, based on the U-net, we put the multi-scale image obtained by the object detection module FPN (Feature Pyramid network) directly into the decoder. It can guide the model to learn semantic information, thus enhancing the boundary of the image. We evaluated our method on KITTI 2015 datasets and Make3D datasets, and our model achieved better results than previous studies. In order to verify the generalization of the model, we have done monocular, stereo, monocular plus stereo experiments. The experimental results show that our model has achieved better results in several main evaluation indexes and clearer boundary information. In order to compare different forms of PSA mechanism, we did ablation experiments. Compared with no PSA module, after adding the PSA module, better results of evaluating indicator have been achieved. And we also found that our model is better in monocular training than stereo training and monocular plus stereo training.

  1. -Add more results in detail.

Response: More results are added in the abstract. And quote as:We evaluated our method on KITTI 2015 datasets and Make3D datasets, and our model achieved better results than previous studies. In order to verify the generalization of the model, we have done monocular, stereo, monocular plus stereo experiments. The experimental results show that our model has achieved better results in several main evaluation indexes and clearer boundary information.

  1. -Significance of your study?

Response: In the abstract, the significance of our study is “Design a new depth estimation method, which can obtain higher accuracy and more accurate boundary information of the detected object.

  1. - Specific findings?

Response: In the abstract, the specific findings are: Compared with no PSA module, after adding the PSA module, better results of evaluating indicator have been achieved. And we also found that our model is better in monocular training than stereo training and monocular plus stereo training.

In Table 3, it is evident that adding PSA_p and PSA_s together leads to better results, especially in monocular depth estimation.

  1. - Revise all figures with clear resolution and detailed captions.

Response: We revised all figures with clear resolution and detailed captions.

 

  1. -Introduction

-Revise it, please.

Response: We revised the introduction. The first three paragraphs have been rewritten. And two more paragraphs are added, including the contributions. The added parts are quoted here.

By contrast, self-supervised monocular depth estimation [8-12], which relies only on stereo image pairs or monocular video for supervised training, has attracted more attention from industry to academic community. The state-of-the-art (SOTA) self-supervised monocular depth estimation methods [8-11] can successfully estimate the relative depth. However, the existing methods are weak for image edges. The edge contour detail estimation still needs to be improved. To address this problem, We propose a new polarized self-attention (PSA) and feature pyramid U-net self-supervised monocular depth estimation method, which can estimate the depth of the image accurately and preserve the contour lines of the image. PSA mechanism combines the characteristics of channel self-attention mechanism and spatial self-attention mechanism, and connect them in parallel and serial ways novelty. We add it after encoder 512 layer directly, so we can plug and play without changing the main structure of the network. With this special structure, the model can learn pixel-level semantic features by convolution without significantly increasing the size of the model. Inspired by the U-net model and the object detection module FPN, we pass the original image processed by the maximum pool operation to the decoder. The experiment shows that our method has achieved good results in the task of depth estimation.

The main contributions of this research are as follows:

1) PSA is used in monocular self-supervised depth estimation model. It can guide the model to learn pixel-level semantic information, so it can get the depth map with more accurate boundaries.

2) We design a new decoder splicing method by combining the skip connection of U-net and FPN. This approach can get better results without significantly increasing the amount of calculation.

 

  1. -Need more details and significance of the study providing other references.

Response: In section 2 Related work, more details and significance of other references of the study are added and reorganized. Section 2.1. is about training methods of self-supervised monocular depth estimation; section 2.2 and 2.3 are the improvement of network models, inlcuding the network combining FPN and U-net and Self-attention mechanism.

 

  1. -Need to clear objectives by points.

Response:

  • Compared with the previous models, its predicted depth map has achieved better results in detecting the boundary of the object and the accuracy of depth estimation.
  • The experimental results show that our model has achieved better results in several main evaluation indexes.

 

  1. -Materials and Methods

- Not clear

- Related work? Part of the introduction section? What methodology you use?

 

Response:Section 2 Related work, is a comprehensive review of the related work of monocular depth estimation and network improvements, and in section 4, the results of related work of monocular depth estimation will be compared with ours.

The related work in the introduction is a short one. This leads to the problem what we will solve in this manuscript.  

Methodology. According to related works, we think we can make some improvements. Based on polarized self-attention (PSA) and feature pyramid U-net, we proposed a new self-supervised monocular depth estimation model to extract more accurate texture details and semantic information. It can obtain more accurate boundary information of detected objects. Detailed description of methods are in section 3.

We added more descriptions and introductions about network architecture in seciotn 3.1.

 

  1. -Tables 1, 2, and 3 need to revise and references should be in the last column.

Better to change the current study instead of “ours”. Please read carefully other papers to improve your presentation of results and discussion.

Response: 

We have modified Tables 1, 2, and 3. In Tables 1 and 2, the first column is the existing method that we are comparing.

We have modified "ours" to "current study". 

 

  1. -Results and Discussion

-Revise it carefully with your results.

-Discussion should be related and brief with other references.

Response: Section 4 Results and discussion has been revised. Some descriptions about the results and section 4.5 discussion are added. The added parts and discussion are quoted here.

Section 4.1.From Table 1, we find that our method achieves well when compared with the other self-supervised depth estimation models. Our model achieves all best result in the monocular training. For the stereo training, our model achieves best reslut excpet the absolute relative error. For the monocular plus stereo training, our model achieves best reslut excpet the absolute relative error and square relative error.”

 

Section 4.2. From Table 2, we find that our method achieves best results of evaluating indicator when compared with the other self-supervised monocular depth estimation models.

 

Section 4.4.For stereo training, our model did not achieve best result in evaluating indicator of absolute relative error. In addition, For monocular plus stereo training, our model did not achieve best result in evaluating indicator of absolute relative error and square relative error. “

Section 4.5 Discussion. The results of this research show that using PSA plus feature pyramid U-net is better than traditional methods in semantic information extraction. Our method does not only provide better results in the evaluating indicators, but also better boundary information for the actual depth map prediction. Another keypoint is on the comparison of monocular, stereo, monocular plus stereo training in different modules of PSA. In Table 3, It is evident that adding PSA_p and PSA_s together leads to better results, especially in monocular training. But our models do not getting best results on all the evaluating indicator. Therefore, PSA modules are better at monocular training. Finally, we have also verified our model on other datasets, and it can also achieve good results. It proves that the model has certain generalization. 

 

  1. -Conclusions

-Unnecessary details.

-One or two paragraphs are appropriate for specific conclusions. But not too short and without specific findings

Response: We simplified the conclusion and provided the specific findings, and quoted here.

In this study, we proposed a new self-supervised monocular depth estimation method. This method is intended to get a more accurate depth estimation map and to understand a detected object from the predicted depth map intuitively. Compared with the previous models, its predicted depth map has achieved better results in detecting the boundary of the object and the accuracy of depth estimation. Finally, we have also verified our model on other datasets, and it can also achieve good results. It proves that the model has certain generalization.

  1. -Significant differences on which basis?

Response: Predicted depth map has achieved better results in detecting the boundary of the object and the accuracy of depth estimation.

The experimental results show that our model has achieved better results in several main evaluation indexes.

  1. References

-Not consistent

Response:We checked all the references and revised it. It is consistent now.

 

Reviewer 2 Report

Comments have been sufficiently addressed

Author Response

Thank you very much for your kindly comments on our manuscript.

Round 3

Reviewer 1 Report

The authors improved the manuscript, which can be accepted in its present form.

Back to TopTop