Next Article in Journal
Electromagnetic Simulation Flow for Integrated Power Electronics Modules
Next Article in Special Issue
Wildfire and Smoke Detection Using Staged YOLO Model and Ensemble CNN
Previous Article in Journal
Fault Diagnosis and Tolerant Control for Three-Level T-Type Inverters
 
 
Article
Peer-Review Record

Super-Resolution Reconstruction Model of Spatiotemporal Fusion Remote Sensing Image Based on Double Branch Texture Transformers and Feedback Mechanism

Electronics 2022, 11(16), 2497; https://doi.org/10.3390/electronics11162497
by Hui Liu 1, Yurong Qian 2,*, Guangqi Yang 2 and Hao Jiang 2
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Reviewer 5:
Electronics 2022, 11(16), 2497; https://doi.org/10.3390/electronics11162497
Submission received: 27 June 2022 / Revised: 31 July 2022 / Accepted: 1 August 2022 / Published: 10 August 2022
(This article belongs to the Special Issue Remote Sensing Image Processing)

Round 1

Reviewer 1 Report

The general scientific soundness and the research design are both appropriate. The quality of presentation could be implemented introducing some summary schemes or diagrams. 

The research is conceived according to current objectives and methods. Unfortunately the bibliography not contains as many references to the most recent literature.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The contributions are clearly presented and methodology used in the study are sound

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments-

1)      Include more recent techniques.

2)      Methodology is not well explained. Refine.

3)      Try to provide the mathematical foundation of the work.

4)      Comparison with other techniques required.

5)      There must be some ablation studies .

6)      English edits required.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

The aim of this paper is the proposal of a super resolution reconstruction model of spatiotemporal fusion remote sensing image, based on double branch texture transformers and feedback mechanism, which separates the network from the coarse-fine images with similar structures through the idea of double branches, while the dependence of images on time series is limited.

The topic of this article quite interesting, while its contribution is also significant. Moreover, the mathematical background is satisfactory. However, the following corrections are necessary, in order to improve the paper:

Please modify the abstract. The abstract should be solid, and a lot of details should be removed. These details can be included in the “Introduction” section.

Please modify the structure of the article. The “Introduction” and “Related Works” sections could be merged into one section. Moreover, the “Discussion” section is missing. Please include a “Discussion” section or modify the “Conclusions” section into “Discussion – Conclusions” section, adding the discussion part.

Please define in a clear way the paper objective in the “Introduction” section, as well as the inquiries which your research answers.

Please include more details in the Figure captions. For example, M0, M1, M2, L0, L1 should be explained in the Figure 1 caption. Please apply in the same way in all Figure captions.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

This manuscript presents a super-resolution image reconstruction method based on ConvNet and Transformer backbones. The proposed network incorporates a feedback mechanism not only to extract spatial-temporal features but also to fuse them as an encoder-decoder and end-to-end model. Experiments on three original benchmark datasets revealed that the proposed method achieved higher accuracy than the four previous methods. Nevertheless, the questions below remain subjects that must be addressed in this manuscript.

 

1. Please provide the basis for the assertion in lines 36-40. Is it common to divide into these two major categories? Is this your original remark?

 

2. The abbreviation MODIS suddenly appears in line 41. In this context, this term is difficult to understand.

 

3. The abbreviation LTHS is used in line 71 before it is defined in line 145.

 

4. Numerous abbreviations are used in this manuscript. However, the list on page 20 is merely a subset of all abbreviations.

 

5. The size of the input features is one of the hyperparameters for deep learning networks. As described in lines 224-227, three input block sizes were applied to the method in consideration of memory consumption and computational cost. Although the appropriate input sizes vary for target images, these settings are relatively small compared to other approaches. How were these values determined? How much did the sizes have an effect on accuracy?

 

6. Did you experiment with optimizing the hyperparameters of your proposed model developed using two deep learning backbones? Is the batch size of 8 optimal? Did the learning process converge sufficiently? Why doesn't this manuscript include preliminary experiment results?

 

7. Statements regarding datasets are repeated several times throughout this manuscript. Is it necessary? Please check lines 93-39 and 287-288.

 

8. Is reference [17] in line 105 appropriate? Is this a pioneering work?

 

9. Reference [66] should cite NeuroIPS instead of arXiv.

 

10. The drawbacks of skip connections are explained in lines 129-133. However, this mechanism is important as a measure to address the vanishing gradient problem. This mechanism should concern both aspects.

 

11. The text in lines 380-383 should also be written above the images in Figs. 5, 6, and 7.

 

12. Are there no labels in each column of Figs. 8, 9, and 11? Why is there no image in the panel of the second row and first column?

 

13. Are there no labels in Figs. 13, 14, and 15?

 

14. Does the sentence "true value" in line 382 mean "ground truth (GT)"?

 

15. The word "good" used in lines 372, 394, 430, and 436 is unsuitable for a scientific paper because it derives from a subjective evaluation.

 

16. The proposed method was evaluated using only your original benchmark datasets. Why did you not use publicly available benchmark datasets? A vast number of public datasets are available as benchmarks in the field of remote sensing research. It is common to evaluate performance using public datasets before original datasets. Is there really no public dataset that matches the specifications described in lines 192-198?

 

17. What is the meaning of the inequality sign on the right side of equation (2)?

 

18. Where is the content loss defined on the right side of Equation (14)?

 

19. If the information in Table 1 is shown, it is desirable to indicate the calculation time.

 

20. Do you use the 1x1 convolution kernel in Fig. 2?

 

21. Variables in the text and figures should be changed to an italic font.

 

22. Is the abbreviation LRFB on line 270 correct? Is the abbreviation ERGAS on line 332 correct?

 

23. The size of the image should include the unit of pixels.

 

24. The mixed terminology of "datasets" and "data sets" should be unified.

 

25. Why are the distributions in Figs 13 and 14 shallower in color than those in Fig. 15?

 

26. It might be helpful to avoid citing references in the last section of the conclusion. The statement in lines 447-448 should be moved to the first section of the introduction.

 

27. The abbreviations in the first column of Tables 3, 4, and 5 should be written in capital letters.

 

28. What backbones were used for the ConvNet and Transformer modules of the proposed method? Did you use VGGNet for the ConvNet module? Is the texture transformer a new backbone?

 

29. The second section on related work presents a number of existing methods. How did you choose the four existing methods, STARFM [32], FSDAF [36], DCSTFN [44], EDCSTFN [45], for comparison? Clearly describe your selection criteria. Why didn't you choose other state-of-the-art methods proposed in recent years?

 

30. Does your method exhibit superior accuracy compared to the state-of-the-art existing methods listed below?

 

- F. Yang, H. Yang, J. Fu, H. Lu, and B. Guo, "Learning Texture Transformer Network for Image Super-Resolution," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5790-5799, doi: 10.1109/CVPR42600.2020.00583.

 

- G. Yang et al., "MSFusion: Multistage for Remote Sensing Image Spatiotemporal Fusion Based on Texture Transformer and Convolutional Neural Network," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 4653-4666, 2022, doi: 10.1109/JSTARS.2022.3179415.

 

- H. M. Kasem, K. -W. Hung and J. Jiang, "Spatial Transformer Generative Adversarial Network for Robust Image Super-Resolution," in IEEE Access, vol. 7, pp. 182993-183009, 2019, doi: 10.1109/ACCESS.2019.2959940.

 

- Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Linlin Zhang, Tieyong Zeng, Transformer for Single Image Super-Resolution, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 457-466.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 5 Report

The revised manuscript has been modified by the authors.

All my questions have been resolved.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Round 1

Reviewer 1 Report

I have included a few non-substantive suggestions in the attached.  

The paper is very well organized. The authors walk the reader through a lot of information. Section 2: Related Works is especially noteworthy for all of the great background information in there. This would be a good reference for people to keep on hand.

The experiment itself is well explained, as are the results. The inferences from the results are well spelled out. The Conclusion Section is very brief but it does a decent job of summing up the paper. 

If I was to suggest any improvement, it might be to be a little more specific in the abstract. A lot of folks will read the abstract but not have time to read the paper. Your concluding sentence in the abstract is so vague, that the reader won't really get a good idea of what you learned.

 

Comments for author File: Comments.pdf

Reviewer 2 Report

Dear authors, 

The paper called "Super-resolution reconstruction model of spatiotemporal fusion remote sensing image based on double branch texture transformers and feedback mechanism" is an interesting paper in terms of image processing. The paper's goal is to create a fusion model for enhanced deep convolutional spatiotemporal fusion networks on the basis of increased double-branch feedback mechanisms and texture transformers.  The paper is well structured in terms of conveying the idea to the reader. That is why I will have some minor questions.

However, the figures and diagram are so small to see the differences between input and out.  could you please use larger images? 

Please also place table 1 within its section

Could you please also explain why only SAM results are better than yours for TTSR+Double branch+Feedback? What is the effect of feedback there?

I have the same question for the Quantitative Evaluation of fusion results of CIA data sets., please explain.

 

 

Reviewer 3 Report

The authors present a novel spatiotemporal data fusion method for medium and coarse-spatial-resolution remotely sensed imagery over three sites.  While the method presented is interesting, a major justification for the method is that it is applicable to a variety of datasets and sensors, however the method is largely confined to MODIS imagery and Landsat-5, 7, and 8.  Limited spectral information on the Landsat data was also used, which reduces the utility of the fused dataset.  There was also next to no discussion of the results, and the conclusion was extremely brief, and presented few meaningful insights or directions for future research.

Furthermore, the manuscript itself seems to be more of a rough draft rather than a finished product, as there are numerous inconsistencies in capitalization, acronyms, and colloquial language.  In-fact I stopped attempting to point out individual issues when I reached the methods section.  Figures are also of poor quality, as there is minimal reference or explanation of the variations between method.

I believe the authors method shows promise, and is interesting, however further exploration of the generalization of the method should be investigated, and any future submission should undergo careful proofreading and polish before consideration in an academic journal.

 

Ln 33 – “Satellite remote sensing data can be divided into two types: Landsat 8…”

I’m not entirely sure what the authors are attempting to state in this sentence, but there are a large variety of remotely sensed datasets (Aerial & Terrestrial LiDAR, hyperspectral imagery, numerous coarse, medium, high, very high resolution satellites (i.e. WorldView-3, 4).

Are the authors attempting to refer to satellite imagery with high temporal resolutions?   Even so, there are numerous other remotely sensed satellites with temporal resolution similar to Landsat-8 & MODIS as well..   

Ln 38 – This sentence is incomplete “The other data are the MODIS imagery, with a spatial resolution; The other is the MODIS image..” 

Ln 38 – Also, please identify the MODIS acronym (Moderate Resolution Imaging Spectroradiometer)

Ln 45 – Landsat 30m imagery is not considered “high-resolution” in remote sensing literature.  High resolution imagery typically has much finer spatial resolutions i.e. 1m – 10m..  Landsat-8 imagery has a higher spatial resolution compared to MODIS, but it’s not considered to be high resolution imagery, even with pansharpening.

Ln 71 – Is atrocious an appropriate term?  What is this referring too, poor accuracy, lack of meaningful result, high error?  Perhaps unusable?

Ln 77 – Should your list start with a colon?

Ln 91 – “the model is suitable for various data sources” – is this true?  I thought the model only worked with MODIS and Landsat-8, Landsat-7 ETM, and Landsat-5 TM data.  That’s only four specific data sources…

Ln 113 – Again, please define acronyms… SRCNN, VDSR, SRRESNET, EDSR, etc…

Ln 186 – What makes the other datasets “ideal”?

Ln 302 – The first four bands of the Landsat data or MODIS?  Bands also differ by sensor?  (i.e. Band 1 in Landsat-7 ETM is “blue 0.411 -0.514 nm”, while Band 1 on Landsat-8 OLI is “Coastal/Aerosol 0.435 - 0.451 nm”.

Ln 302 – Was pansharpning considered given the 15m Panchromatic bands on Landsat-7, 8?  That would further improve the spatial resolution of the Landsat imagery.

In general, the entire section 2 seems a bit long and unnecessary, as it’s mainly just a listing of other methods, some of which are unrelated.  It would be more effective to discuss methods that are relatable to the super resolution image problem and idealized datasets, as discussed in the final paragraph in this section.

Back to TopTop