Next Article in Journal
FAUNet: Frequency Attention U-Net for Parcel Boundary Delineation in Satellite Images
Previous Article in Journal
A Modified Version of the Direct Sampling Method for Filling Gaps in Landsat 7 and Sentinel 2 Satellite Imagery in the Coastal Area of Rhone River
Previous Article in Special Issue
Generalizing Spacecraft Recognition via Diversifying Few-Shot Datasets in a Joint Trained Likelihood
 
 
Article
Peer-Review Record

MeViT: A Medium-Resolution Vision Transformer for Semantic Segmentation on Landsat Satellite Imagery for Agriculture in Thailand

Remote Sens. 2023, 15(21), 5124; https://doi.org/10.3390/rs15215124
by Teerapong Panboonyuen, Chaiyut Charoenphon and Chalermchon Satirapod *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2023, 15(21), 5124; https://doi.org/10.3390/rs15215124
Submission received: 12 September 2023 / Revised: 22 October 2023 / Accepted: 25 October 2023 / Published: 26 October 2023
(This article belongs to the Special Issue Deep Learning and Computer Vision in Remote Sensing-II)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper introduces a MeViT for semantic segmentation, however, it is very confused.

(1) The structure of the paper is very chaotic, from introduce to discussion, and It lacks a normal material organization, this makes reading this article very difficult.

(2) The motivation behind this article is unclear, in page 2, the word ‘state-of-the art’ is repeatedly applied, maybe it is unsuitable, we cannot see the advantages and disadvantages of these discussed method. Authors should make the motivation more clear in the Introduction section.

(3) This paper is based on the Vision Transformer, authors should give a detailed introduction on this technology.

(4) In Fig. 2, the second column text box should be is confused, and from the reader's perspective, it is hoped that these text boxes can express clear content.

(5) The Fig. 3-4 should be in section 3.

(6) The testing area should be introduced more clearly.

(7) in section 3. It only contain Datasets. ‘Our dataset (see Fig. 1) includes many medium-quality images of 53, 289x52, 737 pixels.’ What is the relationship between your research data and the experimental area? And Which bands did you use? And ‘why the northern images are divided into 1,100 training, 400 validation, and 200 test images.”

(8) The locations of equations 1-4 are unsuitable.

(9) For a new semantic segmentation method, adaptability to different sensors needs to be tested.

Comments on the Quality of English Language

the main problem is the content is poorly organized,  leading to hard to understand the major contribution of it.

Author Response

We genuinely appreciate the thoroughness of your comments and valuable suggestions. The issues and concerns you have raised are indeed crucial, and we are committed to addressing them comprehensively in our revision.

Also, We have already highlighted the changes made in our revised paper.

Our revised paper with highlights: https://drive.google.com/drive/folders/1BpKychyO0ZU1A9HqgYlqN2bVjMTFdY-b?usp=sharing

Also, please see the attachment. We appreciate your detailed comments and suggestions. All reviewers identify some critical points we hope to clarify and address here and in our revision.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors proposed Medium-Resolution Vision Transformer (MeViT) for semantic segmentation of Landsat satellite imagery for agricultural purposes in Thailand. It enhances ViT by integrating medium-resolution multi-branch architectures and mixed-scale convolutional feedforward networks (MixCFN) to learn semantically rich and spatially-precise multi-scale representations. Overall, I think the application goal of this paper is meaningful, but the innovation of the method is limited. The author needs to revise it and resubmit. My concerns are as follows:

 

1. In the introduction, the authors only briefly mentioned image segmentation in LULC, and mainly reviewed the literature from the computer vision field. I think this is inappropriate and cannot really introduce the work done by the author. The reasons are as follows: First, there are already many Transformer-based methods in the remote sensing field that have been successfully used for various types of image segmentation, especially high-resolution remote sensing images. Many modules in these works involve Multi-Resolution Branches (Multi-scale) and other modules proposed by the author. Therefore, reviewing these works from the perspective of remote sensing can highlight the significance of this paper by the author; Second, the characteristics of medium-resolution images and high-resolution images are very different. Many existing methods focus on high-resolution images. The author does not distinguish between them, which will make the author’s work meaningless (since the title contain "Medium-Resolution"); Third, The author did not explain why agriculture needs Semantic Segmentation in the introduction.

 

 

2. Please clarify how the proposed enhancement is specifically designed for Landsat OLI images (medium resolution images).

 

3. The description of the proposed method is cloudy. I didnot figure out how the green component (Figure 2.) operate. The authors almost adopted all the modules from HRViT, including MixCFN. The authors seem to only incorporate multiple depthwise convolution paths and utilize RELU instead of GELU (From Figure 2. and 5.). All these seem to lack innovations. Therefore, the authors should introduce their method and innovation more detailedly and clearly.

 

4. The experimental section does not need such a detailed introduction of Landsat 8, as the authors and readers in the remote sensing field are very familiar with it. BTW, the authors seem to have not used the TIRS sensor either.

 

5. As a reader, I am very interested in the dataset (Labeled images) of the authors. Please include more relevant content in the experiments, rather than Landsat 8.

 

6. The comparison methods are from the CV field, as mentioned above, there are already many transformer-based methods in the remote sensing field (including the ones proposed by the author), please add some methods for comparison. In addition, the specific versions of the comparison methods are not given by the author.

 

   [1] Enhanced Feature Pyramid Vision Transformer for Semantic Segmentation on Thailand Landsat-8 Corpus

 

7. The specific parameters of the experiment (comparison methods and proposed method) are not given, such as learning rate, whether pre-trained parameters are used, pytorch or tensorflow?, etc.

 

8. From Figure 10 - 13, It is hard to see the advantages of the method proposed by the author from the figures. If possible, please provide enlarged detail images.

 

Some typos:

HRVit, MeVit..

 

Author Response

We genuinely appreciate the thoroughness of your comments and valuable suggestions. The issues and concerns you have raised are indeed crucial, and we are committed to addressing them comprehensively in our revision.

Also, We have already highlighted the changes made in our revised paper.

Our revised paper with highlights: https://drive.google.com/drive/folders/1BpKychyO0ZU1A9HqgYlqN2bVjMTFdY-b?usp=sharing

Also, please see the attachment. We appreciate your detailed comments and suggestions. All reviewers identify some critical points we hope to clarify and address here and in our revision.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

L25. When introducing acronyms such as "LULC" and "CNNs" provide their full meanings in parentheses upon first use to aid readers who may not be familiar with these terms.

 

Section 2 starts with a clear description of the HRViT architecture, which is good for introducing the base model. However, there is a repetition of this information later in the section, which might be redundant. Consider consolidating the repeated information to avoid redundancy.

 

It's crucial to clearly explain why HRViT was chosen as the base model and how MeViT builds upon it to address the specific challenges related to semantic segmentation. Explain how and why HRViT is not suitable for semantic segmentation, emphasizing the need for MeViT.

 

Highlight the benefits and advantages of incorporating the revised MixCFN with RELU in MeViT. Explain how this modification contributes to improved feature extraction and mitigates the vanishing gradient problem.

 

You provide the categories (corn, para rubber, and pineapple) in your dataset, which is helpful. Consider adding a brief explanation of why these categories were chosen or their relevance to the study.

 

Double-check the formula for Mean Intersection over Union (IoU) (Eq. 4) for accuracy. It appears to contain a typographical error with two plus signs.

 

I wish that my comment would be helpful in improving the quality of this research.

Thank you.

Author Response

We genuinely appreciate the thoroughness of your comments and valuable suggestions. The issues and concerns you have raised are indeed crucial, and we are committed to addressing them comprehensively in our revision.

Also, We have already highlighted the changes made in our revised paper.

Our revised paper with highlights: https://drive.google.com/drive/folders/1BpKychyO0ZU1A9HqgYlqN2bVjMTFdY-b?usp=sharing

Also, please see the attachment. We appreciate your detailed comments and suggestions. All reviewers identify some critical points we hope to clarify and address here and in our revision.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

I thank the authors for addressing most of my concerns. I find this manuscript to be essentially suitable for publication. However, there are still some issues that need to be resolved.

 

1. (Major concern) The authors highlighted two innovations to improve HRViT for Landsat8 image classification in the revised manuscript. These two improvements are simple, but I did not find any ablation experiments, or analysis of the contribution and importance of these two innovations for the improvements. If not, please add relevant experiments, this is very important.

 

2. The authors did not understand the meaning of the specific version of the comparison methods that I mentioned at 1st round. For example, SwinTransformer has versions such as Swin-T, Swin-L, etc. Different versions of SwinTransformer have different accuracy and efficiency. The authors need to list these specific versions, so that the readers can clearly understand the environment of the comparison experiment.

 

3. Eq.(5) is the equation of "IoU", not "mean IoU".

 

4. The size of each image for training/validation/testing?

 

5. F1 is also for a single class, for multiple classes, such as "Mean F1" should be used (Table 1). As well as "Precision" and "Recall".

   

6. I still found some typos, the authors should check the paper more carefully. e.g., Land Use and Land Cove(l22); The parentheses are unnecessary (l104); l219 is duplicated with l229......

 

Author Response

We genuinely appreciate the thoroughness of your comments and valuable suggestions. The issues and concerns you have raised are crucial, and we are committed to addressing them comprehensively in our revision.

Also, We have already highlighted the changes made in our revised paper (Round 2).

Our revised paper (Round 2) with highlights: https://drive.google.com/drive/folders/1BpKychyO0ZU1A9HqgYlqN2bVjMTFdY-b?usp=sharing

Also, please see the attachment. We appreciate your detailed comments and suggestions. All reviewers identify some critical points we hope to clarify and address here and in our revision.

Author Response File: Author Response.pdf

Back to TopTop