Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Change Detection Needs Neighborhood Interaction in Transformer

Remote Sens. 2023, 15(23), 5459; https://doi.org/10.3390/rs15235459

by Hangling Ma¹, Lingran Zhao^2,3, Bingquan Li^2,3

, Ruiqing Niu^1,2,3,* and Yueyue Wang¹

Reviewer 1:

Zhenwei Shi

Reviewer 2: Anonymous

Reviewer 3:

Gabriela Droj

Remote Sens. 2023, 15(23), 5459; https://doi.org/10.3390/rs15235459

Submission received: 22 September 2023 / Revised: 12 November 2023 / Accepted: 17 November 2023 / Published: 22 November 2023

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The proposed article, Change Detection Needs Neighborhood Interaction in Transformer, leverage a sparse sliding-window attention mechanism to localize the attention range of each pixel to its nearest neighbor, and then introduce inductive biases into the CD task to achieve better results while reducing computational cost. The proposed network successfully reaches beyond the state of the art of remote sensing change detection with sufficient experiments, however, the manuscript can be better if these questions are well covered.

1. The experimental results seems promising compared to the other method. However, we look forward to some more up-to-date SOTA method as comparison in Section 4.3. At the same time, the comparison method should be tested under the same experimental conditions.

2. The annotation in Fig.3 is misleading and the model structure was not clearly illustrated. In the ablation study part, you said “we removed the interaction of the DiNA module in the Temporal Neighborhood Cross Differ Module as the baseline”, but according to Fig.3 there isn’t any prominent clue of this DiNA module in the Temporal Neighborhood Cross Differ Module as the baseline. Did you mean you had the Cross-NA module removed, as “The Cross-NA Module is a modified version of the DiNAT”, describe in Section 3.4? Also, it’s advised to add some illustrations on DiNA and Cross-NA for better understanding.

3. Insufficient innovation. DiNAT seems to have simply replaced the convolution in the Self-attention of PVT, CVT, Segformer with an empty convolution. The Change Decoder is very similar to Segformer's Decoder.

4. Insufficient description. DiNAT and Cross-NAT appear in Contribute without being introduced earlier, resulting in poor readability of the article.

5. Incorrect spelling. On page six, line 235 ‘and filters the change noise while filtering the change noise in the interaction phase of the bi-temporal change features by acquiring the inductive bias of the local neighborhoods by means of it. further capturing the temporal-phase semantic change features of the CD task.’ And, formula 5 is misspelled. Please check the spelling and correct it.

6. More transformer-based methods for RS segmentation tasks should be included in the introduction for comparison, e.g., Building extraction from remote sensing images with sparse token transformers; Transformers in remote sensing: A survey; Remote Sensing Image Change Detection Based on Deep Multi-Scale Multi-Attention Siamese Transformer Network

7. The table for the ablation experiment does not indicate which data set the results were on.

8. It is not clear in Figure 3c which features Concat&Conv is an operation on, please specify.

Comments on the Quality of English Language

n/a

Author Response

Dear Reviewer, please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors propose a Trasformer-based Siamese network to perform the change-detection task in remote sensing images.

Overall the paper is well written and present a valid analysis.

I have some minor comments that I would appreciate to be addressed:

- Formula (1), the dot product exact formulation should be Q^T K as vectors are usually intended as column vectors. You can check that also from reference [9] ( A spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection by Chen and Shi), formula (2). Of course the same applies to formula (2) after line 282 from the current manuscript.

-Formula (4), after line 286, the symbol A_i should be bold accordingly to its definition in formula (2), as well as V_i, from formula (3). Then, please write what should be bold also in formulas (5).

- About the experiments:

-please specify somewhere that the used images are RGB.

- WHY-CD dataset, I think the results from the paper in [16] (Fully Convolutional Networks for Multisourse Building Extraction From Open Aerial and Satellite Imagery Data Set, from Ji, Wei, Lu) should be included in the table 3 as well and properly discussed.

About the English grammar/style, I found a couple of typos:

- line 235: [...] local neighborhoods by means of it. further [...] ---> [...] by means of it, further (so remove the full-stop after "it").

- Lines 259-260: the same sentence is repeated twice.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors,

The article deals with a very important tool of surface observation and data analysis in remote sensing, change detection. The authors propose a method for change detection in high-resolution remote sensing images. The method proposed by the authors uses a neural network for segmentation and training sets based on ground data, more specifically a transformer-based Siamese network structure called BTNIFormer

The article seems to be based on extensive research and testing, but the article itself needs to be reconsidered by the authors based on the comments made in the following points:

- Please revise the abstract to clearly state the overall purpose of the article and its findings.

- I recommend expanding the description of change detection in remote sensing images in the introduction. The introduction only fills one paragraph.

- I recommend moving the contribution to the conclusion and leaving the objectives in the introduction, as we will see what the contribution is at the end

- Figure 3 should be explained in detail in the text and not in the caption. Please make the caption more concise and move the description into the text to structure it better.

- please enlarge Figure 3, it is difficult to read

- is the resolution of the images from Christchurch so high (0.075 m)?

- the results should be described in more detail in the text, the numbers are very meaningful, but I recommend a more detailed description

- the conclusion is more of a sintezis about the article itself. The discussion and conclusions should serve as a debate and synthesis of the results obtained in relation to the methodological procedures used (whether the previous methods are better and how), which is the limitation and contribution. Please fill in.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

All concerns have been addressed

Comments on the Quality of English Language

n/a

Article Menu

Change Detection Needs Neighborhood Interaction in Transformer

Further Information

Guidelines

MDPI Initiatives

Follow MDPI