Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

MFATNet: Multi-Scale Feature Aggregation via Transformer for Remote Sensing Image Change Detection

Remote Sens. 2022, 14(21), 5379; https://doi.org/10.3390/rs14215379

by Zan Mao^1,2

, Xinyu Tong^1,2, Ze Luo^1,* and Honghai Zhang¹

Reviewer 1:

Adriano Di Florio

Reviewer 2: Anonymous

Reviewer 3:

Bolin Fu

Remote Sens. 2022, 14(21), 5379; https://doi.org/10.3390/rs14215379

Submission received: 20 September 2022 / Revised: 22 October 2022 / Accepted: 25 October 2022 / Published: 27 October 2022

(This article belongs to the Section Remote Sensing Image Processing)

Round 1

Reviewer 1 Report

4 - difficulties

6 - employed, introducing other …

8 - Thus in this paper we propose

23 - here and anywhere “ remote sensed images”

40 - complicated

49 - here and everywhere: leave a space before a “(“ or a “[“

67- what is “it” here?

82 - CNN not CNNs

85 - here you mention for the first time the spatial semantic tokenizer but you actually define it way later. This is confusing. Either move here the definition or give an anticipation. Also capital letters for the first letter (it is SST not sst). Or a reference.

Figure 1 caption - “means” seems the wrong word. Indicates?

92 - you use ResNet18 or resnet18. Be consistent.

158 - “With the dramatic development of deep learning” I would remove this

166 - 4, 8, 16, 32 times

170 - “proposed by Google” doesn’t sound professional

172 - “While” ? As opposed to what?

174 - which period?

179 - “instinctive” remove

180 - this reference to a PyTorch method is an useless technicality. If you want to keep it add a reference to PyTorch.

Figure 3. “This is” -> remove

181 - h,w,c ? What are these?

Figure 3 - The labels of the blocks are quite misleading w.r.t. the text.

In the text you use fig. or Fig. or Figure. Choose one.

181 - Token first appears here without any definition

183 - “ Therefore, in order to learn long-range dependencies in Transformer more effectively after” this is obscure

188 - i being? i=0…3?

192 - 193 Xi -> the i should be subscript

199-202 - This definition would be useful above

206 - Rephrase as “models such as BiT[2], former a 206 CNN+Transformer model, and ChangeFormer[3], a pure Transformer model, …”

211 - Compared

214 - All this sentence seems redundant.

225 - a few? be quantitative

230 - “without using post norm” remove

All the block below is missing line numbering so it’s quite difficult to map the comments.

235-236 query, key, value appears here for the first time undefined

235-236 PE ? it’s unclear if it is a step of the chain or some chunk data

235-236 d is the sum of the dimensions?

235-236 you are redefining q, k, v from data dimensions to data values. This is confusing.

235-236 here and elsewhere it seems you are not using “However” properly. It should be used to introduce a statement that contrasts with the previous one. Here it is not.

236 - are you sure of this last W0 factor?

237 - 239 - I can’t realy grasp what this sentence means “Yet … Fi”.

241 - define or reference GELU

247 - perform A bilinear

268 - module not mudule

Formula 11 -> OA define the acronym

311 - remove (%)

311 - given in percentage

232 - Table.1 illustrates

4.2 - are all of these performed on test datasets I imagine. I don’t see it stated clearly.

323 - perceive -> understand, infere

323 - remove obviously

General comment. All these tables are nice but a bit confusing. A plot for each category would help. Also having the column/plot with the overall metrics could be nice.

Also some of the metrics seems to be very near across different cases (just as an example: Table 3 rows for L=16 and L=32 for F1 in the first dataset). Would be useful to have some errors on these numbers. Splitting the test dataset in N and computing N times the metrics could be a solution. Without errors the conclusion you came for L from Table 3 are hard to support.

369 - emphasis

Again, like Figure, you use Table, tab., Tab. : be consistent.

Table 4 shows the same issue I was mentioning above. All the values are near in the first column. Without an erro you can’t really say anything.

417 - Table.5.6.7 would be better to use In Tables from 5 to 6

446 - F1 score AND IoU ?

466 - fiture.10.11.12 same as above for Tabel.5.6.7

460-475 all these conclusions are quite qualitative. You could bin a bit more the 8 categories (especially A and B) to have more points in your plots and to perform a Kolmogorv Smirnov test for the compatibility of the two distributions.

Author Response

Response to Reviewer 1 Comments

Point 1: 4 - difficulties

Response 1: Thank you for pointing out the deficiency. We’ve changed [difficulty] to [difficulties].

Point 2: 6 - employed, introducing other …

Response 2: Thank you for pointing out the deficiency. We’ve changed [employed, but this introduces other…] to [employed, introducing other…].

Point 3: 8 - Thus in this paper we propose

Response 3: Thank you for pointing out the deficiency. We’ve changed [Thus in this paper, we propose] to [Thus in this paper we propose].

Point 4: 23 - here and anywhere “ remote sensed images”

Response 4: Thank you for your suggestion. We have searched numerous literature and found that "remote sensing images" are widely used, such as [2,3,4].

Point 5: 40 - complicated

Response 5: Thank you for pointing out the deficiency. We’ve changed [complicate] to [complicated].

Point 6: 49 - here and everywhere: leave a space before a “(“ or a “[“

Response 6: Thank you for pointing out the deficiency. We have revised.

Point 7: 67- what is “it” here?

Response 7: Thank you for pointing out the deficiency. We’ve modified the description in line 67. We’ve changed [it seems to bring a fresh twist to the above problems] to [the above mentioned problems seem to have some fresh solutions].

Point 8: 82 - CNN not CNNs

Response 8: Thank you for pointing out the deficiency. We’ve changed [CNNs-based] to [CNN-based].

Point 9: 85 - here you mention for the first time the spatial semantic tokenizer but you actually define it way later. This is confusing. Either move here the definition or give an anticipation. Also capital letters for the first letter (it is SST not sst). Or a reference.

Response 9: Thank you for pointing out the deficiency. We agree and have moved here its definition. We’ve changed [we design a spatial semantic tokenizer (SST)] to [we follow the semantic tokenizer in bit and introduce the spatial attention module, resulting in the design of a Spatial Semantic Tokenizer (SST)].

Point 10: Figure 1 caption - “means” seems the wrong word. Indicates?

Response 10: Thank you for pointing out the deficiency. This observation is correct. We’ve changed [means] to [indicates].

Point 11: 92 - you use ResNet18 or resnet18. Be consistent.

Response 11: Thank you for pointing out the deficiency. We have unified the modification in the revised version to ResNet18.

Point 12: 158 - “With the dramatic development of deep learning” I would remove this

Response 12: Thank you for your suggestion. We’ve removed [With the dramatic development of deep learning] and updated to [In recent years].

Point 13: 166 - 4, 8, 16, 32 times

Response 13: Thank you for pointing out the deficiency. We’ve changed [4 times, 8 times, 16 times, and 32 times] to [4, 8, 16, 32 times].

Point 14: 170 - “proposed by Google” doesn’t sound professional

Response 14: Thank you for your suggestion. We’ve removed“proposed by Google”

Point 15: 172 - “While” ? As opposed to what?

Response 15: Thank you for pointing out the deficiency. We’ve changed [While] to [However,].

Point 16: 174 - which period?

Response 16: Thank you for pointing out the deficiency. We’ve changed [in the same period] to [from the same period as ViT].

Point 17: 179 - “instinctive” remove

Response 17: Thank you for pointing out the deficiency. We’ve removed “instinctive”.

Point 18: 180 - this reference to a PyTorch method is an useless technicality. If you want to keep it add a reference to PyTorch.

Response 18: Thank you for your suggestion. We’ve changed [Pytorch] to [deep learning open source framework].

Point 19: Figure 3. “This is” -> remove

Response 19: Thank you for pointing out the deficiency. We’ve removed “This is”. And we made similar changes in the titles of other figures.

Point 20: 181 - h,w,c ? What are these?

Response 20: Thank you for pointing out the deficiency. We’ve added the decription of h, w, c at the end of the sentence: “where h, w, c are the height, width and channel of the input image respectively.”.

Point 21: Figure 3 - The labels of the blocks are quite misleading w.r.t. the text.

Response 21: Thank you for pointing out the deficiency. We made minor modifications to Figure 3 to make it better readable.

Point 22: In the text you use fig. or Fig. or Figure. Choose one.

Response 22: Thank you for pointing out the deficiency. We used Fig. and Table. Uniformly

Point 23: 181 - Token first appears here without any definition

Response 23: Thank you for pointing out the deficiency. We’ve added the definition of Token at the end of the sentence: “each pixel in the vector is a token similar to a word in a sentence.”.

Point 24: 183 - “ Therefore, in order to learn long-range dependencies in Transformer more effectively after” this is obscure

Response 24: Thank you for pointing out the deficiency. We’ve changed [Therefore, in order to learn long-range dependencies in Transformer more effectively afterwards] to [Therefore, to make the subsequent transformer more effective in learning long-range dependencies].

Point 25: 188 - i being? i=0…3?

Response 25: Thank you for pointing out the deficiency. We’ve added a description of the value range of i at the end of the sentence: “and I = {1, 2, 3, 4} corrsponds to the last four stages of ResNet.”.

Point 26: 192 - 193 Xi -> the i should be subscript

Response 26: Thank you for pointing out the deficiency. We’ve changed [Xi] to [].

Point 27: 199-202 - This definition would be useful above

Response 27: Thank you for your suggestion. In the correspondence out above, we give a preliminary definition.

Point 28: 206 - Rephrase as “models such as BiT[2], former a 206 CNN+Transformer model, and ChangeFormer[3], a pure Transformer model, …”

Response 28: Thank you for your suggestion. We’ve made the change. The new sentence is as follows: “models such as BiT[2], a CNN + Transformer model, and ChangeFormer[3], a pure Transformer model,…”.

Point 29: 211 - Compared

Response 29: Thank you for pointing out the deficiency. We’ve changed [Compare] to [Compared].

Point 30: 214 - All this sentence seems redundant.

Response 30: Thank you for pointing out the deficiency. We’ve changed this sentence to [When the multi-scale feature maps are obtained, the SST module is used to yield the semantic tokens.]

Point 31: 225 - a few? be quantitative

Response 31: Thank you for pointing out the deficiency. We’ve changed [a few] to [4L groups of].

Point 32: 230 - “without using post norm” remove

Response 32: Thank you for your suggestion. We’ve removed “without using post norm but” here.

Point 33: 235-236 query, key, value appears here for the first time undefined

Response 33: Thank you for pointing out the deficiency. We’ve added the definition of query, key, and value here and changed the sentence to “When we input tokens into the transformer to calculate self-attention, Q, K, and V are computed by token T, where Q, K, and V denote Query, Key, and Value, respectively.”.

Point 34: 235-236 PE ? it’s unclear if it is a step of the chain or some chunk data

Response 34: Thank you for pointing out the deficiency. We’ve added a description of PE, which is an abbreviation of position encoder.

Point 35: 235-236 d is the sum of the dimensions?

Response 35: Thank you for pointing out the deficiency. This is indeed confusing. We’ve modified “d is the number of channels in the query, key, and value” to “d is the number of channel dimension of the Q/K/V”. And we also added the definition field of QKV above for clearer elaboration.

Point 36: 235-236 you are redefining q, k, v from data dimensions to data values. This is confusing.

Response 36: Thank you for pointing out the deficiency. This is indeed confusing. We intended it to be short for query, key and value, but after reviewing the previous literature, we modified the description here for a clearer understanding.

Point 37: 235-236 here and elsewhere it seems you are not using “However” properly. It should be used to introduce a statement that contrasts with the previous one. Here it is not.

Response 37: Thank you for pointing out the deficiency. We realized that "However" was an inappropriate usage, so we removed "However".

Point 38: 236 - are you sure of this last W0 factor?

Response 38: Yes, we are sure about the final W0, because of the . We also fixed a punctuation error here.

Point 39: 237 - 239 - I can’t realy grasp what this sentence means “Yet … Fi”.

Response 39: Thank you for pointing out the deficiency. We’ve revised the text to address your concerns and hope that it is now clearer.

Point 40: 241 - define or reference GELU

Response 40: Thank you for your suggestion. We’ve added a reference to GELU.

Point 41: 247 - perform A bilinear

Response 41: Thank you for pointing out the deficiency. We’ve changed [perform bilinear] to [perform a bilinear].

Point 42: 268 - module not mudule

Response 42: Thank you for pointing out the deficiency. We’ve changed [mudule] to [module].

Point 43: Formula 11 -> OA define the acronym

Response 43: Thank you for pointing out the deficiency. We’ve changed [overall accuracy] to [Overall Accuracy].

Point 44: 311 - remove (%)

Response 44: Thank you for your suggestion. We’ve removed.

Point 45 311 - given in percentage

Response 45: Thank you for your suggestion. We’ve changed [are all in percentage] to [are all given in percentage].

Point 46: 232 - Table.1 illustrates

Response 46: Thank you for suggestion. We added the description that the experiments were performed on the test sets.

Point 47: 4.2 - are all of these performed on test datasets I imagine. I don’t see it stated clearly.

Response 47: Thank you for suggestion. We redescribed the title of Table 1, adding the description that the experiments were performed on the test sets of LEVIR-CD, WHU-CD, and DSIFN-CD

Point 48: 323 - perceive -> understand, infere

Response 48: Thank you for suggestion. We’ve changed [perceive] to [understand].

Point 49: 323 - remove obviously

Response 49: Thank you for suggestion. We’ve removed.

Point 50: Also some of the metrics seems to be very near across different cases (just as an example: Table 3 rows for L=16 and L=32 for F1 in the first dataset). Would be useful to have some errors on these numbers. Splitting the test dataset in N and computing N times the metrics could be a solution. Without errors the conclusion you came for L from Table 3 are hard to support.

Response 49: Thank you for suggestion. We follow your suggestion and randomly sample the test sets of LEVIR-CD (N = 4, 512 images per sub-test sets). We then performed four evaluations on the sub-test set, as shown in Table 1.

L	sub-test1		sub-test2		sub-test3		sub-test4
	iou	f1	iou	f1	iou	f1	iou	f1
2	81.724	89.943	81.213	89.632	81.425	89.762	82.47	90.393
4	81.865	90.028	81.408	89.751	81.205	89.628	82.099	90.17
8	81.74	89.953	81.371	89.729	80.986	89.494	82.334	90.311
16	81.628	89.885	81.444	89.773	81.317	89.696	82.527	90.427
32	81.832	90.009	81.091	89.558	81.386	89.738	82.532	90.430

Table 1. The results on the four sub-test sets.

And we calculated the mean values of iou and f1 based on Table 1, as shown in Table 2.

L	mean-iou	mean-f1
2	81.708	89.9325
4	81.64425	89.89425
8	81.60775	89.87175
16	81.729	89.94525
32	81.71025	89.93375

Table 2. The mean values of iou and f1.

Actually, as in the previous work [1], "L" was also explored as a hyperparameter, L being the length of the compact token, and different L may lead to subtle differences in the semantic information contained in the tokens (BiT [1] concluded that too small an L would lead to the loss of some useful information, while too large an L would contain redundant information, but it was experimented on the single-scale feature maps). However, in our article, the multi-scale feature maps are constructed as tokens, and the semantic and localization information contained in the feature maps at different scales are inherently different (e.g., low-level features and high-level features). Also as mentioned in Discussion the train and test sets of LEVIR-CD have quite close distributions that are model learning friendly. Therefore, the gap of ablation experiments on LEVIR-CD test sets about L is not as obvious as the other two datasets. We conjecture that tokenizer of the features for different scales is feasible by choosing different L (e.g., using larger L for low-level features and smaller L for high-level features) and then merging them for later training. Of course, this is just our speculation, and we will explore it in future studies.

Point 52: 369 - emphasis

Response 52: Thank you for pointing out the deficiency. We’ve changed [emphasize] to [emphasis].

Point 53: Again, like Figure, you use Table, tab., Tab. : be consistent.

Response 53: Thank you for pointing out the deficiency. We used Fig. and Table. Uniformly.

Point 54: Table 4 shows the same issue I was mentioning above. All the values are near in the first column. Without an erro you can’t really say anything.

Response 53: Thank you for your suggestion. We follow your suggestion and randomly sample the test sets of LEVIR-CD (N = 4, 512 images per sub-test sets). We then performed four evaluations on the sub-test set, as shown in Table 3.

method	sub1		sub2		sub3		sub4
	iou	f1	iou	f1	iou	f1	iou	f1
max	82.191	90.225	81.561	89.844	81.875	90.034	82.466	90.391
avg	82.493	90.407	81.498	89.806	82.126	90.186	82.909	90.656
st	82.336	90.312	81.297	89.684	82.365	90.33	82.7	90.531
sst	82.582	90.46	81.792	89.984	82.146	90.198	82.669	90.512

Table 3. The results on the four sub-test sets.

And we calculated the mean values of iou and f1 based on Table 1, as shown in Table 4.

Method	mean-iou	mean-f1
	mean-iou	mean-f1
max	82.02325	90.1235
avg	82.2565	90.26375
st	82.175	90.21425
sst	82.29725	90.2885

Table 4. The mean values of iou and f1.

We were able to observe that after dividing the test set of LEVIR-CD into 4 portions randomly, the iou and f1 scores obtained were the same as reported in the paper. Also we think that it is the excellence of the LEVIR-CD dataset that causes the gap to be not as large as the other two datasets.

Point 55: 417 - Table.5.6.7 would be better to use In Tables from 5 to 6

Response 55: Thank you for your suggestion. We’ve changed [Table.5.6.7] to [In Tables from 5 to 7].

Point 56: 446 - F1 score AND IoU ?

Response 56: Thank you for your suggestion. We’ve changed [F1 score, IoU] to [F1 score and IoU].

Point 57: 466 - fiture.10.11.12 same as above for Tabel.5.6.7

Response 57: Thank you for your suggestion. We’ve changed [Tabel.5.6.7] to [from Fig.5 to 7].

Point 58: 460-475 all these conclusions are quite qualitative. You could bin a bit more the 8 categories (especially A and B) to have more points in your plots and to perform a Kolmogorv Smirnov test for the compatibility of the two distributions.

Response 58: Thank you for your suggestion. We reorganized the descriptions in the Discussion section to make them more readable. And our intuition about A-H with 4096 as the interval is that the input image is 256 x 256, and if the change area is no more than one-eighth of the image, we consider it to be the appropriate one. We also tried to use more points in our plots at first, but the actual visual thought is not as well as now, thus we choose 8 categories. If more points were placed between the 8 categories, this would require a further division of the interval, which would be contrary to our original intention.

In addition, we followed your comments and performed the Kolmogorv Smirnov test and presented it in Table 9 in the paper.

References

Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Transactions on Geoscience and Remote 507 Sensing 2021.
zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, A deeply supervised image fusion network for 606 change detection in high resolution bi-temporal remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing 2020, 607 166, 183–200.
Bandara, W.G.C.; Patel, V.M. A transformer-based siamese network for change detection. arXiv preprint arXiv:2201.01293 2022.
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. 510 Remote Sensing 2020, 12, 1662.

Author Response File: Author Response.docx

Reviewer 2 Report

Mao et al. report an interesting approach for detection of changes in images of remote sensing. The authors use aggregation of features at multiple scales via a transformer to learn coherent details of feature maps at different scales. The result is a more accurate description of changes in images. In addition, the authors present their approach to error correction between result maps at multiple scales and provide the code for their work.

I recommend publication after fixing the following minor issues.

Page 1 line 19. please put a space bar in "changethe".

The description of figure 7 is vague (page 7). The same is true for the figure caption. Each image of the figure should be named and properly described. The details are missing. The case is similar for the other figures. For example, what the colors mean in figures 8 and 9?

Author Response

Response to Reviewer 2 Comments

Point 1: Page 1 line 19. please put a space bar in "changethe".

Response 1: Thank you for pointing out the deficiency. We modified "changeethe" to "change".

Point 2: The description of figure 7 is vague (page 7). The same is true for the figure caption. Each image of the figure should be named and properly described. The details are missing. The case is similar for the other figures. For example, what the colors mean in figures 8 and 9?

Response 2: Thank you for pointing out the deficiency. We redescribed the illustrations in figures 7, 8, and 9, adding more details.

Author Response File: Author Response.docx

Reviewer 3 Report

Overall Assessment

This paper proposed a MFATNet network, this network acquired multi-scale features of bi-temporal remote sensing images through ResNet18 and used the Spatial Semantic Tokenizer(SST) to sematic tokens of features, then obtained accurate change map through the transformer, and integrate the IICAM attention module to further obtain more convincing change maps. This study compared the performance of MFATNet in three datasets, LEVIR-CD, WHU-CD and DSIFN-CD, and achieved good results in comparison with other excellent methods such as FCEF, FC-Siam-D, FC-Siam-Conc.

However, the manuscript still has some flaws, and therefore, I recommend that the manuscript can be published of this journal after major revisions.

In this paper, there are too many typos, grammar errors, and repetitive statements in the current version, the English of your paper needs improvement, we recommend you find an English native speaker expert to help you revise the paper.

More specifical comments are shown below

Abstract

Line 14 “…semantic gaps and localization errors…”. This relevant statement appears only twice in lines 14 and 104, and does not have a good understanding of what these two words represent through the context.

Introduction

The introductory section of the manuscript provides a good overview of the advantages and disadvantages of the transformer algorithm and how to improve these disadvantages. However, there is less content about SST and IICAM. I read your paper, and found that the innovation focused on inserting the SST and IICAM into transformer network structure. I suggest authors to add more descriptions of SST and IICAM.

Related Work

Line 143-145 “However, transformer is inherently capable of calculating the global receptive field of input, which is a perfect match for remote sensing.” However, it is well-known that remote sensing images are different from RGB photos, authors should further explain and confirm the effect of global receptive field on different land cover types in remote sensing images.

Materials and Method

Line 178 The authors finally adopt visual transformer between the ViT and the visual transformer networks. But the workflow of Figure 2 used the “Multi Transformer Module”, I did not what is mean of Multi Transformer Module? The title of Figure 2 had a statement using the “multi-scale transformer module”. So, I confusion the diagram of MFATNet in this paper.

Experiment

Line 417-443 The comparisons between classification methods are not well described, and I did not understand very clearly the advantages and disadvantages of MFATNet using different datasets. For example, “…from the second-best method by 1.05…”. Which is the second-best method specifically? I think authors should reorganize this paragraph, and make the statement more readable.

Discussion

I suggested that the authors would be better to clarify and explain the advantages and disadvantages of MFATNet algorithm, and how this method improves the classification accuracy for different remote sensing datasets. I think authors should rewrite the discussion section.

More specifical comments are shown below

Line 160-161 “…, and it does not have a strong demand for deeper networks…”. It is necessary for authors to add some literatures to support this statement.

Figure 9. The statements of sequence number of figures in the text were different from the note of original figures, such as using the “a.”, “b.”, etc. in the figure, but using the “a”, “b” in the text. I think authors should use the unified format.

Line 148 and 163~164 and 485 “…resnet18…”. The writing of ResNet18 in the paper needs to use the unified statement.

Table 8-9. The titles of Tables 8 and 9 are the same

Author Response

Response to Reviewer 3 Comments

Point 1: Line 14 “…semantic gaps and localization errors…”. This relevant statement appears only twice in lines 14 and 104, and does not have a good understanding of what these two words represent through the context.

Response 1: Thank you for pointing out the deficiency. We realized that the previous text was not well-read. Therefore, in 14 line of the revised version, we modified the description in line 14 to make it differential from the statement in line 104.

Point 2: The introductory section of the manuscript provides a good overview of the advantages and disadvantages of the transformer algorithm and how to improve these disadvantages. However, there is less content about SST and IICAM. I read your paper, and found that the innovation focused on inserting the SST and IICAM into transformer network structure. I suggest authors to add more descriptions of SST and IICAM.

Response 2: Thank you for your suggestion about SST and IICAM. We have supplemented the description of SST and IICAM for better readability and ease of understanding.

Point 3: Line 143-145 “However, transformer is inherently capable of calculating the global receptive field of input, which is a perfect match for remote sensing.” However, it is well-known that remote sensing images are different from RGB photos, authors should further explain and confirm the effect of global receptive field on different land cover types in remote sensing images.

Response 3: Thank you for your suggestion. We have redescribed this sentence by adding the description that the global receptive field is contributing to the remote sensing image.

Point 4: Line 178 The authors finally adopt visual transformer between the ViT and the visual transformer networks. But the workflow of Figure 2 used the “Multi Transformer Module”, I did not what is mean of Multi Transformer Module? The title of Figure 2 had a statement using the “multi-scale transformer module”. So, I confusion the diagram of MFATNet in this paper.

Response 4: Thank you for pointing out the deficiency. Actually, our intention of using "Mutil Transformer Module" is to illustrate our use of Transfomer Module to aggregate learning of multi-scale features. But we did not realize that this description was confusing. Therefore, we have changed "Multi Transformer Module" to "Transformer Module" in the workflow of Figure 2, and the corresponding the title of Figure 2.

Point 5: Line 417-443 The comparisons between classification methods are not well described, and I did not understand very clearly the advantages and disadvantages of MFATNet using different datasets. For example, “…from the second-best method by 1.05…”. Which is the second-best method specifically? I think authors should reorganize this paragraph, and make the statement more readable.

Response 5: Thank you for your suggestion. We supplemented the three datasets with more detailed descriptions of the experimental results, and for a clearer presentation, we marked the top three accuracies with different colors (red indicates best, blue is just 2nd-best, and black indicates 3rd-best) on Tables 7, 8, and 9. Meanwhile, we have also added a description of the differentiation from other methods to better understand our proposal.

Point 6: I suggested that the authors would be better to clarify and explain the advantages and disadvantages of MFATNet algorithm, and how this method improves the classification accuracy for different remote sensing datasets. I think authors should rewrite the discussion section.

Response 6: Thank you for your suggestion. We have supplemented the advantages and disadvantages of MFATNet algorithm. Also we have included how we can improve classification accuracy compared to the excellent method. And we preliminarily explain how our method improves classification accuracy on three datasets.

Point 7: Line 160-161 “…, and it does not have a strong demand for deeper networks…”. It is necessary for authors to add some literatures to support this statement.

Response 7: Thank you for pointing out the deficiency. We realize that our previous statement was a bit arbitrary, thus, we reorganized the presentation. In practice, our intuition comes from previous work [1,2,3,4] that used ResNet18 as a feature extractor.

Point 8: Figure 9. The statements of sequence number of figures in the text were different from the note of original figures, such as using the “a.”, “b.”, etc. in the figure, but using the “a”, “b” in the text. I think authors should use the unified format.

Response 8: Thank you for pointing out the deficiency. We unified the formatting in the figure and text as "a." and "b.".

Point 8: Figure 9. Line 148 and 163~164 and 485 “…resnet18…”. The writing of ResNet18 in the paper needs to use the unified statement.

Response 8: Thank you for pointing out the deficiency. We have unified the modification in the revised version to ResNet18.

Point 9: Figure 9. Line 148 and 163~164 and 485 “…resnet18…”. The writing of ResNet18 in the paper needs to use the unified statement.

Response 9: Thank you for pointing out the deficiency. We have modified the titles of Tables 8 and 9.

References

Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Transactions on Geoscience and Remote Sensing 2021.
Chen, H.; Li, W.; Shi, Z. Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing 2021, PP, 1–16
Song, F.; Zhang, S.; Lei, T.; Song, Y.; Peng, Z. MSTDSNet-CD: Multiscale Swin Transformer and Deeply Supervised Network for Change Detection of the Fast-Growing Urban Regions. IEEEGeoscience and Remote Sensing Letters 2022, 19, 1–5.
Ke, Q.; Zhang, P. Hybrid-TransCD: A Hybrid Transformer Remote Sensing Image Change Detection Network via Token Aggregation. ISPRS International Journal of Geo-Information 2022, 11, 26

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The vast majority of the comments were addressed. Thank you for all the effort!

Reviewer 3 Report

All my concerned problems have been solved and revised, and I have no other comments. I agree with this paper publishing on the Jounral.

Article Menu

MFATNet: Multi-Scale Feature Aggregation via Transformer for Remote Sensing Image Change Detection

Further Information

Guidelines

MDPI Initiatives

Follow MDPI