Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction

Remote Sens. 2022, 14(11), 2611; https://doi.org/10.3390/rs14112611

by Xiao Xiao^1,2,3, Wenliang Guo^1,*

, Rui Chen^1,4, Yilong Hui¹, Jianing Wang⁵

and Hongyu Zhao³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Remote Sens. 2022, 14(11), 2611; https://doi.org/10.3390/rs14112611

Submission received: 1 April 2022 / Revised: 25 May 2022 / Accepted: 27 May 2022 / Published: 29 May 2022

(This article belongs to the Special Issue Recent Advances in Neural Network for Remote Sensing)

Round 1

Reviewer 1 Report

The paper describes a novel deep learning-based architecture to deal with the building extraction problem in remote sensing images. The architecture consists in a vision Transformer with shifted windows (swin Transformer) integrated into a U-shaped network as encoding booster called STEB-UNet.
The work is interesting and the motivations are clear, as well as its main contributions.
The methodology is adequately detailed. Experiments are appropriate, showing great performances of the proposed approach.

As the manuscript is already well-written and the method is adequately tested, I have no major comments to enhance the paper, but I have two main questions which could be addressed by authors as possible future developments of the work:
- Reference [48] and [49] shows approaches based on fusing U-Nets and Transformers, but only on medical image segmentation. The authors claim that such a "direct fusion creates a potential ambiguity and imbalance in local and global feature extraction", but is this the case also in building extraction? The pyramid structure of the Transformer booster seems a great idea for features fusion, but I wonder if other types of Transformers in literature (not only the basic ViT) could (or not) achieve good performance in the specific task of building extraction when fused to a U-Net.
- Since the visual comparison of STEB-UNet was done also with U2-Net (Figure 5 and 6), is it possible to improve the capabilities (and performance) of STEB-UNet approach even further, by fusing STEB with the U2-Net instead? Even if not, a comparison could be interesting.

Minor comment:
In page 2, lines 67 and 85: I think that the capitalization in "connectiViTy" and "DosoViTskiy" was unintentional, need to be corrected.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors presented a deep learning approach to identify buildings in satellite/aerial images.

I have one concern regarding the "transfer learning" in page 14. "Transfer learning" has a broader meaning than training on a dataset and testing on another. Usually, part of the network is trained on a much larger dataset in another application, then these layers are "transfered" to another network, after being trained. I would recommend not using the term "transfer learning" unless the authors justify a reason for that.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

In this manuscript, the authors proposed a novel semantic segmentation framework for building detection by combining swin transformer and U-Net. My comments are listed below:

1) Just curious about Figure 1. For the part on the upper left side, I mean those four images, why do you provide "real" RS image for the biggest one but ground truth images for those smaller ones? I mean, is there any specific meaning for that?

2) Lines 171-185. For this paragraph, I guess you are trying to explain the strategy of dealing with divisibility stuff between image size and window size. However, it is not clear enough, including but not limited to: (a) what does l-th calculation? Does it refer to one specific image or a specific step when processing one image? (b) there is no sub-title for those two sub-figures in Figure 2. In this paragraph, you only use "Figure 2" but actually for some texts, you are going to refer to "Figure 2 (b)" (right sub-figure). Thus, please review and revise this paragraph to make it more clear.

3) Line 371. It would be better if detailed explanations of TP, FP, and FN are provided.

4) Section 3.4, I think it is better to discuss the performance of different loss functions and different parameters of loss functions in the "Discuss" section.

5) Figure 5, could you please provide zoom-in figures for those red-rectangle annotations?

6) Still relate to Figure 5, could you add another comparison figure for the Massachusetts dataset?

7) Lines 409-427. Maybe a question/comment/discussion. In the deep learning ear, the meaning of transfer learning seems to be changed. In the past, we always say transfer learning is storing knowledge gained from domain A and applying it to a different domain B. In deep learning, a lot of people believe that transfer learning is taking a pre-trained model which has been trained on dataset A and re-train it on dataset B. However, yours are different from all previous ones. I mean, you only train your model on dataset A (e.g., WHU dataset) and apply it on dataset B (e.g., Mas dataset) directly without a re-training procedure. So I think it may not be appropriate to use the term "transfer learning" here.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Efficient building extraction algorithms can identify and segment building areas providing informative data for other problem. Usually, building extraction is mainly achieved by deep CNNs based on the U-shaped encoder-decoder architecture. However, the local perceptive field of the convolutional operation poses a challenge for CNNs to fully capture the semantic information of large buildings, especially in high-resolution remote sensing images. In this study, the authors proposed a shifted-window (swin) Transformer-based encoding booster that uses Transformer pyramid containing patch merging layers for down-sampling and it can enable encoding booster extracting semantics from multi-level features at different scales. The authors have integrated the encoding booster in a specially designed U-shaped network Swin Transformer-based Encoding Booster- U-shaped Network: STEB-UNet, resulting in the feature-level fusion of local and large-scale semantics. The computComments:ational complexity and memory requirement of the novel framework can effectively discriminate and extract buildings of different scales and demonstrates higher accuracy than the state-of-the-art methods.

Comments:

The quality and performance of the designed system depends on weight parameter alfa used in eq. 11 when you employ two loss functions. Please explain the chosen value of alfa. This reviewer did not see significant statistical difference between different loss functions presented in table 1 or their combination. Please comments this fact and explain the choose of one of them or their combination.
The authors should present much more details of the Experimental Setting that they employ for their framework. The authors should present much more details for methods that they used in comparison: U-Net, U2-Net, SETR-Naive, BRRNet, and RFANet, in particular, for Experimental Setting in the experiments with these systems.
What was the purpose in usage different volume for training, validation, and test in two different datasets? In first dataset, you used of about 60% for training, and of 30 % for testing. In the second dataset, you used the augmented data via random cropping, rotating, and shifting. As this reviewer understood, this dataset is limited one. The authors used more than of 90% for training, and only less 7% for testing. Please explain the choose of these volumes for different stages: training, validation, and test. Also, please explain how do you used the comparison system in these datasets.
Please present more discussions and results in subsect.3.5. Transfer learning such approach is so important in practical applications.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 4 Report

The authors have attended all comments of this reviewer

Article Menu

A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction

Further Information

Guidelines

MDPI Initiatives

Follow MDPI