Next Article in Journal
Assessing the Performance of Multi-Resolution Satellite SAR Images for Post-Earthquake Damage Detection and Mapping Aimed at Emergency Response Management
Previous Article in Journal
Upgrades of the Earth Networks Total Lightning Network in 2021
 
 
Article
Peer-Review Record

EfficientUNet+: A Building Extraction Method for Emergency Shelters Based on Deep Learning

Remote Sens. 2022, 14(9), 2207; https://doi.org/10.3390/rs14092207
by Di You 1,2,†, Shixin Wang 1,2,†, Futao Wang 1,2,3,*, Yi Zhou 1,2, Zhenqing Wang 1,2, Jingming Wang 1,2 and Yibing Xiong 1,2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Remote Sens. 2022, 14(9), 2207; https://doi.org/10.3390/rs14092207
Submission received: 8 April 2022 / Revised: 30 April 2022 / Accepted: 2 May 2022 / Published: 5 May 2022
(This article belongs to the Topic Computational Intelligence in Remote Sensing)

Round 1

Reviewer 1 Report

 

 

This paper presented a variant U-net for build extraction, and experimental results are provided to validate its performance. The followings are some of my comments for this paper: 

  1. In Sec. 4., it is also interesting to see the training/inference time efficiency.
  2. Sensitivity to the training sample amount should be analyzed, such as 10%, 20%, etc. without any data augmentation. 
  3. Training history vs epoch (loss vs epoch) should be added to compare training efficiency with the competing methods.
  4. The experimental result is not convincing, the proposed method should be compared with the state-of-the-art U-net methods, especially those methods used EfficientNet encoder.
  5. Table 3 should include references for each competing method. I did not see DeepLabV3+ was cited in the paper or it was not clearly mentioned with this abbreviation. 

 

Author Response

Response to Reviewer 1 Comments

 

Point 1: In Sec. 4., it is also interesting to see the training/inference time efficiency.

 

Response 1: In Section 4.3, operation time of buildings extracted by different methods have been added. Details are as follows:

In order to verify the extraction efficiency of the proposed method, the operation time of each method is compared in this study, as shown in the Table 7. It can be seen that the inference time and training time of the proposed method are 11.61 s and 279.05 min respectively, which are the shortest and the most efficient compared to each method. It can be seen that the method proposed in this study can quickly extract the buildings in emergency shelters.

Table 7. Operation time of buildings extracted by different methods.

Time

DeepLabv3+

Pspnet

ResUnet

HRNet

EfficientUNet+

Inference time

16.31 s

13.42 s

15.96 s

32.05 s

11.16 s

Train time

362.77 min

312.82 min

334.77 min

427.98 min

279.05 min

 

Point 2: Sensitivity to the training sample amount should be analyzed, such as 10%, 20%, etc. without any data augmentation.

 

Response 2: The dataset we created is small in size, and this study belongs to few-shot learning. Theoretically, in terms of our data size, the more data size of training samples, the better the effect will be. This is also the reason why we use transfer learning.

 

Point 3: Training history vs epoch (loss vs epoch) should be added to compare training efficiency with the competing methods.

 

Response 3: “Section 4.3 Efficiency evaluation” has been added to visualize training loss and epochs. The specific modifications are as follows:

4.3 Efficiency evaluation

We visualize the training loss versus epoch as shown in Figure 15. It can be seen that the training loss of the proposed method decreases the fastest, far exceeding other comparison methods, which verifies its efficiency in the training phase. In addition, we count the operation time of the validation set, as shown in the Table 7.

 

Figure 15. Visualization graph of training loss and epochs.

 

Point 4: The experimental result is not convincing, the proposed method should be compared with the state-of-the-art U-net methods, especially those methods used EfficientNet encoder.

 

Response 4: In the ablation experiment part of Section 4.2, when discussing the effect of adding the scSE module on the experimental results (ie Section 4.2.1), the method proposed in this study has been compared with the state-of-the-art UNet method using EfficientNet as the encoder, namely the EfficientUNet method, see Figure 12 and Table 4 for details. If the results of the EfficientUNet method are placed in the "4.1 Comparison with other methods" section, it will be repeated.

 

       
       
       

(a)

(b)

(c)

(d)

Figure 12. Building extraction results with or without the scSE. (a) Original image. (b) Ground truth. (c) EfficientUNet+. (d) EfficientUNet (without scSE).

Table 4. Accuracy comparison of extraction results of different decoders.

Method

Decoder

Precision

Recall

F1-Score

mIoU

EfficientUNet

Without scSE

90.81%

88.23%

89.50%

89.54%

EfficientUNet+

With scSE

93.01%

89.17%

91.05%

90.97%

 

Point 5: Table 3 should include references for each competing method. I did not see DeepLabV3+ was cited in the paper or it was not clearly mentioned with this abbreviation.

 

Response 5: Each method has been added with references in Section 4.1 and Table 3. Details are as follows:

4.1 Comparison to state-of-the-art studies

To verify whether the proposed method performs better than other state-of-the-art methods, several deep learning methods commonly used in semantic segmentation and building extraction are selected as comparison methods, namely DeepLabv3+, pyramid scene parsing network (PSPNet), deep residual UNet (ResUNet), high hesolution Net (HRNet). Among these methods, the DeepLabv3+ method introduces a decoder, which can achieve accurate semantic segmentation and reduce the computational complexity [53]. The PSPNet method extends pixel-level features to global pyramid pooling to make predictions more reliable [54]. The ResUNet method is a variant of the UNet structure with state-of-the-art results in road image extraction [55]. The HRNet method maintains high-resolution representations through the whole process, and its effectiveness has been demonstrated in previous studies [56].

Table 3. Accuracy comparison of the extraction results of different methods.

Methods

Precision

Recall

F1-Score

mIoU

DeepLabv3+[53]

90.52%

87.15%

88.80%

88.92%

PSPNet[54]

76.40%

75.34%

75.87%

78.36%

ResUNet[55]

88.51%

80.72%

84.44%

85.16%

HRNet[56]

89.14%

83.43%

86.19%

86.63%

EfficientUNet+

93.01%

89.17%

91.05%

90.97%

 

Supplementary references are as follows:

  1. Chen, LCE.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. 15th European Conference on Computer Vision (ECCV), Munich, GERMANY, 8-14 September 2018; pp.833-851.
  2. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 21-26 July 2017;pp.6230-6239.
  3. Zhang, Z.X.; Liu, Q.J.; Wang, Y.H. Road Extraction by Deep Residual U-Net. IEEE Geosci. Remote Sens. Lett. 2018,15,749-753.
  4. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; Liu, W.; Xiao, B. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019,43(10),3349-3364.

Author Response File: Author Response.doc

Reviewer 2 Report

The paper focuses on analyzing remote sensing images and their segmentation. This task is important due to different issues like blurred boundaries, integrity, etc. For this purpose, a new model of neural architecture was proposed. The idea is unet-based and called EfficientUNet+. It contains encoder and decoder parts to embed the spatial and channel squeeze & excitation. Moreover, the authors proposed a joint loss function. In general, the idea of improving unet model is interesting and needed. However many parts are missing like some details in the model, no comparison with state-of-art (but there are some models implemented by authors). In my opinion, the paper must be improved in the following aspects:
1) The abstract is very long. Is it possible to reduce the number of information and focus mainly on novelty, improved unet model and results?
2) Explain why did you improve unet and not enet?
3) The literature review is made good but it seems to me that it was prepared 2 years ago. I must ask the authors for improving these sections to the current knowledge which was published in the last 2-3 years.
4) Remote sensing data are very difficult also to the size. How did you select the reducing rate of images before processing them in the unet?
5) Discuss also the possibility of using your solution in remote sensing areas like side-scan sonar. Discuss the current approaches there like roi analysis with neural networks.
6) You mentioned that for blurred boundaries, the joint loss function was selected. How this loss function impacts on this problem?
7) Some bullet lists of contributions in the end of the first section should be added.
8) In Fig. 2, the model is shown, but how it was modeled? There is no justification for this model.
9) Your proposition is described mainly in terms of explanation, but a mathematical model is missing.
10) Explain the feature extraction in your proposition. I think that some figures from different layers and its feature maps could be good to add.
11) Experimental section should be extended to proper comparison with state-of-art on the publicly available dataset.

Author Response

Response to Reviewer 2 Comments

 

Point 1: The abstract is very long. Is it possible to reduce the number of information and focus mainly on novelty, improved unet model and results?

 

Response 1: For the special scene of emergency shelter, this study proposes an improved building extraction method, namely EfficientUNet+. Considering the important geographical location of Beijing, and high-level earthquakes have occurred, Beijing is used as a research demonstration area to verify the effectiveness of the method in this study. This study not only proposes a new method according to actual needs, but also demonstrates the application of the new method. The abstract is revised as follows:

Quickly and accurately extracting buildings from remote sensing images is essential for urban planning, change detection, and disaster management applications. In particular, extracting buildings that cannot be sheltered in emergency shelters can help establish and improve the city’s overall disaster prevention system. However, small building extraction often involves problems, such as integrity, missed and false detection, and blurred boundaries. In this study, EfficientUNet+, an improved building extraction method from remote sensing images based on the UNet model, is proposed. This method uses EfficientNet-b0 as the encoder and embeds the spatial and channel squeeze & excitation (scSE) in the decoder to realize forward correction of features and improve the accuracy and speed of model extraction. Next, for the problem of blurred boundaries, we propose a joint loss function of building boundary weighted cross-entropy and Dice loss function to enforce constraints on building boundaries. Finally, model pretraining is performed using the WHU aerial building dataset with a large amount of data. The transfer learning method is used to complete the high-precision extraction of buildings with few training samples in specific scenarios. We create a Google building image dataset of emergency shelters within the Fifth Ring Road of Beijing and conduct experiments to verify the effectiveness of the method in this study. The proposed method is compared with the state-of-the-art methods,namely DeepLabv3+, PSPNet, ResUNet, and HRNet methods. Results show that the EfficientUNet+ method is superior in terms of Precision, Recall, F1-Score, and mean intersection over union (mIoU). The accuracy of the EfficientUNet+ method for each index is the highest, are 93.01%, 89.17%, 91.05%, 90.97%, respectively. It indicating that the method proposed in this study can effectively extract buildings in emergency shelters and has an important reference value for guiding urban emergency evacuation.

 

Point 2: Explain why did you improve unet and not enet?

 

Response 2:At the end of the introduction, an explanation has been added explaining why the UNet model was improved. Details are as follows:

Most of the above methods of extracting buildings are performed in standard public datasets or large-scale building scenarios. They rarely involve buildings in special scenarios, such as emergency shelters. The volume and footprint of buildings in emergency shelters are generally small. For such small buildings, considering UNet [30] structure can integrate high- and low-level features effectively and restore fine edges, thereby reducing the problems of missed and false detection and blurred edges during building extraction. We use UNet as the overall framework to design a fully convolutional neural network , namely, the EfficientUNet+ method.

 

Point 3: The literature review is made good but it seems to me that it was prepared 2 years ago. I must ask the authors for improving these sections to the current knowledge which was published in the last 2-3 years.

 

Response 3: The literature has been reorganized to supplement the latest research in the past 2-3 years. The specific modifications are as follows:

Convolutional neural network (CNN) is the most widely used method for structural image classification and change detection [20]. CNN can solve the problems caused by inaccurate empirically designed features by eliminating the gap between different semantics; it can also learn feature representations from the data in the hierarchical structure itself [21], improving the accuracy of building extraction. Tang et al. [22] proposed to use the vector "capsule" to store building features, the encoder extracts the "capsule" from the remote sensing image, and the decoder calculates the target building, which not only realizes the effective extraction of buildings, but also has good generalization. Li et al. [23] used the improved faster regions with convolutional neural network (R-CNN) detector, the spectral residual method is embedded into the deep learning network model to extract the rural built-up area. Chen et al. [24] used a multi-scale feature learning module in CNN to achieve better results in extracting buildings from remote sensing images. However, CNN requires ample storage space, and repeated calculations lead to low computational efficiency. Moreover, only some local features can be extracted, limiting the classification performance.

……

The UNet network model belongs to one of the FCN variants. It adds skip connections between the encoding and decoding of FCN. The decoder can receive low-level features from the encoder, form outputs, retain boundary information, fuse high- and low-level semantic features of the network, and achieve good extraction results through skip connections [30]. In recent years, many image segmentation algorithms have used the UNet network as the original segmentation network model, and these algorithms have been fine-tuned and optimized on this basis. Ye et al. [31] proposed RFN-UNet, which considers the semantic gap between features at different stages. It also uses an attention mechanism to bridge the gap between feature fusions and achieves good building extraction results in public datasets. Qin et al. [32] proposed a network structure U2Net with a two-layer nested UNet. This model can capture a large amount of context information and has a remarkable effect on change detection. Peng et al. [33] used UNet++ as the backbone extraction network and proposed a differentially enhanced dense attention CNN for detecting changes in bitemporal optical remote sensing images. In order to improve the spatial information perception ability of the network, Wang et al. [34] proposed a building method, namely B-FGC-Net,with prominent features, global perception, and cross-level information fusion. Wang et al. [35] combined with UNet, residual learning, atrous spatial pyramid pooling and focal loss, the ResUNet model was proposed to extract buildings.Based on refined attention pyramid networks (RAPNets), Tian et al. [36] embedded salient salient multi-scale features into a convolutional block attention module to improve the accuracy of building extraction.

Supplementary references are as follows:

  1. Wang, Y.; Zeng, X.; Liao, X.; Zhuang, D. B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery. Remote Sens. 2022,14(2),269.
  2. Wang, H.; Miao, F. Building extraction from remote sensing images using deep residual U-Net. EUR J REMOTE SENS 2022,55(1),71-85.
  3. Tian, Q.; Zhao, Y.; Li, Y.; Chen, J.; Chen, X.; Qin, K. Multiscale Building Extraction With Refined Attention Pyramid Networks. IEEE Geosci. Remote Sens. Lett. 2022,19.

 

Point 4: Remote sensing data are very difficult also to the size. How did you select the reducing rate of images before processing them in the unet?

 

Response 4: This opinion is not very understandable. Could you please describe the real meaning of this suggestion in detail, such as what does "reducing rate of images" refer to?

 

Point 5: Discuss also the possibility of using your solution in remote sensing areas like side-scan sonar. Discuss the current approaches there like roi analysis with neural networks.

 

Response 5: According to the suggestion, the relevant content has been supplemented in the section "5.Conclusion" of the article. The specific modifications are as follows:

However, the model sometimes misses extracting buildings that are obscured by trees. In the future, we will continue to optimize and improve the EfficientUNet+ method, try to extract buildings under different phenological conditions in summer and winter, improve the accuracy and performance of remote sensing image building extraction. The method proposed in this study is suitable for optical remote sensing images. In the future, we will try to apply the proposed method to other datasets, such as side-scan sonar, to further verify the advantages of this method in small building extraction.

 

Point 6: You mentioned that for blurred boundaries, the joint loss function was selected. How this loss function impacts on this problem?

 

Response 6: According to the suggestion, the corresponding content has been supplemented in Section 2.4. Moreover, Section 4.2.2 of the article also has ablation experiments on loss functions. Supplementary content in Section 2.4 are as follows:

The weight of each pixel is equal by considering the cross-entropy function. The boundary area of the building is difficult to segment. We weigh the area’s cross-entropy loss from the perspective of the loss function. In backpropagation, the network is enhanced to learn the boundary regions.

 

Point 7: Some bullet lists of contributions in the end of the first section should be added.

 

Response 7: According to the suggestion, it has been revised at the end of the literature. The specific modifications are as follows:

This paper is organized as follows: Section 2 “Methods” introduces the EfficientUNet+ model overview, which includes EfficientNet-b0, scSE module, loss function, and transfer learning; Section 3 “Experimental results” presents the study area and data, experimental environment and parameter settings, and accuracy evaluation and experimental results of the EfficientUNet+ method; Section 4 ”Discussion” validates the effectiveness of the proposed method through comparative experiments and ablation experiments; Section 5 “Conclusion” presents the main findings of this study.

 

Point 8: In Fig. 2, the model is shown, but how it was modeled? There is no justification for this model.

 

Response 8: The relevant content has been modified in Section 2.1.

The UNet model is an encoder–decoder architecture, which consists of a compressed path for capturing context and a symmetric expansion path for precise localization. It uses skip connections to fuse the high- and low-level semantic information of the network[37]. Good segmentation results can be obtained when the training set is small. However, the original UNet model uses VGG-16 as the encoder, which has many model parameters and the feature learning ability is weak. This study follows the model framework of UNet, applies EfficientNet in the UNet encoder, and proposes a deep learning–based method for extracting buildings in emergency shelters, namely, EfficientUNet+. Figure 2 shows the EfficientUNet+ module structure. The emergency shelters within the Fifth Ring Road of Beijing were taken as the research area to verify the effectiveness of the method in this study. The method is improved as follows: (1) The deep learning model used by the encoder is EfficientNet-b0, which is a new model developed using composite coefficients to scale the three dimensions of width/depth/resolution and achieves good classification accuracy with few model parameters and fast inference [38-39]. (2) The scSE is embedded in the decoder. Embedding spatial Squeeze & Excitation(sSE) into low-level features can emphasize salient location information and suppress background information; combining channel Squeeze & Excitation (cSE) with high-level features extracts salient meaningful information [40], thereby reducing false lifts of buildings. (3) The cross-entropy function is used to weigh the boundary area, improving the accuracy of building boundary extraction. The dice loss is combined to solve the problem of blurred boundary extraction. (4) Given the small number of samples in the study area, a transfer learning method is introduced to transfer the features of the existing WHU aerial building dataset to the current Beijing’s Fifth Ring Road emergency shelter building extraction task, thereby reducing the labor cost of acquiring new samples and further improving the accuracy of building extraction.

 

Point 9: Your proposition is described mainly in terms of explanation, but a mathematical model is missing.

 

Response 9: The model basis of this study is UNet, and the related content has been supplemented in Section 2.1. In addition, the formula equations supplemented in Section 2.3 are as follows:

                (1)

where  means the new feature map;  means activation function; means the linear combination of spatial positions (i, j) under channel C; means the spatial location of the feature.

                                   (2)

where means the output feature map; andmeans the number of input and output channels, respectively; means the second two-dimensional spatial convolution kernel; means convolution operation; means the sth input feature map.

                             (3)

where means the generated vector through after global average pooling (squeeze operation) ; H and W represent the height and width of the feature map, respectively.

                                 (4)

Where means the vector output through after the excitation operation; ,, r is the scaling factor. Through the operation, convert to , and generate a new feature map as follows:

                       (5)

 

Point 10: Explain the feature extraction in your proposition. I think that some figures from different layers and its feature maps could be good to add.

 

Response 10: In Section 3.4, the relevant content of the feature map has been supplemented. The specific content are as follows:

We further visualize the multi-scale architectural features extracted by the proposed model at different depths, as shown in Figure 9. From Figure 9(b)-(f), we can see that the low-resolution architectural features are gradually refined as the feature resolution increases. The example in column (f) of Figure 9 illustrates that the semantic information of small-scale buildings cannot be captured by high-level features, because they occupy less than one pixel at low resolution.

 

           

(a)    Sample image

(b)Depth=1

(c)Depth=2

(d)Depth=3

(e)Depth=4

(f)Depth=5

Figure 9. Feature map visualization.

 

Point 11: Experimental section should be extended to proper comparison with state-of-art on the publicly available dataset.

 

Response 11: The state-of-the-art method HRNet has been added in Section 4.1. The experimental results are as follows:

             
             
             
             
             

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Figure 9. Partial details of the building in the emergency shelter through different methods. (a) Original image. (b) Ground truth. (c) EfficientUNet+. (d) DeepLabv3+. (e) PSPNet. (f) ResUNet. (g) HRNet.

Table 3. Accuracy comparison of the extraction results of different methods.

Methods

Precision

Recall

F1-Score

mIoU

DeepLabv3+[53]

90.52%

87.15%

88.80%

88.92%

PSPNet[54]

76.40%

75.34%

75.87%

78.36%

ResUNet[55]

88.51%

80.72%

84.44%

85.16%

HRNet[56]

89.14%

83.43%

86.19%

86.63%

EfficientUNet+

93.01%

89.17%

91.05%

90.97%

                             

Figure 10. Accuracy comparison chart of different methods.

 

 

 

 

 

Author Response File: Author Response.doc

Round 2

Reviewer 1 Report

My comments are well-addressed. 

Reviewer 2 Report

All my comments have been addressed.

Back to TopTop