AGs-Unet: Building Extraction Model for High Resolution Remote Sensing Images Based on Attention Gates U Network
Abstract
:1. Introduction
2. Materials and Methods
2.1. AGs-Unet Model Framework
- (1)
- The encoder, which is composed of four convolution blocks, can extract features at different levels using global and local context information (xi represents the feature of layer i). Each convolution block consists of two convolutional (Conv) layers, two batch normalization (BN) layers, and two rectified linear unit (ReLU) activation function layers. The Maxpooling layer in the model can extract the maximum value of the local area in the feature map and construct it into a new feature map, which helps reduce the number of parameters in the model and prevent overfitting;
- (2)
- The converter is composed of the fifth convolution block and AGs in four ‘skip connection’ processes. The fifth convolution block includes Conv, Maxpool, and BN, which abstracts the feature map from the encoder to the highest level. The number of channels of the feature map is superimposed to 1248, the size is reduced to 16 × 16, and the abstract feature map of the highest dimension is extracted. The AGs stringing in the ‘jump connection’ filter out the feature points in the low-dimensional feature map that are beneficial in extracting the building and in filtering and suppressing irrelevant features and nodes. Four AG extract effective features from all aspects and dimensions in four different grid dimension levels from low to high. The converter connects the feature map that corresponds to the encoder and decoder and solves the problem of the disappearance of the reverse propagation gradient to some extent;
- (3)
- The decoder; the number of channels of the corresponding feature map is adjusted by convolution operation, and the size of the fused feature map is expanded by up-sampling and gradually fused with the multi-level feature map and restored to the size of the original input map, since the underlying high-dimensional feature map began to integrate into the feature map of the corresponding size in the encoder.
2.2. Grid-Based AGs Module under U-Net Architecture
2.2.1. Attention Mechanism
2.2.2. AG Module
2.2.3. U-Net Architecture with AGs Module
3. Experimental Datasets and Evaluation
3.1. Dataset
3.1.1. The WHU Dataset
3.1.2. The INRIA Aerial Image Labeling Dataset
3.2. Evaluation Criterion
3.2.1. Evaluation Metrics
3.2.2. Model Complexity
4. Experiment
4.1. Experimental Settings
4.2. Experimental Result
4.2.1. Results of Qualitative Analysis
4.2.2. Results of Quantitative Analysis
5. Discussion
5.1. Model Comparison
5.1.1. Comparison of Prediction Results
5.1.2. Model Complexity Comparison
5.2. Ablation Experiment
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lunetta, R.S.; Johnson, D.M.; Lyon, J.G.; Crotwell, J. Impacts of imagery temporal frequency on land-cover change detection monitoring. Remote Sens. Environ. 2004, 89, 444–454. [Google Scholar] [CrossRef]
- Wu, T.; Hu, Y.; Peng, L.; Chen, R. Improved Anchor-Free Instance Segmentation for Building Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 2910. [Google Scholar] [CrossRef]
- Liu, Y.; So, E.; Li, Z.; Su, G.; Gross, L.; Li, X.; Qi, W.; Yang, F.; Fu, B.; Yalikun, A.; et al. Scenario-based seismic vulnerability and hazard analyses to help direct disaster risk reduction in rural Weinan, China. Int. J. Disaster Risk Reduct. 2020, 48, 101577. [Google Scholar] [CrossRef]
- Sun, S.; Mu, L.; Wang, L.; Liu, P.; Liu, X.; Zhang, Y. Semantic Segmentation for Buildings of Large Intra-Class Variation in Remote Sensing Images with O-GAN. Remote Sens. 2021, 13, 475. [Google Scholar] [CrossRef]
- Camps-Valls, G.; Tuia, D.; Bruzzone, L.; Benediktsson, J.A. Advances in Hyperspectral Image Classification: Earth Monitoring with Statistical Learning Methods. IEEE Signal Processing Mag. 2014, 31, 45–54. [Google Scholar] [CrossRef] [Green Version]
- Liu, Y.; Li, Z.; Wei, B.; Li, X.; Fu, B. Seismic vulnerability assessment at urban scale using data mining and GIScience technology: Application to Urumqi (China). Geomat. Nat. Hazards Risk 2019, 10, 958–985. [Google Scholar] [CrossRef] [Green Version]
- Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
- Zhu, M.; He, Y.; He, Q. A Review of Researches on Deep Learning in Remote Sensing Application. Int. J. Geosci. 2019, 10, 1–11. [Google Scholar] [CrossRef] [Green Version]
- Xie, Y.; Zhu, J.; Cao, Y.; Feng, D.; Hu, M.; Li, W.; Zhang, Y.; Fu, L. Refined Extraction Of Building Outlines From High-Resolution Remote Sensing Imagery Based on a Multifeature Convolutional Neural Network and Morphological Filtering. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1842–1855. [Google Scholar] [CrossRef]
- Guo, H.; Shi, Q.; Du, B.; Zhang, L.; Wang, D.; Ding, H. Scene-Driven Multitask Parallel Attention Network for Building Extraction in High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4287–4306. [Google Scholar] [CrossRef]
- Chen, Q.; Zhang, Y.; Li, X.; Tao, P. Extracting Rectified Building Footprints from Traditional Orthophotos: A New Workflow. Sensors 2022, 22, 207. [Google Scholar] [CrossRef]
- Wang, Y.; Li, S.; Lin, Y.; Wang, M. Lightweight Deep Neural Network Method for Water Body Extraction from High-Resolution Remote Sensing Images with Multisensors. Sensors 2021, 21, 7397. [Google Scholar] [CrossRef]
- Sirmacek, B.; Unsalan, C. Urban-Area and Building Detection Using SIFT Keypoints and Graph Theory. IEEE Trans. Geosci. Remote Sens. 2009, 47, 1156–1167. [Google Scholar] [CrossRef]
- Huang, X.; Zhang, L. Morphological Building/Shadow Index for Building Extraction From High-Resolution Imagery Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 161–172. [Google Scholar] [CrossRef]
- Zhang, Q.; Huang, X.; Zhang, G. A Morphological Building Detection Framework for High-Resolution Optical Imagery Over Urban Areas. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1388–1392. [Google Scholar] [CrossRef]
- Ahmadi, S.; Zoej, M.J.V.; Ebadi, H.; Moghaddam, H.A.; Mohammadzadeh, A. Automatic urban building boundary extraction from high resolution aerial images using an innovative model of active contours. Int. J. Appl. Earth Obs. 2010, 12, 150–157. [Google Scholar] [CrossRef]
- Liasis, G.; Stavrou, S. Building extraction in satellite images using active contours and colour features. Int. J. Remote Sens. 2016, 37, 1127–1153. [Google Scholar] [CrossRef]
- Ok, A.O. Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts. ISPRS J. Photogramm. 2013, 86, 21–40. [Google Scholar] [CrossRef]
- Li, E.; Xu, S.; Meng, W.; Zhang, X. Building Extraction from Remotely Sensed Images by Integrating Saliency Cue. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 906–919. [Google Scholar] [CrossRef]
- Inglada, J. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS J. Photogramm. 2007, 62, 236–248. [Google Scholar] [CrossRef]
- Turker, M.; Koc-San, D. Building extraction from high-resolution optical spaceborne images using the integration of support vector machine (SVM) classification, Hough transformation and perceptual grouping. Int. J. Appl. Earth Obs. 2015, 34, 58–69. [Google Scholar] [CrossRef]
- Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. Acm 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Liu, Y.; Zhou, J.; Qi, W.; Li, X.; Gross, L.; Shao, Q.; Zhao, Z.; Ni, L.; Fan, X.; Li, Z. ARC-Net: An Efficient Network for Building Extraction From High-Resolution Aerial Images. IEEE Access 2020, 8, 154997–155010. [Google Scholar] [CrossRef]
- Zhou, D.; Wang, G.; He, G.; Long, T.; Yin, R.; Zhang, Z.; Chen, S.; Luo, B. Robust Building Extraction for High Spatial Resolution Remote Sensing Images with Self-Attention Network. Sensors 2020, 20, 7241. [Google Scholar] [CrossRef]
- Chen, K.; Zou, Z.; Shi, Z. Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
- Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
- Guo, M.; Liu, H.; Xu, Y.; Huang, Y. Building Extraction Based on U-Net with an Attention Block and Multiple Losses. Remote Sens. 2020, 12, 1400. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 8–10 June 2015. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Li, C.; Fu, L.; Zhu, Q.; Zhu, J.; Fang, Z.; Xie, Y.; Guo, Y.; Gong, Y. Attention Enhanced U-Net for Building Extraction from Farmland Based on Google and WorldView-2 Remote Sensing Images. Remote Sens. 2021, 13, 4411. [Google Scholar] [CrossRef]
- Deng, W.; Shi, Q.; Li, J. Attention-Gate-Based Encoder-Decoder Network for Automatical Building Extraction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2611–2620. [Google Scholar] [CrossRef]
- Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R.R.; Cheng, M.; Hu, S. Attention Mechanisms in Computer Vision: A Survey. arXiv 2021, arXiv:2111.07624. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6077–6086. [Google Scholar]
- Stollenga, M.; Masci, J.; Gomez, F.; Schmidhuber, J. Deep networks with internal selective attention through feedback connections. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
- Jetley, S.; Lord, N.A.; Lee, N.; Torr, P.H.S. Learn To Pay Attention. arXiv 2018, arXiv:1804.02391. [Google Scholar]
- Zhao, W.; Ivanov, I.; Persello, C.; Stein, A. Building Outline Delineation: From Very High Resolution Remote Sensing Imagery to Polygons with an Improved End-To-End Learning Framework. ISPRS—Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, XLIII-B2-2020, 731–735. [Google Scholar] [CrossRef]
- Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city the inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both Weights and Connections for Efficient Neural Networks. arXiv 2015, arXiv:1506.02626. [Google Scholar]
- Liu, Y.; Gross, L.; Li, Z.; Li, X.; Fan, X.; Qi, W. Automatic Building Extraction on High-Resolution Remote Sensing Imagery Using Deep Convolutional Encoder-Decoder With Spatial Pyramid Pooling. IEEE Access 2019, 7, 128774–128786. [Google Scholar] [CrossRef]
- Dixit, M.; Chaurasia, K.; Mishra, V.K. Automatic Building Extraction from High-Resolution Satellite Images Using Deep Learning Techniques; Springer: Singapore, 2021; pp. 773–783. [Google Scholar]
Block | Type | Filter | Channel Size | Output Size |
---|---|---|---|---|
256 × 256 × 3 | ||||
1 | Conv1 | (3, 3) | 3 → 64 | 256 × 256 × 64 |
Maxpool1 | (2, 2) | 64 → 64 | 128 × 128 × 64 | |
2 | Conv2 | (3, 3) | 64 → 128 | 128 × 128 × 128 |
Maxpool2 | (2, 2) | 128 → 128 | 64 × 64 × 128 | |
3 | Conv3 | (3, 3) | 128 → 256 | 64 × 64 × 256 |
Maxpool3 | (2, 2) | 256 → 256 | 32 × 32 × 256 | |
4 | Conv4 | (3, 3) | 256 → 512 | 32 × 32 × 512 |
Maxpool4 | (2, 2) | 512 → 512 | 16 × 16 × 512 | |
5 | Conv5 | (3, 3) | 512 → 1024 | 16 × 16 × 1024 |
6 | Up_conv4 | Conv-(2, 2) | 1024 → 512 | 32 × 32 × 512 |
AGs4 | 512 → 512 | 32× 32 × 512 | ||
Up4 | Up-(3, 3) | 512 → 512 | 32 × 32 × 512 | |
7 | Up_conv3 | Conv-(2, 2) | 512 → 256 | 64 × 64 × 256 |
AGs3 | 256 → 256 | 64 × 64 × 256 | ||
Up3 | Up-(3, 3) | 256 → 256 | 64 × 64 × 256 | |
8 | Up_conv2 | Conv-(2, 2) | 256 → 128 | 128 × 128 × 128 |
AGs2 | 512 → 512 | 128 × 128 × 128 | ||
Up2 | Up-(3, 3) | 512 → 512 | 128 × 128 × 128 | |
9 | Up_conv1 | Conv-(2, 2) | 128 → 64 | 256 × 256 × 64 |
AGs 1 | 64 → 64 | 256 × 256 × 64 | ||
Up1 | Up-(3, 3) | 64 → 64 | 256 × 256 × 64 | |
10 | Conv_1 × 1 | (1, 1) | 64 → 1 | 256 × 256 × 1 |
Sigmoid | 256 × 256 × 1 |
Conv Block | Layers | Filter | Channel Size |
---|---|---|---|
Wg | Conv + BN | (1, 1) | Fg → Fint |
Wx | Conv + BN | (1, 1) | Fl → Fint |
Activation | Relu | Wg + Wx | - |
Ps | Conv + BN | (1, 1) | Fint → 1 |
Pout | BN | Fint × Ps | 1 → Fint |
Groups | 1 | 2 | 3 | 4 |
---|---|---|---|---|
U-Net Accuracy | 0.759 | 0.961 | 0.921 | 0.977 |
AGs-Unet Accuracy 1 | 0.819 | 0.963 | 0.937 | 0.973 |
U-Net IoU | 0.709 | 0.827 | 0.802 | 0.944 |
AGs-Unet IoU | 0.715 | 0.840 | 0.883 | 0.946 |
Dataset | WHU Dataset | INRIA Dataset | ||||
---|---|---|---|---|---|---|
Model | OA | Precision | IoU | OA | Precision | IoU |
SegNet | 0.944 | 0.856 | 0.775 | 0.888 | 0.791 | 0.413 |
FCN8s | 0.948 | 0.870 | 0.776 | 0.838 | 0.703 | 0.484 |
DANet | 0.952 | 0.922 | 0.790 | 0.929 | 0.839 | 0.676 |
U-Net | 0.967 | 0.931 | 0.843 | 0.916 | 0.862 | 0.671 |
PISANet | 0.962 | 0.935 | 0.853 | 0.906 | 0.885 | 0.651 |
ARC-Net | 0.952 | 0.855 | 0.793 | 0.921 | 0.835 | 0.679 |
AGs-Unet | 0.969 | 0.937 | 0.855 | 0.919 | 0.907 | 0.682 |
Model | Parameters (M) | Computation (G Mac) | WHU Dataset Training Time (s)/Epoch | INRIA Dataset Training Time (s)/Epoch |
---|---|---|---|---|
SegNet | 16.31 | 23.77 | 222 | 69 |
FCN8s | 134.27 | 62.81 | 393 | 74 |
DANet | 49.48 | 10.93 | 138 | 70 |
U-Net | 13.4 | 23.77 | 212 | 66 |
PISANet | 11.03 | 23.89 | 294 | 70 |
ARC-Net | 16.19 | 16.6 | 193 | 69 |
AGs-Unet | 34.88 | 51.02 | 316 | 72 |
Model | OA | Precision | IoU |
---|---|---|---|
U-Net | 0.9713 | 0.9689 | 0.8358 |
AGs-Unet-2 | 0.9778 | 0.9251 | 0.8402 |
AGs-Unet | 0.9781 | 0.9051 | 0.8438 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, M.; Chen, X.; Zhang, W.; Liu, Y. AGs-Unet: Building Extraction Model for High Resolution Remote Sensing Images Based on Attention Gates U Network. Sensors 2022, 22, 2932. https://doi.org/10.3390/s22082932
Yu M, Chen X, Zhang W, Liu Y. AGs-Unet: Building Extraction Model for High Resolution Remote Sensing Images Based on Attention Gates U Network. Sensors. 2022; 22(8):2932. https://doi.org/10.3390/s22082932
Chicago/Turabian StyleYu, Mingyang, Xiaoxian Chen, Wenzhuo Zhang, and Yaohui Liu. 2022. "AGs-Unet: Building Extraction Model for High Resolution Remote Sensing Images Based on Attention Gates U Network" Sensors 22, no. 8: 2932. https://doi.org/10.3390/s22082932
APA StyleYu, M., Chen, X., Zhang, W., & Liu, Y. (2022). AGs-Unet: Building Extraction Model for High Resolution Remote Sensing Images Based on Attention Gates U Network. Sensors, 22(8), 2932. https://doi.org/10.3390/s22082932