Method of Building Detection in Optical Remote Sensing Images Based on SegFormer
Abstract
:1. Introduction
2. The Building Extraction Network of Improved SegFormer
2.1. SegFormer
- ①
- A lightweight decoder is constructed, and MLP is issued for feature aggregation [22].
- ②
- The hierarchical structure design is adopted. The multi-layer feature map is attained via the hierarchical Transformer Encoder [23] without positional encoding. The high-resolution shallow features and low-resolution fine features are obtained at multiple scales.
2.2. Improvement Strategy
- (1)
- The structure of the detection network is optimized, and the sampling of semantic and detailed features of buildings is reinforced to address the poor applicability of the existing models for building extraction from remote sensing images. The transposed convolutional network is coupled with the sampling module to correlate semantic feature information, fuse high and low-layer feature information, and cascade inter-layer receptive fields. This can lead to receding the impact of spatial heterogeneity and intricate ground object background and intensifying the establishment of a detection network model for building features. Multiple normalization activation layers are fused [24]. The screening efficiency of building semantic information is enhanced by reducing the quantity of the parameters, inhibiting the disappearance of the gradient, and speeding up the convergence. To efficiently gain the semantic information features for buildings, the effective screening and classification of the intra-class feature information are realized by embedding the activation layer [25].
- (2)
- The decoding output part of the model is optimized, and the atrous spatial pyramid pooling (ASPP) decoding module is used to address issues such as excessive interference, loss of semantic and detailed feature information of buildings, and missing and false detection of building features extracted from remote sensing images. Multi-scale contextual information is explored by applying multi-sampling rate atrous convolution [26] and multi-receptive field convolution on the input feature map. A sufficient amount of contextual information is captured by pooling operations at different resolutions, and lucid target boundaries are captured by gradually recovering spatial information. The semantic fusion module is used in the output part to aggregate rough information in the shallow layer and fine information in the deep layer. The interaction of feature information from different layers is achieved by using the shuffle [27]. The shuffle is applied to address issues such as the partial loss of building information and the lack of long-distance information. It is also helped to efficaciously relieve the loss of feature information resulting from the difference in image resolution due to miscellaneous data sources. Consequently, issues such as the loss of edge details of multi-scale buildings and the missing and false classification of tiny buildings can be addressed.
2.2.1. Sampling Module Based on the Transposed Convolutional Network
Transposed Convolution
Multiple Normalization Activation Layers
2.2.2. Spatial Pyramid Pooling Decoding Module Fused Atrous Convolution
Atrous Convolution (Dilated Convolution)
Atrous Spatial Pyramid Pooling
Semantic-Guided Fusion Module
3. Experiment and Analysis
3.1. Experimental Dataset
- (1)
- The AISD contains four ground object labels—buildings, roads, water systems, and vegetation. In this paper, building labels are only used for the analysis. The dataset covers areas of Berlin, Chicago, Paris, Potsdam, and Zurich, and the images have a resolution of 0.3 m. This dataset includes 1672 images with 3328 × 2816 pixels deriving from Google Earth and pixel-level label of Open Street Map.
- (2)
- The MBD covers the cities and suburbs areas of Boston and contains 151 images with 1500 × 1500 pixels. The occupied area of each image is 2.25 km², with a resolution of 1 m.
- (3)
- The WHU includes 204 images with 512 × 512 pixels in Satellite Dataset(Global Cities), and the coverage area is the global urban satellite images, with a resolution of 0.3 m.
3.2. Experimental Environment and Parameter Setting
3.3. Ablation Experiment
3.3.1. Ablation Experiment with Different Mechanism Modules
- (1)
- Ablation experiments with different improved mechanisms
- (2)
- Visualization of the training process
3.3.2. Ablation Experiment of Multiple Normalization Activation Layers
3.3.3. Dilation Rate Ablation Experiment of Atrous Spatial Pyramid Pooling Decoding Module
3.4. Comparison Experiment
- (1)
- The integrity of the detection edge is better, and the boundary transition is more regular and smoother: the architectural form of the building is neat. The 1st and 2nd rows in Figure 12 show that the detection of a single independent building by the HRnet, PSPNet, and SegFormer is poor, especially depicting the right-angle side of the building. The U-Net and improved network can exactly locate the sideline of the building. Nevertheless, in edge detection, the improved network is more intact and reasonable in the transition of edge detection details than the U-Net network. The improved network in this study can better differentiate the background from the building boundary relative to the SegFormer network. The improved network revises the regular edges of buildings better and obtains apparent enhancement in the detection effect.
- (2)
- The construction of building feature information is improved, the missing and false detection due to the inter-class similarity of ground objects is efficaciously inhibited, the multi-scale generalization capacity is strong, and the detection rate of tiny buildings is enhanced. In the 3rd and 4th rows of Figure 12, when the color of tiny buildings in the original image is similar to that of the ground objects in surrounding background and when there is a large-scale difference between tiny buildings and the surrounding buildings, the phenomenon of missing buildings occurs in the PSPNet and DeepLabv3+ experiments. In the experiments of HRnet, U-Net, and SegFormer, though the positioning of some buildings can be realized, they cannot be validly detected. In contrast, the improved network exactly locates tiny buildings and detects their distribution range and edges.
- (3)
- The reservation of detailed feature information is abundant and intact, and the differentiation capacity for buildings and their surrounding backgrounds is powerful. According to the 5th and 6th rows in Figure 12, when detecting polygonal buildings and aiming at the transition of edge texture, the evident edge fitting, the severe loss of edge details in angle and corner, and the erroneous judgment of building distribution on a large scale exist in the comparison algorithms. However, the improved network restores more intact details compared with other networks. There is little difference in the distribution range and outline when the detection results of the improved network are compared with the label information.
- (4)
- Limited by feature similarity: the spectral similarity interferes greatly and the inhibition of negative features is weak. When the background with similar characteristics is mixed, the building classification performance is mediocre. There are still great challenges in the discrimination of detection algorithms.
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhou, J.; Liu, Y.; Nie, G.; Cheng, H.; Yang, X.; Chen, X.; Gross, L. Building Extraction and Floor Area Estimation at the Village Level in Rural China via a Comprehensive Method Integrating UAV Photogrammetry and the Novel EDSANet. Remote Sens. 2022, 14, 5175. [Google Scholar] [CrossRef]
- Wang, Y.; Cui, L.; Zhang, C.; Chen, W.; Xu, Y.; Zhang, Q. A Two-Stage Seismic Damage Assessment Method for Small, Dense, and Imbalanced Buildings in Remote Sensing Images. Remote Sens. 2022, 14, 1012. [Google Scholar] [CrossRef]
- Degas, A.; Islam, M.R.; Hurter, C.; Barua, S.; Rahman, H.; Poudel, M.; Ruscio, D.; Ahmed, M.U.; Begum, S.; Rahman, M.A.; et al. A survey on artificial intelligence (ai) and explainable ai in air traffic management: Current trends and development with future research trajectory. Appl. Sci. 2022, 12, 1295. [Google Scholar] [CrossRef]
- Haq, M.A.; Rehman, Z.; Ahmed, A.; Khan, M.A.R. Machine Learning-based Classification of Hyperspectral Imagery. Int. J. Comput. Sci. Netw. Secur. 2022, 22, 193–202. [Google Scholar]
- Jiang, H.; Peng, M.; Zhong, Y.; Xie, H.; Hao, Z.; Lin, J.; Ma, X.; Hu, X. A Survey on Deep Learning-Based Change Detection from High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 1552. [Google Scholar] [CrossRef]
- Shafique, A.; Cao, G.; Khan, Z.; Asad, M.; Aslam, M. Deep learning-based change detection in remote sensing images: A review. Remote Sens. 2022, 14, 871. [Google Scholar] [CrossRef]
- Baduge, S.K.; Thilakarathna, S.; Perera, J.S.; Arashpour, M.; Sharafi, P.; Teodosio, B.; Shringi, A.; Mendis, P. Artificial intelligence and smart vision for building and construction 4.0: Machine and deep learning methods and applications. Autom. Constr. 2022, 141, 104440. [Google Scholar] [CrossRef]
- Han, Q.; Yin, Q.; Zheng, X.; Chen, Z. Remote sensing image building detection method based on Mask R-CNN. Complex Intell. Syst. 2022, 8, 1847–1855. [Google Scholar] [CrossRef]
- Chen, F.; Wang, N.; Yu, B.; Wang, L. Res2-Unet, a New Deep Architecture for Building Detection from High Spatial Resolution Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1494–1501. [Google Scholar] [CrossRef]
- Yu, M.; Chen, X.; Zhang, W.; Liu, Y. AGs-Unet: Building Extraction Model for High Resolution Remote Sensing Images Based on Attention Gates U Network. Sensors 2022, 22, 2932. [Google Scholar] [CrossRef]
- Guo, H.; Du, B.; Zhang, L.; Su, X. A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 240–252. [Google Scholar] [CrossRef]
- Liu, C.; Sepasgozar, S.M.; Zhang, Q.; Ge, L. A novel attention-based deep learning method for post-disaster building damage classification. Expert Syst. Appl. 2022, 202, 117268. [Google Scholar] [CrossRef]
- Sun, Y.; Zheng, W. HRNet-and PSPNet-based multiband semantic segmentation of remote sensing images. Neural Comput. Appl. 2022, 35, 1–9. [Google Scholar] [CrossRef]
- Yang, H.; Xu, M.; Chen, Y.; Wu, W.; Dong, W. A Postprocessing Method Based on Regions and Boundaries Using Convolutional Neural Networks and a New Dataset for Building Extraction. Remote Sens. 2022, 14, 647. [Google Scholar] [CrossRef]
- Yoo, S.; Lee, J.; Farkoushi, M.G.; Lee, E.; Sohn, H.G. Automatic generation of land use maps using aerial orthoimages and building floor data with a Conv-Depth Block (CDB) ResU-Net architecture. Int. J. Appl. Earth Observ. Geoinf. 2022, 107, 102678. [Google Scholar] [CrossRef]
- Luo, J.; Zhao, T.; Cao, L.; Biljecki, F. Semantic Riverscapes: Perception and evaluation of linear landscapes from oblique imagery using computer vision. Landsc. Urban Plan. 2022, 228, 104569. [Google Scholar] [CrossRef]
- Wang, Y.; Zeng, X.; Liao, X.; Zhuang, D. B-FGC-Net: A Building Extraction Network from High Resolution Remote Sensing Imagery. Remote Sens. 2022, 14, 269. [Google Scholar] [CrossRef]
- Pinto, G.; Wang, Z.; Roy, A.; Hong, T.; Capozzoli, A. Transfer learning for smart buildings: A critical review of algorithms, applications, and future perspectives. Adv. Appl. Energy 2022, 5, 100084. [Google Scholar] [CrossRef]
- Zhou, Y.; Chang, H.; Lu, Y.; Lu, X. CDTNet: Improved image classification method using standard, Dilated and Transposed Convolutions. Appl. Sci. 2022, 12, 5984. [Google Scholar] [CrossRef]
- Ahmad, I.; Qayyum, A.; Gupta, B.B.; Alassafi, M.O.; AlGhamdi, R.A. Ensemble of 2D Residual Neural Networks Integrated with Atrous Spatial Pyramid Pooling Module for Myocardium Segmentation of Left Ventricle Cardiac MRI. Mathematics 2022, 10, 627. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Zhang, J.; Li, C.; Yin, Y.; Zhang, J.; Grzegorzek, M. Applications of artificial neural networks in microorganism image analysis: A comprehensive review from conventional multilayer perceptron to popular convolutional neural network and potential visual transformer. Artif. Intell. Rev. 2022, 56, 1–58. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Segu, M.; Tonioni, A.; Tombari, F. Batch normalization embeddings for deep domain generalization. Pattern Recognit. 2022, 135, 109115. [Google Scholar] [CrossRef]
- Bingham, G.; Miikkulainen, R. Discovering parametric activation functions. Neural Netw. 2022, 148, 48–65. [Google Scholar] [CrossRef]
- Huang, Y.; Wang, Q.; Jia, W.; Lu, Y.; Li, Y.; He, X. See more than once: Kernel-sharing atrous convolution for semantic segmentation. Neurocomputing 2021, 443, 26–34. [Google Scholar] [CrossRef]
- Cui, F.; Jiang, J. Shuffle-CDNet: A Lightweight Network for Change Detection of Bitemporal Remote-Sensing Images. Remote Sens. 2022, 14, 3548. [Google Scholar] [CrossRef]
- Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
- Abdollahi, A.; Pradhan, B.; Alamri, A.M. An ensemble architecture of deep convolutional Segnet and Unet networks for building semantic segmentation from high-resolution aerial images. Geocarto Int. 2022, 37, 3355–3370. [Google Scholar] [CrossRef]
- Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Wang, H.; Miao, F. Building extraction from remote sensing images using deep residual U-Net. Eur. J. Remote Sens. 2022, 55, 71–85. [Google Scholar] [CrossRef]
Dataset | Category | Number | Resolution |
---|---|---|---|
AISD | train | 1504 | 2611 × 2453 |
val | 168 | 2611 × 2453 | |
test | 168 | 2611 × 2453 | |
MBD | train | 135 | 1500 × 1500 |
val | 16 | 1500 × 1500 | |
test | 16 | 1500 × 1500 | |
WHU | train | 183 | 512 × 512 |
val | 21 | 512 × 512 | |
test | 21 | 512 × 512 |
Item | AISD | MBD | WHU |
---|---|---|---|
Input size | 2611 × 2453 | 1500 × 1500 | 512 × 512 |
Train size | 512 × 512 | 512 × 512 | 512 × 512 |
Test size | 512 × 512 | 512 × 512 | 512 × 512 |
Freeze epoch | 50 | 50 | 50 |
UnFreeze epoch | 500 | 600 | 200 |
Batch size | 8 | 16 | 4 |
Optimizer | Sgd | Sgd | Sgd |
Learning decay | Cos | Cos | Cos |
Weight decay | 0.0005 | 0.0005 | 0.0005 |
Learning rate | 1.00 × 10−5 | 1.00 × 10−5 | 1.00 × 10−5 |
Dataset | Network | Evaluation Indicator | |||
---|---|---|---|---|---|
mIoU | F1 | Precision | Recall | ||
AISD | segf | 82.17 | 90.11 | 89.93 | 90.3 |
segf + Transconv | 83.58 | 90.97 | 90.9 | 91.05 | |
segf + ASPP | 83.19 | 91.3 | 91.8 | 90.8 | |
This study | 88.43↑6.26 | 93.81↑3.7 | 93.85↑3.92 | 93.77↑3.47 | |
MBD | segf | 66.09 | 77.70 | 82.49 | 73.44 |
segf + Transconv | 74.23 | 85.44 | 87.35 | 83.61 | |
segf + ASPP | 72.74 | 86.97 | 92.49 | 82.08 | |
This study | 75.85↑9.76 | 88.24↑10.54 | 93.07↑10.58 | 83.89↑10.45 | |
WHU | segf | 79.5 | 88.23 | 88.17 | 88.31 |
segf + Transconv | 80.69 | 89.01 | 88.86 | 89.18 | |
segf + ASPP | 81.23 | 91.16 | 92.95 | 89.45 | |
This study | 81.69↑2.19 | 91.09↑2.86 | 92.46↑4.78 | 89.76↑1.45 |
The number of normalization activation layers | L = 1 | × | ✓ | ✓ | ✓ | ✓ | Optimal mechanism promotion |
L = 2 | × | × | ✓ | ✓ | ✓ | ||
L = 3 | × | × | × | ✓ | ✓ | ||
L = 4 | × | × | × | × | ✓ | ||
mIoU | 72.74 | 74.02 | 74.78 | 75.42 | 75.85 | 3.11 | |
F1 | 86.97 | 87.02 | 87.52 | 87.54 | 88.24 | 1.27 | |
Precision | 92.49 | 92.5 | 92.74 | 93.11 | 93.07 | 0.58 | |
Recall | 82.08 | 82.16 | 82.87 | 82.6 | 83.89 | 1.81 |
Dilation Rate Combination | mIoU |
---|---|
1, 2, 2, 2 | 71.92 |
1, 3, 6, 9 | 72.25 |
1, 2, 5, 11 | 72.96 |
1, 6, 12, 18 | 73.43 |
1, 5, 11, 17 | 73.7 |
1, 3, 11, 17 | 75.85 |
Detection Network | HRnet | PSPNet | U-Net | DeepLabv3+ | SegFormer | Methods of This Paper | Optimal Promotion (%) | |
---|---|---|---|---|---|---|---|---|
Dataset | Evaluation Indicator (%) | |||||||
AISD | mIoU | 77.17 | 70.75 | 80.39 | 76.37 | 82.17 | 88.43 | 17.68 |
F1 | 86.95 | 82.56 | 89.00 | 86.42 | 90.11 | 93.81 | 11.25 | |
Precision | 86.74 | 82.68 | 88.94 | 86.26 | 89.93 | 93.85 | 11.17 | |
Recall | 87.17 | 82.45 | 89.07 | 86.59 | 90.30 | 93.77 | 11.32 | |
MBD | mIoU | 55.93 | 45.41 | 73.58 | 53.52 | 66.09 | 75.85 | 30.44 |
F1 | 68.71 | 60.50 | 83.68 | 67.64 | 77.70 | 88.24 | 27.74 | |
Precision | 76.10 | 73.05 | 86.92 | 78.15 | 82.49 | 93.07 | 20.02 | |
Recall | 62.62 | 51.63 | 80.67 | 59.62 | 73.44 | 83.89 | 32.26 | |
WHU | mIoU | 75.32 | 66.43 | 77.10 | 72.22 | 79.50 | 81.69 | 15.26 |
F1 | 85.41 | 78.87 | 86.67 | 83.21 | 88.24 | 91.09 | 12.22 | |
Precision | 85.62 | 80.60 | 87.84 | 84.14 | 88.17 | 92.46 | 11.86 | |
Recall | 85.20 | 77.22 | 85.54 | 82.30 | 88.31 | 89.76 | 12.54 |
Network | GFLOPS /G | Parameters /M | Computation Time (sec/epoch) | mIoU |
---|---|---|---|---|
HRnet | 79.539 | 29.538 | 14 | 75.32 |
PSPNet | 5.873 | 2.377 | 7 | 66.43 |
U-Net | 451.94 | 24.891 | 20 | 77.1 |
DeepLabv3+ | 52.846 | 5.818 | 7 | 72.22 |
SegFormer | 13.674 | 3.72 | 15 | 79.5 |
Methods of this paper | 27.17 | 3.903 | 18 | 81.69 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, M.; Rui, J.; Yang, S.; Liu, Z.; Ren, L.; Ma, L.; Li, Q.; Su, X.; Zuo, X. Method of Building Detection in Optical Remote Sensing Images Based on SegFormer. Sensors 2023, 23, 1258. https://doi.org/10.3390/s23031258
Li M, Rui J, Yang S, Liu Z, Ren L, Ma L, Li Q, Su X, Zuo X. Method of Building Detection in Optical Remote Sensing Images Based on SegFormer. Sensors. 2023; 23(3):1258. https://doi.org/10.3390/s23031258
Chicago/Turabian StyleLi, Meilin, Jie Rui, Songkun Yang, Zhi Liu, Liqiu Ren, Li Ma, Qing Li, Xu Su, and Xibing Zuo. 2023. "Method of Building Detection in Optical Remote Sensing Images Based on SegFormer" Sensors 23, no. 3: 1258. https://doi.org/10.3390/s23031258
APA StyleLi, M., Rui, J., Yang, S., Liu, Z., Ren, L., Ma, L., Li, Q., Su, X., & Zuo, X. (2023). Method of Building Detection in Optical Remote Sensing Images Based on SegFormer. Sensors, 23(3), 1258. https://doi.org/10.3390/s23031258