Multi-Scale Frequency-Spatial Domain Attention Fusion Network for Building Extraction in Remote Sensing Images
Abstract
:1. Introduction
- Frequency-Spatial Domain Attention Fusion Module (FSAFM): We integrate a frequency domain component into the attention module, which enhances the detection of features of interest within the feature maps. In this module, by integrating frequency and spatial feature enhancements, we enable the model to concentrate its attention on the most critical features, leading to a more sensitive perception of boundary information.
- Attention-Guided Multi-scale Fusion Upsampling Module (AGMUM): The module is divided into two parts: (1) Multi-Scale Features Fusion (MSF): This part integrates attention feature maps from various depths, improving the model’s comprehension of contextual information and facilitating the capture of boundary details. (2) Attention Upsampling (AU): Through frequency-space fusion attention, this part effectively compensates for the losses incurred during upsampling.
- Multi-Scale Frequency-Spatial Domain Attention Fusion Network (MFSANet): Based on the above structure, we propose the MFSANet architecture, which integrates rich frequency domain information while preserving spatial domain details. This network is designed in two distinct phases: the initial phase employs dual-domain attention mechanisms to refine feature extraction, directing the network’s focus towards salient features; the subsequent phase utilizes multi-scale feature concatenation along with attention-guided upsampling, enabling the network to perceive multi-scale information effectively.
2. Related Works
2.1. CNN-Based Semantic Segmentation
2.2. Transformer-Based Semantic Segmentation
2.3. Learning in Frequency Domain
3. Method
3.1. Multi-Scale Frequency-Spatial Domain Attention Fusion Network (MFSANet)
Algorithm 1: Segmentation Model Based on Dual-Domain Enhancement |
Input: 512 × 512 remote sensing images |
Output: Building segmentation map |
1: Use VGG16 to extract the feature maps of five layers of the input image |
2: Apply FSAFM for frequency-space domain attention enhancement on the first four layers of feature maps |
3: Apply MSF to fuse the enhanced multi-scale features |
4: Apply attention-guided upsampling |
a. Skip connections |
b. Apply FSAFM enhancement again to the concatenated features |
5: Utilize the segmentation head to produce the final segmentation map |
6: Output the building segmentation map |
3.2. Frequency-Spatial Domain Attention Fusion Module (FSAFM)
Algorithm 2: Frequency-Spatial Domain Attention Fusion Module |
Input: Multi-level feature maps |
Output: Fusion attention enhancement feature map |
1: For the feature maps derived from various tiers of the backbone network: |
2: Frequency-domain attention component |
3: Segment into n parts across the channel dimension |
4: Assign 2D-DCT frequency components to each part |
5: Output 2D DCT enhanced feature map |
6: Spatial-domain attention component |
7: Refine the feature map through the application of average and max pooling |
8: Concatenate the two feature maps |
9: Integrate the spatial weight map with the original feature map, resulting in a spatial attention-enhanced feature map |
10: Add the frequency-domain enhanced feature map and the spatial attention map pixel-wise |
11: end for |
12: Output the feature map refined by frequency-space domain attention Z |
3.2.1. Spatial Domain Attention Enhancement
3.2.2. Frequency Domain Attention Enhancement
3.3. Attention-Guided Multi-Scale Fusion Upsampling Module (AGMUM)
4. Experiments
4.1. Data Sets and Hardware Environment
- WHU building data set [37]. The data set is composed of aerial and satellite imagery. For our experiments, we utilize the aerial image data set to validate the model’s effectiveness. It contains 187,000 buildings across more than 450 square kilometers, totaling 8189 images, each sized at 512 × 512 pixels. In the experiment, for the training, there are 4736 images (60% of the total data set); for validation, the count is 1036 images (15%); and for testing, the number stands at 2416 images (25%).
- Inria aerial image data set [38]. The data set consists of five cities, containing a total of 360 remote sensing images, each measuring 5000 × 5000 pixels. From this data set, we selected 180 images for experimentation, cropping the originals into patches of 512 × 512 pixels. The cropped images were subsequently allocated into training (12,600 images, 70%), validation (2700 images, 15%), and test (2700 images, 15%) sets.
4.2. Evaluation Metrics
4.3. Experiment Analysis
4.3.1. Quantitative Comparison Results
4.3.2. Qualitative Results
4.4. Ablation Study
4.4.1. Quantitative Comparison Results
4.4.2. Qualitative Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Huang, X.; Cao, Y.X.; Li, J.Y. An automatic change detection method for monitoring newly constructed building areas using time-series multi-view high-resolution optical satellite images. Remote Sens. Environ. 2020, 244, 111802. [Google Scholar] [CrossRef]
- Chen, Z.; Wei, Y.; Shi, K.; Zhao, Z.; Wang, C.; Wu, B.; Qiu, B.; Yu, B. The potential of nighttime light remote sensing data to evaluate the development of digital economy: A case study of China at the city level. Comput. Environ. Urban Syst. 2022, 92, 101749. [Google Scholar] [CrossRef]
- Bai, H.; Li, Z.W.; Guo, H.L.; Chen, H.P.; Luo, P.P. Urban Green Space Planning Based on Remote Sensing and Geographic Information Systems. Remote Sens. 2022, 14, 4213. [Google Scholar] [CrossRef]
- Sakellariou, S.; Sfougaris, A.I.; Christopoulou, O.; Tampekis, S. Integrated wildfire risk assessment of natural and anthropogenic ecosystems based on simulation modeling and remotely sensed data fusion. Int. J. Disaster Risk Reduct. 2022, 78, 103129. [Google Scholar] [CrossRef]
- Jiang, X.; Zhang, X.; Xin, Q.; Xi, X.; Zhang, P. Arbitrary-Shaped Building Boundary-Aware Detection with Pixel Aggregation Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 2699–2710. [Google Scholar] [CrossRef]
- Ok, A.O. Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts. ISPRS J. Photogramm. Remote Sens. 2013, 86, 21–40. [Google Scholar] [CrossRef]
- Guo, H.; Du, B.; Zhang, L.; Su, X. A coarse-to-fine boundary refinement network for building footprint extraction from remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 183, 240–252. [Google Scholar] [CrossRef]
- Shao, P.; Shi, W.; Liu, Z.; Dong, T. Unsupervised Change Detection Using Fuzzy Topology-Based Majority Voting. Remote Sens. 2021, 13, 3171. [Google Scholar] [CrossRef]
- You, S.; Liu, Y.; Lei, B.; Wang, S. Fine Perceptive GANs for Brain MR Image Super-Resolution in Wavelet Domain. arXiv 2020. [Google Scholar] [CrossRef]
- Chen, H.; Yokoya, N.; Chini, M. Fourier domain structural relationship analysis for unsupervised multimodal change detection. ISPRS J. Photogramm. Remote Sens. 2023, 198, 99–114. [Google Scholar] [CrossRef]
- Yu, B.; Yang, A.; Chen, F.; Wang, N.; Wang, L. SNNFD, spiking neural segmentation network in frequency domain using high spatial resolution images for building extraction. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102930. [Google Scholar] [CrossRef]
- Sun, H.; Luo, Z.; Ren, D.; Du, B.; Chang, L.; Wan, J. Unsupervised multi-branch network with high-frequency enhancement for image dehazing. Pattern Recognit. 2024, 156, 110763. [Google Scholar] [CrossRef]
- Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.-k.; Ren, F. Learning in the Frequency Domain. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1737–1746. [Google Scholar]
- Gupta, A.; Mahobiya, C. Analysis of Image Compression Algorithm Using DCT. Int. J. Sci. Technol. Eng. 2016, 3, 121–127. [Google Scholar]
- Chen, Z.; Liu, T.; Xu, X.; Leng, J.; Chen, Z. DCTC: Fast and Accurate Contour-Based Instance Segmentation with DCT Encoding for High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8697–8709. [Google Scholar] [CrossRef]
- Zhang, C.; Lam, K.-M.; Wang, Q. CoF-Net: A Progressive Coarse-to-Fine Framework for Object Detection in Remote-Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5600617. [Google Scholar] [CrossRef]
- Zheng, J.; Shao, A.; Yan, Y.; Wu, J.; Zhang, M. Remote Sensing Semantic Segmentation via Boundary Supervision-Aided Multiscale Channelwise Cross Attention Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4405814. [Google Scholar] [CrossRef]
- Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A Fully Convolutional Neural Network for Automatic Building Extraction From High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
- Inglada, J. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS J. Photogramm. Remote Sens. 2007, 62, 236–248. [Google Scholar] [CrossRef]
- Li, E.; Femiani, J.C.; Xu, S.; Zhang, X.; Wonka, P. Robust Rooftop Extraction From Visible Band Images Using Higher Order CRF. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4483–4495. [Google Scholar] [CrossRef]
- Du, S.; Zhang, F.; Zhang, X. Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach. ISPRS J. Photogramm. Remote Sens. 2015, 105, 107–119. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
- Chen, F.; Wang, N.; Yu, B.; Wang, L. Res2-Unet, a New Deep Architecture for Building Detection From High Spatial Resolution Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1494–1501. [Google Scholar] [CrossRef]
- Ali, S.; Lee, Y.R.; Park, S.Y.; Tak, W.Y.; Jung, S.K. Towards Efficient and Accurate CT Segmentation via Edge-Preserving Probabilistic Downsampling. arXiv 2024, arXiv:2404.03991. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Zhou, X.; Wei, X. Feature Aggregation Network for Building Extraction from High-resolution Remote Sensing Images. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Jakarta, Indonesia, 15–19 November 2023; pp. 105–116. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Álvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the ECCV Workshops, Montreal, ON, Canada, 11 October 2021. [Google Scholar]
- Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-high-resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408820. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2021, 190, 196–214. [Google Scholar] [CrossRef]
- Dong, B.; Wang, P.; Wang, F. Head-Free Lightweight Semantic Segmentation with Linear Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 516–524. [Google Scholar] [CrossRef]
- Huang, J.; Guan, D.; Xiao, A.; Lu, S. FSDR: Frequency Space Domain Randomization for Domain Generalization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6887–6898. [Google Scholar]
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2020; pp. 763–772. [Google Scholar]
- Zhu, Y.; Fan, L.; Li, Q.; Chang, J. Multi-Scale Discrete Cosine Transform Network for Building Change Detection in Very-High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 5243. [Google Scholar] [CrossRef]
- Fan, J.; Li, J.; Liu, Y.; Zhang, F. Frequency-aware robust multidimensional information fusion framework for remote sensing image segmentation. Eng. Appl. Artif. Intell. 2024, 129, 107638. [Google Scholar] [CrossRef]
- Zhang, J.; Shao, M.; Wan, Y.; Meng, L.; Cao, X.; Wang, S. Boundary-Aware Spatial and Frequency Dual-Domain Transformer for Remote Sensing Urban Images Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5637718. [Google Scholar] [CrossRef]
- Ji, S.P.; Wei, S.Q.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
- Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-Resolution Representations for Labeling Pixels and Regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
- Liu, Z.Y.; Shi, Q.; Ou, J.P. LCS: A Collaborative Optimization Framework of Vector Extraction and Semantic Segmentation for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5632615. [Google Scholar] [CrossRef]
- Wang, L.B.; Fang, S.H.; Meng, X.L.; Li, R. Building Extraction With Vision Transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625711. [Google Scholar] [CrossRef]
- Jiang, W.X.; Chen, Y.; Wang, X.F.; Kang, M.L.; Wang, M.Y.; Zhang, X.J.; Xu, L.X.; Zhang, C. Multi-branch reverse attention semantic segmentation network for building extraction. Egypt. J. Remote Sens. Space Sci. 2024, 27, 10–17. [Google Scholar] [CrossRef]
Method | IoU | Recall | Precision | F1 |
---|---|---|---|---|
UNet [22] | 88.28 | 93.81 | 94.74 | 93.77 |
SegNet [39] | 85.35 | 91.31 | 92.65 | 91.97 |
HRNet [40] | 86.51 | 92.67 | 93.66 | 93.16 |
UNetFormer [30] | 87.33 | 92.84 | 93.64 | 93.24 |
LCS [41] | 90.71 | 94.86 | 95.38 | 95.12 |
BuildFormer [42] | 90.73 | 95.14 | 95.15 | 95.14 |
MRANet [43] | 90.59 | 95.22 | 94.90 | 95.06 |
Ours | 91.01 | 95.12 | 95.47 | 95.29 |
Method | IoU | Recall | Precision | F1 |
---|---|---|---|---|
UNet | 74.40 | 84.28 | 86.39 | 85.32 |
SegNet | 72.00 | 82.33 | 84.69 | 83.49 |
HRNet | 75.03 | 84.92 | 86.56 | 85.73 |
LCS | 78.82 | 86.77 | 89.58 | 88.15 |
BuildFormer | 81.24 | 88.78 | 90.65 | 89.71 |
MRANet | 81.79 | 90.72 | 89.26 | 89.99 |
Ours | 82.45 | 90.09 | 90.68 | 90.38 |
Method | IoU | Recall | Precision | F1 |
---|---|---|---|---|
Baseline | 88.28 | 93.81 | 94.74 | 93.77 |
Baseline + FSAFM | 89.64 | 94.31 | 94.76 | 94.53 |
Baseline + AGMUM | 90.51 | 94.66 | 95.38 | 95.01 |
Baseline + FSAFM + AGMUM | 91.01 | 95.12 | 95.47 | 95.29 |
SegNet | 82.83 | 87.59 | 93.90 | 90.60 |
SegNet + FSAFM | 83.67 | 89.09 | 93.22 | 91.10 |
SegNet + AGMUM | 84.08 | 89.73 | 93.03 | 91.35 |
Method | FLOP (G) | Parameters (M) | Runtime (Min) | Memory (MiB) |
---|---|---|---|---|
Baseline | 225.66 | 24.89 | 4:34 | 10,165 |
Baseline + FSAFM | 305.42 | 28.00 | 6:48 | 14,033 |
Baseline + AGMUM | 364.79 | 33.72 | 7:50 | 14,725 |
Baseline + FSAFM + AGMUM | 364.85 | 33.76 | 8:01 | 15,267 |
SegNet | 160.83 | 29.44 | 4:30 | 13,227 |
SegNet + FSAFM | 160.89 | 29.52 | 5:18 | 21,499 |
SegNet + AGMUM | 175.67 | 35.21 | 5:26 | 13,281 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, J.; Chen, H.; Li, Z.; Gu, H. Multi-Scale Frequency-Spatial Domain Attention Fusion Network for Building Extraction in Remote Sensing Images. Electronics 2024, 13, 4642. https://doi.org/10.3390/electronics13234642
Liu J, Chen H, Li Z, Gu H. Multi-Scale Frequency-Spatial Domain Attention Fusion Network for Building Extraction in Remote Sensing Images. Electronics. 2024; 13(23):4642. https://doi.org/10.3390/electronics13234642
Chicago/Turabian StyleLiu, Jia, Hao Chen, Zuhe Li, and Hang Gu. 2024. "Multi-Scale Frequency-Spatial Domain Attention Fusion Network for Building Extraction in Remote Sensing Images" Electronics 13, no. 23: 4642. https://doi.org/10.3390/electronics13234642
APA StyleLiu, J., Chen, H., Li, Z., & Gu, H. (2024). Multi-Scale Frequency-Spatial Domain Attention Fusion Network for Building Extraction in Remote Sensing Images. Electronics, 13(23), 4642. https://doi.org/10.3390/electronics13234642