1. Introduction
Buildings, as essential places for daily life and production for people, are crucial carriers for showcasing urban culture, history, and modernization levels. It is also an integral component of urban basic geographic information construction. At the same time, the reasonable layout of buildings has a crucial impact on urban development, environmental construction, and people’s lives. The construction, demolition, renovation, expansion, and other activities also reflect the potential for growth in a particular area [
1,
2,
3]. Therefore, extracting accurate building data plays a crucial role in urban planning and development [
4,
5], land use change detection [
6], national defense construction [
7], disaster prevention and mitigation [
8], environmental protection [
9], and other aspects.
With the rapid development of high-resolution sensor technology and equipment, the acquisition of remote sensing images (RSIs) has become more flexible and efficient, and the spectral and spatial resolutions have been further improved. These mean richer information, sharper detail in the imagery, and a high-quality data source for more accurate building extractions [
10]. However, extracting building information from high-resolution RSIs also brings more challenges to image interpretation. Due to the large amount of information and comprehensive coverage of high-resolution RSIs, as well as the diversity in building spectral characteristics, scale, morphology, etc., accurately and efficiently extracting buildings from high-resolution RSIs remains a vital research direction [
11,
12]. As a result, there is a need for more automated, accurate, and efficient image interpretation methods to match this, and the significant emergence of high-resolution remote sensing image data also leads to a shift towards a data-intensive scientific paradigm in earth observation research.
Deep learning can automatically extract the required features and uncover more in-depth information about objects. It is an inheritance and development of traditional machine learning techniques, bringing new solutions to the semantic segmentation tasks for building extraction in remote sensing images [
13]. It is worth noting that classic deep learning networks, the U-shaped network (U-Net) [
14] and the residual network (ResNet) [
15], have achieved significant segmentation results. Inspired by their structures, many scholars have made network improvements based on them or simultaneously deepened the network structure to construct deep convolutional neural networks (DCNNs) for extracting deeper features of buildings. Jin et al. [
16] combined the dense atrous spatial pyramid pooling (DenseASPP) [
17] with the U-Net structure to build the bilateral attention refinement network (BARNet), which can refine the perception of building boundaries. Xu et al. [
18] used ResNeXt101 to replace the encoder part of U-Net and combine it with a feature pyramid to fuse multi-scale features to improve the building segmentation accuracy for small sample datasets. Yu et al. [
19] added a recurrent, residual deformable convolution unit based on the U-Net structure and blended in enhanced multi-head self-attention during the jump connection process to improve the network’s extraction details for complex buildings. Aryal et al. [
20] incorporated multi-scale feature maps with a partial feature pyramid network into the U-Net framework to achieve higher precision and robustness in building extraction. DCNN has become the mainstream method for automatic building extraction [
21]. Still, its small and fixed field of view, which focuses more on local context information, limits the deep learning network from extracting building features in complex backgrounds of high-resolution remote sensing images.
In recent years, the transformer architecture can capture global context information and long-range dependencies between pixels, providing new technical support for building extraction. Swin transformer [
22] achieves attention operation and information sharing in a single window. Networks such as the segmentation transformer (SETR) [
23], the semantic segmentation with transformers (SegFormer) [
24], and the Unet-like pure transformer (Swin-Unet) [
25] adopt a pure transformer structure and have achieved good segmentation results. However, while the transformer can capture global information, the advantage of local feature extraction via convolutional neural networks (CNNs) cannot be replaced entirely [
26]. UNet-like transformer (UNetFormer) [
27] uses a lightweight ResNet18 as the encoder, combined with a transformer-based decoder to extract global and local features. Zhang et al. [
28] used the Swin transformer as the encoding structure. They constructed DCNN as the decoding structure to improve the segmentation effectiveness of the building boundaries in very high-resolution RSIs. Wang et al. [
29] proposed a multiscale transformer with the convolutional block attention module (MTCNet), which combined the convolutional block attention module (CBAM) [
30] and transformer to improve the network’s segmentation performance for buildings in RSIs. He et al. [
31] proposed utilizing the Swin transformer to assist UNet (ST-UNet) for feature extraction in RSIs, which uses U-Net as the primary encoder and the Swin transformer as an auxiliary encoder. The network also uses a designed relational aggregation module (RAM) to guide the primary encoder with channel relationships, thereby improving the network’s global modeling capabilities. Li et al. [
32] added multiple transposed convolution sampling modules in SegFormer to enhance the local and long-range detail on building information. Xia et al. [
33] constructed a dual-stream feature extraction encoder based on ResNet34 [
15] and Swin transformer. They performed feature aggregation at each network stage, effectively enhancing the network’s capability for building extraction. However, despite the many advantages of transformers in building extraction, their complexity still needs to be improved. Their performance may not be ideal in cases where the training dataset is small. Additionally, due to the complex background information in high-resolution RSIs, extracted buildings may be similar to and adjacent to surrounding roads, leading to the need for mitigation of false positives and false negatives. Therefore, further research on leveraging the advantages of DCNNs in local feature extraction and the global information-capturing capabilities of transformers is of significant importance for building extraction in high-resolution remote sensing images.
The effectiveness of deep learning networks largely depends on the diversity and quality of the dataset for effective building extraction. The differences in building forms and distributions across different regions further increase the difficulty of semantic segmentation tasks for buildings. Currently used open-source building datasets include the aerial image segmentation dataset [
34], WHU building dataset [
35], INRIA aerial image dataset [
36], Massachusetts buildings dataset [
37], etc. These building datasets were obtained through aerial photography and have high resolution, such as the 0.075 m resolution of the aerial image segmentation dataset and the 0.3 m resolution of the WHU building dataset. They have diverse architectural styles covering Europe, East Asia, the United States, and New Zealand. However, the current building dataset contains few images of architectural styles within China, and most are three-band images (red (R), green (G), and blue (B) datasets). Further exploration of the spectral information of the images is still needed to complement semantic segmentation networks for more accurate building extraction work.
To address the abovementioned issues further and improve the extraction performance of multi-scale buildings under different data sources, this paper explores combining transformer with DCNN and further mining the spectral information of high-resolution RSIs. Based on the U-Net architecture, we employed a parallel encoder of Swin transformer and ResNet to extract building feature information at various scales simultaneously using advantageous local and global extraction methods. To enhance the recognition capability of local blurry features in remote sensing images, we introduced the DenseASPP module in the decoding process and combined it with the feature information obtained from skip connections. Various attention mechanisms were also introduced in the skip connection process at different stages of the network to capture and retain the positional and semantic feature information of the downsampled feature maps. In addition, this study also constructed the SIEM module to enhance the spectral information of RSIs and further improve the accuracy of building extraction. We selected part of Xi’an City, Shaanxi Province, China, as the study area and used Gaofen-2 satellite (GF-2) images for the Xi’an building dataset. The study area contains buildings of various forms, distributions, and scales, providing a dataset of buildings with the characteristic architectural style of Xi’an. The dataset includes images with R, G, B, and near-infrared (NIR) bands, which can provide data support for various spectral processing methods.
The contributions of this paper mainly include the following aspects:
- (1)
We used high-resolution satellite images from GF-2 to construct the Xi’an building dataset, which includes complex background information and features various building forms, distributions, and scales. This dataset enriches the diversity of datasets used for building extraction and presents more challenges for building extraction networks.
- (2)
We designed an effective building extraction network, MARS-Net, to improve the extraction performance of buildings with different architectural characteristics in different regions. We compared MARS-Net with other building extraction methods on our self-built Xi’an building dataset and the WHU building dataset, and conducted ablation experiments, demonstrating the effectiveness and generalization of the proposed network in this study.
- (3)
Using the Xi’an building dataset, we propose a spectral information enhancement module to enhance the relationships between bands of high-resolution remote sensing images and provide them with reinforced building shape information. Through experiments, it is demonstrated that this module can effectively enhance the extraction of buildings in complex backgrounds by semantic segmentation models.
The rest of this paper is organized as follows:
Section 2 mainly introduces the design of the proposed method.
Section 3 presents the building dataset constructed for this study, the comparative datasets used in the experiments, the primary parameter settings, and the selection of evaluation metrics. It reports the results of the network’s comparative and ablation experiments and conducts experimental analysis of the SIEM module after validating the effectiveness of the network.
Section 4 discusses the work carried out in this paper.
Section 5 summarizes the entire paper and looks forward to future work.
4. Discussion
The scale and morphology of buildings vary due to natural, cultural, and social development. At the same time, the segmentation of buildings is also affected by the different image resolutions and label quality of the dataset, leading to a certain degree of fluctuation in the segmentation results.
By analyzing the characteristics of the two experimental datasets, the WHU aerial buildings dataset has a relatively higher image resolution than the Xi’an satellite buildings dataset. There is a rather large proportion of small buildings in both datasets. However, due to geographic variation, small buildings in the WHU dataset tend to be irregularly shaped like squares, as shown in the first three images in
Figure 11. In contrast, small buildings in the Xi’an dataset tend to be long strips and more densely distributed, as shown in the first three images in
Figure 10. Influenced by the characteristics of the dataset itself above, in the accuracy evaluation results of the comparison experiments, the training accuracy of each model on the WHU dataset is generally better than that on the Xi’an dataset. As shown in
Table 1 and
Table 2, ResUNet++, which has a deeper structure, performs lower than U-Net on the Xi’an dataset with poor stability compared to other DCNN networks, although it has higher accuracy. On the two experimental datasets in this paper, the accuracy of transformer class networks is evaluated more elevated than that of DCNN networks, and the performance is more stable. Among them, the MARS-Net accuracy proposed in this paper performs better. Based on the characteristics of the two experimental datasets analyzed in the previous section, this may result from the fact that the building patterns in the Xi’an dataset are mostly long strips. This building morphology gives better play to the transformer’s long-distance extraction advantage. It also enables the transformer-like network to maintain a better extraction effect on the Xi’an dataset, which has a slightly lower image resolution.
In addition, compared with the networks in the other experiments, MARS-Net also has a better segmentation effect on the densely distributed buildings in the two experimental data, as shown in
Figure 10 and
Figure 11. From the perspective of multi-scale analysis, the richness of spatial and semantic information varies between deep and shallow layers in the feature learning process of the networks. Shallow features often contain more spatial information, but due to their limited depth of learning, they need more semantic feature information. On the other hand, due to multiple rounds of object feature learning, deep features have more accurate semantic feature information. However, after numerous downsamplings, there is some loss of spatial location information, leading to misclassification and omission of detailed information, such as boundaries, corners, and interiors, for small buildings and dense complexes. The MARS-Net proposed in this paper combines the advantages of local extraction from DCNN and global learning from the transformer. It bridges the encoder and decoder at different positions using the CA and CBAM modules to reconcile the contradiction between deep and shallow feature information acquisition. It effectively collects spatial features from the superficial layers and semantic features from the deep layers. At the same time, DenseASPP is used to increase the receptive field during the feature map restoration process in the decoder to obtain semantic feature information at a larger scale. These structures play an essential role in the network and make the network in this paper more effective in segmenting buildings of various scales, forms, and distributions.
The land cover information in high-resolution RSIs is intricate, containing redundant object details. And the similarity in appearance between roads and rooftops significantly affects the interpretation of buildings. However, high-resolution RSIs also have a high spectral resolution. By mining their spectral information, this paper proposes the SIEM to process the maximum weight ratio results of each image band with MBI to enhance the spectral information of the image and further improve the network’s segmentation effect on buildings. We also considered fusing the high-weight ratio bands with MNF at the final stage of SIEM and then stacking the result with MBI to assess accuracy in MARS-Net, as shown in the last row of
Table 6. In
Table 6, the SIEM in this paper offers a more significant performance improvement than the test results in the previous row. This enhancement may be because the SIEM in this paper fuses the building morphological feature information from the MBI in all bands.
While the proposed method in this paper has achieved good results in building segmentation, there are still some limitations. For example, influenced by geographical location, culture, and the level of social development, the land object information in high-resolution RSIs varies to different extents in different regions. Therefore, spectral information enhancement methods may also need to be adjusted accordingly. We quantitatively compared the performance parameters FLOPs and Params of the experimental network in this paper. As shown in
Table 7, the two parameters of MARS-Net are generally better than those of U-Net and ResUnet++, but higher than those of Swin-Unet and UNetFormer. A further comparison of
Table 7 reveals that the two performance parameters of the Baseline network structure are very close to the network we proposed. It is speculated that this may be due to our simultaneous use of ResNet and Swin transformer structures to construct a parallel encoder, leading to MARS-Net having higher performance parameters in terms of FLOPs and Params. In future research, we will further adjust and lighten the parallel encoding structure used in the proposed network, continuing to explore universal spectral enhancement methods to adapt to the application scenarios of various high-resolution RSIs. At the same time, according to the spectrally enhanced image features, we will carry out targeted model improvement and design. For example, by better combining the semantic segmentation model and image spectral enhancement method, we may improve the model’s anti-interference ability against phenomena such as shadows and similar spectral features of ground objects and enhance the building segmentation capability of the network more comprehensively.
5. Conclusions
The existing building extraction methods often suffer from the complexity of land cover information and the diverse shapes and distributions of buildings, leading to segmentation results that exhibit similarities and adjacency between buildings and surrounding roads. Moreover, incomplete and inaccurate internal and boundary information extraction within buildings is also an issue. First, we constructed the Xi’an building dataset based on the imagery within the third ring road of Xi’an, Shaanxi Province, China, which possesses the characteristic features of local buildings. Due to variations in building shapes, distributions, and image resolutions, this dataset presents more challenges for semantic segmentation networks. Secondly, we propose a deep learning network called MARS-Net, which uses ResNet50 and Swin-T to construct parallel encoders, leveraging the local profound feature extraction advantages of DCNN and the transformer’s global contextual feature extraction advantages. CA and CBAM modules were introduced at different network depths to bridge the encoder and decoder, thus preserving richer spatial and semantic feature information of the building. The DenseASPP module was added during the decoding process to enhance the network’s ability to extract multi-scale building edge features. Beyond that, we designed the SIEM module, which enhances the spectral information of the images by processing the MBI and band ratio results calculated from the R, G, B, and NIR bands in the RSIs, further improving the network’s segmentation accuracy of buildings. Ultimately, we conducted performance analysis experiments on MARS-Net and SIEM. Through comparative experiments and ablation experiments with U-Net, ResUNet++, Deeplab v3+, Swin-Unet, and UNetFormer on the Xi’an building dataset and the WHU building dataset, the results show that our MARS-Net achieves better multi-scale building segmentation performance with different distribution characteristics, stronger resistance to interference from similar spectral features, and higher accuracy evaluation metrics. By processing the Xi’an building dataset images with the SIEM module and continuing the experiments with MARS-Net, the results show that the MARS-Net with the SIEM module is more effective in extracting multi-scale building cluster information with various phenomena, such as similar spectral information and complex background, resulting in clearer boundaries and further improvement in different accuracy evaluation metrics.