1. Introduction
Buildings play a crucial role in human life, acting as a direct reflection of the socio-economic development of a region. The accurate extraction of building areas from HRRSI is essential to provide fundamental surface information, which supports urban planning, disaster management, and socio-economic development [
1,
2].
With the continuous advancement of satellite remote sensing technology, the imaging quality of remote sensing imagery, especially in terms of resolution and precision, has been significantly enhanced. Remote sensing imagery has evolved from meter-level high resolution (HR) to sub-meter very high resolution (VHR), with ongoing efforts to further improve spatial resolution.
Manual image interpretation is both time-consuming and labor-intensive, making the automatic or semi-automatic extraction of buildings from satellite remote sensing images increasingly important. Remote sensing images often exhibit phenomena such as “same object, different spectra” or “different objects, same spectra”. Moreover, the improvement in the spatial resolution of remote sensing imagery has resulted in richer ground feature information, presenting greater challenges for building extraction [
3,
4]. Buildings in HRRSI exhibit more complex color and texture patterns, which makes it difficult for basic neural network models to extract them with high precision.
With the rapid development of deep learning algorithms in computer vision, building extraction from HRRSI has transitioned from traditional methods to deep learning-based approaches. Traditional methods typically rely on manual and expert interpretation of unique building features, such as spectral information [
5], spatial information [
6], texture features [
7,
8], and morphological building indices (MBI) [
9]. However, these feature-based methods heavily depend on manual feature design and implementation, which often vary with different application domains. Furthermore, the diversity in building structures, spectra, scales, and distributions complicates the extraction task, making these methods unsuitable for diverse scenarios and datasets.
In recent years, deep learning technologies have significantly advanced building extraction from remote sensing imagery, making it the mainstream approach. The Convolutional Neural Network (CNN), proposed by Krizhevsky et al. [
10], revolutionized the field of computer vision by enabling the automatic learning of contextual features within images through multiple convolutional layers. CNNs have shown great potential in object detection within remote sensing imagery. Building upon this, CNN architectures have evolved; notably, the Fully Convolutional Network (FCN) [
11] replaces fully connected layers with convolutional layers to handle input images of arbitrary sizes and generates output feature maps of the same dimensions, enabling pixel-level semantic segmentation. This innovation has made FCNs highly effective for building extraction. However, the feature maps generated by FCNs often suffer from blurriness and smoothing due to deconvolutional upsampling, which affects the clarity of fine details. To address this, several improvements have been proposed. For example, Cui Weihong et al. [
12] employed upsampling techniques from SegNet to overcome the roughness caused by deconvolutional upsampling, while Shrestha et al. [
13] combined post-processing with Conditional Random Fields (CRFs) to further refine edge detection and enhance boundary precision. Recent studies like UANet [
14] demonstrate that uncertainty quantification can improve model robustness, but such mechanisms are absent in most existing frameworks.
Recent studies reveal three persistent challenges in state-of-the-art methods: (1) Ineffective fusion of multi-source data due to modality gaps [
15]; (2) limited capability in modeling irregular geometric characteristics of building structures; (3) boundary ambiguity caused by insufficient edge-sensitive learning mechanisms [
16]. Additionally, uncertainty propagation from sensor noise and environmental factors remains understudied, as shown in [
17]. U-Net [
18] introduced an encoder–decoder architecture with skip connections that effectively merges multi-scale features, achieving significant results in restoring building edges and shapes. U-Net and its variants (U
2-Net [
19], MA-Unet [
20], et al.) remain limited in handling multi-source inputs, et al. [
21,
22]. DANet [
23] demonstrates superior spatial attention but may struggle with segmentation performance when dealing with very small objects or noisy images, as it tends to overly focus on local details and faces challenges in attention distribution within local regions. HRNet [
24] maintains high-resolution features yet fails to capture topological dependencies between distant building components. DeepLab v3 [
25,
26,
27] achieves multi-scale perception but produces blurred boundaries due to dilated convolution artifacts. While Swin-Unet [
28] introduces transformer-based global modeling, its quadratic complexity hinders practical deployment on large-scale RS imagery.
In addition, attention mechanisms [
29] and multi-task frameworks [
30] have been widely adopted. Yu et al. [
31] integrated Attention Gates (AGs) in U-Net to effectively suppress noise and highlight the edges and fine details of small buildings, while Hong et al. [
32] proposed a multi-task learning framework incorporating the Swin Transformer, enabling simultaneous building extraction and change detection. This multi-task framework efficiently handles tasks such as building detection, edge extraction, and shape restoration, addressing issues related to detail loss caused by insufficient image resolution.
In addition to the aforementioned convolutional network-based methods, transformer-based models have also been employed in building extraction from remote sensing imagery. Building extraction with vision transformer [
22] achieves state-of-the-art performance on standard benchmarks. STT (Sparse Token Transformer) is an efficient dual-path transformer architecture [
15] that learns spatial and channel-wise long dependencies to capture the spatial location information of building edges, achieving high accuracy on benchmark datasets. MAP-Net [
33] leverages multiple parallel paths to learn multi-scale features with preserved spatial location information, achieving the precise segmentation of building contours at different scales through the adaptive squeezing of features extracted from each path. While these transformer-based structures have made progress in preserving edge details and restoring building shapes, their complex architecture and computational cost present challenges in achieving an optimal balance between computational efficiency and accuracy.
In practical applications, numerous post-processing techniques have been proposed to further optimize the boundary and detail quality of building extraction. For instance, Shrestha et al. [
13] used Conditional Random Fields (CRFs) and Exponential Linear Units (ELUs) to enhance edge stability, while Alshehhi et al. [
34] employed a patch-based CNN architecture combined with post-processing techniques to integrate low-level features from adjacent regions, reducing misclassification. However, post-processing effectiveness often depends on the accuracy of the initial segmentation and cannot fundamentally solve issues with incomplete building boundary information.
In conclusion, deep learning methods have made significant progress in building extraction from remote sensing images, particularly for high-resolution datasets. However, current encoder–decoder architectures still face challenges in accurately restoring building shape features and detecting edges. The integration of multi-source data fusion, attention mechanisms, and transformer models provides promising solutions, but fully utilizing building detail features in high-resolution remote sensing data to achieve more accurate and efficient building extraction remains a key research direction.
In building extraction tasks, DSM data serve as a crucial auxiliary data source [
35], commonly used to provide ground elevation information. The DSM reflects the height of all objects on the surface, including buildings, trees, and other structures, in contrast to the Digital Terrain Model (DTM), which only represents the elevation of the bare ground. In HRRSI, buildings often exhibit minimal spectral and geometric differences from the surrounding environment, which increases the difficulty of extraction. By incorporating DSM data, height information can be provided for buildings, enabling effective differentiation between buildings and other land cover types, particularly in complex urban environments.
This study proposes MMRAD-Net, a high-precision model for building extraction from HRRSI, which is still applicable in complex building scenes and shapes, such as those in GF-7 imagery. The model is designed to address the unique challenges posed by high-resolution images by incorporating DSM data. It effectively integrates multi-layer features, ensuring a strong representation of both fine-grained details and global semantic information.
The key contributions of this study are as follows:
We introduce MMRAD-Net, a model based on the encoder–decoder structure, which integrates the innovative GCN OA-SWinT Dense Module (GSTDM) and the Res DualAttention Dense Fusion Block (R-DDFB), along with the novel Hybrid Loss. By employing hierarchical feature modeling and a refined cross-scale fusion mechanism, the model enhances building extraction capabilities in complex environments.
By combining the texture and morphological features of HRRSI with elevation information from DSM data, we employ a multi-source fusion approach that significantly enhances building extraction performance.
A comparison with other models demonstrates that MMRAD-Net achieves superior extraction accuracy in boundary delineation, detail preservation, and adaptability to complex scenes.
We conducted generalization experiments by applying the model trained on the GF-7 Dataset to the WHU Building Dataset, which verified that our model performs well across different resolutions, spectral features, and geographic environments.
2. Methods
2.1. Model Overview
This study proposes an innovative building extraction model for remote sensing imagery, MMRAD-Net, aimed at addressing the challenges of multi-scale feature extraction, complex building boundary representation, and detail recovery in HRRSI segmentation tasks. In addition to inheriting the symmetric encoder–decoder structure of U-Net, MMRAD-Net incorporates several innovative designs. To overcome the limitations of the classic U-Net in global relationship modeling, feature fusion, and complex scene segmentation, MMRAD-Net combines multi-source data, including Digital Orthophoto Map (DOM) and Digital Surface Model (DSM), and introduces two core modules: the GCN OA-SWinT Dense Module (GSTDM) and the Res DualAttention Dense Fusion Block (R-DDFB). By carefully designing the encoder–decoder structure and optimizing the loss function, the model significantly improves the accuracy and robustness of building extraction. The model architecture is shown in
Figure 1.
In the design of MMRAD-Net, the encoder is responsible for progressively extracting features and downsampling, ultimately passing the features to the bottleneck layer for global information integration and feature refinement. This stage enhances feature extraction capability through the introduction of residual blocks (ResBlocks), ensuring the adequate representation of multi-scale features. Additionally, the design of the bottleneck layer with GSTDM facilitates multi-module collaboration, enabling the model to effectively integrate global structure understanding, contextual relationship modeling, and feature fusion from coarse to fine details.
The decoder portion restores details and strengthens semantic information using the DANet attention mechanism and Dense Block. Notably, we propose the R-DDFB, which combines spatial and channel dual attention mechanisms with deep residual learning on top of the traditional U-Net skip connections. This significantly enhances the model’s ability to capture building shapes, boundaries, and details in complex scenes.
To further enhance the model’s robustness, this paper also proposes a Hybrid Loss function tailored to building boundaries. The Hybrid Loss combines Binary Cross-Entropy (BCE) loss, Dice Loss, and an edge-sensitive term, ensuring that the model maintains global segmentation accuracy while strengthening the extraction of building details and boundary information.
The key innovation of MMRAD-Net lies in its effective fusion of multi-source data (DOM + DSM), and its multi-layer, multi-module collaborative design addresses the challenges of multi-scale, multi-morphological, and complex boundary extraction in HRRSI. This enables the model to achieve more accurate segmentation results, particularly in large-scale remote sensing data processing, while maintaining both high efficiency and high precision.
2.2. GCN OA-SwinT Dense Module (GSTDM)
Our network is designed to address the challenges of multi-scale variability, diverse morphological structures, and complex boundary extraction in HRRSI for building segmentation. Traditional U-Net-based architectures face significant limitations in capturing global contextual relationships and intricate building boundaries. Building contours in remote sensing imagery often exhibit irregular geometric characteristics (e.g., branching structures and discontinuous edges) [
36,
37], which pose challenges for traditional convolution operations based on regular grids. However, Graph Convolutional Networks (GCNs) [
38] can effectively address this issue through topological modeling using adjacency matrices. Moreover, standalone GCN-based local graph modeling fails to perceive the overall spatial distribution of building clusters. Additionally, the bottleneck in encoder–decoder structures often leads to the loss of fine details in small-scale buildings.
To mitigate these challenges, we propose an innovative module—GSTDM. Its design strictly adheres to a hierarchical feature learning paradigm. From a data flow perspective, GSTDM can be regarded as a multi-stage feature processing pipeline. The sequential feature processing in GSTDM follows a structured approach: spectral graph convolution-based topological modeling, windowed self-attention for global relationship constraints, and densely connected layers for multi-scale feature retention.
From a functional complementarity standpoint, GCN captures local topology (e.g., building corner connectivity) at shallow layers, while OA-SwinT models global layout patterns (e.g., spatial distribution of building clusters) at deeper layers. This hierarchical representation ensures a more comprehensive understanding of urban structures, enhancing both global context modeling and fine-grained boundary delineation.
The input to the network passes through the encoder of U-Net, generating low-resolution feature maps with high semantic levels. After these feature maps enter the GSTDM, they are first processed by a GCN. At this stage, due to the limited receptive field of convolution operations, the global structure and complex boundaries of buildings cannot be effectively captured. However, GCNs overcome this limitation by modeling the structure through graphs. The structure of the GCN Block consists of two graph convolution layers, each outputting 64 feature channels with ReLU activation. The feature update process in GCN can be expressed as follows:
where
is the adjacency matrix with self-loops,
is the degree matrix,
is the feature matrix at the
l-th layer, and
is the learnable weight matrix at the
l-th layer.
represents the feature matrix at the
-th layer, which is computed by applying the graph convolution operation to the feature matrix
from the previous layer.
denotes the ReLU activation function. Each pixel in the feature map is mapped as a graph node, and the adjacency matrix
dynamically encodes the non-Euclidean spatial relationships between nodes (such as connections across gaps or fractured regions). Through the normalization operation
, GCN enables the adaptive learning of complex topologies, effectively capturing the sharp corners and discontinuous edges of building contours. To address the limitation of the fixed weight distribution in GCN and enhance the perception of linear building structures, we innovatively designed the Orientation-Aware Swin Transformer (OA-SwinT) submodule to refine the feature representations produced by the GCN. The schematic diagram of this module is illustrated in
Figure 2 and
Figure 3. By leveraging a direction-aware windowed self-attention mechanism, OA-SwinT extends the advantages of the standard Swin Transformer with the following enhancements:
Building Principal Orientation Calculation: Computes principal orientation
within windows using DSM data through second-order moment method:
where
,
represents DSM elevation values, and
are window centroid coordinates
Orientation-Aware Positional Encoding (OA-PE): Integrates orientation information into standard positional encoding:
where
, and
is the height mask
Enhanced Window Attention (OA-W-MSA): Infuses orientation information into the attention bias matrix:
This design incorporates OA-PE to enhance the model’s sensitivity to features along building boundaries while suppressing interference from non-building regions. Theoretical analysis indicates that OA-PE effectively guides attention towards the correct spatial orientation, leading to an improved IoU for building boundary extraction. While GCN focuses on “point-edge” connectivity, OA-SwinT integrates orientation-aware encoding to model the “region-surface” relationships, forming a hierarchical representation from micro to macro levels. This hierarchical structure enables the model to handle multi-scale and complex building boundaries more effectively. Based on dataset statistical analysis (covering the minimal bounding rectangle of 90% independent buildings), the stability of direction calculation (with a second-order moment variance of <5°), and the trade-off in computational efficiency, the optimal window size is determined to be 8 × 8.
Finally, the Dense Block performs a deep fusion of the outputs from the previous two stages through multi-layer dense connections. The dense connection mechanism is implemented through cross-layer feature reuse, ensuring that shallow-layer geometric details (e.g., rooftop edge curvature) are preserved while deep-layer semantic information (e.g., building/non-building classification confidence) is effectively integrated. At this stage, the Dense Block progressively receives outputs from the previous layer and concatenates them with the current layer’s features, ensuring the sufficient flow and reuse of information in the feature maps. Through multiple convolution operations, the Dense Block not only enhances meaningful features but also effectively suppresses noisy features. The Dense Block designed in this paper combines convolution and convolution, retaining key features while reducing dimensionality. This design effectively lowers computational complexity and improves efficiency. Through repeated convolutions, the Dense Block refines and integrates features that already contain global relationships and contextual information, enhancing their relevance.
GCN is positioned at the front-end, performing sparse graph operations on low-resolution feature maps to filter noise while preserving key structures. OA-SwinT is placed in the intermediate stage, where attention mechanisms operate in the dimensionally reduced semantic space, balancing accuracy and computational efficiency. Dense connections are applied at the final stage, conducting lightweight feature reassembly in high-dimensional space to achieve efficient feature fusion. The design of GSTDM innovatively integrates multiple modules in a synergistic manner, enhancing the accuracy and robustness of the model in real-world applications. Notably, when processing large-scale remote sensing data, GSTDM maintains high efficiency and precision, making it well-suited for practical deployment.
2.3. Res DualAttention Dense Fusion Block (R-DDFB)
This paper proposes a novel encoder–decoder skip connection structure, based on a newly introduced module: the Res DualAttention Dense Fusion Block (R-DDFB). The schematic diagram of the module structure is shown in
Figure 4. The module consists of two parts: one is the ResBlock and the other is the DualAttention Dense Fusion Block (DDFB).
2.3.1. ResBlock
The traditional U-Net architecture extracts features through convolutional layers. Although it demonstrates certain effectiveness in feature learning, it also exposes limitations such as gradient vanishing and inadequate feature representation in deep networks. To address these issues, we introduce a deep residual learning strategy, stacking ResBlocks to enhance feature extraction capability. As shown in
Figure 1, each ResBlock consists of two
convolution layers and a residual connection. The residual connection effectively alleviates the gradient vanishing problem, promotes deeper feature learning, and enhances the network’s expressive power. Through deep residual learning, the network can progressively extract multi-scale building features from HRRSI and flexibly capture structural information at different scales.
In the encoder part, we design four stages, each containing a ResBlock and a max-pooling layer. During the progressive downsampling process, the spatial resolution of the feature maps gradually decreases, while the number of channels increases, namely 64, 128, 256, and 512. This ensures rich feature representation and the effective expression of semantic information.
2.3.2. DDFB
To further enhance the model’s performance in building extraction tasks, we designed the DualAttention Dense Fusion Block (DDFB), which serves as the core module of the decoder. The DDFB combines spatial and channel dual attention mechanisms with Dense Block’s multi-layer feature fusion, significantly enhancing the decoder’s ability to recover details and model global semantics in complex scenes. The structural design of the module is shown in
Figure 4, and features are gradually refined through four iterations.
The traditional U-Net decoder performs image detail recovery via simple upsampling and convolution operations, but it does not apply feature selection or weighting to the features output by the encoder. This kind of skip connection may introduce noise, affecting the accuracy of feature extraction. Moreover, directly concatenating shallow and deep features without considering semantic differences or importance may lead to a suboptimal use of information. To address these issues, we introduce the DANet attention mechanism in the DDFB, which effectively filters and weighs the skip-connected features through spatial and channel attention. This ensures that the decoder focuses on target-relevant regions, significantly reducing noise interference and enhancing the expression of detailed features. Furthermore, the DDFB innovatively fuses the shallow early features from the encoder (rich in detail but weak in semantics) with the upsampled deep features, balancing global semantics and local details. This feature fusion strategy prevents the loss of important detailed information during the recovery of spatial resolution, significantly improving the decoder’s ability to reconstruct building boundaries and shapes.
The features obtained from the previous workflow need to be efficiently fused. Therefore, we design the Dense Block, which performs multi-layer dense connections to progressively fuse the high-level features from the encoder side with the high-semantic information output by the bottleneck layer. This also efficiently integrates the low-resolution detail features after upsampling with high-level features. This layer-by-layer feature accumulation and integration not only enhances the flow of information but also optimizes the reuse of features, thereby improving the model’s performance in complex environments.
The proposed R-DDFB achieves a deep integration of detailed and semantic information by combining residual learning, dual attention mechanisms, and dense feature fusion strategies. Through feature extraction and fusion strategies, the R-DDFB significantly enhances the model’s performance in building extraction tasks under complex backgrounds, ensuring the coordinated integration of detailed and global semantic information.
2.4. Hybrid Loss
In the task of building extraction, Binary Cross-Entropy (BCE) and Dice Loss are two commonly used loss functions, each serving different roles in terms of segmentation accuracy and boundary detail optimization. The Binary Cross-Entropy Loss function is designed for binary classification tasks, aiming to minimize the classification error at each pixel. The formula is as follows:
where
is the ground truth and
is the predicted value. BCE emphasizes the overall pixel classification accuracy, enabling the model to distinguish between building and background pixels effectively. However, in building extraction tasks, the foreground (building pixels) is often much smaller than the background area. As a result, BCE calculates the error uniformly across each pixel, which leads to insufficient performance in small foreground areas, such as buildings.
In contrast, Dice Loss is specifically designed for segmentation tasks, particularly in scenarios with class imbalance. Dice Loss measures the quality of segmentation by computing the overlap between the predicted and true labels, making it particularly effective in handling class imbalance. The Dice Loss formula is as follows:
where
represents the intersection of the predicted and true values,
and
are the sums of the true and predicted values, and
is a small constant to avoid division by zero. Dice Loss emphasizes maximizing the overlap between the predicted and true regions, with higher sensitivity to the foreground areas. As such, Dice Loss effectively reduces the missed detection of building contours, enhancing the accuracy of building boundaries. This makes Dice Loss particularly beneficial in tasks where the foreground is small and the segmentation accuracy heavily depends on boundary information.
We introduce an Edge Sensitivity Term to further enhance the model’s focus on building boundaries. This is achieved by incorporating gradient-based information, which generates an edge mask on the segmentation map, concentrating the loss on edge pixels. Specifically, an edge mask
E is first generated by applying an edge detection operator (we use the Sobel operator) to the true label
. The loss is then computed only at the locations of the edge mask, and the Edge Sensitivity Term is defined as follows:
where
represents the number of pixels in the edge region. This approach ensures that the model produces higher gradients in the edge regions, allowing for more accurate boundary detection of buildings.
In summary, to optimize the model’s performance, we design a combined loss function, Hybrid Loss, as shown in the following equation:
where
,
, and
are weight parameters used to adjust the relative contribution of each loss term. The BCE loss focuses on the binary classification accuracy at the pixel level, suitable for controlling the overall segmentation accuracy of background and building regions. The Dice Loss enhances the overlap between the foreground and the predicted segmentation, making it robust to class imbalance and helping to reduce the probability of building omission. The Edge Sensitivity Term is designed for the boundary regions of buildings, increasing the loss weight at the true label edge locations, thereby enabling the model to more precisely capture building contours and improve edge prediction quality. Considering the foreground–background class imbalance issue in building extraction tasks and the importance of edge details, and after extensive experimentation and tuning, we selected weight settings of
,
, and
. Specifically, the weight
for the BCE loss was set to 0.4 to ensure accurate classification of both foreground and background. The weight
for the Dice loss waa set to 0.5 to enhance the extraction of building foreground, particularly in the presence of class imbalance. The weight
for the edge sensitivity term was set to 0.1 to give moderate attention to the edge details of buildings. This configuration effectively balanced overall segmentation accuracy, foreground extraction, and edge precision, demonstrating superior convergence in the experiments and ultimately achieving optimal performance on the validation set. This combined loss function optimizes the model in terms of overall classification accuracy, consistency of overlapping regions, and the quality of boundary details.
2.5. DSM Multi-Source Data Fusion
DSM data record the absolute height information of surface objects from the ground level upwards. These height data play a significant role in assisting building extraction from remote sensing imagery, particularly in the following aspects:
Object Differentiation: DSM data help distinguish between different types of objects, such as buildings versus ground or trees versus low vegetation, thereby improving classification accuracy.
Dense Area Identification: In densely built-up areas, DSM data effectively address the challenge of distinguishing between adjacent buildings, ensuring independent identification of each building.
Geometric Distortion Mitigation: DSM data help alleviate geometric distortions caused by the viewing angle of remote sensing imagery, such as shadow occlusion and the misclassification of building walls as roofs, thus improving the accuracy of building boundary delineation.
Integrating DSM with HRRSI for building extraction better accommodates the complexity of the research subject, significantly enhancing the accuracy and reliability of building detection. In our method, we first process the DSM data into a normalized Digital Surface Model (nDSM), which is then used as a fourth channel and input into the network alongside HRRSI data for training. This multi-source data fusion approach fully leverages the height information provided by the DSM, strengthening the model’s ability to recognize building features.
5. Conclusions
This paper proposes MMRAD-Net, an innovative multi-source multi-scale deep learning network designed to address the challenges of building extraction from HRRSI in complex scenes. Two key novel modules are introduced: GSTDM and R-DDFB. These modules combine the strengths of GCN and OA-SwinT, enabling the model to capture local details and understand global context, while the dual attention mechanism ensures a robust fusion of cross-scale features. The network built upon these two modules overcomes the limitations of traditional U-Net in global relationship modeling, feature fusion, and segmentation of complex scenes.
In addition, we integrated DOM and DSM data to alleviate the issue of spectral confusion between buildings and the ground. A Hybrid Loss is proposed, which combines Binary Cross-Entropy Loss, Dice Loss, and an edge-sensitive term, effectively improving the extraction accuracy of building boundaries and details.
The experimental results on the GF-7 Dataset and WHU Building Dataset show that MMRAD-Net outperforms five other classical segmentation models in both quantitative metrics and qualitative analysis, particularly in boundary delineation, detail recovery, and adaptability to complex scenes. Ablation and transfer learning experiments further validate the model’s design rationality and generalization capability.
Despite achieving significant progress, there is still room for improvement in MMRAD-Net. Future work could explore more efficient feature extraction methods or hybrid attention mechanisms to reduce model complexity without sacrificing performance. Overall, MMRAD-Net provides a promising framework for improving building extraction from HRRSI.