MMRAD-Net: A Multi-Scale Model for Precise Building Extraction from High-Resolution Remote Sensing Imagery with DSM Integration

Gao, Yu; Chai, Huiming; Lv, Xiaolei

doi:10.3390/rs17060952

Open AccessArticle

MMRAD-Net: A Multi-Scale Model for Precise Building Extraction from High-Resolution Remote Sensing Imagery with DSM Integration

by

Yu Gao

^1,2,3

,

Huiming Chai

^1,2,* and

Xiaolei Lv

^1,2,3

¹

Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences, Beijing 100190, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 952; https://doi.org/10.3390/rs17060952

Submission received: 29 December 2024 / Revised: 27 February 2025 / Accepted: 6 March 2025 / Published: 7 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

High-resolution remote sensing imagery (HRRSI) presents significant challenges for building extraction tasks due to its complex terrain structures, multi-scale features, and rich spectral and geometric information. Traditional methods often face limitations in effectively integrating multi-scale features while maintaining a balance between detailed and global semantic information. To address these challenges, this paper proposes an innovative deep learning network, Multi-Source Multi-Scale Residual Attention Network (MMRAD-Net). This model is built upon the classical encoder–decoder framework and introduces two key components: the GCN OA-SWinT Dense Module (GSTDM) and the Res DualAttention Dense Fusion Block (R-DDFB). Additionally, it incorporates Digital Surface Model (DSM) data, presenting a novel feature extraction and fusion strategy. Specifically, the model enhances building extraction accuracy and robustness through hierarchical feature modeling and a refined cross-scale fusion mechanism, while effectively preserving both detail information and global semantic relationships. Furthermore, we propose a Hybrid Loss, which combines Binary Cross-Entropy Loss (BCE Loss), Dice Loss, and an edge-sensitive term to further improve the precision of building edges and foreground reconstruction capabilities. Experiments conducted on the GF-7 and WHU datasets validate the performance of MMRAD-Net, demonstrating its superiority over traditional methods in boundary handling, detail recovery, and adaptability to complex scenes. On the GF-7 Dataset, MMRAD-Net achieved an F1-score of 91.12% and an IoU of 83.01%. On the WHU Building Dataset, the F1-score and IoU were 94.04% and 88.99%, respectively. Ablation studies and transfer learning experiments further confirm the rationality of the model design and its strong generalization ability. These results highlight that innovations in multi-source data fusion, multi-scale feature modeling, and detailed feature fusion mechanisms have enhanced the accuracy and robustness of building extraction.

Keywords:

remote sensing; digital surface model (DSM); building extraction; GF-7; SWinT transformer

1. Introduction

Buildings play a crucial role in human life, acting as a direct reflection of the socio-economic development of a region. The accurate extraction of building areas from HRRSI is essential to provide fundamental surface information, which supports urban planning, disaster management, and socio-economic development [1,2].

With the continuous advancement of satellite remote sensing technology, the imaging quality of remote sensing imagery, especially in terms of resolution and precision, has been significantly enhanced. Remote sensing imagery has evolved from meter-level high resolution (HR) to sub-meter very high resolution (VHR), with ongoing efforts to further improve spatial resolution.

Manual image interpretation is both time-consuming and labor-intensive, making the automatic or semi-automatic extraction of buildings from satellite remote sensing images increasingly important. Remote sensing images often exhibit phenomena such as “same object, different spectra” or “different objects, same spectra”. Moreover, the improvement in the spatial resolution of remote sensing imagery has resulted in richer ground feature information, presenting greater challenges for building extraction [3,4]. Buildings in HRRSI exhibit more complex color and texture patterns, which makes it difficult for basic neural network models to extract them with high precision.

With the rapid development of deep learning algorithms in computer vision, building extraction from HRRSI has transitioned from traditional methods to deep learning-based approaches. Traditional methods typically rely on manual and expert interpretation of unique building features, such as spectral information [5], spatial information [6], texture features [7,8], and morphological building indices (MBI) [9]. However, these feature-based methods heavily depend on manual feature design and implementation, which often vary with different application domains. Furthermore, the diversity in building structures, spectra, scales, and distributions complicates the extraction task, making these methods unsuitable for diverse scenarios and datasets.

In recent years, deep learning technologies have significantly advanced building extraction from remote sensing imagery, making it the mainstream approach. The Convolutional Neural Network (CNN), proposed by Krizhevsky et al. [10], revolutionized the field of computer vision by enabling the automatic learning of contextual features within images through multiple convolutional layers. CNNs have shown great potential in object detection within remote sensing imagery. Building upon this, CNN architectures have evolved; notably, the Fully Convolutional Network (FCN) [11] replaces fully connected layers with convolutional layers to handle input images of arbitrary sizes and generates output feature maps of the same dimensions, enabling pixel-level semantic segmentation. This innovation has made FCNs highly effective for building extraction. However, the feature maps generated by FCNs often suffer from blurriness and smoothing due to deconvolutional upsampling, which affects the clarity of fine details. To address this, several improvements have been proposed. For example, Cui Weihong et al. [12] employed upsampling techniques from SegNet to overcome the roughness caused by deconvolutional upsampling, while Shrestha et al. [13] combined post-processing with Conditional Random Fields (CRFs) to further refine edge detection and enhance boundary precision. Recent studies like UANet [14] demonstrate that uncertainty quantification can improve model robustness, but such mechanisms are absent in most existing frameworks.

Recent studies reveal three persistent challenges in state-of-the-art methods: (1) Ineffective fusion of multi-source data due to modality gaps [15]; (2) limited capability in modeling irregular geometric characteristics of building structures; (3) boundary ambiguity caused by insufficient edge-sensitive learning mechanisms [16]. Additionally, uncertainty propagation from sensor noise and environmental factors remains understudied, as shown in [17]. U-Net [18] introduced an encoder–decoder architecture with skip connections that effectively merges multi-scale features, achieving significant results in restoring building edges and shapes. U-Net and its variants (U²-Net [19], MA-Unet [20], et al.) remain limited in handling multi-source inputs, et al. [21,22]. DANet [23] demonstrates superior spatial attention but may struggle with segmentation performance when dealing with very small objects or noisy images, as it tends to overly focus on local details and faces challenges in attention distribution within local regions. HRNet [24] maintains high-resolution features yet fails to capture topological dependencies between distant building components. DeepLab v3 [25,26,27] achieves multi-scale perception but produces blurred boundaries due to dilated convolution artifacts. While Swin-Unet [28] introduces transformer-based global modeling, its quadratic complexity hinders practical deployment on large-scale RS imagery.

In addition, attention mechanisms [29] and multi-task frameworks [30] have been widely adopted. Yu et al. [31] integrated Attention Gates (AGs) in U-Net to effectively suppress noise and highlight the edges and fine details of small buildings, while Hong et al. [32] proposed a multi-task learning framework incorporating the Swin Transformer, enabling simultaneous building extraction and change detection. This multi-task framework efficiently handles tasks such as building detection, edge extraction, and shape restoration, addressing issues related to detail loss caused by insufficient image resolution.

In addition to the aforementioned convolutional network-based methods, transformer-based models have also been employed in building extraction from remote sensing imagery. Building extraction with vision transformer [22] achieves state-of-the-art performance on standard benchmarks. STT (Sparse Token Transformer) is an efficient dual-path transformer architecture [15] that learns spatial and channel-wise long dependencies to capture the spatial location information of building edges, achieving high accuracy on benchmark datasets. MAP-Net [33] leverages multiple parallel paths to learn multi-scale features with preserved spatial location information, achieving the precise segmentation of building contours at different scales through the adaptive squeezing of features extracted from each path. While these transformer-based structures have made progress in preserving edge details and restoring building shapes, their complex architecture and computational cost present challenges in achieving an optimal balance between computational efficiency and accuracy.

In practical applications, numerous post-processing techniques have been proposed to further optimize the boundary and detail quality of building extraction. For instance, Shrestha et al. [13] used Conditional Random Fields (CRFs) and Exponential Linear Units (ELUs) to enhance edge stability, while Alshehhi et al. [34] employed a patch-based CNN architecture combined with post-processing techniques to integrate low-level features from adjacent regions, reducing misclassification. However, post-processing effectiveness often depends on the accuracy of the initial segmentation and cannot fundamentally solve issues with incomplete building boundary information.

In conclusion, deep learning methods have made significant progress in building extraction from remote sensing images, particularly for high-resolution datasets. However, current encoder–decoder architectures still face challenges in accurately restoring building shape features and detecting edges. The integration of multi-source data fusion, attention mechanisms, and transformer models provides promising solutions, but fully utilizing building detail features in high-resolution remote sensing data to achieve more accurate and efficient building extraction remains a key research direction.

In building extraction tasks, DSM data serve as a crucial auxiliary data source [35], commonly used to provide ground elevation information. The DSM reflects the height of all objects on the surface, including buildings, trees, and other structures, in contrast to the Digital Terrain Model (DTM), which only represents the elevation of the bare ground. In HRRSI, buildings often exhibit minimal spectral and geometric differences from the surrounding environment, which increases the difficulty of extraction. By incorporating DSM data, height information can be provided for buildings, enabling effective differentiation between buildings and other land cover types, particularly in complex urban environments.

This study proposes MMRAD-Net, a high-precision model for building extraction from HRRSI, which is still applicable in complex building scenes and shapes, such as those in GF-7 imagery. The model is designed to address the unique challenges posed by high-resolution images by incorporating DSM data. It effectively integrates multi-layer features, ensuring a strong representation of both fine-grained details and global semantic information.

The key contributions of this study are as follows:

We introduce MMRAD-Net, a model based on the encoder–decoder structure, which integrates the innovative GCN OA-SWinT Dense Module (GSTDM) and the Res DualAttention Dense Fusion Block (R-DDFB), along with the novel Hybrid Loss. By employing hierarchical feature modeling and a refined cross-scale fusion mechanism, the model enhances building extraction capabilities in complex environments.
By combining the texture and morphological features of HRRSI with elevation information from DSM data, we employ a multi-source fusion approach that significantly enhances building extraction performance.
A comparison with other models demonstrates that MMRAD-Net achieves superior extraction accuracy in boundary delineation, detail preservation, and adaptability to complex scenes.
We conducted generalization experiments by applying the model trained on the GF-7 Dataset to the WHU Building Dataset, which verified that our model performs well across different resolutions, spectral features, and geographic environments.

2. Methods

2.1. Model Overview

This study proposes an innovative building extraction model for remote sensing imagery, MMRAD-Net, aimed at addressing the challenges of multi-scale feature extraction, complex building boundary representation, and detail recovery in HRRSI segmentation tasks. In addition to inheriting the symmetric encoder–decoder structure of U-Net, MMRAD-Net incorporates several innovative designs. To overcome the limitations of the classic U-Net in global relationship modeling, feature fusion, and complex scene segmentation, MMRAD-Net combines multi-source data, including Digital Orthophoto Map (DOM) and Digital Surface Model (DSM), and introduces two core modules: the GCN OA-SWinT Dense Module (GSTDM) and the Res DualAttention Dense Fusion Block (R-DDFB). By carefully designing the encoder–decoder structure and optimizing the loss function, the model significantly improves the accuracy and robustness of building extraction. The model architecture is shown in Figure 1.

In the design of MMRAD-Net, the encoder is responsible for progressively extracting features and downsampling, ultimately passing the features to the bottleneck layer for global information integration and feature refinement. This stage enhances feature extraction capability through the introduction of residual blocks (ResBlocks), ensuring the adequate representation of multi-scale features. Additionally, the design of the bottleneck layer with GSTDM facilitates multi-module collaboration, enabling the model to effectively integrate global structure understanding, contextual relationship modeling, and feature fusion from coarse to fine details.

The decoder portion restores details and strengthens semantic information using the DANet attention mechanism and Dense Block. Notably, we propose the R-DDFB, which combines spatial and channel dual attention mechanisms with deep residual learning on top of the traditional U-Net skip connections. This significantly enhances the model’s ability to capture building shapes, boundaries, and details in complex scenes.

To further enhance the model’s robustness, this paper also proposes a Hybrid Loss function tailored to building boundaries. The Hybrid Loss combines Binary Cross-Entropy (BCE) loss, Dice Loss, and an edge-sensitive term, ensuring that the model maintains global segmentation accuracy while strengthening the extraction of building details and boundary information.

The key innovation of MMRAD-Net lies in its effective fusion of multi-source data (DOM + DSM), and its multi-layer, multi-module collaborative design addresses the challenges of multi-scale, multi-morphological, and complex boundary extraction in HRRSI. This enables the model to achieve more accurate segmentation results, particularly in large-scale remote sensing data processing, while maintaining both high efficiency and high precision.

2.2. GCN OA-SwinT Dense Module (GSTDM)

Our network is designed to address the challenges of multi-scale variability, diverse morphological structures, and complex boundary extraction in HRRSI for building segmentation. Traditional U-Net-based architectures face significant limitations in capturing global contextual relationships and intricate building boundaries. Building contours in remote sensing imagery often exhibit irregular geometric characteristics (e.g., branching structures and discontinuous edges) [36,37], which pose challenges for traditional convolution operations based on regular grids. However, Graph Convolutional Networks (GCNs) [38] can effectively address this issue through topological modeling using adjacency matrices. Moreover, standalone GCN-based local graph modeling fails to perceive the overall spatial distribution of building clusters. Additionally, the bottleneck in encoder–decoder structures often leads to the loss of fine details in small-scale buildings.

To mitigate these challenges, we propose an innovative module—GSTDM. Its design strictly adheres to a hierarchical feature learning paradigm. From a data flow perspective, GSTDM can be regarded as a multi-stage feature processing pipeline. The sequential feature processing in GSTDM follows a structured approach: spectral graph convolution-based topological modeling, windowed self-attention for global relationship constraints, and densely connected layers for multi-scale feature retention.

From a functional complementarity standpoint, GCN captures local topology (e.g., building corner connectivity) at shallow layers, while OA-SwinT models global layout patterns (e.g., spatial distribution of building clusters) at deeper layers. This hierarchical representation ensures a more comprehensive understanding of urban structures, enhancing both global context modeling and fine-grained boundary delineation.

The input to the network passes through the encoder of U-Net, generating low-resolution feature maps with high semantic levels. After these feature maps enter the GSTDM, they are first processed by a GCN. At this stage, due to the limited receptive field of convolution operations, the global structure and complex boundaries of buildings cannot be effectively captured. However, GCNs overcome this limitation by modeling the structure through graphs. The structure of the GCN Block consists of two graph convolution layers, each outputting 64 feature channels with ReLU activation. The feature update process in GCN can be expressed as follows:

H^{(l + 1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(l)} W^{(l)})

(1)

where

\tilde{A}

is the adjacency matrix with self-loops,

\tilde{D}

is the degree matrix,

H^{(l)}

is the feature matrix at the l-th layer, and

W^{(l)}

is the learnable weight matrix at the l-th layer.

H^{(l + 1)}

represents the feature matrix at the

(l + 1)

-th layer, which is computed by applying the graph convolution operation to the feature matrix

H^{(l)}

from the previous layer.

σ

denotes the ReLU activation function. Each pixel in the feature map is mapped as a graph node, and the adjacency matrix

\tilde{A}

dynamically encodes the non-Euclidean spatial relationships between nodes (such as connections across gaps or fractured regions). Through the normalization operation

D^{- 1 / 2} \tilde{A} D^{- 1 / 2}

, GCN enables the adaptive learning of complex topologies, effectively capturing the sharp corners and discontinuous edges of building contours. To address the limitation of the fixed weight distribution in GCN and enhance the perception of linear building structures, we innovatively designed the Orientation-Aware Swin Transformer (OA-SwinT) submodule to refine the feature representations produced by the GCN. The schematic diagram of this module is illustrated in Figure 2 and Figure 3. By leveraging a direction-aware windowed self-attention mechanism, OA-SwinT extends the advantages of the standard Swin Transformer with the following enhancements:

Building Principal Orientation Calculation: Computes principal orientation $θ$ within windows using DSM data through second-order moment method:

$θ = \frac{1}{2} arctan (\frac{2 μ_{11}}{μ_{20} - μ_{02}})$

(2)

where $μ_{p q} = \sum_{x, y \in W} {(x - \bar{x})}^{p} {(y - \bar{y})}^{q} I (x, y)$ , $I (x, y)$ represents DSM elevation values, and $(\bar{x}, \bar{y})$ are window centroid coordinates
Orientation-Aware Positional Encoding (OA-PE): Integrates orientation information into standard positional encoding:

$P E (x, y) = P E_{s t d} (x, y) + P E_{o r i e n t (θ) \cdot 1_{DSM > t h}}$

(3)

where $P E_{o r i e n t} (θ) = cos (θ) P E_{x} + sin (θ) P E_{y}$ , and $1_{DSM > t h}$ is the height mask
Enhanced Window Attention (OA-W-MSA): Infuses orientation information into the attention bias matrix:

$B_{O A} (i, j) = B (i, j) + P E_{o r i e n t} (θ)$

(4)

This design incorporates OA-PE to enhance the model’s sensitivity to features along building boundaries while suppressing interference from non-building regions. Theoretical analysis indicates that OA-PE effectively guides attention towards the correct spatial orientation, leading to an improved IoU for building boundary extraction. While GCN focuses on “point-edge” connectivity, OA-SwinT integrates orientation-aware encoding to model the “region-surface” relationships, forming a hierarchical representation from micro to macro levels. This hierarchical structure enables the model to handle multi-scale and complex building boundaries more effectively. Based on dataset statistical analysis (covering the minimal bounding rectangle of 90% independent buildings), the stability of direction calculation (with a second-order moment variance of <5°), and the trade-off in computational efficiency, the optimal window size is determined to be 8 × 8.

Finally, the Dense Block performs a deep fusion of the outputs from the previous two stages through multi-layer dense connections. The dense connection mechanism is implemented through cross-layer feature reuse, ensuring that shallow-layer geometric details (e.g., rooftop edge curvature) are preserved while deep-layer semantic information (e.g., building/non-building classification confidence) is effectively integrated. At this stage, the Dense Block progressively receives outputs from the previous layer and concatenates them with the current layer’s features, ensuring the sufficient flow and reuse of information in the feature maps. Through multiple convolution operations, the Dense Block not only enhances meaningful features but also effectively suppresses noisy features. The Dense Block designed in this paper combines

1 \times 1

convolution and

3 \times 3

convolution, retaining key features while reducing dimensionality. This design effectively lowers computational complexity and improves efficiency. Through repeated convolutions, the Dense Block refines and integrates features that already contain global relationships and contextual information, enhancing their relevance.

GCN is positioned at the front-end, performing sparse graph operations on low-resolution feature maps to filter noise while preserving key structures. OA-SwinT is placed in the intermediate stage, where attention mechanisms operate in the dimensionally reduced semantic space, balancing accuracy and computational efficiency. Dense connections are applied at the final stage, conducting lightweight feature reassembly in high-dimensional space to achieve efficient feature fusion. The design of GSTDM innovatively integrates multiple modules in a synergistic manner, enhancing the accuracy and robustness of the model in real-world applications. Notably, when processing large-scale remote sensing data, GSTDM maintains high efficiency and precision, making it well-suited for practical deployment.

2.3. Res DualAttention Dense Fusion Block (R-DDFB)

This paper proposes a novel encoder–decoder skip connection structure, based on a newly introduced module: the Res DualAttention Dense Fusion Block (R-DDFB). The schematic diagram of the module structure is shown in Figure 4. The module consists of two parts: one is the ResBlock and the other is the DualAttention Dense Fusion Block (DDFB).

2.3.1. ResBlock

The traditional U-Net architecture extracts features through convolutional layers. Although it demonstrates certain effectiveness in feature learning, it also exposes limitations such as gradient vanishing and inadequate feature representation in deep networks. To address these issues, we introduce a deep residual learning strategy, stacking ResBlocks to enhance feature extraction capability. As shown in Figure 1, each ResBlock consists of two

3 \times 3

convolution layers and a residual connection. The residual connection effectively alleviates the gradient vanishing problem, promotes deeper feature learning, and enhances the network’s expressive power. Through deep residual learning, the network can progressively extract multi-scale building features from HRRSI and flexibly capture structural information at different scales.

In the encoder part, we design four stages, each containing a ResBlock and a max-pooling layer. During the progressive downsampling process, the spatial resolution of the feature maps gradually decreases, while the number of channels increases, namely 64, 128, 256, and 512. This ensures rich feature representation and the effective expression of semantic information.

2.3.2. DDFB

To further enhance the model’s performance in building extraction tasks, we designed the DualAttention Dense Fusion Block (DDFB), which serves as the core module of the decoder. The DDFB combines spatial and channel dual attention mechanisms with Dense Block’s multi-layer feature fusion, significantly enhancing the decoder’s ability to recover details and model global semantics in complex scenes. The structural design of the module is shown in Figure 4, and features are gradually refined through four iterations.

The traditional U-Net decoder performs image detail recovery via simple upsampling and convolution operations, but it does not apply feature selection or weighting to the features output by the encoder. This kind of skip connection may introduce noise, affecting the accuracy of feature extraction. Moreover, directly concatenating shallow and deep features without considering semantic differences or importance may lead to a suboptimal use of information. To address these issues, we introduce the DANet attention mechanism in the DDFB, which effectively filters and weighs the skip-connected features through spatial and channel attention. This ensures that the decoder focuses on target-relevant regions, significantly reducing noise interference and enhancing the expression of detailed features. Furthermore, the DDFB innovatively fuses the shallow early features from the encoder (rich in detail but weak in semantics) with the upsampled deep features, balancing global semantics and local details. This feature fusion strategy prevents the loss of important detailed information during the recovery of spatial resolution, significantly improving the decoder’s ability to reconstruct building boundaries and shapes.

The features obtained from the previous workflow need to be efficiently fused. Therefore, we design the Dense Block, which performs multi-layer dense connections to progressively fuse the high-level features from the encoder side with the high-semantic information output by the bottleneck layer. This also efficiently integrates the low-resolution detail features after upsampling with high-level features. This layer-by-layer feature accumulation and integration not only enhances the flow of information but also optimizes the reuse of features, thereby improving the model’s performance in complex environments.

The proposed R-DDFB achieves a deep integration of detailed and semantic information by combining residual learning, dual attention mechanisms, and dense feature fusion strategies. Through feature extraction and fusion strategies, the R-DDFB significantly enhances the model’s performance in building extraction tasks under complex backgrounds, ensuring the coordinated integration of detailed and global semantic information.

2.4. Hybrid Loss

In the task of building extraction, Binary Cross-Entropy (BCE) and Dice Loss are two commonly used loss functions, each serving different roles in terms of segmentation accuracy and boundary detail optimization. The Binary Cross-Entropy Loss function is designed for binary classification tasks, aiming to minimize the classification error at each pixel. The formula is as follows:

BCE (y_{true}, y_{pred}) = - y_{true} log (y_{pred}) - (1 - y_{true}) log (1 - y_{pred})

(5)

where

y_{true}

is the ground truth and

y_{pred}

is the predicted value. BCE emphasizes the overall pixel classification accuracy, enabling the model to distinguish between building and background pixels effectively. However, in building extraction tasks, the foreground (building pixels) is often much smaller than the background area. As a result, BCE calculates the error uniformly across each pixel, which leads to insufficient performance in small foreground areas, such as buildings.

In contrast, Dice Loss is specifically designed for segmentation tasks, particularly in scenarios with class imbalance. Dice Loss measures the quality of segmentation by computing the overlap between the predicted and true labels, making it particularly effective in handling class imbalance. The Dice Loss formula is as follows:

DiceLoss (y_{true}, y_{pred}) = 1 - \frac{2 | y_{true} \cap y_{pred} | + ϵ}{| y_{true} | + | y_{pred} | + ϵ}

(6)

where

y_{true} \cap y_{pred}

represents the intersection of the predicted and true values,

| y_{true} |

and

| y_{pred} |

are the sums of the true and predicted values, and

ϵ

is a small constant to avoid division by zero. Dice Loss emphasizes maximizing the overlap between the predicted and true regions, with higher sensitivity to the foreground areas. As such, Dice Loss effectively reduces the missed detection of building contours, enhancing the accuracy of building boundaries. This makes Dice Loss particularly beneficial in tasks where the foreground is small and the segmentation accuracy heavily depends on boundary information.

We introduce an Edge Sensitivity Term to further enhance the model’s focus on building boundaries. This is achieved by incorporating gradient-based information, which generates an edge mask on the segmentation map, concentrating the loss on edge pixels. Specifically, an edge mask E is first generated by applying an edge detection operator (we use the Sobel operator) to the true label

y_{true}

. The loss is then computed only at the locations of the edge mask, and the Edge Sensitivity Term is defined as follows:

EdgeSensitivityTerm = \frac{1}{| E |} \sum_{i \in E} | y_{i} - {\hat{y}}_{i} |

(7)

where

| E |

represents the number of pixels in the edge region. This approach ensures that the model produces higher gradients in the edge regions, allowing for more accurate boundary detection of buildings.

In summary, to optimize the model’s performance, we design a combined loss function, Hybrid Loss, as shown in the following equation:

HybridLoss = α \cdot BCE (y_{true}, y_{pred}) + β \cdot DiceLoss (y_{true}, y_{pred}) + γ \cdot EdgeSensitivityTerm

(8)

where

α

,

β

, and

γ

are weight parameters used to adjust the relative contribution of each loss term. The BCE loss focuses on the binary classification accuracy at the pixel level, suitable for controlling the overall segmentation accuracy of background and building regions. The Dice Loss enhances the overlap between the foreground and the predicted segmentation, making it robust to class imbalance and helping to reduce the probability of building omission. The Edge Sensitivity Term is designed for the boundary regions of buildings, increasing the loss weight at the true label edge locations, thereby enabling the model to more precisely capture building contours and improve edge prediction quality. Considering the foreground–background class imbalance issue in building extraction tasks and the importance of edge details, and after extensive experimentation and tuning, we selected weight settings of

α = 0.4

,

β = 0.5

, and

γ = 0.1

. Specifically, the weight

α

for the BCE loss was set to 0.4 to ensure accurate classification of both foreground and background. The weight

β

for the Dice loss waa set to 0.5 to enhance the extraction of building foreground, particularly in the presence of class imbalance. The weight

γ

for the edge sensitivity term was set to 0.1 to give moderate attention to the edge details of buildings. This configuration effectively balanced overall segmentation accuracy, foreground extraction, and edge precision, demonstrating superior convergence in the experiments and ultimately achieving optimal performance on the validation set. This combined loss function optimizes the model in terms of overall classification accuracy, consistency of overlapping regions, and the quality of boundary details.

2.5. DSM Multi-Source Data Fusion

DSM data record the absolute height information of surface objects from the ground level upwards. These height data play a significant role in assisting building extraction from remote sensing imagery, particularly in the following aspects:

Object Differentiation: DSM data help distinguish between different types of objects, such as buildings versus ground or trees versus low vegetation, thereby improving classification accuracy.
Dense Area Identification: In densely built-up areas, DSM data effectively address the challenge of distinguishing between adjacent buildings, ensuring independent identification of each building.
Geometric Distortion Mitigation: DSM data help alleviate geometric distortions caused by the viewing angle of remote sensing imagery, such as shadow occlusion and the misclassification of building walls as roofs, thus improving the accuracy of building boundary delineation.

Integrating DSM with HRRSI for building extraction better accommodates the complexity of the research subject, significantly enhancing the accuracy and reliability of building detection. In our method, we first process the DSM data into a normalized Digital Surface Model (nDSM), which is then used as a fourth channel and input into the network alongside HRRSI data for training. This multi-source data fusion approach fully leverages the height information provided by the DSM, strengthening the model’s ability to recognize building features.

3. Experiments and Results

3.1. Dataset Description

3.1.1. GF-7 Dataset

The GF-7 satellite, independently developed by China, is the first stereoscopic mapping satellite with sub-meter accuracy. It is equipped with China’s first long-focus, distortion-free dual-line array mapping camera, and a full-waveform laser altimeter. The satellite provides panchromatic images with a forward viewing angle of +26° (resolution better than 0.8 m) and a backward viewing angle of −5° (resolution better than 0.65 m), as well as multi-spectral images with a 2.6 m resolution. Its ground coverage exceeds 20 km. The laser altimeter achieves a ranging accuracy better than 0.3 m, supporting the production of 1:10,000-scale digital terrain maps. On-orbit simulations show that without control, GF-7 achieves a plane accuracy of 5 m and a height accuracy of 1.5 m, meeting the 1:10,000-scale mapping requirements. With sub-meter spatial resolution and global stereoscopic observation capability, it is ideal for precise building extraction.

In order to better evaluate the performance of the proposed model on GF-7 remote sensing imagery, this study uses the GF-7 satellite’s orthorectified image and its DSM data. The data are derived from the GF-7 satellite and have undergone several processing steps, including RPC model correction, stereo 3D reconstruction, and orthorectification. We annotated the data to generate the corresponding building labels.

To evaluate the effectiveness of the model, the urban areas of Zhengzhou and Beijing were chosen as experimental regions, forming the GF-7 Dataset. Figure 5 presents representative data from specific regions. The dataset spans approximately 987 square kilometers. Following labeling, cropping, and filtering procedures, a total of 640 samples were retained, each with a resolution of 512 × 512 pixels. The dataset was then divided into training, validation, and test sets in a 7:2:1 ratio.

Additionally, challenges such as shadow occlusions, tree cover, and building inclination are present in the selected regions, making them highly representative test cases. From the perspectives of building type variability and scene diversity, the chosen regions provide abundant application scenarios and significant academic value for in-depth research. Moreover, these regions pose considerable challenges to the model’s capability in multi-scale feature extraction.

3.1.2. WHU Building Dataset

The WHU Building Dataset [39] is a high-resolution remote sensing imagery dataset released by Wuhan University in 2019, designed specifically for building extraction and analysis tasks. It is particularly suited for high-precision building extraction and change detection tasks. In this study, 8200 images from the Satellite Dataset I and Satellite Dataset II were selected as the experimental dataset, each with a resolution of 512 × 512 pixels. Based on empirical evidence, the images were divided into training, validation, and test sets in a 7:2:1 ratio. These images are sourced from various satellites, including ZY-3, IKONOS, and WorldView. The dataset covers multiple cities worldwide and represents diverse geographic and architectural styles. The building contour annotations provided in the dataset are both accurate and comprehensive, sourced from multiple satellite constellations operating under different conditions and across various regions.

The WHU Building Dataset presents a challenge for assessing the robustness and comprehensiveness of building extraction algorithms, while also offering valuable data for validating the robustness of methods in diverse scenarios.

3.2. Experimental Setting

The experimental framework includes image data loading and preprocessing, construction of the deep learning model, training, and evaluation. The implementation of MMRAD-Net was executed using the PyTorch 1.7 deep learning framework on an NVIDIA GeForce RTX 4090 GPU (24 GB) (NVIDIA Corporation, Santa Clara, CA, USA). During the training phase, the Adam optimizer [40] was employed with an initial learning rate of 0.001. The parameters

β_{1}

and

β_{2}

were set to 0.9 and 0.999, respectively, and

ϵ

was set to

1 \times 10^{- 8}

. The batch size was set to 16. The loss function used was the proposed Hybrid Loss, with the weight settings of

α = 0.4

,

β = 0.5

, and

γ = 0.1

, determined after extensive experimentation and tuning. All models were trained for 200 epochs, and the model demonstrating the highest performance was selected for subsequent testing.

Prior to training, data preprocessing was conducted. The dataset was normalized to a range between 0 and 1, and the images were partitioned into smaller blocks to facilitate batch processing during model training. Image dimensions were padded to satisfy block size requirements, and the data were partitioned using the view_as_blocks function. The resulting block data were stored as four-dimensional tensors, encompassing batch size, height, width, and channel number. An if_augment parameter was incorporated in this process. When this parameter is set to True, CutMix augmentation is randomly applied to each block post-splitting, thereby enhancing data diversity. The final output comprises the segmented input data blocks along with their corresponding label data blocks. The CutMix data augmentation is defined by the following formula:

c u t_w = int (D I M \times c u t_r a t)

(9)

where

D I M

denotes the image size. The cropping ratio is denoted as

c u t_r a t = 0.4

. The values of

c u t_w

,

c u t_x

, and

c u t_y

are utilized to delineate the cut region, which is subsequently filled with the corresponding segment from the same image, thereby generating the mixing effect.

3.3. Evaluation Metrics

To further evaluate the performance of building extraction, this study employs four evaluation metrics: Precision, Recall, F1-score, and Intersection over Union (IoU). The formulae are as follows:

Precision = \frac{T P}{T P + F P}

(10)

Recall = \frac{T P}{T P + F N}

(11)

F 1 - score = \frac{2 \cdot (Precision \cdot Recall)}{Precision + Recall}

(12)

IoU = \frac{T P}{T P + F N + F P}

(13)

Building extraction is essentially a binary classification task, and thus, these four statistical metrics are applied to evaluate feature labeling. In this context,

T P

(true positive) refers to the set of samples that actually belong to the building category and are predicted as buildings by the model.

F N

(False negative) represents the set of samples that belong to the building category but are predicted as non-buildings.

F P

(false positive) indicates the set of samples that do not belong to the building category but are predicted as buildings.

T N

(true negative) refers to the set of samples that do not belong to the building category and are also predicted as non-buildings by the model.

3.4. Comparison Experiments

3.4.1. Performance on GF-7

To validate the superiority of the proposed MMRAD-Net architecture in building extraction tasks and to comprehensively assess its performance under different conditions, we compared it with seven advanced building extraction methods: U-Net, U²-Net, DANet, HRNet, DeepLab v3+, Swin-Unet, MA-Unet, and UANet. The model selection followed the principles outlined below:

Coverage of mainstream architecture paradigms, including traditional convolutional networks, attention mechanisms, and Transformer-based architectures;
Addressing the core challenges of building extraction tasks, such as complex background separation, multi-scale feature fusion, and fine boundary modeling.

Among these models, U-Net, as one of the most representative segmentation networks, is widely used in various image segmentation tasks due to its simple and efficient structure. U²-Net, with its advantages in detail recovery, is particularly suitable for the fine extraction of building contours. DANet, which introduces a dual attention mechanism, effectively enhances the handling of complex backgrounds, making it a key reference for evaluating a model’s global context modeling capability. HRNet, while maintaining high-resolution features, enhances the capture of fine-grained information, making it especially suitable for precise building boundary extraction. DeepLab v3+, based on dilated convolutions, has strong context modeling capabilities and is commonly used in semantic segmentation tasks in complex environments. Swin-Unet, based on the Swin Transformer, can effectively address the spatial continuity recognition of large building complexes due to its long-range dependency modeling ability. MA-Unet further integrates a multi-scale attention gating mechanism, with an adaptive weight allocation strategy that helps suppress interference from shadows and vegetation artifacts in remote sensing images. Although Swin-Unet and MA-Unet were initially designed for medical imaging, their innovations in global context modeling and multi-scale feature extraction have been shown to be transferable to the remote sensing domain [41,42]. UANet, with its uncertainty-aware mechanism and lightweight design, demonstrates unique advantages in low false-alarm scenarios and edge computing platforms such as UAVs, making it a pragmatic choice for resource-constrained deployment. These models span a technological spectrum from local feature extraction to global relationship modeling, providing multi-dimensional references for MMRAD-Net.

First, experiments were conducted on our GF-7 Dataset under identical training conditions, followed by performance evaluation using various assessment metrics. The results are summarized in Table 1.

On the GF-7 Dataset, MMRAD-Net demonstrates exceptional performance, particularly in the composite metrics F1-score and IoU, achieving 91.12% and 83.01%, respectively, which significantly outperforms the other models. Compared to the second-best performing model, MA-Unet, MMRAD-Net showed an improvement of 1.45% in F1-score and 1.58% in IoU, showcasing its superior Precision in capturing building boundaries and handling complex urban environments. In addition, an optimal balance was achieved with 92.23% Precision and 90.03% Recall, overcoming the limitations of HRNet, UANet (high Precision, low Recall), and U²-Net (high recall, low precision). This indicates that MMRAD-Net not only excels in overall segmentation performance but also possesses a stronger capability in global context modeling, making it more efficient in handling complex scenarios and fine-grained segmentation tasks.

To provide a more intuitive comparison of the extraction results, three representative images from the test dataset were selected, and building extraction was performed using eight different networks. The results are shown in Figure 6. Compared with other models, MMRAD-Net effectively extracted buildings even in high-rise urban areas covered by shadows, thereby significantly reducing the misclassification of non-building elements as buildings. Furthermore, it consistently preserved the completeness of building footprints for both large- and small-scale constructions, while capturing the intricate details of complex building forms with high fidelity.

3.4.2. Performance on WHU

The WHU Building Dataset is widely recognized in the field for its comprehensiveness and challenge, making it an ideal benchmark for evaluating the performance of building extraction algorithms. Under the same training conditions, we quantitatively evaluated the building extraction results on the WHU Building Dataset, as shown in the Table 2.

On the WHU Building Dataset, our method achieved optimal performance across all metrics compared to other methods. MMRAD-Net exhibited the highest IoU, Accuracy, and F1-Score, demonstrating excellent segmentation performance even on large-scale datasets with many samples.

For qualitative comparison, we selected several representative results obtained by all methods on the test samples, as shown in Figure 7. It is evident that in areas with dense buildings and partial tree occlusion, other models are prone to misclassifying buildings as other land-cover types. Moreover, when the color and textural features of buildings closely resemble those of the surrounding background, incomplete edge extraction frequently occurs. In contrast, our proposed model demonstrates a lower incidence of such errors.

The experimental results on the WHU Building Dataset further validate the superiority of our method in both visual and quantitative evaluations, demonstrating that the MMRAD-Net architecture effectively aggregates local and global features, enabling better extraction of building details and overall structures.

4. Discussion

4.1. Advantages and Limitations of Our Model

The experimental results on the GF-7 and WHU datasets validate the significant advantages of MMRAD-Net for building extraction tasks in HRRSI.

4.1.1. Qualitative Results

From the qualitative perspective, MMRAD-Net outperforms traditional models in handling complex building boundaries, multi-scale buildings, and detailed feature preservation. Particularly, the model achieves a more precise segmentation of building edges, avoiding the common issues of edge blurring and excessive smoothing observed in other models. Benefiting from the design of the GSTDM, MMRAD-Net effectively integrates global semantic information with local topological structures, which is crucial for identifying buildings in complex scenarios. Additionally, MMRAD-Net demonstrates strong adaptability to different types of buildings, successfully extracting tall buildings, low-rise structures, and buildings in both dense and sparse regions.

However, despite its superior performance in most scenarios, MMRAD-Net still has some limitations. In certain complex cases, such as densely packed buildings or regions with severe occlusions, the model occasionally exhibits false negatives and false positives, as shown in the last column of the second row of Figure 7 and the last column of the third row of Figure 6. This is particularly evident when building features are similar to surrounding objects such as trees and roads, or when shadow occlusions are present, leading to a decline in segmentation accuracy. Moreover, in extreme cases, the model struggles to identify smaller or low-contrast buildings.

4.1.2. Performance on Independent Buildings

To further analyze the segmentation results, several independent buildings were randomly selected from both datasets. The outputs of each model were converted into grayscale images, where each pixel value ranges from 0 to 1, representing the probability of belonging to a building.

As shown in Figure 8, each row corresponds to the image, ground truth map, and the building extraction results of U²-Net, DANet, HRNet, DeepLabv3+, and MMRAD-Net, respectively. MMRAD-Net produces building regions that are closer to white, while non-building areas exhibit fewer grayish-white regions, indicating a more reliable distinction between buildings and the background. Additionally, MMRAD-Net achieves sharp black-and-white transitions at building edges. In comparison, other models exhibit certain issues: U²Net shows advantages in edge processing but struggles with inconsistent grayscale values within building interiors, leading to discontinuities in overall segmentation. DANet, utilizing dual attention mechanisms (channel and spatial attention) to enhance feature representation, tends to focus excessively on global features, introducing grayish-white interference in building regions. HRNet, due to its multi-branch structure for feature fusion, achieves an accurate overall segmentation of buildings but produces blurry grayscale transitions at edges, with grayish noise in the background. DeepLab v3+ achieves a relatively uniform white tone in building interiors but underperforms in edge detail processing, often resulting in gray transitions.

Overall, MMRAD-Net exhibits clear advantages in the segmentation of independent buildings. It provides a more accurate recognition of building regions, stronger background suppression in non-building areas, and sharper transitions at building edges. These results further demonstrate the reliability and detail-handling capability of MMRAD-Net in complex building scenes.

4.1.3. Performance in Evaluation Metrics

From the perspective of evaluation metrics, MMRAD-Net excels in both Precision and Recall, showing significant advantages in reducing false positives and false negatives. Compared with baseline models, MMRAD-Net achieves notable improvements in IoU and F1-score, indicating more precise and reliable segmentation results. However, the model’s high-accuracy performance is dependent on high-quality DSM data. In cases where DSM data are unavailable or of low quality, the model’s performance may be affected to some extent.

4.1.4. Network Complexity and Application Efficiency Analysis

To further investigate the computational complexity and practical application efficiency of MMRAD-Net, we conducted a comparative experiment using the GF-7 Dataset mentioned in Section 3.4.1. The baseline models selected for comparison include U-Net, U²-Net, Swin-Unet, DANet, and HRNet. The total number of parameters and training time were recorded, and the results are summarized in Table 3.

The results show that MMRAD-Net has a higher number of parameters and longer training time compared to other models, with approximately 57 million parameters and 25 min per training epoch. Although the training time of MMRAD-Net is longer than that of traditional models such as U-Net (3 min/epoch) and U²-Net (5 min/epoch), this increase is due to the introduction of several advanced modules, which result in a deeper and wider network architecture. These modules significantly enhance the model’s ability to extract multi-scale features, model complex building boundaries, and recover fine details. When compared to Swin-Unet, HRNet, and DANet, MMRAD-Net shows a longer training time but remains within a similar computational cost range.

The core advantage of MMRAD-Net lies in its high accuracy and functional scalability. Its modular design provides flexibility for future lightweighting, such as through techniques like channel pruning, which could reduce the parameter count by up to 78%. Despite its slightly higher parameter count, its performance and generalization ability in complex scenarios offer a new technological pathway for high-resolution remote sensing building extraction.

4.2. Ablation Study

4.2.1. All Models Analysis

To validate the scientific validity of the model design and the impact of incorporating DSM data on building extraction performance, we used U-Net as the baseline. Within the same algorithmic framework, we incrementally introduced the GSTDM, R-DDFB, Hybrid Loss, and DSM to evaluate the impact of each module on model performance. Visualizing these results as a line chart in Figure 9, we observe a steady increase in both the F1-score and IoU with the gradual addition of modules, which fully demonstrates the effectiveness and necessity of each component in our model design.

Based on the quantitative results in Table 4, after incorporating DSM data, the F1-score increased by 2.11%, and the IoU improved by 2.45%. In conjunction with the qualitative findings illustrated in Figure 10, these results indicate a more accurate extraction of building boundaries and notably reduced confusion with other land-cover types. By introducing topographic elevation information, the DSM data effectively mitigate misclassification errors due to spectral confusion, as well as omissions stemming from discontinuous rooftop spectral signatures. This improvement is particularly evident in complex scenarios, where the incorporation of DSM data significantly enhances building extraction outcomes.

4.2.2. GSTDM Model Analysis

To validate the scientific design of the GSTDM and assess the contribution of each submodule to the building extraction performance, we conducted an ablation study on MMRAD-Net using the GF-7 Dataset. Under the same algorithmic framework, Swin Transformer was used as the baseline (using DSM as the fourth input channel), and directional-aware positional encoding (OA-Swin), the GCN Block, and the Dense Block were sequentially added. The contributions of these components to the model’s performance were quantitatively evaluated through the Intersection over Union (IoU) metric.

As shown in Table 5, the inclusion of each module resulted in a positive improvement in building extraction performance. After incorporating OA-Swin, the IoU increased from 81.02% to 82.23% (+1.21%), primarily due to the enhancement of building boundary features by the directional-aware positional encoding. Adding the GCN Block further improved the IoU to 82.85% (+0.62%), confirming the necessity of graph convolutions in modeling and building topological relationships. The complete GSTDM model achieved an IoU of 83.01%, an increase of 0.16% from the previous stage.

Although the improvement was marginal, the Dense Block preserved shallow-level geometric details through inter-layer connections, which aligns with the design intent of “multi-level feature retention to prevent information loss”. The cumulative performance improvement in the components demonstrates the complementary effect between OA-SwinT and GCN (directional encoding optimizing boundaries and graph convolution enhancing structural coherence), and the Dense Block as a feature distiller coordinating multi-level semantic expressions.

These experimental results are consistent with the “Hierarchical Feature Processing Pipeline” proposed in Section 2.2 and further confirm the effectiveness and necessity of these components in building extraction tasks for HRRSI.

Further analysis reveals that directly concatenating DSM as the fourth channel, while introducing elevation information, has two key drawbacks: the significant numerical distribution difference between the elevation values of DSM and the RGB spectral values, which leads to feature space mismatch. This issue is especially noticeable in flat areas (such as roads), where noise is introduced. The elevation gradient of DSM implicitly encodes the primary direction of buildings, but directly concatenating it fails to explicitly model such geometric priors. In contrast, OA-SwinT decouples the geometric information of DSM into directional parameters through directional-aware positional encoding (OA-PE), guiding the attention mechanism to focus on the building boundary directions. This approach avoids noise interference and improves boundary IoU (+1.21%). We present two typical samples in Figure 11 (rectangular building: clear primary direction; complex polygonal building: multiple directional edges), comparing the segmentation results without OA-PE and with OA-PE (visualized from mask results for easier observation), further demonstrating the benefit of our proposed positional encoding method in boundary accuracy.

Next, we will discuss the sequence of each submodule in GSTDM. GSTDM follows a “Local Structure → Global Constraints → Feature Refinement” cognitive flow: GCN must precede SwinT to preserve original spatial relationships. Let

F_{i n}

be input features. If SwinT executes first,

F_{s w i n} = Attention (F_{i n} W_{Q}, F_{i n} W_{K}, F_{i n} W_{V})

(14)

GCN then operates on attention-weighted features, blurring spatial relationships. The reverse order preserves topological purity. We analyze the gradient propagation path, where the gradient norm of the backpropagated gradients is as follows:

⎻: Order A (GCN→SwinT→Dense):

$| | \nabla_{F_{i n}} {L | |}_{2} \propto | | J_{d e n s e} \cdot J_{s w i n} \cdot J_{g c n} {| |}_{F}$

(15)
⎻: Order B (SwinT→GCN→Dense):

$| | \nabla_{F_{i n}} {L | |}_{2} \propto | | J_{d e n s e} \cdot J_{g c n} \cdot J_{s w i n} {| |}_{F}$

(16)

where $J_{g c n}$ is sparse (5% non-zero elements) vs. $J_{s w i n}$ is dense. Order A yields 1.8× higher Frobenius norm (empirical measurement). It is proved that the original order is more conducive to the gradient flow.

4.3. Generalization Ability Analysis

To validate the generalization performance of the proposed method, transfer learning experiments were designed and conducted. To ensure the diversity of building characteristics between the testing and training datasets, 200 randomly selected images from the WHU Building Dataset were used as the testing set. Using the same training strategy, we compared the performance of MMRAD-Net and U-Net. The models were trained on the GF-7 Dataset and evaluated on the aforementioned testing set. The final prediction results are illustrated in Figure 12.

The results indicate that the model achieves an overall satisfactory performance in accurately covering building regions and effectively delineating building edges. However, certain false negatives and false positives were observed. For instance, in regions with geometrically complex buildings, vegetation, vehicles, or other low-height ground items, objects were occasionally misclassified as buildings (highlighted in green in the figure). Additionally, partial omissions of building interiors were identified (highlighted in blue), primarily due to the complexity of building structures and severe shadow interference from surrounding objects. These factors reduced the performance of building contour extraction to some extent.

Despite these limitations, the proposed method demonstrates high accuracy and reliability in extracting most buildings and their contours. As shown in Table 6, MMRAD-Net achieves an F1-score of 84.53% and an IoU of 73.45%, outperforming U-Net on both metrics. In summary, MMRAD-Net exhibits good generalization performance and scalability, maintaining competitive accuracy when applied to other datasets.

5. Conclusions

This paper proposes MMRAD-Net, an innovative multi-source multi-scale deep learning network designed to address the challenges of building extraction from HRRSI in complex scenes. Two key novel modules are introduced: GSTDM and R-DDFB. These modules combine the strengths of GCN and OA-SwinT, enabling the model to capture local details and understand global context, while the dual attention mechanism ensures a robust fusion of cross-scale features. The network built upon these two modules overcomes the limitations of traditional U-Net in global relationship modeling, feature fusion, and segmentation of complex scenes.

In addition, we integrated DOM and DSM data to alleviate the issue of spectral confusion between buildings and the ground. A Hybrid Loss is proposed, which combines Binary Cross-Entropy Loss, Dice Loss, and an edge-sensitive term, effectively improving the extraction accuracy of building boundaries and details.

The experimental results on the GF-7 Dataset and WHU Building Dataset show that MMRAD-Net outperforms five other classical segmentation models in both quantitative metrics and qualitative analysis, particularly in boundary delineation, detail recovery, and adaptability to complex scenes. Ablation and transfer learning experiments further validate the model’s design rationality and generalization capability.

Despite achieving significant progress, there is still room for improvement in MMRAD-Net. Future work could explore more efficient feature extraction methods or hybrid attention mechanisms to reduce model complexity without sacrificing performance. Overall, MMRAD-Net provides a promising framework for improving building extraction from HRRSI.

Author Contributions

Conceptualization, Y.G.; Data curation, Y.G.; Formal analysis, Y.G. and H.C.; Funding acquisition, H.C.; Investigation, Y.G.; Methodology, Y.G.; Project administration, Y.G., H.C., and X.L.; Resources, X.L.; Software, Y.G.; Supervision, H.C. and X.L.; Writing—original draft, Y.G.; Writing—review & editing, Y.G. and H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received financial support from the LuTan-1 L-Band Spaceborne Bistatic SAR data processing program, with the grant number E0H2080702.

Data Availability Statement

The datasets that support the funding of this study can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nichol, J.E.; Shaker, A.; Wong, M.S. Application of high-resolution stereo satellite images to detailed landslide hazard assessment. Geomorphology 2006, 76, 68–75. [Google Scholar] [CrossRef]
Yang, Q.; Qi, J.; Sun, Y.J. The application of high resolution satellite remotely sensed data to landuse dynamic monitoring. Remote Sens. Land Resour. 2001, 4, 20–27. [Google Scholar]
Wulder, M.A.; Hall, R.J.; Coops, N.C.; Franklin, S.E. High spatial resolution remotely sensed data for ecosystem characterization. BioScience 2004, 54, 511–521. [Google Scholar] [CrossRef]
Mahabir, R.; Croitoru, A.; Crooks, A.T.; Agouris, P.; Stefanidis, A. A critical review of high and very high-resolution remote sensing approaches for detecting and mapping slums: Trends, challenges and emerging opportunities. Urban Sci. 2018, 2, 8. [Google Scholar] [CrossRef]
Zakharov, A.; Tuzhilkin, A.; Zhiznyakov, A. Automatic building detection from satellite images using spectral graph theory. In Proceedings of the 2015 International Conference on Mechanical Engineering, Automation and Control Systems (MEACS), Tomsk, Russia, 1–4 December 2015; pp. 1–5. [Google Scholar]
Chen, L.C.; Huang, C.Y.; Teo, T.A. Multi-type change detection of building models by integrating spatial and spectral information. Int. J. Remote Sens. 2012, 33, 1655–1681. [Google Scholar] [CrossRef]
Zhang, Y. Optimisation of building detection in satellite images by combining multispectral classification and texture filtering. ISPRS J. Photogramm. Remote Sens. 1999, 54, 50–60. [Google Scholar] [CrossRef]
Futagami, T.; Hayasaka, N. Automatic extraction of building regions by using color clustering. In Proceedings of the 2019 58th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), Hiroshima, Japan, 10–13 September 2019; pp. 415–419. [Google Scholar]
Ding, Z.; Wang, X.; Li, Y.; Zhang, S. Study on building extraction from high-resolution images using Mbi. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, 42, 283–287. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Weihong, C.; Baoyu, X.; Liyao, Z. Multi-scale fully convolutional neural network for building extraction. Acta Geod. Cartogr. Sin. 2019, 48, 597. [Google Scholar]
Shrestha, S.; Vanneschi, L. Improved fully convolutional network with conditional random fields for building extraction. Remote Sens. 2018, 10, 1135. [Google Scholar] [CrossRef]
Li, J.; He, W.; Cao, W.; Zhang, L.; Zhang, H. UANet: An Uncertainty-Aware Network for Building Extraction From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5608513. [Google Scholar] [CrossRef]
Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
Zhang, X.; Quan, Z.; Li, Q.; Zhu, D.; Yang, W. SED: Searching Enhanced Decoder with switchable skip connection for semantic segmentation. Pattern Recognit. 2024, 149, 110196. [Google Scholar] [CrossRef]
Li, J.; He, W.; Li, Z.; Guo, Y.; Zhang, H. Overcoming the uncertainty challenges in detecting building changes from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2025, 220, 1–17. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Cai, Y.; Wang, Y. Ma-unet: An improved version of unet based on multi-scale and attention mechanism for medical image segmentation. In Proceedings of the Third International Conference on Electronics and Communication, Network and Computer Technology (ECNCT 2021). Harbin, China, 3–5 December 2021; Volume 12167, pp. 205–211. [Google Scholar]
Alsabhan, W.; Alotaiby, T. Automatic building extraction on satellite images using Unet and ResNet50. Comput. Intell. Neurosci. 2022, 2022, 5008854. [Google Scholar] [CrossRef]
Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625711. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Yurtkulu, S.C.; Şahin, Y.H.; Unal, G. Semantic segmentation with extended DeepLabv3 architecture. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; pp. 1–4. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
Che, Z.; Shen, L.; Huo, L.; Hu, C.; Wang, Y.; Lu, Y.; Bi, F. MAFF-HRNet: Multi-attention feature fusion HRNet for building segmentation in remote sensing images. Remote Sens. 2023, 15, 1382. [Google Scholar] [CrossRef]
Liu, J.; Xia, Y.; Feng, J.; Bai, P. A Novel Building Extraction Network via Multi-Scale Foreground Modeling and Gated Boundary Refinement. Remote Sens. 2023, 15, 5638. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple attending path neural network for building footprint extraction from remote sensed imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6169–6181. [Google Scholar] [CrossRef]
Alshehhi, R.; Marpu, P.R.; Woon, W.L.; Dalla Mura, M. Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2017, 130, 139–149. [Google Scholar] [CrossRef]
Hattula, E.; Zhu, L.; Raninen, J. Building extraction in urban and rural areas with aerial and LiDAR DSM. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, 10, 73–79. [Google Scholar] [CrossRef]
Chen, L.; Qu, Z.; Zhang, Y.; Liu, J.; Wang, R.; Zhang, D. Edge-enhanced GCIFFNet: A multiclass semantic segmentation network based on edge enhancement and multiscale attention mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4450–4465. [Google Scholar] [CrossRef]
Song, S.; Zhang, Y.; Yuan, Y. Iterative edge enhancing framework for building change detection. IEEE Geosci. Remote Sens. Lett. 2023, 21, 6002605. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Yang, L.; Gu, Y.; Bian, G.; Liu, Y. MAF-Net: A multi-scale attention fusion network for automatic surgical instrument segmentation. Biomed. Signal Process. Control 2023, 85, 104912. [Google Scholar] [CrossRef]
Xia, L.; Mi, S.; Zhang, J.; Luo, J.; Shen, Z.; Cheng, Y. Dual-stream feature extraction network based on CNN and transformer for building extraction. Remote Sens. 2023, 15, 2689. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of MMRAD-Net, including encoder, bottleneck, decoder, and skip connections.

Figure 2. The key structure of the OA-SwinT.

Figure 3. Schematic of the OA-SwinT module: orientation-aware encoding and hierarchical feature fusion.

Figure 4. Structure of the Res DualAttention Dense Fusion Block (R-DDFB), which is used in the encoder–decoder skip connection structure.

Figure 5. Display of partial images and labels from the GF-7 Dataset: the first two rows correspond to the Henan region, while the last two rows correspond to the Beijing region.

Figure 6. Segmentation results on the GF-7 Dataset using different models. Columns (a–h) show the building extraction results from U-Net, U²-Net, DANet, HRNet, DeepLab v3+, Swin-Unet, MA-Unet, and UANet, respectively. Black indicates non-building regions (TN), white represents building regions (TP), orange denotes missed detections (FN), and green indicates false positives (FP).

Figure 7. Segmentation results on the WHU Building Dataset using different models. Columns (a–h) show the building extraction results from U-Net, U²-Net, DANet, HRNet, DeepLab v3+, Swin-Unet, MA-Unet, and UANet, respectively. Black indicates non-building regions (TN), white represents building regions (TP), orange denotes missed detections (FN), and green indicates false positives (FP).

Figure 8. Comparative visualization of prediction results. Columns (a–d) show the building extraction results from U²-Net, DANet, HRNet, and DeepLab v3+, respectively. The first row shows the input images and the corresponding predictions, while the second row includes the respective annotations.

Figure 9. Impact of incrementally introducing modules on building extraction performance of the model (F1-Score and IoU).

Figure 10. Comparison of MMRAD-Net building extraction results with and without DSM fusion. Columns (a) and (b), respectively, present the extraction outcomes obtained using DOM alone and those obtained by integrating DOM and DSM.

Figure 11. Visual comparison of segmentation results without and with OA-PE.

Figure 12. Building extraction results of MMRAD-Net on the test set: comparison of ground truth and prediction.

Table 1. Quantitative evaluation results of different models on GF-7 Building Dataset (unit: %).

Dataset	ID	Model	Precision	Recall	F1-Score	IoU
GF-7	(ours)	MMRAD-Net	92.23	90.03	91.12	83.01
	(a)	U-Net	88.21	82.23	85.13	74.48
	(b)	U²-Net	89.98	86.92	88.42	79.80
	(c)	DANet	89.51	86.72	88.10	79.01
	(d)	HRNet	92.12	79.66	85.49	74.15
	(e)	DeepLab v3+	90.33	83.79	86.98	77.85
	(f)	Swin-Unet	90.12	84.56	87.75	78.60
	(g)	MA-Unet	91.05	88.33	89.67	81.43
	(h)	UANet	91.50	85.00	88.15	77.50

Table 2. Quantitative evaluation results of different models on WHU Building Dataset (unit: %).

Dataset	ID	Model	Precision	Recall	F1-Score	IoU
WHU	(ours)	MMRAD-Net	94.21	93.88	94.04	88.99
	(a)	U-Net	90.45	89.78	90.11	82.67
	(b)	U²-Net	93.14	92.67	92.91	86.78
	(c)	DANet	92.43	91.88	92.15	85.74
	(d)	HRNet	92.89	92.34	92.61	86.13
	(e)	DeepLab v3+	89.77	90.45	90.11	83.22
	(f)	Swin-Unet	92.15	91.01	91.58	84.50
	(g)	MA-Unet	92.76	91.55	92.15	85.65
	(h)	UANet	93.00	89.50	91.17	83.50

Table 3. Comparison of model parameters and training time.

Model	U-Net	U²-Net	Swin-Unet	DANet	HRNet	MMRAD-Net
Parameters (million)	3	5	54	50	52	57
Time (min/epoch)	8	12	23	22	24	25

Note: The parameter statistics are based on the results reproduced from public code repositories, and the training time is the average of 5 experiments (with a standard deviation of less than 5%).

Table 4. Performance of MMRAD-Net with ablation study on GF-7 Dataset (unit: %).

Dataset	ID	U-Net	+GSTDB	+R-DDFB	+Hybrid Loss	+DSM	F1-Score	IoU
GF-7	(a)	🗸					85.13	74.48
	(b)	🗸	🗸				87.35	76.89
	(c)	🗸	🗸	🗸			88.47	79.32
	(d)	🗸	🗸	🗸	🗸		89.01	80.56
	(e)	🗸	🗸	🗸	🗸	🗸	91.12	83.01

Table 5. Performance of GSTDM with ablation study on GF-7 Dataset (unit: %).

Dataset	Model Version	IoU
GF-7	SwinT(RGB-D input)	81.02
	OA-SwinT	82.23
	GCN + OA-SwinT	82.85
	Complete GSTDM	83.01

Table 6. Comparative performance of MMRAD-Net and U-Net on the WHU Test Set (unit: %).

Model	Precision	Recall	F1-Score	IoU
MMRAD-Net	85.32	83.76	84.53	73.45
U-Net	81.65	79.43	80.53	68.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, Y.; Chai, H.; Lv, X. MMRAD-Net: A Multi-Scale Model for Precise Building Extraction from High-Resolution Remote Sensing Imagery with DSM Integration. Remote Sens. 2025, 17, 952. https://doi.org/10.3390/rs17060952

AMA Style

Gao Y, Chai H, Lv X. MMRAD-Net: A Multi-Scale Model for Precise Building Extraction from High-Resolution Remote Sensing Imagery with DSM Integration. Remote Sensing. 2025; 17(6):952. https://doi.org/10.3390/rs17060952

Chicago/Turabian Style

Gao, Yu, Huiming Chai, and Xiaolei Lv. 2025. "MMRAD-Net: A Multi-Scale Model for Precise Building Extraction from High-Resolution Remote Sensing Imagery with DSM Integration" Remote Sensing 17, no. 6: 952. https://doi.org/10.3390/rs17060952

APA Style

Gao, Y., Chai, H., & Lv, X. (2025). MMRAD-Net: A Multi-Scale Model for Precise Building Extraction from High-Resolution Remote Sensing Imagery with DSM Integration. Remote Sensing, 17(6), 952. https://doi.org/10.3390/rs17060952

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MMRAD-Net: A Multi-Scale Model for Precise Building Extraction from High-Resolution Remote Sensing Imagery with DSM Integration

Abstract

1. Introduction

2. Methods

2.1. Model Overview

2.2. GCN OA-SwinT Dense Module (GSTDM)

2.3. Res DualAttention Dense Fusion Block (R-DDFB)

2.3.1. ResBlock

2.3.2. DDFB

2.4. Hybrid Loss

2.5. DSM Multi-Source Data Fusion

3. Experiments and Results

3.1. Dataset Description

3.1.1. GF-7 Dataset

3.1.2. WHU Building Dataset

3.2. Experimental Setting

3.3. Evaluation Metrics

3.4. Comparison Experiments

3.4.1. Performance on GF-7

3.4.2. Performance on WHU

4. Discussion

4.1. Advantages and Limitations of Our Model

4.1.1. Qualitative Results

4.1.2. Performance on Independent Buildings

4.1.3. Performance in Evaluation Metrics

4.1.4. Network Complexity and Application Efficiency Analysis

4.2. Ablation Study

4.2.1. All Models Analysis

4.2.2. GSTDM Model Analysis

4.3. Generalization Ability Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI