Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images

Du, Bingyun; Shan, Lianlei; Shao, Xiaoyu; Zhang, Dongyou; Wang, Xinrui; Wu, Jiaxi

doi:10.3390/rs17030540

Open AccessArticle

Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images

by

Bingyun Du

¹,

Lianlei Shan

²,

Xiaoyu Shao

^3,*,

Dongyou Zhang

^1,4,

Xinrui Wang

¹ and

Jiaxi Wu

¹

Heilongjiang Province Key Laboratory of Geographical Environment Monitoring and Spatial Information Service in Cold Regions, Harbin Normal University, Harbin 150080, China

²

School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100864, China

³

Xitaihu Campus, Changzhou University, Changzhou 213164, China

⁴

Heilongjiang Wuyiling Wetland Ecosystem National Observation and Research Station, Yichun 153000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 540; https://doi.org/10.3390/rs17030540

Submission received: 17 November 2024 / Revised: 29 January 2025 / Accepted: 1 February 2025 / Published: 5 February 2025

(This article belongs to the Special Issue Target Recognition and Change Detection for High-Resolution Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

:

With the advancement of remote sensing technology, the acquisition of ultra-high-resolution remote sensing imagery has become a reality, opening up new possibilities for detailed research and applications of Earth’s surface. These ultra-high-resolution images, with spatial resolutions at the meter or sub-meter level and pixel counts exceeding 4 million, contain rich geometric and attribute details of surface objects. Their use significantly improves the accuracy of surface feature analysis. However, this also increases the computational resource demands of deep learning-driven semantic segmentation tasks. Therefore, we propose the Transform Dual-Branch Attention Net (TDBAN), which effectively integrates global and local information through a dual-branch design, enhancing image segmentation performance and reducing memory consumption. TDBAN leverages a cross-collaborative module (CCM) based on the Transform mechanism and a data-related learnable fusion module (DRLF) to achieve adaptive content processing. Experimental results show that TDBAN achieves mean intersection over union (mIoU) of 73.6% and 72.7% on DeepGlobe and Inria Aerial datasets, respectively, and surpasses existing models in memory efficiency, highlighting its superiority in handling ultra-high-resolution remote sensing images. This study not only advances the development of ultra-high-resolution remote sensing image segmentation technology, but also lays a solid foundation for further research in this field.

Keywords:

ultra-high-resolution remote sensing imagery; deep learning; memory efficiency; TDBAN; data-related learnable fusion module; cross-collaborative module

Graphical Abstract

1. Introduction

Under the impetus of globalization and digitization, remote sensing technology has emerged as a cornerstone of geographic information science, offering a fresh perspective on Earth exploration [1]. Ultra-high-resolution remote sensing images, with pixel counts exceeding 4 million, reveal finer texture details and object characteristics compared to lower resolutions [2,3,4]. In particular, in fields such as automated road detection and multi-angle urban classification analysis, these images demonstrate significant advantages, greatly improving decision-making precision [5,6]. However, handling these complex image data poses significant challenges in memory management and computational efficiency, increasing the need for manual intervention and potentially causing delays in critical applications such as emergency response [2]. Despite advancements in semantic segmentation technology offering new opportunities for understanding and automating the annotations of these ultra-high-resolution images, developing more efficient and accurate segmentation algorithms remains a critical issue in remote sensing image processing.

In recent years, deep learning technologies have made significant advances in the field of remote sensing image semantic segmentation, leading to innovations in image analysis and interpretation techniques. Classical methods like U-Net and its variants [7,8,9], multi-scale context aggregation networks [10,11,12], and multi-level feature fusion networks [13,14,15] have demonstrated strong segmentation performance. Furthermore, due to their advantages in enhancing key feature identification and boosting model efficiency, attention mechanisms (i.e., focusing on relevant information and ignoring irrelevant details) have been widely used in several models [16,17,18]. Meanwhile, Transformer-based and generative adversarial network (GAN)-based methods have also gained increasing attention in recent years [19,20,21]. For example, integrating Transformer models with U-Net and the novel guided focal attention mechanism has demonstrated outstanding segmentation accuracy, further enhancing their feature representation ability [22]. These advancements highlight a clear trend of combining Transformer-based models with attention mechanisms to address the challenges posed by high-resolution remote sensing imagery. Nevertheless, a persistent challenge remains in balancing segmentation quality with computational efficiency and memory usage, particularly as the demand for processing ultra-high-resolution images continues to rise. Therefore, there is increasing focus on models that can balance fine-grained segmentation needs with memory and computational efficiency, which is a critical area in current research.

For example, Chen et al. [2] and Shan et al. [3] found that, when segmenting a 6-million-pixel DeepGlobe ultra-high-resolution remote sensing image, high-precision models such as FCN-8s [23] and DANet [24] require 5 GB to 10 GB of GPU memory for inference. Although these models offer high accuracy, they also come with high memory consumption. In contrast, fast segmentation methods such as ICNet [25] and BiSeNet [26] significantly reduce memory usage but at the cost of accuracy drops ranging from 9.6% to 30%. These data indicate the difficulty that traditional model designs face in striking a balance between accuracy and memory efficiency when handling ultra-high-resolution remote sensing imagery.

To address this challenge, various methods specifically designed for ultra-high-resolution image segmentation have emerged [2,27,28]. These methods are based on networks that jointly model features using dual-branch structures, significantly reducing the dependence on high-performance hardware while maintaining segmentation precision. For example, GLNet2 [2] uses a global and local dual-branch structure, with the global branch extracting contextual information and the local branch capturing image details, both of which together improve segmentation accuracy and optimize memory usage. Although it offers advantages in accuracy and memory usage compared to traditional methods, the image block processing strategy increases computational burden, resulting in a slower inference speed. To further optimize the segmentation performance and memory usage, ISDNet [27] proposed a new method that captures semantic information at different levels using a dual-branch structure and integrates this information via a relation-aware feature fusion module. Experiments conducted on several ultra-high-resolution datasets demonstrate the model’s strong generalizability. However, due to the complexity of the relation-aware feature fusion module, it is challenging to understand, and concerns remain regarding the reliability of its actual performance. To resolve this, RSDNet [28] introduced a simplified Res-CBAM module to replace the relation-aware feature fusion module in ISDNet, boosting feature extraction effectiveness and improving the segmentation accuracy. However, while the model shows improvements in memory efficiency and computational load balancing, its inference speed is still slower than ISDNet, though it excels in large-scale crop scenarios, indicating that there is still potential to further optimize input processing and the model.

Based on these challenges, we propose TDBAN, a dual-branch feature enhancement framework for ultra-high-resolution remote sensing image segmentation. TDBAN optimizes the conventional dual-branch structure using a lightweight Transformer architecture and incorporates a cross-sharing module to efficiently fuse global and local information in one step. Additionally, TDBAN incorporates a data-related learnable fusion module (DRLF) that adaptively adjusts the weights of global and local features, significantly reducing computational costs while maintaining high segmentation accuracy. In general, our work significantly contributes to the following aspects.

(1): Optimized Dual-Branch Network Architecture: Drawing inspiration from the GLNet framework, optimizations are made to process global and local information in ultra-high-resolution remote sensing images to achieve more effective information processing and feature recognition.
(2): Innovative Sharing Strategy: An innovative cross-collaboration module is introduced, achieving perfect feature fusion in one step through the Transform mechanism and significantly enhancing model efficiency through lightweight design.
(3): Dynamic weight adjustment: The newly developed DRLF module automatically adjusts weights for global and local information based on image content characteristics, optimizing information fusion to improve segmentation accuracy and processing speed. Additionally, the module adopts knowledge distillation techniques, effectively reducing the computational complexity of the model.
(4): Significant performance improvement: The above techniques demonstrated significant performance enhancements on the DeepGlobe and Inria Aerial datasets, validating the network’s efficiency.

2. Related Work

2.1. Transformer in Semantic Segmentation

Transformer was originally applied to Natural Language Processing, utilizing the self-attention mechanism to effectively capture global contextual information. In recent years, Transformer has successfully expanded into the field of Computer Vision, especially showcasing strong modeling capabilities in semantic segmentation tasks. For example, SETR [29] pioneered the use of a pure Transformer structure to replace traditional CNN encoders, enabling comprehensive global feature modeling. However, the computational complexity of the Transformer grows quadratically with input resolution when processing image tasks, significantly restricting its use in ultra-high-resolution image segmentation. To address this issue, Swin Transformer [30] effectively reduced the computational complexity through sliding windows and hierarchical design, while maintaining the ability to model the global context. Enhanced versions of the Swin Transformer have also found extensive applications in remote sensing image segmentation [31,32,33].

In ultra-high-resolution remote sensing image segmentation tasks, computational efficiency and memory usage are critical challenges. To optimize the performance of traditional Transformer models, some studies have proposed lightweight designs. For example, SegFormer [34] introduces a hierarchical Transformer encoder that avoids the use of positional encoding, reducing performance degradation caused by varying test resolutions, while simplifying the model structure with a lightweight MLP decoder, significantly improving computational efficiency. UniFormer [35] combines 3D convolution with Transformer, proposing an efficient spatiotemporal representation learning framework that leverages the fusion of local convolution and global self-attention, effectively reducing computational complexity while maintaining a strong ability to capture spatiotemporal features. Another study, Masked-attention Mask Transformer [36], introduces a masked attention mechanism in Transformer to reduce computations in irrelevant background areas, thus improving efficiency in image segmentation tasks. These studies provide important references for high-resolution image segmentation tasks, but existing methods still exhibit significant limitations in fusing global and local features [37].

2.2. Dual-Branch Architectures for Feature Fusion

The dual-branch network structure is a widely adopted design for the collaborative processing of global and local features in semantic segmentation tasks. For instance, GLNet [2] utilizes a global–local dual-branch structure: the global branch extracts contextual information while the local branch captures image details, together improving segmentation accuracy and memory efficiency. However, while this approach offers advantages in accuracy and memory efficiency, the incorporation of image patch processing significantly increases computational overhead and reduces inference speed. Furthermore, GLNet [2] imposes mandatory information exchange between global and local streams, neglecting the varying importance of branches or the resolution requirements of complex regions, thereby constraining its multi-branch architecture’s effectiveness. Consequently, the efficient integration of global and local features for high-resolution images remains an area with considerable research potential.

2.3. Attention Mechanisms in Feature Interaction

The application of attention mechanisms in feature interaction has been proven to be a key method for improving the segmentation performance [38,39,40]. Among them, the self-attention mechanism, as the core component of the Transformer, effectively improves the modeling of contextual information by capturing long-range dependencies within the input sequence [41,42,43]. For example, Focal-UNET utilizes the focal self-attention mechanism to integrate local and global information within a single feature sequence, demonstrating excellent segmentation performance in complex scenes with multi-scale objects [44]. Channel2dTransformer directly applies self-attention to 2D feature maps, reducing the computational complexity of traditional methods while preserving spatial structures [42]. However, self-attention mechanisms generally focus on intra-sequence relationships within a single feature sequence, which makes modeling interactions between features from different sources difficult, especially in scenarios where global context and local details must collaborate. Therefore, this study introduces a cross-attention mechanism that supports cross-modal interaction and simplifies the Transformer architecture, enhancing global and local feature interactions while effectively reducing computational complexity.

Additionally, channel attention (CA) and spatial attention (SA) are widely used for feature enhancement. CA emphasizes important features by assigning weights to different channels, while SA focuses on enhancing information in key regions. Research indicates that the integration of CA and SA can markedly enhance segmentation performance [45,46]. For example, the Zoom Module in MBNet [3] utilizes a multi-scale attention mechanism to enhance focus on features at different resolutions, but its high computational complexity makes it unsuitable for resource-constrained scenarios.

To address the aforementioned issues, Knowledge Distillation (KD) technology has been widely applied in lightweight multi-resolution feature learning in recent years [47,48,49,50]. KD achieves a compromise between lightweight design and high performance by utilizing a high-performing Teacher Network to train the Student Network. For example, stonet-S* [48] utilizes KD to enhance lightweight multi-resolution feature learning, achieving outstanding segmentation accuracy with just 7.19 million parameters, thereby improving its efficacy in challenging remote sensing settings. Inspired by this, we propose a data-driven learnable fusion (DRLF) module, which simplifies the complex feature modeling at a lower computational cost by distilling the attention output of the Zoom Module, while preserving its performance. This module integrates channel and spatial attention and achieves the efficient fusion of global and local features through dynamic weight adjustment, offering a lightweight and efficient solution for high-resolution remote sensing image segmentation tasks.

3. Method

3.1. Overview

The entire network architecture is divided into two main pathways (Figure 1): the global and local branches. The architecture begins with processing a dataset D, containing N pairs of high-resolution images and their corresponding segmentation maps, each image and segmentation map being matrices of size

H \times W

. The global branch G processes downsampled low-resolution image and label pairs, denoted as

G = {\{(I_{i}^{lr}, L_{i}^{lr})\}}_{j = 1}^{N},

(1)

where

I_{i}^{lr}

and

L_{i}^{lr}

are images and labels of size

h_{1} \times w_{1}

, significantly smaller than the original image’s

H \times W

size.

The local branch L focuses on processing high-resolution image patches extracted from the original dataset D, denoted as

L = {\{{\{(I_{i j}^{hr}, L_{i j}^{hr})\}}_{j = 1}^{n_{i}}\}}_{i = 1}^{N},

(2)

where

I_{i j}^{hr}, L_{i j}^{hr} \in R^{h_{2} \times w_{2}}

, with

h_{2} \times w_{2}

being noticeably smaller than the original image size. Each image

I_{i}

and label

L_{i}

is cropped from top to bottom and left to right into overlapping patches, divided into

n_{i}

image blocks, preserving the characteristics of ultra-high resolution. Both branches use the same backbone network, and images cropped and downsampled sequentially pass through ResNet, FPN, CCM, DRLF, and then through the branch aggregation layer

f_{agg}

to aggregate two sets of high-level feature maps, generating the final segmentation mask. To ensure stability and improve the training of both branches, weakly coupled regularization is also applied to the training of the local branch. The CCM and DRLF are our innovations, introduced in Section 3.2 and Section 3.3, respectively.

3.2. Cross-Collaborative Module

To fully harness the potential of the dual-branch structure, this study introduces an efficient and flexible cross-sharing module aiming to achieve deep collaboration and information exchange between global and local branch features. The module is based on the Transformer architecture and incorporates a cross-attention mechanism, which effectively handles two distinct dimensions of embedded sequences—global and local branch features. By asymmetrically fusing these features and assigning them adaptive weights, the module can thoroughly explore their interrelation and complementarity.

Unlike the cubic iterative fusion method with four layers per iteration employed by GLNet2, TDBAN introduces an innovative cross-sharing module that leverages the Transformer mechanism to achieve efficient feature fusion in a single step. We have the lightweight Transformer by adopting a single-layer Channel Attention (CA) mechanism and simplifying the query (Q) and value (V) acquisition process, directly taking in global (G) and local (L) features as input. Furthermore, we significantly enhance both computational efficiency and model performance through the two following lightweight design strategies:

(1): Dimensionality Reduction and Load Mitigation: We incorporate dimensionality reduction operations within the fully connected layers to alleviate computational burdens.
(2): Local Attention Optimization: A local attention strategy is employed to refine the cross-attention mechanism, thereby effectively capturing fine-grained feature relationships.

Additionally, the application of Feed-Forward Neural Networks (FFNs) is omitted, and the desired fusion effect is achieved through a single Cross-attention calculation, resulting in a computationally concise and efficient lightweight Transformer model. This design not only enhances the collaborative efficiency between global and local branch features but also significantly improves the overall performance of the model.

The computation of cross-fusion can be expressed by the following formula:

softmax ((W_{Q} S_{2}) {(W_{K} S_{1})}^{⊤}) W_{V} S_{1}

(3)

where

S_{1}

and

S_{2}

represent the features of the global and local branches, respectively, Q, K, and V denote the linear transformation matrices of queries, keys, and values, respectively, and K represents the dimension of the keys. The softmax function represents the normalization function.

Figure 2 presents the architecture of the Cross Collaborative Module, which integrates global and local features. This diagram illustrates how the cross-attention mechanism efficiently integrates diverse feature layers to enhance the model’s discriminative power and accuracy. The input features originate from the final four layers of ResNet. The Local Feature Array is represented by the input array M, capturing the features from a global perspective to provide extensive contextual information. The Global Feature Array, represented by latent array N, focuses on capturing rich local details, emphasizing detailed performance in specific regions.

The Refined Query Array undergoes initial encoding and processing to further refine and extract features during the decoding stage. The Cross-Attention Mechanism is the core of the model, dynamically adjusting the weights and importance of features by computing interactions among the Query (Q), Key (K), and Value (V) arrays. This mechanism optimizes the feature integration based on the correlation between global and local information. Q represents the query vector dynamically generated from the local feature array, guiding how relevant information is extracted from global features. K and V vectors are generated from the global feature array and weighted with the query vector to enhance the local features. Attention Scores demonstrate the strength of interactions between different features, determining how these features are combined to form refined outputs.

The Final Output Features represent the comprehensive understanding and processing results of the input data after undergoing cross-attention processing and are suitable for various subsequent applications. Specifically, the cross-collaborative module first inputs the features of the local branch as K and V and the global branch as Q. It calculates the dot product attention weights between Q and K and then applies these to V to obtain the output. The output features have the same dimension as the features of the global branch, retaining the overview of global information while integrating the details of local information. Then, the output features of the cross-attention are inputted as K and V, and the features of the global branch are inputted as Q. This repeated process strengthens the interaction between global and local features, thereby enhancing the model’s feature expression capabilities and recognition accuracy.

3.3. Data-Related Learnable Fusion Module

To fully leverage the potential of multibranch networks, we designed a data-adaptive learnable fusion module (DRLF) that adjusts weights adaptively to differentiate focus on global and local information flows. This module emphasizes local features in edge areas and prioritizes global information in ambiguous regions, greatly improving the segmentation accuracy through dynamic weight allocation.

Module Architecture:

The data-related learnable fusion module (Figure 3) consists of two parallel learning submodules: the spatial attention generation module and the channel attention generation module. These submodules replicate the outputs of the zoom module (Shan et al’s) [3] spatial and channel attention, obtaining accurate attention maps without extensive computational operations.

Knowledge distillation is a commonly used method for model compression. Unlike pruning and quantization in model compression, knowledge distillation involves constructing a lightweight, small model and utilizing the supervision from a larger, better-performing model to train this smaller model, aiming for improved performance and accuracy. It was first proposed and applied to classification tasks by Hinton in 2015 [47]. The large model is called the “teacher” model, while the small model is called the “student” model. The supervision information from the teacher model’s outputs is referred to as “knowledge”, and the student learning process from the teacher’s supervision is termed “distillation”.

The DRLF adopts the strategy of knowledge distillation, using the Spatial and Channel Attention Masks from the zoom module of MBNet [3] as supervision from a more sophisticated, complex model. The learning process of the sub-module is guided by the supervision signal, with the generated spatial and channel attention masks serving as training targets. By optimizing these sub-modules to generate attention patterns that match those of the zoom module in MBNet [3], the DRLF approximates the functionality of the original module with lower computational complexity.

Knowledge distillation typically involves several steps:

Soft label distillation loss:
Using the softmax function with a high-temperature parameter T to compute the output of the teacher network, generating soft labels to provide behavioral guidance to the student network. The formula is as follows:

$L_{soft} = - \sum_{i = 1}^{N} p_{i}^{T} log (q_{i}^{T})$

(4)

where $p_{i}^{T}$ represents the smooth probability distribution of the teacher network’s output, and $q_{i}^{T}$ denotes the corresponding probability distribution of the student network.
Hard target student loss:
Simultaneously, the student network optimizes its performance by learning to predict correct hard targets (actual labels). The formula is:

$L_{hard} = - \sum_{i = 1}^{N} c_{i} log (q_{i})$

(5)

where $c_{i}$ represents the one-hot encoded representation of the true labels, and $q_{i}$ signifies the student network’s softmax probability output.
Combined distillation and student loss:
The final loss function is a linear combination of distillation loss and hard target loss, enabling the student network to learn from soft and hard labels. The formula is:

$L = α \cdot L_{soft} + β \cdot L_{hard}$

(6)

where $α$ and $β$ are hyperparameters used to balance the contributions of the two types of losses.

Computational benefits:

Our design ensures that the submodule can significantly reduce computational costs while learning the complex attention maps in the Zoom Module of MBNet [3]. This inherent computational efficiency makes this module well suited for resource-constrained environments while maintaining the model’s focus on essential features at different resolutions. In this way, the DRLF facilitates rapid and efficient information exchange between features of different resolutions, thereby enhancing the overall performance of the model.

3.4. Loss Functions and Training Process

To effectively address feature conflicts and convergence instability in the dual-branch network, we have designed a comprehensive loss function and introduced a specific weighting scheme to stabilize the training process.

Dual-Branch Feature Aggregation and Loss Function Design

In this study, the global and local branches generate high-level feature maps at their respective L layers, denoted as

{\hat{X}}_{L o c}^{L}

and

{\hat{X}}_{G l b}^{L}

. These feature maps are then passed through an aggregation layer

f_{a g g}

, which concatenates them along the channel dimension using a

3 \times 3

convolution filter, resulting in the aggregated feature map

{\hat{S}}_{A g g}

, which serves as the final segmentation output.

To maintain consistency between the outputs of the two branches, we introduce an auxiliary loss function that constrains the outputs of the local and global branches to approach their respective target feature maps. This auxiliary loss function helps stabilize the training process, avoiding conflicts between branches and assisting the network in balancing local and global features.

Weak Coupling Regularization

To further alleviate feature conflicts in the dual-branch network, especially when the local branch overfits local features, we introduce weak coupling regularization. This regularization adds a constraint term

λ ∥ {\hat{X}}_{L o c}^{L} - {\hat{X}}_{G l b}^{L} ∥^{2}

to limit the difference between the local and global feature maps. The regularization term effectively controls the learning rate of the local branch, preventing it from dominating the entire training process.

In the experiments, we adjusted the regularization coefficient

λ

(from 0.15 to 0.25, 0.35, 0.45, and 1) to explore the impact of different regularization strengths on the training dynamics. The results show (Table 1) that a reasonable regularization coefficient (

λ = 0.15

) helps maintain the synchronized updates of the local and global branches and significantly improves the stability of the training process.

4. Experiment

4.1. Experimental Details

We employ the feature pyramid network (FPN) [51] as the backbone network based on ResNet50 [52]. During the bottom–up phase of FPN, we utilize a deep feature map-sharing strategy to process the feature maps of ResNet50’s conv2 to conv5 blocks and continue to apply this strategy in the top–down and smoothing phases of FPN. In the final lateral connection phase of FPN, we adopt feature map regularization and summarize this phase to achieve final segmentation. To simplify processing, the downsampled global images and cropped local patches are set to 500 × 500 pixels with 50 pixels of overlap between adjacent patches. We adopt focal loss (

γ = 6

) [53] as the primary loss function, supplemented by two auxiliary losses, each weighted equally at 1.0. Additionally, we set the regularization coefficient

λ

for feature maps to 0.15. To measure GPU memory usage during model training, the command-line tool “gpustat” is utilized to avoid computing gradients. Training and inference are performed using a single GPU card, and the experiments are conducted using the PyTorch framework [54]. We train with the Adam optimizer [55] (

β_{1} = 0.9

,

β_{2} = 0.999

), setting the learning rate for the global branch to

1 \times 10^{- 4}

and the local branch to

2 \times 10^{- 5}

, with a batch size of 6 for all training. All experiments are conducted on a workstation with NVIDIA 1080Ti GPU cards (NVIDIA Corporation, Santa Clara, CA, USA).

4.2. Datasets

This study utilizes two ultra-high-resolution remote sensing image datasets: the DeepGlobe Land Cover Classification Dataset [56] and the Inria Aerial Image Dataset [57]. The selection of these datasets aims to demonstrate the robustness and applicability of our method across diverse scenarios.

The DeepGlobe Land Cover Classification Dataset [56] is the first publicly available benchmark that provides high-resolution sub-meter satellite images focused on rural areas. It includes pixel-level ground truth masks for seven land cover classes: urban, agriculture, pasture, forest, water, barren, and unknown. The dataset consists of 1146 annotated satellite images, each with a resolution of 2448 × 2448 pixels, significantly higher than that of conventional land cover classification datasets. For our study, the DeepGlobe images are randomly partitioned into training, validation, and test sets, containing 455, 142, and 206 images, respectively.

In contrast, the Inria Aerial Image Dataset [57] covers a diverse array of urban landscapes, including dense metropolitan areas and mountainous resort regions. It provides 180 images, each with a size of 5000 × 5000 pixels (from five cities), with each image annotated using binary masks distinguishing between building and non-building areas. Unlike DeepGlobe, the Inria dataset is split by city into training, validation, and test sets, consisting of 126, 27, and 27 images, respectively. Each image in the Inria dataset contains up to 25 million pixels—approximately four times more than DeepGlobe—and its foreground regions are more detailed, presenting a greater challenge for segmentation methods. By testing on these two datasets, we are able to thoroughly evaluate the performance of our proposed method across different resolutions and scenarios.

4.3. Evaluation Metrics

The performance of the proposed model is evaluated using three widely adopted metrics: inference time (time), memory usage (memory, in MB), and mean intersection over union (mIoU) on the joint. Inference time is a crucial metric for assessing the model’s response speed, while memory usage serves as an important indicator for evaluating the practical feasibility of model deployment. mIoU serves as a comprehensive evaluation metric that not only quantifies the accuracy of segmentation results but also considers the balance between different classes. It effectively assesses the overall performance of models in segmenting various classes in ultra-high-resolution remote sensing images, with higher values indicating better model performance. The expression is defined as:

mIoU = \frac{1}{N} \sum_{i = 1}^{N} {IoU}_{i}

(7)

where N denotes the total number of classes, and

{IoU}_{i}

represents the intersection of union for each class.

To provide a more comprehensive evaluation of the model’s performance, we also introduce two key indicators of model complexity: the number of parameters and computational cost. The number of parameters represents the total count of trainable parameters in the model, which is typically proportional to the model’s storage requirements and computational resource consumption. A lower number of parameters indicates a more lightweight model that is easier to deploy and incurs lower storage and transmission overheads. Computational cost (FLOPs) refers to the number of floating-point operations required during each inference process. A lower FLOP count indicates higher computational efficiency during inference, helping to reduce latency and enhance the response speed. In this study, we analyze the model’s parameter count and FLOPs to provide deeper insights into its complexity and computational overhead, offering valuable guidance for its deployment in practical applications.

5. Results and Analysis

5.1. Performance Metric Comparison

To validate the superiority of the model, we conducted a comprehensive evaluation on the DeepGlobe and Inria datasets, and quantitatively compared the proposed TDBAN with various advanced segmentation methods. The methods compared include traditional general methods (such as ICNet [25], BiSeNet [26], SegNet [58], DANet [24], FCN-8s [23]) as well as models incorporating the latest attention mechanisms (such as MBNet [3], SegFormer [34], Uniformer [35], Mask2Former [36]). Additionally, we compared our optimized baseline model GLNet [2] for ultra-high-resolution remote sensing imagery. The experimental results (Table 2, Table 3 and Table 4) show that TDBAN not only achieves the best performance in mIoU but also strikes a better balance between accuracy, speed, and memory consumption.

The key observations are as follows: (1). Compared to traditional general networks, networks using the Transformer attention mechanism (Mask2Former [36], SegFormer [34]) significantly improved segmentation accuracy but also resulted in higher computational costs, including increased memory consumption and inference time. (2). GLNet, MBNet, and the proposed TDBAN all adopt multi-branch network architectures, which effectively reduce memory consumption while improving mIoU. (3). TDBAN, through the optimized dual-branch network framework and innovative attention modules, achieved an mIoU of 73.6% on the DeepGlobe dataset, a 2% improvement over the baseline model GLNet; on the Inria dataset, the mIoU reached 72.7%, a 1.5% improvement over the baseline. Additionally, the GPU memory usage decreased, with significant speed improvements of nearly 40% and 52%, respectively. (4). On the DeepGlobe dataset, we experimentally verified the impact of different patch sizes on the model segmentation accuracy. The results show that TDBAN exhibits excellent stability under different patch settings, with the mIoU consistently above 70%, ensuring high accuracy.

Through detailed analysis, we further clarify the performance disparities between TDBAN and other approaches. The innovative Transformer-based lightweight dual-branch architecture of TDBAN, together with the DRLF module, enhances feature extraction and fusion efficiency, improving segmentation accuracy while preserving low memory usage. Specifically, TDBAN’s dual-branch architecture effectively balances the processing of both global and local information. The global branch captures broad contextual information through downsampling, providing background support for the entire image; the local branch, on the other hand, concentrates on detailed recognition and refines local regions, enhancing the accuracy and precision of image segmentation. Furthermore, CCM introduced by TDBAN optimizes the fusion of global and local features through a single-step transformation mechanism, reducing redundancy and enhancing the coherence between global and local information. The DRLF module adapts to remote sensing images, preserving fine local features while effectively leveraging global background information, thus avoiding the accuracy loss caused by an excess or lack of local information.

Overall, TDBAN achieves an efficient balance of global and local information through the optimized dual-branch network architecture and innovative modules (such as CCM and DRLF), providing outstanding performance for semantic segmentation tasks in remote sensing images while significantly reducing computational overhead.

5.2. Segmentation Result Comparison

Figure 4 demonstrates the performance of our method in comparison to GLNet [2] and MBNet [3] through the visualizations of segmentation results, feature maps, and binary feature maps. In Figure 4a, we visualize the segmentation outcomes, where our method (top row) produces more accurate segmentation boundaries compared to GLNet [2] and MBNet [3] (middle and bottom rows, respectively). Figure 4b illustrates the feature maps, highlighting how our approach better captures relevant features, showing clear differentiation between objects in the scene. Finally, in Figure 4c, the binary feature map emphasizes the effectiveness of our method in isolating key structures with higher precision than the other two models. These results confirm the superiority of our method in both segmentation accuracy and feature representation.

Figure 5 illustrates the results of the improvements, particularly through the comparison of the two panels. When considering the results of the innovative sharing strategy alone (Figure 5c) or the enhancement of the model’s ability to integrate features of different resolutions (Figure 5b,d), undesirable grid-like artifacts and inaccurate boundaries are still evident. However, a gradual improvement can be observed from the bidirectional deep feature map sharing in GLNet to the effective integration of the TDBAN dual-module, with a significant reduction in misclassifications and inaccurate boundaries.

6. Ablation Study

In this ablation study, we analyzed the contributions of the CCM and DRLF to the overall performance of the model. All models included the Agg and Fmreg, which play essential roles in the model’s aggregation capability and feature normalization.

6.1. CCM

In this experiment, the CCM module introduces a lightweight Transformer mechanism and an effective dimensionality reduction design, successfully maintaining low computational costs while improving model performance (Table 5). Specifically, the CCM module achieves 73.0% mIoU on the DeepGlobe dataset, significantly outperforming other ablation configurations, while maintaining low memory usage (1994 MB) and computational overhead (94.8 s). With 12 M parameters and only 39 G FLOPs, it significantly reduces the computational complexity compared to the traditional Transformer (48 M parameters and 156 G FLOPs) and MBNet (90.5 M parameters and 141 G FLOPs). A comparison shows that using the cross-attention mechanism alone (whether GtoL or LtoG) results in a limited performance improvement (mIoU of 69.1% and 70.0%, respectively), while the Transformer, although offering better performance (71.1% mIoU), leads to a substantial rise in computational cost. The CCM module retains the benefits of the cross-attention mechanism while effectively controlling computational costs through a streamlined design, showcasing higher computational efficiency and model deployability.

In summary, the CCM module optimizes the Transformer design and effectively merges cross-domain features, not only improving segmentation performance but also achieving a good balance between model complexity and computational efficiency, enabling efficient operation in resource-constrained environments and offering high practical value.

6.2. DRLF

Table 6 displays the ablation experiment results of the DRLF module on the DeepGlobe dataset, analyzing the effectiveness of the knowledge distillation strategy by comparing different configurations. The DRLF module effectively reduces computational complexity through knowledge distillation, significantly lowering the computational costs while maintaining high performance. DRLF and the Zoom module have the same mIoU (72.6%), but DRLF consumes less memory and computation time (1543 MB and 71 s, respectively), demonstrating that DRLF, with distillation, can significantly enhance computational efficiency while preserving segmentation accuracy. Using only channel attention results in a performance of 71.9%, while using only spatial attention leads to a lower mIoU (50.4%). This indicates that channel attention plays a more significant role in feature modeling and image segmentation, with the DRLF module combining the strengths of both, optimizing the learning process of submodules, and enhancing performance. This ablation study confirms the crucial role of knowledge distillation in enhancing the balance between model efficiency and performance. The DRLF module not only increases segmentation accuracy but also offers a more efficient solution in environments with limited computational resources.

6.3. Comprehensive Analysis

In our study, we introduce an innovative improvement to the GLNet model’s traditional bidirectional deep feature map sharing strategy (which involves 4-layer feature fusion performed three times) by employing a Transformer-based fast fusion mechanism—CCM. This mechanism enables perfect feature fusion in a single step, greatly enhancing fusion efficiency. At the same time, our Transformer fusion utilizes a lightweight design, including dimensionality reduction in the fully connected layer and the use of a local attention strategy in the cross-attention mechanism, thereby further reducing computational complexity. While maintaining the model’s parameter count at 12 M and FLOPs at 39 G, the model’s mIoU increased to 73%, and the processing time was significantly reduced by at least 140 s, with the final processing time being 94.8 s (Table 7).

Subsequently, we introduced the DRLF module with distillation technology, which further optimized the model’s performance. Through the DRLF module, the model’s mIoU was further improved to 73.6%, and the processing time was further reduced to 68.9 s (Table 7) while the memory usage (1771MB) remained almost unchanged. This indicates that the DRLF module has a significant effect in simplifying the computation process and improving computational speed.

Although this paper mainly focuses on ultra-high-resolution remote sensing image segmentation tasks, the design of the CCM and DRLF modules is highly adaptable and can be applied to a wide range of other computer vision tasks. For example, the CCM module improves small object detection accuracy in object detection by fusing multi-scale features, while in image classification tasks, it enhances the combination of global and local information through the cross-attention mechanism, improving classification performance. The DRLF module optimizes computational efficiency via knowledge distillation, enabling fast and accurate feature fusion in tasks such as semantic segmentation, image super-resolution, and medical image analysis, and is especially well suited for resource-limited environments. Nonetheless, in cross-task applications, the module designs must still be adjusted according to the specific requirements of each task. Future studies may explore ways to further optimize the computational efficiency of these modules and enhance their performance in multi-task learning.

7. Conclusions

This study proposes the TDBAN model, which significantly improves the semantic segmentation accuracy of ultra-high-resolution remote sensing images through an innovative dual-branch architecture and module design, while also demonstrating excellent performance in memory efficiency and processing speed. Experiments on the DeepGlobe and Inria Aerial datasets show that TDBAN achieves mIoU scores of 73.6% and 72.7%, respectively, with only 12 million parameters and a computational complexity (FLOPs) of 39 G, exhibiting low memory consumption and a short processing time.

The success of TDBAN can be attributed to its innovative dual-branch architecture, which cleverly balances the processing of global and local information. The global branch captures broad contextual information through downsampling, providing background support for the image, while the local branch focuses on fine details, improving the segmentation accuracy in finer regions. CCM further optimizes the fusion of global and local features, reducing redundant information and enhancing the coordination of information, thus improving overall performance. The introduction of the DRLF module effectively utilizes global background information while preserving local details, further enhancing model accuracy and optimizing computational overhead.

Additionally, TDBAN significantly improves feature fusion efficiency by adopting a lightweight Transformer-based design and a fast feature fusion mechanism. The lightweight fully connected layers and local attention mechanism not only reduce computational complexity but also enhance the processing speed while maintaining performance. Notably, the introduction of knowledge distillation technology further improves segmentation accuracy while reducing computational time in the DRLF module.

It is worth emphasizing that our method is a general improvement, not an optimization for specific scenes or environments. Its effectiveness has been proven on datasets such as DeepGlobe, and due to its innovative generalizability, it is expected to perform well in various remote sensing application scenarios. The diversity of the DeepGlobe dataset further validates the effectiveness of the method.

Overall, TDBAN achieves good balance between accuracy and efficiency, making it particularly suitable for resource-constrained environments. TDBAN’s performance demonstrates that it is a feasible solution that balances accuracy and resource utilization, with broad application prospects in fields such as 3D object detection.

Author Contributions

B.D., L.S., D.Z. and X.W. Conceptualization; B.D. and L.S. methodology; B.D. and L.S. software; B.D. and L.S. validation; D.Z. formal analysis; L.S. investigation; J.W. and X.W. resources; D.Z. and X.S. data curation; B.D. writing—original draft preparation; B.D. writing—review and editing; L.S. visualization; B.D. and L.S. supervision; L.S. project administration; D.Z. funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Harbin Normal University under Grant HSDBSCX2024-04, the National Natural Science Foundation of China (No. 41671064) and Natural Science Foundation of Heilongjiang Province of China (No. TD2023D005).

Data Availability Statement

The data that support the findings of this study are openly available online https://doi.org/10.7910/DVN/XU4F4D (accessed on 3 February 2025). For additional inquiries, please contact the corresponding author.

Acknowledgments

I would like to express my heartfelt gratitude to Lianlei Shan for their guidance and support, which have been essential to the successful progress of this research. I also want to thank the funding provided for the aforementioned projects, which enabled this research to proceed smoothly.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lemmens, M. Geo-Information: Technologies, Applications and the Environment; Springer Science & Business Media: Dordrecht, The Netherlands; Heidelberg, Germany; London, UK; New York, NY, USA, 2011; Volume 5. [Google Scholar]
Chen, W.; Jiang, Z.; Wang, Z.; Cui, K.; Qian, X. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8924–8933. [Google Scholar]
Shan, L.; Wang, W. Mbnet: A multi-resolution branch network for semantic segmentation of ultra-high resolution images. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Online, 7–13 May 2022; pp. 2589–2593. [Google Scholar]
Yang, X.; Li, S.; Chen, Z.; Chanussot, J.; Jia, X.; Zhang, B.; Li, B.; Chen, P. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote. Sens. 2021, 177, 238–262. [Google Scholar] [CrossRef]
Mnih, V.; Hinton, G.E. Learning to detect roads in high-resolution aerial images. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part VI 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 210–223. [Google Scholar]
Longbotham, N.; Chaapel, C.; Bleiler, L.; Padwick, C.; Emery, W.J.; Pacifici, F. Very high resolution multiangle urban classification analysis. IEEE Trans. Geosci. Remote. Sens. 2011, 50, 1155–1170. [Google Scholar] [CrossRef]
Sarra, K.; Boulmerka, A. Semantic segmentation of remote sensing images using u-net and its variants: Conference new technologies of information and communication (ntic 2022). In Proceedings of the 2022 2nd International Conference on New Technologies of Information and Communication (NTIC), Mila, Algeria, 21–22 December 2022; pp. 1–5. [Google Scholar]
Cui, M.; Li, K.; Chen, J.; Yu, W. CM-Unet: A novel remote sensing image segmentation method based on improved U-Net. IEEE Access 2023, 11, 56994–57005. [Google Scholar] [CrossRef]
Khan, B.A.; Jung, J.W. Semantic Segmentation of Aerial Imagery Using U-Net with Self-Attention and Separable Convolutions. Appl. Sci. 2024, 14, 3712. [Google Scholar] [CrossRef]
Shen, X.; Weng, L.; Xia, M.; Lin, H. Multi-scale feature aggregation network for semantic segmentation of land cover. Remote. Sens. 2022, 14, 6156. [Google Scholar] [CrossRef]
Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Liu, Y.; Li, H.; Hu, C.; Luo, S.; Luo, Y.; Chen, C.W. Learning to aggregate multi-scale context for instance segmentation in remote sensing images. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 595–609. [Google Scholar] [CrossRef]
Tong, Z.; Li, Y.; Zhang, J.; He, L.; Gong, Y. MSFANet: Multiscale fusion attention network for road segmentation of multispectral remote sensing data. Remote. Sens. 2023, 15, 1978. [Google Scholar] [CrossRef]
Huang, X.; Wang, H.; Li, X. A multi-scale semantic feature fusion method for remote sensing crop classification. Comput. Electron. Agric. 2024, 224, 109185. [Google Scholar] [CrossRef]
Fu, L.; Chen, J.; Wang, Z.; Zang, T.; Chen, H.; Wu, S.; Zhao, Y. MSFANet: Multi-scale fusion attention network for mangrove remote sensing lmage segmentation using pattern recognition. J. Cloud Comput. 2024, 13, 27. [Google Scholar] [CrossRef]
Feng, S.; Song, R.; Yang, S.; Shi, D. U-net Remote Sensing Image Segmentation Algorithm Based on Attention Mechanism Optimization. In Proceedings of the 2024 9th International Symposium on Computer and Information Processing Technology (ISCIPT), Xi’an, China, 24–26 May 2024; pp. 633–636. [Google Scholar]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A synergistical attention model for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Yuan, M.; Ren, D.; Feng, Q.; Wang, Z.; Dong, Y.; Lu, F.; Wu, X. MCAFNet: A multiscale channel attention fusion network for semantic segmentation of remote sensing images. Remote. Sens. 2023, 15, 361. [Google Scholar] [CrossRef]
Yao, M.; Zhang, Y.; Liu, G.; Pang, D. SSNet: A novel transformer and CNN hybrid network for remote sensing semantic segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 3023–3037. [Google Scholar] [CrossRef]
Yuan, Q.; Xia, B. Cross-level and multiscale CNN-Transformer network for automatic building extraction from remote sensing imagery. Int. J. Remote Sens. 2024, 45, 2893–2914. [Google Scholar] [CrossRef]
Pan, X.; Zhao, J.; Xu, J. Conditional Generative Adversarial Network-Based Training Sample Set Improvement Model for the Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 7854–7870. [Google Scholar] [CrossRef]
Nedevschi, S. Semantic Segmentation of Remote Sensing Images with Transformer-Based U-Net and Guided Focal-Axial Attention. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 18303–18318. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer VISION and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2018; pp. 405–420. [Google Scholar]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Guo, S.; Liu, L.; Gan, Z.; Wang, Y.; Zhang, W.; Wang, C.; Jiang, G.; Zhang, W.; Yi, R.; Ma, L.; et al. ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4351–4360. [Google Scholar] [CrossRef]
Chen, K.; Lin, X.; Zhang, C.; Wu, Y. RSDNet: Res-CBAM-enhanced integration of shallow and deep networks for ultra-high-resolution image segmentation. In Proceedings of the International Conference on Remote Sensing, Mapping, and Image Processing (RSMIP 2024), SPIE, Xiamen, China, 19–21 January 2024; Volume 13167, pp. 496–502. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 6877–6886. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Feng, D.; Zhang, Z.; Yan, K. A Semantic Segmentation Method for Remote Sensing Images Based on the Swin Transformer Fusion Gabor Filter. IEEE Access 2022, 10, 77432–77451. [Google Scholar] [CrossRef]
Lin, R.; Zhang, Y.; Zhu, X.; Chen, X. Local-Global Feature Capture and Boundary Information Refinement Swin Transformer Segmentor for Remote Sensing Images. IEEE Access 2024, 12, 6088–6099. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, R.; Wei, S.; Ni, J.; Wu, M.; Luo, Y.; Luo, C. Convolution Meets Transformer: Efficient Hybrid Transformer for Semantic Segmentation with Very High Resolution Imagery. In Proceedings of the IGARSS 2024—2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 9688–9691. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Li, K.; Wang, Y.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv 2022, arXiv:2201.04676. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Zhang, Y.; Balestra, G.; Zhang, K.; Wang, J.; Rosati, S.; Giannini, V. MultiTrans: Multi-branch transformer network for medical image segmentation. Comput. Methods Programs Biomed. 2024, 254, 108280. [Google Scholar] [CrossRef]
Pandu, J.; Reddy, G.R.S.; Babu, C.R.K. CSWin Transformer-CNN Encoder and Multi-Head Self-Attention Based CNN Decoder for Robust Medical Segmentation. J. Soft Comput. Data Min. 2024, 5, 57–69. [Google Scholar] [CrossRef]
Boulila, W.; Ghandorh, H.; Masood, S.; Alzahem, A.; Koubaa, A.; Ahmed, F.; Khan, Z.; Ahmad, J. A Transformer-based Approach Empowered by a Self-Attention Technique for Semantic Segmentation in Remote Sensing. Heliyon 2024, 12, 6088–6099. [Google Scholar] [CrossRef]
Liu, W.; Wu, J. Channel2DTransformer: A Multi-level Features Self-attention Fusion Module for Semantic Segmentation. Int. J. Comput. Intell. Syst. 2024, 17, 282. [Google Scholar] [CrossRef]
Li, Q.; Tong, J.; Yang, S.; Du, S. FATUnetr:fully attention Transformer for 3D medical image segmentation. In Proceedings of the 2024 IEEE International Conference on Mechatronics and Automation (ICMA), Tianjin, China, 4–7 August 2024; pp. 1415–1419. [Google Scholar] [CrossRef]
Gou, H.; Xiang, L.; Chen, X.; Tan, X.; Lv, L. Focal-UNet: Complex Image Semantic Segmentation Based on Focal Self-attention and UNet. In Proceedings of the 2024 4th International Symposium on AI (ISAI), Chengdu, China, 25–27 April 2024; pp. 33–37. [Google Scholar]
Seong, S.; Choi, J. Semantic segmentation of urban buildings using a high-resolution network (HRNet) with channel and spatial attention gates. Remote. Sens. 2021, 13, 3087. [Google Scholar] [CrossRef]
Sun, X.; Shi, A.; Huang, H.; Mayer, H. BAS⁴Net: Boundary-Aware Semi-Supervised Semantic Segmentation Network for Very High Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2020, 13, 5398–5413. [Google Scholar] [CrossRef]
Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Zhou, W.; Yang, P.; Qiu, W.; Qiang, F. STONet-S*: A Knowledge-Distilled Approach for Semantic Segmentation in Remote-Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 4414413. [Google Scholar] [CrossRef]
Baffour, A.A.; Qin, Z.; Wang, Y.; Qin, Z.; Choo, K.K.R. Spatial self-attention network with self-attention distillation for fine-grained image recognition. J. Vis. Commun. Image Represent. 2021, 81, 103368. [Google Scholar] [CrossRef]
Dong, Z.; Gao, G.; Liu, T.; Gu, Y.; Zhang, X. Distilling Segmenters From CNNs and Transformers for Remote Sensing Images’ Semantic Segmentation. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. In Proceedings of the Neural Information Processing Systems Workshop (NIPS-W), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International geoscience and remote sensing symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Du, S.; Du, S.; Liu, B.; Zhang, X. Incorporating DeepLabv3+ and object-based image analysis for semantic segmentation of very high resolution remote sensing images. Int. J. Digit. Earth 2021, 14, 357–378. [Google Scholar] [CrossRef]
Jin, C.; Tanno, R.; Xu, M.; Mertzanidou, T.; Alexander, D.C. Foveation for segmentation of ultra-high resolution images. arXiv 2020, arXiv:2007.15124. [Google Scholar]
Ji, Y.; Shan, L. LDNET: Semantic Segmentation Of High-Resolution Images Via Learnable Patch Proposal Furthermore, Dynamic Refinement. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]

Figure 1. TDBAN architecture overview. The architecture comprises two branches processing features at different scales: the global branch handles inputs from a wide area, while the local branch focuses on extracting detailed features. The cross-collaborative module (CCM) orchestrates the interaction between global and local features, followed by the data-related learnable fusion module (DRLF), refining and enhancing local relevance. An aggregation module integrates all information to generate output. Red arrows denote the flow of local features to global features, while blue arrows signify the influence of global features on local features.

Figure 2. Cross Collaborative Module Workflow: Q (Query), K (Key), and V (Value) denote the essential components of the Cross-Attention mechanism, with Attention Scores illustrating the interaction strength between features. Input arrays M and N, respectively, represent the Local Feature Array and Global Feature Array, which interchange roles sequentially to capture and enhance local and global information correlations.

Figure 3. Overview of data-related learnable fusion module. Overview of data-related learnable fusion module. Soft labels generated by the teacher network are used as knowledge to guide the student network through a softmax function with a high temperature (T). The student network strives to mimic the output of the teacher network while also learning precise classification through hard targets. The objective function of the distillation process is obtained by weighting the distilled loss (corresponding to soft target) and student loss (corresponding to hard target).

Figure 4. Visualization of image feature results. (a) Segmentation results visualization; (b) Feature visualization; and (c) Binary feature map illustration. The first, second, and third rows correspond to the results of our method, GLNet, and MBNet, respectively. All images are ultra-high-resolution images with a resolution of 2480 × 2480 pixels, resized for display purposes.

Figure 5. Comparison of segmented maps obtained by different methods on DeepGlobe. (a) Source image. (b) Ground truth. We present the predicted results of models trained using the following methods: (c) DANet [26] with dual attention modules; (d) GLNet [2] with bidirectional deep feature map sharing; (e) MBNet [3] with zoom module; and (f) the proposed TDBAN. The improvements observed in the two rectangular panels from (c–f) highlight the details of local fine structures.

Table 1. The impact of different regularization coefficients on the network segmentation performance metric (mIoU).

$λ$	0.15	0.25	0.35	0.45	1
mIoU(%)	73.6	73.1	68.2	67.4	62.0

Table 2. Comparison of the overall performance of networks on DeepGlobe.

Model	mIoU (%)	Memory (MB)	Time (s)
ICNet [25]	40.2	2557	23.8
BiSeNet [26]	47.0	1801	10.2
SegNet [58]	61.2	10339	17.8
DANet [24]	63.4	6914	60.9
FCN-8s [23]	70.1	5227	45.6
MBNet [3]	72.6	1549	162.9
SegFormer [34]	70.4	6880	404.4
Uniformer [35]	58.5	7311	596.0
Mask2Former [36]	71.2	4821	327.3
GLNet [2] (baseline)	71.6	1865	113.2
TDBAN	73.6	1771	68.9

Note: The bold values indicate the best performance in each respective column.

Table 3. Comparison of the overall performance of networks on Inria Aerial.

Model	mIoU (%)	Memory (MB)	Time (s)
ICNet [25]	30.1	2379	50.9
DeepLabv3+ [59]	55.9	4323	98.4
FCN-8s [23]	61.6	8253	43.2
MBNet [3]	71.4	2778	169.4
FovNet [60]	70.3	4469	274.6
LDNet [61]	71.5	5401	225.1
GLNet [2] (baseline)	71.2	2663	167.3
TDBAN	72.7	2501	79.4

Note: The bold values indicate the best performance in each respective column.

Table 4. Performance comparison of models with different patch sizes on DeepGlobe.

Net	Patch Sizes
Net	500	400	256	128
GLNet [2]	71.6	70.6	66.4	64.6
MBNet [3]	72.6	72.2	71.1	69.1
TDBAN	73.6	73.3	72.6	70.2

Table 5. Ablation study of the CCM: performance on DeepGlobe.

Architecture	mIoU (%)	Memory (MB)	Time (s)	FLOPs (G)	Number of Parameters (M)
GLNet [2]	71.6	1865	113.2	—	—
MBNet [3]	72.6	1549	162.9	141	90.5
Cross-attention only G to L	69.1	1500	90.0	78	24
Cross-attention only L to G	70.0	1504	90.4	78	24
Transformer	71.1	4701	356.9	156	48
CCM Only	73.0	1994	94.8	39	12

Table 6. Ablation study of the DRLF: performance on DeepGlobe.

Attention	mIoU (%)	Memory (MB)	Time (s)
Dual attention [24]	63.4	6914	60.9
Zoom module [3]	72.6	1549	162.9
Only channel attention	71.9	1404	60.0
Only spatial attention	50.4	1494	65.0
DRLF only	72.6	1543	71.0

Table 7. Comparison of the improved comprehensive performance of the TDBAN on DeepGlobe.

Model	CCM	DRLF	mIoU (%)	Memory (MB)	Time (s)
TDBAN	✓	–	73.0	1994	94.8
	–	✓	72.6	1543	71.0
	✓	✓	73.6	1771	68.9

Note: The modules marked with “✓” are used in building the network.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, B.; Shan, L.; Shao, X.; Zhang, D.; Wang, X.; Wu, J. Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 540. https://doi.org/10.3390/rs17030540

AMA Style

Du B, Shan L, Shao X, Zhang D, Wang X, Wu J. Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images. Remote Sensing. 2025; 17(3):540. https://doi.org/10.3390/rs17030540

Chicago/Turabian Style

Du, Bingyun, Lianlei Shan, Xiaoyu Shao, Dongyou Zhang, Xinrui Wang, and Jiaxi Wu. 2025. "Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images" Remote Sensing 17, no. 3: 540. https://doi.org/10.3390/rs17030540

APA Style

Du, B., Shan, L., Shao, X., Zhang, D., Wang, X., & Wu, J. (2025). Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images. Remote Sensing, 17(3), 540. https://doi.org/10.3390/rs17030540

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transform Dual-Branch Attention Net: Efficient Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Transformer in Semantic Segmentation

2.2. Dual-Branch Architectures for Feature Fusion

2.3. Attention Mechanisms in Feature Interaction

3. Method

3.1. Overview

3.2. Cross-Collaborative Module

3.3. Data-Related Learnable Fusion Module

3.4. Loss Functions and Training Process

4. Experiment

4.1. Experimental Details

4.2. Datasets

4.3. Evaluation Metrics

5. Results and Analysis

5.1. Performance Metric Comparison

5.2. Segmentation Result Comparison

6. Ablation Study

6.1. CCM

6.2. DRLF

6.3. Comprehensive Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI