1. Introduction
Under the impetus of globalization and digitization, remote sensing technology has emerged as a cornerstone of geographic information science, offering a fresh perspective on Earth exploration [
1]. Ultra-high-resolution remote sensing images, with pixel counts exceeding 4 million, reveal finer texture details and object characteristics compared to lower resolutions [
2,
3,
4]. In particular, in fields such as automated road detection and multi-angle urban classification analysis, these images demonstrate significant advantages, greatly improving decision-making precision [
5,
6]. However, handling these complex image data poses significant challenges in memory management and computational efficiency, increasing the need for manual intervention and potentially causing delays in critical applications such as emergency response [
2]. Despite advancements in semantic segmentation technology offering new opportunities for understanding and automating the annotations of these ultra-high-resolution images, developing more efficient and accurate segmentation algorithms remains a critical issue in remote sensing image processing.
In recent years, deep learning technologies have made significant advances in the field of remote sensing image semantic segmentation, leading to innovations in image analysis and interpretation techniques. Classical methods like U-Net and its variants [
7,
8,
9], multi-scale context aggregation networks [
10,
11,
12], and multi-level feature fusion networks [
13,
14,
15] have demonstrated strong segmentation performance. Furthermore, due to their advantages in enhancing key feature identification and boosting model efficiency, attention mechanisms (i.e., focusing on relevant information and ignoring irrelevant details) have been widely used in several models [
16,
17,
18]. Meanwhile, Transformer-based and generative adversarial network (GAN)-based methods have also gained increasing attention in recent years [
19,
20,
21]. For example, integrating Transformer models with U-Net and the novel guided focal attention mechanism has demonstrated outstanding segmentation accuracy, further enhancing their feature representation ability [
22]. These advancements highlight a clear trend of combining Transformer-based models with attention mechanisms to address the challenges posed by high-resolution remote sensing imagery. Nevertheless, a persistent challenge remains in balancing segmentation quality with computational efficiency and memory usage, particularly as the demand for processing ultra-high-resolution images continues to rise. Therefore, there is increasing focus on models that can balance fine-grained segmentation needs with memory and computational efficiency, which is a critical area in current research.
For example, Chen et al. [
2] and Shan et al. [
3] found that, when segmenting a 6-million-pixel DeepGlobe ultra-high-resolution remote sensing image, high-precision models such as FCN-8s [
23] and DANet [
24] require 5 GB to 10 GB of GPU memory for inference. Although these models offer high accuracy, they also come with high memory consumption. In contrast, fast segmentation methods such as ICNet [
25] and BiSeNet [
26] significantly reduce memory usage but at the cost of accuracy drops ranging from 9.6% to 30%. These data indicate the difficulty that traditional model designs face in striking a balance between accuracy and memory efficiency when handling ultra-high-resolution remote sensing imagery.
To address this challenge, various methods specifically designed for ultra-high-resolution image segmentation have emerged [
2,
27,
28]. These methods are based on networks that jointly model features using dual-branch structures, significantly reducing the dependence on high-performance hardware while maintaining segmentation precision. For example, GLNet2 [
2] uses a global and local dual-branch structure, with the global branch extracting contextual information and the local branch capturing image details, both of which together improve segmentation accuracy and optimize memory usage. Although it offers advantages in accuracy and memory usage compared to traditional methods, the image block processing strategy increases computational burden, resulting in a slower inference speed. To further optimize the segmentation performance and memory usage, ISDNet [
27] proposed a new method that captures semantic information at different levels using a dual-branch structure and integrates this information via a relation-aware feature fusion module. Experiments conducted on several ultra-high-resolution datasets demonstrate the model’s strong generalizability. However, due to the complexity of the relation-aware feature fusion module, it is challenging to understand, and concerns remain regarding the reliability of its actual performance. To resolve this, RSDNet [
28] introduced a simplified Res-CBAM module to replace the relation-aware feature fusion module in ISDNet, boosting feature extraction effectiveness and improving the segmentation accuracy. However, while the model shows improvements in memory efficiency and computational load balancing, its inference speed is still slower than ISDNet, though it excels in large-scale crop scenarios, indicating that there is still potential to further optimize input processing and the model.
Based on these challenges, we propose TDBAN, a dual-branch feature enhancement framework for ultra-high-resolution remote sensing image segmentation. TDBAN optimizes the conventional dual-branch structure using a lightweight Transformer architecture and incorporates a cross-sharing module to efficiently fuse global and local information in one step. Additionally, TDBAN incorporates a data-related learnable fusion module (DRLF) that adaptively adjusts the weights of global and local features, significantly reducing computational costs while maintaining high segmentation accuracy. In general, our work significantly contributes to the following aspects.
- (1)
Optimized Dual-Branch Network Architecture: Drawing inspiration from the GLNet framework, optimizations are made to process global and local information in ultra-high-resolution remote sensing images to achieve more effective information processing and feature recognition.
- (2)
Innovative Sharing Strategy: An innovative cross-collaboration module is introduced, achieving perfect feature fusion in one step through the Transform mechanism and significantly enhancing model efficiency through lightweight design.
- (3)
Dynamic weight adjustment: The newly developed DRLF module automatically adjusts weights for global and local information based on image content characteristics, optimizing information fusion to improve segmentation accuracy and processing speed. Additionally, the module adopts knowledge distillation techniques, effectively reducing the computational complexity of the model.
- (4)
Significant performance improvement: The above techniques demonstrated significant performance enhancements on the DeepGlobe and Inria Aerial datasets, validating the network’s efficiency.
3. Method
3.1. Overview
The entire network architecture is divided into two main pathways (
Figure 1): the global and local branches. The architecture begins with processing a dataset D, containing N pairs of high-resolution images and their corresponding segmentation maps, each image and segmentation map being matrices of size
. The global branch
G processes downsampled low-resolution image and label pairs, denoted as
where
and
are images and labels of size
, significantly smaller than the original image’s
size.
The local branch
L focuses on processing high-resolution image patches extracted from the original dataset D, denoted as
where
, with
being noticeably smaller than the original image size. Each image
and label
is cropped from top to bottom and left to right into overlapping patches, divided into
image blocks, preserving the characteristics of ultra-high resolution. Both branches use the same backbone network, and images cropped and downsampled sequentially pass through ResNet, FPN, CCM, DRLF, and then through the branch aggregation layer
to aggregate two sets of high-level feature maps, generating the final segmentation mask. To ensure stability and improve the training of both branches, weakly coupled regularization is also applied to the training of the local branch. The CCM and DRLF are our innovations, introduced in
Section 3.2 and
Section 3.3, respectively.
3.2. Cross-Collaborative Module
To fully harness the potential of the dual-branch structure, this study introduces an efficient and flexible cross-sharing module aiming to achieve deep collaboration and information exchange between global and local branch features. The module is based on the Transformer architecture and incorporates a cross-attention mechanism, which effectively handles two distinct dimensions of embedded sequences—global and local branch features. By asymmetrically fusing these features and assigning them adaptive weights, the module can thoroughly explore their interrelation and complementarity.
Unlike the cubic iterative fusion method with four layers per iteration employed by GLNet2, TDBAN introduces an innovative cross-sharing module that leverages the Transformer mechanism to achieve efficient feature fusion in a single step. We have the lightweight Transformer by adopting a single-layer Channel Attention (CA) mechanism and simplifying the query (Q) and value (V) acquisition process, directly taking in global (G) and local (L) features as input. Furthermore, we significantly enhance both computational efficiency and model performance through the two following lightweight design strategies:
- (1)
Dimensionality Reduction and Load Mitigation: We incorporate dimensionality reduction operations within the fully connected layers to alleviate computational burdens.
- (2)
Local Attention Optimization: A local attention strategy is employed to refine the cross-attention mechanism, thereby effectively capturing fine-grained feature relationships.
Additionally, the application of Feed-Forward Neural Networks (FFNs) is omitted, and the desired fusion effect is achieved through a single Cross-attention calculation, resulting in a computationally concise and efficient lightweight Transformer model. This design not only enhances the collaborative efficiency between global and local branch features but also significantly improves the overall performance of the model.
The computation of cross-fusion can be expressed by the following formula:
where
and
represent the features of the global and local branches, respectively, Q, K, and V denote the linear transformation matrices of queries, keys, and values, respectively, and K represents the dimension of the keys. The softmax function represents the normalization function.
Figure 2 presents the architecture of the Cross Collaborative Module, which integrates global and local features. This diagram illustrates how the cross-attention mechanism efficiently integrates diverse feature layers to enhance the model’s discriminative power and accuracy. The input features originate from the final four layers of ResNet. The Local Feature Array is represented by the input array M, capturing the features from a global perspective to provide extensive contextual information. The Global Feature Array, represented by latent array N, focuses on capturing rich local details, emphasizing detailed performance in specific regions.
The Refined Query Array undergoes initial encoding and processing to further refine and extract features during the decoding stage. The Cross-Attention Mechanism is the core of the model, dynamically adjusting the weights and importance of features by computing interactions among the Query (Q), Key (K), and Value (V) arrays. This mechanism optimizes the feature integration based on the correlation between global and local information. Q represents the query vector dynamically generated from the local feature array, guiding how relevant information is extracted from global features. K and V vectors are generated from the global feature array and weighted with the query vector to enhance the local features. Attention Scores demonstrate the strength of interactions between different features, determining how these features are combined to form refined outputs.
The Final Output Features represent the comprehensive understanding and processing results of the input data after undergoing cross-attention processing and are suitable for various subsequent applications. Specifically, the cross-collaborative module first inputs the features of the local branch as K and V and the global branch as Q. It calculates the dot product attention weights between Q and K and then applies these to V to obtain the output. The output features have the same dimension as the features of the global branch, retaining the overview of global information while integrating the details of local information. Then, the output features of the cross-attention are inputted as K and V, and the features of the global branch are inputted as Q. This repeated process strengthens the interaction between global and local features, thereby enhancing the model’s feature expression capabilities and recognition accuracy.
3.3. Data-Related Learnable Fusion Module
To fully leverage the potential of multibranch networks, we designed a data-adaptive learnable fusion module (DRLF) that adjusts weights adaptively to differentiate focus on global and local information flows. This module emphasizes local features in edge areas and prioritizes global information in ambiguous regions, greatly improving the segmentation accuracy through dynamic weight allocation.
Module Architecture:
The data-related learnable fusion module (
Figure 3) consists of two parallel learning submodules: the spatial attention generation module and the channel attention generation module. These submodules replicate the outputs of the zoom module (Shan et al’s) [
3] spatial and channel attention, obtaining accurate attention maps without extensive computational operations.
Knowledge distillation is a commonly used method for model compression. Unlike pruning and quantization in model compression, knowledge distillation involves constructing a lightweight, small model and utilizing the supervision from a larger, better-performing model to train this smaller model, aiming for improved performance and accuracy. It was first proposed and applied to classification tasks by Hinton in 2015 [
47]. The large model is called the “teacher” model, while the small model is called the “student” model. The supervision information from the teacher model’s outputs is referred to as “knowledge”, and the student learning process from the teacher’s supervision is termed “distillation”.
The DRLF adopts the strategy of knowledge distillation, using the Spatial and Channel Attention Masks from the zoom module of MBNet [
3] as supervision from a more sophisticated, complex model. The learning process of the sub-module is guided by the supervision signal, with the generated spatial and channel attention masks serving as training targets. By optimizing these sub-modules to generate attention patterns that match those of the zoom module in MBNet [
3], the DRLF approximates the functionality of the original module with lower computational complexity.
Knowledge distillation typically involves several steps:
Soft label distillation loss:
Using the softmax function with a high-temperature parameter T to compute the output of the teacher network, generating soft labels to provide behavioral guidance to the student network. The formula is as follows:
where
represents the smooth probability distribution of the teacher network’s output, and
denotes the corresponding probability distribution of the student network.
Hard target student loss:
Simultaneously, the student network optimizes its performance by learning to predict correct hard targets (actual labels). The formula is:
where
represents the one-hot encoded representation of the true labels, and
signifies the student network’s softmax probability output.
Combined distillation and student loss:
The final loss function is a linear combination of distillation loss and hard target loss, enabling the student network to learn from soft and hard labels. The formula is:
where
and
are hyperparameters used to balance the contributions of the two types of losses.
Computational benefits:
Our design ensures that the submodule can significantly reduce computational costs while learning the complex attention maps in the Zoom Module of MBNet [
3]. This inherent computational efficiency makes this module well suited for resource-constrained environments while maintaining the model’s focus on essential features at different resolutions. In this way, the DRLF facilitates rapid and efficient information exchange between features of different resolutions, thereby enhancing the overall performance of the model.
3.4. Loss Functions and Training Process
To effectively address feature conflicts and convergence instability in the dual-branch network, we have designed a comprehensive loss function and introduced a specific weighting scheme to stabilize the training process.
Dual-Branch Feature Aggregation and Loss Function Design
In this study, the global and local branches generate high-level feature maps at their respective L layers, denoted as and . These feature maps are then passed through an aggregation layer , which concatenates them along the channel dimension using a convolution filter, resulting in the aggregated feature map , which serves as the final segmentation output.
To maintain consistency between the outputs of the two branches, we introduce an auxiliary loss function that constrains the outputs of the local and global branches to approach their respective target feature maps. This auxiliary loss function helps stabilize the training process, avoiding conflicts between branches and assisting the network in balancing local and global features.
Weak Coupling Regularization
To further alleviate feature conflicts in the dual-branch network, especially when the local branch overfits local features, we introduce weak coupling regularization. This regularization adds a constraint term to limit the difference between the local and global feature maps. The regularization term effectively controls the learning rate of the local branch, preventing it from dominating the entire training process.
In the experiments, we adjusted the regularization coefficient
(from 0.15 to 0.25, 0.35, 0.45, and 1) to explore the impact of different regularization strengths on the training dynamics. The results show (
Table 1) that a reasonable regularization coefficient (
) helps maintain the synchronized updates of the local and global branches and significantly improves the stability of the training process.
6. Ablation Study
In this ablation study, we analyzed the contributions of the CCM and DRLF to the overall performance of the model. All models included the Agg and Fmreg, which play essential roles in the model’s aggregation capability and feature normalization.
6.1. CCM
In this experiment, the CCM module introduces a lightweight Transformer mechanism and an effective dimensionality reduction design, successfully maintaining low computational costs while improving model performance (
Table 5). Specifically, the CCM module achieves 73.0% mIoU on the DeepGlobe dataset, significantly outperforming other ablation configurations, while maintaining low memory usage (1994 MB) and computational overhead (94.8 s). With 12 M parameters and only 39 G FLOPs, it significantly reduces the computational complexity compared to the traditional Transformer (48 M parameters and 156 G FLOPs) and MBNet (90.5 M parameters and 141 G FLOPs). A comparison shows that using the cross-attention mechanism alone (whether GtoL or LtoG) results in a limited performance improvement (mIoU of 69.1% and 70.0%, respectively), while the Transformer, although offering better performance (71.1% mIoU), leads to a substantial rise in computational cost. The CCM module retains the benefits of the cross-attention mechanism while effectively controlling computational costs through a streamlined design, showcasing higher computational efficiency and model deployability.
In summary, the CCM module optimizes the Transformer design and effectively merges cross-domain features, not only improving segmentation performance but also achieving a good balance between model complexity and computational efficiency, enabling efficient operation in resource-constrained environments and offering high practical value.
6.2. DRLF
Table 6 displays the ablation experiment results of the DRLF module on the DeepGlobe dataset, analyzing the effectiveness of the knowledge distillation strategy by comparing different configurations. The DRLF module effectively reduces computational complexity through knowledge distillation, significantly lowering the computational costs while maintaining high performance. DRLF and the Zoom module have the same mIoU (72.6%), but DRLF consumes less memory and computation time (1543 MB and 71 s, respectively), demonstrating that DRLF, with distillation, can significantly enhance computational efficiency while preserving segmentation accuracy. Using only channel attention results in a performance of 71.9%, while using only spatial attention leads to a lower mIoU (50.4%). This indicates that channel attention plays a more significant role in feature modeling and image segmentation, with the DRLF module combining the strengths of both, optimizing the learning process of submodules, and enhancing performance. This ablation study confirms the crucial role of knowledge distillation in enhancing the balance between model efficiency and performance. The DRLF module not only increases segmentation accuracy but also offers a more efficient solution in environments with limited computational resources.
6.3. Comprehensive Analysis
In our study, we introduce an innovative improvement to the GLNet model’s traditional bidirectional deep feature map sharing strategy (which involves 4-layer feature fusion performed three times) by employing a Transformer-based fast fusion mechanism—CCM. This mechanism enables perfect feature fusion in a single step, greatly enhancing fusion efficiency. At the same time, our Transformer fusion utilizes a lightweight design, including dimensionality reduction in the fully connected layer and the use of a local attention strategy in the cross-attention mechanism, thereby further reducing computational complexity. While maintaining the model’s parameter count at 12 M and FLOPs at 39 G, the model’s mIoU increased to 73%, and the processing time was significantly reduced by at least 140 s, with the final processing time being 94.8 s (
Table 7).
Subsequently, we introduced the DRLF module with distillation technology, which further optimized the model’s performance. Through the DRLF module, the model’s mIoU was further improved to 73.6%, and the processing time was further reduced to 68.9 s (
Table 7) while the memory usage (1771MB) remained almost unchanged. This indicates that the DRLF module has a significant effect in simplifying the computation process and improving computational speed.
Although this paper mainly focuses on ultra-high-resolution remote sensing image segmentation tasks, the design of the CCM and DRLF modules is highly adaptable and can be applied to a wide range of other computer vision tasks. For example, the CCM module improves small object detection accuracy in object detection by fusing multi-scale features, while in image classification tasks, it enhances the combination of global and local information through the cross-attention mechanism, improving classification performance. The DRLF module optimizes computational efficiency via knowledge distillation, enabling fast and accurate feature fusion in tasks such as semantic segmentation, image super-resolution, and medical image analysis, and is especially well suited for resource-limited environments. Nonetheless, in cross-task applications, the module designs must still be adjusted according to the specific requirements of each task. Future studies may explore ways to further optimize the computational efficiency of these modules and enhance their performance in multi-task learning.
7. Conclusions
This study proposes the TDBAN model, which significantly improves the semantic segmentation accuracy of ultra-high-resolution remote sensing images through an innovative dual-branch architecture and module design, while also demonstrating excellent performance in memory efficiency and processing speed. Experiments on the DeepGlobe and Inria Aerial datasets show that TDBAN achieves mIoU scores of 73.6% and 72.7%, respectively, with only 12 million parameters and a computational complexity (FLOPs) of 39 G, exhibiting low memory consumption and a short processing time.
The success of TDBAN can be attributed to its innovative dual-branch architecture, which cleverly balances the processing of global and local information. The global branch captures broad contextual information through downsampling, providing background support for the image, while the local branch focuses on fine details, improving the segmentation accuracy in finer regions. CCM further optimizes the fusion of global and local features, reducing redundant information and enhancing the coordination of information, thus improving overall performance. The introduction of the DRLF module effectively utilizes global background information while preserving local details, further enhancing model accuracy and optimizing computational overhead.
Additionally, TDBAN significantly improves feature fusion efficiency by adopting a lightweight Transformer-based design and a fast feature fusion mechanism. The lightweight fully connected layers and local attention mechanism not only reduce computational complexity but also enhance the processing speed while maintaining performance. Notably, the introduction of knowledge distillation technology further improves segmentation accuracy while reducing computational time in the DRLF module.
It is worth emphasizing that our method is a general improvement, not an optimization for specific scenes or environments. Its effectiveness has been proven on datasets such as DeepGlobe, and due to its innovative generalizability, it is expected to perform well in various remote sensing application scenarios. The diversity of the DeepGlobe dataset further validates the effectiveness of the method.
Overall, TDBAN achieves good balance between accuracy and efficiency, making it particularly suitable for resource-constrained environments. TDBAN’s performance demonstrates that it is a feasible solution that balances accuracy and resource utilization, with broad application prospects in fields such as 3D object detection.