Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion

Sun, Yihao; Wang, Mingrui; Huang, Xiaoyi; Xin, Chengshu; Sun, Yinan

doi:10.3390/rs16173248

Open AccessArticle

Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion

by

Yihao Sun

¹

,

Mingrui Wang

¹,

Xiaoyi Huang

²,

Chengshu Xin

³ and

Yinan Sun

^4,*

¹

Department of Landscape Architecture, School of Architecture, Tsinghua University, Beijing 100084, China

²

Urban-Rural Ecological Landscape Construction & Research Institute, China Urban Construction Design & Research Institute, Beijing 100120, China

³

Department of Horticulture, Life Science and Technology College, Dalian University, Dalian 116622, China

⁴

Department of Environmental Art Design, School of Art and Design, Beijing Forestry University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(17), 3248; https://doi.org/10.3390/rs16173248

Submission received: 16 July 2024 / Revised: 28 August 2024 / Accepted: 30 August 2024 / Published: 2 September 2024

(This article belongs to the Special Issue Deep Learning for Satellite Image Segmentation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

For ultra-high-resolution (UHR) image semantic segmentation, striking a balance between computational efficiency and storage space is a crucial research direction. This paper proposes a Feature Fusion Network (EFFNet) to improve UHR image semantic segmentation performance. EFFNet designs a score map that can be embedded into the network for training purposes, enabling the selection of the most valuable features to reduce storage consumption, accelerate speed, and enhance accuracy. In the fusion stage, we improve upon previous redundant multiple feature fusion methods by utilizing a transformer structure for one-time fusion. Additionally, our combination of the transformer structure and multibranch structure allows it to be employed for feature fusion, significantly improving accuracy while ensuring calculations remain within an acceptable range. We evaluated EFFNet on the ISPRS two-dimensional semantic labeling Vaihingen and Potsdam datasets, demonstrating that its architecture offers an exceptionally effective solution with outstanding semantic segmentation precision and optimized inference speed. EFFNet substantially enhances critical performance metrics such as Intersection over Union (IoU), overall accuracy, and F1-score, highlighting its superiority as an architectural innovation in ultra-high-resolution remote sensing image semantic segmentation.

Keywords:

semantic segmentation; ultra-high-resolution; multiscale feature fusion; attention mechanism

1. Introduction

Semantic segmentation in remote sensing imagery is used in urban planning [1], agriculture [2], disaster management [3], environmental monitoring [4], transportation [5], land use change detection [6], forestry management [7], and climate studies [8]. It classifies land cover types, aiding in development, crop monitoring, disaster response, conservation, infrastructure management, and climate research, enhancing decision making and resource management. Ultra-high-resolution image semantic segmentation is a crucial domain within computer vision and remote sensing information processing. Presently, images with a minimum resolution of 2048 × 1080 pixels (∼2.2 M), referred to as 2K high-resolution media, represent a significant standard in terms of image quality [9]. Furthermore, images with at least 3840 × 1080 pixels (∼4.1 M) are considered to meet the basic threshold for 4K resolution [10], while 4K ultra-high-definition media typically possesses a resolution starting from 3840 × 2160 pixels (∼8.3 M) [11]. However, segmenting ultra-high-resolution images presents unique challenges and prospects due to their immense data volume and inherent complexity [12]. Advanced methodologies and algorithms are required to effectively parse, interpret, and utilize the abundant information encapsulated within ultra-high-resolution imagery. Deep learning methodologies in computer vision have significantly advanced ultra-high-resolution image semantic segmentation through innovative neural network architectures specifically designed for this type of imagery.

Ultra-high-resolution image semantic segmentation faces several challenges, including managing large data volumes and computational complexity, which demand efficient algorithms and hardware optimization. Downsampling these images can lead to a loss of critical details and contextual information, reducing accuracy [13]. The complexity of backgrounds and interclass confusion further complicates the process, necessitating advanced methods to maintain class-specific features and reduce variability. Efficient GPU memory utilization and the integration of multiscale information are also critical, requiring innovative network designs and multistep fusion techniques to achieve high precision and semantic consistency [14]. It has been observed that the accuracy of semantic segmentation tasks in minority classes can be as low as 50–60% due to this class imbalance [15]. Moreover, the memory consumption for processing a single 4K image can exceed 16 GB, thereby imposing limitations on the batch size and prolonging the training time [16]. To address these challenges, certain networks employ fractional boundary filters during semantic segmentation [17], simplifying convolution operations through low-rank approximation methods to reduce computational demand. Additionally, some networks incorporate inputs of varying-sized low-resolution images into the leading network to acquire a coarse semantic segmentation map fused with high-resolution feature maps via feature fusion units, resulting in reduced computation and enhanced speed. Ultra-high-resolution remote sensing images necessitate high-precision accuracy, which poses a challenge for real-time networks. Collaborative global–local networks (GLNet) have been introduced as an effective solution due to their commendable real-time performance [18]. The GLNet architecture incorporates downsampled global and full-resolution cropped images as inputs, effectively preserving both detailed information and global contextual understanding while minimizing GPU memory usage. It leverages ResNet [19] and the feature pyramid network (FPN) [20] as its backbones, thereby enhancing memory efficiency without compromising ultra-high-resolution image semantic segmentation accuracy. However, the side output from the global branch backbone has a relatively low feature map resolution, limiting its representative capacity. Additionally, the learning process in the local branch encounters confusion issues. Addressing these challenges requires a multifaceted approach that involves developing more efficient neural architectures to improve semantic segmentation accuracy for complex scenes and strategically managing computational resources.

This study aims to enhance the accuracy and speed of image semantic segmentation by leveraging the structural characteristics of existing networks designed for ultra-high-resolution images. Our objective is to achieve this goal through the design of attention mechanisms and modules that will improve semantic segmentation accuracy and network inference speed. Additionally, we aim to simultaneously meet performance objectives and further enhance them. The proposed network in our study is referred to as Efficient Future Fusion Network (EFFNet). During both training and prediction stages, our proposed network integrates features from global and local information sources, effectively preserving global contextual information while capturing detailed local information to achieve improved accuracy without compromising inference speed. Moreover, the architectural design of EFFNet optimizes the use of computational resources, enabling the model to deliver high semantic segmentation accuracy without compromising inference speed. Some networks with similar architectures include Ada-MBA [21], RS³Mamba [22], and MBATrans [23]. Ada-MBA uses a multilevel multimodal fusion transformer, focusing on integrating features from different modalities and scales to enhance semantic segmentation accuracy in remote sensing. RS³Mamba employs a dual-branch structure, combining a convolution-based main branch with a Visual State Space (VSS) auxiliary branch, using a Collaborative Completion Module (CCM) for cross-branch feature fusion. MBATrans is a transformer-based network that utilizes cross-attention and self-attention mechanisms to align high-level feature maps across domains, designed for unsupervised domain adaptation in very high-resolution remote sensing images. Compared with these networks, EFFNet also adopts a dual-branch structure but innovates with its score map module and fast fusion mechanism, which efficiently select and integrate global and local features. While RS3Mamba’s CCM handles feature fusion across branches, and MBATrans’s complex attention mechanisms are tailored for cross-domain tasks, EFFNet prioritizes computational efficiency and precise feature selection, particularly in single-domain high-resolution imagery. The structure of EFFNet lies in its efficient feature fusion and attention-driven feature selection, enabling it to excel in both accuracy and speed when processing ultra-high-resolution images, making it particularly competitive for tasks that require high performance and computational efficiency.We have implemented a meticulously designed evaluation system that assesses multiple features within local patches, generating detailed feature score maps. These score maps guide the selection of optimal local feature maps for fusion with global features, reducing fusion operations while maintaining high-quality features, thus increasing efficiency. In the process of feature fusion, we introduce a depth fusion mechanism that incorporates a multihead attention mechanism known as the fast fusion mechanism. This mechanism significantly enhances the efficiency of multiscale feature fusion. By specializing in improving semantic segmentation accuracy and inference speed significantly, our network architecture makes EFFNet a valuable contribution to this field. This study provides several contributions as follows:

A novel approach to improve the accuracy and speed of image semantic segmentation task for ultra-high-resolution images called Efficient Future Fusion Network (EFFNet) is proposed.
A score map module is proposed to reduce fusion operations and increase efficiency. The score map module is based on a dimension reduction convolutional attention mechanism. This mechanism calculates the global feature vector through global average pooling and learns the relationship between channels using a one-dimensional convolution operation.
A fast fusion mechanism is introduced to improve the efficiency of multiscale feature fusion. The fast fusion mechanism promotes seamless integration between global and local branches, achieving extensive collaboration by using multiple attention weights to fuse feature maps at each layer.
The experimental results show that EFFNet outperforms other state-of-the-art network architectures on challenging datasets such as Vaihingen and Potsdam, with improvements in both efficiency and accuracy.

2. Related Work

With the integration of deep learning techniques, remarkable advancements have been witnessed in the semantic segmentation of remote sensing images, thereby enhancing the precision and efficiency of these systems. The three pivotal dimensions driving these advancements are categorized into multiscale feature representation, context aggregation, and the incorporation of attention mechanisms. Currently, network architectures have also evolved to specifically cater to high-resolution imagery characteristics. This evolution in network architectures serves as a testament to the dynamic nature of our field and our constant pursuit of more efficient and accurate solutions.

2.1. Semantic Segmentation of Remote Sensing Images

The emergence of semantic segmentation in remote sensing can be traced back to pioneering models such as fully convolutional networks (FCN). Early methods for semantic segmentation of remote sensing images primarily focused on FCN [24]. It marked a significant milestone by enabling the end-to-end training of semantic segmentation models directly on image data, producing pixel-wise predictions without the need for preprocessing steps like region proposals. Some derivatives of FCN, including Kampffmeyer [25], enhanced the FCN model for segmenting small objects in remote sensing images. Liu [26] employed graph-based semantic segmentation methods, such as Selective Search, for data augmentation and utilized conditional random fields to refine the boundaries of segmented results.

Subsequent models, such as SegNet and U-Net, further refined these ideas. SegNet’s encoder–decoder architecture was particularly effective in preserving spatial information. Studies have demonstrated that SegNet performs exceptionally well in medium-resolution satellite image semantic segmentation, especially when incorporating remote sensing indices like NIR and SWIR bands [27]. U-Net, initially designed for biomedical image semantic segmentation, was adapted to remote sensing, demonstrating its robustness in capturing fine details and segmenting complex scenes. Several studies have focused on enhancing the original UNet architecture to improve its performance in specific remote sensing tasks. Such as E-Unet++ retains the nested and dense skip connection structure of UNet++, which allows for capturing multiscale information effectively, improving the accuracy by refining the details across different scales [28]. RSUnet is designed as a full-scale UNet model with a focus on adaptive feature selection and multiscale feature processing [29]. The adoption of more advanced architectures has played a pivotal role in addressing the complexities associated with remote-sensing imagery.

2.2. Multiscale, Context Aggregation, and Attention Mechanism

Multiscale [30,31,32] approaches leverage the hierarchical nature of deep neural networks, extracting both low-level details and high-level semantic information. This approach is pivotal in capturing and integrating features across various scales. For instance, HRNet maintains high-resolution representations across the network, portraying that fine details are preserved and accurate [33,34]. MEC-Net takes this further by integrating edge features into multiscale features, significantly enhancing the structure delineation precision in remote sensing images, notably in complex urban landscapes [35]. Liu [36] combined U-Net and PSPNet, incorporating pyramid pooling modules to enhance accuracy. Nong [37] proposed a boundary-aware dual-stream network based on U-Net, incorporating an auxiliary edge detection stream to monitor object boundaries and improve boundary semantic segmentation results explicitly. In addition, spatial and channel attention modules replaced skip connections of UNet to capture local details.

The semantic segmentation accuracy and computational efficiency have been significantly improved by innovative models in the field of context aggregation. Notable advancements include LDCANet’s lightweight dual-range context aggregation, which effectively balances accuracy and computational efficiency [38]. BCANet incorporates a Multi-Scale Boundary extractor and a Boundary Context Aggregation module, skillfully capturing long-range dependencies [39]. Additionally, HCANet utilizes Compact Atrous Spatial Pyramid Pooling modules to extract multiscale context information [40]. Furthermore, EGCAN introduces an edge-guided context aggregation branch and a minority category extraction branch that exemplify the innovation in this domain [41].

Advanced techniques such as attention mechanisms have revolutionized the field of semantic segmentation by enabling models to selectively focus on salient features while suppressing irrelevant ones. The Global Multi-Attention UResNeXt (GMAUResNeXt) model is a prime example that incorporates a global attention gate (GAG) module, leveraging the interdependence between context and multiscale features to enhance semantic segmentation results. This framework has established a new benchmark in the field [42]. MSCSA-Net is another model that has significantly enhanced its ability to discriminate multiscale objects. By employing local channel spatial and multiscale attention, MSCSA-Net effectively improves feature representation and object boundary discrimination through attention mechanisms [43].

2.3. Network for High-Resolution Images

Significant advancements have been made in the semantic segmentation of high-resolution images, particularly within the context of remote sensing, through the integration of deep learning models. Starting with traditional models, Guo [44] substituted standard convolutions with dilated convolutions in the FCN architecture to achieve high-resolution image semantic segmentation. Liu [45] integrated a spatial residual inception module into FCN to capture multiscale contextual features for extracting buildings in high-resolution remote sensing imagery. Additionally, combining SegNet with other architectures like U-Net has led to better semantic segmentation accuracy, particularly in building extraction tasks from high-resolution aerial images [46].

Recently, more sophisticated architectures have been developed to address the unique challenges posed by high-resolution remote sensing data. Qiao presents a weakly supervised approach for extracting damaged buildings from high-resolution remote sensing images after earthquakes. This approach incorporates a multiscale dependence (MSD) module and a spatial correlation refinement (SCR) module to enhance localization accuracy and suppress noise, resulting in improved overall model performance [47]. The DNAS framework addresses a key challenge in high-resolution image processing by proposing a hierarchical search space and employing a decoupling search optimization strategy to reduce memory usage [48]. Both MACANet [49] and CEN [50] have demonstrated exceptional performance in extracting multiscale information and segmenting small-scale objects in high-resolution images. Networks such as MHLDet [51] have significantly enhanced semantic segmentation accuracy and multiscale object detection by integrating attention mechanisms and multiscale feature fusion strategies.

The semantic segmentation of high-resolution images has witnessed significant advancements, yet this field still encounters certain limitations. One prominent challenge pertains to the computational power and memory demands, particularly when dealing with large-scale datasets or real-time applications. Moreover, while multiscale feature representation, context aggregation, and attention mechanisms have improved segmentation accuracy, they often struggle with efficiency when applied to ultra-high-resolution imagery. The complexity and data volume of such images require further progress in developing models that can balance accuracy, robustness, and computational efficiency for consistent performance across diverse datasets.

3. Method

3.1. Overview

Our ultra-high-resolution remote sensing image semantic segmentation system is called EFFNet (Figure 1). The network’s primary structures comprise a global and local branch. Therefore, we choose N original images and their corresponding semantic segmentation maps from the high-resolution image dataset D as input to the network:

D = {\{(I_{i}, S_{i})\}}_{i = 1}^{N},

(1)

where

I_{i}

and

S_{i}

represent the i-th original image and its semantic segmentation map, respectively.

I_{i}, S_{i} \in R^{H \times W}

,

R^{H \times W}

represents the dimensions of the image, which is

H \times W

. In addition, the global branch takes the downsampled low-resolution image dataset

D^{G}

as input:

D^{G} = {\{(I_{i}^{G}, S_{i}^{G})\}}_{i = 1}^{N},

(2)

where

I_{i}^{G}

and

S_{i}^{G}

represent the low-resolution original image and semantic segmentation map of the

n_{i}

patches in D, respectively. The local branch performs cropping on D to obtain a set of cropped images

D^{L}

at full resolution:

D^{L} = {\{{\{I_{i j}^{L}, S_{i j}^{L}\}}_{j = 1}^{n_{j}}\}}_{i = 1}^{N},

(3)

where D is cropped into

n_{i}

patches, and

I_{i} j^{L}

and

I_{i} j^{L}

represent the j-th cropped subimage and its corresponding semantic segmentation map from the i-th image, respectively. The cropping of

I_{i}

and

S_{i}

is not random but systematically performed to enhance consistency for training and testing purposes.

I_{i}^{G}, S_{i}^{G} \in R^{h_{1} \times w_{1}}

,

R^{h_{1} \times w_{1}}

represents that the size of the input image for the global branch is

h_{1} \times w_{1}, a n d I_{i j}^{L}, S_{i j}^{L} \in R^{h_{2} \times w_{2}}, R^{h_{2} \times w_{2}}

indicates that the input image size for the local branch is

h_{2} \times w_{2}

,

h_{1}, h_{2} ≪ H, w_{1}, w_{2} ≪ W

. The EFFNet design allows it to effectively combine features from both branches, capturing more details from high-resolution local features and richer contextual information from downsampled global features. These characteristics make EFFNet capable of achieving higher accuracy without compromising inference speed.

One key innovation of EFFNet is its fast fusion mechanism in the local branch. This mechanism assesses the importance of numerous feature maps in local patches, selecting local features with the highest fusion value (Section 3.2). These high-value local features, containing fine-grained local details, are then fused with the global features. This unique approach to feature fusion is a significant factor in EFFNet’s ability to achieve higher accuracy without compromising inference speed (Section 3.3).

3.2. Score Map Module

In previous network architectures, local patches were typically encoded to form features. However, this approach must model the relationships between many local features explicitly. Consequently, some features contributed less to specific tasks during the feature fusion stage, while other channels were more critical, decreasing semantic segmentation accuracy and inference speed.

The score map module is a convolution attention mechanism based on dimensional reduction, which aims to enhance the efficiency of information propagation between channels within the local branch (Figure 2). This mechanism utilizes global average pooling to compute the global feature vector, thereby representing the contextual information of the entire image by averaging feature maps across each channel.

The score map module operates as an innovative convolutional attention mechanism designed to efficiently evaluate and enhance the most informative local features for subsequent fusion with global features. It begins by processing the input feature map, generated from the local branch, through two 3 × 3 convolutional layers in ResNet. This feature map, with dimensions H × W × C, undergoes global average pooling across each channel to produce a global feature vector, representing the contextual information of the entire image. A one-dimensional convolution is then applied to this global vector to learn the inter-channel relationships, identifying which channels are most critical for the task at hand. The module generates a score map by applying a Sigmoid activation function, which assigns importance scores to different spatial locations within the feature map. These scores are then used to weight the local features, producing a refined feature map that emphasizes key details while suppressing irrelevant information. The output of the score map module is a weighted local feature map that carries the most valuable information, ready for integration in the fast fusion module. This approach ensures that the network focuses on the most relevant features, improving segmentation accuracy, particularly in complex, high-resolution imagery.

3.3. Fast Fusion Mechanism

The fast fusion mechanism (Figure 3) mitigates the model’s expressive capability and excessive focus on current position information and enhanced the reasonable distribution of attention weights. The objective is to leverage its expressive capabilitywhich is to enhance the model’s ability to selectively emphasize and integrate relevant features from both global and local contexts by appropriately allocating attention weights. The aim is to leverage the multihead attention (MHA) mechanism to enhance the efficiency and effectiveness of multiscale feature fusion. The MHA mechanism learns the different subspaces of input global and local features in parallel. This mechanism significantly reduces the fusion steps required for integrating features across three channels and four layers from 12 instances of fusion to just one. This MHA enhances the model’s capability and captures relationships between diverse features. It also streamlines the fusion process by facilitating more efficient information integration. This reduction in fusion steps facilitates the future fusion process more accurately. Its expression is as follows:

MultiHead (Q, K, V) = Concat ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O}

(4)

where Q, K, and V are the query, key, and value vectors, respectively, while h is the number of heads. The output of each head, denoted as

h e a d_{i}

, is expressed as follows:

{h e a d}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(5)

where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are the transformation matrices for the query, key, and value of the i-th head, respectively. The attention function is the mechanism for calculating attention. In the MHA, the self-attention mechanisms are typically used to compute attention. The formula is as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(6)

where

d_{k}

is the dimension of the key vectors, and softmax similarity is used for normalization to calculate the weight for each key vector. These weights are then multiplied by the value vectors, and the final attention output is obtained through weighted summation.

The module receives two inputs: the global features, which act as the query (Q), and the weighted local features from the score map module, which serve as both the key (K) and value (V). Within the module, the multihead attention mechanism operates by calculating the attention weights that capture the relationships between the global context and the local details across different subspaces. This mechanism allows the model to aggregate information from multiple scales and locations within the image, effectively merging fine-grained local features with the broader global context. The attention process involves computing the similarity between Q and K, which is then used to weight the V features, resulting in a fused feature map. This fused map integrates the detailed, high-resolution information from the local features with the overall scene understanding provided by the global features.

The output is a comprehensive, high-resolution feature map that retains critical spatial details and contextual information, enabling more precise and accurate segmentation. This fusion process is performed layer-by-layer, ensuring that the model maintains a balanced focus on both local and global aspects throughout the network, ultimately enhancing the segmentation performance, especially in ultra-high-resolution remote sensing images. This ultimately enhances the model’s ability to allocate attention weights appropriately and improves the outcomes of feature fusion.

3.4. Training Process and Loss Function

The training process of the model unfolds in three sequential steps. It consists of a global branch and a local branch, where the feature maps from both branches are combined to create a comprehensive feature map. Subsequently, the feature maps within the global branch are cropped based on their spatial locations within the local patch and downsampled to match corresponding sizes. These downsized feature maps are then fused as additional channels with the local branch’s feature maps at a similar layer. Simultaneously, we obtain the feature maps of the local branch. Following this procedure, we apply the score map module to these downsampled local feature maps and select those with high fusion suitability for combination into a complete feature map. Finally, by referencing, we fuse global context features and local fine-structure features together throughout all layers except for the final layer in both international and local branches using a fast fusion module applied layer-wise.

The training process utilized the AdamW optimizer for 60 epochs, with a batch size of 64. To adjust the learning rate, a cosine annealing mechanism was employed. Initially, the learning rate was set to 5 × 10⁻⁴, accompanied by a momentum of 0.9 and no weight decay regularization. Typically, a cross-entropy loss is used as the main loss function. The cross-entropy loss measures the dissimilarity between two probability distributions for the same random variable, effectively quantifying the disparity between true and predicted probability distributions. In multiclass tasks, this loss function treats the probability distributions for each class as independent, making it effective in handling multiclass classification problems. This cross-entropy loss is expressed as follows:

L_{main (p, q)} = - \sum_{i = 1}^{C} p_{i} log (q_{i})

(7)

where C is the number of categories,

p_{i}

is the ground truth, and

q_{i}

is the prediction.

In addition to the main loss function, two auxiliary loss functions are applied—one on the global branch and one on the local branch. These auxiliary losses are designed to ensure that the network effectively learns both global context and local details when processing images at different resolutions. An auxiliary loss of global branch helps produce a coarse but contextually accurate semantic segmentation map. It also facilitates gradient flow, helping the network converge faster during training.

L_{global (p, q)} = - \sum_{i = 1}^{C} p_{i}^{global} log (q_{i}^{global})

(8)

where

p_{i}^{global}

and

q_{i}^{global}

are the true and predicted probabilities for the global branch output.

An auxiliary loss of local branch encourages the local branch to refine the semantic segmentation, particularly in areas with small objects or complex textures. It ensures that the local details complement the broader context provided by the global branch.

L_{local (p, q)} = - \sum_{i = 1}^{C} p_{i}^{local} log (q_{i}^{local})

(9)

where

p_{i}^{local}

and

q_{i}^{local}

are the true and predicted probabilities for the local branch output.

The total loss used to train the network is a weighted sum of the main loss and the two auxiliary losses:

L_{total} = L_{main} + α L_{global} + β L_{local}

(10)

where

α

and

β

are weights that balance the contribution of the auxiliary losses relative to the main loss. During the optimization process, both the global branch auxiliary loss and the local branch auxiliary loss are assigned equal weights, each set to 1.0. This approach helps the network simultaneously learn global context and local details without prioritizing one over the other. By assigning equal weights, the network maintains a balanced focus on both the broader, downsampled view and the detailed, high-resolution patches during training.

The main loss ensures that the combined output is accurate, while the auxiliary losses help in refining the features learned by the individual branches. This multiloss strategy leads to better semantic segmentation performance, particularly in complex and high-resolution scenarios.

4. Experiments

4.1. Dataset

Our framework was evaluated using the International Society for Photogrammetry and Remote Sensing (ISPRS) dataset, specifically the Vaihingen and Potsdam datasets. The Vaihingen dataset comprises 33 tiles of true orthophoto (TOP) images derived from three-band color-infrared aerial imagery with a spatial resolution of 9 cm. This dataset provides manually annotated ground truth for six categories: impervious surfaces, buildings, low vegetation, trees, cars, and clutter/background. Notably, it encompasses a diverse range of urban structures including residential areas, commercial zones, and vegetated parks, making it an excellent resource for evaluating the performance of semantic segmentation models in varied urban environments. The Potsdam dataset complements the Vaihingen dataset by offering a higher spatial resolution of 5 cm across 38 TOP image tiles. This dataset features four-band imagery (red, green, blue, and near-infrared) and provides ground truth annotations for the same six categories as the Vaihingen dataset. The Potsdam dataset is characterized by its detailed representation of urban landscapes, including intricate building geometries, diverse vegetation types, and various vehicles, thus presenting a challenging scenario for semantic segmentation tasks.

The original image tiles were cropped into patches of 512 × 512 pixels with a 50-pixel overlap in a sequential manner. This dataset configuration accurately captures the intricacies involved in semantic segmentation tasks on ultra-high-resolution remote sensing imagery.

4.2. Implementation Details

Both the global and local branches employ ResNet50 and FPN as their backbone networks. A noteworthy innovation lies within our local branch, where we compute entropy maps for patches and utilize them as tensors to generate scores, thereby effectively identifying uncertain patches within the top 10% of fusion values. The pixel size of the downsampled images fed into the global branch is determined, while the cropped images supplied to the local branch are consistently maintained at a resolution of 512 × 512 pixels. All experiments were conducted on a workstation equipped with 8 NVIDIA 3090 GPUs, each possessing a memory storage capacity of 24 GB, ensuring fair comparisons and reliable results.

The frames per second (FPS) monitoring method was utilized to quantify the inference speed of our network. An increase in FPS signifies an enhanced inference speed, denoting a reduced duration for network processing per image.

4.3. Evaluation Metric

The assessment process consists of two essential elements: the evaluation of semantic segmentation accuracy and the assessment of network model efficiency.

The three standardized evaluation criteria were utilized to ensure an impartial and equitable cross-sectional comparison with other suggested network models. These evaluation criteria include Intersection over Union (IoU), overall accuracy (OA), and F1-score. The definition of each category slightly varies due to the presence of multiple categories.

The IoU metric measures the accuracy of a machine learning model’s predictions. It calculates the intersection ratio to the union between the predicted output for a specific category and the actual model value. The mIoU, or Mean Intersection over Union, is the average of these ratios across all categories that the model predicts. This gives a better overall measure of how well the model is performing. The model’s true value is as follows:

I o U_{j} = \frac{\sum_{i = 1}^{n} T P_{i j}}{\sum_{i = 1}^{n} T P_{i j} + \sum_{i = 1}^{n} F P_{i j} + \sum_{i = 1}^{n} F N_{i j}}

(11)

The OA assesses the global accuracy of the model’s predictive outcomes:

O A_{j} = \frac{\sum_{i = 1}^{n} T P_{i j} + \sum_{i = 1}^{n} T N_{i j}}{\sum_{i = 1}^{n} T P_{i j} + \sum_{i = 1}^{n} T N_{i j} + \sum_{i = 1}^{n} F P_{i j} + \sum_{i = 1}^{n} F N_{i j}}

(12)

The F1-score is a composite measure that considers both precision and recall. The mean F1-score (mean F1) represents the average accuracy across all categories. Precision quantifies the ratio of correctly identified positive samples of all positive samples detected by the model. However, the recall quantifies the ratio of correct predictions made by the model concerning all positive samples:

P r e c i s i o n_{j} = \frac{\sum_{i = 1}^{n} T P_{i j}}{\sum_{i = 1}^{n} T P_{i j} + \sum_{i = 1}^{n} F P_{i j}}

(13)

R e c a l l_{j} = \frac{\sum_{i = 1}^{n} T P_{i j}}{\sum_{i = 1}^{n} T P_{i j} + \sum_{i = 1}^{n} F N_{i j}}

(14)

F 1_{j} = 2 \times \frac{P r e c i s i o n_{j} \times R e c a l l_{j}}{P r e c i s i o n_{j} + R e c a l l_{j}} = \frac{2 \times \sum_{i = 1}^{n} T P_{i j}}{2 \times \sum_{i = 1}^{n} T P_{i j} + \sum_{i = 1}^{n} F N_{i j} + \sum_{i = 1}^{n} F P_{i j}}

(15)

where

T P_{i j}

is the count of pixels in image i correctly predicted as class j.

F P_{i j}

is the count of pixels in image i incorrectly predicted as class j.

F N_{i j}

is the count of pixels in image i incorrectly predicted as any class other than j, and

T N_{i j}

is the count of pixels in image i correctly predicted as any class other than j. Therefore, there is an inactive unknown class in our evaluation, and predictions for this class pixels are not factored into the calculation; hence, they do not affect the final score.

4.4. Quantitative Results: Accuracy and Inference Speed Comparison

We conducted a comparative analysis of EFFNet using various other state-of-the-art encoders–decoder semantic segmentation models on two representative remote sensing image datasets: Vaihingen and Potsdam. The models used for comparison include FCN-8s [24], UNet [52], SegNet [53], EncNet [54], RefineNet [55], CCEM [56], DeepLabv3+ [57], and S-RA-FCN [58]. In addition, six semantic segmentation models were selected, namely GLNet [18], MBNet [59], UHRSNet [60], FCtL [61], MagNet [62], EHSNet [63], Mask2Former [64], and SegGPT [65], designed to address the characteristics of ultra-high-resolution images. Table 1 and Table 2 compare the experimental results of the Vaihingen and Potsdam datasets.

The semantic segmentation performance of EFFNet surpasses that of other models in terms of various metrics, such as overall accuracy (OA), mean F1 score, and Mean Intersection over Union (mIoU). Our incorporation of global and local branches enables us to effectively capture boundary-relevant information, encompassing both intricate features and contextual details. This approach significantly enhances the accuracy in segmenting buildings, cars, and impervious surfaces. Moreover, our model generates high-quality feature maps through the score map module, enabling precise and detailed feature extraction. The fast fusion module further improves information integration capabilities, resulting in superior identification scores for low vegetation and trees.

To validate the stability of the model’s results, Monte Carlo experiments were conducted to assess the statistical significance of the experimental outcomes. Specifically, we evaluated the variability in the results by adjusting the model’s training initialization settings 10 times, while keeping all other hyperparameters constant. Table 3 presents the average results and their standard deviations obtained from experiments conducted on the Vaihingen and Potsdam datasets. The results indicate that the mean F1 and mIoU metrics across different classes exhibit some fluctuations on both datasets, suggesting that the model’s initialization does indeed have an impact on the outcomes. However, overall, these fluctuations are relatively minor, with standard deviations mostly ranging between 0.01 and 0.05, indicating that the model maintains a certain level of stability across multiple training runs.

The results of Table 4 and Figure 4 provide a comparative analysis of the network’s inference speed with other networks. Furthermore, the proposed EFFNet model achieved an inference frame rate of 0.22, indicating a significant improvement of 0.05 compared with GLNet, which is specifically designed for ultra-high-resolution images using the global–local architecture. This enhancement represents an approximate increase of 30%. When evaluated on the Vaihingen dataset, known for its high accuracy, EFFNet outperforms FCtL by 1.1% in terms of mIoU and achieves more than a twenty-fold improvement in performance. Additionally, EFFNet surpasses DeepLab V3+, which boasts superior accuracy, by over ten times in terms of inference speed on the Potsdam dataset while also demonstrating a noteworthy improvement of 1.2% in mIoU. By comparing the FLOPs results, it is evident that the computational complexity varies significantly across different models. EFFNet, with a FLOPs value of 4.75, exhibits the highest computational cost among all models, indicating that EFFNet achieves superior segmentation accuracy at a higher computational expense. Overall, EFFNet’s performance across the three key metrics—mIoU, FPS, and FLOPs—demonstrates that it not only delivers superior accuracy but also excels in inference speed and computational efficiency, achieving a well-balanced trade-off among these aspects.

4.5. Visualization Results

Figure 5 illustrates some semantic segmentation findings based on the Vaihingen and Potsdam datasets. It showcases the superiority of our approach in accurate border detection, precisely in identifying boundaries, particularly in recognizing building boundaries, which appear more accurate and smoother. Moreover, EFFNet delivers more precise recognition results for classes such as trees and clutter, where boundary information is relatively less distinct.

4.6. Ablation Study

In the ablation study, we primarily verified the impact of the transformer’s position, the number of transformer blocks, the score map module, and the number of patches on the semantic segmentation outcomes. We present the main results utilizing the Vaihingen dataset.

4.6.1. Location of Transformer

Table 5 presents an ablation study on the impact of the transformer’s location. We investigated the effects of placing the transformer in both the backbone and decoder of the network. Moreover, we assess the mIoU metric in measuring semantic segmentation accuracy. This table also reveals the relocation of the transformer to the decoder, as opposed to the backbone, leading to a notable increase in the model’s mIoU, reaching 81.1% and yielding an improvement of roughly 2.4%. These findings infer that placing the transformer in the decoder can better handle target features, capture target boundaries, and manage contextual information than when placed in the backbone.

4.6.2. Number of Transformer Blocks

Figure 6 presents an ablation study on the impact of the number on transformer blocks. Building upon the default model with eight transformer blocks, an experiment was performed using various transformer blocks to explore their impact on the model’s performance. The findings show that the model performs best in training when using four transformer blocks with downsample ratios of 4×, achieving a mIoU of 81.1%. The experimental results show that four transformer blocks are effective in extracting multilevel features, such as image edges and textures. The model is also effective in capturing extensive contextual background information. However, when increasing the number of transformer blocks, overfitting issues can occur when using the training data, leading to a decrease in model performance on the test data, as reflected in the reduction in mIoU.

4.6.3. Without Score Map

In this experiment, we investigated the influence of score maps on our model. After removing the score map module, the mIoU of the model decreased to 77.4%, leading to a reduction of approximately 4.6% compared with the baseline model. These findings demonstrate the significant contribution of the score map module in improving semantic segmentation accuracy, offering valuable insights into its influence on model performance. It also highlights the significance of the newly released score map module, demonstrating it as a practical application of attention mechanisms in upsampling techniques.

Table 6 shows that by weighting features from different channels, this module assists the network in better object semantic segmentation. It also simultaneously allows the model to capture a broader context, enhances resolution to preserve finer details, and suppresses background interference through weighted features, reducing conflicts between background and foreground objects. Figure 7 demonstrates that the inclusion of the Score Map module results in more accurate boundary segmentation for the Building and Tree categories, while effectively reducing misclassification between Low Vegetation and the background.

4.6.4. Number of Patch

We used the top 5% to 40% weighted patches in the score map to extract local features separately, which were then fused with global features. Figure 8 shows that when increasing from the top 5% weighted patches to the top 20% weighted patches, the mIoU gradually rises from 78.3 to 81.4, reaching its maximum value of 81.4 at the top 20% weighted patches. As the top percentage continues to increase to 40%, the mIoU gradually decreases to 80.9.

This result can be explained by several factors. As the proportion of selected patches increases, additional redundant or less relevant features might be introduced, which could introduce noise and reduce the overall effectiveness of feature fusion, leading to a decrease in mIoU. The attention mechanism, designed to highlight critical features, may lose its selectivity as the proportion of top-weighted patches increases, resulting in a less focused integration of features. Additionally, optimal performance is likely achieved when there is a balance between local and global features; increasing the proportion of top-weighted patches beyond 20% might disrupt this balance, causing the model to overemphasize local features at the expense of global semantic understanding. This disruption likely contributes to the observed decrease in mIoU.

5. Discussion and Conclusions

This work introduces a novel approach for ultra-high-resolution remote sensing imagery called EFFNet. It features a unique design that includes both a global and local branch, effectively preserving the global context and intricate details of images. On the one hand, the score map module utilizes a dimension reduction-based convolutional attention mechanism to precisely evaluate the importance of local features within the overall semantic segmentation task. This targeted selection not only reduces computational overhead but also prevents irrelevant features from diluting the model’s predictions, thereby improving both accuracy and efficiency. On the other hand, the fast fusion mechanism leverages a multihead attention strategy to efficiently integrate global and local features. In contrast to conventional multistep fusion processes, which require repeated information exchange across different layers and scales—often leading to increased computational complexity and potential information loss—EFFNet performs a single, efficient fusion that preserves critical details. This approach captures the intricate relationships between global context and fine-grained local features, maintaining high sensitivity to details while ensuring consistent global semantics.

Experimental results demonstrate the effectiveness of EFFNet in accurately delineating object boundaries, facilitating precise semantic segmentation. Notably, when evaluated on challenging datasets such as Vaihingen and Potsdam, EFFNet outperforms several previous network architectures. In future research, we plan to enhance EFFNet by integrating the probabilistic hierarchical clustering approach to improve feature extraction and semantic segmentation processes [66]. Specifically, we will introduce an endmember estimation module that applies PCA and K-means clustering to the feature maps extracted by EFFNet, followed by agglomerative clustering using Independent Component Analysis (ICA) to identify key spectral signatures. These endmember features will then guide the attention mechanisms within EFFNet, refining the selection and fusion of global and local features. Additionally, we will explore multiscale feature fusion strategies that leverage clustering results to reduce redundancy and enhance the network’s responsiveness to various scales. Through these steps, we aim to further improve accuracy and efficiency, while expanding EFFNet’s applicability to complex remote sensing scenarios.

Author Contributions

Conceptualization, Y.S. (Yihao Sun) and Y.S. (Yinan Sun); methodology, Y.S. (Yihao Sun); software, Y.S. (Yihao Sun); validation, M.W. and X.H.; formal analysis, Y.S. (Yihao Sun); investigation, Y.S. (Yihao Sun); resources, Y.S. (Yinan Sun); data curation, X.H. and C.X.; writing—original draft preparation, Y.S. (Yihao Sun); writing—review and editing, M.W.; visualization, Y.S. (Yihao Sun); supervision, Y.S. (Yinan Sun); project administration, Y.S. (Yinan Sun); funding acquisition, Y.S. (Yinan Sun). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Beijing Forestry University Science and Technology Innovation Program Project (No. BLX201732).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank Beijing Forestry University Science and Technology Innovation Program for helpful support related to this work. The authors would like to thank the anonymous editors and reviewers for their helpful remarks. We thank LetPub (www.letpub.com, accessed on 31 August 2024) for its linguistic assistance during the preparation of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pan, X.; Gao, L.; Marinoni, A.; Zhang, B.; Yang, F.; Gamba, P. Semantic labeling of high resolution aerial imagery and LiDAR data with fine segmentation network. Remote Sens. 2018, 10, 743. [Google Scholar] [CrossRef]
Kamilaris, A.; Prenafeta-Boldu, F. Deep learning in agriculture: A survey, computers and electronics in agriculture. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Li, W.; Guo, Q.; Elkan, C. A positive and unlabeled learning algorithm for one-class classification of remote-sensing data. IEEE Trans. Geosci. Remote Sens. 2010, 49, 717–725. [Google Scholar] [CrossRef]
Zhang, C.; Xie, Z. Object-based vegetation mapping in the Kissimmee River watershed using HyMap data and machine learning techniques. Wetlands 2013, 33, 233–244. [Google Scholar] [CrossRef]
Liu, C.; Frazier, P.; Kumar, L. Comparative assessment of the measures of thematic classification accuracy. Remote Sens. Environ. 2007, 107, 606–616. [Google Scholar] [CrossRef]
Fassnacht, F.E.; Latifi, H.; Stereńczak, K.; Modzelewska, A.; Lefsky, M.; Waser, L.T.; Straub, C.; Ghosh, A. Review of studies on tree species classification from remotely sensed data. Remote Sens. Environ. 2016, 186, 64–87. [Google Scholar] [CrossRef]
Stow, D.A.; Hope, A.; McGuire, D.; Verbyla, D.; Gamon, J.; Huemmrich, F.; Houston, S.; Racine, C.; Sturm, M.; Tape, K.; et al. Remote sensing of vegetation and land-cover change in Arctic Tundra Ecosystems. Remote Sens. Environ. 2004, 89, 281–308. [Google Scholar] [CrossRef]
Ascher, S.; Pincus, E. The Filmmaker’s Handbook: A Comprehensive Guide for the Digital Age; Penguin: New York, NY, USA, 1999. [Google Scholar]
Lilly, P. Samsung Launches Insanely Wide 32: 9 Aspect Ratio Monitor with HDR and Freesync 2. 2017. Available online: https://www.pcgamer.com/samsung-launches-a-massive-49-inch-ultrawide-hdr-monitor-with-freesync-2/ (accessed on 31 August 2024).
Akundy, V.A.; Wang, Z. 4K or not?—Automatic image resolution assessment. In Proceedings of the Image Analysis and Recognition: 17th International Conference, ICIAR 2020, Póvoa de Varzim, Portugal, 24–26 June 2020; Proceedings, Part I 17. Springer: Berlin/Heidelberg, Germany, 2020; pp. 61–65. [Google Scholar]
Dong, S.; Li, Y.; Zhang, Z.; Gou, T.; Xie, M. A transfer-learning-based windspeed estimation on the ocean surface: Implication for the requirements on the spatial-spectral resolution of remote sensors. Appl. Intell. 2024, 54, 7603–7620. [Google Scholar] [CrossRef]
Du, X.; He, S.; Yang, H.; Wang, C. Multi-Field Context Fusion Network for Semantic Segmentation of High-Spatial-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 5830. [Google Scholar] [CrossRef]
Su, Y.; Cheng, J.; Bai, H.; Liu, H.; He, C. Semantic segmentation of very-high-resolution remote sensing images via deep multi-feature learning. Remote Sens. 2022, 14, 533. [Google Scholar] [CrossRef]
Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Baltimore, MD, USA, 15–17 April 2019; SPIE: Bellingham, WA, USA, 2019; Volume 11006, pp. 369–386. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Chen, W.; Jiang, Z.; Wang, Z.; Cui, K.; Qian, X. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8924–8933. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. RS 3 Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Wang, Z.; Pun, M.O. Unsupervised domain adaptation augmented by mutually boosted attention for semantic segmentation of VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400515. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Kampffmeyer, M.; Salberg, A.B.; Jenssen, R. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1–9. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Chantharaj, S.; Pornratthanapong, K.; Chitsinpchayakun, P.; Panboonyuen, T.; Vateekul, P.; Lawavirojwong, S.; Srestasathiern, P.; Jitkajornwanich, K. Semantic segmentation on medium-resolution satellite images using deep convolutional networks with remote sensing derived indices. In Proceedings of the 2018 IEEE 15th International Joint Conference on Computer Science and Software Engineering (JCSSE), Nakhonpathom, Thailand, 11–13 July 2018; pp. 1–6. [Google Scholar]
Bao, Y.; Liu, W.; Gao, O.; Lin, Z.; Hu, Q. E-Unet++: A Semantic Segmentation Method for Remote Sensing Images. In Proceedings of the 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 18–20 June 2021; Volume 4, pp. 1858–1862. [Google Scholar] [CrossRef]
Chen, S.; Zhang, B. RSUnet: A New Full-scale Unet for Semantic Segmentation of Remote Sensing Images. 2022. Available online: https://www.researchsquare.com/article/rs-1211375/v1 (accessed on 31 August 2024).
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Chen, L.C.; Yang, Y.; Wang, J.; Xu, W.; Yuille, A.L. Attention to scale: Scale-aware semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3640–3649. [Google Scholar]
Xia, F.; Wang, P.; Chen, L.C.; Yuille, A.L. Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 648–663. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Zhang, J.; Lin, S.; Ding, L.; Bruzzone, L. Multi-scale context aggregation for semantic segmentation of remote sensing images. Remote Sens. 2020, 12, 701. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, Y.; Wang, F.; Wang, S.; Qin, G.; Zou, W.; Zhu, J. A Multi-Scale Edge Constraint Network for the Fine Extraction of Buildings from Remote Sensing Images. Remote Sens. 2023, 15, 927. [Google Scholar] [CrossRef]
Liu, Y.; Gross, L.; Li, Z.; Li, X.; Fan, X.; Qi, W. Automatic building extraction on high-resolution remote sensing imagery using deep convolutional encoder-decoder with spatial pyramid pooling. IEEE Access 2019, 7, 128774–128786. [Google Scholar] [CrossRef]
Nong, Z.; Su, X.; Liu, Y.; Zhan, Z.; Yuan, Q. Boundary-Aware Dual-Stream Network for VHR Remote Sensing Images Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5260–5268. [Google Scholar] [CrossRef]
He, G.; Dong, Z.; Feng, P.; Muhtar, D.; Zhang, X. Dual-Range Context Aggregation for Efficient Semantic Segmentation in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 2500605. [Google Scholar] [CrossRef]
Ma, H.; Yang, H.; Huang, D. Boundary guided context aggregation for semantic segmentation. arXiv 2021, arXiv:2110.14587. [Google Scholar]
Bai, H.; Cheng, J.; Huang, X.; Liu, S.; Deng, C. HCANet: A hierarchical context aggregation network for semantic segmentation of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 6002105. [Google Scholar] [CrossRef]
Liu, Z.; Li, J.; Song, R.; Wu, C.; Liu, W.; Li, Z.; Li, Y. Edge Guided Context Aggregation Network for Semantic Segmentation of Remote Sensing Imagery. Remote Sens. 2022, 14, 1353. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, J.; Deng, H. Global Multi-Attention UResNeXt for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 1836. [Google Scholar] [CrossRef]
Liu, K.H.; Lin, B.Y. MSCSA-Net: Multi-scale channel spatial attention network for semantic segmentation of remote sensing images. Appl. Sci. 2023, 13, 9491. [Google Scholar] [CrossRef]
Guo, R.; Liu, J.; Li, N.; Liu, S.; Chen, F.; Cheng, B.; Duan, J.; Li, X.; Ma, C. Pixel-wise classification method for high resolution remote sensing imagery using deep neural networks. Isprs Int. J. -Geo-Inf. 2018, 7, 110. [Google Scholar] [CrossRef]
Liu, P.; Liu, X.; Liu, M.; Shi, Q.; Yang, J.; Xu, X.; Zhang, Y. Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network. Remote Sens. 2019, 11, 830. [Google Scholar] [CrossRef]
Alam, M.; Wang, J.F.; Guangpei, C.; Yunrong, L.; Chen, Y. Convolutional neural network for the semantic segmentation of remote sensing images. Mob. Netw. Appl. 2021, 26, 200–215. [Google Scholar] [CrossRef]
Qiao, W.; Shen, L.; Wang, J.; Yang, X.; Li, Z. A weakly supervised semantic segmentation approach for damaged building extraction from postearthquake high-resolution remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6002705. [Google Scholar] [CrossRef]
Wang, Y.; Li, Y.; Chen, W.; Li, Y.; Dang, B. DNAS: Decoupling Neural Architecture Search for High-Resolution Remote Sensing Image Semantic Segmentation. Remote Sens. 2022, 14, 3864. [Google Scholar] [CrossRef]
Li, X.; Lei, L.; Kuang, G. Multilevel adaptive-scale context aggregating network for semantic segmentation in high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 6003805. [Google Scholar] [CrossRef]
Chong, Y.; Chen, X.; Pan, S. Context union edge network for semantic segmentation of small-scale objects in very high resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 19, 6003805. [Google Scholar] [CrossRef]
Zhou, L.; Zhao, H.; Liu, Z.; Cai, K.; Liu, Y.; Zuo, X. MHLDet: A Multi-Scale and High-Precision Lightweight Object Detector Based on Large Receptive Field and Attention Mechanism for Remote Sensing Images. Remote Sens. 2023, 15, 4625. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 801–818, ISBN 978-3-030-01233-5. [Google Scholar]
Mou, L.; Hua, Y.; Zhu, X.X. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12416–12425. [Google Scholar]
Shan, L.; Wang, W. MBNet: A Multi-Resolution Branch Network for Semantic Segmentation Of Ultra-High Resolution Images. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 2589–2593. [Google Scholar]
Shan, L.; Li, M.; Li, X.; Bai, Y.; Lv, K.; Luo, B.; Chen, S.B.; Wang, W. Uhrsnet: A semantic segmentation network specifically for ultra-high-resolution images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 1460–1466. [Google Scholar]
Li, Q.; Yang, W.; Liu, W.; Yu, Y.; He, S. From contexts to locality: Ultra-high resolution image segmentation via locality-aware contextual correlation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7252–7261. [Google Scholar]
Huynh, C.; Tran, A.T.; Luu, K.; Hoai, M. Progressive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 16755–16764. [Google Scholar]
Chen, W.; Li, Y.; Dang, B.; Zhang, Y. EHSNet: End-to-End Holistic Learning Network for Large-Size Remote Sensing Image Semantic Segmentation. arXiv 2022, arXiv:2211.11316. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. Seggpt: Segmenting everything in context. arXiv 2023, arXiv:2304.03284. [Google Scholar]
Prades, J.; Safont, G.; Salazar, A.; Vergara, L. Estimation of the number of endmembers in hyperspectral images using agglomerative clustering. Remote Sens. 2020, 12, 3585. [Google Scholar] [CrossRef]

Figure 1. Overview of the Efficient Future Fusion Network (EFFNet): The network utilizes cropped and downsampled full-resolution patches and consists of both local and global branches. After passing the local feature maps through ResNet and applying one-dimensional convolution, the score map module employs a Sigmoid activation function to extract significant local features. These local features are then efficiently fused with global features using a fast fusion mechanism, resulting in a high-resolution and information-rich feature map that is utilized for final semantic segmentation. Consequently, our objective is to enhance the accuracy and efficiency of the network by introducing two attention-based mechanism modules designed to reduce the processing load of local features while improving feature matching across samples.

Figure 2. Score map module. The input image is subjected to two successive layers of 3 × 3 convolutions using ResNet, resulting in a feature map of dimensions H × W × 1. Following the Sigmoid function activation, the feature map is indexed, and high-value features are selectively retained.

Figure 3. The fast fusion mechanism facilitates seamless integration between the global and local branches, enabling extensive collaboration through the fusion of feature maps at each layer using multiple attention weights. The model’s depth determines the number of layers, while the merging process occurs N times based on the number of cropped global patches. Subsequently, these attention weights are computed by leveraging local and global features such as Q, K, and V. The optimization objective in this study encompasses a primary loss derived from the merged results along with two additional losses.

Figure 4. The GPU inference frames per second (FPS) and Mean Intersection over Union (mIoU) accuracy are evaluated on the (a) Vaihingen and (b) Potsdam datasets. EFFNet (represented by red dots) outperforms existing networks, including GLNet, in terms of both inference speed and accuracy for segmenting ultra-high-resolution images.

Figure 5. Semantic segmentation results when adopting different modules on (a) the Vaihingen and (b) the Potsdam datasets.

Figure 6. Ablation study of different transformer locations.

Figure 7. Comparison of semantic segmentation results with and without score map on (a) Vaihingen and (b) Potsdam datasets.

Figure 8. Ablation study of different numbers of patches.

Table 1. Experimental results of the Vaihingen dataset.

Model Name	Impervious Surface	Building	Low Vegetation	Tree	Car	OA	Mean F1	mIoU
FCN-8s [24]	90.0	93.0	77.7	86.5	80.4	88.3	85.5	75.5
UNet [52]	90.5	93.3	79.6	87.5	76.4	89.2	85.5	75.5
SegNet [53]	90.2	93.7	78.5	85.8	83.9	88.5	86.4	76.8
EncNet [54]	91.2	94.1	79.2	86.9	83.7	89.4	87.0	77.8
RefineNet [55]	91.1	94.1	79.8	87.2	82.3	88.9	86.9	77.1
CCEM [56]	91.5	93.8	79.4	87.3	83.5	88.6	87.1	78.0
DeepLabv3+ [57]	91.4	94.7	79.6	87.6	85.8	89.9	87.8	79.0
S-RA-FCN [58]	90.5	93.8	79.6	87.5	82.6	89.2	86.8	77.3
GLNet [18]	89.3	92.4	79.0	85.7	79.6	88.5	85.2	78.4
MBNet [59]	90.4	93.3	79.2	86.0	80.1	85.0	85.8	77.9
UHRSNet [60]	90.1	92.9	79.4	86.2	79.7	85.7	85.7	78.7
FCtL [61]	90.4	90.7	80.2	87.9	84.0	86.6	86.1	80.0
MagNet [62]	91.1	92.7	79.0	83.6	84.1	85.9	86.2	79.8
EHSNet [63]	87.9	89.7	78.2	81.9	80.7	83.7	87.4	78.3
Mask2Former [64]	89.9	91.8	80.4	82.1	81.4	84.9	88.9	81.9
SegGPT [65]	88.3	89.6	77.0	86.4	83.4	84.9	89.0	80.3
EFFNet	92.2	95.3	82.1	88.8	87.0	91.0	89.1	81.1

Table 2. Experimental results on the Potsdam dataset.

Model Name	Impervious Surface	Building	Low Vegetation	Tree	Car	OA	Mean F1	mIoU
FCN-8s [24]	89.9	93.7	83.0	85.2	93.5	87.8	89.1	71.7
UNet [52]	88.2	91.1	82.8	84.9	91.6	86.2	87.7	68.5
SegNet [53]	87.8	90.7	81.0	84.7	89.7	85.7	86.8	66.2
EncNet [54]	91.0	94.9	84.4	85.9	93.6	89.0	90.0	73.2
RefineNet [55]	88.1	93.1	85.6	86.3	90.3	88.3	88.7	72.6
CCEM [56]	88.3	93.2	84.7	86.0	92.8	89.1	89.0	72.8
DeepLabv3+ [57]	91.3	94.8	84.2	86.6	93.8	89.2	90.1	73.8
S-RA-FCN [58]	90.7	94.2	83.8	85.8	93.6	88.5	89.6	72.5
GLNet [18]	89.7	90.4	80.9	83.3	91.7	84.7	87.2	72.1
MBNet [59]	88.2	90.6	81.2	85.1	91.8	86.2	87.4	72.8
UHRSNet [60]	90.3	91.1	82.0	86.0	92.0	87.0	88.3	73.0
FCtL [61]	88.3	92.1	82.9	85.4	93.1	88.4	87.0	69.9
MagNet [62]	90.3	93.7	83.4	84.6	93.0	89.0	86.8	70.3
EHSNet [63]	88.7	92.2	82.6	83.6	92.7	88.0	85.1	67.3
Mask2Former [64]	89.0	91.7	83.9	86.4	92.9	88.8	88.1	73.2
SegGPT [65]	89.2	91.3	79.5	81.2	87.9	85.8	85.6	72.0
EFFNet	92.1	94.6	86.3	87.9	93.9	92.0	91.0	74.6

Table 3. Monte Carlo Analysis of the Stability of Experimental Results.

Dataset	Impervious Surface	Building	Low Vegetation	Tree	Car	OA	Mean F1	mIoU
Vaihingen	92.2 ± 0.01	95.3 ± 0.01	82.1 ± 0.01	88.8 ± 0.02	87.0 ± 0.01	91.0 ± 0.01	89.1 ± 0.05	81.1 ± 0.20
Potsdam	92.1 ± 0.02	94.6 ± 0.01	86.3 ± 0.01	87.9 ± 0.01	93.9 ± 0.01	92.0 ± 0.02	91.0 ± 0.04	74.6 ± 0.12

Table 4. Mean intersection over union (mIoU) accuracy, GPU inference FPS, and FlOPs on (a) Vaihingen and (b) Potsdam datasets.

(a)
	Accuracy (mIoU%)	FPS	FLOPs
UNet	75.5	0.09	1.36
FCN-8S	75.5	0.04	4.52
DeepLab V3+	79.0	0.02	4.44
GLNet	78.4	0.17	0.20
MBNet	77.9	0.10	0.32
UHRSNet	78.7	0.04	0.15
FCtL	80.0	0.01	0.13
MagNet	79.8	0.11	0.80
EFFNet	81.1	0.22	4.75
(b)
	Accuracy (mIoU%)	FPS	FLOPs
UNet	68.5	0.09	1.36
FCN-8S	71.7	0.04	4.52
DeepLab V3+	73.8	0.02	4.44
GLNet	72.1	0.17	0.20
MBNet	72.8	0.10	0.32
UHRSNet	73.0	0.04	0.15
FCtL	69.9	0.01	0.13
MagNet	70.3	0.11	0.80
EFFNet	74.6	0.22	4.75

Table 5. Ablation study of different transformer locations.

	Backbone	Decoder
mIoU	79.2	81.1

Table 6. Ablation study of whether to implement the score map module.

	With Score Map	Without Score Map
mIoU	81.1	77.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Wang, M.; Huang, X.; Xin, C.; Sun, Y. Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion. Remote Sens. 2024, 16, 3248. https://doi.org/10.3390/rs16173248

AMA Style

Sun Y, Wang M, Huang X, Xin C, Sun Y. Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion. Remote Sensing. 2024; 16(17):3248. https://doi.org/10.3390/rs16173248

Chicago/Turabian Style

Sun, Yihao, Mingrui Wang, Xiaoyi Huang, Chengshu Xin, and Yinan Sun. 2024. "Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion" Remote Sensing 16, no. 17: 3248. https://doi.org/10.3390/rs16173248

APA Style

Sun, Y., Wang, M., Huang, X., Xin, C., & Sun, Y. (2024). Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion. Remote Sensing, 16(17), 3248. https://doi.org/10.3390/rs16173248

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation of Remote Sensing Images

2.2. Multiscale, Context Aggregation, and Attention Mechanism

2.3. Network for High-Resolution Images

3. Method

3.1. Overview

3.2. Score Map Module

3.3. Fast Fusion Mechanism

3.4. Training Process and Loss Function

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metric

4.4. Quantitative Results: Accuracy and Inference Speed Comparison

4.5. Visualization Results

4.6. Ablation Study

4.6.1. Location of Transformer

4.6.2. Number of Transformer Blocks

4.6.3. Without Score Map

4.6.4. Number of Patch

5. Discussion and Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI