Next Article in Journal
Suitability Evaluation of Urban Underground Space Development: A Case Study of Qingdao City
Previous Article in Journal
BOPVis: Bridge Monitoring Data Visualization for Operational Performance Mining
Previous Article in Special Issue
Detection of Ocean Internal Waves Based on Modified Deep Convolutional Generative Adversarial Network and WaveNet in Moderate Resolution Imaging Spectroradiometer Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Global-Local Feature Aggregation Framework for Semantic Segmentation of Large-Format High-Resolution Remote Sensing Images

1
China Aero Geophysical Survey and Remote Sensing Center for Natural Resources, Beijing 100083, China
2
College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
3
Faculty of Artificial Intelligence in Education, Central China Normal University, 152 Luoyu Road, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(15), 6616; https://doi.org/10.3390/app14156616 (registering DOI)
Submission received: 28 May 2024 / Revised: 13 July 2024 / Accepted: 26 July 2024 / Published: 29 July 2024
(This article belongs to the Special Issue Remote Sensing Image Processing and Application)

Abstract

:
In high-resolution remote sensing images, there are areas with weak textures such as large building roofs, which occupy a large number of pixels in the image. These areas pose a challenge for traditional semantic segmentation networks to obtain ideal results. Common strategies like downsampling, patch cropping, and cascade models often sacrifice fine details or global context, resulting in limited accuracy. To address these issues, a novel semantic segmentation framework has been designed specifically for large-format high-resolution remote sensing images by aggregating global and local features in this paper. The framework consists of two branches: one branch deals with low-resolution downsampled images to capture global features, while the other branch focuses on cropped patches to extract high-resolution local details. Also, this paper introduces a feature aggregation module based on the Transformer structure, which effectively aggregates global and local information. Additionally, to save GPU memory usage, a novel three-step training method has been developed. Extensive experiments on two public datasets demonstrate the effectiveness of the proposed approach, with an IoU of 90.83% on the AIDS dataset and 90.30% on the WBDS dataset, surpassing state-of-the-art methods such as DANet, DeepLab v3+, U-Net, ViT, TransUNet, CMTFNet, and UANet.

1. Introduction

Remote sensing images are widely used in economic construction, national defense construction, and people’s daily life. Semantic segmentation of remote sensing images is an important aspect of remote sensing image processing, which plays a crucial role in urban planning, disaster relief, traffic management, and climate modeling. Semantic segmentation of remote sensing images refers to the process of assigning specific class labels to individual pixels in an image, enabling a comprehensive understanding of the image content [1]. The classified objects include buildings, roads, vegetation, etc.
Deep learning-based semantic segmentation algorithms have become the mainstream for remote sensing image processing. According to the difference in network structure, existing networks fall into three main categories: convolutional neural network(CNN)-based, Transformer-based, and hybrid CNN-Transformer models.
In terms of CNN-based semantic segmentation, classical convolutional neural networks include FCN [2], U-Net [3], DeepLab [4,5,6], PSPNet [7], etc. FCN [2] represents the inaugural convolutional neural network to attain image semantic segmentation. It eliminates the fully connected layer from the conventional classification network and effectively transforms the image classification network into an image segmentation network. This is achieved by reinstating the resolution of the feature map through deconvolution. U-Net [3] is a symmetrical network structure of encoder and decoder, which makes full use of the multi-scale image convolution features generated by the encoder through skip connections. DeepLab uses atrous convolution to solve the problem of spatial resolution degradation in the downsampling process of traditional CNN networks. To enhance the DeepLab networks, many researchers have incorporated depthwise separable convolution, parallel atrous separable convolution, and symmetric encoder-decoder structures. This integration has resulted in the development of more advanced variants. PSPNet [7] designs a new method of multi-scale feature calculation and usage, namely using four pooling layers of different sizes to generate feature maps of different levels and then obtaining image features that fuse multi-scale information through feature map aggregation, so as to improve the global information mining ability. Inspired by the idea of feature map accumulation in ResNet networks [8], DenseNet [9] improves feature reuse through denser connections and slows down gradient vanishing. In addition to the network structure, many attempts have also been made to improve the semantic segmentation of remote sensing images. These methods include the use of the attention mechanisms to enhance feature representation [10,11,12,13,14,15,16], generative adversarial network to enable cross-domain semantic segmentation [17,18,19], a self-updating CNN model [20] and a progressive edge guidance network [21] to incorporate geographic knowledge into CNN models, a semantic category balance-aware involved anti-interference network named SCBANet [22] to handle category imbalance issue, a high-order semantic decoupling network (HSDN) to disentangle features [23], a uncertainty-aware network (UANet) [24] to facilitate level-by-level feature refinement.
However, since the convolution operation is only performed within a certain range around the center point, the range of the receptive field is limited, and there is a lack of global understanding of the image itself. To make full use of the context information of images, some researchers have introduced the Transformer [25] structure with global modeling capabilities into the field of image processing, and a series of image semantic segmentation networks based on Transformer have emerged, such as Vision Transformer (ViT) [26], Swin Transformer [27], etc. ViT [26] is the first semantic segmentation network based on Transformer structure. It first splits and serializes the entire image, introduces the image positional information by adding position embeddings, and then uses the Transformer structure to process the resulting sequence of vectors to obtain more powerful image features, thereby improving the accuracy of image semantic segmentation. Since ViT splits the entire images into fixed-size patches, the result is rough when the patch size is large, while the memory consumption and computation amount are large when the patch size is small. To this end, some researchers have proposed a Swin Transformer [27] with a hierarchical design through shifted windows, which limits the self-attention computation to local windows through a sliding window operation, thereby significantly reducing computational complexity while retaining the details of the image.
Semantic segmentation networks utilizing Transformers necessitate substantial training data and computational resources. Despite numerous scholarly efforts to enhance and optimize these networks, exemplified by innovations like Swin Transformer [27], TopFormer [28] and SiamixFormer [29], they can only mitigate training complexity to a certain degree. To this end, some researchers have designed semantic segmentation networks that combine CNNs and Transformers, such as TransUNet. TransUNet [30] is one of the most classic hybrid networks, which improves the classic U-Net network by adding a series of Transformer units at the end of the encoder to globally model deep features with global context information, thereby significantly improving the accuracy of image segmentation. Since then, Wu et al. [31] introduced the CNN and Multiscale Transformer Fusion Network (CMTFNet), an encoder-decoder architecture that integrates CNN and transformer techniques to extract local details and merge multiscale global context for precise high-resolution remote-sensing image segmentation. With transformer as the backbone network, STransFuse [32], Cmt [33] and CMFNet [34] have emerged. STransFuse [32] incorporates swim transformer coding branches alongside CNN coding, extracting feature representations at various scales through hierarchical stages. Cmt [33] is a transformer-based hybrid network that substitutes convolution for the multilayer perceptron in the transformer, enhancing model accuracy without sacrificing speed. CMFNet [34] is a crossmodal multiscale fusion network that utilizes transformer architecture to capture long-range dependencies across multiscale convolutional feature maps of remote sensing data from diverse modalities, with the goal of enhancing semantic segmentation.
Deep learning-based semantic segmentation has achieved good results in the field of natural images and remote sensing images. However, classical semantic segmentation algorithms are mainly focus on processing small-sized images. Compared with natural images, remote sensing images have high resolution and large format, and there are many large-sized objects with weak-texture areas such as large building roofs, waters, and vegetation. Small-sized images (e.g., 256 × 256) in these areas contain less information and often have low segmentation accuracy in these areas. Meanwhile, due to the limitations of graphics processing unit (GPU) memory, large-format high-resolution images (e.g., 2048 × 2048) cannot be processed directly. Typical solutions such as downsampling, patch cropping, and the use of cascade models often entail a trade-off where they sacrifice intricate details or comprehensive contextual information, leading to accuracy constraints. To address this issue, we propose a novel global-local feature aggregation framework for the semantic segmentation of large-format high-resolution remote sensing images. This framework extracts global information and local details through two branches, which are then aggregated to effectively resolve category determination in weakly textured areas with fewer spatial features, such as water bodies and large building rooftops. Moreover, to optimize GPU memory usage during training, we introduce a novel three-step training method that divides the extensive semantic segmentation network into three parts, thereby reducing the reliance on graphics cards during the model training process.
The main contributions of this paper are summarized as follows:
  • A new semantic segmentation framework is designed for large-format high-resolution remote sensing images. In this framework, two branches are used to extract global and local features respectively, and then a global-local feature aggregation module is designed based on the transformer global attention mechanism, which effectively improves the accuracy of semantic segmentation through the fusion of global and local information. Comprehensive comparisons with the state of the arts on two public datasets and ablation studies are conducted to verify its effectiveness.
  • To reduce the large memory occupation that occurs during large-format image processing, a new model training method based on step-by-step training and gradient accumulation is devised. This method divides the entire network into three parts and trains them in three steps by designing a separate classification head for each part. Extensive experiments demonstrate that this method effectively reduces the memory requirements of the GPU during model training.
  • This framework can effectively integrate the mainstream semantic segmentation networks and its strong scalability has been validated using four mainstream backbone networks, including U-Net, ViT, TransUNet, and ConvNeXt V2.
In the following sections, we first present the methodology of our framework. We will then describe the experimental results and analysis, followed by the conclusion.

2. Methodology

2.1. Overview

The network structure of our proposed method is depicted in Figure 1 and comprises five key components: global feature extraction module, feature coding module, feature aggregation module, information decoding module, and lightweight convolution module.
Here’s a brief overview of the functions of each module:
  • Global feature extraction module: This module’s primary function is to extract global contextual information from large-format images. It takes a downsampled low-resolution image as input and produces global semantic features with a consistent size as the output. The training dataset for this module includes low-resolution images and their corresponding labels. In the subsequent sections of this paper, global semantic features refer to the features output by this module.
  • Feature coding module: The feature coding module processes cropped small image patches. It generates multi-scale convolution features from the input, with the highest and smallest features being used for fusion with the output feature from the global feature extraction module, while other multi-scale features are utilized for skip connections. In the subsequent sections of this paper, local semantic features refer to the features output by this module.
  • Feature aggregation module: This module is responsible for combining the global features produced by the global feature extraction module with the local features generated by the feature coding module. Initially, these two features are concatenated, and then a series of Transformer units process the concatenated features to obtain new aggregated features that incorporate both global and local information.
  • Information decoding module: The information decoding module’s primary purpose is to generate a prediction map consistent with the original image size. It takes as input the new features produced by the feature aggregation module and the multi-scale shallow features generated by the feature coding module. The feature map gradually returns to its original size through a series of upsampling, convolution, and skip connection operations, resulting in a prediction map consistent with the original image size..
  • Lightweight convolution module: This module primarily stitches the feature map at the original resolution. This stitching corresponds to the slicing operation in the feature coding module, which is used to create a feature map of the same size as the original image. Subsequently, the prediction map with the same size as the original image is restored through a simple combination of lightweight convolution.

2.2. Global Feature Extraction Module

The global feature extraction module is primarily used to extract global semantic features from large-format input images. In this paper, we first downscale large-format images into low-resolution small-size images and also downscale labels corresponding to large-format images to small-size labels. Then, we select the appropriate semantic segmentation model to extract global semantic features. In this paper, we have chosen the TransUNet [30], a classic semantic segmentation network, which is primarily composed of three components: an encoder, an attention module, and a decoder. The encoder utilizes the ResNet-50 network to generate multi-scale feature maps. The height and width of these multi-scale feature maps are 1/2, 1/4, 1/8, and 1/16 of the small-size image’s dimensions, respectively. The attention module consists of 12 transformer units connected in series to process the encoder’s final output feature map, which has a height and width that is 1/16 of the small-size image. This process yields an enhanced feature map that is rich in contextual information. The encoder consists of a series of convolutional upsampling and skip connection units. It employs a U-Net-like strategy to progressively restore the feature map layer by layer. The final output of the TransUNet is a pixel-by-pixel semantic feature map that matches the size of the low-resolution small-size image.
Subsequently, to fuse with the local features output by the subsequent feature coding module, we divided the global semantic features into 64 blocks, which we referred to as a partition operation. Finally, to fuse with the highest-level features output by the feature encoding module, we further resized each block of the global semantic features to the same size as the highest-level features output by the feature encoding module, e.g., 1/16 of the small-size image’s dimensions.
The equation for the global feature extraction module is summarized as follows:
x g = R e s i z e ( P a r t i t i o n T r a n s U N e t ( R e s i z e ( x ) ) ) ,
where x denotes the large-format input image, Resize represents the operation of downscaling the image, T r a n s U n e t represents the network structure used by the global feature extraction module in this paper, P a r t i t i o n stands for partition operation, and x g represents the global feature of the final output.
It should be noted that to better utilize the global semantic information extracted by the semantic information extraction module, high-dimensional semantic features are used as input to the subsequent feature aggregation module, instead of the semantic segmentation results. To alleviate the computational load and GPU memory usage during the training of large-format image semantic segmentation, we use the downsampling image and corresponding label to train the model in advance. During the training of the entire proposed network, the parameters of the global feature extraction module are fixed and do not participate in the backpropagation and gradient update, greatly reducing the GPU memory consumption by the network.

2.3. Feature Coding Module

The feature coding module crops the large-format image into a series of small-size image patches, which are then processed by a convolutional neural network to extract the multi-scale convolutional features of each patch.
The input of the feature coding module is the cropped image patches (e.g., a large-format image is divided into 64 patches), and ResNet-50 is used as the network structure. The output is multi-scale feature maps with dimensions of 1/2, 1/4, 1/8, and 1/16 of the original image, respectively. The equation for the feature coding module is described as follows:
[ x l 1 , x l 2 , x l 3 , x l 4 ] = R e s N e t P a r t i t i o n ( x ) ,
where x represents the large-format input image, P a r t i t i o n stands for partition operation, ResNet represents the network structure adopted by the feature coding module in this paper, and [ x l 1 , x l 2 , x l 3 , x l 4 ] represents the final output of four local features at different scales.

2.4. Feature Aggregation Module

The feature aggregation module primarily serves to integrate both global and local features, resulting in a new feature that encompasses local and global information. The global features are derived from the global feature extraction module, while the local features are obtained from the feature coding module. The module comprises 6 transformer units within the network architecture, with the detailed computation described as follows:
x g l = T r a n s f o r m e r s C o n c a t x l 4 ,   x g ,
where x g represents the global feature output of the global feature extraction module, x l 4 represents the last layer of local features output by the feature coding module, C o n c a t stands for feature concatenation operation, which connects two sets of features together along the channel dimension to form a new, larger set of features, T r a n s f o r m e r s represents the network structure adopted by the feature aggregation module in this paper, which consists of 6 multi-head transformer units, and x g l represents the new features after feature aggregation. The transformer unit used here is the same as the transformer unit of TransUNet [30] in Section 2.2.

2.5. Information Decoding Module

The information decoding module consists of two parts of input data: high-order features output by the feature aggregation module and multi-scale features output by the feature coding module. It decodes high-order features output by the feature aggregation module layer by layer and outputs feature maps with a consistent size of the original image. This module is mainly composed of convolutional upsampling blocks and skip connection blocks. The convolutional upsampling block consists of convolution and upsampling operations, which are mainly used to increase the size of the feature map. The skip connection block is implemented by the concatenation operation. It is mainly used to connect the multi-scale features output by the feature coding module and improve the detailed information of the feature map by introducing low-level features. The specific calculation process is as follows:
y 1 = C o n c a t x l 3 , C o n v _ u p x g l ,
y 2 = C o n c a t x l 2 , C o n v _ u p y 1 ,
y p a t c h = C o n c a t x l 1 , C o n v _ u p y 2 ,
where x g l represents the new features after feature aggregation, x l 1 , x l 2 , x l 3 represents the multi-scale shallow features output by the feature coding module, which is used to gradually restore the feature map to its original size, C o n v _ u p represents convolution and upsampling operations, C o n c a t stands for feature concatenation operation, y p a t c h represents the new feature after information decoding, and the size is the same as the image patch.

2.6. Lightweight Convolution Module

The lightweight convolution module is primarily used to restore the prediction map to the same size as the original image and improve the segmentation result at the tile boundary of the image. Since convolution of large-format feature maps consumes a significant amount of GPU memory, this module consists of only a few simple convolutional layers. Specifically, this module combines the small patch-size feature map into a large-format feature map that is consistent with the original large-format image through the stitching operation. Finally, it obtains the final prediction map through the convolution operation. The calculation process is as follows:
y = C o n v U n i o n y p a t c h ,
where U n i o n represents the feature stitching operation, which corresponds to the P a r t i t i o n operation of the feature coding module, and C o n v represents the convolution operation.

2.7. Training Process

Training our model for large-format images demands significant GPU memory and computational power, necessitating high-performance GPUs. It is difficult to use regular GPU for direct training. To address this challenge, we optimize the training process by breaking it down into three distinct steps, as shown in Figure 2. Details are as follows.
Step 1: Training the Global Feature Extraction Module
In the first step, we train the global feature extraction module using downsampled images and their corresponding labels. The training process aligns with that of a traditional semantic segmentation network.
Step 2: Training the Feature Encoding, Feature Aggregation, and Information Decoding Modules
Building upon the fixed parameters of the global feature extraction module, we proceed to train the following three modules: the feature encoding module, the feature aggregation module, and the information decoding module. To facilitate this training, we introduce modifications to the information decoding module, incorporating a classification head. The classification head serves the dual purpose of generating classification results and computing the loss. In this step, we scale the input large-sized image down to a smaller 256 × 256 image, and leverage the global feature extraction model to extract global features. Subsequently, these three modules are trained, with inputs comprising global feature information, small image patches, and their corresponding labels, while the output yields prediction results determined by the classification head.
During the training process, we encountered high GPU memory consumption. To mitigate this issue, we implemented a gradient-accumulation training approach, following these specific steps: (1) Dividing the input image patches, corresponding labels, and global features into several small batches. (2) Performing backward propagation for each small batch, but refraining from updating the model parameters. (3) Updating the model parameters only after all batches have been processed.
Step 3: Training the Lightweight Convolution Module
In the final step, we train the lightweight convolution module based on the fixed parameters of the global feature extraction module, feature encoding module, feature aggregation module, and information decoding module. We begin by employing the global feature extraction module to extract global features from the downsampled image. Subsequently, we utilize the feature encoding module, feature aggregation module, and information decoding module to obtain feature maps with the same resolution as the original image. Finally, these feature maps are seamlessly stitched together into a large feature map mirroring the original image’s dimensions. This unified feature map is then fed into the lightweight convolution module for parameter training.
This partitioned training approach optimizes GPU memory usage and computational efficiency, making it feasible to train our model for large-format images on GPUs with more constrained resources.

3. Experimental Results and Analysis

3.1. Datasets

To evaluate the effectiveness of our proposed method, we used two datasets for validation, both sourced from the WHU Building Dataset [35], available at http://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 2 March 2023). To differentiate between them, we have named them the Aerial Imagery Dataset (AIDS) and the Building Change Detection Dataset (WBDS). Both datasets consist of large-area images and corresponding building labels in Shapefile format. During the process of use, we convert the vector data in Shapefile format into a raster binary image that has the same resolution as the image, with buildings represented in white and the background in black. Due to GPU memory constraints, we were unable to process these large-area images directly. Therefore, we cropped the area images to produce relatively large-format images (e.g., 2048 × 2048). The size of the cropped large-format image is 2048 × 2048, which is eight times larger than the common image size of 256 × 256. For large-format labels, the cropping processing is exactly the same as the original image. Prior to cropping, the datasets were divided into training, validation, and test sets in a 6:2:2 ratio.
AIDS Dataset: Collected in New Zealand, the AIDS dataset comprises more than 22,000 individual buildings. The original area image is a large image that measures 1,560,159 pixels × 517,909 pixels with a spatial resolution of 0.075 m. Cropping the entire area image resulted in 12,940 large-format images with sizes of 2048 × 2048. The dataset was divided into 7764 training sets, 2588 validation sets, and 2588 test sets according to the 6:2:2 ratio.
WBDS Dataset: Covering a substantial portion of Christchurch, New Zealand, the WBDS dataset includes bi-temporal images from 2012 and 2016, each accompanied by corresponding building label data. The original image size is 32,507 pixels × 15,345 pixels with a spatial resolution of 0.2 m. After cropping, 630 large-format images with sizes of 2048 × 2048 were generated. The dataset was divided into 378 training sets, 126 validation sets, and 126 test sets in a 6:2:2 ratio.

3.2. Implementation Details

Data Preprocessing: To enhance the network’s robustness and avoid overfitting, we employed data augmentation techniques such as random rotation, mirror flipping, and adjustments to image color, saturation, and contrast.
Training Details: Our experiments were conducted using the PyTorch framework with CUDA version 11.0 on an Ubuntu 20 environment. We utilized an NVIDIA A40 with 46GB of memory to accelerate model training. The AdamW optimizer was chosen with an initial learning rate of 0.0001, a decay rate of 1 × 10−8, β1 = 0.9, and β2 = 0.999. The batch size varies slightly across different steps: it is 4 for steps 1 and 2, and 2 for step 3. The epoch is set to 200. Details are shown in Table 1. In our large-format image semantic segmentation task, we chose binary cross-entropy as the loss function. The global feature extraction module and feature coding module were initialized with ResNet50 on ImageNet-1k as the initialization parameters. During network training, to conserve GPU memory, the parameters of the global feature extraction module remained fixed and were not updated. We employed a gradient accumulation strategy for image patch processing.
Metrics: To evaluate our method’s performance, we used four common metrics: Intersection over Union (IoU), precision, recall, and F1-score. These metrics are calculated based on true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) as follows:
I o U = T P / T P + F P + F N ,
P r e c i s o n = T P / T P + F P ,
R e c a l l = T P / T P + F N ,
F 1 s c o r e = 2 × P r e c i s o n × R e c a l l / P r e c i s o n + R e c a l l

3.3. Evaluation of GPU Memory Usage

To evaluate GPU memory usage, we monitored GPU memory consumption during the model training process. We used a single NVIDIA A40 graphics card with 46GB of memory for the experiments. When training the entire model directly, GPU memory consumption exceeded 46 GB, rendering training infeasible on a single A40. Consequently, we employed a step-by-step training strategy, consisting of three steps:
  • Step 1: Training the global feature extraction module with a batch size of 4, resulting in GPU memory consumption of less than 20 GB.
  • Step 2: Training the feature encoding module, feature aggregation module, and information decoding module with a batch size of 1 and GPU memory consumption of 43 GB. Using the gradient accumulation strategy with a batch size of 4 for small image groups reduced GPU memory usage to a maximum of 16 GB.
  • Step 3: Training the lightweight convolution module with a batch size of 2 and GPU memory consumption of 26 GB.
This three-step training strategy substantially reduced GPU memory usage and eased the hardware requirements for model training.

3.4. Ablation Study

To further assess the effectiveness of our proposed framework, we conducted an ablation study. All experimental data and parameter settings remained consistent throughout the study. The ablation study was designed as follows:
  • Assessment of the Proposed Framework’s Effectiveness: We evaluated the performance improvement achieved by incorporating four popular networks (i.e., U-Net, ViT, TransUNet and ConvNeXt V2) as global feature extraction modules in our proposed framework.
  • Evaluation of Different Modules’ Effectiveness: Using TransUNet as the global feature extraction module, we added feature aggregation module and lightweight convolution module sequentially to form three cases, aiming to verify the role and effectiveness of these modules.

3.4.1. Assessment of the Proposed Framework’s Effectiveness

To gauge the effectiveness of our framework, we employed four mainstream networks (i.e., U-Net, ViT, TransUNet and ConvNeXt V2) as global feature extraction modules and compared their performance improvements within our framework on the AIDS and WBDS datasets. Table 2 presents the performance improvements of the proposed framework with different global feature extraction modules.
Table 2 reveals that TransUNet outperforms U-Net, ViT and ConvNeXt V2 on both the AIDS and WBDS datasets. This demonstrates that TransUNet, chosen as our global feature extraction module, is effective. Moreover, adopting our proposed framework yields performance improvements over direct stitching results of U-Net, ViT, TransUNet and ConvNeXt V2, highlighting the effectiveness of our framework in utilizing the global characteristics of large-format images.

3.4.2. Evaluation of Different Modules’ Effectiveness

To assess the individual contributions of each module, we examined the results on the WBDS dataset when various combinations of modules were employed. The different module combinations were as follows:
  • Baseline: The TransUNet network is employed as the baseline for processing small-sized image patches, and the outcomes from these smaller images are subsequently stitched together directly to produce large-format results.
  • +Global Feature Extraction Module (GFEM): TransUNet serves as the global feature extraction module, ResNet-50 acts as the feature coding model, and concatenation operations are employed to replace the feature aggregation module in the global-local feature aggregation process.
  • +Feature Aggregation Module (FAM): We added the feature aggregation module to integrate global and local features, without employing lightweight convolution. The output was the prediction result of image patches, and we stitched these predictions to obtain large-format predictions.
  • +Lightweight Convolution Module (LCM): The lightweight convolution module was introduced to process large-format feature maps stitched from small patches. The small patch feature maps were produced by the feature aggregation module.
Table 3 showcases the results of these different module combinations. Table 3 demonstrates that, using TransUNet as the baseline, the model’s performance is incrementally enhanced as the global feature extraction module, the feature aggregation module, and the lightweight convolution module are incorporated one by one. Compared to the baseline TransUNet, the IoU metrics increased by 0.78%, 0.91%, and 2.49% respectively. This demonstrates the effectiveness of the modules within the network structure designed in this paper.
To further illustrate the efficacy of the modules used in the ablation study, we selected several images from the dataset to visualize the semantic segmentation outcomes of different module combinations, as depicted in Figure 3. It is evident that the addition of the global information extraction module leads to a slight improvement in performance. However, without advanced feature aggregation module, the enhancement is not substantial. Upon incorporating the feature aggregation module, there is a full integration of global and local information, resulting in a significantly improved outcome. The inclusion of the lightweight convolution module effectively enhances the edge areas of buildings. Hence, this further affirms the validity of the modules presented in this paper.

3.5. Comparisons with the State of the Arts

To further validate the effectiveness of the proposed method, we conducted comparative experiments on AIDS and WBDS datasets, including DANet [10], DeepLab v3+ [6], U-Net [3], ViT [26], TransUNet [30], CMTFNet [31] and UANet [24]. To make the experimental results more convincing, except for the ViT network, we selected ResNet50 as the feature extraction backbone network for all networks during the experiment. ResNet50 loaded the pretrained model on ImageNet-1k and ViT loaded the pretrained model of ViT-B on ImageNet21k as the initialization parameters. Table 4 displays the comparisons on the AIDS and WBDS datasets, with the best results indicated in bold.
Table 4 illustrates that the early CNN-based semantic segmentation networks (such as U-Net and DeepLab v3+) are not satisfactory due to the absence of attention mechanisms. Results from the ViT network, relying solely on self-attention mechanisms, also proved unsatisfactory. However, subsequent networks combining CNNs with attention mechanisms (such as TransUNet, CMTFNet and UANet) notably improved results by integrating various attention modules into CNNs. Among them, despite its relatively simple structure, TransUNet exhibited stable segmentation results. Compared with the state of the arts, the proposed method achieves superior semantic segmentation results on both datasets due to the introduction of large-format global features. Additionally, implementing gradient accumulation strategies introduced some fluctuations in accuracy, but still yielded favorable outcomes. This is primarily because our framework effectively combines the global semantic features from large-format images with the local details of image patches. This integration allows for a more precise handling and extraction of objects across various scales, which notably enhances the accuracy and reliability of semantic segmentation in high-resolution remote sensing images.
To further verify the computational efficiency of the proposed method, we have detailed the training parameters and Floating Point of Operations (FLOPs) for different models on the AIDS dataset in Table 4. Our network is divided into two parts for statistical purposes: global semantic feature extraction (i.e., Part I) and other components (i.e., Part II). Part I is TransUNet, which is mainly responsible for extracting global semantic features. Part II consists of our four modules: feature coding, feature aggregation, information decoding, and lightweight convolution. As indicated in Table 4, our supplementary processing modules for large-format images (i.e., Part II) impose only a modest increase in computational load, thereby preserving computational efficiency.
To further demonstrate the effectiveness of the proposed method, we selected several images from two datasets to visualize the semantic segmentation results of buildings. The results are shown in Figure 4. It can be observed that the segmentation results of the comparison method in some building areas are not ideal, and there are problems such as blurred and incoherent boundaries. The proposed method achieves the best detection results, especially in large building areas, which significantly reduces error detection.

4. Conclusions

In this paper, we have introduced a novel gloabl-local feature aggregation framework for the semantic segmentation of large-format high-resolution remote sensing images. This framework effectively combines the advantages of local information from cropped small-size images and global information from downsampled large-format images. Furthermore, we have implemented strategies such as step-by-step training and gradient accumulation, resulting in a significant reduction in GPU memory consumption.
Extensive experiments on two public datasets have demonstrated that our framework effectively enhances the accuracy and reliability of semantic segmentation for large-format high-resolution remote sensing images. We have also conducted thorough comparisons with state-of-the-art models, including DANet, DeepLab v3+, U-Net, ViT, TransUNet, CMTFNet and UANet, further showcasing the effectiveness of our proposed method. Additionally, our framework exhibits strong scalability, making it adaptable to various mainstream semantic segmentation networks such as U-Net, ViT, TransUNet and ConvNeXt V2.
In our future work, we will explore semantic segmentation models that are tightly integrated with large models and integrate them into our large-format image processing frameworks to meet the needs of more diverse remote sensing image scenarios.

Author Contributions

Conceptualization, S.W., Z.Z. and S.P.; methodology, S.W., Z.Z. and S.P.; validation, Z.Z., S.Y. and W.Z.; writing—original draft preparation, S.W., Z.Z. and W.Z.; writing—review and editing, S.W., S.Y. and S.P.; visualization, S.Y. and W.Z.; supervision, S.W; project administration, S.W.; funding acquisition, S.W. and S.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by National Key Research and Development Program of China under Grant 2021YFC3000400, Ministry of Education of the People’s Republic of China under Grant 22YJC880058, Knowledge Innovation Program of Wuhan-Shuguang Project under Grant 2022010801020281, University-Industry Collaborative Education Program under Grant 230806008021539 and the Fundamental Research Funds for the Central Universities under Grant CCNU22QN011 and CCNU22QN019.

Data Availability Statement

The original data presented in the study are openly available at http://gpcv.whu.edu.cn/data/building_dataset.html, accessed on 13 July 2024.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

AIDS Aerial Imagery DataSet
CMFNet Crossmodal Multiscale Fusion Network
Cmt Convolutional neural networks Meet vision Transformers
CMTFNet Crossmodal Multiscale Fusion Network
DANet Dual Attention Network
DenseNet Densely Connected Convolutional Network
DRDG Depth-assisted ResiDualGAN
FCN Fully Convolutional Network
FN false negative
FP false positive
GLF-Net A Semantic Segmentation Model Fusing Global and Local Features for High-Resolution Remote Sensing Images
FLOPsFloating Point Operations
GPU Graphics Processing Unit
HSDN High-order Semantic Decoupling Network
IoU Intersection over Union
LANet Local Attention Network
MACU-Net Multiscale skip connected and Asymmetric-Convolution-based U-Net
MLDANets MultiLevel Deformable Attention-aggregated Network
PEG-Net Progressive Edge Guidance Network
PSPNet Pyramid Scene Parsing Network
ResNet Residual Network
SCBANet Semantic Category Balance-Aware involved anti-interference Network
SLU-CNN Self-Learning-Update CNN
SPGAN Semantic-Preserving Generative Adversarial Network
SSCNet Spectral-Spatial Cooperation Network
STransFuse Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation
TN true negative
TopFormer Token Pyramid Transformer for Mobile Semantic Segmentation
TP true positive
UANet Uncertainty-Aware Network
ViT Vision Transformer
WFE Wavelet Feature Enhancement

References

  1. Camps-Valls, G.; Tuia, D.; Bruzzone, L.; Benediktsson, J.A. Advances in Hyperspectral Image Classification: Earth Monitoring with Statistical Learning Methods. IEEE Signal Process. Mag. 2014, 31, 45–54. [Google Scholar] [CrossRef]
  2. Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
  3. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Cham, Switzerland, 5–9 October 2015; pp. 234–241. [Google Scholar]
  4. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  5. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
  6. Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  7. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
  8. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  9. Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
  10. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar]
  11. Ding, L.; Tang, H.; Bruzzone, L. LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 426–435. [Google Scholar] [CrossRef]
  12. Song, W.; Zhou, X.; Zhang, S.; Wu, Y.; Zhang, P. GLF-Net: A Semantic Segmentation Model Fusing Global and Local Features for High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 4649. [Google Scholar] [CrossRef]
  13. Li, Y.; Liu, Z.; Yang, J.; Zhang, H. Wavelet Transform Feature Enhancement for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 5644. [Google Scholar] [CrossRef]
  14. Zhang, X.; Yu, W.; Pun, M.O. Multilevel Deformable Attention-Aggregated Networks for Change Detection in Bitemporal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
  15. Li, X.; Xu, F.; Yong, X.; Chen, D.; Xia, R.; Ye, B.; Gao, H.; Chen, Z.; Lyu, X. SSCNet: A Spectrum-Space Collaborative Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 5610. [Google Scholar] [CrossRef]
  16. Li, R.; Duan, C.; Zheng, S.; Zhang, C.; Atkinson, P.M. MACU-Net for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  17. Zhao, Y.; Guo, P.; Gao, H.; Chen, X. Depth-Assisted ResiDualGAN for Cross-Domain Aerial Images Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
  18. Xi, Z.; Meng, Y.; Chen, J.; Deng, Y.; Liu, D.; Kong, Y.; Yue, A. Learning to Adapt Adversarial Perturbation Consistency for Domain Adaptive Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 5498. [Google Scholar] [CrossRef]
  19. Li, Y.; Shi, T.; Zhang, Y.; Ma, J. SPGAN-DA: Semantic-Preserved Generative Adversarial Network for Domain Adaptive Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
  20. Zheng, C.; Hu, C.; Chen, Y.; Li, J. A Self-Learning-Update CNN Model for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
  21. Pan, S.; Tao, Y.; Nie, C.; Chong, Y. PEGNet: Progressive Edge Guidance Network for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 18, 637–641. [Google Scholar] [CrossRef]
  22. Nie, J.; Wang, Z.; Liang, X.; Yang, C.; Zheng, C.; Wei, Z. Semantic Category Balance-Aware Involved Anti-Interference Network for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
  23. Zheng, C.; Nie, J.; Wang, Z.; Song, N.; Wang, J.; Wei, Z. High-Order Semantic Decoupling Network for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  24. Li, J.; He, W.; Cao, W.; Zhang, L.; Zhang, H. UANet: An Uncertainty-Aware Network for Building Extraction From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
  25. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  26. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  27. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
  28. Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12073–12083. [Google Scholar]
  29. Mohammadian, A.; Ghaderi, F. SiamixFormer: A fully-transformer Siamese network with temporal Fusion for accurate building detection and change detection in bi-temporal remote sensing images. Int. J. Remote Sens. 2023, 44, 3660–3678. [Google Scholar] [CrossRef]
  30. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
  31. Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
  32. Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
  33. Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. CMT: Convolutional Neural Networks Meet Vision Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12165–12175. [Google Scholar]
  34. Ma, X.; Zhang, X.; Pun, M.O. A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
  35. Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Figure 1. Network structure of the proposed method.
Figure 1. Network structure of the proposed method.
Applsci 14 06616 g001
Figure 2. Training process.
Figure 2. Training process.
Applsci 14 06616 g002
Figure 3. Visualization of ablation study on the WBDS dataset. white represents the building area, black represents the background, red represents missed detection, and green represents error detection.
Figure 3. Visualization of ablation study on the WBDS dataset. white represents the building area, black represents the background, red represents missed detection, and green represents error detection.
Applsci 14 06616 g003
Figure 4. Comparisons with the state of the arts on the AIDS and WBDS datasets. white represents the building area, black represents the background, red represents missed detection, and green represents error detection.
Figure 4. Comparisons with the state of the arts on the AIDS and WBDS datasets. white represents the building area, black represents the background, red represents missed detection, and green represents error detection.
Applsci 14 06616 g004
Table 1. Hyperparameters.
Table 1. Hyperparameters.
ItemsSettings
OptimizerAdamW optimizer
Initial learning rate0.0001
Decay rate1 × 10−8
Momentumβ1 = 0.9, β2 = 0.999
Batch size4 for steps 1 and 2, and 2 for step 3
Epoch200
Table 2. Performance improvements of the proposed framework with different global feature extraction modules.
Table 2. Performance improvements of the proposed framework with different global feature extraction modules.
DatasetsNetworkDirect StitchingOursImprovement
IoUF1IoUF1IoUF1
AIDSViT84.2991.51--------
U-Net83.5391.1990.0094.736.473.54
TransUNet87.1193.1290.1294.803.011.68
ConvNeXt V287.1793.1489.9894.732.811.59
WBDSViT83.7891.5691.7495.697.964.13
U-Net85.1791.9991.1795.386.003.39
TransUNet87.8993.6092.0795.874.182.27
ConvNeXt V287.8193.5189.6594.541.841.03
Table 3. Results of different module combinations on the WBDS dataset. The items with “√” indicate the options that have been adopted. The boldfaced values indicate the optimal result.
Table 3. Results of different module combinations on the WBDS dataset. The items with “√” indicate the options that have been adopted. The boldfaced values indicate the optimal result.
TransUNet+GFEM+FAM+LCMIoUF1
87.8993.60
88.6793.99
89.5894.52
92.0795.87
Table 4. Comparisons of different models on the AIDS and WBDS datasets. The boldfaced values indicate the optimal result.
Table 4. Comparisons of different models on the AIDS and WBDS datasets. The boldfaced values indicate the optimal result.
ModelComputational EfficiencyAIDSWBDS
Params
(×106)
FLOPs
(×109)
IoUprecisionRecallF1IoUprecisionRecallF1
DANet46.19125.1479.6888.2589.1488.6977.9594.0781.9887.61
DeepLab v3+40.3517.3682.9896.7385.3790.7085.1295.9988.2591.96
U-Net34.5365.5283.5396.1086.4791.0385.1793.5590.4391.97
ViT87.8024.2884.2996.0887.3091.4883.7894.7987.8291.17
TransUNet93.2332.2387.1196.2090.2293.1187.8994.0593.0793.56
CMTFNet30.078.5688.0594.2693.0393.6486.2190.1295.2192.59
UANet26.737.4588.8394.9093.2894.0888.5092.7395.0993.90
Ours93.2332.2390.1295.6793.9594.8092.0796.6595.1195.87
30.2819.93
Ours
(gradient-accumulation)
93.2332.2390.8395.5894.8095.1990.3095.8094.0294.90
30.2819.93
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, S.; Zuo, Z.; Yan, S.; Zeng, W.; Pang, S. A Novel Global-Local Feature Aggregation Framework for Semantic Segmentation of Large-Format High-Resolution Remote Sensing Images. Appl. Sci. 2024, 14, 6616. https://doi.org/10.3390/app14156616

AMA Style

Wang S, Zuo Z, Yan S, Zeng W, Pang S. A Novel Global-Local Feature Aggregation Framework for Semantic Segmentation of Large-Format High-Resolution Remote Sensing Images. Applied Sciences. 2024; 14(15):6616. https://doi.org/10.3390/app14156616

Chicago/Turabian Style

Wang, Shanshan, Zhiqi Zuo, Shuhao Yan, Weimin Zeng, and Shiyan Pang. 2024. "A Novel Global-Local Feature Aggregation Framework for Semantic Segmentation of Large-Format High-Resolution Remote Sensing Images" Applied Sciences 14, no. 15: 6616. https://doi.org/10.3390/app14156616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop