A Stage-Adaptive Selective Network with Position Awareness for Semantic Segmentation of LULC Remote Sensing Images

Zheng, Wei; Feng, Jiangfan; Gu, Zhujun; Zeng, Maimai

doi:10.3390/rs15112811

Open AccessArticle

A Stage-Adaptive Selective Network with Position Awareness for Semantic Segmentation of LULC Remote Sensing Images

¹

School of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

²

Pearl River Water Resources Research Institute, Pearl River Water Resources Commission, Guangzhou 510610, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(11), 2811; https://doi.org/10.3390/rs15112811

Submission received: 23 April 2023 / Revised: 24 May 2023 / Accepted: 26 May 2023 / Published: 29 May 2023

(This article belongs to the Special Issue Convolutional Neural Network Applications in Remote Sensing II)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning has proven to be highly successful at semantic segmentation of remote sensing images (RSIs); however, it remains challenging due to the significant intraclass variation and interclass similarity, which limit the accuracy and continuity of feature recognition in land use and land cover (LULC) applications. Here, we develop a stage-adaptive selective network that can significantly improve the accuracy and continuity of multiscale ground objects. Our proposed framework can learn to implement multiscale details based on a specific attention method (SaSPE) and transformer that work collectively. In addition, we enhance the feature extraction capability of the backbone network at both local and global scales by improving the window attention mechanism of the Swin Transfer. We experimentally demonstrate the success of this framework through quantitative and qualitative results. This study demonstrates the strong potential of the prior knowledge of deep learning-based models for semantic segmentation of RSIs.

Keywords:

remote sensing; semantic segmentation; position awareness; attention network; land use and land cover (LULC)

1. Introduction

Remote sensing image (RSI) semantic segmentation aims to annotate each pixel to analyze the feature semantics in RSIs [1]. In real-world applications, RSI semantic segmentation is applied to many scenarios, such as land-use and land-cover (LULC) investigations [2], environmental monitoring [3], precision agriculture [4], urban planning [5], and meteorology [6]. Traditional methods in previous studies have limitations in model complexity in addressing semantic segmentation, which is complicated by nature.

Recent advances in deep learning and convolutional neural networks (CNNs) have enabled opportunities for semantic segmentation of RSIs and have received considerable research attention. Currently, the semantic segmentation approach adopts deep neural network architectures consisting of an encoder and a decoder. In particular, the encoder continuously shrinks the feature map, increasing the dimensions to extract abundant features. Instead, the decoder decodes the features at different scales and gradually recovers the original image size. Typically, ResNet [7], HRNet [8], or Inception [9] are used to extract features in the encoder stage. In contrast, the decoder usually has a different structure. For example, U-net [10] progressively recovers the size of the feature map by deconvolution and compensates for edge features by skip connections. SegNet [11] achieves upsampling and progressive feature map recovery by deconvolution and pooling index. To address the problem of multiscale target segmentation, DeepLabV3+ [12] uses Atrous Spatial Pyramid Pooling (ASPP) modules with different dilation rates to aggregate information from different scales of perceptual fields. These approaches generally use CNNs to integrate high and low levels of semantic features for segmentation. The limitation, therefore, is the convolution kernel size, which only acquires local information and fails to achieve global semantic interaction and global information modeling [13].

A particularly effective technique to model global information is the attention mechanisms, which can be divided into several types: channel attention mechanisms, spatial attention mechanisms, branch attention mechanisms, temporal attention mechanisms, and various combinations of mechanisms [14,15,16,17,18,19,20]. The channel-based attention mechanism selects different objects by adaptively calibrating the weights of each channel. The spatial-based attention mechanism adaptively selects essential regions. The branch-based attention mechanism is a dynamic branch selection mechanism. SENet is the first to mention the channel, which encompasses two important modules, sequence and excitation [14]. Subsequently, many researchers have proposed efficient channel attention methods by optimizing or strengthening these two modules [15,16,17,18]. In the domain of semantic segmentation, the methods based on spatial attention mechanisms are DCN [21], PSANet [22], nonlocal [23], etc. The methods based on temporal attention mechanisms are usually applied in the video domain. CBAM [19] combines channel attention and spatial attention, which allows the models to adaptively select important objects and regions at the same time. However, the channel-attention- and branch-attention-based approaches do not perform effective full-image modeling, with accuracy often limited by intraclass similarities and interclass differences in RSIs. In contrast, spatial-attention-based approaches focus on more essential regions, losing information about slender features and leading to discontinuous feature recognition, with a similar problem in the combined attention approach. Therefore, the particularities of ground objects, such as small scale and high similarity, remain challenging.

Recently, the success of the Visual Transformer (ViT) [24] has opened up new possibilities for global modeling in image processing. Following this, many works [25,26,27,28,29,30,31] have proposed improvements to ViT. For instance, the Swin Transfer [29] uses a window-based self-attention calculation mechanism to limit self-attention calculation to local windows. Information interaction is carried out through shifted windows to obtain the complete picture information. To solve the poor robustness of the transformer in local information extraction, Vit-adapter [30] introduces an adapter to embed local information into ViT and improves the robustness of the transformer. PVT [31] proposes a backbone network suitable for pixel-level dense task prediction, mainly through a progressively shrinking pyramid and spatially reduced attention layers. It is well known that transformers have a robust global modeling capability. Conversely, CNN can effectively process low-level features to enhance the locality of the transformer. Therefore, considerable attention has focused on combining the transformer with a CNN in the encoder to extract global and local features to improve the accuracy of semantic segmentation [32,33,34]. However, these methods tend to double the amount of computation typically. It thus remains an open question as to how the pixel-level representations are balanced in terms of computational efficiency and performance to support RSI semantic segmentation.

Here, a stage adaptive selective network called SaSPENet is proposed to solve the problems of low accuracy and discontinuity of category recognition without introducing additional computational effort. Central to our approach is a stage-adaptive selective module with global position encoding called SaSPE. According to prior knowledge, we know that pixels in remote sensing images should have interrelationships, such as the fact that roads should have linear relationships while building clusters have aggregated relationships; in this way, SaSPE models the relationships between pixels from a full-image perspective to achieve global modeling. Unlike other spatial-attention-based methods, SaSPE can keep detailed information in global modeling and optimize complex semantic space, significantly improving slender features’ continuity and accuracy. In addition, the adaptive selection of multiple branches in the SaSPE module can achieve both semantic and detail information preservation and improve the accuracy of segmentation. Moreover, we use cross-shaped window attention to replace the window-based attention of the Swin Transfer, which further enhances the robustness of the backbone network in terms of local modeling capability. The performance of the proposed algorithm is comprehensively compared with DANet [35], SE-Net [14], CBAM [19], SKNet [20], ECA-Net [15], EMANet [36], UperNet [37], WicoNet [38], SETR [39], Efficient-T [40], and SegFormer [41] in the BLU [38] and GID [42] datasets. The mIoU of our method reached 73.84% and 57.63% on the BLU and GID datasets, respectively. The main contributions of this article are given as follows.

1.: A novel branch attention method (SaSPE) with global modeling and detail-preserving capabilities is proposed. The major distinction between it and other methods is that instead of computing the feature map directly after global modeling, the spatial optimization module is used to optimize the semantic space, which makes the feature mapping space more compact and thus significantly improves the accuracy in continuity recognition and slender features.
2.: To increase the robustness of the Swin Transfer in local modeling, we use cross-shaped self-attention to replace the window-based self-attention of the Swin Transfer. Compared with previous paradigms, our method significantly improves the accuracy of feature detection for LULC.
3.: To explore effective decoder structures, four decoder structures are designed and compared. The transformer and convolution operations are included in each decoder. Each structure is combined with SaSPE to encode multiscale cues effectively for separating ground objects with high similarity.

2. Related Work

2.1. Semantic Segmentation of Remote Sensing Images with CNN

The CNN-based approach achieves high accuracy in semantic segmentation of natural images. Compared with natural images, RSIs are characterized by a wide range of target sizes, wide differences in color and texture, and inter-class similarities and intra-class differences. Therefore, these methods tend not to be used directly in the field of RSIs. Many scholars have optimized the classical model to make it adaptable to the dense classification task of RSIs [43,44,45,46]. Some of them improve spatial orientation accuracy by introducing symmetric U-shaped networks [47,48,49,50,51]. Yue et al. [47] adaptively constructs a Tree-CNN module to fuse multiscale features. ResUNet [48] combines atrous convolution and PSP pooling layers in the decoder to aggregate global information. Yang et al. [50] combined the unique advantages of DenseNet and U-net to build a residual-dense block composed of a dense connection layer, a local feature fusion layer, and a local residual-dense learning layer. The feature resolution of the deepest input of the pyramid structure in DeepLab is too small, which affects practical remote sensing applications. Therefore, some scholars [52,53] have introduced atrous convolution to ensure the input resolution of the pyramid structure, making it suitable for remote sensing applications. DenseASPP introduces dense connections to improve the semantic information capture ability of Deeplab, but it introduces a large amount of computation. Peng et al. [54] designed a lightweight fast semantic segmentation network CF-Net, which expands the perceptual field of low-level feature maps through a cross-fusion module to encode more accurate semantic information from small-scale objects. However, these methods are limited by the size of the CNN convolutional kernel, which leads to the problem of discontinuous feature segmentation.

2.2. Semantic Segmentation of Remote Sensing Images with Attention Mechanisms

The convolutional kernel of the convolutional operation will restrict the CNN-based methodology and cannot be modeled globally [55]. However, global information is critical for RSIs, and CNN-based approaches are handicapped in performing pixel-by-pixel semantic segmentation. Attention mechanisms are known to have a remarkable ability to model global information. Therefore, many existing methods compensate for the deficiency of CNNs by introducing an attention mechanism to improve the accuracy of pixel-by-pixel semantic segmentation of RSIs [56,57,58,59,60,61,62].

LANet [61] improves the semantic representation of high and low-level features while preserving their spatial details by embedding a local attention mechanism and effectively fusing high and low-level feature maps. SPANet [23] enhances the learning of salient features by optimizing the averaging operation of LANet through successive pooling operations to improve the segmentation performance of small-scale targets and edges. DA-RoadNet [57] acquires the features of the road and their global correlations through a dual-attention module, which fuses low-level and high-level feature maps and filters useless information. MCAFNet [59] added channel attention optimization modules and fusion modules to the decoder to improve the overall feature map information of different dimensions. SERNet [63] learns and integrates features by integrating multiple SE blocks with the introduction of a refinement attention module. RSANet [64] uses an improved region self-attention mechanism to exploit the information flow in the whole image and reduce the noise and redundant features of the feature map. HRRNet [65] refines the feature mapping by introducing a channel attention module and a pooled residual attention module to improve the depth representation of features.

2.3. Semantic Segmentation of Remote Sensing Images with Tranformer

The great success of ViT [24] on natural images brings new solutions for semantic segmentation of RSIs. Recently, many scholars have also applied the transformer-based approach to the semantic segmentation task of RSIs as well [32,33,34]. The majority of transformer-based methods employ a hybrid design to combine the respective advantages of CNN and transformer. Xu et al. [40] introduced an efficient transformer framework with a pure transformer header for prediction. CCTNet [66] uses CNN to extract local details, which is combined with the transformer’s powerful ability to capture contextual information to achieve high-performance semantic segmentation. WiCoNet [38] integrates contextual and local information by introducing a transformer-based global context branch and a CNN-based local branch. He et al. [67] used a biencoder structure based on a transformer and CNN and strengthened the global modeling capability of swin-T with a spatial interaction module. However, methods which combine CNNs and transformers introduce a high amount of computation.

3. Methodology

In this paper, we propose a stage-adaptive selective network (i.e., SaSPENet) that uses cross-shaped window attention instead of the window-based attention of the Swin Transfer to further improve the robustness of the backbone network in terms of local modeling capability and to address the inaccuracy and discontinuity of category identification without introducing additional computational effort. The core of our approach is a stage-adaptive selective module with global position encoding called SaSPE. In contrast to other spatial attention-based approaches, SaSPE preserves detailed information in global modeling, optimizes the complex semantic space, and significantly improves the continuity and accuracy of slender features. In this section, three aspects of the backbone, the core module SaSPE, and the four decoder structures are described in detail.

3.1. Swin-T Backbone Improvements

3.1.1. Swin Transformer Backbone

Swin Transformer is a universal backbone for feature extraction, which can be widely used in several vision fields such as object detection and image segmentation. The structure of the Swin Transformer is shown in Figure 1. The backbone consists of 4 stages, and each stage stacks

\times N

Swin Transformer blocks, where the values of N are 2, 2, 6, and 2. We mark the output of each stage as

S_{i}

, and the value of i is 1, 2, 3, and 4. First, the original image

X \in R^{H \times W \times 3}

is divided into nonoverlapping patches of size 4 × 4 through the patch partition layer, and each patch is linearly stretched to the C dimension through the LinearEmbedding layer, where C = 96. We feed each patch into the Swin Transformer block of different depths to generate the feature representation. After the output of each stage, a patch merging layer is connected to merge the output patches to generate multilevel feature representations.

The detailed structure of the Swin Transformer block is shown in Figure 1. Each Swin Transformer block includes a W-MSA block and a SW-MSA block. The W-MSA block contains the layer normalization (LN) layer, multilayer perceptron (MLP) layer, and W-MSA layer, and the SW-MSA block is similar to it. Both W-MSA and SW-MSA are window-based multihead self-attention calculation mechanisms, as shown in Figure 2a. First, the image is divided into nonoverlapping windows, and self-attention calculation is performed on each window to realize information modeling, namely W-MSA. To improve the global modeling ability and realize cross-window information interaction, the shifted window method is used to calculate self-attention, namely SW-MSA. Moreover, LN is introduced in each block to normalize the features, and MLP is used to record the learning parameters of W-MSA and SW-MSA.

We assume that each window contains

M \times M

patches, so the computational complexity of window-based self-attention on an image can be expressed as

\begin{matrix} ω (W - M S A) = 4 h w C^{2} + 2 M^{2} h w C \end{matrix}

(1)

where

h \times w

is 512 × 512, and M is set to 16.

3.1.2. Cross-Shaped Swin Transformer Backbone

We improve the window-based self-attention mechanism of the Swin Transformer to make it more suitable for the land-use classification task of RSIs. To enhance the local modeling capabilities of the Swin Transformer, we use a cross-window self-attention mechanism to achieve strong global and local awareness while limiting computational costs, as shown in Figure 2b.

We map the input feature

X \in R^{H \times W \times 3}

to the horizontal branch and the vertical branch and calculate the self-attention in the horizontal direction and the vertical direction, respectively. For the horizontal direction, we divide

X \in R^{H \times W \times 3}

into nonoverlapping equal-width stripes, where the width is represented by

s w

. We compute self-attention on horizontal stripes of width

s w

. Similarly, we perform the same method on the vertical branch, computing self-attention on vertical stripes of width

s w

. Finally, we concatenate the outputs of the two branches together to obtain the cross-shaped attention result. In addition, to achieve strong modeling capability while constraining computational cost, we adjust the stripe width according to the depth of the network. In the shallow stage, the stripe width is smaller, and in the deep stage, the stripe width is larger.

We embed the cross-shaped window self-attention into the module, forming a cross-Swin Transformer block. We stack N cross-Swin Transformer blocks in the 4 stages of the backbone network, where the values of N are 1, 2, 21, and 1. The improved network structure is shown in Figure 3. In the cross-shaped Swin Transformer backbone, we set C to 64. The computational complexity of cross-shaped window self-attention can be expressed as

\begin{matrix} ω (c s w i n - a t t e n t i o n) = 4 h w C^{2} + h w C (s w \times h + s w \times w) \end{matrix}

(2)

where

h \times w

is 512 × 512, and

s w

of the four stages are 1, 2, 16, and 16, respectively.

3.2. Stage-Adaptive Selective Module with Global Position Encoding

In this section, we introduce the stage adaptive selective module with global position encoding (SaSPE) module in detail. The general structure of the module is shown in Figure 4, which mainly includes the position-aware module and the semantic-space optimization module. We use a 1-D convolution to learn multiple sets of channel attention, and then use softmax to implement adaptive stage-selection to preserve as much detail and semantic information as possible. We adopt an adaptive convolution kernel size, where the kernel size is proportional to the channel dimension, which can be expressed as

\begin{matrix} k = φ (C) = {| \frac{l o g_{2} C}{γ} + \frac{b}{γ} |}_{o d d} \end{matrix}

(3)

where

{| t |}_{o d d}

indicates the nearest odd number of t. In this paper, we set

γ

and b to 2 and 1 throughout all the experiments, respectively.

The position-aware module focuses on global modeling of bidirectional position relationships at the pixel level for input features. Then, the global pixel-relational feature map Z is input to the semantic-space optimization module and optimized using the em algorithm. The EM algorithm aggregates pixels belonging to the same class as much as possible to improve intraclass similarity and interclass variability. The details of the position-aware module and semantic-space optimize module are described in Section 3.2.1 and Section 3.2.2, respectively.

3.2.1. Position-Aware Module

In RSIs, there is a mutual relationship between pixels of the same land type; for example, roads should be continuous and have a linear relationship, while buildings should be aggregated. The position-aware module can model the internal connection of pixels so that when the model recognizes the occluded objects, it can infer the category of the objects according to the mutual relationship, thereby improving the recognition continuity and recognition accuracy of the objects.

The overall structure of the position-aware module is shown in Figure 4a, which contains two branches for modeling the relationship between pixels. The position information at position i can be modeled as

\begin{matrix} Z_{i} = \frac{1}{N} \sum_{\forall j \in Ω (i)} F_{Δ i j} (x_{i}) x_{j} \end{matrix}

(4)

where

x_{i}

represents the input of feature map X at position i, and

Δ i j

means the relative positions of i and j.

F_{Δ i j} ()

is a series of functions to calculate the positional relationship, which means the contribution of other positions j to position i. Similarly, the contribution of position i to other positions j can be expressed as

\begin{matrix} Z_{i} = \frac{1}{N} \sum_{\forall j \in Ω (i)} F_{Δ i j} (x_{j}) x_{j} \end{matrix}

(5)

By combining the above two formulas, we can obtain a mutual position relationship modeling expression

Z_{i} = \frac{1}{N} \sum_{\forall j \in Ω (i)} F_{Δ i j} (x_{i}) x_{j} + \frac{1}{N} \sum_{\forall j \in Ω (i)} F_{Δ i j} (x_{j}) x_{j}

(6)

In the position-aware module, we use the convolution function to achieve

F_{Δ i j} ()

. Therefore, both

F_{Δ i j} (x_{i})

and

F_{Δ i j} (x_{j})

can be regarded as predicted attention values. Therefore, Equation (6) can be rewritten as

\begin{matrix} Z_{i} = \frac{1}{N} \sum_{\forall j \in Ω (i)} a_{i, j}^{c} x_{j} + \frac{1}{N} \sum_{\forall j \in Ω (i)} a_{i, j}^{r} x_{j} \end{matrix}

(7)

where

a_{i, j}^{c}

and

a_{i, j}^{r}

denote the predicted attention values in the attention maps

A^{c}

and

A^{r}

, respectively.

3.2.2. Semantic-Space Optimization Module

In the SaSPE module, we design a semantic-space optimization module to optimize the global location modeling information to make its semantic space more compact, which reduces the difficulty of pixel classification and improves the segmentation performance of the model. The overall structure of the semantic-space optimization module is shown in Figure 4b. This optimization process is realized by an expectation-maximization (EM) algorithm. We expect that the maximum likelihood solution of the semantic space can be obtained after a certain number of iterations. According to Figure 5, it can be seen that the new solution space is more compact and the intracategory differences are reduced while maintaining the differences between classes.

Given an input feature map

X \in R^{N \times C}

, there is an initial basis

μ \in R^{K \times C}

, where

N = h \times w

. We use the responsibility estimation module to estimate the responsibility

Z \in R^{N \times K}

corresponding to each base, which can be expressed as

\begin{matrix} Z = s o f t m a x (λ X μ^{⊤}) \end{matrix}

(8)

Then, we find the maximum likelihood solution according to the likelihood maximization module and update the value of basis

μ

. The k-th

μ

is computed as a weighted average of X, which can be expressed as

\begin{matrix} μ_{k} = \frac{\sum_{n = 1}^{N} z_{n k} x_{n}}{\sum_{j = 1}^{N} z_{j k}} \end{matrix}

(9)

where

z_{n k}

represents the responsibility of the k-th base to the n-th pixel. The

μ

and Z values are continuously updated and converged after T iterations, where

T = 3

. X is recalculated using the converged

μ

and Z, denoted as

X^{^{'}}

\begin{matrix} X^{^{'}} = Z μ \end{matrix}

(10)

3.3. Decoder Design

We designed four decoder structures that combine transformer and convolution operations. In particular, we incorporate a SaSPE module in each decoder to explore an efficient pixel-level semantic segmentation structure suitable for RSIs. Next, we provide a detailed description for each decoder.

3.3.1. Progressive Feature Fusion Decoder

As shown in Figure 6, we design a progressive feature fusion decoder. We progressively reduce and upsample the high-level feature maps using convolutions. The high-level feature map and the second-highest-level feature map are inputted into the SaSPE module to adaptively select the weights of different stages to retain high-level semantic information and detailed information. By analogy, we finally use the classifier to calculate the classification result. The decoder of this structure adds multiscale feature representation while gradually restoring the image size, which is beneficial to land-use classification in RSIs. The decoder is able to fuse feature maps with strong semantic information at low resolution and feature maps with weak semantic information but rich spatial information at high resolution with less computation. The convolution operation sequence mentioned in this decoder structure includes 1 × 1 convolution + BN + ReLU, and the convolution dimensionality reduction and upsampling ratio are both 2 times. We call it progressive feature fusion, abbreviated as the PFF decoder.

3.3.2. Progressive MLP Feature Fusion Decoder

Xie et al. [41] first proposed a MLP decoder based on a pure transformer. The MLP decoder upsamples each stage output to the same size. This means that it will directly upsample 8× and 4×. We argue that this may introduce some unwanted noise. Therefore, we designed a progressive MLP feature fusion decoder: the PMLP decoder. In this encoder, we progressively recover the size of the feature map and merge the high-level features with the low-level features to reduce noise introduction. As shown in Figure 7, we reduce the dimensionality of the high-level output, fuse it with the low-level output, and input the result to the MLP block. The output feature maps of the four stages are input into the SaSPE module to obtain multiscale position information attention maps.

3.3.3. Multiscale MLP Feature Fusion Decoder

We design a decoder structure based on the fusion of a pure transformer and the SaSPE module, called the MSMLP decoder, as shown in Figure 8. We map the output of each stage to dimension C through MLP and upsample the feature maps of different stages to the same size. This means that the output

S_{i}

of each stage MLP is of size

\frac{H}{4} \times \frac{W}{4} \times 3

, where

i = 1, 2, 3, 4

. Compared with other decoders, the MSMLP decoder is purer and does not introduce additional convolution operations. It reduces many additional parameters while ensuring the performance of semantic segmentation. We contact the output of each stage in the channel direction, fuse it through the linear-fuse module, and finally input it into the classifier for classification.

3.3.4. Multiscale Feature Extraction Decoder

We design a multiscale feature extraction (MFE) decoder, use MLP for feature mapping, and upsample the feature maps in 4 stages to

\frac{H}{4} \times \frac{W}{4} \times 3

. We use the MLA module for feature extraction on the output of each layer, where the composition of the MLA is shown in Figure 9. Each MLA contains two sets of convolution sequences, which consist of a 3 × 3 conv and BN. We use the multiscale feature extraction results as the input of the SaSPE module, calculate the position information at each scale, and generate the corresponding attention map.

4. Experimental Settings

4.1. Dataset

In this paper, we verify the performance of our method on two publicly available LULC datasets: the Beijing land-use (BLU) dataset and the Gaofen image dataset (GID). The BLU and GID datasets can be downloaded at https://rslab.disi.unitn.it/dataset/BLU/ (accessed on 28 April 2023) and https://x-ytong.github.io/project/GID.html (accessed on 28 April 2023), respectively. The BLU dataset has 196 images for training, 28 for validation, and 32 for testing, and each image size is

2048 \times 2048

. We manually cropped the GID dataset into a nonoverlapping

2048 \times 2048

tile. Thus, the GID dataset can be divided into 112 for training, 16 for validation, and 32 for testing. The BLU and GID information is shown in Figure 10 and Figure 11. When training the model, the image size of the input model is

512 \times 512

. We crop each

2048 \times 2048

training image to a

512 \times 512

image with a step size of 256. Thus, each

2048 \times 2048

large image yields 49 small images of size

512 \times 512

. The final images used for model training were 9604 and 5488 on the BLU and GID, respectively.

4.2. Implementation Details

Our network and other comparison methods are implemented using the PyTorch framework. All experiments are implemented on a Tesla V100-SXM2-32GB GPU. The batch size is 16, the initial learning rate is 0.0001, and the maximum number of epoches is 100. The learning rate

l_{r}

is updated as follows

\begin{matrix} l_{r} = 0.0001 \times {(1 - \frac{i t e r a t i o n s}{t o t a l_i t e r a t i o n s})}^{p o w e r} \end{matrix}

(11)

where the power is 1.5. We use the AdamW optimizer with a weight decay of

l e^{- 2}

to train the model. In addition, we adopt an early stop strategy, where patience is 10. In our work, we adopt joint loss with cross-entropy loss, dice loss, and focal loss. The joint loss L is expressed as follows

\begin{matrix} L = L_{c e} + 0.5 \times L_{d i c e} + 0.5 \times L_{f o c a l} \end{matrix}

(12)

Moreover, we use random flip and random rotate operations to augment the data. All experiments in this paper use the above data enhancements by default. The input image size for all datasets is set to

512 \times 512

. After each mini-batch, we update

μ

with the following formulation

\begin{matrix} μ = α μ + (1 - α) \bar{μ} \end{matrix}

(13)

where

\bar{μ}

denotes the mean value of

μ

,

α = 0.9

.

We use common evaluation metrics for semantic segmentation, including precision, recall, F1-score, (pixel) accuracy, and IoU, to evaluate the performance of our proposed method.

\begin{matrix} p r e c s i o n = \frac{T P}{T P + F P} \end{matrix}

(14)

\begin{matrix} r e c a l l = \frac{T P}{T P + F N} \end{matrix}

(15)

\begin{matrix} F 1 - s c o r e = \frac{2 \times (p r e c s i o n \times r e c a l l)}{p r e c s i o n + r e c a l l} = \frac{2 T P}{2 T P + F P + F N} \end{matrix}

(16)

\begin{matrix} a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(17)

\begin{matrix} I o U = \frac{T P}{T P + F N + F P} \end{matrix}

(18)

where TP, TN, FP, and FN denote true positive pixels, true negative pixels, false positive pixels, and false negative pixels, respectively.

5. Results

5.1. Ablation Study

5.1.1. Components of the SaSPE Module

As described in Section 3.2, our SaSPE module has two core modules, the position-aware module and the semantic-space optimization module. We perform ablation experiments on the SaSPE module in both BLU and GID datasets, and the experimental results are shown in Table 1 and Table 2. We find that the position-aware module significantly outperforms (

+ 0.54 %

and

+ 0.50 %

) the semantic-space optimization module in terms of improvement accuracy when using either the position-aware module or the semantic-space optimization module alone. We think that it is not enough to optimize the image in RSI alone to significantly improve recognition accuracy. Since RS objects have interclass similarity and intraclass variability, incorrect classification can also occur when using the semantic-space optimization module alone. The position-aware module, on the other hand, calculates global pixel correlation, which takes into account global contextual information and can reduce the effect of interclass similarity. Therefore, we first use the position-aware module to calculate the pixel relationships and optimize the pixel interrelationship feature maps using the semantic-space optimization module, which makes the differences between classes more obvious for the purpose of optimizing the semantic space and achieving high-precision classification. According to the experimental results, the SaSPE module can achieve excellent experimental results on the BLU and GID datasets, where the mIoU reached 72.99% and 55.73%, respectively.

5.1.2. Cross-Shaped Window Self-Attention and Window Self-Attention

In this section, the cross-shaped window self-attention effectiveness is verified in our four different decoder designs, which use the SaSPE module by default. The experimental results are shown in Table 3 and Table 4, where cross-shaped self-attention significantly outperforms window-based self-attention on both datasets. On the BLU dataset, mIoU was improved by 0.85%, 0.85%, 1.1%, and 1.32%, respectively. On the GID dataset, mIoU was improved by 1.14%, 0.41%, 1.26%, and 0.76%, respectively. Moreover, the number of parameters and FLOPS of cross-shaped self-attention are significantly lower than those of window-based self-attention. This is due to the setting of

s w

, which ensures model performance while controlling the model computation. Based on the cross-shaped self-attention, the spatial information of the features is considered and fused from horizontal and vertical directions in an integrated manner.

5.1.3. Decoder Design

We tested the behavior of four decoder structures on two datasets, with swin-T and SaSPE used by default. The experimental results are shown in Table 5 and Table 6. We find that the PFF decoder works best on the BLU dataset, but its results on the GID dataset are not as good. Similarly, the MSMLP decoder does not perform as well on the GID dataset as on the BLU dataset.

For a comprehensive analysis of our methods, we conducted combined experiments for each method, and Table 7 and Table 8 represent the results on the BLU and GID datasets, respectively. We found that all four decoders performed quite well on the BLU dataset, with PFF achieving the best mIoU of 73.84%. On the GID dataset, MSMLP performs significantly better than the other decoders, with the mIoU reaching 57.63%. We argue that on the more refined GID dataset, where the scale variation of features will be more diverse, it becomes important to encode positional relationships at each stage. However, performing stage-by-stage positional encoding does not significantly improve the performance of GID; it is important to optimize the semantic space after each stage encoding, as this reduces the classification difficulty of the model, which is very important in achieving fine-grained segmentation.

5.2. Comparison with Other Methods

We further compare the proposed method with some existing methods, including DANet [35], SE-Net [14], CBAM [19], SKNet [20], ECA-Net [15], EMANet [36], UperNet [37], WicoNet [38], SETR [39], Efficient-T [40], and SegFormer [41]. The comparison methods are mainly divided into attention-based methods and other efficient methods for encoder-decoder structures. We chose to compare with SegFormer-B2 in order to guarantee the fairness of the comparison.

5.2.1. Comparison with Other Methods on the BLU Dataset

The results are shown in Table 9. Our PFF encoder–decoder method improves the mIoU by 8.4% over DANet. Compared to SKNet, which is the best performer among attention-based methods, it improves mIoU by 1.74%. Compared to the best-performing Efficient-transformer in the efficient encoder–decoder-based approach, the improvement in mIoU is 1.26%. Compared to the dual encoder approach of WicoNet, our approach improves mIoU by 5.14%. Compared to SegFormer, our approach improves mIoU by 0.92%. Our method has improved mIoU by 1.69% and 1.49% compared to PUP and MLA in SETR, respectively. It is worth noting that our method achieves 56.60% in the recognition of “Road”, which is 5.14% higher than the best attention-based method and 1.54% higher than the best method based on an efficient encoder-decoder method.

We randomly selected a test set image for visualization, as shown in Figure 12. Compared with other methods, our method has a distinct advantage in the identification of slender roads and rivers. The roads and rivers identified by our method are more accurate and coherent. It is worth noting that in the highlighted area in the lower right corner, the road is obscured by other features and other models cannot identify the correct type, while our model can identify the correct type of land.

5.2.2. Comparison with Other Methods on the GID Dataset

We also verified the superiority of our method on the more refined dataset, GID. The quantitative results are shown in Table 10, and the qualitative results are shown in Figure 13. As shown in Table 10, our MSMLP encoder–decoder method improves 6.17% over WicoNet on mIoU. It improves 1.44% on mIoU compared to EMANet, the best performer among attention-based methods. Compared with UperNet and SETR, which are the best performers among efficient encoder–decoder-based approaches, the improvement in mIoU is 2.02%. Compared to SegFormer, our approach improves mIoU by 1.47%. Notably, our method achieves 54.88% in “TL” recognition, which is 2.76% higher than the best attention-based method and 2.35% higher than the best method based on efficient encoder–decoder methods. Moreover, our method also shines in “AG”, which is 10.73% higher than the best attention-based method and 11.49% higher than the best effective encoder–decoder-based method.

We randomly selected a test set of images for visualization, as shown in Figure 12. Compared with other methods, our method has a significant advantage in identifying slender roads and rivers. The roads and rivers identified by our method are more accurate and coherent.

5.2.3. Comparison of Efficiency Analysis with Other Methods

We compared each model in terms of parameters, FLOPS, and accuracy, and the results are shown in Table 11. The FLOPS of our method is slightly higher than that of the Efficient-transformer, but in general, it is lower than that of the other methods. It is worth noting that the number of parameters in our model are lower than other methods. We achieve high performance segmentation results with fewer parameters, which clearly shows the superiority of our method.

6. Discussion

In this section, we further visualize the proposed approach to better understand its effectiveness. It mainly includes the SaSPE module, cross-shaped attention mechanism, and four decoders.

6.1. Visual Analysis of the SaSPE Modules

To obtain a deeper understanding of the SaSPE module, we visualized some feature maps of the middle layer, and the visualization results are shown in Figure 14. Observing the original image, there will be differences between the same features and similarities between different features. The feature maps input to the SaSPE module (Figure 14c) differs from feature to feature in an insignificant way, which will cause inaccurate classification. Following the position-aware module (Figure 14d), feature variation between objects can be represented after global position relationship modeling. To minimize the effects of interclass similarity and intraclass differences, we optimize the feature representation space using the semantic-space optimization module after the position-aware module, as shown in Figure 14e. The optimized feature space enhances the similarity of similar features and extends the difference between different features.

6.2. Analysis of the Cross-Shaped Attention Mechanism

In this section, we visualized the effective receptive fields of the Swin Transformer and cross-shaped Swin Transformer backbone, as shown in Figure 15. We found that while the effective receptive field of the cross-shaped Swin Transformer would be slightly smaller than that of the Swin Transformer in stage 4, it would be larger for the cross-shaped Swin Transformer than for Swin Transformer in the low-level stage. It also shows that, compared with the Swin Transformer, the cross-shaped Swin Transformer has a stronger ability of local modeling. Therefore, the cross-shaped attention approach assists the backbone in achieving high-performance semantic segmentation performance with global modeling capability while enhancing local modeling capability.

6.3. Analysis of the Effect of Four Decoders

We visualize the results of the four decoders on the BLU dataset for the analysis. The comparison results are shown in Figure 16. We find that the MFE decoder has higher feature recognition integrity than the other decoders, and the PFF decoder will be better than the other decoders in feature continuity recognition. The PMLP and MSMLP decoders have problems with intermittent and incomplete recognition. We argue that the reason is that PMLP and MSMLP are directly upsampling multiples when recovering the image size, which may result in loss of detail information and thus incomplete recognition. While MFE also directly upsamples multiples, the connected MLA module models the features locally, and the SaSPE module can compensate for the lost details when computing the global pixels in the final input multiscale image. It is advantageous to use MLP as a decoder in global modeling, but it is easy to lose the detailed information of the image, so we think that compensating for the lost details by convolution operation will make the transformer-based method more suitable for semantic segmentation of RSIs.

7. Conclusions

Inspired by recently successful applications of the transformer model in RSI semantic segmentation, we have developed a stage-adaptive selective network (SaSNet) with position awareness. Specifically, a branch attention module is proposed to model global pixel interrelationships without losing details, thus effectively improving slender features’ recognition accuracy and continuity. Furthermore, we replace the window-based attention mechanism with a cross-shaped attention mechanism to improve the transformer’s local modeling capabilities and achieve high-performance segmentation. We verified that combining convolution and transformer in the decoder is also effective, which is less computationally intensive and more effective than other methods that use a two-branch encoder to combine the advantages of both. Nevertheless, we only prove that combining both in the decoder is effective, it should be one of the future research works to design a more efficient decoder structure.

Author Contributions

Conceptualization, W.Z., J.F.; methodology, W.Z., J.F.; software, W.Z.; validation, W.Z., J.F.; writing—original draft preparation, W.Z., J.F.; writing—review and editing, Z.G., M.Z.; visualization, W.Z., J.F.; investigation, Z.G., M.Z.; funding acquisition, W.Z., Z.G. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Hydrology and Water Resources Survey Bureau of Jiangsu Province, Wuxi Branch Research Project (No. JSJGWXCG2022-08). This work was also supported by the Major Science and Technology Project of the Ministry of Water Resources (No. SKR-2022037) and the Chongqing Graduate Research Innovation Project (CYS22448).

Data Availability Statement

The BLU dataset and GID dataset in this study are downloaded at https://rslab.disi.unitn.it/dataset/BLU/ (accessed on 28 April 2023) and https://x-ytong.github.io/project/GID.html (accessed on 28 April 2023).

Conflicts of Interest

No potential conflict of interest was reported by the authors.

References

Andrs, S.; Arvor, D.; Mougenot, I.; Libourel, T.; Durieux, L. Ontology-based classification of remote sensing images using spectral rules. Comput. Geosci. 2017, 102, 158–166. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, W.; Cao, J.; Xie, G. MKANet: An Efficient Network with Sobel Boundary Loss for Land-Cover Classification of Satellite Remote Sensing Imagery. Remote Sens. 2022, 14, 4514. [Google Scholar] [CrossRef]
Metzger, M.J.; Bunce, R.G.; Jongman, R.H.; Sayre, R.; Trabucco, A.; Zomer, R. A high-resolution bioclimate map of the world: A unifying framework for global biodiversity research and monitoring. Glob. Ecol. Biogeogr. 2013, 22, 630–638. [Google Scholar] [CrossRef]
López, A.; Jurado, J.M.; Ogayar, C.J.; Feito, F.R. A framework for registering UAV-based imagery for crop-tracking in Precision Agriculture. Int. J. Appl. Earth Obs. Geoinf. 2021, 97, 102274. [Google Scholar] [CrossRef]
Benediktsson, J.A.; Chanussot, J.; Moon, W.M. Advances in very-high-resolution remote sensing. Proc. IEEE 2013, 101, 566–569. [Google Scholar] [CrossRef]
Zhang, X.; Wang, T.; Chen, G.; Tan, X.; Zhu, K. Convective clouds extraction from Himawari–8 satellite images based on double-stream fully convolutional networks. IEEE Geosci. Remote Sens. Lett. 2019, 17, 553–557. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Lecture Notes in Computer Science, Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, G.; Tan, X.; Guo, B.; Zhu, K.; Liao, P.; Wang, T.; Wang, Q.; Zhang, X. SDFCNv2: An improved FCN framework for remote sensing images semantic segmentation. Remote Sens. 2021, 13, 4902. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–119 June 2020; pp. 11534–11542. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 783–792. [Google Scholar]
Lee, H.; Kim, H.E.; Nam, H. Srm: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1854–1862. [Google Scholar]
Yang, Z.; Zhu, L.; Wu, Y.; Yang, Y. Gated channel transformation for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11794–11803. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 510–519. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. Davit: Dual attention vision transformers. In Proceedings of the Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXIV. Springer: Berlin, Germany, 2022; pp. 74–92. [Google Scholar]
Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. Levit: A vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 12259–12269. [Google Scholar]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, QC, Canada, 11–17 October 2021; pp. 16519–16529. [Google Scholar]
Wu, K.; Peng, H.; Chen, M.; Fu, J.; Chao, H. Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10033–10041. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transfer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, Z.; Duan, Y.; Wang, W.; He, J.; Lu, T.; Dai, J.; Qiao, Y. Vision transformer adapter for dense predictions. arXiv 2022, arXiv:2205.08534. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-high-resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-Based Decoder Designs for Semantic Segmentation on Remotely Sensed Images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
Ding, L.; Lin, D.; Lin, S.; Zhang, J.; Cui, X.; Wang, Y.; Tang, H.; Bruzzone, L. Looking Outside the Window: Wide-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient Transformer for Remote Sensing Image Segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Liu, W.; Shu, Y.; Tang, X.; Liu, J. Remote sensing image segmentation using dual attention mechanism Deeplabv3+ algorithm. Trop. Geogr. 2020, 40, 303–313. [Google Scholar]
Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef]
Yue, J.; Mao, S.; Li, M. A deep learning framework for hyperspectral image classification using spatial pyramid pooling. Remote Sens. Lett. 2016, 7, 875–884. [Google Scholar] [CrossRef]
Liu, R.; Mi, L.; Chen, Z. AFNet: Adaptive fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7871–7886. [Google Scholar] [CrossRef]
Yue, K.; Yang, L.; Li, R.; Hu, W.; Zhang, F.; Li, W. TreeUNet: Adaptive tree convolutional neural networks for subdecimeter aerial image segmentation. ISPRS J. Photogramm. Remote Sens. 2019, 156, 1–13. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Yang, X.; Li, X.; Ye, Y.; Zhang, X.; Zhang, H.; Huang, X.; Zhang, B. Road detection via deep residual dense u-net. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; IEEE: New York, NY, USA, 2019; pp. 1–7. [Google Scholar]
Abdollahi, A.; Pradhan, B.; Shukla, N.; Chakraborty, S.; Alamri, A. Multi-object segmentation in complex urban scenes from high-resolution remote sensing data. Remote Sens. 2021, 13, 3710. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
Peng, C.; Zhang, K.; Ma, Y.; Ma, J. Cross fusion net: A fast semantic segmentation network for small-scale semantic information capturing in aerial scenes. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Qi, X.; Li, K.; Liu, P.; Zhou, X.; Sun, M. Deep attention and multi-scale networks for accurate remote sensing image segmentation. IEEE Access 2020, 8, 146627–146639. [Google Scholar] [CrossRef]
Wan, J.; Xie, Z.; Xu, Y.; Chen, S.; Qiu, Q. DA-RoadNet: A dual-attention network for road extraction from high resolution satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6302–6315. [Google Scholar] [CrossRef]
Wang, H. Remote sensing image segmentation model based on attention mechanism. In Proceedings of the 2022 5th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Shenyang, China, 22–24 April 2022; IEEE: New York, NY, USA, 2022; pp. 403–405. [Google Scholar]
Yuan, M.; Ren, D.; Feng, Q.; Wang, Z.; Dong, Y.; Lu, F.; Wu, X. MCAFNet: A Multiscale Channel Attention Fusion Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 361. [Google Scholar] [CrossRef]
Liang, C.; Xiao, B.; Cheng, B.; Dong, Y. XANet: An Efficient Remote Sensing Image Segmentation Model Using Element-Wise Attention Enhancement and Multi-Scale Attention Fusion. Remote Sens. 2022, 15, 236. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
Xu, Y.; Xie, Z.; Feng, Y.; Chen, Z. Road extraction from high-resolution remote sensing imagery using deep learning. Remote Sens. 2018, 10, 1461. [Google Scholar] [CrossRef]
Zhang, X.; Li, L.; Di, D.; Wang, J.; Chen, G.; Jing, W.; Emam, M. SERNet: Squeeze and Excitation Residual Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 4770. [Google Scholar] [CrossRef]
Zhao, D.; Wang, C.; Gao, Y.; Shi, Z.; Xie, F. Semantic segmentation of remote sensing image based on regional self-attention mechanism. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Cheng, S.; Li, B.; Sun, L.; Chen, Y. HRRNet: Hierarchical Refinement Residual Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2023, 15, 1244. [Google Scholar] [CrossRef]
Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images. Remote Sens. 2022, 14, 1956. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]

Figure 1. Swin Tranformer backbone. Each Swin Transformer block contains a W-MSA block and a SW-MSA block. Except for the first stage, each stage is preceded by a Patch Merging layer to reduce the feature map size and deepen the feature map depth, which generates a multilevel feature representation. The image is divided into nonoverlapping patches using Patch Partition layer, and the image is mapped to C dimensions by a LinearEmbedding layer as the input to the backbone network.

Figure 2. (a) Window self-attention. The window self-attention mechanism divides the input feature map into multiple windows within which self-attention is computed. Information interaction across windows is achieved using shifted-window. (b) Cross-shaped window self-attention. The cross-shaped self-attention mechanism divides the feature map into vertical and horizontal branches, adjusts the size of the strip window according to

s w

on both branches, and calculates the self-attention within the strip window. Finally, the attention results are fused in both directions.

Figure 2. (a) Window self-attention. The window self-attention mechanism divides the input feature map into multiple windows within which self-attention is computed. Information interaction across windows is achieved using shifted-window. (b) Cross-shaped window self-attention. The cross-shaped self-attention mechanism divides the feature map into vertical and horizontal branches, adjusts the size of the strip window according to

s w

on both branches, and calculates the self-attention within the strip window. Finally, the attention results are fused in both directions.

Figure 3. Cross-shaped Swin Transformer backbone. The cross-Swin Transformer block consists of the LN layer, the cross-shaped window self-attention and the MLP layer. We reduce the feature map size and deepen the feature map depth using a convolution operation with a convolution kernel size of 3 and a stride size of 2 to generate a multiscale representation. Information is encoded using a convolutional layer with a convolutional kernel of 7 and a stride size of 4 before stage 1.

Figure 4. The overall structure of SaSPE. High-level feature maps contain rich semantic information, and low-level feature maps contain detailed information about the features. Establishing bidirectional global position relationships in the position-aware module through two branches. Optimize the semantic embedding space by the semantic-space optimization module to obtain a spatial attention map to preserve detailed and global information. Cross-channel attention maps are generated adaptively by 1-D convolution to combine the advantages of two levels of feature maps (a) Position-aware module. (b) Semantic-space optimization module. (c) Adpatively kernel-size k.

Figure 5. The visual effects of the semantic space optimization module. The optimized feature information is more obvious, and both interclass differences and intraclass similarities are significantly optimized. (a) Input image. (b) Original semantic-space. (c) Optimized semantic-space.

Figure 6. Progressive feature fusion decoder. The high-level feature map is downsampled using convolution and upsampled to the same size as the sub-level feature map. Feature fusion is performed by fusing high-level feature maps with sub-level feature maps in turn using the SaSPE module to form a progressive feature fusion.

Figure 7. Progressive MLP feature fusion decoder. The high level feature map is downsampled using convolution and upsampled to the same size as the sub-level feature map. The high level feature map is fused with the low level feature map to form a progressive feature fusion.

Figure 8. Multiscale MLP feature fusion decoder. The feature maps of each stage are mapped to the same scale after MLP block. The global information at different scales is modeled level by level using the SaSPE module, and features are fused for multilevel information. The features of each stage are mapped to the same size after being processed by MLP blocks and input to the SaSPE module for multiscale global modeling.

Figure 9. Multiscale feature extraction decoder. The feature maps of each stage are mapped to the same scale after MLP block processing. MLA Head performs feature extraction for each layer result, and finally, global modeling is performed using SaSPE module.

Figure 10. Some samples in the Beijing land use dataset. We show the categories corresponding to the tags.

Figure 11. Some samples in the Gaofen image dataset. We show the categories corresponding to the tags. We show the abbreviations for each category of GID.

Figure 12. Comparison of the visualization results of features on the BLU test dataset with different semantic segmentation methods. The blue label indicates “Water” and the gray label indicates “Road”. Observe the highlighted area in the lower right corner, where our model can identify “Road” that are obscured by other features, while other models do not.

Figure 13. Comparison of the visualization results of features on the GID test dataset with different semantic segmentation methods. The real labels in the highlighted area indicate RV (River) and TL (Traffic Land), respectively.

Figure 14. Visualization of the SaSPE intermediate layer. (a) Image. (b) Ground truth. (c) Input of position-aware module. (d) Output of the position-aware module. (e) Semactic-space optimization results. (f) Output of the SaSPE module. (g) Prediction.

Figure 15. Swin Transformer and cross-shaped Swin Transformer visualization of the effective perceptual field in each stage. From left to right, it is divided into stages from 1 to 4. The red area indicates a high impact on the target value, and the blue area indicates a low impact on the target value, i.e., the nonblue area can indicate the effective perceptual field at the current stage.

Figure 16. Comparison of visualization results of the PFF decoder, the PMLP decoder, the MSMLP decoder, and the MFE decoder on the BLU test set. The yellow label indicates “Agricultural”, the blue label indicates “water”, and the gray label indicates road.

Table 1. SaSPE module ablation experiments on the BLU dataset which uses the PFF decoder and the swin-T backbone.

Decoder	Backbone	SaSPE Module		Acc(%)	Mean F1(%)	mIoU(%)
Decoder	Backbone	Position-Aware	Semantic-Space Optimization	Acc(%)	Mean F1(%)	mIoU(%)
PFF	swin-T	√		95.98	83.39	72.39
PFF	swin-T		√	95.89	84.88	71.85
PFF	swin-T	√	√	96.05	83.84	72.99

The bolded is the best result.

Table 2. SaSPE module ablation experiments on the GID dataset which uses the PFF decoder and the swin-T backbone.

Decoder	Backbone	SaSPE Module		Acc(%)	Mean F1(%)	mIoU(%)
Decoder	Backbone	Position-Aware	Semantic-Space Optimization	Acc(%)	Mean F1(%)	mIoU(%)
PFF	swin-T	√		97.84	71.28	55.41
PFF	swin-T		√	97.88	70.68	54.91
PFF	swin-T	√	√	97.89	67.35	55.73

The bolded is the best result.

Table 3. Ablation experiments with window self-attention and cross-shaped window self-attention on the BLU dataset, with results on four decoders and using the SaSPE module by default.

Decoder	Flops/(GMac)	Params/(M)	Backbone		SaSPE Module	Acc(%)	Mean F1(%)	mIoU(%)
Decoder	Flops/(GMac)	Params/(M)	Window Self-Attention	Cross-Shaped Window Attention	SaSPE Module	Acc(%)	Mean F1(%)	mIoU(%)
PFF	30.17	30.93	√		√	96.05	83.84	72.99
PFF	24.16	23.48		√	√	96.18	84.49	73.84
PMLP	28.94	28.51	√		√	95.96	83.66	72.71
PMLP	23.56	22.34		√	√	96.15	84.28	73.56
MSMLP	28.58	28.19	√		√	95.95	83.57	72.56
MSMLP	23.4	22.19		√	√	96.12	84.34	73.66
MFE	38.15	28.77	√		√	95.94	83.65	72.71
MFE	26.67	22.45		√	√	96.16	84.45	73.80

The bolded is the best result.

Table 4. Ablation experiments with window self-attention and cross-shaped window self-attention on the GID dataset, with results on four decoders and using the SaSPE module by default.

Decoder	Flops/(GMac)	Params/(M)	Backbone		SaSPE Module	Acc(%)	Mean F1(%)	mIoU(%)
Decoder	Flops/(GMac)	Params/(M)	Window Self-Attention	Cross-Shaped Window Attention	SaSPE Module	Acc(%)	Mean F1(%)	mIoU(%)
PFF	30.17	30.93	√		√	97.89	67.35	55.73
PFF	24.16	23.48		√	√	97.89	68.81	56.87
PMLP	28.94	28.51	√		√	97.88	68.61	56.32
PMLP	23.56	22.34		√	√	97.87	73.14	56.73
MSMLP	28.58	28.19	√		√	97.89	68.54	56.37
MSMLP	23.4	22.19		√	√	97.93	70.03	57.63
MFE	38.15	28.77	√		√	97.85	68.42	56.20
MFE	26.67	22.45		√	√	97.94	69.56	56.96

The bolded is the best result.

Table 5. Experimental results of four decoders on the BLU dataset using the swin-T backbone and SaSPE module by default.

Decoder	Backbone	SaSPE Module	Acc(%)	Mean F1(%)	mIoU(%)
PFF	swin-T	√	96.05	83.84	72.99
PMLP	swin-T	√	95.96	83.66	72.71
MSMLP	swin-T	√	95.95	83.57	72.56
MFE	swin-T	√	95.94	83.65	72.71

Table 6. Experimental results of four decoders on the GID dataset using the swin-T backbone and SaSPE module by default.

Decoder	Backbone	SaSPE Module	Acc(%)	Mean F1(%)	mIoU(%)
PFF	swin-T	√	97.89	67.35	55.73
PMLP	swin-T	√	97.88	68.61	56.32
MSMLP	swin-T	√	97.89	68.54	56.37
MFE	swin-T	√	97.85	68.42	56.20

Table 7. Ablation experiments with various combinations of decoders, backbone, and SaSPE module in the BLU dataset.

Decoder	Backbone		SaSPE Module		Acc(%)	Mean F1(%)	mIoU(%)
Decoder	Window Self-Attention	Cross-Shaped Window Self-Attention	Position-Aware	Semantic-Space Optimization	Acc(%)	Mean F1(%)	mIoU(%)
PFF	√		√		95.98	83.39	72.39
PFF	√		√	√	96.05	83.84	72.99
PFF		√	√		96.12	84.26	73.55
PFF		√	√	√	96.18	84.49	73.84
PMLP	√		√		95.91	83.45	72.41
PMLP	√		√	√	95.96	83.66	72.71
PMLP		√	√		96.09	84.11	73.28
PMLP		√	√	√	96.15	84.28	73.56
MSMLP	√		√		95.92	83.44	72.38
MSMLP	√		√	√	95.95	83.57	72.56
MSMLP		√	√		96.05	84.09	73.23
MSMLP		√	√	√	96.12	84.34	73.66
MFE	√		√		95.93	83.52	72.52
MFE	√		√	√	95.94	83.65	72.71
MFE		√	√		96.06	84.30	73.56
MFE		√	√	√	96.16	84.45	73.80

The bolded is the best result.

Table 8. Ablation experiments with various combinations of decoders, backbone, and SaSPE module in the GID dataset.

Decoder	Backbone		SaSPE Module		Acc(%)	Mean F1(%)	mIoU(%)
Decoder	Window Self-Attention	Cross-Shaped Window Self-Attention	Position-Aware	Semantic-Space Optimization	Acc(%)	Mean F1(%)	mIoU(%)
PFF	√		√		97.84	71.28	55.41
PFF	√		√	√	97.89	67.35	55.73
PFF		√	√		97.93	72.76	56.71
PFF		√	√	√	97.89	68.81	56.87
PMLP	√		√		97.86	67.48	55.44
PMLP	√		√	√	97.88	68.61	56.32
PMLP		√	√		97.87	72.29	56.63
PMLP		√	√	√	97.87	73.14	56.73
MSMLP	√		√		97.91	68.12	55.99
MSMLP	√		√	√	97.89	68.54	56.37
MSMLP		√	√		97.92	73.06	56.53
MSMLP		√	√	√	97.93	70.03	57.63
MFE	√		√		97.90	67.66	55.80
MFE	√		√	√	97.85	68.42	56.20
MFE		√	√		97.88	68.64	56.65
MFE		√	√	√	97.94	69.56	56.96

The bolded is the best result.

Table 9. Comparison of experiment results with other models on the BLU dataset.

Dods	Backbone	Per-Class IoU(%)						Acc(%)	Mean F1(%)	mIoU(%)
Dods	Backbone	Backgroud	Built-Up	Vegetation	Road	Agricultural	Water	Acc(%)	Mean F1(%)	mIoU(%)
DANet [35]	swin-T	49.02	69.18	80.12	43.09	78.44	72.79	94.84	78.15	65.44
SE-Net [14]	swin-T	59.36	78.16	83.61	49.79	79.82	77.47	95.87	82.65	71.37
CBAM [19]	swin-T	59.27	78.20	83.78	50.84	80.05	77.00	95.90	82.78	71.52
SKNet [20]	swin-T	59.92	79.37	83.90	52.35	80.03	77.05	95.96	83.23	72.10
ECA-Net [15]	swin-T	59.67	78.35	83.71	51.03	80.49	76.06	95.92	82.82	71.55
EMANet [36]	swin-T	59.55	79.15	83.36	52.69	79.44	75.70	95.87	82.94	71.65
UperNet [37]	swin-T	59.14	78.86	83.97	52.06	80.23	77.19	95.95	82.45	71.91
Efficient transformer [40]	Effificient-T	59.51	78.92	83.78	55.06	79.96	78.27	95.94	83.62	72.58
WicoNet [38]	ResNet50+transformer	55.08	76.79	81.22	49.55	75.02	74.53	95.17	80.82	68.70
SETR-PUP [39]	swin-T	60.56	78.77	83.54	52.75	79.92	77.35	95.91	83.29	72.15
SETR-MLA [39]	swin-T	60.90	78.39	83.91	53.56	79.58	77.76	95.96	83.45	72.35
SegFormer [41]	MiT-B2	59.96	80.14	84.19	54.80	80.36	78.06	96.04	83.83	72.92
PFF	cross-shaped swin-T	61.35	80.21	84.60	56.60	81.52	78.73	96.18	84.49	73.84
PMLP	cross-shaped swin-T	60.87	79.85	84.57	55.86	81.42	78.78	96.15	84.28	73.56
MSMLP	cross-shaped swin-T	61.00	80.00	84.38	55.63	80.96	79.99	96.12	84.34	73.66
MFE	cross-shaped swin-T	61.59	80.39	84.34	55.88	81.36	79.21	96.16	84.45	73.80

The bolded is the best result.

Table 10. Comparison experiment results with other models on the GID dataset.

Methods	Backbone	Per-Class IoU(%)																Acc(%)	Mean F1(%)	mIoU(%)
Methods	Backbone	BG	IDL	UR	RR	TL	PF	IL	DC	GP	AW	SL	NG	AG	RV	LK	PN	Acc(%)	Mean F1(%)	mIoU(%)
DANet [35]	swin-T	72.33	58.40	68.37	48.66	38.25	78.92	73.32	39.65	16.58	67.42	1.73	36.99	7.47	91.88	72.55	79.75	97.76	64.89	53.27
SE-Net [14]	swin-T	73.24	59.20	67.83	47.17	52.12	79.89	74.08	38.75	25.20	68.48	7.04	36.88	10.82	93.54	76.91	78.88	97.86	67.63	55.63
CBAM [19]	swin-T	73.71	59.58	69.81	48.96	48.94	77.80	76.01	46.57	18.86	68.39	0.00	37.23	7.16	94.25	73.42	79.24	97.92	70.76	55.00
SKNet [20]	swin-T	74.50	58.87	68.63	47.00	51.51	79.29	75.08	38.12	19.71	68.47	7.18	27.97	10.22	94.66	76.98	81.75	97.74	66.65	55.00
ECA-Net [15]	swin-T	74.27	60.43	69.91	51.15	51.08	79.66	74.63	41.69	15.93	67.12	0.00	37.69	14.57	95.02	69.99	82.35	97.94	71.28	55.34
EMANet [36]	swin-T	73.95	58.09	69.17	50.65	51.98	79.03	73.80	45.90	23.22	65.74	6.78	38.78	13.99	93.41	74.15	80.32	97.90	68.32	56.19
UperNet [37]	swin-T	73.21	57.04	69.74	42.96	51.73	79.89	70.41	41.88	25.67	66.65	8.32	42.43	10.71	93.59	74.21	81.34	97.84	67.80	55.61
Efficient transformer [40]	Effificient-T	71.98	56.37	68.06	52.46	52.18	82.65	74.17	40.38	21.56	68.32	1.89	45.62	13.03	94.84	64.35	76.08	97.81	67.22	55.25
WicoNet [38]	ResNet50+transformer	69.54	40.81	65.01	49.47	47.28	79.99	67.21	23.29	11.87	64.49	7.11	46.79	13.81	93.29	61.17	82.15	97.52	63.74	51.46
SETR-PUP [39]	swin-T	73.49	58.94	69.00	46.15	51.09	76.64	73.83	30.64	26.98	69.79	0.00	39.13	8.35	92.47	72.87	79.92	97.89	70.37	54.33
SETR-MLA [39]	swin-T	72.54	55.84	65.39	46.09	52.53	81.24	74.80	45.41	25.84	66.44	13.81	33.36	5.87	93.48	74.98	82.10	97.79	67.70	55.61
SegFormer [41]	MiT-B2	72.97	58.84	69.37	51.77	53.49	82.22	73.76	39.46	17.48	67.24	11.14	43.28	12.79	93.09	74.74	76.87	97.86	68.26	56.16
PFF	cross-shaped swin-T	73.36	60.80	68.74	52.38	54.62	81.50	74.61	36.96	25.59	66.64	5.08	39.08	17.53	93.18	82.71	77.17	97.89	68.81	56.87
PMLP	cross-shaped swin-T	73.48	60.36	67.66	52.54	53.65	81.75	74.32	40.26	23.43	67.92	0.00	41.64	21.91	92.64	79.65	76.41	97.87	73.14	56.73
MSMLP	cross-shaped swin-T	73.71	61.67	69.62	52.50	54.88	80.87	75.54	41.74	25.80	67.12	7.18	43.11	25.30	92.51	72.76	77.81	97.93	70.03	57.63
MFE	cross-shaped swin-T	73.52	63.00	70.88	53.27	53.86	81.59	74.35	35.30	29.13	67.46	10.51	42.13	21.21	93.32	64.98	76.77	97.94	69.56	56.96

The bolded is the best result.

Table 11. Comparison of model parameters, FLOPS, accuracy, and mIoU.

Methods	Backbone	BLU-mIoU(%)	GID-mIoU(%)	Params(M)	FLOPS(GMac)
DANet [35]	swin-T	65.44	53.27	30.89	25.54
SE-Net [14]	swin-T	71.37	55.63	27.99	26.27
CBAM [19]	swin-T	71.52	55.00	28.02	26.27
SKNet [20]	swin-T	72.10	55.00	29.34	29.10
ECA-Net [15]	swin-T	71.55	55.34	28.00	26.27
EMANet [36]	swin-T	71.65	56.19	28.38	27.18
UperNet [37]	swin-T	71.91	55.61	30.35	31.42
Efficient transformer [40]	Effificient-T	72.58	55.25	29.92	23.06
WicoNet [38]	ResNet50+transformer	68.70	51.46	39.07	83.56
SETR-PUP [39]	swin-T	72.15	54.33	27.30	28.10
SETR-MLA [39]	swin-T	72.35	55.61	35.98	28.33
SegFormer [41]	MiT-B2	72.92	56.16	27.35	56.70
PFF	cross-shaped swin-T	73.84	56.87	23.48	24.16
PMLP	cross-shaped swin-T	73.56	56.73	22.34	23.56
MSMLP	cross-shaped swin-T	73.66	57.63	22.19	23.40
MFE	cross-shaped swin-T	73.80	56.96	22.45	26.67

The bolded is the best result.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, W.; Feng, J.; Gu, Z.; Zeng, M. A Stage-Adaptive Selective Network with Position Awareness for Semantic Segmentation of LULC Remote Sensing Images. Remote Sens. 2023, 15, 2811. https://doi.org/10.3390/rs15112811

AMA Style

Zheng W, Feng J, Gu Z, Zeng M. A Stage-Adaptive Selective Network with Position Awareness for Semantic Segmentation of LULC Remote Sensing Images. Remote Sensing. 2023; 15(11):2811. https://doi.org/10.3390/rs15112811

Chicago/Turabian Style

Zheng, Wei, Jiangfan Feng, Zhujun Gu, and Maimai Zeng. 2023. "A Stage-Adaptive Selective Network with Position Awareness for Semantic Segmentation of LULC Remote Sensing Images" Remote Sensing 15, no. 11: 2811. https://doi.org/10.3390/rs15112811

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Stage-Adaptive Selective Network with Position Awareness for Semantic Segmentation of LULC Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation of Remote Sensing Images with CNN

2.2. Semantic Segmentation of Remote Sensing Images with Attention Mechanisms

2.3. Semantic Segmentation of Remote Sensing Images with Tranformer

3. Methodology

3.1. Swin-T Backbone Improvements

3.1.1. Swin Transformer Backbone

3.1.2. Cross-Shaped Swin Transformer Backbone

3.2. Stage-Adaptive Selective Module with Global Position Encoding

3.2.1. Position-Aware Module

3.2.2. Semantic-Space Optimization Module

3.3. Decoder Design

3.3.1. Progressive Feature Fusion Decoder

3.3.2. Progressive MLP Feature Fusion Decoder

3.3.3. Multiscale MLP Feature Fusion Decoder

3.3.4. Multiscale Feature Extraction Decoder

4. Experimental Settings

4.1. Dataset

4.2. Implementation Details

5. Results

5.1. Ablation Study

5.1.1. Components of the SaSPE Module

5.1.2. Cross-Shaped Window Self-Attention and Window Self-Attention

5.1.3. Decoder Design

5.2. Comparison with Other Methods

5.2.1. Comparison with Other Methods on the BLU Dataset

5.2.2. Comparison with Other Methods on the GID Dataset

5.2.3. Comparison of Efficiency Analysis with Other Methods

6. Discussion

6.1. Visual Analysis of the SaSPE Modules

6.2. Analysis of the Cross-Shaped Attention Mechanism

6.3. Analysis of the Effect of Four Decoders

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI