A Local–Global Framework for Semantic Segmentation of Multisource Remote Sensing Images

Qiu, Luyi; Yu, Dayu; Zhang, Chenxiao; Zhang, Xiaofeng

doi:10.3390/rs15010231

Open AccessArticle

A Local–Global Framework for Semantic Segmentation of Multisource Remote Sensing Images

by

Luyi Qiu

¹

,

Dayu Yu

^2,*

,

Chenxiao Zhang

² and

Xiaofeng Zhang

³

¹

School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China

²

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430070, China

³

School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(1), 231; https://doi.org/10.3390/rs15010231

Submission received: 17 November 2022 / Revised: 27 December 2022 / Accepted: 27 December 2022 / Published: 31 December 2022

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Recently, deep learning has been widely used in the segmentation tasks of remote sensing images. However, the existing deep learning method most focus on local contextual information and has limited field of perception, which makes it difficult to capture the long-range contextual feature of objects at large scales form very-high-resolution (VHR) images. In this paper, we present a novel Local–global Framework consisting of the dual-source fusion network and local–global transformer modules, which efficiently utilize features extracted from multiple sources and fully capture features of local and global regions. The dual-source fusion network is an encoder designed to extract features from multiple sources such as spectra, synthetic aperture radar, and elevations, which selective fuse features from multiple sources and reduce the interference of redundant features. The local–global transformer module is proposed to capture fine-grained local features and coarse-grained global features, which enables the framework to focus on recognizing multiple-scale objects from the local and global regions. Moreover, we propose a pixelwise contrastive loss, which could encourage that the prediction is pulled closer to the ground truth. The Local–global Framework achieves state-of-the-art performance with 90.45% mean f1 score on the ISPRS Vaihingen dataset and 93.20% mean f1 score on the ISPRS Potsdam dataset.

Keywords:

semantic segmentation; deep learning; multisource image; global–local feature fusion; contrastive learning

1. Introduction

Semantic segmentation aims to assign a label to each pixel in the image [1], e.g., buildings, trees, low vegetation, cars, which is required in a wide range of practical applications such as city planning, economic protection, and land use mapping [2]. In the past years, image semantic segmentation mainly relied on artificial vectorization and traditional machine learning (ML) methods with low accuracy, such as support vector machine (SVM) [3] and random forest (RF) [4]. Currently, driven by rapid technological advancement in computer vision, deep learning (DL) methods have dominated semantic segmentation tasks of RS images for several years. Compared with traditional ML methods, DL-based methods have the ability to capture more high-dimensional contextual features, which underpins its significant capabilities in feature representation [5].

With the advancement of sensors and unmanned vehicles (UAV) technologies, numerous and various RS images with very high resolution (VHR) originating from different sensors have been generated. VHR images include fine multiscale objects, such as cars and city furniture at small scales, ground at large scales, and buildings and trees of various sizes. Compared with ordinary RS images, VHR images collected from multiple sources can provide complementary features for object recognition, such as spectral features, height features, and intensity features. However, existing DCNNs usually capture only spectral or texture features from single-source images based on pixel-level fusion modules [6,7] and feature-level fusion [8] modules. Some frameworks directly input multiple source images as different channels, making extracting consistent features between multisource data difficult, especially when the spatial or coordinate of objects in multisource images are not strictly aligned. Moreover, VHR requires more pixels to represent the same ground object, whereas DCNNs focus on local contextual information and have a limited field of perception, which makes it challenging to capture the long-range contextual features of objects at large scales from VHR images. Although some multiscale fusion modules [9,10] and architecture [8,11] have been tried to enhance the recognition ability of DCNNs for multiple-scale objects, they still fail to capture the global features of an image. In general, the segmentation tasks of RS images have two feature fusion difficulties: (1) the feature fusion of multiple sources and (2) the feature fusion of the local region’s short-range visual dependencies and the global region’s long-range visual dependencies.

In this paper, we propose a novel Local–global Framework (LGNet) aiming at fusing multisource features and aggregating local–global features to accomplish segmentation tasks on VHR images. The contributions are described as follows:

A feature-level fusion method (DFNet) of multisource images is proposed, which could effectively fuse complementary features extracted from different sources and reduce inaccurate features caused by shadow.
The LG-Trans module takes advantage of the local convolution operation and global transformer mechanism to fully capture the short-range visual dependencies of the local region and the long-range visual dependencies.
Furthermore, the LGNet introduces contrastive loss (CT) as a universal regularization to help the framework approximate the positive ground truth and move away from the wrong negative prediction.
Integrating the DFNet, LG-Trans modules, and CT loss, a novel LGNet is proposed and we evaluate the proposed network on the ISPRS Vaihingen and Potsdam datasets. Experimental results demonstrate that the LGNet has an excellent performance.

The paper is organized as follows. In Section 2, we describe the related work on semantic segmentation for RS images. In Section 3, we explain the overall architecture of the LGNet. In Section 4, we present the experimental results and analysis. The discussion is given in Section 5. Finally, in Section 6, we summarize the paper.

2. Related Work

2.1. Semantic Segmentation of Multisource Remote Sensing Images

Due to RS images being collected from multiple sensors, they have different data structures, and objects captured from multiple sensors have slightly different details at the same spatiotemporal location [12,13]. As shown in Figure 1, 3-band infrared/red/green (IRRG) images have more explicit spectral details but also have problems such as shadow of objects, which results in inaccurate segmentation. Although normalized digital surface models/digital surface models (NDSM/DSM) are mapped from spatial information of the three-dimensional objects, it also has discrete and semantic feature scarcity issues. It is necessary to fuse complementary features from multiple sources and reduce segmentation inaccurate caused by shadow.

With the remarkable success of DCNNs in computer vision [14,15], researchers have been trying to apply DCNNs to solve problems in RS images [16,17,18,19,20,21]. Due to single DCNN having difficulty selectively utilizing features from multisource images to supplement details presented by objects, it is not inappropriate to use single-source images or stack multisource images from different sources as the input for DCNN [22,23]. Recently, some scholars have proposed methods to solve the above problems, which could be divided into two categories. One is a pixel-level fusion [6], which fuses pixels from different sources images and inputs them into the DCNN for training, thus reducing the requirement for the ability of DCNNs to learn features. While the pixel-level fusion method has advantages such as low-level complexity, it ignores semantic and structural components of objects, resulting in artifacts and blocky effects [24]. The other is the feature-level fusion method [8], which happens in the process of RS image features extracted by DCNN. This method could automatically fuse features from multisources images during training. Most feature-level fusion methods mainly rely on summing features [25] or concatenating individual features [26]. Although these methods have better performance, they could not suppress mutual interference of redundant features from multisources and involve many redundant parameters in the model.

2.2. Local and Global Feature Fusion

With the development of RS technology, the spatial resolution of RS images has increased. RS images contain many spatial texture details of multiscale objects, and the distribution of object categories is unbalanced [27]. For example, buildings vary in size, cars are small, and similar objects are confusing and difficult to distinguish. It is difficult for general DCNNs to identify these complex objects accurately.

Some researchers have improved the architecture of DCNNs to capture features of different objects in RS images more widely [8,11]. Diao et al. [11] presented a superpixel attention graph neural network, which aggregated scale features differently and extracts higher quality features. Moreover, scholars have also designed embeddable multiscale fusion modules [9,10], which capture multiple receptive fields through stacking convolution layers and could efficiently fuse feature information extracted from various scales, thus improving the networks’ ability to identify objects. However, these DCNNs are based on local convolution operations, which expand the receptive field of models by increasing the kernel size of convolution or stacking multiple convolution operations. Wherefore DCNNs could not model the connections between features extracted from different receptive fields, they would lack a global comprehension of the entire image and couldn’t fully utilize contextual information. Compared with the convolution of DCNNs, the self-attention mechanism of transformers would not be limited by local interactions, can mine long-distance connections, and can learn the most appropriate inductive biases according to different tasks. Recently, some scholars have tried introducing transformers from the natural language domain to the computer vision domain [28,29,30]. Dosovitskiy et al. [29] presented a vision transformer for image classification. Although the vision transformer performs better, it needs to be trained with large amounts of images. Many reachers are still trying to combine CNN and transformer to break the dominance of DCNNs in the image domain.

Based on the above analysis, we introduce the LG-Trans module, which combines the advantage of the local convolution and global transformer and could fully capture image features.

3. Method

3.1. Overall Architecture

The Local–global framework (LGNet) is consist with the dual-source fusion network (DFNet), local–global transformer modules (LG-Trans) and multiple upsampling network, as shown in Figure 2.

When segmenting IRRG and DSM images with sizes 512 × 512 are firstly processed with the two EfficientNet models to generate multiple level features with sizes 16 × 256 × 256, 24 × 128 × 128, 48 × 64 × 64, and 120 × 32 × 32. Then, the DFNet fuses multiple level features through selective fusion modules (SFM) and generates deep semantic features with sizes of 16 × 256 × 256, 24 × 128 × 128, 48 × 64 × 64, and 120 × 32 × 32. After that, deep semantic features fused by SFM modules are input into LG-Trans modules and generate multiscale features with sizes of 128 × 256 × 256, 128 × 128 × 128, 128 × 64 × 64, and 128 × 32 × 32. The MUNet firstly generates features u1 by upsampling features extracted by the fourth LG-Trans module into size using bilinear interpolation operation and concatenating them with the second LG-Trans module. Then, the LGNet generates features u2 by upsampling features extracted by the third LG-Trans module into size 256 × 256, using bilinear interpolation operation and concatenating them with the first LG-Trans module. After that, it restores features u1 to size 256 × 256 and fuses with u2 by adding. The MUNet gradually restores the original image size, could effectively alleviate the partial details loss caused by excessive sampling amplitude, and could improve semantic segmentation accuracy.

Furthermore, the LGNet is optimized by minimizing

L o s s_{s e g}

, and we utilize backpropagation to obtain the optimal parameters in the LGNet. Specifcally, the learning process of the LGNet is summarized in Algorithm 1.

Algorithm 1: Optimization of the LGNet.

Input: IRRG image sets

i r r g_x

; DSM image sets

d s m_x

; Label sets y, Category number n

Model: the prediction

y^{'}

of the LGNet

1:: Initialize Feature extractor EfficientNet $E (\cdot)$ with pre-trained parameters
2:: fore in Epochs do
3:: Extract IRRG features via $E (i r r g_x)$
4:: Extract DSM features via $E (d s m_x)$
5:: Use the DFNet $F_{D F N e t} (\cdot)$ to fuse each level of $i r r g_f e a t u r e$ and $d s m_f e a t u r e$
6:: for I in Levels do
7:: $d_{I} = F_{D F N e t} (i r r g_f e a t u r e_{I}, d s m_f e a t u r e_{I})$
8:: Use four LG-Trans modules to fuse global and local features extracted from each level
9:: $l g_t r a n s_{I} = F_{L G - T r a n s} (d_{I})$
10:: end for
11:: Use the MUNet to restore size of the prediction
12:: $y^{'} = F_{D F} (l g_t r a n s_{1}, l g_t r a n s_{2}, l g_t r a n s_{3}, l g_t r a n s_{4})$
13:: Calculate joint $L o s s_{S e g}$ via Equation (17)
14:: Optimize the parameters of the LGNet by minimizing $L o s s_{S e g}$
15:: end for

3.2. Dual-Source Fusion Network

The DFNet is a novel dual-source feature fuse structure, which uses two EfficientNet B2 models to extract multiple level features of IRRG and NDSM/DSM images and fuses features of each level by SFM module. The SFM module could effectively use the spatial information among the neighborhoods reflected by the NDSM/DSM images to reduce the inaccurate prediction caused by the shadows and fully capture texture details and spatial features from multisource images.

As shown in Figure 3, the SFM module concatenates features extracted from IRRG and DSM images. Then, it utilizes an adaptive global average pooling to extract the global features and acquires weight parameters for each channel of global features by using two fully connected layers to compress into a one-dimensional vector from the global features. Afterward, features with the same size are obtained by multiplying the weight parameters of each channel by the features extracted from IRRG images of the same level. Moreover, the selective fusion features would be obtained by adding features extracted from IRRG images to the previous features.

The following equations explain the SFM module. First, we express a convolution operator as Equation (1), x represents image features,

W^{n \times n}

indicates the kernel size is

n \times n, ⊙

means the convolution operator and b means the vector of bias:

W^{n} (x) = W^{n \times n} ⊙ x + b

(1)

Then, we use the two EfficientNet B2 models to extract features

i r r g_f e a t u r e

and

d s m_f e a t u r e

from IRRG image

i r r g_x

and NDSM/DSM image

d s m_x

, respectively.

i r r g_f e a t u r e = E f f i c i e n t n e t (i r r g_x)

(2)

d s m_f e a t u r e = E f f i c i e n t n e t (d s m_x)

(3)

We use concatenating operations

F_{c o n c a t} (\cdot)

to connect IRRG features

i r r g_f e a t u r e

and NDSM/DSM features

d s m_f e a t u r e_{i}

of the same level.

u_{i} = F_{C o n c a t} (i r r g_f e a t u r e_{i}, d s m_f e a t u r e_{i})

(4)

Finally, the selective fusion module (SFM) fuses different levels of features extracted from IRRG and NDSM/DSM images as the following equations.

\begin{matrix} F_{D S A F} (u_{i}) = & F_{A d d} (i r r g_f e a t u r e_{i} \otimes F_{S i g m o i d} (W_{2}^{1 \times 1} ⊙ F_{R e L U} \\ (W_{1}^{1 \times 1} ⊙ F_{A d a p t i v e A v g P o o l}^{1} (u_{i}))), i r r g_f e a t u r e_{i}) \end{matrix}

(5)

where ⊗ indicates the dot-multiply, + means the add operation and

F_{A d a p t i v e A v g P o o l}^{1} (\cdot)

represents the adaptive average pooling operation.

3.3. Local–Global Transformer Module

To fully extract local texture details and improve the global understanding of the entire image, we propose an LG-Trans module consisting of submodule A and submodule B. As shown in Figure 4, the submodule A focuses more on global long-range visual dependencies of images through a multihead attention module, which helps the architecture improve the ability to extract solid semantic features. The submodule B could pay more attention to the texture details of small objects in the local region and make up for the neglect of the local texture details by the trans-based global fusion module. The LG-Trans module not only captures the rich spatial information but also obtains contextual features of images.

In the submodule A, we give features

x \in R^{H \times W \times C}

with spatial size of

H \times W

and C number of channels. The traditional transformer first performs tokenization by reshaping the input features x into a sequence of flattened 2D patches, where each patch is of size

P \times P

and

N = \frac{H W}{P^{2}}

is the number of image patches. Then, the transformer maps the vectorized patches

x_{i}

into a D-dimensional embedding space using a trainable linear projection. The transformer learns specific position embeddings for encoding the patch spatial information, and these embeddings are added to the patch embeddings to retain positional details as Equation (6):

F_{0} = [x_{i}^{1} P; x_{i}^{2} P; \dots; x_{i}^{N} P] + P_{p o s}

(6)

where

P \in R^{(P^{2} \times C) \times D}

means a patch embedding projection layer, and

P_{p o s} \in R^{N \times D}

represents the position embedding. The transformer consists of L layers of multihead self-attention mechanism module (MSA) and multilayer perceptron (MLP). Therefore, the output of the I-th layer can be written as Equations (7) and (8), where

L N (\cdot)

denotes a layer normalization operation and

F_{I}

is the encoded features.

F_{I}^{'} = M S A (L N (F_{I - 1})) + F_{I - 1}

(7)

F_{I} = M L P (L N (F_{I}^{'})) + F_{I}^{'}

(8)

However, it is not the optimal directly usage of transformers in segmentation since features fused through SFM modules are usually much smaller than the orginal image size

H \times W

, which inevitably results in a loss of low-level details. The submodule A uses features extracted by dual-source fusion network as the input and employs patch embedding layer apply to

1 \times 1

patches extracted from features instead of from raw images.

Moreover, the LG-Trans module also uses submodule B for compensating for local information loss, which could help the architecture to fully capture the local texture details. The submodule B consists of a convolution operation, two receptive field blocks (RFB), and an adaptive average pooling operation. To avoid the degradation of the model’s ability to extract features caused by excessive atrous rate, we apply atrous convolution with the rates of 3, 6, and 9 in the RFB blocks. First, we define the features extracted by atrous convolution is

F_{Atrous} (rate)

,

x

represents the input, w represents a convolution operation,

k

represents the kernel size of convolution,

rate

represents the rate of atrous convolution, and ⊙ represents the dot-product operator. Furthermore,

b

and

b_{i}

indicates the bias vector a the bias vector of the

ith

convolution, respectively.

F_{A t r o u s} (r a t e) = \sum_{i = 1}^{k} x [i + r a t e \cdot k] ⊙ w [k] + b

(9)

The first branch extracts features

m_{1}

by

1 \times 1

kernel size of convolution, the advantage of which could increase the network’s non-linearity without changing the images’ spatial structure.

m_{1} = W_{1}^{3 \times 3} ⊙ u_{i} + b_{1}

(10)

The second branch extracts features

m_{2}

by cascade atrous pyramid, which consists a

1 \times 1

convolution and atrous convolutions with atrous rate of 3, 6, and 9.

\begin{matrix} m_{2} = & W_{3}^{1 \times 1} ⊙ F_{c o n c a t} (W_{2}^{1 \times 1} ⊙ x + b_{2}, F_{A t r o u s} (3), \\ F_{A t r o u s} (6), F_{A t r o u s} (9)) + b_{3} \end{matrix}

(11)

The third branch extracts features

m_{3}

by cascade atrous pyramid, which consists

1 \times 1

convolution, the adaptive pooling operation and atrous convolutions with atrous rate of 3, 6, and 9.

\begin{matrix} m_{3} = & W_{5}^{1 \times 1} ⊙ F_{c o n c a t} (W_{4}^{1 \times 1} ⊙ x + b_{4}, F_{A t r o u s} (3), \\ F_{A t r o u s} (6), F_{A t r o u s} (9), F_{A d a p t i v e A v g P o o l}^{1} (x)) + b_{5} \end{matrix}

(12)

The fourth branch consists of the adaptive pooling operation, which could effectively solve the problem that the architecture is insensitive to object scale changes by fusing image features under different receptive fields.

m_{4} = F_{A d a p t i v e A v g P o o l}^{1} (x)

(13)

In the end, the features extracted from multiple branches were fused by concatenating operation, which could capture multiscale image features from the different receptive fields and enhance the semantic information of local space. Furthermore, the submodule B uses batch normalization operation and relu activation function after each branch’s convolution or pooling operation to reduce the displacement of internal covariables of features.

F_{B} (x) = W_{6}^{1 \times 1} ⊙ F_{C o n c a t} (m_{1}, m_{2}, m_{3}, m_{4}) + b_{6}

(14)

3.4. Loss Function

Contrastive learning is widely used in self-supervised representation learning [31,32]. He et al. [31] constructed a large and consistent dictionary on-the-fly through a dynamic dictionary with a queue and a moving-averaged encoder, which facilitates contrastive unsupervised visual representation learning. For a given anchor point, contrastive learning aims to pull the anchor close to positive points and push the anchor far away from negative points in the representation space [33]. Previous works [34,35] often apply contrastive learning to high-level vision tasks since these tasks are inherently suited for modelling the contrast between positive and negative samples. However, there are still few works to constructing contrastive samples and contrastive loss into semantic segmentation.

Inspired by these works, we propose a pixelwise contrastive loss (CT), which exploits both the information of prediction and ground truth as negative and positive samples and ensures that the prediction is pulled closer to the ground truth. It satisfies the following basic criterion, the smaller the distance between approximate samples the better, and the larger the distance between unlike samples the better. In this paper, a hyperparameter m is added to the second criterion to make the training target bounded. The second criterion has a figurative interpretation, like a spring of length

m

. If it is compressed, it will recover to length

m

because of the repulsive force. So the loss of contrastive learning is:

\begin{matrix} L o s s_{C T} (W, Y, {\vec{X}}_{1}, {\vec{X}}_{2}) = \\ (1 - Y) \frac{1}{2} {(D_{W})}^{2} + (Y) \frac{1}{2} {\{\max (0, m - D_{W})\}}^{2} \end{matrix}

(15)

where

W

is the framework weight,

Y

is the pairwise label,

Y = 0

if the pair of samples

X_{1}

,

X_{2}

belongs to the same class and

Y = 1

if they belong to different class,

D

is the distance between approximate samples.

Moreover, we add cross-entropy loss (CE) to the total loss so that the model can accurately calculate the differences between the prediction and the ground truth, as the Equation (16).

{L o s s}_{C E} = - \frac{1}{C} \sum_{i = 1}^{c} Y \log \hat{Y}

(16)

where

Y

is the pairwise label,

\hat{Y}

is the pairwise predition, and

C

is the categories of objects.

Thus, the loss function can be reformulated as Equation (17), which can be trained via the Adam optimizer in an end-to-end manner.

{L o s s}_{S e g} = - {L o s s}_{C E} + β {L o s s}_{C T}

(17)

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset

There are two datasets in the ISPRS segmentation competition to be used to evaluate the LGNet. Figure 5 shows IRRG/IRRGB and DSM images and ground truth from the ISPRS Vaihingen and Potsdam datasets. In the datasets, ground truth images are labeled by experts, and we use the training set of the competition for the LGNet training and utilize the test set as an evaluation of the LGNet performance.

The ISPRS Vaihingen dataset contains 3-band infrared/red/green (IRRG) images and digital surface models (DSM). There are 33 images of approximately 2494 × 2064 pixels. There are five significant categories of annotation: impervious surface (Imp_surf), buildings, low vegetation (Low_veg), trees, and cars. In experiments, 16 images were used as the training set, 17 images as the test set (IDs 1, 3, 5, 7, 9, 11, 13, 15, 17, 21, 23, 26, 28, 30, 32, 34, 37), 2 images as the validation set (IDs 1, 13).

The ISPRS Potsdam dataset contains 4-band infrared/red/green/blue (IRRGB) images, corresponding DSM, and normalized digital surface model (NDSM). There are 38 images of about 6000 × 6000 pixels. To ensure the consistency of the number of channels of the model input data, we will extract 3-band IRRG images from 4-band IRRGB images and NDSM image as the model input. In the experiments, 17 images are used as the training set, 14 images as the test set (IDs 2_13, 2_14, 3_13, 3_14, 4_1, 4_13, 4_14, 5_13, 5_14, 5_15, 6_13, 6_14, 6_15, 7_13), and 4 images as the validation set (IDs 2_11, 4_11, 6_9 and 6_11) in the experiments.

4.1.2. Metrics

To evaluate the performance of the LGNet, evaluation metrics are employed to evaluate the architecture in these experiments, such as precision, recall, f1 score, and accuracy where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

P r e c i s i o n = \frac{T P}{T P + F P}

(18)

The recall is specific to the original sample and characterizes the frequency of correct predictions among the instances labeled as positive samples.

R e c a l l = \frac{T P}{T P + F N}

(19)

The f1 score is calculated by precision and recall, which used to evaluate the model’s overall performance.

F 1 s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(20)

The overall accuracy (OA) is the ratio of correctly classified pixels to the total number of pixels.

O v e r a l l A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(21)

4.1.3. Implementation Details

The LGNet is built by taking Pytorch as the tool. Two Matrox G200eW3 22 GB graphics cards were used as the GPU platform. To address the overfitting problem, we use data augmentation strategies in the training phase as follows: (a) flip transformation, (b) random rotation transformation, (c) size scaling transformation, (d) up-down-right-side-left translation transformation, (e) random cropping, and (f) contrast transformation. Moreover, due to memory limitation, we first slide IRRG and DSM images into 512 × 512, and set the overlap stride to 32 px. We train the LGNet using the adam optimization with a learning rate of 0.001, a momentum of 0.9, and a batch size of 4. For the LGNet, the DFNet weights are initialized with EfficientNet B2 trained on imageNet. In the testing phase, data augmentation operations are horizontal flip, vertical flip, and diagonal flip. Then, we input them to the model and perform multiple calculations to attain the final prediction.

4.2. Ablation Study

In this subsection, we decompose the LGNet to illustrate the effect of the DFNet and LG-Trans module. Each model is evaluated with the ISPRS Vaihingen and Potsdam dataset. We compare the OA and mean f1 score of each model (see Table 1, Table 2, Table 3 and Table 4). Some samples of the test set results are shown in Figure 6, Figure 7 and Figure 8.

4.2.1. The Effect of Different Source Images

LGNet-IR. The original DeepLab v3+ is selected as the baseline, and we replace the backbone Xception in Deeplab v3+ with Efficient B2. When the input are IRRG images, we name this network LGNet-IR. The OA of LGNet-IR on the Vaihingen test set is 90.04%, and the OA on the Potsdam test set is 89.72%.

LGNet-S. We stack NDSM/DSM images to IRRG images and train a modified baseline network. This modified baseline network changes the number of input channels of the first convolutional layer to the number of channels of the stacked image. We name this architecture LGNet-S. Compared with the LGNet-IR, the LGNet-S improves the mean f1 score performance from 88.00% to 89.32% on the Vaihingen test set and from 90.14% to 91.03% on the Potsdam test set.

Figure 6 illustrates the effect of using single-source and multisource for the results. In the first and third groups of results, we find the red clutter regions in the upper left and center are barely recognized by the LGNet-IR correctly. When the LGNet-S stacks DSM with IRRG images for training, the red clutter region of the prediction is significantly improved, but it is still far from satisfactory. As shown in the second and fifth group, a similar phenomenon exists in the trees and low vegetation. There are obvious errors in the trees and low vegetation extracted by the LGNet-IR, and the same problem still exists in the LGNet-S. In the LGNet-S, we notice that the incorrectly classified flaws of trees and low vegetation in the bottom left region of the second image and the upper right region of the fifth image have improved. In the third and sixth groups of results, we find buildings and roads extracted by the LGNet-IR have obvious errors, and there are some scattered pixels around objects. After adding the DSM images to eliminate inconsistent features, boundaries of buildings and roads become relatively regular, so the DSMs are important for object extraction.

Figure 6. Some samples of the LGNet-IR and LGNet-S results on the ISPRS Vaihingen and Potsdam dataset.

4.2.2. The Effect of the Dual-Source Fusion Network

LGNet-C. We use the pretraining models as initial feature extract network, where the main and the auxiliary branches are EfficientNet B2. It feeds IRRG and DSM images into the two branches, respectively. When the fusion method is the concatenating operation, we name this network LGNet-C. The OA of LGNet-C on the Vaihingen test set is 90.5%, and the OA on the Potsdam test set is 90.63%.

LGNet-A. When the initial feature extract networks are EfficientNet B2 models and the fusion method is the add operation, we name this architecture LGNet-A. The OA of the LGNet-A for the test set is 90.7%. Compared with the LGNet-C, the LGNet-A improves the mean f1 score performance from 89.64% to 89.78% on the Vaihingen test set and from 92.01% to 92.48% on the Potsdam test set.

Table 2. The effect of the dual-source fusion network on the ISPRS Vaihingen and Potsdam datasets.

Method	DataSet	Fusion	F1 Score					OA	Mean F1
Method	DataSet	Fusion	Imp_surf	Building	Low_veg	Tree	Car	OA	Mean F1
LGNet-C	Vaihingen	Concat	93.0	95.7	83.0	89.1	87.4	90.5	89.64
LGNet-A		Add	93.1	95.8	83.5	89.3	87.2	90.7	89.78
LGNet-D		SFM	92.9	95.6	84.2	89.8	88.3	90.9	90.16
LGNet-C	Potsdam	Concat	93.21	97.04	87.57	87.81	94.42	90.63	92.01
LGNet-A		Add	93.29	97.19	87.64	88.71	95.57	91.07	92.48
LGNet-D		SFM	93.16	97.35	88.04	89.27	96.18	91.38	92.80

Figure 7 shows the effect of the DFNet for results. In the first and third groups of Figure 6, the red clutter regions in the upper left and center recognized by the LGNet-S are significantly improved. However, the red clutter regions still have obvious errors, such as the structure of the red clutter regions being incomplete. In the second and fifth groups of results, we find similar problems in the trees and low vegetation extracted by the LGNet-C and the LGNet-A, which have classification errors and could not be distinguished. In the second group of results, after adding the dual-source fusion network, the redundant part of trees in the upper left region almost disappeared, which results from spatial characteristics extracted from DSMs could effectively reduce the shadows of trees and helps models suppress these disturbances. The same problem exists in the buildings of the fourth and sixth groups. After using the dual-source fusion network to fuse texture details and spatial characteristics extracted from IRRG and DSM images, the incorrectly classified region on the road and building becomes smaller and the road and building are regular. Thus, the DFNet is important for multisource feature extraction.

4.2.3. The Effect of Trans-Based Global–Local Fusion Module

LGNet-L. We apply submodule B based on LGNet-D, which could capture multiple scale local features through convolution operations with different receptive fields. We name this architecture LGNet-L. The OA of the LGNet-L on the Vaihingen test set is 91.14%, and the OA on the Potsdam test set is 91.37%.

LGNet-G. We apply submodule A based on LGNet-D, which could enhance the ability of model to capture global features. The mean f1 score of the LGNet-G on the ISPRS Vaihingen test set is 90.31%, and the OA on the Potsdam test set is 92.65%.

LGNet-LG. We apply local–global transformer modules based on LGNet-D, which could help the model to capture global and local features of image. We name this architecture LGNet-LG. Compared with the LGNet-L, the LGNet-LG improves the mean f1 score performance from 90.20% to 90.39% on the ISPRS Vaihingen test set and from 92.78% to 93.16% on the Potsdam test set.

Table 3. The effect of trans-based global–local fusion module on the ISPRS Vaihingen and Potsdam datasets.

Method	DataSet	A	B	F1 Score					OA	Mean F1
Method	DataSet	A	B	Imp_surf	Building	Low_veg	Tree	Car	OA	Mean F1
LGNet-L	Vaihingen		√	93.05	95.52	84.61	90.09	87.75	91.14	90.20
LGNet-G		√		93.48	96.07	84.59	89.93	87.48	91.22	90.31
LGNet-LG		√	√	93.51	96.01	84.63	90.12	87.67	91.34	90.39
LGNet-L	Potsdam		√	93.33	97.27	88.47	88.51	96.34	91.37	92.78
LGNet-G		√		93.52	97.04	88.08	88.38	96.25	91.28	92.65
LGNet-LG		√	√	94.12	97.53	88.25	89.60	96.31	91.73	93.16

Figure 7. Some samples of the LGNet-C, LGNet-A and LGNet-D results on the ISPRS Vaihingen and Potsdam datasets.

Figure 8 illustrates the effect of the submodule A, the submodule B, and the local–global transformer modules for results. Compared with the LGNet-G, the results of the LGNet-L have recognition performance on trees, low vegetation and cars significantly, and there are apparent differences under the LGNet-L and the LGNet-G. In the first and second groups of results, we find that some cars extracted by the LGNet-G are stacked together and have unclear boundaries. In the LGNet-L, the cars and buildings stacking problem have been improved. In the fourth and fifth groups of results, we find some trees and low vegetation in the left region extracted by the LGNet-G have some incorrect classification; some low vegetation has been classified into trees. With the addition of the multiscale local fusion submodules in the LGNet-L, the LGNet-L pays more attention to the local region and the problem of misclassification is resolved. In the third and sixth groups of results, we find buildings and impervious surface extracted by the LGNet-G have more completeness. In the sixth group, the building in the bottom right region of the image appears to have holes. With the addition of the submodule A in the LGNet-G, most buildings and low vegetation misclassification is disappeared, and the holes in the buildings have almost disappeared. After gradually adding the submodule A and the submodule B to the LGNet-D, the accuracy of objects continues to improve, many details have been fixed, and the accuracy has been significantly improved. The LGNet-LG performs better than the LGNet-L and LGNet-G. In summary, the DFNet and the LG-Trans module presented in this paper progressively improve the model performance for RS images.

Figure 8. Some samples of the test set of the LGNet-L, LGNet-G and LGNet-LG on the ISPRS Vaihingen and Potsdam datasets.

4.2.4. The Effect of Contrastive Loss

LGNet-LG. The LGNet-LG only applies the cross-entropy loss (CE). The mean f1 score of the LGNet-LG on the ISPRS Vaihingen test set is 90.39%, and the mean f1 score on the Potsdam test set is 93.16%.

LGNet. We add the contrastive loss (CT) based on LGNet-LG, which could reduce distance between prediction and the ground truth. Compared with the LGNet-LG, the LGNet improves the mean f1 score performance from 90.39% to 90.45% on the ISPRS Vaihingen test set and from 93.16% to 93.20% on the Potsdam test set.

Figure 9 illustrates the effect of the contrastive regulaization loss. After gradually adding the contrastive loss, the accuracy of low vegetation and trees continues to improve, many details have been improved, and the accuracy has significantly increased. The LGNet performs better than the LGNet-LG. In summary, the CT loss could reduce differences between prediction and the ground truth and improves the model performance for RS images.

Figure 9. Some samples of the LGNet-LG and LGNet results on the ISPRS Vaihingen and Potsdam dataset.

Table 4. The effect of CT loss on the ISPRS Vaihingen and Potsdam datasets.

Method	DataSet	CE	CT	F1 Score					OA	Mean F1
Method	DataSet	CE	CT	Imp_surf	Building	Low_veg	Tree	Car	OA	Mean F1
LGNet-LG	Vaihingen	√		93.51	96.01	84.63	90.12	87.67	91.34	90.39
LGNet	Vaihingen	√	√	93.49	96.12	84.65	90.38	87.62	91.38	90.45
LGNet-LG	Potsdam	√		94.12	97.53	88.25	89.60	96.31	91.73	93.16
LGNet	Potsdam	√	√	93.75	97.21	88.75	90.04	96.23	91.84	93.20

4.3. Comparison Method

The comparisons between the LGNet and other methods are shown in Table 5 and Table 6. Some samples of the test set results are shown in Figure 10 and Figure 11. The LANet, CGFDN, BAM-UNet-sc, MACANet and SAGNN currently have no published code and complete test set results, we could not visually compare these methods with the LGNet.

4.3.1. Experiments on the Vaihingen Dataset

The OA and the mean f1 score comparisons between the LGNet and those state-of-the-art methods on the ISPRS Vaihingen dataset are shown in Table 5. The DLR_9 [36] used boundary detection strategy to accomplish image segmentation. The CASIA2 [18] improves the OA through aggregating contextual information and the OA is 91.10%. The UFMG_4 [37] used global context modules and attention mechanism to improve the model performace, but these modules failed to classify small objects, such as cars, and the f1 score for cars is only 81.30%. The HUSTW3 [38] had excellent recognition on trees and low vegetation, the f1 scores for low vegetation and trees are 85.60% and 90.50%. The LANet [39] utilized attention modules to capture contextual information, it could accurately recognize the impervious surface. The CGFDN [40] encoded class co-occurrence relationships as convolutional features and decoupled the most obvious co-occurrence relationships to infer the segmentation result and it achieved competitive results on the cars.

Table 5. Comparisons between the LGNet and other state-of-the-art methods on the Vaihingen dataset.

Method	Year	F1 Score					OA	Mean F1
Method	Year	Imp_surf	Building	Low_veg	Tree	Car	OA	Mean F1
DLR_9 [36]	2018	92.40	95.20	83.90	89.9 0	81.20	90.3	88.52
CASIA2 [18]	2018	93.20	96.00	84.70	89.90	86.70	91.1	90.10
UFMG_4 [37]	2019	91.10	94.50	82.90	88.80	81.30	89.4	87.72
HUSTW3 [38]	2019	92.10	95.30	85.60	90.50	78.30	90.7	88.36
LANet [39]	2020	92.41	94.90	82.89	88.92	81.31	89.8	88.09
CGFDN [40]	2020	91.90	95.00	81.50	88.70	85.00	90.6	88.40
BAM-UNet-sc [41]	2021	92.26	96.17	80.36	88.14	88.55	89.8	89.10
MACANet [10]	2021	88.44	91.64	77.79	85.57	86.88	84.5	84.38
LRDNet [42]	2022	91.90	95.80	84.70	90.10	86.50	91.1	90.00
DGCRNet [43]	2022	91.32	93.16	80.10	87.27	74.56	88.2	85.28
SAGNN [11]	2022	92.01	95.13	83.09	88.36	87.25	89.3	89.16
Ours	2022	93.49	96.12	84.65	90.38	87.62	91.38	90.45

The BAM-UNet-sc [41] used a boundary attention module to alleviate the problem that objects could not be recovered finely, and it had an excellent result in man-made objects with boundaries and the mean f1 scores for buildings and cars are 96.17% and 88.55%. The MACANet [10] used a sequence aggregation module for multiple aggregate level adaptive scale features. It could capture features with appropriate scale information, mitigating different levels of semantic gaps between features, and the mean f1 score is 84.38%. The LRDNet [42] intrudoced local relationsand and similarity, it has great performance on buildings, and the f1 score for buildings is 95.80%. The DGCRNet [43] used a dynamic graph contextual reasoning module to explore more effective contextual representations. It could perform better in big objects and fail to classify small objects accurately, such as cars, the f1 scores for buildings and cars are 93.16 and 74.56%. The SAGNN [11] constructs graph nodes to diversely aggregate neighbor information thereby extracting higher quality features, and the OA of the test set is 89.30%. The LGNet trained with an independent validation set achieves an OA of 91.38%, which already surpasses all competitors on Table 4. Moreover, the mean f1 score of the LGNet could reach 90.45%, meaning that all categories of objects performed are excellent. Especially in the impervious surface, the mean f1 score for impervious surface reached 93.49%.

As shown in Figure 10, the LGNet outperforms other methods. Due to the DLR_9 [36] not paying attention to global context features, the model lacks a global understanding of images. There are many misclassification problems in buildings, trees, and the red clutter regions could not be identified in the second group. Compared with the DLR_9 [36], the CASIA2 [18] aggregation global-to-local context features, which results in an excellent performance in object recognition. The UFMG_4 [37] has some flaws in object recognition, the buildings, trees, and impervious surfaces in the upper left region of the third group have many misclassifications, and there are still holes in the building in the bottom left region.

Figure 10. Some samples of the results of the test set on the ISPRS Vaihingen dataset. The LANet, CGFDN, BAM-UNet-sc, MACANet, LRDNet, DGCRNet, and SAGNN currently have no published code and complete test set prediction results [18,36,37,38].

Moreover, the HUSTW3 [38] uses multiple strategies during the training and inference process to improve object recognition, but it is not ideal to recognise the red clutter regions and yellow cars. The red clutter regions is barely identified in the first group, and only one car is also identified in the bottom right region of the fourth group. Some methods are troublesome compared with the LGNet, but they often make mistakes when it comes to objects which are similar with background and objects with complex appearance. The LGNet solves the problems mentioned above. As shown in Figure 10, in the bottom region of the first group, we see that the LGNet could accurately identify cars, red clutter, low vegetation, and trees, which means the LGNet could fully capture local details. On the other hand, the results of the LGNet still have certain shortcomings. In the bottom left region of the second group, the LGNet misidentifies objects as cars, for the white objects have a similar appearance to cars. Furthermore, the LGNet achieved advanced performance on the ISPRS Vaihingen dataset.

4.3.2. Experiments on the Potsdam Dataset

We also evaluate the effect of the LGNet on the ISPRS Potsdam dataset. The image size of the Potsdam dataset is about three times larger than the image size of the Vaihingen dataset, with more details and textures about the objects. Due to the Potsdam dataset having more images for training, DL models could obtain more generalizability than the Vaihingen dataset. The accuracy and the f1 score comparisons between the LGNet and other methods are shown in Table 6. Some samples of the test set are shown in Figure 11.

Table 6. Comparisons between the LGNet and other state-of-the-art methods on the Potsdam dataset.

Method	Year	F1 Score					OA	Mean F1
Method	Year	Imp_surf	Building	Low_veg	Tree	Car	OA	Mean F1
CASIA2 [18]	2018	93.30	97.00	87.70	88.40	96.20	91.1	92.52
UFMG_4 [37]	2019	90.80	95.60	84.40	84.30	92.40	87.9	89.50
HUSTW3 [38]	2019	93.80	96.70	88.00	89.00	96.00	91.6	92.70
CGFDN [40]	2020	92.10	95.60	86.30	87.90	94.90	90.3	91.40
LANet [39]	2020	93.05	97.19	87.30	88.04	94.19	90.8	91.95
MACANet [10]	2021	90.58	94.31	83.47	84.02	90.02	88.9	88.48
BAM-UNet-sc [41]	2021	91.50	95.56	86.94	83.37	95.09	89.1	88.59
LRDNet [42]	2022	91.75	95.23	85.51	85.08	94.40	88.9	90.40
SAGNN [11]	2022	92.59	95.96	87.86	87.78	96.18	90.2	92.01
DGCRNet [43]	2022	94.10	97.30	88.30	88.90	96.40	91.8	93.00
Ours	2022	93.75	97.21	88.75	90.04	96.23	91.84	93.20

As shown in Table 6, the CASIA2 [18] had an excellent performance in cars and the mean f1 score for cars is 96.20%. The UFMG_4 [37] paid more attention to global features and ignores local features, resulting in poor recognition of small objects and the f1 score for cars is only 92.40%. The HUSTW3 [38] used multiple strategies to enhance the recognition ability of objects. Compared with the UFMG_4 [37], the mean f1 score increases by 3.2%. And the CGFDN [40] also achieved competitive results in buildings, and the mean f1 score is 91.40%. The LANet [39] could effectively bridges the gap in physical information content and spatial distribution, and the OA is 90.8%. Although the MACANet [10] used multiple modules to aggregate multiple level features, it does not improve the ability to recognize similar objects, like trees and low vegetation. The f1 scores for trees and low vegetation are 84.02% and 83.47%. The BAM-UNet-sc [41] still performs superiorly for man-made objects, and the f1 scores for buildings and cars are 95.56% and 95.09%. Moreover, both the SAGNN [11] and the DGCRNet [43] are based on the k-nearest neighbor graph. They perform excellent in the complex images, which may be since the graph convolutional network (GCN) can more accurately filter out the higher quality features. The SAGNN [11] achieves the best segmentation results on the Potsdam dataset with the f1 score of 92.01%. The f1 score of the DGCRNet [43] is 93.00%. The LGNet has the best segmentation in impervious surface, buildings, and trees on the Potsdam dataset. The f1 score for buildings is 97.21%, and the f1 score for trees is 90.04%.

Figure 11. Some examples of the results of the test set on the ISPRS Potsdam dataset. The LANet, CGFDN, BAM-UNet-sc, MACANet and SAGNN currently have no published code and complete test set results [18,37,38].

Similar to the performance on the Vaihingen dataset, the LGNet has achieved great performance. The CASIA2 [18] method performs poorly in buildings, trees, and low vegetation. The building in the upper left region of the first group is misidentified as low vegetation and roads in the second group could not be identified. The UFMG_4 [37] could not identify differences between trees and low vegetation, such as the results in the fourth group. Compared with other methods, the results of the HUSTW3 [38] are more accurate. For instance, trees and low vegetation in the second and fourth groups are similar to the ground truth and buildings in the third and fifth groups have clear edges. As shown in Figure 11, the LGNet shows better robustness to other methods, which has a better performance in recognition of buildings and cars. For instance, in the upper region of the third group, buildings extracted by the LGNet have complete structure, and buildings with holes almost disappear, which shows that the LGNet could comprehensively capture global contextual information of images compared to other methods. However, the results of the LGNet still have some mistake. In the right region of the fifth group, the LGNet misidentifies trees and low vegetation. In response to the above problems, we consider that the LGNet cannot correctly distinguish similar objects according to the appearance information of objects when segmenting objects, such as rectangular cars and circular trees. In future work, we consider introducing appearance information into the model as auxiliary features.

5. Discussion

As shown in Table 7, we conduct model capability experiments on the LGNet and other methods. The DLR_9 [36] parameter is 768.66 M, and the parameter of the LANet [39] is 44.47 M. The UFMG_4 [37] has the smallest parameters, but its performance is much lower than the LGNet. The other methods currently have no published models. The parameter of the LGNet-IR is 9.81 M. After adding the DFNet as a multisource feature extractor, the parameter of LGNet-D is increased by 7.21 M compared to the LGNet-IR. The LGNet-G uses multiscale local fusion modules to capture texture details of small objects, rising by 1.39 M parameters compared to LGNet-D. It means that the parameter used by multiscale local fusion modules is 1.39 M. The LGNet-G uses submodule A to obtain long-range global visual dependencies. The parameter is increased by 0.12 M, which represents the parameters used by submodule A are 0.12 M. Furthermore, the LGNet uses the DFNet and LG-Trans modules simultaneously. The CT loss cannot increase the additional parameters for testing, and the parameter of the LGNet is 18.77 M. The LGNet has shown extraordinary results in segmentation tasks, but its model parameters are still relatively large compared to other models. In future work, we will explore how to make LGNet achieve lightweight and efficient.

6. Conclusions

In this paper, we proposed a novel Local–global framework (LGNet) for Multisource RS image segmentation. The LGNet uses the dual-source fusion network to extract multiple-level features from the IRRG and DSM images and selectively fuse features of different levels. Moreover, the LGNet designed an LG-Trans module to help the framework enable long-range and short-range visual dependencies capabilities and improve the recognition ability of multiscale objects. Furthermore, the LGNet introduces the CT loss to encourage that the prediction is pulled closer to the ground truth and pushed too far away from the wrong result. We experimented with the LGNet on the ISPRS Vaihingen and Potsdam datasets and achieved excellent performance compared with other methods. Then, our attempts of how use the LGNet to accomplish image segmentation is preliminary and left as future works.

Author Contributions

L.Q. designed the algorithm, performed the experiments and wrote the paper; D.Y., C.Z. and X.Z. revised the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by China National Postdoctoral Program for Innovative Talents (No.BX2021223), China Postdoctoral Science Foundation (No.2021M702510).

Data Availability Statement

We thank the ISPRS for providing the research community with the awesome challenge datasets https://www.isprs.org.

Conflicts of Interest

The authors declare no conflict of interest.

References

He, J.; Jia, X.; Chen, S.; Liu, J. Multi-source domain adaptation with collaborative learning for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11008–11017. [Google Scholar]
Luo, X.; Tong, X.; Pan, H. Integrating multiresolution and multitemporal Sentinel-2 imagery for land-cover mapping in the Xiongan New Area, China. IEEE Trans. Geosci. Remote Sens. 2020, 59, 1029–1040. [Google Scholar] [CrossRef]
Cervantes, J.; Garcia-Lamont, F.; Rodríguez-Mazahua, L.; Lopez, A. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 2020, 408, 189–215. [Google Scholar]
Guo, Y.; Wu, Z.; Shen, D. Learning longitudinal classification-regression model for infant hippocampus segmentation. Neurocomputing 2020, 391, 191–198. [Google Scholar] [CrossRef]
Sengupta, S.; Basak, S.; Saikia, P.; Paul, S.; Tsalavoutis, V.; Atiah, F.; Ravi, V.; Peters, A. A review of deep learning with special emphasis on architectures, applications and recent trends. Knowl.-Based Syst. 2020, 194, 105596. [Google Scholar] [CrossRef] [Green Version]
Li, Q.; Yang, X.; Wu, W.; Liu, K.; Jeon, G. Pansharpening multispectral remote-sensing images with guided filter for monitoring impact of human behavior on environment. Concurr. Comput. Pract. Exp. 2021, 33, e5074. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A. Context-driven fusion of high spatial and spectral resolution images based on oversampled multiresolution analysis. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2300–2312. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach City, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
Li, X.; Lei, L.; Kuang, G. Multilevel Adaptive-Scale Context Aggregating Network for Semantic Segmentation in High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Diao, Q.; Dai, Y.; Zhang, C.; Wu, Y.; Feng, X.; Pan, F. Superpixel-based attention graph neural network for semantic segmentation in aerial images. Remote Sens. 2022, 14, 305. [Google Scholar] [CrossRef]
Luo, A.; Li, X.; Yang, F.; Jiao, Z.; Cheng, H.; Lyu, S. Cascade graph neural networks for RGB-D salient object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 346–364. [Google Scholar]
Liang, J.; Zhou, J.; Tong, L.; Bai, X.; Wang, B. Material based salient object detection from hyperspectral images. Pattern Recognit. 2018, 76, 476–490. [Google Scholar] [CrossRef] [Green Version]
Ning, X.; Tian, W.; Yu, Z.; Li, W.; Bai, X.; Wang, Y. HCFNN: High-order coverage function neural network for image classification. Pattern Recognit. 2022, 131, 108873. [Google Scholar] [CrossRef]
Wang, C.; Wang, X.; Zhang, J.; Zhang, L.; Bai, X.; Ning, X.; Zhou, J.; Hancock, E. Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recognit. 2022, 124, 108498. [Google Scholar] [CrossRef]
Xiong, D.; He, C.; Liu, X.; Liao, M. An end-to-end Bayesian segmentation network based on a generative adversarial network for remote sensing images. Remote Sens. 2020, 12, 216. [Google Scholar] [CrossRef] [Green Version]
Ren, Y.; Yu, Y.; Guan, H. DA-CapsUNet: A dual-attention capsule U-Net for road extraction from remote sensing imagery. Remote Sens. 2020, 12, 2866. [Google Scholar] [CrossRef]
Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded convolutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef] [Green Version]
Mou, L.; Hua, Y.; Zhu, X.X. Relation matters: Relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 7557–7569. [Google Scholar] [CrossRef]
Liu, Y.; Chen, D.; Ma, A.; Zhong, Y.; Fang, F.; Xu, K. Multiscale U-shaped CNN building instance extraction framework with edge constraint for high-spatial-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2020, 59, 6106–6120. [Google Scholar] [CrossRef]
Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef] [Green Version]
Wang, C.; Bai, X.; Wang, S.; Zhou, J.; Ren, P. Multiscale visual attention networks for object detection in VHR remote sensing images. IEEE Geosci. Remote Sens. Lett. 2018, 16, 310–314. [Google Scholar] [CrossRef]
Xing, Q.; Xu, M.; Li, T.; Guan, Z. Early exit or not: Resource-efficient blind quality enhancement for compressed images. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 275–292. [Google Scholar]
Zhao, J.; Zhou, Y.; Shi, B.; Yang, J.; Zhang, D.; Yao, R. Multi-stage fusion and multi-source attention network for multi-modal remote sensing image segmentation. ACM Trans. Intell. Syst. Technol. (TIST) 2021, 12, 1–20. [Google Scholar] [CrossRef]
Yang, H.; Shan, C.; Bouwman, A.; Kolen, A.F.; de With, P.H. Efficient and robust instrument segmentation in 3D ultrasound using patch-of-interest-FuseNet with hybrid loss. Med. Image Anal. 2021, 67, 101842. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image transformer. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.; Schaal, S.; Levine, S.; Brain, G. Time-contrastive networks: Self-supervised learning from video. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1134–1141. [Google Scholar]
Wu, H.; Qu, Y.; Lin, S.; Zhou, J.; Qiao, R.; Zhang, Z.; Xie, Y.; Ma, L. Contrastive learning for compact single image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10551–10560. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 3–18 July 2020; pp. 4182–4192. [Google Scholar]
Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef] [Green Version]
Nogueira, K.; Dalla Mura, M.; Chanussot, J.; Schwartz, W.R.; Dos Santos, J.A. Dynamic multicontext segmentation of remote sensing images based on convolutional networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7503–7520. [Google Scholar] [CrossRef] [Green Version]
Sun, Y.; Tian, Y.; Xu, Y. Problems of encoder-decoder frameworks for high-resolution remote sensing image segmentation: Structural stereotype and insufficient learning. Neurocomputing 2019, 330, 297–304. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 426–435. [Google Scholar] [CrossRef]
Zhou, F.; Hang, R.; Liu, Q. Class-guided feature decoupling network for airborne image segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 59, 2245–2255. [Google Scholar] [CrossRef]
Nong, Z.; Su, X.; Liu, Y.; Zhan, Z.; Yuan, Q. Boundary-Aware Dual-Stream Network for VHR Remote Sensing Images Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5260–5268. [Google Scholar] [CrossRef]
Lin, B.; Yang, G.; Zhang, Q.; Zhang, G. Semantic Segmentation Network Using Local Relationship Upsampling for Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Su, Y.; Cheng, J.; Wang, W.; Bai, H.; Liu, H. Semantic segmentation for high-resolution remote-sensing images via dynamic graph context reasoning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]

Figure 1. The sample of the IRRG, DSM images, and corresponding ground truth.

Figure 2. The overview of the Local–global Framework (LGNet).

Figure 3. The overview of the selective fusion module (SFM module).

Figure 4. The overview of the local–global transformer module (LG-Trans module).

Figure 5. Some samples of ISPRS Vaihingen and Potsdam datasets.

Table 1. The effect of different source images on the ISPRS Vaihingen and Potsdam datasets.

Method	DataSet	Data Source	F1 Score					OA	Mean F1
Method	DataSet	Data Source	Imp_surf	Building	Low_veg	Tree	Car	OA	Mean F1
LGNet-IR	Vaihingen	IRRG	92.15	94.65	83.53	89.44	80.25	90.04	88.00
LGNet-S	Vaihingen	IRRG/DSM	92.21	95.02	83.65	89.56	86.17	90.20	89.32
LGNet-IR	Potsdam	IRRG	91.75	94.64	86.38	86.16	90.23	89.72	90.14
LGNet-S	Potsdam	IRRG/DSM	92.73	95.84	86.54	87.71	93.89	90.26	91.03

Table 7. Experiments of Parameters (PRM).

Method	PRM (M)	Method	PRM (M)
DLR_9 [36]	768.66	LGNet-IR	9.81
UFMG_4 [37]	1.91	LGNet-D	17.02
LANet [39]	23.80	LGNet-L	18.41
BAM-UNet-sc [41]	5.18	LGNet-G	17.13
LRDNet [42]	44.47	LGNet-LG	18.77
DGCRNet [43]	8.96	LGNet	18.77

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, L.; Yu, D.; Zhang, C.; Zhang, X. A Local–Global Framework for Semantic Segmentation of Multisource Remote Sensing Images. Remote Sens. 2023, 15, 231. https://doi.org/10.3390/rs15010231

AMA Style

Qiu L, Yu D, Zhang C, Zhang X. A Local–Global Framework for Semantic Segmentation of Multisource Remote Sensing Images. Remote Sensing. 2023; 15(1):231. https://doi.org/10.3390/rs15010231

Chicago/Turabian Style

Qiu, Luyi, Dayu Yu, Chenxiao Zhang, and Xiaofeng Zhang. 2023. "A Local–Global Framework for Semantic Segmentation of Multisource Remote Sensing Images" Remote Sensing 15, no. 1: 231. https://doi.org/10.3390/rs15010231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Local–Global Framework for Semantic Segmentation of Multisource Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation of Multisource Remote Sensing Images

2.2. Local and Global Feature Fusion

3. Method

3.1. Overall Architecture

3.2. Dual-Source Fusion Network

3.3. Local–Global Transformer Module

3.4. Loss Function

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset

4.1.2. Metrics

4.1.3. Implementation Details

4.2. Ablation Study

4.2.1. The Effect of Different Source Images

4.2.2. The Effect of the Dual-Source Fusion Network

4.2.3. The Effect of Trans-Based Global–Local Fusion Module

4.2.4. The Effect of Contrastive Loss

4.3. Comparison Method

4.3.1. Experiments on the Vaihingen Dataset

4.3.2. Experiments on the Potsdam Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI