Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion

Zhang, Di; Yue, Peicheng; Yan, Yuhang; Niu, Qianqian; Zhao, Jiaqi; Ma, Huifang

doi:10.3390/rs16244717

Open AccessArticle

Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion

by

Di Zhang

^1,*

,

Peicheng Yue

¹,

Yuhang Yan

¹,

Qianqian Niu

¹,

Jiaqi Zhao

²

and

Huifang Ma

¹

College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China

²

The School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(24), 4717; https://doi.org/10.3390/rs16244717

Submission received: 6 October 2024 / Revised: 12 December 2024 / Accepted: 13 December 2024 / Published: 17 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Multi-source remote sensing image semantic segmentation can provide more detailed feature attribute information, making it an important research field for remote sensing intelligent interpretation. However, due to the complexity of remote sensing scenes and the feature redundancy caused by multi-source fusion, multi-source remote sensing semantic segmentation still faces some challenges. In this paper, we propose a multi-source remote sensing semantic segmentation method based on differential feature attention fusion (DFAFNet) to alleviate the problems of difficult multi-source discriminant feature extraction and the poor quality of decoder feature reconstruction. Specifically, we achieve effective fusion of multi-source remote sensing features through a differential feature fusion module and unsupervised adversarial loss. Additionally, we improve decoded feature reconstruction without introducing additional parameters by employing an attention-guided upsampling strategy. Experimental results show that our method achieved 2.8% and 2.0% mean intersection over union (mIoU) score improvements compared with the competitive baseline algorithm on the available US3D and ISPRS Potsdam datasets, respectively.

Keywords:

multi-source fusion; remote sensing semantic segmentation; differential feature; attention-guided upsampling

1. Introduction

As an important means of earth observation, the rapid development of remote sensing has provided humans with a new way to efficiently perceive the land and ocean on which they depend for survival [1]. Semantic segmentation of remote sensing images is one of the important applications of the intelligent interpretation of remote sensing images [2]. The core goal is to assign category labels to each pixel in the image based on sample features, and pixels with the same category label have the same visual characteristics [3]. The full-scene and high-precision ground object recognition results make remote sensing image semantic segmentation widely used in ecological monitoring [4], urban planning [5], disaster assessment [6], change detection [7], and other fields.

In recent years, driven by massive data, models, and computing power, artificial intelligence technologies represented by deep convolutional neural networks have greatly promoted the development of remote sensing semantic segmentation [8]. Due to the complexity of remote sensing images, it is impossible to effectively represent different types of objects using a single remote sensing image. For example, low vegetation and trees, gray buildings, and the ground may be misidentified. In recent years, the rapid development of remote sensing sensors and platforms has greatly increased the quantity and quality of remote sensing data [9]. Multi-source remote sensing data can provide more detailed information about land cover attributes [10]. Infrared images can reflect the radiation characteristics of land cover, visible light images can reflect visual features such as the texture and geometry of land cover, and multi-spectral images can reflect the spectral characteristics of land cover. These multi-source remote sensing images with different characteristics can effectively compensate for the limitations of single data sources, greatly improving the application scope and resolution accuracy of remote sensing images [11].

The single-source remote sensing image semantic segmentation method mainly considers the combination of low-level detail information and high-level semantic information, while the deep multi-source model further integrates intermediate or hierarchical features to improve performance by mining complementary information. A large number of improved segmentation models have been proposed to address the characteristics of multi-source remote sensing images and the limitations of existing methods. Zhou et al. [12] proposed a multi-source semantic segmentation network based on modal memory fusion and morphological multi-scale assistance to promote cross-modal feature fusion by considering the connections between different samples. In view of the fact that existing methods ignore the unique characteristics of each modality at different encoding layers, Liang et al. [13] proposed a multi-branch bidirectional fusion segmentation network, which guides the distinguishable fusion of features at different layers through a detail enhancement module and an RGB-guided semantic enhancement module. Ma et al. [14] proposed a new adjacent two-layer fusion semantic segmentation network to address the problems of small targets and complex scenes in remote sensing images. It fully explores multi-scale complementary features through a DSM enhancement module and an adjacent context exploration module. The above methods are based on the perspective of complementary feature mining and multi-stage fusion strategy design for network construction. However, while multi-source data enrich the information of ground object categories, they also cause feature redundancy and noise information. Therefore, it is a challenge to effectively mine the differential information of object categories, maximize the advantages of different types of remote sensing images, reduce the impact of noise information on segmentation performance, and achieve collaborative representation of multi-source semantic information in complex scenes.

In order to solve the above problems, we proposed a multi-source remote sensing image semantic segmentation method based on differential feature attention fusion by studying the multi-source fusion mechanism, enriching contextual information, and maintaining the diversity of fusion features. As mentioned above, the model mainly focuses on two basic issues: one is how to efficiently achieve the full fusion of multi-source remote sensing image features and improve the model’s ability to mine discriminative features, and the other is how to overcome the loss of shallow image information caused by downsampling operations in the encoding stage. For the first problem, we used differential feature fusion modules and unsupervised adversarial losses to mine the discriminative features and complementary features of multi-source data. For the second problem, we designed an upsampling module based on the attention mechanism to model the correlation between pixels. Different from existing methods, DFAFNet uses the attention mechanism to mine differential features and guide the upsampling process, which can significantly improve the performance of remote sensing image semantic segmentation tasks.

To summarize, the main contributions of this work are in four-fold:

We propose an end-to-end multi-source semantic segmentation network based on differential feature attention fusion. The network enhances the feature expression ability of the model through the multi-source differential feature fusion mechanism and enriches the context information in the decoding stage.
We develop a differential feature fusion module based on spatial attention. By aligning the weight distribution of multi-source feature maps and giving higher attention to the differential part, the model’s discriminative feature mining ability is improved.
We design a shallow attention-guided upsampling method based on the self-attention mechanism to better achieve image reconstruction without introducing additional parameters.
We use an unsupervised loss function to perform deep supervision on the feature extraction and fusion modules so that the fused features can better retain the diversity of multi-source data features.

The rest of this paper is organized as follows. In Section 2, we briefly review the current situation of the remote sensing semantic segmentation task. Section 3 introduces the details of the architecture. Section 4 gives experiments and comparisons with other methods. Finally, this paper is concluded in Section 5.

2. Related Work

In this section, relevant works on remote sensing image semantic segmentation and multi-source semantic segmentation are briefly reviewed.

2.1. Remote Sensing Image Semantic Segmentation

Remote sensing image semantic segmentation is a multi-disciplinary research field that continuously develops and improves by integrating theoretical knowledge from remote sensing image processing and computer science [15]. According to the differences in image feature extraction methods, remote sensing image segmentation methods are mainly divided into traditional image segmentation methods [16] and semantic segmentation methods based on deep learning [17,18]. Traditional image segmentation methods rely on manually designed shallow features for processing and analysis. However, due to the complexity of remote sensing image segmentation scenarios and the diversity of segmentation requirements, traditional image segmentation methods are less able to effectively meet the needs of remote sensing image semantic segmentation tasks in terms of segmentation accuracy and generalization performance [19].

With the breakthrough progress of convolutional neural networks in computer vision tasks, more and more researchers are paying attention to deep learning-based semantic segmentation methods for remote sensing images [20]. Compared to traditional image segmentation methods, deep neural networks can automatically learn abstract nonlinear high-level semantic features, greatly improving the performance of semantic segmentation tasks [21]. In 2015, the introduction of the fully convolutional network laid a solid foundation for using deep neural networks to solve the semantic segmentation problem of remote sensing images in an end-to-end manner [22]. However, it still has two limitations: First, after the image is continuously pooled, the spatial position information of the pixels is lost. Second, the segmentation process does not consider the context information and cannot fully utilize the local and global features of the image. A large number of improved semantic segmentation models for remote sensing images have been proposed, taking into account the characteristics of remote sensing images and the limitations of existing methods. They have been found to improve the model’s ability to represent features and segmentation performance through encoder–decoder structures [23], feature fusion mechanisms [24,25], attention mechanisms [21,26], Transformer techniques [27,28], and other methods.

These methods have greatly promoted the development of remote sensing image semantic segmentation technology by enhancing the representation ability of feature maps. However, due to the complex background of remote sensing images, the large difference in feature distribution within a class, and the small difference in feature distribution between classes, remote sensing image semantic segmentation is still a difficult task to accomplish.

2.2. Multi-Source Fusion Semantic Segmentation

With the rapid development of remote sensing sensors, the number and application level of remote sensing images with different imaging mechanisms are constantly increasing. It has become a development trend to use the complementarity of multi-source data to enhance the segmentation accuracy and robustness of remote sensing image semantic segmentation tasks [29]. Multi-source data can reflect the various characteristics of ground object categories.

How to fully consider the correlation of multi-source data, effectively improve the ability to extract discriminative features, and reduce the impact of redundant features on segmentation accuracy is an important research direction. Multi-source remote sensing images with different characteristics can effectively make up for the limitations of single data sources and greatly improve the application scope and analysis accuracy of remote sensing images [30]. In response to the limitations of single-modal data, Guo et al. [31] proposed a multi-modal semantic segmentation network, PIF-Net. This network extracts deep multi-modal features through two independent branches, incorporating Res-Pooling and point attention blocks. Currently, most multi-source fusion methods merely combine the features of two modalities without considering their differences and complementarity. Fan et al. [32] proposed a novel network called the progressive adjacent layer coordinated symmetric cascade network (PACSCNet). This network employs a two-stage fusion symmetric cascade encoder to utilize the similarity and differences between adjacent features for cross-layer fusion, thereby preserving spatial details. In addition, Ma et al. [33] proposed a multi-level multi-modal fusion scheme called FTransUNet, which integrates CNN and Vit into a unified fusion framework to provide a robust and effective multi-modal fusion backbone for semantic segmentation.

Multi-source data play an important role in improving the performance of remote sensing semantic segmentation. However, due to the interference of noise information carried by multi-source data and complex scenes of remote sensing images, these methods are usually difficult to adapt to the semantic segmentation task of multi-source remote sensing images. Therefore, we have conducted some explorations on the difficulty of mining complementary features and discriminative features of multi-source data based on the existing methods.

3. Method

In this section, we will introduce the model details of DFAFNet in detail. They consist of three parts: a spatial attention-based differential feature fusion module, a shallow pixel-guided upsampling operation, and an unsupervised adversarial loss function. On the one hand, the DFAFNet module uses the differences in multi-source data to extract discriminant information for feature fusion. On the other hand, it introduces the relationship between shallow pixels into the feature map reconstruction process to enhance the context information of the decoder, thereby improving the accuracy and robustness of the model.

The DFAFNet model architecture is shown in Figure 1. It can be seen from the figure that DFAFNet is a multi-branch encoding–decoding structure, which consists of two feature extraction encoding branches, a feature fusion branch, and a decoding branch.

In the encoder part of the model, each feature extraction branch contains 5 CBR modules (In this section, the convolution layer, BN layer, and ReLU function are abbreviated as CBR). For each convolution block, the size of the convolution kernel, stride, and padding are set to 3, 1, and 1 respectively. Then, the output feature map of each CBR module is downsampled using the maximum pooling operation, and the pooling kernel size and pooling stride are set to 2. Taking the multi-source input images RGB image and DSM image as examples, the RGB feature map, depth feature map, and fusion feature map can be finally obtained through the encoder feature extraction branch. The multi-source feature maps are interactively fused through a difference feature fusion module (DFF).

In the decoder part of the model, its structure is consistent with the decoder structure of the original UNet, except for the upsampling operation. This chapter replaces the traditional upsampling operation with the proposed self-attention-guided upsampling operation (AGU). Similarly, DFAFNet adds skip connections between the encoder and decoder to enrich the contextual information of the decoder. In addition, considering the significant role of residual blocks in the ResNet network, this chapter uses ResNet34 as the backbone network of the model.

3.1. Difference Feature Attention Fusion Module

Accurate feature representation is the core issue of multi-source remote sensing semantic segmentation. In order to achieve optimal multi-source feature learning, a large number of studies have been conducted to improve the performance of segmentation models by fusing the most discriminative features of multi-source remote sensing images. However, existing methods often have limitations such as information redundancy and incomplete discriminative features. In contrast, a better strategy is to extract differential features that are focused on in the depth image feature map and ignored in the visible light image feature map and then fuse them with the original visible light image feature map. The obtained fused feature map not only enhances the discriminative features, but it also effectively reduces information redundancy. Based on this idea, we designed a differential feature fusion module, which consists of two stages: one is feature alignment, and the other is differential feature mining. The structure of the fusion module is shown in Figure 2.

Feature alignment: It aims to match the activation regions of common interest between RGB image feature maps and depth image feature maps. Regions with high activation values indicate that the model pays more attention to the corresponding regions. In other words, the region has a higher impact on model performance, while regions with low activation values indicate a lower impact on model performance. For example, in the RGB image feature map, the activation response values of the forest and grassland categories are not significantly different, while in the depth feature map, the activation response value of the forest category is higher than that of the grassland category. Therefore, we used the feature activation response value to represent the importance of the features. After obtaining regions of common interest from multiple feature maps, we selected regions with different response values from the two feature maps, laying the foundation for subsequent differential feature mining.

First, we calculated the Hadamard product of the RGB image feature map

F_{r g b}

and the depth feature map

F_{d e p}

to mine the feature areas of common interest. In order to increase the diversity of features, we also used maximum pooling and average pooling operations to process the feature map and cascade the pooled results. Given the feature map F, the calculation method of this operation can be formulated as follows:

Pool (F) = MaxPool (F) \oplus AvgPool (F) .

(1)

Secondly, we used 1D convolution to compress the number of channels in the cascaded feature maps that had undergone max pooling and average pooling to 1, thereby obtaining activation maps for different regions. Then, the sigmoid function was used to map the score of each region to between 0 and 1. The feature map extracted by the l-

t h

CBR block of the RGB image feature extraction branch is

F_{r g b}^{(l)}

, and similarly, the feature map extracted by the deep feature extraction branch is

F_{d e p}^{(l)}

. In addition, in the l-

t h

difference feature fusion block, the weights of the common attention region extraction branch and the deep feature extraction branch are defined as

A_{h a d}^{(l)}

and

A_{d e p}^{(l)}

. Then, the calculation method of the two activation maps is as follows:

A_{h a d}^{(l)} = Sigmoid (W_{h a d}^{(l)} ⊛ Pool (F_{r g b}^{(l)} \otimes F_{d e p}^{(l)})),

(2)

A_{d e p}^{(l)} = Sigmoid (W_{d e p}^{(l)} ⊛ Pool (F_{d e p}^{(l)})),

(3)

where ⊛ means 1D convolution, and ⊗ means Hadamard product. We can obtain activation maps

A_{h a d}^{(l)}

and

A_{d e p}^{(l)}

through the above steps, which reflect the areas that the model focuses on. The former represents the important areas common to the RGB image and the depth image, while the latter represents the important areas of the depth image.

Feature complementary: The activation map reflects the importance of the model’s attention area. The higher the score, the greater the weight assigned by the model, and the more valuable the corresponding feature. To obtain the difference information between multi-source feature maps, a simple approach is to negate the above activation map. So, we need matching the

A_{h a d}^{(l)}

and

A_{d e p}^{(l)}

to select the complementary regions from different feature maps. We can simplify this problem through the following logical propositions:

{Δ A}^{(l)} = \neg A_{h a d}^{(l)} \land A_{d e p}^{(l)},

(4)

where

\neg A_{h a d}^{(l)}

refers to areas where RGB and depth focus on inconsistencies. So,

{Δ A}^{(l)}

represents the regions where depth features are interesting, whereas the RGB features ignore them. We call

{Δ A}^{(l)}

as the complementary regions’ scoring map. In practice, we use specific mathematical operations in place of the logical operations described above:

{Δ A}^{(l)} = (1 - A_{h a d}^{(l)}) \otimes A_{d e p}^{(l)} .

(5)

Since the score of each region in the feature map is between [0, 1] by the

s i g m o i d

function, we can use the

(1 - A_{h a d}^{(l)})

value to replace the

\neg A_{h a d}^{(l)}

value for performing the inversion operation. Under the guidance of

{Δ A}^{(l)}

, we can select features unique to the depth feature map to strengthen the RGB features. Firstly, we purify the RGB features by utilizing

A_{h a d}^{(l)}

.

A_{h a d}^{(l)}

only focuses on the regions that RGB images and depth images both concern, so it can significantly filter out the background noise of RGB feature maps. Then, we extract the features unique to the depth images by

{Δ A}^{(l)}

. Finally, we obtain the fusion feature maps

F_{f u s e}^{(l)}

. The operations above can be formulated as follows:

F_{f u s e}^{(l)} = A_{h a d}^{(l)} \otimes F_{r g b}^{(l)} + {Δ A}^{(l)} \otimes F_{d e p}^{(l)} .

(6)

3.2. Attention-Guided Upsampling Module

In semantic segmentation networks based on encoder–decoder structures, the quality of feature upsampling has a non-negligible impact on segmentation performance. However, traditional upsampling operations are mainly bilinear interpolation and transposed convolution, which have limitations to varying degrees because different regions share the same sampling weights. Therefore, we proposed a novel attention-guided upsampling module to improve segmentation performance. The structure of the attention-guided upsampling module is shown in Figure 3.

Given the image feature map F, we denote the feature vector set of feature map F as

X_{F}

and the feature vector for the pixels in column j of row i as

x_{i j} \in R^{d}

, where d means the embedding dimension of feature vectors. So, any feature vector

x_{i j}

which belongs to F is an element of

X_{F}

. Assuming that the elements of the set

X_{F}

are distributed in a low-dimensional manifold space, downsampling on the feature map

X_{F}

should meet the following conditions: (a) The number of feature map’s feature vectors should decrease after downsampling. (b) In the process of downsampling, the information loss should be as little as possible. Thus, the downsampling can be described as the process to select the

X_{F}

value’s real subset

X_{s}

, and

X_{s}

should contain a set of bases. That means any feature vector

x_{i}

in the set

X_{F}

can be represented linearly by the elements of

X_{s}

.

In practice, considering that the number n of elements of

X

is much larger than the embedding dimension d (

n ≫ d

), and there is a lot of redundant information in the image features, the above conditions in the definition are very easy to be satisfied for 2x downsampling according to the

D r a w e r P r i n c i p l e

. So, if we obtain a proper

X_{s}

after downsampling, any feature vector

x_{i}

in the set

X_{F}

can be described as follows:

x_{i} = \sum_{k = 1}^{n_{s}} α_{k} x_{k}, x_{k} \in X_{s},

(7)

where

α_{k}

defines the weights for different vectors

x_{k}

. We can organize the elements of

X_{F}

and

X_{s}

as matrix

X_{F}

and matrix

X_{s}

separately, and we can then combine corresponding weights

α_{k}

as weight matrix W. Thus, we convert Equation (7) to the equation in the form of matrices as follows:

X_{F} = W^{⊤} X_{s},

(8)

where

W

means the relationships between the elements of

X_{F}

and the elements of

X_{s}

. Consider

X_{s} \subset X_{F}

; thus, matrix W refers to the 2nd-order information of images. According to the Equation (8), the quality of upsampling is decided by two factors: (a) the quality of

X_{s}

and (b) the 2rd-order information W. To illustate this clearly, we can focus on the ith decoder block, which would process the feature maps from (i − 1)th decoder block. Between the ith decoder block and (i − 1)th decoder block, we should upsample the feature map from (i − 1)th decoder block first, where the feature map can be seen as the source of matrix

X_{s}

. To improve the quality of

X_{s}

, it is a good idea to directly utilize the encoded feature from the encoder block in the same level as the ith decoder block. That is why U-net adds the skip connections between the encoder and decoder. For the 2rd-order matrix W, which is often ignored in related research, it is obvious that we can compute it in the encoded feature from the same level encoder block and transit it by the skip connection.

According to the analysis above, we designed the algorithm to improve the upsampling results by introducing the 2nd-order information of images. In this way, the relationship between the pixel to be reconstructed and all global pixels can be established, and this relationship has an adaptive characteristic, that is, different weights can be obtained by inputting different feature maps into the module.

Given the i-

t h

layer encoding feature map

F_{i}^{e n c}

and the i-

t h

layer decoding feature map

F_{i}^{d e c}

, the i + 1-

t h

layer encoding feature map

F_{i + 1}^{e n c}

is obtained by downsampling the feature map

F_{i}^{e n c}

. Since feature maps

F_{i}^{d e c}

and

F_{i}^{e n c}

are at the same level, they have the same image size. Therefore, the encoding feature map

F_{i}^{e n c}

and the decoding feature map

F_{i}^{d e c}

can be cascaded to obtain a fused feature map

F_{i}

to enhance the context representation of the feature map. The calculation method is as follows:

F_{i} = F_{i}^{d e c} \oplus F_{i}^{e n c} .

(9)

During the process of downsampling

F_{i}^{e n c}

to generate

F_{i + 1}^{e n c}

, we can record the position information between pixels to guide the upsampling process of the decoded feature map

F_{i + 1}^{d e c}

. In this paper, this position information is defined as indices

I

. Next, we start to calculate the 2nd-order relationship matrix

W

. According to indices

I

, we can divide the feature map

F_{i}^{e n c}

into two parts:

F_{i}^{a}

and

F_{i}^{b}

.

F_{i}^{a}

contains all feature vectors indexed by

I

, and the remaining vectors constitute

F_{i}^{b}

. We convert

F_{i}^{a}

and

F_{i}^{b}

separately to

(X_{a}, I_{a})

, and

(X_{b}, I_{b})

, where

X_{a}

and

X_{b}

are all matrices, while

I_{a}

and

I_{b}

are all indices indicating each feature vector’s position in the original feature map of the corresponding set. For convenience of calculation, we borrowed the self-attention mechanism to simulate the matrix

W

:

W_{i j} = \frac{exp (x_{i}^{⊤} y_{j} / \sqrt{d})}{\sum_{k = 1}^{m} exp (x_{k}^{⊤} y_{j} / \sqrt{d})}, x_{i} \in X_{a}, y_{j} \in X_{b} .

(10)

So far, we can obtain the 2nd-order information

W

of the image through the above steps, which can guide the upsampling process of the decoder. Then, the final feature map after upsampling can be generated according to the position indexes

I_{a}

and

I_{b}

.

3.3. Loss Function

The loss function used in this paper consists of two parts: one is the unsupervised adversarial loss function, and the other is the semantic segmentation loss.

In order to better extract fusion features from multi-source data and maintain the diversity of fusion features, we used an unsupervised adversarial loss function for deep fusion supervision of feature encoders. Abandoning the segmentation results, an effective multi-source fusion feature should be as close to a certain data source as possible while not being too biased towards a certain data source, otherwise the multi-source fusion feature extraction will degenerate into a single-source feature extraction. Therefore, we introduced the adversarial idea when designing the loss function to supervise the fusion process of multi-source data. The specific calculation method is as follows:

At the end of encoder, we have three feature maps: the fusion feature map

F_{f u s e}

, the RGB feature map

F_{r g b}

, and the depth feature map

F_{d e p}

. There is a constraint between the three feature maps, that is, the final fusion feature map

F_{f u s e}

can neither be too similar to the RGB feature map

F_{r g b}

nor too similar to the depth feature map

F_{d e p}

, otherwise the use of multi-source data will lose its meaning. So, we introduced the idea of adversariality when designing the loss function to supervise the fusion process:

L_{a d v} = L (F_{f u s e}, F_{r g b}) + L (F_{f u s e}, F_{d e p}),

(11)

where

L (.)

defines the PIKD function [34], which can be described as follows:

L (F_{1}, F_{2}) = \sqrt{{(AvgPool (F_{1}) - AvgPool (F_{2}))}^{2} + ε^{2}} .

(12)

where

ε

is a balancing term, which prevents the model from overfitting and improves the generalization ability of the model. In this paper, the value of

ε

used is 1 × 10⁻⁶. The PIKD function can measure the distance between two feature distributions. Its loss reaches the global minimum only when the fused feature distribution lies between the multiple source feature distributions. Minimizing the distance among these three feature distributions inherently forms an adversarial relationship.

Next, we introduced the Lovász Softmax loss function [35], which is calculated as follows:

L_{s e g} = \frac{1}{| C |} \sum_{c \in C} Δ \bar{J_{c}} (m_{i} (c)),

(13)

where

| C |

represents the total number of categories,

J_{c}

is the Jaccard similarity coefficient,

m_{i} (c)

is the error vector, and

Δ J_{c} (\cdot)

represents the convex closure of the Jaccard loss.

In summary, the loss function of the difference feature attention fusion model can be expressed as follows:

Loss = L_{a d v} + L_{s e g} .

(14)

4. Experiments

In order to fully verify the effectiveness of the DFAFNet method, we conducted a series of experiments on two public datasets. First, we introduced the datasets, experimental details, evaluation indicators, and comparison methods. Secondly, we conducted an ablation experiment on the proposed method to verify the effectiveness of the module. Finally, we demonstrated the performance of the DFAFNet method by comparing it with other advanced methods in terms of indicator results and visual effects.

4.1. Experimental Conditions

Datasets. Two public and commonly used multi-modal remote sensing data were used in our experiments. One of them is Urban Semantic 3D (https://ieee-dataport.org/open-access/data-fusion-contest-2019-dfc2019/, accessed on 8 October 2024) (US3D); the other is ISPRS Potsdam Semantic Labeling data set (https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx, accessed on 8 October 2024)) (ISPRS Postdam). Table 1 shows examples and legends of the two datasets. The table shows the categories included in the dataset and the corresponding proportions and color markings. In the US3D dataset, the image size is 512 × 512, and the ground sampling distance is 1.3 m. We obtained a total of 10,848 images by cropping. Among them, 8692 images were used for training, and 2156 images were used for testing. There are five labels in this dataset, and the background class is ignored. In ISPRS Potsdam, the image size is 512 × 512, the ground sampling distance is 5 cm, and a total of 4598 images were obtained by cropping. Among them, 3678 images were used for training, and 920 images were used for testing. The first category in the original dataset was changed to the background class.

Implementation Details. In this paper, the deep learning framework we used is PyTorch, the model optimizer was Adam, the initial learning rate was 0.0001, and the momentum and weight decay coefficients were set to 0.9 and 0.0005, respectively. When conducting experiments on the US3D dataset, the model was iteratively trained for 150 times, the learning rate was reduced by half every 100 iterations, and the batch size was set to 15. When conducting experiments on the ISPRS Postdam dataset, the model was iteratively trained for 100 times, the learning rate was reduced by half every 50 iterations, and the batch size was set to 6. An NVIDIA Tesla P100 GPU (12G) was used to train and test the network.

Evaluation Metrics. Five different evaluation metrics were adopted to measure the performance quantitatively between the DFAFNet and the comparison methods. In this paper, we used mIoU as our main metrics. The specific calculation method of the evaluation indicators is as follows:

(1) Pixel Accuracy (PA): It measures the ratio of correctly classified pixels to the total number of pixels, as calculated by Equation (15).

P A = \frac{\sum_{i = 0}^{k} p_{i i}}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i j}},

(15)

where k represents the number of classes,

p_{i i}

denotes the number of pixels correctly classified, and

p_{i j}

denotes the number of pixels where the true class i is predicted as class j.

(2) Mean Pixel Accuracy (MPA): It measures the average accuracy of all class pixels, as calculated by Equation (16).

M P A = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j}} .

(16)

(3) Mean Intersection Over Union (mIoU): It measures the ratio of the intersection to the union of the predicted pixel set and the true pixel set, as calculated by Equation (17).

m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{p_{i i}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}},

(17)

where

p_{j i}

denotes the number of pixels, where class j is predicted as the true class.

(4) Frequency-Weighted Intersection Over Union (FWIoU): It is an improvement to the mIoU metric, where the weight for each class is set based on its frequency of occurrence, as calculated by Equation (18).

F W I o U = \frac{1}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i j}} \sum_{i = 0}^{k} \frac{p_{i i} \sum_{j = 0}^{k} p_{i j}}{\sum_{j = 0}^{k} p_{i j} + \sum_{j = 0}^{k} p_{j i} - p_{i i}} .

(18)

(5) Kappa: It measures the consistency between the model’s prediction and the true results. The higher the Kappa coefficient, the more accurate the semantic segmentation result of the remote sensing image. The calculation method is shown in Equation (19).

K a p p a = \frac{N \cdot t r (M) - ν M^{2} ν^{T}}{N - ν M^{2} ν^{T}},

(19)

where N represents the number of pixels in the image,

t r (M)

denotes the trace of the confusion matrix

M

, and v is a row vector of length 2 with all elements equal to 1.

Comparative Methods. In the experiments, we compared the following seven semantic segmentation methods in the past four years to verify the effectiveness of the proposed model: the Unified Perceptual Parsing Network (UperNet) [36], Multi-Path Residual Network (MP-ResNet) [37], Dual-Domain Optimized Class-Aware Network (DOCNet) [38], Attention-Based Network (ACNet) [39], Efficient Scene Analysis Network (ESANet) [40], Fusion Height Network (FHNet) [41], and Multi-Source Segmentation Fusion Network (SegFusion) [42].

4.2. Ablation Studies

We conducted a series of ablation experiments on the US3D and ISPRS Potsdam datasets to verify the effectiveness of the key modules in the DFAFNet. The ablation points of the DFAFNet method include the two-stream structure, the difference feature fusion module, the attention-guided upsampling module, and the unsupervised adversarial loss. We used the dual-stream structure as the baseline model for comparison, which is an improved UNet network based on dual encoders. The ablation experiment results are shown in Table 2.

First, we analyzed the index results of the DFAFNet components on the US3D dataset. From the upper part of Table 2, it can be seen that the difference feature fusion module increased the PA value, mIoU value, and Kappa value of the baseline model by 0.42%, 1.3%, and 0.7%, respectively. Using unsupervised adversarial loss on the basis of the difference feature fusion module increased the mIoU value and Kappa value of the baseline model by 1.69% and 0.92%, respectively. The combination of the three modules increased the mIoU and Kappa values of the baseline model by 2.8% and 1.08%, respectively.

Second, we analyzed the index results of the DFAFNet components on the ISPRS dataset. From the lower part of Table 2, it can be seen that the difference feature fusion module increased the PA, mIoU, and Kappa values of the baseline model by 0.64%, 0.69%, and 0.87%, respectively. Using unsupervised adversarial loss on the basis of the difference feature fusion module increased the mIoU and Kappa values of the baseline model by 1.07% and 1.14%, respectively. The combination of the three modules improved the mIoU and Kappa values of the baseline model by 2.0% and 1.99%, respectively.

In the ablation experiments on the two datasets, we found that the complete architecture achieved the highest performance. If any of the three modules—the difference feature fusion module, the attention upsampling module, and the unsupervised adversarial loss—is removed, the performance of the indicators will decrease. Therefore, it can be determined that the improvement in DFAFNet’s performance is due to the effective design of these modules.

In addition, to further demonstrate the performance of the designed modules, we also counted the average pixel accuracy MPA of different module combinations in each category, and the results are shown in Table 3. It can be seen from the table that DFAFNet achieved the best MPA in the four categories of “ground”, “vegetation”, “water”, and “road” on the US3D dataset. Although the accuracy of the “building” category did not improve, it was not much different from the best result. The reason may be that the addition of multi-source elevation information greatly improved the segmentation accuracy of the building category. Therefore, only the baseline model was needed to increase the segmentation accuracy of the building category to more than 91%.

Compared to the baseline model, the proposed method has obvious advantages in each category of the US3D dataset. As the number of modules continued to increase on the baseline model, the average segmentation accuracy of the three categories of “ground”, “water”, and “road” showed a gradual improvement trend. However, after adding the unsupervised adversarial loss, the segmentation effect of the “vegetation” category did not increase but decreased. The reason for this phenomenon may be that the distinction between this category and the surrounding categories is not high, so the addition of the unsupervised adversarial loss caused the model to overfit. Overall, the three proposed modules are all conducive to the segmentation of remote sensing images.

4.3. Comparative Experiments and Discussions

4.3.1. Comparative Experiments on US3D Data Set

The comparative experimental results of the DFAFNet method and other advanced methods on the US3D dataset are shown in Table 4.

From the table, we can see that DFAFNet outperformed the other seven comparison methods in all five evaluation indicators. Except for ACNet, the methods with multi-source data as input were better than the methods with single-source data as input in all five indicators. It fully proves the research motivation of this paper. Among the three methods with single-source data as input, MP-ResNet achieved good performance, and its results are comparable to those of the network designed specifically for multi-source data. Compared with the suboptimal model SegFusion, DFAFNet achieved a certain degree of improvement in all five indicators, and the PA increased by 0.55%, MPA increased by 1.1%, mIoU increased by 0.79%, FWIoU increased by 0.99%, and Kappa increased by 0.71%.

In order to further visualize the segmentation performance of DFAFNet and other comparison methods on the US3D dataset, we also compared the segmentation visualization results, as shown in Figure 4.

As can be seen from the figure, all methods showed good segmentation performance for the recognition of “water” and “vegetation”. The reason may be that the appearance of these two categories is not easily confused with other categories, and the number of samples is large. In the four rows (a), (b), (d), and (f), most methods had low recognition accuracy for the “road” category and the “building” category. This is because the colors of these two categories are very close in the US3D dataset. However, this situation did not occur in the DFAFNet proposed in this paper. On the one hand, the fusion strategy adopted by DFAFNet fully exploited the discriminative features and complementary features. On the other hand, the attention-guided upsampling module of this method enhanced the contextual information of the decoder.

4.3.2. Comparative Experiments on ISPRS Data Set

In order to further verify the generalization performance of DFAFNet, we also selected the ISPRS dataset and the same comparison method and evaluation index for comparative analysis. The results are shown in Table 5.

From the table, we can see that DFAFNet also achieved the best performance compared with the other seven comparison algorithms on the ISPRS dataset. The mIoU value reached 82.73%. Unlike the methods with a single source as input on the US3D dataset, which are generally worse than the methods with multiple sources as input, on the ISPRS dataset, the three methods with a single source as input, UperNet, MP-ResNet, and DOCNet, performed well. It may be due to the lower segmentation difficulty of the ISPRS dataset compared to the US3D dataset. For example, the MPA of the comparison method on the US3D dataset was about 80%, while on the ISPRS dataset, this indicator was about 86%. Compared to the suboptimal model SegFusion, DFAFNet improved the mIoU and Kappa values by 1.37% and 0.88%, respectively, on the ISPRS dataset.

In order to vividly demonstrate the segmentation performance of DFAFNet and other methods on the ISPRS dataset, the segmentation visualization comparison results are shown in Figure 5. It can be seen from the figure that the segmentation results of DFAFNet are visually superior to the other seven comparison methods. In particular, the DFAFNet prediction results are almost consistent with the label map in the three rows (a), (e), and (f). Although other methods also yielded high segmentation accuracy, DFAFNet’s accuracy was significantly higher than other methods for the recognition of the “building” category. The reason may be that the "building" category occupies a larger area in the image, which improved the model’s fitting ability for this category.

In addition, in order to clearly show the segmentation performance of various methods in the ISPRS dataset, we show the confusion matrix of eight methods, and the results are shown in Figure 6. In the confusion matrix, the values on the diagonal represent the percentage of correct classifications, while the rest of the positions represent the percentage of incorrect classifications. From the figure, it can be seen that DFAFNet is slightly lower than DOCNet on the diagonal except for the “tree” category, but it is higher than the other seven comparison methods in other categories. The segmentation accuracy of all methods for the “tree” and “vegetation” categories is lower than that of the other three categories, which is due to the similar appearance of these two categories. Among all the comparison methods, ESANet has the lowest category segmentation accuracy compared with other methods. The reason may be that its fusion strategy cannot effectively process redundant information.

5. Conclusions

In this paper, we focused on multi-source discriminative feature mining and enhancing the quality of decoder feature reconstruction. We proposed a multi-source remote sensing semantic segmentation method based on differential feature attention fusion. It consists of three parts: differential feature fusion module, attention-guided upsampling mechanism, and unsupervised adversarial loss. On the one hand, the method proposed in this paper can achieve sufficient fusion of multi-source remote sensing image features and enhance the mining ability of the model’s discriminative features. On the other hand, it can overcome the loss of shallow information in the image caused by downsampling operations during the encoding stage, thereby improving the quality of feature reconstruction. In addition, it can also preserve the diversity of multi-source data fusion features. In future work, we will optimize the structure of multi-source fusion networks to improve the segmentation accuracy of edge features and small targets.

Author Contributions

Conceptualization, D.Z. and P.Y.; methodology, D.Z.; validation, Q.N. and Y.Y.; investigation, Q.N.; data curation, Y.Y. and J.Z.; writing—original draft preparation, D.Z.; writing—review and editing, D.Z.; visualization, D.Z.; supervision, H.M.; project administration, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Gansu Province (23JRRA683), the Northwest Normal University Young Teachers Research Capacity Promotion Plan (NWNU-LKQN2023-12), the Industrial Support Project of Gansu Colleges: China (No. 2022CYZC11).

Data Availability Statement

The codes in this paper are available on https://github.com/Hellc07/DFAFNet/tree/master/, accessed on 27 November 2024.

Acknowledgments

The authors would like to thank the Johns Hopkins University Applied Physics Laboratory and IARPA for providing the data used in this study, the research group for Signal Processing in Earth Observation at the Technical University of Munich for providing the data used in this study, and the IEEE GRSS Image Analysis and Data Fusion Technical Committee for organizing the Data Fusion Contest.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Leyva-Mayorga, I.; Martinez-Gost, M.; Moretti, M.; Pérez-Neira, A.; Vázquez, M.Á.; Popovski, P.; Soret, B. Satellite edge computing for real-time and very-high resolution earth observation. IEEE Trans. Commun. 2023, 71, 6180–6194. [Google Scholar] [CrossRef]
Zhou, W.; Jin, J.; Lei, J.; Yu, L. CIMFNet: Cross-layer interaction and multiscale fusion network for semantic segmentation of high-resolution remote sensing images. IEEE J. Sel. Top. Signal Process. 2022, 16, 666–676. [Google Scholar] [CrossRef]
Gao, Y.; Luo, X.; Gao, X.; Yan, W.; Pan, X.; Fu, X. Semantic segmentation of remote sensing images based on multiscale features and global information modeling. Expert Syst. Appl. 2024, 249, 123616. [Google Scholar] [CrossRef]
Li, Q.; Guo, J.; Wang, F.; Song, Z. Monitoring the Characteristics of Ecological Cumulative Effect Due to Mining Disturbance Utilizing Remote Sensing. Remote Sens. 2021, 13, 5034. [Google Scholar] [CrossRef]
Jia, P.; Chen, C.; Zhang, D.; Sang, Y.; Zhang, L. Semantic segmentation of deep learning remote sensing images based on band combination principle: Application in urban planning and land use. Comput. Commun. 2024, 217, 97–106. [Google Scholar] [CrossRef]
Chowdhury, T.; Rahnemoonfar, M. Attention based semantic segmentation on uav dataset for natural disaster damage assessment. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2325–2328. [Google Scholar]
Feng, J.; Yang, X.; Gu, Z.; Zeng, M.; Zheng, W. SMBCNet: A transformer-based approach for change detection in remote sensing images through semantic segmentation. Remote Sens. 2023, 15, 3566. [Google Scholar] [CrossRef]
Wang, W.; Fu, Y.; Dong, F.; Li, F. Semantic segmentation of remote sensing ship image via a convolutional neural networks model. IET Image Process. 2019, 13, 1016–1022. [Google Scholar] [CrossRef]
Gao, W.; Chen, N.; Chen, J.; Gao, B.; Xu, Y.; Weng, X.; Jiang, X. A Novel and Extensible Remote Sensing Collaboration Platform: Architecture Design and Prototype Implementation. ISPRS Int. J. Geo-Inf. 2024, 13, 83. [Google Scholar] [CrossRef]
Wang, X.; Tan, L.; Fan, J. Performance evaluation of mangrove species classification based on multi-source Remote Sensing data using extremely randomized trees in Fucheng Town, Leizhou city, Guangdong Province. Remote Sens. 2023, 15, 1386. [Google Scholar] [CrossRef]
Ma, J.; Qian, K.; Zhang, X.; Ma, X. Weakly Supervised Instance Segmentation of Electrical Equipment Based on RGB-T Automatic Annotation. IEEE Trans. Instrum. Meas. 2020, 69, 9720–9731. [Google Scholar] [CrossRef]
Zhou, W.; Zhang, H.; Yan, W.; Lin, W. MMSMCNet: Modal Memory Sharing and Morphological Complementary Networks for RGB-T Urban Scene Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7096–7108. [Google Scholar] [CrossRef]
Liang, W.; Shan, C.; Yang, Y.; Han, J. Multi-branch Differential Bidirectional Fusion Network for RGB-T Semantic Segmentation. IEEE Trans. Intell. Veh. 2024, 1–11. [Google Scholar] [CrossRef]
Ma, J.; Zhou, W.; Lei, J.; Yu, L. Adjacent Bi-Hierarchical Network for Scene Parsing of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A synergistical attention model for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Mostafa, R.R.; Houssein, E.H.; Hussien, A.G.; Singh, B.; Emam, M.M. An enhanced chameleon swarm algorithm for global optimization and multi-level thresholding medical image segmentation. Neural Comput. Appl. 2024, 36, 8775–8823. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Liu, B.; Zhao, J.; Yao, R. Remote sensing image semantic segmentation via class-guided structural interaction and boundary perception. Expert Syst. Appl. 2024, 252, 124019. [Google Scholar] [CrossRef]
Hong, S.; Oh, J.; Lee, H.; Han, B. Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3204–3212. [Google Scholar]
Freixenet, J.; Munoz, X.; Raba, D.; Martí, J.; Cufí, X. Yet another survey on image segmentation: Region and boundary information integration. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, 28–31 May 2002; pp. 408–422. [Google Scholar]
Kampffmeyer, M.; Salberg, A.B.; Jenssen, R. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1–9. [Google Scholar]
Wang, J.; Feng, Z.; Jiang, Y.; Yang, S.; Meng, H. Orientation attention network for semantic segmentation of remote sensing images. Knowl. Based Syst. 2023, 267, 110415. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Shang, R.; Zhang, J.; Jiao, L.; Li, Y.; Marturi, N.; Stolkin, R. Multi-scale adaptive feature fusion network for semantic segmentation in remote sensing images. Remote Sens. 2020, 12, 872. [Google Scholar] [CrossRef]
Liu, R.; Mi, L.; Chen, Z. AFNet: Adaptive fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7871–7886. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A Novel Mamba Architecture with a Semantic Transformer for Efficient Real-Time Remote Sensing Semantic Segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
Zhou, W.; Jin, J.; Lei, J.; Hwang, J.N. CEGFNet: Common extraction and gate fusion network for scene parsing of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–10. [Google Scholar] [CrossRef]
Zhang, J. Multi-source remote sensing data fusion: Status and trends. Int. J. Image Data Fusion 2010, 1, 5–24. [Google Scholar] [CrossRef]
Guo, Z.; Xu, R.; Feng, C.C.; Zeng, Z. PIF-Net: A Deep Point-Image Fusion Network for Multimodality Semantic Segmentation of Very High-Resolution Imagery and Aerial Point Cloud. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Fan, X.; Zhou, W.; Qian, X.; Yan, W. Progressive Adjacent-Layer coordination symmetric cascade network for semantic segmentation of Multimodal remote sensing images. Expert Syst. Appl. 2024, 238, 121999. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Liu, Y.; Chen, K.; Liu, C.; Qin, Z.; Luo, Z.; Wang, J. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2604–2613. [Google Scholar]
Berman, M.; Triki, A.R.; Blaschko, M.B. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4413–4421. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8 September 2018; pp. 418–434. [Google Scholar]
Ding, L.; Zheng, K.; Lin, D.; Chen, Y.; Liu, B.; Li, J.; Bruzzone, L. MP-ResNet: Multipath residual network for the semantic segmentation of high-resolution PolSAR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Ma, X.; Che, R.; Wang, X.; Ma, M.; Wu, S.; Feng, T.; Zhang, W. DOCNet: Dual-Domain Optimized Class-Aware Network for Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Hu, X.; Yang, K.; Fei, L.; Wang, K. ACNet: Attention based network to exploit complementary features for rgbd semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1440–1444. [Google Scholar]
Seichter, D.; Köhler, M.; Lewandowski, B.; Wengefeld, T.; Gross, H.M. Efficient rgb-d semantic segmentation for indoor scene analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13525–13531. [Google Scholar]
Ma, C.; Zhang, Y.; Guo, J.; Zhou, G.; Geng, X. FusionHeightNet: A Multi-Level Cross-Fusion Method from Multi-Source Remote Sensing Images for Urban Building Height Estimation. Remote Sens. 2024, 16, 958. [Google Scholar] [CrossRef]
Liu, B.; Ren, B.; Hou, B.; Gu, Y. Multi-Source Fusion Network for Remote Sensing Image Segmentation with Hierarchical Transformer. In Proceedings of the IGARSS 2023–2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 6318–6321. [Google Scholar]

Figure 1. The architecture of the proposed DFAFNet.

Figure 2. The structure of differential feature attention fusion module.

Figure 3. The structure of attention-guided upsampling module.

Figure 4. Comparison of visualization results between DFAFNet and other methods on the US3D dataset. Subfigures (a–f) show the prediction results of different methods in the same scene.

Figure 5. Comparison of visualization results between DFAFNet and other methods on the ISPRS dataset. Subfigures (a–f) show the prediction results of different methods in the same scene.

Figure 6. Comparison of confusion matrices between DFAFNet and other methods on the ISPRS dataset.

Table 1. The detailed information of US3D and ISPRS datasets for DFAFNet.

Attributes	Categories
Name	Ground	Vegetation	Building	Water	Road	Background
Color
Training	63.05%	15.8%	15.34%	4.46%	1.34%	0.01%
Testing	73.65%	14.46%	9.08%	1.93%	0.86%	0.02%
Name	Low vegetation	Tree	Car	Building	Surface	Background
Color
Training	24.28%	16.11%	1.86%	26.91%	30.83%	0.01%
Testing	22.28%	15.95%	1.89%	28.26%	31.61%	0.01%

Table 2. The results of ablation studies on DFAFNet ¹. “DS” denotes the dual-stream structure, “DFF” is the difference feature fusion module, “AGU” means the attention-guided upsampling operation, “UAL” denotes the unsupervised adversarial loss function. The values of the evaluation metrics in bold indicate the best results.

Datasets	DS	DFF	AGU	UAL	Metrics
Datasets	DS	DFF	AGU	UAL	PA	mIoU	FWIoU	Kappa
US3D	✓	-	-	-	93.15	79.53	87.74	84.34
	✓	✓	-	-	93.57	80.83	88.33	85.04
	✓	✓	-	✓	93.69	81.22	88.50	85.26
	✓	✓	✓	✓	93.73	82.33	88.62	85.42
ISPRS	✓	-	-	-	89.59	80.73	81.49	86.02
	✓	✓	-	-	90.23	81.42	82.60	86.89
	✓	✓	-	✓	90.43	81.80	82.96	87.16
	✓	✓	✓	✓	91.07	82.73	83.95	88.01

¹ “✓” indicates that the corresponding module is retained. “-” indicates that the corresponding module is removed.

Table 3. The performance of different modules combinations on each category. The values of the evaluation metrics in bold indicate the best results.

Datasets	DS	DFF	AGU	UAL	Categories
Datasets	DS	DFF	AGU	UAL	Ground	Vegetation	Building	Water	Road
US3D	✓	-	-	-	84.68	83.52	91.14	93.74	90.62
	✓	✓	-	-	85.35	84.22	91.61	94.67	91.66
	✓	✓	-	✓	86.60	83.39	91.63	95.18	92.37
	✓	✓	✓	✓	86.93	85.28	91.41	95.71	93.05

“✓” indicates that the corresponding module is retained. “-” indicates that the corresponding module is removed.

Table 4. Comparison results of DFAFNet and other methods on the US3D dataset. The values of the evaluation metrics in bold indicate the best results.

Methods	Metrics
Methods	PA	MPA	mIoU	FWIoU	Kappa
UperNet [36]	87.96	75.82	67.79	79.11	70.79
MP-ResNet [37]	89.14	80.73	72.54	80.88	73.57
DOCNet [38]	88.71	79.39	70.37	81.39	75.08
ACNet [39]	90.78	75.87	60.97	84.38	79.05
ESANet [40]	92.92	80.05	72.41	87.26	83.43
FHNet [41]	91.92	87.79	79.39	85.58	82.34
SegFusion [42]	93.18	89.36	81.54	87.63	84.71
DFAFNet	93.73	90.46	82.33	88.62	85.42

Table 5. Comparison results of DFAFNet and other methods on the ISPRS dataset. The values of the evaluation metrics in bold indicate the best results.

Methods	Metrics
Methods	PA	MPA	mIoU	FWIoU	Kappa
UperNet [36]	88.99	88.43	79.52	80.56	85.22
MP-ResNet [37]	89.76	88.96	80.39	81.80	86.27
DOCNet [38]	89.24	89.03	80.78	81.57	86.53
ACNet [39]	86.99	85.57	75.31	77.49	82.57
ESANet [40]	85.03	83.67	70.76	74.65	79.91
FHNet [41]	88.79	85.35	79.83	80.30	85.84
SegFusion [42]	90.23	89.41	81.36	81.72	87.13
DFAFNet	91.07	90.12	82.73	83.95	88.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, D.; Yue, P.; Yan, Y.; Niu, Q.; Zhao, J.; Ma, H. Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion. Remote Sens. 2024, 16, 4717. https://doi.org/10.3390/rs16244717

AMA Style

Zhang D, Yue P, Yan Y, Niu Q, Zhao J, Ma H. Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion. Remote Sensing. 2024; 16(24):4717. https://doi.org/10.3390/rs16244717

Chicago/Turabian Style

Zhang, Di, Peicheng Yue, Yuhang Yan, Qianqian Niu, Jiaqi Zhao, and Huifang Ma. 2024. "Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion" Remote Sensing 16, no. 24: 4717. https://doi.org/10.3390/rs16244717

APA Style

Zhang, D., Yue, P., Yan, Y., Niu, Q., Zhao, J., & Ma, H. (2024). Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion. Remote Sensing, 16(24), 4717. https://doi.org/10.3390/rs16244717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image Semantic Segmentation

2.2. Multi-Source Fusion Semantic Segmentation

3. Method

3.1. Difference Feature Attention Fusion Module

3.2. Attention-Guided Upsampling Module

3.3. Loss Function

4. Experiments

4.1. Experimental Conditions

4.2. Ablation Studies

4.3. Comparative Experiments and Discussions

4.3.1. Comparative Experiments on US3D Data Set

4.3.2. Comparative Experiments on ISPRS Data Set

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI