OPT-SAR-MS2Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

Hu, Wei; Wang, Xinhui; Zhan, Feng; Cao, Lu; Liu, Yong; Yang, Weili; Ji, Mingjiang; Meng, Ling; Guo, Pengyu; Yang, Zhi; Liu, Yuhang

doi:10.3390/rs16111850

Open AccessArticle

OPT-SAR-MS²Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

by

Wei Hu

¹,

Xinhui Wang

¹,

Feng Zhan

^2,3,

Lu Cao

¹,

Yong Liu

¹,

Weili Yang

¹,

Mingjiang Ji

¹,

Ling Meng

¹,

Pengyu Guo

^1,*

,

Zhi Yang

⁴ and

Yuhang Liu

⁴

¹

Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing 100000, China

²

School of Aeronautics, Harbin Institute of Technology, Harbin 150001, China

³

Shandong Institute of Space Electronic Technology, Yantai 250100, China

⁴

DFH Satellite Co., Ltd., Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(11), 1850; https://doi.org/10.3390/rs16111850

Submission received: 3 April 2024 / Revised: 3 May 2024 / Accepted: 15 May 2024 / Published: 22 May 2024

(This article belongs to the Special Issue Object Detection and Information Extraction Based on Remote Sensing Imagery)

Download

Browse Figures

Versions Notes

Abstract

:

The utilization of optical and synthetic aperture radar (SAR) multi-source data to obtain better land classification results has received increasing research attention. However, there is a large property and distributional difference between optical and SAR data, resulting in an enormous challenge to fuse the inherent correlation information to better characterize land features. Additionally, scale differences in various features in remote sensing images also influence the classification results. To this end, an optical and SAR Siamese semantic segmentation network, OPT-SAR-MS²Net, is proposed. This network can intelligently learn effective multi-source features and realize end-to-end interpretation of multi-source data. Firstly, the Siamese network is used to extract features from optical and SAR images in different channels. In order to fuse the complementary information, the multi-source feature fusion module fuses the cross-modal heterogeneous remote sensing information from both high and low levels. To adapt to the multi-scale features of the land object, the multi-scale feature-sensing module generates multiple information perception fields. This enhances the network’s capability to learn contextual information. The experimental results obtained using WHU-OPT-SAR demonstrate that our method outperforms the state of the art, with an mIoU of 45.2% and an OA of 84.3%. These values are 2.3% and 2.6% better than those achieved by the most recent method, MCANet, respectively.

Keywords:

multi-source Siamese network; multi-modal feature fusion; multi-scale feature perception

Graphical Abstract

1. Introduction

Intelligent interpretation of remote sensing images has a broad spectrum of practical applications from environmental conservation [1] to transportation navigation [2], ecological analysis, land use planning [3] and so on. The utilization of multi-source remote sensing data allows for the capture of a wider range of information about the Earth’s surface, thereby improving the accuracy of land object classification. Compared to single-source information, the use of multi-source remote sensing data can better generate a more diverse representation and mitigate the ambiguity of the same land objects by exploiting the complementary features of multi-source data [4].

Optical remote sensing images are obtained by capturing energy in the visible wavelength band of electromagnetic radiation. They are more sensitive to the color information of the features and can reflect the characteristics of landforms more intuitively, but are sensitive to the impacts of weather and lighting conditions. When it is raining, snowing, cloudy or foggy, the information in optical remote sensing images will be obscured to some extent. SAR systems enhance our understanding of the Earth’s surface by capturing and analyzing radar signals reflected from various objects. These systems gather both intensity and phase information from the returning radar waves, providing detailed insights into the scattering properties of objects on the Earth’s surface. SAR remote sensing systems are available all day and are unaffected by bad weather or poor lighting conditions. SAR remote sensing systems are more effective than optical remote sensing images for identifying specific surface features of the Earth, such as water bodies. As illustrated in Figure 1a, SAR images provide more intuitive visibility of water bodies in cloudy weather compared to optical remote sensing images. The optical remote sensing image in Figure 1b provides a clearer visualization of the houses and roads compared to the SAR image. Therefore, it is important to extract complementary data from optical and SAR remotely sensed images and integrate features by removing irrelevant information to improve land object classification studies [2].

Currently, research on classifying land objects from multi-source remote sensing images using deep learning methods is rapidly advancing [5]. The accuracy of feature classification can be improved by effectively fusing optical and SAR remote sensing data [6,7]. Studies on land object classification that combine optical and SAR remote sensing imagery are typically classified into three primary categories: (1) pixel-level classification fusion methods; (2) decision-level classification fusion methods; and (3) feature-level classification fusion methods. The first type of method uses multiple data sources to obtain pixel values from various modal images. These values are then directly input into a multi-layer perceptron for analysis [8,9], such as MCNN [10]. This method employs pixel stacking for multi-source fusion. It is a simple and user-friendly approach. However, it does not consider the unique characteristics of each data source. It introduces redundant information, resulting in increased computation. This potentially affects classification accuracy [11]. Methods for decision-level fusion use voting or other methods to fuse the outputs of training on different modal data [12]. Although these methods utilize the individual features of heterogeneous data from multiple sources, they do not consider the complementary nature of these features. Basic feature-level fusion methods are realized by direct feature stacking and splicing, such as PSCNN [13]. However, this method does not exploit the implicit and complementary information of multi-source data, leaving the characterization of fusion features incomplete. In response to this problem, a Siamese network offers a solution. MCANet [14] constructs a Siamese network to extract multi-source fusion features. Unfortunately, this method does not address the issue of the limited receptive fields at a multi-scale in convolutional neural networks. There are also feature-level fusion methods for multiple source remote sensing data that use a multi-stage training approach [15], but the multi-stage training strategy increases the training difficulty and computational effort.

In addition, the overhead view of remote sensing images provides a wealth of information, such as meandering rivers, large areas of vegetation, oceans, cities, villages and so on. Therefore, modeling multi-scale information from remote sensing images can extract features from different classes of land object information [16]. Many deep learning methods use both optical and SAR remote sensing data to classify land objects [17]. However, they often fail to incorporate multi-scale land object information, which limits a network’s performance in distinguishing between different types of land object features.

To tackle the issues mentioned, we propose a joint optical and SAR remote sensing imagery intelligence network for land object classification. This network is called OPT-SAR-MS²Net. Its backbone network is built with two branches that can extract features from optical and SAR remote sensing images in parallel. OPT-SAR-MS²Net includes a feature fusion module that fuses heterogeneous features from a multi-source, enhancing complementary information and suppressing redundant information. Additionally, OPT-SAR-MS²Net consists of a module for perceiving multi-scale information about objects of different sizes in remote sensing images. This work offers the following contributions:

To avoid complex training strategies with multi-stage, multi-source feature fusion networks for optical and SAR remote sensing data, we designed an end-to-end architectural network named OPT-SAR-MS²Net for land object classification.
The OPT-SAR-MFF multi-modal feature fusion module is designed to combine complementary information obtained from optical and SAR remote sensing data. It employs a shallow–deep-level feature fusion strategy to compensate for information loss during network transmission.
To address the issue of a single receptive field in convolutional neural networks, we designed the multi-scale information perception module, OPT-SAR-MIP. This module enhances the feature representation of multi-scale land objects in the top view of remote sensing images.
Our work outperforms other SOTA methods by improving the mIoU and OA by 2.3% and 2.6%, respectively, on the dataset WHU-OPT-SAR [18].

2. Related Work

2.1. Land Object Classification of Single-Source Remote Sensing Imagery

The classification of land objects using remote sensing imagery can be approached in two principal ways: through the application of traditional machine learning methods and through the use of deep learning methods. (1) Traditional machine learning approaches include kernel feature engineering, random forests, Markov random fields, conditional random fields and support vector machines [19,20,21]. However, the majority of non-parametric methods are based on a priori knowledge and hand-designed features, which renders them computationally complex. Furthermore, their design transferability is constrained by the specific problem [22,23]. Consequently, the limitations of machine learning algorithms in terms of generalization ability, coupled with the increase in the spatial resolution of remote sensing images, render the manual extraction of features and the design of classifiers inadequate for the accurate land object classification of remote sensing images acquired by different kinds of sensors [12]. (2) Deep learning techniques are capable of fully exploiting the intrinsic properties of data while demonstrating a robust ability to generalize and maintain their effectiveness in a wide range of contexts [24]. Deep learning methods for land object classification have gone through a development process from FCNs to CNNs [25]. This novel method can more effectively extract implicit information from features and has greater generalization ability [26]. Researchers explored FCN-based land object classification studies for single-source remote sensing images in the early days [27], either through optical [28,29,30] or SAR networks [31,32]. The CNN-based land object classification studies of ConvNets designed by Reichstein et al. [33] enabled high-precision feature classification and had a profound impact on subsequent related research. Some research has employed ConvNets to extract deep features and subsequently utilized machine learning methods for classification and categorization, achieving significantly higher accuracy than machine learning methods alone [34,35]. In order to reduce the necessity for human intervention and to facilitate the automatic interpretation of remote sensing images, researchers have proposed the use of end-to-end deep learning methods [36,37]. To better exploit features in single-source remote sensing images, scholars have focused on attention mechanisms [18]. Yang et al. [38] employed attention mechanisms to calibrate features for feature classification in single-source remote sensing imagery. Some methods [39,40] explored the effectiveness of single-source remote sensing images feature classification for urban scenario planning. In addition, there are several approaches [41,42] to extracting features from single-source remote sensing images using transfer learning. However, relying solely on a single source of data for land object classification has limitations. These limitations arise from the diverse imaging mechanisms, trajectories and duty cycles of each sensor.

2.2. Land Object Classification of Multi-Source Remote Sensing Imagery

The continuous improvement in remote sensing technology has led to an increase in the abundance of remote sensing data acquired by various types of sensors. This has resulted in the availability of multi-modal remote sensing data for the same land object, which was previously more difficult to obtain. Furthermore, the increasing capacity of computational resources to process vast quantities of data enables the deployment of deep learning methodologies with substantial computational power, which can more comprehensively delineate the characteristics of land objects and facilitate the automated interpretation of remotely sensed images [43,44]. This have driven the trend towards utilizing images from multiple remote sensing sources for smart analysis [45]. Paisitkriangkrai et al. [34] directly utilized pixel stacking of multi-source remote sensing data for land object classification. Although this approach is simple and easy to follow, it lacks the ability to extract implicit features from multi-source data. Concurrently, this method does not filter essential features, and the final land object classification effect is susceptible to being easily perturbed by local information [5]. In the methods in [15,46], decision-level multi-source fusion was used for land use classification, and although such methods are able to characterize multi-source features, they do not take into account the complementarity between them. To better explore the effective complementary information of different modal remote sensing data, some studies [47,48] have utilized a machine learning approach to manually construct multi-source feature fusion models for land use classification. However, this method [49,50] of manually selecting features is less efficient and also more complex to compute. Feature-level fusion methods employ fused features extracted by manually designed or deep learning models for classification. Basic feature-level fusion is accomplished by splicing and summing [11], which involves merging and aggregating heterogeneous remote sensing features from multiple sources. But this method cannot filter redundant information, which may affect the classification accuracy. In response to this challenge, Li et al. [17] introduced an attention mechanism that effectively mitigates the problem. Attention-based methods can select valid information from cross-modal data and eliminate redundant noise. in the literature [51], the impact of combining complementary features from multi-source remote sensing data on classification effectiveness was investigated. However, most methods do not give specific consideration to the characteristics of top views in remote sensing images. These images typically capture complex scenes with varying sizes of the Earth’s surface objects. Therefore, integrating multi-scale perception is crucial for intelligent interpretation.

3. Method

Using deep learning techniques, we propose a new network, OPT-SAR-MS²Net, for land object classification by fusing multi-source heterogeneous features of optical and SAR remote sensing data. Figure 2 displays the overall framework. Our design approach is based on the classic DeepLabV3+ [52] encoding and decoding architecture. The encoder part comprises a two-branch Siamese multi-source feature extraction module (MFE). It extracts both deep and shallow features from SAR and optical remote sensing data simultaneously. This approach avoids the complexity of multi-stage training strategies for feature classification from multi-source remote sensing data and achieves an efficient network architecture. To integrate complementary information and reduce redundancy of heterogeneous features from multiple sources, we developed a multi-modal feature fusion module named OPT-SAR-MFF. This module fully exploits implicit features and interactive relationships in the fusion of heterogeneous remote sensing images. Additionally, a module called OPT-SAR-MIP was developed to sense multi-scale information in remotely sensed images that feature overhead views and objects of various sizes on the Earth’s surface. This module constructs parallel branches of sensory fields at different scales to compensate for the limitations of convolutional natural receptive fields. It enhances contextual semantic information extraction and global information characterization for better classification results. The decoder section generates final fine-grained feature classification results by aggregating cross-modal fusion features for upsampling.

During the training process, firstly, the optical and SAR remote sensing images are input separately into the dual-branch Siamese multi-source feature extraction module, MFE. With this module, shallow-level and deep-level optical and SAR feature maps can be extracted. Subsequently, the shallow optical and SAR feature maps are input into the OPT-SAR-MFF multi-source feature fusion module to extract learnable multi-modal heterogeneous fusion features. This helps guide the network to adaptively and accurately control the fusion ratio of different modal information, resulting in lower-level cross-modal fusion features that are rich in more detailed characteristics. The optical and SAR feature maps are stacked and input into the OPT-SAR-MIP multi-scale information perception module. This module extracts higher-level cross-modal fusion features with multi-scale receptive fields that are rich in semantic information. Finally, the cross-modal fusion features with lower-level, detail-rich information and higher-level semantic information are spliced to improve the loss of effective features during network transmission, enhancing the completeness of the fusion feature representations. The encoding part of multiple source feature classification is then completed. The decoding part utilizes a decoder to generate the final land object classification results. This is achieved through the use of a convolutional layer and upsampling of the encoder-processed features.

3.1. OPT-SAR-MS²Net Encoder Part

The OPT-SAR-MS²Net encoder part consists of two parts: multi-source feature extraction and feature fusion for optical and SAR remote sensing data, respectively. The first part relies on the Siamese network architecture to construct a double branch and input the remote sensing information from multiple sources in parallel with extracting the unique features of the two different modes. The optical and SAR remote sensing data’s respective features are fused using hierarchical features from deep and shallow convolutional layers to minimize information loss during network transmission. Simultaneously, the two subnetworks perform cross-modal feature fusion. The fusion module precisely controls the fusion ratio of different modes to extract cross-modal features. This adaptively guides the entire network to efficiently learn the effective characterization of cross-modal features under different sample distributions. Subsequently, the deep and shallow features as well as cross-modality features extracted from the network are vertically and horizontally integrated to achieve feature encoding for the fine classification of features in multi-source remote sensing images. Additionally, a multi-scale receptive field module with parallel branching and multi-scale capabilities was designed to capture contextual semantic information. This module enhances global features in response to the receptive field limitations of convolutional neural networks.

3.1.1. Dual Branch Siamese Multi-Source Feature Extraction Architecture

Optical and SAR remote sensing images have different image appearance characteristics due to their distinct imaging mechanisms. Therefore, a Siamese architecture was designed to process these two types of images. The architecture comprises a two-branch convolutional stream with the same structure but different parameters. This enables the independent and parallel extraction of features from the two different modalities. As shown in Figure 2, firstly, the backbone network receives the optical remote sensing image of size 256

\times

256

\times

4 and the SAR remote sensing image of size 256

\times

256

\times

1 as inputs. The architecture generates feature maps from remotely sensed images of two different modalities with a size of 128

\times

128

\times

64. The feature maps are input into a convolutional block that includes a Conv2d convolutional layer, a BatchNorm2d regularization layer and a Relu activation layer. The resulting output is then resized to 64

\times

64

\times

256. Secondly, the feature maps are processed through three consecutive convolutional blocks. The blocks produce optical and SAR feature maps of three different sizes: 32

\times

32

\times

512, 32

\times

32

\times

1024 and 32

\times

32

\times

2048. Finally, the feature maps of size 32

\times

32

\times

2048 are fed into the deformable convolutional layer (DCN) to output feature maps of size 32

\times

32

\times

256. At a certain network depth, the feature map streams of the optical image

P_{o p t}

and the SAR image

P_{S A R}

can be described separately as the following:

{C o v}_{o p t} = \sum_{i = 1}^{L} F_{o p t} (P_{o p t}^{i}, W_{o p t}^{i})

(1)

{C o v}_{S A R} = \sum_{j = 1}^{L} F_{S A R} (P_{S A R}^{j}, W_{S A R}^{j})

(2)

where

i

and

j

represent each convolutional unit, L represents the depth of the convolutional layer, and

W_{o p t}^{i} a n d W_{S A R}^{j}

represent linear projections of optical and SAR remote sensing data.

F_{o p t} (P_{o p t}^{i}, W_{o p t}^{i}) a n d F_{S A R} (P_{S A R}^{j}, W_{S A R}^{j})

denote the feature maps learned by the network through optical and SAR remote sensing image slices, respectively.

{C o v}_{o p t}

and

{C o v}_{S A R}

denote the summation of the feature maps learned from the initial to the L-layer convolutional layer networks for optical and SAR remote sensing images, respectively.

3.1.2. Multi-Source Feature Fusion Module OPT-SAR-MFF

To ensure optimal utilization of complementary information from optical and SAR remote sensing data and minimize redundancy, we propose a multi-source feature fusion module named OPT-SAR-MFF. This module fuses information from the two feature maps

{C o v}_{S A R}^{l o w}, {C o v}_{o p t}^{l o w}

. It guides the network to capture useful components at each pixel location on the feature map of a single data source, resulting in effective spatial fusion, as illustrated in Figure 3.

The fusion module is divided into three fusion phases. Firstly, the feature maps of the two modes are fed into the depth-wise overparameterized convolutional layer (Do-Conv) to obtain the shallow-level feature maps of the two modes with a size of 64

\times

64

\times

256,

{C o v}_{S A R}^{l o w}, {C o v}_{o p t}^{l o w}

, respectively. Subsequently, the two feature maps are stacked, and channel compression is then performed to obtain a primary fusion feature map with a size of 64 × 64 × 256. This feature map is subjected to a first stage of initial fusion of the features of the two modes after simple stacking and channel extrusion, using the following mathematical expression:

X_{S A R - o p t}^{F 1} = S q u e e z e d ({C o v}_{S A R}^{l o w} ∥ {C o v}_{o p t}^{l o w})

(3)

where ‘

∥

’ represents the Concat channel stacking operation and ‘Squeezed’ represents the channel compression operation.

X_{S A R - o p t}^{F 1}

represents the feature map resulting from the fusion of optical and SAR remote sensing images in the initial stage of the process.

The primary fusion stage does not however focus on the unique properties of each modality and the redundancy of cross-modal information, so a second fusion stage is required. The second stage of the fusion process, due to the consideration that SAR remote sensing images are more sensitive to the Earth’s surface texture information, introduces a global mean pooling operation. This method preserves texture information and reduces errors caused by parameter errors in the convolutional layer, which lead to a shift in the estimated mean value. On the other hand, optical remote sensing images can effectively reflect overall feature structure information. Therefore, the global maximum pooling operation is implemented to preserve essential background information from the image and to mitigate errors due to the increased variance in the estimated values caused by the limited neighborhood size. The feature maps from the first stage are combined through global average pooling and global maximum pooling operations and then added together. The two pooling techniques are represented by the following mathematical formulas:

X_{a v g} = G A P (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \cdot \sum_{j = 1}^{W} X_{i, j}, X_{a v g} \in R^{C \times 1 \times 1}

(4)

X_{m a x} = G M P (X) = m a x \sum_{i = 1}^{H} \cdot \sum_{j = 1}^{W} X_{i, j}, X_{m a x} \in R^{C \times 1 \times 1}

(5)

X_{S A R - o p t}^{F 2} = X_{a v g} + X_{m a x}

(6)

The intermediate input tensor is represented by

X \in R^{C \times H \times W}

. The dimensions of the feature map are represented by

H

and

W

, respectively. The value of the feature on the ith row and jth column is denoted by

X_{i, j}

. Two global spatial context maps are generated following two distinct pooling operations: the global average pooling operation

G A P (\cdot)

and the global maximum pooling operation

G M P (\cdot)

.

X_{a v g}

represents the global average pooled features, while

X_{m a x}

represents the global maximum pooled features. The fusion features at the second level, which have a size of 1

\times

1

\times

256, are obtained by summing the results of the two pooling operations.

X_{S A R - o p t}^{F 2}

denotes the features obtained from the fusion of the second-stage optical and SAR remote sensing data.

In the third stage, the adaptive weights for the two sets of cross-modal fusion features are obtained by upscaling the channel into a tensor of size 1

\times

1

\times

512. The tensor is then divided into two tensors of size 256

\times

1

\times

1, and these two components are normalized using the Softmax function. Use the following mathematical expression:

W_{o p t} = s o f t m a x (X_{S A R - o p t}^{F 3 - 1})

(7)

W_{S A R} = s o f t m a x (X_{S A R - o p t}^{F 3 - 2})

(8)

s o f t m a x (x_{i}) = - \log \frac{e^{x_{i}}}{\sum_{c = 1}^{2} e^{x_{c}}}

(9)

where

x_{i}

is the value of each vector channel of the different mode, and the Softmax function can convert the values of the respective channels of the two modal vectors into a probability distribution in the range of [0,1]. These probabilities are then summed to unity, representing the corresponding weights of each mode.

X_{S A R - o p t}^{F 3 - 1}

and

X_{S A R - o p t}^{F 3 - 2}

represent features derived from the integration of optical and SAR remote sensing data in the first and second branches of the third stage, respectively.

In this stage, the corresponding input features are multiplied by two sets of weights to obtain two feature tensors of size 64

\times

64

\times

256, which are then weighted by the cross-modal fusion weights. These two feature tensors are then summed and channel compressed to obtain a low-level fusion feature tensor of size 64

\times

64

\times

48.

3.1.3. Multi-Scale Information Perception Module OPT-SAR-MIP

Convolution inherently suffers from the limitation of perceiving a single receptive field, which is contradicted by the fact that complex scenes in remote sensing images often contain multi-scale land objects. Therefore, the ability to perceive multi-scale land objects is particularly important for understanding contextual information in complex background scenes. Thus, we propose a new module to extract multi-source fusion information from the higher-level convolutional layer portion for multi-scale information perception modeling. This module is able to extract different receptive field information of input features to obtain multi-scale expression and contextual semantic information of the features and improve the utilization efficiency of the features, as shown in Figure 4.

Firstly, we obtain a 256

\times

64

\times

64 feature tensor by using the deeper-level semantic features

{C o v}_{S A R}^{h i g h}

and

{C o v}_{o p t}^{h i g h}

extracted by the Siamese network. These features are used as inputs for the channel stacking operation and are then processed by the OPT-SAR-MIP multi-scale information perception module. The module uses parallel independent branches to aggregate multi-scale contextual information, and multi-scale receptive fields are realized using dilated convolutional layers with different dilation rates. The OPT-SAR-MIP module comprises two parts. The first part consists of four parallel branches, each of which includes a series-connected 1

\times

1 convolutional layer, 3

\times

3 convolutional layer and 3

\times

3 dilated convolutional layer, and the dilation rates for each of the four branches are 2, 3, 5 and 7, respectively. This allows for the capture of receptive fields of four different scales, with sizes of 7, 9, 15 and 29. The second part consists of a series of layers including a 1

\times

1 convolutional layer, a 1

\times

1 pooling layer and an upsampling layer. This series serves to complement the lost features of the network propagation process. The data stream with five branches is then passed through a 1

\times

1 convolutional layer to compress the channels after the concatenation operation, resulting in a feature tensor of size 64

\times

64

\times

256.

The encoder part finally acquires the fusion feature of optical and SAR multi-source data. This fusion feature is characterized by cross-modal, multi-level and multi-scale receptive fields.

3.2. OPT-SAR-MS²Net Decoder Part

The encoder part’s multi-modal fusion features are then input into the decoder part to produce the ultimate land object classification outcome. The decoder part consists of 1

\times

1 convolutional layers and an upsampling layer to obtain the classification result of the joint optical and SAR remote sensing features. OPT-SAR-MS²Net is an end-to-end multi-source remote sensing classification network for images of land objects. The decoder part’s output is projected onto N feature maps, indicating the probability of each pixel being assigned to one of the N classes in the optical and SAR multi-source remote sensing images.

3.3. Loss Function

To enhance feature classification results, we used the cross-entropy loss function to constrain feature classification. This enables the network to converge rapidly when fusing features from various modalities’ information. The cross-entropy loss function is a commonly used method for multi-categorization in semantic segmentation tasks, and its mathematical expression is given by Formula (10):

l o s s (x, c l a s s) = - \log (\frac{e^{x_{class}}}{\sum_{j}^{} e^{x_{j}}})

(10)

where the variable

x

represents the prediction result, which is a one-dimensional tensor representing the number of categories. The term ‘class’ refers to the true label of the sample.

Joint multi-source datasets of optical and SAR remote sensing imagery are characterized by unbalanced data distribution. We counted the proportion of pixels in each category of the dataset and calculated the penalty weight for each category, using the following equation:

F_{w e i g h t} (x, c l a s s) = \frac{\sum_{i}^{N} {P i x}^{x_{i}}}{\sum {P i x}^{x_{c l a s s}}}

(11)

The variable N means the total number of categories,

{P i x}^{x_{i}}

signifies the number of pixels in category

i

and

{P i x}^{x_{c l a s s}}

denotes the number of pixels to be counted in a particular category.

During the training process, penalty weights are assigned as weighted targets to each corresponding category, using the following mathematical expression as Equation (12):

l o s s (x, c l a s s, w e i g h t) = F_{w e i g h t} . (- \log (\frac{e^{x_{class}}}{\sum_{j}^{} e^{x_{j}}}))

(12)

where ‘

F_{w e i g h t}

’ represents the penalty weight for different categories, and it is used to weigh the loss function for different categories.

4. Experiment and Analysis

To demonstrate OPT-SAR-MS²Net’s effectiveness in land object classification, we trained and tested its performance on the publicly available dataset WHU-OPT-SAR [14]. Additionally, we conducted numerous ablation experiments to validate our method.

4.1. Experimental Settings

4.1.1. Description of the Dataset

The WHU-OPT-SAR dataset is a publicly available collection of multi-source remote sensing images released by Wuhan University. The dataset comprises 100 pairs of optical remote sensing images, each with corresponding SAR remote sensing images. Each image has dimensions of 5556 × 3704 pixels. The data have a resolution of 5 m and cover a non-contiguous area of 5,144,856 km² in Hubei Province. They cover various geomorphological features and are primarily used for land object classification. They are divided into seven categories: farmlands, cities, villages, waters, forests, roads and others. Unlabeled pixels are uniformly classified as background. Pixel category counts were performed on the dataset and the corresponding penalty weights were calculated in the loss, as presented in Table 1.

The penalty weight in Table 1 was calculated using Equation (11), which aims to assign higher weights to different land object classes with less pixel occupancy for the calculation of loss and lower weights to those with more pixel occupancy. The objective is to enhance the imbalance between classes and accelerate network convergence.

We processed the WHU-OPT-SAR dataset into 256

\times

256 non-overlapping patches as the data were fed into the network. Table 2 displays the quantity of images in the training, validation and test sets, with a ratio of 6:2:2.

4.1.2. Experimental Configuration and Evaluation Criteria

The pretrained models of ResNet101 [53] on ImageNet [54] were embedded into the backbone of OPT-SAR-MS²Net for feature extraction. The new first layer of convolution in ResNet was initialized using a Gaussian distribution. The model was trained for 200 epochs on the Ubuntu 18.04 platform using the NVIDIA GTX4090 (24 GB) and the PyTorch architecture with Python 3.8 compiler language. The hyperparameter values were the following: the batch size was set 16, the learning rate was set 0.001, the SGD optimizer was used and the momentum was 0.9. The learning strategy employs the Cosine Annealing, CosineAnnealingLR formula (13):

η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + \cos (\frac{T_{c u r}}{T_{m a x}} π))

(13)

where the current learning rate is denoted as

η_{t}

, the minimum learning rate is

η_{m i n}

, the maximum learning rate is

η_{m a x}

, the current epoch is

T_{c u r}

and the maximum epoch is

T_{m a x}

.

To evaluate the performance of OPT-SAR-MS²Net, the OA, mIou and Kappa coefficient were used as evaluation metrics. These can be expressed using the following mathematical formula:

O A = \frac{\sum_{i = 1}^{K} P_{i j}}{\sum_{i = 1}^{K} \sum_{j = 1}^{K} P_{i j}}

(14)

where the variable

K

is the total count of the categories and

P_{i j}

represents the pixels of each category.

mIou = \frac{1}{K + 1} \sum_{i = 0}^{k} \frac{P_{i i}}{\sum_{j = 0}^{k} P_{i j} + \sum_{j = 0}^{k} P_{j i} - P_{i i}}

(15)

In the given equation,

P_{i i}

is the number of correctly classified pixels. Meanwhile,

P_{i j (j i)}

represents the pixels where the category

i (j)

is misclassified to be category

j (i)

.

K a p p a = \frac{{p_{o} - p}_{e}}{{1 - p}_{e}}

(16)

In this equation, the

K a p p a

coefficient is utilized to quantify the effectiveness of classification. The overall classification accuracy (

p_{o}

) and chance consistency (

p_{e}

) are used to calculate it. A lower Kappa value indicates a lower effectiveness of classification, especially when the confusion matrix is unbalanced under the same overall accuracy condition [55].

4.2. Experimentation on the Dataset

In this part, we compared the OPT-SAR-MS²Net model with other methods using qualitative and quantitative experiments. The various comparison methods employed in the comparison experiments were subjected to training and testing using the same optical and SAR multi-source remote sensing images. The methods were SegNet [56], DeeplabV3+ [57], non-local neural network [58], SA-Gate [11], UNet3+ [59] and MCANet [34]. Compared to several other methods, OPT-SAR-MS²Net obtained the highest OA (0.843) and mIOU (0.452), with 2.6% and 2.3% improvement over the recent method MCANet, respectively. The Kappa coefficient (0.720) is lower than MCANet’s by 1.5%, with the indicators coming in third place compared to other methods. The main reason is that OPT-SAR-MFF, the multi-modal feature fusion module of OPT-SAR-MS²Net, fully exploits the complementary information from multi-modal data and efficiently models the interaction relationship of the fused features. Additionally, OPT-SAR-MS²Net demonstrates superior classification accuracy in various categories when compared to other state-of-the-art methods. It achieved the highest scores in the categories of villages (0.759), waters (0.796), roads (0.868) and others (0.285), respectively. The classification accuracy for villages is superior to that of other algorithms, primarily due to the fact that OPT-SAR-MS²Net employs not only multi-source feature fusion but also deep-level and shallow-level feature fusion. This enables the vertical and horizontal fusion of features and the integration of information. In this manner, the model is capable of extracting texture features of villages at shallow depths that are not lost as the network propagates. Furthermore, it is able to complement the fusion features with information. The classification accuracy of farmlands (0.723), forests (0.922) and cities (0.537) was slightly lower than the latest model MCANet’s by 0.074, 0.036 and 0.051, in each case. It is evident that OPT-SAR-MS²Net performs better in categories with a wide distribution area. This demonstrates that OPT-SAR-MIP, the multi-scale information perception module in the model, enhances the expression of global information features. The experimental results are presented in Table 3.

The results of the land object classification visualization and the experimental results are divided into eight groups as shown in Figure 5, Figure 6 and Figure 7. The groups are numbered from a_i to j_i, with the subscript i indicating the group number. The presentation begins with optical remote sensing images, followed by SAR remote sensing images, labels and results from other comparison methods such as SegNet, DeeplabV3+, non-local, SA-Gate, UNet3+ and MCANet. Finally, the results for OPT-SAR-MS²Net are presented.

The first group of experiments is presented from a₁ to j₁ in Figure 5. The classification of roads (highlighted in red boxes) by OPT-SAR-MS²Net exhibits a more continuous pattern, featuring clearer and more detailed edges compared to other methods. However, DeeplabV3+, non-local and MCANet also extracted more continuous roads, but they have rougher edges. Moreover, the road classification results in d₁ (SegNet) and h₁ (UNet3+) exhibit broken and blurred edges. As can be seen in Figure 5, OPT-SAR-MS²Net classified meandering water bodies (in the pink box) much better than other methods, providing accurate contours and detailed curved edges of water bodies. From the results in Figures d₁ to i₁, it can be seen that other methods could not extract an entire water body, resulting in varying degrees of fragmentation and inaccurate representation of its physical characteristics. Meanwhile, OPT-SAR-MS²Net correctly extracted farmland categories with small ranges (in the white circle), while all other comparison methods exhibit misclassification phenomena. The results of the second group of experiments can be shown from Figures a₂ to j₂, OPT-SAR-MS²Net accurately extracted continuous feature information from the mesh roads (in the red box), Although i₂ (MCANet) also shows good continuity in the extracted mesh roads, the edges are relatively rough. Conversely, the road classification results of the other comparison methods have varying degrees of residuals and misclassification of neighboring areas. For instance, d₂ (SegNet) and g₂ (SA-Gate) misclassified some parts of the ‘roads’ category as ‘villages’. All other comparison methods exhibit varying degrees of misclassification in the lower-left position (in the white box). It is evident that OPT-SAR-MS²Net yields superior classification results for road information. This demonstrates the effectiveness of OPT-SAR-MIP, the multi-scale information perception module within the network, in modeling long-distance dependencies between pixels. This allows for an effective description and characterization of global information for feature targets, ultimately improving the continuity and integrity of classification results for narrow and curved feature targets, such as roads and water bodies.

Images labelled from a₃ to j₃ are the results of the third group of experiments. This group contained large water bodies (in red boxes), and OPT-SAR-MS²Net demonstrated a superior classification of various read contours compared to other methods. Additionally, with the complex background area shown by the blue box in Figure c₃, OPT-SAR-MS²Net outperformed other methods in terms of accuracy. In the fourth and fifth groups of experiments (from Figure a₄ to j₅), OPT-SAR-MS²Net was able to extract large, curved and narrow water features completely and continuously, With smoother and sharper edges than other methods. The networks represented by d₄ (SegNet) and i₄ (MCANet) were able to extract continuous water bodies in the fourth set of experiments, but with more granular edges. However, other methods resulted in water bodies with varying degrees of fracture and blurring, and they also misclassified large areas of forest as farmland. The above three groups of experiments demonstrate the effectiveness of the module OPT-SAR-MFF in OPT-SAR-MS²Net. It is noteworthy that the water body has lower reflection coefficients for the SAR remote sensing system. In SAR remote sensing images, the water body is depicted as a black area with a clear contour. Therefore, the model can effectively adjust its fusion ratio to obtain better results for water body feature classification.

The sixth group (from Figure a₆ to j₆) of experiments demonstrates that OPT-SAR-MS²Net (j₆) is effective in classifying farmland (highlighted in the red box), which is a type of feature target that covers a flat area. In the seventh and eighth groups of experiments, OPT-SAR-MS²Net also demonstrated a relatively high accuracy rate when processing complex feature scenes. In the seventh group of experiments (from Figure a₇ to j₇), the red box shows that a small portion of the village’s concave fossa was prone to misclassification. Similarly, in the eighth group of experiments (from Figure a₈ to j₈), the blue box shows that smaller water bodies were also easily misclassified. However, OPT-SAR-MS²Net (j₈) could accurately and completely extract feature information from all of them. It has been demonstrated that the high- and low-order longitudinal fusion strategy of the multi-source feature extraction module MFE in the model can effectively utilize the detailed and shape features in the shallow network as a supplement to the fused features. This reduces the loss of shallow features due to network transmission phenomena. From Figure a₈ to j₈, the qualitative results presented in the red box, when considered alongside the quantitative data presented in Table 3, demonstrate that the class of roads yielded superior outcomes when utilizing DeeplabV3+ and OPT-SAR-MS²Net. Nevertheless, the classification effect of OPT-SAR-MS²Net for roads was more continuous and the edges are more detailed compared to DeeplabV3+. The primary rationale for this analysis is that OPT-SAR-MS²Net is founded upon DeeplabV3+ encoding–decoding architecture. Consequently, both methodologies are underpinned by a comparable data processing framework, which implies that they exhibit comparable sensitivity to the same type of features. The OPT-SAR-MS²Net developed a multi-scale sensing module OPT-SAR-MIP within the encoding section of the algorithm. This module enables the algorithm to more effectively capture contextual semantic information and to perform better deciphering of roads.

4.3. Ablation Experiments

In order to validate the effectiveness and necessity of each of the modules in OPT-SAR-MS²Net, ablation experiments were performed on the WHU-OPT-SAR dataset in this subsection.

4.3.1. Effectiveness of Each Source of Data

Table 4 shows that using two homogeneous SAR remote sensing images as inputs for inference with OPT-SAR-MS²Net improved the metrics OA, mIoU and Kappa by 0.6%, 0.8% and 0.6%, respectively, compared to using two homogeneous optical remote sensing images trained on the same dataset. Furthermore, the accuracy of the ‘water’ classification was improved by 16.2%. This improvement is primarily due to the fact that water bodies have a significantly lower scattering signal compared to other targets, making them more detectable by the SAR remote sensing system. Additionally, water bodies exhibit a distinct specular reflection feature on SAR remote sensing images, which facilitates their classification as a separate class. In contrast, training the network using only SAR remote sensing images resulted in lower accuracies for cities and forests by 4.8% and 2.5%, respectively, compared to using only optical remote sensing images. The shadowing and overlapping of SAR images was due to its side-view imaging characteristics. The presence of buildings and trees with varying heights and mutual shading in cities and forests can also affect classification accuracy. Table 4 shows that OPT-SAR-MS²Net achieved better results for the OA (0.843), mIou (0.452), Kappa (0.720) and accuracy of each class when using both optical and SAR data as inputs, compared to using individual modalities. This confirms the significance of using multi-source feature fusion for land object classification. It takes advantage of the complementary strengths of optical and SAR remote sensing imagery.

4.3.2. Effectiveness of Each Module

As shown in Figure 2, the encoding part of OPT-SAR-MS²Net contains two modules, OPT-SAR-MFF and OPT-SAR-MIP. ResNet’s pretrained model on ImageNet is used in the backbone network for transfer learning. The effectiveness of various modules was validated by combining them in different combinations and verifying them on test sets. ‘√ ’ means that the model was used in the experiment.

Table 5 demonstrates the consistent superiority of the pretrained model augmented with ResNet101 on ImageNet over its counterpart enhanced with ResNet50 on ImageNet. The ResNet101 residual network has more parameters that can guide the network to learn more about what is beneficial for performance. This is because of its architecture. If the same backbone network is used, adding only the OPT-SAR-MIP module results in more significant overall performance improvement than adding only the OPT-SAR-MFF module. Although OPT-SAR-MFF is capable of fusing and expressing features from multiple sources, it cannot achieve better results by solely considering fusion without addressing the issue of large differences in feature sizes caused by the unique overhead view of remote sensing images. Similarly, the best results cannot be obtained if only the feature multi-scale problem of remote sensing images is considered without effective fusion of multiple source remote sensing data. Thus, the best outcomes are achieved by utilizing ResNet101’s pretraining weights for the backbone network and incorporating both the OPT-SAR-MFF and OPT-SAR-MIP modules.

4.3.3. Effectiveness of Loss with Penalty Weight

We optimized the model using the cross-entropy loss function. Because of the data imbalance in the dataset, we assigned penalty weights to different categories to improve the model’s effectiveness. This operation mitigates the negative impact of data imbalance issues.

Table 6 demonstrates that incorporating penalty weights into the cross-entropy loss function improves model inference results when using the same dataset, hyperparameters and training strategy. The OA and mIoU values are 0.843 and 0.452, respectively, and the Kappa coefficient is 0.720. These results are 2.5%, 15.2% and 5.9% better than the results obtained without penalty weights. The closer the Kappa coefficient is to 1, the better the model’s performance. This verifies the optimization effect of using penalty weights in the cross-entropy loss function.

4.3.4. Robustness of the Model

In order to assess the robustness of the model, Gaussian noise with a mean of 0 and a variance of 25 was added to the test sets as shown in Figure 8. In contrast to other types of noise, such as salt-and-pepper noise, Gaussian noise is present at almost every pixel point, with the depth of the noise being randomly distributed. This can effectively simulate the remote sensing image that is interfered with by noise in a real scene.

As shown in Figure 8, it can be observed that the addition of Gaussian noise to optical remote sensing images has a more intuitive effect. In contrast, the visual appeal of SAR images with Gaussian noise is diminished due to the interference of coherent spot noise generated by the imaging mechanism of the SAR system. Qualitative analysis of the classification results e₁ and e₂ revealed that OPT-SAR-MS²Net is an effective network architecture for various land objects. In particular, roads and water bodies can be segmented into complete, continuous entities with smooth edges, thus demonstrating the superior robustness of OPT-SAR-MS²Net.

Table 7 illustrates that the addition of Gaussian noise had a minimal impact on the OA (0.439) and Kappa (0.7211), which remained only 0.01% lower than those without noise. Furthermore, the mIou remained unchanged. These results demonstrate that OPT-SAR-MS²Net is capable of removing redundant features and noise in multi-source information during multi-source feature fusion, thereby obtaining more effective information characterization. Furthermore, the multi-scale perception module OPT-SAR-MIP in OPT-SAR-MS²Net is capable of fully extracting contextual information and modelling the overall feature characteristics with a certain degree of robustness and anti-interference ability.

5. Conclusions

This paper explored the method of using joint optical and SAR remote sensing data for classifying land objects, with the following conclusions: (1) A multi-source end-to-end Siamese semantic segmentation network for land use classification, OPT-SAR-MS²Net, is proposed. The OPT-SAR-MS²Net algorithm considers the respective advantages and disadvantages of optical and SAR remote sensing images, effectively integrates the complementary information, and exploits the potential advantages of the features. (2) The module OPT-SAR-MFF was designed to efficiently extract the complementary features of two different modes, optical and SAR remote sensing images. Therefore, the construction of multi-source feature fusion architecture can effectively merge the cross-modal features of remote sensing data and improve the intelligent interpretation effect. (3) The OPT-SAR-MIP module was designed to address the issue of large differences in the scale of various feature targets in remote sensing image overhead scenes. (4) The performance of OPT-SAR-MS²Net was tested on the WHU-OPT-SAR dataset, demonstrating the advantages of OPT-SAR-MS²Net in land object classification joint with optical and SAR remote sensing data. In addition, OPT-SAR-MS²Net significantly improved the classification accuracy for village and roads.

Synthetic aperture radar sensors produce images with shadows and overlaps due to their side-view imaging characteristics. This occurs in various scenarios, such as in cities or forested areas. Currently, our approach is not specifically designed for these land objects. Subsequent research will continue to enhance the network’s generalization ability. Subsequent extensions of the OPT-SAR-MS²Net algorithm will optimize its portability to other domains, such as in the analysis and interpretation of sonar system data [60,61,62].

Author Contributions

Conceptualization, W.H., Y.L. (Yong Liu), M.J., P.G., Z.Y. and Y.L. (Yuhang Liu); Methodology, W.H., Y.L. (Yong Liu), M.J., L.M. and P.G.; Software, W.H. and W.Y.; Formal analysis, X.W.; Investigation, L.C. and P.G.; Data curation, W.Y.; Writing—original draft, W.H.; Writing—review & editing, X.W., M.J., P.G. and Y.L. (Yuhang Liu); Visualization, F.Z. and Z.Y.; Supervision, Y.L. (Yong Liu) and L.M.; Project administration, L.C. and L.M.; Funding acquisition, L.C. and Y.L. (Yong Liu). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, Grant Number 61901504.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Zhi Yang was employed by the company DFH Satellite Co., Ltd. The remaining authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a poten-tial conflict of interest.

References

Abdi, A.M. Land cover and land use classification performance of machine learning algorithms in a boreal landscape using Sentinel-2 data. GIScience Remote Sens. 2020, 57, 1–20. [Google Scholar] [CrossRef]
Bai, Y.; Sun, G.; Li, Y.; Ma, P.; Li, G.; Zhang, Y. Comprehensively analyzing optical and polarimetric SAR features for land-use/land-cover classification and urban vegetation extraction in highly-dense urban area. Int. J. Appl. Earth Obs. Geoinf. 2021, 103, 102496. [Google Scholar] [CrossRef]
Girma, R.; Fürst, C.; Moges, A. Land use land cover change modeling by integrating artificial neural network with cellular Automata-Markov chain model in Gidabo river basin, main Ethiopian rift. Environ. Chall. 2022, 6, 100419. [Google Scholar] [CrossRef]
Liu, J.; Gong, M.; Qin, K.; Zhang, P. A deep convolutional coupling network for change detection based on heterogeneous optical and radar images. IEEE Trans. Neural Netw. Learn. Syst. 2016, 29, 545–559. [Google Scholar] [CrossRef]
Audebert, N.; Le Saux, B.; Lefèvre, S. Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks. ISPRS J. Photogramm. Remote Sens. 2018, 140, 20–32. [Google Scholar] [CrossRef]
Schmitt, M.; Zhu, X.X. Data fusion and remote sensing: An ever-growing relationship. IEEE Geosci. Remote Sens. Mag. 2016, 4, 6–23. [Google Scholar] [CrossRef]
Mou, L.; Zhu, X.; Vakalopoulou, M.; Karantzalos, K.; Paragios, N.; Le Saux, B.; Moser, G.; Tuia, D. Multitemporal very high resolution from space: Outcome of the 2016 IEEE GRSS data fusion contest. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3435–3447. [Google Scholar] [CrossRef]
Yuan, H.; Van Der Wiele, C.F.; Khorram, S. An automated artificial neural network system for land use/land cover classifi-cation from Landsat TM imagery. Remote Sens. 2009, 1, 243–265. [Google Scholar] [CrossRef]
Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J.; Atkinson, P.M. Joint Deep Learning for land cover and land use classification. Remote Sens. Environ. 2019, 221, 173–187. [Google Scholar] [CrossRef]
Zhao, W.; Du, S. Learning multiscale and deep representations for classifying remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2016, 113, 155–165. [Google Scholar] [CrossRef]
Chen, X.; Lin, K.-Y.; Wang, J.; Wu, W.; Qian, C.; Li, H.; Zeng, G. Bi-directional cross-modality feature propagation with sep-aration-and-aggregation gate for RGB-D semantic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 561–577. [Google Scholar]
Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.-S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef]
Mou, L.; Schmitt, M.; Wang, Y.; Zhu, X.X. Identifying corresponding patches in SAR and optical imagery with a convolutional neural network. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 5482–5485. [Google Scholar]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Wang, S.; Li, X.; Chen, Y.; Li, Z.; Zhang, L. MCANet: A joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102638. [Google Scholar] [CrossRef]
Li, X.; Zhang, G.; Cui, H.; Hou, S.; Chen, Y.; Li, Z.; Li, H.; Wang, H. Progressive fusion learning: A multimodal joint segmentation framework for building extraction from optical and SAR images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 178–191. [Google Scholar] [CrossRef]
Jensen, J.R.; Qiu, F.; Patterson, K. A neural network image interpretation system to extract rural and urban land use and land cover information from remote sensor data. Geocarto Int. 2001, 16, 21–30. [Google Scholar] [CrossRef]
Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Collaborative attention-based heterogeneous gated fusion network for land cover classification. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3829–3845. [Google Scholar] [CrossRef]
Li, X.; Lei, L.; Sun, Y.; Li, M.; Kuang, G. Multimodal bilinear fusion network with second-order attention-based channel se-lection for land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1011–1026. [Google Scholar] [CrossRef]
Solberg, A.; Taxt, T.; Jain, A. A Markov random field model for classification of multisource satellite imagery. IEEE Trans. Geosci. Remote Sens. 1996, 34, 100–113. [Google Scholar] [CrossRef]
Pacifici, F.; Del Frate, F.; Emery, W.J.; Gamba, P.; Chanussot, J. Urban mapping using coarse SAR and optical data: Outcome of the 2007 GRSS data fusion contest. IEEE Geosci. Remote Sens. Lett. 2008, 5, 331–335. [Google Scholar] [CrossRef]
Talukdar, S.; Singha, P.; Mahato, S.; Shahfahad; Pal, S.; Liou, Y.-A.; Rahman, A. Land-use land-cover classification by machine learning classifiers for satellite observations—A review. Remote Sens. 2020, 12, 1135. [Google Scholar] [CrossRef]
Casals-Carrasco, P.; Kubo, S.; Madhavan, B.B. Application of spectral mixture analysis for terrain evaluation studies. Int. J. Remote Sens. 2000, 21, 3039–3055. [Google Scholar] [CrossRef]
Pu, R.; Landry, S. A comparative analysis of high spatial resolution IKONOS and WorldView-2 imagery for mapping urban tree species. Remote Sens. Environ. 2012, 124, 516–533. [Google Scholar] [CrossRef]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Dickenson, M.; Gueguen, L. Rotated rectangles for symbolized building footprint extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 225–228. [Google Scholar]
Kuo, T.-S.; Tseng, K.-S.; Yan, J.-W.; Liu, Y.-C.; Frank Wang, Y.-C. Deep aggregation net for land cover classification. In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Salt Lake City, UT, USA, 18–23 June 2018; pp. 252–256. [Google Scholar]
Aich, S.; van der Kamp, W.; Stavness, I. Semantic binary segmentation using convolutional networks without decoders. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 197–201. [Google Scholar]
Dong, S.; Zhuang, Y.; Yang, Z.; Pang, L.; Chen, H.; Long, T. Land cover classification from VHR optical remote sensing images by feature ensemble deep learning network. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1396–1400. [Google Scholar] [CrossRef]
Liu, Y.; Fan, B.; Wang, L.; Bai, J.; Xiang, S.; Pan, C. Semantic labeling in very high resolution images via a self-cascaded con-volutional neural network. ISPRS J. Photogramm. Remote Sens. 2018, 145, 78–95. [Google Scholar] [CrossRef]
Sellami, A.; Tabbone, S. Deep neural networks-based relevant latent representation learning for hyperspectral image classi-fication. Pattern Recognit. 2022, 121, 108224. [Google Scholar] [CrossRef]
Kang, W.; Xiang, Y.; Wang, F.; Wan, L.; You, H. Flood detection in Gaofen-3 SAR images via fully convolutional networks. Sensors 2018, 18, 2915. [Google Scholar] [CrossRef]
Ding, L.; Zheng, K.; Lin, D.; Chen, Y.; Liu, B.; Li, J.; Bruzzone, L. MP-ResNet: Multipath residual network for the semantic segmentation of high-resolution PolSAR images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, f. Deep learning and process un-derstanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
Paisitkriangkrai, S.; Sherrah, J.; Janney, P.; Hengel, V.-D. Effective semantic pixel labelling with convolutional networks and conditional random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 May 2015; pp. 36–43. [Google Scholar]
Paisitkriangkrai, S.; Sherrah, J.; Janney, P.; Van Den Hengel, A. Semantic labeling of aerial and satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2868–2881. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Munich, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep con-volutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Yang, X.; Li, S.; Chen, Z.; Chanussot, J.; Jia, X.; Zhang, B.; Li, B.; Chen, P. An attention-fused network for semantic segmentation of very-high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 177, 238–262. [Google Scholar] [CrossRef]
Zhang, X.; Han, L.; Han, L.; Zhu, L. How well do deep learning-based methods for land cover classification and object detection perform on high resolution remote sensing imagery. Remote Sens. 2020, 12, 417. [Google Scholar] [CrossRef]
Gao, T.; Chen, H.; Chen, W. Adaptive heterogeneous support tensor machine: An extended STM for object recognition using an arbitrary combination of multisource heterogeneous remote sensing data. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–22. [Google Scholar] [CrossRef]
Zheng, X.; Huan, L.; Xia, G.-S.; Gong, J. Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss. ISPRS J. Photogramm. Remote Sens. 2020, 170, 15–28. [Google Scholar] [CrossRef]
Hu, F.; Xia, G.-S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
Hong, D.; Chanussot, J.; Yokoya, N.; Kang, J.; Zhu, X.X. Learning-shared cross-modality representation using multispectral-LiDAR and hyperspectral data. IEEE Geoscience and Remote Sensing Letters 2020, 17, 1470–1474. [Google Scholar] [CrossRef]
Gao, T.; Chen, H. Multicycle disassembly-based decomposition algorithm to train multiclass support vector machines. Pattern Recognit. 2023, 140, 109479. [Google Scholar] [CrossRef]
Jiang, L.; Liao, M.; Lin, H.; Yang, L. Synergistic use of optical and InSAR data for urban impervious surface mapping: A case study in Hong Kong. Int. J. Remote Sens. 2009, 30, 2781–2796. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, H.; Lin, H. Improving the impervious surface estimation with combined use of optical and SAR remote sensing images. Remote Sens. Environ. 2014, 141, 155–167. [Google Scholar] [CrossRef]
Gunatilaka, A.H.; Baertlein, B.A. Feature-level and decision-level fusion of noncoincidently sampled sensors for land mine detection. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 577–589. [Google Scholar] [CrossRef]
Liao, W.; Bellens, R.; Pizurica, A.; Gautama, S.; Philips, W. Combining feature fusion and decision fusion for classification of hyperspectral and LiDAR data. In Proceedings of the 2014 IEEE Geoscience and Remote Sensing Symposium, Quebec City, QC, Canada, 13–18 July 2014; pp. 1241–1244. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. An SVM ensemble approach combining spectral, structural, and semantic features for the classification of high-resolution remotely sensed imagery. IEEE Trans. Geosci. Remote Sens. 2012, 51, 257–272. [Google Scholar] [CrossRef]
Li, W.; Du, Q. Gabor-filtering-based nearest regularized subspace for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 1012–1022. [Google Scholar] [CrossRef]
Kang, W.; Xiang, Y.; Wang, F.; You, H. CFNet: A cross fusion network for joint land cover classification using optical and SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 1562–1574. [Google Scholar] [CrossRef]
Yurtkulu, S.C.; Şahin, Y.H.; Unal, G. Semantic segmentation with extended DeepLabv3 architecture. In Proceedings of the 2019 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; pp. 1–4. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Olofsson, P.; Foody, G.M.; Herold, M.; Stehman, S.V.; Woodcock, C.E.; Wulder, M.A. Good practices for estimating area and assessing accuracy of land change. Remote Sens. Environ. 2014, 148, 42–57. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmen-tation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Baheti, B.; Innani, S.; Gajre, S.; Talbar, S. Semantic scene segmentation in unstructured environment with modified DeepLabV3+. Pattern Recognit. Lett. 2020, 138, 223–229. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 4–9 May 2020; pp. 1055–1059. [Google Scholar]
Zhang, X.; Yang, P.; Wang, Y.; Shen, W.; Yang, J.; Ye, K.; Zhou, M.; Sun, H. LBF-based CS Algorithm for Multireceiver SAS. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1502505. [Google Scholar] [CrossRef]
Yang, P. An imaging algorithm for high-resolution imaging sonar system. Multimed. Tools Appl. 2023, 83, 31957–31973. [Google Scholar] [CrossRef]
Grządziel, A. The Impact of Side-Scan Sonar Resolution and Acoustic Shadow Phenomenon on the Quality of Sonar Imagery and Data Interpretation Capabilities. Remote Sens. 2023, 15, 5599. [Google Scholar] [CrossRef]

Figure 1. (a) A pair of SAR and optical remote sensing images in cloudy weather; (b) A pair of SAR and optical remote sensing images containing the land objects of houses and roads.

Figure 2. Architecture of OPT-SAR-MS²Net.

Figure 3. Multi-module feature extraction module OPT-SAR-MFF.

Figure 4. Multi-scale information perception module OPT-SAR-MIP.

Figure 5. First part of land object classification results in the WHU-OPT-SAR dataset.

Figure 6. Second part of land object classification results in the WHU-OPT-SAR dataset.

Figure 7. Third part of object classification results in the WHU-OPT-SAR dataset.

Figure 8. Land object classification results in the dataset add Gaussian noise.

Table 1. Quantitative analysis of the WHU-OPT-SAR dataset.

Category	Pixel Count per Category	Penalty Weight	Proportion of Pixels (%)
farmlands	664,037,272	0.015	34.5
forests	90,315,847	0.014	37.7
cities	112,245,660	0.113	4.6
villages	273,311,896	0.090	5.8
waters	725,600,699	0.037	14.2
roads	18,457,677	0.551	1.0
others	32,882,578	0.310	1.7
background	9,906,771	1.0	0.5

Table 2. Number of divisions of the dataset.

Public Datasets	Training Datasets	Validation Datasets	Test Datasets
Number of images	17,660	5870	5870

Table 3. Results of the quantitative analysis of the dataset for its accuracy, mIoU, Kappa and OA.

Methods	OA (%)	mIoU (%)	Kappa (%)	Accuracy of Each Class (%)
Methods	OA (%)	mIoU (%)	Kappa (%)	Farmlands	Cities	Villages	Waters	Forests	Roads	Others
SegNet	0.757	0.374	0.669	0.765	0.428	0.451	0.684	0.969	0.428	0.140
DeeplabV3+	0.809	0.412	0.726	0.795	0.658	0.393	0.752	0.942	0.790	0.127
Non-local	0.724	0.305	0.601	0.714	0.401	0.399	0.576	0.805	0.366	0.108
SA-Gate	0.733	0.312	0.611	0.722	0.410	0.421	0.616	0.878	0.395	0.137
U-Net 3+	0.785	0.385	0.683	0.695	0.567	0.374	0.674	0.804	0.519	0.220
MCANet	0.817	0.429	0.735	0.797	0.588	0.497	0.786	0.958	0.352	0.272
Ours	0.843	0.452	0.720	0.723	0.537	0.759	0.796	0.922	0.868	0.285

Table 4. Quantitative analysis of ablation experiments for data sources in networks.

OPT-SAR-MS²Net	OA	mIoU	Kappa	Accuracy of Each Class
OPT-SAR-MS²Net	OA	mIoU	Kappa	Farmlands	Cities	Villages	Waters	Forests	Roads	Others
Optical input	0.768	0.312	0.593	0.648	0.488	0.598	0.535	0.926	0.742	0.107
SAR input	0.774	0.320	0.599	0.709	0.440	0.674	0.697	0.901	0.793	0.277
Optical+SAR	0.843	0.452	0.720	0.723	0.537	0.759	0.796	0.922	0.868	0.285

Table 5. Quantitative analysis of ablation experiments for modules in networks.

ResNet50	ResNet101	OPT-SAR-MFF	OPT-SAR-MIP	OA	mIou	Kappa
√		√		0.793	0.355	0.649
	√	√		0.816	0.394	0.676
√			√	0.825	0.409	0.693
	√		√	0.827	0.416	0.701
√		√	√	0.835	0.400	0.702
	√	√	√	0.836	0.424	0.709

Table 6. Quantitative analysis of penalty weight ablation experiments for loss functions in networks.

Penalty Weight	OA	mIou	Kappa
With	0.843	0.452	0.720
Without	0.818	0.300	0.661

Table 7. Qualitative analysis of model robustness.

Gaussian Noise	OA	mIou	Kappa
Without	0.8440	0.4542	0.7213
With	0.8439	0.4542	0.7212

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, W.; Wang, X.; Zhan, F.; Cao, L.; Liu, Y.; Yang, W.; Ji, M.; Meng, L.; Guo, P.; Yang, Z.; et al. OPT-SAR-MS²Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images. Remote Sens. 2024, 16, 1850. https://doi.org/10.3390/rs16111850

AMA Style

Hu W, Wang X, Zhan F, Cao L, Liu Y, Yang W, Ji M, Meng L, Guo P, Yang Z, et al. OPT-SAR-MS²Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images. Remote Sensing. 2024; 16(11):1850. https://doi.org/10.3390/rs16111850

Chicago/Turabian Style

Hu, Wei, Xinhui Wang, Feng Zhan, Lu Cao, Yong Liu, Weili Yang, Mingjiang Ji, Ling Meng, Pengyu Guo, Zhi Yang, and et al. 2024. "OPT-SAR-MS²Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images" Remote Sensing 16, no. 11: 1850. https://doi.org/10.3390/rs16111850

APA Style

Hu, W., Wang, X., Zhan, F., Cao, L., Liu, Y., Yang, W., Ji, M., Meng, L., Guo, P., Yang, Z., & Liu, Y. (2024). OPT-SAR-MS²Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images. Remote Sensing, 16(11), 1850. https://doi.org/10.3390/rs16111850

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OPT-SAR-MS²Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Land Object Classification of Single-Source Remote Sensing Imagery

2.2. Land Object Classification of Multi-Source Remote Sensing Imagery

3. Method

3.1. OPT-SAR-MS²Net Encoder Part

3.1.1. Dual Branch Siamese Multi-Source Feature Extraction Architecture

3.1.2. Multi-Source Feature Fusion Module OPT-SAR-MFF

3.1.3. Multi-Scale Information Perception Module OPT-SAR-MIP

3.2. OPT-SAR-MS²Net Decoder Part

3.3. Loss Function

4. Experiment and Analysis

4.1. Experimental Settings

4.1.1. Description of the Dataset

4.1.2. Experimental Configuration and Evaluation Criteria

4.2. Experimentation on the Dataset

4.3. Ablation Experiments

4.3.1. Effectiveness of Each Source of Data

4.3.2. Effectiveness of Each Module

4.3.3. Effectiveness of Loss with Penalty Weight

4.3.4. Robustness of the Model

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

OPT-SAR-MS2Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Land Object Classification of Single-Source Remote Sensing Imagery

2.2. Land Object Classification of Multi-Source Remote Sensing Imagery

3. Method

3.1. OPT-SAR-MS2Net Encoder Part

3.1.1. Dual Branch Siamese Multi-Source Feature Extraction Architecture

3.1.2. Multi-Source Feature Fusion Module OPT-SAR-MFF

3.1.3. Multi-Scale Information Perception Module OPT-SAR-MIP

3.2. OPT-SAR-MS2Net Decoder Part

3.3. Loss Function

4. Experiment and Analysis

4.1. Experimental Settings

4.1.1. Description of the Dataset

4.1.2. Experimental Configuration and Evaluation Criteria

4.2. Experimentation on the Dataset

4.3. Ablation Experiments

4.3.1. Effectiveness of Each Source of Data

4.3.2. Effectiveness of Each Module

4.3.3. Effectiveness of Loss with Penalty Weight

4.3.4. Robustness of the Model

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

OPT-SAR-MS²Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images

3.1. OPT-SAR-MS²Net Encoder Part

3.2. OPT-SAR-MS²Net Decoder Part