MDSCNN: Remote Sensing Image Spatial–Spectral Fusion Method via Multi-Scale Dual-Stream Convolutional Neural Network

Wang, Wenqing; Jia, Fei; Yang, Yifei; Mu, Kunpeng; Liu, Han

doi:10.3390/rs16193583

Open AccessArticle

MDSCNN: Remote Sensing Image Spatial–Spectral Fusion Method via Multi-Scale Dual-Stream Convolutional Neural Network

by

Wenqing Wang

^1,2

,

Fei Jia

¹,

Yifei Yang

¹,

Kunpeng Mu

¹ and

Han Liu

^1,2,*

¹

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

²

Shaanxi Key Laboratory of Complex System Control and Intelligent Information Processing, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(19), 3583; https://doi.org/10.3390/rs16193583

Submission received: 16 August 2024 / Revised: 12 September 2024 / Accepted: 23 September 2024 / Published: 26 September 2024

(This article belongs to the Special Issue Image Fusion and Object Detection Using Multi-modal Remote Sensing Data)

Download

Browse Figures

Versions Notes

Abstract

:

Pansharpening refers to enhancing the spatial resolution of multispectral images through panchromatic images while preserving their spectral features. However, existing traditional methods or deep learning methods always have certain distortions in the spatial or spectral dimensions. This paper proposes a remote sensing spatial–spectral fusion method based on a multi-scale dual-stream convolutional neural network, which includes feature extraction, feature fusion, and image reconstruction modules for each scale. In terms of feature fusion, we propose a multi cascade module to better fuse image features. We also design a new loss function aim at enhancing the high degree of consistency between fused images and reference images in terms of spatial details and spectral information. To validate its effectiveness, we conduct thorough experimental analyses on two widely used remote sensing datasets: GeoEye-1 and Ikonos. Compared with the nine leading pansharpening techniques, the proposed method demonstrates superior performance in multiple key evaluation metrics.

Keywords:

pansharpening; multispectral image; convolutional neural network; multiple cascaded modules

1. Introduction

Remote sensing imaging technology serves as a vital means of acquiring information about the objects of Earth. It achieves non-contact, long-distance observation of the Earth’s surface, significantly enriching the understanding of our planet. Optical satellites can provide multi-spectral (MS) images with multiple spectral bands and panchromatic (PAN) images with only a single band. However, due to current technological constraints, ordinary satellite sensors still encounter significant barriers when attempting to capture high-resolution multi-spectral (HRMS) images. Therefore, it is necessary to combine the complementary information of two types of images to obtain the HRMS images that simultaneously satisfy both spectral and spatial requirements. Pansharpening refers to the use of the spatial information contained in PAN images to sharpen MS images and generate the desired HRMS images [1]. The HRMS images that are produced through fusion processes show extensive application prospects across multiple domains, including object detection and classification, Earth observation, environmental monitoring, and agriculture [2,3,4,5,6].

Over the past few decades, numerous scholars have proposed a variety of efficient pansharpening techniques. According to different theories, these pansharpening methods can be divided into four categories [7]: the component substitution (CS) method, the multi resolution analysis (MRA) method, the model-based method, and the deep learning-based method. The CS-based method [8,9] firstly projects the upsampled low-resolution multispectral (LRMS) image onto a new space through a specific spatial transformation. In this space, spatial information and spectral information are decomposed into different components. Then, the original spatial component is replaced using the histogram-matched PAN image. Finally, the fused image is obtained through inverse transformation. This method is simple, fast, and efficient, but it is prone to spectral distortion. In the CS framework, the choice of transformation and the injection gain are two crucial factors that determine the quality of the fusion performance. The representation CS methods include the intensity–hue–saturation method [10], the principal component analysis method [11], the Brovey transformation method [12], and the Gram–Schmidt transformation [13]. A typical algorithm based on MRA extracts high-frequency detail information from PAN images and injects it into upsampled MS images to construct HRMS images. Although the fused images obtained by MRA can obtain good spectral information, they are prone to spatial distortion. The typical MRA-based algorithms include wavelet transform [14,15], Laplacian pyramid [16], etc. Compared to the CS method, the multi-resolution analysis-based approach excels in preserving spectral information, but it may cause spatial distortion in local regions. The pansharpneing methods based on CS and MRA both utilize some transformations to infer the missing information in LRMS images from PAN images. However, due to the intricate coupling between spatial and spectral information, existing fusion methods encounter difficulties in completely eliminating the distortion in fused images while striving for high resolution and spectral fidelity. Subsequently, model-based methods have emerged employing diverse optimization algorithms for their solutions. These methods incorporate various prior knowledge to constrain and regularize the solution space, with numerous proposed priors such as sparsity [17,18], gradient priors [19,20], and low rank priors [21,22]. Based on these priors, a multitude of model-based remote sensing fusion methods have been proposed.

In the field of pansharpening, convolutional neural networks (CNN) have demonstrated immense potential. Through their distinctive convolutional and pooling layers, the CNN models are capable of efficiently extracting spatial hierarchical features and spectral information from images, offering robust technical support for pansharpening. Deep learning-based pansharpening methods primarily encompass three categories: supervised learning, unsupervised learning, and semi-supervised learning. Supervised learning methods rely on a large amount of labeled training data for model training. By continuously adjusting model parameters, the model can deeply explore the various features of LRMS images and PAN images, thereby generating LRMS images with high spatial resolution. Although this method has achieved remarkable results in practical applications, its demand for labeled data and computing resources is also relatively high. Unsupervised learning methods do not rely on labeled data but instead learn by mining the inherent structure and patterns within the data. These methods often utilize the self-similarity of images or the correlation between images to achieve the pansharpening of LRMS images. While unsupervised learning methods are simpler in terms of data acquisition and model training, their performances are often limited by the complexity of the model and the quality of the data. Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data to improve learning efficiency and cost reduction, and it is particularly suitable for situations where labeled data are scarce, enhancing the generalization ability of the model. Such methods have also achieved certain results in the field of pansharpening, but their effectiveness in practical applications is often affected by the quality and quantity of the training data.

Some deep learning-based fusion methods tend to directly combine all input data without fully leveraging the unique advantages of two image types in terms of the richness of the spectral information and the high precision of the spatial resolution, resulting in the fusion process that overlooks the correlation between spectral and spatial information. Given these deficiencies, there is still significant potential for optimization in the field of pansharpening, particularly in terms of enhancing the spatial detail representation of images while ensuring the integrity of spectral information.

In this paper, we innovatively propose a network model for pansharpening, which is a remote sensing image fusion network based on a multi-scale dual-stream convolutional neural network (MDSCNN). This network is capable of accurately extracting image features across multiple scales and effectively fusing these features to enrich image details, ultimately leading to a significant improvement in the quality of the fused results. The main contributions of this paper are summarized as follows.

We construct a multi-scale dual-stream network that can simultaneously extract rich image features from different scales. It can overcome the inherent limitations of single-scale fusion methods, better process the detailed information in images, and improve the accuracy and reliability of fusion results.
We design a multi-level fusion module that integrates the feature information from multiple layers of the network, promoting the transmission of information from lower layers to higher layers, thereby achieving a more comprehensive fusion of images.

The remaining parts of this paper are organized as follows. Section 2 introduces related work. Section 3 details the network structure of the proposed method. Section 4 details the experiments and performance evaluations that were conducted on the proposed method. Section 5 outlines the ablation experiments that were performed, and Section 6 summarizes the paper.

2. Related Work

2.1. Traditional Methods

The fundamental principle of CS technology lies in its ability to successfully separate the spatial details from spectral characteristics in LRMS images through the use of specific feature mapping or spatial transformation techniques. The general model of the CS method is as follows:

{\hat{M}}_{b} = {\tilde{M}}_{b} + g_{b} (P - I),

(1)

where

b

is the number of bands in the MS image,

{\tilde{M}}_{b}

is the MS image upsampled to the same size as the PAN image,

P

is the PAN image after histogram matching with the intensity component

I

,

g_{b}

is the detail injection coefficient, and

{\hat{M}}_{b}

is the fused image.

The basic idea of remote sensing image fusion algorithms utilizing multi-resolution analysis involves dividing PAN images and LRMS images into low-frequency and high-frequency sub-bands at various scales through decomposition. Following this, appropriate fusion rules are then selected, taking into account the characteristics of each sub-band, to effectively integrate the high-frequency detail information from the PAN image with the spectral information from the LRMS image. Lastly, the fused image is reconstructed through an inverse transformation process, resulting in an output that combines the best features of both input images.

Model-based methods have significant advantages in the field of image fusion as they utilize unique optimization algorithms. Specifically, the relations of LRMS and PAN and HRMS can be defined by Equations (2) and (3), respectively.

X = D B Z + n,

(2)

Y = S Z + n,

(3)

where

X

,

Y

, and

Z

are the LRMS image, the PAN image, and the HRMS image, respectively;

D

and

B

represent as the spatial downsampling matrix and blurring matrix, respectively;

S

represents the spectral downsampling matrix; and

n

represents the additive noise. Through introducing various priors to optimize the solution space, the optimization model be expressed as Equation (4):

min_{Z} \frac{1}{2} {∥ X - D B Z ∥}_{F}^{2} + \frac{λ}{2} {∥ Y - S Z ∥}_{F}^{2} + α R (Z),

(4)

where

R (Z)

represents the regularization term, and

λ

and

α

represent the adjustable parameters.

2.2. Deep Learning-Based Pansharpening Methods

Deep learning technology has made significant achievements in the field of remote sensing image fusion due to its powerful feature representation ability. Dong et al. [23] constructed a super-resolution reconstruction network called SRCNN, which utilizes a three-layer convolutional neural network to directly learn and reconstruct high-resolution images from low-resolution images. It is simple and efficient, providing new ideas for the field of image super-resolution. Masi et al. [24] first proposed using CNN for pansharpening. Based on the SRCNN model, they constructed a three-layer convolutional neural network architecture to upsample LRMS images and superimpose them with PAN images as the input. Through feature extraction, nonlinear mapping, and reconstruction steps, they generated HRMS images with both high-spatial and high-spectral resolution. Yang et al. [25] proposed the use of PanNet network for pansharpening. They utilized convolutional neural networks, combined with spectral and spatial information preservation mechanisms, and they achieved high-resolution multispectral image reconstruction through upsampling and spectral mapping techniques. The dual-stream fusion network (TFNet) [26] proposed by Liu et al. specifically employs a dual parallel structure, where one path extracts features from LRMS images and the other path processes PAN images to capture spatial details. He et al. [27] proposed a new detail injection-based method called DiCNN, this method extracts the high-frequency details from PAN images through CNN and injects them into LRMS images to enhance the spectral and spatial resolution. Zhang et al. [28] used the spatial–spectral dual back-projection method for image fusion, which aims to effectively fuse PAN images and LRMS images in the feature domain or pixel domain through the dual back-projection mechanism. Zhang et al. [29] proposed a new cross interactive kernel attention network (CIKANet). This network architecture consists of two main branches: one focuses on extracting spectral information, and the other focuses on capturing spatial information. Through cross kernel attention structures, the complementary fusion information between the spatial domain and spectral domain is enhanced. Bandara et al. [30] introduced a new attention mechanism that spans multiple scales, further enhancing fusion performance. Zhang et al. [31] proposed a progressive pansharpening network based on deep spectral transformation for image fusion. This method also introduced a trained module to use the PAN image information of different resolutions for fusion control. Shang et al. [32] introduced an innovative generative adversarial network consisting of two parts. The former extracts information from MS and PAN images, while the latter performs multi-scale interactive fusion.

3. Proposed Method

3.1. Motivation

This paper delves into an innovative fusion strategy that elegantly combines a multi-scale framework with a dual-stream network. The objective is to deeply explore and fully utilize the rich and multi-layered feature information within PAN images and LRMS images.

First, in recognizing that single-scale fusion methods tend to lose crucial details and features during image processing, thereby compromising the overall quality of the fused image, we devised a multi-scale framework. This framework not only pays close attention to hierarchical details, but also emphasizes the synergistic effect of information across different resolutions. Secondly, we meticulously crafted an efficient and precise fusion module. Each layer within this module is capable of fusing different abstract features from both PAN and LRMS images at a more refined level of detail. Finally, to further optimize the performance of the model, we innovatively adopted a composite loss function that combines pixel-level and gradient-level information. This method aims to ensure that while enhancing the spectral fidelity of the image, it can also better reflect the clarity of details in the image.

3.2. Network Architecture

The proposed multi-scale dual-stream fusion network framework is shown in Figure 1. The entire fusion network comprises three parts: feature extraction, feature fusion, and image restoration. We firstly downsampled the PAN image using bicubic interpolation to obtain a spatially degraded PAN image. Then, we downsampled the original LRMS image using bicubic interpolation to obtain a spatially degraded LRMS image. Finally, the degraded LRMS image was upsampled to match the size of the degraded PAN image.

Based on the multi-scale approach, we performed two downsampling operations on the PAN and LRMS images, resulting in three levels of fused images. From bottom to top, we refer to the three scales as coarse scale, medium scale, and fine scale. The PAN and LRMS images at all three scales undergo the fusion module to obtain fused features. Finally, the decoder is used to obtain the final HRMS image. We firstly concatenated the coarse-scale fused image with the medium-scale MS image, and then we concatenated the medium-scale fused image with the fine-scale MS image, thereby constructing a multi-scale network system that enables effective information transfer and fusion across different scales.

3.3. Feature Extraction

For feature extraction, we utilized a sub-network comprising two convolutional layers arranged sequentially to extract features from the upsampled LRMS and PAN images. The sub-network structures corresponding to the PAN and LRMS images are the same, but the weights are different. PAN is a single-band, while LRMS is a four-band, input. The first convolutional kernel has a stride of 1, and the second convolution uses a stride of 2 for downsampling. There is a PReLU activation function after each convolutional layer. The feature map representations of PAN and LRMS are as described in Equations (5) and (6).

F_{P}^{i} = f_{P}^{i} (P_{i}), i = 1, 2,

(5)

F_{M}^{i} = f_{M}^{i} (M_{i}), i = 1, 2,

(6)

where

F_{P}^{i}

and

F_{M}^{i}

denote the output feature maps of the i-th convolutional layer, respectively.

3.4. Feature Fusion

Inspired by [33], we propose a global cascaded module (GCM) to implement feature fusion. Figure 1 graphically illustrates how the global cascading occurs. The output of the intermediate features was cascaded into higher blocks, ultimately converging into a single 1 × 1 convolutional layer.

Our global cascaded module (GCM) primarily consists of local cascaded modules (LCMs). Both adopt the same cascaded architecture, where the LCMs are formed by cascading residual modules, and the GCM is composed of cascaded LCMs. Figure 2 graphically depicts how the cascading occurs. The LCMs are constructed by alternately stacking three residual modules and 1 × 1 convolutional layers, where each 1 × 1 convolutional layer receives not only the outputs of all the previous residual modules, but also the original input features. Similarly, the GCM was built by alternately stacking three LCMs and 1 × 1 convolutional layers, with the internal cascading pattern being identical to the LCMs. Our residual module consists of two consecutive 3 × 3 grouped convolutions and a pointwise convolution. Grouped convolution sets are used to reduce the number of parameters, greatly improving the computational efficiency and scalability of the model.

First, for the local cascaded residual module, we defined the output of the residual module as

r^{i} (f_{i - 1}, w^{i})

, where

f_{i - 1}

denotes the input to the i-th residual module,

w^{i}

represents the feature parameters of the i-th residual module,

a^{i}

denotes the output of the i-th residual module within the local cascaded module, and

b^{i}

denotes the output of the i-th 1 × 1 convolutional layer within the LCMs. The expressions can be defined as

a^{i} = r^{i} (f_{i - 1}, w^{i}),

(7)

b^{i} = g ([a^{1}, \dots a^{i}, f_{0}]),

(8)

where g represents the 1 × 1 convolution,

f_{0}

is the original input, and the value of i ranges from 1 to 3.

Similarly, for the global cascaded residual module, we adopted a consistent approach to describe it. We define the output of the local module as

R^{i} (F_{i - 1}, W^{i})

, where

F_{i - 1}

denotes the input to the i-th local cascaded residual module,

W^{i}

represents the feature parameters of the i-th local cascaded residual module,

A^{i}

denotes the output of the i-th local cascaded residual module within the global cascaded module, and

B^{i}

denotes the output of the i-th 1 × 1 convolutional layer within the GCM. The expressions can be defined as

A^{i} = R^{i} (F_{i - 1}, W^{i}),

(9)

B^{i} = g ([A^{1}, \dots A^{i}, F_{0}]),

(10)

where g represents the 1 × 1 convolution,

F_{0}

is the original input, and the value of i ranges from 1 to 3.

The model achieves a deep learning of the multi-level feature representations by integrating features from multiple layers. In this process, the multi-scale cascade connection, as a crucial operation of the multi-level shortcut connection, leverages its unique structural advantage to enable information to be effortlessly and efficiently transmitted from the lower layers to higher layers, ensuring both the continuity and integrity of the information. This design not only enhances the model’s comprehensive understanding and analysis capabilities, but also boosts its performance and efficiency in handling complex tasks involving multi-level features.

3.5. Image Restoration

For the image restoration module, inspired by TFnet [26], we used three sequentially arranged convolutional layers to reconstruct HRMS images. The image restoration module consists of two convolutional layers, both with a kernel size of 3 × 3, and a deconvolutional layer, with a kernel size of 2 × 2. In the first convolutional layer, a deconvolutional method is used with a stride of 2 to upsample the image back to its original size. Subsequently, a regular convolutional layer with two convolutional kernels of 3 × 3 is then used. In the design of this chapter, each convolutional layer was immediately followed by a activation function, and this was aimed at enhancing the capability of network to handle nonlinearities. Specifically, we employed a PReLU activation layer after the first convolutional layer, with the purpose of further boosting the efficiency of the network in feature extraction. For the second convolutional layer, we opted for a Tanh activation layer, which boasts the advantage of ensuring smoothness in the output images and stabilizing gradients during computation, effectively preventing issues such as gradient explosion or vanishing. This is illustrated in Figure 3. Each convolutional layer is preceded by a residual module. The residual network is shown in Figure 3b, with two convolution kernels of size 3 × 3.

After completing the image fusion at the coarse and medium scales, we performed upsampling on the fused images. Following that, we connected these upsampled images with their corresponding LRMS images from the previous scale, thereby establishing relationships between images of different scales.

Lastly, at the fine scale, the LRMS image of that scale was added to the fused output image of the same scale through skip connections to generate the final HRMS image.

\tilde{M} = F_{1} + M_{L},

(11)

where

\tilde{M}

represents the fused HRMS image,

F_{1}

represents the output of the fine-scale image restoration network, and

M_{L}

is the LRMS image.

3.6. Loss Function

Wald et al. [34] recognized that the ideal HRMS images are not flawless, leading them to propose an innovative training strategy. In this strategy, they select downscaled PAN images and low-resolution versions of the original LRMS images as input data for the network. Meanwhile, the original LRMS images are set as the reference standard during the training process. To optimize the training effect, the loss function adopted in this paper comprises two parts: gradient loss and pixel loss.

In this paper, we used the norm of the difference between the fused image and the input PAN image as the gradient loss function

L_{g r a d}

, which is defined as

L_{g r a d} = {∥\nabla \tilde{M} - \nabla P∥}_{1},

(12)

where

\nabla P

denotes the gradient of PAN image, and

\nabla \tilde{M}

denotes the gradient of the fused image.

We used the norm of the difference between the fused image and the real image as the pixel loss function

L_{m}

, which is defined as

L_{m} = {∥\tilde{M} - \hat{M}∥}_{1},

(13)

where

\hat{M}

represents the reference image.

The overall loss function in this paper is as follows:

L_{i} = L_{g r a d} + L_{m}

(14)

Compared to the L2 norm, the L1 norm can overcome local minima, leading to a more stable training effect. Given that our work involves generating HRMS images at three different scales, we designed corresponding loss functions for each scale to guide the training process. Consequently, the total loss function throughout the training process can be regarded as the average of the individual loss functions at these three scales. The expression is as follows:

L o s s = \frac{1}{3} \sum_{i = 1}^{3} L_{i},

(15)

where

L_{i}

is the loss corresponding to each scale and the value of i is 1, 2, and 3.

4. Experiment

4.1. Dataset Introduction and Generation

We chose to test the model on the datasets constructed from the Ikonos and GeoEye-1 satellite sensors. GeoEye-1 operates in a sun-synchronous orbit with an orbital height of 681 km, an inclination of 98 degrees, and an orbital period of 98 min, which captures a 0.41 m-resolution PAN image and a 1.64 m-resolution LRMS image. Ikonos is a commercial satellite capable of capturing a 1 m-resolution PAN image and a 4 m-resolution LRMS image. The characteristics of the two remote sensing satellites are shown in Table 1.

In GeoEye-1, the source PAN image had a size of 13,532 × 31,624, and the source LRMS image had the size of 3383 × 7906. For Ikonos, the size of the source PAN image was also 13,532 × 31,624, while the size of the source LRMS image was 3383 × 7906. For each type of remote sensing imagery, we first removed the black borders at the edges, and then we divided them into non-overlapping 200 × 200 LRMS image patches and 800 × 800 PAN image patches. Following this, we categorized these image patches into training samples and testing samples. When constructing the training set, we adhered to the Wald protocol to generate simulated datasets. Specifically, we applied modulation transfer function (MTF) filtering to each pair of original LRMS and PAN images to perform spatial degradation. The degraded data were then used as the input images for the network. In this manner, the original LRMS images served as reference images, enabling the supervision of the network model during training and the subsequent calculation of loss values. The dataset division is shown in Table 2. We cropped the input and reference images into small 32 × 32 image patches with a specified stride, with the aim of obtaining a training set that contains a larger volume of data.

4.2. Implementation Details

This paper is based on the Pytorch framework and conducted training and testing in Python 3.8. In terms of hardware, an RTX 3060 GPU with 16 GB of memory was used. We set the batch size to 32. The initial learning rate was set to 0.001 and reduced by 50 every 2000 epochs. The training time on the two training sets was approximately 20 h. Table 3 shows the specific parameters of some of the network modules.

4.3. Comparison Algorithms and Evaluation Metrics

To validate the outstanding effectiveness of our proposed strategy, this study rigorously followed the Wald test standard and meticulously selected six key evaluation methods. These metrics encompass Spectral Angle Mapper (SAM) [35], Relative Dimensionless Global Error in Synthesis (ERGAS) [36], Relative Average Spectral Error (RASE) [37], Spatial Correlation Coefficient (SCC) [38], the Quality (Q) index [39], and the Structural Similarity (SSIM) index [40]. For the evaluation of the performance at full-resolution scale, SAM, the Quality with No Reference (QNR) [41] index, the spectral distortion index

D_{λ}

, and the spatial distortion index

D_{s}

were used to ensure comprehensive and precise evaluation.

SAM measures the spectral similarity between the pansharpened image and the corresponding reference image. The smaller the value of SAM, the higher the spectral similarity between two images. It is defined by the following equation:

SAM (R, F) = {cos}^{- 1} (\frac{R^{T} F}{‖ R ‖ ‖ F ‖})

(16)

ERGAS can measure the overall spatial and spectral quality of fused images, which is defined as

ERGAS (R, F) = 100 \frac{r_{h}}{r_{l}} \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {[\frac{RMSE (R_{i}, F_{i})}{μ (F_{i})}]}^{2}},

(17)

where

RMSE (R_{i}, F_{i})

represents the mean square error, and

r_{h}

and

r_{l}

are the spatial resolutions of the PAN and LRMS images, respectively. The ideal value of ERGAS is 0.

RASE reflects the global spectral quality of the fused image, which is defined as

RASE (R, F) = \frac{100}{M} \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {RMSE}^{2} (R_{i}, F_{i})},

(18)

where M represents the average brightness of the N spectral bands of the R image. The ideal value of RASE is 0.

SCC measures the spatial detail similarity between the fused image and the reference image. Before calculating SCC, it is necessary to obtain high-frequency information from two images through high-pass filtering, and then the correlation coefficient between the high-frequency components of the two images is calculated. The closer the SCC value is to 1, the higher the spatial quality of the fused image.

Q can estimate the global spatial quality of the fused image. It measures the correlation, average brightness similarity, and contrast similarity between the fused image and the reference image. The closer the value of Q is to 1, the higher the quality of the fused image. The Q index is defined as

Q (R, F) = \frac{σ_{R F}}{σ_{R} σ_{F}} \cdot \frac{2 μ_{R} μ_{F}}{μ_{R}^{2} + μ_{F}^{2}} \cdot \frac{2 σ_{R} σ_{F}}{σ_{R}^{2} + σ_{F}^{2}},

(19)

where

σ_{R F}

represents the covariance between two images, and

σ

and

μ

are the variance and mean of the images, respectively.

The SSIM metric can globally evaluate the similarity between the two images from three directions: brightness, contrast, and structure. The expression is as follows:

SSIM (R, F) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})}

(20)

D_{λ}

is used to determine the spectral similarity between the fused image and the observed LRMS image. Its value is a positive number closer to 0, indicating that the spectral information of the fused image is more similar to that of the original LRMS image. The

D_{λ}

index is defined as

D_{λ} = \sqrt[p]{\frac{1}{N (N - 1)} \sum_{l = 1}^{N} \sum_{r = 1}^{N} {|Q ({\hat{M}}_{l}, {\hat{M}}_{r}) - Q (M_{l}, M_{r})|}^{p}},

(21)

where

Q (\cdot, \cdot)

calculates the Q value between two images, and the parameter p is a positive exponent that emphasizes the degree of spectral distortion, which is usually taken as p = 1.

D_{S}

is used to measure the spatial similarity between the fused image and the original PAN image. Its value is a positive number closer to 0, indicating that the spatial distortion of the fused image is smaller.

D_{S}

is defined as

D_{S} = \sqrt[q]{\frac{1}{N} \sum_{l = 1}^{N} {|Q ({\hat{M}}_{l}, P) - Q (M_{l}, p)|}^{q}},

(22)

where p represents a reduced resolution PAN image with the same size as the original LRMS image, and q is a positive exponent that emphasizes the degree of spatial distortion in the fused image, which is usually taken as q = 1.

QNR can jointly measure the degree of the spectral distortion and spatial distortion of fused images using two indexes:

D_{λ}

and

D_{S}

. The QNR is defined as

QNR = {(1 - D_{λ})}^{α} {(1 - D_{S})}^{β},

(23)

where

α

and

β

are positive exponents that control the relative correlation between the spectral distortion and spatial distortion, which is usually taken as

α

=

β

= 1. The ideal value of QNR is 1.

This study selected nine algorithms for experimentation on the GeoEye-1 and Ikonos datasets, including four traditional methods and five deep learning methods. The traditional comparison algorithms are the generalized Laplacian pyramid with an MTF-matched filter (MTF-GLP) [42], the Gram–Schmidt mode 2 algorithm with a generalized Laplacian pyramid (GS2-GLP) [42], the robust band dependent spatial detail (BDSD-PC) [43], and the additive wavelet luminance proportional with haze correction (AWLP-H) [44]. The CNN-based methods included TFnet [26], PanNet [25], DiCNN1 [27], S2DBPN [28], and CIKANet [29].

4.4. Performance Evaluation on Remote Sensing Images at a Reduced Resolution

In this section, we will compare the performance of the different pansharpening methods on two simulated datasets. Figure 4 provides a detailed comparison of the different image fusion algorithms on the simulated GeoEye-1 dataset. Figure 5 shows the corresponding residual images. Through observation, we clearly noticed that the GS2-GLP algorithm exhibited significant spatial distortion in the fusion results, especially in the large grass area. Meanwhile, although the BDSD-PC, MTF-GLP, and AWLP-H fusion methods achieved certain effects, they also exhibited varying degrees of spatial distortion and spectral distortion when processing the grass area. Among them, the BDSD-PC algorithm showed significant differences in color and texture compared to the original image, making it particularly noticeable.

Compared to the traditional algorithms, the fused images based on deep learning methods exhibited higher similarity with the reference image in terms of spectral fidelity and spatial details. These deep learning methods have learned the ability to extract useful information from the LRMS images and the PAN images by training on a large amount of sample data, thus achieving higher-quality image fusion. However, while these methods can obtain good fusion results in most cases, they may still suffer from issues of spectral and spatial distortion. PanNet, TFnet, and S2DBPN showed spectral distortion and a lack of fineness in the fusion of the middle green lawn area, while the DiCNN1 and CIKANet methods exhibited relatively more spectral and spatial information, resulting in better fusion performance. In addition to the lawn area, the fusion effect of our method in the street and village areas was also closer to that of the actual scenes. Although the subtle differences in the fused images could not be clearly observed by the naked eye, it could be seen from the residual images in Figure 5 that our fused image reflected less residual information and was closer to the reference image.

In order to further quantify and compare the performance of the different fusion methods, Table 4 lists the average evaluation metrics for the 25 test images that were subjected different fusion methods. We have highlighted the best indicator values in bold and highlighted the second best values with an underline. As can be seen from Table 4, the evaluation metrics of the AWLP-H algorithm achieved the best results compared to the traditional algorithms. Compared to the traditional algorithms, the deep learning methods produced a better fusion performance. Specifically, TFnet, PanNet, and S2DBPN performed slightly worse, while DiCNN1 and CIKANet were highly competitive. Our method achieved the best results. Except for a slightly poorer performance in the Q value, our method excelled in all other metrics. Through experimental indicators, it can be seen that our method significantly maintains the high spatial resolution of the image and preserved the high-quality spectral information. This characteristic significantly improved the image fusion performance under complex backgrounds.

Figure 6 shows the image fusion results obtained by applying different algorithms to the simulated Ikonos dataset. Figure 7 shows the corresponding residual images. Firstly, both GS2-GLP and MTF-GLP exhibited spectral distortion on the lawn in the upper right corner. This distortion significantly reduced the clarity and recognition rate of the image. This was primarily due to the lack of capability in handling details, edges, and other information when processing high-resolution images. Although the BDSD-PC and AWLP-H algorithms had good spatial resolution, there was a certain degree of spectral distortion. The six CNN-based methods, including the proposed algorithm, were significantly better than the traditional methods in preserving spatial details. The distortion exhibited in the fused images was markedly reduced in comparison to the traditional methods. However, the fused images of the TFnet, PanNet, and S2DBPN algorithms exhibited color tone deviation at the intersection of villages and rivers. There was color distortion between the fusion results and the reference image. There were two main reasons for this: first, the ability of the model to process specific scene colors during the training phase was weak; second, it could not adapt to complex environments with varying lighting and reflection. By analyzing the residual images, it could be found that the proposed method in this study generated residual images that had less information, indicating that it maintained a higher fidelity to the spatial details and spectral features of the original image during the fusion process.

In order to quantify the performance of each algorithm on the Ikonos dataset, we also listed the average evaluation metrics of the 24 sets of simulation experimental results, as shown in Table 5. Similarly, we highlighted the best results in bold and highlighted the second best results with an underline. Among the CNN-based algorithms, the proposed algorithm achieved the most optimal results in terms of SAM, SCC, and SSIM, but it performed suboptimally in both the ERGAS and RASE metrics. This indicates that the method proposed in this paper is more suitable for GeoEye-1 datasets. But, overall, its performance on the simulated datasets surpassed the traditional methods and the numerous deep learning algorithms.

4.5. Performance Evaluation on Remote Sensing Images at Full Resolution

In practical applications, full-resolution remote sensing images play a central role in the accuracy and reliability of experimental results. Therefore, we directly performed image fusion operations on the real test datasets and comprehensively assessed the performance of various fusion algorithms through visual inspection and analysis and quantitative indicators.

Figure 8 shows the real experimental results of our algorithm and other comparative algorithms on the Ikonos dataset. The small image in the green box was enlarged and placed in the lower right corner for clearer observation of its spatial and spectral information. As can be seen from Figure 8, the traditional algorithms exhibited significant deficiencies in detail processing. Specifically, the MTF-GLP algorithm handled boundaries particularly unclearly. However, the distortion issues of the AWLP-H algorithm were the most severe, primarily manifesting in color shifts and distortions in the image. In contrast, the six deep learning algorithms demonstrated a visually similar performance, which achieved significant improvements in detail and better fusion results. We were able to observe the roof of the red house and found that our proposed method had a more delicate fusion effect. It retained excellent spatial details while effectively restoring the original spectrum of the LRMS.

In order to quantify the performance of various algorithms on the Ikonos dataset, Table 6 lists the average evaluation metrics of the 24 sets of real experimental results. TFnet and CIKANet both demonstrated good performance, with TFnet ranking second in SAM metrics and CIKANet ranking second in the QNR and

D_{λ}

metrics. However, our proposed algorithm achieved the best performance on all evaluation metrics. This result further validated the advantages of deep learning in the field of remote sensing image fusion and demonstrates the potential of our algorithm in practical applications.

Figure 9 shows the real experimental results of all the algorithms when applied to the GeoEye-1 dataset. Through observation, it could be found that the color distortion of GS2-GLP and MTF-GLP was particularly evident, which affected the overall quality of the fused image. From the enlarged area, it could be observed that the result of the GS2-GLP algorithm appeared relatively blurry, while the fused image of the MTF-GLP algorithm showed the most blurry effect. In contrast, the fusion image generated by the AWLP-H algorithm produced more saturated colors. The visual effects of the six deep learning methods, including our algorithm, usually perform better than the traditional methods by showing more refined and comprehensive fusion of spatial and spectral information.

We also conducted statistical analysis on the average metrics of the 25 sets of real image fusion results on the GeoEye-1 dataset, as shown in Table 7. Overall, the traditional methods performed poorly in terms of metrics, with only BDSD-PC performing slightly better from the traditional methods. For the deep learning fusion algorithms, TFnet and CIKANet both demonstrated good performance, with TFnet ranking second in the

D_{λ}

and QNR metrics, and CIKANet ranking second in the SAM and

D_{S}

metrics. However, our proposed algorithm achieved the best performance on all of the evaluation metrics.

5. Ablation Experiment

5.1. Comparison for Different Network Structures

We used a multi-scale dual-stream fusion network to train and test two types of remote sensing images. To rigorously substantiate the efficacy of its multi-scale fusion approach, we meticulously crafted ablation experiments that comprehensively assess the performance of the model across multiple scales on the high-resolution GeoEye-1 remote sensing image dataset. Table 8 presents a comprehensive overview of the objective evaluation indicators for the various network structures tested on the GeoEye-1 dataset, intuitively comparing the advantages and disadvantages of single-scale, dual-scale, and multi-scale fusion models. It is conclusively demonstrated that our proposed multi-scale fusion method significantly enhances the performance of image fusion.

In Table 8, we use bold text to indicate the optimal value. As the number of network scales increased, the six evaluation indicators showed an overall upward trend, ultimately indicating that multiple scales have a positive impact on improving the performance of a model. By fusing the input images at different scales, the feature differences of the images can be presented more comprehensively, enabling the fused image to have more spatial details. At the same time, the input images and the fused images at each level possess rich spectral information. Employing these methods can also lead to a notable enhancement of the spectral quality in the final fused image.

However, setting more scale levels for the network would increase the complexity of the model, which not only means that more parameters need to be learned, leading to lower training efficiency, but that they also tend to cause the network to overfit. Additionally, as shown in Table 8, setting the scale level to 2 resulted in a significant performance improvement compared to setting the scale level to 1, while the model performance with a scale level of 3 only achieved a slight improvement compared to a scale level set to 2. This suggests that, as the scale level increases, there is a certain limitation in improving the performance of the model. To strike a balance between maximizing the model performance and managing the implementation complexity, the proposed method ultimately settled on a network level of 3.

5.2. Ablation Study of the Proposed Loss Function

In this paper, we propose a new loss function that can effectively reduce the loss of spectral and spatial information in the fused image based on the information of the input images. In addition to using common loss functions to evaluate the pixel differences between the output image and the reference image, we also introduced another loss to measure the gradient differences between the output image and the PAN image, where the aim is to further enhance the ability to preserve structural details in the image fusion process. To verify the effectiveness of this new loss function, we conducted ablation experiments on the simulated GeoEye-1 dataset. In the experiment, we kept the model structure unchanged and used both the ordinary loss function and our proposed loss function. We also tested and compared the performance of the model. The test results are listed in Table 9.

In the ablation experiment mentioned above, we recorded, in detail, the performance of different loss functions on six key performance indicators. From the experimental results of Table 9, it is evident that, compared to using the traditional MSE loss function, using our proposed loss function to train the network achieves superior performance in terms of spatial resolution and spectral fidelity. This is because although the MSE evaluation method can quantify the differences between pixel values, it fails to fully consider the spatial structural consistency between the fused image and the input image when evaluating the similarity between the fused image and the reference image. To overcome this limitation, our proposed loss function incorporates a loss term that takes into account the difference in gradients between the output image and the reference images. This approach not only enhances the spatial consistency of the fused image with the input images, but also improves the fidelity of the spectral information in the fused image.

6. Conclusions

In this paper, we propose a remote sensing image fusion network based on a multi-scale dual-stream convolutional neural network. We constructed a multi-scale dual-stream network structure that can simultaneously extract detailed features of different scales. Additionally, the adoption of a cascading residual fusion module effectively retains crucial features, ultimately enhancing the fusion quality. The proposed algorithm was compared with several advanced methods on two mainstream datasets. It demonstrated remarkable performance advantages in simulated experiments and achieved highly competitive results in real-world data. This outcome strongly validates the capability of the proposed method in preserving spatial details and spectral features. However, due to the use of a multi-scale dual stream convolutional neural network structure, this method requires the simultaneous extraction and fusion of features from multiple scales when processing images, which inevitably increases computational complexity. In addition, although we conducted experiments on the widely used remote sensing datasets and achieved good results, there may be significant differences in resolution, spectral characteristics, and other aspects of remote sensing data from different sources. Therefore, the adaptability of this method to certain specific types of remote sensing data still needs further validation and improvement. In the future, we will continue to explore the applications of unsupervised learning, graph convolutional networks, and lightweight networks in remote sensing image fusion with the aim of further advancing the field.

Author Contributions

Conceptualization, W.W. and H.L.; methodology, W.W., F.J. and Y.Y.; software, F.J. and K.M.; validation, W.W., F.J. and Y.Y.; formal analysis, W.W. and F.J.; investigation, W.W. and F.J.; writing—original draft preparation, W.W. and F.J.; writing—review and editing, W.W., Y.Y. and K.M.; visualization, Y.Y. and K.M.; supervision, H.L.; funding acquisition, W.W. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was partially supported by the National Natural Science Foundation of China (Grant Nos. 62376214 and 92270117) and the Natural Science Basic Research Program of Shaanxi (Grant No. 2023-JC-YB-533 (corresponding author: Han Liu)).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy concerns in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tsvetkovskaya, I.; Tekutieva, N.; Prokofeva, E.; Vostrikov, A. Methods of Obtaining Geospatial Data Using Satellite Communications and Their Processing Using Convolutional Neural Networks. In Proceedings of the 2020 Moscow Workshop on Electronic and Networking Technologies (MWENT), IEEE, Moscow, Russia, 11–13 March 2020; pp. 1–5. [Google Scholar]
Li, S.; Wang, Y.; Cai, H.; Lin, Y.; Wang, M.; Teng, F. MF-SRCDNet: Multi-feature fusion super-resolution building change detection framework for multi-sensor high-resolution remote sensing imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 119, 103303. [Google Scholar] [CrossRef]
Wang, Y.; Li, S.; Teng, F.; Lin, Y.; Wang, M.; Cai, H. Improved mask R-CNN for rural building roof type recognition from uav high-resolution images: A case study in hunan province, China. Remote Sens. 2022, 14, 265. [Google Scholar] [CrossRef]
Bhargava, A.; Sachdeva, A.; Sharma, K.; Alsharif, M.H.; Uthansakul, P.; Uthansakul, M. Hyperspectral Imaging and Its Applications: A Review. Heliyon 2024, 10, e33208. [Google Scholar] [CrossRef] [PubMed]
Paris, C.; Bruzzone, L.; Fernández-Prieto, D. A novel approach to the unsupervised update of land-cover maps by classification of time series of multispectral images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4259–4277. [Google Scholar] [CrossRef]
Tu, X.; Shen, X.; Fu, P.; Wang, T.; Sun, Q.; Ji, Z. Discriminant sub-dictionary learning with adaptive multiscale superpixel representation for hyperspectral image classification. Neurocomputing 2020, 409, 131–145. [Google Scholar] [CrossRef]
Loncan, L.; De Almeida, L.B.; Bioucas-Dias, J.M.; Briottet, X.; Chanussot, J.; Dobigeon, N.; Fabre, S.; Liao, W.; Licciardi, G.A.; Simoes, M.; et al. Hyperspectral pansharpening: A review. IEEE Geosci. Remote Sens. Mag. 2015, 3, 27–46. [Google Scholar] [CrossRef]
Leung, Y.; Liu, J.; Zhang, J. An improved adaptive intensity–hue–saturation method for the fusion of remote sensing images. IEEE Geosci. Remote Sens. Lett. 2013, 11, 985–989. [Google Scholar] [CrossRef]
Ghahremani, M.; Ghassemian, H. Nonlinear IHS: A promising method for pan-sharpening. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1606–1610. [Google Scholar] [CrossRef]
Khan, S.S.; Ran, Q.; Khan, M.; Ji, Z. Pan-sharpening framework based on laplacian sharpening with Brovey. In Proceedings of the 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), IEEE, Chongqing, China, 11–13 December 2019; pp. 1–5. [Google Scholar]
Kwarteng, P.; Chavez, A. Extracting spectral contrast in Landsat Thematic Mapper image data using selective principal component analysis. Photogramm. Eng. Remote Sens. 1989, 55, 339–348. [Google Scholar]
Aiazzi, B.; Baronti, S.; Selva, M. Improving component substitution pansharpening through multivariate regression of MS + Pan data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
Garzelli, A.; Nencini, F.; Capobianco, L. Optimal MMSE pan sharpening of very high resolution multispectral images. IEEE Trans. Geosci. Remote Sens. 2007, 46, 228–236. [Google Scholar] [CrossRef]
Nason, G.P.; Silverman, B.W. The stationary wavelet transform and some statistical applications. In Wavelets and Statistics; Springer: Berlin/Heidelberg, Germany, 1995; pp. 281–299. [Google Scholar]
Shensa, M.J. The discrete wavelet transform: Wedding the a trous and Mallat algorithms. IEEE Trans. Signal Process. 1992, 40, 2464–2482. [Google Scholar] [CrossRef]
Burt, P.J.; Adelson, E.H. The Laplacian pyramid as a compact image code. In Readings in Computer Vision; Elsevier: Amsterdam, The Netherlands, 1987; pp. 671–679. [Google Scholar]
Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Li, S.; Yang, B. A new pan-sharpening method using a compressed sensing technique. IEEE Trans. Geosci. Remote Sens. 2010, 49, 738–746. [Google Scholar] [CrossRef]
Palsson, F.; Sveinsson, J.R.; Ulfarsson, M.O. A new pansharpening algorithm based on total variation. IEEE Geosci. Remote Sens. Lett. 2013, 11, 318–322. [Google Scholar] [CrossRef]
Lotfi, M.; Ghassemian, H. A new variational model in texture space for pansharpening. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1269–1273. [Google Scholar] [CrossRef]
Yang, S.; Zhang, K.; Wang, M. Learning low-rank decomposition for pan-sharpening with spatial-spectral offsets. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 3647–3657. [Google Scholar] [CrossRef]
Rong, K.; Jiao, L.; Wang, S.; Liu, F. Pansharpening based on low-rank and sparse decomposition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4793–4805. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef]
Zhong, J.; Yang, B.; Huang, G.; Zhong, F.; Chen, Z. Remote sensing image fusion with convolutional neural network. Sens. Imaging 2016, 17, 1–16. [Google Scholar] [CrossRef]
Liu, X.; Liu, Q.; Wang, Y. Remote sensing image fusion based on two-stream fusion network. Inf. Fusion 2020, 55, 1–15. [Google Scholar] [CrossRef]
He, L.; Rao, Y.; Li, J.; Chanussot, J.; Plaza, A.; Zhu, J.; Li, B. Pansharpening via detail injection based convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 1188–1204. [Google Scholar] [CrossRef]
Zhang, K.; Wang, A.; Zhang, F.; Wan, W.; Sun, J.; Bruzzone, L. Spatial-spectral dual back-projection network for pansharpening. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Zhang, P.; Mei, Y.; Gao, P.; Zhao, B. Cross-interaction kernel attention network for pansharpening. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5001505. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. Hypertransformer: A textural and spectral feature fusion transformer for pansharpening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1767–1777. [Google Scholar]
Zhang, H.; Wang, H.; Tian, X.; Ma, J. P2Sharpen: A progressive pansharpening network with deep spectral transformation. Inf. Fusion 2023, 91, 103–122. [Google Scholar] [CrossRef]
Shang, Y.; Liu, J.; Zhang, J.; Wu, Z. MFT-GAN: A Multiscale Feature-guided Transformer Network for Unsupervised Hyperspectral Pansharpening. IEEE Trans. Geosci. Remote Sens. 2024. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Wald, L.; Ranchin, T.; Mangolini, M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
Pushparaj, J.; Hegde, A.V. Evaluation of pan-sharpening methods for spatial and spectral quality. Appl. Geomat. 2017, 9, 1–12. [Google Scholar] [CrossRef]
Yang, Y.; Wan, W.; Huang, S.; Lin, P.; Que, Y. A novel pan-sharpening framework based on matting model and multiscale transform. Remote Sens. 2017, 9, 391. [Google Scholar] [CrossRef]
Choi, M. A new intensity-hue-saturation fusion approach to image fusion with a tradeoff parameter. IEEE Trans. Geosci. Remote Sens. 2006, 44, 1672–1682. [Google Scholar] [CrossRef]
Zhou, J.; Civco, D.L.; Silander, J.A. A wavelet transform method to merge Landsat TM and SPOT panchromatic data. Int. J. Remote Sens. 1998, 19, 743–757. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Alparone, L.; Aiazzi, B.; Baronti, S.; Garzelli, A.; Nencini, F.; Selva, M. Multispectral and panchromatic data fusion assessment without reference. Photogramm. Eng. Remote Sens. 2008, 74, 193–200. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. MTF-tailored multiscale fusion of high-resolution MS and Pan imagery. Photogramm. Eng. Remote Sens. 2006, 72, 591–596. [Google Scholar] [CrossRef]
Vivone, G. Robust band-dependent spatial-detail approaches for panchromatic sharpening. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6421–6433. [Google Scholar] [CrossRef]
Vivone, G.; Alparone, L.; Garzelli, A.; Lolli, S. Fast reproducible pansharpening based on instrument and acquisition modeling: AWLP revisited. Remote Sens. 2019, 11, 2315. [Google Scholar] [CrossRef]

Figure 1. The multi-scale dual-stream fusion network framework.

Figure 2. The structure of the LCMs. (a) A diagram of the LCM modules; (b) a diagram of the residual module.

Figure 3. Decoder module. (a) Overall structure of the decoder; (b) a diagram of the residual module.

Figure 4. The fused images of simulated experiments on the GeoEye-1 dataset.

Figure 5. The residual images corresponding to the fusion results in Figure 4.

Figure 6. The fused images of the simulated experiments on the Ikonos dataset.

Figure 7. The residual images corresponding to the fusion results in Figure 6.

Figure 8. The fused images of the real experiments on the Ikonos dataset.

Figure 9. The fused images of real experiments on the GeoEye-1 dataset.

Table 1. The characteristics of the two remote sensing satellites.

Satellites	Spatial Resolution/m
Satellites	PAN	MS
GeoEye-1	0.5	2
Ikonos	1	4

Table 2. Division of the LRMS and PAN image datasets.

Satellite	Data Type	Number of Groups	Size
	Train	192	8 × 8, 32 × 32
Ikonos	Valid	24	8 × 8, 32 × 32
	Test	24	50 × 50, 200 × 200
	Train	200	8 × 8, 32 × 32
GeoEye-1	Valid	25	8 × 8, 32 × 32
	Test	25	50 × 50, 200 × 200

Table 3. Specific parameters of the network modules.

Module	Layer	InChannel	Kernel	OutChannel
LCM	Residual	128	3 × 3	128
	Concat1	128	-	256
	Conv1	256	1 × 1	128
	Residual	128	3 × 3	128
	Concat2	128	-	384
	Conv2	384	1 × 1	128
	Residual	128	3 × 3	128
	Concat3	128	-	512
	Conv3	512	1 × 1	128
GCM	LCM	128	-	128
	Concat1	128	-	256
	Conv1	256	1 × 1	128
	LCM	128	-	128
	Concat2	128	-	384
	Conv2	384	1 × 1	128
	LCM	128	-	128
	Concat3	128	-	512
	Conv3	512	1 × 1	128
	Add	128	-	128
Image Restoration	Residual	128	3 × 3	128
	Conv1	128	2 × 2	128
	Residual	128	3 × 3	128
	Conv2	128	3 × 3	64
	Residual	64	3 × 3	64
	Conv3	64	3 × 3	4

Table 4. Average evaluation indicators of the simulated GeoEye-1 dataset. The best values are marked in bold, and the second-best values are underlined.

Method	SAM↓	ERGAS↓	RASE↓	SCC↑	Q↑	SSIM↑
GS2-GLP	3.2744	2.8539	13.3318	0.8168	0.8206	0.8929
BDSD-PC	3.0881	2.5465	11.8565	0.8794	0.8601	0.9222
AWLP-H	2.4347	2.3470	10.8151	0.9124	0.8789	0.9427
MTF-GLP	2.9326	2.7003	12.2147	0.8766	0.8431	0.9186
TFnet	1.6003	1.6858	7.3025	0.9616	0.9098	0.9668
PanNet	1.5137	1.4808	6.7133	0.9609	0.9320	0.9691
DiCNN1	1.4790	1.4567	6.7126	0.9617	0.9326	0.9704
S2DBPN	1.7679	1.7506	7.3598	0.9600	0.8987	0.9633
CIKANet	1.4385	1.4434	6.3030	0.9675	0.9166	0.9708
Proposed	1.3831	1.4245	6.0775	0.9705	0.9191	0.9728

Table 5. Average evaluation indicators of the simulated Ikonos dataset. The best values are marked in bold, and the second-best values are underlined.

Method	SAM↓	ERGAS↓	RASE↓	SCC↑	Q↑	SSIM↑
GS2-GLP	3.8824	2.6749	10.7669	0.8818	0.7867	0.9070
BDSD-PC	3.8636	2.6820	10.7902	0.8885	0.7965	0.9117
AWLP-H	3.5207	2.6405	10.6203	0.8931	0.8098	0.9201
MTF-GLP	4.0959	2.7863	11.7074	0.8716	0.7747	0.8959
TFnet	2.9284	2.0570	8.4343	0.9373	0.8412	0.9460
PanNet	2.6712	1.8804	7.6997	0.9444	0.8673	0.9527
DiCNN1	2.6529	1.8696	7.6126	0.9466	0.8671	0.9535
S2DBPN	3.0977	2.4527	9.4238	0.9388	0.8387	0.9455
CIKANet	2.7239	1.8799	7.5619	0.9527	0.8556	0.9545
Proposed	2.6079	1.8764	7.5816	0.9528	0.8588	0.9554

Table 6. Average evaluation indicators of the Ikonos real dataset. The best values are marked in bold, and the second-best values are underlined.

Method	SAM↓	QNR↑	$D_{λ}$ ↓	$D_{s}$ ↓
GS2-GLP	1.5272	0.7301	0.1407	0.1663
BDSD-PC	2.0043	0.8021	0.0823	0.1389
AWLP-H	1.4829	0.7583	0.1332	0.1410
MTF-GLP	1.5746	0.7186	0.1510	0.1680
TFnet	1.2775	0.7940	0.1057	0.1162
PanNet	1.5322	0.8443	0.0705	0.0999
DiCNN1	1.6077	0.8337	0.0647	0.1203
S2DBPN	1.8169	0.7920	0.1010	0.1244
CIKANet	1.5810	0.8450	0.0544	0.1117
Proposed	1.2672	0.8738	0.0433	0.0874

Table 7. Average evaluation indicators of the GeoEye-1 real dataset. The best values are marked in bold, and the second-best values are underlined.

Method	SAM↓	QNR↑	$D_{λ}$ ↓	$D_{s}$ ↓
GS2-GLP	0.8168	0.8373	0.0532	0.1170
BDSD-PC	1.2219	0.9036	0.0287	0.0702
AWLP-H	0.7132	0.8837	0.0437	0.0764
MTF-GLP	0.7895	0.8056	0.0701	0.1348
TFnet	0.6090	0.9470	0.0181	0.0357
PanNet	0.7997	0.9155	0.0289	0.0575
DiCNN1	0.7984	0.9210	0.0226	0.0582
S2DBPN	0.9402	0.9335	0.0237	0.0439
CIKANet	0.5285	0.9374	0.0291	0.0346
Proposed	0.5012	0.9595	0.0179	0.0230

Table 8. Evaluation indicators at different scales on the GeoEye-1 dataset. The best values are marked in bold.

	SAM↓	ERGAS↓	RASE↓	SCC↑	Q↑	SSIM↑
One scale	1.4335	1.5004	6.3652	0.9674	0.9149	0.9712
Two scales	1.3990	1.4405	6.1769	0.9695	0.9207	0.9722
Three scales	1.3831	1.4245	6.0775	0.9705	0.9191	0.9728

Table 9. Evaluation metrics for different loss functions on the GeoEye-1 dataset. The best values are marked in bold.

	SAM↓	ERGAS↓	RASE↓	SCC↑	Q↑	SSIM↑
The L1 loss function	1.4198	1.4674	6.4413	0.9659	0.9208	0.9713
The proposed loss function	1.3831	1.4245	6.0775	0.9705	0.9191	0.9728

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, W.; Jia, F.; Yang, Y.; Mu, K.; Liu, H. MDSCNN: Remote Sensing Image Spatial–Spectral Fusion Method via Multi-Scale Dual-Stream Convolutional Neural Network. Remote Sens. 2024, 16, 3583. https://doi.org/10.3390/rs16193583

AMA Style

Wang W, Jia F, Yang Y, Mu K, Liu H. MDSCNN: Remote Sensing Image Spatial–Spectral Fusion Method via Multi-Scale Dual-Stream Convolutional Neural Network. Remote Sensing. 2024; 16(19):3583. https://doi.org/10.3390/rs16193583

Chicago/Turabian Style

Wang, Wenqing, Fei Jia, Yifei Yang, Kunpeng Mu, and Han Liu. 2024. "MDSCNN: Remote Sensing Image Spatial–Spectral Fusion Method via Multi-Scale Dual-Stream Convolutional Neural Network" Remote Sensing 16, no. 19: 3583. https://doi.org/10.3390/rs16193583

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MDSCNN: Remote Sensing Image Spatial–Spectral Fusion Method via Multi-Scale Dual-Stream Convolutional Neural Network

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Deep Learning-Based Pansharpening Methods

3. Proposed Method

3.1. Motivation

3.2. Network Architecture

3.3. Feature Extraction

3.4. Feature Fusion

3.5. Image Restoration

3.6. Loss Function

4. Experiment

4.1. Dataset Introduction and Generation

4.2. Implementation Details

4.3. Comparison Algorithms and Evaluation Metrics

4.4. Performance Evaluation on Remote Sensing Images at a Reduced Resolution

4.5. Performance Evaluation on Remote Sensing Images at Full Resolution

5. Ablation Experiment

5.1. Comparison for Different Network Structures

5.2. Ablation Study of the Proposed Loss Function

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI