1. Introduction
High-resolution multispectral (HRMS) imagery has numerous applications in the domain of remote sensing image processing, such as target detection [
1], military defense [
2], climate monitoring [
3], area classification [
4], and so on. However, due to constraints associated with hardware and physical imaging capabilities, various satellite sensors are unable to directly capture HRMS data. Instead, they can only acquire paired low-resolution multispectral (LRMS) images, which contain extensive spectral information, along with panchromatic (PAN) images that offer high spatial resolution. Therefore, pansharpening technology has been developed to merge the information from these two types of images, thereby producing HRMS imagery that encompasses both rich spectral and spatial information.
Generally speaking, pansharpening methods can be divided into traditional methods and deep learning-based methods. Among them, traditional methods can be divided into three categories: the component substitution (CS) method [
5], the multiresolution analysis (MRA) method [
6], and the variational optimization (VO) method [
7].
The CS method transforms multispectral images into a certain coordinate space, separates the spectral and spatial information, replaces the spatial information with PAN images that have been histogram-matched, and finally generates HRMS through inverse transformation. Classic algorithms include GS [
8], IHS [
9], and PCA [
10]. This type of method will destroy the original spectral structure and produce fusion results with serious spectral distortion.
The MRA method extracts spatial detail structures by designing high-pass filters and then injects these spatial detail structures into multispectral images to obtain HRMS. Representative algorithms include GLP [
11], HPF [
12], and SFIM [
13]. The spectral information preservation ability of this type of method is good, but during the process of injecting spatial details, spatial details are lost and introduced as shadows, resulting in poor spatial quality of the fusion result.
The VO method constructs an energy function by analyzing the relationship between LRMS, PAN, and HRMS and optimizes the model by optimizing the energy function. Typical methods include observation model-based methods [
14,
15] and sparse representation-based methods [
16,
17]. This type of method relies heavily on prior information, which often has unpredictability, resulting in a certain degree of distortion in the spectral and spatial information of the final fusion result.
The pansharpening method utilizing deep learning techniques, particularly convolutional neural networks (CNNs), has garnered significant attention in recent years. The robust non-linear feature extraction capabilities of CNNs have facilitated their extensive application in the domain of pansharpening. Drawing inspiration from super-resolution convolutional neural networks (SRCNNs) [
18], Masi et al. [
19] were among the first to incorporate CNNs into pansharpening networks, developing a three-layer convolutional network architecture to achieve pansharpening results. Building upon this foundation, Wei et al. [
20] designed convolutional kernels of varying sizes to extract information at different scales from the source images. Recognizing that LRMS and PAN images contain different modalities of information, Liu et al. [
21] proposed a dual-stream fusion network that employs independent branches to extract features from both LRMS and PAN source images. However, these independent branches, despite having identical structures, still struggle to effectively characterize spectral and spatial features. In response to this limitation, Yong et al. [
22] developed separate channel and spatial attention (SA) mechanisms [
23] to extract spectral and spatial features, respectively. They subsequently multiplied the channel weights by the spatial weights to derive overall weights, thereby guiding LRMS and PAN branches to enhance their feature extraction capabilities. To mitigate redundant information across different subnetworks, Wang et al. [
24] introduced an adaptive feature fusion module (AFFM) that infers weight maps for various branches from the feature map, thereby increasing the network’s flexibility. Other researchers reconstruct HRMS by extracting and combining common unique features. Some researchers reconstruct HRMS by extracting and combining common unique features. Xu et al. [
25] used Convolutional Sparse Coding (CSC) to extract the lateral information of panchromatic images. At the same time, they split low-resolution multispectral images into panchromatic image-related feature maps and panchromatic image-independent feature maps. The former was regularized by the lateral information of the panchromatic image, and the proposed model was extended to a deep neural network using the principle of algorithm expansion techniques. Cao et al. [
26] designed a CSC with two sets of filters (a common filter and a unique filter) to model PAN and MS images separately, and this model is called PanCSC. At the same time, they derived an effective optimization algorithm to optimize the PanCSC model. Inspired by the learnable iterative soft thresholding algorithm, Yin [
27] proposed a coupled convolutional sparse coding-based panchromatic sharpening (PSCSC) model. The solution of PSCSC follows traditional algorithms, and PSCSC uses a deep unfolding strategy to develop interpretable end-to-end depth panchromatic sharpening networks. Zhi et al. [
28] proposed a module based on cross attention to extract common and unique features from MS and PAN images. Lin et al. [
29] encodes LRMS and PAN into unique and common features, then averages the common features and combines them with the unique features in the source image to reconstruct the fused image. Deng et al. [
30] developed a cross-convolutional attention network that dynamically adjusts the parameters of the convolutional kernels, thereby enhancing the interaction between the two branches to obtain complementary information. Zhou et al. [
31] proposed a PAN-guided multispectral feature enhancement network that employs multi-scale convolutional blocks to aggregate features across multiple scales. Jia et al. [
32] introduced an attention-based progressive network that incrementally enhances detail information to generate an enhanced version of LRMS that corresponds to the size of PAN. Concurrently, they utilized concatenated spectral data and multi-scale SA blocks to progressively extract spatial and spectral features for the reconstruction of HRMS images. Despite the promising spectral and spatial quality of the fusion results produced by existing methods, several challenges remain to be addressed as follows:
Many dual-branch architecture methods overlook the common features shared between LRMS and PAN images, such as those captured by [
33], leading to architectural inefficiencies;
A lack of adaptive convolution in most current approaches weakens the network’s flexibility and adaptability;
The majority of methods operate solely in the spatial domain, without incorporating network structures designed to process frequency domain information.
To address the identified issues, this study developed a three-branch pansharpening network that leverages the interaction between spatial and frequency domains. Each branch is responsible for extracting different types of information: one branch extracts spectral information from LRMS images, another captures spatial information from PAN images, and the third branch integrates common information from both sources. The research implemented a spectral feature extraction module designed to identify nonlinear relationships among the channels in LRMS, thereby facilitating the acquisition of spectral information. Additionally, a spatial feature extraction module was created to capture spatial information from PAN images. This module employs multi-scale blocks to extract spatial features at various scales, followed by the application of SA mechanisms, as well as Sobel and Laplacian operators, to derive general spatial features and to enhance edge and texture features, respectively. Following the interaction and fusion of these two feature sets, a directional perception module is introduced to embed positional information, thereby augmenting the expressive capacity of the spatial features. For the common features derived from both LRMS and PAN, this study utilized a concatenation method that combines the bands from each image type separately and designed a dynamic convolution mechanism based on the extracted spectral and spatial features. Notably, the content of the convolution kernel is adapted according to the spectral and spatial characteristics of the input image. Subsequently, the features obtained from both the spectral and spatial feature extraction modules are processed to derive their corresponding amplitude and phase using discrete Fourier transform (DFT). These amplitude and phase features are then integrated and subjected to inverse discrete Fourier transform (IDFT) to yield frequency domain features. The branch responsible for extracting common features from LRMS and PAN utilizes these features as spatial domain inputs, which are then interacted with and fused with the frequency domain features, ultimately leading to the reconstruction of HRMS images. In summary, the contributions can be delineated as follows:
A three-branch pansharpening network based on spatial and frequency domain interaction was proposed, considering the distinct modal characteristics of LRMS and PAN images, as well as their common features;
A dynamic convolution based on spectral–spatial feature extraction was designed to improve network flexibility and adaptability;
Spatial–frequency domain feature fusion module (SFFIM) was developed to achieve interactive fusion of spatial and frequency domain information, leading to more comprehensive and enriched fused image features.
2. Proposed Method
Figure 1 illustrates the overall proposed network architecture. The design of the network structure is approached from two perspectives: the upsampled version of the original LRMS images, along with their unique features, and the common features shared between them, where H represents the height of the image, W represents the width of the image, and B represents the number of bands present in LRMS. Acknowledging the differences between LRMS images and PAN images, this study developed a spectral feature extraction module to extract spectral information from LRMS images and a spatial feature extraction module to derive spatial information from PAN images. Furthermore, recognizing that LRMS images and PAN images share certain common features, the research implemented a common feature extraction module. Due to the specificity of each band in LRMS [
34], the study focused on each band of LRMS images with PAN images individually as input for this branch. This approach facilitates substantial interaction between PAN image and each band of LRMS. Additionally, a feature fusion module was designed to adaptively integrate information from the various branches, ultimately performing feature reconstruction to yield HRMS images.
2.1. Spectral Feature Extraction Module
LRMS images encompass extensive spectral information, which is derived from the interrelationships among various spectral bands [
35]. To facilitate the extraction of this spectral information, a spectral feature extraction module was developed (
Figure 2). Firstly, a 3 × 3 convolution followed by a ReLU activation function is applied sequentially to preprocess LRMS image, resulting in the output
. Subsequently, global average pooling (GAP) is performed on the output
, which is then processed through two parallel multi-layer perceptrons (MLPs) to generate two vectors (
) that represent the channel features. This entire procedure is expressed through Equation (
1):
Subsequently, a matrix multiplication operation is conducted on matrices
and
, where
represents the matrix transposition of
. This is followed by the application of the Sigmoid activation function
in the subsequent equation, to derive the relationship graph
among the various channels. Concurrently, the shape of matrix
is transformed to yield matrix
. Following this, matrix multiplication is performed on matrices
and
, after which their shapes are adjusted to produce matrix
. This entire process can be represented in Equation (
2):
where Re (·) represents the undergoing shape changes. Finally, a skip connection operation is executed to retain the information from the source image, thereby yielding the output of the module
.
Figure 2.
The architecture of the spectral feature extraction module.
Figure 2.
The architecture of the spectral feature extraction module.
2.2. Spatial Feature Extraction Module
To acquire comprehensive spatial information in PAN imagery, this study developed a spatial feature extraction module comprising three parallel branches (
Figure 3a). The subsequent sections will detail the functionality of these three branches in a top-to-bottom sequence (
Figure 3).
Firstly, the PAN image was preprocessed using a 3 × 3 convolution and ReLU activation function, followed by a 3 × 3 convolution and ReLU activation function and a sequence of multi-scale blocks to preprocess the PAN image, thereby facilitating the preliminary extraction of multi-scale features. Subsequently, convolutional layers and ReLU activation functions were employed for channel dimensionality reduction, resulting in the preprocessed image . In the first branch, this study utilized convolutional layers in conjunction with SA mechanisms to comprehensively extract spatial features from the PAN image while also implementing skip connection operations, which yielded the output of the branch .
However, addressing this issue in isolation neglects the extraction of finer textures from PAN images, which are predominantly found in details such as object edges. This observation has also been noted by several scholars. Drawing inspiration from [
36,
37], this study incorporated the Sobel operator and Laplacian operator in the second branch to extract fine texture details. The Sobel operator is a first-order differential operator that emphasizes pixels in proximity to the central pixel and has a high weighting on the nearest pixel to the central pixel, thereby aiding in noise reduction. It can be used to preserve strong texture features. Some researchers have also demonstrated the superiority of Sobel in extracting detailed information [
38]. However, the Sobel operator is only a first-order operator, and compared with first-order operators, second-order differential operators have better edge localization ability and better pansharpening effect. The Laplace operator in second-order operators is widely used for fine feature extraction in the field of image processing, such as [
37,
39], which can extract weak textures of features that complement previous strong textures. Therefore, we chose the Laplace operator. Specifically, the research focused on the features extracted by these two operators, subsequently applying a convolutional layer and executing a skip connection operation to derive the output of the branch
.
After extracting the features from the two aforementioned branches, a methodical approach was employed to integrate the information from these branches, ensuring that they mutually inform one another and that the details are examined. Specifically, for each branch, the Sigmoid activation function is utilized to derive the weight maps
and
corresponding to the feature maps of each branch. Subsequently,
and
are cross-multiplied with
and
. Ultimately, the information from the two branches, following their interaction, is combined to yield
(Equation (
3)):
In addition to providing rich spatial information, including details and contours, PAN images also encompass positional information. Specifically, the rows and columns within PAN images exhibit interrelatedness. The first two branches do not distinguish between rows and columns of the image but use a unified operation, which means a lack of perception of positional information. To effectively capture this location information, this study developed a spatial perception module within the third branch (
Figure 3c). Following the pooling,
, along the width and height dimensions of the feature map, feature representation in the height dimension
and feature representation of the width dimension
were derived. Subsequently, a multiplication of
and
was performed along the channel dimension, applying the Sigmoid activation function to generate the direction-aware map
. Ultimately, the multiplication of variable
with
, conducted channel by channel and element by element, makes each element in
obtain a weight map with directional information, serves to embed the positional information into
, thereby yielding the output of the spatial feature extraction module
. This process is articulated in Equation (
4) as follows:
where ∗ represents the matrix multiplication along the channel direction and ⊙ represents the multiplication channel by channel and element by element.
2.3. Common Feature Extraction Module
The integration of spatial information from LRMS images and spectral information from PAN images reveals a shared informational basis between the two modalities. To effectively harness this common information, this study developed a dedicated feature extraction module, the architecture of which is depicted in
Figure 4. Given the unique characteristics of each spectral band in LRMS images, the research implemented a dynamic convolution kernel, termed spectral spatial feature extraction (SSDConv), tailored for the corresponding branch of each band, rather than employing conventional convolution with uniform parameters. This approach not only takes into account the different properties of each band but also enhances the model’s adaptability. The detailed structure of SSDConv is presented in
Figure 5a.
The fundamental aspect of generating dynamic convolution kernels involves acquiring the weight matrix of local blocks , where the local blocks present the patch segments created by concatenating the PAN image with each band of LRMS data, followed by the segmentation of the feature map through preprocessing in the convolutional layer; k and represent the dimensions and the number of channels of the input local blocks, respectively. Given that the methodology for generating dynamic convolution kernels for each branch corresponding to a specific band is consistent, a particular band was used as a representative example and select a local block to elucidate the process of generating dynamic convolution kernels.
comprises information derived from a specific band of LRMS and PAN imagery. Consequently, numerous studies examined the channel and spatial characteristics of the feature map, leading to the development of three-branch feature extraction architectures that address both channel (one-dimensional) and spatial (two-dimensional) dimensions. However, the network structures of these three branches are identical and do not adequately account for the inherent differences between spatial and spectral features. To address this limitation, this study proposed a dynamic convolution approach focused on spectral–spatial feature extraction. For channel features, the research employed the “squeeze excitation” operation [
40] (SE(·) in Equation (
5)) to derive the weight for the channel dimension
. In terms of spatial features, this study utilized the SA mechanism
and
in Equation (
5), to generate the corresponding weight maps
and
for the width and height dimensions of the feature map, respectively. Ultimately, these three weights were combined to obtain
W (Equation (
5)):
where ⊙ represents the channel-wise multiplication and ⊗ represents the matrix multiplication along the channel direction. Following this, an inner product operation is conducted between
W and a set of candidate kernels,
, to derive a dynamic convolution kernel. Subsequently, this dynamic convolution kernel is employed to execute convolution operations on the input local block, resulting in the corresponding features of the local block
(Equation (
6)):
where ⊙ represents the inner product operation and ∗ represents the convolution operation.
At the conclusion of the common feature extraction module, this study focused on features from various bands and subsequently processed them through convolutional layers followed by ReLU activation functions to derive the output of the module .
2.4. Space–Frequency Domain Feature Interaction Fusion Module
Most existing methodologies primarily focus on designing fusion strategies within the spatial domain, employing techniques such as direct addition, concatenation, or the application of adaptive weights. While these approaches have yielded satisfactory fusion results, there has been a lack of corresponding network designs that address frequency domain information. Furthermore, although certain methods, such as FAFNet [
41], extract features in the frequency domain, they tend to overlook the information present in the spatial domain. To enhance the accuracy of the fusion results, this study developed SFFIM (
Figure 6). The research used DFT to
and
individually to derive their respective amplitudes (
,
) and phases (
,
). Subsequently, this study focused on the amplitudes and phases from the different branches separately, thereby integrating the amplitude information
and phase information
P. This is performed to ensure that the amplitude and phase information in the frequency domain of
and
are fully communicated. If it is spatial domain interaction between
and
, then only conventional strategies such as addition or concatenation can be used. Therefore, we consider the interaction between
and
in the frequency domain as an effective way to integrate the information from these two branches. Finally, IDFT was performed on the combined
A and
to obtain the frequency domain feature
(Equation (
7)):
where Con (·) represents the convolutional layer of two consecutive layers in
Figure 6.
For
, two successive convolutional layers are employed to extract the spatial domain feature
. If
is processed in the frequency domain by simply performing DFT and IDFT transformations, there is no information exchange involved. Following this,
and
are fused by separately applying Sigmoid activation functions to each of them to derive their respective feature map weights,
and
. Subsequently, the weights
and
are cross-multiplied with
and
, respectively, and the results are concatenated to produce the output feature
. The detailed process is expressed in Equation (
8):
2.5. Loss Function Design
Most pansharpening networks primarily utilize
ℓ1 loss function during the training process to evaluate the discrepancy between the fusion output and the reference image. This methodology neglects the supervision of the outputs from the intermediate layers of the network [
42]. The unavoidable presence of redundant information across various branches can adversely impact the quality of the fusion results. To mitigate the redundancy inherent in different modalities, this study proposed the incorporation of mutual information constraints, which aim to minimize the mutual information between different branches (
(·, ·) in Equation (
9) [
28]. This approach is intended to reduce redundancy and enhance the feature representation capabilities of the various branches. The mutual information loss function is presented in Equation (
9).
Additionally, it was considered that the frequency domain characteristics of the output should closely match those of the ground truth (GT). To address this, a frequency domain loss function was designed to impose constraints on the frequency domain features (Equation (
10)):
where
and
represent the amplitude and phase of the fusion result, respectively, while
and
represent the amplitude and phase of the reference image, respectively.
In summary, the developed loss function is expressed as follows:
where
, as indicated by [
43,
44].
4. Conclusions
This study introduced a novel three-branch pansharpening network that leverages interactions between the spatial and frequency domains. Each branch is designed to extract specific types of information: spectral features from LRMS images, spatial features from PAN images, and common features shared by both. These extracted features are exchanged across spatial and frequency domains before undergoing feature reconstruction. The proposed design includes a spectral feature extraction module capable of capturing nonlinear relationships between the spectral bands in LRMS images and a spatial feature extraction module that effectively captures texture details and embeds positional information, thereby enhancing the completeness of the spatial features. Additionally, the research proposed a dynamic convolution mechanism that adapts to spectral and spatial features, improving network flexibility. The proposed SSFM facilitates robust interaction between spatial and frequency domain information. Furthermore, this study developed a loss function aimed at reducing redundant information between branches while ensuring the frequency domain features closely align with the reference image. Comprehensive experiments, including comparative, ablation, and network structure evaluations across three datasets (IKONOS, WV3, and WV4), consistently demonstrate the superior performance of the proposed method.
It is important to acknowledge that, like many CNN-based approaches, the proposed method was trained on simulated datasets and evaluated on both simulated and real datasets. However, there is still room for improvement in terms of spatial details in our proposed method, and we will attempt to use more cutting-edge edge operator regions to extract detailed textures from the source image.In addition, its performance on real datasets with scale variations remains modest. Future work will focus on jointly training with both simulated and real datasets to improve performance and achieve a balanced effectiveness across diverse datasets.