1. Introduction
With the vigorous development of remote sensing technology, high-resolution (HR) remote sensing images play an important role in many fields, such as object detection [
1,
2], urban planning [
3], semantic labeling [
4] and object detection [
5]. However, most accessible public remote sensing datasets cannot maintain long-term coverage and high spatial resolution at the same time. For example, the earliest remote sensing data of Sentinel-2 can be traced back to only seven years ago. For remote sensing datasets with a time coverage of more than 20 years, it is usually impossible to maintain a high spatial resolution. To avoid the huge cost of directly improving satellite imaging equipment, image super-resolution (SR) technology is proposed to improve the quality of low-resolution (LR) images. The SR methods based on the interpolation method [
6,
7] proposed earlier have poor reconstruction effects. In recent years, people have focused on the field of deep learning [
8]. However, the single image super-resolution (SISR) method based on a convolutional neural network (CNN) [
9,
10,
11,
12] cannot accurately reconstruct the HR image texture that has been excessively damaged due to degradation and its final reconstruction effect is often fuzzy. Although the SR method based on GAN [
13] considers human subjective visual perception and effectively alleviates the appealing problem, the resulting artifacts have also become another thorny problem.
Considering that the details of LR image loss can be compensated by the rich information in its similar HR reference (Ref) image, the reference-based super-resolution (RefSR) method [
14,
15] came into being. Not only does it effectively avoid the ill-posed problem caused by the SISR method, but also the reconstructed texture is more realistic with the help of the rich detailed information of the Ref image. Image alignment and patch matching are two mainstream ideas of recent RefSR methods. In remote sensing SR tasks, HR-Ref images and LR images can be easily located at the same geographical location through longitude and latitude matching. Therefore, it can ensure that the image contents of Ref and LR images have a certain similarity, which further explains the adaptability of the RefSR method in the field of remote sensing. However, due to different shooting viewpoints and geographical coordinate deviation, the alignment degree between the Ref image and the LR image is still not ideal. Therefore, we choose the RefSR method based on patch matching. Because SRNTT [
14] looks for the most similar LR-Ref patch pair in the global scope, it can deal with the dependence of long distance and ensure the robustness of the model in the case of serious dislocation between Ref and LR images.
Although the above-mentioned methods have good performance, their results can be further improved. Different from natural images, the spatial information of remote sensing images is very large and complex. Therefore, for most SR methods of remote sensing images, improving the representation ability of the network means that a higher level of abstraction and better data representation can be obtained. This is very important for the final reconstruction effect of the LR remote sensing image. To improve the performance of the model, previous methods usually redesign the model structure, such as deepening the depth [
16], expanding the network width [
17] and increasing the cardinality [
18], while we achieve the goal through lightweight mechanisms (such as attention mechanism and dense connection mechanism) that do not need too much network engineering. Therefore, we propose a residual dense hybrid attention network (R-DHAN), which integrates feature information from different levels, reduces the role of unimportant channels and spatial regions and improves the effective utilization of features. The major contributions are as follows:
(1) We propose an end-to-end SR algorithm for remote sensing satellite images, called residual dense hybrid attention network (R-DHAN), which is superior to most classical algorithms in quantitative and qualitative evaluation.
(2) A spatial attention module (SA) and a channel attention module (CA) are added to the network. This helps the network have a more flexible discriminative ability for different local regions and re-examine the importance of different channels. It contributes to reconstructing the final image.
(3) Based on some lightweight mechanisms, we propose a new residual block named DHAB, which mainly includes the local feature fusion (LFF) module and convolution block attention module (CBAM). LFF module makes full use of the current intra-block features and the original input features, while CBAM uses the interdependence between different channels and spatial dimensions to re-weight the features with different degrees of importance. Both of them improve the characterization ability of the network.
In the rest of this article, we briefly review the relevant work in
Section 2. The details of our proposed method are introduced in
Section 3. The experimental setup and final results are provided in
Section 4 and our work is summarized in
Section 5.
3. Method
Given the good performance of SRNTT when LR image and Ref image are misaligned to a certain extent, SRNTT is used as the backbone structure in this method. However, we substantially redesigned the texture transfer structure in two aspects. Firstly, a hybrid channel-spatial attention mechanism (
Figure 1) is added to the original network. This will be discussed in
Section 3.1. Secondly, we replace the original RB with our proposed DHAB (
Figure 2) to further improve the network performance. This will be discussed in
Section 3.2.
As shown in
Figure 1a, we retain the feature swapping part of SRNTT. First, we apply bicubic up-sampling on
to obtain the enlarged image
, which has the same size as
. In order to obtain the Ref image with the same frequency band as
, we apply bicubic down-sampling and up-sampling on
with the same scale and get
with a blur degree similar to
. As for
and
patches, as shown in
Figure 1b, we use the inner product in the neural feature space
to measure the similarity between neural features.
where
denotes sampling the
i-th patch from the neural feature map and
is the similarity between the
i-th
patch and the
j-th
patch. The similarity computation can be efficiently implemented as a set of convolution operations over all
patches with each kernel corresponding to a
patch. Where the position of the
patch with the highest similarity score corresponding to each
patch is denoted as error. Each patch in
M centered at
is defined as
where
maps patch center to patch index.Here we replace the
patch with a
patch at the same position to preserve the reference information of the original HR. All the reference patches together constitute the exchange feature map
M at this scale.
The texture transfer network of SRNTT is shown in
Figure 1c. The base generative network takes the RBs as the main body and uses skip connections [
16,
43]. The network output
of layer
k can be expressed:
where
denotes
,
denotes
and
denotes the RBs. The channel connection symbol is represented by ‖ and the upscaling sub-pixel convolution [
44] with
scale is represented by
. The final reconstruction result SR image is expressed as:
Our texture transfer network is shown in
Figure 1d. Firstly, RBs in SRNTT are replaced with DHAB. More details about DHAB are given in
Section 3.2. In addition, after
and
are extracted by DHAB, the weighted feature map is generated by adding the hybrid channel-spatial attention mechanism and it is merged with the target content by using skip connection.
3.1. Channel-Spatial Attention Mechanism
Bottleneck Attention Modules (BAMs) and CBAMs are two representative examples of channel-spatial attention mechanisms. Although they both involve the SA module and CA module, they are different in the arrangement and combination of these two modules. BAM keeps the two modules in a parallel structure, while CBAM keeps the two modules in a series structure. Relevant ablation experiments show that connecting CA and SA in sequence can bring optimal performance enhancement. Therefore, here we use the idea of CBAM for reference. After DHAB extracts relevant features, we add channel attention and spatial attention in sequence. The relevant features extracted by DHAB are affected by the original features to focus on more important and useful areas and content.
In the CBAM module, the feature output
after CA module is expressed as:
The feature output
after the SA module is expressed as:
where
denotes the input feature maps of the CBAM.
denotes the CA module.
denotes the SA module. Moreover, The height, width and channels of the feature map are represented by
h,
w and
c, respectively. ⊗ denotes multiplication element-wise.
where
denotes the operation of average-pooling in the CA module.
denotes the operation of max-pooling in the CA module.
denotes a multilayer perception network. The weights of
are denoted by
and
, where
t denotes the scale of the number of channels,
denotes the sigmoid activation function.
where
denotes the operation of obtaining the mean value of the feature maps.
denotes the operation of obtaining the maximum value of the feature maps. Their output results are two SA maps.
denotes the connection operation of SA maps.
means convolution with a filter of size
.
denotes the sigmoid activation function.
3.2. Dense Hybrid Attention Block
The improvements made by our DHAB compared with the original RB are shown in
Figure 2, mainly reflected in LFF and CBAM.
Since the existence of the BN layer will not have a substantial impact on the super-resolution task, we deleted the BN layer in the original network to lighten the network and further release the memory space of GPU. In addition, extracting and aggregating features can maximize the use of these features by the network, which can enhance the characterization ability of the network and further improve the final SR reconstruction effect. Thus, we have added the LFF module to DHAB. The LFF module can adaptively fuse some layers in the current DHAB with the preceding DHAB. However, because the features of the
DHAB are directly introduced into the
DHAB, resulting in too many features, we introduce a
convolution layer that performs the dimensionality reduction operation to ensure the constant output dimension. The above operations are expressed as:
where
denotes the function of
convolution layer in the
DHAB.
denotes the output of the
DHAB.
denotes the feature maps produced by the activation function.
denotes the feature maps generated by the second convolution layer in the
DHAB. The symbol
denotes the concatenation of the feature maps. After the LFF module, we further apply CBAM to distinguish the importance of different contents and regions.