1. Introduction
Image super-resolution (SR) refers to the task of reconstructing high-resolution (HR) images from their low-resolution (LR) counterparts [
1], and it has great significance in image processing, enabling various downstream applications [
2]. However, image super-resolution is a well-known ill-posed problem because a single LR image can correspond to multiple HR images. Recent SR studies have addressed this problem by leveraging deep learning networks and achieved remarkable performance improvements compared to conventional example-based methods [
3], even in the absence of prior information [
4,
5,
6,
7,
8,
9,
10].
Since the advent of the deep-learning-based SR approach [
4], several studies have devised deeper networks using various learning strategies such as residual [
6,
9,
11,
12,
13,
14], recursive [
6,
7], and adversarial [
9,
15] learning. Deep-learning-based SR models can be categorized into two groups, including convolutional neural network (CNN)-based models and generative adversarial network (GAN)-based models [
16]. The first deep-learning-based SR model, SRCNN, is a CNN-based model composed of three convolutional layers, each corresponding to patch extraction, nonlinear mapping, and reconstruction [
4]. Following the success of the SRCNN, CNN-based models such as VDSR [
6] and EDSR [
11] have been widely developed to fully leverage the learning capability of deeper networks. Currently, a residual network with a stack of residual blocks is perceived as the basic structure of CNN-based SR models. Zhang et al. [
14] proposed RDN, consisting of residual dense blocks with dense local connections to enhance the residual block. In [
13], a channel attention mechanism was adopted for a residual structure, demonstrating a significant improvement in SR image quality. However, because most CNN-based models are trained to optimize pixel-wise losses, such as the mean squared error (MSE) loss or L1 loss, these models are prone to producing overly smoothed SR outputs with restrictions on realistic texture recovery [
17].
Compared with CNN-based models, GAN-based SR models generate more realistic textures by introducing adversarial training into existing CNN-based models [
18]. The basic principle of a GAN is to train two networks (generator and discriminator) simultaneously for opposite purposes. The discriminator is trained to distinguish real HR images from SR images, whereas the generator is trained to produce realistic SR images to fool the discriminator. SRGAN [
9] and ESRGAN [
15] integrated two additional losses (perceptual and adversarial losses) into a loss function and improved the perceptual quality of the SR results.
A GAN-based SR approach has also been employed for remote sensing image processing to improve the perceptual quality [
19]. Jiang et al. [
20] complemented a GAN-based model by incorporating a subnetwork for edge enhancement, which refines edge information from satellite image datasets. Furthermore, Rabbi et al. [
21] proposed EESRGAN, which trained the SR and object detection networks end-to-end and attempted to enhance the SR performance by using the detector loss from the subsequent object detection network. Liu et al. [
22] proposed SG-GAN to benefit from employing a downstream task network by applying a pre-trained saliency detection model to the outputs of the SR network.
In general, deep-learning-based SR methods require LR images and their corresponding HR images as the training dataset. However, owing to the difficulties in obtaining real-world LR-HR datasets, most SR studies have only used HR images and generated LR images by applying degradations to HR images. The most commonly used method for generating LR images from HR images is downsampling by bicubic interpolation with a predefined scale factor [
6,
8,
9,
10,
12,
15,
23]. However, SR models trained on simple degradation do not reflect the properties of real-world degradation, and often result in deteriorated performance when applied to real-world LR images. Therefore, some researchers have attempted to alleviate the gap between simple downsampling and real-world image degradation by applying a blur kernel and noise [
4,
13,
14,
24,
25,
26,
27]. Conversely, several tailored datasets have been constructed, such as RealSR [
28], DRealSR [
29], and SR-RAW [
30], which are more targeted at real-world image super-resolution. These datasets comprise real-world LR-HR image pairs obtained by adjusting the camera’s focal length. Similarly, deep-learning-based SR models for remote sensing images commonly use predefined degradation to generate synthetic LR-HR datasets for training and validation [
21,
22]. Some recent studies adopted a degrader [
31] or downsample generator [
32] in the deep-learning architecture and attempted to make the model learn image degradation and super-resolution.
For HR satellite images, image datasets are usually provided as pairs of panchromatic (PAN) and multispectral (MS) images. Thus, these paired images provide a favorable opportunity for constructing real-world LR-HR image datasets. In this study, to train and validate the proposed model on real-world LR-HR image datasets, we performed pansharpening [
33] using paired PAN and MS images from WorldView-3 (WV3) to generate real-world LR-HR remote sensing image datasets. The pansharpened and original MS images were then used as the HR and LR images, respectively. The scale factor was set to 4, based on the scale ratio of the PAN and MS images. The experimental results from the overall study were obtained from SR models trained on real-world LR-HR image datasets. A detailed description of the datasets used in this study is provided in
Section 3.1.
Figure 1 demonstrates the difference between real-world and synthetic LR images. The ground objects are discernible in the bicubic-downsampled LR images (
Figure 1a), whereas the clarity of the object boundaries is diminished in the real-world LR images (
Figure 1b) because of blurring. Therefore, SR models trained on synthetic LR images from bicubic downsampling often fail to achieve satisfactory SR performance on real-world LR images. Furthermore, we observed that the SR models demonstrated better SR performance when trained on synthetic LR-HR image datasets than when trained on real-world LR-HR image datasets (see
Appendix A).
Based on these observations, we have inferred that refining the input LR image is as crucial as designing a complex SR network architecture to enhance the SR performance. Thus, this study proposed a bicubic-downsampled LR image-guided generative adversarial network (BLG-GAN) for the super-resolution of remote sensing images. The BLG-GAN performs super-resolution for real-world LR images under the guidance of clean synthetic LR images, obtained through a simple bicubic operation. By dividing the SR problem into subproblems with separate networks, the learning objective of each network becomes clearer. As a result, the training process of the BLG-GAN can be more stabilized than that of deep networks trained to learn a direct relationship between real-world LR and HR images.
To the best of our knowledge, this is the first study to introduce a training strategy that uses a synthetic LR image from bicubic downsampling to guide the supervised image super-resolution of remote sensing images. Moreover, we investigated the effectiveness of our method by comparing it with state-of-the-art methods and thoroughly analyzed the influence of its components on SR performance.
The remainder of this study is organized as follows.
Section 2 presents the architecture of the proposed BLG-GAN model.
Section 3 presents the experimental results of the WV3 datasets. In
Section 4, the effectiveness of the proposed method was validated using ablation studies on the network architecture and type of loss. Finally,
Section 5 presents the conclusions of this study.
2. Methodology
The proposed model aims to learn mapping from the real-world LR image domain
to the HR image domain
from training using the given samples
and
with the guidance of bicubic-downsampled LR images. While real-world LR image
is obtained from MS bands of WV3, a synthetic LR image is generated from HR images
with bicubic downsampling and denoted as
. Inspired by [
34,
35,
36], we assumed
as “clean LR image,” which has less corruption in an image such as blur and noise. Thus, we used these bicubic-downsampled LR images as a bridge between the LR images and the corresponding HR images to restore clear details from the clean LR images. The prior application of image transfer to the input LR image is intended to reduce corruption within the real-world LR image and affects the quality of the output images from the following SR process.
As shown in
Figure 2, the proposed BLG-GAN model consists of two stages: LR image transfer and super-resolution. In the LR image transfer stage, the LR images are processed through
to generate LR images that have similar image characteristics or distributions with synthetic LR images, referred to as “bicubic-like LR images”. The output of the LR image transfer stage is then fed into the generator with upsampling blocks (
) for super-resolution. Both stages include a generator and a discriminator to adopt adversarial training for the generation of bicubic-like LR and SR images. Each generator is trained to fool its corresponding discriminator and produce bicubic-like LR or SR images. Conversely, the discriminator is trained to distinguish whether the generated image is real or fake. The following subsections provide detailed explanations of each stage.
2.1. LR Image Transfer
In LR image transfer, generator
learns the mapping from the LR domain
to the bicubic-like LR domain
, as illustrated in Stage 1 of
Figure 2. For the given input LR image x,
generate a bicubic-like LR image
, which looks similar to a synthetic LR image
. This LR image transfer process can be formulated as:
Using adversarial training, was trained to fool the corresponding discriminator, for the generated bicubic-like LR image (). In the meantime, is trained to discern the generated LR image as fake and the synthetic LR image as real.
The generator loss for LR Image transfer consists of two different losses: pixel-wise loss
and adversarial loss
. The pixel-wise loss calculates the
-distance between
and
. We chose LSGAN [
37] for adversarial loss, which uses the form of least squares loss instead of negative log-likelihood loss. The LSGAN is known to stabilize the learning process while achieving a higher SR performance than the standard GAN [
38]. The two different losses are formulated as:
where
N denotes the number of training samples. The discriminator loss for
can be formulated as:
Finally, the total loss for generator
can be expressed as the weighted sum of the pixel-wise loss (
) and adversarial loss (
),
where
is the weight of adversarial loss for LR images.
2.2. Super-Resolution
Using the LR image generated from the prior LR image transfer as the input, the generator for super-resolution (
) learns the mapping relationship from the bicubic-like LR domain
to the HR domain
. As shown in Stage 2 of
Figure 2, the output of
, which is a bicubic-like LR image
, is input into the SR network
to produce an SR image
. In the training phase, the discriminator
interacts with
and helps the network generate an SR image similar to the corresponding HR image,
. The super-resolution process can be formulated as follows:
We denote the consecutive processes of and as . As with the LR image transfer, is trained to fool the corresponding discriminator for the generated SR image , whereas is trained to distinguish the generated SR image as fake and the ground truth HR image as real.
The generator loss function for super-resolution consists of three different losses: pixel-wise loss
, perceptual loss
, and adversarial loss
. Similar to the LR image transfer, we chose the L1 norm for pixel-wise loss and LSGAN for adversarial loss. The pixel-wise and adversarial losses for HR image are formulated as:
The discriminator loss for
can then be formulated as:
Additionally, we added the perceptual loss
between
and
. For perceptual loss, we adopted the learned perceptual image patch similarity (LPIPS) [
39], which measures the perceptual similarity of images with multi-layer features. Recent SR studies have verified the usefulness of the LPIPS as a perceptual loss measure by achieving high ranks in challenges on SR tasks [
40,
41]. In
Section 4.3, we also compare perceptual loss with LPIPS and the commonly used VGG-based perceptual loss.
The total loss for generator
can be expressed as the weighted sum of
,
, and
where
and
are the weights of adversarial loss and perceptual loss for HR images, respectively.
2.3. Network Architecture
The proposed SR network consists of two generators and two discriminators, each for LR image transfer and super-resolution. In this section, the architecture of each network component is described.
2.3.1. Generator
For the two generators (
and
) in the proposed model, we adopted the network architecture from residual channel attention networks (RCAN) [
13] (
Figure 3), considering its superior SR performance even without a discriminator. RCAN is based on residual in residual (RIR) architecture with several residual groups and long skip connections. Each residual group comprises multiple residual channel attention blocks (RCABs). As shown in
Figure 3b, RCAB integrates channel attention into the residual block to extract channel-wise features and achieves considerable enhancement in the image quality of the SR outputs. Further investigation of the effectiveness of the RCAN-based generator is addressed in
Section 4.1 through a comparison with other generator architectures.
Although the basic architecture for the two generators is almost identical, we adjusted the network capacity by setting the number of residual groups and the number of RCABs in each residual group to (5, 10) and (5, 20) for and , respectively. Even though our total generative network () is smaller in size than the original RCAN model with 10 residual groups and 20 RCABs for each residual group, BLG-GAN achieves superior SR performance by dividing the SR problem into subproblems. In addition, does not include upsampling blocks because the scales of the input and output images remain the same in the LR image transfer.
2.3.2. Discriminator
The discriminators
and
share the same discriminator structure based on patchGAN architecture [
42] (
Figure 4). The patchGAN consists of four convolutional layers with the number of features increasing from 64 to 512 by a factor of 2, followed by a final convolutional layer. The output features represent the patch-based decision of whether the image region is real or fake. To discriminate between the generated SR and HR images (
), we used a 70 × 70 patchGAN discriminator. For the discriminator for the generated bicubic-like LR images (
), we modified the stride of the first three convolutional layers of
from two to one [
34], because the size of the LR images is smaller than 70 × 70 pixels. As a result, the receptive field of
is reduced to 16 × 16 for LR images.