1. Introduction
The increasing maturity of remote sensing technology enables researchers to obtain rich surface observation images and serve all walks of life in society. However, due to the limitations of the imaging hardware, technology, and observation environment of remote sensing images [
1], obtaining high-resolution (HR) remote sensing images takes work. The quality of low-resolution (LR) remote sensing images is not sufficient for the application of remote sensing images in map updating [
2], semantic segmentation [
3], and target detection [
4]. Therefore, remote sensing image SR reconstruction technology has become a vital processing technology for improving the clarity and reliability of remote sensing images at low cost. The local unmixing ability of SR reconstruction technology can make observations in remote sensing images more prominent, providing more reliable data for subsequent remote sensing image processing. For example, researchers can use SR reconstruction technology to more accurately extract buildings from and detect changes in target boundaries in remote sensing images [
5]. SR technology was also used to achieve the precise labeling of hyperspectral images [
6] and address fusion-related issues [
7]. The realization of remote sensing image SR reconstruction technology from an algorithmic perspective has become an important research topic in image processing and computer vision [
8].
Image SR reconstruction is a technique that uses LR images to reconstruct HR images with richer information. Interpolation-based methods, reconstruction-based methods, and learning-based methods are three types of image SR reconstruction methods. Interpolation-based methods mainly rely on mathematical interpolation techniques. By utilizing the numerical information of known points and their positional relationships in space, interpolation algorithms are used to infer the values of unknown points, achieving improved image resolution. Interpolation-based methods mainly include nearest-neighbor interpolation [
9], bicubic interpolation [
10], and bilinear interpolation [
11], among others. Although these methods have the characteristics of simple computation and high efficiency, it is often challenging for them to capture the high-frequency detail information of images, resulting in unsatisfactory performance in handling complex textures and edges. Reconstruction-based methods are usually based on signal reconstruction theory. The process of downsampling HR images to LR images is used as prior information to construct an observation model. Regularization methods are used to construct prior constraints of HR images. It is transformed into a cost function optimization problem under a constraint condition to achieve image super-resolution reconstruction. Reconstruction-based methods mainly include the iterative back projection method [
12], maximum posterior estimation method [
13], and convex set projection method [
14]. Although reconstruction-based methods have good reconstruction results, when the amplification coefficient is significant, the learning difficulty of this method increases sharply, and, due to limited prior information, some texture details are difficult to recover. Learning-based methods achieve SR reconstruction by learning the mapping relationship between HR images and LR images in feature space. Learning-based methods can be divided into three categories: neighborhood embedding methods [
15], sparse representation methods [
16], and deep learning methods [
17]. In recent years, the mature application of artificial intelligence technology in many fields has also brought many new solutions to the challenges faced by image SR reconstruction technology. Some image SR methods developed by scientific researchers based on the deep learning framework and neural network ideas have achieved excellent reconstruction quality [
18]. The image SR reconstruction method based on deep learning uses neural networks to map the LR feature space to the HR feature space which automatically learn this mapping function through large-scale training data to effectively convert LR images into HR images. Deep learning methods typically involve two main branches: SR reconstruction based on a convolutional neural networks (CNN) and SR reconstruction based on a generative adversarial network (GAN).
The CNN-based method uses CNN structures, such as convolution layers and pooling layers, to extract features from LR images and gradually map them to HR images [
19]. In 2014, Dong et al. presented a simple SR reconstruction using a CNN (SRCNN) [
20], becoming the first to introduce a CNN into the field of image SR. This network consisted of only three layers of CNNs, each with different convolution kernel sizes and filter numbers, enhancing the network’s feature extraction ability. In addition, nonlinear mapping functions were added after each convolution layer, further enhancing the network’s nonlinear expression ability. SRCNN can generate images with better visual perception than traditional image SR reconstruction methods. However, due to the generation network having fewer layers and the large convolution kernel, it is difficult for it to extract the deep features of the image, and the reconstructed image loses some details. In 2016, Chao et al. presented a fast SR reconstruction method based on a CNN (FSRCNN) [
21]. FSRCNN adopts smaller convolution kernels to simplify the network structure, and uses small-sized filters to process images to reduce computational complexity. This method converts LR images to HR images through convolution and deconvolution layers [
22]. FSRCNN achieved higher evaluation scores and better image reconstruction results. In the same year, Kim et al. increased the depth of the model, enabling the network to extract hierarchical feature information better, and presented a deeply recursive convolution network (DRCN) [
23]. This method uses skip connection and recursive supervision to enhance training stability and improve the model’s reconstructed image performance. In 2017, Lim et al. presented the enhanced residual network (EDSR) [
24] for SR reconstruction based on the CNN and residual network (ResNet) ideas [
25], applying ResNet to the field of image SR reconstruction. In EDSR, the researchers chose to remove the batch normalization (BN) layer from ResNet, freeing up computational resources and allowing for more network layers to be added under the same resources, thereby improving the feature extraction ability of each layer of the network. The CNN-based methods have made significant progress in image SR reconstruction tasks, but they also have some shortcomings and challenges. In particular, the commonly used mean square error (MSE) loss function may lead to the model having the tendency to generate images with higher peak signal-to-noise ratio indicators [
20] and lacking high-frequency features, making it difficult to generate more realistic texture details.
The GAN-based method introduces the generation network and the discrimination network, which can generate more realistic textures through adversarial training [
26]. In 2017, Ledig et al. used GAN architecture for the first time to achieve image SR reconstruction and developed an image SR reconstruction method using a GAN (SRGAN) [
27]. SRGAN uses deep residual networks and upsampling layers to convert LR images into HR images and uses perceptual loss [
28] in adversarial training to effectively increase the texture details and clarity of the image. Perception loss is based on a pre-trained CNN (VGG networks) and compares the feature differences between generated and authentic images. This helps generate more realistic images and solves the problem where using the MSE loss function causes images to be too smooth. With the introduction of SRGAN, many image SR reconstruction models based on GANs have been presented. In 2018, Wang et al. presented an enhanced SRGAN (ESRGAN) [
29]. A residual-in-residual dense block (RRDB) was designed to replace ResNet, which can help the network better learn residual information and generate more realistic and clear images. ESRGAN removes all BN layers, which can improve the evaluation metrics of SR, reduce computational complexity, and save memory. It uses a perceptual loss function before activation, which can solve the problem of sparse feature quantities. In 2020, Carraz et al. presented ESRGAN+ [
30], which incorporates residual connections and noise in an RRDB to enhance the generation network’s ability to extract features. In 2021, Wang et al. presented Real-ESRGAN [
31], replacing the VGG-style discrimination network in ESRGAN with a U-Net discrimination network. Image SR reconstruction is achieved by simulating various degradations during the conversion from HR to LR, which has good reconstruction results in anime images and videos. However, it should be pointed out that adversarial training still has the problem of training instability in this field, which can generate unpleasant artifacts during the reconstruction process. Generating complex textures and effectively removing artifacts remain some of the challenges in SR reconstruction techniques for remote sensing images [
32].
To address this issue, we present TDEGAN, which can reconstruct higher-quality images and generate more clear and realistic texture details. Compared to existing GAN-based networks, we have made various improvements. The main contributions of this research are as follows:
Based on the GAN idea, TDEGAN for remote sensing image SR is presented, which can generate more realistic texture details and reduces the impact of artifacts in reconstructed remote sensing images;
The use of multi-level dense connections, SA, and residual connections to improve the generation network structure of the GAN enhances its feature extraction ability;
A PatchGAN-style discrimination network is designed which allows the input image to be output after it has passed through multiple convolution layers. The receptive field of view has a certain size and can help the network generate richer texture details through local discrimination;
Using the artifact loss function for local statistics to distinguish between artifacts and realistic texture details, combined with EMA technology, can punish artifacts and help the network generate more realistic texture details.
2. Related Work
Although CNN-based image SR reconstruction methods have a high peak signal-to-noise ratio when reconstructing remote sensing images, they often have problems such as unclear textures and excessive blurring. The GAN-based method has received widespread attention due to its ability to generate images with higher perceptual quality. GANs consist of a generation network and a discrimination network. The generation network uses LR images to generate corresponding HR images. The discrimination network evaluates the quality and completeness of the generated images by comparing them with the actual HR images in the dataset. The optimization process of the generation network relies on the loss values calculated by the discrimination network for these two images, iterating through adversarial methods. In this process, the generation network continuously optimizes its network to generate more realistic and high-quality HR images. The optimization process is shown in Equation (
1):
In (
1),
G represents the generation network, which outputs the generated image;
D represents the discrimination network; the output range of the image passing through
D is
, indicating the probability that the image is an HR image;
indicates that
x is obtained from HR images;
indicates that
y is obtained from LR images. The discrimination network continuously improves its ability to distinguish HR images and SR, which means that it hopes that the output value of the HR image passing through
D is as close to 1 as possible and the output value of the generated image
passing through
D is as close to 0 as possible, that is,
D hopes that
is as large as possible. The generation network is just the opposite; it hopes that
is as close as possible to the HR image so as to fool the discriminator, that is,
G hopes that
tends to become smaller. The two engage in a game to achieve dynamic equilibrium by optimizing the loss function.
In recent years, GAN-based image SR reconstruction methods have achieved excellent results in remote sensing. In 2018, Ma et al. developed a remote sensing image SR method based on Transmission GAN (TGAN) [
33], which removes BN layers, reduces memory consumption and computational burden, and improves accuracy. The method was first trained on a natural image dataset and then a remote sensing image dataset was used for fine-tuning, achieving good reconstruction results. In 2020, Sustika et al. presented a remote sensing image SR method with residual dense networks (RDNs) [
34], and experiments have shown that the combination of residual dense networks and a GAN is effective. In 2021, Guo et al. presented a remote sensing image SR method using cascade GANs (CGANs) [
35] and designed an edge enhancement module to improve the reconstruction of edge details. In the same year, Huang et al. presented a remote sensing image SR method that combines wavelet transform with a GAN [
36]. This method uses wavelet decomposition coefficients to improve the reconstruction effect of local details of the image. Some researchers have combined attention mechanisms with GANs to enhance the generation network’s feature extraction ability effectively. In 2021, Moustafa et al. presented a satellite imagery SR method using a squeeze-and-excitation-based GAN (SCSEGAN) [
37] which adds squeeze-and-excitation blocks to ensure feature flow and amplify high-frequency details. In the same year, Li et al. presented an attention-based GAN (SRAGAN) [
38] which uses local attention and global attention to capture the detailed features of the earth’s surface and the correlation features between channels and spatial dimensions, respectively. In addition, Gao et al. presented a remote sensing image SR method that combines residual channel attention (CA) [
39]. This network uses the CA module to extract deep feature information from remote sensing images, which can reconstruct images with more precise edges. In 2022, Jia et al. presented a multi-attention GAN (MA-GAN) framework [
40], which included attention-based upsampling blocks designed to implement any number of upsampling operations and achieved good reconstruction results. In the same year, Xu et al. presented a texture enhancement GAN (TE-SAGAN) [
41] for remote sensing image SR. This method uses a self-attention mechanism to improve the generation network and uses weight normalization to improve the discrimination network, which can reconstruct edge contours and textures with better visual effects.
In the field of image restoration and fusion, handling artifacts is also one of the challenges in image processing. Guo et al. proposed a novel dual-stream network for image restoration [
42] which models texture synthesis with structural constraints and texture-guided structural reconstruction in a coupled manner. A bidirectional gated feature fusion module and a context feature aggregation module were designed to better refine the generated content. Wang et al. designed a parallel multi-resolution repair network with multi-resolution partial convolution [
43]. The low-resolution branch focuses on the global structure, while the high-resolution branch focuses on local texture details, better repairing texture details and solving the problem of artifacts. Xu et al. proposed a texture enhancement network based on structural transformation, designing a structural transformation renderer and texture enhancement stylist to solve the problem of artifacts and generate high-quality character images [
44]. This study focuses on solving the problem of artifacts in the super-resolution processing of remote sensing images by modifying the generation network, discrimination network, and loss function to generate high-resolution images with the best visual effect.
4. Experiments and Results
This experiment was implemented on the Pytorch framework using a GTX3060 12G GPU. The batch size of the experimental input image was set to 16. We randomly cropped the HR image to
and the LR image to 32 × 32. In Guo’s experiment, it was proven that the network performance is best when the residual coefficient
is 0.2 [
58]; therefore, our model’s residual coefficient
is set to 0.2. In Liang’s experiment, it was proved that setting the weight parameter of EMA
to 0.999 can effectively improve the stability of model training [
56]; therefore, our EMA weight parameter
was set to 0.999. Loss function coefficient
,
,
,
. In the subsequent ablation experiments, we will also discuss the impact of the weight of the artifact loss function on the model. The number of training iterations was set to
. The initial learning rate was set to
. The Adam optimizer of the generation network and discrimination network was set to
,
.
4.1. Dataset
We used the RHLAI dataset [
58] for our experiment, which includes images of surface landscapes in Yichang City, Hubei Province, China, such as farmland, houses, roads, and forests. The researcher obtained remote sensing images with resolutions of 0.2 m and 0.5 m by processing aerial photography, and then processed them into 9288 pairs of HR images with a pixel size of 256 × 256 and LR images with a pixel size of 64 × 64. We divided the HR and LR images into training, evaluation, and testing datasets in an 8:1:1 ratio.
Figure 7 shows some high-definition images from this dataset.
Most of the datasets for remote sensing image SR reconstruction only have HR remote sensing images. Researchers often use the downsampling method to reduce the HR images to obtain the LR images. There is a certain mathematical relationship between them. In the application scenario of remote sensing images, the observed remote sensing images are usually used as the original data to obtain HR images which are somewhat different from the LR images reduced by the downsampling method in experiments. Therefore, our experiments used the RHLAI dataset, which uses observed remote sensing images as LR images and can better reflect the reconstruction performance in practical applications. In the experiment of Guo et al. [
58], the feasibility of this dataset was demonstrated.
4.2. Evaluation Metrics
In this research, three evaluation indicators of image SR reconstruction methods, peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [
59], and learning perception image patch similarity (LPIPS) [
60], were selected to evaluate the experimental results. PSNR can be used to calculate the pixel difference between the SR reconstructed image and the HR image. The calculation formula for PSNR is as follows:
In (
28) and (
29),
is the MSE of images
and
with size
;
is the maximum pixel value.
SSIM can evaluate the structural similarity between the SR reconstructed image and the HR image, paying more attention to the local structural differences. The SSIM calculation formula is as follows:
In (
30),
and
are the pixel mean of image
x and
y;
is the covariance of images
x and
y;
and
are the variance corresponding to images
x and
y;
and
are the non-zero constants.
LPIPS is an evaluation metric for comparing features extracted by deep learning networks which can better represent the human eye’s perception of image quality. Compared with traditional evaluation indicators, LPIPS is more advantageous in assessing human visual perception of images [
60]. It can be obtained by calculating the image distance
, which can be expressed as the following formula (
31):
4.3. Analysis of Image Quality Metrics during Training Process
During our model training process, we saved model weights every 10,000 iterations and validated metrics such as PSNR, SSIM, and LPIPS and plotted them into line graphs as shown in
Figure 8.
Figure 8a shows the changes in the PSNR of TDEGAN during training. It can be seen that, in the initial stages of training, the PSNR volatility of the images generated by the model increases. Subsequently, the model oscillates within a specific range, with an oscillation amplitude of around 0.6, indicating that our model is still learning the image features of the dataset. In the later stage of training, our model gradually stabilizes at an amplitude of around 0.2, indicating that our model gradually converges.
Figure 8b shows the changes in SSIM during TDEGAN training. Like the metric shown in
Figure 8a, SSIM rapidly rises in the initial stages of model training and oscillates over an extensive range, with an amplitude of around 0.02. In the later stage of training, the range of SSIM changes gradually decreases, and the model begins to converge with an amplitude of around 0.01.
Figure 8c shows the changes in LPIPS during TDEGAN training. In the initial stage of training, LPIPS rapidly decreases, followed by an extensive range of oscillations with an oscillation range of around 0.03. Subsequently, the oscillation range gradually decreases, and, in the later stage of model training, the amplitude is around 0.01, and the model gradually stabilizes.
4.4. Comparative Experiment
We chose bicubic interpolation, SRCNN [
20], EDSR [
24], SRGAN [
27], ESRGAN [
29], SPSR [
61], and SAM-DiffSR [
62] methods for comparative experiments conducted with TDEGAN in the same experimental environment.
Table 2 shows the indicator sizes tested by different methods on the test dataset. Compared to methods based on other methods, the bicubic interpolation and CNN-based methods have certain advantages in the PSNR and SSIM metrics. However, PSNR focuses more on pixel differences, while SSIM focuses on three indicators, brightness, contrast, and structure, which cannot effectively represent perceptual quality. In the experiment of Wang et al. [
29], it was also pointed out that PSNR-guided methods are prone to producing ambiguous results. LPIPS is an indicator used to measure image similarity, which is closer to the perception of the human visual system. In Guo et al.’s experiment [
58], it was also shown that bicubic interpolation methods and CNN-based methods are more inclined to generate blurred images with higher PSNR and SSIM values, while GAN-based methods, although they have lower PSNR and SSIM, have better visual perception quality and higher LPIPS values. Therefore, we introduced LPIPS indicators that better reflect human visual perception to evaluate the reconstruction effects of each method.
Table 2 shows that our model has the best LPIPS metric, indicating that the reconstructed images of our model have better visual perception effects. Of course, our model also has certain advantages regarding the PSNR and SSIM metrics among other methods.
From
Figure 9, we can see that we selected features such as roofs, farmland, vehicles, and roads for visual comparison. Overall, the images reconstructed by bicubic interpolation, SRCNN, and EDSR methods are relatively blurry, and the effect of reconstructing texture details is poor. In contrast, other methods have certain advantages. The first row of roof images shows that the images reconstructed by SRGAN and SPSR methods have specific texture details. However, the roof lines are blurry and discontinuous. The image lines reconstructed by the ESRGAN method are relatively straightforward and complete. However, it can be seen that there are some artifacts around the lines, which affect the visual effect. The SAM-DiffSR method has a good reconstruction effect, but the roof lines are not smooth and natural. Our method reconstructs images with clear and complete roof lines, providing the best visual effect. The farmland image in the second row shows that the texture effect of the images reconstructed by SRGAN and SPSR methods can be poor, lacking the expression of farmland gully features, and the specific details of the image cannot be clearly seen. The ESRGAN method reconstructs relatively smooth and somewhat blurry images after local magnification. The farmland gullies reconstructed by the SAM-DiffSR method are relatively clear and complete, but some details are missing. Our method reconstructs images of farmland gullies with clear texture, complete lines, and flat edges. The third line of the image contains cars on the road, and the textures on both sides of the car reconstructed using SRGAN, ESRGAN, and SAM-DiffSR methods are relatively blurry. The car images reconstructed using the SPSR method have some artifacts that affect the visual effect. The texture of the car image reconstructed by our model is natural and realistic, clearly showing the car’s local details and overall structure. The fourth and fifth lines of images show that, when reconstructing the lines on the road, there is often generation of artifacts, resulting in unclear edges and poor visual effects of the image features. Our method dramatically reduces the impact of artifacts and increases the clarity of the edges of the ground objects.
The above indicates that our model has certain advantages over some classic models in terms of indicators and visual effects as it can reduce the impact of artifacts and generate images with better texture details.
4.5. Ablation Experiment
In order to systematically demonstrate the improvement of our model generation network, discrimination network, and loss function on model performance, we conducted ablation experiments using the following networks: (a). baseline (ESRGAN), (b). baseline + DCSADRDB, (c). baseline + DCSADRDB + PatchGAN-style discrimination network, (d). DCSADRDB + PatchGAN-style discrimination network + artifact loss (TDEGAN). We trained the above networks under the same conditions and tested the test dataset, as well as calculated the PSNR, SSIM, and LPIPS indicators.
Table 3 shows that the LPIPS indicators in each module gradually added to the baseline network improve to some extent. Especially for the LPIPS indicator, which can better reflect human visual perception, our final improved model is 0.017024, smaller than the baseline model. This indicates that modules such as the DCSADRDB, PatchGAN-style discrimination network, and artifact loss function can effectively improve the ability to generate better perceptual images.
Figure 10 compares the visualization effects of three types of land features: farmland, roof, and playground. The network’s reconstruction effect on texture details is enhanced with the introduction of the modules. From the farmland in the first row, it can be seen that, after the introduction of the DCSADRDB, the image details become more abundant, and the gullies in the farmland gradually become clear. From the texture of the second row of the roof, it can be seen that the roof lines gradually become more three-dimensional and better reflected in visual perception. However, the impact of artifacts also increases. When we introduce artifact loss, the problem of generating artifacts is significantly reduced. From the lines on the playground, it can be seen that the clarity of the image gradually improves. After introducing the PatchGAN-style discrimination network, there are black artifacts on both sides of the white lines on the playground. After improving the loss function, TDEGAN not only enhances the clarity of the lines but also reduces the impact of artifacts around the lines. From this, we can see that our various modules improve the indicators. In terms of visual effects, combining each module enhances details and image clarity and counters the impact of artifacts.
In order to improve the performance of the generator, we added multi-level dense connections and additional residual connections to the generation network, forming a multi-level dense connection network (DCDRDB). In order to further improve the feature extraction performance, SA modules were added to the generative network to form a DCSADRDB. We used an RRDB as the baseline generation network and conducted ablation experiments on the above two improvements.
As shown in
Table 4, with the continuous improvement of our generated network, all the tested indicators gradually improve. The PSNR index increases by about 0.3. SSIM increases by about 0.018. LPIPS increases by about 0.01. As shown in
Figure 11, as our generative network improves, the visualization effect of the generated images gradually improves. When the generated network is the RRDB, the details of the car windows and nearby grass are unclear, the edges of the lines on the road are blurry, and the texture of the roof of the house is poor. When the generation network is a DCDRDB, the details of the generated image are more abundant. When the generation network is a DCSADRDB, as the network’s feature extraction ability improves, the visual effect of the image also further improves. The texture details of the car and grass become richer and more realistic, the lines on the road are clearer, and the stripes on the roof of the house are closer to those of the HR image.
A GAN has the potential to generate rich and detailed clear images but often produces unpleasant artifacts. While the model’s ability to generate texture details is enhanced, the impact of artifacts on the image is also enhanced. To solve this problem, we used the artifact loss function for optimization. To further explore the effectiveness of the artifact loss function, we set the weights of the artifact loss function to 0.5, 1, and 1.5 for the ablation experiments.
The artifact loss function can distinguish between artifacts and real texture details through local statistics, and help the network generate better images by punishing artifacts. The weight L of the artifact loss can control the penalty intensity of the loss function on the artifact area. From
Table 5, we can see that, when L is 1, the PSNR, SSIM, and LPIPS are the best indicators. When L is 0.5, the penalty for the loss function is insufficient, and the optimal effect is not achieved. When L is 1.5, the penalty of the loss function is excessive, and, due to the often simultaneous generation of artifacts and real textures, it has an impact on the real texture, resulting in a deterioration of the indicator effect. From
Figure 12, it can be seen that, when L is 0.5, there are certain artifacts in the texture of the car, the lines on the road, and the patterns on the playground, which affect the visual effect. When L is 1, the artifacts are greatly reduced, the lines become clearer, the texture details are more realistic, and the visual effect of the image is the best. When L is 1.5, although the effect of artifacts in the image is reduced, texture details are also reduced, and local areas become blurred. So, in our model, the weight selection for artifact loss is 1.
6. Conclusions
Image SR reconstruction technology has been widely used in the application of remote sensing images. Image SR reconstruction technology based on GANs has attracted attention due to its ability to generate more explicit images. However, generating more realistic texture details and reducing image artifacts remain challenges. To address these challenges, we propose TDEGAN.
We propose DCSADRDB as the main part of the generation network which adds multi-level dense-connection SA and residual connections to improve the feature extraction capability of the network. We design a PatchGAN-style discrimination network instead of the classic VGG-style discrimination network which can perform local discrimination and help the generation network generate more rich texture details. However, while enhancing the model’s ability to generate texture details, it often generates some unpleasant artifacts. To solve this problem, we introduce artifact loss, which combines with EMA technology to calculate local statistics to distinguish between realistic details and artifacts, thereby helping the network generate more realistic texture details and reducing the impact of artifacts. Compared with existing methods, our model can generate more realistic texture details, reconstruct higher image quality, and achieve better visual perception and evaluation indicators.
Although our model demonstrated certain performance advantages when tested on the RHLAI and NWPU-RESISC45 datasets, there is also the problem of enhancing the universality of the model. How to train models that can be applied to more remote sensing datasets and enhance the universality of SR reconstruction models for remote sensing images remains one of the challenges to be solved in the future.