DSpix2pix: A New Dual-Style Controlled Reconstruction Network for Remote Sensing Image Super-Resolution

Wang, Zhouyi; Wang, Changcheng

doi:10.3390/app15031179

Open AccessArticle

DSpix2pix: A New Dual-Style Controlled Reconstruction Network for Remote Sensing Image Super-Resolution

by

Zhouyi Wang

and

Changcheng Wang

^*

School of Geosciences and Info-Physics, Central South University, Changsha 410017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1179; https://doi.org/10.3390/app15031179

Submission received: 26 November 2024 / Revised: 20 January 2025 / Accepted: 22 January 2025 / Published: 24 January 2025

(This article belongs to the Special Issue Advanced Remote Sensing Technologies and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

Super-resolution reconstruction is a critical task in remote sensing image classification, and generative adversarial networks (GANs) have emerged as a dominant approach in this field. Traditional generative networks often produce low-quality images at resolutions like 256 × 256, and current research on single-image super-resolution typically focuses on resolution enhancement factors of two to four (2×–4×), which do not meet practical application demands. Building upon the framework of StyleGAN, this study introduces a dual-style controlled super-resolution reconstruction network referred to as DSpix2pix. It uses a fixed style vector (Style 1) from StyleGAN-v2, generated through its mapping network and applied to each layer in the generator. And an additional style vector (Style 2) is extracted from example images and injected into the decoder using AdIn, enhancing the balance of styles in the generated images. DSpix2pix is capable of generating high-quality, smoother, noise-reduced, and more realistic super-resolution remote sensing images at 512 × 512 and 1024 × 1024 resolutions. In terms of visual metrics such as RMSE, PSNR, SSIM, and LPIPS, it outperforms traditional super-resolution networks like SRGAN and UNIT, with RMSE consistently exceeding 10. The network excels in 2× and 4× super-resolution tasks, demonstrating potential for remote sensing image interpretation, and shows promising results in 8x super-resolution tasks.

Keywords:

computer vision; remote sensing; super resolution reconstruction; generative adversarial networks (GANs); StyleGAN

1. Introduction

The interpretation of remote sensing images is one of the important tasks in the field of remote sensing. Whether manual or machine-based, high-quality, high-spatial-resolution image data are essential for detailed interpretation [1]. However, due to the limitations of sensors and data confidentiality, remote sensing image interpreters often do not have access to high-quality image data. And most free datasets have low spatial resolution. For example, the spatial resolution of Landsat data is 30 m [2], Sentinel-2 data of the visible light have a processed resolution of 10 m [3], and Sentinel-1 radar data, after multi-look processing, have a spatial resolution of 20 m [4]. As remote sensing visual tasks increasingly demand higher precision and resolution, traditional low-resolution images can no longer meet the fine-grained requirements. Therefore, super-resolution reconstruction of remote sensing images is considered an effective solution, as it can enhance image resolution and quality.

In 1955, Toraldo di Francia first defined the concept of super-resolution in the field of optical imaging clearly. It mainly refers to the process of using optical-related knowledge to recover data information beyond the diffraction limit [5]. In around 1964, Harris and Goodman first proposed the concept of image super-resolution, which primarily refers to the process of synthesizing a single-frame image with richer detail information using the method of extrapolating the spectrum [6]. Recently, the methods for single-frame remote sensing image super-resolution reconstruction using deep learning have been mainly divided into two types: one is based on Convolutional Neural Networks (CNNs), and the other is based on GANs.

In 2016, Dong applied CNNs to the field of super-resolution. By utilizing sparse coding, multiple overlapping low-resolution image patches are reconstructed into high-resolution patches through the nonlinear mapping of CNNs (SRCNN) [7]. However, due to the shallow depth of the network, the results in SRCNN were not accurate enough. Recently, many studies based on CNNs have proposed effective solutions. Kim, J. et al. proposed Very Deep Super-Resolution (VDSR), which introduced a residual learning method to accelerate the training process and alleviate the vanishing gradient problem [8]. Zhang, Y. et al. proposed a Residual Dense Network (RDN), which combines residual learning with dense connections, further enhancing the network’s ability to recover fine details [9]. Haris, M. et al. introduced Deep Back-Projection Networks (DBPNs), which use a back-projection mechanism to progressively restore details from low-resolution images [10]. Additionally, Local–Global Combined Networks (LGCNets) [11] and Deep Memory Connected Networks (DMCN) [12] address the issue of insufficient network depth by employing local–global feature fusion and deep memory mechanisms. These approaches, through enhancing network depth or introducing novel architectural designs, have significantly improved super-resolution task performance, making them more powerful for the image reconstruction field.

Due to the limited detail recovery and data bias issues in CNNs for super-resolution tasks, their practical effectiveness is constrained. GANs have emerged as a powerful alternative, addressing these challenges through their unique generative capabilities and optimization mechanisms.

In 2014, Ian Goodfellow proposed GANs, bringing a new focal point to the field of image generation [13]. In 2018, Haut et al. proposed an hourglass-shaped generative network that accomplished the task of image super-resolution in an unsupervised manner [14]. However, since the network uses noise as the main input, the generated images are easily contaminated by it. Therefore, in 2019, Jiang et al. proposed a new generative network to address the issue of noise contamination. This network is composed of two parts: an ultra-dense sub-network (UDSN) and an edge enhancement sub-network (EESN), which together maximize the clarity and noise-free quality of the image edges [15]. Additionally, there are notable research achievements such as the Super-Resolution GANs (SRGANs) proposed by Christian Ledig et al. in 2016 [16] and the modified SRGAN network proposed by Xiong et al. in 2020 [17]. Through extensive research by scholars, using GANs to accomplish image super-resolution reconstruction has become one of the important research methods in this field.

Research on super-resolution reconstruction based on GANs and CNNs has made some progress but still faces significant limitations. Many GAN- and CNN-based super-resolution methods lack smoothness in the generated results, with the reconstructed images often appearing rough, lacking sufficient details and natural transitions. And the generated image sizes are typically small. In particular, traditional GAN models are more proficient at generating low-quality images (e.g., 256 × 256), such as UNIT proposed by Liu et al. [18] and DRIT proposed by Lee et al. [19], but they lack the capability to generate high-quality images (e.g., 512 × 512 or 1024 × 1024), making it difficult to meet the requirements of applications that demand higher precision. Additionally, existing super-resolution networks are susceptible to noise interference, especially in complex environments. Noise and artifacts often degrade the quality of the generated images, leading to results that are neither clear nor realistic. Lastly, although traditional methods have improved resolution in some cases, the effect on detail recovery and quality enhancement is not ideal, making it difficult to produce high-quality, natural, and refined reconstructed images [20,21,22,23]. To mitigate these limitations, we consider using StyleGAN proposed by Tero et al. to generate high-quality, less noisy, and more realistic super-resolution images.

According to the research on StyleGAN, the generator can produce visual features of images at different scales through convolutional layers of varying scales. Lower-level convolutional layers (4 × 4) generate more global but relatively coarse visual effects, while deeper convolutional layers produce finer visual details. These layers are progressively combined to generate high-quality and smooth images at a 1024 × 1024 scale [24,25]. The network utilizes a set of random vectors and mapping networks to control the style of the generated images, which is more conducive to producing images with higher stylistic diversity. In remote sensing image super-resolution tasks, to ensure that the super-resolution results retain the same semantic information as the original image, it is essential to consider how to extract content information from the low-resolution image to provide to the network. There are some models based on StyleGAN that can provide content information to the generator. Pix2style2pix proposes feeding the input image information into the mapping network as an encoder to control the content information of the generated image [26]. DualstyleGAN, on the other hand, suggests using two mapping networks to separately control the style and content of the generated image [27]. The prerequisite for these methods is that StyleGAN can stably generate images. But in practical experiments, we observed that when the dataset is not large enough, the large image span and complex terrestrial objects in remote sensing images make it difficult for StyleGAN-v2 to stably generate remote sensing images. These methods are not suitable for remote sensing images.

Although existing image translation networks based on StyleGAN-v2 cannot be directly applied to remote sensing image super-resolution reconstruction, their network structure, which can generate less noisy, smooth, high-quality images, still holds potential. Therefore, we propose a dual-style controlled super-resolution reconstruction network that is referred to as DSpix2pix based on StyleGAN-v2. This network uses a fixed style vector (Style 1) from StyleGAN-v2, generated through its mapping network and applied to each layer in the generator. And an additional style vector (Style 2) is extracted from example images and injected into the decoder using AdIN [28], enhancing the balance of styles in the generated images. DSpix2pix is capable of generating high-quality, smoother, noise-reduced, and more realistic super-resolution remote sensing images at 512×512 and 1024 × 1024 resolutions.

2. Method

In this section, we will introduce the methodologies used for high-resolution reconstruction, focusing on the specific region of interest, which is the Birmingham area in the United Kingdom. The images for reconstruction are from the Sentinel-2 satellite, specifically the visible light data, and will be analyzed at different resolutions. In Section 2.1, we will present the network structure and training strategies for the super-resolution task, which aims to improve image quality while maintaining the fidelity of the original terrain features.

2.1. Network Structure

To enhance the quality of super-resolution images, we designed an innovative dual-style network, DSpix2pix, based on the high-quality image generation capabilities of the StyleGAN-v2 framework. This network utilizes downsampling and StyleGAN-v2’s style convolution layers to encode low-resolution images into constant vectors, enabling precise control over the content of the generated images. Additionally, during the decoding phase, it incorporates style information extracted from example images to enhance the diversity and controllability of the generated results. The loss structure adopts the pix2pix loss framework [29], combining generative and adversarial losses to further optimize the detail quality and visual consistency of the images, achieving dual optimization of both content and style. The overall network structure is shown in Figure 1.

The network consists of two parts: the generator and the discriminator. The discriminator is a residual network with 18 layers (resnet-18) [30]. The generator consists of three parts: two style structures and a generator structure. The generator generates high-resolution images, while Style 1 and Style 2 ensure that the generated images are as consistent as possible with the high-resolution ground truth in both content and style, especially during the generation of high-quality images, such as those at a 1024 × 1024 resolution.

Style 1, derived from StyleGAN-v2, is used to learn the style information of the training remote sensing image dataset. The mapping network, as a core component, consists of multiple fully connected layers and is used to transform the latent vector z (usually random noise) into style vector space w+. This network can fully decouple the image style features, allowing the style vector in the w+ space to independently control the styles of generated images at different layers [25].

Style 2 is a trained autoencoder (AE) [31], with its decoder layer design identical to that of the generator’s decoder and super layers. By compressing and reconstructing high-resolution data, the features of the high-resolution image decoder layer are obtained. Subsequently, we input a high-resolution image into Style 2, and after inference, we extracted the Gram matrix [32] (style) from the features output by the Style 2 decoder layer. This matrix was then injected into the generator’s decoder layer to continuously control the style of the generated images. The high-resolution image serves as a reference image, which can be any image from the training dataset, as long as it contains high-resolution style information.

The generator consists of a network structure similar to U-Net [33]. The encoder of generator (G-Encoder) provides the content and style information of the low-resolution image. Since the style of the low-resolution image is similar to that of the high-resolution image, it offers strong control, reducing the difficulty of training the network and the size of the training dataset required. This is the foundation that enables the network to stably generate remote sensing images. The extra residual blocks in the generator (G-Extra) are set to enhance the network’s expressive ability, with the number of residual blocks determined based on the complexity of the task. In the experiments conducted in this paper, two blocks are set. The decoder of the generator (G-Decoder) restores the feature maps obtained from the encoder to their original size, generating images that contain more high-resolution information but retain the same size as the low-resolution image. During this process, the decoder continuously extracts style information from Style 1, Style 2, and the low-resolution image. For example, if the input low-resolution image has a size of 256 × 256, the high-resolution image generated by the decoder will also have a size of 256 × 256, but with more details. The super layer of the generator (G-super) generates images with higher quality (such as 512 × 512 or 1024 × 1024). During this process, the network layers no longer receive style information from the low-resolution image. Instead, the required style control is solely obtained from Style 1 and Style 2.

The structure design of the generator and Style 1 is shown in Figure 2.

In Figure 2, the latent vector z is mapped to the style space w+ through the mapping network. The multilayer perceptron (MLP) transforms the style vector into scaling weights A that can be used by the convolutional layers. It should be noted that the style vector in w+ is intended to directly affect the feature maps. However, StyleGAN-v2 proposes that converting the style vector into scaling weights that influence the convolutional layer weights can achieve the same effect while being more computationally efficient. In Figure 2, w is the convolutional layer weights and b is the bias. StyleGAN-v2 designs Mod and DeMod to inject style information into the generator. Mod is used to inject style and scale the convolutional layer weights, as shown in Equation (1), while DeMod is used to normalize the convolutional layer weights, making style control more flexible and efficient, as shown in Equation (2).

w_{i, j, k}^{'} = S_{i} * w_{i, j, k}

(1)

w_{i, j, k}^{″} = \frac{w_{i, j, k}^{'}}{\sqrt{\sum_{i, k} w_{i, j, k}^{' 2} + ϵ}}

(2)

S represents the scaling weights of each channel mapped from the w+ space,

w_{i, j, k}

represents the original weights,

w_{i, j, k}^{'}

represents the modulated weights,

w_{i, j, k}^{″}

represents the normalized weights of the convolutional layers, i is the channels of the input feature maps, j represents the channels of the output feature maps, and k represents the size of the convolution kernel. ϵ represents a very small constant to prevent the denominator from being zero. Although this is not entirely identical to directly scaling the feature map, experiments of StyleGAN-v2 have shown that it does not significantly affect the quality of the generated images.

Since the design of StyleGAN-v2 is to generate images with the same style but random content, its input is typically a random constant of size 4 × 4 × 512. To leverage this characteristic for generating super-resolution images, we need to replace the random constant with a certain vector with content information extracted from the low-resolution image. Therefore, we add the G-Encoder. This means we use convolutional layers to encode low-resolution images into a 4 × 4 × 512 constant, which are then fed into the decoder of the generator to provide content information, thereby generating high-resolution images. In the encoder, the design of the style block is similar to that of the decoder, with the only difference being that the upsampling operation is replaced by downsampling to accommodate the low-resolution image input.

Although the style information of the image is typically well preserved during the encoding process, we still extract the style features of the remote sensing data from the style space w+ to ensure the stability of the image encoding. At the same time, we retain the input of random noise B to allow the network to generate more details, improving the realism and quality of the image. And we have added skip connections similar to those in U-Net between corresponding layers of the encoder and decoder to continuously provide style and content information from the low-resolution image.

In the experiment, it was found that using the generator in Figure 2b can stably produce high-resolution remote sensing images with the same content information as the low-resolution images, especially at lower scales. However, when generating higher-quality image results through the G-Super, there was a deviation in the style of the generated image compared to the low-resolution image. For instance, a brown road might turn green like farmland. This phenomenon is particularly evident at the 1024 × 1024 scale. One possible reason is that in designing the loss function for the generator, this paper incorporated the same L1 loss as used in pix2pix [29]. As the generation quality increases, the farmland category, which originally has a high proportion in the samples, contributes more to the loss, leading to a tendency for the network to overfit. Another reason is that during the G-super image generation phase, the network loses the style information provided by the encoder from the low-resolution image and relies solely on the vector in w+ space for style control.

To mitigate this phenomenon, we add an additional style (Style 2) to the network by extracting the style from an example image and incorporating it into the high-quality image generation process. Style 2’s structure is shown in Figure 3.

Figure 3a is a simplified schematic of Figure 2b. Figure 3a clearly illustrates that in the G-super, the layers fail to receive the style information of the low-resolution image from the encoder. Figure 3b shows the pre-trained autoencoder. Each convolutional layer within its decoder can output feature maps of the same size as those in G-Decoder and G-Super, as shown in Figure 3a. The way add additional style to the generator is the AdIN method. AdIN is a technique used for style transfer, proposed by Xun Huang et al. in 2017 [28], with the primary goal of injecting the statistical information (mean and standard deviation) of the style image into the features of the content image, enabling the generated image to retain the structure of the content image while adopting the appearance of the style image. This method is more efficient and intuitive. The specific formula is shown in Equation (3).

A d I N (x) = σ (y) (\frac{x - μ (x)}{σ (x)}) + μ (y) x, y \in R^{b \times c \times w \times h}

(3)

x represents the feature map of the low-resolution image in G-Decoder and G-Super, and y represents the feature map of the example image. b represents the batch size, c represents the number of feature map channels, and w and ℎ represent the width and height of the feature map, respectively. μ(x) and σ(x) denote the mean and variance of the feature map x, calculated across the spatial dimensions, independent of b and c, as shown in Equations (4) and (5).

μ_{b c} (f) = \frac{1}{H W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} (f)

(4)

σ_{b c} (f) = \sqrt{\frac{1}{H W} \sum_{h = 1}^{H} \sum_{w = 1}^{W} {(f - μ_{b c} (f))}^{2} + ε}

(5)

f represents the feature map x or y. Combined with dual-style control, this forms the proposed network DSpix2pix.

2.2. Training Strategy and Loss Function

We adopt the training strategy of StyleGAN-v2. This suggests that when training a network to generate high-quality images, the use of low-quality generated images can be beneficial to the training process, as illustrated in Figure 4a. In other words, we upsample the results from the lower scale and add them to the results from the next higher scale. The simplified representation can be expressed as Equation (6).

i m a g e s = (α \times s t r a i g h t) + ((1 - α) * r e s i d u a l)

(6)

Images represents the generated images, straight refers to the directly generated large-quality images, residual refers to the upsampled low-quality images, and α is the parameter that modulates the two levels of images. As training progresses, α gradually increases to 1.

To prevent the lower-level features from interfering with the generation of higher-level images, which could cause the well-decoupled weights in the w+ space to become re-entangled, we retain the mixing regularization training strategy proposed in the StyleGAN series. The specific approach is as follows: during training, latent vectors (z) from different samples within the same batch are randomly mixed, and these mixed style vectors are injected into different layers of the generator. For example, the style vector of one sample can be applied to the earlier layers of the generator, while the style vector of another sample is applied to the later layers. This method enables the generator to learn the ability to handle different style information, thereby enhancing the disentanglement properties of the generative network and avoiding overfitting. The method is shown in Figure 4b.

In Figure 4a, ToRGB refers to the convolutional layer that converts feature maps into three-channel color images, and Up represents an upsampling operation. In Figure 4b, a and b denote the style vectors of two different samples within the same batch. The loss function consists of the traditional loss of cGAN [34], as shown in Equation (7), the L1 loss is provided to control content information, as shown in Equation (8), and the AdIn loss is provided for injecting additional styles, as shown in Equation (9) [28].

L_{c G A N} = E_{x, y} [l o g D (x, y)] + E_{x, z} [l o g (1 - D) (x, G (x, z))]

(7)

L_{L 1} = \frac{1}{N} \sum_{i = 1}^{N} | G (x, z) - y |

(8)

L_{A d I N} = {| | f (g (x)) - f (t) | |}_{2} + λ * \sum_{i = 1}^{L} {||μ (f_{i} (g (x))) - μ (f_{i} (t))||}_{2} + \sum_{i = 1}^{L} {||σ (f_{i} (g (x))) - σ (f_{i} (t))||}_{2}

(9)

X represents the input low-resolution images. y represents the ground truth corresponding to x. z represents the random noise. G(x, z) represents the generated high-resolution images. N represents the number of samples in a batch.

f represents the feature map of the corresponding convolutional layer, g represents the generator’s computation process, t denotes the convolutional layer where additional an style needs to be injected, λ is the hyperparameter that adjusts the balance between content loss and style loss, i indicates the total number of convolutional layers where additional style is injected, μ represents the calculation of the mean, and σ represents the calculation of the variance. The total loss is expressed in Equation (10).

L o s s = L_{c G A N} + L_{A d I N} + L_{L 1}

(10)

3. Experiment and Discussion

In this section, we designed comparative experiments with other generative networks to demonstrate the advantages and limitations of DSpix2pix. We show DSpix2pix’s performance at different resolutions, verifying whether it fully leverages the advantages of StyleGAN-v2, i.e., whether the generated images’ quality is higher as the generation scale increases. Additionally, we conducted an ablation study on Style 2, proving its role in stabilizing the style of generation. The results and analysis are presented in the following section.

3.1. Comparative Experiments

To verify whether DSpix2pix can generate higher-quality images compared to other generative networks, we used Sentinel-2 data from Birmingham and its surrounding areas in the UK, collected between April and August 2022, with cloud removal performed using the Cloudy algorithm for synthesis [2]. The original data can be obtained from the ESA (European Space Agency) website. The dataset was created consisting of 2244 image patches of 256 × 256 pixels, with 2000 patches used as the training set and 224 patches as the validation set. Low-resolution images were created using bilinear interpolation at 2×, 4× and 8× downscaling factors.

In the experiment, we selected the following networks for comparison experiments: UNIT, which requires no paired training data, achieves effective image translation between different domains through a shared latent space, and can generate multi-modal outputs; DRIT, which features disentangled representations and separates the content and style of images to achieve flexible and diverse image translations, making it suitable for unpaired data; CycleGAN, which performs bidirectional image translation with cycle consistency loss, ensuring content consistency without paired data [35]; pix2pix, which uses conditional GANs, generates high-resolution outputs given specific inputs, and is applicable to various image translation tasks such as image restoration, coloring, and segmentation; and SRGAN, focusing on super-resolution tasks, uses perceptual loss to generate high-resolution, detail-rich images. The upward arrow indicates that a higher metric value corresponds to better performance, whereas a downward arrow indicates that a lower metric value corresponds to better performance. The results are shown in Table 1, Table 2 and Table 3.

The evaluation metrics used are as follows: RMSE (Root Mean Squared Error), PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index) [36], ERGAS (Erreur Relative Globale Adimensionnelle de Synthèse), LPIPS (Learned Perceptual Image Patch Similarity) [37], and SAM (Spectral Angle Mapper) [38].

Due to differences in network structure design and application scenarios, SRGAN performed excessively poorly in the 8× super-resolution task and was therefore excluded from the statistics. From these tables, we find that DSpix2pix achieved good results in metrics such as RMSE, PSNR, SSIM, and LPIPS, which are commonly used to evaluate visual quality. Among all the compared methods, DSpix2pix generated images with the smallest RMSE, indicating that the images generated by this method are the closest to the original images, demonstrating its high accuracy in image reconstruction. It also achieved a higher PSNR, an important metric for evaluating image quality. A higher PSNR value indicates less noise, making the generated images clearer and containing fewer artifacts. Additionally, DSpix2pix performed better on the SSIM metric, which measures the structural similarity between images. Our method’s higher SSIM indicates better preservation of the image structure, textures, and edge details, making the generated images more similar to the original images in terms of structure. Finally, DSpix2pix also outperforms pix2pix on the LPIPS metric, which is used to evaluate perceptual image quality. LPIPS uses a pre-trained deep neural network to capture high-level features of images and assesses perceptual differences by calculating the differences between images in this feature space. A smaller LPIPS value indicates that the generated images appear more natural to the human eye, with richer details and fewer artifacts, suggesting better perceptual quality and a more natural visual effect.

This indicates that the super-resolution results generated by DSpix2pix not only have the highest similarity to the ground truth high-resolution images in terms of image structure but are also the least affected by noise. The visual quality of the generated images could potentially enhance the classification of remote sensing image. However, the network’s performance on the ERGAS and SAM metrics is not ideal, indicating that although the generated images have improved visual quality, they may lack in global consistency and spectral consistency. This means that the high-resolution images reconstructed by DSpix2pix may not be as suitable for fine-scale remote sensing parameter inversion tasks, such as soil moisture or biomass inversion.

An example selected from the dataset is shown in Figure 5. From Figure 5, it can be observed that the results of the example image are consistent with the experimental results presented in the table. DSpix2pix demonstrates better performance across various super-resolution tasks. It maximizes the preservation of image content information while achieving the best generation quality and being minimally affected by noise pollution. To more intuitively observe this, we selected some examples and zoomed in on specific areas of the results, as shown in Figure 6.

From the zoomed-in images, the network has a better ability to preserve the content information of the images, with significant retention of hue, terrestrial structures, and other details. This is particularly notable when compared to DRIT and CycleGAN methods. Different super-resolution task samples exhibit consistent experimental results. Additionally, it is evident that the high-resolution images generated by DSpix2pix exhibit smoother textures and are less affected by noise, resulting in the highest image quality compared to the results from other methods. To further substantiate this observation, we apply Canny edge detection to extract the edges from the 4× super-resolution results, as shown in Figure 7. The extracted edges serve to highlight the enhanced smoothness and reduced noise levels in the generated images.

Through the results of Canny edge detection, we observe that the high-resolution images generated by DSpix2pix exhibit fewer edges, which does not indicate information loss but rather signifies that noise in the images has been effectively suppressed while retaining the original structure and details. The reduction in edges typically suggests that the images are smoother, cleaner, and less affected by noise, presenting a more natural visual effect. In contrast, an increase in edges may reflect higher levels of noise contamination, leading to excessive sharpening or distortion of the image. For example, the pix2pix method extracts more edges, which results in greater noise contamination and image distortion. The superior performance of DSpix2pix in terms of PSNR, SSIM, and LPIPS further confirms the effectiveness of this method in enhancing image quality. Specifically, DSpix2pix not only effectively removes unnecessary noise but also generates clearer and more realistic images while preserving structural information, demonstrating a significant performance advantage.

In addition, we calculated the standard deviation of the gray-level histogram for the super-resolution results of the 4× super-resolution task, including two zoomed-in regions from Figure 6b. The results are presented in Table 4. Analysis of the data in Table 4 shows that, except for CycleGAN, the gray-level histogram standard deviation of the images generated by DSpix2pix is the smallest, indicating that DSpix2pix produces smoother images with less noise influence. Although CycleGAN performs well in terms of the standard deviation metric, its over-smoothing of the generated images affects the preservation of land cover information, making it difficult to maintain the content of the image. Therefore, considering the overall performance, DSpix2pix undoubtedly exhibits superior results.

3.2. Dual-Style Experiment

To demonstrate that the dual-style addition is beneficial to the network, we designed the following experiment. First, since StyleGAN-v2 incorporates the injected style within the convolutional layers, making it difficult to remove, this paper extracts super-resolution images generated at different qualities (128 × 128, 256 × 256, 512 × 512, 1024 × 1024) and calculates the same experimental metrics as in Experiment 3.1. The experimental results are shown in Table 5, Table 6 and Table 7.

From the data in the table, it is evident that the super-resolution results for the 1024 × 1024 large-scale images are generally better compared to the results at lower scales. To more intuitively observe the data, we plotted a line graph, as shown in Figure 8.

In Figure 8, it can be observed that as the generated image quality increases, visual indicators (such as RMSE) show a gradual improvement. In the 2× and 4× super-resolution tasks, the trend is consistent, with the RMSE value decreasing as the super-resolution factor increases, indicating that the quality of the generated images improves as the resolution increases. However, in the 8x super-resolution task, although the RMSE value continues to decrease in most cases, there is a brief rebound in the RMSE value for the generated image at 1024 × 1024. This rebound phenomenon warrants attention, although the overall trend remains one of improvement. And the other visual metrics (PSNR, SMMI, LPILS) show similar performance. This phenomenon suggests that, despite the potential difficulty in accurately reconstructing certain complex details or structures at extremely high resolutions (such as the 1024 × 1024 image in the 8× task), due to the advantages offered by StyleGAN-v2, Dspix2pix is still able to effectively generate high-quality super-resolution images in most cases, thereby improving image quality and reducing visual artifacts. However, for the ERGAS and SAM metrics, the values do not show improvement with the increase in image size. This indicates that the results produced by DSpix2pix are not suitable for tasks requiring precise remote sensing parameter inversion.

To prove that the additional style structure is necessary, we selected the 2× super-resolution task at the 1024 quality. We compared the super-resolution results with and without the additional style structure. The comparison results are shown in Table 8.

In Table 8, it is evident that the additional style injection significantly improves the performance of the super-resolution network. The improvement in RMSE exceeds 40. The key lies in that, as the quality increases, the L1 loss provided by pix2pix contributes more significantly, causing the entire image to exhibit a uniform style dominated by the main land cover. The additional style information we inject helps to restore the image to a more realistic ground style. The following example from the dataset can illustrate this point, as shown in Figure 9.

As shown in Figure 9, at the 1024 × 1024 scale, the entire image without additional style injection displays a green hue dominated by farmland, with urban areas and roads exhibiting the same coloration. However, when additional style from an example image is injected, this issue is significantly alleviated. Therefore, the additional style injection is also necessary in DSpix2pix.

3.3. UAV Experiment

To evaluate the model’s performance on UAV data, we collected several samples to train this model. The results of the super-resolution experiments are presented in Figure 10.

In Figure 10, each sample group consists of the low-resolution image, the reconstruction, and the high-resolution result, from left to right. Upon examining these image groups, we observed that DSpix2pix still demonstrates super-resolution capabilities for UAV data, but the issue of style deviation is more pronounced compared to satellite data. In some cases, such as in Figure 10a, the network is able to generate style-consistent reconstructed images. However, this is not the case in many other instances. For example, as shown in Figure 10b,c, the ground that should be brown exhibits a tendency to shift toward a greenish tone, while the grass that should be green shows a drift toward a brownish pavement, as seen in Figure 10d. The issue likely arises from the conflict between the high resolution of the UAV data and the working mechanism of the StyleGAN generator. StyleGAN decouples the style information of input images through its Mapping Network, allowing for perfect style fitting. This mechanism enables StyleGAN to account for the various styles present in the dataset. However, different land cover types often have distinct styles, and in trying to balance these styles, the problem of style deviation emerges. While we mitigated this issue in satellite images by incorporating Style 2, the higher resolution of UAV data introduces a greater challenge. The style differences between samples in UAV data are much more pronounced than in satellite imagery, making it harder to maintain style consistency. Even with the addition of style components, the results using StyleGAN remain suboptimal. In the future, we will explore more effective methods for controlling image styles, such as extracting the style from sample images and feeding it into the mapping network.

Finally, we will explain the resource allocation during the training. The proposed DSpix2pix network has a model weight size of 432MB (float32). Training time is approximately the same as pix2pix, with a slight increase of 6–8 h due to the inclusion of style encoding and style convolution layers. The GPU memory requirement for training is 10 GB, which is sufficient for completing the process. This indicates that the model can run efficiently on common consumer-grade GPUs (e.g., NVIDIA RTX 3080 or RTX 3090), without requiring high-performance computing platforms. This trade-off in training time is justified by the significantly improved quality of super-resolution images. Future optimizations, such as mixed-precision training or pruning redundant computations, could further reduce computational costs while maintaining quality.

4. Conclusions

We propose a new super-resolution network named DSpix2pix by incorporating concepts from StyleGAN and related knowledge and designed a series of experiments with sentinel-2 data to evaluate this network. The experimental results demonstrate that this network outperforms others in super-resolution tasks across various scales, producing smoother images with less noise and higher quality. With the best performance in visual metrics such as RMSE, SSIM, PSNR, and LPIPS, DSpix2pix is well suited for interpolation tasks like land cover classification. However, it performs less effectively in ERGAS and SAM, making it less suitable for accuracy parameter inversion tasks. Additionally, generated image samples and edge detection results using the Canny method further confirm that DSpix2pix produces images with minimal noise and higher quality.

Author Contributions

Conceptualization, Z.W. and C.W.; methodology, Z.W.; software, Z.W.; validation, Z.W.; formal analysis, Z.W. and C.W.; investigation, Z.W. and C.W.; resources, Z.W. and C.W.; data curation, Z.W. and C.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W. and C.W.; visualization, Z.W.; supervision, C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 42471422) and Institute of Land Resources Surveying and Mapping in Guangdong Province.

Institutional Review Board Statement

Ethical review and approval were not required for this study because it does not involve human or animal subjects.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Sentinel-2 visible spectrum data used in this study were acquired from the Copernicus Open Access Hub (https://scihub.copernicus.eu/ (accessed on 15 June 2024)) and cover the Birmingham region, United Kingdom, from February to August 2022. The data underwent cloud masking and compositing to generate a cloud-free dataset for analysis. Sentinel-2 data are publicly available and can be accessed through the Copernicus Open Access Hub after registration.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, K.; Rong, L.; Jiang, S.; Zhan, Y. A review of remote sensing image super-resolution reconstruction methods based on deep learning. J. Zhengzhou Univ. (Eng. Ed.) 2022, 43, 8–16. [Google Scholar]
Roy, D.P.; Wulder, M.A.; Loveland, T.R.; Woodcock, C.E.; Allen, R.G.; Anderson, M.C.; Helder, D.; Irons, J.R.; Johnson, D.M.; Kennedy, R. Landsat-8: Science and Product Vision for Terrestrial Global Change Research. Remote Sens. Environ. 2014, 145, 154–172. [Google Scholar] [CrossRef]
Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for Earth Observation. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
Torres, R.; Snoeij, P.; Geudtner, D.; Bibby, D.; Davidson, M.; Attema, E.; Potin, P.; Rommen, B.; Floury, N.; Brown, M.; et al. GMES Sentinel-1 mission. Remote Sens. Environ. 2012, 120, 9–24. [Google Scholar] [CrossRef]
Lu, W.; Wang, J. Methods and research progress in remote sensing image super-resolution processing. Sci. Surv. Mapp. 2016, 41, 53–58. [Google Scholar]
Zhang, B.; Gao, L.; Li, J.; Hong, D.; Zheng, K. Research progress and perspectives on high/multispectral remote sensing image super-resolution fusion. Acta Geod. Cartogr. Sin. 2023, 52, 1074–1089. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zuo, W.; Zhang, L. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep Back-Projection Networks for Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1664–1673. [Google Scholar]
Ji, Y.; Zhang, H.; Gao, F.; Sun, H.; Wei, H.; Wang, N.; Yang, B. LGCNet: A local-to-global context-aware feature augmentation network for salient object detection. Inf. Sci. 2022, 584, 399–416. [Google Scholar] [CrossRef]
Xu, W.; Xu, G.; Wang, Y.; Sun, X.; Lin, D.; Wu, Y. High quality remote sensing image super-resolution using deep memory connected network. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 8889–8892. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; pp. 2672–2680. [Google Scholar]
Haut, J.M.; Fernandez-Beltran, R.; Paoletti, M.E.; Plaza, J.; Plaza, A.; Pla, F. A new deep generative network for unsupervised remote sensing single-image super-resolution. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6792–6810. [Google Scholar] [CrossRef]
Jiang, K.; Wang, Z.; Yi, P.; Wang, G.; Lu, T.; Jiang, J. Edge-enhanced GAN for remote sensing image superresolution. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5799–5812. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Scchiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar] [CrossRef]
Liu, M.-Y.; Breuel, T.; Kautz, J. Unsupervised Image-to-Image Translation Networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 700–708. [Google Scholar]
Lee, H.Y.; Tseng, H.Y.; Huang, J.B.; Singh, M.; Yang, M.H. Diverse image-to-image translation via disentangled representations. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 35–51. [Google Scholar]
Wang, Z.; Chen, J.; Hoi SC, H. Deep Learning for Image Super-Resolution: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Sun, X.; Zhu, Q.; Zheng, G. A Survey of Hyperspectral Image Super-Resolution Technology. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Brussels, Belgium, 11–16 July 2021; pp. 4476–4479. [Google Scholar]
Sharma, A.; Srivastava, B.P.; Shankar, P.N. Facial Image Super-Resolution with CNN A Review. In Proceedings of the 2023 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, 18–19 February 2023; pp. 1–6. [Google Scholar]
Li, C.; Xiong, K.; Chen, Y.; Fan, C.; Wang, Y.L.; Ye, H.; Zhu, M.Q. Visible-light-driven photoswitching of aggregated-induced emission-active diarylethenes for super-resolution imaging. ACS Appl. Mater. Interfaces 2020, 12, 27651–27662. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4396–4405. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of StyleGAN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 8107–8116. [Google Scholar]
Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; Shapiro, S.; Cohen-Or, D. Encoding in style: A StyleGAN encoder for image-to-image translation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2287–2296. [Google Scholar]
Yang, S.; Jiang, L.; Liu, Z.; Loy, C.C. Pastiche Master: Exemplar-based high-resolution portrait style transfer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7683–7692. [Google Scholar]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1510–1519. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, W.; Huang, Y.; Wang, Y.; Wang, L. Generalized autoencoder: A neural network framework for dimensionality reduction. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 496–503. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. A neural algorithm of artistic style. J. Vis. 2016, 16, 326. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W., Frangi, A., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. ArXiv 2014, abs/1411.1784. Available online: https://api.semanticscholar.org/CorpusID:12803511 (accessed on 20 January 2025).
Zhu, Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
Kruse, F.A.; Lefkoff, A.B.; Boardman, J.B.; Heidebrecht, K.B.; Shapiro, A.T.; Barloon, P.J.; Goetz, A.F.H. The Spectral Image Processing System (SIPS)—Interactive Visualization and Analysis of Imaging Spectrometer Data. Remote Sens. Environ. 1993, 44, 145–163. [Google Scholar] [CrossRef]

Figure 1. Overall network framework structure of DSpix2pix.

Figure 2. Network structure of generator: (a) is the generator of styleGAN-V2, and (b) is the generator of DSpix2pix in Figure 1.

Figure 3. Additional style structure: (a) is Generator; (b) is Style 2.

Figure 4. The training strategy, (a) Training structure of high-low scale addition (b) mixing regularization training structure.

Figure 5. Examples of DSpix2pix compared with other super-resolution results.

Figure 6. Local magnification diagram of the super-resolution example results: (a) 2× super-resolution task and (b) 4× super-resolution task, and (c) 8× scale super-resolution task.

Figure 7. Original image and edge detection image.

Figure 8. Schematic diagram of discount comparison of super resolution results under different tasks and scales of the same network. (a) is RMSE; (b) is PSNR; (c) is SSMI; (d) is ERGAS; (e) is LPIPS; (f) is SAM.

Figure 9. Example of results at the 1024 scale for the 2× super-resolution task with or without additional style injection.

Figure 10. UVA super-resolution results. (a) is railways; (b) is train station; (c) is road; (d) is airports.

Table 1. Comparison of generation effects: DSpix2pix vs. other networks in 2× super-resolution tasks.

Method	Downsamples 2×
Method	RMSE ↓	PSNR ↑	SSMI ↑	ERGAS ↓	LPIPS ↓	SAM ↓
Bicubic	53.05	23.49	0.58	2.97	0.65	14.47
pix2pix	43.05	25.41	0.76	3.11	0.30	15.10
CycleGAN	168.84	13.19	0.26	3.29	0.56	14.71
DRIT	186.95	12.51	0.25	3.30	0.59	14.40
UNIT	49.37	24.03	0.70	3.05	0.20	14.98
SRGAN	51.10	24.48	0.63	3.13	0.35	15.46
DSpix2pix	35.87	26.56	0.84	3.11	0.18	15.08

Table 2. Comparison of generation effects: DSpix2pix vs. other networks in 4× super-resolution tasks.

Method	Downsamples 4×
Method	RMSE ↓	PSNR ↑	SSMI ↑	ERGAS ↓	LPIPS ↓	SAM ↓
Bicubic	70.04	21.01	0.51	3.15	0.77	14.75
pix2pix	78.16	19.87	0.52	4.70	0.42	15.20
CycleGAN	167.20	13.24	0.24	3.29	0.54	14.66
DRIT	179.19	12.84	0.28	3.30	0.66	14.38
UNIT	79.47	19.80	0.47	3.26	0.35	15.03
SRGAN	61.51	22.69	0.62	4.61	0.56	15.43
DSpix2pix	48.35	24.23	0.72	3.35	0.34	15.07

Table 3. Comparison of generation effects: DSpix2pix vs. other networks in 8× super-resolution tasks.

Method	Downsamples 8×
Method	RMSE ↓	PSNR ↑	SSMI ↑	ERGAS ↓	LPIPS ↓	SAM ↓
Bicubic	83.85	19.40	0.48	3.27	0.75	14.87
pix2pix	61.13	21.13	0.49	3.39	0.61	15.17
CycleGAN	176.66	12.73	0.26	3.30	0.57	14.22
DRIT	164.49	13.64	0.25	3.27	0.53	15.00
UNIT	213.82	11.06	0.23	3.40	0.53	14.79
SRGAN	/
DSpix2pix	57.35	23.91	0.63	3.13	0.50	14.90

Table 4. Gray-level standard deviation for different models.

Region	Bicubic	pix2pix	CycleGAN	DRIT	UNIT	SRGAN	DSpix2pix
Global	36.81	37.71	27.12	34.76	36.04	36.27	32.33
# 1	23.07	30.69	19.12	21.6	30.69	23.3	20.54
# 2	38.74	46.98	24.94	35.56	40.74	39.07	34.79

Table 5. Comparison of super-resolution results of DSpix2pix at different scales in 2× super-resolution tasks.

Scale	Downsamples 2×
Scale	RMSE ↓	PSNR ↑	SSMI ↑	ERGAS ↓	LPIPS ↓	SAM ↓
Bicubic	53.05	23.49	0.58	2.97	0.65	14.47
128 × 128	65.87	21.90	0.78	3.09	0.27	15.38
256 × 256	61.70	22.07	0.79	3.16	0.32	15.25
512 × 512	57.57	22.96	0.82	3.17	0.21	15.22
1024 × 1024	35.87	26.56	0.84	3.11	0.18	15.08

Table 6. Comparison of super-resolution results of DSpix2pix at different scales in 4× super-resolution tasks.

Scale	Downsamples 4×
Scale	RMSE ↓	PSNR ↑	SSMI ↑	ERGAS ↓	LPIPS ↓	SAM ↓
Bicubic	70.04	21.01	0.51	3.16	0.77	14.75
128 × 128	75.70	20.77	0.64	3.17	0.45	15.15
256 × 256	61.74	22.38	0.65	3.14	0.45	15.06
512 × 512	59.72	22.60	0.64	3.15	0.51	14.96
1024 × 1024	48.35	24.23	0.72	3.35	0.34	15.07

Table 7. Comparison of super-resolution results of DSpix2pix at different scales in 8× su-per-resolution tasks.

Scale	Downsamples 8×
Scale	RMSE ↓	PSNR ↑	SSMI ↑	ERGAS ↓	LPIPS ↓	SAM ↓
Bicubic	83.85	19.40	0.48	3.27	0.75	14.87
128 × 128	71.48	20.88	0.55	3.77	0.64	14.97
256 × 256	70.74	21.01	0.56	3.21	0.61	14.96
512 × 512	57.35	22.91	0.63	3.33	0.50	14.90
1024 × 1024	68.19	21.96	0.61	3.20	0.54	14.94

Table 8. Comparison of results at 1024 × 1024 quality in 2× super-resolution tasks with and without additional style injection.

	RMSE ↓	PSNR ↑	SSIM ↑	ERGAS ↓	LPIPS ↓	SAM ↓
with extra style	35.87	26.56	0.84	3.21	0.18	15.08
without extra style	75.19	21.03	0.77	3.41	0.30	15.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Wang, C. DSpix2pix: A New Dual-Style Controlled Reconstruction Network for Remote Sensing Image Super-Resolution. Appl. Sci. 2025, 15, 1179. https://doi.org/10.3390/app15031179

AMA Style

Wang Z, Wang C. DSpix2pix: A New Dual-Style Controlled Reconstruction Network for Remote Sensing Image Super-Resolution. Applied Sciences. 2025; 15(3):1179. https://doi.org/10.3390/app15031179

Chicago/Turabian Style

Wang, Zhouyi, and Changcheng Wang. 2025. "DSpix2pix: A New Dual-Style Controlled Reconstruction Network for Remote Sensing Image Super-Resolution" Applied Sciences 15, no. 3: 1179. https://doi.org/10.3390/app15031179

APA Style

Wang, Z., & Wang, C. (2025). DSpix2pix: A New Dual-Style Controlled Reconstruction Network for Remote Sensing Image Super-Resolution. Applied Sciences, 15(3), 1179. https://doi.org/10.3390/app15031179

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSpix2pix: A New Dual-Style Controlled Reconstruction Network for Remote Sensing Image Super-Resolution

Abstract

1. Introduction

2. Method

2.1. Network Structure

2.2. Training Strategy and Loss Function

3. Experiment and Discussion

3.1. Comparative Experiments

3.2. Dual-Style Experiment

3.3. UAV Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI