2.1. Network Structure
To enhance the quality of super-resolution images, we designed an innovative dual-style network, DSpix2pix, based on the high-quality image generation capabilities of the StyleGAN-v2 framework. This network utilizes downsampling and StyleGAN-v2’s style convolution layers to encode low-resolution images into constant vectors, enabling precise control over the content of the generated images. Additionally, during the decoding phase, it incorporates style information extracted from example images to enhance the diversity and controllability of the generated results. The loss structure adopts the pix2pix loss framework [
29], combining generative and adversarial losses to further optimize the detail quality and visual consistency of the images, achieving dual optimization of both content and style. The overall network structure is shown in
Figure 1.
The network consists of two parts: the generator and the discriminator. The discriminator is a residual network with 18 layers (resnet-18) [
30]. The generator consists of three parts: two style structures and a generator structure. The generator generates high-resolution images, while Style 1 and Style 2 ensure that the generated images are as consistent as possible with the high-resolution ground truth in both content and style, especially during the generation of high-quality images, such as those at a 1024 × 1024 resolution.
Style 1, derived from StyleGAN-v2, is used to learn the style information of the training remote sensing image dataset. The mapping network, as a core component, consists of multiple fully connected layers and is used to transform the latent vector z (usually random noise) into style vector space
w+. This network can fully decouple the image style features, allowing the style vector in the
w+ space to independently control the styles of generated images at different layers [
25].
Style 2 is a trained autoencoder (AE) [
31], with its decoder layer design identical to that of the generator’s decoder and super layers. By compressing and reconstructing high-resolution data, the features of the high-resolution image decoder layer are obtained. Subsequently, we input a high-resolution image into Style 2, and after inference, we extracted the Gram matrix [
32] (style) from the features output by the Style 2 decoder layer. This matrix was then injected into the generator’s decoder layer to continuously control the style of the generated images. The high-resolution image serves as a reference image, which can be any image from the training dataset, as long as it contains high-resolution style information.
The generator consists of a network structure similar to U-Net [
33]. The encoder of generator (G-Encoder) provides the content and style information of the low-resolution image. Since the style of the low-resolution image is similar to that of the high-resolution image, it offers strong control, reducing the difficulty of training the network and the size of the training dataset required. This is the foundation that enables the network to stably generate remote sensing images. The extra residual blocks in the generator (G-Extra) are set to enhance the network’s expressive ability, with the number of residual blocks determined based on the complexity of the task. In the experiments conducted in this paper, two blocks are set. The decoder of the generator (G-Decoder) restores the feature maps obtained from the encoder to their original size, generating images that contain more high-resolution information but retain the same size as the low-resolution image. During this process, the decoder continuously extracts style information from Style 1, Style 2, and the low-resolution image. For example, if the input low-resolution image has a size of 256 × 256, the high-resolution image generated by the decoder will also have a size of 256 × 256, but with more details. The super layer of the generator (G-super) generates images with higher quality (such as 512 × 512 or 1024 × 1024). During this process, the network layers no longer receive style information from the low-resolution image. Instead, the required style control is solely obtained from Style 1 and Style 2.
The structure design of the generator and Style 1 is shown in
Figure 2.
In
Figure 2, the latent vector z is mapped to the style space
w+ through the mapping network. The multilayer perceptron (MLP) transforms the style vector into scaling weights A that can be used by the convolutional layers. It should be noted that the style vector in
w+ is intended to directly affect the feature maps. However, StyleGAN-v2 proposes that converting the style vector into scaling weights that influence the convolutional layer weights can achieve the same effect while being more computationally efficient. In
Figure 2,
w is the convolutional layer weights and
b is the bias. StyleGAN-v2 designs Mod and DeMod to inject style information into the generator. Mod is used to inject style and scale the convolutional layer weights, as shown in Equation (1), while DeMod is used to normalize the convolutional layer weights, making style control more flexible and efficient, as shown in Equation (2).
S represents the scaling weights of each channel mapped from the w+ space, represents the original weights, represents the modulated weights, represents the normalized weights of the convolutional layers, i is the channels of the input feature maps, j represents the channels of the output feature maps, and k represents the size of the convolution kernel. ϵ represents a very small constant to prevent the denominator from being zero. Although this is not entirely identical to directly scaling the feature map, experiments of StyleGAN-v2 have shown that it does not significantly affect the quality of the generated images.
Since the design of StyleGAN-v2 is to generate images with the same style but random content, its input is typically a random constant of size 4 × 4 × 512. To leverage this characteristic for generating super-resolution images, we need to replace the random constant with a certain vector with content information extracted from the low-resolution image. Therefore, we add the G-Encoder. This means we use convolutional layers to encode low-resolution images into a 4 × 4 × 512 constant, which are then fed into the decoder of the generator to provide content information, thereby generating high-resolution images. In the encoder, the design of the style block is similar to that of the decoder, with the only difference being that the upsampling operation is replaced by downsampling to accommodate the low-resolution image input.
Although the style information of the image is typically well preserved during the encoding process, we still extract the style features of the remote sensing data from the style space w+ to ensure the stability of the image encoding. At the same time, we retain the input of random noise B to allow the network to generate more details, improving the realism and quality of the image. And we have added skip connections similar to those in U-Net between corresponding layers of the encoder and decoder to continuously provide style and content information from the low-resolution image.
In the experiment, it was found that using the generator in
Figure 2b can stably produce high-resolution remote sensing images with the same content information as the low-resolution images, especially at lower scales. However, when generating higher-quality image results through the G-Super, there was a deviation in the style of the generated image compared to the low-resolution image. For instance, a brown road might turn green like farmland. This phenomenon is particularly evident at the 1024 × 1024 scale. One possible reason is that in designing the loss function for the generator, this paper incorporated the same L1 loss as used in pix2pix [
29]. As the generation quality increases, the farmland category, which originally has a high proportion in the samples, contributes more to the loss, leading to a tendency for the network to overfit. Another reason is that during the G-super image generation phase, the network loses the style information provided by the encoder from the low-resolution image and relies solely on the vector in
w+ space for style control.
To mitigate this phenomenon, we add an additional style (Style 2) to the network by extracting the style from an example image and incorporating it into the high-quality image generation process. Style 2’s structure is shown in
Figure 3.
Figure 3a is a simplified schematic of
Figure 2b.
Figure 3a clearly illustrates that in the G-super, the layers fail to receive the style information of the low-resolution image from the encoder.
Figure 3b shows the pre-trained autoencoder. Each convolutional layer within its decoder can output feature maps of the same size as those in G-Decoder and G-Super, as shown in
Figure 3a. The way add additional style to the generator is the AdIN method. AdIN is a technique used for style transfer, proposed by Xun Huang et al. in 2017 [
28], with the primary goal of injecting the statistical information (mean and standard deviation) of the style image into the features of the content image, enabling the generated image to retain the structure of the content image while adopting the appearance of the style image. This method is more efficient and intuitive. The specific formula is shown in Equation (3).
x represents the feature map of the low-resolution image in G-Decoder and G-Super, and
y represents the feature map of the example image.
b represents the batch size,
c represents the number of feature map channels, and
w and ℎ represent the width and height of the feature map, respectively.
μ(
x) and
σ(
x) denote the mean and variance of the feature map
x, calculated across the spatial dimensions, independent of
b and
c, as shown in Equations (4) and (5).
f represents the feature map x or y. Combined with dual-style control, this forms the proposed network DSpix2pix.
2.2. Training Strategy and Loss Function
We adopt the training strategy of StyleGAN-v2. This suggests that when training a network to generate high-quality images, the use of low-quality generated images can be beneficial to the training process, as illustrated in
Figure 4a. In other words, we upsample the results from the lower scale and add them to the results from the next higher scale. The simplified representation can be expressed as Equation (6).
Images represents the generated images, straight refers to the directly generated large-quality images, residual refers to the upsampled low-quality images, and α is the parameter that modulates the two levels of images. As training progresses, α gradually increases to 1.
To prevent the lower-level features from interfering with the generation of higher-level images, which could cause the well-decoupled weights in the
w+ space to become re-entangled, we retain the mixing regularization training strategy proposed in the StyleGAN series. The specific approach is as follows: during training, latent vectors (
z) from different samples within the same batch are randomly mixed, and these mixed style vectors are injected into different layers of the generator. For example, the style vector of one sample can be applied to the earlier layers of the generator, while the style vector of another sample is applied to the later layers. This method enables the generator to learn the ability to handle different style information, thereby enhancing the disentanglement properties of the generative network and avoiding overfitting. The method is shown in
Figure 4b.
In
Figure 4a, ToRGB refers to the convolutional layer that converts feature maps into three-channel color images, and Up represents an upsampling operation. In
Figure 4b, a and b denote the style vectors of two different samples within the same batch. The loss function consists of the traditional loss of cGAN [
34], as shown in Equation (7), the L1 loss is provided to control content information, as shown in Equation (8), and the AdIn loss is provided for injecting additional styles, as shown in Equation (9) [
28].
X represents the input low-resolution images. y represents the ground truth corresponding to x. z represents the random noise. G(x, z) represents the generated high-resolution images. N represents the number of samples in a batch.
f represents the feature map of the corresponding convolutional layer,
g represents the generator’s computation process,
t denotes the convolutional layer where additional an style needs to be injected,
λ is the hyperparameter that adjusts the balance between content loss and style loss,
i indicates the total number of convolutional layers where additional style is injected,
μ represents the calculation of the mean, and
σ represents the calculation of the variance. The total loss is expressed in Equation (10).