4.1. Experimental Dataset
We build two remote-sensing image datasets with distinct styles: the Chongzhou area in Sichuan Province of China, covering longitudes 103°37′ to 103°45′ E and latitudes 30°35′ to 30°40′ N, and the Wuzhen area in Zhejiang Province of China, covering longitudes 120°26′ to 120°33′ E and latitudes 30°43′ to 30°47′ N. The original images are satellite optical orthorectified images captured over two time periods, with each image measuring 5826 × 3884 pixels and a spatial resolution of 0.51 m. The Chongzhou dataset features complex land characteristics, including large factories, intricate residential structures, and rural clusters. In contrast, the Wuzhen dataset primarily consists of water bodies surrounded by villages, with a landscape dominated by vegetation and rural buildings.
Both datasets exhibit data imbalance, each with its own characteristics. The Chongzhou dataset features complex and diverse buildings, posing a challenge for the model to generate these underrepresented classes, especially for buildings. The Wuzhen dataset contains water and vegetation with high semantic similarity, requiring the model to have strong distinguishing capability.
We also conducted comparative experiments on the publicly available LoveDA [
50] dataset to further validate the advantages of MCGGAN across different geographic locations and satellite sensors. The LoveDA dataset covers three cities—Nanjing, Changzhou, and Wuhan. It features inconsistent sample distributions between urban and rural areas, posing significant challenges for generative models.
In total, as shown in
Figure 4, we utilized two custom datasets and one public dataset, covering multiple cities in China and different satellite sensors. We focus on generating five typical remote-sensing land features: background, water, vegetation, buildings, and roads. To facilitate this, we annotated the Chongzhou and Wuzhen dataset with the corresponding categories to create semantic labels. For the LoveDA dataset, we merged forest and agriculture into vegetation, and barren areas into background, resulting in five label categories.
The images are cropped into 512 × 512. Following cropping, both datasets are randomly divided into training and testing sets in a 4:1 ratio. The final Chongzhou dataset comprises 845 training samples and 211 testing samples, while the Wuzhen dataset contains 704 training samples and 176 testing samples. To maintain consistency in dataset size, we randomly selected 1000 512 × 512 images from the LoveDA dataset and then split them at a 4:1 ratio to obtain 800 training samples and 200 testing samples.
Figure 4 illustrates the percentage of each feature in the Chongzhou, Wuzhen, and LoveDA datasets. In Chongzhou, buildings occupy a large proportion, predominantly representing urban scenes. In contrast, Wuzhen features abundant vegetation and water bodies, has fewer buildings, and is predominantly characterized by rural farmland. The LoveDA dataset contains a higher proportion of background and includes both urban and rural sample distributions. It features a diverse range of buildings, thus posing a greater challenge for generative models.
4.2. Evaluation Metrics
To assess the perceptual similarity between generated images and real images, we employ three representative metrics: Fre’chet Inception Distance (FID) [
51,
52], Learned Perceptual Image Patch Similarity (LPIPS) [
53], and Structural Similarity (SSIM) tailored to the characteristics of remote-sensing images.
FID is used to measure the distribution differences between generated images and real images in feature space. The process begins by extracting features using the Inception network, followed by modeling the feature space with a Gaussian model, and finally calculating the distance between the two feature distributions. A lower FID indicates higher image quality and diversity. The formula is as follows:
where
denotes the generated image;
denotes the original image;
,
,
, and
represent the mean and covariance matrices of the image features; and
(·) represents the trace of a matrix.
LPIPS is a metric used to assess the perceptual similarity between images. Unlike traditional metrics such as PSNR and SSIM, which primarily focus on pixel-level differences, LPIPS aligns more closely with human visual perception. The formula is as follows:
where
denotes different layers in the network (e.g., convolutional layers);
and
are the feature maps of images
and
at layer
;
is the number of elements in the feature map at layer
; and
represents the L2 norm (Euclidean distance).
SSIM is an index that estimates the resemblance of two images. One is the undistorted, uncompressed image, and the other is the distorted image of the two images used in SSIM. The specific calculation is as follows:
where the mean,
, represents the estimate of lightness; the standard deviation,
, represents the estimate of contrast; and the covariance,
, represents the evaluation of the degree of structural correspondence.
The ultimate goal of generating these images is to augment the dataset for deep learning tasks. To evaluate the quality of the generated images, we utilize a U-Net network trained on the three datasets for semantic segmentation. The model’s accuracy is quantified using two metrics: Frequency-Weighted Intersection over Union (FWIoU) and Overall Accuracy (OA). If the generated images are highly realistic and closely resemble real images, a segmentation network trained on real images should accurately segment the generated outputs.
4.4. Hyperparameter Settings
The loss function used in MCGGAN is defined as
, where
,
,
are the basic losses in the dual-branch network. For
and
, we refer to the empirical values used in LGGAN [
54], setting
and
. The terms
and
correspond to the perceptual loss and texture matching loss used in MCGGAN. We perform ablation studies and sensitivity analysis to investigate the impact of the perceptual loss and texture matching loss on the generation quality.
The experimental results are shown in
Table 1. The first three rows present the ablation study of the losses. As observed, introducing the losses leads to improvements in all metrics, indicating that both the perceptual loss and texture matching loss positively contribute to model training.
After introducing the two losses, we conducted a sensitivity analysis. By fixing and gradually increasing , we observed that the metrics initially improved and then gradually declined, with significant fluctuations. When was fixed and was gradually decreased, the metrics followed a similar trend. But the sensitivity to was lower, as the fluctuations were less pronounced. Based on the experimental results, we ultimately chose and .
4.5. Ablation Experiments
Ablation experiments are conducted to decompose the generator model and evaluate how various structures influence image quality. This approach allows us to verify the contribution of each functional module within the MCGGAN generator to the enhancement of generated image quality.
The ablation experiments are structured around five schemes as shown in
Table 2 and
Figure 5: Pix2Pix serves as the baseline model. Pix2Pix++ incorporates perceptual loss (
) and texture-matching loss (
) into Pix2Pix, resulting in the loss function
. DBGAN employs Pix2Pix as the global generator and includes the multi-class generator for different features within the dual-branch generative model. DBGAN++ builds upon DBGAN by introducing the shared-parameter encoder, thereby balancing the training process. MCGGAN enhances the class generators by introducing the spatial decoder to form the final proposed model. In this context, the loss functions for DBGAN, DBGAN++, and MCGGAN are all defined as
.
Ablation experiments are conducted on the Chongzhou and Wuzhen datasets.
Table 3 presents the evaluation metrics for each program on the respective datasets.
Table 3 indicates that compared to the baseline model Pix2Pix, Pix2Pix++ shows significant improvements in the Chongzhou dataset, achieving a 3.61% improvement in FWIoU, a 3.52% improvement in OA, a 0.0072 improvement in LPIPS, a 0.0607 improvement in SSIM and a remarkable 17.05% improvement in FID. Similarly, for the Wuzhen dataset, FWIoU and OA improvement by 2.12% and 1.39%, with LPIPS improves by 0.0183, SSIM by 0.014 and FID by 12.27.
Figure 6b,c and
Figure 7c illustrate that the incorporation of VGG loss and texture matching loss effectively mitigates issues in water generation. Additionally, this enhancement improves the model’s capacity to learn color textures, particularly evident in the extraction of urban building colors in the Wuzhen dataset.
When comparing the DBGAN model to Pix2Pix++ on the Chongzhou dataset, DBGAN demonstrates improvements with a 1.60% increase in FWIoU and a 1.67% increase in OA. However, LPIPS decreases by 0.0124, SSIM decreases by 0.0036 and FID decreases by 10.43.
Figure 6c,d visually illustrate that DBGAN produces architecture with clearer outlines compared to Pix2Pix++. Moreover, DBGAN’s road generation results feature contours and textures that more closely resemble real road characteristics. On the Wuzhen dataset, DBGAN outperforms Pix2Pix++ with a 2.29% increase in FWIoU and a 1.63% increase in OA. Conversely, LPIPS decreases by 0.0186, SSIM decreases by 0.0105 and FID decreases by 23.72.
Figure 7c,d show that DBGAN enhances the generation of colors and contours for small-scale buildings in the Wuzhen dataset. The above illustrates that the use of a dual-branch structure, along with the introduction of multi-class generator, can effectively enhance the model’s ability to generate objects for underrepresented land-cover classes.
Table 4 shows the complexity of different modules. With the introduction of the shared-parameter encoder, DBGAN++ successfully addresses the issues present in DBGAN. Although introducing the shared-parameter encoder increases computational cost, DBGAN++ achieves a lower overall loss compared to DBGAN and demonstrates significantly faster convergence. This indicates that the shared-parameter encoder successfully balances the training of the two generators, effectively accelerating convergence and reducing training difficulty. In terms of image quality, DBGAN++ has also achieved significant improvements. As indicated by the metrics in
Table 3, DBGAN++ shows improvements in LPIPS, SSIM and FID by 0.0178, 0.0235, and 30.85 on the Chongzhou dataset, and by 0.0290, 0.0247, and 57.58 on the Wuzhen dataset, respectively. Visual comparisons in
Figure 6d and
Figure 7d demonstrate that DBGAN++ effectively mitigates the pattern collapse and noise issues found in DBGAN, resulting in images that closely resemble real ones. However, there is a slight decrease in FWIoU and OA metrics with the introduction of the shared-parameter encoder. Specifically, on the Chongzhou dataset, FWIoU and OA decreased by 0.36% and 0.1%, while on the Wuzhen dataset, they decreased by 2% and 0.73%. This reduction can be attributed to the interference introduced during the convolution process of the shared-parameter encoder, which complicates the multi-class generator’s ability to generate specific categories.
MCGGAN leverages the spatial decoder to balance the influences among the class generators. This approach significantly enhances the performance of the multi-class generator. As shown in
Figure 8, DBGAN++ converges faster but exhibits some fluctuations in the later stages of training, suggesting potential instability. MCGGAN shows the best stability, with the loss steadily decreasing and minimal fluctuations. Stable training leads to higher generation quality, MCGGAN achieves the best overall metrics. On the Chongzhou dataset, compared to DBGAN++, MCGGAN’s FWIoU and OA improve by 6.3% and 5.29%, respectively; LPIPS improves by 0.0178, SSIM improves by 0.0015, and FID improves by 30.85. When compared to the baseline model Pix2Pix, MCGGAN demonstrates improvements of 11.24% and 10.38% in FWIoU and OA, respectively; LPIPS improves by 0.0297, SSIM improves by 0.0821, and FID improves by 52.86. On the Wuzhen dataset, MCGGAN outperforms DBGAN++ with improvements of 6.35% and 3.88% in FWIoU and OA, respectively; LPIPS improves by 0.0432, SSIM improves by 0.0049, and FID improves by 41.26. Compared to the baseline model Pix2Pix, MCGGAN shows improvements of 8.76% and 6.17% in FWIoU and OA, respectively; LPIPS improves by 0.0719, SSIM improves by 0.0331, and FID improves by 87.29.
Visually, MCGGAN demonstrates significant enhancements in generating building outlines on the Chongzhou dataset. As illustrated in the first row of
Figure 6, the model produces more realistic representations of complex residential buildings, while the fourth row shows improved generation of factory buildings. Additionally, MCGGAN effectively captures vegetation textures and rural buildings that closely resemble real remote-sensing images, as seen in the fifth row of
Figure 6. The model also excels in generating roads and complex backgrounds, aligning better with the inherent characteristics of these features, highlighted in the second and third rows of
Figure 6.
In summary, the improvements provided by MCGGAN not only ensure network stability but also enhance the generation of fine details across various land-cover categories. This results in a higher quality of generated samples for underrepresented land-cover classes. Moreover, MCGGAN strengthens the depiction of complex features like building outlines, leading to samples that more closely match real remote-sensing images.
On the Wuzhen dataset, MCGGAN’s superior understanding of global context enables it to generate diverse remote-sensing images, reflecting both lush spring/summer scenes and darker autumn/winter tones based on the layout features of the semantic image.
However, DBGAN exhibits certain shortcomings, as evidenced by the generated images depicted in
Figure 9. Specifically, the red dashed box in
Figure 9a highlights a texture replication issue in the Chongzhou generated image, while the white dashed box in
Figure 9b indicates the presence of noise in the Wuzhen generated image. These problems primarily arise from the challenges in maintaining a balance between the global generator and the multi-class generator during the training process.
4.6. Comparison Experiments
In order to further verify the effectiveness of MCGGAN model, we respectively compare Pix2PixHD [
37], DAGAN [
39], DPGAN [
40], stable diffusion model [
14], LGGAN [
54], and Lab2Pix-V2 [
55] on the dataset of Chongzhou, Wuzhe ann and LoveDA. The evaluation indexes of the experimental results are shown in
Table 5,
Table 6 and
Table 7. To visualize the generating effect of different models, some of the experimental result images are given in this paper, as shown in
Figure 10,
Figure 11 and
Figure 12.
The experimental results demonstrate that MCGGAN achieves the highest accuracy on the Chongzhou dataset, with FWIoU and OA metrics consistently outperforming those of existing models. This suggests that MCGGAN’s generated images are more realistic and reliable, contributing to enhanced segmentation performance. Additionally, the FID, LPIPS and SSIM scores for MCGGAN-generated images show significant improvements, indicating that these images align more closely with real remote-sensing data in both overall distribution and individual characteristics.
As shown in
Figure 10, for the complex buildings in Chongzhou, MCGGAN generates images with clearer, more defined contours compared to other models, while maintaining better consistency in intra-class information for the same semantic label. Furthermore, MCGGAN excels in generating realistic textures for less common features, such as water bodies and roads. In terms of background and vegetation, MCGGAN offers richer color and texture details, resulting in generated images that are both more realistic and trustworthy.
In the Wuzhen dataset, land-cover types like water and vegetation, which have high semantic similarity, are often confused by other models. However, MCCGAN excels at accurately distinguishing between them.
Table 6 indicates that images generated by MCGGAN significantly surpass existing methods across various metrics. The generation metrics reveal that MCGGAN-generated images closely mimic real data in terms of style and distribution. Moreover, the superior FWIoU and OA metrics suggest that the generated images offer more relevant information for the U-Net segmentation network. MCGGAN’s advantage lies in its ability to produce vegetation with rich color and texture details, while also excelling in generating features with smaller sample sizes, such as buildings (6.50% of the sample) and roads (2.51% of the sample).
Despite the significant land-cover style differences and sensor inconsistencies in the LoveDA dataset, MCGGAN still achieves superior performance. This is due to the dual-branch architecture, where the multi-class generator focuses on targeted generation for individual categories, compensating for the global generator’s limited learning capability on complex datasets. As shown in
Figure 12, MCGGAN excels at generating vegetation, background, and other extensive land-cover types. Its advantages are particularly evident in generating buildings. The LoveDA dataset includes both urban and rural scenarios, featuring diverse architectural styles ranging from low-rise houses in rural areas to high-rise buildings in urban settings. Other models fail to effectively generate buildings, with building contours blending into the background and internal details appearing chaotic. In contrast, MCGGAN successfully captures the structure and style of buildings, accurately delineating contours and preserving internal features.
Comparison experimental results across three datasets demonstrate that MCGGAN exhibits significant advantages over other GAN networks. Compared to the advanced dual-branch network LGGAN, MCGGAN improves the FID metric by 28% and the SSIM metric by 19%. In terms of generated image quality, the multi-class generator equipped in MCGGAN not only produces high-quality images for underrepresented classes (e.g., buildings and roads) but also effectively distinguishes land-cover types with high semantic similarity (e.g., vegetation and water bodies).
Compared to diffusion models, MCGGAN also shows strong performance in scenarios with limited sample sizes. For instance, MCGGAN outperforms the stable diffusion model with a 19% improvement in FID and a 15% improvement in SSIM. Diffusion models, constrained by their complex noise addition and removal processes, face increased computational costs during training and inference, making it difficult to fully leverage their strengths when only a few hundred training samples are available.