1. Introduction
Remote sensing image generation holds significant importance in fields such as Earth observation, resource management, and environmental protection. When the cost of obtaining actual data is high or there is a lack of data, generating high-quality remote sensing images can provide essential support for modeling and analysis. Such image generation can not only enhance the application of few-shot learning in remote sensing but also expand the coverage of remote sensing tasks, including disaster monitoring, land use analysis, and ecosystem change tracking. With the rapid development of generative models such as generative adversarial networks (GANs) and diffusion models, remote sensing image generation technology has made significant progress in improving generation accuracy and efficiency. However, current technologies still face challenges such as the realism of generated images, resolution enhancement, and semantic accuracy. Therefore, there is a need for improved remote sensing image generation techniques to provide high-quality image data for remote sensing image processing tasks. Currently, there are three main architectural approaches to image generation methods: autoencoders, generative adversarial networks (GANs), and diffusion models.
Autoencoder neural networks, introduced by David E. Rumelhart et al. [
1], are unsupervised learning algorithms that can be used for high-dimensional complex data processing. In 2010, researchers [
2] applied autoencoders to image reconstruction and feature learning, marking the first use of autoencoder networks for image generation. Diederik P. Kingma et al. [
3] proposed the variational autoencoder (VAE), a probabilistic generative model, which achieved significant progress in image generation and latent space interpolation. VAE has since become a popular method in image generation models. However, due to the limitations of the latent space and the insufficient handling of image space consistency, the realism and detail of images generated using autoencoder-based models often fall short compared to other generative models. Therefore, traditional autoencoders usually need to be combined with other generative models to better meet the needs of remote sensing image generation. For instance, David Berthelot et al. [
4] introduced boundary equilibrium GANs (BEGANs), which balance image quality and diversity using both autoencoder loss and generative adversarial network loss.
Since their introduction by Ian J. Goodfellow et al. [
5] in 2014, generative adversarial networks (GANs) have been widely applied in image generation. The principle behind GANs involves two adversarial neural networks (a generator and a discriminator) that compete with each other. This competition drives the generator to continuously improve its generated images to the point where the discriminator has difficulty distinguishing between the real and generated images, thereby enhancing the quality of the generated images. However, early GANs lacked controllability in their generated images, and the training process was unstable. Although conditional generative adversarial networks (cGANs) [
6] introduced conditional generation for specific types of images, thereby improving controllability, they still struggled to capture image details when generating complex types of objects, often resulting in blurriness and distortion.
The introduction of diffusion models has provided new solutions in the field of image generation. The concept of diffusion models, proposed by Jascha Sohl-Dickstein et al. [
7], is inspired by non-equilibrium statistical physics. The core idea involves systematically disrupting the data distribution structure through an iterative forward diffusion process and then learning the reverse diffusion process to restore the data structure, resulting in a highly flexible and manageable generative model. Jonathan Ho et al. [
8] advanced this concept with the denoising diffusion probabilistic model (DDPM), which learns the data distribution through a progressive noise injection process and generates new samples through a reverse denoising process. Subsequent research by Prafulla Dhariwal et al. [
9] demonstrated that diffusion models outperform generative adversarial networks in the field of image generation. This approach excels in generating image details, diversity, and realism, and it has gradually become a focus of research in remote sensing image generation.
Although models like OpenAI’s DALLE2 [
10] and Robin Rombach et al.’s stable diffusion model [
11] have made significant advancements in image generation, enabling the creation of images in various styles (e.g., anime, oil painting) based on text prompts and keywords, research focused on remote sensing image generation remains relatively scarce. Existing methods for remote sensing image generation, such as the multi-stage structured generative adversarial network (StrucGAN) proposed by Rui Zhao et al. [
12], still face challenges like low semantic matching between generated images and textual descriptions, spatial structure errors, and poor fidelity in generated images. Consequently, remote sensing image generation encounters unique challenges, such as high resolution, multi-scale features [
13], complex backgrounds, and a lack of annotated data. Specifically, tasks involving remote sensing images demand high resolution and richness in detail. Additionally, the complex and diverse backgrounds found in remote sensing images (e.g., urban areas, forests, and bodies of water) require models to accurately understand and describe these semantic details [
14].
To address the aforementioned challenges, this paper employs a method combining diffusion models with the Transformer architecture [
15] based on the VQ-Diffusion model [
16]. The model is modified and trained to obtain the text-to-remote-sensing image generation model, RSVQ-Diffusion. Experimental results demonstrate that the modified model ensures the quality of the generated remote sensing images, producing spatially coherent images with significant improvements in diversity and realism. Moreover, it alleviates issues such as image distortion and semantic accuracy. The specific contributions of this work are as follows:
The architecture combining diffusion models with the Transformer is applied to remote sensing image generation. To address the spatial characteristics of remote sensing images, spatial position encoding is incorporated into the model’s image decoder, enhancing the Transformer model’s ability to capture the overall spatial positional information of remote sensing images during the sequence processing;
The concept of local feature extraction from sequences [
17] is integrated with the self-attention mechanism of the Transformer, leading to the proposed TransLocalBlock module. By controlling the sequence length and combining it with the multi-head attention mechanism, this module enables the Transformer to focus on local information during the processing of long sequences;
We designed ablation experiments and compared them with existing text-to-remote-sensing image generation models. Through evaluation metrics and improvements in aspects such as the spatial structure, realism, and details of the generated images, the effectiveness of the proposed method is effectively demonstrated.
The organization of the paper is as follows:
Section 2 describes various studies related to the generation of remote sensing images from text.
Section 3 provides a detailed description of the proposed model’s improvement methods.
Section 4 outlines the datasets and evaluation metrics used in the experiments, followed by a comprehensive presentation and analysis of the experimental results. Finally,
Section 5 summarizes the proposed method and offers an outlook for future work.
2. Related Works
In recent years, generative adversarial networks (GANs) and diffusion models have been continuously applied in the field of image generation. Text-to-remote-sensing image generation methods based on diffusion models have also played a significant role in remote sensing image processing. The earliest application of GANs in remote sensing image processing was proposed by Zhu Lin et al. [
18], who introduced two GAN frameworks: spectral classifier and spectral-spatial classifier for hyperspectral remote sensing image classification tasks. To address the issue of low resolution in generated remote sensing images, Boyu Pang et al. [
19] proposed a conditional GAN for remote sensing image super-resolution, which guides the generator to produce corresponding high-resolution images based on input low-resolution images. To generate images with details that better match textual descriptions, attnGAN [
20] introduces an attention mechanism that aligns the focus points of text descriptions with the corresponding regions of generated images and uses local attention at different generation stages. Later, attnGAN was improved by incorporating a multi-stage generation strategy to incrementally introduce image detail features and structurally process text descriptions to capture fine-grained textual information. Despite improvements in the quality of generated remote sensing images, issues such as blurriness, distortion, or deformation in certain areas still persist.
Subsequently, researchers have progressively applied diffusion models to the field of remote sensing image generation, often combining semantic segmentation networks like U-Net with attention mechanisms to achieve the diffusion process. The U-Net network extracts image features, while the attention mechanism computes the weights of the input conditional information to establish the matching relationship between images and their text. Among these, Samar Khanna et al. [
21] proposed the Diffusionsat model, which is an improved version of the stable diffusion model that enables remote sensing image generation guided by metadata and text. Datao Tang et al. [
22] introduced ConcededNet based on improved diffusion models, which allows multi-condition inputs to control remote sensing image generation. Its adaptive feature extraction module dynamically adjusts the generation strategy according to different input conditions, further enhancing image quality and diversity.
Existing image generation models that commonly use convolutional neural networks (CNNs) are limited by the size of the convolutional kernels, hindering the extraction and learning of detailed image features. In contrast, the Transformer architecture, initially designed for natural language processing, can learn image features on a pixel-by-pixel basis. It also employs positional encoding to mark positions in the unfolded one-dimensional sequence of images, facilitating the generation of images rich in details. Yonghao Xu et al. [
23] combined the Transformer architecture with energy models by leveraging two pre-trained models, VQVAE [
24] and VQGAN [
25]. They proposed a Hopfield network to maintain and update states during image generation, enhancing consistency between the generated images and textual descriptions, thereby resulting in remote sensing images with abundant details. The VQ-Diffusion model integrates the principles of diffusion and the Transformer architecture, demonstrating outstanding performance in general image generation. Therefore, this study uses VQ-Diffusion as the benchmark model for remote sensing image generation. To address issues such as unclear details, unreasonable spatial structures, and regional distortions in generated images, the model’s structure has been improved, significantly enhancing the quality of the generated remote sensing images.
4. Experimental Results and Analysis
To validate the effective application and performance improvement of the proposed RSVQ-Diffusion model in the field of remote sensing image generation, both the original model and the RSVQ-Diffusion model were trained on publicly available remote sensing image datasets. Additionally, ablation experiments were designed to verify the improvement effects, and comparisons were made with existing text-to-remote-sensing image generation models.
4.1. Datasets and Evaluation Metrics
The Remote Sensing Image Caption Dataset (RSICD) [
28] is a commonly used dataset in the field of multimodal learning for remote sensing. It contains 10,921 image–text pairs, covering 30 categories such as playgrounds, bridges, beaches, deserts, and cities. As a critical resource for training text-to-remote-sensing image models, it possesses significant academic value in the field of remote sensing image understanding due to its large scale, rich semantic descriptions, and multi-source image acquisition characteristics. This dataset not only provides 10,921 images along with five corresponding descriptions for each image, enhancing the model’s understanding and generalization of complex remote sensing scenes, but also presents challenges for accurate classification and recognition due to its high intra-class diversity and low inter-class variability. Moreover, the RSICD supports a multi-task evaluation framework, including but not limited to image classification, retrieval, and object counting, making it a comprehensive benchmark for evaluating model performance. Cross-dataset comparative studies further expand its academic application scope, and performance improvements of fine-tuned models on this dataset, such as the significant increase in top-1 accuracy of the CLIP model [
29], further demonstrate the important role of the RSICD in advancing intelligent remote sensing image analysis technologies. Therefore, this paper selects the RSICD as the training dataset for the text-to-remote-sensing image generation model, with the training, validation, and test sets divided in an 8:1:1 ratio.
The evaluation metrics used in this study to assess the quality of generated images are commonly used in the field of image generation: Fréchet Inception Distance (FID) [
30], Contrastive Language-Image Pre-training (CLIP) score, and Inception Score (IS) [
31]. FID primarily measures the distribution distance between real and generated image features. First, the generated images are embedded into the latent feature space of the selected layer of the inception network, and the embeddings of the generated and real images are treated as two continuous multivariate Gaussian samples to facilitate the calculation of their means and covariances. The calculation process is detailed in Equation (11).
where
represent the mean and covariance of the feature distribution of the real images, and
represent the mean and covariance of the feature distribution of the generated images.
denotes the Euclidean distance between the means of the feature distributions of the real and generated images, and
denotes the trace of the sum of the covariance matrices, which is the sum of the diagonal elements of the square matrices.
CLIP is used to evaluate the semantic alignment between generated images and their corresponding input semantic information. It measures their correlation by calculating the cosine similarity between image and text embeddings. The model converts images and texts into high-dimensional vectors, and the similarity between these vectors is calculated to obtain the CLIP score, as detailed in Equation (12).
where
represents the feature vector extracted using the image encoder;
represents the vector extracted by the text encoder, and
and
are the L2 norms (normalized to unit length) of the respective feature vectors.
is the scaling parameter used to adjust the score range.
The Inception Score (IS) is a widely used metric for evaluating the performance of generative models. It employs a pre-trained Inception v3 network to classify generated images, assessing the clarity and diversity of these images. Specifically, the IS measures the network’s confidence in the classification of generated images and evaluates the diversity of the generated samples. A higher IS indicates higher quality images and greater diversity among the images. The IS calculation involves obtaining the classification probabilities for each generated image and then measuring the Kullback–Leibler (KL) divergence between the conditional distribution for each image and the marginal distribution of all images. The calculation process is shown in Equation (13).
where
represents the number of generated remote sensing images;
is the conditional probability distribution of category
given image
, indicating the diversity of categories in the generated set;
denotes the Kullback–Leibler divergence between the conditional distribution of image
and the marginal distribution of all images.
4.2. Experimental Setup
The VQ-Diffusion mode, based on the diffusion principle and the Transformer network architecture, l has hundreds of millions of parameters, and requiring a large dataset for training to effectively learn image features. However, remote sensing images are difficult to acquire and limited in quantity, which cannot fully support the model’s learning of image features. The Conceptual Captions dataset [
32] consists of a large number of natural images and their descriptive texts, covering a wide range of scenes and object types with high visual diversity. Although this dataset is derived from natural images, the visual features it contains, such as object shapes, lighting, and perspective variations, have certain similarities to the common land-cover types and scenes found in remote sensing images. Therefore, pre-training with the Conceptual Captions dataset can help the model learn general visual features and provide effective initialization for training with remote sensing images. Especially in situations where remote sensing data are scarce, cross-domain knowledge transfer can significantly improve model performance and accelerate the training process. Hence, this study uses the pre-trained VQ-Diffusion model trained on the Conceptual Captions dataset to initialize the training of the VQ-Diffusion model on remote sensing datasets.
In this experiment, efficient hardware and software configurations were employed for the training process. The hardware platform included an Intel Xeon Platinum 8352S CPU, 512 GB of memory, and an RTX 4090 GPU to ensure efficient computation while training large models (Santa Clara, CA, USA). The operating system used was Windows 11, and the deep was PyTorch1.12.0. To ensure the stability and efficiency of the training process, the AdamW optimizer was selected. This optimizer effectively prevents overfitting by combining L2 regularization and weight decay. Additionally, the Reduce LR On Plateau With Warmup learning rate scheduler was used, which automatically adjusts the learning rate based on the performance on the validation set, ensuring stable convergence during different training stages. A weight decay value of 4.5 × 10−2 was chosen to control overfitting and help the model achieve better generalization during training. For the training parameters, after multiple adjustments to determine the optimal settings, a batch size of 4 was chosen to fully utilize the GPU memory, and the number of epochs was set to 100 to ensure the model could learn effective image features over sufficient training cycles. The initial learning rate was set to 0.3 × 10−6 with a minimum learning rate of 1.0 × 10−6 to ensure rapid learning in the early stages and fine-tuning in the later stages with a smaller learning rate.
Details regarding the model training environment and parameter settings are provided in
Table 1 and
Table 2.
The VQ-Diffusion model, which combines diffusion principles and the Transformer architecture, has significant computational complexity. The diffusion model generates images through multiple iterative processes, with each iteration (denoising process) requiring complex calculations using the Transformer’s self-attention mechanism. Consequently, the training process demands substantial computational resources and longer training times. To accelerate training and reduce memory consumption, automatic mixed precision (AMP) training was employed during the model’s training on the RSICD, taking approximately 52 h on a single RTX 4090 GPU. Despite the introduction of a new spatial positional encoding process in the RSVQ-Diffusion model, the TransLocalBlock module replaces the original self-attention mechanism with a local awareness mechanism. This module divides long sequences into multiple short sequences and achieves parallel processing through the multi-head attention mechanism. Thus, the proposed improvements enhance the quality of the generated images while having minimal impact on the model’s training and inference speed.
4.3. Ablation Experiments
We conducted a series of ablation experiments on the RSICD, and the results are detailed in
Table 3, which lists the changes in model performance under different experimental settings. We evaluated the impacts of spatial positional encoding, the local awareness mechanism, and their combination. The model was quantitatively assessed using FID and CLIP scores. The results indicate that incorporating spatial positional encoding (VQ-Diffusion_SP) and the local awareness mechanism (VQ-Diffusion_LMA) effectively enhances model performance. The combination of both (RSVQ-Diffusion) yielded the most significant improvements. Specifically, the data in
Table 3 demonstrate the improvements in image generation quality for each setting, further validating the effectiveness of these strategies in enhancing the performance of the VQ-Diffusion model.
Using the VQ-Diffusion model as a baseline, it can be observed that the VQ-Diffusion_SP model shows a significant reduction in FID compared to the baseline model. The VQ-Diffusion_LMA model achieves an even greater reduction, while the proposed improved model, RSVQ-Diffusion, achieves the best reduction in FID relative to the original model. Regarding the CLIP text–image matching score, although the VQDM_SP and VQDM_LMA models exhibit similar performance, the RSVQ-Diffusion model shows a noticeable increase in this metric. Additionally, all three improved models show an increase in IS evaluation metrics compared to the baseline model. Although the IS improvement of RSVQ-Diffusion is not as large as that of VQ-Diffusion_LMA, considering the overall performance across other metrics, the RSVQ-Diffusion model stands out with its outstanding overall effectiveness.
Figure 5 presents eight groups of text descriptions (a)–(h) and their corresponding generated images. The improvements made to the model to account for the spatial characteristics of remote sensing images, as well as the introduction of the local awareness mechanism, have significantly enhanced the image generation quality. The RSVQ-Diffusion model demonstrates high-resolution and more stable image generation capabilities.
Specifically, for group (a), the text description is “This theme park has a lake and some buildings near the river”. The VQ-Diffusion model fails to capture the keyword “river”. The images generated by the VQ-Diffusion_SP and VQ-Diffusion_LMA models show enhanced spatial positioning and details. The RSVQ-Diffusion model effectively generates images that include buildings, a lake, and a river, with reasonable ground features and clear boundaries. The differences from the real images highlight their diversity. For group (b), the text keywords are “building”, “river”, “tree”, and “overpass”. The VQ-Diffusion model shows deformed overpasses, missing buildings, and blurred details. The VQ-Diffusion_SP model generates features that match the text description, but the overpass appears deformed. The VQ-Diffusion_LMA model produces a reasonably shaped overpass but lacks the semantic information of the “river”. The RSVQ-Diffusion model accurately places the overpass, with the buildings and trees reasonably structured. The river detail in the upper right corner is slightly blurred. For group (c), the VQ-Diffusion model generates distorted details; the VQ-Diffusion_SP model’s mountain textures are blurred, and the VQ-Diffusion_LMA model generates mountains with reasonable colors but with a chaotic distribution. The RSVQ-Diffusion model produces mountains with continuous textures, naturally transitioning to the brown exposed ground, with reasonable spatial structure. For group (d), the RSVQ-Diffusion model generates farmland with semi-regular geometric shapes, better color transitions, and better spatial distribution compared to the original model’s generated images. From groups (e) to (h), the images generated by the VQ-Diffusion model exhibit target deformations, twisted spatial structures, and unreasonable color distributions, failing to reflect the corresponding textual information. The VQ-Diffusion_SP and VQ-Diffusion_LMA models generate images that reflect improvements in spatial structure and local details. The RSVQ-Diffusion model produces images with complete lake structures and clear boundaries, factory buildings that align with real target scene distributions and are neatly arranged, realistic tree colors, and well-organized oil tanks in the factories. The overall image quality generated by the RSVQ-Diffusion model surpasses the other three models.
Based on the evaluation metrics from the ablation experiments and the generated images, it is evident that the image generation capabilities of the RSVQ-Diffusion model significantly surpass those of the VQ-Diffusion model. The two improved versions, VQ-Diffusion_SP and VQ-Diffusion_LMA, each demonstrate visual enhancements that align with the principles of their respective modules. Although there are still differences between the images generated by the RSVQ-Diffusion model and real images, the generated images from the improved model show clear target details and spatial consistency, making the visuals more reasonable. They effectively reflect the scenes described in the text.
4.4. Comparison Experiments
A comparative analysis was conducted between the RSVQ-Diffusion model and the state-of-the-art text-to-remote-sensing image generation model, Txt2Img-MHN, using the RSICD test set. The evaluation employed FID, CLIP, and IS as the assessment metrics to quantitatively analyze the generated results. The comparative experiment results are shown in
Table 4.
It can be observed that the RSVQ-Diffusion model outperforms the Txt2Img-MHN model in both metrics, and it also shows improvements compared to the VQ-Diffusion model. This indicates that the RSVQ-Diffusion model excels in remote sensing image generation quality, text–image matching, and diversity, meeting the practical application needs of the remote sensing image processing field.
Figure 6 showcases eight groups (a)–(h) of images generated by four models: Txt2Img-MHN with pre-trained models VQVAE and VQGAN, VQ-Diffusion, and RSVQ-Diffusion. Visually, although the RSVQ-Diffusion model’s generated images exhibit minor issues such as irregular marking lines on the football field in group (b) and missing demarcated areas in the bottom left of the baseball field in group (f), it outperforms the other three models in terms of ground feature details and boundaries, text–image alignment, and overall visual quality. The RSVQ-Diffusion model’s images are closer to real images.
For instance, in group (a), with the text “A bridge is built over the river”, the image generated by the Txt2Img-MHN (VQVAE) model has an unclear lower left edge of the bridge, making it difficult to discern ground features, and it fails to capture the local structural features of the image. The Txt2Img-MHN (VQGAN) model’s image shows a deformed and blurred bridge with no clear distinction from the background. The VQ-Diffusion model’s image has relatively clear boundaries but shows deformation at the upper right edge of the bridge and distortion in the bridge area. The RSVQ-Diffusion model’s image, however, presents a complete bridge structure with clear boundaries and shadows beneath the bridge, enhancing realism and spatial coherence. In group (b), the Txt2Img-MHN models generate images with blurred and deformed stadium boundaries and indistinguishable details. The VQ-Diffusion model’s stadium edge has an unreasonable shape, and the surrounding buildings are deformed. The RSVQ-Diffusion model generates a clear image of the stadium with a visible central circle and clear, reasonable boundaries. In group (c), the Txt2Img-MHN models produce images with basic spatial structures but blurred details, while the VQ-Diffusion model’s left black area has an unnatural color texture distribution. The RSVQ-Diffusion model generates images with a reasonably structured river and clear details. In group (d), the Txt2Img-MHN models’ images have blurred building boundaries and overlapped ground features. The VQ-Diffusion model’s image shows deformed and distorted building boundaries, while the RSVQ-Diffusion model generates images with neatly arranged buildings and clear boundaries, displaying visually realistic images. In groups (e) to (h), other models’ images exhibit blurred details, deformed ground features, twisted boundaries, and missing target areas. The RSVQ-Diffusion model’s images show a natural texture distribution at the beach–sea boundary, orderly structured baseball fields and buildings, a consistent spatial distribution of residential areas and roads, and clear, reasonable surrounding ground features of swimming pools. These target areas are rich in detail, have realistic colors, and have different regional structures that match the distribution of real remote sensing images, clearly outperforming the other models.
Based on the evaluation metrics and generated images from the comparative experiments, it is evident that the RSVQ-Diffusion model significantly outperforms the other three models. The images generated by the RSVQ-Diffusion model exhibit clearer ground features and boundaries, with minimal blurring and distortion. The spatial structure is more consistent, with reasonable positional relationships between ground features, avoiding overlapping and misalignment issues. Visually, the images more closely resemble real images with natural textures and materials. Additionally, the semantic information satisfies the scenes described in the input text.
5. Conclusions
The proposed RSVQ-Diffusion model achieves high-quality remote sensing image generation that aligns well with input semantic information. By incorporating the spatial characteristics of remote sensing images into the diffusion image decoder and employing the sequential local perception mechanism integrated with the Transformer architecture, the RSVQ-Diffusion model effectively generates target remote sensing images based on the input text. The generated images exhibit clear local details, reasonable spatial structure, and closely align with remote sensing image characteristics, achieving both controllability and realism. Additionally, we conducted a comprehensive comparative analysis to evaluate the impact of various improvements on the model’s image generation quality. The experimental results indicate that the RSVQ-Diffusion model outperforms other models across various evaluation metrics and in terms of generated image quality, such as detail richness and reduced distortion and deformation. The improvement in FID scores demonstrates higher realism and closer resemblance to real images. The increase in CLIP scores reflects better alignment of generated images with input text, and the enhancement in ISs highlights superior image diversity. This study extends the application of diffusion models and Transformer architectures to the field of remote sensing imagery, broadening the research scope of remote sensing image generation methods. The RSVQ-Diffusion model generates remote sensing images with high fidelity, spatial consistency, and rich details, which can enhance data for remote sensing image processing tasks, simulate specific scenarios, and provide robust support for fields such as environmental monitoring, urban planning, and agricultural management.
Although the proposed methods in this study have significantly improved the realism of remote sensing image generation and text–image matching accuracy, the existing remote sensing datasets lack extensively labeled text data, making manual labeling difficult and limiting the variety of generated images. There is room for further improvement in the model’s generation capabilities and performance. In future work, we will focus on two main aspects: First, we aim to collect and construct diverse, high-resolution remote sensing image datasets to support large-scale generation tasks. Second, we will conduct in-depth research into Transformer architecture variants to optimize computational efficiency. This will enable us to improve training efficiency and inference speed, all while maintaining robust model performance. Lastly, by incorporating control conditions, such as metadata and geolocation information, and utilizing prior information like time and weather conditions, we aim to generate more informative remote sensing images. This will enhance the model’s ability to generate realistic images, further improving its practical applications.