1. Introduction
Temperature is one of the most important parameters in meteorological data collection [
1,
2,
3]. The near-surface air temperature, which is typically measured at a height of 1.5 to 2 m above the ground, reflects the ambient temperature [
4]. As an indicator of surface radiation exchange and heat balance, it governs most land-surface processes [
2,
5]. This parameter, which signifies the warmth or coolness of the near-surface air, is closely associated with the growth and development of plants and animals, as well as human activities. Consequently, it serves as a crucial element in climate change research [
6]. At present, near-surface air temperature data are primarily collected by meteorological observation stations, which measure atmospheric temperature at a height of 1.5 to 2 m near individual monitoring sites. However, the distribution of these observation stations is uneven, due to geographical and environmental factors, and they are particularly sparse or even absent in remote and geographically unique areas. This uneven distribution restricts the near-surface air temperature monitoring range, impeding the acquisition of high-resolution spatial temperature distributions and complicating the study of the spatial distribution and the internal structure of temperature levels in specific locations. These limitations hinder subsequent data analysis and applications [
7,
8,
9,
10].
In recent years, satellite remote sensing technology and associated operational products have significantly improved. Satellite remote sensing provides continuous observations with broad spatial coverage, minimal ground interference, and high spatial resolution, making it an effective method for addressing the limitations of observation stations. This technology offers a feasible way to obtain continuous spatiotemporal distribution information of near-surface air temperature [
11,
12]. As a result, the estimation of near-surface air temperature using remote sensing satellites has emerged as a novel approach, with methods classified into the following types, each possessing its own strengths and limitations.
Statistical methods utilize linear regression models to link remote sensing data with ground-based measurements [
13], but their effectiveness is limited by the variability in meteorological station density and the high demand for input data, resulting in poor model portability [
14]. The Temperature–Vegetation index (TVX) method estimates temperature by correlating normalized vegetation indices with surface temperature [
15]; however, it struggles with accuracy in areas with sparse or no vegetation cover and relies heavily on empirical determinations [
16]. The surface energy balance method models temperature through iterative calculations based on energy exchange principles [
17], but its complexity and the requirement for difficult-to-obtain ground parameters often lead to error propagation and limited practical application [
17]. Lastly, the atmospheric profile extrapolation method assumes continuous vertical temperature changes to interpolate near-surface temperatures [
1], but it lags in accuracy compared to other methods, particularly under cloudy conditions where thermal infrared data cannot be effectively utilized [
18,
19].
In recent years, deep learning methods have brought about revolutionary changes in the field of meteorology, driven by their rapid development and powerful data processing capabilities [
20]. With the advancement of satellite remote sensing technology and the onset of the big data era, numerous studies have started to use deep learning techniques to improve the prediction and estimation of meteorological elements. The strengths of these techniques, in terms of feature extraction and pattern recognition, have been leveraged to enhance the efficiency and accuracy of weather forecasting. These methods not only allow for the management of large-scale remote sensing data but also uncover deep patterns within the data, providing new perspectives and tools for meteorological research [
21].
In 2016, Tao et al. [
22] proposed the use of a stacked denoising autoencoder network to process homogeneous PERSIANN-CCS data for the error correction of PERSIANN-CCS precipitation products. The results demonstrated significant improvements in precipitation identification and rainfall intensity assessment with the application of this method. In 2017, Tao et al. [
23] developed a deep learning network with two hidden layers that utilized 10.8 μm infrared and 6.7 μm water vapor channel images from the Geostationary Operational Environmental Satellite (GOES) as inputs to generate precipitation identification maps of the region as outputs. The experimental results indicated that the identification of precipitation through deep-learning-based feature extraction from cloud images significantly outperformed traditional manual feature extraction methods. In 2019, Sadeghi et al. [
24] introduced a ten-layer convolutional neural network (CNN) for training and used water vapor and infrared channel images from the GOES as input data. They found that the CNN model surpassed PERSIANN-CCS and PERSIANN-SDAE in terms of precipitation identification accuracy and the correlation coefficient for precipitation prediction. In 2020, Shen et al. [
25] proposed a deep belief network (DBN) that integrates remote sensing, socio-economic, and station data to estimate near-surface air temperature. In the same year, Jiang et al. [
26] proposed a deep-learning-based method for reconstructing satellite data and supplementing missing radar echo data. In 2021, Wang et al. [
27] combined data-driven and knowledge-based technology with deep learning models to develop a new method for overcoming the ill-posed problem of surface temperature retrieval. In 2022, Guo et al. [
28] proposed a deep-learning-based rainfall classification method using synthetic aperture radar images, aiming to improve sea surface wind speed retrieval through integrating existing rainfall correction models.
Relevant research has clearly demonstrated the immense potential and significant advantages of deep learning in meteorology. However, several challenges and limitations persist in practical applications that cannot be overlooked. One area that needs more exploration is the estimation of the near-surface air temperature. Existing methods require improvements in accuracy and reliability, particularly when considering complex terrain and variable climate conditions. As a result, there is a need for the development of new deep learning architectures to improve the accuracy and generalization ability of near-surface air temperature estimation approaches in meteorological research.
In 2014, generative adversarial networks (GANs) were introduced [
29]. GANs consist of a generator and a discriminator. The generator generates images from random noise, while the discriminator determines whether the input data are real or fake. The generator’s goal is to produce images that closely resemble real ones, attempting to fool the discriminator. In contrast, the discriminator aims to accurately distinguish between real and generated images. This adversarial process continues until an equilibrium is reached, where the discriminator can no longer distinguish between real and generated images. GANs offer several advantages, including the ability to train without labeled data, update gradients through backpropagation, and rapidly generate images [
30]. These features have led to their widespread application in various fields, such as image generation, segmentation [
31], inpainting [
32], super-resolution [
33], and denoising [
34]. However, GANs also have limitations. The outputs are uncontrollable and lack interpretability due to the generation process being driven by random noise, requiring high consistency in the training data. Additionally, the original GAN is difficult to train and often encounters gradient vanishing issues when there is little overlap between generated and real samples. To address these challenges, numerous GANs variants have been developed.
Mirza et al. introduced the conditional generative adversarial networks (cGANs) in 2014 [
35], which incorporated a conditional variable into both the generator and discriminator of the original GAN. By adding additional information to constrain the model, cGANs established a paired relationship between input and output, guiding the image generation process and resolving the issue of uncontrollable outputs. Unlike the original GAN, where the generator’s input was simply real or fake images, cGANs used a combination of real images with conditions or fake images with conditions. During the prediction, corresponding conditional information is inputted to obtain the desired output. Mirza and colleagues successfully applied cGANs to generate corresponding MNIST handwritten digit images by inputting random noise and one-hot encoded digits into the generator. The introduction of cGANs paved the way for the application of GANs in multimodal learning, leading to the development of many improved cGAN models.
In this study, a new method for near-surface air temperature estimation based on deep learning is proposed, specifically using an attention- and residual-enhanced multi-scale conditional generative adversarial network (ARM-cGAN). This method improves upon the conditional generative adversarial network (cGAN) framework through the creation of a generator that combines a U-Net backbone with a self-attention mechanism and cascaded residual blocks. The goal of this design is to selectively extract features that are crucial for temperature data generation while reducing redundant information. Additionally, a discriminator is designed to combine multi-scale spatial features at different levels, further improving the network’s ability to perceive details and overall structure, resulting in precise temperature estimation. The ARM-cGAN method effectively takes advantage of cGANs by incorporating conditional information to control the image generation process. It uses cloud image data from the FY-4A satellite as conditional information, along with data from channels 12 and 13, which effectively represent near-surface air temperature, including cloud and surface temperature. This guided approach helps to control the generation of temperature images, producing high-resolution spatio-temporal near-surface air temperature data and effectively addressing the problem of missing temperature data due to a lack of meteorological observation stations. The proposed ARM-cGAN method is compared with several deep learning models that have been proven to be effective in meteorological estimation and prediction tasks. The experimental results indicate that this method offers superior temperature estimation performance.
3. Methodology
Generative adversarial networks (GANs), as introduced by Goodfellow et al. in 2014, are a type of generative model that learns through unsupervised training [
29]. The GAN architecture is inspired by the concept of Nash equilibrium. The network learns the mapping from a random noise vector
z to an output image
y through the adversarial training of two modules: the generator and the discriminator, denoted as
. Expanding on this concept, Mehdi Mirza et al. proposed conditional generative adversarial networks (cGANs) in 2014 [
35]. cGANs adopt a supervised learning approach, enabling the incorporation of additional conditional information to guide the generation process. By introducing conditional information into both the generator and discriminator, the outputs become more controllable and predictable. cGANs learn the mapping from the conditional information
x and the random noise vector
z to
y, represented as
. The objective function of a cGAN is expressed as
where the symbol
represents the expectation operator, which is used in two different contexts. The first term denotes the expectation under the true data distribution for all pairs of real data and their corresponding condition
, while the second term denotes the expectation under the noise distribution, which serves as input to the generator, for all pairs of generated data and their corresponding condition
. The generator
G aims to minimize the loss
, while the discriminator
D seeks to maximize
, leading to a competitive relationship between the two, which can be denoted as
.
Phillip Isola et al. [
42] have demonstrated that, by including an
regularization loss in the generator’s loss function within the cGAN framework, the network can capture low-frequency information features and produce less blurry results. This significantly improves the mapping performance. Therefore, in this study, we use the loss function detailed in Equation (
2):
where the
loss is defined as
In Equation (
2), the first term represents the adversarial loss between the generator
G and the discriminator
D, which is in line with the cGAN objective function. The second term—the
pixel loss—evaluates the quality of the generator’s output. By optimizing the combined objective of adversarial loss and pixel loss, it is ensured that the generated images both have high realism and accurately reflect the mapping relationship with the input images.
The ARM-cGAN model’s overall structure is illustrated in
Figure 1. The model starts by taking the dual-channel cloud image data (L12 and L13) from the FY-4A satellite and inputting them into the generator network
G as conditional information. The generator then generates a synthetic “fake” temperature sample image. This generated image, along with the ERA5 temperature image, is combined with the original conditional information L12 and L13. These combined data are then fed into the discriminator network
D, which evaluates whether the generated near-surface temperature image matches the real samples. Throughout this process, the generator and discriminator are trained adversarially, with the generator continuously improving to create more realistic temperature images, while the discriminator enhances its ability to distinguish between generated images and real samples. Ultimately, this adversarial training enables the generator to produce high-quality near-surface air temperature estimation samples.
3.1. Generator
The generator structure of the ARM-cGAN network follows the U-Net’s encoder–decoder architecture, as depicted in the upper part of
Figure 2. The encoder progressively reduces the input image size through a series of convolutional and pooling operations, thereby extracting essential features and decreasing the spatial dimensions. Throughout the encoding process, the output feature map at each level is transmitted directly to the corresponding level of the decoder via a skip connection, preserving the continuity and integrity of spatial information. On the other hand, the decoder gradually restores the image’s spatial resolution through up-sampling and convolution while receiving feature maps from the corresponding levels of the encoder through the skip connection. These feature maps, along with the up-sampling results of the decoder, are simultaneously input into attention gates to further refine and capture effective features, thereby enhancing the model’s reconstruction capabilities.
The ARM-cGAN model introduces four multi-scale residual modules between the encoder and decoder of the generator [
43]. These modules deepen the network without changing the size and channel number of the input image. This expands the receptive field and improves the extraction of the correspondence between FY-4A dual-channel data and ERA5 temperature values. Each residual block consists of two 3 × 3 convolutional layers and employs a residual connection design, as depicted in the lower-left part of
Figure 2. The feature maps output by the encoder serve as input to the residual blocks. Each convolutional layer applies edge reflection padding with a padding size of 1 pixel to maintain the feature map size and reduce edge effects, facilitating subsequent convolution operations while preserving spatial resolution. Each convolutional layer uses a 3 × 3 convolutional kernel with a step size of 1, keeping the dimensions of the input and output feature maps constant. Instance normalization is applied to normalize each channel of each sample independently after convolution, effectively maintaining sample independence and preserving the original details of the image. It is particularly suitable for processing cross-domain data such as temperature estimation tasks in remote sensing imagery missions. The activation function (ReLU) introduces non-linearity by retaining features when the input is greater than zero and setting the feature value to zero when it is less than zero. This allows the model to capture complex features, contributing to network sparsity and improving computational efficiency. Dropout is added between the convolutional layers (with a dropout rate of 50%) to enhance the generalization of the model. Finally, the input is added to the output of the two convolutional layers using identity mapping. Overall, the four residual blocks deepen the network and enhance the model’s representational capacity. This enables the model to learn structural information more effectively and recover key indicative features related to temperature, such as gradients and local anomalies. The residual connections also retain the original input information through skip connections, mitigating the gradient vanishing problem, increasing the network’s adaptability, and making the model better suited to variations in different meteorological data.
The self-attention mechanism is a technique that dynamically adjusts the weights based on the relationships between elements in the input sequence. When combined with attention gates and the U-Net architecture, the model can suppress irrelevant parts and enhance task-related features during the learning process, ultimately achieving better performance [
44]. As the near-surface air temperature is influenced by various geographic and meteorological factors, the self-attention mechanism helps the model to identify key factors when estimating the temperature distribution, particularly in regions where surface cover and terrain diversity have significant impacts on the temperature. Therefore, this study incorporates attention gates into the decoder, as shown on the right side of
Figure 2. This module receives feature maps from the encoder’s down-sampling and the decoder’s up-sampling processes. It first applies a 1 × 1 × 1 convolution operation. The convolution kernel has a size of 1 and a stride of 1 and does not use padding, applying weights to the feature channels. This operation compresses the number of input feature map channels into an intermediate feature dimension, reducing computational complexity and emphasizing important features. Additionally, a batch normalization layer is applied after the convolution to balance the feature channels, resulting in the feature maps
E and
D. The feature responses are then fine-tuned using ReLU and sigmoid activation functions, respectively, thereby generating attention weights for the feature maps of the encoder and decoder. Additionally, another 1 × 1 × 1 convolution operation is used to extract contextual information and further enhance the weights, ensuring that the model can capture significant features related to the temperature distribution. This process is followed by re-sampling to ensure that the spatial resolution of the attention weights matches the feature maps. Finally, the attention weights are multiplied by the feature maps, resulting in weighted features. Through this process, the feature maps are fused with the original feature maps, enabling the model to capture long-range dependencies between different features and causing the network to focus more on spatial features related to temperature distribution.
3.2. Discriminator
Oscillation and instability often occur during the cGAN training process, leading to issues such as mode collapse and non-convergence. These problems primarily arise from the imbalance between the convergence rates of the generator and the discriminator. This imbalance is particularly notable when the generator is tasked with translating across multiple classes of remote sensing images, as its convergence is significantly slower than that of the discriminator. On the other hand, the discriminator often reaches an optimal state early on and consistently distinguishes real from fake images effectively. Furthermore, as this study incorporates both global and local information into the generator, the discriminator—which typically operates at a single local scale—also faces an imbalance. Therefore, it is necessary to provide the discriminator with multi-scale information.
In this study, a multi-level and multi-scale discriminator network is introduced, as shown in
Figure 3. Unlike the global-scale discriminators used in traditional GANs, this study proposes a pyramid-like multi-scale spatial feature fusion strategy for the discriminator. The network consists of three different levels of local-scale sub-discriminator networks cascaded within the discriminator. Initially, the input image pairs are down-sampled by factors of 2 and 4, and the resulting data are directed to corresponding sub-discriminators, namely, discriminator 1, discriminator 2, and discriminator 3. Each sub-discriminator shares a similar structure, as depicted on the right side of
Figure 3. Through employing different scale factors, the sub-discriminator networks achieve receptive fields of 34 × 34, 68 × 68, and 136 × 136 on the input image pairs, allowing the discriminator to operate across various scales. After hierarchical features are extracted at different scales, the results from the three sub-discriminators are integrated to differentiate between real and generated input images. This design enhances the network’s perception of both detailed and global structures, guiding the generator to produce finer details. It also enables a comprehensive understanding of the input images, from the microscopic texture details of temperature variations to the macroscopic patterns of temperature distribution. Furthermore, the proposed discriminator network structure mitigates instability and mode collapse during network training, thereby improving the ARM-cGAN model’s ability to distinguish between generated and real images.
The structure of the sub-discriminator in this study uses a Markovian discriminator known as PatchGAN. PatchGAN is a fully convolutional network designed to penalize structures at the patch scale. In this study, the FY-4A dual-channel data are combined with the corresponding ERA5 temperature data and the generator output at the channel level, resulting in a 401 × 401 × 3 input data pair. After two down-sampling operations, the dimensions are reduced to 200 × 200 × 3 and 100 × 100 × 3, respectively. Subsequently, these three sets of data are processed by the corresponding sub-discriminators for evaluation. Each sub-discriminator uses five convolutional layers to extract high-frequency information from the images, ultimately mapping the input to an N × N matrix, in which each point represents the evaluation value for a small patch in the original image. The mean value of this matrix is then output as the final discrimination result. According to Phillip Isola et al. [
42], N can be significantly smaller than the original image’s full size while still yielding high-quality results. Smaller PatchGANs have fewer parameters and operate more efficiently and can be applied to images of any size, addressing issues related to training instability and slow convergence. This characteristic makes them particularly well suited for handling remote sensing images with large amounts of high-resolution information.
5. Discussion
In this work, we propose an improved conditional generative adversarial network, called ARM-cGAN. It estimates near-surface air temperature in near real time using FY-4A satellite data, providing timely and accurate near-surface air temperature data for regions lacking station observations. A series of comparison and ablation experiments are conducted to verify the performance of ARM-cGAN using the operational product ERA5 as the benchmark standard. Based on the experimental results, our discussions are as follows.
Estimating near-surface air temperature from FY-4A satellite imagery based on deep learning is effective. This new method addresses the limitations of previous approaches that relied solely on remote sensing satellites for estimation. For example, statistical methods have poor model transferability due to variations in the density of ground meteorological stations. The TVX method is not suitable for regions with diverse terrain coverage. The surface energy balance method is complex and requires numerous parameters, while the atmospheric profile extrapolation method suffers from lower accuracy. In contrast, deep learning networks are known for approximating arbitrary functions with high accuracy. The proposed ARM-cGAN model uses the cGAN network architecture, incorporating cascaded residual blocks and a self-attention mechanism, and introduces a multi-scale discriminator to significantly enhance the accuracy of near-surface air temperature estimation using remote sensing satellite data. The model is structurally simple, has few parameters, and is not constrained by terrain, making it applicable to any region covered by remote sensing satellites. The effectiveness of the model was validated using two commonly employed evaluation metrics in temperature estimation tasks: RMSE and CC. In comparative experiments, the ARM-cGAN model outperformed other deep learning models effectively in meteorological element estimation and prediction tasks, achieving an RMSE of 1.4815 °C and a CC of 0.9897 compared to ERA5 reanalysis data. This demonstrates the model’s superior performance in near-surface air temperature estimation.
In addition, the selection of a suitable deep neural network can enhance performance. Unlike traditional deep learning networks, which are commonly used in meteorological tasks, this study exploits the unique benefits of the cGAN network for image generation. By effectively integrating these strengths with the specific requirements of temperature estimation, we achieve exceptional estimation performance. The proposed model expands the cGAN framework by creating a generator that combines a U-Net backbone with a self-attention mechanism and cascaded residual blocks. This design enables the model to selectively extract features that are essential for temperature data generation while enhancing the network’s ability to suppress redundant information. Furthermore, a multi-scale spatial feature fusion discriminator was developed to further enhance the network’s perception of both fine details and global structures, ultimately leading to precise temperature estimation. Ablation experiments conducted in this study validated the effectiveness of each module in the proposed task.
Moreover, it is important to consider that near-surface air temperature estimation can be influenced by meteorological conditions, seasonal fluctuations, and geographic variations. After thorough analysis, it has been found that the accuracy and consistency of temperature estimation, as measured by RMSE and CC values, can vary based on the month/season and location. For example, in August, RMSE values tend to be smaller and more stable, with higher and more consistent CC values during this period. Furthermore, regions at mid-to-high latitudes often show more diverse and unstable variations in RMSE and CC values. Considering these findings, future work should take into account factors such as latitude, longitude, topography, and climate to enhance the accuracy and reliability of near-surface air temperature estimation.
Lastly, although the ARM-cGAN model proposed in this study has demonstrated superior near-surface air temperature estimation capabilities, its generalization across different geographic regions and atmospheric conditions remains to be further explored. The model’s training and validation rely heavily on remote sensing data from the FY-4A satellite, and its observational range and resolution may limit global applicability. In regions where FY-4A data are unavailable or where its data quality differs significantly from other remote sensing sources, the model’s effectiveness may be compromised. This suggests that without access to FY-4A data, the model’s performance could degrade, restricting its broader applicability. Additionally, geographic variation and atmospheric conditions, such as complex terrain, cloudy weather, or extreme climates, may affect the quality of remote sensing data and, in turn, impact the accuracy of the model’s temperature estimates. While the ARM-cGAN model enhances adaptability to diverse meteorological conditions by incorporating a self-attention mechanism, cascaded residual blocks, and a multi-scale discriminator, its generalization performance still requires evaluation across more regions and climatic contexts. Future research should incorporate additional remote sensing data, such as MODIS or Sentinel satellite data, to assess the model’s performance under varying data conditions and explore its potential for global-scale meteorological forecasting and temperature estimation.
In conclusion, the ARM-cGAN model is currently limited to regions with FY-4A data. Future work should aim to expand its applicability and test its robustness and generalization across a broader range of geographic environments and atmospheric conditions.
6. Conclusions
Near-surface air temperature reflects the thermal characteristics of the air close to the ground and is a critical factor in the study of hydrology, ecology, climate dynamics, and processes such as vegetation photosynthesis, transpiration, and evaporation. Typically, near-surface temperature values are obtained from meteorological observation stations. However, geographic constraints make it difficult to acquire accurate temperature data with high spatial and temporal resolution.
This study addressed the challenge of near-surface air temperature estimation through the proposal of an improved conditional generative adversarial network, called ARM-cGAN. This model is based on a conditional GAN framework and incorporates several network modules, including cascaded residual blocks, a self-attention mechanism, and a multi-scale discriminator. Data from the FY-4A meteorological satellite were used as conditional guiding information, and ERA5 reanalysis data were used as target information for training. The experimental results demonstrated the model’s excellent performance and its advantages in terms of estimating near-surface air temperature, effectively improving the accuracy and quality of the estimated temperature data. This research not only provides an innovative tool for fields such as meteorology, ecology, and hydrology but also emphasizes the significant potential of deep learning in analyzing complex climate data.
Future work will concentrate on enhancing the model by including extra meteorological factors such as wind speed and direction. This will help to improve the accuracy of near-surface air temperature estimates, with the aim of developing a more comprehensive and precise temperature estimation model that can better meet the requirements of different fields relying on climate data analysis.
With the rapid development of remote sensing technology and deep learning, temperature estimation is poised for new opportunities and challenges. Future research can focus on the fusion of multi-source remote sensing data, such as integrating satellite data from MODIS, Sentinel, and Landsat, to improve the model’s generalization across diverse geographic regions and variable climate conditions globally. As the demand for real-time meteorological monitoring grows, model optimization and lightweighting will become key priorities. By employing model compression and optimization techniques, deep learning models can achieve efficient real-time temperature estimation, even in resource-limited environments. Additionally, combining global climate forecasting with regional micro-scale analysis to build a multi-scale climate prediction framework will enable a more precise capture of interactions between global climate changes and local characteristics. Future research should also aim to improve model interpretability, offering deeper insights into how the model extracts and estimates temperature features, thereby providing more reliable support for meteorological decision-making. In summary, the fusion of multi-source data, real-time model optimization, the integration of global and local analyses, and enhanced interpretability are critical future directions for advancing temperature estimation.