1. Introduction
Due to global warming, progressively more frequent heatwaves have gradually drawn the attention of the academic community. Considering the variations in population acclimatisation and adaptation across different regions, the definition of heatwaves also varies accordingly [
1]. Generally, it is a period of consecutive days when the weather is excessively hotter and drier than normal conditions [
2]. In recent decades, the intensity, frequency, and duration of heatwaves have increased [
3]. In 2020, the Centre for Research on the Epidemiology of Disasters (CRED) and the United Nations Office for Disaster Risk Reduction (UNDRR) [
4] found that heatwaves have sharply increased by 232% from 2000 to 2019 worldwide. The increase in air temperature can escalate the risk of illness and death for vulnerable residents [
5]. The urban heat island (UHI) effect increases the air temperature during heatwaves, especially in urban areas. On the other hand, from 1985 to 2017, the population residing in urban areas increased from 41% to 55% [
6]. As a result, heatwaves pose significant risks to urban residents, particularly those living in big cities.
Air temperature is a crucial variable in climate models, and it widely serves as a fundamental metric for defining heatwaves [
7]. Meteorological stations generally observe it at 2 m above the land surface with high accuracy and temporal continuity. However, the limited number of meteorological stations restricts their ability to present the spatial distribution of air temperature, especially in urban areas [
8]. Abnormal mortality analysis during heatwaves at the city or community level often requires high-resolution air temperature data [
9]. On the other hand, understanding the evolution mechanism of heatwaves and their relationship with climate change also relies on accurate and temporally continuous data [
10]. Many studies used other data sources to analyse air temperature changes in areas of interest to address these challenges. Reanalysis models and satellite products are among the most commonly used data sources. Reanalysis models, such as the land component of the fifth generation of European reanalysis (ERA5-Land), generally offer good temporal continuity (e.g., hourly), but its spatial resolution is too coarse for urban studies (e.g., 0.1° × 0.1°). Zou et al. [
11] attempted to use ERA5-Land to evaluate air temperature for coastal urban agglomerations. However, their findings indicated that the coarse spatial resolution of the data made it difficult to differentiate between built-up areas and other land covers (e.g., grasslands). Unlike reanalysis data, satellite products usually offer a higher spatial resolution (e.g., 1 km). However, they often suffer from temporal discontinuity caused by weather conditions, such as cloud cover. For example, even though the Landsat 7 satellite is scheduled to revisit an area every 16 days, the effect of cloud cover can lead to a gap in data availability for several months [
12]. Therefore, despite the availability of numerous datasets, none of them alone can meet the requirement for high-resolution air temperature monitoring.
Data fusion has been widely used to obtain higher quality and more relevant information from multi-source data [
13]. It was first introduced in the 1960s as a mathematical model that combined data from multiple sources to acquire improved data [
14]. In the field of atmospheric science, statistical downscaling-based data fusion has already been used by many scholars to obtain high-resolution temperature data. Abunnasr and Mhawej [
8] utilised a linear regression model to integrate multiple satellite products, including the datasets of digital elevation model (DEM), normalised difference vegetation index (NDVI), enhanced vegetation index (EVI), and evapotranspiration (ET), to produce a five-year night air temperature trend analysis with 1 km spatial resolution. Although the research achieved promising results with the coefficient of determination (
R-squared) of 0.895 and root mean square error (
RMSE) of 0.49 °C, the five-year temporal resolution was too coarse for the demand of most urban-scale studies. To obtain daily maximum air temperature at 1 km resolution, Dos Santos [
15] employed machine learning and six satellite products to calibrate a regression model. However, the performance of the regression model was relatively poor, with the
RMSE of 2.03 °C and
R-squared of 0.68. For reanalysis-based datasets, the interpolation method is mainly employed to fuse multi-source data. Combining the reanalysis data and ground measurements, both Wakjira et al. [
16] and Viggiano et al. [
17] used interpolation approaches to downscale the temperature data. However, their spatial resolutions were still relatively coarse, which were 0.05° × 0.05° and 2 km, respectively. Although some studies have attempted to fuse both satellite and reanalysis datasets using statistical downscaling for air temperature retrievals, their resolution and accuracy are relatively poor for urban-scale research. For instance, Karaman and Akyürek [
18] employed a downscaling approach that combines five reanalyses and four satellite products to achieve daily mean temperatures at 0.05° resolution, with an
RMSE of 2.14 °C. Considering the spatial and temporal autocorrelation of the in situ observed air temperature, Zhu et al. [
19] proposed a method for air temperature reconstruction based on the multisource data and machine learning technique. However, despite having
MAE and
RMSE both below 0.5 K, the temporal resolution is too coarse, only allowing for monthly data estimation. To acquire high-resolution data, Shen et al. [
20] first employed deep learning for estimating 0.01° daily maximum air temperature based on remote sensing and ground station observations with the
RMSE of 1.996 °C and
R-squared of 0.986. Zhang et al. [
21] integrated eight types of reanalysis and satellite datasets based on machine learning to obtain 1 km daily average air temperature data. Despite providing a dataset with high spatial resolution, it could not well represent temperature changes during urban heatwaves due to its low accuracy with an
RMSE of 1.70 °C. Zhang et al. [
22] further explored the potential of machine learning in fusing multi-source data to obtain high-resolution and accurate air temperature data. Their team developed a novel five-layer deep belief network deep learning model to generate daily air temperature data, yielding promising results with an
RMSE of 1.086 °C and an
R-squared of 0.986. However, due to the limitation of the temporal resolution of explanatory variables (daily), this method cannot further estimate hourly air temperature. In the field of public health, the daily temporal resolution remained inadequate for accurately assessing the duration of high-risk periods within a single day during the heatwave. Therefore, it is imperative to explore a relatively simple and highly accurate statistical method for fusing multi-source data to acquire high-resolution air temperature data at the city level.
The Genetic Programming (GP) algorithm is widely used in atmospheric science as a data fusion technique. Since the GP algorithm is an extension of the genetic algorithm, it can automatically generate interpretable statistical climate models by combining multiple data sources based on genetic evaluation [
23]. Similar to genetic algorithm, GP algorithm can provide a relatively simple approach to identify optimal solutions without requiring individuals to have extensive knowledge of the specific problems. The research of Stanislawska et al. [
24] proved the potential of the GP algorithm for air temperature downscaling (
R-squared > 0.90). Coulibaly [
25] also demonstrated that the GP algorithm was more straightforward and efficient for estimating local-scale daily extreme temperature than other statistical methods. Despite the potential benefits, to our knowledge, few studies used the GP algorithm to build downscaling models for obtaining air temperature data at high spatial-temporal resolutions.
In the current context of increasingly frequent heatwave events, an important challenge for scholars is how to efficiently obtain high spatiotemporal resolution air temperature data by fusing multi-source data, which is crucial for analyzing the impacts of heatwaves. To address the current research gap, our study proposed a novel two-step data fusion model that integrates multi-source data while exploring the key explanatory variables for accurate air temperature estimation. To illustrate the effectiveness of our model, we conducted a case study in London. Specifically, moderate resolution imaging spectroradiometer (MODIS) land surface temperature (LST) and other satellite-based local-scale variables, including NDVI, normalized difference water index (NDWI), modified normalized difference water index (MNDWI), elevation, emissivity, and ERA5-Land hourly air temperature were fused to generate a two-step statistical downscaling model by using GP-assisted regression modelling. The fusion of satellite and reanalysis products for estimating temporally continuous air temperature data at high spatial-temporal resolution (hourly, 1 km), especially for studies related to heatwaves, has not been explored in previous literature. Thus, such datasets can significantly benefit local authorities in assessing heatwave-related health risks, as well as other heatwave-related studies such as resilient urban planning.
5. Conclusions
In this research, we proposed a new two-step data fusion model to produce temporally continuous, high spatial-temporal resolution air temperature data. Using London as a case study, the hourly air temperature at 1 km resolution during daytime (6:00–18:00) was successfully obtained by fusing satellite and reanalysis datasets with station-based observations. The two-step downscaling model based on the GP algorithm demonstrated superior performance in obtaining air temperature data in London as compared to other similar studies. It achieved good performance with the RMSE of 0.335 °C, R-squared of 0.949, MAE of 1.115 °C, and NSE of 0.924, surpassing previous studies and demonstrating its potential in estimating hourly air temperature data. Compared to other downscaling models that can only obtain daily temperature data, the proposed model can provide better temporal continuity while maintaining high accuracy, allowing for estimating hourly air temperature data during heatwave events.
The significance of explanatory variables was ranked using the forward stepwise regression model. The results showed that elevation considerably impacted the spatial distribution of air temperature, while emissivity was the least influential variable. This was primarily because emissivity values were numerically similar across different land covers, making it difficult to distinguish some land covers (e.g., forested areas and water body areas) in emissivity images. Additionally, the sensitivity of surface emissivity to precipitation was another factor that could affect the values. Thus, adding precipitation-related variables such as soil moisture and rainfall as explanatory variables may provide a potential improvement solution. The performance of four error metrics revealed the limitation of R-squared in the downscaling model, which is the limited variation range and inflated R-squared values. Therefore, in future research, more attention should be given to other validation indicators, such as RMSE. Furthermore, using ERA5-Land data as a global-scale variable for downscaling in urban areas can inevitably result in spatial differences in air temperature at the microenvironment scale due to the complex surface of urban areas. Although there is a slight disadvantage, our results demonstrated that the proposed multi-source data fusion model could generate high-quality air temperature data suitable for heatwave-related studies. Given the limited number of meteorological stations in urban areas, the produced air temperature datasets have important implications for public health research, which requires quantitative data, especially continuous and high-resolution data, to support excess mortality analysis associated with heatwaves. Moreover, the resulting dataset can also provide valuable support for researching the environmental impacts of urbanisation, such as the UHI effect and its implications on building energy consumption and human health.