1. Introduction
Global climate models (GCMs) play a crucial role in predicting the state of the atmosphere [
1]. However, these models’ spatial resolution is low, typically between 25 and 100 km, because of computational limitations. Such a spatial resolution is suitable for analyzing large-scale features, e.g., synoptic or mesoscale processes, but too coarse for capturing the spatial variability of meteorological parameters in complex terrain [
2,
3]. This limits the straightforward use of GCM output for decision-making processes.
The near-surface air temperature of Earth is an essential parameter for a series of processes, especially for terrestrial life, including that of humans and the ecosystem. In particular, the air temperature measured 2 m above the ground (T2M) is a standard reference variable adopted to represent processes relevant to various sectors like agriculture and transportation. The high-resolution daily mean temperature data can also have applications in various studies, such as those dealing with urban heat islands and thermal comfort. The interaction between surface temperature and urban morphology can be explored in greater detail by utilizing high-resolution temperature data [
4]. Similarly, daily mean temperature can help study landscape interventions, such as the effect of increased vegetation on thermal conditions [
5]. For these applications and to better predict these processes, there is an increasing need for high-resolution T2M datasets. Two main types of downscaling techniques are used to evaluate data from coarse resolution to local scale, i.e., dynamic and statistical downscaling. Dynamic downscaling techniques use regional climate models to simulate physical processes by assuming initial and boundary conditions from GCM output. Fine-scale processes are represented with more accuracy, thus improving the provision of local-scale data [
6,
7]. However, dynamic downscaling usually requires considerable expertise and computational resources. Instead, statistical downscaling stems from a statistical relationship between large-scale climate variables (predictors) and local-scale variables (predictands) [
8]. These relationships are exploited to predict values of the local-scale variables at different times and locations. Statistical downscaling techniques have an advantage in reduced computation and more straightforward implementation and thus are widely used in applications related to climate change [
9,
10,
11].
Machine learning (ML) techniques have been increasingly used in recent years for statistical downscaling of daily mean temperature because of their ability to capture patterns within complex data. Various studies have employed different ML models, such as linear regression [
12], support vector regression (SVR) [
13,
14], artificial neural network (ANN) [
9,
15,
16], random forest (RF) [
9], and k-nearest neighbor (KNN) [
10] for downscaling temperature.
The ANN model is inspired by the human brain and consists of neurons with different layers. The input layer is the first layer, and, as the name suggests, it provides input data to the model; then comes the hidden layer, with one or more layers responsible for the model performance. The last is the output layer, where we obtain a prediction of the model. ANN can capture nonlinear relationships between parameters, making it useful for complex tasks such as downscaling temperature.
RF constructs multiple decision trees and takes the ensemble mean of outputs of all trees for final prediction. Gradient boosting (GB) is also an ensemble method that builds models sequentially. Each new model in GB corrects the errors of the earlier one. Both RF and GB leverage the strengths of multiple individual models to improve prediction accuracy, albeit through different mechanisms.
The MLR model uses a linear equation to capture the relationship between the predictor and predictand parameters. SVR aims to find a hyperplane that fits the data, making it suitable for capturing nonlinear temperature patterns. The KNN model predicts the value of a location depending on the values of its nearest neighbors in the dataset.
In the past, several studies have compared ML techniques for downscaling purposes. Pang et al. [
9] conducted a study in the Pearl River Basin (Southern China) using three methods, namely multiple linear regression (MLR), RF, and ANN, to downscale mean temperature data from coarse resolution. Their study revealed that RF exhibited superior performance compared to both ANN and MLR. Azari et al. [
10] used six different models—MLR, KNN, ANN, RF, SVR, and adaptive boost—in daily mean temperature downscaling at Memphis International Airport (USA). The results showed that ANN outperformed other models. Hanoon et al. [
17] demonstrated that neural networks outperformed GB, RF, and linear regression in predicting daily temperature in Terengganu state (Malaysia). Hence, the literature suggests that no model can be considered invariably superior when comparisons are made across diverse regions. This emphasizes the need for conducting intercomparison studies specifically suited to where downscaling applications are required. However, it should be noted that most of these studies have employed ML techniques to downscale the daily mean temperature on a point scale. This approach involves taking low-resolution data from the closest grid point of the weather station for downscaling. Hutengs and Vohland [
18] used RF for downscaling of LST spatially from 1 km to a resolution of 250 m in Jordan River Valley. However, the application of these techniques for spatial downscaling of the daily mean temperature is limited.
Recent studies on the downscaling of atmospheric variables have explored the application of a convolutional neural network (CNN). CNN models are similar to ANN models; however, in addition to fully connected dense layers, CNN consists of a few more layers, such as convolution and pooling layers. CNNs are especially effective for handling gridded data, such as spatial data or images. CNN models have gained increasing popularity for downscaling spatial gridded data, considering their ability to capture spatial features effectively. Bano Medina et al. [
19] conducted a study in Europe with the objective of intercomparing CNNs of different complexity with linear models, bringing the horizontal resolution from 2° to 0.5° both for latitude and longitude. Their findings suggest that CNNs perform better compared to linear models, particularly as downscaling was performed on a continental scale, considering a large area, and with a relatively low scaling ratio. (The scaling ratio is calculated as the predictor resolution divided by the target or predictand resolution.) Therefore, the transferability of these results to a higher resolution remains unanswered.
Mountainous terrains are known to exhibit a variety of climatic situations [
20] with very different features compared to surrounding plain areas nearby [
21]. Accordingly, peculiar boundary layer processes occur therein, deeply affecting surface–atmosphere exchanges and, hence, surface temperatures, resulting from a variety of combinations of different land forms, ambient conditions, and surface properties [
22]. In particular, under weather characterized by wide and persistent anticyclonic situations typically associated with clear skies and calm wind at the synoptic scale, daily-periodic, thermally-driven wind systems are generated by the regular cycle of surface heating and cooling [
23]. Enhanced heating and cooling are both favored by clear skies, allowing for both strong incoming solar radiation during daytime and strong radiative loss during nighttime, respectively [
24,
25]. Under the different phases characterizing these winds, air typically flows up the slopes during daytime and downslope during nighttime, with transitional reversals at sunset and sunrise, respectively [
26]. These factors affect surface temperature, particularly at the floor of valleys and basins, where long-lasting, ground-based temperature inversions and persistent cold pools often occur [
27] fed by katabatic winds flowing down from the surrounding sidewall slopes [
28].
Given such variability arising from a nontrivial combination of factors, downscaling is a particularly challenging task over mountainous terrain. On the other hand, the availability of high-resolution T2M is critical for a series of applications [
29], ranging from air pollutant transport [
30,
31] to water resource management to agriculture. Hence, high-resolution data from NWP models were obtained with smoothing topography for computational stability [
32]. Smoothing the topography can lead to a less accurate representation of the terrain affecting local atmospheric processes. However, ML models can leverage high-resolution data, overcoming these limitations to provide more accurate forecasts. Mutiibwa et al. [
33] investigated the relationship between air temperature near surface and land surface temperature (LST) in mountainous terrain. Their study found that the LST serves as a reliable proxy for air temperature near the surface, with higher accuracy in the daytime compared to the nighttime. Li et al. [
34] evaluated machine learning model performance for downscaling the LST, highlighting the better performance of machine learning algorithms compared to traditional regression approaches. Wang et al. [
35] used CNN to downscale the daily temperature from their different coarser resolutions (100, 50, and 25 km) to a fine resolution (4 km), with better results obtained for downscaling from 25 km. Their study achieves good results with a scaling ratio of 6. However, 4 km is still very coarse for areas of complex terrain. Sha et al. [
3] conducted a study over complex terrains in the western United States to lower the temperature from 0.25° (approximately 27 km) to a resolution of 4 km using CNN and found a higher MAE in mountainous areas and a lower MAE in the plains. Furthermore, when developing a model for downscaling temperature in complex terrains, the selection of predictors is crucial. For example, Karaman and Akyurek [
36] conducted a study in Turkey, aiming to downscale the daily mean temperature on station data using an RF model, and found that incorporating static features such as elevation as additional predictors significantly improved the model’s performance. Some studies have included dynamic parameters such as dew point, pressure, and wind speed as additional predictors to improve the performance of the model [
10,
17]. Sebbar et al. [
37] downscaled hourly temperature using SVR, XGBoost, and MLR by incorporating environmental lapse rate for temperature corrections. However, this study has limitations in regions where there is an unavailability of temperature data at various vertical levels. Therefore, our study employs ANN, RF, and CNN advanced neural networks to capture spatial and temporal variability directly from surface data.
This study presents various novel contributions to atmospheric research, especially in the field of spatial downscaling. The objective of the study presents a comprehensive intercomparison of the machine learning algorithms’ performances, namely ANN, RF, and CNN, in downscaling spatial daily mean temperature. While proposed models have been individually employed in downscaling studies, our work is novel in rigorously comparing their performance at high resolution and in complex terrain settings. Our study achieves a higher downscaling ratio (9), which is a significant leap compared to earlier studies [
19]. This study places a strong emphasis on assessing the importance of elevation using a sensitivity experiment by providing elevation as an additional input. Here, we understand how different models respond to elevation as an additional input and improve the performance of models. We conduct a feature importance analysis showing key predictors that primarily contribute to enhancing model performance in different seasons. Furthermore, this is one of the first attempts at daily mean temperature downscaling in the Non and Adige Valleys, a region with very complex terrain in northern Italy, using machine learning.
Spatial downscaling often encounters constraints due to the unavailability of high-resolution predictand or target datasets, which can limit the scope of such studies. Instead, our research benefits from a gridded 1 km dataset for daily mean temperature created using ground-based measurements [
38], providing a unique opportunity to explore spatial downscaling at high resolutions in this area.
The paper is organized as follows.
Section 2 describes the materials and methods, including the study region, the datasets used, and the methods adopted.
Section 3 describes the ML models employed in the study.
Section 4 presents the results, which include the spatial consistency of models, average metrics of model performance, spatial and seasonal variation in model performance, sensitivity of models to elevation, and feature importance.
Section 5 provides a discussion of the results, and
Section 6 draws some conclusions and presents an outlook on possible future developments.
4. Results
In this section, we show the results of downscaling in terms of the consistency of models in predicting T2M, comparison of models using aggregated metrics, spatial and seasonal variation in the performance of models, and the importance of elevation for the models using sensitivity and feature importance.
4.1. Spatial Consistency of Models
We selected representative days for each season to evaluate the spatial consistency of machine learning models in predicting T2M: for winter (15 January 2015), spring (15 April 2015), summer (15 July 2015), and autumn (15 October 2015) (
Figure 5). In the figure, Input-T2M is the input ERA5L-T2M with a coarser resolution of 9 km. ANN, RF, and CNN represent the downscaled T2M output of respective models at a spatial resolution of 1 km, whereas Crespi represents target reference T2M with a spatial resolution of 1 km. The selection of days was based on the typical seasonal characteristics. The primary goal here was to visually assess whether the proposed models can reproduce patterns in target data across different seasons. Hence, spatial consistency refers to the ability of models to accurately capture spatial pattern variations in temperature. Overall, all models effectively improve the spatial resolution of input ERA5L-T2M, generating more detailed high-resolution outputs. However, the CNN model consistently matches spatial variability and fine-scale features closely to that of the target across all seasons, suggesting better performance than other models. Especially in winter (15 January 2015) and autumn (15 October 2015), CNN excels at capturing spatial consistency, whereas ANN and RF struggle to reproduce the target data. The RF model had better performance than ANN in terms of spatial details; however, it did not fully capture finer details as effectively as CNN. The ANN model can reproduce broader spatial patterns but tends to produce less detailed predictions. However, in spring and summer, all models—ANN, RF, and CNN—show comparable performance, effectively capturing spatial patterns and variability in the target data. To assess in more detail the accuracy and performance of models in producing target data for the entire test period (2010–2018) across different seasons, a comparison of models using evaluation metrics such as RMSE, MAE, R
2, and MBE is shown in
Section 4.2. These metrics assist in the assessment of model performance comparison, complementing the visual overview from the spatial plots.
4.2. Average Metrics of Model Performance
Figure 6 shows the values of spatial-temporal aggregated metrics of the three models adopted (ANN, RF, and CNN) for downscaling across different seasons. The metrics are calculated for all 12 months; for example, for January, the metrics are computed for data across latitude (74), longitude (19), and 279 prediction days from 2010 to 2018. Then, the metrics are averaged for every season to compare and evaluate the performance of the models. CNN shows better values for metrics than others across all seasons, closely followed by RF. Particularly for summer and spring, both CNN and RF show values of metrics RMSE (1 °C, 1.2 °C), and R
2 (0.94, 0.92) quite similar to each other, signifying close performance. However, particularly for winter, CNN performs better than RF, with the lowest RMSE (1.29 °C, 1.62 °C), MAE (1 °C, 1.26 °C), and the highest R
2 (0.87, 0.79). This similar pattern of CNN performing better than RF is observed for autumn as well. On the other hand, the ANN model lags behind the RF and CNN models in terms of performance across all seasons, indicating its poor performance.
All models exhibit seasonal variation in performance for downscaling spatial T2M. The best performance for all the models is obtained in summer, whereas the worst model performance is obtained in winter. The CNN model exhibits the best performance in summer (achieving RMSE = 1.01 °C, MAE = 0.78 °C, and R2 = 0.94), whereas the ANN model shows the lowest performance in winter (RMSE = 1.63 °C, MAE = 1.28 °C, and R2 = 0.79). The performances of the models for spring and autumn are very close to each other, which is better than winter and poorer than summer.
The aggregated MBE metric for all models shows lower biases, and its maximum value reaches only up to +0.21 °C and −0.16 °C (
Figure 6). However, from the average metrics for MBE, it appears that the ANN model outperforms others, with the lowest MBE values across all seasons. To confirm this, we looked at the spatial distribution of MBE and its variation along the elevation for all models and seasons.
4.3. Spatial and Seasonal Variation in Model Performance
Figure 7 shows an elevation map of a study region (a), season-wise spatial distribution of MBE (b), and variation of MBE elevation bin-wise (c) for the ANN, RF, and CNN models. To show the spatial distribution, MBE is computed at each spatial location over the study region for the prediction time range of 2010 to 2018. The spatial distribution shows that the values of positive and negative MBE are much higher for ANN compared to its counterparts RF and CNN. The RF model shows comparable performance with CNN, although with higher values, both positive and negative, in particular in the winter. The analysis of MBE using spatial distributions reveals that the ANN model exhibits higher positive MBE at higher elevations and negative MBE at lower elevations, which results in compensating for MBE when aggregated over space and time, resulting in overall lower MBE metrics.
MBE varies with different elevations but also with seasons, showing clear patterns, particularly for summer and winter (
Figure 7b,c). In summer, there is a decrease in positive MBE with an increase in elevation for all models, shown as a green line (
Figure 7c). Conversely, for winter, there is an increase in positive MBE with an increase in elevation for models, except CNN, where MBE decreases with an increase in elevation, shown as a blue line (
Figure 7c). In general, we observed that the ANN and RF models exhibit positive biases at higher elevations and negative biases at lower elevations. In addition, the seasonal pattern of MBE for elevation can be observed specifically for ANN and RF, whereas CNN shows a decrease in MBE for both summer and winter with an increase in elevation.
Figure 8c shows the variation of MAE with different elevations for each model for all seasons. MAE is also computed at each spatial location over a study region for the prediction time from 2010 to 2018. The ANN model shows an increase in MAE with increasing elevation across all seasons, with a slight decrease in MAE for elevation bins from 1700 m to 2300 m. However, the RF model shows different patterns, with relatively invariant MAE across elevations for almost all seasons except summer, where it decreases slightly with elevation.
Conversely, for the CNN model, MAE decreases during summer with an increase in elevation, whereas in autumn, MAE remains consistent for all elevation bins. However, during winter, we observe mixed responses, with higher MAE values for both lower and higher elevation bins and lower MAE values for medium elevation bins for winter and autumn.
4.4. Sensitivity to Elevation
When we used ERA5L-T2M as the only predictor, both the ANN and RF models resulted in a scatter density plot exhibiting clearly visible horizontal lines (
Figure 9). These patterns suggest that the relationship between predictor ERA5L-T2M and predictand is not well-captured, as also suggested by the lower correlation and R
2 values. Conversely, when EL is provided as additional input along with ERA5L-T2M to the ANN and RF models, there is a significant improvement in model performance. The scatter plot no longer shows visible horizontal lines, suggesting a more consistent relationship between predictors and predictand. This improvement is marked by a notable increase in correlation coefficient and R
2 values, as shown in
Figure 9.
When the CNN model is tested for only ERA5L-T2M as a predictor and with both ERA5L-T2M and EL as predictors, no noticeable difference in its performance is observed. Both cases resulted in identical correlation coefficients and R2 values. Furthermore, for CNN, data points on the scatter plot between observation and target lie more closely along the 1-to-1 line than for ANN and RF, indicating consistent model performance. The CNN showed a similar pattern of results across other seasons too, suggesting that the inclusion of elevation as an additional feature does not enhance model performance. This underscores the inherent capability of the CNN model in terms of extracting spatial information from the input datasets, making elevation an unnecessary additional input for the particular model even over complex terrain. However, for ANN and RF, the inclusion of elevation as an additional predictor plays a crucial role in enhancing the model’s performance in complex terrain.
4.5. Feature Importance of Models
Feature importance is a method used in ML to quantify the contribution of each input predictor or feature on model performance.
Figure 10 shows the feature importance for the ANN, RF, and CNN models. In the study of downscaling spatial temperature, the identification of influential predictors is important for enhancing model accuracy and interpretability.
During winter, for the ANN and RF models, a dominant feature is ERA5L-T2M, followed by EL, suggesting an emphasis on broader atmospheric conditions, followed by elevation. However, for the CNN model, for all the seasons, the dominant feature is T2M, followed by D2M, except in the winter, where D2M is the primary predictor, followed by ERA5L-T2M, indicating a different importance in the winter season. In spring, for the ANN and RF models, T2M still appears as a dominant feature, with EL as a secondary dominant feature. However, the importance given to ERA5L-T2M has been reduced, whereas there is an increase in importance given to EL. This adjustment suggests that there is a change in the relationship between temperature and elevation as the season transitions. In summer, ANN and RF see a consensus, with EL emerging as a dominant feature, followed by ERA5L-T2M, for both ANN and RF. This pattern underlines the key role of altitude on summer temperature, as expected. This might improve the relationship between temperature and altitude, potentially leading to the enhanced or better performance of the models in summer. In autumn, for ANN, EL is the dominant feature, followed by T2M, D2M, and SC. For RF, ERA5L-T2M emerges as a dominant feature once again, suggesting broader atmospheric conditions influencing daily mean temperature in autumn.
5. Discussion
Results suggest the superior performance of the CNN model in downscaling gridded daily mean T2M in complex terrain across all seasons, even with the aid of fewer features. This fact may be due to its ability to capture spatial features in datasets compared to other models. Additionally, CNN is inherently designed to handle image-like data and capture spatial dependencies in it, a feature that may be key to making it ideal for handling gridded datasets and well-suited for tasks such as spatial downscaling. In this study, the CNN model applies convolution filters to predictors (e.g., elevation, temperature) to detect gradients and spatial patterns. CNN captures temperature variations associated with elevation changes or other features by detecting edges and gradients in input data. During training time, these filters learn to represent these spatial relationships, allowing the model to identify how temperature is affected due to factors such as elevation or other features. The hierarchical architecture of CNN helps it capture low-level and high-level spatial features. For example, the first convolutional layer helps in capturing low-level spatial features like small changes in temperature, whereas deeper convolutional layers extract more complex relationships, e.g., the combined effect of wind and elevation on temperature. Thus, CNN builds a hierarchical presentation of data by stacking multiple layers, which allows it to recognize both broader and local spatial patterns in the dataset. Pooling layers reduces the dimension of feature maps by retaining most significant features. The final fully connected dense layers combine the captured spatial features and produce temperature predictions for each grid. This architecture of CNN enables it to effectively extract and learn complex spatial patterns, giving more accurate predictions for complex terrains.
Considering a downscaling ratio of 9, our study showed a good performance for CNN followed by RF for spatial downscaling of daily mean T2M. The metrics for CNN and RF show R2 > 0.90, RMSE < 1.25 °C, and MAE < 0.97 °C in all seasons except winter, where R2 ranges from 0.79 to 0.87, RMSE ranges from 1.29 °C to 1.63 °C, and MAE ranges from 1 °C to 1.28 °C. Moreover, ANN also shows comparable performance but lags behind CNN and RF in each season.
ANN and RF are predominantly employed for point scale downscaling [
9,
14,
16]. Our study successfully tested these models for spatial downscaling of daily T2M. When compared to the Karaman and Akyurek [
36] study that focused on downscaling monthly ERA5-Land T2M using RF for point scale downscaling, our study shows significant advantages in terms of temporal resolution with the RF model, with MAE values ranging from 0.80 °C to 1.26 °C and RMSE values from 1.04 °C to 1.62 °C across different seasons. Also, our results for daily downscaling of T2M using RF are comparable, with MAE (1.22 °C) and RMSE (1.65 °C) achieved on the monthly scale by Karaman and Akyurek [
36]. Moreover, in our study, CNN attains the lowest MAE and RMSE among all models, with MAE from 0.78 °C to 1.00 °C and RMSE values ranging from 1.01 to 1.29, showing superior performance. Both studies show the robustness of ML models in downscaling T2M over complex topography. However, our focus on daily T2M provides finer temporal resolution with better performance for CNN.
Results indicate that ANN and RF also exhibit performances comparable to that of CNN, especially during spring and summer. Our study underlines and recommends the importance of considering elevation as an auxiliary predictor in enhancing the performance of ANN and RF in spatial downscaling of temperature in complex terrains. However, it also reveals no significant improvement for CNN, implying CNN’s ability to extract spatial features from input data without relying on elevation as an extra input feature. Notably, our study achieves a greater downscaling ratio (9) over complex terrains, advancing beyond the downscaling ratio in the earlier studies of 4 and 7, respectively [
3,
19].
We observed a pattern in seasonal variation in the models with better performance in summer and lower performance in winter, in line with earlier studies [
36,
59]. This strengthens the notion that seasonal variation significantly influences model performance. To analyze plausible reasons behind seasonal variation in the performance of the model, we conducted a seasonal comparison between input ERA5L-T2M and reference target T2M for errors. We calculated different metrics such as RMSE, MAE, and MBE between input ERA5L-T2M and target T2M data for different seasons (
Figure 11).
ERA5L-T2M exhibits the highest errors during winter, implying that ERA5L-T2M is less accurate in winter than in other seasons (
Figure 11). Vanella et al. [
60] observed similar results showing less accuracy of ERA5L-T2M data in winter and higher accuracy in autumn and summer when ERA5L-T2M is compared with ground-based observations over the regions of Lombardy, Apulia, Sicily, and Campania in Italy. This implies that large errors in input data during winter months may lead to poorer model performance. The errors reduced through spring and summer, reaching the lowest in autumn. However, it is also worth noting that, although the input seems most accurate in autumn, the best model performance was observed in summer and not in autumn. This suggests that errors in input data may be a contributing factor but not the sole reason for the lower performance of models in winter.
Some atmospheric processes are more frequent and exclusive to colder seasons, especially in winter, and are rarely present in other seasons. Examples include thermal inversions and katabatic winds, which occur at a very local scale. These types of phenomena are more common in winter because longer nights lead to a stronger radiation loss from the earth and, as a consequence, surface cooling. For instance, thermal inversions occur when the air near the surface is cooler than the air above [
61]. Similarly, katabatic winds typically flow during the night, when air layers on mountain slopes cool faster than the valley atmosphere and make colder air drain toward valley floors, thus affecting the surface temperature there. The occurrence of such phenomena at the local scale recalls the complex interaction between topography and atmospheric conditions. While these processes are crucial in understanding temperature patterns, capturing them is still a significant challenge for numerical weather prediction models. These limitations may contribute to discrepancies in the performance of models during winter or colder seasons.
The MBE and MAE show distinct patterns across the models for their performance, particularly across elevation bins. For ANN and RF models, MBE increases with elevation in winter and decreases in summer. In winter, models seem to struggle more with elevation-related variations when ERA5L-T2M is the dominant feature. Additionally, complexities of winter conditions, such as snow cover and temperature inversions, introduce additional interactions that these models may not be able to take into full consideration. Conversely, during summer, MBE decreases along with the elevation when the dominant feature is the EL. This shift may result in a better understanding of the influence of elevation on temperature. The CNN models show a consistent decrease in MBE for both winter and summer across all elevations. The robustness of the CNN model and its superior performance can be attributed to the ability of the convolution layers to capture spatial patterns more effectively than ANN and RF.
The models also show variations of MAE along the elevation. The ANN model consistently shows the trend of increasing MAE towards higher elevations across all seasons, similar to findings [
3] obtained for a study over the western United States on the downscaling temperature in complex terrains. It seems that ANN might struggle at higher elevations, with increasing complexity leading to higher MAE values irrespective of the seasons. The elevation distribution data have fewer data points at higher elevations (
Figure 2), which could also play an important role in model training and performance. ANN with less exposure to high-elevation data points may not adequately learn specific patterns and conditions in these regions, which might have contributed to increased errors in those elevation ranges. However, the RF model shows comparatively stable MAE across elevation, with a slight decrease in summer. This stability implies that RF may manage to reduce errors more effectively as a result of the ensemble learning approach. On the other hand, the CNN model structure and its ability to extract spatial features from the data might mitigate the negative impacts of this data imbalance, giving lower and stable MBE and MAE across elevations. The observed errors in the MBE and MAE suggest that it would be important to consider shifts in feature importance, model structure, seasonal complexity, and data distribution in assessing model performance.
We observed variations in the performance of the ANN, RF, and CNN models depending on the inclusion of elevation as a predictor, as shown in
Figure 9. The CNN model exhibited different behavior compared to the ANN and RF models, showing lower sensitivity to elevation as an explicit predictor. The possible reason behind these results can be attributed to differences in the architectures of CNN and other models. The CNN models are designed to handle images or grid data, making them effective at capturing spatial features and relationships between neighboring grid points. In the CNN model, convolutional layers apply filters to input data, which helps in detecting spatial features such as temperature gradients and features related to topography. Due to this, CNN can implicitly account for the effects of elevation through spatial data itself. For example, even when ERA5L-T2M was the only predictor, CNN can still understand the impact of elevation by learning from the spatial context, which includes temperature variations related to changes in elevations. The CNN model, indirectly through its convolution and pooling operations, has already learned elevation information. This ability of the CNN model to capture spatial dependencies explains why the explicit inclusion of elevation as an additional feature does not change its performance. On the other hand, the ANN and RF models do not have an inherent ability to capture spatial patterns and dependencies like CNN. For ANN and RF, the inclusion of elevation as an explicit predictor is crucial for improvement in their performance, especially in complex terrain. When ANN and RF provided ERA5L-T2M as the only predictor, excluding elevation, scatter density plots resulted in visible horizontal lines. The inclusion of elevation as an explicit predictor allows the ANN and RF models to capture the relationship between temperature and topography better, as shown by the disappearance of horizontal lines and improved R
2 and correlation values.
Our results from the detailed feature importance analysis highlight distinct patterns in feature importance for the proposed models, showing their strengths and limitations under different seasons (
Figure 10). The ANN model assigns significant importance to T2M and elevation, with moderate importance to all dynamic features across all seasons, except for static features (slope, aspect, and curvatures), which are consistently deemed less important. On the other hand, RF shows a very selective approach by selecting mostly ERA5L-T2M and EL as the primary predictors, with the others having less importance across all seasons. When we look at feature importance for CNN, it consistently has ERA5L-T2M as the dominant feature, followed by D2M across all seasons except winter, where D2M is the most dominant one. In addition to this, elevation has not been given importance at all, which suggests that the CNN model can extract spatial features on its own without a reliance on explicit input of elevation as additional features. This aspect of CNN underscores its characteristic strength in spatial feature extraction in the context of spatial temperature downscaling.
Based on our analysis, we found that the performance of models varied depending on the region and seasons, underscoring the importance of choosing the appropriate model for specific applications. The ANN model showed good performance for summer across the study region with moderate elevations. However, it struggled in regions with higher elevations, as shown by higher biases. Thus, the ANN model’s ability to capture nonlinear relationships between temperature and other predictors can be a good fit for regions with less complex terrains. The RF model showed better performance in handling both lower and higher elevations relatively well. The RF model exhibited consistent performance in summer, spring, and autumn, but its accuracy declined in winter, indicating its limited ability to capture extreme cold temperatures. On the other hand, the CNN model outperformed all others in the complex region across all seasons. Although its performance slightly declined in winter, with some further tuning, CNN could potentially also be used for predicting temperature for important practical applications such as frost forecasting.