**3. Results**

A comparison between several FNN configurations is first shown in Figure 4. Each graph represents the overall accuracy of a certain FNN when iterating just one of its parameters, and while increasing the number of neurons in the hidden layers. From this iterative process, a common, optimal FNN configuration for both the TEMP and the UHII approach was established. The optimal structure was defined as a neural network with seven inputs, two hidden layers of 18 and 9 neurons respectively, and one output. In that sense, it was found that increasing from one to two hidden layers produced a significant improvement in the models' accuracy, while increasing the number of hidden layers further did not. Similarly, moving from 100 to 200 epochs during the training phase could reduce the error of the FNN, while the computational expense of using 500 epochs instead of 200 did not seem justified. This was particularly evident when having tens of neurons in the hidden layers.

**Figure 4.** Average Root Mean Squared Error for different FNN configurations using Table 2. Results

obtained with the models derived from the TEMP and the UHII approach for the site Embajadores. Three weeks were selected, each one representing a different atmospheric stability scenario. The timeframe used to train these models extends from July 2016 to September 2017.

In some cases, due to the performance differences between the TEMP and the UHII approach, a common ground had to be reached in terms of the optimal configuration. That was the case of the activation functions, where the Stochastic Gradient Descent (SGD) seemed to produce the best results for those FNNs modelling the UHI intensity, but it led to exploding gradient problems when modelling the temperature. Thus, the Adaptive Moment Estimation (Adam) optimiser was used instead, which performed optimally in both scenarios. For the activation functions, a combination of the Exponential Linear Unit (ELU) for the hidden layers and the linear function for the output layer was used.

Overall, UHII models presented fewer converging problems than TEMP models, which seemed to have some difficulties with some activation functions and optimizers. Furthermore, the UHII approach usually outperformed the TEMP approach. The former did not only produce models with relatively smaller errors than the latter but required fewer neurons per hidden layer to reach a similar accuracy. This behaviour might be indicative of a clearer and more direct relationship between inputs and output, which in the case of the UHII approach links parameters such as wind speed, precipitation, or solar radiation with the UHI formation.

Differences between both modelling approaches also arise when looking at the inputs' relevance. In that sense, the sensitivity analyses presented in Figures 5 and 6 seem to reveal significant variations among them. The temperature from the reference site shifts from being the most relevant parameter of the entire FNN (TEMP approach) to being one of the least important (UHII approach). This is especially visible at night, when inter- and intra-urban temperature differences are most pronounced. The other parameters, albeit with different magnitude, seem to condition the outcome of both models in a similar way. In that sense, wind speed and direction seem to be two highly influential parameters during the night, while solar radiation and relative humidity seem to be key during the day.

Although the UHII approach appears to yield more balanced models, this apparent advantage does not seem to have a significant impact on their outputs when trained with 12 months of data. In this scenario, reasonably good results, and with similar error patterns, are obtained for both approaches. As it can be noted in Figure 7, modelled temperatures fit satisfactorily with the measured temperatures at the urban site and under a wide variety of circumstances, including different UHI scenarios: a rainy and windy week with generalised low UHI intensities (<2 ◦C); a week with varying meteorological conditions, during which a sudden weather change from calm to rainy was observed leading to a rapid change in the UHI intensities; or a calm week with strong UHI intensities (>5 ◦C), probably reinforced by temperature inversions. The greatest errors seem to accumulate on those nights when unusual conditions occur, such as when very high UHI intensities, close to 10 ◦C, are registered; or when the UHI intensity drops and rises abruptly, perhaps coinciding with occasional and localised weather events, such as rainfalls. Overall, models produced relatively smooth time series, without spikes or large variations from one hour to the next one, despite not having a built-in temporal dependence between consecutive outputs. Using a moving average for the wind speed seems to have contributed to reducing the noise in the models' output.

**Figure 5.** Sensitivity analysis of the inputs of the models. The same FNN configuration was used targeting the urban temperature (top) and the UHI intensity (bottom). The hour was fixed to 12 a.m. (Nighttime).

**Figure 6.** Sensitivity analysis of the inputs of the models. The same FNN configuration was used targeting the urban temperature (top) and the UHI intensity (bottom). The hour was fixed at 12 p.m. (Daytime).

**Figure 7.** Results obtained with the models derived from the TEMP and the UHII approach for the site *Embajadores*. Three weeks were selected, each one representing a different atmospheric stability scenario. The timeframe used to train these models extends from July 2016 to September 2017.

Models targeting the UHI intensity go<sup>t</sup> a slightly better score in the error metrics, with a reduction of the error between 6.4 and 11.7% (see Table 3). RMSE was 1.09 ◦C and 1.02 ◦C for the TEMP and the UHII approach, respectively. These results are in line with previous studies, such as in Kim and Baik (RMSE = 1.18 ◦C, [70]) or Demirezen et al. (RMSE = 1.29 ◦C, [75]), both modelling outdoor air temperature. The only exception is the coefficient of determination, which is extraordinarily high when targeting the temperature (*R*<sup>2</sup> = 0.99). This is also in line with previous studies (e.g., [75,77]) and it is further addressed in the discussion section.

**Table 3.** Metrics of the selected models targeting both the air temperature and the UHI intensity. Both models were trained using 12 months of data (July 2016–September 2017). The two variables regressed are modelled and monitored air temperatures.


#### *Shortening the Training Dataset*

The results presented above correspond to FNN models trained with one year of hourly data. So far, the TEMP and the UHII approach have proved to yield similar results. When training models with less data, however, differences started to arise. Results show that using 9 months instead of 12 months of data slightly increased the RMSE, with 0.9% and 2.4% for the TEMP and the UHII approach, respectively. When using 6 months of data the accuracy loss increased more markedly, especially in the case of the TEMP models (11.7% vs. 6.2%). The error kept growing exponentially when using 3 months of data, although the tendency was more accentuated and led to significant differences between both approaches (63.1% vs. 40.7%). A similar trend was observed for the MAE and MAD metrics, which can be found in Table 4.

**Table 4.** Relative accuracy loss when reducing the size of the training dataset for both the TEMP and the UHII approach. The number of months in each column establish the baseline of accuracy. The accuracy was obtained using the evaluation dataset.


These results are the average error yielded by several models trained with shortened datasets and are relative to the accuracy of the models trained with 12 months of data. Figure 8 presents the models' accuracy absolute levels, including the accuracy of all models trained with each shortened dataset. As already noted, differences arise not only when reducing the training datasets, but also when changing from one approach to another. The large variability of error between the models trained with 3 months of data is noticeable, being more accentuated in the case of the TEMP approach. It seems that, depending on the data used during training, it is possible to obtain models with an acceptable overall accuracy (RMSE < 1.5 ◦C, in line with previously developed models) to others that it is not clear that they could be used to make a reasonable modelling (RMSE > 2 ◦C).

Yet, these results represent the average cumulative error over a year. A more detailed analysis of the accuracy of the models showed that their error is unevenly distributed over the months, losing accuracy outside the months for which they were trained. It was also observed that their results do not suffer excessively within the months for which they were trained, being comparable with models trained on more data. In that sense, Figure 9 shows the additional error yielded by models trained with only 3 months of data. For convenience purposes, these months were made coincident with the seasons of the year, and a model trained with all 12 months of data was used as a reference.

**Figure 8.** Comparison of the error obtained by several models, trained using different datasets of different length and differentiating between those targeting the air temperature and the UHI intensity. On the left is presented the RMSE. On the left is presented the RMSE. On the right is presented the MAD.

**Figure 9.** Additional error yielded by models trained with just 3 months of data, using both the TEMP and the UHII approach. These models are named according to the season they were trained on. The reference error baseline was established by the same ANN configuration trained with 12 months of data.

The results show that the models systematically tend to minimise their error within their season, with the RMSE gradually increasing as they move away from it. This is accentuated for models trained in winter and summer. The reason behind this could lie in the annual cyclical behaviour of temperatures: between solstices and equinoxes, temperatures remain at one extreme of the annual cycle, either at the high end of temperatures (summer) or at the low end (winter). Between the equinoxes and solstices (spring and autumn), though, the transition between the two extremes takes place. This could favour the training of the ANN, as it would extend pattern recognition to practically the entire annual temperature range, and where only the extremes would be at the expense of the neural network's ability to generalise and extrapolate its modelling capacity beyond what is known during its training.

This dynamic is noticeable in the case of the UHII approach as well, although it seems to be rather less pronounced. As it was pointed out in the introduction, Madrid's UHI does not seem to follow a seasonal pattern, which means it might reach its highest and lowest UHII intensities at any time during the year (see Figure A1). However, these UHI peaks depend on the meteorological conditions, thus the loss of accuracy registered by these UHII models seems to be likely related to the concentration of certain meteorological conditions during the training phase. In other words, these FNNs would have difficulties in refining the modelling if, within the three months of data used to train them, there is

not a sufficiently large record of the different meteorological conditions that favour the occurrence of UHI.

The performance differences between the TEMP and the UHII approach are now clearly noticeable when plotting the data. In this respect, Figure 10 shows how the results of a TEMP model trained from May to August would produce quite precise results for June of the next year, like the ones obtained by models trained with 12 months of data. However, when trying to obtain the temperature profile in February, that same model barely captures the global trend. In that scenario, the UHII model, trained with the same three months of data, was able to fit to observed values with higher accuracy. It accumulated the error at the same moment as the models trained with 12 months of data, in many cases amplifying it. Despite the unusual distribution of temperatures and UHI intensities for that week, the UHII model was able to capture most of it, which turned to be surprising due to the relatively low amount of data used for its training.

**Figure 10.** Comparison of TEMP and UHII models trained with 12 and 3 months of data for the site Embajadores. Results are presented for one week of June (top) and one week of February (bottom). The TEMP model trained on 3 months of data shows difficulties when modelling the temperatures further away from its training temporal range (i.e., February, bottom).

## **4. Discussion**

The results of this research point towards the potential reduction of the training datasets without having a significant loss of accuracy. This could facilitate the work of urban climate researchers, thus promoting the development of shorter and simpler

monitoring campaigns. This does not mean that it is preferable to use smaller amounts of data to train ANN models, but that their accuracy might not be compromised when they are trained in this manner. Although using large amounts of high-quality data is always desirable, in some cases it is not possible due to varying circumstances, such as budget constraints or human resources limitations. In this context, knowing where the accuracy limits of the models are when trained with fewer data might help researchers explore their experimental data or design new measurement campaigns in an efficient manner.

In this study we propose the use of empirical, FNN-based models to extend the temporal coverage of urban monitoring campaigns. These models, although limited for carrying out temporal predictions into the future, they can be used to adjust long-term records gathered outside the city to the urban context. This approach, the generation of long-term datasets by looking backwards, might be potentially useful in many disciplines, including the generation of site-specific weather files for building energy modelling [120–123], the downscaling of heat-related epidemiological studies to evaluate the effect of urban temperatures in health [4,124–127], or the identification and characterization of energy poor households in urban environments [128–132].

It is worth noting that the use of UHI intensity instead of outdoor temperature as the output of the FNN models yielded significantly better results mainly when reducing the size of the training dataset. The accuracy improvement was limited when using 9 or more months of data during the training phase. The benefits of targeting the UHI intensity with the FNN model are, therefore, linked to the potential of using smaller datasets to model outdoor urban temperatures. However, using the UHI intensity instead of the temperature as the output, sustained on the lower seasonality of the former, could be arguable. ANN are universal function approximators [133] and, for that reason, using one parameter or the other should not produce significant differences. Although this was mathematically demonstrated, Curry [134] showed that to model the seasonality of a time series with FNN would require a very large structure. This structure would grow exponentially when increasing the length of the dataset, since more turning points are likely to appear. In fact, Zhang and Qi [135] recommended not only to deseasonalize the time series, but also to remove its trend (if any). Nowadays, pre-processing the dataset to make it stationary before feeding the ANN is a very extended practice and has demonstrated to be very effective with RNN as well [88,136]. This approach might be helpful in the future for other studies such as Han et al. [86], where the UHI intensity could be used instead of the outdoor air temperature to remove much of the seasonality from their temperature forecasts. However, it is unclear whether they could be extended to FNNs that use a reference site for modelling outdoor urban temperatures without any time dependence. Other reasons, such as the range of temperatures or the concentration of meteorological stability of the training dataset, might explain the varying accuracy results between the TEMP and the UHII approach when training these types of models, especially when using just 3 months of training data.

In line with the latter, it seems that the selection of days with different meteorological conditions and at different times of the year might be more relevant for the modelling than the continuity of the monitoring campaign. Thus, it may be more appropriate that future studies work with shorter, discontinuous monitoring campaigns covering a wider range of meteorological situations rather than a single, continuous-over-time campaign that might concentrate in a specific time of the year. Results may also support the use of data from sources whose long-term continuity may be compromised (i.e., CWS). In these cases, it would be relevant not only to apply filtering techniques to reduce the risks of introducing outliers, but also to carry out frequency distribution analyses to ensure that all meteorological conditions are being included into the modelling.

Some attention should be drawn to the pertinence of using certain error metrics. Despite being very extended (e.g., [75–77]), the use of *R*<sup>2</sup> as a performance indicator could be misleading [137,138]. As it can be seen in Equation (1), *R*<sup>2</sup> relies both on the size of the

residuals (*SSres*, the actual deviation of the prediction from the observed values) and the total variance of the dependent variable (*SStot*):

$$R^2 = 1 - \frac{\sum (y\_i - \hat{y}\_i)^2}{\sum (y\_i - \overline{y}\_i)^2} = 1 - \frac{SS\_{res}}{SS\_{tot}} \tag{1}$$

Thus, obtaining a higher *R*<sup>2</sup> does not implicitly mean having less error (numerator), but might be the result of a higher variance of the output (denominator). This was observed in this study when comparing the two approaches. In the case of the TEMP approach, the variance of the output temperature (which ranges from −2 to 41 ◦C) is much higher that the variance of the UHI intensity (ranging from 0 to 8 ◦C). Furthermore, since the TEMP approach contains an input variable (airport temperature) that explains most of the variance of the output variable (urban temperature), the *R*<sup>2</sup> tends to be extremely high (*R*<sup>2</sup> > 0.99). This explains why significantly lower *R*<sup>2</sup> were obtained when using the UHII approach in spite of yielding better results with the rest of the performance indicators (RMSE, MAE, MAD).

Taking the above into consideration, it would be worthwhile to investigate whether the behaviour of the TEMP and the UHII models presented in this paper is only attributable to the case of Madrid or if, on the contrary, it might be common in other cities at different latitudes and climatic conditions. Some existing studies have identified strong seasonal differences in UHI intensities in other cities [139–142], while others have not found such differences [143–145]. In this respect, a strong annual seasonality of the UHI intensity would probably limit the capacity of the models to produce accurate results when trained with small datasets. On the contrary, in cities where air temperatures remain within a narrow range throughout the year, such as tropical regions, the TEMP approach might perform better.
