*Article* **Evaluation of Evaporation from Water Reservoirs in Local Conditions at Czech Republic**

**Eva Melišová 1,2,\* ,† , Adam Vizina 1,2,† , Martin Hanel 1,2,† , Petr Pavlík 1,2 and Petra Šuhájková 1,2**


**Abstract:** Evaporation is an important factor in the overall hydrological balance. It is usually derived as the difference between runoff, precipitation and the change in water storage in a catchment. The magnitude of actual evaporation is determined by the quantity of available water and heavily influenced by climatic and meteorological factors. Currently, there are statistical methods such as linear regression, random forest regression or machine learning methods to calculate evaporation. However, in order to derive these relationships, it is necessary to have observations of evaporation from evaporation stations. In the present study, the statistical methods of linear regression and random forest regression were used to calculate evaporation, with part of the models being designed manually and the other part using stepwise regression. Observed data from 24 evaporation stations and ERA5-Land climate reanalysis data were used to create the regression models. The proposed regression formulas were tested on 33 water reservoirs. The results show that manual regression is a more appropriate method for calculating evaporation than stepwise regression, with the caveat that it is more time consuming. The difference between linear and random forest regression is the variance of the data; random forest regression is better able to fit the observed data. On the other hand, the interpretation of the result for linear regression is simpler. The study introduced that the use of reanalyzed data, ERA5-Land products using the random forest regression method is suitable for the calculation of evaporation from water reservoirs in the conditions of the Czech Republic.

**Keywords:** evaporation; water reservoir; regression; observed data; ERA5-Land data; R language

## **1. Introduction**

Water management, changes in natural water regime and sustainable landscape became an important topic of social discussions and policies not only in the Czech Republic, but also around the world [1]. It is clear that global and local climatic conditions are changing and will have an impact on the water management sector and therefore they should be given the highest attention. The evaporation in the Czech Republic also changes [2].

However, not only the climatic conditions change, but also the technology and knowledge that can be used in water management and specifically in hydrology. With the rapid development of remote sensing tools through recent decades an onset of easy-to-use high quality products supplied both professionals and public in water resources.

In recent years, there has been a significant development in the supply of information from remote sensing of the Earth utilizable in water management, not only for the professional public [3–5]. Another option is, for example, the use of globally available climate reanalyses or other available data sources. Despite the development of data availability and modelling tools, a question arises: How significant is the impact of the ongoing cli-

**Citation:** Melišová, E.; Vizina, A.; Hanel, M.; Pavlík, P.; Šuhájková P. Evaluation of Evaporation from Water Reservoirs in Local Conditions Evaluation of Evaporation from Water Reservoirs in Local Conditions at Czech Republic. *Hydrology* **2021**, *8*, 153. https://doi.org/10.3390/ hydrology8040153

Academic Editors: Aristoteles Tegos and Nikolaos Malamos

Received: 31 August 2021 Accepted: 29 September 2021 Published: 12 Ocotober 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

mate change on hydrological balance components and the consequent impact on water management [6]?

The hydrological balance is tied to rainfall-runoff processes, which are driven by climatic, geographical and geomorphological factors. The climatic factors include meteorological factors affecting the evaporation and evapotranspiration from the catchment, such as: precipitation, humidity, soil moisture, evaporation, air temperature, wind speed and direction and atmospheric pressure [7].

Recently, a number of studies pointed out that evapotranspiration significantly affects the hydrological balance. The key role of evapotranspiration in hydrological balance was the subject of many recent studies, e.g., [8–11]. And it is nowadays widely recognized, that on the most of the Earth's surface evaporation plays crucial role in the hydrological cycle.

The study [12] illustrates the impacts of climate change on the water cycle, which may impact from total evaporation, precipitation, atmospheric humidity and horizontal moisture transport at the global scale.

There are many methods to calculate evaporation, which can be calculated from free water, from the soil surface or from vegetation over a period of time. The evaluation of evaporation can be done by direct methods namely measurement or by indirect methods: empirical methods, remote sensing of the Earth on regional or global scales [13,14], the use of models that are classified as fully physically-based combination models, semi-physically based models or black-box models [15].

The total evaporation can be divided into actual, potential or reference evapotranspiration. The potential evaporation can be determined by empirical relationships or by measurement, the empirical relationships may differ in the input data or in the time step [8,16]. The calculation of the reference evapotranspiration is defined according to the FAO methodology, with the reference area being devided in [17].

The studies [18,19] evaluated evapotranspiration calculated on the base of empirical equations, which were divided into categories: mass-transfer, radiation based method and temperature-based method. The best equations from each category were then selected and compared based on the FAO and Penman–Monteith equations [20].

The estimation of reference evapotranspiration was used in the study [21], where the Penman–Monteith temperature-based equation achieved the best rating for the evaluation of reference evapotranspiration because it preserves the physical philosophy of the Penman– Monteith equation method. The method was applied at a global scale using the Köppen climate classification system with respect to the world dataset under different climate conditions. Calculation of reference evapotranspiration based on indirect methods can provide acceptable results when direct measurements of are not available [15].

Since most of the empirical formulas are based on geographical location, it is straightforward that the empirical calculation of evapotranspiration is not the same for different regions, due to the different climatic conditions [17]. National standards, legislation and expertise also takes place resulting that different methods are preferred in different countries, e.g., Netherlands—Makkink's method [22], Slovakia—Budyko's method [23], Bulgaria—Delibaltov–Hristov–Tsonev method [24].

The Penman–Monteith method is considered the sole standard for calculating reference evapotranspiration. The inputs to the equation are climatic data, solar radiation, air temperature, humidity and wind speed. It allows the calculation of evapotranspiration at different times of the year and in different regions, yet a precise measurement at a given location can easily replace the simplified Penman–Monteith equation [17].

Other methods of calculating evapotranspiration include the use of empirical relationships, e.g., the relationship between observed evaporation from evaporation stations and meteorological quantities, these relationships can be calculated either linearly or nonlinearly [25,26] using machine learning algorithms [27,28] linear regression or random forest regression.

The assessment of long-term climate variables can be based on time series. The time series is a sequence of measurements recorded over time, that can be analysed using,

e.g., Least-Squares Spectral Analysis, Least-Squares Wavelet Analysis, Least-Squares Cross Wavelet Analysis [29].

Other methods for evaluation may include parametric and non-parametric trend tests, which are used in machine learning [30,31]. The parametric method (logistic regression, linear discriminant analysis and simple neural network) use a fixed number of parameters to build models, require fewer variables and the result may be affected by outliers. The nonparamtric method (the Mann–Kendall, Spearman's Rho and k-Nearest neighbors) use a flexible number of parameters, both variable and attribute can be used in the models, the result is not affected by outliers.

In this paper, we explore the relationships for the calculation of evaporation from water surface in the Czech Republic using reanalyzed climate data and the constructed linear models (LM) and random forest models (RFM) for the calculation of evaporation. Evaporation estimated from the derived models was compared with observed evaporation from evaporation stations. Finally, the derived relationships were applied to the selected water reservoirs.

Specifically, we aim to answer the following questions: Which statistical method for calculating evaporation achieves better linear regression or random forest regression? How many variables are important for determining the formula for calculating evaporation? How important is the geomorphological information (elevation and location) for calculating evaporation using linear and non-linear models? The main objective of the evaporation estimation from water surface was to derive a universal relationship for the whole territory of the Czech Republic.

This paper is structured as follows: Section 2 introduces the area of interest and input data. The statistical method for evaluation evaporation with respect to goodness-of-fit (GOF) is evaluated in the R environment [32] and described Section 3. The results and discussion are in Section 4 along with a detailed evaluation of the goodness-of-fit (GOF) regression for evaporation stations and subsequently for water reservoirs. The paper is concluded in Section 5.

#### **2. Study Area and Data**

The study area is defined by the state border of the Czech Republic. Within the region (51◦030 N to 48◦330 N latitude and 12◦050 E to E 18◦ 510 longitude) the long–term (1981–2010) mean annual precipitation totals at 709.5 mm, mean annual air temperature is 7.9 ◦C, mean runoff is 205.5 mm [33] and long-term runoff coefficient is thus 0.29 (29% of precipitation totals runs off).

Figure 1 describe long-term temperature, evaporation trend at evaporation station Hlasivo. The Hlasivo evaporation measuring station provides a consistent time series of 58 years, the evaporation values are measured by a 20 [m−<sup>2</sup> ] benchmark evaporator. Other observed variables are: air temperature at 2 m [°C], water surface temperature in the evaporimeter [°C], relative humidity [%], global solar radiation [W·m−<sup>2</sup> ] and wind speed at 2 m [m·s −1 ] [34].

Figure 2 shows the selected 24 evaporation stations and 33 water reservoirs. The evaporation stations were assigned to water reservoirs based on the Quitt classification and the elevation [35]. The elevation differences between the evaporation stations and water reservoirs do not exceed 100 m a.s.l. The Quitt classification divides the Czech Republic into three climatic regions (cold, moderately warm and warm regions), with an evaporation station in the same climatic region always assigned to a reservoir. The observed evaporation from the evaporation station was recorded between 1957 and 2019 (most evaporation station was recorded from 2005).

**Figure 1.** Mean monthly temperature (green lines), water temperature (blue line) and evaporation (red line) at Hlasivo evaporation station for period 1957–2019.

**Figure 2.** Study area: Czech Republic with climatic regions [35], blue color: water reservoirs with altitude of dam and red color: evaporation stations with altitude.

The data from the evaporimeter (EWM) were provided by the Czech Hydrometeorological Institute, Palivový kombinát Ústí, state-owned enterprise. The T. G. Masaryk Water

Research Institute, public research institution (TGM WRI, p.r.i.) provided data from the floating evaporator and data from the evaporation station Hlasivo.

Observed data from evaporation stations were aggregated into monthly step which were then used to evaluate evaporation from water reservoir surface, because the measured daily values are affected by random error [36]. The observed evaporation (may–october) is from 459 [mm·year−<sup>1</sup> ] (Pec pod Snˇežkou) to 760 [mm·year−<sup>1</sup> ] (Holešov), mean evaporation (from evaporation stations) 627 [mm·year−<sup>1</sup> ], minimum mean daily rate (1.38 [mm·year−<sup>1</sup> ]) is in October and maximum mean daily rate is in July (4.53 [mm·year−<sup>1</sup> ]), with maximum in June 2017 (5 stations exceeded 6.5 [mm·year−<sup>1</sup> ]).

The relationships for calculating evaporation from the water surface were developed using linear and nonlinear regression. Measured evaporation from evaporation stations serves as the dependent variable. ERA5-Land climate reanalysis data were used for the non-dependent variables from 1981 to 2019.

#### *Climate Reanalysis*

The purpose of the reanalysis is to provide an estimate of quantities describing atmospheric, climatic and hydropedological processes and behavior of oceans with global coverage and relatively high spatiotemporal resolution.

The reanalyses are outputs of various models, usually including a hydrological, atmospheric and ocean model and a model of the Earth's surface. The advantage is the provision of multidimensional spatially complete and coherent information about the global circulation and hydroclimatic quantities. Climate reanalyses are generated in a similar manner as in numerical weather forecasts, where the prediction models based on the development of the climate system from the initial state are used to predict the future state of the atmosphere. The initial state of the climate is a key input into the forecast determining the future development of the model simulation. Data assimilation is used to estimate the initial state that best matches the available data, while taking into account model errors. The climate reanalysis is performed as the only version of data assimilation that includes the use of the prediction model [37].

The reanalysis uses a combination of modeled data and observed data with emphasis on the laws of physics. The data are stored in the ECMWF archive and copied to the COPERNICUS Climate Data Store archive, from where they are freely downloadable using the CDS catalog or the CDS API application in the GRIB or NetCDF format.

The data was downloaded in NetCDF, which is a common format in drought or flood forecasting [38]. The spatial resolution is 0.1◦ × 0.1◦ , which represents approximately a grid of 9 km × 9 km.

The data set consisting of 2 m temperature [K], skin temperature [K], 2 m dew-point temperature [K], 10 m v-component of wind [m·s −1 ], surface pressure [Pa], surface net solar radiation [J·m−<sup>2</sup> ] was selected to calculate evaporation from water reservoir. Temperature units [K] were converted to [°C] and energy units from [J·m−<sup>2</sup> ] were converted to [W·m−<sup>2</sup> ], values divided by the accumulation time expressed in seconds. Relative humidity [%] was calculated using the August–Roche–Magnus approximation [39], where the input data were dew point and temperature.

In the final dataset preparation, evaporation data from evaporation stations and geomorphological variables (elevation, latitude and longitude) were added to the reanalyzed data.

#### **3. Methods**

Statistical methods of linear and non-linear regression (random forest regression) were used to evaluate evaporation from the water reservoir. In this case, the main objective of the regression is to determine the best fit between the observed values from the evaporation stations and the variables from the ERA5-Land project. The resulting linear and nonlinear models were evaluated based on cross validation and goodness-of-fit (GOF): mean absolute error (MAE), root mean squared error (RMSE), coefficient of determination (R<sup>2</sup> ) and relative error (RERR). This section introduced building linear and non-linear models and their evaluating.

#### *3.1. Linear Regression*

Linear regression attempts to explain the values of a dependent variable through other quantities. In our case, an attempt was made to explain the dependent variable (evaporation value or evaporation rate from evaporimeter stations and evaporimeters EWM) using other variables (air temperature, surface temperature, wind, surface net solar radiation, dew point, pressure, latitude and altitude, evaporation type distribution) using 18 linear models created by sequential testing manually (8 models) and on the basis of stepwise regression (10 models).

The first set of models (built manually) was evaluated based on the Akaike Information Criterion (AIC) [40] value and the QQ plot was used for visual diagnostics [41]. The value of AIC is the sum of two terms, the first is proportional to the logarithm of the residual sum of squares, the second term is proportional to the complexity of the model (number of its members). When building the LM models, it can often happen that more independent variables reduce the sum of residues (improves the fit of the model with the observed data), however, this can result in an overfitted LM. The part of the AIC that penalizes the complexity of the model should prevent overfitting. When verifying the assumptions of the model (normality of residues), the QQ plot of residues can help. In the QQ plot of residues, two quantiles are plotted against each other—the theoretical quantile from distribution and the quantile with the actual residues of the model.

The second part of the linear models was developed using stepwise regression. Rpackages caret, leaps, MASS [42] were used for this regression. The R-package caret uses the principle of machine learning and the R-package leaps are used to calculate the stepwise regression. The R-package caret has a function *train*(), which allows the implementation of a sequential selection of predictors, where the linear regression selection is selected:


In this work, a method with backward selection was selected. The hyperparameter nvmax corresponds to the maximum number of predictors that are included in the model. In this work, 11 predictors were used. Furthermore, it is also possible to set the parameters of the validation method, in this work it was cross validation with 500 iterations.

#### *3.2. Random Forest Regression*

Random forest (RF) is a combined learning method for classification and regression that creates multiple decision trees during learning and then outputs the modus (most frequent value) of the classes returned by each tree to form a regression forest. The resulting regression function is defined as a weighted average of the regression functions of multiple trees. Regression forests belong to the so-called committee or ensemble methods, the main idea of which is to combine several separate models into a single ensemble. Thus, it uses the so-called collective decision [26,43]. A random forest consists of a set of trees *T*1,. . . ,*T<sup>N</sup>* whose classification or regression functions can be expressed as follows:

$$h(\mathbf{X}, \mathbf{O}\_1), \dots, h(\mathbf{X}, \mathbf{O}\_N), \tag{1}$$

where *h* is a function, **X** is a predictor and **O**1,. . . ,**O***<sup>N</sup>* are independent equally distributed random vectors. For the Random forests method, binary trees of type CART [44] are used. Similar to the creation of individual trees or other calibrations, a split into test and training sets is used. The R-package randomforest [27] was used in this work.

Random forest is an approach to build predictive models for both classification and regression tasks. It is a way to combine poorer performing baseline models to obtain better predictive models. Due to their simple nature, low assumptions and high performance, RF models have been widely used in machine learning. The term "forest" refers to a set of decision trees that are themselves "weak" classifiers. A regression forest does not have the same predictive power as a stand-alone regression tree. If a single tree splits into a single criterion, it is very sensitive to changes. RF models classify variables based on their importance to achieve the best RF model [45].

## *3.3. Evaluation of Regression*

Cross validation is used to improve the quality of regression models [46]. Depending on the method chosen, cross-validation is divided into k-fold cross validation, k-fold cross validation and leave-one-out. In our experiment, the method selected was leave-one-out validation. The dataset was split into training and test data, with one subset of data removed for the training data. The dataset consisted of the selected stations and in the training data the subset consisted of one sampled station, for a total of 24 stations, resulting in 24 iterations. Goodness-of-fit (GOF) criteria were used for further evaluation.

#### *3.4. Evaluation of Regression by Goodness-of-Fit (GOF)*

The linear regression and random forest regression set were evaluated based on their GOF (R 2 [47], RMSE [48], MAE [48] and RERR [49]). This means that we would like to identify the best model which is the most suitable for the calculation of evaporation in the Czech Republic.

(i) The R<sup>2</sup> is given by:

$$R^2 = 1 - \frac{RSS}{TSS} \,\text{\,\,\,}\tag{2}$$

where *RSS* is the residual sum of squares and *TSS* the total sum of squares from predicted evaporation values *Ep* and of tested data of cross validation *Et*.


$$RMSE = \sqrt{\left(\frac{1}{n}\right) \sum\_{i=1}^{n} (Ep\_i - Et\_i)^2} \,\tag{3}$$

where *Ep<sup>i</sup>* is predicted evaporation values *i*-th case, *Et<sup>i</sup>* tested data from cross validation and *N* is the total number of simulated values.


$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |Ep\_i - Et\_i| \tag{4}$$

The mean absolute error (MAE) is calculated as the average of the absolute differences between the predicted evaporation values *Ep<sup>i</sup>* and tested data from cross validation *Et<sup>i</sup>* .


$$
\delta = \frac{|Ep - Et|}{Et}.\tag{5}
$$

is the ratio of the absolute error between *Ep*-predicted evaporation values and *Et*tested data to the true of the value *Ep*-predicted evaporation values.

• It is a dimensionless quantity and can be given in percentages, it may attain both positive and negative values. Relative error can be used to compare quantities with different dimensions.

#### *3.5. Final Evaluation of Regression Models*

The last step of the evaluation was to create a scoring matrix and consecutively remove the models from the end (order of removal was from the worst models to the best). In order for the removal to occur, the individual models had to be ranked (from best to worst) or standardized using a GOF. Based on this procedure, the final evaluation was performed.

#### **4. Results and Discussion**

In this section, a detailed evaluation of linear and random forest regression with respect to GOF (R 2 , RMSE, MAE and RERR) is presented. After evaluating all GOFs, RMSE was selected. Then, the best evaporation formulas are selected from the group of linear models (LM) and random forest models (RFM). Selected models were used to calculate evaporation from the water reservoirs.

#### *4.1. Evaluation of Regression Models*

Regression models LM and RFM were evaluated by cross-validation. The crossvalidation procedure was as follows:


The models were evaluated and compared using GOF (see Figure 3). The results show that RF models can fit the data better than LM models. RF models are more consistent than LM models for all criterion functions. It can also be seen from the graph and results that for some stations the models do not achieve a good fit.

Outliers (the worst 10% GOF values) are present in all LM models, which also happens in RF models, but on a smaller scale. The outliers corresponded to 70% of the maximum value, thus setting the limit value for selected GOF. Table 1 shows evaporative stations that have exceeded the limit values for the selected GOF.


**Table 1.** Evaluation of evaporation stations based on GOF.

**Figure 3.** Model evaluation using GOF (R 2 , RMSE, MAE, RERR). The lines in the plot represent the LM and RFM models. Part (**a**) linear regression models (18 models) is divided into two parts: orange line: models created manually (8 models), grey part: stepwise regression was used (10 models). Part (**b**) random forest regression models (15 models).

By the method of sorting, using the R function rank(), 3 linear and 3 random forest models (RFM) were selected. The selected regression models with average values of GOF are presented in Table 2.


**Table 2.** Average values of goodness-of-fit of selected models.

The top 3 linear models according to all criterion functions are LM1, LM7 and LM8 and the top 3 RFM are RFM4, RFM5 and RFM15. The selected models are shown in Figure 4, green line represents linear models and blue line represents random forest models. The average value of RMSE for the selected linear models is 0.57, the minimum value is 0.22. The selected RFM had an average RMSE value of 0.51 and a minimum value of 0.18. The models that were designed based on stepwise regression achieved worse results than the models that were built manually based on data analysis. Models designed using manual regression achieved better results; however, some models designed using stepwise regression achieved good results in some cases, with less demanding inputs. The linear models were further supplemented with LM12, which also showed good results and the derived equation is more useful for practice due to its simplicity. All regression models are presented in the Sect. Appendix A in Table A1.

**Figure 4.** Evaluation of models using GOF, where the best models are selected. Part (**a**) linear regression models, green lines are LM1, LM7 and LM8. Part (**b**) random forest regression models, blue lines are RFM4, RFM5 and RFM15.

Selected regression formulas based on linear models:

$$LM1 = 45.84 + (0.173T \cdot (-0.004R))^{0.0008} - 0.183D - 0.0002P - 0.0002 \text{asl} - 0.475Y + 0.063X,\tag{6}$$

$$LM7 = 16.97 + 0.082N + \left(0.235T \cdot (-0.263D)\right)^{0.007} + 0.008R - 0.0003 \text{asl} - 0.368Y + 0.063X,\tag{7}$$

$$LM8 = 17.33 + 0.055X - 0.367Y - 0.0003 \text{asl} + (0.2134T \ast (-0.277D))^{0.009} + 0.008R,\tag{8}$$

$$LM12 = 19.82 + 0.302ST + 0.006R - 0.170D - 0.419Y. \tag{9}$$

Selected variables for best random forest regression models:

$$RFM4: \mathcal{W}, (T\*D), R, asl, \mathcal{Y}, X, \tag{10}$$

$$\text{RFM5}: X, \text{Y}, \text{asl}, (T\*D), \text{R},\tag{11}$$

$$RFM15: \mathcal{W}, T, ST, R, D, P, H, a \text{sl.}, Y, X.$$

where:


#### *4.2. Model Application to Water Reservoirs*

For testing, the best LM models (LM1, LM7, LM8 and LM12), RF models (RFM4, RFM5, RFM15) already described above were applied to selected reservoirs in the Czech Republic for the period May–October. The selection of the May–October period is because the evaporation from the observed data in the winter months is not measured due to

freezing. The calculated evaporation values for the water reservoirs are introduced in the Sect. Appendix A in Table A2.

The difference and seasonality in evaporation between the water reservoirs is described in Figure 5 where green lines represent linear models (LM), blue lines random forest models (RFM) and red lines introduce observed data. The average across all data is represented by the bold lines. The mean value from LM models and RFM models over the period (1981–2020) for reservoirs for May–October is 546.54 [mm·year−<sup>1</sup> ] and for RFM is 546.02 [mm·year−<sup>1</sup> ]. The mean value of the evaporation stations (2005–2019) is 497.26 [mm·year−<sup>1</sup> ]. The highest increase in evaporation is observed in the month of July, however, in the summer months (June–August) a significant increase in evaporation can be observed for all models.

**Figure 5.** Monthly evaporation for the water reservoirs throughout 1981–2019 (green lines represent linear regression, blue lines represent random forest regression and red lines observed data) and from evaporation stations throughout 2005–2019. Bold lines represent means for water reservoirs and evaporation's stations.

Top models LM1 and RFM12 are compared with elevation for the whole water reservoir. The following Figure 6 shows the relationship between elevation and evaporation, where the green line represents linear regression model and blue line represents random forest model. The elevation of water reservoirs is 170.54–781.91 m a.s.l. The evaporation decreases with the elevation above sea level. Both models are influenced by local conditions because both models have input geographic coordinates and elevation.

**Figure 6.** The lines (a) and (b) represent the relationship between yearly total evaporation and elevation based on the derived formulas. The points represent yearly evaporation and altitude for both models.

The results of the study will be implemented to the hydrological model Bilan [50,51] and for assessing climate change studies in the Czech Republic [52].

#### **5. Concluding Remarks**

The main objective of the estimation of evaporation from the water reservoirs was to derive a universal relationship for the whole territory of the Czech Republic.

The estimation of evaporation from water reservoirs is complicated because a large number of water reservoirs do not have observed evaporation data. In this work, Quitt's climate classification was used to assign a evaporimeter station that is not near a reservoir to a given reservoir based on climate region and elevation. Within the Czech Republic, the evaporation value from water reservoirs is determined on the basis of a handling order, which is established according to a Czech technical standard which is based on old climatic data and does not deal with climate change. For this reason, the determination of the evaporation from water reservoirs is based on estimation using statistical methods rather than exact measurement.

The ERA5-Land climate reanalysis data were used for derivation and were chosen for their comprehensiveness, availability, high spatial resolution, long time series and advantageous management. Relative humidity was included into the results based on the calculated August–Roche–Magnus approximation. The climate reanalysis data were exported for stations and water reservoirs.

The derivation of the relationship for evaporation was based on the multiple linear regression method, where the values of the dependent variable (evaporation) were sought, based on two or more variables (predictors: air temperature, surface temperature, wind speed, surface net solar radiation, dew point, surface pressure, dew point, altitude, latitude,

longitude and calculated humidity). The construction of the models was done (i) manually, where the evaluation was done using the AIC parameter and the quantile–quantile (plot-QQ) was used for visual diagnostics, this method was time consuming, (ii) using stepwise regression, where the predictors are entered sequentially and models from one to X-selected variables were generated, this method is not time consuming. Random forest regression was used to account for non-linear relationships. Linear and random forest regression models were cross-validated and evaluated using criterion functions (R 2 , RMSE, MAE and RERR). Finally, 3(+1) LM models and 3 RF models were selected. The models contained a large number of independent variables (6–7), possibly leading to model overfitting and therefore another model was selected which performed best for the RMSE criterion function and is based only on 4 independent variables and is therefore more user friendly.

It turned out that geomorphological information (elevation, location) appeared more in the manually derived models as opposed to models constructed using the stepwise regression method. When comparing linear models (LM) and random forest models (RFM), LM was found to have much more variability in the outcome compared to the RFM. The advantage of RFM is their adaptability, but the subsequent interpretation of the results can be a problem. This has been shown in the design of LM and RFM as well as when applying the proposed models to water reservoirs.

Evaporation values for the period 1981–2019 were calculated for the selected water reservoirs and selected formulas based on ERA5-Land climate reanalysis data.

For the evaluation of evaporation, models from LM and RFM models were used. Among the best models that were evaluated by linear regression, models LM1 from the manual linear regression group and LM12 from the stepwise regression group were used. Model LM1 was selected as the best model among the six predictors. The LM1 model can be replaced by an alternative model LM12 with which also performed satisfactorily with four predictors.

**Author Contributions:** Conceptualization, A.V., E.M. and M.H.; methodology, A.V., E.M. and M.H.; validation, A.V., E.M., M.H., P.P. and P.Š.; formal analysis, A.V., E.M., M.H. and P.P.; investigation, A.V.; resources, A.V.; data curation, A.V., E.M. and M.H.; writing—original draft preparation, A.V., P.P. and P.Š.; writing—review and editing, A.V., E.M., M.H. and P.P.; visualization, A.V., E.M., M.H. and P.P.; supervision, A.V. and M.H; project administration, P.Š.; funding acquisition, A.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by T. G. Masaryk Water Research Institute, public research institution (VÚV TGM, v.v.i.) grant number 3600.52.26.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data underlying the analyses in this paper are available from the lead authors E.M., A.V., M.H. and P.P.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:



#### **Appendix A**

**Table A1.** Review of linear models and random forest models, where evaporation E [mm·month−<sup>1</sup> ], temperature T (2 m) [°C], surface temperature ST [°C], wind speed W [m·s −1 ], surface net solar radiation R [W·m−<sup>2</sup> ], dew point D [°C], relative humidity H [%], surface pressure P [Pa], elevation above sea level asl [m], longitude X, latitude Y.



**Table A2.** Average year evaporation [mm·year−<sup>1</sup> ] values for selected reservoirs by model.

#### **References**

