*2.2. Validation Sites*

Year 2017 measurements from eleven eddy covariance (EC) sites were used in this study, covering a wide range of land cover types and climates. Sites are summarised in Table 1 and data used in validation included the four components of net radiation *Rn*(shortwave/longwave downwelling/upwelling), soil heat flux *G*, sensible heat flux *H*, and latent heat flux *λE*. In addition, friction velocity, Monin-Obukhov length, and wind direction data from the EC system was used to estimate the satellite pixel footprint contribution [61,62] to the turbulent fluxes at the satellite overpass. Validation sites comprise 5 agricultural sites, both irrigated and rainfed, including row crops (e.g., Sierra Loma vineyard) and an olive grove (Taous). In addition, two sites over grassland, one humid meadow (Grillenburg) and a semi-arid steppe (Kendall Grassland), one in conifer (Hyltemossa) and one in broadleaved forests (Harvard Forest) are also included in the validation list. Finally, complex heterogeneous landscapes are represented by two wooded savannas (Dahra and Majadas de Tiétar). From all these sites, 3 are in Mediterranean climate, and two more in semi-arid climates, whereas the rest of the sites are located in temperate climates.

**Table 1.** Description of eddy covariance sites used for validation. Sites are listed in alphabetic order. Z shows the EC measurement height in meters, while the contact person for the EC tower is credited in PI column.


q Error metrics included mean bias error (∑ (*Obs*. − *Pred*.)/*N*), root mean squared error (*RMSE* = ∑ (*Obs*. − *Pred*.) 2 /*N*), relative RMSE (*RMSE*/*Obs*), and Pearson correlation coefficient between observed and predicted. Due to the lack of energy closure in the eddy covariance data, unless otherwise stated all metrics were computed after adding the energy balance residual (residual = *Rn*,*EC* − *GEC* − *λEEC* − *HEC*) to the latent heat flux, taking the assumption that errors in measurements of *λE* are larger than errors in the measurements of *H* [68].

#### *2.3. Input Data Sources*

The input data required to run the evapotranspiration models came from three main and two ancillary sources. The main sources were: Sentinel-2 shortwave observations, Sentinel-3 thermal observations and European Center for Medium-range Weather Forecasts (ECMWF) ERA-5 meteorological reanalysis data. The ancillary sources were: a digital elevation model (DEM) from the Shuttle Radar Topography Mission (SRTM) satellite, and land cover map created as part of the ESA Climate Change Initiative (CCI).

#### 2.3.1. Satellite Data

The main satellite data inputs come from the Sentinel-2 (both A and B) and Sentinel-3 (A only) satellites. Sentinel-2 and Sentinel-3 were chosen as the primary satellite data sources for this study for a number of reasons. Firstly, as mentioned previously, when treated as a constellation those satellites are able to satisfy the need for temporally dense observations at high spatial resolution and with good radiometric accuracies. Secondly, they are part of the Copernicus programme, meaning that there is a guaranteed long-term continuity of data into the future. Thirdly, the data from those satellites is free and open and will remain so, again due to being part of the Copernicus programme.

High-resolution shortwave observations needed to characterise the state of vegetation in the evapotranspiration model as well as to sharpen TIR data were obtained by the MultiSpectral Instrument (MSI) on board of the Sentinel-2A & 2B satellites. MSI acquires reflectance information in 13 spectral bands (with central wavelength ranging from 444 nm to 2202 nm) with a spatial resolution of 10 m, 20 m, or 60 m (depending the spectral band) and global geometric revisit of at least 5 days when both satellites are considered [6]. The MSI sensor has 3 spectral bands in the leaf-pigment sensitive red-edge part of the electromagnetic spectrum and two bands in water-content sensitive shortwave-infrared part of the spectrum, in addition to the more traditional visible and near-infrared bands, which makes it well suited for vegetation mapping and monitoring [69]. For each of the validation sites, all Sentinel-2 images for year 2017 were visually scanned and the ones which were cloud, fog and shadow free in the closest vicinity of the flux towers locations were selected for processing.

L1C top of the atmosphere images were converted to bottom-of-atmosphere (BOA) reflectances (L2A) using the Sen2Cor atmospheric correction processor [70] v2.5.5. BOA reflectance values were then used as input to the Biophysical Processor [71] available in the SNAP software v6.0.1 (step.esa.int—last accessed 28 November 2018) in order to obtain effective values of green Leaf Area Index (LAI), Fractional Vegetation Cover (FVC), Fraction of Absorbed Photosynthetically Active Radiation (FAPAR), Canopy Chlorophyll Content (CCC) and Canopy Water Content (CWC). The outputs of the SNAP Biophysical Processor have been validated in a number of studies, with good performance in herbaceous crops [72,73] but overestimation of LAI in bare-soil cases and underestimation of LAI in forests [74]. Those inaccuracies could have an impact on the results of this study in semi-arid and forested areas.

The fraction of vegetation which is green and transpiring (*fg*) was estimated based on Fisher et al. [75] (Equation (7)):

$$f\_{\mathcal{J}} = \text{FAPAR} / \text{FIPAR} \tag{7}$$

where FIPAR is the fraction of photosynthetically active radiation intercepted by green and brown vegetation. FAPAR was obtained from the biophysical processor as described above, while FIPAR was derived iteratively from Equation (8) of Campbell and Norman [51]:

$$FIPAR = 1 - \exp\frac{-0.5PAI}{\cos\theta} \tag{8}$$

where *θ* is the solar zenith angle at the time of the S2 overpass, and PAI is the plant area index with initial PAI equal to LAI and in subsequent iterations

$$\text{PAI} = \text{LAI} / f\_{\mathcal{X}} \tag{9}$$

until *f<sup>g</sup>* converges. Two assumptions made in Equation (8) are that all intercepted PAR comes from the solar beams, and that both FAPAR and FIPAR are computed from a canopy with a spherical leaf inclination distribution. Indeed, from the the average leaf angle histogram, from which the training database was built in Weiss and Baret [71], most training cases in the Biophysical processor correspond to a spherical distribution (mode at 60◦ leaf angle). Equation (9) was subsequently used within the land surface models to convert LAI, which was assumed to represent green LAI [71], into PAI. Afterwards, PAI, leaf bi-hemispherical reflectance and transmittance, together with constant values for soil reflectance in the visible (*V IS* = [400–700] nm, *ρsoil*,*V IS* = 0.15) and near infrared (*N IR* = [700–2500] nm, *ρsoil*,*N IR* = 0.25) were used to quantify the shortwave net radiation of the soil and canopy. Leaf chlorophyll concentration (i.e., *Ca*+*<sup>b</sup>* = *CCC*/*LAI*) was used to derive the leaf bihemispherical reflectance and transmittance in the visible spectrum after a curve fitting of 45,000 ProspectD [76] simulations. Likewise, equivalent water thickness (i.e., *C<sup>w</sup>* = *CWC*/*LAI*) was used to retrieve leaf bihemispherical reflectance and transmittance in the NIR spectral region.

The thermal data needed to drive the evapotranspiration model was obtained from the Sea and Land Surface Temperature Radiometer (SLSTR) on board of the Sentinel-3A satellite [5]. SLSTR contains 3 thermal infrared (TIR) channels (with two dynamic range settings—for fire monitoring and for sea/land surface temperature monitoring) with 1 km spatial resolution and less than two days temporal resolution with one satellite (less than one day with both A and B satellites) at the equator. For each selected S2 scene, all the S3 scenes falling on the day of S2 overpass or within ten days before and after, were selected for processing. In the current study we used the L2A Land Surface Temperature (LST) product as downloaded from the Copernicus Open Access Hub (https://scihub.copernicus.eu/—last accessed 10 September 2019). The accuracy of this product is reported to be below 1 K when comparing against in situ radiometer measurements and independent operational reference products [77].

Finally, the parameters in the ET models that could not be directly retrieved from shortwave observations (e.g., vegetation height or leaf inclination angle) were set based on a land cover map and a look-up table (see Table 2). The CCI landcover map from 2015 [78], which was produced with a global coverage and 300 m spatial resolution, was used as the initial input layer before being reclassified into the smaller number of classes as shown in Table 2. Out of the parameters set according to the look-up table, the vegetation height (*hC*) has the largest influence on the modelled fluxes as it effects aerodynamic roughness [79,80]. Therefore in herbaceous classes where it can change throughout the growing season (grasslands and croplands) it was scaled with PAI using a power law, with maximum value *hC*,*max* indicated in Table 2 reached at a prescribed maximum PAI *PAImax* (5 in croplands and 4 in grasslands) and a minimum value set to 10% of the maximum value.

**Table 2.** Land cover based Look-Up-Table for ancillary parameters used in ET models. CCI-LC is the land cover code for the ESA's CCI land cover legend (http://maps.elie.ucl.ac.be/CCI/viewer/ download/CCI-LC\_Maps\_Legend.pdf—last accessed 13 April 2020); *hC*,*max* is the maximum canopy height occurring when PAI reaches *PAImax*; *f<sup>C</sup>* is fraction of the ground occupied by a clumped canopy (*f<sup>C</sup>* = 1 for a homogeneous canopy); *w<sup>C</sup>* is canopy shape parameter, representing the canopy width to canopy height ratio; *lw* is the average leaf size; *χ* Campbell [52] leaf angle distribution parameter.


### 2.3.2. Meteorological Data Source

The meteorological data used in this study consists of air temperature at 2 m, dew point temperature at 2 m, wind speed at 100 m, surface pressure, total column water vapour (TCWV), aerosol optical thickness (AOT) at 550 nm and surface geopotential. Those inputs are obtained from the ECMWF ERA5 reanalysis ensemble means dataset [81]. The only exception was AOT which come from the Copernicus Atmosphere Monitoring Service (CAMS) forecast dataset [82], since it is not included in ERA5. Inputs at the time of the satellite overpass are computed by linear interpolation between the previous and posterior reanalysis timestep. Due to the low spatial resolution of the air temperature and wind speed fields (tens of kilometers) they are assumed to represent the surface conditions derived from conditions above the blending height (100 m above the surface) rather then

the actual surface conditions. Therefore, air temperature at 100 m is calculated using the 2 m estimate, ECMWF surface geopotential, SRTM DEM and lapse rate for moist air. Those 100 m estimates are then used as inputs into the land surface flux models. AOT together with TCWV, surface pressure, SRTM DEM elevation and solar zenith angle at the time of Sentinel-3 satellite overpass were used to estimate the instantaneous shortwave irradiance on a horizontal surface at the satellite overpass [83,84].

#### *2.4. Thermal Data Sharpening Approach*

The thermal data sharpening approach used in this study is based on ensemble of modified decision trees. The basic scheme of the method (Figure 1) is based on Gao et al. [8] and has been previously applied by Guzinski and Nieto [4] to sharpen thermal data to be used as input to evapotranspiration models. Each S3 scene is matched with an S2 scene acquired at most ten days before or after the S3 acquisition and the regression model used for sharpening is derived specifically for each scene pair.

**Figure 1.** General thermal sharpening workflow. Explanatory variables include both shortwave bands as well any other ancillary explanatory variable, such as elevation, land cover type or exposure. Model could be any regression model, such as multivariate linear regression or machine learning techniques.

Briefly, the atmospherically corrected Sentinel-2 shortwave data (all the 10 m and 20 m spectral bands) with a spatial resolution of 20 m is resampled to match the pixel sampling of the SLSTR sensor (around 1 km spatial resolution). Concurrently, the SRTM DEM is used to derive slope and aspect maps which, together with S3 overpass time, are used to estimate the solar irradiance incident angle of a flat tilted surface. The DEM and the solar angle maps are also resampled to the SLSTR resolution. A multivariate regression model is then trained with the three resampled datasets used as predictors and the *Trad* used as the dependent variable. The selection of training samples is performed automatically by estimating the coefficient of variation (CV) of all the high-resolution pixels falling within one low-resolution pixel and selecting 80% of pixels with lowest CV. The regression model is based on bagging ensemble [85] of decision trees. The decision trees are additionally modified such that all samples within a regression tree leaf node are fitted with a multivariate linear model, as proposed by [8].

The regression models are trained on the whole S2 tile (100 km by 100 km) as well as on subsets of 30 by 30 S3 pixels in a moving window fashion. Once they are trained they are also applied on the whole scene and on each window. The bias between the predicted high-resolution *Trad* pixels aggregated to the low-resolution and the original low-resolution *Trad* is calculated and the outputs of the whole-scene and moving-window regressions are combined based on a weight inversely proportional to the bias [8]. Finally, the *Trad* predicted by the regression model is corrected by comparing the emitted longwave radiance of the sharpened fine *Trad* versus the original coarse *Trad*. A bias-corrected *Trad* is therefore re-calculated by adding an offset all fine scale pixels falling within coarse scale pixel in order to remove any residual bias. This is done to ensure the conservation of energy between the two thermal images with different spatial resolutions [8]. The output of the sharpening is a 20 m representation of the LST.

This image sharpening approach relies on the direct or indirect relationship that different regions of the shortwave spectrum have with the radiometric temperature and/or the ET process. For instance the temperature of denser canopies, with higher contrast between visible and near-infrared bands, is lower than the temperature of bare soils [86,87]. In addition, surfaces with higher water content (i.e., larger absorption in the short-wave infrared) have a larger evaporative capability and hence lower temperature [88]. Likewise, higher chlorophyll concentrations (i.e., larger absorption in the red and red-edge regions) might lead to higher light and water use efficiency and hence lower temperatures. The information contained in the DEM (i.e., the altitude and solar illumination conditions) also has a direct relationship with radiometric temperature, with sunlit areas having higher temperatures than shaded ones and lower altitude sufraces having higher temperatures than higher altitude surfaces.

#### **3. Results**

The overall performance of the tested models using sharpened temperatures from Decision Trees regressor (hereinafter *Trad*,*DT*) is shown in Table 3. Scatter plots of modelled versus measured fluxes for all the validation sites are in the Supplement. We removed all the cases in which the S3 image was contaminated by clouds in the vicinity of the flux towers or in which the SLSTR view zenith angle was larger than 45 degrees. In addition, we filtered all cases where estimated *R<sup>n</sup>* ≤ 50 W m−<sup>2</sup> , assuming that noisy outputs will be produced under low available energy, as well as those yielding unrealistic fluxes during daytime (≤−500 W m−<sup>2</sup> and ≥1000 W m−<sup>2</sup> ). After filtering the data, more than 400 cases were available overall for the following analyses. However, it is worth noting that ESVEP yielded significantly fewer valid retrievals. This issue might be due to the fact that ESVEP's end-member estimation equations were designed and parametrised for herbaceous crops [44] while in this study they were applied to varied land-covers. All models returned a similar performance regarding the estimation of *Rn*, with mean bias between −10 and −24 W m−<sup>2</sup> , RMSE ranging between 49 and 59 W m−<sup>2</sup> and *r* above 0.91. This similar behaviour is explained by the fact that all models share the same approach and same inputs in modelling net shortwave radiation, which is the component with larger magnitude of *Rn*. Likewise, G showed similar behaviour as well, but in this case *GMETRIC* is computed differently as it is a function of surface *Rn* [31,32] as opposed to TSEB and ESVEP where, as two-source models, G is computed from *Rn*,*<sup>S</sup>* [36,44].

The main differences in model performance are therefore in the estimation of turbulent fluxes (i.e., sensible and latent heat fluxes), and TSEB (TSEB-PT and disTSEB) usually produced most accurate estimates in terms of RMSE (≈80 W m−<sup>2</sup> , 45% relative error, in H; and ≈90 W m−<sup>2</sup> , 45% relative error, in *λE*) and higher correlation between observed and predicted values (≈0.67 for H and ≈0.76 for *λE*). disTSEB performs slightly better than TSEB-PT but the difference is not significant. For METRIC and ESVEP, the RMSE values are in all cases higher than 120 W m−<sup>2</sup> (going as high as 220 W m−<sup>2</sup> in case of H modelled with ESVEP) and with lower correlation (≤0.47).

The choice of closing the energy balance gap in field measurements by assigning it to *λE* has influence on the above results. Therefore, in Table 4 we also present the accuracy statistics of the turbulent fluxes when Bowen ratio is preserved during the energy gap closure procedure. The overall ranking of the models is preserved with the TSEB models still obtaining the lowest RMSE and highest correlation coefficients. However, the differences between the models (particularly in case of RMSE) are not as large as in Table 3. In particular the RMSE of the TSEB models increases significantly while there is a decrease in *r*, while the influence of closure method on the other two models is much weaker with the RMSE of ESVEP even decreasing slightly. In subsequent analysis we always assign the residual energy to *λE*.

**Table 3.** Error metrics for METRIC, TSEB-PT, disTSEB (TSEB-PT with flux disaggregation) and ESVEP modelled fluxes using Decision Tree sharpened temperatures and closing the energy balance gap in field measurements by assigning residual energy to latent heat flux. N, number of valid cases; Obs.; mean of observed values (W m−<sup>2</sup> ); bias, mean difference between predicted and observed (W m−<sup>2</sup> ); MAE, Mean Absolute Error (W m−<sup>2</sup> ), RMSE, Root Mean Square Error (W m−<sup>2</sup> ); rRMSE, Relative RMSE (–); r, Pearson correlation coefficient (–).


**Table 4.** Error metrics for METRIC, TSEB-PT, disTSEB (TSEB-PT with flux disaggregation) and ESVEP modelled fluxes using Decision Tree sharpened temperatures and closing the energy balance gap in field measurements by preserving the Bowen ratio. N, number of valid cases; Obs.; mean of observed values (W m−<sup>2</sup> ); bias, mean difference between predicted and observed (W m−<sup>2</sup> ); MAE, Mean Absolute Error (W m−<sup>2</sup> ), RMSE, Root Mean Square Error (W m−<sup>2</sup> ); rRMSE, Relative RMSE (–); r, Pearson correlation coefficient (–).


In order to evaluate the model sensitivity and uncertainty to different vegetation types, we have split the results of Table 3 into four main vegetation types, depending on differences in aerodynamic roughness, horizontal homogeneity and/or seasonal dynamics/senescence (i.e., croplands, grasslands, savannas and forests, Table 5). Similar to the overall results, the TSEB models output most accurate turbulent fluxes across all four vegetation types. They obtain the best results for H in grassland (RMSE ≈ 70 W m−<sup>2</sup> , r ≈ 0.8) and for *λE* in cropland (RMSE ≈ 80 W m−<sup>2</sup> , r ≈ 0.75). In grassland and cropland TSEB-PT and disTSEB produce very similar fluxes while in savanna disTSEB improves the accuracy of modelled H and *λE* by up to 10 W m−<sup>2</sup> . METRIC has its best overall performance in

savanna (RMSE of 132 W m−<sup>2</sup> and r of 0.43 for H; RMSE of 99 W m−<sup>2</sup> and r of 0.61 for *λE*) followed by cropland while ESVEP produces inaccurate H in all vegetation types (rRMSE > 1) and its best overall *λE* in grassland. It should also be noted that RMSE of *Rn* is for all models double in savanna (≈65 W m−<sup>2</sup> ) than in the other land cover types. This is due to vegetation being most sparse at those sites meaning that uncertainties in estimation of albedo and emissivity of soil have the biggest influence on shortwave and longwave net radiation respectively. Finally, very few valid cases are available to evaluate the forest sites and hence the results are not very conclusive, with the TSEB models again outperforming the METRIC and ESVEP.

**Table 5.** Error dependence on land cover for METRIC, TSEB-PT, disTSEB (TSEB-PT with flux disaggregation) and ESVEP modelled fluxes using Decision Trees sharpened temperatures. N, number of valid cases; Obs.; mean of observed values (W m−<sup>2</sup> ); bias, mean difference between predicted and observed (W m−<sup>2</sup> ); MAE, Mean Absolute Error (W m−<sup>2</sup> ), RMSE, Root Mean Square Error (W m−<sup>2</sup> ); rRMSE, Relative RMSE (–); r, Pearson correlation coefficient (–).



**Table 5.** *Cont.*

The agriculture class was further split into herbaceous and woody types, with results shown in Table 6. The former sub-class represents crops such as corn, soybean or wheat while the latter represents olive groves and vineyards. TSEB models produce the most consistent results for both types of crops, although somewhat surprisingly the RMSE of *λE* in woody crops (76–79 W m−<sup>2</sup> ) is significantly lower than in herbaceous crops (91–93 W m−<sup>2</sup> ), while opposite is the case for RSME of H (69–71 W m−<sup>2</sup> in herbaceous crops and 91–94 W m−<sup>2</sup> ). rRMSE of *λE* in both agricultural sub-classes was 0.32 which is of the same magnitude as energy closure gap at the validation sites (e.g., the mean value at CH was 0.34 at the times at which fluxes were modelled). METRIC is very clearly performing better in woody crops, while ESVEP obtains better results for H in herbaceous crops and better results for *λE* in woody crops. It is also worth noting that *Rn* and G showed larger relative errors in woody crops than in herbaceous crops, since woody canopies are more complex and therefore more difficult to capture by the models and/or parametrizations used [89,90].


**Table 6.** Crop type dependent errors for METRIC, TSEB and ESVEP modelled fluxes using Decision Tree sharpened temperatures. N, number of valid cases; Obs.; mean of observed values (W m−<sup>2</sup> ); bias, mean difference between predicted and observed (W m−<sup>2</sup> ); MAE, Mean Absolute Error (W m−<sup>2</sup> ), RMSE, Root Mean Square Error (W m−<sup>2</sup> ); rRMSE, Relative RMSE (–); r, Pearson correlation coefficient (–).

Finally, Table 7 lists the model performance depending on whether sites are under Mediterranean and semi-arid climate (i.e., water limited sites), or sites under temperate climate (i.e., energy limited sites). First of all it is worth noting that due to cloud coverage conditions, more valid cases are obtained over semi-arid conditions than in temperate areas. TSEB models showed similar range of errors in both climatic conditions, with RMSE in *λE* at around 85 W and 99 W m−<sup>2</sup> for semi-arid and temperate conditions, and correspondingly around 80 and 70 W m−<sup>2</sup> for H. ESVEP and METRIC yielded more varying results between climates, with METRIC producing more accurate estimates of both H and *λE* in semi-arid conditions and ESVEP showing better performance for H in temperate climates and better performance for *λE* in semi-arid climates.


**Table 7.** Climate dependence of errors for METRIC, TSEB and ESVEP modelled fluxes using Decision Trees sharpened temperatures. N, number of valid cases; Obs.; mean of observed values (W m−<sup>2</sup> ); bias, mean difference between predicted and observed (W m−<sup>2</sup> ); MAE, Mean Absolute Error (W m−<sup>2</sup> ), RMSE, Root Mean Square Error (W m−<sup>2</sup> ); rRMSE, Relative RMSE (–); r, Pearson correlation coefficient (–).
