**1. Introduction**

The fluxes of water (e.g., evapotranspiration—ET) and energy (e.g., of latent and sensible heat) at the surface of the Earth are critical to quantify for many applications in the fields of climatology, meteorology, hydrology and agronomy. Easy access to reliable estimations of ET is considered a key requirement within natural resource management, and if ET can be estimated accurately enough it holds a vast potential to assist in the current attempts of meeting the UN Sustainable Development Goals (SDG), e.g., SDG2—zero hunger, or SDG6—clean water and sanitation (https://sustainabledevelopment.un.org, last accessed 10 December 2018).

Water and energy fluxes show large spatio-temporal variability since they are highly dependent not only on the meteorological conditions, but also on different characteristics and properties of the land surface, such as soil moisture/water availability, land cover type and amount of vegetation biomass and its health. Remote sensing data can provide spatially-distributed information about relevant land surface states and properties used to model the relevant fluxes and hence this technology addresses a key limitation of conventional point scale observations when estimating fluxes at watershed and regional scales. In particular, thermal remote sensing has been widely used for assessing land surface turbulent fluxes [1]. While there are a variety of existing remote sensing ET methods and data options available [2,3], none is fully satisfying the user needs for reliable, operational and easy accessible estimates and tools able to derive ET at agricultural-parcel scale. The limitations have so far primarily been centred on the lack of suitable satellite-based input data sources.

With the recent launch of Sentinel-2 and Sentinel-3 satellites, the data foundation for producing operational ET maps has been set since as a constellation they contain most of the required spatial, temporal and spectral characteristics [4]. Sentinel-3 Sea and Land Surface Temperature Radiometer (SLSTR) instrument acquires daily thermal infrared (TIR) information of the surface at ca. 1 km scale [5]. However, the reliable estimation of ET in agricultural and heterogeneous landscapes requires that the model's spatial resolution matches the dominant landscape feature scale, usually tens or hundreds of meters. Sentinel 2, with a spatial resolution ranging from 10 to 60 m and 5 day revisit time with Sentinel 2A & B combined [6], can resolve part of these scaling issues, although it lacks a TIR instrument at high spatial resolution such as in the Landsat missions. Therefore sharpening [7–9] and/or disaggregation methods [10] are required to bridge the spatial gap between the currently available Sentinel constellation's thermal-infrared (referred to as "thermal" in the reminder of the paper) and optical-shortwave (referred to as "shortwave" in the reminder of the paper) observational capabilities in order to optimally exploit the synergies of both types of sensors for field-scale ET estimations. The aim of this study is to develop an optimal combination of thermal sharpening and ET modelling methods for the derivation field-scale ET with combined Sentinel-2 and Sentinel-3 observations.

Several data fusion methods have been proposed to merge low resolution thermal infrared imagery with high resolution shortwave imagery in order to obtain estimates of surface temperature (*Trad*) and/or ET at high spatial resolution. In this study we focus on different, but possibly complementary, approaches: empirical and semi-empirical methods that exploit relationships between shortwave bands and thermal or ET data (hereinafter called image sharpening methods); and physically-based ET downscaling methods (hereinafter called ET disaggregation).

Thermal image sharpening uses information from the thermal and shortwave images themselves to calibrate empirical or semi-empirical models. Those models relate coarse resolution *Trad* (or ET) with coarse resolution (or fine resolution aggregated to coarse resolution) shortwave bands, and then apply the calibrated model to the fine scale shortwave image, producing either a sharpened *Trad*, or directly an ET product.

One of the first attempts to sharpen *Trad* was TsHARP [11], who tested different regression models between *Trad* and NDVI. Since then, TsHARP has been utilised as reference method for developing and testing other sharpening methods [8,12,13]. The Data Mining Sharpening (DMS) approach [8] used local and global regression trees between reflective bands and *Trad* of homogeneous samples at coarse scale (based on coefficient of variation threshold). Residual analysis was performed to ensure energy conservation (based on emitted radiances) between original resolution and sharpened images. To avoid overfitting of regression trees such as in DMS the use of random forests was proposed instead [14]. Following with the machine learning algorithms, Yang et al. [15] used an Artificial Neural Network with Genetic Algorithm and Self-Organizing Feature Mapping trained with different land surface parameters for each land cover class (vegetation, bare soil, urban and water). A different approach used an unmixing method to derive brightness temperature and emissivity at fine scale [16]. The unmixed brightness temperature and emissivity were then the inputs to a generalized split-window algorithm to retrieve fine resolution *Trad*.

The use of a contextual algorithm can also be applied in sharpening, such as is the case of DISPATCH-LST (DISaggregation based on Physical And Theoretical scale CHange) by Merlin et al. [7] who used shortwave information on fractional vegetation cover and fractional photosynthetically active vegetation cover in contextual scatterplots of fractional green vegetation cover versus *Trad* and albedo versus *Trad* to define minimum and maximum soil and canopy endmember temperatures. Finally, two or more different methods can be used together and combined through weighted averaging, such as in Chen et al. [17], who combined TsHARP and a Thin Plate Spline interpolation by weighting their corresponding residuals. Besides of the fact that all methods described above can be used as well to sharpen ET, other studies have already suggested methods to directly downscale coarse scale ET using shortwave data [18–21]. In any case, shortwave images provide limited information related to some surface energy balance processes, such as turbulent transport, soil moisture, and meteorological forcing. Therefore ancillary variables could be included in *Trad* or ET sharpening such as land cover maps (to account for different aerodynamic roughness), local meteorology, or surface geometry [22].

A previous study [4] found that using a "disaggregation" approach [10,23] significantly enhanced the accuracy of turbulent fluxes derived with sharpened *Trad*. That approach ensures spatial consistency between fluxes derived at fine and coarse spatial scales by first estimating them at the coarse scale at which the thermal observations were acquired. In the following step, the low-resolution air temperature is varied to adjust the flux estimates for all high-resolution pixels falling within one low-resolution pixel. This is repeated until a consistency between the two scales is obtained. This approach assumes that since the coarse scale estimates are derived with *Trad* at original spatial resolution they are of higher accuracy. The disaggregation was shown to improve ET model skill when compared with outputs produced at either coarse or fine resolution alone [4,23,24].

The sharpened *Trad* can be used as input to land-surface energy flux models. The latent heat flux *λE* (or energy used for ET) can be estimated as the residual of the surface energy budget, using estimates of the net radiation (Rn), soil heat flux (G) and sensible heat flux (H). The thermal-based ET models were originally formulated for computing H, which is governed by the bulk resistance equation for heat transfer [25], and is driven by the gradient between an ensemble surface temperature, called the "aerodynamic surface temperature" (*T*0), and the surface layer air temperature. Besides of the estimation of that surface-to-air temperature gradient, the estimation of H requires the modelling of an aerodynamic resistance term, which can be viewed as a simplification of the complex turbulent transport of heat, momentum and water vapour, by using a similarity with Ohm's law for electric transport. These resistances therefore represent how efficiently a scalar (heat, momentum or water vapour) is transported from one point to another following a gradient (i.e., vertical differences of temperature and/or vapour pressure). Several formulations and/or parametrizations have been proposed to describe these turbulent transport processes but generally they include variables related to surface aerodynamic roughness, wind speed as well as wind attenuation through the canopy, and atmospheric stability [26].

The challenge in resistance energy balance models is that *T*<sup>0</sup> cannot be directly estimated by remote sensing [27,28]. Hence, remote sensing ET models differ from each other on how the existing difference between the radiometric temperature (*Trad*) observed by satellite sensors and *T*<sup>0</sup> is considered. Single-source or bulk transfer schemes for modelling H treat soil and canopy as a single flux source and often employ an additional resistance term (*RAH*, usually dependent on the Stanton number *kB*−<sup>1</sup> ) because heat transport is less efficient than momentum transport from land surface (see e.g., Garratt and Hicks [29] or Verhoef et al. [30]). Appropriately calibrated, one-source energy balance (OSEB) models have shown satisfactory estimates of surface energy fluxes in heterogeneous landscapes [31–34]. However, due to the difficulty in robustly and parsimoniously parametrizing *RAH* for OSEB schemes at different landscapes, climates, and observational configurations [35], the two-source energy balance (TSEB) modelling approach was developed [36]. TSEB models partition the surface energy fluxes and the radiometric temperature between nominal soil and canopy sources, and include a more physical representation of processes related to *Trad* and *T*<sup>0</sup> without requiring any additional input information beyond that needed by single-source models using more sophisticated *kB*−<sup>1</sup> parametrizing. However, because direct measurements of canopy (*TC*) and soil (*TS*) temperatures rarely are available, in most applications these component temperatures are derived from a measurement of the bulk surface radiometric temperature *Trad*. Partitioning of *Trad* between *T<sup>C</sup>* and *T<sup>S</sup>* requires some assumptions related to the evaporative efficiency of soil or canopy [36–38].

Finally, like all remote sensing retrievals, satellite radiometric temperature is prone to uncertainty due to sensor noise, surface emissivity and atmospheric effects. To overcome this issue in ET estimation, several methods have been proposed based on either contextual models [39–41], by constraining the ET range between hot (no ET) and cold (potential ET) pixels [31,32], or using time-differenced morning temperature rise [42,43]. Regarding the contextual methods, all of them require homogeneous forcing and coupling between land surface/atmosphere which is a disadvantage when applied at large scales. In addition, those models assume that the coldest pixel in the image means potential transpiration, and the hottest pixel means zero transpiration which is not always the case (e.g., in humid and sub-humid areas).

In this study we will evaluate three different ET models driven by Sentinel-2 and Sentinel-3 imagery: METRIC [32] is a one source energy balance model that is less sensitive to heat transfer coefficient parametrizing than other OSEB model such as SEBS [33]; TSEB-PT [36] as a widely used two source energy balance model; and ESVEP [44] as a hybrid contextual-two source energy balance model.

#### **2. Materials and Methods**

In this section we first describe the evaluated ET models (Section 2.1), before presenting the validation sites (Section 2.2) and the input data sources (Section 2.3).

#### *2.1. Description of ET Models*

The energy balance can be expressed as (1)

$$
\mathcal{R}\_{\mathbb{H}} \approx \mathbb{G} + H + \lambda E \tag{1}
$$

where net radiation *R<sup>n</sup>* is a key element in the energy budget of the land surface as it determines the available energy that the land utilises for water evapotranspiration (latent heat flux, *λE*) and for heating up the overlying air layer (sensible heat flux, *H*). Equation (1) assumes that other energy terms (heat advection, heat storage in the canopy layer, and energy for photosynthesis) are negligible. Since ET is the combined process of soil evaporation and canopy transpiration, *Rn* can be also be partitioned into soil (*Rn*,*S*) and canopy net radiation (*Rn*,*C*), with both sensible and latent heat flux also partitioned between soil (i.e., evaporation process) and canopy (transpiration).

Using remote sensing data to derive *R<sup>n</sup>* has proven to be a sound alternative to ground-based measurements of both shortwave and longwave net radiation. Different approaches have been proposed to estimate surface albedo, ranging from empirical relationships between ground measured albedo and the different reflective bands in satellite [45] to more physically based methods relying on modeling the surface anisotropic effects [46,47]. Indeed, one of the major challenges when estimating albedo with satellite remote sensing data is that such sensors typically measure the outgoing radiance at a given direction while the estimation of albedo needs to account for the outgoing radiance in all the directions of the hemisphere [48,49]. Methods based on the modelling of those bidirectional effects have proven to be effective to overcome this challenge. In this study, we use a method for retrieving soil and canopy shortwave net radiation, proposed by Kustas and Norman [50], based on the different spectral behaviour of soil and vegetation for the photosynthetically active radiation (PAR) and near infrared (NIR) regions of the spectrum. Such approach is based on the radiative transfer model (RTM) described in Campbell and Norman [51] to obtain estimates of soil and canopy albedo as well as canopy transmittance in the PAR and NIR. This approach requires as inputs Leaf Area Index and leaf inclination distribution [52], the different bihemispherical reflectances and transmittances of soil and a single leaf, and the proportion of diffuse irradiance. However, this approach assumes homogeneous canopies and it requires some corrections when dealing with clumped canopies [53], which were also used in this study.

On the other hand, longwave net radiation is primarily driven by the thermal radiation emitted by the surface, which depends on surface emissivity and skin temperature following the Stefan-Boltzman

law. Besides, Kirchoff's law can be applied to derive the atmospheric longwave radiation that is absorbed by the surface. When running OSEB models (i.e., METRIC), only those two components are taken into account. However, when using TSEB models (i.e., TSEB-PT and ESVEP), surface anisotropy can also be accounted for when estimating the net longwave radiation, considering that leaves and soil have different temperatures and hence emit different amounts of thermal radiation [50].

Soil heat flux *G* is usually assumed to be a ratio of the soil net radiation. Choudhury et al. [54], Bastiaanssen et al. [31] suggested that *G* is ca. 35% of net radiation of the soil around midday hours and this is the approach used in this study by TSEB models. Since net radiation of the soil cannot be computed when using OSEB models, a specific equation suggested by Bastiaanssen et al. [31] is used instead.

In all three evaluated models, *λE* is estimated as residual of Equation (1). The main difference between the models is in the way in which *H* is calculated, as briefly described in the sections below.

#### 2.1.1. Mapping Evapotranspiration at High Resolution with Internalized Calibration, METRIC

Sensible heat flux in METRIC [32] is derived in a contextual manner by finding hot and cold pixels (Equation (2)).

$$H = \rho \mathbb{C}\_p \frac{\delta T}{R\_{AH}} \tag{2a}$$

$$
\delta T = \mathfrak{c} + \mathfrak{m} \, T\_{rad} \tag{2b}
$$

where *δT* is the estimated gradient between aerodynamic and air temperature, estimated as a linear equation function of *Trad* with *c* and *m* parameters are linearly solved from expressing Equation (2b) from two cold and hot endpoints:

$$\mathfrak{m} = \frac{\delta T\_{\text{hot}} - \delta T\_{\text{cold}}}{T\_{\text{hot}} - T\_{\text{cold}}} \tag{3a}$$

$$
\mathcal{L} = \delta T\_{\text{hot}} - \mathfrak{m} \, T\_{\text{hot}} \tag{3b}
$$

METRIC scales *λE* between these two hot (*Thot*) and cold (*Tcold*) endmembers based on a linear relationship between actual ET and reference ET using the standardised ASCE Penman-Monteith equation for an ideal alfalfa field [55]. Therefore, METRIC, as opposed to SEBAL [31], does not assume zero sensible heat flux at the cold pixel, which can have a positive impact on model accuracy at well watered areas under large vapour pressure deficit conditions. According to Allen et al. [32], cold pixels yield a 5% larger ET than the reference ET (*λEcold* = 1.05*λEre f*), but earlier in the season and off-season, cold pixel ET is instead a function of fractional cover or NDVI: *λEcold*/*λEre f* = *f* (*NDV I*). On the other hand, METRIC overcomes the issue of estimating *kB*−<sup>1</sup> by computing *RAH* using the profile at two different heights above *z*0*H*. Finally the authors stated the need for either computing an "excess resistance" in aerodynamically rough and dry surfaces when using the *δT* calibration performed over agricultural areas, or calibrating different *δT* slopes at different land covers/environmental conditions [32].

For Equation (2) to hold true, *δT* and *H* require constant wind speed at the application domain, so the model uses wind speed at blending height to overcome this issue. It also requires constant irradiance and air temperature, i.e., *δT* changes are only either due to root-zone soil moisture or aerodynamic roughness. Furthermore, the model requires heterogeneity in hydrologic and vegetation conditions and therefore we applied METRIC over two different vegetation domains, short vegetation (crops, grass and shrubs) and tall vegetation (broadleaved and conifer forests as well as wooded savannas). Finally, METRIC is sensitive to the definition of hot and cold pixels. Several different methodologies to find those endmember values were proposed, which can be especially challenging in heterogeneous areas where pixels become mixed at coarse spatial resolution. In our case we adopted the Exhaustive Search Algorithm solution proposed by Bhattarai et al. [56].
