1. Introduction
The estimation of quantitative vegetation variables is fundamental to assess the dynamic response of vegetation to changing environmental conditions [
1]. Earth observation sensors in the optical domain enable the spatiotemporally-explicit retrieval of plant biophysical parameters [
2]. Since the advent of optical remote sensing science, a variety of retrieval methods for vegetation attribute extraction emerged. Essentially, quantification of surface biophysical variables from spectral data always relies on a model, enabling the interpretation of spectral observations and their translation into a surface biophysical variable. Methodologically, these retrieval models can be classified into the following four categories: (1) parametric regression, e.g., spectral indices combined with a fitting function, (2) non-parametric regression, e.g., machine learning regression algorithms (MLRAs), (3) physically-based, i.e., inverting radiative transfer models (RTMs), and (4) hybrid methods. See [
3,
4] for a comprehensive review of these methods and mapping applications. Each of these categories has their benefits and limitations, depending on the targeted application. Hybrid methods blend the generic properties of physically-based models combined with the flexibility and computational efficiency of MLRAs. Within such a scheme, lookup tables (LUT) are generated from RTM simulations. Then, the MLRA learns the (non-linear) relationship between the pairs of reflectance and vegetation trait of interest. Hybrid methods tend to be preferred when it comes to operational processing to be applied across the globe, given their general applicability, processing speed and competitive performances. One of the major advantages of these methods is that, once a MLRA is trained, it can process an image into a vegetation product quasi-instantaneously. Current hybrid schemes for the generation of land products typically rely on neural networks (NNs) trained using a very large amount of RTM-simulated data [
5]. Previous studies with NNs and LUT-based inversion even suggested LUT sizes from 8000–100,000 combinations of input variables [
6,
7].
In this regard, due to the fast progress in the development of machine learning techniques and their applications, for the last few years alternative MLRAs came forth as appealing alternatives over conventional NN models into hybrid retrieval strategies [
8,
9]. Especially, the MLRA families of decision trees and kernel-based methods proved to deliver outstanding mapping results [
3,
4]. These methods tend to be simpler to train, i.e., no need for such large LUTs, and for vegetation properties estimation can perform more robustly than NNs when a reduced number of training samples is available, while maintaining competitive accuracies [
8,
10]. From the kernel-based algorithms, noteworthy are kernel ridge regression [
11], because of its simplicity and therefore fast run-time, support vector regression [
12] and Gaussian process regression (GPR) [
13]. GPR is particularly attractive because of carrying out statistical learning developed in a Bayesian framework. Lately, GPR became one of the main interesting kernel-based ML methods for vegetation properties retrievals: GPRs excel other ML algorithms through delivering competitive prediction accuracy [
8,
10,
14] and insight in relevant bands [
15], and above all, providing a closed form expression for the uncertainty intervals of the estimates. This distinct feature provides a valuable source of additional information, e.g., to assess the robustness of the predictions at varying spatiotemporal scales [
16].
Essentially, the entire procedure of learning a GPR model only relies on an appropriate selection of the type of kernel and the hyperparameters involved in the estimation of input data covariance. Kernels contain assumptions about the function we wish to learn and define the closeness and similarity between data points. Once a kernel is selected, the unknown hyperparameters of the kernel need to be learned from the training data [
17]. Trained GPRs are usually highly flexible and accurate for prediction over new inputs closer to training data points, whereas their uncertainty increases when the new inputs are further away from the available training information. With these appealing properties, apart from retrieval applications, recent studies have demonstrated the effectiveness of GPR for time series gap-filling applications [
18,
19,
20].
Given the attractive properties of GPR and with ambition of progressing towards operational processing, the algorithm was recently explored as a prototype retrieval algorithm for automated processing of Sentinel-2 (S2) tiles into products of green leaf area index (LAI
) and brown LAI (LAI
) at a pixel resolution of 20 m [
21]. GPR models were developed, applied to S2 imagery, and LAI maps were validated across various sites in Europe. In addition, LAI time series were processed for the selected sites; the regular occurrence of cloud cover in optical data makes that gap-filling and smoothing techniques are mandatory for obtaining meaningful phenology profiles. Among multiple gap-filling fitting functions analyzed [
20,
22], GPR appeared to be one of the top performing algorithms in correctly reconstructing cloud-free LAI
products [
20], and at the same time providing associated uncertainties. Altogether, GPR proved to be appealing for spatiotemporal processing optical satellite data, i.e., not only for retrieving LAI from S2 data but also for gap-filling processing in order to obtain cloud-free LAI
estimates at regular intervals (e.g., each 5 days).
However, when eventually aiming for seamless S2 data processing applicable to any corner in the world, more computationally-efficient solutions have to be sought for. While GPR processes a single S2 tile reasonably fast (in the order of minutes), GPR processing becomes tedious when time series of S2 tiles must be processed, and this is especially true when covering larger regions. Moreover, preprocessing steps, such as selecting and preparing S2 tiles from the Copernicus data hub [
23], adds to additional runtime and eventually becomes a bottleneck when not fully automated. Altogether, in order to achieve dynamic processing of a vast amount of S2 data, it demands for: (1) migrating towards cloud-computing platforms, and (2) integrating the GPR algorithm into the cloud-computing platforms. Hence, this leads to new challenges to overcome, but also opportunities towards interactive on-the-fly processing of vegetation properties estimation in a cloud computing environment. Specifically, for the last few years the Google Earth Engine (GEE) platform emerged as an attractive high-performance computing platform to enable cloud-based processing of petabytes of S2 data [
24]. GEE provides powerful computational capability for planetary-scale data processing and allows creation and training for well-known machine learning algorithms [
25]. However, despite the growing capabilities in advanced machine learning tools in GEE, the possibility to train and apply GPRs, or even just run already trained GPRs, is still lacking. This means that solutions have to be developed to enable integrating GPR into the GEE environment. All in all, the ambition to plug in GPR into the GEE platform for the generation of vegetation products from S2 satellite data in a flawless cloud-based approach brings us to the following main objectives: (1) to adapt the GPR algorithm for multispectral data so it is scalable into GEE environment; (2) to optimize and import a GPR model for LAI
estimation in GEE; (3) to process S2 multispectral data into LAI
in GEE; (4) to tackle the gap-filling of discontinuous LAI
time series by extending the GPR modeling to the time domain; and (5) exemplify the processing power of the developed framework with a few comprehensive case studies.
The remainder of the paper is structured as follows. The GPR standard theory and a simple reformulation tailored to parallelize the prediction process is described in
Section 2.
Section 3 outlines the data and the followed methodology for training the GPR models.
Section 4 summarizes the strategy to plugin GPR into the GEE platform.
Section 5 provides a demonstration case of reconstructing cloud-free composite LAI maps from Sentinel-2 acquisitions over wide areas. A discussion on the workflow’s strengths and weaknesses and future research lines is presented in
Section 6, whereas conclusions are finally presented in
Section 7.
5. Cloud-Free Seamless Mapping of Wide Areas
The aim of this section was to demonstrate the great potential of the methodology here proposed for wide areas monitoring. It is often of key importance being able to provide continuous mapping of LAI
at a specific time all over the area of interest, but the presence of clouds usually hinders or makes unfeasible the comparison of zones evolution on the same dates. The area chosen as demonstration case is the whole Iberian peninsula, then including Portugal, Spain and the south area of France along Pyrenees. In terms of extension, an overall number of 127 tiles from 6 different orbits (relative number from east to west 8, 51, 37, 80, 94, 173) are required for the complete coverage. A detail of tile distribution is shown in the top-left image (a) in
Figure 8. To stress the different evolutions of LAI
according to geographical coordinates of zones, we selected 3 different dates: 2 February, 30 March and 30 June, 2019. For each date, we create a collection of S2 reflectance images within the time span given by the date ± 3 months. Note that this time span corresponds to approximately 6 times the lengthscale hyperparameter in
(
Section 3.2), which ensures all the samples that contribute meaningfully to the prediction in Equation (
12) are accounted for.
The result of the LAI
mapping at 20 m for the three dates is shown in the images (b–d) of
Figure 8, each one representative of a different season: Winter, Spring and Summer. The quality of the reconstructed LAI
maps is evident. No orbit artifact is detectable for the three dates, confirming the capability of the proposed methodology to guarantee a spatial continuity of the result even if it is applied in a pixel-wise fashion. Clear temporal patterns can be observed by comparing the three maps. For instance the north area of Spain and Portugal are characterized by very high values of LAI
, indicating the presence of dense forests, with respect to the central and south part of the peninsula, where seasonal vegetation growth is more common. In addition, the presence of large areas where not significant LAI
is detectable during the three seasons here monitored (and probably during autumn too) can be also observed. Note that the usage of an advanced machine learning model for LAI
such as GPR, which exploits the whole multispectral information of S2 data, allows enhancing the dynamic range of the model, overcoming saturation issues usually exhibited by models based on vegetation indexes.
Different climatic areas can also be deduced from this mapping, which resemble the distribution of the average total precipitation in the Iberian Peninsula (see Figure 69 in [
58]) and confirm the identification of areas characterized by desertification in the Köppen–Geiger Climatic Classification of Iberian Peninsula (see Figure 1 in [
58]). Finally, it is worth emphasizing that the spatial continuity is also maintained at the 20 m resolution. The visual inspection of the maps at the maximum detail did not reveal any radiometric discontinuity. The need to visualize a wide area oblige to compress the whole information in a small representation, so that details of the maximum resolution map cannot be appreciated. This is the purpose of image (e) in
Figure 8, which provide a zoom of the LAI
map of 30 June over the so-called Atlantic Mountain Range and the Pyrenees. This subset area represents a critical test site, being characterized by a severe cloud presence (probability between 60% and 70% between April and October [
59]). Again, no radiometric discontinuities or artifacts can be detected both in coastal and mountainous zones, confirming the high reliability of the kernel-based gap-filling regression method we proposed over multi-tiled areas of study.
One closing comment about computational time. The processing of each tile takes about 1 min, and an average of 26 min were necessary for tile downloading to Google Drive at 20 m; the overall time required for downloading the 127 tiles composing each map was in average 26 h. When setting the output resolution to higher values, the overall time drops significantly. For instance, less than 2 h were required to get a complete LAI map at 300 m spatial resolution.
6. Discussion
This work presented the integration of GPR into GEE for seamless vegetation properties mapping. The advantages of GEE as an image processing platform are unprecedented: GEE has opened a new big data paradigm for storage and analysis of open-access Earth observation (EO) data
over areas with a spatial extension and at a spatial resolution that was not feasible using any desktop processing machines [
24]. In the following, we will discuss its strengths and weaknesses.
In GEE, whole data collections of multiple EO missions from medium to coarse spatial resolution are available on line for free, and the user-friendly Java coding editor allows launching computational-demanding processes over multiple distributed platforms. In addition, efficient mosaicking tools make it possible to deal with raster and vector information at once for selecting specific areas to be studied, or extend the analysis to nationwide coverage [
60,
61]. GEE Python coding is also possible using specific wrapper of the
library, giving the opportunity to link GEE data catalogue to GIS environment such as QGIS [
62,
63], and use it as a starting point for more advanced client-side analysis. Finally, information down- and up-loading can be carried out very efficiently using solution as Google Data Cloud or Google Drive [
64]. All these aspects make GEE extremely appealing for the development of EO applications. Yet, also some inconveniences need to be addressed.
First of all, scarce supporting documentation is provided in the official webpage. Error description during debugging is not exhaustive [
57], and becoming aware of library changes due to continuous updates is not straightforward. To a certain point, the bug report official channel is helpful to find how-to-do examples, but these gaps are mitigated mostly by the ever-growing numbers of users sharing their experience in forums [
65] and unofficial users’ blogs [
66,
67]. Materials shared by the GEE community provide enlightening pieces of code that allowed us to look into
methods and clarify the input-output specific formats they require. The main GEE limitation we had to circumvent is that despite the rapid progress in cloud-based algorithm development, not all MLRAs have been developed into the GEE ecosystem. Particularly, the GPR algorithm appeared to be lacking, which eventually became the rationale of investing into its implementation and the here presented work.
Secondly, although GPR is a highly competitive regression algorithm and has appealing advantages over other MLRAs such as uncertainty estimates and band ranking properties [
3,
8,
14,
15,
16,
26], one reason why GPR did not find its way yet into GEE is likely due to the heavy usage of contiguous memory allocation required by the GEE
data type. Matrix algebra operations involved in the GPR regression implementation cannot be performed using directly the
data type in which all the GEE catalog imageries are made available. To apply these specific algebra operations, the conversion from
to
data type is compulsory. Unfortunately, due to the particular usage of computational memory
type performs, using
type generates “memory exceeded” error messages much more frequently than using
type, even if the number of acquisitions to be processed is low (<40). To avoid these errors, computational workarounds must be devised.
In order to bypass these memory-related limitations, we had to introduce multiple adaptations, which are summarized as follows. We (1) expanded the formulation of standard GPR, (2) aggregated all terms independent of pixel’s hyperspectral information that can be precalculated to avoid repeating cumbersome operations for each pixel, (3) performed data manipulation that can be carried out using
datatype format before moving to
data type, (4) implemented GPR into a matrix algebra formulation and (5) converted the results back to
format adding coordinates information, mandatory for mapping purposes. These main steps have been followed for the implementation of the LAI
model, but also for the gapfilling technique based on GPR, and constitute the result of pursuing a non-straightforward optimization of GEE coding. This is the main reason why we decided to add a few lines of pseudo-code (
Table 5 and
Table 6), where we used mathematical symbology of
Section 2 for variable names and amethyst colored for specific
library functions. The scarce documentation and the few examples available online on this subject made complex algorithm development challenging, and sharing the core part is going to be key for anyone interested in implementing its own GPR model.
On the bright side, the presented workflow provides a generic strategy for importing any SE kernel-based GPR model in GEE. To demonstrate the functioning of the workflow, here we focused on the implementation of the LAI
model for national mapping. This demonstration case was chosen for two reasons. Firstly, LAI
is a key vegetation variable for many applications dealing with permanent (e.g., forests) and seasonal (e.g., croplands) vegetation phenology evolution. Secondly, the imported LAI
model is trustworthy for wide area mapping as: (1) it was trained on agriculture areas, bare soils and dense forests, allowing to cover a dynamic range of this parameter from 0 up to 10 [m
m
] [
21], and (2) a thorough assessment on the hyperparameter optimization was conducted and reported in [
20,
22]. Although the LAI
product was shown to be consistent, in principle any kind of vegetation property can be mapped as long as the model is accurate and robust [
10]. In this respect and moving ahead towards mapping of other biophysical variables, once having an already trained GPR model at disposal, it is possible to integrate it into the developed workflow after substituting the new hyperparameter values and the corresponding training samples and normalization matrices. Although the latter step still requires manual implementation, it is foreseen that in the near future these steps will be further optimized and automated so that eventually GPR models can be smoothly imported into GEE. One last warning deals with the maximum number of samples the GPR model can contain. Keeping it below 150 prevents out-of-memory errors from occurring frequently. As demonstrated in
Section 3.1, active learning optimization constitutes an efficient tool to slim down heavy models but preserve the original training set diversity. Concerning the estimation of GPR uncertainty described in Equation (
2), which represents a property only GPRs offer with respect to any other ML technique, a deeper development is still needed. First tests allowed its estimation for LAI model only over small areas, but the higher amount of information data to be kept in memory makes this issue still pending for multi-tile mapping.
To this end, an alternative research line to be explored in the future deals with using Bayesian Neural Networks, which are currently the state-of-the-art for estimating NN predictive uncertainty [
68,
69]. Recent works have shown the tight relationship between NNs and GPs [
70] and how in particular scenarios NNs can outperform GPs [
71]. The limited number of samples in case only in-situ data are to be used might still represent an issue for proper training purposes, but hybrid solutions able to blend simulated information and real data might offer a feasible workaround.
Another point worth addressing is the resolved temporal aspect. The proposed gapfilling strategy yielded promising and consistent results. The examples of seamless mosaicking of cloud-free LAI
collections demonstrate the great potentials of this regression technique as a gap filler. Although this approach was so far only demonstrated over small areas within the same S2 tile [
20,
22], this work presents the first proof that it is robust to radiometric spatial discontinuities over very large areas even if applied pixel-wise, and hence suitable to blend information from different S2 orbits and perform high-quality large-scale mapping. Despite its promising perspectives, it must also be remarked that the gapfilling processing chain can still be improved. For instance, residual atmospheric errors affecting input reflectance may generate inconsistent intervals along the time series. The GPR model assumes all of them are valid samples, and performs a smoothing effect that may lead to underestimate local LAI
peaks. This can be observed by comparing original and gapfilled LAI
time series shown in
Figure 7. As an attempt to minimize the effect of these unlikely outliers, an additional cleaning step before the gapfilling procedure, or even an iterative or two-step gapfilling approach might be considered.
A final aspect that merits further consideration is the all-comprehensive generic model we trained over multiple vegetated land covers [
28] to ensure consistent any-image processing. As an alternative strategy, rather than relying on one generic model, the use multiple light models might also be considered. For instance, coarse scale classification map from MODIS or Corine can segment the area of interest to enable running multiple GPR models, each of them specialized for the LAI calculation over a specific vegetation class. This approach would not only help to achieve a more reliable information using lighter, dedicated models with beneficial computational time. Moving towards land cover specific processing also opens the possibility to develop further processing schemes depending on the land cover type. For instance for croplands one can think of the determination of key phenological descriptors such as start-of-season, end-of-season or area under the curve [
72,
73,
74,
75,
76]; and for evergreen forests the detection of disturbances (e.g., logging, fires) [
77,
78,
79,
80,
81] or sudden discontinuities [
82]. There is no doubt that it will eventually become possible to combine all these advanced processing schemes into the GEE framework. Moving further along this line, the usage of lighter GPR models can pave the way towards a multi-model framework to retrieve and combine multiple vegetation variables over the same area. This is of interest for a range of purposes. For instance, apart from LAI
, other vegetation variables such as chlorophyll, fractional vegetation cover or fAPAR are considered of key importance for monitoring applications, e.g., related to crop productivity and safeguarding food security, for the estimation of the gross primary production (GPP) at ecosystem level [
83], and ultimately for the estimation of carbon sequestration at global scale [
84,
85].