Comparison of Climate Reanalysis and Remote-Sensing Data for Predicting Olive Phenology through Machine-Learning Methods

Azpiroz, Izar; Oses, Noelia; Quartulli, Marco; Olaizola, Igor G.; Guidotti, Diego; Marchi, Susanna

doi:10.3390/rs13061224

Open AccessArticle

Comparison of Climate Reanalysis and Remote-Sensing Data for Predicting Olive Phenology through Machine-Learning Methods

by

Izar Azpiroz

^1,*

,

Noelia Oses

¹,

Marco Quartulli

¹

,

Igor G. Olaizola

¹

,

Diego Guidotti

² and

Susanna Marchi

²

¹

Vicomtech Foundation, Basque Research Technology Alliance (BRTA), 20009 Donostia, Spain

²

Agricolus s.r.l., 06129 Perugia, Italy

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(6), 1224; https://doi.org/10.3390/rs13061224

Submission received: 16 February 2021 / Revised: 18 March 2021 / Accepted: 18 March 2021 / Published: 23 March 2021

(This article belongs to the Special Issue Advances of Remote Sensing in Agriculture for Climate Change Adaptation)

Download

Browse Figures

Versions Notes

Abstract

:

Machine-learning algorithms used for modelling olive-tree phenology generally and largely rely on temperature data. In this study, we developed a prediction model on the basis of climate data and geophysical information. Remote measurements of weather conditions, terrain slope, and surface spectral reflectance were considered for this purpose. The accuracy of the temperature data worsened when replacing weather-station measurements with remote-sensing records, though the addition of more complete environmental data resulted in an efficient prediction model of olive-tree phenology. Filtering and embedded feature-selection techniques were employed to analyze the impact of variables on olive-tree phenology prediction, facilitating the inclusion of measurable information in decision support frameworks for the sustainable management of olive-tree systems.

Keywords:

machine learning; remote sensing; vegetation indices; ERA5; growing degree days

1. Introduction

Monitoring the timing of phenological events is key to understanding the effects of climate change on agroecosystems [1], and scheduling appropriate control strategies. Phenological information can be derived through the direct ground observation of crop development, though this is a time-consuming and labor-intensive process [2]. Proximal sensing techniques (e.g., phenocams) may allow for retrieving accurate phenological data on the canopy scale. Satellite time series can be conveniently used to determine vegetation indices and monitor phenological phases on the regional level [3]. Vegetation indices, such as the normalized difference vegetation index (NDVI) and the enhanced vegetation index (EVI), were used to infer crop growth, and changes in vegetation dynamics and land cover [4,5]. These indices provide useful radiometric and biophysical information for land-surface characterization using different spectral-saturation effects [6], allowing for the study of phenological variation over large spatial and temporal scales [7]. The use of multitemporal data has proven useful for monitoring and characterizing spatial and temporal patterns of crop cover changes, with improved accuracy compared to that of a single date [8]. Both NDVI and EVI are important indicators to estimate the start of the growing season, SOS [9,10,11], or other phenological metrics [10].

Forecasting phenological crop phases is a valuable tool for crop management and decision making, particularly for yield prediction and growth modeling [2,12]. Temperature is the most important variable for the explanation of phenological changes [13]. The inclusion of temperature in process-driven models allows for determining thresholds below which physiological processes stop [14]. Models based on the thermal-time concept are widely used to predict phenological events with respect to thermal accumulation above a base temperature, generally calculated in hourly or daily (i.e., growing degree days, GDD) time steps up to a fixed amount that is crop- and variety-dependent [12,14,15]. However, GDD-based models are not sufficient for predicting crop phenology, as they are intrinsically uncertain and speculative [16]. Additional meteorological, geographical, and vegetation parameters can better capture environment-induced variability. Accurate predictive phenological models depend, therefore, on data availability and quality, in addition to robust modeling methods.

Due to the different observation scale, satellite or proximal sensing approaches and ground-based surveys have hardly provided successful information on the seasonal course of phenological events. Indeed, phenotypic plasticity determines a wide range of phenological responses [17]. In order to merge ground-based phenological observations and satellite-derived vegetation indices, statistical models based on meteorological parameters provide a potential solution [18,19]. In particular, machine-learning techniques may improve the accuracy of phenological models [20]. Dense meteorological networks and crop simulation models can be combined to predict phenological crop events at high resolution and apply sustainable management practices. Machine-learning techniques were previously applied to crop-yield optimization and predictions of plant phenology [16]. Indeed, machine-learning techniques have proven useful in selecting appropriate environmental variables and identifying important features, thus improving the accuracy of models aimed at predicting ecological phenomena [21]. In particular, decision-tree models may allow for numerous variables to be managed, in comparison with parametric models, addressing issues related to multicollinearity [22].

Although climate change and crop production are strongly interconnected, the timing of phenological events and the prediction of their trajectories in a climate-change scenarios, as well as effects on crop production are still uncertain for olive trees. The prognosis of flowering [23,24] and yield [25], and monitoring the phenology of olive trees, as well as pests and diseases [26] are crucial tools to modeling the impact of climate change on this crop. In olive trees, a relation between phenological phases and climatic variables was found [27,28], which is also associated with topographic features [29,30]. The impact of both environmental framework and agronomic management on olive-tree phenology was analyzed in several contributions [6,31,32].

Marchi et al. [33] used comprehensive regional phenology and an agrometeorological database of the Tuscany region (Italy), implementing a simple phenological model for olive trees. Afterwards, Oses et al. [34] adjusted the random-forest method with an optimized temperature base, considering as input the phase-specific cumulative GDD and the day of year (DOY). However, olive-tree agroecosystems across the Mediterranean region represent a complex landscape mosaic, and models based only on temperature are likely to reveal widely ranging temperature sensitivities. Here, we propose a novel approach based on vegetation-index accumulation quantities as input features to predict olive-tree phenology, which further improves previous phenological models for olive trees [27,28]. The daily sampling period of such data plays an essential role for the creation of highly correlated features to phase phenology. Other novel features, such as the plant-phenology index described in [9] (and its respective accumulation), were tested without successful results. We first considered distinct datasets and derived features. The quality of the random forest proposed in [34] was contrasted with the temperature dataset provided by different agencies (weather stations vs. Copernicus data).

Then, filter and embedded feature-selection techniques [35] were employed to analyze the impact of different feature sets on phenology prediction. In addition, a reduction in input variables was pursued to simplify model complexity [36] and the final physical interpretation of the model [37]. Lastly, multiple machine-learning models were compared to select the most accurate forecasting model with the most compliant feature sets.

2. Materials and Methods

A flowchart of the model is shown in Figure 1.

2.1. Original Data

Olive-tree phenology observation data were obtained from the monitoring network established in Tuscany (Italy), consisting of 21, 19, and 18 olive-tree orchards in 2008, 2009, and 2010, respectively. Olive-tree orchards were selected for the proximity to one of the agrometeorological stations of the regional network and for being representative of the most suited areas for the production of high-quality olive oil in the region in both coastal and inland conditions. For each olive-tree orchard, 5 olive trees belonging to the Frantoio cultivar, in good health conditions and with homogeneous productivity, were selected in the core area of the orchard, excluding those at the edge. Phenological observations were carried out by trained field technicians that visited the orchards at different timespans throughout the year according to the main developmental stages of the olive tree. A simplified BBCH scale was used to assess the phenological stage for each observation (the abbreviation derives from the names of the originally participating collaborators: Biologische Bundesanstalt, Bundessortenamt, und CHemische Industrie).

Weather data, used for the phenological baseline model [33], were obtained from 19 different weather stations of the regional agrometeorological network of Tuscany. There are 822 phenological observations in the dataset that belong to 22 different olive-tree orchards, with an average of 34 phenological observations per location and an average of 274 phenological observations per year.

In addition to temperature, other climate variables were used, namely, precipitation, which may improve phenology prediction [38,39]; and vegetation indices, namely, NDVI and EVI. These indices can also be used to predict other phenological parameters [10]. We considered climate and MODIS-derived features to improve the prediction of phenological events based only on temperature, as in [34,40].

Three diverse datasets were utilized to calculate predictors for the target feature, the phenological phases of the olive tree. The first dataset was the weather registry provided by the open-access ERA5 dataset, the latest generation of ECMWF atmospheric reanalysis (Copernicus Climate Change Service, 2017). Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset. ERA5 provides hourly data, distributed in a fixed grid and with a native resolution of 9 km. Climate reanalysis data are available through the interface of Google Earth Engine (GEE) Data Catalog [41].

The second dataset, satellite and timeless dataset, contains the slope label provided by Conservation Science Partners. As a geophysical characteristic, together with the latitude and longitude of the olive-tree orchards, the slope (or type of landform) was considered as an input feature.

The last dataset corresponded to diverse collections of MODIS ( see the Data Source column of Table 1), the Moderate Resolution Imaging Spectroradiometer imaging sensor launched by NASA. The data products of MODIS are also available via the interface of GEE Data Catalog. In this case, the data provider was NASA LP DAAC 3.

All used parameters as features or for modeling purposes are reported in Table 1.

2.2. Data Preprocessing

The predicted variable is the phenological phase considering a simplified BBCH scale for the olive tree. The BBCH scale uses numerical codes to label the different phases; however, since the distance between numeric labels bears no relevance, it cannot be considered to be a numerical variable. A ranking of the BBCH scale for the olive tree was generated, and a new variable ranking the development label for each data point was created. The latter rank variable was used as the target variable in the prediction models. When regression models were applied, the output was discretized (with round function) to obtain an integer value. This phenological parameter is predicted for the day of year (DOY), representing the day of observation.

ERA5 provides aggregated values for each day of the used ERA5 climate reanalysis parameters described in Table 1. Following the Allen methodology [34,42], growing degree days (GDD) were calculated using the single sine method to approximate hourly temperatures given the daily minimal and maximal temperatures from ERA5, a base temperature, and no upper cutoff. The first day for starting heat accumulation was 1 January. The GDD values computed from ERA5 data are denoted by GEEcumt, with t being the base temperature. Lastly, we created the precipitation-accumulation feature, denoted by “cum precipitation”. All accumulation-derived features were measured from 1 January until the day of the field observation of the target variable.

Considering the daily measured enhanced vegetation index (EVI) and normalized difference vegetation index (NDVI) that were used for forest metrics [10], we derived the cumulative sum for each day. Due to the increasing nature of the BBCH output, the accumulative strategy was applied to different MODIS features, namely, the ones that could explain the phenological events. The accumulation of EVI or NDVI is denoted by “EVIcum” and “NDVIcum”, respectively. For the red and NIR surface reflectance, we computed the corresponding cumulative sum, represented by “REDcum” and “NIRcum”.

Splitting of Dataset and Evaluation Metrics

For each experiment, the dataset of phenological data was divided into training (

65 %

) and validation (

35 %

) disjoint sets, obtained with stratified sampling depending on the different years and locations. For each model, we created a grid of model parameters, with the parameters of each model being optimized by cross-validated grid search over these parameter grids. The employed corroboration strategy was stratified 5-fold cross-validation, completed with the training set. The different experiments were compared according to the root-mean-square error (RMSE) metric on the validation set. The model with the lowest RMSE was the most accurate. We used the RMSE metric of regression because the classes were strictly ordered, and RMSE correctly penalized higher deviances of the predicted values from the observed values more than lower deviances.

2.3. Baseline Model and Base-Temperature Optimization

The baseline model used as a benchmark for this study was random forest because, in recent studies [40], the random-forest model had better accuracy than that of the original GDD baseline technique detailed in [34,40]. Random-forest operates by constructing several decision trees during training, and outputting the most voted class as the prediction of all the trees. We used the scikit-learn package to build a pipeline with various machine-learning tools. More precisely, decision-tree-based methods such as random forest correspond to the module of ensemble methods, which combines predictions of several base estimators in order to improve generalizability and robustness over a single estimator. Following the available documentation in https://scikit-learn.org/stable/modules/ensemble.html#id6, accessed on 15 February 2020, the common and distinct characteristics of the random-forest classifier and regressor methods are explained. Afterwards, the random-forest regressor method was selected due to having more accurate results. In both cases, the random-forest method allows for decreasing variance in the forest estimator. Unfortunately, this reduction in variance can result in a slight increase in bias. In order to avoid overfitting, and ensure the stability and efficiency of the model, parameter estimation using grid search with cross-validation was performed over the following parameters and values: number of trees (number of estimators, 100, 200, 300), the minimal number of samples required to be at a leaf node (2, 3, 5, 7), maximal depth of the tree (7, 9, 11, 13, 15), and minimal cost-complexity pruning (0, 0.0001, 0.001, 0.01, 0.1). As demonstrated in [34,40], the selection of the base temperature for GDD calculation plays an important role in determining model accuracy. A new value was proposed for the base temperature in order to optimize olive phenology prediction below the original base temperature set at 10

^{\circ}

C. Indeed, at lower temperature values, the RMSE of the model improved. This was tested using temperature data from weather stations and from ERA5 dataset, that is, GDD and DOY as input features. To this end, the RMSE of the baseline model was computed for different base-temperature values (from 0 to 14

^{\circ}

C).

2.3.1. Feature Engineering

Exhaustive feature analysis was conducted to select the most meaningful feature combinations using the parameters reported in Table 1. The reduction in input variables contributed to simplifying the model complexity [36] and enhanced its physical interpretation [37]. We applied the following methods for feature selection.

Hierarchical clustering based on Spearman rank-order correlations is a useful tool to reduce the number of collinear features without decreasing model accuracy. This technique consisted of picking a tolerance threshold over the Spearman rank-order correlation and keeping a single feature from each cluster (see Figure 2). The lower the threshold is, the smaller the dimension of input parameters is; therefore, this technique is employed to reduce the quantity of input parameters if the efficiency of the models is persistent. Hierarchical clustering is shown in Figure 2 with the corresponding collinear feature relations.

Recursive feature elimination (RFE) removes less influential features from the original set. For this purpose, each feature is discarded, and the accuracy of the model is computed with the rest of the features. The supervised learning estimator is the random-forest regressor method. The feature corresponding to the worse RMSE mean is definitely discarded, and the procedure continues recursively eliminating less important features.

Recursive feature addition (RFA), contrary to RFE, recursively adds the most influential feature. The initial features set is composed by the DOY and GEEcum0 (GDD with a base temperature of 0

^{\circ}

C). The recursive procedure consists of adding only one feature to the initial set, and computes the related RMSE interval for the learning estimator random-forest regressor method. This computation is performed for each feature, and the feature corresponding to the lowest RMSE mean is added to the initial feature composition. The procedure continues recursively incorporating the most meaningful features.

Subgroup comparison and RMSE-based weight importance ranking: Subgroup comparison is a computationally expensive technique. All possible feature combinations are tested to determine the best combination. Completely distinct feature combinations obtain similar accuracy results for the random-forest model, as detailed in Appendix A, Table A1. Consequently, we used the following metric to reduce the number of experiments. As a result, feature priority ordering was introduced.

Description of metrics: Initially, an efficiency threshold was considered. The limit was set to be 0.7, the RMSE correspondent to the reference model presented in [34]. The technique consisted of computing the accuracy of the selected machine-learning model for all possible feature combinations, and the ones with lower RMSE than the efficiency threshold were selected. For each test, the weight of the features was the difference in RMSE with respect to 1 (the lower the better). We computed the total weight per feature as a sum in each experiment, as in Equation (1).

{rmse}_{e x}

denotes the RMSE corresponding to experiment or feature combination. In order to normalize these quantities, all weight sums were divided by the maximal weight value.

Weight per feature = \sum_{e x} (1 - {rmse}_{e x})

(1)

Model Comparison: Preliminary experiments were carried out for several regression and classification methods. We considered both classification and regression models. Among regression methods, we used K neighbors, random forest, Bayesian ridge, logistic, and SGDR and logistic. As classifier methods, we applied K neighbors, decision tree, random forest, Ada boost, gradient boosting, extra tree, Gaussian naive Bayes (NB), linear discriminant analysis, and quadratic discriminant analysis. Models with a score higher than RMSE = 2 were discarded. Models were selected on the basis of [34,40] and considering [43,44,45]. The novel prediction model is the extra-tree regressor, an efficient technique over random forest models, as demonstrated in [43]. This model considers a metaestimator, fitting several randomized decision trees (called extra trees) on several subsamples of the dataset. Then, averaging the results, it provides more efficient predictions while avoiding overfitting. To compare the models, we considered the most influential input feature sets, i.e., the first 14 and 16, which were ranked according to their relative performance (Table 2). Regardless of the number of considered features, RMSE values were similar. Consequently, reducing the feature sets did not affect the accuracy of model predictions, while saving time and computational resources.

3. Results

3.1. Temperature Accuracy

The RMSE of the baseline model was computed using temperature data from weather stations and from the ERA5 dataset (Figure 3).

Considering both ERA5 and weather-station data, RMSE increased with augmenting the value of the base temperature. In addition, the difference between trends was constant.

The most accurate baseline models were those based on temperature measured at the weather stations. Since we aimed to use climate reanalysis data, the base temperature was fixed at 0

^{\circ}

C for the remaining analysis and computations (see corresponding feature definition GEEcum0, described in Section 2.2).

3.2. Feature Engineering

Several filter feature-selection techniques [35] were tested to contrast the effect of the features on prediction. This feature analysis was followed by a decisive model comparison. An embedded feature-selection procedure was performed to propose the most efficient and adequate feature combinations.

3.2.1. Hierarchical Clustering

Two different performance levels were obtained. In the first case, illustrated in Figure 4a, a tolerance threshold ranged from 0 to 3.8 with step size of 0.2, while in Figure 4b, the tolerance ranged from 0 to 0.9 with 0.1 as step size.

The tolerance threshold markedly increased with the increase in model accuracy. For the sake of consistency, this exercise was repeated with a smaller interval, which resulted in slightly better results for the tolerance limit fixed at 0.8. This property reduced the original selected feature set, discarding 11 features: NDVI, NDVIcum, EVIcum, sur refl b07, sur refl b02, REDcum, NIRcum, GEE TMIN, GEE TMAX, dewpoint 2 m temperature, GEEcum0. The resulting random forest model showed an average RMSE of 0.6603.

3.2.2. Recursive Feature Elimination and Addition

The main contribution of this strategy was the inverse priority ranking of the complete feature set. RMSE statistics for each feature elimination are shown in Figure 5a with feature inverse priority ordering. The discarded features proposed in common with hierarchy clustering were: REDcum, NDVI, sur refl b02, dewpoint 2 m temperature. RMSE statistics for each feature addition are illustrated in Figure 5b with feature priority ranking.

3.2.3. Subgroup Comparison: RMSE-Based Weight Importance

Table 2 shows the final feature priority ordering proposition, a ranking of features based on their impact on the model. This ordering based on RFA began being efficient when considering at least 6 features (see Figure 6a). In addition, the most efficient feature combinations are depicted in Figure 6b and detailed in Table A1, showing that multiple compositions provided equitably competent models.

3.3. Model Comparison

Comparison of two distinct feature sets, the 14 and 16 features with the highest weight percentage (Table 2) was used to select the most compelling model. In general, the efficiency of olive phenology prediction was related to decision-tree-based model selection (see Figure 7a,b). In addition, the extremely randomized trees improved similar random forest regressor method predictions depicted in Figure 7b. In the case of extremely randomized trees, randomness was even more exploited. Similar to random forests, in cases of the extra-tree method, a random subset of candidate features could be used. In this method, instead of looking for the most discriminative limits, selection thresholds are chosen at random for each candidate feature, and the best of these randomly generated thresholds is selected as the splitting rule. The randomness of features (the hybrid dataset contained features with different natures) helped improve the accuracy of the extra-trees regressor model (Table 3).

Extra-Tree Regressor Feature Set and Hyperparameter Tuning

Here, the main goal was to enhance the most efficient feature combinations corresponding to the extra-tree regressor prediction model for olive phenology. The RFA methodology was applied for two distinct adding criteria, and subgroup comparison was performed. On the basis of the weighted metric feature ordering described in Table 2, these features were recursively added. The model started being efficient from the 11th added feature with a mean RMSE of 0.6231 (Figure 8a). The RFA methodology was applied as in Section 2.3.1 (Figure 8b). In this case, the performance of the model was acceptable with 6 features and a mean RMSE of 0.6214. Nonetheless, the strategy of computing all possible combinations was the most competent, as reported in Figure 8c. As detailed in Table A2, several feature consolidations are equally capable, with an RMSE mean below 0.6.

4. Discussion

Different strategies were useful for selecting appropriate features to attain accurate models for olive-tree phenology prediction. The Hierarchy clustering and RFE techniques failed at proposing a reduced feature set without losing prediction accuracy, whereas RFA was more effective in achieving this objective (see Figure 5b, Figure 6a and Figure 8a,b). However, common feature inclusion priority was not evident. The main key is that the RFA method was not equally applied in all cases. On the one hand, as the main objective of the study was to improve the obtained results in [34,40], we considered as mandatory inputs the DOY and GDD in Figure 5b, Figure 6b, Figure 7b and Figure 8b. Then, the final priority feature list was the result of recursively adding the feature corresponding to the most accurate combination. On the other hand, the recursively added features in Figure 6a, Figure 7a and Figure 8a were extracted from a previously imposed feature list. Nevertheless, we observed that a feature priority list derived from one prediction model was not consistent for other models. The imposed feature list of Figure 6a, Figure 7a and Figure 8a was the outcome of Section 3.2.3, as it was devoted to extracting unique priority ordering from several experiments with the reference random-forest model in [34,40]. In parallel, subgroup comparison extracted equally competent feature combinations for the random-forest model; consequently, this procedure was repeated once the most efficient model—the extra-tree regressor—had been highlighted.

Therefore, although machine learning improved prediction, unique priority ordering of appropriate environmental variables was not found. Indeed, separate feature combinations provided equally accurate results, as confirmed by subgroup comparison (see Table A1). Despite ranking based on subgroup analogies being provided, the best strategy to enhance the most efficient feature combination was the comparison of all possible compositions [46]. Nevertheless, such automated methodology was an advancement per se for selecting predictor variables at varying hierarchical levels to predict future phenological characteristics [22]. This result points to the benefits of the method when large datasets are explored to identify novel important features or their combinations that may provide superior outcomes in extracting practical information from model forecasts.

The decision-tree-based random forest and extra tree were the most efficient machine-learning models tested in this study. Classification and regression tasks were compared in both models, and the regression algorithm was found to be the most accurate. Although the phenological phases are categorical features, the categories have an increasing order, and this may explain why regression performed better than classification algorithms did.

The decision-tree-based extra-tree regression model had the most accurate olive-tree phenology prediction, with specific feature combinations (Table A2). When the combinations included slope and latitude as model features (lat in Table A2), geographic information was considered for phenology prediction. On the other hand, for generalization purposes, geographic features such as latitude and longitude could be discarded in order to properly adapt the model to new, previously unconsidered locations. As a matter of fact, though using geographic variables might allow for the model to consider the spatial structure of data, it would be difficult to apply resulting models in areas other than the one tested in this study. The terrain of Tuscany (the region from which data was gathered to feed the model) is rather complex, including coastal plains, mountain ranges, hilly areas, and alluvial valleys. As such, it can be considered representative of different geographic areas.

Surface-reflectance-derived NDVI and EVI were present in numerous combinations and, for the first time, to our knowledge, employed in olive-tree phenology prediction. Forecasting species–environment interactions through a combination of satellite observations and numerical models would help to optimize the monitoring of olive-tree phenology on a large scale. Since phenological events are influenced by drivers varying on different spatial scales (e.g., general circulation, local climate), the use of features generated at different spatial resolutions may allow for including in the model landscape predictions that consider overarching controls and finer-scale projections that ponder local variations.

Considering accumulated precipitation and temperature (cum precipitation and GEEcum0, respectively, in Table A2) proved to be beneficial to prediction accuracy. Indeed, equally efficient models could be generated with inputs of a completely different nature, including features derived from climate data and vegetation indices (ERA5- and MODIS-derived features, respectively). This implies that the unavailability of certain data may not invalidate olive-tree phenology prediction, which has important practical consequences for model implementation.

5. Conclusions and Future Work

The main goal of the DEMETER project (h2020-demeter.eu) is to provide Internet of Things (IoT)-based services aimed at providing practical strategies to optimize agricultural production across different European landscapes. In this sense, machine-learning techniques such as decision-tree models may have the advantage of handling numerous variables and allowing for discarding those features that impede the implementation of local models. This would reduce costs through the whole production chain.

Data availability is an important key in olive-tree phenology prediction, though the use of weather stations can be burdensome in terms of both instrument logistics and data management. Therefore, cost-effective methods as an alternative to acquiring meteorological data need to be explored. The main results of this research demonstrate that the worsening of accuracy when using temperature-derived indices such as GDD can be compensated with climate-, geography-, and vegetation-based model inputs. Therefore, a streaming satellite data service is advantageous for predicting and scaling olive-tree phenology in comparison with local data.

Since different feature sets provided equally efficient predictions, exploring larger datasets might select combinations of features, thereby increasing data-source options and phenology prediction quality. Further improvements to olive-tree phenology prediction may derive from applying alternative methods, such as the Markov chain approach, for managing meteorological variables [47,48,49,50,51] and implementing prediction models.

Author Contributions

Conceptualization and methodology, N.O. and S.M.; software and formal analysis, I.A.; validation and investigation, I.A. and N.O.; writing—original-draft preparation, I.A. and I.G.O.; writing—review and editing, S.M. and M.Q.; supervision, D.G and M.Q.; project administration and funding acquisition, D.G. and I.G.O. All authors have read and agreed to the published version of the manuscript.

Funding

This material is based upon work funded by the H2020 DEMETER project, grant agreement ID 857202, funded under H2020-EU.2.1.1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Acknowledgments

We are grateful to the Phytosanitary Service of Toscana Region for providing access to the data of olive-fruit-tree phenology and daily air-temperature levels.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following standard abbreviations are used in this manuscript:

GDD	Growing degree days
NDVI	Normalized difference vegetation index
EVI	Enhanced vegetation index
RMSE	Root-mean-square deviation
GEE	Google Earth Engine

Appendix A

The most effective feature combinations are listed in Table A1 and Table A2 for the random-forest and extra-tree regressor models, respectively.

Table A1. Description of most efficient feature combinations for random forest regressor prediction model.

Numeric	rmse	Feature List
Label	Mean	Feature List
0	0.6209	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’surface pressure’
1	0.6214	’DOY’, ’slope’, ’sea pressure’, ’lat’
2	0.6244	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitacion’, ’surface pressure’
3	0.6253	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’GEEcum0’
4	0.6272	’DOY’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitacion’, ’surface pressure’
5	0.6278	’DOY’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’surface pressure’
6	0.6281	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitacion’, ’surface pressure’
7	0.6281	’DOY’, ’sea pressure’, ’lat’, ’surface pressure’
8	0.6287	’DOY’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’surface pressure’
9	0.6292	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’surface pressure’, ’GEEcum0’
10	0.6295	’DOY’, ’sea pressure’, ’lat’, ’cum precipitacion’, ’surface pressure’
11	0.6296	’DOY’, ’seapressure’, ’lat’, ’GEEcum0’
12	0.6301	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitacion’, ’surface pressure’, ’mean 2 m air temperature’
13	0.6301	’DOY’, ’GEE TMIN’, ’sea pressure’, ’NDVI’, ’lat’, ’GEEcum0’
14	0.6302	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’mean 2 m air temperature’, ’GEEcum0’
15	0.6304	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitacion’, ’mean 2 m air temperature’, ’GEEcum0’
16	0.6306	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’surface pressure’, ’mean 2 m air temperature’
17	0.6307	’DOY’, ’sea pressure’, ’NDVI’, ’lat’, ’surface pressure’
18	0.6308	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitacion’, ’surface pressure’, ’mean 2 m air temperature’
19	0.6310	’DOY’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’mean 2 m air temperature’, ’GEEcum0’
20	0.6312	’DOY’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’
21	0.6312	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’GEEcum0’
22	0.6315	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’surface pressure’, ’mean 2 m air temperature’, ’GEEcum0’
23	0.6315	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’surface pressure’, ’mean 2 m air temperature’, ’GEEcum0’
24	0.6316	’DOY’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitacion’, ’surface pressure’

Table A2. Description of most efficient feature combinations for extra-tree regressor prediction model.

Numeric	rmse	Feature List
Label	Mean	Feature List
0	0.5857	’DOY’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’
1	0.5865	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitation’, ’surface pressure’, ’mean 2 m air temperature’
2	0.5877	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitation’, ’surface pressure’
3	0.5879	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitation’, ’surface pressure’
4	0.5885	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitation’, ’mean 2 m air temperature’, ’GEEcum0’
5	0.5887	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitation’, ’mean 2 m air temperature’, ’GEEcum0’
6	0.5897	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’cum precipitation’, ’surface pressure’
7	0.5904	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’lat’, ’surface pressure’, ’mean 2 m air temperature’
8	0.5905	’DOY’, ’slope’, ’sea pressure’, ’lat’, ’cum precipitation’, ’surface pressure’, ’mean 2 m air temperature’
9	0.5905	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’mean 2 m air temperature’, ’GEEcum0’
10	0.5908	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’cum precipitation’, ’surface pressure’, ’GEEcum0’
11	0.5911	’DOY’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’GEEcum0’
12	0.5911	’DOY’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’GEEcum0’
13	0.5915	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’cum precipitation’, ’mean 2 m air temperature’, ’GEEcum0’
14	0.5916	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’cum precipitation’, ’GEEcum0’
15	0.5916	’DOY’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’mean 2 m air temperature’, ’GEEcum0’
16	0.5917	’DOY’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’cum precipitation’, ’GEEcum0’
17	0.5919	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’mean 2 m air temperature’, ’GEEcum0’
18	0.5924	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’GEEcum0’
19	0.5924	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’
20	0.5926	’DOY’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’mean 2 m air temperature’, ’GEEcum0’
21	0.5927	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’lat’, ’mean 2 m air temperature’, ’GEEcum0’
22	0.5927	’DOY’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’mean 2 m air temperature’
23	0.5931	’DOY’, ’EVI’, ’slope’, ’GEE TMIN’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’GEEcum0’
24	0.5931	’DOY’, ’EVI’, ’slope’, ’sea pressure’, ’NDVI’, ’lat’, ’cum precipitation’, ’surface pressure’, ’mean 2 m air temperature’

References

Gauzere, J.; Lucas, C.; Ronce, O.; Davi, H.; Chuine, I. Sensitivity analysis of tree phenology models reveals increasing sensitivity of their predictions to winter chilling temperature and photoperiod with warming climate. Ecol. Model. 2019, 411, 108805. [Google Scholar] [CrossRef]
Huang, X.; Liu, J.; Zhu, W.; Atzberger, C.; Liu, Q. The optimal threshold and vegetation index time series for retrieving crop phenology based on a modified dynamic threshold method. Remote Sens. 2019, 11, 2725. [Google Scholar] [CrossRef] [Green Version]
Snyder, K.A.; Huntington, J.L.; Wehan, B.L.; Morton, C.G.; Stringham, T.K. Comparison of landsat and land-based phenology camera normalized difference vegetation index (NDVI) for dominant plant communities in the great basin. Sensors 2019, 19, 1139. [Google Scholar] [CrossRef] [Green Version]
Fraga, H.; Amraoui, M.; Malheiro, A.C.; Moutinho-Pereira, J.; Eiras-Dias, J.; Silvestre, J.; Santos, J.A. Examining the relationship between the Enhanced Vegetation Index and grapevine phenology. Eur. J. Remote Sens. 2014, 47, 753–771. [Google Scholar] [CrossRef]
Wang, C.; Li, J.; Liu, Q.; Zhong, B.; Wu, S.; Xia, C. Analysis of differences in phenology extracted from the enhanced vegetation index and the leaf area index. Sensors 2017, 17, 1982. [Google Scholar] [CrossRef] [Green Version]
Garcia-Mozo, H.; Orlandi, F.; Galan, C.; Fornaciari, M.; Romano, B.; Ruiz, L.; de la Guardia, C.D.; Trigo, M.; Chuine, I. Olive flowering phenology variation between different cultivars in Spain and Italy: Modeling analysis. Theor. Appl. Climatol. 2009, 95, 385. [Google Scholar] [CrossRef]
Osborne, C.; Chuine, I.; Viner, D.; Woodward, F. Olive phenology as a sensitive indicator of future climatic warming in the Mediterranean. Plant Cell Environ. 2000, 23, 701–710. [Google Scholar] [CrossRef] [Green Version]
Htitiou, A.; Boudhar, A.; Lebrini, Y.; Hadria, R.; Lionboui, H.; Elmansouri, L.; Tychon, B.; Benabdelouahab, T. The Performance of Random Forest Classification Based on Phenological Metrics Derived from Sentinel-2 and Landsat 8 to Map Crop Cover in an Irrigated Semi-arid Region. Remote Sens. Earth Syst. Sci. 2019, 2, 208–224. [Google Scholar] [CrossRef]
Karkauskaite, P.; Tagesson, T.; Fensholt, R. Evaluation of the plant phenology index (PPI), NDVI and EVI for start-of-season trend analysis of the Northern Hemisphere boreal zone. Remote Sens. 2017, 9, 485. [Google Scholar] [CrossRef] [Green Version]
Testa, S.; Soudani, K.; Boschetti, L.; Mondino, E.B. MODIS-derived EVI, NDVI and WDRVI time series to estimate phenological metrics in French deciduous forests. Int. J. Appl. Earth Obs. Geoinf. 2018, 64, 132–144. [Google Scholar] [CrossRef]
Chang, Q.; Xiao, X.; Jiao, W.; Wu, X.; Doughty, R.; Wang, J.; Du, L.; Zou, Z.; Qin, Y. Assessing consistency of spring phenology of snow-covered forests as estimated by vegetation indices, gross primary production, and solar-induced chlorophyll fluorescence. Agric. For. Meteorol. 2019, 275, 305–316. [Google Scholar] [CrossRef]
Yun, K.; Hsiao, J.; Jung, M.P.; Choi, I.T.; Glenn, D.M.; Shim, K.M.; Kim, S.H. Can a multi-model ensemble improve phenology predictions for climate change studies? Ecol. Model. 2017, 362, 54–64. [Google Scholar] [CrossRef]
Basler, D. Evaluating phenological models for the prediction of leaf-out dates in six temperate tree species across central Europe. Agric. For. Meteorol. 2016, 217, 10–21. [Google Scholar] [CrossRef]
Orlandi, F.; Garcia-Mozo, H.; Dhiab, A.B.; Galán, C.; Msallem, M.; Romano, B.; Abichou, M.; Dominguez-Vilches, E.; Fornaciari, M. Climatic indices in the interpretation of the phenological phases of the olive in mediterranean areas during its biological cycle. Clim. Chang. 2013, 116, 263–284. [Google Scholar] [CrossRef]
Moriondo, M.; Ferrise, R.; Trombi, G.; Brilli, L.; Dibari, C.; Bindi, M. Modelling olive trees and grapevines in a changing climate. Environ. Model. Softw. 2015, 72, 387–401. [Google Scholar] [CrossRef]
Lee, M.A.; Monteiro, A.; Barclay, A.; Marcar, J.; Miteva-Neagu, M.; Parker, J. A framework for predicting soft-fruit yields and phenology using embedded, networked microsensors, coupled weather models and machine-learning techniques. Comput. Electron. Agric. 2020, 168, 105103. [Google Scholar] [CrossRef]
Fisher, J.I.; Mustard, J.F. Cross-scalar satellite phenology from ground, Landsat, and MODIS data. Remote Sens. Environ. 2007, 109, 261–273. [Google Scholar] [CrossRef]
Peng, D.; Wu, C.; Li, C.; Zhang, X.; Liu, Z.; Ye, H.; Luo, S.; Liu, X.; Hu, Y.; Fang, B. Spring green-up phenology products derived from MODIS NDVI and EVI: Intercomparison, interpretation and validation using National Phenology Network and AmeriFlux observations. Ecol. Indic. 2017, 77, 323–336. [Google Scholar] [CrossRef]
Wu, C.; Peng, D.; Soudani, K.; Siebicke, L.; Gough, C.M.; Arain, M.A.; Bohrer, G.; Lafleur, P.M.; Peichl, M.; Gonsamo, A.; et al. Land surface phenology derived from normalized difference vegetation index (NDVI) at global FLUXNET sites. Agric. For. Meteorol. 2017, 233, 171–182. [Google Scholar] [CrossRef]
Almeida, J.; dos Santos, J.A.; Alberton, B.; Torres, R.D.S.; Morellato, L.P.C. Applying machine learning based on multiscale classifiers to detect remote phenology patterns in cerrado savanna trees. Ecol. Inform. 2014, 23, 49–61. [Google Scholar] [CrossRef]
Van de Pol, M.; Bailey, L.D.; McLean, N.; Rijsdijk, L.; Lawson, C.R.; Brouwer, L. Identifying the best climatic predictors in ecology and evolution. Methods Ecol. Evol. 2016, 7, 1246–1257. [Google Scholar]
Holloway, P.; Kudenko, D.; Bell, J.R. Dynamic selection of environmental variables to improve the prediction of aphid phenology: A machine learning approach. Ecol. Indic. 2018, 88, 512–521. [Google Scholar] [CrossRef] [Green Version]
Marra, F.P.; Macaluso, L.; Marino, G.; Caruso, T. Predicting Olive Flowering Phenology with Phenoclimatic Models. Acta Hortic. 2018, 88, 189–194. [Google Scholar] [CrossRef]
Alcala, A.; Barranco, D. Prediction of Flowering Time in Olive for the Cordoba Olive Collection. HortScience 1992, 27, 1205–1207. [Google Scholar] [CrossRef] [Green Version]
Aguilera, F.; Ruiz-Valenzuela, L. A new aerobiological indicator to optimize the prediction of the olive crop yield in intensive farming areas of southern Spain. Agric. For. Meteorol. 2019, 271, 207–213. [Google Scholar] [CrossRef]
Mancuso, S.; Pasquali, G.; Fiorino, P. Phenology modelling and forecasting in olive (Olea europaea L.) using artificial neural networks. Adv. Hort. Sci. 2002, 16, 155–164. [Google Scholar]
Avolio, E.; Pasqualoni, L.; Federico, S.; Fornaciari, M.; Bonofiglio, T.; Orlandi, F.; Bellecci, C.; Romano, B. Correlation between large-scale atmospheric fields and the olive pollen season in Central Italy. Int. J. Biometeorol. 2008, 52, 787. [Google Scholar] [CrossRef]
Bonofiglio, T.; Orlandi, F.; Sgromo, C.; Romano, B.; Fornaciari, M. Influence of temperature and rainfall on timing of olive (Olea europaea) flowering in southern Italy. N. Z. J. Crop Hortic. Sci. 2008, 36, 59–69. [Google Scholar] [CrossRef]
García-Mozo, H.; Galán, C.; Vázquez, L. The reliability of geostatistic interpolation in olive field floral phenology. Aerobiologia 2006, 22, 95. [Google Scholar] [CrossRef]
Aguilera, F.; Valenzuela, L.R. Study of the floral phenology of Olea europaea L. in Jaen province (SE Spain) and its relation with pollen emission. Aerobiologia 2009, 25, 217. [Google Scholar] [CrossRef]
Bacelar, E.A.; Moutinho-Pereira, J.M.; Gonçalves, B.C.; Lopes, J.I.; Correia, C.M. Physiological responses of different olive genotypes to drought conditions. Acta Physiol. Plant. 2009, 31, 611–621. [Google Scholar] [CrossRef]
Dias, A.B.; Peça, J.; Pinheiro, A. Long-term evaluation of the influence of mechanical pruning on olive growing. Agron. J. 2012, 104, 22–25. [Google Scholar] [CrossRef] [Green Version]
Marchi, S.; Guidotti, D.; Ricciolini, M.; Sebastiani, L. Un esempio di supporto on line alle decisioni per gli olivicoltori | Archivio della ricerca della Scuola Superiore Sant’Anna. L’Informatore Agrario 2012, 4, 60–63. [Google Scholar]
Oses, N.; Azpiroz, I.; Quartulli, M.; Olaizola, I.; Marchi, S.; Guidotti, D. Machine Learning for olive phenology prediction and base temperature optimisation. In Proceedings of the 2020 Global Internet of Things Summit (GIoTS), Dublin, Ireland, 3 June 2020; pp. 1–6. [Google Scholar]
Bolón-Canedo, V.; Sánchez-Maroño, N.; Alonso-Betanzos, A. A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 2013, 34, 483–519. [Google Scholar] [CrossRef]
Murray, A.B. Reducing model complexity for explanation and prediction. Geomorphology 2007, 90, 178–191. [Google Scholar] [CrossRef]
Gerretzen, J.; Szymańska, E.; Bart, J.; Davies, A.N.; van Manen, H.J.; van den Heuvel, E.R.; Jansen, J.J.; Buydens, L.M. Boosting model performance and interpretation by entangling preprocessing selection and variable selection. Anal. Chim. Acta 2016, 938, 44–52. [Google Scholar] [CrossRef] [PubMed]
Jolly, W.M.; Running, S.W. Effects of precipitation and soil water potential on drought deciduous phenology in the Kalahari. Glob. Chang. Biol. 2004, 10, 303–308. [Google Scholar] [CrossRef]
White, M.A.; Thornton, P.E.; Running, S.W. A continental phenology model for monitoring vegetation responses to interannual climatic variability. Glob. Biogeochem. Cycles 1997, 11, 217–234. [Google Scholar] [CrossRef]
Oses, N.; Azpiroz, I.; Marchi, S.; Guidotti, D.; Quartulli, M.; Olaizola, G.I. Analysis of Copernicus’ ERA5 Climate Reanalysis Data as a Replacement for Weather Station Temperature Measurements in Machine Learning Models for Olive Phenology Phase Prediction. Sensors 2020, 20, 6381. [Google Scholar] [CrossRef]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Allen, J.C. A modified sine wave method for calculating degree days. Environ. Entomol. 1976, 5, 388–396. [Google Scholar] [CrossRef]
Mishra, G.; Sehgal, D.; Valadi, J.K. Quantitative structure activity relationship study of the anti-hepatitis peptides employing random forests and extra-trees regressors. Bioinformation 2017, 13, 60. [Google Scholar] [CrossRef] [Green Version]
Chaney, N.W.; Herman, J.D.; Ek, M.B.; Wood, E.F. Deriving global parameter estimates for the Noah land surface model using FLUXNET and machine learning. J. Geophys. Res. Atmos. 2016, 121, 13–218. [Google Scholar] [CrossRef]
Góraj, M.; Wróblewski, C.; Ciężkowski, W.; Jóźwiak, J.; Chormański, J. Free water table area monitoring on wetlands using satellite and UAV orthophotomaps-Kampinos National Park case study. Meteorol. Hydrol. Water Manag. Res. Oper. Appl. 2019, 7. [Google Scholar] [CrossRef]
Hill, M.; Connolly, P.; Reutemann, P.; Fletcher, D. The use of data mining to assist crop protection decisions on kiwifruit in New Zealand. Comput. Electron. Agric. 2014, 108, 250–257. [Google Scholar] [CrossRef]
Weiss, L.L. Sequences of wet or dry days described by a Markov chain probability model. Mon. Weather Rev. 1964, 92, 169–176. [Google Scholar] [CrossRef] [Green Version]
Khiatani, D.; Ghose, U. Weather forecasting using hidden Markov model. In Proceedings of the 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN), Gurgaon, India, 12–14 October 2017; pp. 220–225. [Google Scholar]
Hashemi, F.; Decker, W. Using climatic information and weather forecast for decisions in economizing irrigation water. Agric. Meteorol. 1969, 6, 245–257. [Google Scholar] [CrossRef]
Kuswanto, H.; Sari, M.R. Bayesian model averaging with Markov chain monte Carlo for calibrating temperature forecast from combination of time series models. J. Math. Stat. 2013, 9, 349. [Google Scholar] [CrossRef] [Green Version]
Carpinone, A.; Langella, R.; Testa, A.; Giorgio, M. Very short-term probabilistic wind power forecasting based on Markov chain models. In Proceedings of the 2010 IEEE 11th International Conference on Probabilistic Methods Applied to Power Systems, Singapore, 14–17 June 2010; pp. 107–112. [Google Scholar]

Figure 1. Entire methodology adopted in this study on machine-learning process for olive-tree phenology prediction model. Flowchart provides a graphical cue for the sequence of steps in the modeling process.

Figure 2. Hierarchical clustering on Spearman rank order and corresponding heat map of correlated features.

Figure 3. Root-mean-square error (RMSE) statistics (minimum, quartile at 25%, mean, quartile at 75%, and maximum) of baseline model for GEE vs weather-station temperature measurements.

Figure 4. Accuracy comparison performed for distinct tolerance ranges.

Figure 5. Accuracy comparison for (a) recursive feature elimination (RFE) and (b) recursive feature addition (RFA) feature-selection strategies.

Figure 6. Olive-phenology predictions based on random-forest regressor model. (a) Recursive feature addition considering priority ranking depicted at Table 2. (b) Accuracy comparison between most efficient feature combinations.

Figure 7. Performance comparability of K-neighbors regressor, random-forest regressor, random-forest classifier, gradient-boosting classifier, extra-tree regressor, extra-tree classifier, decision-tree classifier, and GDD baseline models, respectively.

Figure 8. Extra-tree regressor model prediction for distinct feature-selection techniques. (a) Feature addition depending on weighted metric ordering detailed in Table 2. (b) Feature ordering resulted from RFA process for extra-tree regressor model. (c) Feature subgroup combinations listed in Table A2.

Table 1. Features used as predictors for the phenological phases of olive trees.

Abbreviated	Feature Description	Predictor	Data	Data
Feature Name	Feature Description	Type	Source	Resolution
DOY	Day of year	Time
Mean 2 m air temperature	Average air temperature at 2 m height (daily average)	Meteo	ERA 5	27–28 km
GEE TMIN (Minimal air temp.)	Minimal air temperature at 2 m height (daily minimum)	Meteo	ERA 5	27–28 km
GEE TMAX (Maximal air temp.)	Maximal air temperature at 2 m height (daily maximum)	Meteo	ERA 5	27–28 km
Dewpoint 2 m temperature	Dewpoint temperature at 2 m height (daily average)	Meteo	ERA 5	27–28 km
Total precipitation	Total precipitation (daily sums)	Meteo	ERA 5	27–28 km
Surface pressure	Surface pressure (daily average)	Meteo	ERA 5	27–28 km
Mean sea-level pressure (sea pressure)	Mean sea-level pressure (daily average)	Meteo	ERA 5	27–28 km
u component of wind 10 m	Horizontal speed of air moving towards the east,
	at a height of 10 metres above the surface of Earth.	Meteo	ERA 5	27–28 km
v component of wind 10 m	Horizontal speed of air moving towards the north.	Meteo	ERA 5	27–28 km
EVI	Enhanced vegetation index (EVI) generated from the
	Near-IR, red, and blue bands of each scene.	MODIS	MOD09GA 006 EVI	1 km
NDVI	Normalized difference vegetation index generated
	from the near-IR and red bands of each scene.	MODIS	MOD09GA 006 NDVI	1 km
RED (sur refl b01)	Red surface reflectance	MODIS	006 MOD09GQ	0.25 km
NIR (sur refl b02)	NIR surface reflectance	MODIS	006 MOD09GQ	0.25 km
sur refl b03	Blue surface reflectance, 16 day frequency	MODIS	006 MOD13Q1	0.25 km
sur refl b07	MIR surface reflectance, 16 day frequency	MODIS	006 MOD13Q1	0.25 km
ViewZenith	View zenith angle, 16 day frequency	MODIS	006 MOD13Q1	0.25 km
SolarZenith	Solar zenith angle, 16 day frequency	MODIS	006 MOD13Q1	0.25 km
RelativeAzimuth	Relative azimuth angle, 16 day frequency	MODIS	006 MOD13Q1	0.25 km
Lat	Latitude	Spatial
Lon	Longitude	Spatial
Slope	Landform classes created by combining the ALOS CHILI
	and ALOS mTPI datasets.	Spatial	ALOS Landform
	Created features
GEEcumt	Growing degree day from GEE temperature measurements; t is base temperature used.
Cum precipitation	Precipitation accumulated from the first of January until DOY.
EVIcum	EVI accumulated from the first of January until DOY.
NDVIcum	NDVI accumulated from 1 January until DOY.
REDcum	RED accumulated from 1 January until DOY.
NIRcum	NIR accumulated from 1 January until DOY.

Table 2. Feature priority ordering based on weighted importance metric.

Feature	Weight Percentage	Feature	Weight Percentage	Feature	Weight Percentage
EVI	1.00	Mean 2 m air temp.	0.40	v comp. of wind	0.35
slope	0.97	GEEcum0	0.40	lon	0.34
GEE TMIN	0.77	EVIcum	0.38	GEE TMAX	0.33
Sea level	0.69	sur refl b03	0.36	NIR	0.33
NDVI	0.68	dewpoint 2 m temp.	0.36	ViewZenith	0.33
lat	0.66	REDcum	0.36	RED	0.33
Cum precipitacion	0.55	NIRcum	0.35	SolarZenith	0.32
NDVIcum	0.48	u comp. of wind	0.35	Total precipitation	0.32
Surface pressure	0.43	sur refl b07	0.35	RelativeAzimuth	0.30

Table 3. RMSE statistical description of distinct model predictions.

Number of Features		14					16
Model	mean	min	25%	50%	75%	max	mean	min	25%	50%	75%	max
ExtraTreesRegressor	0.600	0.555	0.581	0.613	0.620	0.627	0.589	0.532	0.576	0.600	0.610	0.636
RandomForestRegressor	0.635	0.583	0.618	0.642	0.650	0.670	0.632	0.575	0.612	0.635	0.662	0.677
RandomForestClassifier	0.779	0.677	0.732	0.782	0.828	0.872	0.787	0.661	0.755	0.782	0.816	0.901
ExtraTreesClassifier	0.793	0.745	0.757	0.783	0.825	0.870	0.757	0.664	0.723	0.753	0.803	0.844
KNeighborsRegressor	0.831	0.769	0.789	0.827	0.845	0.947	1.799	1.692	1.769	1.810	1.842	1.873
DecisionTreeClassifier	0.941	0.812	0.898	0.936	0.961	1.107	0.949	0.795	0.914	0.932	1.004	1.102
GradientBoostingClassifier	1.001	0.892	0.895	0.942	1.119	1.208	0.967	0.860	0.931	0.980	0.998	1.077
Agricolus	1.190	1.135	1.173	1.188	1.205	1.250	1.190	1.135	1.173	1.188	1.205	1.250

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Azpiroz, I.; Oses, N.; Quartulli, M.; Olaizola, I.G.; Guidotti, D.; Marchi, S. Comparison of Climate Reanalysis and Remote-Sensing Data for Predicting Olive Phenology through Machine-Learning Methods. Remote Sens. 2021, 13, 1224. https://doi.org/10.3390/rs13061224

AMA Style

Azpiroz I, Oses N, Quartulli M, Olaizola IG, Guidotti D, Marchi S. Comparison of Climate Reanalysis and Remote-Sensing Data for Predicting Olive Phenology through Machine-Learning Methods. Remote Sensing. 2021; 13(6):1224. https://doi.org/10.3390/rs13061224

Chicago/Turabian Style

Azpiroz, Izar, Noelia Oses, Marco Quartulli, Igor G. Olaizola, Diego Guidotti, and Susanna Marchi. 2021. "Comparison of Climate Reanalysis and Remote-Sensing Data for Predicting Olive Phenology through Machine-Learning Methods" Remote Sensing 13, no. 6: 1224. https://doi.org/10.3390/rs13061224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Climate Reanalysis and Remote-Sensing Data for Predicting Olive Phenology through Machine-Learning Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. Original Data

2.2. Data Preprocessing

Splitting of Dataset and Evaluation Metrics

2.3. Baseline Model and Base-Temperature Optimization

2.3.1. Feature Engineering

3. Results

3.1. Temperature Accuracy

3.2. Feature Engineering

3.2.1. Hierarchical Clustering

3.2.2. Recursive Feature Elimination and Addition

3.2.3. Subgroup Comparison: RMSE-Based Weight Importance

3.3. Model Comparison

Extra-Tree Regressor Feature Set and Hyperparameter Tuning

4. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI