**1. Introduction**

Agriculture plays a crucial role in the global economy and, as the world's population continues to grow, the pressure on agricultural production also increases [1]. Historically, the primary method for increasing agricultural production was to expand the cultivated land [2]. This was typically conducted until the early years of the "Green Revolution" (GR), when cereal production tripled while the area devoted to agriculture increased by just 30% [3]. This improvement was driven by heavy public investments in infrastructure and research, as well as the implementation of agricultural promotion policies. The GR was characterized by the widespread use of mechanization, chemical fertilizers, and pesticides, together with genetic improvements in major crops, aspects that played a significant role in yield increases from the 1990s onward [4]. Nitrogen, a key component of fertilizers, is particularly detrimental to the environment when used in excess [5,6]. To address this issue, the European Union has launched the "Farm to Fork" strategy, which aims to reduce the use of pesticides and fertilizers. As crop nutrient requirements are related to production, reliable yield estimates are essential if fertilizer inputs are to be adjusted and losses reduced [7].

**Citation:** Uribeetxebarria, A.; Castellón, A.; Aizpurua, A. Optimizing Wheat Yield Prediction Integrating Data from Sentinel-1 and Sentinel-2 with CatBoost Algorithm. *Remote Sens.* **2023**, *15*, 1640. https://doi.org/10.3390/rs15061640

Academic Editors: Kenji Omasa, Shan Lu and Jie Wang

Received: 4 February 2023 Revised: 10 March 2023 Accepted: 14 March 2023 Published: 17 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Recent studies, such as those conducted by Zambon et al. [8], have demonstrated that technological advances can play a crucial role in achieving sustainable intensification in agriculture. The development of precision agriculture (PA) began in the late 1990s as a strategy for improving the sustainability of agricultural production through the consideration of temporal and spatial variability. The utilization of various sensors, including weather stations [9], multispectral cameras [10], electroconductivity meters [11], and LiDAR [12], is a common practice within the framework of PA. The implementation of PA allows input reduction while maintaining yield levels [13] through the targeted distribution of inputs according to specific crop requirements rather than a uniform application [14]. Despite the availability of PA technologies, adoption among farmers, particularly smallholders, remains low [15]. Partially this phenomenon can be attributed to the economic burden associated with acquiring new technology. Additionally, as technology becomes more sophisticated and data-intensive, farmers may require expert assistance to validate their decisions [16].

Despite the challenges faced by small- and medium-sized farmers to adopt PA techniques, the recent deployment of the Sentinel-2 (S2) satellite constellation by the European Space Agency (ESA) has the potential to enhance their utilization. Specifically, the twin satellites of the S2 series (A and B) were engineered to cater to requirements of the agricultural sector and researchers [17]. These satellites provide high resolution images, with 13 multispectral bands and a rapid revisit rate, all of which are available free of charge through ESA's Copernicus program (https://scihub.copernicus.eu/, accessed on 13 March 2023). The different bands of the sensor allow the calculation of various vegetation indices (VIs), which are related to a range of crop parameters, including crop growth [18], crop classification [19], and soil conditions [20]. For example, Vallentin et al. [21] conducted an analysis utilizing a time series of 13 years to examine the correlation between crop yield and different VIs. Comparison of various satellites led to the conclusion that those of higher resolution, such as the Rapid Eye or S2, performed better when compared to other lower resolution satellite imagery.

VIs have been widely used in agriculture to estimate crop yield because stressed and healthy crops emit energy at different wavelengths. For example, the normalized difference vegetation index (NDVI) is calculated based on the reflectance of vegetation in the red and near-infrared bands of the electromagnetic spectrum. As plants absorb more red light and reflect more near-infrared light as they become more vigorous, the NDVI value increases as the canopy density and biomass increase, and in consequence, the grain yield. Therefore, NDVI can be used as an indicator of plant health and biomass production. Although the use of VIs for this purpose dates to the early 1980s [22], it was not until the 1990s that it became more common [23,24]. With the release of images provided by satellites such as S2 [25], Landsat [26], MODIS [27], and SPOT [28], the use of VIs has exponentially increased. Recent studies, such as that proposed by the authors of [29], have utilized VIs derived from S2 in combination with random forest (RF) to estimate yield within individual plots across multiple wheat fields in England. VIs have also been used to estimate yield across entire countries [30]. Incorporating satellite-derived information into agrometeorological models has been shown to improve their accuracy [31,32]. For example, Vicente-Serrano et al. [33] in Spain combined advanced very high resolution radiometer (AVHRR) and NDVI data as well as drought indices at different time scales to predict wheat yield in advance. In other cases, VIs have been used to estimate yield directly [34]. More recently, publications such as [35,36] have taken a step further by combining machine learning techniques with satellite information to estimate the yield of specific plots using data from other plots.

However, one major limitation of S2 is cloud cover [37], which can restrict the amount of usable data available for certain areas and applications. Additionally, while S2 images have a high spatial resolution, they may not be sufficient for some applications that require very high resolution data as, for example, field work with vineyards or early disease detection. Other impediments include misalignment with other remotely sensed data, such as Landsat 8, the lack of panchromatic and thermal bands, and variations in the spatial resolution of the bands [38].

Sentinel-1 (S1) data are also available for free through the Copernicus program. S1 is a synthetic aperture radar (SAR) designed for radar imaging and can provide data in various modes and polarizations (VV, HH, VH or HV), depending on the emission and acquisition signal mode. S1 operates in the C polarimetric band, which ranges from 5.405 to 5.625 GHz and has a wavelength of 5.6 cm. S1 provides information about objects after being impacted by microwaves (C-band). Importantly, radar data are not affected by atmospheric conditions such as clouds and can also be acquired at night. The spatial resolution of S1 is 10 m, similar to the maximum resolution of S2, and it typically has a revisit period of 6 days [39]. However, the interpretation of the signal from S1 is complex and requires specialized analysis. For example, for a vegetated surface, the C-band signal is a combination of contributions from the soil, canopy, volume scattering within the canopy, and interactions between the soil and vegetation [40]. As a result, its use in agriculture is not as widespread as that of S2.

The computational development and utilization of machine learning techniques have become increasingly important in the field of PA [41]. These technologies allow for the processing and analysis of large amounts of data collected from various sources, including satellite imagery, drones, and Internet of Things (IoT) sensors, to generate accurate and detailed predictions [42]. Different types of machine learning algorithms can be employed in this process, including supervised and unsupervised algorithms. Supervised learning algorithms, such as decision trees, RF, and support vector machines (SVMs), can be used to classify different crops, predict crop yields or detect patterns in crop growth [43–45]. Unsupervised learning algorithms, such as *k-means* and principal component analysis (PCA), can be utilized to identify patterns or delineate site-specific management zones (SSMZs) [46].

Over the past few years, a variety of algorithms have been tested to estimate wheat yield. Tang et al. [47] utilized multiple linear regression (MLR) to estimate yield, with root mean squared error (RMSE) values ranging from 0.54 to 1.02. In the same study, the backpropagation neural network (BPNN) was also tested, obtaining better results with RMSE values ranging from 0.30 to 0.68. Hunt et al. [29] used the RF algorithm to estimate wheat yield in different plots. These results were compared with those obtained from MLR. The RF algorithm consistently obtained superior results for all the considered scenarios. Support vector machine (SVM) is another commonly used algorithm for this purpose. In the study published by Bebbie et al. [25], the coefficient of determination (R2) value obtained was always greater than 0.80. Meraj et al. [48] compared the ability of SVM and RF to estimate the area of wheat cultivation in large areas of India, obtaining better results with RF. Finally, deep learning algorithms such as the long short-term memory (LSTM) also produced adequate results, with an RMSE of 0.64 t ha−<sup>1</sup> when estimating wheat grain yield [49]. Srivastava et al. [50] compared the performance of eight different algorithms using a 20-year time series and found that the convolutional neural network (CNN) produced the best results. Finally, Cao et al. [51] compared the performance of MLR, SVM, RF, and XgBoost to estimate winter wheat yield in northern China combining machine learning with a global dynamical atmospheric prediction system.

Recently, in the latter part of the 1990s, a new type of supervised algorithm involving gradient boosting emerged. Gradient boosting is a machine learning technique that aims to enhance the accuracy of predictive models. The method operates by repeatedly training a sequence of base models and assigning increased weights to examples previously misclassified by prior models with the purpose of focusing on the most challenging samples. These algorithms involve the combination of multiple simple models with the goal of creating a robust ensemble model. The first of these algorithms to be developed was the adaptive boosting (AdaBoost) algorithm, published by Yoav Freund and Robert Schapire in [52]. The gradient boosting machine (GBM), proposed by Jerome Friedman in [53], is an extension of AdaBoost, but instead of assigning weights to examples, it utilizes gradient descent to optimize the parameters of the base model. GBM is an iterative algorithm that generates a series of decision trees, with each tree being intended to correct the errors made

by the preceding tree. Another gradient boosting algorithm, the extreme gradient boosting (XGBoost) algorithm, was developed by Tianqi Chen in [54] and is optimized for working with large datasets. In 2017, the categorical boosting (CatBoost) algorithm was released by Prokhorenkova et al. [55], which is optimized to handle categorical variables. Currently, CatBoost is considered a powerful algorithm and is widely used owing to its ability to process categorical data and its high capacity to generalize. However, its application in agriculture is not yet widespread.

The challenge of yield estimation in modern agriculture presents numerous opportunities for decision making at both farmer and institutional level, including future action planning, the modulation of input supply according to crop needs, and harvest storage. In this regard, it should be noted that several global-scale works, in addition to satellite and yield information, use weather data [56] and soil information [57] in their yield estimation models. However, it is difficult to have weather and soil information at a sufficient level of detail when making yield estimation at intra-plot level.

Remote sensing technologies also offer new possibilities for improving yield estimation through the use of more advanced algorithms. Taking these considerations into account, the aim of the present study is to conduct a comprehensive analysis of the potential of remote sensing and machine learning techniques for yield estimation. More specifically, the study aims to determine whether the utilization of information obtained from S1 and S2 satellite imagery on different days enhances the accuracy of yield predictions. The study also evaluates the potential benefits of combining S1 and S2 data and, finally, aims to determine the effectiveness of the CatBoost algorithm in comparison to other commonly used methods such as MLR, SVM, and RF.

The analyses are conducted with a practical approach that applies to agriculture. High resolution wheat yield data from 39 plots, obtained with a yield monitor during the 2021 season, are used. Additionally, three cloud-free S2 images representing different phenological stages of wheat are analyzed, from which 13 VIs are calculated. A total of three S1 images, acquired on dates close to those of S2, are also examined for their backscattering values in vertical-vertical (VV) and vertical-horizontal (VH) polarizations.

#### **2. Materials and Methods**
