*2.3. Wheat Grain Yield Acquisition, Preprocessing, and Connection with Sentinel Data*

Spatially dense wheat grain yield data were obtained by installing a yield monitor and a GPS receiver on a John Deere T560 harvester. The GPS receiver could receive RX corrections, enabling it to be spatially positioned with an error lower than 15 cm, making it suitable for PA. Yield data were collected between 19 and 25 July 2021. To prepare the yield data for further analysis, they were pre-processed to eliminate anomalous measurements that can greatly affect the results [75]. Firstly, data with incorrect latitude/longitude measurements were removed. In the pre-processing steps, data with moisture concentrations below 8%, or values recorded when the harvester was operating at an inadequate speed, were removed to ensure the accuracy of the data. Afterwards, some steps of the methodology described by Taylor et al. [76] were applied. In the first step, yield values that exceeded or did not reach the established threshold were eliminated. In the next step, data points that were more than 2.5 standard deviations above and below the plot mean were removed. In the following step, the local Moran's I test [77] was applied to eliminate spatial outliers in our case, high yield measures surrounded by low yield measures or vice versa. In addition, to ensure that every S2 pixel was entirely within the study plot, a safety buffer of 15 m was established in each plot to minimize the distortion produced by the edge effect. Pixels located out of the buffer were removed. Data were then interpolated by ordinary kriging to a continuous yield map by selecting the semivariogram that best fit to the yield data for each plot. The most frequently used semi-variograms were exponential, spheric or rational quadratic. The maps were re-sampled to a resolution of 10 × 10 m and aligned with S1 and S2 pixels. Finally, a grid of points was generated in vector format (ESRI shapefile) and the information from the different rasters was transferred to the vector layer using the 'extract values' function in ArcGIS 10.8. This process resulted in a dataset composed by 6219 yield measures.

#### *2.4. Sentinel-1 Data and Retro Dispersion Calculation*

S1 ground range detected (GRD) images [78] were used in this study. These images are synthetic aperture radar (SAR) data acquired by the S1 satellite with a resolution of 5 × 20 m and a swath width of 250 km. The interferometric wide (IW) mode of acquisition was used, resulting in the acquisition of two polarization types: VV and VH. The images provide backscatter intensity information and are pre-processed at Level 1, resulting in geolocated, radiometrically calibrated, and terrain-corrected complex data in the slant range. Three images (27 March 2021, 20 April 2021, and 7 June 2021) acquired on days close to those acquired for S2 were downloaded from the Copernicus Open Access Hub (https://scihub.copernicus.eu/, accessed on 13 March 2023). To make the acquired images useful, they were processed using the SNAP software following the procedure outlined in Figure 2. This processing included adjusting the image tile size to the study area and calculating accurate orbits, as the metadata provided with the radar products is not always sufficiently accurate. Precise orbit information was obtained by using the 'apply orbit file' function, which is available a few days after image capture. Other necessary steps included improving image quality through the removal of thermal noise and radiometric artifacts from image edges, image calibration to obtain radiometrically calibrated backscatter images, and the elimination of granular noise caused by backscatter from certain elements (salt and pepper effect). The Lee Sigma filter was used in this process. Geographical coordinates were subsequently added to the images and, in the last step, backscatter values were finally converted to decibels (Figure 2). VV and VH backscatter information were extracted using the same georeferenced grid used to extract the information from S2 VI data. In total, two variables (VV and VH polarization backscatter information) were obtained for each date.

**Figure 2.** Workflow followed for the pre-processing of S1 images to obtain backscatter information.

#### *2.5. Machine Learning Algorithms*

Selecting the appropriate algorithm for a specific task is a crucial step in machine learning. The literature suggests that no single algorithm is the best, and the selection should be based on data characteristics and the desired outcome [79]. Therefore, it is imperative to evaluate the suitability of different algorithms for a given task to obtain optimal results.

In this study, the performance of four supervised machine learning algorithms was evaluated for a specific task: MLR [80], SVM [81], RF [82], and CatBoost [55]. Their election was based on their popularity and versatility in the modern agriculture literature [83].

Hyperparameter optimization for the SVM, RF, and CatBoost algorithms was performed using the GridsearchCV method implemented in the Scikit-learn library [84]. The MLR algorithm does not require hyperparameter optimization.

In this study, the MLR algorithm was implemented with Lasso regularization to reduce the complexity of the model and mitigate the effects of collinearity present between some of the variables. Collinearity is a phenomenon where two or more predictors in a multiple regression are highly correlated and can inflate the regression coefficients [85]. The Lasso function addresses this issue by limiting the sum of the absolute values of the model coefficients.

For its part, SVMs can be used for classification and regression tasks. One of the key advantages of using SVMs is their ability to identify the optimal boundary or decision surface that separates different classes in a dataset. The main idea behind SVMs is to find the best boundary or decision surface that separates different classes in a dataset. This is achieved by maximizing the margin, which is the distance between the boundary and the closest data points from each class [81]. Additionally, SVMs possess the ability to handle high-dimensional and non-linearly separable data by utilizing kernel functions to map the input data into a higher-dimensional space where a linear boundary can be found. This enables SVMs to perform well on complex and non-linear datasets. In this study, the kernel parameter was changed from 'linear' to 'RBF' to achieve this purpose. However, it should be noted that SVMs are less resistant to overfitting than other algorithms. Overfitting is a prevalent issue in machine learning, where a model performs well on the training data but poorly on unseen data. This is due to the margin maximization technique employed by SVMs being susceptible to overfitting [86]. To mitigate this risk, effective optimization of the 'C' hyperparameter is required. A large value of C results in the generation of a complex model, which minimizes training errors but also increases the likelihood of overfitting. Conversely, a small value of C leads to the production of a simpler model, which is less susceptible to overfitting but may not be as effective in fitting the training data [87]. In the present study, various values (1, 10, 100, 1000) of the C hyperparameter were experimentally evaluated to determine the optimal value that strikes a balance between predictive accuracy and model overfitting. Among all the tests carried out, the best results were obtained with C = 10. The gamma parameter was modified to 0.1.

The third algorithm used in the study is an ensemble algorithm that combines multiple decision trees to make predictions and is known as RF. The principle of RF is to construct a large number of decision trees and combine their predictions through methods such as majority voting or averaging [82]. It works by randomly selecting a subset of features to split the decision trees. This approach reduces the overfitting and variance issues commonly associated with single decision tree models. Additionally, RF can handle high-dimensional and correlated features, and can be used for both classification and regression tasks [88]. Moreover, the algorithm provides an estimate of feature importance, which can be useful for feature selection and understanding the underlying relationships in the data. Despite its advantages, RF is sensitive to noise in the dataset and can be computationally expensive for large datasets. Additionally, the algorithm can be sensitive to the number of trees used in the ensemble, requiring proper tuning to achieve optimal performance. Therefore, one of the hyperparameters optimized using the Gridsearch.cv function was the number of trees used in the ensemble, with the best results achieved with 300 trees. In addition, the maximum number of features parameter was determined using the 'auto' function. This function allows to use all features in each split. After tests with different combinations of tree depth (4, 5, 6, 7, 8, 10, 45, 50), the best result was obtained with 45 nodes.

CatBoost is a gradient boosting algorithm for decision trees that is specifically designed to handle datasets with many categorical variables [55]. The algorithm uses the gradient descent to optimize the parameters of the decision trees, which helps to improve the performance of the model [89]. The algorithm works by building and combining multiple decision trees. It uses a subset of the data to build each decision tree and then combines the predictions of all the decision trees to make the final prediction. The algorithm also utilizes a technique called 'permutation feature importance' to determine the importance of variables in the model. This technique is based on measuring the impact of each feature on the model's performance by randomly shuffling the values of a single feature. The feature with the largest impact on the model's performance is considered the most important [55]. Additionally, CatBoost is able to handle missing values in the data without the need for imputation techniques.

The CatBoost configuration that yielded the best results consisted of 18,000 iterations with an early stopping value of 200, which was implemented to prevent overfitting of the algorithm. The depth of the trees was set to six, and the 'MultiRMSE' loss function was selected, with a learning rate of 0.015. The parameter 'leaf\_estimation\_iteration' was set to 10. As the dataset was not excessively large, it was trained on the computer's CPU, but CatBoost has the option to train on a GPU if needed.

In addition to utilizing supervised algorithms, the present study incorporated the iterative self-organizing data analysis technique (ISODATA) unsupervised algorithm for data classification. This iterative algorithm begins by assigning an arbitrary mean to each class, and subsequently reassigns pixels based on minimizing the Euclidean distance to the mean value of their assigned class. The iteration process continues until either the final iteration is reached or the threshold for the maximum number of pixels changing class is not exceeded.

In this study, a data partitioning strategy was implemented with the purpose of training and validating the algorithms. The strategy entailed the random selection of 70% of the data for training and 30% for testing. This nearly ensures that testing is performed with data from all plots. However, in Section 3.6, the authors deviated from the standard data partitioning strategy and adopted an alternative approach. Except for data belonging to one plot, the rest were utilized for testing while the data from the excluded plot was reserved for testing. Iteratively the same process was performed for all plots. This methodology aimed to evaluate the algorithm's ability to predict the yield of a particular plot without utilizing information from that plot. Algorithms were trained and tested using functions provided by the Scikit-learn library over our datasets. The performance of the regression algorithms was evaluated using R2, RMSE, and the percentage of mean absolute error (%MAE).

Obtaining an accurate estimated yield map is the first step towards creating a fertilizer prescription map based on yield data in cases where a yield monitor is not available. With this in mind, in Section 3.6, the G15 plot was selected to demonstrate the possibilities offered by the estimated yield map for creating prescription maps. Since prescription maps usually consist of two or three zones, the unsupervised ISODATA algorithm was selected to divide the datasets into two classes. This procedure was applied to the actual yield data and the estimated yield data. The similarity between the classified estimated yield map and the classified real yield map was measured using the 'accuracy' and Kappa Index (KI) metrics, both widely used to assess the performance of classification algorithms.

#### *2.6. Accuracy Assessment*

#### 2.6.1. Root Mean Squared Error (RMSE)

The RMSE is a commonly used statistic that measures the difference between predicted values and observed values in a regression problem. It is defined as the square root of the mean of the squared differences between the predicted and observed values. A lower RMSE value indicates a better fit of the model to the data. It is widely used in regression problems to evaluate the performance of a model (Equation (1)):

$$\text{RMSE} = \sqrt{\sum\_{i=1}^{N} \frac{\left(Ei - Oi\right)^2}{n}} \tag{1}$$

where *O* represents the observed value, *E* the estimated value, and *n* represents the number of samples.
