*4.3. Algorithm Analysis*

The results obtained through the utilization of RF, SVM, and CatBoost algorithms surpassed those obtained through the utilization of MLR in all scenarios. The greatest error measured with RMSE was observed when the model was trained with S1 data, as depicted in Figure 5. The reason for this is that the connection between backscatter and yield is not linear, and MLR is not able to handle non-linear relationships. Although the relationship between VIs and wheat yield is primarily linear, it possesses enough non-linearity for other algorithms to yield superior results [106]. The capacity to handle non-linear relationships is a key advantage of some algorithms (SVM, RF, CatBoost), as it enables the analysis of complex multivariate relationships between different types of data, which is not feasible with MLR. The results obtained through the utilization of RF and SVM are comparable, with those obtained using the SVM model being slightly superior, which is in contrast to those reported by other authors [35,107] in the field of wheat yield prediction. Although RF generally outperforms SVM, in some areas of PA such as disease detection, SVM has performed better than RF [108]. However, in this study, the best results were achieved using the CatBoost algorithm, which is a member of the boosting algorithm family. The algorithms belonging to this family have produced inconsistent outcomes within the domain of PA. For example, Bebie et al. [25] reported the worst results when the boosting regression (BR) algorithm was used, while Heremans et al. [108] obtained the best outputs with the same algorithms. CatBoost, like Xtreme Gradient Boosting (XGBoost), is a gradient boosting algorithm that belongs to the next generation of boosting algorithms, and XGBoost has been used successfully in PA to predict monthly NDVI evolution [109]. However, the use of this group of algorithms is not as prevalent in PA as RF or SVM. As an example, the Scopus database revealed a limited number of articles, only seven, that employ CatBoost within any field of PA. In contrast, it is widely utilized in other areas such as industries, finance, healthcare, and online advertising.

Although in this case it has not been used because all the variables are quantitative, one of the main advantages of CatBoost over other algorithms is its ability to handle categorical variables because it can automatically deal with them without any additional pre-processing, such as 'one hot encoding' reducing considerably matrix dimensions. Moreover, CatBoost is specifically engineered to handle large datasets, as it facilitates training on graphics processing units (GPUs), thereby significantly decreasing computation time. In terms of performance, CatBoost has been shown to have high performance and generalization ability, outperforming other algorithms such as RF and the generalized

regression neural network (GRNN) algorithm [110]. Additionally, CatBoost has a built-in mechanism for handling overfitting, which can be a problem with other algorithms like deep neural networks (DNNs) [111] and missing values. Finally, CatBoost also has a built-in feature importance mechanism that allows users to understand the importance of each feature in the dataset.

#### *4.4. CatBoost Algorithm as a Tool for Processing Heterogeneous Data in Precision Agriculture*

Use of the CatBoost algorithm in PA can provide significant advantages in terms of scaling up results. This algorithm is based on gradient boosting and is specifically designed to handle both numerical and categorical variables. This characteristic makes it suitable for PA, where a large amount of heterogeneous data are generated.

Compared to traditional machine learning algorithms such as RF, CatBoost has demonstrated improved performance in terms of accuracy and speed. The algorithm utilizes decision trees as weak learners and combines them in an iterative manner to make a strong prediction model. This results in a model that can generalize well to new data and is able to handle large amounts of data more efficiently than traditional algorithms.

In PA, the use of remote sensing data is increasingly common. This technology allows the acquisition of information on the physical, chemical, and biological characteristics of crops. Integration of the CatBoost algorithm with remote sensing data can provide valuable insights into crop growth. Another advantage of CatBoost is its ability to operate effectively even in the presence of missing records in a database. This is a common challenge faced when utilizing information from multiple sensors, as failures of individual sensors can occur at any point in time. The application of techniques to address such situations is not ideal, as it involves the addition of estimated information, which does not enhance the model. Furthermore, CatBoost data does not require scaling, leading to reduced time and effort in data preprocessing.
