2.4.1. Model Development

Soil biomass productivity assessment is the process of establishing relationships between soil properties and yields. Data-mining methods are tools for revealing hidden relationships in datasets structured by input variables. In soil assessment, data mining can help to identify the most important factors in yield formation and establish the weights of these factors. For our analysis we chose the Random Forest technique [27,28], which uses the Classification and Regression Trees method as a basis for growing multiple classification trees. For this operation, the database is divided into a series of training and test datasets to establish and validate relationships, respectively. Each training dataset (80% of the dataset) is a randomly selected subset that is used to develop a tree model using randomly selected predictors. The remaining data (10%) after the random selection of the subset (test data, 10% of dataset) are used to validate the developed model [47]. We used the createDataPartition function from the caret package to select data randomly. The generalized error of the forest depends on two parameters: how accurate each individual classifier is and how independent the different classifiers are from each other (i.e., the strength of each tree in the forest and the correlation between them). The Random Forest analysis was performed with the ranger R package [48]. The long-term means normalized productivity index (MNPI), taking into account both the measured AIIR data and the GPP data, was computed by taking the average of the two normalized datasets. The Random Forest operation was performed with the MNPI as the dependent variable and the environmental (soil, climate) variables as explanatory variables (Figure 2). First, the assessment was carried out separately for winter wheat, maize and sunflowers in order to evaluate crop-specific productivity of Hungarian croplands using the MNPI data of these crops. As a result, cropspecific productivity indices were produced for the three main crops. As our overall interest was to establish the MNPI for each Hungarian parcel at 100 m resolution, three parallel models were developed for the three major crops (wheat, maize, sunflowers) based on the crop-specific entries of the normalized yield data, and a fourth, a general productivity model, was developed based on the MNPI. As a result, both crop-specific (weighted means, wheat 40%, maize 40% and sunflowers 20%) and general productivity indices were assigned to climate and soil property combinations. Due to the limited information for some minor soil types (i.e., occupying area < 0.5% of agricultural lands), statistical testing could not be successfully performed for these soils. To assess the productivity evaluation of these soils, two evaluation approaches were applied and their results were combined. Firstly, an expert-based judgement was carried out. Productivity indices were established considering those of closely related soils in the Hungarian soil taxonomy using information from previous land evaluation systems [49], related literature [50–54] and expert knowledge. Secondly, a statistical test based solely on the GPP data was carried out to evaluate the effect of soil properties and climate, although without statistically significant results, but for orientation purposes. The relative importance of the explanatory variables was calculated. We analyzed the importance of all variables using the imp function of bclust package in R [55]. Relative importance was calculated by dividing the importance score of each variable by the largest importance score of the variable, and then multiplying by 100. Harmonizing the results of the two approaches ensured the consistency across the system, even for parcels with soils that make up a small proportion of the country's croplands. The theoretical range of the final productivity indices was set between 1 and 100, corresponding to the normalized yield values of the test dataset and following the indexing approach of traditional soil productivity evaluation of Hungary [51]. Model validation was performed using normalized yield data as independent variables of the test dataset. The test dataset included a randomly selected 10% of the data and a predict function of the ranger package was used. We calculated the correlation coefficient to show the relationship between the

observed and the predicted values, the mean absolute error (MAE) to show the distance of the predicted values from the observed values [56], and the mean absolute percentage error (MAPE) to show the percentage of error between observed and predicted values [57].

**Figure 2.** Flowchart of land evaluation modeling process.
