*Article* **Prediction of Soil Organic Carbon at Field Scale by Regression Kriging and Multivariate Adaptive Regression Splines Using Geophysical Covariates**

**Daniela De Benedetto 1, Emanuele Barca 2,\*, Mirko Castellini 1, Stefano Popolizio 3, Giovanni Lacolla <sup>4</sup> and Anna Maria Stellacci <sup>3</sup>**


**Abstract:** Knowledge of the spatial distribution of soil organic carbon (SOC) is of crucial importance for improving crop productivity and assessing the effect of agronomic management strategies on crop response and soil quality. Incorporating secondary variables correlated to SOC allows using information often available at finer spatial resolution, such as proximal and remote sensing data, and improving prediction accuracy. In this study, two nonstationary interpolation methods were used to predict SOC, namely, regression kriging (RK) and multivariate adaptive regression splines (MARS), using as secondary variables electromagnetic induction (EMI) and ground-penetrating radar (GPR) data. Two GPR covariates, representing two soil layers at different depths, and X geographical coordinates were selected by both methods with similar variable importance. Unlike the linear model of RK, the MARS model also selected one EMI covariate. This result can be attributed to the intrinsic capability of MARS to intercept the interactions among variables and highlight nonlinear features underlying the data. The results indicated a larger contribution of GPR than of EMI data due to the different resolution of EMI from that of GPR. Thus, MARS coupled with geophysical data is recommended for prediction of SOC, pointing out the need to improve soil management to guarantee agricultural land sustainability.

**Keywords:** SOC spatial distribution; regression kriging (RK); multivariate adaptive regression splines (MARS); secondary variables; electromagnetic induction technique (EMI); ground-penetrating radar (GPR)

#### **1. Introduction**

Soil organic carbon (SOC) is one of the most important indicators for assessing soil quality and overall soil health [1]. SOC plays a key role in unveiling soil structure development, nutrient turnover and stability, soil water retention, regulation of greenhouse gases, and susceptibility or resilience to land degradation [2]. SOC stock is thus a main factor in soil health, fertility, quality, and productivity [3] and supports important soil-derived ecosystem services (ESs) including water filtration and erosion control, soil strength and stability, nutrient conservation, and climate change adaptation and mitigation by sequestration of atmospheric CO2 [4]. By selecting key soil indicators under different land use and management practices, Shukla et al. [5] concluded that SOC was the main soil quality indicator and suggested using SOC to monitor soil quality changes [6].

**Citation:** De Benedetto, D.; Barca, E.; Castellini, M.; Popolizio, S.; Lacolla, G.; Stellacci, A.M. Prediction of Soil Organic Carbon at Field Scale by Regression Kriging and Multivariate Adaptive Regression Splines Using Geophysical Covariates. *Land* **2022**, *11*, 381. https://doi.org/10.3390/ land11030381

Academic Editor: Manuel López-Vicente

Received: 18 January 2022 Accepted: 2 March 2022 Published: 4 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

SOC distribution is influenced by many factors, including climate variables (temperature and rainfall), topographical features, soil texture, parent material, vegetation, land-use types, and human management at different spatial scales [7].

Agronomic management strategies, with particular regard to fertilization, soil tillage, and irrigation, may significantly modify SOC content and its labile fractions, mainly in the shallower soil layers [6,8–10]. Because of the interaction of the factors described, SOC spatial variation is often wide and complex, and the knowledge of its spatial distribution is the key information in agricultural productivity to improve food security, enhance crop production [11], and predict the effects of different agronomic management strategies. Among these strategies, irrigation with treated municipal wastewater can be considered important for saving limited freshwater resources and protecting the environment, but its effects should be monitored to avoid soil fertility decline in the medium to long term [9].

Conventional laboratory methods for quantifying this soil variable are destructive, time consuming, expensive, and hazardous for the environment. In addition, because of the associated costs, soil is sampled at relatively few spatial locations, which are often irregularly distributed over the study area. The small sample size does not allow meeting the criteria for soil quality assessment for precision farming or for using statistical methods taking into account residual autocorrelation [12]. Making a short review, a number of samples ranging from 50 [13] to 100 [14] is considered well suited for an accurate spatial analysis.

A strategy to enhance the quality of the estimation of SOC content and to reduce the spatial sampling intensity consists of incorporating secondary information correlated to the primary variable [15,16]. This multivariate approach allows utilization of secondary information, such as that derived from proximally and remotely sensed data, that is often much more abundant than information deriving from the primary target variable [17,18].

Proximal sensing data could provide strong support for characterizing the spatial variability at the field or even regional scale. These data are very attractive because of their high resolution, their noninvasive nature, the relatively low cost of data acquisition, the possibility for a mobile survey configuration, and their three-dimensional (3D) information, although their outcome is not a direct measurement of soil properties [19].

Among the geophysical methods, electromagnetic induction (EMI) and groundpenetrating radar (GPR) have been widely applied. EMI methods measure apparent electrical conductivity (ECa), an integrated value of soil physical, chemical, and biological properties [20] that can capture soil spatial variability and characterize soil organic carbon distribution [21,22]. However, since soil properties vary in both the horizontal and vertical domains, soil needs to be described in three dimensions, and EMI sensors may have limitations when highly contrasting horizons are present [23]. Ground-penetrating radar (GPR) technology allows overcoming this limitation by measuring large volumes of soil (about cubic decimetres to cubic meters). Thus, GPR is suggested for field-scale determinations rather than for pointwise measurements, provides higher resolution of subsurface features, and is particularly suited to visualizing soil in two or three dimensions [24]. One of the most useful presentations of GPR data is to display horizontal maps of recorded reflection amplitudes, called "time slice" (or depth slice) maps [25]. There have been several studies involving GPR to determine thickness and characterize depths of organic soil materials [26,27], but few studies have been devoted so far to the potentiality of GPR to study the spatial variation of soil organic carbon.

The use of geophysical proximal sensor data as auxiliary information to effectively support an irregularly sampled target variable is not free from practical difficulties and experimental limits. This is because proximal sensing data are often massive, need to be collected on different spatial and temporal scales, and use different measurement supports. Several statistical methods are able to incorporate secondary information; for example, a multivariate extension of kriging, known as cokriging, is used for improving the prediction of a primary variable by using secondary information [28,29]. This technique assumes intrinsic stationarity, both of the target variables and of more intensively measured secondary variables, supposing a strong correlation between primary and secondary information [30]. These conditions are not always verified. Another way of taking into account the secondary variable is by checking for a spatial trend in the primary variable with respect to the secondary variable(s) and combining the deterministic part and the stochastic component, as in "hybrid methods" [29], or by adopting complex multivariate nonlinear approaches. In recent times, a number of hybrid interpolation techniques, which combine kriging with methods that use auxiliary information (covariates), have been developed and applied. Several authors have compared some of the techniques to incorporate trends and account for nonstationarity [31,32]. Two possible methods of nonstationary interpolation are regression kriging (RK) [28,33] and multivariate adaptive regression splines (MARS) [34]. In many cases, these techniques have been proven superior to common geostatistical methods, yielding more detailed results and higher accuracy of prediction, because they take advantage of being linear hybrid (RK) or nonlinear (MARS) [35]. MARS is a nonparametric predictive method that intrinsically models nonlinearities and interactions between variables, suitably managing local nonstationarity [34]. This method has been successfully applied in various fields, such as estimating the collapse potential for compacted soil, underground gas storage in bedded salt formations, and lateral spreading induced by earthquakes [36,37].

The regression kriging (RK) method is of straightforward use and often performs better than cokriging [38–40].

In this study, we compared the performance of RK and MARS to achieve the following objectives: (i) to prove that there are preferential nonlinear relationships between SOC and geophysical measurements, and (ii) to compare the performance of two nonstationary interpolation methods to effectively model SOC at the field scale. Machine learning techniques may open new perspectives to modelling SOC spatial distribution at the field and regional scales. The study was performed on a dataset deriving from a field experiment in which water of different qualities was used for irrigation.

To the best of authors' knowledge, no comparison between these methods has been presented before; therefore, it can be considered a novelty.

#### **2. Materials and Methods**

#### *2.1. Study Area*

Soil data were derived from a field experiment carried out in an olive grove located in Fasano (Apulia region, Southern Italy). The climate of the study area is "accentuated thermo-Mediterranean", as classified by UNESCO FAO [41,42], characterized by rather mild and rainy winters and warm and dry summer months. The soil of the experimental site is classified as loam (USDA classification), with an average content of silt, clay, and sand fractions of 35.28%, 21.74%, and 42.98%, respectively.

Olive trees were irrigated with treated municipal wastewater (TWW), and the following treatments were applied: irrigation with fresh water and full fertilization supply (FW); irrigation with TWW and full fertilization supply (R1); and irrigation with TWW and fertilizer supply reduced by the amount provided by TWW (R2) [10]. Treatments were arranged in a randomized complete block design (RCBD) with four replicates (Figure 1). Unit plot size was 108 m2, with 3 plants per plot and a plant spacing of 6 m × 6 m; field size was 1296 m2 (whole experimental area was 1728 m2).

#### *2.2. Soil Sampling and Soil Analysis*

Soil samples with absolute coordinates were collected on a regular grid (April 2017) at 6 locations (subreplicates) per plot at a 0–0.20 m depth for a total of 72 observations (Figure 1); only 71 were used in this study. Soil organic carbon (SOC) was quantified on air-dried and sieved samples through dry combustion [43]. Further details about the experimental trial were reported by Barca et al. [44] and Stellacci et al. [10].

**Figure 1.** Location of the field experiment (Google Earth Pro, 2021) the soil sampling locations (black dots), and the electromagnetic induction (EMI) and ground-penetrating radar (GPR) acquisitions along transects (red lines).

#### *2.3. Acquisition and Preprocessing of Auxiliary Information*

A geophysical survey was carried out using an EMI sensor (EM38DD, Geonics Limited, Mississauga, ON, Canada) and a Georadar (RIS 2k-MF Multifrequency Array Radar-System, manufactured by IDS SpA, Italy) connected to the DGPS along 6 parallel transects by sliding the sensors on the surface (Figure 1) on the same day as soil sampling.

EMI soil survey is based on the principle that a transmitter coil in contact with the soil surface produces a time-varying primary magnetic field in the subsoil. The eddy currents induced in the soil generate a secondary magnetic field, which is recorded by a receiver coil in the EM unit. The apparent conductivity near the receiver is determined by the ratio of the magnitude of the secondary magnetic field to that of the primary magnetic field [22]. The EMI sensor used herein consisted of two perpendicularly superposed EM38 sensors that simultaneously measured apparent electrical conductivity (ECa, expressed in mSm<sup>−</sup>1) near the soil surface (0–0.75 m depth) with the horizontal mode (ECa-H) and up to 1.5 m depth with the vertical mode (ECa-V) [22]. Before operation, the instrument was set to zero at a height of 1.5 m, according to the manufacturer's instructions, and at the end of the survey, the zeroing was checked to detect possible drift. The survey was performed using a nonmetallic platform with wood cover, and the sensor was towed behind a tractor/The ECa was recorded every second, with spatial resolution of 0.5 m, on average, along each transect.

Immediately after the EMI survey, the GPR survey was carried out by sliding the sensor along the surface. GPR data were collected with the common offset reflection method, using a monostatic system (the transmitting and receiving antenna placed in the same box) with two central frequencies of 600 and 1600 MHz (IDS Ing-manufactured, RIS 2k-MF Multifrequency Array Radar-System). The GPR worked with a time window of 60 ns and a temporal sampling interval of 0.05 ns; successive traces were collected every 0.024 m. GPR used electromagnetic pulse energy in the frequency range of 10 MHz to 1000 MHz. The transmitter component of the GPR system allowed the passage of generated pulse energy, which propagated through the subsurface materials, and the interactions with the material were sensed by the receiver component. Traditional surveys employ reflections of electromagnetic waves from boundaries between environments of different electromagnetic properties [45]. Theoretical aspects and working principles of radar components can be found in detail in Davis and Annan [46].

Both the data quality check and cleaning procedure characterized the preliminary data analysis. For EMI data, the points at which the instrument was stationary and any negative values were removed.

Processing the raw GPR data consisted of extracting quantifiable variables, such as attenuation, and displaying GPR data in horizontal maps at a specified time (or depth), called amplitude maps or time slices. The preprocessing of GPR signal amplitude data included the application of a set of filters [47] and the extraction of quantifiable variables.

The enveloped amplitude maps (time slices) were built by averaging the amplitude (or the square amplitude) of the radar signal, expressed in digital number (DN), within overlapping time windows of width Δt equal to the order of the dominant period of each antenna (2 and 1 ns for the 600 and 1600 MHz antennas, respectively). The total time interval was of 10 ns for the 600 MHz antenna because this time was comparable with the depth of the soil, and it was 6 ns for 1600 MHz because of the attenuation of radar signal. The time slices were then transformed in depth slices using the velocity of the radar waves determined through the analysis of hyperbolae [48]. Data preprocessing was performed with ReflexW Software [49].

In order to estimate the geophysical covariates at the same locations as the SOC measurements, geostatistical procedures were separately applied to EMI and GPR data by using a multivariate approach and fitting a linear model of coregionalization (LMC) to the experimental variograms. Each group of geophysical data was interpolated with ordinary cokriging (ck) on a 0.5 m × 0.5 m grid. The estimated covariates, migrated at the sample locations, were: the ECa in horizontal (ECaH) and vertical (ECaV) modes; the amplitude for the 600 MHz antenna at ten depths from 0.05 m to 0.50 m with a step of 0.05 m (Amp600MHz\_0.05 m-Amp600MHz\_0.50 m); and the amplitude for 1600 MHz frequency antenna at eleven depths from 0.025 m to 0.275 m with a step 0.025 m (Amp1600MHz\_0.025 m-Amp1600MHz\_0.275 m).

Finally, 25 covariates were considered, namely, the 23 geophysical covariates plus the (two) geographical coordinates expressed in the WGS84 coordinate system.

#### *2.4. Regression Kriging (Residual Kriging)*

In the present paper, kriging combined with linear regression (RK), a hybrid interpolation technique, was applied [35,39] (see Figure 2). In mathematical terms, RK can be described as the sum of a deterministic (regression) component and kriging as shown in the following equation:

$$\mathfrak{E}(\mathbf{s}\_0) = \hat{\boldsymbol{m}}(\mathbf{s}\_0) + \mathfrak{E}(\mathbf{s}\_0) = \sum\_{k=1}^p \hat{\boldsymbol{\beta}}\_k \cdot \boldsymbol{q}\_k(\mathbf{s}\_0) + \sum\_{i=1}^N \lambda\_i \cdot \mathbf{e}(\mathbf{s}\_i) \tag{1}$$

where *<sup>s</sup>*<sup>0</sup> is the spatial location associated with the desired prediction, <sup>ˆ</sup> *m*(*s*0) is the trend,

*<sup>e</sup>*ˆ(*s*0) is the interpolated residual, <sup>ˆ</sup> *β<sup>k</sup>* are the estimated regressive coefficients, *qk* are the covariates, *p* is the number of coviariates, *λ<sup>i</sup>* are kriging weights, *N* is the number of observations, and *e*(*si*) is the residual (i.e., the difference between the regression estimation minus the observation) at the generic observational location *si*.

From a practical standpoint, once the trend component has been estimated, the residual can be interpolated with kriging and then added to the previously estimated component. The prediction of the residual is a very critical step, because in principle, only the autocorrelated components should be estimated, neglecting the purely random component. Unfortunately, it is very difficult to separate the overall residual into the autocorrelated and the noncorrelated components. There are many different opinions about the best way to accomplish this issue [50,51]. In the present paper, the variography directly performed on the residuals provided results that did not depart much from those obtained with more sophisticated statistical methods; in other words, this approach did not significantly bias the final predictions. Therefore, the more straightforward approach, which brutally separates observations from trend values to obtain residuals, was preferred [29,52]. The validation of the RK method is usually carried out by means of the cross-validation procedure, and specifically the leave-one-out method [53]. Cross-validation is structured as a two-stage procedure. In the first stage, a leave-one-out method is applied, which consists of dropping an observation from the dataset and predicting this omitted value using the remaining

data. Leave-one-out is iterated for each value in the dataset, and each time, a residual is computed as the difference between the observed and predicted values. The second stage of the cross-validation consists of making inferences about the residuals' distribution [54,55]. The R library [56] used to perform the aforementioned analysis was {Automap version 1.0–14}.

**Figure 2.** An example of the regression-kriging approach shown by means of a cross-section of the spatial random field (after Hengl, [35]).

#### *2.5. Multivariate Adaptive Regression Splines (MARS)*

MARS is a nonparametric and nonlinear predictive method that automatically models nonlinearities and interactions between variables managing suitably local nonstationarity [34]. Datasets are split into piecewise curves (splines) of differing slopes. Splines consist of two branches, i.e., left-sided (Equation (2)) and right-sided (Equation (3)) truncated functions, separated by a point called the *knot* [57].

$$\mathbf{b}\_{\mathbf{q}}^{-}(\mathbf{x} - \mathbf{t}) = [- (\mathbf{x} - \mathbf{t})]\_{+}^{\mathbf{q}} = \begin{cases} \ (\mathbf{t} - \mathbf{x})^{\mathbf{q}} \text{ if } \mathbf{x} < \mathbf{t} \\\ \mathbf{0} \text{ otherwise} \end{cases} \tag{2}$$

$$\mathbf{b}\_{\mathbf{q}}^{+}(\mathbf{x}-\mathbf{t}) = [+(\mathbf{x}-\mathbf{t})]\_{+}^{\mathbf{q}} = \begin{cases} \ (\mathbf{x}-\mathbf{t})^{\mathbf{q}} \text{ if } \mathbf{x} > \mathbf{t} \\\ \mathbf{0} \text{ otherwise} \end{cases} \tag{3}$$

b− <sup>q</sup> (<sup>x</sup> <sup>−</sup> <sup>t</sup>) and <sup>b</sup><sup>+</sup> <sup>q</sup> (x − t) are splines describing the regions on the right and left sides of the knot (t), respectively, and q is the degree of the polynomial. The subscript "+" indicates that the result of the function is 0 outside the local definition domain. For each of the covariate variables, MARS selects the couple of splines and the knot location more in accordance with the response variable. In a next stage, the different splines are added up in a single multivariate model, which describes the response as a function of the covariates. The result is a nonlinear model assuming the form:

$$\circlearrowleft = \mathbf{a}\_0 \sum\_{\mathbf{m}=1}^{\text{M}} \mathbf{a}\_{\mathbf{m}} \mathbf{B}\_{\mathbf{m}}(\mathbf{x}) \tag{4}$$

where yˆ is the prediction of the response variable; a0 is the known term; M is the number of basic splines; and Bm and am are the m-th basic spline and its coefficient, respectively [58].

Overall, a MARS analysis consists of three stages. Specifically, (i) the variable that best describes the response by means of the splines in terms of R<sup>2</sup> is selected. Afterwards, (ii) other covariates are added stepwise, always using splines, to build a multivariate model (i.e., the global MARS model). The aim of this addition is the improvement of model in terms of performance (R2). The performance is computed on the training set. Since the global MARS model is usually affected by *overfitting*, it needs to be "pruned" in a

further stage, for which iterations of the generalized cross-validations (GCV) alternated with 10-fold cross-validation are used [59]. The GCV index is a sum of squared errors (observations minus predictions) adjusted by embodying a penalty for reducing the model complexity. This criterion is used to prevent overfitting derived from an excessively accurate model with respect to the training set:

$$\text{GCV} = \frac{\frac{1}{\text{n}} \sum\_{\text{m}=1}^{\text{n}} \left( \text{y}\_{\text{i}} - \text{f}\_{\text{m}}(\text{x}\_{\text{i}}) \right)^{2}}{\left(1 - \text{C}(\text{M})/\text{n} \right)^{2}} \tag{5}$$

where C(M) is a parameter that penalizes models involving a large number of splines, defined as follows:

$$\mathbf{C(M)} = (\mathbf{M} + 1) + \mathbf{dM} \tag{6}$$

where M is the number of nonconstant splines (i.e., all terms of Equation (4) except a0) in the MARS model and d is a user-defined penalty value for each spline optimization. Increases in the cost d cause the exclusion of splines. Substantially, d is increased during the pruning step in order to obtain smaller models. Besides its use during the pruning phase, GCV index is essential to rank covariates based on their importance in the model. The definition of the final model is reached in a third phase. This phase (iii) is performed by cross-validation or a new independent test set. The R library used to perform the aforementioned analysis herein is {earth} [59].

#### **3. Results**

#### *3.1. Exploratory Data Analysis*

Descriptive statistics showed that SOC data were normally distributed as confirmed by skewness and kurtosis values (Table 1) and by Shapiro–Wilk test (*p* = 0.656); for this reason, they were not subjected to a normal transform. The reported bubble plot (Figure 3) shows the spatial distribution of the SOC observations, evidencing some clusters of similar values.

**Table 1.** Summary statistics for SOC (g 100 g<sup>−</sup>1).


**Figure 3.** Bubble plot of spatial distribution of SOC values (g 100 g<sup>−</sup>1).

The global Moran index provided an assessment of the spatial autocorrelation strength over the study area and is reported in Table 2. The result (I = 0.42) indicated a significant spatial autocorrelation (*p* = 0.00034). In addition to the global Moran index, the peak of the Moran index (local Moran index) was estimated by means of the computation of the mean of nearest neighbours. Afterwards, a lagged scatterplot provided the Moran computation at such distance lag. For the considered case, the mean of nearest neighbours was 2.63 m, and Figure 4 shows the Moran value corresponding to that distance, indicating a greater spatial correlation at short range (r = 0.75).

**Table 2.** Assessment of the global Moran index.


**Figure 4.** h-scatterplot for assessing local Moran I.

#### *3.2. Linear Model Outcomes*

The correlation matrix between SOC and the 25 covariates (23 geophysical variables plus the geographical coordinates) was first computed, and different sets of highly correlated covariates were derived and used to fit SOC data.

The following equation shows the first attempt to model SOC with the most correlated variables:

SOC ~ ckAmp0.05m\_600MHz+ ckAmp0.1m\_600MHz + ckAmp0.4m\_600MHz

The five-point summary statistics and the coefficients of the linear model are reported in Tables 3 and 4. The outcomes seemed to indicate a larger contribution of the GPR data than of the EMI sensor data. The covariates related to the higher frequency antennae (1600MHz frequency) were therefore excluded.

**Table 3.** Five-point table of the linear model's residuals.


In particular, the GPR data representations for both frequencies showed a first discontinuity in the radar signal at 0.1 m depth, a high level of spatial continuity along the soil profile at least to 0.30 m, and a second discontinuity after 0.30 m depth. Therefore, the selected covariates were representative of information derived by two different layers.


**Table 4.** Coefficients of the linear model.

Signif. codes: 0.01, "\*"; 0.05, ".".

The model was significant (F-statistic: 4.80 on 3 and 67 DF, *p*-value: 0.004) and showed a residual standard error of 0.26 with 67 degrees of freedom; multiple R-squared and adjusted R-squared were 0.177 and 0.14, respectively. Analysing Table 4, it was evident that there was a unique significant covariate, ckAmp0.1m\_600MHz. The result showed the distribution of SOC to be significantly affected by the shallower layer, probably because it was comparable with the portion of sampled soil.

After many other attempts (not reported), a model was developed with the following optimal arrangement of the covariates:

#### SOC ~ X + Y + ckAmp0.35m\_600MHz

This model included the geographical coordinates and a unique geophysical covariate, ckAmp0.35m\_600MHz (see Tables 5 and 6). This model was better that the aforementioned one, with all the covariates significant, a better value of R-squared (multiple R-squared: 0.26, adjusted R-squared: 0.22), and a more significant F-statistic *p*-value (F = 7.9 on 3 and 67 DF, *p*-value: 0.00018). Residual standard error was 0.24 with 67 degrees of freedom.

**Table 5.** Five-point table of the second linear model's residuals.




The model's residuals were then analysed. The Shapiro–Wilk Gaussianity test showed a nonsignificant departure from the normal distribution (W = 0.98567, *p*-value = 0.598); as a consequence, the Gaussian hypothesis was accepted. Afterwards, spatial autocorrelation analysis was performed to check at what extent the linear model filtered out the autocorrelation present in the raw data.

From Table 7, it was evident that in the linear model's residuals, there was still a significant quantity of spatial autocorrelation (*p*-value = 0.0012). Therefore, it made sense to apply regression kriging (RK) to exploit the residual autocorrelation with the aim of improving the goodness of fit.

**Table 7.** Linear model coefficients with related statistics.


#### *3.3. Regression Kriging (RK)*

Geostatistical analysis was then applied to the linear model's residuals with the aim of finding in them a structure that could represent their spatial variability.

The goodness of fit between the selected variogram model and the empirical variogram was evaluated by means of the SSErr index, which provides a value that helps user to judge the quality of the final model. For the case at hand, the value was SSErr = 0.00050, which appeared to be a satisfactory result. Moreover, by analysing the variogram parameters, reported in Table 8, it was possible to figure out the strength of the model by computing the nugget-to-sill ratio index [60], also called the spatial dependence index (SDI; [61]). For the case at hand, the observed value was 0.075, indicating high descriptive capability for the variogram model.

**Table 8.** Variogram model and parameters.


In Figure 5, the experimental variogram and the fitted nested model (nugget + spherical) are reported.

**Figure 5.** Experimental variogram and fitted variogram model.

Cross-validation statistics showed an MAE to RMSE ratio of 0.76, indicating a very good outcome. Mathematically, RMSE is always larger than MAE, because large errors are magnified by the square contained in the formula; therefore, the ratio between MAE and RMSE is always less than 1. However, the closer to 1 the ratio is, the fewer large errors made are by the model. This positive result was confirmed by a MAPE value far lower than 10% (Table 9). Computing the Lin coefficient (CCC) between observations and predictions, the outcomes were 0.65 for overall CCC, 0.68, for overall precision, and 0.95 for overall accuracy. The scatterplot of predicted versus observed values qualitatively showed the adequacy between the two data series (Figure 6).

**Table 9.** Accuracy metrics to assess the goodness of fit of the RK model.


**Figure 6.** Scatterplot of predicted (RK model) vs. observed values.

The spatial distribution of SOC obtained through RK is reported in Figure 7.

**Figure 7.** Map of SOC obtained with regression kriging. The black polygons indicate the four blocks in the RCB experimental design.

#### *3.4. MARS Model Assessment*

The original dataset was split into two complementary subsets, namely, training and test, corresponding to 80% and 20% of the original data, respectively.

Since the model is calibrated by means of the training dataset with the aim to predict the test data, the two subsets should be (statistically) similar at some extent. For this reason, after the splitting, subsets were subjected to the t-test for mean homogeneity and the Levene test for variance homogeneity. In addition, a univariate cluster analysis, carried out to assess the presence of clusters among data, showed that observations could be split into four groups. This represents another constraint about the splitting that has to be taken into account, i.e., the training and test subsets should be formed by a balanced quantity

of elements extracted from all the clusters. Subsets were both checked for Gaussianity by means of Shapiro–Wilk test; results showed for both subsets a nonsignificant departure from normal distribution (W = 0.99, *p*-value = 0.90, for the training set; W = 0.97, *p*-value = 0.81, for the test set).

A Welch two-sample t-test showed that the means of the two subsets were not statistically different (t = −0.25, df = 20.36, *p*-value = 0.81). In addition, a boxplot confirmed the equality of the two means of the SOC variable subsets (Figure 8).

**Figure 8.** Boxplot for SOC comparison between training and test sets.

A Levene test, based on the absolute deviations from the median with a modified structural zero removal method and correction factor, showed the homogeneity of the group variances (test statistic = 0.059, *p*-value = 0.81). In Figure 9, the placement of the observations for the training (red points) and the test (green points) sets is reported.

**Figure 9.** Spatial distribution of training and test sets points.

In summary, the two subsets could be considered similar according to the distribution, mean value, and variance comparisons. Therefore, the training set seemed to be appropriate to calibrate the model and the test set to check for overfitting.

The MARS model selected only 4 out of 25 predictors, namely, ckAmp0.35m\_600MHz, X, ckECaVer, and ckAmp0.1m\_600MHz.

The model included the main GPR covariates selected previously. Regarding EMI data, only the apparent electrical conductivity measured in vertical polarization was selected because the two electrical conductivity variables were strongly correlated and therefore redundant. In addition, the sensor in vertical polarization had a maximum sensitivity approximately at a depth of 0.40 m, which was comparable with the time slices of GPR repeatedly selected (0.35 m).

From Table 10, it can be drawn that the MARS model was formed by four terms; apart from the intercept, the first was linear, and the remaining two were interactions between couples of covariates. After importance analysis was applied, by using the GCV and raw residual sum of squares (Rss) indices, the selected predictors were ranked accordingly (Table 11).

**Table 10.** MARS model structure.


**Table 11.** Covariates of the MARS model listed according to their importance rank with respect to GCV (generalized cross-validation) and Rss (raw residual sum of squares).


As first step, the Gaussianity of the residuals after the training was tested using the Shapiro–Wilk test; the residuals distribution could be considered Gaussian with a distribution ~ *N*(0.0, 0.036) (W = 0.98, *p*-value = 0.50).

By applying a blind cross-validation with k-fold = 10, the resulting R<sup>2</sup> was 0.51, but it should be borne in mind that this was a pessimistic result, as the extractions of blocks of 10 elements (k-fold with k = 10) from the original dataset was performed 200 times in a purely random fashion, neglecting similar subsets. Moreover, the original dataset was relatively small and represents a not-very-homogeneous reality. Finally, the results in terms of goodness of fit were averaged.

The first step consisted of checking the correlation between predicted and observed values for the training set; the results showed a certain agreement (r = 0.72, *p*-value ≈ 0.0). In addition, correlation between residuals and predicted values of training subset was checked and was close to zero, as expected.

Afterwards, the MARS model calibrated on the training set was applied to predict SOC data from the test set, which was independent from the model calibration (training) set.

As a first step, the correlation between observations and (test set) predictions was analysed. This resulted in a highly significant correlation (r = 0.87, *p*-value ≈ 0.0). The value gained after the validation step surprisingly outperformed that of the training set, which is a rare event. The correlation between residuals and (test-set) predicted was not significant.

The residuals, according to the Shapiro–Wilk test, were Gaussian, with a distribution ~ *N*(0.027, 0.025) (W = 0.93, *p*-value = 0.23).

Computing the Lin coefficient (CCC) between observations and predictions, the outcomes showed very good agreement (overall CCC, 0.81; overall precision, 0.88; overall accuracy, 0.93).

Since the observations were available, it was possible to compute the error metrics, which are reported in Table 12.


**Table 12.** Accuracy metrics to assess the goodness of fit of the MARS model.

The error indices were good overall; in particular, MAPE was below 10%, which value has been indicated in literature as a critical threshold. Another very interesting result concerned the ratio between MAE and RMSE, which was larger than that obtained with regression kriging (0.8 vs. 0.76). In conclusion, the MARS model could be considered effective whenever the coefficients of the covariates were not constant over the study domain and the covariates were intertwined in more complex ways than additively.

By comparing the error indices and Lin's coefficients of both methods, it became evident that MARS performed better than RK. The two methods were linear (RK) and nonlinear (MARS), respectively. The main difference concerns the interaction terms, since the MARS model has one linear term and two multiplicative terms (interactions) that represent the added value that allowed improving the predictive capability of MARS with respect to that of RK.

In Figure 10, the map of SOC predictions obtained with the MARS model is reported. Comparing the RK and MARS maps, they showed overall agreement, with a cluster of lower values in the northern part of the study area, a central part with the lowest values, and finally, a southern part with two clusters of larger values and a cluster of lower values.

**Figure 10.** Map of SOC obtained with MARS model. The black polygons indicate the four blocks in the RCB experimental design.

Finally, to quantitatively compare the maps obtained by the two methods, a crosscorrelogram was computed. The result was a value of 0.67 at the distance 0. Therefore, the map gained from RK can be considered a first approximation of that from MARS. This result underlines the reliability of the SOC spatial distribution predicted by the MARS model.

#### **4. Discussion**

Spatial prediction of SOC is critical for assessing the effect of agronomic management strategies on soil quality and crop productivity. In this scope, the sample size is a value that plays a key role in SOC prediction. Thus, it needs to be balanced between economic and predictive constraints. In fact, increasing the sample size may allow the application of statistical methods that take residual autocorrelation into account and thereby reduce the probability of inflation of the type I error rate [12], but at large cost. Regression kriging and MARS, incorporating covariate information often available at a finer resolution than primary data, such as proximal and remote sensor data, may improve the quality of SOC estimation without increasing the sampling size of the primary variable [62,63].

Outcomes obtained from linear models seem to highlight a larger informative contribution of GPR than of EMI data. From a physical standpoint, this result can be explained by the different nature of sensors' outcomes. In fact, GPR information results are more sensitive to near-surface effects than EMI data, which are integrated values over all soil layers [15]. However, unexpectedly [64], the covariates related to the higher-frequency antenna (1600 MHz frequency) were excluded, probably because they did not add further information or were redundant in this study case.

Two GPR covariates, namely, ckAmp0.35m\_600MHz and ckAmp0.1m\_600MHz, were selected by the MARS model. The same variables were also chosen by the final RK model (ckAmp0.35m\_600MHz) and the preliminary RK model (ckAmp0.1m\_600MHz). Similar importance was also assigned to the selected variables by both statistical methods, as shown by the ranking defined by GCV and Rss in MARS model, suggesting that their significance was physically based. In fact, the selected covariates were representative of information derived by two soil layers with different physical properties influencing radar signal and soil organic carbon distribution. The two methods also had the X geographical coordinate in common, indicating a larger continuity along this direction.

The main difference between the two approaches concerned the selection of the EMI covariate in vertical polarization performed only by the MARS model, indicating the different explanatory power of information brought by the two sensors. This result was tied to the intrinsic capability of the MARS model to intercept the interactions among variables and highlight nonlinear features underlying the data [34]. In addition, the coefficients of the MARS model were not constant but piecewise linear (splines), and therefore, their gradient varied over the studied domain [57]. This explains the larger descriptive capability of the MARS model and its ability to select hidden features with respect to regression kriging. Although MARS is not explicitly a spatial method, its capability of modelling covariate coefficients by means of flexible functions allows, when the geographic variables are included in the analysis, filtering out the spatial autocorrelation contained in the data, which makes it substantially a spatial method [65]. A confirmation of this was the statistical nonsignificance of the Moran I index obtained from the MARS residuals.

Studies on the spatial variability of SOC in agricultural soils remain a central theme in assessing the environmental sustainability of agricultural systems [66], because agronomic inputs could be rationalized in order to not impoverish the soil's fertility. Therefore, our results represent a knowledge contribution for future studies aimed at detecting the spatial distribution of soil organic carbon at the field scale. Geophysical methods show new applicative potentialities for environmental sciences (see, among others, [67]) and can represent support for research in this field. However, because of the complex interactions with soil properties, the use of geophysical measurements as covariates needs to be investigated in more detail to draw more precise conclusions. A limit of the present work could be its potential site-specificity, which could not be quantified in advance. Therefore, further experiments in different study areas and agroenvironments should be performed to test the performance of the methods under different conditions.

#### **5. Conclusions**

The results of our investigation showed that MARS outperformed RK in predicting SOC spatial distribution. The nonlinearity of MARS evidenced the contribution of EMI variables neglected by linear approaches. That result would have to be deepened in future works in consideration of the fact that EMI measures are more easily achievable than GPR ones.

The accuracy reached in mapping SOC with the support of MARS was remarkable and opens interesting perspectives in applying other, more powerful machine learning methods (e.g., deep learning) to even better exploit proximally sensed data. In the future, it is hoped that these machine learning methods will be successfully associated with mapping procedures and then applied at the regional and national level.

The use of relatively easy, accurate, and inexpensive geophysical methods for SOC estimation, together with application of advanced statistical techniques for SOC spatialization, can represent a viable solution to investigate agroecosystem sustainability.

**Author Contributions:** Conceptualization, A.M.S., E.B. and D.D.B.; methodology, E.B.; software, E.B.; validation, E.B., D.D.B., A.M.S., M.C. and G.L.; formal analysis, E.B., A.M.S. and D.D.B.; investigation, D.D.B.; resources, M.C. and S.P.; data curation, D.D.B.; writing—original draft preparation, E.B.; writing—review and editing, D.D.B., A.M.S., E.B., M.C. and G.L.; visualization, D.D.B.; supervision, A.M.S. and E.B.; project administration and funding acquisition, A.M.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors would like to thank the EU and MIUR for funding the present methodological contribution under the framework of the collaborative international consortium DESERT (ID-217) financed under the ERA-NET Cofund WaterWorks2014 Call. This ERA-NET is an integral part of the 2015 Joint Activities developed by the Water Challenges for a Changing World Joint Programme Initiative (Water JPI).

**Institutional Review Board Statement:** The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

