Next Article in Journal
The Bibliometric Analysis of Studies on Physical Literacy for a Healthy Life
Next Article in Special Issue
Leaching Characteristics of Heavy Metals in the Baghouse Filter Dust from Direct-Fired Thermal Desorption of Contaminated Soil
Previous Article in Journal
Relationships among Sports Group Cohesion, Passion, and Mental Toughness in Chinese Team Sports Athletes
Previous Article in Special Issue
Numerical Analysis Method That Considers Weathering and Water-Softening Effects for the Slope Stability of Carbonaceous Mudstone
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing the Information Potential of MIR Spectral Signatures for Prediction of Multiple Soil Properties Based on Data from the AfSIS Phase I Project

by
Stanisław Gruszczyński
and
Wojciech Gruszczyński
*
AGH University of Science and Technology, Faculty of Geo-Data Science, Geodesy and Environmental Engineering, Al. Mickiewicza 30, 30-059 Krakow, Poland
*
Author to whom correspondence should be addressed.
Int. J. Environ. Res. Public Health 2022, 19(22), 15210; https://doi.org/10.3390/ijerph192215210
Submission received: 9 October 2022 / Revised: 9 November 2022 / Accepted: 11 November 2022 / Published: 17 November 2022
(This article belongs to the Special Issue Soil Remediation and Prophylaxis in Polluted Environments)

Abstract

:
The aim of the study was to assess the predictive potential of mid-infrared (MIR) spectral response in the estimation of 60 soil properties. It is important to know the accuracy limitations in estimating various soil characteristics using various models in conditions of high spatial variability of the environment. To fully assess this potential, three types of algorithms were used in modeling, i.e., partial least squares (PLSR), one-dimensional convolutional neural network (1DCNN), and generalized regression neural network (GRNN). The research used data from 19 sub-Saharan African countries collected as part of the Africa Soil Information Service (AfSIS) Phase I project. The repositories provide 18,250 MIR reflectance recordings and nearly two thousand analytical data records from the determination of many soil properties by reference methods. The modeled subset of these properties included texture (three variables), bulk density, moisture content at soil water characteristic curves (SWCC, 4 variables), total and organic C and total N content (3 variables), total elemental content (32 variables), elemental content in bioavailable forms (12 variables), electrical conductivity, exchangeable acidity, exchangeable bases, pH, and phosphorus sorption index. It is not possible to indicate a universal optimal prediction model for all soil variables. The best prediction results are provided by all regression models for total and organic C, total Fe, total Al and bioavailable Al content, and pH. For bulk density, total N and total K content satisfactory results are provided by specific model type. Many other properties, i.e., texture, SWCC, total Ga, Rb, Na, Ca, Cu, Pb, Hg content, and bioavailable Ca and K content, can be predicted with accuracies sufficient for some less demanding tasks.

1. Introduction

The interpretation of short-range spectral recordings under field or laboratory conditions attracts a lot of attention [1,2,3,4]. The importance of such methods is related to the three-dimensional variability of soils and their profile variation, which are impossible to recognize by recording spectral reflectance of the ground surface. Most studies to date relating to spectral analysis of soils have used the visible near-infrared (Vis-NIR) range covering the 350–2500 nm range of electromagnetic radiation [3,5,6]. Currently, there is increasing interest in investigating the suitability of the mid-infrared (MIR) spectrum covering the 2500–25,000 nm range [7,8,9].
The analysis of the usefulness of MIR signatures in soil research is of increasing interest. Margenot et al. analyzed the technical aspects of using this spectral range in the estimation of soil properties, mainly on the basis of a literature review [10]. They indicate the need to standardize the method of considering the impact of soil moisture on disturbances in the estimation of soil organic matter (SOM). Nath et al. analyze the usefulness of MIR signatures in estimating biological properties [11], pointing to the high usefulness of this approach in determining the concentration of C and N. They believe that the estimation of the concentration of enzymes and other organic components requires more sophisticated estimation methods. In a study of alfisols in eastern India [12], using a relatively small (approximately 330 samples) sample set, the authors concluded that the PLSR and SVM model provide a better estimate of SOC, pH, and clay than the random forest model. They point to the need to strengthen these conclusions on the basis of more territorially extensive research. An extensive analysis of the estimates of over 100 soil properties, based on the MIR Kellogg Laboratory’s spectral library, is contained in the publication by Ng et al. [8]. The results obtained confirmed the usefulness of MBL models for the estimation of many properties of soils from the territory of the USA.
A very insightful analysis of prediction results using NIR and MIR was conducted by Bellon-Maurel and McBratney [5]. The review analyzed the effect of the method of measurement implementation, data preprocessing, and algorithms used on soil organic carbon (SOC) prediction. According to one of the review’s conclusions, NIR and MIR allow the prediction of SOC with similar accuracy, with a slight advantage to MIR as it provides better reproducibility, with a lesser error (10–40% when compared to NIR). In addition, it was found that, regardless of the range of the spectrum, homogeneity of data for calibration and validation is essential. It should be noted that comparison of modeling results based on spectral response is justified only for soil samples with similar properties, geology, and climatic conditions. The importance of the range of variability of the data, the number of cases for calibration, preprocessing, and the type of model should also be emphasized.
In order to use spectral analysis in the soil survey, it is necessary to know the potential possibilities and limitations of this approach in estimating the values of various soil characteristics. The problem is complex. The model error, apart from its architecture, depends on the distribution and differentiation of data, the range of their variability, the number of observations, the method of validation, etc. The work of Nduwamungu et al. [13] summarizes the results of modeling tests based on the NIR spectrum of soil properties such as texture, SOC, pH, and cation exchange capacity (CEC) presented in the work of many authors, whose statistics differed dramatically: the R2 spread was between 0.01 and 0.99. The availability of possibly extensive data with high variability increases the credibility of the information potential of spectral data under various soil conditions. The knowledge of the potential usefulness of a spectrum for determining various soil properties is of significant practical importance for the selection of variables used to estimate soil quality indices [14,15].
Soils are described by a number of properties subject to evaluation from the point of view of their cultivation, economic, and engineering suitability. As evaluation criteria, depending on the use under consideration, various properties of soils are treated as more or less important. Classification of soils, in the context of their use in agriculture or forestry, requires characterization of basic physical (texture, water holding capacity, compactness, permeability, and many other potential physical indicators) and chemical (concentration of macro- and micronutrients, pH, sorption capacity, buffering capacity, and many others) properties. The list of elements to be assessed is constantly growing; for example, by including rare earth elements in the assessment of soils [16,17]—important because of their role as micronutrients and potential pollutants. Given this fact, it is important to appreciate the possibility of assessing the information potential contained in the spectral response of soil samples as a source to support, at least roughly, the assessment of environmentally or economically significant properties. It is important to note that the quality of models linking the spectral response to individual soil properties depends largely on the type of prediction algorithm used as well as the statistical distribution of the modeled properties. For this reason, it is advantageous to confront modeling approaches and validation datasets in order to identify features, prediction methods, and conditions for their use.
The aim of this research is to determine prediction accuracy for many soil variables based on the analysis of the spectral response of soils in the proximal sensing mode in the MIR range. The literature generally presents attempts to model several variables—usually C, N, P, K, pH, and texture, contents of some heavy metals, such as Zn, Hg, Cr, Cu, and Pb [18,19]—and occasionally other properties, such as soilwater characteristics curve, CEC, EC, or exchangeable bases [8]. AfSIS data are characterized by a very extensive list of variables marked with reference methods along with the spectral response [20]. They allow estimation of the modeling errors of many soil variables from samples taken in highly varied soil and climatic conditions. Determined validation statistics allow for the assessment of the usefulness and limitations of MIR models and data for estimating the values of variables characterizing soils.

2. Materials and Methods

2.1. AfSiS Phase I Data

AfSIS Phase I data were collected as part of a project financed by the Melinda and Bill Gates Foundation, implemented with the participation of five research institutions: The Tropical Soil Biology and Fertility Institute of The International Center for Tropical Agriculture, The International Soil Reference and Information Center—World Soil Information, The Center for International Earth Science Information Network, The Earth Institute at Columbia University, and World Agroforestry (Center for Research in Agroforestry-ICRAF). These data include the results of analyses of samples from soil profiles collected at points determined according to the methodology of the Land Degradation Surveillance Framework [20] in 19 countries of sub-Saharan Africa: Angola, Botswana, Burkina Faso, Cameroon, Ethiopia, Ghana, Guinea, Kenya, Madagascar, Malawi, Mali, Mozambique, Niger, Nigeria, South Africa, Tanzania, Uganda, Zambia, and Zimbabwe. More details regarding the AfSIS database can be found in Vågen et al. [21].
The data provided by the repositories (Amazon AWS, GitHub, and the Agroforestry portal) came from topsoil (0–20 cm) and subsoil (20–50 cm) layers. Soil sampling was carried out in accordance with the Land Degradation Surveillance Framework (LDSF) procedure, taking into account the local and spatial characteristics of the environment. The sampling areas represent a random stratification of the African landscapes south of the Sahara desert (Figure 1). The samples were taken in the areas surrounding the cluster center, which allowed consideration of the spatial variability of soils [22]. A detailed method for determining the location of sampling areas, sampling, and analytical procedures is included in the report written by Leenaars et al. [23].
AfSIS data were the basis for studies on the analysis of the spatial variability of soil chemical composition [24] and the characteristics of soil profiles [23]. On this foundation, cartographic documentation of soils with a resolution of 30 m was developed and prepared [25]. The repositories provide 18,250 MIR reflectance recordings [26,27]. A single spectral signature covers 1749 recorded reflectance values in the range from 2500 to 16,666 nm (wavelength 4000 to 600 cm−1 with a step of 1.9 cm−1). The average density of the spectral registration points was about 8.1 nm, and it ranged from 1.2 nm to 53 nm. The GitHub repository also provides nearly two thousand analytical data records from determinations of soil samples by reference methods. Georeferenced data on the location of sampling points are also available. The analytical data of the original dataset are split at the highest level into two groups, i.e., DryChemistry and WetChemistry. The DryChemistry subset includes the results of X-ray determinations of the content of 32 elements. The WetChemistry subset [28] includes data, which include soil texture, water characteristics of samples, and many soil chemical properties associated with the use of the Mehlich 3 methodology [29], from three different laboratories.
There are slight variations in the amount of data from different laboratories, which resulted in the need for separate modeling for individual parts of the set. Modeling of chemical data for which the reference data were less than 500 records were omitted. In addition, properties for which there are strict relationships allowing for their direct calculation on the basis of the modeled values were omitted in the modeling. Moreover, modeling of soil organic N was not carried out, as the method of determination of reference values was different from the commonly accepted standard. Each of the group of features for which modeling was performed included at least 1860 soil samples data, of which a randomly selected 70% was used for training and the remaining 30% was used for validation. The training and validation sets were kept the same for each model throughout the study.
A total of 60 variables were modeled, with the data divided into groups:
“Texture” (3 variables, 1307 training, and 560 validation samples), separated into three particle-size fractions, i.e., sand, silt, or clay;
“Water” (5 variables, 1306 training, and 560 validation samples), i.e., saturation at 0 (Sat. at 0), pF2.0, pF2.5, pF4.2, and bulk density (B. Den.);
“CN” (3 variables, 1331 training, and 567 validation samples): total N, total C, and SOC;
“Elements” (32 variables, 1333 training, and 570 validation samples), i.e., concentrations of: Na, Al, Cl, K, Ca, Ti, V, Cr, Mn, Fe, Ni, Cu, Zn, Ga, Se, Rb, Sr, Y, Zr, Ba, La, Ce, Pr, Nd, Sm, Hf, Ta, W, Hg, Pb, Bi, and Th;
“Mehlich” (17 variables, 1333 training, and 571 validation samples), i.e., electrical conductivity (EC), exchangeable acidity (Ex. Ac.), exchangeable bases (Ex. Bas.), pH, phosphorus sorption index (PSI), and bioavailable forms of elements extracted with use of Mehlich 3 (M3) methodology, namely, M3 Al, M3 B, M3 Ca, M3 Cu, M3 Fe, M3 K, M3 Mg, M3 Mn, M3 Na, M3 P, M3 S, and M3 Zn.
For a description of the reference methods, in addition to standard operating procedures, see Vågen et al. [22]. The texture of the soil was determined with laser diffraction particle size analysis. For AfSIS reference samples, soil moisture release curves are determined using soil fines in pressure plate apparatus. High throughput total x-ray fluorescence spectroscopy (TXRF) was used in the analysis of the chemical composition (macro and microelements). Total and organic C and total N were determined with combustion method. Extractable Al, Ca, Mg, P, K, Na, S, Fe, Mn, Zn, Cu, B, Mo, H, and other bases were determined with use of ICP analysis of Mehlich 3 extracts. P sorption index was measured by single-point P addition.
Figure 2 and Appendix A (Table A1, Table A2, Table A3, Table A4 and Table A5) present selected statistics for all variables in these data groups. For each data group, prediction models were created for all soil properties included in the group.

2.2. Models

Three different groups of algorithms were used in this study: partial least squares regression (PLSR), one-dimensional convolutional neural network (1DCNN), and generalized regression neural network (GRNN).
The PLSR algorithm has been the most widely described and applied to prediction of the state of the environment from remotely sensed data. The deterministic PLSR algorithm [30,31] produces a linear model preceded by an orthogonalization operation of the input data taking into account their effect on the variance of the outputs. The experiments used a module available in the MATLAB package [32]. Depending on the number of model outputs, the number of PLS coefficients was set in range from 25 to 100. The single input vector consisted of 1749 reflectance values recorded for one soil sample.
A one-dimensional convolutional neural network [33] represents the so-called adaptive deep learning by processing the input data through successive layers of the algorithm. All conducted computational experiments used a network with two convolutional layers, counting 32 and 64 filters, or 64 and 128 filters with widths of three, six, and ten input features separated by MaxPooling layers. A fully connected layer consisted of 15 to 50 units. The number of outputs corresponded to the number of modeled properties. The spectra were standard normal variate (SNV)-transformed [34] and, like the outputs, were normalized to the numerical interval <0; 1> according to the methodology described in Gruszczyński and Gruszczyński [35]. The calculations associated with this algorithm were performed in the Python language environment using the Keras system and TensorFlow libraries [36,37].
A generalized regression neural network is a memory-based local nonparametric linear and nonlinear regression algorithm [38]. The GRNN algorithm is susceptible to the curse of dimensionality; therefore, the spectral data from the AfSIS project were transformed into vectors of 50 components each using the PCA algorithm, considering all available spectra (18,250). In order to optimize the spread of RBF units, a modeling cycle was performed with its values between 0.5 and 2.5 (with a step of 0.05). Figure 3 shows the dependence of linear correlation coefficients of data and prediction as a function of spread. In the construction of models, the spread generating the highest average linear correlation coefficient of the prediction with the observed ( r ) data was used. Optimal spread values ranged from 0.75 for the “Mehlich” data group to 2.0 for variables C and N. Calculations associated with the GRNN model were performed in the MATLAB environment [32].
The evaluation criteria for each model were the following validation statistics: coefficient of determination (R2), root mean square error (RMSE), ratio of performance to interquartile distance (RPIQ), bias, standardized bias (Sb), and Lin’s concordance correlation coefficient (LCCC).
The coefficient of determination was calculated according to the equation:
R 2 = 1 S S R v S S T v
where SSRv is a sum of squares of differences of modeled value and its prediction in the validation set, and SSTv is a sum of squares of differences of modeled value and its mean in the validation set. The coefficient of determination is considered to be a numerical estimate of the degree to which the variance in a modeled variable is explained.
Root mean square error was calculated according to the equation:
R M S E = S S R v n
where n is the size of the validation set.
Ratio of performance to interquartile distance was calculated according to the equation:
R P I Q = I Q R R M S E
I Q R = Q 3 Q 1
where IQR is an interquartile distance, and Q 3 and Q 1 are third and first quartile of validation data.
The bias is calculated according to equation:
b i a s = 1 n   ( x i y i )
where xi and yi are ith observed value and its prediction.
Standardized bias is calculated according to equation:
S b = b i a s I Q R
Lin’s concordance correlation coefficient LCCC [39] is calculated according to equation:
L C C C = ( 2 · r · s x · s y ) s x 2 + s y 2 + ( x _ y _ ) 2
where s x , s y are standard deviations of observed values and their predictions, and x _ and y _ are the average of the observed values and their predictions. The LCCC value indicates the degree of agreement between the observed data and their prediction.

2.3. Hierarchical Summary of Prediction Positions

The relatively large number of modeled variables and the use of three different modeling algorithms justify the need to hierarchize the prediction accuracy of the variables in order to compare their practical utility. The hierarchization procedure was implemented using the technique for order of preference by similarity to ideal solution (TOPSIS) methodology [40]. This technique is used to rank objects described by multidimensional criteria through determination of their distance in multidimensional space from the potentially most favorable and least favorable variants. These variants are determined by identifying, for each criterion, the highest and lowest rated values. As a result, two (not necessarily existing in the dataset) variants are created with the best and the worst properties. The distance, in the multidimensional space, from both of them is a measure of the quality of prediction of the variable, according to the formula:
T I n d e x = D i s t W O R S T D i s t W O R S T + D i s t B E S T
where TIndex is a ranking coefficient, DistWORST is a euclidean distance in multidimensional space from the worst variant, and DistBEST is a euclidean distance in multidimensional space from the most favorable variant.
The criteria used in TOPSIS method to characterize the prediction quality should be independent of the values scale. Thus, R2, RPIQ, Sb, and LCCC statistics scaled to the interval <0; 1> were used. The prediction accuracy of the model is highly dependent on the data number, distribution, and range of variability. For this reason, the hierarchization performed is of relative importance, depending on the type and homogeneity of the data.

2.4. Importance of Spectral Bands to Soil Property Estimation

The multidimensionality of spectral data is a significant obstacle in some modeling algorithms. The reduction of the dimensionality of the input data is a component of linear models (PCA, PLS). It is always associated with determining the parts of the spectrum that are important for obtaining a good estimation of the sought relationship between the spectral response and soil properties.
In PLSR models, the significance of individual parts of the spectrum for the estimation of a soil variable can be determined by calculating the indicator referred to as (VIP scores) variable importance in projection [30,32]. VIP scores, as a vector of numerical values with a length equal to the number of spectral registration points, can be calculated a posteriori. The sum of squares of the VIP scores for the spectrum is equal to the number of spectrum components. VIP scores less than 1.0 indicate a lower than average importance of a particular input value. Components with VIP scores greater than 1.0 are important for modeling. To assess the importance of spectral bands, calculations of VIP scores for 11 selected soil features (SOC, total N, Ca, Al, Fe, Cr, Cu, Pb, Zn, and Hg contents) were performed during the creation of PLSR models for these properties.
An independent analysis of the significance of spectral bands was performed with the neighborhood components analysis (NCA) algorithm. This algorithm, not related to a specific modeling method, is used to select the number of input variables a priori, especially in the case of a limited number of data [32,41,42,43]. In order to compare the results of this approach with VIP scores, calculations were performed on the same dataset.

3. Results

3.1. Results of Modeling of Texture

The search for an effective model to estimate soil texture is a frequent topic in soil science research [44]. The estimates of the content of the different fractions vary in terms of RMSE values (Table 1). Validation statistics for all algorithm groups analyzed are relatively weak ( R 2 < 0.75 ) and ( R P I Q < 2.0 ) ; R2, RPIQ, and LCCC signal significant agreement between predictions and observations, but RMSE is relatively high. In our study, the individual models unanimously indicate that sand has the largest prediction error, while silt and clay have much smaller errors. This is an indirect indication that when soil texture needs to be determined, it is safest to determine it using these two variables. However, it should be assumed that this result is specific to the AfSiS dataset and it results from a much larger concentration spread of Sand compared to the range of silt and clay. With similar R 2 values, the RMSEs for silt and clay fractions are significantly smaller. Moreover, the distribution of sand content is characterized by negative asymmetry.

3.2. Results of Modeling of Properties from “Water” Group

The complicated and lengthy process of laboratory determination of water characteristics of soils using sand box and ceramic plates justifies the search for alternative means of estimating them. The validation statistics of the prediction models for the “Water” group indicate that the models based on MIR provide a reasonably good basis for estimation of these quantities (Table 2). RPIQs exceeding 2.0 signal a relatively good approximation of all relevant properties. Saturation at 0 and pF2.0 were estimated with the least error by the GRNN model; soil moisture at pF2.5 and pF4.2, by the 1DCNN model; while the volumetric density estimate with the highest accuracy is provided by the PLSR model.

3.3. Results of Modeling of C and N Contents

C and N concentrations in soils are the soil properties most commonly modeled, based on NIR and MIR spectral response [2,3]. This is due to the important role of both elements in shaping other soil properties, the role of N as a nutrient, and the importance of C sequestration in soils under conditions of increasing CO2 content in ambient air. Validation statistics (Table 3, Figure 4) signal that, in the AfSIS dataset, the concentration of total C and N, as well as SOC, can be determined with relatively high accuracy, even when the compound content of both elements in soils is low. RMSE estimates of 0.33% for C and 0.03% for N provide good approximations of the true values. Validation results indicate that (among those tested) the best model for estimating C and N is 1DCNN. This provides reason to believe that the relationship between the spectrum and SOC concentration (and associated N) is nonlinear, as indicated by the weaker result of the PLSR model. The result of the GRNN model is also relatively weak.

3.4. Results of Modeling of Properties from “Elements” Group

Table 4 contains validation statistics for the prediction models of the 32 elements. The validation results vary widely, although there are several elements whose concentrations can be estimated with relatively high accuracy. The RPIQ value indicates that a small relative error characterizes the prediction of Fe, Al, Ga, and Rb (RPIQ > 2.0). However, analyzing the RMSE level shows that, despite low values of RPIQ or R2, the content of other macroelements can also be predicted with sufficient accuracy, taking into account the scale of data variability (data spread). Comparison of the validation results from the three models indicates that the GRNN model dominates in terms of prediction quality. There are six exceptions to this rule: Na, K, Ca, and Se, for which better prediction results were obtained with the PLSR method; and Se and Bi, for which a slightly better result was obtained with the 1DCNN model.

3.5. Results of Modeling of Properties from “Mehlich” Group

Table 5 shows the prediction statistics of chemical properties of soils from the “Mehlich” group. Apart from Al content and pH value, the RPIQ of other characteristics does not exceed a value of 2.0. A relatively good validation result is shown by the model of Ca and K content (PLSR), and Mg content (1DCNN).

3.6. Relative Prediction Positions

Table 6 contains hierarchical summaries of the relative prediction positions of soil variables using the TOPSIS method considering the R2, RPIQ, LCCC, and Sb criteria. A high position in the hierarchy indicates a prediction result more consistent with the observations. Hierarchization takes into account all validation results. A higher Tindex value for the same predicted element indicates an algorithm that more accurately approximates the variables. Despite some differences in the position of individual variables, a relatively narrow group of data can be predicted with the greatest precision: SOC, Al and M3 Al, Fe, K, pH, and total N. Ranks of prediction of soil texture elements and water properties are relatively low. However, this highly formalized hierarchical procedure does not invalidate the usefulness of lower-ranked models, which should be further evaluated. This applies, for example, to the prediction of Ca or pF, especially at higher values of these variables.
Spectral response is used for the estimation of soil properties in digital mapping of the environment as well as in the management of field operations. The use of spectral analysis in laboratory conditions should be included in a similar category as it allows rapid, inexpensive, and approximate estimation of some soil properties in the whole soil profile [45,46]. An important problem is that of identifying the optimal modeling algorithm, as well as the limiting prediction precision, taking into account the minimum documentation requirements.

3.7. Impact of Relationships between Properties

It can be assumed that in the field of spectral analysis used for quantitative estimation of soil properties, there are three categories of variables: those that affect the spectral response directly to the extent that they can be quantitatively identified; those that are correlated with properties that affect the spectral response and thus can be quantitatively estimated; and those that do not affect the spectral response and are not correlated with properties that affect the spectral response. Results of previous studies indicate that, for soils, the first category includes carbon organic compounds [47,48], when using both NIR and MIR spectra. For other features, the possibility of belonging to each category has to be evaluated experimentally.
The linkages between soil properties that enable or interfere with the quantitative identification of some components can be traced by confronting the statistics of the prediction results with the cluster system generated by the variable agglomeration algorithm. Figure 5 shows the dendrogram of associations between soil properties obtained by Ward’s agglomeration procedure using the 1–r2 metric [49,50]. The algorithm groups statistically related variables using a metric that expresses the similarity or dissimilarity between variables. The value of the association distance indicates the degree of similarity of the variables in the dataset. A cluster composed of mutually correlated variables Ti, Bi, V, Cu, Fe, Th, Pb, and Mn in models using MIR is characterized by coefficients of determination of, respectively: 0.67, 0.35, 0.63, 0.59, 0.84, 0.76, 0.64, 0.62. The highest R2 value of the Fe prediction leader gives reason to believe that the variability of its content reflects to some extent the variability of the other elements of the cluster, supporting their prediction. The cluster related to Fe is in turn related to the cluster including M3 Cu, Pr, Ni, Cr, M3 Fe, Ex. Acid., PSI, M3 Al, and Al (R2: 0.50, 0.33, 0.66, 0.68, 0.49, 0.39, 0.78, 0.86, 0.79, respectively). There is also a cluster of variables related to soil pH, including pH, M3 Mg, M3 K, M3 Ca, Ex. Bas., and Ca. A characteristic group is formed by variables shaping water properties including: bulk density, pF2.0, pF2.5, pF4.2, SOC, total C, and total N. In this case, one might suspect that the supporting factor for quantitative identification in this group is total C, or SOC.
The graphs in Figure 6 illustrate the statistical relationships between Fe content and the concentration of other metals (Al, V, Mn, Cu, Pb, and Th). It can be assumed that the model of these elements’ content is to some extent related to the content of Fe, which is well quantified by the models. It cannot be excluded that the chain of connections between particular properties is more complex and may lead to correct estimation of features poorly influencing the spectral response or not influencing it at all.
Similar relationships exist between the exchangeable bases and the elements included in them (Figure 7). The exchangeable bases value is shaped by Ca to the largest extent, although the association with K and Mg is also significant. Calcium content, fairly well identified through the MIR spectral response, is nonlinearly related to pH. It is difficult to indicate which of these elements is the primary basis for correct modeling of soil sorption elements.

3.8. Importance of Spectral Bands in Prediction of Selected Properties

Figure 8 shows the diagrams of squares of VIP scores and NCA score values for 11 soil characteristics, including the concentration of heavy metals. The diagrams reflect the properties of the PLSR and NCA algorithms. In the case of PLSR, most orthogonalized inputs are used in the construction of the model. In the selection, in the NCA procedure, only spectral sections with appropriately high NCA scores are considered. In the analyzed cases, the number of values indicated by this procedure ranged from 12 to 54.
It can be seen that SOC and total N is determined by taking the same spectral segments into account. It can also be concluded that the concentration of Al, and especially Fe, is related to sections of the spectrum of heavy metals.

4. Discussion

Most soil data, for classification purposes or for making economic decisions, are interpreted in a semi-fuzzy way, or at least a tendency is observed leading to a blurring of the boundaries between classification criteria. The evaluation of modeling results of many soil characteristics must allow for a certain level of acceptable errors. In this context, the estimation of soil texture components, such as silt and clay contents, using the models described in this paper, can be considered as sufficiently precise. Similarly, the correctness of the models of water properties of soils, which are extremely time-consuming to determine in the laboratory using the classical approaches, should be assessed.
The modeling of SOC concentration based on NIR and MIR signatures is understandably of the greatest interest to researchers [2,3,4,51]. It can be assumed that, as an important component of soil organic compounds, SOC and total N content can be estimated with satisfactory accuracy. In terms of total element prediction, only a few elements can be estimated with reasonable accuracy: Al, Fe, Ca, and K, along with several elements usually found in soils in low concentrations, namely Se, Rb, Ga, Zr, La, and Th. The contents of Cr, Ti, Mn, Ni, and Pb are estimated with slightly lower accuracy.
In modeling the “Mehlich” data group (17 variables), the 1DCNN (10 variables with the best prediction statistics) and GRNN (7 variables with the best prediction statistics) algorithms performed best. The PLSR model was inferior to the other two in every case. Particularly good prediction scores apply to: M3 Al, M3 B, M3 Mg, and pH (1DCNN model); and to total exchangeable bases, M3 Ca, M3 K, M3 Na, and PSI (GRNN model).
The results presented indicate that, among the examined models, there is no single optimal one that would provide the best predictions for all variables. Each of the models shows advantages in predicting a certain group of variables and at the same time is less useful in predicting others. This has certainly been influenced by the heterogeneity of data from many different countries, which is evident through the very large spread of variable values [24]. Compared to results with less dispersed distributions, more than 100 variables across the United States from the Kellogg Soil Survey Laboratory [8] show, in many cases, similar values of determination coefficients, but are associated, in the case of AfSIS data, with a much larger RMSE values. Comparisons between the two sets of results are difficult, mainly due to the smaller spread—often by several times—of the Kellogg Laboratory data relative to AfSIS, and also due to the slightly wider spectral range they used. Data from the USA were also used by Wijewardane et al. [9], who indicated satisfactory results for prediction of C, N, CEC, pH, and clay content, while poor results were produced for extractable forms of phosphorus and potassium. In this case, the results (RMSE) are similar to those obtained from the AfSIS data, presumably also due to less restrictive data selection conditions that did not limit the spread of the data. AfSIS observational data have been used in very advanced digital mapping attempts. Hengl et al. [25] presented a modeling methodology to generate a map of soil properties, among other things, from data collected in the AfSIS project. They used an ensemble of machine learning regression models whose inputs were topographic, hydrological, meteorological, and vegetation data, as well as spectral reflectance images collected by the Sentinel satellites (Copernicus program). Nineteen variables were modeled, mainly Mehlich 3 extractable components, pH, texture, CEC, C, and N contents. Results of modeling based on spectra recorded from the satellite ceiling were slightly worse than those obtained by proximal sensing. The disadvantage of this approach is the limitation of quantitative identification of variables to the surface layer. The indisputable advantage, however, is the possibility of relatively precise representation of many soil properties with good spatial resolution over a very large area.
If we arbitrarily take ( R 2 0.75 ) and ( R P I Q 2.0 ) as the criteria of correct prediction, then for PLSR models they are fulfilled for 10 of the studied soil properties. For nine variables, these criteria are met when 1DCNN is used, while among predictions using the GRNN model, the adopted criteria are met for seven variables. For six variables, the assumed criteria are met by all tested algorithms. These variables are SOC, total C, Al, Fe, M3 Al, and pH.
The decision to use a model based on spectral analysis instead of the reference method is inevitably subjective. Using validation statistics, it is possible to estimate the risk of error based, for example, on R2 and RMSE values. Acceptance or rejection of the model as an approximation for classification or mapping of soils is determined by the mapping scale (spatial resolution) and the way in which the variability of soils is reflected—either as a continuous image or discrete representation with information about the mean value of properties and the range of their variability.
Assuming a symmetrical distribution of deviations from observed values, properties such as sand, clay; total C, N, and SOC contents; and bulk density can be estimated with sufficient precision for large-scale soil mapping. In addition, the Ca content error of 0.5% relative to sample weight is small enough even for assessing the cultivation needs of soils, although the K content (error of 0.7%) is estimated with relatively low accuracy. Fe compounds rarely show signs of deficiency; therefore, an error of 1% is not significant in practice. Similarly, Al concentration itself does not pose a significant problem; it only becomes important once the soil pH is below 5.0, when the presence of aluminum compounds is a threat to crops. The accuracy of estimates of other variables must be considered depending on the purpose and scale of the documentation required. For example, in the context of investigating threats to soils with high metal contents, even estimates of relatively low accuracy may be sufficient to determine the extent of transformation.
For large soil datasets, especially those from diverse geological settings, a significant asymmetry in the distribution of soil characteristics, usually right-handed, is characteristic. The aforementioned data from the US territory analyzed by Ng et al. [8] possess such a characteristic. Towett et al. [24], characterizing the chemical properties of the AfSIS data, noted the strong skewness of the distributions of many soil variables. Relative to the distribution of US data, the spread of AfSIS observations is, in most cases, much larger (in extreme cases, several tens of times). This has a significant impact on the level of prediction errors, especially at low values of the variables. In such circumstances, one of the solutions used is to exclude extremely high values (outliers) from the dataset. Such an operation, in the context of the real occurrence of high values of variables, is controversial. Figure 9 illustrates, based on the analyzed data, the relationship between the prediction error of the model with the most favorable characteristics (RMSE) and the prediction errors after removing the values considered as outliers, i.e., exceeding the assumed cut-off value ( H L ):
H L = Q 3 + 1.5 · I Q R
The graph shows two versions of this approach: the RMSE ratio after removing extreme values from the validation data after prediction with a model derived from the entire data (RMSEOUT); and the RMSE ratio after removing extreme values from the training and validation data before training the model (RMSEmodOUT).
A significant variation in the ratios of RMSE values can be seen. As a rule, the error after removing outliers before training is smaller than the RMSE for the full data and the RMSE after removing outliers from the validation data after training on the full data. The exceptions are clay, Sat. at 0, K, M3 Al, and pH, for which limiting the range of variables in the learning set worsened the validation results. For sand and total N content, the exclusion of extreme feature values did not significantly affect the validation error.
The largest decreases in RMSE are for M3 Na, M3 K, M3 B, Hg, and Zr. Other RMSEs are closer to the variant with removal of outliers after training on the whole data. The problem of outliers for prediction of soil property values is debatable. The increase in accuracy due to truncation of large data values limits the applicability of the model, especially for memory-based models. In extreme cases, the construction of two different models should be considered [21]: one for data close to the modal and mean value, and a separate one for very high values of the variables. However, for models built for high property values, the problem will be the small number of learning examples.

5. Conclusions

Each of the models examined possesses a specific suitability for prediction of different soil variables. It should be assumed that differences in the correctness of estimation of particular variables depend on the type of relationship between the variable and the spectral response. This means that, due to the specificity of the relationship between the MIR spectrum and soil properties, it is not possible to indicate a universal optimal prediction model for all soil variables. Because of the wide range of the variable distributions, this result should be treated as specific for particular geological, morphological, and climatic conditions.
The best prediction results are provided by all regression models for total C, SOC, total Fe, Al, M3 Al, and pH. The clustering and modeling results indicate that many other soil properties do not directly affect the MIR spectral response, but there are correlations of features that can support such a prediction with limited accuracy.
Asymmetry in the distribution of the modeled variables makes it difficult to build a valid regression model. Removing outliers from the training set results in a significant reduction of the prediction error in many cases. However, this operation limits the universality of the model to a range relatively close to the median. Consideration of building two different models for different ranges of data variability is conditioned by the disproportion of the number of examples close to the median and outliers.

Author Contributions

Conceptualization, S.G.; data curation, S.G.; formal analysis, S.G.; funding acquisition, S.G. and W.G.; investigation, S.G.; methodology, S.G. and W.G.; project administration, S.G.; resources, S.G. and W.G.; software, S.G. and W.G.; supervision, S.G.; validation, S.G.; visualization, W.G.; writing—original draft, S.G. and W.G.; writing—review and editing, S.G. and W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by a research subvention of AGH University of Science and Technology (Grant number 16.16.150.545).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

AfSIS Phase I was a collaborative projective funded by the Bill and Melinda Gates Foundation, which aimed to provide a consistent baseline of soil information for monitoring soil ecosystem services in sub-Saharan Africa. Partners included: The Tropical Soil Biology and Fertility Institute of The International Center for Tropical Agriculture, The International Soil Reference and Information Center—World Soil Information, The Center for International Earth Science Information Network, The Earth Institute at Columbia University, and World Agroforestry (Center for Research in Agroforestry-ICRAF, previously The International Council for Research in Agroforestry).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Descriptive statistics of properties in the “Texture” group.
Table A1. Descriptive statistics of properties in the “Texture” group.
PropertyUnitsMeanSDMin.Max.IQRSkewness
Sand%84.9714.320.44100.009.50−2.71
Silt%8.305.940.0033.925.711.65
Clay%6.739.510.0082.155.354.00
Table A2. Descriptive statistics of properties in the “Water” group.
Table A2. Descriptive statistics of properties in the “Water” group.
PropertyUnitsMeanSDMin.Max.IQRSkewness
Sat. at 0g/g0.410.140.091.100.141.21
pF2.0g/g0.270.120.020.710.170.58
pF2.5g/g0.230.120.010.670.180.69
pF4.2g/g0.150.110.000.580.151.10
B. Den.g/cm31.030.160.581.410.23−0.02
Table A3. Descriptive statistics of properties in the “CN” group.
Table A3. Descriptive statistics of properties in the “CN” group.
PropertyUnitsMeanSDMin.Max.IQRSkewness
Total N%0.080.080.000.580.072.46
Total C%1.211.310.0911.191.112.69
SON%0.080.080.000.560.062.46
SOC%1.141.260.0910.401.132.69
Table A4. Descriptive statistics of properties in the “Elements” group.
Table A4. Descriptive statistics of properties in the “Elements” group.
PropertyUnitMeanSDMin.Max.IQRSkewness
Namg/kg23,184.648764.419568.5892,079.607023.583.60
Almg/kg30,499.9718,155.75824.40120,089.4725,022.700.63
Clmg/kg214.661210.238.9028,193.55111.0021.85
Kmg/kg11,055.7412,002.48147.4097,368.8914,449.812.29
Camg/kg7558.7424,037.6255.30426,430.904772.0511.01
Timg/kg3907.004195.352.1056,099.972970.085.02
Vmg/kg28.4939.991.00357.9028.903.61
Crmg/kg61.1384.011.001148.7955.306.08
Mnmg/kg390.29451.065.403863.57408.002.61
Femg/kg26,233.9427,297.1274.30181,690.3826,458.382.43
Nimg/kg19.4027.150.30286.0020.704.43
Cumg/kg14.9115.060.70128.4015.602.67
Znmg/kg25.6627.970.90349.0027.404.38
Gamg/kg7.185.950.2028.209.200.95
Semg/kg886.6316.80877.201000.003.206.47
Rbmg/kg54.3347.211.30351.3060.101.64
Srmg/kg85.46155.680.401984.9177.405.57
Ymg/kg10.7311.410.20105.4011.202.76
Zrmg/kg65.63114.8019.70950.415.104.03
Bamg/kg1788.022985.2616.5042,787.142237.296.50
Lamg/kg1974.234428.231.9065,385.601506.818.13
Cemg/kg48.0548.381.20413.3058.301.96
Prmg/kg2.043.410.5029.700.604.00
Ndmg/kg8.148.800.9056.609.402.15
Smmg/kg148.84737.650.609806.8019.208.40
Hfmg/kg2.904.370.2052.903.304.36
Tamg/kg4.546.680.1064.304.105.24
Wmg/kg0.932.240.2030.700.307.25
Hgmg/kg1.974.850.1036.800.404.05
Pbmg/kg37.5167.180.40686.8034.605.63
Bimg/kg1.183.060.1037.500.906.10
Thmg/kg39.8291.740.20818.5029.204.89
Table A5. Descriptive statistics of properties in the “Mehlich” group.
Table A5. Descriptive statistics of properties in the “Mehlich” group.
PropertyUnitsMeanSDMin.Max.IQRSkewness
ECdS/m0.130.310.013.880.088.11
Ex. Ac.cmol/kg0.500.580.054.210.522.39
Ex. Bas.cmol/kg14.0323.630.05186.9510.143.40
M3 Almg/kg842.34488.753.003041.00660.251.00
M3 Bmg/kg0.481.710.0023.550.358.82
M3 Camg/kg2096.974126.190.0035,200.001332.254.09
M3Cumg/kg1.742.110.0023.701.864.65
M3 Femg/kg116.0092.591.26782.0088.282.76
M3 Kmg/kg167.05251.245.503432.00128.706.02
M3 Mgmg/kg309.07412.410.004740.00300.324.19
M3 Mnmg/kg101.61104.461.05592.00116.331.67
M3 Namg/kg130.06998.770.0023,000.0029.7521.22
M3 Pmg/kg11.7728.990.00396.007.087.86
M3 Smg/kg30.03212.380.623940.008.2515.07
M3 Znmg/kg1.461.610.0023.800.966.42
pH 6.191.093.619.721.330.68
PSI 84.8889.50−116.90452.8988.271.80

References

  1. Angelopoulou, T.; Balafoutis, A.; Zalidis, G.; Bochtis, D. From Laboratory to Proximal Sensing Spectroscopy for Soil Organic Carbon Estimation—A Review. Sustainability 2020, 12, 443. [Google Scholar] [CrossRef] [Green Version]
  2. Gholizadeh, A.; Borůvka, L.; Saberioon, M.; Vašát, R. Visible, near-infrared, and mid-infrared spectroscopy applications for soil assessment with emphasis on soil organic matter content and quality: State-of-the-art and key issues. Appl. Spectrosc. 2013, 67, 1349–1362. [Google Scholar] [CrossRef] [PubMed]
  3. Lazaar, A.; Mouazen, A.M.; Hammouti, K.E.L.; Fullen, M.; Pradhan, B.; Memon, M.S.; Andich, K.; Monir, A. The application of proximal visible and near-infrared spectroscopy to estimate soil organic matter on the Triffa Plain of Morocco. Int. Soil Water Conserv. Res. 2020, 8, 195–204. [Google Scholar] [CrossRef]
  4. Viscarra Rossel, R.A.; Lobsey, C.R.; Sharman, C.; Flick, P.; McLachlan, G. Novel proximal sensing for monitoring soil organic C stocks and condition. Environ. Sci. Technol. 2017, 51, 5630–5641. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Bellon-Maurel, V.; McBratney, A. Near-infrared (NIR) and mid-infrared (MIR) spectroscopic techniques for assessing the amount of carbon stock in soils—Critical review and research perspectives. Soil Biol. Biochem. 2011, 43, 1398–1410. [Google Scholar] [CrossRef]
  6. Ramirez-Lopez, L.; Wadoux, A.M.J.-C.; Franceschini, M.H.D.; Terra, F.S.; Marques, K.P.P.; Sayão, V.M.; Demattê, J.A.M. Robust soil mapping at the farm scale with vis–NIR spectroscopy. Eur. J. Soil Sci. 2019, 70, 378–393. [Google Scholar] [CrossRef] [Green Version]
  7. Viscarra Rossel, R.A.; Walvoort, D.J.J.; McBratney, A.B.; Janik, L.J.; Skjemstad, J.O. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma 2006, 131, 59–75. [Google Scholar] [CrossRef]
  8. Ng, W.; Minasny, B.; Jeon, S.-H.; McBratney, A. Mid-infrared spectroscopy for accurate measurement of an extensive set of soil properties for assessing soil functions. Soil Secur. 2022, 6, 100043. [Google Scholar] [CrossRef]
  9. Wijewardane, N.K.; Ge, Y.; Wills, S.; Libohova, Z. Predicting physical and chemical properties of US soils with a mid-infrared reflectance spectral library. Soil Sci. Soc. Am. J. 2018, 82, 722–731. [Google Scholar] [CrossRef] [Green Version]
  10. Margenot, A.J.; Calderón, F.J.; Parikh, S.J. Limitations and Potential of Spectral Subtractions in Fourier-Transform Infrared Spectroscopy of Soil Samples. Soil Sci. Soc. Am. J. 2016, 80, 10–26. [Google Scholar] [CrossRef]
  11. Nath, D.; Laik, R.; Singh Meena, V.; Pramanick, B.; Kumar Singh, S. Can mid-infrared (mid-IR) spectroscopy evaluate soil conditions by predicting soil biological properties? Soil Secur. 2021, 4, 100008. [Google Scholar] [CrossRef]
  12. Hati, K.M.; Sinha, N.K.; Mohanty, M.; Jha, P.; Londhe, S.; Sila, A.; Towett, E.; Chaudhary, R.S.; Jayaraman, S.; Vassanda Coumar, M.; et al. Mid-Infrared ReflectanceSpectroscopy for Estimation of Soil Properties of Alfisols from Eastern India. Sustainability 2022, 14, 4883. [Google Scholar] [CrossRef]
  13. Nduwamungu, C.; Ziadi, N.; Parent, L.-E.; Tremblay, G.F.; Thuries, L. Opportunities for, and limitations of, near infrared reflectance spectroscopy applications in soil analysis: A review. Can. J. Soil Sci. 2009, 89, 531–541. [Google Scholar] [CrossRef]
  14. Asensio, V.; Guala, S.; Vega, F.A.; Covelo, E.F. A soil quality index for reclaimed mine soils. Environ. Toxicol. Chem. 2013, 32, 2240–2248. [Google Scholar] [CrossRef]
  15. Klimkowicz-Pawlas, A.; Ukalska-Jaruga, A.; Smreczak, B. Soil quality index for agricultural areas under different levels of anthropopressure. Int. Agrophys. 2019, 33, 455–462. [Google Scholar] [CrossRef]
  16. Hu, Z.; Haneklaus, S.; Sparovek, G.; Schnug, E. Rare Earth Elements in Soils. Communications in Soil Science and Plant Analysis. Commun. Soil Sci. Plant Anal. 2006, 37, 1381–1420. [Google Scholar] [CrossRef]
  17. Ramos, S.J.; Dinali, G.S.; Oliveira, C.; Martins, G.C.; Moreira, C.G.; Siqueira, J.O.; Guilherme, L.R.G. Rare Earth Elements in the Soil Environment. Curr. Pollut. Rep. 2016, 2, 28–50. [Google Scholar] [CrossRef] [Green Version]
  18. Xu, X.; Chen, S.; Ren, L.; Han, C.; Lv, D.; Zhang, Y.; Ai, F. Estimation of Heavy Metals in Agricultural Soils Using Vis-NIR Spectroscopy with Fractional-Order Derivative and Generalized Regression Neural Network. Remote Sens. 2021, 13, 2718. [Google Scholar] [CrossRef]
  19. Yang, M.; Xu, Y.; Zhang, J.; Chen, H.; Liu, S.; Li, W.; Hao, Y. Near-Infrared Spectroscopic Study of Heavy-Metal-Contaminated Loess Soils in Tongguan Gold Area, Central China. Minerals 2020, 10, 89. [Google Scholar] [CrossRef] [Green Version]
  20. Vågen, T.-G.; Winowiecki, L.; Walsh, M.G.; Desta, L.T.; Tondoh, J.E. Land Degradation Surveillance Framework (LSDF): Field Guide; International Center for Tropical Agriculture, World Agroforestry Centre, and the Earth Institute at Columbia University: New York, NY, USA, 2010. [Google Scholar]
  21. Vågen, T.-G.; Winowiecki, L.A.; Tondoh, J.E.; Desta, L.T.; Gumbricht, T. Mapping of soil properties and land degradation risk in Africa using MODIS reflectance. Geoderma 2016, 263, 216–225. [Google Scholar] [CrossRef]
  22. Vågen, T.-G.; Shepherd, K.D.; Walsh, M.G.; Winowiecki, L.; Desta, L.T.; Tondoh, J.E. AfSIS Technical Specifications. Soil Health Surveillance. 2010. Available online: https://worldagroforestry.org/sites/default/files/afsisSoilHealthTechSpecs_v1_smaller.pdf (accessed on 10 April 2022).
  23. Leenaars, J.G.B.; van Oostrum, A.J.M.; Gonzalez, M.R. Africa Soil Profiles Database, Version 1.2. A Compilation of Georeferenced and Standardised Legacy Soil Profile Data for Sub-Saharan Africa (with Dataset); ISRIC Report 2014/01; Africa Soil Information Service (AfSIS) Project; ISRIC, World Soil Information: Wageningen, The Netherlands, 2014. [Google Scholar]
  24. Towett, E.K.; Shepherd, K.D.; Tondoh, J.E.; Winowiecki, L.A.; Lulseged, T.; Nyambura, M.; Sila, A.; Vågen, T.o.r.-G.; Cadisch, G. Total elemental composition of soils in Sub-Saharan Africa and relationship with soil forming factors. Geoderma Reg. 2015, 5, 157–168. [Google Scholar] [CrossRef] [Green Version]
  25. Hengl, T.; Miller, M.; Križan, J.; Shepherd, K.; Sila, A.; Kilibarda, M.; Antonijević, O.; Glušica, L.; Dobermann, A.; Haefele, S.; et al. African soil properties and nutrients mapped at 30 m spatial resolution using two-scale ensemble machine learning. Sci. Rep. 2022, 11, 6130. [Google Scholar] [CrossRef] [PubMed]
  26. Vågen, T.-G.; Winowiecki, L.A.; Desta, L.; Tondoh, J.E.; Weullow, E.; Shepherd, K.; Sila, A. Mid-Infrared Spectra (MIRS) from ICRAF Soil and Plant Spectroscopy Laboratory: Africa Soil Information Service (AfSIS) Phase I 2009–2013. World Agroforestry–Research Data Repository, V1. 2020. Available online: https://data.worldagroforestry.org/dataset.xhtml?persistentId=doi:10.34725/DVN/QXCWP1 (accessed on 10 April 2022).
  27. Summerauer, L.; Baumann, P.; Ramirez-Lopez, L.; Barthel, M.; Bauters, M.; Bukombe, B.; Reichenbach, M.; Boeckx, P.; Kearsley, E.; Van Oost, K.; et al. The central African soil spectral library: A new soil infrared repository and a geographical prediction analysis. Soil 2021, 7, 693–715. [Google Scholar] [CrossRef]
  28. Vågen, T.-G.; Winowiecki, L.A.; Desta, L.; Tondoh, J.; Weullow, E.; Shepherd, K.; Sila, A.; Dunham, S.J.; Hernández-Allica, J.; Carter, J.; et al. Wet Chemistry Data for a Subset of AfSIS: Phase I Archived Soil Samples. World Agroforestry–Research Data Repository, V1. 2021. Available online: https://data.worldagroforestry.org/dataset.xhtml?persistentId=doi:10.34725/DVN/66BFOB (accessed on 10 April 2022).
  29. Mehlich, A. Mehlich 3 Soil Test Extractant. A Modification of the Mehlich 2 Extractant. Commun. Soil Sci. Plant Anal. 1984, 15, 1409–1416. [Google Scholar] [CrossRef]
  30. Wold, S.; Sjöström, M.; Eriksson, L. PLS-regression: A basic tool of chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. [Google Scholar] [CrossRef]
  31. Leone, A.P.; Viscarra-Rossel, R.A.; Amenta, P.; Buondonno, A. Prediction of soil properties with PLSR and vis-NIR Spectroscopy: Application to Mediterranean soils from Southern Italy. Curr. Anal. Chem. 2012, 8, 283–299. [Google Scholar] [CrossRef]
  32. MATLAB, Version 9.13.0 (R2022b); The MathWorks Inc.: Natick, MA, USA, 2022.
  33. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  34. Dhanoa, M.S.; Lister, S.J.; Sanderson, R.; Barnes, R.J. The link between multiplicative scatter correction (MSC) and standard normal variate (SNV) transformations of NIR spectra. J. Near-Infrared Spectrosc. 1994, 2, 43–47. [Google Scholar] [CrossRef]
  35. Gruszczyński, S.; Gruszczyński, W. Supporting soil and land assessment with machine learning models using the Vis-NIR spectral response. Geoderma 2022, 405, 115451. [Google Scholar] [CrossRef]
  36. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org (accessed on 1 January 2022).
  37. Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 15 June 2021).
  38. Specht, D.F. A general regression neural network. IEEE Trans. Neural Netw. 1991, 2, 568–576. [Google Scholar] [CrossRef]
  39. Lin, L.I. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989, 45, 255–268. [Google Scholar] [CrossRef] [PubMed]
  40. Hwang, C.-L.; Lai, Y.-J.; Liu, T.-Y. A new approach for multiple objective decision making. Comput. Oper. Res. 1993, 20, 889–899. [Google Scholar] [CrossRef]
  41. Goldberger, J.; Roweis, S.; Hinton, G.; Salakhutdinov, R. Neighbourhood Components Analysis. 2022. Available online: https://www.cs.toronto.edu/~hinton/absps/nca.pdf (accessed on 4 November 2022).
  42. Sinaice, B.B.; Owada, N.; Saadat, M.; Toriya, H.; Inagaki, F.; Bagai, Z.; Kawamura, Y. Coupling NCA Dimensionality Reduction with Machine Learning in Multispectral Rock Classification Problems. Minerals 2021, 11, 846. [Google Scholar] [CrossRef]
  43. Yang, W.; Wang, K.; Zuo, W. Neighborhood Component Feature Selection for High-Dimensional Data. J. Comput. 2012, 7, 161–168. [Google Scholar] [CrossRef]
  44. Thomas, C.L.; Hernandez-Allica, J.; Dunham, S.J.; McGrath, S.P.; Haefele, S.M. A comparison of soil texture measurements using mid-infrared spectroscopy (MIRS) and laser diffraction analysis (LDA) in diverse soils. Sci. Rep. 2021, 11, 16. [Google Scholar] [CrossRef] [PubMed]
  45. Debaene, G.; Bartmiński, P.; Niedźwiecki, J.; Miturski, T. Visible and Near-Infrared Spectroscopy as a Tool for Soil Classification and Soil Profile Description. Pol. J. Soil Sci. 2017, 50, 1–10. [Google Scholar] [CrossRef] [Green Version]
  46. Xie, X.-L.; Li, A.-B. Identification of soil profile classes using depth-weighted visible–near-infrared spectral reflectance. Geoderma 2018, 325, 90–101. [Google Scholar] [CrossRef]
  47. Francos, N.; Ogen, Y.; Ben-Dor, E. Spectral assessment of organic matter with different composition using reflectance spectroscopy. Remote Sens. 2021, 13, 1549. [Google Scholar] [CrossRef]
  48. Ramírez, P.B.; Calderón, F.J.; Haddix, M.; Lugato, E.; Cotrufo, M.F. Using diffuse reflectance spectroscopy as a high throughput method for quantifying soil c and n and their distribution in particulate and mineral-associated organic matter fractions. Front. Environ. Sci. 2021, 9, 634472. [Google Scholar] [CrossRef]
  49. De Amorim, R.C. Feature relevance in Ward’s hierarchical clustering using the Lp norm. J. Classif. 2015, 32, 46–62. [Google Scholar] [CrossRef]
  50. TIBCO Software Inc. Statistica (Data Analysis Software System), Version 13. 2017. Available online: http://statistica.io (accessed on 15 June 2021).
  51. Zhou, W.; Li, H.; Wen, S.; Xie, L.; Wang, T.; Tian, Y.; Yu, W. Simulation of Soil Organic Carbon Content Based on Laboratory Spectrum in the Three-Rivers Source Region of China. Remote Sens. 2022, 14, 1521. [Google Scholar] [CrossRef]
Figure 1. Locations of sampling areas in the AfSIS Phase I project. In each sampling area, a dozen sampling points were selected and the samples were taken from topsoil and subsoil layers.
Figure 1. Locations of sampling areas in the AfSIS Phase I project. In each sampling area, a dozen sampling points were selected and the samples were taken from topsoil and subsoil layers.
Ijerph 19 15210 g001
Figure 2. Box plots for modeled soil properties divided into separate data groups, i.e., Elements, Texture, Water, CN, and Mehlich. Blue boxes represent values between first and third quartile, red line represents median value and red crosses represent values outside the range of first quartile minus 1.5 times interquartile range and third quartile plus 1.5 times interquartile range.
Figure 2. Box plots for modeled soil properties divided into separate data groups, i.e., Elements, Texture, Water, CN, and Mehlich. Blue boxes represent values between first and third quartile, red line represents median value and red crosses represent values outside the range of first quartile minus 1.5 times interquartile range and third quartile plus 1.5 times interquartile range.
Ijerph 19 15210 g002
Figure 3. Dependence of the linear correlation coefficient between the reference and modeled values as a function of the spread value.
Figure 3. Dependence of the linear correlation coefficient between the reference and modeled values as a function of the spread value.
Ijerph 19 15210 g003
Figure 4. Scatter plots of C and N content prediction validation data using the 1DCNN model. When the points are on the red line it is an indication of perfect prediction, deviations from red line indicate prediction errors.
Figure 4. Scatter plots of C and N content prediction validation data using the 1DCNN model. When the points are on the red line it is an indication of perfect prediction, deviations from red line indicate prediction errors.
Ijerph 19 15210 g004
Figure 5. Dendrogram of associations between variables obtained by Ward’s agglomeration method.
Figure 5. Dendrogram of associations between variables obtained by Ward’s agglomeration method.
Ijerph 19 15210 g005
Figure 6. Scatter plots of the content of selected elements as a function of Fe content.
Figure 6. Scatter plots of the content of selected elements as a function of Fe content.
Ijerph 19 15210 g006
Figure 7. Scatter plots of soil variables significantly correlated with the exchangeable bases and Ca content.
Figure 7. Scatter plots of soil variables significantly correlated with the exchangeable bases and Ca content.
Ijerph 19 15210 g007
Figure 8. Distribution diagrams of squares of VIP scores and NCA scores for selected soil properties. The diagrams include index values exceeding 1.0. The diameters of points are proportional to the squares of VIP scores and NCA scores.
Figure 8. Distribution diagrams of squares of VIP scores and NCA scores for selected soil properties. The diagrams include index values exceeding 1.0. The diameters of points are proportional to the squares of VIP scores and NCA scores.
Ijerph 19 15210 g008
Figure 9. Example RMSE ratios.
Figure 9. Example RMSE ratios.
Ijerph 19 15210 g009
Table 1. Summary of validation statistics for models in the “Texture” group. The best prediction result for particular variables is marked in bold.
Table 1. Summary of validation statistics for models in the “Texture” group. The best prediction result for particular variables is marked in bold.
PropertyPLSR1DCNNGRNN
R2RMSERPIQLCCCSbR2RMSERPIQLCCCSbR2RMSERPIQLCCCSb
Sand [%]0.628.831.080.790.050.609.021.050.73−0.020.549.650.980.72−0.01
Silt[%]0.544.041.410.730.000.374.711.210.590.310.324.901.160.560.03
Clay[%]0.645.720.930.80−0.040.645.740.930.770.280.665.540.970.790.03
Table 2. Summary of model validation statistics for the “Water” group. The best prediction result for particular variables is marked in bold.
Table 2. Summary of model validation statistics for the “Water” group. The best prediction result for particular variables is marked in bold.
PropertyPLSR1DCNNGRNN
R2RMSERPIQLCCCSbR2RMSERPIQLCCCSbR2RMSERPIQLCCCSb
Sat. at 0[g/g]0.610.091.640.770.040.610.091.660.77−0.050.700.081.890.820.07
pF2.0[g/g]0.550.082.050.730.050.600.082.170.760.000.620.082.240.770.05
pF2.5[g/g]0.560.082.200.740.050.620.072.380.77−0.010.620.072.380.770.05
pF4.2[g/g]0.490.081.940.670.060.590.072.150.730.010.560.072.100.720.06
B. Den.[g/cm3]0.760.082.960.860.010.740.082.830.850.030.730.082.760.840.00
Table 3. Summary of model validation statistics for C and N concentration in soils. The best prediction result for particular variables is marked in bold.
Table 3. Summary of model validation statistics for C and N concentration in soils. The best prediction result for particular variables is marked in bold.
PropertyPLSR1DCNNGRNN
R2RMSERPIQLCCCSbR2RMSERPIQLCCCSbR2RMSERPIQLCCCSb
Total N[%]0.840.032.030.910.010.910.032.680.950.000.800.041.850.88−0.01
Total C[%]0.920.382.950.960.000.930.343.250.970.050.870.472.370.93−0.01
SOC[%]0.910.373.020.960.000.930.333.420.960.050.860.472.390.92−0.02
Table 4. Summary of model validation statistics for the “Elements” group. The best prediction result for particular variables is marked in bold.
Table 4. Summary of model validation statistics for the “Elements” group. The best prediction result for particular variables is marked in bold.
PropertyPLSR1DCNNGRNN
R2RMSERPIQLCCCSbR2RMSERPIQLCCCSbR2RMSERPIQLCCCSb
Na[mg/kg]0.466452.11.090.640.040.367029.81.000.540.130.446553.61.070.62−0.01
Al[mg/kg]0.778714.32.870.87−0.020.768854.02.830.86−0.010.798377.42.980.880.00
Cl[mg/kg]0.001592.50.070.43−0.790.001199.50.090.060.440.301013.50.110.44−0.37
K[mg/kg]0.834990.32.890.91−0.030.795437.62.660.890.070.726299.12.290.84−0.06
Ca[mg/kg]0.9551480.930.98−0.050.6114,9790.320.75−0.360.9457670.830.970.04
Ti[mg/kg]0.413203.30.930.69−0.050.363342.70.890.590.150.672422.01.220.81−0.04
V[mg/kg]0.5127.981.040.72−0.100.4928.591.010.64−0.020.6324.121.200.78−0.08
Cr[mg/kg]0.4064.720.850.60−0.100.3070.070.790.470.050.6847.081.170.81−0.02
Mn[mg/kg]0.44336.761.210.70−0.020.54304.551.340.710.070.62278.771.460.76−0.03
Fe[mg/kg]0.7912,3982.130.89−0.010.7912,6242.100.860.070.8610,2742.570.92−0.01
Ni[mg/kg]0.4719.701.050.70−0.110.5418.301.130.71−0.060.6615.751.310.79−0.01
Cu[mg/kg]0.5510.041.560.720.010.5210.381.500.72−0.010.599.641.620.740.00
Zn[mg/kg]0.4121.481.270.63−0.020.3722.131.240.64−0.090.4620.571.330.66−0.01
Ga[mg/kg]0.683.362.730.82−0.040.633.592.560.79−0.120.733.072.990.840.00
Se[mg/kg]0.719.000.360.820.240.5910.750.300.72−0.960.817.370.430.890.11
Rb[mg/kg]0.7125.312.380.84−0.020.6528.002.150.780.110.7225.072.400.83−0.03
Sr[mg/kg]0.42118.850.650.680.010.6394.370.820.760.100.49111.220.690.630.02
Y[mg/kg]0.428.691.300.63−0.050.408.831.270.59−0.050.488.231.370.65−0.02
Zr[mg/kg]0.4783.820.060.71−1.860.5874.200.070.690.930.7755.060.090.860.22
Ba[mg/kg]0.392331.20.960.64−0.050.262560.70.870.420.030.442223.51.000.620.02
La[mg/kg]0.483190.50.470.620.120.393447.50.440.520.340.821874.50.800.890.06
Ce[mg/kg]0.3638.651.510.67−0.040.4137.101.570.670.050.5631.971.830.75−0.06
Pr[mg/kg]0.262.920.210.420.440.153.130.190.220.250.332.780.220.470.60
Nd[mg/kg]0.426.691.400.68−0.040.416.731.400.61−0.050.625.421.730.78−0.06
Sm[mg/kg]0.34600.160.030.461.750.17671.350.030.244.850.58477.830.040.720.65
Hf[mg/kg]0.144.040.820.54−0.160.054.260.770.290.270.403.390.970.59−0.08
Ta[mg/kg]0.544.500.910.700.020.375.310.770.480.030.733.431.190.840.01
W[mg/kg]0.371.780.170.570.070.182.030.150.310.430.411.720.170.64−0.11
Hg[mg/kg]0.523.360.120.76−1.120.463.570.110.68−2.070.652.850.140.85−0.78
Pb[mg/kg]0.3454.290.640.59−0.020.3952.500.660.530.220.7036.570.940.810.00
Bi[mg/kg]0.023.020.300.62−0.550.352.470.360.660.190.562.010.450.76−0.25
Th[mg/kg]0.6554.150.540.770.040.6356.120.520.710.390.7743.790.670.860.06
Table 5. Summary of model validation statistics for the “Mehlich” group. The best prediction result for particular variables is marked in bold.
Table 5. Summary of model validation statistics for the “Mehlich” group. The best prediction result for particular variables is marked in bold.
PropertyPLSR1DCNNGRNN
R2RMSERPIQLCCCSbR2RMSERPIQLCCCSbR2RMSERPIQLCCCSb
EC[dS/m]0.190.280.290.47−0.050.380.250.340.64−0.180.310.260.320.500.16
Ex. Ac.[cmol/kg]0.400.451.150.580.020.550.391.330.730.020.330.481.090.570.07
Ex. Bas.[cmol/kg]0.829.991.010.910.010.888.051.260.940.050.907.501.350.950.02
M3 Al[mg/kg]0.84192.323.430.920.010.86181.993.630.920.010.85189.463.480.920.00
M3 B[mg/kg]0.391.330.260.550.110.700.930.370.800.230.371.350.260.510.21
M3 Ca[mg/kg]0.821752.30.760.900.020.821741.90.760.900.040.891394.70.960.940.02
M3Cu[mg/kg]0.191.900.980.38−0.030.331.731.070.53−0.040.501.491.250.70−0.04
M3 Fe[mg/kg]0.4270.281.260.570.060.4965.811.340.630.040.3972.111.220.560.07
M3 K[mg/kg]0.18227.550.570.55−0.130.40194.010.660.560.180.65147.940.870.80−0.04
M3 Mg[mg/kg]0.68234.541.280.82−0.030.74209.041.440.860.020.68232.251.290.810.01
M3 Mn[mg/kg]0.4775.631.540.640.020.5867.781.720.730.040.5173.301.590.72−0.04
M3 Na[mg/kg]0.001003.10.030.320.880.11942.550.030.471.140.89325.350.090.830.85
M3 P[mg/kg]0.0228.660.250.15−0.150.2525.060.280.390.350.1925.990.270.380.14
M3 S[mg/kg]0.12199.340.040.201.060.28180.350.050.501.230.49151.660.050.441.62
M3 Zn[mg/kg]0.131.500.640.30−0.150.221.420.680.310.080.111.520.640.34−0.02
pH 0.810.482.800.90−0.010.840.433.070.91−0.030.770.522.560.870.00
PSI 0.7545.091.960.850.020.7247.351.860.830.140.7841.792.110.880.04
Table 6. Hierarchical summary of relative predictive position of soil variables.
Table 6. Hierarchical summary of relative predictive position of soil variables.
RankPLSR1DCNNGRNN
PropertyTindexPropertyTindexPropertyTindex
1M3 Al0.839SOC0.853M3 Al0.843
2SOC0.833M3 Al0.847Al0.800
3Total C0.831Total C0.846Ga0.780
4K0.809pH0.824Fe0.777
5pH0.796Total N0.808Total C0.776
6B. Den.0.790Al0.782SOC0.776
7Al0.789K0.777B. Den.0.762
8Ga0.748B. Den.0.772pH0.760
9Total N0.737Fe0.725Rb0.733
10Fe0.736Ga0.718PSI0.729
11Rb0.732pF2.50.692K0.728
12PSI0.704PSI0.683Total N0.713
13Ca0.666Rb0.680pF2.50.690
14pF2.50.652Ex. Bas.0.678Ex. Bas.0.689
15Ex. Bas.0.646pF2.00.667Sat. at 00.682
16pF2.00.638M3 Mg0.660pF2.00.680
17Sat. at 00.630pF4.20.656M3 Ca0.655
18M3 Mg0.629Sat. at 00.635Nd0.646
19M3 Ca0.626M3 Ca0.626Ca0.646
20Cu0.600M3 Mn0.619pF4.20.642
21pF4.20.598Cu0.589Ta0.637
22Sand0.593Ex. Ac.0.583Ce0.630
23Clay0.592Sand0.578M3 Mg0.628
24Silt0.588Mn0.576Ni0.623
25Se0.562Clay0.575Ti0.621
26Th0.559Sr0.569La0.621
27M3 Mn0.557Ni0.567Cr0.620
28V0.554M3 B0.557Cu0.619
29Nd0.546Ce0.550Mn0.618
30Ta0.545M3 Fe0.547V0.608
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gruszczyński, S.; Gruszczyński, W. Assessing the Information Potential of MIR Spectral Signatures for Prediction of Multiple Soil Properties Based on Data from the AfSIS Phase I Project. Int. J. Environ. Res. Public Health 2022, 19, 15210. https://doi.org/10.3390/ijerph192215210

AMA Style

Gruszczyński S, Gruszczyński W. Assessing the Information Potential of MIR Spectral Signatures for Prediction of Multiple Soil Properties Based on Data from the AfSIS Phase I Project. International Journal of Environmental Research and Public Health. 2022; 19(22):15210. https://doi.org/10.3390/ijerph192215210

Chicago/Turabian Style

Gruszczyński, Stanisław, and Wojciech Gruszczyński. 2022. "Assessing the Information Potential of MIR Spectral Signatures for Prediction of Multiple Soil Properties Based on Data from the AfSIS Phase I Project" International Journal of Environmental Research and Public Health 19, no. 22: 15210. https://doi.org/10.3390/ijerph192215210

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop