Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China

Zhao, Zhi-Dong; Zhao, Ming-Song; Lu, Hong-Liang; Wang, Shi-Hang; Lu, Yuan-Yuan

doi:10.3390/su151712874

Open AccessArticle

Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China

by

Zhi-Dong Zhao

^1,2,3,

Ming-Song Zhao

^1,2,3,*,

Hong-Liang Lu

¹,

Shi-Hang Wang

^1,2,3 and

Yuan-Yuan Lu

^4,5,*

¹

School of Geomatics, Anhui University of Science and Technology, Huainan 232001, China

²

Key Laboratory of Aviation-Aerospace-Ground Cooperative Monitoring and Early Warning of Coal Mining-Induced Disasters of Anhui Higher Education Institutes, Anhui Provincial Department of Education, Huainan 232001, China

³

Coal Industry Engineering Research Center of Collaborative Monitoring of Mining Area’s Environment and Disasters, Huainan 232001, China

⁴

Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment of the People’s Republic China, Nanjing 210042, China

⁵

Key Laboratory of Soil Environmental Management and Pollution Control, Ministry of Environment Protection, Nanjing 210042, China

^*

Authors to whom correspondence should be addressed.

Sustainability 2023, 15(17), 12874; https://doi.org/10.3390/su151712874

Submission received: 26 July 2023 / Revised: 18 August 2023 / Accepted: 23 August 2023 / Published: 25 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

This study aimed to evaluate and compare the performances of the random forest (RF) and support vector regression (SVR) models combined with different feature selection methods, including recursive feature elimination (RFE), simulated annealing feature selection (SAFS), and selection by filtering (SBF) in predicting soil pH in Anhui Province, East China. We also used the ALL original features to build the RF and SVR models as a comparison. A total of 140 samples were selected, following the principles of randomness, uniformity, and representativeness, to consider the combination of landscape elements, such as topography, parent material, and land use. Auxiliary data, including climatic, topographic, and vegetation indexes, were used for predicting soil pH. The results showed that compared with the use the ALL original modeling features (ALL-RF, ALL-SVR), the combination of the three feature selection algorithms with RF and SVR can eliminate some redundant features and effectively improve the prediction accuracy of the soil pH model. For the RF model, the RMSE and the MAE of the calibration of the RFE-RF model were 0.73 and 0.57 and had the highest R² in four different RF models. The testing set of the RFE-RF model had an R² of 0.61, which was better than that of the ALL-RF (R² = 0.45) model and lower than those of the SAFS-RF (R² = 0.71) and SBF-RF (R² = 0.69) models. For the SVR model, the RFE-RF model was more robust and had better generalization ability. The accuracy of digital soil mapping can be improved through feature selection.

Keywords:

soil pH; feature selection; random forest; support vector regression parameter tuning; multiple soil classes

1. Introduction

Soil pH affects the physical and chemical properties of soil and is closely related to soil fertility levels, microbial and faunal activities, the C/N ratio, and humus formation [1,2,3]. Estimating the spatial distribution of soil pH has important significance in the monitoring and management of soil quality. The traditional methods of mapping soil properties are always challenging, time consuming, and costly. Digital soil mapping (DSM) techniques have become the proven methods for the spatial prediction of soil properties. DSM is based on the soil models CLORPT [4] and SCORPAN [5]; these methods describe the soil formation as a function of climate(C), organisms (O), topography (R), parent materials (P), age(A), and spatial position. In the recent past, the general methods of digital soil mapping have mainly been based on statistical and geostatistical methods. The statistical methods include ordinary linear models, generalized linear models [6], and generalized additive models [7], which are mainly used to establish a quantitative model between the soil properties and environmental factors, while the discriminant analysis is often used to map soil types. The geostatistical methods include cokriging [8], regression kriging [9], and geographically weighted regression [10]. At present, machine learning, data mining, and other methods are widely used in DSM. Compared with geostatistical methods, machine learning and data mining can effectively solve the nonlinear problem between soil properties and environmental variables, and good results have been obtained in previous studies.

Random forest (RF) [11] and support vector regression (SVR) [12] are representative algorithms of machine learning; they are widely used in various fields and are also employed in digital soil mapping. Azamat et al. [13] used the SVR model to predict the main agrochemical soil properties in the Trans-Ural steppe zone (Republic of Bashkortostan, Russia). Following a comparison with multiple linear regression (MLR), they concluded that the support vector regression (SVR) method has a better predictive effect on the spatial distribution and variation of soil nutrients. Ruhollah et al. [14] implemented a machine learning algorithm using W-SVR, which involves wavelet transformation of the covariates within a DSM framework, to map and predict soil salinity in central Iran. This approach demonstrates strong potential for the accurate prediction of soil salinity in the subsoil, where it is typically more challenging for conventional machine learning models to perform well. In [15], the authors used a RF algorithm to perform a spectral inversion of rice canopy nitrogen content. It was found that the RF algorithm has the advantages of fewer samples, no overfitting, high precision, and universality. Were et al. [16] compared the prediction performances of three models, including SVR, artificial neural networks, and RF, but their work lacked a discussion on feature engineering and parameters. The boosted regression tree and RF models for mapping topsoil organic carbon concentrations in an alpine ecosystem were also compared [17].

Feature selection is an important process in the machine learning methods. Blum et al. [18] mentioned that feature selection in machine learning was the subject of a published paper; two key issues were focused on in this paper; these issues were the problem of selecting relevant features and the problem of selecting relevant examples. Feature selection is generally divided into three types: embedded approaches to feature selection, filter approaches to feature selection, and wrapper approaches to feature selection. The wrapper method uses some learning algorithms to score the feature subset according to its prediction ability; the filter method selects a subset of features as a preprocessing step, independently of the selected predictor; the embedding method is used to perform feature selection during training, usually for the specified algorithm. The objectives of feature selection are the improvement of the prediction performance of the predictors, the provision of faster and more cost-effective predictors, and the provision of a better understanding of the underlying process that generated the data [19]. Previous studies have shown that feature selection has many potential advantages, such as the improvement of data visualization and the understanding of data, the reduction in storage requirements, the shortening of model training time, and the improvement of prediction accuracy [20]; the selection of smaller and more optimal feature sets for modeling is necessary [21]. Although the RF and SVR models are both black box models, the tunable parameters of both are also important for the model. There are few studies on the effects of the parameters of the RF model on accuracy. For the SVR model, numerous studies have shown that parameters such as kernel functions have a significant impact on its accuracy [22]. Therefore, the comparison of different feature selection algorithms and the parameter optimization of the RF and SVR models has a certain significance.

This study selected Anhui Province as an example and used GIS spatial analysis and remote sensing image technology to extract environmental features for the modeling of soil pH. The objectives of this study were: (1) to examine the effects of four different feature selection algorithms for the RF and SVR models and to compare the prediction accuracy of the RF and SVR models with different feature subsets when predicting the soil pH in Anhui Province, East China; (2) to analyze the influence of different parameters on the prediction accuracy of the RF and SVR models and to determine the optimal parameters of the RF and SVR models; and (3) to compare the differences in soil pH spatial distribution between the RF and SVR models.

2. Materials and Methods

2.1. Study Areas

Anhui Province (114°54′~119°37′ E, 29°41′~34°38′ N) located in the eastern part of China, across the Yangtze River and the Huaihe River, was selected as the study area. The climate is characterized by a typical transition from subtropical to temperate. The mean annual temperature ranges from 14 to 16 °C; the mean annual precipitation ranges from 800 to 1800 mm, with an increasing trend from the north to the south of Anhui. The elevation is generally less than 100 m above sea level, except for the hills in the southwest and south of Anhui. Anhui Province is divided into five geographic regions from north to south, including Huaibei Plain, the Jianghuai hilly downland, the riverine plain, the Dabieshan Mountains in west Anhui, and the south Anhui hilly region (Figure 1).

The total area of Anhui Province is 13.96 × 104 km², with 69% farmland, 14% low mountains and hills, and 17% lakes and rivers. The farmland is mainly distributed in Huaibei Plain, Jianghuai hilly downland, and in the riverine plain. The forest land and grass land are mainly distributed in the Dabieshan Mountains in west Anhui and the south Anhui hilly region. According to the World Reference Base for Soil Resources, Anhui Province includes ten different soil types. The main cultivated soil types are Hydragric Anthrosols, Haplic Fluvisols, and Calcaric Cambisols. The Hydragric Anthrosols are primarily distributed in the riverine plain and the Jianghuai hilly downland, while the Haplic Fluvisols are mainly found in the Huaibei Plain. The Cutanic Luvisols, Eutric Cambisols, Ferric Luvisols, Eutric Regosols, Lithic Leptosols, and Ferric Acrisols, on the other hand, are mostly located in the hilly regions of the southwest and south [23].

2.2. Data Source

The soil data were obtained from the “Soil Series of China (Anhui Volume)” [24]; the sample collection was based on the combination of landscape elements such as topography and parent material land use. The soil series survey of China was conducted during 2010–2011. In total, 140 typical soil profiles were collected (Figure 2). The soil profile information included the sampling site and pedon data, such as soil depth, land use, cropping system, and the physical and chemical properties of the soil. Soil pH was selected as the target variable. The pH of the sample was determined by potentiometry [25]. Auxiliary data, including topography data, vegetation index, and climate data, were used for modeling the soil pH. The digital elevation model (DEM) from SRTM had a spatial resolution of 90 m. The derivates of the DEM, like aspect, slope, elevation, plan curvature, and profile curvature, were derived by ArcGIS 10.2. The standardized precipitation index (SPI), convergence index (CI), multi-resolution index of valley bottom flatness (MrVBF), multi-resolution ridge top flatness (MrRTF) index, topographical wetness index (TWI), and topographic position index (TPI) were derived using Saga-GIS 6.3.0 version. The normalized difference vegetation index (NDVI) and the enhanced vegetation index (EVI) were derived from the 16-day synthetic vegetation index (MOD13Q1) of MODIS terrestrial products, with a spatial resolution of 250m and a date of June 2010. The climate data mainly included mean annual temperature (MAT) and mean annual precipitation (MAP), which were derived from the raster data (1 km resolution) completed by the China Eco-Environmental Background Construction Project of the Institute of Agricultural Resources and Regional Planning of the Chinese Academy of Agricultural Sciences. The resolution of all the environmental features and the soil property spatial prediction results were unified to 250 m.

2.3. Feature Selection

Feature selection is an important topic in data mining and a common process in machine learning. To investigate whether feature selection for the environmental features affecting soil pH can improve the prediction accuracy of RF and SVR models, three feature selection methods, including the recursive feature elimination (RFE) algorithm [26,27], the simulated annealing feature selection (SAFS) algorithm [28,29], and the selection by filtering (SBF) algorithm, were used in combination with the RF and SVR algorithms to establish a soil pH prediction model [30]. All the modeling processes were based on the “caret” and “e1071” packages in the R software 3.5.1. The caret [31] package in R 3.5.1 was used for feature selection and to establish the RF and SVR mode. The e1071 [32] package in R was used to tune the parameters of the RF and SVR models and to create the spatial prediction mapping.

2.4. Random Forest

Random forest (RF) is an ensemble learning method which was first proposed by Breiman et al. [11]. RF is an ensemble of n trees {T₁(X), …, T_n(X)}, where X = {x₁,..., x_p} is the p-dimensional vector of the features associated with the response variable. The ensemble of n tree outputs {Y₁ = T₁(X), …, Y_n = T_n(X)}, where Y_i, i = 1, …, n, is the prediction of a response variable of the nth tree. For classification problems, Y is obtained by the maximum number of votes. For the regression problems, Y is the mean value of the individual tree predictions.

Suppose we have a training dataset D = {(X₁, Y₁), …, (X_n, Y_n)}, where X_i, i = 1, …, n, the training procedure of RF can be described as follows:

(1): Sample the training set D with a replacement to generate bootstrap resample D_r.
(2): For each bootstrap sample, grow a tree. At each node of trees, the best split is chosen from a random subset of m_try features.
(3): Repeat the above steps until the random forest is grown.

The RF model also provides an estimate of errors using the so-called out-of-bag (OOB) error. In this process, the RF algorithm uses some “left out” data, which are not used to build the trees, to obtain the assessment of the performance for the prediction model [33,34]. And from the OOB predictions of every tree, the mean square error of OOB (MSE_OOB) is calculated by:

{M S E}_{O O B} = n^{- 1} {\sum_{i = 1}^{n} (z_{i} - z_{i}^{O B B})}^{2}

(1)

where z_i is the measured value and

z_{i}^{O B B}

is the average of all the OOB predictions; the percentage of variance explained by the model (Var_ex) is also proposed by Liaw et al. [35] to compare the performance of different models:

{V a r}_{e x} = 1 - (\frac{{M S E}_{O O B}}{{V a r}_{z}})

(2)

where Var_z is the total variance of the variable. The RF algorithm has better robustness against outliers and noise, higher speed than bagging and boosting algorithms, and is not easy to overfit [11].

2.5. Support Vector for Regression

The support vector machine (SVM) is a binary classification model whose basic model is a linear classifier defined in feature space. The learning strategy of SVM is to maximize the interval, which can be formalized into a problem involving the solving of convex quadratic programming; it is also equivalent to the problem of the minimization of the regularized hinge loss function. The SVR model is the application of the SVM model to the regression problem. Compared with other regression models, the SVR model builds an error range, and the predicted value falling within the error range is considered to be correct. The regression model is established through the size of the given error interval. For a given training sample, {(xi, yi), i = 1, 2, …, n} is the input, x is the predicted value, y is the measured value, and the standard form of the SVR model is:

\min_{ω, b} \frac{1}{2} ‖ω^{2}‖ + C \sum_{i = 1}^{m} L_{ε} (y - f (x))

(3)

L_{ε} (y - f (x)) = f (x) = \{\begin{array}{l} 0, |y - f (x)| \\ |y - f (x)| - ε, o t h e r \end{array}

(4)

where L_Ɛ is the loss function and C is the penalty coefficient. In general, the larger the value of C that was set, the higher the precision of the model training will be; if the value of C was set too high, the problem of overfitting would appear. The kernel trick is an important part of SVR. Its purpose is to make the data in high-dimensional space separable by mapping linear indivisible data in the input space to a high-dimensional feature space. There are four kinds of kernel functions commonly used in SVR: the linear function, polynomial function, radial basis function (RBF), and sigmoid function.

2.6. Parameter Tuning and Analysis

In RF modeling, the training parameters that needed to be analyzed were the number of trees to grow in the forest (n_tree) and the number of randomly selected predictor variables at each node (m_try). For the SVR model, the RBF as a kernel function was used to build the model and analyze the penalty coefficient C and RBF self-parameter gamma. To study the effect of a single parameter on two models, only a single parameter was used for modeling; the m_try, n_tree, C, and gamma values were set to 1~10, 100~2000, 2⁻⁴~2⁴, and 2⁻⁶~2⁶, respectively. The effects of multi-parameters on the RF and SVR models were also explored. We combined the parameters of the RF and SVR models and used the e1071 package to conduct parameter tuning using the grid search and 10-fold cross-validation. For the RF model, the parameter m_try was set to 1~5, and the parameter n_tree was set to 100~2000. For the SVR model, some studies showed that it is more reasonable to set the parameters as exponential forms [36]. Therefore, in this experiment, the parameter C range was set to 2⁻⁴~2⁴, and the parameter gamma range was set to 2⁻⁶~2⁶.

2.7. Model Accuracy Assessment

This paper used only a small sample (n = 140) to establish a soil pH prediction model. The results of modeling and predicting by training and validation sets are not reliable and are not stable. Therefore, this study mainly used two methods of independent validation and cross-validation. First, the data were divided into training sets and testing sets using stratified sampling. The advantage of this method is that the sample is better represented, and the sampling error is small. Secondly, the k-fold cross-validation was used to establish the prediction model for the data of the training set. Finally, the test set was used to verify the model accuracy.

The soil samples were randomly split into training (n = 108) and testing (n = 32) sets. The training data were used to build RF and SVR models, while the testing data were used to validate the RF and SVR models. Root mean squared error (RMSE), mean absolute error (MAE), and the coefficient of determination (R²) were selected to evaluate the performances of the RF and SVR predicted models. RMSE and MAE were computed from the differences between the predicted and the measured value of soil pH.

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(p_{i} - o_{i})}^{2}}

(5)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |o_{i} - p_{i}|

(6)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(p_{i} - o_{i})}^{2}}{\sum_{i = 1}^{n} {(p_{i} - \hat{o_{i}})}^{2}}

(7)

where p_i was the predicted value, oi was the observed value, and

\hat{o_{i}}

was the mean of the observed values.

3. Results and Discussion

3.1. Soil pH Data Analysis

A summary of the soil pH levels is presented in Table 1. The soil pH of the samples ranged from 4.58 to 8.67, with a mean of 6.37, a standard deviation of 1.16, and a coefficient of variation of 18.21%, respectively. Out of 140 samples analyzed, 36 samples had pH levels from 4.5 to 5.5; 45 samples had pH levels from 5.5 to 6.5; 33 samples had pH levels from 6.5 to 7.5; 20 samples had pH levels from 7.5 to 8.5; and 20 samples had pH > 8.5. Overall, Anhui Province has a large proportion of soil acidity, accounting for about 58% of the total. The spatial distribution of soil sample pH (Figure 2) indicated that the soil pH distribution in Anhui Province was gradually decreasing from north to south. The training set mean (6.41) was larger than that of the testing set (6.25); the standard deviation (1.16) was lower than that of the testing set (1.17); and the training set coefficient of variation (18.10%) is lower than that of the testing set (18.72%). The overall distribution of the samples was similar and can be used for model building.

3.2. Feature Selection Results

The feature selection results of the REF, SAFS, and SBF algorithms are presented in Table 2. For the RFE and SBF algorithms, the same optimal feature subset models were obtained by combining RF and SVR, namely MAP, MrRVBF, EVI, MAT, MrRTF and EVI, NDVI, MrVBF, MrRTF, TWI, plan, slope, elevation, and MAP; for the SAFS algorithm, the optimal feature subsets obtained by RF as the learner were EVI, NDVI, MrVBF, TWI, plan, CI, MAP; SVR as the learner determined that the optimal feature subsets were EVI, MrVBF, TPI, plan, aspect, MAP, and MAT.

Compared with the ALL-RF model (Table 2), the R² of the RFE-RF, SAFS-RF, and SBF-RF models in the testing set increased by 0.20, 0.26, and 0.24, respectively, indicating that the feature selection had a significant impact on the accuracy of the RF model. Compared with the ALL-SVR model, the R² results of the RFE-SVR, SAFS-SVR, and SBF-SVR models were improved by 0.08, 0.06 and 0.07, respectively, which was not obvious when compared with the RF model, but it was still improved. In summary, feature selection improved the accuracy of the RF model and the SVR model, and both the wrapper method and the filtering method had better performances.

3.3. Effect of Parameter Optimization on Performance of RF and SVR Model

3.3.1. Single Parameter Optimization Analysis

Figure 3 shows the effect of a single parameter on the RF and SVR models, with R² as an indicator. For the RF model, when the m_try value was set to 1~10 and the n_tree value was 100~2000, the R² of the model changed slightly, always within the range of 0.1 (Figure 3a). This indicates that the values of the parameters m_try and n_tree had little influence on the prediction accuracy, and it also reflects the fact that the feature selection was important for the RF model. The results were similar to those of previous studies [37]. Two parameters of the SVR model had a greater impact on the prediction accuracy. When the C value was set to 0.4~0.6, the change in the gamma value had a more obvious influence on the prediction accuracy of the model, and the R² change was between 0.3 and 0.5. Overall, as the C and gamma values increase, R² shows a downward trend (Figure 3b). In summary, the impact of a single parameter on the RF model was not obvious, but it had a greater impact on the SVR model.

3.3.2. Multi-Parameter Optimization Analysis

It can be seen from Table 2 that the optimal parameters of the RF and SVR models were m_try = 1, n_tree = 1000, C = 0.5, and gamma = 0.125, respectively. The modeling results of the different parameter combinations are displayed in Figure 4. For the RF model, when the m_try value was 1 and the n_tree value was from 600 to 1200, the MSE had a large change, while when the m_try value was selected to be 2, 3, 4, and 5, the change in the n_tree value had no significant effect on the prediction accuracy of the model. Overall, the MSE range was between 0.55 and 0.63, and the floating range was within 0.08. Different parameter combinations had less influence on the prediction accuracy of the RF model. For the SVR model, for the convenience of analysis, the parametric forms of the parameters gamma and C were analyzed. As shown in Figure 4, when the gamma value was set between 2⁻⁴ and 2⁰, the value of C had a greater effect on the accuracy of the model, and the variation of MSE was obvious. The accuracy was the highest when the value was set from 2⁻² to 2²; when the gamma value ranged from 2⁰ to 2⁶, the value of C did not have much influence on the SVR model. Compared with the RF model, the SVR model with different parameter values had a larger error range. This showed that the SVR model was more susceptible to parameters, while the RF model was relatively stable.

3.4. Mapping Soil pH and Model Accuracy Assessment

The spatial patterns of soil pH predicted by the different models are displayed in Figure 5. Broadly, the eight predictions of the soil pH distribution were similar, and the soil pH from north to south was gradually decreasing in Anhui Province, which is roughly consistent with the previous studies [30].

The correlation analysis results (Figure 6) show that MAP, MrVBF, MrRTF, TWI, elevation, slope, and plan are significantly correlated with soil pH. Combined with the correlation analysis and natural environment, the main factors that may cause the trend of the soil pH distribution in Anhui Province are summarized as follows: (1) It can be seen from the correlation analysis that MAP has the highest correlation with soil pH. In the rainy environment, the leaching of soil and its parent material leads strongly to acidification of the soil, while the alkaline soil is generally distributed in arid and semi-arid areas, where the annual precipitation is much lower than the evaporation, creating conditions for soil alkalization. Generally, the rainfall in the south is higher than that in the north. (2) MrVBF and MrRTF are topographical indicators calculated based on DEM data. They are mainly used to identify valley floors and hillsides. Distinguishing between valley floors and hillsides is an important step in identifying and characterizing hydrology [38]. The soil water content has an important influence on the pH value of the soil. In general, excessive water content leads to soil acidification, and water content which is too low leads to soil alkalization [39]. TWI also has the effect of describing soil water content. The water content of soil in the north of Anhui Province is lower than that in the south. (3) In the higher terrain, the leaching effect and base are stronger, and the soil is alkaline. Therefore, elevation, slope, and plan, which describe the characteristics of the terrain, also have an important effect on soil pH. In the northern plains of Anhui Province, the topography shows little change, while the south is mostly mountainous, and the terrain is undulating. In summary, a variety of environmental factors cause the soil acidity in the southern part of Anhui Province to be higher than that in the north.

It can be seen from the prediction results of the RF model that the basic trends of soil pH distribution obtained by the four RF models were basically the same; the soil pH range was between 5.14 and 8.32, and the results of the RFE-RF model were the most obvious. For the SVR model, the SAFS-SVR model had little discrimination with regard to soil pH prediction in northern Anhui. The prediction trends of the other three models were basically the same, ranging from 4.75 to 8.57 (Figure 5). Overall, the predicted mapping results for the different RF and SVR models were basically the same.

The correlation between SPI, TPI, and MAT and pH did not reach a significant standard compared to the results of the feature selection (Table 2). However, they were considered to be important features in feature selection. This means that correlation analysis is not well applied to machine learning, and it also shows that feature selection combined with machine learning has a certain significance.

To verify whether the optimal feature subsets obtained by the different feature selection algorithms can improve the accuracy of the model, different selected feature subsets were used to build RF and SVR models, respectively, and the parameters were tuned. The results show that (Table 2), for the RF and SVR models, the model accuracy was higher than that of the ALL-RF and ALL-SVR models after feature selection using REF, SAFS, and SBF algorithms. Feature selection can effectively improve the prediction accuracy of the model. Compared with the RFE-SVR model, the RFE-RF model interpretation ability (R² = 0.61) was lower than that of the RFE-SVR model (R² = 0.66). In the testing set, the RMSE and MAE of the RFE-RF model were 0.11 and 0.10, respectively, and were lower than that of the RFE-SVR; the R² was increased by 0.13, indicating that the overall accuracy was greatly improved. The accuracy comparison shows that the difference between the R² of the RFE-RF in the training set and that of the testing set was lower than that of the RFE-SVR model, indicating that the RFE-RF model had better generalization ability and was more robust. At the same time, the prediction accuracy of the RF model by the same method were all higher than that of the SVR model, and the accuracy of the RF model testing set was higher than that of the training set after feature selection. The accuracy of the SVR model testing set was lower than that of the training set after feature selection. In contrast, the RF model made a better prediction of soil pH in the study area than the SVR model.

4. Conclusions

(1): The prediction accuracy of the RF and SVR models with the combined use of the RFE, SAFS, and SBF feature selection methods outperformed those without any feature selection. Therefore, conducting feature selection before establishing machine learning models is necessary and can significantly enhance model accuracy.
(2): In the study area, employing the RFE feature selection method combined with RF and SVR modeling produced the best soil pH prediction model when compared to the other feature selection methods in terms of prediction accuracy and generalization ability. Both the RFE-RF and RFE-SVR models achieved high prediction accuracy results. The validation set accuracy of the RFE-RF model was greater than that of the RFE-SVR model, with a lower difference between the Rca² and Rva² values. These findings indicate that the RFE-RF model possessed higher prediction and generalization ability compared to the RFE-SVR model, making it more suitable as a soil pH prediction model.
(3): For the RF model, individual parameters did not appear to have a significant impact on model accuracy and increasing the number of parameters (ntree) did not lead to a significant improvement in prediction accuracy. For the SVR model, the penalty coefficient (cost) and the parameter gamma for the radial basis function had a significant impact on model accuracy.
(4): Based on the mapping results, both the RF and SVR models exhibited similar distribution patterns for soil pH prediction, which aligned with the “south acid and north alkaline” characteristics of soil acidity in the study area. Therefore, using both models for the spatial prediction and mapping of soil pH has significant meaning.

Author Contributions

Conceptualization, M.-S.Z. and Z.-D.Z.; methodology, H.-L.L.; software, Z.-D.Z.; validation, M.-S.Z., S.-H.W. and Y.-Y.L.; formal analysis, Z.-D.Z.; investigation, Z.-D.Z.; resources, M.-S.Z.; data curation, H.-L.L.; writing—original draft preparation, M.-S.Z.; writing—review and editing, M.-S.Z. and Z.-D.Z.; visualization, Z.-D.Z.; supervision, S.-H.W.; project administration, S.-H.W. and M.-S.Z.; funding acquisition, M.-S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Anhui Province, China, grant number 2208085MD88; the National Natural Science Foundation of China, grant number 41501226; Research Fund for Doctoral Program of Anhui University of Science and Technology, grant number ZY020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhao, X.; He, C.; Liu, W.S.; Liu, W.X.; Liu, Q.Y.; Bai, W.; Li, L.J.; Lal, R.; Zhang, H.L. Responses of soil pH to no-till and the factors affecting it: A global meta-analysis. Global Chang. Biol. 2022, 28, 154–166. [Google Scholar] [CrossRef] [PubMed]
Meng, C.; Tian, D.; Zeng, H.; Li, Z.; Yi, C.; Niu, S. Global soil acidification impacts on belowground processes. Environ. Res. Lett. 2019, 14, 074003. [Google Scholar] [CrossRef]
Liu, K.; Liu, Z.; Zhou, N.; Shi, X.; Lock, T.R.; Kallenbach, R.L.; Yuan, Z. Diversity-stability relationships in temperate grasslands as a function of soil pH. Land Degrad. Dev. 2022, 33, 1704–1717. [Google Scholar] [CrossRef]
Roy, W.S. Factors of soil formation. A system of quantitative pedology. Geoderma 1995, 68, 334–335. [Google Scholar]
McBratney, A.B.; Mendonça Santos, M.L.; Minasny, B. On digital soil mapping. Geoderma 2003, 117, 3–52. [Google Scholar] [CrossRef]
Srisomkiew, S.; Kawahigashi, M.; Limtong, P.; Yuttum, O. Digital soil assessment of soil fertility for Thai jasmine rice in the Thung Kula Ronghai region, Thailand. Geoderma 2021, 409, 115597. [Google Scholar] [CrossRef]
Simon, A.; Geitner, C.; Katzensteiner, K. A framework for the predictive mapping of forest soil properties in mountain areas. Geoderma 2020, 371, 114383. [Google Scholar] [CrossRef]
Zovko, M.; Romić, D.; Colombo, C.; Di Iorio, E.; Romić, M.; Buttafuoco, G.; Castrignanò, A. A geostatistical Vis-NIR spectroscopy index to assess the incipient soil salinization in the Neretva River valley, Croatia. Geoderma 2018, 332, 60–72. [Google Scholar] [CrossRef]
Odhiambo, B.O.; Kenduiywo, B.K.; Were, K. Spatial prediction and mapping of soil pH across a tropical afro-montane landscape. Appl. Geogr. 2020, 114, 102129. [Google Scholar] [CrossRef]
Xuanqiang, C.; Mingsong, Z. Comparison and analysis of spatial prediction and variability of soil pH in Anhui Province based on three kinds of geographically weighted regression. Sci. Geogr. Sin. 2023, 43, 173–183. [Google Scholar]
Leo, B. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar]
Alex, J.S.; Bernhard, S. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar]
Suleymanov, A.; Abakumov, E.; Suleymanov, R.; Gabbasova, I.; Komissarov, M. The Soil Nutrient Digital Mapping for Precision Agriculture Cases in the Trans-Ural Steppe Zone of Russia Using Topographic Attributes. ISPRS Int. J. Geo-Inform. 2021, 10, 243. [Google Scholar] [CrossRef]
Taghizadeh-Mehrjardi, R.; Schmidt, K.; Toomanian, N.; Heung, B.; Behrens, T.; Mosavi, A.; Band, S.S.; Amirian-Chakan, A.; Fathabadi, A.; Scholten, T. Improving the spatial prediction of soil salinity in arid regions using wavelet transformation and support vector regression models. Geoderma 2021, 383, 114793. [Google Scholar] [CrossRef]
Li, X.; Liu, X.; Liu, M. Random forest algorithm and regional applications of spectral inversion model for estimating canopy nitrogen concentration in rice. J. Remote Sens. 2014, 18, 923–945. [Google Scholar]
Kennedy, W.; Dieu, T.B.; Øystein, B.D.; Bal, R.S. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 2015, 52, 394–403. [Google Scholar]
Yang, R.-M.; Zhang, G.-L.; Liu, F.; Lu, Y.-Y.; Yang, F.; Yang, F.; Yang, M.; Zhao, Y.-G.; Li, D.-C. Comparison of boosted regression tree and random forest models for mapping topsoil organic carbon concentration in an alpine ecosystem. Ecol. Indic. 2016, 60, 870–878. [Google Scholar] [CrossRef]
Avrim, L.B.; Pat, L. Selection of relevant features and examples in machine learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar]
Isabelle, G.; André, E. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
Girish, C.; Ferat, S. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar]
Zhang, X.; Chen, S.; Xue, J.; Wang, N.; Xiao, Y.; Chen, Q.; Hong, Y.; Zhou, Y.; Teng, H.; Hu, B.; et al. Improving model parsimony and accuracy by modified greedy feature selection in digital soil mapping. Geoderma 2023, 432, 116383. [Google Scholar] [CrossRef]
Ben-Hur, A.; Weston, J. A User’s Guide to Support Vector Machines; Carugo, O., Eisenhaber, F., Eds.; Humana Press: Totowa, NJ, USA, 2010; Volume 609, pp. 223–239. [Google Scholar]
Zhao, M.-S.; Qiu, S.-Q.; Wang, S.-H.; Li, D.-C.; Zhang, G.-L. Spatial-temporal change of soil organic carbon in Anhui Province of East China. Geoderma Reg. 2021, 26, e00415. [Google Scholar] [CrossRef]
Li, D.; Zhang, G.; Wang, H. Soil Series of China—Anhui Volume; Science Press at Beijing: Beijing, China, 2017; pp. 3–24. [Google Scholar]
Zhang, G.-L.; Gong, Z.-T. Soil Survey Laboratory Methods; Science Press: Beijing, China, 2012; pp. 38–40. [Google Scholar]
Guo, J.; Wang, K.; Jin, S. Mapping of Soil pH Based on SVM-RFE Feature Selection Algorithm. Agronomy 2022, 12, 2742. [Google Scholar] [CrossRef]
de Sousa, G.P.B.; Tayebi, M.; Campos, L.R.; Greschuk, L.T.; Amorim, M.T.A.; Rosas, J.T.F.; Mello, F.A.d.O.; Chen, S.; Ayoubi, S.; Demattê, J.A.M. Improvement of spatial prediction of soil depth via earth observation. CATENA 2023, 223, 106915. [Google Scholar] [CrossRef]
Chen, Y.; Ma, L.; Yu, D.; Zhang, H.; Feng, K.; Wang, X.; Song, J. Comparison of feature selection methods for mapping soil organic matter in subtropical restored forests. Ecol. Indic. 2022, 135, 108545. [Google Scholar] [CrossRef]
Justin, C.W.D.; Victor, J.R. Feature Subset Selection within a Simulated Annealing Data Mining Algorithm. J. Intell. Inf. Syst. 1997, 9, 57–81. [Google Scholar]
Wang, S.; Lu, H.; Zhao, M. Assessing soil pH in Anhui Province based on different features mining methods combined with generalized boosted regression models. J. Appl. Ecolog. 2020, 31, 3509–3517. [Google Scholar]
Max, K. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar]
Meyer, D.; Dimitriadou, E.; Hornik, K.; Weingessel, A.; Leisch, F. Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. Documentation on the R Package ‘e1071’ Version 1.7-3. 2019. Available online: https://cranr-project.org/web/packages/e1071/e1071.pdf (accessed on 1 February 2020).
Breiman, L. Bagging predictors. Mach Learn 1996, 24, 123–140. [Google Scholar] [CrossRef]
Helfenstein, A.; Mulder, V.L.; Heuvelink, G.B.; Okx, J.P. Tier 4 maps of soil pH at 25 m resolution for the Netherlands. Geoderma 2022, 410, 115659. [Google Scholar] [CrossRef]
Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
Pereira, G.W.; Valente, D.S.M.; de Queiroz, D.M.; Santos, N.T.; Fernandes-Filho, E.I. Soil mapping for precision agriculture using support vector machines combined with inverse distance weighting. Precis. Agric. 2022, 23, 1189–1204. [Google Scholar] [CrossRef]
Svetnik, V.; Liaw, A.; Tong, C.; Wang, T. Application of Breiman’s Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules; Springer: Berlin/Heidelberg, Germany, 2004; pp. 334–343. [Google Scholar]
John, C.G.; Trevor, I.D. A multiresolution index of valley bottom flatness for mapping depositional areas. Water Resour. Res. 2003, 39. [Google Scholar] [CrossRef]
Haifeng, G.; Jumhong, B.; Gai, W.Q. Distribution of Soil pH Values and Soil Water Contents in FloodplainWetlands in the Lower Reach of Huolin River. Res. Soil Water Conserv. 2011, 18, 268–271. [Google Scholar]

Figure 1. (DEM) of the study area.

Figure 2. Soil pH spatial distribution of sampling points.

Figure 3. Single-parameter model accuracy analysis.

Figure 4. Multi-parameter model accuracy analysis.

Figure 5. Spatial distribution of the measurement and predictions of soil properties in Anhui Province.

Figure 6. Heat map of correlation coefficient (the cross mark indicates that it did not pass the significance test at the 95% level).

Table 1. Statistical characters of soil pH in Anhui Province.

Soil pH	n	Min	25% Quantile	Median	75% Quantile	Max	Mean	Std	CV/%
Samples	140	4.58	5.50	6.01	7.20	8.67	6.37	1.16	18.21
Training set	108	4.58	5.50	6.01	7.20	8.67	6.41	1.16	18.10
Testing set	32	4.68	5.49	6.00	7.27	8.51	6.25	1.17	18.72

Std: standard deviation. CV: coefficient of variation.

Table 2. Comparison of model accuracy based on different feature selection methods.

Models	Selected Features	Training Set			Testing Set			Optimal Parameters
Models	Selected Features	RMSE	MAE	R²	RMSE	MAE	R²	Optimal Parameters
ALL-RF	−	0.78	0.61	0.56	0.85	0.67	0.45	m_try = 3, n_tree = 100
RFE-RF	MAP, MrRVBF, EVI, MAT, MrRTF	0.73	0.57	0.61	0.68	0.56	0.65	m_try = 1, n_tree = 1000
SAFS-RF	EVI, NDVI, MrVBF, TWI, plan, CI, MAP	0.77	0.59	0.57	0.65	0.53	0.71	m_try = 5, n_tree = 200
SBF-RF	EVI, NDVI, MrVBF, MrRTF, TWI, plan, slope, elevation, MAP	0.75	0.59	0.59	0.66	0.57	0.69	m_try = 3, n_tree = 200
ALL-SVR	−	0.83	0.63	0.51	0.89	0.72	0.44	gamma = 0.0625, C = 1.
RFE-SVR	MAP, MrRVBF, EVI, MAT, MrRTF	0.69	0.53	0.66	0.79	0.66	0.52	gamma = 0.125, C = 0.5
SAFS-SVR	EVI, MrVBF, TPI, plan, aspect, MAP, MAT	0.69	0.50	0.66	0.84	0.69	0.50	gamma = 0.015625, C = 16
SBF-SVR	EVI, NDVI, MrVBF, MrRTF, TWI, plan, slope, elevation, MAP	0.70	0.52	0.65	0.80	0.65	0.51	gamma = 0.015625, C = 4

The AIL-SVR model is a support vector regression model built using all the unprocessed environmental covariates. The Boruta-SVR, RFE-SVR, SAFS-SVR, SBF-SVR, and PCA-SVR models are support vector regression models built using features processed by different feature mining methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.-D.; Zhao, M.-S.; Lu, H.-L.; Wang, S.-H.; Lu, Y.-Y. Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China. Sustainability 2023, 15, 12874. https://doi.org/10.3390/su151712874

AMA Style

Zhao Z-D, Zhao M-S, Lu H-L, Wang S-H, Lu Y-Y. Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China. Sustainability. 2023; 15(17):12874. https://doi.org/10.3390/su151712874

Chicago/Turabian Style

Zhao, Zhi-Dong, Ming-Song Zhao, Hong-Liang Lu, Shi-Hang Wang, and Yuan-Yuan Lu. 2023. "Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China" Sustainability 15, no. 17: 12874. https://doi.org/10.3390/su151712874

APA Style

Zhao, Z.-D., Zhao, M.-S., Lu, H.-L., Wang, S.-H., & Lu, Y.-Y. (2023). Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China. Sustainability, 15(17), 12874. https://doi.org/10.3390/su151712874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Digital Mapping of Soil pH Based on Machine Learning Combined with Feature Selection Methods in East China

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Areas

2.2. Data Source

2.3. Feature Selection

2.4. Random Forest

2.5. Support Vector for Regression

2.6. Parameter Tuning and Analysis

2.7. Model Accuracy Assessment

3. Results and Discussion

3.1. Soil pH Data Analysis

3.2. Feature Selection Results

3.3. Effect of Parameter Optimization on Performance of RF and SVR Model

3.3.1. Single Parameter Optimization Analysis

3.3.2. Multi-Parameter Optimization Analysis

3.4. Mapping Soil pH and Model Accuracy Assessment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI