Next Article in Journal
Circular Economy Business Models for the Tanzanian Coffee Sector: A Teaching Case Study
Previous Article in Journal
Exploring the Prevalence of Protective Measure Adoption in Mosques during the COVID-19 Pandemic in Indonesia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on the Prediction of Several Soil Properties in Heihe River Basin Based on Remote Sensing Images

1
College of Computer Science and Technology, Harbin Engineering University, Harbin 150000, China
2
College of Resources and Environment, Northeast Agricultural University, Harbin 150000, China
3
School of Information Science and Engineering, University of Jinan, Jinan 250000, China
*
Author to whom correspondence should be addressed.
Sustainability 2021, 13(24), 13930; https://doi.org/10.3390/su132413930
Submission received: 27 October 2021 / Revised: 7 December 2021 / Accepted: 9 December 2021 / Published: 16 December 2021

Abstract

:
Soil property monitoring is useful for sustainable agricultural production and environmental modeling. It is possible to automatically predict soil properties in a wide range based on remote sensing images. Heihe River Basin was chosen as the research area. Measurements on three soil properties, which were pH, organic carbon, and bulk density, were available there. Two kinds of attributes were extracted, which were the remote sensing index and terrain attributes. The prediction models were constructed by random forest algorithms. The features were determined by combining correlation statistics with prediction error, and different features were selected for each of the three properties. The validation experimental results are presented. The error results were as follows: pH (MAE = 0.28, RMSE = 0.39, R 2 = 0.41), organic carbon (MAE = 4.75, RMSE = 8.26, R 2 = 0.75), and bulk density (MAE = 0.11, RMSE = 0.13, R 2 = 0.70). Through the analysis and comparison of the experimental results, it was proven that the algorithm in this paper had a good performance in the prediction of organic carbon and bulk density.

1. Introduction

Soil properties and functions are closely related to soil ventilation, fertilization, water filtering, and other environmental conditions, which are important references for soil utilization, management, and improvement. Soil property prediction is beneficial for sustainable agricultural production and environmental modeling. It is highly practical to detect soil properties through remote sensing images because a large range of soil conditions can be monitored automatically. Traditional soil property detection is implemented by spectrometers. Soriano et al., (2017) tested and compared the performance of portable mid-infrared (MIR) and visible-near-infrared (Vis-NIR) spectrometers by partial least squares regressions (PLSRs) for the prediction of soil properties [1]. They found that the best spectral ranges in the Vis-NIR and MIR regions for the prediction of soil properties were 1650–2450 nm in the NIR and 2500–5000 nm in the MIR. Martínez et al., (2018) used portable MIR technology to predict total carbon (TC), total nitrogen (TN), cation exchange capacity (CEC), clay, silt, and exchangeable sodium ( N a + ). Gaussian process (GP), random forest (RF), M5 rules, bagging, and decision trees were compared with PLSR [2]. Their results showed that GP was the best.
In soil property prediction using a spectrometer, the spectral values were directly adopted as the features. The prediction results were very accurate. Coefficient of determination ( R 2 ) results were all above 0.70, and the residual prediction deviation (RPD) results were all above 1.90. The main reference of the spectrometer prediction for the remote sensing image prediction was the spectral ranges related to the soil properties.
In recent years, some soil property prediction methods based on remote sensing images have emerged. Scudiero et al., (2014) used the seven-year relationship data between canopy reflectance and salinity calculated by Landsat-7 remote sensing images and other auxiliary data (such as meteorological and soil type information) to determine the attributes and salinity prediction model for the study area of western San Joaquin Valley in California. They discovered that the mean reflectance rate and temporal variability of Landsat-7 visible and infrared bands had a strong correlation with soil salinity. They predicted electrical conductivity (EC) and obtained 0.43 for R 2 [3]. Peng et al. collected 225 soil samples from southern Xinjiang (Peng, 2018). Then by Landsat-8 images and related attributes (e.g., terrain attributes, vegetation spectral indices of remote sensing data), they constructed cubist and PLSR models of electrical conductivity to detect the degree of soil salinization in the study area. The prediction results of cubist model were the best. The results of R 2 and RPD were 0.91 and 3.15, respectively. Peng’s prediction results were very accurate, reaching the level of the spectrometer [4].
Dou et al., (2019) used soil moisture content, crop residue, soil organic matter (SOM), and other laboratory and field data in concert with multi-temporal MODIS images to implement the prediction of SOM. Seven optical reflectance bands of Earth observation were extracted from MODIS images. For the above bands, the difference and normalization index of the difference and ratio between two bands, as well as the tangent of the angle formed by three adjacent bands were calculated. The prediction model of SOM was constructed by stepwise multiple regression in SPSS Statistics 22 software. Using MODIS Band 6 and Band 1 and the ratio of Band 1 and Band 4 as the input variables, the prediction accuracy was the highest ( R 2 :0.69) [5]. Although their prediction results were very accurate, their method combined the remote sensing images with some ground data.
Forkuor et al., (2017) utilized high-spatial-resolution satellite data (RapidEye and Landsat), terrain and climatic data, and soil data to compute the spectral index, terrain index, and climate index. Furthermore, multiple linear regression (MLR), RF, support vector machine (SVM), and stochastic gradient boosting (SGB) were tested and compared in the prediction of sand, silt, clay, cation exchange capacity (CEC), soil organic carbon (SOC), and nitrogen. The results revealed that RF provided the highest accuracy in most cases. The optimal prediction results obtained by different regressors were sand ( R 2 :0.34), silt ( R 2 :0.53), clay ( R 2 :0.21), CEC ( R 2 :0.30), SOC (RMSE:0.52; R 2 :0.39), and nitrogen (RMSE:0.03; R 2 :0.30) [6]. In their prediction results, only the R 2 of the silt exceeded 0.5.
Song et al. collected 548 soil data points at which measurements on soil properties were available. The normalized vegetation index (NDVI), slope, aspect, altitude, profile curvature, plane curvature, and topographic moisture index derived from Beijing-1 multispectral data were used as the features to predict organic carbon in Heihe River Basin. The multiple linear regression model (MLR), geographically weighted regression model (GWR), geographically weighted ridge regression model (GWRR), external drift kriging method (KED), geographically weighted regression-model-derived local method of the simple kriging method (GWRSK) were compared, and the KED model obtained the best predictive performance, while the RMSE and R2 of the KED model were 0.515g/kg, 0.87, respectively. The best RPD result was 2.04 [7].
Five soil properties (i.e., SOM, CEC, magnesium (Mg), potassium (K), and pH) were predicted by multispectral aerial images and terrain data (Sami et al.2018). The soil property model was constructed by linear regression (LM), RF, the neural network (NN), SVM, and the gradient enhancement model (GBM). The best results derived by different regressors were as follows: SOM ( R 2 = 0.64, RMSE = 0.44), CEC ( R 2 = 0.67), K ( R 2 = 0.21), Mg ( R 2 = 0.22), pH ( R 2 = 0.15). However, because the aerial images contain only five bands and no shortwave or mid-infrared bands, the prediction results of K, Mg, and pH were not very good [8].
In conclusion, although the prediction results of traditional laboratory and on-spot analysis are exact, they are time consuming and have difficulty obtaining dynamic soil information [9]. There are few soil property prediction methods based on remote sensing images. The better prediction performance was from the salinity prediction method proposed by Peng et al., which achieved that of the spectrometer level [4]. The other methods obtained common performance, and the prediction errors were much lower than that of the spectrometer. The samples of soil property measurements were generally fewer than 300, so it was difficult to establish an accurate prediction model. Features for different regions may not be completely consistent, and different soil properties may be related to different features.
The objective of this study was to implement a precise prediction approach of pH, organic carbon, and bulk density through the Landsat-8 images in the study area. The main contribution of the paper was that various features were checked through both correlation and prediction error and the most effective features and the appropriate classifier were selected.

2. Materials and Methods

2.1. Study Area

Heihe River Basin is located in the northwest inland area of China (37°68’–42°68’ E 97°06’–101°98’ N). The terrain of this area is complex and has obvious differences. The altitude gradually decreases from south to north. The types of land mainly include grassland, woodland, and cultivated land. The upstream regions are dominated by nomadic industries. The midstream part includes important agricultural areas where wheat and corn are the main crops. The vegetation types change regularly from southeast to northwest in the horizontal distribution. From upstream to downstream, the vegetation can be divided into four vegetation zones: forest, shrub, grassland, and desert. The distribution of soil type corresponding to the three Landsat-8 images in this study is shown in Figure 1. The main soil types are gray-brown desert soils, cold calcic soils, kastanozems, brown, pedocals, and so on. The climate of the study area is continental cold temperate with less water vapor and sufficient sunshine. The annual average temperature is 0.7–1.1 C. The mean precipitation is 640–750 mm, which is mostly concentrated in July to September. In the upstream area, the precipitation related to the height gradient increases with the altitude increase. The downstream climate is continental arid desert with less precipitation, strong evaporation, high radiation, and long sunshine. In the southern Qilian Mountains, the precipitation reduces gradually from east to west. The central plain area is an ideal one for agricultural development, in which the precipitation drops from 250 mm in the east to 50 mm in the west. The evaporation rises from less than 2000 mm in the east to more than 4000 mm in the west, and the sunshine time is as long as 3000–4000 h. Due to the existence of drought, frost, dry hot wind, and other ecological stress factors in Heihe River Basin, the soil of some areas suffers from desertification and salinization.

2.2. Soil Dataset

The soil datasets were pH, soil organic carbon concentration, and soil bulk density of representative samples in Heihe River Basin, including 266 data samples for organic carbon and pH and 229 data samples for bulk density. The dataset was constructed by the National Tibetan Plateau Scientific Data Center during 2013–2014. The spatial ranges of the longitude and latitude are 37°71’–43°34’ E, 96°13’–104°19’ N. The collection method of typical soil samples in Heihe River Basin was representative sampling [7]. In this method, representative soil sites are selected from soil-scape units constructed by fuzzy c-means classification of pixels on the basis of the soil-forming factors (covariates). The additional soil sampling was conducted in July 2012 and July 2013 [7]. All the sampling sites were located by handheld Global Positioning System (GPS) receivers. A 100–150 cm deep soil pit was dug at each site. Topsoil (0–20 cm) samples were collected. The samples covered the typical landscapes of the upper, middle, and lower reaches in Heihe River Basin, which can reflect the overall spatial distribution of soil properties in this region. According to the Chinese Soil Classification System, based on the diagnostic layer and characteristics, the depth of field soil sample collection was the soil occurrence layer in the soil profile.

2.3. Remote Sensing Images and Elevation Data

Five Landsat-8 OLI images acquired on 23 April 2013 were used in this study. The paths/rows were 134/031, 134/032, and 134/033, respectively, and their cloud cover rates were all below 5%. There are 11 bands in Landsat-8 OLI images (http://www.usgs.gov/ (accessed on 14 March 2020)). B1–B7 were used in this study, in which B1–B4 are visible light and B5–B7 are NIR bands. In order to reduce the radiation error caused by atmospheric scattering, spectral hypercube fast line of sight atmospheric analysis (FLAASH) atmospheric correction was adopted in the pre-processing. The positional relationship between the five remote sensing images and the study area is shown in Figure 1.
In Figure 1, the white region is Heihe River Basin and the colored rectangles are the five remote sensing images. The brown rings represent the data points of pH and organic carbon (266 samples). The blue stars represent the data points of bulk density (229 samples).
Terrain indirectly affects soil formation through the redistribution of material and energy. As the altitude rises, temperature, precipitation, and humidity change with it, which forms different climatic zones. Considering the significant influence of terrain on soil properties, a digital elevation model (DEM) with a resolution of 30 m (http://srtm.csi.cgiar.org/srtmdata/ (accessed on 14 March 2020)) was used to predict soil properties in this study [10]. The data were co-published by the National Aeronautics and Space Administration (NASA) and the United States Geological Survey (USGS). The DEM is the digital simulation of terrain through limited elevation data.

2.4. Attributes for Prediction

The attributes of this paper were mainly obtained from [4,8,11]. The first two papers studied soil salinity degradation and property prediction, and the third paper studied soil classification. Soil classification and property prediction are related tasks, and the features can be used with each other. In this paper, features with more applications and different types were preferred. There were 22 dimensions of candidate features, as shown in Table 1.
Feature selection was achieved in two steps. The first step was to determine the core feature set. N D V I , N D W I 1 , N D W I 2 , R V I , C M R , B I , D E M , S , a total of eight dimensions, constituted the core feature set. The second step was to add the remaining features in turn to the core feature set. Each time one feature was added, then the prediction error was calculated. If the error increased after adding the feature, it was removed; otherwise, the feature was retained. After all 22 features were tested, the core feature set became the optimal feature set, that is the adopted feature set.
The descriptions and abbreviations of features are shown in Table 1. The features in N D V I , N D W I 1 , N D W I 2 , R V I , C M R , B I , D E M , S were used in at least two of the three above-mentioned papers, so we used them as the core features with eight dimensions. Using the core features, a random forest regressor was adopted to construct a basic model. Although the topographic features in C , C A , T W I , S W I , V D were also used in two papers, when we calculated the correlation between the features and soil properties, these features were not significantly correlated with the three properties, so they were not included in the core feature set. Then, each remaining feature was added to the feature set in turn. If the prediction error increased after adding the feature, it was removed; otherwise, it was retained. Through this method, we determined different predictor sets for the three soil properties. The fourth column in Table 1 is the serial number of the property using this feature in the prediction. Finally, 17 dimensions were selected in total, of which the same 12 dimensions were used for pH and organic carbon and 17 dimensions for bulk density. Five topographic predictors were not adopted. The error results of feature selection are shown in Section 3.3.

2.5. Prediction Model Construction Method

In this study, the prediction model was built with the RF regressor. RF was developed on the basis of the decision tree regression algorithm [27]. Classification and regression tree (CART) segments the feature space recursively and divides it into some simple partitions to generate the model in the form of a binary tree. Every time of segmentation is a dichotomy division whose segmentation condition is represented as a node in the binary tree. In the regression algorithm, the mean of a simple partition is used to fit the output.
The RF algorithm is a “forest” composed of many decision trees. The RF model has two important parameters, which are the number of decision trees k and that of the random variable in dividing nodes m. The RF model can increase the diversity of the decision tree by sampling in a putting back way and randomly changing the combination prediction variables in the evolution process of different trees. Each decision tree grows through a self-service subset sampling method in the original dataset. In addition, the best prediction variables among m randomly selected ones are applied for node segmentation. This partition method is slightly different from that in the standard decision tree in which the best segmentation variable from all the prediction ones is selected in node segmentation.
The construction process of the RF model (k decision tree) is as follows:
1.
When variable i changes from one to k, a self-service sampling subset of two-thirds of the original dataset is built. Then, on the basis of this subset, m prediction variables are randomly selected on each node. The best variable among these random ones is chosen for node segmentation. A decision tree with the maximum depth and not requiring pruning is created;
2.
The new samples are predicted by the results of k decision trees. In regression, the average of the predicting results of k decision trees is calculated as the regression output. Since random forest only uses two-thirds of the data for model construction, it will not generate the over-fitting of the model.
RF is a single-output regressor, and in soil property prediction, a regression model should be built for each soil property. Three models were constructed for the three soil properties.
In addition, the SVM regression and GP regression models were used to compare the results of the RF. The SVM algorithm realizes the linear regression by constructing the linear decision function in the high-dimensional space after increasing the dimension. If the fit mathematical model is expressed as a curve in a multidimensional space, the result obtained from the e-sensitivity loss function is an “e pipe” including the curve and the training point. Of all the sample points, only the portion of the sample point that is distributed over the “wall” determines the location of the pipe. This part of the training sample is called the support vector. In order to adapt to the nonlinearity of the training sample set, the traditional fitting method usually adds a higher-order term after the linear equation. This method is effective, but the adjustable parameters increase the risk of over-fitting. The support vector regression algorithm uses the kernel function to solve this contradiction. By replacing the linear term in a linear equation with a kernel function, the original linear algorithm can be “non-linearized”, that is nonlinear regression. At the same time, the kernel function is introduced to achieve the goal of “dimension”, and the adjustable parameter is still controlled by over-fitting. The GP algorithm is a probability distribution over functions in Bayesian inference. This technique implements the Bayesian Gaussian technique for non-linear regression. This technique requires a kernel function to be specified, along with a “noise” regularization parameter for controlling the closeness of the fit. Moreover, the technique allows us to choose the training data to be normalized or standardized before learning the regression.

3. Results and Discussion

The experiment in this study includes five subsections. The first one gives the assessment measurements. The second subsection describes the numerical ranges, statistical features, and the distributions of three soil properties in the study area. In the third subsection, the correlation coefficients between the attributes and pH, organic carbon, and bulk density are calculated and the results analyzed. In the fourth subsection, the prediction method is verified in various ways. The prediction experiments were performed using the K-fold cross validation.

3.1. Performance Assessment Measurements

In this study, the mean absolute error (MAE), root-mean-squared error (RMSE), and R 2 were used to evaluate and compare the performance of the model (in Equations (1)–(3)).
M A E = 1 m i = 1 m y i y i ^
R M S E = 1 m i = 1 m y i y i ^ 2 1 2
R 2 = 1 i = 1 m y i ^ y i 2 i = 1 m y i ¯ y i 2
where m is the number of test samples, y i is the true value of the i-th data point, y i ^ is the predicted value of i-th data point, and y i ¯ is the average of true values. Since the MAE has no inherent fuzziness, it can accurately describe the error. However, when the error distribution is normal and has a large dataset, the RMSE reflects the error better than the MAE. Therefore, the MAE and RMSE were both chosen as evaluation measurements. R 2 is calculated by the ratio of the prediction error square to the variance. The closer R 2 is to one, the better the predicting effect of the model is. R 2 is a measure of relative error and can be used for comparison between different datasets or even between different soil properties. The MAE and RMSE results are direct errors and cannot be used to compare between different datasets.
As for R 2 , the RPD is also a dataset-independent evaluation measurement. Although the RPD was not used in this paper, it has been used in other literature. The RPD is mentioned in this paper, so it is also introduced here:
R P D = S T D R M S E
STD is the standard deviation of the analysis sample; RMSE is the predicted root-mean-squared error.

3.2. Descriptive Statistics of the Soil Properties

The descriptive statistics of the soil properties in the study area included the minimum, maximum, average, standard deviation, variance, skewness, kurtosis, and coefficient of variation. The specific results are displayed in Table 2.
The value of the pH in the study area was from 6.20 to 9.70. Most of the pH values were >8.00. The standard deviation of pH was 0.53 with a negative skewness and an ordinary kurtosis. Organic carbon has a relatively large range from 0.40–118.10g/kg. Most of the organic carbon was less than 10g/kg, while the standard deviation was 17.86 with a positive skewness and a big kurtosis. The range of bulk density was 0.69–1.85 g/cm 3 . Most of the bulk density was >1 g/cm 3 . The standard deviation was 0.24 with a negative skewness and a small kurtosis.
We calculated the probability distribution histograms of three properties. The probability distributions of pH and bulk density followed a normal distribution. The probability distribution of organic carbon showed a strong positive skew; the skewness was 3.43 (Figure 2c, Table 2). For all prediction methods, we therefore transformed the organic carbon measurements by taking the natural logarithms in the correlation statistics. The skewness of the logarithmic organic carbon dropped to 0.19 (Table 2).

3.3. Correlation between Attributes and Soil Properties

The Pearson correlation coefficients between attributes and soil properties were calculated, and the results are revealed in Table 3. Among all the features, 12 predictors were significantly related to pH (p < 0.01), 12 features were significantly related to logarithmic organic carbon (p < 0.01), and 12 features were significantly related to bulk density (p < 0.01). The correlation between the three properties and the features tended to be the same, and most of the features had the same high or low correlations with the three properties.
Among all the features, elevation had the highest correlation with the three properties. The correlation among the pH value, logarithmic organic carbon, bulk density, and elevation was >0.75 (p < 0.01). This was due to the complex terrain and obvious differences in the study area. The NDVI and CRSI belong to the spectral index of vegetation and were significantly correlated with pH, bulk density, and logarithmic organic carbon (p = 0.01), but the correlation value was low. The remote sensing images in this paper were acquired in April, and the vegetation coverage rate was low, so the correlation between the vegetation spectral indexes and the selected soil properties was weak. The vegetation, topography, moisture, and other factors in the study area all had a great influence on the accuracy of the prediction. In general, the higher the vegetation coverage was, the better the vegetation spectral index was, whereas the spectral index was the opposite [28,29,30].

3.4. Prediction Errors of Different Predictor Sets

Based on the eight-dimensional core features described in Section 2.4, we constructed the basic model through the RF regressor and obtained the experimental results of the three soil properties of the basic model. The experimental results are shown in Table 4. In these experiments, the K of the K-fold cross-validation was four. If all 22 features were used in prediction, the optimal result could not be obtained. The prediction results of all 22 features are shown in Table 4. Then, each remaining feature was added to the feature set in turn. If the prediction error increased after adding the feature, it was removed; otherwise, it was retained. The models of organic carbon and pH had the smallest error when the 12-dimensional feature set was NDVI, NDWI1, NDWI2, RVI, CMR, BI, CRSI, EVI, GDVI, SI6, DEM, S. The bulk density model achieved the optimum when the 17-dimensional feature set was NDVI, NDWI1, NDWI2, RVI, CMR, BI, CRSI, EVI, GDVI, SI1, SI2, S13, SI4, SI5, SI6, DEM, S. The optimal prediction results are shown in Table 5. Among them, the optimal model is shown in Table 4.
From the results of Table 2 to Table 4, we can find the relationship between feature correlation and prediction error. The features in C , C A , T W I , S W I , V D are also widely used, but for this study region, there was almost no correlation with the three properties. However, the prediction features cannot be completely determined by correlation alone. The features in S I 1 , S I 2 , S I 3 , S I 4 , S I 5 had high correlations with all three properties, but according to the prediction errors, they only worked well in the prediction of bulk density. Instead, the CRSI, without very high correlations, was suitable for all three properties. In feature selection, correlation can be used as a reference, and the prediction error is more decisive.

3.5. Prediction Performance Comparison of Different Regressor

In this section, the performance of the RF model is first verified by K-fold cross-validation. The error results are shown in Table 4. Then, we compared the RF results with those of SVM and GP. According to the results in Table 4, when K = 4, the RF model had the best prediction performance for bulk density and organic carbon. The MAE, RMSE, and R 2 of bulk density were 0.11 g/cm 3 , 0.13 g/cm 3 , and 0.70. The MAE, RMSE, and R 2 of the organic carbon index were 4.75 g/kg, 8.26 g/kg, and 0.75. When K = 5, the RF model of pH had the best performance. The MAE, RMSE, and R 2 were 0.28, 0.39, and 0.41.
The numerical range of the pH value, bulk density, and organic carbon in the study area was different (see Table 2 for details). The range of the pH value was 2.23; the range of bulk density was 1.00 g/cm 3 ; the range of organic carbon was 96.04 g/kg. The numerical dispersion of organic carbon was the highest, and the numerical dispersion of bulk density was the lowest. The numerical dispersion is the range of the main data distribution. The numerical dispersion can be measured by variance, so it refers to the variance here. In Table 2, the variance of organic carbon data is the highest, and the variance of bulk density data is the lowest. Therefore, it is difficult to accurately predict the soil properties with relatively discrete distributions.
Then, we calculated the prediction errors of SVM and GP, which are shown in Table 5 and Table 6. Both SVM and GP had the smallest error when K=5, and we only give the error results of K = 5 in Table 5 and Table 6.
The predicted results of the SVM and GP models were much worse than those of RF: the optimal RMSE values of the RF were 0.39, 8.26 g/cm 3 , and 0.13 g/kg for pH, bulk density, and organic carbon. The RMSE values of SVM were 0.43, 14.60 g/cm 3 , and 0.15 g/kg; only the result of bulk density was acceptable. From the description of the prediction methods based on remote sensing images in the Introduction, the best R 2 values were above 0.7, reaching the level of the spectrometer; the performance was excellent. The medium R 2 values were above 0.5, and there were some studies that were less than 0.5. Since it is a difficult task to predict soil composition based on remote sensing images, it is acceptable when R 2 reaches a medium level, that is greater than 0.5. The RMSE values of GP were 0.45, 15.74 g/cm 3 , and 0.22 g/kg, which were all very low. Therefore, the experimental results showed that RF was more suitable for the prediction of the three soil properties in this paper.
In order to judge whether there was multicollinearity between features, we performed principal component analysis (PCA) transformation on the features, and the RMSE of taking features with different dimensions after PCA transformation is shown in Table 7.
We performed PCA transformation on the features and then used them to construct the model and predict the soil properties. The comparison between the prediction results using the original features and the results using PCA vectors are shown in Table 7. For organic carbon and pH, when the PCA dimension number was 10, the error was the smallest and less than that before transformation. The minimum error dimension of bulk density was 13, which was less than the error before transformation, but the error value had little difference. Based on these results, we judged that there was slight multicollinearity between the prediction variables.

3.6. Analysis of Results

In this paper, the detailed feature determination process and the experimental results were given. These experimental results showed that although the correlation statistical results were consistent with the determination of most features, they were not completely consistent with the determination of all features. Therefore, the feature determination still needs to be determined by calculating the errors. As for the determination of the regressor, the results of this paper were consistent with those of Forkuor [6], proving that RF is more suitable for soil property prediction again. Because the process of feature determination in this paper was more accurate, better prediction results were obtained.
The best RPD of organic carbon predicted by Song et al. was 2.0 [7]. Although the RPD was not used as a measure in this paper, it can be calculated according to Equation (4), and the best RPD of organic carbon in this paper was 2.2. The R 2 of organic carbon predicted by Forkuor et al. was 0.39, and the best R 2 in this paper was 0.75. The RPD of organic carbon predicted by Morellos by the spectrometer was 2.2 [31]. Due to the high skewness of the organic carbon data in this paper, the R 2 result may not indicate the true error situation.
The R 2 of pH predicted by Sami from aerial images was only 0.15, while the R 2 of pH predicted by Soriano using the spectrometer was 0.75 [1]. The best R 2 of pH in this paper was 0.41, which was higher than Sami’s study, but lower than 0.5 and needs further improvement.
We did not find a prediction method for bulk density, and the prediction results of bulk density in this paper cannot be compared with other methods. Among the three soil properties in this paper, the R 2 of bulk density was lower than that of organic carbon and higher than that of pH. The lowest R 2 of the soil property predicted by the spectrometer was 0.70. Our R 2 of bulk density was 0.70, which achieved a level lower than that of the spectrometer and should not be worse for prediction based on remote sensing images.
The prediction performance may be affected by many factors, such as the prediction difficulty of the dataset, the type of adopted remote sensing image, the features, and the regressor. In soil property prediction based on machine learning, most studies built their own small datasets. There are few open datasets, so it is hard to evaluate the prediction difficulty of the dataset. A spectrometer study suggested that infrared bands are more useful for soil property prediction [2], which we already mentioned in the Introduction. The features and regressors are more important factors. Our experiments showed that the suitable features were better than a large number of features. It is urgent to develop more effective feature extraction methods. Various regressors were used in different studies. Heihe region spans over 400 km in the horizontal and vertical direction. There are less than 300 sampling points in such a large region. Under such a situation, a good prediction effect indicates whether certain sub-regions are sampled and modeled in Heihe region. The applicability of predictive models is one of our future research directions.

4. Conclusions

Using remote sensing image technology and DEM data to obtain soil properties can provide useful information for rational soil management. We calculated 17 features from remote sensing images and DEM data and a prediction model of the soil properties was constructed by the RF algorithm. In this paper, correlation statistics and error calculation were used to determine the features. The predicted R 2 values for pH, bulk density, and organic carbon were 0.41, 0.70, and 0.75, respectively. Compared with other literature, the prediction of organic carbon in this paper was not lower than the previous methods based on remote sensing images and achieved the level of the spectrometer. The prediction of pH was better than previous methods based on remote sensing images, but did not reach the medium level and needs further improvement. Although no comparable method based on remote sensing images and spectrometers can be found for bulk density, it was better than the prediction effect of pH in this paper and achieved a level lower than that of the spectrometer. Therefore, it can be concluded that the prediction method of the three soil properties in this paper is effective.

Author Contributions

Z.L. designed the study, performed the experiments, and led the manuscript writing and revision. Y.Y. and B.T. performed the experiments and contributed to the interpretation of results and the writing of the paper. S.G. contributed to the interpretation of results. J.Z. contributed to the writing of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by: 1.2020-2022 National Natural Science Foundation of China under Grant (Youth) No. 52001039; 2.2022-2025 National Natural Science Foundation of China under Grant No. 52171310; 3.2020-2022 Funding of Shandong Natural Science Foundation in China, No. ZR2019LZH005.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The soil organic carbon dataset (2012–2013) [32], soil bulk density dataset (2012–2013) [33], and soil pH dataset (2012–2013) [34] of typical soil samples in Heihe River Basin were obtained from the National Scientific Data Center for the Qinghai-Tibet Plateau (http://data.tpdc.ac.cn) (accessed on 24 March 2021). SRTM data are provided by the platform of (http://srtm.csi.cgiar.org/) (accessed on 14 March 2020).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Soriano-Disla, J.M.; Janik, L.J.; Allen, D.J.; Mclaughlin, M.J. Evaluation of the performance of portable visible-infrared instruments for the prediction of soil properties. Biosyst. Eng. 2017, 161, 24–36. [Google Scholar] [CrossRef]
  2. Martínez-España, R.; Bueno-Crespo, A.; Soto, J.; Janik, L.J.; Soriano-Disla, J.M. Developing an intelligent system for the prediction of soil properties with a portable mid-infrared instrument. Biosyst. Eng. 2018, 177, 101–108. [Google Scholar] [CrossRef]
  3. Scudiero, E.; Skaggs, T.H.; Corwin, D.L. Regional scale soil salinity evaluation using Landsat 7, western San Joaquin Valley, California, USA. Geoderma Reg. 2014, 2–3, 82–90. [Google Scholar] [CrossRef]
  4. Peng, J.; Biswas, A.; Jiang, Q.; Zhao, R.; Hu, J.; Hu, B.; Shi, Z. Estimating soil salinity from remote sensing and terrain data in southern Xinjiang Province, China. Geoderma 2019, 337, 1309–1319. [Google Scholar] [CrossRef]
  5. Dou, X.; Wang, X.; Liu, H.; Zhang, X.; Meng, L.; Pan, Y.; Yu, Z. Prediction of soil organic matter using multi-temporal satellite images in the Songnen Plain, China. Geoderma 2019, 356, 113896. [Google Scholar] [CrossRef]
  6. Forkuor, G.; Hounkpatin, O.K.L.; Welp, G.; Thiel, M. High resolution mapping of soil properties using remote sensing variables in south-western burkina faso: A comparison of machine learning and multiple linear regression models. PLoS ONE 2017, 12, e0170478. [Google Scholar] [CrossRef] [PubMed]
  7. Song, X.D.; Brus, D.J.; Liu, F.; Li, D.C.; Zhao, Y.G.; Yang, J.L.; Zhang, G.L. Mapping soil organic carbon content by geographically weighted regression: A case study in the Heihe River Basin, China. Geoderma 2016, 261, 11–22. [Google Scholar] [CrossRef]
  8. Sami, K.; John, F.; Andrew, K.; Nathan, D.; Scott, S. Integration of high resolution remotely sensed data and machine learning techniques for spatial prediction of soil properties and corn yield. Comput. Electron. Agric. 2018, 153, 213–225. [Google Scholar]
  9. Harti, A.E.; Lhissou, R.; Chokmani, K.; Ouzemou, J.; Hassouna, M.; Bachaoui, E.M.; Ghmari, A.E. Spatiotemporal monitoring of soil salinization in irrigated Tadla Plain (Morocco) using satellite spectral indices. Int. J. Appl. Earth Obs. Geoinf. 2016, 50, 64–73. [Google Scholar] [CrossRef]
  10. Jarvis, A.; Reuter, H.I.; Nelson, A.; Guevara, E. Hole-Filled Seamless SRTM Data V4, International Centre for Tropical Agriculture (CIAT), [dataset]. 2008. Available online: http://srtm.csi.cgiar.org/ (accessed on 14 March 2020).
  11. Andrei, D.; Lucian, D.; Petru, U. Classification of Soil Types Using Geographic Object-Based Image Analysis and Random Forests. Pedosphere 2018, 28, 913–925. [Google Scholar]
  12. Ding, Y.; Zhao, K.; Zheng, X.; Jiang, T. Temporal dynamics of spatial heterogeneity over cropland quantified by time-series ndvi, near infrared and red reflectance of landsat 8 oli imagery. Int. J. Appl. Earth Obs. Geoinf. 2014, 30, 139–145. [Google Scholar] [CrossRef]
  13. Du, Z.; Li, W.; Zhou, D.; Tian, L.; Ling, F.; Wang, H.; Gui, Y.; Sun, B. Analysis of landsat-8 oli imagery for land surface water mapping. Remote Sens. Lett. 2014, 5, 672–681. [Google Scholar] [CrossRef]
  14. Nath, B.; Niu, Z.; Mitra, A.K. Observation of short-term variations in the clay minerals ratio after the 2015 chile great earthquake (8.3 mw) using landsat 8 oli data. J. Earth Syst. Sci. 2019, 128, 1–21. [Google Scholar] [CrossRef] [Green Version]
  15. Rapinel, S.; Bouzille, J.B.; Oszwald, J.; Bonis, A. Use of bi-seasonal landsat-8 imagery for mapping marshland plant community combinations at the regional scale. Wetlands 2015, 35, 1043–1054. [Google Scholar] [CrossRef]
  16. Han, L.; Liu, D.; Cheng, G.; Zhang, G.; Wang, L. Spatial distribution and genesis of salt on the saline playa at Qehan Lake, Inner Mongolia, China. Catena 2019, 177, 22–30. [Google Scholar] [CrossRef]
  17. Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Oimesverview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
  18. Wu, W.; Al-Shafifie, W.M.; Mhaimeed, A.S.; Ziadat, F.; Nangia, V.; Payne, W.B. Soil Salinity Mapping by Multiscale Remote Sensing in Mesopotamia, Iraq. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 4442–4452. [Google Scholar] [CrossRef]
  19. Khan, N.M.; Rastoskuev, V.V.; Sato, Y.; Shiozawa, S. Assessment of hydrosaline land degradation by using a simple approach of remote sensing indicators. Agric. Water Manag. 2005, 77, 96–109. [Google Scholar] [CrossRef]
  20. Douaoui, A.E.K.; Nicolas, H.; Walter, C. Detecting salinity hazards within a semiarid context by means of combining soil and remote-sensing data. Geoderma 2006, 134, 217–230. [Google Scholar] [CrossRef]
  21. Abbas, A.; Khan, S. Using remote sensing techniques for appraisal of irrigated soil salinity. In Proceedings of the MODSIM 2007: International Congress on Modelling and Simulation: Land, Water and Environmental Management: Integrated Systems for Sustainability, Christchurch, New Zealand; 2007; pp. 2632–2638. Available online: https://researchoutput.csu.edu.au/ws/portalfiles/portal/9629947/CSU290411.pdf (accessed on 24 March 2021).
  22. Taghizadeh-Mehrjardi, R.; Minasny, B.; Sarmadian, F.; Malone, B.P. Digital mapping of soil salinity in Ardakan Region, Central Iran. Geoderma 2014, 213, 15–28. [Google Scholar] [CrossRef]
  23. Wilson, J.P.; Gallant, J.C. Secondary topographic attributes. In Terrain Analysis—Principles and Applications; Wilson, J.P., Gallant, J.C., Eds.; Wiley: New York, NY, USA, 2000; pp. 87–132. [Google Scholar]
  24. Gruber, S.; Peckham, S. Land-surface parameters and objects in hydrology. Dev. Soil Sci. 2009, 33, 171–194. [Google Scholar]
  25. Böhner, J.; Köthe, R.; Conrad, O.; Gross, J.; Ringeler, A.; Selige, T. Soil regionalisation by means of terrain analysis and process parameterisation. In Soil Classifification 2001; Micheli, E., Nachtergaele, F., Montanarella, L., Eds.; European Soil Bureau: Luxembourg, 2002; pp. 213–222. [Google Scholar]
  26. Olaya, V.F. A Gentle Introduction to SAGA GIS; The SAGA User Group eV: Gottingen, Germany, 2004; Volume 208. [Google Scholar]
  27. Breiman, L. Random Forests. Semi-automatic Road Extraction Method from High Resolution Remote Sensing Images Based on P-N Learning. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  28. Allbed, A.; Kumar, L.; Aldakheel, Y.Y. Assessing soil salinity using soil salinity and vegetation indices derived from IKONOS high-spatial resolution imageries: Applications in a date palm dominated region. Geoderma 2014, 230, 1–8. [Google Scholar] [CrossRef]
  29. Alhammadi, M.S.; Glenn, E.P. Detecting date palm trees health and vegetation greenness change on the eastern coast of the United Arab Emirates using SAVI. Int. J. Remote Sens. 2008, 29, 1745–1765. [Google Scholar] [CrossRef]
  30. Yang, R.M.; Zhang, G.L.; Liu, F.; Lu, Y.Y.; Yang, F.; Yang, F.; Yang, M.; Zhao, Y.G.; Li, D.C. Comparison of boosted regression tree and random forest models for mapping topsoil organic carbon concentration in an alpine ecosystem. Ecol. Indic. 2016, 60, 870–878. [Google Scholar] [CrossRef]
  31. Morellos, A.; Pantazi, X.-E.; Moshou, D. Machine Learning based Prediction of Soil Total Nitrogen, Organic Carbon and Moisture Content by Using VIS-NIR Spectroscopy. Biosyst. Eng. 2016, 104–116. [Google Scholar] [CrossRef] [Green Version]
  32. Song, X.D.; Zhang, G.L.; Soil Organic carbon dataset of typical soil samples in Heihe River Basin. National Tibetan Plateau Data Center. [dataset]. 2020. Available online: http://data.tpdc.ac.cn/zh-hans/data/8aab6846-4af1-4485-a20e-3add39da060b/ (accessed on 24 March 2021).
  33. Song, X.D.; Zhang, G.L.; Soil bulk density dataset of typical soil samples in Heihe River Basin. National Tibetan Plateau Data Center.[dataset]. 2020. Available online: http://data.tpdc.ac.cn/zh-hans/data/eaeb6193-4275-489e-91f9-ce1a7c51e3ed/ (accessed on 24 March 2021).
  34. Song, X.D.; Zhang, G.L.; pH dataset of typical soil samples in Heihe River Basin. National Tibetan Plateau Data Center. [dataset]. 2020. Available online: http://data.tpdc.ac.cn/zh-hans/data/1070819b-40ed-42ef-adbc-5aaa85483f32/ (accessed on 24 March 2021).
Figure 1. The positional relationship between the five remote sensing images and the study area.
Figure 1. The positional relationship between the five remote sensing images and the study area.
Sustainability 13 13930 g001
Figure 2. Probability density histograms of pH, bulk density, original organic carbon, and logarithmic organic carbon. (a) The Histogram of PH. (b) The Histogram of Bulk density. (c) The Histogram of Organic carbon. (d) The Histogram of Logarithmic Organic carbon.
Figure 2. Probability density histograms of pH, bulk density, original organic carbon, and logarithmic organic carbon. (a) The Histogram of PH. (b) The Histogram of Bulk density. (c) The Histogram of Organic carbon. (d) The Histogram of Logarithmic Organic carbon.
Sustainability 13 13930 g002
Table 1. The abbreviations, calculation formulas, and references of the candidate features.
Table 1. The abbreviations, calculation formulas, and references of the candidate features.
Auxiliary DataLand Surface ParametersAbbrevFormulationsIndicator Numbers for Target VariablesReferences
Remote sensing index attributesNormalized vegetation indexNDVI ( B 5 B 4 ) / ( B 5 + B 4 ) (1), (2), (3)Ding et al., (2014) [12]
Water body index INDWI1 ( B 5 B 6 ) / ( B 5 + B 6 ) (1), (2), (3)Du et al., (2014) [13]
Water body index IINDWI2 ( B 5 B 7 ) / ( B 5 + B 7 ) (1), (2), (3)Du et al., (2014) [13]
Ratio vegetation indexRVI B 5 / B 4 (1), (2), (3)Ding et al., (2014) [12]
Clay mineral ratioCMR B 6 / B 7 (1), (2), (3)Nath et al., (2019) [14]
Brightness indexBI B 4 2 + B 5 2 1 / 2 (1), (2), (3)Rapinel et al., (2015) [15]
Canopy response salinity indexCRSI ( B 5 × B 4 B 3 × B 2 ) / ( B 5 × B 4 + B 3 × B 2 ) (1), (2), (3)Han et al., (2019) [16]
Enhanced vegetation indexEVI g · ( B 5 B 4 ) / ( B 5 + C 1 · B 4 C 2 · B 2 + L ) (1), (2), (3)Huete et al.(2002) [17]
Generalized difference vegetation indexGDVI B 5 2 B 4 2 / B 5 2 + B 4 2 (1), (2), (3)Wu et al., (2014) [18]
Salinity index 1SI1 ( B 4 × B 2 ) 0.5 (3)Khan et al., (2005) [19]
Salinity index 2SI2 B 5 2 + B 4 2 × B 3 2 0.5 (3)Douaoui at al. (2006) [20]
Salinity index 3S13 B 2 / B 4 (3)Abbas and Khan (2007) [21]
Salinity index 4SI4 ( B 2 B 4 ) / ( B 2 + B 4 ) (3)Abbas and Khan (2007) [21]
Salinity index 5SI5 ( B 3 × B 4 ) / B 2 (3)Abbas and Khan (2007) [21]
Salinity index 6SI6 ( B 3 + B 4 + B 5 ) / 2 (1), (2), (3)Douaoui et al., (2006) [20]
Terrain attributesElevationDEMZ(1), (2), (3)Mehrjardi et al., (2014) [22]
SlopeS Z x 2 + Z y 2 1 / 2 (1), (2), (3)Wilson et al., (2000) [23]
CurvatureC Z x x 2 + 2 Z x y 2 + Z y y 2 Wilson et al., (2000) [23]
Catchment areaCA CA = Con Dpi 2 Wilson et al.(2000) [23]
Topographic wetness index (TWI)TWI T W I = ln ( A / tan β ) Gruber et al., (2009) [24]
Saga wetness indexSWI S W I = ln A m / tan β B¨ohner et al., (2002) [25]
Valley depthVD V D = Z ( x ) Olaya et al.(2004) [26]
Indicator numbers for the target variables: (1) pH value, (2) organic carbon, (3) bulk density.
Table 2. Descriptive statistics of the soil properties.
Table 2. Descriptive statistics of the soil properties.
Descriptive StatisticspH ( log H + )Organic Carbon (g/kg)Logarithmic Organic CarbonBulk Density ( g / cm 3 )
Full Range3.50117.705.691.16
Minimum6.200.40−0.920.69
Maximum9.70118.104.771.85
Average8.2511.721.681.26
Standard Deviation0.5317.861.260.25
Variance0.28319.061.600.06
Skewness−0.523.440.19−0.09
Kurtosis4.8517.70−0.722.36
Table 3. Correlation coefficients between attributes and soil properties.
Table 3. Correlation coefficients between attributes and soil properties.
AttributesAbbrevpHLogarithmic Organic CarbonBulk DensitySerial Number of Composition
Normalized vegetation indexNDVI−0.30 **0.61 **−0.51 **(1), (2), (3)
Water body index INDWI1−0.30 **−0.01−0.09(1), (2), (3)
Water body index IINDWI2−0.27 **0.23 **−0.36 **(1), (2), (3)
Ratio vegetation indexRVI−0.31 **0.70 **−0.52 **(1), (2), (3)
Clay mineral ratioCMR−0.48 **0.74 **−0.61 **(1), (2), (3)
Brightness indexBI0.37 **0.110.20 **(1), (2), (3)
Canopy response salinity indexCRSI0.17 *−0.16 **0.08(1), (2), (3)
Salinity index 1SI1−0.24 *0.31 **−0.41 **(3)
Salinity index 2SI20.08−0.18 *0.19 *(3)
Salinity index 3SI3−0.24 **0.26 **−0.46 **(3)
Salinity index 4SI4−0.24 **−0.36 **−0.48 **(3)
Salinity index 5SI50.24 **−0.45 **−0.45 **(3)
Salinity index 6SI60.040.37 **0.12 *(1), (2), (3)
Enhanced vegetation indexEVI0.060.13 *0.10*(1), (2), (3)
Generalized difference vegetation indexGDVI0.02−0.24 **0.03(1), (2), (3)
ElevationDEM−0.76 **0.80 **−0.89 **(1), (2), (3)
SlopeS−0.40 **0.65 **−0.50 **(1), (2), (3)
CurvatureC00.070.01 **
Catchment areaCA−0.02−0.020.01
Topographic wetness indexTWI0.040.070.01
SAGA wetness indexSWI0.03−0.04−0.04
Valley depthVD−0.050.03−0.08
** Significant at the 0.01 probability level. * Significant at the 0.05 probability level. Serial number of each component: (1) pH value, (2) organic carbon, (3) bulk density.
Table 4. Model performance for the estimation of soil properties for all three models.
Table 4. Model performance for the estimation of soil properties for all three models.
ModelK
pH

Organic Carbon

Bulk Density
Logarithmic
Organic Carbon
MAERMSE R 2 MAERMSE R 2 MAERMSE R 2 MAERMSE R 2
Core feature40.280.400.384.798.300.740.110.140.660.490.710.69
22-dimensional feature40.300.430.334.838.420.740.110.140.690.420.670.76
The optimal model30.320.430.285.078.570.750.110.140.680.450.690.69
40.300.430.334.758.260.750.110.130.700.400.620.75
50.280.390.414.908.310.720.110.140.650.350.550.81
Table 5. Verification results of SVM (K = 5).
Table 5. Verification results of SVM (K = 5).
Soil PropertiesMAERMSE R 2
pH0.300.430.30
Organic carbon7.9914.020.26
Bulk density0.110.150.60
Table 6. Verification results of GP (K = 5).
Table 6. Verification results of GP (K = 5).
Soil PropertiesMAERMSE R 2
pH0.340.450.20
Organic carbon9.4215.740.12
Bulk density0.190.220.35
Table 7. PCA prediction RMSE results and the minimum RMSE of non-PCA.
Table 7. PCA prediction RMSE results and the minimum RMSE of non-PCA.
Soil PropertiesFeature DimensionNon-PCA
810121315
Bulk density0.13960.13810.13710.13390.13820.1347
pH0.38490.38360.38510.38770.38690.3941
Organic carbon8.74808.25208.71648.58768.62738.2647
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, Z.; Yang, Y.; Gu, S.; Tang, B.; Zhang, J. Research on the Prediction of Several Soil Properties in Heihe River Basin Based on Remote Sensing Images. Sustainability 2021, 13, 13930. https://doi.org/10.3390/su132413930

AMA Style

Li Z, Yang Y, Gu S, Tang B, Zhang J. Research on the Prediction of Several Soil Properties in Heihe River Basin Based on Remote Sensing Images. Sustainability. 2021; 13(24):13930. https://doi.org/10.3390/su132413930

Chicago/Turabian Style

Li, Zhihui, Yang Yang, Siyu Gu, Boyu Tang, and Jing Zhang. 2021. "Research on the Prediction of Several Soil Properties in Heihe River Basin Based on Remote Sensing Images" Sustainability 13, no. 24: 13930. https://doi.org/10.3390/su132413930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop