1. Introduction
Water plays an important role as a transmitter in the SPAC (Soil–Plant–Atmosphere Continuum) system, creating a unified, dynamic, and interconnected system of mutual feedback between the soil, plants, and atmosphere. The soil water content (SWC) is a crucial parameter of soil physicochemical properties and is one of the necessary conditions for soil to nurture life. It is also one of the nonconstant parameters in agricultural, ecological, hydrological, and other research fields [
1]. The SWC has been listed as an essential climate variable by the Global Climate Observing System [
2]. In the agricultural industry, the SWC has always been a very important indicator, mainly playing a role in decision-making for irrigation management, efficient water use, and yield prediction [
3]. Therefore, achieving rapid and accurate monitoring of the SWC status has always been a key concern for scholars, and the results of such research will play an important role in agricultural production.
Different methods have been developed to measure the SWC, including the oven-drying method, the resistance method, and the tensiometer method. Among the multitude of methodologies, the oven-drying method provides the most precise measurement for SWC. However, most of these methods have certain limitations. For instance, the oven-drying method involves destructive sampling, a tedious process, and poor real-time data [
4]. The resistance soil moisture sensor is influenced by factors such as air gap, soil salinity, temperature, and bulk density, and even requires specific calibration [
5]. Tensiometers have limitations such as lag and susceptibility to soil temperature and salinity, and they also require regular manual monitoring and maintenance [
6]. Furthermore, these traditional methods are limited to point-scale measurements and do not provide spatially representative results, making it challenging to meet the requirements of real-time, large-scale, dynamic moisture estimation for precision agriculture. The active heated fiber optics (AHFO) method has demonstrated the potential to continuously determine the SWC at the field scale [
7,
8,
9]. However, poor mobility, high cost, and professional post-maintenance have limited the widespread application of AHFO. Therefore, achieving large-scale SWC determination in real-time with accuracy and continuity remains a challenging task.
Spectrum technology has emerged as a rapidly developing analytical technique in recent years owing to its non-destructiveness, accuracy, and speed. Due to the inevitable defects of traditional SWC methods in monitoring on the spatial scale, spectrum technology has become a research direction for many scholars in SWC monitoring [
10,
11]. In the early stages of SWC spectral retrieval research, the majority of scholars focused on the diagnosis of soil moisture deficiency (much lower than field capacity (
θf)). Bowers and Hanks [
12] discovered that soil reflectance decreased as soil water increased in bare ground, and the spectral reflectance curve could be altered by the soil water [
13,
14]. As research progressed, the situation in which SWC was higher than the
θf was studied. Neema et al. [
15] pointed out that soil spectral reflectance decreased with increasing SWC when the SWC was below the
θf and increased with increasing SWC when the SWC exceeded a certain threshold value. Liu et al. [
16] demonstrated that the threshold is usually greater than the
θf. Previous remote sensing retrieval studies tended to focus on the SWC below
θf [
17], and there were few studies reported on remote sensing retrieval SWC above
θf. However, in the realm of agricultural production, farmers may face many situations that lead to a high SWC, such as heavy rainfall, over-irrigation, and poor drainage. This can negatively impact crop growth, resulting in a reduction in crop yield and even total crop failure [
18]. Therefore, it is also of practical importance to diagnose an SWC above the
θf. However, as the reflection spectrum of soil is a process that reduces initially and subsequently increases with an increase in SWC, using the same model to invert the SWC under the two conditions of water content above and below
θf may lead to poor accuracy.
Hyperspectral data contain thousands of bands, many of which are mixed with noise and interfering variables. Data preprocessing and feature extraction algorithms can reduce noise, remove interfering variables, and improve model prediction [
19]. However, when only one method is used to extract feature variables, the stability might be poor, and too many variables may be located, which would make the prediction model too complex [
20]. To address the deficiency with feature band extraction methods, different variable extraction methods were used, for example, uninformative variable elimination plus the successive projections algorithm (UVE–SPA). The UVE–SPA method can cause the correlation between feature variables and targets to be more significant, while also reducing the number of variables [
21]. Xu et al. [
22] employed competitive adaptive reweighted sampling plus the successive projections algorithm (CARS–SPA) method to extract variables, which simplified the modeling process and improved the prediction accuracy of potato dry matter. Different coupled feature extraction methods have been studied in some fields, but it remains to be investigated whether the method can effectively extract SWC-sensitive bands (the SWC of samples included both lower and higher
θf) and whether the dimensionality of the hyperspectral data can be sufficiently reduced to simplify model building.
Bowers and Hanks [
12] reported absorption bands for soil water at 1400, 1900, and 2200 nm of indoor soil spectral reflectance, and the SWC could be predicted from the feature band 1900 nm. However, 1400 and 1900 nm are in the water–air absorption band, which is difficult to apply outdoors. Sun et al. [
23] analyzed the absorption spectrum of black soil in northeast China and observed a strong correlation between the soil absorption spectrum and the SWC. The maximum absorbance peak point was found at 1946 nm, and the prediction dataset R
2 of the one-dimensional linear regression model of SWC was greater than 0.95. However, the soil composition is complex and variable, and a simple linear model may not retrieve the SWC accurately. Relevant findings have revealed that the relationship between soil spectral reflectance and SWC in a large range was usually nonlinear [
13,
16,
24,
25,
26].
Machine learning has been widely applied in various fields in recent decades because of its ability to learn and approximate complex nonlinear mappings. In particular, quantitative remote sensing in agriculture has become an active research area for machine learning applications. The establishment of spectral monitoring of SWC based on the machine learning method is also an important research field. Research on estimating the SWC in saline soils also indicates that machine learning methods have more advantages, for example, the support vector machine model had better overall fitting ability compared to the multiple linear regression and partial least squares regression models [
27]. The study of the spectral estimation of SWC in different soils (sandy and loamy) demonstrated that the nonlinear method (back propagation artificial neural network, BPANN) can predict well in single-soil and mixed-soil samples with R
2 > 0.8 [
28]. Previous studies have demonstrated that machine learning methods are capable of effectively handling the nonlinearity of soil reflectance and SWC. However, further research is required to determine which of the commonly used machine learning models is best suited for inverting the SWC.
Based on this, the aims of this study were to (1) divide the sample into two parts with θf as the threshold to establish models, (2) extract the SWC-sensitive bands using a combination of the competitive adaptive reweighted sampling (CARS) and random frog (Rfrog) algorithms and evaluate the effectiveness of this integrated approach for identifying SWC-sensitive bands, and (3) establish and compare the performance of machine learning methods (extreme learning machine, back-propagation artificial neural network, and support vector machine) to select the optimal model for SWC prediction.
2. Materials and Methods
2.1. Preparation of Soil Samples
In this study, soil samples with a certain range of water content were obtained in the laboratory. Red soil was used for the soil sample preparation (porosity, 61.65%; bulk density, 1.01 g/cm3; clay, 20.03%; silt, 62.32%; sand: 17.65%), and the collected soil raw materials were air-dried, finely ground, cleared of impurities, and made into test soil by passing through a 2 mm size sieve to reduce the effect of soil particle diameter on the spectral determination. The prepared soil was packed into a disc with a 16 cm inner diameter and 1.7 cm height, with several small holes at the bottom. After this, the soil sample surface was leveled and then placed into a tray with a water depth of approximately 1 cm to be saturated. The disc was removed and placed on air-dried soil lined with filter paper to allow the water to drain out naturally. Soil samples with various water contents were obtained by controlling the duration of the water removal. This process of soil sample preparation can avoid the uneven surfaces in soil samples caused by adding water from above.
2.2. Remote Sensing Data
The hyperspectral reflectance of the soil samples was determined using an SR-2500 portable geophysical spectrometer (Spectral Evolution, Inc., 1 Canal St., Unit B-1, Lawrence, MA 01840 USA). The wavelength range of the instrument was 350–2500 nm, with a total of 2151 channels. The portable spectroradiometer was equipped with optical fiber with a length of 1.5 m and an 8° field of view (FOV). The sampling intervals were 1.5 nm @ 350–1000 nm and 6 nm @ 1000–2500 nm, and the instrument automatically interpolated the measurement results into 1 nm intervals. To obtain steady spectral data, we chose a clear and cloudless day between 10:00 and 14:00 local time when the solar altitude angle and light intensity were optimal. During sampling, the optical fiber was placed 15 cm above the soil sample in a vertically downward position to ensure that the FOV coverage did not exceed the disc range. The hyperspectral data were collected 10 times for each soil sample, and the average value was utilized as the hyperspectral reflectance to reduce random errors. The instrument was calibrated using a standard whiteboard before measurement, and the calibration process was repeated every 10 min.
2.3. Soil Water Content Determination
After collecting the hyperspectral data, the SWCs of the samples were determined by the drying method (
Table 1). The Wilcox method was used to measure the field water capacity of the experiment soil, which was 31.63% (mass water content).
2.4. Spectral Preprocessing
In this study, the soil spectral reflectance was analyzed only in the 350–1349 nm and 1451–1800 nm bands (a total of 1350 bands) due to the presence of a strong absorption band of water near 1400 nm and the presence of large signal noise for reflectance greater than 1800 nm. During the acquisition process, the sample spectra were frequently disturbed by stray light, baseline drift, and other factors, which had an impact on the final analysis results. Therefore, it was necessary to preprocess the raw spectra. Savitzky–Golay (SG, window width, 3; polynomial, 1) smoothing was utilized to preprocess the raw spectral data.
2.5. Elimination of the Outliers and Sample Data Division
The collection, processing, and analysis of soil samples might introduce a degree of error, particularly human measurement error, which could affect subsequent data analysis and modeling. Samples with errors are called outliers, and it is often necessary to re-measure or eliminate them to minimize their impact on the subsequent processing results. To address the issue of outliers in the samples, this study employed Monte Carlo cross-validation (MCCV) [
29] to identify them. The MCCV could efficiently detect outliers in the spectral array by analyzing the sensitivity of the prediction error to anomalous samples.
In the present study, all data were processed centrally. A total of 1000 PLSR models with SWC as the dependent variable and raw spectrum as the independent variable were established using MCCV, with a ratio of randomly selected samples of 0.7. The prediction error of each sample in the model was calculated, and the mean (MEAN) and standard deviation (STD) of the prediction errors for each sample were determined. A scatter plot illustrating the MEAN–STD of the sample set was created. Finally, 2.5 times the average value of either the MEANs or the STDs was taken as the threshold. The complete flow of the MCCV is shown in
Figure 1.
During model construction, Sample Set Partitioning based on joint X–Y distance (SPXY) was implemented to divide the samples into representative calibration and prediction datasets with a ratio of 2:1. The SPXY algorithm, originally developed by Galvao [
30], involved the calculation of the distance between each sample using spectral and target values as characteristic parameters to ensure difference and representativeness between the calibration and prediction datasets. This method effectively covered the multidimensional vector space and improved the model’s prediction accuracy.
2.6. Feature Variable Extraction
The hyperspectral data contain a large amount of redundant data and irrelevant information in addition to information about the SWC, possibly leading to model complexity. Selecting important bands for the modeling not only reduces the complexity of the model but also results in better performance and higher accuracy.
In this study, competitive adaptive reweighted sampling (CARS) was chosen for spectral feature extraction [
31]. The CARS algorithm mimicked the “survival of the fittest” principle of Darwinian evolutionary theory in selecting variables by treating wavelength variables as individual entities. During the selection process, bands with a strong adaptive capacity were retained, while those with a weak adaptive capacity were eliminated. As the CARS algorithm uses Monte Carlo sampling to randomly select modeling samples, the variable regression coefficients would change due to the random sample selection, and the absolute value of the regression coefficients cannot entirely indicate the significance of the variables, which affected the accuracy of the model.
To mitigate the influence of the randomness of the CARS algorithm, the random frog (Rfrog) algorithm was adopted to conduct a secondary data filtration after feature extraction by CARS, further simplifying the model while ensuring its accuracy. Rfrog is a feature selection algorithm proposed by Li [
32], which operates iteratively. The variable selection process was executed using the reversible jump Markov chain Monte Carlo (RJMCMC) framework. A sufficient number (≥10,000) of partial least squares regression (PLSR) models were built to calculate the selection probability of each band, and the probability of each band being selected was calculated in each iteration. The more information a band contains, the greater its selection probability. After completing the iterations, bands were ranked by their probability of being selected, and variables with a high probability of being selected were preferred as feature variables.
2.7. Modeling Method
In this study, based on the nonlinear characteristics between the soil spectral reflectance and the SWC [
13,
16], three nonlinear models, extreme learning machine (ELM), back-propagation artificial neural network (BPANN), and support vector machine (SVM), were selected for modeling.
The ELM is a single-hidden-layer feedforward neural network (SLFN) learning algorithm developed by Huang [
33]. In contrast to conventional gradient-based feedforward neural network learning algorithms, the ELM randomly assigns weights and biases to the input layer. This algorithm’s execution process may not require artificial parameter adjustment, avoiding repetitive iterations in the traditional training algorithm. As a result, the model trains extremely fast and achieves high generalization performance. In this work, the activation function of the hidden-layer neurons was set to “sigmoid” by default, and the number of hidden layers was initially set at 3, gradually increasing to 100 in steps of 1. Each model structure was operated multiple times to determine the optimal number of hidden-layer nodes based on the best results trained.
The BPANN is a widely used machine learning algorithm based on the gradient descent method, which uses gradient search techniques to reduce the mean squared error between the actual output value and the desired output value of the network. It consists of an input layer, a hidden layer, and an output layer, each containing several nodes. The weights of each node are calculated through self-learning to derive the training results. These results are analyzed for errors with the expected outcomes, and if the training results do not meet expectations, the weights are modified to reduce the errors. Continuous iteration helps to achieve consistency with the expected results and to minimize errors. The training function of the BPANN model was “newff”; the maximum iteration number was 10,000; the minimum error of the training target was 0.000001; the learning rate was 0.01; and the number of hidden-layer nodes was determined using the same method as for ELM.
The SVM is a learning system that uses linear function hypotheses in high-dimensional feature spaces [
34]. Based on minimal structural risk, this method can better address practical problems such as the curse of dimensionality and overfitting. The proposed model effectively handles small samples, nonlinearity, high dimensions, and local minima and has good generalization ability. To better address the nonlinear characteristics of the data, the radial basis function (RBF) was used as SVM’s kernel function in this study. There were two important parameters that needed to be adjusted in the model, i.e., the penalty factor (c) and the kernel function parameter (g). If either c or g is too large, the model prediction tends to be overfitted. By contrast, if either is too small, the model prediction tends to be underfitted. Either extreme situation could result in poor generalization ability. A 5-fold cross-validation combined with the grid search method was used to find the optimal penalty factor c and kernel function parameter g within the range [2
−10, 2
10], with the step size 2
0.5 to determine the final model.
2.8. Model Evaluation Metrics and Software
The coefficient of determination (R2), root mean square error (RMSE), and relative root mean square error (RRMSE) were selected as the evaluation metrics. Generally, RRMSE > 10% represents that the model accuracy is excellent; furthermore, 10% < RRMSE < 30% represents that the model accuracy is good, and RRMSE > 30% represents that the model accuracy is poor.
The Unscrambler X 10.4 software was used for spectral preprocessing (SG smoothing). MATLAB 2020a was adopted for feature extraction and model building. Excel 2021 was employed for data analysis and scientific drawing.
4. Discussion
Initially, it was believed that the soil spectral reflectance declined with increasing SWC due to the absorption effect of water on the spectrum [
12,
35]. However, subsequent studies [
14] have shown that when the SWC exceeds
θf, a water film forms on the surface of soil particles, resulting in specular reflection and causing the soil spectral reflectance to increase with increasing SWC. Previous studies tended to control the measured sample SWC to remain below
θf when establishing a prediction model of SWC based on hyperspectral reflectance or possibly did not distinguish whether the SWC was higher or lower than the
θf. Nonetheless, considering that the soil reflectance changing process decreased and then increased with
θf as the threshold and the full-spectrum modeling of samples by PLSR in this paper, it was found that the model performed poorly in the super-
θf samples, with the accuracy being much lower than that of the sub-
θf sample part of the model. Therefore, this paper modeled samples with water content below
θf and above
θf separately to improve the accuracy of the SWC prediction model based on the hyperspectral data.
The feature variables extracted by CARS–Rfrog in this paper were mainly concentrated in the near-infrared (NIR) region, with almost none in the visible band (except for the 366–370 nm range). This is consistent with the findings of previous studies [
36,
37]. The NIR spectrum is generated due to the vibrational energy level jumps and rotational energy level jumps in molecules. When the vibration and rotation of a molecule jump from the ground state or low-energy level to a higher-energy state, they absorb a certain amount of infrared energy from the external incident electromagnetic radiation. In the mid-infrared region, fundamental frequency absorption occurs, while in the NIR region, combined frequency and doubled frequency absorption occur. There are three fundamental frequencies of water molecules in the near-infrared band [
38]. Therefore, the NIR region of the band can reflect changes in soil moisture.
In this study, secondary extraction (CARS–Rfrog) was applied to extract variables of hyperspectral reflectance. Firstly, CARS was used for the initial screening of the feature variables. However, the CARS extraction results were random [
20], and the feature variables extracted by only one method are numerous, making the model too complex to model.
Therefore, the feature variables extracted by CARS were subjected to a second extraction using the Rfrog method to obtain the variables with the least redundant information. By retaining bands with high correlation and downscaling and re-extracting the feature variables, we simplify the model. The Rfrog method applied the partial least squares linear discriminant analysis to construct the classifier and combined the strong ability of synergy interval partial least squares (SiPLS) to handle highly correlated data [
32].
The Rfrog method combines the ideas of the memetic algorithm and the particle swarm optimization algorithm, so it has the characteristics of survival of the fittest and random search. It also takes advantage of CARS to simplify the complexity of wavelength selection. Through secondary feature extraction, the number of bands in the modeled data was greatly reduced. Rfrog reduced the number of variables of the super-θf and sub-θf samples extracted using CARS from 83 and 63 to 25 and 18, respectively. CARS–Rfrog minimized the redundant information and achieved the effect of data dimensionality reduction.
Among the three machine learning models, the BPANN model had the highest accuracy (
Table 3 and
Table 4), which could deliver a better prediction of SWC. This might mean that the SVM is more suitable for fewer-sample modeling, while the ELM tends to have low and unstable prediction accuracy when dealing with the quantitative analysis of complex samples [
39]. However, the ELM is fast learning and has a strong generalization ability, so it is frequently used in scenarios that require real-time computing. The BPANN is expressive and simple, and the theory also demonstrated that a three-layer neural network could approximate a nonlinear continuous function with arbitrary accuracy, which makes it possible to solve complex nonlinear problems with internal mechanisms. However, its generalization ability is slightly inferior, and it easily falls into locally optimal solutions [
40], and subsequent studies have attempted to apply optimization algorithms to optimize BPANN and obtain a model with better performance. Therefore, CARS–Rfrog–BPANN is recommended as a prediction model for the SWC of red soil.
5. Conclusions
In this study, soil samples were prepared in the laboratory, and the hyperspectral reflectance was acquired outdoors. The samples were divided into two parts (sub-θf and super-θf) with θf as a threshold to obtain a more accurate SWC prediction model. The outliers were detected using MCCV; the spectral feature variables were extracted using a secondary extraction method (CARS–Rfrog), and the prediction model of SWC was established using the machine learning method. We draw the following conclusions: (1) The poor performance of the model in the fraction of water content above the θf when the model was built with full-spectrum PLSR, indicated that using the same model for the simultaneous inversion of SWC under both conditions of water content above or below θf led to poor inversion accuracy of samples above the θf. (2) By combining CARS and Rfrog for the extraction of the feature variables of soil reflectance, the feature wavelengths of the sub-θf and super-θf samples extracted by CARS–Rfrog were 25 and 18, and they were widely distributed in the NIR range, which is a significant reduction in comparison to the full spectrum. (3) Among the machine learning methods, the BPANN achieved optimal prediction results, the R2p, RMSEP, and RRMSE of the sub-θf samples were 0.941, 1.570%, and 6.685%, respectively, and the R2p, RMSEp, and RRMSE of the super-θf samples were 0.764, 1.479%, and 4.205%, respectively.