Comparative Analysis of Machine-Learning Models for Soil Moisture Estimation Using High-Resolution Remote-Sensing Data

Li, Ming; Yan, Yueguan

doi:10.3390/land13081331

Open AccessArticle

Comparative Analysis of Machine-Learning Models for Soil Moisture Estimation Using High-Resolution Remote-Sensing Data

by

Ming Li

and

Yueguan Yan

^*

College of Geoscience and Surveying Engineering, China University of Mining & Technology (Beijing), Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Land 2024, 13(8), 1331; https://doi.org/10.3390/land13081331

Submission received: 27 July 2024 / Revised: 14 August 2024 / Accepted: 21 August 2024 / Published: 22 August 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Soil moisture is an important component of the hydrologic cycle and ecosystem functioning, and it has a significant impact on agricultural production, climate change and natural disasters. Despite the availability of machine-learning techniques for estimating soil moisture from high-resolution remote-sensing imagery, including synthetic aperture radar (SAR) data and optical remote sensing, comprehensive comparative studies of these techniques remain limited. This paper addresses this gap by systematically comparing the performance of four tree-based ensemble-learning models (random forest (RF), extreme gradient boosting (XGBoost), light gradient-boosting machine (LightGBM), and category boosting (CatBoost)) and three deep-learning models (deep neural network (DNN), convolutional neural network (CNN), and gated recurrent unit (GRU)) in terms of soil moisture estimation. Additionally, we introduce and evaluate the effectiveness of four different stacking methods for model fusion, an approach that is relatively novel in this context. Moreover, Sentinel-1 C-band dual-polarization SAR and Sentinel-2 multispectral data, as well as NASADEM and geographical code and temporal code features, are used as input variables to retrieve the soil moisture in the ShanDian River Basin in China. Our findings reveal that the tree-based ensemble-learning models outperform the deep-learning models, with LightGBM being the best individual model, while the stacking approach can further enhance the accuracy and robustness of soil moisture estimation. Moreover, the stacking all boosting classes ensemble-learning model (SABM), which integrates only boosting-type models, demonstrates superior accuracy and robustness in soil moisture estimation. The SHAP value analysis reveals that ensemble learning can utilize more complex features than deep learning. This study provides an effective method for retrieving soil moisture using machine-learning and high-resolution remote-sensing data, demonstrating the application value of SAR data and high-resolution optical remote-sensing data in soil moisture monitoring.

Keywords:

traditional machine learning; deep learning; stacking; soil moisture; Sentinel-1; Sentinel-2

1. Introduction

Soil moisture plays a crucial role in climate dynamics, affecting the exchanges of water, energy, and carbon between the land and the atmosphere [1,2]. It serves as a crucial indicator in hydrology, agriculture, and meteorology, impacting weather forecasting, drought and flood warnings, and crop management [3,4]. Accurate soil moisture monitoring is essential for understanding and predicting climate change.

Synthetic aperture radar (SAR) utilizes the scattering of microwaves to estimate soil moisture, offering the advantages of all-weather capability, all-day operation, and independence from cloud cover. However, SAR is susceptible to interference from factors such as terrain, vegetation, and coherent patches, leading to inconsistent inversion results [5,6]. To mitigate these interferences, SAR data can be integrated with optical remote-sensing data, which provide spectral information on soil and vegetation, helping to reduce vegetation effects and improve the accuracy of soil moisture estimation. The combined use of SAR and optical remote sensing for soil moisture estimation has become a widely adopted approach [7,8,9], especially in regions prone to cloud cover and fog. By integrating SAR data from Sentinel-1 and multi-temporal optical data from Sentinel-2, both satellites launched by the European Space Agency, a synergistic fusion of microwave and optical observations can be achieved, enabling high spatial and temporal resolution soil moisture estimation opportunities [10,11,12,13].

Numerous models have been developed to describe the relationship between radar backscatter and soil moisture. These include empirical and semi-empirical models like the Dubois [14] and Oh [15] models, as well as theoretical models such as the integral equation model (IEM) [16] and its improved version (AIEM) [17]. While these models have shown effectiveness in bare or sparsely vegetated areas, they often struggle in regions with dense vegetation. To address this, vegetation scattering models like the Michigan microwave canopy scattering model [18] and the water cloud model [19] have been developed to separate the contributions of soil and vegetation in SAR signals [20,21]. Despite these advancements, traditional models still rely on numerous assumptions and parameters, which can lead to inaccuracies in complex environments.

Machine learning, with its ability to automatically learn from large datasets and capture nonlinear relationships, has emerged as a powerful alternative for soil moisture estimation [22,23]. In previous studies, traditional machine-learning models have been widely used for soil moisture estimation, particularly tree-based ensemble-learning models [22,23]. Ågren et al. [24] proposed a method based on multiple LIDAR-derived digital terrain indices and machine learning to estimate the surface soil moisture of forested landscapes on a national scale at a spatial resolution of 2 m, utilizing data from about 20,000 field observation sites in Sweden. Nguyen et al. [25] introduced a cost-effective method to accurately predict soil moisture at a 10 m spatial resolution in an Australian study area using multi-source remote-sensing data (including Sentinel-1, Sentinel-2, and ALOS) and XGBoost. Wang and Gao [26] used Sentinel-1, Sentinel-2, a water cloud model (WCM), and ensemble-learning algorithms (RF and AdaBoost) to invert the surface soil moisture in an agricultural region. Greifeneder et al. [23] used Landsat-8 optical and thermal infrared imagery, Sentinel-1 C-band SAR imagery, and simulated data in conjunction with a gradient-boosted tree regression model to invert the surface soil moisture on a global scale.

In recent years, deep learning has also been gradually applied to soil moisture estimation using remote-sensing data. Deep learning can automatically extract advanced features from raw data and capture complex nonlinear relationships, and it is also capable of handling large-scale high-dimensional data problems. Shallow neural networks (ANNs) have been more widely used in previous soil moisture estimation studies, but they have been shown to be inferior to traditional machine learning [27,28,29]. In order to improve the accuracy and robustness of soil moisture inversion, some studies have explored the use of deeper neural network models, such as convolutional neural networks (CNNs), to utilize their powerful feature extraction and representation capabilities to learn soil moisture-related information directly from SAR signals or images. Hegazi et al. [30] developed a CNN-based method to estimate the soil moisture content in agricultural areas from Sentinel-1 SAR images. Guo et al., 2022) [31] proposed a method using a convolutional neural network regression (CNNR) model with ultra-wideband (UWB) radar and multispectral remote-sensing data combined with various scattering models to estimate the surface layer soil moisture of the winter wheat cultivation area of the Guanzhong plain in China. R. Wang et al. [32] introduced an innovative technique leveraging SSA-CNN for the retrieval of the surface soil moisture in agricultural regions. This technique integrates Sentinel-1 SAR imagery with Sentinel-2 multispectral data, complemented by a suite of diverse scattering models. Concurrently, the application of sophisticated neural networks, including the long short-term memory (LSTM) architecture, has been notably scarce in the domain of soil moisture assessment.

Different machine-learning methods have different data utilization mechanisms; therefore, hybrid structures have become an important research area. Some studies have shown that the combination of multiple models often performs better than a single model [33]. One of the common methods is the stacking method, which can achieve a more accurate estimation of soil moisture by taking the outputs of multiple base models as new inputs and then fusing them through a meta-model. However, previous studies usually used models of the same type for stacking, e.g., Das et al. [34] stacked gradient-boosting machine, RF and cubist, which are all tree-based models. (S. Wang et al. [35] stacked four tree-based models, namely classified regression tree, RF, gradient-boosting decision tree (GBDT), and extreme random tree. Few studies have explored whether the fusion of ensemble-learning models and deep-learning models is effective in soil moisture estimation. In addition, there are also few systematic comparative studies that analyze which method is more effective, machine learning or deep learning, in terms of soil moisture estimation.

In this study, we address these gaps by systematically comparing the performance of four popular tree-based ensemble-learning models (RF, XGBoost, LightGBM, and CatBoost) and three deep-learning models (DNN, CNN, and GRU) for soil moisture estimation. Additionally, we introduce and evaluate four different fusion methods: stacking all boosting classes ensemble-learning model (SABM), staking all boosting and bagging classes ensemble-learning model (SAEM), staking all deep-learning models (SADMs), and stacking ensemble-learning and deep-learning models (SAMs). This study aims to provide a comprehensive analysis of the effectiveness of different machine-learning approaches for soil moisture estimation and to offer insights into the selection and combination of models for improved accuracy and robustness. The specific objectives of this study are as follows:

(1): To compare the performance of different machine-learning models for estimating soil moisture;
(2): To assess whether the newly proposed stacking method can accurately estimate soil moisture;
(3): To evaluate the contribution of environmental variables to soil moisture inversion.

2. Dataset

2.1. Study Area and In Situ Soil Moisture Dataset

The ShanDian River, which emerges at the Hebei–Inner Mongolia border and encompasses a basin of roughly 12,700 km², is located in a region that experiences a temperate continental climate, with a significant precipitation period from July to September, averaging 300 to 500 mm annually [36]. The primary land uses in the area include grasslands, farmlands, and forests, where crops such as oats, potatoes, carrots, pasture, and maize are cultivated.

In 2018, the Soil Moisture and Energy Balance Remote Sensing Experiment (SMELR) was initiated in the ShanDian River Basin, establishing a wireless sensor network (SMN-SDR) to monitor the soil temperature and moisture [36]. This network comprised 34 strategically placed stations, covering the large (100 km), medium (50 km), and small (10 km) scales (Figure 1). Utilizing Decagon 5TM sensors, these stations recorded the soil moisture and temperature at depths of 3 cm, 5 cm, 10 cm, 20 cm, and 50 cm, with 20 stations additionally equipped with HOBO rain gauges. Data were collected at 10–15 min intervals and transmitted to a central server in real time. Field measurements of the elevation, soil texture, and land cover were also conducted. For the sensor calibration, soil samples were collected using an auger, and the soil moisture and bulk density were determined through oven-drying. The dataset, spanning from July 2018 to December 2020, is accessible via the National Tibetan Plateau Data Center (https://data.tpdc.ac.cn/home, accessed on 18 August 2024). Given the penetration ability of Sentinel-1 (C-band) radar, the 5 cm soil moisture data were selected for validation purposes in this study.

2.2. Remote-Sensing Dataset

2.2.1. Sentinel-1

The Sentinel-1 series, which includes Sentinel-1A and Sentinel-1B, is outfitted with C-band synthetic aperture radar (SAR) technology. Launched in April 2014 and April 2016, these satellites operate on a 12-day revisit cycle, effectively halving to a 6-day cycle when combined. They offer a diverse range of imaging capabilities, including stripmap, interferometric wide swath, extra wide swath, and wave modes, each tailored to specific resolution and coverage needs. This study used the ground range detected (GRD) products of the interferometric wide swath mode of Sentinel-1A/B, which have VV and VH dual-polarization modes and a spatial resolution of 5 m × 20 m.

Sentinel-1 data were sourced from the Google Earth Engine (GEE) platform, which had been preprocessed with radiometric calibration, speckle noise reduction, and geometric correction. Given the spatial variability of soil moisture, a 10 m radius buffer zone was created around each station, and then the backscattering coefficients and local incidence angles of VV and VH polarization were extracted and averaged for model development.

2.2.2. Sentinel-2

Sentinel-2 is a high-resolution satellite, carrying a multispectral imager (MSI) that is spectrally sensitive to the visible–near-infrared region of the electromagnetic spectrum. Similar to Sentinel-1, there are two satellites: 2A and 2B. The temporal resolution of each satellite is 12 days, resulting in an average revisit cycle of 6 days.

This study selected the band data with a spatial resolution of 10 m and 20 m, including the red, green, blue, near-infrared, and shortwave infrared bands. The data were downloaded from the GEE platform2 and the average band values were calculated from the pixels within the 10 × 10 m area of interest around the station. The vegetation, soil, and water indices derived from the Sentinel-2 data are shown in Table 1. Due to the influence of clouds and rain, the data quality was severely degraded. Therefore, we selected images with cloud cover of less than 20% and used cubic spline interpolation to fill in the missing values, obtaining daily-scale information for various indices.

2.2.3. Other Dataset

NASADEM data were obtained from the GEE platform, which is a global digital elevation model (DEM) that enhances the SRTM DEM with a spatial resolution of 30 m. The geographic encoding information is obtained by transforming the latitude and longitude into a three-dimensional spherical coordinate system, resulting in three-dimensional features (gx, gy, gz). The temporal encoding information is obtained by applying a two-dimensional encoding of the month information using the sine–cosine coding method, which preserves the periodicity and order among months. For more details on the geographic and temporal encoding, please refer to Yang et al. [46].

3. Method

The research framework for soil moisture estimation includes four main steps (Figure 2):

(1): Data collection and feature extraction: Remote-sensing indicators, such as the vegetation index, soil index, polarization band, are derived from optical (Sentinel-2) and synthetic aperture radar (Sentinel-1), and site characterization information, such as the land cover and soil type, are extracted from the field measurement data. Calculate the geographical encoding features based on latitude and longitude information. Calculate the temporal encoding features based on month information.
(2): Data splitting and base-learner model training: The data are randomly split into a training set (80%) and a test set (20%). Various machine-learning models, such as RF, XGBoost, LightGBM, CatBoost, DNN, CNN, and GRU, are trained on the training set using 5-fold cross-validation, and their hyperparameters are optimized using Bayesian optimization. To ensure reproducibility, we set the random seed to 2023.
(3): Model construction and fusion: Apply the stacking method to combine different base-learner models into new machine-learning models, and compare and analyze their performance with the base-learner models Choose ridge regression as the meta-learner model.
(4): Model evaluation and analysis: Machine-learning models are evaluated using metrics such as such as the R², RMSE, and MAE.

3.1. Machine Learning

In order to compare and evaluate various machine-learning models for soil moisture estimation, we selected seven representative algorithms as a baseline, covering both traditional machine-learning methods and deep-learning methods. Specifically, these include random forest [47], XGBoost [48], LightGBM [49], CatBoost [50]. Random forest is an ensemble method that builds multiple independent decision trees. Each tree is trained on a randomly selected subset of the data, and the final prediction is obtained by averaging the predictions from all the trees, which helps mitigate overfitting and enhances the model’s generalization. XGBoost and LightGBM are both gradient-boosting frameworks that build models in an iterative manner. XGBoost is known for its speed and performance, while LightGBM is optimized for efficiency, particularly with large datasets. These models improve their predictions by iteratively correcting the errors made by previous models. CatBoost is another gradient-boosting algorithm that excels at handling categorical features, incorporating techniques such as ordered boosting to reduce overfitting.

Deep-learning methods include DNNs, one-dimensional convolutional neural networks (1DCNNs) [51], and gated recurrent units (GRUs) [52]. DNNs consist of multiple layers of neurons that learn hierarchical data representations. Moreover, 1D-CNNs apply convolutional filters to capture spatial patterns in time-series data, making them particularly suitable for this type of analysis. GRUs are recurrent neural networks that efficiently handle sequential data by capturing temporal dependencies.

In addition, this study explored the integrated learning method, the stacking algorithm [53], which utilizes a meta-regressor to make the final prediction by combining the prediction results from multiple base models.

This study introduces a custom stacking algorithm, as illustrated in Figure 3, that incorporates a two-phase learning approach. (1) The data are divided into two segments—a larger portion for training and a smaller one for testing, adhering to an 8:2 ratio. Following this, k-fold cross-validation is implemented to develop a range of foundational models on the training segment, while also capturing their predictive outputs for both segments. (2) The base learners’ prediction results are used on the training set as new training data and their prediction results on the test set as new test data. Subsequently, a meta-learner is employed to train and make predictions based on these new datasets.

Considering the strong correlation between the results obtained by different base learners, a simple linear model is usually chosen for the meta-learner to avoid high correlation between independent variables. The meta-learning model used in this paper is the ridge regression model, which is a linear regression model that reduces the regression coefficients by regularizing the parameters, and it can handle the multicollinearity problem with high correlation between the independent variables.

3.2. Hyperparameter Tuning

Selecting the appropriate hyperparameters is essential for enhancing the efficacy and precision of a predictive model. The process of hyperparameter tuning seeks to identify the most effective set of these parameters. One such method is Bayesian optimization, which employs a probabilistic approach to progressively refine the parameter space’s probability distributions. This is achieved by leveraging Bayesian statistical models to incorporate feedback from empirical observations, with the goal of either maximizing or minimizing the target objective function [54]. In this study, we use Optuna, an automatic tuning framework based on Bayesian optimization algorithm, for the hyperparameter optimization. For the deep-learning model, we set the batch size to 512, utilize a learning rate scheduler to adaptively modify the learning rate throughout the training process, and apply early stopping to monitor the validation loss and stop the training when the loss does not decrease for 60 epochs. All the experiments were conducted in an environment provided by Google Colab, equipped with a T4 GPU and 12.68 GB RAM. The hyperparameter tuning ranges of the models are shown in Table 2.

3.3. Feature Importance Assessment Methods

Machine-learning models have far surpassed traditional statistical models in terms of their accuracy and generalization. However, machine-learning methods lose the interpretability of linear models and are often referred to as black-box models. To address this problem, SHAP (SHapley Additive exPlanations) uses a game-theoretic framework to elucidate the functioning of machine-learning models [55]. The fundamental idea of SHAP is to allocate a Shapley value to each feature based on its contribution to the model’s output, thereby measuring the feature’s importance and influence. This method provides transparency and interpretability for the model decisions by decomposing the model predictions into the contribution parts of each feature. This study used TreeExplainer to explain the predictions of LightGBM, XGBoost, CatBoost, and RF, and KernelExplainer to explain the predictions of DNN, CNN, and GRU.

3.4. Model Evaluation

This study selected three evaluation metrics to quantify the accuracy of the different models in predicting soil moisture. The evaluation metrics used were the coefficient of determination (R²), root mean square error (RMSE), and mean absolute error (MAE). The calculation formulas are as follows:

R^{2} = \frac{\sum_{i} {({\hat{y}}_{i} - \bar{y})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}} = 1 - \frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}}

(1)

R M S E = \sqrt{\frac{\sum_{i = 1}^{m} {(y_{i} - {\hat{y}}_{i})}^{2}}{m}}

(2)

M A E = \frac{1}{m} \sum_{i = 1}^{m} | y_{i} - {\hat{y}}_{i} |

(3)

where

y_{i}

is the actual soil moisture value,

{\hat{y}}_{i}

is the simulated soil moisture value,

\bar{y}

is the mean of the actual soil moisture values, and m is the number of observations.

4. Results

4.1. Model Performance

Figure 4 and Table 3 show the performance of the ensemble-learning and deep-learning methods for soil moisture estimation on the test dataset. Our results demonstrate that the stacking methods have achieved remarkable accuracy and stability. The models are ranked by their accuracy from high to low as follows: SABM > SAEM > LightGBM > XGBoost > SAM > CatBoost > RF > SADM > GRU > CNN > DNN. Among them, SABM exhibited the highest accuracy, with an R² value of 0.861, an RMSE of 0.025 cm³/cm³, and an MAE of 0.019 cm³/cm³. This indicates the effectiveness of our approach for soil moisture estimation. Notably, when stacking models of the same type, the improvement is significant. SABM enhances the accuracy of the boosting methods, increasing the R² value from 0.843 to 0.861, while SADM enhanced the accuracy of the deep-learning methods, raising the R² value from 0.765 to 0.804.

Among the individual models, LightGBM emerged as the most accurate model across all the methods, with an R² value of 0.858, an RMSE of 0.025 cm³/cm³, and an MAE of 0.019 cm³/cm³. It was followed closely by XGBoost, CatBoost, and RF. Interestingly, the ensemble-learning methods outperformed the deep-learning methods, with GRU standing out as the best performer among the deep-learning methods, with an R² value of 0.799, an RMSE of 0.026 cm³/cm³, and an MAE of 0.020 cm³/cm³. In contrast, DNN demonstrated the weakest performance among all the models, with an R² value of 0.765, an RMSE of 0.032 cm³/cm³, and an MAE of 0.023 cm³/cm³.

Furthermore, when examining the model runtimes, RF proved to be the most efficient method, with execution times consistently under 15 s. In contrast, the stacking approach exhibited the slowest performance, with an approximate runtime of 8 min. It is important to note that the runtime of the stacking model is primarily determined by the relatively slowest first-level models. Within the category of boosting methods, LightGBM exhibited the most favorable speed performance. Conversely, among the deep-learning methods, the deep neural network (DNN) stood out for its computational efficiency.

Figure 5 illustrates the variation in model performance across different months. In general, during the months of April, May, and October, all the models exhibited robust performance, with low errors and high R² values above 0.9. However, during the period spanning from June to September, the performance of all the models tended to be relatively subpar, with July, in particular, showing significant performance degradation. This may be because the soil moisture changes dramatically during this period, affected by precipitation, evaporation, irrigation, and other factors [56,57]. These factors may cause the relationship between soil moisture and input features to be unstable or nonlinear, making it difficult for machine-learning models to capture this relationship and thus reducing the prediction accuracy. Additionally, the figure demonstrates that the ensemble-learning models consistently outperformed the deep-learning models across all the months, especially in June, July, and August. It is noteworthy that the soil moisture changes drastically in these months, indicating that the ensemble-learning models can explain the soil moisture variations to a greater extent. Among them, the SABM model is the best or second best in all the months, demonstrating that it can better integrate the advantages of each individual model and improve the prediction accuracy.

4.2. Spatiotemporal Variation of Soil Moisture Prediction Accuracy

Due to the spatiotemporal heterogeneity of soil moisture, the overall accuracy of the model cannot reflect the spatiotemporal variations of the model predictions. Therefore, the model performance was further evaluated for each site. As shown in Figure 6, the soil moisture prediction accuracy was considerable, with 56% of the sites having R² values greater than 0.5. In terms of uncertainty measurement, 85% of the sites had low prediction errors, with an RMSE less than 0.03 cm³/cm³. The sites with better prediction accuracy were mainly concentrated in the small-scale range of the experimental area, which might be related to the relatively low spatial heterogeneity of soil moisture within the small scale. Moreover, the large number of sample points within the small-scale range enabled the machine-learning models to learn more consistent results.

In order to further evaluate the capability of SABM in capturing temporal variations in soil moisture, we selected three representative sites at different spatial scales and presented the results in Figure 7, Figure 8 and Figure 9. The soil moisture estimated by SABM demonstrated good temporal consistency with the in situ measured soil moisture, accurately capturing wet and dry periods. Notably, at these three locations, there was a moderate correlation between SM and the indices MNDWI, B8, and NDVI, respectively, as well as some correlation with VV or VH. The combination of these correlated variables facilitated relatively accurate soil moisture estimation at these sites. Conversely, at locations with lower R² values, the weaker correlation between the radar backscatter signals, vegetation indices, band values, and SM appears to be the primary factor contributing to the reduced estimation accuracy. These sites exhibited high heterogeneity and were mainly characterized by forested and cultivated land cover types. Previous research [58] has shown that vegetation coverage can degrade the estimation of soil moisture using radar backscatter signals, with soil moisture in exposed or low-density vegetation areas being more predictable than in areas with high-density vegetation. In forested areas, the dense vegetation cover influences the surface temperature and NDVI variations, which are more reflective of vegetation characteristics than directly indicative of soil moisture [59]. Additionally, cultivated land is more susceptible to human activities. In addition, the sites with poor R² performance had a common feature, which was a high sand content and low silt and clay content, indicating that these sites had poor soil water retention capacity, and the soil moisture remained at a low level for a long time, resulting in the weak radar scattering signal of soil moisture and thus increasing the inversion difficulty. Considering that the site data used in this study were limited, and that the soil cover type was mainly grassland, the above conclusions may not be generalizable.

To further assess the spatial capabilities of the proposed ensemble model, we applied it to estimate the soil moisture within a small-scale area (10 km × 10 km) within the experimental region in July 2019. As illustrated in Figure 10, the result showed that the soil moisture exhibited obvious spatial distribution differences, with the western region having significantly higher soil moisture content than the eastern region, especially in the middle-western and southeastern crop planting areas. Furthermore, the temporal dynamics of the soil moisture exhibited significant trends. During July, there was an initial increase in the soil moisture content across the entire area, followed by a subsequent decrease, which might be influenced by precipitation, irrigation and evaporation. Of particular significance is the influence of irrigation on the soil moisture content within the crop area. It is evident that the soil moisture levels in the crop area were notably impacted by irrigation activities. Consequently, at both the commencement and conclusion of July, the regional soil moisture content remained relatively low. However, the soil moisture content within the crop area consistently surpassed that of the surrounding regions, underscoring the impact of agricultural practices on soil moisture dynamics.

4.3. Feature Importance

Although the newly developed model can estimate soil moisture well, the interpretability of machine learning is a challenge. It is a long-standing pursuit to understand the complex relationship between input variables and output variables, and to determine which variables are useful for model estimation. Therefore, this section calculates the SHAP values of the input variables for each machine-learning model, aiming to reveal the influence of each variable on soil moisture estimation.

Figure 11 shows the distribution of the SHAP values for different features in four ensemble-learning models and three deep-learning models. SHAP values are indicators of the degree and direction of the influence of features on the model prediction outcomes. Positive SHAP values mean that the feature has a positive effect on soil moisture, while negative SHAP values mean the opposite. The larger the absolute value of the SHAP value, the stronger the influence of the feature. The different colors in the figure represent the values of the features, with red indicating high and blue indicating low.

It can be seen from Figure 11 that the ensemble-learning models and the deep-learning models rely on different features, which may result in the performance difference between the two types of models in estimating soil moisture. The ensemble-learning model mainly relies on features such as the elevation, geographic encoding, temporal encoding, vegetation index, water index, and radar scattering signals, while the deep-learning model mainly relies on features such as the elevation, temporal encoding, vegetation index, water index, and band information. Elevation is the most important feature in all the models, and it has a negative effect on soil moisture, that is, the higher the elevation, the lower the soil moisture. This is consistent with the phenomenon whereby elevation affects the distribution of precipitation and the amount of runoff, leading to a decrease in soil moisture with increasing elevation. Elevation has also been considered a key factor for soil moisture prediction by previous studies [24,25]. The geographical code and temporal code are secondary features in ensemble-learning models, reflecting the spatiotemporal variation of soil moisture. Geographical code information is not important in deep-learning models, but temporal code information still has a significant impact. Previous studies rarely considered the role of these two types of features in soil moisture estimation, while this study proved that they can help the model capture the spatial and temporal patterns of soil moisture. The vegetation index and water index have a large role in both ensemble-learning models and deep-learning models, especially in deep-learning models, where the GNDVI’s role is second only to elevation. The vegetation index and water index can indirectly reflect the situation of soil moisture, where generally speaking, a higher vegetation cover results in a higher soil water content. The radar scattering signal only has a certain role in ensemble-learning models, mainly the VV polarization signal, while it has no role in deep-learning models. In addition, band information also plays a certain role, especially in deep-learning models. The band information can reflect the spectral properties of the soil surface or sublayer and thus correlate with soil moisture.

5. Discussion

This study evaluated the performance of ensemble-learning and deep-learning methods for soil moisture estimation using diverse remote-sensing data sources and proposed a new stacking model, the SABM. The SABM model combined three base-learner models (XGBoost, LightGBM, and CatBoost) and used linear regression as the meta-learner model to achieve high-precision estimation of soil moisture. The experimental results showed that the SABM model outperformed the other models in all the tests, demonstrating its effectiveness and superiority.

The stacking method is a meta-learning technique that combines multiple base-learner models and uses a meta-learner model to synthesize their outputs. It has demonstrated its superiority in several research areas [53,60,61,62], but it is not a one-size-fits-all approach. It requires selecting appropriate base-learner models and meta-learner models based on the data and task characteristics [34,35]. Previous studies have shown that fusing different types of models can obtain more complementary information and improve the prediction accuracy [63]. However, this study found that when the base-learner models are all of the same type, they can indeed improve the accuracy to some extent, but when the base-learner models are different types of models, there is no improvement in performance and bias. The SAM, which fused all the models, did not show the best performance, possibly because of the large performance difference between the neural network class of models and the ensemble-learning class of models, which did not complement each other. The accuracy difference between the base-learner models in terms of stacking should not be too large; otherwise, it cannot guarantee the effectiveness of the fusion [34]. In addition, the choice of the meta-learner model also has a significant impact on the final result. The meta-learner model determines how to synthesize the output of the base-learner model. Considering that there is a high correlation between the output results of the base-learner models, the meta-learner model generally chooses a linear model (such as ridge or logistic regression) that can avoid multicollinearity [34]. In addition, a linear model is simple and can effectively balance the weights of the base-learner models.

This study also further explored the gap between tree-based ensemble learning (RF, XGBoost, LightGBM and CatBoost) and deep learning in soil moisture estimation. Previous studies have shown that tree models usually outperform deep-learning models on tabular data, possibly because tree models can better handle the complexity and nonlinearity of heterogeneous data [64,65,66]. This study also confirmed this conclusion, finding that the ensemble-learning models had higher prediction accuracy than the deep-learning models. In addition, this study selected the best-performing models from both methods—LightGBM and GRU—and plotted their predicted soil moisture spatial maps. From Figure 12, it can be seen that LightGBM predicted higher soil moisture values than the GRU predicted values, except for agricultural irrigation areas. At the same time, the LightGBM-predicted spatial map showed an obvious boundary effect, while the GRU-predicted spatial map was relatively smooth. This may be related to the way both methods handle irregular samples. Previous studies have shown that compared to tree models, neural network models tend to fit irregular samples, making neural network prediction results smoother [66,67]. This smooth result may be more consistent with the flow characteristics of soil moisture.

To explore the differences between both methods in terms of the interpretability, this study used the SHAP method to analyze the models. This study found that ensemble-learning models used more types and higher-quality features, mainly including geographic coding features and radar radiation information features. However, these features do not perform as prominently in deep-learning models, possibly due to their tendency to disregard features with weaker correlations. In addition, the high correlation among the covariates can impact the interpretability results [68]. For example, the high correlation (R > 0.9, see Figure 13) among the RVI, NDVI, and GNDVI features suggests that the GNDVI is considered the most important, while the roles of the other two features are less significant. Interestingly, the model can still achieve satisfactory accuracy even when omitting any one or a combination of these factors. Although providing a physical explanation for including all these factors is challenging, this approach indeed achieves optimal accuracy.

This study also had some limitations and shortcomings. First, this study used limited site data, and the soil cover type was mainly grassland, which may affect the generalization ability and robustness of the model. To improve the adaptability of the model under different environmental conditions, future research can consider using more site data and more types of soil cover. Moreover, applying techniques such as transfer learning could further improve the model’s ability to generalize to new and different contexts. Second, the feature information used in this study mainly came from Sentinel-1 radar data and Sentinel-2 optical data, which may limit the model’s ability to use other potentially relevant information. To increase the model’s explanatory power for soil moisture changes, future research can try to use other data sources, such as meteorological data, terrain factor data, soil property data, etc. In particular, future research can explore the joint inversion method using SAR-derived backscatter and thermal data to improve the accuracy and reliability of soil moisture estimation [69]. Finally, this study did not consider more model fusion methods, such as the voting method, weighting method, Bayesian averaging method, etc. These methods may outperform any single algorithm or algorithm series by diversifying the combination of multiple algorithm series [70].

6. Conclusions

In this study, we proposed an innovative stacking method that integrates multiple machine-learning techniques for soil moisture inversion in the ShanDian River Basin in China, using Sentinel-1A radar scattering signals, Sentinel-2A spectral indices, NASADEM, and additional geographic and temporal features. The main conclusions of this paper are as follows:

(1): The proposed stacking model outperforms the individual models, demonstrating the effectiveness of model fusion in enhancing soil moisture estimation accuracy. Notably, the SABM, which uniquely integrates only boosting-type models, achieved the highest performance, with an R² value of 0.861, an RMSE of 0.025 cm³/cm³, and an MAE of 0.019 cm³/cm³. Among the individual models, the tree-based ensemble-learning models perform better than the deep-learning models, with LightGBM being the best-performing single model and GRU being the best-performing deep-learning model.
(2): This study reveals significant spatiotemporal variability in soil moisture prediction accuracy. In terms of time, all the models perform well in April, May, and October, and poorly in June to September, especially in July, where the performance drops 5sharply. In terms of space, the sites with high prediction accuracy are mainly concentrated in a small-scale range within the experimental area. The sites with low prediction accuracy are mainly dominated by forest and cropland vegetation types.
(3): The SHAP value analysis provided deeper insights into the feature importance, revealing that ensemble-learning models and deep-learning models rely on different sets of features. The ensemble-learning models mainly rely on features such as the elevation, geographic encoding, temporal encoding, vegetation index, water index, and radar scattering signal, while the deep-learning models mainly rely on features such as the elevation, temporal encoding, vegetation index, water index, and band information.

This study uses multi-source remote-sensing data and ground observation data to construct a high-accuracy soil moisture estimation model, providing an effective technical means for monitoring and managing soil moisture resources.

Author Contributions

Conceptualization, M.L. and Y.Y.; methodology, M.L.; software, M.L.; validation, M.L. and Y.Y.; formal analysis, M.L. and Y.Y.; investigation, M.L.; resources, Y.Y.; data curation, M.L.; writing—original draft preparation, M.L.; writing—review and editing, M.L. and Y.Y.; visualization, M.L.; supervision, Y.Y.; project administration, Y.Y.; funding acquisition, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was primarily supported by the Major Program of the National Natural Science Foundation of China (52394191).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ahmad, S.; Kalra, A.; Stephen, H. Estimating Soil Moisture Using Remote Sensing Data: A Machine Learning Approach. Adv. Water Resour. 2010, 33, 69–80. [Google Scholar] [CrossRef]
Babaeian, E.; Paheding, S.; Siddique, N.; Devabhaktuni, V.K.; Tuller, M. Estimation of Root Zone Soil Moisture from Ground and Remotely Sensed Soil Information with Multisensor Data Fusion and Automated Machine Learning. Remote Sens. Environ. 2021, 260, 112434. [Google Scholar] [CrossRef]
Chaudhary, S.K.; Srivastava, P.K.; Gupta, D.K.; Kumar, P.; Prasad, R.; Pandey, D.K.; Das, A.K.; Gupta, M. Machine Learning Algorithms for Soil Moisture Estimation Using Sentinel-1: Model Development and Implementation. Adv. Space Res. 2022, 69, 1799–1812. [Google Scholar] [CrossRef]
Long, D.; Bai, L.; Yan, L.; Zhang, C.; Yang, W.; Lei, H.; Quan, J.; Meng, X.; Shi, C. Generation of Spatially Complete and Daily Continuous Surface Soil Moisture of High Spatial Resolution. Remote Sens. Environ. 2019, 233, 111364. [Google Scholar] [CrossRef]
Bauer-Marschallinger, B.; Freeman, V.; Cao, S.; Paulik, C.; Schaufler, S.; Stachl, T.; Modanesi, S.; Massari, C.; Ciabatta, L.; Brocca, L.; et al. Toward Global Soil Moisture Monitoring With Sentinel-1: Harnessing Assets and Overcoming Obstacles. IEEE Trans. Geosci. Remote Sens. 2019, 57, 520–539. [Google Scholar] [CrossRef]
Paloscia, S.; Pettinato, S.; Santi, E.; Notarnicola, C.; Pasolli, L.; Reppucci, A. Soil Moisture Mapping Using Sentinel-1 Images: Algorithm and Preliminary Validation. Remote Sens. Environ. 2013, 134, 234–248. [Google Scholar] [CrossRef]
Amazirh, A.; Merlin, O.; Er-Raki, S.; Gao, Q.; Rivalland, V.; Malbeteau, Y.; Khabba, S.; Escorihuela, M.J. Retrieving Surface Soil Moisture at High Spatio-Temporal Resolution from a Synergy between Sentinel-1 Radar and Landsat Thermal Data: A Study Case over Bare Soil. Remote Sens. Environ. 2018, 211, 321–337. [Google Scholar] [CrossRef]
Bao, Y.; Lin, L.; Wu, S.; Kwal Deng, K.A.; Petropoulos, G.P. Surface Soil Moisture Retrievals over Partially Vegetated Areas from the Synergy of Sentinel-1 and Landsat 8 Data Using a Modified Water-Cloud Model. Int. J. Appl. Earth Obs. Geoinf. 2018, 72, 76–85. [Google Scholar] [CrossRef]
Wang, Q.; Li, J.; Jin, T.; Chang, X.; Zhu, Y.; Li, Y.; Sun, J.; Li, D. Comparative Analysis of Landsat-8, Sentinel-2, and GF-1 Data for Retrieving Soil Moisture over Wheat Farmlands. Remote Sens. 2020, 12, 2708. [Google Scholar] [CrossRef]
Bousbih, S.; Zribi, M.; El Hajj, M.; Baghdadi, N.; Lili-Chabaane, Z.; Gao, Q.; Fanise, P. Soil Moisture and Irrigation Mapping in A Semi-Arid Region, Based on the Synergetic Use of Sentinel-1 and Sentinel-2 Data. Remote Sens. 2018, 10, 1953. [Google Scholar] [CrossRef]
Liu, Y.; Qian, J.; Yue, H. Combined Sentinel-1A With Sentinel-2A to Estimate Soil Moisture in Farmland. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1292–1310. [Google Scholar] [CrossRef]
Efremova, N.; Seddik, M.E.A.; Erten, E. Soil Moisture Estimation Using Sentinel-1/-2 Imagery Coupled with CycleGAN for Time-Series Gap Filing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4705111. [Google Scholar] [CrossRef]
Benninga, H.-J.F.; van der Velde, R.; Su, Z. Soil Moisture Content Retrieval over Meadows from Sentinel-1 and Sentinel-2 Data Using Physically Based Scattering Models. Remote Sens. Environ. 2022, 280, 113191. [Google Scholar] [CrossRef]
Dubois, P.C.; van Zyl, J.; Engman, T. Measuring Soil Moisture with Imaging Radars. IEEE Trans. Geosci. Remote Sens. 1995, 33, 915–926. [Google Scholar] [CrossRef]
Oh, Y.; Sarabandi, K.; Ulaby, F.T. An Empirical Model and an Inversion Technique for Radar Scattering from Bare Soil Surfaces. IEEE Trans. Geosci. Remote Sens. 1992, 30, 370–381. [Google Scholar] [CrossRef]
Fung, A.K.; Li, Z.; Chen, K.S. Backscattering from a Randomly Rough Dielectric Surface. IEEE Trans. Geosci. Remote Sens. 1992, 30, 356–369. [Google Scholar] [CrossRef]
Wu, T.-D.; Chen, K.-S. A Reappraisal of the Validity of the IEM Model for Backscattering from Rough Surfaces. IEEE Trans. Geosci. Remote Sens. 2004, 42, 743–753. [Google Scholar] [CrossRef]
Ulaby, F.T.; Sarabandi, K.; Mcdonald, K.; Whitt, M.; Dobson, M.C. Michigan Microwave Canopy Scattering Model. Int. J. Remote Sens. 1990, 11, 1223–1253. [Google Scholar] [CrossRef]
Attema, E.P.W.; Ulaby, F.T. Vegetation Modeled as a Water Cloud. Radio Sci. 1978, 13, 357–364. [Google Scholar] [CrossRef]
Balenzano, A.; Mattia, F.; Satalino, G.; Davidson, M.W.J. Dense Temporal Series of C- and L-Band SAR Data for Soil Moisture Retrieval Over Agricultural Crops. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 4, 439–450. [Google Scholar] [CrossRef]
Kornelsen, K.C.; Coulibaly, P. Advances in Soil Moisture Retrieval from Synthetic Aperture Radar and Hydrological Applications. J. Hydrol. 2013, 476, 460–489. [Google Scholar] [CrossRef]
Solomatine, D.P.; Shrestha, D.L. A Novel Method to Estimate Model Uncertainty Using Machine Learning Techniques. Water Resour. Res. 2009, 45, W00B11. [Google Scholar] [CrossRef]
Greifeneder, F.; Notarnicola, C.; Wagner, W. A Machine Learning-Based Approach for Surface Soil Moisture Estimations with Google Earth Engine. Remote Sens. 2021, 13, 2099. [Google Scholar] [CrossRef]
Ågren, A.M.; Larson, J.; Paul, S.S.; Laudon, H.; Lidberg, W. Use of Multiple LIDAR-Derived Digital Terrain Indices and Machine Learning for High-Resolution National-Scale Soil Moisture Mapping of the Swedish Forest Landscape. Geoderma 2021, 404, 115280. [Google Scholar] [CrossRef]
Nguyen, T.T.; Ngo, H.H.; Guo, W.; Chang, S.W.; Nguyen, D.D.; Nguyen, C.T.; Zhang, J.; Liang, S.; Bui, X.T.; Hoang, N.B. A Low-Cost Approach for Soil Moisture Prediction Using Multi-Sensor Data and Machine Learning Algorithm. Sci. Total Environ. 2022, 833, 155066. [Google Scholar] [CrossRef]
Wang, L.; Gao, Y. Soil Moisture Retrieval From Sentinel-1 and Sentinel-2 Data Using Ensemble Learning Over Vegetated Fields. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 1802–1814. [Google Scholar] [CrossRef]
Araya, S.N.; Fryjoff-Hung, A.; Anderson, A.; Viers, J.H.; Ghezzehei, T.A. Advances in Soil Moisture Retrieval from Multispectral Remote Sensing Using Unoccupied Aircraft Systems and Machine Learning Techniques. Hydrol. Earth Syst. Sci. 2021, 25, 2739–2758. [Google Scholar] [CrossRef]
Liu, Y.; Jing, W.; Wang, Q.; Xia, X. Generating High-Resolution Daily Soil Moisture by Using Spatial Downscaling Techniques: A Comparison of Six Machine Learning Algorithms. Adv. Water Resour. 2020, 141, 103601. [Google Scholar] [CrossRef]
Senanayake, I.P.; Yeo, I.-Y.; Walker, J.P.; Willgoose, G.R. Estimating Catchment Scale Soil Moisture at a High Spatial Resolution: Integrating Remote Sensing and Machine Learning. Sci. Total Environ. 2021, 776, 145924. [Google Scholar] [CrossRef]
Hegazi, E.H.; Yang, L.; Huang, J. A Convolutional Neural Network Algorithm for Soil Moisture Prediction from Sentinel-1 SAR Images. Remote Sens. 2021, 13, 4964. [Google Scholar] [CrossRef]
Guo, J.; Bai, Q.; Guo, W.; Bu, Z.; Zhang, W. Soil Moisture Content Estimation in Winter Wheat Planting Area for Multi-Source Sensing Data Using CNNR. Comput. Electron. Agric. 2022, 193, 106670. [Google Scholar] [CrossRef]
Wang, R.; Zhao, J.; Yang, H.; Li, N. Inversion of Soil Moisture on Farmland Areas Based on SSA-CNN Using Multi-Source Remote Sensing Data. Remote Sens. 2023, 15, 2515. [Google Scholar] [CrossRef]
Semwal, V.B.; Gupta, A.; Lalwani, P. An Optimized Hybrid Deep Learning Model Using Ensemble Learning Approach for Human Walking Activities Recognition. J. Supercomput. 2021, 77, 12256–12279. [Google Scholar] [CrossRef]
Das, B.; Rathore, P.; Roy, D.; Chakraborty, D.; Jatav, R.S.; Sethi, D.; Kumar, P. Comparison of Bagging, Boosting and Stacking Algorithms for Surface Soil Moisture Mapping Using Optical-Thermal-Microwave Remote Sensing Synergies. Catena 2022, 217, 106485. [Google Scholar] [CrossRef]
Wang, S.; Wu, Y.; Li, R.; Wang, X. Remote Sensing-Based Retrieval of Soil Moisture Content Using Stacking Ensemble Learning Models. Land Degrad. Dev. 2023, 34, 911–925. [Google Scholar] [CrossRef]
Zhao, T.; Shi, J.; Lv, L.; Xu, H.; Chen, D.; Cui, Q.; Jackson, T.J.; Yan, G.; Jia, L.; Chen, L.; et al. Soil Moisture Experiment in the Luan River Supporting New Satellite Mission Opportunities. Remote Sens. Environ. 2020, 240, 111680. [Google Scholar] [CrossRef]
Rouse, J.W.; Haas, R.H.; Schell, J.A.; Deering, D.W.; Harlan, J.C. Monitoring the Vernal Advancements and Retrogradation; Texas A & M University: College Station, TX, USA, 1974. [Google Scholar]
Gitelson, A.A.; Kaufman, Y.J.; Merzlyak, M.N. Use of a Green Channel in Remote Sensing of Global Vegetation from EOS-MODIS. Remote Sens. Environ. 1996, 58, 289–298. [Google Scholar] [CrossRef]
Xu, H. Modification of Normalised Difference Water Index (NDWI) to Enhance Open Water Features in Remotely Sensed Imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
García, M.J.L.; Caselles, V. Mapping Burns and Natural Reforestation Using Thematic Mapper Data. Geocarto Int. 1991, 6, 31–37. [Google Scholar] [CrossRef]
Richardson, A.J.; Wiegand, C.L. Distinguishing Vegetation from Soil Background Information. Photogramm. Eng. Remote Sens. 1977, 43, 1541–1552. [Google Scholar]
Marsett, R.C.; Qi, J.; Heilman, P.; Biedenbender, S.H.; Carolyn Watson, M.; Amer, S.; Weltz, M.; Goodrich, D.; Marsett, R. Remote Sensing for Grassland Management in the Arid Southwest. Rangel. Ecol. Manag. 2006, 59, 530–540. [Google Scholar] [CrossRef]
Huete, A.R. A Soil-Adjusted Vegetation Index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
Torres, R.; Snoeij, P.; Geudtner, D.; Bibby, D.; Davidson, M.; Attema, E.; Potin, P.; Rommen, B.; Floury, N.; Brown, M.; et al. GMES Sentinel-1 Mission. Remote Sens. Environ. 2012, 120, 9–24. [Google Scholar] [CrossRef]
Attema, E.; Cafforio, C.; Gottwald, M.; Guccione, P.; Guarnieri, A.M.; Rocca, F.; Snoeij, P. Flexible Dynamic Block Adaptive Quantization for Sentinel-1 SAR Missions. IEEE Geosci. Remote Sens. Lett. 2010, 7, 766–770. [Google Scholar] [CrossRef]
Yang, N.; Shi, H.; Tang, H.; Yang, X. Geographical and Temporal Encoding for Improving the Estimation of PM2.5 Concentrations in China Using End-to-End Gradient Boosting. Remote Sens. Environ. 2022, 269, 112828. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: San Francisco, CA, USA, 2016; pp. 785–794. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Long Beach, CA, USA, 2017; Volume 30, pp. 3146–3154. [Google Scholar]
Dorogush, A.V.; Ershov, V.; Gulin, A. CatBoost: Gradient Boosting with Categorical Features Support. arXiv 2018, arXiv:1810.11363. [Google Scholar]
Malek, S.; Melgani, F.; Bazi, Y. One-Dimensional Convolutional Neural Networks for Spectroscopic Signal Regression. J. Chemom. 2018, 32, e2977. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Pavlyshenko, B. Using Stacking Approaches for Machine Learning Models. In Proceedings of the 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 21–25 August 2018; pp. 255–258. [Google Scholar]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Lake Tahoe, NV, USA, 2012; Volume 25, pp. 2960–2968. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Long Beach, CA, USA, 2017; Volume 30, pp. 4765–4774. [Google Scholar]
Koster, R.D.; Dirmeyer, P.A.; Guo, Z.; Bonan, G.; Chan, E.; Cox, P.; Gordon, C.T.; Kanae, S.; Kowalczyk, E.; Lawrence, D.; et al. Regions of Strong Coupling Between Soil Moisture and Precipitation. Science 2004, 305, 1138–1140. [Google Scholar] [CrossRef]
Soulis, K.X.; Elmaloglou, S.; Dercas, N. Investigating the Effects of Soil Moisture Sensors Positioning and Accuracy on Soil Moisture Based Drip Irrigation Scheduling Systems. Agric. Water Manag. 2015, 148, 258–268. [Google Scholar] [CrossRef]
Millard, K.; Richardson, M. Quantifying the Relative Contributions of Vegetation and Soil Moisture Conditions to Polarimetric C-Band SAR Response in a Temperate Peatland. Remote Sens. Environ. 2018, 206, 123–138. [Google Scholar] [CrossRef]
Przeździecki, K.; Zawadzki, J.J.; Urbaniak, M.; Ziemblińska, K.; Miatkowski, Z. Using Temporal Variability of Land Surface Temperature and Normalized Vegetation Index to Estimate Soil Moisture Condition on Forest Areas by Means of Remote Sensing. Ecol. Indic. 2023, 148, 110088. [Google Scholar] [CrossRef]
Chen, J.; Yin, J.; Zang, L.; Zhang, T.; Zhao, M. Stacking Machine Learning Model for Estimating Hourly PM2.5 in China Based on Himawari 8 Aerosol Optical Depth Data. Sci. Total Environ. 2019, 697, 134021. [Google Scholar] [CrossRef]
Taghizadeh-Mehrjardi, R.; Schmidt, K.; Amirian-Chakan, A.; Rentschler, T.; Zeraatpisheh, M.; Sarmadian, F.; Valavi, R.; Davatgar, N.; Behrens, T.; Scholten, T. Improving the Spatial Prediction of Soil Organic Carbon Content in Two Contrasting Climatic Regions by Stacking Machine Learning Models and Rescanning Covariate Space. Remote Sens. 2020, 12, 1095. [Google Scholar] [CrossRef]
Zounemat-Kermani, M.; Batelaan, O.; Fadaee, M.; Hinkelmann, R. Ensemble Machine Learning Paradigms in Hydrology: A Review. J. Hydrol. 2021, 598, 126266. [Google Scholar] [CrossRef]
Shwartz-Ziv, R.; Armon, A. Tabular Data: Deep Learning Is Not All You Need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 7499–7519. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting Deep Learning Models for Tabular Data. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Long Beach, CA, USA, 2021; Volume 34, pp. 18932–18943. [Google Scholar]
Grinsztajn, L.; Oyallon, E.; Varoquaux, G. Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data? Adv. Neural Inf. Process. Syst. 2022, 35, 507–520. [Google Scholar]
Rahaman, N.; Baratin, A.; Arpit, D.; Draxler, F.; Lin, M.; Hamprecht, F.; Bengio, Y.; Courville, A. On the Spectral Bias of Neural Networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5301–5310. [Google Scholar]
Yu, J.; Zheng, W.; Xu, L.; Meng, F.; Li, J.; Zhangzhong, L. TPE-CatBoost: An Adaptive Model for Soil Moisture Spatial Estimation in the Main Maize-Producing Areas of China with Multiple Environment Covariates. J. Hydrol. 2022, 613, 128465. [Google Scholar] [CrossRef]
Li, Z.-L.; Leng, P.; Zhou, C.; Chen, K.-S.; Zhou, F.-C.; Shang, G.-F. Soil Moisture Retrieval from Remote Sensing Measurements: Current Knowledge and Directions for the Future. Earth-Sci. Rev. 2021, 218, 103673. [Google Scholar] [CrossRef]
McElfresh, D.; Khandagale, S.; Valverde, J.; C, V.P.; Ramakrishnan, G.; Goldblum, M.; White, C. When Do Neural Nets Outperform Boosted Trees on Tabular Data? Available online: https://arxiv.org/abs/2305.02997v1 (accessed on 12 June 2023).

Figure 1. Study area.

Figure 2. Flowchart of this study.

Figure 3. The stacking ensemble-learning method framework used in this study.

Figure 4. Scatter plot of the observed and predicted soil moisture values on the testing set for different machine-learning methods. The probability density is represented by the color of the point.

Figure 5. Evaluating the performance of different machine-learning techniques for soil moisture estimation over various months.

Figure 6. Spatial distribution of the soil moisture assessment indicators using SABM on the test dataset.

Figure 7. Temporal variations in VV, NDVI, observed and predicted soil moisture at site L2.

Figure 8. Temporal variations in the VV, NDVI, observed and predicted soil moisture at site M1.

Figure 9. Temporal variations in the VV, NDVI, observed and predicted soil moisture at site S7.

Figure 10. Spatial distribution of soil moisture using SABM, with a spatial resolution of 20 m.

Figure 11. Distribution of feature importance based on the SHAP method.

Figure 12. Spatial distribution of soil moisture estimated by (a) LightGBM and (b) GRU on 28 July 2019, with a spatial resolution of 20 m.

Figure 13. Correlation analysis of various indicators with soil moisture.

Table 1. Vegetation, soil and water body indices calculated from Sentinel-2 data.

Covariates	Abbreviation	Reference	Formula
Normalized Difference Vegetation Index	NDVI	[37]	$\frac{N I R - R}{N I R + R}$
Green Normalized Difference Vegetation Index	GNDVI	[38]	$\frac{N I R - G}{N I R + G}$
Modified Normalized Difference Water Index	MNDWI	[39]	$\frac{(G - S W I R)}{(G + S W I R)}$
Normalized Burn Ratio Index	NBRI	[40]	$\frac{(N I R - S W I R 1)}{(N I R + S W I R 1)}$
Ratio Vegetation Index	RVI	[41]	$\frac{R}{N I R}$
Soil-Adjusted Total Vegetation Index	SATVI	[42]	$\frac{(S W I R 1 - R)}{(S W I R 1 + R + L)} \times (1 + L) - \frac{S W I R 2}{2}$
Soil-Adjusted Vegetation Index	SAVI	[43]	$\frac{(N I R - R)}{(N I R + R + L)} \times (1 + L)$
Blue band of Sentinel-2	B2	[44]
Green band of Sentinel-2	B3	[44]
Red band of Sentinel-2	B4	[44]
Near-infrared band of Sentinel-2	B8	[44]
Shortwave-infrared 1 band of Sentinel-2	B11	[44]
Shortwave-infrared 2 band of Sentinel-2	B12	[44]
Backscattering coefficients of VH band	VH	[44]
Backscattering coefficients of VV band	VV	[45]

Table 2. Hyperparameters of the models in the experiment.

Model	Hyperparameter	Range
RF	n_estimators	[2–256]
	max_depth	[2–128]
	max_features	[1–max_feature]
XGBoost	n_estimators	[5000–10,000]
	colsample_bytree	[0.4–1.0]
	learning_rate	log ([0.01–0.3])
	lambda	log ([le-5–1])
	alpha	log ([le-5–1])
	subsample	[0.4–1.0]
LightGBM	n_estimators	[5000–10,000]
	num_leaves	[64–512]
	colsample_bytree	[0.4–1.0]
	learning_rate	log ([0.01–0.3])
	lambda_l1	log ([le-8–10])
	lambda_l2	log ([le-8–10])
	subsample	[0.4–1.0]
	min_child_samples	[1–20]
CatBoost	n_estimators	[5000–10,000]
	max_depth	[5–16]
	learning_rate	log ([0.01–0.3])
	bootstrap_type	Bernoulli
	l2_leaf_reg	log ([le-5–1])
	max_depth	[0.4–1.0]
	min_data_in_leaf	[1–300]
	subsample	[0.4–1.0]
	max_bin	[200–400]
DNN	hidden_layers	[1–5]
	dense_units	[64–512]
	dropout_rate	[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
CNN	num_cnn_layers	[1–5]
	filters	[64–512]
	dense_units	[64–512]
	regularization_rate	[le-5–le-1]
	dropout_rate	[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
GRU	num_gru_layers	[1–5]
	gru_units	[64–512]
	dense_units	[64–512]
	regularization_rate	[le-5–le-1]
	dropout_rate	[0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
Ridge	alpha	[le-3–le3]

Table 3. Performance of different methods for soil moisture estimation.

Method	Method Type	R²	RMSE	MAE	Run Times (s)
LightGBM	Ensemble method (boosting)	0.858	0.025	0.019	76
XGBoost		0.855	0.025	0.019	112
CatBoost		0.843	0.027	0.021	471
RF	Ensemble method (bagging)	0.839	0.027	0.021	14
DNN	DL	0.765	0.032	0.023	58
CNN	DL	0.783	0.031	0.022	59
GRU	DL	0.799	0.030	0.022	73
SABM	Ensemble method (stacking)	0.861	0.025	0.019	472
SAEM		0.859	0.025	0.019	472
SADM		0.804	0.030	0.021	74
SAM		0.845	0.026	0.020	472

SABM refers to the use of boosting methods as base learners; SAEM refers to the use of ensemble methods as base learners; SADM refers to refers to the use of deep-learning methods as base learners; SAM refers to the use of a combination of machine-learning methods as base learners.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Yan, Y. Comparative Analysis of Machine-Learning Models for Soil Moisture Estimation Using High-Resolution Remote-Sensing Data. Land 2024, 13, 1331. https://doi.org/10.3390/land13081331

AMA Style

Li M, Yan Y. Comparative Analysis of Machine-Learning Models for Soil Moisture Estimation Using High-Resolution Remote-Sensing Data. Land. 2024; 13(8):1331. https://doi.org/10.3390/land13081331

Chicago/Turabian Style

Li, Ming, and Yueguan Yan. 2024. "Comparative Analysis of Machine-Learning Models for Soil Moisture Estimation Using High-Resolution Remote-Sensing Data" Land 13, no. 8: 1331. https://doi.org/10.3390/land13081331

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Comparative Analysis of Machine-Learning Models for Soil Moisture Estimation Using High-Resolution Remote-Sensing Data

Abstract

1. Introduction

2. Dataset

2.1. Study Area and In Situ Soil Moisture Dataset

2.2. Remote-Sensing Dataset

2.2.1. Sentinel-1

2.2.2. Sentinel-2

2.2.3. Other Dataset

3. Method

3.1. Machine Learning

3.2. Hyperparameter Tuning

3.3. Feature Importance Assessment Methods

3.4. Model Evaluation

4. Results

4.1. Model Performance

4.2. Spatiotemporal Variation of Soil Moisture Prediction Accuracy

4.3. Feature Importance

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI