1. Introduction
Salts accumulate in the soil due to groundwater-associated salinity, non-groundwater-associated salinity, and irrigation-induced salinity, resulting in the occurrence of soil salinization. This phenomenon can have significant impacts on agricultural production, environmental health, and regional economies [
1]. In China, the total area of saline soils accounts for about 4.88% of the country’s total usable land area. Saline soils are primarily found in arid, semi-arid, and semi-humid regions [
2]. The Yellow River Delta, situated in Dongying City, Shandong Province, represents China’s best-preserved, largest, and most recent wetland ecosystem within the warm-temperature zone. The area experiences low precipitation and high evaporation, leading to the accumulation of salts on the soil surface. Its proximity to the Bohai Sea facilitates seawater intrusion, exacerbated by its low-lying topography and inadequate drainage, resulting in high soil salinity levels. Over time, due to the combined effects of climate conditions and seawater, as well as China’s limited knowledge of saline land management and outdated technology, more than half of the land in the Yellow River Delta has become salinized soil, resulting in the formation of the current semi-humid saline area [
3]. Accurate monitoring of salinity in the soil and improvement of soil salinization has become an urgent need.
Before the adoption of remote sensing, soil salinity was monitored through time-consuming and labor-intensive traditional methods like gravimetric analysis and electrical conductivity measurements, limiting the scope of salinity assessment across large areas [
4,
5]. Remote sensing technology offers the advantage of wide-range monitoring with high spatial and temporal resolution, providing immediate and cost-effective data, which play a crucial role in monitoring soil salinity content [
6]. Soil salinity can be effectively characterized by band indices [
7]. Subsequently, researchers have incorporated the other spectral indices as modeling variables in salinity inversion studies, yielding favorable outcomes [
8,
9,
10]. Salt inversion models can be broadly classified into two categories: linear fitting models and nonlinear models. Linear models mainly include multivariate spline autoregressive model (MSA) [
11], multivariate linear regression model (MLR) [
12], exponential fitting model (EF) [
13], partial least squares regression model (PLSR) [
14], and so on. Nonlinear models primarily encompass the BPNN model [
15], support vector machine model (SVM) [
16], random forest model (RF) [
17], extreme learning machine model (ELM) [
18], and other machine learning models.
In scenarios where machine learning models are employed for regression analysis, researchers often encounter a significant challenge: the models may contain numerous parameters, and the initial settings of these parameters critically influence the model’s final performance. Traditionally, these parameters are initialized randomly; however, this method has a clear drawback: it may lead the model to converge to local minima during training, preventing it from reaching the global optimum, thereby limiting its predictive capability [
19]. The use of intelligent optimization algorithms has proven beneficial in effectively addressing this challenge. Genetic algorithms, simulated annealing, and particle swarm optimization are capable of performing global searches within the parameter space. During the model’s iterative training process, intelligent optimization algorithms continuously adjust its fitness value, thereby evaluating and enhancing selected parameters, reducing prediction errors, and enhancing accuracy [
20]. The integration of intelligent optimization algorithms with traditional machine learning models has produced excellent outcomes in regression prediction research. This approach enhances both the predictive accuracy and the generalizability of the models across various domains. The integrated models, such as the particle swarm optimization–extreme learning machine (PSO-ELM) [
21], bat optimization algorithm–extreme learning machine (BOA-ELM) [
22], the estimation distribution algorithm–extreme learning machine (EDA-ELM) [
23], genetic algorithm-support vector machine (GA-SVM) [
24], and whale optimization algorithm–random forest (WOA-RF) [
25], incorporate intelligent optimization algorithms into machine learning models with the aim of optimizing parameter selection and enhancing model performance. These methods have demonstrated efficacy in various fields, including financial market forecasting, bioinformatics, environmental monitoring, and energy consumption prediction. This integrated model is extensively employed in SSC inversion studies. Zhao Wenju and his team enhanced the BPNN model using PSO, mind evolutionary algorithm (MEA), and GA, selecting the western corridor of China’s Taolai River Basin as their study area. It was demonstrated that the accuracies of the PSO-BPNN, MEA-BPNN, and GA-BPNN models surpassed those of the standalone BPNN model, with the GA-BPNN model emerging as the most effective salinity inversion model, achieving an
of 0.6659 [
26]. Yang Lianbing and his colleagues employed the GA and the Bayesian optimization algorithm (BOA) to optimize subsets of inversion parameters and RF model parameters, respectively, subsequently constructing GA-RFR and BOA-RFR salt inversion models. Results indicated that the BOA-RFR model achieved the highest predictive accuracy [
27]. To enhance the predictive performance of the SVM, Xiaohong Zhou and his colleagues employed PSO, gray wolf optimization (GWO), and differential evolution algorithms (DE) for SVM parameter optimization, resulting in the development of PSO-SVM, GWO-SVM, and DE-SVM models for the Ebinur Lake Wetland National Nature Reserve (ELWNNR) area’s SSC inversion. Ultimately, the DE-SVM model demonstrated superior performance, evidenced by an
value of 0.56. Utilizing this model, the authors mapped soil salinity in the ELWNNR area for August 2018 and May 2019 [
28]. The primary strength of this methodology lies in its systematic exploration of the parameter space, eschewing dependence on randomness or unidirectional gradient descent. During the iterative process, the algorithm enables the model to effectively circumvent local optima and progressively advance towards a solution that more closely approximates the global optimum. It has been proven that integrating intelligent optimization algorithms into the training process of machine learning models can significantly improve the performance of the models on regression tasks, both in terms of prediction accuracy and generalization ability of the models [
29].
The crayfish optimization algorithm (COA) [
30], proposed in 2023, simulates the crayfish’s heat avoidance, competition, and foraging behaviors in varying environmental temperatures, and has the ability of fast searching speed, strong searching ability, and the ability to effectively balance the global search and local search. The COA algorithm exhibits sluggish convergence toward the optimal solution during the search phase. In this paper, the COA is improved by using the chaotic population initialization method to improve the search capability of the model. Furthermore, the improved COA (ICOA) is integrated with a machine learning model to analyze soil salinity information in the Yellow River Delta region. This integration aims to mitigate the influence of random initialization parameters on the performance of the salt inversion model.
This paper is centered on the Yellow River Delta region as the study area. Twenty-nine spectral indices across four categories (band index, salinity index, vegetation index, and composite index) were extracted from Landsat5 TM image data. Two optimal sets of input variables were determined using two different variable screening methods. The SSC inversion model was constructed by using BPNN, RELM, and ICOA-RELM. Comparative analysis was performed to evaluate the performance of different combinations of modeling variables and models. The most accurate and stable model was selected to create a spatial distribution map of soil salinity in the Yellow River Delta region.
3. Results and Analysis
3.1. Statistical Analysis of Soil Salt Content Characteristics
According to Feng Xueli’s study on soil salinization monitoring in the irrigation domain of Jiefangzha, Hetao Irrigation District, Inner Mongolia [
46], the level of soil salinization was classified into five classes: non-saline, slight saline, moderate saline, severe saline, and extreme saline. The distribution of various salinity classes across the 94 sets of measured data is presented in
Table 5. Non-saline soil samples accounted for the largest percentage (38.3%) among all samples. Sample points in both slight saline and severe saline soil accounted for 21.28% of all samples. Moderate saline soil samples accounted for 19.14% all samples. No extreme saline soil samples were found in any of the samples in this paper.
3.2. Filtering of Input Variables
As the calculation of various spectral indices occurs within the basic band of the image, these indices are often significantly correlated. Using correlated variables to train a machine learning model often leads to overfitting. Consequently, before model training, constructed variables must be analyzed and screened to mitigate overfitting, simplify the model, and enhance its efficiency. In this section, two analysis methods are employed to screen the constructed variables.
Pearson correlation analysis is a statistical method used to measure the strength and direction of a linear relationship between two continuous variables, and it can help to understand the degree of association between the variables, which is useful for feature selection, variable screening, and understanding patterns and associations in the data. The variable importance score is calculated by considering both the predictive performance of the model and the contribution of the independent variables. Generally, a higher score signifies that the associated independent variable contributes more significantly to the model. This method aids in identifying the independent variables that hold the most importance for the prediction objective. Consequently, it facilitates feature selection and model optimization.
3.2.1. Correlation between Spectral Indices and Soil Salt Content
The original band data at the corresponding locations were extracted from the corrected remote sensing image using ArcMap 10.8.12790 software, based on the latitude and longitude information of the measured soil samples. The values of all characteristic variables corresponding to each measured soil sample were calculated using IBM SPSS Statistics 24.0.0.0 software. Pearson’s correlation coefficients were calculated between the four types of characterization variables and the measured values of soil salinity. The correlation heat map showing the relationship between SSC and different categories of characterization variables is presented in
Figure 5. The color red indicates a positive correlation between the variables, whereas blue indicates a negative correlation. The intensity of the color darkens as the correlation increases. Based on the graph, we can conclude that: (1) Among the band index group, SWIR1 exhibited the strongest negative correlation with SSC, with a correlation coefficient of −0.6. The GREEN band showed the weakest correlation with SSC, with a correlation coefficient of 0.21. Within the salinity indices group, SI5 demonstrated a robust positive correlation with SSC, evidenced by a high correlation coefficient of 0.76, in contrast to SI2, which exhibited the weakest correlation. In the vegetation indices category, ENDVI presented the strongest correlation with SSC, whereas ALBEDO displayed the weakest. Within the composite indices category, CORSI and SWIR1 each registered a correlation coefficient of −0.6 with SSC. (2) The mean absolute values of the correlation coefficients between each group of variables and the SSC were calculated separately. The mean absolute correlation values for the band indices group, the salinity indices group, the vegetation indices group, and the composite indices group were 0.375, 0.469, 0.571, and 0.42, respectively. The vegetation indices group had the highest mean absolute correlation with SSC, whereas the band indices group had the lowest. (3) SI5 emerged as the variable with the strongest correlation to SSC, succeeded by ENDVI and EDVI from the vegetation indices category, both manifesting negative correlations with SSC. GREEN and SI2 each displayed the weakest correlation with SSC, with their correlation coefficients having an absolute value of 0.21. Following the correlation analysis, four variables—SWIR1, SI5, ENDVI, and COSRI—were chosen to constitute the input variable group PCC.
3.2.2. Importance Analysis of Characteristic Variables
Variable importance in the projection analysis was used to screen twenty-nine spectral indices, including six band indices, ten salinity indices, eight vegetation indices, and five composite indices. The results are presented in
Figure 6. In the figure, the blue dots represent the projected importance value of the variable, and the red circles represent the positions where the projected importance value is 1. The figure reveals that certain band and composite indices exhibit low VIP value, whereas several vegetation and salinity indices possess VIP value greater than 1. Among them, SI5 has the highest importance, indicated by a VIP value of 1.444, whereas SI3 has the lowest VIP value of 0.578. (A VIP value greater than 1 indicates that the variable is highly important for the dependent variable, a value greater than 0.5 but less than 1 suggests unclear importance, and a value less than 0.5 indicates that the variable is not important for the dependent variable.) Sixteen variables were selected for the input variable group VIP, comprising SI5, ENDVI, SI4, SWIR1, EDVI, ERVI, SI9, COSRI, SWIR2, SIT, MSAVI, EEVI, GRVI, NDWI, NDSI, and NDVI.
After variable analysis and screening, three different groups of input variables were finally created for the experimental part of this paper. The spectral indices of the three input variable groups are shown in
Table 6.
3.3. Soil Salinity Inversion Model
The variable group PCC, variable group VIP, and full variable group TV are used as modeling variables to build three machine learning models—RELM, BP, and ICOA-RELM, respectively—and the performance of each model is evaluated using , RMSE, and MAE.
The
values of the nine model test sets and the fitting equations for the measured and predicted values are depicted in
Figure 7. The results indicate that optimizing the RELM model using ICOA substantially improved the model’s performance, resulting in enhanced predictions of SSC in the study area. Among all the models, ICOA-RELM-TV achieved the second-best performance with a test set
value of 0.728, followed by BP-TV (
value of 0.676). RELM-TV exhibited a similar
value of 0.676, with higher MAE and RMSE compared to BP-TV. BP-PCC attained an
value of 0.661, whereas RELM-VIP had an
value of 0.607. ICOA-RELM-PCC achieved an
value of 0.6, BP-VIP had an
value of 0.594, and RELM-PCC obtained an
value of 0.589.
The prediction results of SSC based on datasets selected by different feature band selection methods and different models are shown in
Table 7. The analysis of the table reveals that the RELM and BP models exhibit the best performance when utilizing the full set of variables as modeling variables. They achieved an
value above 0.67 for both the training and test sets, along with lower MAE and RMSE. In the ICOA-RELM model, ICOA-RELM-VIP demonstrates the best performance and the highest inversion accuracy, with a test set
of 0.75, MAE of 0.198, and RMSE of 0.249. ICOA-RELM-TV follows closely, with
of 0.748 and 0.728 for the training and test sets, respectively. On the other hand, ICOA-RELM-PCC performs the worst, with
below 0.7 for all the models. Comparatively, the models constructed using ICOA-RELM outperformed the unoptimized RELM model across all three input variables, yielding an average improvement of approximately 0.1 in the test set’s
.
3.4. Inversion of Soil Salt Spatial Distribution Based on ICOA-RELM-VIP Model
The Yellow River Delta wetlands include both perennial and seasonal storage wetlands. Perennial wetlands, predominantly characterized by mudflat ecosystems, consist of rivers, lakes, estuaries, and various types of ponds, including those for salt, shrimp, and crab. Conversely, seasonal storage wetlands comprise heavily saline supratidal areas, marshes, wet meadows, and paddy fields. As a result, salinization levels in the Yellow River Delta region vary significantly [
47]. Soil salinization in this region arises from both natural factors and human activities. The Yellow River Delta’s unique geographic location causes an imbalance in precipitation and distribution, exacerbated by an arid climate and scarce rainfall. This leads to soil moisture evaporation exceeding recharge, resulting in inadequate moisture and subsequent soil salinization. Additionally, a significant decline in the water table accelerates salt migration in groundwater, worsening surface soil salinization. Extensive soil erosion alters the nutrient composition of the land, further contributing to soil salinization. Excessive reclamation and rapid industrialization have disrupted the land’s nutrient composition. Prolonged irrigation and improper water management have further exacerbated soil salinization [
48].
The 16 spectral indices, selected through variable projection importance analysis, were utilized as model inputs. The ICOA-RELM model which performed best, was then employed for field inversion of the study area to obtain the distribution of soil total salinity classes in October 2003, as depicted in
Figure 8. Subsequently, the percentage of soils in each class was tabulated, and the results are presented in
Table 8.
In the study area, spatial distribution patterns reveal higher salinity levels along the coastal regions and lower salinity levels inland. The southeastern coastal region, accompanied by segments of the northwestern coast and the northeastern countryside, predominantly features soil with extreme and severe salinization, encompassing approximately 2351.5 square kilometers, which constitutes 43.36% of the entire study area. These areas are prone to repeated saltwater intrusion, exacerbated by drought and high temperatures, which promotes salt accumulation in the soil. Moderate saline soils are predominantly found in the central region, characterized by granite terraces and fluvial uplands at higher elevations, covering approximately 1266.94 square kilometers, accounting for 23.36% of the study area. Slight saline soils are primarily located along both sides of the Yellow River and in the northwestern region, where irrigation is extensively utilized. This area consists of river terraces, flatlands, and lowlands. Despite the influence of shallow groundwater levels and significant capillary action, these areas benefit from freshwater recharge. This category spans approximately 948.87 square kilometers, comprising 17.5% of the study area. Non-saline soils, the least represented category, comprise 15.78% of the entire study area, covering approximately 855.93 square kilometers, and are primarily found in the northeastern region, excluding coastal areas.
4. Discussion
In order to explore the effect of combining intelligent optimization algorithms with traditional machine learning models for inversion of SSC, there have been scholars combining the two for inversion of SSC, using intelligent optimization algorithms such as GA [
26], seagull optimization algorithm (SOA) [
49], sparrow search algorithm (SSA), bird swarm algorithm (BSA), moth search algorithm (MSA), Harris hawk optimization algorithm (HHO), grasshopper optimization algorithm (GOA), particle swarm optimization algorithm (PSO) [
50], and so on. In this paper, on the basis of the previous studies, using measured SSC and different combinations of spectral indices as modeling input, we improve the crayfish optimization algorithm based on the one proposed in 2023 and combine the improved optimization algorithm with the RELM model to train the SSC inverse model. Circle chaotic mapping was introduced to improve the initialization of crayfish populations, which improved the convergence ability of the algorithm and the speed of searching for optimal solutions, as well as the accuracy of SSC inversion model. The results show that the use of the model of ICOA-RELM can realize the monitoring of soil salinity conditions in the Yellow River Delta region, which is conducive to the soil management in the region.
Comparative analysis of the final inversion model’s accuracy demonstrates that, across all three input variable groups, the ICOA-RELM model introduced in this paper enhances the accuracy of estimating SSC in the study area when compared with the unoptimized model. This enhancement indicates that the optimization algorithm positively impacts the model’s inversion capability. Overall, non-saline, slight saline, and moderate saline soils intersect throughout the central part of the study area. Non-saline soils tend to form dendritic patterns following the direction of the water network’s runoff. Extreme saline soils are primarily found in tidal flats and tidal ditches, as well as other water bodies. The degree of salinization generally increases toward the seaward direction, closely linked to tidal infiltration and ground elevation. In the northern part of the study area, the former estuary area of the old Yellow River channel is dominated by severe saline soils, and the inner part is wrapped by a small amount of extreme saline soils. The coastal area in the southern part of the study area is dominated by extreme saline soils, and this part of the area is mainly tidal flats. The overall salinization level of the soil in the inversion results is consistent with the measured data from the actual sampling.
The accuracy of the model for inverting the SSC in the study area is influenced to some extent by the resolution and band information of the remote sensing images. The modeling effectiveness is limited by the use of reflectance data extracted from Landsat5 TM imagery, collected in 2003, to construct the spectral indices. The availability of higher quality imagery was not utilized. Currently, there are satellite data with higher resolution and quality, such as Sentinel 1 and 2, Planet, and Landsat 8, that can be utilized for studying the subsequent soil salinization levels using more recent data. Additionally, environmental factors such as soil moisture and soil utilization type can affect the level of salinity. Incorporating these environmental covariates into the input spectral index allows for an investigation of their relationship with SSC, thereby enhancing the accuracy of the inversion model [
51]. The sample size used in this paper is limited, which restricts the application of new inversion modeling techniques. Collecting a larger sample size in future studies would be beneficial. Additionally, the utilization of deep learning algorithms in salinity inversion can enhance the accuracy of soil salinity level identification. The applicability of the ICOA-RELM inversion model in this paper in other regions needs to be further verified, and comprehensive testing and evaluation are needed to determine the performance of the model under different environmental conditions.