1. Introduction
Heavy metal pollution has seriously endangered human life, and its characteristics, such as difficult degradation, easy enrichment, and high toxicity, have a significant impact on crop growth, production, and quality [
1,
2]. Heavy metal pollution of farmland soil can inhibit the dominant bacterial communities in the soil environment and microbial growth, particularly with the frequent use of chemicals and fertilizers, and the problem of heavy metal contamination for farmland soils has become increasingly serious [
3]. Lead (Pb) is a typical heavy metal contaminant [
4]. Once it infiltrates the soil, it exhibits low mobility, long residual times, and resistance to microbial degradation, making its remediation and control challenging. Its accumulation and migration within the soil can endanger regional ecological stability and affect the growth and development of flora and fauna, posing serious risks to human health through the food chain [
5,
6]. Therefore, efficient monitoring of Pb content in farmland soil is essential for ensuring food security and ecological safety.
Currently, the inversion research into soil heavy metal content mainly relies on the direct monitoring of ground-measured hyperspectral data. This method meets the present demand for rapid, nondestructive, and quantitative monitoring of soil heavy metal content [
7,
8]. For instance, Zhang et al. [
9] indirectly constructed an improved PLSR soil heavy metal Pb content inversion model with a genetic algorithm (GA) by acquiring the spectral characteristic wavelengths of iron oxides through the field-measured spectra. Wang et al. [
10] obtained the measured hyperspectral curves and content using spectrometry and chemical analysis, constructing six heavy metal inversion models. However, the use of ASD spectra data and heavy metal concentrations to construct models for direct monitoring can achieve high inversion accuracy, which is too expensive in terms of instrumentation and equipment, and consumes considerable manpower and material resources. This makes the regional application costly and less scalable [
11].
Fortunately, the use of remote sensing technology with relatively continuous spectral coverage of visible near-infrared reflectance (VNIR) for estimating soil heavy metal content is increasingly becoming a research focus, opening new avenues for cost-effective and large-scale monitoring of heavy metal content [
12]. The ability to estimate soil heavy metal concentrations using remote sensing images from multiple sensors, such as the WorldView-3 [
13], Landsat 8 [
14], Sentinel-2 [
15], GF-2 [
16], and HJ-1A [
17,
18], has been documented in published research. Zhang et al. [
19] used Landsat8 OLI and Sentinel-2 images to invert the Au element of the South African tailings. Wang et al. [
16] utilized GF-2 multispectral and measured spectra to establish multivariable factor models. The first four satellites mentioned above belong to multispectral sensors with limited bands. However, the shortcomings of the currently available sensors were clear. They cannot detail soil responses to heavy metal stress with single bands alone, requiring supplementary auxiliary factor variables for heavy metal content estimation [
20,
21]. Therefore, the above satellite sensors are still deficient in estimating soil heavy metal levels.
However, following the launch of domestic satellites such as GF-5, ZY-1-02D, and ZY-1-02E, accessible hyperspectral satellite data resources are becoming increasingly abundant. The subtle spectral changes in the soil in multiple bands can be captured, which are relevant to the heavy metal content in soil. They promoted the application of hyperspectral remote sensing in this research direction. Zhang et al. [
22] used reflectance of GF-5 hyperspectral images to predict the content of heavy metal Zn, Ni, and Cu in coal mines, with good inversion results. Lin et al. [
23] constructed a stacking model by utilizing ZY-1-02D images and applied it to the images to obtain large-scale soil heavy metal contamination information. The satellite hyperspectral data can take advantage of its “image–spectrum integration” to enable rapid and nondestructive monitoring for soil heavy metal concentrations over large areas. Unfortunately, the GF-5 satellite is no longer in service. Compared with existing hyperspectral satellite data, ZY-1-02E AHSI carries multiple spectral band sensors, which can acquire rich spectral information and is conducive to inverting soil heavy metal content. Yet, there are fewer studies related to inversion for heavy metal content in farmland soil using ZY-1-02E AHSI hyperspectral remote sensing images. There are still challenges in monitoring the concentrations of heavy metals in farmland using hyperspectral images [
24]. Consequently, there is an essential need to assess the feasibility by utilizing ZY-1-02E remotely sensed imagery for estimating heavy metal content in farmland and attempt to fill this gap.
Hyperspectral remotely sensed data with hundreds of consecutive narrow bands provide more detailed spectral information than traditional single-band remotely sensed data [
25]. However, data characteristics such as high latitude, high redundancy, and low correlation might have a negative influence on the stability and accuracy for modeling. To simplify models and improve the prediction accuracy model, many scholars have introduced feature selection algorithms into the constructed inversion models [
26]. Methods such as the Pearson method (PCC), genetic algorithm (GA) [
27,
28], competitive adaptive reweighted sampling algorithm (CARS) [
29], successive projection algorithm (SPA) [
30], and slime mold algorithm (SMA) [
10] have been recognized for their ability to explain characteristics and their certain statistical basis [
31]. In addition, the Boruta algorithm has demonstrated its effectiveness in extracting feature bands [
22,
32]. Nevertheless, heavy metal-contaminated soils in farmland exhibit weak signals on their spectral characteristics, which makes it difficult to determine effective spectrum response bands from soil spectral information. Meanwhile, effective spectral transforms may extract useful information from complex spectral reflectance, highlighting the absorption and reflection properties of the spectra [
33,
34], but some band redundancy still exists. Therefore, it is necessary to combine the advantages of both the optimized spectral index and the feature selection algorithm for hyperspectral remotely sensed monitoring of soil heavy metals in farmland.
Hyperspectral remote sensing analysis of soil heavy metals relies heavily on establishing a suitable regression model. While traditional statistical models, such as partial least squares (PLS) [
35] and multiple linear regression (MLR), are commonly used, they often fail to capture the nonlinear relationships between soil heavy metal content and spectral reflectance. Using nonlinear models can yield better prediction results and significantly enhance the accuracies of hyperspectral soil heavy metal estimation [
36,
37]. Machine learning approaches have been widely utilized for soil heavy metal estimation, including random forest (RF), extreme learning machine (ELM), general regression neural network (GRNN) [
38], and deep neural network (DNN) [
39]. Compared to traditional statistical models, machine learning methods offer faster training, better suitability for small samples and high-dimensional data processing, and strong nonlinear and adaptive capabilities. Shen et al. [
35] conducted a BPNN model to indirectly invert heavy metal Cu content in a mining location. The prediction accuracy was improved by 0.06 compared to a linear model. Zhou et al. [
40] found that RF provided better estimation accuracies than SVM and PLSR in estimating six soil metal content. Chen et al. [
41] showed that ELM, SVM, and RF performed better than PLSR in estimating Pb, Cr, and Zn in farmland soil.
Ensemble learning approaches have increasingly become an essential tool for soil heavy metal content inversion studies because of their superior stability, accuracy, and efficiency. Commonly used ensemble learning frameworks of heavy metal content inversion include RF, GBDT, AdaBoost [
42], XGBoost [
43], and Stacking [
12]. Among them, RF is a typical representation of the bagging approach, boosting the diversity and generalization of the model by introducing randomness [
44,
45]. The XGBoost model can efficiently capture complex nonlinear relations from data with a certain degree of interpretability, which is important for understanding the key influencing factors in heavy metal content inversion. It has been particularly successful in handling high-dimensional remote sensing data and providing accurate predictions [
43,
46]. Tan et al. [
42] and Sun et al. [
47] have demonstrated the effectiveness of ensemble models like AdaBoost and XGBoost in estimating heavy metal concentrations, with XGBoost outperforming other methods in predicting Ni levels and other heavy metals. Despite its success in other soil heavy metals [
48], XGBoost has had fewer applications for estimating Pb content in farmland soils, which remains an area for further exploration.
The aim of this study is to develop an efficient framework for estimating heavy metal content in soil over a large area using hyperspectral image data from the ZY-1-02E satellite. We tried to explore an XGBoost fusion method that combines the advantages of the ensemble learning method and the packing method. The key variables with a strong correlation with heavy metal content were identified from the high-dimensional spectral data, and the prediction performance of the XGBoost model was compared with four traditional estimation models. The best prediction model was selected to obtain the spatial distribution map of the heavy metal Pb content in farmland soil. The results of the study provide valuable insights and technical support for building a green, sustainable, and high-quality agricultural system in Xiong’an New Area.
2. Materials and Methods
2.1. Research Region
Xiong’an New Area (115°38′ ~ 116°20′ E, 38°43′ ~ 39°10′ N) is located in Baoding City in the central hinterland of Hebei Province, covering Xiong County, Rongcheng County, Anxin County, and parts of the surrounding areas, with a flat terrain and four distinct seasons [
49]. The sampling area of the study covers the entire Xiong’an New Area, with abundant water resources, including Baiyangdian Lake and Caogou Canal (
Figure 1). It is crossed by tributaries such as the Tanghe River, Fuhe River, Qingshuihe River, and Caohe River in the area. The sample area is mainly selected around potentially polluting enterprises and locations where major pollution events have occurred. Among these, Laohetou Town and Anzhou Town in Anxin County are sites for non-ferrous metal production and processing. Parts of the Fuhu River and the Tang River watersheds were sites of enterprise and residential wastewater discharge. Rongcheng County is the gathering place of the paper and fleece industries. Unlike Xiong County, Anxin and Rongcheng counties primarily practiced agriculture reliant on polluted irrigation methods. Crops are predominantly grown in the southwest and northwest of Xiong’an New Area, with winter wheat and corn being primary crops, supplemented by soybean and sweet potato.
2.2. Soil Sample Collection and Testing
We collected soil samples in late October 2022, after the sowing of winter wheats. The region in farmland was without crops and weeds, making it appropriate for sampling. Following the standards of the “Technical Specification for Soil Environment Monitoring” (HJ/T 166-2004) [
50], the sample sites were set up using the uniform distribution method to target soil pollution plots caused by irrigation water and solid waste pollutants (see
Figure 1). The preset positions for the sampling points, along with sampling sequences and routes, were adjusted through field surveys. During the sampling process, all sampling points were located at distances greater than 100 m from other features to guarantee the validity of the pixel values for each sample point. The soil sampling method used was diagonal five-point sampling (
Figure 2, ABCDO). Samples of soil were obtained at a surface depth of 5 cm, mixed homogeneously from five sampling points, and placed in sealed plastic bags. We took 80 valid soil samples, and the latitude and longitude coordinates of the sampling sites were recorded utilizing GPS positioning technology, along with the sample names, surrounding conditions of the sample area, and other relevant details. Following collection, the soil samples were dried in a desiccator, and stones, grassroots, metal fragments, and other impurities were removed to minimize the effect of moisture and other contaminants. Ultimately, the soil samples were dissolved firstly with HCl, followed by HClO
4, HF, and HNO
3, to determine heavy metal Pb levels in the agricultural soil with an atomic absorption spectrophotometer (AA-6880, Spread Strom Technology Co., Ltd., Nanning, China).
2.3. Remote Sensing Image Data
The ZY-1-02E AHSI imagery produced on 23 October 2022 was chosen as the hyperspectral data for this study, with cloud coverage of 0%, which was quasi-synchronous with the field sampling time. The ZY-1-02E remote sensing images were obtained from the website ”
https://data.cresda.cn/#/home” (accessed on 10 April 2023). Compared with the ZY-1-02D satellite, the ZY-1-02E satellite is provided with a visible near-infrared (VNIR) camera with a resolution of 2.5 m. Additionally, the ZY-1-02E AHSI satellite continues to acquire 166 spectral wavebands with the hyperspectral camera, with a spatial resolution of 30 m and a wavelength range of 0.4 to 2.5 μm. The image is highly current and accessible, providing more detailed information on spectral features. For regional studies, a spatial resolution of 30 m can balance image detail with the efficiency of data processing, avoiding the problem of mixed pixels that may be caused by too low a resolution. It provides sufficient spatial information for the estimation of soil Pb concentration.
Table 1 shows parameters of the ZY-1-02E hyperspectral satellite data.
Addressing the attribute complexity of remote sensing image data, combined with the characteristics of hyperspectral image band range and resolution, the ZY-1-02E satellite remote sensing imagery was used as the data source, from which the required band information was extracted. Due to electromagnetic wave interference caused by atmospheric conditions and errors of the sensor itself, the spectral information for the features and corresponding geometrical information may become aberrant. The complexity of remote sensing image data affects the spectral reflectance for image pixels, which impacts the heavy metal inversion accuracy in farmland. Therefore, in this research, ENVI5.3 software was utilized to conduct systematic radiometric calibration and atmospheric correction of the ZY-1-02E AHSI imagery for the study area. Then, the original reflectance data were obtained by orthorectifying the atmospheric correction results and the 30 m digital elevation model data. The spectral curves of atmospheric correction exhibited several “broken” bands because the atmospheric correction identified certain bands affected by air and water vapor as “bad bands.” The actual pixel value of these bands in the original image was zero. Therefore, severely influenced wavelengths (1.36–1.45 μm, 1.82–1.95 μm) were excluded from this study. Meanwhile, given the latitude and longitude coordinates from the soil sample points collected, we extracted the spectra information for the sampling points from hyperspectral images preprocessed by ArcGIS10.8 software.
2.4. XGB–Boruta Screening Algorithm
The Boruta is a feature selection algorithm on account of random forests, comparing the importance of initial features and shadow features to obtain the optimal combination of features. This method screens out all sets of features that have a correlation with the dependent variable and is a type of greedy algorithm. The approach comprehensively explains the effect of the independent variables on the dependent variables, improving the global search ability for spectral features. However, it has problems such as high sample complexity and slow iterative computation time.
Therefore, the study utilized the improved XGB–Boruta algorithm to select features of hyperspectral data using the XGBoost model instead of the original RF training model. It used a combination of ensemble learning and packaging methods to improve the robustness of feature selection. The method of superior judgment and identification capabilities can better handle the problem of high dimensionality and covariance between variables, improving the model’s prediction performance. The main ideas are follows: (1) The raw spectral reflectance matrix (N) is read, a new shadow soil spectral matrix (S) is generated by randomly disrupting the order of matrix columns, and the original spectral matrix (N) is combined with the shadow matrix (S) based on the sampling point identifiers (ID) to form a new feature combination matrix (P). (2) Multiple iterations of the XGBoost model are performed to calculate feature importance using the XGBoost importance score formula. The feature importance is determined based on the construction of K classification trees. The importance scores (Z-scores) for N and S are obtained from the frequency of split points used in each tree. (3) The max value for the Z-score of the shadow feature (S) is defined as max_S, and the feature wavebands in N for Z-scores higher than max_S are marked as important. Conversely, those with lower Z-score are marked as unimportant and eliminated. (4) All the above processes are repeated until all feature bands are marked. These results obtain a combination of feature bands with high modeling contribution, providing the optimal solution. This algorithm has excellent optimization capabilities and a fast convergence rate.
2.5. Pearson Correlation Coefficient (PCC)
The PCC is a method used to measure the level of linear relationship between two variables. The coefficient takes values between −1 and 1, with higher absolute values indicating a stronger correlation. This method quantifies and assesses the strength of the relationship between two variables, determining whether a change in one variable can predict the direction and magnitude of change in the other. The formula is as follows:
In this equation, represents the covariance; when the trend of x and y changes are consistent, the is positive, and conversely, is negative. and indicate the standard deviations of and , reflecting the volatility of the values.
2.6. Research Workflow
To overcome the challenges posed by the high dimensionality, high redundancy, and low correlation inherent in hyperspectral data, which can severely impact the accuracy and stability for hyperspectral modeling, and to effectively monitor heavy metal content in farmland soil at a large scale, in this paper, a method called XGB–Boruta–PCC combined with extreme gradient boosting (XGBoost) is proposed to monitor heavy metal content over a large-scale region using hyperspectral remote sensing image data in farmland soil. The fusion model combines an overlay strategy of ensemble learning and packaging methods to improve the accuracy and robustness with the prediction performance for heavy metal concentrations of farmland soils over large-scale areas. The XGB–Boruta–PCC algorithm was first applied to screen for optimal spectral response features. Five machine learning methods, PLSR, RF, ELM, GBDT, and XGBoost, were selected to build models and assess their accuracies in heavy metal Pb estimation. This evaluation led to the identification of XGBoost for the ultimate modeling. The XGB–Boruta–PCC method consists of XGB–Boruta and PCC. Firstly, the extracted hyperspectral image bands are preliminarily chosen by XGB–Boruta. Then, the correlation of the spectral response features is enhanced using continuum removal (CR) and its first-order derivatives (CR–FD). Finally, a collection of feature bands with the most sensitivity for the heavy metal Pb is selected by the PCC for modelling. To reach the optimal model performance, the hyperparameters of every machine learning model were optimized using the Optuna optimization method during the model building process.
However, the experiment in this study reveals that, although the XGB–Boruta algorithm successfully screens feature bands with high contributions to modeling, these bands still exhibit data redundancy and low correlation with the measured heavy metal Pb content. To solve the problems above, this work efficiently combined XGB–Boruta with the PCC algorithm, where appropriate spectral transformations effectively highlighted the spectrum feature information for soil, enabling the absorption valleys and reflectance peaks of spectrum curves to be more prominent. Overall, these spectral variables highlight the absorption and reflection features for the spectra by continuum removal and background noise being removed by the first-order derivative again, contributing to the enhancement in spectral information for soil heavy metal response. The method addresses the inherent flaws of the above feature screening algorithms and improves the correlation of the spectral response features, realizing the optimal feature selection of soil heavy metal Pb hyperspectral bands. The spectral transformation formulas are as follows:
where
represents the wavelength at band
,
(
) represents the reflectance at wavelength
represents the continuum value at wavelength
,
(
) represents the reflectance after continuum removal, and
represents the first-order derivative spectral reflectance at wavelength
.
In the modeling process, the obtained multiple spectral feature variables and the measured heavy metal Pb content as the dependent variable were utilized to build the inversion models of heavy metal Pb for agricultural soils with strong explanatory power. The training and validation sets were split in a proportion of 3:1. Ultimately, a training set of 60 samples and a validation set of 20 samples were acquired. The methodology of this study accomplished the following: (1) The heavy metal Pb content was determined through field sampling, and the raw spectral reflectance for each sampling point was extracted from hyperspectral images, followed by preprocessing of the anomalous spectra. (2) The preselected feature bands were obtained by preselecting the preprocessed raw spectral bands using XGB–Boruta. (3) Correlation analysis was performed on preselected bands and utilizing the CR and CR–FD transformations to enhance the differences in spectral features between samples. (4) The Pearson method (PCC) was utilized to assess the relevance between lead (Pb) levels and the transformed spectrum indices, including CR and CR–FD transformations. The correlation coefficients were tested for significance at a stringent threshold of
p < 0.01. (5) The outcomes were amalgamated to quantify the number of spectral bands exhibiting a highly significant correlation (
p < 0.01), with the correlation coefficient threshold established at ±0.4. For the same bands, the spectral data associated with the spectral transformations exhibiting the highest correlation were identified and subsequently selected. They were used as spectral input variables to participate in the modeling. (6) The final optimal spectral response features were incorporated into five estimation models, including extreme gradient boosting (XGBoost), gradient boosting decision tree (GBDT), extreme learning machine (ELM), random forest (RF), and partial least squares regression (PLSR), for modeling soil heavy metal Pb content. The Optuna optimization method was adopted to perform the optimization of hyperparameters for each model. Subsequently, the effectiveness of these models in forecasting Pb concentrations within the research area was determined by assessing their performance. (7) The spatial distributions for soil heavy metal Pb levels within the research area were mapped using the determined optimal estimation model, and its potential sources of contamination and distribution pattern were analyzed.
Figure 3 shows the steps.
2.7. Estimation Models
2.7.1. PLSR
The PLSR is a statistical approach for multiple regression analysis, offering significant advantages in handling multiple independent and dependent variables. PLSR is suitable for addressing multicollinearity and high-dimensional datasets. When datasets contain many independent variables (features), traditional multiple linear regression models may suffer from multidimensionality. PLSR alleviates this issue by reducing data dimensionality, enabling more effective predictive models.
2.7.2. RF
The RF is based on decision trees and the classification of the ensemble algorithm, commonly applied to both regression and classification problems [
51]. RF is suitable for handling complex, high-dimensional, and large-scale datasets, providing robust predictive results. For regression problems, the RF can capture non-linear relationships in data and deal with outliers and noise. In this study, the RF model was trained multiple times by Optuna optimization to determine the optimal values for hyperparameters such as max_features, min_samples_split, max_depth, and n_estimators.
2.7.3. ELM
The ELM represents a novel approach to training single-hidden-layer feedforward networks. This algorithm is distinguished by its random initialization of input-layer weights and the hidden-layer neurons of threshold values. These weights and thresholds remain constant throughout the training process, eliminating the need for iterative adjustments. The sole parameter to be determined is the size of neurons in the hidden layer, which ensures a unique optimal solution is obtained. Beyond the predefined network configuration, there is no requirement for manual parameter adjustments [
52]. Based on previous studies, this study utilized the sigmoid function as the activation function for ELM. The optimal hidden-layer node size was turned from 50 to 200, with each ELM iteration running for 600 cycles [
22].
2.7.4. GBDT
The GBDT is an iterative decision tree algorithm that belongs to the category of boosting in integrated learning, and it improves the overall predictive performance by iteratively constructing several decision trees to correct the previous models of errors. Based on the principle of gradient boosting, in each iteration, GBDT trains new decision trees by fitting the negative gradient of the loss function, which are combined for updating this model, reducing overall loss and enhancing prediction performance [
53]. For the research, we set the optimization ranges of hyperparameters such as max_depth, learning_rate, and n_estimators. The GBDT model was trained 600 times by Optuna optimization to select the parameter combination with the lowest RMSE and construct the optimal model.
2.7.5. XGBoost
XGBoost is an improved machine learning algorithm for gradient boosting decision trees (GBDTs), and is commonly used in regression, classification, and sorting problems. XGBoost performs well on structured data, high-dimensional data, nonlinear relational data, and unbalanced data. The fundamental principle involves aggregating the outcomes of multiple decision trees (weak learners) to form a final output (strong learners) [
54]. It supports parallel processing and distributed computing, allowing for effective control of model complexity. We set the optimization ranges of hyperparameters such as colsample_bytree, min_child_weight, max_depth, learning_rate, and n_estimators by Optuna optimization. The parameter settings for the XGBoost model were tested cyclically for different values. The optimal model hyperparameters at the minimum RMSE were determined.
2.7.6. Model Comparison
Selection of a suitable model for soil heavy metal estimation is essential. To ensure rationality, the advantages and disadvantages of PLSR, RF, ELM, GBDT, and XGBoost used in this study are presented, as shown in
Table 2. R
2 was used to assess the accuracy of the soil heavy metal estimation models.
2.8. Algorithm Assessment Approach
The assessment indicators include the coefficient of determination (R
2), root mean square error (RMSE), and mean absolute error (MAE) to assess the stability and prediction accuracies of prediction models in this study. The formulas are as follows:
Here, denotes the mean of the true values; , , and represent the true values, forecasted values, and number of samples.
R2 is one of the common measures for regression model fit, and it ranges between 0 and 1. It indicates how well the model explains variations in the dependent variable. The closer R2 is to 1, the better the model is in terms of explanatory power; the closer it is to 0, the weaker the explanatory power. The MAE reflects the average level of the model’s prediction error and directly indicates model performance; lower values typically signify better prediction accuracy. The RMSE indicates the average deviation between predicted and true values. Compared to the MAE, the RMSE is more sensitive to significant differences between predicted and real values. For model validation, a higher R2 and a lower RMSE and MAE indicate higher modeling reliability and stability, demonstrating good predictive ability and reliability.
3. Results and Discussion
3.1. Measured Data of Heavy Metal Pb
The measured values of heavy metal Pb content in farmland soils at the sample sites from the Xiong’an New Area were calculated by SPSS27 software. The obtained values were compared against the risk control values for soil heavy metal. The results of the analysis are listed in
Table 3. The criterion is based on the national standard (GB15618-2018) in China (Regulation 2018) [
55]. The highest level of heavy metal Pb significantly exceeded its lowest level, with the maximum value reaching 164.12 mg/kg. Some soil samples exceeded the risk screening value for soil Pb contamination, posing a potential risk of crop contamination, with the coefficient of variation serving as a metric to characterize the sample data distribution. This table shows that the Pb element belonged to strong variability, indicating high dispersion of the heavy metal Pb, and the soil samples generally displayed an imbalance, with inhomogeneous enrichment. Therefore, it is necessary to monitor heavy metal Pb content by remote sensing in the farmland of the study area to obtain the distribution situation in a timely and rapid manner.
3.2. Spectral Response Feature Selection
In this study, the XGB–Boruta algorithm was utilized during the characteristic preselection step to identify and preselect relevant characteristic bands for the ZY-1-02E hyperspectral imagery. The outcomes of this preselection process are detailed in
Table 4.
Table 4 shows that 28 bands were retained after feature preselection, accounting for approximately 18.54% of the total number of bands. Numerous bands not sensitive to heavy metals were excluded.
Despite the application of the XGB–Boruta algorithm for initial spectral band selection, it was observed that the selected bands exhibited a weak correlation with the heavy metal Pb (
Figure 4). Most of the correlation coefficients for these bands fell within the range of 0.2 to 0.3, predominantly showing negative trends. This weak relevance could have potentially compromised the precision of subsequent modeling efforts, necessitating further processing of the spectral features to enhance the sensitivity of the characteristic spectra for heavy metal Pb.
Therefore, we utilized the combined spectral transformation methods using continuum removal (CR) and first-order derivative based on continuum removal (CR–FD) to improve the correlation of the preselected results.
Figure 5 shows that the relevance of spectral indices with heavy metal Pb improved to varying degrees after spectral transformations. The correlation coefficients with Pb contents, after the first-order derivative spectral transformation, exhibited alternately positive and negative changes, with more pronounced valleys. However, not all bands treated with feature enhancement made a large contribution to the modeling. Therefore, the PCC correlation coefficients of spectral indices and heavy metal Pb using CR and CR–FD spectral transformations were calculated separately. The characteristic bands exhibiting correlation coefficients with absolute values larger than 0.4 (the extremely significant level) were selected, and all results were combined to construct feature bands for modeling. This method combines the advantages of feature screening algorithms and optimized spectral indices, effectively highlighting the spectral response information for heavy metal Pb.
Table 5 presents the optimal set for heavy metal Pb of the feature bands screened by the XGB–Boruta–PCC method. The feature selection process, specifically through PCC, reduced the number of feature bands to 13, accounting for approximately 8.61% of the total band count.
The results indicate that the wavelength ranges selected for modeling estimation of heavy metal Pb were primarily clustered around 800–1000, 1100–1200, 1700, and 2087–2400 nm. Specifically, the core bands for estimating Pb content in farmland soils were distributed within the range of wavelengths related to iron oxides and clay minerals of absorption, with strong absorption of iron oxides in the range of 800–1000 nm, and clay minerals being more pronounced in the range of 1100–2500 nm [
26,
46]. It is shown that the spectral response features of soil active components (iron oxides and clay minerals) demonstrated strong adsorption of heavy metal Pb, thereby confirming the validity of the characteristic spectral bands obtained through the XGB–Boruta–PCC algorithm, which provides a more mechanistic understanding. This reduction significantly minimizes redundant information and computation for the later estimation modeling.
This effectively indicates the spectrum characteristic regions for soil, highlighting the practicality and effectiveness of the proposed method. The XGB–Boruta–PCC hyperspectral feature screening algorithm proposed in this paper not only eliminates a certain amount of redundancy in spectral information, reduces computation, and iterates quickly, but also effectively retains the integrity and the original physical explanatory power of the spectral data, meeting the desired goals. Compared to approaches with inherent randomness, the XGB–Boruta–PCC algorithm is more explanatory. Future research could consider integrating it with other regressor models, making it compatible with different modeling methods. According to specific problem needs, combining feature selection with subsequent prediction models can improve overall prediction performance and accomplish the goals of streamlining models and raising modeling accuracy.
3.3. Model Evaluation
In this section, the spectral response characteristics chosen by the XGB–Boruta–PCC algorithm served as independent variables, while heavy metal Pb concentrations acted as dependent variables. The training dataset was utilized for model construction and parameter optimization, whereas the validation dataset was applied for assessing model accuracy and generalization. In addition, we utilized PLSR, RF, ELM, GBDT, and XGBoost to estimate soil heavy metal Pb concentration within the research region. The comparative accuracies of these models are detailed in
Table 6.
Under the premise of XGB–Boruta–PCC, the five inversion models developed were compared, and we observed that the PLSR exhibited a higher estimation accuracy (R2) when dealing with the training data, while the R2 of the estimation accuracy for the testing data decreased significantly, indicating its weak prediction ability for unknown data. This implies that the linear PLSR model was prone to overfitting for the estimation of heavy metal Pb content in this research, and the generalization ability was reduced, leading to poor model prediction. In contrast, RF, ELM, GBDT and XGBoost exhibited better estimation accuracy R2 (R2 > 0.6) for heavy metal Pb content prediction on the training and validation sets, as well as having recognized RMSE and MAE indicators. Their advantages in terms of accuracy and robustness were further validated, showing that these models are more reliable tools for predicting heavy metal Pb levels. Second, this reflects that, compared to the PLSR linear regression method, GBDT and XGBoost, as nonlinear algorithms, can better describe complex nonlinear relationships through parameter optimization and decision tree composition adjustments, resulting in improved prediction accuracy. While RF and ELM can also handle nonlinear problems, their nonlinear modeling ability is usually inferior to that of GBDT and XGBoost; thus, their prediction accuracy is not optimal.
To further evaluate the predictive capabilities for the XGBoost model in comparison to conventional machine learning models under the XGB–Boruta–PCC algorithm, the models constructed by RF, ELM, and GBDT using the same training and validation sets were selected for estimating heavy metal Pb concentration in farmland soil within the research region. Their inversion scatter plots are presented in
Figure 6. The constructed ELM model showed a more dispersed distribution, resulting in a large deviation between predicted and measured values. The RF and GBDT models tended to underestimate Pb content, performing better at low values but showing greater dispersion at high values, thus contributing to the high RMSE, though within an acceptable range. Compared to the other three models, the XGBoost model exhibited more concentrated distributions of estimated and measured values towards the 1:1 line, indicating a good model fit. Therefore, this study demonstrates that the abilities of modeling and estimation for the XGBoost regression model were significantly better than those of the other three traditional models, as it reached the maximum value of R
2 (
= 0.82) and the minimum values of RMSE and MAE (
= 11.58,
= 9.89), indicating that the XGB–Boruta–PCC algorithm effectively obtained the characteristic spectral indices with high contributions for the XGBoost modeling. They were applied to the XGBoost model, and it was found that the estimation accuracy was better than that of other traditional machine learning models, with rapid computation speed and stronger generalization ability, which can achieve the effective prediction of heavy metal Pb concentration in farmland soil.
To further assess the stability and generalization ability of the XGBoost model, a four-fold cross-validation was conducted. The results are presented in
Figure 7. The R
2 values were relatively high, ranging approximately from 0.75 to 0.85, indicating that the XGBoost model could effectively explain the variability in the data. The median value was close to 0.80, demonstrating a good fit. The distribution of RMSE was relatively stable, with some fluctuations; however, these variations remained within a reasonable range. Despite this, there is still potential for further improvement of the model.
Overall, the accuracy evaluation for the existing models shows that XGB–Boruta–PCC effectively screened characteristic spectral indices and integrated them with the XGBoost regression model, increasing the predictive ability of the original inversion model and enhancing the algorithm robustness. Additionally, the viability and stability of utilizing ZY-1-02E hyperspectral imageries for heavy metal content estimation were confirmed, which provides an important reference value for subsequent related research and applications.
3.4. Mapping Heavy Metal from ZY-1-02E Hyperspectral Satellite Data
To estimate heavy metal Pb content for farmland soil over large-scale areas, the RF classification method was used in this study to obtain the distribution of soil pixels for farmland from remotely sensed images. Meanwhile, based on the selected spectral feature indices using XGB–Boruta–PCC and the extracted soil pixels in farmland, the best prediction model was employed to invert the farmland for the study area. This model utilized ZY-1-02E hyperspectral imagery information as input for mapping the spatial distribution of heavy metal Pb in farmland soil in Xiong’an New Area (
Figure 8).
The estimated outcomes of the XGBoost estimation model applied to the ZY-1-02E hyperspectral imagery indicate that the spatial distribution of Pb content was mostly low. However, areas within the southwest and parts of the west of the research area, primarily concentrated in Laohetou Town, the Tanghe River, and near the factories, exceeded the soil pollution risk screening value and the heavy metal standard.
To assess the stability of the spatial predictions made by the XGBoost model, this study performed spatial predictions for each of the four result models and visualized the standard deviation based on the pixels (
Figure 9). Overall, most regions with Pb prediction values exhibited low standard deviation values, as low as 0.042 mg/kg, indicating that the variation in Pb content was minimal and the distribution was relatively stable. Regions with higher standard deviations were primarily located in the southwestern part, where Pb content exhibited greater variation and higher spatial heterogeneity, though still within a reasonable range. These areas were generally characterized by higher soil heavy metal content, which may be attributed to insufficient training on the limited high-value data within the sample. Future studies will collect more data from regions with high levels of heavy metals to expand the sample database and enhance the prediction ability for high-value areas.
The results indicate that heavy metal content estimated by the XGBoost model adequately represented the inhomogeneity and complexity of the spatial distribution for heavy metal Pb content. The estimation is more continuous and aggregated, providing an effective and feasible large-scale technical method for monitoring heavy metal Pb contamination in the farmland of Xiong’an New Area.
As shown in the landscape in
Figure 10, Laohetou Town previously housed numerous non-ferrous metal processing factories and recycling plants. Industrial solid waste pollutants from long-term metal smelting and other industrial activities may have affected the surrounding farmland with dry and wet atmospheric deposition, industrial sewage irrigation, and biological enrichment, posing the potential risk of heavy metal Pb pollution for soil and crops (areas C and F). Meanwhile, there were enrichment phenomena with high concentrations of heavy metal Pb in the surrounding farmland in densely populated areas, where human activities such as vehicle emissions from residential areas, improper waste disposal, and pesticide and fertilizer application can lead to high levels of heavy metals (areas B and D). This finding is consistent with the reality. We also found that the closer the distances of roads, the higher the levels of heavy metals, probably caused by incomplete treatment for heavy metal pollution on both sides of the road (area A). Parts of the Tanghe River watershed demonstrated heavy metal Pb pollution, as it was previously a site for enterprise and residential wastewater discharge (area E). Field investigations revealed that crops (mainly wheat) grown in the river channel have curled leaves and short plants, and soils in agricultural fields along the watershed are prone to heavy metal element enrichment [
56]. The polluted area in the southwest is primarily distributed in the farmland around the non-ferrous metal processing plant, where crop growth is poor, and the heavy metal surface pollution will not degrade and disappear in a short time.
Based on the analysis of heavy metal Pb content distribution for the research region in
Figure 8 and
Figure 10, it is possible to infer that industries and human activities are a potential source of heavy metal Pb contamination in farmland soil from Xiong’an New Area. This study further supports the idea of direct industrial pollution. The massive expansion of factories and the occupation of large amounts of agricultural land have led to increasingly severe crop contamination, with direct irrigation of sewage being one of the causes of heavy metal contamination. Meanwhile, when heavy metal concentrations accumulated within the soil exceed the limit of its natural capacity to contain and purify, it will lead to increasingly serious soil pollution problems. Therefore, the local government needs to focus on heavy metal Pb contamination in farmland surrounding factory areas to prevent further expansion of contamination within Xiong’an New Area and effectively control the potential contamination of farmland and irrigation water from factory discharge to promote the sustainability and green management of agriculture in Xiong’an New Area.
4. Conclusions
In this paper, we discussed the methodology and feasibility of utilizing ZY-1-02E hyperspectral imagery to estimate heavy metal concentration in farmland. The XGB–Boruta–PCC algorithm integrated with the XGBoost model performed well on datasets with high variability and uneven spatial distribution. The XGB–Boruta–PCC algorithm achieved double dimensionality reduction of the hyperspectral data, effectively preserving the integrity and original physical explanatory power of the hyperspectral data. We utilized the XGB–Boruta–PCC algorithm with PLSR, RF, ELM, GBDT, and XGBoost models to analyze and model the relationship with heavy metal Pb and characteristic spectrum indices in farmland. Among these, the Optuna-optimized XGBoost model demonstrated optimal accuracy and stability, which demonstrates its strong representation learning ability in addressing typical problems such as small sample data, information redundancy, and nonlinearity. Furthermore, it compensates for the limitations of traditional contamination investigation methods, making it highly operable. This provides an effective approach to estimating heavy metal levels for farmland based on ZY-1-02E AHSI images and offers real-time, accurate, and large-scale information on Pb contamination status in the farmland soil of Xiong’an New Area.
In addition, the spatial distribution for heavy metal Pb content from Xiong’an New Area was analyzed utilizing the estimation results. In the southwest and parts of the west of the research region, there were areas exceeding the screening value for soil contamination risk, with block distribution and enrichment phenomena, indicating that the soil–crop system might have been contaminated. Upon analyzing the sources of contamination, it was found that the production activities of heavy metal factories might be the direct cause of heavy metal Pb soil contamination in the surrounding farmland. Human activities, including the use of pesticides, sewage irrigation, and waste disposal, may be additional causes of heavy metal Pb contamination in farmland. Therefore, policymakers need to focus on the pollution of agricultural land areas around processing plants such as those for metal smelting, wool spinning, and building materials in the study area. This includes rationalizing the use of fertilizers and pesticides, strengthening the regulations for pollution sources, and encouraging green production and development to protect the agroecological environment and ensure food security.