1. Introduction
Soil heavy metal contamination in abandoned mining areas poses a significant environmental concern and severe human health risk. These areas, which are typically remnants of past industrial activities, face an increased risk of soil contamination owing to inadequate environmental remediation. Therefore, the accurate prediction of soil contamination is a crucial preliminary step in treating the contamination. However, detailed predictions are hampered by the limited accessibility, safety concerns, high cost, and time constraints of extensive sampling. Furthermore, traditional methods for predicting soil contamination cannot provide a comprehensive spatial understanding of areas with extensive or sensitive environments, because they are based on point-by-point surveys.
Geostatistics has been used to model the distribution of heavy metals across contaminated areas. This approach provides a view of the spatial correlations and variability within the soil data, thereby offering reliable results. However, geostatistics have limitations in interpreting complex and nonlinear data patterns. Fortunately, this can be solved through increasingly popular machine learning. Machine learning is highly efficient in interpreting data patterns and is crucial for generating predictive maps using geostatistics. These maps play a significant role in determining the scale and scope of soil contamination, which, in turn, influences decisions for further exploration or sampling processes. The reliability of these maps must be ensured, as they guide prioritized actions and investments on site. Therefore, the available sample data, which are often limited by various restrictions, must be effectively utilized when constructing these maps. Determining the appropriate amount of data for a predictive map is unclear, making it important to maximize utility while minimizing opportunity costs. Although producing highly reliable predictive maps based on small amounts of sampled distribution data is crucial, the techniques used to create these maps have inherent limitations, thereby necessitating a thorough evaluation and analysis of the influencing factors to minimize errors and generate highly reliable maps [
1].
Numerous studies have reported the use of geostatistical techniques [
2,
3,
4] to interpolate values at unsampled points from sampling data. Many studies have been conducted in geographic information system (GIS) environments to create predictive maps by applying geostatistics, and many of these studies have employed Ordinary Kriging (OK) [
5,
6,
7]. Cross-validation is frequently used to verify Kriging-based estimated values [
8,
9,
10]. Additionally, the Kriging application process offers numerous options and parameters that can affect the prediction results. For instance, Kim et al. [
11] analyzed the effect of varying the lag distance on the prediction error when creating a semivariogram. However, a comprehensive analysis is lacking on the impact of all available options and parameters in Kriging techniques on prediction errors, preventing the identification of the optimal selections.
Recently, machine learning techniques have gained attention for data prediction and analysis [
12,
13,
14,
15]. Cross-validation is also commonly performed in machine learning research, and the optimization of models using hyperparameter tuning techniques has been used to optimize the available options and parameters [
16,
17,
18]. Recent studies have either compared the prediction performance (error) of geostatistics and machine learning techniques in interpolation [
19,
20] or combined both techniques for interpolation [
21,
22,
23,
24,
25]. The code for the combined RF–OK model presented in this study is in the
Appendix A. Chun et al. [
19] applied OK and an Artificial Neural Network (ANN) to predict subsurface profile information from an Unboring region and found that the machine learning results were distributed in a similar pattern, although they performed worse than geostatistics. In the case of Pereira et al. [
20], OK and Support Vector Machine (SVM) were used for mapping soil attributes. It was coded as a plug-in to GIS software and compared the results of three methods: OK and SVM that used the attribute itself interpolated by Inverse Distance Weighing as a covariate (SVM1) and with the use of covariates (SVM2). Szatmári and Pásztor [
21] compared the digital soil mapping techniques Universal Kriging, Sequential Gaussian Simulation, Random Forest (RF) with Kriging, and Quantile Regression Forest to quantify uncertainty in surveying soil organic carbon stock. Chen et al. [
22] conducted OK, Geographically Weighted Regression, SVM, ANN, and a hybrid approach, geographically weighted regression Kriging and Artificial Neural Networks Kriging, to predict soil organic carbon and concluded that the hybrid approach was promising. Su et al. [
23] combined RF, OK, and Co-Kriging to estimate aboveground biomass. The validation result indicated that the combined model showed better performance than the single RF model. Song et al. [
24] presented a hybrid geostatistical method (Extreme Learning Machine–OK) for estimating soil organic matter and predicting the spatial variability of contents. For this purpose, Simple Kriging, OK, Regression–OK, ANN–OK, and the results confirmed that the proposed hybrid method showed superior performance. Hsu et al. [
25] enhanced the Land Use Regression model with machine learning to estimate the spatial–temporal variations in benzene, toluene, ethylbenzene, and xylenes concentrations. They compared Hybrid Kriging–Land Use Regression, Geographically Weighted Regression, RF, and Extreme Gradient Boosting and indicated that the models incorporating machine learning showed improved performance. However, studies combining machine learning and geostatistics for interpolating and predicting the concentrations of heavy metals in the soil of abandoned mine areas are scarcely found.
The objective of this study was to create a soil contamination map in abandoned mine sites that is improved in terms of the prediction error from field survey (sampling) data using an interpolation model that combines geostatistics and machine learning. To achieve this, three techniques were applied to interpolate sampling values of heavy metal contamination concentrations in soil and compare their prediction errors: machine learning, geostatistics, and combined models. Previous studies on estimating the distribution of soil contamination in abandoned mine areas typically relied entirely on geostatistical techniques, and there are clear limitations to a single approach. To improve the prediction accuracy, the number of sampling data is usually increased, but in this study, from a methodological point of view, a combined machine learning and geostatistical model is used to generate a map with improved prediction accuracy.
2. Materials and Methods
Figure 1 illustrates the research methodology flow conducted in this study. The yellow rectangular boxes represent the sequence of results derived from the analysis using each technique. The process began with an exploratory data analysis (EDA) of the heavy metal concentration data, followed by data preprocessing. Subsequently, a grid was established to define the options and parameters for both the machine learning model and geostatistical techniques. This facilitated hyperparameter tuning to identify the optimal combination. The interpolation techniques utilized in this study include the RF model (indicated as number 1 in
Figure 1), OK technique (number 2), and combined RF and OK models (number 3). The interpolation (prediction) errors of each model were compared and evaluated, and a map was generated based on the prediction results.
This study utilized ArcGIS Pro 3.2.1, a commercial GIS software, for data visualization and analysis, and Python 3.11.5 was used for machine learning analysis. Additionally, geostatistical analyses were performed using the Python library PyKrige 2.7 [
26], which implements Kriging techniques for both two-dimensional and three-dimensional data.
2.1. Study Area and Data
The study area in this study was a metal mine located in Gijang-gun, Busan, South Korea (35°18′31.36″ N, 129°13′25.56″ E). The mine was a major Cu producer from the 1930s to the 1940s that extracted various other mineral resources and closed around 1990. However, the site was not properly remediated or treated after the mine was abandoned, resulting in heavy metal-laden mine drainage and large quantities of waste rock. High concentrations of Cu were observed in the areas surrounding the abandoned mine [
27].
Figure 2 shows the distribution of the soil sampling locations near the metal mine used in this study. Data for these locations were collected using a portable X-ray fluorescence spectrometer. A total of 40 samples were collected.
2.2. EDA
Before preprocessing the 40 sampled data points, EDA was performed. Its primary purpose was to understand how the collected heavy metal data were distributed within the study area by identifying patterns in the dataset, detecting outliers, and verifying assumptions. The process involved the use of various visualization methods, such as histograms, scatter plots, and Q–Q plots, to investigate the frequency distribution of data, correlations between auxiliary variables, and normality. Basic statistical measures were used to calculate the minimum, maximum, mean, median, skewness, and standard deviation of the data to quantify the central tendency and variance of the data. To normalize the data and stabilize the variance, a log transformation was applied in cases of high variability. This enhanced the reliability and efficiency of the subsequent analyses.
2.3. Criteria of Error Evaluation
When processing a dataset, 20% of the total dataset is used as test data, 90% of the remaining 80% as training data, and 10% as validation data. This partitioning method is effective when the dataset is sufficiently large. However, in situations with limited data, this can result in a biased validation accuracy of the model. To overcome this problem, k-fold cross-validation has been widely used. This method involves dividing the training data into K equal-sized folds. K-1 of these folds was used for training, whereas the remaining fold was used for validation. This process was repeated K times to ensure that each fold was used as validation data only once, thereby preventing bias.
A quantitative assessment of the predictive performance of a model is crucial. Therefore, this study employed the root mean squared error (RMSE) as a metric. RMSE is the square root of the average of the squared differences between the actual and predicted values (Equation (1)). This metric directly measures the average magnitude of the predicted errors. A lower RMSE indicates a lower prediction error and higher model accuracy. However, one limitation of the RMSE is its sensitivity to large errors, as it assigns greater penalties to larger errors. Sensitivity to outliers is an important consideration when interpreting RMSE results.
where
represents the actual data value collected at location
,
denotes the predicted data value at location
, and
indicates the number of data points used in the prediction.
The coefficient of determination (R
2) was used to compensate for the shortcomings of the RMSE metric. R
2 measures the proportion of variance in the dependent variable that is predictable from the independent variables, reflecting the overall fit of the predictive model to the actual data (Equation (2)). In addition, R
2 served as an indicator of how well the model explained the data, with values closer to 1 indicating a better fit between the model predictions and actual data. However, a high R
2 value does not necessarily guarantee reliability of the predictive model. The predicted data must be graphically represented to verify the actual distribution. Furthermore, the effectiveness of R
2 diminishes when the data exhibit nonlinear relationships.
where SST is the sum of squared total, and SSR is the sum of squared residual.
In this study, grid search cross-validation (GridSearchCV) was used to optimize the model. GridSearchCV experiments were conducted using various hyperparameter combinations to identify the optimal set. The hyperparameters are values set before the model training begins, and their combinations are defined in a grid format to evaluate the model’s performance. For example, when two hyperparameters exist, the performance of the model is tested by varying one while keeping the other constant. This process was repeated in reverse order. This systematic approach enables an extensive search over a specified parameter space, resulting in the identification of the best parameter combination for the model.
In addition, k-fold cross-validation was incorporated into the GridSearchCV process. K was set to 5, meaning that the data were divided into five subsets, and cross-validation was performed on each subset. RMSE and R2 metrics were used to evaluate the performance of each hyperparameter combination. This approach facilitated the selection of the optimal hyperparameter combination, which was then applied to the entire dataset to construct a new model. The final model was optimized and used for further prediction or evaluation.
2.4. RF
Several machine learning models are available for predicting the distribution of the soil contaminants. However, when it comes to predicting regression data, such as heavy metal concentrations in soil, a RF model tends to have the best predictive performance [
28]. This is the reason for the selection of RF models among the machine learning models in this study. RF is an ensemble method commonly used to process both regression and classification data. It constructs multiple decision trees and outputs either a class (for classification tasks) or an average prediction (for regression tasks), depending on the type of individual trees. In the classification tasks, RF outputs the class that receives the majority vote among all the created decision trees. In contrast, when dealing with regression data, RF calculates the average prediction of all the decision trees [
29]. This approach improves the predictive accuracy and controls overfitting by aggregating the results from multiple trees, with each contributing to the final outcome.
RF models typically make predictions for unsampled locations using patterns learned from training data. In this study, auxiliary data, including elevation, slope, aspect, curvature, and flow accumulation, were added to the location coordinates (latitude and longitude) of the 40 sampling data points when the RF model was applied. The RF model trained with data from 40 sampling locations was then applied to the location coordinates and five types of auxiliary data for the entire area. This enabled the prediction of heavy metal concentrations across the entire sampling region.
In the RF model, hyperparameters, such as the number of trees in the forest or the maximum depth of each tree, significantly impact the model’s predictive performance. Therefore, this study optimized the model by adjusting the available options and parameters using GridSearchCV. A grid was defined to set the options and parameters, and GridSearchCV was used to determine the optimal choices. This process was repeated until the RMSE was minimized. The number of trees in the forest (n_estimators) increased from 10 to 300 in intervals of 50, whereas the max_depth increased from 2 to 10 in steps of 2, along with the option of none. For the min_samples_split parameter, the value increased from 2 to 10. Similarly, the min_samples_leaf parameter was set to a range of 1 to 5. The bootstrapping option was tested for both True and False values.
Table 1 summarizes the defined grid of options and parameters for the RF model.
2.5. OK
Kriging is a most popular geostatistical interpolation technique that includes several types of models, such as Simple Kriging, Universal Kriging, and OK. Simple Kriging simply minimizes the error variance, which has the limitations that the estimation equation is biased and the mean does not correspond to the population mean, while Universal Kriging is often used when the sampling data being analyzed show a specific trend of the mean varies with the locations. Conversely, OK is a commonly used technique for interpolating spatial data collected from arbitrary regions, providing unbiased estimates while also minimizing the variance. This is a reason why we chose OK among the geostatistical techniques in this study. OK is primarily used to interpolate spatial data collected from arbitrary areas, providing unbiased estimates while minimizing the variance. This is referred to as the best linear unbiased estimator (BLUE). This approach assumes that the unknown population mean is a fixed value and bases the estimation formula on an unbiased condition. The fundamental prediction equation for OK is expressed in Equation (3).
where
represents the predicted values,
is the number of sampled data,
is the weight at location
, and
is the values of the sampled data at location
.
OK implementation relies primarily on the construction and analysis of a semivariogram, which represents the degree of spatial correlation within a dataset over various distances. To construct a semivariogram, key variogram parameters, such as the Nugget, Sill, and Range, are estimated. These parameters significantly influence the weights assigned to the data points in the Kriging model, making their accurate estimation crucial.
Table 2 lists the options and parameters that could be set in OK for this study. The variogram model was set to three different types, and the coordinate-type options included both Euclidean and geographic options. The number of lags ranged from 2 to 20, increasing in steps of two. Weights were categorized as False or True, with the True category further subdivided into increments of 0.1, ranging from 0.1 to 1.0. Finally, the number of closest points used in the estimation, which represents the number of nearest neighboring points, was incrementally increased from 1 to 20. Analyzing the semivariogram is vital, because it provides important information for selecting an appropriate model based on the inherent spatial correlation structure of the data.
2.6. Combined RF–OK Interpolation Model
The process of combining both techniques involved three steps: prediction using the RF model, application of OK to the residuals of the RF prediction, and totaling the RF model prediction results with the OK results of the RF prediction residuals. This approach combines the strengths of RF and OK. First, the predicted values were calculated using an RF model. Then, the difference between the predicted and actual values was represented as the residuals of the RF model (Equation (4)). RF was used to capture complex patterns in the data, whereas OK was used to adjust for local discrepancies not captured by the RF model. This enhances the overall accuracy and reliability of the predictions.
where
represents the RF residual at location
,
is the sampled data value at location
, and
is the RF model prediction at location
.
The residuals were calculated using this method and subjected to hyperparameter tuning following previously defined options and parameter grids for OK. This involved searching for the optimal choices for the RF model residuals and conducting OK based on the optimal options and parameters obtained. This allows RF predictions and RF residuals to be spatialized to account for spatial autocorrelation that does not exist in the sampling data, thus compensating for the shortcomings of both techniques. Subsequently, a model that combines these two techniques was developed. This was achieved by combining the prediction values for the entire study area calculated using the trained RF model, with the residual prediction values for the entire study area obtained by applying OK to the RF model’s prediction residuals. This combined methodology is mathematically represented by Equation (5).
where
represents the RF–OK model prediction at location
,
is the RF model prediction at location
, and
denotes the OK estimation of the RF residual at location
4. Discussion
To evaluate the quantified predictive performance of the machine learning–geostatistics combination model proposed in this study, a search was conducted on the existing studies. However, due to the lack of research in the same application category, comparisons were made with the combination models from other applications analyzed in the Introduction (
Table 4). For both metrics, relative comparisons were made rather than direct comparisons, and in the case of the RMSE, the values showed a large difference, because it is the relevant error of the predicted data value. However, compared to other studies, the RMSE for the combined machine learning and geostatistics model is significantly lower than for the single techniques. Similarly, the R
2 shows that the hybrid model performs markedly better than other cases. However, note that the three cases compared were based on large amounts of data and separated the training and validation data. In contrast, this study was conducted with a small amount of data (40 points), learning 40 points and predicting 40 points with auxiliary variables (RF) or interpolating based on sampling data (OK). Therefore, further research should be conducted with a larger amount of sampling data.
In this study, the anisotropy option was not used when applying a method based on heavy metal concentration data. The PyKrige library allows the configuration of this feature with anisotropy scaling and anisotropy angles; however, it is only valid when the coordinate type is set to Euclidean and is not applicable to geographic coordinates. Consequently, in areas where anisotropy is prominent only in specific regions, differentiating between regions with and without significant anisotropy is necessary to conduct a thorough analysis. In addition, the model performance can be further improved by using additional auxiliary data. The auxiliary variables employed in this study are only those that can be obtained from elevation, but it is expected that higher prediction performance can be improved by obtaining and utilizing in-soil influencing factors. Finally, there are numerous other machine learning models and geostatistical techniques that were not used, so several combined models should be created to evaluate the performance of each model.
The results of this study are expected to provide a more accurate assessment of the level of contamination in areas where environmental monitoring is required, which will be important for effective remediation and prevention strategies. When monitoring and remediation activities are determined by the concentration of heavy metals in the soil in abandoned mine areas, soil contamination maps generated from the results of single and combined models may have different scope, area, and remediation costs. Therefore, using the combined model is expected to produce soil contamination maps with lower interpolation errors, which may have an impact on soil contamination monitoring, remediation planning, and decision making.
5. Conclusions
This study focused on predictive maps created by applying RF, OK, and a combined RF–OK interpolation model based on soil contamination data. The performance of each model was compared after configuration using a diverse set of options and parameters. Optimization was performed using the GridSearchCV hyperparameter-tuning technique. Predictive maps were generated based on the optimized RF, OK, and RF–OK models, and the interpolation performance was evaluated using the RMSE and R2 metrics. The findings demonstrated that the combined RF–OK model (RMSE = 52.884 mg/kg, R2 = 0.915) outperformed the individual techniques (RF: RMSE = 66.214 mg/kg, R2 = 0.867; OK: RMSE = 65.101 mg/kg, R2 = 0.871).
Combining a machine learning model with a geostatistical technique can produce more reliable predictive maps than individual interpolation models. Therefore, the objective results obtained through hyperparameter tuning of the options and parameter settings of each technique are expected to be useful for mapping more reliable predictions.