Spatial Prediction of Soil Contaminants Using a Hybrid Random Forest–Ordinary Kriging Model

Han, Hosang; Suh, Jangwon

doi:10.3390/app14041666

Open AccessArticle

Spatial Prediction of Soil Contaminants Using a Hybrid Random Forest–Ordinary Kriging Model

by

Hosang Han

¹

and

Jangwon Suh

^2,*

¹

Department of Energy and Mineral Resources Engineering, Kangwon National University, Samcheok-si 25913, Republic of Korea

²

Department of Energy Resources and Chemical Engineering, Kangwon National University, Samcheok-si 25913, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(4), 1666; https://doi.org/10.3390/app14041666

Submission received: 29 January 2024 / Revised: 14 February 2024 / Accepted: 17 February 2024 / Published: 19 February 2024

(This article belongs to the Section Earth Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate prediction of soil contamination in abandoned mining areas is necessary to address their environmental risks. This study employed a combined model of machine learning and geostatistics to predict the spatial distribution of soil contamination using heavy metal data collected in an abandoned metal mine. An exploratory data analysis was used to identify patterns in the collected data, the root mean squared error (RMSE) and coefficient of determination (R²) were used to verify the predicted values, and the model was validated using K-fold cross-validation. The prediction results were produced as a map by applying hyperparameter tuning to Random Forest (RF) and Ordinary Kriging (OK) through GridSearchCV using optimal parameter selections. Furthermore, the prediction residuals of the RF model were calculated, and the RF prediction map and OK interpolation results of the RF prediction residuals were summarized to construct an RF–OK prediction map. The RMSE and R² values for the RF, OK, and RF–OK interpolation models were 66.214, 65.101, and 52.884 mg/kg and 0.867, 0.871, and 0.915, respectively. In addition, the optimization results with the minimum RMSE and maximum R² were obtained through hyperparameter tuning. The proposed RF–OK hybrid model demonstrated superior prediction performance compared to the individual models.

Keywords:

interpolation; geostatistics; machine learning; Ordinary Kriging; Random Forest

1. Introduction

Soil heavy metal contamination in abandoned mining areas poses a significant environmental concern and severe human health risk. These areas, which are typically remnants of past industrial activities, face an increased risk of soil contamination owing to inadequate environmental remediation. Therefore, the accurate prediction of soil contamination is a crucial preliminary step in treating the contamination. However, detailed predictions are hampered by the limited accessibility, safety concerns, high cost, and time constraints of extensive sampling. Furthermore, traditional methods for predicting soil contamination cannot provide a comprehensive spatial understanding of areas with extensive or sensitive environments, because they are based on point-by-point surveys.

Geostatistics has been used to model the distribution of heavy metals across contaminated areas. This approach provides a view of the spatial correlations and variability within the soil data, thereby offering reliable results. However, geostatistics have limitations in interpreting complex and nonlinear data patterns. Fortunately, this can be solved through increasingly popular machine learning. Machine learning is highly efficient in interpreting data patterns and is crucial for generating predictive maps using geostatistics. These maps play a significant role in determining the scale and scope of soil contamination, which, in turn, influences decisions for further exploration or sampling processes. The reliability of these maps must be ensured, as they guide prioritized actions and investments on site. Therefore, the available sample data, which are often limited by various restrictions, must be effectively utilized when constructing these maps. Determining the appropriate amount of data for a predictive map is unclear, making it important to maximize utility while minimizing opportunity costs. Although producing highly reliable predictive maps based on small amounts of sampled distribution data is crucial, the techniques used to create these maps have inherent limitations, thereby necessitating a thorough evaluation and analysis of the influencing factors to minimize errors and generate highly reliable maps [1].

Numerous studies have reported the use of geostatistical techniques [2,3,4] to interpolate values at unsampled points from sampling data. Many studies have been conducted in geographic information system (GIS) environments to create predictive maps by applying geostatistics, and many of these studies have employed Ordinary Kriging (OK) [5,6,7]. Cross-validation is frequently used to verify Kriging-based estimated values [8,9,10]. Additionally, the Kriging application process offers numerous options and parameters that can affect the prediction results. For instance, Kim et al. [11] analyzed the effect of varying the lag distance on the prediction error when creating a semivariogram. However, a comprehensive analysis is lacking on the impact of all available options and parameters in Kriging techniques on prediction errors, preventing the identification of the optimal selections.

Recently, machine learning techniques have gained attention for data prediction and analysis [12,13,14,15]. Cross-validation is also commonly performed in machine learning research, and the optimization of models using hyperparameter tuning techniques has been used to optimize the available options and parameters [16,17,18]. Recent studies have either compared the prediction performance (error) of geostatistics and machine learning techniques in interpolation [19,20] or combined both techniques for interpolation [21,22,23,24,25]. The code for the combined RF–OK model presented in this study is in the Appendix A. Chun et al. [19] applied OK and an Artificial Neural Network (ANN) to predict subsurface profile information from an Unboring region and found that the machine learning results were distributed in a similar pattern, although they performed worse than geostatistics. In the case of Pereira et al. [20], OK and Support Vector Machine (SVM) were used for mapping soil attributes. It was coded as a plug-in to GIS software and compared the results of three methods: OK and SVM that used the attribute itself interpolated by Inverse Distance Weighing as a covariate (SVM1) and with the use of covariates (SVM2). Szatmári and Pásztor [21] compared the digital soil mapping techniques Universal Kriging, Sequential Gaussian Simulation, Random Forest (RF) with Kriging, and Quantile Regression Forest to quantify uncertainty in surveying soil organic carbon stock. Chen et al. [22] conducted OK, Geographically Weighted Regression, SVM, ANN, and a hybrid approach, geographically weighted regression Kriging and Artificial Neural Networks Kriging, to predict soil organic carbon and concluded that the hybrid approach was promising. Su et al. [23] combined RF, OK, and Co-Kriging to estimate aboveground biomass. The validation result indicated that the combined model showed better performance than the single RF model. Song et al. [24] presented a hybrid geostatistical method (Extreme Learning Machine–OK) for estimating soil organic matter and predicting the spatial variability of contents. For this purpose, Simple Kriging, OK, Regression–OK, ANN–OK, and the results confirmed that the proposed hybrid method showed superior performance. Hsu et al. [25] enhanced the Land Use Regression model with machine learning to estimate the spatial–temporal variations in benzene, toluene, ethylbenzene, and xylenes concentrations. They compared Hybrid Kriging–Land Use Regression, Geographically Weighted Regression, RF, and Extreme Gradient Boosting and indicated that the models incorporating machine learning showed improved performance. However, studies combining machine learning and geostatistics for interpolating and predicting the concentrations of heavy metals in the soil of abandoned mine areas are scarcely found.

The objective of this study was to create a soil contamination map in abandoned mine sites that is improved in terms of the prediction error from field survey (sampling) data using an interpolation model that combines geostatistics and machine learning. To achieve this, three techniques were applied to interpolate sampling values of heavy metal contamination concentrations in soil and compare their prediction errors: machine learning, geostatistics, and combined models. Previous studies on estimating the distribution of soil contamination in abandoned mine areas typically relied entirely on geostatistical techniques, and there are clear limitations to a single approach. To improve the prediction accuracy, the number of sampling data is usually increased, but in this study, from a methodological point of view, a combined machine learning and geostatistical model is used to generate a map with improved prediction accuracy.

2. Materials and Methods

Figure 1 illustrates the research methodology flow conducted in this study. The yellow rectangular boxes represent the sequence of results derived from the analysis using each technique. The process began with an exploratory data analysis (EDA) of the heavy metal concentration data, followed by data preprocessing. Subsequently, a grid was established to define the options and parameters for both the machine learning model and geostatistical techniques. This facilitated hyperparameter tuning to identify the optimal combination. The interpolation techniques utilized in this study include the RF model (indicated as number 1 in Figure 1), OK technique (number 2), and combined RF and OK models (number 3). The interpolation (prediction) errors of each model were compared and evaluated, and a map was generated based on the prediction results.

This study utilized ArcGIS Pro 3.2.1, a commercial GIS software, for data visualization and analysis, and Python 3.11.5 was used for machine learning analysis. Additionally, geostatistical analyses were performed using the Python library PyKrige 2.7 [26], which implements Kriging techniques for both two-dimensional and three-dimensional data.

2.1. Study Area and Data

The study area in this study was a metal mine located in Gijang-gun, Busan, South Korea (35°18′31.36″ N, 129°13′25.56″ E). The mine was a major Cu producer from the 1930s to the 1940s that extracted various other mineral resources and closed around 1990. However, the site was not properly remediated or treated after the mine was abandoned, resulting in heavy metal-laden mine drainage and large quantities of waste rock. High concentrations of Cu were observed in the areas surrounding the abandoned mine [27].

Figure 2 shows the distribution of the soil sampling locations near the metal mine used in this study. Data for these locations were collected using a portable X-ray fluorescence spectrometer. A total of 40 samples were collected.

2.2. EDA

Before preprocessing the 40 sampled data points, EDA was performed. Its primary purpose was to understand how the collected heavy metal data were distributed within the study area by identifying patterns in the dataset, detecting outliers, and verifying assumptions. The process involved the use of various visualization methods, such as histograms, scatter plots, and Q–Q plots, to investigate the frequency distribution of data, correlations between auxiliary variables, and normality. Basic statistical measures were used to calculate the minimum, maximum, mean, median, skewness, and standard deviation of the data to quantify the central tendency and variance of the data. To normalize the data and stabilize the variance, a log transformation was applied in cases of high variability. This enhanced the reliability and efficiency of the subsequent analyses.

2.3. Criteria of Error Evaluation

When processing a dataset, 20% of the total dataset is used as test data, 90% of the remaining 80% as training data, and 10% as validation data. This partitioning method is effective when the dataset is sufficiently large. However, in situations with limited data, this can result in a biased validation accuracy of the model. To overcome this problem, k-fold cross-validation has been widely used. This method involves dividing the training data into K equal-sized folds. K-1 of these folds was used for training, whereas the remaining fold was used for validation. This process was repeated K times to ensure that each fold was used as validation data only once, thereby preventing bias.

A quantitative assessment of the predictive performance of a model is crucial. Therefore, this study employed the root mean squared error (RMSE) as a metric. RMSE is the square root of the average of the squared differences between the actual and predicted values (Equation (1)). This metric directly measures the average magnitude of the predicted errors. A lower RMSE indicates a lower prediction error and higher model accuracy. However, one limitation of the RMSE is its sensitivity to large errors, as it assigns greater penalties to larger errors. Sensitivity to outliers is an important consideration when interpreting RMSE results.

R M S E = \sqrt{\frac{\sum {(y_{i} - {y^{*}}_{i})}^{2}}{n_{d a t a}}}

(1)

where

y_{i}

represents the actual data value collected at location

i

,

{y^{*}}_{i}

denotes the predicted data value at location

i

, and

n_{d a t a}

indicates the number of data points used in the prediction.

The coefficient of determination (R²) was used to compensate for the shortcomings of the RMSE metric. R² measures the proportion of variance in the dependent variable that is predictable from the independent variables, reflecting the overall fit of the predictive model to the actual data (Equation (2)). In addition, R² served as an indicator of how well the model explained the data, with values closer to 1 indicating a better fit between the model predictions and actual data. However, a high R² value does not necessarily guarantee reliability of the predictive model. The predicted data must be graphically represented to verify the actual distribution. Furthermore, the effectiveness of R² diminishes when the data exhibit nonlinear relationships.

R^{2} = 1 - \frac{S S R}{S S T}

(2)

where SST is the sum of squared total, and SSR is the sum of squared residual.

In this study, grid search cross-validation (GridSearchCV) was used to optimize the model. GridSearchCV experiments were conducted using various hyperparameter combinations to identify the optimal set. The hyperparameters are values set before the model training begins, and their combinations are defined in a grid format to evaluate the model’s performance. For example, when two hyperparameters exist, the performance of the model is tested by varying one while keeping the other constant. This process was repeated in reverse order. This systematic approach enables an extensive search over a specified parameter space, resulting in the identification of the best parameter combination for the model.

In addition, k-fold cross-validation was incorporated into the GridSearchCV process. K was set to 5, meaning that the data were divided into five subsets, and cross-validation was performed on each subset. RMSE and R² metrics were used to evaluate the performance of each hyperparameter combination. This approach facilitated the selection of the optimal hyperparameter combination, which was then applied to the entire dataset to construct a new model. The final model was optimized and used for further prediction or evaluation.

2.4. RF

Several machine learning models are available for predicting the distribution of the soil contaminants. However, when it comes to predicting regression data, such as heavy metal concentrations in soil, a RF model tends to have the best predictive performance [28]. This is the reason for the selection of RF models among the machine learning models in this study. RF is an ensemble method commonly used to process both regression and classification data. It constructs multiple decision trees and outputs either a class (for classification tasks) or an average prediction (for regression tasks), depending on the type of individual trees. In the classification tasks, RF outputs the class that receives the majority vote among all the created decision trees. In contrast, when dealing with regression data, RF calculates the average prediction of all the decision trees [29]. This approach improves the predictive accuracy and controls overfitting by aggregating the results from multiple trees, with each contributing to the final outcome.

RF models typically make predictions for unsampled locations using patterns learned from training data. In this study, auxiliary data, including elevation, slope, aspect, curvature, and flow accumulation, were added to the location coordinates (latitude and longitude) of the 40 sampling data points when the RF model was applied. The RF model trained with data from 40 sampling locations was then applied to the location coordinates and five types of auxiliary data for the entire area. This enabled the prediction of heavy metal concentrations across the entire sampling region.

In the RF model, hyperparameters, such as the number of trees in the forest or the maximum depth of each tree, significantly impact the model’s predictive performance. Therefore, this study optimized the model by adjusting the available options and parameters using GridSearchCV. A grid was defined to set the options and parameters, and GridSearchCV was used to determine the optimal choices. This process was repeated until the RMSE was minimized. The number of trees in the forest (n_estimators) increased from 10 to 300 in intervals of 50, whereas the max_depth increased from 2 to 10 in steps of 2, along with the option of none. For the min_samples_split parameter, the value increased from 2 to 10. Similarly, the min_samples_leaf parameter was set to a range of 1 to 5. The bootstrapping option was tested for both True and False values. Table 1 summarizes the defined grid of options and parameters for the RF model.

2.5. OK

Kriging is a most popular geostatistical interpolation technique that includes several types of models, such as Simple Kriging, Universal Kriging, and OK. Simple Kriging simply minimizes the error variance, which has the limitations that the estimation equation is biased and the mean does not correspond to the population mean, while Universal Kriging is often used when the sampling data being analyzed show a specific trend of the mean varies with the locations. Conversely, OK is a commonly used technique for interpolating spatial data collected from arbitrary regions, providing unbiased estimates while also minimizing the variance. This is a reason why we chose OK among the geostatistical techniques in this study. OK is primarily used to interpolate spatial data collected from arbitrary areas, providing unbiased estimates while minimizing the variance. This is referred to as the best linear unbiased estimator (BLUE). This approach assumes that the unknown population mean is a fixed value and bases the estimation formula on an unbiased condition. The fundamental prediction equation for OK is expressed in Equation (3).

{Z^{*}}_{O K} (x_{0}) = \sum_{i = 1}^{n} λ_{i} Z (x_{i})

(3)

where

{Z^{*}}_{O K} (x_{0})

represents the predicted values,

n

is the number of sampled data,

λ_{i}

is the weight at location

i

, and

Z (x_{i})

is the values of the sampled data at location

i

.

OK implementation relies primarily on the construction and analysis of a semivariogram, which represents the degree of spatial correlation within a dataset over various distances. To construct a semivariogram, key variogram parameters, such as the Nugget, Sill, and Range, are estimated. These parameters significantly influence the weights assigned to the data points in the Kriging model, making their accurate estimation crucial. Table 2 lists the options and parameters that could be set in OK for this study. The variogram model was set to three different types, and the coordinate-type options included both Euclidean and geographic options. The number of lags ranged from 2 to 20, increasing in steps of two. Weights were categorized as False or True, with the True category further subdivided into increments of 0.1, ranging from 0.1 to 1.0. Finally, the number of closest points used in the estimation, which represents the number of nearest neighboring points, was incrementally increased from 1 to 20. Analyzing the semivariogram is vital, because it provides important information for selecting an appropriate model based on the inherent spatial correlation structure of the data.

2.6. Combined RF–OK Interpolation Model

The process of combining both techniques involved three steps: prediction using the RF model, application of OK to the residuals of the RF prediction, and totaling the RF model prediction results with the OK results of the RF prediction residuals. This approach combines the strengths of RF and OK. First, the predicted values were calculated using an RF model. Then, the difference between the predicted and actual values was represented as the residuals of the RF model (Equation (4)). RF was used to capture complex patterns in the data, whereas OK was used to adjust for local discrepancies not captured by the RF model. This enhances the overall accuracy and reliability of the predictions.

Z_{R F r e s i d u a l} (x_{i}) = C (x_{i}) - {\hat{C}}_{R F} (x_{i})

(4)

where

Z_{R F r e s i d u a l} (x_{i})

represents the RF residual at location

i

,

C (x_{i})

is the sampled data value at location

i

, and

{\hat{C}}_{R F} (x_{i})

is the RF model prediction at location

i

.

The residuals were calculated using this method and subjected to hyperparameter tuning following previously defined options and parameter grids for OK. This involved searching for the optimal choices for the RF model residuals and conducting OK based on the optimal options and parameters obtained. This allows RF predictions and RF residuals to be spatialized to account for spatial autocorrelation that does not exist in the sampling data, thus compensating for the shortcomings of both techniques. Subsequently, a model that combines these two techniques was developed. This was achieved by combining the prediction values for the entire study area calculated using the trained RF model, with the residual prediction values for the entire study area obtained by applying OK to the RF model’s prediction residuals. This combined methodology is mathematically represented by Equation (5).

{\hat{Z}}_{R F O K} (x_{i}) = {\hat{Z}}_{R F} (x_{i}) + {\hat{Z}}_{O K} (x_{i})

(5)

where

{\hat{Z}}_{R F O K} (x_{i})

represents the RF–OK model prediction at location

i

,

{\hat{Z}}_{R F} (x_{i})

is the RF model prediction at location

i

, and

{\hat{Z}}_{O K} (x_{i})

denotes the OK estimation of the RF residual at location

i .

3. Results

3.1. EDA

Table 3 summarizes the descriptive statistics of the collected heavy metals. Except for one maximum value, all the remaining data points were below 500 mg/kg. The standard deviation was calculated to be 183.63 mg/kg. The data exhibited a positive skewness (2.37), as indicated by the mean value (138.55 mg/kg) being higher than the median, suggesting that the use of raw data for predictions could result in biased outcomes. Therefore, a log transformation was applied to normalize the data before interpolation.

Figure 3a shows the initial distribution of heavy metal concentrations used in this study. Figure 3b illustrates the distribution of data values after log transformation. Figure 3c shows a Q–Q plot, demonstrating that the data follow a normal distribution after log transformation. After inspection, the log-transformed sample data do not deviate significantly from normality.

3.2. RF Fitting and Optimal Interpolation

As described in Section 2, the log-transformed heavy metal concentrations were predicted using the RF model. In this process, a grid of configurable options and parameters was defined, and the optimal selections were identified using GridSearchCV and applied iteratively until the RMSE was minimized. The optimization results revealed that, for the RF model, the bootstrap option indicating the application of boosting was set to True. The number of estimators used was 100, representing the number of decision trees. The maximum depth of the tree was 10 m, which was denoted by the maximum depth. The minimum number of samples required to split an internal decision tree node is 2, denoted by the min samples split. Finally, the minimum number of samples required at a leaf node of the decision tree was determined to be 1, denoted by min sample leaves.

Figure 4a compares the actual collected data (x-axis) with the values predicted by the RF model using the auxiliary data (y-axis). The R² for the RF predictions of the sampling data was calculated as 0.885. Figure 4b shows the results of the feature importance analysis for the seven auxiliary data factors. This analysis does not necessarily indicate how the features directly influence the dependent variable (heavy metal concentration). However, it is suitable for demonstrating the relative impact of each feature on the dependent variable.

The prediction of the soil heavy metal concentrations in the entire study area using auxiliary data is shown in Figure 5a. The distribution in the light green areas exhibited a specific pattern that was identified as similar to the flow direction (Figure 5b) in the collected data. The number of sampling points in Figure 5b indicates the order of sampling. This similarity was observed when considering the soil contamination data and distribution of heavy metal concentrations discharged from the abandoned mine. In a study conducted in the same area by Suh et al. [27], OK was used to estimate the spatial patterns of potential toxic elements. They concluded that the distribution of heavy metal contamination exhibited weak secondary invariance, indicating no specific trend in the spread of contaminants. However, to represent these results, the authors considered the hydrological characteristics specific to the area, such as surface runoff and the direction of water flow on the slopes.

The calculated RMSE and R² values for heavy metal contamination predicted by the trained RF model were 66.214 mg/kg and 0.867, respectively. Figure 6 shows a comparison between the actual and predicted values, revealing that the model tended to overestimate the lower ranges of sampling data values but generally underestimated as the range increased. This suggests that, similar to interpolation methods, machine learning techniques tend to predict within the range of the training data. Consequently, the predictions are typically higher than the actual minimum data values and lower than the maximum values. Consequently, machine learning models have an inherent limitation in extrapolating beyond the range of data on which they have been trained.

3.3. OK Fitting and Optimal Interpolation

Similar to the RF prediction model, GridSearchCV was applied to various options and parameter grids defined for the OK semivariogram. This procedure was repeated until the minimum RMSE was obtained. The grid in this study consisted of three variogram models: 20 levels of nlags, 11 weight steps, two types of coordinate types, and 20 levels of n_closest_points, resulting in 13,200 combination options. A five-fold cross-validation approach was used to perform 66,000 estimations. The iterations were based on the RMSE metric to identify the choice with the lowest prediction error and highest determination coefficient.

The optimization results showed that the optimal options and parameters for the OK semivariogram (Figure 7a) were as follows: Euclidean coordinate type, number of closest points set to eight, nlags to 10, Gaussian variogram model, and weight set to 0.1. These settings were analyzed to obtain the highest prediction performance. The optimized semivariogram components included a partial mill of 42,589.76979, full fill of 43,906.26968, a range of 0.00114, and nugget of 1316.49988. The sampling points predicted using the fitted OK semivariogram had an RMSE of 65.101 mg/kg and an R² of 0.871. The red dots in Figure 7 show the correlation based on the lag distance of the actual binned data.

Subsequently, based on the optimized semivariogram, a prediction map (Figure 7b) and prediction standard deviation map (Figure 7c) were generated. The OK prediction map shows that the values in the central-left part were predominantly high, and the highest predicted value in this region was 739.966 mg/kg. The standard deviation map displayed a range of 40–180 mg/kg in increments of 10 based on the minimum and maximum predicted standard deviations. Evidently, the deviation in the upper-left area, where the sampling data points are clustered, is greater than that at other locations. This could be due to the greater variability in values in that area. Furthermore, in the central-upper region, where no sampling data were obtained, the deviation was as high as 180 mg/kg. This implies that additional samples may be necessary to obtain reliable results in the area, because a high deviation indicates lower confidence in the predictions for this area.

As with the RF model, Figure 8 shows a comparison between the 40 sampled data values and the estimates derived from the OK interpolation. As expected, the OK method tended to overestimate lower actual values and underestimate higher values. This trend is due to the nature of the Kriging method, which operates under the BLUE constraint and is designed to minimize the variance in the prediction errors, resulting in a smoother transition in the predicted values.

3.4. RF–OK Model-Based Interpolation

The residuals, which indicate the differences between the actual sampled values and those predicted by the RF model, were calculated as shown in Figure 9a. These residuals were computed for 40 sampling points and ranged between –30.225 and 283.648 mg/kg. The hyperparameter tuning of the RF prediction residuals was conducted using a grid of options and parameters similar to those employed for the OK technique, with the iterative process focused on achieving the lowest possible RMSE. The optimal options and parameters for applying OK to RF residuals varied from those of the original sampling data. The coordinate type was set to geographic, with 18 n_closest_points, two nlags, and the spherical variogram model while maintaining a weight of 0.1. The optimal semivariogram parameters, including the partial sill, full sill, range, and nugget, were determined to be 747.10896, 3692.06177, 0.00061, and 2944.95281, respectively.

Figure 9b,c show the prediction and prediction standard deviation maps, respectively, generated using the optimized semivariogram for the RF prediction residuals. The map of residual predictions displays a distribution range of 0–80 mg/kg. As observed in both the RF and OK predictions, areas with dense sampling data, particularly in the upper-left region, exhibited significant variations in the predicted values due to substantial fluctuations in heavy metal concentrations. The map displays the standard deviation of the residuals within the range of 59–62 mg/kg. The higher standard deviation of residuals compared to the standard deviation of OK predictions for the sampling data is due to the range of residuals covering both negative and positive values, unlike the sampling data, which only include positive values. Furthermore, the residual predictive standard deviation showed smaller variances in the locations with sampling data. This results in a smaller blue distribution on the map, indicating lower deviations.

The RF–OK map generated by combining the RF prediction map with the OK interpolated map of the RF residuals is shown in Figure 10a. The range of this predictive map was 39.283–737.450 mg/kg. Unlike the OK prediction map, which displays smoother transitions in concentration values, this map more closely resembles the RF prediction map, highlighting terrain features such as aspect. The RF–OK method yielded an error of 52.884 mg/kg and a coefficient of determination of 0.915. Figure 10b presents a comparative analysis of the actual data values and predictions obtained using the RF–OK approach. Furthermore, reflecting the characteristics of the interpolation methods, the map shows a tendency to overestimate at concentrations below 200 mg/kg and underestimate at higher concentrations.

When comparing the prediction results of the RF model, OK method, and the proposed RF–OK model, the RF–OK prediction map showed a significant shift towards yellower tones in the upper-left region, where higher data values are distributed, compared to the RF prediction map. As previously mentioned, this map exhibits a pattern similar to that of the RF prediction map, which prominently displays topographical features rather than the smoother transitional pattern evident in the OK prediction map.

The RMSE and R² values are plotted in Figure 11a to evaluate the predictive performances of the three interpolation methods. The RF–OK technique exhibited a 13.33 mg/kg decrease in the RMSE and 0.048 increase in the R² relative to the RF model, and a 12.217 mg/kg decrease in the RMSE and 0.044 increase in the R² compared to the OK model. Figure 11b shows the distribution of the predicted versus actual values for all three techniques. The RF–OK approach, which embodies typical interpolation characteristics, tended to overestimate at lower values and underestimate at higher values. Its distribution aligned more closely with the y = x line than that of the RF or OK methods. For concentrations greater than 200 mg/kg, the RF–OK method demonstrated significantly smaller errors than the RF and OK methods.

4. Discussion

To evaluate the quantified predictive performance of the machine learning–geostatistics combination model proposed in this study, a search was conducted on the existing studies. However, due to the lack of research in the same application category, comparisons were made with the combination models from other applications analyzed in the Introduction (Table 4). For both metrics, relative comparisons were made rather than direct comparisons, and in the case of the RMSE, the values showed a large difference, because it is the relevant error of the predicted data value. However, compared to other studies, the RMSE for the combined machine learning and geostatistics model is significantly lower than for the single techniques. Similarly, the R² shows that the hybrid model performs markedly better than other cases. However, note that the three cases compared were based on large amounts of data and separated the training and validation data. In contrast, this study was conducted with a small amount of data (40 points), learning 40 points and predicting 40 points with auxiliary variables (RF) or interpolating based on sampling data (OK). Therefore, further research should be conducted with a larger amount of sampling data.

In this study, the anisotropy option was not used when applying a method based on heavy metal concentration data. The PyKrige library allows the configuration of this feature with anisotropy scaling and anisotropy angles; however, it is only valid when the coordinate type is set to Euclidean and is not applicable to geographic coordinates. Consequently, in areas where anisotropy is prominent only in specific regions, differentiating between regions with and without significant anisotropy is necessary to conduct a thorough analysis. In addition, the model performance can be further improved by using additional auxiliary data. The auxiliary variables employed in this study are only those that can be obtained from elevation, but it is expected that higher prediction performance can be improved by obtaining and utilizing in-soil influencing factors. Finally, there are numerous other machine learning models and geostatistical techniques that were not used, so several combined models should be created to evaluate the performance of each model.

The results of this study are expected to provide a more accurate assessment of the level of contamination in areas where environmental monitoring is required, which will be important for effective remediation and prevention strategies. When monitoring and remediation activities are determined by the concentration of heavy metals in the soil in abandoned mine areas, soil contamination maps generated from the results of single and combined models may have different scope, area, and remediation costs. Therefore, using the combined model is expected to produce soil contamination maps with lower interpolation errors, which may have an impact on soil contamination monitoring, remediation planning, and decision making.

5. Conclusions

This study focused on predictive maps created by applying RF, OK, and a combined RF–OK interpolation model based on soil contamination data. The performance of each model was compared after configuration using a diverse set of options and parameters. Optimization was performed using the GridSearchCV hyperparameter-tuning technique. Predictive maps were generated based on the optimized RF, OK, and RF–OK models, and the interpolation performance was evaluated using the RMSE and R² metrics. The findings demonstrated that the combined RF–OK model (RMSE = 52.884 mg/kg, R² = 0.915) outperformed the individual techniques (RF: RMSE = 66.214 mg/kg, R² = 0.867; OK: RMSE = 65.101 mg/kg, R² = 0.871).

Combining a machine learning model with a geostatistical technique can produce more reliable predictive maps than individual interpolation models. Therefore, the objective results obtained through hyperparameter tuning of the options and parameter settings of each technique are expected to be useful for mapping more reliable predictions.

Author Contributions

H.H.: methodology, analysis, validation, visualization, and writing the original draft; J.S.: conceptualization, project administration, supervision, and reviewing and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Energy & Mineral Resources Development Association of Korea (EMRD) grant funded by the Korean government (MOTIE) (2021060001, Data science-based oil/gas exploration consortium).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript.

GIS	geographic information system
OK	Ordinary Kriging
ANN	Artificial Neural Network
SVM	Support Vector Machine
RF	Random Forest
EDA	exploratory data analysis
RMSE	root mean squared error
GridSearchCV	grid search cross-validation
kNN	k-nearest neighbor
AdaBoost	adaptive boosting

Appendix A

The code for the combined RF–OK model presented in this study is provided on GitHub: https://github.com/Euck411/RF-OK (accessed on 13 February 2024).

References

Kim, H.R.; Yu, S.; Yun, S.T.; Kim, K.H.; Lee, G.T.; Lee, J.H.; Heo, C.H.; Ryu, D.W. Estimation of Spatial Distribution Using the Gaussian Mixture Model with Multivariate Geoscience Data. Econ. Environ. Geol. 2022, 55, 353–366. [Google Scholar] [CrossRef]
Lee, I.G.; Choi, S.H. Characteristics of Stream and Soil Contamination from the Tailing Disposal and Waste Rocks at the Abandoned Uljin Mine. Econ. Environ. Geol. 2008, 41, 63–79. [Google Scholar]
Kim, H.-R.; Kim, K.-H.; Yun, S.-T.; Hwang, S.-I.; Kim, H.-D.; Lee, G.-T.; Kim, Y.-J. Evaluation of Geostatistical Approaches for Better Estimation of Polluted Soil Volume with Uncertainty Evaluation. J. Soil Groundw. Environ. 2012, 17, 69–81. [Google Scholar] [CrossRef]
Kim, S.N.; Lee, W.K.; Kim, J.G.; Shin, K.I.; Kwon, T.H.; Hyun, S.H.; Yang, J.E. Prediction of Spatial Distribution Trends of Heavy Metals in Abandoned Gangwon Mine Site by Geostatistical Technique. Spat. Inf. Soc. 2012, 20, 17–27. [Google Scholar]
Chung, S.Y.; Kang, D.H.; Park, H.Y.; Shim, B.W. Application of Geostatistical Methods for the Analysis of Groundwater Contamination in Pusan. J. Eng. Geol. 2000, 10, 247–261. [Google Scholar]
Kim, H.J.; Jo, W.K. Assessment of PM-10 Monitoring Stations in Daegu Using GIS Interpolation. J. Korean Soc. Geospat. Inf. Syst. 2012, 20, 3–13. [Google Scholar] [CrossRef]
Park, H.-J.; Shin, H.-S.; Roh, Y.-H.; Kim, K.-M.; Park, K.-H. Estimating Forest Carbon Stocks in Danyang Using Kriging Methods for Aboveground Biomass. J. Korean Assoc. Geogr. Inf. Stud. 2012, 15, 16–33. [Google Scholar] [CrossRef]
Park, N.-W.; Jang, D.-H.; Chi, K.-H. Geostatistical Integration of Ground Survey Data and Secondary Data for Geological Thematic Mapping. Korean J. Remote Sens. 2006, 22, 581–593. [Google Scholar]
Park, N.-W.; Jang, D.-H. Mapping of Temperature and Rainfall Using DEM and Multivariate Kriging. J. Korean Geogr. Soc. 2005, 43, 1002–1015. [Google Scholar]
Park, N.-W. Application of Indicator Geostatistics for Probabilistic Uncertainty and Risk Analyses of Geochemical Data. J. Korean Earth Sci. Soc. 2010, 31, 301–312. [Google Scholar] [CrossRef]
Kim, J.H.; Choi, J.H.; Kim, C.S. Comparative Evaluation of Interpolation Accuracy for CO2 Emission Using GIS. J. Environ. Impact Assess. 2010, 19, 647–656. [Google Scholar]
Bae, W.; Kwon, Y.; Ha, W. Research Trend Analysis for Seismic Data Interpolation Methods Using Machine Learning. Geophys. Geophys. Explor. 2020, 23, 192–207. [Google Scholar] [CrossRef]
Lee, S.H.; Yoon, Y.A.; Jung, J.H.; Sim, H.S.; Chang, T.-W.; Kim, Y.S. A Machine Learning Model for Predicting Silica Concentrations through Time Series Analysis of Mining Data. J. Korean Soc. Qual. Manag. 2020, 48, 511–520. [Google Scholar] [CrossRef]
Mahdavinejad, M.S.; Rezvan, M.; Barekatain, M.; Adibi, P.; Barnaghi, P.; Sheth, A.P. Machine Learning for Internet of Things Data Analysis: A Survey. Digit. Commun. Netw. 2018, 4, 161–175. [Google Scholar] [CrossRef]
Sung, J.H.; Cho, Y.S. Machine Learning Approach for Pattern Analysis of Energy Consumption in Factory. KIPS Trans. Comput. Commun. Syst. 2019, 8, 87–92. [Google Scholar] [CrossRef]
Contreras, P.; Orellana-Alvear, J.; Muñoz, P.; Bendix, J.; Célleri, R. Influence of Random Forest Hyperparameterization on Short-Term Runoff Forecasting in an Andean Mountain Catchment. Atmosphere 2021, 12, 238. [Google Scholar] [CrossRef]
Han, S.; Kim, H. Optimal Feature Set Size in Random Forest Regression. Appl. Sci. 2021, 11, 3428. [Google Scholar] [CrossRef]
Prakash, V.S.; Bushra, S.N.; Subramanian, N.; Indumathy, D.; Mary, S.A.L.; Thiagarajan, R. Random Forest Regression with Hyper Parameter Tuning for Medical Insurance Premium Prediction. Int. J. Health Sci. 2022, 6, 7093–7101. [Google Scholar] [CrossRef]
Chun, C.; Choi, C.; Cho, J. Comparison of Ordinary Kriging and Artificial Neural Network for Estimation of Ground Profile Information in Unboring Region. J. Korean GEO-Environ. Soc. 2019, 20, 15–20. [Google Scholar] [CrossRef]
Pereira, G.W.; Valente, D.S.; Queiroz, D.M.; Coelho, A.L.; Costa, M.M.; Grift, T. Smart-Map: An Open-Source QGIS Plugin for Digital Mapping Using Machine Learning Techniques and Ordinary Kriging. Agronomy 2022, 12, 1350. [Google Scholar] [CrossRef]
Szatmári, G.; Pásztor, L. Comparison of Various Uncertainty Modelling Approaches Based on Geostatistics and Machine Learning Algorithms. Geoderma 2019, 337, 1329–1340. [Google Scholar] [CrossRef]
Chen, L.; Ren, C.; Li, L.; Wang, Y.; Zhang, B.; Wang, Z.; Li, L. A Comparative Assessment of Geostatistical, Machine Learning, and Hybrid Approaches for Mapping Topsoil Organic Carbon Content. ISPRS Int. J. Geoinf. 2019, 8, 174. [Google Scholar] [CrossRef]
Su, H.; Shen, W.; Wang, J.; Ali, A.; Li, M. Machine Learning and Geostatistical Approaches for Estimating Aboveground Biomass in Chinese Subtropical Forests. Ecosyst 2020, 7, 64. [Google Scholar] [CrossRef]
Song, Y.Q.; Yang, L.A.; Li, B.; Hu, Y.M.; Wang, A.L.; Zhou, W.; Cui, X.-S.; Liu, Y.L. Spatial Prediction of Soil Organic Matter Using a Hybrid Geostatistical Model of an Extreme Learning Machine and Ordinary Kriging. Sustainability 2017, 9, 754. [Google Scholar] [CrossRef]
Hsu, C.Y.; Zeng, Y.T.; Chen, Y.C.; Chen, M.J.; Lung, S.C.C.; Wu, C. Da Kriging-Based Land-Use Regression Models That Use Machine Learning Algorithms to Estimate the Monthly Btex Concentration. Int. J. Environ. Res. Public Health 2020, 17, 6956. [Google Scholar] [CrossRef] [PubMed]
Müller, S.; Yurchak, R.; Murphy, B.; Ziebarth, M.; Basak, S.; Albuquerque, M.; Vrijlandt, M.; Peveler, M.; Raigosa, D.M. GeoStat-Framework/PyKrige: V1.7.1 (v1.7.1). Zenodo 2023. [Google Scholar] [CrossRef]
Suh, J.; Lee, H.; Choi, Y. A Rapid, Accurate, and Efficient Method to Map Heavy Metal-Contaminated Soils of Abandoned Mine Sites Using Converted Portable XRF Data and GIS. Int. J. Environ. Res. Public Health 2016, 13, 1191. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Yin, A.; Yang, X.; Fan, M.; Shao, S.; Wu, J.; Wu, P.; Zhang, M.; Gao, C. Use of Machine-Learning and Receptor Models for Prediction and Source Apportionment of Heavy Metals in Coastal Reclaimed Soils. Ecol. Indic. 2021, 122, 107233. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the geostatistics- and machine learning-based interpolation modeling processes.

Figure 2. Sampling points and measurements of heavy metal concentrations using a portable X-ray fluorescence spectrometer.

Figure 3. Heavy metal concentration data. (a) Histogram of raw data, (b) histogram of the log-transformed data, and (c) Q–Q plot of log-transformed data.

Figure 4. Comparison of actual values and RF model-based prediction. (a) Coefficient of determination. (b) RF feature importance analysis.

Figure 5. Results of the optimal RF model-based prediction. (a) Heavy metal concentration in the study area using auxiliary data. (b) Flow direction map of the study area [27]. The number in the top right of the green circle is the sampling number.

Figure 6. Comparison of the actual and predicted heavy metal concentrations by RF.

Figure 7. Results of the optimal OK technique-based interpolation. (a) Optimal semivariogram, (b) heavy metal concentration interpolated values, and (c) standard deviations.

Figure 8. Comparison of the actual and predicted heavy metal concentrations by OK.

Figure 9. Results of the RF residual analysis. (a) Distribution of RF residuals, (b) heavy metal concentration interpolated by the RF residual OK, and (c) OK standard deviation of the RF residual.

Figure 10. Results of the optimal RF–OK model-based prediction. (a) Prediction map, and (b) comparison of actual and predicted heavy metal concentration by RF–OK.

Figure 11. Comparison of the performance metrics calculated by the RF, OK, and RF–OK models. (a) RMSE and R², and (b) True values versus predicted values.

Table 1. Setting of the Random Forest options and parameter grid.

Option	Parameter
Bootstrap	No. of Estimator	Max Depth	Min Samples Split	Min Samples Leaf
True/False	Min 10 Max 300 Step 50	None/ Min 10 Max 50 Step 10	Min 2 Max 10 Step 2	Min 1 Max 5 Step 1

Table 2. Settings of the Ordinary Kriging (OK) options and parameter grid.

Option		Parameter
Variogram Model	Coordinate Type	No. of Lags	Weight		No. of Closest Points
Gaussian/ Spherical/ Exponential	Euclidean/ Geographic	Min 2 Max 20 Step 2	True Min 0.1 Max 1.0 Step 0.1	False	Min 1 Max 20 Step 1

Table 3. Descriptive statistics of the heavy metal concentration data (unit: mg/kg).

Min	Max	Median	Mean	Skewness	Standard Deviation
18	374	104.50	138.55	0.64	104.97

Table 4. Comparison of the prediction performance between existing studies and this study.

References	Model	Evaluation Index
References	Model	RMSE	R²
Chen et al. [22]	OK	9.81 (g·kg⁻¹)	0.32
Chen et al. [22]	ANNK	8.89 (g·kg⁻¹)	0.60
Su et al. [23]	RF	31.21 (t·ha⁻¹)	0.49
Su et al. [23]	RFOK	29.97 (t·ha⁻¹)	0.53
Song et al. [24]	OK	2.07 (g·kg⁻¹)	0.36
Song et al. [24]	ELMOK	1.40 (g·kg⁻¹)	0.67
This study	RF	66.21 (mg/kg)	0.87
	OK	65.10 (mg/kg)	0.87
	RFOK	52.88 (mg/kg)	0.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, H.; Suh, J. Spatial Prediction of Soil Contaminants Using a Hybrid Random Forest–Ordinary Kriging Model. Appl. Sci. 2024, 14, 1666. https://doi.org/10.3390/app14041666

AMA Style

Han H, Suh J. Spatial Prediction of Soil Contaminants Using a Hybrid Random Forest–Ordinary Kriging Model. Applied Sciences. 2024; 14(4):1666. https://doi.org/10.3390/app14041666

Chicago/Turabian Style

Han, Hosang, and Jangwon Suh. 2024. "Spatial Prediction of Soil Contaminants Using a Hybrid Random Forest–Ordinary Kriging Model" Applied Sciences 14, no. 4: 1666. https://doi.org/10.3390/app14041666

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatial Prediction of Soil Contaminants Using a Hybrid Random Forest–Ordinary Kriging Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data

2.2. EDA

2.3. Criteria of Error Evaluation

2.4. RF

2.5. OK

2.6. Combined RF–OK Interpolation Model

3. Results

3.1. EDA

3.2. RF Fitting and Optimal Interpolation

3.3. OK Fitting and Optimal Interpolation

3.4. RF–OK Model-Based Interpolation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI