Subsurface Geological Profile Interpolation Using a Fractional Kriging Method Enhanced by Random Forest Regression

Ding, Qile; Wang, Yiren; Zheng, Yu; Wang, Fengyang; Zhou, Shudong; Pan, Donghui; Xiong, Yuchun; Zhang, Yi

doi:10.3390/fractalfract8120717

Open AccessArticle

Subsurface Geological Profile Interpolation Using a Fractional Kriging Method Enhanced by Random Forest Regression

by

Qile Ding

^1,2,

Yiren Wang

^1,2,*,

Yu Zheng

^1,2

,

Fengyang Wang

^1,2,

Shudong Zhou

³,

Donghui Pan

^1,2,

Yuchun Xiong

⁴ and

Yi Zhang

³

¹

School of Environment and Civil Engineering, Dongguan University of Technology, Dongguan 523808, China

²

Guangdong Provincial Key Laboratory of Intelligent Disaster Prevention and Emergency Technologies for Urban Lifeline Engineering, Dongguan 523808, China

³

Dongguan Institute of Building Research Co., Ltd., Dongguan 523809, China

⁴

Guangdong Building Material Research Institute Co., Ltd., Guangzhou 510160, China

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2024, 8(12), 717; https://doi.org/10.3390/fractalfract8120717

Submission received: 13 October 2024 / Revised: 1 December 2024 / Accepted: 3 December 2024 / Published: 5 December 2024

(This article belongs to the Section Engineering)

Download

Browse Figures

Versions Notes

Abstract

Analyzing geological profiles is of great importance for various applications such as natural resource management, environmental assessment, and mining engineering projects. This study presents a novel geostatistical approach for subsurface geological profile interpolation using a fractional kriging method enhanced by random forest regression. Using bedrock elevation data from 49 boreholes in a study area in southeast China, we first use random forest regression to predict and optimize variogram parameters. We then use the fractional kriging method to interpolate the data and analyze the variability. We also compare the proposed model with traditional methods, including linear regression, K-nearest neighbors, ordinary kriging, and fractional kriging, using cross-validation metrics. The results indicate that the proposed model reduces prediction errors and enhances spatial prediction reliability compared to other models. The MSE of the proposed model is 25% lower than that of ordinary kriging and 10% lower than that of fractional kriging. In addition, the execution time of the proposed model is slightly higher than other models. The findings suggest that the proposed model effectively captures complex subsurface spatial relationships, offering a reliable and precise solution for performing spatial interpolation tasks.

Keywords:

interpolation; fractional kriging method; borehole; bedrock elevation; variogram

1. Introduction

Analyzing geological profiles is crucial for understanding subsurface characteristics and identifying spatial patterns, which are essential for effective decision-making in resource extraction, infrastructure development, environmental protection, and risk assessment related to natural hazards [1,2,3,4]. Geological data are often obtained from boreholes, which provide valuable point-specific information about subsurface conditions. However, due to the high cost and logistical challenges of drilling, borehole data are typically sparse and irregularly distributed [5,6]. Interpolating borehole data can generate continuous representations of geological profiles [7]. Effective interpolation methods are crucial for bridging the gaps between boreholes and providing a complete understanding of the subsurface. These methods can not only enhance geological modeling performance but also reduce uncertainties in practical applications [8,9].

Geostatistical methods have been widely utilized for spatial interpolation and geological data analysis [10,11,12,13]. Commonly used methods to interpolate geological data collected from boreholes include inverse distance weighting (IDW), spline interpolation, and kriging [14]. IDW assigns weights to data points based on their distances from the prediction location, with closer points having a greater influence [15]. It is suitable for datasets with a relatively uniform spatial distribution. Spline interpolation, on the other hand, uses piecewise polynomials to create smooth surfaces, making it effective for capturing gradual changes in geological profiles and providing a smooth spatial representation [16]. Kriging incorporates spatial autocorrelation, providing an optimal, unbiased estimation of unknown values based on a weighted average of known data points, making it particularly effective for geological data interpolation [17,18]. Each of these methods has strengths and limitations, and the choice of method depends on the specific characteristics of the data and the desired level of accuracy. For example, IDW may struggle with datasets that have highly clustered data points, whereas spline interpolation is better suited for smoothly varying datasets but may not well handle abrupt changes [19]. A key advantage of kriging is its capacity to provide statistically optimal estimates by incorporating spatial relationships among data points. However, it highly relies on the assumption of stationarity, which can be restrictive when dealing with regions that exhibit complex, non-stationary variability. This limitation often results in reduced accuracy in areas with significant geological heterogeneity.

Extensions have been proposed by previous studies to improve traditional kriging methods. For example, non-stationary kriging methods, such as universal kriging, have been introduced to handle spatial trends by incorporating external trend functions [20,21]. The application of advanced kriging techniques, such as co-kriging, Bayesian kriging, and indicator kriging, has also gained increasing attention in recent years [22]. Co-kriging incorporates secondary information from related variables, which has been found to be effective in enhancing predictions when primary data are scarce. Indicator kriging is a non-parametric approach that allows for the estimation of categorical or binary variables, making it useful for modeling geological facies. Bayesian kriging integrates Bayesian inference with traditional kriging, providing a probabilistic framework that accounts for uncertainties in the model and the data [23]. However, these methods may still struggle to fully capture complex spatial correlations in highly heterogeneous geological settings. Therefore, more flexible geostatistical methods to account for spatial variability are needed.

Recent studies have also explored machine-learning-assisted kriging techniques, such as Gaussian Process Regression Kriging, which combines the flexibility of Gaussian processes with kriging to enhance spatial predictions in highly variable settings [24,25,26]. Other machine learning methods, such as random forests, support vector machines, and neural networks, have been integrated with kriging to improve prediction accuracy [27,28,29]. These hybrid models take advantage of the strengths of machine learning algorithms in capturing complex, nonlinear relationships within data, which traditional kriging methods may struggle to achieve [30]. For example, random forests have been used to identify key spatial features, which are then incorporated into kriging models to enhance interpolation accuracy [27,31]. Neural networks have also been used to model complex geological patterns, providing input for kriging models [32,33,34]. These advancements highlight the trend of combining machine learning with kriging methods to address the challenges of spatial heterogeneity and improve the robustness of geological predictions.

Fractional calculus has emerged as a promising approach to enhance spatial modeling in geostatistics [35,36]. Fractional Brownian motion and fractional Gaussian fields have been employed to model long-range dependence and fractal characteristics of geological data. Incorporating these concepts into kriging, Pan et al. [37] proposed the fractional kriging method, which allows for fractional differentiation to better capture non-stationary spatial structures. The fractional kriging method offers an improved approach by incorporating fractional differentiation. This modification allows for more flexible modeling of spatial correlation structures, accommodating non-stationary variability and capturing subtle heterogeneities in geological data more effectively [38]. By extending the traditional kriging framework, fractional kriging overcomes the shortcomings of conventional methods, providing improved accuracy and a better representation of complex geological profiles [39]. The use of fractional kriging allows for more accurate predictions in environments where geological features exhibit non-linear behavior or irregular patterns over space.

This study presents an analysis of geological profile variability using a machine-learning-assisted kriging method as an innovative approach. Specifically, we utilized the random forest regression in combination with the fractional kriging model. Here, the random forests regressor was employed to optimize the variogram parameters within the fractional kriging model, thereby improving the model’s ability to capture spatial heterogeneity. This approach enhanced the model’s ability to capture complex spatial relationships and validate its effectiveness. We then validate the model using bedrock elevation data from a specific region in southeastern China consisting of 49 boreholes and compare the model with four other related models. The results indicate that the proposed method provides a more precise representation of spatial variability, enabling better prediction and understanding of subsurface properties. The proposed model has significant implications for improving risk assessment, resource estimation, and the planning of engineering projects in areas characterized by complex geological conditions.

2. Model Development

Fractional kriging is an extension of standard kriging and is developed based on fractional Brownian motion and Hurst exponent, providing a more robust model for spatial prediction. It accounts for complex spatial dependencies and offers an estimation variance, which gives an indication of the reliability of the interpolated value.

2.1. Standard Kriging Recap

Kriging is a geostatistical interpolation method used to estimate the unknown spatial variable

Z (x_{0})

at an unsampled location

x_{0}

based on observed values

Z (x_{i})

at nearby locations

x_{1}, x_{2}, \dots, x_{n}

.

The estimation is made by calculating a weighted sum of the observed values:

Z^{*} (x_{0}) = \sum_{i = 1}^{n} λ_{i} Z (x_{i})

(1)

with

Z^{*} (x_{0})

being the estimated value at

x_{0}

,

Z (x_{i})

being the observed values at

x_{i}

, and

λ_{i}

being the kriging weights.

The spatial dependence between values is modeled by the semivariogram

γ (h)

. The expression of semivariogram

γ (h)

can be written as functions of the spatial distance between two points h:

γ (h) = \frac{1}{2} E [{(Z (x) - Z (x + h))}^{2}]

(2)

Based on the semi-variogram results, we can select a semivariogram that is spherical, circular, exponential, Gaussian, or linear, as shown in Figure 1.

The covariance function

C (h)

between two points separated by distance h is related to the semi-variogram as

C (h) = C (0) - γ (h)

(3)

where

C (0)

is the sill, representing the total variance at long distances.

Figure 2 shows the fundamentals of how kriging uses the spatial relations captured in the variogram to interpolate values at unsampled locations based on the spatial arrangement of known data points. The Nugget represents the y-intercept of the variogram, indicating variability at very short distances. The Sill is the maximum of semi-variance, beyond which there is no further increase, which means points separated by greater distances have an ignorable spatial correlation. The Range denotes the distance at which the sill is reached, representing the extent of spatial correlation between points, with points beyond this distance considered uncorrelated.

To find the kriging weights

λ_{i}

, the equations of the kriging system should be solved. The system is derived from the condition that the estimator is unbiased (i.e., the weights sum to 1) and the variance of the estimation error is minimized. The expression is

\sum_{j = 1}^{n} λ_{j} γ (x_{i} - x_{j}) + μ = γ (x_{i} - x_{0}), i = 1, \dots, n

(4)

where

\sum_{j = 1}^{n} λ_{j} = 1

and

μ

is a Lagrange multiplier that enforces the unbiasedness condition.

2.2. Fractional Order Kriging

The fractional order kriging incorporates the concept of fractional Brownian motion (FBM), which is a generalization of classical Brownian motion. The covariance function of FBM is:

E [B_{H} (t) B_{H} (s)] = \frac{1}{2} ({| t |}^{2 H} + {| s |}^{2 H} - {| t - s |}^{2 H})

(5)

with H being the Hurst exponent. The Hurst exponent H is a key parameter in FBM that controls the fractal behavior of the data. It reflects the self-similarity and long-range dependence of the data.

The Hurst exponent takes values between 0 and 1 and reflects the correlations in the dataset. It is typically estimated through variogram analysis, where the spatial correlation of the data is assumed to follow a power-law relationship,

C (h) \sim h^{2 H}

, for the variogram

C (h)

at a lag distance h. This method relies on plotting the empirical variogram on a log–log scale and fitting it to the power-law model. The slope of the resulting log–log plot provides a direct estimate of H. Specifically, if the slope of the plot is s, then the Hurst exponent is given by

H = s / 2

. By analyzing the shape of the variogram and its slope, the Hurst exponent characterizes the degree of spatial persistence or anti-persistence, helping to quantify the long-range dependence inherent in the data.

H = 0.5

corresponds to classical Brownian motion with no long-range dependence.

H > 0.5

denotes persistent behavior, i.e., it is likely to continue increasing if the process is increasing.

H < 0.5

indicates anti-persistence, i.e., increases might be followed by decreases.

For spatial data, FBM is used to create a fractional variogram that scales with distance according to the Hurst exponent H. The fractional variogram is modified to reflect the fractal nature of the data in fractional kriging, whose expressions can be written as

γ (h) = C_{0} + C_{1} {| h |}^{2 H}

(6)

where

C_{0}

is the nugget effect, representing short-range variability or measurement errors,

C_{1}

is the sill, representing the total variance contributed by the spatial structure, h is the distance between two points.

To use the fractional variogram in kriging, the parameters

C_{0}

,

C_{1}

, and H must be estimated. This is carried out by fitting the model to the empirical semivariogram calculated from the data. The empirical semivariogram

γ_{emp} (h)

is computed as

γ_{emp} (h) = \frac{1}{2 | N (h) |} \sum_{(i, j) \in N (h)} {(Z (x_{i}) - Z (x_{j}))}^{2}

(7)

where

N (h)

is the set of all pairs of points

(i, j)

such that

| x_{i} - x_{j} | \approx h

, and

| N (h) |

is the number of point pairs at distance h.

The objective is to fit the fractional variogram model to the empirical semivariogram by minimizing the squared error between them, that is,

min_{C_{0}, C_{1}, H} \sum_{h} {(γ_{emp} (h) - γ (h))}^{2}

(8)

The kriging system is modified by using the fractional variogram to capture the long-range dependence and fractal properties of the data. The system of equations that determines the kriging weights

λ_{i}

is

\sum_{j = 1}^{n} λ_{j} γ (x_{i} - x_{j}) + μ = γ (x_{i} - x_{0}), i = 1, \dots, n

(9)

where the sum of the weights

\sum_{j = 1}^{n} λ_{j}

equals 1,

γ (x_{i} - x_{j})

is the fractional semivariance between

x_{i}

, and

x_{j}

, and

γ (x_{i} - x_{0})

is the semivariance between

x_{i}

and

x_{0}

.

This system of equations can be written in matrix form as

[\begin{matrix} Γ & 1 \\ 1^{⊤} & 0 \end{matrix}] [\begin{matrix} λ \\ μ \end{matrix}] = [\begin{matrix} γ \\ 1 \end{matrix}]

(10)

with Γ_{i j} = γ (x_{i} - x_{j})

(11)

with

Γ

as the semivariance matrix,

1

as the column vector of ones,

λ

as the vector of kriging weights,

μ

as the Lagrange multiplier, and

γ

as the vector of semivariances between the known points and the unsampled location

x_{0}

.

The solution for the kriging weights

λ_{i}

and the Lagrange multiplier

μ

is obtained by solving the matrix equation:

[\begin{matrix} λ \\ μ \end{matrix}] = {[\begin{matrix} Γ & 1 \\ 1^{⊤} & 0 \end{matrix}]}^{- 1} [\begin{matrix} γ \\ 1 \end{matrix}]

(12)

This requires inverting the

(n + 1) \times (n + 1)

matrix, which can be carried out using standard linear algebra methods such as Gaussian elimination and matrix decomposition. Once

λ_{i}

are determined, we can compute the estimated value at the unsampled location

x_{0}

using the kriging estimator:

Z^{*} (x_{0}) = \sum_{i = 1}^{n} λ_{i} Z (x_{i})

(13)

where

Z (x_{i})

are the observed values at the known locations

x_{i}

, and

λ_{i}

are the weights obtained from solving the kriging system. This gives the best linear unbiased estimate for the unknown value at

x_{0}

.

The estimation variance provides a confidence measure for the predicted value

Z^{*} (x_{0})

and quantifies the uncertainty in the kriging estimate. It evaluates how much the estimated value might differ from the true value due to the spatial variability of the data and the distance from the known points. A smaller variance indicates higher confidence in the estimate, while a larger variance suggests more uncertainty. The estimation variance at the unsampled location

x_{0}

, denoted

σ^{2} (x_{0})

, is given by

σ^{2} (x_{0}) = γ (0) - \sum_{i = 1}^{n} λ_{i} γ (x_{i} - x_{0}) - μ

(14)

where

γ (0)

is the semivariance at zero distance, typically equal to the nugget effect

C_{0}

;

λ_{i}

values are the kriging weights;

γ (x_{i} - x_{0})

are the semivariances between the known points

x_{i}

and the unsampled location

x_{0}

; and

μ

is the Lagrange multiplier obtained from solving the kriging system.

2.3. Fractional Variogram Parameters Optimization Using Random Forest Regression

The objective is to quantify the relation between spatial lag distances and semi-variances using a random forest regressor. We start by calculating the lag distances between all pairs of data points. Let the lag distance between point i and point j be denoted by

h_{i j}

. The lag distance is calculated as

h_{i j} = \sqrt{{(x_{i} - x_{j})}^{2} + {(y_{i} - y_{j})}^{2}}

(15)

The semi-variance for each pair of points is calculated as follows:

γ (h_{i j}) = \frac{1}{2} {(z_{i} - z_{j})}^{2}

(16)

where

z_{i}

and

z_{j}

are the values of the variable of interest at points i and j.

The dataset for training the random forest consists of lag distances as input features (

h_{i j}

) and semi-variances (

γ (h_{i j})

) as the target values. A random forest regressor is trained using the lag distances as input and the semi-variance as the targets. The random forest regressor is used to approximate the semi-variogram from the data:

\hat{γ} (h) = RF (h)

(17)

where

RF (h)

represents the random forest model’s prediction of the

γ (h_{i j})

for a given

h_{i j}

.

The random forest regressor consists of an ensemble of decision trees. For each tree, a subset of the training data is randomly sampled and a subset of features is randomly chosen at each split to reduce correlation between the trees. The prediction is the average of the predictions of all decision trees:

\hat{γ} (h) = \frac{1}{T} \sum_{t = 1}^{T} {\hat{γ}}_{t} (h)

(18)

with T being the number of trees and

{\hat{γ}}_{t} (h)

being the prediction from the t-th tree.

The trained random forest regressor is then used to predict semi-variance values for a range of lag distances, denoted by

h_{1}, h_{2}, \dots, h_{N}

. The goal is to fit a parametric variogram model to these predicted semi-variances. The model consists of an ensemble of N decision trees, where each tree t is trained on a bootstrap sample

D_{t}

of the original dataset D. The training process for each tree involves recursively splitting the data at each node based on a feature and a threshold that minimize the mean squared error (MSE). The prediction

{\hat{y}}_{i}

for a given data point i is obtained by averaging the predictions from all trees:

{\hat{y}}_{i} = \frac{1}{N} \sum_{t = 1}^{N} {\hat{y}}_{i, t}

(19)

where

{\hat{y}}_{i, t}

is the prediction of the t-th tree for the data point i. Each tree in the random forest independently generates its own prediction, and the overall model prediction is the average of these individual predictions.

For model training, the hyperparameters were configured as follows: the number of trees

N = 100

, the maximum depth of each tree was limited to 10, and the minimum number of samples per leaf node was set to 5. The number of features considered for each split was determined by the square root of the total number of features. Hyperparameters were optimized through grid search combined with k-fold cross-validation, where the training set was split into 10 subsets, and the model was trained on k subsets while testing on the remaining subset. This process was repeated

k - 1

times, and the results were averaged to minimize the bias of individual splits.

During training, out-of-bag error estimates were used to provide an unbiased measure of model performance without needing a separate validation set. To prevent overfitting, several strategies were employed, including limiting tree depth, using bootstrap aggregation to sample the data for each tree, and leveraging the ensemble nature of the random forest, which reduces variance and mitigates noise sensitivity. The resulting model demonstrated strong predictive accuracy. These predicted parameters were then used as inputs in the fractional kriging method for spatial interpolation and variability analysis of bedrock elevation in the study area.

The parametric variogram model is represented as

γ_{model} (h; θ)

, where

θ

represents the variogram parameters including nugget, sill, and range. To find the optimal variogram parameters, we minimize the error between the predicted semi-variance values (

\hat{γ} (h_{i})

) and the parametric variogram model values (

γ_{model} (h_{i}; θ)

):

min_{θ} \sum_{i = 1}^{N} {(\hat{γ} (h_{i}) - γ_{model} (h_{i}; θ))}^{2}

(20)

To solve this optimization problem, we can use methods such as Levenberg–Marquardt or Gradient Descent. The variogram model

γ_{model} (h; θ)

can be in the form of an exponential model in fractional kriging:

γ_{model} (h; θ) = nugget + sill \cdot (1 - e^{- \frac{h}{range}})

(21)

or a fractional model:

γ_{model} (h; θ) = nugget + sill \cdot h^{α}, 0 < α < 2

(22)

with

α

being the fractional parameter that controls the degree of smoothness or roughness of the variogram. The optimization aims to find the parameters

θ = (nugget, sill, range, α)

that best fit the predicted semi-variance values.

Once the optimal parameters are found, they are used to construct the variogram model, which is then applied in the fractional kriging process to estimate the variable of interest at unknown locations.

2.4. The Algorithm of the Proposed Model for Geological Profile Interpolation

We programmed the codes using Python. The algorithm of the proposed model is summarized as follows:

STEP 1: Prepare the Data. Extract features like lag distances between points, spatial coordinates, and attribute values from the spatial data.
STEP 2: Calculate Lag Distances and Semi-Variances. Calculate lag distances (h) between all pairs of data points using the Euclidean distance formula.
STEP 3: Generate Training Data for Random Forest Regression. Use the lag distances as input features (h) and the semi-variances ( $γ (h)$ ) as input and output values of the random forest regressor.
STEP 4: Train the Random Forest Regressor. Train the random forest regressor using h as features and $γ (h)$ as targets so as to estimate the semi-variogram from the data.
STEP 5: Optimization of Variogram Parameters. Fit a parametric variogram model by minimizing the error between the predicted semi-variance and the parametric variogram model.
STEP 6: Apply Fractional Kriging Using Optimized Variogram Parameters. Calculate kriging weights using the optimized variogram and estimate the attribute value for unknown points.

3. Engineering Case

Bedrock elevation plays a crucial role in the variability of geological profiles, as it determines the depth and shape of the bedrock, influencing the overall characteristics and complexity of the geological profile. Variations in bedrock elevation lead to significant differences in the thickness of overlying soil layers and geological conditions, resulting in high variability in the composition, soil type, and groundwater distribution within the profile. Therefore, bedrock elevation is a key factor in understanding the variability of geological profiles and its impact on regional soil properties, stability, and engineering characteristics.

In this study, we use bedrock elevation data from a specific region in southeastern China as a case study. Figure 3 shows the spatial coverage of the boreholes within the study area, which is essential for understanding the extent and density of sampling within the region. Data collected from 49 boreholes are used to validate the proposed model. The boreholes are distributed relatively uniformly across the study area, ensuring adequate spatial coverage for geological analysis. Some regions are measured with slightly higher density, indicating more concentrated sampling efforts in those areas.

Figure 4a depicts the linear regression trend surface of bedrock elevation over the study area. A first-order linear regression model is applied to fit the trend surface,

z = 207 - 0.11 x - 0.10 y

, with x and y beubg the spatial coordinates and z being the bedrock elevation. This model provides a general trend of bedrock elevation, with a slight negative gradient in both the x and y directions, indicating a gradual decrease in elevation from northwest to southeast. The highest elevation values are observed in the northwest part of the study area, and the elevation decreases gradually toward the southeast. This trend suggests a gentle slope across the study area, which may be indicative of underlying geological structures or erosion processes affecting the region. The residuals of the bedrock elevation after removing the trend surface are displayed in Figure 4b, which helps to identify localized variations in bedrock elevation that are not captured by the linear trend, providing insights into the spatial heterogeneity of the study area. The color map illustrates the deviations from the fitted trend surface, where positive residuals are indicated by blue and negative residuals by red.

4. Results

4.1. Intermediate Steps

Figure 5 illustrates the fitting results of different models for the experimental semi-variogram. The red dotted line represents the experimental semi-variogram, which is calculated from the original data and reflects the spatial correlation between different distances. A binned semi-variogram is used to quantify the spatial correlation between data points over different distances. Bins refer to grouping data pairs within certain distance ranges to smooth out the variability in the semi-variogram calculation. The binned semi-variogram exhibits fluctuations with semi-variance values ranging from approximately 2 to 16, indicating substantial variability in spatial dependence across different lag distances. The blue dashed line shows the fitting results using the random forest model, demonstrating the model’s ability to capture nonlinear features and effectively describe the relationship between lag distance and semi-variance. The green solid line represents the fitted variogram model with the support of random forest regression, given by the equation

γ (h) = 1.54 + 13.42 \cdot (1 - e^{- h / 36.62})

(23)

with an

R^{2}

value of 0.94, indicating a very good fit. The nugget effect is estimated at 1.54, which represents the variability at very small lag distances, that is, the inherent small-scale variations or measurement errors in bedrock elevation that cannot be explained by spatial correlation. The sill is 13.42, representing the plateau of the semi-variance and indicating the total spatial variability in bedrock elevation that can be explained by spatial dependence. The range parameter of 36.62 indicates the distance over which the spatial correlation becomes insignificant, which implies that locations separated by distances less than 36.62 units show a spatial correlation in elevation, while points farther apart become independent in terms of spatial variability. The fitted variogram model effectively describes the trend of the experimental semi-variogram, particularly for smaller lag distances under 20 units. At larger lag distances beyond 50 units, the experimental semi-variance exhibits greater variability, which introduces some deviation in the model fits for the fitted variogram model. The model incorporates tolerance and bandwidth implicitly through the random forest model, which helps capture nonlinear and complex spatial dependencies without explicitly setting a strict directional tolerance.

Variogram residuals are used to evaluate the fit of spatial models to the data’s spatial structure. Analyzing variogram residuals provides insights into model optimization and helps in understanding how well the model handles spatial dependencies. As shown in Figure 6, the residuals of the random forest model are randomly distributed around the zero line, ranging from approximately −3 to 3, with no discernible pattern or clustering. The fitted variogram model residuals also range from −3 to 3 but show a distinct cyclical pattern across the lag distances.

4.2. Geological Profile Variability Analysis

Figure 7 shows the interpolation results obtained using fractional kriging with a variogram optimized by the random forest model. The map illustrates the spatial distribution of bedrock elevation across the area of interest, where the color gradient represents the elevation ranging from 36.0 to 46.5. The circles represent the data points of the boreholes used in the interpolation. The spatial variability is evident, with higher elevations in the northwest and lower elevations in the southeast. This variability reflects the geological profile variability, characterized by significant differences in bedrock elevation across the region, suggesting the presence of varying geological conditions such as changes in rock type, differential erosion, or tectonic activity that have influenced the topography of the bedrock surface.

Using a k-fold cross-validation technique, a quantitative analysis of the residuals from the interpolation is conducted, highlighting the variability in the model’s performance. k-fold cross-validation evaluates the performance of a model by partitioning the original dataset into two subsets, training the model on some of the subsets and testing it on the remaining ones. The dataset is randomly divided into k equal-sized folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold being used exactly once as the test set. The final performance is averaged over all k iterations to obtain an overall estimate. As depicted in Figure 8a, the spatial distribution of residuals is shown at each borehole, with the color scale ranging from negative to positive residuals, indicating areas where the interpolation overestimated or underestimated the bedrock elevation. Figure 8b illustrates the frequency distribution of residuals and the distribution curve of residuals. A roughly normal distribution centered around zero is observed, suggesting that the model’s errors are mostly unbiased with minor deviations. The residuals vary between approximately −2 and 2 and are relatively well behaved but with some discrepancies at the tails. This distribution pattern further highlights the variability in the geological profile, as some areas show larger deviations due to complex local geological features.

Figure 9 depicts the variation in the cross-validation MSE to the fold number in the proposed model. The MSE decreases significantly from the first to the second fold, indicating improved performance with more folds. It then fluctuates slightly but remains relatively low after fold 2, with the lowest MSE observed at fold 3. This suggests that the model’s performance stabilizes with an appropriate number of cross-validation folds, optimizing its predictive accuracy. The variability in MSE also reflects the underlying geological profile variability, as different folds may capture different aspects of the complex spatial relations present in the bedrock elevation data.

4.3. Comparing the Proposed Model with Other Models

Figure 10 shows a comparison of different models for semi-variogram prediction over various lag distances. The random forest model generally captures the trend well but shows some moderate fluctuations, suggesting it can handle variability effectively. The linear regression model has larger deviations and struggles to capture the variability, indicating lower predictive accuracy. The k-nearest neighbors model shows improved performance compared to linear regression but still exhibits significant fluctuations. Ordinary kriging shows reasonable stability but does not capture the variability as effectively as the proposed model.

Figure 11 shows the interpolation results of four other related methods: fractional kriging, ordinary kriging, linear regression, and k-nearest neighbors. Each subfigure represents the interpolated bedrock elevation, with x and y axes indicating spatial coordinates, and color gradients depicting elevation values. Red dots mark borehole locations used in the interpolation. Fractional kriging and ordinary kriging both capture smooth spatial trends, with ordinary kriging emphasizing local variations slightly differently. Linear regression results in a more linear and simplified spatial representation, lacking detailed variability, while k-nearest neighbors produce a more localized effect with sharper transitions, reflecting the influence of nearby data points. This figure highlights how each interpolation method handles spatial variability differently, from smooth transitions in kriging to localized sharp changes in k-nearest neighbors.

Figure 12 presents a comparison of the prediction accuracy for each model based on mean squared error (MSE), which is determined using k-fold cross-validation method. Here, the dataset is split into seven equal parts, with the model being trained seven times. In each iteration, one unique subset is used for validation, while the remaining six subsets are used for training. This process is repeated six times, ensuring that every data point serves as a validation point exactly once. The performance metric MSE is calculated for each fold, and the results are averaged to provide an overall assessment of the model’s predictive accuracy. The proposed model achieves the lowest MSE value of 2.03, indicating the highest prediction accuracy. Fractional kriging has an MSE of 2.21, while k-nearest neighbors and ordinary kriging have MSE values of 2.34 and 2.73, respectively. The linear regression model shows the highest MSE at 3.52, indicating lower prediction performance. The differences in MSE values highlight the variability in model accuracy, with the proposed model demonstrating the best ability to capture the complex spatial relationships inherent in the geological profile.

Figure 13 illustrates the execution times for different interpolation methods, showing that the proposed model has the highest execution time of approximately 0.025 s, followed closely by ordinary kriging and fractal kriging, both with times around 0.02 s, indicating that these methods are the most computationally intensive. In contrast, linear regression and K-nearest neighbors are significantly faster, with execution times of around 0.005 s and slightly below 0.005 s, respectively. This comparison highlights that the proposed model has higher accuracy but requires more computational resources.

5. Discussion

The motivation for introducing the random-forest-enhanced fractional kriging model stems from the need to address the limitations of traditional spatial interpolation methods in capturing complex subsurface variability. Geological profiles, such as those represented by bedrock elevation data, often exhibit intricate spatial patterns due to natural processes like erosion, tectonic activity, and lithological diversity. Traditional methods like linear regression, ordinary kriging, and even standard fractional kriging struggle to accurately represent these complexities, often resulting in oversimplified predictions with limited reliability. Ordinary kriging tends to assume homogeneity in spatial correlations, which can lead to inaccuracies in heterogeneous geological settings, while fractional kriging offers improved flexibility but still lacks the adaptive optimization capabilities needed for highly variable data. To overcome these challenges, this study employs a hybrid approach that integrates the machine learning capabilities of random forest with fractional kriging to optimize the variogram, thereby enhancing the accuracy of spatial predictions and better capturing the underlying variability.

The results of this study clearly demonstrate the effectiveness of the proposed model for the variability analysis of bedrock elevation data. By comparing the proposed model with traditional interpolation methods, it can be seen that the proposed model provides a more robust solution for capturing complex spatial relations. Notably, the significant reduction in mean squared error (MSE) when using the proposed model underlines its effectiveness in managing the complex variability found in bedrock elevation data. The superior model fitting, reduced prediction errors, and enhanced reliability shown in the cross-validation metrics highlight the strength of this hybrid approach in addressing subsurface variability.

Despite these promising results, there are still challenges and potential areas for future improvement. One notable limitation lies in the computational cost associated with training the random forest model and optimizing the variogram parameters, which can be substantial for larger datasets. Future studies could explore the use of more computationally efficient algorithms or parallel processing techniques to address this challenge. Additionally, incorporating additional geological variables, such as soil type, groundwater levels, or rock permeability, could further enhance the predictive capability of the model and provide a more comprehensive understanding of subsurface variability. Moreover, applying the proposed method to different types of geological data could help validate its generalizability and expand its applicability across various geoscientific disciplines.

6. Conclusions

The objective of this study is to accurately capture complex subsurface geological profiles. Traditional spatial interpolation methods faced significant challenges in effectively representing the intricate spatial patterns inherent in subsurface geological profiles. This study presents a random-forest-enhanced fractional kriging model for analyzing subsurface geological profile data interpolation. The model is validated using bedrock elevation data from 49 boreholes located in southeastern China. The results show that the proposed method significantly outperforms traditional methods including ordinary kriging, linear regression, and K-nearest neighbors in terms of accuracy and the ability to capture complex spatial relations. The proposed model achieves the lowest MSE value of 2.03, achieving an MSE reduction of up to 25% compared to ordinary kriging. In addition, the execution time of the proposed model is approximately 0.025 s, which is slightly higher than that of ordinary kriging and fractal kriging, indicating that the model is computationally intensive. By leveraging the machine learning capabilities of random forest to optimize variogram parameters, the proposed method offers a more robust approach for subsurface geological profile interpolation. While there are challenges related to computational cost and the potential for further improvement through the inclusion of additional geological variables, the findings suggest that the random-forest-assisted fractional kriging method is a powerful tool for subsurface geological profile interpolation, with broad applicability to geosciences, environmental engineering, and resource management.

Author Contributions

Conceptualization, Q.D., Y.Z. (Yu Zheng) and Y.W.; methodology, Q.D.; validation, F.W., S.Z. and D.P.; formal analysis, F.W. and Y.X.; data curation, Y.Z. (Yi Zhang); writing—original draft preparation, Q.D.; writing—review and editing, Y.W.; supervision, Q.D. and Y.Z. (Yu Zheng); project administration, Q.D.; funding acquisition, Q.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant No. 42002274; 52002410), Dongguan Science and Technology of Social Development Program (Grant No. 20211800900112), Enterprise cooperation projects (Grant No. 2439001092), Guangdong Provincial Key Laboratory of Intelligent Disaster Prevention and Emergency Technologies for Urban Lifeline Engineering (2022) (Grant No. 2022B1212010016).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Authors Shudong Zhou and Yi Zhang were employed by the company Dongguan Institute of Building Research Co., Ltd. Author Yuchun Xiong was employed by the company Guangdong Building Material Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Calcagno, P.; Chilès, J.P.; Courrioux, G.; Guillen, A. Geological modelling from field data and geological knowledge: Part I. Modelling method coupling 3D potential-field interpolation and geological rules. Phys. Earth Planet. Inter. 2008, 171, 147–157. [Google Scholar] [CrossRef]
Meng, Z. Experimental study on impulse waves generated by a viscoplastic material at laboratory scale. Landslides 2018, 15, 1173–1182. [Google Scholar] [CrossRef]
Hoy, M.; Doan, C.B.; Horpibulsuk, S.; Suddeepong, A.; Udomchai, A.; Buritatum, A.; Chaiwan, A.; Doncommul, P.; Arulrajah, A. Investigation of a large-scale waste dump failure at the Mae Moh mine in Thailand. Eng. Geol. 2024, 329, 107400. [Google Scholar] [CrossRef]
Zhang, J.Z.; Jiang, Q.H.; Zhang, D.M.; Huang, H.W.; Liu, Z.Q. Influence of geological uncertainty and soil spatial variability on tunnel deformation and their importance evaluation. Tunn. Undergr. Space Technol. 2024, 152, 105930. [Google Scholar] [CrossRef]
Meng, Z.; Wang, Y.; Zheng, S.; Wang, X.; Liu, D.; Zhang, J.; Shao, Y. Abnormal Monitoring Data Detection Based on Matrix Manipulation and the Cuckoo Search Algorithm. Mathematics 2024, 12, 1345. [Google Scholar] [CrossRef]
White, W. Soil variability: Characterisation and modelling. In Probabilistic Methods in Geotechnical Engineering; CRC Press: Boca Raton, FL, USA, 2020; pp. 111–120. [Google Scholar]
Bamisaiye, O. Subsurface mapping: Selection of best interpolation method for borehole data analysis. Spat. Inf. Res. 2018, 26, 261–269. [Google Scholar] [CrossRef]
Nistor, M.M.; Rahardjo, H.; Satyanaga, A.; Hao, K.Z.; Xiaosheng, Q.; Sham, A.W.L. Investigation of groundwater table distribution using borehole piezometer data interpolation: Case study of Singapore. Eng. Geol. 2020, 271, 105590. [Google Scholar] [CrossRef]
Hale, D. Image-guided 3D interpolation of borehole data. In SEG Technical Program Expanded Abstracts 2010; Society of Exploration Geophysicists: Houston, TX, USA, 2010; pp. 1266–1270. [Google Scholar]
Deng, Z.P.; Pan, M.; Niu, J.T.; Jiang, S.H. Full probability design of soil slopes considering both stratigraphic uncertainty and spatial variability of soil properties. Bull. Eng. Geol. Environ. 2022, 81, 195. [Google Scholar] [CrossRef]
Meng, Z.; Zhang, J.; Hu, Y.; Ancey, C. Temporal Prediction of Landslide-Generated Waves Using a Theoretical-Statistical Combined Method. J. Mar. Sci. Eng. 2023, 11, 1151. [Google Scholar] [CrossRef]
Chen, H.; Huang, S.; Xu, Y.P.; Teegavarapu, R.S.; Guo, Y.; Nie, H.; Xie, H. Using baseflow ensembles for hydrologic hysteresis characterization in humid basins of Southeastern China. Water Resour. Res. 2024, 60, e2023WR036195. [Google Scholar] [CrossRef]
Meng, Z.; Ancey, C. The effects of slide cohesion on impulse-wave formation. Exp. Fluids 2019, 60, 151. [Google Scholar] [CrossRef]
Ibrahim, A.M.; Nasser, R.H.A. Comparison between inverse distance weighted (IDW) and Kriging. Int. Sci. Res. 2017, 6, 249–254. [Google Scholar]
Setianto, A.; Triandini, T. Comparison of kriging and inverse distance weighted (IDW) interpolation methods in lineament extraction and analysis. J. Appl. Geol. 2013, 5. [Google Scholar] [CrossRef]
Wahba, G. Spline interpolation and smoothing on the sphere. SIAM J. Sci. Stat. Comput. 1981, 2, 5–16. [Google Scholar] [CrossRef]
Virdee, T.; Kottegoda, N. A brief review of kriging and its application to optimal interpolation and observation well selection. Hydrol. Sci. J. 1984, 29, 367–387. [Google Scholar] [CrossRef]
Oliver, M.A.; Webster, R. Kriging: A method of interpolation for geographical information systems. Int. J. Geogr. Inf. Syst. 1990, 4, 313–332. [Google Scholar] [CrossRef]
Hastaoğlu, K.Ö.; Göğsu, S.; Gül, Y. Determining the relationship between the slope and directional distribution of the UAV point cloud and the accuracy of various IDW interpolation. Int. J. Eng. Geosci. 2022, 7, 161–173. [Google Scholar] [CrossRef]
Stein, A.; Corsten, L. Universal kriging and cokriging as a regression procedure. Biometrics 1991, 47, 575–587. [Google Scholar] [CrossRef]
Zimmerman, D.; Pavlik, C.; Ruggles, A.; Armstrong, M.P. An experimental comparison of ordinary and universal kriging and inverse distance weighting. Math. Geol. 1999, 31, 375–390. [Google Scholar] [CrossRef]
Belkhiri, L.; Tiri, A.; Mouni, L. Spatial distribution of the groundwater quality using kriging and Co-kriging interpolations. Groundw. Sustain. Dev. 2020, 11, 100473. [Google Scholar] [CrossRef]
Gribov, A.; Krivoruchko, K. Empirical Bayesian kriging implementation and usage. Sci. Total. Environ. 2020, 722, 137290. [Google Scholar] [CrossRef] [PubMed]
Meng, D.; Yang, S.; de Jesus, A.M.; Zhu, S.P. A novel Kriging-model-assisted reliability-based multidisciplinary design optimization strategy and its application in the offshore wind turbine tower. Renew. Energy 2023, 203, 407–420. [Google Scholar] [CrossRef]
Song, Z.; Liu, Z.; Zhang, H.; Zhu, P. An improved sufficient dimension reduction-based Kriging modeling method for high-dimensional evaluation-expensive problems. Comput. Methods Appl. Mech. Eng. 2024, 418, 116544. [Google Scholar] [CrossRef]
Farooq, I.; Bangroo, S.A.; Bashir, O.; Shah, T.I.; Malik, A.A.; Iqbal, A.M.; Mahdi, S.S.; Wani, O.A.; Nazir, N.; Biswas, A. Comparison of random forest and kriging models for soil organic carbon mapping in the Himalayan Region of Kashmir. Land 2022, 11, 2180. [Google Scholar] [CrossRef]
Han, H.; Suh, J. Spatial Prediction of Soil Contaminants Using a Hybrid Random Forest–Ordinary Kriging Model. Appl. Sci. 2024, 14, 1666. [Google Scholar] [CrossRef]
Pereira, G.W.; Valente, D.S.M.; Queiroz, D.M.d.; Coelho, A.L.d.F.; Costa, M.M.; Grift, T. Smart-map: An open-source QGIS plugin for digital mapping using machine learning techniques and ordinary kriging. Agronomy 2022, 12, 1350. [Google Scholar] [CrossRef]
Huang, S.; Zhou, J. An enhanced stability evaluation system for entry-type excavations: Utilizing a hybrid bagging-SVM model, GP and kriging techniques. J. Rock Mech. Geotech. Eng. 2024, in press. [Google Scholar] [CrossRef]
De Caires, S.A.; Keshavarzi, A.; Bottega, E.L.; Kaya, F. Towards site-specific management of soil organic carbon: Comparing support vector machine and ordinary kriging approaches based on pedo-geomorphometric factors. Comput. Electron. Agric. 2024, 216, 108545. [Google Scholar] [CrossRef]
Sekulić, A.; Kilibarda, M.; Heuvelink, G.B.; Nikolić, M.; Bajat, B. Random forest spatial interpolation. Remote. Sens. 2020, 12, 1687. [Google Scholar] [CrossRef]
Yasrebi, A.B.; Hezarkhani, A.; Afzal, P.; Karami, R.; Tehrani, M.E.; Borumandnia, A. Application of an ordinary kriging–artificial neural network for elemental distribution in Kahang porphyry deposit, Central Iran. Arab. J. Geosci. 2020, 13, 1–14. [Google Scholar] [CrossRef]
Ren, C.; Aoues, Y.; Lemosse, D.; De Cursi, E.S. Ensemble of surrogates combining Kriging and Artificial Neural Networks for reliability analysis with local goodness measurement. Struct. Saf. 2022, 96, 102186. [Google Scholar] [CrossRef]
Takoutsing, B.; Heuvelink, G.B. Comparing the prediction performance, uncertainty quantification and extrapolation potential of regression kriging and random forest while accounting for soil measurement errors. Geoderma 2022, 428, 116192. [Google Scholar] [CrossRef]
Hilfer, R. Applications of Fractional Calculus in Physics; World Scientific: Singapore, 2000. [Google Scholar]
Baleanu, D.; Diethelm, K.; Scalas, E.; Trujillo, J.J. Fractional Calculus: Models and Numerical Methods; World Scientific: Singapore, 2012; Volume 3. [Google Scholar]
Pan, I.; Das, S. Kriging based surrogate modeling for fractional order control of microgrids. IEEE Trans. Smart Grid 2014, 6, 36–44. [Google Scholar] [CrossRef]
Pozniak, N.; Sakalauskas, L.; Saltyte, L. Kriging model with fractional Euclidean distance matrices. Informatica 2019, 30, 367–390. [Google Scholar] [CrossRef]
Zhang, N.; Apley, D.W. Fractional Brownian fields for response surface metamodeling. J. Qual. Technol. 2014, 46, 285–301. [Google Scholar] [CrossRef]

$Fractalfract 08 00717 g001$

Figure 1. Typical semivariogram functions in kriging.

$Fractalfract 08 00717 g001$

$Fractalfract 08 00717 g002$

Figure 2. The spatial structure and associated semi-variogram model of kriging interpolation.

$Fractalfract 08 00717 g002$

$Fractalfract 08 00717 g003$

Figure 3. The geological distribution of the boreholes.

$Fractalfract 08 00717 g003$

$Fractalfract 08 00717 g004$

Figure 4. (a) The linear regression trend surface; (b) the color map of the residual of the bedrock elevation.

$Fractalfract 08 00717 g004$

$Fractalfract 08 00717 g005$

Figure 5. The semi-variance versus

γ (h)

lag distance h fitted by the random forest model and the fitted variogram model.

Figure 5. The semi-variance versus

γ (h)

lag distance h fitted by the random forest model and the fitted variogram model.

$Fractalfract 08 00717 g005$

$Fractalfract 08 00717 g006$

Figure 6. The variogram residual of (a) random forest model and (b) fitted variogram model.

$Fractalfract 08 00717 g006$

$Fractalfract 08 00717 g007$

Figure 7. The interpolation results using fractional kriging with a random forest optimized variogram.

$Fractalfract 08 00717 g007$

$Fractalfract 08 00717 g008$

Figure 8. (a) The spatial distribution of residual at each borehole points; (b) the frequency distribution of residual.

$Fractalfract 08 00717 g008$

$Fractalfract 08 00717 g009$

Figure 9. Variation of the cross-validation MSE to the fold number of the proposed model.

$Fractalfract 08 00717 g009$

$Fractalfract 08 00717 g010$

Figure 10. Comparison of different models for semi-variogram prediction.

$Fractalfract 08 00717 g010$

$Fractalfract 08 00717 g011$

Figure 11. The interpolation results of the (a) fractional kriging, (b) ordinary kriging, (c) linear regression, and (d) k-nearest neighbors.

$Fractalfract 08 00717 g011$

$Fractalfract 08 00717 g012$

Figure 12. Comparison of the prediction accuracy of the 5 models.

$Fractalfract 08 00717 g012$

$Fractalfract 08 00717 g013$

Figure 13. Comparison of execution times for different methods.

$Fractalfract 08 00717 g013$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Q.; Wang, Y.; Zheng, Y.; Wang, F.; Zhou, S.; Pan, D.; Xiong, Y.; Zhang, Y. Subsurface Geological Profile Interpolation Using a Fractional Kriging Method Enhanced by Random Forest Regression. Fractal Fract. 2024, 8, 717. https://doi.org/10.3390/fractalfract8120717

AMA Style

Ding Q, Wang Y, Zheng Y, Wang F, Zhou S, Pan D, Xiong Y, Zhang Y. Subsurface Geological Profile Interpolation Using a Fractional Kriging Method Enhanced by Random Forest Regression. Fractal and Fractional. 2024; 8(12):717. https://doi.org/10.3390/fractalfract8120717

Chicago/Turabian Style

Ding, Qile, Yiren Wang, Yu Zheng, Fengyang Wang, Shudong Zhou, Donghui Pan, Yuchun Xiong, and Yi Zhang. 2024. "Subsurface Geological Profile Interpolation Using a Fractional Kriging Method Enhanced by Random Forest Regression" Fractal and Fractional 8, no. 12: 717. https://doi.org/10.3390/fractalfract8120717

APA Style

Ding, Q., Wang, Y., Zheng, Y., Wang, F., Zhou, S., Pan, D., Xiong, Y., & Zhang, Y. (2024). Subsurface Geological Profile Interpolation Using a Fractional Kriging Method Enhanced by Random Forest Regression. Fractal and Fractional, 8(12), 717. https://doi.org/10.3390/fractalfract8120717

Article Menu

Subsurface Geological Profile Interpolation Using a Fractional Kriging Method Enhanced by Random Forest Regression

Abstract

1. Introduction

2. Model Development

2.1. Standard Kriging Recap

2.2. Fractional Order Kriging

2.3. Fractional Variogram Parameters Optimization Using Random Forest Regression

2.4. The Algorithm of the Proposed Model for Geological Profile Interpolation

3. Engineering Case

4. Results

4.1. Intermediate Steps

4.2. Geological Profile Variability Analysis

4.3. Comparing the Proposed Model with Other Models

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI