1. Introduction
Analyzing geological profiles is crucial for understanding subsurface characteristics and identifying spatial patterns, which are essential for effective decision-making in resource extraction, infrastructure development, environmental protection, and risk assessment related to natural hazards [
1,
2,
3,
4]. Geological data are often obtained from boreholes, which provide valuable point-specific information about subsurface conditions. However, due to the high cost and logistical challenges of drilling, borehole data are typically sparse and irregularly distributed [
5,
6]. Interpolating borehole data can generate continuous representations of geological profiles [
7]. Effective interpolation methods are crucial for bridging the gaps between boreholes and providing a complete understanding of the subsurface. These methods can not only enhance geological modeling performance but also reduce uncertainties in practical applications [
8,
9].
Geostatistical methods have been widely utilized for spatial interpolation and geological data analysis [
10,
11,
12,
13]. Commonly used methods to interpolate geological data collected from boreholes include inverse distance weighting (IDW), spline interpolation, and kriging [
14]. IDW assigns weights to data points based on their distances from the prediction location, with closer points having a greater influence [
15]. It is suitable for datasets with a relatively uniform spatial distribution. Spline interpolation, on the other hand, uses piecewise polynomials to create smooth surfaces, making it effective for capturing gradual changes in geological profiles and providing a smooth spatial representation [
16]. Kriging incorporates spatial autocorrelation, providing an optimal, unbiased estimation of unknown values based on a weighted average of known data points, making it particularly effective for geological data interpolation [
17,
18]. Each of these methods has strengths and limitations, and the choice of method depends on the specific characteristics of the data and the desired level of accuracy. For example, IDW may struggle with datasets that have highly clustered data points, whereas spline interpolation is better suited for smoothly varying datasets but may not well handle abrupt changes [
19]. A key advantage of kriging is its capacity to provide statistically optimal estimates by incorporating spatial relationships among data points. However, it highly relies on the assumption of stationarity, which can be restrictive when dealing with regions that exhibit complex, non-stationary variability. This limitation often results in reduced accuracy in areas with significant geological heterogeneity.
Extensions have been proposed by previous studies to improve traditional kriging methods. For example, non-stationary kriging methods, such as universal kriging, have been introduced to handle spatial trends by incorporating external trend functions [
20,
21]. The application of advanced kriging techniques, such as co-kriging, Bayesian kriging, and indicator kriging, has also gained increasing attention in recent years [
22]. Co-kriging incorporates secondary information from related variables, which has been found to be effective in enhancing predictions when primary data are scarce. Indicator kriging is a non-parametric approach that allows for the estimation of categorical or binary variables, making it useful for modeling geological facies. Bayesian kriging integrates Bayesian inference with traditional kriging, providing a probabilistic framework that accounts for uncertainties in the model and the data [
23]. However, these methods may still struggle to fully capture complex spatial correlations in highly heterogeneous geological settings. Therefore, more flexible geostatistical methods to account for spatial variability are needed.
Recent studies have also explored machine-learning-assisted kriging techniques, such as Gaussian Process Regression Kriging, which combines the flexibility of Gaussian processes with kriging to enhance spatial predictions in highly variable settings [
24,
25,
26]. Other machine learning methods, such as random forests, support vector machines, and neural networks, have been integrated with kriging to improve prediction accuracy [
27,
28,
29]. These hybrid models take advantage of the strengths of machine learning algorithms in capturing complex, nonlinear relationships within data, which traditional kriging methods may struggle to achieve [
30]. For example, random forests have been used to identify key spatial features, which are then incorporated into kriging models to enhance interpolation accuracy [
27,
31]. Neural networks have also been used to model complex geological patterns, providing input for kriging models [
32,
33,
34]. These advancements highlight the trend of combining machine learning with kriging methods to address the challenges of spatial heterogeneity and improve the robustness of geological predictions.
Fractional calculus has emerged as a promising approach to enhance spatial modeling in geostatistics [
35,
36]. Fractional Brownian motion and fractional Gaussian fields have been employed to model long-range dependence and fractal characteristics of geological data. Incorporating these concepts into kriging, Pan et al. [
37] proposed the fractional kriging method, which allows for fractional differentiation to better capture non-stationary spatial structures. The fractional kriging method offers an improved approach by incorporating fractional differentiation. This modification allows for more flexible modeling of spatial correlation structures, accommodating non-stationary variability and capturing subtle heterogeneities in geological data more effectively [
38]. By extending the traditional kriging framework, fractional kriging overcomes the shortcomings of conventional methods, providing improved accuracy and a better representation of complex geological profiles [
39]. The use of fractional kriging allows for more accurate predictions in environments where geological features exhibit non-linear behavior or irregular patterns over space.
This study presents an analysis of geological profile variability using a machine-learning-assisted kriging method as an innovative approach. Specifically, we utilized the random forest regression in combination with the fractional kriging model. Here, the random forests regressor was employed to optimize the variogram parameters within the fractional kriging model, thereby improving the model’s ability to capture spatial heterogeneity. This approach enhanced the model’s ability to capture complex spatial relationships and validate its effectiveness. We then validate the model using bedrock elevation data from a specific region in southeastern China consisting of 49 boreholes and compare the model with four other related models. The results indicate that the proposed method provides a more precise representation of spatial variability, enabling better prediction and understanding of subsurface properties. The proposed model has significant implications for improving risk assessment, resource estimation, and the planning of engineering projects in areas characterized by complex geological conditions.
2. Model Development
Fractional kriging is an extension of standard kriging and is developed based on fractional Brownian motion and Hurst exponent, providing a more robust model for spatial prediction. It accounts for complex spatial dependencies and offers an estimation variance, which gives an indication of the reliability of the interpolated value.
2.1. Standard Kriging Recap
Kriging is a geostatistical interpolation method used to estimate the unknown spatial variable
at an unsampled location
based on observed values
at nearby locations
.
The estimation is made by calculating a weighted sum of the observed values:
with
being the estimated value at
,
being the observed values at
, and
being the kriging weights.
The spatial dependence between values is modeled by the semivariogram
. The expression of semivariogram
can be written as functions of the spatial distance between two points
h:
Based on the semi-variogram results, we can select a semivariogram that is spherical, circular, exponential, Gaussian, or linear, as shown in
Figure 1.
The covariance function
between two points separated by distance
h is related to the semi-variogram as
where
is the sill, representing the total variance at long distances.
Figure 2 shows the fundamentals of how kriging uses the spatial relations captured in the variogram to interpolate values at unsampled locations based on the spatial arrangement of known data points. The
Nugget represents the
y-intercept of the variogram, indicating variability at very short distances. The
Sill is the maximum of semi-variance, beyond which there is no further increase, which means points separated by greater distances have an ignorable spatial correlation. The
Range denotes the distance at which the sill is reached, representing the extent of spatial correlation between points, with points beyond this distance considered uncorrelated.
To find the kriging weights
, the equations of the kriging system should be solved. The system is derived from the condition that the estimator is unbiased (i.e., the weights sum to 1) and the variance of the estimation error is minimized. The expression is
where
and
is a Lagrange multiplier that enforces the unbiasedness condition.
2.2. Fractional Order Kriging
The fractional order kriging incorporates the concept of fractional Brownian motion (FBM), which is a generalization of classical Brownian motion. The covariance function of FBM is:
with
H being the Hurst exponent. The Hurst exponent
H is a key parameter in FBM that controls the fractal behavior of the data. It reflects the self-similarity and long-range dependence of the data.
The Hurst exponent takes values between 0 and 1 and reflects the correlations in the dataset. It is typically estimated through variogram analysis, where the spatial correlation of the data is assumed to follow a power-law relationship,
, for the variogram
at a lag distance h. This method relies on plotting the empirical variogram on a log–log scale and fitting it to the power-law model. The slope of the resulting log–log plot provides a direct estimate of H. Specifically, if the slope of the plot is s, then the Hurst exponent is given by
. By analyzing the shape of the variogram and its slope, the Hurst exponent characterizes the degree of spatial persistence or anti-persistence, helping to quantify the long-range dependence inherent in the data.
corresponds to classical Brownian motion with no long-range dependence.
denotes persistent behavior, i.e., it is likely to continue increasing if the process is increasing.
indicates anti-persistence, i.e., increases might be followed by decreases.
For spatial data, FBM is used to create a fractional variogram that scales with distance according to the Hurst exponent
H. The fractional variogram is modified to reflect the fractal nature of the data in fractional kriging, whose expressions can be written as
where
is the nugget effect, representing short-range variability or measurement errors,
is the sill, representing the total variance contributed by the spatial structure,
h is the distance between two points.
To use the fractional variogram in kriging, the parameters
,
, and
H must be estimated. This is carried out by fitting the model to the empirical semivariogram calculated from the data. The empirical semivariogram
is computed as
where
is the set of all pairs of points
such that
, and
is the number of point pairs at distance
h.
The objective is to fit the fractional variogram model to the empirical semivariogram by minimizing the squared error between them, that is,
The kriging system is modified by using the fractional variogram to capture the long-range dependence and fractal properties of the data. The system of equations that determines the kriging weights
is
where the sum of the weights
equals 1,
is the fractional semivariance between
, and
, and
is the semivariance between
and
.
This system of equations can be written in matrix form as
with
as the semivariance matrix,
as the column vector of ones,
as the vector of kriging weights,
as the Lagrange multiplier, and
as the vector of semivariances between the known points and the unsampled location
.
The solution for the kriging weights
and the Lagrange multiplier
is obtained by solving the matrix equation:
This requires inverting the
matrix, which can be carried out using standard linear algebra methods such as Gaussian elimination and matrix decomposition. Once
are determined, we can compute the estimated value at the unsampled location
using the kriging estimator:
where
are the observed values at the known locations
, and
are the weights obtained from solving the kriging system. This gives the best linear unbiased estimate for the unknown value at
.
The estimation variance provides a confidence measure for the predicted value
and quantifies the uncertainty in the kriging estimate. It evaluates how much the estimated value might differ from the true value due to the spatial variability of the data and the distance from the known points. A smaller variance indicates higher confidence in the estimate, while a larger variance suggests more uncertainty. The estimation variance at the unsampled location
, denoted
, is given by
where
is the semivariance at zero distance, typically equal to the nugget effect
;
values are the kriging weights;
are the semivariances between the known points
and the unsampled location
; and
is the Lagrange multiplier obtained from solving the kriging system.
2.3. Fractional Variogram Parameters Optimization Using Random Forest Regression
The objective is to quantify the relation between spatial lag distances and semi-variances using a random forest regressor. We start by calculating the lag distances between all pairs of data points. Let the lag distance between point
i and point
j be denoted by
. The lag distance is calculated as
The semi-variance for each pair of points is calculated as follows:
where
and
are the values of the variable of interest at points
i and
j.
The dataset for training the random forest consists of lag distances as input features (
) and semi-variances (
) as the target values. A random forest regressor is trained using the lag distances as input and the semi-variance as the targets. The random forest regressor is used to approximate the semi-variogram from the data:
where
represents the random forest model’s prediction of the
for a given
.
The random forest regressor consists of an ensemble of decision trees. For each tree, a subset of the training data is randomly sampled and a subset of features is randomly chosen at each split to reduce correlation between the trees. The prediction is the average of the predictions of all decision trees:
with
T being the number of trees and
being the prediction from the
t-th tree.
The trained random forest regressor is then used to predict semi-variance values for a range of lag distances, denoted by
. The goal is to fit a parametric variogram model to these predicted semi-variances. The model consists of an ensemble of
N decision trees, where each tree
t is trained on a bootstrap sample
of the original dataset
D. The training process for each tree involves recursively splitting the data at each node based on a feature and a threshold that minimize the mean squared error (MSE). The prediction
for a given data point
i is obtained by averaging the predictions from all trees:
where
is the prediction of the
t-th tree for the data point
i. Each tree in the random forest independently generates its own prediction, and the overall model prediction is the average of these individual predictions.
For model training, the hyperparameters were configured as follows: the number of trees
, the maximum depth of each tree was limited to 10, and the minimum number of samples per leaf node was set to 5. The number of features considered for each split was determined by the square root of the total number of features. Hyperparameters were optimized through grid search combined with k-fold cross-validation, where the training set was split into 10 subsets, and the model was trained on k subsets while testing on the remaining subset. This process was repeated
times, and the results were averaged to minimize the bias of individual splits.
During training, out-of-bag error estimates were used to provide an unbiased measure of model performance without needing a separate validation set. To prevent overfitting, several strategies were employed, including limiting tree depth, using bootstrap aggregation to sample the data for each tree, and leveraging the ensemble nature of the random forest, which reduces variance and mitigates noise sensitivity. The resulting model demonstrated strong predictive accuracy. These predicted parameters were then used as inputs in the fractional kriging method for spatial interpolation and variability analysis of bedrock elevation in the study area.
The parametric variogram model is represented as
, where
represents the variogram parameters including nugget, sill, and range. To find the optimal variogram parameters, we minimize the error between the predicted semi-variance values (
) and the parametric variogram model values (
):
To solve this optimization problem, we can use methods such as Levenberg–Marquardt or Gradient Descent. The variogram model
can be in the form of an exponential model in fractional kriging:
or a fractional model:
with
being the fractional parameter that controls the degree of smoothness or roughness of the variogram. The optimization aims to find the parameters
that best fit the predicted semi-variance values.
Once the optimal parameters are found, they are used to construct the variogram model, which is then applied in the fractional kriging process to estimate the variable of interest at unknown locations.
2.4. The Algorithm of the Proposed Model for Geological Profile Interpolation
We programmed the codes using Python. The algorithm of the proposed model is summarized as follows:
STEP 1: Prepare the Data. Extract features like lag distances between points, spatial coordinates, and attribute values from the spatial data.
STEP 2: Calculate Lag Distances and Semi-Variances. Calculate lag distances (h) between all pairs of data points using the Euclidean distance formula.
STEP 3: Generate Training Data for Random Forest Regression. Use the lag distances as input features (h) and the semi-variances () as input and output values of the random forest regressor.
STEP 4: Train the Random Forest Regressor. Train the random forest regressor using h as features and
as targets so as to estimate the semi-variogram from the data.
STEP 5: Optimization of Variogram Parameters. Fit a parametric variogram model by minimizing the error between the predicted semi-variance and the parametric variogram model.
STEP 6: Apply Fractional Kriging Using Optimized Variogram Parameters. Calculate kriging weights using the optimized variogram and estimate the attribute value for unknown points.
3. Engineering Case
Bedrock elevation plays a crucial role in the variability of geological profiles, as it determines the depth and shape of the bedrock, influencing the overall characteristics and complexity of the geological profile. Variations in bedrock elevation lead to significant differences in the thickness of overlying soil layers and geological conditions, resulting in high variability in the composition, soil type, and groundwater distribution within the profile. Therefore, bedrock elevation is a key factor in understanding the variability of geological profiles and its impact on regional soil properties, stability, and engineering characteristics.
In this study, we use bedrock elevation data from a specific region in southeastern China as a case study.
Figure 3 shows the spatial coverage of the boreholes within the study area, which is essential for understanding the extent and density of sampling within the region. Data collected from 49 boreholes are used to validate the proposed model. The boreholes are distributed relatively uniformly across the study area, ensuring adequate spatial coverage for geological analysis. Some regions are measured with slightly higher density, indicating more concentrated sampling efforts in those areas.
Figure 4a depicts the linear regression trend surface of bedrock elevation over the study area. A first-order linear regression model is applied to fit the trend surface,
, with
x and
y beubg the spatial coordinates and
z being the bedrock elevation. This model provides a general trend of bedrock elevation, with a slight negative gradient in both the
x and
y directions, indicating a gradual decrease in elevation from northwest to southeast. The highest elevation values are observed in the northwest part of the study area, and the elevation decreases gradually toward the southeast. This trend suggests a gentle slope across the study area, which may be indicative of underlying geological structures or erosion processes affecting the region. The residuals of the bedrock elevation after removing the trend surface are displayed in
Figure 4b, which helps to identify localized variations in bedrock elevation that are not captured by the linear trend, providing insights into the spatial heterogeneity of the study area. The color map illustrates the deviations from the fitted trend surface, where positive residuals are indicated by blue and negative residuals by red.
5. Discussion
The motivation for introducing the random-forest-enhanced fractional kriging model stems from the need to address the limitations of traditional spatial interpolation methods in capturing complex subsurface variability. Geological profiles, such as those represented by bedrock elevation data, often exhibit intricate spatial patterns due to natural processes like erosion, tectonic activity, and lithological diversity. Traditional methods like linear regression, ordinary kriging, and even standard fractional kriging struggle to accurately represent these complexities, often resulting in oversimplified predictions with limited reliability. Ordinary kriging tends to assume homogeneity in spatial correlations, which can lead to inaccuracies in heterogeneous geological settings, while fractional kriging offers improved flexibility but still lacks the adaptive optimization capabilities needed for highly variable data. To overcome these challenges, this study employs a hybrid approach that integrates the machine learning capabilities of random forest with fractional kriging to optimize the variogram, thereby enhancing the accuracy of spatial predictions and better capturing the underlying variability.
The results of this study clearly demonstrate the effectiveness of the proposed model for the variability analysis of bedrock elevation data. By comparing the proposed model with traditional interpolation methods, it can be seen that the proposed model provides a more robust solution for capturing complex spatial relations. Notably, the significant reduction in mean squared error (MSE) when using the proposed model underlines its effectiveness in managing the complex variability found in bedrock elevation data. The superior model fitting, reduced prediction errors, and enhanced reliability shown in the cross-validation metrics highlight the strength of this hybrid approach in addressing subsurface variability.
Despite these promising results, there are still challenges and potential areas for future improvement. One notable limitation lies in the computational cost associated with training the random forest model and optimizing the variogram parameters, which can be substantial for larger datasets. Future studies could explore the use of more computationally efficient algorithms or parallel processing techniques to address this challenge. Additionally, incorporating additional geological variables, such as soil type, groundwater levels, or rock permeability, could further enhance the predictive capability of the model and provide a more comprehensive understanding of subsurface variability. Moreover, applying the proposed method to different types of geological data could help validate its generalizability and expand its applicability across various geoscientific disciplines.