1. Introduction
Since its introduction, geostatistical modeling has played a crucial role in the stochastic simulation of geological systems by applying the concept of conditional probability to generate system realizations. Traditional variogram-based simulations, which rely on two-point statistics, often struggle to accurately represent the complex and heterogeneous spatial structures found in geological formations [
1] (T). Multiple-point statistics (MPS) algorithms, however, enhance this by capturing connectivity and intricate features using higher-order statistics within a multiple-point framework [
2]. These MPS methods extract structural statistics from a training image (TI), which serves as a model of the spatial structures to be replicated, adding physical realism to stochastic models [
3,
4]. Recent developments have also explored TI-free approaches [
5,
6]. MPS algorithms have found applications across various fields, including reservoir geology [
7,
8], mineral deposits modeling [
9,
10], seismic inversion [
11], porosity modeling [
12], hydrology [
13], groundwater modeling [
14,
15,
16], climate modeling [
17,
18,
19], and remote sensing [
20,
21,
22,
23].
MPS algorithms are divided into two main categories, pattern-based and pixel-based simulation methods, each offering its own benefits and drawbacks. The effectiveness of MPS algorithms is evaluated based on their ability to (a) accurately replicate complex geological features, (b) adhere to conditional data, and (c) maintain computational efficiency. Therefore, the selection of the optimal geostatistical method involves finding a balance among these criteria [
4]. Pattern-based methods simulate multiple points simultaneously, whereas pixel-based methods sequentially simulate points across the grid, taking into account the surrounding points. Pattern-based algorithms excel in reproducing large-scale features but may show bias when conditioned on hard data [
1]. While pattern-based approaches [
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40] are noted for their speed, they are often critiqued for producing simulations with limited variability due to directly copying extensive areas from the training image onto the simulation grid [
41,
42]. Additionally, pattern-based methods struggle with efficiently incorporating hard data and are challenging to apply in scenarios with dense conditioning data, especially in the mining sector.
Pixel-based MPS algorithms [
2,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53] are generally noted for their computational demands and slower processing speeds, and they face challenges in capturing the connectivity of complex patterns. Nonetheless, these algorithms are advantageous for generating more realistic simulations, not requiring the merging of patches, and offering adaptability in handling conditioning data [
52]. The ENESIM algorithm, introduced by [
2], was the first of its kind in pixel-based MPS, calculating the conditional distribution of a simulation node by examining all the matches throughout the TI to select a value, a process that was computationally intense. [
43] later introduced a more efficient version named SNESIM, which pre-stored the conditional probabilities in a tree structure and utilized a multi-grid concept to enhance the connectivity of complex structures. Ref. [
49] developed IMPALA, which reduced memory usage by storing data in lists instead of trees. Continuing this trend, the Direct Sampling (DS) method introduced by [
45] samples directly from the TI rather than using a conditional probability distribution, reducing the memory requirements. Refs. [
41,
42] proposed the high-order simulation (HOSIM) algorithm, which operates based on high-order spatial connectivity criteria, termed spatial cumulants, and uses local conditional distributions generated by high-order Legendre polynomials whose coefficients are derived from the training images as a function of the spatial cumulants and the spatial configuration of the hard data. Ref. [
51] further refined this approach by numerically approximating the conditional probability density function (
cpdf) based on spatial Legendre moments, simplifying the
cpdf approximation to a local empirical function to reduce the computational demands. Ref. [
50] expanded this method by approximating the cpdf using Legendre-like orthogonal splines, with coefficients derived from high-order spatial statistics obtained from the hard data and the training image. Despite these enhancements, the computational intensity of pixel-based algorithms remains significant.
To enhance the performance of pixel-based MPS algorithms, researchers have explored various strategies. One of the important aspects of a multiple-point geostatistical model is encountering high-dimensional data, which often serves as a bottleneck. Several methods have been proposed to decrease the dimensionality within the MPS patterns database [
28,
29,
30,
32]. Ref. [
28] introduced an approach utilizing filter scores to simplify the dimensions of the pattern database. These filter scores represent specific linear combinations of pixel values in each pattern, capturing the directional mean, gradient, and curvature characteristics. Their FILTERSIM algorithm successfully reduces the dimensionality to six and nine for two- and three-dimensional simulations, respectively. Building on this, Ref. [
29] developed a variant of FILTERSIM, which identifies pattern similarities through the comparison of filter scores instead of direct pixel-wise comparisons. However, relying solely on a limited number of filter scores can lead to challenges in capturing the complexity of the patterns, potentially resulting in similar scores for distinctly different patterns. Refs. [
32,
54] opted for a wavelet-based approach, extracting wavelet approximate sub-band coefficients to represent the patterns in a reduced dimension. Their method, WAVESIM, while computationally intensive, offers a quicker implementation and produces realizations that more accurately represent the structures of the training image compared to FILTERSIM. Nonetheless, the WAVESIM algorithm can be sensitive to the number of wavelet coefficients extracted; a high count can still result in considerable dimensionality for specific training images. Ref. [
30] implemented multi-dimensional scaling (MDS) to project a database of patterns onto a two-dimensional Cartesian space for easier clustering. Although MDS facilitates clustering by measuring pairwise distances, its computational efficiency decreases with larger pattern sets and it only operates on local data pairs. Additionally, when high-dimensional data reside on a low-dimensional nonlinear manifold, maintaining proximity between similar points becomes challenging with a linear method like MDS [
55,
56].
Another major step in reducing the computational burden of pixel-based methods in some MPS models involves clustering the pattern database extracted from the training image [
30,
32,
54,
57]. Numerous clustering techniques are available, developed by computer and data scientists, including hierarchical clustering, expectation-maximization, fuzzy c-means, and mean-shift clustering [
58]. Yet, one of the most popular and frequently used clustering methods in MPS is k-means clustering [
27,
28,
29,
30,
32,
54], an unsupervised learning algorithm [
59]. This distance-based clustering technique offers simplicity and rapid computation as its key advantages. However, a significant limitation, particularly in geostatistical simulations, is its requirement for a predefined number of clusters, which is not inherently known and is challenging to determine accurately in advance. Consequently, using k-means for clustering patterns often necessitates a sensitivity analysis of the number of clusters to optimize the results [
54]. Another drawback of the k-means method is its inability to recognize dense data areas effectively, as it relies on the distances between data points, which can sometimes lead to incorrect cluster identifications and raise questions about the method’s overall efficiency.
On the implementation side of the MPS methods, the selection of key parameters is critical to their success, necessitating careful tuning to achieve optimal results. In the past, various optimization techniques have been employed to refine the parameters of MPS algorithms. Ref. [
60] introduced an optimization-based approach using the Expectation-Maximization algorithm to enhance MPS methods. Ref. [
61] developed a method for quantitatively comparing the training image with the realizations produced by MPS. This methodology was later adopted by [
62] for tuning the parameters of the Direct Sampling (DS) algorithm [
45,
47], utilizing Jensen–Shannon Divergence as the objective function and optimizing it through the simulated annealing (SA) algorithm. Ref. [
63] proposed a method for quantitative evaluation of MPS results, which involves estimating a coherence map using key point detection and matching. Ref. [
64] introduced methods to evaluate the texture, geometry, spatial correlation, and connectivity of the models. Their approach for parameter optimization leverages the gray-level co-occurrence matrix (GLCM) and deep convolutional neural networks (CNNs), providing robust tools for assessing and refining MPS parameters.
In this study, a novel, computationally efficient, pixel-based MPS method is presented, which can preserve the complexity and continuity of the training images in simulations while honoring the conditioning data in generating realizations. The proposed research uses t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensional reduction of the pattern database, and subsequent clustering of the patterns is achieved using an unsupervised classification technique, named Density-based Clustering of Applications with Noise algorithm (DBSCAN). The t-SNE algorithm is implemented as an efficient dimensionality reduction technique to map and visualize the database of patterns in a two-dimensional Cartesian environment based on their joint probabilities. The main advantage of t-SNE, as a nonlinear method, over linear methods such as MDS and principal component analysis (PCA), is the ability to preserve global and local data structures at the same time [
56]. The DBSCAN algorithm clusters the patterns database using the output from the t-SNE algorithm. It is specifically more efficient on low-dimensional data and has the main advantages of discovering arbitrarily shaped clusters, robust outlier removal and no need for cluster number selection [
65]. These methods are applied to reduce the memory requirement for storing the training image configurations and for the reduction of the computations needed. In order to optimize and automate the input parameter selection for the proposed method, different methods were implemented to minimize the user’s interference. The proposed methodology is validated and tested using different synthetic datasets and applied in a three-dimensional case-study gold mine for simulating an orebody model.
2. Materials and Methods
2.1. Pattern Database Generation
The MPS algorithm proposed here consists of two steps: (a) the generation of a pattern database; and (b) searching for the best match among the patterns to the conditioning data for simulation. The first step in the method used herein is to scan the training image with a given spatial template. Define
as a value of the training image
where
and
are the regular Cartesian grid discretizing the training image.
indicates a specific multiple-point vector of
values within a template
centered at node
, which is provided via Equation (1).
where the
vectors are the vectors defining the geometry of the
nodes of template
and
. The vector
represents the central location
of template
.
The pattern database,
, the same as other multiple-point algorithms, is then obtained by scanning
using template
and storing the multi-point
vectors in the database. The patterns in the pattern database
are location-independent, and the
kth pattern is presented via Equation (2).
where
,
is the number of patterns in the pattern database, and
are the values obtained from
.
It is noteworthy that the pattern database built is independent of the pattern locations. For the continuous simulation cases, the patterns are stored exactly as they appear in the training image. However, for a categorical training image with
M categories, the training image is first transformed into
M sets of binary values, according to Equation (3).
Accordingly, each location is represented by a vector of binary values, where the mth element is 1 if that node belongs to category m, and 0 otherwise. Thus, for each node, there is exactly 1 element equal to 1.
2.2. Dimensional Reduction of Pattern Database
During the simulation, the pattern database will be searched to find the best-matched conditional probability that corresponds to the multi-point data. Therefore, to create conditional probability, similar patterns are grouped together.
Since the number of patterns (
) and the dimension of the patterns (
) in the pattern database are very large, grouping the patterns from the pattern database can be computationally expensive. To reduce the computational time of the proposed MPS algorithm, instead of grouping the
of its original dimensions, the dimensionally reduced database is used. In this research, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is applied for dimensional reduction [
66].
Assume the pattern database
,
is generated and it constitutes a data manifold. In t-SNE, the pattern
,
in
should be projected to points
,
in the projection space
such that as much structure as possible is preserved. In the non-parametric dimensionality reduction method, i.e., t-SNE, we used a cost function-based approach where the pattern
contained in a high-dimensional vector space is projected to a low-dimensional space
such a way that the characteristics of low-dimensional data points mimic the characteristics of their high-dimensional counterparts. We used a probabilistic approach, where the probabilities in the actual pattern space are defined via Equation (4).
where,
depends on the pairwise distances of patterns,
is the norm operator, and the standard deviation
for each high-dimensional
is calculated so that the perplexity of each point is at a predetermined level. Perplexity is the effective number of local neighbors of each point.
Similarly, the probability distribution in the t-SNE projection space can be defined via Equation (5).
To avoid the crowding problem when using a long tail distribution, we will be using the Student’s t-distribution. Our goal is to find projections
such that the difference between
and
becomes very small as measured by the Kullback–Leibler divergence in Equation (6).
To achieve this goal, the t-SNE uses a gradient-based algorithm [
67] to iteratively minimize the difference between these two.
2.3. Pattern Clustering Using DBSCAN
The DBSCAN clustering algorithm is designed to analyze the projected patterns on the basis of their density, as opposed to their distance. One of the main advantages of DBSCAN is that, unlike k-means clustering, it does not require the cluster number as an input parameter to perform clustering. DBSCAN can also find arbitrarily shaped clusters, which is not the case for many clustering algorithms. It can even find a cluster completely surrounded by a different cluster. The other advantage of DBSCAN compared to other clustering methods is that it has a notion of noise (outlier). Additionally, it is mostly insensitive to the ordering of the points in the database.
Before explaining the DBSCAN algorithm, we need to define some terminology used in this method:
Definition 1:
-neighborhood point of : It is the circular area with as a center and as the radius, where . The set of points included in the -neighborhood is defined as .
Definition 2:
: The minimum number of points to form a cluster, which is a threshold selected by the users.
Definition 3:
Density of point : It is the number of points within the -neighborhood point of .
Definition 4:
Core point: It is the point at which the density is greater than or equal to .
The DBSCAN algorithm forms clusters from the projected database
by performing the following steps [
65]:
Select the first random unlabeled projected point y as the arbitrary starting point from , and assign the first cluster label, .
Retrieve of the density within the -neighborhood point of . If the density is less than , then label as an outlier point (noise). Otherwise, label as a core point belonging to the existing cluster .
Iterate over each core point and repeat step 2 until no new -neighborhood points are found. This process continues until the density-connected cluster is completely found.
Select the next unlabeled point as the current point, and increase the cluster count by 1 to form a new cluster, .
Repeat steps 2–4 until all the points in are labeled.
After clustering the using DBSCAN, prototypes of the classes are calculated. The prototype is the single representative of all the members of each class. These prototypes are used during the simulation process, when the similarity between the conditional data event and prototype class is calculated. The prototype value is obtained by averaging all the patterns falling into a particular cluster after clustering.
2.4. Simulation Step
After classifying the
and prototype calculation, simulation of the spatial patterns is carried out. During the simulation, the similarity between the conditioning data event and the prototypes of the classes is determined. A sequential simulation algorithm [
68] is used in this paper. At each visited node, a conditioning data event is obtained by placing the same template used in the training image, centering it at the node to be simulated. The similarity between the conditioning data event and the prototypes of clusters is calculated by a distance function. The distance function used in this paper is the Minkowski distance [
69]. The best-matched prototype for the conditioning data event is selected based on the minimization of the distance function value. A value is then selected from the center of the best-matched class by a Monte-Carlo simulation approach [
70].
2.5. Implementation Strategies
A flowchart of the proposed method (Pixel-MPS) is provided in
Figure 1. The following subsections provide information on the implementation aspects and individual steps taken for the simulation.
2.5.1. Template Size Determination
The training image is the main source for borrowing multiple-point statistics information in MPS modeling. To exploit the multiple-point statistical information from the TI, all the possible overlapping patterns are extracted from the TI and stored in the patterns database, as discussed in
Section 2.1. The selection of the template size to extract patterns is the key decision for extracting multiple-point information. This would be a crucial part of the algorithm, as the template size can significantly impact the reconstruction process, and the quality of the realizations depends on this size [
71]. Numerous studies have shown the sensitivity to such a thing, and the sensitivity analysis of the template size has been the main part of many studies [
30]. The template size should be as small in size as possible to reduce the computational time and as large as necessary to represent the features of a specific training image. Ref. [
30] implemented a method that uses the concept of entropy to determine the size of the template for scanning the training image. We have implemented their approach in this study by using the Shannon entropy, a measure of the amount of information or randomness in a set of data [
72], with common applications in digital image processing. The optimal template size selection starts by scanning through the training image with different template sizes. After calculating the Shannon entropy for each extracted pattern, the template size leading to the maximum average entropy is automatically used as the optimal value of the template size.
2.5.2. Dimensionality Reduction
As discussed in
Section 2.2, the very-high-dimensional pattern database is projected on a 2D grid using the t-SNE algorithm. However, for the sake of optimization of the algorithm and ease of computation, it is recommended to apply the Principal Component Analysis (PCA) method [
73,
74] on the data when the number of features (template size) is relatively large, which will reduce the dimensionality to a reasonable number [
56,
66]. This will suppress some noise and speed up the computation of the pairwise distances between samples. PCA is a multivariate technique based on eigenvector analysis that helps uncover the underlying structure of data by explaining its variance most effectively. It focuses on the variance within the dataset to identify and prioritize features that account for the most variance. In PCA, the number of principal components equals the number of original data variables, as each component is constructed by projecting the rest of the data along its axis. However, typically, only the first few principal components capture the majority of the variance in the data. This is because the components are ordered according to the amount of variance they each explain [
74]. This prioritization is why the initial components are usually selected to represent the original features of the dataset effectively.
In terms of the t-SNE parameters, the typical values for perplexity are proposed to be between 5 and 50, and the suggested trade-off value of 30 was used. The performance of t-SNE is fairly robust under different settings of the perplexity [
66]. The optimization procedure explained via Equation (6) in
Section 2.2 is the most time-consuming part of the algorithm. The t-SNE optimization can also be implemented via the exact but expensive algorithm, which optimizes the Kullback–Leibler divergence of the distributions between the original space and the embedded space. However, the more efficient Barnes–Hut approximation was performed to speed up and cut the memory usage of the program. The idea is that the gradient is similar for nearby points so that the computations can be simplified [
56,
75].
2.5.3. Pattern Clustering
As discussed in
Section 2.3, the 2D projected patterns database is clustered into similar groups using the DBSCAN algorithm. In terms of the input parameters, it is suggested by the algorithm developers that a value of
should be selected between the number of dimensions (
) and “
”, which is widely used by other researchers [
76]. In this study’s case of a two-dimensional environment,
is selected, which generally results in a higher number of clusters and prototypes as a consequence. This will lead to a higher chance of better similarity detection results because the algorithm will be provided a higher number of prototypes as its options for selection as the best match during simulation of the nodes. Assigning
for three-dimensional simulations can decrease the computational time by relatively decreasing the number of clusters. However,
is used to ensure quality reconstructions in three-dimensional as well as two-dimensional simulations.
In order to define the
in an optimized manner, the k-nearest neighbor (KNN) method [
77] is used to calculate the distances of the kth nearest neighbor points of each of the points in the database. Then, the distance values of all the points are sorted from low to high and plotted against the point numbers. Finally, the knee of the resulting curve is found automatically and considered as the optimal value for the
. The search range for k values in the KNN algorithm is defined as a range from 2 (number of dimensions of projected data) to 3 (the dimension of data plus 1). The
is derived by taking the average of the knees of the curves derived from the two tested k values.
2.5.4. Similarity Search
During the process of sequential simulation, the distance between the data event in the simulation grid and the prototypes is measured, and the best-matched prototype with the closest distance is found. Then, the best single pattern member within the prototype’s corresponding cluster is found using a second distance comparison function. If the number of patterns in a class is large, the complementary cumulative distribution function (ccdf) is used to draw a random pattern from the selected cluster. This way, during sequential simulation, the MPS algorithm just searches for similarity among a limited number of prototypes and patterns, not among all the available patterns within the patterns database. It should be noted that after performing clustering and prototypes generations, the clusters with exactly similar prototypes are merged in order to improve the computational performance while the similarity search is performed. This can save computational time when similar cluster prototypes are available in the patterns database, since the loop to measure the similarity will be shortened.
This study proposes using a weight distribution in the methodology while measuring the similarity in the template matching step. Since the proposed method is a pixel-based simulation method, there is a risk of mismatching due to the number of nodes that take part in the similarity detection steps. It is likely that without assigning higher weights to the nodes near the simulation node, the selected pattern will not be the best match for the data event. Thus, as shown in
Figure 2, a weight distribution is implemented while measuring distances, and the first-order Minkowski distance (Manhattan distance) is chosen as the distance measurement method, so the weights have their appropriate impact in the template-matching process. The distance function used for the similarity measurement is provided via Equation (7).
where
is the conditioning data event,
is the pixel weight, and
is the prototype of class
. If some of the
values are missing in the conditioning data event, those
will not be considered in the distance calculation process.
The similarity comparison step will be expensive when the template size is big and when all or a high number of the corresponding nodes around the simulation node are known. This problem will be intensified for three-dimensional simulation, as the computational complexity is higher. In this study, the calculation and usage of ccdf is limited to specific clusters that contain a high number of pattern members. For any cluster of less than 100 patterns, ccdf will not be applicable. The reason behind this is the less computational expense of pixel-based simulation because there is a possibility for the algorithm to search among many patterns within a populous cluster to simulate a single node. Instead of using ccdf for all the clusters, a second step of similarity measurement within a cluster is performed, where the cluster has a limited number of patterns, and due to that, the ccdf might be unstable.
For continuous variables, however, the second step similarity detection will be performed by using the distance function in Equation (7) again to find the best match within the cluster in all cases, even in clusters with a large number of members. For continuous patterns, the second step of similarity measurement is performed in the same way as the first step but within the selected cluster corresponding to the best prototype match in the first similarity measurement step. This means that in the case of continuous variables, there is a higher chance that the reproduced statistics and patterns of individual realizations will be closer to the training image and hard data compared to simulated categorical variables. However, by generating a higher number of realizations of the categorical variables, the effect of the non-dominant category assignment to the simulation nodes from highly populated clusters will be moderated.
2.5.5. Model Validation
In order to validate the results of the proposed approach, different statistical and visual investigations are performed. Within the statistical framework, the first-order and second-order statistics of the training images and the realizations are compared to ensure the reproduced statistics are acceptably close to the reference statistics. Variograms and cumulant maps are also generated to enable a comparison between the higher-order statistics of the reconstructions and the reference training images. Additionally, the statistics of the hard data are shown for reference.
Within the visualization framework, the reproduced patterns of the realizations are compared to the ones available in the training image to ensure the reproduction of the specific patterns and shapes for both the continuous and categorical variables. Uncertainty analysis is also performed by generating E-type plots of the realizations in order to understand the variance in the simulation domain comparing the regions having a high number of hard data with the data-scarce regions.
2.5.6. Comparative Analysis
In order to compare the results with previously developed methods in terms of accuracy and speed, and to justify the choice of algorithms used in this research, a set of sensitivity analysis schemes are performed. The runtimes of the different scenarios and methods are provided. Also, cluster evaluation is performed to compare the results of the proposed method (t-SNE + DBSCAN) with a previously introduced approach (MDS + K-Means), and the quality of the clustering methods is compared quantitatively via Within-Cluster Sum of Squares (WCSS), which measures the sum of the squared distances of all the points within a cluster to the cluster centroid. Additionally, the Davies–Bouldin Index (DBI) is calculated, which measures the average similarity of each cluster with its most similar cluster, with the aim of evaluating the cluster spread and separation [
78].
2.6. Summary of the Methodology
Here, a brief summary of the proposed methodology is provided:
Automatically determine the template size T using an entropy-based approach.
Scan the training image using the automatically determined template size T and extract all the patterns.
Reduce the dimensionality of the pattern database by the PCA method, if needed, and subsequently, stochastic map to two-dimensional by the t-SNE algorithm.
Cluster the patterns based on the two-dimensional map by the DBSCAN algorithm.
Calculate the class prototype using the point-wise averaging of all the patterns within a class.
In the case of conditional simulation, assign hard data within the simulation grid and mark he nodes as seen (sampled) points.
Define a random path visiting once and only once all the unseen nodes.
Use the same template t at each unseen location to extract the data event on the SG.
Find the best match between the class prototypes and data event in the simulation grid.
Sample a value of the central node from the best-matched class using either the second-stage distance function or ccdf.
Assign the sampled value to the current simulation point.
Continue until all the grid points are filled with a simulated value.
Performed median filtering to enhance the simulation quality.
Repeat steps 7 to 13 of the simulation process to generate different equiprobable realizations.
3. Validation Results
In this section, the performance of the model is demonstrated by visual and statistical comparison of the generated realizations with the continuous and categorical training images used in this study. A total of four training images were used for the validation and testing of the method, including a two-dimensional categorical TI, a two-dimensional continuous TI, a three-dimensional two-categorical TI, and a three-dimensional case study of three-categorical data. The selection of these TIs was based on their ability to represent a range of geological scenarios, ensuring comprehensive validation. The 2D categorical and continuous TIs (
Figure 3), as well as the 3D channel TI, represent the complex non-linear structures and connectivity observed in Earth systems, and the complexity is even stronger in the 3D channel TI (see Section Three-Dimensional TI). The continuous TI (
Figure 3b) and the 3D channel TI (see Section Three-Dimensional TI) are also characterized by the non-stationarity propagation of patterns as the patterns are different in different parts of the TIs. The case study 3D TI (see
Section 4) was also chosen to ensure applicability of the method in a real-world complicated 3D domain.
The categorical (size = [101 × 101]) and continuous (size = [100 × 128]) two-dimensional TIs are shown in
Figure 3. The binary training image in
Figure 3a [
30] represents a deposit with complex channels. For the simulation of continuous data, an exhaustive two-dimensional continuous horizontal slice (
Figure 3b) was obtained from a three-dimensional fluvial reservoir of the Stanford V Reservoir Dataset [
79], and the channel configurations and orientation were complex in nature from one slice to another in the vertical direction.
Automatic template size determination was performed on the training images, as based on the approach explained in
Section 2.5.1, and the optimal template size (
T) was achieved for the categorical channel TI,
= [13 × 13], as shown in
Figure 4, and the continuous TI,
= [23 × 23].
Figure 5 shows the automated approach by which the
Eps parameter of the DBSCAN algorithm is selected through the averaging of an iterative k-nearest neighbor search with different numbers of neighbors (see
Section 2.5.3 for more information). Based on
Figure 5, the variability of the
Eps is not considerable with the variation of the number of neighboring points in the k-nearest neighbor search, so we can infer little sensitivity to this. It should also be noted that depending on the specific training image used for simulation, and the nature and number of the patterns extracted, the optimal range of nearest neighbors in the k-nearest neighbor search for
Eps estimation can be further investigated by plotting for different values to better understand potential sensitivity or variations. Although, the maximum value for the range should not be too high, as it will lead to undesirably high
Eps estimations which can cause DBSCAN to form significantly lower number of clusters.
As explained in
Section 2.5.2, for better t-SNE performance, the dimensionality should be decreased by PCA when applicable. Thus, based on the value suggested by the t-SNE algorithm developers and performing experimentation by varying the numbers between 50 and 100, the value of 80 was chosen as a cut-off for PCA implementation, leading to a more robust embedding performance. The dimensionality of the patterns was decreased to 80 in the case of having a higher dimensionality. However, no tangible impact on the performance of the t-SNE was observed within the 50–100 range. In different two- and three-dimensional cases, it was observed that the 80 components selected via PCA provide more than 90% of the variance of the original features.
The stochastic embedding and clustering of the pattern database, generated using the automatically selected template size, were performed. In a single run, the algorithm provided 690 and 177 distinct clusters for the patterns database of the categorical and continuous TIs (
Figure 3), respectively. As an example, a visualized plot of the clustering for the patterns database of the channel TI after mapping the high-dimensional patterns database is shown in
Figure 6.
To verify the combined impact of the t-SNE mapping and the DBSCAN clustering algorithm, the patterns from the clusters were inspected to see the results visually. To provide an example, all the patterns that were formed as a cluster for the categorical training image by the algorithm are depicted in
Figure 7, and the generated prototype of the cluster is also shown. As t-SNE works based on probability (not directly on distance), measuring the Euclidean distance in the high-dimensional space and the two-dimensional space would not help to understand the goodness of fit in the embedding. Due to the stochastic manner of the algorithm, it is a good idea to run the algorithm a few times and save the results when the KL divergence is the lowest. In the cases of the present study, the algorithm provided acceptably small values of KL divergence in all the runs, and no need to run the t-SNE multiple times was recognized.
3.1. Unconditional Simulation
The results of the unconditional simulation on the introduced categorical and continuous training images are presented herein. Generally, a minimum number of realizations should be generated to have an understanding of the uncertainty of the simulation [
71]. A total of 100 realizations were generated for each TI shown in
Figure 3 to understand the uncertainty of the modeling. Three generated realizations of the channel TI (
Figure 3a) provided by the proposed method are depicted in
Figure 8.
It can be observed that the continuity of the training image streams is reproduced well by the proposed method. The main differences of the proposed method compared to the methods such as the FILTERSIM algorithm can be stated to be the two-dimensional mapping of the patterns database by t-SNE instead of using filter scores, and the subsequent density-based clustering of the mapped data by DBSCAN, instead of using k-means clustering on the high-dimensional database. In addition, the proposed methodology used a pixel-based simulation approach, as opposed to the pattern-based simulation implemented in the FILTERSIM method.
Statistical analysis was performed on the simulation results to show the efficiency of the methodology.
Figure 9 provides a first-order statistical comparison of the realizations and training images using the box plots and mean comparison plot. It can be inferred from the plots that the simulations tend to reproduce the one-point statistics of the training image, and there is almost no uncertainty regarding this criterion, as the mean plot shows no considerable deviations from the red line (TI average value) among the 100 realizations.
Figure 10 provides a variogram comparison as a measure of the two-point statistics and the E-type (ensemble) and variance maps of the realizations. From the variogram comparison, a near-perfect match is observed between the two-points statistics of the training image and the 100 generated realizations, and a high quality in maintaining the two-points statistical characteristics of the TI is achieved for all the lag distances’ overall generated realizations.
The E-type and variance maps confirm that the stochastic nature of the simulation method is preserved in the unconditional simulation, as the simulations are not constrained by any hard data. The variability is almost high across the whole domain, and the channels are propagated in different locations within different realizations.
Figure 11 depicts the third-order spatial cumulant maps [
41,
42] of the realization and training images for comparison as a higher-order validation method. From the cumulant maps, it can be inferred that for both small and large lag distances, the proposed algorithm is able to reproduce the third-order statistics of the channel TI well in unconditional categorical simulation.
The results of performing the methodology on the continuous training image (
Figure 3b) also confirmed the applicability of using the proposed simulation method for the continuous variable.
Figure 12 shows the quality of the realizations. It is visually observed that the continuity of the high-valued streams (channels) is reproduced.
Figure 13 provides histogram and variogram comparison plots for validation. The complex bimodal histograms of the generated realizations match that of the training image according to
Figure 13a, which confirms the acceptable performance of the method in reproducing the first-order statistics of the TI. The histogram of the TI (in red) falls in between those of the realizations, and higher deviations can be seen from the TI average on the two modes rather than the values between the heads of the histogram. The mean plot provided via
Figure 13c proves that there is little uncertainty about the reproduction of the first-order statistics of the continuous TI, as none of the 100 realizations has a mean value far from the TI, as shown via the red line.
In addition, the realizations were successful in reproducing variograms similar to that of TI according to
Figure 13b. The two-point statistics of the realizations are maintained similar to the continuous TI over different lag distances, and deviations from the red line (TI variogram) are observed for only a few of the realizations’ variograms, mostly in the larger lag distances.
Figure 12 also provides E-type and variance maps calculated using the 100 realizations.
Figure 14 presents cumulant maps of the TI and a realization. From the cumulant maps, it can be inferred that for small lag distances, from the cumulant maps, the proposed algorithm is able to reproduce the third-order statistics of the continuous TI in unconditional simulation. For larger lag distances, a reasonable match is also observed in most of the areas within the maps. The variability shown in the variance plot is almost high across the whole domain, and high-value channels are propagated in different locations within different realizations according to the E-type plot, which shows the unconditional simulation is performing well in generating equiprobable reconstructions of the TI.
Three-Dimensional TI
Additionally, the algorithm was tested on a three-dimensional TI with a size of [69 × 69 × 39], as shown in
Figure 15, accompanied by its cross-sectional views (middle row). The template size of
T = [15 × 15 × 15] was used as concluded from the entropy-based method.
Figure 16 shows a few unconditional simulations and their cross-sectional views to compare the training image with the realizations and their structures. It can be seen that the method reproduces the desired shapes and continuities present in the three-dimensional TI with reasonable accuracy.
Figure 17 compares the 24 generated simulations and TI statistically. The box plots in
Figure 17a show that the median values for the frequencies of the simulated nodes related to each category are near the corresponding values of the training image (red circle), and the TI class frequencies are within the 50% boxes of the realizations’ categorical frequencies. The mean plot in
Figure 17b also confirms the similarity of the first-order statistics of the TI (red line) and realizations by comparing their average values. Looking at the realizations produced by the proposed method (
Figure 16), and comparing them to the TI (
Figure 15), it can be observed that the continuity of the channels was reproduced within the simulation domain. The trapezoids available in the TI are very difficult to reproduce, even using pattern-based MPS simulation methods. The results show that those trapezoid-shaped structures are reproduced via unconditional simulations; however, they are not precisely similar to the shapes available in the training image.
3.2. Conditional Simulation
The proposed method was tested for conditional simulation to check the accuracy of the simulations in honoring the hard data during the simulation process.
Figure 18a depicts 361 points of hard data used for the conditional simulation of the channel training image, arranged in such a way as to form a dense area of points, in addition to the vertical discrete streams. The purpose is to analyze the impact on the E-type and variance maps (
Figure 18e,f), in comparison with the hard data and the unconditional simulations’ E-type plots (
Figure 12d,e). The variance is very small in areas where the conditioning data points are available.
Among the hard dataset, 72 data points (20%) are located in the center of the grid, forming a rectangular area, playing the role of a dense area of conditioning data. The other 289 hard data are distributed with equal distances across the grid, as shown in
Figure 18a. Looking at the regions which contain a few non-channel-class hard data (blue), not interrupted via the channel-class points (yellow), shows a low variance in the simulations and non-channel-class dominance in the ensemble plots (
Figure 18e,f).
The areas of proportionally lower variance clearly show how the algorithm is honoring the hard datasets and how the variability increases as the distance increases from the hard data. The conditioning data available in those areas are all of the same class, so the other class has little presence in those areas (leading to low variance). The training image contains structures with “Y-like” shapes or shapes partially looking like “Y”. So, each local straight stream tends to turn right, left, or in both directions with certain angles after a certain length of vertical continuity. This can be inferred from the E-type plot of 100 realizations, as shown in
Figure 18e.
The vertical streams were reconstructed with a reasonable accuracy according to the E-type plots, and the variability over those locations is less, even with such a small number of hard conditioning data points. The dense rectangular area of the conditioning data of the same class is dominated by the opposite class on its left and right side in the E-type plot and in almost all of the realizations. It is an indicator of the ability of the model to respect the formation of channels. In fact, the width of the original TI channels is honored in that area of dense conditioning data. The variability above and at the bottom of the rectangular channel-class stream is high. It is due to the shapes that are present at the channel TI, as the channels maintain a straight path for a limited length, and after that, they tend to turn to other directions with certain angular deviations.
The isolated channel-class hard data points are also marking single channel-class points in the E-Type plots, which are dominated by non-channel-class points as hard data. Looking at the E-type plot, it is observed that even these single points are honored by the algorithm and become the points where the channels are passed through, mostly in two distinct directions. In addition, it is clear from the variance plot that the variability is increasing gradually by an increased distance from these points. In fact, as a pixel-based method, the proposed algorithm has a higher potential for respecting conditioning data. Another point that is visible in
Figure 18 is the fact that on the left side, there are conditioning data near the domain border, whereas the conditioning datasets have a higher distance from the border at the right side. Looking at the variance plot and comparing the two sides shows a clear representation of the difference in the amount of variability (uncertainty) of the simulations by the proposed method with and without conditioning data points at the borders.
Figure 19 and
Figure 20 provide statistical validation results for 100 realizations of conditional simulation for the channel TI, as shown in
Figure 3a. The first-, second- and third-order statistical comparison of the realizations with the corresponding channel TI reveals a very good match, and the uncertainty of the conditional simulation of the categorical variable via the proposed algorithm seems to be very low. The box plots show that the TI has the frequencies of the classes almost similar to the medians of the frequencies over all the realizations. The variograms from the realizations also show good agreement with the variogram of TI over approximately all the lag values. As shown in
Figure 20, the three-point statistical comparison of the training image with the conditional simulation shows a reasonable visual match. From the cumulant maps, it can be inferred that for both small and large lag distances, the proposed algorithm is able to reproduce the third-order statistics of the channel TI well by conditional simulation.
The quality of such realizations depends on maintaining the continuity and shape of the channels across the simulated images in the same manner as they are available in the training image. As the hard data are derived from the TI, the realizations should approximately match the TI in the sense of where specific textures are located and how the statistical distribution is. In addition, the E-type map will reflect the exhaustive training image when the number of hard data is very high. The algorithm was tested on the channel TI via 1000 randomly distributed hard data, as shown in
Figure 21a, and the results presented in
Figure 21 confirmed this claim. According to the variance plot, the variability of the 100 simulation values over most of the image is very low, resulting in minimal uncertainty in the modeling compared to the previous case with fewer hard data (
Figure 18). Two conditionally simulated realizations are also provided in
Figure 21, which are very similar to the categorical channel TI (
Figure 3a). Looking at the E-type plot (
Figure 21d), channels with continuity over larger distances are visible compared to the previous case (
Figure 18). Thus, the effect of the hard data conditioning on the quality of the realizations produced by the method is confirmed.
For a conditional simulation of a continuous variable, 208 hard conditioning data were used (
Figure 22a). The hard data are irregularly spaced and scattered all over the simulation grid domain. The two conditional simulation maps are shown in
Figure 22b,c. The visual comparison between the training image and simulation maps confirms that the proposed method respects the hard data. The main channels’ shapes and locations in the exhaustive continuous training image are well reproduced in the simulated images, especially at the top half of the image. The ensemble maps (E-type) are used to check how the realizations respect the hard conditioning data. The E-type map is generated by 100 realizations. Looking at the green circular areas marked in
Figure 22a,e, the lack of conditioning data leads to higher uncertainty, and the simulations provide high variance in areas with a sparse hard dataset. The area highlighted by the ellipsoid was selected as the indicator of regions with low-value dominated hard data. Looking at those areas on the E-type and variance plots, it can be inferred that the algorithm respects the hard data by simulating the pixels within those regions with the majority of low values with low variability.
On the other hand, within the area marked by the yellow rectangle, as the hard data were sampled from both heads of the bimodal distribution of the reference image, high variability among the realizations in the variance map is observed. The orange diamond in the conditioning data plot shows a single high-value pixel dominated by a few other low-value pixels as hard data. It is obvious from the E-type map that the algorithm honors that hard data point and the average value around that point seems to be higher than the surroundings. This is a clear advantage of the pixel-based simulation method, as demonstrated by the proposed algorithm, over pattern-based simulation, which means single points could be neglected while pasting patterns to the simulation grid.
The continuous training image used here has a complex bimodal histogram, and the algorithm is able to reproduce this distribution over all the realizations (
Figure 23a). A small overestimation of the left head frequency and a small under-estimation of the right head frequency, on average, were observed. The variogram comparison plot in
Figure 23c reveals that the second-order statistics of the continuous training image were also well reproduced, and the TI variogram falls in almost the middle of the ones for the realizations for different lag distances. To show the multiple-point reproduction of the TI using the proposed pixel-based method, three-point cumulant maps of the training image and a simulated realization are generated and presented in
Figure 24. The algorithm output ensures the reproduction of the TI’s third-order statistical characteristics.
For the continuous variable as well, the E-type map will closely reproduce the reference TI when the number of hard data is sufficiently high. The algorithm was tested using 500 randomly distributed hard data (
Figure 25a), and the results presented in
Figure 25c,d confirm this claim. It can be seen from
Figure 25b that the channels occurring in the continuous TI are almost perfectly reproduced, and this happens due to the presence of a higher number of conditioning datasets. In the region limited by the red lines (60 < y < 80), there are almost no high-value hard data available, and consequently, no high-value streams were reproduced, which shows the appropriate effectiveness of the hard data on the simulation by the proposed algorithm. Looking at the variance plot confirms that as well, because the variability remained very low in that region. A comparison of the E-type maps in
Figure 22 and
Figure 25 proves the effect of the number of hard data on the simulation quality. In
Figure 25c, the channels of the TI are almost perfectly reproduced, as opposed to the ensemble map in
Figure 22d, where the higher uncertainty (
Figure 22e) was revealed. Still, the percentage of the hard data for this case is under 4% (
Figure 25a).
4. Three-Dimensional Case Study
After validation and testing, the proposed method was applied to a gold deposit to simulate the categorical variables. The primary commodity of the deposit is gold, followed by silver. The deposit type is disseminated in Paleozoic carbonate and siliciclastic rocks with a tabular orebody. The structure is characterized by a narrow reverse fault with a variable dip and warped host rocks resulted from anticlinal folding. The dominant forms of alterations include argillization, decalcification, silicification, and baritization. A cutoff grade of 0.003 AuFA was used for defining the zones of mineralization. High-grade mineralization was also modeled using a 0.17 Au cutoff. The operation type is open-pit surface mining.
The mining company has categorized the exploration drilling data into three different categories, i.e., high-grade, low-grade, and waste data, to create wireframes (solid models) for volumetric analysis. The solid model developed by the mining company was used as the training image and the categorically transformed composited exploration drilling data as the hard conditioning data for the simulation. A total number of 8759 composite samples with a composite length of 10 m are available for this study. The pixel size of the training image is 25 m × 25 m × 10 m. The same pixel size was selected for simulation.
Figure 26 shows the training image and hard conditioning data used in this study. The training image used in this case study has the size of [70 × 60 × 57], and a total of 239,400 blocks were simulated by the proposed method. The solid, which is used as a training image, is generated based on updated geological interpretations, including the existence of faults, mineralization zones, lithological characterization, and lithological contacts. Drilling data were considered as the source of information with almost the highest certainty.
To perform the simulation, the template size was first determined to extract the patterns database from the training image, for which the results showed the optimal size would be
T = [13 × 13 × 13]. After extraction of the patterns database, the patterns containing the above-topography values were removed, and then stochastic embedding and clustering were performed. The same approach mentioned for the previous three-dimensional simulation was performed, which means that
ccdf was used for sampling from the highly populated clusters (more than 100 members).
Figure 27 provides two generated realizations and a mean of 16 realizations.
Figure 28 provides a statistical comparison between the hard data, TI, and realizations.
Figure 28 shows that there are some deviations between the hard data and the simulation statistics; however, the results show an excellent match between the training image and the simulations. The reason for this happening is that the statistics and frequency of the different classes are not the same for both the conditioning data and the TI. Thus, the simulations, as expected, honor the statistics of the TI rather than the hard data. The proposed method, like other MPS methods, is supposed to reproduce the TI statistics, as the realizations derived by using this methodology are training-image-driven. The statistics of the simulations tend to get skewed toward the conditioning data while remaining similar to the TI. It is specifically clear for class three, as shown in
Figure 28c. Thus, it can be concluded that, in the case of having similar statistics between TI and hard data, the simulations will reproduce the statistical properties that honor both the TI and the conditioning data.
Figure 29 shows a cross-sectional view of the TI and a generated realization to check the ability of the model to reconstruct structures in addition to honoring the conditioning data. Having a visual comparison between the two plots confirms the accuracy of the proposed pixel-based MPS method in the generation of realizations that reconstruct the training image structures, and at the same time, in honoring the conditioning data. In this case-study three-dimensional TI, there are dense areas of conditioning data as indications of the areas in which accumulations of orebody materials with different grades are happening. This necessitates the evaluation of the results in this regard, in addition to checking for the reproduction of textures and continuities. As observed in
Figure 29’s slice views, even the high-grade zones, which were a minority, were formed via the pixel-based simulation presented in this study. The number of high-grade classes among the training image pixels was 189 (0.08%), and among the hard datasets, it was 79 (0.14%).
5. Comparative Analysis
In order to compare the results of the dimensionality reduction and clustering algorithms used in this study (t-SNE and DBSCAN) with a previously used method (MDS and K-means clustering), we performed a comparative analysis and calculated the WCSS and DBI for both cases. The WCSS metric was increased from 2.52 (for t-SNE and DBSCAN) to 3.44 (for MDS and K-means clustering) calculated using normalized 2D-projected data, which shows the superiority of the proposed method in keeping within-cluster members more similar. In addition, the DBI metric was increased from 0.73 (for t-SNE and DBSCAN) to 0.82 (for MDS and K-means clustering), which shows better cluster separation and lower cluster spread when the proposed method is used.
The proposed method can be considered a fast MPS method in both two-dimensional and three-dimensional simulations, based on the information provided in
Table 1. The codes were run on a PC with Windows 10 (64-bit) and with configurations including 1.70 GHz Intel(R) Xeon(R) Bronze 3106 CPU (2 processors) and 64 GB of Installed Memory (RAM). The computer was a shared system exploited simultaneously by several users. The source codes used to produce the results published in this paper and the associated supplementary materials are available online in the GitHub repository of the first author,
https://github.com/adel-asadi/Pixel_based_MPS (accessed on 1 April 2024). It should be noted that the code is implemented in MATLAB, using the parallel computing toolbox for the simulation step. So, the speed could be improved in environments like Java and C++. It is not appropriate to compare the method’s speed with other algorithms that are not written and run in MATLAB. However, especially due to the dimensionality reduction and subsequent fast clustering, the runtime of the code in different cases is acceptably low, which is a promising criterion in the success of the methodology in terms of being used as an application in industry.
In order to demonstrate the computational efficiency of the proposed method in comparison with previous methods, the same 2D categorical TI (
Figure 3a) was used to generate 100 unconditional realizations, and the recorded times per simulated realization were 5.17 and 38.56 s for the WAVESIM [
32] and SIMPAT [
25] algorithms, respectively. This shows 31% and 91% higher speed of the proposed method than the two investigated algorithms, respectively.
The information provided in
Table 1 also indicates the following points, as marked within the table as superscripts:
The unconditional simulation via the same training image is faster than the conditional simulation. The method was also tested with a larger number of hard data (for both of the two-dimensional TIs shown in
Figure 3) and an increase in the computational time was observed, because of the more expensive similarity detection, despite the fact that a lower number of nodes need to be filled via the MPS simulation. In addition, the number of clusters have an impact on the runtime. As seen in
Table 1, although the size (number of pixels) of 2D categorical TI is smaller than the 2D continuous TI, the unconditional simulation takes a relatively longer time, as during the comparison of the data event with the prototypes, a higher number of prototypes are available to search through.
By assigning MinPts = 3, the computational time will decrease significantly (to 542 s for this case) but the reconstructions could be considered suboptimal. It was decided to assign MinPts = 2 for quality results. However, it should be changed and tested via trial and error for different training images and scenarios. Another observation of the experiments was that having a higher number of conditioning data to help the simulation in reproducing the desired textures and statistics could let the user assign MinPts = 3 for faster simulations while maintaining outputs with acceptable quality. It should be noted that assigning MinPts = 3 decreased the number of clusters from 1947 to 374 for this case. In an experiment, removal of the ccdf partial usage in the similarity detection for MinPts = 3 case increased the timing to 792 s.
The case-study three-dimensional TI had 46,339 points above topography, which did not need to be simulated. The number of nodes to be simulated was 193,061 (including the points which were filled using hard data). In addition, the computational time is less than the unconditional simulation for the other three-dimensional TI, because the number of clusters of patterns formed for the case study TI was 463, which is almost a quarter of the 1947 for the other two-categorical three-dimensional TI.
6. Conclusions and Future Work
This research proposed a new pixel-based simulation algorithm using different machine-learning techniques. The results showed that t-Distributed Stochastic Neighbors Embedding (t-SNE algorithm) and Density-based Spatial Clustering of Applications with Noise (DBSCAN algorithm) are efficient methods for the dimensional reduction and classification of pattern database generated from a training image. In addition, it was shown that the algorithm is useful for taking advantage of pixel-based MPS algorithms as the basic approach of the present study. Also, an automatic method was employed for the template size determination and a number of optimization techniques in order to fully automate the proposed MPS method. In this study, high performance was achieved while reconstructing complex features and continuities, although the method uses a pixel-based approach for geostatistical simulation. In terms of respecting the hard data, the MPS method showed its ability to produce high-quality conditional simulations while also honoring the statistical properties of the training image. Continuous variables were also tested for simulation using the algorithm, and the results of the continuous training image simulation were promising, similar to the categorical simulations. Another main advantage of the algorithm was its computational speed, as a relatively high-speed methodology was achieved in this work, and this criterion is very important to consider for real-life three-dimensional problems. The validation results for the different scenarios showed that the algorithm is capable of solving related problems in various disciplines, including but not limited to mineral resources modeling.
Although the proposed approach is computationally fast, there is still some scope for further improvement. The t-SNE is the most time-consuming step in the proposed method. The DBSCAN clustering is fast and efficient enough but inappropriate for a high-dimensional dataset. Therefore, implementing a fast clustering algorithm for high-dimensional datasets instead of dimensionality reduction and subsequent clustering, as proposed in this study, can significantly improve the computational time. If this could be achieved, the proposed method will be significantly faster, although it is efficient even now. Another path toward the development of the proposed method would be implementation and testing of the algorithm via non-stationary simulations, multivariate modeling, block conditioning, simulations post-processing, optimization and enhancement of the reconstructions, dealing with spatio-temporal datasets, and other widespread MPS applications in different fields of studies, using various training images and datasets for hard or soft conditioning to the simulation process.