Pixel-MPS: Stochastic Embedding and Density-Based Clustering of Image Patterns for Pixel-Based Multiple-Point Geostatistical Simulation

Asadi, Adel; Chatterjee, Snehamoy

doi:10.3390/geosciences14060162

Open AccessArticle

Pixel-MPS: Stochastic Embedding and Density-Based Clustering of Image Patterns for Pixel-Based Multiple-Point Geostatistical Simulation

by

Adel Asadi

^1,2,*

and

Snehamoy Chatterjee

¹

Department of Geological and Mining Engineering and Sciences, Michigan Technological University, Houghton, MI 49931, USA

²

Department of Earth and Planetary Sciences, Stanford University, Stanford, CA 94305, USA

^*

Author to whom correspondence should be addressed.

Geosciences 2024, 14(6), 162; https://doi.org/10.3390/geosciences14060162

Submission received: 14 April 2024 / Revised: 5 June 2024 / Accepted: 7 June 2024 / Published: 12 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Multiple-point geostatistics (MPS) is an established tool for the uncertainty quantification of Earth systems modeling, particularly when dealing with the complexity and heterogeneity of geological data. This study presents a novel pixel-based MPS method for modeling spatial data using advanced machine-learning algorithms. Pixel-based multiple-point simulation implies the sequential modeling of individual points on the simulation grid, one at a time, by borrowing spatial information from the training image and honoring the conditioning data points. The developed methodology is based on the mapping of the training image patterns database using the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm for dimensionality reduction, and the clustering of patterns by applying the Density-based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, as an efficient unsupervised classification technique. For the automation, optimization, and input parameter tuning, multiple stages are implemented, including entropy-based determination of the template size and a k-nearest neighbors search for clustering parameter selection, to ensure the proposed method does not require the user’s interference. The proposed model is validated using synthetic two- and three-dimensional datasets, both for conditional and unconditional simulations, and runtime information is provided. Finally, the method is applied to a case study gold mine for stochastic orebody modeling. To demonstrate the computational efficiency and accuracy of the proposed method, a two-dimensional training image with 101 by 101 pixels is simulated for 100 conditional realizations in 453 s (~4.5 s per realization) using only 361 hard data points (~3.5% of the simulation grid), and the resulting average simulation has a good visual match and only an 11.8% pixel-wise mismatch with the training image.

Keywords:

multiple-point geostatistics; uncertainty quantification; pixel-based simulation; dimensionality reduction; image patterns clustering; parameters tuning; geosystems modeling

1. Introduction

Since its introduction, geostatistical modeling has played a crucial role in the stochastic simulation of geological systems by applying the concept of conditional probability to generate system realizations. Traditional variogram-based simulations, which rely on two-point statistics, often struggle to accurately represent the complex and heterogeneous spatial structures found in geological formations [1] (T). Multiple-point statistics (MPS) algorithms, however, enhance this by capturing connectivity and intricate features using higher-order statistics within a multiple-point framework [2]. These MPS methods extract structural statistics from a training image (TI), which serves as a model of the spatial structures to be replicated, adding physical realism to stochastic models [3,4]. Recent developments have also explored TI-free approaches [5,6]. MPS algorithms have found applications across various fields, including reservoir geology [7,8], mineral deposits modeling [9,10], seismic inversion [11], porosity modeling [12], hydrology [13], groundwater modeling [14,15,16], climate modeling [17,18,19], and remote sensing [20,21,22,23].

MPS algorithms are divided into two main categories, pattern-based and pixel-based simulation methods, each offering its own benefits and drawbacks. The effectiveness of MPS algorithms is evaluated based on their ability to (a) accurately replicate complex geological features, (b) adhere to conditional data, and (c) maintain computational efficiency. Therefore, the selection of the optimal geostatistical method involves finding a balance among these criteria [4]. Pattern-based methods simulate multiple points simultaneously, whereas pixel-based methods sequentially simulate points across the grid, taking into account the surrounding points. Pattern-based algorithms excel in reproducing large-scale features but may show bias when conditioned on hard data [1]. While pattern-based approaches [24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40] are noted for their speed, they are often critiqued for producing simulations with limited variability due to directly copying extensive areas from the training image onto the simulation grid [41,42]. Additionally, pattern-based methods struggle with efficiently incorporating hard data and are challenging to apply in scenarios with dense conditioning data, especially in the mining sector.

Pixel-based MPS algorithms [2,41,42,43,44,45,46,47,48,49,50,51,52,53] are generally noted for their computational demands and slower processing speeds, and they face challenges in capturing the connectivity of complex patterns. Nonetheless, these algorithms are advantageous for generating more realistic simulations, not requiring the merging of patches, and offering adaptability in handling conditioning data [52]. The ENESIM algorithm, introduced by [2], was the first of its kind in pixel-based MPS, calculating the conditional distribution of a simulation node by examining all the matches throughout the TI to select a value, a process that was computationally intense. [43] later introduced a more efficient version named SNESIM, which pre-stored the conditional probabilities in a tree structure and utilized a multi-grid concept to enhance the connectivity of complex structures. Ref. [49] developed IMPALA, which reduced memory usage by storing data in lists instead of trees. Continuing this trend, the Direct Sampling (DS) method introduced by [45] samples directly from the TI rather than using a conditional probability distribution, reducing the memory requirements. Refs. [41,42] proposed the high-order simulation (HOSIM) algorithm, which operates based on high-order spatial connectivity criteria, termed spatial cumulants, and uses local conditional distributions generated by high-order Legendre polynomials whose coefficients are derived from the training images as a function of the spatial cumulants and the spatial configuration of the hard data. Ref. [51] further refined this approach by numerically approximating the conditional probability density function (cpdf) based on spatial Legendre moments, simplifying the cpdf approximation to a local empirical function to reduce the computational demands. Ref. [50] expanded this method by approximating the cpdf using Legendre-like orthogonal splines, with coefficients derived from high-order spatial statistics obtained from the hard data and the training image. Despite these enhancements, the computational intensity of pixel-based algorithms remains significant.

To enhance the performance of pixel-based MPS algorithms, researchers have explored various strategies. One of the important aspects of a multiple-point geostatistical model is encountering high-dimensional data, which often serves as a bottleneck. Several methods have been proposed to decrease the dimensionality within the MPS patterns database [28,29,30,32]. Ref. [28] introduced an approach utilizing filter scores to simplify the dimensions of the pattern database. These filter scores represent specific linear combinations of pixel values in each pattern, capturing the directional mean, gradient, and curvature characteristics. Their FILTERSIM algorithm successfully reduces the dimensionality to six and nine for two- and three-dimensional simulations, respectively. Building on this, Ref. [29] developed a variant of FILTERSIM, which identifies pattern similarities through the comparison of filter scores instead of direct pixel-wise comparisons. However, relying solely on a limited number of filter scores can lead to challenges in capturing the complexity of the patterns, potentially resulting in similar scores for distinctly different patterns. Refs. [32,54] opted for a wavelet-based approach, extracting wavelet approximate sub-band coefficients to represent the patterns in a reduced dimension. Their method, WAVESIM, while computationally intensive, offers a quicker implementation and produces realizations that more accurately represent the structures of the training image compared to FILTERSIM. Nonetheless, the WAVESIM algorithm can be sensitive to the number of wavelet coefficients extracted; a high count can still result in considerable dimensionality for specific training images. Ref. [30] implemented multi-dimensional scaling (MDS) to project a database of patterns onto a two-dimensional Cartesian space for easier clustering. Although MDS facilitates clustering by measuring pairwise distances, its computational efficiency decreases with larger pattern sets and it only operates on local data pairs. Additionally, when high-dimensional data reside on a low-dimensional nonlinear manifold, maintaining proximity between similar points becomes challenging with a linear method like MDS [55,56].

Another major step in reducing the computational burden of pixel-based methods in some MPS models involves clustering the pattern database extracted from the training image [30,32,54,57]. Numerous clustering techniques are available, developed by computer and data scientists, including hierarchical clustering, expectation-maximization, fuzzy c-means, and mean-shift clustering [58]. Yet, one of the most popular and frequently used clustering methods in MPS is k-means clustering [27,28,29,30,32,54], an unsupervised learning algorithm [59]. This distance-based clustering technique offers simplicity and rapid computation as its key advantages. However, a significant limitation, particularly in geostatistical simulations, is its requirement for a predefined number of clusters, which is not inherently known and is challenging to determine accurately in advance. Consequently, using k-means for clustering patterns often necessitates a sensitivity analysis of the number of clusters to optimize the results [54]. Another drawback of the k-means method is its inability to recognize dense data areas effectively, as it relies on the distances between data points, which can sometimes lead to incorrect cluster identifications and raise questions about the method’s overall efficiency.

On the implementation side of the MPS methods, the selection of key parameters is critical to their success, necessitating careful tuning to achieve optimal results. In the past, various optimization techniques have been employed to refine the parameters of MPS algorithms. Ref. [60] introduced an optimization-based approach using the Expectation-Maximization algorithm to enhance MPS methods. Ref. [61] developed a method for quantitatively comparing the training image with the realizations produced by MPS. This methodology was later adopted by [62] for tuning the parameters of the Direct Sampling (DS) algorithm [45,47], utilizing Jensen–Shannon Divergence as the objective function and optimizing it through the simulated annealing (SA) algorithm. Ref. [63] proposed a method for quantitative evaluation of MPS results, which involves estimating a coherence map using key point detection and matching. Ref. [64] introduced methods to evaluate the texture, geometry, spatial correlation, and connectivity of the models. Their approach for parameter optimization leverages the gray-level co-occurrence matrix (GLCM) and deep convolutional neural networks (CNNs), providing robust tools for assessing and refining MPS parameters.

In this study, a novel, computationally efficient, pixel-based MPS method is presented, which can preserve the complexity and continuity of the training images in simulations while honoring the conditioning data in generating realizations. The proposed research uses t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensional reduction of the pattern database, and subsequent clustering of the patterns is achieved using an unsupervised classification technique, named Density-based Clustering of Applications with Noise algorithm (DBSCAN). The t-SNE algorithm is implemented as an efficient dimensionality reduction technique to map and visualize the database of patterns in a two-dimensional Cartesian environment based on their joint probabilities. The main advantage of t-SNE, as a nonlinear method, over linear methods such as MDS and principal component analysis (PCA), is the ability to preserve global and local data structures at the same time [56]. The DBSCAN algorithm clusters the patterns database using the output from the t-SNE algorithm. It is specifically more efficient on low-dimensional data and has the main advantages of discovering arbitrarily shaped clusters, robust outlier removal and no need for cluster number selection [65]. These methods are applied to reduce the memory requirement for storing the training image configurations and for the reduction of the computations needed. In order to optimize and automate the input parameter selection for the proposed method, different methods were implemented to minimize the user’s interference. The proposed methodology is validated and tested using different synthetic datasets and applied in a three-dimensional case-study gold mine for simulating an orebody model.

2. Materials and Methods

In this section, first, the mathematical background and the simulation algorithm are introduced in Section 2.1, Section 2.2, Section 2.3 and Section 2.4. Then, the implementation strategies that are followed in this method are discussed in Section 2.5.

2.1. Pattern Database Generation

The MPS algorithm proposed here consists of two steps: (a) the generation of a pattern database; and (b) searching for the best match among the patterns to the conditioning data for simulation. The first step in the method used herein is to scan the training image with a given spatial template. Define

t i (u)

as a value of the training image

t i

where

u \in G_{t i}

and

G_{t i}

are the regular Cartesian grid discretizing the training image.

t i_{T} (u)

indicates a specific multiple-point vector of

t i (u)

values within a template

T

centered at node

u

, which is provided via Equation (1).

t i_{T} {(u) = \{t i (u + h_{1}), t i (u + h_{2}), . . ., t i (u + h_{α}), . . ., t i (u + h_{n_{T}})\}}_{T}

(1)

where the

h_{α}

vectors are the vectors defining the geometry of the

n_{T}

nodes of template

T

and

α = {1,2, . . ., n_{T}}

. The vector

h_{1} = 0

represents the central location

u

of template

T

.

The pattern database,

p a t d b_{T}

, the same as other multiple-point algorithms, is then obtained by scanning

t i

using template

T

and storing the multi-point

t i_{T}

vectors in the database. The patterns in the pattern database

p a t d b_{T}

are location-independent, and the kth pattern is presented via Equation (2).

p a t_{T}^{k} = \{p a t^{k} (h_{1}), p a t^{k} (h_{2}), . . ., p a t^{k} (h_{α}), . . ., p a t^{k} (h_{n_{T}})\}

(2)

where

k = {1,2, . . ., n_{P a t_{T}}}

,

n_{P a t_{T}}

is the number of patterns in the pattern database, and

p a t^{k} (h_{1}), p a t^{k} (h_{2}), . . ., p a t^{k} (h_{α}), . . ., p a t^{k} (h_{n_{T}})

are the values obtained from

t i_{T} (u)

.

It is noteworthy that the pattern database built is independent of the pattern locations. For the continuous simulation cases, the patterns are stored exactly as they appear in the training image. However, for a categorical training image with M categories, the training image is first transformed into M sets of binary values, according to Equation (3).

I_{m} (u) = \{\begin{matrix} 1, i f u b e l o n g s t o t h e c a t e g o r y m \\ 0, o t h e r w i s e \end{matrix}

(3)

Accordingly, each location is represented by a vector of binary values, where the mth element is 1 if that node belongs to category m, and 0 otherwise. Thus, for each node, there is exactly 1 element equal to 1.

2.2. Dimensional Reduction of Pattern Database

During the simulation, the pattern database will be searched to find the best-matched conditional probability that corresponds to the multi-point data. Therefore, to create conditional probability, similar patterns are grouped together.

Since the number of patterns (

n_{P a t_{T}}

) and the dimension of the patterns (

n_{T}

) in the pattern database are very large, grouping the patterns from the pattern database can be computationally expensive. To reduce the computational time of the proposed MPS algorithm, instead of grouping the

p a t d b_{T}

of its original dimensions, the dimensionally reduced database is used. In this research, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is applied for dimensional reduction [66].

Assume the pattern database

p a t d b_{T}

,

{p a t d b}_{T} \subset R^{n_{P a t_{T}} \times n_{T}}

is generated and it constitutes a data manifold. In t-SNE, the pattern

{p a t}_{T}^{k}

,

k = 1,2, . . ., n_{P a t_{T}}

in

p a t d b_{T}

should be projected to points

y_{k}

,

k = 1,2, . . ., n_{P a t_{T}}

in the projection space

y \subset R^{2}

such that as much structure as possible is preserved. In the non-parametric dimensionality reduction method, i.e., t-SNE, we used a cost function-based approach where the pattern

{p a t}_{T}^{k}

contained in a high-dimensional vector space is projected to a low-dimensional space

y_{k}

such a way that the characteristics of low-dimensional data points mimic the characteristics of their high-dimensional counterparts. We used a probabilistic approach, where the probabilities in the actual pattern space are defined via Equation (4).

p_{i j} = \frac{p_{i | j} + p_{j | i}}{2 n_{P a t_{T}}}

(4)

where,

p_{j | i} = \frac{e x p (\frac{{- ‖{p a t}^{i} - {p a t}^{j}‖}^{2}}{2 σ_{i}^{2}})}{\sum_{k, k \neq i} e x p (\frac{{- ‖{p a t}^{i} - {p a t}^{k}‖}^{2}}{2 σ_{i}^{2}})}

depends on the pairwise distances of patterns,

‖.‖

is the norm operator, and the standard deviation

σ_{i}

for each high-dimensional

{p a t}_{T}^{k}

is calculated so that the perplexity of each point is at a predetermined level. Perplexity is the effective number of local neighbors of each point.

Similarly, the probability distribution in the t-SNE projection space can be defined via Equation (5).

q_{i j} = \frac{{(1 + ({| | y_{i} - y_{j} | |)}^{2})}^{- 1}}{\sum_{k} \sum_{l \neq k} {(1 + ({| | y_{k} - y_{l} | |)}^{2})}^{- 1}}

(5)

To avoid the crowding problem when using a long tail distribution, we will be using the Student’s t-distribution. Our goal is to find projections

y_{i}

such that the difference between

p_{i j}

and

q_{i j}

becomes very small as measured by the Kullback–Leibler divergence in Equation (6).

K L (P | |Q) = \sum_{j} \sum_{i \neq j} p_{i j} \log \frac{p_{i j}}{q_{i j}}

(6)

To achieve this goal, the t-SNE uses a gradient-based algorithm [67] to iteratively minimize the difference between these two.

2.3. Pattern Clustering Using DBSCAN

The DBSCAN clustering algorithm is designed to analyze the projected patterns on the basis of their density, as opposed to their distance. One of the main advantages of DBSCAN is that, unlike k-means clustering, it does not require the cluster number as an input parameter to perform clustering. DBSCAN can also find arbitrarily shaped clusters, which is not the case for many clustering algorithms. It can even find a cluster completely surrounded by a different cluster. The other advantage of DBSCAN compared to other clustering methods is that it has a notion of noise (outlier). Additionally, it is mostly insensitive to the ordering of the points in the database.

Before explaining the DBSCAN algorithm, we need to define some terminology used in this method:

Definition 1:

E p s

-neighborhood point of

y

: It is the circular area with

y

as a center and

E p s

as the radius, where

y \in y

. The set of points included in the

E p s

-neighborhood is defined as

N E p s (y) = \{z, z \in y| d i s t (y, z) \leq E p s}

.

Definition 2:

M i n P t s

: The minimum number of points to form a cluster, which is a threshold selected by the users.

Definition 3:

Density of point

y

: It is the number of points within the

E p s

-neighborhood point of

y

.

Definition 4:

Core point: It is the point at which the density is greater than or equal to

M i n P t s

.

The DBSCAN algorithm forms clusters from the projected database

y \subset R^{2}

by performing the following steps [65]:

Select the first random unlabeled projected point y as the arbitrary starting point from $y \subset R^{2}$ , and assign the first cluster label, $c = 1$ .
Retrieve of the density within the $E p s$ -neighborhood point of $y$ . If the density is less than $M i n P t s$ , then label $y$ as an outlier point (noise). Otherwise, label $y$ as a core point belonging to the existing cluster $c = 1$ .
Iterate over each core point and repeat step 2 until no new $E p s$ -neighborhood points are found. This process continues until the density-connected cluster is completely found.
Select the next unlabeled point $z, z \in y, z \notin c$ as the current point, and increase the cluster count by 1 to form a new cluster, $c = 2$ .
Repeat steps 2–4 until all the points in $y \subset R^{2}$ are labeled.

After clustering the

p a t d b_{T}

using DBSCAN, prototypes of the classes are calculated. The prototype is the single representative of all the members of each class. These prototypes are used during the simulation process, when the similarity between the conditional data event and prototype class is calculated. The prototype value is obtained by averaging all the patterns falling into a particular cluster after clustering.

2.4. Simulation Step

After classifying the

p a t d b_{T}

and prototype calculation, simulation of the spatial patterns is carried out. During the simulation, the similarity between the conditioning data event and the prototypes of the classes is determined. A sequential simulation algorithm [68] is used in this paper. At each visited node, a conditioning data event is obtained by placing the same template used in the training image, centering it at the node to be simulated. The similarity between the conditioning data event and the prototypes of clusters is calculated by a distance function. The distance function used in this paper is the Minkowski distance [69]. The best-matched prototype for the conditioning data event is selected based on the minimization of the distance function value. A value is then selected from the center of the best-matched class by a Monte-Carlo simulation approach [70].

2.5. Implementation Strategies

A flowchart of the proposed method (Pixel-MPS) is provided in Figure 1. The following subsections provide information on the implementation aspects and individual steps taken for the simulation.

2.5.1. Template Size Determination

The training image is the main source for borrowing multiple-point statistics information in MPS modeling. To exploit the multiple-point statistical information from the TI, all the possible overlapping patterns are extracted from the TI and stored in the patterns database, as discussed in Section 2.1. The selection of the template size to extract patterns is the key decision for extracting multiple-point information. This would be a crucial part of the algorithm, as the template size can significantly impact the reconstruction process, and the quality of the realizations depends on this size [71]. Numerous studies have shown the sensitivity to such a thing, and the sensitivity analysis of the template size has been the main part of many studies [30]. The template size should be as small in size as possible to reduce the computational time and as large as necessary to represent the features of a specific training image. Ref. [30] implemented a method that uses the concept of entropy to determine the size of the template for scanning the training image. We have implemented their approach in this study by using the Shannon entropy, a measure of the amount of information or randomness in a set of data [72], with common applications in digital image processing. The optimal template size selection starts by scanning through the training image with different template sizes. After calculating the Shannon entropy for each extracted pattern, the template size leading to the maximum average entropy is automatically used as the optimal value of the template size.

2.5.2. Dimensionality Reduction

As discussed in Section 2.2, the very-high-dimensional pattern database is projected on a 2D grid using the t-SNE algorithm. However, for the sake of optimization of the algorithm and ease of computation, it is recommended to apply the Principal Component Analysis (PCA) method [73,74] on the data when the number of features (template size) is relatively large, which will reduce the dimensionality to a reasonable number [56,66]. This will suppress some noise and speed up the computation of the pairwise distances between samples. PCA is a multivariate technique based on eigenvector analysis that helps uncover the underlying structure of data by explaining its variance most effectively. It focuses on the variance within the dataset to identify and prioritize features that account for the most variance. In PCA, the number of principal components equals the number of original data variables, as each component is constructed by projecting the rest of the data along its axis. However, typically, only the first few principal components capture the majority of the variance in the data. This is because the components are ordered according to the amount of variance they each explain [74]. This prioritization is why the initial components are usually selected to represent the original features of the dataset effectively.

In terms of the t-SNE parameters, the typical values for perplexity are proposed to be between 5 and 50, and the suggested trade-off value of 30 was used. The performance of t-SNE is fairly robust under different settings of the perplexity [66]. The optimization procedure explained via Equation (6) in Section 2.2 is the most time-consuming part of the algorithm. The t-SNE optimization can also be implemented via the exact but expensive algorithm, which optimizes the Kullback–Leibler divergence of the distributions between the original space and the embedded space. However, the more efficient Barnes–Hut approximation was performed to speed up and cut the memory usage of the program. The idea is that the gradient is similar for nearby points so that the computations can be simplified [56,75].

2.5.3. Pattern Clustering

As discussed in Section 2.3, the 2D projected patterns database is clustered into similar groups using the DBSCAN algorithm. In terms of the input parameters, it is suggested by the algorithm developers that a value of

M i n P t s

should be selected between the number of dimensions (

d

) and “

d \times 2

”, which is widely used by other researchers [76]. In this study’s case of a two-dimensional environment,

M i n P t s = 2

is selected, which generally results in a higher number of clusters and prototypes as a consequence. This will lead to a higher chance of better similarity detection results because the algorithm will be provided a higher number of prototypes as its options for selection as the best match during simulation of the nodes. Assigning

M i n P t s = 3

for three-dimensional simulations can decrease the computational time by relatively decreasing the number of clusters. However,

M i n P t s = 2

is used to ensure quality reconstructions in three-dimensional as well as two-dimensional simulations.

In order to define the

E p s

in an optimized manner, the k-nearest neighbor (KNN) method [77] is used to calculate the distances of the kth nearest neighbor points of each of the points in the database. Then, the distance values of all the points are sorted from low to high and plotted against the point numbers. Finally, the knee of the resulting curve is found automatically and considered as the optimal value for the

E p s

. The search range for k values in the KNN algorithm is defined as a range from 2 (number of dimensions of projected data) to 3 (the dimension of data plus 1). The

E p s

is derived by taking the average of the knees of the curves derived from the two tested k values.

2.5.4. Similarity Search

During the process of sequential simulation, the distance between the data event in the simulation grid and the prototypes is measured, and the best-matched prototype with the closest distance is found. Then, the best single pattern member within the prototype’s corresponding cluster is found using a second distance comparison function. If the number of patterns in a class is large, the complementary cumulative distribution function (ccdf) is used to draw a random pattern from the selected cluster. This way, during sequential simulation, the MPS algorithm just searches for similarity among a limited number of prototypes and patterns, not among all the available patterns within the patterns database. It should be noted that after performing clustering and prototypes generations, the clusters with exactly similar prototypes are merged in order to improve the computational performance while the similarity search is performed. This can save computational time when similar cluster prototypes are available in the patterns database, since the loop to measure the similarity will be shortened.

This study proposes using a weight distribution in the methodology while measuring the similarity in the template matching step. Since the proposed method is a pixel-based simulation method, there is a risk of mismatching due to the number of nodes that take part in the similarity detection steps. It is likely that without assigning higher weights to the nodes near the simulation node, the selected pattern will not be the best match for the data event. Thus, as shown in Figure 2, a weight distribution is implemented while measuring distances, and the first-order Minkowski distance (Manhattan distance) is chosen as the distance measurement method, so the weights have their appropriate impact in the template-matching process. The distance function used for the similarity measurement is provided via Equation (7).

d = \sum_{i = 1}^{n_{T}} (w_{i} \times \overset{____}{| p a t^{s}} (h_{i}) - y (h_{i}) |)^{p}; (p = 1)

(7)

where

y (h_{i})

is the conditioning data event,

w_{i}

is the pixel weight, and

\overset{____}{p a t^{s}} (h_{i})

is the prototype of class

s

. If some of the

y (h_{i})

values are missing in the conditioning data event, those

h_{i}

will not be considered in the distance calculation process.

The similarity comparison step will be expensive when the template size is big and when all or a high number of the corresponding nodes around the simulation node are known. This problem will be intensified for three-dimensional simulation, as the computational complexity is higher. In this study, the calculation and usage of ccdf is limited to specific clusters that contain a high number of pattern members. For any cluster of less than 100 patterns, ccdf will not be applicable. The reason behind this is the less computational expense of pixel-based simulation because there is a possibility for the algorithm to search among many patterns within a populous cluster to simulate a single node. Instead of using ccdf for all the clusters, a second step of similarity measurement within a cluster is performed, where the cluster has a limited number of patterns, and due to that, the ccdf might be unstable.

For continuous variables, however, the second step similarity detection will be performed by using the distance function in Equation (7) again to find the best match within the cluster in all cases, even in clusters with a large number of members. For continuous patterns, the second step of similarity measurement is performed in the same way as the first step but within the selected cluster corresponding to the best prototype match in the first similarity measurement step. This means that in the case of continuous variables, there is a higher chance that the reproduced statistics and patterns of individual realizations will be closer to the training image and hard data compared to simulated categorical variables. However, by generating a higher number of realizations of the categorical variables, the effect of the non-dominant category assignment to the simulation nodes from highly populated clusters will be moderated.

2.5.5. Model Validation

In order to validate the results of the proposed approach, different statistical and visual investigations are performed. Within the statistical framework, the first-order and second-order statistics of the training images and the realizations are compared to ensure the reproduced statistics are acceptably close to the reference statistics. Variograms and cumulant maps are also generated to enable a comparison between the higher-order statistics of the reconstructions and the reference training images. Additionally, the statistics of the hard data are shown for reference.

Within the visualization framework, the reproduced patterns of the realizations are compared to the ones available in the training image to ensure the reproduction of the specific patterns and shapes for both the continuous and categorical variables. Uncertainty analysis is also performed by generating E-type plots of the realizations in order to understand the variance in the simulation domain comparing the regions having a high number of hard data with the data-scarce regions.

2.5.6. Comparative Analysis

In order to compare the results with previously developed methods in terms of accuracy and speed, and to justify the choice of algorithms used in this research, a set of sensitivity analysis schemes are performed. The runtimes of the different scenarios and methods are provided. Also, cluster evaluation is performed to compare the results of the proposed method (t-SNE + DBSCAN) with a previously introduced approach (MDS + K-Means), and the quality of the clustering methods is compared quantitatively via Within-Cluster Sum of Squares (WCSS), which measures the sum of the squared distances of all the points within a cluster to the cluster centroid. Additionally, the Davies–Bouldin Index (DBI) is calculated, which measures the average similarity of each cluster with its most similar cluster, with the aim of evaluating the cluster spread and separation [78].

2.6. Summary of the Methodology

Here, a brief summary of the proposed methodology is provided:

Automatically determine the template size T using an entropy-based approach.
Scan the training image $t i$ using the automatically determined template size T and extract all the patterns.
Reduce the dimensionality of the pattern database by the PCA method, if needed, and subsequently, stochastic map to two-dimensional by the t-SNE algorithm.
Cluster the patterns based on the two-dimensional map by the DBSCAN algorithm.
Calculate the class prototype using the point-wise averaging of all the patterns within a class.
In the case of conditional simulation, assign hard data within the simulation grid and mark he nodes as seen (sampled) points.
Define a random path visiting once and only once all the unseen nodes.
Use the same template t at each unseen location to extract the data event on the SG.
Find the best match between the class prototypes and data event in the simulation grid.
Sample a value of the central node from the best-matched class using either the second-stage distance function or ccdf.
Assign the sampled value to the current simulation point.
Continue until all the grid points are filled with a simulated value.
Performed median filtering to enhance the simulation quality.

Repeat steps 7 to 13 of the simulation process to generate different equiprobable realizations.

3. Validation Results

In this section, the performance of the model is demonstrated by visual and statistical comparison of the generated realizations with the continuous and categorical training images used in this study. A total of four training images were used for the validation and testing of the method, including a two-dimensional categorical TI, a two-dimensional continuous TI, a three-dimensional two-categorical TI, and a three-dimensional case study of three-categorical data. The selection of these TIs was based on their ability to represent a range of geological scenarios, ensuring comprehensive validation. The 2D categorical and continuous TIs (Figure 3), as well as the 3D channel TI, represent the complex non-linear structures and connectivity observed in Earth systems, and the complexity is even stronger in the 3D channel TI (see Section Three-Dimensional TI). The continuous TI (Figure 3b) and the 3D channel TI (see Section Three-Dimensional TI) are also characterized by the non-stationarity propagation of patterns as the patterns are different in different parts of the TIs. The case study 3D TI (see Section 4) was also chosen to ensure applicability of the method in a real-world complicated 3D domain.

The categorical (size = [101 × 101]) and continuous (size = [100 × 128]) two-dimensional TIs are shown in Figure 3. The binary training image in Figure 3a [30] represents a deposit with complex channels. For the simulation of continuous data, an exhaustive two-dimensional continuous horizontal slice (Figure 3b) was obtained from a three-dimensional fluvial reservoir of the Stanford V Reservoir Dataset [79], and the channel configurations and orientation were complex in nature from one slice to another in the vertical direction.

Automatic template size determination was performed on the training images, as based on the approach explained in Section 2.5.1, and the optimal template size (T) was achieved for the categorical channel TI,

T_{c a t}

= [13 × 13], as shown in Figure 4, and the continuous TI,

T_{c o n}

= [23 × 23].

Figure 5 shows the automated approach by which the Eps parameter of the DBSCAN algorithm is selected through the averaging of an iterative k-nearest neighbor search with different numbers of neighbors (see Section 2.5.3 for more information). Based on Figure 5, the variability of the Eps is not considerable with the variation of the number of neighboring points in the k-nearest neighbor search, so we can infer little sensitivity to this. It should also be noted that depending on the specific training image used for simulation, and the nature and number of the patterns extracted, the optimal range of nearest neighbors in the k-nearest neighbor search for Eps estimation can be further investigated by plotting for different values to better understand potential sensitivity or variations. Although, the maximum value for the range should not be too high, as it will lead to undesirably high Eps estimations which can cause DBSCAN to form significantly lower number of clusters.

As explained in Section 2.5.2, for better t-SNE performance, the dimensionality should be decreased by PCA when applicable. Thus, based on the value suggested by the t-SNE algorithm developers and performing experimentation by varying the numbers between 50 and 100, the value of 80 was chosen as a cut-off for PCA implementation, leading to a more robust embedding performance. The dimensionality of the patterns was decreased to 80 in the case of having a higher dimensionality. However, no tangible impact on the performance of the t-SNE was observed within the 50–100 range. In different two- and three-dimensional cases, it was observed that the 80 components selected via PCA provide more than 90% of the variance of the original features.

The stochastic embedding and clustering of the pattern database, generated using the automatically selected template size, were performed. In a single run, the algorithm provided 690 and 177 distinct clusters for the patterns database of the categorical and continuous TIs (Figure 3), respectively. As an example, a visualized plot of the clustering for the patterns database of the channel TI after mapping the high-dimensional patterns database is shown in Figure 6.

To verify the combined impact of the t-SNE mapping and the DBSCAN clustering algorithm, the patterns from the clusters were inspected to see the results visually. To provide an example, all the patterns that were formed as a cluster for the categorical training image by the algorithm are depicted in Figure 7, and the generated prototype of the cluster is also shown. As t-SNE works based on probability (not directly on distance), measuring the Euclidean distance in the high-dimensional space and the two-dimensional space would not help to understand the goodness of fit in the embedding. Due to the stochastic manner of the algorithm, it is a good idea to run the algorithm a few times and save the results when the KL divergence is the lowest. In the cases of the present study, the algorithm provided acceptably small values of KL divergence in all the runs, and no need to run the t-SNE multiple times was recognized.

3.1. Unconditional Simulation

The results of the unconditional simulation on the introduced categorical and continuous training images are presented herein. Generally, a minimum number of realizations should be generated to have an understanding of the uncertainty of the simulation [71]. A total of 100 realizations were generated for each TI shown in Figure 3 to understand the uncertainty of the modeling. Three generated realizations of the channel TI (Figure 3a) provided by the proposed method are depicted in Figure 8.

It can be observed that the continuity of the training image streams is reproduced well by the proposed method. The main differences of the proposed method compared to the methods such as the FILTERSIM algorithm can be stated to be the two-dimensional mapping of the patterns database by t-SNE instead of using filter scores, and the subsequent density-based clustering of the mapped data by DBSCAN, instead of using k-means clustering on the high-dimensional database. In addition, the proposed methodology used a pixel-based simulation approach, as opposed to the pattern-based simulation implemented in the FILTERSIM method.

Statistical analysis was performed on the simulation results to show the efficiency of the methodology. Figure 9 provides a first-order statistical comparison of the realizations and training images using the box plots and mean comparison plot. It can be inferred from the plots that the simulations tend to reproduce the one-point statistics of the training image, and there is almost no uncertainty regarding this criterion, as the mean plot shows no considerable deviations from the red line (TI average value) among the 100 realizations. Figure 10 provides a variogram comparison as a measure of the two-point statistics and the E-type (ensemble) and variance maps of the realizations. From the variogram comparison, a near-perfect match is observed between the two-points statistics of the training image and the 100 generated realizations, and a high quality in maintaining the two-points statistical characteristics of the TI is achieved for all the lag distances’ overall generated realizations.

The E-type and variance maps confirm that the stochastic nature of the simulation method is preserved in the unconditional simulation, as the simulations are not constrained by any hard data. The variability is almost high across the whole domain, and the channels are propagated in different locations within different realizations. Figure 11 depicts the third-order spatial cumulant maps [41,42] of the realization and training images for comparison as a higher-order validation method. From the cumulant maps, it can be inferred that for both small and large lag distances, the proposed algorithm is able to reproduce the third-order statistics of the channel TI well in unconditional categorical simulation.

The results of performing the methodology on the continuous training image (Figure 3b) also confirmed the applicability of using the proposed simulation method for the continuous variable. Figure 12 shows the quality of the realizations. It is visually observed that the continuity of the high-valued streams (channels) is reproduced. Figure 13 provides histogram and variogram comparison plots for validation. The complex bimodal histograms of the generated realizations match that of the training image according to Figure 13a, which confirms the acceptable performance of the method in reproducing the first-order statistics of the TI. The histogram of the TI (in red) falls in between those of the realizations, and higher deviations can be seen from the TI average on the two modes rather than the values between the heads of the histogram. The mean plot provided via Figure 13c proves that there is little uncertainty about the reproduction of the first-order statistics of the continuous TI, as none of the 100 realizations has a mean value far from the TI, as shown via the red line.

In addition, the realizations were successful in reproducing variograms similar to that of TI according to Figure 13b. The two-point statistics of the realizations are maintained similar to the continuous TI over different lag distances, and deviations from the red line (TI variogram) are observed for only a few of the realizations’ variograms, mostly in the larger lag distances. Figure 12 also provides E-type and variance maps calculated using the 100 realizations. Figure 14 presents cumulant maps of the TI and a realization. From the cumulant maps, it can be inferred that for small lag distances, from the cumulant maps, the proposed algorithm is able to reproduce the third-order statistics of the continuous TI in unconditional simulation. For larger lag distances, a reasonable match is also observed in most of the areas within the maps. The variability shown in the variance plot is almost high across the whole domain, and high-value channels are propagated in different locations within different realizations according to the E-type plot, which shows the unconditional simulation is performing well in generating equiprobable reconstructions of the TI.

Three-Dimensional TI

Additionally, the algorithm was tested on a three-dimensional TI with a size of [69 × 69 × 39], as shown in Figure 15, accompanied by its cross-sectional views (middle row). The template size of T = [15 × 15 × 15] was used as concluded from the entropy-based method. Figure 16 shows a few unconditional simulations and their cross-sectional views to compare the training image with the realizations and their structures. It can be seen that the method reproduces the desired shapes and continuities present in the three-dimensional TI with reasonable accuracy. Figure 17 compares the 24 generated simulations and TI statistically. The box plots in Figure 17a show that the median values for the frequencies of the simulated nodes related to each category are near the corresponding values of the training image (red circle), and the TI class frequencies are within the 50% boxes of the realizations’ categorical frequencies. The mean plot in Figure 17b also confirms the similarity of the first-order statistics of the TI (red line) and realizations by comparing their average values. Looking at the realizations produced by the proposed method (Figure 16), and comparing them to the TI (Figure 15), it can be observed that the continuity of the channels was reproduced within the simulation domain. The trapezoids available in the TI are very difficult to reproduce, even using pattern-based MPS simulation methods. The results show that those trapezoid-shaped structures are reproduced via unconditional simulations; however, they are not precisely similar to the shapes available in the training image.

3.2. Conditional Simulation

The proposed method was tested for conditional simulation to check the accuracy of the simulations in honoring the hard data during the simulation process. Figure 18a depicts 361 points of hard data used for the conditional simulation of the channel training image, arranged in such a way as to form a dense area of points, in addition to the vertical discrete streams. The purpose is to analyze the impact on the E-type and variance maps (Figure 18e,f), in comparison with the hard data and the unconditional simulations’ E-type plots (Figure 12d,e). The variance is very small in areas where the conditioning data points are available.

Among the hard dataset, 72 data points (20%) are located in the center of the grid, forming a rectangular area, playing the role of a dense area of conditioning data. The other 289 hard data are distributed with equal distances across the grid, as shown in Figure 18a. Looking at the regions which contain a few non-channel-class hard data (blue), not interrupted via the channel-class points (yellow), shows a low variance in the simulations and non-channel-class dominance in the ensemble plots (Figure 18e,f).

The areas of proportionally lower variance clearly show how the algorithm is honoring the hard datasets and how the variability increases as the distance increases from the hard data. The conditioning data available in those areas are all of the same class, so the other class has little presence in those areas (leading to low variance). The training image contains structures with “Y-like” shapes or shapes partially looking like “Y”. So, each local straight stream tends to turn right, left, or in both directions with certain angles after a certain length of vertical continuity. This can be inferred from the E-type plot of 100 realizations, as shown in Figure 18e.

The vertical streams were reconstructed with a reasonable accuracy according to the E-type plots, and the variability over those locations is less, even with such a small number of hard conditioning data points. The dense rectangular area of the conditioning data of the same class is dominated by the opposite class on its left and right side in the E-type plot and in almost all of the realizations. It is an indicator of the ability of the model to respect the formation of channels. In fact, the width of the original TI channels is honored in that area of dense conditioning data. The variability above and at the bottom of the rectangular channel-class stream is high. It is due to the shapes that are present at the channel TI, as the channels maintain a straight path for a limited length, and after that, they tend to turn to other directions with certain angular deviations.

The isolated channel-class hard data points are also marking single channel-class points in the E-Type plots, which are dominated by non-channel-class points as hard data. Looking at the E-type plot, it is observed that even these single points are honored by the algorithm and become the points where the channels are passed through, mostly in two distinct directions. In addition, it is clear from the variance plot that the variability is increasing gradually by an increased distance from these points. In fact, as a pixel-based method, the proposed algorithm has a higher potential for respecting conditioning data. Another point that is visible in Figure 18 is the fact that on the left side, there are conditioning data near the domain border, whereas the conditioning datasets have a higher distance from the border at the right side. Looking at the variance plot and comparing the two sides shows a clear representation of the difference in the amount of variability (uncertainty) of the simulations by the proposed method with and without conditioning data points at the borders.

Figure 19 and Figure 20 provide statistical validation results for 100 realizations of conditional simulation for the channel TI, as shown in Figure 3a. The first-, second- and third-order statistical comparison of the realizations with the corresponding channel TI reveals a very good match, and the uncertainty of the conditional simulation of the categorical variable via the proposed algorithm seems to be very low. The box plots show that the TI has the frequencies of the classes almost similar to the medians of the frequencies over all the realizations. The variograms from the realizations also show good agreement with the variogram of TI over approximately all the lag values. As shown in Figure 20, the three-point statistical comparison of the training image with the conditional simulation shows a reasonable visual match. From the cumulant maps, it can be inferred that for both small and large lag distances, the proposed algorithm is able to reproduce the third-order statistics of the channel TI well by conditional simulation.

The quality of such realizations depends on maintaining the continuity and shape of the channels across the simulated images in the same manner as they are available in the training image. As the hard data are derived from the TI, the realizations should approximately match the TI in the sense of where specific textures are located and how the statistical distribution is. In addition, the E-type map will reflect the exhaustive training image when the number of hard data is very high. The algorithm was tested on the channel TI via 1000 randomly distributed hard data, as shown in Figure 21a, and the results presented in Figure 21 confirmed this claim. According to the variance plot, the variability of the 100 simulation values over most of the image is very low, resulting in minimal uncertainty in the modeling compared to the previous case with fewer hard data (Figure 18). Two conditionally simulated realizations are also provided in Figure 21, which are very similar to the categorical channel TI (Figure 3a). Looking at the E-type plot (Figure 21d), channels with continuity over larger distances are visible compared to the previous case (Figure 18). Thus, the effect of the hard data conditioning on the quality of the realizations produced by the method is confirmed.

For a conditional simulation of a continuous variable, 208 hard conditioning data were used (Figure 22a). The hard data are irregularly spaced and scattered all over the simulation grid domain. The two conditional simulation maps are shown in Figure 22b,c. The visual comparison between the training image and simulation maps confirms that the proposed method respects the hard data. The main channels’ shapes and locations in the exhaustive continuous training image are well reproduced in the simulated images, especially at the top half of the image. The ensemble maps (E-type) are used to check how the realizations respect the hard conditioning data. The E-type map is generated by 100 realizations. Looking at the green circular areas marked in Figure 22a,e, the lack of conditioning data leads to higher uncertainty, and the simulations provide high variance in areas with a sparse hard dataset. The area highlighted by the ellipsoid was selected as the indicator of regions with low-value dominated hard data. Looking at those areas on the E-type and variance plots, it can be inferred that the algorithm respects the hard data by simulating the pixels within those regions with the majority of low values with low variability.

On the other hand, within the area marked by the yellow rectangle, as the hard data were sampled from both heads of the bimodal distribution of the reference image, high variability among the realizations in the variance map is observed. The orange diamond in the conditioning data plot shows a single high-value pixel dominated by a few other low-value pixels as hard data. It is obvious from the E-type map that the algorithm honors that hard data point and the average value around that point seems to be higher than the surroundings. This is a clear advantage of the pixel-based simulation method, as demonstrated by the proposed algorithm, over pattern-based simulation, which means single points could be neglected while pasting patterns to the simulation grid.

The continuous training image used here has a complex bimodal histogram, and the algorithm is able to reproduce this distribution over all the realizations (Figure 23a). A small overestimation of the left head frequency and a small under-estimation of the right head frequency, on average, were observed. The variogram comparison plot in Figure 23c reveals that the second-order statistics of the continuous training image were also well reproduced, and the TI variogram falls in almost the middle of the ones for the realizations for different lag distances. To show the multiple-point reproduction of the TI using the proposed pixel-based method, three-point cumulant maps of the training image and a simulated realization are generated and presented in Figure 24. The algorithm output ensures the reproduction of the TI’s third-order statistical characteristics.

For the continuous variable as well, the E-type map will closely reproduce the reference TI when the number of hard data is sufficiently high. The algorithm was tested using 500 randomly distributed hard data (Figure 25a), and the results presented in Figure 25c,d confirm this claim. It can be seen from Figure 25b that the channels occurring in the continuous TI are almost perfectly reproduced, and this happens due to the presence of a higher number of conditioning datasets. In the region limited by the red lines (60 < y < 80), there are almost no high-value hard data available, and consequently, no high-value streams were reproduced, which shows the appropriate effectiveness of the hard data on the simulation by the proposed algorithm. Looking at the variance plot confirms that as well, because the variability remained very low in that region. A comparison of the E-type maps in Figure 22 and Figure 25 proves the effect of the number of hard data on the simulation quality. In Figure 25c, the channels of the TI are almost perfectly reproduced, as opposed to the ensemble map in Figure 22d, where the higher uncertainty (Figure 22e) was revealed. Still, the percentage of the hard data for this case is under 4% (Figure 25a).

4. Three-Dimensional Case Study

After validation and testing, the proposed method was applied to a gold deposit to simulate the categorical variables. The primary commodity of the deposit is gold, followed by silver. The deposit type is disseminated in Paleozoic carbonate and siliciclastic rocks with a tabular orebody. The structure is characterized by a narrow reverse fault with a variable dip and warped host rocks resulted from anticlinal folding. The dominant forms of alterations include argillization, decalcification, silicification, and baritization. A cutoff grade of 0.003 AuFA was used for defining the zones of mineralization. High-grade mineralization was also modeled using a 0.17 Au cutoff. The operation type is open-pit surface mining.

The mining company has categorized the exploration drilling data into three different categories, i.e., high-grade, low-grade, and waste data, to create wireframes (solid models) for volumetric analysis. The solid model developed by the mining company was used as the training image and the categorically transformed composited exploration drilling data as the hard conditioning data for the simulation. A total number of 8759 composite samples with a composite length of 10 m are available for this study. The pixel size of the training image is 25 m × 25 m × 10 m. The same pixel size was selected for simulation. Figure 26 shows the training image and hard conditioning data used in this study. The training image used in this case study has the size of [70 × 60 × 57], and a total of 239,400 blocks were simulated by the proposed method. The solid, which is used as a training image, is generated based on updated geological interpretations, including the existence of faults, mineralization zones, lithological characterization, and lithological contacts. Drilling data were considered as the source of information with almost the highest certainty.

To perform the simulation, the template size was first determined to extract the patterns database from the training image, for which the results showed the optimal size would be T = [13 × 13 × 13]. After extraction of the patterns database, the patterns containing the above-topography values were removed, and then stochastic embedding and clustering were performed. The same approach mentioned for the previous three-dimensional simulation was performed, which means that ccdf was used for sampling from the highly populated clusters (more than 100 members). Figure 27 provides two generated realizations and a mean of 16 realizations. Figure 28 provides a statistical comparison between the hard data, TI, and realizations.

Figure 28 shows that there are some deviations between the hard data and the simulation statistics; however, the results show an excellent match between the training image and the simulations. The reason for this happening is that the statistics and frequency of the different classes are not the same for both the conditioning data and the TI. Thus, the simulations, as expected, honor the statistics of the TI rather than the hard data. The proposed method, like other MPS methods, is supposed to reproduce the TI statistics, as the realizations derived by using this methodology are training-image-driven. The statistics of the simulations tend to get skewed toward the conditioning data while remaining similar to the TI. It is specifically clear for class three, as shown in Figure 28c. Thus, it can be concluded that, in the case of having similar statistics between TI and hard data, the simulations will reproduce the statistical properties that honor both the TI and the conditioning data.

Figure 29 shows a cross-sectional view of the TI and a generated realization to check the ability of the model to reconstruct structures in addition to honoring the conditioning data. Having a visual comparison between the two plots confirms the accuracy of the proposed pixel-based MPS method in the generation of realizations that reconstruct the training image structures, and at the same time, in honoring the conditioning data. In this case-study three-dimensional TI, there are dense areas of conditioning data as indications of the areas in which accumulations of orebody materials with different grades are happening. This necessitates the evaluation of the results in this regard, in addition to checking for the reproduction of textures and continuities. As observed in Figure 29’s slice views, even the high-grade zones, which were a minority, were formed via the pixel-based simulation presented in this study. The number of high-grade classes among the training image pixels was 189 (0.08%), and among the hard datasets, it was 79 (0.14%).

5. Comparative Analysis

In order to compare the results of the dimensionality reduction and clustering algorithms used in this study (t-SNE and DBSCAN) with a previously used method (MDS and K-means clustering), we performed a comparative analysis and calculated the WCSS and DBI for both cases. The WCSS metric was increased from 2.52 (for t-SNE and DBSCAN) to 3.44 (for MDS and K-means clustering) calculated using normalized 2D-projected data, which shows the superiority of the proposed method in keeping within-cluster members more similar. In addition, the DBI metric was increased from 0.73 (for t-SNE and DBSCAN) to 0.82 (for MDS and K-means clustering), which shows better cluster separation and lower cluster spread when the proposed method is used.

The proposed method can be considered a fast MPS method in both two-dimensional and three-dimensional simulations, based on the information provided in Table 1. The codes were run on a PC with Windows 10 (64-bit) and with configurations including 1.70 GHz Intel(R) Xeon(R) Bronze 3106 CPU (2 processors) and 64 GB of Installed Memory (RAM). The computer was a shared system exploited simultaneously by several users. The source codes used to produce the results published in this paper and the associated supplementary materials are available online in the GitHub repository of the first author, https://github.com/adel-asadi/Pixel_based_MPS (accessed on 1 April 2024). It should be noted that the code is implemented in MATLAB, using the parallel computing toolbox for the simulation step. So, the speed could be improved in environments like Java and C++. It is not appropriate to compare the method’s speed with other algorithms that are not written and run in MATLAB. However, especially due to the dimensionality reduction and subsequent fast clustering, the runtime of the code in different cases is acceptably low, which is a promising criterion in the success of the methodology in terms of being used as an application in industry.

In order to demonstrate the computational efficiency of the proposed method in comparison with previous methods, the same 2D categorical TI (Figure 3a) was used to generate 100 unconditional realizations, and the recorded times per simulated realization were 5.17 and 38.56 s for the WAVESIM [32] and SIMPAT [25] algorithms, respectively. This shows 31% and 91% higher speed of the proposed method than the two investigated algorithms, respectively.

The information provided in Table 1 also indicates the following points, as marked within the table as superscripts:

The unconditional simulation via the same training image is faster than the conditional simulation. The method was also tested with a larger number of hard data (for both of the two-dimensional TIs shown in Figure 3) and an increase in the computational time was observed, because of the more expensive similarity detection, despite the fact that a lower number of nodes need to be filled via the MPS simulation. In addition, the number of clusters have an impact on the runtime. As seen in Table 1, although the size (number of pixels) of 2D categorical TI is smaller than the 2D continuous TI, the unconditional simulation takes a relatively longer time, as during the comparison of the data event with the prototypes, a higher number of prototypes are available to search through.
By assigning MinPts = 3, the computational time will decrease significantly (to 542 s for this case) but the reconstructions could be considered suboptimal. It was decided to assign MinPts = 2 for quality results. However, it should be changed and tested via trial and error for different training images and scenarios. Another observation of the experiments was that having a higher number of conditioning data to help the simulation in reproducing the desired textures and statistics could let the user assign MinPts = 3 for faster simulations while maintaining outputs with acceptable quality. It should be noted that assigning MinPts = 3 decreased the number of clusters from 1947 to 374 for this case. In an experiment, removal of the ccdf partial usage in the similarity detection for MinPts = 3 case increased the timing to 792 s.
The case-study three-dimensional TI had 46,339 points above topography, which did not need to be simulated. The number of nodes to be simulated was 193,061 (including the points which were filled using hard data). In addition, the computational time is less than the unconditional simulation for the other three-dimensional TI, because the number of clusters of patterns formed for the case study TI was 463, which is almost a quarter of the 1947 for the other two-categorical three-dimensional TI.

6. Conclusions and Future Work

This research proposed a new pixel-based simulation algorithm using different machine-learning techniques. The results showed that t-Distributed Stochastic Neighbors Embedding (t-SNE algorithm) and Density-based Spatial Clustering of Applications with Noise (DBSCAN algorithm) are efficient methods for the dimensional reduction and classification of pattern database generated from a training image. In addition, it was shown that the algorithm is useful for taking advantage of pixel-based MPS algorithms as the basic approach of the present study. Also, an automatic method was employed for the template size determination and a number of optimization techniques in order to fully automate the proposed MPS method. In this study, high performance was achieved while reconstructing complex features and continuities, although the method uses a pixel-based approach for geostatistical simulation. In terms of respecting the hard data, the MPS method showed its ability to produce high-quality conditional simulations while also honoring the statistical properties of the training image. Continuous variables were also tested for simulation using the algorithm, and the results of the continuous training image simulation were promising, similar to the categorical simulations. Another main advantage of the algorithm was its computational speed, as a relatively high-speed methodology was achieved in this work, and this criterion is very important to consider for real-life three-dimensional problems. The validation results for the different scenarios showed that the algorithm is capable of solving related problems in various disciplines, including but not limited to mineral resources modeling.

Although the proposed approach is computationally fast, there is still some scope for further improvement. The t-SNE is the most time-consuming step in the proposed method. The DBSCAN clustering is fast and efficient enough but inappropriate for a high-dimensional dataset. Therefore, implementing a fast clustering algorithm for high-dimensional datasets instead of dimensionality reduction and subsequent clustering, as proposed in this study, can significantly improve the computational time. If this could be achieved, the proposed method will be significantly faster, although it is efficient even now. Another path toward the development of the proposed method would be implementation and testing of the algorithm via non-stationary simulations, multivariate modeling, block conditioning, simulations post-processing, optimization and enhancement of the reconstructions, dealing with spatio-temporal datasets, and other widespread MPS applications in different fields of studies, using various training images and datasets for hard or soft conditioning to the simulation process.

Author Contributions

Conceptualization, S.C.; methodology, A.A. and S.C.; software, A.A. and S.C.; validation, A.A. and S.C.; formal analysis, A.A. and S.C.; investigation, A.A. and S.C.; resources, A.A. and S.C.; data curation, A.A. and S.C.; writing—original draft preparation, A.A.; writing—review and editing, A.A. and S.C.; visualization, A.A.; supervision, S.C.; project administration, S.C.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source MATLAB codes used to produce the results published in this paper and the associated supplementary materials are available online in the GitHub repository of the first author, https://github.com/adel-asadi/Pixel_based_MPS (accessed on 1 April 2024).

Acknowledgments

The authors would like to thank the Geological and Mining Engineering and Sciences (GMES) Department at Michigan Technological University for supporting and funding this research study. This work was completed at Michigan Technological University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tahmasebi, P. Multiple-Point Statistics, Handbook of Mathematical Geosciences: Fifty Years of IAMG; Daya Sagar, B.S., Qiuming, C., Frits, A., Eds.; Springer: Dordrecht, The Netherlands, 2018. [Google Scholar]
Guardiano, F.B.; Srivastava, R.M. Multivariate Geostatistics: Beyond Bivariate Moments. In Geostatistics Tróia ’92. Quantitative Geology and Geostatistics; Soares, A., Ed.; Springer: Dordrecht, The Netherlands, 1993; Volume 5. [Google Scholar] [CrossRef]
Strebelle, S. Multiple-Point Geostatistics: From Theory to Practice. In Proceedings of the Ninth International Geostatistics Congress, Oslo, Norway, 11–15 June 2012. [Google Scholar]
Mariethoz, G.; Caers, J. Multiple-Point Geostatistics: Stochastic Modeling with Training Images, 1st ed.; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2015. [Google Scholar]
Yao, L.; Dimitrakopoulos, R.; Gamache, M. Training Image Free High-Order Stochastic Simulation Based on Aggregated Kernel Statistics. Math. Geosci. 2021, 53, 1469–1489. [Google Scholar] [CrossRef] [PubMed]
Guo, J.; Wang, Z.; Li, C.; Li, F.; Jessell, M.W.; Wu, L.; Wang, J. Multiple-Point Geostatistics-Based Three-Dimensional Automatic Geological Modeling and Uncertainty Analysis for Borehole Data. Nat. Resour. Res. 2022, 31, 2347–2367. [Google Scholar] [CrossRef]
Caers, J.; Zhang, T. Multiple-point geostatistics: A quantitative vehicle for integrating geologic analogs into multiple reservoir models. AAPG Memoir 2004, 80, 383–394. [Google Scholar]
Zhang, D.; Zhang, H.; Ren, Q.; Zhao, X.; Shang, Z. A modified method of multiple point geostatistics for spatial simulation of sedimentary facies for carbonate reservoirs. J. Appl. Geophys. 2023, 215, 105112. [Google Scholar] [CrossRef]
Paithankar, A.; Chatterjee, S. Grade and Tonnage Uncertainty Analysis of an African Copper Deposit Using Multiple-Point Geostatistics and Sequential Gaussian Simulation. Nat. Resour. Res. 2018, 27, 419–436. [Google Scholar] [CrossRef]
Li, Z.; Yi, S.; Wang, N.; Zhang, X.; Chen, Q.; Liu, G. Adaptive direct sampling-based approach to ore grade modeling. Earth Sci. Inform. 2024, 17, 2537–2554. [Google Scholar] [CrossRef]
Abuzaied, M.M.; Chatterjee, S.; Askari, R. Stochastic inversion combining seismic data, facies properties, and advanced multiple-point geostatistics. J. Appl. Geophys. 2023, 213, 105026. [Google Scholar] [CrossRef]
Lin, C.Y.; Wang, Y.; Yang, S.; Ren, L.H.; You, C.M.; Wu, S.T.; Zhang, Y.M. 3D Modeling of digital core based on X-ray computed tomography. J. Jilin Univ. 2018, 48, 307–317. [Google Scholar]
Barfod, A.A.S.; Møller, I.; Christiansen, A.V.; Høyer, A.-S.; Hoffimann, J.; Straubhaar, J.; Caers, J. Hydrostratigraphic modeling using multiple-point statistics and airborne transient electromagnetic methods. Hydrol. Earth Syst. Sci. 2018, 22, 3351–3373. [Google Scholar] [CrossRef]
Feyen, L.; Caers, J. Multiple-Point Geostatistics: A Powerful Tool to Improve Groundwater Flow and Transport Predictions in Multi-Modal Formations; Geostatistics for Environmental Applications; Springer: Heidelberg, Germany, 2005; pp. 197–208. [Google Scholar]
Le Coz, M.; Bodin, J.; Renard, P. On the use of multiple-point statistics to improve groundwater flow modeling in karst aquifers: A case study from the Hydrogeological Experimental Site of Poitiers, France. J. Hydrol. 2017, 545, 109–119. [Google Scholar] [CrossRef]
Kumar, D.; Srinivasan, S. Indicator-based data assimilation with multiple-point statistics for updating an ensemble of models with non-Gaussian parameter distributions. Adv. Water Resour. 2020, 141, 103611. [Google Scholar] [CrossRef]
Asadi, A.; Tronci, E.M.; Moaveni, B. Stochastic High-Resolution Spatiotemporal Simulation of Offshore Wind Speed with Multiple-Point Geostatistics Using Observational, Modeling, Geospatial, and Remote Sensing Data. In Proceedings of the ASME 5th International Offshore Wind Technical Conference (IOWTC), Exeter, UK, 18–19 December 2023. [Google Scholar]
Hadjipetrou, S.; Mariethoz, G.; Kyriakidis, P. Gap-Filling Sentinel-1 Offshore Wind Speed Image Time Series Using Multiple-Point Geostatistical Simulation and Reanalysis Data. Remote. Sens. 2023, 15, 409. [Google Scholar] [CrossRef]
Zou, W.; Hu, G.; Wiersma, P.; Yin, S.; Xiao, Y.; Mariethoz, G.; Peleg, N. Multiple-point geostatistics-based spatial downscaling of heavy rainfall fields. J. Hydrol. 2024, 632, 130899. [Google Scholar] [CrossRef]
Yin, G.; Mariethoz, G.; McCabe, M.F. Gap-Filling of Landsat 7 Imagery Using the Direct Sampling Method. Remote. Sens. 2017, 9, 12. [Google Scholar] [CrossRef]
Yin, Z.; Zuo, C.; MacKie, E.J.; Caers, J. Mapping high-resolution basal topography of West Antarctica from radar data using non-stationary multiple-point geostatistics (MPS-BedMappingV1). Geosci. Model Dev. 2022, 15, 1477–1497. [Google Scholar] [CrossRef]
Yunwei, T.; Jingxiong, Z. Area-to-point Cokriging and Multiple-point Geostatistical Simulation for Remotely Sensed Image Fusion. Geomat. Inf. Sci. Wuhan Univ. 2014, 39, 856–861. [Google Scholar]
Zakeri, F.; Mariethoz, G. A review of geostatistical simulation models applied to satellite remote sensing: Methods and applications. Remote. Sens. Environ. 2021, 259, 112381. [Google Scholar] [CrossRef]
Efros, A.; Freeman, W.T. Image Quilting for Texture Synthesis and Transfer. In Proceedings of the ACM SIGGRAPH Conference on Computer Graphics, Los Angeles, CA, USA, 12–17 August 2001; pp. 341–346. [Google Scholar]
Arpat, G.B. Sequential Simulation with Patterns. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 2005. [Google Scholar]
Arpat, B.; Caers, J. A Multiple-Scale, Pattern-Based Approach to Sequential Simulation. In Quantitative Geology and Geostatistics; Leuangthong, O., Deutsch, C.V., Eds.; Springer: Cham, Switzerland, 2004; Volume 14, pp. 255–264. [Google Scholar]
Zhang, T.; Switzer, P.; Journel, J. Sequential Conditional Simulation Using Classification of Local Training Patterns. In Quantitative Geology and Geostatistics; Leuangthong, O., Deutsch, C.V., Eds.; Springer: Cham, Switzerland, 2004; Volume 14, pp. 265–274. [Google Scholar]
Zhang, T.; Switzer, P.; Journel, A. Filter-based classification of training image patterns for spatial simulation. Math. Geol. 2006, 38, 63–80. [Google Scholar] [CrossRef]
Wu, J.; Zhang, T.; Journel, A. Fast FILTERSIM simulation with scorebased distance. Math. Geosci. 2008, 40, 773–788. [Google Scholar] [CrossRef]
Honarkhah, M.; Caers, J. Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling. Math. Geosci. 2010, 42, 487–517. [Google Scholar] [CrossRef]
Parra, A.; Ortiz, J.M. Adapting a texture synthesis algorithm for conditional multiple point geostatistical simulation. Stoch. Env. Res. Risk A 2011, 25, 1101–1111. [Google Scholar] [CrossRef]
Chatterjee, S.; Dimitrakopoulos, R.; Mustapha, H. Dimensional Reduction of Pattern-Based Simulation Using Wavelet Analysis. Math. Geosci. 2012, 44, 343–374. [Google Scholar] [CrossRef]
Tahmasebi, P.; Hezarkhani, A.; Sahimi, M. Multiple-point geostatistical modeling based on the cross-correlation functions. Comput. Geosci. J. 2012, 16, 779–797. [Google Scholar]
Mahmud, K.; Mariethoz, G.; Caers, J.; Tahmasebi, P.; Baker, A. Simulation of Earth textures by conditional image quilting. Water Resour. Res. 2014, 50, 3088–3107. [Google Scholar] [CrossRef]
Tahmasebi, P.; Sahimi, M.; Caers, J. MS-CCSIM: Accelerating pattern-based geostatistical simulation of categorical variables using a multi-scale search in Fourier space. Comp. Geosci. 2014, 67, 75–88. [Google Scholar] [CrossRef]
Tahmasebi, P.; Sahimi, M. Geostatistical Simulation and Reconstruction of Porous Media by a Cross-Correlation Function and Integration of Hard and Soft Data. Transp. Porous Media 2015, 107, 871–905. [Google Scholar] [CrossRef]
Rezaee, H.; Marcotte, D.; Tahmasebi, P.; Saucier, A. Multiple-point geostatistical simulation using enriched pattern databases. Stoch. Env. Res. Risk A 2015, 29, 893–913. [Google Scholar] [CrossRef]
Abdollahifard, M.J. Fast multiple-point simulation using a data-driven path and an efficient gradient-based search. Comput. Geosci. 2016, 86, 64–74. [Google Scholar] [CrossRef]
Li, X.; Mariethoz, G.; Lu, D.; Linde, N. Patch-based iterative conditional geostatistical simulation using graph cuts. Water Resour. Res. 2016, 52, 6297–6320. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, J.; Yang, S.; He, K.; Shimada, H.; Sasaoka, T. A Multi-Point Geostatistical Modeling Method Based on 2D Training Image Partition Simulation. Mathematics 2023, 11, 4900. [Google Scholar] [CrossRef]
Mustapha, H.; Dimitrakopoulos, R. A new approach for geological pattern recognition using high-order spatial cumulants. Comput. Geosci. 2010, 36, 313–334. [Google Scholar] [CrossRef]
Mustapha, H.; Dimitrakopoulos, R. HOSIM: A high-order stochastic simulation algorithm for generating three-dimensional complex geological patterns. Comput. Geosci. 2011, 37, 1242–1253. [Google Scholar] [CrossRef]
Strebelle, S. Conditional Simulation of Complex Geological Structures Using Multiple-Point Statistics. Math. Geol. 2002, 34, 1–22. [Google Scholar] [CrossRef]
Boucher, A. Considering complex training images with search tree partitioning. Comput. Geosci. 2009, 35, 1151–1158. [Google Scholar] [CrossRef]
Mariethoz, G.; Renard, P.; Straubhaar, J. The Direct Sampling method to perform multiple-point geostatistical simulations. Water Resour. Res. 2010, 46, wr007621. [Google Scholar] [CrossRef]
Huysmans, M.; Dassargues, A. Direct Multiple-Point Geostatistical Simulation of Edge Properties for Modeling Thin Irregularly Shaped Surfaces. Math. Geosci. 2011, 43, 521–536. [Google Scholar] [CrossRef]
Abdollahifard, M.J.; Faez, K. Fast direct sampling for multiple-point stochastic simulation. Arab. J. Geosci. 2014, 7, 1927–1939. [Google Scholar] [CrossRef]
Mariethoz, G.; Straubhaar, J.; Renard, P.; Chugunova, T.; Biver, P. Constraining distance-based multipoint simulations to proportions and trends. Environ. Model Softw. 2015, 72, 184–197. [Google Scholar] [CrossRef]
Straubhaar, J.; Renard, P.; Mariethoz, G.; Froidevaux, R.; Besson, O. An Improved Parallel Multiple-point Algorithm Using a List Approach. Math. Geosci. 2011, 43, 305–328. [Google Scholar] [CrossRef]
Minniakhmetov, I.; Dimitrakopoulos, R.; Godoy, M. High-Order Spatial Simulation Using Legendre-Like Orthogonal Splines. Math. Geosci. 2018, 50, 753–780. [Google Scholar] [CrossRef]
Yao, L.; Dimitrakopoulos, R.; Gamache, M. A New Computational Model of High-Order Stochastic Simulation Based on Spatial Legendre Moments. Math. Geosci. 2018, 50, 929–960. [Google Scholar] [CrossRef] [PubMed]
Gravey, M.; Mariethoz, G. QuickSampling v1.0: A robust and simplified pixel-based multiple-point simulation approach. Geosci. Model Dev. 2020, 13, 2611–2630. [Google Scholar] [CrossRef]
Straubhaar, J.; Renard, P. Conditioning Multiple-Point Statistics Simulation to Inequality Data. Earth Space Sci. 2021, 8, 2020EA001515. [Google Scholar] [CrossRef]
Chatterjee, S.; Dimitrakopoulos, R. Multi-scale stochastic simulation with a wavelet-based approach. Comput. Geosci. 2012, 45, 177–189. [Google Scholar] [CrossRef]
Mead, A. Review of the Development of Multidimensional Scaling Methods. J. R. Stat. Soc. 1992, 41, 27–39. [Google Scholar] [CrossRef]
van der Maaten, L.J.P.; Hinton, G.E. Visualizing High-Dimensional Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Mustapha, H.; Chatterjee, S.; Dimitrakopoulos, R. CDFSIM: Efficient Stochastic Simulation Through Decomposition of Cumulative Distribution Functions of Transformed Spatial Patterns. Math. Geosci. 2014, 46, 95–123. [Google Scholar] [CrossRef]
Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
MacQueen, J.B. Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Oakland, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
Yang, L.; Hou, W.; Cui, C.; Cui, J. GOSIM: A multi-scale iterative multiple-point statistics algorithm with global optimization. Comput. Geosci. 2016, 89, 57–70. [Google Scholar] [CrossRef]
Melnikova, Y.; Zunino, A.; Lange, K.; Cordua, K.S.; Mosegaard, K. History matching through a smooth formulation of multiple-point statistics. Math. Geosci. 2015, 47, 397–416. [Google Scholar] [CrossRef]
Dagasan, Y.; Renard, P.; Straubhaar, J.; Erten, O.; Topal, E. Automatic Parameter Tuning of Multiple-Point Statistical Simulations for Lateritic Bauxite Deposits. Minerals 2018, 8, 220. [Google Scholar] [CrossRef]
Abdollahifard, M.J.; Mariéthoz, G.; Ghavim, M. Quantitative evaluation of multiple-point simulations using image segmentation and texture descriptors. Comput. Geosci. 2019, 23, 1349–1368. [Google Scholar] [CrossRef]
Wang, X.; Yu, S.; Li, S.; Zhang, N. Two parameter optimization methods of multi-point geostatistics. J. Pet. Sci. Eng. 2022, 208, 109724. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Montréal, QC, Canada, 2–4 August 1996; Simoudis, E., Han, J., Fayyad, U.M., Eds.; AAAI Press: Washington, DC, USA, 1996; pp. 226–231. [Google Scholar]
van der Maaten, L.J.P. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence & Statistics (AI-STATS), JMLR W&CP, Clearwater Beach, FL, USA, 16–18 April 2009; Volume 5, pp. 384–391. [Google Scholar]
Barnes, J.; Hut, P. A hierarchical O(N log N) force-calculation algorithm. Nature 1986, 324, 446–449. [Google Scholar] [CrossRef]
Deutsch, C.V.; Journel, A.G. GSLIB Geostatistical Software Library and User’s Guide, 2nd ed.; Oxford University Press: New York, NY, USA, 1997; 369p. [Google Scholar]
Strang, G. Linear Algebra and Its Applications, 4th ed.; Cengage Learning: Boston, MA, USA, 2006; ISBN 0030105676. [Google Scholar]
Robert, C.P.; Casella, G. Monte Carlo Statistical Methods, 2nd ed.; Springer: New York, NY, USA, 2004. [Google Scholar]
Journel, A.G. Multiple-Point Geostatistics: A State of the Art. Unpublished Stanford Center for Reservoir Forecasting Paper. 2003. Available online: https://pangea.stanford.edu/departments/ere/dropbox/scrf/documents/reports/16/SCRF2003_Report16/SCRF2003_Andre_Journel.pdf (accessed on 6 June 2024).
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441, 498–520. [Google Scholar] [CrossRef]
Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer: Cham, Switzerland, 2002. [Google Scholar]
van der Maaten, L.P.J. Barnes-Hut-SNE. arXiv 2013, arXiv:1301.3342. [Google Scholar]
Sander, J.; Ester, M.; Kriegel, H.-P.; Xu, X. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Min. Knowl. Discov. 1998, 2, 169–194. [Google Scholar] [CrossRef]
Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Mao, S.; Journel, A. Conditional 3D simulation of lithofacies with 2D seismic data. Comput. Geosci. 1999, 25, 845–862. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the proposed pixel-based MPS method (Pixel-MPS).

Figure 2. Plots of the weight distributions used during: (a) two-dimensional template matching, and (b) three-dimensional template matching.

Figure 3. (a) Categorical binary TI named channel; and (b) continuous TI.

Figure 4. (a) Entropy plot of the varying template sizes for the 2D categorical TI; (b) change in the variance of entropy; and (c) profile log-likelihood plot to determine the optimal template size. The codes used to generate this figure were provided in the study by [30].

Figure 5. Automated estimation (knee detection) of the DBSCAN clustering threshold value (Eps) with a nearest neighbor search of different numbers of neighbors for the categorical 2D TI.

Figure 6. DBSCAN clustering on the embedded patterns database for the original resolution channel TI shown in Figure 3a.

Figure 7. Members of a randomly selected cluster (11 patterns) and their calculated prototypes (bottom right), as extracted from the categorical channel TI (Figure 3a).

Figure 8. Categorical TI realizations of the proposed MPS method.

Figure 9. Categorical box plots (a) and mean comparison plot (b) of the generated 100 unconditional realizations simulated from the categorical training image shown in Figure 3a.

Figure 10. (a) Variograms of the TI (in red) and the 100 unconditional realizations (in blue); and (b,c) the E-type and variance plots of the realizations of the categorical channel training image shown in Figure 3a.

Figure 11. The template used for cumulant maps (a), and the cumulant maps of the categorical channel TI (b) and an unconditional simulation (c) plotted for the higher-order validation of the proposed method.

Figure 12. The continuous TI’s unconditional realizations (a–c) of the proposed MPS method, plus the E-Type mean (d) and variance (e) plots of 100 realizations of the TI in Figure 3b.

Figure 13. Histogram (a), variogram (b), and the mean comparison (c) plots for 100 realizations of the continuous TI (Figure 3b).

Figure 14. The template used for cumulant maps (a), and the cumulant maps of the continuous TI in Figure 3b (b) and an unconditional simulation (c) for the higher-order validation.

Figure 15. (a) Three-dimensional training image volume, alongside (b) three-dimensional slice view, and two-dimensional slice views in the (c) X–Y, (d) X–Z, and (e) Y–Z directions.

Figure 16. Unconditional three-dimensional volume simulation results (a–c) and cross-sectional example views of the simulations in the X–Y (d), X–Z (e), and Y–Z (f) directions.

Figure 17. Categorical box plots (a) and mean comparison plot (b) for 24 3-dimensional unconditional simulations, showing the comparison of the 1-point statistics. Class 1 represents zero values, and class 2 represents ones in the TI.

Figure 18. Conditioning data for the TI shown in Figure 3a (a), three conditional realizations (b–d), and the E-type mean and variance plots of the 100 conditional simulations (e,f).

Figure 19. Variograms of the channel TI (in red) and the 100 conditional realizations (a), categorical box plots (b), and the mean comparison plot (c).

Figure 20. The template used for cumulant maps (a), and the cumulant maps of the categorical channel TI (b) and a conditional simulation (c) plotted for the higher-order validation of the proposed method.

Figure 21. One thousand randomly distributed hard data (a), conditional realizations (b,c), and the E-Type (d) and variance (e) maps, plus the mean comparison plot (f) for the channel TI.

Figure 22. Conditioning data (208 points) for the training image shown in Figure 3b (a), two conditional realizations by the proposed method (b,c), and the E-type plots (d,e) of 100 conditional simulations for the continuous TI.

Figure 23. Comparative histograms (a,b) and variograms (c), and the mean comparison plot (d) of the continuous training image shown in Figure 3b and the generated 100 conditional realizations.

Figure 24. The template used for cumulant maps (a), and the cumulant maps of the continuous TI (b) and a conditional simulation (c) plotted for the higher-order validation of the proposed method.

Figure 25. Five hundred randomly distributed hard datasets (a), the conditional realization (b), and the E-type and variance plots (c,d) for the continuous TI.

Figure 26. Three-categorical training image of the gold deposit (a) and the hard (conditioning) data gathered (b).

Figure 27. Two generated three-dimensional realizations of the conditional three-categorical simulation (a,b), and the mean of sixteen generated realizations (c).

Figure 28. Box plots for comparison of the statistics of the three-dimensional TI, the hard (conditioning) data, and the generated realizations for the three classes (a–c), and mean comparison plot (d).

Figure 29. The three-dimensional TI’s six slice views (two in each direction) for showing the textures and shapes (top two rows), and the realization’s six slice views (two in each same direction) for checking if/how the simulations are reproducing the TI’s textures and shapes (bottom two rows).

Table 1. Simulation times by the training image and simulation type.

Training Image	Simulation Grid Size	Number of Hard Data	Simulation Type	Number of Realizations	Total Time (Seconds)	Time per Realization (Seconds)
Two-dimensional Categorical (Figure 3a)	101 × 101 (10,201)	-	Unconditional	100	356	3.56
	101 × 101 (10,201)	361	Conditional	100	429	4.29
	101 × 101 (10,201)	1000	Conditional	100	936	9.36 ⁽¹⁾
Two-dimensional Continuous (Figure 3b)	100 × 128 (12,800)	-	Unconditional	100	265	2.65
	100 × 128 (12,800)	208	Conditional	100	388	3.88
	100 × 128 (12,800)	500	Conditional	100	874	8.74 ⁽¹⁾
Three-dimensional TI (Figure 13)	69 × 69 × 39 (185,679)	-	Unconditional	24	47,961	1998 ⁽²⁾
Three-dimensional TI (Figure 24)	70 × 60 × 57 (239,400)	8759	Conditional	16	15,760	985 ⁽³⁾

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Asadi, A.; Chatterjee, S. Pixel-MPS: Stochastic Embedding and Density-Based Clustering of Image Patterns for Pixel-Based Multiple-Point Geostatistical Simulation. Geosciences 2024, 14, 162. https://doi.org/10.3390/geosciences14060162

AMA Style

Asadi A, Chatterjee S. Pixel-MPS: Stochastic Embedding and Density-Based Clustering of Image Patterns for Pixel-Based Multiple-Point Geostatistical Simulation. Geosciences. 2024; 14(6):162. https://doi.org/10.3390/geosciences14060162

Chicago/Turabian Style

Asadi, Adel, and Snehamoy Chatterjee. 2024. "Pixel-MPS: Stochastic Embedding and Density-Based Clustering of Image Patterns for Pixel-Based Multiple-Point Geostatistical Simulation" Geosciences 14, no. 6: 162. https://doi.org/10.3390/geosciences14060162

APA Style

Asadi, A., & Chatterjee, S. (2024). Pixel-MPS: Stochastic Embedding and Density-Based Clustering of Image Patterns for Pixel-Based Multiple-Point Geostatistical Simulation. Geosciences, 14(6), 162. https://doi.org/10.3390/geosciences14060162

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pixel-MPS: Stochastic Embedding and Density-Based Clustering of Image Patterns for Pixel-Based Multiple-Point Geostatistical Simulation

Abstract

1. Introduction

2. Materials and Methods

2.1. Pattern Database Generation

2.2. Dimensional Reduction of Pattern Database

2.3. Pattern Clustering Using DBSCAN

2.4. Simulation Step

2.5. Implementation Strategies

2.5.1. Template Size Determination

2.5.2. Dimensionality Reduction

2.5.3. Pattern Clustering

2.5.4. Similarity Search

2.5.5. Model Validation

2.5.6. Comparative Analysis

2.6. Summary of the Methodology

3. Validation Results

3.1. Unconditional Simulation

Three-Dimensional TI

3.2. Conditional Simulation

4. Three-Dimensional Case Study

5. Comparative Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI