Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions

Liu, Jingyue; Tian, Fei; Zhao, Aosai; Zheng, Wenhao; Cao, Wenjing

doi:10.3390/app14156534

Open AccessArticle

Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions

by

Jingyue Liu

^1,2,3,

Fei Tian

^1,2,3,*

,

Aosai Zhao

^1,2,3

,

Wenhao Zheng

^1,2,3,4 and

Wenjing Cao

^1,2,3

¹

CAS Engineering Laboratory for Deep Resources Equipment and Technology, Institute of Geology and Geophysics, Chinese Academy of Sciences, Beijing 100029, China

²

Innovation Academy for Earth Science, Chinese Academy of Sciences, Beijing 100029, China

³

College of Earth and Planetary Sciences, University of Chinese Academy of Sciences, Beijing 100049, China

⁴

Department of Earth Science and Engineering, Imperial College London, London SW7 2BP, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6534; https://doi.org/10.3390/app14156534

Submission received: 6 July 2024 / Revised: 23 July 2024 / Accepted: 23 July 2024 / Published: 26 July 2024

(This article belongs to the Topic Petroleum and Gas Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

In the process of lithology discrimination from a conventional well logging dataset, the imbalance in sample distribution restricts the accuracy of log identification, especially in the fine-scale reservoir intervals. Enhanced sampling balances the distribution of well logging samples of multiple lithologies, which is of great significance to precise fine-scale reservoir characterization. This study employed data over-sampling and under-sampling algorithms represented by the synthetic minority over-sampling technique (SMOTE), adaptive synthetic sampling (ADASYN), and edited nearest neighbors (ENN) to process well logging dataset. To achieve automatic and precise lithology discrimination on enhanced sampled well logging dataset, support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT) models were trained using cross-validation and grid search methods. Aimed to objectively evaluate the performance of different models on different sampling results from multiple perspectives, the lithology discrimination results were evaluated and compared based on the Jaccard index and F1 score. By comparing the predictions of eighteen lithology discrimination workflows, a new discrimination process containing ADASYN, ENN, and RF has the most precise lithology discrimination result. This process improves the discrimination accuracy of fine-scale reservoir interval lithology, has great generalization ability, and is feasible in a variety of different geological environments.

Keywords:

well logging; lithology discrimination; adaptive synthetic sampling; edited nearest neighbors; random forest

1. Introduction

Deep and precise navigation technology directs drilling instruments to a specific location of the oil reservoir to acquire the optimum recovery ratio, and this is of great significance in improving the economic benefits of oil fields. The effect of lithology discrimination, especially the discrimination result of fine-scale reservoir sections, directly affects the development process of precise downhole navigation [1,2,3]. Traditional lithology discrimination methods rely on empirical rules and domain expertise, which are not only time-consuming but also often suffer from subjectivity and inconsistency issues [4,5,6,7,8]. Therefore, researchers are increasingly focusing on developing faster and more reliable identification tools [9,10,11,12,13]. Currently, artificial intelligence technology is gradually being adopted as an important alternative method to address such complex problems [14,15,16,17,18,19]. Utilizing intelligent methods such as machine learning models to construct lithology discrimination models is highly effective. However, addressing issues in the data preprocessing stage is also important [20,21,22,23].

Nowadays, various machine learning methods and related technologies are being applied in lithology discrimination tasks. These applications mainly include basic models such as naive Bayes classifier, support vector machine (SVM), and decision trees, as well as ensemble methods like random forest (RF) and gradient boosting decision trees (GBDT) [24,25,26,27]. Through practical data testing and quantitative comparisons, ensemble methods exhibit outstanding performance in classifying sandstone-related lithologies that are difficult to distinguish accurately using other algorithms. Compared to weak classifiers, ensemble methods have broader application prospects [28,29]. Building upon the foundation of algorithmic models, optimizing the model structure according to the specific requirements of practical problems aids in addressing more specialized and complex situations [30,31]. Improving the random forest model with a probability-based fuzzy representation method is able to provide more information about rhythm, heterogeneity, and geological characteristics and improves the fineness of formation evaluation and reservoir characterization [32]. Introducing an enhanced multi-kernel Fisher discriminant model based on a single Fisher discriminant model effectively extracts features of carbonate reservoirs, enabling more precise lithology identification [33]. Artificial intelligence methods hold a comprehensive advantage in lithology discrimination tasks. The combination of various machine learning models with optimization algorithms forms an efficient and stable workflow for lithology discrimination. For instance, the fusion of principal component analysis, fuzzy decision tree models, and particle swarm optimization algorithms into lithology discrimination processing systems exemplifies this approach [34]. Additionally, there are methods that consist of dimensionality reduction, k-means sample clustering, and regression analysis for well logging lithology classification workflows [35,36,37]. These studies demonstrate the significant benefits of machine learning methods in lithology discrimination and formation evaluation. However, most of the research has not taken into account the impact of imbalanced well logging data distribution.

Enhanced sampling algorithms can balance the dataset by adequately learning and accurately identifying classes with fewer samples in the original dataset. These methods improve data quality by considering both data distribution characteristics and model computation requirements [38]. The Gaussian mixture model-based over-sampling method with Jensen–Shannon divergence (GJ-RSMOTE) can effectively improve the prediction accuracy of C4.5 decision trees and SVM models [39]. According to the multi-model fusion strategy, using real sample points for interpolation to obtain synthetic data has more consistent probability distribution characteristics compared to traditional sample point interpolation methods, making it more advantageous [40]. At present, in the field of well logging data mining, there are also studies that use stacking models combined with the SMOTE algorithm to obtain lithology identification results [41]. These studies all focus on over-sampling the minority class to enrich the information content of the dataset.

The above review indicates that data balancing is crucial for enhancing the performance of classifiers and improving the predictive capability for sparsely represented classes in the samples. Well logging data often suffer from imbalanced sample distributions. Data on fine scale reservoir intervals are relatively scarce, and models tend to misclassify corresponding samples into more abundant lithology classes in order to maintain high overall accuracy. This can ultimately lead to missing highly economical reservoir intervals during industrial exploitation [42,43,44]. To enhance the accuracy of identifying lithology classes with scarce samples, it is necessary to construct balanced datasets with an equal number of samples for each class. On the other hand, to adopt sample balancing methods for balancing the frequency distribution of data across different classes, it is necessary to consider both the information loss issue of under-sampling methods and the overfitting risk of over-sampling methods.

Based on the analysis of the above issues, this study proposes a novel approach that first utilizes over-sampling algorithms represented by the synthetic minority over-sampling technique (SMOTE) and adaptive synthetic sampling (ADASYN) to generate synthetic samples for lithology classes with scarce samples. Then, the study applied the edited nearest neighbors (ENN) process to remove marginalized samples and maintain the balance in the count of samples between lithology classes. After enhanced sampling, several machine learning models, including support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT), were applied to test the improvement of this sampling workflow on the lithology discrimination results. According to the experimental results, discussions were held to determine the optimal process of enhanced sampling and model application to achieve efficient and balanced enhanced sampling and intelligent and accurate lithology identification. This study provides a method for synthesizing logging data for scarce sample lithologies and explores an efficient workflow for the accurate identification of a fine-scale reservoir.

2. Methodology

The intelligent logging lithology discrimination using enhanced sampling methods is divided into four steps: data preparation, data balancing processing, intelligent discrimination model construction, and lithology prediction analysis (Figure 1). Firstly, the well logging dataset was prepared, including organizing data obtained from logging instruments, conducting data cleaning, and standardization. Secondly, the dataset was subjected to enhanced sampling processing, starting with over-sampling for classes with fewer samples to generate synthetic samples, followed by effective sample selection from the over-sampled dataset. The third step involved establishing an intelligent lithology discrimination model. This step included initializing machine learning models, model training, and determination of the optimal performing model. The final step was to analyze the lithology discrimination results of the model. This included quantifying the performance metrics of the model, comparing and evaluating the results of various models, and presenting the high effectiveness of lithology discrimination in practical applications.

2.1. Data Preprocessing

The Z score method determines the position of each sample in a distribution by calculating the difference between each data point and the mean and standard deviation of the entire dataset. Based on such differences, samples that deviate significantly from the mean will be marked as outliers. Z score can be calculated using the following equation:

Z = \frac{x - μ}{σ}

(1)

where

Z

is the Z score of sample

x

,

μ

represents the mean of the sample set, and

σ

represents the standard deviation of the sample set.

Based on the principle that formation of the same period and phase exhibit similar logging responses, the logging curve data corresponding to the same-period strata in each well were adjusted to have similar frequency distributions by logging curve standardization. First, we selected a reference well and calculated the mean and standard deviation of its logging data. Based on this, the data from other wells were transformed. The standardized values of the logging curves can be calculated as:

V_{n o r m} = \frac{V_{r a w} - μ_{r a w}}{σ_{r a w}} \cdot σ_{s t d} + μ_{s t d}

(2)

where

V_{n o r m}

represents the standardized logging curve value,

V_{r a w}

represents the logging curve value before standardization,

μ_{r a w}

and

σ_{r a w}

represent the mean and standard deviation, respectively, of the logging curve before standardization, and

μ_{s t d}

and

σ_{s t d}

represent the mean and standard deviation of the logging curve corresponding to the reference well.

2.2. Enhanced Sampling Algorithms

2.2.1. Over-Sampling Algorithms

SMOTE (synthetic minority over-sampling technique) is an algorithm designed to address the issue of imbalanced sample distributions in datasets [45]. Data balancing processing is a crucial step in machine learning tasks, and SMOTE is a widely recognized over-sampling method that can effectively increase the number of samples of the minority classes. It has obvious advantages in maintaining data reliability and has become a fundamental basis for various over-sampling algorithms. Its basic concept is about generating synthetic samples for minority classes to enhance their representation. The specific process of the SMOTE algorithm is as follows (Figure 2): Firstly, the number of samples for each class in the training set are counted, with the class having the highest sample count designated as the majority class, and the remaining classes designated as minority classes. Next, the Euclidean distance between samples of minority classes is calculated to measure feature disparities, expressed by the following equation:

d (i, j) = \sqrt{\sum_{k = 1}^{N} {(x_{i}^{k} - x_{j}^{k})}^{2}}

(3)

where

N

represents the number of features. Then, for each minority class sample, another sample is randomly selected from its

K

nearest neighbors, and a synthetic sample is generated using interpolation. For the sample

x_{i}

and its selected neighbor

x_{j}

, the synthetic sample can be calculated as:

x_{n e w} = x_{i} + λ \cdot (x_{j} - x_{i})

(4)

where

λ

is a random number between 0 and 1.

ADASYN (adaptive synthetic sampling) is a further improved data balancing algorithm that introduces adaptiveness based on SMOTE [46]. It evaluates the local density of each minority class sample and generates more synthetic data for samples with lower density. The advantages of this method are reflected in many aspects such as dataset distribution and model prediction results. The specific process of the ADASYN algorithm is as follows (Figure 3): The first step is the same as the SMOTE algorithm, which involves finding the majority and minority classes in the training set. Next, for each minority class sample, compute the density of majority class samples among its

K

nearest neighbors. For sample

x_{i}

in the current minority class, its density can be represented by the following equation:

r_{i} = 1 - \frac{∆_{i}}{K}

(5)

where

∆_{i}

represents the number of samples belonging to the same class with

x_{i}

among the

K

nearest neighbors of

x_{i}

. The larger the value of

r_{i}

, the fewer samples in neighbors of

x_{i}

, and therefore more synthetic samples need to be generated around

x_{i}

.

The next step involves normalizing the density of each minority class sample to obtain corresponding sample synthesis weights. In the following data synthesis process, more synthetic samples will be generated around samples with larger weights. For the minority class sample

x_{i}

, the number of synthesized samples

g_{i}

can be calculated as:

g_{i} = \frac{1}{\sum_{i = 1}^{M} r_{i}} \cdot r_{i} \cdot (N - M)

(6)

where

N

represents the number of majority class samples in the dataset, and

M

represents the number of minority class samples corresponding to

x_{i}

.

The final step of the algorithm is the synthesis of synthetic samples, following the same interpolation method in the sample feature space, which can be represented by Equation (4). Compared to the SMOTE algorithm, the adaptiveness of sample synthesis enables ADASYN to perform better when dealing with complex imbalanced datasets, making it an important step for further optimizing the handling of imbalanced problems.

2.2.2. Under-Sampling Algorithm

ENN (edited nearest neighbors) is an under-sampling algorithm used for handling imbalanced datasets [47]. It improves the quality of the dataset and enhances the predictive accuracy of classifiers by removing samples inconsistent with the majority class between their neighbors. It can effectively reduce the risk of overfitting and purify the dataset by removing possible noise samples (especially potential erroneous samples synthesized by over-sampling). And by combining with different over-sampling algorithms, the optimal benefits of the ‘over-sampling + under-sampling’ processing flow can be comprehensively discussed. The specific process of the ENN algorithm is as follows (Figure 4): Firstly, for each sample, calculate its Euclidean distance to other samples with Equation (3). Then, if it is found that the class of the sample is different to the most frequently occurring class among its

K

nearest neighbors, it will be marked as noise or a potentially misclassified sample and removed. This process is iterated until no more noise samples are found. After processing with the ENN algorithm, the resulting dataset can be represented as:

S_{E N N} = \{x_{i} | y_{i} = m o d e (K_{i}), i = 1,2, \dots, n\}

(7)

where

S_{E N N}

represents the dataset after under-sampling processing,

x_{i}

represents the

i

-th sample,

y_{i}

represents the true label of the

i

-th sample,

K_{i}

represents the set of classes of the

K

nearest neighbor samples for the

i

-th sample, and

m o d e (K_{i})

represents the most frequently occurring class in

K_{i}

.

2.3. Machine Learning Models

Well logging curve data are input into the intelligent lithology discrimination model to achieve automatic differentiation and discrimination of various lithologies in the formation and to output lithology prediction results. In this study, considering the limited size of the dataset, the risk of overfitting of the deep learning model is very high. And in order to better verify the effect of the enhanced sampling algorithm on widely used machine learning models, three representative machine learning methods, namely support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT), were selected to evaluate the improvement effect of enhanced sampling methods (Figure 5).

2.3.1. Support Vector Machine (SVM)

The support vector machine is a widely used machine learning algorithm in classification tasks. It is suitable for datasets with small sample size. Its core idea is to find a classification hyperplane that maximizes the margin between different classes. Support vectors are the data points closest to the hyperplane, and they determine the position of the hyperplane during model construction. For linearly separable problems, classification decisions can be made by evaluating the distance between data points and the hyperplane. However, in practical applications of well logging data analysis, data are often not completely linearly separable and may contain some noise [48]. To address these situations, kernel functions can be employed in SVM to map data features into a higher-dimensional space, allowing nonlinear relationships to be separated by a linear hyperplane in this higher-dimensional space. The SVM model used in this study employed the radial basis function (RBF) kernel, which can be represented by the following equation:

K (x_{i}, x_{j}) = e x p (- \frac{{‖x_{i} - x_{j}‖}^{2}}{2 σ^{2}})

(8)

where

x_{i}

and

x_{j}

are data points and

σ

is a parameter controlling the kernel width.

The soft margin of SVM allows a small number of samples to be misclassified within a certain range, which helps maintain the model’s generalization ability and prevent overfitting. During model training, parameters related to the kernel function and soft margin have the most direct impact on model performance.

2.3.2. Random Forest (RF)

Random forest is an ensemble learning method. It performs well when processing high-dimensional data. It constructs multiple decision trees during the training process and improves overall accuracy by combining their classification results while controlling model overfitting [49]. Random forest model employs random sampling with replacement from the training dataset to construct different training datasets. This technique of random sampling enhances the diversity and robustness of the model while reducing the influence of individual samples. Then, during the construction of each decision tree, the random forest model adopts a feature random selection strategy. When splitting each node of the decision tree, a subset of features is randomly selected from the total feature set as candidate split attributes. This feature random selection helps reduce the correlation between features and improves the model’s generalization ability. Finally, the random forest model makes classification decisions through voting. For classification tasks, each decision tree provides a prediction, and the final classification result is determined by the majority vote. Assuming the set of classes is

{c_{1}, c_{2}, \dots, c_{N}}

and

h_{i}^{j} (x)

represents the prediction result of the decision tree

h_{i}

for class

c_{i}

, the voting result can be expressed by:

H (x) = \{\begin{matrix} c_{j}, \sum_{i = 1}^{T} h_{i}^{j} (x) > 0.5 \sum_{k = 1}^{N} \sum_{i = 1}^{T} h_{i}^{k} (x) \\ r e j e c t, o t h e r s \end{matrix}

(9)

2.3.3. GBDT

Gradient boosting decision tree (GBDT) is also an ensemble learning algorithm, and its core idea is to iteratively update the structure of decision trees, enabling each tree to correct the errors of the previous one, ultimately forming a robust predictive model [50]. The specific process is as follows: Firstly, the model initializes a basic decision tree. By comparing the lithology prediction results of the decision tree on the well logging dataset with the actual lithology labels, the logarithmic loss function is obtained, represented by the following equation:

L (y_{i}, p_{i}) = - \sum_{k = 1}^{K} y_{i}^{k} l o g p_{i}^{k}

(10)

where

K

represents the total number of classes, and if the

i

-th sample belongs to class

k

, then

y_{i}^{k} = 1

and

p_{i}^{k}

represents the probability of the

i

-th sample belonging to class

k

. Next, we calculated the negative gradient (residual) of the multinomial logarithmic loss function, which can be represented by:

r_{i}^{k} = - \frac{\partial L (y_{i}^{k}, p_{i}^{k})}{\partial p_{i}^{k}}

(11)

where

r_{i}^{k}

represents the residual of class

k

for the

i

-th sample, and the updating process of the decision tree involves learning these residuals. Through repeated iterations, the residuals of the model on the training data gradually decrease. After stopping the iterations, the final model combines the outputs of all decision trees to obtain the classification results.

2.3.4. Model Evaluation Framework

To assess the effectiveness of the evaluation methods, it is essential to establish evaluation metrics for the experimental results. The Jaccard index serves as a metric for evaluating the classification performance of the model from the perspective of the well logging sample dataset. It is defined as the size of the intersection between the predicted label set and the true label set divided by the size of their union. The Jaccard index for the

k

-th class can be calculated using the following equation:

J_{k} = \frac{|Y_{k} \cap {\hat{Y}}_{k}|}{|Y_{k} \cup {\hat{Y}}_{k}|}

(12)

where

Y_{k}

represents the true label set for the

k

-th class and

{\hat{Y}}_{k}

represents the corresponding predicted label set. A higher Jaccard index score indicates that the model’s predicted label set is closer to the true label set, reflecting better lithology prediction performance. We let

N

denote the total number of lithology classes. The overall Jaccard index score of the model is represented by the average of the scores for each class:

J = \frac{\sum_{k = 1}^{N} J_{k}}{N}

(13)

Well logging lithology discrimination is a multi-class classification problem, and the F1 score is a commonly used evaluation metric for such problems. The F1 score reflects the classification performance of each class, and the F1 score for the

k

-th class can be calculated using the following equation:

F_{1}^{k} = \frac{2 {T P}_{k}}{2 {T P}_{k} + {F P}_{k} + {F N}_{k}}

(14)

where

{T P}_{k}

represents the number of true positives for the

k

-th class,

{F P}_{k}

represents the number of false positives for the

k

-th class, and

{F N}_{k}

represents the number of false negatives for the

k

-th class. The overall F1 score of the model can be represented by the average of the scores for each class:

F_{1} = \frac{\sum_{k = 1}^{N} F_{1}^{k}}{N}

(15)

Here,

N

still represents the total number of lithology classes. A higher F1 score for the model indicates a better lithology prediction performance.

3. Results

3.1. Dataset Description

The dataset consists of well logging data from eight wells (CB32, CB82, CB85, CB89, CB323, CB327, CB832, and CBX395) in the Chengbei Operation Area of Shengli Oilfield, covering the Ed1 to Ed4 sections. Based on the logging interpretation results and following the Chinese petroleum industry’s classification standard for clastic grain size [51], the lithology labels for the dataset are classified into the following five classes: (1) mudstone, (2) siltstone, (3) fine sandstone, (4) coarse sandstone, and (5) pebbled sandstone. The dataset includes seven well logging curves for lithology discrimination: ① GR (gamma ray): Gamma ray logging curve is used to measure the intensity of gamma rays in the formation, is usually related to the content of radioactive materials in the rock, and can help identify lithologies. ② CAL (caliper): Caliper logging curve measures the diameter of the borehole. Changes in the borehole diameter can reflect the stability of the borehole and possible changes in geological structure. ③ RD (deep investigate double lateral resistivity log): RD is a method of measuring formation resistivity, focusing on detecting strata deeper than the borehole, which helps to identify the type and content of fluids in deep layers. ④ RS (shallow investigate double lateral resistivity log): The detection depth of RS is shallower than RD. It is usually used to evaluate the resistivity of strata near the borehole and plays an important role in identifying the distribution of fluids near the borehole. ⑤ AC (acoustic log): Acoustic logging curve measures the propagation time of sound waves in the formation and is used to estimate the porosity and fluid properties of the formation. ⑥ CNL (compensated neutron log): Compensated neutron logging curve is used to measure neutron absorption in the formation and is related to the porosity and fluid type of the formation. ⑦ DEN (density log): Density logging curve measures the density of the formation and can reflect the porosity and mineral composition of the rock.

3.2. Data Preprocessing

The missing values in the original data were filled based on the core data and mud logging information. For dataset cleaning, we calculated the Z score for all samples in the original well logging data. We identified data points with a Z score outside the range [−3, 3] as outliers and removed them. This process resulted in the cleaned well logging dataset with a sample capacity of 28,763. The original dataset comprised five lithology labels: M for mudstone, S for siltstone, FS for fine sandstone, CS for coarse sandstone, and PS for pebbled sandstone. Table 1 displays the statistical information for each lithology in the original dataset. It is evident that the M label has the highest number of samples (13,307 samples), followed by the FS label (6830 samples). The number of samples for S (4716 samples) and PS (2855 samples) is relatively lower, while the CS label has the fewest samples (1055 samples).

Based on the logging curves of well CB32, which served as the reference well, the well logging data for each well were standardized to eliminate the influence of environmental factors and logging instrument variations on data quality. By considering the maximum and minimum values of each logging curve from CB32 after cleaning, optimal estimates for standardization were set. Equation (2) was then applied to standardize the data from other wells, resulting in detailed information for the seven curves as shown in Table 2.

3.3. Data Balancing: Application of Enhanced Sampling Methods

In reservoir exploration, lithologies such as siltstone (S), fine sandstone (FS), and coarse sandstone (CS) represent crucial oil-bearing reservoirs. However, based on the preprocessing results above, the sample counts for these three lithology types are less than half of the total dataset size. To obtain more comprehensive information at the data level and train models for more accurate predictions of these relevant reservoirs, it is essential to synthesize abundant and reliable synthetic data for these lithology classes through data balancing methods. Therefore, synthesizing enriched and dependable simulated data for these classes via data balancing methods is a crucial task.

3.3.1. Over-Sampling Results

After applying the SMOTE method to the original dataset, the number of samples for each label was balanced to 10,666 (Figure 6). To examine the impact of the data balancing results on the consistency of sample distribution, principal component analysis (PCA) was used to visualize the main features before and after balancing processing, and frequency density curves of the two principal components were plotted (Figure 7b,e). It can be observed that SMOTE increased the sample capacity of the dataset while essentially maintaining the original distribution shape of the samples under the principal component attributes.

After applying the ADASYN over-sampling algorithm to the dataset, the minimum number of samples for each label is 10,411 (label FS), and the maximum number of samples is 10,733 (label PS). Through ADASYN over-sampling, the sample counts for each class are close to the original majority class (label M). From the frequency distribution of the principal components (Figure 7c,f), it can be observed that ADASYN can maintain the consistency of the distribution between the over-sampled dataset and the original dataset.

3.3.2. Under-Sampling Results

Using the ENN under-sampling algorithm to filter out confusing samples from the original dataset does not control the balance of sample quantities for each class. Therefore, the dataset processed by ENN remains unbalanced, with little change in the relative quantities and proportions of samples for each class compared to the original dataset (Figure 7g,j). The ENN algorithm can remove redundant majority class samples and construct a more lightweight and reliable well logging dataset without losing critical information. However, the under-sampling algorithm does not increase the quantity of minority class samples, resulting in limited improvement in the predictive performance for relevant lithologies.

In summary, combining the sample filtering capability of the ENN under-sampling algorithm with the data synthesis function of over-sampling algorithms can maximize the enrichment of lithological information content in well logging data. This approach maintains adaptability and flexibility in handling actual well logging datasets, ultimately leading to the training of optimal intelligent discrimination models.

3.3.3. Integrated Balancing Processing Results

The integrated method of SMOTE+ENN and ADASYN+ENN was employed to process the dataset, aiming to control the relative balance of sample quantities for each label (Figure 6). Initially, datasets processed by SMOTE or ADASYN over-sampling algorithms exhibited rich and fairly balanced sample quantities for each class. Subsequently, after ENN filtering, samples with ambiguous class attributions were removed. At this stage, the sample quantities for each class became comparable yet distinct. Both sets of datasets achieved a relatively balanced state. Upon comparing the principal component frequency densities, it is evident that both integrated methods maintained the original distribution shape of samples in the principal component attributes. Specifically, the results of the SMOTE+ENN approach resemble the principal component frequency distribution of the original dataset (Figure 7h,k) more closely. These findings demonstrate that integrated methods can uphold the overall consistency of key features in the data distribution, making them applicable in intelligent lithology discrimination models.

3.4. Application of Intelligent Discrimination Models

Regarding the division of the dataset, 80% of the original dataset was used for model training and validation, which is the main basis for updating model parameters. This part of the dataset was named ‘training and validation set’. The remaining 20% was named ‘testing set’, and it was used for the final model result test and was not used for model parameter updating. Different data balancing methods were applied to the training and validation sets to obtain corresponding over-sampled or under-sampled datasets.

For the training and validation sets, five different data balancing method combinations were used, namely SMOTE, SMOTE, ADASYN, ENN, SMOTE+ENN, and ADASYN+ENN. Together with the training and validation sets without data balancing, a total of six datasets for training and validation were obtained, and they were divided into training and validation sets in a ratio of 4:1. The 5-fold cross-validation method was employed to enhance the model’s generalization ability, and grid search was used to find the optimal parameter combinations for each model. To avoid randomness in the results, the training process was repeated 10 times, and the average values were recorded as the final results. Line plots were generated to illustrate the training progress of each model parameter on different datasets, facilitating the comparison of the effects of various balancing methods on model training performance.

3.4.1. Model Training

The detailed records of the tuning parameters and optimal values for each model are documented in Table 3. The parameter tuning for the SVM model includes the regularization penalty coefficient C and the coefficient γ of the RBF kernel function. Random forest (RF) and gradient boosting decision tree (GBDT) models were both based on decision trees as basic classifiers, so in terms of parameter tuning, attention needs to be paid to tree structure-related parameters. The most important among these is the minimum number of samples required to form a leaf node. Additionally, in the construction process of random forest, sample and feature selection were involved, making the maximum number of features another important parameter; GBDT involves an iterative process of updating to fit residuals; thus, while focusing on tree structure tuning, it was also necessary to control the learning rate of the model iteration.

During each parameter tuning process, the improvement of the model with the ENN-treated dataset was not significant, while over-sampling methods enhanced the training performance of the models. Moreover, the overall training performance of the models on the ADASYN-treated dataset was better than that on the SMOTE dataset, with the training results being the best on the dataset treated by ADASYN+ENN (Figure 8).

3.4.2. Model Testing

To avoid the randomness of test results, the support vector machine (SVM), random forest (RF), and gradient boosting decision tree (GBDT) models were tested on the test set for 10 runs each. The average Jaccard index and F1 score of the models were obtained (Figure 9). It can be observed that different methods of handling the datasets have varying effects on different models, showing a general trend that the more balanced the sample distribution of the dataset, the higher the model’s lithology recognition scores.

Specifically, for the SVM model, the improvement in the balanced datasets is the most significant. However, the highest scores of both evaluation metrics for SVM were lower than those of RF and GBDT, indicating that the overall lithology recognition performance of the other two models is superior. RF is the highest-scoring model, and it performs better on more balanced datasets as well. The balanced datasets further unleash the potential of the RF model. In contrast, the performance of the GBDT model is not as outstanding as RF, but the performance of the GBDT model also improves on balanced datasets.

3.5. Optimal Combination of Algorithms and Models

The average values of the two evaluation metrics for predicting all five lithology classes using different balanced datasets were recorded over 10 test runs for each model, and corresponding box plots were drawn (Figure 10). Based on the results of all models on the five training datasets, the following important findings were observed: (1) Models performed poorly on the original dataset and the ENN under-sampled dataset, showing a large range and low average scores. This is because the original dataset has very few samples corresponding to the S and CS labels, and the ENN under-sampling algorithm further exacerbates the imbalance issue by removing samples. Therefore, accuracy did not significantly improve compared to the original dataset and, in some cases, even fell below the results of the models on the original dataset; (2) ADASYN and SMOTE are over-sampling algorithms for classes with fewer samples. They generally have a positive impact on the lithology recognition model, primarily reflected in the improvement of the median accuracy, and the results after improvement are relatively consistent across models; (3) The comprehensive balancing method of ADASYN+ENN has significant advantages over all previous results, with the highest average scores and the smallest range. The overall results are relatively stable. These findings are consistent with the performance of the three models, indicating that data balancing algorithms have a certain degree of universality in their impact on models.

The random forest model demonstrated the best overall performance, and it exhibited the most stable performance on the dataset processed using the ADASYN+ENN method. Therefore, ADASYN+ENN+RF is the best method combination in this study.

According to the test scores of the two metrics for predicting all five lithology classes on the test set using random forest (Table 4), on the original imbalanced dataset, the model’s prediction performance is relatively poor for the S and CS lithology classes, while higher scores are achieved when predicting the more abundant M and FS classes. Random forest, when trained on the balanced dataset obtained through the ADASYN+ENN method, learned from the added synthetic samples, leading to a significant improvement in prediction accuracy for the originally under-represented S and CS classes. Additionally, the prediction performance for the PS class also benefited from the ADASYN+ENN method, with scores showing some improvement. For the M and FS classes, the sampling method has almost no negative impact, and the model’s prediction scores remain at a high level.

Visualizing the lithology discrimination performance of the random forest model on the test data (Figure 11), the figure includes seven well log curves and three bar charts: lithology labels from mud logging core, predictions from the random forest model trained on the original dataset, and predictions from the random forest model trained on the dataset balanced using the ADASYN+ENN method. From a global perspective, the results shown in “No Balancing” exhibit more noise, while the results shown in “ADASYN+ENN” demonstrate a noticeable noise reduction. This reflects the significant role of sample balancing methods in improving the model’s lithology discrimination result. Further observation reveals that in the “No Balancing” results, noise is mainly concentrated in the sandstone (S) and coarse sandstone (CS) reservoirs. This is because the original data lack samples corresponding to these classes, leading to unstable prediction results in these classes. However, after balancing processing, the model’s prediction results in these classes become more stable, with reduced noise, resulting in more accurate predictions for classes such as sandstone and coarse sandstone.

4. Discussion

4.1. The Effectiveness of Enhanced Sampling Algorithms

When using well logging data to identify important oil-bearing reservoirs represented by sandstone, there is often a situation where the corresponding number of samples is far less than that of non-oil-bearing reservoirs represented by shale. The direct purpose of adopting enhanced sampling methods in this study is to synthesize simulated data for lithology labels corresponding to oil-bearing reservoirs in order to increase the sample capacity of the dataset.

From the results of the dataset processing with enhanced sampling methods (Figure 6), the ADASYN method synthesized a large number of accurate simulated samples, thereby improving the data quality of the high-quality reservoirs. Subsequently, the ENN algorithm filtered the real logging data and simulated samples, ensuring the reliability of the samples in the balanced dataset to a certain extent. According to the feature distributions before and after data balancing shown in Figure 7, it can be observed that the balanced dataset maintains the consistency of the overall distribution of features and samples. It can be said that the results of data balancing meet expectations and enrich the logging data samples of high-quality oil-bearing reservoirs.

4.2. Improvement of Lithology Discrimination Results

The enhanced sampling methods significantly improve the lithology recognition performance of each model after processing the dataset. According to the results shown in Figure 10, the improvement effect of balancing processing on the SVM model is the most significant. However, due to the limitations of the model itself, the lithology prediction score of SVM is consistently lower than that of random forest and GBDT. The random forest model performs the best after enhancement, especially in the prediction tasks of siltstone and sandstone, which lack original data. The balanced dataset helps improve the prediction score of random forest. The improvement in the GBDT model is relatively small, and the effect of the balanced dataset on GBDT is limited.

Compared to different data sampling methods, the proposed ADASYN+ENN method shows the most significant improvement in models. From the results shown in Figure 7, the effectiveness of the comprehensive method in processing the dataset is better than using ADASYN or SMOTE over-sampling algorithms alone. Over-sampling algorithms cannot accurately determine whether the baseline samples used to synthesize simulated data are errors generated during data collection and may increase noise in the dataset after synthesizing a certain number of samples. The key role of the ENN algorithm is to detect and remove potentially erroneous samples from the dataset. Therefore, the dataset processed by the comprehensive method combining ADASYN and ENN is more reliable.

4.3. Analysis of the Lithology Discrimination Effectiveness of the Optimal Method Combination

From the perspective of exploration and development applications, the excellent performance of machine learning models in lithology discrimination mainly depends on whether the model can accurately identify important reservoirs represented by various types of sandstone. A model with practical value cannot only maintain high accuracy in identifying lithology labels with low oil contents such as mudstone. According to Figure 11, after supplementing simulated samples with the ADASYN algorithm and removing erroneous samples with the ENN algorithm, the random forest model learned the characteristics of sandstone and coarse sandstone more comprehensively, resulting in more accurate predictions for these lithology classes. In summary, after balancing the data, the discrimination accuracy of labels with fewer original data samples significantly improved. Data balancing algorithms have a significant positive impact on these classes and can maintain high discrimination accuracy for classes with a larger number of samples in the original dataset.

5. Conclusions

Based on the need to address the relative scarcity of high-quality reservoir logging data in practical production, we propose a workflow that combines enhanced sampling algorithms with machine learning models to achieve efficient and accurate lithology discrimination from well logging data. Applying a data synthesis balancing method that combines the ADASYN over-sampling algorithm with the ENN under-sampling algorithm to the problem of intelligent lithology discrimination can assist machine learning models in accurately predicting and identifying reservoirs with insufficient logging curve samples, thereby providing reliable references for production practices such as reservoir interpretation.

The experimental results demonstrate the effectiveness of the proposed method. By comparing the changes in the dataset before and after data balancing, it was found that the comprehensive balancing method better maintains the original data’s feature distribution compared to other data over-sampling or under-sampling algorithms, thereby increasing the number of learnable samples for high-quality reservoir lithology classes. Through testing and comparing the lithology discrimination performance of the SVM, RF, and GBDT models on the comprehensive balanced dataset and the original imbalanced dataset, evaluation metrics such as the Jaccard index and F1 score indicate that the comprehensive balancing method effectively improves the lithology discrimination performance of various machine learning models. Among them, the RF model performs the best and can accurately identify lithology in oil-bearing reservoirs.

The enhanced sampling algorithms used in this study demonstrate strong versatility, as they not only enhance the performance of the well logging lithology discrimination model but also provide important methodological references for identifying shale reservoirs and predicting sweet spots and other related tasks. Future research should conduct a more comprehensive long-term evaluation of this workflow with more data and time resources.

Author Contributions

Data curation, J.L., F.T., A.Z. and W.C.; Funding acquisition, F.T.; Methodology, J.L.; Project administration, F.T.; Resources, F.T.; Software, W.Z.; Supervision, F.T.; Visualization, J.L., A.Z., W.Z. and W.C.; Writing—original draft, J.L.; Writing—review and editing, F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Chinese National key research and development program, grant no. 2019YFA0708301 and no. 2023YFB3905005; the Youth Innovation Promotion Association Foundation of the Chinese Academy of Sciences (2021063); the Strategic Priority Research Program of the Chinese Academy of Sciences, grant no. XDA14050101; and the China National Petroleum Corporation (CNPC) Scientific research and technology development project, grant no. 2021DJ05.And the APC was funded by Institute of Geology and Geophysics, Chinese Academy of Sciences.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

No conflicts of interest exit in the submission of this manuscript, and the manuscript is approved by all authors for publication.

Abbreviations

The abbreviations used in this article are summarized in below (term abbreviations used in this work).

Abbr.	Full Name
SMOTE	Synthetic Minority Over-sampling Technique
ADASYN	Adaptive Synthetic Sampling
ENN	Edited Nearest Neighbours
PCA	Principal Component Analysis
SVM	Support Vector Machine
RF	Random Forest
GBDT	Gradient Boosting Decision Tree

References

Zhu, R.X.; Jin, Z.J.; Di, Q.Y.; Yang, C.C.; Chen, W.X.; Tian, F.; Zhang, W.X. Research and progress of Intelligent Drilling Technology System and related theories. Chin. J. Geophys.-Chin. Ed. 2023, 66, 1–15. [Google Scholar] [CrossRef]
Vásconez Garcia, R.G.; Mohammadizadeh, S.; Avansi, M.C.K.; Basilici, G.; Bomfim, L.d.S.; Cunha, O.R.; Soares, M.V.T.; Mesquita, Á.F.; Mahjour, S.K.; Vidal, A.C. Geological Insights from Porosity Analysis for Sustainable Development of Santos Basin’s Presalt Carbonate Reservoir. Sustainability 2024, 16, 5730. [Google Scholar] [CrossRef]
Liu, H.; Zhang, X.L.; Li, Z.L.; Weng, Z.P.; Song, Y.P. A borehole clustering based method for lithological identification using logging data. Earth Sci. Inform. 2024. [Google Scholar] [CrossRef]
Datta, D.; Singh, G.; Routray, A.; Mohanty, W.K.; Mahadik, R. Automatic Classification of Lithofacies with Highly Imbalanced Dataset Using Multistage SVM Classifier. In Proceedings of the IECON 2021—47th Annual Conference of the IEEE Industrial Electronics Society, Toronto, ON, Canada, 13–16 October 2021; pp. 1–6. [Google Scholar]
Kang, Z.M.; Zhang, Y.; Qin, H.J.; Gan, W.; Chen, G. An Intelligent Inversion Method for Azimuth Electromagnetic Logging While Drilling Measurements. IEEE Access 2023, 11, 79285–79294. [Google Scholar] [CrossRef]
Li, Y.; Luo, M.; Ma, S.; Lu, P.; Ren, S. Massive Spatial Well Clustering Based on Conventional Well Log Feature Extraction for Fast Formation Heterogeneity Characterization. Lithosphere 2022, 2022, 7260254. [Google Scholar] [CrossRef]
Saporetti, C.M.; da Fonseca, L.G.; Pereira, E.; de Oliveira, L.C. Machine learning approaches for petrographic classification of carbonate-siliciclastic rocks using well logs and textural information. J. Appl. Geophys. 2018, 155, 217–225. [Google Scholar] [CrossRef]
Tian, F.; Di, Q.Y.; Jin, Q.; Cheng, F.Q.; Zhang, W.; Lin, L.M.; Wang, Y.; Yang, D.B.; Niu, C.K.; Li, Y.X. Multiscale geological-geophysical characterization of the epigenic origin and deeply buried paleokarst system in Tahe Oilfield, Tarim Basin. Mar. Petrol. Geol. 2019, 102, 16–32. [Google Scholar] [CrossRef]
Xing, Y.; Yang, H.; Yu, W. An Approach for the Classification of Rock Types Using Machine Learning of Core and Log Data. Sustainability 2023, 15, 8868. [Google Scholar] [CrossRef]
Zhang, J.L.; He, Y.B.; Zhang, Y.; Li, W.F.; Zhang, J.J. Well-Logging-Based Lithology Classification Using Machine Learning Methods for High-Quality Reservoir Identification: A Case Study of Baikouquan Formation in Mahu Area of Junggar Basin, NW China. Energies 2022, 15, 3675. [Google Scholar] [CrossRef]
Tian, F.; Luo, X.R.; Zhang, W. Integrated geological-geophysical characterizations of deeply buried fractured-vuggy carbonate reservoirs in Ordovician strata, Tarim Basin. Mar. Petrol. Geol. 2019, 99, 292–309. [Google Scholar] [CrossRef]
Tian, F.; Jin, Q.; Lu, X.B.; Lei, Y.H.; Zhang, L.K.; Zheng, S.Q.; Zhang, H.F.; Rong, Y.S.; Liu, N.G. Multi-layered ordovician paleokarst reservoir detection and spatial delineation: A case study in the Tahe Oilfield, Tarim Basin, Western China. Mar. Petrol. Geol. 2016, 69, 53–73. [Google Scholar] [CrossRef]
Tian, F.; Zhang, J.Y.; Zheng, W.H.; Zhou, H.; Ma, Q.H.; Shen, C.G.; Ma, Q.Y.; Lan, M.J.; Liu, Y.C. “Geology-geophysics-data mining” integration to enhance the identification of deep fault-controlled paleokarst reservoirs in the Tarim Basin. Mar. Petrol. Geol. 2023, 158, 106498. [Google Scholar] [CrossRef]
Ai, X.; Wang, H.; Sun, B. Automatic Identification of Sedimentary Facies Based on a Support Vector Machine in the Aryskum Graben, Kazakhstan. Appl. Sci. 2019, 9, 4489. [Google Scholar] [CrossRef]
Hou, L.; Ma, C.; Tang, W.Q.; Zhou, Y.X.; Ye, S.; Chen, X.D.; Zhang, X.X.; Yu, C.Y.; Chen, A.Q.; Zheng, D.Y.; et al. DDViT: Advancing lithology identification on FMI image logs through a dual modal transformer model with less information drop. Geoenergy Sci. Eng. 2024, 234, 212662. [Google Scholar] [CrossRef]
Kim, D.; Byun, J. Selection of Augmented Data for Overcoming the Imbalance Problem in Facies Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8019405. [Google Scholar] [CrossRef]
Zhang, L.; Geisler, T.; Ray, H.; Xie, Y. Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function. J. Appl. Stat. 2022, 49, 3257–3277. [Google Scholar] [CrossRef] [PubMed]
Tian, F.; Wang, Z.X.; Cheng, F.Q.; Xin, W.; Fayemi, O.; Zhang, W.; Shan, X.C. Three-Dimensional Geophysical Characterization of Deeply Buried Paleokarst System in the Tahe Oilfield, Tarim Basin, China. Water 2019, 11, 1045. [Google Scholar] [CrossRef]
Tian, F.; Di, Q.Y.; Zhang, W.H.; Ge, X.M.; Zhang, W.X.; Zhang, J.Y.; Yang, C.C. A formation intelligent evaluation solution for geosteering. Chin. J. Geophys.-Chin. Ed. 2023, 66, 3975–3989. [Google Scholar] [CrossRef]
Geng, Z.X.; Liu, J.; Li, S.Y.; Yang, C.Y.; Zhang, J.; Zhou, K.B.; Tang, J.Z. Channel attention-based static-dynamic graph convolutional network for lithology identification with scarce labels. Geoenergy Sci. Eng. 2023, 223, 211526. [Google Scholar] [CrossRef]
Hossain, T.M.; Watada, J.; Aziz, I.A.; Hermana, M. Machine Learning in Electrofacies Classification and Subsurface Lithology Interpretation: A Rough Set Theory Approach. Appl. Sci. 2020, 10, 5940. [Google Scholar] [CrossRef]
Jiang, C.; Zhang, D.; Chen, S. Lithology identification from well-log curves via neural networks with additional geologic constraint. Geophysics 2021, 86, IM85–IM100. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, J.; Ren, Y.; Huang, Z.; Zhao, L. A gradient boosting decision tree algorithm combining synthetic minority oversampling technique for lithology identification. Geophysics 2020, 85, WA147–WA158. [Google Scholar] [CrossRef]
Jiang, S.Y.; Sun, P.K.; Lyu, F.; Zhu, S.C.; Zhou, R.F.; Li, B.; He, T.H.; Lin, Y.J.; Gao, Y.N.; Song, W.D.; et al. Machine learning (ML) for fluvial lithofacies identification from well logs: A hybrid classification model integrating lithofacies characteristics, logging data distributions, and ML models applicability. Geoenergy Sci. Eng. 2024, 233, 212587. [Google Scholar] [CrossRef]
Martin, T.; Meyer, R.; Jobe, Z. Centimeter-Scale Lithology and Facies Prediction in Cored Wells Using Machine Learning. Front. Earth Sci. 2021, 9, 659611. [Google Scholar] [CrossRef]
Sun, Z.; Jiang, B.; Li, X.; Li, J.; Xiao, K. A Data-Driven Approach for Lithology Identification Based on Parameter-Optimized Ensemble Learning. Energies 2020, 13, 3903. [Google Scholar] [CrossRef]
Gao, L.; Xie, R.-H.; Xiao, L.-Z.; Wang, S.; Xu, C.-Y. Identification of low-resistivity-low-contrast pay zones in the feature space with a multi-layer perceptron based on conventional well log data. Pet. Sci. 2022, 19, 570–580. [Google Scholar] [CrossRef]
Srivardhan, V. Adaptive boosting of random forest algorithm for automatic petrophysical interpretation of well logs. Acta Geod. Geophys. 2022, 57, 495–508. [Google Scholar] [CrossRef]
Xie, Y.; Zhu, C.; Zhou, W.; Li, Z.; Liu, X.; Tu, M. Evaluation of machine learning methods for formation lithology identification: A comparison of tuning processes and model performances. J. Pet. Sci. Eng. 2018, 160, 182–193. [Google Scholar] [CrossRef]
Ouladmansour, A.; Ameur-Zaimeche, O.; Kechiched, R.; Heddam, S.; Wood, D.A. Integrating drilling parameters and machine learning tools to improve real-time porosity prediction of multi-zone reservoirs. Case study: Rhourd Chegga oilfield, Algeria. Geoenergy Sci. Eng. 2023, 223, 211511. [Google Scholar] [CrossRef]
Wang, Z.; Xie, K.; Wen, C.; Sheng, G.; He, J.; Tian, H. Multi-scale spatiotemporal feature lithology identification method based on split-frequency weighted reconstruction. Geoenergy Sci. Eng. 2023, 226, 211794. [Google Scholar] [CrossRef]
Ao, Y.; Zhu, L.; Guo, S.; Yang, Z. Probabilistic logging lithology characterization with random forest probability estimation. Comput. Geosci. 2020, 144, 104556. [Google Scholar] [CrossRef]
Dong, S.; Zeng, L.; Du, X.; He, J.; Sun, F. Lithofacies identification in carbonate reservoirs by multiple kernel Fisher discriminant analysis using conventional well logs: A case study in a oilfield, Zagros Basin, Iraq. J. Pet. Sci. Eng. 2022, 210, 110081. [Google Scholar] [CrossRef]
Ren, Q.; Zhang, H.; Zhang, D.; Zhao, X. Lithology identification using principal component analysis and particle swarm optimization fuzzy decision tree. J. Pet. Sci. Eng. 2023, 220, 111233. [Google Scholar] [CrossRef]
Al Hasan, R.; Saberi, M.H.; Riahi, M.A.; Manshad, A.K. Electro-facies classification based on core and well-log data. J. Pet. Explor. Prod. Technol. 2023, 13, 2197–2215. [Google Scholar] [CrossRef]
Mishra, A.; Sharma, A.; Patidar, A.K. Evaluation and Development of a Predictive Model for Geophysical Well Log Data Analysis and Reservoir Characterization: Machine Learning Applications to Lithology Prediction. Nat. Resour. Res. 2022, 31, 3195–3222. [Google Scholar] [CrossRef]
Zheng, D.; Liu, S.; Chen, Y.; Gu, B. A Lithology Recognition Network Based on Attention and Feature Brownian Distance Covariance. Appl. Sci. 2024, 14, 1501. [Google Scholar] [CrossRef]
Luo, K.; Wang, G. Research on imbalanced data classification based on L-SMOTE and SVM. Comput. Eng. Appl. 2019, 55, 55–62. [Google Scholar] [CrossRef]
Li, G.; Liu, S.; Zhang, Y.; Zheng, Y.; Hong, Y.; Zhou, X. Synthetic Method of Label—Balancing Samples for Classifier Learning. Comput. Appl. Softw. 2022, 39, 230–237. [Google Scholar]
He, Y.; Chen, J.; Xu, H.; Huang, Z.; Yin, J. Data Generation Model-based Synthetic Sample Imputation Method. J. Syst. Simul. 2023, 35, 1948–1964. [Google Scholar] [CrossRef]
Yang, J.; Wang, M.; Li, M.; Yan, Y.; Wang, X.; Shao, H.; Yu, C.; Wu, Y.; Xiao, D. Shale lithology identification using stacking model combined with SMOTE from well logs. Unconv. Resour. 2022, 2, 108–115. [Google Scholar] [CrossRef]
Deng, C.; Pan, H.; Fang, S.; Konaté, A.A.; Qin, R. Support vector machine as an alternative method for lithology classification of crystalline rocks. J. Geophys. Eng. 2017, 14, 341–349. [Google Scholar] [CrossRef]
Merembayev, T.; Kurmangaliyev, D.; Bekbauov, B.; Amanbek, Y. A Comparison of Machine Learning Algorithms in Predicting Lithofacies: Case Studies from Norway and Kazakhstan. Energies 2021, 14, 1896. [Google Scholar] [CrossRef]
Ramos, M.M.; Bijani, R.; Santos, F.V.; Lupinacci, W.M.; Freire, A.F.M. Analysis of alternative strategies applied to Naive-Bayes classifier into the recognition of electrofacies: Application in well-log data at Reconcavo Basin, North-East Brazil. Geoenergy Sci. Eng. 2023, 227, 211889. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
He, H.B.; Bai, Y.; Garcia, E.A.; Li, S.T. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Wilson, D.L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, Smc2, 408–421. [Google Scholar] [CrossRef]
Yan, T.; Xu, R.; Sun, S.-H.; Hou, Z.-K.; Feng, J.-Y. A real-time intelligent lithology identification method based on a dynamic felling strategy weighted random forest algorithm. Pet. Sci. 2024, 21, 1135–1148. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
SY/T 5434-2018; Clastic Rock Particle Size Analysis Method. National Energy Administration: Beijing, China, 2018.

Figure 1. Schematic diagram of logging lithology discrimination with enhanced sampling methods.

Figure 2. The work flow of SMOTE algorithm.

Figure 3. The work flow of ADASYN algorithm.

Figure 4. The work flow of ENN algorithm.

Figure 5. Schematic diagram of candidate classifiers.

Figure 6. The number of samples before and after data balancing processing.

Figure 7. Principal component frequency densities.

Figure 8. The results of parameter tuning for SVM, RF, and GBDT models.

Figure 9. Model testing scores.

Figure 10. Box plots of the evaluation metrics for models on datasets processed with different data-balancing methods.

Figure 11. Visualization of lithology discrimination results of RF.

Table 1. Distribution of samples for each lithology label in the dataset.

Lithology	Label	Count	Proportion (%)
Mudstone	M	13,307	46.26
Fine Sandstone	FS	6830	23.75
Siltstone	S	4716	16.39
Pebbled Sandstone	PS	2855	9.93
Coarse Sandstone	CS	1055	3.67

Table 2. Description of standardized well logging curves.

	GR	CAL	RD	RS	AC	DEN	CNL
count	28,763	28,763	28,763	28,763	28,763	28,763	28,763
mean	79.00	10.23	4.00	3.49	82.62	2.36	23.32
std	15.36	1.27	2.58	1.81	10.26	0.12	6.71
min	28.13	8.61	0.20	0.83	51.80	1.83	4.51
25%	67.90	9.28	2.62	2.36	76.66	2.29	18.77
50%	82.64	9.88	3.54	3.16	80.12	2.36	21.76
75%	90.87	10.84	4.78	4.22	85.22	2.44	25.96
max	125.47	18.23	46.70	43.66	144.00	2.65	57.76

Table 3. The tuned parameters for SVM, RF, and GBDT models trained on different datasets.

Model	Parameter	Searching Range	No Balancing	SMOTE	ADASYN	ENN	ADASYN+ENN	SMOTE+ENN
SVM	C	1–50	5	5	5	10	50	50
SVM	γ	0.01–1.0	0.02	0.05	0.5	0.02	0.02	0.02
RF	min_samples_leaf	2–10	2	3	3	3	2	3
RF	max_features	1–6	3	3	3	3	3	3
GBDT	min_samples_leaf	2–10	3	2	3	2	2	3
GBDT	learning_rate	0.02–0.5	0.2	0.2	0.5	0.25	0.2	0.2

Table 4. Comparison of lithology discrimination scores for random forest model: original training set vs. ADASYN+ENN balanced dataset.

Score	Dataset	M	S	FS	CS	PS
Jaccard Index	Raw	0.910	0.802	0.841	0.820	0.906
Jaccard Index	ADASYN+ENN	0.910	0.847	0.884	0.861	0.913
F1 Score	Raw	0.955	0.785	0.893	0.823	0.942
F1 Score	ADASYN+ENN	0.955	0.901	0.942	0.939	0.949

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Tian, F.; Zhao, A.; Zheng, W.; Cao, W. Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions. Appl. Sci. 2024, 14, 6534. https://doi.org/10.3390/app14156534

AMA Style

Liu J, Tian F, Zhao A, Zheng W, Cao W. Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions. Applied Sciences. 2024; 14(15):6534. https://doi.org/10.3390/app14156534

Chicago/Turabian Style

Liu, Jingyue, Fei Tian, Aosai Zhao, Wenhao Zheng, and Wenjing Cao. 2024. "Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions" Applied Sciences 14, no. 15: 6534. https://doi.org/10.3390/app14156534

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Logging Lithology Discrimination with Enhanced Sampling Methods for Imbalance Sample Conditions

Abstract

1. Introduction

2. Methodology

2.1. Data Preprocessing

2.2. Enhanced Sampling Algorithms

2.2.1. Over-Sampling Algorithms

2.2.2. Under-Sampling Algorithm

2.3. Machine Learning Models

2.3.1. Support Vector Machine (SVM)

2.3.2. Random Forest (RF)

2.3.3. GBDT

2.3.4. Model Evaluation Framework

3. Results

3.1. Dataset Description

3.2. Data Preprocessing

3.3. Data Balancing: Application of Enhanced Sampling Methods

3.3.1. Over-Sampling Results

3.3.2. Under-Sampling Results

3.3.3. Integrated Balancing Processing Results

3.4. Application of Intelligent Discrimination Models

3.4.1. Model Training

3.4.2. Model Testing

3.5. Optimal Combination of Algorithms and Models

4. Discussion

4.1. The Effectiveness of Enhanced Sampling Algorithms

4.2. Improvement of Lithology Discrimination Results

4.3. Analysis of the Lithology Discrimination Effectiveness of the Optimal Method Combination

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI