Classification of Logging Data Using Machine Learning Algorithms

Mukhamediev, Ravil; Kuchin, Yan; Yunicheva, Nadiya; Kalpeyeva, Zhuldyz; Muhamedijeva, Elena; Gopejenko, Viktors; Rystygulov, Panabek

doi:10.3390/app14177779

Open AccessArticle

Classification of Logging Data Using Machine Learning Algorithms

by

Ravil Mukhamediev

^1,2

,

Yan Kuchin

^1,2,*

,

Nadiya Yunicheva

^2,3

,

Zhuldyz Kalpeyeva

¹

,

Elena Muhamedijeva

²,

Viktors Gopejenko

^4,5 and

Panabek Rystygulov

^1,*

¹

Institute of Automation and Information Technologies, Satbayev University (KazNRTU), 22 Satbayev Street, Almaty 050013, Kazakhstan

²

Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050013, Kazakhstan

³

Institute of Automation and Information Technologies, Almaty University of Energy and Communications, Baitursynov Str, 126/1, Almaty 050013, Kazakhstan

⁴

International Radio Astronomy Centre, Ventspils University of Applied Sciences, LV-3601 Ventspils, Latvia

⁵

Department of Natural Science and Computer Technologies, ISMA University of Applied Sciences, LV-1019 Riga, Latvia

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7779; https://doi.org/10.3390/app14177779

Submission received: 4 August 2024 / Revised: 29 August 2024 / Accepted: 30 August 2024 / Published: 3 September 2024

(This article belongs to the Special Issue Application of Artificial Intelligence in the Mining Industry)

Download

Browse Figures

Versions Notes

Abstract

:

A log data analysis plays an important role in the uranium mining process. Automating this analysis using machine learning methods improves the results and reduces the influence of the human factor. In particular, the identification of reservoir oxidation zones (ROZs) using machine learning allows a more accurate determination of ore reserves, and correct lithological classification allows the optimization of the mining process. However, training and tuning machine learning models requires labeled datasets, which are hardly available for uranium deposits. In addition, in problems of interpreting logging data using machine learning, data preprocessing is of great importance, in other words, a transformation of the original dataset that allows improving the classification or prediction result. This paper describes a uranium well log (UWL) dataset generated with the employment of floating data windows and designed to solve the problems of identifying ROZ and lithological classification (LC) on sandstone-type uranium deposits. Comparative results of the ways of solving these problems using classical machine learning methods and ensembles of machine learning algorithms are presented. It has been shown that an increase in the size of the floating data window can improve the quality of ROZ classification by 7–9% and LC by 6–12%. As a result, the best-quality indicators for solving these problems were obtained, f1_score_macro = 0.744 (ROZ) and accuracy = 0.694 (LC), using the light gradient boosting machine and extreme gradient boosting, respectively.

Keywords:

machine learning; open data; uranium mining; reservoir oxidation zone; open-source software

1. Introduction

The host rocks of uranium deposits in Kazakhstan have a complex lithological structure that needs to be accurately determined for the application of the in situ leaching (ISL) method, which is an environmentally friendly way of uranium mining [1]. ISL accounts for about 48.3% of the global uranium production and almost all of the production is in Kazakhstan [2]; therefore, the task of identifying the rock characteristics is very important. Machine learning techniques can help with this task by automating some aspects of it and reducing human errors. The rock characteristics are obtained from geophysical research of boreholes (GRB), which differs for exploration and production wells. The standard GRB methods in Kazakhstan uranium fields include apparent resistance logging (AR) and spontaneous polarization (SP) potential for the lithological classification and assessment of filtration properties of host rocks, as well as gamma ray logging (GR) for estimating uranium content based on the gamma radiation of radium and its decay products with a conversion factor using the radioactive equilibrium coefficient. The log data consist of physical parameters measured inside the borehole at 10 cm intervals in depth, and it is displayed as graphs (curves) for expert evaluation. However, the current automatic interpretation of electric logs (AR and SP) uses only AR data without considering other data sources, such as other logs, information from nearby boreholes, etc. This results in significant manual corrections produced by the interpreting engineer and also prevents correct lithologic interpretation when the AR curve is distorted.

Application machine learning is one of the ways to improve the quality of borehole log interpretation from the 1990s [3,4,5,6] to the present day [7]. Most studies on using ML for log data interpretation focus on the lithology classification of oil and gas fields [8,9,10,11] and permeability prediction [12,13,14,15]. ML methods are employed in uranium deposits for such tasks as lithologic classification [16], stratigraphy [17], estimation of filtration properties of host rocks [1], and evaluation of the impact of expert labeling of logging data [18]. Oil well logging and uranium well logging have many differences. Oil well logging needs to find layers that are at least one-meter-thick in deep wells. Uranium well logging needs to find layers that are as thin as 20 cm in shallow wells. Uranium wells also have smaller diameters, which limit the types of downhole tools and logging methods that can be used. Electrical logging methods are common, but they are not very accurate. Core sampling is also difficult because the rocks are loose and can be destroyed by in situ leaching.

One of the most important tasks in uranium mining by the in situ leaching method (ISL) is determining the uranium concentration, which can be carried out either directly by neutron logging methods or indirectly by the gamma radiation of uranium decay products. The first approach is more reliable, while the second one is cheaper and faster, and it is used in uranium mining in Kazakhstan. However, in the reservoir oxidation zones (ROZs), this dependence does not work; in other words, uranium is absent in the presence of the gamma radiation of uranium decay products.

Errors in ROZ identification or ignoring them leads to large financial losses, while their correct allocation is a real challenge. There is no formalized solution algorithm for this problem. The task of ROZ identification based on the data of a standard set of geophysical studies used in uranium mining in Kazakhstan is particularly relevant and complex. The peculiarity of the task is that none of the registered parameters are directly related to ROZ; all of them are only indirect signs and make sense only when analyzing their interrelation and spatial distribution. The use of machine learning is promising for such weakly formalized tasks.

The problem of identifying the reservoir oxidation zones (zones with disturbed radioactive equilibrium) using machine learning methods based on logging data is addressed in this paper. This problem was first considered in [19].

The second task that is solved when interpreting logging data is the classification of lithological types of host rocks, which directly affects the technological processes of production and, ultimately, economic indicators.

A formalized algorithm for lithological interpretation based on the AR curve shape was developed over 50 years ago and described in [20]. There is software implementing this algorithm. When compared with the core, the program shows an accuracy of about 0.4. When performing manual interpretation, experts rely not only on the AR curve shape, but also on data from other logging, neighboring wells, and accumulated experience at the field. Due to this, the average accuracy of experts when comparing with the core is about 0.6. The use of machine learning makes it possible to obtain an accuracy of automatic interpretation comparable to the accuracy of experts’ assessments.

Despite the importance of these tasks, there is a lack of research on using ML for identifying ROZs and lithological classification in uranium deposits. One of the limits is the deficiency of labeled well logs for this type of research. Here, we consider the application of classical machine learning methods to solve these problems and describe the dataset that allows for training and evaluating ML models. We also show how a relatively simple feature engineering method called a floating window can significantly improve the classification quality when using a range of classical machine learning models.

The main contributions of this paper are as follows:

The main results of well log data interpretation using machine learning for different types of deposits are presented;
A uranium well log (UWL) dataset is presented and described, allowing us to set up machine learning methods for ROZ detection and lithological classification;
This paper presents the state-of-the-art result in solving ROZ detection and lithological classification tasks obtained using the UWL dataset;
The influence of floating window size on the quality of classification is investigated.

This paper consists of the following sections:

Section 2 discusses the main results of machine learning application in processing the well logging data for various types of minerals and details the features of the problem for uranium deposits.

Section 3 describes the method of forming the UWL dataset and its analysis using machine learning methods to solve the problems of the lithological classification and identification of reservoir oxidation zones.

Section 4 presents the classification results.

Section 5 briefly discusses the obtained results.

The conclusion summarizes the results of the analysis and its limitations and formulates the tasks of further research.

2. Related Works

The interpretation of well logging data is one of the important stages in assessing the lithological structure of ore-bearing horizons, predicting the productivity of a deposit, and selecting a drilling location for mining. One of the frequently studied issues in such interpretation is the problem of choosing the most effective machine learning methods. Both classical and modern machine learning methods are widely used [21], including deep neural networks (see Figure 1).

At the same time, the supervised learning (SL) methods are favored under appropriate conditions, which shows better results compared to unsupervised learning (UL) [22].

According to the information available to the authors, the main models used for interpreting well logging data are classical models and ensembles of machine learning algorithms based on decision trees: Linear Regression (LR), Logistic Regression (LogR), k-nearest neighbors (kNNs), decision tree (DT), artificial neural network or multilayer perceptron (ANN or MLP), Naive Bayes Classifier (NB), support vector machines (SVMs), random forest (RF), extreme gradient boosting (XGB), light gradient boosting machine (LGBM) (Table 1). Deep learning models are also used: Deep Feed Forward Neural Network (DFFNN), long short-term memory (LSTM), Convolution Neural Network (CNN), etc.

To assess the quality of the models, the following standard metrics are widely used: accuracy (Acc), harmonic measure—f1 score (f1), determination coefficient (R2), linear correlation coefficient (Rp). In some cases, the area under the curve (Receiver Operating Characteristic (ROC)) or Area Under the Curve (AUC) is used—ROC AUC. The main metrics of regression and classification models are presented in Appendix A.

Table 2 shows the main results obtained in the course of solving problems of processing well logging data with employment of machine learning methods.

Types of solved problems in the Table 2:

Lithological classification.
The identification of reservoirs.
Stratigraphic classification.
The estimation of rock permeability.
The identification of reservoir oxidation.

The analysis shows that the majority of works on the topic of logging data interpretation are devoted to reservoir identification and lithological classification at oil and gas fields. The characteristic features of this task are the large thickness of the identified layers (tens of meters) and their stability within the horizon. Sometimes they are even assigned alphanumeric designations, for example, in [41]. Moreover, at least 6–7 geophysical parameters recorded during logging are usually analyzed, including electrical, neutron, and gamma methods. All these factors allow for achieving the classification accuracy at oil and coal deposits of more than 80–90%. The task of interpreting the logging data at sandstone-type uranium deposits has a number of significant differences:

The technology of extraction dictates the necessity of identifying impermeable layers with a thickness of 20 cm (this is a requirement of the regulatory documentation) within the ore-bearing horizon with a thickness of 60–80 m. In some cases, one identified layer in petroleum geophysics corresponds to the entire interpreted ore-bearing horizon in uranium geophysics [1].
The set of recorded logging data is much smaller compared to oil and gas fields. In fact, only fairly simple variations in electrical logging (AR, SP, IL) are available. Gamma logging cannot be used for lithological classification because the contribution to the recorded gamma radiation from radium and its decay products is two orders of magnitude greater than that from lithology. Of the neutron methods, only fission neutron logging is used, aimed at the direct determination of uranium [1].
Difficulties with extracting and tying the core due to the characteristics of the section (sand and clay).
Use of experts’ assessments, which contain a significant degree of subjectivity [16].
The regulatory framework, interpretation methods, and standard set of logging methods were inherited from the USSR and underwent only minor changes in Kazakhstan.
There are no publicly available datasets that allow for a comparative analysis of classification and forecasting methods based on well logging data from uranium deposits.

The above features result in the fact that the problem of classification of logging data of uranium deposits is solved with significantly lower accuracy. For example, the accuracy of determination of rock permeability for oil deposits is more than 0.9, while for uranium deposits, it does not exceed 0.72; the accuracy of lithological classification is about 0.9 and 0.6, respectively. There are also problems that are specific to uranium deposits, and one of them is the problem of identification of reservoir oxidation zones.

In this regard, this paper firstly describes and proposes using a set of well log data from a uranium deposit that can be used to solve at least two problems—the identification of reservoir oxidation zones and lithological classification. Secondly, methods for its interpretation using classical algorithms and ensembles of machine learning models in solving the problem of the lithological classification and identification of reservoir oxidation zones are discussed.

3. Method

This study includes 2 stages (see Figure 2).

First, the log data are preprocessed and data dumps are generated (stages 1 and 2 in Figure 2). In this case, at first, the data of those wells that contain the required logging information are manually selected (1). Each well is represented by a table file. The resulting set of table data is processed so that input data and target columns are formed from the entire set of columns. Then, the data of each well are converted into a floating data window format and saved as dataframes. The interpretation of logging data by an expert is carried out not by the absolute values of the logging curve, but by its shape. Of particular importance are the minima, maxima, and inflection points.

The expert classifies the rock interlayers with a minimum thickness of 0.1 m. At the same time, the distance between the recording electrodes of the logging tool is 1 m; therefore, it is advisable to use sections of the curve with a depth of at least one meter (10 points) for classification. In practice, the expert can consider the shape of a curve of a larger size. Therefore, the features of the logging tool and the data acquisition process suggest the possibility of using floating data windows.

In other words, floating data windows allow the shape of the well log to be considered, thereby simulating the interpretation process performed by an expert.

The test and training datasets are formed from different dataframes. To ensure cross-validation, the dataframes are split into test and training ones many times. The resulting test and training sets form dumps written to a disk. A set of such dumps with different floating window sizes and different splits into training and test sets form a UWL dataset.

The UWL dataset is available at the link in the Data Availability Statement section.

Secondly, the dumps are loaded and the classifiers are trained and evaluated to solve the problems of identifying reservoir oxidation zones (ROZs) and lithological classification (LC) (3 and 4 in Figure 2).

Cross-validation can be performed, or the data dumps can be used separately. The Data Availability Statement contains a link to the data dumps and an example program for processing them.

Further, the above-described processing stages are considered in detail.

3.1. Data Preprocessing

Using one of Kazakhstan’s uranium ore field’s 1000 exploration wells with core sampling, we performed log data processing tasks. We manually created a special set of wells for ROZ studies, selected input features, and generated floating data windows for each well. We split the set of wells into training, test, and validation sets and used them in a loop of machine learning model selection and evaluation. We applied the best model to interpret the validation dataset and to display the results.

We made a special dataset for ROZ and lithology classification from one of Kazakhstan’s uranium ore deposit’s wells, with three classes of wells: 42 wells without ROZ (LOW_ROZ), 84 wells with ROZ of 5–50% of the ore-bearing horizon (MEDIUM ROZ), and 42 wells with ROZ of more than 50% of the ore-bearing horizon (HI_ROZ).

We used AR (Ohm*m); SP (mV); GR (µR*h) log data for each well, recorded in 10 cm steps; lithologic intervals (upper boundary, rock code, permeability code, filtration coefficient, lower boundary); and wellhead coordinates (X, Y, Z) for the analysis. Since the rocks’ physical properties (especially the AR recording level) change significantly in different stratigraphic horizons, we used only logging data within one horizon (ore-bearing horizon number 2). We identified the ROZ zones in exploratory wells through lab studies and marked them with a geochemical code of 8. We used these indicators from several tables of well logging data for further processing.

Therefore, the dataset utilizes 24 input variables and two types of target variables, applied separately depending on the problem to be solved.

Input values [19]:

RP—ore intersection (value 2 is used for ore-bearing horizon);

Sn—well number (used only at the stage of data input and search for the nearest wells);

Depth—depth of 10 cm measurement zone (m);

GR—gamma ray log (µR/h);

AR—apparent resistance (Ohm*m);

SP—spontaneous polarization potential (mV);

X, Y, Z—well coordinates.

Target values:

lit—rock type by permeability (target value 1) (1—permeable rocks, 2—impermeable rocks, 8—oxidized rocks (ROZ));

LIT1—lithologic code (target value 2);

LIT2—geochemical type of rock.

The target variable “lit” is used to solve the problem of reservoir oxidation zone identification. The variable LIT1 can be used to solve the problem of lithological classification. Lithological type codes (class numbers) are given in Table 3.

The input data are reformatted as floating data windows [16,18] (Figure 3).

Using floating data windows allows for increasing the size of the input data vector by a multiple of the window size. For example, if the size of the input vector is n = 11, then using a window of size h = 3 gives the model input vector n’ = h*n (n’ = 33). Training and test datasets were generated from the input data. At the same time, in order to avoid data leakage and, as a consequence, obtaining an overestimate of the classification quality, the training and test sets used the data from different wells.

The dataset was divided into test and train sets in such a way that of the N wells available in a particular experiment, 0.1*N was the test set and 0.9*N was the train set. The resulting datasets are recorded in dumps for storage on a disk. The data dump is written as a file with the extension txt, the name of which defines the parameters of the dataset; for example, 96_wells_up5_dn150_t5_n1 means that

‘up5’:up_w = 5 (size of the top of the data window).
‘dn150’:dn_w = 150 (size of the bottom of the data window).
‘t5’:test_part = 5 (which part of the dataset will be the test part. In this case, it is the 5th of 10 possible test parts).
‘n1’:norm = 1 (whether the input parameters were normalized; 1—normalized, 0—not normalized).

Figure 4 shows an example of a dataset framed as a pandas dataframe with floating data window sizes up_w = 5 and dn_w = 5.

In this case, a total of 11 lines of initial logging data values numbered from 0 to 10 are used. Each line of logging data is represented by the depth of the layer depth, which corresponds to RP and logging data: GR, AR, SP.

In addition, the wellhead coordinates X, Y, and Z in conventional units, the depth of the formation for which the classification is performed—Depth_lit, the depth difference from the classified formation to the upper boundary of the window in meters—Diff_depth, and the direction of data window formation—Wtype (0 from top to bottom, 1 from bottom to top), are recorded. Note that within the entire generated dataset, Wtype = 0.

For each ratio of up_w and dn_w, 10 datasets are generated for 10 combinations of test and training wells. In addition, each of the ten datasets are prepared both with the normalization of the original values (n1) and without normalization (n0). For example, the data dump 96_wells_up5_dn25_t3_n0.txt contains partially non-normalized values, while the 96_wells_up5_dn25_t3_n0.txt contains partially non-normalized values, while the 96_wells_up5_dn25_t3_n1.txt contains fully normalized values.

The data dump named 96_wells_up0_dn0_…txt contains a minimum number of input parameters without using a floating data window, so that each log dataset corresponds to a rock permeability value (lit) or lithologic code (LIT1). Although these datasets give, as shown below, a relatively low classification result, we have included them for those researchers who wish to convert these datasets into any other form.

3.2. Training and Evaluating Machine Learning Models

The data dump program reads the specified folder containing log data files in xls format. The program generates dataframes of training and test sets of wells from all files. This division into training and test sets is carried out proportionally to the value of the proportion variable. For example, if proportion = 0.1, then it is possible to create a set of test wells that will be represented by different parts of the whole set of wells, each time in the amount of 10% of the whole set. The training set is formed from the remaining 90% of the wells.

Therefore, it is possible to implement k-fold cross-validation so that the data of one well are either only in the training set or only in the test set and there is no data leakage. The generated dataframes as data dumps are written to \ROZ_dumps_2024. The data dump name describes the number of wells used for dump generation (96), the floating window size, the number of the test part from the set S, and the normalization flag. For data classification, we can use a wide range of classifiers available as part of sklearn frameworks [53,54] and PyCaret [55,56] or installed separately.

Preliminary experiments with SVM showed that SVM works dozens of times slower than the slowest of the algorithms listed in Table 4. The physical nature of SVM must take into account all training data using the kernel (exp(−g||x − x’||2), where g is specified by parameter greater than 0), the complexity of the calculation of which depends on the number of input parameters (

n_{f e a t u r e s})

and training examples (

n_{s a m p l e s}^{}

). In other words, the time complexity of the O(n) algorithm is between the quadratic

O (n_{f e a t u r e s} \times n_{s a m p l e s}^{2})

and cubic

O (n_{f e a t u r e s} \times n_{s a m p l e s}^{3}

) programming problem. As a result, the use of SVM with a window depth greater than 100 is extremely difficult due to a sharp slowdown in operation. The libSVM library allows us to increase the speed of SVM operation with a nonlinear kernel by increasing the cache size. But even in this case, the speed of SVM operation is hundreds of times lower than LightGBT.

The application of deep learning in many problems gives good results, but preliminary experiments on well logging data from uranium deposits [16] did not reveal any advantages of deep learning over decision tree ensembles (XGBoost, LightGBT). In our opinion, the reason for this is that the interpretation is performed on inaccurate data based either on experts’ estimates of different classes or on core data, which are also not 100% reliable in sandy soils. In addition, the application of deep learning usually requires the additional transformation of the input data. Therefore, the application of deep learning, primarily convolutional networks, is considered as a task for future research with the appropriate transformation of the input dataset and conducting large-scale computational experiments.

For these reasons, the following classifiers were used in the process of experiments, taking into account the preliminary experience [19]: LGBM, RF, XGB, kNN, DT, MLP, NB, support vector machines with linear kernel (linear SVMs).

The UWL dataset was used to solve the lithology classification (LC) and ROZ identification (ROZ) problems. In selecting the main metric for evaluating the classification quality, we considered the following factors:

The dataset is not balanced. It means that the number of objects of different classes in the dataset is different—class 2: 6876; class 1: 35,812;class 8: 28,073 (ROZ).
In the ROZ classification task, objects of all three classes are equally important for the researcher.

For these reasons, the f1_score macro-average (f1_score_macro) was chosen as the main quality assessment metric in the ROZ classification problem (the classification quality assessment metrics are presented in Appendix A). The macro-average means that f1_score is calculated for each class separately and then averaged.

3.: In the lithological classification (LC) problem, the correct classification of low-permeability rocks (clay—7) and the overall accuracy of the classifier are important.

The computational experiments were carried out on a computer equipped with an Intel(R) Core(TM) i7-10TH processor with 64 GB of RAM and a discrete Nvidia Quatro T2000 video card.

4. Results

4.1. ROZ Identification

To evaluate the quality of ROZ classification, preliminary experiments were conducted with the mentioned range of machine learning models on the basis of a single dataset (Table 4). The division into training and test parts was identical in all computational experiments.

SVM classifiers took tens or even hundreds of times longer to train than others (Appendix B). The best value of f1_score_macro was demonstrated by LightGBM, and it was 30 times faster than the second-best classifier (XGB). Therefore, the LightGBM classifier was further used to evaluate the effect of normalization and floating window size.

The results of computational experiments with datasets in which the upper part of the data window is fixed (up_w = 5) and the lower part (dw_n) varies between 5 and 200 are shown in Table 5 and Table 6. Table 5 shows the results for partially non-normalized values. Table 6 shows the results for normalized log data.

These results show that using large-size floating data windows improves the classification results by about 7–9%. Using non-normalized values gives slightly better results, especially when the floating data window size is small.

4.2. Lithological Classification

In the process of evaluating the quality of lithology classification, the experiments show that the best value of f1_score_macro was demonstrated by XGB. Table 7 shows the results of solving the lithological classification problem when changing the floating data window.

The results of applying other machine learning models to solve the lithological classification problem are given in Appendix C.

5. Discussion

The correct identification of reservoir oxidation zones is a very important task in interpreting uranium deposit logging data, which helps to avoid economic losses during the production process. The use of a floating data window allowed us to improve the quality of ROZ classification measured using f1_score_macro by 9% (from 0.65 to 0.74). In solving the ROZ identification problem, the best result (f1_score_macro = 0.744) is demonstrated by the LGBM model, which is also much faster than its closest competitors. In general, the result is 2% better than the one obtained at the previous stage of the research [19]. In the problem of lithological classification, which is part of the uranium mining technology, the best result was achieved using XGBoost (Acc = 0.694, f1_score for class 7 is 0.705). The best result was obtained with the maximum floating window size (200). In general, the use of floating data windows allowed us to improve accuracy by more than 12%. The obtained result is also the best to date compared to the literature data. It exceeds the result mentioned in [49] by 4%.

It can be assumed that the influence of the floating window in both tasks is due to the fact that classifiers better “take into account” the patterns of changes in well logs with increasing window size. This allows us to count on the prospects of using deep learning models. First of all, we are talking about convolutional network models, since this allows us to bring the automatic classification technology closer to experts’ assessment, which considers the visual characteristics of logging curves (minimums and maximums, inflection points). Despite the obtained results, which are the best to date, this study has some limitations.

First, as noted above, we did not use deep learning models.

Second, we did not solve the problem of finding the optimal set of classifier hyperparameters, which can improve their accuracy indicators.

Third, we did not evaluate the influence of input parameters on the obtained result, which in principle can allow us to optimize the list of input variables of the model.

Fourth, although the dataset is quite large, it is significantly smaller than that which can be obtained by an additional selection of logging data.

6. Conclusions

The task of analyzing well logging data is one of the most important in the process of exploration of mineral deposits. First of all, this concerns oil, gas, coal, and uranium deposits.

A well logging data analysis allows for optimizing the production process. The correct interpretation of data allows for avoiding unproductive costs and increasing the economic indicators of the production process. In the process of interpretation, machine learning models are used to solve the most common problems. The selection and configuration of such models is one of the important scientific tasks. Most of this kind of work is devoted to the applications of machine learning to the analysis of oil and gas field data. Until now, when analyzing well logging data of uranium deposits, a significant limitation of research was the lack of a publicly available dataset. This study covers this gap.

This paper presents a labeled well logging dataset. The set partially compensates for the deficit of such data for sandstone-type uranium deposits. In this research, this dataset is used to train and evaluate the machine learning models when solving two problems. Task 1 is the ROZ identification problem. This problem is based on electrical and gamma logging data and it was first posed in work [19]. It is of exceptional importance in uranium mining by the ISR method (more than 40% of the world’s uranium production). The use of machine learning allows us to “significantly (by 22%–70%) increase the accuracy of the filtration coefficient determination and, accordingly, improve the accuracy of recoverable reserves calculation and economic indicators of mining processes” [1].

In this case, the target variable is the lit variable.

However, if we use the variables y_LIT1_train and y_LIT1_test instead of y_train and y_test as the target variables, we can solve the problem of lithologic classification (Task 2). Correct lithologic classification allows for optimizing the mining process. In both cases, feature engineering in the form of increasing the size of the floating data window helped to significantly improve (by 9–12%) the classification result.

In the future, it is possible to conduct computational experiments to select the optimal set of input parameters, for example, using the Mlextend library [57,58] (Task 3), and to search for the best combination of hyperparameters of machine learning models using GridSearchCV, RandomizedSearchCV [54], BayesSearchCV [59], and other methods [60] (Task 4). This will also help to overcome the limitations of this study listed in Section 5. The dataset can also be used for a comparative analysis of log data from different fields or for transfer learning experiments (Task 5).

One important task for future research is the application of deep learning models that can improve the classification quality. Deep learning models, in particular convolutional networks, can take into account the shape of the well log and, therefore, better imitate the classification processes performed by experts.

It should be noted that this study is based on a dataset limited to 96 wells from one field. It is planned to increase the dataset, use deep learning models along with algorithm ensembles, and solve above-mentioned tasks, Tasks 3–5, in future studies.

Author Contributions

Conceptualization, R.M. and Y.K.; methodology, R.M., E.M. and Z.K.; validation, Y.K., P.R. and E.M.; formal analysis, Y.K., N.Y. and Z.K.; investigation, R.M., Y.K., P.R. and V.G.; data curation, Y.K. and N.Y.; visualization, R.M. and E.M.; writing—original draft, R.M. and Y.K.; writing—review and editing, Y.K., E.M., P.R. and V.G.; project administration, N.Y. and Y.K.; funding acquisition, N.Y., V.G., Z.K. and R.M. All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan under Grants AP23488745 “Rapid assessment of soil salinity using low-altitude unmanned aerial platforms (RASS)”, AP14869110 “Improving the accuracy of solving problems of interpretation of geophysical well research data on uranium deposits using machine learning methods”, BR21881908 “Complex of urban ecological support”, BR24992908 “Support system for agricultural crop production optimization via remote monitoring and artificial intelligence methods (Agroscope)”, and BR24993051 “Development of an intelligent city system based on IoT and data analysis”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the data and the software are available at https://www.dropbox.com/scl/fo/hj0vuqebsb0irz00guks2/h?rlkey=fn10pja37yqfh0lfaue7xscbf&dl=0 (accessed on 29 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Machine learning model quality metrics.

	Metrics	Formula	Explanation
Regression models	Mean absolute error—MAE	$M A E = \frac{\sum_{i = 1}^{n} (y^{(i)} - h^{(i)})}{n}$	where n is the sample size; the real value of the target variable for the i-th example; calculated value of the i-th example; $y^{(i)} h^{(i)}$
	Determination coefficient	$R^{2} = 1 - \frac{S S_{r e s}}{S S_{t o t}}$ $S S_{r e s} = \sum_{i = 1}^{n} {(y^{(i)} - h^{(i)})}^{2}$ $S S_{t o t} = \sum_{i = 1}^{n} {(y^{(i)} - \bar{y})}^{2}, \bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y^{(i)}$
	Linear correlation coefficient (or Pearson correlation coefficient)	$R p (y, h) = \frac{\sum_{i = 1}^{n} (h_{i} - \bar{h}) (y_{i} - \bar{y})}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2} \sum_{i = 1}^{n} {(h_{i} - \bar{h})}^{2}}$	where $\bar{h} = \frac{1}{n} \sum_{i = 1}^{n} h_{i}$
Classification models	Accuracy	$A c c = \frac{N_{t}}{N}$	where $N_{t}$ is the number of correct answers and N is the total number of possible answers of the model
	Precision	$P = \frac{T P}{(T P + F P)}$	where true positive (TP) and true negative (TN) are cases of correct operation of the classifier. Accordingly, false negatives (FNs) and false positives (FPs) are cases of misclassification
	Recall	$R = \frac{T P}{(T P + F N)}$
	F1 score	$F 1 s c o r e = \frac{2 * P * R}{(P + R)}$

Appendix B

Typically, SVM works fast enough for small datasets (hundreds or thousands of rows) and a small number of input variables (tens). As the training dataset and number of input variables increase, the algorithm’s running time increases quadratically. The libSVM library allows for some acceleration by increasing the cache size. Computational experiments with a cache size of 10 Gb showed that SVM with default parameters has no advantages over LGBT, but works about 70 times slower (Table A2).

Table A2. Comparative characteristics of LGBM and RBF SVM algorithms.

Dw_N	Classifier	Acc	f1_Score_Class1	f1_Score_Class2	f1_Score_Class8	f1_Score_Macro	f1_Score_Micro	Duration
5	LGBM	0.807	0.805	0.633	0.659	0.699	0.807	22.2
5	RBF SVM	0.792	0.786	0.627	0.632	0.682	0.792	1596
25	LGBM	0.808	0.805	0.661	0.673	0.713	0.808	52
25	RBF SVM	0.794	0.782	0.630	0.650	0.688	0.794	3274
50	LGBM	0.814	0.815	0.668	0.681	0.721	0.814	81.1
50	RBF SVM	0.795	0.780	0.603	0.662	0.682	0.795	5743.5

Appendix C

Table A3. Results of lithological classification.

F1_Score
Dw_N	Model	Acc	1	3	4	5	6	7	9	f1_Macro	f1_Micro	Duration
5	LGBM	0.55	0.695	0.295	0.168	0.29	0.015	0.643	0	0.301	0.55	3.98
	XGB	0.565	0.705	0.292	0.159	0.433	0	0.623	0	0.344	0.565	120.97
	MLP	0.599	0.739	0.266	0.078	0.022	0	0.671	0	0.276	0.599	32.001
10	LGBM	0.565	0.706	0.344	0.209	0.247	0.063	0.629	0	0.314	0.565	4.731
	RFC	0.588	0.729	0.318	0.197	0.16	0.043	0.608	0	0.324	0.588	24.25
	XGB	0.577	0.715	0.348	0.19	0.337	0.089	0.628	0	0.359	0.577	114.24
	MLP	0.62	0.758	0.328	0.11	0.059	0	0.646	0	0.3	0.62	37.518
25	LGBM	0.578	0.719	0.375	0.206	0.185	0.022	0.65	0	0.308	0.578	9.851
	RFC	0.608	0.745	0.337	0.213	0.143	0.027	0.641	0	0.332	0.608	58.774
	XGB	0.609	0.744	0.392	0.21	0.263	0.028	0.671	0	0.364	0.609	260.25
	MLP	0.641	0.774	0.402	0.18	0.142	0	0.668	0	0.341	0.641	50.892
50	LGBM	0.61	0.754	0.4	0.241	0.178	0.012	0.644	0	0.318	0.61	17.161
	RFC	0.639	0.773	0.405	0.224	0.17	0	0.649	0	0.35	0.639	104.44
	XGB	0.636	0.768	0.415	0.243	0.301	0.005	0.683	0	0.381	0.636	435.20
	MLP	0.659	0.792	0.454	0.228	0.198	0	0.674	0	0.37	0.659	81.985
100	LGBM	0.634	0.776	0.434	0.253	0.153	0.011	0.665	0	0.327	0.634	32.906
	RFC	0.656	0.79	0.431	0.245	0.216	0	0.649	0	0.368	0.656	195.50
	XGB	0.659	0.791	0.441	0.247	0.308	0	0.692	0	0.391	0.659	732.61
	MLP	0.656	0.796	0.432	0.253	0.188	0	0.666	0	0.368	0.656	102.92
200	LGBM	0.65	0.794	0.471	0.277	0.091	0.019	0.636	0	0.327	0.65	46.825
	RFC	0.681	0.81	0.489	0.282	0.199	0	0.633	0	0.381	0.681	658.43
	XGB	0.694	0.819	0.492	0.295	0.28	0.008	0.705	0	0.41	0.694	1319.5
	MLP	0.649	0.789	0.455	0.234	0.18	0	0.652	0	0.364	0.649	157.25

References

Mukhamediev, R.I.; Kuchin, Y.; Amirgaliyev, Y.; Yunicheva, N.; Muhamedijeva, E. Estimation of Filtration Properties of Host Rocks in Sandstone-Type Uranium Deposits Using Machine Learning Methods. IEEE Access 2022, 10, 18855–18872. [Google Scholar] [CrossRef]
Amirova, U.; Uruzbaeva, N. Overview of the development of the world market of Uranium. Univers. Econ. Law Electron. Sci. J. 2017, 6, 1–8. [Google Scholar]
Baldwin, J.L.; Bateman, R.M.; Wheatley, C.L. Application of a neural network to the problem of mineral identification from well logs. Log Anal. 1990, 31, SPWLA-1990-v31n5a1. [Google Scholar]
Poulton, M.M. Computational Neural Networks for Geophysical Data Processing; Elsevier: Amsterdam, The Netherlands, 2001. [Google Scholar]
Benaouda, D.; Wadge, G.; Whitmarsh, R.; Rothwell, R.; MacLeod, C. Inferring the lithology of borehole rocks by applying neural network classifiers to downhole logs: An example from the Ocean Drilling Program. Geophys. J. Int. 1999, 136, 477–491. [Google Scholar] [CrossRef]
Saggaf, M.; Nebrija, E.L. Estimation of missing logs by regularized neural networks. AAPG Bull. 2003, 87, 1377–1389. [Google Scholar] [CrossRef]
Kumar, T.; Seelam, N.K.; Rao, G.S. Lithology prediction from well log data using machine learning techniques: A case study from Talcher coalfield, Eastern India. J. Appl. Geophys. 2022, 199, 104605. [Google Scholar] [CrossRef]
Kim, J. Lithofacies classification integrating conventional approaches and machine learning technique. J. Nat. Gas Sci. Eng. 2022, 100, 104500. [Google Scholar] [CrossRef]
Thongsamea, W.; Kanitpanyacharoena, W.; Chuangsuwanich, E. Lithological Classification from Well Logs using Machine Learning Algorithms. Bull. Earth Sci. Thail. 2018, 10, 31–43. [Google Scholar]
Liang, H.; Xiong, J.; Yang, Y.; Zou, J. Research on Intelligent Recognition Technology in Lithology Based on Multi-parameter. Fusion 2023. [Google Scholar] [CrossRef]
Mohamed, I.M.; Mohamed, S.; Mazher, I.; Chester, P. Formation lithology classification: Insights into machine learning methods. In Proceedings of the SPE Annual Technical Conference and Exhibition, Calgary, AB, Canada, 30 September–2 October 2019. [Google Scholar]
Ahmadi, M.-A.; Ahmadi, M.R.; Hosseini, S.M.; Ebadi, M. Connectionist model predicts the porosity and permeability of petroleum reservoirs by means of petro-physical logs: Application of artificial intelligence. J. Pet. Sci. Eng. 2014, 123, 183–200. [Google Scholar] [CrossRef]
Gholami, R.; Moradzadeh, A.; Maleki, S.; Amiri, S.; Hanachi, J. Applications of artificial intelligence methods in prediction of permeability in hydrocarbon reservoirs. J. Pet. Sci. Eng. 2014, 122, 643–656. [Google Scholar] [CrossRef]
Zhong, Z.; Carr, T.R.; Wu, X.; Wang, G. Application of a convolutional neural network in permeability prediction: A case study in the Jacksonburg-Stringtown oil field, West Virginia, USA. Geophysics 2019, 84, B363–B373. [Google Scholar] [CrossRef]
Khan, H.; Srivastav, A.; Kumar Mishra, A.; Anh Tran, T. Machine learning methods for estimating permeability of a reservoir. Int. J. Syst. Assur. Eng. Manag. 2022, 13, 2118–2131. [Google Scholar] [CrossRef]
Kuchin, Y.I.; Mukhamediev, R.I.; Yakunin, K.O. One method of generating synthetic data to assess the upper limit of machine learning algorithms performance. Cogent Eng. 2020, 7, 1718821. [Google Scholar] [CrossRef]
Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for stratigraphy classification on uranium deposits. Procedia Comput. Sci. 2019, 150, 46–52. [Google Scholar] [CrossRef]
Kuchin, Y.I.; Mukhamediev, R.I.; Yakunin, K.O. Quality of data classification under conditions of inconsistency of expert estimations. Cloud Sci. 2019, 6, 109–126. (In Russian) [Google Scholar]
Mukhamediev, R.I.; Kuchin, Y.; Popova, Y.; Yunicheva, N.; Muhamedijeva, E.; Symagulov, A.; Abramov, K.; Gopejenko, V.; Levashenko, V.; Zaitseva, E.; et al. Determination of Reservoir Oxidation Zone Formation in Uranium Wells Using Ensemble Machine Learning Methods. Mathematics 2023, 11, 4687. [Google Scholar] [CrossRef]
Dacknov, V.N. Interpretation of the Results of Geophysical Studies of Well Sections; Nedra: Moskow, Russia, 1982; p. 448. (In Russian) [Google Scholar]
Mukhamediev, R.I.; Popova, Y.; Kuchin, Y.; Zaitseva, E.; Kalimoldayev, A.; Symagulov, A.; Levashenko, V.; Abdoldina, F.; Gopejenko, V.; Yakunin, K.; et al. Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions, Opportunities and Challenges. Mathematics 2022, 10, 2552. [Google Scholar] [CrossRef]
Singh, H.; Seol, Y.; Myshakin, E.M. Automated well-log processing and lithology classification by identifying optimal features through unsu-pervised and supervised machine-learning algorithms. SPE J. 2020, 25, 2778–2800. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Al Daoud, E. Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset. Int. J. Comput. Inf. Eng. 2019, 13, 6–10. [Google Scholar]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Fix, E.; Hodges, J.L. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Int. Stat. Rev. /Rev. Int. De Stat. 1989, 57, 238–247. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
Galushkin, A.I. Neural Networks: Fundamentals of Theory; Telecom: Perm, Russia, 2010; p. 496. (In Russian) [Google Scholar]
Bayes, T. An essay towards solving a problem in the doctrine of chances. Biometrika 1958, 45, 296–315. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time Series. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1995; p. 3361. [Google Scholar]
Xueqing, Z.; Zhansong, Z.; Chaomo, Z. Bi-LSTM deep neural network reservoir classification model based on the innovative input of logging curve response sequences. IEEE Access 2021, 9, 19902–19915. [Google Scholar] [CrossRef]
Patidar, A.K.; Singh, S.; Anand, S. Subsurface Lithology Classification Using Well Log Data, an Application of Supervised Machine Learning. In Workshop on Mining Data for Financial Applications; Springer Nature: Singapore, 2022; pp. 227–240. [Google Scholar]
Zhang, J.; He, Y.; Zhang, Y.; Li, W.; Zhang, J. Well-Logging-Based Lithology Classification Using Machine Learning Methods for High-Quality Reservoir Identification: A Case Study of Baikouquan Formation in Mahu Area of Junggar Basin, NW China. Energies 2022, 15, 3675. [Google Scholar] [CrossRef]
Xing, Y.; Yang, H.; Yu, W. An approach for the classification of rock types using machine learning of core and log data. Sustainability 2023, 15, 8868. [Google Scholar] [CrossRef]
Maxwell, K.; Rajabi, M.; Esterle, J. Automated classification of metamorphosed coal from geophysical log data using supervised machine learning techniques. Int. J. Coal Geol. 2019, 214, 103284. [Google Scholar] [CrossRef]
Al-Mudhafar, W.J. Integrating well log interpretations for lithofacies classification and permeability modeling through advanced machine learning algorithms. J. Pet. Explor. Prod. Technol. 2017, 7, 1023–1033. [Google Scholar] [CrossRef]
Liu, J.J.; Liu, J.C. Integrating deep learning and logging data analytics for lithofacies classification and 3D modeling of tight sandstone reservoirs. Geosci. Front. 2022, 13, 101311. [Google Scholar] [CrossRef]
Rogulina, A.; Zaytsev, A.; Ismailova, L.; Kovalev, D.; Katterbauer, K.; Marsala, A. Similarity learning for well logs prediction using machine learning algorithms. In Proceedings of the International Petroleum Technology Conference, Dhahran, Saudi Arabia, 21–23 February 2022. D032S158R005. [Google Scholar]
Zhong, R.; Johnson, R.L., Jr.; Chen, Z. Using machine learning methods to identify coal pay zones from drilling and logging-while-drilling (LWD) data. SPE J. 2020, 25, 1241–1258. [Google Scholar] [CrossRef]
Hou, M.; Xiao, Y.; Lei, Z.; Yang, Z.; Lou, Y.; Liu, Y. Machine learning algorithms for lithofacies classification of the gulong shale from the Songliao Basin, China. Energies 2023, 16, 2581. [Google Scholar] [CrossRef]
Schnitzler, N.; Ross, P.S.; Gloaguen, E. Using machine learning to estimate a key missing geochemical variable in mining exploration: Applica-tion of the Random Forest algorithm to multi-sensor core logging data. J. Geochem. Explor. 2019, 205, 106344. [Google Scholar] [CrossRef]
Joshi, D.; Patidar, A.K.; Mishra, A.; Mishra, A.; Agarwal, S.; Pandey, A.; Choudhury, T. Prediction of sonic log and correlation of lithology by comparing geophysical well log data using machine learning principles. GeoJournal 2021, 88, 47–68. [Google Scholar] [CrossRef]
Al-Khudafi, A.M.; Al-Sharifi, H.A.; Hamada, G.M.; Bamaga, M.A.; Kadi, A.A.; Al-Gathe, A.A. Evaluation of different tree-based machine learning approaches for formation lithology classification. In Proceedings of the ARMA/DGS/SEG International Geomechanics Symposium, Al Khobar, Saudi Arabia, 30 October–2 November 2023; p. ARMA-IGS-2023-0026. [Google Scholar] [CrossRef]
Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for classification geology data from well logging. In Proceedings of the 14th International Conference on Electronics Computer and Computation (ICECCO), Kaskelen, Kazakhstan, 29 November–1 December 2018; pp. 206–212. [Google Scholar]
Wenhua, W.; Zhuwen, W.; Ruiyi, H.; Fanghui, X.; Xinghua, Q.; Yitong, C. Lithology classification of volcanic rocks based on conventional logging data of machine learning: A case study of the eastern depression of Liaohe oil field. Open Geosci. 2021, 13, 1245–1258. [Google Scholar] [CrossRef]
Kuchin, Y.; Mukhamediev, R.; Yunicheva, N.; Symagulov, A.; Abramov, K.; Mukhamedieva, E.; Levashenko, V. Application of machine learning methods to assess filtration properties of host rocks of uranium deposits in Kazakhstan. Appl. Sci. 2023, 13, 10958. [Google Scholar] [CrossRef]
Kuchin, Y.; Yakunin, K.; Mukhamedyeva, E.; Mukhamedyev, R. Project on creating a classifier of lithological types for uranium deposits in Kazakhstan. J. Phys. Conf. Ser. 2019, 1405, 012001. [Google Scholar] [CrossRef]
Pedregosa, F. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Scikit-Learn. Machine Learning in Python. Available online: https://scikit-learn.org/stable/ (accessed on 1 February 2024).
Ali, M. PyCaret: An Open Source, Low-Code Machine Learning Library in Python. PyCaret Version 1.0.0. 2020. Available online: https://www.pycaret.org (accessed on 1 February 2024).
Arnaut, F.; Kolarski, A.; Srećković, V.A. Machine Learning Classification Workflow and Datasets for Ionospheric VLF Data Exclusion. Data 2024, 9, 17. [Google Scholar] [CrossRef]
Raschka, S. MLxtend: Providing Machine Learning and Data Science Utilities and Extensions to Python’s Scientific Computing Stack. J. Open Source Softw. 2018, 3, 638. [Google Scholar] [CrossRef]
Raschka, S. Available online: https://rasbt.github.io/mlxtend/ (accessed on 3 May 2023).
Scikit-Optimize. Sequential Model-Based Optimization in Python. Available online: https://scikit-optimize.github.io/stable/ (accessed on 4 August 2024).
Zahedi, L.; Mohammadi, F.G.; Rezapour, S.; Ohland, M.W.; Amini, M.H. Search algorithms for automated hyper-parameter tuning. arXiv 2021, arXiv:2104.14677. [Google Scholar]

Figure 1. Machine learning models.

Figure 2. Well log data processing and classification quality assessment.

Figure 3. Floating data window with size 5.

Figure 4. Well log dataset with floating data window size up_w = 5 and dn_w = 5.

Table 1. Machine learning models used for well log data analysis.

№	Classifier	Abbreviated Name	References
1.	Light gradient boosting machine	LGBM	[23,24,25]
2.	Random forest classifier	RF	[26]
3.	Extreme gradient boosting	XGB	[27]
4.	k-nearest neighbors	kNN	[28]
5.	Decision tree	DT	[29]
6.	Artificial neural network or multilayer perceptron	MLP or ANN	[30,31]
7.	Naive Bayes classifier	NB	[32]
8.	Support vector machines with linear kernel	Linear SVM	[33]
9.	Support vector machines with rbf kernel	RBF SVM	[33]
10.	Long short-term memory	LSTM	[34]
11.	Convolution neural network	CNN	[35]

Table 2. Results of applying machine learning models to some well logging data processing tasks.

Extracted Resources	Task	Model	Results	Ref.
Oil	1, 2	Bidirectional LSTM	Acc = 92.69%	[36]
Oil, gas	1	DT, RF	f1 = 0.97 (RF), f1 = 0.94 (DT)	[37]
Oil	1, 2	UL, SL	UL = 80%, SL = 90%	[22]
Oil	1, 2	XGBoost and RF	Acc = 0.882 (XGB)	[38]
Oil, gas	1	kNN, RF, XGB, MLP	Acc = 0.79	[39]
Coal	1	XGB, RF, ANN	Acc = 0.99 (RF)	[40]
Oil	4	DFFNN, XBG, LR	R² = 0.9551 (LR)	[41]
Oil, gas	1	ANN	Acc = 0.88 (ANN)	[11]
Oil, gas	1	Hybrid model based on CNN and LSTM	Acc = 87.3% (CNN-LSTM)	[42]
Oil, gas	2	XGB, LogR	ROC AUC = 0.824	[43]
Coal	1	LR, SVM, ANN, RF, XGB	Acc > 0.9	[44]
Coal	1	SVM, MLP, DT, RF, XG	Acc = 0.8	[7]
Oil, gas	1	MLP, SVM, XGB, RF	Acc = 0.868 (XGB) and Acc = 0.884 (RF)	[45]
Geothermal wells	1	kNN, SVM, XGB	Acc = 0.9067 (XGB)	[9]
Sulfide ore	1	RF	R > 0.66 between calculated and measured Na concentration in core	[46]
Oil	1	UL	Acc = 0.5840	[47]
Oil	1	RF	F1 = 0.913	[48]
Uranium	1, 3	RF, kNN, XGB	Acc = 0.65 (1), Acc = 0.95 (3)	[49]
Oil, gas	1	SVM, RF	Acc = 0.9746	[50]
Uranium	4	ANN, XGB	R = 0.7 (XGB)	[1]
Uranium	4	XGB, LGBM, RF, DFFNN, SVM	R² = 0.710, R = 0.845 (LGBM)	[51]
Uranium	1	kNN, LogR, DT, SVM, XGB, ANN, LSTM	Acc = 0.54 (XGB)	[52]
Uranium	5	SVM, ANN, RF, XGB, LGBM	f1_weighted = 0.72 (XGB)	[19]

Table 3. Characteristics of the main lithological types of sandstone-type uranium deposits.

Code	Rock Name	AR (Ohm*m)	Filtration Coefficient (m/Day)
1	gravel, pebbles	medium	12–20
2	coarse sand	medium	8–15
3	medium-grained sand	medium	5–12
4	fine-grained sand	medium	1–7
5	sandstones	high	0–0.1
6	silt, siltstone	low	0.8–1
7	clay	low	0.1–0.8
8	gypsum, dolomite	high	0–0.1
9	carbonate rocks	high	0–0.1

Table 4. Results of preliminary computational experiments.

№	Classifier	Acc	f1_Score_Class1	f1_Score_Class2	f1_Score_Class8	f1_Score_Macro	f1_Score_Micro	Duration
1	LGBM	0.868	0.88	0.668	0.891	0.813	0.868	5.757
2	RFC	0.844	0.859	0.57	0.869	0.766	0.844	94.193
3	XGB	0.859	0.87	0.657	0.882	0.803	0.859	161.549
4	kNN	0.666	0.71	0.396	0.658	0.588	0.666	134.631
5	DT	0.721	0.762	0.419	0.753	0.645	0.721	18.003
6	MLP	0.809	0.825	0.628	0.825	0.759	0.809	54.788
7	NB	0.479	0.404	0.279	0.678	0.454	0.479	0.98
8	Linear SVM	0.750	0.775	0.461	0.757	0.664	0.750	3322.33
9	RBF SVM	0.799	0.812	0.560	0.818	0.730	0.799	1156.59

Table 5. Results of k-fold cross-validation (k = 10) for non-normalized log data in the ROZ classification problem.

Dw_N	Acc	f1_Score_Class1	f1_Score_Class2	f1_Score_Class8	f1_Score_Macro	f1_Score_Micro	Duration
0	0.79	0.828	0.505	0.685	0.672	0.79	1.651
5	0.825	0.843	0.636	0.698	0.726	0.825	6.386
25	0.82	0.839	0.656	0.69	0.729	0.82	15.56
50	0.827	0.842	0.664	0.696	0.734	0.827	26.56
100	0.836	0.847	0.67	0.723	0.747	0.836	48.94
200	0.837	0.815	0.68	0.73	0.741	0.837	73.61

Table 6. Results of k-fold cross-validation (k = 10) for normalized log data in the ROZ classification task.

Dw_N	Acc	f1_Score_Class1	f1_Score_Class2	f1_Score_Class8	f1_Score_Macro	f1_Score_Micro	Duration
0	0.779	0.803	0.498	0.647	0.649	0.779	0.902
5	0.807	0.805	0.633	0.659	0.699	0.807	2.503
25	0.808	0.805	0.661	0.673	0.713	0.808	5.833
50	0.814	0.815	0.668	0.681	0.721	0.814	10.47
100	0.818	0.815	0.668	0.697	0.727	0.818	18.99
200	0.826	0.815	0.679	0.739	0.744	0.826	31.5

Table 7. Results of k-fold cross-validation (k = 10) for normalized log data in the lithology classification task.

F1_Score
Dw_n	Acc	1	3	4	5	6	7	9	F1_Macro	F1_Micro	Duration
5	0.565	0.705	0.292	0.159	0.433	0	0.623	0	0.344	0.565	120.973
10	0.577	0.715	0.348	0.19	0.337	0.089	0.628	0	0.359	0.577	114.243
25	0.609	0.744	0.392	0.21	0.263	0.028	0.671	0	0.364	0.609	260.255
50	0.636	0.768	0.415	0.243	0.301	0.005	0.683	0	0.381	0.636	435.202
100	0.659	0.791	0.441	0.247	0.308	0	0.692	0	0.391	0.659	732.618
200	0.694	0.819	0.492	0.295	0.28	0.008	0.705	0	0.41	0.694	1319.534

Note. The best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mukhamediev, R.; Kuchin, Y.; Yunicheva, N.; Kalpeyeva, Z.; Muhamedijeva, E.; Gopejenko, V.; Rystygulov, P. Classification of Logging Data Using Machine Learning Algorithms. Appl. Sci. 2024, 14, 7779. https://doi.org/10.3390/app14177779

AMA Style

Mukhamediev R, Kuchin Y, Yunicheva N, Kalpeyeva Z, Muhamedijeva E, Gopejenko V, Rystygulov P. Classification of Logging Data Using Machine Learning Algorithms. Applied Sciences. 2024; 14(17):7779. https://doi.org/10.3390/app14177779

Chicago/Turabian Style

Mukhamediev, Ravil, Yan Kuchin, Nadiya Yunicheva, Zhuldyz Kalpeyeva, Elena Muhamedijeva, Viktors Gopejenko, and Panabek Rystygulov. 2024. "Classification of Logging Data Using Machine Learning Algorithms" Applied Sciences 14, no. 17: 7779. https://doi.org/10.3390/app14177779

APA Style

Mukhamediev, R., Kuchin, Y., Yunicheva, N., Kalpeyeva, Z., Muhamedijeva, E., Gopejenko, V., & Rystygulov, P. (2024). Classification of Logging Data Using Machine Learning Algorithms. Applied Sciences, 14(17), 7779. https://doi.org/10.3390/app14177779

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Classification of Logging Data Using Machine Learning Algorithms

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Data Preprocessing

3.2. Training and Evaluating Machine Learning Models

4. Results

4.1. ROZ Identification

4.2. Lithological Classification

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI