Post-Fracture Production Prediction with Production Segmentation and Well Logging: Harnessing Pipelines and Hyperparameter Tuning with GridSearchCV

Sun, Yongtao; Wang, Jinwei; Wang, Tao; Li, Jingsong; Wei, Zhipeng; Fan, Aibin; Liu, Huisheng; Chen, Shoucun; Zhang, Zhuo; Chen, Yuanyuan; Huang, Lei

doi:10.3390/app14103954

Open AccessArticle

Post-Fracture Production Prediction with Production Segmentation and Well Logging: Harnessing Pipelines and Hyperparameter Tuning with GridSearchCV

by

Yongtao Sun

^1,2,

Jinwei Wang

¹,

Tao Wang

^1,2,

Jingsong Li

^1,2,

Zhipeng Wei

^1,2,

Aibin Fan

¹,

Huisheng Liu

¹,

Shoucun Chen

¹,

Zhuo Zhang

^3,4,

Yuanyuan Chen

^3,4 and

Lei Huang

^3,4,*

¹

China Oilfield Services Limited, Tianjin 300459, China

²

National Key Laboratory of Offshore Oil and Gas Exploitation, Beijing 102209, China

³

State Key Laboratory of Marine Geology, Tongji University, Shanghai 200092, China

⁴

School of Ocean and Earth Science, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(10), 3954; https://doi.org/10.3390/app14103954

Submission received: 19 October 2023 / Revised: 1 February 2024 / Accepted: 22 February 2024 / Published: 7 May 2024

(This article belongs to the Special Issue Advances in Geo-Energy Development and Enhanced Oil/Gas Recovery)

Download

Browse Figures

Versions Notes

Abstract

:

As the petroleum industry increasingly exploits unconventional reservoirs with low permeability and porosity, accurate predictions of post-fracture production are becoming critical for investment decisions, energy policy development, and environmental impact assessments. However, despite extensive research, accurately forecasting post-fracture production using well-log data continues to be a complex challenge. This study introduces a new method of data volume expansion, which is to subdivide the gas production of each well on the first day according to the depth of logging data, and to rely on the correlation model between petrophysical parameters and gas production to accurately combine the gas production data while matching the accuracy of the well-log data. Twelve pipelines were constructed utilizing a range of techniques to fit the regression relationship between logging parameters and post-fracture gas production These included data preprocessing methods (StandardScaler and RobustScaler), feature extraction approaches (PCA and PolynomialFeatures), and advanced machine learning models (XGBoost, Random Forest, and neural networks). Hyperparameter optimization was executed via GridSearchCV. To assess the efficacy of diverse models, metrics including the coefficient of determination (R2), standard deviation (SD), Pearson correlation coefficient (PCC), mean absolute error (MAE), mean squared error (MSE), and root-mean-square error (RMSE) were invoked. Among the several pipelines explored, the PFS-NN exhibited excellent predictive capability in specific reservoir contexts. In essence, integrating machine learning with logging parameters can be used to effectively assess reservoir productivity at multi-meter formation scales. This strategy not only mitigates uncertainties endemic to reservoir exploration but also equips petroleum engineers with the ability to monitor reservoir dynamics, thereby facilitating reservoir development. Additionally, this approach provides reservoir engineers with an efficient means of reservoir performance oversight.

Keywords:

post-fracture production prediction; production segmentation; well logging; pipelines; GridSearchCV; machine learning

1. Introduction

Oil production plays a pivotal role in national economic progression, energy assurance, and global geopolitical equilibrium [1]. As the energy demand continues to rise, the industry has turned to unconventional reservoirs, particularly those with low-permeability and low-porosity formations. Recent advancements in hydraulic fracturing have introduced innovative methodologies to optimize the exploration and production of these challenging reservoirs. An integrated geological engineering design, such as volume fracturing with a fan-shaped well pattern, has been proposed to enhance hydrocarbon recovery by optimizing well placement [2]. This approach, backed by in-depth studies into the dynamics of hydraulic fractures in multilayered formations, offers insights into fracture propagation and its subsequent impact on reservoir performance [3].

However, with the introduction of hydraulic fracturing and other advanced extraction techniques, predicting post-fracture oil production capacity becomes an intricate endeavor fraught with multifaceted challenges [4]. Accurate post-fracture production prediction is paramount in enhancing well operations, mitigating operational hiatuses, and ensuring a higher return on petroleum investments. The integration of machine learning with advanced numerical techniques, like the 3D lattice method, has emerged as a promising solution for understanding fracture dynamics in various shale fabric facies, as evidenced by a case study from the Bohai Bay Basin, China [5]. Moreover, the advent of physics-informed methods for evaluating feasibility emphasizes the need to comprehend the intrinsic properties of shale reservoirs for efficient production [6]. When combined with sustainable extraction techniques, accurate production prediction can lead to more effective oil development, minimizing environmental degradation [7]. This approach provides a pathway toward a more sustainable and environmentally friendly energy economy. Nonetheless, predicting oil production capacity is an intricate endeavor with multifaceted challenges [8]. There are numerous methods for production prediction on the oil field site. One prevalent strategy involves forecasting oil production metrics for a specific well by extrapolating the anticipated oil yield. This extrapolation leans heavily on historical production metrics and pertinent reservoir data [8,9,10]. Another strategy is based on the production potentialities of adjacent wells [11,12], representing the extant geological and geophysical data on the target reservoir. This approach evaluates the economic viability of implementing new drillings and finding the sweet spots within the reservoir’s geological and engineering landscape [13,14].

Machine learning has recently gained popularity in production prediction [11,12]. The estimated ultimate recovery (EUR) prediction model for shale gas wells is based on the multiple linear regression method. The key factors controlling productivity are analyzed using the Pearson correlation coefficient and maximum mutual information coefficient analysis method [15]. In contrast, the EUR prediction model for fractured horizontal wells in tight oil reservoirs employs a deep neural network (DNN) and demonstrates significantly higher prediction accuracy than the traditional multiple linear regression model [16]. Other studies have also shown the effectiveness of supervised learning models such as Random Forest, Gradient-Boosting Machine, and Support-Vector Machine in predicting shale gas reservoir productivity, with notable improvements when integrating the variable importance method [17]. Additionally, tree-boosting methods and time-segmented feature extraction have been effective for predicting oil production decline rates, demonstrating ML’s ability to manage complex datasets [18]. The use of long short-term memory networks has also outperformed traditional methods in water saturation prediction, enhancing reservoir characterization accuracy [19]. Moreover, deep neural networks have been applied for predicting cumulative gas production, indicating a shift towards data-driven methodologies [20]. Furthermore, deep learning models have been utilized for oil and gas production forecasting, significantly reducing time and resource expenditure while maintaining high accuracy [21]. Lastly, a machine learning-based approach has been developed for reservoir identification and production prediction, underscoring the efficiency of ML in classifying reservoir types and forecasting production rates, especially with ensemble methods [22].

Reservoir simulation is a widely used technique that employs physics-based models to predict oil production under different scenarios. However, the simulation requires extensive geological and engineering data on the reservoir, which may not always be available. A deep understanding of the geology and engineering aspects of the complex reservoir is also challenging for people without professional knowledge [23]. Moreover, the assumptions made in the simulation model can impact the predictive accuracy. Reservoirs with low permeability and porosity are distinct due to their constrained pore spaces and fluid flow capabilities [24,25,26,27]. At present, scholars have used advanced CT imaging technology and machine learning algorithms to study fluid flow in complex, non-uniform, porous, low-permeability media [28,29,30]. This nature inherently complicates numerical computations for production rates [31]. The reduced permeability demands a more detailed grid division, amplifying computational intricacy and extending calculation durations [32]. Such environments can also generate numerical dispersion, causing potential mismatches between simulated outcomes and actual field data [33]. Additionally, fluid flow in these settings might not always adhere to Darcy’s law, requiring more nuanced modeling approaches [32]. The existing mainstream hydraulic fracturing reservoir simulation heavily relies on two-dimensional or pseudo-three-dimensional models [34,35,36]. These models frequently neglect the reservoir’s vertical heterogeneity and omit numerous details during the simplification of physical formulae, negatively affecting the accuracy of capacity forecasting. In contrast, this study’s innovation lies in fully leveraging the geometric characteristics of vertical wells and the longitudinal collection attributes of well-logging data. This approach enables us to obtain detailed petrophysical information at each reservoir layer. With this information, we can more precisely forecast gas production from reservoirs at various depths.

In this study, we introduce a novel approach using well-logging parameters for predicting gas production at various well depths. Our focus is to establish the relationship between rock physics, geomechanics, and production performance, particularly in wells with post-fracture enhanced hydrocarbon migration. Addressing the challenge of collecting productivity data across multiple wells, we propose a segmentation method based on the formation coefficient—derived from permeability, gas saturation, and porosity. Utilizing scikit-learn’s pipelines coupled with GridSearchCV, our method consolidates data preprocessing, feature selection, and algorithmic modeling into a streamlined workflow, effectively preventing data leakage and ensuring modeling integrity [37,38,39]. This integration also facilitates hyperparameter optimization, enhancing the workflow efficiency [40,41,42,43]. We explore correlations between logging parameters and productivity, concluding in Section 5 with insights and recommendations aimed at improving forecasting for sustainable capacity development.

2. Materials and Methods

Our research aims to forecast post-fracture gas production, and the comprehensive technical approach is segmented into five primary components. The initial stage is “data collection and processing”, which constitutes the fundamental step in guaranteeing data quality and accessibility. Subsequently, we have “pipelines setup” to establish the data processing procedure, ensuring the efficiency and accuracy of data analysis. The third segment involves “optimal model selection”, where we identify the most suitable prediction model through comparisons. Following this, in the “generalization ability verification” phase, we assess the model’s ability to perform effectively across various datasets. Lastly, there is the “result demonstration” stage, in which we present the model’s prediction results and visually showcase the research outcomes. The complete technology roadmap is illustrated in detail in the Appendix A; refer to Figure A1 for specifics.

2.1. Well Logging and Production Data Preprocessing

2.1.1. Data Collection and Processing

Two data types need to be collected: the well-logging data based on the depth range of perforated wells that have undergone hydraulic fracturing, and the gas production during well testing on the first day. Since the sample size provided by the perforated segment is too small to carry out reliable and stable production prediction, the gas production during well testing is divided into “production segmentation” for each logging sample. When analyzing LAS files containing well-log data, the initial step involves removing all −999 marked as invalid values. Additionally, essential adjustments are made to the porosity, permeability, and water saturation data obtained in the field. Specifically, negative porosity and permeability values are adjusted to 0, and water saturation values exceeding 100% are corrected to 100%. Similarly, all records with gas production less than 0 are processed as 0.

2.1.2. Sample Preparation

The data collected from the fractured wells are organized samples for machine learning-based production prediction. The features of the samples include nine logging parameters (depth, p-wave velocity, density, natural gamma, neutron porosity, resistivity, porosity, permeability, and water saturation).

2.1.3. Data Splitting

In Figure 1, to ensure that the machine learning model is able to generalize the unseen data, the collected data are divided into a training set and a testing set. In our study, we utilized scikit-learn’s train_test_split function for data partitioning, which performs a random splitting approach. train_test_split randomly divides the dataset into training and testing subsets, ensuring a heterogeneous mix of data points in both sets. This is in contrast to sequential splitting, where the data are divided based on a sequence, such as time. The test_size = 0.2 parameter ensures that 20% of the data are reserved for testing, while the random_state = 42 ensures the reproducibility of the results by using the same seed for random number generation.

2.2. Pipelines Setup

To enhance the generalization capabilities of our model and ensure its adaptability to a diverse range of data, we have developed twelve distinct pipelines. Each pipeline represents a unique combination of a data preprocessing method and a machine learning algorithm. Our systematic exploration is geared towards identifying the most effective pairings to achieve optimal predictive performance. This structured approach to pipeline implementation streamlines the process, enabling the sequential application of data transformations followed by the deployment of a machine learning model. In the following sections of our research, our focus will shift towards refining data-scaling methods, advancing feature extraction techniques, and optimizing machine learning algorithms. A comprehensive flowchart detailing the configuration of these pipelines is presented in Figure 2, while Table 1 lists the abbreviated names of all of the pipelines.

2.2.1. Data Scaling and Feature Extraction

Effective data preprocessing involves handling variations in data scales and units. This is usually performed through data standardization and normalization. Our study incorporates two scaling techniques: StandardScaler, which normalizes data to have a mean of zero and a standard deviation of one, and RobustScaler, which uses the median and interquartile range, making it less prone to outliers.

For the StandardScaler, it follows the following function [44]:

z = (x - μ) / σ

(1)

where x is the original dataset, μ represents the mean of the dataset, σ denotes the standard deviation of the dataset, and z is the processed dataset.

For the RobustScaler, it follows the following function [45]:

v ’ = (v - m e d i a n) / I Q R

(2)

where v′ is the processed value, v is a value from the original dataset, median represents the median of the dataset, and IQR denotes the interquartile range of the dataset.

We employed principal component analysis (PCA) for feature extraction to reduce the data’s dimensionality while preserving critical information. PolynomialFeatures was also used to generate complex polynomial combinations of features. We comparatively analyzed four preprocessing combinations (StandardScaler with PCA, RobustScaler with PCA, StandardScaler with PolynomialFeatures, and RobustScaler with PolynomialFeatures), aiming to discern the optimal preprocessing strategy for our dataset.

2.2.2. Machine Learning Algorithm

We employed a combined approach of three robust models: XGBoost, Random Forest, and neural networks. XGBoost is a sequential model that iteratively corrects errors from previous predictions, thus enhancing the predictive accuracy. Random Forest, containing multiple decision trees, excels in handling intricate datasets, offers resistance to overfitting, and highlights essential features. Neural networks are adept at deciphering complex patterns using interconnected layers of neurons, making them suitable for high-dimensional datasets. Our ensemble approach taps into the distinct strengths of each model, ensuring a comprehensive and efficient prediction mechanism.

XGBoost

XGBoost is an ensemble gradient-boosting algorithm abbreviated from eXtreme Gradient Boosting. This method has proven effective in handling geophysical data, especially with small sample sizes. We use XGBoost to build a regression model for the production prediction in this paper. The algorithm begins by training a weak learner, typically a decision tree. This tree starts with a root node encompassing all training data, and then it optimally splits the dataset based on features. The dataset partitioning continues until the data are correctly classified. Successive learners iteratively refine prediction errors from their predecessors using gradient descent on the loss function’s second-order Taylor expansion. The final model aggregates these weak learners’ results. For in-depth mathematical insights, we have provided a brief introduction based on the work of Chen et al. [46].

Given a dataset

D = {(x_{i}, y_{i})}_{i = 1}^{n}

, the tree ensemble model in XGBoost is represented by a set of

K

functions

{\{f_{k} (x)\}}_{k = 1}^{K}

from an input space to the output space. Each

f_{k}

corresponds to a tree structure

q

that maps an instance to the corresponding leaf index, yielding a predictive value from the set of leaf weights

w

. Specifically, an instance

x_{i}

is mapped as follows:

{\hat{y}}_{i}^{(t)} = \sum_{k = 1}^{K} f_{k} (x_{i}) = w_{q (x_{i})}

(3)

The objective function to be minimized during the training of XGBoost is given by the sum of a differentiable convex loss function

l

and a regularization term

Ω

. Formally,

Obj (Θ) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t)}) + \sum_{k = 1}^{K} Ω (f_{k})

(4)

For the loss function, a second-order Taylor expansion provides an approximation:

l (y_{i}, {\hat{y}}_{i}^{(t)}) \approx l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} (w_{q (x_{i})} - {\hat{y}}_{i}^{(t - 1)}) + \frac{1}{2} h_{i} (w_{q (x_{i})} - {{\hat{y}}_{i}^{(t - 1)})}^{2}

(5)

where

g_{i}

and

h_{i}

represent the first- and second-order gradients, respectively.

The regularization term, which penalizes the complexity of the tree ensemble, is defined as follows:

Ω (f) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

(6)

where

T

denotes the number of leaves in the tree and

λ

is the L2 regularization term on the leaf weights.

Random Forest

The Random Forest algorithm, introduced by Leo Breiman in 2001, is a robust machine learning technique that leverages the power of multiple decision trees for making predictions [47]. This ensemble method operates by constructing a multitude of decision trees at the training time and outputs the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests can correct the habit of overfitting to the training set caused by decision trees. The algorithm’s name comes from the fact that it introduces randomness into the tree building to ensure that each tree is different. This randomness originates from two main aspects: Firstly, whenever a tree split is considered, a random sample of “m” predictors is selected from the full set of “p” predictors. Secondly, each tree is constructed using a bootstrap sample instead of the entire original dataset.

The application of Random Forest in petroleum exploration has been gaining traction in recent years. For instance, Roubícková et al. [48] utilized Random Forest to reduce ensembles of geological models for oil and gas exploration. They found that the Random Forest algorithm effectively identifies the critical grouping of models based on the most essential feature. This work highlights the potential of Random Forest in handling large datasets and complex geological models.

Neural Network Models (MLP Regressor)

The multilayer perceptron (MLP), a foundational element in neural network architectures, has witnessed substantial advancements across various applications due to rigorous research efforts. Featuring an input layer, multiple hidden layers, and an output layer, this configuration equips the MLP with robust data processing and feature extraction capabilities [49,50,51]. In the domain of oil and gas exploration, MLP is applied to tasks such as seismic data interpretation, reservoir property assessment, and optimization of development plans [52,53]. As computing technology progresses and data volumes expand, MLP holds promising prospects for widespread applications within the oil and gas industry, particularly in the domain of complex data analysis and prediction. This signifies its potential to emerge as a pivotal technology in data-driven decision-making processes.

2.3. The Optimal Model Selection

To optimize our model configuration, we applied GridSearchCV to twelve pipelines, calculating the mean R² scores for various hyperparameter combinations using fivefold cross-validation. The best model was selected based on the highest average R² score, as detailed in Table 1 and Figure 3, where each pipeline is sequentially numbered and its hyperparameters are listed. This systematic approach, integrating both evaluation and pipeline optimization, led to the identification of the most effective model.

2.3.1. GridSearchCV

GridSearchCV, representing “Grid Search and Cross Validation”, embodies a meticulous approach for scanning an extensive array of parameter configurations while concurrently conducting cross-validation to identify the most efficacious model settings [54]. This comprehensive methodology is instrumental in harnessing the aggregate predictive capability of the ensemble model, thereby attaining pinnacle levels of accuracy and reliability. Concurrently, a fivefold cross-validation was executed for each permutation of hyperparameters within every pipeline. Table 2 enumerates the hyperparameters necessitated for consideration in each pipeline. Following the delineated procedure, a total of 33,210 models were trained in the scope of this study. The substantial volume of model outcomes generated is pivotal in facilitating the identification of the superlative model configuration.

2.3.2. Evaluation Criteria

In each pipeline, GridSearchCV was performed for each set of hyperparameter combinations, and the widely recognized R₂ score was used as the evaluation criterion for fivefold cross-validation. The average of these five R₂ scores was used as the basis for evaluating the model performance of each group of hyperparameter combinations, and the hyperparameter combination with the highest average R₂ score in each pipeline was selected and determined as the optimal model for the pipeline.

2.3.3. Optimal Pipeline

After assigning each scikit-learn pipeline the hyperparameter combination with the highest R₂ score, this study systematically compared the model performance of these 12 pipelines to identify the relative strengths and limitations of each pipeline. Through in-depth analysis and empirical comparisons, we successfully identified the most efficient pipeline configurations. The results of this study not only provide valuable insights into how to select the optimal Scikit-learn pipeline configuration for post-hydraulic fracturing productivity prediction, but are also important for a broader understanding of the performance of scikit-learn pipelines in oil and gas exploration and development applications.

2.4. Generalization Ability Verification

After completing the preliminary analysis of the dataset, this study then used the remaining 20% of the dataset for the implementation of production forecasting. In order to comprehensively evaluate the performance of the model, appropriate evaluation indicators were selected for detailed analysis. Especially when dealing with regression tasks such as production forecasting, this study uses several common evaluation criteria, including mean absolute error (MAE), mean square error (MSE), and decision score coefficient (R²).

M A E = \frac{1}{n} \sum_{k = 1}^{n} |y_{k} - {\hat{y}}_{k}|

(7)

M S E = \frac{1}{n} \sum_{k = 1}^{n} {(y_{k} - {\hat{y}}_{k})}^{2}

(8)

R^{2} = 1 - \frac{\sum_{k = 1}^{n} {(y_{k} - {\hat{y}}_{k})}^{2}}{\sum_{k = 1}^{n} {(y_{k} - \bar{y})}^{2}}

(9)

where

y_{k}

is the kth real production (m³/d),

{\hat{y}}_{k}

is the kth predicted production (m³/d),

\bar{y}

is the mean real production of all wells (m³/d),

\bar{\hat{y}}

is the mean predicted production of all wells (m³/d), and n is the total count of specimens.

Ehsan Heidaryan introduced the percentage of accuracy–precision (PAP) metric as a novel approach for model selection in energy science and engineering. This approach emphasizes the PAP’s utility in objectively comparing experimental and modeling data, and as a function for optimization, advocating its effectiveness in identifying models that best balance accuracy and precision [55]. Meanwhile, for oil and gas field production predictions, Taylor diagrams offer a valuable graphical method for comparing predicted and observed datasets [56], such as predicted production, pressure drops, and fluid flow patterns. In this study, Taylor diagrams are utilized to examine the accuracy of the model in relation to alternative machine learning models. These diagrams can incorporate various measures of the predicted and actual values. Specifically, they can display the standard deviation (SD), Pearson correlation coefficient (PCC), and root-mean-square error (RMSE) between the predicted and observed data, providing a comprehensive overview of the model’s performance.

σ_{l} = \sqrt{\frac{1}{N} \sum_{k = 1}^{N} {(y_{k} - \bar{y})}^{2}}

(10)

σ_{p} = \sqrt{\frac{1}{N} \sum_{k = 1}^{N} {({\hat{y}}_{k} - \bar{\hat{y}})}^{2}}

(11)

R M S E = \sqrt{\frac{1}{n} \sum_{k = 1}^{n} {(y_{k} - {\hat{y}}_{k})}^{2}}

(12)

r = \frac{\sum_{k = 1}^{n} (y_{k} - \bar{y}) ({\hat{y}}_{k} - \bar{\hat{y}})}{\sqrt{\sum_{k = 1}^{n} {(y_{k} - \bar{y})}^{2} \sum_{k = 1}^{n} {({\hat{y}}_{k} - \bar{\hat{y}})}^{2}}}

(13)

where

σ_{l}

is the standard deviation of real production (m³/d),

σ_{p}

is the standard deviation of the pipelines’ predicted production (m³/d), and r is the Pearson correlation coefficient.

2.5. Result Demonstration

Finally, the findings were elucidated through visual representations. The comparison of the predicted outcomes against the real results was undertaken for 33 wells, enabling a clear delineation of model accuracy. At the same time, the performance metrics were chosen to provide a granular understanding of the model’s predictive capacity. The comprehensive analysis of the training results facilitates comprehension of our predictive framework’s inherent attributes and highlights potential ways of refinement. This study can provide a complete understanding of the model’s utility, illuminating its merits and areas demanding further exploration.

3. Data Preparation and Description

3.1. Production Segmentation

Due to the limited samples provided (only 33 perforated sections) in the study areas, it was difficult to find the relationship between logging data and production and obtain a reliable production prediction model. Therefore, the production segmentation was adopted to enlarge the quantity of samples. The production was divided based on the formation coefficient (the product of permeability, gas saturation, and porosity). The model assumes that the effect of perforation friction at different depths is not considered. The production division is based on the following criteria:

Q_{j} = Q \cdot \frac{K_{j} S_{j} \emptyset_{j}}{\sum_{i = 1}^{n} K_{i} S_{i} \emptyset_{i}}

(14)

where Q_j represents the gas production of a sampling point after partitioning (m³/d), Q represents the total gas production of the perforation segment (m³/d) K is the permeability (md), S is the oil saturation (%), and

\emptyset

represents the porosity (%).

The well-logging data and productivity data that this study relies on were derived from a designated gas well in the Linxing Gas Field located in Luliang City, Shanxi Province, China. The gas field has a significant geographical location, located on the eastern edge of the Ordos Basin. Its geological structure and resource reserves provide rich and representative data resources for this study. As shown in Figure 4, the total production of the perforation section from 1485.40 to 1491.40 m is 2800 m³/d. After subdivision, we determined the production of each sampling point from 1485.40 to 1491.40 m, with an interval of 0.1 m.

After dividing the production according to the above formula, a total of 2663 samples were obtained from 33 perforation sections, which were used to train the production prediction model.

3.2. Data Description

As shown in Figure 5, when visualizing all data in the form of histograms and violin plots, it is evident that the hydraulic fracturing segments of the reservoir are predominantly concentrated from 1600 to 1800 m and from 2000 to 2300 m. Acoustic slowness (DT) predominantly falls within the range of 65 to 70 ft/μs. The density distribution (ZDEN) is centered around 2.5 g/cm³. The GR values are primarily below 100, and the CNCF distribution is mainly between 5 and 15%, with a substantial proportion of the data at 0. For M2RX, the values mostly lie below 100 Ω/m, with the vast majority ranging from 0 to 200 Ω/m. Therefore, the petrophysical properties of this block strongly indicate that it is a hydrocarbon-bearing sandstone reservoir. The porosity generally follows a normal distribution centered around 8%. Notably, about 350 data points display zero porosity, indicating that certain reservoir fracturing segments are suboptimal. These “suboptimal” data points were incorporated into our test set, since our goal is to predict gas and oil production capabilities in other regions. To make the database as comprehensive as possible, it is essential to include these points. The water saturation (SW) was predominantly around 50%, though nearly 350 formations registered as water-bearing layers with a saturation of 100%. After production segmentation (PROD), the overall changes in output lie between 0 and 1000 m³/d. We utilized two methods for displaying data, namely, histograms and violin plots, in order to assess the characteristics of data distribution and facilitate the selection of standardization methods. The histogram enables us to examine the distribution shape of each variable and determine whether it follows a normal distribution; for data conforming to normal distribution, StandardScaler becomes an optimal standardization tool, derived from the standard normal distribution of the data. Violin plots offer a visual representation of data distribution and outliers. In cases where variables contain outliers, RobustScaler is particularly suitable, as it can effectively mitigate the impact of outliers.

4. Results

4.1. Model Performance Analysis

Optimal model parameters were determined utilizing GridSearchCV. Initially, 80% of the dataset was randomly allocated for training. Within this subset, a comprehensive set of 5130 sensitivity parameters was established across the nine pipelines. GridSearchCV implicates a fivefold cross-validation strategy. This method entails training the model on 80% of the subset during each iteration (four out of five times) and validating the residual 20% (once). This procedure is iterated five times to include all samples, with the averaged outcome of validations serving as the performance metric. In total, 25,650 models underwent training to ascertain the optimal configuration. Table 3 presents the results of post-parameter optimization corresponding to each pipeline.

Figure 6 shows the prediction results of validation. The horizontal axis represents the real production calculated by Equation (14), and the vertical axis represents the predicted production. If the predicted production of a sample is equal to the real production, the blue dots will fall on the black diagonal line. The overall prediction results of high-productivity data samples are smaller than the real productivity values, but most of the samples are close to the diagonal. This indicates a good correspondence between predicted production and real production. Therefore, the adopted model can be considered to have strong predictive ability.

The data normalization was executed within the processing pipeline, while it did not perturb the intrinsic range of our predictive outcomes. Specifically, the production capability in the designated operational zone is expected to span between 0 to 2000 m³/d. The mathematical formulae represented in the Taylor diagram delineated in Section 2.3.2 were employed to compare the real production capacity against the forecasted values, and the outcomes are illustrated in Figure 7. The analysis reveals that the standard deviation of the baseline data stands at 192.2. Notably, every pipeline, post-training, manifests a standard deviation smaller than that of the original dataset. This suggests that the variance within the trained data across all pipelines is less pronounced than that observed in the raw data. Given the significant deviation inherent in the dataset, the RMSE values for all pipelines exceed those from previously reported data; however, they remain substantially below the observed standard deviation. This context suggests that the predictions of the model operate within an acceptable margin. In subsequent sections, the validation against the remaining 20% of the data across 33 wells will be elucidated. Table 4 presents the array of evaluation metrics as depicted in the Taylor diagram for all pipelines, offering a comprehensive view of their performance. Meanwhile, Table 5 details a standardized list of evaluation indices, serving as a benchmark for assessing all pipelines. Furthermore, in Table 4 and Table 5, the ★ symbols located in the leftmost column are indicative of the best-performing models, highlighting the results achieved by the most effectively trained models in our analysis. In Taylor plot analysis, the discrepancy between the standard deviation of predicted production data and the standard deviation of real production data indicates the model’s capacity to encompass the data’s variability. Ideally, the standard deviation of the predicted production data should closely align with that of the real production data, visually positioning the model point (

σ_{p}

) near the radius of the reference points (

σ_{l}

) on the Taylor plot. Additionally, the root-mean-square error (RMSE) is depicted as the straight-line distance from the model point to the reference point on the Taylor diagram. A shorter distance corresponds to a smaller RMSE, reflecting reduced prediction error. As a result, an exceptional model should closely approach the reference point (

σ_{l}

) on the Taylor plot, exhibit a higher Pearson correlation coefficient, and have a lower RMSE, signifying its precision in predicting the data’s variability and enhancing prediction accuracy.

4.2. Comparative Analysis of PS-XGB vs. PS-RF vs. PS-NN

The remaining 20% of the dataset was used for forecasting, with real and projected productivity values for each well illustrated using well-log mapping. The black curve and blue dots represent the segmented production of the perforation interval, and the red curve and green squares represent the predicted production. By fitting the two curves, it can be seen that the model can make a reliable production prediction on many perforation intervals (Wells 2, 5, 14, 17, and 29). These results indicate that the model has good predictive ability.

Notably, significant discrepancies are observed in Wells 3, 6, and 25. Such deviations might be attributed to the insufficient training data for these wells, suggesting the need for refined parameters and enhanced machine learning models. Assuming that the range of parameter optimization is enlarged, the model could move into the territory of overfitting, which is detrimental to reservoir productivity forecasting.

Among the three pipelines, PS-XGB exhibits the most severe overfitting, whereas PS-NN demonstrates the least overfitting. This is particularly evident in Well 3, where the predicted curve of PS-XGB displays considerable variability in contrast to the observed smoother curve of the actual data. Although PS-RF exhibits fewer fluctuations compared to PS-XGB, overfitting is still not avoided. When compared with PS-NN, PS-XGB shows superior results.

4.3. Comparative Analysis of PFS-XGB vs. PFS-RF vs. PFS-NN

In general, the fluctuation of the predicted production capacity data of these three pipelines is smaller than that of the three pipelines PS-XGB, PS-RF, and PS-NN. Simultaneously, the differences in production prediction trends among the three pipelines are minimal, but there are significant differences in the production prediction of some wells in the pipeline. For example, between 1717 and 1720 m in Well 3, the PFS-XGB and PFS-RF forecasts are good. Still, in PFS-NN, the productivity forecast surges to 2000 m³/d, which may be related to the strategic optimization of the neural network parameters. In Well 25, an elevated predicted productivity value emerges at shallower depths, yet the predictions from all three pipelines align closely at greater depths. The shallow depth predictions for Well 30 significantly deviate from the real values. An observable production fluctuation between depths of 1690 and 1692 m in Well 4 eludes the predictive capabilities of all three pipelines. In Well 21, the predicted values substantially exceed the real production. Conversely, the trio of pipelines demonstrates commendable accuracy in forecasting the productivity for Wells 20 and 27. This could be attributed to a more straightforward relationship between logging parameters and productivity for this particular well.

4.4. Comparative Analysis of PR-XGB vs. PR-RF vs. PR-NN

The predictions from these three pipelines show notable discrepancies from the actual results, particularly in Wells 1 to 11. This deviation might stem from the use of the RobustScaler in data preprocessing. While it adeptly handles outliers through the median and interquartile range, it may not suit near-normal datasets and can miss crucial outlier information due to its non-zero mean. Among the three pipeline configurations, PR-XGB provides a more volatile representation of production. For Wells 6 and 7, the predictions from all three pipelines were off-mark, whereas PR-NN demonstrated notable accuracy. In particular, the production forecast for Well 21 between 2150.5 and 2154 m closely matched the actual values, contrasting with the significant deviations seen in the PR-XGB and PR-RF predictions. While PR-RF generally provided the most accurate forecasts, it faltered in certain instances, notably with Well 31.

4.5. Comparative Analysis of PFR-XGB vs. PFR-RF vs. PFR-NN

In this study, RobustScaler and polynomial feature extraction methods were adopted in all three data processing pipelines. After comparative analysis, it can be seen that the predictive regression framework (PFR-NN) based on a neural network (NN) shows the best effect in terms of oil and gas well productivity fitting. Especially in the productivity fitting of the 4th and 11th wells, PFR-NN performed outstandingly, while methods based on XGBoost (PFR-XGB) and Random Forest (PFR-RF) did not perform well. It is particularly worth noting that only PFR-NN performed more accurately when fitting actual gas production. However, all methods failed to achieve the expected results in fitting the low gas production characteristics in the depth interval of 2025 m to 2030 m in the 26th well. Generally speaking, the neural network is better than the integrated machine learning algorithm in terms of training effect, and it requires fewer parameters, which makes it very suitable for oil and gas reservoir engineering applications and academic research fields.

4.6. Correlation Coefficient Comparison

In this research, we conducted an analysis of the Pearson correlation coefficient for each well across all pipelines and represented the data using a heatmap. The findings primarily revolve around the linear correlation between the real and predicted production values as an indicator of prediction accuracy. The heatmap revealed that the level of the Pearson correlation coefficient aligns with the subsequent distribution results for production capacity forecasts (as depicted in Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19). Specifically, a higher correlation coefficient indicates better consistency between the predicted and real predictions. This study identified Well 6 as having the lowest prediction accuracy among all pipelines, while Well 33 demonstrated the highest performance, with a correlation coefficient exceeding 0.98. The production capacity prediction distribution chart further illustrates the consistency of the forecasted production of Well 33 with the real production. Evaluation of different pipelines’ prediction results indicates that PR-XGB has the poorest performance, evidenced by a higher RMSE in the Taylor plot (Figure 7). Conversely, PFS-NN exhibits superior prediction effectiveness, with the lowest RMSE during training and a standard deviation closer to the reference point (

σ_{l}

) in the Taylor plot. Overall, the production forecasts for all pipelines exhibit strong correlation, underscoring the model’s reliable performance in production forecasting.

5. Discussion

According to the Pearson correlation coefficient analysis of different models on different wells in Figure 20, it can be seen that among all pipeline predictions, the prediction correlations of Wells 6 and 7 are the lowest. Only the PFS-NN prediction pipeline showed relatively high prediction correlation in these two wells. In order to explore this phenomenon in depth, we reorganized and compared the logging data and productivity data of the 6th and 7th wells, and we used the data of the 33rd well with the best prediction effect as a reference. We conducted a quantitative analysis of the Pearson correlation coefficients on the above data. The specific results are shown in Figure 21 below.

From the perspective of reservoir physics, an increase in water saturation typically results in a decrease in oil and gas production. This occurrence is attributed to the limited solubility of oil and gas in water. As liquid production rises, the contents of oil and gas relatively decrease. The robust negative correlation evident in the data analysis of Well 33 serves to fully validate this theory. However, in the analysis of Well 6, this negative correlation is not statistically significant. Particularly in the case of Well 7, there is no apparent correlation between water saturation and productivity. Through comparative analysis of production reports and well-logging curves, we found that even some formations with 100% water content have high oil and gas production. This may be due to inaccuracies in well site data collection and processing. Although outliers in the data may stem from measurement errors or field data processing, this method remains viable in the absence of superior processing techniques. Regarding water saturation data, future efforts may involve data normalization to ensure that all values fall within the range of 0 to 100, thereby enhancing the consistency and accuracy of the data.

Upon further examination of the gas production data from Wells 6 and 7, it was observed that the data exhibited wide dispersion and sparsity, whereas the gas production data of Well 33 were relatively concentrated. This phenomenon may be linked to operational activities within the oil and gas extraction process, such as wellbore soaking operations after well opening. The sparse and widely dispersed production data can lead to significant errors in model training. For instance, while well-log data such as sonic transit time are generally distributed in the range of 65 to 60 ft/μs, gas production data vary from 0 to 1000 m³/d (represented in Figure 6), and the distribution varies notably between different wells. Using Wells 6 and 7 as an example, their gas production data range from 0 to 800 m³/d, while the data from Well 33 are concentrated between 120 and 550 m³/d. This disparity in data might be one of the main factors impacting the accuracy of the model predictions. Additionally, there are a substantial number of data points indicating zero gas production in Well 7, which is consistent with the corresponding formation water saturation reaching 100%, indicating that the well has no gas production.

In the analysis of Figure 21, it can be observed that in the data from Well 7, there does not seem to be a clear correlation between water saturation and the absolute permeability of the rock. This observation serves as a reminder that the absolute permeability of rock, as a physical property, primarily depends on the pore structure of the rock itself, rather than the type of fluid in the pores. Furthermore, the surface tension and capillary pressure of rock pores play a crucial role in determining the distribution and movement of fluid in the pore space. Therefore, in future research, investigating the distribution and flow mechanisms of fluid in the pore space under specific water saturation conditions will be a crucial research direction. This will not only aid in deepening our comprehension of petrophysical properties, but also holds great significance for the efficient development and management of oil and gas reservoirs.

Archie’s formula is the fundamental equation for describing reservoir saturation using resistivity logging data. It effectively integrates resistivity logging with porosity logging, providing a theoretical foundation for the quantitative assessment of reservoir saturation.

S_{w} = \sqrt[d]{a b R_{w} / (ϕ^{c} R_{t})}

(15)

However, in analyzing the data from the three wells in Figure 21, we can observe that the Pearson correlation coefficient between water saturation and resistivity does not exhibit a clear linear relationship. This may be because the Pearson correlation coefficient can only uncover the linear correlation between variables and is insufficient to capture the nonlinear relationships between them. Therefore, employing additional correlation metrics, such as the Spearman correlation coefficient, the coefficient of determination, and the point-biserial correlation coefficient, is beneficial. These measures can elucidate the nonlinear associations between variables. They provide a broader insight into the links between reservoir attributes and production parameters, encompassing complex interactions.

In this research, while model selection plays a significant role, data preprocessing and feature extraction are pivotal in enhancing the model performance. Effective data preprocessing helps cleanse and enhance the quality of the data, while precise feature extraction is essential for uncovering the potential of the data and refining valuable information. In Figure 5, we have elucidated the initial rationale for employing histograms and violin plots to determine the normalization technique. However, upon observing the uniform distribution of DEPTH in Figure 21, it became evident that StandardScaler may not be the optimal choice, as the mean and standard deviation of uniformly distributed data may not effectively capture their characteristics. In this scenario, employing max–min normalization or utilizing the original data directly may be more suitable. Furthermore, the data from Well 6 indicate that the permeability closely approximates a normal distribution, while the water saturation conforms to a Poisson distribution, and Well 7’s gas production exhibits numerous outliers. These findings suggest the necessity of implementing distinct preprocessing methods for different well-logging data to ensure data quality post-processing and establish a robust foundation for subsequent feature analysis. In the process of feature extraction, principal component analysis (PCA) is dedicated to reducing the dimensionality of the dataset and decreasing the number of features, while the generation of polynomial features aims to enhance the model structure by introducing higher-order terms and interaction terms. The results in Figure 7 and Figure 20 demonstrate that although the polynomial feature expands the applicability of the model and enhances adaptability, it may lead to overfitting issues. Simultaneously, while PCA sacrifices some information during data compression, it retains the majority of the variability, effectively simplifies the data structure, and reduces the processing costs. The choice of method should be based on the nature of the data and our requirements for model accuracy and explanatory power.

6. Conclusions

The post-fracturing production characteristics offer valuable insights for selecting the appropriate fracturing stage. Our analysis thoroughly examined the correlation between oil and gas reservoir productivity and logging parameters after hydraulic fracturing. A novel data volume expansion method was introduced, which involves subdividing the gas production of each well based on depth and leveraging the correlation model between petrophysical parameters and gas production to accurately match the gas production data with the accuracy of the well-logging data, establishing a one-to-one correspondence between depth and gas production. Then, a comprehensive parameter optimization was executed across twelve pipelines, employing diverse data preprocessing techniques, feature extraction methods, and machine learning models. This was achieved through the GridSearchCV hyperparameter tuning method, yielding the best-performing model for each pipeline. In conjunction with the optimized pipeline, these logging parameters were then harnessed to forecast post-fracturing production. Subsequently, we performed a comparative analysis of the productivity forecasts for various wells across different pipelines, created Taylor plots, computed the model evaluation index for all pipelines, and assessed the Pearson correlation coefficient for all wells within each pipeline. Lastly, we discussed the training process of the pipeline and the resulting predictions.

After extensive hyperparameter tuning, it became evident that logging parameters can indeed correlate with productivity, and the prediction trends are generally on point. As such, this approach holds promise as a surrogate model for numerical reservoir simulations.
The gas production from each well was segmented by depth on the initial day. Utilizing the correlation model between petrophysical parameters and gas production, the accuracy of the well-logging data was effectively matched with the gas production data, achieving integration of depth and gas production with precision. This approach not only enables the refinement and prediction of productivity in different layers of oil and gas reservoirs, but also offers practical data support for reservoir management and hydraulic fracturing operations, holding significant practical application value/
In the context of oil and gas extraction from low-permeability and low-porosity formations, machine learning models offer an intuitive representation of the productivity distribution in relation to well depth. This is pivotal in guiding the selection of the most suitable fracturing stage during field operations, and this approach provides reservoir engineers with an efficient means of reservoir performance oversight.

Author Contributions

Conceptualization, Y.S.; Methodology, Y.S.; Software, J.W. and Y.C.; Validation, J.L.; Formal analysis, Z.W.; Investigation, T.W. and S.C.; Resources, H.L.; Writing—original draft, L.H.; Writing—review & editing, Z.Z. and L.H.; Visualization, L.H.; Supervision, A.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was approved by China Oilfield Services Limited.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this article are not readily available due to confidentiality agreements with the involved corporation. Requests to access the datasets should be directed to China Oilfield Services Limited, Tianjin 300459, China.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

$x$	The original dataset
$μ$	The mean of the dataset
$σ$	The standard deviation of the dataset
$z$	The formation pressure, MPa
$v'$	The processed value
$v$	A value from the original dataset
$m e d i a n$	The median of the dataset
$I Q R$	The interquartile range of the dataset
${\hat{y}}_{i}^{(t)}$	The prediction of the i-th instance at the t-th iteration
$f_{k} (x_{i})$	The prediction of the k-th tree for instance x_i
$w_{q (x_{i})}$	The leaf weight assigned to the i-th instance when it reaches a leaf in the tree
$q (x_{i})$	The function mapping an instance to the corresponding leaf in the tree
$O b j (Θ)$	The objective is a sum of a loss term (how well the model predicts) and a regularization term (to keep the model simple)
$l (y_{i}, {\hat{y}}_{i}^{(t)})$	The loss function that measures the difference between the true label $y_{i}$ and the predicted label ${\hat{y}}_{i}^{(t)}$ for the i-th instance at the t-th iteration
$g_{i}$	The first-order gradient statistics of the loss function with respect to the prediction ${\hat{y}}^{(t - 1)}$ from the previous iteration
$h_{i}$	The second-order gradient statistics of the loss function with respect to the prediction ${\hat{y}}^{(t - 1)}$ from the previous iteration
$Ω (f_{k})$	Regularization term for the k-th tree
$y_{k}$	The k-th real production, m³/d
${\hat{y}}_{k}$	The k-th predicted production, m³/d
$\bar{y}$	The mean real production of all wells, m³/d
$\bar{\hat{y}}$	The mean predicted production of all wells, m³/d
$σ_{l}$	The standard deviation of pipelines’ real production, m³/d
$σ_{p}$	The standard deviation of pipelines’ predicted production, m³/d
$n$	The total count of specimens
$Q_{j}$	The gas production of a sampling point after segmenting, m³/d
$Q$	The total gas production, m³/d
$K$	Permeability, md
$S$	Oil saturation, %
$ϕ$	Porosity, %
$a$	Tortuosity factor
$b$	The coefficient related to lithological characteristics
$c$	The cementation exponent of the rock
$d$	The saturation exponent
$R_{t}$	The total fluid-saturated rock resistivity
$R_{w}$	The resistivity of the fluid itself ( $w$ meaning water or an aqueous solution containing dissolved salts with ions bearing electricity in solution)

Appendix A

Figure A1. The technical approach of post-fracture production prediction.

References

Amineh, M.P.; Yang, G. China’s geopolitical economy of energy security: A theoretical and conceptual exploration. Afr. Asian Stud. 2018, 17, 9–39. [Google Scholar] [CrossRef]
Tang, J.; Wang, X.; Du, X.; Ma, B.; Zhang, F. Optimization of Integrated Geological-engineering Design of Volume Fracturing with Fan-shaped Well Pattern. Pet. Explor. Dev. 2023, 50, 971–978. [Google Scholar] [CrossRef]
Tang, J.; Wu, K.; Zuo, L.; Sun, S.; Ehlig–Economides, C. Investigation of Rupture and Slip Mechanisms of Hydraulic Fracture in Multiple-layered Formation. SPE J. 2019, 24, 2292–2307. [Google Scholar] [CrossRef]
Huang, L.; Jiang, P.; Zhao, X.; Yang, L.; Lin, J.; Guo, X. A modeling study of the productivity of horizontal wells in hydrocarbon-bearing reservoirs: Effects of fracturing interference. Geofluids 2021, 2021, 2168622. [Google Scholar] [CrossRef]
Zhao, X.; Jin, F.; Liu, X.; Zhang, Z.; Cong, Z.; Li, Z.; Tang, J. Numerical study of fracture dynamics in different shale fabric facies by integrating machine learning and 3-D lattice method: A case from Cangdong Sag, Bohai Bay basin, China. J. Pet. Sci. Eng. 2022, 218, 110861. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Shao, L.; Tian, F.; Tang, J. A new physics-informed method for the fracability evaluation of shale oil reservoirs. Coal Geol. Explor. 2023, 51, 37–51. [Google Scholar] [CrossRef]
Yang, C.; Bu, S.; Fan, Y.; Wan, W.X.; Wang, R.; Foley, A. Data-driven prediction and evaluation on future impact of energy transition policies in smart regions. Appl. Energy 2023, 332, 120523. [Google Scholar] [CrossRef]
Parvizi, H.; Rezaei-Gomari, S.; Nabhani, F. Robust and flexible hydrocarbon production forecasting considering the heterogeneity impact for hydraulically fractured wells. Energy Fuels 2017, 31, 8481–8488. [Google Scholar] [CrossRef]
Noshi, C.I.; Eissa, M.R.; Abdalla, R.M.; Schubert, J.J. An intelligent data driven approach for production prediction. In Proceedings of the Offshore Technology Conference, Houston, TX, USA, 6–9 May 2019; OnePetro: Richardson, TX, USA, 2019. [Google Scholar]
Li, X.; Ma, X.; Xiao, F.; Xiao, C.; Wang, F.; Zhang, S. A physics-constrained long-term production prediction method for multiple fractured wells using deep learning. J. Pet. Sci. Eng. 2022, 217, 110844. [Google Scholar] [CrossRef]
Wang, H.; Mu, L.; Shi, F.; Dou, H. Production prediction at ultra-high water cut stage via Recurrent Neural Network. Pet. Explor. Dev. 2020, 47, 1084–1090. [Google Scholar] [CrossRef]
Cheng, B.; Xu, T.; Luo, S.; Chen, T.; Li, Y.; Tang, J. Method and practice of deep favorable shale reservoirs prediction based on machine learning. Pet. Explor. Dev. 2022, 49, 1056–1068. [Google Scholar] [CrossRef]
Hou, L.; Zou, C.; Yu, Z.; Luo, X.; Wu, S.; Zhao, Z.; Lin, S.; Yang, Z.; Zhang, L.; Wen, D.; et al. Quantitative assessment of the sweet spot in marine shale oil and gas based on geology, engineering, and economics: A case study from the Eagle Ford Shale, USA. Energy Strategy Rev. 2021, 38, 100713. [Google Scholar] [CrossRef]
Tang, J.; Fan, B.; Xiao, L.; Tian, S.; Zhang, F.; Zhang, L.; Weitz, D. A new ensemble machine-learning framework for searching sweet spots in shale reservoirs. SPE J. 2021, 26, 482–497. [Google Scholar] [CrossRef]
Niu, W.; Lu, J.; Sun, Y. A production prediction method for shale gas wells based on multiple regression. Energies 2021, 14, 1461. [Google Scholar] [CrossRef]
Luo, S.; Ding, C.; Cheng, H.; Zhang, B.; Zhao, Y.; Liu, L. Estimated ultimate recovery prediction of fractured horizontal wells in tight oil reservoirs based on deep neural networks. Adv. Geo-Energy Res. 2022, 6, 111–122. [Google Scholar] [CrossRef]
Han, D.; Jung, J.; Kwon, S. Comparative study on supervised learning models for productivity forecasting of shale reservoirs based on a data-driven approach. Appl. Sci. 2020, 10, 1267. [Google Scholar] [CrossRef]
Wang, C.; Zhao, L.; Wu, S.; Song, X. Predicting the Surveillance Data in a Low-Permeability Carbonate Reservoir with the Machine-Learning Tree Boosting Method and the Time-Segmented Feature Extraction. Energies 2020, 13, 6307. [Google Scholar] [CrossRef]
Zhang, Q.; Wei, C.; Wang, Y.; Du, S.; Zhou, Y.; Song, H. Potential for prediction of water saturation distribution in reservoirs utilizing machine learning methods. Energies 2019, 12, 3597. [Google Scholar] [CrossRef]
Han, D.; Kwon, S. Application of machine learning method of data-driven deep learning model to predict well production rate in the shale gas reservoirs. Energies 2021, 14, 3629. [Google Scholar] [CrossRef]
Ibrahim, N.M.; Alharbi, A.A.; Alzahrani, T.A.; Abdulkarim, A.M.; Alessa, I.A.; Hameed, A.M.; Albabtain, A.S.; Alqahtani, D.A.; Alsawwaf, M.K.; Almuqhim, A.A. Well Performance Classification and Prediction: Deep Learning and Machine Learning Long Term Regression Experiments on Oil, Gas, and Water Production. Sensors 2022, 22, 5326. [Google Scholar] [CrossRef]
Liu, W.; Chen, Z.; Hu, Y.; Xu, L. A systematic machine learning method for reservoir identification and production prediction. Pet. Sci. 2023, 20, 295–308. [Google Scholar] [CrossRef]
Ren, L.; Su, Y.; Zhan, S.; Meng, F. Progress of the research on productivity prediction methods for stimulated reservoir volume (SRV)-fractured horizontal wells in unconventional hydrocarbon reservoirs. Arab. J. Geosci. 2019, 12, 184. [Google Scholar] [CrossRef]
Wu, Y.; Tahmasebi, P.; Lin, C.; Zahid, M.A.; Dong, C.; Golab, A.N.; Ren, L. A comprehensive study on geometric, topological and fractal characterizations of pore systems in low-permeability reservoirs based on SEM, MICP, NMR, and X-ray CT experiments. Mar. Pet. Geol. 2019, 103, 12–28. [Google Scholar] [CrossRef]
Huang, H.; Sun, W.; Ji, W.; Zhang, R.; Du, K.; Zhang, S.; Ren, D.; Wang, Y.; Chen, L.; Zhang, X. Effects of pore-throat structure on gas permeability in the tight sandstone reservoirs of the Upper Triassic Yanchang formation in the Western Ordos Basin, China. J. Pet. Sci. Eng. 2018, 162, 602–616. [Google Scholar] [CrossRef]
Pimanov, V.; Lukoshkin, V.; Toktaliev, P.; Iliev, O.; Muravleva, E.; Orlov, D.; Krutko, V.; Avdonin, A.; Steiner, K.; Koroteev, D. On a workflow for efficient computation of the permeability of tight sandstones. arXiv 2022, arXiv:2203.11782. [Google Scholar]
Ucar, E.; Berre, I.; Keilegavlen, E. Three-dimensional numerical modeling of shear stimulation of fractured reservoirs. J. Geophys. Res. Solid Earth 2018, 123, 3891–3908. [Google Scholar] [CrossRef]
Iraji, S.; Soltanmohammadi, R.; Munoz, E.R.; Basso, M.; Vidal, A.C. Core scale investigation of fluid flow in the heterogeneous porous media based on X-ray computed tomography images: Upscaling and history matching approaches. Geoenergy Sci. Eng. 2023, 225, 211716. [Google Scholar] [CrossRef]
Soltanmohammadi, R.; Iraji, S.; de Almeida, T.R.; Basso, M.; Munoz, E.R.; Vidal, A.C. Investigation of pore geometry influence on fluid flow in heterogeneous porous media: A pore-scale study. Energy Geosci. 2024, 5, 100222. [Google Scholar] [CrossRef]
Iraji, S.; Soltanmohammadi, R.; Matheus, G.F.; Basso, M.; Vidal, A.C. Application of unsupervised learning and deep learning for rock type prediction and petrophysical characterization using multi-scale data. Geoenergy Sci. Eng. 2023, 230, 212241. [Google Scholar] [CrossRef]
Shen, L.; Cui, T.; Liu, H.; Zhu, Z.; Zhong, H.; Chen, Z.; Yang, B.; He, R.; Liu, H. Numerical simulation of two-phase flow in naturally fractured reservoirs using dual porosity method on parallel computers: Numerical simulation of two-phase flow in naturally fractured reservoirs. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, Guangzhou, China, 14–16 January 2019; pp. 91–100. [Google Scholar]
Kuhlman, K.L.; Malama, B.; Heath, J.E. Multiporosity flow in fractured low-permeability rocks. Water Resour. Res. 2015, 51, 848–860. [Google Scholar] [CrossRef]
van der Linden, J.; Jönsthövel, T.; Lukyanov, A.; Vuik, C. The parallel subdomain-levelset deflation method in reservoir simulation. J. Comput. Phys. 2016, 304, 340–358. [Google Scholar] [CrossRef]
Guo, X.; Wu, K.; An, C.; Tang, J.; Killough, J. Numerical investigation of effects of subsequent parent-well injection on interwell fracturing interference using reservoir-geomechanics-fracturing modeling. SPE J. 2019, 24, 1884–1902. [Google Scholar] [CrossRef]
Guo, X.; Wu, K.; Killough, J. Investigation of production-induced stress changes for infill-well stimulation in Eagle Ford Shale. SPE J. 2018, 23, 1372–1388. [Google Scholar] [CrossRef]
Guo, X.; Wu, K.; Killough, J.; Tang, J. Understanding the mechanism of interwell fracturing interference with reservoir/geomechanics/fracturing modeling in eagle ford shale. SPE Reserv. Eval. Eng. 2019, 22, 842–860. [Google Scholar] [CrossRef]
Ramos-Carreño, C.; Torrecilla, J.L.; Carbajo-Berrocal, M.; Marcos, P.; Suárez, A. scikit-fda: A Python package for functional data analysis. arXiv 2022, arXiv:2211.02566. [Google Scholar]
Yang, C.; Brower-Sinning, R.A.; Lewis, G.A.; Kästner, C. Data leakage in notebooks: Static detection and better processes. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Rochester, MI, USA, 10–14 October 2022; pp. 1–12. [Google Scholar]
Schoenfeld, B.; Giraud-Carrier, C.; Poggemann, M.; Christensen, J.; Seppi, K. Preprocessor selection for machine learning pipelines. arXiv 2018, arXiv:1810.09942. [Google Scholar]
Jamieson, K.; Talwalkar, A. Non-stochastic best arm identification and hyperparameter optimization. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; PMLR, 2016. pp. 240–248. [Google Scholar]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 2017, 18, 6765–6816. [Google Scholar]
Sun, Y.; Wang, J.; Wang, T.; Li, J.; Wei, Z.; Fan, A.; Liu, H.; Chen, S. Geomechanical Modeling of Cluster Wells in Shale Oil Reservoirs using GridSearchCV. Well Logging Technol. 2023, 47, 421–431. [Google Scholar]
Standardscaler, S.P. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html (accessed on 20 May 2022).
Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J. RobustScaler: Scikit-Learn Documentation. 2018. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html (accessed on 1 February 2024).
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Roubickova, A.; Brown, N.; Brown, O. Using machine learning to reduce ensembles of geological models for oil and gas exploration. In Proceedings of the 2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-5), Denver, CO, USA, 17 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 42–49. [Google Scholar]
Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386. [Google Scholar] [CrossRef] [PubMed]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Ph.D. Thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA, USA, 1974. [Google Scholar]
Di, H.; Wang, Z.; AlRegib, G. Why using CNN for seismic interpretation? An investigation. In Proceedings of the SEG International Exposition and Annual Meeting, Anaheim, CA, USA, 14–19 October 2018; SEG: Houston, TX, USA, 2018. SEG-2018-2997155. [Google Scholar]
Fath, A.H.; Madanifar, F.; Abbasi, M. Implementation of multilayer perceptron (MLP) and radial basis function (RBF) neural networks to predict solution gas-oil ratio of crude oil systems. Petroleum 2020, 6, 80–91. [Google Scholar] [CrossRef]
Ranjan, G.S.K.; Verma, A.K.; Radhika, S. K-nearest neighbors and grid search cv based real time fault monitoring system for industries. In Proceedings of the 2019 IEEE 5th International Conference for Convergence in Technology (I2CT), Bombay, India, 29–31 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Heidaryan, E. A Note on Model Selection Based on the Percentage of Accuracy-Precision. ASME J. Energy Resour. Technol. 2019, 141, 045501. [Google Scholar] [CrossRef]
Taylor, K.E. Summarizing multiple aspects of model performance in a single diagram. J. Geophys. Res. Atmos. 2001, 106, 7183–7192. [Google Scholar] [CrossRef]

Figure 1. Flowchart depicting the process, including data collection, sample preparation, and data splitting, for preprocessing of well-logging data.

Figure 2. Flowchart depicting the process, including data scaling, feature extraction, and machine learning algorithm, for pipelines setup.

Figure 3. Flowchart depicting the process, including GridSearchCV, evaluation criteria, and machine learning algorithm, for selecting the optimal model.

Figure 4. Schematic diagram of the sample collection and preparation process. The red box represents the gas production dissected at each logging depth.

Figure 5. An in-depth analysis of reservoir characteristics and production parameters: (a) Histograms depicting the distribution and variability of key reservoir properties such as porosity, permeability, and fluid saturation, alongside critical production metrics like oil and gas output. (b) Violin plots further elucidating these characteristics, offering a detailed view of their distribution patterns, central tendencies, and dispersion, thus providing a holistic understanding of the reservoir’s behavior and production efficiency.

Figure 6. Comparative analysis of production capacity projections using the 20% residual dataset vs. real predictions.

Figure 7. Taylor diagram representation of model bias and standard deviation in errors. The azimuthal angle represents the PCC, the radial distance is the standard deviation of predicted production data, as

σ_{p}

, and the semicircles centered at the “

σ_{l}

” marker indicate the standard deviation of the real production data, as 192.2. The color scale shows the value of the root-mean-square error.

Figure 7. Taylor diagram representation of model bias and standard deviation in errors. The azimuthal angle represents the PCC, the radial distance is the standard deviation of predicted production data, as

σ_{p}

, and the semicircles centered at the “

σ_{l}

” marker indicate the standard deviation of the real production data, as 192.2. The color scale shows the value of the root-mean-square error.

Figure 8. Production prediction of the remaining 20% of the 33 wells using PS-XGB. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 9. Production prediction of the remaining 20% of the 33 wells using PS-RF. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 10. Production prediction of the remaining 20% of the 33 wells using PS-NN. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 11. Production prediction of the remaining 20% of the 33 wells using PFS-XGB. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 12. Production prediction of the remaining 20% of the 33 wells using PFS-RF. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 13. Production prediction of the remaining 20% of the 33 wells using PFS-NN. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 14. Production prediction of the remaining 20% of the 33 wells using PR-XGB. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 15. Production prediction of the remaining 20% of the 33 wells using PR-RF. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 16. Production prediction of the remaining 20% of the 33 wells using PR-NN. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 17. Production prediction of the remaining 20% of the 33 wells using PFR-XGB. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 18. Production prediction of the remaining 20% of the 33 wells using PFR-RF. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 19. Production prediction of the remaining 20% of the 33 wells using PFR-NN. The black curve and blue dots represent the segmented production, and the red curve and green squares represent the predicted production.

Figure 20. Pearson correlation coefficients of different models on different wells.

Figure 21. Logging parameter distribution and Pearson correlation coefficient analysis of Well 6, Well 7, and Well 33.

Table 1. Pipelines setup and abbreviation list.

Pipelines Index	Pipelines Setup	Pipelines Abbreviation
1st	StandardScaler + PCA + XGBoost	PS-XGB
2nd	StandardScaler + PCA + Random Forest	PS-RF
3rd	StandardScaler + PCA + neural network	PS-NN
4th	StandardScaler + PolynomialFeatures + XGBoost	PFS-XGB
5th	StandardScaler+ PolynomialFeatures + Random Forest	PFS-RF
6th	StandardScaler + PolynomialFeatures + neural network	PFS-NN
7th	RobustScaler + PCA + XGBoost	PR-XGB
8th	RobustScaler + PCA + Random Forest	PR-RF
9th	RobustScaler + PCA + neural network	PR-NN
10th	RobustScaler + PolynomialFeatures + XGBoost	PFR-XGB
11th	RobustScaler + PolynomialFeatures + Random Forest	PFR-RF
12th	RobustScaler + PolynomialFeatures + neural network	PFR-NN

Table 2. Hyperparameter tuning combination list.

Pipeline	Method	Hyperparameter	Range of Values
PS-XGB	PCA	n_components	2/4/6
	XGBoost	learning_rate	0.01/0.1/0.5
		n_estimators	50/100/200
		max_depth	1/2/3
		min_child_weight	1/3/5
		booster	“gbtree”/”gblinear”
PS-RF	PCA	n_components	2/4/6
	Random Forest	n_estimators	400/500/600
		max_depth	2/6/10
		min_samples_split	10/15/20
		min_samples_leaf	4/6/8
		max_features	“auto”/”sqrt”
		bootstrap	True/False
PS-NN	PCA	n_components	2/4/6
	Neural network	hidden_layer_sizes	(50), (100), (200), (100, 50), (200, 100), (300, 200, 100), (400, 300, 200, 100)
		activation function	“tanh”, “relu”, “logistic”
		alpha	0.0001/0.001/0.01/0.1
		max_iter	3000/5000/7000
PFS-XGB	PolynomialFeatures	poly_degree	2/3/4
	XGBoost	learning_rate	0.01/0.1/0.5
		n_estimators	50/100/200
		max_depth	1/2/3
		min_child_weight	1/3/5
		booster	“gbtree”/”gblinear”
PFS-RF	PolynomialFeatures	poly__degree	2/3/4
	Random Forest	n_estimators	400/500/600
		max_depth	2/6/10
		min_samples_split	10/15/20
		min_samples_leaf	4/6/8
		max_features	“auto”/”sqrt”
		bootstrap	True/False
PFS-NN	PolynomialFeatures	poly__degree	2/3/4
	Neural network	hidden_layer_sizes	(50), (100), (200), (100, 50), (200, 100), (300, 200, 100), (400, 300, 200, 100)
		activation function	“tanh”, “relu”, “logistic”
		alpha	0.0001/0.001/0.01/0.1
		max_iter	3000/5000/7000
PR-XGB	PCA	n_components	2/4/6
	XGBoost	learning_rate	0.01/0.1/0.5
		n_estimators	50/100/200
		max_depth	1/2/3
		min_child_weight	1/3/5
		booster	“gbtree”/”gblinear”
PR-RF	PCA	n_components	2/4/6
	Random Forest	n_estimators	400/500/600
		max_depth	2/6/10
		min_samples_split	10/15/20
		min_samples_leaf	4/6/8
		max_features	“auto”/”sqrt”
		bootstrap	True/False
PR-NN	PCA	n_components	2/4/6
	Neural network	hidden_layer_sizes	(50), (100), (200), (100, 50), (200, 100), (300, 200, 100), (400, 300, 200, 100)
		activation function	“tanh”, “relu”, “logistic”
		alpha	0.0001/0.001/0.01/0.1
		max_iter	3000/5000/7000
PFR-XGB	PolynomialFeatures	poly__degree	2/3/4
	XGBoost	learning_rate	0.01/0.1/0.5
		n_estimators	50/100/200
		max_depth	1/2/3
		min_child_weight	1/3/5
		booster	“gbtree”/”gblinear”
PFR-RF	PolynomialFeatures	poly__degree	2/3/4
	Random Forest	n_estimators	400/500/600
		max_depth	2/6/10
		min_samples_split	10/15/20
		min_samples_leaf	4/6/8
		max_features	“auto”/”sqrt”
		bootstrap	True/False
PFR-NN	PolynomialFeatures	poly__degree	2/3/4
	Neural network	hidden_layer_sizes	(50), (100), (200), (100, 50), (200, 100), (300, 200, 100), (400, 300, 200, 100)
		activation function	“tanh”, “relu”, “logistic”
		alpha	0.0001/0.001/0.01/0.1
		max_iter	3000/5000/7000

Table 3. Hyperparameter result list.

Pipeline	Hyperparameter	The Best of Values
PS-XGB	n_components	6
	learning_rate	0.5
	n_estimators	200
	max_depth	3
	min_child_weight	5
	booster	“gbtree”
PS-RF	n_components	6
	n_estimators	400
	max_depth	10
	min_samples_split	10
	min_samples_leaf	4
	max_features	“sqrt”
	bootstrap	False
PS-NN	n_components	6
	hidden_layer_sizes	(400, 300, 200, 100)
	activation function	“tanh”
	alpha	0.01
	max_iter	7000
PFS-XGB	poly_degree	3
	learning_rate	0.1
	n_estimators	200
	max_depth	3
	min_child_weight	5
	booster	“gbtree”
PFS-RF	poly__degree	3
	n_estimators	500
	max_depth	10
	min_samples_split	10
	min_samples_leaf	4
	max_features	“sqrt”
	bootstrap	False
PFS-NN	poly__degree	2
	hidden_layer_sizes	(300, 200, 100)
	activation function	“logistic”
	alpha	0.0001
	max_iter	3000
PR-XGB	n_components	6
	learning_rate	0.5
	n_estimators	200
	max_depth	3
	min_child_weight	1
	booster	“gbtree”
PR-RF	n_components	6
	n_estimators	600
	max_depth	10
	min_samples_split	10
	min_samples_leaf	4
	max_features	“sqrt”
	bootstrap	False
PR-NN	n_components	6
	hidden_layer_sizes	(400, 300, 200, 100)
	activation function	“tanh”
	alpha	0.01
	max_iter	3000
PFR-XGB	poly__degree	3
	learning_rate	0.1
	n_estimators	200
	max_depth	3
	min_child_weight	3
	booster	“gbtree”
PFR-RF	poly__degree	3
	n_estimators	400
	max_depth	10
	min_samples_split	10
	min_samples_leaf	4
	max_features	“sqrt”
	bootstrap	False
PFR-NN	poly__degree	2
	hidden_layer_sizes	(200, 100)
	activation function	“logistic”
	alpha	0.001
	max_iter	5000

Table 4. Taylor diagram evaluation index list.

	Model	PCC	RMSE	SD
	Reference	1	0	192.2
	PS-XGB	0.89	88.51	159.77
	PS-RF	0.861	101.78	137.29
	PS-NN	0.94	66.22	173.24
	PFS-XGB	0.933	70.32	168.11
	PFS-RF	0.916	81.86	148.87
★	PFS-NN	0.979	39.57	181.52
	PR-XGB	0.875	93.24	164.66
	PR-RF	0.86	102.93	134.86
	PR-NN	0.927	74.79	156.18
	PFR-XGB	0.929	74.67	156.25
	PFR-RF	0.908	84.63	148.45
	PFR-NN	0.974	44.47	177.97

Table 5. Pipelines standard evaluation index list.

	Model	R2	MAE	MSE
	PS-XGB	0.79	42.43	7834.73
	PS-RF	0.72	41.33	10,360.13
	PS-NN	0.88	17.44	4385.2
	PFS-XGB	0.87	29.99	4944.32
	PFS-RF	0.82	29.39	6700.64
★	PFS-NN	0.96	9.03	1565.52
	PR-XGB	0.76	50.17	8693.67
	PR-RF	0.71	42.96	10,594.91
	PR-NN	0.85	22.21	5672.91
	PFR-XGB	0.85	31.21	5575.96
	PFR-RF	0.81	30.57	7162.82
	PFR-NN	0.95	10.8	1978

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Wang, J.; Wang, T.; Li, J.; Wei, Z.; Fan, A.; Liu, H.; Chen, S.; Zhang, Z.; Chen, Y.; et al. Post-Fracture Production Prediction with Production Segmentation and Well Logging: Harnessing Pipelines and Hyperparameter Tuning with GridSearchCV. Appl. Sci. 2024, 14, 3954. https://doi.org/10.3390/app14103954

AMA Style

Sun Y, Wang J, Wang T, Li J, Wei Z, Fan A, Liu H, Chen S, Zhang Z, Chen Y, et al. Post-Fracture Production Prediction with Production Segmentation and Well Logging: Harnessing Pipelines and Hyperparameter Tuning with GridSearchCV. Applied Sciences. 2024; 14(10):3954. https://doi.org/10.3390/app14103954

Chicago/Turabian Style

Sun, Yongtao, Jinwei Wang, Tao Wang, Jingsong Li, Zhipeng Wei, Aibin Fan, Huisheng Liu, Shoucun Chen, Zhuo Zhang, Yuanyuan Chen, and et al. 2024. "Post-Fracture Production Prediction with Production Segmentation and Well Logging: Harnessing Pipelines and Hyperparameter Tuning with GridSearchCV" Applied Sciences 14, no. 10: 3954. https://doi.org/10.3390/app14103954

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Post-Fracture Production Prediction with Production Segmentation and Well Logging: Harnessing Pipelines and Hyperparameter Tuning with GridSearchCV

Abstract

1. Introduction

2. Materials and Methods

2.1. Well Logging and Production Data Preprocessing

2.1.1. Data Collection and Processing

2.1.2. Sample Preparation

2.1.3. Data Splitting

2.2. Pipelines Setup

2.2.1. Data Scaling and Feature Extraction

2.2.2. Machine Learning Algorithm

XGBoost

Random Forest

Neural Network Models (MLP Regressor)

2.3. The Optimal Model Selection

2.3.1. GridSearchCV

2.3.2. Evaluation Criteria

2.3.3. Optimal Pipeline

2.4. Generalization Ability Verification

2.5. Result Demonstration

3. Data Preparation and Description

3.1. Production Segmentation

3.2. Data Description

4. Results

4.1. Model Performance Analysis

4.2. Comparative Analysis of PS-XGB vs. PS-RF vs. PS-NN

4.3. Comparative Analysis of PFS-XGB vs. PFS-RF vs. PFS-NN

4.4. Comparative Analysis of PR-XGB vs. PR-RF vs. PR-NN

4.5. Comparative Analysis of PFR-XGB vs. PFR-RF vs. PFR-NN

4.6. Correlation Coefficient Comparison

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI