A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms

Lertliangchai, Thitaree; Dindoruk, Birol; Lu, Ligang; Yang, Xi; Sinha, Utkarsh

doi:10.3390/fuels5030030

Open AccessArticle

A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms^†

by

Thitaree Lertliangchai

¹,

Birol Dindoruk

^1,*

,

Ligang Lu

²,

Xi Yang

^2,‡

and

Utkarsh Sinha

^1,3

¹

Department of Petroleum Engineering, University of Houston, Houston, TX 77023, USA

²

Shell International E&P Inc., Westhollow Technology Center, 3333 Hwy 6 South, Houston, TX 77082, USA

³

Xecta Digital Labs, Houston, TX 77201, USA

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper presented in 2021 SPE Annual Technical Conference and Exhibition, Dubai, United Arab Emirates, 15 September 2021 (SPE-205997-MS).

^‡

Current address: Citibank, 3800 Citibank Center, Tampa, FL 33610, USA.

Fuels 2024, 5(3), 548-563; https://doi.org/10.3390/fuels5030030

Submission received: 8 May 2024 / Revised: 19 June 2024 / Accepted: 28 June 2024 / Published: 16 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

Dew point pressure (DPP) emerges as a pivotal factor crucial for forecasting reservoir dynamics regarding condensate-to-gas ratio and addressing production/completion hurdles, alongside calibrating EOS models for integrated simulation. However, DPP presents challenges in terms of predictability. Acknowledging these complexities, we introduce a state-of-the-art approach for DPP estimation utilizing advanced machine learning (ML) techniques. Our methodology is juxtaposed against published empirical correlation-based methods on two datasets with limited sizes and diverse inputs. With superior performance over correlation-based estimators, our ML approach demonstrates adaptability and resilience even with restricted training datasets, spanning various fluid classifications. We acquired condensate PVT data from publicly available sources and GeoMark RFDBASE, encompassing dew point pressure (the target variable), as well as compositional data (mole percentages of each component), temperature, molecular weight (MW), and specific gravity (SG) of heptane plus, which served as input variables. Before initiating the study, thorough assessments of measurement quality and results using statistical methods were conducted leveraging domain expertise. Subsequently, advanced ML techniques were employed to train predictive models with cross-validation to mitigate overfitting to the limited datasets. Our models were juxtaposed against the foremost published DDP estimators utilizing empirical correlation-based methods, with correlation-based estimators also trained on the underlying datasets for equitable comparison. To improve outcomes, pseudo-critical properties and artificial proxy features were utilized, leveraging generalized input data.

Keywords:

phase-equlibria; condensates; PVT; wet gas; machine learning; XGBoost

1. Introduction

For gas condensates, dew point pressure is harder to calculate or predict due to the complexity of the behavior of the heavier components. In addition, measurement of the dew point pressure is also more difficult with respect to the bubble-point pressure as it often involves visual justification for the measurements. Current correlations published in the literature have limitations both in their predictive capabilities and as well as their range of validities. Therefore, traditional direct experimental determination is still essential for the identification of dew points. The first published dew point pressure correlation can be dated back to 1947 by Sage and Olds [1]. Sage and Olds [1] developed a simple correlation that uses stock tank oil API gravity, gas–oil ratio (GOR), and temperature to predict dew point pressure. Nemeth and Kennedy [2] developed a more complex correlation with 11 coefficients to calculate dew point pressure using compositional information, the bulk properties of the plus fraction (molecular weight of heptane plus and specific gravity of heptane plus), and reservoir temperature. Although the Nemeth and Kennedy [2] correlation was established about five decades ago, it is still one of the most widely used empirical correlations for quick calculation of dew point pressure. Elsharkawy [3] proposed a new correlation for dew point pressure prediction, which uses the same input parameters as Nemeth and Kennedy [2], but the resulting equation is more complex with higher degrees of freedom (19 coefficients in total). Apart from the traditionally developed empirical correlations for dew point pressure, Ahmadi and Elsharkawy [4], and El-hoshoudy et al. [5] used gene expression programming or genetic programming (GP or GEP) to develop a new set of correlation equations for dew point pressure. Such genetic programming (GP) and gene expression programming (GEP) methods are advantageous because they produce an explicit equation similar to traditional empirical correlations. However, the equations derived from these methods tend to be more complex, often containing numerous terms and significant nonlinearity. This complexity can sometimes lead to overfitting, where the model captures noise in the training data rather than the underlying trend. Therefore, more flexible machine learning models were developed for dew point pressure prediction [6,7]. Jalali et al. [6] used an artificial neural network (ANN) as the machine learning algorithm and utilized compositional data, molecular weight of heptane plus (

{M W}_{C_{7 +}}

), and temperature as predictors on a 111-sample dataset. Alarfaj [8] compared several machine learning models including ANN, support vector machine (SVM), and basic decision trees using the same predictors as Jalali et al. [6]. The dataset of Alarfaj [8] includes the results from 98 PVT reports. Although the dataset size needed for machine learning depends on the complexity of the problem and the complexity of learning algorithms, a dataset with about 100 samples is still a typical small dataset [9]. Neural networks often exhibit severe overfitting problems, particularly when trained over a large number of iterations, especially with small datasets [10]. Alzahabi [7] compared the performance of linear regression, random forest, generalized additive models (GAMs with local smoothing), and neural networks (with a single hidden layer) on a dataset of 667 samples. The study concluded that these four models performed similarly in terms of accuracy. However, it is important to note that while neural networks can be powerful, they require careful tuning and sufficient data to avoid overfitting and achieve optimal performance.

The advanced machine learning algorithm chosen for this study is extreme gradient boosting, commonly abbreviated as XGBoost [11,12,13], due to its robustness and strong ability to capture nonlinear behavior, making it highly predictive. XGBoost [12,13], developed by Chen and Guestrin [13], quickly emerged as one of the most powerful and popular advanced machine learning algorithms among data scientists.

Based on the gradient boosting algorithm pioneered by Friedman [14,15], XGBoost utilizes an ensemble of weak prediction trees for robust prediction (see Sinha et al. [16] and Sinha et al. [17] to read the details about the mathematics gradient boosting methods (GBMs)). Its key advantages over other gradient boosting methods include the following features:

(1): The incorporation of regularization terms (γ and λ in Figure 1a) for leaf weights ( $w_{j}$ in Figure 1a) and the number of leaf nodes of trees (T), restricting their values during loss minimization; this reduces the influence of each individual tree, allowing room for future trees to improve the model.
(2): The use of second-order partial derivatives (as shown in Figure 1b) as an approximation of the loss function, providing more information about the gradient direction and the path to minimize the loss function.
(3): High prediction capabilities facilitated by pre-sorted and histogram-based algorithms for splitting at each node; XGBoost [12,13] explores various combinations of splitting and selects the one with the best value of the scoring function derived from minimizing the loss function XGBoost [12,13].
(4): Efficient handling of sparse data and missing values in inputs.

Furthermore, to enhance the prediction of dew point pressure and develop a robust model, gas pseudocritical properties (calculated from correlations) and artificial proxy features were incorporated into XGBoost [12]. This correlation aims to further improve the predictive performance of the model.

2. Dataset Analysis

In this study, the gas condensate PVT data were obtained from GeoMark RFDBASE (RFDbase—Rock and Fluid Database by GeoMark Research, Ltd. [18]), which contains company data as well as general access data. In order to research the ability of the advanced machine learning algorithms to predict dew point pressure, we used the entire mix of PVT data with no upper and lower range limits and with no region constraints. Based on the previous research on predicting dew point pressure [2,3,6,7,8,19,20], mole percentages of methane through heptane plus, nitrogen, carbon dioxide and hydrogen sulfide, reservoir temperature, molecular weight of reservoir fluid, and molecular weight and specific gravity of heptane plus are widely used as predictors. Thus, these predictors were also considered during our data cleaning and categorization process before the application of machine learning methods. In addition to staying aligned with what became almost the norm in terms of the selected predictors in the literature, we did not want to use other input data as potential add-on parameters in addition to these indicators as they tend to be hard to find/measure and also expensive, which almost defeat the purpose of estimation of the dew point pressure. Generally, there is always a fine balance between the required input parameters and the predicted quantities, and this case is no different.

We investigated the available data based on the availability and validity of the input proxies unless there were other known issues. The raw dataset we obtained had 521 gas samples with relevant PVT information. Among those, there were 96 dry gas samples that we discarded as they did not have dew point pressure measurements, leaving 425 relevant samples in the dataset. Second, we used compositional data (mole % of each component) as predictors so we needed to exclude samples with missing or invalid compositional information. After eliminating 4 of the samples that did not have usable compositional information, the number of samples in the set was reduced to 421. We also observed that 2 samples had missing molecular weights in the database, reducing the set to 419 samples. In addition, 53 samples did not have any recorded specific gravity values of the plus fractions (

{S G}_{C n +}

) along with the 2 samples identified with an incorrect molecular weight of heptane plus on the database. Therefore, there were 364 samples left in the final checked dataset. Furthermore, reservoir temperature was not provided for 1 sample, 1 sample had extremely high oil-based mud contamination, and 20 samples did not have accurate plus fluid properties leading to the final 342 samples in the dataset.

The final dataset of 342 samples was used with 16 predictors, namely the measured mole % of

N_{2}

,

{C O}_{2}

,

H_{2} S

,

C_{1}

,

C_{2}

,

C_{3}

,

{i - C}_{4}

,

n - C_{4}

,

i {- C}_{5}

,

n - C_{5}

,

C_{6}

, and

C_{7 +}

; molecular weight of

C_{7 +}

; reservoir temperature; and specific gravity of

C_{7 +}

. The statistical information of the corresponding final dataset is shown in Table 1.

3. Development of the New Model

Several alternative advanced machine learning algorithms are included in Appendix A, along with their performance dimensions. Based on our study using the dataset, the results from XGBoost [12,13] were found to be the best among those compared in Appendix A. Therefore, in this section, the outcomes and details of the XGBoost [12,13] model are discussed.

XGBoost

Traditionally, researchers have used a portion of this dataset to train the advanced machine learning model and leave the rest of the dataset as test data [2,3,6,8,19,20]. However, for the dew point pressure prediction models that have been published, the datasets were usually under 500 samples, and the test dataset was usually under 100 samples [6,8,19,20]. In this case, the test dataset cannot cover the whole range of dew point pressure and is even chosen randomly, so the errors may vary depending on the random seed. In this study, K-fold cross-validation was applied to make full use of our dataset and thus show a fair comparison between the advanced machine learning algorithm and empirical correlations. For K-fold cross-validation, the whole dataset was randomly split into K subsets of the same size. For each fold or subset, the rest of the dataset was used to train the model and then predict this subset of data. For example, if we were using 5-fold cross-validation, in each iteration, 4-fold data were used as training data, and 1-fold data were used as test data until all data samples were predicted. The merit of K-fold cross-validation is that each sample in the dataset can be predicted by the model building without this sample itself. The usual choice of K-fold cross-validation is

K = 5

or

K = 10

[21]. Additionally, 10-fold cross-validation was applied to the dataset in order to maintain a reasonable training dataset size.

In addition to the primary input variables, we also used artificial physical proxy features, including groups of input parameters, such as pseudocritical properties, via a selected correlation (Sutton [22]).

Sutton’s correlations

Pseudocritical properties of gases were applied to adjust the performance of XGBoost [12,13]. Gas mixture pseudocritical properties can be estimated from Sutton’s correlations based on gas-specific gravity (Sutton [22]).

Sutton [22] correlations were determined based on over 3200 databases of associated gas compositions and suitable for gas condensates, expressed as follows:

T_{p c} = 164.3 + 357.7 γ_{g H C} - 67.7 γ_{g H C}^{2}

(1)

P_{p c} = 744 - 125.4 γ_{g H C} + 5.9 γ_{g H C}^{2}

(2)

where

γ_{g H C} = \frac{M W o f t h e H y d r o c a r b o n g a s (H C) g a s}{28.967}

Artificial proxy features (APFs)

Artificial proxy features were combined with XGBoost [12,13] and Sutton’s correlations [22] for gas pseudocritical properties to develop a physical sound and to improve the accuracy of the model predictions. This is required to suitably model the effects of all variables corresponding to dew point pressure. Each artificial proxy feature was selected based on feature correlations with dew point pressure as a target, and a high-feature correlation distribution was observed, as shown in Figure 2 and Table 2. This indicates the following:

(a)

The molecular weight of heptane plus fraction is the most important for correlation in controlling the dew point pressure. Considering this observation, we initially attempted to develop two artificial proxy features, which are explained as follows:

The first artificial proxy feature (APF#1) was created by using the product of molecular weight and the mole percent of heptane plus fraction.

$A P F # 1 = {({M W}_{C_{7 +}} z_{C_{7 +}})}^{3}$

(3)
The mole percent of methane was combined with the molecular weight of heptane plus fraction to achieve the second artificial proxy feature (APF#2).

$A P F # 2 = {M W}_{C_{7 +}} {z_{C}}_{1}$

(4)

(b)

Mixture pseudocritical properties (

T_{p c}

and

P_{p c}

) contribute more to the dew point pressure than reservoir temperature and mole percentages of light, intermediate, and non-hydrocarbon components. The corrections of gas pseudocritical properties were carried out by adjusting the pseudocritical properties of heptane plus using Matthews [23] correlations as the third and fourth artificial proxy features (APF#3 and APF#4).

A P F # 3 = T_{p c_{c o r r}} = E X P (T_{c_{C_{7 +}}} / T_{p c})

(5)

A P F # 4 = P_{p c_{c o r r}} = E X P (P_{c_{C_{7 +}}} / P_{p c})

(6)

(c)

The effect of reservoir temperature was also needed in a different functional form. Therefore, three artificial proxy features were defined by integrating reservoir temperature with pseudocritical temperature (

T_{p c}

).

A P F # 5 = {(T_{R} / T_{p c})}^{2}

(7)

A P F # 6 = E X P (T_{R} / T_{p c})

(8)

A P F # 7 = T_{p c} / T_{R}

(9)

(d)

The mole percent of intermediate (

{i C}_{4}

, n

C_{4}

,

i C_{5}

, n

C_{5}

, and

C_{6}

) and

H_{2} S

has different effects on the dew point pressure, as their interaction with the heptane plus fraction differs from the more volatile segment of the fluid. The last artificial proxy feature (APF#8) was defined using the subject’s intermediate fraction and H₂S fraction.

A P F # 8 = \frac{z_{C_{7 +}}}{{z_{i - C}}_{4} + {z_{n - C}}_{4} + {z_{i - C}}_{5} + {z_{n - C}}_{5} + {z_{C}}_{6} + z_{H_{2} S}}

(10)

4. Results and Discussion

Dew point pressure correlations determined by Nemeth and Kennedy [2], Elsharkawy [3], Ahmadi and Elsharkawy [4], and El-hoshoudy et al. [5] were used for comparison against our new model.

To be able to perform an objective comparison against the selected set of leading dew point pressure correlations, the models were re-trained using the same selected dataset for this study. The objective function setup for tuning the Nemeth and Kennedy [2] and Elsharkawy [3] correlations is shown in Equation (1). The summation of absolute error each time during training was set to be the objective function to be minimized. The Nelder–Mead algorithm [24,25], a commonly applied numerical method for minimization in multidimensional space, was applied to minimize the objective function, as shown in Equation (3).

Obj (\overset{⃑}{a}) = \sum_{i = 1}^{n_{train}} | p_{dew, correlation} (\overset{⃑}{Z_{l}}, T_{i}, {MW}_{C_{7 +}, i}, γ_{C_{7 +}} | \overset{⃑}{a}) - p_{dew, i} |

(11)

where

\overset{⃑}{a}

is the vector of the correlation coefficient for dew point pressure.

The outcomes of the correlations by Nemeth and Kennedy [2], Elsharkawy [3], Ahmadi and Elsharkawy [4], and El-hoshoudy et al. [5] for dew point pressure prediction are shown in Equations (12)–(15). After using Equations (12) and (15), we used the Nelder–Mead algorithm [24] to minimize the objective function in Equation (11) to tune the 11 coefficients of Nemeth and Kennedy [2], 19 coefficients of Elsharkawy [3], 28 coefficients of Ahmadi and Elsharkawy [4], and 19 coefficients of El-hoshoudy et al. [5], thus updating these correlations.

Nemeth and Kennedy [2] correlation:

\begin{matrix} l n (p_{d}) = a_{1} \times (z C_{2} + z C O_{2} + {z H}_{2} S + z C_{6} + 2 \times (z C_{3} + z C_{4}) + z C_{5} + 0.4 \times \\ z C_{1} + 0.2 \times z N_{2}) + a_{2} \times γ_{C 7 +} + a_{3} \times (z C_{1} / ({z C}_{7 +} + 0.002)) + a_{4} \times T_{R} + a_{5} \times \\ ({z C}_{7 +} \times M W_{C_{7 +}}) + a_{6} \times {({z C}_{7 +} \times M W_{C_{7 +}})}^{2} + a_{7} \times {({z C}_{7 +} \times M W_{C_{7 +}})}^{3} + a_{8} \times \\ (M W_{C_{7 +}} / (γ_{C 7 +} + 0.0001)) + a_{9} \times {(M W_{C_{7 +}} / (γ_{C_{7 +}} + 0.0001))}^{2} + a_{10} \times \\ {(M W_{C_{7 +}} / (γ_{C_{7 +}} + 0.0001))}^{3} + a_{11} \end{matrix}

(12)

Elsharkawy [3] correlation:

\begin{matrix} p_{d} = a_{0} + a_{1} \times T_{f} + a_{2} \times {z H}_{2} S + a_{3} \times z C O_{2} + a_{4} \times z N_{2} + a_{5} \times z C_{1} + a_{6} \times \\ z C_{2} + a_{7} \times z C_{3} + a_{8} \times z C_{4} + a_{9} \times z C_{5} + a_{10} \times z C_{6} + a_{11} \times z C_{7 +} + a_{12} \times \\ M W_{C_{7 +}} + a_{13} \times γ_{C_{7 +}} + a_{14} \times ({z C}_{7 +} \times M W_{C 7 +}) + a_{15} \times (M W_{C_{7 +}} / γ_{C_{7 +}}) + a_{16} \times \\ ({z C}_{7 +} \times M W_{C_{7 +}} / γ_{C_{7 +}}) + a_{17} \times ({z C}_{7 +} / (z C_{1} + z C_{2})) + a_{18} \times (z C_{7 +} / (z C_{3} + z C_{4} + \\ z C_{5} + z C_{6})) \end{matrix}

(13)

Ahmadi and Elsharkawy [4] correlation:

p_{d} = {- a}_{1} - T_{f} \times z C_{1} \times a_{2} + T_{f}^{2} \times a_{3} + z C_{1} \times a_{4} + A \times a_{5}

(14)

where

A = a_{6} - {z N_{2}}^{\frac{1}{3}} \times B \times a_{7} + {z N_{2}}^{\frac{1}{3}} \times C \times a_{8} + {z N_{2}}^{\frac{2}{3}} \times a_{9} + B \times a_{10} + B^{2} \times a_{11} + C \times a_{12}

B = - a_{13} + {M W_{C_{7 +}}}^{3} \times a_{14} - {M W_{C_{7 +}}}^{3} \times {z C}_{7 +}^{\frac{1}{3}} \times a_{15} - {M W_{C_{7 +}}}^{6} \times a_{16} - {z C_{4}}^{\frac{1}{3}} \times {z C}_{7 +}^{\frac{1}{3}} \times a_{17} + {z C_{4}}^{\frac{2}{3}} \times a_{18} + {z C}_{7 +}^{\frac{1}{3}} \times a_{19} - {z C}_{7 +}^{\frac{2}{3}} \times a_{20}

C = a_{21} - z C_{1} \times a_{22} + z C_{1} \times {γ_{C_{7 +}}}^{3} \times a_{23} + z C_{1} \times {z C}_{7 +}^{\frac{1}{3}} \times a_{24} + {z C_{1}}^{2} \times a_{25} - {γ_{C 7 +}}^{3} \times a_{26} + {γ_{C_{7 +}}}^{3} \times {z C}_{7 +}^{\frac{1}{3}} \times a_{27} - {z C}_{7 +}^{\frac{2}{3}} \times a_{28}

El-hoshoudy et al. [5] correlation:

p_{d} = x^{3} + y^{3}

(15)

where,

x = \frac{a_{0} + a_{1} z C_{1} + a_{2} z C_{2} + a_{3} z C_{4} + a_{4} {z C_{4}}^{2} + a_{5} z C_{5} + a_{6} z C_{6} + z C_{7 +} + a_{7} T_{f}}{a_{8} T_{f} + a_{9}}

y = \frac{b_{0} + b_{1} z C_{1} {M W}_{C 7 +} + b_{2} z C_{3} + b_{3} z C_{7 +} {+ b}_{4} \ln (z C_{7 +}) + b_{5} z {C O}_{2} + b_{6} z N_{2}}{b_{7} γ_{C 7 +} + b_{8}}

The cross-plots of the proposed model (XGBoost [12,13] with pseudocritical property correlations and artificial proxy features adopted from Sutton [22]) and four empirical correlations are shown in Figure 3. It can be observed that the proposed model performs better than the other models even after the recalibration of the existing correlation coefficients of the subject correlations, as shown in Figure 3a. As seen in Figure 3c, the overall performance of the tuned Ahmadi and Elsharkawy [4] correlation is better than the other three correlations. However, the tuned Ahmadi and Elsharkawy [4] correlation has fewer overprediction problems when predicting dew point pressure higher than 8000 psia. In addition, based on Figure 3d, the tuned Elsharkawy [3] correlation performs better than the tuned Nemeth and Kennedy [2] correlation and the tuned El-hoshoudy et al. [5] correlation. As shown in Figure 3e, the tuned El-hoshoudy et al. [5] correlation has better prediction compared to the Nemeth and Kennedy [2] correlation but still has a large range of prediction errors for the overall range of dew point pressure. As shown in Figure 3f, the tuned Nemeth and Kennedy [2] correlation has also underprediction problems for less than 3000 psia while having overprediction problems for pressures higher than 6000 psia.

The distribution of the errors for all five methods is shown in Figure 4. The proposed model is still the best method compared against the other correlations, as shown in Figure 4a. It has a higher density in prediction errors between −200 psia and 200 psia. The predicted errors of the developed model are in the range between −1400 psia and 1400 psia, and only a few samples are predicted errors outside of these ranges. The predicted errors of all four correlations are in the range between −3000 psia and 3000 psia. It is also obvious that the developed model has the highest peak density around zero. Others have similar peak densities around the predicted error of zero psia; the higher the peak density around the predicted error of zero psia, the better the performance of the developed model.

The error statistics of the five model performances are shown in Table 3, and based on the results, the proposed model has the best performance. The deviation of mean relative errors after adding pseudocritical property correlations of Sutton [22] and artificial proxy features into XGBoost [12,13] improves with respect to XGBoost [12,13] on the average of 35 psia. The proposed model performs better than others with a mean relative error of 470 psia (mean absolute relative error of 7.16%). The three correlations ([3,4,5]) for dew point pressure have similar error statistics with respect to each other, and their performance is better than the previously published correlations of Nemeth and Kennedy [2] (See Table 4). Their mean absolute error is in the range of 678 psia to 722 psia (mean absolute relative error of 10.69–11.36%), while the Nemeth and Kennedy [2] correlation has mean absolute errors of 838 psia (mean absolute relative error of 13.15%).

5. Summary and Conclusions

In this study, we reviewed a number of reports, and only higher-quality data was utilized (leading to about 350 points). Some of the fluids had excessive contamination, and some of them did not have reliable molecular weights and/or compositions, or overall quality check via modeling indicated issues with the data quality. Compositional information such as mole percentages of hydrocarbons to heptane plus and non-hydrocarbon gases, carbon dioxide, nitrogen, and hydrogen sulfide, along with the molecular weight of well stream, molecular weight, and specific gravity of heptane plus and reservoir temperature were used as predictors in order to estimate the dew point pressures. The samples, as reflected by their compositional character, are representative of gas condensates throughout the world with reservoir temperatures ranging from 60 °F to 381 °F and measured dew point pressures ranging from 1465 psia to 11,815 psia.

In developing the proposed model, an advanced machine learning algorithm, XGBoost [12,13,26,27], was performed on the system. Then, XGBoost [12,13,26,27] with Sutton [22] correlations for gas pseudocritical properties and eight artificial proxy features were selected based on the feature correlative responses on dew point pressure measurements. Higher-ranked features were applied to the datasets.

The outcomes of our study were compared against the tuned correlations of Nemeth and Kennedy [2], Elsharkawy [3], Ahmadi and Elsharkawy [4], and El-hoshoudy et al. [5]. In general, high-molecular-weight substances have very pronounced effects on the dew point pressure, as their behavior considering the dew point pressure is influenced by their quantity and as well as distribution. However, providing compositions to the heptane plus level reduces the predictive capabilities of the proposed model (and as well as the correlations), i.e., the compositions are not granular enough. However, such detailed compositions are not always available, especially in the older reports, and also sometimes they may not be very accurate. Therefore, in this study, we used compositional details to the heptane plus level as in the case of the classical correlations.

Some of the key conclusions of this study are as follows:

XGBoost [12,13,26,27] with gas pseudocritical property correlations of Sutton [22] and artificial proxy features leads to the best overall estimate of dew point pressure with the mean relative error of 470 psia (with mean absolute relative error = 7.16%) when compared to the tuned correlations of Nemeth and Kennedy [2], Elsharkawy [3], Ahmadi and Elsharkawy [4], and El-hoshoudy et al. [5].
After tuning, the correlations by Elsharkawy [3], Ahmadi and Elsharkawy [4], and El-hoshoudy et al. [5] for dew point pressure perform better than the Nemeth and Kennedy [2] correlation for dew point pressure. However, they all exhibit higher predictive errors for the elevated dew point pressure ranges (>8000 psia), while the proposed model performs better than all the selected correlations.

In conclusion, the implementation of the proposed model for precise dew point pressure prediction significantly bolsters phase-equilibrium modeling, fostering enhanced comprehension and control of fluid dynamics within reservoirs, pipelines, and processing facilities. This advancement notably refines decision-making processes pertaining to fluid management and process optimization. The meticulous determination of dew point pressure profoundly benefits reservoir engineering endeavors, facilitating comprehensive reservoir characterization and optimal production strategy formulation. By meticulously forecasting dew point pressure, operators are empowered to optimize natural gas extraction processes, thereby ensuring maximal hydrocarbon recovery with minimal energy expenditure and operational overheads and augmenting overall production efficacy. Furthermore, an adept grasp of dew point pressure is instrumental in emission mitigation efforts, as it enables the fine-tuning of gas processing and handling protocols, thereby curbing the release of greenhouse gases and pollutants into the atmosphere and advancing environmental sustainability. Moreover, precise dew point pressure estimation is indispensable for CO₂ capture and storage initiatives, underpinning the advancement of sustainable carbon capture, utilization, and storage (CCUS) endeavors aimed at combatting climate change by mitigating greenhouse gas emissions.

Author Contributions

Conceptualization, T.L., B.D. and U.S.; methodology, T.L., B.D., L.L., X.Y. and U.S.; software, T.L., B.D., L.L. and X.Y.; validation, B.D. and X.Y.; formal analysis, L.L. and X.Y.; investigation, T.L. and L.L.; resources, B.D.; data curation, L.L.; writing—original draft, T.L. and X.Y.; writing—review & editing, U.S.; visualization, T.L.; supervision, B.D. and U.S. All authors have read and agreed to the published version of the manuscript.

Funding

The data that have been used are confidential.

Data Availability Statement

The data that have been used are confidential.

Acknowledgments

The authors express gratitude to Shell International Exploration and Production Inc., for authorizing the publication of this work. They also extend appreciation to Keat Hoe Foo from Shell for his assistance with the data. Utkarsh Sinha, a Volunteer Research Associate at the University of Houston acknowledges the members of the Research Consortium on Interaction of Phase Behavior and Flow (IPB&F) Consortium, along with its affiliates at the University of Houston.

Conflicts of Interest

Authors Xi Yang (new address: Citibank, 3800 Citibank Center, Tampa, FL 33610, USA) and Ligang Lu were employed by were employed by the company Shell International E&P Inc., Westhollow Technology Center. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Nomenclature

$γ_{g}$	Specific gravity of gas
$γ_{C_{7 +}}$	Specific gravity of heptane plus
$\overset{⃑}{a}$	The vector of coefficients of correlation for dew point pressure
$a_{i}$	Constant, i = 1, 2, …, n
APF	Artificial proxy feature
$b_{i}$	Constant, i = 1, 2, …, n
GEP	Gene expression programming
GOR	Gas–oil ratio
LightGBM	Light gradient boosting machine
MW	Molecular weight, g/mole
${M W}_{C_{7 +}}$	Molecular weight of heptane plus, g/mole
NN	Neural network
$p_{d}$	$p_{d e w}$ = Dew point pressure, psia
P_pc	Pseudocritical pressure, psia
$P_{p c_c o r r}$	Pseudocritical pressure correction
SG	Specific gravity
${S G}_{C 7 +}$	Specific gravity of heptane plus
${S G}_{C n +}$	Specific gravity of carbon n plus
T_R	Reservoir temperature, °R
T_f	Reservoir temperature, °F
T_pc	Pseudocritical temperature, °R
$T_{p c_c o r r}$	Pseudocritical temperature correction
$T_{r e s}$	Reservoir temperature, °F
XGBoost	Extreme gradient boosting
$z C_{1}$	Mole percent of methane
$z C_{2}$	Mole percent of ethane
$z C_{3}$	Mole percent of propane
$z i C_{4}$	Mole percent of iso-butane
$z n C_{4}$	Mole percent of normal butane
$z {i C}_{5}$	Mole percent of iso-pentane
$z n C_{5}$	Mole percent of normal pentane
$z C_{6}$	Mole percent of hexane
$z C_{7 +}$	Mole percent of heptane plus
$z {C O}_{2}$	Mole percent of carbon dioxide
$z H_{2} S$	Mole percent of hydrogen sulfide
$z_{i}$	Mole percent of component i
$z N_{2}$	Mole percent of nitrogen

Appendix A

The cross-plots of four advanced machine learning models are shown in Figure A1. From Figure A1a,b, it can be inferred that XGBoost and LightGBM perform similarly, as expected, while XGBoost has slightly better accuracy in the high dew point pressure range. Random forest regressor shows obvious overprediction in the low range of dew point pressure (<4000 psia) and underprediction in the high range of dew point pressure (>6000 psia), as shown in Figure A1c. NN regressor with eight neurons in one hidden layer performs the worst among the four machine learning algorithms, as shown in Figure A1d. When the training dataset has a limited size, neural network algorithms more easily overfit the training data, and this is the significant problem causing the worse performance of the NN regressor. The error distribution plot shown in Figure A2 confirms the observation in Figure A1 that XGBoost is the best model among the four models. XGBoost has only a few samples with prediction errors outside the range from −2000 to 2000 psia. The prediction error statistical results are shown in Table A1. It is also obvious from Figure A2 that XGBoost performs the best, with a mean absolute error of 505 psia (mean absolute relative error of 8.196%), while among the other similar advanced machine learning algorithms, LightGBM becomes the second best, with a mean absolute error of 517 psia (mean absolute relative error of 8.361%). All machine learning models still cannot yield a closer estimation considering the predicted dew point pressure relative to the measured dew point pressure. Therefore, correlations of gas pseudocritical properties and artificial proxy features were added to the selected machine learning model, XGBoost, not only reducing the model error but also improving the model performance.

Figure A1. Cross-plots of measured

p_{d e w}

vs. predicted

p_{d e w}

using (a) XGBoost; (b) LightGBM; (c) random forest regressor; (d) NN regressor (8 neurons in 1 hidden layer).

Figure A1. Cross-plots of measured

p_{d e w}

vs. predicted

p_{d e w}

using (a) XGBoost; (b) LightGBM; (c) random forest regressor; (d) NN regressor (8 neurons in 1 hidden layer).

Figure A2. Prediction error distribution plots of (a) XGBoost; (b) LightGBM; (c) random forest regressor; (d) NN regressor (8 neurons in 1 hidden layer).

Table A1. Ten-fold cross-validation prediction error statistics.

Quantity	XGBoost	LightGBM	Random Forest	Neural Network
Mean absolute relative error (%)	8.196	8.361	9.513	13.212
Mean absolute error (psia)	505	517	592	705
Absolute relative error standard deviation (%)	9.600	9.154	11.486	37.524
Mean relative error (%)	1.874	2.079	2.812	4.864
Relative error standard deviation (%)	12.483	12.222	14.646	39.484
Pearson correlation	0.952	0.951	0.934	0.831
R-square	0.905	0.903	0.872	0.661

References

Sage, B.H.; Olds, R.H. 1947. Volumetric Behavior of Oil and Gas from Several San Joaquin Valley Fields. Trans. AIME 2011, 170, 156–173. [Google Scholar] [CrossRef]
Nemeth, L.K.; Kennedy, H.T. A Correlation of Dewpoint Pressure with Fluid Composition and Temperature. Soc. Pet. Eng. J. 1967, 7, 99–104. [Google Scholar] [CrossRef]
Elsharkawy, A.M. Characterization of the Plus Fraction and Prediction of the Dewpoint Pressure for Gas Condensate Reservoirs. In Proceedings of the SPE Western Regional Meeting, Bakersfield, CA, USA, 26–30 March 2001. [Google Scholar]
Ahmadi, M.A.; Elsharkawy, A. Robust correlation to predict dew point pressure of gas condensate reservoirs. Petroleum 2016, 3, 340–347. [Google Scholar] [CrossRef]
El-hoshoudy, A.N.; Gomaa, S.; Desouku, S.M. Prediction of Dew Point Pressure in Gas Condensate Reservoirs Based on a Combination of Gene Expression Programming (GEP) and Multiple Regression Analysis. Pet. Petrochem. Eng. J. 2018, 2, 000163. [Google Scholar]
Jalali, F.; Abdy, Y.; Akbari, M.K. Using Artificial Neural Network’s Capability for Estimation of Gas Condensate Reservoir’s Dew Point Pressure. In Proceedings of the EUROPEC/EAGE Conference and Exhibition, London, UK, 11–14 June 2007. [Google Scholar]
Alzahabi, A.; El-Banbi, A.; Trindade, A.A.; Soliman, M. A regression model for estimation of dew point pressure from down-hole fluid analyzer data. J. Pet. Explor. Prod. Technol. 2017, 7, 1173–1183. [Google Scholar] [CrossRef]
Alarfaj, M.K.; Abdulraheem, A.; Busaleh, Y.R. Estimating Dewpoint Pressure Using Artificial Intelligence. In Proceedings of the SPE Saudi Arabia Section Young Professionals Technical Symposium, Dhahran, Saudi Arabia, 19–21 March 2012. [Google Scholar]
Brownlee, J. How Much Training Data Is Required for Machine Learning? Machine Learning Mastery. 2017. Available online: https://machinelearningmastery.com/much-training-data-required-machine-learning/ (accessed on 9 November 2020).
Maheswari, J.P. Breaking the Curse of Small Data Sets in Machine Learning: Part 2 How Data Size Impacts Deep Learning Models and How to Work with Small Datasets? Towards Data Science. 2019. Available online: https://towardsdatascience.com/breaking-the-curse-of-small-data-sets-in-machine-learning-part-2-894aa45277f4 (accessed on 10 November 2023).
XGBDocs. Available online: https://xgboost.readthedocs.io/en/stable/python/index.html (accessed on 2 April 2022).
Chen, R.-C.; Caraka, R.E.; Arnita; Goldameir, N.E.; Pomalingo, S.; Rachman, A.; Toharudin, T.; Tai, S.-K.; Pardamean, B. An end to end of scalable tree boosting system. Sylwan 2020, 165, 1–11. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016 (KDD ‘16), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Sinha, U.; Dindoruk, B.; Soliman, M.Y. Physics augmented correlations and machine learning methods to accurately calculate dead oil viscosity based on the available inputs. SPE J. 2022, 27, 3240–3253. [Google Scholar] [CrossRef]
Sinha, U.; Dindoruk, B.; Soliman, M. Physics guided data-driven model to estimate minimum miscibility pressure (MMP) for hydrocarbon gases. Geoenergy Sci. Eng. 2023, 224, 211389. [Google Scholar] [CrossRef]
Available online: https://www.geomarkresearch.com/rfdbase (accessed on 2 April 2022).
Shokir, E.M. Dewpoint Pressure Model for Gas Condensate Reservoirs Based on Genetic Programming. Energy Fuels 2008, 22, 3194–3200. [Google Scholar] [CrossRef]
Kaydani, H.; Mohebbi, A.; Hajizadeh, A. Dew point pressure model for gas condensate reservoirs based on multi-gene genetic programming approach. Appl. Soft Comput. 2016, 47, 168–178. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Sutton, R.P. Fundamental PVT Calculations for Associated and Gas/Condensate Natural Gas Systems. In SPE Annual Technical Conference and Exhibition? SPE: New York, NY, USA, 2005; p. 97099. [Google Scholar]
Matthews, T.A.; Roland, C.H.; Katz, D.L. High pressure gas measurement. In Proceedings of the Twenty-first Annual Convention, NGAA; 1942. [Google Scholar]
Nelder, J.A.; Mead, R. A simple method for function minimization. Comput. J. 1965, 7, 308–313. [Google Scholar] [CrossRef]
Gao, F.; Han, L. Implementing the Nelder-Mead simplex algorithm with adaptive parameters. Comput. Optim. Appl. 2012, 51, 259–388. [Google Scholar] [CrossRef]
Swalin, A. CatBoost vs. Light GBM vs. XGBoost. Towards Data Science. 2018. Available online: https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db (accessed on 28 June 2024).
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]

Figure 1. Brief Summary of the loss function minimized in XGBboost [12,13].

Figure 2. Feature correlation with dew point pressure heatmap for XGBoost [9] with Sutton [22] pseudocritical properties.

Figure 3. Cross-plots of measured

p_{d e w}

vs. predicted

p_{d e w}

using (a) XGBoost [9] with Sutton [22] pseudocritical properties and APF; (b) XGBoost [12,13]; (c) tuned Ahmadi and Elsharkawy [4] correlation; (d) tuned Elsharkawy [3] correlation; (e) tuned El-hoshoudy et al. [5] correlation; (f) tuned Nemeth and Kennedy [2] correlation.

Figure 3. Cross-plots of measured

p_{d e w}

vs. predicted

p_{d e w}

using (a) XGBoost [9] with Sutton [22] pseudocritical properties and APF; (b) XGBoost [12,13]; (c) tuned Ahmadi and Elsharkawy [4] correlation; (d) tuned Elsharkawy [3] correlation; (e) tuned El-hoshoudy et al. [5] correlation; (f) tuned Nemeth and Kennedy [2] correlation.

Figure 4. Prediction error distribution plots using (a) XGBoost [12,13] with Sutton [22] pseudocritical properties and APF; (b) XGBoost [12,13]; (c) tuned Ahmadi and Elsharkawy [4] correlation; (d) tuned Elsharkawy [3] correlation; (e) tuned El-hoshoudy et al. [5] correlation; (f) tuned Nemeth and Kennedy [2] correlation.

Table 1. Dataset (342 samples) statistical summaries.

	Mean	Standard Deviation	Min	Median	Max
Reservoir Temperature (°F)	219.04	50.74	60.00	217.00	381.00
Dew Point Pressure (psia)	6855	2223	1465	6648	11815
$N_{2}$ (mole %)	0.50	0.80	0.00	0.30	10.15
$C O_{2}$ (mole %)	2.28	3.52	0.00	0.47	33.72
$H_{2} S$ (mole %)	0.07	0.97	0.00	0.00	16.83
$C_{1}$ (mole %)	77.75	10.25	44.62	76.53	99.06
$C_{2}$ (mole %)	5.78	2.73	0.19	5.64	14.15
$C_{3}$ (mole %)	3.15	1.55	0.05	3.11	12.36
$i {- C}_{4}$ (mole %)	0.64	0.33	0.01	0.49	4.36
$n - C_{4}$ (mole %)	1.26	0.63	0.00	1.28	3.73
$i - C_{5}$ (mole %)	0.52	0.33	0.01	0.49	4.36
$n - C_{5}$ (mole %)	0.60	0.31	0.00	0.60	1.85
$C_{6}$ (mole %)	0.84	0.52	0.01	0.84	4.81
$C_{7 +}$ (mole %)	6.61	3.88	0.09	6.69	17.08
Molecular Weight of $C_{7 +}$ (g/mole)	174.68	34.08	106.00	170.81	283.16
Specific Gravity of $C_{7 +}$	0.81	0.03	0.72	0.81	0.90
Reservoir Fluid MW (g/mole)	31.95	9.47	16.40	31.13	65.97

Table 2. Feature correlation distribution with dew point pressure using XGBoost [12,13] with pseudocritical properties according to Sutton [22].

Features	Correlation with Pdew
Dewpoint Pressure	1.0000
Molecular Weight of $C_{7 +}$	0.7539
Specific Gravity of $C_{7 +}$	0.7041
Mole % of $C_{2}$	0.3741
Mole % of $C_{7 +}$	0.3186
$T_{p c}$	0.3140
$P_{p c}$	0.3059
Reservoir Fluid MW	0.3040
Mole % of $C_{3}$	0.2854
Mole % of $C O_{2}$	0.2839
Mole % of $N_{2}$	0.2413
Mole % of $i - C_{4}$	0.1912
Mole % of $n - C_{4}$	0.1665
Reservoir Temperature	0.1473
Mole % of $H_{2} S$	0.1261
Mole % of $i {- C}_{5}$	0.1139
Mole % of $n - C_{5}$	0.0340
Mole % of $C_{6}$	0.0015

Table 3. Configuration of XGBoost [12,13] method used in this work.

Inputs	Reservoir Temperature (°F); mole % of $N_{2}$ , $C O_{2}$ , $C_{1}$ , $C_{2}$ , $C_{3}$ , $n - C_{4}$ , $i - C_{4}$ , $n - C_{5}$ , $i - C_{5}$ , $C_{6}$ $C_{7 +}$ ; $M W_{C_{7} +};$ $S G_{C_{7} +}$ ; $T_{p c}$ ; $P_{p c}$ ; APF#1, 2, 3, 4, 5, 6, 7, 8
Test–Train Split	30–70%
Cross Validation Folds (Training Data)	10
Booster	Gbtree
Objective	reg:squarederror
learning_rate	0.01
max_depth	6
min_child_weight	1
Subsample	0.5
colsample_bytree	0.5
reg_alpha	23
Reg_lambda	0.5
Gamma	75

+ The description of all configuration hyperparameters can be found in XGBDocs, 2022 [11].

Table 4. Ten-fold cross-validation prediction error statistical results.

Quantity	XGBoost [12,13] with Sutton [22] Pseudocritical Properties and APF	XGBoost [12,13]	Tuned Ahmadi and Elsharkawy [4]	Tuned Elsharkawy [3]	Tuned El-hoshoudy et al. [5]	Tuned Nemeth and Kennedy [2]
Mean absolute relative error (%)	7.160	8.196	10.693	11.075	11.359	13.149
Mean absolute error (psia)	470	505	678	722	715	838
Absolute relative error standard deviation (%)	7.228	9.600	10.746	10.058	11.461	12.625
Mean relative error (%)	1.222	1.874	0.007	−0.203	0.984	4.650
Relative error standard deviation (%)	10.101	12.483	15.160	14.960	16.107	17.625
Pearson correlation	0.957	0.952	0.912	0.902	0.904	0.880
R-square	0.916	0.905	0.824	0.809	0.812	0.736

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lertliangchai, T.; Dindoruk, B.; Lu, L.; Yang, X.; Sinha, U. A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms. Fuels 2024, 5, 548-563. https://doi.org/10.3390/fuels5030030

AMA Style

Lertliangchai T, Dindoruk B, Lu L, Yang X, Sinha U. A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms. Fuels. 2024; 5(3):548-563. https://doi.org/10.3390/fuels5030030

Chicago/Turabian Style

Lertliangchai, Thitaree, Birol Dindoruk, Ligang Lu, Xi Yang, and Utkarsh Sinha. 2024. "A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms" Fuels 5, no. 3: 548-563. https://doi.org/10.3390/fuels5030030

Article Menu

A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms^†

Abstract

1. Introduction

2. Dataset Analysis

3. Development of the New Model

4. Results and Discussion

5. Summary and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms †

Abstract

1. Introduction

2. Dataset Analysis

3. Development of the New Model

4. Results and Discussion

5. Summary and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

A Comparative Analysis of the Prediction of Gas Condensate Dew Point Pressure Using Advanced Machine Learning Algorithms^†