This study develops an approximating function for the deposition rate of hydrates in gas-dominant subsea pipelines, operating in environmental temperature conditions that favour hydrate formation. The main assumption in this study is that the rate of hydrate deposition in a pipeline can be accurately predicted by the gas velocity, water volume fraction, subcooling temperatures, and pipeline diameter from empirical evidence [
2,
3,
4,
6,
7]. Stable hydrates form when the system temperature is below the hydrating equilibrium condition for stable hydrates. The subcooling temperature reduces the gas temperature by the subcooling value into the stable hydrates zone. The equilibrium pressure must be lower than the operating pressure of 8.0 MPa and the pipeline temperature must be less than 292 K to ensure that hydrates are forming before using this regression model. The equilibrium hydrate formation pressure equation for methane temperature ranging from 0–25 °C by Sloan and Koh (2007) [
25] is adopted to compute the minimum pressure required for hydrate formation. Stable hydrates are formed at temperatures below 292 K for methane hydrates as discussed in the literature [
2,
7]. Hence, the regression model is for natural gas with methane gas above 82% by composition. The model is based on the parametric simulations conducted using the validated CFD model for predicting the deposition rates of hydrates mentioned earlier. The data are made up of 81 × 5 matrix data table with a total of 405 data. The basis for the selected variables is discussed as follows based on evidence in the literature [
2,
3,
4,
6,
7]: (i) gas velocity defines the nature of fluid flow—laminar, transitional, or turbulent; (ii) hydrate formation, agglomeration, and pipe wall deposition are affected by the gas velocity; (iii) increase in gas velocity under the same pressure and subcooling temperature increases the deposition rates of hydrates; and (iv) increasing the subcooling temperature of the pipeline at constant gas velocity also increases the deposition rate of hydrates. Also, the additional outcome of the CFD simulations proposes that: (i) an increase in pipeline diameter under the same gas flow condition increases the deposition rate by similar factor; and (ii) an increase in the volume fraction of water reduces the deposition rate of hydrates. The developed regression model was validated with experimental studies. The stages of the adopted method In the development, validation, and application of this regression model are presented in
Figure 1 below.
2.1. Defining Variables and Data Generation
The data for the regression model development were obtained from the CFD simulations. The measured variables are defined based on the parametric studies conducted in the literature [
7]. The validated CFD model is a 10 m length by 0.0204 m diameter pipe and a wall thickness of 0.0012 m. Given that the pipeline is constructed from steel, in order to mitigate the impact of pipe wall thickness on subcooling temperature, the entire wall was adjusted to match the hydrate-forming temperature. Initial multiphase flow is made up of natural gas and water. The simulation was conducted in a commercial CFD software—ANSYS Fluent, version 2020 R1. The architecture of the computer for the simulations is designed with Intel Xeon Gold 6230 quad-core 2.10 GHz CPU and RAM of 16 GB. Input variables are operating pressure, temperature, water volume fraction, and gas velocity. This research utilises the Eulerian–Eulerian multiphase framework, incorporating boundary conditions and physical flow parameters primarily to improve the interaction between gas and water interfaces. Previous computational fluid dynamics (CFD) simulations focusing on gas hydrates have favoured the Eulerian–Eulerian approach as the most suitable method for enhancing interfacial interactions between gas and water [
26,
27]. Since hydrate deposition on the pipe wall is a near-wall viscous effect, the realisable
k–ε two-equation turbulence model was employed to improve the modeling of near-wall viscosity in predicting the deposition of hydrates [
7,
28]. In order to improve the efficiency of multiphase flow in the oil and gas sector, pipeline designs aim to minimise frictional losses and pipe wall erosion, thus reducing pressure drops. Consequently, the study enhanced the stability of the computational fluid dynamics (CFD) simulation by selecting a mesh size that yielded the least noticeable pressure drop, as determined through a sensitivity analysis of the mesh grid. This mesh sensitivity analysis was conducted at specific conditions: an inlet velocity of 10 m/s (equivalent to a flow rate of 3.3 kg/s), a temperature of 292 K, and a pressure of 8.8 MPa. The simulation was conducted for different ranges of pipe diameter, gas velocity, subcooling temperatures, and water volume fraction. A total of eighty-one (81) deposition rates of hydrates were predicted from 81 simulations. The sample size was determined as per the recommendation in the literature [
29,
30] using G*Power software, version 3.1 [
31], with a conservative effect size of 0.30 because the CFD model was already validated with experimental results, and statistical power of 95%, which yielded a minimum sample size of 72. Detail documentation on the development and validation of the CFD model is already discussed in the literature [
7,
32]. The input variables for the CFD simulations are defined in
Table 1 as follows.
2.2. Regression Model Development
The regressor variables are as defined earlier, including the subcooling temperature (
), pipeline diameter (
), water volume fraction (
), and gas velocity (
) as predictors, while the deposition rate of hydrates (
) is the outcome variable. This is represented in
Figure 2, below.
Selecting a multiple regression model with the most appropriate explanatory and predictive power is difficult and depends on the selection of an appropriate set of variables that defines the expected response. In MATLAB, multiple regression modelling can be achieved by the standard linear regression, robust linear regression, interaction linear regression, and stepwise linear regression. The standard linear model is also known as ordinary least square (OLS) estimation of the intercept and coefficients to minimise the error sum of the squares [
29]. However, there are instances where the data sets contain values that have high discrepancy from the expected outcome, also known as outliers. When this occurs, as with some experimental outcomes, an alternative approach using robust linear regression may be adopted. The robust linear regression modelling approach produces improved estimates by minimising the weights given to outlying cases when calculating the regression coefficients [
29]. Thus, the presence of outliers in the data sets is ruled out when the outcome of the robust linear regression model compares favourably with the predictions of the OLS model. Both models are represented in Equation (10). The stepwise linear regression modelling approach in Equation (11) was considered to enhance the predictability of hydrate deposition rates in MATLAB. In the stepwise regression approach, one variable at each stage is selected from a group of predictors that produces the highest coefficient of determination (R
2). The selected variable is the regressor that produces the largest value of F statistic [
24], implying that variables are either added or removed at each step leading to an iterative sequence of regression modelling. However, one problem with this approach is the high dependence on chance and the likely underestimation of predictive confidence intervals [
29]. In the equation, two sets of interactions between two regressors were included with the four additive regressors in the OLS equation (Equation (10)). The last approach adopted is the interaction linear regression model in Equation (12). In the interactions approach, additional sets of interacting variables are added to the additive models of the original regressors as in the OLS. Here the interaction predictors are products of the original predictors [
29]. The regression modelling approach adopted in this study did not consider a squared form of input variables to prevent over-fitting, where the model fits the training data too closely. Again, when squared terms are introduced into linear regression models, the assumption of a linear relationship between the predictor variables and the response variable can be undermined and the model becomes nonlinear. Furthermore, our adoption of a linear modelling approach is supported by the experimental results in the literature [
2,
4] used for the model validation.
2.3. Model Selection Criteria
The adopted model for the parametric studies was based on a combination of five model selection criteria, including the error sum of squares (SSE), adjusted R-squared (R2adj.), Akaike information criterion (AICc), standard F test, and root of mean square error (RMSE). Statistical significance was determined using the p-value at alpha () level of 0.05. Lowering the significance level below 0.05 may shift the focus towards statistical significance at the expense of practical significance in the study, primarily due to the model’s limited ability to detect lower deposition rates. This shift can reduce proactive predictability and increase the vulnerability of the pipeline to hydrate plugging events. Each criterion is discussed further to provide insight into the parameters that influenced the predictive power of the chosen model.
Error sum of squares
(SSE)
: In regression analysis, the sum of squares is used to explain the dispersion of the data sets around a mean. The residual sum of squares, or error sum of squares as used in this study, is based on the residual after the model-fitting process.
represent the regression sum of squares of the data set that predicted the model-fit regression line. The total sum of squares (
) describes the total variability in the research data. The estimation of
,
, and
are defined in Equations (1)–(3) below.
where
is the regression sum of squares;
, is the predicted value per data point;
, is the original target value;
is the mean of the data set representing the regression line prediction, and
, is the deviation of the predicted value per data from the mean.
Adjusted R-squared (
R2adj.): The coefficient of determination (
) is determined from the ratio of the
and
(Equation (4)). Since it is a ratio where the denominator is always higher or equal to the numerator, the value is from 0 to 1. The value of
indicates the extent to which the variance in the predicted variable is dependent on the predictor variables. However, because the value of
increases as new variables are added to the regression equation, it is seldom problematic in determining model fit when comparing models. To overcome this weakness, the R-squared is adjusted (
) as in Equation (5) to compensate for this effect, so that the
value decreases as more predictor variables are added to the regression model [
24], hence guarding against overfitting. Consequently, it is important to select the predictors that have a higher effect on the variance of the response variable.
where
is the degree of freedom for the denominator,
represents the numbers of the measured predictor variables, and
, the total data points.
Standard F test: Another statistical measure for model selection is the standard F test, which tests the significance of the obtained value of the
. It is used to determine if the set of predictor variables statistically explain a significant amount of the outcome. Higher values of F indicate better model performance. F test is estimated from Equation (6).
where
k represents the numbers of predictor variables and
, the statistical degree of freedom. Similarly, the
p-value measures the statistical significance of the regression model or individual coefficients within the model and assesses the probability of obtaining the observed regression coefficients. The
p-value is not directly used in model selection; however, it does provide evidence of the strength of the contributing variables in regression analysis.
Root of mean square error (
RMSE): This model selection measure is the standard deviation of the prediction errors or residuals. The
RMSE provides insight into how far the error is from the prediction. Models with lower
RMSE have higher predictive power. The
RMSE is estimated from Equation (7) below, where the symbols
,
, and
are as defined earlier.
Akaike information criterion (
AICc): The
AICc enhances the selection of the most fit-for-purpose model because it compares the quality of each model against the other models. It measures the estimated prediction error and the relative quality of a set of data. The smaller case “c” in Equation (9) indicates that the calculated
AIC value has been corrected for smaller samples to prevent overfitting because of the inclusion of both stepwise and interaction models in this study. The
AIC criteria is generally an estimation of the information loss because of the presence of the likelihood function,
. This index also take into account the number of regression coefficients being tested [
29]. When the experimental data sets for cross-validation are sparse, the
AICc have been found to be more reliable than the F test [
33]. The smaller the value of
AICc, the better the model fit.
AIC is estimated using Equation (8).
where
is the likelihood function,
k is the numbers of predictor variables, and
, the statistical degree of freedom, implying that the higher the variables, the higher the
AIC value. Thus, from the discussion above, the model selection criteria are defined as follows (
Table 2).
A simple ranking method was adopted, where the most favourable of the four models was awarded a score of 4 and the least favourable model was awarded a score of 1 on each selection parameter. The model with the highest sum was adopted for the prediction of hydrate deposition rates.