Prediction of Pipe Failure Rate in Heating Networks Using Machine Learning Methods

Beloev, Hristo Ivanov; Saitov, Stanislav Radikovich; Filimonova, Antonina Andreevna; Chichirova, Natalia Dmitrievna; Babikov, Oleg Evgenievich; Iliev, Iliya Krastev

doi:10.3390/en17143511

Open AccessArticle

Prediction of Pipe Failure Rate in Heating Networks Using Machine Learning Methods

by

Hristo Ivanov Beloev

¹

,

Stanislav Radikovich Saitov

²

,

Antonina Andreevna Filimonova

²,

Natalia Dmitrievna Chichirova

²

,

Oleg Evgenievich Babikov

²

and

Iliya Krastev Iliev

^3,*

¹

Department Agricultural Machinery, “Angel Kanchev” University of Ruse, 7017 Ruse, Bulgaria

²

Department Nuclear and Thermal Power Plants, Kazan State Power Engineering University, 420066 Kazan, Russia

³

Department of Heat, Hydraulics and Environmental Engineering, “Angel Kanchev” University of Ruse, 7017 Ruse, Bulgaria

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(14), 3511; https://doi.org/10.3390/en17143511

Submission received: 1 July 2024 / Revised: 15 July 2024 / Accepted: 16 July 2024 / Published: 17 July 2024

(This article belongs to the Section F5: Artificial Intelligence and Smart Energy)

Download

Browse Figures

Versions Notes

Abstract

:

The correct prediction of heating network pipeline failure rates can increase the reliability of the heat supply to consumers in the cold season. However, due to the large number of factors affecting the corrosion of underground steel pipelines, it is difficult to achieve high prediction accuracy. The purpose of this study is to identify connections between the failure rate of heating network pipelines and factors not taken into account in traditional methods, such as residual pipeline wall thickness, soil corrosion activity, previous incidents on the pipeline section, flooding (traces of flooding) of the channel, and intersections with communications. To achieve this goal, the following machine learning algorithms were used: random forest, gradient boosting, support vector machines, and artificial neural networks (multilayer perceptron). The data were collected on incidents related to the breakdown of heating network pipelines in the cities of Kazan and Ulyanovsk. Based on these data, four intelligent models have been developed. The accuracy of the models was compared. The best result was obtained for the gradient boosting regression tree, as follows: MSE = 0.00719, MAE = 0.0682, and MAPE = 0.06069. The feature «Previous incidents on the pipeline section» was excluded from the training set as the least significant.

Keywords:

machine learning; heating network; evaluation of the value feature; evaluation of heat supply reliability; intelligent model

1. Introduction

The pipeline is the main method of transporting energy resources such as hot water, natural gas, oil, ammonia, etc. The life and health of people, the safety of the environment, and the existence and development of the national economy depend on the reliable functioning of the pipeline transport system.

A leakage of the energy carrier is considered one of the most serious potential threats to pipeline transport. Due to the extent of the territories, complex geology, aggressive environment, and heterogeneity of soil properties, steel pipelines are susceptible to external corrosion, which reduces their safety and service life. According to the 11th EGIG report, for the period from 2010 to 2019, incidents related to energy carrier leaks occurred most often due to pipeline corrosion—38% of all recorded cases [1] (p. 24).

In this regard, methods that allow for the assessment of the reliability of heat supply and the remaining service life of pipelines play an important role. Predictions obtained using these methods can prevent accidents and reduce capital costs for large-scale pipeline system replacement due to timely maintenance work. In the last few years, there has been an increase in the number of scientific studies related to predicting the residual strength and service life of pipelines [2], which also indicates the relevance of this issue.

Currently used guidelines and industrial regulations (traditional methods) take into account only a small part of the factors affecting the reliability of the pipeline systems’ functioning. In traditional methods, factors such as the length [3], outside diameter [4,5] (radius [5,6]), wall thickness [2,3,4,5,6], and service life [3] of pipeline sections are mainly taken into account. This is due to the fact that in real pipeline systems, many significant factors (the corrosiveness of the soil, the presence of stray currents, wall thinning, etc.) do not have a clear functional connection with their reliability.

A solution to this problem was found in the use of so-called intelligent methods (artificial neural networks, fuzzy logic, chaos theory, support vector machine, etc.). However, as noted by the authors Li H., Huang K., et al. [2] in their work, models created based on these methods have low stability of results, which makes them unsuitable for real operating conditions.

This complexity can be resolved through a combined application of traditional and intelligent methods during the development of a general model. This approach will expand the list of significant factors in the model while maintaining the stability of its results. Consequently, it will be possible to more objectively assess the condition of underground steel pipelines.

The purpose of this study is to determine significant factors affecting the heat supply reliability that are not taken into account in traditional methods, as well as to establish connections between these factors and the failure rate of heating network pipelines.

2. Current State of the Research Area

A review of open sources devoted to assessing the residual strength and/or forecasting the residual service life of pipelines showed that all existing models can be divided into traditional (evaluation) and intelligent (predictive).

2.1. Traditional Evaluation Models

In the late 1960s, the Texas Eastern Transportation Company and American Natural Gas Association (AGA) conducted extensive research into pipeline corrosion damage, which led to the development of the NG-18 formula [6], which predicts the burst pressure of a defective pipeline.

In 1984, the earliest method for assessing the residual strength of corroded pipelines, B31G, was proposed by the American Society of Mechanical Engineers (ASME) based on the NG-18 formula. For a “short” defect, it is assumed that the corrosion zone has a parabolic shape with a curved bottom. In this case, the burst pressure (p_b, MPa) is determined using the following equation:

p_{b} = \frac{2 \cdot t \cdot σ_{f}}{D} \cdot [\frac{1 - (2 / 3) \cdot (d / t)}{1 - (2 / 3) \cdot (d / t) / M}], for \frac{l^{2}}{D \cdot t} \leq 20

(1)

where t and D are the wall thickness and outside diameter of the pipeline, m, respectively; l and d are the length and depth of the corrosion zone, m, respectively; σ_f is the flow stress, MPa; and M is the Folias bulging coefficient.

For a “long” corrosion defect, the corrosion area is simplified to a rectangle with a flat bottom. In this case, the burst pressure is determined using the following equation:

p_{b} = \frac{2 \cdot t \cdot σ_{f}}{D} \cdot [1 - \frac{d}{t}], for \frac{l^{2}}{D \cdot t} > 20

(2)

In expressions (1) and (2), the flow stress is σ_f = 1.1 R_t_0.5, where R_t_0.5 is the minimum yield strength, MPa, determined according to the API 5L specification [7]. The Folias bulging coefficient is calculated using Equation (3), as follows:

M = \sqrt{1 + 0.8 \cdot (\frac{l}{\sqrt{D \cdot t}})}

(3)

Subsequently, all newly developed foreign methods of this series (Mod. B31G, SY/T 6151-2009, DNV-RP-F101, PCORRC, RSTRENG, etc.) were based on the B31G standard [2,8]. These methods have proven to be highly effective in assessing the condition of gas pipelines. However, for heat pipelines, it is impossible to obtain the parameters d and l without the complete removal of thermal insulation, which is not possible on the scale of the entire heating network.

In 2013, the Russian company Gazprom Promgaz JSC proposed a methodology and algorithm for calculating the reliability of heating networks when developing city heat supply schemes [9]. According to this methodology, the failure rate of heating network elements λ, 1/(km·h) is determined using the following equation:

λ = λ_{0} \cdot {(0.1 \cdot τ)}^{α - 1}

(4)

where τ is the pipeline service life, years; λ₀ is the initial failure rate of 1 km of single-line heat pipeline, obtained using the Weibull distribution equation, λ₀ = 5.7·10⁻⁶ 1/(km·h); and α is the coefficient taking into account the duration of the pipeline section operation, as follows:

α (τ) = (\begin{matrix} 0 . 8, 0 < τ \leq 3 \\ 1 . 0, 3 < τ \leq 17 \\ 0 . 5 \cdot e^{(τ / 20)}, τ > 17 \end{matrix})

(5)

The main advantage of the method is that it enables us to assess the reliability of underground pipelines without excavating the soil and removing the pipeline thermal insulation. The disadvantage of the methodology is that only the service life and length of its sections are considered significant factors when assessing the reliability of the heating network. However, in 2019, Akhmetova I.G. and Akhmetov T.R. [3], in their study, tried to modernize this technique. As a result, expressions (4) and (5) acquired the following form:

λ = λ_{0} \cdot {(τ)}^{α - 1}

(6)

α = 0 . 5 \cdot e^{K_{i}} = 0 . 5 \cdot e^{f (K 1, K 2, K 3, K 4, K 5)}

(7)

where K_i is the coefficient taking into account additional significant factors; K1 is the residual pipeline wall thickness, %; K2 is the previous incidents on the pipeline section; K3 is the soil corrosion activity; K4 is the flooding (traces of flooding) of the channel; and K5 is the presence of intersections with communications.

Since the K2–K5 features are categorical, the authors of the study [3] were unable to take them into account when constructing linear regression. Therefore, the model gives acceptable results only for combinations of one quantitative (K1) and one categorical (K2–K5) characteristic. If there are three or more factors in the test sample, the model demonstrates the instability of the results.

2.2. Intelligent Predictive Models

Artificial neural networks (ANN) are machine learning (ML) algorithms related to deep learning. An ANN is a mathematical abstraction that models the structure and functioning mechanism of a biological neural network, designed for information processing [10].

Figure 1 shows the structure of a simple neural network. It includes input, hidden, and output layers. Let us assume that the training sample is a vector of significant factors x = (x₁, x₂, …, x_m), and the predicted parameters are the output vector y = (y₁, y₂, …, y_n), where m and n are the number of significant factors and output parameters. The input weight of the h-th node of the hidden layer is ω_1h, ω_2h, …, ω_mh, and the corresponding bias is γ_h. The input weight of the j-th node of the output layer is ω_1j, ω_2j, …, ω_kj, and the corresponding bias is θ_j. k is the number of hidden layer nodes.

The input signals of the h-th neuron of the hidden layer are determined using Equation (8), as follows:

α_{h} = \sum_{i = 1}^{m} ω_{i h} x_{i}

(8)

The output signals of the h-th neuron of the hidden layer are calculated using Equation (9), as follows:

b_{h} = f (α_{h} + γ_{h})

(9)

The input signals of the j-th output neuron are determined using Equation (10), as follows:

β_{j} = \sum_{h = 1}^{k} ω_{h j} b_{h}

(10)

The output data of the j-th output neuron are calculated using Equation (11), as follows:

y_{j} = f (β_{j} + θ_{j})

(11)

In recent years, artificial neural networks have been increasingly used in the oil and gas industry to predict the residual strength and service life of corroded pipelines [11,12]. ANN is capable of establishing a nonlinear relationship between significant corrosion factors and the corrosion rate [2].

The disadvantages of ANN include low learning speed with a large amount of data, as well as poor interpretability of the results.

Fuzzy logic (FL) is a mathematical tool that allows us, using the membership function, to transform a qualitative (categorical) assessment of objects that are influenced by many factors into a quantitative (numerical) assessment. There are five basic membership functions, as follows: normalized Gaussian function, generalized Bell function, and triangular, trapezoidal, and double sigmoidal functions [12]. The studies of Bagheri M. [13] and Liang Q. [14] discuss the use of triangular and trapezoidal membership functions, respectively, when assessing the risk of leakage through pressure pipelines.

Fuzzy logic allows for a more realistic, science-based quantification of data, with hidden information representing fuzziness. The result of this evaluation is not a point value, but a vector. Such a vector allows us to more accurately describe the object being evaluated, and, in the future, can be processed to obtain reference information.

The disadvantages of the method include the fact that the process of identifying vector weight indices is quite subjective [2]. In addition, with a large set of indices, the comparison of the degree of membership becomes problematic and can lead to algorithm failure. The authors Rahmanifard H. and Plaksina T. [12], in their study, came to the conclusion that in cases in which a good mathematical description of the process is possible, the use of fuzzy logic is justified only when the full mathematical implementation is limited by computing power.

Chaos theory is a way of analyzing irregular and unpredictable phenomena and processes. According to this theory, a chaotic process is a deterministic phenomenon that only at first glance seems disordered, unclear, and random. The corrosion process in a pipeline system can be attributed to this type of phenomenon [15]. The main advantage of this method is that it enables us to obtain a qualitative representation of non-Gaussian nonstationary stochastic processes [16]. The disadvantage of the method is that such an assessment requires a large amount of relevant data. In addition, even minor errors at the data preparation stage can lead to instability in the calculation model [2].

The support vector machine (SVM) is a machine learning algorithm aimed at solving nonlinear classification and regression problems for small amounts of data [11]. The idea of the method is to compare the vector of initial data (x₁, x₂, …, x_n) with a higher-dimensional feature space to find the separating hyperplane with the largest gap in this space [10]. In the case of linear separability, the hyperplane can be described in the form of linear regression (Figure 2), as follows:

f (x_{i}) = ω \cdot x_{i} + b

(12)

where ω and b are linear regression coefficients.

To obtain the regression function, Equation (12) is converted into a mathematical task, as follows:

\frac{1}{2} ω^{T} \cdot ω + C \cdot \sum_{i = 1}^{n} ξ_{i} \to \min

(13)

under the following restrictions (14):

\{\begin{matrix} ω \cdot x_{i} + b - y_{i} \leq ε + ξ_{i} \\ ξ_{i} \geq 0 \end{matrix}

(14)

where C is the regularization constant, which allows you to adjust the relationship between maximizing the gap width and minimizing the total error; ξ_i is the slack variables characterizing the magnitude of the error at objects x_i; and ε is the insensitivity of the loss function.

According to the Kuhn–Tucker theorem, the optimization task (13) is to find the saddle point of the Lagrange function (15), as follows:

f (x) = \sum_{i = 1}^{n} λ_{i} \cdot y_{i} \cdot K (x, x^{'}) + b

(15)

where λ_i is the Lagrange multiplier when λ_i is not equal to 0, the corresponding sample is a support vector, and K(x, x′) is the kernel function.

In practice, it is generally not possible to guarantee the linear separability of points into classes. Therefore, the transition from scalar products to arbitrary kernels (the so-called kernel trick) is used, which allows for the construction of nonlinear separators [17,18]. The most common kernels are as follows:

linear: K(x, x′) = (x · x′);
polynomial: K(x, x′) = (x · x′)^d;
radial basis function: $K (x, x^{'}) = \exp (- γ \cdot {‖x - x^{'}‖}^{2})$ for γ > 0;
sigmoid: K(x, x′) = tanh(κx · x′ + c), for almost every κ > 0 and c > 0.

The advantages of the support vector machine include high efficiency with small data samples, which is most valuable in the case of analyzing ruptures in heating network pipelines. Another advantage is that, unlike neural networks, this method avoids local optimization [2].

Among the disadvantages of the method are the following: low learning speed for large volumes of data; high sensitivity to missing data, parameters, and kernel functions; and resource-intensive search for optimal hyperparameters (regularization constants, parameters, and kernel functions).

Ensemble learning (EL) methods are machine learning algorithms that use decision trees to model the relationships between input features and target variables. Their peculiarity is that they combine several weak methods into an ensemble, thereby neutralizing their shortcomings (for example, sensitivity to training data and overfitting problems) [19]. According to the method of combining weak methods, EL models are divided into parallel and sequential.

Parallel EL models, such as random forest (RF), use bootstrap aggregating; homogeneous weak models are trained independently and in parallel, and their average result becomes the prediction result [19,20], as follows:

Y = \frac{1}{B} \cdot \sum_{j = 1}^{B} Y_{j} (x)

(16)

where Y is the target result; Y_j is the j-e separate decision tree; and x is the vector of significant factors.

The advantages of RF include high accuracy and robustness to overfitting, outliers, and unbalanced datasets. The main disadvantage of RF is its bias toward categorical features with a large number of unique classes. In the case of encoding categorical features using the «Label Encoder» type, decision trees create redundant dependencies that were not present in the original data.

Sequential EL models use the boosting method, which allows us to reduce the variance and systematic error of a weak model by creating another weak model that corrects the errors of the previous one [20]. Sequential EL models include adaptive boosting (AdaBoost), gradient boosting regression tree (GBRT), extreme gradient boosting (XGBoost), etc.

Unlike RF, the GBRT uses the generated weak learners to fit the loss function (L) negative gradient obtained from the cumulative model of the previous iteration. Thereafter, the negative gradient direction (g_m) will decrease by adding the obtained loss function to the weak learner [19], as follows:

F_{m} (x) = F_{m - 1} (x) + \arg \min \sum_{i = 1}^{n} L [y_{i}, F_{m - 1} (x_{i}) + h_{m} (x_{i})]

(17)

g_{m} = - \frac{\partial L [y_{i}, F_{m - 1} (x_{i})]}{\partial F_{m - 1} (x_{i})}

(18)

where h(x) represents the base learner function; x is the vector of initial data (x₁, x₂, …, x_n); and m is the number of iterations. The error will be minimal when the m-th weak learner fits the negative gradient g_m of the cumulative model’s loss function [19].

The advantage of GBRT is the highest accuracy (higher than the RF method has) among ensemble methods and the ability to work with categorical features. However, this method also has disadvantages, such as a tendency to overfit (which makes it necessary to artificially limit the tree depth) and sensitivity to data outliers.

The general disadvantage of ensemble methods is their low performance for large volumes of data. However, it can be partially compensated by parallelizing the calculation process between the central (CPU) and graphics (GPU) processors [21].

Li H., Huang K., et al. [2] studied 71 intelligent models published by different researchers between 2009 and 2021. In their review, they noted that in only 2 of the 71 models, the authors divided the general data set into training, test, and validation samples. In other cases, the division was made into training and test samples. The problem is that tuning hyperparameters based on the metrics of the test sample leads to model overfitting. Because of this, the model demonstrates high accuracy in tests, but in practice, it has low efficiency and shows unstable results.

To obtain a stable model, it is necessary to focus on the metrics obtained when evaluating the validation sample. To achieve this, cross-validation (alternately hiding parts of the data during training) [19,22] and/or testing on a delayed sample (part of the test sample hidden from the model during hyperparameter tuning) is used [2]. The first method is optimal for situations with a small set of initial data. The second method allows us to obtain a more objective assessment of the model.

3. Materials and Methods

To develop and train intelligent models, statistical data on incidents of heating network pipelines in the cities of Kazan and Ulyanovsk collected by heat supply organizations in accordance with the MDK 4-01.2001 standard [23] were used as initial data (Table 1). A total of 111 incidents were considered; there were 55 records in the Kazan database and 56 records in Ulyanovsk.

The methodology proposed by I.G. Akhmetova and T.R. Akhmetov in [3] was chosen as the basis. To take into account the features (K2–K5) used in expression (7), they were converted from categorical to quantitative using the cat.codes method of the Pandas library (Python, https://www.python.org/).

Subsequently, using the model_selection.train_test_split method of the Scikit-learn library, the general data set was divided into training (for training models), testing (for setting model hyperparameters), and validation (for objective assessment of models) samples. The training set contained 80% of all records, the testing set contained 15%, and the validation set contained 5%. The distribution was performed in a randomized manner. The shuffle control parameter (random_state = 42) was used to obtain reproducible results.

The input data (significant factors) of the models were the characteristics K1–K5, and the predicted parameter was the coefficient, taking into account additional significant factors K_i. The actual value of K_i used in setting up and validating models was calculated according to the method described in [3].

The methods used in the development of intelligent models are as follows:

Multilayer perceptron (MLP);
Support vector machine (SVM);
Gradient boosting regression tree (GBRT);
Random forest (RF).

The MLP was chosen because it is the most commonly used feedforward neural network [24]. The convergence of this network is slow, but it is often reliable and accurate. An advantage of this model, for our specific task, is its capability to establish nonlinear relationships between significant corrosion factors and corrosion rate [2].

A disadvantage of this model is that MLP with hidden layers has a non-convex loss function when there is more than one local minimum. Therefore, different random weight initializations can lead to varying accuracies during testing. In other words, the model returns unstable results with each new testing case.

SVM was chosen because it is the most frequently used ML method for predicting pipeline failures [2,4,10,11,17,19,25,26]. In addition, SVM is highly effective with small data samples [25,27], which is valuable for our task. However, it is necessary to take into account that the high efficiency of SVM with small data samples is achieved when the feature space is large. Since our sample contains only 5 features, the expectations of SVM effectiveness were quite modest.

According to Seghier M.E.A.B., Höche D., et al. [19], EL methods are the most promising compared to individual ML methods when modeling the corrosion processes of steel pipelines. Based on this conclusion, the RF and GBRT methods were also included in the study.

RF models are resistant to overfitting and data outliers, so their results can be considered as a benchmark in comparative analysis. However, it is necessary to take into account that in our sample, 4 out of 5 features are categorical, which may negatively affect the method’s effectiveness.

The GBRT method was used to solve the problem of correctly taking into account categorical features. However, due to its tendency to overfit, a careful approach to hyperparameter tuning is required. For this purpose, a comparative analysis of the GBRT model with other models (ML, SVM, and RF) was conducted, and additional testing was carried out on the validation dataset.

The MLP neural network was created using the Keras tool of the TensorFlow library [28]. The input layer of the network contains 64 nodes, and the neuron activation function is ReLU [11]. There is one hidden layer, and it consists of 32 nodes; the activation function is ReLU. The output layer contains 1 output. The training settings are as follows: the optimizer is Adam [29], the loss functions are MSE and MAE, and the metrics are MAE, MAPE, and MSE. The training parameters are as follows: the number of iterations is 50 epochs (Figure 3), and the size of the iterated blocks (batch size) is 32.

When creating the SVM, GBRT, and RF models, the Scikit-learn library was used [30]. Hyperparameter tuning was performed with cross-validation using a hyperparameter grid (GridSearchCV). Five data splits (n_splits = 5) were specified in the cross-validation parameters.

During the process of the SVM model cross-validation, a search was carried out for the kernel constant γ and regularization constant C in the ranges of 0.01–1.0 and 0.1–100.0, respectively. Among the kernel options, linear (linear), radial (rbf), polynomial (poly), and sigmoid (sigmoid) were considered. The size of the dead zone of the loss function ε varied from 0.01 to 1.0.

In the GBRT and RF models, the hyperparameter grid was built according to 3 parameters. For GBRT, the number of trees (100–300), maximum tree depth (3–7), and learning rate (0.01–0.2) were the parameters; for RF, the number of trees (100–300), maximum tree depth (5–15), and the function of the maximum number of features (sqrt, log2) were the parameters.

Data preprocessing in the form of standardization (StandartScaler) and normalization (MinMaxScaler) was used only for the SVM model since it did not have any effect on other methods [31].

The models were assessed using a deferred (validation) sample using the following 3 metrics:

mean absolute error (MAE), as follows:

MAE = \frac{1}{n} \cdot \sum_{i = 1}^{n} |R_{i} - P_{i}|

(19)

mean absolute percentage error (MAPE), as follows:

MAPE = \frac{1}{n} \cdot \sum_{i = 1}^{n} \frac{|R_{i} - P_{i}|}{R_{i}}

(20)

mean squared error (MSE), as follows:

MSE = \frac{1}{n} \cdot \sum_{i = 1}^{n} {(R_{i} - P_{i})}^{2}

(21)

where R_i and P_i are the calculated (actual) and predicted values of K_i, respectively, and n is the validation sample size.

4. Results

The primary data analysis showed the following:

The distribution of the K_i coefficient actual values within the general population of data is close to the normal Gauss–Laplace distribution (Figure 4), which, in accordance with the central limit theorem, indicates the representativeness of the sample [32];
There is no significant correlation between the target feature and significant factors (Figure 5);
The greatest influence on the target feature (K_i) is exerted by the soil corrosion activity K3 and the wall thinning K1. The least significant factor is the presence of previous pipeline incidents K2 (Figure 5 and Figure 6).

Tuning the hyperparameters made it possible to obtain the limiting metrics of the predictive models (Figure 7). The SVM, GBRT, and RF models gave comparable results in terms of accuracy.

The MLP model showed the worst results both in terms of quality metrics and forecast value relative error spread (Figure 8). Because of this, it was excluded from further studies.

Next, the weakest feature was removed from the training set (K2, Figure 6). Retraining and the subsequent re-estimation of the models showed an increase in accuracy (Figure 9). Filtering out other significant factors (K4 and K5) only led to worse metrics.

The best result for the SVM model was obtained with the following hyperparameters: C = 100, γ = 1, ε = 0.01, and a kernel type of radial basis function (rbf).

The GBRT model showed the best metrics with 100 trees (n_estimator = 100), with a maximum branch depth of 3 (max_depth = 3) and a learning rate of 0.01 (learning_rate = 0.01).

The RF model gave the best results with 100 trees (n_estimator = 100), with a maximum branch depth of 5 (max_depth = 5) and a logarithmic function of the number of separable features for selecting splits in the trees (max_features = ‘log2’). An example of one such tree is shown in Figure 10.

The GBRT model turned out to be the most optimal of the three models considered in terms of quality metrics, as follows: MSE = 0.00719, MAE = 0.0682, and MAPE = 0.06069.

More detailed research results, including model calculation scripts, are available for review in the authors’ public repository (https://github.com/caapel/Failure_rate/blob/main/Failure_rate.ipynb, accessed on 18 June 2024).

5. Discussion

To objectively assess the results obtained, we refer to the work of Li H., Huang K., et al. [2] to compare the metrics of the proposed model with the metrics of 71 models of other authors (Table 2), whose studies were published from 2009 to 2021.

According to Table 2, the main problem for most researchers remains the collection and accumulation of the initial data for intelligent model training. Despite the high degree of wear and tear of the heat supply systems in Russian cities [3], the ruptures of heating network pipelines cannot be classified as a statistically widespread phenomenon. This explains the limited size of the sample studied (111 incidents).

In the future, as incident statistics accumulate, intelligent models will become more accurate in describing the relationships between significant and target features.

In the present study, five significant factors were assessed (after filtering the weakest feature—4). At the same time, it should be taken into account that the basic methodology [3], to which the proposed model is an addition, also takes into account the service life and pipeline length. As a result, the number of significant factors corresponds to the average number of significant factors in studies by other authors (Table 2).

Based on the MAPE metric from Table 2, we can conclude that the proposed GBRT model is superior in accuracy to most models by other authors. However, such a comparison is very conditional since the accuracy and/or error of the models depends on a number of subjective (controllable by the researcher) factors, including the size and representativeness of the general population, data preprocessing, the number of significant features, the presence of a validation sample, etc.

Let us compare the obtained conclusions with more recent and closely related studies. In 2022, Elshaboury N., Al-Sakkaf A., et al. [24] developed a model for predicting the failure modes of steel pipelines used for transporting oil and gas. The model was developed using a database collected by the Conservation of Clean Air and Water in Europe (CONCAWE) from 1971 to 2019. The database included 253 incidents related to corrosion and third-party actions. The feature space of the studied sample included two continuous features (pipeline diameter and age) and three categorical features (transported product, laying method, and land use type). The authors developed three models using a multilayer perceptron (MLP) neural network, radial basis function (RBF) neural network, and multinomial logistic (MNL) regression. The results were evaluated using the average validity percentage (AVP) metric. The accuracies of the MLP, RBF, and MNL models according to the AVP metric were 0.84, 0.85, and 0.81, respectively. These results confirm our conclusion that ANN models are not well-suited for solving problems of this kind.

In 2023, Xu L., Yu J., et al. [25] applied hybrid machine learning to predicting the corrosion rates of natural gas pipelines. They used a database of 60 records of pipeline corrosion in southwest China. The feature space of the sample under study included 10 continuous features. The best result (MAPE = 0.0573) was obtained for the hybrid CEEMDAN–IPSO–SVM model, which combines the complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) data preprocessing method, the improved particle swarm optimization (IPSO) model hyperparameter optimization algorithm, and the ML-SVM method. Due to the large feature space, the hybrid SVM method in this study showed good results, despite the modest sample size. This outcome also confirms our conclusions.

In 2023, Şahin E. and Yüce H. [26] developed a model for predicting damage in water supply networks. The data for the study were obtained from an experimental setup and consisted of 360 records. The feature space included 339 categorical features. The authors proposed two classification models developed using graph convolutional neural network (GCN) and SVM methods. The accuracies of the GCN and SVM models (the number of correctly classified states out of the total number of states) in predicting leaks were 0.95 and 0.81, respectively. Therefore, we can conclude that SVM is less suitable for solving multi-class classification tasks under these conditions.

In 2022, Cai J., Jiang X., et al. [10] developed models for predicting the burst strength of corroded pipelines subjected to internal pressure. The data for the study, as in the previous work, were obtained experimentally. The sample included 115 records, and the feature space contained eight parameters. The authors selected the following three methods: MLP, SVM, and linear regression (LR). The smallest errors of the MLP, SVM, and LR models, according to the MSE metric for the validation set, were 0.15171, 0.03707, and 0.05156, respectively. This also confirms our conclusions about ANN and SVM.

A comparison of the obtained results with the work of other authors has shown that our proposed models demonstrate high accuracy and that these models can be scaled to datasets of heat network pipelines in other cities.

6. Conclusions

Based on the results of this study, the following conclusions were drawn:

The distribution of the K_i coefficient actual values within the general population of data is close to the normal Gauss–Laplace distribution, which indicates the representativeness of the source data;
The most significant factors when assessing the condition of underground steel pipelines are wall thinning (K1) and soil corrosion (K3);
Previous incidents on the pipeline section (K2) are the least significant factor, and their exclusion from the training set leads to an increase in the accuracy of the models;
The MLP model showed the worst results and is therefore not suitable for solving such tasks.

The GBRT model turned out to be the most optimal in terms of quality metrics, as follows: MSE = 0.00719, MAE = 0.0682, and MAPE = 0.06069. However, the gap in metrics between the RF and GBRT models is insignificant, which in the general case, gives reason to consider ensemble methods the best for solving such tasks.

Author Contributions

Conceptualization, N.D.C., H.I.B. and I.K.I.; software, S.R.S.; validation, A.A.F. and N.D.C.; formal analysis, I.K.I. and H.I.B.; resources, N.D.C.; writing—original draft preparation, S.R.S.; writing—review and editing, O.E.B. and I.K.I.; visualization, S.R.S.; supervision, I.K.I. and H.I.B.; project administration, I.K.I. and H.I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This study is financed by the European Union—NextGenerationEU—through the National Recovery and Resilience Plan of the Republic of Bulgaria, project No. BG-RRP-2.013-0001-C01. This research was also co-funded by the Ministry of Science and Higher Education of the Russian Federation “Study of processes in a fuel cell-gas turbine hybrid power plant” (project code: FZSW-2022-0001).

Data Availability Statement

The original data presented in this study are openly available in the public repository “Failure rate” at https://github.com/caapel/Failure_rate (accessed on 18 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

Notations
α	coefficient taking into account the duration of the pipeline section operation
γ	kernel constant
ε	the insensitivity of the loss function
λ	the failure rate of heating network elements, (km·h)⁻¹
λ₀	initial failure rate of 1 km of single-line heat pipeline, obtained from the Weibull distribution equation, 5.7·10⁻⁶ (km·h)⁻¹
ξ	slack variables
σ_f	flow stress, MPa
τ	pipeline service life
C	the regularization constant
d	depth of the corrosion zone, m
D	outside diameter of the pipeline, m
g_m	negative gradient
K1	residual pipeline wall thickness, %
K2	previous incidents on the pipeline section
K3	soil corrosion activity
K4	flooding (traces of flooding) of the channel
K5	presence of intersections with communications
l	length of the corrosion zone, m
L	loss function
M	Folias bulging coefficient
p_b	burst pressure, MPa
P_i	predicted values
R_i	calculated (actual) values
R_t_0.5	the minimum yield strength, MPa
t	wall thickness of the pipeline, m
Abbreviations
AGA	American Natural Gas Association
ANN	Artificial neural networks
API	American Petroleum Institute
ASME	American Society of Mechanical Engineers
AVP	Average validity percentage
CEEMDAN	Complete ensemble empirical mode decomposition with adaptive noise
CONCAWE	Conservation of Clean Air and Water in Europe
CPU	Central processing unit
EGIG	European Gas Pipeline Incident Data Group
EL	Ensemble learning
FL	Fuzzy logic
GBRT	Gradient boosting regression tree
GCN	Graph convolutional neural network
GPU	Graphics processing unit
IPSO	Improved particle swarm optimization
JSC	Joint-stock company
LR	Linear regression
MAE	Mean absolute error
MAPE	Mean absolute percentage error
ML	Machine learning
MLP	Multilayer perceptron
MNL	Multinomial logistic regression
MSE	Mean squared error
RBF	Radial basis function
ReLU	Rectified linear unit
RF	Random forest
SVM	Support vector machine

References

EGIG. Available online: https://www.egig.eu/reports (accessed on 18 June 2024).
Li, H.; Huang, K.; Zeng, Q.; Sun, C. Residual Strength Assessment and Residual Life Prediction of Corroded Pipelines: A Decade Review. Energies 2022, 15, 726. [Google Scholar] [CrossRef]
Akhmetova, I.G.; Akhmetov, T.R. Analysis of Additional Factors in Determining the Failure Rate of Heat Network Pipelines. Therm. Eng. 2019, 66, 730–736. [Google Scholar] [CrossRef]
Zhu, X.-K. Recent Advances in Corrosion Assessment Models for Buried Transmission Pipelines. CivilEng 2023, 4, 391–415. [Google Scholar] [CrossRef]
Law, M.; Bowie, G. Prediction of failure strain and burst pressure in high yield-to-tensile strength ratio line pipes. Int. J. Press. Vessel. Pip. 2007, 84, 487–492. [Google Scholar] [CrossRef]
Lyons, C.J.; Race, J.M.; Chang, E.; Cosham, A.; Barnett, J. Validation of the ng-18 equations for thick walled pipelines. EFA 2020, 112, 104494. [Google Scholar] [CrossRef]
API Specification 5L. Lin Pipe, 46th ed.; American Petroleum Institute: Washington, DC, USA, 2018. [Google Scholar]
Zhou, R.; Gu, X.; Luo, X. Residual strength prediction of X80 steel pipelines containing group corrosion defects. Ocean. Eng. 2023, 274, 114077. [Google Scholar] [CrossRef]
Methodology and Algorithm for Calculating Reliability Indicators of Heat Supply to Consumers and Redundancy of Heat Networks when Developing Heat Supply Schemes. Available online: https://www.rosteplo.ru/Tech_stat/stat_shablon.php?id=2781 (accessed on 18 June 2024).
Cai, J.; Jiang, X.; Yang, Y.; Lodewijks, G.; Wang, M. Data-driven methods to predict the burst strength of corroded line pipelines subjected to internal pressure. J. Mar. Sci. Appl. 2022, 21, 115–132. [Google Scholar] [CrossRef]
Soomro, A.A.; Mokhtar, A.A.; Hussin, H.B.; Lashari, N.; Oladosu, T.L.; Jameel, S.M.; Inayat, M. Analysis of machine learning models and data sources to forecast burst pressure of petroleum corroded pipelines: A comprehensive review. EFA 2024, 155, 107747. [Google Scholar] [CrossRef]
Rahmanifard, H.; Plaksina, T. Application of artificial intelligence techniques in the petroleum industry: A review. Artif. Intell. Rev. 2019, 52, 2295–2318. [Google Scholar] [CrossRef]
Bagheri, M.; Zhu, S.P.; Ben Seghier, M.E.A.; Keshtegar, B.; Trung, N.T. Hybrid intelligent method for fuzzy reliability analysis of corroded X100 steel pipelines. Eng. Comput. 2020, 37, 2559–2573. [Google Scholar] [CrossRef]
Liang, Q. Pressure pipeline leakage risk research based on trapezoidal membership degree fuzzy mathematics. GST 2019, 24, 48–53. [Google Scholar]
Mishra, M.; Keshavarzzadeh, V.; Noshadravan, A. Reliability-based lifecycle management for corroding pipelines. Struct. Saf. 2019, 76, 1–14. [Google Scholar] [CrossRef]
Sakamoto, S.; Ghanem, R. Polynomial chaos decomposition for the simulation of non-Gaussian nonstationary stochastic processes. J. Eng. Mech. 2002, 128, 190–201. [Google Scholar] [CrossRef]
Robles-Velasco, A.; Cortés, P.; Muñuzuri, J.; Onieva, L. Prediction of pipe failures in water supply networks using logistic regression and support vector classification. Reliab. Eng. Syst. Saf. 2020, 196, 106754. [Google Scholar] [CrossRef]
Nie, F.; Zhu, W.; Li, X. Decision Tree SVM: An extension of linear SVM for non-linear classification. Neurocomputing 2020, 401, 153–159. [Google Scholar] [CrossRef]
Seghier, M.E.A.B.; Höche, D.; Zheludkevich, M. Prediction of the internal corrosion rate for oil and gas pipeline: Implementation of ensemble learning techniques. J. Nat. Gas. Sci. Eng. 2022, 99, 104425. [Google Scholar] [CrossRef]
Foroozand, H.; Weijs, S. Entropy ensemble filter: A modified bootstrap aggregating (bagging) procedure to improve efficiency in ensemble model simulation. Entropy 2017, 19, 520. [Google Scholar] [CrossRef]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Ossai, C.I. Corrosion defect modelling of aged pipelines with a feed-forward multi-layer neural network for leak and burst failure estimation. EFA 2020, 110, 104397. [Google Scholar] [CrossRef]
MDK 4-01.2001; Recommended Practice for Investigation and Recordkeeping of Technical Violations in Public Energy Utility Systems and in the Operation of Public Energy Utility Organizations. State Unitary Enterprise «Center for Design Products in Construction»: Moscow, Russia, 2001.
Elshaboury, N.; Al-Sakkaf, A.; Alfalah, G.; Abdelkader, E.M. Data-Driven Models for Forecasting Failure Modes in Oil and Gas Pipes. Processes 2022, 10, 400. [Google Scholar] [CrossRef]
Xu, L.; Yu, J.; Zhu, Z.; Man, J.; Yu, P.; Li, C.; Wang, X.; Zhao, Y. Research and Application for Corrosion Rate Prediction of Natural Gas Pipelines Based on a Novel Hybrid Machine Learning Approach. Coatings 2023, 13, 856. [Google Scholar] [CrossRef]
Sahin, E.; Yüce, H. Prediction of Water Leakage in Pipeline Networks Using Graph Convolutional Network Method. Appl. Sci. 2023, 13, 7427. [Google Scholar] [CrossRef]
Shang, Y.; Li, S. FedPT-V2G: Security enhanced federated transformer learning for real-time V2G dispatch with non-IID data. Appl. Energy 2024, 358, 122626. [Google Scholar] [CrossRef]
Pang, B.; Nijkamp, E.; Wu, Y.N. Deep learning with tensorflow: A review. J. Educ. Behav. Stat. 2020, 45, 227–248. [Google Scholar] [CrossRef]
Zhang, Z. Improved Adam Optimizer for Deep Neural Networks. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018. [Google Scholar] [CrossRef]
Kramer, O. Scikit-Learn; Springer: New York, NY, USA, 2016; pp. 45–53. [Google Scholar] [CrossRef]
Çetin, V.; Yildiz, O. A comprehensive review on data preprocessing techniques in data analysis. Pamukkale Üniversitesi Mühendislik Bilim. Derg. 2022, 28, 299–312. [Google Scholar] [CrossRef]
Islam, M.R. Sample size and its role in Central Limit Theorem (CLT). J. Computat. Appl. Math. 2018, 4, 1–7. [Google Scholar]

Figure 1. Multilayer perceptron structure with one hidden layer.

Figure 2. Concept of a linear SVM regression.

Figure 3. Loss functions MAE (a) and MSE (b) during MLP model training.

Figure 4. Distribution of actual K_i values in the population of data.

Figure 5. Correlation of input features with the output value.

Figure 6. Input features of importance for output value.

Figure 7. Error indicator with a full set of input features; MSE (a), MAE (b), and MAPE (c).

Figure 8. Scatter of relative error K_i value in the test set.

Figure 9. Error indicator without the K2 features; MSE (a), MAE (b), and MAPE (c).

Figure 10. A part of the best RF model decision tree.

Table 1. Part of initial data on incidents of the Kazan heating network.

Number of Records	Length, m	Diameter, mm	Wall thinning (K1), %	Previous Incidents (K2)	Corrosion Activity (K3)	Flooding Traces (K4)	Intersection with Communications (K5)
1	30	50	60.0	no	average	no	no
2	75	50	45.7	no	low	yes	да
3	50	100	32.5	no	low	no	no
4	30	50	62.9	no	low	no	yes
5	170	200	24.7	yes	average	yes	yes
6	270	150	53.3	yes	high	yes	yes

Table 2. Comparison of the results obtained with foreign authors’ works.

Metrics	Statistics of the Results of 71 Models from [2]			Author’s Result
Metrics	Minimum	Maximum	Average	Author’s Result
Sample size	15	259	188	111
Number of significant factors	2	11	6	5(4)
Number of target features	1	1	1	1
MAPE	0.0123	0.1499	0.0708	0.06069

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Beloev, H.I.; Saitov, S.R.; Filimonova, A.A.; Chichirova, N.D.; Babikov, O.E.; Iliev, I.K. Prediction of Pipe Failure Rate in Heating Networks Using Machine Learning Methods. Energies 2024, 17, 3511. https://doi.org/10.3390/en17143511

AMA Style

Beloev HI, Saitov SR, Filimonova AA, Chichirova ND, Babikov OE, Iliev IK. Prediction of Pipe Failure Rate in Heating Networks Using Machine Learning Methods. Energies. 2024; 17(14):3511. https://doi.org/10.3390/en17143511

Chicago/Turabian Style

Beloev, Hristo Ivanov, Stanislav Radikovich Saitov, Antonina Andreevna Filimonova, Natalia Dmitrievna Chichirova, Oleg Evgenievich Babikov, and Iliya Krastev Iliev. 2024. "Prediction of Pipe Failure Rate in Heating Networks Using Machine Learning Methods" Energies 17, no. 14: 3511. https://doi.org/10.3390/en17143511

APA Style

Beloev, H. I., Saitov, S. R., Filimonova, A. A., Chichirova, N. D., Babikov, O. E., & Iliev, I. K. (2024). Prediction of Pipe Failure Rate in Heating Networks Using Machine Learning Methods. Energies, 17(14), 3511. https://doi.org/10.3390/en17143511

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Pipe Failure Rate in Heating Networks Using Machine Learning Methods

Abstract

1. Introduction

2. Current State of the Research Area

2.1. Traditional Evaluation Models

2.2. Intelligent Predictive Models

3. Materials and Methods

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI