To Feel the Spatial: Graph Neural Network-Based Method for Leakage Risk Assessment in Water Distribution Networks

Wu, Wenhong; Pan, Xinyu; Kang, Yunkai; Xu, Yuexia; Han, Liwei

doi:10.3390/w16142017

Open AccessArticle

To Feel the Spatial: Graph Neural Network-Based Method for Leakage Risk Assessment in Water Distribution Networks

by

Wenhong Wu

^1,2,

Xinyu Pan

^1,2,*

,

Yunkai Kang

^1,2,*

,

Yuexia Xu

³ and

Liwei Han

¹

School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China

²

Henan Province Water Distribution Network Intelligent Management Engineering Research Center, Zhengzhou 450046, China

³

Zhengzhou Water Supply Investment Holdings Co., Ltd., Zhengzhou 450046, China

^*

Authors to whom correspondence should be addressed.

Water 2024, 16(14), 2017; https://doi.org/10.3390/w16142017

Submission received: 22 June 2024 / Revised: 12 July 2024 / Accepted: 14 July 2024 / Published: 16 July 2024

(This article belongs to the Special Issue Water Supply System Reliability, Safety and Risk Modelling & Assessment, Volume II)

Download

Browse Figures

Versions Notes

Abstract

:

As water distribution networks expand, evaluating pipeline network leakage risk has become increasingly crucial. Contrary to traditional evaluation methods, which are often hampered by subjective weight assignment, data scarcity, and high expenses, data-driven models provide advantages like autonomous weight learning, comprehensive coverage, and cost-efficiency. This study introduces a data-driven framework leveraging graph neural networks to assess leakage risk in water distribution networks. Employing geographic information system (GIS) data from a central Chinese city, encompassing pipeline network details and historical repair records, the model achieved superior performance compared to other data-driven approaches, evidenced by metrics such as precision, accuracy, recall, and the Matthews correlation coefficient. Further analysis of risk factors underscores the importance of factors like pipe age, material, prior failures, and length. This approach demonstrates robust predictive accuracy and offers significant reference value for leakage risk evaluation.

Keywords:

water distribution network; risk assessment; graph neural network; supervised learning

1. Introduction

China’s urban water distribution networks (WDNs) play a vital role in connecting water sources to end-users and are essential for the survival and development of cities. However, pipeline networks often experience wear and tear in practical operations, leading to leaks that can disrupt daily life and impact the economy. To prevent significant losses and ensure the smooth operation of WDNs, it is vital to establish a robust leakage risk assessment system and implement appropriate early warning measures for high-risk pipelines (Pietrucha-Urbanik et al., 2023) [1].

Leakage risk prediction for WDNs has been the focus of research in the industry. Aiming at the complex influence factors of WDN pipelines and the surrounding environment, some scholars have proposed utilizing statistics and fuzzy mathematics to comprehensively consider the risk of pipeline leakage. For example, Peng et al. (2020) [2] used the fuzzy mathematical comprehensive evaluation method to predict the safety of WDNs, while Song et al. (2023) [3] used the multilevel fuzzy comprehensive evaluation method to clarify the different leakage risk influencing factors and their weights. Although these methods achieve a quantitative analysis of the risk of WDN pipelines, the subjectivity exposed by these methods due to their reliance on expert opinion reduces their credibility. With the construction of a database for pipeline attribute features based on a Geographic Information System (GIS) (Ho et al., 2010) [4] and the development of machine learning technology, some scholars have proposed that artificial intelligence models can be applied to the field of leakage risk prediction to improve the accuracy of risk prediction through data-driven learning of the nonlinear relationship between pipeline complexity factors and leakage probability (Bubtiena et al., 2011) [5]. Data-driven machine learning models can create relatively objective and effective risk assessment methods (Zhou et al., 2019) [6]. For example, Taiwo et al. (2024) [7] achieved a reliable assessment of pipeline leakage risk using characteristics of the pipeline itself and a logistic regression (LR) model optimized by a genetic algorithm, and Robles-Velasco et al. (2020) [8] developed predictive models using LR and support vector machine (SVM) algorithms for the prediction of water pipe failure probability. Chen et al. (2022) [9] developed predictive models using random forest (RF), boosted trees (BT), and extreme gradient boosting (XGBoost) algorithms with data collected from six WDNs.

The above machine learning-based method achieved a high identification rate of leaky pipelines in different cities or areas, but the description of pipeline topology in the GIS system is neglected and the black-box modeling makes it difficult to distinguish which risk factors on which pipelines have a more significant impact on leakage. We summarize the shortcomings of existing methods as follows:

Existing methods lack spatial considerations or cannot intuitively model the topological space of WDNs.
Existing methods utilize a limited number of pipeline features and cannot accurately capture the complex interactions between features at different domain levels, thus reducing the model’s accuracy.
Existing methods can conduct quantitative analysis of pipelines but not qualitative analysis of the various risk factors of pipelines, and therefore cannot provide practical solutions.

We propose a novel solution to address the challenges of predicting pipeline risk in WDN. The proposed solution involves using a GIS pipeline feature database to build a data-driven model based on graph neural network (GNN) that has spatial sensing ability. This model considers the pipeline’s characteristics and spatial factors to make quantitative predictions about pipeline risk. Furthermore, the model provides risk factor ranking for different pipelines, which can guide real-world decision-making. The study’s main contributions are as follows:

This study proposes the utilization of node embedding learning in the domain of leakage risk estimation for WDNs via the dual-stream network optimization embedding-multilayer perceptron framework.
This study introduces two novel techniques, the spatial perception block and the dynamic graph optimization feature enhancement block, to capture the complex nonlinear relationship between pipeline properties, spatial topology, and the likelihood of leakage.
This study uses the SHAP method to investigate further the risk factors that may lead to pipeline leakage based on the quantitative risk prediction results.

2. Related Work

2.1. Leakage Risk Research for WDNs

Leakage risk research in WDNs has been a major focus of the industry. Leakage detection methods based on data-driven models are becoming mainstream. Konstantinos et al. (2017) [10] leveraged evolutionary polynomial regression and K-means clustering to identify abnormal pipelines through unsupervised learning. Wang et al. (2017) [11] implemented leakage prediction of pipelines using support vector machines by using pipeline operating water pressure as the basis of judging. Jensen et al. (2018) [12] predicted pipeline leakage incidents using neural networks and pipeline water volume as a metric. Sousa et al. (2023) [13] utilized a learning vector quantitative classifier to identify abnormal water pressures, which serve as indicators of leakage incidents, aiding in determining their occurrence within WDNs. Leakage prediction schemes for WDNs based on data-driven models have become a research priority. However, there is still a need for enhancement in quantitative risk analysis and extraction of spatial characteristics of WDN pipelines.

2.2. Graphical Neural Network

GNNs have become an important tool for analyzing and processing graph data. They have been widely used in various fields, including e-commerce, biology, and communication (Tang et al., 2023, Jin et al., 2023) [14,15], for graph node classification and anomaly detection tasks. For instance, Huang et al. (2022) [16] harnessed a real-world financial dataset, employing graph convolutional neural networks to detect financial fraudsters within a network. Choi et al. (2020) [17] merged the transformer encoder with GNNs to enhance the accuracy of heart failure prediction. Ahmed et al. (2019) [18] extensively reviewed the applications of graph neural networks in social media networks, showcasing their significant advantage and heightened accuracy in predicting fake news. Li et al. (2020) [19] integrated graph neural networks with spatio-temporal convolutional networks, achieving superior performance in predicting neighborhood crime rates. Yu et al. (2018) [20] employed graph neural networks for urban traffic computation, leveraging spatial information on connected roads to improve traffic flow predictions.

This study suggests incorporating GNNs into WDN leakage risk research. This approach would consider pipeline features and spatial factors to understand better and mitigate the risk of water leaks in WDNs.

3. Materials and Methods

3.1. Preliminary

3.1.1. Description of Tasks

The study was conducted in the B2 district metering area of a city in central China, and the data were derived from the pipeline GIS data and inspection and maintenance records covering the period from 2015 to 2021. Our task is to predict the probability of a leakage incident in this section of the network using the available pipeline properties. There is a complex nonlinear relationship between the pipeline properties of WDNs and the leakage risk, so it is necessary to process the pipeline properties to the embedding space, and then obtain the risk probability of this pipeline after nonlinear calculation within the embedding space, which is described in Equation (1):

{{X}_{1}, X_{2} \dots, X_{M}} \overset{f_{1} (\cdot)}{\Rightarrow} {{E}_{1}, E_{2} \dots, E_{D}} \overset{f_{2} (\cdot)}{\Rightarrow} \hat{Y}

(1)

where

X

denotes the natural features that the pipeline has,

M

denotes the number of features,

E

denotes the embedded representation of the pipeline in embedding space that has

D

dimensions, and

\hat{Y}

denotes the result of predicting pipeline leakage.

3.1.2. Modeling of WDNs with Graph

A graph is a data structure with node and connection relationships in the form of edges. Graph structures are useful for recording spatial topological relationships between data elements. This study aims to model WDNs using graph structures to better understand spatial distribution. Recent research by Cen et al. (2023) [21] emphasizes the link between WDN leakage incidents and the network’s intrinsic characteristics. Therefore, the modeling process is divided into two main parts: the semantic features of the pipeline itself and the spatial topological distribution. The semantic features of the pipeline itself are constructed as the feature matrix

X \in R^{N \times M}

, where

N

is the number of pipelines to be predicted, and each pipeline has a feature vector of

{{X}_{1}, X_{2} \dots, X_{M}}

. The spatial distribution of pipelines is modelled as

G = (V, E, X)

; pipelines are defined as nodes in the form of

V \in R^{N \times 1}

, which have corresponding feature vectors in

X

; and their connectivity is defined by edges in the form of

E \in R^{N \times 2}

.

Due to the spatial proximity of pipelines with similar water quality, soil properties, and other external factors, this model uses neighbor node sampling to avoid disturbing the factors’ interference with the model results. The neighbor node sampling process is denoted as

N (v)

(Hamilton et al., 2018) [22]. It represents the search for pipelines with spatial connectivity relationships in the node set

V

using the edge set

E

. The above modeling approach provides the possibility of utilizing the spatial distribution of pipelines, as shown in subsequent models.

3.2. Model Framework

The framework is shown in Figure 1; the model adopts the dual-stream network design through the spatial perception block (SPB) and dynamic graph optimization feature enhancement block (DGB). The pipeline feature matrix and graph structure are inputs, and the stacked spatial domain graph convolution model learns the potential embedding of each pipeline. The results of the dual-stream network are aggregated and fed into a multilayer perceptron (MLP) optimized with BatchNorm layers (Ioffe et al., 2015) [23] to generate the final pipeline node representation vectors, then the final leakage probability of the pipeline nodes is generated. The components of the model are described in detail in this section.

3.2.1. Spatial Perception Block

This block explicitly utilizes the spatial topology of WDNs after graph modeling, specifically applying graph convolution operations in the spatial domain to perform feature averaging and aggregation operations on the target pipeline and its surrounding pipeline nodes, which are then subjected to nonlinear transformations to ensure dimensional invariance (Hamilton et al., 2018). After the multi-layer graph convolution operation, this block realizes the synchronous embedding of pipeline features with spatial topology (Yang et al., 2023) [24]. The spatial embedding of the pipeline node

v

is generated as shown in Equation (2):

E_{1 v}^{k} = σ (W_{k} \cdot m e a n ({{E}_{1 v}^{k - 1}} {\cup {E}_{1 u}^{k - 1}})) \in R^{1 \times D}

(2)

where

\forall u \in N (v)

,

E_{1 v}^{(0)}

=

X (v)

, mean is the average computation, and

σ

represents a nonlinear activation function such as the ReLU.

This part achieves differentiated spatial embedding by aggregating the features of neighboring pipelines and their pipelines. Also, it enables the final generated pipeline embedding node to perceive the spatial structure of WDNs.

3.2.2. Dynamic Graph Optimization Feature Enhancement Block

This block focuses on the features of the pipeline itself in the embedding generation of the pipeline nodes. Since there are pipes of a close nature in the WDN system, the feature enhancement of individual pipes is achieved by aggregating the feature vectors of the more relative pipes. This module introduces a dynamic graph mechanism that allows the model to spontaneously learn potentially similar nodes to enhance the richness of the embedding representation (Ma et al., 2021) [25]. The dynamic graph

A_{a p t}

construction process is shown in Equation (3):

\begin{matrix} M^{k} = t a n h (β E^{k} Γ_{g c}^{k}) \in R^{N \times D} \\ A_{a p t} (i, j) = R e L U (t a n h (β (M_{i j}^{1} {M_{i j}^{2}}^{T} - M_{i j}^{2} {M_{i j}^{1}}^{T}))) \in R^{N \times N} \end{matrix}

(3)

where

M_{1}, M_{2} \in R^{N \times D}

is described by two neural networks with randomly initialized embedding

E_{1}, E_{2} \in R^{N \times D}

and trainable parameters

Γ_{g c}^{1}, Γ_{g c}^{2} \in R^{N \times D}

computed via the Tanh activation function.

β

is hyper-parameter that adjusts the saturation rate of activation.

After constructing the dynamic graph, this module uses the maximum pooling method to achieve the feature enhancement of the pipeline node

v

, as shown in Equation (4):

E_{2 v}^{k} = \max (σ ({W_{p} E}_{2 u}^{k - 1} + b)) + {W_{p} E}_{2 v}^{k - 1} \in R^{1 \times D}

(4)

where

\forall u \in N_{A_{a p t}} (v)

,

E_{2 u}^{(0)}

=

X (u)

,

E_{2 v}^{(0)}

=

X (v)

, max is the maximizing function,

W_{p}

is the pool of learnable weights, and

b

is the learnable bias constant.

After feature-enhanced embedding through this block, it is spliced with the spatial embedding obtained from the spatial perception block and fed into the MLP optimized by BatchNorm to finally obtain the representation vectors of the pipeline node. This process can be described by Equation (5):

{z_{v}}^{(l)} = D r o p o u t (w^{(l)} {z_{v}}^{(l - 1)} + b^{(l)})

(5)

where

{z_{v}}^{(0)} = \frac{C O N C A T (E_{1 v}, E_{2 v})}{1 + e^{(- C O N C A T (E_{1 v}, E_{2 v}))}} \in R^{2 \times D}

,

C O N C A T

denotes the dimension splicing operation,

w^{(l)} a n d b^{(l)}

are all linear layer learning parameters, and

D r o p o u t

introduces randomness in the training process by randomly masking some neurons (Hinton et al., 2012) [26].

3.2.3. Loss Functions and Leakage Probability Generation

The node representation

z_{v}

output after stacking the MLP needs to be mapped to the probability space and the final leakage probability obtained. To ensure the non-negativity and discriminability of the leakage probability and treat it as a probability distribution, the model maps the final generated node representations into probability space using the sigmoid function defined in Equation (6). This study minimizes the loss using the cross-entropy function in Equation (7) during model training.

\hat{Y} = \frac{1}{1 + e^{{- z}_{v}}}

(6)

L = - \frac{1}{N} \sum_{i = 1}^{N} (Y_{i} l o g ({\hat{Y}}_{i}) + (1 - Y_{i}) l o g (1 - {\hat{Y}}_{i}))

(7)

where

{\hat{Y}}_{i}

is the predicted label of pipeline

i

, and

Y_{i}

is the actual label of pipeline

i

.

3.3. Evaluation Indicators and Model Interpretation

3.3.1. Evaluation Indicators

In this model, the commonly used model evaluation metrics in binary classification tasks were chosen, such as precision, accuracy, recall, and the Matthews correlation coefficient (MCC):

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(9)

R e c a l l = \frac{T P}{T P + T N}

(10)

M c c = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(11)

where true positive (TP), false positive (FP), true negative (TN), and false negative (FN) denote normal samples predicted to be normal, leakage samples predicted to be normal, leakage samples predicted to be leakage, and normal samples predicted to be leakage, respectively.

The first three metrics all range from 0% to 100%, and the MCC metrics range from −100% to 100%, with higher metrics being better. Different metrics reflect the ability of the model with different focuses. For example, higher recall and precision metrics indicate that the model is more capable of recognizing positive samples, and a higher accuracy indicates that the model is more capable of recognizing the current data. Due to the imbalance of leakage samples relative to normal samples, we additionally introduce the MCC metric, which is more robust in the category imbalance scenario, and a value of0% means that the model carries out a random prediction, a value of −100% means the model predicts completely wrongly, and a value of 100% means the model predicts all samples perfectly. To better evaluate the quality of a model, this study introduces the receiver operating characteristic curve (ROC) to visualize and analyze the prediction results. The model’s predictions for test samples generate the ROC using different probability thresholds for the two classification outcomes (Carrington et al., 2021) [27]. This curve plots the actual positive rate against the false positive rate. The area under the curve (AUC) value summarizes the model’s performance. In general, a larger AUC indicates better model performance.

3.3.2. SHAP Risk Factor Explanatory Mechanism

To determine the effect of different risk factors of a particular pipeline on the pipeline’s leakage situation, this study introduces SHAP (Lundberg et al., 2017) [28] ranking to analyze the effect of different risk factors. The SHAP method employs the Shapley value in game theory, quantifying each feature’s contribution to the model’s final predicted outcome. The SHAP values represent the expected change in the model prediction when the attribute value of each feature is assigned to adjust for that feature, as shown in Equation (12):

ϕ_{i} (x) = \sum_{S \in {1, \dots, p} / {i}} \frac{|S|! (p - |S| - 1)!}{p!} [f (x_{S} \cup i) - f (x_{S})]

(12)

where

x_{S}

denotes the input after removing the features in feature set

S

in

x

,

p

denotes the number of features in the input,

|S|

denotes the size of

S

,

f (x_{S} \cup i)

denotes the output of the model with feature

i

included, and

f (x_{S})

denotes the output of the model without feature

i

included.

The results of SHAP analysis are generally expressed through overall feature violin plots. Within the overall feature plot, the order of the vertical coordinates corresponds to the level of influence ability of different features. In contrast, the width of the horizontal coordinates of the violin plot corresponding to different feature values corresponds to the degree of influence of the feature on the predicted results of the model. The violin plot’s width corresponds to the samples’ distribution, and its color also represents the level of feature values. The SHAP methodology allows for the analysis of pipelines on a case-by-case basis and guides the repair of risky pipelines in practice. Specifically, if we find that a pipeline has a high probability of incurring a risk of leakage, and the SHAP analysis yields that the highest contributing factor to the high probability of the pipeline is the pipe material, we can replace the material of the pipeline during the repair work. Similarly, if the risk factor contributing more to the risk of the line is the age of the pipe and the water pressure in the pipe, then we can renew the pipe or reduce its operating pressure.

4. Results and Discussion

To verify the feasibility of the model, extensive comparison and visualization experiments were conducted on real-world pipeline datasets based on the GIS pipeline system of a managed area in a city in central China.

4.1. Dataset Description

This study selected pipeline materials, pipeline diameters, pipeline length, pipeline age, pipeline burial depth, operating pressure, and previous failures as seven critical pipeline network characteristics influencing pipeline network leakage. Pipelines where leakage occurs were labeled with a separate label item: “leakage”. Specifically for the missing values, we supplemented the missing pipeline features in the GIS pipeline database and the maintenance work order with each other according to the section number. If the relevant records were missing in both data sources, we deleted the pipeline from the dataset considering the reliability of the model; after cleaning the data, we needed to max-min normalize the numerical type features in the pipeline, especially the pipeline length, pipeline age, pipeline burial depth, and operating pressure. Data preprocessing is a crucial step in preparing data for machine learning models. It involves various tasks, such as handling missing values and converting data types.

To clean up the GIS database, irrelevant indicators such as pipeline numbers, update times, and duplicate records were removed. Categorical features such as pipeline material and previous failures were expanded to nine dimensions using one-hot encoding. This resulted in seven pipeline types with two markers for previous failure, making the total dimensions 14. Numerical features such as pipeline length and age were scaled to a minimum–maximum range for model training. The dataset contained 30,612 edges. Out of the 11,489 data entries in the dataset, random sampling was used to divide them into training, validation, and test sets in a 7:2:1 ratio. To avoid the risk of potential data leakage (Chen et al., 2019) [29], a BFS optimized stratified sampling method was used when dividing the dataset. The pipeline nodes were divided into three regions with as few intersections as possible to avoid the model aggregating the test node information in advance. In the experiment, the model was trained on the training set, and the trained parameters were validated on the validation set. The optimal model on the validation set was finally tested on the test set.

4.2. Visual Comparison Experiment

This section compares the model proposed with five commonly used models for WDN leakage risk prediction to assess its performance. These models include BP networks, random forest, decision tree, AdaBoost (Suárez et al., 2023) [30], and logistic regression. In this study, we conducted 20 experiments with the baseline model and our model and subsequently calculated the resultant averages for the various evaluation metrics. The results of the combined experiments are shown in Figure 2 and detailed in Table 1.

Figure 2 and Table 1 show that the novel model introduced within our model exhibited superior performance metrics in comparison to extant models, achieving an accuracy of 93.23%, a precision of 92.18%, and a recall of 92.05%. These figures represent enhancements of 2.85%, 1.514%, and 4.000%, respectively, over a previously identified sub-optimal model. The graphical representations in Figure 2 underscore the model’s supremacy across all evaluated metrics and the remarkable stability of the results compared to other machine learning models. Despite the enhanced classification stability attributed to tree models predicated on ensemble learning, their failure to adequately capture the nonlinear dynamics between feature sets and leakage and the spatial interrelations amongst pipeline structures rendered their accuracy subpar. Conversely, while BP-based models consider nonlinear interactions, the variability of their evaluative outcomes across different metrics signaled a deficiency in their generalization capacity. Through its incorporation of spatial considerations, GCN manifested superior performance indicators on the dataset, highlighting the criticality of spatial elements in predictive accuracy. A common trend among most models is a higher accuracy relative to recall, potentially indicative of a data imbalance within the test set. The ROC curves furnish a more illustrative depiction of the model’s predictive capabilities, particularly within an imbalanced dataset. Given an equilibrated dataset and the ambition to generate a more precise anomaly score, both the ROC and MCC offer a more accurate reflection of the model performance. The proposed model achieved an optimal ROC and AUC of 0.952 and an MCC of 87.14%, underscoring its exceptional efficacy. To substantiate the effectiveness of the methodology delineated in this study, an ablation study was conducted, obfuscating the SPB and the dynamic DGB within the model and assessing their respective performances. The results from these ablation experiments affirm that both modules substantially enhanced the model’s accuracy in predicting leakage probability. Specifically, the incorporation of the DGB, predicated on the dynamic graph’s feature augmentation, significantly bolstered the model’s accuracy, with improvements of 2.6319%, 1.7727%, and 4.5708% observed in accuracy, recall, and MCC metrics, respectively, compared to the model devoid of the DGB.

Reference to the GCN model’s outcomes elucidates that an in-depth understanding of the interplay between pipeline features and a generic representation of analogous pipelines can markedly augment the model’s predictive capabilities. Consequently, the TSNE-KDE joint graph approach was adopted to elucidate the model’s node representation learning process. This methodology posits that their similarity is enhanced when high-dimensional representations of pipeline nodes are nonlinearly projected and proximally aligned in the high-dimensional continuum.

In Figure 3a–c, the results of the TSNE-KDE analyses performed on the training, validation, and test datasets are presented, demonstrating the model’s adeptness at internalizing pipeline representations with clear class demarcation. Within the TSNE-KDE visualizations of the training dataset, it is evident that the model adeptly conglomerated disparate pipeline representations, engendering clusters with distinct nuclei. This phenomenon underscores the model’s exemplary learning aptitude in the training phase. Conversely, the TSNE-KDE visualizations for the validation and test sets, while still delineating between leakage and normative pipeline representations, did not manifest pronounced clustering within the leakage representations. This observation intimates a potential diminution in the model’s generalization capabilities across the validation and test datasets. However, it is imperative to note that this limitation predominantly impinges upon the model’s interpretative capacity rather than its predictive accuracy. Furthermore, as depicted in Figure 3d, there was a nominal decrease in leakage prediction accuracy when transitioning from the training set to the test set, with the decrease being less than 5%. Notwithstanding this slight attrition, the model’s ultimate prediction efficacy substantially exceeded competing models. This resilience in performance is ascribed to the disparity between the quantities of leakage and normal samples within the dataset. Consequently, enhancing the model’s generalization capabilities is a critical area for further investigation to overcome the identified shortcomings while maintaining or improving its predictive accuracy.

4.3. Leakage Risk Factor SHAP Analysis

As indicated earlier, the SHAP algorithm complemented our model’s interpretability. In this section, we used the SHAP method to visualize pipeline leakage risk factors. Specifically, we used a violin plot to analyze the risk factors for pipeline leakage at the full pipeline level. Then, we further used a waterfall plot to analyze two pipelines with different model identification results. The results of the evaluation are shown in Figure 4.

Figure 4a describes the ranking of leakage risk factors obtained from analyzing the results of all pipeline forecasts. The pipeline’s age was a key factor in assessing the risk of leakage, as shown by the overall characteristic violin plot ranking results. This may be due to the corrosive effects of soil and water quality on the pipeline, which increase over time and distance. Additionally, steel pipes are less resistant to corrosion, and previous damage or degradation of pipeline material can contribute to risk. It is worth noting that the previously invalid factor also significantly increases the risk of pipeline leakage because repair work on the pipeline destroys the strength of the original pipeline. Figure 4b shows the contribution of a pipeline identified by the model as having a risk of leakage, in which the age of the pipeline being 28 years was the largest contributor to the risk of leakage, and the previous invalid and the quality of its steel greatly influenced the model’s judgment. Figure 4c shows the contribution of a pipeline identified by the model as not having a risk, in which the newer, 6-year-old pipeline reduced the risk of leakage. The PPR material and the absence of previous invalid also reduced the risk of leakage, and it is worth noting that the relatively high pressure increased the probability of leakage. The results of these two specific pipelines analyses are consistent with the analysis of all pipelines. In addition, the influence of pipeline operation water pressure on the risk of leakage is unclear for now. Although the contribution of water pressure to the prediction results was more common when analyzing the overall pattern, we found that the leakage loss of the pipeline was more sensitive to the change in water pressure by analyzing the prediction results of the specific pipeline. The interaction between the characteristics of the water pressure and the other characteristics still needs to be further clarified. Conversely, it was observed that steel–plastic, PPR, and galvanized features held minimal significance in the predictive models, which can be partially attributed to the inherent safety of these materials in terms of leakage risks. Consequently, this insight paves the way for strategic interventions such as prioritizing the replacement of obsolete pipelines, advocating for a diminished reliance on steel pipe in pipeline repair protocols, and advocating for reducing operating water pressure within pipelines deemed to be at high risk. This approach enhances the safety and efficiency of the pipeline infrastructure and aligns with best practices for sustainable and risk-averse management.

4.4. Practical Applications

In this section, we utilized pipeline coordinates from the GIS system and treated them as nodes in the B2 area. The assessment uses model-predicted leakage probability as an indicator to classify nodes. Nodes with a leakage probability of 0.50 or below are represented by blue, indicating normal conditions. Orange nodes signify medium-risk situations, with a leakage probability of between 0.50 and 0.75, and red nodes indicate high-risk situations, with a leakage probability of 0.75 or above. This criterion was used to evaluate the status of the local pipeline network, as illustrated in Figure 5b below.

In the results delineated in Figure 5c, a comprehensive analysis of 11,489 pipelines contained within the database was conducted. The findings revealed that 91.64% of these pipelines were categorized as usual, 6.92% as medium-risk, and 1.44% as high-risk. To augment the robustness of the SHAP analysis outcomes, the pipeline’s age was employed as a principal reference feature to conduct detailed data analysis. This examination is visually represented in the violin plot depicted in Figure 5d, where a discernible distinction in the age distribution between leaky and regular pipelines was observed. Notably, as the age of the pipeline advanced, there was a palpable escalation in the likelihood of leakage, a trend that aligns with the SHAP analysis results. This correlation underscores the critical impact of pipeline age on its integrity and the imperative of incorporating age considerations into pipeline risk assessment frameworks.

5. Conclusions

The failure in WDNs is always unprecedented, facilitating numerous unpalatable consequences, for which previous studies have developed physical, statistical, and machine learning models to predict the probability of failure of individual pipes in a network. In this study, a novel embedding–MLP model is conducted to address the challenges of predicting the risk of water supply leakage in WDN pipelines. The proposed model incorporates a spatial perception block and a dynamic graph optimization feature enhancement block, introducing SHAP into the model prediction. This approach enhances the accuracy of leakage risk prediction by overcoming the shortcomings of existing methods, including insufficient spatial perception ability, lack of feature learning ability, and inadequate quantitative analysis of risk factors.

In the present inquiry, pipeline GIS data alongside maintenance archives from an urban area in China’s heartland were meticulously aggregated. Following an extensive process of data cleansing and the fine-tuning of parameters for optimal performance, the predictive model demonstrated commendable efficacy, as evidenced by its achievement of accuracy, precision, recall, MCC, and AUC values of 93.23%, 92.18%, 92.05%, 87.14%, and 0.95, respectively. The robustness of the model was further corroborated through a series of detailed comparative analyses and visualization techniques. This verification underscores the model’s potential to significantly aid water utility administrators in making well-informed decisions regarding the prioritization and identification of pipeline repair, which needs to be predicated on the calculated probability of failure. Moreover, the elucidation of model determinants via SHAP value experiments revealed that the factors of “pipeline age,” “previous invalid,” and “material steel” were predominant influencers in the model’s predictive accuracy. This finding aligns seamlessly with the insights from heuristic data analyses, reinforcing the model’s credibility. The congruence of these results with heuristic analyses validates the model’s predictive prowess and establishes it as an indispensable instrument for management teams. By leveraging such analytical tools, decision-makers are empowered to enact strategic interventions aimed at enhancing the operational efficiency and reliability of WDNs.

Despite the significant insights provided by this study, there are acknowledged limitations that pave the way for future research. Currently, our model still has high requirements for the quality of the data, we need to have topological–spatial maps that record the connection relationship between pipelines in GIS data, and more pipeline features that can be used for model training and inference are required, which leads to the geographic–scenario migration capability of the model still being a big challenge. In our subsequent research, we will propose a novel way of selecting pipeline features to cope with model training under different data qualities. Additionally, the scarcity of pipeline leakage events within the dataset highlights the necessity for data augmentation strategies to achieve a more balanced sample. Moving forward, plans include expanding the scope to encompass various operating conditions and background characteristics pertinent to WDN pipelines. This will involve building a more comprehensive database of pipeline attributes. Furthermore, by leveraging the foundations laid by this study, the ultimate goal is to develop an early warning platform specifically designed to assess and mitigate the risk of WDN pipeline leakages.

Author Contributions

Conceptualization, W.W.; methodology, W.W. and Y.K.; validation, X.P.; formal analysis, X.P.; resources, W.W.; data curation, Y.X.; writing—original draft preparation, X.P. and Y.K.; writing—review and editing, L.H. and Y.K.; visualization, X.P. and Y.X.; supervision, Y.K. and L.H.; project administration, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available on request due to restrictions, e.g., privacy or ethical.

Acknowledgments

Thanks to Zhengzhou Water Supply Company for the data support.

Conflicts of Interest

Author Yuexia Xu was employed by the company Zhengzhou Water Supply Investment Holdings Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Pietrucha-Urbanik, K.; Rak, J. Water, Resources, and Resilience: Insights from Diverse Environmental Studies. Water 2023, 15, 3965. [Google Scholar] [CrossRef]
Peng, L.; Liu, Y.F.; Zheng, J.C. Safety assessment and analysis for urban water supply distribution network. Water Purif. Technol. 2020, 39, 140–148. [Google Scholar]
Song, Y.; Liu, S.G.; Qi, Y.Y. Safety assessment and analysis for urban municipal pipeline networks. Water Purif. Technol. 2023, 42, 151–160+175. [Google Scholar]
Ho, C.; Lin, M.; Lo, S. Use of a GIS-based hybrid artificial neural network to prioritize the order of pipe replacement in a water distribution network. Environ. Monit. Assess. 2010, 166, 177–189. [Google Scholar] [CrossRef]
Bubtiena, A.M.; Elshafie, A.H.; Jafaar, O. Application of Artificial Neural networks in modeling water networks. In Proceedings of the 2011 IEEE 7th International Colloquium on Signal Processing and Its Applications, Penang, Malaysia, 4–6 March 2011; pp. 50–57. [Google Scholar]
Zhou, X.; Tang, Z.; Xu, W.; Meng, F.; Chu, X.; Xin, K.; Fu, G. Deep learning identifies accurate burst locations in water distribution networks. Water Res. 2019, 166, 115058. [Google Scholar] [CrossRef] [PubMed]
Taiwo, R.; Zayed, T.; Ben Seghier, M.E.A. Integrated Intelligent Models for Predicting Water Pipe Failure Probability. Alex. Eng. J. 2024, 86, 243–257. [Google Scholar] [CrossRef]
Robles-Velasco, A.; Cortés, P.; Muñuzuri, J.; Onieva, L. Prediction of pipe failures in water supply networks using logistic regression and support vector classification. Reliab. Eng. Syst. Saf. 2020, 196, 106754. [Google Scholar] [CrossRef]
Chen, T.Y.; Vladeanu, G.; Yazdekhasti, S.; Daly, C.M. Performance evaluation of pipe break machine learning models using datasets from multiple utilities. J. Infrastruct. Syst. 2022, 28, 05022002. [Google Scholar] [CrossRef]
Konstantinos, K.; Kourosh, B.; Raz1yeh, F. Pipeline failure prediction in water distribution networks using evolutionary polynomial regression combined with K means clustering. Urban Water J. 2017, 14, 737–742. [Google Scholar]
Wang, X.; Chen, Z.; Zhong, X.; Lu, N. PSO⁃SVM based leakage diagnosis method of water supply pipeline. Mod. Electron. Tech. 2018, 41, 156–159. [Google Scholar]
Jensen, T.N.; Puig, V.; Romera, J.; Kallesøe, C.S.; Wisniewski, R.; Bendtsen, J.D. Leakage localization in water distribution using data-driven models and sensitivity analysis. IFAC-Pap. 2018, 51, 736–741. [Google Scholar] [CrossRef]
Sousa, D.P.; Du, R.; Mairton Barros da Silva, J., Jr.; Cavalcante, C.C.; Fischione, C. Leakage detection in water distribution networks using machine-learning strategies. Water Supply 2023, 23, 1115–1126. [Google Scholar] [CrossRef]
Tang, J.; Hua, F.; Gao, Z.; Zhao, P.; Li, J. GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection. arXiv 2023. [Google Scholar] [CrossRef]
Jin, G.; Liang, Y.; Fang, Y.; Shao, Z.; Huang, J.; Zhang, J.; Zheng, Y. Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey. arXiv 2023. [Google Scholar] [CrossRef]
Huang, X.; Yang, Y.; Wang, Y. DGraph: A large-scale financial dataset for graph anomaly detection. Adv. Neural Inf. Process. Syst. 2022, 35, 22765–22777. [Google Scholar]
Edward, C.; Xu, Z.; Li, Y.; Dusenberry, M.; Flores, G.; Xue, E.; Dai, A. Learning the Graphical Structure of Electronic Health Records with Graph Convolutional Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 606–613. [Google Scholar]
Ahmed, S.; Hinkelmann, K.; Corradini, F. Combining machine learning with knowledge engineering to detect fake news in social networks—A survey. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 12, p. 8. [Google Scholar]
Li, Z.; Huang, C.; Xia, L.; Xu, Y.; Pei, J. Spatial-Temporal Hypergraph Self-Supervised Learning for Crime Prediction. In Proceedings of the IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2984–2996. [Google Scholar]
Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar]
Cen, H.; Huang, D.; Liu, Q.; Zong, Z.; Tang, A. Application Research on Risk Assessment of Municipal Pipeline Network Based on Random Forest Machine Learning Algorithm. Water 2023, 15, 1964. [Google Scholar] [CrossRef]
Hamilton, W.L.; Ying, R.; Leskovec, J. Inductive Representation Learning on Large Graphs. arXiv 2017. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Yang, C.; Wu, Q.; Wang, J.; Yan, J. Graph Neural Networks are Inherently Good Generalizers: Insights by Bridging GNNs and MLPs. arXiv 2023. [Google Scholar] [CrossRef]
Ma, X.; Wu, J.; Xue, S.; Yang, J.; Zhou, C.; Sheng, Q.Z.; Xiong, H.; Akoglu, L. A Comprehensive Survey on Graph Anomaly Detection with Deep Learning. IEEE Trans. Knowl. Data Eng. 2021, 35, 12012–12038. [Google Scholar] [CrossRef]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Ruslan, R. Salakhutdinov Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012. [Google Scholar] [CrossRef]
Carrington, A.M.; Manuel, D.G.; Fieguth, P.W. Deep ROC analysis and AUC as balanced average accuracy to improve model selection, understanding, and interpretation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 329–341. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Chen, D.; Lin, Y.; Li, W.; Li, P.; Zhou, J.; Sun, X. Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 34. [Google Scholar]
Suárez, M.; Martínez, R.; Torres, A.M.; Ramón, A.; Blasco, P.; Mateo, J. Personalized Risk Assessment of Hepatic Fibrosis after Cholecystectomy in Metabolic-Associated Steatotic Liver Disease: A Machine Learning Approach. J. Clin. Med. 2023, 12, 6489. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Framework of the model.

Figure 2. Results of the comparison experiments on the test set: (a) precision result, (b) accuracy result (c) recall result, and (d) ROC and AUC results.

Figure 3. Visualization experimental results: (a) TSNE−KDE on the training set, (b) TSNE−KDE on the validation set, (c) TSNE−KDE on the test set, and (d) comparison of model performance on the training set, test set, and validation set.

Figure 4. Result of SHAP analysis: (a) analysis results of pipeline leakage at the full pipeline level and (b,c) waterfall analysis for two separate samples.

Figure 5. Coordinate mapping of model prediction results: (a) GIS system pipeline coordinate mapping results, (b) assessment results, (c) circular assessment results, and (d) results of heuristic data analysis using pipeline age as a reference.

Table 1. Comparison of experimental results.

Model Type	Training Set Precision (%)	Test Set Precision (%)	Test Set Accuracy (%)	Test Set Recall (%)	Test Set Mcc (%)
Proposed method	96.8996	93.2346	92.1823	92.0526	87.1423
GCN	94.4862	90.3806	90.6678	88.0526	83.6821
BP	99.7746	86.1159	86.1271	73.0392	70.2132
RF	93.4798	87.2053	84.8039	79.0697	72.0127
DT	96.8715	88.3217	86.9109	77.2093	71.9531
AdaBoost	96.2484	82.3374	83.4951	84.7041	79.4239
LR	81.2648	79.1739	77.8501	58.5784	53.4952
No-SPB	92.4824	90.5982	92.6342	91.4586	86.2581
No-DGB	95.4237	91.3249	90.0023	89.6859	81.6873

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, W.; Pan, X.; Kang, Y.; Xu, Y.; Han, L. To Feel the Spatial: Graph Neural Network-Based Method for Leakage Risk Assessment in Water Distribution Networks. Water 2024, 16, 2017. https://doi.org/10.3390/w16142017

AMA Style

Wu W, Pan X, Kang Y, Xu Y, Han L. To Feel the Spatial: Graph Neural Network-Based Method for Leakage Risk Assessment in Water Distribution Networks. Water. 2024; 16(14):2017. https://doi.org/10.3390/w16142017

Chicago/Turabian Style

Wu, Wenhong, Xinyu Pan, Yunkai Kang, Yuexia Xu, and Liwei Han. 2024. "To Feel the Spatial: Graph Neural Network-Based Method for Leakage Risk Assessment in Water Distribution Networks" Water 16, no. 14: 2017. https://doi.org/10.3390/w16142017

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

To Feel the Spatial: Graph Neural Network-Based Method for Leakage Risk Assessment in Water Distribution Networks

Abstract

1. Introduction

2. Related Work

2.1. Leakage Risk Research for WDNs

2.2. Graphical Neural Network

3. Materials and Methods

3.1. Preliminary

3.1.1. Description of Tasks

3.1.2. Modeling of WDNs with Graph

3.2. Model Framework

3.2.1. Spatial Perception Block

3.2.2. Dynamic Graph Optimization Feature Enhancement Block

3.2.3. Loss Functions and Leakage Probability Generation

3.3. Evaluation Indicators and Model Interpretation

3.3.1. Evaluation Indicators

3.3.2. SHAP Risk Factor Explanatory Mechanism

4. Results and Discussion

4.1. Dataset Description

4.2. Visual Comparison Experiment

4.3. Leakage Risk Factor SHAP Analysis

4.4. Practical Applications

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI