2. Methods
The sewage network is represented by a directed acyclic graph
. A node
represents a sewage network junction or spot, such as a building or sewage well. One or more sensors could be deployed in a node. Each edge
represents a pipe between two nodes. The attribute
of every edge
e provides the current flow propagation time offset (lag) it introduced between its two connecting nodes. The direction of an edge corresponds to the direction of the wastewater flow. Additionally, it is assumed that (1) the graph
G is consistent, (2) each node is connected to the root by exactly one path, and (3) the graph
G contains a node representing the sink (drain). The sewage network’s sink is the location where all the wastewater exits the network. Hereinafter, we use the terms “sink” and “root” interchangeably. The graph
G is a directed tree created under the assumptions mentioned above.
Section 3 shows two examples of such networks, where root nodes are marked as “1”.
The sensors provide measurements of the wastewater properties in the form of observations
O [
44]:
Q is the entity, v is the spatial location of the measurement, t is the time-stamp of the measurement, y is a digital representation of the measured value, and is the uncertainty. Possible entity values for Q include electrical conductance, and pH and concentration of a specific compound. The spatial location of an observation O corresponds to the node v in G where the measurement was taken.
We define the vector of all observed entities as .
In the presented system a finite list of substances (compounds) to be tracked is represented by set , where is a compound. Predefined functions are used to convert measured values to amounts for each compound . It should be noted that included not only pollutants, but also other compounds, primarily those that are generally present in wastewater.
The data fusion algorithm consists of five steps: resampling, pollution quantification, downstream propagation, tracking, and event generation (
Figure 1). These steps are repeated. The input data for resampling (first step of the algorithm) consists of sensor measurements, the output data of the resampling is the input for the pollution quantification, etc.
2.1. Resampling
Each sensor in the system is capable of sampling at a different time period. In the resampling step, we convert sensor measurements into sensor observations in a unified discrete-time domain by setting a common sampling time period for all sensors and estimating the value of sensor measurements that were not initially collected. Therefore, for each iteration of the data fusion algorithm, the values of y and are calculated. This process utilizes linear interpolation when the sampling period T is greater than the sensor measurement period or if there are missing measurements, and mean aggregation, when the sampling period is less than the sensor measurement period.
Sensor observations resulting from this step are represented as indicated below.
k is the discrete-time step
.
Time steps are referenced to a fixed point in time so that measurements taken at have . In the present study, we assume uniform time sampling. Therefore, discrete-time step k represents .
2.2. Pollutant Quantification
The pollution quantification step converts sensor observations into the identification and quantification of sought compounds.
The pollution quantification step yields a set of pollution detections . Each pollution detection takes the form , where is the amount in liters of a substance that is detected with uncertainty by node v.
Pollution detections are created using the following method. For each sensor observation , every compound is considered independently. A potential discharge amount is calculated using the mapping function , where y is the measured value of entity Q, C is the compound, and a is the amount.
A threshold value
is considered for filtering out sensor observations that are below the noise level. In other words, only if the inferred pollution detection amount
is greater than the threshold
, pollution detection is created and added to the detection set. The algorithm used to calculate pollution detections is depicted in Algorithm 1.
Algorithm 1: Pollution quantification algorithm |
|
The function in line 4 of Algorithm 1 compound amount a from an input sensor observation y in the following way. Let , where y is a sensor observation value and b is the baseline, which is defined as the sensor observation value when no compound is present in the proximity of the sensor. In the presented study, we assume linear mapping from z to the compound amount, . Parameter specifies how a unit amount of a compound can be quantified into pollutant volume units.
A new detection object is created only if amount a exceeds the detection threshold. Thresholds are set per compound and are constant in time. These thresholds allow us to filter out insignificant detected amounts caused by small fluctuations of measured values.
2.3. Downstream Propagation
The downstream propagation step infers additional pollution detections in vertices of the graph where no sensors are installed. The majority of vertices are like this as we assume the number of available sensors to be limited, due to either high capital or operational costs.
The inferred time of arrival of a pollutant is generated from pollution detections using Algorithm 2.
A depth-first search algorithm is used to create new detections downstream from their original nodes with maximum depth d, for each pollution detection .
New pollution detections are inferred by considering the propagation model of compounds in the utility network between neighbor vertices. For this, we consider the following three simplifications.
The propagation time of a substance for an edge, , is known, constant in time, and equal for every compound. In practice, this condition is satisfied only when the flow characteristics do not change in time and the flow rate for each compound is the same.
The total amount of a discharged compound does not change as the substance flows through the network. In practice, a substance may either react with other domestic waste and change its intrinsic characteristics, or may adhere to the sewage pipe walls.
The sensors have infinite resolution and no noise. Therefore, tiny volumes of diluted compounds in the network over time can be measured.
Algorithm 2: Detection propagation algorithm |
|
Inferred detections with amounts less than a given threshold (representing the process noise) are not considered. After detection propagation, pollution detections are associated with almost every vertex in the graph.
2.4. Tracking
The tracking algorithm clusters pollution detections by the detected compound. A cluster of pollution detections associated with a detected compound is named a track. Therefore, pollution detections can not be associated with a track if there is a difference in compounds between the detection and the track.
In this article, a Kalman filter is used for predicting the most probable location of a detected compound within the network in a previous algorithm iteration (for time ). The tracking algorithm updates the most probable location for time using the pollution detections calculated in the previous two steps.
The filter state (Equation (
3)) represents the location in the network of a substance at a given point in time. For each track, the most probable amount
a of a compound, as well as the most probable location
d (as a function of time), are determined. Location is expressed as a real number equal to the distance from the network sink.
The precise location of a compound within a track can be calculated at any time based on the fact that only one path connects each node to the sink and that the starting node of the track is stored. This localization scheme places a compound on the graph edge located at vector , where are the source and the destination of the edge, respectively, and is a number describing the position relative to the edge.
The tracking algorithm (Algorithm 3) assigns pollution detections to tracks, creates new tracks, and removes stale tracks. Once the location of compounds within tracks are predicted by the Kalman filter, an assessment of whether the pollution detection can be supported is performed by comparing the amounts
,
and graph distance
between the detections and the track representatives. If these values are less than their respective thresholds, the detection is counted as supporting the track. If the detection cannot be associated with the existing tracks, a new track is created. Tracks with no new associations over several previous algorithm iterations are labeled as outdated and are removed.
Algorithm 3: Tracking algorithm |
|
2.5. Event Generation
The final step, namely, event generation, only considers tracks that have a large number of supporting (associated) detections. Events are created for each of these tracks, where an event represents the discharge of a compound into a node of the graph.
Event generation is depicted in Algorithm 4.
Algorithm 4: Event generation algorithm |
|
Possible events are generated for each important track. Subsequently, equivalent events from different paths are transformed to a single event with the confidence being equal to the sum of the confidences of those events and the compound amount being equal to the maximum of the amounts in a cluster. Finally, the events are sorted in descending order of confidence.
2.6. Implementation
The data fusion algorithm was implemented as a Python package with a modular layout. This enabled the user to replace any of the modules with ones that were more suited to their specific use. This was especially important as the subsequent modules of the system were then simplified compared to the real world. The sampling process relied on linear interpolation and mean aggregation. The amounts of the compounds were assumed to be in a linear relationship with the values measured. The detection and clustering thresholds and Kalman filter parameters were constant.
A client-server application was developed to store measurements and implement data fusion. This application also contains a presentation layer that allows the results to be presented using a web browser. Our application use a PostgreSQL database, Python standard packages, and the Django web framework.
3. Results
The fusion algorithm was tested using simulated data. Several numerical experiments were conducted to evaluate our system across multiple scenarios.
Two network topologies were considered. The first network
was a path graph. It consisted of linearly connected nodes (
Figure 2).
The second network
(
Figure 3) was a simplified version of the sample network available in the EPANET software. The original network was a water distribution network, so the edges were reversed to resemble a sewage network. The modification included transforming the acyclic graph into a tree via a depth-first search starting at node 1. The edge gains (
) and offsets (
) were calculated using pipe lengths (
) from the original EPANET network description (Equations (
4) and (
5)). The offsets were computed by dividing the pipe length by a constant velocity
.
The gains were calculated for each edge using a linear function. The minimal gain
was chosen to make similar travel times for
and
.
We simulated two types of sensors: (1) the microMole sensor system [
6] and (2) liquid chromatography with tandem mass-spectrometry (LC-MS-MS). The microMole sensor system measures the pH and electrical conductivity (EC) of wastewater every second. It can be mounted in main sewer pipes of no less than 250 mm in diameter. The microMole system is not capable of identifying chemical compounds. LC-MS-MS is laboratory equipment capable of detecting and quantifying chemical compounds. Within the H2020 SYSTEM project [
7], LC-MS-MS is used for analysis of wastewater samples collected at WWTPs. It analyses the composition of wastewater every 10 min. As LC-MS-MS is located at the WWTP, LC-MS-MS data are not sufficient for localizing the source of pollution in a sewage network graph.
Our sensors measured one of three entities of different characteristics:
with a range of and a neutral value of 1400, which refers to the electrolytic conductivity,
with a range of and a neutral value of , which refers to the pH, and
with a range of and a neutral value of 0, which indicates the relative concentration of a pollutant.
The substances (compounds) that were tracked are listed in
Table 1. The illegal substance is sodium hydroxide, described in
Section 1.1. Pipe cleaner is legal but has a similar pH and electrolytic conductivity. The presence of those compounds in the proximity of the sensors measuring
and
affected readings in the same way: a positive peak of
(
+) and a negative peak of
(
). The measured values of
were influenced only by
in the form of a positive peak (
).
Thresholds for Algorithms 1, 2 and 4, and the Kalman filter parameters were constant for all simulations. The specific values were derived using the expectation-maximization algorithm on a representative sample of the measured values.
The average results of four experiments are presented below. For each network topology, the influence of sensor coverage, substance discharge amount, update period, and downstream propagation depth was calculated. The parameter values are presented in
Table 2.
The updated period T was the constant time period used in resampling that determined how many iterations of the data fusion algorithm were performed.
The sensor coverage was represented by a number in the range , which expressed the number of sensors in the network relative to the number of nodes. For a given simulation, and sensors were placed randomly among all nodes (except the sink) using sampling without replacement. A single sensor measuring was always located in the sink of the network.
3.1. Simulations
To create random scenarios for the data fusion module, a measurement generation module was produced. The method for creating simplified sensor observations was conceived based on the results of real-world experiments.
To create a simulation scenario, several parameters of the discharge event were required: the compound, node, amount, noise, and function inverse to the mapping function described in
Section 2.2. An additional edge
e parameter known as gain
was also required. The edge gain
was a real number that satisfied
. The amplitude of the signal measured at the edge end divided by the amplitude of the signal measured at the edge start yielded
. The gain parameters revealed how the signal was attenuated while the compounds traveled through the edges. Noise was introduced by adding random values from a Gaussian distribution with a mean of 0 and a standard deviation equal to the product of the measurement value and the noise parameter.
Real-world measurements often resemble exponential functions with bases in the range from 0 to 1. In our experiments, a rectangle impulse function was used to simplify reverse mapping of the amount of the compound. Generating a single measurement series consisted of an initial calculation of the target area between the baseline reading and the measured values, and then the generation of a suitable number of measurements. A series corresponding to a single discharge event differed only in signal length (which was calculated by dividing the target area by the product of the gains of all the edges from the discharge node to the current node) and the initial signal amplitude (which was a property of the entity).
During one discharge event, many measurement series were generated that corresponded to each sensor in each node on the path from the discharge node to the sink. Scenarios in which more than one discharge event occurred were not taken into account, as this would have required knowledge of the behavior of compounds when they mix in the sewage network.
3.2. Quality of Data Fusion
For each scenario, a set of events was generated by the system using many iterations of the fusion algorithm depicted in
Figure 1. Scenarios involved the simulated data of a single discharge in a random node, a node where pollution was introduced into the network was selected using a uniform distribution. The detected events were labeled either true positive or false positive. It should be noted that at most, one event was true positive.
Based on these labels, metrics were calculated for each simulation:
The confidence coefficient, which was computed by dividing the confidence of the true positive event by the average confidence of all events. This metric showed how the confidences of true positive events compared to the confidences of false positive events. For the system to be useful, this metric had to be greater than 1.
The number of reported events. The ground truth was 1. The smaller this number was, the more precise the localization. In studied scenarios, multiple events signified multiple possible nodes of discharge or multiple compounds; therefore, this was a valuable metric that demonstrated the precision of the system.
The results demonstrated that, as expected, the performance of the system depended on the sensor coverage of the network. According to
Figure 4, the number of generated events decreased faster with an increase in the number of sensors in the network. In the case of the simple “path” network (
), the event count plot showed a median count of approximately 20 for a coverage of 10%. Taking into account that the a priori knowledge included two similar compounds, the system should reduce the source of the pollution to approximately 10 nodes. A coverage of 20% provides a twofold decrease in the event count.
According to simulations performed on the more realistic “Net2” network (), the event count should not exceed 15 for coverage of 10% or greater. Moreover, taking into account that two similar compounds were considered, an event should be reduced to approximately seven nodes. Achieving such coverage in real networks may not be possible, but this metric provides a valuable overview of what can be expected concerning the performance of the system. It is important to note that to expect reconstructed events in a single node, sensor coverage would have to reach 100%, which in practice is impossible to achieve. This fact, however, does not mean that one cannot obtain accurate results from the proposed system at low sensor coverage. This means that the lower this value is, the more nodes must be considered as a potential source of pollution.
Figure 5 shows that if the position of the sensors and the discharge node are aligned in a way that allows for any detection (confidence coefficient
), assuming that coverage is greater than 10%, it can be expected that true events have greater confidence than false events. Across all experiments with coverage of greater than
, true events had ≈30% greater confidence than false events in the simple network and ≈50% greater confidence in the more complicated network.
When it comes to correct identification of the source node,
Figure 6 shows that we can get very close to 100% identification chance with network coverage of 80%.
Comparing these results with events counts shown in
Figure 4, we can observe that even though a high coverage is needed to achieve excellent accuracy, results achieved at lower coverage would still be useful. When the number of sensors is low and the sensors are located far from the pollution source, we can expect that the system would generate several similar events in multiple nodes in proximity of the source. It is difficult to distinguish the actual source in such a case. However, as already mentioned (
Figure 5), on average, our system assigns higher confidence values to actual pollution sources than to neighbor nodes.
Di Cristo and Leopardi in 2008 achieved a location identification rate from 60.5% to 100% with 31% of nodes containing sensors [
30], while our system needed 60% coverage to get such high values. However, in the cited article, sensor locations were constant across simulations and only one discharge node was considered. In contrast, we considered random placement of sensors and the discharge node was chosen at random. Additionally, Di Cristo and Leopardi used hydraulic simulation by the EPANET simulator, which causes the performance of the system to be dependent on simulation quality. Our aim was to create a system which could continuously monitor the network and perform calculations as new measurements appear.
3.3. System Performance
The performance of the system was also evaluated by analyzing the algorithm execution time and the total number of observations that translated into the usage of system memory.
Figure 7 illustrates that memory usage (observation count) was directly proportional to the number of sensors. Analysis of the simulation run time charts (
Figure 8) showed that the time complexity of the used algorithms was linear relative to the number of measurements that generated detections. As shown in
Figure 8, the system can be expected to process more than 100 sensor observations per second.