Next Article in Journal
A Data Storage, Analysis, and Project Administration Engine (TMFdw) for Small- to Medium-Size Interdisciplinary Ecological Research Programs with Full Raster Data Capabilities
Previous Article in Journal
A Dataset of Plant Species Richness in Chinese National Nature Reserves
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Nearest-Better Network-Assisted Fitness Landscape Analysis of Contaminant Source Identification in Water Distribution Network

1
School of Automation, China University of Geosciences, Wuhan 430074, China
2
Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China
3
Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education, Wuhan 430074, China
4
School of Artificial Intelligence, Anhui University of Science & Technology, Hefei 232001, China
5
School of Mechanical Engineering and Electronic Information, China University of Geosciences, Wuhan 430074, China
6
School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK
*
Author to whom correspondence should be addressed.
Data 2024, 9(12), 142; https://doi.org/10.3390/data9120142
Submission received: 30 October 2024 / Revised: 20 November 2024 / Accepted: 26 November 2024 / Published: 6 December 2024
(This article belongs to the Topic Water and Energy Monitoring and Their Nexus)

Abstract

:
Contaminant Source Identification in Water Distribution Network (CSWIDN) is critical for ensuring public health, and optimization algorithms are commonly used to solve this complex problem. However, these algorithms are highly sensitive to the problem’s landscape features, which has limited their effectiveness in practice. Despite this, there has been little experimental analysis of the fitness landscape for CSWIDN, particularly given its mixed-encoding nature. This study addresses this gap by conducting a comprehensive fitness landscape analysis of CSWIDN using the Nearest-Better Network (NBN), the only applicable method for mixed-encoding problems. Our analysis reveals for the first time that CSWIDN exhibits the landscape features, including neutrality, ruggedness, modality, dynamic change, and separability. These findings not only deepen our understanding of the problem’s inherent landscape features but also provide quantitative insights into how these features influence algorithm performance. Additionally, based on these insights, we propose specific algorithm design recommendations that are better suited to the unique challenges of the CSWIDN problem. This work advances the knowledge of CSWIDN optimization by both qualitatively characterizing its landscape and quantitatively linking these features to algorithms’ behaviors.

1. Introduction

Contaminant Source Identification in Water Distribution Network (CSIWDN) is a practical issue that is closely related to human life safety. Water pollution incidents occur frequently around the world. Ensuring a reliable and safe water supply system plays an important role in maintaining public health and stable economic development. To reduce the risk of water pollution, water quality sensors can be placed in the water distribution network for real-time monitoring, and sensor data can be used to identify the location of pollution sources and other injection information.
There are mainly three methods for solving the contaminant source identification problem: the particle inversion methods [1], the machine learning method [2], and the simulation–optimization method [3]. The simulation–optimization model is better in terms of accuracy and robustness [4]. Moreover, the first two methods can only handle the problems with a single pollution source, while the simulation–optimization method may identify multiple pollution sources.
The simulation–optimization method consists of two models: a simulation model and an optimization model. The simulation model uses software such as EPANET to simulate a contamination event and generates the concentration data of each sensor. The optimization model generates different solutions through optimization. Each solution represents pollution event information, which includes injection locations, the starting time, and the injection amount of each injection location. Then, the sensors’ simulated concentration data of a solution are compared with real pollutant concentration data, and this comparison is formulated as the objective function.
When the objective value of the best solution found by the optimization algorithm reaches a certain level of accuracy—meaning that the error between the simulated sensor values and the actual values is sufficiently small—this solution is considered to represent the true contamination source.
It is important to note that the simulation model plays a crucial role in the accuracy of the optimization methods. The positioning of the sensors can vary significantly depending on the model used, as demonstrated in [5], thereby influencing the resolution of the optimization model.
Although research shows that the simulation–optimization method is highly competitive in terms of accuracy [3,6]; the performance of the simulation–optimization method depends on the ability of the optimization algorithm, while the performance of the optimization algorithm is sensitive to the landscape features [7].
Researchers currently lack an in-depth experimental exploration of the landscape features of the contamination source identification problem. Since this is a mixed-coding problem, most existing visual fitness landscape analysis tools are designed specifically for either continuous or discrete problems [8]. Fortunately, the Nearest-Better Network (NBN) [8,9], proposed recently, has been proven to be an effective tool for analyzing the landscape features of problems with various encodings.
This paper makes a significant contribution by providing the first comprehensive experimental analysis of the fitness landscape features of CSWIDN using the NBN. This study identifies key landscape features, including neutrality, ruggedness, modality, dynamic change, and separability, revealing how each of these features exhibits distinct patterns. These findings deepen the understanding of the inherent challenges of CSWIDN and offer valuable insights into how these landscape features influence the performance of optimization algorithms. By linking these features to algorithmic performance, this paper aims to assist researchers in designing more effective optimization strategies tailored to the unique characteristics of the CSWIDN problem.
The remainder of this paper is organized as follows. Section 2 provides an overview of the previous related work. Section 3 introduces the fitness landscape analysis method, including the definition of NBN, the definition of the CSWIDN problem, and the definition of the distance. Section 4 provides an in-depth experimental fitness landscape analysis of the CSWIDN problem. The discussion is provided in Section 5. Finally, the conclusions and future research are provided in Section 6.

2. Related Work

2.1. Fitness Landscape

The fitness landscape in optimization problems describes the relationship between possible solutions and their fitness values. More specially, a fitness landscape describes a surface of fitness values over solution space. Figure 1 shows the fitness landscape of continuous optimization problems in two dimensions. A general view of fitness landscapes was proposed in [10], in which the fitness landscape consists of three elements ( X , χ , f ) :
  • A set X of potential solutions to the problem;
  • A notion χ of neighborhood, nearness, distance, or accessibility on X ; and
  • A fitness function f : X R . The fitness of a solution indicates how good the solution is, and, the larger the fitness value, the better the solution.
There are several definitions related to the fitness landscape and the algorithm behavior:
  • Search space: The search space X is the union of all possible solutions of an optimization problem;
  • Neighborhood: The neighborhood relationship is a mapping χ : X N , which associates each solution x with a set of candidate solutions N ( x ) , called neighbors, which can be reached by applying a local search operator for a one-step search or defined by a distance metric;
  • Basin of Attraction (BoA) and local optimum: B ( x * ) = x X x * = local - search ( x ) , where the BoA B ( x * ) of a local optimum x * is the set of solutions B ( x * ) that approaches x * by utilizing a local search strategy within the decision variable space X [11].
By studying the fitness landscape, researchers obtained a deeper understanding of fitness landscapes and tried to summarize the features of fitness landscapes that may affect the performance of the algorithm [12,13]. Some of the landscape features are listed below:
  • Modality [14]: Modality refers to the number of local optima in a fitness landscape. A problem is unimodal if there is only one global optimum, whereas a problem is multimodal if there are multiple local optima;
  • Ruggedness: A rugged landscape can be regarded as a multi-modality landscape with many sharp descents and ascends;
  • Neutrality [15]: In a landscape where the fitness values of the neighborhood solutions of a solution yield no change, the area around the solution is neutral;
  • Separability [16]: It refers to the degree of dependency between components of the variable;
  • Dynamic change [17]: Dynamic means that the fitness or constraint functions may change over time.

2.2. Features Derived from CSWIDN

In general, due to the limited number of sensors, the large number of nodes in the water supply network, and the dynamic demand for water by users, the CSWIDN problem is a large-scale, expensive, multimodal, and dynamic optimization problem. The large-scale difficulty of the problem is determined by the scale of the nodes in the water supply network, and the expensive difficulty is determined by the simulation time of the simulation software. As for difficulties such as modality and dynamics, which are difficulties manifested by the structure of the fitness landscape, it is necessary to analyze the fitness landscape to verify whether these difficulties exist in the problem. At the same time, these landscape features are closely related to the performance of optimization algorithms. For different landscape features, optimization algorithms may have different strategies. Therefore, a deep understanding of the fitness landscape of the problem is very important for researchers to design efficient algorithms.

2.2.1. Modality

Yan [6] thought that, from the qualitative perspective, the quantity of water quality sensors is limited, while a large number of water distribution network nodes are potential contaminant sources. In addition, the mixing and timeliness of contamination source injection should be considered. Therefore, the number of contamination sources tends to be no less than one. Therefore, the author modeled the CSWIDN problem as a multimodal problem and applied a niching genetic algorithm to solve it. Later, many studies have focused on the multimodal difficulty of the CSWIDN problem and used different methods to solve the multimodal difficulty. Li [3] used an adaptive multi-population framework. This framework can adaptively adjust the number of populations and simultaneously track multiple global optimal solutions.
Although these multimodal algorithms provide multiple optimal solutions for the problem, there is a lack of quantitative analysis for the characteristics of the modality feature, such as the number of the global optima, the distribution of these optimal solutions, and the change in modalities in a dynamic environment. Compared to unimodal algorithms, multimodal algorithms have relatively lower search efficiency. If multimodal algorithms are blindly used, a lot of computing resources will be wasted. However, there is a lack of any relevant analysis of the multimodal feature of the CSIWDN problem: researchers do not know what kind of multimodal algorithms to choose and when they should be applied.

2.2.2. Dynamic Change

The CSWIDN problem is a dynamic problem, with the injection of contamination sources and the water demand of residents changing all the time. These all lead to change ins the data of the water quality sensors and, thus, lead to the change in the objective value of the solutions. In a dynamic problem, the position of the global optima may change and, thus, the algorithms are required to have the ability to respond to the dynamic changes and track the dynamically changing global optimal solution in real time.
For the first time, Liu [18] modeled the CSWIDN problem as a dynamic optimization problem and solved it using a multi-population method. Rasekh [19] used a multi-objective diversity preservation strategy to tackle the dynamic difficulty. The method of Li [3] is similar to that of Liu. Li solved it using a more advanced multi-population algorithm that can adaptively adjust the number of populations by combining the historical data and the current evolutionary state of the population, reducing the consumption of computing resources. Experiments have shown that this method is more effective than Liu’s method.
The CSWIDN problem is indeed dynamic. However, the magnitude of changes in the fitness landscape and the difficulty for algorithms to track the global optimum are still unknown to us.

2.2.3. Separability

The solution to the CSWIDN problem contains two pieces of information: the location and the injection information of the contamination sources.
In the earlier studies, researchers directly used a hybrid encoding method to solve the CSWIDN problem. Yan [20] employed this method to simultaneously optimize both the location and the injection information of the contamination sources. Hybrid encoding involves both continuous and combinatorial encoding, making it more complex. The search operators for continuous and combinatorial problems are quite different, and those for hybrid encoding problems require even more specialized design. Moreover, there are relatively few optimization algorithms tailored for hybrid encoding problems.
Considering the real-world CSIWDN problems, the location and injection information may be two independent components. To reduce the complexity of the problem and ease the optimization pressure on the algorithm, some studies [3,21] have reformulated the CSWIDN problem into two sub-problems: the location and injection information of the contamination sources, and then solved them using collaborative optimization methods.
Some studies suggest that the location of the contamination source has a relatively great impact, while the contamination injection information has a smaller influence. Based on sensor data, it is possible to predict the location of certain contamination sources without considering the injection information.
Grbči’c [22] used Random Forests to identify contamination source locations with high probability. Based on these locations, optimization algorithms were then used to determine the corresponding injection information. Similarly, Qian [23] utilized a deep neural network to mine the relationship between sensor data and contamination source locations based on historical data.
Assuming that the contamination source location and injection information are independent components of the variable can indeed reduce the complexity of solving the CSWIDN problem and improve algorithm accuracy. However, if the two components are strongly correlated in the fitness landscape, optimizing them independently without considering their interaction could reduce the accuracy of the identification of the contamination sources. Currently, research lacks sufficient investigation into the correlation between these two components.

2.2.4. Other Features

Current research on CSWIDN focuses only on the aforementioned fitness landscape features, and existing algorithms are designed around these features. However, since the understanding of the fitness landscape for CSWIDN is not yet comprehensive, there may be other landscape features that remain unknown. As researchers have not recognized the existence of these features, no algorithms have been developed to address them for CSWIDN.
Research in the field of CSWIDN lacks an in-depth exploration of fitness landscape features, primarily due to the absence of a powerful fitness landscape analysis tool that can be applied to hybrid encoding problems. The recent NBN is currently the only tool available for analyzing hybrid encoding problems [8,9]. This paper has also validated its ability to analyze the fitness landscape features of continuous problems. This paper applies NBN to the CSWIDN problem and analyzes its landscape features.

3. Fitness Landscape Analysis Method

In this section, the relevant definitions of the Nearest-Better Network are introduced. The CSWIDN model and the distance of two solutions are defined. And several fitness landscape analysis metrics are introduced.

3.1. Nearest-Better Network

The Nearest-Better Network (NBN) [8] is a sampling-based method. In optimization algorithms, search is fitness-driven, with a higher probability of exploring the neighborhood of a solution. For an algorithm with a solution x , it is more likely to search for its nearest-better solution than the other solutions. Thus, the NBN simplifies the complex fitness landscape by preserving only the nearest-better relationships to facilitate fitness landscape analysis.
The NBN can be defined as a directed graph G = ( V , E ) , where the set of vertices V represents the sampled solutions X N , and the set of edges E represents the nearest-better relationship for each solution: E = ( x , b ( x ) ) x X N .
The nearest-better relationship is defined as follows:
b ( x ) = arg min y y y X N , f ( y ) > f ( x ) y , x ,
where b ( x ) is the nearest-better solution for solution x .
In a previous work [9], a visualization method was proposed based on NBN. An example of the fitness landscape of the 2D Shubert function is shown in Figure 2a,b, where the color mesh is the real fitness landscape of Shubert’s function and the black network represents the NBN in the original fitness landscape. Figure 2c,d is the visualization of the NBN. In the visualization of the NBN, each node is a sampled solution with an edge that connects its nearest-better solution. The paper [8] has also verified that the features of the fitness landscape can be captured by the NBN visualization, including asymmetry, ill-conditioning, neutrality, ruggedness, the size of the BoAs, and the number of BoAs.
Figure 3 illustrates the workflow of the NBN-assisted fitness landscape analysis method. First, the sampling method proposed in this paper is used to sample the solutions V . Then, the nearest-better relationship E is calculated using the algorithm from [8,9]. By combining the sampled solutions with the nearest-better relationship, the NBN is obtained, represented as G = ( V , E ) . There are two approaches for fitness landscape analysis: visualization and metrics. In the visualization method, the NBN figure is generated using the formulas from [8,9]. In the metrics method, the metric values are calculated using the metrics proposed or referenced in this paper.

3.1.1. Optimization Problem for CSWIDN

In the optimization problem, the sum of the squared error between the simulated sensor data of a solution and the actual sensor data of the contamination sources is used as the objective value. In theory, when this value is zero, the solution is the real contamination sources.
min f ( x ( t ) ) = j = 1 W s = 0 t [ y j ( s ) y ˜ j ( s ) ] 2 W · t s . t . y j ( t ) = μ ( x ( t ) ) , j = 1 , , W x ( t ) = [ ( x l 1 , x u 1 ( t ) ) , , ( x l L , x u L ( t ) ) ] x u i ( t ) = ( x 0 i , x x 0 i , x x 0 i + 1 , , x t i )
where y j ( t ) is the comtamination concentration data at sensor j at time t. μ is a simulation model with input x ( t ) and output y ( t ) of W sensors and, in this paper, the hydraulic and water quality analysis software EPANET 2.0 is used as the simulation model. x ( t ) is the solution with all the information of the L comtamination sources at time t. ( x l i , x u i ( t ) ) is the information of the i-th contamination source at time t, where x l i U is the position in the water distribution network U and x u i ( t ) is the injection information. x 0 i is the start time of the injection and ( x x 0 i , x x 0 i + 1 , , x t i ) represents the injection rate sequence from time x 0 i to time t i .
This model is a relatively general CSWIDN optimization model with multiple contamination sources. It also takes the feature of dynamic change into consideration. Moreover, the location and injection information of a contamination source are encoded separately, which facilitates the analysis of the correlation between these two components.
The water distribution network model analyzed in this paper is a water distribution network with | U | = 97 nodes, as shown in Figure 4. It includes a lake, a river, three water tanks, and several pipeline connection nodes. The purple stars are 4 fixed sensors in the water distribution network, W = 4 . The red rectangles are the position of the 3 contamination sources ( o l 1 , o l 2 , o l 3 ) , L = 3 , and o contains the global optima. When simulating a dynamic contamination event in this water distribution network, pollutants are injected at the three positions with different start times, durations, and substance concentrations. The pollutant injection times for the three contamination sources are [ 1 , 29 ] , [ 6 , 37 ] , and [ 14 , 46 ] , respectively.

3.1.2. Definition of the Distance

In the NBN, the definition of the distance between solutions is needed. The distance between two solutions fundamentally reflects their similarity; the closer the distance, the more similar the solutions. In the context of pollution source identification, the similarity between the two solutions is spatiotemporal value. The objective of the problem is to identify the pollution source information. Thus, the similarity between the two solutions is essentially the resemblance of the pollutant particle distributions in the water network at different times.
As illustrated in Figure 5, the numerical value of this similarity is spatiotemporally dependent and influenced by various factors, such as the current time, the location and concentration of the pollution source, the connectivity of the pipes, the direction and velocity of the water flow, and the number and placement of sensors, which are used to indirectly measure the pollutant concentrations.
This similarity value can be accurately obtained through particle dynamics equations that simulate the distribution of pollutant particles. However, such simulations are highly complex and time-consuming. Therefore, this paper uses a simpler approach to calculate the similarity between two solutions, which is defined as follows:
x ( t ) , y ( t ) = x l , y l + x u ( t ) , y u ( t ) norm
where x l is the pollution source location, x l = ( x l 1 , , x l L ) , and x u ( t ) is the pollution injection information, x u ( t ) = ( x u 1 ( t ) , , x u L ( t ) ) .
To analyze the correlation between the pollution source location x l and the pollution injection information x u ( t ) , the sum of the distances of the two components is defined as the distance. The distance for the pollution source location component is given by
x l , y l = i = 1 L dim p p ( x l i , y l i ) x l = ( x l 1 , , x l L )
The distance between two nodes in the water distribution network x l i , y l i can be interpreted as the correlation of the sensor data of the two nodes. p ( x l i , y l i ) is the shortest path in the water distribution system between the two nodes, x l i and y l i .
If two nodes are on the same pipe and are close to each other, it can be assumed that their sensor data are similar; thus, the number of nodes in the shortest path p between the two nodes can represent the distance between them. It is assumed that the similarity between two pollution source locations is the reciprocal of their distance, denoted as
s ( x l i , y l i ) = 1 dim p p ( x l i , y l i ) .
For problems with multiple pollution sources, each source is considered as independent. The similarity between two solutions with L sources is given by
s ( x l , y l ) = i = 1 L s ( x l i , y l i ) = i = 1 L 1 dim p p ( x l i , y l i ) .
Consequently, the distance between the two solutions can be defined as
x l , y l = s ( x l , y l ) 1 = i = 1 L 1 dim p p ( x l i , y l i ) 1 = i = 1 L dim p p ( x l i , y l i ) .
For the pollution injection information component, these values form a continuous space, where the normalized Euclidean distance is used x u ( t ) , y u ( t ) norm as the distance between the pollution injection information components.

3.1.3. Metrics for Fitness Lansacape Analysis

In the previous work [8], various fitness landscape metrics were proposed to evaluate the degree of neutrality, ruggedness, and the number of BoAs. The formulas for these metrics are as follows:
  • Ruggedness
    A rugged landscape can be regarded as a multi-modality landscape with many sharp descents and ascends, while d x is the distance between a solution and its nearest-better solution. If the standard deviation of d x is relatively large, this indicates that the fitness landscape has many sharp descents and ascents.
    I r = x X N ( d x d ¯ ) 2 X N d x = x , b ( x ) , d ¯ = x X N d x X N
  • Neutrality
    Neutral space refers to areas where fitness values show little variation. Thus, the area of this neutral space can be calculated as a metric for neutrality as follows:
    B d n e u ( x ) = { y β ( y ) = x , and | f ( x ) f ( y ) | | max z S f ( z ) min z X N f ( z ) | τ , y X N }
    B n e u ( x ) = ( y B d n e u ( x ) B n e u ( y ) ) B d n e u ( x )
    I n = max x X N | B n e u ( x ) | ,
    where τ = 0.001 is a small enough positive number, which guarantees a small difference in the fitness of a solution and its nearest-better solution in a neutral area;
  • Differences between two fitness landscapes
    However, this paper also needs to analyze the dynamic change and separability of the problem. The key to analyzing these two properties is to examine the differences between two fitness landscapes. This paper proposes a new fitness landscape metric to analyze the differences between two fitness landscapes.
    As illustrated in Figure 6, it is not sufficient to rely on counting the number of solutions with different fitness values as an indicator of the difference between two fitness landscapes. In the figure, only a few blue solutions (on the edge of the fitness landscape) have the same fitness value. It seems that only the edge of the attraction domain remains unchanged. But, in fact, these two fitness landscapes are very similar, with only three changed areas: (1) The optimal solution of the peak on the left has shifted slightly; (2) The BoA on the right has become larger, and, correspondingly, the BoA on the left has become smaller; (3) The two BoAs on the right have merged into one. And, precisely in these three changed areas, the nearest-better relationships are different. Therefore, the difference between two fitness landscapes can be evaluated by counting different nearest-better relationships for each solution. The formula is as follows:
    S ( G 1 ) , S ( G 2 ) = T S where T = x S b 1 ( x ) b 2 ( x )
    where S represents the set of the sampled solutions and b i denotes the nearest-better relationship of the the i-th fitness landscape. T is the set of solutions with different nearest-better relations; T is the number of solutions in the set T ;
  • Modality in the biased data set
    In previous research [8], optimal solutions were identified solely based on the magnitude of the nearest-better distance (NBD) of the solutions. However, this approach is unsuitable for the NBN generated from biased data. The distribution of solutions generated by the algorithm is non-uniform. In the early stages of evolution, the algorithm’s search radius is relatively large, resulting in higher NBD values in some poorer regions. Consequently, some solutions may be mistakenly identified as local optima due to their larger NBD. In reality, optimal solutions typically refer to those with better fitness values. In the biased data-based NBN, fitness and NBD are integrated to identify optima, as shown in the following equation:
    f ( x ) θ x , b ( x ) ϑ
    where θ and ϑ are user-defined parameters.

4. Fitness Landscape Analysis

4.1. Sampling Method

To analyze the fitness landscape features of the CSWIDN problem from different perspectives, several sampling methods are designed, as follows:
  • Sampling in continuous search space
    Global sampling in the continuous space centered at c
    S con ( c ) = x X x l , c l = 0
    To observe the landscape features of the problem in the continuous space, the pollution source location components x l of the sampled solutions are set the same as those of the central solution c while randomly sampling the pollution injection information component x u ;
    Local sampling in continuous space centered at c
    S con ( c , r ) = x X x l , c l = 0 x u , c u norm r
    To further investigate the landscape features of the problem in the continuous space, the samplings are centered around c , keeping the pollution injection information the same as c while performing local sampling on the pollution injection information component x u within a sampling radius r;
  • Sampling in combinatorial search space centered at c
    S com ( c ) = x X x u , c u norm = 0
    In the global sampling of the combinatorial space centered at c , S com ( c ) , the pollution injection information component x u of the sampled solutions is set to match that of the central solution c , while the pollution source locations component x l is varied for global sampling. Given that the number of pollution sources is relatively small ( L = 3 ), there are only 96 3 = 884 , 736 solutions in the combinatorial space, where U = 96 is the number of nodes in the water distribution network. Therefore, it is feasible to conduct a complete sampling of the entire combinatorial solution space;
  • Sampling in the whole search space
    To analyze the original fitness landscape features of the CSWIDN problem, this paper also performs global random sampling in the original solution space to generate a set of sampled solutions S o X ;
  • Sampling by algorithm
    The algorithm focuses more on solutions with better fitness, making its sampling biased. The set of sampled solution S alg X generated by the algorithm allows for an indirect analysis of the algorithm’s behavior. This paper applies the Adaptive Multi-Population Algorithm [3] as the sampling method, which is designed comprehensively to account for various landscape features of the problem, such as dynamic changes, multimodality, and separability. Moreover, experiments show that this algorithm outperforms other algorithms. In the sampling, the algorithm is run more than 30 independent times with the recommended parameters.

4.2. Analysis of CSWIDN

4.2.1. Neutrality

From Figure 7, it can be observed that the CSWIDN exhibits neutrality in the local regions of the continuous space with r = 1 × 10 5 and r = 1 × 10 6 . Figure 8 and Figure 9 show that the local areas in the continuous space display the neutrality feature during the early phase ( t 9 ). However, Figure 9 shows that the global structure of the continuous space with the sampled solution set S con ( o ) does not exhibit obvious neutrality features.
Figure 9 also reveals that the combinatorial space with the sampled solution set S com ( o ) also demonstrates the neutrality feature during the early phase ( t 6 ). In Figure 9, the entire search space with S o and the algorithm’s search space with S alg both exhibit neutrality features in the early stages. The pollutant injection times for the three contamination sources [ o l 1 , o l 2 , o l 3 ] are [ 1 , 29 ] , [ 6 , 37 ] , and [ 14 , 46 ] , respectively. The times when the fitness landscape exhibits neutrality do not align with the times of pollutant injection. This presence of neutrality suggests that pollutants spread gradually in the water flow, and sensor delays in detection may lead to observed neutrality in the fitness landscape in the early phase.
Furthermore, the presence of neutrality in the algorithm’s search space suggests that the algorithm, as designed, does not adequately account for this feature. This oversight results in lower search efficiency in neutral spaces, leading to the generation of many solutions with identical fitness values.

4.2.2. Ruggedness

From Figure 10, it can be observed that the global structure of continuous space, S con ( o ) , the local region of continuous space, S con ( o , 1 × 10 6 ) , and the the combinatorial space, S com ( o ) , are relatively smooth.
While the whole search space, S o , displays the feature of ruggedness, throughout all phases, the degree of ruggedness remains fairly consistent. The ruggedness of the overall fitness landscape may arise from the combination of the continuous and combinatorial spaces. This suggests that the combined fitness landscape is more complex than that of the individual components.
However, the algorithm’s search space also shows ruggedness, indicating that the algorithm does not consider the ruggedness feature and lacks a suitable search operator to smooth the fitness landscape.

4.2.3. Modality

Figure 11 shows that both the local and global regions of the CSWIDN continuous space exhibit only one global optimum. As illustrated in the figures, Figure 8 and Figure 12, where only one BoA is present. In contrast, the CSWIDN combinatorial space has many optima, as shown in Figure 13. Combining the two search spaces, the entire search space also exhibits the modality feature, as shown in Figure 14.
During the early phase t 8 , there are more optima, with a total of four in the combinatorial space, and the entire search space also shows modality features during this time. However, in the phases 9 t 14 , there are only 2 optima in the combinatorial space, and, at this time, the entire search space has only one global optimum. This could be attributed to the single-modal nature of the continuous space, which tends to diminish the multimodal characteristics of the overall search space.
From Figure 11, it can be found that the time at which the algorithm’s search space exhibits the multimodal feature aligns closely with the time when the entire search space shows the multimodal feature. This suggests that the algorithm indeed has the ability to search for multiple global optima. It can be observed that the number of optima identified by the algorithm significantly exceeds the number of optima recognized in the entire search space, potentially due to low sampling accuracy, meaning that some global optima are not effectively identified.
Given the observed patterns in the emergence of the modality feature, researchers can design a more efficient algorithm specifically tailored to these features. Since modality appears in the combinatorial space but not in the continuous space, researchers should focus on developing mechanisms for handling modality in the combinatorial space. Furthermore, the modality features mainly arise in the early stages. The algorithm may still find many good solutions later, but these good solutions are closely connected in later phases, as shown in Figure 15. Therefore, designing an effective local search operator in later phases will allow algorithms to quickly identify these good solutions. In a word, the algorithm needs to implement diversity-handling mechanisms in the early stages and an effective local search operator in later phases. Given that these mechanisms consume computational resources, a more reasonable use of them, based on an analysis of the problem features, can enhance the algorithm’s search efficiency.

4.2.4. Dynamic Change

From Figure 16, it can be found that the CSWIDN problem indeed changes over time. The changes in the combinatorial space are relatively minor, with S ( t ) , S ( t 1 ) at different times remaining smaller than 0.1 , as shown in Figure 16. The global structure of the continuous space also shows minimal changes most of the time, with S ( t ) , S ( t 1 ) smaller than 0.1 . However, at certain moments, such as 9 t 11 , the continuous space can exhibit abrupt changes, with S ( t ) , S ( t 1 ) rising above 0.35 . These abrupt changes also appear in the entire search space and the algorithm’s search space.
It can be observed that the timing of these abrupt changes does not align with either the start or the end of the injection period. This suggests that, due to the gradual dispersion of particles within the water distribution system, there is a delayed response in the fitness landscape.
Although the magnitudes of these dynamic changes are relatively small, such as in the combinatorial space, where the changes are less than 0.1 , the algorithm cannot overlook the impact of these minor dynamic changes. As shown in Table 1, in the combinatorial space, even if the changes are minimal, the set of sampled solutions S com ( o ) exhibits a change magnitude of 1 0.702 = 0.298 over the entire duration. If the algorithm does not account for these dynamic changes, it may gradually lose the global optimum found at a specific moment.
By analyzing the patterns of dynamic changes, a more efficient dynamic change-handling mechanism can be designed for CSWIDN. Given that the magnitudes of dynamic changes are relatively small in both continuous and combinatorial spaces, researchers can develop mechanisms tailored to handle these minor dynamic changes. Additionally, in the continuous space, a mechanism is needed to design specifically for abrupt changes.

4.3. Separability

To analyze the independence between the contamination source position components and the pollutant injection information components in the CSWIDN problem, the following two sets of experiments are designed:
  • For two different contamination source positions o l and c l , sampling data are generated with identical distributions of pollutant injection information components and the differences between the two fitness landscapes are analyzed. The two sampled solution sets are defined as
    S con ( o ) = x X x l , o l = 0
    and
    S con ( c ) = x X x l , c l = 0
    For all the i-th sampled solutions in the two sets, x i S con ( o ) and y i S con ( c ) , the injection information components of the two solutions are the same, x u i , y u i | = 0 .
    Given that the CSWIDN problem involves multiple contamination sources, the phase t = 38 is selected for a clearer comparison of the influence of source positions. At this phase, only the third contamination source is discharging pollutants, and the contamination source position components of the two solutions o l and c l , are set such that only the third contamination source position differs, o l , c l = o l 3 , c l 3 | , to exclude the effects of the other two contamination sources;
  • Under different pollutant injection information o u and c u , sampling data are generated with identical distributions of contamination source position components and differences between the two fitness landscapes are analyzed. The two sampled solution sets are defined as
    S com ( o ) = x X x u , o u = 0
    and
    S com ( c ) = x X x u , c u = 0
    For all the i-th sampled solutions in the two sets, x i S com ( o ) and y i S com ( c ) , The two solutions’ contamination source position components are the same, x l i , y l i | = 0 .
Table 2 demonstrates that the fitness landscapes differ significantly with varying contamination source position components, exhibiting a similarity of about 1 0.4 = 0.6 between S con ( o ) and S con ( c ) . This indicates that the pollutant injection information components depend highly on the contamination source position components. Conversely, under different pollutant injection information components, the fitness landscapes of the contamination source positions’ components remain quite similar, with a similarity greater than 1 0.1 = 0.9 between S com ( o ) and S com ( c ) , suggesting that contamination source position components are relatively independent. Therefore, in an optimization algorithm, the contamination source position components can be searched first, the most likely contamination positions selected, and, subsequently, the pollutant injection information optimized based on those positions to reduce the difficulty of the search.

5. Discussion

The advantage of NBN over other methods lies in its ability to simplify the fitness landscape by utilizing the nearest-better relationship, which has long been a valuable tool for analyzing landscape features. Additionally, NBN introduces an effective visualization method, further enhancing its ability. Experimental results also show the significant potential of NBN. Its simplicity is another strength; by relying only on sampling solutions and distance information, NBN can theoretically be applied to any optimization problem.
However, NBN does have some limitations. One major drawback is the lack of a universal and efficient method for computing the nearest-better relationship. While [8] proposed an efficient algorithm for continuous problems, the current approach for combinatorial and mixed-encoding problems still relies on an O ( N 2 ) iterative algorithm, which limits its scalability. Furthermore, the simplicity of NBN, while advantageous in many ways, means it focuses exclusively on the nearest-better relationship. In comparison to other methods that incorporate a wider range of information to analyze landscape features, NBN may overlook certain aspects, such as ill-conditioning, which can be important for a more comprehensive understanding of the landscape.

6. Conclusions

This study presents the first comprehensive fitness landscape analysis for Contaminant Source Identification in Water Distribution Networks, revealing landscape features such as neutrality, ruggedness, modality, dynamic change, and separability. These findings deepen our understanding of the underlying characteristics of the problem and provide valuable insights into how these features influence the performance of optimization algorithms. By recognizing these patterns, researchers can design more efficient algorithms tailored to the specific challenges of CSWIDN.
However, the methodology, NBN, has some limitations. It currently lacks a universally efficient approach for computing the nearest-better relationship, particularly for combinatorial and mixed-encoding problems. Additionally, while NBN’s simplicity is advantageous, it overlooks certain landscape features, such as ill-conditioning, which could provide a more comprehensive understanding of the problem. Future work will focus on addressing these limitations by improving computational efficiency and incorporating additional landscape features to enhance the analysis and algorithm development for CSWIDN. We will also develop a more effective algorithm for CSWIDN in our future work.

Author Contributions

Conceptualization, Y.D., S.Z. and S.Y.; Data curation, Y.D. and C.L.; Formal analysis, Y.D. and C.L.; Funding acquisition, C.L.; Investigation, Y.D. and C.L.; Methodology, Y.D., C.L., S.Z. and S.Y.; Project administration, C.L.; Resources, C.L.; Software, Y.D. and C.L.; Supervision, C.L.; Validation, Y.D.; Writing—original draft, Y.D.; Writing—review and editing, C.L., S.Z. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62476006, in part by the Hubei Provincial Natural Science Foundation of China under Grant 2023AFA049, and in part by the Fundamental Research Funds of the AUST under Grant 2024JBZD0007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CSIWDNContaminant Source Identification in Water Distribution Networks
NBNNearest-Better Network
NBDNearest-Better Distance

References

  1. Costa, D.M.; Melo, L.F.; Martins, F.G. Localization of Contamination Sources in Drinking Water Distribution Systems: A Method Based on Successive Positive Readings of Sensors. Water Resour. Manag. 2013, 27, 4623–4635. [Google Scholar] [CrossRef]
  2. Taormina, R.; Galelli, S. Deep-Learning Approach to the Detection and Localization of Cyber-Physical Attacks on Water Distribution Systems. J. Water Resour. Plan. Manag. 2018, 144, 04018065. [Google Scholar] [CrossRef]
  3. Li, C.; Yang, R.; Zhou, L.; Zeng, S.; Mavrovouniotis, M.; Yang, M.; Yang, S.; Wu, M. Adaptive Multipopulation Evolutionary Algorithm for Contamination Source Identification in Water Distribution Systems. J. Water Resour. Plan. Manag. 2021, 147, 04021014. [Google Scholar] [CrossRef]
  4. Seth, A.; Klise, K.A.; Siirola, J.D.; Haxton, T.; Laird, C.D. Testing Contamination Source Identification Methods for Water Distribution Networks. J. Water Resour. Plan. Manag. 2016, 142, 04016001. [Google Scholar] [CrossRef]
  5. Piazza, S.; Sambito, M.; Freni, G. Analysis of Optimal Sensor Placement in Looped Water Distribution Networks Using Different Water Quality Models. Water 2023, 15, 559. [Google Scholar] [CrossRef]
  6. Yan, X.; Zhao, J.; Hu, C.; Zeng, D. Multimodal optimization problem in contamination source determination of water supply networks. Swarm Evol. Comput. 2019, 47, 66–71. [Google Scholar] [CrossRef]
  7. Kerschke, P.; Trautmann, H. Automated Algorithm Selection on Continuous Black-Box Problems by Combining Exploratory Landscape Analysis and Machine Learning. Evol. Comput. 2019, 27, 99–127. [Google Scholar] [CrossRef] [PubMed]
  8. Diao, Y.; Li, C.; Zeng, S.; Yang, S.; Coello, C.A.C. Nearest-Better Network for Fitness Landscape Analysis of Continuous Optimization Problems. IEEE Trans. Evol. Comput. 2024. early access. [Google Scholar] [CrossRef]
  9. Diao, Y.; Li, C.; Zeng, S.; Yang, S. Nearest Better Network for Visualization of the Fitness Landscape. In Proceedings of the GECCO ’23 Companion: Companion Conference on Genetic and Evolutionary Computation, Lisbon, Portugal, 15–19 July 2023; pp. 815–818. [Google Scholar]
  10. Stadler, P.F. Fitness landscapes. In Biological Evolution and Statistical Physics; Springer: Berlin/Heidelberg, Germany, 2002; pp. 183–204. [Google Scholar]
  11. Zou, F.; Chen, D.; Liu, H.; Cao, S.; Ji, X.; Zhang, Y. A survey of fitness landscape analysis for optimization. Neurocomputing 2022, 503, 129–139. [Google Scholar] [CrossRef]
  12. Malan, K.M.; Engelbrecht, A.P. A survey of techniques for characterising fitness landscapes and some possible ways forward. Inf. Sci. 2013, 241, 148–163. [Google Scholar] [CrossRef]
  13. Malan, K.M. A Survey of Advances in Landscape Analysis for Optimisation. Algorithms 2021, 14, 40. [Google Scholar] [CrossRef]
  14. Horn, J.; Goldberg, D.E. Genetic Algorithm Difficulty and the Modality of Fitness Landscapes. In Foundations of Genetic Algorithms; Whitley, L.D., Vose, M.D., Eds.; Elsevier: Amsterdam, The Netherlands, 1995; Volume 3, pp. 243–269. [Google Scholar] [CrossRef]
  15. Reidys, C.M.; Stadler, P.F. Neutrality in fitness landscapes. Appl. Math. Comput. 2001, 117, 321–350. [Google Scholar] [CrossRef]
  16. Davidor, Y. Epistasis Variance: A Viewpoint on GA-Hardness. In Foundations of Genetic Algorithms; Rawlins, G.J., Ed.; Elsevier: Amsterdam, The Netherlands, 1991; Volume 1, pp. 23–35. [Google Scholar] [CrossRef]
  17. Mavrovouniotis, M.; Li, C.; Yang, S. A survey of swarm intelligence for dynamic optimization: Algorithms and applications. Swarm Evol. Comput. 2017, 33, 1–17. [Google Scholar] [CrossRef]
  18. Liu, L.; Ranjithan, S.R.; Mahinthakumar, G. Contamination Source Identification in Water Distribution Systems Using an Adaptive Dynamic Optimization Procedure. J. Water Resour. Plan. Manag. 2011, 137, 183–192. [Google Scholar] [CrossRef]
  19. Rasekh, A.; Brumbelow, K. A dynamic simulation–optimization model for adaptive management of urban water distribution system contamination threats. Appl. Soft Comput. 2015, 32, 59–71. [Google Scholar] [CrossRef]
  20. Yan, X.; Zhao, J.; Hu, C.; Wu, Q. Contaminant source identification in water distribution network based on hybrid encoding. J. Comput. Methods Sci. Eng. 2016, 16, 379–390. [Google Scholar] [CrossRef]
  21. Gong, J.; Yan, X.; Hu, C.; Wu, Q. Collaborative based pollution sources identification algorithm in water supply sensor networks. Desalin. Water Treat. 2019, 168, 123–135. [Google Scholar] [CrossRef]
  22. Grbčić, L.; Kranjčević, L.; Družeta, S. Machine learning and simulation-optimization coupling for water distribution network contamination source detection. Sensors 2021, 21, 1157. [Google Scholar] [CrossRef] [PubMed]
  23. Qian, K.; Jiang, J.; Ding, Y.; Yang, S.H. DLGEA: A deep learning guided evolutionary algorithm for water contamination source identification. Neural Comput. Appl. 2021, 33, 11889–11903. [Google Scholar] [CrossRef]
Figure 1. Fitness landscape of continuous optimization problems in two dimensions.
Figure 1. Fitness landscape of continuous optimization problems in two dimensions.
Data 09 00142 g001
Figure 2. Transformation from the original fitness landscape of the Shubert function to an NBN visualization with 5000 samples [9].
Figure 2. Transformation from the original fitness landscape of the Shubert function to an NBN visualization with 5000 samples [9].
Data 09 00142 g002
Figure 3. Flow chart of the NBN-assisted fitness landscape analysis method.
Figure 3. Flow chart of the NBN-assisted fitness landscape analysis method.
Data 09 00142 g003
Figure 4. Water distribution network.
Figure 4. Water distribution network.
Data 09 00142 g004
Figure 5. Schematic diagram of the distribution of pollutant examples in the local water distribution system at different times t under different solutions x and y , where purple stars are sensors, the black rectangle is the pollution source injection location, blue directed lines indicate the direction of water flow, and red circles are pollutant particles. x releases pollutant from time t and y releases pollutant from time t + 1 .
Figure 5. Schematic diagram of the distribution of pollutant examples in the local water distribution system at different times t under different solutions x and y , where purple stars are sensors, the black rectangle is the pollution source injection location, blue directed lines indicate the direction of water flow, and red circles are pollutant particles. x releases pollutant from time t and y releases pollutant from time t + 1 .
Data 09 00142 g005
Figure 6. The contour of two typical 2D fitness landscapes with their NBN with the same distribution of sampled solutions, where blue dots are solutions with the same fitness and red directed arrows are different nearest-better relationships for each solution.
Figure 6. The contour of two typical 2D fitness landscapes with their NBN with the same distribution of sampled solutions, where blue dots are solutions with the same fitness and red directed arrows are different nearest-better relationships for each solution.
Data 09 00142 g006
Figure 7. Figures of NBN with local samples in continuous space centered at the global optima solution; o , S con ( o , r ) with | S con ( o , r ) | = 1,000,000 samples at time t = 3 .
Figure 7. Figures of NBN with local samples in continuous space centered at the global optima solution; o , S con ( o , r ) with | S con ( o , r ) | = 1,000,000 samples at time t = 3 .
Data 09 00142 g007
Figure 8. Figures of NBN with local samples in continuous space centered at the global optima solution o ; S con ( o , 1 e 6 ) with | S con ( o , 1 e 6 ) | = 1,000,000 samples.
Figure 8. Figures of NBN with local samples in continuous space centered at the global optima solution o ; S con ( o , 1 e 6 ) with | S con ( o , 1 e 6 ) | = 1,000,000 samples.
Data 09 00142 g008
Figure 9. Neutrality of different sampled sets at different times.
Figure 9. Neutrality of different sampled sets at different times.
Data 09 00142 g009
Figure 10. Ruggedness of different sampled sets at different times.
Figure 10. Ruggedness of different sampled sets at different times.
Data 09 00142 g010
Figure 11. Number of optima for different sampled sets at different times.
Figure 11. Number of optima for different sampled sets at different times.
Data 09 00142 g011
Figure 12. Figures of NBN with samples in continuous space centered at the global optima solution o ; S con ( o ) with | S con ( o , r ) | = 1,000,000 samples.
Figure 12. Figures of NBN with samples in continuous space centered at the global optima solution o ; S con ( o ) with | S con ( o , r ) | = 1,000,000 samples.
Data 09 00142 g012
Figure 13. Figures of NBN with samples in combinatorial space centered at the global optima solution o , S com ( o ) .
Figure 13. Figures of NBN with samples in combinatorial space centered at the global optima solution o , S com ( o ) .
Data 09 00142 g013
Figure 14. Figures of NBN with samples in the whole search space, S o , with | S o | = 1 , 000 , 000 samples.
Figure 14. Figures of NBN with samples in the whole search space, S o , with | S o | = 1 , 000 , 000 samples.
Data 09 00142 g014
Figure 15. Figures of NBN with samples by algorithm S alg .
Figure 15. Figures of NBN with samples by algorithm S alg .
Data 09 00142 g015
Figure 16. Differences in fitness landscapes between two successive times for different sampled sets at different times.
Figure 16. Differences in fitness landscapes between two successive times for different sampled sets at different times.
Data 09 00142 g016
Table 1. The differences in the fitness landscape between the start time and the end time of the sampled solution set.
Table 1. The differences in the fitness landscape between the start time and the end time of the sampled solution set.
S con ( o ) S con ( o , 1 × 10 6 ) S com ( o ) S o S alg
S ( 1 ) , S ( 46 ) 0.5640.6550.7020.4780.618
Table 2. Analysis of independence between contamination source positions and pollutant injection information components.
Table 2. Analysis of independence between contamination source positions and pollutant injection information components.
o l , c l 1234567
S con ( o ) , S con ( c ) 0.3940.3610.3570.4770.3100.3120.419
o l , c l 891011121314
S con ( o ) , S con ( c ) 0.4110.4060.4020.2770.3120.3110.311
o u , c u norm 0.90.10.010.0010.00010.000010.000001
S com ( o ) , S com ( c ) 0.0100.0000.0010.0010.0030.0030.012
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Diao, Y.; Li, C.; Zeng, S.; Yang, S. Nearest-Better Network-Assisted Fitness Landscape Analysis of Contaminant Source Identification in Water Distribution Network. Data 2024, 9, 142. https://doi.org/10.3390/data9120142

AMA Style

Diao Y, Li C, Zeng S, Yang S. Nearest-Better Network-Assisted Fitness Landscape Analysis of Contaminant Source Identification in Water Distribution Network. Data. 2024; 9(12):142. https://doi.org/10.3390/data9120142

Chicago/Turabian Style

Diao, Yiya, Changhe Li, Sanyou Zeng, and Shengxiang Yang. 2024. "Nearest-Better Network-Assisted Fitness Landscape Analysis of Contaminant Source Identification in Water Distribution Network" Data 9, no. 12: 142. https://doi.org/10.3390/data9120142

APA Style

Diao, Y., Li, C., Zeng, S., & Yang, S. (2024). Nearest-Better Network-Assisted Fitness Landscape Analysis of Contaminant Source Identification in Water Distribution Network. Data, 9(12), 142. https://doi.org/10.3390/data9120142

Article Metrics

Back to TopTop