1. Introduction
Simulation-based optimization problems are usually black-box and computationally expensive and have been receiving increasing attention for their relevance in ubiquitous applications [
1]. Bayesian optimization (BO), due to its flexibility and sample efficiency, has become a standard approach for simulation optimization. The computational cost, notwithstanding its sample efficiency, can still represent an obstacle to a wider diffusion. To mitigate this problem, in many situations, one can resort to cheaper surrogates of the objective function such as the output of a computer simulation. Examples are ubiquitous, including experimental design in protein engineering or material science, where the “ground truth” is given by a physical prototype as extremely expensive synthesis and characterization of a new material in a laboratory. In other cases, sources of different fidelities are given by the output of a partial differential equation solver using different discretization parameters. Sources of different fidelities can be also exploited tuning machine learning algorithms. Rather than using the full dataset, one could use a smaller related dataset [
2] or terminate the training procedure early as in [
3]. Cheap information sources in the optimization scheme have been studied in the literature as the multi-fidelity optimization problem. Specific methods have been developed to leverage cheaper sources in more efficient methods [
4]. Of course, cheaper sources may hold some promise toward tractability, but cheaper models offer an incomplete model inducing unknown bias and epistemic uncertainty. Multi-fidelity optimization methods require that sources are hierarchical organized. This means that once a source has been queried at location
no further knowledge can be obtained querying any other source of lower fidelity at any location [
5]. Moreover, hierarchical source organization relies on the assumption that information sources are unbiased, admitting only aleatoric uncertainty that must be independent across sources.
To overcome these limitations, the multi-fidelity setting was generalized under different headings as multi-task BO [
2], non-hierarchical multifidly [
6], or multiple information source BO [
7,
8]. In [
9], it is shown how more cost-effective sources of information can be integrated with more accurate, as in computational chemistry, in material discovery. Another application of BO for material optimization is [
10].
The above difficulties were first addressed in [
11], which considered a single model integrating the different information sources into a single model with relative discrepancies between each source and the function to optimize depending on the location proposed and change across the search space. Moreover, [
8] introduces a general notion of location-dependent model discrepancy to quantify the difference between each source and the objective function. Under these general assumptions, sources are no longer necessarily unbiased and allow for epistemic error.
Another feature of simulation-based optimization is that the evaluations of the objective functions are noisy (aleatoric or observational errors) and can be affected by uncertain (epistemic) errors and model uncertainty. The usual solution considers as the objective function the sample average approximation (SAA), as is done in the cross-validation procedure in machine learning. The reference problem is:
Real-world optimization problems tend to have stochastic elements in the objective function, the constraints, or the context of the problem. This is the case when querying the objective functions requires the execution of a stochastic simulation model accounting for different scenarios, but also the choice of a stochastic optimizer of the loss function or the randomness in the initialization of the optimization algorithm.
A more general formulation considers the different sources of randomness synthetized by a random variable
. Consequently, the objective function
in (1) becomes a random function
, and the problem (1) becomes:
If is a performance metric of a system, this defines the optimization of the average performance. In this manuscript, we are concerned with the discrete case where is the value of the performance measure associated with the environmental condition . represents the relevance of condition (i.e., the probability of occurrence or the fraction of time this condition occurs).
This is for instance the case of optimal sensor placement in a network where the integer variable corresponds to the placement of a number of sensors over the nodes of a network, while the environmental condition is the injection of a contaminant at a node and is a performance score of the placement to monitor the propagation process and detect the detection/intrusion as early and effectively as possible. is the sample average approximation of the detection time corresponding to .
A relevant limitation of SAA is that it is a risk-neutral measure, while infrastructure networks like water, energy, or transport, among others, must weigh specifically the downside risk. The networks considered in this paper use a different risk profile given by value-at-risk (Var) and conditional VaR (CVaR), borrowed from financial analysis.
Among simulation optimization problems, combinatorial domains present challenges due to the generalization of the Gaussian process to combinatorial structures and the combinatorial optimization of the acquisition function. In this paper, the authors consider a “naïve” solution given by a continuous embedding of a solution. A continuous relaxation allows for an efficient optimization of the acquisition function, but it does not account for the discretization needed before the next function evaluation.
The general objective of this paper is to propose a Gaussian-based framework, called augmented Gaussian process (AGP), based on sparsification, originally proposed in [
7], for continuous functions, and to show that it can be generalized to stochastic optimization using different risk profiles for combinatorial optimization. Some approaches to deal with integer and categorical variables are analyzed in [
12,
13].
The AGP, used in [
14] for fine-tuning the hyperparameters of a machine learning model to optimize simultaneously accuracy and fairness while also reducing energy consumption, is shown in this paper to provide a solution that can be generalized to simulation-based combinatorial and network problems. The AGP leverages into sample and cost-efficient BO over multiple information sources and supports a new acquisition function to select the new source–location pair which combines the AGP confidence bound, the cost of the source, and the (location-dependent) model discrepancy between the source-specific GP and the AGP model.
An extensive set of computational results supports risk-aware optimization based on CVaR. The multiple information source acquisition function avoids variance starvation, premature convergence to local optima, as well as ill-conditioning in the GP training.
Computational experiments confirm the performances of the MISO-AGP (multiple information source optimization through AGP) methods on both benchmark functions and real-world problems.
1.1. Related Works
Multi-fidelity and multiple information source BO have been a thriving research domain. Many approaches have been proposed and leveraged into effective algorithms among which only a few are here commented. The case of unreliable information sources is considered in [
15], where a methodology is proposed which makes multi-fidelity BO robust meaning that a theoretical guarantee is given to the effect that the addition of an auxiliary information source will not lead to worse performance than “vanilla” BO. Also, [
16] proposes multi-fidelity BO with the acquisition function max-value entropy search and analysis of a parallel version. A general framework for multi-fidelity BO based on mutual information and the greedy strategy (namely, MF-MI-greedy) is proposed in [
17], specifically requiring that strict relations between the quality and the cost of a lower fidelity function are likely to lead to sub-optimal experiment design and to limit their practicality. Moreover, it is proposed that a simple notion of regret which incorporates the cost of different fidelities and proves that (MF-MI-Greedy) achieves a low regret. Another strategy for adaptive sampling of multi-fidelity GP is proposed in [
18] to reduce predictive uncertainty as well as the cost of the execution of the simulation runs.
The key approach proposed in this paper is the AGP [
7], which proposes sparsifying over multiple information sources. The strategy is to “augment” the observations of the high-fidelity source with only the “reliable” ones coming from the cheaper sources, and to extend the acquisition function to the selection among sources which can be considered reliable.
Furthermore, transfer learning as a tool for multi-fidelity optimization is addressed in [
19], which proposes an acquisition function based on across-task transferable max-value entropy that balances the need to acquire information about the current task with the goal of acquiring information transferable to future tasks. Also, [
20] considers the effects of heterogeneous errors on multi-fidelity BO and proposes a method to learn a noise model for each data source and leverage highly biased low-fidelity sources which are only locally correlated with the high-fidelity source.
A seminal paper for BO on combinatorial structure is [
21], which proposes an approximate optimizer of the acquisition function to overcome the difficulty of many acquisition functions to large combinatorial domains. Another approach is [
22], which provides a wide analysis of BO over combinatorial spaces and samples discrete variables upon continuous relaxation. The surrogate model is a Bayesian neural network with Thompson sampling and variational optimization of the acquisition function. An entirely different approach is based on autoencoders and deep learning to generate high-dimensional discrete objects. Ref. [
23] uses the epistemic uncertainty of the decoder to guide the exploration of new points. The algorithm proposed in [
24] integrates deep metric learning and a variational autoencoder and provides vanishing regret guarantees. Another approach for solving combinatorial problems was proposed in [
25], which introduces a new learning-to-search approach that employs a combinatorial search over a combinatorial space where each discrete structure is represented by discrete variables. Heuristics are used to select good starting spaces while machine learning is adopted to improve global knowledge. A different approach was proposed in [
26] based on recent advances in submodular relaxation for solving binary quadratic programming. The approach is inspired by parametrized submodular relaxation which makes it possible to optimize efficiently the acquisition function via minimum graph cut algorithms.
In [
27], a new approach based on Mercer features for combinatorial Bayesian optimization is proposed, based on diffusion kernels and using Thompson sampling as an acquisition function. Finally, the method proposed in [
28] maps the structural information of the combinatorial space into a corresponding latent space, where the optimization takes place. The next candidate latent solution is decoded into a discrete one to evaluate it. The superiority of the method, especially in small-data setting, is empirically proven.
BO has been applied to a wide set of problems. In this manuscript, we focused on problems characterized by main features such as combinatorial search spaces of discrete variables, simulation-based optimization with stochastic elements, and multiple information sources. Several application domains fit into this framework. Optimal sensor placement in networks, which will be considered in our experiments, is one application. Other problems considered in the experiments are the combinatorial optimization binary quadratic problems, and standard multi-fidelity benchmarks.
Epidemic scenarios also fit into the simulation optimization setting. Given a network of interacting people, the problem is to choose a small set of people whose surveillance enables the early detection of any disease outbreak when very few people are already infected. In the domain of the web, bloggers publish posts and use hyperlinks to other content on the web. We want to select a set of links to most of the stories that propagate in the blogosphere.
1.2. Our Contributions
The key contribution of this paper is a new decision-theoretic approach based on the AGP for generating a single model on different information sources which can be used also for combinatorial and network design problems. The proposed acquisition function, to select the new source–location pair combines the AGP confidence bound, the cost of the source, and the (location-dependent) model discrepancy between the source-specific GP and the AGP. A genetic operator was also proposed for the optimization of the acquisition function over combinatorial structures.
The focus of the proposed method was on simulation optimization models which typically generate black-box expensive optimization problems. The risk profile of the problem has been accounted for, in the case of network design, using the risk measures VaR and CVaR. The multiple information source acquisition function avoids variance starvation, premature convergence to local optima as well as ill-conditioning in the GP training. Computational experiments confirm the actual performance of the MISO-AGP method for hyperparameter optimization on benchmark functions and real-world problems.
1.3. Organization of the Paper
The rest of the paper is organized as follows.
Section 2 provides the background on the GP-based BO.
Section 3 summarizes the MISO-AGP framework initially proposed in [
7] for continuous optimization problems.
Section 4 presents the structure of BoTorch (
https://botorch.org/, accessed on 4 October 2024). The standard reference library for BO in which MISO-AGP was recently included
https://github.com/pytorch/botorch/pull/2152, accessed on 4 October 2024.
Section 5 provides the computational results of MISO-AGP applied to a binary quadratic programming problem from literature. Then,
Section 6 regards the adoption of MISO-AGP for solving a real-world application, specifically the optimal sensor placement in a water distribution network. Finally,
Section 7 summarizes concluding remarks, perspectives, and limitations.
5. Test Problem: Binary Quadratic Programming
The MISO-AGP approach was compared against two state-of-the-art information-based multi-fidelity approaches, whose implementations are available in the BoTorch platform. As far as the MISO-AGP is concerned, all the GPs, including the AGP, use a Matern 5/2 kernel. Moreover, to prevent over-reliance on the cheap information source, a minimum number of evaluations on the ground truth was established.
Throughout the optimization process, if the threshold was violated, then the algorithm was forced to evaluate the ground truth.
To mitigate the effect of randomness in the initialization of the three algorithms, five independent runs were performed. For each run, the three algorithms shared the same set of initial random solutions.
The objective in the binary quadratic programming problem was a quadratic function with regularization.
where
is a random matrix with zero-mean Gaussian entries, multiplied elements-wise by a matrix
with entries
which decays smoothly away from the diagonal at a rate determined by the correlation length.
According to the literature, we set , and sampled 50 independent realizations for . Every algorithm was run 10 times on each instance for each realization of . The tests were performed for the two cases: , and and . For the cheaper source cost, we considered 50% and 10% of the high fidelity.
As depicted in
Figure 3, MISO-AGP achieves, on average, a lower best-seen value (
Figure 2 on the left) and a smaller accumulated query cost (
Figure 2 on the right). Although the final best-seen of MISO-AGP was lower than the other two approaches, there was not a statistically significant difference, as evaluated via a Wilcoxon test (
p-value > 0.05). Further, MISO-AGP is significantly more efficient than MF-MES and MF-GIBBON.
Finally, MISO-AGP uses the ground truth in 79% of the total queries, against 25% for MF-MES, and 20% for MF-GIBBON. This behavior is motivated by a relevant discrepancy between the ground truth and the cheap source, leading the AGP to rely on the expensive source instead of the cheap one. This is crucial because, contrary to other standard methods for combining GPs (e.g., fusing GPs), the AGP discards cheap observations if the two sources are—even locally—uncorrelated. This crucial property of the AGP model—at the core of its design—was specifically and carefully addressed in [
7].
As depicted in
Figure 4, MISO-AGP achieves, on average, a lower best-seen (on the left) and a smaller accumulated query cost (on the right).
In this case, MISO-AGP uses the ground truth in 91% of the total queries, against 29% for MF-MES and 20% for MF-GIBBON. It is important to remark that both MISO-AGP and MF-MES have increased the number of queries on the ground truth, even if the query cost of the cheap source decreased from 50% to 10% of the ground truth’s query cost. Both the algorithms increased the number of queries on the cheap source in the first iterations—due to its small cost—leading them to understand that it is poorly correlated to the ground truth and, consequently rely only on the expensive source for most of the remaining queries.
For this specific experiment, MISO-AGP shows worse results than the other two approaches, with a significantly larger value of the final best-seen (Wilcoxon test: against MF-MES,
p-value = 0.0143; against MF-GIBBON,
p-value = 0.0141). However, the cumulated runtime of MISO-AGP was still significantly lower than those of the other two methods, as depicted in
Figure 5.
Anyway, MISO-AGP queried the ground truth in 83% of the iterations, against 31% for MF-MES and 20% for MF-GIBBON. Again, MISO-AGP is more capable of understanding that the two sources are—locally—poorly correlated.
As depicted in
Figure 6, MISO-AGP, again, achieves on average a lower best-seen value and at a lower accumulated query cost. Moreover, the final value of the best-seen is significantly smaller than those provided by the other two approaches (Wilcoxon test).
Finally, MISO-AGP uses the ground truth in 87% of the total queries (slightly increase with respect to the previous experiment), against 33% for MF-MES and 20% for MF-GI. The underlying motivation is the one already provided for the previous experiments.
6. A Real-Life Application: Risk-Averse Optimal Sensor Placement in Water Distribution Network
6.1. Conditional Value-at-Risk (CVaR)
CVaR is based on the value-at-risk (VaR), which is the maximum potential value of a metric of interest, at a certain confidence level
. Formally:
where
is the distribution of the metric of interest with respect to a given solution
.
When the distribution
is discrete, VaR is easily computed as the
-quantile of the distribution, with
. A general framework for Bayesian quantile and expectile optimization is established in [
36]. A BO approach for CVaR is given [
37], which received a BoTorch implementation. An application of the CVaR metric to water distribution networks was given in Naseridaze [
38] using genetic algorithms.
Then, CVaR is the expected value of the metric of interest, given that it is beyond the VaR. For discrete distributions, CVaR is computed as:
where
is the CVaR index, which is the position indicating the values exceeding the VaR threshold, within the sorted samples
.
6.2. The Optimal Sensor Placement Problem
The optimal sensor problem (OSP) problem aims at selecting a subset of locations where a fixed number of sensors are to be deployed to minimize an impact measure. There is not a unique impact metric because the final choice strictly depends on the specific case. Some examples of frequently used impact measures are (i) the time required to detect a contamination (aka detection time), (ii) the amount of contaminated water consumed up to the detection as well as the affected number of inhabitants, or (iii) the probability of detecting a contamination.
In this paper, we consider the detection time, and more precisely the CVaR of the detection times under a set of scenarios. We briefly introduce some required notation and then present the formalization of the OSP problem.
A water distribution network was modeled as a graph , where the node set contains junctions and consumption points, while the edge set consists of all the pipes connecting pairs of nodes.
A sensor placement is defined as a binary vector , where is the subset of nodes where sensors can be possibly deployed. Specifically, each component of the vector refers to a location in the set , thus if a sensor is deployed at the correspondent th location, otherwise. The number of sensors to deploy is fixed in advance as
Now, we introduce the stochastic component of the problem, which is the definition of simulation scenarios referred to different contamination events. Specifically, a set of contamination events is denoted with , which is a subset of nodes where a contaminant is, in turn, injected. Each contamination event requires a simulation run and, therefore, is uniquely associated with a scenario. Thus, we referred to scenarios or contamination events indifferently.
6.3. Combinatorial Multi-Information Source Optimization (MISO) for Risk-Averse Optimal Sensor Placement
As far as the MISO setting was concerned, we used two sets of scenarios, that were, respectively, and with . Consequently, computing CVaR by using (the higher fidelity source i.e., the ground truth) led to a sampling cost twice as large as that required for computing CVaR on (i.e., the cheaper source).
The optimal sensor placement
is the one that optimizes the CVaR over all the contamination events
, so we wanted to solve the following problem:
where
denotes the conditional value-at-risk of the detection times observed on the
scenarios under the deployment
.
Specifically, the detection time for one event is the lowest time needed to detect the contamination through any of the sensors in the placement . This leads to as many detection times as the number of scenarios, and their distribution is used to compute .
Since we are considering a MISO setting, we wanted to solve (9) by generating a sequence of solutions that also involves evaluations on the cheap source (i.e., uses the scenario set ), with the aim to converge to the optimum with a low cumulative cost. Indeed, denote the sequence of generated solutions with , then the generic can be if CVaR must be computed by using (i.e., the ground truth, entailing a nominal cost 1) or if CVaR must be computed by using (i.e., the cheap source, entailing a nominal cost 0.5).
Our Search Space is , where the first dimensions refer to the sensor placement and the last dimension refers to the information source to use for computing the objective function.
At a generic iteration of the MISO-AGP algorithm, the minimization of the acquisition function (6) is performed under the following two constraints:
To solve this constrained combinatorial optimization problem, a Pymoo implementation of a genetic algorithm was used. As a mutation operator, a standard bitflip mutation was used with a probability of
. As crossover operator, the problem specific operator previously proposed in [
32] was used.
It is briefly summarized here. Consider the example in
Figure 7: Each offspring,
and
, takes in turn a random sensor from each parent,
and
, until no more sensors are available. This strategy guarantees to produce feasible offspring when using feasible parents, i.e., the offspring will have the same number of sensors as the parents.
6.4. Numerical Results
A contamination event on each node was simulated using WNTR v1.1.0 (a Python wrapper of EPANET, a water distribution network simulator). The simulations lasted 24 (simulated) hours and the contamination concentration in each node was registered hourly. Sensors could be placed only on a subset of nodes identified by sampling nodes uniformly on their coordinates, to attain good coverage of the entire water distribution network.
The network considered in the study was named Apulian, whose number of nodes is 1364. The number of allowed sensor locations was 63, and the number of sensors allowed was set to
In
Figure 8, we report the best-seen value, which the is lowest CVaR value observed so far, with respect to (top of the figure) the cumulative evaluation cost and (bottom of the figure) the overall wall-clock time.
The proposed MISO-AGP and MF-GIBBON were aligned in terms of performance, while MF-MES results were slightly worse than the other two methods.
The main advantage of the proposed approach is the significantly lower standard deviation over the different runs, making MISO-AGP a more robust framework than MF-GIBBON and MF-MES. A drawback of MISO-AGP is the slightly higher wall-clock time.
7. Conclusions, Limitations, and Perspectives
We presented an extension of the basic BO algorithm to a distributionally aware, constrained, and combinatorial multiple information source optimization setting.
The method proposes a new mechanism for generating a single model on the information sources based on GP sparsification and a decision-theoretic approach based on the MISO-AGP framework, initially and successfully tested on many test and real-world continuous optimization problems [
7,
14]. The extension to the combinatorial case was quite straightforward, basically requiring modifying the way in which the MISO-AGP’s acquisition function is optimized.
Specifically, the real world addressed in this paper is the optimal sensor placement in water distribution networks, required to optimize the MISO-AGP’s acquisition function via a genetic algorithm, whose cross-over operator was designed to address the feasibility of the generated solutions, with respect to the combinatorial nature of the problem.
It is important to remark that, to account for risk measures that are non-neutral, CVaR was considered as the objective function of the optimal sensor placement problem. Computational experiments—also on a test problem from the literature—confirm the previous results obtained on continuous optimization problems.
Examples of other combinatorial optimization problems which could benefit from the approach proposed in this paper are epidemic source detection in contact tracing networks [
39] and fake news detection using a graph-based approach [
40].
Although high-dimensionality is out-of-scope in our paper, authors are aware that the scalability of MISO-AGP for high-dimensional problems is crucial. Fortunately, there are many available GP-based methods for high-dimensional Bayesian optimization (HDBO), such as TuRBO [
41] and its recent extensions BAxUS [
42] and BOUNCE [
43], which are able to perform scalable BO in high-dimensional spaces, directly working within the general GP-based BO framework. Thus, equipping one of these algorithms with the AGP, with the aim to target a MISO problem, would lead to a scalable MISO-AGP implementation. Moreover, it is important to remark that the AGP is based on a GP sparsification technique (i.e., by insertion of relevant observations), so the resulting AGP is fitted on a subset of all the observations collected over all the information sources, leading to a lower computational cost for training it, contrary to cokriging and fusing GPs methods.